Even in the healthiest networks, issues can arise that cause a loss of network connectivity, sometimes preventing users from gaining access to critical network resources. These outages may not be planned and are often difficult to predict. Network administrators normally have a short period of time to identify a problem and find a way to resolve it. In these situations, a common troubleshooting step is usually to reboot networking equipment until the issue is resolved. This may work as a short term solution, but rebooting equipment regularly may mask a larger issue that warrants further investigation. This document details steps that can be used to troubleshoot a network connectivity issue and explains helpful information that can be gathered to discover the source of this type of problem.
Sections Included in this Document
Basic network connectivity troubleshooting will be the same regardless of the AOS device being used. Some troubleshooting options may not be available in all units, however this document will only cover the general troubleshooting options available on every unit.
It is important when experiencing any issue to check that your equipment is running current software. If you are unsure which firmware version you should be running, ADTRAN recommends the latest Extended Maintenance Release (EMR) to ensure avoidance of possible software issues. To determine the current EMR, please see the product page for your applicable product on www.adtran.com. For more information an AOS firmware naming conventions, please see AOS Firmware Release Naming Conventions.
Without a troubleshooting plan in place for network disasters, it is very easy to panic when a problem occurs. Most businesses today primarily use network and internet connections to do a majority of their critical operations. When network resources are lost, panic can ensue among the business employees which will eventually be fed back to any network administrator. In almost every case, top priority is to get the network back to working order as quickly as possible. Often, little thought is put into troubleshooting an outage beyond simply rebooting equipment.
This approach is very understandable, but what happens if this issue occurs again? What caused the problem in the first place? The answer to these questions really depends on what you learn from the problem when it occurs with the troubleshooting steps you take. Normally problems of this nature can only be fully identified if troubleshooting is done while the problem is occurring. If data is not gathered during the issue, the administrator may not learn how to properly prevent the problem. The network may come back up quickly with the reboot of a unit, or by reconnecting to the network, but if this problem happens repeatedly, the short length of the outage becomes irrelevant compared to the overall number of times the problem is experienced.
This is why an administrator should work with the following in mind: How can I prevent this problem from happening again while getting the network back up as quickly as possible? With this line of thinking, a small amount of time may be lost while troubleshooting, but this will easily save a greater amount of time in the future for the prevention of the same problem reoccurring. With a detailed and precise troubleshooting plan in place beforehand, this troubleshooting time can be even more easily reduced.
This section details basic general steps that should be taken when any network problems arise dealing with loss of connectivity to network resources or the Internet. Feel free to improvise off of these basic steps as each may not be fully applicable in terms of every network situation. These should, however, serve as a general guide during each network issue.
Depending on how large your network is, having a "network problem" can be pretty vague. Normally a network administrator will find out about a network problem through another employee, possibly not someone technically savvy. In this case, they may just say that they "can't get onto the Internet" or "can't access the file storage server". In this case, though these problems sound simple, there may not be enough context to know exactly what is causing the problem. If a user's physical link on their client goes down, they will lose connectivity to everything. So, if a user reports they have lost internet connectivity, they may have actually lost all connectivity. Similarly in this situation, someone could claim they lost access to a server because that was the only thing they were using when in reality they have no connectivity at all.
In cases like this, it is important to ask troubleshooting questions from the user to narrow down the problem, or physically log onto their system to see if the issue can be put into context. Some of the things you want to find out in each type of situation:
- Loss of internet connecivity
- Is this affecting other users? Which users and how many? What do they have in common?
- Can the user connect to internal resources (can they use them, PING them, etc.)
- Have they tried more than one web page?
- Can they PING an internet address (as DNS could actually be the issue)? If not, how many hops can they traceroute to it? Where does it fail?
- Can they PING to their default gateway?
- Can they PING the management IP of the switch they are plugged into?
- Do they have a connectivity light on their Ethernet port?
- Loss of Network Resource
- Is this affecting other users? How many and who?
- Does the user have connectivity to anything else?
- Can they PING the unit that is running that network resource? How far can they traceroute to it?
- Can they PING to their default gateway?
- Can they PING the management IP of the switch they are plugged into?
- Do they have a connectivity light on their Ethernet port?
As you can see, some of the steps are very similar in each of the above situations because they are all trying to achieve the central point: Narrow down where the problem is occurring and who the problem is affecting. These steps will not always be the same: there is a certain amount of improvisation that will need to be used to fully figure out the problem. However, these steps should serve as a general guide showing the thought process that should be used when a network connectivity problem occurs. An example situation is shown below:
Company.com's network administrator gets a call from users in building A complaining of "network connectivity problems". Upon arriving at building A and questioning the users further, the network administrator realizes they do not have Internet access. By asking around, it is discovered that no one in building A seems to have Internet access. A quick call over to the employees at building B confirms their Internet is up and working. Since their PCs are located in a different VLAN, it seems that building A's VLAN is somehow not getting out to the Internet.
The administrator starts by sending a PING from a Building A PC to 126.96.36.199 (a public IP that's easy to remember and always accessible on the Internet) to see if the issue is caused by a lack of DNS resolution. This fails, so it seems that there is an actual break in connectivity. The administrator then decides to traceroute to 188.8.131.52. This fails upon reaching the third hop, which is the site's main Internet router. The administrator PINGs several other internal network units to confirm it is just Internet access that is lost. Once that is done, the administrator has now narrowed down the problem to reside at the Internet router, or further into the service provider network meaning they can troubleshoot at that one central area now that they have narrowed down where the problem resides.
This is just a basic example of a network connectivity complaint, but the general troubleshooting steps will apply to the majority of all issues that arise of this type. In a network with large amounts of routers, switches, and other units, narrowing down the problem can mean taking hours off of the total troubleshooting time needed before the network is back up and functioning normally. The following section discusses what to do if the unit that the issue is narrowed down to is an AOS unit.
Resuming from the example in the above section, if the issue in Company.com's network leads the administrator to the Internet router (which happens to be an AOS unit), they must now troubleshoot the AOS unit directly to see if the issue in the unit can be identified. Unfortunately, it is a very common instinct to just try and reboot the unit that seems to be causing the problem. Most residential router manufacturers encourage this practice in home networks, and even in an enterprise network, its hard to sometimes imagine how something that has been working far a period of time could just stop. However, rebooting the unit has several negative effects:
- There is a chance it won't fix the problem.
- AOS routers are not made like residential routers that sometimes require consistent reboots to work properly. In contrast, AOS units can run weeks, months, and even years without rebooting (although it is encouraged to regularly upgrade your unit which will require a reboot). This means that there is a low chance a reboot will actually fix what is happening. This means that the boot procedure just becomes time wasted, and could negatively effect all other users serviced by this router.
- Upon reboot, all relevant log information (and the ability to live troubleshoot a problem, assuming a reboot fixes it) is flushed because it is stored in RAM, which is cleared upon a reboot. All the precious data the unit output during the issue is gone, meaning that it could be very difficult to reproduce and possibly fix in a swift and timely manner without waiting for the issue to reoccur.
- If for some reason the underlying issue is related to the unit's hardware, it is possible (albeit rare) that the unit could not recover after the reboot which could cause a need for a replacement device to be installed to restore connectivity.
Rebooting should normally be a last resort when attempting to restore connectvity. If the troubleshooting steps are done properly beforehand and a reboot ends up resolving the issue, important information has been obtained during the troubleshooting period that could help ADTRAN support engineer's identify the issue that affected the unit. The output detailed below should be gathered from an AOS unit when issues of this nature occur, before a reboot is performed (if necessary).
Once the issue has been narrowed down to an AOS device, the device should be accessed for further troubleshooting via the Command Line Interface (CLI) This is important because the CLI has the most tools to help troubleshooting connectivity issues. If you need assistance logging into the AOS CLI, please see Accessing the Command Line Interface in AOS. If you are unable to login to the CLI, please see the section If the Unit is not Accessible.
Once inside the CLI, attempt to find answers to these types of questions:
- Do the interface/connections to this unit seem to be functioning properly? Do the unit's LEDs show the normal colors for a functional unit?
- Can you PING the gateway? Traceroute past it?
- Can the internal subnets experiencing the problem respond to PING? Can they respond if sourced from another interface in the unit?
Asking and answering the above questions should help confirm whether the connectivity problem actually is in the AOS unit, or exists in a different section of the network past it. Assuming the issues still seem to point to the AOS unit causing the problem, the following steps at minimum should be taken. Note: It is recommended that anyone logging into the unit use a program like PuTTY that can log all session output to a text file:
- Use the show running-config command to get a copy of the current configuration
- Configuration files and updates should be regularly pulled from the unit and stored, especially whenever there is a change made. Unfortunately, this is not always the case if multiple administrators use the unit. Pulling the configuration is an important step so that the configuration can be verified to be properly administered.
- Run show flash and compare the startup-configuration with the backup-configuration
- This command shows all the current files on the unit's flash drive as shown below:
As you can see, there is a startup-config and a startup-config.bak. Startup-config.bak is a copy of the previous startup-config after it's been saved. In other words, after saving a configuration, startup-config is a current copy of the config, while startup-config.bak is a copy of the config prior to the last time the configuration was saved. If these are not equal, it is important that they are pulled and compared (you can show them on the screen with the show file flash <name> command) so the last change made can be examined to see if it is possibly the issue.
- Run the show interfaces command
- For intermittent loss of connectivity, this can be a very useful command as each port can be examined to see if there are large amounts of errors, over-utilization of bandwidth or any other anomalies.
- In situations that are more complex and require ADTRAN Technical Support, this is an important set of information that can be an aide to an ADTRAN engineer.
- If a T1 interface being used is showing an "Alarm", please see Troubleshooting Layer 2 Protocols over T1 with CLI.
- If using the firewall, run the show ip policy-sessions command
- This command can be used to see if user traffic is reaching the AOS device and whether or not that traffic is being properly allowed or having NAT performed before leaving the unit.
- Run the show process cpu command to see if CPU utilization is a current issue in the unit.
- Use show process queue as well to see if any of the processes have been high in the past despite being low at the moment.
- If the problem still has not been resolved, run show tech
- This will print a file to the flash (and to the screen) that contains a large array of applicable commands that provide information about the current state of the unit.
- This can be provided to ADTRAN Technical Support. This file can be removed in the same manner an exception report is using the document Retrieving an Exception Report from the Flash of an AOS Product .
- Generate an exception report using the exception report generate command
- This, like show tech, offers a lot of current unit information, as well as internal information that can be examined by ADTRAN Technical Support
- This file can be removed using the document Retrieving an Exception Report from the Flash of an AOS Product .
- Run any applicable debug and show commands for features being used that are not working properly
- For example, if the issue is phones that have lost connectivity to a SIP server, SIP message debugs would need to be gathered.
The below shows a snippet of commands that can be entered into a device quickly to gather all of the applicable information shown above without having to type each command individually. This should be entered in privilege exec mode:
term length 0
show ip policy-sessions
show process cpu
show process queue
exception report generate
After running through these steps, if the issue still persists, a reboot can be performed to see if the problem resolves itself. If the reboot resolves it, all the above information should be provided to ADTRAN Technical Support along with a detailed problem description and a network diagram. If the reboot does not resolve it, contact ADTRAN Technical Support with the information above to help continue to troubleshoot the issue.
In certain cases during a network outage, attempts to log into a unit may fail (this is discounting a user not having proper credentials. If you need credentials to log into the unit, please see another unit administrator). This could be for several reasons:
- The unit's CPU is running too high to respond.
- The network the connection is being initiated from doesn't have IP connectivity to the AOS unit.
- The network bandwidth is saturated completely.
- In some very rare cases, the unit could be in a state where it does not respond to management.
In these cases, the unit should be accessed via console using a male to female, straight through DB9 cable. Once logged in via console, the commands shown in the section Information to Gather from the Unit should be gathered to help troubleshoot the unit, or be provided to Tech Support.
If the console is also not responsive, a reboot will most likely restore the unit back to working condition, but it will not provide any information needed to help troubleshoot the issue. In this case, if the unit has a NIM, it should be pulled out of the AOS device without powering down the device. This will cause the unit to reboot and print an exception report to flash (which can be provided to technical support) which can prove vital to finding a resolution. If a NIM is not present, call ADTRAN Technical Support about the issue.
Syslog should also be set up for a unit in this case so that any messages before the inaccessibility occurs could be logged and examined later. This can be setup using Configuring Syslog Logging in AOS . _of_connectivity
For information on how to set up Syslog, please see Configuring Syslog Logging in AOS
For instructions on how to pull an exception report from an AOS unit, please see Retrieving an Exception Report from the Flash of an AOS Product
For information on problems with AOS units behind third-party modems, please see Problems with Internet Connectivity to an AOS Unit Behind a Third Party Modem
For information on WAN fail-over, which may be able to mitigate some connectivity issues in the future, please see Configuring WAN Failover with Network Monitor in AOS