Part 2, Detailed diagnosis and troubleshooting
Network problem determination: AIX tools for a system administrator
Part 2, Detailed diagnosis and troubleshooting
This article provides you with a set of commands available on IBM AIX®, many of which are also available on other flavors of UNIX®, that can help you get as much information as you can about exactly what is going on when your host has problems communicating with another. It also provides a logical step-by-step approach to diagnosing common issues.
For the purposes of this article, the target host system used in all sample commands and output is called testhost.
Tell me more
Depending on the nature of the network problem you're diagnosing, it's sometimes worth investigating whether the failing application or command has any kind of verbose, trace, or debug options. For example, both the ssh
(Secure Shell) and scp
(Secure Copy) commands have a verbose switch (-v
) that can provide you with an extensive trace of the communication, key exchange, and authentication that takes place between client and server (see Listing 1).
Listing 1. Connecting to a remote host with a verbose ssh session
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
|
# ssh —v testhost OpenSSH_4.2p1, OpenSSL 0.9.7d 17 Mar 2004 debug1: Reading configuration data /opt/freeware/etc/ssh_config debug1: Connecting to testhost [10.217.1.206] port 22. debug1: Connection established. debug1: permanently_set_uid: 0/0 debug1: identity file /root/.ssh/identity type -1 debug1: identity file /root/.ssh/id_rsa type 1 debug1: identity file /root/.ssh/id_dsa type -1 debug1: Remote protocol version 1.99, remote software version OpenSSH_4.1 debug1: match: OpenSSH_4.1 pat OpenSSH* debug1: Enabling compatibility mode for protocol 2.0 debug1: Local version string SSH-2.0-OpenSSH_4.2 debug1: SSH2_MSG_KEXINIT sent debug1: SSH2_MSG_KEXINIT received debug1: kex: server->client aes128-cbc hmac-md5 none debug1: kex: client->server aes128-cbc hmac-md5 none debug1: SSH2_MSG_KEX_DH_GEX_REQUEST(1024<1024<8192) sent debug1: expecting SSH2_MSG_KEX_DH_GEX_GROUP debug1: SSH2_MSG_KEX_DH_GEX_INIT sent debug1: expecting SSH2_MSG_KEX_DH_GEX_REPLY debug1: Host 'testhost' is known and matches the RSA host key. debug1: Found key in /root/.ssh/known_hosts:14 debug1: ssh_rsa_verify: signature correct debug1: SSH2_MSG_NEWKEYS sent debug1: expecting SSH2_MSG_NEWKEYS debug1: SSH2_MSG_NEWKEYS received debug1: SSH2_MSG_SERVICE_REQUEST sent debug1: SSH2_MSG_SERVICE_ACCEPT received debug1: Authentications that can continue: publickey,password,keyboard-interactive debug1: Next authentication method: publickey debug1: Trying private key: /root/.ssh/identity debug1: Offering public key: /root/.ssh/id_rsa debug1: Authentications that can continue: publickey,password,keyboard-interactive debug1: Trying private key: /root/.ssh/id_dsa debug1: Next authentication method: keyboard-interactive debug1: Authentications that can continue: publickey,password,keyboard-interactive debug1: Next authentication method: password root@testhost's password: debug1: Authentication succeeded (password). debug1: channel 0: new [client-session] debug1: Entering interactive session. Last unsuccessful login: Wed 27 Jan 13:30:23 2010 on ssh from 10.216.163.37 Last login: Wed 10 Feb 16:05:48 2010 on /dev/pts/0 from 10.216.163.37 ****************************************************************************** * * * * * Welcome to AIX Version 5.3! * * * * * * Please see the README file in /usr/lpp/bos for information pertinent to * * this release of the AIX Operating System. * * * * * ****************************************************************************** # |
If you have login access to the problematic host (ideally the server failing to service network requests to a particular port, although sometimes errors can be reported on the requesting client as well), you should check system logs for relevant messages. These include files such as /var/adm/messages, /var/log/syslog, and /var/log/mail, depending on how the system logging daemon is configured in /etc/syslog.conf, as well as daemon-specific logs if any exist (for example, ftpd, sshd, telnetd). It's often the case that warnings, errors, or failures are logged in one or more of these logs. Therefore, they're a good place to look for information that might help identify root cause.
Some services allow for the configuration of verbose, debug, or trace-type logging so that more than the standard informational or error messages are logged. If the problem is reproducible, it's worth investigating the potential to use this type of diagnostic logging for the duration of the testing. However, it's not advisable to keep verbose logging on indefinitely, as doing so can cause disk and file system space issues.
To establish a better idea of what a process is doing, you can use truss
to trace the system calls that a process makes. The truss
command can either execute a specified command or attach to an existing process to produce a trace (assuming that you own the running process or have root privileges). In the case of the latter, you can stop the trace at any time by pressing Control-C.
Listing 2 shows an extract of a basic trace of the command, ssh testhost
, along with a short extract. The -l
switch prefixes each trace entry with the process ID, while the -d
switch displays a timestamp relative to the start of the trace.
Listing 2. Basic system call trace of a command
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
|
# truss —ld ssh testhost 2785347: 0.0000: execve("/usr/bin/ssh", 0x2FF22B70, 0x2FF22B7C) argc: 2 2785347: 0.0137: __loadx(0x03020000, 0x2FF22A40, 0x00000080, 0xDEADBEEF, 0xDEADBEEF) = 0x00000000 2785347: 0.0141: __loadx(0x0C000000, 0xF0208964, 0xF1422004, 0xF020832C, 0x00000001) = 0x00000000 2785347: 0.0143: thread_init(0x0000000000000000, 0x00000000D011A9BC) = 2785347: 0.0146: sbrk(0x00000000) = 0x20015B5C 2785347: 0.0148: vmgetinfo(0x2FF22958, 7, 16) = 0 2785347: 0.0151: sbrk(0x00000000) = 0x20015B5C 2785347: 0.0153: vmgetinfo(0x2FF22410, 7, 16) = 0 2785347: 0.0156: sbrk(0x00000000) = 0x20015B5C 2785347: 0.0158: sbrk(0x00000004) = 0x20015B5C 2785347: 0.0160: __libc_sbrk(0x00000000) = 0x20015B60 2785347: 0.0163: getrpid(-1, -1, 10) = 475322 2785347: 0.0165: _getpid() = 475322 . . . . 2785347: 35.9732: kioctl(0, 1074295912, 0x2FF22520, 0x00000000) = 0 2785347: 35.9735: getsockopt(3, 6, 1, 0x2FF22554, 0x2FF22550) = 0 2785347: 35.9737: setsockopt(3, 6, 1, 0x2FF22554, 4) = 0 2785347: 35.9739: ngetsockname(3, 0x2FF22498, 0x2FF22490) = 0 2785347: 35.9741: setsockopt(3, 0, 3, 0x2FF22560, 4) = 0 2785347: 35.9743: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kwrite(3, " H14 l95121D i86 H Q o10".., 384) = 384 2785347: 35.9748: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kread(3, " t x "0699841A E a S y\n".., 8192) = 768 2785347: 35.9837: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 2 Last unsuccessful login: Sat 13 Feb 22:16:28 2010 on ssh from myhost.testdomain.com Last login: Sat 13 Feb 22:16:56 2010 on /dev/pts/4 from myhost.testdomain.com ******************************************************************************* * * * * * Welcome to AIX Version 5.1! * * * * * * Please see the README file in /usr/lpp/bos for information pertinent to * * this release of the AIX Operating System. * * * * * ******************************************************************************* 2785347: kwrite(5, " x ".., 567) = 567 2785347: 35.9849: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kread(3, " x d o e x10 # 0 A1C c17".., 8192) = 48 2785347: 36.1103: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kwrite(5, " t e s t h o s t : r o o".., 17) = 17 testhost:root> 2785347: 219.4781: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) (sleeping...) 2785347: 219.4781: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kread(4, "04 n a80 n V\f a\0\0\010".., 16384) = 1 2785347: 220.1322: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kwrite(3, " O8D d r 013 g1982 o\n i".., 48) = 48 2785347: 220.1327: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kread(3, " p h1A 1 a I J E031D9D1C".., 8192) = 80 2785347: 220.1347: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kwrite(5, "\r\n", 2) = 2 2785347: 220.1352: close(5) = 0 2785347: 220.1354: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kread(3, " O a *901C ^81 . B e83 R".., 8192) = 96 2785347: 220.1358: close(4) = 0 2785347: 220.1360: kioctl(0, 22528, 0x00000000, 0x00000000) = 0 2785347: 220.1362: kioctl(0, 21507, 0x200151F8, 0x00000000) = 0 2785347: 220.1365: close(6) = 0 2785347: 220.1367: _select(7, 0x2004E738, 0x2004FA48, 0x00000000, 0x00000000) = 1 2785347: kwrite(3, "1A | B O # E c v9D e93 >".., 32) = 32 2785347: 220.1372: sigprocmask(2, 0xF1423790, 0x2FF22630) = 0 2785347: 220.1374: _sigaction(28, 0x00000000, 0x2FF226E0) = 0 2785347: 220.1375: thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x102A8043, 0x00000000, 0x00000135, 0x00000000, 0x00000000) = 0x00000000 2785347: 220.1377: sigprocmask(2, 0xF1423790, 0x2FF22630) = 0 2785347: 220.1379: _sigaction(28, 0x2FF226D0, 0x00000000) = 0 2785347: 220.1381: thread_setmymask_fast(0x00000000, 0x00000000, 0x00000000, 0x102A8043, 0x00000000, 0x0000017C, 0x00000000, 0x00000000) = 0x00000000 2785347: 220.1383: kioctl(0, 22528, 0x00000000, 0x00000000) = 0 2785347: 220.1385: kioctl(1, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY 2785347: 220.1387: kfcntl(1, F_GETFL, 0x00000000) = 67108869 2785347: 220.1389: kioctl(1, -2147195266, 0x2FF22640, 0x00000000) = 0 2785347: 220.1391: kioctl(1, -2147195267, 0x2FF22640, 0x00000000) = 0 2785347: 220.1393: kfcntl(1, F_SETFL, 0x04000001) = 0 2785347: 220.1395: kioctl(2, 22528, 0x00000000, 0x00000000) Err#25 ENOTTY 2785347: 220.1397: kfcntl(2, F_GETFL, 0x00000000) = 67108865 Connection to testhost closed. 2785347: kwrite(2, " C o n n e c t i o n t".., 32) = 32 2785347: 220.1402: shutdown(3, 2) = 0 2785347: 220.1404: close(3) = 0 2785347: 220.1406: kfcntl(1, F_GETFL, 0x102A8043) = 67108865 2785347: 220.1408: kfcntl(2, F_GETFL, 0x102A8043) = 67108865 2785347: 220.1410: _exit(0) # |
Listing 3 shows a more verbose trace of a running process (process ID 976) to an output file along with a short extract. The following switches were used:
Table 1. Switches used for more verbose trace of a running process
Listing 3. More verbose trace of a running process to a file
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
|
# truss —o /var/tmp/truss.out —aeflD —r all —w all —x all —p 976 ^C # cat /var/tmp/truss.out 1003752: psargs: sshd: testuser@pts/1 AIA,, ts/1 1003752: 1798193: 0.0000: _select(0x00000008, 0x2002D788, 0x2002D798, 0x00000000, 0x00000000) (sleeping...) 1003752: 1798193: 0.0000: _select(0x00000008, 0x2002D788, 0x2002D798, 0x00000000, 0x00000000) = 0x00000001 1003752: 1798193: 0.7196: sigprocmask(0x00000000, 0x2FF22598, 0x2FF225A0) = 0x00000000 1003752: 1798193: 0.0002: sigprocmask(0x00000002, 0x2FF225A0, 0x00000000) = 0x00000000 1003752: 1798193: kread(0x00000003, 0x2FF1E590, 0x00004000) = 0x00000034 1003752: ? 2 q q A> Ao A' A,8E ) A, Au A?9D 8 A,87 {90 A1 A^ l p 0 !02 A— A% A4 A!9C\n 1003752: A| | &8E A!9F G A2 )1C M1E A^ AZ / AE p AI Az A 1003752: 1798193: 0.0003: _select(0x00000008, 0x2002D788, 0x2002D798, 0x00000000, 0x00000000) = 0x00000001 1003752: 1798193: 0.0003: sigprocmask(0x00000000, 0x2FF22598, 0x2FF225A0) = 0x00000000 1003752: 1798193: 0.0002: sigprocmask(0x00000002, 0x2FF225A0, 0x00000000) = 0x00000000 1003752: 1798193: kwrite(0x00000006, 0x200A6F98, 0x00000001) = 0x00000001 1003752: p 1003752: 1798193: 0.0003: kioctl(0x00000006, 0x00005800, 0x00000000, 0x00000000) = 0x00000000 1003752: 1798193: 0.0002: kioctl(0x00000006, 0x00005401, 0x2FF224C0, 0x00000000) = 0x00000000 1003752: 1798193: 0.0003: _select(0x00000008, 0x2002D788, 0x2002D798, 0x00000000, 0x00000000) = 0x00000001 1003752: 1798193: 0.1359: sigprocmask(0x00000000, 0x2FF22598, 0x2FF225A0) = 0x00000000 1003752: 1798193: 0.0002: sigprocmask(0x00000002, 0x2FF225A0, 0x00000000) = 0x00000000 1003752: 1798193: kread(0x00000007, 0x2FF1E4F0, 0x00004000) = 0x00000013 1003752: t e s t h o s t : t e s t u s e r > . . . . 1003752: 1798193: kread(0x00000007, 0x2FF1E4F0, 0x00004000) = 0x0000003F 1003752: 1003752: M o z i l l a\r\n 1003752: 1798193: 0.0003: _select(0x00000008, 0x2002D788, 0x2002D798, 0x00000000, 0x00000000) = 0x00000002 1003752: 1798193: 0.0003: sigprocmask(0x00000000, 0x2FF22598, 0x2FF225A0) = 0x00000000 1003752: 1798193: 0.0002: sigprocmask(0x00000002, 0x2FF225A0, 0x00000000) = 0x00000000 1003752: 1798193: kread(0x00000007, 0x2FF1E4F0, 0x00004000) = 0x00000051 1003752: 1003752: 1003752: t r u s s . t x t\r\n f p d g 1003752: 1798193: 2.0001: _select(0x00000008, 0x2002D788, 0x2002D798, 0x00000000, 0x00000000) (sleeping...) ^C # |
You can use the system call trace to identify errors that may be the potential root cause of your problem. Look for calls marked with #Err
, indicating a non-zero return code, which you can look up in /usr/include/sys/errno.h. You can also use it to identify potential performance delays by looking for long deltas between calls when using the -D
switch. For example, the elapsed time between the last two events in the sample trace output in Listing 3 finished was 2.0001 seconds.
Houston, we have a problem!
Now that you have a useful toolkit of network diagnostic aids, it's time to look at a logical, step-by-step approach to troubleshooting common problems. The following section lists a number of common AIX network-related issues and provides a guide for diagnosis and what to look for in each one.
Host unknown
If a host name being used on a command or by an application isn't recognized, check the search order in which names are resolved by looking at the hosts
record in /etc/irs.conf and /etc/netsvc.conf. For a host to be referenced by name, it has to be resolved through name resolution.
If local
is specified in the hosts
record, look for the host name in the /etc/hosts file. If you look at the example in Listing 4, you can see a simple grep
of the host testhost
from this file returning a successful match. Your host name must appear in any of the fields after the first field (the IP address). In the example shown, the server is also known by two aliases: testhost.testdomain.com and aixserver. This means that you can refer to this particular host by any of those three names when it comes to using commands that require a host name argument.
Listing 4. Looking for a host in /etc/hosts
1
2
3
|
# grep testhost /etc/hosts 10.217.1.206 testhost testhost.testdomain.com aixserver # |
If bind
or dns
is specified in the hosts
record, use nslookup
to ensure that the host name resolves through DNS. If you look at the example in Listing 5, you can see that resolution has been successful and the DNS server testdns.testdomain.com (shown with its IP address) has returned a known IP address for the host testhost
of 10.217.1.206
.
Listing 5. Resolving a host name via DNS
1
2
3
4
5
6
7
|
# nslookup testhost Server: testdns.testdomain.com Address: 158.177.79.90 Name: testhost.testdomain.com Address: 10.217.1.206 # |
Any additional name resolution services specified in your configuration files are outside the scope of this document and won't be discussed here.
Unresponsive host
If a host is known but you find that users are complaining that the host itself or an application running on it isn't responding, use ping
and look for 0% packet loss (see Listing 6). Anything else means that there could be a problem with the target host or the network.
Listing 6. Pinging a responsive host
1
2
3
4
5
6
7
8
9
10
11
|
# ping testhost PING testhost: (10.217.1.206): 56 data bytes 64 bytes from 10.217.1.206: icmp_seq=0 ttl=253 time=0 ms 64 bytes from 10.217.1.206: icmp_seq=1 ttl=253 time=0 ms 64 bytes from 10.217.1.206: icmp_seq=2 ttl=253 time=0 ms 64 bytes from 10.217.1.206: icmp_seq=3 ttl=253 time=0 ms ----testhost PING Statistics---- 4 packets transmitted, 4 packets received, 0% packet loss round-trip min/avg/max = 0/0/0 ms # |
Also, look for long response times in the time=
field or a spike in the values reported. Both of these can indicate poor network or host performance, which may be causing the application that your users are complaining about to time out.
Ensure that there's a route out to the target host by using route get
(see Listing 7), verifying with the network administrator that this is the correct gateway to use.
Listing 7. Getting routing table information for a host
1
2
3
4
5
6
7
8
9
10
11
12
|
# route get testhost route to: testhost destination: 10.203.35.128 mask: 255.255.255.128 gateway: 10.203.35.1 interface: en2 interf addr: myhost flags: < UP ,GATEWAY,DONE,PRCLONING> recvpipe sendpipe ssthresh rtt,msec rttvar hopcount mtu expire 0 0 0 0 0 0 0 -9751026 # |
Use ifconfig
(see Listing 8) to make sure that the interface
reported is configured to AIX and showing as UP
and RUNNING
.
Listing 8. Displaying network interface status
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# ifconfig en1 en1: flags=7e080863,40< UP ,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255 tcp_sendspace 131072 tcp_recvspace 65536 # ifconfig -a en2: flags=7e080863,40< UP ,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> inet 10.203.35.14 netmask 0xffffff80 broadcast 10.203.35.127 en1: flags=7e080863,40< UP ,BROADCAST,NOTRAILERS,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> inet 10.216.163.37 netmask 0xffffff00 broadcast 10.216.163.255 tcp_sendspace 131072 tcp_recvspace 65536 en0: flags=7e080822,10< BROADCAST ,NOTRAILERS,SIMPLEX,MULTICAST,GROUPRT,64BIT, CHECKSUM_OFFLOAD,CHECKSUM_SUPPORT,PSEG> lo0: flags=e08084b< UP ,BROADCAST,LOOPBACK,RUNNING,SIMPLEX,MULTICAST,GROUPRT,64BIT> inet 127.0.0.1 netmask 0xff000000 broadcast 127.255.255.255 inet6 ::1/0 tcp_sendspace 65536 tcp_recvspace 65536 # |
Use ping
to check that the gateway
reported by ifconfig
is contactable from your host. If it isn't, there may be a problem with the physical connection from the network adapter on your host to the gateway (for example, a faulty switch port, cable, network card).
Use traceroute
(see Listing 9) to trace the full route to and from the target host. A route to a host doesn't necessarily mean a route back, so check with the network administrator to ensure that both exist and the application traffic being sent or received isn't being blocked by any firewalls. If any of the hops on the route returns no data (marked by asterisks [*
]), it could indicate a problem with the routing. Although, it might be that the packets that traceroute
uses to trace the route may also be blocked by firewalls. The output from this command should help the network administrator determine whether there's a real routing issue.
Listing 9. Tracing a successful route to a host
1
2
3
4
5
6
7
8
9
|
# traceroute testhost trying to get source for testhost source should be 10.216.163.37 traceroute to testhost (10.217.1.206) from 10.216.163.37 (10.216.163.37), 30 hops max outgoing MTU = 1500 1 10.216.163.2 (10.216.163.2) 1 ms 0 ms 0 ms 2 10.217.189.6 (10.217.189.6) 0 ms 0 ms 0 ms 3 testhost (10.217.1.206) 1 ms 1 ms 1 ms # |
Unresponsive TCP port
If a host is known and responding to ping
but a particular TCP port used by an application or remote service doesn't appear to be, use telnet
to try and make a connection to the specific port on the target host using the example shown in Listing 10, which attempts to connect to port 25 on host testhost
.
Listing 10. Testing port 25 (SMTP) on a host (successful)
1
2
3
4
5
6
7
8
9
|
# telnet testhost 25 Trying... Connected to testhost. Escape character is '^]'. 220 testhost.testdomain.com ESMTP Sendmail Wed, 10 Feb 2010 15:52:28 GMT ^] telnet> quit Connection closed. # |
Common ports are listed in /etc/services. A successful connection should result in the message Escape character is '^]'
and optionally a message from the remote service, such as the mail server, shown in Listing 10. If no such messages are received and the connection times out or is refused, then check with the network administrator that there are no firewalls en route blocking the type of traffic being sent. Also, check with the systems administrator of the target host that the application server or remote service is running and listening on the specified port and that firewalls running on the host are not blocking traffic.
Not connecting to a responsive TCP port
If a host is known, responding to ping
, and responding on a particular TCP port to other hosts but not yours, use telnet
to try and make a specific connection to the specific port following the logic shown in Unresponsive TCP port.
Use netstat
to look for connections to the host and their state using the second example shown in Listing 11, which looks for all connections to a particular IP address and port (port 22 at 10.217.1.206). Unless the connection state is shown as ESTABLISHED
, the connection is either still being made or has been terminated. For example, a status of SYN_SENT
indicates that a three-way handshake has been initiated by your host, but as yet no acknowledgement has been received from the target host. This could mean that there's a route to the target but no route back for this type of traffic. In this situation, ask the network administrator whether any firewalls on the route back are blocking this type of traffic.
Listing 11. Displaying the status of connections to hosts
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
|
# netstat -an | grep 10.217.1.206 tcp4 0 0 10.203.35.14.22 10.217.1.206.1023 ESTABLISHED tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED # netstat -an | grep 10.217.1.206.22 tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED # netstat -an | grep ESTABLISHED tcp4 0 0 10.203.35.14.22 10.217.1.206.1023 ESTABLISHED tcp4 0 0 10.203.35.14.46183 10.217.1.206.22 ESTABLISHED tcp4 0 0 10.216.163.37.1521 10.216.163.37.44122 ESTABLISHED tcp4 0 0 10.216.163.37.44122 10.216.163.37.1521 ESTABLISHED tcp4 0 0 127.0.0.1.199 127.0.0.1.32769 ESTABLISHED tcp4 0 0 127.0.0.1.32769 127.0.0.1.199 ESTABLISHED tcp4 0 0 10.203.35.14.46183 10.203.35.170.22 ESTABLISHED tcp4 0 0 10.216.163.37.32770 10.216.163.37.32771 ESTABLISHED # |
Use tcpdump
to display packets sent to and received from the host on the specified port using the example shown in Listing 12. If only packets sent by your host are shown, this is another indication that the problem is with traffic sent back by the target and therefore the route back.
Listing 12. Display packets destined for or sent by a specific host on a specific port
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
|
# tcpdump -i en2 host testhost port 22 12:15:38.033833162 myhost.47216 > testhost.22: . ack 610148954 win 17520 (DF) [tos 0x10] 12:15:38.113807903 myhost.47216 > testhost.22: P 145:193(48) ack 192 win 17520 (DF) [tos 0x10] 12:15:38.114291921 testhost.22 > myhost.47216: P 192:240(48) ack 193 win 24820 (DF) [tos 0x10] 12:15:38.241718122 myhost.47216 > testhost.22: P 193:241(48) ack 240 win 17520 (DF) [tos 0x10] 12:15:38.242344703 testhost.22 > myhost.47216: P 240:288(48) ack 241 win 24820 (DF) [tos 0x10] 12:15:38.243844593 myhost.47216 > testhost.22: . ack 288 win 17520 (DF) [tos 0x10] 12:15:38.497817604 myhost.47216 > testhost.22: P 241:289(48) ack 288 win 17520 (DF) [tos 0x10] 12:15:38.503088328 testhost.22 > myhost.47216: P 288:336(48) ack 289 win 24820 (DF) [tos 0x10] 12:15:38.503154802 testhost.22 > myhost.47216: P 336:432(96) ack 289 win 24820 (DF) [tos 0x10] ^C 145 packets received by filter 0 packets dropped by kernel # |
Long login times
If users are complaining that response times logging in to a particular host are slow, log in to the host and use dig
to perform a reverse lookup of the IP address of the user's computer using the example shown in Listing 13.
Listing 13. Reverse lookup of an IP address in DNS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
|
# dig -x 10.217.1.206 ; <<>> DiG 9.2.0 <<>> -x 10.217.1.206 ;; global options: printcmd ;; Got answer: ;; ->>HEADER<<- opcode: QUERY, status: NOERROR, id: 21351 ;; flags: qr aa rd ra; QUERY: 1, ANSWER: 1, AUTHORITY: 0, ADDITIONAL: 0 ;; QUESTION SECTION: ;206.1.217.10.in-addr.arpa. IN PTR ;; ANSWER SECTION: 206.1.217.10.in-addr.arpa. 3600 IN PTR testhost.testdomain.com. ;; Query time: 11 msec ;; SERVER: 10.217.1.206#53(10.217.1.206) ;; WHEN: Fri Feb 12 13:28:16 2010 ;; MSG SIZE rcvd: 82 # |
During login, a host may perform a reverse lookup on the IP address of the source address on the packets it receives. Depending on the configuration of that host, it can take some time for the lookup to fail and the login to continue. This process appears to the user as a long login delay.
Look in the ANSWER SECTION
of the output that dig
returns—specifically, the PTR
record, which will be a pointer to the host name. If none is returned, this could explain the delay.
MAC address query
If an application requires a MAC address—for example, one where an ACL is based on one, a firewall has rules based on one, or a configuration requires one (for example, KickStart or JumpStart installation services on Linux® and Sun® Solaris®)—use entstat
to find out the MAC address of a local interface using the example shown in Listing 14, looking for the Hardware Address
(aka MAC address).
Listing 14. Displaying Ethernet statistics for a network adapter
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
|
# entstat -d en2 ------------------------------------------------------------- ETHERNET STATISTICS (en2) : Device Type: 10/100/1000 Base-TX PCI-X Adapter (14106902) Hardware Address: 00:02:55:d3:37:be Elapsed Time: 114 days 22 hours 48 minutes 20 seconds Transmit Statistics: Receive Statistics: -------------------- ------------------- Packets: 490645639 Packets: 3225432063 Bytes: 9251643184881 Bytes: 215598601362 Interrupts: 0 Interrupts: 3144149248 Transmit Errors: 0 Receive Errors: 0 Packets Dropped: 0 Packets Dropped: 0 Bad Packets: 0 Max Packets on S/W Transmit Queue: 109 S/W Transmit Queue Overflow: 0 Current S/W+H/W Transmit Queue Length: 0 Broadcast Packets: 442 Broadcast Packets: 10394992 Multicast Packets: 0 Multicast Packets: 349 No Carrier Sense: 0 CRC Errors: 0 DMA Underrun: 0 DMA Overrun: 0 Lost CTS Errors: 0 Alignment Errors: 0 Max Collision Errors: 0 No Resource Errors: 0 Late Collision Errors: 0 Receive Collision Errors: 0 Deferred: 0 Packet Too Short Errors: 0 SQE Test: 0 Packet Too Long Errors: 0 Timeout Errors: 0 Packets Discarded by Adapter: 0 Single Collision Count: 0 Receiver Start Count: 0 Multiple Collision Count: 0 Current HW Transmit Queue Length: 0 General Statistics: ------------------- No mbuf Errors: 0 Adapter Reset Count: 0 Adapter Data Rate: 200 Driver Flags: Up Broadcast Running Simplex 64BitSupport ChecksumOffload PrivateSegment DataRateSet 10/100/1000 Base-TX PCI-X Adapter (14106902) Specific Statistics: -------------------------------------------------------------------- Link Status: Up Media Speed Selected: 100 Mbps Full Duplex Media Speed Running: 100 Mbps Full Duplex PCI Mode: PCI-X (100-133) PCI Bus Width: 64-bit Jumbo Frames: Disabled TCP Segmentation Offload: Enabled TCP Segmentation Offload Packets Transmitted: 260772859 TCP Segmentation Offload Packet Errors: 0 Transmit and Receive Flow Control Status: Disabled Transmit and Receive Flow Control Threshold (High): 32768 Transmit and Receive Flow Control Threshold (Low): 24576 Transmit and Receive Storage Allocation (TX/RX): 16/48 # |
Use arp
to the find out the MAC address of a remote host (see Listing 15), assuming that the host in question is known to your host and therefore has an entry in the cache.
Listing 15. Displaying a host entry in the arp table
1
2
3
|
# arp testhost testhost (10.217.1.206) at 0:c:29:44:90:28 [ethernet] stored in bucket 0 # |
Use ping
to force an entry for the remote host into the arp cache if one doesn't exist.
Are packets being sent?
If a host is not responding in some way (either completely or on a particular port) and you need to verify that your host is sending out packets, reproduce the problem with the local application or command. Then, use tcpdump
to display packets sent to the host using the example shown in Listing 16.
Listing 16. Display packets destined for a specific host
1
2
3
4
5
6
7
8
9
10
11
12
13
14
|
# tcpdump -i en2 dst host testhost tcpdump: listening on en2 10:08:24.912057892 myhost.46183 > testhost.22: P 1299060979:1299061027(48) ack 3373421618 win 17520 (DF) [tos 0x10] 10:08:25.009291439 myhost.46183 > testhost.22: P 1:49(48) ack 48 win 17520 (DF) [tos 0x10] 10:08:25.093832676 myhost.46183 > testhost.22: . ack 96 win 17520 (DF) [tos 0x10] 10:08:25.249319253 myhost.46183 > testhost.22: P 1299061075:1299061123(48) ack 3373421714 win 17520 (DF) [tos 0x10] ^C 53 packets received by filter 0 packets dropped by kernel # |
If no packets are seen leaving your host, then either there's a problem with the sending application or with the interface or routing (which you can diagnose using the previous diagnostic steps in this section).
Are packets being received?
If a host is not responding in some way (either completely or on a particular port) and you need to verify that packets are not being received from that host, reproduce the problem with the local application or command. Then, use tcpdump
to display packets received from the host using the example shown in Listing 17.
Listing 17. Display packets sent by a specific host
1
2
3
4
5
6
7
8
9
10
11
|
# tcpdump -i en2 src host testhost tcpdump: listening on en2 10:10:38.505848354 testhost.22 > myhost.46183: . ack 130 win 24820 (DF) [tos 0x10] 10:10:38.505916972 testhost.22 > myhost.46183: F 529:529(0) ack 225 win 24820 (DF) [tos 0x10] 10:10:43.855153846 testhost > myhost: icmp: echo reply 10:10:44.855224394 testhost > myhost: icmp: echo reply ^C 102 packets received by filter 0 packets dropped by kernel # |
If you have verified that your host is sending packets (using Are packets being sent?) and no packets are being received, this means that the host is not responding, the service on the host is not responding, or there is no route back (either one does not exist or the traffic is blocked by a firewall along one that does).
Connection made but application or command fails
If users are complaining that an application or command appears to establish a successful connection but fails afterwards, see whether the command has a debug, trace, or verbose option, and rerun to see whether any additional output produced identifies potential root cause. For example, both ssh
and scp
have a verbose switch (-v
) that can provide details of the protocol exchange between client and server as follows:
- The connection is established to the remote host on TCP port 22.
- Local private key files are identified.
- Protocol versions are exchanged and agreed upon.
- A remote host key is identified and matched to the entry in the local known_hosts file for that host.
- Key authentication is tried for each private key type found locally.
- When these fail, the user is finally prompted for a password as authentication.
- The user successfully enters it, login is successful, and a shell prompt is presented.
The verbose option here can help identify whether any of those steps fails and the probable cause.
Use truss
to trace the command or the process running the remote service (see Tracing a problem application or daemon).
Tracing a problem application or daemon
If the previous steps fail to uncover the exact root cause of the problem and you need to diagnose connectivity issues further, use truss
to run a verbose system call trace of the command.
Also, use truss
to run a verbose system call trace of the daemon that runs the remote service where you are attempting to make a connection. You may need to ask the systems administrator of the remote host for help if you don't have access or don't have access to the user running the process or the root user.
When a system call returns with an error, the non-zero return code is shown marked with Err#
followed by the error number and an error code (for example, ENOSPC
). Standard error codes can be found in /usr/include/sys/errno.h and can help indicate the cause of the error. For example, a system call returning with Err#2 ENOENT
(No such file or directory) would indicate that the command is expecting to find a file or directory but can't and subsequently fails. A system call returning with Err#28 ENOSPC
(No space left on device) would indicate that a disk or file system is full, potentially causing the daemon to fail to respond to service requests.
The verbose trace will display data from the parameters of system calls (-x all
) and the contents of the buffer for both Read (-r all
) and Write (-w all
) calls. The contents of these buffers can sometimes identify root cause, as well.
You can also use the -D
switch to display each system call with a time delta, representing the elapsed time in seconds since the last event. You should look for long time deltas, as these could indicate delays and lead to long response times and poor performance.
Conclusion
Network problem diagnosis does not need to be the obstacle that some systems administrators feel it is. This series has shown that armed with the correct knowledge, many problems you might normally take to a network administrator can be diagnosed, analyzed, and root cause identified, thus making the network administrator's job a lot easier when fixing it. In some cases, you'll even find that it's something you can fix yourself. Happy troubleshooting!