Troubleshooting networks without NetMon

Hi, Ned here. You may already be asking yourself why I’m writing about network troubleshooting. Isn’t this the Directory Services blog? Don’t we just care about Kerberos and group policies and the like? Shouldn’t the Networking team do all this heavy TCP/IP lifting?

Well, without the network, Active Directory and all its little pieces don’t really amount to much. We are a customer of networking ourselves and that means to be effective DS engineers we have to understand the infrastructure that moves all our data around. Otherwise when this important component fails we can’t really determine if DS is having issues or the underlying structure it relies on is in trouble. To be frank, we work a lot of cases here in 3rd tier support that came in as Directory Services symptoms and left resolved as network issues. At one point, 80% of all our DS cases could be tracked back to DNS configuration problems!

We can’t all be network trace gurus though – it takes a lot of time and experience to get to the point where you can look at a capture in NetMon3.1 (or Wireshark, Ethereal, Packetyzer, etc.) and make meaningful sense of all the details. So what are your options if you suspect a networking problem and you don’t feel that NetMon is in your league? You can call us in Microsoft support, or you can use other tools that are simpler and often just as effective to figure out your issue. That’s what we’ll do today.

One quick note – I’m sticking with IPv4 here since that’s 99.999∞% of what you’ll see.

Network troubleshooting from 30,000 feet

Here’s an extremely unattractive flowchart I put together that covers the basic process. We are going into a great deal more detail below.

clip_image001

At its core, we will always troubleshoot the same way:

1. What’s our symptom and failing component?
2. Do we have basic network connectivity?
3. Do we have good name resolution?
4. Can we test our failing component using reliable tools?

You may be saying ‘What the heck? Does this guy think I was born yesterday?’ but trust me – plenty of engineers that should know better often rush into step 4 when they really didn’t have a good understanding of step 1 or without trying the basics in steps 2 and 3. Especially when servers are down, the boss is screaming, and the company is losing money.

Note: Unless specified, everything we do here will be from the computer that is reporting the problem or having the symptom. In all examples the network settings are:

IP address – 10.10.0.128 (SRC-CLIENT-01.contoso.com)
Subnet Mask - 255.255.0.0
Default Gateway - 10.10.0.1
DNS Server - 10.20.0.20 (DNS-01.contoso.com)
WINS Server - 10.20.0.30
Our Destination DC - 10.30.0.166 (DEST-DC-01.contoso.com)

1. What’s our symptom and failing component?

We’re troubleshooting something not working– what exactly? Since this is a Directory Services blog I’m going to be greedy and focus on DS components. Are domain controllers not replicating SYSVOL? Are users unable to logon? Is group policy not applying? You need to understand the component in question in order to test it at the Application layer of OSI-TCP/IP.

clip_image002

2. Do we have basic network connectivity?

Next we will determine if the lower layers are working ok. It’s very possible that our component is just one of many victims, but no one else is complaining as loudly. Let’s break out a snippet from the flowchart and follow it with some utilities.

clip_image004

Connectivity test with PING– built-in tool in all supported Windows versions

  • Can we verify our own local networking with:

PING 127.0.0.1
PING 10.10.0.128
PING 10.10.0.1

All should return:

Packets: Sent = 4, Received = 4, Lost = 0 (0% loss)

This tests if our NIC responds at all, if our own IP address works, and if we can reach our gateway. If we can’t even reach our gateway but the NIC responds, we probably have a local software firewall issue. Also keep in mind, that most hardware firewalls (often default gateways in customer environments these days.) do not allow you to ping their interfaces. If you know for sure that the firewalls private network interface is working it is OK if it fails to respond to a ping.

  • Can we ping between our problem computer and the destination that our component is trying to reach with:

PING 10.30.0.166
PING DEST-DC-01.contoso.com
PING DEST-DC-01

This proves that we can get to the machine at all on the wire both with and without name resolution. We can only use this test if your network allows ICMP – some customers decide to turn it off internally on routers and private firewalls (and no, I really haven’t ever heard a good reason why – the days of malware/hackers using ICMP to find machines on a LAN are ten years behind us; I welcome comments on this). If pinging by address fails, it’s important to read the error – DESTINATION UNREACHABLE or REQUEST TIMED OUT means routing is having issues and we should move to the routing tests. COULD NOT FIND HOST means name resolution is broken and we should move to the name resolution tests. You may also want to ping with the –F –L 1472 command to verify that we can ping without fragmenting a 1500 byte packet.

Routing tests with TRACERT / PATHPING / ARP / ROUTE - built-in tools in all supported Windows versions

  • Can we check which routes we’re taking and where the traffic dies with:

PATHPING 10.30.0.166
or
TRACERT 10.30.0.166

Both tools accomplish basically the same thing – letting you know where you travel on the network to reach your destination, and where the journey fails. TRACERT shows fairly quick, basic info:

Tracing route to DEST-DC-01.contoso.com [10.30.0.166] over a maximum of 30 hops:

1 1 ms 1 ms <1 ms router1.network.contoso.com [10.10.0.1]
2 <1 ms 1 ms <1 ms router2.network.contoso.com [10.30.0.1]
3 <1 ms <1 ms <1 ms DEST-DC-01.contoso.com [10.30.0.166]

Whereas PATHPING trades speed for more details:

Tracing route to DEST-DC-01.contoso.com [10.30.0.166] over a maximum of 30 hops:

0 SRC-CLIENT-01.contoso.com [10.10.0.128]
1 router1.network.contoso.com [10.10.0.1]
2 router2.network.contoso.com [10.30.0.1]
3 DEST-DC-01.contoso.com [10.30.0.166]

Computing statistics for 75 seconds...

Source to Here This Node/Link
Hop RTT Lost/Sent = Pct Lost/Sent = Pct Address
0 SRC-CLIENT-01.contoso.com [10.10.0.128]
0/ 100 = 0% |
1 0ms 0/ 100 = 0% 0/ 100 = 0% router1.network.contoso.com [10.10.0.1]
0/ 100 = 0% |
2 0ms 0/ 100 = 0% 0/ 100 = 0% router2.network.contoso.com [10.30.0.1]
0/ 100 = 0% |
3 0ms 0/ 100 = 0% 0/ 100 = 0% DEST-DC-01.contoso.com [10.30.0.166]

  • It’s not usually necessary, but you can also see further routing details with:

ARP –a
and
ROUTE PRINT

3. Do we have good name resolution?

We’re only in this step if we failed some of our earlier checking, or if we simply feel that we only have partial name resolution (for example, a DC might have it’s a record but be missing CNAME and SRV records needed for functionality). So now we’ll run through some tests to see why our name resolution isn’t working or to verify that we have all the records we need for our component.

clip_image006

Note: It’s important that before you do any name resolution testing you always start with the following commands to ensure that you are not using cached information:

IPCONFIG /flushdns
NBTSTAT -R

Name resolution tests with NSLOOKUP - built-in tool in all supported Windows versions

  • Can we get the DNS server to give us back the A record with:

NSLOOKUP DEST-DC-01.contoso.com 10.20.0.20

This will return:

Server: DNS-01.contoso.com
Address: 10.20.0.20

Name: DEST-DC-01.contoso.com
Address: 10.30.0.166

Using the fully qualified domain name lets us know A record lookups are working. The important part about using NSLOOKUP is that it actually uses UDP DNS lookups, whereas the DNSCMD command below makes an RPC connection to the DNS to return data, and isn’t a valid test of the DNS protocol itself.

Name resolution tests with DNSCMD and NSLOOKUP (if appropriate) – support tools download for Windows 2000/XP/2003

  • Can we get the DNS server give us back the CNAME and SRV records of our DC’s with:

DNSCMD /EnumRecords _msdcs.contoso.com @ /Type CNAME

and

NSLOOKUP

>set type=all

_ldap._tcp.dc._msdcs.contoso.com

_kerberos._tcp.dc._msdcs.contoso.com

This is usually important for Directory Services engineers because the A record is only part of the puzzle. We also care about SRV records and CNAME records. That’s how AD works when it comes to LDAP, Kerberos, replication, and so on. So if you suspect one of those technologies has a name resolution issue this is appropriate to test.

Name resolution tests with NBTSTAT (if appropriate) - built-in tool in all supported Windows versions

  • Can we get WINS to give us back the records with:

NBTSTAT -c
NBTSTAT -n

This is important since despite all efforts to the contrary, WINS and NetBIOS name resolution are still part of many products, including DFS Namespaces, Netlogon, Terminal Services licensing, and much more.

If all these name resolution steps check out, it’s time to move to the Application layer testing phase.

4. Can we test our failing component using reliable tools?

The one you’ve been waiting for. At this stage we’ve eliminated the overall possible general network connectivity issues, and we suspect that just our component is a victim. If the network is fine, the mostly likely problems are filtered firewall rules and the application layer itself. Let’s go down some common paths to figure it out.

clip_image008

LDAP tests with LDP and PORTQRY – support tools download for Windows 2000/XP/2003; download Portqry.

  • Can we verify that LDAP is listening on DC/GC’s with:

PORTQRY -n DEST-DC-01.contoso.com -p tcp -e 389
PORTQRY -n DEST-DC-01.contoso.com -p tcp -e 636
PORTQRY -n DEST-DC-01.contoso.com -p both -e 3268
PORTQRY -n DEST-DC-01.contoso.com -p tcp -e 3269

Here’s a sample of working output from the first command:

TCP port 389 (ldap service): LISTENING

Using ephemeral source port
Sending LDAP query to TCP port 389...

LISTENING is good. :-) TCP-based LDAP ports should always be listening on DC/GC’s and never return NOT LISTENING or FILTERED. UDP-based ports should return LISTENING or FILTERED (as they are connectionless). Seeing TCP as FILTERED or anything as NOT LISTENING should be a red flag to find out why someone has configured a firewall to block or manipulate LDAP traffic.

NOTE: You should see more data then what is listed in the blog example.

  • Can we connect to the domain controllers with LDP:

LDP
Connection --> Connect --> DEST-DC-01.contoso.com
Connection --> Bind
View --> Tree --> Select the domain naming context
Browse a few levels deep.

By doing the above with a reliable tool (i.e. not an application that does many things unspecific to LDAP and often use ADSI rather than pure LDAP) we can see if unadulterated LDAP binds and queries are working. We also know that authentication is working.

SMB tests with NET USE and PORTQRY - download Portqry.

  • Can we verify that SMB is listening on port 138 and 445 with:

PORTQRY -n DEST-DC-01.contoso.com -p udp -e 138
PORTQRY -n DEST-DC-01.contoso.com -p both -e 445

The same diatribe above applies here for LISTENING versus FILTERED. If we cannot get to 138 and 445 over the network, endless zillions of components will fail – follow that link to see what I mean, it’s a good one. If SMB is blocked via firewall rules, file sharing, group policy, named pipes, and many other applications will fail.

  • Can we connect over SMB (as an administrator) with:

NET USE \\DEST-DC-01.contoso.com\C$ /p:n

This simple and reliable test tells us that we can map a drive through SMB to the server. It also validates that at least NTLM authentication is working (to only use NTLM, use an IP address). You could use KLIST or KERBTRAY from the Resource Kit to confirm if there’s a Kerberos TGS ticket for that connection as well.

RPC tests with COMPMGMT and PORTQRY - download Portqry.

  • Can we verify the endpoint mapper is available and returning data with:

PORTQRY -n DEST-DC-01.contoso.com -p tcp -e 135

The endpoint mapper should always be LISTENING on TCP 135 (never FILTERED or NOT LISTENING) and should return all of its registered endpoint ports and named pipes. If the endpoint mapper is blocked due to firewall rules, a great many applications will fail.

  • Can we connect to the destination server with:

COMPMGMT.MSC
Computer Management --> Connect to another computer
Expand ‘System Tools’

COMPMGMT is an included app with simple RPC connectivity needs at startup. This will generate several MSRPC binds, query and respond to several RPC endpoints, and generally is a good test of basic RPC functionality. The list of RPC-based applications (from Microsoft and elsewhere) is a mile long and includes such things as AD replication, FRS replication, DFS Replication, and more.

PORTQRY scripting

Finally, here’s a little batch file you can use to run PORTQRY with a set of standard DS-related queries and output to a file. This is a useful way to see if any ports are looking troublesome even if you’re not sure which ones to be looking for. For the sharp-eyed, yes HTTP/HTTPS is included. Why? Certificate Authority Web Enrollment issues – we do a lot more in MS DS support than deal with account lockouts. :-)

@echo off REM Sample batch wrapper script for portqry.exe REM Designed to verify responsiveness of remote server specified on commandline REM Requires PORTQRY.EXE in same directory as script

REM Example: checkports.cmd DEST-DC-01.contoso.com

REM Please note that this script is provided "AS IS" with no warranties, and confers no rights. REM Use of included script sample is subject to the terms specified at REM https://www.microsoft.com/info/cpyright.htm

ECHO Querying DNS Portqry -n %1 -p both -e 53 > %1_checkports.txt

ECHO Querying DHCP Portqry -n %1 -p udp -e 67 >> %1_checkports.txt

ECHO Querying HTTP portqry -n %1 -p tcp -e 80 >> %1_checkports.txt

ECHO Querying Kerberos KDC Service portqry -n %1 -p both -e 88 >> %1_checkports.txt

ECHO Querying NTP Time Service Portqry -n %1 -p udp -e 123 >> %1_checkports.txt

ECHO Querying RPC EndPoint Mapper Service portqry -n %1 -p tcp -e 135 >> %1_checkports.txt

ECHO Querying NetBIOS Name Service (WINS) portqry -n %1 -p both -e 137 >> %1_checkports.txt

ECHO Querying NetBIOS Datagram Service portqry -n %1 -p udp -e 138 >> %1_checkports.txt

ECHO Querying NetBIOS Session Service portqry -n %1 -p tcp -e 139 >> %1_checkports.txt

ECHO Querying LDAP portqry -n %1 -p tcp -e 389 >> %1_checkports.txt

ECHO Querying HTTP over SSL portqry -n %1 -p both -e 443 >> %1_checkports.txt

ECHO Querying SMB portqry -n %1 -p both -e 445 >> %1_checkports.txt

ECHO Querying Kerberos Logon portqry -n %1 -p both -e 464 >> %1_checkports.txt

ECHO Querying LDAP over SSL portqry -n %1 -p tcp -e 636 >> %1_checkports.txt

ECHO Querying Win2000/2003 AD Logon and Directory Replication portqry -n %1 -p tcp -o 1025,1026 >> %1_checkports.txt

ECHO Querying Global Catalog portqry -n %1 -p both -e 3268 >> %1_checkports.txt

ECHO Querying Global Catalog over SSL portqry -n %1 -p tcp -e 3269 >> %1_checkports.txt

ECHO Querying Terminal Server / Remote Desktop Portqry -n %1 -p tcp -e 3389 >> %1_checkports.txt

start notepad %1_checkports.txt

Further reading:

https://blogs.technet.com/networking/ Official MS Support blog of networking

https://blogs.technet.com/netmon/ Official Dev blog of NetMon

Download NetMon3.1

Service overview and network port requirements for the Windows Server system

Happy hunting.

- Ned Pyle