High-level Methodology for Troubleshooting Active Directory Problems

Article
12/09/2009

Overview

Your entry point into troubleshooting an Active Directory problem might be as straightforward as receiving an event in an event log or an alert from a monitoring system. If the event or alert specified the components that are involved in the problem, you can start troubleshooting the process or event by referring to the appropriate section later in this guide.

However, if you are responding to a user call or a symptom noticed by IT personnel, you need to isolate the problem. You might also need to use the process in this section if previous troubleshooting efforts for an event or alert did not solve the problem. There is a possibility that you are not troubleshooting the correct combination of components.

In any case, you need to be familiar with the high-level methodology that follows for troubleshooting Active Directory. This helps you to isolate the problem to the correct components or identify a different set of components if necessary.

Figure 2.1 shows the process for troubleshooting Active Directory.

Figure 2.1: Troubleshooting Active Directory

Figure 2.1: Troubleshooting Active Directory

Documenting the Problem

Documenting the problem can reduce misunderstandings and help you resolve issues more quickly. It provides an accurate history that facilitates vendor involvement when necessary. This history also helps in the problem management process. If a particular problem keeps occurring, you can use past incident histories to identify and resolve the problem.

How you begin to document the problem depends on whether you are using a monitoring system, which is a best practice for Active Directory operations. If you are not using a monitoring system, all of your help desk tickets will be generated when a dissatisfied user logs a complaint. At this point, you are reactively troubleshooting, and the problem is more urgent. Due to the nature of reactive problem-solving, you might experience a service disruption at a significant cost. It is important to use a monitoring system to avoid these costs.

If you are following the best practices for operations and are using a monitoring system, usually the monitoring system proactively alerts you before an issue escalates to a service outage. A monitoring system is also likely to indicate the most common ways to resolve the problem. If you are alerted to a problem by the monitoring system, open a new help desk ticket and document all information raised by the alert, including the suggested remedies. Collect as much supporting information from the monitoring system as possible, including other alerts occurring on the same computer or other computers and services that might also be involved in the problem.

Then open a problem ticket for the customer call and verify that you have enough information to proceed. Typically, you need information such as:

Date and time of occurrence.
Error message number and text.
Client information, including:
- Computer name for the client.
- User ID being used when the problem occurred.
- TCP/IP configuration.
- List of DNS servers that that client is configured to use.
- Operating system version, service pack, and any hot fixes.
Server information, including:
- Computer name for the server.
- TCP/IP configuration.
- Operating system version, service pack, and any hot fixes.
Network information, including:
- Domain name of the client.
- Domain name of the server.
Application name and related settings.
Service involved in the problem, such as network BIOS (NetBIOS), DNS, Server Message Block (SMB), and Lightweight Directory Access Protocol (LDAP).

In addition, identify whether:

The problem is repeatable. If so, include the steps taken to reproduce the problem.
Others are having the same problem.
Help desk is able to duplicate and verify the issue. Include any troubleshooting steps already taken by the help desk, such as using Ping to verify network connectivity to the client or server.

Important: If the problem was not reported by the monitoring system, first open a new problem ticket to correct the gap in your monitoring coverage and then communicate the failure to the appropriate personnel. Information derived from troubleshooting this problem can provide the monitoring or problem management team with valuable insight to help detect and potentially prevent this problem in the future.

For more information about problem tickets, see the Microsoft Operations Framework (MOF) link on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/.

Identifying the Components Involved

Identify the specific components that are involved in the problem, including the clients, network paths, servers, and services. Taking time to properly identify the machines and the actual protocols or services involved minimizes the risk of wasting significant time trying to solve the wrong problems on the wrong computers. The information you obtained while documenting the problem is a good starting point, but the problem might require additional investigation to ensure that you have identified the correct components.

Note: When troubleshooting Active Directory, remember that the client is the computer that makes the request and the server is the computer that responds to the request. Thus, computers running Microsoft® Windows 2000® Professional or Microsoft® Windows 2000 Server can be either clients or servers, depending on whether they are initiating or responding to a request.

Identifying the right components can be easy, such as when a workstation makes an LDAP call to a domain controller. However, it can also be much more complex, such as when a workstation that issues a net use command to a file server receives an "Access Denied" error message.

In this last case, the workstation is clearly the client because it initiated the request. The other most apparent components (server and service) involved are the file server that received the request, and the SMB Service (the file and print access protocol used by Windows 2000). However, an entirely different server and service might also be causing the problem. Consider the problems that can occur when connecting to the server:

DNS or WINS might not return the correct IP address for the intended server to the client. This indicates a name resolution problem, which involves a different server and service.
If the client is using Kerberos authentication as the authentication protocol, the Key Distribution Center (KDC) could be returning an error. This might indicates a time synchronization problem, which involves the KDC and the Windows Time Service.

Know the required steps for all of the protocols and services to function successfully, and be familiar with the common breaking points for each step.

Verifying Client Health

Because all client/server communications begin with the client issuing a request, start the troubleshooting process by verifying the health of the client computer that you identified in the previous step. The client must be correctly configured, connected to the network, and functioning properly. To verify the client health, perform the following tests:

Verify that the client is connected to the local area network (LAN). Verify that network cables and hubs are firmly connected, and that any status indicators on network adapters and hubs are reporting activity.
Use Performance Monitor to ensure that the client's CPU usage is not too high.
Verify network configuration for the client. Verify that the client's IP configuration settings, including DNS and WINS settings, are correct. Resolve any problems before continuing.

Client health problems are generally simple to fix. If you find a problem at this point, correct it before proceeding.

For more information about troubleshooting client health problems, see the Operations Guide of the Microsoft® Windows 2000 Server Resource Kit. For more information about troubleshooting networking problems, see the TCP/IP Core Networking Guide of the Windows 2000 Server Resource Kit.

Verifying Network Path

Verify that the network path between the client and server is properly working. Although the problem ticket might indicate that the help desk was able to reach the server, the client is most likely on a different network segment, so verify the network path again from the client. You can either perform the following tests at the client, or use Terminal Services or Remote Assistance from your current location to issue the commands from the client. Perform the following tests:

Verify network configuration. Ensure that the IP configuration is what it should be, according to your records. Verify network connectivity between the client and the server by using the IP address of each computer. If connectivity is a problem, open a new problem ticket as described earlier. Perimeter firewalls, IPSec, network address translation (NAT) between the client and server, or personal firewalls like those included in Windows XP Professional can cause connectivity problems.
If you cannot verify that the server received a request, or that the client received the response, use Network Monitor (NetMon) to perform a trace at the client and server. For more information about using Network Monitor, see "Monitoring Network Performance" in the Operations Guide of the Windows 2000 Server Resource Kit.

For more information about troubleshooting network problems, see the TCP/IP Core Networking Guide of the Windows 2000 Server Resource Kit.

Verifying Server Health

To verify server health, perform the same verification tests on the server that you do on the client, to make sure that the server is configured correctly, connected to the network, and functioning properly. Perform the following steps:

Verify that the server is connected to the LAN. Verify that network cables and hubs are firmly connected, and that any status indicators on network adapters and hubs are reporting activity.
Verify network configuration. Verify that IP configuration settings, including DNS and WINS settings, are correct. Resolve any problems before continuing.
Verify network connectivity. If any of the Ping or Pathping tests fail, see "TCP/IP Troubleshooting" in the TCP/IP Core Networking Guide of the Windows 2000 Server Resource Kit.

For more information about troubleshooting server health problems, see the Operations Guide of the Windows 2000 Server Resource Kit. For more information about troubleshooting networking problems, see the TCP/IP Core Networking Guide of the Windows 2000 Server Resource Kit.

Verifying Service Health

For the service that you have identified, verify that the:

Service is installed properly on the server.
Service is running.
User has permissions to make the request.

In addition, view the service event log (typically, the application event log). If you find any warning or error events in the event log, determine the source and refer to the corresponding section in this guide for further troubleshooting procedures. If the event is not discussed in this guide, search the Microsoft Knowledge Base. To search the Microsoft Knowledge Base, see the Microsoft Knowledge Base link on the Web Resources page at https://www.microsoft.com/windows/reskits/webresources/.

For more information about troubleshooting service health problems, see the Operations Guide of the Windows 2000 Server Resource Kit.

Iterate the Troubleshooting Process

If the components that you initially identified do not reveal the root cause of the problem, you must identify additional components involved in the problem. Identify the next client, server, or service that might be involved in the problem and verify the health of each of those components until you reach the actual source of the problem.

You might need to iterate the process for troubleshooting Active Directory on several different components before you successfully identify the root cause. In this case, you must "walk the chain," or repeat the troubleshooting process on each component that might be involved in the problem. Consider the following example, where you must iterate the troubleshooting process to identify the correct components.

A company has four domain controllers (DC1, DC2, DC3, and DC4). DC1 replicates to DC2, DC2 replicates to DC3, and DC3 replicates to DC4 (this is referred to as transitive replication). An administrator adds a user to Active Directory at DC1. Several hours later, the change still has not replicated to DC4. You initially identify DC3 and DC4 as the client and server involved. Your troubleshooting indicates that DC3 did not replicate the change to DC4. After verifying the health of the client, the network, the server, and replication, you determine that they are working properly. You must then iterate the troubleshooting process, but with the next link in the chain: DC2 and DC3. If this pair is working properly, then you need to verify DC1 and DC2.

Applying a structured approach to the troubleshooting process helps you methodically find the root cause of any distributed systems problem, regardless of the client, server, or service involved.