Troubleshooting Methodology

Article
10/08/2009

Applies To: Windows Server 2003, Windows Server 2003 with SP1

To efficiently troubleshoot Web applications, you need to apply and consistently use a troubleshooting framework, or methodology. Doing so helps you to streamline your troubleshooting activities and approach problems with confidence.

The fundamental troubleshooting methodology has five phases:

Phase 1: Discovery. Gather information about the problem.
Phase 2: Planning. Create a plan of action.
Phase 3: Problem Reproduction. Reproduce the problem, or determine that you cannot reproduce it. If you cannot reproduce the problem, then you might not have enough information to confirm that there is a problem.
Phase 4: Problem Isolation. Isolate the variables that relate directly to the problem.
Phase 5: Analysis. Analyze your findings to determine the cause of the problem.

Phase 1: Discovery

Most problems are discovered by the end user. Problems might also be revealed by the operating system or by application logging and event tracking utilities. When a problem is reported, begin the troubleshooting process by gathering information to confirm that a problem exists.

Interview the user who reported the problem

If an end user reported the problem, ask that user for detailed information about what occurred. If possible, get the exact text of any error messages. If the problem recurs, ask the user to provide screen shots for error-message information or other symptoms that appeared on a screen. If the problem is behavioral and does not generate an error message, ask very specific, closed-ended questions. For example, you might ask the following questions:

What time did the problem occur? (You can use this information to locate proximate events a large log file or to correlate the problem with external causes.)
Which button(s) or link(s) did you click just before the problem occurred?
Did you refresh or re-request the Web page to recover from the problem?

Examine the records

The Microsoft® Windows® Server 2003, Standard Edition; Windows® Server 2003, Enterprise Edition; Windows® Server 2003, Web Edition; and Windows® Server 2003, Datacenter Edition operating systems provide logging and event tracking. Check the logs for entries around the time that the problem occurred. Additionally, the application itself might log events. For example, the IIS logs and the HTTP error log record this information. For more information, see Analyzing Log Files.

Phase 2: Planning

Creating a plan of action is quite possibly the most important phase in the troubleshooting process. In this phase, you list the steps that you will follow as you proceed through the other troubleshooting phases. Stick to this plan and refer to it often. When you are bogged down in the tasks of the problem isolation phase, referring to your plan will help you remember where you are in the troubleshooting process and avoid getting sidetracked. Revise the plan as you progress — in most cases one step will dictate later steps.

When troubleshooting IIS–related problems, your plan might consist of configuring IIS logging to log extra details and then setting up a Performance Monitor log to run during a specific period of time. Based on the results of these actions, the next step might involve making a configuration change.

Phase 3: Problem Reproduction

Determine whether the problem is readily reproducible. The ability to reproduce the problem on demand is fundamental to properly and efficiently troubleshooting the problem. Use the information that you gathered in Phase 1 to reproduce the problem, most often by repeating the actions taken by the end user before the problem occurred. If you can readily reproduce the problem, the isolation phase is more manageable. When you have determined a set of steps or events that trigger the problem, move to the isolation phase.

If you cannot readily reproduce the problem, the problem isolation phase can be tedious or isolating the problem can be impossible. If you cannot reproduce the problem, prepare to gather the right kind of information the next time it happens. Consider doing any or all of the following:

Enable more detailed event tracking.
Ask users to watch for the problem and pay close attention to what they are doing if it occurs.
Write additional code in the application that looks for and highlights the problem if it happens again.

For example, when dealing with a Web application problem, you might add extra tracing code to your ASP pages or you might configure Windows security auditing for specific failures. The goal is to gather enough information so that the next time the problem occurs, the information that you obtain will be sufficient to correctly diagnose the problem.

Phase 4: Problem Isolation

In the problem isolation phase, you reproduce the problem as efficiently as possible by using repetitive steps. In this phase, you eliminate variables — such as settings, file actions, component starts/stops, or any change in the execution — that do not cause the problem, narrowing the variables down to those that are responsible for the problem.

You have succeeded in isolating the problem when you achieve the following conditions:

The problem can be reproduced consistently when you take a fixed series of actions.
The problem cannot be reproduced when you omit any of those actions.

For example, when troubleshooting IIS–related problems, you might do the following:

Configure a Web application to be hosted in its own worker process or configure a component to execute in an isolated host process like a COM+ Dllhost.exe in order to determine which part of an application is consuming CPU time or leaking memory.
Configure security settings on a URL in varying degrees through repetitive tests to isolate authentication problems.
Capture performance data to determine whether a problem lies in the core Web server or in a Web service extension–based application, such as ASP or ASP.NET.

Phase 5: Analysis

Depending on the nature of the problem, the analysis phase can be the most difficult phase of the troubleshooting process. In the analysis phase, you use everything that you learned in the previous phases to do the following:

Determine the cause of the problem.
Explore how the problem affects the application.
Determine the best way to solve the problem.

This step might be a simple formality if, after Phase 3 is complete, you know what the problem is and how to fix it. If not, you must perform additional analysis of the data you have gathered. You might find that you must spend more time in the isolation phase.

You might find that there are several ways to fix the problem. Take what you have learned about the problem and decide which actions to take. Ask yourself these kinds of questions:
- What is the impact of the problem?
- Do the benefits of fixing the problem outweigh the costs of fixing it?
- Is there an acceptable workaround?
Record your progress and the data that you have gathered.

As you obtain information about a problem, you will make decisions about the steps that you need to take that will often alter your plan of action. It is important to record the information that you collect so that you have a way to account for your decisions as you move through the process of troubleshooting. If you find yourself taking the wrong path, having this information will allow you to backtrack to the point where you made the wrong decision. In addition, your ability to explain the changes or improvements in the problem is often important to other stakeholders.

Share via