Troubleshooting 101

It’s been far too long between blog posts so here’s a post and a promise to blog more frequently...

 

I thought I’d share some of my thoughts on this topic that’s been the focus of my career for the past 7.5 years. I’ve tried to keep this as generic as possible.

 

Troubleshooting is somewhat of an art and a science. Fortunately, with a logical approach it can be more like a science and less like a modern art disaster! J Far too often I encounter situations where the chaos associated with a problem has clouded the judgement of those tasked with addressing it. From my experience, a structured, logical approach will achieve results faster than any other approach founded in matching the chaos of those affected by the problem. Talking about how important it is to resolve the problem, etc does nothing to address it. Only action will achieve the outcome desired by all involved.

 

At the risk of stating the obvious, a high-level overview of my troubleshooting approach is as follows:

 

1) Define the problem

2) Gather data

3) Analyse data

4) Implement potential solutions

5) Repeat 1)-4) until the problem is resolved

 

 

1) Define the problem

 

The initial and arguably most important step to troubleshooting is to define the problem that you’re hoping to overcome. After all, without a clear understanding of what you’re hoping to overcome/achieve you’ve got little hope. Some examples of the questions you should be asking include:

 

- What symptom(s) indicate that the problem occurred or is occurring?

- Are there multiple symptoms that can be attributed to the problem either at the time of the problem or even in the timeframe leading up to the problem?

- How do you know that the problem has occurred or is occurring?

- When did the problem first occur?

- When did the problem last occur?

- Approximately how frequently is the problem occurring?

- What action(s) are you taking to recover from the problem state when it occurs?

- Can you reproduce the issue at will? If so, what’s the steps necessary to do so?

 

Note, the above is not a definitive list. However, I hope it’s enough to give you an idea as to the type of questioning that should occurring before proceeding further down the troubleshooting path.

 

 

2) Gather data

 

Gather data that helps you to understand the problem. For example, the configuration of the effected environment, events leading up to the problem and the environment state at the time of the problem.

 

 

3) Analyse data

 

Invest time into thoroughly analysing the data that has been gathered. Leverage automated analysis tools where possible. Your goal should be to extract clues from the data that might help you to figure out the cause of the issue. Search whatever resources you have available to you (eg Internet) in attempt to locate others that have experienced the same/similar situations. You won’t always find others who’ve encountered exactly the same issue. However, you’re likely to find others who’ve faced something similar and you’re likely to learn from their journey.

 

 

4) Implement potential solutions

 

Potential solutions should be justified by observations from the data analysis and/or experience in the problem domain in general. It’s often necessary to promote your suggestions that are likely potential solutions as sometimes those in control are reluctant to risk any change to the effected environment. The reality is a change of some sort is likely to be necessary to resolve the issue so don’t be shy in regard to pushing the changes you feel are most likely to achieve the objective.

 

 

5) Repeat 1)-4) until the problem is resolved

 

Troubleshooting is often an iterative process. Don’t expect to “nail it” on your first attempt. You’ll often need to refine the action plan in response to the observations made during data analysis.

 

A problem is sometimes considered “resolved” if it is agreed that relief has be achieved. In other words, determining absolute root cause and/or fully addressing or understanding the reasons why the remedy has been successful is sometimes a luxury. Engineering types typically aren’t satisfied with an outcome unless the problem and it’s solution are fully understood. However, you’ll sometimes need to accept that your goal has been achieved when the problem is considered resolved by others.