When you deal with problems in IT, you often deal with problems where is root cause is unknown. To solve such problems, you have to use a systematic method. Only a systematic method leads to a fast, effective and efficient solution. One of the most commonly observed methods in my career bases on approximation. We all know it as “trial and error”. Someone tries as long until the problem is solved. Often this method makes it worse than it was before, and it often leads to wrong conclusions, and furthermore wrong results. If someone draws a wrong connections at the beginning of the analysis, this leads to a totally wrong path. I would like to illustrate this with an example:
John Doe tried to monitor VMware ESXi hosts with a HP Systems Insight Manager (SIM). The VMware ESXi were running on different HP ProLiant models. John noticed, that some of the ESXi hosts showed more information than other hosts. After a very quick Google search he quickly concluded, that this was related to iLO 4 Agentless Monitoring, because those hosts, that showed all information, were ProLiant Gen8 models.
As you can imagine, this was dozens of miles away. The solution was simple: The Gen8 models were installed with ESXi images from HP, which includes the necessary agents. This example shows another very ugly behavior: Googling around, in the hope to find a problem description that sounds similar. This is often done by entering a error message into Google, selecting a search result and trying the proposed solution. And quite often the article is not even read, simply scrolled down to the solution. It’s unlikely that the same error message can have different causes, which need different solutions.
What could be a systematic method to solve problems? I’d like to introduce to Kepner-Tregoe (KT). KT stands for two things: A consulting company founded by Charles Kepner and Benjamin Tregoe, and for a method. KT is mentioned by ITIL as a component of the Problem Management in the Service Operation phase. You can use KT for problem solving, decision making or potential problem analysis. I will focus on the situation analysis and problem analysis. The situation analysis is common for problem solving, decision making or potential problem analysis.
The Kepner-Tregoe method
The KT method is based on a rational process and it’s divided into four different processes:
- situation analysis
- problem analysis
- decision analysis
- potential problem analysis
Behind each process is a question you should ask.
The situation analysis
During the situation analysis the question is “What’s going on?”. At this point, the problem analysis hasn’t started. Before you can analyse the problem, you have to clarify the situation, outline concerns and set priorities. Ask yourself about the current and future impact, how much time do you have to find a solution, and at which point a solution could be impossible (limitations because of time, budget etc.).
The problem analysis
The problem analysis consists of five consecutive steps:
- Define the problem
- Describe the problem
- Create hypotheses about the cause
- Test the hypotheses
- Verify the root cause
Use the 5 Ws to define the problem. Only a problem description, that includes the 5 Ws is capable to fully describe a problem. Such a description will help you, and your colleagues, to understand the problem.
- Who is affected by the problem?
- Why is this important to solve the problem?
- What are the symptoms?
- When does the problem occur?
- Where does the problem occur?
If you created you problem description with the 5 Ws, you can concretize the answers with “IS” and “COULD BE but IS NOT” aspects. Let’s pick up the example from above:
Who is affected by the problem? HP ProLiant G6 and G7 models running VMware ESXi.
A HP ProLiant G7 model with VMware ESXi image “IS” affected. A HP ProLiant G7 and Gen8 model with a HP custom Image for ESXi “COULD BE but IS NOT” affected.
As you can see, this will dramatically reduce the number of possible causes, especially when you add the problem description and the symptoms. But this also shows another fact: You have to take a detailed look at the affected components/ systems, and you have to take care, that you not miss any deviations between the components/ systems (in the example all hosts were running ESXi 5.1, but some of the hosts were running a VMware image, some hosts a HP custom ESXi image). You also should identify what changes are made in the past. This may be answered by the “When?” question (When does the problem occur? After demoting one of the four Active Directory Domain Controllers).
Now it’s time to create hypotheses about the possible cause. Depending on the problem description, the past changes and the “IS” and “COULD BE but IS NOT” aspects of the problem, it should be possible to create one or more hypotheses.
With one or more hypotheses, you have to test each of them against the “IS” and “COULD BE but IS NOT” aspects. The question is: Can the hypothesis explain the “IS” and “COULD BE but IS NOT” aspects? One of the hypotheses will best explain the “IS” and “COULD BE but IS NOT” aspects. This is the most probable hypothesis.
Verifying the root cause is the last and trickiest part. You have to verify your assumptions and reflect the way, how you have come to the decision what the root cause is. If you are sure that you have identified the root cause, you can develop and implement a solution. After the implementation, you have to verify the result. Is the problem solved? Yes? Fine! If not, you have to involve this into the test of the other hypotheses.
Kepner-Tregoe is a totally rational method. It’s hard at the beginning not to make quick assumptions and to reflect. It’s something you have to train. I guarantee that you will get better with each problem you solve. KT problem analysis was used during the Apollo 13 mission. And what should I say? It worked! So give it a try.
EDIT: Kepner-Tregoe informed me over Twitter, that there are two groups on LinkedIn, where you can get more information and talk to other KT practitioners.
- Virtually reseated: Reset blade in a HPE C7000 enclosure - July 19, 2020
- Update Manager fails with unknown error during host remediation - July 19, 2020
- Access to on-premise hosted Public Folders using Exchange Online mailboxes - July 18, 2020