Problem analysis with Kepner-Tregoe

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

When you deal with problems in IT, you often deal with problems where is root cause is unknown. To solve such problems, you have to use a systematic method. Only a systematic method leads to a fast, effective and efficient solution. One of the most commonly observed methods in my career bases on approximation. We all know it as “trial and error”. Someone tries as long until the problem is solved. Often this method makes it worse than it was before, and it often leads to wrong conclusions, and furthermore wrong results. If someone draws a wrong connections at the beginning of the analysis, this leads to a totally wrong path. I would like to illustrate this with an example:

John Doe tried to monitor VMware ESXi hosts with a HP Systems Insight Manager (SIM). The VMware ESXi were running on different HP ProLiant models. John noticed, that some of the ESXi hosts showed more information than other hosts. After a very quick Google search he quickly concluded, that this was related to iLO 4 Agentless Monitoring, because those hosts, that showed all information, were ProLiant Gen8 models.

As you can imagine, this was dozens of miles away. The solution was simple: The Gen8 models were installed with ESXi images from HP, which includes the necessary agents. This example shows another very ugly behavior: Googling around, in the hope to find a problem description that sounds similar. This is often done by entering a error message into Google, selecting a search result and trying the proposed solution. And quite often the article is not even read, simply scrolled down to the solution. It’s unlikely that the same error message can have different causes, which need different solutions.

What could be a systematic method to solve problems? I’d like to introduce to Kepner-Tregoe (KT). KT stands for two things: A consulting company founded by Charles Kepner and Benjamin Tregoe, and for a method. KT is mentioned by ITIL as a component of the Problem Management in the Service Operation phase. You can use KT for problem solving, decision making or potential problem analysis. I will focus on the situation analysis and problem analysis. The situation analysis is common for problem solving, decision making or potential problem analysis.

The Kepner-Tregoe method

The KT method is based on a rational process and it’s divided into four different processes:

  • situation analysis
  • problem analysis
  • decision analysis
  • potential problem analysis

Behind each process is a question you should ask.

The situation analysis

During the situation analysis the question is “What’s going on?”. At this point, the problem analysis hasn’t started. Before you can analyse the problem, you have to clarify the situation, outline concerns and set priorities. Ask yourself about the current and future impact, how much time do you have to find a solution, and at which point a solution could be impossible (limitations because of time, budget etc.).

The problem analysis

 The problem analysis consists of five consecutive steps:

  1. Define the problem
  2. Describe the problem
  3. Create hypotheses about the cause
  4. Test the hypotheses
  5. Verify the root cause

Use the 5 Ws to define the problem. Only a problem description, that includes the 5 Ws is capable to fully describe a problem. Such a description will help you, and your colleagues, to understand the problem.

  • Who is affected by the problem?
  • Why is this important to solve the problem?
  • What are the symptoms?
  • When does the problem occur?
  • Where does the problem occur?

If you created you problem description with the 5 Ws, you can concretize the answers with “IS” and “COULD BE but IS NOT” aspects. Let’s pick up the example from above:

Who is affected by the problem? HP ProLiant G6 and G7 models running VMware ESXi.

A HP ProLiant G7 model with VMware ESXi image “IS” affected. A HP ProLiant G7 and Gen8 model with a HP custom Image for ESXi “COULD BE but IS NOT” affected.

As you can see, this will dramatically reduce the number of possible causes, especially when you add the problem description and the symptoms. But this also shows another fact: You have to take a detailed look at the affected components/ systems, and you have to take care, that you not miss any deviations between the components/ systems (in the example all hosts were running ESXi 5.1, but some of the hosts were running a VMware image, some hosts a HP custom ESXi image). You also should identify what changes are made in the past. This may be answered by the “When?” question (When does the problem occur? After demoting one of the four Active Directory Domain Controllers).

Now it’s time to create hypotheses about the possible cause. Depending on the problem description, the past changes and the “IS” and “COULD BE but IS NOT” aspects of the problem, it should be possible to create one or more hypotheses.

With one or more hypotheses, you have to test each of them against the “IS” and “COULD BE but IS NOT” aspects. The question is: Can the hypothesis explain the “IS” and “COULD BE but IS NOT” aspects? One of the hypotheses will best explain the “IS” and “COULD BE but IS NOT” aspects. This is the most probable hypothesis.

Verifying the root cause is the last and trickiest part. You have to verify your assumptions and reflect the way, how you have come to the decision what the root cause is. If you are sure that you have identified the root cause, you can develop and implement a solution. After the implementation, you have to verify the result. Is the problem solved? Yes? Fine! If not, you have to involve this into the test of the other hypotheses.

Summary

Kepner-Tregoe is a totally rational method. It’s hard at the beginning not to make quick assumptions and to reflect. It’s something you have to train. I guarantee that you will get better with each problem you solve. KT problem analysis was used during the Apollo 13 mission. And what should I say? It worked! So give it a try.

EDIT: Kepner-Tregoe informed me over Twitter, that there are two groups on LinkedIn, where you can get more information and talk to other KT practitioners.

Follow me

Patrick Terlisten

vcloudnine.de is the personal blog of Patrick Terlisten. Patrick has a strong focus on virtualization & cloud solutions, but also storage, networking, and IT infrastructure in general. He is a fan of Lean Management and agile methods, and practices continuous improvement whereever it is possible.

Feel free to follow him on Twitter and/ or leave a comment.
Patrick Terlisten
Follow me

2 thoughts on “Problem analysis with Kepner-Tregoe

  1. darkfader

    Hi,

    one year it has taken me to finally read this.
    And I liked it – this seems more important than “the USE method” which is just rebranded common admin skills.
    Why? My hope the more formal approach allows for writing down optimized problem determination.
    It’s a hard to teach skill: how do you check for the most common elements of known issues, to quickly find the major “region” to search in, while at the same time maximizing the number of potential problems you are excluding.

    If we invest some time to track our problem determination paths (hits and misses) and sort them to match both criteria, we can use this in creating checklists and general procedures.
    People can then be aware which causes they already safely excluded, and which are still unverified.

    With that, they are able to say ”
    * We are 90% certain this is the cause and we know the fix for that. It is most efficient to look at this now since the other paths have no common mode (yet, in this what we’re looking at)
    * There’s at least 3 other causes that are 30% less likely, which we will investigate if this attempt failed.
    * We will continually check if blows up, which would prove we have been degrading things by going down the wrong path and makes most likely”

    But really I am most impressed that the KT approach is smart enough to calculate time most early.
    I can’t even count how many times I tried to explain colleagues they need to open their vendor case *early* on in the resolution, after the first 1-2 quick shots have failed. Not to wait until the last damn moment! Because then you just maximize the time till it’s fixed, goddamit!

    Thank you very much for the post!

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *

I accept!