wissel.net

Usability - Productivity - Business - The web - Singapore & Twins

The folly of root cause analysis


IT support's dealing with management is a funny business. Whenever something goes wrong, support teams engage in " defensive blaming" and the elusive quest for a root cause.
I've seen this quest (and blaming god and country along the way if it doesn't appear) taking priority over problem resolution and prevention. The twisted thought is: " If I'm not sure about the (single) root cause, I can't neither fix nor prevent it from happening again".

Why is that a folly?
  • It paralyses: If a person bleeding enters an ER, the first call to action is to stop the bleeding. Asking: Have all the suppliers of the manufacturer of the blade that caused it have ISO 9000 certifications? Where was the ore mined to make the blade? Root cause analysis is like that. IT support however is an ER room
  • It is not real: There is never a single reason. Example: Two guys walk along the street in opposite directions. One is distracted, because he is freshly in love. The other is distracted because he just was dumped. They run into each other and bump their heads. What is the root cause for it?
    You remove a single factor: route chosen, fallen in love, being dumped, time of leaving the house, speed of walking, lack of discipline to put attention ahead of emotion etc. and the incident never would have happened.
    If I apply an IT style root cause analysis it will turn out: it was the second guy's grandfather who is the root cause: He was such a fierce person, so his son never developed trust, which later in the marriage led to a breakup while guy two was young, traumatising every breakup for him - thus being distracted
  • There is more than one: as the example shows: removing one factor could prevent the incident to happen, but might leave a instable situation. Once a root cause has been announced, the fiction spreads: "everything else is fine". Example: A database crashes. The root cause gets determined: only 500 users can be handled, so a limit is introduced. However the real reason is a faulty hardware controller that malfunctions when heating up, which happens on prolonged high I/O (or when the data center aircon gets maintained).
    The business changes and the application gets used heavier by the 500 users, crossing the temperature threshold and the database crashes again.
  • Cause and effect are not linear: Latest since Heisenberg it is clear that the world is interdependent. Even the ancient Chinese knew that. So it is not, as Newton discovered, a simple action/reaction pair (which suffers from the assumption of "ceteris paribus" which is invalid in dynamic systems), but a system of feedback loops subject to ever changing constraints.
Time is better spend understanding a problem using system thinking: Assess and classify risk factors and problem contributors. Map them with impact, probability, ability to improve and effort to action. Then go and move the factors until the problem vanishes. Don't stop there, build a safety margin.
As usual YMMV

Posted by on 09 July 2014 | Comments (2) | categories: Software

Comments

  1. posted by Mpoed on Thursday 10 July 2014 AD:
    I am not sure about the 'folly' here:

    1. The idea is to identify root causes at the end of the day (Does not have to be a single root cause).
    2. If there are established ITSM practices at the organisation you will know that as per ITIL guidance, the objective of incident management is to resolve the issue as quickly as possible without worrying about root causes. If the an issue keeps recurring then Problem Management processes are invoked to identify root cause(s) with a view of permanantly resolving the issue.
  2. posted by Stephan H. Wissel on Thursday 10 July 2014 AD:
    By mentioning causes (plural) you eliminated half of the folly Emoticon smile.gif and yes: ITIL incident management doesn't care too much about RCA.