The folly of root cause analysis
IT support's dealing with management is a funny business. Whenever something goes wrong, support teams engage in " defensive blaming" and the elusive quest for a root cause.
I've seen this quest (and blaming god and country along the way if it doesn't appear) taking priority over problem resolution and prevention. The twisted thought is: " If I'm not sure about the (single) root cause, I can't neither fix nor prevent it from happening again".
Why is that a folly?
As usual YMMV
I've seen this quest (and blaming god and country along the way if it doesn't appear) taking priority over problem resolution and prevention. The twisted thought is: " If I'm not sure about the (single) root cause, I can't neither fix nor prevent it from happening again".
Why is that a folly?
- It paralyses: If a person bleeding enters an ER, the first call to action is to stop the bleeding. Asking: Have all the suppliers of the manufacturer of the blade that caused it have ISO 9000 certifications? Where was the ore mined to make the blade? Root cause analysis is like that. IT support however is an ER room
- It is not real: There is never a single reason. Example: Two guys walk along the street in opposite directions. One is distracted, because he is freshly in love. The other is distracted because he just was dumped. They run into each other and bump their heads. What is the root cause for it?
You remove a single factor: route chosen, fallen in love, being dumped, time of leaving the house, speed of walking, lack of discipline to put attention ahead of emotion etc. and the incident never would have happened.
If I apply an IT style root cause analysis it will turn out: it was the second guy's grandfather who is the root cause: He was such a fierce person, so his son never developed trust, which later in the marriage led to a breakup while guy two was young, traumatising every breakup for him - thus being distracted - There is more than one: as the example shows: removing one factor could prevent the incident to happen, but might leave a instable situation. Once a root cause has been announced, the fiction spreads: "everything else is fine". Example: A database crashes. The root cause gets determined: only 500 users can be handled, so a limit is introduced. However the real reason is a faulty hardware controller that malfunctions when heating up, which happens on prolonged high I/O (or when the data center aircon gets maintained).
The business changes and the application gets used heavier by the 500 users, crossing the temperature threshold and the database crashes again. - Cause and effect are not linear: Latest since Heisenberg it is clear that the world is interdependent. Even the ancient Chinese knew that. So it is not, as Newton discovered, a simple action/reaction pair (which suffers from the assumption of "ceteris paribus" which is invalid in dynamic systems), but a system of feedback loops subject to ever changing constraints.
As usual YMMV
Posted by Stephan H Wissel on 09 July 2014 | Comments (2) | categories: Software