Root cause analysis for recurring errors

The fix that never fixes anything

You have a meeting that consistently runs over time. After the third overrun, you set a stricter agenda. The next meeting runs over anyway. You add a timekeeper. Still runs over. You shorten the meeting by fifteen minutes to create a buffer. It fills the buffer and runs over again. Each intervention addresses the visible symptom — the meeting exceeding its allotted time — while the root cause sits untouched. Perhaps the meeting has no defined decision criteria, so every agenda item becomes an open-ended discussion. Perhaps the wrong people are in the room, so decisions require follow-up conversations that bleed into meeting time. Perhaps the meeting exists to solve a coordination problem that would disappear if two teams shared a dashboard instead.

You know this pattern. Not just in meetings — in every recurring frustration you have ever addressed with a workaround, a resolution, or a promise to "be more careful next time." The error keeps returning because you keep fixing the wrong thing. The previous lesson on error budgets established that some errors are tolerable. This lesson is about the errors that are not — the ones that repeat, that resist your interventions, that survive every surface-level fix you throw at them. These errors are not random. They are generated by a structural cause, and until you find and fix that cause, the error is not a bug. It is a feature of your system.

James Reason and the architecture of failure

The most important insight in the study of recurring errors came from James Reason, a psychologist at the University of Manchester who spent his career studying how complex systems fail. In Human Error (1990), Reason proposed a model that changed how entire industries think about mistakes.

Reason distinguished between two categories of failure: active failures and latent conditions. Active failures are the visible errors — the wrong button pressed, the deadline missed, the miscommunication in the meeting. These are what you notice. These are what you try to fix. Latent conditions are the structural factors that make active failures likely: organizational pressures, poorly designed processes, missing information channels, inadequate training. Latent conditions can lie dormant for weeks, months, or years before they combine to produce a visible error.

His Swiss cheese model illustrates the point with brutal clarity. Imagine each layer of defense in a system — each process, check, safeguard, or habit — as a slice of Swiss cheese. Each slice has holes, representing weaknesses or gaps. Normally, the holes do not align. One layer catches what another misses. But when latent conditions cause the holes to line up, a hazard passes through every defense and produces a failure. The active error — the thing you see — is just the final hole. The root cause is whatever created and aligned all the other holes.

Reason's framework reveals why symptom-level fixes fail. When you punish the person who pressed the wrong button, you address the last hole in the cheese. The latent conditions — the confusing interface, the missing checklist, the time pressure that made rushing inevitable — remain. The next person facing the same conditions will press the same wrong button. You did not fix the error. You replaced the error-maker and left the error-generating system intact.

The fundamental attribution error in your own debugging

There is a reason you instinctively fix symptoms instead of causes, and cognitive science has named it precisely. Lee Ross coined the term "fundamental attribution error" in 1977 to describe the human tendency to attribute behavior to personal disposition rather than situational factors. When someone cuts you off in traffic, you think "reckless driver," not "person rushing to the hospital." When a colleague misses a deadline, you think "unreliable," not "overloaded by a process that assigns work without tracking capacity."

You apply this same bias to yourself, but with a twist. When your own errors recur, you attribute them to personal failings — lack of discipline, insufficient willpower, not caring enough. Each recurrence reinforces the narrative: "I keep doing this because something is wrong with me." This framing is not just inaccurate. It is actively destructive to root cause analysis, because it directs your attention toward the one factor that is almost never the actual root cause: your character.

Dietrich Dorner, in The Logic of Failure (1996), demonstrated this experimentally. He placed intelligent, motivated participants in computer simulations of complex systems — managing a small town, controlling an ecological region — and watched them fail repeatedly. The failures were not caused by stupidity or carelessness. They were caused by structural cognitive tendencies: oversimplifying interconnected problems, applying corrections too aggressively, ignoring side effects, and — most relevant here — treating each failure as an isolated event rather than a symptom of a systemic pattern. Dorner's participants kept fixing symptoms because they could not see the system generating them. They blamed the immediate cause because the structural cause was invisible.

This is the cognitive trap that root cause analysis is designed to break. It forces you to stop asking "What went wrong?" and start asking "What structural condition makes this failure likely to recur?"

Toyota and the engineering of systematic diagnosis

The most influential practical methodology for root cause analysis did not come from psychology or cognitive science. It came from a factory floor in Japan.

Taiichi Ohno, the architect of the Toyota Production System, developed and formalized the practice of systematic causal inquiry in manufacturing during the 1950s and 1960s. His approach rested on a principle that sounds obvious but contradicts almost everything about how people naturally respond to problems: when a machine breaks down, do not fix the machine. Find out why the machine broke down.

Ohno described the method in Toyota Production System: Beyond Large-Scale Production (1988). When a production line stopped, Toyota workers did not restart it and move on. They traced the failure backward through the causal chain until they reached a condition that, if changed, would prevent the failure from recurring. The famous Five Whys technique — which you will learn in the next lesson — emerged from this practice. But the technique was secondary to the principle: every recurring problem has a structural cause, and the only legitimate response to a recurring problem is to find and eliminate that cause.

What made Toyota's approach revolutionary was not sophistication. It was discipline. Most organizations respond to problems by resuming operations as quickly as possible. Toyota responded by stopping operations until the root cause was identified. The short-term cost was obvious — lost production time. The long-term benefit was that each problem, once properly diagnosed, never recurred. Over decades, this compounding elimination of root causes produced a manufacturing system of extraordinary reliability.

The lesson for your cognitive infrastructure is direct. When an error recurs in your life, you have two options: resume operations (apply a workaround and move on) or stop and diagnose (find the structural cause and eliminate it). The first option is faster today and guarantees the error will return. The second option costs time now and eliminates an entire category of future failure.

The anatomy of a root cause

Not every cause is a root cause. This distinction matters, because stopping at the wrong level of analysis is worse than not analyzing at all — it gives you false confidence that you have solved a problem you have only renamed.

A root cause has three characteristics. First, it is structural rather than circumstantial. "I was tired" is circumstantial. "I have no system for ensuring adequate sleep before high-stakes days" is structural. Circumstantial causes explain one instance. Structural causes explain the pattern across all instances.

Second, a root cause is actionable at the system level. "People make mistakes" is true but not actionable — you cannot eliminate human fallibility. "The checklist omits the verification step where most mistakes occur" is actionable — you can add the step. Root cause analysis that terminates at human nature has failed. It should terminate at a design choice, a process gap, or a structural condition that you can change.

Third, fixing the root cause should make the symptom impossible or improbable, not just less likely. If your proposed fix merely reduces the frequency of the error, you have probably identified a contributing factor, not the root cause. The root cause is the condition without which the error cannot occur. Remove it, and the error does not merely decrease. It stops.

Reason's framework provides a useful test: if your proposed root cause is an active failure (something a person did), keep going. Active failures are almost never root causes. They are the final link in a causal chain that began with a latent condition — a design flaw, a missing process, an information gap, a structural incentive that made the active failure likely. Find the latent condition.

The AI parallel: debugging models by tracing error sources

Machine learning engineers face the root cause problem in a form that makes the principle unmistakable. When a model produces wrong predictions, the visible symptom — the incorrect output — can originate from dozens of possible root causes: training data that contains systematic bias, a loss function that optimizes for the wrong objective, feature engineering that discards critical information, a model architecture that lacks the capacity to represent the underlying pattern, or a data pipeline that introduces corruption before the model ever sees the input.

The naive response — the equivalent of "trying harder" — is to train the model longer or increase its size. This is symptom-level intervention. It sometimes improves the metric without addressing the cause, and it always leaves the system vulnerable to the same category of failure.

Skilled ML practitioners perform structured error analysis. They examine not just the aggregate loss but the distribution of errors: which examples does the model get wrong? Is there a pattern? Do the errors cluster around a particular data segment, input characteristic, or edge case? Andrew Ng has advocated for this approach extensively, arguing that systematic error analysis — categorizing errors, quantifying each category, and tracing each category back to its source in the data or architecture — produces better model improvements than any amount of undirected hyperparameter tuning (Ng, 2017).

The parallel to personal root cause analysis is exact. When your errors cluster — when the same type of mistake keeps appearing across different contexts — the pattern itself is diagnostic. The recurrence tells you that the cause is not in the specific context (the "data point") but in the structure that processes all contexts (the "model"). You do not need more willpower. You need to audit the system that generates the behavior.

A protocol for finding root causes in your own systems

Root cause analysis is not intuition. It is a structured process. Here is a protocol you can apply to any recurring error in your cognitive infrastructure.

Step 1: Establish recurrence. Document at least three instances of the same error. If you cannot find three instances, it may not be a recurring error — it may be noise. Error budgets (L-0485) exist for a reason. Not every mistake warrants root cause analysis. Recurring patterns do.

Step 2: Strip the narratives. For each instance, you told yourself a story about why it happened. Write those stories down, then set them aside. The stories are almost certainly wrong — they are post-hoc rationalizations shaped by the fundamental attribution error. You need the facts: what happened, in what sequence, under what conditions.

Step 3: Find the invariant. Look across all instances for the structural factor that was present every time. Ignore what varied between instances — those are circumstantial details. The root cause is the constant, not the variable. If you missed three deadlines and the circumstances were different each time but you estimated the time required in every case without consulting data from past projects, the missing estimation process is the invariant.

Step 4: Test with removal. Ask: if I eliminated this structural factor, would the error become impossible or merely less likely? If merely less likely, you have found a contributing factor, not the root cause. Keep digging. If impossible or nearly so, you have likely found the root cause.

Step 5: Design the structural fix. The fix must change the system, not the person. "I will try harder" is not a fix. "I will add a ten-minute estimation step using historical data before committing to any deadline" is a fix. The fix should be a process change, an environmental change, or a decision-rule change — something that operates regardless of your mood, energy level, or attention on any given day.

Step 6: Verify elimination. After implementing the fix, track whether the error recurs. If it does, your root cause analysis was incomplete. Return to Step 3 and look for a deeper invariant. This is not failure. This is the process working as designed — your verification step is itself a feedback loop (L-0461) ensuring that your diagnosis was accurate.

From symptom management to structural repair

There is a clean division in how people relate to their recurring problems. On one side are the symptom managers: they notice the error, feel bad about it, resolve to do better, and wait for it to happen again. Their relationship to the problem is emotional and reactive. On the other side are the structural debuggers: they notice the error, trace it to its generating condition, change the condition, and verify that the error stops. Their relationship to the problem is analytical and proactive.

The difference is not intelligence or discipline. It is method. Symptom managers lack a process for moving from "this happened again" to "here is the structural condition that caused it and here is the change that eliminates it." Root cause analysis is that process. It converts recurring frustration into permanent structural improvement.

Phase 25 is about error correction — building the capacity to detect, diagnose, and eliminate errors in your cognitive infrastructure. The previous lessons established that all systems produce errors, that errors carry information, and that you should budget for acceptable error rates. This lesson establishes the principle that governs unacceptable errors: when the same error recurs, the error is not the problem. The system that generates the error is the problem. Fix the system.

In the next lesson, you will learn the Five Whys technique — a structured method for ensuring that your root cause analysis goes deep enough. The protocol above tells you what to do. The Five Whys will tell you how to do it with precision.

Sources:

Reason, J. (1990). Human Error. Cambridge University Press.
Reason, J. (1997). Managing the Risks of Organizational Accidents. Ashgate Publishing.
Ohno, T. (1988). Toyota Production System: Beyond Large-Scale Production. Productivity Press.
Ross, L. (1977). "The Intuitive Psychologist and His Shortcomings: Distortions in the Attribution Process." Advances in Experimental Social Psychology, 10, 173-220.
Dorner, D. (1996). The Logic of Failure: Recognizing and Avoiding Error in Complex Situations. Metropolitan Books.
Ng, A. (2017). Machine Learning Yearning. Self-published draft.
Meadows, D. H. (2008). Thinking in Systems: A Primer. Chelsea Green Publishing.