Rulev2

After every operational failure, run five blameless post-mortem questions

After any operational failure, write a blameless post-mortem using five questions: what happened (factual description), what was the timeline of contributing events, what were the systemic factors, what are the action items (specific system changes), and what would have prevented this.

Why This Is a Rule

Operational failures contain the highest-density learning in any system — but only if the learning is extracted through structured analysis rather than reactive blame. Blame stops analysis at the nearest person ("who screwed up?"); blameless analysis continues to the systemic factors that made the failure possible ("what about the system allowed this?").

The five questions form a complete failure analysis framework: (1) What happened — factual description, not interpretation. Camera-testable events. (2) Timeline of contributing events — the causal chain, not the immediate trigger. Failures rarely have single causes; they have sequences of contributing factors. (3) Systemic factors — what about the system's design, processes, or environment made this failure possible or likely? (4) Action items — specific system changes (not "be more careful" — that's blame disguised as action). (5) What would have prevented this — the counterfactual that identifies the highest-leverage intervention point.

The "blameless" constraint isn't about being nice — it's about getting accurate information. When people face blame, they minimize, deflect, and hide details. When the analysis is explicitly blameless, they share the full picture.

When This Fires

After any system, process, or operational failure (missed deadline, broken deployment, dropped commitment)
After a near-miss that could have been a failure under slightly different conditions
When the same type of failure recurs, signaling unresolved systemic factors
During any incident review or retrospective

Common Failure Mode

Stopping at question 1 ("what happened") and jumping to action items without analyzing the timeline or systemic factors. The resulting action items address the immediate trigger but not the systemic conditions that produced it, so the failure recurs in a slightly different form.

The Protocol

After any operational failure: (1) Write a factual description of what happened (no blame, no interpretation). (2) Construct the timeline of contributing events — trace backward from the failure to identify each contributing factor. (3) Identify systemic factors: what about the system's design made this failure possible? (4) Write specific action items — system changes, not behavior changes. "Add automated check before deployment" not "be more careful when deploying." (5) State what single change would have most likely prevented this failure. That's your highest-leverage improvement.

Source Lessons

L-0997

Learning from operational failures

When your operations fail treat it as a system design problem not a personal failure.

Connections

Specializes (3)

PrincipleFocus failure analysis on system-level causal chains ('what PrincipleInterpreting failure as information about design rather than PrincipleBlameless analysis of failures generates more accurate

Why This Is a Rule

When This Fires

After any system, process, or operational failure (missed deadline, broken deployment, dropped commitment)
After a near-miss that could have been a failure under slightly different conditions
When the same type of failure recurs, signaling unresolved systemic factors
During any incident review or retrospective

After every operational failure, run five blameless post-mortem questions

Why This Is a Rule

When This Fires

Common Failure Mode

The Protocol

Source Lessons

Learning from operational failures

Connections

Specializes (3)

Tags

After every operational failure, run five blameless post-mortem questions

Why This Is a Rule

When This Fires

Common Failure Mode

The Protocol

Source Lessons

Learning from operational failures

Connections

Specializes (3)

Tags