After every operational failure, run five blameless post-mortem questions
After any operational failure, write a blameless post-mortem using five questions: what happened (factual description), what was the timeline of contributing events, what were the systemic factors, what are the action items (specific system changes), and what would have prevented this.
Why This Is a Rule
Operational failures contain the highest-density learning in any system — but only if the learning is extracted through structured analysis rather than reactive blame. Blame stops analysis at the nearest person ("who screwed up?"); blameless analysis continues to the systemic factors that made the failure possible ("what about the system allowed this?").
The five questions form a complete failure analysis framework: (1) What happened — factual description, not interpretation. Camera-testable events. (2) Timeline of contributing events — the causal chain, not the immediate trigger. Failures rarely have single causes; they have sequences of contributing factors. (3) Systemic factors — what about the system's design, processes, or environment made this failure possible or likely? (4) Action items — specific system changes (not "be more careful" — that's blame disguised as action). (5) What would have prevented this — the counterfactual that identifies the highest-leverage intervention point.
The "blameless" constraint isn't about being nice — it's about getting accurate information. When people face blame, they minimize, deflect, and hide details. When the analysis is explicitly blameless, they share the full picture.
When This Fires
- After any system, process, or operational failure (missed deadline, broken deployment, dropped commitment)
- After a near-miss that could have been a failure under slightly different conditions
- When the same type of failure recurs, signaling unresolved systemic factors
- During any incident review or retrospective
Common Failure Mode
Stopping at question 1 ("what happened") and jumping to action items without analyzing the timeline or systemic factors. The resulting action items address the immediate trigger but not the systemic conditions that produced it, so the failure recurs in a slightly different form.
The Protocol
After any operational failure: (1) Write a factual description of what happened (no blame, no interpretation). (2) Construct the timeline of contributing events — trace backward from the failure to identify each contributing factor. (3) Identify systemic factors: what about the system's design made this failure possible? (4) Write specific action items — system changes, not behavior changes. "Add automated check before deployment" not "be more careful when deploying." (5) State what single change would have most likely prevented this failure. That's your highest-leverage improvement.