Fail fast fail cheap

The most expensive errors are the ones you find last

In 1981, Barry Boehm published Software Engineering Economics and introduced a finding that would reshape how engineers think about quality. Analyzing defect data from large-scale projects at TRW and IBM, Boehm demonstrated that the cost of fixing a defect rises exponentially depending on when it is discovered. An error caught during requirements analysis might cost one unit to fix. The same error caught during design costs three to six times more. Caught during testing: fifteen to forty times more. Found in production, by real users experiencing real failures: fifty to two hundred times the original cost (Boehm, 1981).

This is not a software problem. It is a universal property of committed systems. The deeper you are into any course of action — the more resources invested, the more dependencies built, the more decisions stacked on top of your initial assumptions — the more expensive it becomes to discover that something at the foundation was wrong.

In the previous lesson, you learned to distinguish error types: execution errors, knowledge errors, and judgment errors. That distinction matters because different errors require different corrections. This lesson asks a different question: when should you find them? The answer, across every domain that has studied the question seriously, is the same. As early as possible. As cheaply as possible. Before the cost of correction compounds into the cost of catastrophe.

Boehm's curve and the architecture of regret

Boehm's exponential cost curve is the empirical backbone of the fail-fast principle, even though Boehm himself never used the phrase. His data showed something engineers already intuited but had never quantified: the cost of changing course is a function of how far you have traveled.

The mechanism is straightforward. Early in a project, few decisions have been made. The system is flexible. Assumptions can be revised with minimal rework. As work progresses, each new decision constrains future decisions. Code is written that depends on architectural choices. Teams are hired based on strategic assumptions. Processes are built around premises. Each layer of investment becomes a structural commitment that subsequent work must honor. Reversing a foundational assumption does not mean undoing one decision — it means unwinding every decision that was built on top of it.

Boehm and Basili later refined this finding in "Software Defect Reduction Top 10 List" (2001), noting that the cost ratio depends on project scale. For small, non-critical systems, the ratio might be 5:1 rather than 100:1. But the direction never reverses. Later discovery always costs more than earlier discovery. The only variable is how much more.

This principle extends far beyond code. A startup that discovers its market assumption is wrong after six months of building has wasted six months of engineering, design, and runway. One that tests the assumption in week one with a landing page and a waitlist loses a week. A person who realizes after two years that their career direction conflicts with their values has lost two years of compounding effort. One who conducts an honest self-assessment before committing loses an afternoon.

The fail-fast principle is not about tolerating failure. It is about engineering the sequence of your commitments so that the most dangerous assumptions are tested first, when falsification is cheap.

The andon cord: engineering permission to stop

The most famous physical implementation of fail-fast thinking is the andon cord on Toyota's production lines.

In the Toyota Production System, any worker who detects an abnormality — a defective part, a misaligned component, a deviation from standard work — can pull a cord (or press a button) that triggers an alert and, if necessary, stops the entire production line. This is part of Toyota's broader principle of jidoka, which the company translates as "automation with a human touch" — the idea that when an abnormality occurs, the process stops immediately rather than passing the defect downstream (Ohno, 1988).

The economics are counterintuitive. Stopping a production line costs money. Every second the line is halted, cars are not being built. A manager optimizing for short-term throughput would never stop the line for a single defect. But Toyota discovered that the cost of stopping the line immediately is always less than the cost of letting a defect propagate. A misaligned door panel caught at station twelve costs minutes to fix. The same defect discovered at final inspection costs hours. Found by a customer after purchase, it costs the price of a recall, a warranty claim, and a damaged reputation.

The andon cord works because it inverts the default. In most organizations, the default is to continue — to push the defect downstream, to hope someone else catches it, to avoid the short-term cost of stopping. The andon cord makes stopping the default response to any detected error. It engineers the system so that early detection is not just possible but expected.

This is a design choice, not a cultural one. Toyota did not achieve this by telling workers to "care more about quality." They achieved it by building a physical mechanism — the cord — that makes early error surfacing the path of least resistance. The architecture of the system determines whether errors are caught early or late.

Escalation of commitment: why humans fail slow

If failing fast is so obviously superior, why do most people fail slow?

In 1976, Barry Staw published "Knee Deep in the Big Muddy," a study that demonstrated one of the most robust findings in behavioral science. Staw gave participants the role of a corporate executive allocating research and development funds between two divisions of a company. After receiving feedback that their initial allocation had led to poor results, participants did not reduce their investment in the failing division. They increased it — and the increase was largest among participants who had personally made the initial allocation decision (Staw, 1976).

Staw called this escalation of commitment: the tendency to invest more resources in a failing course of action precisely because you have already invested resources in it. The phenomenon is closely related to the sunk cost fallacy — the irrational weighting of past, irrecoverable investments in decisions about future action — but escalation adds a critical psychological ingredient: personal responsibility. When you chose the path, admitting it was wrong threatens your identity as a competent decision-maker. Continuing feels like defending your judgment. Stopping feels like confessing incompetence.

This is why failing fast requires more than intellectual understanding. It requires a structural intervention against your own psychology. You already know, rationally, that sunk costs are irrelevant to future decisions. But knowing this does not prevent you from feeling the pull of commitment. Staw's research showed that even business students who understood the sunk cost fallacy in theory fell prey to escalation in practice. Knowledge alone is insufficient. You need systems that interrupt the escalation pattern before your psychology can take over.

The writer who shares a two-page outline in week one is not braver than the writer who waits six months. The first writer designed a process that forces confrontation with reality before commitment deepens. The second writer designed a process — probably without realizing it — that delays confrontation until the cost of changing course is psychologically unbearable.

Reason's latent failures: the errors already embedded

James Reason, in Human Error (1990), introduced a distinction that deepens the fail-fast principle beyond the obvious. Reason distinguished between active failures — errors committed by people at the sharp end of a system, like a surgeon making an incision or a pilot flipping a switch — and latent failures — errors embedded in system design, organizational decisions, and management practices that lie dormant until they combine with active failures to produce catastrophe.

Latent failures are the harder problem. An active failure is visible: someone did the wrong thing at the wrong time. A latent failure is invisible: a policy was written ambiguously six months ago, a safety check was removed to save time last quarter, a training program was cut during budget season. These decisions sit in the system like pathogens, producing no symptoms until conditions align to trigger a cascade.

Reason's "Swiss cheese model" illustrates how accidents happen when the holes in multiple defensive layers — each representing a latent failure — momentarily line up, allowing a hazard to pass through every barrier. No single failure causes the catastrophe. The catastrophe emerges from the accumulation of latent failures that were never surfaced, never tested, never corrected — because no one designed a mechanism to find them early.

The fail-fast principle, applied through Reason's lens, means actively hunting for latent failures before they combine. It means stress-testing assumptions, auditing processes for hidden vulnerabilities, and building mechanisms that surface dormant errors while they are still individual problems rather than components of a cascade. It means treating the absence of visible failure not as evidence of safety, but as a possible indicator that your detection systems are not sensitive enough.

The AI parallel: early stopping and validation loss

Machine learning offers a precise, quantitative demonstration of the fail-fast principle in the form of early stopping.

When training a neural network, the model's performance on training data improves continuously as training progresses. But at some point, the model begins memorizing the training data rather than learning generalizable patterns — a phenomenon called overfitting. Performance on training data continues to improve, but performance on unseen data begins to degrade. The model is getting better at the wrong thing.

Early stopping addresses this by monitoring the model's performance on a held-out validation set — data the model never trains on directly. When validation loss stops decreasing and begins to rise, training halts. The model is "failing fast" — recognizing that further investment in the current trajectory (more training epochs) will produce worse outcomes, not better ones, and stopping before the damage compounds (Goodfellow, Bengio, & Courville, 2016).

The structural parallel to human decision-making is exact. The training data is your immediate experience — the feedback that confirms your current approach is working. The validation set is external reality — the outcomes you have not yet observed, the perspectives you have not yet encountered, the consequences that have not yet materialized. Overfitting is the human equivalent of optimizing for the metrics you can see while degrading on the metrics you cannot. You feel productive. Your immediate results look good. But your approach is becoming less generalizable, less robust, less aligned with reality — and you will not discover this until you encounter data your current model cannot handle.

Early stopping requires something most people resist: defining in advance the conditions under which you will stop, and monitoring those conditions continuously rather than only checking when failure becomes obvious. It requires building the validation set before you need it, not after the damage is done.

A protocol for failing fast

The fail-fast principle is not a personality trait. It is a design specification. Here is how to implement it.

Identify your riskiest assumptions. Before committing significant resources to any project, relationship, or decision, list the three to five assumptions that, if wrong, would invalidate the most work. These are your highest-leverage error targets. Not all assumptions are equally dangerous. Focus on the ones where being wrong is most expensive.

Sequence tests by cost of falsification. For each risky assumption, design the cheapest possible test that could prove it wrong. A conversation. A prototype. A one-page document. A small experiment. Rank these tests by how little they cost to run and how much they would save you if they reveal a problem. Run the cheapest tests first.

Set kill criteria in advance. Before you begin, define the specific conditions under which you will stop, pivot, or fundamentally restructure. Write these down. Escalation of commitment is most powerful when the criteria for stopping are vague or undefined. Explicit kill criteria function as your personal andon cord — a predetermined mechanism that interrupts the default of continuing.

Build external validation into your process. The validation set in machine learning works because it is independent of the training process. Your equivalent is external feedback from people who are not invested in your success. Share work early. Seek disconfirming evidence deliberately. The feedback that stings in week one saves you from catastrophe in month six.

Review the cost curve regularly. At each stage of commitment, estimate the cost of discovering a foundational error now versus discovering it later. If the ratio is growing — if the cost of late discovery is accelerating — that is a signal to invest more in detection, not less.

The bridge from detection timing to error tolerance

The fail-fast principle establishes that errors should be found early. But it raises a question it cannot answer on its own: how many errors should you expect?

Designing systems that surface errors quickly is necessary but incomplete. You also need a framework for deciding how much error is acceptable — a way to distinguish between the error rate that signals a healthy, learning system and the error rate that signals something has gone structurally wrong. A system with zero detected errors is not a perfect system. It is almost certainly a system with insufficient detection.

This is the domain of error budgets: the practice of defining, in advance, how much failure is tolerable and using that budget as a decision-making tool. You have already built the intuition for this — every time you estimated the cost ratio between early and late discovery, you were implicitly reasoning about how to allocate your error-detection resources. The next lesson makes that reasoning explicit.

Sources:

Boehm, B. W. (1981). Software Engineering Economics. Prentice Hall.
Boehm, B., & Basili, V. R. (2001). "Software Defect Reduction Top 10 List." IEEE Computer, 34(1), 135-137.
Ohno, T. (1988). Toyota Production System: Beyond Large-Scale Production. Productivity Press.
Staw, B. M. (1976). "Knee Deep in the Big Muddy: A Study of Escalating Commitment to a Chosen Course of Action." Organizational Behavior and Human Performance, 16(1), 27-44.
Reason, J. (1990). Human Error. Cambridge University Press.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep Learning. MIT Press.
Gawande, A. (2009). The Checklist Manifesto: How to Get Things Right. Metropolitan Books.