Test backup paths during normal operations — untested redundancy fails when you need it most
When validating redundancy in human systems, test backup paths during normal operations rather than waiting for primary failure, as untested redundancy often fails precisely when needed due to skill atrophy or outdated procedures.
Why This Is a Rule
Redundancy in human systems — backup people, alternative procedures, fallback plans — degrades silently through disuse. The backup person's skills atrophy because they never execute the primary path. The fallback procedure becomes outdated because it was written for last year's system. The alternative tool breaks because no one noticed when a dependency updated.
This silent degradation means untested redundancy fails precisely when you need it: during the primary failure that was supposed to trigger the backup. The backup hasn't been exercised, hasn't been updated, and hasn't been validated — and the moment of primary failure is the worst possible time to discover this.
Netflix's Chaos Engineering approach — deliberately failing primary paths during normal operations to validate that backups work — is the gold standard. For human systems, this translates to scheduled backup-path exercises: the backup person executes the primary role while the primary person observes, the fallback procedure is followed while the primary procedure is temporarily disabled, the alternative tool is used for a day while the primary tool is idle.
When This Fires
- After establishing any backup plan, backup person, or fallback procedure
- During quarterly resilience reviews
- When you realize a backup path hasn't been tested since it was created
- Before any high-risk period where primary failure would be especially costly
Common Failure Mode
Assuming the backup works because it was designed correctly: "We have a backup plan." Having a plan and having a working, validated plan are different things. Plans atrophy like skills. The only way to know a backup works is to test it — under conditions similar to when it would actually be needed.
The Protocol
For each backup path in your system: (1) Schedule a test during normal operations — not during a crisis. (2) Temporarily activate the backup while the primary is available as safety net. (3) Verify: does the backup actually work? Are procedures current? Does the backup person have the needed skills? Are the tools functional? (4) Fix any gaps discovered during testing. (5) Repeat at regular intervals (quarterly minimum) — untested redundancy degrades continuously, so validation must be periodic.