Rulev2

Test backup paths during normal operations — untested redundancy fails when you need it most

When validating redundancy in human systems, test backup paths during normal operations rather than waiting for primary failure, as untested redundancy often fails precisely when needed due to skill atrophy or outdated procedures.

Why This Is a Rule

Redundancy in human systems — backup people, alternative procedures, fallback plans — degrades silently through disuse. The backup person's skills atrophy because they never execute the primary path. The fallback procedure becomes outdated because it was written for last year's system. The alternative tool breaks because no one noticed when a dependency updated.

This silent degradation means untested redundancy fails precisely when you need it: during the primary failure that was supposed to trigger the backup. The backup hasn't been exercised, hasn't been updated, and hasn't been validated — and the moment of primary failure is the worst possible time to discover this.

Netflix's Chaos Engineering approach — deliberately failing primary paths during normal operations to validate that backups work — is the gold standard. For human systems, this translates to scheduled backup-path exercises: the backup person executes the primary role while the primary person observes, the fallback procedure is followed while the primary procedure is temporarily disabled, the alternative tool is used for a day while the primary tool is idle.

When This Fires

After establishing any backup plan, backup person, or fallback procedure
During quarterly resilience reviews
When you realize a backup path hasn't been tested since it was created
Before any high-risk period where primary failure would be especially costly

Common Failure Mode

Assuming the backup works because it was designed correctly: "We have a backup plan." Having a plan and having a working, validated plan are different things. Plans atrophy like skills. The only way to know a backup works is to test it — under conditions similar to when it would actually be needed.

The Protocol

For each backup path in your system: (1) Schedule a test during normal operations — not during a crisis. (2) Temporarily activate the backup while the primary is available as safety net. (3) Verify: does the backup actually work? Are procedures current? Does the backup person have the needed skills? Are the tools functional? (4) Fix any gaps discovered during testing. (5) Repeat at regular intervals (quarterly minimum) — untested redundancy degrades continuously, so validation must be periodic.

Source Lessons

L-0257

Redundant relationships provide resilience

Multiple paths between important nodes make a system more robust.

Connections

Specializes (3)

PrincipleDesign a minimal degraded-mode version of chains (2-4 core PrincipleFrame checklist items as conditions to verify rather than PrinciplePractice running systems in degraded mode during normal

Test backup paths during normal operations — untested redundancy fails when you need it most

Why This Is a Rule

When This Fires

Common Failure Mode

The Protocol

Source Lessons

Redundant relationships provide resilience

Connections

Specializes (3)

Tags

Test backup paths during normal operations — untested redundancy fails when you need it most

Why This Is a Rule

When This Fires

Common Failure Mode

The Protocol

Source Lessons

Redundant relationships provide resilience

Connections

Specializes (3)

Tags