Resilience Testing
We design for failure with timeouts, retries, fail-closed behaviour, and backups. But a resilience measure you have never exercised is only a theory. Resilience testing deliberately injects failure (a dependency goes down, a region is lost, latency spikes) to prove the system actually copes, before real life proves it does not.
Writing a timeout and a fallback is one thing. Knowing they work correctly under real failure is another. Resilience testing (including chaos experiments and disaster-recovery drills) turns assumptions into evidence. Kill a dependency in a test environment and watch what happens. Does the system fail closed, degrade gracefully, recover, and alert? Or does it cascade and fall over?
This confirms the promises made in Designing for Failure, Backup/Recovery & BCP, and Incident Readiness. Start small and safe in non-production. Only consider production experiments, with strong guardrails, once you trust the basics.
Test that failure is handled
- DoDeliberately exercise failure modes (dependency down or slow, timeouts, errors, lost region) and confirm the system fails closed and degrades gracefully (see Designing for Failure).
- DoRehearse recovery for real: run restore-from-backup and DR drills on a schedule, and measure against your RPO/RTO targets (see Backup, Recovery & Business Continuity).
- DoVerify the safety-critical behaviours specifically. For example, when screening is unavailable, onboarding should block and escalate rather than approve (see AML Screening).
- DoCheck that failures are observed: the right alerts fire and reach someone (see Security Monitoring, Observability).
- ConsiderLightweight chaos experiments and load/soak tests for critical paths, widening the scope as confidence grows.
Do it safely
- DoStart in non-production with realistic conditions. Have a clear hypothesis, a limit on the blast radius, and a way to stop the experiment instantly.
- DoFeed findings back as tracked work. A resilience test that reveals a gap is a success only if the gap gets fixed (see Technical Debt).
- AvoidRunning destructive experiments in production without strong guardrails, approval, and the ability to abort. Resilience testing must not cause the incident it is meant to prevent.
- NeverRun a resilience or chaos experiment against real customer data or live regulated flows without controls that ensure no real harm or data loss.
// fail-closed code written; backups configured;
// never tested under actual failure or actually restored
The first time the screening provider really times out, or you really need the backup, is the worst time to discover the fallback has a bug or the backup will not restore. Untested resilience is only hope.
// staging drill: kill the screening provider -> assert onboarding blocks
// + alert fires + recovers when it returns
// quarterly: restore prod backup to an isolated env, verify RPO/RTO
The fail-closed path and the backups are shown to work under real failure, on a schedule. So when it happens for real, it is routine.
Self-review checklist
- AskHave the failure modes this feature relies on handling actually been exercised?
- AskHas a restore from backup or DR actually been tested recently against our targets?
- AskDo the right alerts fire when this fails?
- AskIs any experiment safely bounded, abortable, and away from real customer data?