Operations

Resilience Testing

Advanced

We design for failure with timeouts, retries, fail-closed behaviour, and backups. But a resilience measure you have never exercised is only a theory. Resilience testing deliberately injects failure (a dependency goes down, a region is lost, latency spikes) to prove the system actually copes, before real life proves it does not.

Writing a timeout and a fallback is one thing. Knowing they work correctly under real failure is another. Resilience testing (including chaos experiments and disaster-recovery drills) turns assumptions into evidence. Kill a dependency in a test environment and watch what happens. Does the system fail closed, degrade gracefully, recover, and alert? Or does it cascade and fall over?

This confirms the promises made in Designing for Failure, Backup/Recovery & BCP, and Incident Readiness. Start small and safe in non-production. Only consider production experiments, with strong guardrails, once you trust the basics.

Test that failure is handled

Do it safely

Assume it works // fail-closed code written; backups configured;
// never tested under actual failure or actually restored

The first time the screening provider really times out, or you really need the backup, is the worst time to discover the fallback has a bug or the backup will not restore. Untested resilience is only hope.

Prove it // staging drill: kill the screening provider -> assert onboarding blocks
// + alert fires + recovers when it returns
// quarterly: restore prod backup to an isolated env, verify RPO/RTO

The fail-closed path and the backups are shown to work under real failure, on a schedule. So when it happens for real, it is routine.

Self-review checklist

Why it matters: Resilience that is never tested tends to fail exactly when it is needed: the fallback has a bug, the backup will not restore, the alert never fires. Deliberately rehearsing failure turns those nasty surprises into known, fixed behaviour. For a platform handling money and regulated decisions, that is the difference between a blip and a crisis.