Operations

Resilience Testing

Advanced

We design for failure with timeouts, retries, fail-closed behaviour, and backups. But a resilience measure you have never exercised is only a theory. Resilience testing deliberately injects failure (a dependency goes down, a region is lost, latency spikes) to prove the system actually copes, before real life proves it does not.

Writing a timeout and a fallback is one thing. Knowing they work correctly under real failure is another. Resilience testing (including chaos experiments and disaster-recovery drills) turns assumptions into evidence. Kill a dependency in a test environment and watch what happens. Does the system fail closed, degrade gracefully, recover, and alert? Or does it cascade and fall over?

This confirms the promises made in Designing for Failure, Backup/Recovery & BCP, and Incident Readiness. Start small and safe in non-production. Only consider production experiments, with strong guardrails, once you trust the basics.

Test that failure is handled

DoDeliberately exercise failure modes (dependency down or slow, timeouts, errors, lost region) and confirm the system fails closed and degrades gracefully (see Designing for Failure).
DoRehearse recovery for real: run restore-from-backup and DR drills on a schedule, and measure against your RPO/RTO targets (see Backup, Recovery & Business Continuity).
DoVerify the safety-critical behaviours specifically. For example, when screening is unavailable, onboarding should block and escalate rather than approve (see AML Screening).
DoCheck that failures are observed: the right alerts fire and reach someone (see Security Monitoring, Observability).
ConsiderLightweight chaos experiments and load/soak tests for critical paths, widening the scope as confidence grows.

Do it safely

DoStart in non-production with realistic conditions. Have a clear hypothesis, a limit on the blast radius, and a way to stop the experiment instantly.
DoFeed findings back as tracked work. A resilience test that reveals a gap is a success only if the gap gets fixed (see Technical Debt).
AvoidRunning destructive experiments in production without strong guardrails, approval, and the ability to abort. Resilience testing must not cause the incident it is meant to prevent.
NeverRun a resilience or chaos experiment against real customer data or live regulated flows without controls that ensure no real harm or data loss.

Assume it works

// fail-closed code written; backups configured;
// never tested under actual failure or actually restored

The first time the screening provider really times out, or you really need the backup, is the worst time to discover the fallback has a bug or the backup will not restore. Untested resilience is only hope.

Prove it

// staging drill: kill the screening provider -> assert onboarding blocks
//                 + alert fires + recovers when it returns
// quarterly: restore prod backup to an isolated env, verify RPO/RTO

The fail-closed path and the backups are shown to work under real failure, on a schedule. So when it happens for real, it is routine.

Self-review checklist

AskHave the failure modes this feature relies on handling actually been exercised?
AskHas a restore from backup or DR actually been tested recently against our targets?
AskDo the right alerts fire when this fails?
AskIs any experiment safely bounded, abortable, and away from real customer data?

Why it matters: Resilience that is never tested tends to fail exactly when it is needed: the fallback has a bug, the backup will not restore, the alert never fires. Deliberately rehearsing failure turns those nasty surprises into known, fixed behaviour. For a platform handling money and regulated decisions, that is the difference between a blip and a crisis.