Foundations

Designing for Failure (Fail-Closed)

Foundational

Everything fails in time. Networks drop, dependencies time out, processes die mid-operation. The question is never whether it happens, but what the system does when it does. Treat failure as the normal case. When the outcome is uncertain, take the safe path: deny, escalate, or stop. Never quietly proceed.

Designing for failure has two parts. The first is resilience: assume every dependency can be slow, missing, or wrong. Build timeouts, retries with backoff, circuit breakers, and graceful degradation, so one component's failure does not spread to all of them. The second part matters most here: the direction of failure. When something is missing, unknown, or errored, does the system fail open (allow) or fail closed (safe)?

For an AML platform, fail-closed is not a preference. It is the rule. A screening check that times out must block and escalate, not approve. An unknown country or document type must escalate, not default to low risk. The Finperiti finding that unsigned webhooks could be acted on is a fail-open disaster: uncertainty was resolved in the attacker's favour. Always resolve it in favour of safety.

Fail in the safe direction

Fail open on error try { result = screening.Check(customer); }
catch { result = ScreeningResult.Clear; } // "don't block the happy path"

A timeout or outage in the screening provider now silently approves everyone. This is the exact shape of an AML false-negative catastrophe: uncertainty resolved as 'allow'.

Fail closed, escalate try { result = await screening.CheckAsync(customer, timeout); }
catch (Exception e) { audit.Record("Screening.Unavailable", customer, e);
return Decision.BlockAndEscalate("screening unavailable"); }

When the check cannot complete, the customer is held for human review and the failure is recorded. This is the safe and lawful default.

Build resilience against dependency failure

Self-review checklist

Why it matters: The direction a system fails in decides whether an outage is a small problem or a breach. Fail-open code turns every dependency hiccup into an opening: the wrong person let through, a check skipped, money moved. Failing closed costs a little friction when things break, and it prevents the failures that end licences and careers.