Designing for Failure (Fail-Closed)
Everything fails in time. Networks drop, dependencies time out, processes die mid-operation. The question is never whether it happens, but what the system does when it does. Treat failure as the normal case. When the outcome is uncertain, take the safe path: deny, escalate, or stop. Never quietly proceed.
Designing for failure has two parts. The first is resilience: assume every dependency can be slow, missing, or wrong. Build timeouts, retries with backoff, circuit breakers, and graceful degradation, so one component's failure does not spread to all of them. The second part matters most here: the direction of failure. When something is missing, unknown, or errored, does the system fail open (allow) or fail closed (safe)?
For an AML platform, fail-closed is not a preference. It is the rule. A screening check that times out must block and escalate, not approve. An unknown country or document type must escalate, not default to low risk. The Finperiti finding that unsigned webhooks could be acted on is a fail-open disaster: uncertainty was resolved in the attacker's favour. Always resolve it in favour of safety.
Fail in the safe direction
- AlwaysDefault to deny or escalate on any security or compliance decision when input is missing, errored, timed out, or unrecognised.
- DoTreat unknown as dangerous. An unmapped value (country, document type, risk band, screening result) escalates to a human. It never resolves to a permissive default.
- DoMake the safe outcome the default branch, so forgetting to handle a case fails closed, not open.
- ConsiderGiving each external decision point an explicit 'unknown/unavailable' path, built as carefully as the success path.
- NeverFail open. When something is missing, unknown, or uncertain, deny or escalate. Never default to the permissive outcome.
- NeverAuto-approve a customer, transaction, or onboarding when a screening, sanctions, or KYC check is missing, errored, timed out, or unrecognised. Block and escalate.
try { result = screening.Check(customer); }
catch { result = ScreeningResult.Clear; } // "don't block the happy path"
A timeout or outage in the screening provider now silently approves everyone. This is the exact shape of an AML false-negative catastrophe: uncertainty resolved as 'allow'.
try { result = await screening.CheckAsync(customer, timeout); }
catch (Exception e) { audit.Record("Screening.Unavailable", customer, e);
return Decision.BlockAndEscalate("screening unavailable"); }
When the check cannot complete, the customer is held for human review and the failure is recorded. This is the safe and lawful default.
Build resilience against dependency failure
- DoPut a timeout on every outbound call — to databases, services, and third parties — so a hung dependency cannot hang you.
- DoRetry only transient, idempotent failures, with backoff and a cap. Never retry blindly, and never retry a non-idempotent action without a guard.
- DoIsolate failures with circuit breakers and bulkheads, so a struggling dependency breaks one feature, not the whole system.
- ConsiderGraceful degradation — a reduced but safe response — where the domain allows it and the safe-direction rule still holds.
- Do notAssume a call that usually returns in milliseconds always will. Design for the slow and missing cases too.
- NeverLet a failure in a non-critical dependency silently disable a critical control such as screening. Surface it and fail closed.
Self-review checklist
- AskWhen this check is missing, errored, or times out, does the system deny/escalate or quietly allow?
- AskIs 'unknown' treated as dangerous, or does it fall through to a permissive default?
- AskDoes every outbound call have a timeout, and are retries safe and bounded?
- AskIf a non-critical dependency fails, could it silently take down a critical control?