Foundations

Error Handling & Failure Modes

Foundational

Failure is normal, not rare. A resilient system and a fragile one usually fail at the same rate. The difference is what the code does right after a fault. Design those moments on purpose.

Every operation that can fail forces a choice: recover, retry, escalate, or stop. Much production pain comes from code that never made that choice. It caught an error and carried on, leaving the system in a state nobody planned for. The rules below cover the three choices that matter most: where you handle a failure, how you keep information about it, and what you show to the outside world.

Where to handle failure

DoHandle an error only where you can make a real decision about it. If a function cannot decide, let the error pass up.
DoGive one layer clear ownership of recovery. That layer decides retry or fail, instead of every layer guessing.
ConsiderSplitting transient failures (network blips, lock contention → retry with backoff) from permanent ones (validation, auth → fail fast).
Do notCatch broad exceptions in low-level helpers just to "be safe". This takes away the caller's chance to respond correctly.
NeverRetry a non-idempotent operation without a guard. A retried payment is a duplicate charge.

The silent swallow try { chargeCard(order); } catch (Exception) { /* ignore */ }

The order is marked paid, but no charge happened, and there is no log, alert, or trace. This one pattern causes a large share of "it just silently didn't work" incidents.

Decide, preserve context, surface

try { chargeCard(order); }
catch (TransientGatewayError e) { enqueueRetry(order, e); }
catch (CardDeclinedError e) { markUnpaid(order); notify(order, e); throw; }

Transient and permanent failures are handled differently. The order never ends up in a wrong state, and the original error still passes up with its cause attached.

Preserving information

AlwaysKeep the original cause when you wrap an error. Add context; never throw the cause away.
DoMake every failure visible — structured logs, metrics, or traces — so you can find it without a debugger.
ConsiderDefining a function's failure contract (what it throws or returns on error) as carefully as its success type.
Do notUse errors or exceptions for normal control flow. It hides intent and is slow.
NeverSwallow an error silently. If ignoring it is truly correct, say so clearly and write down why.

Losing the cause catch (IOException e) { throw new AppError("save failed"); }

The real reason (disk full? permission denied?) is gone. The new error starts a fresh, empty stack trace.

Chain the cause catch (IOException e) { throw new AppError("save failed", e); }

The original IOException is kept as the inner cause, so the real reason still shows up in logs and traces.

What you expose

DoReturn stable, useful error shapes to callers (a code plus a safe message), not raw internals.
ConsiderDesigning for graceful degradation. A reduced but working response is better than a hard failure, where the domain allows it.
NeverLeak stack traces, SQL, secrets, or file paths to an external caller.

Self-review checklist

AskIf this fails, who finds out, and how?
AskIs the system in a valid state on every failure path, not just the happy path?
AskCould this error be retried safely — and if so, is it?
AskDoes anything I show on error reveal more than the caller should see?

Why it matters: Most production incidents are not caused by the first fault. They are caused by code that handled the fault badly: it hid the error, corrupted state, or retried a doomed operation forever. Good error handling turns a bug into a logged, recoverable event instead of a middle-of-the-night alert.