Foundations

Error Handling & Failure Modes

Foundational

Failure is normal, not rare. A resilient system and a fragile one usually fail at the same rate. The difference is what the code does right after a fault. Design those moments on purpose.

Every operation that can fail forces a choice: recover, retry, escalate, or stop. Much production pain comes from code that never made that choice. It caught an error and carried on, leaving the system in a state nobody planned for. The rules below cover the three choices that matter most: where you handle a failure, how you keep information about it, and what you show to the outside world.

Where to handle failure

The silent swallow try { chargeCard(order); } catch (Exception) { /* ignore */ }

The order is marked paid, but no charge happened, and there is no log, alert, or trace. This one pattern causes a large share of "it just silently didn't work" incidents.

Decide, preserve context, surface try { chargeCard(order); }
catch (TransientGatewayError e) { enqueueRetry(order, e); }
catch (CardDeclinedError e) { markUnpaid(order); notify(order, e); throw; }

Transient and permanent failures are handled differently. The order never ends up in a wrong state, and the original error still passes up with its cause attached.

Preserving information

Losing the cause catch (IOException e) { throw new AppError("save failed"); }

The real reason (disk full? permission denied?) is gone. The new error starts a fresh, empty stack trace.

Chain the cause catch (IOException e) { throw new AppError("save failed", e); }

The original IOException is kept as the inner cause, so the real reason still shows up in logs and traces.

What you expose

Self-review checklist

Why it matters: Most production incidents are not caused by the first fault. They are caused by code that handled the fault badly: it hid the error, corrupted state, or retried a doomed operation forever. Good error handling turns a bug into a logged, recoverable event instead of a middle-of-the-night alert.