Runbooks & On-Call
When something breaks at 3 a.m., the person responding should not have to figure out the system under pressure. Runbooks are short, practical guides for operating and recovering services. With a clear, humane on-call setup, they turn a panicked scramble into a calm, repeatable response.
A runbook answers "what do I do when X happens?": how to tell if the service is healthy, common failures and their fixes, how to restart, roll back, or scale, what it depends on, and how to escalate. On-call is the human side: who is responsible right now, how they are alerted, and how the rota stays sustainable. Together they make operations a shared, documented capability rather than knowledge stuck in one person's head.
This supports Incident Readiness (the response process), Observability (the signals), and Ownership & Accountability (you operate what you build). For a largely junior team, good runbooks are also how knowledge spreads and how a newer engineer can safely hold the pager.
Write runbooks people can actually use
- DoKeep a runbook per service: how to check health, key dashboards and alerts, common failures and fixes, how to restart, roll back, or scale, dependencies, and escalation contacts.
- DoMake them practical and current: concrete commands and steps, kept as code or docs next to the service and updated when it changes (see Documentation as Code).
- DoTurn what you learn from every incident into runbook updates, so the next person handles it faster (see Incident Readiness).
- ConsiderAutomating the repetitive recovery steps a runbook describes, so the safe action is one command, not ten.
- AvoidRunbooks that are stale, vague, or assume deep context only the author has. Under pressure they are worse than useless.
Run on-call humanely
- DoDefine on-call clearly: who is responsible at any time, how they are reached, the expected response, and a real escalation path when they need help.
- DoMake alerts actionable and tuned: page on what truly needs a human now, and keep noise down so real alerts are not lost (see Observability, SLOs).
- DoSupport the people on call: a reasonable rotation, the access and runbooks they need ready, no lone-hero expectations, and recovery time after a rough night (see Wellbeing & Sustainable Pace).
- DoMake sure newer engineers can hold the pager safely: paired or shadow on-call, clear runbooks, and easy escalation (see Collaboration & Teamwork).
- AvoidAlert fatigue (paging on non-urgent noise) and single points of knowledge where only one person can fix a thing.
Self-review checklist
- AskIf this service broke tonight, is there a runbook a tired colleague could follow without me?
- AskAre this service's alerts actionable, or just noise?
- AskCould a newer engineer on-call diagnose and recover, or escalate easily?
- AskDid the lessons from the last incident make it back into the runbook?