Operations

Runbooks & On-Call

Foundational

When something breaks at 3 a.m., the person responding should not have to figure out the system under pressure. Runbooks are short, practical guides for operating and recovering services. With a clear, humane on-call setup, they turn a panicked scramble into a calm, repeatable response.

A runbook answers "what do I do when X happens?": how to tell if the service is healthy, common failures and their fixes, how to restart, roll back, or scale, what it depends on, and how to escalate. On-call is the human side: who is responsible right now, how they are alerted, and how the rota stays sustainable. Together they make operations a shared, documented capability rather than knowledge stuck in one person's head.

This supports Incident Readiness (the response process), Observability (the signals), and Ownership & Accountability (you operate what you build). For a largely junior team, good runbooks are also how knowledge spreads and how a newer engineer can safely hold the pager.

Write runbooks people can actually use

DoKeep a runbook per service: how to check health, key dashboards and alerts, common failures and fixes, how to restart, roll back, or scale, dependencies, and escalation contacts.
DoMake them practical and current: concrete commands and steps, kept as code or docs next to the service and updated when it changes (see Documentation as Code).
DoTurn what you learn from every incident into runbook updates, so the next person handles it faster (see Incident Readiness).
ConsiderAutomating the repetitive recovery steps a runbook describes, so the safe action is one command, not ten.
AvoidRunbooks that are stale, vague, or assume deep context only the author has. Under pressure they are worse than useless.

Run on-call humanely

DoDefine on-call clearly: who is responsible at any time, how they are reached, the expected response, and a real escalation path when they need help.
DoMake alerts actionable and tuned: page on what truly needs a human now, and keep noise down so real alerts are not lost (see Observability, SLOs).
DoSupport the people on call: a reasonable rotation, the access and runbooks they need ready, no lone-hero expectations, and recovery time after a rough night (see Wellbeing & Sustainable Pace).
DoMake sure newer engineers can hold the pager safely: paired or shadow on-call, clear runbooks, and easy escalation (see Collaboration & Teamwork).
AvoidAlert fatigue (paging on non-urgent noise) and single points of knowledge where only one person can fix a thing.

Self-review checklist

AskIf this service broke tonight, is there a runbook a tired colleague could follow without me?
AskAre this service's alerts actionable, or just noise?
AskCould a newer engineer on-call diagnose and recover, or escalate easily?
AskDid the lessons from the last incident make it back into the runbook?

Why it matters: Incidents are won or lost on preparation. Clear runbooks and a humane, well-defined on-call turn an outage into a calm, fast recovery and stop reliability from depending on one heroic person. They also spread operational knowledge across the team, which is essential when most engineers are still building that experience.