Operations

Service Levels & Reliability Targets (SLOs)

Advanced

You can only answer "is the service reliable enough?" if you have decided what "enough" means. Service-Level Objectives turn reliability from a vague feeling into a number: a target for availability or latency that you measure, alert on, and use to decide when to push features and when to shore up stability.

A few definitions, because they are easy to mix up. An SLI is what you measure (for example, the percentage of requests served under 500ms, or the percentage of successful requests). An SLO is the target for that indicator (for example, 99.9% success over 30 days). An error budget is the allowed shortfall (the 0.1%). When you have used it up, reliability work takes priority over new features. An SLA is a contractual promise to customers, usually looser than your internal SLO.

You do not need many SLOs. Pick the few user-facing things that matter (the API is up, requests are fast enough, critical flows like onboarding and payments work) and measure those. This complements Observability (how you measure) and Incident Readiness (what you do when you breach).

Set targets that mean something

DoDefine SLOs for the few things users actually feel: availability and latency of key endpoints, and success of critical flows (onboarding, payments, screening).
DoBase SLIs on the user's experience (did their request succeed, quickly?), not just server-side resource metrics like CPU.
DoPick realistic targets. Chasing 100% is far too expensive. The right number balances reliability against the cost and pace of change.
ConsiderAn error budget: while you are within it, ship features; when you have used it up, prioritise reliability. It is a simple, honest rule for the trade-off.
ConsiderKeeping internal SLOs tighter than any customer-facing SLA, so you notice and fix problems before they breach a promise.

Use them

DoAlert on SLO burn (you are trending toward a breach), not just on hard outages, so you act before users are badly affected (see Observability & Logging Hygiene).
DoReview SLOs regularly and after incidents. Adjust targets, and feed breaches into improvement work (see Incident Readiness).
AvoidVanity targets nobody acts on, or so many SLOs that none get attention. A few meaningful ones beat a dashboard of ignored numbers.
AvoidTreating an SLO as a way to blame people. It is a shared signal for setting priorities, in the blameless spirit (see Ownership & Accountability).

Self-review checklist

AskFor the feature I am building, what reliability or latency target would users care about?
AskDoes my SLI reflect the user's actual experience, not just a server metric?
AskWill we get alerted before we breach, not only after an outage?
AskIs the target realistic, and does breaching it actually change what we prioritise?

Why it matters: Without explicit targets, reliability is argued by opinion. It is then either neglected (until an outage) or over-invested (gold-plating at the expense of delivery). SLOs give an honest, measured basis for that trade-off, catch problems before customers feel them, and keep the conversation about reliability grounded in what users actually experience.