Operations

Service Levels & Reliability Targets (SLOs)

Advanced

You can only answer "is the service reliable enough?" if you have decided what "enough" means. Service-Level Objectives turn reliability from a vague feeling into a number: a target for availability or latency that you measure, alert on, and use to decide when to push features and when to shore up stability.

A few definitions, because they are easy to mix up. An SLI is what you measure (for example, the percentage of requests served under 500ms, or the percentage of successful requests). An SLO is the target for that indicator (for example, 99.9% success over 30 days). An error budget is the allowed shortfall (the 0.1%). When you have used it up, reliability work takes priority over new features. An SLA is a contractual promise to customers, usually looser than your internal SLO.

You do not need many SLOs. Pick the few user-facing things that matter (the API is up, requests are fast enough, critical flows like onboarding and payments work) and measure those. This complements Observability (how you measure) and Incident Readiness (what you do when you breach).

Set targets that mean something

Use them

Self-review checklist

Why it matters: Without explicit targets, reliability is argued by opinion. It is then either neglected (until an outage) or over-invested (gold-plating at the expense of delivery). SLOs give an honest, measured basis for that trade-off, catch problems before customers feel them, and keep the conversation about reliability grounded in what users actually experience.