Cloud & Infrastructure

Cost & Scale Planning

Advanced

Cloud cost is a result of your design. The architecture you choose decides the bill you pay. A system that scales well can also run up a huge bill fast if nobody planned for it. Treat cost and capacity as core engineering concerns. Size them to real demand, show them in dashboards and alerts, and avoid the surprise bills that come from autoscaling something inefficient.

Two problems cost us money. The first is quiet waste: oversized resources, idle environments left running, premium tiers nobody needs, and data egress and logging that slowly add up. The second is more dangerous: an architecture where cost grows faster than value. A traffic spike or a large tenant then turns into a bill nobody approved. Both come from treating cost as a finance problem after the fact, instead of an engineering decision made up front.

Scale planning is the other half of the same job. It means knowing how the system behaves as load and data grow, where the bottlenecks and cost limits are, and having enough capacity and budget for the demand we expect. The goal is not to be cheap. The goal is to be deliberate: every significant resource has an owner, a reason, a rough cost, and an alert if it goes wrong.

Design for cost and scale

DoEstimate the cost and scaling behaviour of a design before you build it. Ask how spend grows with users, tenants, data, and traffic.
DoRight-size resources to real, measured demand. Start small and scale up based on evidence, not on an imagined peak.
DoPrefer elastic, consumption-based services that scale to zero or near zero when idle, especially for non-production and bursty workloads.
DoProvision sizing through infrastructure as code (Pulumi) so capacity is reviewed, consistent, and easy to adjust per environment.
ConsiderThe whole cost picture: compute, storage, egress, logging and telemetry retention, premium tiers, and per-tenant cost. Do not look only at the obvious VM or database line.
AlwaysSet autoscaling upper bounds and resource quotas, so a spike or a runaway process scales within a known, affordable limit rather than without any limit.

Make cost visible and govern it

DoTag resources (environment, service, owner, cost-centre) so spend can be attributed. Review cost regularly, not just when the invoice arrives.
DoSet budgets and cost alerts per environment or subscription, so unusual spend is caught in hours, not at month end.
DoLoad-test and capacity-test critical paths with realistic volumes, so you know where the scale and cost limits are before customers find them (see Performance & Resource Use).
ConsiderReserved capacity or savings plans for stable, predictable production workloads, once you understand usage well.
ConsiderA simple cost-per-tenant or cost-per-transaction view, so pricing and margin decisions are based on real unit economics.
Do notLeave non-production environments, oversized instances, or forgotten resources running idle. Idle spend is pure waste.
NeverDeploy an autoscaling or consumption resource with no upper limit, budget, or alert. That is how a bug or a spike becomes a large bill overnight.

No limit, no guardrails

autoscale: min=2, max=∞   // no ceiling
no budget, no cost alert, logs retained forever
dev + test environments running 24/7 at prod size

A retry storm or a traffic spike scales out with no limit, logs pile up forever, and idle non-prod runs at full size all day and night. The first sign of trouble is the invoice.

Bounded, sized, watched

autoscale: min=2, max=20 (a known, affordable ceiling)
budget + cost alert per subscription; log retention capped
dev/test scale to zero off-hours; resources tagged by owner

Scale is elastic but bounded, spend is attributed and alerted, and idle environments cost nothing. You get capacity for real demand without the risk of a runaway bill.

Self-review checklist

AskHow does this design's cost grow as users, tenants, data, and traffic grow?
AskIs anything autoscaling or consumption-based without an upper bound, budget, or alert?
AskIs this resource right-sized for measured demand, tagged, and owned — or guessed and forgotten?
AskDo we know where the scale and cost cliffs are from realistic testing?

Why it matters: Cloud spend is one of our largest controllable costs. Left uncontrolled, it threatens both margin and runway. Designing for cost and scale (right-sized, bounded, tagged, and alerted) means we pay for the value we deliver, absorb growth without nasty surprises, and can answer what the platform costs to run and what it will cost at 10x. That discipline is the difference between scaling profitably and scaling into a crisis.