Rolling Out
Why roll out
Rumsfeld:
- Known knowns — you know what happens, you handled it, done.
- Known unknowns — you know the question but not the answer. “Will it hold at 10k requests a second?” “Will users actually click this?” You can’t settle these at your desk.
- Unknown unknowns — you don’t even know the question. The case nobody pictured, the user who does something nobody designed for.
Only the first one is done before you ship. The other two need real traffic to answer. A rollout is how you get that answer — a slice at a time, while a mistake is still cheap to undo.
Four things can go wrong
Four, and they’re independent. Each can be the only thing wrong while the rest are fine. Each needs its own signal.
- Did we build the right thing? Real users hit inputs and states you never modeled — the empty account, the 10-year-old record, the workflow nobody designed for. Tests can’t catch these; they only assert the behavior you already thought to specify.
- Did we build it right? You specified the behavior correctly; the code does something else. A plain bug. Tests catch these — but only on the paths you remembered to assert.
- Will it hold up under load? Correct code can still fall over. Cold caches, drained connection pools, a leak that kills the box after two days, p99 latency tripling. Only real traffic shows it.
- Do people want it? No correct answer exists to hit — only revealed preference. You can ship it bug-free, fast, exactly as designed, and engagement doesn’t move. Only real users settle it, and a flat metric here isn’t a bug to fix — it’s the design being wrong.
Fixing one tells you nothing about the others. None can be settled at your desk.
What a rollout depends on
A rollout is: ramp a slice, watch, decide, act. Each verb needs something underneath it:
- A knob — splits traffic old vs new (flag, percentage, routing). No knob, no slice.
- A signal — a metric that moves when it breaks, and the right one: errors and latency for bugs and load, engagement for whether people want it. You can’t act on what you can’t see.
- A threshold, set before you ramp — “if errors pass X, roll back.” Decide it up front or you’ll explain the bad number away in the moment.
- Time — leaks, cron jobs, and piled-up data only show after hours. A slice held 30 seconds catches nothing.
- A way back — a flag undoes; a sent email or dropped column doesn’t. Irreversible changes (migrations, deletes) need a different playbook: expand/contract, dual-write, backfill.
How fast you turn the knob is set by one thing: at 1% one thing breaks and you know why; at 100% five move and you’re guessing. Ramp slow enough to keep cause and effect readable.
And the limit even when you do all this: a small slice won’t surface rare bugs or failures that only exist at full load — a cache stampede may not appear at 1% at all. A clean canary is weak evidence, not proof.
What’s a good ramp
Don’t memorize a sequence. Two forces set every step.
How big a jump? Each step multiplies blast radius. 1% → 10% means a bug now hits 10× the people before you catch it. So the jump is bounded by how much damage you can survive at that stage. Start tiny — a bug at 1% is nothing — and take bigger jumps later, once the early steps have already cleared most failure modes.
How long to hold? As long as your slowest signal takes to appear. A memory leak that shows after six hours makes a ten-minute hold meaningless. And the step needs enough traffic for the signal to be real — 1% of a quiet app might be three requests, which tells you nothing.
So a good ramp is the fewest steps where each one stays survivable and is held long enough, with enough traffic, to clear the failures that step is meant to catch. Faster, you’re blind. Slower, you’re paying in flag-debt and calendar time for no new information.
Each rung catches a different unknown — that’s what you watch as you climb:
| Stage | Catches | Watch |
|---|---|---|
| Internal / dogfood | wrong design, crashes | does it work at all; “this is the wrong shape” |
| 1% | correctness on real data | error rate, exceptions |
| ~10% | load, emergent behavior | p99 latency, saturation, cache hit rate |
| ~50% | preference, scale | engagement/conversion vs the held-back half |
| 100% | — | done — now decide the flag’s fate |
The numbers are an example, not a law. A high-traffic app might start at 0.1%; a low-traffic one might jump straight to 20% to get a readable signal. Derive yours from the two forces.
Notice the table isn’t just “more users” each row — each stage finds a different kind of problem:
- Common bugs get caught instantly. If something breaks for one in twenty requests, your 1% slice hits it almost right away. The frequent stuff is cheap to find and you find it early. So the first few steps teach you the most; by 50% you’re mostly confirming, not learning.
- The big failures only show up with a crowd. A stampede on a cache, a database buckling under load — these don’t exist at 1%. They need the full crowd to happen at all. No amount of waiting at 1% will find them.
That’s the trap: the late steps feel boring because little new shows up — but they’re exactly where the worst, crowd-only failures hide. You go slow at the top not because you expect to learn a lot, but because the rare thing you might learn is the one that takes you down.
At 100%, the flag is either trash or a tool — decide which. If it only existed to ramp this change, delete it: a stale flag is a second code path nobody tests, and one day someone flips a three-year-old switch into an off-branch that’s been broken for two years. But if you have a real reason to keep the knob — perf, legal, a dependency you might need to cut — keep it on purpose, and treat it as live code you keep tested. The mistake isn’t keeping flags; it’s keeping them by accident.
How to actually run one
Before you turn the knob — all four, or don’t start:
- The flag works both ways. Test the off-switch before you need it, not during the incident.
- The metric is live and you can see it per-slice (the 1% group separately, not blended into the whole).
- The threshold is written down. “Roll back if errors > 1% or p99 > 400ms.” A number, decided now, while you’re calm.
- You know what normal looks like. You can’t spot abnormal without it.
At each step, one of three moves:
- Advance — metrics within threshold and you’ve held a full cycle of your slowest signal. Go to the next step.
- Hold — a metric is degraded but still inside the threshold. Stay put, gather more samples. Don’t advance on an unresolved trend.
- Roll back — threshold breached. Pull it now.
When it breaches: roll back first, diagnose after. The instinct is to investigate while it’s live — resist it. Pull the knob, halt the impact, then read the logs without users in the blast radius. The rollback is reversible; the errors you serve while debugging are not.
Signal or noise? A small slice has small samples, so the variance is high — one spike at 1% is usually nothing, don’t panic-revert on it. But a trend — successive readings climbing, or a metric that stays elevated across a full hold — is real, and “it’s probably noise” is exactly the rationalization you’ll reach for to avoid rolling back. Single spike: wait. Sustained trend: act.
What it gives you back
The knob doesn’t disappear after launch. It stays, and the rest of how you ship rests on it:
- Turn it down when reality demands — incident, load spike, cost, bad dependency. Recovery becomes “pull the knob,” not “hotfix under pressure.”
- Experiment — A/B tests and holdback groups are the same knob, pointed at the “do people want it” question.
- Ship small and often — if every change ramps safely, you deploy many small changes instead of rare big ones.
- Deploy without releasing — code can be live but off, so merging doesn’t mean exposing.
A rollout looks like a launch-day tactic. It’s actually the thing the whole shipping loop sits on top of.