A practical playbook for moving production data without taking the service offline. The four steps, the rehearsal protocol, and the metrics that prove it worked.
"Near-zero downtime" is one of those phrases that means everything and nothing. The promise sells, but the operational cost is real. This post is the thing we hand new engineers on day one — the actual checklist, with the parts we used to skip.
The four steps
Every migration we run follows the same shape. The shape is boring on purpose: boring is what lets the rare interesting bit get the attention it deserves.
- Schema first, data later. Add columns, indexes, and constraints in a separate change. Roll out, observe, then start writing.
- Dual-write. Old code keeps writing the old way. New code writes both. Read from old until you trust the new.
- Backfill in batches. Cap by row count and lock duration; pause if replication lag spikes.
- Cut over. Flip reads to the new column, keep dual-writing for one full release cycle, then drop.
If a migration plan does not have an answer to "what do we do when we’re halfway and something looks off?", it is not a plan, it is a wish.
The rehearsal protocol
We rehearse every cutover at least once against a production-shaped clone. The clone has the same row counts (within ±5 %), the same indexes, and the same replication topology. It does not have the same hardware — and that is fine, because what we are timing is the shape of the work, not the absolute duration.
-- Step 3 of 4: backfill
UPDATE orders
SET region_id = lookup_region(country_code)
WHERE region_id IS NULL
LIMIT 5000; What "it worked" actually means
Three signals, in order of how seriously we treat them:
- Application error rate stays within 0.5σ of the seven-day baseline.
- Replication lag never crosses 30 s for longer than two consecutive minutes.
- The post-migration row-count assertion matches the pre-migration row count exactly. No fuzz.
If any of those slip, we pause and investigate. We do not roll forward to "see if it settles." It does not settle.
What we still get wrong
Rehearsals catch about 80 % of issues. The remaining 20 % is almost always the same shape: an integration we did not know about, writing through an unexpected path. The fix is not technical — it is asking, every quarter, "what writes to this table that I would not guess from reading the code?"