Rollback Plan Checklist
Runbook a release captain or on-call engineer follows to safely roll back a production deployment, from pre-release backup verification through post-incident review. Use during a deploy that's tripping error-rate or latency thresholds beyond the rollback budget.
Pre-Release Preparation
-
Capture the current production sha and image tag
Record the currently-deployed git sha, container image tag, and Helm chart version so the rollback target is unambiguous. Confirm the previous image is still present in ECR/GCR and has not been pruned by registry lifecycle rules — a common gotcha when the rollback target is more than 30 days old.
Collects text Collects text Collects text -
Confirm the migration is reversible
Review any DB migrations in this release for reversibility. Destructive ops (DROP COLUMN, DROP TABLE, NOT NULL backfills) are not safely reversible by re-running the prior image — flag these for expand/contract treatment. If the migration is one-way, the rollback plan must rely on forward-fix, not down-migration.
Collects list -
Assign release captain and rollback IC
Name the release captain (drives the deploy), the incident commander if rollback fires, and the comms lead who owns the status page and #incidents updates. Both primary and secondary on-call should be reachable for the duration of the deploy window.
-
Verify rollback runbook is current
Walk the runbook in Confluence/Notion against the current infra. Common drift: kubectl context names changed, the ArgoCD app was renamed, the feature-flag kill switch moved from LaunchDarkly to a self-hosted flag service. A runbook last touched 14 months ago is presumed stale.
Backup and Restore Validation
-
Trigger a pre-deploy RDS snapshot
Take a manual RDS / Cloud SQL snapshot tagged with the release candidate version. Don't rely on the nightly automated snapshot — it may be 16 hours stale by deploy time. Note the snapshot identifier in the deploy ticket.
Collects text -
Restore the snapshot to a scratch instance
Restore into a non-prod instance and run a smoke query against a known-row-count table. A backup that has never been restored is not a backup. Common failure: the restore script depends on a credential that rotated last quarter.
Collects list -
Confirm prior container image is in the registry
Pull the prior production image tag locally to verify it exists and is not garbage-collected. If your registry has a lifecycle policy that prunes untagged images after 14 days, confirm the rollback target is still tagged.
Monitoring and Trigger Criteria
-
Open the release dashboard in Datadog
Pin the four golden signals: request rate, error rate, p95 / p99 latency, and saturation (CPU, memory, DB connections). Layer the deploy marker on the dashboard so the canary's effect is visually unambiguous.
-
Define the rollback trigger thresholds
Pre-commit to numeric criteria: e.g., error rate > 2% sustained 5 min, p99 latency > 1.5x baseline for 10 min, or any SEV2 customer report tied to the release. Decide before deploy, not during — debating thresholds at 3am is a known anti-pattern.
Collects paragraph -
Page the on-call rotation in PagerDuty
Acknowledge the deploy window in PagerDuty so the alerting team knows a deploy is the likely cause of any page in the next 60 minutes. Post in #incidents and #engineering with the release tag, captain, and rollback contact.
Rollback Decision
-
Evaluate canary signals at 5% traffic
Watch the four golden signals for 10 minutes against the canary cohort. Compare error rate and p99 against the baseline the dashboard shows for the prior release. If signals are clean, proceed; if degraded beyond the trigger thresholds, hold the rollout and call the rollback decision.
Collects list
Rollback Execution
-
Flip the kill switch feature flag
If the release shipped behind a kill-switch flag, flip it off first — it's the fastest path to mitigation and avoids a full redeploy. Only proceed to image revert if the flag doesn't fully gate the bad code path.
-
Drain canary traffic back to the prior version
Shift the load balancer / service mesh weights from the canary back to 0%. Watch error rate trend down as traffic drains. If using ArgoCD rollouts, abort the rollout rather than rolling forward.
-
Redeploy the prior container image
Run the rollback pipeline pointing at the prior image tag captured in pre-release. For Kubernetes, kubectl rollout undo on each affected Deployment, or set the chart values back and re-sync ArgoCD. Confirm pod readiness gates pass before declaring the rollback complete.
-
Reverse the database migration if safe
Only run the down migration if pre-release confirmed it as reversible. For one-way migrations, the prior image must be forward-compatible with the new schema; if not, restore from the pre-deploy snapshot — but only with the engineering director's explicit approval since this drops any writes since the snapshot.
-
Run smoke tests against production
Run the synthetic user-path suite (login, core workflow, checkout) against production. Confirm error rate and p99 latency have returned to the pre-deploy baseline on the dashboard.
Collects list
Communication and Documentation
-
Post status page update for affected customers
Update Statuspage / Better Stack with the user-visible impact, time window, and current state (investigating → identified → monitoring → resolved). Notify customer support so inbound tickets are routed correctly.
-
File the incident ticket in Jira
Open an incident ticket linked to the release ticket. Capture the timeline, who paged whom, the rollback steps executed, and the contributing PRs. This is the source for the PIR document and any SOC 2 change-management evidence.
Collects url
Post-Incident Review
-
Schedule a blameless PIR within five business days
Invite the release captain, on-call, IC, comms lead, and at least one engineer not on the team. Blameless framing — focus on contributing factors (alert tuning, missing canary check, untested migration), not individuals. First-shift narratives are usually wrong; the PIR is where the real causes surface.
-
Identify contributing factors with the 5-whys method
Walk the timeline backwards. Look for second-order causes: was the canary cohort too small to surface the issue? Did the dashboard hide the regression? Was the test that should have caught this missing or flaky-and-ignored? Capture each contributing factor as a discrete action item with a named owner.
Collects paragraph -
File action items in Jira with named owners
Each action item gets an owner, a target sprint, and a link back to the PIR. Common gap: action items filed but never tracked to closure — six months later the same incident recurs. The engineering manager owns chasing these to done.
-
Update the rollback runbook with lessons learned
Fold any drift discovered during the rollback (renamed contexts, missing kill switches, broken restore scripts) into the runbook. Note the date and PIR ticket so the next reader knows the runbook was exercised, not just edited from a desk.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Rollback Plan Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.