Rollback Plan Checklist

Runbook a release captain or on-call engineer follows to safely roll back a production deployment, from pre-release backup verification through post-incident review. Use during a deploy that's tripping error-rate or latency thresholds beyond the rollback budget.

7 sections 22 steps Collects data

Pre-Release Preparation

Capture the current production sha and image tag
- Record the currently-deployed git sha, container image tag, and Helm chart version so the rollback target is unambiguous. Confirm the previous image is still present in ECR/GCR and has not been pruned by registry lifecycle rules — a common gotcha when the rollback target is more than 30 days old.
Collects text Collects text Collects text
Confirm the migration is reversible
- Review any DB migrations in this release for reversibility. Destructive ops (DROP COLUMN, DROP TABLE, NOT NULL backfills) are not safely reversible by re-running the prior image — flag these for expand/contract treatment. If the migration is one-way, the rollback plan must rely on forward-fix, not down-migration.
Collects list
Assign release captain and rollback IC
- Name the release captain (drives the deploy), the incident commander if rollback fires, and the comms lead who owns the status page and #incidents updates. Both primary and secondary on-call should be reachable for the duration of the deploy window.
Verify rollback runbook is current
- Walk the runbook in Confluence/Notion against the current infra. Common drift: kubectl context names changed, the ArgoCD app was renamed, the feature-flag kill switch moved from LaunchDarkly to a self-hosted flag service. A runbook last touched 14 months ago is presumed stale.

Backup and Restore Validation

Trigger a pre-deploy RDS snapshot
- Take a manual RDS / Cloud SQL snapshot tagged with the release candidate version. Don't rely on the nightly automated snapshot — it may be 16 hours stale by deploy time. Note the snapshot identifier in the deploy ticket.
Collects text
Restore the snapshot to a scratch instance
- Restore into a non-prod instance and run a smoke query against a known-row-count table. A backup that has never been restored is not a backup. Common failure: the restore script depends on a credential that rotated last quarter.
Collects list
Confirm prior container image is in the registry
- Pull the prior production image tag locally to verify it exists and is not garbage-collected. If your registry has a lifecycle policy that prunes untagged images after 14 days, confirm the rollback target is still tagged.

Monitoring and Trigger Criteria

Open the release dashboard in Datadog
- Pin the four golden signals: request rate, error rate, p95 / p99 latency, and saturation (CPU, memory, DB connections). Layer the deploy marker on the dashboard so the canary's effect is visually unambiguous.
Define the rollback trigger thresholds
- Pre-commit to numeric criteria: e.g., error rate > 2% sustained 5 min, p99 latency > 1.5x baseline for 10 min, or any SEV2 customer report tied to the release. Decide before deploy, not during — debating thresholds at 3am is a known anti-pattern.
Collects paragraph
Page the on-call rotation in PagerDuty
- Acknowledge the deploy window in PagerDuty so the alerting team knows a deploy is the likely cause of any page in the next 60 minutes. Post in #incidents and #engineering with the release tag, captain, and rollback contact.

Rollback Decision

Evaluate canary signals at 5% traffic
- Watch the four golden signals for 10 minutes against the canary cohort. Compare error rate and p99 against the baseline the dashboard shows for the prior release. If signals are clean, proceed; if degraded beyond the trigger thresholds, hold the rollout and call the rollback decision.
Collects list

Rollback Execution

Flip the kill switch feature flag
- If the release shipped behind a kill-switch flag, flip it off first — it's the fastest path to mitigation and avoids a full redeploy. Only proceed to image revert if the flag doesn't fully gate the bad code path.
Drain canary traffic back to the prior version
- Shift the load balancer / service mesh weights from the canary back to 0%. Watch error rate trend down as traffic drains. If using ArgoCD rollouts, abort the rollout rather than rolling forward.
Redeploy the prior container image
- Run the rollback pipeline pointing at the prior image tag captured in pre-release. For Kubernetes, kubectl rollout undo on each affected Deployment, or set the chart values back and re-sync ArgoCD. Confirm pod readiness gates pass before declaring the rollback complete.
Reverse the database migration if safe
- Only run the down migration if pre-release confirmed it as reversible. For one-way migrations, the prior image must be forward-compatible with the new schema; if not, restore from the pre-deploy snapshot — but only with the engineering director's explicit approval since this drops any writes since the snapshot.
Run smoke tests against production
- Run the synthetic user-path suite (login, core workflow, checkout) against production. Confirm error rate and p99 latency have returned to the pre-deploy baseline on the dashboard.
Collects list

Communication and Documentation

Post status page update for affected customers
- Update Statuspage / Better Stack with the user-visible impact, time window, and current state (investigating → identified → monitoring → resolved). Notify customer support so inbound tickets are routed correctly.
File the incident ticket in Jira
- Open an incident ticket linked to the release ticket. Capture the timeline, who paged whom, the rollback steps executed, and the contributing PRs. This is the source for the PIR document and any SOC 2 change-management evidence.
Collects url

Post-Incident Review

Schedule a blameless PIR within five business days
- Invite the release captain, on-call, IC, comms lead, and at least one engineer not on the team. Blameless framing — focus on contributing factors (alert tuning, missing canary check, untested migration), not individuals. First-shift narratives are usually wrong; the PIR is where the real causes surface.
Identify contributing factors with the 5-whys method
- Walk the timeline backwards. Look for second-order causes: was the canary cohort too small to surface the issue? Did the dashboard hide the regression? Was the test that should have caught this missing or flaky-and-ignored? Capture each contributing factor as a discrete action item with a named owner.
Collects paragraph
File action items in Jira with named owners
- Each action item gets an owner, a target sprint, and a link back to the PIR. Common gap: action items filed but never tracked to closure — six months later the same incident recurs. The engineering manager owns chasing these to done.
Update the rollback runbook with lessons learned
- Fold any drift discovered during the rollback (renamed contexts, missing kill switches, broken restore scripts) into the runbook. Note the date and PIR ticket so the next reader knows the runbook was exercised, not just edited from a desk.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 7

Steps 22

Category Software Development

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Software Development

Run Rollback Plan Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.