Rollback Plan Checklist

Pre-Release Preparation

    Record the currently-deployed git sha, container image tag, and Helm chart version so the rollback target is unambiguous. Confirm the previous image is still present in ECR/GCR and has not been pruned by registry lifecycle rules — a common gotcha when the rollback target is more than 30 days old.

    Review any DB migrations in this release for reversibility. Destructive ops (DROP COLUMN, DROP TABLE, NOT NULL backfills) are not safely reversible by re-running the prior image — flag these for expand/contract treatment. If the migration is one-way, the rollback plan must rely on forward-fix, not down-migration.

    Name the release captain (drives the deploy), the incident commander if rollback fires, and the comms lead who owns the status page and #incidents updates. Both primary and secondary on-call should be reachable for the duration of the deploy window.

    Walk the runbook in Confluence/Notion against the current infra. Common drift: kubectl context names changed, the ArgoCD app was renamed, the feature-flag kill switch moved from LaunchDarkly to a self-hosted flag service. A runbook last touched 14 months ago is presumed stale.

Backup and Restore Validation

    Take a manual RDS / Cloud SQL snapshot tagged with the release candidate version. Don't rely on the nightly automated snapshot — it may be 16 hours stale by deploy time. Note the snapshot identifier in the deploy ticket.

    Restore into a non-prod instance and run a smoke query against a known-row-count table. A backup that has never been restored is not a backup. Common failure: the restore script depends on a credential that rotated last quarter.

    Pull the prior production image tag locally to verify it exists and is not garbage-collected. If your registry has a lifecycle policy that prunes untagged images after 14 days, confirm the rollback target is still tagged.

Monitoring and Trigger Criteria

    Pin the four golden signals: request rate, error rate, p95 / p99 latency, and saturation (CPU, memory, DB connections). Layer the deploy marker on the dashboard so the canary's effect is visually unambiguous.

    Pre-commit to numeric criteria: e.g., error rate > 2% sustained 5 min, p99 latency > 1.5x baseline for 10 min, or any SEV2 customer report tied to the release. Decide before deploy, not during — debating thresholds at 3am is a known anti-pattern.

    Acknowledge the deploy window in PagerDuty so the alerting team knows a deploy is the likely cause of any page in the next 60 minutes. Post in #incidents and #engineering with the release tag, captain, and rollback contact.

Rollback Decision

    Watch the four golden signals for 10 minutes against the canary cohort. Compare error rate and p99 against the baseline the dashboard shows for the prior release. If signals are clean, proceed; if degraded beyond the trigger thresholds, hold the rollout and call the rollback decision.

Rollback Execution

    If the release shipped behind a kill-switch flag, flip it off first — it's the fastest path to mitigation and avoids a full redeploy. Only proceed to image revert if the flag doesn't fully gate the bad code path.

    Shift the load balancer / service mesh weights from the canary back to 0%. Watch error rate trend down as traffic drains. If using ArgoCD rollouts, abort the rollout rather than rolling forward.

    Run the rollback pipeline pointing at the prior image tag captured in pre-release. For Kubernetes, kubectl rollout undo on each affected Deployment, or set the chart values back and re-sync ArgoCD. Confirm pod readiness gates pass before declaring the rollback complete.

    Only run the down migration if pre-release confirmed it as reversible. For one-way migrations, the prior image must be forward-compatible with the new schema; if not, restore from the pre-deploy snapshot — but only with the engineering director's explicit approval since this drops any writes since the snapshot.

    Run the synthetic user-path suite (login, core workflow, checkout) against production. Confirm error rate and p99 latency have returned to the pre-deploy baseline on the dashboard.

Communication and Documentation

    Update Statuspage / Better Stack with the user-visible impact, time window, and current state (investigating → identified → monitoring → resolved). Notify customer support so inbound tickets are routed correctly.

    Open an incident ticket linked to the release ticket. Capture the timeline, who paged whom, the rollback steps executed, and the contributing PRs. This is the source for the PIR document and any SOC 2 change-management evidence.

Post-Incident Review

    Invite the release captain, on-call, IC, comms lead, and at least one engineer not on the team. Blameless framing — focus on contributing factors (alert tuning, missing canary check, untested migration), not individuals. First-shift narratives are usually wrong; the PIR is where the real causes surface.

    Walk the timeline backwards. Look for second-order causes: was the canary cohort too small to surface the issue? Did the dashboard hide the regression? Was the test that should have caught this missing or flaky-and-ignored? Capture each contributing factor as a discrete action item with a named owner.

    Each action item gets an owner, a target sprint, and a link back to the PIR. Common gap: action items filed but never tracked to closure — six months later the same incident recurs. The engineering manager owns chasing these to done.

    Fold any drift discovered during the rollback (renamed contexts, missing kill switches, broken restore scripts) into the runbook. Note the date and PIR ticket so the next reader knows the runbook was exercised, not just edited from a desk.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack