Rollback Plan Checklist
Pre-Release Preparation
Record the currently-deployed git sha, container image tag, and Helm chart version so the rollback target is unambiguous. Confirm the previous image is still present in ECR/GCR and has not been pruned by registry lifecycle rules — a common gotcha when the rollback target is more than 30 days old.
Review any DB migrations in this release for reversibility. Destructive ops (DROP COLUMN, DROP TABLE, NOT NULL backfills) are not safely reversible by re-running the prior image — flag these for expand/contract treatment. If the migration is one-way, the rollback plan must rely on forward-fix, not down-migration.
Name the release captain (drives the deploy), the incident commander if rollback fires, and the comms lead who owns the status page and #incidents updates. Both primary and secondary on-call should be reachable for the duration of the deploy window.
Walk the runbook in Confluence/Notion against the current infra. Common drift: kubectl context names changed, the ArgoCD app was renamed, the feature-flag kill switch moved from LaunchDarkly to a self-hosted flag service. A runbook last touched 14 months ago is presumed stale.
Backup and Restore Validation
Take a manual RDS / Cloud SQL snapshot tagged with the release candidate version. Don't rely on the nightly automated snapshot — it may be 16 hours stale by deploy time. Note the snapshot identifier in the deploy ticket.
Restore into a non-prod instance and run a smoke query against a known-row-count table. A backup that has never been restored is not a backup. Common failure: the restore script depends on a credential that rotated last quarter.
Pull the prior production image tag locally to verify it exists and is not garbage-collected. If your registry has a lifecycle policy that prunes untagged images after 14 days, confirm the rollback target is still tagged.
Monitoring and Trigger Criteria
Pin the four golden signals: request rate, error rate, p95 / p99 latency, and saturation (CPU, memory, DB connections). Layer the deploy marker on the dashboard so the canary's effect is visually unambiguous.
Pre-commit to numeric criteria: e.g., error rate > 2% sustained 5 min, p99 latency > 1.5x baseline for 10 min, or any SEV2 customer report tied to the release. Decide before deploy, not during — debating thresholds at 3am is a known anti-pattern.
Acknowledge the deploy window in PagerDuty so the alerting team knows a deploy is the likely cause of any page in the next 60 minutes. Post in #incidents and #engineering with the release tag, captain, and rollback contact.
Rollback Decision
Watch the four golden signals for 10 minutes against the canary cohort. Compare error rate and p99 against the baseline the dashboard shows for the prior release. If signals are clean, proceed; if degraded beyond the trigger thresholds, hold the rollout and call the rollback decision.
Rollback Execution
If the release shipped behind a kill-switch flag, flip it off first — it's the fastest path to mitigation and avoids a full redeploy. Only proceed to image revert if the flag doesn't fully gate the bad code path.
Shift the load balancer / service mesh weights from the canary back to 0%. Watch error rate trend down as traffic drains. If using ArgoCD rollouts, abort the rollout rather than rolling forward.
Run the rollback pipeline pointing at the prior image tag captured in pre-release. For Kubernetes, kubectl rollout undo on each affected Deployment, or set the chart values back and re-sync ArgoCD. Confirm pod readiness gates pass before declaring the rollback complete.
Only run the down migration if pre-release confirmed it as reversible. For one-way migrations, the prior image must be forward-compatible with the new schema; if not, restore from the pre-deploy snapshot — but only with the engineering director's explicit approval since this drops any writes since the snapshot.
Run the synthetic user-path suite (login, core workflow, checkout) against production. Confirm error rate and p99 latency have returned to the pre-deploy baseline on the dashboard.
Communication and Documentation
Update Statuspage / Better Stack with the user-visible impact, time window, and current state (investigating → identified → monitoring → resolved). Notify customer support so inbound tickets are routed correctly.
Open an incident ticket linked to the release ticket. Capture the timeline, who paged whom, the rollback steps executed, and the contributing PRs. This is the source for the PIR document and any SOC 2 change-management evidence.
Post-Incident Review
Invite the release captain, on-call, IC, comms lead, and at least one engineer not on the team. Blameless framing — focus on contributing factors (alert tuning, missing canary check, untested migration), not individuals. First-shift narratives are usually wrong; the PIR is where the real causes surface.
Walk the timeline backwards. Look for second-order causes: was the canary cohort too small to surface the issue? Did the dashboard hide the regression? Was the test that should have caught this missing or flaky-and-ignored? Capture each contributing factor as a discrete action item with a named owner.
Each action item gets an owner, a target sprint, and a link back to the PIR. Common gap: action items filed but never tracked to closure — six months later the same incident recurs. The engineering manager owns chasing these to done.
Fold any drift discovered during the rollback (renamed contexts, missing kill switches, broken restore scripts) into the runbook. Note the date and PIR ticket so the next reader knows the runbook was exercised, not just edited from a desk.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Software Architecture Design Checklist
- Post-Deployment Testing Checklist
- Performance Monitoring Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Monitoring Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Service Level Agreement (SLA) Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- Disaster Recovery Plan Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Backup and Recovery Checklist
- Release Checklist
- Deployment Plan Checklist
- Release Planning Checklist
- Software Update Checklist
- Version Control Checklist
- Testing Environment Setup Checklist
- CI/CD Pipeline Review Checklist
- Infrastructure as Code (IaC) Checklist
- Deployment Checklist
- Release Notes Checklist
- CI/CD Pipeline Checklist
- Quarterly DevOps Security Review
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
