Release Planning Checklist
Two-week run-up to a production release: backlog refinement through deployment readiness. The release captain owns the checklist; tech leads, QA, and product collaborate on individual steps.
Backlog Review & Refinement
-
Confirm tickets have acceptance criteria
Walk the Jira/Linear board for tickets tagged to this release. Each story needs a measurable AC and a definition of done — vague tickets ("improve dashboard performance") are a common source of scope creep mid-sprint. Kick anything ambiguous back to product before locking scope.
-
Flag cross-team dependencies and blockers
Identify tickets that depend on platform, data, or external API changes. A blocked ticket discovered on release day is the most common reason scope ships short. Link the dependency in the ticket and notify the upstream team's tech lead.
-
Tag candidate tickets with the release version
Apply the fix-version label (e.g., v2024.45.0) so the release notes script can pull the right changelog entries automatically.
Release Scope Definition
-
Lock release scope with product and tech lead
Hold a 30-minute scope-lock meeting with PM and tech lead. After this meeting, additions require a written exception from the release captain — otherwise scope creep eats the QA window.
-
Inventory feature flags for gradual rollout
List every flag this release introduces or flips. Include the flag key, the rollout plan (5% canary → 25% → 100%), and the named owner who will clean it up. Flags without owners become 18-month dead code paths.
Collects paragraph -
Capture whether the release ships a DB migration
If any merged PR includes a schema change, an index addition, or a data backfill, answer Yes. Migrations need their own pre-flight section — adding a column with a default on a 50M-row Postgres table is a full table rewrite under exclusive lock.
Collects list
Risk & Security Review
-
Review SCA findings for new dependencies
Pull the Snyk or Dependabot report for the release branch. Triage anything CVSS 7+ before merge; defer lower-severity transitive findings only if the package is not on a request path. Document the deferral in the ticket.
-
Threat-model new external endpoints
For any new public API or webhook, walk OWASP Top 10 with the AppSec partner: authn/authz coverage, input validation, rate-limit posture, PII in logs. A 30-minute pairing session is enough for incremental changes.
-
Document the rollback decision criteria
Spell out the thresholds that trigger an automatic rollback: error-rate spike above baseline, p99 latency regression, customer-support volume. Without pre-agreed thresholds, the call gets argued in Slack at 2am.
Database Migration Planning
-
Confirm the migration is reversible
Either the migration has a tested down-migration, or there is a written forward-fix plan if rollback after schema change is impossible. Dropping a column you just added is reversible; populating it from a third-party API and then deleting that column is not.
-
Plan the backfill in batches
Backfills that run as a single UPDATE block writes for the duration. Chunk by primary key (e.g., 5,000 rows per batch with a 200ms sleep) and watch replication lag during the rehearsal. Use CREATE INDEX CONCURRENTLY for new indexes on Postgres.
-
Rehearse migration on a production-sized clone
Restore the latest production snapshot to a staging RDS instance and run the migration end-to-end. Capture wall-clock time and lock duration; if the rehearsal takes longer than the agreed deploy window, split the migration into a pre-deploy step.
Capacity & Resource Planning
-
Confirm release captain and on-call coverage
Check PagerDuty for the deploy window and the four hours after. Release captain plus primary on-call must both be online; a vacation collision is a common reason hotfixes get botched.
-
Reserve the deploy window on the team calendar
Block Tuesday 10am–12pm (or your team's standard window). Avoid Fridays and the day before a holiday — if something breaks Saturday morning, the on-call carries it alone.
-
Verify staging matches production configuration
Diff the Terraform state for environment-specific variables: instance sizes, secrets, third-party API endpoints, feature-flag defaults. "It worked in staging" failures usually trace back to an undocumented config drift.
QA & Test Strategy
-
Run the full e2e suite against the release candidate
Cut the release branch, tag -rc.1, and run Playwright/Cypress in the staging pipeline. Investigate every red — "just rerun, it's flaky" is how real regressions slip through. Open tickets for any flakes you defer.
-
Identify regression-risk areas for manual smoke testing
Map the merged PRs to user-facing surfaces. Auth, billing, and primary checkout paths get manual smoke regardless of automated coverage; secondary surfaces only if a PR touched them.
-
Capture QA sign-off on the release candidate
QA lead records the verdict, attaches the test report, and notes any deferred bugs. "Pass with notes" is allowed for cosmetic issues with a tracked follow-up; functional regressions are Fail.
Collects list Collects paragraph -
Halt release and schedule a fix cycle
QA failed the candidate. Notify stakeholders the release is slipping, file fix tickets at SEV2 or above, and schedule a follow-up build for the next release window. Do not let pressure push a known-failing build to production.
Communication & Release Notes
-
Draft customer-facing release notes
Generate the draft from changelog entries, then strip internal jargon. Group by feature/fix/breaking change. Anything marked breaking needs a migration note for API consumers and a 90-day deprecation window if it changes existing endpoints.
-
Brief support on customer-visible changes
15-minute walkthrough with the support lead. Cover what changed, expected ticket categories, and the engineer to escalate to. Without this brief, support backlog spikes the day after release.
-
Post the release window in #engineering
Announcement covers: deploy date and time, scope summary, release captain, rollback contact, link to the runbook. Pin it. Lock main to release-blocking PRs only during the window.
Deployment & Rollback Planning
-
Verify the previous container image is in the registry
Pull-test the prior version's image tag from ECR/GCR. Image lifecycle policies sometimes prune the very tag you need to roll back to — discovering this during an incident is too late.
-
Walk through the rollback runbook end-to-end
Read the runbook out loud with the on-call engineer. Confirm every command works against the current cluster — kubectl contexts, deploy script flags, feature-flag kill switches. Update anything stale before deploy day.
Collects file -
Tag the release candidate sha
Apply the semver tag (e.g., v2024.45.0) to the candidate sha and push. CI builds the final artifact from the tag — never deploy from a moving branch ref.
Monitoring & Support Readiness
-
Confirm dashboards cover the new services
Open the Datadog/Grafana service dashboards. New endpoints need RED-method panels (rate, errors, duration). A new service shipping without a dashboard is a guaranteed blind spot during the post-deploy window.
-
Set SLO alert thresholds for new endpoints
Establish baseline p99 latency and error-rate burn-rate alerts. Route to the on-call rotation in PagerDuty, not a dead Slack channel — the cert-renewal failure pattern lives here too.
-
Brief on-call on the hotfix process
Walk the on-call through: who decides hotfix vs. rollback, the cherry-pick branch convention, and the abbreviated review path for SEV1 fixes. New on-call members get this brief every release; veterans skim.