Release Planning Checklist

Two-week run-up to a production release: backlog refinement through deployment readiness. The release captain owns the checklist; tech leads, QA, and product collaborate on individual steps.

Use this workflow

Backlog Review & Refinement

Confirm tickets have acceptance criteria
- Walk the Jira/Linear board for tickets tagged to this release. Each story needs a measurable AC and a definition of done — vague tickets ("improve dashboard performance") are a common source of scope creep mid-sprint. Kick anything ambiguous back to product before locking scope.
Flag cross-team dependencies and blockers
- Identify tickets that depend on platform, data, or external API changes. A blocked ticket discovered on release day is the most common reason scope ships short. Link the dependency in the ticket and notify the upstream team's tech lead.
Tag candidate tickets with the release version
- Apply the fix-version label (e.g., v2024.45.0) so the release notes script can pull the right changelog entries automatically.

Release Scope Definition

Lock release scope with product and tech lead
- Hold a 30-minute scope-lock meeting with PM and tech lead. After this meeting, additions require a written exception from the release captain — otherwise scope creep eats the QA window.
Inventory feature flags for gradual rollout
- List every flag this release introduces or flips. Include the flag key, the rollout plan (5% canary → 25% → 100%), and the named owner who will clean it up. Flags without owners become 18-month dead code paths.
Collects paragraph
Capture whether the release ships a DB migration
- If any merged PR includes a schema change, an index addition, or a data backfill, answer Yes. Migrations need their own pre-flight section — adding a column with a default on a 50M-row Postgres table is a full table rewrite under exclusive lock.
Collects list

Risk & Security Review

Review SCA findings for new dependencies
- Pull the Snyk or Dependabot report for the release branch. Triage anything CVSS 7+ before merge; defer lower-severity transitive findings only if the package is not on a request path. Document the deferral in the ticket.
Threat-model new external endpoints
- For any new public API or webhook, walk OWASP Top 10 with the AppSec partner: authn/authz coverage, input validation, rate-limit posture, PII in logs. A 30-minute pairing session is enough for incremental changes.
Document the rollback decision criteria
- Spell out the thresholds that trigger an automatic rollback: error-rate spike above baseline, p99 latency regression, customer-support volume. Without pre-agreed thresholds, the call gets argued in Slack at 2am.

Database Migration Planning

Confirm the migration is reversible
- Either the migration has a tested down-migration, or there is a written forward-fix plan if rollback after schema change is impossible. Dropping a column you just added is reversible; populating it from a third-party API and then deleting that column is not.
Plan the backfill in batches
- Backfills that run as a single UPDATE block writes for the duration. Chunk by primary key (e.g., 5,000 rows per batch with a 200ms sleep) and watch replication lag during the rehearsal. Use CREATE INDEX CONCURRENTLY for new indexes on Postgres.
Rehearse migration on a production-sized clone
- Restore the latest production snapshot to a staging RDS instance and run the migration end-to-end. Capture wall-clock time and lock duration; if the rehearsal takes longer than the agreed deploy window, split the migration into a pre-deploy step.

Capacity & Resource Planning

Confirm release captain and on-call coverage
- Check PagerDuty for the deploy window and the four hours after. Release captain plus primary on-call must both be online; a vacation collision is a common reason hotfixes get botched.
Reserve the deploy window on the team calendar
- Block Tuesday 10am–12pm (or your team's standard window). Avoid Fridays and the day before a holiday — if something breaks Saturday morning, the on-call carries it alone.
Verify staging matches production configuration
- Diff the Terraform state for environment-specific variables: instance sizes, secrets, third-party API endpoints, feature-flag defaults. "It worked in staging" failures usually trace back to an undocumented config drift.

QA & Test Strategy

Run the full e2e suite against the release candidate
- Cut the release branch, tag -rc.1, and run Playwright/Cypress in the staging pipeline. Investigate every red — "just rerun, it's flaky" is how real regressions slip through. Open tickets for any flakes you defer.
Identify regression-risk areas for manual smoke testing
- Map the merged PRs to user-facing surfaces. Auth, billing, and primary checkout paths get manual smoke regardless of automated coverage; secondary surfaces only if a PR touched them.
Capture QA sign-off on the release candidate
- QA lead records the verdict, attaches the test report, and notes any deferred bugs. "Pass with notes" is allowed for cosmetic issues with a tracked follow-up; functional regressions are Fail.
Collects list Collects paragraph
Halt release and schedule a fix cycle
- QA failed the candidate. Notify stakeholders the release is slipping, file fix tickets at SEV2 or above, and schedule a follow-up build for the next release window. Do not let pressure push a known-failing build to production.

Communication & Release Notes

Draft customer-facing release notes
- Generate the draft from changelog entries, then strip internal jargon. Group by feature/fix/breaking change. Anything marked breaking needs a migration note for API consumers and a 90-day deprecation window if it changes existing endpoints.
Brief support on customer-visible changes
- 15-minute walkthrough with the support lead. Cover what changed, expected ticket categories, and the engineer to escalate to. Without this brief, support backlog spikes the day after release.
Post the release window in #engineering
- Announcement covers: deploy date and time, scope summary, release captain, rollback contact, link to the runbook. Pin it. Lock main to release-blocking PRs only during the window.

Deployment & Rollback Planning

Verify the previous container image is in the registry
- Pull-test the prior version's image tag from ECR/GCR. Image lifecycle policies sometimes prune the very tag you need to roll back to — discovering this during an incident is too late.
Walk through the rollback runbook end-to-end
- Read the runbook out loud with the on-call engineer. Confirm every command works against the current cluster — kubectl contexts, deploy script flags, feature-flag kill switches. Update anything stale before deploy day.
Collects file
Tag the release candidate sha
- Apply the semver tag (e.g., v2024.45.0) to the candidate sha and push. CI builds the final artifact from the tag — never deploy from a moving branch ref.

Monitoring & Support Readiness

Confirm dashboards cover the new services
- Open the Datadog/Grafana service dashboards. New endpoints need RED-method panels (rate, errors, duration). A new service shipping without a dashboard is a guaranteed blind spot during the post-deploy window.
Set SLO alert thresholds for new endpoints
- Establish baseline p99 latency and error-rate burn-rate alerts. Route to the on-call rotation in PagerDuty, not a dead Slack channel — the cert-renewal failure pattern lives here too.
Brief on-call on the hotfix process
- Walk the on-call through: who decides hotfix vs. rollback, the cherry-pick branch convention, and the abbreviated review path for SEV1 fixes. New on-call members get this brief every release; veterans skim.