Release Checklist

Pre-Release Preparation

    Branch off the latest green main commit and tag the candidate (e.g., v2024.45.0-rc.1). The release captain owns this step. If main is red, hold the cut until the failing build is investigated — don't branch from a broken tip.

    Diff merged PRs since the last release tag against the changelog. Customer-visible changes need a public note; internal refactors get an internal-only line. Missing changelog entries are the most common reason support gets blindsided post-release.

    Trigger the Playwright/Cypress suite against the release-candidate build deployed to staging. Investigate every failure — do not classify as flaky without a linked ticket. Rerunning until green hides real regressions.

    Confirm the previous container image still exists in ECR/GAR and has not been pruned. Walk the rollback procedure on paper: if the deploy fails at 50% traffic, what command brings us back? If a migration is in this release, the rollback path must reverse cleanly or be forward-only safe.

    Review the migrations folder for new entries since the last release tag. Flag any migration that adds a column with a default, drops a column, or renames a column on a table over 1M rows — these need a multi-step expand/contract pattern, not a single deploy.

Release Day Pre-Deploy

    Check PagerDuty and the #incidents channel. Releasing on top of an active incident makes triage impossible — you can't tell whether new symptoms are from the release or the underlying issue. If a SEV is open, push the deploy to the next window.

    Both the release captain and the primary on-call engineer must be at keyboard for the full deploy window plus 30 minutes of post-deploy monitoring. Split-brain coverage (captain leaves at deploy, on-call takes over for monitoring) is how the first error spike gets missed.

    Drop the release window, scope summary, and rollback contact in #engineering. Cross-post to #customer-support so they can triage tickets that arrive during the window. Pin the message until wrap-up is complete.

    Enable the branch-protection lock or post the freeze in #engineering. Hotfixes for the in-flight release are fine; unrelated merges are not. The freeze ends after the wrap-up section completes.

Deploy

    Run the migration ahead of the application deploy so the new code lands on a schema that already supports it. Watch replication lag during the migration — if lag exceeds 30 seconds on the primary replica, pause and investigate before continuing. Backfills run in batches with sleeps, never as a single transaction.

    Promote the release image to the canary deployment and shift 5% of production traffic via the load balancer or service mesh. Hold here for 10 full minutes — early canary errors usually appear within 2 minutes, but slow leaks (memory, connection pool) need the longer window.

    Compare canary error rate, p99 latency, and saturation against the baseline pods. A canary error rate more than 2x baseline, or p99 latency degraded by more than 20%, fails the canary. New error signatures in Sentry that weren't present pre-deploy also fail it. Mark Fail and the rollback step will be triggered automatically.

    Re-deploy the previous image tag verified during pre-release prep, shift canary traffic back to the stable pool, and confirm error rate returns to baseline. If a migration shipped in this release, run the down-migration only after confirming the new code is fully drained — otherwise the old code hits a schema it doesn't recognize.

    Shift traffic in three steps with at least 5 minutes of dwell at each stage. Watch error rate and latency at each step; a regression at 50% is easier to roll back than at 100%. Don't skip stages even if the canary looked clean — load patterns at higher percentages can surface issues the canary missed.

    Backend goes first because it's forward-compatible with the old frontend; the reverse is not true. Bust the CDN cache for the frontend bundle so users get the new client immediately. Confirm the build hash served from the CDN matches the deployed artifact.

    Execute the synthetic user journey against production: login, core CRUD path, billing read, logout. This catches the misconfigurations that staging doesn't have — production secrets, third-party webhooks, real DNS.

Post-Deploy Monitoring

    Datadog/Sentry dashboard, filtered to the post-deploy window. New error signatures get triaged immediately — even one new exception class warrants a look before declaring the release stable. Cross-reference Sentry release tags against the deployed sha.

    Compare p50/p95/p99 against the trailing 7-day baseline at the same hour-of-day. A 20%+ p99 regression that holds for 15 minutes is a release issue, not a noise spike — file a hotfix ticket and consider rolling back if it persists.

    Check Zendesk/Intercom inbound rate against the same-hour baseline. Customer-reported regressions often arrive 10–20 minutes after deploy, after the first wave of users hits the new code. A 2x ticket spike with overlapping symptoms is a release signal.

    Flags that gate new functionality flip after the deploy is confirmed stable, not at deploy time — keeps the variables independent. Flip via LaunchDarkly/Statsig with a gradual rollout cohort if the feature is risky. Note any flag flipped here in the release summary.

    Post a Statuspage maintenance-completed note if you opened a maintenance window. Skip if the release was transparent to users. Don't post anything that contradicts the changelog — they get cross-referenced.

Wrap-Up

    Promote the rc tag to the final release tag (e.g., v2024.45.0-rc.1 → v2024.45.0). The deployed sha is what you'll need for the post-incident review if anything regresses next week — capture it now while it's fresh.

    Customer-visible notes go to the public changelog (docs site or in-app). Internal-only changes stay in the engineering changelog. Attach the published notes here for the audit trail.

    Post in #engineering: tag, sha, deploy duration, any rollback or partial-rollback events, error-rate and p99 deltas vs. baseline, flags flipped. Unfreeze main in the same message.

    Anything weird seen during deploy or monitoring becomes a Jira/Linear ticket — even minor anomalies, even if they self-resolved. Patterns across releases only emerge if you write them down.

    Rollback fired, p99 spiked, customer tickets surged, or the deploy ran more than 2x the median window — schedule a blameless retro within 5 business days. If the deploy was uneventful, skip this step.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack