Software Update Checklist

Steps an engineering team runs to plan, test, deploy, and verify a software update on a production system. Covers pre-flight checks, staging validation, production rollout, and post-deploy monitoring.

6 sections 24 steps Collects data

Planning and Preparation

Review release notes and breaking changes
- Read the upstream changelog and CHANGELOG.md entries for every version between current and target. Flag deprecations, config changes, and required migrations. For semver major bumps, expect breaking changes; for minor/patch, watch for security fixes you need to call out to support.
Confirm rollback plan and prior image retention
- Verify the previous container image is still in ECR/GCR and not subject to lifecycle pruning. If a DB migration is included, confirm it is reversible — or document the forward-fix plan. Untested rollbacks are the most common reason a release night turns into an outage.
Collects list
Verify infrastructure and dependency requirements
- Check that runtime versions (Node, Python, JVM, Postgres), Kubernetes API versions, and Terraform provider constraints meet the new release's minimums. A surprise minimum-Postgres bump in a point release has eaten more than one release window.
Snapshot the production database
- Trigger an out-of-band RDS snapshot (or equivalent) in addition to the nightly backup. Record the snapshot identifier so the on-call engineer can restore from a known point if the migration goes sideways.
Collects text

Testing and Quality Assurance

Run the full unit and integration suite
- Trigger CI on the release-candidate tag (e.g., v2024.45.0-rc.1). Investigate any flaky tests rather than re-running blindly — a habit of "just rerun it" hides real regressions.
Execute end-to-end tests against staging
- Deploy the RC to staging and run Playwright/Cypress suites against it. Cover the critical user paths: signup, auth, checkout, primary CRUD flows. Staging should mirror prod config — environment drift breaks this gate.
QA smoke-test critical user paths
- Manual exploratory pass by QA on the changed surfaces. The release-notes review in the prep phase determines what gets exercised here. Any blocking defects gate the release.
Collects list
Run a security and dependency scan
- Run Snyk / Dependabot / Trivy against the RC build. Triage any new high or critical CVEs introduced by the update. SBOM diff against the prior release goes into the change record for SOC 2 evidence.

Pre-Deploy Coordination

Confirm no active SEV1 or SEV2 incidents
- Check PagerDuty / Incident.io for open incidents. Don't ship into an active production fire — even an unrelated SEV2 will mask new symptoms introduced by the deploy.
Notify stakeholders of the deploy window
- Post in #engineering and #customer-support: deploy window, scope of changes, named release captain, named on-call. Update the status page if the change is user-visible or carries downtime risk.
Confirm release captain and on-call coverage
- Release captain drives the deploy; primary on-call holds the pager. Both must be available for the full deploy plus a 60-minute monitoring window. No solo deploys.

Production Deploy

Apply the database migration
- Run the migration ahead of the application deploy so the new schema is live before code that depends on it. Use CREATE INDEX CONCURRENTLY for index work, batched backfills with sleeps, and watch replication lag throughout. Avoid ADD COLUMN ... DEFAULT on large tables — split into add-column, backfill, set-default.
Deploy backend canary at 5 percent
- Route 5% of traffic to the new version via the canary deployment in ArgoCD / the load balancer. Hold for 10 minutes minimum and watch error rate, p99 latency, and saturation on the canary pods specifically — not just aggregate dashboards.
Collects list
Roll out backend to 100 percent
- Step through 25% → 50% → 100% with a few minutes between increments to let metrics stabilize. Backend goes fully out before frontend so the frontend can rely on new API contracts.
Deploy the frontend bundle
- Push the new frontend artifact to CloudFront / Vercel / Netlify and invalidate the CDN cache. Confirm the new asset hashes are being served before declaring the deploy complete.
Roll back the deploy
- Triggered when the canary is degraded. Re-deploy the previous container image, revert the migration if reversible (or apply the documented forward-fix), and post the abort to #engineering. File a ticket capturing the failure mode for the post-incident review.

Verification and Monitoring

Run the production smoke test
- Execute the synthetic user journey against production: signup, login, primary action, logout. A green canary plus a green smoke test is the gate for declaring the deploy successful.
Watch the golden-signals dashboard for 30 minutes
- Latency (p50, p95, p99), traffic, errors, saturation. Compare to the same time-of-day in the previous week, not just to the prior hour. A 10% error-rate bump can hide in absolute numbers if traffic is also up.
Check Sentry for new error signatures
- Filter Sentry / Bugsnag to the new release tag. New error fingerprints, even at low volume, are the early signal — investigate before they become a spike.
Confirm customer support volume is steady
- Ping support lead. A spike in inbound after a deploy — even without an error-rate change — is a sign something user-visible regressed. Cosmetic and copy bugs rarely page; users report them.

Wrap-Up

Tag the release and publish notes
- Promote the RC tag to the final release tag (e.g., v2024.45.0). Update the public changelog and post the release summary to #engineering with the deployed sha.
Update runbooks and architecture docs
- Reflect any new operational behavior in the service runbook: new env vars, new alerts, new dashboards, deprecated endpoints. Stale runbooks are a SOC 2 finding and an on-call tax.
Capture the change record for audit
- Attach the approved PR list, QA sign-off, deploy log, and rollback evidence (if any) to the change ticket in Jira / Linear. Vanta / Drata pulls from this for SOC 2 change-management evidence.
Collects url
Schedule a post-incident review
- Triggered when the deploy was rolled back or caused a customer-visible incident. Blameless PIR within five business days, action items tracked to closure in Jira. First-shift impressions of contributing factors are usually wrong; the second-order causes only surface in writing.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 6

Steps 24

Category Software Development

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Software Development

Run Software Update Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.