Software Upgrade Checklist

Steps a sysadmin or MSP engineer follows to plan, execute, and validate a production software upgrade — from change-advisory approval through post-upgrade monitoring and rollback if needed.

Use this workflow

Pre-Upgrade Preparation

Review the vendor release notes
- Read the vendor's release notes, KB articles, and known-issues page for the target version. Flag breaking changes, deprecated features, and any prerequisite intermediate versions. Vendors frequently bury config-file format changes and minimum OS / .NET / JDK requirements in the fine print.
Verify compatibility with dependent systems
- Check OS, database, .NET / JDK / Python runtime, and integration partner versions against the new release's compatibility matrix. Confirm any agents (EDR, RMM, backup) are supported on the new version — CrowdStrike and Veeam compatibility lags major OS releases by weeks.
File the change request with the CAB
- Submit the RFC in ServiceNow / ConnectWise / Jira Service Management with scope, risk classification, maintenance window, rollback plan, and named approvers. Standard changes can use a pre-approved template; normal changes need CAB review before scheduling.
Collects list Collects text
Schedule the maintenance window
- Confirm the window falls outside any change-freeze (quarter-close, retail peak, payroll runs) and aligns with the application owner's blackout calendar. Block calendars for on-call, DBA, and network engineer if their involvement is required.
Notify stakeholders and end users
- Send notice to the affected distribution list with start time, expected duration, services impacted, and the status-page URL. For MSP clients, include the vCIO or account manager. Send a 24-hour reminder the day before.
Run a verified backup of affected systems
- Trigger a Veeam / Datto / native snapshot of the application server, database, and config volumes. Confirm the backup completes successfully and capture the restore point ID — do not rely on last night's scheduled job. For VMs, also take a hypervisor-level snapshot at the start of the window for fast rollback.
Collects text
Document the rollback plan
- Write the named rollback steps: restore from snapshot, revert database schema, redirect DNS / load balancer back to prior nodes. Define the go/no-go decision time and the named decision-maker. A rollback plan that says "restore from backup" without specifying which backup, who runs it, and how long it takes is not a plan.

Test Ring Validation

Apply the upgrade in the test environment
- Deploy to the dev or QA ring first using the same installer and runbook the production change will use. Capture installer exit codes and timing — production estimates come from this run, not the vendor's quoted duration.
Run the application owner's smoke tests
- Have the app owner or QA execute the documented smoke-test script: login, top three workflows, integrations to upstream / downstream systems, scheduled-job execution. Block production deployment if any test fails.
Collects list
Investigate test failures before proceeding
- If smoke tests failed, open a vendor support case with installer logs and stack traces before re-scheduling. Do not push to production hoping the issue is environment-specific — that's how Friday afternoon outages happen.

Production Upgrade Execution

Suspend monitoring alerts and on-call paging
- Place the affected hosts in maintenance mode in PRTG / Datadog / LogicMonitor and silence the PagerDuty service so the NOC isn't paged on every health-check failure. Set a timer — re-enable at the end of the window even if you forget.
Disable scheduled tasks and cron jobs
- Stop Windows Task Scheduler jobs, cron entries, and any RMM-scheduled scripts that touch the application or its database. A backup job firing mid-upgrade locks files and corrupts the install.
Stop the application and dependent services
Take a pre-upgrade VM snapshot
- Trigger an ESXi / Hyper-V / Proxmox snapshot with services stopped — this is the fastest rollback option. Tag the snapshot with the change ticket number; schedule deletion after 7 days of stable running.
Apply the upgrade package
- Run the installer with logging enabled (msiexec /l*v, install.log, etc.). Watch for prompts that the runbook didn't anticipate — silently accepting an EULA change or a database schema migration without reading it is how upgrades break.
Review the installer logs for errors

Collects list Collects file
Restart application services and verify startup
- Start services in dependency order — database first, then application tier, then web / load-balanced front-end. Tail the application log during startup; a service that reports "running" while throwing connection errors in the log is not actually up.

Rollback (If Upgrade Failed)

Execute the documented rollback plan
- Revert the VM snapshot or restore from the verified backup captured pre-upgrade. Bring services back in dependency order and run smoke tests against the restored version before declaring rollback complete.
Notify stakeholders of the rollback
- Send the same distribution list a status update: rollback complete, current version, root cause under investigation, next attempt TBD. Update the change ticket with rollback timestamp and reason code.
Open a vendor support case
- File with the vendor including installer logs, environment details, and exact failure mode. Get a case number before re-scheduling — the next attempt should not repeat the same path without vendor input.

Post-Upgrade Validation

Run smoke tests against production
- Execute the same smoke-test script used in the test ring: login, top workflows, upstream and downstream integrations. Confirm the application reports the new version number in the about screen or /version endpoint.
Re-enable scheduled tasks and monitoring
- Take hosts out of maintenance mode in the monitoring tool, un-silence PagerDuty, and re-enable cron / Task Scheduler jobs. Confirm the next scheduled job actually fires — disabled flags sometimes don't toggle back cleanly.
Watch dashboards for performance anomalies
- Stay on the Datadog / PRTG / LogicMonitor dashboard for at least 60 minutes post-cutover. Watch CPU, memory, disk I/O, query latency, and error rates against the prior week's baseline. New versions often regress on a single metric (memory leak, slow query) that only shows up under real load.
Update IT Glue or Hudu documentation
- Update the configuration record with the new version, upgrade date, change ticket reference, and any config changes (new ports, new service accounts, schema changes). MSPs especially: stale documentation is what makes the next on-call engineer fail.
Close the change ticket and notify stakeholders

Collects list Collects paragraph
Hold a 7-day post-upgrade health review
- Review tickets, monitoring trends, and user feedback from the week post-upgrade. Delete the rollback snapshot if the system is stable. Capture lessons learned in the runbook so the next upgrade is shorter.