Software Upgrade Checklist

Steps a sysadmin or MSP engineer follows to plan, execute, and validate a production software upgrade — from change-advisory approval through post-upgrade monitoring and rollback if needed.

1

Pre-Upgrade Preparation

  1. Review the vendor release notes
    • Read the vendor's release notes, KB articles, and known-issues page for the target version. Flag breaking changes, deprecated features, and any prerequisite intermediate versions. Vendors frequently bury config-file format changes and minimum OS / .NET / JDK requirements in the fine print.

  2. Verify compatibility with dependent systems
    • Check OS, database, .NET / JDK / Python runtime, and integration partner versions against the new release's compatibility matrix. Confirm any agents (EDR, RMM, backup) are supported on the new version — CrowdStrike and Veeam compatibility lags major OS releases by weeks.

  3. File the change request with the CAB
    • Submit the RFC in ServiceNow / ConnectWise / Jira Service Management with scope, risk classification, maintenance window, rollback plan, and named approvers. Standard changes can use a pre-approved template; normal changes need CAB review before scheduling.

    Collects list Collects text
  4. Schedule the maintenance window
    • Confirm the window falls outside any change-freeze (quarter-close, retail peak, payroll runs) and aligns with the application owner's blackout calendar. Block calendars for on-call, DBA, and network engineer if their involvement is required.

  5. Notify stakeholders and end users
    • Send notice to the affected distribution list with start time, expected duration, services impacted, and the status-page URL. For MSP clients, include the vCIO or account manager. Send a 24-hour reminder the day before.

  6. Run a verified backup of affected systems
    • Trigger a Veeam / Datto / native snapshot of the application server, database, and config volumes. Confirm the backup completes successfully and capture the restore point ID — do not rely on last night's scheduled job. For VMs, also take a hypervisor-level snapshot at the start of the window for fast rollback.

    Collects text
  7. Document the rollback plan
    • Write the named rollback steps: restore from snapshot, revert database schema, redirect DNS / load balancer back to prior nodes. Define the go/no-go decision time and the named decision-maker. A rollback plan that says "restore from backup" without specifying which backup, who runs it, and how long it takes is not a plan.

2

Test Ring Validation

  1. Apply the upgrade in the test environment
    • Deploy to the dev or QA ring first using the same installer and runbook the production change will use. Capture installer exit codes and timing — production estimates come from this run, not the vendor's quoted duration.

  2. Run the application owner's smoke tests
    • Have the app owner or QA execute the documented smoke-test script: login, top three workflows, integrations to upstream / downstream systems, scheduled-job execution. Block production deployment if any test fails.

    Collects list
  3. Investigate test failures before proceeding
    • If smoke tests failed, open a vendor support case with installer logs and stack traces before re-scheduling. Do not push to production hoping the issue is environment-specific — that's how Friday afternoon outages happen.

3

Production Upgrade Execution

  1. Suspend monitoring alerts and on-call paging
    • Place the affected hosts in maintenance mode in PRTG / Datadog / LogicMonitor and silence the PagerDuty service so the NOC isn't paged on every health-check failure. Set a timer — re-enable at the end of the window even if you forget.

  2. Disable scheduled tasks and cron jobs
    • Stop Windows Task Scheduler jobs, cron entries, and any RMM-scheduled scripts that touch the application or its database. A backup job firing mid-upgrade locks files and corrupts the install.

  3. Stop the application and dependent services
  4. Take a pre-upgrade VM snapshot
    • Trigger an ESXi / Hyper-V / Proxmox snapshot with services stopped — this is the fastest rollback option. Tag the snapshot with the change ticket number; schedule deletion after 7 days of stable running.

  5. Apply the upgrade package
    • Run the installer with logging enabled (msiexec /l*v, install.log, etc.). Watch for prompts that the runbook didn't anticipate — silently accepting an EULA change or a database schema migration without reading it is how upgrades break.

  6. Review the installer logs for errors
    Collects list Collects file
  7. Restart application services and verify startup
    • Start services in dependency order — database first, then application tier, then web / load-balanced front-end. Tail the application log during startup; a service that reports "running" while throwing connection errors in the log is not actually up.

4

Rollback (If Upgrade Failed)

  1. Execute the documented rollback plan
    • Revert the VM snapshot or restore from the verified backup captured pre-upgrade. Bring services back in dependency order and run smoke tests against the restored version before declaring rollback complete.

  2. Notify stakeholders of the rollback
    • Send the same distribution list a status update: rollback complete, current version, root cause under investigation, next attempt TBD. Update the change ticket with rollback timestamp and reason code.

  3. Open a vendor support case
    • File with the vendor including installer logs, environment details, and exact failure mode. Get a case number before re-scheduling — the next attempt should not repeat the same path without vendor input.

5

Post-Upgrade Validation

  1. Run smoke tests against production
    • Execute the same smoke-test script used in the test ring: login, top workflows, upstream and downstream integrations. Confirm the application reports the new version number in the about screen or /version endpoint.

  2. Re-enable scheduled tasks and monitoring
    • Take hosts out of maintenance mode in the monitoring tool, un-silence PagerDuty, and re-enable cron / Task Scheduler jobs. Confirm the next scheduled job actually fires — disabled flags sometimes don't toggle back cleanly.

  3. Watch dashboards for performance anomalies
    • Stay on the Datadog / PRTG / LogicMonitor dashboard for at least 60 minutes post-cutover. Watch CPU, memory, disk I/O, query latency, and error rates against the prior week's baseline. New versions often regress on a single metric (memory leak, slow query) that only shows up under real load.

  4. Update IT Glue or Hudu documentation
    • Update the configuration record with the new version, upgrade date, change ticket reference, and any config changes (new ports, new service accounts, schema changes). MSPs especially: stale documentation is what makes the next on-call engineer fail.

  5. Close the change ticket and notify stakeholders
    Collects list Collects paragraph
  6. Hold a 7-day post-upgrade health review
    • Review tickets, monitoring trends, and user feedback from the week post-upgrade. Delete the rollback snapshot if the system is stable. Capture lessons learned in the runbook so the next upgrade is shorter.