Monthly Server Maintenance Checklist

Run Setup and Server Identification

    Record the FQDN as it appears in DNS plus the role tag from the CMDB (web, app, db, file, hypervisor, AD DC). Mismatches between DNS and the CMDB are a common reason a maintenance window touches the wrong host.

    Capture the primary management IP and the iLO / iDRAC / IPMI address. If the host goes unresponsive after a kernel patch, the OOB address is what gets you back in without a deskside trip.

    Verify the RFC is approved by CAB, the blackout calendar is clear, and stakeholders have been notified. A maintenance window run without an approved change ticket is itself an audit finding under SOC 2 CC8.1 and ISO 27001 A.12.1.2.

Backup and Restore Verification

    Open the backup console (Veeam, Commvault, Datto, Rubrik, or AWS Backup) and confirm the most recent job ran to completion with no warnings. A green job with skipped files is not a successful backup — drill into the per-file log.

    Restore a representative file or VM to an isolated location and validate it opens cleanly. Backups that have never been restored are not backups; this is the step that catches silent corruption before it matters.

    Validate the 3-2-1 chain: at least one immutable / air-gapped copy meets the documented retention. Ransomware playbooks assume an immutable tier exists; confirm it does for this server.

    Do not proceed into the patching window without a working backup. Open a P2 in the PSA, page the on-call backup engineer, and document the rollback plan if patching has to continue without a fresh restore point.

OS, Firmware, and Application Patching

    Pull pending patches from WSUS / SCCM / Action1 / Automox and compare against the CISA Known Exploited Vulnerabilities catalog plus EPSS scores. CVSS alone misranks priorities — a CVSS 7.5 on the KEV list outranks a CVSS 9.8 with no observed exploitation.

    Stage patches via the patch tool, take a pre-patch snapshot if the platform supports it, then apply OS, BIOS / firmware, and agent (EDR, RMM, monitoring) updates in that order. Reboot once at the end rather than after each — fewer reboots, fewer surprises.

    Update cPanel / Plesk / IIS / Apache / nginx and any application runtimes (Java, .NET, Node, Python). Vendor-managed control panels often lag the OS patch cadence and are a common foothold; check the vendor's CVE feed even if no automated update is queued.

    Restore the pre-patch snapshot or uninstall the offending update, capture the failure signature for the postmortem, and post in the change channel. Do not leave the server in a half-patched state — either it's at the new baseline or rolled back to the prior known-good state.

Capacity and Database Maintenance

    Run DBCC CHECKDB on SQL Server, pg_amcheck on Postgres, or the equivalent for MySQL / Oracle. Schedule during low-traffic; corruption found here is the reason the prior section verified backups first.

    Rebuild indexes above the 30% fragmentation threshold, vacuum / shrink as appropriate, and rotate or compress old logs. Watch for runaway temp tables and orphaned WAL / transaction log files that silently fill the volume.

    Pull the 30-day trend from Datadog / Prometheus / SolarWinds. Flag any volume above 80% capacity, sustained CPU above 70%, or memory pressure that triggered swap. The point is to catch the trend before next month's window, not to firefight today.

Security Review

    Rotate via HashiCorp Vault, AWS Secrets Manager, or Azure Key Vault — never edit the password directly on the host. Service accounts last rotated years ago because "changing them breaks things" are the most common credential-theft target on a server.

    Run an authenticated Tenable, Qualys, or Rapid7 scan against the host and compare findings to last month. Unauthenticated scans miss most local privilege issues; the credentialed result is what feeds SOC 2 / PCI evidence.

    Pull the local Administrators / sudoers / wheel group and reconcile against the IdP entitlement. Review the last 30 days of CrowdStrike or Defender for Endpoint detections on the host; quietly-suppressed alerts are a common audit finding.

Hardware and Facility Inspection

    Walk the rack: confirm both power supplies show green, no amber fault LEDs on the chassis or drive carriers, and intake/exhaust airflow is unobstructed. A single-PSU server running on a redundant pair is silently one PSU failure away from outage.

    Pull RAID status from the controller (PERC, SmartArray, MegaRAID) and SMART data from each disk. A degraded array running on parity is fast to fail to a second-disk loss; treat any non-Healthy state as urgent.

    Open a Dell ProSupport / HPE / Lenovo case with the controller log bundle, order the replacement under warranty, and schedule a hot-swap window. Note the failed drive's serial and slot — replacing the wrong slot collapses the array.

    Read the rack-top temperature sensor and compare to the prior month. Cold aisle should be 64–80°F per ASHRAE TC9.9. A creeping baseline is the early signal of a CRAC unit needing service before it fails on a Saturday.

Sign-Off and Reporting

    Close the change ticket with the outcome, attach the patch report and vulnerability scan output, and link any P2 follow-ups opened during the window. This evidence pack is what SOC 2 and ISO 27001 auditors sample at the next review.