Monthly Server Maintenance Checklist
Recurring monthly maintenance pass run by a sysadmin or MSP NOC tech across server-room environment, OS patching, database health, network, security, and backup verification. The workflow is paced across a maintenance week with conditional remediation steps for failed backups ...
Server Room Environment
-
Verify temperature and humidity in the server closet
Pull the last 30 days of readings from the environmental sensor (APC NetBotz, Sensaphone, or whatever the room is using). Target band is 64-80°F and 40-60% RH per ASHRAE TC 9.9. Spikes outside the band correlate with disk failure rates — note any sustained excursions in the run notes, not just the current reading.
-
Test the UPS battery load and runtime
Run the self-test from the UPS management interface (PowerChute, Eaton IPM). Record the projected runtime at current load. Batteries degrade silently — a UPS that showed 22 minutes last quarter and shows 8 minutes today needs replacement before next cycle, not after the next outage.
Collects number -
Inspect rack airflow and cooling fans
Walk the row. Check for blocked perforated tiles, missing blanking panels, and fans with audible bearing noise. Hot-aisle / cold-aisle separation matters more than raw CRAC capacity — a single missing blanking panel can recirculate hot exhaust into intake.
Operating System and Patching
-
Apply OS security patches in the maintenance window
Roll patches through the patch tool (Action1, Automox, WSUS, or NinjaOne) in waves: dev → staging → prod. Prioritize CVEs on the CISA KEV list and high-EPSS items, not just CVSS — a CVSS 9.8 with no known exploit waits behind a CVSS 7.4 that's actively being weaponized.
Collects list -
Review system event logs for critical errors
Pull the last 30 days of Windows Event Log Critical/Error entries and Linux journalctl -p err for each host (or query the SIEM directly if logs are forwarded to Splunk, Datadog, or Elastic). Repeat hardware errors — disk SMART warnings, ECC memory corrections — predict failure within weeks.
-
Reconcile configuration drift via Ansible or Terraform
Run a plan / dry-run against the IaC repo. Investigate any drift before applying — click-ops changes from prior incidents are common and shouldn't be silently reverted. Document the resolution: either codify the change in the repo or revert the host.
-
Reboot hosts in the maintenance window
Reboot in dependency order — databases last, web tier first. Confirm each service comes back via the monitoring system (Datadog, New Relic, Grafana) before moving to the next host. A clean reboot also surfaces services that were running on a kept-alive process but missing from systemd / init.
Database Health
-
Run DBCC CHECKDB or pg_amcheck integrity scan
SQL Server: DBCC CHECKDB WITH PHYSICAL_ONLY on production, full CHECKDB on the most recent restore copy to avoid the I/O hit. Postgres: pg_amcheck or amcheck against B-tree indexes. Corruption is rare but silent — catching it monthly beats discovering it during a restore.
-
Verify nightly database backup jobs completed
Cross-check the backup tool's job history (Veeam, native SQL Agent, pgBackRest, Barman) against the expected schedule. Look for jobs that ran but produced suspiciously small files — that's the classic signature of a backup against an empty mount point.
Collects list -
Open a P2 incident for the failed database backup
File the incident in the PSA (ConnectWise, Autotask, ServiceNow) with severity P2 and assign to the DBA on call. Don't wait for next month's pass — every additional night without a clean backup is RPO debt.
-
Rebuild fragmented database indexes
Pull the fragmentation report (sys.dm_db_index_physical_stats or pg_stat_user_indexes). Reorganize anything between 10-30%, rebuild over 30%. Run during the maintenance window — index rebuilds on busy tables can hold locks long enough to time out application queries.
Network and Firewall
-
Review firewall rule changes since last cycle
Pull the audit log from the NGFW (Palo Alto Panorama, Fortinet FortiManager, Cisco FMC). Cross-check every rule change against an approved RFC in the change system. Unattributed changes are the lead indicator of either a compromise or sloppy CAB process — both worth surfacing.
-
Test the site-to-site VPN failover path
Force the primary IPsec or WireGuard tunnel down and confirm traffic cuts over to the secondary within the SLA window. Untested failover is just hopeful failover — most failover bugs are config-time, not runtime.
-
Verify switch and AP uptime in Auvik
Auvik (or your RMM's network module) shows uptime, CPU, and link errors per device. Flag anything with input errors or CRC errors trending up — usually a bad cable or a port about to fail. Replace before the support call.
Endpoint and Server Security
-
Confirm EDR agents reporting in the CrowdStrike console
Pull the agent inventory from CrowdStrike, SentinelOne, or Defender for Endpoint and reconcile against the server inventory in your CMDB. Any host that hasn't checked in for 7+ days either died, lost the agent, or was rebuilt without it — chase down each one.
-
Run an authenticated vulnerability scan against servers
Authenticated scans from Tenable, Qualys, or Rapid7 catch issues that unauthenticated scans miss — installed package versions, registry settings, local privilege escalation paths. Attach the scan report PDF to this step for the SOC 2 / PCI evidence trail.
Collects file -
Review privileged account session logs in the PAM vault
Pull session recordings or audit logs from the PAM tool (CyberArk, Delinea, HashiCorp Vault, BeyondTrust). Spot-check 5-10 sessions for break-glass account use, unexpected hours, or commands outside the operator's normal scope. This is the access-review evidence auditors actually look at.
Backup and Recovery Verification
-
Verify the last full backup job in Veeam
Open Veeam (or Datto, Cohesity, Rubrik, Commvault) and confirm the most recent full backup completed without warnings. Warning-state jobs often hide skipped VMs or partial captures — treat warning as failure for verification purposes.
Collects list -
Open a P2 incident for the failed backup job
File the incident in the PSA with severity P2 and link the Veeam job log. Backup failures are the kind of finding that escalates to the CISO during the next SOC 2 audit — fix and document the root cause this cycle, not next.
-
Perform a spot-restore from a random backup set
Pick a random VM or database from the last 30 days of backups and restore to an isolated test network. Boot it, mount it, query it — confirm the data is actually recoverable, not just that the backup file exists. The 3-2-1 rule is meaningless without a tested restore.
Collects list -
Escalate the failed restore to vendor support
Open a high-priority ticket with Veeam / Datto / Cohesity support, attach the restore log, and notify the IT Lead and CISO. A failed restore is an active RPO/RTO gap — the org is operating without verified recovery until this is closed.
Hardware Health and Sign-Off
-
Check RAID array status and SMART disk health
Pull the controller status from the iDRAC, iLO, or IPMI interface. Look for degraded arrays, predictive-failure flags, and reallocated-sector counts trending up. A drive in predictive-failure today is a drive that fails during business hours next week — replace proactively.
-
Review IPMI logs for hardware faults
Walk the BMC system event log on each host. ECC memory corrections, PSU events, thermal warnings — these surface on the BMC long before the OS notices. Open RMA tickets with the vendor for anything in warranty before the next maintenance cycle.
-
Sign off on the monthly maintenance window
Final review by the IT Lead or vCIO. Confirm all findings have a tracked remediation owner. The signed run becomes the SOC 2 / ISO 27001 evidence artifact for this control period — store the export with the audit binder.
Collects list Collects paragraph Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Monthly Server Maintenance Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.