Monthly Server Maintenance Checklist

Recurring monthly maintenance pass run by a sysadmin or MSP NOC tech across server-room environment, OS patching, database health, network, security, and backup verification. The workflow is paced across a maintenance week with conditional remediation steps for failed backups ...

7 sections 24 steps Collects data

Server Room Environment

Verify temperature and humidity in the server closet
- Pull the last 30 days of readings from the environmental sensor (APC NetBotz, Sensaphone, or whatever the room is using). Target band is 64-80°F and 40-60% RH per ASHRAE TC 9.9. Spikes outside the band correlate with disk failure rates — note any sustained excursions in the run notes, not just the current reading.
Test the UPS battery load and runtime
- Run the self-test from the UPS management interface (PowerChute, Eaton IPM). Record the projected runtime at current load. Batteries degrade silently — a UPS that showed 22 minutes last quarter and shows 8 minutes today needs replacement before next cycle, not after the next outage.
Collects number
Inspect rack airflow and cooling fans
- Walk the row. Check for blocked perforated tiles, missing blanking panels, and fans with audible bearing noise. Hot-aisle / cold-aisle separation matters more than raw CRAC capacity — a single missing blanking panel can recirculate hot exhaust into intake.

Operating System and Patching

Apply OS security patches in the maintenance window
- Roll patches through the patch tool (Action1, Automox, WSUS, or NinjaOne) in waves: dev → staging → prod. Prioritize CVEs on the CISA KEV list and high-EPSS items, not just CVSS — a CVSS 9.8 with no known exploit waits behind a CVSS 7.4 that's actively being weaponized.
Collects list
Review system event logs for critical errors
- Pull the last 30 days of Windows Event Log Critical/Error entries and Linux journalctl -p err for each host (or query the SIEM directly if logs are forwarded to Splunk, Datadog, or Elastic). Repeat hardware errors — disk SMART warnings, ECC memory corrections — predict failure within weeks.
Reconcile configuration drift via Ansible or Terraform
- Run a plan / dry-run against the IaC repo. Investigate any drift before applying — click-ops changes from prior incidents are common and shouldn't be silently reverted. Document the resolution: either codify the change in the repo or revert the host.
Reboot hosts in the maintenance window
- Reboot in dependency order — databases last, web tier first. Confirm each service comes back via the monitoring system (Datadog, New Relic, Grafana) before moving to the next host. A clean reboot also surfaces services that were running on a kept-alive process but missing from systemd / init.

Database Health

Run DBCC CHECKDB or pg_amcheck integrity scan
- SQL Server: DBCC CHECKDB WITH PHYSICAL_ONLY on production, full CHECKDB on the most recent restore copy to avoid the I/O hit. Postgres: pg_amcheck or amcheck against B-tree indexes. Corruption is rare but silent — catching it monthly beats discovering it during a restore.
Verify nightly database backup jobs completed
- Cross-check the backup tool's job history (Veeam, native SQL Agent, pgBackRest, Barman) against the expected schedule. Look for jobs that ran but produced suspiciously small files — that's the classic signature of a backup against an empty mount point.
Collects list
Open a P2 incident for the failed database backup
- File the incident in the PSA (ConnectWise, Autotask, ServiceNow) with severity P2 and assign to the DBA on call. Don't wait for next month's pass — every additional night without a clean backup is RPO debt.
Rebuild fragmented database indexes
- Pull the fragmentation report (sys.dm_db_index_physical_stats or pg_stat_user_indexes). Reorganize anything between 10-30%, rebuild over 30%. Run during the maintenance window — index rebuilds on busy tables can hold locks long enough to time out application queries.

Network and Firewall

Review firewall rule changes since last cycle
- Pull the audit log from the NGFW (Palo Alto Panorama, Fortinet FortiManager, Cisco FMC). Cross-check every rule change against an approved RFC in the change system. Unattributed changes are the lead indicator of either a compromise or sloppy CAB process — both worth surfacing.
Test the site-to-site VPN failover path
- Force the primary IPsec or WireGuard tunnel down and confirm traffic cuts over to the secondary within the SLA window. Untested failover is just hopeful failover — most failover bugs are config-time, not runtime.
Verify switch and AP uptime in Auvik
- Auvik (or your RMM's network module) shows uptime, CPU, and link errors per device. Flag anything with input errors or CRC errors trending up — usually a bad cable or a port about to fail. Replace before the support call.

Endpoint and Server Security

Confirm EDR agents reporting in the CrowdStrike console
- Pull the agent inventory from CrowdStrike, SentinelOne, or Defender for Endpoint and reconcile against the server inventory in your CMDB. Any host that hasn't checked in for 7+ days either died, lost the agent, or was rebuilt without it — chase down each one.
Run an authenticated vulnerability scan against servers
- Authenticated scans from Tenable, Qualys, or Rapid7 catch issues that unauthenticated scans miss — installed package versions, registry settings, local privilege escalation paths. Attach the scan report PDF to this step for the SOC 2 / PCI evidence trail.
Collects file
Review privileged account session logs in the PAM vault
- Pull session recordings or audit logs from the PAM tool (CyberArk, Delinea, HashiCorp Vault, BeyondTrust). Spot-check 5-10 sessions for break-glass account use, unexpected hours, or commands outside the operator's normal scope. This is the access-review evidence auditors actually look at.

Backup and Recovery Verification

Verify the last full backup job in Veeam
- Open Veeam (or Datto, Cohesity, Rubrik, Commvault) and confirm the most recent full backup completed without warnings. Warning-state jobs often hide skipped VMs or partial captures — treat warning as failure for verification purposes.
Collects list
Open a P2 incident for the failed backup job
- File the incident in the PSA with severity P2 and link the Veeam job log. Backup failures are the kind of finding that escalates to the CISO during the next SOC 2 audit — fix and document the root cause this cycle, not next.
Perform a spot-restore from a random backup set
- Pick a random VM or database from the last 30 days of backups and restore to an isolated test network. Boot it, mount it, query it — confirm the data is actually recoverable, not just that the backup file exists. The 3-2-1 rule is meaningless without a tested restore.
Collects list
Escalate the failed restore to vendor support
- Open a high-priority ticket with Veeam / Datto / Cohesity support, attach the restore log, and notify the IT Lead and CISO. A failed restore is an active RPO/RTO gap — the org is operating without verified recovery until this is closed.

Hardware Health and Sign-Off

Check RAID array status and SMART disk health
- Pull the controller status from the iDRAC, iLO, or IPMI interface. Look for degraded arrays, predictive-failure flags, and reallocated-sector counts trending up. A drive in predictive-failure today is a drive that fails during business hours next week — replace proactively.
Review IPMI logs for hardware faults
- Walk the BMC system event log on each host. ECC memory corrections, PSU events, thermal warnings — these surface on the BMC long before the OS notices. Open RMA tickets with the vendor for anything in warranty before the next maintenance cycle.
Sign off on the monthly maintenance window
- Final review by the IT Lead or vCIO. Confirm all findings have a tracked remediation owner. The signed run becomes the SOC 2 / ISO 27001 evidence artifact for this control period — store the export with the audit binder.
Collects list Collects paragraph Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 7

Steps 24

Category Information Technology

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Information Technology

Run Monthly Server Maintenance Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.

Use this workflow Start free trial

Monthly Server Maintenance Checklist

Server Room Environment

Operating System and Patching

Database Health

Network and Firewall

Endpoint and Server Security

Backup and Recovery Verification

Hardware Health and Sign-Off

Use this template

Related templates

Desktop Configuration Checklist

User Access Control Checklist

Monthly Server Maintenance Checklist

Performance Monitoring Checklist

Run Monthly Server Maintenance Checklist with your team