Monthly Server Maintenance Checklist

Server Room Environment

    Pull the last 30 days of readings from the environmental sensor (APC NetBotz, Sensaphone, or whatever the room is using). Target band is 64-80°F and 40-60% RH per ASHRAE TC 9.9. Spikes outside the band correlate with disk failure rates — note any sustained excursions in the run notes, not just the current reading.

    Run the self-test from the UPS management interface (PowerChute, Eaton IPM). Record the projected runtime at current load. Batteries degrade silently — a UPS that showed 22 minutes last quarter and shows 8 minutes today needs replacement before next cycle, not after the next outage.

    Walk the row. Check for blocked perforated tiles, missing blanking panels, and fans with audible bearing noise. Hot-aisle / cold-aisle separation matters more than raw CRAC capacity — a single missing blanking panel can recirculate hot exhaust into intake.

Operating System and Patching

    Roll patches through the patch tool (Action1, Automox, WSUS, or NinjaOne) in waves: dev → staging → prod. Prioritize CVEs on the CISA KEV list and high-EPSS items, not just CVSS — a CVSS 9.8 with no known exploit waits behind a CVSS 7.4 that's actively being weaponized.

    Pull the last 30 days of Windows Event Log Critical/Error entries and Linux journalctl -p err for each host (or query the SIEM directly if logs are forwarded to Splunk, Datadog, or Elastic). Repeat hardware errors — disk SMART warnings, ECC memory corrections — predict failure within weeks.

    Run a plan / dry-run against the IaC repo. Investigate any drift before applying — click-ops changes from prior incidents are common and shouldn't be silently reverted. Document the resolution: either codify the change in the repo or revert the host.

    Reboot in dependency order — databases last, web tier first. Confirm each service comes back via the monitoring system (Datadog, New Relic, Grafana) before moving to the next host. A clean reboot also surfaces services that were running on a kept-alive process but missing from systemd / init.

Database Health

    SQL Server: DBCC CHECKDB WITH PHYSICAL_ONLY on production, full CHECKDB on the most recent restore copy to avoid the I/O hit. Postgres: pg_amcheck or amcheck against B-tree indexes. Corruption is rare but silent — catching it monthly beats discovering it during a restore.

    Cross-check the backup tool's job history (Veeam, native SQL Agent, pgBackRest, Barman) against the expected schedule. Look for jobs that ran but produced suspiciously small files — that's the classic signature of a backup against an empty mount point.

    File the incident in the PSA (ConnectWise, Autotask, ServiceNow) with severity P2 and assign to the DBA on call. Don't wait for next month's pass — every additional night without a clean backup is RPO debt.

    Pull the fragmentation report (sys.dm_db_index_physical_stats or pg_stat_user_indexes). Reorganize anything between 10-30%, rebuild over 30%. Run during the maintenance window — index rebuilds on busy tables can hold locks long enough to time out application queries.

Network and Firewall

    Pull the audit log from the NGFW (Palo Alto Panorama, Fortinet FortiManager, Cisco FMC). Cross-check every rule change against an approved RFC in the change system. Unattributed changes are the lead indicator of either a compromise or sloppy CAB process — both worth surfacing.

    Force the primary IPsec or WireGuard tunnel down and confirm traffic cuts over to the secondary within the SLA window. Untested failover is just hopeful failover — most failover bugs are config-time, not runtime.

    Auvik (or your RMM's network module) shows uptime, CPU, and link errors per device. Flag anything with input errors or CRC errors trending up — usually a bad cable or a port about to fail. Replace before the support call.

Endpoint and Server Security

    Pull the agent inventory from CrowdStrike, SentinelOne, or Defender for Endpoint and reconcile against the server inventory in your CMDB. Any host that hasn't checked in for 7+ days either died, lost the agent, or was rebuilt without it — chase down each one.

    Authenticated scans from Tenable, Qualys, or Rapid7 catch issues that unauthenticated scans miss — installed package versions, registry settings, local privilege escalation paths. Attach the scan report PDF to this step for the SOC 2 / PCI evidence trail.

    Pull session recordings or audit logs from the PAM tool (CyberArk, Delinea, HashiCorp Vault, BeyondTrust). Spot-check 5-10 sessions for break-glass account use, unexpected hours, or commands outside the operator's normal scope. This is the access-review evidence auditors actually look at.

Backup and Recovery Verification

    Open Veeam (or Datto, Cohesity, Rubrik, Commvault) and confirm the most recent full backup completed without warnings. Warning-state jobs often hide skipped VMs or partial captures — treat warning as failure for verification purposes.

    File the incident in the PSA with severity P2 and link the Veeam job log. Backup failures are the kind of finding that escalates to the CISO during the next SOC 2 audit — fix and document the root cause this cycle, not next.

    Pick a random VM or database from the last 30 days of backups and restore to an isolated test network. Boot it, mount it, query it — confirm the data is actually recoverable, not just that the backup file exists. The 3-2-1 rule is meaningless without a tested restore.

    Open a high-priority ticket with Veeam / Datto / Cohesity support, attach the restore log, and notify the IT Lead and CISO. A failed restore is an active RPO/RTO gap — the org is operating without verified recovery until this is closed.

Hardware Health and Sign-Off

    Pull the controller status from the iDRAC, iLO, or IPMI interface. Look for degraded arrays, predictive-failure flags, and reallocated-sector counts trending up. A drive in predictive-failure today is a drive that fails during business hours next week — replace proactively.

    Walk the BMC system event log on each host. ECC memory corrections, PSU events, thermal warnings — these surface on the BMC long before the OS notices. Open RMA tickets with the vendor for anything in warranty before the next maintenance cycle.

    Final review by the IT Lead or vCIO. Confirm all findings have a tracked remediation owner. The signed run becomes the SOC 2 / ISO 27001 evidence artifact for this control period — store the export with the audit binder.