Server Maintenance Checklist
Monthly server maintenance routine for a sysadmin or MSP tech to verify hardware health, OS and application patching, security posture, and backup recoverability across the server fleet.
Hardware Health
-
Inspect chassis fans, PSU, and cabling
For physical servers, walk the rack: check fan RPMs in iDRAC / iLO / IPMI, confirm both PSUs are drawing on dual feeds (no failover-to-single condition), and listen for bearing noise. Document any amber LEDs. For colocated hardware, pull the latest remote-hands report.
-
Review disk space and growth trends
Check filesystem utilization on every mount; flag anything over 80%. Compare against last month's snapshot to catch runaway log directories or unrotated journal files. Common gotcha: /var/log on a small root partition fills silently between cycles.
Collects list -
Expand or remediate the full volume
Open a change ticket for the volume expansion or cleanup. For LVM, extend the PV/LV and grow the filesystem online; for cloud disks (EBS, Azure managed disk), resize then run the in-guest grow. Don't simply delete logs without confirming retention requirements for SOC 2 / PCI evidence.
-
Verify RAID and SMART status
Run the controller utility (megacli, storcli, perccli) to confirm all virtual disks are Optimal and no physical disk is in Predictive Failure. Pull smartctl -a output for direct-attach disks and flag any reallocated-sector or pending-sector counts trending up.
-
Check CPU, memory, and thermal metrics
Pull 30-day CPU, memory, and inlet-temperature graphs from Datadog / Prometheus / the RMM. Note any host averaging over 80% CPU or memory pressure — that's a capacity-planning signal worth raising at the next ops review, not a same-day fire.
OS and Application Patching
-
Review system and application logs
Scan the SIEM (Splunk, Sumo, Elastic, Wazuh) for the past 30 days of error and warning events on each server. Look for repeating kernel oops, OOM kills, service crashes, or auth failures clustering around a single source IP.
-
Reconcile pending OS updates against Patch Tuesday
Pull the pending-update report from Action1 / NinjaOne / WSUS / dnf updateinfo. Cross-check missing KBs against the current Patch Tuesday bulletin and the CISA KEV list — KEV-listed CVEs jump the queue regardless of CVSS score.
Collects paragraph -
Audit installed packages against the baseline
Compare rpm -qa / dpkg -l / Get-Package output against the Ansible / Puppet / Chef baseline. Unexpected packages are a drift signal — someone clicked-ops an install. Document and either codify or remove.
-
Verify application and runtime versions
Confirm Java, .NET, Node, Python, nginx, Apache, PostgreSQL, MSSQL, and any line-of-business app are within their supported release window. End-of-life runtimes are a vulnerability-management finding even when no CVE is currently flagged.
Security Posture
-
Verify EDR agent health and definitions
In CrowdStrike / SentinelOne / Defender for Endpoint, confirm the agent reported within the last 24 hours, definitions are current, and tamper-protection is enabled. Hosts that haven't checked in for 7+ days get a ticket — a silent EDR is worse than no EDR because it implies coverage that isn't there.
-
Review firewall and security-group rules
Diff the current firewall config (NGFW rule export, AWS security group, Azure NSG) against last month. Flag any 0.0.0.0/0 rules on management ports (22, 3389, 5985, 1433). Document the business justification for any exception or schedule remediation.
-
Run the vulnerability scan and triage by KEV/EPSS
Trigger an authenticated scan in Tenable / Qualys / Rapid7. Triage by combining CVSS, CISA KEV listing, and EPSS score — a CVSS 7.5 on the KEV list with EPSS over 0.5 outranks a CVSS 9.8 with no exploit activity. Attach the scan PDF to the run for SOC 2 / ISO 27001 evidence.
Collects file -
Audit local accounts and privileged access
Enumerate local admins, sudoers, and service accounts on the host. Cross-check against the IdP active-employee list — orphan accounts from prior offboardings are the most common access-review finding. Confirm break-glass accounts are stored in the vault, not on a sticky note.
Backup and Recovery Verification
-
Confirm last successful backup job
Pull the Veeam / Datto / Rubrik / AWS Backup job report for the past 7 days. Every protected workload should have at least one successful job within the RPO window. Investigate any chain of failed jobs — silent backup failures are how organizations discover their 3-2-1 isn't really 3-2-1.
Collects list -
Investigate and re-run failed backup jobs
Open a ticket per failed job, identify the root cause (VSS quiesce failure, repository full, credential expired, agent offline), remediate, and trigger a manual job. Do not close this step until the workload has a clean job in the chain.
-
Verify offsite and immutable copy
3-2-1 means three copies, two media, one offsite. Confirm the offsite tier (S3 with Object Lock, Azure immutable blob, tape rotation, secondary repository) received the most recent restore point and that immutability is still in effect — ransomware playbooks target the backup repo first.
-
Perform a spot-check restore
Restore one randomly-selected file or VM to an isolated sandbox and confirm it boots / opens cleanly. A backup that has never been restored is a hope, not a recovery. Capture the restore time and compare against the documented RTO; deviations are worth raising at the next DR review.
Collects list -
Escalate the failed restore to the IT lead
A failed spot-check is a P2 incident — open a ticket, page the IT lead, and pause the backup-policy change freeze until root cause is identified. Document findings in the postmortem template; this becomes audit evidence at the next SOC 2 / ISO review.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Server Maintenance Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.