Server Maintenance Checklist

Monthly server maintenance routine for a sysadmin or MSP tech to verify hardware health, OS and application patching, security posture, and backup recoverability across the server fleet.

4 sections 18 steps Collects data

Hardware Health

Inspect chassis fans, PSU, and cabling
- For physical servers, walk the rack: check fan RPMs in iDRAC / iLO / IPMI, confirm both PSUs are drawing on dual feeds (no failover-to-single condition), and listen for bearing noise. Document any amber LEDs. For colocated hardware, pull the latest remote-hands report.
Review disk space and growth trends
- Check filesystem utilization on every mount; flag anything over 80%. Compare against last month's snapshot to catch runaway log directories or unrotated journal files. Common gotcha: /var/log on a small root partition fills silently between cycles.
Collects list
Expand or remediate the full volume
- Open a change ticket for the volume expansion or cleanup. For LVM, extend the PV/LV and grow the filesystem online; for cloud disks (EBS, Azure managed disk), resize then run the in-guest grow. Don't simply delete logs without confirming retention requirements for SOC 2 / PCI evidence.
Verify RAID and SMART status
- Run the controller utility (megacli, storcli, perccli) to confirm all virtual disks are Optimal and no physical disk is in Predictive Failure. Pull smartctl -a output for direct-attach disks and flag any reallocated-sector or pending-sector counts trending up.
Check CPU, memory, and thermal metrics
- Pull 30-day CPU, memory, and inlet-temperature graphs from Datadog / Prometheus / the RMM. Note any host averaging over 80% CPU or memory pressure — that's a capacity-planning signal worth raising at the next ops review, not a same-day fire.

OS and Application Patching

Review system and application logs
- Scan the SIEM (Splunk, Sumo, Elastic, Wazuh) for the past 30 days of error and warning events on each server. Look for repeating kernel oops, OOM kills, service crashes, or auth failures clustering around a single source IP.
Reconcile pending OS updates against Patch Tuesday
- Pull the pending-update report from Action1 / NinjaOne / WSUS / dnf updateinfo. Cross-check missing KBs against the current Patch Tuesday bulletin and the CISA KEV list — KEV-listed CVEs jump the queue regardless of CVSS score.
Collects paragraph
Audit installed packages against the baseline
- Compare rpm -qa / dpkg -l / Get-Package output against the Ansible / Puppet / Chef baseline. Unexpected packages are a drift signal — someone clicked-ops an install. Document and either codify or remove.
Verify application and runtime versions
- Confirm Java, .NET, Node, Python, nginx, Apache, PostgreSQL, MSSQL, and any line-of-business app are within their supported release window. End-of-life runtimes are a vulnerability-management finding even when no CVE is currently flagged.

Security Posture

Verify EDR agent health and definitions
- In CrowdStrike / SentinelOne / Defender for Endpoint, confirm the agent reported within the last 24 hours, definitions are current, and tamper-protection is enabled. Hosts that haven't checked in for 7+ days get a ticket — a silent EDR is worse than no EDR because it implies coverage that isn't there.
Review firewall and security-group rules
- Diff the current firewall config (NGFW rule export, AWS security group, Azure NSG) against last month. Flag any 0.0.0.0/0 rules on management ports (22, 3389, 5985, 1433). Document the business justification for any exception or schedule remediation.
Run the vulnerability scan and triage by KEV/EPSS
- Trigger an authenticated scan in Tenable / Qualys / Rapid7. Triage by combining CVSS, CISA KEV listing, and EPSS score — a CVSS 7.5 on the KEV list with EPSS over 0.5 outranks a CVSS 9.8 with no exploit activity. Attach the scan PDF to the run for SOC 2 / ISO 27001 evidence.
Collects file
Audit local accounts and privileged access
- Enumerate local admins, sudoers, and service accounts on the host. Cross-check against the IdP active-employee list — orphan accounts from prior offboardings are the most common access-review finding. Confirm break-glass accounts are stored in the vault, not on a sticky note.

Backup and Recovery Verification

Confirm last successful backup job
- Pull the Veeam / Datto / Rubrik / AWS Backup job report for the past 7 days. Every protected workload should have at least one successful job within the RPO window. Investigate any chain of failed jobs — silent backup failures are how organizations discover their 3-2-1 isn't really 3-2-1.
Collects list
Investigate and re-run failed backup jobs
- Open a ticket per failed job, identify the root cause (VSS quiesce failure, repository full, credential expired, agent offline), remediate, and trigger a manual job. Do not close this step until the workload has a clean job in the chain.
Verify offsite and immutable copy
- 3-2-1 means three copies, two media, one offsite. Confirm the offsite tier (S3 with Object Lock, Azure immutable blob, tape rotation, secondary repository) received the most recent restore point and that immutability is still in effect — ransomware playbooks target the backup repo first.
Perform a spot-check restore
- Restore one randomly-selected file or VM to an isolated sandbox and confirm it boots / opens cleanly. A backup that has never been restored is a hope, not a recovery. Capture the restore time and compare against the documented RTO; deviations are worth raising at the next DR review.
Collects list
Escalate the failed restore to the IT lead
- A failed spot-check is a P2 incident — open a ticket, page the IT lead, and pause the backup-policy change freeze until root cause is identified. Document findings in the postmortem template; this becomes audit evidence at the next SOC 2 / ISO review.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 4

Steps 18

Category Information Technology

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Information Technology

Run Server Maintenance Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.