Server Maintenance Checklist

Hardware Health

    For physical servers, walk the rack: check fan RPMs in iDRAC / iLO / IPMI, confirm both PSUs are drawing on dual feeds (no failover-to-single condition), and listen for bearing noise. Document any amber LEDs. For colocated hardware, pull the latest remote-hands report.

    Check filesystem utilization on every mount; flag anything over 80%. Compare against last month's snapshot to catch runaway log directories or unrotated journal files. Common gotcha: /var/log on a small root partition fills silently between cycles.

    Open a change ticket for the volume expansion or cleanup. For LVM, extend the PV/LV and grow the filesystem online; for cloud disks (EBS, Azure managed disk), resize then run the in-guest grow. Don't simply delete logs without confirming retention requirements for SOC 2 / PCI evidence.

    Run the controller utility (megacli, storcli, perccli) to confirm all virtual disks are Optimal and no physical disk is in Predictive Failure. Pull smartctl -a output for direct-attach disks and flag any reallocated-sector or pending-sector counts trending up.

    Pull 30-day CPU, memory, and inlet-temperature graphs from Datadog / Prometheus / the RMM. Note any host averaging over 80% CPU or memory pressure — that's a capacity-planning signal worth raising at the next ops review, not a same-day fire.

OS and Application Patching

    Scan the SIEM (Splunk, Sumo, Elastic, Wazuh) for the past 30 days of error and warning events on each server. Look for repeating kernel oops, OOM kills, service crashes, or auth failures clustering around a single source IP.

    Pull the pending-update report from Action1 / NinjaOne / WSUS / dnf updateinfo. Cross-check missing KBs against the current Patch Tuesday bulletin and the CISA KEV list — KEV-listed CVEs jump the queue regardless of CVSS score.

    Compare rpm -qa / dpkg -l / Get-Package output against the Ansible / Puppet / Chef baseline. Unexpected packages are a drift signal — someone clicked-ops an install. Document and either codify or remove.

    Confirm Java, .NET, Node, Python, nginx, Apache, PostgreSQL, MSSQL, and any line-of-business app are within their supported release window. End-of-life runtimes are a vulnerability-management finding even when no CVE is currently flagged.

Security Posture

    In CrowdStrike / SentinelOne / Defender for Endpoint, confirm the agent reported within the last 24 hours, definitions are current, and tamper-protection is enabled. Hosts that haven't checked in for 7+ days get a ticket — a silent EDR is worse than no EDR because it implies coverage that isn't there.

    Diff the current firewall config (NGFW rule export, AWS security group, Azure NSG) against last month. Flag any 0.0.0.0/0 rules on management ports (22, 3389, 5985, 1433). Document the business justification for any exception or schedule remediation.

    Trigger an authenticated scan in Tenable / Qualys / Rapid7. Triage by combining CVSS, CISA KEV listing, and EPSS score — a CVSS 7.5 on the KEV list with EPSS over 0.5 outranks a CVSS 9.8 with no exploit activity. Attach the scan PDF to the run for SOC 2 / ISO 27001 evidence.

    Enumerate local admins, sudoers, and service accounts on the host. Cross-check against the IdP active-employee list — orphan accounts from prior offboardings are the most common access-review finding. Confirm break-glass accounts are stored in the vault, not on a sticky note.

Backup and Recovery Verification

    Pull the Veeam / Datto / Rubrik / AWS Backup job report for the past 7 days. Every protected workload should have at least one successful job within the RPO window. Investigate any chain of failed jobs — silent backup failures are how organizations discover their 3-2-1 isn't really 3-2-1.

    Open a ticket per failed job, identify the root cause (VSS quiesce failure, repository full, credential expired, agent offline), remediate, and trigger a manual job. Do not close this step until the workload has a clean job in the chain.

    3-2-1 means three copies, two media, one offsite. Confirm the offsite tier (S3 with Object Lock, Azure immutable blob, tape rotation, secondary repository) received the most recent restore point and that immutability is still in effect — ransomware playbooks target the backup repo first.

    Restore one randomly-selected file or VM to an isolated sandbox and confirm it boots / opens cleanly. A backup that has never been restored is a hope, not a recovery. Capture the restore time and compare against the documented RTO; deviations are worth raising at the next DR review.

    A failed spot-check is a P2 incident — open a ticket, page the IT lead, and pause the backup-policy change freeze until root cause is identified. Document findings in the postmortem template; this becomes audit evidence at the next SOC 2 / ISO review.