Server Maintenance Checklist

Monthly maintenance window workflow for sysadmins to verify hardware health, apply OS and firmware patches, validate security posture, and confirm backups are restorable. Run during the scheduled maintenance window with CAB-approved chan...

Use this workflow

Pre-Maintenance Preparation

Confirm CAB-approved change record
- Pull the RFC from ServiceNow / ConnectWise / Jira Service Management. Verify CAB approval, scheduled window, rollback plan, and downstream service owners notified. Emergency changes still require post-hoc CAB review.
Collects text
Notify stakeholders of the maintenance window
- Send the standard maintenance announcement 48 hours ahead — include start/end time in local TZ, expected impact, and rollback contact. For MSP clients, send via the PSA so it lands in the ticket history.
Verify last successful backup before window
- Confirm Veeam / Datto / Rubrik shows a green job within the last 24 hours and the immutable copy is current. Do not start patching if the most recent backup failed — without a rollback point, an OS update gone wrong becomes a restore-from-tape exercise.
Collects list
Remediate the failed backup before proceeding
- Re-run the failed job, escalate to the backup engineer if it fails again, and document root cause in the ticket. Do not enter the maintenance window until a green backup exists or the change has been rescheduled with CAB.

Hardware Inspection

Inspect chassis and clear dust from intakes
- Walk the rack with compressed air for intake filters and a flashlight for blocked vents. Note any front-bezel LED faults — amber LEDs on Dell PowerEdge / HPE ProLiant front panels usually indicate a hardware event logged in iDRAC / iLO.
Verify rack temperature and airflow
- Check intake (cold-aisle) temperature against ASHRAE recommended range (18-27°C / 64-81°F). Confirm hot/cold aisle containment is intact and no blanking panels are missing — gaps cause recirculation that masks real cooling problems.
Inspect cabling and PDU connections
- Verify dual power feeds land on separate PDUs (A-side / B-side). Tug-test SFP+ / fiber connections — a half-seated transceiver is a common cause of intermittent flapping that monitoring catches but humans miss on a visual sweep.
Pull diagnostic logs from iLO/iDRAC/IPMI
- Export the SEL (system event log) and lifecycle log. Look for predictive failure alerts on disks, ECC memory errors, and fan/PSU faults. Predictive disk failures get a replacement scheduled before the next window — don't let a yellow indicator ride.
Collects list
Test UPS battery health and runtime
- Run the self-test from PowerChute / APC NMC or the equivalent vendor console. Sealed lead-acid batteries typically need replacement every 3-5 years; runtime drift below the documented load is the early signal.

Patching and Firmware

Review the patch ring assignment
- Confirm whether this server is in the test, pilot, or production ring for the current Patch Tuesday cycle. Production servers should only receive patches that have soaked 7-14 days in pilot without incident.
Apply OS patches via WSUS or Intune
- Push approved KBs from WSUS / SCCM / Intune / Automox per the change record. Cross-reference CVEs against the vendor advisory and CISA KEV catalog — known-exploited vulnerabilities take precedence over routine cumulative updates.
Apply firmware updates from vendor catalog
- Use Dell Repository Manager / HPE SUM / Lenovo XClarity to stage BIOS, RAID controller, NIC, and BMC firmware. Firmware mismatches between cluster nodes are a leading cause of vMotion / live-migration failures.
Reboot and confirm services return healthy
- After reboot, confirm services start in the expected order, monitoring agents reconnect (PRTG / Datadog / LogicMonitor), and dependent applications respond. Flag any service stuck in a starting state for follow-up before closing the window.
Collects list
Execute the documented rollback plan
- Restore from the pre-window snapshot or uninstall the offending KB per the rollback plan in the change record. Open a P2 ticket capturing the symptom, KB number, and host so the patch can be held in pilot for the next cycle.

Security Hardening

Review firewall rules and ACL drift
- Diff the current FortiGate / Palo Alto / Meraki / pfSense ruleset against the approved baseline. Any rule without a ticket reference or an expiry date is a candidate for removal — temporary rules that became permanent are the most common path to over-permissive access.
Audit privileged accounts and stale logins
- Pull local admins, Domain Admins, and service accounts. Flag any account with no logon in 60+ days or a password older than the rotation policy. Service accounts running as Domain Admin go on the remediation list — none should remain after this audit cycle.
Run authenticated vulnerability scan
- Trigger a Tenable / Qualys / Rapid7 InsightVM authenticated scan against the host. Track Critical and High findings against SLA — typically 15 days for Critical, 30 days for High under most internal policies and SOC 2 controls.
Collects number
Verify EDR agent health and definitions
- Check CrowdStrike / SentinelOne / Defender for Endpoint console: agent online, sensor version current, last check-in within tolerance, no quarantined items pending review. An EDR agent silently uninstalled by a patch is a known failure mode.

Performance and Capacity

Review CPU, memory, and IOPS trends
- Pull 30-day trends from PRTG / SolarWinds / Datadog. Flag any resource sustained above 80% — that's the inflection point where contention starts hitting end-user latency. Document capacity asks for the next budget cycle.
Check disk free space and growth rate
- Volumes under 15% free or growing more than 5% per week get a capacity ticket. Watch for runaway log directories — IIS, SQL transaction logs, and Windows Update cache are the usual culprits when a volume fills overnight.
Validate monitoring and alert routing
- Trigger a synthetic alert and confirm it pages the current on-call via PagerDuty / Opsgenie. Schedules drift after every offboarding — a silent monitoring tool is worse than no monitoring tool.

Backup and DR Validation

Verify nightly job success in backup console
- Confirm last 7 nights of jobs in Veeam / Datto / Rubrik / Cohesity. Investigate any retry-then-succeed pattern — it's usually the canary for a storage or network problem brewing on the backup target.
Restore a sample file to isolated environment
- Pick a random file from a random job and restore it to a sandbox VM. The point is to prove the restore path works end-to-end, including the credentials and decryption keys — the documentation that the backup is actually usable, not just that the job ran green.
Collects list
Confirm immutable offsite copy is current
- Verify the 3-2-1 third copy in S3 Object Lock / Azure immutable blob / air-gapped tape is within RPO. Ransomware-resilient backup means the production account cannot delete or encrypt this copy — confirm IAM separation, not just job success.
Sign off on the maintenance run
- Close the change record with outcome notes and any deviations from the approved plan. Attach screenshots of patch results, scan summary, and restore confirmation. Anything that needs follow-up gets a child ticket before this step is complete.
Collects list Collects paragraph Collects signature