Server Maintenance Checklist

Pre-Maintenance Preparation

    Pull the RFC from ServiceNow / ConnectWise / Jira Service Management. Verify CAB approval, scheduled window, rollback plan, and downstream service owners notified. Emergency changes still require post-hoc CAB review.

    Send the standard maintenance announcement 48 hours ahead — include start/end time in local TZ, expected impact, and rollback contact. For MSP clients, send via the PSA so it lands in the ticket history.

    Confirm Veeam / Datto / Rubrik shows a green job within the last 24 hours and the immutable copy is current. Do not start patching if the most recent backup failed — without a rollback point, an OS update gone wrong becomes a restore-from-tape exercise.

    Re-run the failed job, escalate to the backup engineer if it fails again, and document root cause in the ticket. Do not enter the maintenance window until a green backup exists or the change has been rescheduled with CAB.

Hardware Inspection

    Walk the rack with compressed air for intake filters and a flashlight for blocked vents. Note any front-bezel LED faults — amber LEDs on Dell PowerEdge / HPE ProLiant front panels usually indicate a hardware event logged in iDRAC / iLO.

    Check intake (cold-aisle) temperature against ASHRAE recommended range (18-27°C / 64-81°F). Confirm hot/cold aisle containment is intact and no blanking panels are missing — gaps cause recirculation that masks real cooling problems.

    Verify dual power feeds land on separate PDUs (A-side / B-side). Tug-test SFP+ / fiber connections — a half-seated transceiver is a common cause of intermittent flapping that monitoring catches but humans miss on a visual sweep.

    Export the SEL (system event log) and lifecycle log. Look for predictive failure alerts on disks, ECC memory errors, and fan/PSU faults. Predictive disk failures get a replacement scheduled before the next window — don't let a yellow indicator ride.

    Run the self-test from PowerChute / APC NMC or the equivalent vendor console. Sealed lead-acid batteries typically need replacement every 3-5 years; runtime drift below the documented load is the early signal.

Patching and Firmware

    Confirm whether this server is in the test, pilot, or production ring for the current Patch Tuesday cycle. Production servers should only receive patches that have soaked 7-14 days in pilot without incident.

    Push approved KBs from WSUS / SCCM / Intune / Automox per the change record. Cross-reference CVEs against the vendor advisory and CISA KEV catalog — known-exploited vulnerabilities take precedence over routine cumulative updates.

    Use Dell Repository Manager / HPE SUM / Lenovo XClarity to stage BIOS, RAID controller, NIC, and BMC firmware. Firmware mismatches between cluster nodes are a leading cause of vMotion / live-migration failures.

    After reboot, confirm services start in the expected order, monitoring agents reconnect (PRTG / Datadog / LogicMonitor), and dependent applications respond. Flag any service stuck in a starting state for follow-up before closing the window.

    Restore from the pre-window snapshot or uninstall the offending KB per the rollback plan in the change record. Open a P2 ticket capturing the symptom, KB number, and host so the patch can be held in pilot for the next cycle.

Security Hardening

    Diff the current FortiGate / Palo Alto / Meraki / pfSense ruleset against the approved baseline. Any rule without a ticket reference or an expiry date is a candidate for removal — temporary rules that became permanent are the most common path to over-permissive access.

    Pull local admins, Domain Admins, and service accounts. Flag any account with no logon in 60+ days or a password older than the rotation policy. Service accounts running as Domain Admin go on the remediation list — none should remain after this audit cycle.

    Trigger a Tenable / Qualys / Rapid7 InsightVM authenticated scan against the host. Track Critical and High findings against SLA — typically 15 days for Critical, 30 days for High under most internal policies and SOC 2 controls.

    Check CrowdStrike / SentinelOne / Defender for Endpoint console: agent online, sensor version current, last check-in within tolerance, no quarantined items pending review. An EDR agent silently uninstalled by a patch is a known failure mode.

Performance and Capacity

    Pull 30-day trends from PRTG / SolarWinds / Datadog. Flag any resource sustained above 80% — that's the inflection point where contention starts hitting end-user latency. Document capacity asks for the next budget cycle.

    Volumes under 15% free or growing more than 5% per week get a capacity ticket. Watch for runaway log directories — IIS, SQL transaction logs, and Windows Update cache are the usual culprits when a volume fills overnight.

    Trigger a synthetic alert and confirm it pages the current on-call via PagerDuty / Opsgenie. Schedules drift after every offboarding — a silent monitoring tool is worse than no monitoring tool.

Backup and DR Validation

    Confirm last 7 nights of jobs in Veeam / Datto / Rubrik / Cohesity. Investigate any retry-then-succeed pattern — it's usually the canary for a storage or network problem brewing on the backup target.

    Pick a random file from a random job and restore it to a sandbox VM. The point is to prove the restore path works end-to-end, including the credentials and decryption keys — the documentation that the backup is actually usable, not just that the job ran green.

    Verify the 3-2-1 third copy in S3 Object Lock / Azure immutable blob / air-gapped tape is within RPO. Ransomware-resilient backup means the production account cannot delete or encrypt this copy — confirm IAM separation, not just job success.

    Close the change record with outcome notes and any deviations from the approved plan. Attach screenshots of patch results, scan summary, and restore confirmation. Anything that needs follow-up gets a child ticket before this step is complete.