Hardware Maintenance Checklist
Quarterly hardware maintenance routine for sysadmins covering physical inspection, firmware updates, performance monitoring, and backup/redundancy verification across servers, network gear, and storage.
Routine Physical Inspections
-
Walk the rack and log visible damage
Inspect each U for chassis dents, scorched components, bulging capacitors, and amber LEDs on iLO/iDRAC/IPMI front panels. Note serial and asset tag for any unit showing wear — these become candidates for the next refresh cycle.
Collects list Collects image Collects paragraph -
Reseat and dress loose cabling
Verify power, IPMI/OOB, and data cables are seated and labeled at both ends. Loose SFP+ transceivers and half-clicked C13/C14 connectors are the most common cause of intermittent link flaps that show up later in monitoring.
-
Clean fans, filters, and intake vents
Use compressed air on chassis fans, PSU intakes, and rack door filters. Dust buildup on a 1U server's fan stack pushes inlet temps 5-10°C and accelerates drive failure. Schedule with hot/cold aisle airflow in mind — don't blow dust into adjacent racks.
-
Verify asset tags match the CMDB
Spot-check 10% of inspected hardware against the CMDB or RMM inventory (ServiceNow, Hudu, IT Glue, Auvik). Mismatched serials are how a vendor audit (Microsoft, VMware, Oracle) turns into a six-figure true-up.
-
Read inlet and exhaust temperatures
Pull inlet temps from iDRAC/iLO or rack PDU sensors. ASHRAE A1 recommends 18-27°C inlet; sustained readings above 27°C mean a CRAC issue or blocked airflow. Log readings to PRTG / LogicMonitor / SolarWinds for trend.
Collects number
Firmware and BIOS Updates
-
Inventory current firmware versions
Pull BIOS, BMC, NIC, HBA, and drive firmware via Dell OME, HPE OneView, Lenovo XClarity, or vendor CLI. Mismatched firmware across cluster nodes (vSAN, S2D, Nutanix) is a leading cause of weird performance regressions.
-
Review vendor advisories for CVEs
Check Dell, HPE, Cisco, Fortinet, and Palo Alto security advisories for CVSS 7+ items affecting your installed firmware. Cross-reference against the CVEs your vuln scanner (Tenable, Qualys, Rapid7) flagged this cycle.
Collects list -
File a change request with rollback plan
Submit RFC to CAB with the maintenance window, blast radius, rollback steps, and named change owner. Standard changes (pre-approved firmware on identical hardware) skip CAB; novel firmware to production is a normal change.
-
Export BMC and BIOS configuration
Export current iDRAC/iLO config (XML/JSON) and BIOS settings before flashing. Vendor update bundles occasionally reset BIOS to defaults — losing your boot order, SR-IOV, or virtualization toggles is the classic post-flash gotcha.
Collects file -
Apply updates to the test ring first
Three-ring deployment: test (lab/dev), pilot (one prod node), prod (rest). Wait 7 days between rings to surface latent issues. Updating all production nodes the same evening is how you discover a bad NIC firmware after the fact.
-
Validate post-update health
Re-run vendor diagnostics (Dell SupportAssist, HPE Active Health), check ESXi/Hyper-V cluster status, and confirm no new SEL events. Roll back if any node fails to boot or shows degraded sensors.
Collects list
Performance Monitoring
-
Review SEL and syslog for hardware errors
Pull System Event Logs from each BMC and the last 30 days of syslog from your SIEM (Splunk, Sentinel, Sumo). Look for ECC memory errors, predictive drive failures, and PSU input loss events — these precede outright failure by weeks.
-
Check CPU, memory, and disk baselines
Compare 30-day rolling averages against last quarter in PRTG, LogicMonitor, or Datadog. Hosts running >80% sustained CPU or >85% memory are candidates for workload rebalancing before they page on-call at 3 AM.
-
Run vendor diagnostics on flagged hardware
Use Dell SupportAssist, HPE SmartArray Diagnostics, or smartctl on any node showing anomalies. SMART attributes 5 (reallocated sectors), 187 (uncorrectable errors), and 197 (current pending) are the early-warning trio for SSD/HDD failure.
-
File replacement tickets for failing components
Open warranty cases with the vendor (Dell ProSupport, HPE Foundation Care) for any flagged drive, DIMM, or PSU. Capture the service tag, dispatch ID, and ETA in the PSA ticket so the dispatcher knows when to schedule the on-site swap.
Collects number
Backup and Redundancy Checks
-
Confirm backup jobs hit RPO targets
In Veeam, Datto, or Rubrik, confirm last 30 days of backup jobs completed within RPO. Green dashboards lie — a job can complete "successfully" while skipping a VM that was offline. Verify per-protected-object, not per-job.
-
Run a restore drill into an isolated VLAN
Pick one production VM and restore it to an isolated test VLAN. Boot it, log in, and verify application data. The restore-test gap is where ransomware-day finds you — backup nightly green for 18 months, restore script broken since Q1.
Collects list -
Investigate the failed restore
Open a SEV2 ticket. Common causes: rotated service-account credentials, expired backup-vendor license, immutability lock blocking the test target, or backup format change after a software upgrade. Document the root cause in the post-mortem.
-
Verify RAID and pool health
Check RAID controller status (PERC, SmartArray) and ZFS/Storage Spaces pool health. A degraded array running on a single parity disk is one bad sector away from data loss. Confirm hot spares are present and accessible.
-
Test redundant PSU and uplink failover
During a maintenance window, pull one PSU per dual-PSU server and one uplink per LAG to confirm seamless failover. PSUs that have never been tested often reveal that the B-side feed has been dead for months and nobody noticed.
-
Update the DR runbook with current state
Reconcile the DR runbook in IT Glue / Hudu / Confluence against this quarter's hardware changes. Verify RPO/RTO targets still match leadership expectations and the BCP. Stale runbooks cost hours during an actual incident.
Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Hardware Maintenance Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.