Hardware Maintenance Checklist
Routine Physical Inspections
Inspect each U for chassis dents, scorched components, bulging capacitors, and amber LEDs on iLO/iDRAC/IPMI front panels. Note serial and asset tag for any unit showing wear — these become candidates for the next refresh cycle.
Verify power, IPMI/OOB, and data cables are seated and labeled at both ends. Loose SFP+ transceivers and half-clicked C13/C14 connectors are the most common cause of intermittent link flaps that show up later in monitoring.
Use compressed air on chassis fans, PSU intakes, and rack door filters. Dust buildup on a 1U server's fan stack pushes inlet temps 5-10°C and accelerates drive failure. Schedule with hot/cold aisle airflow in mind — don't blow dust into adjacent racks.
Spot-check 10% of inspected hardware against the CMDB or RMM inventory (ServiceNow, Hudu, IT Glue, Auvik). Mismatched serials are how a vendor audit (Microsoft, VMware, Oracle) turns into a six-figure true-up.
Pull inlet temps from iDRAC/iLO or rack PDU sensors. ASHRAE A1 recommends 18-27°C inlet; sustained readings above 27°C mean a CRAC issue or blocked airflow. Log readings to PRTG / LogicMonitor / SolarWinds for trend.
Firmware and BIOS Updates
Pull BIOS, BMC, NIC, HBA, and drive firmware via Dell OME, HPE OneView, Lenovo XClarity, or vendor CLI. Mismatched firmware across cluster nodes (vSAN, S2D, Nutanix) is a leading cause of weird performance regressions.
Check Dell, HPE, Cisco, Fortinet, and Palo Alto security advisories for CVSS 7+ items affecting your installed firmware. Cross-reference against the CVEs your vuln scanner (Tenable, Qualys, Rapid7) flagged this cycle.
Submit RFC to CAB with the maintenance window, blast radius, rollback steps, and named change owner. Standard changes (pre-approved firmware on identical hardware) skip CAB; novel firmware to production is a normal change.
Export current iDRAC/iLO config (XML/JSON) and BIOS settings before flashing. Vendor update bundles occasionally reset BIOS to defaults — losing your boot order, SR-IOV, or virtualization toggles is the classic post-flash gotcha.
Three-ring deployment: test (lab/dev), pilot (one prod node), prod (rest). Wait 7 days between rings to surface latent issues. Updating all production nodes the same evening is how you discover a bad NIC firmware after the fact.
Re-run vendor diagnostics (Dell SupportAssist, HPE Active Health), check ESXi/Hyper-V cluster status, and confirm no new SEL events. Roll back if any node fails to boot or shows degraded sensors.
Performance Monitoring
Pull System Event Logs from each BMC and the last 30 days of syslog from your SIEM (Splunk, Sentinel, Sumo). Look for ECC memory errors, predictive drive failures, and PSU input loss events — these precede outright failure by weeks.
Compare 30-day rolling averages against last quarter in PRTG, LogicMonitor, or Datadog. Hosts running >80% sustained CPU or >85% memory are candidates for workload rebalancing before they page on-call at 3 AM.
Use Dell SupportAssist, HPE SmartArray Diagnostics, or smartctl on any node showing anomalies. SMART attributes 5 (reallocated sectors), 187 (uncorrectable errors), and 197 (current pending) are the early-warning trio for SSD/HDD failure.
Open warranty cases with the vendor (Dell ProSupport, HPE Foundation Care) for any flagged drive, DIMM, or PSU. Capture the service tag, dispatch ID, and ETA in the PSA ticket so the dispatcher knows when to schedule the on-site swap.
Backup and Redundancy Checks
In Veeam, Datto, or Rubrik, confirm last 30 days of backup jobs completed within RPO. Green dashboards lie — a job can complete "successfully" while skipping a VM that was offline. Verify per-protected-object, not per-job.
Pick one production VM and restore it to an isolated test VLAN. Boot it, log in, and verify application data. The restore-test gap is where ransomware-day finds you — backup nightly green for 18 months, restore script broken since Q1.
Open a SEV2 ticket. Common causes: rotated service-account credentials, expired backup-vendor license, immutability lock blocking the test target, or backup format change after a software upgrade. Document the root cause in the post-mortem.
Check RAID controller status (PERC, SmartArray) and ZFS/Storage Spaces pool health. A degraded array running on a single parity disk is one bad sector away from data loss. Confirm hot spares are present and accessible.
During a maintenance window, pull one PSU per dual-PSU server and one uplink per LAG to confirm seamless failover. PSUs that have never been tested often reveal that the B-side feed has been dead for months and nobody noticed.
Reconcile the DR runbook in IT Glue / Hudu / Confluence against this quarter's hardware changes. Verify RPO/RTO targets still match leadership expectations and the BCP. Stale runbooks cost hours during an actual incident.
