Hardware Maintenance Checklist

Quarterly hardware maintenance routine for sysadmins covering physical inspection, firmware updates, performance monitoring, and backup/redundancy verification across servers, network gear, and storage.

4 sections 21 steps Collects data
1

Routine Physical Inspections

  1. Walk the rack and log visible damage
    • Inspect each U for chassis dents, scorched components, bulging capacitors, and amber LEDs on iLO/iDRAC/IPMI front panels. Note serial and asset tag for any unit showing wear — these become candidates for the next refresh cycle.

    Collects list Collects image Collects paragraph
  2. Reseat and dress loose cabling
    • Verify power, IPMI/OOB, and data cables are seated and labeled at both ends. Loose SFP+ transceivers and half-clicked C13/C14 connectors are the most common cause of intermittent link flaps that show up later in monitoring.

  3. Clean fans, filters, and intake vents
    • Use compressed air on chassis fans, PSU intakes, and rack door filters. Dust buildup on a 1U server's fan stack pushes inlet temps 5-10°C and accelerates drive failure. Schedule with hot/cold aisle airflow in mind — don't blow dust into adjacent racks.

  4. Verify asset tags match the CMDB
    • Spot-check 10% of inspected hardware against the CMDB or RMM inventory (ServiceNow, Hudu, IT Glue, Auvik). Mismatched serials are how a vendor audit (Microsoft, VMware, Oracle) turns into a six-figure true-up.

  5. Read inlet and exhaust temperatures
    • Pull inlet temps from iDRAC/iLO or rack PDU sensors. ASHRAE A1 recommends 18-27°C inlet; sustained readings above 27°C mean a CRAC issue or blocked airflow. Log readings to PRTG / LogicMonitor / SolarWinds for trend.

    Collects number
2

Firmware and BIOS Updates

  1. Inventory current firmware versions
    • Pull BIOS, BMC, NIC, HBA, and drive firmware via Dell OME, HPE OneView, Lenovo XClarity, or vendor CLI. Mismatched firmware across cluster nodes (vSAN, S2D, Nutanix) is a leading cause of weird performance regressions.

  2. Review vendor advisories for CVEs
    • Check Dell, HPE, Cisco, Fortinet, and Palo Alto security advisories for CVSS 7+ items affecting your installed firmware. Cross-reference against the CVEs your vuln scanner (Tenable, Qualys, Rapid7) flagged this cycle.

    Collects list
  3. File a change request with rollback plan
    • Submit RFC to CAB with the maintenance window, blast radius, rollback steps, and named change owner. Standard changes (pre-approved firmware on identical hardware) skip CAB; novel firmware to production is a normal change.

  4. Export BMC and BIOS configuration
    • Export current iDRAC/iLO config (XML/JSON) and BIOS settings before flashing. Vendor update bundles occasionally reset BIOS to defaults — losing your boot order, SR-IOV, or virtualization toggles is the classic post-flash gotcha.

    Collects file
  5. Apply updates to the test ring first
    • Three-ring deployment: test (lab/dev), pilot (one prod node), prod (rest). Wait 7 days between rings to surface latent issues. Updating all production nodes the same evening is how you discover a bad NIC firmware after the fact.

  6. Validate post-update health
    • Re-run vendor diagnostics (Dell SupportAssist, HPE Active Health), check ESXi/Hyper-V cluster status, and confirm no new SEL events. Roll back if any node fails to boot or shows degraded sensors.

    Collects list
3

Performance Monitoring

  1. Review SEL and syslog for hardware errors
    • Pull System Event Logs from each BMC and the last 30 days of syslog from your SIEM (Splunk, Sentinel, Sumo). Look for ECC memory errors, predictive drive failures, and PSU input loss events — these precede outright failure by weeks.

  2. Check CPU, memory, and disk baselines
    • Compare 30-day rolling averages against last quarter in PRTG, LogicMonitor, or Datadog. Hosts running >80% sustained CPU or >85% memory are candidates for workload rebalancing before they page on-call at 3 AM.

  3. Run vendor diagnostics on flagged hardware
    • Use Dell SupportAssist, HPE SmartArray Diagnostics, or smartctl on any node showing anomalies. SMART attributes 5 (reallocated sectors), 187 (uncorrectable errors), and 197 (current pending) are the early-warning trio for SSD/HDD failure.

  4. File replacement tickets for failing components
    • Open warranty cases with the vendor (Dell ProSupport, HPE Foundation Care) for any flagged drive, DIMM, or PSU. Capture the service tag, dispatch ID, and ETA in the PSA ticket so the dispatcher knows when to schedule the on-site swap.

    Collects number
4

Backup and Redundancy Checks

  1. Confirm backup jobs hit RPO targets
    • In Veeam, Datto, or Rubrik, confirm last 30 days of backup jobs completed within RPO. Green dashboards lie — a job can complete "successfully" while skipping a VM that was offline. Verify per-protected-object, not per-job.

  2. Run a restore drill into an isolated VLAN
    • Pick one production VM and restore it to an isolated test VLAN. Boot it, log in, and verify application data. The restore-test gap is where ransomware-day finds you — backup nightly green for 18 months, restore script broken since Q1.

    Collects list
  3. Investigate the failed restore
    • Open a SEV2 ticket. Common causes: rotated service-account credentials, expired backup-vendor license, immutability lock blocking the test target, or backup format change after a software upgrade. Document the root cause in the post-mortem.

  4. Verify RAID and pool health
    • Check RAID controller status (PERC, SmartArray) and ZFS/Storage Spaces pool health. A degraded array running on a single parity disk is one bad sector away from data loss. Confirm hot spares are present and accessible.

  5. Test redundant PSU and uplink failover
    • During a maintenance window, pull one PSU per dual-PSU server and one uplink per LAG to confirm seamless failover. PSUs that have never been tested often reveal that the B-side feed has been dead for months and nobody noticed.

  6. Update the DR runbook with current state
    • Reconcile the DR runbook in IT Glue / Hudu / Confluence against this quarter's hardware changes. Verify RPO/RTO targets still match leadership expectations and the BCP. Stale runbooks cost hours during an actual incident.

    Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 4
Steps 21
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Hardware Maintenance Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.