Hardware Troubleshooting Checklist
Ticket Intake and Triage
Pull the asset tag from the device sticker (Dell service tag, HP serial, Lenovo MTM) and reconcile against the RMM inventory and PSA ticket. A device that doesn't match the RMM record is a flag — either it was never enrolled, or somebody swapped hardware without updating CMDB.
Watch the failure happen before you start swapping parts. Intermittent issues that the user reports but you cannot reproduce are usually environmental (thermals, power, cable seating) — note conditions when it occurs.
Reseat power cable, monitor cable (DisplayPort connectors back out under their own weight — known issue), USB peripherals, and dock connections. Try a different wall outlet or PDU port if a desktop won't power on.
Record any beep codes, diagnostic LED patterns (Dell light codes, HP caps-lock blinks), or BSOD stop codes. Photograph the screen if the error is unfamiliar — these codes map directly to the OEM service manual.
System Diagnostics
Boot the vendor's pre-OS diagnostic: Dell ePSA / SupportAssist (F12), HP PC Hardware Diagnostics UEFI (F2), Lenovo Diagnostics (F10). These run before the OS loads, which rules out driver and OS-layer noise.
Filter the System log for source disk, nvme, WHEA-Logger, and Kernel-Power. WHEA errors precede most CPU and memory failures by days. On Linux, check dmesg and journalctl -p err.
Confirm the boot drive is detected, Secure Boot and TPM are enabled per company baseline, and virtualization extensions (VT-x / AMD-V) are on. A drained CMOS battery resets these — symptom is wrong-clock errors and lost boot order.
Drain residual power: unplug, hold the power button 30 seconds (laptops with non-removable batteries — pinhole reset on the underside). For servers, fully remove power from both PSUs for 60 seconds before re-seating.
Component Isolation
Sagging 12V rails cause random reboots that look like every other failure mode. A $20 PSU tester or a multimeter on the 24-pin connector confirms it in 30 seconds.
Boot from a MemTest86 USB and run at least one full pass (4 passes for confidence). If errors appear, pull modules and test individually to identify which DIMM and which slot is at fault — bad slots are common on older motherboards.
Use CrystalDiskInfo on Windows or smartctl -a /dev/sdX on Linux. Watch reallocated sectors, pending sectors, and (for NVMe) media wear indicator. A drive in Caution status is on borrowed time — back up before further testing.
Run a 15-minute load while monitoring temps in HWiNFO. Artifacts under load with stable temps point to VRAM; thermal throttling above 90°C points to dried thermal paste or a clogged heatsink.
Confirm the switch port shows the expected link speed and duplex; a 1Gb NIC negotiating at 100Mb half-duplex usually means a damaged cable or bent contact. Try a known-good Cat 6 patch cable before assuming the NIC is bad.
Environmental Checks
CPU package above 95°C, GPU above 90°C, or NVMe above 75°C all trigger throttling that the user experiences as random slowness or hangs. Idle thermals matter too — high idle indicates a fan or paste problem.
Compressed air outdoors or in a cleaning area, hold fans still while blowing (free-spinning fans generate back-EMF that can damage the controller). Laptops in dusty environments need this every 6-12 months.
For desktops behind an APC or Eaton UPS, check the self-test result and battery age. UPS batteries typically last 3-5 years; an aged battery throws line-noise events that look like random reboots on the protected device.
Firmware and Driver Remediation
Use Dell Command Update, HP Image Assistant, or Lenovo System Update — not random firmware off forums. For servers, run iDRAC / iLO / IPMI lifecycle controller updates with the device on AC power and no scheduled jobs running.
Pull drivers from the OEM support portal keyed to the asset tag, not from Windows Update. Windows Update ships generic Microsoft-signed drivers that often lag the vendor by months — chipset and storage drivers especially.
If the symptom started right after a Windows Update or driver push, use Device Manager → Properties → Driver tab → Roll Back Driver. Pin the prior version in WSUS / Intune / patch tool to prevent re-application until the vendor fixes it.
Bring the device current on the latest cumulative update through your patch tool (Intune, WSUS, Automox, Action1). Some hardware-related fixes ride along in CUs and won't be obvious from the KB summary.
Resolution and Escalation
Reproduce the original failing condition and confirm with the user — not just a tech-side smoke test. Schedule a 24-hour soak before closing if the original report was intermittent.
File via the vendor support portal with the asset tag, diagnostic codes, and photos of the failure. Pro-support contracts (Dell ProSupport, HP CarePack, Lenovo Premier) typically ship a part next-business-day; standard warranty is depot repair at 5-10 days.
Pull a spare from the loaner pool, image with Autopilot or your MDM, and coordinate handoff with the user. Capture the BitLocker recovery key from the failed device before the chassis goes back to the OEM.
Document the root cause, the parts replaced, the firmware/driver versions applied, and any preventive notes. This is the entry future technicians will search when the same model fails next quarter — write it for them.
