Hardware Troubleshooting Checklist
Bench and on-site workflow a sysadmin or MSP technician runs to triage, isolate, and resolve a hardware fault on a workstation, laptop, or server. Covers intake through OEM RMA escalation.
Ticket Intake and Triage
-
Capture asset tag, user, and symptom
Pull the asset tag from the device sticker (Dell service tag, HP serial, Lenovo MTM) and reconcile against the RMM inventory and PSA ticket. A device that doesn't match the RMM record is a flag — either it was never enrolled, or somebody swapped hardware without updating CMDB.
Collects text Collects text Collects paragraph -
Reproduce the reported symptom
Watch the failure happen before you start swapping parts. Intermittent issues that the user reports but you cannot reproduce are usually environmental (thermals, power, cable seating) — note conditions when it occurs.
-
Verify power, peripherals, and cabling
Reseat power cable, monitor cable (DisplayPort connectors back out under their own weight — known issue), USB peripherals, and dock connections. Try a different wall outlet or PDU port if a desktop won't power on.
-
Document POST codes and on-screen errors
Record any beep codes, diagnostic LED patterns (Dell light codes, HP caps-lock blinks), or BSOD stop codes. Photograph the screen if the error is unfamiliar — these codes map directly to the OEM service manual.
Collects paragraph
System Diagnostics
-
Run the OEM diagnostics utility
Boot the vendor's pre-OS diagnostic: Dell ePSA / SupportAssist (F12), HP PC Hardware Diagnostics UEFI (F2), Lenovo Diagnostics (F10). These run before the OS loads, which rules out driver and OS-layer noise.
-
Review Event Viewer hardware logs
Filter the System log for source disk, nvme, WHEA-Logger, and Kernel-Power. WHEA errors precede most CPU and memory failures by days. On Linux, check dmesg and journalctl -p err.
-
Check BIOS/UEFI settings and boot order
Confirm the boot drive is detected, Secure Boot and TPM are enabled per company baseline, and virtualization extensions (VT-x / AMD-V) are on. A drained CMOS battery resets these — symptom is wrong-clock errors and lost boot order.
-
Perform a hard power cycle
Drain residual power: unplug, hold the power button 30 seconds (laptops with non-removable batteries — pinhole reset on the underside). For servers, fully remove power from both PSUs for 60 seconds before re-seating.
Component Isolation
-
Test PSU rails with a PSU tester
Sagging 12V rails cause random reboots that look like every other failure mode. A $20 PSU tester or a multimeter on the 24-pin connector confirms it in 30 seconds.
-
Run MemTest86 on the RAM
Boot from a MemTest86 USB and run at least one full pass (4 passes for confidence). If errors appear, pull modules and test individually to identify which DIMM and which slot is at fault — bad slots are common on older motherboards.
-
Pull S.M.A.R.T. status from the boot drive
Use CrystalDiskInfo on Windows or smartctl -a /dev/sdX on Linux. Watch reallocated sectors, pending sectors, and (for NVMe) media wear indicator. A drive in Caution status is on borrowed time — back up before further testing.
Collects list -
Stress-test the GPU with FurMark or vendor tool
Run a 15-minute load while monitoring temps in HWiNFO. Artifacts under load with stable temps point to VRAM; thermal throttling above 90°C points to dried thermal paste or a clogged heatsink.
-
Verify NIC link state and replace the patch cable
Confirm the switch port shows the expected link speed and duplex; a 1Gb NIC negotiating at 100Mb half-duplex usually means a damaged cable or bent contact. Try a known-good Cat 6 patch cable before assuming the NIC is bad.
Environmental Checks
-
Read thermals in HWiNFO under load
CPU package above 95°C, GPU above 90°C, or NVMe above 75°C all trigger throttling that the user experiences as random slowness or hangs. Idle thermals matter too — high idle indicates a fan or paste problem.
-
Clear dust from intakes and heatsinks
Compressed air outdoors or in a cleaning area, hold fans still while blowing (free-spinning fans generate back-EMF that can damage the controller). Laptops in dusty environments need this every 6-12 months.
-
Confirm UPS battery health and runtime
For desktops behind an APC or Eaton UPS, check the self-test result and battery age. UPS batteries typically last 3-5 years; an aged battery throws line-noise events that look like random reboots on the protected device.
Firmware and Driver Remediation
-
Apply vendor firmware updates
Use Dell Command Update, HP Image Assistant, or Lenovo System Update — not random firmware off forums. For servers, run iDRAC / iLO / IPMI lifecycle controller updates with the device on AC power and no scheduled jobs running.
-
Install the OEM driver pack
Pull drivers from the OEM support portal keyed to the asset tag, not from Windows Update. Windows Update ships generic Microsoft-signed drivers that often lag the vendor by months — chipset and storage drivers especially.
-
Roll back recent driver if a regression is suspected
If the symptom started right after a Windows Update or driver push, use Device Manager → Properties → Driver tab → Roll Back Driver. Pin the prior version in WSUS / Intune / patch tool to prevent re-application until the vendor fixes it.
-
Apply pending OS cumulative updates
Bring the device current on the latest cumulative update through your patch tool (Intune, WSUS, Automox, Action1). Some hardware-related fixes ride along in CUs and won't be obvious from the KB summary.
Resolution and Escalation
-
Confirm whether the issue is resolved
Reproduce the original failing condition and confirm with the user — not just a tech-side smoke test. Schedule a 24-hour soak before closing if the original report was intermittent.
Collects list -
Open an OEM warranty RMA case
File via the vendor support portal with the asset tag, diagnostic codes, and photos of the failure. Pro-support contracts (Dell ProSupport, HP CarePack, Lenovo Premier) typically ship a part next-business-day; standard warranty is depot repair at 5-10 days.
-
Schedule a loaner or replacement deployment
Pull a spare from the loaner pool, image with Autopilot or your MDM, and coordinate handoff with the user. Capture the BitLocker recovery key from the failed device before the chassis goes back to the OEM.
-
Close the PSA ticket with remediation notes
Document the root cause, the parts replaced, the firmware/driver versions applied, and any preventive notes. This is the entry future technicians will search when the same model fails next quarter — write it for them.
Collects paragraph Collects file
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Hardware Troubleshooting Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.