Hardware Troubleshooting Checklist

Bench and on-site workflow a sysadmin or MSP technician runs to triage, isolate, and resolve a hardware fault on a workstation, laptop, or server. Covers intake through OEM RMA escalation.

6 sections 24 steps Collects data
1

Ticket Intake and Triage

  1. Capture asset tag, user, and symptom
    • Pull the asset tag from the device sticker (Dell service tag, HP serial, Lenovo MTM) and reconcile against the RMM inventory and PSA ticket. A device that doesn't match the RMM record is a flag — either it was never enrolled, or somebody swapped hardware without updating CMDB.

    Collects text Collects text Collects paragraph
  2. Reproduce the reported symptom
    • Watch the failure happen before you start swapping parts. Intermittent issues that the user reports but you cannot reproduce are usually environmental (thermals, power, cable seating) — note conditions when it occurs.

  3. Verify power, peripherals, and cabling
    • Reseat power cable, monitor cable (DisplayPort connectors back out under their own weight — known issue), USB peripherals, and dock connections. Try a different wall outlet or PDU port if a desktop won't power on.

  4. Document POST codes and on-screen errors
    • Record any beep codes, diagnostic LED patterns (Dell light codes, HP caps-lock blinks), or BSOD stop codes. Photograph the screen if the error is unfamiliar — these codes map directly to the OEM service manual.

    Collects paragraph
2

System Diagnostics

  1. Run the OEM diagnostics utility
    • Boot the vendor's pre-OS diagnostic: Dell ePSA / SupportAssist (F12), HP PC Hardware Diagnostics UEFI (F2), Lenovo Diagnostics (F10). These run before the OS loads, which rules out driver and OS-layer noise.

  2. Review Event Viewer hardware logs
    • Filter the System log for source disk, nvme, WHEA-Logger, and Kernel-Power. WHEA errors precede most CPU and memory failures by days. On Linux, check dmesg and journalctl -p err.

  3. Check BIOS/UEFI settings and boot order
    • Confirm the boot drive is detected, Secure Boot and TPM are enabled per company baseline, and virtualization extensions (VT-x / AMD-V) are on. A drained CMOS battery resets these — symptom is wrong-clock errors and lost boot order.

  4. Perform a hard power cycle
    • Drain residual power: unplug, hold the power button 30 seconds (laptops with non-removable batteries — pinhole reset on the underside). For servers, fully remove power from both PSUs for 60 seconds before re-seating.

3

Component Isolation

  1. Test PSU rails with a PSU tester
    • Sagging 12V rails cause random reboots that look like every other failure mode. A $20 PSU tester or a multimeter on the 24-pin connector confirms it in 30 seconds.

  2. Run MemTest86 on the RAM
    • Boot from a MemTest86 USB and run at least one full pass (4 passes for confidence). If errors appear, pull modules and test individually to identify which DIMM and which slot is at fault — bad slots are common on older motherboards.

  3. Pull S.M.A.R.T. status from the boot drive
    • Use CrystalDiskInfo on Windows or smartctl -a /dev/sdX on Linux. Watch reallocated sectors, pending sectors, and (for NVMe) media wear indicator. A drive in Caution status is on borrowed time — back up before further testing.

    Collects list
  4. Stress-test the GPU with FurMark or vendor tool
    • Run a 15-minute load while monitoring temps in HWiNFO. Artifacts under load with stable temps point to VRAM; thermal throttling above 90°C points to dried thermal paste or a clogged heatsink.

  5. Verify NIC link state and replace the patch cable
    • Confirm the switch port shows the expected link speed and duplex; a 1Gb NIC negotiating at 100Mb half-duplex usually means a damaged cable or bent contact. Try a known-good Cat 6 patch cable before assuming the NIC is bad.

4

Environmental Checks

  1. Read thermals in HWiNFO under load
    • CPU package above 95°C, GPU above 90°C, or NVMe above 75°C all trigger throttling that the user experiences as random slowness or hangs. Idle thermals matter too — high idle indicates a fan or paste problem.

  2. Clear dust from intakes and heatsinks
    • Compressed air outdoors or in a cleaning area, hold fans still while blowing (free-spinning fans generate back-EMF that can damage the controller). Laptops in dusty environments need this every 6-12 months.

  3. Confirm UPS battery health and runtime
    • For desktops behind an APC or Eaton UPS, check the self-test result and battery age. UPS batteries typically last 3-5 years; an aged battery throws line-noise events that look like random reboots on the protected device.

5

Firmware and Driver Remediation

  1. Apply vendor firmware updates
    • Use Dell Command Update, HP Image Assistant, or Lenovo System Update — not random firmware off forums. For servers, run iDRAC / iLO / IPMI lifecycle controller updates with the device on AC power and no scheduled jobs running.

  2. Install the OEM driver pack
    • Pull drivers from the OEM support portal keyed to the asset tag, not from Windows Update. Windows Update ships generic Microsoft-signed drivers that often lag the vendor by months — chipset and storage drivers especially.

  3. Roll back recent driver if a regression is suspected
    • If the symptom started right after a Windows Update or driver push, use Device Manager → Properties → Driver tab → Roll Back Driver. Pin the prior version in WSUS / Intune / patch tool to prevent re-application until the vendor fixes it.

  4. Apply pending OS cumulative updates
    • Bring the device current on the latest cumulative update through your patch tool (Intune, WSUS, Automox, Action1). Some hardware-related fixes ride along in CUs and won't be obvious from the KB summary.

6

Resolution and Escalation

  1. Confirm whether the issue is resolved
    • Reproduce the original failing condition and confirm with the user — not just a tech-side smoke test. Schedule a 24-hour soak before closing if the original report was intermittent.

    Collects list
  2. Open an OEM warranty RMA case
    • File via the vendor support portal with the asset tag, diagnostic codes, and photos of the failure. Pro-support contracts (Dell ProSupport, HP CarePack, Lenovo Premier) typically ship a part next-business-day; standard warranty is depot repair at 5-10 days.

  3. Schedule a loaner or replacement deployment
    • Pull a spare from the loaner pool, image with Autopilot or your MDM, and coordinate handoff with the user. Capture the BitLocker recovery key from the failed device before the chassis goes back to the OEM.

  4. Close the PSA ticket with remediation notes
    • Document the root cause, the parts replaced, the firmware/driver versions applied, and any preventive notes. This is the entry future technicians will search when the same model fails next quarter — write it for them.

    Collects paragraph Collects file

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 6
Steps 24
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Hardware Troubleshooting Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.