Network Troubleshooting Checklist
Runbook a sysadmin or MSP technician follows to triage and resolve a network outage or degradation, from initial physical-layer checks through DNS/DHCP and wireless diagnosis. Captures the symptom, scope, and resolution for the ticket record.
Ticket Intake and Scope
-
Capture the reported symptom and impact
Record what the user sees — "no internet," "slow Teams calls," "shared drive unreachable" — plus when it started and what changed recently (patch window, ISP work, office move). Vague tickets like "network is down" almost always narrow to one app, one subnet, or one VLAN once you ask.
Collects paragraph -
Determine the scope of impact
Scope drives the next move. A single user is an endpoint problem; one VLAN or floor points at a switch or AP; a site-wide outage points at the firewall, ISP, or DNS. Confirm by asking a second user on the same subnet, or check PRTG / Auvik / Meraki dashboard for affected devices.
Collects list -
Set ticket priority and notify stakeholders
Site-wide or multi-site impact escalates to P1 — page the on-call engineer via PagerDuty / Opsgenie, post in the NOC channel, and start the incident timeline. Single-user issues stay at standard helpdesk priority.
Physical and Endpoint Checks
-
Verify cabling, link lights, and PoE
Confirm the patch cable is seated at both ends, the switchport link LED is amber/green, and PoE devices are drawing power. A surprising fraction of "network down" tickets are a kicked cable or a tripped PoE budget on an aging switch.
-
Confirm IP, gateway, and DNS on the client
Run ipconfig /all (Windows) or ifconfig / ip addr (macOS/Linux). An APIPA address (169.254.x.x) means DHCP failed — skip ahead to the DHCP section. A correct lease with wrong DNS points at scope options or a static override on the NIC.
Collects file -
Ping the default gateway and a known external host
Ping the gateway first to isolate LAN vs. WAN. If the gateway responds but 8.8.8.8 doesn't, the problem is upstream of the firewall (ISP, WAN circuit). If the gateway itself doesn't respond, it's a LAN-side switch, VLAN, or cable issue.
-
Run traceroute to isolate the failure hop
Use tracert (Windows) or traceroute (macOS/Linux) to the destination the user can't reach. The hop where latency spikes or replies stop is your suspect. Asymmetric routing or an MPLS handoff is a common gotcha at the ISP boundary.
Network Device Diagnosis
-
Pull switch and router logs for the affected port or VLAN
SSH into the upstream switch (Meraki dashboard, Cisco IOS show logging, FortiGate diagnose, Aruba show log). Look for err-disable, STP topology change, port flap, or duplex mismatch entries within the incident window.
-
Check switchport status and error counters
show interface status and show interface counters errors on Cisco; equivalent on your platform. CRC errors point at a bad cable or NIC; input drops point at a microburst or saturated uplink; err-disabled ports usually mean a port-security violation.
-
Verify VLAN assignment and trunk configuration
Confirm the access port is in the right VLAN and the upstream trunk carries it (show interface trunk). A native-VLAN mismatch across a trunk is a classic STP and broadcast-loop trigger.
-
Confirm routing table and HSRP/VRRP state
Check show ip route for the destination prefix and confirm the active gateway peer (HSRP/VRRP) is the one you expect. A failover that didn't fail back is a common cause of intermittent connectivity after a maintenance window.
DNS and DHCP
-
Test name resolution against internal and external resolvers
Run nslookup or dig against the internal DNS server (DC, Windows Server DNS, Bind) and against an external resolver (1.1.1.1, 8.8.8.8, Quad9). If internal works but external fails, check forwarders; if external works but internal fails, the DC's DNS service is the suspect.
-
Verify DHCP scope utilization and lease activity
Open the DHCP console (Windows Server DHCP, ISC Kea, Meraki, FortiGate) and check the scope. A scope at 100% utilization gives new clients APIPA addresses and looks identical to a "network down" report. Expand the scope or shorten the lease as a temporary fix; investigate the device-count spike afterward.
Collects number -
Review DNS and DHCP server logs
Check Event Viewer (DHCP-Server, DNS-Server channels) or the equivalent on your platform for repeated NACKs, scope-exhausted entries, or zone-transfer failures. Cross-reference timestamps with the user's report.
Wireless Diagnosis
-
Confirm the issue is wireless-specific
Have the user plug into a wired port (or test with a known-good wired endpoint nearby). If wired works and wireless doesn't, the problem is the AP, SSID, or RF environment — not the upstream network.
Collects list -
Check AP health and client count in the controller
In the Meraki / UniFi / Aruba / Mist dashboard, confirm the nearest AP is online, on the right firmware, and not stuck with 60+ clients on a single radio. A single overloaded AP is the most common "wifi is slow" cause in conference rooms.
-
Review signal strength, channel, and interference
Use the controller's RF spectrum view or a tool like Ekahau / NetSpot to confirm signal strength at the affected location is above -67 dBm and the channel isn't being clobbered by a neighbor or rogue AP. 2.4 GHz channel overlap is the usual culprit in dense offices.
-
Validate SSID, RADIUS, and authentication
Confirm the SSID is broadcast on the correct AP group, and for 802.1X SSIDs, test a RADIUS auth from the controller against NPS / ClearPass / Cisco ISE. An expired RADIUS shared secret or a cert renewal that didn't propagate is a classic post-maintenance failure.
Resolution and Documentation
-
Apply the fix and verify with the reporting user
Don't close on "should be working now." Have the original reporter reproduce their workflow — the Teams call, the file open, the SaaS login — and confirm it succeeds. Restored ping is not the same as restored business function.
-
Document root cause and resolution
Write the ticket close-out in IT Glue / Hudu / Confluence with the symptom, scope, root cause, and the exact fix command or config change. Future-you (or the next on-call) will search this in six months when it recurs.
Collects list Collects paragraph Collects number -
Schedule a post-incident review for P1 outages
If the incident was site-wide or multi-site, hold a 30-minute blameless review within 48 hours. Capture preventive actions — monitoring gap, runbook gap, config drift — as tickets, not as wishes in meeting notes.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRelated templates
More workflows your team can run.
Run Network Troubleshooting Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.