Network Troubleshooting Checklist

Runbook a sysadmin or MSP technician follows to triage and resolve a network outage or degradation, from initial physical-layer checks through DNS/DHCP and wireless diagnosis. Captures the symptom, scope, and resolution for the ticket record.

6 sections 21 steps Collects data

Ticket Intake and Scope

Capture the reported symptom and impact
- Record what the user sees — "no internet," "slow Teams calls," "shared drive unreachable" — plus when it started and what changed recently (patch window, ISP work, office move). Vague tickets like "network is down" almost always narrow to one app, one subnet, or one VLAN once you ask.
Collects paragraph
Determine the scope of impact
- Scope drives the next move. A single user is an endpoint problem; one VLAN or floor points at a switch or AP; a site-wide outage points at the firewall, ISP, or DNS. Confirm by asking a second user on the same subnet, or check PRTG / Auvik / Meraki dashboard for affected devices.
Collects list
Set ticket priority and notify stakeholders
- Site-wide or multi-site impact escalates to P1 — page the on-call engineer via PagerDuty / Opsgenie, post in the NOC channel, and start the incident timeline. Single-user issues stay at standard helpdesk priority.

Physical and Endpoint Checks

Verify cabling, link lights, and PoE
- Confirm the patch cable is seated at both ends, the switchport link LED is amber/green, and PoE devices are drawing power. A surprising fraction of "network down" tickets are a kicked cable or a tripped PoE budget on an aging switch.
Confirm IP, gateway, and DNS on the client
- Run ipconfig /all (Windows) or ifconfig / ip addr (macOS/Linux). An APIPA address (169.254.x.x) means DHCP failed — skip ahead to the DHCP section. A correct lease with wrong DNS points at scope options or a static override on the NIC.
Collects file
Ping the default gateway and a known external host
- Ping the gateway first to isolate LAN vs. WAN. If the gateway responds but 8.8.8.8 doesn't, the problem is upstream of the firewall (ISP, WAN circuit). If the gateway itself doesn't respond, it's a LAN-side switch, VLAN, or cable issue.
Run traceroute to isolate the failure hop
- Use tracert (Windows) or traceroute (macOS/Linux) to the destination the user can't reach. The hop where latency spikes or replies stop is your suspect. Asymmetric routing or an MPLS handoff is a common gotcha at the ISP boundary.

Network Device Diagnosis

Pull switch and router logs for the affected port or VLAN
- SSH into the upstream switch (Meraki dashboard, Cisco IOS show logging, FortiGate diagnose, Aruba show log). Look for err-disable, STP topology change, port flap, or duplex mismatch entries within the incident window.
Check switchport status and error counters
- show interface status and show interface counters errors on Cisco; equivalent on your platform. CRC errors point at a bad cable or NIC; input drops point at a microburst or saturated uplink; err-disabled ports usually mean a port-security violation.
Verify VLAN assignment and trunk configuration
- Confirm the access port is in the right VLAN and the upstream trunk carries it (show interface trunk). A native-VLAN mismatch across a trunk is a classic STP and broadcast-loop trigger.
Confirm routing table and HSRP/VRRP state
- Check show ip route for the destination prefix and confirm the active gateway peer (HSRP/VRRP) is the one you expect. A failover that didn't fail back is a common cause of intermittent connectivity after a maintenance window.

DNS and DHCP

Test name resolution against internal and external resolvers
- Run nslookup or dig against the internal DNS server (DC, Windows Server DNS, Bind) and against an external resolver (1.1.1.1, 8.8.8.8, Quad9). If internal works but external fails, check forwarders; if external works but internal fails, the DC's DNS service is the suspect.
Verify DHCP scope utilization and lease activity
- Open the DHCP console (Windows Server DHCP, ISC Kea, Meraki, FortiGate) and check the scope. A scope at 100% utilization gives new clients APIPA addresses and looks identical to a "network down" report. Expand the scope or shorten the lease as a temporary fix; investigate the device-count spike afterward.
Collects number
Review DNS and DHCP server logs
- Check Event Viewer (DHCP-Server, DNS-Server channels) or the equivalent on your platform for repeated NACKs, scope-exhausted entries, or zone-transfer failures. Cross-reference timestamps with the user's report.

Wireless Diagnosis

Confirm the issue is wireless-specific
- Have the user plug into a wired port (or test with a known-good wired endpoint nearby). If wired works and wireless doesn't, the problem is the AP, SSID, or RF environment — not the upstream network.
Collects list
Check AP health and client count in the controller
- In the Meraki / UniFi / Aruba / Mist dashboard, confirm the nearest AP is online, on the right firmware, and not stuck with 60+ clients on a single radio. A single overloaded AP is the most common "wifi is slow" cause in conference rooms.
Review signal strength, channel, and interference
- Use the controller's RF spectrum view or a tool like Ekahau / NetSpot to confirm signal strength at the affected location is above -67 dBm and the channel isn't being clobbered by a neighbor or rogue AP. 2.4 GHz channel overlap is the usual culprit in dense offices.
Validate SSID, RADIUS, and authentication
- Confirm the SSID is broadcast on the correct AP group, and for 802.1X SSIDs, test a RADIUS auth from the controller against NPS / ClearPass / Cisco ISE. An expired RADIUS shared secret or a cert renewal that didn't propagate is a classic post-maintenance failure.

Resolution and Documentation

Apply the fix and verify with the reporting user
- Don't close on "should be working now." Have the original reporter reproduce their workflow — the Teams call, the file open, the SaaS login — and confirm it succeeds. Restored ping is not the same as restored business function.
Document root cause and resolution
- Write the ticket close-out in IT Glue / Hudu / Confluence with the symptom, scope, root cause, and the exact fix command or config change. Future-you (or the next on-call) will search this in six months when it recurs.
Collects list Collects paragraph Collects number
Schedule a post-incident review for P1 outages
- If the incident was site-wide or multi-site, hold a 30-minute blameless review within 48 hours. Capture preventive actions — monitoring gap, runbook gap, config drift — as tickets, not as wishes in meeting notes.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 6

Steps 21

Category Systems Administration

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Systems Administration

Run Network Troubleshooting Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.