Network Monitoring Checklist
Recurring operational pass a sysadmin or NOC technician runs to keep network configuration, monitoring, security, performance, and recovery in good order. Designed for monthly cadence against a single site or client environment.
Network Inventory and Configuration
-
Pull the device inventory from the RMM
Export the live device list from NinjaOne, Datto RMM, ConnectWise Automate, or Auvik — whatever is the system of record. Reconcile against IT Glue or Hudu documentation; rogue or undocumented devices are the most common audit finding at this step.
Collects file -
Verify firmware versions on routers, switches, and firewalls
Compare running firmware on FortiGate, Meraki, Catalyst, Aruba, and SonicWall gear against the vendor's current recommended release. Note any device more than two minor releases behind — these are change-window candidates, not emergency patches unless a CVE applies.
-
Rotate privileged device credentials via PAM
Rotate local admin and enable secrets in CyberArk, Delinea, Passportal, or Hudu Vault. Common gotcha: a service account hardcoded in a backup script that breaks the moment the password rotates — confirm dependent services before the change.
-
Confirm VLAN segmentation matches the topology diagram
Walk the trunk configuration on each access switch against the documented VLAN plan. PCI scope, guest WiFi isolation, and IoT/printer segmentation are the three places drift quietly accumulates.
-
Audit 802.1x and NAC enforcement on access ports
Spot-check that 802.1x is enforcing — not in monitor mode — on production switchports. RADIUS misconfigurations frequently cause silent fallback to MAB, which defeats the control.
Monitoring and Alerting
-
Confirm SNMPv3 polling on every managed device
Pull the polling status report from PRTG, Auvik, LogicMonitor, or SolarWinds Orion. Devices that are reachable but not polling are usually credential mismatches or ACLs blocking the NMS source IP — fix at the device, not the monitoring server.
-
Re-baseline thresholds against the past 30 days
Adjust CPU, memory, and interface utilization thresholds based on the trailing 30-day baseline so the noise floor reflects current load. Static thresholds copied from a vendor template are the leading cause of alert fatigue.
-
Triage syslog and SIEM events from the prior week
Review unresolved events in Sentinel, Splunk, or Elastic for the prior seven days. Focus on auth failures against management plane, unexpected config changes, and IPS signature hits — these surface incidents the threshold alerts miss.
Collects list -
Tune PagerDuty routing and on-call schedule
Verify each service in PagerDuty or Opsgenie maps to the correct escalation policy and that holiday coverage is filled. Stale schedules — someone who left the team six months ago still on the Tier 2 rota — are how a P1 sits unacknowledged at 2am.
-
Update runbooks for the top three alert types
Pull the most-fired alerts from the past month and confirm each has a current runbook in IT Glue or Hudu with the actual diagnostic command, not generic advice. The Tier 1 tech at 3am should not need to invent the fix.
Security and Compliance
-
Run an authenticated Nessus scan on the production segment
Use credentialed scanning in Tenable, Qualys, or Rapid7 — unauthenticated scans miss most of the meaningful findings. Confirm scan windows are coordinated so the IPS does not blackhole the scanner mid-run.
-
Triage scan findings by CVSS score
Sort findings by CVSS and exploitability (KEV catalog membership trumps raw score). Critical findings on internet-facing assets get an emergency change; internal mediums roll into the next monthly patch ring.
Collects list -
File an emergency change for critical CVEs
Open an emergency RFC in ServiceNow or ConnectWise PSA with the CVE, affected hosts, mitigation, and rollback plan. Skip CAB only with documented executive approval — emergency change is still tracked change.
-
Confirm IDS/IPS signatures and EDR policies are current
Check signature subscription status on FortiGate IPS or Palo Alto Threat Prevention, and policy version in CrowdStrike, SentinelOne, or Defender for Endpoint. Lapsed subscriptions silently stop updating without breaking traffic.
-
Verify patch-ring deployment in Intune or SCCM
Walk the test → pilot → production rings and confirm last month's KB rollouts completed without app regressions. The point of three rings is catching a bad KB at pilot — if 100% of fleet is patched on day 1, the rings are not real.
-
Tabletop the incident response runbook
Run a 30-minute tabletop with the on-call rota against a recent realistic scenario — ransomware on a file server, phishing-driven token theft, or a public-facing service compromise. Capture gaps and feed them back into the runbook.
Performance Optimization
-
Analyze NetFlow for top talkers and saturated links
Pull NetFlow or sFlow from the core in Auvik, Kentik, or SolarWinds NTA. Backup jobs running during business hours and unsanctioned cloud sync clients are the usual top talkers — both have policy fixes, not bandwidth fixes.
-
Validate QoS markings on voice and video traffic
Confirm DSCP markings (EF for voice, AF41 for video) survive end-to-end across LAN and SD-WAN. ISPs strip markings at the handoff unless the SD-WAN edge re-marks — Teams and Zoom call quality complaints often trace back to this.
-
Run iPerf tests across site-to-site VPN tunnels
Run iperf3 between sites at off-peak hours. Compare against last month's results and the contracted circuit speed; sustained drops below the floor are an ISP ticket, not a firewall tuning exercise.
-
Forecast WAN capacity from 30-day utilization
Pull 95th-percentile utilization for each WAN circuit. Anything sustained above 70% is an upgrade conversation — the budget cycle is long, so flag now rather than the week the link saturates.
Backup and Recovery
-
Verify nightly config backups in RANCID or Oxidized
Confirm Oxidized, RANCID, or the RMM-native backup successfully captured every device's running-config in the past 24 hours. Devices that authenticate with TACACS often silently stop backing up when the TACACS shared secret rotates.
Collects list -
Run a restore drill into the lab segment
Pick a switch or firewall and restore last night's backup to a lab device, not the live one. The 3-2-1 rule is meaningless if the restore path has not been exercised — the format-vs-archive mismatches always surface here, never in tabletop.
Collects list Collects paragraph Collects file -
Open a remediation ticket for the failed restore
File a P2 ticket assigned to the backup engineer with the drill notes and the device on which the restore failed. Do not close this monthly checklist with a known-broken backup — the next ransomware day finds whatever was deferred.
-
Confirm the immutable offsite copy meets 3-2-1
Verify the offsite tier in Veeam, Datto, or AWS Backup uses object lock or equivalent immutability. A backup writable from production is not ransomware-resilient regardless of how many copies exist.
-
Update the DR runbook with topology changes
Reconcile the DR runbook against any new VLANs, circuits, or vendor changes from this cycle. RPO and RTO commitments only hold if the runbook reflects the current network — annual reviews catch this too late.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Network Monitoring Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.