Network Monitoring Checklist

Recurring operational pass a sysadmin or NOC technician runs to keep network configuration, monitoring, security, performance, and recovery in good order. Designed for monthly cadence against a single site or client environment.

5 sections 25 steps Collects data

Network Inventory and Configuration

Pull the device inventory from the RMM
- Export the live device list from NinjaOne, Datto RMM, ConnectWise Automate, or Auvik — whatever is the system of record. Reconcile against IT Glue or Hudu documentation; rogue or undocumented devices are the most common audit finding at this step.
Collects file
Verify firmware versions on routers, switches, and firewalls
- Compare running firmware on FortiGate, Meraki, Catalyst, Aruba, and SonicWall gear against the vendor's current recommended release. Note any device more than two minor releases behind — these are change-window candidates, not emergency patches unless a CVE applies.
Rotate privileged device credentials via PAM
- Rotate local admin and enable secrets in CyberArk, Delinea, Passportal, or Hudu Vault. Common gotcha: a service account hardcoded in a backup script that breaks the moment the password rotates — confirm dependent services before the change.
Confirm VLAN segmentation matches the topology diagram
- Walk the trunk configuration on each access switch against the documented VLAN plan. PCI scope, guest WiFi isolation, and IoT/printer segmentation are the three places drift quietly accumulates.
Audit 802.1x and NAC enforcement on access ports
- Spot-check that 802.1x is enforcing — not in monitor mode — on production switchports. RADIUS misconfigurations frequently cause silent fallback to MAB, which defeats the control.

Monitoring and Alerting

Confirm SNMPv3 polling on every managed device
- Pull the polling status report from PRTG, Auvik, LogicMonitor, or SolarWinds Orion. Devices that are reachable but not polling are usually credential mismatches or ACLs blocking the NMS source IP — fix at the device, not the monitoring server.
Re-baseline thresholds against the past 30 days
- Adjust CPU, memory, and interface utilization thresholds based on the trailing 30-day baseline so the noise floor reflects current load. Static thresholds copied from a vendor template are the leading cause of alert fatigue.
Triage syslog and SIEM events from the prior week
- Review unresolved events in Sentinel, Splunk, or Elastic for the prior seven days. Focus on auth failures against management plane, unexpected config changes, and IPS signature hits — these surface incidents the threshold alerts miss.
Collects list
Tune PagerDuty routing and on-call schedule
- Verify each service in PagerDuty or Opsgenie maps to the correct escalation policy and that holiday coverage is filled. Stale schedules — someone who left the team six months ago still on the Tier 2 rota — are how a P1 sits unacknowledged at 2am.
Update runbooks for the top three alert types
- Pull the most-fired alerts from the past month and confirm each has a current runbook in IT Glue or Hudu with the actual diagnostic command, not generic advice. The Tier 1 tech at 3am should not need to invent the fix.

Security and Compliance

Run an authenticated Nessus scan on the production segment
- Use credentialed scanning in Tenable, Qualys, or Rapid7 — unauthenticated scans miss most of the meaningful findings. Confirm scan windows are coordinated so the IPS does not blackhole the scanner mid-run.
Triage scan findings by CVSS score
- Sort findings by CVSS and exploitability (KEV catalog membership trumps raw score). Critical findings on internet-facing assets get an emergency change; internal mediums roll into the next monthly patch ring.
Collects list
File an emergency change for critical CVEs
- Open an emergency RFC in ServiceNow or ConnectWise PSA with the CVE, affected hosts, mitigation, and rollback plan. Skip CAB only with documented executive approval — emergency change is still tracked change.
Confirm IDS/IPS signatures and EDR policies are current
- Check signature subscription status on FortiGate IPS or Palo Alto Threat Prevention, and policy version in CrowdStrike, SentinelOne, or Defender for Endpoint. Lapsed subscriptions silently stop updating without breaking traffic.
Verify patch-ring deployment in Intune or SCCM
- Walk the test → pilot → production rings and confirm last month's KB rollouts completed without app regressions. The point of three rings is catching a bad KB at pilot — if 100% of fleet is patched on day 1, the rings are not real.
Tabletop the incident response runbook
- Run a 30-minute tabletop with the on-call rota against a recent realistic scenario — ransomware on a file server, phishing-driven token theft, or a public-facing service compromise. Capture gaps and feed them back into the runbook.

Performance Optimization

Analyze NetFlow for top talkers and saturated links
- Pull NetFlow or sFlow from the core in Auvik, Kentik, or SolarWinds NTA. Backup jobs running during business hours and unsanctioned cloud sync clients are the usual top talkers — both have policy fixes, not bandwidth fixes.
Validate QoS markings on voice and video traffic
- Confirm DSCP markings (EF for voice, AF41 for video) survive end-to-end across LAN and SD-WAN. ISPs strip markings at the handoff unless the SD-WAN edge re-marks — Teams and Zoom call quality complaints often trace back to this.
Run iPerf tests across site-to-site VPN tunnels
- Run iperf3 between sites at off-peak hours. Compare against last month's results and the contracted circuit speed; sustained drops below the floor are an ISP ticket, not a firewall tuning exercise.
Forecast WAN capacity from 30-day utilization
- Pull 95th-percentile utilization for each WAN circuit. Anything sustained above 70% is an upgrade conversation — the budget cycle is long, so flag now rather than the week the link saturates.

Backup and Recovery

Verify nightly config backups in RANCID or Oxidized
- Confirm Oxidized, RANCID, or the RMM-native backup successfully captured every device's running-config in the past 24 hours. Devices that authenticate with TACACS often silently stop backing up when the TACACS shared secret rotates.
Collects list
Run a restore drill into the lab segment
- Pick a switch or firewall and restore last night's backup to a lab device, not the live one. The 3-2-1 rule is meaningless if the restore path has not been exercised — the format-vs-archive mismatches always surface here, never in tabletop.
Collects list Collects paragraph Collects file
Open a remediation ticket for the failed restore
- File a P2 ticket assigned to the backup engineer with the drill notes and the device on which the restore failed. Do not close this monthly checklist with a known-broken backup — the next ransomware day finds whatever was deferred.
Confirm the immutable offsite copy meets 3-2-1
- Verify the offsite tier in Veeam, Datto, or AWS Backup uses object lock or equivalent immutability. A backup writable from production is not ransomware-resilient regardless of how many copies exist.
Update the DR runbook with topology changes
- Reconcile the DR runbook against any new VLANs, circuits, or vendor changes from this cycle. RPO and RTO commitments only hold if the runbook reflects the current network — annual reviews catch this too late.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 5

Steps 25

Category Systems Administration

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Systems Administration

Run Network Monitoring Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.