System Monitoring Checklist
Daily and weekly operational checks a sysadmin or NOC technician runs to keep network, server, application, and security telemetry healthy. Use as a recurring shift checklist with branches for triggered incidents.
Network Monitoring
-
Confirm all monitored devices are reachable
Open the PRTG / Auvik / SolarWinds Orion device tree and confirm every monitored switch, router, firewall, and AP is green. Investigate any device in down or unknown state — SNMP credential changes after firmware updates are a common cause of false-positive outages.
-
Review NetFlow for traffic anomalies
Compare the last 24 hours against baseline in Auvik / Orion NTA. Look for sustained egress spikes to unfamiliar ASNs (potential exfiltration) and unusual east-west traffic between VLANs that should be segmented.
-
Audit firewall rule hits and denies
Check FortiGate / Palo Alto / Meraki rule-hit counts. Zero-hit allow rules over 90 days are candidates for cleanup; sudden spikes in denies on a previously-quiet rule warrant investigation.
-
Check WAN bandwidth utilization
Review per-site WAN circuits for sustained >80% utilization. Note any saturated links so capacity tickets are opened with the carrier before user-impact tickets land.
-
Verify VPN tunnel health
Confirm site-to-site IPsec tunnels are up and stable in the firewall console. For client VPN / ZTNA, spot-check session counts and authentication failures against baseline — repeated MFA fails on one account is an early brute-force signal.
Server Performance
-
Check CPU utilization across hosts
In vCenter / Hyper-V Manager / Datadog, identify hosts averaging >80% CPU over the prior 24h. Note ready-time on VMs as a co-symptom of host contention, not just guest load.
-
Review memory pressure and ballooning
Look for active memory ballooning or swap on ESXi hosts; on Windows guests, watch for sustained high committed bytes vs. RAM. A SQL or Exchange VM trending up week-over-week is the early signal for a leak or undersized config.
-
Verify disk capacity on critical volumes
Check datastores, OS volumes, SQL data and log drives, and Exchange databases. Anything past 85% used gets a remediation ticket; anything past 90% gets paged. Snapshot sprawl on VM datastores is a frequent culprit.
Collects list -
Plan capacity remediation for full volumes
For any volume flagged above 85%: identify the growth driver (log file, backup chain, snapshot, user data), open a change request, and either expand the volume, archive data, or prune snapshots. Schedule the change in the next maintenance window unless a P2 threshold is breached.
-
Confirm critical services are running
Spot-check service-level monitors for AD DS, DNS, DHCP, Exchange transport, SQL Server, and any line-of-business app daemons. RMM agent dashboards (NinjaOne, Datto, Atera) are usually the fastest view across the fleet.
-
Triage Windows and Linux event logs
Pull the last 24h of System and Application errors from Windows Event Log and journalctl on Linux. Repeated disk I/O errors, DCOM faults, or kernel oopses are leading indicators that warrant a host-level deeper dive.
Security and Compliance
-
Confirm EDR signatures and agents are current
In CrowdStrike / SentinelOne / Defender for Endpoint, list any sensor older than N-1 version or any endpoint that hasn't checked in within 7 days. Stale agents are a common audit finding and the most frequent cause of "why didn't EDR catch it."
-
Review patch compliance against last Patch Tuesday
Pull the Intune / SCCM / Automox / Action1 compliance report. Flag any KB with critical or high CVSS still missing past the deployment ring SLA. Note exceptions with a documented mitigation; do not let "pending reboot" linger past 14 days.
-
Investigate failed authentication patterns
In Entra ID sign-in logs and SIEM (Sentinel / Splunk), look for password spray patterns, impossible-travel sign-ins, and legacy-auth attempts that should be blocked by conditional access. Repeated MFA-fatigue prompts on a single user is a high-priority signal.
Collects list -
Open an incident response ticket
Create a SEV2 IR ticket in ServiceNow / Halo PSA, attach the sign-in log evidence, and notify the security on-call. Disable the suspect account, revoke active sessions in Entra ID, and force MFA re-enrollment before continuing the daily check.
Collects text -
Spot-check privileged group membership
List members of Domain Admins, Enterprise Admins, Global Administrator, and any tenant-level privileged roles. Anything added since the last run requires a matching change ticket; standing membership without a tier-0 admin account is a finding.
-
Verify backup job success and immutability
In Veeam / Datto / Rubrik confirm overnight jobs completed successfully and that the immutable / object-locked copy is intact. A green job with a stale restore point (older than RPO) is the failure mode to watch — alert thresholds often miss it.
Application Monitoring
-
Review uptime against SLA targets
Pull synthetic check results from Site24x7 / LogicMonitor / Datadog for each tier-1 application. Flag any service trending toward its monthly SLA budget burn — under-budget at mid-month means an outage will breach.
-
Triage application error logs
Filter the last 24h of application logs for ERROR/FATAL. Group by signature; new error signatures since yesterday get a follow-up ticket assigned to the application owner, not silently acknowledged.
-
Check database query performance
Review SQL Server / PostgreSQL slow-query logs and wait-stat dashboards. A query that doubled in average runtime week-over-week often means a missing index or a stats-out-of-date condition rather than a load problem.
-
Validate API endpoint health
Confirm external and internal API synthetic checks return expected 2xx and within latency budget. TLS certificate expiry within 30 days surfaces here — escalate to ACME automation owner before the browser warnings start.
-
Inspect application resource consumption
Compare per-app CPU, memory, and disk I/O against the prior week. App pools recycling more often than configured and JVM heap creeping toward Xmx are the two patterns worth chasing before users notice.
Alerting and Reporting
-
Review the overnight alert queue
Walk the PagerDuty / Opsgenie incident timeline since the last shift. Confirm every P1/P2 has a closing note with root cause or a follow-up ticket; un-acknowledged alerts older than the SLA are the first thing to escalate to the service coordinator.
-
Tune noisy alert thresholds
Identify any monitor that fired more than 5 times in 24h without action — that's alert fatigue waiting to happen. Adjust thresholds, add hysteresis, or convert to a daily digest. Document the change in the monitoring runbook.
-
Confirm escalation chains routed correctly
Verify on-call rotations in PagerDuty / Opsgenie match this week's published schedule, including holiday coverage. A page that escalates to a former employee is the most embarrassing way to find a stale schedule.
-
Publish the daily monitoring report
Generate the daily summary from BrightGauge / built-in PSA dashboard and post to the NOC channel. Yellow or Red status requires a named follow-up owner and ETA before close-of-shift; do not hand off Red without a paged engineer.
Collects list Collects paragraph Collects file -
Verify monitoring tools are themselves healthy
Confirm PRTG / Orion / Datadog agents are reporting from every site, RMM check-in counts match endpoint inventory, and SIEM ingestion is within expected EPS. A monitoring system that's down silently is worse than no monitoring — schedule a watchdog if one doesn't exist.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun System Monitoring Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.