System Monitoring Checklist
Daily and weekly operational checks a sysadmin or NOC technician runs to keep network, server, application, and security telemetry healthy. Use as a recurring shift checklist with branches for triggered incidents.
Network Monitoring
-
Confirm all monitored devices are reachable
Open the PRTG / Auvik / SolarWinds Orion device tree and confirm every monitored switch, router, firewall, and AP is green. Investigate any device in down or unknown state — SNMP credential changes after firmware updates are a common cause of false-positive outages.
-
Review NetFlow for traffic anomalies
Compare the last 24 hours against baseline in Auvik / Orion NTA. Look for sustained egress spikes to unfamiliar ASNs (potential exfiltration) and unusual east-west traffic between VLANs that should be segmented.
-
Audit firewall rule hits and denies
Check FortiGate / Palo Alto / Meraki rule-hit counts. Zero-hit allow rules over 90 days are candidates for cleanup; sudden spikes in denies on a previously-quiet rule warrant investigation.
-
Check WAN bandwidth utilization
Review per-site WAN circuits for sustained >80% utilization. Note any saturated links so capacity tickets are opened with the carrier before user-impact tickets land.
-
Verify VPN tunnel health
Confirm site-to-site IPsec tunnels are up and stable in the firewall console. For client VPN / ZTNA, spot-check session counts and authentication failures against baseline — repeated MFA fails on one account is an early brute-force signal.
Server Performance
-
Check CPU utilization across hosts
In vCenter / Hyper-V Manager / Datadog, identify hosts averaging >80% CPU over the prior 24h. Note ready-time on VMs as a co-symptom of host contention, not just guest load.
-
Review memory pressure and ballooning
Look for active memory ballooning or swap on ESXi hosts; on Windows guests, watch for sustained high committed bytes vs. RAM. A SQL or Exchange VM trending up week-over-week is the early signal for a leak or undersized config.
-
Verify disk capacity on critical volumes
Check datastores, OS volumes, SQL data and log drives, and Exchange databases. Anything past 85% used gets a remediation ticket; anything past 90% gets paged. Snapshot sprawl on VM datastores is a frequent culprit.
Collects list -
Plan capacity remediation for full volumes
For any volume flagged above 85%: identify the growth driver (log file, backup chain, snapshot, user data), open a change request, and either expand the volume, archive data, or prune snapshots. Schedule the change in the next maintenance window unless a P2 threshold is breached.
-
Confirm critical services are running
Spot-check service-level monitors for AD DS, DNS, DHCP, Exchange transport, SQL Server, and any line-of-business app daemons. RMM agent dashboards (NinjaOne, Datto, Atera) are usually the fastest view across the fleet.
-
Triage Windows and Linux event logs
Pull the last 24h of System and Application errors from Windows Event Log and journalctl on Linux. Repeated disk I/O errors, DCOM faults, or kernel oopses are leading indicators that warrant a host-level deeper dive.
Security and Compliance
-
Confirm EDR signatures and agents are current
In CrowdStrike / SentinelOne / Defender for Endpoint, list any sensor older than N-1 version or any endpoint that hasn't checked in within 7 days. Stale agents are a common audit finding and the most frequent cause of "why didn't EDR catch it."
-
Review patch compliance against last Patch Tuesday
Pull the Intune / SCCM / Automox / Action1 compliance report. Flag any KB with critical or high CVSS still missing past the deployment ring SLA. Note exceptions with a documented mitigation; do not let "pending reboot" linger past 14 days.
-
Investigate failed authentication patterns
In Entra ID sign-in logs and SIEM (Sentinel / Splunk), look for password spray patterns, impossible-travel sign-ins, and legacy-auth attempts that should be blocked by conditional access. Repeated MFA-fatigue prompts on a single user is a high-priority signal.
Collects list -
Open an incident response ticket
Create a SEV2 IR ticket in ServiceNow / Halo PSA, attach the sign-in log evidence, and notify the security on-call. Disable the suspect account, revoke active sessions in Entra ID, and force MFA re-enrollment before continuing the daily check.
Collects text -
Spot-check privileged group membership
List members of Domain Admins, Enterprise Admins, Global Administrator, and any tenant-level privileged roles. Anything added since the last run requires a matching change ticket; standing membership without a tier-0 admin account is a finding.
-
Verify backup job success and immutability
In Veeam / Datto / Rubrik confirm overnight jobs completed successfully and that the immutable / object-locked copy is intact. A green job with a stale restore point (older than RPO) is the failure mode to watch — alert thresholds often miss it.
Application Monitoring
-
Review uptime against SLA targets
Pull synthetic check results from Site24x7 / LogicMonitor / Datadog for each tier-1 application. Flag any service trending toward its monthly SLA budget burn — under-budget at mid-month means an outage will breach.
-
Triage application error logs
Filter the last 24h of application logs for ERROR/FATAL. Group by signature; new error signatures since yesterday get a follow-up ticket assigned to the application owner, not silently acknowledged.
-
Check database query performance
Review SQL Server / PostgreSQL slow-query logs and wait-stat dashboards. A query that doubled in average runtime week-over-week often means a missing index or a stats-out-of-date condition rather than a load problem.
-
Validate API endpoint health
Confirm external and internal API synthetic checks return expected 2xx and within latency budget. TLS certificate expiry within 30 days surfaces here — escalate to ACME automation owner before the browser warnings start.
-
Inspect application resource consumption
Compare per-app CPU, memory, and disk I/O against the prior week. App pools recycling more often than configured and JVM heap creeping toward Xmx are the two patterns worth chasing before users notice.
Alerting and Reporting
-
Review the overnight alert queue
Walk the PagerDuty / Opsgenie incident timeline since the last shift. Confirm every P1/P2 has a closing note with root cause or a follow-up ticket; un-acknowledged alerts older than the SLA are the first thing to escalate to the service coordinator.
-
Tune noisy alert thresholds
Identify any monitor that fired more than 5 times in 24h without action — that's alert fatigue waiting to happen. Adjust thresholds, add hysteresis, or convert to a daily digest. Document the change in the monitoring runbook.
-
Confirm escalation chains routed correctly
Verify on-call rotations in PagerDuty / Opsgenie match this week's published schedule, including holiday coverage. A page that escalates to a former employee is the most embarrassing way to find a stale schedule.
-
Publish the daily monitoring report
Generate the daily summary from BrightGauge / built-in PSA dashboard and post to the NOC channel. Yellow or Red status requires a named follow-up owner and ETA before close-of-shift; do not hand off Red without a paged engineer.
Collects list Collects paragraph Collects file -
Verify monitoring tools are themselves healthy
Confirm PRTG / Orion / Datadog agents are reporting from every site, RMM check-in counts match endpoint inventory, and SIEM ingestion is within expected EPS. A monitoring system that's down silently is worse than no monitoring — schedule a watchdog if one doesn't exist.