System Monitoring Checklist

Daily and weekly operational checks a sysadmin or NOC technician runs to keep network, server, application, and security telemetry healthy. Use as a recurring shift checklist with branches for triggered incidents.

Use this workflow

Network Monitoring

Confirm all monitored devices are reachable
- Open the PRTG / Auvik / SolarWinds Orion device tree and confirm every monitored switch, router, firewall, and AP is green. Investigate any device in down or unknown state — SNMP credential changes after firmware updates are a common cause of false-positive outages.
Review NetFlow for traffic anomalies
- Compare the last 24 hours against baseline in Auvik / Orion NTA. Look for sustained egress spikes to unfamiliar ASNs (potential exfiltration) and unusual east-west traffic between VLANs that should be segmented.
Audit firewall rule hits and denies
- Check FortiGate / Palo Alto / Meraki rule-hit counts. Zero-hit allow rules over 90 days are candidates for cleanup; sudden spikes in denies on a previously-quiet rule warrant investigation.
Check WAN bandwidth utilization
- Review per-site WAN circuits for sustained >80% utilization. Note any saturated links so capacity tickets are opened with the carrier before user-impact tickets land.
Verify VPN tunnel health
- Confirm site-to-site IPsec tunnels are up and stable in the firewall console. For client VPN / ZTNA, spot-check session counts and authentication failures against baseline — repeated MFA fails on one account is an early brute-force signal.

Server Performance

Check CPU utilization across hosts
- In vCenter / Hyper-V Manager / Datadog, identify hosts averaging >80% CPU over the prior 24h. Note ready-time on VMs as a co-symptom of host contention, not just guest load.
Review memory pressure and ballooning
- Look for active memory ballooning or swap on ESXi hosts; on Windows guests, watch for sustained high committed bytes vs. RAM. A SQL or Exchange VM trending up week-over-week is the early signal for a leak or undersized config.
Verify disk capacity on critical volumes
- Check datastores, OS volumes, SQL data and log drives, and Exchange databases. Anything past 85% used gets a remediation ticket; anything past 90% gets paged. Snapshot sprawl on VM datastores is a frequent culprit.
Collects list
Plan capacity remediation for full volumes
- For any volume flagged above 85%: identify the growth driver (log file, backup chain, snapshot, user data), open a change request, and either expand the volume, archive data, or prune snapshots. Schedule the change in the next maintenance window unless a P2 threshold is breached.
Confirm critical services are running
- Spot-check service-level monitors for AD DS, DNS, DHCP, Exchange transport, SQL Server, and any line-of-business app daemons. RMM agent dashboards (NinjaOne, Datto, Atera) are usually the fastest view across the fleet.
Triage Windows and Linux event logs
- Pull the last 24h of System and Application errors from Windows Event Log and journalctl on Linux. Repeated disk I/O errors, DCOM faults, or kernel oopses are leading indicators that warrant a host-level deeper dive.

Security and Compliance

Confirm EDR signatures and agents are current
- In CrowdStrike / SentinelOne / Defender for Endpoint, list any sensor older than N-1 version or any endpoint that hasn't checked in within 7 days. Stale agents are a common audit finding and the most frequent cause of "why didn't EDR catch it."
Review patch compliance against last Patch Tuesday
- Pull the Intune / SCCM / Automox / Action1 compliance report. Flag any KB with critical or high CVSS still missing past the deployment ring SLA. Note exceptions with a documented mitigation; do not let "pending reboot" linger past 14 days.
Investigate failed authentication patterns
- In Entra ID sign-in logs and SIEM (Sentinel / Splunk), look for password spray patterns, impossible-travel sign-ins, and legacy-auth attempts that should be blocked by conditional access. Repeated MFA-fatigue prompts on a single user is a high-priority signal.
Collects list
Open an incident response ticket
- Create a SEV2 IR ticket in ServiceNow / Halo PSA, attach the sign-in log evidence, and notify the security on-call. Disable the suspect account, revoke active sessions in Entra ID, and force MFA re-enrollment before continuing the daily check.
Collects text
Spot-check privileged group membership
- List members of Domain Admins, Enterprise Admins, Global Administrator, and any tenant-level privileged roles. Anything added since the last run requires a matching change ticket; standing membership without a tier-0 admin account is a finding.
Verify backup job success and immutability
- In Veeam / Datto / Rubrik confirm overnight jobs completed successfully and that the immutable / object-locked copy is intact. A green job with a stale restore point (older than RPO) is the failure mode to watch — alert thresholds often miss it.

Application Monitoring

Review uptime against SLA targets
- Pull synthetic check results from Site24x7 / LogicMonitor / Datadog for each tier-1 application. Flag any service trending toward its monthly SLA budget burn — under-budget at mid-month means an outage will breach.
Triage application error logs
- Filter the last 24h of application logs for ERROR/FATAL. Group by signature; new error signatures since yesterday get a follow-up ticket assigned to the application owner, not silently acknowledged.
Check database query performance
- Review SQL Server / PostgreSQL slow-query logs and wait-stat dashboards. A query that doubled in average runtime week-over-week often means a missing index or a stats-out-of-date condition rather than a load problem.
Validate API endpoint health
- Confirm external and internal API synthetic checks return expected 2xx and within latency budget. TLS certificate expiry within 30 days surfaces here — escalate to ACME automation owner before the browser warnings start.
Inspect application resource consumption
- Compare per-app CPU, memory, and disk I/O against the prior week. App pools recycling more often than configured and JVM heap creeping toward Xmx are the two patterns worth chasing before users notice.

Alerting and Reporting

Review the overnight alert queue
- Walk the PagerDuty / Opsgenie incident timeline since the last shift. Confirm every P1/P2 has a closing note with root cause or a follow-up ticket; un-acknowledged alerts older than the SLA are the first thing to escalate to the service coordinator.
Tune noisy alert thresholds
- Identify any monitor that fired more than 5 times in 24h without action — that's alert fatigue waiting to happen. Adjust thresholds, add hysteresis, or convert to a daily digest. Document the change in the monitoring runbook.
Confirm escalation chains routed correctly
- Verify on-call rotations in PagerDuty / Opsgenie match this week's published schedule, including holiday coverage. A page that escalates to a former employee is the most embarrassing way to find a stale schedule.
Publish the daily monitoring report
- Generate the daily summary from BrightGauge / built-in PSA dashboard and post to the NOC channel. Yellow or Red status requires a named follow-up owner and ETA before close-of-shift; do not hand off Red without a paged engineer.
Collects list Collects paragraph Collects file
Verify monitoring tools are themselves healthy
- Confirm PRTG / Orion / Datadog agents are reporting from every site, RMM check-in counts match endpoint inventory, and SIEM ingestion is within expected EPS. A monitoring system that's down silently is worse than no monitoring — schedule a watchdog if one doesn't exist.