System Monitoring Checklist

Daily and weekly operational checks a sysadmin or NOC technician runs to keep network, server, application, and security telemetry healthy. Use as a recurring shift checklist with branches for triggered incidents.

5 sections 27 steps Collects data
1

Network Monitoring

  1. Confirm all monitored devices are reachable
    • Open the PRTG / Auvik / SolarWinds Orion device tree and confirm every monitored switch, router, firewall, and AP is green. Investigate any device in down or unknown state — SNMP credential changes after firmware updates are a common cause of false-positive outages.

  2. Review NetFlow for traffic anomalies
    • Compare the last 24 hours against baseline in Auvik / Orion NTA. Look for sustained egress spikes to unfamiliar ASNs (potential exfiltration) and unusual east-west traffic between VLANs that should be segmented.

  3. Audit firewall rule hits and denies
    • Check FortiGate / Palo Alto / Meraki rule-hit counts. Zero-hit allow rules over 90 days are candidates for cleanup; sudden spikes in denies on a previously-quiet rule warrant investigation.

  4. Check WAN bandwidth utilization
    • Review per-site WAN circuits for sustained >80% utilization. Note any saturated links so capacity tickets are opened with the carrier before user-impact tickets land.

  5. Verify VPN tunnel health
    • Confirm site-to-site IPsec tunnels are up and stable in the firewall console. For client VPN / ZTNA, spot-check session counts and authentication failures against baseline — repeated MFA fails on one account is an early brute-force signal.

2

Server Performance

  1. Check CPU utilization across hosts
    • In vCenter / Hyper-V Manager / Datadog, identify hosts averaging >80% CPU over the prior 24h. Note ready-time on VMs as a co-symptom of host contention, not just guest load.

  2. Review memory pressure and ballooning
    • Look for active memory ballooning or swap on ESXi hosts; on Windows guests, watch for sustained high committed bytes vs. RAM. A SQL or Exchange VM trending up week-over-week is the early signal for a leak or undersized config.

  3. Verify disk capacity on critical volumes
    • Check datastores, OS volumes, SQL data and log drives, and Exchange databases. Anything past 85% used gets a remediation ticket; anything past 90% gets paged. Snapshot sprawl on VM datastores is a frequent culprit.

    Collects list
  4. Plan capacity remediation for full volumes
    • For any volume flagged above 85%: identify the growth driver (log file, backup chain, snapshot, user data), open a change request, and either expand the volume, archive data, or prune snapshots. Schedule the change in the next maintenance window unless a P2 threshold is breached.

  5. Confirm critical services are running
    • Spot-check service-level monitors for AD DS, DNS, DHCP, Exchange transport, SQL Server, and any line-of-business app daemons. RMM agent dashboards (NinjaOne, Datto, Atera) are usually the fastest view across the fleet.

  6. Triage Windows and Linux event logs
    • Pull the last 24h of System and Application errors from Windows Event Log and journalctl on Linux. Repeated disk I/O errors, DCOM faults, or kernel oopses are leading indicators that warrant a host-level deeper dive.

3

Security and Compliance

  1. Confirm EDR signatures and agents are current
    • In CrowdStrike / SentinelOne / Defender for Endpoint, list any sensor older than N-1 version or any endpoint that hasn't checked in within 7 days. Stale agents are a common audit finding and the most frequent cause of "why didn't EDR catch it."

  2. Review patch compliance against last Patch Tuesday
    • Pull the Intune / SCCM / Automox / Action1 compliance report. Flag any KB with critical or high CVSS still missing past the deployment ring SLA. Note exceptions with a documented mitigation; do not let "pending reboot" linger past 14 days.

  3. Investigate failed authentication patterns
    • In Entra ID sign-in logs and SIEM (Sentinel / Splunk), look for password spray patterns, impossible-travel sign-ins, and legacy-auth attempts that should be blocked by conditional access. Repeated MFA-fatigue prompts on a single user is a high-priority signal.

    Collects list
  4. Open an incident response ticket
    • Create a SEV2 IR ticket in ServiceNow / Halo PSA, attach the sign-in log evidence, and notify the security on-call. Disable the suspect account, revoke active sessions in Entra ID, and force MFA re-enrollment before continuing the daily check.

    Collects text
  5. Spot-check privileged group membership
    • List members of Domain Admins, Enterprise Admins, Global Administrator, and any tenant-level privileged roles. Anything added since the last run requires a matching change ticket; standing membership without a tier-0 admin account is a finding.

  6. Verify backup job success and immutability
    • In Veeam / Datto / Rubrik confirm overnight jobs completed successfully and that the immutable / object-locked copy is intact. A green job with a stale restore point (older than RPO) is the failure mode to watch — alert thresholds often miss it.

4

Application Monitoring

  1. Review uptime against SLA targets
    • Pull synthetic check results from Site24x7 / LogicMonitor / Datadog for each tier-1 application. Flag any service trending toward its monthly SLA budget burn — under-budget at mid-month means an outage will breach.

  2. Triage application error logs
    • Filter the last 24h of application logs for ERROR/FATAL. Group by signature; new error signatures since yesterday get a follow-up ticket assigned to the application owner, not silently acknowledged.

  3. Check database query performance
    • Review SQL Server / PostgreSQL slow-query logs and wait-stat dashboards. A query that doubled in average runtime week-over-week often means a missing index or a stats-out-of-date condition rather than a load problem.

  4. Validate API endpoint health
    • Confirm external and internal API synthetic checks return expected 2xx and within latency budget. TLS certificate expiry within 30 days surfaces here — escalate to ACME automation owner before the browser warnings start.

  5. Inspect application resource consumption
    • Compare per-app CPU, memory, and disk I/O against the prior week. App pools recycling more often than configured and JVM heap creeping toward Xmx are the two patterns worth chasing before users notice.

5

Alerting and Reporting

  1. Review the overnight alert queue
    • Walk the PagerDuty / Opsgenie incident timeline since the last shift. Confirm every P1/P2 has a closing note with root cause or a follow-up ticket; un-acknowledged alerts older than the SLA are the first thing to escalate to the service coordinator.

  2. Tune noisy alert thresholds
    • Identify any monitor that fired more than 5 times in 24h without action — that's alert fatigue waiting to happen. Adjust thresholds, add hysteresis, or convert to a daily digest. Document the change in the monitoring runbook.

  3. Confirm escalation chains routed correctly
    • Verify on-call rotations in PagerDuty / Opsgenie match this week's published schedule, including holiday coverage. A page that escalates to a former employee is the most embarrassing way to find a stale schedule.

  4. Publish the daily monitoring report
    • Generate the daily summary from BrightGauge / built-in PSA dashboard and post to the NOC channel. Yellow or Red status requires a named follow-up owner and ETA before close-of-shift; do not hand off Red without a paged engineer.

    Collects list Collects paragraph Collects file
  5. Verify monitoring tools are themselves healthy
    • Confirm PRTG / Orion / Datadog agents are reporting from every site, RMM check-in counts match endpoint inventory, and SIEM ingestion is within expected EPS. A monitoring system that's down silently is worse than no monitoring — schedule a watchdog if one doesn't exist.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 5
Steps 27
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run System Monitoring Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.