System Monitoring Checklist
Network Monitoring
Open the PRTG / Auvik / SolarWinds Orion device tree and confirm every monitored switch, router, firewall, and AP is green. Investigate any device in down or unknown state — SNMP credential changes after firmware updates are a common cause of false-positive outages.
Compare the last 24 hours against baseline in Auvik / Orion NTA. Look for sustained egress spikes to unfamiliar ASNs (potential exfiltration) and unusual east-west traffic between VLANs that should be segmented.
Check FortiGate / Palo Alto / Meraki rule-hit counts. Zero-hit allow rules over 90 days are candidates for cleanup; sudden spikes in denies on a previously-quiet rule warrant investigation.
Review per-site WAN circuits for sustained >80% utilization. Note any saturated links so capacity tickets are opened with the carrier before user-impact tickets land.
Confirm site-to-site IPsec tunnels are up and stable in the firewall console. For client VPN / ZTNA, spot-check session counts and authentication failures against baseline — repeated MFA fails on one account is an early brute-force signal.
Server Performance
In vCenter / Hyper-V Manager / Datadog, identify hosts averaging >80% CPU over the prior 24h. Note ready-time on VMs as a co-symptom of host contention, not just guest load.
Look for active memory ballooning or swap on ESXi hosts; on Windows guests, watch for sustained high committed bytes vs. RAM. A SQL or Exchange VM trending up week-over-week is the early signal for a leak or undersized config.
Check datastores, OS volumes, SQL data and log drives, and Exchange databases. Anything past 85% used gets a remediation ticket; anything past 90% gets paged. Snapshot sprawl on VM datastores is a frequent culprit.
For any volume flagged above 85%: identify the growth driver (log file, backup chain, snapshot, user data), open a change request, and either expand the volume, archive data, or prune snapshots. Schedule the change in the next maintenance window unless a P2 threshold is breached.
Spot-check service-level monitors for AD DS, DNS, DHCP, Exchange transport, SQL Server, and any line-of-business app daemons. RMM agent dashboards (NinjaOne, Datto, Atera) are usually the fastest view across the fleet.
Pull the last 24h of System and Application errors from Windows Event Log and journalctl on Linux. Repeated disk I/O errors, DCOM faults, or kernel oopses are leading indicators that warrant a host-level deeper dive.
Security and Compliance
In CrowdStrike / SentinelOne / Defender for Endpoint, list any sensor older than N-1 version or any endpoint that hasn't checked in within 7 days. Stale agents are a common audit finding and the most frequent cause of "why didn't EDR catch it."
Pull the Intune / SCCM / Automox / Action1 compliance report. Flag any KB with critical or high CVSS still missing past the deployment ring SLA. Note exceptions with a documented mitigation; do not let "pending reboot" linger past 14 days.
In Entra ID sign-in logs and SIEM (Sentinel / Splunk), look for password spray patterns, impossible-travel sign-ins, and legacy-auth attempts that should be blocked by conditional access. Repeated MFA-fatigue prompts on a single user is a high-priority signal.
Create a SEV2 IR ticket in ServiceNow / Halo PSA, attach the sign-in log evidence, and notify the security on-call. Disable the suspect account, revoke active sessions in Entra ID, and force MFA re-enrollment before continuing the daily check.
List members of Domain Admins, Enterprise Admins, Global Administrator, and any tenant-level privileged roles. Anything added since the last run requires a matching change ticket; standing membership without a tier-0 admin account is a finding.
In Veeam / Datto / Rubrik confirm overnight jobs completed successfully and that the immutable / object-locked copy is intact. A green job with a stale restore point (older than RPO) is the failure mode to watch — alert thresholds often miss it.
Application Monitoring
Pull synthetic check results from Site24x7 / LogicMonitor / Datadog for each tier-1 application. Flag any service trending toward its monthly SLA budget burn — under-budget at mid-month means an outage will breach.
Filter the last 24h of application logs for ERROR/FATAL. Group by signature; new error signatures since yesterday get a follow-up ticket assigned to the application owner, not silently acknowledged.
Review SQL Server / PostgreSQL slow-query logs and wait-stat dashboards. A query that doubled in average runtime week-over-week often means a missing index or a stats-out-of-date condition rather than a load problem.
Confirm external and internal API synthetic checks return expected 2xx and within latency budget. TLS certificate expiry within 30 days surfaces here — escalate to ACME automation owner before the browser warnings start.
Compare per-app CPU, memory, and disk I/O against the prior week. App pools recycling more often than configured and JVM heap creeping toward Xmx are the two patterns worth chasing before users notice.
Alerting and Reporting
Walk the PagerDuty / Opsgenie incident timeline since the last shift. Confirm every P1/P2 has a closing note with root cause or a follow-up ticket; un-acknowledged alerts older than the SLA are the first thing to escalate to the service coordinator.
Identify any monitor that fired more than 5 times in 24h without action — that's alert fatigue waiting to happen. Adjust thresholds, add hysteresis, or convert to a daily digest. Document the change in the monitoring runbook.
Verify on-call rotations in PagerDuty / Opsgenie match this week's published schedule, including holiday coverage. A page that escalates to a former employee is the most embarrassing way to find a stale schedule.
Generate the daily summary from BrightGauge / built-in PSA dashboard and post to the NOC channel. Yellow or Red status requires a named follow-up owner and ETA before close-of-shift; do not hand off Red without a paged engineer.
Confirm PRTG / Orion / Datadog agents are reporting from every site, RMM check-in counts match endpoint inventory, and SIEM ingestion is within expected EPS. A monitoring system that's down silently is worse than no monitoring — schedule a watchdog if one doesn't exist.
Use this template in Manifestly
- User Offboarding Checklist
- Application Performance Monitoring Checklist
- User Onboarding Checklist
- Employee Training Checklist
- Hardware Upgrade Checklist
- Network Troubleshooting Checklist
- IT Strategy Checklist
- Hardware Troubleshooting Checklist
- Performance Tuning Checklist
- Patch Deployment Checklist
- IT Policy Review Checklist
- Database Security Checklist
- Software Installation Checklist
- Disaster Recovery Plan Checklist
- Patch Management Checklist
- Customer Support Ticket Workflow
- User Access Review Checklist
- Software Upgrade Checklist
- Cloud Monitoring Checklist
- Containerization Rollout Checklist
- Server Maintenance Checklist
- Business Continuity Plan Checklist
- Rollback Plan Checklist
- Password Management Checklist
- Server Decommissioning Checklist
- Network Upgrade Checklist
- Backup and Restore Checklist
- Server Backup Checklist
- IT Resource Allocation Checklist
- Incident Response Checklist
- Infrastructure as Code Checklist
- Hardware Disposal Checklist
- Database Backup Checklist
- Cloud Security Checklist
- Cloud Migration Checklist
- IT Service Request Checklist
- Network Monitoring Checklist
- Cloud Deployment Checklist
- IT Budgeting Checklist
- Database Installation Checklist
- Capacity Planning Checklist
- Security Audit Checklist
- Cloud Cost Management Checklist
- Database Migration Checklist
- Firewall Configuration Checklist
- Quarterly Network Security Review
- Change Management Checklist
- User Role Management Checklist
- IT Staff Performance Review
- Server Security Checklist
- Employee Onboarding Checklist
- Quarterly Compliance Reporting Checklist
- Access Control Checklist
- Incident Management Checklist
- Compliance Audit Checklist
- IT Emergency Response Checklist
- Hardware Maintenance Checklist
- Server Build and Hardening Checklist
- IT Regulatory Compliance Review
- Help Desk Ticket Handling Checklist
- Release Management Checklist
- Data Recovery Checklist
- Problem Management Checklist
- Hardware Inventory Checklist
- IT Vendor Management Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
