Application Performance Monitoring Checklist
Infrastructure Monitoring Coverage
Reconcile the inventory in the RMM (Datadog, NinjaOne, PRTG, LogicMonitor) against the CMDB or hypervisor host list. New VMs spun up outside the gold image are the usual gap — they show up as healthy but report no metrics. List any hosts missing an agent.
Spot-check the 7-day baseline on domain controllers, hypervisors, and database hosts. Saturated baselines (sustained CPU above 80%, memory pressure with paging) get flagged for capacity planning, not just alerting.
SQL transaction logs, tempdb growth, and snapshot residue are the usual culprits. Confirm any volume above 80% has a remediation owner and ETA.
Pull NetFlow or Auvik path data for site-to-site VPN tunnels and cloud peerings. ISP path changes routinely add 30-60ms without notice and only show up when a user complains about Teams.
Confirm switches, firewalls, and APs are still forwarding to Splunk / Sentinel / Graylog. A silent collector is the most common cause of post-incident gaps in the audit trail.
Application Metrics Collection
Confirm New Relic / Datadog APM / AppDynamics / Dynatrace agents are reporting on every tier-1 service. Deployments that change runtime versions silently drop the agent — check the last-seen timestamp against the last release.
Means hide outliers; p95/p99 catches the long tail users actually feel. Compare against the published SLO. Flag any endpoint that breached its objective more than three times in the last week.
Pull HTTP 5xx and uncaught exception counts per service. Note the top offender for the optimization phase later in the run.
Run the slow-query report from SQL Server / Postgres / MySQL and correlate against APM trace flame graphs. Missing indexes on a recently-added column are the most common new entry.
Stripe, Twilio, Auth0, and any payment / identity provider must have an external probe. When their status page lies, your synthetic check is the only ground truth.
User Experience Monitoring
Confirm the RUM snippet (Datadog RUM, New Relic Browser, Sentry) is loading on every public route. CSP changes and tag-manager edits are the usual reason coverage silently drops.
LCP under 2.5s, INP under 200ms, CLS under 0.1 at the 75th percentile. Flag pages where the rolling 7-day score regressed from "good" to "needs improvement."
Pingdom / Site24x7 / Datadog Synthetics scripted journeys (login, search, checkout) should sit above 99.5% pass for the week. Flaky logins from a single region usually mean the script — not the app — needs maintenance.
Pull five sessions tagged with rage-clicks or dead-clicks from FullStory / LogRocket / Hotjar. These usually map to a UI regression that error-rate dashboards miss because the front-end didn't throw.
Alerting and Notifications
Thresholds set 18 months ago against 2-vCPU hosts no longer fit a 16-vCPU upgrade. Compare each P1/P2 threshold to the current 7-day baseline; mark stale ones for adjustment.
Adjust the flagged thresholds in Datadog / PRTG / LogicMonitor. Open a change ticket with before/after values so the next reviewer can audit the rationale.
Confirm the on-call schedule for the next two weeks has no gaps and that overrides for PTO are in place. The Friday-before-a-long-weekend gap is the classic miss.
Fire a synthetic test alert from each integration. Slack webhook tokens expire silently; SMS gateway changes break Twilio routing without warning.
Pull the top 10 alerts by volume for the week. Anything firing more than 50 times without a corresponding ticket is a candidate for tuning, not a real signal.
Reporting and Dashboards
Update SLO burn-down, p95 latency, error budget, and uptime tiles. Broken queries from a renamed metric are the usual reason a tile shows "no data" the morning of a leadership review.
Export the SLO summary, top 5 incidents, and capacity trend into the standard report template. Attach the PDF to this step for the audit trail.
Send to the IT leadership distribution list and the affected service owners. For MSP runs, attach to the client's QBR folder in IT Glue or Hudu.
Performance Optimization Review
Cross-reference APM flame graphs, slow-query logs, and synthetic regressions to single out the largest user-impact bottleneck. If nothing material surfaced, mark No and skip to load-test scheduling.
File the ticket in Jira / ServiceNow / ConnectWise PSA with the trace ID, affected endpoint, and proposed mitigation (index, cache, or query rewrite). Assign to the owning engineering team, not back to ops.
Use k6, JMeter, or Locust against staging — never against prod without a maintenance window. Target 1.5x peak production traffic to leave headroom for the next quarter.
Final reviewer captures the overall posture, any open exceptions, and a digital signature for the audit trail. Required for SOC 2 CC7.2 evidence on monitoring activities.
Use this template in Manifestly
- Cloud Migration Checklist
- Cloud Security Checklist
- User Access Review Checklist
- Data Recovery Checklist
- Containerization Rollout Checklist
- Database Backup Checklist
- Password Management Checklist
- Backup and Restore Checklist
- Network Upgrade Checklist
- Server Backup Checklist
- Business Continuity Plan Checklist
- Problem Management Checklist
- Server Decommissioning Checklist
- Cloud Monitoring Checklist
- Hardware Inventory Checklist
- IT Regulatory Compliance Review
- Release Management Checklist
- Server Maintenance Checklist
- Rollback Plan Checklist
- Customer Support Ticket Workflow
- Software Upgrade Checklist
- Quarterly Compliance Reporting Checklist
- Patch Management Checklist
- Hardware Maintenance Checklist
- Server Security Checklist
- IT Emergency Response Checklist
- Incident Management Checklist
- Disaster Recovery Plan Checklist
- User Role Management Checklist
- Software Installation Checklist
- Compliance Audit Checklist
- Access Control Checklist
- Cloud Cost Management Checklist
- IT Staff Performance Review
- Change Management Checklist
- Firewall Configuration Checklist
- Security Audit Checklist
- Quarterly Network Security Review
- Database Migration Checklist
- Employee Onboarding Checklist
- Capacity Planning Checklist
- IT Budgeting Checklist
- Network Monitoring Checklist
- Cloud Deployment Checklist
- Database Installation Checklist
- IT Service Request Checklist
- Database Security Checklist
- System Monitoring Checklist
- Hardware Troubleshooting Checklist
- IT Strategy Checklist
- Patch Deployment Checklist
- Hardware Upgrade Checklist
- Performance Tuning Checklist
- Employee Training Checklist
- User Onboarding Checklist
- IT Vendor Management Checklist
- Server Build and Hardening Checklist
- IT Policy Review Checklist
- Help Desk Ticket Handling Checklist
- Infrastructure as Code Checklist
- Hardware Disposal Checklist
- IT Resource Allocation Checklist
- Incident Response Checklist
- Network Troubleshooting Checklist
- User Offboarding Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
