Application Performance Monitoring Checklist
Recurring review the IT operations team runs to confirm APM coverage, alert hygiene, and performance posture across infrastructure, application, and user-experience layers. Designed to run weekly with a deeper monthly pass.
Infrastructure Monitoring Coverage
-
Audit monitoring agent coverage across hosts
Reconcile the inventory in the RMM (Datadog, NinjaOne, PRTG, LogicMonitor) against the CMDB or hypervisor host list. New VMs spun up outside the gold image are the usual gap — they show up as healthy but report no metrics. List any hosts missing an agent.
Collects paragraph -
Verify CPU and memory baselines on critical hosts
Spot-check the 7-day baseline on domain controllers, hypervisors, and database hosts. Saturated baselines (sustained CPU above 80%, memory pressure with paging) get flagged for capacity planning, not just alerting.
-
Review disk space and IOPS on database servers
SQL transaction logs, tempdb growth, and snapshot residue are the usual culprits. Confirm any volume above 80% has a remediation owner and ETA.
-
Check network latency between sites and cloud regions
Pull NetFlow or Auvik path data for site-to-site VPN tunnels and cloud peerings. ISP path changes routinely add 30-60ms without notice and only show up when a user complains about Teams.
-
Validate SNMP and syslog ingestion into the SIEM
Confirm switches, firewalls, and APs are still forwarding to Splunk / Sentinel / Graylog. A silent collector is the most common cause of post-incident gaps in the audit trail.
Application Metrics Collection
-
Verify APM tracing on critical services
Confirm New Relic / Datadog APM / AppDynamics / Dynatrace agents are reporting on every tier-1 service. Deployments that change runtime versions silently drop the agent — check the last-seen timestamp against the last release.
-
Review p95 and p99 endpoint response times
Means hide outliers; p95/p99 catches the long tail users actually feel. Compare against the published SLO. Flag any endpoint that breached its objective more than three times in the last week.
-
Audit application error rates by service
Pull HTTP 5xx and uncaught exception counts per service. Note the top offender for the optimization phase later in the run.
Collects text -
Inspect slow database queries and lock contention
Run the slow-query report from SQL Server / Postgres / MySQL and correlate against APM trace flame graphs. Missing indexes on a recently-added column are the most common new entry.
-
Confirm third-party API monitoring is healthy
Stripe, Twilio, Auth0, and any payment / identity provider must have an external probe. When their status page lies, your synthetic check is the only ground truth.
User Experience Monitoring
-
Verify RUM coverage on production front-end
Confirm the RUM snippet (Datadog RUM, New Relic Browser, Sentry) is loading on every public route. CSP changes and tag-manager edits are the usual reason coverage silently drops.
-
Review Core Web Vitals against thresholds
LCP under 2.5s, INP under 200ms, CLS under 0.1 at the 75th percentile. Flag pages where the rolling 7-day score regressed from "good" to "needs improvement."
-
Check synthetic monitor pass rate
Pingdom / Site24x7 / Datadog Synthetics scripted journeys (login, search, checkout) should sit above 99.5% pass for the week. Flaky logins from a single region usually mean the script — not the app — needs maintenance.
-
Sample session replays for friction signals
Pull five sessions tagged with rage-clicks or dead-clicks from FullStory / LogRocket / Hotjar. These usually map to a UI regression that error-rate dashboards miss because the front-end didn't throw.
Alerting and Notifications
-
Audit alert thresholds against current baselines
Thresholds set 18 months ago against 2-vCPU hosts no longer fit a 16-vCPU upgrade. Compare each P1/P2 threshold to the current 7-day baseline; mark stale ones for adjustment.
Collects list -
Update stale thresholds in the monitoring tool
Adjust the flagged thresholds in Datadog / PRTG / LogicMonitor. Open a change ticket with before/after values so the next reviewer can audit the rationale.
-
Verify PagerDuty rotation and override coverage
Confirm the on-call schedule for the next two weeks has no gaps and that overrides for PTO are in place. The Friday-before-a-long-weekend gap is the classic miss.
-
Test alert delivery to email, SMS, and Slack
Fire a synthetic test alert from each integration. Slack webhook tokens expire silently; SMS gateway changes break Twilio routing without warning.
-
Review alert fatigue and noisy-rule list
Pull the top 10 alerts by volume for the week. Anything firing more than 50 times without a corresponding ticket is a candidate for tuning, not a real signal.
Reporting and Dashboards
-
Refresh the executive performance dashboard
Update SLO burn-down, p95 latency, error budget, and uptime tiles. Broken queries from a renamed metric are the usual reason a tile shows "no data" the morning of a leadership review.
-
Compile the weekly performance report
Export the SLO summary, top 5 incidents, and capacity trend into the standard report template. Attach the PDF to this step for the audit trail.
Collects file -
Distribute the report to stakeholders
Send to the IT leadership distribution list and the affected service owners. For MSP runs, attach to the client's QBR folder in IT Glue or Hudu.
Performance Optimization Review
-
Identify the week's top performance bottleneck
Cross-reference APM flame graphs, slow-query logs, and synthetic regressions to single out the largest user-impact bottleneck. If nothing material surfaced, mark No and skip to load-test scheduling.
Collects list -
Open a remediation ticket with the service owner
File the ticket in Jira / ServiceNow / ConnectWise PSA with the trace ID, affected endpoint, and proposed mitigation (index, cache, or query rewrite). Assign to the owning engineering team, not back to ops.
-
Schedule a load test against the staging environment
Use k6, JMeter, or Locust against staging — never against prod without a maintenance window. Target 1.5x peak production traffic to leave headroom for the next quarter.
-
Sign off on the weekly APM review
Final reviewer captures the overall posture, any open exceptions, and a digital signature for the audit trail. Required for SOC 2 CC7.2 evidence on monitoring activities.
Collects list Collects paragraph Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Application Performance Monitoring Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.