Application Performance Monitoring Checklist

Recurring review the IT operations team runs to confirm APM coverage, alert hygiene, and performance posture across infrastructure, application, and user-experience layers. Designed to run weekly with a deeper monthly pass.

6 sections 26 steps Collects data
1

Infrastructure Monitoring Coverage

  1. Audit monitoring agent coverage across hosts
    • Reconcile the inventory in the RMM (Datadog, NinjaOne, PRTG, LogicMonitor) against the CMDB or hypervisor host list. New VMs spun up outside the gold image are the usual gap — they show up as healthy but report no metrics. List any hosts missing an agent.

    Collects paragraph
  2. Verify CPU and memory baselines on critical hosts
    • Spot-check the 7-day baseline on domain controllers, hypervisors, and database hosts. Saturated baselines (sustained CPU above 80%, memory pressure with paging) get flagged for capacity planning, not just alerting.

  3. Review disk space and IOPS on database servers
    • SQL transaction logs, tempdb growth, and snapshot residue are the usual culprits. Confirm any volume above 80% has a remediation owner and ETA.

  4. Check network latency between sites and cloud regions
    • Pull NetFlow or Auvik path data for site-to-site VPN tunnels and cloud peerings. ISP path changes routinely add 30-60ms without notice and only show up when a user complains about Teams.

  5. Validate SNMP and syslog ingestion into the SIEM
    • Confirm switches, firewalls, and APs are still forwarding to Splunk / Sentinel / Graylog. A silent collector is the most common cause of post-incident gaps in the audit trail.

2

Application Metrics Collection

  1. Verify APM tracing on critical services
    • Confirm New Relic / Datadog APM / AppDynamics / Dynatrace agents are reporting on every tier-1 service. Deployments that change runtime versions silently drop the agent — check the last-seen timestamp against the last release.

  2. Review p95 and p99 endpoint response times
    • Means hide outliers; p95/p99 catches the long tail users actually feel. Compare against the published SLO. Flag any endpoint that breached its objective more than three times in the last week.

  3. Audit application error rates by service
    • Pull HTTP 5xx and uncaught exception counts per service. Note the top offender for the optimization phase later in the run.

    Collects text
  4. Inspect slow database queries and lock contention
    • Run the slow-query report from SQL Server / Postgres / MySQL and correlate against APM trace flame graphs. Missing indexes on a recently-added column are the most common new entry.

  5. Confirm third-party API monitoring is healthy
    • Stripe, Twilio, Auth0, and any payment / identity provider must have an external probe. When their status page lies, your synthetic check is the only ground truth.

3

User Experience Monitoring

  1. Verify RUM coverage on production front-end
    • Confirm the RUM snippet (Datadog RUM, New Relic Browser, Sentry) is loading on every public route. CSP changes and tag-manager edits are the usual reason coverage silently drops.

  2. Review Core Web Vitals against thresholds
    • LCP under 2.5s, INP under 200ms, CLS under 0.1 at the 75th percentile. Flag pages where the rolling 7-day score regressed from "good" to "needs improvement."

  3. Check synthetic monitor pass rate
    • Pingdom / Site24x7 / Datadog Synthetics scripted journeys (login, search, checkout) should sit above 99.5% pass for the week. Flaky logins from a single region usually mean the script — not the app — needs maintenance.

  4. Sample session replays for friction signals
    • Pull five sessions tagged with rage-clicks or dead-clicks from FullStory / LogRocket / Hotjar. These usually map to a UI regression that error-rate dashboards miss because the front-end didn't throw.

4

Alerting and Notifications

  1. Audit alert thresholds against current baselines
    • Thresholds set 18 months ago against 2-vCPU hosts no longer fit a 16-vCPU upgrade. Compare each P1/P2 threshold to the current 7-day baseline; mark stale ones for adjustment.

    Collects list
  2. Update stale thresholds in the monitoring tool
    • Adjust the flagged thresholds in Datadog / PRTG / LogicMonitor. Open a change ticket with before/after values so the next reviewer can audit the rationale.

  3. Verify PagerDuty rotation and override coverage
    • Confirm the on-call schedule for the next two weeks has no gaps and that overrides for PTO are in place. The Friday-before-a-long-weekend gap is the classic miss.

  4. Test alert delivery to email, SMS, and Slack
    • Fire a synthetic test alert from each integration. Slack webhook tokens expire silently; SMS gateway changes break Twilio routing without warning.

  5. Review alert fatigue and noisy-rule list
    • Pull the top 10 alerts by volume for the week. Anything firing more than 50 times without a corresponding ticket is a candidate for tuning, not a real signal.

5

Reporting and Dashboards

  1. Refresh the executive performance dashboard
    • Update SLO burn-down, p95 latency, error budget, and uptime tiles. Broken queries from a renamed metric are the usual reason a tile shows "no data" the morning of a leadership review.

  2. Compile the weekly performance report
    • Export the SLO summary, top 5 incidents, and capacity trend into the standard report template. Attach the PDF to this step for the audit trail.

    Collects file
  3. Distribute the report to stakeholders
    • Send to the IT leadership distribution list and the affected service owners. For MSP runs, attach to the client's QBR folder in IT Glue or Hudu.

6

Performance Optimization Review

  1. Identify the week's top performance bottleneck
    • Cross-reference APM flame graphs, slow-query logs, and synthetic regressions to single out the largest user-impact bottleneck. If nothing material surfaced, mark No and skip to load-test scheduling.

    Collects list
  2. Open a remediation ticket with the service owner
    • File the ticket in Jira / ServiceNow / ConnectWise PSA with the trace ID, affected endpoint, and proposed mitigation (index, cache, or query rewrite). Assign to the owning engineering team, not back to ops.

  3. Schedule a load test against the staging environment
    • Use k6, JMeter, or Locust against staging — never against prod without a maintenance window. Target 1.5x peak production traffic to leave headroom for the next quarter.

  4. Sign off on the weekly APM review
    • Final reviewer captures the overall posture, any open exceptions, and a digital signature for the audit trail. Required for SOC 2 CC7.2 evidence on monitoring activities.

    Collects list Collects paragraph Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 6
Steps 26
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Application Performance Monitoring Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.