Performance Monitoring Checklist

Recurring monitoring workflow for SRE and platform teams covering host, application, database, and network telemetry plus security and compliance checks. Run weekly to catch SLO regressions, capacity issues, and drift before customers do.

6 sections 22 steps Collects data

System and Host Telemetry

Verify log shipping to the central pipeline
- Confirm Fluent Bit / Vector / CloudWatch agents on every node are forwarding to Loki, Splunk, or Datadog without dropped batches. Common gotcha: a node that rotated its IAM role 30 days ago and silently lost write permission on the log group.
Review CPU, memory, and disk saturation
- Pull the USE-method dashboard for each fleet — utilization, saturation, errors. Flag any host above 80% sustained CPU, above 85% memory working set, or above 75% disk used. Disk is the silent killer; full root volumes wedge kubelet.
Reconcile baseline metrics against last week
- Compare p50/p95/p99 latency, request rate, and error rate to the prior 7-day window. A 20%+ shift without a known release is the signal — investigate before it becomes a SEV.
Collects list
File a capacity ticket for the drifted service
- Open a Linear / Jira ticket against the owning team with the dashboard link, the drift magnitude, and the suspected contributing factor (release, traffic shift, dependency upgrade). Tag the on-call SRE and the service owner.

Application Performance (APM)

Inspect p95 and p99 latency by endpoint
- Open the Datadog / New Relic / Honeycomb APM service map. Sort by p99 ascending; the worst offenders usually concentrate on a single downstream span (DB call, third-party API, cache miss). Note any endpoint over its SLO.
Triage Sentry errors above threshold
- Filter to issues with >100 events/week or affecting >50 users. Assign each to an owner via CODEOWNERS or service catalog. Stale unassigned errors are the single biggest reason teams stop trusting Sentry.
Collects paragraph
Confirm SLO error budget remaining
- Pull the 30-day rolling SLO burn for each tier-1 service. If burn rate exceeds 2x for the trailing 24 hours, freeze non-critical deploys to that service until burn returns to budget.
Audit alert thresholds and noisy pages
- In PagerDuty / Opsgenie, list alerts that paged more than 3 times this week without an action item. Re-tune or move to ticket-only — alert fatigue is how real SEVs get missed at 3am.

Database Performance

Review slow query log on RDS
- Pull pg_stat_statements (Postgres) or Performance Insights (RDS) for the top 20 queries by total time. Flag any query above 500ms p95 or any sequential scan on a table over 1M rows.
Check replication lag and connection pool
- Replica lag above 30 seconds on a read-replica fleet means stale reads in production. PgBouncer / RDS Proxy connection saturation above 80% means the next traffic spike causes timeouts.
Review index bloat and unused indexes
- Indexes with zero scans over 30 days waste write IO and storage. Run pgstattuple on the largest tables; schedule REINDEX CONCURRENTLY during the next maintenance window for any index over 30% bloat.
Verify nightly backup and snapshot age
- Confirm the most recent automated snapshot is under 24 hours old and matches retention policy. Backup green is not the same as restore green — log the next quarterly restore drill date.
Collects datetime

Network and Edge

Check ALB and CloudFront error rates
- 5xx rate on the public ALB above 0.1% sustained is a signal. CloudFront cache hit ratio below the per-distribution baseline usually means a deploy invalidated the cache or someone added a Vary header.
Inspect VPC flow logs for anomalies
- Look for unexpected egress to non-allowlisted destinations, NAT gateway saturation, or REJECT spikes between subnets. Most NAT bill surprises trace back to a single chatty pod.
Confirm TLS certificates are not near expiry
- List ACM and cert-manager certificates with under 30 days remaining. ACM auto-renewal can fail silently if DNS validation records were removed; check the renewal eligibility status, not just expiry.
Collects list

Infrastructure Health

Review Kubernetes pod restarts and OOMKills
- Pull kube_pod_container_status_restarts_total over 7 days. Any deployment with >5 restarts or recurring OOMKilled containers needs a memory-limit or leak investigation before next release.
Check node autoscaling and cost trend
- Confirm cluster-autoscaler / Karpenter is scaling down idle nodes. A 15% week-over-week cost increase without a traffic increase is usually a stuck HPA or a runaway batch job.
Validate the quarterly restore drill is on the calendar
- Backup runs nightly; restore is the only proof the backup works. Confirm the next drill is scheduled into a non-prod environment with a named owner. SOC 2 CC7.5 evidence depends on this.

Security and Compliance

Triage critical and high CVEs from Snyk
- Filter Snyk / Dependabot to CVSS >= 7.0 in production-deployed services. Patch versions auto-merge; majors get a ticket with an owner and a 30-day SLA. Don't let the queue grow past 20.
Review GuardDuty and IDS findings
- Walk the past week of GuardDuty / Wazuh findings. Suppress documented false positives in the platform, not in your head — undocumented suppressions fail the next audit.
Reconcile production access against SSO directory
- Pull AWS console, GitHub org, and kubectl RBAC bindings; cross-check against active employees in Okta / Google Workspace. Departing-engineer access is the most common offboarding gap auditors flag under SOC 2 CC6.
Sign off on the weekly monitoring review
- SRE lead signs off after confirming all action items have owners and tickets. This is the artifact for SOC 2 CC7.2 (continuous monitoring) evidence.
Collects list Collects paragraph Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 6

Steps 22

Category Software Development

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Software Development

Run Performance Monitoring Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.