Performance Monitoring Checklist
Recurring monitoring workflow for SRE and platform teams covering host, application, database, and network telemetry plus security and compliance checks. Run weekly to catch SLO regressions, capacity issues, and drift before customers do.
System and Host Telemetry
-
Verify log shipping to the central pipeline
Confirm Fluent Bit / Vector / CloudWatch agents on every node are forwarding to Loki, Splunk, or Datadog without dropped batches. Common gotcha: a node that rotated its IAM role 30 days ago and silently lost write permission on the log group.
-
Review CPU, memory, and disk saturation
Pull the USE-method dashboard for each fleet — utilization, saturation, errors. Flag any host above 80% sustained CPU, above 85% memory working set, or above 75% disk used. Disk is the silent killer; full root volumes wedge kubelet.
-
Reconcile baseline metrics against last week
Compare p50/p95/p99 latency, request rate, and error rate to the prior 7-day window. A 20%+ shift without a known release is the signal — investigate before it becomes a SEV.
Collects list -
File a capacity ticket for the drifted service
Open a Linear / Jira ticket against the owning team with the dashboard link, the drift magnitude, and the suspected contributing factor (release, traffic shift, dependency upgrade). Tag the on-call SRE and the service owner.
Application Performance (APM)
-
Inspect p95 and p99 latency by endpoint
Open the Datadog / New Relic / Honeycomb APM service map. Sort by p99 ascending; the worst offenders usually concentrate on a single downstream span (DB call, third-party API, cache miss). Note any endpoint over its SLO.
-
Triage Sentry errors above threshold
Filter to issues with >100 events/week or affecting >50 users. Assign each to an owner via CODEOWNERS or service catalog. Stale unassigned errors are the single biggest reason teams stop trusting Sentry.
Collects paragraph -
Confirm SLO error budget remaining
Pull the 30-day rolling SLO burn for each tier-1 service. If burn rate exceeds 2x for the trailing 24 hours, freeze non-critical deploys to that service until burn returns to budget.
-
Audit alert thresholds and noisy pages
In PagerDuty / Opsgenie, list alerts that paged more than 3 times this week without an action item. Re-tune or move to ticket-only — alert fatigue is how real SEVs get missed at 3am.
Database Performance
-
Review slow query log on RDS
Pull pg_stat_statements (Postgres) or Performance Insights (RDS) for the top 20 queries by total time. Flag any query above 500ms p95 or any sequential scan on a table over 1M rows.
-
Check replication lag and connection pool
Replica lag above 30 seconds on a read-replica fleet means stale reads in production. PgBouncer / RDS Proxy connection saturation above 80% means the next traffic spike causes timeouts.
-
Review index bloat and unused indexes
Indexes with zero scans over 30 days waste write IO and storage. Run pgstattuple on the largest tables; schedule REINDEX CONCURRENTLY during the next maintenance window for any index over 30% bloat.
-
Verify nightly backup and snapshot age
Confirm the most recent automated snapshot is under 24 hours old and matches retention policy. Backup green is not the same as restore green — log the next quarterly restore drill date.
Collects datetime
Network and Edge
-
Check ALB and CloudFront error rates
5xx rate on the public ALB above 0.1% sustained is a signal. CloudFront cache hit ratio below the per-distribution baseline usually means a deploy invalidated the cache or someone added a Vary header.
-
Inspect VPC flow logs for anomalies
Look for unexpected egress to non-allowlisted destinations, NAT gateway saturation, or REJECT spikes between subnets. Most NAT bill surprises trace back to a single chatty pod.
-
Confirm TLS certificates are not near expiry
List ACM and cert-manager certificates with under 30 days remaining. ACM auto-renewal can fail silently if DNS validation records were removed; check the renewal eligibility status, not just expiry.
Collects list
Infrastructure Health
-
Review Kubernetes pod restarts and OOMKills
Pull kube_pod_container_status_restarts_total over 7 days. Any deployment with >5 restarts or recurring OOMKilled containers needs a memory-limit or leak investigation before next release.
-
Check node autoscaling and cost trend
Confirm cluster-autoscaler / Karpenter is scaling down idle nodes. A 15% week-over-week cost increase without a traffic increase is usually a stuck HPA or a runaway batch job.
-
Validate the quarterly restore drill is on the calendar
Backup runs nightly; restore is the only proof the backup works. Confirm the next drill is scheduled into a non-prod environment with a named owner. SOC 2 CC7.5 evidence depends on this.
Security and Compliance
-
Triage critical and high CVEs from Snyk
Filter Snyk / Dependabot to CVSS >= 7.0 in production-deployed services. Patch versions auto-merge; majors get a ticket with an owner and a 30-day SLA. Don't let the queue grow past 20.
-
Review GuardDuty and IDS findings
Walk the past week of GuardDuty / Wazuh findings. Suppress documented false positives in the platform, not in your head — undocumented suppressions fail the next audit.
-
Reconcile production access against SSO directory
Pull AWS console, GitHub org, and kubectl RBAC bindings; cross-check against active employees in Okta / Google Workspace. Departing-engineer access is the most common offboarding gap auditors flag under SOC 2 CC6.
-
Sign off on the weekly monitoring review
SRE lead signs off after confirming all action items have owners and tickets. This is the artifact for SOC 2 CC7.2 (continuous monitoring) evidence.
Collects list Collects paragraph Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Performance Monitoring Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.