Performance Monitoring Checklist

System and Host Telemetry

    Confirm Fluent Bit / Vector / CloudWatch agents on every node are forwarding to Loki, Splunk, or Datadog without dropped batches. Common gotcha: a node that rotated its IAM role 30 days ago and silently lost write permission on the log group.

    Pull the USE-method dashboard for each fleet — utilization, saturation, errors. Flag any host above 80% sustained CPU, above 85% memory working set, or above 75% disk used. Disk is the silent killer; full root volumes wedge kubelet.

    Compare p50/p95/p99 latency, request rate, and error rate to the prior 7-day window. A 20%+ shift without a known release is the signal — investigate before it becomes a SEV.

    Open a Linear / Jira ticket against the owning team with the dashboard link, the drift magnitude, and the suspected contributing factor (release, traffic shift, dependency upgrade). Tag the on-call SRE and the service owner.

Application Performance (APM)

    Open the Datadog / New Relic / Honeycomb APM service map. Sort by p99 ascending; the worst offenders usually concentrate on a single downstream span (DB call, third-party API, cache miss). Note any endpoint over its SLO.

    Filter to issues with >100 events/week or affecting >50 users. Assign each to an owner via CODEOWNERS or service catalog. Stale unassigned errors are the single biggest reason teams stop trusting Sentry.

    Pull the 30-day rolling SLO burn for each tier-1 service. If burn rate exceeds 2x for the trailing 24 hours, freeze non-critical deploys to that service until burn returns to budget.

    In PagerDuty / Opsgenie, list alerts that paged more than 3 times this week without an action item. Re-tune or move to ticket-only — alert fatigue is how real SEVs get missed at 3am.

Database Performance

    Pull pg_stat_statements (Postgres) or Performance Insights (RDS) for the top 20 queries by total time. Flag any query above 500ms p95 or any sequential scan on a table over 1M rows.

    Replica lag above 30 seconds on a read-replica fleet means stale reads in production. PgBouncer / RDS Proxy connection saturation above 80% means the next traffic spike causes timeouts.

    Indexes with zero scans over 30 days waste write IO and storage. Run pgstattuple on the largest tables; schedule REINDEX CONCURRENTLY during the next maintenance window for any index over 30% bloat.

    Confirm the most recent automated snapshot is under 24 hours old and matches retention policy. Backup green is not the same as restore green — log the next quarterly restore drill date.

Network and Edge

    5xx rate on the public ALB above 0.1% sustained is a signal. CloudFront cache hit ratio below the per-distribution baseline usually means a deploy invalidated the cache or someone added a Vary header.

    Look for unexpected egress to non-allowlisted destinations, NAT gateway saturation, or REJECT spikes between subnets. Most NAT bill surprises trace back to a single chatty pod.

    List ACM and cert-manager certificates with under 30 days remaining. ACM auto-renewal can fail silently if DNS validation records were removed; check the renewal eligibility status, not just expiry.

Infrastructure Health

    Pull kube_pod_container_status_restarts_total over 7 days. Any deployment with >5 restarts or recurring OOMKilled containers needs a memory-limit or leak investigation before next release.

    Confirm cluster-autoscaler / Karpenter is scaling down idle nodes. A 15% week-over-week cost increase without a traffic increase is usually a stuck HPA or a runaway batch job.

    Backup runs nightly; restore is the only proof the backup works. Confirm the next drill is scheduled into a non-prod environment with a named owner. SOC 2 CC7.5 evidence depends on this.

Security and Compliance

    Filter Snyk / Dependabot to CVSS >= 7.0 in production-deployed services. Patch versions auto-merge; majors get a ticket with an owner and a 30-day SLA. Don't let the queue grow past 20.

    Walk the past week of GuardDuty / Wazuh findings. Suppress documented false positives in the platform, not in your head — undocumented suppressions fail the next audit.

    Pull AWS console, GitHub org, and kubectl RBAC bindings; cross-check against active employees in Okta / Google Workspace. Departing-engineer access is the most common offboarding gap auditors flag under SOC 2 CC6.

    SRE lead signs off after confirming all action items have owners and tickets. This is the artifact for SOC 2 CC7.2 (continuous monitoring) evidence.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack