Performance Monitoring Checklist
System and Host Telemetry
Confirm Fluent Bit / Vector / CloudWatch agents on every node are forwarding to Loki, Splunk, or Datadog without dropped batches. Common gotcha: a node that rotated its IAM role 30 days ago and silently lost write permission on the log group.
Pull the USE-method dashboard for each fleet — utilization, saturation, errors. Flag any host above 80% sustained CPU, above 85% memory working set, or above 75% disk used. Disk is the silent killer; full root volumes wedge kubelet.
Compare p50/p95/p99 latency, request rate, and error rate to the prior 7-day window. A 20%+ shift without a known release is the signal — investigate before it becomes a SEV.
Open a Linear / Jira ticket against the owning team with the dashboard link, the drift magnitude, and the suspected contributing factor (release, traffic shift, dependency upgrade). Tag the on-call SRE and the service owner.
Application Performance (APM)
Open the Datadog / New Relic / Honeycomb APM service map. Sort by p99 ascending; the worst offenders usually concentrate on a single downstream span (DB call, third-party API, cache miss). Note any endpoint over its SLO.
Filter to issues with >100 events/week or affecting >50 users. Assign each to an owner via CODEOWNERS or service catalog. Stale unassigned errors are the single biggest reason teams stop trusting Sentry.
Pull the 30-day rolling SLO burn for each tier-1 service. If burn rate exceeds 2x for the trailing 24 hours, freeze non-critical deploys to that service until burn returns to budget.
In PagerDuty / Opsgenie, list alerts that paged more than 3 times this week without an action item. Re-tune or move to ticket-only — alert fatigue is how real SEVs get missed at 3am.
Database Performance
Pull pg_stat_statements (Postgres) or Performance Insights (RDS) for the top 20 queries by total time. Flag any query above 500ms p95 or any sequential scan on a table over 1M rows.
Replica lag above 30 seconds on a read-replica fleet means stale reads in production. PgBouncer / RDS Proxy connection saturation above 80% means the next traffic spike causes timeouts.
Indexes with zero scans over 30 days waste write IO and storage. Run pgstattuple on the largest tables; schedule REINDEX CONCURRENTLY during the next maintenance window for any index over 30% bloat.
Confirm the most recent automated snapshot is under 24 hours old and matches retention policy. Backup green is not the same as restore green — log the next quarterly restore drill date.
Network and Edge
5xx rate on the public ALB above 0.1% sustained is a signal. CloudFront cache hit ratio below the per-distribution baseline usually means a deploy invalidated the cache or someone added a Vary header.
Look for unexpected egress to non-allowlisted destinations, NAT gateway saturation, or REJECT spikes between subnets. Most NAT bill surprises trace back to a single chatty pod.
List ACM and cert-manager certificates with under 30 days remaining. ACM auto-renewal can fail silently if DNS validation records were removed; check the renewal eligibility status, not just expiry.
Infrastructure Health
Pull kube_pod_container_status_restarts_total over 7 days. Any deployment with >5 restarts or recurring OOMKilled containers needs a memory-limit or leak investigation before next release.
Confirm cluster-autoscaler / Karpenter is scaling down idle nodes. A 15% week-over-week cost increase without a traffic increase is usually a stuck HPA or a runaway batch job.
Backup runs nightly; restore is the only proof the backup works. Confirm the next drill is scheduled into a non-prod environment with a named owner. SOC 2 CC7.5 evidence depends on this.
Security and Compliance
Filter Snyk / Dependabot to CVSS >= 7.0 in production-deployed services. Patch versions auto-merge; majors get a ticket with an owner and a 30-day SLA. Don't let the queue grow past 20.
Walk the past week of GuardDuty / Wazuh findings. Suppress documented false positives in the platform, not in your head — undocumented suppressions fail the next audit.
Pull AWS console, GitHub org, and kubectl RBAC bindings; cross-check against active employees in Okta / Google Workspace. Departing-engineer access is the most common offboarding gap auditors flag under SOC 2 CC6.
SRE lead signs off after confirming all action items have owners and tickets. This is the artifact for SOC 2 CC7.2 (continuous monitoring) evidence.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Rollback Plan Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Software Architecture Design Checklist
- Post-Deployment Testing Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Monitoring Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Service Level Agreement (SLA) Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- Disaster Recovery Plan Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Backup and Recovery Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
