Monitoring Setup Checklist
Scope and Tooling Decisions
The platform lead picks the metrics, logs, and traces backends — typically Datadog, Grafana/Prometheus/Loki, or New Relic. Confirm the agent or OpenTelemetry collector approach before any installs; mixing vendors mid-rollout is the most common rework driver.
Pick 2-4 SLIs the service owner will defend — typically availability (success rate), latency at p95 or p99, and freshness for data pipelines. Set the SLO target and error budget window (28-day rolling is common). Skipping this step means alerts fire on opinions, not commitments.
Page through PagerDuty or Opsgenie, not a shared Slack channel. Confirm the rotation, escalation policy, and a secondary so a single missed page doesn't drop a SEV2.
Infrastructure Monitoring
Roll the agent (Datadog Agent, node_exporter, OTel collector) through Terraform or your AMI/launch-template pipeline rather than per-host. For Kubernetes, use the DaemonSet pattern so new nodes pick it up automatically.
Wire alerts on CPU saturation, memory pressure, disk usage above 85%, and disk fill-rate (which catches log explosions before the host pages). Use 5-minute averaging windows to avoid flapping on transient spikes.
Standardize on JSON-structured logs and ship via Fluent Bit, Vector, or the vendor agent. Confirm log retention matches compliance needs (SOC 2 generally expects 90+ days for security-relevant logs).
Application Performance Monitoring
Add the APM library or OpenTelemetry SDK to the service and confirm traces show up end-to-end across HTTP, database, and queue spans. Watch out for high-cardinality tags (user_id, request_id as a tag) — they blow up billing fast.
Sentry, Bugsnag, or Rollbar — pick one. Verify source maps upload from CI for frontend builds, otherwise stack traces are minified and useless. Set release tags so error rates are correlated with deploys.
Use multi-window burn-rate alerts (fast: 2% budget in 1 hour; slow: 10% in 6 hours) per Google SRE workbook guidance. Threshold alerts on raw latency are noisier and miss slow-burn regressions.
One dashboard per service showing latency (p50/p95/p99), traffic (RPS), errors (rate and top exceptions), and saturation (CPU, memory, queue depth). Link the dashboard from the on-call runbook so the first responder lands on it from the page.
Security Monitoring
CloudTrail, Kubernetes audit logs, GitHub audit log, and SSO logs feed Splunk, Sumo Logic, or the vendor SIEM. SOC 2 CC7.2 expects continuous monitoring of these sources — confirm the feed is live, not just configured.
If the service touches PHI, cardholder data, or EU resident PII, additional logging and alerting controls apply. Flag now rather than retrofitting after an audit finding.
Required only for HIPAA / PCI / GDPR scope. Turn on alerts for failed-login bursts, privilege escalation, encryption-at-rest disablement, and unauthorized data exports. Confirm 6-year (HIPAA) or scope-appropriate retention on the audit log store.
Turn on GitHub secret scanning + push protection, GitGuardian, or equivalent on every repo. Pipe Snyk or Dependabot critical/high findings to the security channel; route SLA-bound CVEs (e.g., 30-day patch SLA) to the team's queue, not a shared inbox that nobody owns.
User and Synthetic Monitoring
Add the RUM snippet (Datadog RUM, New Relic Browser, Sentry) to the frontend. Track Core Web Vitals — LCP, INP, CLS — and break down by browser and geography. Don't ship without sampling configured; full-fidelity RUM at scale is expensive.
Cover login, the primary checkout or workflow path, and the public marketing page. Run from at least two regions on a 5-minute interval. Synthetic catches DNS, TLS, and CDN failures that internal health checks miss.
Zendesk or Intercom tags like 'outage' or 'degraded' route to the on-call channel. Customer reports often beat synthetic checks by 5-15 minutes for partial-region issues.
Network and Database Monitoring
Enable VPC flow logs to S3 or CloudWatch and ingest into the SIEM. Alert on ALB 5xx rate, target unhealthy host count, and NAT gateway port-allocation errors — the last one silently breaks outbound connections under load.
For Postgres / RDS: connection count vs. max_connections, replication lag, slow-query log, deadlocks, and autovacuum age. Alert before connection saturation, not at it — apps fail closed once the pool is exhausted.
Alert at 30, 14, and 7 days before expiry on every public TLS cert and internal mTLS cert. Send to a channel that exists today, not the deprecated one — silent ACM auto-renewal failures are a top cause of preventable outages.
Cloud and Third-Party Coverage
AWS Budgets, GCP Budgets, or a CMP (Vantage, CloudHealth) — set anomaly detection plus a hard monthly threshold. A runaway data-transfer or accidental GPU instance shows up here before finance notices.
Webhook the status pages of every critical dependency (Stripe, Auth0, Twilio, the cloud provider region) into the on-call channel. Don't rely on manual Slack subscriptions — engineers leave, integrations don't.
For every alert that pages, the runbook entry shows: what the alert means, the dashboard link, the most common causes, and the rollback or mitigation command. Alerts without a runbook get acked-and-ignored within two rotations.
Validation and Sign-Off
Trigger a synthetic alert (e.g., raise error rate in staging or fire a test PagerDuty incident) and confirm the page lands on the right phone within 2 minutes, escalates correctly, and resolves cleanly. A monitoring setup that has never paged is not validated.
Review the alert log: which alerts fired, which were actionable, which were noise. Adjust thresholds, lengthen windows, or delete alerts with zero true-positives. The goal is a rotation where every page is real.
EM confirms SLOs, runbook, on-call routing, and dashboards are in place before the service is declared production-ready. File any gaps as tickets with named owners.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Rollback Plan Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Software Architecture Design Checklist
- Post-Deployment Testing Checklist
- Performance Monitoring Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Service Level Agreement (SLA) Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- Disaster Recovery Plan Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Backup and Recovery Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
