Monitoring Setup Checklist
Steps an SRE or platform engineer follows to stand up observability across infrastructure, applications, security, and dependencies for a new service or environment. Covers metric collection, alerting, dashboards, and on-call wiring through PagerDuty or Opsgenie.
Scope and Tooling Decisions
-
Select the observability stack
The platform lead picks the metrics, logs, and traces backends — typically Datadog, Grafana/Prometheus/Loki, or New Relic. Confirm the agent or OpenTelemetry collector approach before any installs; mixing vendors mid-rollout is the most common rework driver.
Collects list -
Define SLOs for the service
Pick 2-4 SLIs the service owner will defend — typically availability (success rate), latency at p95 or p99, and freshness for data pipelines. Set the SLO target and error budget window (28-day rolling is common). Skipping this step means alerts fire on opinions, not commitments.
Collects paragraph -
Confirm the on-call routing target
Page through PagerDuty or Opsgenie, not a shared Slack channel. Confirm the rotation, escalation policy, and a secondary so a single missed page doesn't drop a SEV2.
Infrastructure Monitoring
-
Deploy the metrics agent to all hosts
Roll the agent (Datadog Agent, node_exporter, OTel collector) through Terraform or your AMI/launch-template pipeline rather than per-host. For Kubernetes, use the DaemonSet pattern so new nodes pick it up automatically.
-
Configure USE-method host alerts
Wire alerts on CPU saturation, memory pressure, disk usage above 85%, and disk fill-rate (which catches log explosions before the host pages). Use 5-minute averaging windows to avoid flapping on transient spikes.
-
Pipe container and host logs to the collector
Standardize on JSON-structured logs and ship via Fluent Bit, Vector, or the vendor agent. Confirm log retention matches compliance needs (SOC 2 generally expects 90+ days for security-relevant logs).
Application Performance Monitoring
-
Instrument the service with APM
Add the APM library or OpenTelemetry SDK to the service and confirm traces show up end-to-end across HTTP, database, and queue spans. Watch out for high-cardinality tags (user_id, request_id as a tag) — they blow up billing fast.
-
Wire error tracking to Sentry
Sentry, Bugsnag, or Rollbar — pick one. Verify source maps upload from CI for frontend builds, otherwise stack traces are minified and useless. Set release tags so error rates are correlated with deploys.
-
Configure SLO-based burn-rate alerts
Use multi-window burn-rate alerts (fast: 2% budget in 1 hour; slow: 10% in 6 hours) per Google SRE workbook guidance. Threshold alerts on raw latency are noisier and miss slow-burn regressions.
-
Build the service golden-signals dashboard
One dashboard per service showing latency (p50/p95/p99), traffic (RPS), errors (rate and top exceptions), and saturation (CPU, memory, queue depth). Link the dashboard from the on-call runbook so the first responder lands on it from the page.
Security Monitoring
-
Forward audit logs to the SIEM
CloudTrail, Kubernetes audit logs, GitHub audit log, and SSO logs feed Splunk, Sumo Logic, or the vendor SIEM. SOC 2 CC7.2 expects continuous monitoring of these sources — confirm the feed is live, not just configured.
-
Confirm whether the service handles regulated data
If the service touches PHI, cardholder data, or EU resident PII, additional logging and alerting controls apply. Flag now rather than retrofitting after an audit finding.
Collects list -
Enable compliance-tier audit alerts
Required only for HIPAA / PCI / GDPR scope. Turn on alerts for failed-login bursts, privilege escalation, encryption-at-rest disablement, and unauthorized data exports. Confirm 6-year (HIPAA) or scope-appropriate retention on the audit log store.
-
Configure secret-scanning and CVE alerts
Turn on GitHub secret scanning + push protection, GitGuardian, or equivalent on every repo. Pipe Snyk or Dependabot critical/high findings to the security channel; route SLA-bound CVEs (e.g., 30-day patch SLA) to the team's queue, not a shared inbox that nobody owns.
User and Synthetic Monitoring
-
Deploy real-user monitoring (RUM)
Add the RUM snippet (Datadog RUM, New Relic Browser, Sentry) to the frontend. Track Core Web Vitals — LCP, INP, CLS — and break down by browser and geography. Don't ship without sampling configured; full-fidelity RUM at scale is expensive.
-
Author synthetic checks for critical user paths
Cover login, the primary checkout or workflow path, and the public marketing page. Run from at least two regions on a 5-minute interval. Synthetic catches DNS, TLS, and CDN failures that internal health checks miss.
-
Hook support tickets into the alert stream
Zendesk or Intercom tags like 'outage' or 'degraded' route to the on-call channel. Customer reports often beat synthetic checks by 5-15 minutes for partial-region issues.
Network and Database Monitoring
-
Monitor VPC flow logs and ALB metrics
Enable VPC flow logs to S3 or CloudWatch and ingest into the SIEM. Alert on ALB 5xx rate, target unhealthy host count, and NAT gateway port-allocation errors — the last one silently breaks outbound connections under load.
-
Track database performance counters
For Postgres / RDS: connection count vs. max_connections, replication lag, slow-query log, deadlocks, and autovacuum age. Alert before connection saturation, not at it — apps fail closed once the pool is exhausted.
-
Set certificate expiration alerts
Alert at 30, 14, and 7 days before expiry on every public TLS cert and internal mTLS cert. Send to a channel that exists today, not the deprecated one — silent ACM auto-renewal failures are a top cause of preventable outages.
Cloud and Third-Party Coverage
-
Enable cloud-cost and budget alerts
AWS Budgets, GCP Budgets, or a CMP (Vantage, CloudHealth) — set anomaly detection plus a hard monthly threshold. A runaway data-transfer or accidental GPU instance shows up here before finance notices.
-
Subscribe to vendor status pages
Webhook the status pages of every critical dependency (Stripe, Auth0, Twilio, the cloud provider region) into the on-call channel. Don't rely on manual Slack subscriptions — engineers leave, integrations don't.
-
Document the runbook and on-call playbook
For every alert that pages, the runbook entry shows: what the alert means, the dashboard link, the most common causes, and the rollback or mitigation command. Alerts without a runbook get acked-and-ignored within two rotations.
Collects url
Validation and Sign-Off
-
Run a paging fire drill
Trigger a synthetic alert (e.g., raise error rate in staging or fire a test PagerDuty incident) and confirm the page lands on the right phone within 2 minutes, escalates correctly, and resolves cleanly. A monitoring setup that has never paged is not validated.
-
Tune thresholds after one week of baseline
Review the alert log: which alerts fired, which were actionable, which were noise. Adjust thresholds, lengthen windows, or delete alerts with zero true-positives. The goal is a rotation where every page is real.
-
Get engineering manager sign-off
EM confirms SLOs, runbook, on-call routing, and dashboards are in place before the service is declared production-ready. File any gaps as tickets with named owners.
Collects list Collects paragraph Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Monitoring Setup Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.