Monitoring Setup Checklist

Steps an SRE or platform engineer follows to stand up observability across infrastructure, applications, security, and dependencies for a new service or environment. Covers metric collection, alerting, dashboards, and on-call wiring through PagerDuty or Opsgenie.

8 sections 26 steps Collects data

Scope and Tooling Decisions

Select the observability stack
- The platform lead picks the metrics, logs, and traces backends — typically Datadog, Grafana/Prometheus/Loki, or New Relic. Confirm the agent or OpenTelemetry collector approach before any installs; mixing vendors mid-rollout is the most common rework driver.
Collects list
Define SLOs for the service
- Pick 2-4 SLIs the service owner will defend — typically availability (success rate), latency at p95 or p99, and freshness for data pipelines. Set the SLO target and error budget window (28-day rolling is common). Skipping this step means alerts fire on opinions, not commitments.
Collects paragraph
Confirm the on-call routing target
- Page through PagerDuty or Opsgenie, not a shared Slack channel. Confirm the rotation, escalation policy, and a secondary so a single missed page doesn't drop a SEV2.

Infrastructure Monitoring

Deploy the metrics agent to all hosts
- Roll the agent (Datadog Agent, node_exporter, OTel collector) through Terraform or your AMI/launch-template pipeline rather than per-host. For Kubernetes, use the DaemonSet pattern so new nodes pick it up automatically.
Configure USE-method host alerts
- Wire alerts on CPU saturation, memory pressure, disk usage above 85%, and disk fill-rate (which catches log explosions before the host pages). Use 5-minute averaging windows to avoid flapping on transient spikes.
Pipe container and host logs to the collector
- Standardize on JSON-structured logs and ship via Fluent Bit, Vector, or the vendor agent. Confirm log retention matches compliance needs (SOC 2 generally expects 90+ days for security-relevant logs).

Application Performance Monitoring

Instrument the service with APM
- Add the APM library or OpenTelemetry SDK to the service and confirm traces show up end-to-end across HTTP, database, and queue spans. Watch out for high-cardinality tags (user_id, request_id as a tag) — they blow up billing fast.
Wire error tracking to Sentry
- Sentry, Bugsnag, or Rollbar — pick one. Verify source maps upload from CI for frontend builds, otherwise stack traces are minified and useless. Set release tags so error rates are correlated with deploys.
Configure SLO-based burn-rate alerts
- Use multi-window burn-rate alerts (fast: 2% budget in 1 hour; slow: 10% in 6 hours) per Google SRE workbook guidance. Threshold alerts on raw latency are noisier and miss slow-burn regressions.
Build the service golden-signals dashboard
- One dashboard per service showing latency (p50/p95/p99), traffic (RPS), errors (rate and top exceptions), and saturation (CPU, memory, queue depth). Link the dashboard from the on-call runbook so the first responder lands on it from the page.

Security Monitoring

Forward audit logs to the SIEM
- CloudTrail, Kubernetes audit logs, GitHub audit log, and SSO logs feed Splunk, Sumo Logic, or the vendor SIEM. SOC 2 CC7.2 expects continuous monitoring of these sources — confirm the feed is live, not just configured.
Confirm whether the service handles regulated data
- If the service touches PHI, cardholder data, or EU resident PII, additional logging and alerting controls apply. Flag now rather than retrofitting after an audit finding.
Collects list
Enable compliance-tier audit alerts
- Required only for HIPAA / PCI / GDPR scope. Turn on alerts for failed-login bursts, privilege escalation, encryption-at-rest disablement, and unauthorized data exports. Confirm 6-year (HIPAA) or scope-appropriate retention on the audit log store.
Configure secret-scanning and CVE alerts
- Turn on GitHub secret scanning + push protection, GitGuardian, or equivalent on every repo. Pipe Snyk or Dependabot critical/high findings to the security channel; route SLA-bound CVEs (e.g., 30-day patch SLA) to the team's queue, not a shared inbox that nobody owns.

User and Synthetic Monitoring

Deploy real-user monitoring (RUM)
- Add the RUM snippet (Datadog RUM, New Relic Browser, Sentry) to the frontend. Track Core Web Vitals — LCP, INP, CLS — and break down by browser and geography. Don't ship without sampling configured; full-fidelity RUM at scale is expensive.
Author synthetic checks for critical user paths
- Cover login, the primary checkout or workflow path, and the public marketing page. Run from at least two regions on a 5-minute interval. Synthetic catches DNS, TLS, and CDN failures that internal health checks miss.
Hook support tickets into the alert stream
- Zendesk or Intercom tags like 'outage' or 'degraded' route to the on-call channel. Customer reports often beat synthetic checks by 5-15 minutes for partial-region issues.

Network and Database Monitoring

Monitor VPC flow logs and ALB metrics
- Enable VPC flow logs to S3 or CloudWatch and ingest into the SIEM. Alert on ALB 5xx rate, target unhealthy host count, and NAT gateway port-allocation errors — the last one silently breaks outbound connections under load.
Track database performance counters
- For Postgres / RDS: connection count vs. max_connections, replication lag, slow-query log, deadlocks, and autovacuum age. Alert before connection saturation, not at it — apps fail closed once the pool is exhausted.
Set certificate expiration alerts
- Alert at 30, 14, and 7 days before expiry on every public TLS cert and internal mTLS cert. Send to a channel that exists today, not the deprecated one — silent ACM auto-renewal failures are a top cause of preventable outages.

Cloud and Third-Party Coverage

Enable cloud-cost and budget alerts
- AWS Budgets, GCP Budgets, or a CMP (Vantage, CloudHealth) — set anomaly detection plus a hard monthly threshold. A runaway data-transfer or accidental GPU instance shows up here before finance notices.
Subscribe to vendor status pages
- Webhook the status pages of every critical dependency (Stripe, Auth0, Twilio, the cloud provider region) into the on-call channel. Don't rely on manual Slack subscriptions — engineers leave, integrations don't.
Document the runbook and on-call playbook
- For every alert that pages, the runbook entry shows: what the alert means, the dashboard link, the most common causes, and the rollback or mitigation command. Alerts without a runbook get acked-and-ignored within two rotations.
Collects url

Validation and Sign-Off

Run a paging fire drill
- Trigger a synthetic alert (e.g., raise error rate in staging or fire a test PagerDuty incident) and confirm the page lands on the right phone within 2 minutes, escalates correctly, and resolves cleanly. A monitoring setup that has never paged is not validated.
Tune thresholds after one week of baseline
- Review the alert log: which alerts fired, which were actionable, which were noise. Adjust thresholds, lengthen windows, or delete alerts with zero true-positives. The goal is a rotation where every page is real.
Get engineering manager sign-off
- EM confirms SLOs, runbook, on-call routing, and dashboards are in place before the service is declared production-ready. File any gaps as tickets with named owners.
Collects list Collects paragraph Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 8

Steps 26

Category Software Development

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Software Development

Run Monitoring Setup Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.