Monitoring Setup Checklist

Scope and Tooling Decisions

    The platform lead picks the metrics, logs, and traces backends — typically Datadog, Grafana/Prometheus/Loki, or New Relic. Confirm the agent or OpenTelemetry collector approach before any installs; mixing vendors mid-rollout is the most common rework driver.

    Pick 2-4 SLIs the service owner will defend — typically availability (success rate), latency at p95 or p99, and freshness for data pipelines. Set the SLO target and error budget window (28-day rolling is common). Skipping this step means alerts fire on opinions, not commitments.

    Page through PagerDuty or Opsgenie, not a shared Slack channel. Confirm the rotation, escalation policy, and a secondary so a single missed page doesn't drop a SEV2.

Infrastructure Monitoring

    Roll the agent (Datadog Agent, node_exporter, OTel collector) through Terraform or your AMI/launch-template pipeline rather than per-host. For Kubernetes, use the DaemonSet pattern so new nodes pick it up automatically.

    Wire alerts on CPU saturation, memory pressure, disk usage above 85%, and disk fill-rate (which catches log explosions before the host pages). Use 5-minute averaging windows to avoid flapping on transient spikes.

    Standardize on JSON-structured logs and ship via Fluent Bit, Vector, or the vendor agent. Confirm log retention matches compliance needs (SOC 2 generally expects 90+ days for security-relevant logs).

Application Performance Monitoring

    Add the APM library or OpenTelemetry SDK to the service and confirm traces show up end-to-end across HTTP, database, and queue spans. Watch out for high-cardinality tags (user_id, request_id as a tag) — they blow up billing fast.

    Sentry, Bugsnag, or Rollbar — pick one. Verify source maps upload from CI for frontend builds, otherwise stack traces are minified and useless. Set release tags so error rates are correlated with deploys.

    Use multi-window burn-rate alerts (fast: 2% budget in 1 hour; slow: 10% in 6 hours) per Google SRE workbook guidance. Threshold alerts on raw latency are noisier and miss slow-burn regressions.

    One dashboard per service showing latency (p50/p95/p99), traffic (RPS), errors (rate and top exceptions), and saturation (CPU, memory, queue depth). Link the dashboard from the on-call runbook so the first responder lands on it from the page.

Security Monitoring

    CloudTrail, Kubernetes audit logs, GitHub audit log, and SSO logs feed Splunk, Sumo Logic, or the vendor SIEM. SOC 2 CC7.2 expects continuous monitoring of these sources — confirm the feed is live, not just configured.

    If the service touches PHI, cardholder data, or EU resident PII, additional logging and alerting controls apply. Flag now rather than retrofitting after an audit finding.

    Required only for HIPAA / PCI / GDPR scope. Turn on alerts for failed-login bursts, privilege escalation, encryption-at-rest disablement, and unauthorized data exports. Confirm 6-year (HIPAA) or scope-appropriate retention on the audit log store.

    Turn on GitHub secret scanning + push protection, GitGuardian, or equivalent on every repo. Pipe Snyk or Dependabot critical/high findings to the security channel; route SLA-bound CVEs (e.g., 30-day patch SLA) to the team's queue, not a shared inbox that nobody owns.

User and Synthetic Monitoring

    Add the RUM snippet (Datadog RUM, New Relic Browser, Sentry) to the frontend. Track Core Web Vitals — LCP, INP, CLS — and break down by browser and geography. Don't ship without sampling configured; full-fidelity RUM at scale is expensive.

    Cover login, the primary checkout or workflow path, and the public marketing page. Run from at least two regions on a 5-minute interval. Synthetic catches DNS, TLS, and CDN failures that internal health checks miss.

    Zendesk or Intercom tags like 'outage' or 'degraded' route to the on-call channel. Customer reports often beat synthetic checks by 5-15 minutes for partial-region issues.

Network and Database Monitoring

    Enable VPC flow logs to S3 or CloudWatch and ingest into the SIEM. Alert on ALB 5xx rate, target unhealthy host count, and NAT gateway port-allocation errors — the last one silently breaks outbound connections under load.

    For Postgres / RDS: connection count vs. max_connections, replication lag, slow-query log, deadlocks, and autovacuum age. Alert before connection saturation, not at it — apps fail closed once the pool is exhausted.

    Alert at 30, 14, and 7 days before expiry on every public TLS cert and internal mTLS cert. Send to a channel that exists today, not the deprecated one — silent ACM auto-renewal failures are a top cause of preventable outages.

Cloud and Third-Party Coverage

    AWS Budgets, GCP Budgets, or a CMP (Vantage, CloudHealth) — set anomaly detection plus a hard monthly threshold. A runaway data-transfer or accidental GPU instance shows up here before finance notices.

    Webhook the status pages of every critical dependency (Stripe, Auth0, Twilio, the cloud provider region) into the on-call channel. Don't rely on manual Slack subscriptions — engineers leave, integrations don't.

    For every alert that pages, the runbook entry shows: what the alert means, the dashboard link, the most common causes, and the rollback or mitigation command. Alerts without a runbook get acked-and-ignored within two rotations.

Validation and Sign-Off

    Trigger a synthetic alert (e.g., raise error rate in staging or fire a test PagerDuty incident) and confirm the page lands on the right phone within 2 minutes, escalates correctly, and resolves cleanly. A monitoring setup that has never paged is not validated.

    Review the alert log: which alerts fired, which were actionable, which were noise. Adjust thresholds, lengthen windows, or delete alerts with zero true-positives. The goal is a rotation where every page is real.

    EM confirms SLOs, runbook, on-call routing, and dashboards are in place before the service is declared production-ready. File any gaps as tickets with named owners.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack