Incident Response Checklist

Detection and Triage

    The on-call engineer acks the page within 5 minutes to stop the escalation timer. If you can't take it, decline so it routes to the secondary — silent acks that don't lead to action are the worst outcome.

    SEV1 = customer-facing outage or data loss; SEV2 = major degradation or partial outage; SEV3 = minor / single-tenant impact. When in doubt, declare up — it's cheaper to downgrade than to retroactively realize a SEV2 was a SEV1.

    Spin up a dedicated Slack channel (e.g., #inc-2024-11-14-checkout) via Incident.io or FireHydrant. All decisions, timestamps, and graph links go here — not DMs — so the post-incident review has a single source of truth.

    Record the user-visible symptom, the alert that fired, and the affected service in plain language. Pin Datadog / Sentry / Grafana links in the channel. Keep it factual — hypotheses come later.

Roles and Communication

    The IC drives the response and makes calls; Comms owns Statuspage and customer messaging; Scribe maintains the channel timeline. For SEV1 these are three different people — the IC should not also be debugging the database.

    Customer-visible degradation gets a Statuspage post within 30 minutes — silence costs more trust than admitting an issue. SEV3 internal-only incidents typically don't.

    Comms posts the first update: what users will experience, that the team is investigating, and when the next update will be. Commit to a cadence (every 30 min for SEV1) and hold to it even when there's nothing new.

    Drop a brief in #cs-incidents with the symptom, the Statuspage link, and the talking points support should use. Enterprise CSMs may need to call named accounts directly for SEV1.

Containment

    Check the deploy log, recently merged PRs, feature flag flips, and infra changes (Terraform applies) in the last 24 hours. "What changed?" is the first hypothesis on most incidents.

    Redeploy the previous container image via the standard pipeline; do not hand-edit production. If a database migration shipped with the deploy, confirm it's reversible before rolling back — irreversible migrations need a forward-fix instead.

    If the suspect change was behind a flag, kill-switch it in LaunchDarkly / Statsig / Unleash. Faster than a redeploy and reversible. Note the flag name and timestamp in the incident channel.

    Drain unhealthy pods, fail over to a replica, scale the ASG, or shed load with a circuit breaker. The goal is to stop the bleed — full root cause analysis comes after customers stop being affected.

Eradication and Recovery

    Watch the golden signals dashboard (latency, traffic, errors, saturation) for 30 minutes of clean baseline before declaring recovery. Premature "all clear" calls undermine the team's credibility on the next incident.

    Trigger the synthetic checks against production — login, checkout, the critical user paths. Manual spot-check anything the synthetics don't cover.

    If the incident touched the database — failed writes, partial transactions, replica lag — confirm read replicas have caught up and run any reconciliation queries. Silent data corruption is a much worse outcome than a visible outage.

    Comms posts the resolution update with a one-sentence summary of impact and remediation. The IC closes the incident in PagerDuty / Incident.io, which captures the timeline for the PIR.

Post-Incident Review

    Within 5 business days while memory is fresh. Invite the responders, the service owners, and one engineer who wasn't involved (fresh eyes). Blameless framing — focus on systems and signals, not individuals.

    Pull timestamps for: first symptom, first alert, page acknowledged, IC declared, mitigation applied, recovery confirmed. These feed MTTD and MTTR — the metrics that drive alerting and runbook investments.

    Resist "root cause" framing — most incidents have multiple contributing factors. Look at: the change that triggered, the gap that let it through (test coverage, review, canary), and the gap in detection (alert tuning, missing dashboard).

    Each action item gets a Jira / Linear ticket, a named owner, and a target date. Track to closure — unclosed PIR action items are the strongest predictor of the same incident recurring.

    Post the writeup to the engineering wiki (Confluence / Notion) and link it from the incident channel. For SEV1s with customer impact, draft a customer-facing summary for the CSM team to share on request.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack