Incident Response Checklist

Runbook for the on-call engineer and incident commander to triage, contain, and recover from a production incident, then close the loop with a blameless post-incident review.

5 sections 21 steps Collects data
1

Detection and Triage

  1. Acknowledge the PagerDuty alert
    • The on-call engineer acks the page within 5 minutes to stop the escalation timer. If you can't take it, decline so it routes to the secondary — silent acks that don't lead to action are the worst outcome.

  2. Assign a severity level
    • SEV1 = customer-facing outage or data loss; SEV2 = major degradation or partial outage; SEV3 = minor / single-tenant impact. When in doubt, declare up — it's cheaper to downgrade than to retroactively realize a SEV2 was a SEV1.

    Collects list
  3. Open the incident channel
    • Spin up a dedicated Slack channel (e.g., #inc-2024-11-14-checkout) via Incident.io or FireHydrant. All decisions, timestamps, and graph links go here — not DMs — so the post-incident review has a single source of truth.

  4. Capture the initial symptoms
    • Record the user-visible symptom, the alert that fired, and the affected service in plain language. Pin Datadog / Sentry / Grafana links in the channel. Keep it factual — hypotheses come later.

    Collects paragraph
2

Roles and Communication

  1. Assign Incident Commander, Comms, and Scribe
    • The IC drives the response and makes calls; Comms owns Statuspage and customer messaging; Scribe maintains the channel timeline. For SEV1 these are three different people — the IC should not also be debugging the database.

  2. Decide whether to post to Statuspage
    • Customer-visible degradation gets a Statuspage post within 30 minutes — silence costs more trust than admitting an issue. SEV3 internal-only incidents typically don't.

    Collects list
  3. Publish the initial Statuspage update
    • Comms posts the first update: what users will experience, that the team is investigating, and when the next update will be. Commit to a cadence (every 30 min for SEV1) and hold to it even when there's nothing new.

  4. Notify support and account managers
    • Drop a brief in #cs-incidents with the symptom, the Statuspage link, and the talking points support should use. Enterprise CSMs may need to call named accounts directly for SEV1.

3

Containment

  1. Identify the most recent deploy or change
    • Check the deploy log, recently merged PRs, feature flag flips, and infra changes (Terraform applies) in the last 24 hours. "What changed?" is the first hypothesis on most incidents.

    Collects list
  2. Roll back the offending deploy
    • Redeploy the previous container image via the standard pipeline; do not hand-edit production. If a database migration shipped with the deploy, confirm it's reversible before rolling back — irreversible migrations need a forward-fix instead.

  3. Disable the suspect feature flag
    • If the suspect change was behind a flag, kill-switch it in LaunchDarkly / Statsig / Unleash. Faster than a redeploy and reversible. Note the flag name and timestamp in the incident channel.

  4. Isolate or scale the affected service
    • Drain unhealthy pods, fail over to a replica, scale the ASG, or shed load with a circuit breaker. The goal is to stop the bleed — full root cause analysis comes after customers stop being affected.

4

Eradication and Recovery

  1. Confirm error rate and p99 latency are recovered
    • Watch the golden signals dashboard (latency, traffic, errors, saturation) for 30 minutes of clean baseline before declaring recovery. Premature "all clear" calls undermine the team's credibility on the next incident.

    Collects list
  2. Run the post-deploy smoke tests
    • Trigger the synthetic checks against production — login, checkout, the critical user paths. Manual spot-check anything the synthetics don't cover.

  3. Verify data integrity and replication lag
    • If the incident touched the database — failed writes, partial transactions, replica lag — confirm read replicas have caught up and run any reconciliation queries. Silent data corruption is a much worse outcome than a visible outage.

  4. Resolve the incident and update Statuspage
    • Comms posts the resolution update with a one-sentence summary of impact and remediation. The IC closes the incident in PagerDuty / Incident.io, which captures the timeline for the PIR.

5

Post-Incident Review

  1. Schedule the blameless PIR
    • Within 5 business days while memory is fresh. Invite the responders, the service owners, and one engineer who wasn't involved (fresh eyes). Blameless framing — focus on systems and signals, not individuals.

  2. Reconstruct the timeline from the channel
    • Pull timestamps for: first symptom, first alert, page acknowledged, IC declared, mitigation applied, recovery confirmed. These feed MTTD and MTTR — the metrics that drive alerting and runbook investments.

  3. Identify contributing factors
    • Resist "root cause" framing — most incidents have multiple contributing factors. Look at: the change that triggered, the gap that let it through (test coverage, review, canary), and the gap in detection (alert tuning, missing dashboard).

    Collects paragraph
  4. File action items with named owners
    • Each action item gets a Jira / Linear ticket, a named owner, and a target date. Track to closure — unclosed PIR action items are the strongest predictor of the same incident recurring.

    Collects url
  5. Publish the PIR document
    • Post the writeup to the engineering wiki (Confluence / Notion) and link it from the incident channel. For SEV1s with customer impact, draft a customer-facing summary for the CSM team to share on request.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 5
Steps 21
Category Software Development
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Incident Response Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.