Cloud Outage Response

Detection & Verification

    Better Stack alert triggered. Capture incident ID, start time, alert source, and initial scope.
    Confirm AWS disruption and note affected regions/services. Record the official status URL.
    Confirm alerting tools (Better Stack, Datadog, CloudWatch) aren’t producing false positives.

Internal Communication

    Confirm Incident Commander, Comms Lead, Tech Lead, and Customer Support Lead.
    Create #inc-[incident-id]. Set update cadence (e.g., every 15 min). Post kickoff with next-update time.
    Notify CTO and CS lead; include link to incident channel and current impact.
    Capture updates, hypotheses, actions, and timestamps.

Customer Communication

    Summarize impact, affected features, workarounds, and next update time.
    Post to Better Stack status page with the message and ETA. Link in all comms.
    Show banner and message to impacted segments; link to status page.

Mitigation & Monitoring

    Pause non-critical tasks, queue background jobs, toggle feature flags. Record actions.
    Track AWS recovery status and internal metrics. Update timeline and ETA each cadence.
    Verify databases, queues, and integrations are healthy before marking resolved.

Post-Incident Review

    Apply learnings to this checklist, status templates, and support macros.