Cloud Outage Response
Playbook for handling major cloud provider outages (e.g., AWS): detection, comms, mitigation, and review.
Detection & Verification
-
Acknowledge and log alert
- Better Stack alert triggered. Capture incident ID, start time, alert source, and initial scope.
Collects paragraph -
Verify cloud provider status
- Confirm AWS disruption and note affected regions/services. Record the official status URL.
Collects url -
Set severity (SEV-1/2/3)Collects list
-
Validate monitoring integrity
- Confirm alerting tools (Better Stack, Datadog, CloudWatch) aren’t producing false positives.
-
Enter Incident IDCollects text
Internal Communication
-
Assign incident roles
- Confirm Incident Commander, Comms Lead, Tech Lead, and Customer Support Lead.
-
Create Slack incident channel
Create #incident-{{incident ID}}
Post to the #incidents channel to begin conversation. Share this specific incident channel to team members.
Set update cadence (e.g., every 15 min). Post kickoff with next-update time.
-
Escalate to leadership (SEV-1/2)
- Notify CTO and CS lead; include link to incident channel and current impact.
-
Start an internal status doc
- Capture updates, hypotheses, actions, and timestamps.
Customer Communication
-
Draft customer status message
- Summarize impact, affected features, workarounds, and next update time.
-
Product manager and CTO approve status message
-
Publish status page update
- Post to Better Stack status page with the message and ETA. Link in all comms.
-
Notify in-app users via Intercom
- Show banner and message to impacted segments; link to status page.
Mitigation & Monitoring
-
Apply mitigations
- Pause non-critical tasks, queue background jobs, toggle feature flags. Record actions.
-
Monitor provider recovery
- Track AWS recovery status and internal metrics. Update timeline and ETA each cadence.
-
Validate recovery
- Verify databases, queues, and integrations are healthy before marking resolved.
-
Product team updates users via Intercom
Post-Incident Review
-
Document timeline and lessons learned
-
Update workflow and templates
- Apply learnings to this checklist, status templates, and support macros.
-
Schedule team debrief
-
Collect MTTA / MTTR metrics
-
Confirm all users notified
About this workflow
Cloud outage response guide
Use this checklist to coordinate detection, communication, mitigation, and recovery during AWS, Azure, or GCP disruptions. Assign roles early, keep updates frequent, and record every action.
When to use
- Provider status page reports a service disruption
- Monitoring shows widespread failures across regions or services
- Customer-visible impact such as authentication, payments, or core features
Before you begin
- Alerts wired: Better Stack, Datadog, CloudWatch
- Slack incident channel automation ready (
#inc-[id]) - Status page template approved (PM + CTO)
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Cloud Outage Response with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.