Cloud Outage Response

Playbook for handling major cloud provider outages (e.g., AWS): detection, comms, mitigation, and review.

5 sections 22 steps Collects data

Detection & Verification

Acknowledge and log alert
- Better Stack alert triggered. Capture incident ID, start time, alert source, and initial scope.
Collects paragraph
Verify cloud provider status
- Confirm AWS disruption and note affected regions/services. Record the official status URL.
Collects url
Set severity (SEV-1/2/3)

Collects list
Validate monitoring integrity
- Confirm alerting tools (Better Stack, Datadog, CloudWatch) aren’t producing false positives.
Enter Incident ID

Collects text

Internal Communication

Assign incident roles
- Confirm Incident Commander, Comms Lead, Tech Lead, and Customer Support Lead.
Create Slack incident channel
- Create #incident-{{incident ID}}
  Post to the #incidents channel to begin conversation. Share this specific incident channel to team members.
  Set update cadence (e.g., every 15 min). Post kickoff with next-update time.
Escalate to leadership (SEV-1/2)
- Notify CTO and CS lead; include link to incident channel and current impact.
Start an internal status doc
- Capture updates, hypotheses, actions, and timestamps.

Customer Communication

Draft customer status message
- Summarize impact, affected features, workarounds, and next update time.
Product manager and CTO approve status message
Publish status page update
- Post to Better Stack status page with the message and ETA. Link in all comms.
Notify in-app users via Intercom
- Show banner and message to impacted segments; link to status page.

Mitigation & Monitoring

Apply mitigations
- Pause non-critical tasks, queue background jobs, toggle feature flags. Record actions.
Monitor provider recovery
- Track AWS recovery status and internal metrics. Update timeline and ETA each cadence.
Validate recovery
- Verify databases, queues, and integrations are healthy before marking resolved.
Product team updates users via Intercom

Post-Incident Review

Document timeline and lessons learned
Update workflow and templates
- Apply learnings to this checklist, status templates, and support macros.
Schedule team debrief
Collect MTTA / MTTR metrics
Confirm all users notified

About this workflow

Cloud outage response guide

Use this checklist to coordinate detection, communication, mitigation, and recovery during AWS, Azure, or GCP disruptions. Assign roles early, keep updates frequent, and record every action.

When to use

Provider status page reports a service disruption
Monitoring shows widespread failures across regions or services
Customer-visible impact such as authentication, payments, or core features

Before you begin

Alerts wired: Better Stack, Datadog, CloudWatch
Slack incident channel automation ready (#inc-[id])
Status page template approved (PM + CTO)

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 5

Steps 22

Category Systems Administration

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Systems Administration

Run Cloud Outage Response with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.