Disaster Recovery Checklist
Steps an IT operations or SRE lead runs after a disaster event — fire, flood, ransomware, prolonged outage — to triage, recover systems from backup, communicate status, and capture lessons learned. Designed to anchor a real DR runbook, not a tabletop summary.
Safety and Scene Preservation
-
Confirm all personnel are accounted for
Before any IT work begins, the incident commander confirms headcount with facilities or the on-site lead. Physical safety overrides RTO — no one touches a smoking rack until the building is cleared.
-
Preserve forensic evidence on affected hosts
If ransomware or intrusion is suspected, do not power-cycle hosts before capturing volatile state. Snapshot affected VMs, preserve EDR telemetry (CrowdStrike, SentinelOne, Defender), and isolate via network ACL rather than shutdown. Powering off destroys memory artifacts that incident-response counsel will need.
-
Classify the disaster typeCollects list
Damage Assessment
-
Activate the IR bridge and assign roles
Page on-call via PagerDuty or Opsgenie. Name an incident commander, scribe, and comms lead before any technical work begins. Open the war-room channel in Slack or Teams and pin the runbook link.
-
Inventory affected systems against the BIA
Cross-reference impacted systems against the Business Impact Analysis tier list. Tier-1 systems (auth, payments, customer-facing API) recover first; tier-3 (internal wiki, dev sandboxes) wait. If the BIA is stale, default to revenue-generating and customer-facing systems first.
Collects paragraph Collects list -
Set the declared RTO and RPO targets
Compare the contractual RTO/RPO from the DR plan against current reality. If the actual recovery window will exceed the SLA, flag it now so legal and account management can prepare customer communication before the breach is observed externally.
-
Decide whether to invoke insurance and counsel
Cyber insurance carriers often require notification within 24-72 hours of discovery, and may dictate the IR firm used. Loop in cyber counsel early to preserve attorney-client privilege over investigation findings.
Containment and Forensics
-
Isolate compromised hosts at the network layer
Use EDR network-isolation (CrowdStrike Falcon, SentinelOne) or firewall ACLs to quarantine without powering off. Block lateral movement paths — SMB, RDP, WinRM — to adjacent subnets.
-
Rotate credentials and revoke active sessions
Force session revocation in Okta or Entra ID for all privileged accounts. Rotate secrets in Vault, AWS Secrets Manager, and Azure Key Vault. Re-key any service account whose credentials may have been on a compromised host. Don't forget API tokens, deploy keys, and CI/CD runner secrets.
-
Engage the IR firm for forensic imaging
System Recovery
-
Validate backup integrity before restore
Confirm immutable backup copies (Veeam hardened repo, Datto, AWS Backup with Vault Lock) are intact and uncorrupted. For ransomware, restore from a snapshot dated before the earliest known indicator of compromise — not the most recent backup, which may already be encrypted or tampered.
Collects list -
Stand up infrastructure from IaC
Re-deploy VPCs, subnets, security groups, and IAM via Terraform or CloudFormation in the DR region or replacement account. Pin to the last known-good commit; do not apply unrelated pending changes during recovery.
-
Restore data to recovery infrastructure
Restore tier-1 systems first per the BIA. Database restores often take longer than expected — start them early and run in parallel with infrastructure provisioning where dependencies allow.
-
Engage alternate procurement for hardware loss
If physical hardware is lost or destroyed, engage the pre-arranged hot-site or hardware-replacement vendor. Most DR plans assume cloud-only recovery; on-prem environments need a named hardware partner with same-day-ship SLAs.
-
Validate functionality and performance
Run smoke tests against tier-1 services: auth flow, payment processing, primary API endpoints. Compare baseline metrics in Datadog or Grafana against pre-incident dashboards. Recovery isn't done until SLIs are within tolerance.
Collects list Collects paragraph
Stakeholder Communication
-
Notify executives and incident sponsor
The comms lead pages the CISO, CTO, and CEO via the documented escalation matrix. Use the pre-approved status template — facts only, no speculation about cause.
-
Issue the customer-facing status page update
Post to the Statuspage or equivalent. Stick to observable impact and current ETA; do not promise root cause. Update at least every 60 minutes during active recovery, even when the update is "no change."
-
Send regulatory breach notifications
If personal data is affected, GDPR Article 33 requires supervisory authority notification within 72 hours of awareness; HIPAA, state laws, and PCI each have their own windows. Coordinate with cyber counsel before sending — wording matters for downstream litigation.
Postmortem and Lessons Learned
-
Capture the incident timeline
Pull the war-room transcript, PagerDuty timeline, and deploy logs into a single chronological record. Note detection time, declaration time, containment time, and full-recovery time — these feed MTTD and MTTR metrics.
-
Hold a blameless postmortem review
Run the review with everyone who responded, plus legal and exec sponsor. Focus on systemic gaps — missing runbooks, alert fatigue, untested backup paths — not individual judgment calls. Output is a written document, not a slide deck.
Collects file -
File remediation tickets with owners and due dates
Every action item from the postmortem becomes a ticket in Jira or the PSA with a named owner and due date. Action items without owners disappear; review the open list at the next quarterly DR tabletop to confirm closure.
-
Update the DR runbook and BIA
Fold the lessons into the live DR runbook in IT Glue, Hudu, or Confluence. Adjust BIA tiers if this incident exposed a system as more critical than its current classification. Dry-run the updated runbook at the next quarterly tabletop.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Disaster Recovery Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.