Disaster Recovery Checklist

Steps an IT operations or SRE lead runs after a disaster event — fire, flood, ransomware, prolonged outage — to triage, recover systems from backup, communicate status, and capture lessons learned. Designed to anchor a real DR runbook, not a tabletop summary.

6 sections 22 steps Collects data
1

Safety and Scene Preservation

  1. Confirm all personnel are accounted for
    • Before any IT work begins, the incident commander confirms headcount with facilities or the on-site lead. Physical safety overrides RTO — no one touches a smoking rack until the building is cleared.

  2. Preserve forensic evidence on affected hosts
    • If ransomware or intrusion is suspected, do not power-cycle hosts before capturing volatile state. Snapshot affected VMs, preserve EDR telemetry (CrowdStrike, SentinelOne, Defender), and isolate via network ACL rather than shutdown. Powering off destroys memory artifacts that incident-response counsel will need.

  3. Classify the disaster type
    Collects list
2

Damage Assessment

  1. Activate the IR bridge and assign roles
    • Page on-call via PagerDuty or Opsgenie. Name an incident commander, scribe, and comms lead before any technical work begins. Open the war-room channel in Slack or Teams and pin the runbook link.

  2. Inventory affected systems against the BIA
    • Cross-reference impacted systems against the Business Impact Analysis tier list. Tier-1 systems (auth, payments, customer-facing API) recover first; tier-3 (internal wiki, dev sandboxes) wait. If the BIA is stale, default to revenue-generating and customer-facing systems first.

    Collects paragraph Collects list
  3. Set the declared RTO and RPO targets
    • Compare the contractual RTO/RPO from the DR plan against current reality. If the actual recovery window will exceed the SLA, flag it now so legal and account management can prepare customer communication before the breach is observed externally.

  4. Decide whether to invoke insurance and counsel
    • Cyber insurance carriers often require notification within 24-72 hours of discovery, and may dictate the IR firm used. Loop in cyber counsel early to preserve attorney-client privilege over investigation findings.

3

Containment and Forensics

  1. Isolate compromised hosts at the network layer
    • Use EDR network-isolation (CrowdStrike Falcon, SentinelOne) or firewall ACLs to quarantine without powering off. Block lateral movement paths — SMB, RDP, WinRM — to adjacent subnets.

  2. Rotate credentials and revoke active sessions
    • Force session revocation in Okta or Entra ID for all privileged accounts. Rotate secrets in Vault, AWS Secrets Manager, and Azure Key Vault. Re-key any service account whose credentials may have been on a compromised host. Don't forget API tokens, deploy keys, and CI/CD runner secrets.

  3. Engage the IR firm for forensic imaging
4

System Recovery

  1. Validate backup integrity before restore
    • Confirm immutable backup copies (Veeam hardened repo, Datto, AWS Backup with Vault Lock) are intact and uncorrupted. For ransomware, restore from a snapshot dated before the earliest known indicator of compromise — not the most recent backup, which may already be encrypted or tampered.

    Collects list
  2. Stand up infrastructure from IaC
    • Re-deploy VPCs, subnets, security groups, and IAM via Terraform or CloudFormation in the DR region or replacement account. Pin to the last known-good commit; do not apply unrelated pending changes during recovery.

  3. Restore data to recovery infrastructure
    • Restore tier-1 systems first per the BIA. Database restores often take longer than expected — start them early and run in parallel with infrastructure provisioning where dependencies allow.

  4. Engage alternate procurement for hardware loss
    • If physical hardware is lost or destroyed, engage the pre-arranged hot-site or hardware-replacement vendor. Most DR plans assume cloud-only recovery; on-prem environments need a named hardware partner with same-day-ship SLAs.

  5. Validate functionality and performance
    • Run smoke tests against tier-1 services: auth flow, payment processing, primary API endpoints. Compare baseline metrics in Datadog or Grafana against pre-incident dashboards. Recovery isn't done until SLIs are within tolerance.

    Collects list Collects paragraph
5

Stakeholder Communication

  1. Notify executives and incident sponsor
    • The comms lead pages the CISO, CTO, and CEO via the documented escalation matrix. Use the pre-approved status template — facts only, no speculation about cause.

  2. Issue the customer-facing status page update
    • Post to the Statuspage or equivalent. Stick to observable impact and current ETA; do not promise root cause. Update at least every 60 minutes during active recovery, even when the update is "no change."

  3. Send regulatory breach notifications
    • If personal data is affected, GDPR Article 33 requires supervisory authority notification within 72 hours of awareness; HIPAA, state laws, and PCI each have their own windows. Coordinate with cyber counsel before sending — wording matters for downstream litigation.

6

Postmortem and Lessons Learned

  1. Capture the incident timeline
    • Pull the war-room transcript, PagerDuty timeline, and deploy logs into a single chronological record. Note detection time, declaration time, containment time, and full-recovery time — these feed MTTD and MTTR metrics.

  2. Hold a blameless postmortem review
    • Run the review with everyone who responded, plus legal and exec sponsor. Focus on systemic gaps — missing runbooks, alert fatigue, untested backup paths — not individual judgment calls. Output is a written document, not a slide deck.

    Collects file
  3. File remediation tickets with owners and due dates
    • Every action item from the postmortem becomes a ticket in Jira or the PSA with a named owner and due date. Action items without owners disappear; review the open list at the next quarterly DR tabletop to confirm closure.

  4. Update the DR runbook and BIA
    • Fold the lessons into the live DR runbook in IT Glue, Hudu, or Confluence. Adjust BIA tiers if this incident exposed a system as more critical than its current classification. Dry-run the updated runbook at the next quarterly tabletop.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 6
Steps 22
Category Information Technology
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Disaster Recovery Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.