Disaster Recovery Checklist

Safety and Scene Preservation

    Before any IT work begins, the incident commander confirms headcount with facilities or the on-site lead. Physical safety overrides RTO — no one touches a smoking rack until the building is cleared.

    If ransomware or intrusion is suspected, do not power-cycle hosts before capturing volatile state. Snapshot affected VMs, preserve EDR telemetry (CrowdStrike, SentinelOne, Defender), and isolate via network ACL rather than shutdown. Powering off destroys memory artifacts that incident-response counsel will need.

Damage Assessment

    Page on-call via PagerDuty or Opsgenie. Name an incident commander, scribe, and comms lead before any technical work begins. Open the war-room channel in Slack or Teams and pin the runbook link.

    Cross-reference impacted systems against the Business Impact Analysis tier list. Tier-1 systems (auth, payments, customer-facing API) recover first; tier-3 (internal wiki, dev sandboxes) wait. If the BIA is stale, default to revenue-generating and customer-facing systems first.

    Compare the contractual RTO/RPO from the DR plan against current reality. If the actual recovery window will exceed the SLA, flag it now so legal and account management can prepare customer communication before the breach is observed externally.

    Cyber insurance carriers often require notification within 24-72 hours of discovery, and may dictate the IR firm used. Loop in cyber counsel early to preserve attorney-client privilege over investigation findings.

Containment and Forensics

    Use EDR network-isolation (CrowdStrike Falcon, SentinelOne) or firewall ACLs to quarantine without powering off. Block lateral movement paths — SMB, RDP, WinRM — to adjacent subnets.

    Force session revocation in Okta or Entra ID for all privileged accounts. Rotate secrets in Vault, AWS Secrets Manager, and Azure Key Vault. Re-key any service account whose credentials may have been on a compromised host. Don't forget API tokens, deploy keys, and CI/CD runner secrets.

System Recovery

    Confirm immutable backup copies (Veeam hardened repo, Datto, AWS Backup with Vault Lock) are intact and uncorrupted. For ransomware, restore from a snapshot dated before the earliest known indicator of compromise — not the most recent backup, which may already be encrypted or tampered.

    Re-deploy VPCs, subnets, security groups, and IAM via Terraform or CloudFormation in the DR region or replacement account. Pin to the last known-good commit; do not apply unrelated pending changes during recovery.

    Restore tier-1 systems first per the BIA. Database restores often take longer than expected — start them early and run in parallel with infrastructure provisioning where dependencies allow.

    If physical hardware is lost or destroyed, engage the pre-arranged hot-site or hardware-replacement vendor. Most DR plans assume cloud-only recovery; on-prem environments need a named hardware partner with same-day-ship SLAs.

    Run smoke tests against tier-1 services: auth flow, payment processing, primary API endpoints. Compare baseline metrics in Datadog or Grafana against pre-incident dashboards. Recovery isn't done until SLIs are within tolerance.

Stakeholder Communication

    The comms lead pages the CISO, CTO, and CEO via the documented escalation matrix. Use the pre-approved status template — facts only, no speculation about cause.

    Post to the Statuspage or equivalent. Stick to observable impact and current ETA; do not promise root cause. Update at least every 60 minutes during active recovery, even when the update is "no change."

    If personal data is affected, GDPR Article 33 requires supervisory authority notification within 72 hours of awareness; HIPAA, state laws, and PCI each have their own windows. Coordinate with cyber counsel before sending — wording matters for downstream litigation.

Postmortem and Lessons Learned

    Pull the war-room transcript, PagerDuty timeline, and deploy logs into a single chronological record. Note detection time, declaration time, containment time, and full-recovery time — these feed MTTD and MTTR metrics.

    Run the review with everyone who responded, plus legal and exec sponsor. Focus on systemic gaps — missing runbooks, alert fatigue, untested backup paths — not individual judgment calls. Output is a written document, not a slide deck.

    Every action item from the postmortem becomes a ticket in Jira or the PSA with a named owner and due date. Action items without owners disappear; review the open list at the next quarterly DR tabletop to confirm closure.

    Fold the lessons into the live DR runbook in IT Glue, Hudu, or Confluence. Adjust BIA tiers if this incident exposed a system as more critical than its current classification. Dry-run the updated runbook at the next quarterly tabletop.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack