Disaster Recovery Plan Checklist

Quarterly disaster recovery readiness workflow for IT operations and MSP teams. Covers BIA refresh, backup integrity verification, restore drills, tabletop exercises, business continuity validation, and the post-incident response sequenc...

Use this workflow

Preparation and Planning

Inventory tier-1 systems with RTO and RPO
- Pull the application inventory from the CMDB or RMM and tag each system with its RTO, RPO, and dependency tree. Common gotcha: shadow SaaS tools the business owns directly never make it onto the DR list, then become the loudest complaint during an outage. Attach the spreadsheet or export.
Collects file
Confirm the DR team roster and on-call rotation
- Verify each DR role has a primary and backup named — incident commander, comms lead, infrastructure lead, identity lead, vendor liaison. Cross-check PagerDuty or Opsgenie schedules for gaps. People leave; rosters drift.
Refresh the out-of-band communication tree
- Personal phone numbers, SMS group, Signal channel, and a non-corporate email — anything that does not depend on M365 or the corporate network being up. The first failure mode of in-band comms is that the platform itself is the incident.
Update the business impact analysis
- Walk the BIA with department leads to recalibrate revenue impact per hour, regulatory exposure (HIPAA, PCI, SOX), and customer-facing SLAs. Numbers older than 12 months are stale.
Verify vendor support contracts and escalation SLAs
- Confirm active contracts with the backup vendor (Veeam, Datto, Rubrik), the firewall vendor, the ISP, and any colo or DR-as-a-Service provider. Capture support phone numbers, account IDs, and named escalation contacts in the runbook — not in someone's inbox.

Backup and Recovery Verification

Verify nightly backup jobs in the backup console
- Open Veeam, Datto, or whichever backup platform is in use and confirm last-30-day success rate per job. Investigate every yellow and red. A green dashboard with quietly failing jobs is the most common DR failure mode.
Confirm 3-2-1 with an immutable offsite copy
- Three copies, two media, one offsite — and at least one immutable (object lock, write-once tape, or a separate cloud account that production credentials cannot reach). Ransomware that finds the backup share encrypts the backups too; immutability is the only durable defense.
Run a restore drill into an isolated VLAN
- Pick one tier-1 system and restore it end-to-end into a network-isolated recovery VLAN — VM, application, database, dependencies. Time the restore against documented RTO. The point is not that the backup file exists; the point is that the system boots and the data is consistent.
Collects list
Open a P1 with the backup vendor
- Triggered when the restore drill failed or missed RTO. File a P1 with the vendor referencing the job ID and restore log. Do not close this loop until the drill passes — a quarterly check-the-box drill that fails and is shrugged off is worse than no drill, because it manufactures false confidence.
Reconcile hardware, software, and license inventory
- Walk the asset inventory against actual deployed counts — VMs in vCenter, endpoints in Intune or JAMF, M365 license seats, EDR agents. Vendor audit (Microsoft, Oracle, VMware) finding 80 unlicensed VMs during a recovery is a six-figure surprise on top of the disaster.

Emergency Response Readiness

Distribute the emergency response card to staff
- Wallet-sized card with the IT incident hotline, out-of-band channel address, and the first three things to do (do not power down, do not reconnect, call the hotline). Email-only distribution fails when email is the incident.
Test the out-of-band communication channel
- Send a test message to the Signal or SMS group and confirm receipt from each DR team member. Channels that nobody has opened in six months are channels nobody will see during an incident.
Run a ransomware tabletop exercise
- Walk a realistic scenario: domain admin credential compromised, backup share encrypted, EDR alerts arriving at 2am Saturday. Force decisions on isolation, comms, and ransom posture. Capture every place the playbook hits an unanswered question.
Collects list
Schedule remediation training for tabletop gaps
- Triggered when the tabletop surfaces gaps. File each gap as a tracked action item with a named owner and due date. Schedule role-specific training — KnowBe4 for end users, vendor sessions for tier-2 engineers, an updated runbook walkthrough for the IC pool.
Confirm coordination with local emergency services
- For physical-site disasters (fire, flood, extended power loss), confirm building security, fire department non-emergency line, and utility-provider account numbers are in the runbook. Cyber incidents: confirm FBI field office and CISA reporting paths.

Business Continuity

Document failover procedures for tier-1 apps
- Per-app runbook entries: failover trigger, DNS or load-balancer change, dependency order, validation tests, rollback plan. Procedures that live in one engineer's head are single points of failure.
Load-test remote access via VPN and ZTNA
- Validate that the FortiGate or Palo Alto concentrator and the ZTNA broker can handle the full workforce concurrently — March 2020 caught everyone with VPN sized for 20% of headcount. Confirm conditional access policies still block legacy auth.
Validate alternate-site readiness and connectivity
- Hot site, warm site, or DRaaS — confirm circuits up, replication current, AD/DNS reachable, and at least one technician knows how to badge in. An alternate site nobody has visited in a year is theoretical.
Distribute the stakeholder status template
- Pre-approved template for executive, customer, and regulator updates so comms during an incident are not drafted from scratch under pressure. Include placeholders for impact, ETA, workaround, and next-update time.
Hold the quarterly DR review with the vCIO
- Walk the test results, gaps, and remediation plan with the vCIO or CIO. Tie outstanding items to budget — DR investments routinely lose to feature work unless leadership is forced to choose explicitly.

Post-Incident Recovery

Classify the incident severity and document scope
- Run the severity rubric: systems affected, users affected, data exposure, regulatory triggers (HIPAA breach, PCI cardholder data, GDPR personal data). Severity drives notification clocks — HIPAA is 60 days, GDPR is 72 hours, state laws vary.
Collects list
Activate full DR site failover
- Triggered for catastrophic incidents only. Cut DNS and traffic to the DR site per the documented failover runbook, in dependency order. Confirm AD, DNS, and identity providers come up first; application tier follows. Notify the vendor and DRaaS provider in parallel.
Prioritize recovery work by RTO tier
- Work the inventory in RTO order: tier-1 first (revenue, safety, regulated), then tier-2, then tier-3. Resist scope creep from loud-but-low-priority requesters; the BIA is the tiebreaker.
Send recovery status updates on the cadence
- Use the pre-approved status template. Send updates every 60 minutes for major or catastrophic events, every 4 hours for moderate. Always include next-update time, even if the substantive update is 'no change.' Silence is what generates the executive escalation.
Capture lessons learned and update the runbook
- Hold a blameless postmortem within five business days of recovery. Capture what actually happened versus what the playbook said would happen. Every gap becomes a tracked runbook edit with a named owner — otherwise the next incident reproduces this one.
Collects paragraph