Disaster Recovery Plan Checklist
Preparation and Planning
Pull the application inventory from the CMDB or RMM and tag each system with its RTO, RPO, and dependency tree. Common gotcha: shadow SaaS tools the business owns directly never make it onto the DR list, then become the loudest complaint during an outage. Attach the spreadsheet or export.
Verify each DR role has a primary and backup named — incident commander, comms lead, infrastructure lead, identity lead, vendor liaison. Cross-check PagerDuty or Opsgenie schedules for gaps. People leave; rosters drift.
Personal phone numbers, SMS group, Signal channel, and a non-corporate email — anything that does not depend on M365 or the corporate network being up. The first failure mode of in-band comms is that the platform itself is the incident.
Walk the BIA with department leads to recalibrate revenue impact per hour, regulatory exposure (HIPAA, PCI, SOX), and customer-facing SLAs. Numbers older than 12 months are stale.
Confirm active contracts with the backup vendor (Veeam, Datto, Rubrik), the firewall vendor, the ISP, and any colo or DR-as-a-Service provider. Capture support phone numbers, account IDs, and named escalation contacts in the runbook — not in someone's inbox.
Backup and Recovery Verification
Open Veeam, Datto, or whichever backup platform is in use and confirm last-30-day success rate per job. Investigate every yellow and red. A green dashboard with quietly failing jobs is the most common DR failure mode.
Three copies, two media, one offsite — and at least one immutable (object lock, write-once tape, or a separate cloud account that production credentials cannot reach). Ransomware that finds the backup share encrypts the backups too; immutability is the only durable defense.
Pick one tier-1 system and restore it end-to-end into a network-isolated recovery VLAN — VM, application, database, dependencies. Time the restore against documented RTO. The point is not that the backup file exists; the point is that the system boots and the data is consistent.
Triggered when the restore drill failed or missed RTO. File a P1 with the vendor referencing the job ID and restore log. Do not close this loop until the drill passes — a quarterly check-the-box drill that fails and is shrugged off is worse than no drill, because it manufactures false confidence.
Walk the asset inventory against actual deployed counts — VMs in vCenter, endpoints in Intune or JAMF, M365 license seats, EDR agents. Vendor audit (Microsoft, Oracle, VMware) finding 80 unlicensed VMs during a recovery is a six-figure surprise on top of the disaster.
Emergency Response Readiness
Wallet-sized card with the IT incident hotline, out-of-band channel address, and the first three things to do (do not power down, do not reconnect, call the hotline). Email-only distribution fails when email is the incident.
Send a test message to the Signal or SMS group and confirm receipt from each DR team member. Channels that nobody has opened in six months are channels nobody will see during an incident.
Walk a realistic scenario: domain admin credential compromised, backup share encrypted, EDR alerts arriving at 2am Saturday. Force decisions on isolation, comms, and ransom posture. Capture every place the playbook hits an unanswered question.
Triggered when the tabletop surfaces gaps. File each gap as a tracked action item with a named owner and due date. Schedule role-specific training — KnowBe4 for end users, vendor sessions for tier-2 engineers, an updated runbook walkthrough for the IC pool.
For physical-site disasters (fire, flood, extended power loss), confirm building security, fire department non-emergency line, and utility-provider account numbers are in the runbook. Cyber incidents: confirm FBI field office and CISA reporting paths.
Business Continuity
Per-app runbook entries: failover trigger, DNS or load-balancer change, dependency order, validation tests, rollback plan. Procedures that live in one engineer's head are single points of failure.
Validate that the FortiGate or Palo Alto concentrator and the ZTNA broker can handle the full workforce concurrently — March 2020 caught everyone with VPN sized for 20% of headcount. Confirm conditional access policies still block legacy auth.
Hot site, warm site, or DRaaS — confirm circuits up, replication current, AD/DNS reachable, and at least one technician knows how to badge in. An alternate site nobody has visited in a year is theoretical.
Pre-approved template for executive, customer, and regulator updates so comms during an incident are not drafted from scratch under pressure. Include placeholders for impact, ETA, workaround, and next-update time.
Walk the test results, gaps, and remediation plan with the vCIO or CIO. Tie outstanding items to budget — DR investments routinely lose to feature work unless leadership is forced to choose explicitly.
Post-Incident Recovery
Run the severity rubric: systems affected, users affected, data exposure, regulatory triggers (HIPAA breach, PCI cardholder data, GDPR personal data). Severity drives notification clocks — HIPAA is 60 days, GDPR is 72 hours, state laws vary.
Triggered for catastrophic incidents only. Cut DNS and traffic to the DR site per the documented failover runbook, in dependency order. Confirm AD, DNS, and identity providers come up first; application tier follows. Notify the vendor and DRaaS provider in parallel.
Work the inventory in RTO order: tier-1 first (revenue, safety, regulated), then tier-2, then tier-3. Resist scope creep from loud-but-low-priority requesters; the BIA is the tiebreaker.
Use the pre-approved status template. Send updates every 60 minutes for major or catastrophic events, every 4 hours for moderate. Always include next-update time, even if the substantive update is 'no change.' Silence is what generates the executive escalation.
Hold a blameless postmortem within five business days of recovery. Capture what actually happened versus what the playbook said would happen. Every gap becomes a tracked runbook edit with a named owner — otherwise the next incident reproduces this one.
