Disaster Recovery Plan Checklist

Quarterly disaster recovery review and drill cycle for SaaS engineering teams. Covers RTO/RPO definition, cross-region backup verification, failover runbooks, tabletop and live drills, and the activation/recovery sequence when DR actuall...

Use this workflow

Risk Assessment and Business Impact Analysis

Inventory tier-1 services and their dependencies
- Pull the service catalog from Backstage (or your equivalent registry). For each tier-1 service, list its database, cache, queue, third-party APIs, and identity dependencies. The common gotcha: a 'tier-1' API depends on a 'tier-3' shared service, which means tier-3 needs to be tier-1 for DR purposes.
Document RTO and RPO targets per service
- Recovery Time Objective (how long can it be down?) and Recovery Point Objective (how much data can we lose?) drive every other choice — replication frequency, topology, snapshot cadence. Get sign-off from the product owner; engineering should not be inventing customer SLA targets unilaterally.
Collects file
Score threat scenarios against likelihood and impact
- Cover the realistic scenarios for a SaaS shop: AWS region outage, AZ outage, accidental DROP TABLE in production, ransomware encrypting backups, expired root credentials, supply-chain compromise of a critical dependency. Assign each a likelihood and a blast radius.

Recovery Strategy and Site Topology

Select the recovery topology per service tier
- Pick from pilot light, warm standby, or multi-region active-active per tier. Tier-1 customer-facing services usually need warm standby (RTO < 1 hr); tier-3 internal tools can survive on backup-and-restore (RTO 24+ hr). Don't pay for active-active where pilot light is sufficient.
Designate the DR team and on-call rotation
- Named roles: Incident Commander, Communications Lead, Database Lead, Infra Lead, Scribe. Each role has a primary and a secondary on the PagerDuty schedule. The single-point-of-failure pattern — 'only Priya knows how to promote the replica' — is the most common reason DR fails when it matters.
Verify cross-region IAM, KMS, and secrets replication
- The classic DR-region failure: you fail over the workload but the KMS keys, Secrets Manager entries, or IAM roles only exist in the primary region. Confirm multi-region KMS keys, Secrets Manager replication, and that IAM/SSO will resolve when the primary region is unreachable.

Plan Documentation

Write per-service failover runbooks in the wiki
- One runbook per tier-1 service. Each runbook includes: trigger conditions, who to page, exact kubectl/Terraform/AWS CLI commands, expected output, validation queries, and rollback. Runbooks must be executable at 3am by a sleep-deprived secondary on-call — no 'see the architecture diagram' indirection.
Document the database failover and promotion procedure
- Cover RDS Multi-AZ failover, cross-region read-replica promotion, and the application-side connection-string cutover. Note replication lag thresholds — promoting a replica that's 30 minutes behind violates a 5-minute RPO. Include the steps to re-establish replication in the reverse direction once the primary region is back.
Map controls to SOC 2 and HIPAA requirements
- SOC 2 CC7.5 (recovery from identified incidents) and CC9.1 (business continuity) require documented and tested DR. HIPAA §164.308(a)(7) requires a contingency plan with data backup, disaster recovery, and emergency-mode operation procedures. Tag each runbook section with the control it satisfies — Vanta/Drata/Secureframe will pull this evidence at audit time.

Backup and Replication

Configure automated RDS snapshots with PITR
- Daily automated snapshots, 35-day retention, point-in-time recovery enabled. Copy snapshots to the DR region — automated snapshots stay in-region by default, which is useless for a regional outage. Encrypt with a multi-region KMS key, not the default `aws/rds` key.
Replicate S3 buckets cross-region with versioning
- S3 Cross-Region Replication (CRR) with versioning enabled on both source and destination. Versioning is the only protection against accidental delete or ransomware re-encryption — without it, the deletion replicates instantly to DR. Add an MFA-delete or Object Lock policy for the most sensitive buckets.
Back up Terraform state and secrets offsite
- If the Terraform state bucket is in the affected region and you can't read it, you can't `terraform apply` to rebuild infra. Replicate the state bucket cross-region. Export Secrets Manager / Vault contents to an encrypted offsite copy on the same cadence.
Run the quarterly restore drill
- Restore the latest snapshot into a non-production environment and validate row counts, schema integrity, and a sample of recent transactions. A green backup metric for 18 months means nothing if the restore script depends on a credential that rotated. Capture wall-clock time to first byte and to fully restored — that's your real RTO floor.
Collects list Collects number Collects paragraph
File remediation tickets for failed restores
- Open Jira tickets for each gap with a named owner and a due date before the next quarterly drill. A failed restore drill is itself a SOC 2 finding — auditors expect to see the gap, the ticket, and the closure evidence.

DR Testing and Maintenance

Run a tabletop walkthrough with on-call engineers
- 90-minute scenario session: 'us-east-1 RDS is unreachable, what do we do?' Walk the runbook step-by-step without executing. The point is to surface the steps that don't make sense to anyone except the original author. Rotate the IC role through different engineers each quarter.
Execute a failover drill in the DR region
- Actually fail over to the DR region for a tier-1 service — staging traffic minimum, ideally a fraction of production traffic in a planned window. Measure observed RTO against the documented target. The tabletop will not surface the IAM-role-doesn't-exist-in-the-DR-region problem; only the live drill will.
File drill action items in Jira
- Every gap surfaced in tabletop or live drill becomes a ticket with an owner and a target close date. Action items without owners are not action items. Track closure rate quarterly — if action items roll over more than once, the DR program is performative.

Activation Decision

Confirm SEV1 trigger conditions are met
- Documented triggers: regional cloud outage with declared AWS Health Dashboard event, primary database unrecoverable for >15 min, customer-facing error rate >50% for >10 min. Don't activate DR for a single AZ blip — partial failover causes more outages than it prevents.
Page the incident commander via PagerDuty
- Use the dedicated DR escalation policy, not the standard SEV1 page. The IC opens the #incident-NNN Slack channel, the Zoom bridge, and assigns the Comms Lead and Scribe roles before any technical action begins.
Decide whether to activate the DR plan
- The IC makes the call after consulting the Database Lead and Infra Lead. Activation triggers the failover runbook and customer comms. Stand-down means continuing to monitor — log the rationale either way for the post-incident review.
Collects list

Recovery Execution and Post-Incident

Execute the failover runbook in the DR region
- Database Lead promotes the cross-region replica; Infra Lead flips the Route53 weighted record or global accelerator endpoint; application teams confirm their services are reading from the DR connection strings. Comms Lead updates the public status page and notifies enterprise customers per their contractual notification windows.
Validate data integrity against RPO targets
- Run the data-validation queries from the runbook: row counts on critical tables, latest transaction timestamp per tenant, checksum on the last hour of writes. Calculate observed data loss and compare to the documented RPO. If you exceed RPO, that's a customer notification on its own under most enterprise contracts and under GDPR Article 33 if EU PII was involved.
Hold the blameless post-incident review
- Within 5 business days. Cover the timeline, contributing factors (not 'root cause'), what worked, what didn't, and action items with owners. Skipping the PIR because 'we know what happened' is the most common path back to the same outage in 90 days — the second-order causes only surface in writing.
Collects file Collects number Collects number