Disaster Recovery Plan Checklist

Risk Assessment and Business Impact Analysis

    Pull the service catalog from Backstage (or your equivalent registry). For each tier-1 service, list its database, cache, queue, third-party APIs, and identity dependencies. The common gotcha: a 'tier-1' API depends on a 'tier-3' shared service, which means tier-3 needs to be tier-1 for DR purposes.

    Recovery Time Objective (how long can it be down?) and Recovery Point Objective (how much data can we lose?) drive every other choice — replication frequency, topology, snapshot cadence. Get sign-off from the product owner; engineering should not be inventing customer SLA targets unilaterally.

    Cover the realistic scenarios for a SaaS shop: AWS region outage, AZ outage, accidental DROP TABLE in production, ransomware encrypting backups, expired root credentials, supply-chain compromise of a critical dependency. Assign each a likelihood and a blast radius.

Recovery Strategy and Site Topology

    Pick from pilot light, warm standby, or multi-region active-active per tier. Tier-1 customer-facing services usually need warm standby (RTO < 1 hr); tier-3 internal tools can survive on backup-and-restore (RTO 24+ hr). Don't pay for active-active where pilot light is sufficient.

    Named roles: Incident Commander, Communications Lead, Database Lead, Infra Lead, Scribe. Each role has a primary and a secondary on the PagerDuty schedule. The single-point-of-failure pattern — 'only Priya knows how to promote the replica' — is the most common reason DR fails when it matters.

    The classic DR-region failure: you fail over the workload but the KMS keys, Secrets Manager entries, or IAM roles only exist in the primary region. Confirm multi-region KMS keys, Secrets Manager replication, and that IAM/SSO will resolve when the primary region is unreachable.

Plan Documentation

    One runbook per tier-1 service. Each runbook includes: trigger conditions, who to page, exact kubectl/Terraform/AWS CLI commands, expected output, validation queries, and rollback. Runbooks must be executable at 3am by a sleep-deprived secondary on-call — no 'see the architecture diagram' indirection.

    Cover RDS Multi-AZ failover, cross-region read-replica promotion, and the application-side connection-string cutover. Note replication lag thresholds — promoting a replica that's 30 minutes behind violates a 5-minute RPO. Include the steps to re-establish replication in the reverse direction once the primary region is back.

    SOC 2 CC7.5 (recovery from identified incidents) and CC9.1 (business continuity) require documented and tested DR. HIPAA §164.308(a)(7) requires a contingency plan with data backup, disaster recovery, and emergency-mode operation procedures. Tag each runbook section with the control it satisfies — Vanta/Drata/Secureframe will pull this evidence at audit time.

Backup and Replication

    Daily automated snapshots, 35-day retention, point-in-time recovery enabled. Copy snapshots to the DR region — automated snapshots stay in-region by default, which is useless for a regional outage. Encrypt with a multi-region KMS key, not the default `aws/rds` key.

    S3 Cross-Region Replication (CRR) with versioning enabled on both source and destination. Versioning is the only protection against accidental delete or ransomware re-encryption — without it, the deletion replicates instantly to DR. Add an MFA-delete or Object Lock policy for the most sensitive buckets.

    If the Terraform state bucket is in the affected region and you can't read it, you can't `terraform apply` to rebuild infra. Replicate the state bucket cross-region. Export Secrets Manager / Vault contents to an encrypted offsite copy on the same cadence.

    Restore the latest snapshot into a non-production environment and validate row counts, schema integrity, and a sample of recent transactions. A green backup metric for 18 months means nothing if the restore script depends on a credential that rotated. Capture wall-clock time to first byte and to fully restored — that's your real RTO floor.

    Open Jira tickets for each gap with a named owner and a due date before the next quarterly drill. A failed restore drill is itself a SOC 2 finding — auditors expect to see the gap, the ticket, and the closure evidence.

DR Testing and Maintenance

    90-minute scenario session: 'us-east-1 RDS is unreachable, what do we do?' Walk the runbook step-by-step without executing. The point is to surface the steps that don't make sense to anyone except the original author. Rotate the IC role through different engineers each quarter.

    Actually fail over to the DR region for a tier-1 service — staging traffic minimum, ideally a fraction of production traffic in a planned window. Measure observed RTO against the documented target. The tabletop will not surface the IAM-role-doesn't-exist-in-the-DR-region problem; only the live drill will.

    Every gap surfaced in tabletop or live drill becomes a ticket with an owner and a target close date. Action items without owners are not action items. Track closure rate quarterly — if action items roll over more than once, the DR program is performative.

Activation Decision

    Documented triggers: regional cloud outage with declared AWS Health Dashboard event, primary database unrecoverable for >15 min, customer-facing error rate >50% for >10 min. Don't activate DR for a single AZ blip — partial failover causes more outages than it prevents.

    Use the dedicated DR escalation policy, not the standard SEV1 page. The IC opens the #incident-NNN Slack channel, the Zoom bridge, and assigns the Comms Lead and Scribe roles before any technical action begins.

    The IC makes the call after consulting the Database Lead and Infra Lead. Activation triggers the failover runbook and customer comms. Stand-down means continuing to monitor — log the rationale either way for the post-incident review.

Recovery Execution and Post-Incident

    Database Lead promotes the cross-region replica; Infra Lead flips the Route53 weighted record or global accelerator endpoint; application teams confirm their services are reading from the DR connection strings. Comms Lead updates the public status page and notifies enterprise customers per their contractual notification windows.

    Run the data-validation queries from the runbook: row counts on critical tables, latest transaction timestamp per tenant, checksum on the last hour of writes. Calculate observed data loss and compare to the documented RPO. If you exceed RPO, that's a customer notification on its own under most enterprise contracts and under GDPR Article 33 if EU PII was involved.

    Within 5 business days. Cover the timeline, contributing factors (not 'root cause'), what worked, what didn't, and action items with owners. Skipping the PIR because 'we know what happened' is the most common path back to the same outage in 90 days — the second-order causes only surface in writing.