Disaster Recovery Plan Checklist
Risk Assessment and Business Impact Analysis
Pull the service catalog from Backstage (or your equivalent registry). For each tier-1 service, list its database, cache, queue, third-party APIs, and identity dependencies. The common gotcha: a 'tier-1' API depends on a 'tier-3' shared service, which means tier-3 needs to be tier-1 for DR purposes.
Recovery Time Objective (how long can it be down?) and Recovery Point Objective (how much data can we lose?) drive every other choice — replication frequency, topology, snapshot cadence. Get sign-off from the product owner; engineering should not be inventing customer SLA targets unilaterally.
Cover the realistic scenarios for a SaaS shop: AWS region outage, AZ outage, accidental DROP TABLE in production, ransomware encrypting backups, expired root credentials, supply-chain compromise of a critical dependency. Assign each a likelihood and a blast radius.
Recovery Strategy and Site Topology
Pick from pilot light, warm standby, or multi-region active-active per tier. Tier-1 customer-facing services usually need warm standby (RTO < 1 hr); tier-3 internal tools can survive on backup-and-restore (RTO 24+ hr). Don't pay for active-active where pilot light is sufficient.
Named roles: Incident Commander, Communications Lead, Database Lead, Infra Lead, Scribe. Each role has a primary and a secondary on the PagerDuty schedule. The single-point-of-failure pattern — 'only Priya knows how to promote the replica' — is the most common reason DR fails when it matters.
The classic DR-region failure: you fail over the workload but the KMS keys, Secrets Manager entries, or IAM roles only exist in the primary region. Confirm multi-region KMS keys, Secrets Manager replication, and that IAM/SSO will resolve when the primary region is unreachable.
Plan Documentation
One runbook per tier-1 service. Each runbook includes: trigger conditions, who to page, exact kubectl/Terraform/AWS CLI commands, expected output, validation queries, and rollback. Runbooks must be executable at 3am by a sleep-deprived secondary on-call — no 'see the architecture diagram' indirection.
Cover RDS Multi-AZ failover, cross-region read-replica promotion, and the application-side connection-string cutover. Note replication lag thresholds — promoting a replica that's 30 minutes behind violates a 5-minute RPO. Include the steps to re-establish replication in the reverse direction once the primary region is back.
SOC 2 CC7.5 (recovery from identified incidents) and CC9.1 (business continuity) require documented and tested DR. HIPAA §164.308(a)(7) requires a contingency plan with data backup, disaster recovery, and emergency-mode operation procedures. Tag each runbook section with the control it satisfies — Vanta/Drata/Secureframe will pull this evidence at audit time.
Backup and Replication
Daily automated snapshots, 35-day retention, point-in-time recovery enabled. Copy snapshots to the DR region — automated snapshots stay in-region by default, which is useless for a regional outage. Encrypt with a multi-region KMS key, not the default `aws/rds` key.
S3 Cross-Region Replication (CRR) with versioning enabled on both source and destination. Versioning is the only protection against accidental delete or ransomware re-encryption — without it, the deletion replicates instantly to DR. Add an MFA-delete or Object Lock policy for the most sensitive buckets.
If the Terraform state bucket is in the affected region and you can't read it, you can't `terraform apply` to rebuild infra. Replicate the state bucket cross-region. Export Secrets Manager / Vault contents to an encrypted offsite copy on the same cadence.
Restore the latest snapshot into a non-production environment and validate row counts, schema integrity, and a sample of recent transactions. A green backup metric for 18 months means nothing if the restore script depends on a credential that rotated. Capture wall-clock time to first byte and to fully restored — that's your real RTO floor.
Open Jira tickets for each gap with a named owner and a due date before the next quarterly drill. A failed restore drill is itself a SOC 2 finding — auditors expect to see the gap, the ticket, and the closure evidence.
DR Testing and Maintenance
90-minute scenario session: 'us-east-1 RDS is unreachable, what do we do?' Walk the runbook step-by-step without executing. The point is to surface the steps that don't make sense to anyone except the original author. Rotate the IC role through different engineers each quarter.
Actually fail over to the DR region for a tier-1 service — staging traffic minimum, ideally a fraction of production traffic in a planned window. Measure observed RTO against the documented target. The tabletop will not surface the IAM-role-doesn't-exist-in-the-DR-region problem; only the live drill will.
Every gap surfaced in tabletop or live drill becomes a ticket with an owner and a target close date. Action items without owners are not action items. Track closure rate quarterly — if action items roll over more than once, the DR program is performative.
Activation Decision
Documented triggers: regional cloud outage with declared AWS Health Dashboard event, primary database unrecoverable for >15 min, customer-facing error rate >50% for >10 min. Don't activate DR for a single AZ blip — partial failover causes more outages than it prevents.
Use the dedicated DR escalation policy, not the standard SEV1 page. The IC opens the #incident-NNN Slack channel, the Zoom bridge, and assigns the Comms Lead and Scribe roles before any technical action begins.
The IC makes the call after consulting the Database Lead and Infra Lead. Activation triggers the failover runbook and customer comms. Stand-down means continuing to monitor — log the rationale either way for the post-incident review.
Recovery Execution and Post-Incident
Database Lead promotes the cross-region replica; Infra Lead flips the Route53 weighted record or global accelerator endpoint; application teams confirm their services are reading from the DR connection strings. Comms Lead updates the public status page and notifies enterprise customers per their contractual notification windows.
Run the data-validation queries from the runbook: row counts on critical tables, latest transaction timestamp per tenant, checksum on the last hour of writes. Calculate observed data loss and compare to the documented RPO. If you exceed RPO, that's a customer notification on its own under most enterprise contracts and under GDPR Article 33 if EU PII was involved.
Within 5 business days. Cover the timeline, contributing factors (not 'root cause'), what worked, what didn't, and action items with owners. Skipping the PIR because 'we know what happened' is the most common path back to the same outage in 90 days — the second-order causes only surface in writing.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Rollback Plan Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Software Architecture Design Checklist
- Post-Deployment Testing Checklist
- Performance Monitoring Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Monitoring Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Service Level Agreement (SLA) Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Backup and Recovery Checklist
- Data Backup and Recovery Checklist
- Data Backup and Recovery Checklist
- Disaster Recovery Plan Checklist
- Disaster Recovery Checklist
- Data Backup Verification Checklist
- Data Backup and Recovery Checklist
- Data Backup and Recovery Checklist
- Business Continuity Checklist
- Data Recovery Checklist
- Database Backup Checklist
- Backup and Restore Checklist
- Server Backup Checklist
- Business Continuity Plan Checklist
- Disaster Recovery Plan Checklist
- Disaster Recovery Checklist
- E-commerce Backup and Recovery Checklist
- Backup and Recovery Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Security Best Practices Checklist
- Employee Offboarding Checklist
- Quarterly Security Review Checklist
- IT Regulatory Compliance Review
- Server Security Checklist
- Compliance Audit Checklist
- Database Security Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
