Backup and Recovery Checklist
Backup Scope and RPO/RTO
List every datastore in scope: RDS instances, self-hosted Postgres/MySQL, MongoDB, Redis (if persistent), Elasticsearch indices, S3 buckets holding customer uploads, and any EBS volumes attached to stateful workloads. Cross-check against the service catalog in Backstage so a newly-launched service hasn't slipped through.
Tier 1 (customer-facing transactional DB) typically targets RPO ≤ 5 min, RTO ≤ 1 hour. Tier 2 (analytics, reporting) RPO ≤ 24 hours. Tier 3 (logs, derived data) recoverable from upstream. Mismatches between SOC 2 stated RPO/RTO and actual backup cadence are a common audit finding.
RDS automated backups (7–35 day window), manual snapshots, S3 versioning + lifecycle to Glacier, and any cross-region copies. Confirm retention matches contractual commitments — some enterprise customers require 90-day or 1-year retention.
Backup Job Health
Pull AWS Backup job history, RDS automated snapshot status, and any custom cron-based backup logs. Look for partial failures, skipped runs, and jobs that succeeded but produced suspiciously small artifacts. "Backup success" metrics that have been green for 18 months without a real restore are not evidence of a working backup.
Open a Jira ticket per failing job with the failure window, root cause hypothesis, and owner. Common causes: rotated IAM credentials, expired KMS grants, source disk full, snapshot quota hit. Track to closure before the next quarterly run.
Trigger a synthetic backup failure or check the most recent real failure: did PagerDuty page the platform on-call, or did the alert quietly route to a deprecated #ops-alerts channel nobody watches? This is the most common source of silent backup rot.
For Tier 1 datastores, confirm cross-region snapshot copies are within RPO. RDS cross-region automated backups can lag by hours during throttling; S3 CRR can stall silently if the destination bucket policy rejects writes.
Encryption and Access Controls
Every RDS snapshot, EBS snapshot, and S3 bucket holding backups should be encrypted with a customer-managed KMS key (not the AWS-managed default). Auditors flag default-key usage; customer-managed keys give you key rotation control and a clean audit trail in CloudTrail.
Pull the IAM principals with backup:*, rds:RestoreDBInstance*, and KMS decrypt permissions on the backup keys. Restore is a privileged path — anyone who can restore can exfiltrate. Confirm departed engineers are removed; cross-check against the HRIS termination list.
For ransomware resilience, Tier 1 backups should sit in an AWS Backup Vault Lock or S3 Object Lock with governance/compliance retention. Attempt a delete from a non-privileged role and confirm it's rejected. A backup that an attacker with prod credentials can delete is not a backup.
Restore Drill
Provision a dedicated VPC/account for the drill — never restore production backups into the production account, and never into the staging environment used for QA. Use the disaster-recovery Terraform workspace so the environment matches what would be stood up in a real incident.
Pick the most recent snapshot of the Tier 1 production DB and restore it. Time the operation end-to-end: snapshot copy (if cross-region), instance provisioning, DNS/connection-string updates. Compare wall-clock time against the stated RTO.
Row counts on key tables, last-write timestamps, foreign-key integrity, and a smoke query that exercises the application's hottest read path. Capture the freshness of the restored data — "snapshot taken at T, restored at T+N, last committed transaction at T-M" — and compare against RPO.
Restore a sample of objects from a prior version (S3 versioning) or from Glacier. Glacier retrieval can take hours for Bulk tier — confirm the retrieval tier in your runbook matches the RTO commitment for this data class.
Run terraform destroy on the drill workspace and confirm restored snapshots/instances are deleted. Restored copies of production data are a liability — every hour they linger is an hour of unmonitored prod data sitting in a less-locked-down environment.
DR Runbook and Roles
Read the runbook out loud as a team. Flag any step that references a deprecated tool, an old account ID, a Slack channel that no longer exists, or a person who has left. Stale runbooks are the #1 reason DR drills go off the rails.
The DR runbook names an IC, comms lead, and scribe. Confirm the on-call rotation in PagerDuty/Opsgenie has at least three trained ICs — one IC + their vacation = SPOF. Schedule cross-training if the bench is thin.
Compare measured restore time against the committed RTO, and measured data freshness against the committed RPO. A miss here triggers a follow-up plan in the next section.
Findings, Sign-Off, and Evidence
Triggered when the drill missed targets. Open named-owner Jira tickets per gap (e.g., "reduce snapshot copy time by pre-staging cross-region copies", "shrink restore time with provisioned IOPS on restore"). Track to closure before next quarter's drill.
Edit the runbook in Confluence/Notion. Add commands you actually used (vs. what was documented), correct any wrong account IDs or paths, and note any tool that didn't behave as expected. Commit changes the same week — memory fades fast.
Capture the drill date, participants, restore times, integrity check results, and any findings. Upload to Vanta/Drata/Secureframe under the CC7.x / A1.x controls so the next SOC 2 auditor sees a clean trail of quarterly DR testing.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Rollback Plan Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Software Architecture Design Checklist
- Post-Deployment Testing Checklist
- Performance Monitoring Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Monitoring Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Service Level Agreement (SLA) Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- Disaster Recovery Plan Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Data Backup and Recovery Checklist
- Data Backup and Recovery Checklist
- Disaster Recovery Plan Checklist
- Disaster Recovery Checklist
- Data Backup Verification Checklist
- Disaster Recovery Plan Checklist
- Data Backup and Recovery Checklist
- Data Backup and Recovery Checklist
- Business Continuity Checklist
- Data Recovery Checklist
- Database Backup Checklist
- Backup and Restore Checklist
- Server Backup Checklist
- Business Continuity Plan Checklist
- Disaster Recovery Plan Checklist
- Disaster Recovery Checklist
- E-commerce Backup and Recovery Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
