Backup and Recovery Checklist
Quarterly backup verification and disaster recovery drill run by the platform/SRE team. Covers RPO/RTO targets, backup integrity testing, restore drills against a non-prod environment, and SOC 2 evidence capture.
Backup Scope and RPO/RTO
-
Inventory backed-up datastores
List every datastore in scope: RDS instances, self-hosted Postgres/MySQL, MongoDB, Redis (if persistent), Elasticsearch indices, S3 buckets holding customer uploads, and any EBS volumes attached to stateful workloads. Cross-check against the service catalog in Backstage so a newly-launched service hasn't slipped through.
-
Confirm RPO and RTO per tier
Tier 1 (customer-facing transactional DB) typically targets RPO ≤ 5 min, RTO ≤ 1 hour. Tier 2 (analytics, reporting) RPO ≤ 24 hours. Tier 3 (logs, derived data) recoverable from upstream. Mismatches between SOC 2 stated RPO/RTO and actual backup cadence are a common audit finding.
-
Verify backup retention policies
RDS automated backups (7–35 day window), manual snapshots, S3 versioning + lifecycle to Glacier, and any cross-region copies. Confirm retention matches contractual commitments — some enterprise customers require 90-day or 1-year retention.
Collects paragraph
Backup Job Health
-
Audit last 30 days of backup jobs
Pull AWS Backup job history, RDS automated snapshot status, and any custom cron-based backup logs. Look for partial failures, skipped runs, and jobs that succeeded but produced suspiciously small artifacts. "Backup success" metrics that have been green for 18 months without a real restore are not evidence of a working backup.
Collects list -
File tickets for failed backup jobs
Open a Jira ticket per failing job with the failure window, root cause hypothesis, and owner. Common causes: rotated IAM credentials, expired KMS grants, source disk full, snapshot quota hit. Track to closure before the next quarterly run.
-
Confirm backup alerting reaches on-call
Trigger a synthetic backup failure or check the most recent real failure: did PagerDuty page the platform on-call, or did the alert quietly route to a deprecated #ops-alerts channel nobody watches? This is the most common source of silent backup rot.
-
Verify cross-region replication lag
For Tier 1 datastores, confirm cross-region snapshot copies are within RPO. RDS cross-region automated backups can lag by hours during throttling; S3 CRR can stall silently if the destination bucket policy rejects writes.
Encryption and Access Controls
-
Confirm KMS encryption on all snapshots
Every RDS snapshot, EBS snapshot, and S3 bucket holding backups should be encrypted with a customer-managed KMS key (not the AWS-managed default). Auditors flag default-key usage; customer-managed keys give you key rotation control and a clean audit trail in CloudTrail.
-
Review IAM access to backup vault
Pull the IAM principals with backup:*, rds:RestoreDBInstance*, and KMS decrypt permissions on the backup keys. Restore is a privileged path — anyone who can restore can exfiltrate. Confirm departed engineers are removed; cross-check against the HRIS termination list.
Collects file -
Test backup vault immutability
For ransomware resilience, Tier 1 backups should sit in an AWS Backup Vault Lock or S3 Object Lock with governance/compliance retention. Attempt a delete from a non-privileged role and confirm it's rejected. A backup that an attacker with prod credentials can delete is not a backup.
Restore Drill
-
Spin up an isolated restore environment
Provision a dedicated VPC/account for the drill — never restore production backups into the production account, and never into the staging environment used for QA. Use the disaster-recovery Terraform workspace so the environment matches what would be stood up in a real incident.
-
Restore the primary database from snapshot
Pick the most recent snapshot of the Tier 1 production DB and restore it. Time the operation end-to-end: snapshot copy (if cross-region), instance provisioning, DNS/connection-string updates. Compare wall-clock time against the stated RTO.
Collects number -
Run data integrity checks on restored DB
Row counts on key tables, last-write timestamps, foreign-key integrity, and a smoke query that exercises the application's hottest read path. Capture the freshness of the restored data — "snapshot taken at T, restored at T+N, last committed transaction at T-M" — and compare against RPO.
-
Restore S3 customer-upload bucket
Restore a sample of objects from a prior version (S3 versioning) or from Glacier. Glacier retrieval can take hours for Bulk tier — confirm the retrieval tier in your runbook matches the RTO commitment for this data class.
-
Tear down the drill environment
Run terraform destroy on the drill workspace and confirm restored snapshots/instances are deleted. Restored copies of production data are a liability — every hour they linger is an hour of unmonitored prod data sitting in a less-locked-down environment.
DR Runbook and Roles
-
Walk through the DR runbook end to end
Read the runbook out loud as a team. Flag any step that references a deprecated tool, an old account ID, a Slack channel that no longer exists, or a person who has left. Stale runbooks are the #1 reason DR drills go off the rails.
-
Confirm Incident Commander coverage
The DR runbook names an IC, comms lead, and scribe. Confirm the on-call rotation in PagerDuty/Opsgenie has at least three trained ICs — one IC + their vacation = SPOF. Schedule cross-training if the bench is thin.
-
Did the drill meet RTO and RPO targets?
Compare measured restore time against the committed RTO, and measured data freshness against the committed RPO. A miss here triggers a follow-up plan in the next section.
Collects list
Findings, Sign-Off, and Evidence
-
Open remediation tickets for RTO/RPO miss
Triggered when the drill missed targets. Open named-owner Jira tickets per gap (e.g., "reduce snapshot copy time by pre-staging cross-region copies", "shrink restore time with provisioned IOPS on restore"). Track to closure before next quarter's drill.
-
Update the DR runbook with drill learnings
Edit the runbook in Confluence/Notion. Add commands you actually used (vs. what was documented), correct any wrong account IDs or paths, and note any tool that didn't behave as expected. Commit changes the same week — memory fades fast.
-
Sign off and capture SOC 2 evidence
Capture the drill date, participants, restore times, integrity check results, and any findings. Upload to Vanta/Drata/Secureframe under the CC7.x / A1.x controls so the next SOC 2 auditor sees a clean trail of quarterly DR testing.
Collects list Collects file Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Backup and Recovery Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.