Backup and Recovery Checklist

Quarterly backup verification and disaster recovery drill run by the platform/SRE team. Covers RPO/RTO targets, backup integrity testing, restore drills against a non-prod environment, and SOC 2 evidence capture.

6 sections 21 steps Collects data
1

Backup Scope and RPO/RTO

  1. Inventory backed-up datastores
    • List every datastore in scope: RDS instances, self-hosted Postgres/MySQL, MongoDB, Redis (if persistent), Elasticsearch indices, S3 buckets holding customer uploads, and any EBS volumes attached to stateful workloads. Cross-check against the service catalog in Backstage so a newly-launched service hasn't slipped through.

  2. Confirm RPO and RTO per tier
    • Tier 1 (customer-facing transactional DB) typically targets RPO ≤ 5 min, RTO ≤ 1 hour. Tier 2 (analytics, reporting) RPO ≤ 24 hours. Tier 3 (logs, derived data) recoverable from upstream. Mismatches between SOC 2 stated RPO/RTO and actual backup cadence are a common audit finding.

  3. Verify backup retention policies
    • RDS automated backups (7–35 day window), manual snapshots, S3 versioning + lifecycle to Glacier, and any cross-region copies. Confirm retention matches contractual commitments — some enterprise customers require 90-day or 1-year retention.

    Collects paragraph
2

Backup Job Health

  1. Audit last 30 days of backup jobs
    • Pull AWS Backup job history, RDS automated snapshot status, and any custom cron-based backup logs. Look for partial failures, skipped runs, and jobs that succeeded but produced suspiciously small artifacts. "Backup success" metrics that have been green for 18 months without a real restore are not evidence of a working backup.

    Collects list
  2. File tickets for failed backup jobs
    • Open a Jira ticket per failing job with the failure window, root cause hypothesis, and owner. Common causes: rotated IAM credentials, expired KMS grants, source disk full, snapshot quota hit. Track to closure before the next quarterly run.

  3. Confirm backup alerting reaches on-call
    • Trigger a synthetic backup failure or check the most recent real failure: did PagerDuty page the platform on-call, or did the alert quietly route to a deprecated #ops-alerts channel nobody watches? This is the most common source of silent backup rot.

  4. Verify cross-region replication lag
    • For Tier 1 datastores, confirm cross-region snapshot copies are within RPO. RDS cross-region automated backups can lag by hours during throttling; S3 CRR can stall silently if the destination bucket policy rejects writes.

3

Encryption and Access Controls

  1. Confirm KMS encryption on all snapshots
    • Every RDS snapshot, EBS snapshot, and S3 bucket holding backups should be encrypted with a customer-managed KMS key (not the AWS-managed default). Auditors flag default-key usage; customer-managed keys give you key rotation control and a clean audit trail in CloudTrail.

  2. Review IAM access to backup vault
    • Pull the IAM principals with backup:*, rds:RestoreDBInstance*, and KMS decrypt permissions on the backup keys. Restore is a privileged path — anyone who can restore can exfiltrate. Confirm departed engineers are removed; cross-check against the HRIS termination list.

    Collects file
  3. Test backup vault immutability
    • For ransomware resilience, Tier 1 backups should sit in an AWS Backup Vault Lock or S3 Object Lock with governance/compliance retention. Attempt a delete from a non-privileged role and confirm it's rejected. A backup that an attacker with prod credentials can delete is not a backup.

4

Restore Drill

  1. Spin up an isolated restore environment
    • Provision a dedicated VPC/account for the drill — never restore production backups into the production account, and never into the staging environment used for QA. Use the disaster-recovery Terraform workspace so the environment matches what would be stood up in a real incident.

  2. Restore the primary database from snapshot
    • Pick the most recent snapshot of the Tier 1 production DB and restore it. Time the operation end-to-end: snapshot copy (if cross-region), instance provisioning, DNS/connection-string updates. Compare wall-clock time against the stated RTO.

    Collects number
  3. Run data integrity checks on restored DB
    • Row counts on key tables, last-write timestamps, foreign-key integrity, and a smoke query that exercises the application's hottest read path. Capture the freshness of the restored data — "snapshot taken at T, restored at T+N, last committed transaction at T-M" — and compare against RPO.

  4. Restore S3 customer-upload bucket
    • Restore a sample of objects from a prior version (S3 versioning) or from Glacier. Glacier retrieval can take hours for Bulk tier — confirm the retrieval tier in your runbook matches the RTO commitment for this data class.

  5. Tear down the drill environment
    • Run terraform destroy on the drill workspace and confirm restored snapshots/instances are deleted. Restored copies of production data are a liability — every hour they linger is an hour of unmonitored prod data sitting in a less-locked-down environment.

5

DR Runbook and Roles

  1. Walk through the DR runbook end to end
    • Read the runbook out loud as a team. Flag any step that references a deprecated tool, an old account ID, a Slack channel that no longer exists, or a person who has left. Stale runbooks are the #1 reason DR drills go off the rails.

  2. Confirm Incident Commander coverage
    • The DR runbook names an IC, comms lead, and scribe. Confirm the on-call rotation in PagerDuty/Opsgenie has at least three trained ICs — one IC + their vacation = SPOF. Schedule cross-training if the bench is thin.

  3. Did the drill meet RTO and RPO targets?
    • Compare measured restore time against the committed RTO, and measured data freshness against the committed RPO. A miss here triggers a follow-up plan in the next section.

    Collects list
6

Findings, Sign-Off, and Evidence

  1. Open remediation tickets for RTO/RPO miss
    • Triggered when the drill missed targets. Open named-owner Jira tickets per gap (e.g., "reduce snapshot copy time by pre-staging cross-region copies", "shrink restore time with provisioned IOPS on restore"). Track to closure before next quarter's drill.

  2. Update the DR runbook with drill learnings
    • Edit the runbook in Confluence/Notion. Add commands you actually used (vs. what was documented), correct any wrong account IDs or paths, and note any tool that didn't behave as expected. Commit changes the same week — memory fades fast.

  3. Sign off and capture SOC 2 evidence
    • Capture the drill date, participants, restore times, integrity check results, and any findings. Upload to Vanta/Drata/Secureframe under the CC7.x / A1.x controls so the next SOC 2 auditor sees a clean trail of quarterly DR testing.

    Collects list Collects file Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 6
Steps 21
Category Software Development
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Backup and Recovery Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.