Backup and Recovery Checklist

Backup Scope and RPO/RTO

    List every datastore in scope: RDS instances, self-hosted Postgres/MySQL, MongoDB, Redis (if persistent), Elasticsearch indices, S3 buckets holding customer uploads, and any EBS volumes attached to stateful workloads. Cross-check against the service catalog in Backstage so a newly-launched service hasn't slipped through.

    Tier 1 (customer-facing transactional DB) typically targets RPO ≤ 5 min, RTO ≤ 1 hour. Tier 2 (analytics, reporting) RPO ≤ 24 hours. Tier 3 (logs, derived data) recoverable from upstream. Mismatches between SOC 2 stated RPO/RTO and actual backup cadence are a common audit finding.

    RDS automated backups (7–35 day window), manual snapshots, S3 versioning + lifecycle to Glacier, and any cross-region copies. Confirm retention matches contractual commitments — some enterprise customers require 90-day or 1-year retention.

Backup Job Health

    Pull AWS Backup job history, RDS automated snapshot status, and any custom cron-based backup logs. Look for partial failures, skipped runs, and jobs that succeeded but produced suspiciously small artifacts. "Backup success" metrics that have been green for 18 months without a real restore are not evidence of a working backup.

    Open a Jira ticket per failing job with the failure window, root cause hypothesis, and owner. Common causes: rotated IAM credentials, expired KMS grants, source disk full, snapshot quota hit. Track to closure before the next quarterly run.

    Trigger a synthetic backup failure or check the most recent real failure: did PagerDuty page the platform on-call, or did the alert quietly route to a deprecated #ops-alerts channel nobody watches? This is the most common source of silent backup rot.

    For Tier 1 datastores, confirm cross-region snapshot copies are within RPO. RDS cross-region automated backups can lag by hours during throttling; S3 CRR can stall silently if the destination bucket policy rejects writes.

Encryption and Access Controls

    Every RDS snapshot, EBS snapshot, and S3 bucket holding backups should be encrypted with a customer-managed KMS key (not the AWS-managed default). Auditors flag default-key usage; customer-managed keys give you key rotation control and a clean audit trail in CloudTrail.

    Pull the IAM principals with backup:*, rds:RestoreDBInstance*, and KMS decrypt permissions on the backup keys. Restore is a privileged path — anyone who can restore can exfiltrate. Confirm departed engineers are removed; cross-check against the HRIS termination list.

    For ransomware resilience, Tier 1 backups should sit in an AWS Backup Vault Lock or S3 Object Lock with governance/compliance retention. Attempt a delete from a non-privileged role and confirm it's rejected. A backup that an attacker with prod credentials can delete is not a backup.

Restore Drill

    Provision a dedicated VPC/account for the drill — never restore production backups into the production account, and never into the staging environment used for QA. Use the disaster-recovery Terraform workspace so the environment matches what would be stood up in a real incident.

    Pick the most recent snapshot of the Tier 1 production DB and restore it. Time the operation end-to-end: snapshot copy (if cross-region), instance provisioning, DNS/connection-string updates. Compare wall-clock time against the stated RTO.

    Row counts on key tables, last-write timestamps, foreign-key integrity, and a smoke query that exercises the application's hottest read path. Capture the freshness of the restored data — "snapshot taken at T, restored at T+N, last committed transaction at T-M" — and compare against RPO.

    Restore a sample of objects from a prior version (S3 versioning) or from Glacier. Glacier retrieval can take hours for Bulk tier — confirm the retrieval tier in your runbook matches the RTO commitment for this data class.

    Run terraform destroy on the drill workspace and confirm restored snapshots/instances are deleted. Restored copies of production data are a liability — every hour they linger is an hour of unmonitored prod data sitting in a less-locked-down environment.

DR Runbook and Roles

    Read the runbook out loud as a team. Flag any step that references a deprecated tool, an old account ID, a Slack channel that no longer exists, or a person who has left. Stale runbooks are the #1 reason DR drills go off the rails.

    The DR runbook names an IC, comms lead, and scribe. Confirm the on-call rotation in PagerDuty/Opsgenie has at least three trained ICs — one IC + their vacation = SPOF. Schedule cross-training if the bench is thin.

    Compare measured restore time against the committed RTO, and measured data freshness against the committed RPO. A miss here triggers a follow-up plan in the next section.

Findings, Sign-Off, and Evidence

    Triggered when the drill missed targets. Open named-owner Jira tickets per gap (e.g., "reduce snapshot copy time by pre-staging cross-region copies", "shrink restore time with provisioned IOPS on restore"). Track to closure before next quarter's drill.

    Edit the runbook in Confluence/Notion. Add commands you actually used (vs. what was documented), correct any wrong account IDs or paths, and note any tool that didn't behave as expected. Commit changes the same week — memory fades fast.

    Capture the drill date, participants, restore times, integrity check results, and any findings. Upload to Vanta/Drata/Secureframe under the CC7.x / A1.x controls so the next SOC 2 auditor sees a clean trail of quarterly DR testing.