Data Backup Verification Checklist
Recurring workflow for IT and MSP teams to verify backup health, test restores against RPO/RTO targets, and confirm 3-2-1 coverage across endpoints, servers, and SaaS data. Run monthly with a quarterly DR test embedded.
Backup Coverage and Configuration Review
-
Reconcile protected assets against the CMDB
Pull the current asset list from the RMM or CMDB (NinjaOne, Datto RMM, IT Glue) and cross-check against the backup console (Veeam, Datto, Rubrik, AWS Backup). Flag any production VMs, endpoints, or M365/Google Workspace tenants that are not enrolled in a backup job.
Collects list -
Confirm 3-2-1 coverage for critical systems
Verify three copies of data, on two media types, with at least one offsite copy. For cloud-native workloads confirm cross-region replication (AWS Backup vault, Azure Backup GRS) plus an immutable copy.
-
Validate RPO and RTO targets per tier
Match each system's backup frequency to its documented RPO and the restore SLA to its RTO. Tier-1 systems (production DB, identity provider) typically need RPO ≤ 1 hour; tier-3 (developer workstations) may tolerate 24 hours.
-
Review backup job schedules and retention
Confirm GFS retention (daily 14, weekly 8, monthly 12, yearly 7) aligns with policy and compliance minimums — SOC 2, HIPAA, and PCI-DSS each have log/data retention floors that override default vendor settings.
Job Health and Encryption
-
Review last 30 days of backup job results
Export the job report from Veeam One, Datto Status, or Rubrik Polaris. Investigate any job with a success rate below 98% — repeated VSS errors and stale agents are the usual culprits.
Collects list -
Open tickets for failing backup jobs
Create a P2 ticket in the PSA (ConnectWise, HaloPSA, Jira Service Management) for each failing job with the asset name, last successful run, error code, and assigned owner. Do not close until two consecutive successful runs are confirmed.
-
Verify encryption at rest and in transit
Confirm AES-256 at rest on the backup repository and TLS 1.2+ in transit. For cloud vaults, confirm KMS or customer-managed keys are in use, not vendor-default keys.
-
Confirm immutability and air-gap controls
Verify object lock (S3 Object Lock, Azure Blob immutable storage) or hardened repository is enforced on at least one copy. Ransomware actors target backup consoles first — an admin-deletable copy is not a recovery copy.
-
Rotate backup service account credentials
Pull the backup service account credential from the vault (HashiCorp Vault, Azure Key Vault, 1Password Secrets) and confirm it has rotated within policy. Service accounts skipped from rotation are a classic finding in SOC 2 and ISO 27001 audits.
Restore Testing
-
Select random sample for spot-restore
Pick one file-level, one VM-level, and one SaaS object (M365 mailbox or SharePoint site) at random. Spot-restores catch silent corruption that job-success metrics miss.
-
Perform file-level restore to isolated location
Restore to a quarantine share, never overwriting production. Verify file hash matches the source where possible.
-
Perform VM-level instant recovery test
Boot the VM in an isolated network (Veeam SureBackup, Datto Screenshot Verification, Rubrik Live Mount). Confirm OS boots, services start, and application heartbeat responds.
-
Record restore times against RTO targetsCollects list Collects paragraph Collects file
-
Run quarterly full DR failover tabletop
Walk through the documented runbook end to end with the IR commander, infra lead, and an exec sponsor as observer. Tabletop without legal/comms invited is a common gap — discovering the press-contact gap during a real outage is the worst possible time.
Remediation and Escalation
-
Escalate restore failures to engineering lead
Page the on-call engineering lead via PagerDuty or Opsgenie when a restore fails. A failed restore is a P1 finding — the system has no recovery path until it's fixed and re-tested.
-
Trigger seed reload from offsite copy
If the primary repository is corrupt, initiate seed restore from the offsite/immutable copy. Document the chain-of-custody for any copy moved across regions or accounts.
-
Re-run failed jobs after remediation
Force a fresh full backup, then a follow-up incremental, and confirm both succeed before closing the incident ticket.
Capacity and Documentation
-
Review repository capacity and growth
Pull capacity metrics from the backup console. Project 90-day growth and order capacity now if free space drops below 20% — repository fill is the #1 cause of silent backup-job failure.
Collects number -
Apply retention and legal-hold policies
Expire backups past retention except those under legal hold. Confirm with GRC or legal that no active hold blocks disposal before purging.
-
Update runbooks and asset documentation
Refresh the backup runbook in IT Glue, Hudu, or Confluence with any changes from this cycle — new asset coverage, schedule changes, repository targets, contact escalation paths.
-
File evidence package for SOC 2 or ISO audit
Upload job reports, restore evidence, and capacity snapshots to the GRC platform (Vanta, Drata, Secureframe). Continuous evidence collection beats scrambling at audit time — auditors flag the gaps in last-minute submissions.
Collects signature Collects file
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Data Backup Verification Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.