Data Recovery Checklist

Quarterly restore drill that a sysadmin or MSP engineer runs to prove backups are usable, not just green. Covers backup verification, isolated-environment restore, and post-restoration validation against named systems and RPO/RTO targets.

5 sections 19 steps Collects data
1

Backup Job Verification

  1. Confirm backup scope against the asset inventory
    • Reconcile the Veeam / Datto / Rubrik job list against the CMDB or RMM asset list. Flag any production VM, file share, M365 tenant, or SQL instance not covered. Newly-provisioned hosts that never got tagged into a backup job are the most common gap.

  2. Review last 7 days of backup job logs
    • Open the backup console and filter for warnings and failures. A streak of "success with warnings" usually means VSS snapshot timeouts or skipped open files — investigate before assuming the job is healthy.

    Collects list Collects paragraph
  3. Open a ticket for failed or warning jobs
    • File a ticket in ConnectWise PSA / Autotask / ServiceNow against the affected job, link the log excerpt, and assign to the backup engineer. Do not proceed with the restore drill until known job failures are triaged — restoring from a broken chain wastes the drill window.

  4. Verify the offsite and immutable copy
    • Confirm 3-2-1 is intact: production copy, secondary local copy, and an offsite copy with object lock or air-gap (S3 Object Lock, Wasabi immutability, LTO tape, separate cloud account). The offsite copy is the only thing standing between the org and a ransomware-encrypted primary backup.

  5. Check encryption keys and recovery credentials
    • Confirm the BitLocker recovery keys, backup repository encryption passphrase, and KMS keys are present in the password vault (Keeper, Bitwarden, IT Glue, Hudu) and accessible to at least two named operators. Backups encrypted with a key nobody can find are the same as no backups.

2

Restore Plan and Scope

  1. Select the restore scenario for this drill
    • Rotate scenarios across drills so coverage is broad: file-level restore, full VM restore, SQL point-in-time, M365 mailbox, Entra ID object recovery, or full site failover. Pick one scenario and stick to it — combining scenarios in one drill muddies the timing data.

    Collects list Collects number Collects number
  2. Identify the recovery point to restore from
    • Pick a recovery point that exercises the chain — typically a synthetic full plus several incrementals, not the most recent point alone. Document the timestamp; this is what RPO is measured against.

  3. Stand up an isolated recovery network
    • Restore into an isolated VLAN or sandbox vCenter cluster — never into production. Block egress to production AD / DNS so a restored host can't register, reboot a duplicate SPN, or pull production GPOs. Veeam SureBackup, Datto Virtualization, and Rubrik Live Mount all support isolated networks.

  4. Confirm dependencies are available in the sandbox
    • List upstream dependencies — DNS, AD, certificate authority, license server, SQL backend — and either spin up isolated copies or stub them. App servers booting without a reachable DC will hang at login and skew the RTO measurement.

3

Execute the Restore

  1. Start the restore and capture the start timestamp
    • Note the wall-clock start time. RTO measurement starts here, not when the job was queued. Watch for early failures — repository connection, credential prompt, dedupe rehydrate stalls — and resolve in-line.

  2. Monitor restore progress against the RTO
    • Track throughput in the backup console. If the projected completion time exceeds the documented RTO, escalate now — do not wait for the post-restore review. Common culprits: cold cloud-tier rehydrate, network bottleneck between repo and target, undersized restore proxy.

  3. Record the restore completion time
    Collects list Collects number Collects image
4

Post-Restoration Validation

  1. Boot the restored system and verify services start
    • Check the Windows Service Control Manager or systemd for failed services. SQL Server, IIS app pools, and scheduled tasks are the usual offenders — they often depend on a service account whose password has rotated since the recovery point.

  2. Run an application-level smoke test
    • Don't stop at "the VM booted." Log in as a test user, open the application, run a transaction, and validate against the application owner's pass criteria. For SQL: run DBCC CHECKDB. For file shares: spot-check files against a checksum manifest.

    Collects list Collects paragraph
  3. Open a remediation ticket for the failed restore
    • If the smoke test failed, file a P2 in the PSA / ITSM tool with the console screenshot, the validation notes, and the affected backup job. Assign to the backup engineering lead and tag for the next change advisory board so the fix is tracked, not lost in a Slack thread.

  4. Verify ACLs and group memberships on restored data
    • Restored file shares often come back with broken inheritance or SIDs that no longer resolve. Run icacls or Get-Acl against a sample of folders, and confirm AD security group membership matches the production source-of-truth.

  5. Tear down the sandbox and reclaim resources
    • Power off and delete the sandbox VMs, release the isolated VLAN, and unmount any Live Mount / Instant Recovery sessions. Forgotten sandbox VMs accumulate license cost and clutter the inventory before the next drill.

5

Drill Reporting and Sign-Off

  1. Update the runbook with lessons learned
    • Edit the recovery runbook in IT Glue / Hudu / Confluence with anything that surprised the operator — undocumented dependency, missing credential, unexpected duration. The next drill (or the next real incident) is run from this document.

  2. Sign off on the drill report
    • Capture the IT manager's or vCIO's sign-off. SOC 2, HIPAA, and PCI auditors will ask for evidence that restore drills happen on a defined cadence with named approvers — the signature plus the captured RTO/RPO data is that evidence.

    Collects list Collects signature Collects file

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 5
Steps 19
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Data Recovery Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.