Data Recovery Checklist

Backup Job Verification

    Reconcile the Veeam / Datto / Rubrik job list against the CMDB or RMM asset list. Flag any production VM, file share, M365 tenant, or SQL instance not covered. Newly-provisioned hosts that never got tagged into a backup job are the most common gap.

    Open the backup console and filter for warnings and failures. A streak of "success with warnings" usually means VSS snapshot timeouts or skipped open files — investigate before assuming the job is healthy.

    File a ticket in ConnectWise PSA / Autotask / ServiceNow against the affected job, link the log excerpt, and assign to the backup engineer. Do not proceed with the restore drill until known job failures are triaged — restoring from a broken chain wastes the drill window.

    Confirm 3-2-1 is intact: production copy, secondary local copy, and an offsite copy with object lock or air-gap (S3 Object Lock, Wasabi immutability, LTO tape, separate cloud account). The offsite copy is the only thing standing between the org and a ransomware-encrypted primary backup.

    Confirm the BitLocker recovery keys, backup repository encryption passphrase, and KMS keys are present in the password vault (Keeper, Bitwarden, IT Glue, Hudu) and accessible to at least two named operators. Backups encrypted with a key nobody can find are the same as no backups.

Restore Plan and Scope

    Rotate scenarios across drills so coverage is broad: file-level restore, full VM restore, SQL point-in-time, M365 mailbox, Entra ID object recovery, or full site failover. Pick one scenario and stick to it — combining scenarios in one drill muddies the timing data.

    Pick a recovery point that exercises the chain — typically a synthetic full plus several incrementals, not the most recent point alone. Document the timestamp; this is what RPO is measured against.

    Restore into an isolated VLAN or sandbox vCenter cluster — never into production. Block egress to production AD / DNS so a restored host can't register, reboot a duplicate SPN, or pull production GPOs. Veeam SureBackup, Datto Virtualization, and Rubrik Live Mount all support isolated networks.

    List upstream dependencies — DNS, AD, certificate authority, license server, SQL backend — and either spin up isolated copies or stub them. App servers booting without a reachable DC will hang at login and skew the RTO measurement.

Execute the Restore

    Note the wall-clock start time. RTO measurement starts here, not when the job was queued. Watch for early failures — repository connection, credential prompt, dedupe rehydrate stalls — and resolve in-line.

    Track throughput in the backup console. If the projected completion time exceeds the documented RTO, escalate now — do not wait for the post-restore review. Common culprits: cold cloud-tier rehydrate, network bottleneck between repo and target, undersized restore proxy.

Post-Restoration Validation

    Check the Windows Service Control Manager or systemd for failed services. SQL Server, IIS app pools, and scheduled tasks are the usual offenders — they often depend on a service account whose password has rotated since the recovery point.

    Don't stop at "the VM booted." Log in as a test user, open the application, run a transaction, and validate against the application owner's pass criteria. For SQL: run DBCC CHECKDB. For file shares: spot-check files against a checksum manifest.

    If the smoke test failed, file a P2 in the PSA / ITSM tool with the console screenshot, the validation notes, and the affected backup job. Assign to the backup engineering lead and tag for the next change advisory board so the fix is tracked, not lost in a Slack thread.

    Restored file shares often come back with broken inheritance or SIDs that no longer resolve. Run icacls or Get-Acl against a sample of folders, and confirm AD security group membership matches the production source-of-truth.

    Power off and delete the sandbox VMs, release the isolated VLAN, and unmount any Live Mount / Instant Recovery sessions. Forgotten sandbox VMs accumulate license cost and clutter the inventory before the next drill.

Drill Reporting and Sign-Off

    Edit the recovery runbook in IT Glue / Hudu / Confluence with anything that surprised the operator — undocumented dependency, missing credential, unexpected duration. The next drill (or the next real incident) is run from this document.

    Capture the IT manager's or vCIO's sign-off. SOC 2, HIPAA, and PCI auditors will ask for evidence that restore drills happen on a defined cadence with named approvers — the signature plus the captured RTO/RPO data is that evidence.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack