Backup and Restore Checklist

Pre-Backup Preparation

    Pull the current RPO/RTO targets from the BCP document or MSA. Note any client tier or regulatory drivers — HIPAA, SOC 2, PCI DSS — that dictate retention or immutability requirements. Mismatched targets between the BCP and the actual backup job schedule are the most common audit finding.

    Cross-check the Veeam / Datto / Rubrik job list against the CMDB or RMM asset list. Flag any production VM, file share, SQL instance, or M365 tenant not covered by a job. New workloads added since the last cycle are the typical source of unprotected data.

    Confirm at least 20% headroom on the primary repository and the offsite / immutable copy. Review dedupe and compression ratios for drift; sudden ratio drops usually mean a new workload is writing incompressible data (encrypted volumes, media files) and will blow the capacity plan.

    Confirm 3-2-1 posture: 3 copies, 2 media types, 1 offsite, with at least one immutable or air-gapped copy (S3 Object Lock, Veeam hardened repo, LTO tape). Backup writable from production is the single most common reason ransomware encrypts the backups along with everything else.

    For application-consistent backups requiring brief service quiesce (SQL, Exchange, file servers with VSS), send the change notice through PSA / ITSM 48 hours ahead. Include start time, expected duration, and rollback contact.

Backup Execution

    For scheduled jobs, confirm the run kicked off at the configured time in Veeam B&R / Datto / Commvault. For ad-hoc runs, document the trigger reason in the PSA ticket. Verify VSS writers are healthy on Windows targets before the snapshot phase.

    Watch the job dashboard for warnings: VSS quiesce failures, network throughput drops, target unreachable, credential errors. Most overnight job failures trace back to a service account whose password rotated without the backup vendor being updated.

    Native Microsoft and Google retention is not a backup. Confirm the third-party SaaS backup (Datto SaaS Protection, Veeam for M365, AvePoint, Spanning) ran for Exchange Online mailboxes, OneDrive, SharePoint, and Teams chat. New users added since last run are typically not auto-licensed for protection.

    Verify the secondary copy job to the cloud / offsite repo finished within the WAN window. For Datto SIRIS / Veeam Cloud Connect / AWS S3 with Object Lock, confirm the immutable retention flag is set on the new restore points.

Failure Triage

    Create the incident in ConnectWise / Autotask / ServiceNow with the failed VM list and error codes. Tag the affected client and assign per the on-call schedule. SLA clock starts at job-failure detection, not at ticket creation.

    Common fixes: rotate the cached service account password in the backup proxy, clear stale VSS shadow copies, expand a tight repository, reseat a hung backup agent. Rerun and confirm the restore point lands before the next scheduled cycle.

Restore Verification Drill

    Rotate test scope each cycle: a file-level restore one month, a full VM Instant Recovery the next, a SQL point-in-time restore the next. Backup green for 18 months and first restore fails is the canonical disaster scenario; rotation is the discipline that prevents it.

    Mount the restore in a fenced VLAN or Veeam SureBackup virtual lab — never into production. Restoring a domain controller into the live domain has caused multiple all-hands outages from USN rollback.

    Boot the restored VM, log in, run application smoke tests (SQL DBCC CHECKDB, Exchange mailbox open, file checksum spot check). For databases, confirm the recovery model and last LSN match expectations.

    Power off and delete the test VMs from the isolated lab. Leaving restored production data sitting on the recovery network is a quiet data-residency and access-control violation that auditors find on the next walkthrough.

Documentation and Reporting

    Record the restore drill date, scope, RTO measured, and any remediation in the client's documentation platform. vCIO will pull from this for the QBR; auditors will pull from this for SOC 2 evidence.

    Export the job log and restore drill record into the GRC tool (Vanta, Drata, Tugboat) for the backup and BCP control families (CC9.1, CC7.5). Missing evidence at audit time, not failed backups, is the typical SOC 2 finding.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack