Backup and Restore Checklist
Operational runbook for sysadmins and MSP technicians to execute scheduled backups, verify recoverability through restore drills, and maintain a 3-2-1 ransomware-resilient backup posture across servers, endpoints, and SaaS.
Pre-Backup Preparation
-
Confirm RPO and RTO targets
Pull the current RPO/RTO targets from the BCP document or MSA. Note any client tier or regulatory drivers — HIPAA, SOC 2, PCI DSS — that dictate retention or immutability requirements. Mismatched targets between the BCP and the actual backup job schedule are the most common audit finding.
-
Reconcile the protected-systems inventory
Cross-check the Veeam / Datto / Rubrik job list against the CMDB or RMM asset list. Flag any production VM, file share, SQL instance, or M365 tenant not covered by a job. New workloads added since the last cycle are the typical source of unprotected data.
Collects list -
Check backup repository capacity
Confirm at least 20% headroom on the primary repository and the offsite / immutable copy. Review dedupe and compression ratios for drift; sudden ratio drops usually mean a new workload is writing incompressible data (encrypted volumes, media files) and will blow the capacity plan.
-
Validate immutability and air-gap configuration
Confirm 3-2-1 posture: 3 copies, 2 media types, 1 offsite, with at least one immutable or air-gapped copy (S3 Object Lock, Veeam hardened repo, LTO tape). Backup writable from production is the single most common reason ransomware encrypts the backups along with everything else.
-
Notify stakeholders of the maintenance window
For application-consistent backups requiring brief service quiesce (SQL, Exchange, file servers with VSS), send the change notice through PSA / ITSM 48 hours ahead. Include start time, expected duration, and rollback contact.
Backup Execution
-
Trigger or verify the scheduled backup job
For scheduled jobs, confirm the run kicked off at the configured time in Veeam B&R / Datto / Commvault. For ad-hoc runs, document the trigger reason in the PSA ticket. Verify VSS writers are healthy on Windows targets before the snapshot phase.
-
Monitor job progress for errors
Watch the job dashboard for warnings: VSS quiesce failures, network throughput drops, target unreachable, credential errors. Most overnight job failures trace back to a service account whose password rotated without the backup vendor being updated.
-
Confirm SaaS backup coverage (M365, Google Workspace)
Native Microsoft and Google retention is not a backup. Confirm the third-party SaaS backup (Datto SaaS Protection, Veeam for M365, AvePoint, Spanning) ran for Exchange Online mailboxes, OneDrive, SharePoint, and Teams chat. New users added since last run are typically not auto-licensed for protection.
-
Capture job completion statusCollects list Collects paragraph Collects file
-
Confirm offsite replication completion
Verify the secondary copy job to the cloud / offsite repo finished within the WAN window. For Datto SIRIS / Veeam Cloud Connect / AWS S3 with Object Lock, confirm the immutable retention flag is set on the new restore points.
Failure Triage
-
Open a P2 ticket and identify the failed objects
Create the incident in ConnectWise / Autotask / ServiceNow with the failed VM list and error codes. Tag the affected client and assign per the on-call schedule. SLA clock starts at job-failure detection, not at ticket creation.
-
Remediate and rerun the failed job
Common fixes: rotate the cached service account password in the backup proxy, clear stale VSS shadow copies, expand a tight repository, reseat a hung backup agent. Rerun and confirm the restore point lands before the next scheduled cycle.
Collects list
Restore Verification Drill
-
Select the restore test scope
Rotate test scope each cycle: a file-level restore one month, a full VM Instant Recovery the next, a SQL point-in-time restore the next. Backup green for 18 months and first restore fails is the canonical disaster scenario; rotation is the discipline that prevents it.
-
Restore into the isolated recovery network
Mount the restore in a fenced VLAN or Veeam SureBackup virtual lab — never into production. Restoring a domain controller into the live domain has caused multiple all-hands outages from USN rollback.
-
Validate restored data integrity
Boot the restored VM, log in, run application smoke tests (SQL DBCC CHECKDB, Exchange mailbox open, file checksum spot check). For databases, confirm the recovery model and last LSN match expectations.
Collects list Collects number Collects paragraph -
Tear down the recovery environment
Power off and delete the test VMs from the isolated lab. Leaving restored production data sitting on the recovery network is a quiet data-residency and access-control violation that auditors find on the next walkthrough.
Documentation and Reporting
-
Update IT Glue or Hudu documentation
Record the restore drill date, scope, RTO measured, and any remediation in the client's documentation platform. vCIO will pull from this for the QBR; auditors will pull from this for SOC 2 evidence.
-
File the SOC 2 / HIPAA evidence artifact
Export the job log and restore drill record into the GRC tool (Vanta, Drata, Tugboat) for the backup and BCP control families (CC9.1, CC7.5). Missing evidence at audit time, not failed backups, is the typical SOC 2 finding.
Collects file -
Sign off on the cycleCollects text Collects signature
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Backup and Restore Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.