Backup and Restore Checklist

Operational runbook for sysadmins and MSP technicians to execute scheduled backups, verify recoverability through restore drills, and maintain a 3-2-1 ransomware-resilient backup posture across servers, endpoints, and SaaS.

5 sections 19 steps Collects data
1

Pre-Backup Preparation

  1. Confirm RPO and RTO targets
    • Pull the current RPO/RTO targets from the BCP document or MSA. Note any client tier or regulatory drivers — HIPAA, SOC 2, PCI DSS — that dictate retention or immutability requirements. Mismatched targets between the BCP and the actual backup job schedule are the most common audit finding.

  2. Reconcile the protected-systems inventory
    • Cross-check the Veeam / Datto / Rubrik job list against the CMDB or RMM asset list. Flag any production VM, file share, SQL instance, or M365 tenant not covered by a job. New workloads added since the last cycle are the typical source of unprotected data.

    Collects list
  3. Check backup repository capacity
    • Confirm at least 20% headroom on the primary repository and the offsite / immutable copy. Review dedupe and compression ratios for drift; sudden ratio drops usually mean a new workload is writing incompressible data (encrypted volumes, media files) and will blow the capacity plan.

  4. Validate immutability and air-gap configuration
    • Confirm 3-2-1 posture: 3 copies, 2 media types, 1 offsite, with at least one immutable or air-gapped copy (S3 Object Lock, Veeam hardened repo, LTO tape). Backup writable from production is the single most common reason ransomware encrypts the backups along with everything else.

  5. Notify stakeholders of the maintenance window
    • For application-consistent backups requiring brief service quiesce (SQL, Exchange, file servers with VSS), send the change notice through PSA / ITSM 48 hours ahead. Include start time, expected duration, and rollback contact.

2

Backup Execution

  1. Trigger or verify the scheduled backup job
    • For scheduled jobs, confirm the run kicked off at the configured time in Veeam B&R / Datto / Commvault. For ad-hoc runs, document the trigger reason in the PSA ticket. Verify VSS writers are healthy on Windows targets before the snapshot phase.

  2. Monitor job progress for errors
    • Watch the job dashboard for warnings: VSS quiesce failures, network throughput drops, target unreachable, credential errors. Most overnight job failures trace back to a service account whose password rotated without the backup vendor being updated.

  3. Confirm SaaS backup coverage (M365, Google Workspace)
    • Native Microsoft and Google retention is not a backup. Confirm the third-party SaaS backup (Datto SaaS Protection, Veeam for M365, AvePoint, Spanning) ran for Exchange Online mailboxes, OneDrive, SharePoint, and Teams chat. New users added since last run are typically not auto-licensed for protection.

  4. Capture job completion status
    Collects list Collects paragraph Collects file
  5. Confirm offsite replication completion
    • Verify the secondary copy job to the cloud / offsite repo finished within the WAN window. For Datto SIRIS / Veeam Cloud Connect / AWS S3 with Object Lock, confirm the immutable retention flag is set on the new restore points.

3

Failure Triage

  1. Open a P2 ticket and identify the failed objects
    • Create the incident in ConnectWise / Autotask / ServiceNow with the failed VM list and error codes. Tag the affected client and assign per the on-call schedule. SLA clock starts at job-failure detection, not at ticket creation.

  2. Remediate and rerun the failed job
    • Common fixes: rotate the cached service account password in the backup proxy, clear stale VSS shadow copies, expand a tight repository, reseat a hung backup agent. Rerun and confirm the restore point lands before the next scheduled cycle.

    Collects list
4

Restore Verification Drill

  1. Select the restore test scope
    • Rotate test scope each cycle: a file-level restore one month, a full VM Instant Recovery the next, a SQL point-in-time restore the next. Backup green for 18 months and first restore fails is the canonical disaster scenario; rotation is the discipline that prevents it.

  2. Restore into the isolated recovery network
    • Mount the restore in a fenced VLAN or Veeam SureBackup virtual lab — never into production. Restoring a domain controller into the live domain has caused multiple all-hands outages from USN rollback.

  3. Validate restored data integrity
    • Boot the restored VM, log in, run application smoke tests (SQL DBCC CHECKDB, Exchange mailbox open, file checksum spot check). For databases, confirm the recovery model and last LSN match expectations.

    Collects list Collects number Collects paragraph
  4. Tear down the recovery environment
    • Power off and delete the test VMs from the isolated lab. Leaving restored production data sitting on the recovery network is a quiet data-residency and access-control violation that auditors find on the next walkthrough.

5

Documentation and Reporting

  1. Update IT Glue or Hudu documentation
    • Record the restore drill date, scope, RTO measured, and any remediation in the client's documentation platform. vCIO will pull from this for the QBR; auditors will pull from this for SOC 2 evidence.

  2. File the SOC 2 / HIPAA evidence artifact
    • Export the job log and restore drill record into the GRC tool (Vanta, Drata, Tugboat) for the backup and BCP control families (CC9.1, CC7.5). Missing evidence at audit time, not failed backups, is the typical SOC 2 finding.

    Collects file
  3. Sign off on the cycle
    Collects text Collects signature

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 5
Steps 19
Category Systems Administration
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Backup and Restore Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.