Data Backup and Recovery Checklist
Scope and Recovery Objectives
List every system whose loss would stop production: ERP (NetSuite, Epicor Kinetic, Dynamics 365 BC), MES, PLM/CAD vault (SolidWorks PDM, Windchill), CMMS, QMS, label printers, and the CNC and PLC program repositories on the floor. A common gap is the standalone PC at the machine that holds the only copy of a tested G-code program.
ERP and MES typically need RTO under 4 hours and RPO under 1 hour to avoid a full shift of lost production. CAD vault and CMMS can usually accept 24-hour RPO. Document the targets so the drill later in this run has something concrete to measure against.
If any drawings, models, or specs in the PLM vault are ITAR or EAR-controlled, backup copies inherit the same access restrictions. Cloud backup targets must be US-person-only; replication outside the US is a violation regardless of encryption.
AS9100 device history records, ISO 13485 / 21 CFR Part 11 batch records, and customer PPAP submissions carry retention obligations of 7-15 years depending on contract. Verify backup retention policy meets the longest applicable requirement.
Backup Configuration
Three copies of data, on two different media, with one off-site. Confirm the off-site copy is air-gapped or immutable — a writable cloud share that ransomware can also encrypt does not count.
Confirm the ERP nightly full plus transaction-log backups are aligned with the RPO defined earlier. For NetSuite or other SaaS ERPs, validate that the third-party backup connector ran successfully — the vendor's native export is not a backup.
SolidWorks PDM or Windchill vault snapshots run on the schedule set by IT. CNC controllers need a separate sweep — most shops use a DNC tool or a scripted SMB pull from the controller after every program edit. Without this, a crashed Fanuc control means re-proving every program from scratch.
Backup volumes encrypted at rest with AES-256, keys stored in the IT password vault and not on the backup server itself. Customer cybersecurity questionnaires (CMMC, TISAX, NIST 800-171) ask for this specifically.
Restrict backup target ACLs to US-person accounts only and confirm the cloud region is GovCloud or equivalent. Document the export classification and the personnel list as part of the empowered official's records.
Execution and Verification
Trigger the monthly full from Veeam, Commvault, or the equivalent. Confirm production database quiesce or VSS snapshot succeeded — an inconsistent SQL backup will restore but the ERP will refuse to start.
Walk every job in the console. Warnings are not successes — a job that completed with skipped files often means an open file lock on a CAD workstation. Confirm the off-site replication chain caught up before the next nightly window.
Auditors for ISO 27001, SOC 2, or customer cybersecurity reviews want dated screenshots of the backup console showing job status and capacity headroom. Save them where the next audit prep run can find them without a hunt.
Recovery Drill
Pull the most recent full plus log chain into the sandbox instance. Log in, run a known query (recent shop orders, last cycle count), and confirm the data matches production. A backup that restores but won't open is a common ransomware-recovery failure mode.
Pick one Haas or Fanuc program at random, restore it to the controller via DNC or USB, and have the operator dry-run it against the proven setup sheet. This is the only way to catch silent corruption in the program backup pipeline.
Stopwatch from restore start to verified application login. Compare to the RTO set in step one. A drill where the team beats RTO by hours is suspect — usually the dataset was small or the network was idle. Repeat with realistic load before signing off.
Pass means every restored system met its RTO and RPO and the application opened cleanly. Anything else is a fail — including the case where one CNC program was unreadable. Be strict; quiet partials are how DR programs decay.
Remediation and Closeout
Quality opens the CAR in the QMS with a containment action (manual snapshot until automated restore is fixed) and a target close date. Effectiveness verification means a clean drill in the following month — not just a re-run of the same scenario.
Capture every undocumented step the team had to figure out live — service account passwords, license server IPs, vendor support numbers. The runbook should let a new IT lead execute recovery during a 2 AM ransomware event.
Fifteen-minute readout: drill outcome, actual RTO vs target, open CARs, and any capacity or licensing issues that affect next month's run. Plant manager owns the production-impact call; IT lead owns the technical remediation.
