Database Backup Checklist

Pre-Backup Preparations

    Check free capacity on the backup destination (SAN LUN, NAS share, S3 bucket, Azure Blob, or Veeam repository) against the size of the last full plus expected log growth. A common gotcha: the repo is fine but the staging volume the backup tool writes to first is not — check both.

    For SQL Server, run DBCC CHECKDB on the target databases or confirm the most recent run was clean. For PostgreSQL, check pg_stat_activity for long-running transactions that would bloat WAL during the backup. For MySQL/InnoDB, confirm no orphan transactions in SHOW ENGINE INNODB STATUS.

    Post in the #db-ops channel and email app owners with the planned start, expected end, and any application impact (read-only, brief I/O latency, none). Skip this for routine nightly jobs; required for ad-hoc or extended-window backups.

    Index rebuilds, statistics updates, ETL loads, and replication reseeds can collide with the backup window. Review the SQL Agent / cron / Airflow schedule for the host and any downstream replicas. Patch Tuesday and quarterly DR tests are common collision points.

    Open the active job definition in Veeam, Commvault, Rubrik, or your scripted pg_dump/mysqldump wrapper. Confirm retention matches the documented RPO and that the immutable / object-lock copy is still configured — ransomware-resilient backup requires the offsite copy be unmodifiable from the production credential.

Backup Execution

    Trigger the configured job in the backup tool (Veeam, Commvault, Rubrik, native SQL Agent, pgBackRest, mysqldump). For ad-hoc runs, use the documented runbook command — do not invent flags at the prompt. Capture the job ID for the audit trail.

    Watch the tool's live job view for read/write MB/s, retry counts, and warnings. A throughput drop usually means the source disk is under contention or the network path to the repo is saturated. Page on errors; warnings get noted in the run log.

    Full backups alone do not meet a sub-day RPO. Verify the log-chain job is also running on its schedule (SQL Server transaction log backups, PostgreSQL WAL archive to archive_command, MySQL binlog shipping). A broken log chain is silent until restore day.

    Confirm the backup file or chunk set appears in the primary repo, the secondary copy, and the immutable / offsite tier (3-2-1: 3 copies, 2 media, 1 offsite). Spot-check file size against the prior night's run; a 10x size delta usually means a misconfigured filter or full vs. incremental confusion.

    Log start time, end time, total bytes, dedup ratio, and the final status code in the run sheet or PSA ticket. This timing data is what feeds the next quarter's window-sizing decision.

Post-Backup Verification

    Run RESTORE VERIFYONLY for SQL Server, pg_verifybackup for PostgreSQL, or the equivalent SureBackup / Live Mount verification in Veeam. This catches checksum corruption that a successful job-status code can hide.

    Restore the most recent backup into the DR lab VLAN, mount the database, and run a smoke query (row count on a known table, latest timestamp on the audit log). This is the step that catches the silent failures — rotated credential the script depends on, vendor format change, encryption key the team no longer holds. Cadence: at least quarterly per the DR policy.

    A failed restore drill is a P2 — the backups are not proven usable. Open a ticket in ServiceNow / Jira Service Management / ConnectWise PSA, page the on-call DBA, and do not close this run until a successful restore is demonstrated against an alternate backup point.

    Skim the job log for VSS writer failures, snapshot quiesce timeouts, deduplication errors, or skipped objects. Warnings that recur across runs become tomorrow's failed restore — file a low-priority ticket rather than letting them accrete.

    Record the run in IT Glue / Hudu / Confluence with backup set ID, retention expiry, restore-drill date, and any deviations. This is the artifact a SOC 2 or HIPAA auditor asks for — without it, the controls are not demonstrable.

    Close the loop with app owners and the on-call rotation. For MSP-managed clients, push the result into the monthly QBR report so the customer has visible evidence the RPO/RTO commitments are being met.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Systems Administration Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack