Database Design Checklist
Steps a backend or platform engineer runs to design a new production database schema, from requirements gathering through physical layout, security, and backup verification. Run this when standing up a new service's data store or doing a major schema redesign on an existing one.
Requirements Gathering
-
Document the read and write access patterns
Sit with the product engineers who will write the queries. Capture the top 10 read paths and the top 5 write paths — query shape, expected QPS, latency target, cardinality of filter columns. This drives index choice later; designing the schema before knowing the queries is the most common cause of expensive redesigns at 6 months.
-
Estimate row counts and growth rate
Project 12-month and 36-month row counts for each major entity. A 50M-row table needs different physical decisions (partitioning, online migration strategy) than a 500K-row table. Pull current numbers from analogous tables in production if you have them.
Collects number Collects paragraph -
Confirm compliance scope for the data
Will this database hold PHI, PCI cardholder data, or EU-resident PII? The answer drives encryption, audit-log, region, and BAA decisions downstream. If unsure, loop in security before continuing — retrofitting HIPAA controls onto a live schema is painful.
Collects list -
Define retention and purging requirements
For each entity, document how long data must be kept and what triggers deletion. GDPR right-to-erasure, CCPA right-to-delete, and contractual retention windows all show up here. Soft-delete vs hard-delete is a schema decision, not just a code one.
Conceptual Design
-
Draft the entity-relationship diagram
Use dbdiagram.io, Lucidchart, or Mermaid erDiagram. Show entities, relationships, and cardinality (1:1, 1:N, M:N). Commit the diagram source to the repo alongside the migration files so it stays in sync.
-
Identify primary keys and natural keys
Default to UUIDv7 or ULID for distributed systems; bigserial is fine for single-writer Postgres. Avoid composite natural keys as primary keys — they propagate into every foreign key and break when the natural key changes. Document any natural keys as unique constraints instead.
-
Document foreign-key relationships and cascade rules
Decide ON DELETE behavior per relationship: CASCADE, SET NULL, RESTRICT. CASCADE on a heavily-referenced parent table can lock huge swaths of the database; RESTRICT is safer and forces explicit cleanup. Note multi-tenant boundaries — tenant_id should be on every table that holds tenant-owned data.
-
Specify column data types and nullability
Use timestamptz, not timestamp, for any time the user might cross zones. Numeric for money (never float). Text over varchar(n) in Postgres. Default NOT NULL unless the absence of a value is semantically meaningful — nullable columns balloon query complexity.
Logical Design
-
Normalize to 3NF then justify denormalizations
Start at third normal form. Then walk each top read pattern from the requirements step — if a query joins more than 4 tables on the hot path, consider a targeted denormalization (counter cache, materialized view, derived column). Document the read pattern that justifies each denormalization so future engineers don't normalize it back.
-
Model many-to-many join tables
Each M:N relationship gets an explicit join table with its own primary key, both foreign keys, and a created_at timestamp. Add metadata columns (role, status) here when the relationship has its own attributes — don't shove them into either parent table.
-
Plan the index strategy from access patterns
One index per top read pattern, not one per column. Use composite indexes with the equality column first, then range. Partial indexes for soft-deleted-row filtering. Avoid indexing low-cardinality boolean columns alone. Every index slows writes — budget them.
-
Write the data dictionary
One row per column: table, column, type, nullability, default, constraint, description, PII classification. Check it into the repo as Markdown or generate from the migration files. This is what the next engineer reads instead of asking you on Slack.
Collects file
Physical Design
-
Select the DBMS and instance class
Default to managed Postgres (RDS, Aurora, Cloud SQL) unless you have a specific reason. Pick instance class against the projected row counts and QPS, not the launch-week numbers. For Aurora, choose Serverless v2 only if traffic is bursty — steady-state workloads are cheaper on provisioned.
-
Decide on partitioning and sharding
Partition tables projected to exceed roughly 100M rows or those with clear time-based access (events, audit logs). Range partition on created_at for time-series; hash partition on tenant_id for multi-tenant isolation. Don't partition prematurely — it complicates migrations and queries.
-
Plan for online schema migrations
Adopt pg_repack, pt-online-schema-change, or gh-ost up front for any table projected over 10M rows. Adding a column with a default on a 50M-row Postgres table = full rewrite under exclusive lock; the pattern is ADD COLUMN nullable, backfill in batches, then SET DEFAULT. Document this in the migration runbook now, not the night of the migration.
-
Size connection pool and transaction log
Put PgBouncer (or RDS Proxy) in front; raw client connections to Postgres run out of memory at a few hundred. Size WAL/transaction log for the longest expected backup window plus replica lag headroom. Default settings are wrong for any non-trivial workload.
Security and Compliance
-
Define database roles and least-privilege grants
Separate roles for app reads, app writes, migrations, and analytics. Application connection strings get the minimum required grants — no DDL from the app role. Service accounts authenticate via IAM database auth or short-lived credentials from Vault, not long-lived passwords.
-
Enable encryption at rest and in transit
KMS-backed storage encryption on the cluster, TLS required for all connections (rds.force_ssl=1 on RDS). For HIPAA or PCI scope, also use column-level encryption or pgcrypto for the highest-sensitivity fields. Confirm BAA is in place with the cloud provider before production data lands.
-
Configure audit logging
pgaudit for Postgres, audit plugin for MySQL. Ship logs to a write-once destination (CloudWatch Logs with retention, S3 with object lock). SOC 2 CC6 and HIPAA §164.312(b) both expect this; turning it on after a finding is much worse than turning it on now.
-
Complete the threat model review
Walk OWASP Top 10 against the schema with the AppSec engineer: SQL injection surface, IDOR via predictable IDs (use UUIDs for externally-exposed identifiers), tenant-isolation bypass through unscoped queries. Capture findings as tickets before the schema freezes.
Performance and Observability
-
Set SLOs for query latency
Per top read path: p50, p95, p99 latency targets. These become alerts in Datadog or Grafana. Without explicit numbers, every query is fast enough until it isn't.
-
Run EXPLAIN ANALYZE against representative queries
Load representative data volume (use pgbench or a sanitized prod snapshot) and run the top 10 queries with EXPLAIN (ANALYZE, BUFFERS). Look for sequential scans on large tables, nested loops with high row counts, and sort steps that spill to disk. Fix at design time, not in production.
-
Wire up monitoring dashboards
Datadog DBM, pganalyze, or Performance Insights — pick one and wire it up before launch, not after the first incident. Track connections, lock waits, replication lag, slow queries, and cache hit ratio. Page on replication lag over 30s and connection pool saturation over 80%.
Backup and Disaster Recovery
-
Set RPO and RTO with the product owner
RPO = how much data loss is acceptable; RTO = how long the system can be down. These are business decisions, not engineering ones. A 5-minute RPO requires PITR with frequent WAL archiving; a 24-hour RPO is satisfied by nightly snapshots. Don't pick RPO/RTO without the person who'll explain the outage to customers.
Collects text Collects text -
Configure automated backups and PITR
RDS automated backups with retention matching RPO; cross-region snapshot copy for the highest-tier services. Confirm the backup-success metric is alerting on failure — not just dashboarded green. A silent backup failure for 30 days is a SOC 2 finding waiting to happen.
-
Run a restore drill into staging
Restore the latest snapshot into a non-prod environment and run schema validation plus a smoke test of the top read paths. Time the restore — that's your real RTO. Schedule this quarterly thereafter; an untested backup is not a backup.
Collects list Collects text Collects paragraph -
File remediation tickets for any drill failures
If the restore failed or missed RTO, file Jira/Linear tickets with named owners and due dates before the schema goes into production. Common culprits: rotated credentials in the restore script, missing parameter group, incorrect KMS key permissions in the DR region.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Database Design Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.