Database Design Checklist

Requirements Gathering

    Sit with the product engineers who will write the queries. Capture the top 10 read paths and the top 5 write paths — query shape, expected QPS, latency target, cardinality of filter columns. This drives index choice later; designing the schema before knowing the queries is the most common cause of expensive redesigns at 6 months.

    Project 12-month and 36-month row counts for each major entity. A 50M-row table needs different physical decisions (partitioning, online migration strategy) than a 500K-row table. Pull current numbers from analogous tables in production if you have them.

    Will this database hold PHI, PCI cardholder data, or EU-resident PII? The answer drives encryption, audit-log, region, and BAA decisions downstream. If unsure, loop in security before continuing — retrofitting HIPAA controls onto a live schema is painful.

    For each entity, document how long data must be kept and what triggers deletion. GDPR right-to-erasure, CCPA right-to-delete, and contractual retention windows all show up here. Soft-delete vs hard-delete is a schema decision, not just a code one.

Conceptual Design

    Use dbdiagram.io, Lucidchart, or Mermaid erDiagram. Show entities, relationships, and cardinality (1:1, 1:N, M:N). Commit the diagram source to the repo alongside the migration files so it stays in sync.

    Default to UUIDv7 or ULID for distributed systems; bigserial is fine for single-writer Postgres. Avoid composite natural keys as primary keys — they propagate into every foreign key and break when the natural key changes. Document any natural keys as unique constraints instead.

    Decide ON DELETE behavior per relationship: CASCADE, SET NULL, RESTRICT. CASCADE on a heavily-referenced parent table can lock huge swaths of the database; RESTRICT is safer and forces explicit cleanup. Note multi-tenant boundaries — tenant_id should be on every table that holds tenant-owned data.

    Use timestamptz, not timestamp, for any time the user might cross zones. Numeric for money (never float). Text over varchar(n) in Postgres. Default NOT NULL unless the absence of a value is semantically meaningful — nullable columns balloon query complexity.

Logical Design

    Start at third normal form. Then walk each top read pattern from the requirements step — if a query joins more than 4 tables on the hot path, consider a targeted denormalization (counter cache, materialized view, derived column). Document the read pattern that justifies each denormalization so future engineers don't normalize it back.

    Each M:N relationship gets an explicit join table with its own primary key, both foreign keys, and a created_at timestamp. Add metadata columns (role, status) here when the relationship has its own attributes — don't shove them into either parent table.

    One index per top read pattern, not one per column. Use composite indexes with the equality column first, then range. Partial indexes for soft-deleted-row filtering. Avoid indexing low-cardinality boolean columns alone. Every index slows writes — budget them.

    One row per column: table, column, type, nullability, default, constraint, description, PII classification. Check it into the repo as Markdown or generate from the migration files. This is what the next engineer reads instead of asking you on Slack.

Physical Design

    Default to managed Postgres (RDS, Aurora, Cloud SQL) unless you have a specific reason. Pick instance class against the projected row counts and QPS, not the launch-week numbers. For Aurora, choose Serverless v2 only if traffic is bursty — steady-state workloads are cheaper on provisioned.

    Partition tables projected to exceed roughly 100M rows or those with clear time-based access (events, audit logs). Range partition on created_at for time-series; hash partition on tenant_id for multi-tenant isolation. Don't partition prematurely — it complicates migrations and queries.

    Adopt pg_repack, pt-online-schema-change, or gh-ost up front for any table projected over 10M rows. Adding a column with a default on a 50M-row Postgres table = full rewrite under exclusive lock; the pattern is ADD COLUMN nullable, backfill in batches, then SET DEFAULT. Document this in the migration runbook now, not the night of the migration.

    Put PgBouncer (or RDS Proxy) in front; raw client connections to Postgres run out of memory at a few hundred. Size WAL/transaction log for the longest expected backup window plus replica lag headroom. Default settings are wrong for any non-trivial workload.

Security and Compliance

    Separate roles for app reads, app writes, migrations, and analytics. Application connection strings get the minimum required grants — no DDL from the app role. Service accounts authenticate via IAM database auth or short-lived credentials from Vault, not long-lived passwords.

    KMS-backed storage encryption on the cluster, TLS required for all connections (rds.force_ssl=1 on RDS). For HIPAA or PCI scope, also use column-level encryption or pgcrypto for the highest-sensitivity fields. Confirm BAA is in place with the cloud provider before production data lands.

    pgaudit for Postgres, audit plugin for MySQL. Ship logs to a write-once destination (CloudWatch Logs with retention, S3 with object lock). SOC 2 CC6 and HIPAA §164.312(b) both expect this; turning it on after a finding is much worse than turning it on now.

    Walk OWASP Top 10 against the schema with the AppSec engineer: SQL injection surface, IDOR via predictable IDs (use UUIDs for externally-exposed identifiers), tenant-isolation bypass through unscoped queries. Capture findings as tickets before the schema freezes.

Performance and Observability

    Per top read path: p50, p95, p99 latency targets. These become alerts in Datadog or Grafana. Without explicit numbers, every query is fast enough until it isn't.

    Load representative data volume (use pgbench or a sanitized prod snapshot) and run the top 10 queries with EXPLAIN (ANALYZE, BUFFERS). Look for sequential scans on large tables, nested loops with high row counts, and sort steps that spill to disk. Fix at design time, not in production.

    Datadog DBM, pganalyze, or Performance Insights — pick one and wire it up before launch, not after the first incident. Track connections, lock waits, replication lag, slow queries, and cache hit ratio. Page on replication lag over 30s and connection pool saturation over 80%.

Backup and Disaster Recovery

    RPO = how much data loss is acceptable; RTO = how long the system can be down. These are business decisions, not engineering ones. A 5-minute RPO requires PITR with frequent WAL archiving; a 24-hour RPO is satisfied by nightly snapshots. Don't pick RPO/RTO without the person who'll explain the outage to customers.

    RDS automated backups with retention matching RPO; cross-region snapshot copy for the highest-tier services. Confirm the backup-success metric is alerting on failure — not just dashboarded green. A silent backup failure for 30 days is a SOC 2 finding waiting to happen.

    Restore the latest snapshot into a non-prod environment and run schema validation plus a smoke test of the top read paths. Time the restore — that's your real RTO. Schedule this quarterly thereafter; an untested backup is not a backup.

    If the restore failed or missed RTO, file Jira/Linear tickets with named owners and due dates before the schema goes into production. Common culprits: rotated credentials in the restore script, missing parameter group, incorrect KMS key permissions in the DR region.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack