Design Documentation Checklist
System Overview
Open with the customer problem in one paragraph and the success criteria in measurable terms — error rate target, p99 latency, adoption percentage. Avoid implementation language at this stage; reviewers should be able to disagree with the framing before reading the architecture.
List what's in scope and — more importantly — what's explicitly out. Non-goals prevent reviewer scope creep ("why doesn't this also handle multi-region failover?") and become the answer when the question comes up in launch review.
Name the personas (end user, internal admin, automated job) and the expected request shape — peak RPS, average payload size, read/write ratio. Capacity planning and SLO targets in later sections depend on this.
Functional Requirements
Use Given/When/Then format and tie each story to a Linear or Jira ticket. QA will lift acceptance criteria directly into test cases — vague phrasing here becomes vague tests later.
For each requirement, name the service or module that owns the behavior and the team in CODEOWNERS. Cross-team requirements need an explicit owner before the doc circulates — orphaned requirements are the most common cause of slipped launches.
Paste the canonical PRD / product brief URL and the tracking epic. The design doc is the engineering view; the PRD is the product view; both should reference each other.
Non-Functional Requirements
Pick numbers, not adjectives. "p99 < 250ms," "99.9% monthly availability," "error rate < 0.1%." Identify which signal will source each SLO (Datadog APM, Prometheus, synthetic checks) so the on-call dashboard can be built before launch, not after.
For each OWASP category (injection, broken auth, SSRF, etc.) note whether the design exposes the surface and how it's mitigated. Authn/authz, input validation, and rate limiting are the three areas that consistently get under-specified in design docs and surface in pentest reports later.
Document expected steady-state load and the 12-month growth assumption. Note where the design hits a scaling cliff — single-writer database, partition key hot spot, fan-out queue depth — and what triggers the next scaling investment.
Data Design
Show entities, foreign keys, indexes, and access patterns. For document/NoSQL stores, show the partition and sort keys. "What query does this table answer?" should be obvious from the diagram alone.
If the design adds, drops, or changes columns or indexes on a populated table, this is a migration. Even a column rename is a migration. Answer drives whether DBA review is required and whether backfill planning is in scope.
For Postgres: ADD COLUMN without a default, backfill in batches of 1k–10k with sleeps, then set the default in a follow-up. Estimate runtime against current row count and confirm the migration is reversible. Long-running migrations that lock the table during peak are the canonical Friday-evening incident.
Mark each field as Public, Internal, Confidential, or Restricted (PII / PHI / payment). State the retention window and deletion mechanism — GDPR right-to-erasure and CCPA right-to-delete both require named deletion paths, not just "we have backups."
Architecture and System Design
Use the C4 levels — system context, container, component — at whatever depth the change demands. A boxes-and-arrows sketch in Excalidraw or Mermaid is fine; the goal is reviewers can name each box and the arrow's protocol.
Name the libraries (FastAPI, Spring, Rails, etc.), the patterns (CQRS, outbox, saga), and any deviations from team defaults. Justify the deviations — "because it's interesting" is not a justification on a production system.
Specify retry policy (exponential backoff with jitter, max attempts, idempotency requirements), timeout values per dependency, circuit breaker triggers, and the structured-log fields that downstream observability relies on (trace_id, request_id, user_id, tenant_id).
API and Integration Points
External meaning anything outside the team's owned services — third-party SaaS, partner APIs, new internal services owned by another team. Answer drives whether AppSec review of data flows is required and whether a new vendor needs procurement / DPA review.
Commit the spec to the repo and link to the rendered docs. Versioning policy goes here too — additive changes vs. breaking changes vs. deprecation timeline. Consumers should be able to generate clients from the spec, not from prose descriptions.
State the auth method (mTLS, OAuth client credentials, signed JWT), idempotency-key behavior on writes, and what the caller should do on 5xx vs. 4xx. Without an idempotency contract, retries on payment or notification endpoints become duplicate-charge incidents.
File the AppSec ticket with the data classification table and the integration list. SOC 2 sub-processor reviews and GDPR Article 28 obligations both flow through this step — vendors handling Restricted data require a signed DPA before the integration goes live.
Rollout and Feature Flags
Name the LaunchDarkly / Unleash / Statsig flag, its default value, and its retirement date. Stale flags accumulate — assign an owner and put the cleanup ticket in the backlog at the same moment the flag is created.
Stages: 1% → 10% → 50% → 100% with hold times between. Define the metric thresholds that automatically halt rollout (error rate spike, p99 regression > 20%, saturation alarm) and who has the kill-switch credential at each stage.
Write the rollback as if you'll execute it at 2am — exact commands, exact image tag to revert to, whether the migration is reversible or requires a forward fix. "We'll figure it out" is the failure mode.
Testing and Validation
Map test type to layer: pure logic gets unit tests (Jest, pytest, RSpec), service boundaries get integration tests with real dependencies in containers, user journeys get Playwright or Cypress e2e. Set the coverage gate the PR will be checked against.
Write k6 or Locust scenarios for the peak shape from the load patterns section. The pass criteria is the SLO from the NFR section, exercised against the staging environment with production-shaped data volume.
Define the post-deploy smoke test that runs against production after each rollout stage — a few synthetic transactions covering the critical path. Datadog Synthetics or a small Playwright job in CI both work; the point is the smoke test runs automatically, not by memory.
Deployment and Maintenance
Runbook covers the top 5 expected alerts: what they mean, what to check, what to do, when to escalate. Name the PagerDuty schedule and the secondary. New services that go live without a runbook generate 3am pages that get acked and ignored.
Backup frequency, retention, and — critically — the quarterly restore drill into a non-prod environment. SOC 2 CC9 auditors want evidence the restore actually works; a green backup metric is not the same as a tested restore.
Put a 60-minute retro on the calendar two weeks after GA. Agenda: what shipped vs. spec, what surprised us, what we'd change in the doc next time. This loop is how design doc quality improves over time.
Review and Sign-Off
Post the doc in #engineering with the reviewer list (CODEOWNERS for affected services, plus the staff engineer for the area). Give reviewers at least 3 business days before the meeting — same-day review on a 20-page doc produces LGTM rubber-stamps, not real review.
60 minutes, scribe takes open questions into the doc as comments. Resist redesigning live — the meeting's job is to surface objections; the author's job after is to address them in writing.
Resolve every blocking comment with either a doc change or a written response. Non-blocking comments can be deferred but must be acknowledged. Reviewers should re-read the changed sections, not re-approve the whole doc.
Staff engineer or area tech lead records the decision in the doc header. "Approved with comments" means implementation can start while the author lands the comment-driven changes; "Rejected" means a substantive rework before re-review.
Track the rejection reasons as a checklist at the top of the doc; the next review cycle confirms each item was addressed. Re-circulate to the same reviewer set and schedule a shorter follow-up meeting (30 minutes is usually enough for a focused re-review).
