Design Documentation Checklist
Authoring and review workflow for an engineering design doc (RFC) — from problem framing through architecture, data, API contract, rollout plan, and final approval. Run by the tech lead drafting the design with reviewers across engineering, AppSec, and DBA.
System Overview
-
Draft the problem statement and goals
Open with the customer problem in one paragraph and the success criteria in measurable terms — error rate target, p99 latency, adoption percentage. Avoid implementation language at this stage; reviewers should be able to disagree with the framing before reading the architecture.
-
Define scope and explicit non-goals
List what's in scope and — more importantly — what's explicitly out. Non-goals prevent reviewer scope creep ("why doesn't this also handle multi-region failover?") and become the answer when the question comes up in launch review.
-
Identify primary users and load patterns
Name the personas (end user, internal admin, automated job) and the expected request shape — peak RPS, average payload size, read/write ratio. Capacity planning and SLO targets in later sections depend on this.
Functional Requirements
-
Enumerate user stories with acceptance criteria
Use Given/When/Then format and tie each story to a Linear or Jira ticket. QA will lift acceptance criteria directly into test cases — vague phrasing here becomes vague tests later.
-
Map requirements to owning services
For each requirement, name the service or module that owns the behavior and the team in CODEOWNERS. Cross-team requirements need an explicit owner before the doc circulates — orphaned requirements are the most common cause of slipped launches.
-
Link the product spec and tracking epic
Paste the canonical PRD / product brief URL and the tracking epic. The design doc is the engineering view; the PRD is the product view; both should reference each other.
Collects url
Non-Functional Requirements
-
Set SLOs for latency, availability, and error rate
Pick numbers, not adjectives. "p99 < 250ms," "99.9% monthly availability," "error rate < 0.1%." Identify which signal will source each SLO (Datadog APM, Prometheus, synthetic checks) so the on-call dashboard can be built before launch, not after.
-
Walk through the OWASP Top 10 threat model
For each OWASP category (injection, broken auth, SSRF, etc.) note whether the design exposes the surface and how it's mitigated. Authn/authz, input validation, and rate limiting are the three areas that consistently get under-specified in design docs and surface in pentest reports later.
-
Specify scalability targets and capacity headroom
Document expected steady-state load and the 12-month growth assumption. Note where the design hits a scaling cliff — single-writer database, partition key hot spot, fan-out queue depth — and what triggers the next scaling investment.
Data Design
-
Diagram the data model and entity relationships
Show entities, foreign keys, indexes, and access patterns. For document/NoSQL stores, show the partition and sort keys. "What query does this table answer?" should be obvious from the diagram alone.
-
Confirm whether a schema migration is required
If the design adds, drops, or changes columns or indexes on a populated table, this is a migration. Even a column rename is a migration. Answer drives whether DBA review is required and whether backfill planning is in scope.
Collects list -
Plan the migration strategy and backfill batches
For Postgres: ADD COLUMN without a default, backfill in batches of 1k–10k with sleeps, then set the default in a follow-up. Estimate runtime against current row count and confirm the migration is reversible. Long-running migrations that lock the table during peak are the canonical Friday-evening incident.
-
Document PII classification and retention policy
Mark each field as Public, Internal, Confidential, or Restricted (PII / PHI / payment). State the retention window and deletion mechanism — GDPR right-to-erasure and CCPA right-to-delete both require named deletion paths, not just "we have backups."
Architecture and System Design
-
Publish the C4 system context diagram
Use the C4 levels — system context, container, component — at whatever depth the change demands. A boxes-and-arrows sketch in Excalidraw or Mermaid is fine; the goal is reviewers can name each box and the arrow's protocol.
-
Document chosen frameworks and design patterns
Name the libraries (FastAPI, Spring, Rails, etc.), the patterns (CQRS, outbox, saga), and any deviations from team defaults. Justify the deviations — "because it's interesting" is not a justification on a production system.
-
Define error handling, retries, and structured logging
Specify retry policy (exponential backoff with jitter, max attempts, idempotency requirements), timeout values per dependency, circuit breaker triggers, and the structured-log fields that downstream observability relies on (trace_id, request_id, user_id, tenant_id).
API and Integration Points
-
Confirm whether new external integrations are introduced
External meaning anything outside the team's owned services — third-party SaaS, partner APIs, new internal services owned by another team. Answer drives whether AppSec review of data flows is required and whether a new vendor needs procurement / DPA review.
Collects list -
Publish the OpenAPI or protobuf contract
Commit the spec to the repo and link to the rendered docs. Versioning policy goes here too — additive changes vs. breaking changes vs. deprecation timeline. Consumers should be able to generate clients from the spec, not from prose descriptions.
-
Document auth, idempotency, and retry semantics
State the auth method (mTLS, OAuth client credentials, signed JWT), idempotency-key behavior on writes, and what the caller should do on 5xx vs. 4xx. Without an idempotency contract, retries on payment or notification endpoints become duplicate-charge incidents.
-
Schedule AppSec review of third-party data flows
File the AppSec ticket with the data classification table and the integration list. SOC 2 sub-processor reviews and GDPR Article 28 obligations both flow through this step — vendors handling Restricted data require a signed DPA before the integration goes live.
Rollout and Feature Flags
-
Define the feature flag and kill switch
Name the LaunchDarkly / Unleash / Statsig flag, its default value, and its retirement date. Stale flags accumulate — assign an owner and put the cleanup ticket in the backlog at the same moment the flag is created.
-
Plan canary stages and rollback triggers
Stages: 1% → 10% → 50% → 100% with hold times between. Define the metric thresholds that automatically halt rollout (error rate spike, p99 regression > 20%, saturation alarm) and who has the kill-switch credential at each stage.
-
Capture the rollout and rollback plan
Write the rollback as if you'll execute it at 2am — exact commands, exact image tag to revert to, whether the migration is reversible or requires a forward fix. "We'll figure it out" is the failure mode.
Collects paragraph
Testing and Validation
-
Outline unit, integration, and e2e coverage
Map test type to layer: pure logic gets unit tests (Jest, pytest, RSpec), service boundaries get integration tests with real dependencies in containers, user journeys get Playwright or Cypress e2e. Set the coverage gate the PR will be checked against.
-
Define load test scenarios and p99 thresholds
Write k6 or Locust scenarios for the peak shape from the load patterns section. The pass criteria is the SLO from the NFR section, exercised against the staging environment with production-shaped data volume.
-
Plan staging smoke tests and synthetic checks
Define the post-deploy smoke test that runs against production after each rollout stage — a few synthetic transactions covering the critical path. Datadog Synthetics or a small Playwright job in CI both work; the point is the smoke test runs automatically, not by memory.
Deployment and Maintenance
-
Document the production runbook and pager owner
Runbook covers the top 5 expected alerts: what they mean, what to check, what to do, when to escalate. Name the PagerDuty schedule and the secondary. New services that go live without a runbook generate 3am pages that get acked and ignored.
-
Define backup, restore, and DR drill cadence
Backup frequency, retention, and — critically — the quarterly restore drill into a non-prod environment. SOC 2 CC9 auditors want evidence the restore actually works; a green backup metric is not the same as a tested restore.
-
Schedule the post-launch retro
Put a 60-minute retro on the calendar two weeks after GA. Agenda: what shipped vs. spec, what surprised us, what we'd change in the doc next time. This loop is how design doc quality improves over time.
Review and Sign-Off
-
Circulate the design doc for review
Post the doc in #engineering with the reviewer list (CODEOWNERS for affected services, plus the staff engineer for the area). Give reviewers at least 3 business days before the meeting — same-day review on a 20-page doc produces LGTM rubber-stamps, not real review.
-
Hold the design review meeting
60 minutes, scribe takes open questions into the doc as comments. Resist redesigning live — the meeting's job is to surface objections; the author's job after is to address them in writing.
-
Address blocking review comments
Resolve every blocking comment with either a doc change or a written response. Non-blocking comments can be deferred but must be acknowledged. Reviewers should re-read the changed sections, not re-approve the whole doc.
-
Capture the final approval decision
Staff engineer or area tech lead records the decision in the doc header. "Approved with comments" means implementation can start while the author lands the comment-driven changes; "Rejected" means a substantive rework before re-review.
Collects list Collects text Collects paragraph -
Resubmit the revised design doc
Track the rejection reasons as a checklist at the top of the doc; the next review cycle confirms each item was addressed. Re-circulate to the same reviewer set and schedule a shorter follow-up meeting (30 minutes is usually enough for a focused re-review).
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Design Documentation Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.