Software Architecture Design Checklist
A staff engineer or tech lead runs this checklist when scoping a new system or a significant redesign — from requirements through design review, data and API decisions, deployment, and the architecture decision record (ADR) trail.
Requirements and Constraints
-
Interview stakeholders for functional requirements
Talk to product, support, sales engineering, and at least one customer-facing engineer. Capture user stories with acceptance criteria in Linear or Jira; vague asks like 'make it scalable' don't survive contact with a design review.
-
Capture non-functional requirements with numbers
Pin down concrete targets: p99 latency, RPS at peak, RPO/RPTO for disaster recovery, expected data volume in 12 and 36 months, SLO targets. 'Fast' and 'reliable' aren't requirements; '200ms p99 at 5k RPS, 99.9% monthly' is.
-
Identify compliance and data-residency constraints
Confirm whether the system handles PHI (HIPAA), cardholder data (PCI), or EU resident data (GDPR). Data residency requirements often dictate region selection and sub-processor choices before the first architecture diagram is drawn.
Collects list Collects text -
Get product and engineering sign-off on scopeCollects signature
High-Level Architecture
-
Draft the C4 context and container diagrams
Use the C4 model (Context → Container → Component) so the diagrams make sense to non-architects. Tools: Structurizr, Excalidraw, or Mermaid in the repo. Show external systems, trust boundaries, and the request path for the top 2-3 user journeys.
Collects file -
Choose the deployment topology
Monolith, modular monolith, or services? Single region with read replicas, or multi-region active-active? Default to the simplest topology that meets the RPO/RTO from the requirements step — multi-region adds 6-12 months of operational maturity to manage.
Collects list -
Select cloud provider and core managed services
Pin down the AWS / GCP / Azure region, compute (ECS/EKS/Lambda/Cloud Run), and managed services (RDS, ElastiCache, SQS, etc.). Cross-check against existing org accounts and Terraform modules; reusing platform team's golden path saves months over rolling your own.
-
Map service ownership to teams
Every service needs a CODEOWNERS team and a PagerDuty rotation. Conway's Law applies — if no team will own a piece on day one, it will become an orphaned service in twelve months.
Data Architecture
-
Model the core domain entities
ERD or DBML covering aggregates, ownership, and tenant boundaries. Decide single-tenant vs. shared-schema multi-tenant up front — retrofitting tenant isolation onto a shared schema is one of the most expensive migrations a SaaS team can take on.
-
Choose the primary datastore
Default to Postgres unless access patterns prove otherwise. Document the access pattern that justifies DynamoDB / MongoDB / Cassandra if you go non-relational. Mixing too many engines is a hidden tax — each one needs backup, monitoring, and on-call expertise.
Collects list -
Plan migration and backfill strategy
For replacement systems: dual-write, shadow-read, or stop-the-world cutover? Sketch the rollback. Adding a column with a default on a 50M-row Postgres table rewrites the whole table under exclusive lock — batched backfill with sleeps is the safer pattern.
-
Define backup, RPO, and restore-test cadence
Backup that has never been restored is not a backup. Schedule a quarterly restore drill into a non-prod environment; the backup-success metric being green for 18 months tells you nothing about whether the restore actually works.
API and Integration Design
-
Pick the API style and document contracts
REST + OpenAPI 3.1, gRPC + protobuf, or GraphQL? Match to consumer needs, not preference. Public APIs almost always REST; internal high-throughput service-to-service often gRPC; product-facing aggregation often GraphQL. Commit the schema to the repo so PR diffs show breaking changes.
Collects list -
Define versioning and deprecation policy
URI versioning (/v1/) vs. header versioning, semver for client SDKs, and a deprecation window (commonly 6-12 months for public APIs). Without a written policy, every breaking change becomes an ad-hoc negotiation with the largest customer.
-
Specify auth, rate limits, and idempotency
OAuth2 / OIDC for user-facing, signed JWTs or mTLS for service-to-service. Define rate-limit tiers per API key. Mutating endpoints take an Idempotency-Key header (Stripe pattern) so retries don't double-charge.
-
Design the async event contracts
If using SQS, Kafka, or EventBridge: define topic naming, partition keys, schema registry (Avro/Protobuf), and retention. Decide DLQ handling and replay strategy. Document at-least-once semantics and the consumer's idempotency expectation.
Reliability and Observability
-
Define SLOs and error budgets
Pick 2-4 SLIs per critical user journey (latency, availability, freshness, correctness) with explicit SLO targets. The error budget — derived from the SLO — is what decides whether the next release goes out or the team focuses on reliability work.
-
Specify metrics, logs, and traces
Pick the stack — Datadog, New Relic, or self-hosted Prometheus + Grafana + Loki + Tempo. Cover the RED method (Rate, Errors, Duration) per service. Watch cardinality on metric labels; runaway tag combinations blow up bills overnight.
-
Plan capacity and load testing
k6 or Locust scripts targeting the SLO RPS, run against a prod-shaped environment. Identify the saturation point (USE method) and document the auto-scaling policy that kicks in before saturation.
Security and Compliance Review
-
Run a STRIDE threat model session
Walk the data-flow diagram with security and at least one outside engineer. STRIDE prompts (Spoofing, Tampering, Repudiation, Info disclosure, DoS, Elevation of privilege) catch what a code-only review misses. Track mitigations as tickets, not bullet points.
-
Review secrets and key management plan
AWS Secrets Manager / Vault for runtime secrets, KMS for envelope encryption, no secrets in environment variables baked into images. Pre-commit hook (gitleaks or trufflehog) plus GitHub secret scanning to catch the original commit.
-
Confirm SOC 2 control coverage
Walk through change management, logical access, vulnerability management, and backup-restore controls in Vanta / Drata / Secureframe. Flag any new sub-processor that needs a signed DPA before launch.
Collects list -
Schedule the formal security review
Triggered when the data-classification step flagged PII/PHI/PCI or the topology touches a new trust boundary. Book AppSec for a 60-minute design walkthrough; bring the C4 diagrams and the threat model output.
Decision Records and Sign-Off
-
Write ADRs for the major trade-offs
One ADR per non-obvious decision: datastore choice, API style, sync vs. async, build-vs-buy. Use the Nygard format (Context → Decision → Consequences) and commit them to docs/adr/ in the repo so they show up in code review.
Collects file -
Hold the architecture review meeting
Invite staff engineers from adjacent domains, the platform team, and AppSec. Pre-read the design doc 48 hours ahead — meetings where reviewers see the doc for the first time produce surface-level feedback only.
-
Capture the review outcome and next stepsCollects list Collects paragraph Collects signature
-
Schedule rework and re-review
If the review came back as rework, file the action items as Linear/Jira tickets, assign owners, and book the follow-up review one to two weeks out. Don't let unresolved review feedback drift into implementation — the cost of fixing it post-code is 5-10x.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRun Software Architecture Design Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.