Software Architecture Design Checklist

A staff engineer or tech lead runs this checklist when scoping a new system or a significant redesign — from requirements through design review, data and API decisions, deployment, and the architecture decision record (ADR) trail.

7 sections 27 steps Collects data

Requirements and Constraints

Interview stakeholders for functional requirements
- Talk to product, support, sales engineering, and at least one customer-facing engineer. Capture user stories with acceptance criteria in Linear or Jira; vague asks like 'make it scalable' don't survive contact with a design review.
Capture non-functional requirements with numbers
- Pin down concrete targets: p99 latency, RPS at peak, RPO/RPTO for disaster recovery, expected data volume in 12 and 36 months, SLO targets. 'Fast' and 'reliable' aren't requirements; '200ms p99 at 5k RPS, 99.9% monthly' is.
Identify compliance and data-residency constraints
- Confirm whether the system handles PHI (HIPAA), cardholder data (PCI), or EU resident data (GDPR). Data residency requirements often dictate region selection and sub-processor choices before the first architecture diagram is drawn.
Collects list Collects text
Get product and engineering sign-off on scope

Collects signature

High-Level Architecture

Draft the C4 context and container diagrams
- Use the C4 model (Context → Container → Component) so the diagrams make sense to non-architects. Tools: Structurizr, Excalidraw, or Mermaid in the repo. Show external systems, trust boundaries, and the request path for the top 2-3 user journeys.
Collects file
Choose the deployment topology
- Monolith, modular monolith, or services? Single region with read replicas, or multi-region active-active? Default to the simplest topology that meets the RPO/RTO from the requirements step — multi-region adds 6-12 months of operational maturity to manage.
Collects list
Select cloud provider and core managed services
- Pin down the AWS / GCP / Azure region, compute (ECS/EKS/Lambda/Cloud Run), and managed services (RDS, ElastiCache, SQS, etc.). Cross-check against existing org accounts and Terraform modules; reusing platform team's golden path saves months over rolling your own.
Map service ownership to teams
- Every service needs a CODEOWNERS team and a PagerDuty rotation. Conway's Law applies — if no team will own a piece on day one, it will become an orphaned service in twelve months.

Data Architecture

Model the core domain entities
- ERD or DBML covering aggregates, ownership, and tenant boundaries. Decide single-tenant vs. shared-schema multi-tenant up front — retrofitting tenant isolation onto a shared schema is one of the most expensive migrations a SaaS team can take on.
Choose the primary datastore
- Default to Postgres unless access patterns prove otherwise. Document the access pattern that justifies DynamoDB / MongoDB / Cassandra if you go non-relational. Mixing too many engines is a hidden tax — each one needs backup, monitoring, and on-call expertise.
Collects list
Plan migration and backfill strategy
- For replacement systems: dual-write, shadow-read, or stop-the-world cutover? Sketch the rollback. Adding a column with a default on a 50M-row Postgres table rewrites the whole table under exclusive lock — batched backfill with sleeps is the safer pattern.
Define backup, RPO, and restore-test cadence
- Backup that has never been restored is not a backup. Schedule a quarterly restore drill into a non-prod environment; the backup-success metric being green for 18 months tells you nothing about whether the restore actually works.

API and Integration Design

Pick the API style and document contracts
- REST + OpenAPI 3.1, gRPC + protobuf, or GraphQL? Match to consumer needs, not preference. Public APIs almost always REST; internal high-throughput service-to-service often gRPC; product-facing aggregation often GraphQL. Commit the schema to the repo so PR diffs show breaking changes.
Collects list
Define versioning and deprecation policy
- URI versioning (/v1/) vs. header versioning, semver for client SDKs, and a deprecation window (commonly 6-12 months for public APIs). Without a written policy, every breaking change becomes an ad-hoc negotiation with the largest customer.
Specify auth, rate limits, and idempotency
- OAuth2 / OIDC for user-facing, signed JWTs or mTLS for service-to-service. Define rate-limit tiers per API key. Mutating endpoints take an Idempotency-Key header (Stripe pattern) so retries don't double-charge.
Design the async event contracts
- If using SQS, Kafka, or EventBridge: define topic naming, partition keys, schema registry (Avro/Protobuf), and retention. Decide DLQ handling and replay strategy. Document at-least-once semantics and the consumer's idempotency expectation.

Reliability and Observability

Define SLOs and error budgets
- Pick 2-4 SLIs per critical user journey (latency, availability, freshness, correctness) with explicit SLO targets. The error budget — derived from the SLO — is what decides whether the next release goes out or the team focuses on reliability work.
Specify metrics, logs, and traces
- Pick the stack — Datadog, New Relic, or self-hosted Prometheus + Grafana + Loki + Tempo. Cover the RED method (Rate, Errors, Duration) per service. Watch cardinality on metric labels; runaway tag combinations blow up bills overnight.
Plan capacity and load testing
- k6 or Locust scripts targeting the SLO RPS, run against a prod-shaped environment. Identify the saturation point (USE method) and document the auto-scaling policy that kicks in before saturation.

Security and Compliance Review

Run a STRIDE threat model session
- Walk the data-flow diagram with security and at least one outside engineer. STRIDE prompts (Spoofing, Tampering, Repudiation, Info disclosure, DoS, Elevation of privilege) catch what a code-only review misses. Track mitigations as tickets, not bullet points.
Review secrets and key management plan
- AWS Secrets Manager / Vault for runtime secrets, KMS for envelope encryption, no secrets in environment variables baked into images. Pre-commit hook (gitleaks or trufflehog) plus GitHub secret scanning to catch the original commit.
Confirm SOC 2 control coverage
- Walk through change management, logical access, vulnerability management, and backup-restore controls in Vanta / Drata / Secureframe. Flag any new sub-processor that needs a signed DPA before launch.
Collects list
Schedule the formal security review
- Triggered when the data-classification step flagged PII/PHI/PCI or the topology touches a new trust boundary. Book AppSec for a 60-minute design walkthrough; bring the C4 diagrams and the threat model output.

Decision Records and Sign-Off

Write ADRs for the major trade-offs
- One ADR per non-obvious decision: datastore choice, API style, sync vs. async, build-vs-buy. Use the Nygard format (Context → Decision → Consequences) and commit them to docs/adr/ in the repo so they show up in code review.
Collects file
Hold the architecture review meeting
- Invite staff engineers from adjacent domains, the platform team, and AppSec. Pre-read the design doc 48 hours ahead — meetings where reviewers see the doc for the first time produce surface-level feedback only.
Capture the review outcome and next steps

Collects list Collects paragraph Collects signature
Schedule rework and re-review
- If the review came back as rework, file the action items as Linear/Jira tickets, assign owners, and book the follow-up review one to two weeks out. Don't let unresolved review feedback drift into implementation — the cost of fixing it post-code is 5-10x.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.

Use this workflow Start free trial

Sections 7

Steps 27

Category Software Development

Price Free to start

Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Related templates

More workflows your team can run.

Software Development

Run Software Architecture Design Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.