Software Architecture Design Checklist
A staff engineer or tech lead runs this checklist when scoping a new system or a significant redesign — from requirements through design review, data and API decisions, deployment, and the architecture decision record (ADR) trail.
Requirements and Constraints
-
Interview stakeholders for functional requirements
Talk to product, support, sales engineering, and at least one customer-facing engineer. Capture user stories with acceptance criteria in Linear or Jira; vague asks like 'make it scalable' don't survive contact with a design review.
-
Capture non-functional requirements with numbers
Pin down concrete targets: p99 latency, RPS at peak, RPO/RPTO for disaster recovery, expected data volume in 12 and 36 months, SLO targets. 'Fast' and 'reliable' aren't requirements; '200ms p99 at 5k RPS, 99.9% monthly' is.
-
Identify compliance and data-residency constraints
Confirm whether the system handles PHI (HIPAA), cardholder data (PCI), or EU resident data (GDPR). Data residency requirements often dictate region selection and sub-processor choices before the first architecture diagram is drawn.
Collects list Collects text -
Get product and engineering sign-off on scopeCollects signature
High-Level Architecture
-
Draft the C4 context and container diagrams
Use the C4 model (Context → Container → Component) so the diagrams make sense to non-architects. Tools: Structurizr, Excalidraw, or Mermaid in the repo. Show external systems, trust boundaries, and the request path for the top 2-3 user journeys.
Collects file -
Choose the deployment topology
Monolith, modular monolith, or services? Single region with read replicas, or multi-region active-active? Default to the simplest topology that meets the RPO/RTO from the requirements step — multi-region adds 6-12 months of operational maturity to manage.
Collects list -
Select cloud provider and core managed services
Pin down the AWS / GCP / Azure region, compute (ECS/EKS/Lambda/Cloud Run), and managed services (RDS, ElastiCache, SQS, etc.). Cross-check against existing org accounts and Terraform modules; reusing platform team's golden path saves months over rolling your own.
-
Map service ownership to teams
Every service needs a CODEOWNERS team and a PagerDuty rotation. Conway's Law applies — if no team will own a piece on day one, it will become an orphaned service in twelve months.
Data Architecture
-
Model the core domain entities
ERD or DBML covering aggregates, ownership, and tenant boundaries. Decide single-tenant vs. shared-schema multi-tenant up front — retrofitting tenant isolation onto a shared schema is one of the most expensive migrations a SaaS team can take on.
-
Choose the primary datastore
Default to Postgres unless access patterns prove otherwise. Document the access pattern that justifies DynamoDB / MongoDB / Cassandra if you go non-relational. Mixing too many engines is a hidden tax — each one needs backup, monitoring, and on-call expertise.
Collects list -
Plan migration and backfill strategy
For replacement systems: dual-write, shadow-read, or stop-the-world cutover? Sketch the rollback. Adding a column with a default on a 50M-row Postgres table rewrites the whole table under exclusive lock — batched backfill with sleeps is the safer pattern.
-
Define backup, RPO, and restore-test cadence
Backup that has never been restored is not a backup. Schedule a quarterly restore drill into a non-prod environment; the backup-success metric being green for 18 months tells you nothing about whether the restore actually works.
API and Integration Design
-
Pick the API style and document contracts
REST + OpenAPI 3.1, gRPC + protobuf, or GraphQL? Match to consumer needs, not preference. Public APIs almost always REST; internal high-throughput service-to-service often gRPC; product-facing aggregation often GraphQL. Commit the schema to the repo so PR diffs show breaking changes.
Collects list -
Define versioning and deprecation policy
URI versioning (/v1/) vs. header versioning, semver for client SDKs, and a deprecation window (commonly 6-12 months for public APIs). Without a written policy, every breaking change becomes an ad-hoc negotiation with the largest customer.
-
Specify auth, rate limits, and idempotency
OAuth2 / OIDC for user-facing, signed JWTs or mTLS for service-to-service. Define rate-limit tiers per API key. Mutating endpoints take an Idempotency-Key header (Stripe pattern) so retries don't double-charge.
-
Design the async event contracts
If using SQS, Kafka, or EventBridge: define topic naming, partition keys, schema registry (Avro/Protobuf), and retention. Decide DLQ handling and replay strategy. Document at-least-once semantics and the consumer's idempotency expectation.
Reliability and Observability
-
Define SLOs and error budgets
Pick 2-4 SLIs per critical user journey (latency, availability, freshness, correctness) with explicit SLO targets. The error budget — derived from the SLO — is what decides whether the next release goes out or the team focuses on reliability work.
-
Specify metrics, logs, and traces
Pick the stack — Datadog, New Relic, or self-hosted Prometheus + Grafana + Loki + Tempo. Cover the RED method (Rate, Errors, Duration) per service. Watch cardinality on metric labels; runaway tag combinations blow up bills overnight.
-
Plan capacity and load testing
k6 or Locust scripts targeting the SLO RPS, run against a prod-shaped environment. Identify the saturation point (USE method) and document the auto-scaling policy that kicks in before saturation.
Security and Compliance Review
-
Run a STRIDE threat model session
Walk the data-flow diagram with security and at least one outside engineer. STRIDE prompts (Spoofing, Tampering, Repudiation, Info disclosure, DoS, Elevation of privilege) catch what a code-only review misses. Track mitigations as tickets, not bullet points.
-
Review secrets and key management plan
AWS Secrets Manager / Vault for runtime secrets, KMS for envelope encryption, no secrets in environment variables baked into images. Pre-commit hook (gitleaks or trufflehog) plus GitHub secret scanning to catch the original commit.
-
Confirm SOC 2 control coverage
Walk through change management, logical access, vulnerability management, and backup-restore controls in Vanta / Drata / Secureframe. Flag any new sub-processor that needs a signed DPA before launch.
Collects list -
Schedule the formal security review
Triggered when the data-classification step flagged PII/PHI/PCI or the topology touches a new trust boundary. Book AppSec for a 60-minute design walkthrough; bring the C4 diagrams and the threat model output.
Decision Records and Sign-Off
-
Write ADRs for the major trade-offs
One ADR per non-obvious decision: datastore choice, API style, sync vs. async, build-vs-buy. Use the Nygard format (Context → Decision → Consequences) and commit them to docs/adr/ in the repo so they show up in code review.
Collects file -
Hold the architecture review meeting
Invite staff engineers from adjacent domains, the platform team, and AppSec. Pre-read the design doc 48 hours ahead — meetings where reviewers see the doc for the first time produce surface-level feedback only.
-
Capture the review outcome and next stepsCollects list Collects paragraph Collects signature
-
Schedule rework and re-review
If the review came back as rework, file the action items as Linear/Jira tickets, assign owners, and book the follow-up review one to two weeks out. Don't let unresolved review feedback drift into implementation — the cost of fixing it post-code is 5-10x.