Software Architecture Design Checklist

A staff engineer or tech lead runs this checklist when scoping a new system or a significant redesign — from requirements through design review, data and API decisions, deployment, and the architecture decision record (ADR) trail.

7 sections 27 steps Collects data
1

Requirements and Constraints

  1. Interview stakeholders for functional requirements
    • Talk to product, support, sales engineering, and at least one customer-facing engineer. Capture user stories with acceptance criteria in Linear or Jira; vague asks like 'make it scalable' don't survive contact with a design review.

  2. Capture non-functional requirements with numbers
    • Pin down concrete targets: p99 latency, RPS at peak, RPO/RPTO for disaster recovery, expected data volume in 12 and 36 months, SLO targets. 'Fast' and 'reliable' aren't requirements; '200ms p99 at 5k RPS, 99.9% monthly' is.

  3. Identify compliance and data-residency constraints
    • Confirm whether the system handles PHI (HIPAA), cardholder data (PCI), or EU resident data (GDPR). Data residency requirements often dictate region selection and sub-processor choices before the first architecture diagram is drawn.

    Collects list Collects text
  4. Get product and engineering sign-off on scope
    Collects signature
2

High-Level Architecture

  1. Draft the C4 context and container diagrams
    • Use the C4 model (Context → Container → Component) so the diagrams make sense to non-architects. Tools: Structurizr, Excalidraw, or Mermaid in the repo. Show external systems, trust boundaries, and the request path for the top 2-3 user journeys.

    Collects file
  2. Choose the deployment topology
    • Monolith, modular monolith, or services? Single region with read replicas, or multi-region active-active? Default to the simplest topology that meets the RPO/RTO from the requirements step — multi-region adds 6-12 months of operational maturity to manage.

    Collects list
  3. Select cloud provider and core managed services
    • Pin down the AWS / GCP / Azure region, compute (ECS/EKS/Lambda/Cloud Run), and managed services (RDS, ElastiCache, SQS, etc.). Cross-check against existing org accounts and Terraform modules; reusing platform team's golden path saves months over rolling your own.

  4. Map service ownership to teams
    • Every service needs a CODEOWNERS team and a PagerDuty rotation. Conway's Law applies — if no team will own a piece on day one, it will become an orphaned service in twelve months.

3

Data Architecture

  1. Model the core domain entities
    • ERD or DBML covering aggregates, ownership, and tenant boundaries. Decide single-tenant vs. shared-schema multi-tenant up front — retrofitting tenant isolation onto a shared schema is one of the most expensive migrations a SaaS team can take on.

  2. Choose the primary datastore
    • Default to Postgres unless access patterns prove otherwise. Document the access pattern that justifies DynamoDB / MongoDB / Cassandra if you go non-relational. Mixing too many engines is a hidden tax — each one needs backup, monitoring, and on-call expertise.

    Collects list
  3. Plan migration and backfill strategy
    • For replacement systems: dual-write, shadow-read, or stop-the-world cutover? Sketch the rollback. Adding a column with a default on a 50M-row Postgres table rewrites the whole table under exclusive lock — batched backfill with sleeps is the safer pattern.

  4. Define backup, RPO, and restore-test cadence
    • Backup that has never been restored is not a backup. Schedule a quarterly restore drill into a non-prod environment; the backup-success metric being green for 18 months tells you nothing about whether the restore actually works.

4

API and Integration Design

  1. Pick the API style and document contracts
    • REST + OpenAPI 3.1, gRPC + protobuf, or GraphQL? Match to consumer needs, not preference. Public APIs almost always REST; internal high-throughput service-to-service often gRPC; product-facing aggregation often GraphQL. Commit the schema to the repo so PR diffs show breaking changes.

    Collects list
  2. Define versioning and deprecation policy
    • URI versioning (/v1/) vs. header versioning, semver for client SDKs, and a deprecation window (commonly 6-12 months for public APIs). Without a written policy, every breaking change becomes an ad-hoc negotiation with the largest customer.

  3. Specify auth, rate limits, and idempotency
    • OAuth2 / OIDC for user-facing, signed JWTs or mTLS for service-to-service. Define rate-limit tiers per API key. Mutating endpoints take an Idempotency-Key header (Stripe pattern) so retries don't double-charge.

  4. Design the async event contracts
    • If using SQS, Kafka, or EventBridge: define topic naming, partition keys, schema registry (Avro/Protobuf), and retention. Decide DLQ handling and replay strategy. Document at-least-once semantics and the consumer's idempotency expectation.

5

Reliability and Observability

  1. Define SLOs and error budgets
    • Pick 2-4 SLIs per critical user journey (latency, availability, freshness, correctness) with explicit SLO targets. The error budget — derived from the SLO — is what decides whether the next release goes out or the team focuses on reliability work.

  2. Specify metrics, logs, and traces
    • Pick the stack — Datadog, New Relic, or self-hosted Prometheus + Grafana + Loki + Tempo. Cover the RED method (Rate, Errors, Duration) per service. Watch cardinality on metric labels; runaway tag combinations blow up bills overnight.

  3. Plan capacity and load testing
    • k6 or Locust scripts targeting the SLO RPS, run against a prod-shaped environment. Identify the saturation point (USE method) and document the auto-scaling policy that kicks in before saturation.

6

Security and Compliance Review

  1. Run a STRIDE threat model session
    • Walk the data-flow diagram with security and at least one outside engineer. STRIDE prompts (Spoofing, Tampering, Repudiation, Info disclosure, DoS, Elevation of privilege) catch what a code-only review misses. Track mitigations as tickets, not bullet points.

  2. Review secrets and key management plan
    • AWS Secrets Manager / Vault for runtime secrets, KMS for envelope encryption, no secrets in environment variables baked into images. Pre-commit hook (gitleaks or trufflehog) plus GitHub secret scanning to catch the original commit.

  3. Confirm SOC 2 control coverage
    • Walk through change management, logical access, vulnerability management, and backup-restore controls in Vanta / Drata / Secureframe. Flag any new sub-processor that needs a signed DPA before launch.

    Collects list
  4. Schedule the formal security review
    • Triggered when the data-classification step flagged PII/PHI/PCI or the topology touches a new trust boundary. Book AppSec for a 60-minute design walkthrough; bring the C4 diagrams and the threat model output.

7

Decision Records and Sign-Off

  1. Write ADRs for the major trade-offs
    • One ADR per non-obvious decision: datastore choice, API style, sync vs. async, build-vs-buy. Use the Nygard format (Context → Decision → Consequences) and commit them to docs/adr/ in the repo so they show up in code review.

    Collects file
  2. Hold the architecture review meeting
    • Invite staff engineers from adjacent domains, the platform team, and AppSec. Pre-read the design doc 48 hours ahead — meetings where reviewers see the doc for the first time produce surface-level feedback only.

  3. Capture the review outcome and next steps
    Collects list Collects paragraph Collects signature
  4. Schedule rework and re-review
    • If the review came back as rework, file the action items as Linear/Jira tickets, assign owners, and book the follow-up review one to two weeks out. Don't let unresolved review feedback drift into implementation — the cost of fixing it post-code is 5-10x.

Use this template

Copy it to your account, customize the steps, and run it with your team in minutes.


Sections 7
Steps 27
Category Software Development
Price Free to start
Need a different process

Browse hundreds of free templates across every team and industry.

Back to template library

Run Software Architecture Design Checklist with your team

Customize the steps, assign roles, set a schedule, and keep a complete record for every run.