Software Architecture Design Checklist
Requirements and Constraints
Talk to product, support, sales engineering, and at least one customer-facing engineer. Capture user stories with acceptance criteria in Linear or Jira; vague asks like 'make it scalable' don't survive contact with a design review.
Pin down concrete targets: p99 latency, RPS at peak, RPO/RPTO for disaster recovery, expected data volume in 12 and 36 months, SLO targets. 'Fast' and 'reliable' aren't requirements; '200ms p99 at 5k RPS, 99.9% monthly' is.
Confirm whether the system handles PHI (HIPAA), cardholder data (PCI), or EU resident data (GDPR). Data residency requirements often dictate region selection and sub-processor choices before the first architecture diagram is drawn.
High-Level Architecture
Use the C4 model (Context → Container → Component) so the diagrams make sense to non-architects. Tools: Structurizr, Excalidraw, or Mermaid in the repo. Show external systems, trust boundaries, and the request path for the top 2-3 user journeys.
Monolith, modular monolith, or services? Single region with read replicas, or multi-region active-active? Default to the simplest topology that meets the RPO/RTO from the requirements step — multi-region adds 6-12 months of operational maturity to manage.
Pin down the AWS / GCP / Azure region, compute (ECS/EKS/Lambda/Cloud Run), and managed services (RDS, ElastiCache, SQS, etc.). Cross-check against existing org accounts and Terraform modules; reusing platform team's golden path saves months over rolling your own.
Every service needs a CODEOWNERS team and a PagerDuty rotation. Conway's Law applies — if no team will own a piece on day one, it will become an orphaned service in twelve months.
Data Architecture
ERD or DBML covering aggregates, ownership, and tenant boundaries. Decide single-tenant vs. shared-schema multi-tenant up front — retrofitting tenant isolation onto a shared schema is one of the most expensive migrations a SaaS team can take on.
Default to Postgres unless access patterns prove otherwise. Document the access pattern that justifies DynamoDB / MongoDB / Cassandra if you go non-relational. Mixing too many engines is a hidden tax — each one needs backup, monitoring, and on-call expertise.
For replacement systems: dual-write, shadow-read, or stop-the-world cutover? Sketch the rollback. Adding a column with a default on a 50M-row Postgres table rewrites the whole table under exclusive lock — batched backfill with sleeps is the safer pattern.
Backup that has never been restored is not a backup. Schedule a quarterly restore drill into a non-prod environment; the backup-success metric being green for 18 months tells you nothing about whether the restore actually works.
API and Integration Design
REST + OpenAPI 3.1, gRPC + protobuf, or GraphQL? Match to consumer needs, not preference. Public APIs almost always REST; internal high-throughput service-to-service often gRPC; product-facing aggregation often GraphQL. Commit the schema to the repo so PR diffs show breaking changes.
URI versioning (/v1/) vs. header versioning, semver for client SDKs, and a deprecation window (commonly 6-12 months for public APIs). Without a written policy, every breaking change becomes an ad-hoc negotiation with the largest customer.
OAuth2 / OIDC for user-facing, signed JWTs or mTLS for service-to-service. Define rate-limit tiers per API key. Mutating endpoints take an Idempotency-Key header (Stripe pattern) so retries don't double-charge.
If using SQS, Kafka, or EventBridge: define topic naming, partition keys, schema registry (Avro/Protobuf), and retention. Decide DLQ handling and replay strategy. Document at-least-once semantics and the consumer's idempotency expectation.
Reliability and Observability
Pick 2-4 SLIs per critical user journey (latency, availability, freshness, correctness) with explicit SLO targets. The error budget — derived from the SLO — is what decides whether the next release goes out or the team focuses on reliability work.
Pick the stack — Datadog, New Relic, or self-hosted Prometheus + Grafana + Loki + Tempo. Cover the RED method (Rate, Errors, Duration) per service. Watch cardinality on metric labels; runaway tag combinations blow up bills overnight.
k6 or Locust scripts targeting the SLO RPS, run against a prod-shaped environment. Identify the saturation point (USE method) and document the auto-scaling policy that kicks in before saturation.
Security and Compliance Review
Walk the data-flow diagram with security and at least one outside engineer. STRIDE prompts (Spoofing, Tampering, Repudiation, Info disclosure, DoS, Elevation of privilege) catch what a code-only review misses. Track mitigations as tickets, not bullet points.
AWS Secrets Manager / Vault for runtime secrets, KMS for envelope encryption, no secrets in environment variables baked into images. Pre-commit hook (gitleaks or trufflehog) plus GitHub secret scanning to catch the original commit.
Walk through change management, logical access, vulnerability management, and backup-restore controls in Vanta / Drata / Secureframe. Flag any new sub-processor that needs a signed DPA before launch.
Triggered when the data-classification step flagged PII/PHI/PCI or the topology touches a new trust boundary. Book AppSec for a 60-minute design walkthrough; bring the C4 diagrams and the threat model output.
Decision Records and Sign-Off
One ADR per non-obvious decision: datastore choice, API style, sync vs. async, build-vs-buy. Use the Nygard format (Context → Decision → Consequences) and commit them to docs/adr/ in the repo so they show up in code review.
Invite staff engineers from adjacent domains, the platform team, and AppSec. Pre-read the design doc 48 hours ahead — meetings where reviewers see the doc for the first time produce surface-level feedback only.
If the review came back as rework, file the action items as Linear/Jira tickets, assign owners, and book the follow-up review one to two weeks out. Don't let unresolved review feedback drift into implementation — the cost of fixing it post-code is 5-10x.
Use this template in Manifestly
- Quality Assurance Checklist
- Prototype Review Checklist
- Requirement Gathering Checklist
- Sprint Planning Checklist
- Project Closure Checklist
- Employee Data Security Checklist
- Security Review Checklist
- Change Management Checklist
- Software Project Management Checklist
- Software Project Initiation Checklist
- Release Checklist
- New Engineer Onboarding Checklist
- Technical Debt Management Checklist
- User Acceptance Testing (UAT) Checklist
- Integration Testing Checklist
- Deployment Plan Checklist
- Performance Testing Checklist
- Release Planning Checklist
- Software Update Checklist
- Software Engineer Hiring Checklist
- Project Review and Retrospective Checklist
- Rollback Plan Checklist
- Automated Testing Checklist
- Incident Response Checklist
- System Testing Checklist
- Software Development Plan Checklist
- Refactoring Checklist
- API Development Checklist
- Database Design Checklist
- Performance Optimization Checklist
- Version Control Checklist
- Post-Deployment Testing Checklist
- Performance Monitoring Checklist
- Peer Review Onboarding Checklist
- Test Case Review Checklist
- Test Plan Checklist
- Testing Environment Setup Checklist
- Monitoring Setup Checklist
- Security Best Practices Checklist
- Acceptance Testing Checklist
- Feature Development Checklist
- Bug Tracking and Resolution Checklist
- Engineering Resource Allocation Checklist
- Personal Development Plan (PDP) Checklist
- Code Review Checklist
- Service Level Agreement (SLA) Checklist
- Technical Documentation Checklist
- QA Testing Checklist
- Design Documentation Checklist
- Employee Offboarding Checklist
- Engineering Team Building Activity Checklist
- CI/CD Pipeline Review Checklist
- End-User Documentation Checklist
- Deployment Checklist
- Software Licensing Compliance Checklist
- Software Project Risk Management Checklist
- Development Environment Setup Checklist
- Disaster Recovery Plan Checklist
- API Documentation Checklist
- Software Engineer Onboarding Checklist
- Release Notes Checklist
- Code Review Checklist
- Engineer Offboarding Checklist
- Unit Testing Checklist
- Backlog Prioritization Checklist
- User Acceptance Testing Checklist
- New Developer Onboarding Checklist
- Backup and Recovery Checklist
Ready to take control of your recurring tasks?
Start Free 14-Day TrialUse Slack? Sign up with one click
