API Development Checklist
Planning and Design
The product manager and tech lead document who calls this API — internal services, first-party mobile/web clients, third-party partners — and the top three jobs each consumer needs done. Different audiences drive different decisions on auth, rate limits, and SLA.
Write the OpenAPI 3.x spec before any handler code — resources, paths, request/response schemas, error envelopes. Lint with Spectral against your team's ruleset (consistent casing, required pagination params, error shape). Commit the YAML to the repo so PRs diff against it.
Pick based on consumer: OAuth 2.0 authorization code + PKCE for user-facing third-party apps, client credentials for server-to-server, signed JWTs for first-party services, API keys only for low-risk public read endpoints. Document token lifetime, refresh, and revocation up front.
Walk consumers, security, and SRE through the spec. Common findings at this stage: missing pagination on list endpoints, inconsistent error envelopes, PII leaking into URLs, no idempotency keys on POSTs that retry.
Security and Auth
Register the resource server with Auth0, Okta, Cognito, or your identity provider. Define granular scopes per resource action (e.g., orders:read, orders:write) — broad scopes like admin are flagged at SOC 2 access review.
For API-key consumers: hash keys at rest, store the plaintext only in AWS Secrets Manager or Vault for the consumer to copy once. Pre-commit hooks (gitleaks, trufflehog) plus GitHub secret scanning prevent the original commit from leaking.
Validate every field against the OpenAPI schema at the framework boundary (zod, pydantic, Joi, FluentValidation). Cover the OWASP API Top 10 — broken object level authorization, mass assignment, SSRF on URL fields. Reject unknown properties; do not trust client-supplied IDs.
TLS 1.2+ only at the ALB / API Gateway / Cloudflare edge; HSTS header on. Configure per-consumer rate limits (token bucket) and a global circuit breaker. Surface 429 with Retry-After rather than letting backends crater.
Wire Semgrep or CodeQL into CI for SAST; Snyk or Dependabot for SCA. Block the merge on any unresolved high or critical CVE. Generate an SBOM (CycloneDX) on every release build.
Development and Testing
Gate the new routes behind a LaunchDarkly / Unleash / internal flag so the code can ship to main without being externally reachable. Assign the flag a named owner and a removal date — orphaned flags pile up fast.
Unit tests on handler logic; contract tests (Pact, Dredd, schemathesis) that replay the OpenAPI spec against the running service. Contract tests catch the drift case where someone edits the handler but forgets the spec.
Use k6 or Locust to drive expected peak traffic plus 2x headroom. Watch p95/p99 latency, error rate, and DB connection pool saturation. Common failure mode: a single N+1 query that's invisible at low load and tips over at 50 RPS.
Generate the Postman collection from the OpenAPI spec and run the happy-path flow end-to-end as a real consumer would. Catches surprises in auth setup, base URL, and example payloads that automated tests miss.
Documentation and Versioning
Render the spec via Redoc, Stoplight, or Mintlify into the developer portal. Include auth setup, a quickstart curl example per resource, and the error code table — not just the auto-generated schema.
Breaking: removing or renaming a field, tightening validation, changing response shape, removing an endpoint. Non-breaking: adding optional fields, adding endpoints, relaxing validation. When in doubt, treat it as breaking.
Bump the URL version (/v2/) or media type. Mark the prior version deprecated in the spec, add a Sunset header with the removal date (minimum 6 months out), and email known consumers with the migration guide.
Public CHANGELOG.md entry under the new semver tag. Group entries as Added / Changed / Deprecated / Removed / Fixed / Security. Link each entry to the ticket so support can trace customer questions back to the change.
Release and Rollout
Argo Rollouts or your gateway's traffic-split routes 5% to the new revision. Watch error rate and p99 for at least 15 minutes before progressing. Promote 25% → 50% → 100% with checks between.
Previous container image still in the registry (not pruned), DB migration is reversible or feature-flag-gated, kubectl rollout undo tested in staging this quarter. Rollback plans that have never been run usually fail when needed.
Monitoring and Post-Launch
Datadog, Grafana, or Honeycomb panels for latency (p50/p95/p99), traffic (RPS), errors (4xx vs 5xx split), saturation (CPU, memory, DB pool). Pin the dashboard URL in the service's runbook.
Set availability and latency SLOs (e.g., 99.9% availability, p99 < 400ms). Burn-rate alerts page on-call via PagerDuty when budget consumption is unsustainable, not on every transient blip.
Runbook in the service catalog (Backstage or wiki) covering common alerts, dashboard links, escalation path, and break-glass procedures. Walk the on-call rotation through it live — runbooks nobody has read are runbooks nobody can follow at 3am.
Review adoption metrics, top error endpoints, latency outliers, and consumer feedback. File tickets for any technical debt accrued during launch — naming an owner and a sprint, not a vague backlog tag.
