Automated Testing Checklist

Test Planning

    The QA lead writes down what the suite is responsible for proving — happy-path checkout, auth flows, billing webhooks — and what is explicitly out of scope (load tests, visual regression). Define exit criteria using definition of done: e.g., all P0 user paths covered, suite runs under 15 minutes, flake rate below 2%.

    Match the runner to the language and surface: Playwright or Cypress for browser e2e, Jest/Vitest for JS unit, pytest for Python, RSpec for Ruby, k6 or Locust for load. Avoid mixing two e2e frameworks — pick one and commit; cross-framework debugging is the most common reason teams abandon automation.

    Pull the top 10 user flows from product analytics (Amplitude, Mixpanel, or your warehouse). Sign-up, login, primary CRUD operation, payment, and any flow tied to a customer SLA belong in the first batch. Capture the list as the input to test-case writing.

    Pick numbers the team will defend at retro: e.g., e2e suite under 15 min, unit suite under 3 min, flake rate under 2% measured over the last 100 runs. Without explicit budgets, suites drift to 45 minutes and engineers stop trusting them.

Test Environment Setup

    The platform engineer stands up a staging tier mirroring production — same Postgres major version, same Redis topology, same feature flags defaulted off. Ephemeral per-PR environments (Vercel preview, Render preview, or k8s namespace per branch) are ideal; a single shared staging is acceptable but tests will collide.

    Add the test job to GitHub Actions, GitLab CI, CircleCI, or Buildkite. Make it a required status check on the main branch ruleset so PRs cannot merge with red CI — the most common automation failure mode is engineers normalizing red builds.

    Use factories (FactoryBot, factory_boy, Fishery) or fixture files checked into the repo. Never depend on records that exist in shared staging — they get mutated, deleted, or rotated. Each test should create the data it needs and tear down after itself.

    API keys for Stripe test mode, Twilio test creds, OAuth client secrets — store in GitHub Actions secrets, AWS Secrets Manager, or Vault. Never commit a `.env.test` with real keys; gitleaks or trufflehog as a pre-commit hook catches the slip.

Writing Test Cases

    Document the convention in CONTRIBUTING.md: e.g., `feature.behavior.spec.ts` for e2e, `module.test.ts` for unit. The test description should read as a sentence — `it('rejects expired JWTs at /api/v2/orders')` — so failures in CI are self-explanatory without opening the file.

    No test should depend on the order another ran in. Use `beforeEach` to set up state and `afterEach` (or DB transaction rollback) to tear down. Order-dependent suites break the moment you parallelize across runners.

    Use msw, nock, or VCR cassettes for HTTP. Tests that hit real Stripe, SendGrid, or partner APIs are the top source of flakes — rate limits, network blips, and sandbox outages all show up as red CI. Reserve real-API tests for a nightly contract-test job, not the PR pipeline.

    When an e2e test fails in CI without local repro, the screenshot + DOM snapshot + network HAR + console log is what makes triage tractable. Playwright's trace viewer and Cypress's video output are the baseline; configure them to upload as CI artifacts on failure.

Test Execution

    Configure CI to trigger on `pull_request` and `push` to main. Split the matrix: unit tests on every PR, full e2e on PRs touching the relevant package, nightly cron for the whole suite against staging. Required status checks block merge until green.

    Playwright supports `--shard 1/4` natively; Cypress has parallelization via Cypress Cloud or Sorry-Cypress. Headless Chromium is roughly 2x faster than headed and behaves identically for 95% of tests. Reserve headed mode for local debugging only.

    Tools like Datadog CI Visibility, BuildPulse, or Trunk Flaky Tests ingest JUnit XML and surface the slowest 1% and the flakiest tests. Without this data, the suite degrades silently — engineers blame infra rather than spotting the one test that fails 1 in 30 runs.

Results Analysis and Reporting

    A Datadog, Grafana, or Honeycomb dashboard showing pass rate, p95 runtime, and top 10 flaky tests over a 14-day window. Link it from the team's README so it's reviewed at every retro, not just when CI is on fire.

    Main-branch red builds page the on-call engineer; PR failures notify the PR author only. Avoid the anti-pattern of routing every failure to #engineering — within a week the channel becomes ignorable noise.

    Compare the flake rate captured earlier against the budget set in planning. If over budget, trigger the remediation phase — the team must quarantine the worst offenders before adding new tests.

    Tag the offenders `@quarantine` so they run but don't block merge, file tickets to fix or delete within 2 sprints, and remove the tag once the underlying race condition or timing issue is resolved. A quarantine list with no expiration is just hidden tech debt.

Maintenance and Optimization

    When a flag goes 100% on, the assertions guarded by `if (flag)` need to become unconditional, and the off-path test deletes. Stale flag-gated tests are a major source of dead-code accumulation called out in the team's quarterly flag review.

    Sprint-by-sprint, page objects accrete copy-pasted login helpers and selector duplication. Schedule a quarterly refactor pass: consolidate selectors into one place per page, replace CSS selectors with `data-testid`, delete dead helpers identified by ts-prune or similar.

    Use Codecov, Coveralls, or `nyc`/`coverage.py` reports to map line coverage against the P0 path list captured in planning. Aim for 80%+ branch coverage on the critical-path modules; chasing 100% repo-wide buys little and incentivizes assertion-free tests.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Software Development Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack