A/B Testing Checklist
End-to-end workflow for running an on-site A/B test — from hypothesis and sample-size math through QA, launch, significance analysis, and rollout. Run by the growth or experimentation lead with input from analytics, design, and brand.
Planning and Hypothesis
-
Define the test objective and primary KPI
Pick one primary KPI tied to the funnel stage you're testing — checkout conversion rate, signup rate, lead form submit, ARPU. Avoid multi-metric primaries; secondary metrics belong in the readout, not the success criterion. Document the GA4 event name that maps to the KPI so analysts pull the right column.
Collects text -
Write the hypothesis in PICO form
Population, intervention, comparison, outcome — for example: "For mobile US visitors landing on /pricing, replacing the three-tier card with a single CTA will lift trial-start rate by ≥5% vs. the current page." The directional prediction prevents post-hoc rationalization when results come in flat.
-
Calculate required sample size and MDE
Use a power calculator (Optimizely, VWO, or Evan Miller's calculator) with baseline conversion, target MDE, alpha 0.05, power 0.80. If your weekly traffic on the target segment can't reach the sample within 4 weeks, either widen the MDE or reconsider the test — running underpowered is the most common reason A/B results are unreliable.
Collects text -
Confirm UTM and GA4 event taxonomy
Match the test's source / medium / campaign UTMs to the team convention doc. Verify the conversion event in GA4 fires on actual submit (not blur, not page-view) — mis-mapped events distort reported lift by multiples and have killed more experiments than bad copy.
Variant Build and QA
-
Build control and treatment in the testing tool
Implement variants in Optimizely, VWO, Convert, or Google Optimize successor (e.g., GrowthBook, Statsig). Keep DOM changes scoped to what the hypothesis tests — incidental CSS shifts contaminate the read.
-
Configure traffic allocation and audience segment
Default 50/50 split unless you have a reason to skew. Audience targeting should match the population in the hypothesis — don't include logged-in users in a test of the anonymous pricing page. Exclude internal IPs and bot traffic.
-
Verify GA4 events fire on both variants
Use GA4 DebugView or the GTM preview console. Click the CTA on control and treatment in three browsers; confirm the conversion event fires once with the right experiment_id and variant_id parameters. Pixel-loaded-before-consent is also a check here for GDPR pages.
-
Run cross-browser and mobile rendering QA
Chrome, Safari, Firefox, Edge; iOS Safari and Android Chrome at common viewport widths. Layout shift on a treatment that doesn't appear on control is a CLS regression that confounds the test.
-
Get brand and legal sign-off on the variant
Brand reviews tone, color, and lockup adherence. Legal reviews any claim that's testable for substantiation under FTC guidelines. Capture timestamped approval — Slack DMs aren't a record.
Launch and Monitoring
-
Launch the test to the target segment
Flip the experiment to live status during a low-traffic window so any rollback is contained. Tag the launch in the analytics annotations so the WBR dashboard reflects the change.
-
Check sample ratio mismatch on day one
For a 50/50 split, expect roughly equal exposure counts. A chi-square SRM check that fails (p < 0.001) means your assignment is broken — bot filtering, redirect chain, or caching layer is corrupting the bucketing. Pause and fix before reading any KPI.
-
Monitor guardrail metrics weekly
Watch page-load time, bounce rate, error-rate, and downstream revenue per session. A treatment that lifts the primary KPI while crashing a guardrail isn't a winner. Don't peek at primary-KPI significance mid-test — early peeking inflates false-positive rate.
-
Hold the test until the pre-registered duration ends
Stopping early because results "look good" is the textbook way to ship a false positive. Run for at least one full business cycle (typically two weeks) to absorb day-of-week effects, even if the sample target is reached sooner.
Analysis and Decision
-
Pull final results from the testing platform
Export raw exposure and conversion data, not just the platform's auto-summary. Reconcile the platform count against GA4 — drift over 5% means one of the two systems is dropping events.
-
Run significance test against pre-registered alpha
Two-proportion z-test for binary conversion KPIs; t-test or bootstrap for continuous metrics like ARPU. Apply Bonferroni or Benjamini-Hochberg if you're reading multiple secondary metrics. Confidence intervals on lift are more honest than a bare p-value.
-
Record the test outcome
Pick the call honestly. "Inconclusive" is a valid outcome when the test was underpowered or guardrails failed; calling a flat test a "loss" loses information. The category here drives whether you roll out, iterate, or shelve.
Collects list -
Cut results by segment and device
Look at mobile vs. desktop, new vs. returning, paid vs. organic. A flat overall result that's a 12% mobile lift and a 10% desktop drop is a different decision than a flat-everywhere result. Treat segment cuts as hypothesis-generating, not confirmatory.
Reporting and Rollout
-
Write the experiment readout document
Standard sections: hypothesis, design, sample size, primary result with CI, guardrail summary, segment cuts, decision, follow-up. Attach screenshots of control vs. treatment so the reader can see what was tested without opening the platform.
Collects file -
Share findings in the growth review channel
Post in #growth-experiments (or your equivalent) and tag PMM, design, and analytics. Walk through the readout at the next WBR — null and inconclusive results are worth presenting; suppressing them encourages publication bias inside the team.
-
Roll out the winning variant to 100%
Hand off the change to engineering for permanent implementation rather than leaving the testing-tool override in place forever — flag-driven variants slow page load and accumulate as tech debt. Update the production page and remove the experiment from the testing tool once shipped.
-
Plan a follow-up test with a revised hypothesis
A flat result is a learning. Revise the hypothesis based on segment cuts or qualitative session-replay review (Hotjar, FullStory) and queue the next iteration. Don't re-run the same test hoping for a different p-value.
-
Log the experiment in the test repository
Add an entry to the central test log (Notion, Airtable, or the testing-tool's archive) with hypothesis, dates, sample, outcome, and link to the readout. Future tests on the same surface should start with a search of this log to avoid relitigating settled questions.
Use this template
Copy it to your account, customize the steps, and run it with your team in minutes.
Browse hundreds of free templates across every team and industry.
Back to template libraryRelated templates
More workflows your team can run.
Run A/B Testing Checklist with your team
Customize the steps, assign roles, set a schedule, and keep a complete record for every run.