A/B Testing Checklist

Planning and Hypothesis

    Pick one primary KPI tied to the funnel stage you're testing — checkout conversion rate, signup rate, lead form submit, ARPU. Avoid multi-metric primaries; secondary metrics belong in the readout, not the success criterion. Document the GA4 event name that maps to the KPI so analysts pull the right column.

    Population, intervention, comparison, outcome — for example: "For mobile US visitors landing on /pricing, replacing the three-tier card with a single CTA will lift trial-start rate by ≥5% vs. the current page." The directional prediction prevents post-hoc rationalization when results come in flat.

    Use a power calculator (Optimizely, VWO, or Evan Miller's calculator) with baseline conversion, target MDE, alpha 0.05, power 0.80. If your weekly traffic on the target segment can't reach the sample within 4 weeks, either widen the MDE or reconsider the test — running underpowered is the most common reason A/B results are unreliable.

    Match the test's source / medium / campaign UTMs to the team convention doc. Verify the conversion event in GA4 fires on actual submit (not blur, not page-view) — mis-mapped events distort reported lift by multiples and have killed more experiments than bad copy.

Variant Build and QA

    Implement variants in Optimizely, VWO, Convert, or Google Optimize successor (e.g., GrowthBook, Statsig). Keep DOM changes scoped to what the hypothesis tests — incidental CSS shifts contaminate the read.

    Default 50/50 split unless you have a reason to skew. Audience targeting should match the population in the hypothesis — don't include logged-in users in a test of the anonymous pricing page. Exclude internal IPs and bot traffic.

    Use GA4 DebugView or the GTM preview console. Click the CTA on control and treatment in three browsers; confirm the conversion event fires once with the right experiment_id and variant_id parameters. Pixel-loaded-before-consent is also a check here for GDPR pages.

    Chrome, Safari, Firefox, Edge; iOS Safari and Android Chrome at common viewport widths. Layout shift on a treatment that doesn't appear on control is a CLS regression that confounds the test.

    Brand reviews tone, color, and lockup adherence. Legal reviews any claim that's testable for substantiation under FTC guidelines. Capture timestamped approval — Slack DMs aren't a record.

Launch and Monitoring

    Flip the experiment to live status during a low-traffic window so any rollback is contained. Tag the launch in the analytics annotations so the WBR dashboard reflects the change.

    For a 50/50 split, expect roughly equal exposure counts. A chi-square SRM check that fails (p < 0.001) means your assignment is broken — bot filtering, redirect chain, or caching layer is corrupting the bucketing. Pause and fix before reading any KPI.

    Watch page-load time, bounce rate, error-rate, and downstream revenue per session. A treatment that lifts the primary KPI while crashing a guardrail isn't a winner. Don't peek at primary-KPI significance mid-test — early peeking inflates false-positive rate.

    Stopping early because results "look good" is the textbook way to ship a false positive. Run for at least one full business cycle (typically two weeks) to absorb day-of-week effects, even if the sample target is reached sooner.

Analysis and Decision

    Export raw exposure and conversion data, not just the platform's auto-summary. Reconcile the platform count against GA4 — drift over 5% means one of the two systems is dropping events.

    Two-proportion z-test for binary conversion KPIs; t-test or bootstrap for continuous metrics like ARPU. Apply Bonferroni or Benjamini-Hochberg if you're reading multiple secondary metrics. Confidence intervals on lift are more honest than a bare p-value.

    Pick the call honestly. "Inconclusive" is a valid outcome when the test was underpowered or guardrails failed; calling a flat test a "loss" loses information. The category here drives whether you roll out, iterate, or shelve.

    Look at mobile vs. desktop, new vs. returning, paid vs. organic. A flat overall result that's a 12% mobile lift and a 10% desktop drop is a different decision than a flat-everywhere result. Treat segment cuts as hypothesis-generating, not confirmatory.

Reporting and Rollout

    Standard sections: hypothesis, design, sample size, primary result with CI, guardrail summary, segment cuts, decision, follow-up. Attach screenshots of control vs. treatment so the reader can see what was tested without opening the platform.

    Post in #growth-experiments (or your equivalent) and tag PMM, design, and analytics. Walk through the readout at the next WBR — null and inconclusive results are worth presenting; suppressing them encourages publication bias inside the team.

    Hand off the change to engineering for permanent implementation rather than leaving the testing-tool override in place forever — flag-driven variants slow page load and accumulate as tech debt. Update the production page and remove the experiment from the testing tool once shipped.

    A flat result is a learning. Revise the hypothesis based on segment cuts or qualitative session-replay review (Hotjar, FullStory) and queue the next iteration. Don't re-run the same test hoping for a different p-value.

    Add an entry to the central test log (Notion, Airtable, or the testing-tool's archive) with hypothesis, dates, sample, outcome, and link to the readout. Future tests on the same surface should start with a search of this log to avoid relitigating settled questions.

Use this template in Manifestly

Start a Free 14 Day Trial
Use Slack? Start your trial with one click

Related Marketing Checklists

Ready to take control of your recurring tasks?

Start Free 14-Day Trial


Use Slack? Sign up with one click

With Slack