AI for Marketing
Attribution 9 min read

Most A/B Tests Cannot Detect a Real Win

AI for Marketing

By Alexa Matveeva

Published Updated

To detect a 5 percent relative lift on a typical 3 percent conversion rate, you need roughly 208,000 visitors per variation. At 1,000 visitors a day per arm that is seven months. Yet real winning lifts cluster in the low single digits, a median of about 4.9 percent in the largest public meta-analysis. Put those together and most on-site A/B tests cannot detect the size of effect they actually produce. Always Be Testing optimizes the wrong variable. The teams that win do fewer, bigger, properly powered tests, and they use AI for the bold swing and the sample-size math, not for the verdict.

Why most of your tests cannot detect a real win

The on-site CRO industry rests on an assumption that is usually false, that your site has enough traffic to detect the effects your tests produce. The effects are small. Georgi Georgiev's meta-analysis of 115 documented GoodUI tests found a median observed lift of 4.89 percent, dropping to just under 4 percent once underpowered tests were pruned and significance recalculated. Ronny Kohavi, who ran experimentation at Bing, says most progress comes in 0.1 to 1 percent increments and a 2 percent gain on a key metric is rare. Now the sample math. Halving the effect you want to detect roughly quadruples the traffic you need, because it sits in the squared denominator. A 5 percent lift on a 3 percent baseline needs about 208,000 visitors per variation, a 10 percent lift about 53,000, a 20 percent lift about 14,000. So about 70 percent of the GoodUI sample, 80 of 115 tests, were underpowered, and only 24 could detect an effect below 10 percent.

Only one test in seven wins, even at the best shops

If you assume your next test will win, you are wrong about six times in seven. A VWO study found roughly 1 in 7 tests produces a winner, about 14 percent, and Optimizely puts it near 12 percent. The most rigorous number comes from Harvard's Stefan Thomke and Sourobh Ghosh, who analyzed about 20,000 Optimizely experiments and found only about 10 percent had a statistically significant uplift on their primary metric. Kohavi reports that at Microsoft only about one-third of well-designed experiments improved the key metric, one-third were neutral, and one-third made things worse, with Bing's rate lower still. Failure rates run around 70 percent at Microsoft, 85 percent at Bing, and 90 percent at Google Ads, Netflix and Airbnb, the most sophisticated experimentation shops there are. The base rate of winning is low, so a program built on the expectation of frequent wins is built on sand.

How peeking turns a 5 percent error rate into 26 percent

The most damaging everyday error is optional stopping, shipping the moment the bar turns green. A fixed test controls error only for a single look at a pre-committed sample, so every extra peek is another chance for noise to cross the line. Evan Miller's canonical simulation shows that checking after every batch and stopping at p below 0.05 produces an actual false-positive rate of 26.1 percent, so roughly one in four declared winners is pure noise. Peek ten times and what you think is 1 percent significance is really 5 percent. Optimizely built its Stats Engine on always-valid inference to let teams monitor continuously without inflating error, cutting the chance of falsely declaring a winner from about 30 percent to 5 percent. And going Bayesian is not a free pass. A 2025 simulation by Alex Molas ran a Bayesian test with a 95 percent probability-to-beat-control stopping rule, checked every 100 observations, and produced an 80 percent false-positive rate. Interpretability at any sample size is not the same as error control.

Why your big winner never reaches revenue

A large share of reported wins are illusory, which is why the 20 percent uplift never shows up in the bottom line. Martin Goodson traces vanishing wins to three causes, low statistical power, multiple testing, and regression to the mean. A lucky variation, the winner's curse, performs closer to its true mean next time, so dramatic lifts shrink or disappear on retest. This is the practical core of Twyman's Law, any figure that looks unusually large is usually wrong. A 364 percent lift or an overnight doubling is more likely an instrumentation bug, a sample-ratio mismatch, or contamination than a discovery. Re-test any surprising winner with a holdout before you bank the revenue.

Test bigger, not more

Always Be Testing optimizes volume, and volume is the wrong variable. When real effects are small and traffic is finite, running many small tests guarantees most of them are underpowered. A program of 100 button-color tweaks can run all year and end at the conversion rate it started with, mostly shipping noise. Peep Laja of CXL is blunt, bad testing is worse than no testing at all, and small tweaks eventually hit a local maximum where no further small change helps. The way off a local maximum is a bold differentiated change, a new value proposition or a restructured funnel, whose effect is large enough to clear the detection threshold at realistic traffic. Even VWO concedes that low-traffic sites should make big changes, because larger expected effects need far less traffic. Prioritization beats velocity, and frameworks like PXL, ICE and PIE exist to force fewer, better-grounded bets.

The pre-test gate (copy this)

Before you run any on-site test, clear these.

  • Compute the required sample per variation up front. As anchors, a 5 percent lift on a 3 percent baseline needs about 208,000 per arm, a 10 percent lift about 53,000, a 20 percent lift about 14,000.
  • If you cannot reach that sample within about four weeks, do not run that small test. Under about 100,000 visitors a month, default to bold changes with a target effect of 10 to 20 percent.
  • Pre-commit a fixed sample size and a stop date. Do not stop because you see significance. To monitor continuously, use an always-valid engine.
  • Check for sample-ratio mismatch before you read any result.
  • Reserve about one test in four for a high-risk differentiated swing, and treat a fully powered inconclusive test as valid evidence, not a failure.
  • Re-test any lift that looks too large with a holdout before scaling it.

Where AI helps, and where it lies to you

AI changes the economics in two honest places, the swing and the math, and fails in a third, the verdict. On the swing, an LLM can generate a differentiated value proposition and full alternative copy and layout in minutes, making the test-bigger strategy practical for teams that otherwise could not afford one. On the math, AI is well suited to running the sample-size calculation, picking a defensible minimum effect, enforcing one change against many, and flagging peeking and sample-ratio errors, the steps humans skip. Where traffic allows, AI-CRO platforms run multi-armed bandits with auto-generated variants, which beat fixed A/B tests when there are many variants and cumulative reward matters more than clean attribution. The verdict is where AI lies to you. Synthetic users are too shallow to be useful. Nielsen Norman Group found they care about everything equally while real people care about some things far more. The most rigorous academic test of generative agents found they replicate survey answers only about 85 percent as accurately as the same people replicate their own answers two weeks later, that is 85 percent of human consistency, not absolute accuracy, and they struggle on behavioral prediction. Vendor lift claims of 20 to 30 percent, or 410 percent, are uncorroborated marketing. There is even a sting, every extra AI-generated variant multiplies the traffic you need, so spraying out variants makes underpowering worse unless paired with bandits or strict error control.

The full workflow, running the power and sample-size math, picking a defensible minimum detectable effect, generating a bold differentiated challenger with full variant copy, and building the guardrail and sample-ratio checks, is packaged as a reusable Claude skill. Get the free skill.

What to do Monday

The whole discipline reduces to one line, test in proportion to your traffic. Below about 100,000 visitors a month, stop shipping tweaks and make bold changes whose effect is large enough to detect, power every test up front, and use AI for the challenger and the math but never for the verdict. The honest exception is scale. Booking.com runs more than 1,000 concurrent experiments and treats every change as one, and at that scale a 10 percent win rate at 1 percent average uplift compounds, the way Bing accrued about 2 percent a year from 0.1 to 0.2 percent steps, because millions of users make tiny effects detectable. Everyone else wins with fewer, bigger, properly powered tests, not a busy calendar of underpowered ones.

Sources: Georgi Georgiev, meta-analysis of 115 GoodUI tests, with a larger 1,001-test follow-up; Stefan Thomke and Sourobh Ghosh, Harvard analysis of about 20,000 Optimizely experiments; Ronny Kohavi, Trustworthy Online Controlled Experiments, plus published Microsoft and Bing data; VWO and Optimizely win-rate studies (vendor figures, directional); Evan Miller, How Not To Run An A/B Test (optional stopping and the 26.1 percent figure); Optimizely Stats Engine and Always Valid Inference (Pekelis, Walsh and Johari, Operations Research, 2022); Alex Molas, 2025 simulation of Bayesian optional stopping; Martin Goodson, Most Winning A/B Test Results Are Illusory (2014); Park and colleagues, Generative Agent Simulations of 1,000 People (arXiv 2411.10109, November 2024); Nielsen Norman Group, synthetic-user testing.

Read next

← Back to all articles