The skill
Install once in Claude. It sizes your test, picks a defensible minimum detectable effect, and drafts a bold challenger worth running.
Most on-site A/B tests are underpowered and cannot detect the small lifts they produce. This free Claude skill sizes your test, picks an effect you can actually detect, and generates a bold challenger worth running.
We email you the download link. One email, no spam.
Drop your email and we send the download link to your inbox. Inside: the skill, a short install guide, and a plan B prompt as a no-install option.
No spam ever. Unsubscribe anytime.
Built for marketers, not statisticians.
Install once in Claude. It sizes your test, picks a defensible minimum detectable effect, and drafts a bold challenger worth running.
A short step-by-step for setup, plus a no-install option if you cannot add skills.
A single prompt you paste into any Claude chat to size a test and plan it without installing anything.
Three steps, from your baseline to a trustworthy plan.
Your baseline conversion rate, your traffic and a target effect to size a test, or your current page if you want a bold challenger.
It computes the required sample and duration, tells you whether the test is powered within a realistic window, and if not, the largest effect your traffic can detect. Or it drafts a bold differentiated challenger with full copy.
The sample size, the stop date, the guardrail metrics and the sample-ratio check, or a ready challenger with its single hypothesis.
The skill does the math and the design. It does not run the test or connect to your analytics. You bring the baseline and traffic, it sizes the test and plans it so the result is trustworthy.
A/B test sample size decides whether a test can conclude at all, and the required numbers are larger than most teams expect. Real winning lifts are small, but detecting a small lift takes a lot of traffic, so most on-site tests are underpowered. Here is how the skill sizes a test, picks an effect you can actually detect, and keeps you from shipping a false win.
It depends on your baseline conversion rate and the size of the lift you want to detect, and the numbers are larger than they look. On a 3 percent baseline, detecting a 5 percent relative lift needs about 208,000 visitors per variation, a 10 percent lift about 53,000, and a 20 percent lift about 14,000. The relationship is steep, because the effect sits in the squared denominator, so halving the minimum detectable effect roughly quadruples the traffic you need. The skill runs the classic two-proportion calculation at 80 percent power and 95 percent confidence and returns the sample per variation, the total, and the duration at your traffic.
Real winning lifts are small. The largest public meta-analysis found a median observed lift of about 4.9 percent, and most progress at the biggest experimentation shops comes in increments under 1 percent. Yet detecting a small lift needs traffic most sites do not have, which is why roughly 70 percent of tests in that meta-analysis were underpowered, statistically incapable of seeing the effect they produced. An underpowered test cannot give you a trustworthy answer, so the first job is to find out whether your test can conclude at all.
Work backward from your traffic. If your test cannot reach its required sample within about four weeks, it is underpowered for that effect, and the fix is to target a bigger change rather than wait months. Under about 100,000 visitors a month, the skill steers you to bold changes with a target effect of 10 to 20 percent, because larger expected effects need far less traffic. It also reserves about one test in four for a high-risk differentiated swing, since that is where outsized lifts come from.
The most common way to manufacture a fake win is to stop the test the moment it looks significant. Checking repeatedly and stopping at p below 0.05 pushes the real false-positive rate to about 26 percent, so the skill makes you pre-commit a fixed sample size and a stop date, and only allows continuous monitoring through an always-valid engine. It defines guardrail metrics that must not regress, runs a sample-ratio mismatch check before any result is read, and flags any lift that looks too large for a holdout re-test, because most oversized wins fade through regression to the mean.
AI is genuinely useful for two parts of this, generating a bold challenger with full variant copy in minutes and running the sample-size math, the steps teams usually skip. It is not a substitute for the experiment. Synthetic users and attention tools are directional at best, and no model reliably predicts whether real people will buy, so the verdict still comes from a powered test. For the full evidence, including why peeking inflates error and why brand-new winners often vanish, read Most A/B Tests Cannot Detect a Real Win.