How to Calculate A/B Test Sample Size and Power (Free Claude Skill)

What's inside

Three files, one job: a test that can actually conclude

Built for marketers, not statisticians.

The skill

Install once in Claude. It sizes your test, picks a defensible minimum detectable effect, and drafts a bold challenger worth running.

Install guide

A short step-by-step for setup, plus a no-install option if you cannot add skills.

Plan B prompt

A single prompt you paste into any Claude chat to size a test and plan it without installing anything.

How it works

Three steps, from your baseline to a trustworthy plan.

Step 1Bring your input

Your baseline conversion rate, your traffic and a target effect to size a test, or your current page if you want a bold challenger.

Step 2Claude does the work

It computes the required sample and duration, tells you whether the test is powered within a realistic window, and if not, the largest effect your traffic can detect. Or it drafts a bold differentiated challenger with full copy.

Step 3Get the plan

The sample size, the stop date, the guardrail metrics and the sample-ratio check, or a ready challenger with its single hypothesis.

The skill does the math and the design. It does not run the test or connect to your analytics. You bring the baseline and traffic, it sizes the test and plans it so the result is trustworthy.

Get the free kit →

How to calculate A/B test sample size

A/B test sample size decides whether a test can conclude at all, and the required numbers are larger than most teams expect. Real winning lifts are small, but detecting a small lift takes a lot of traffic, so most on-site tests are underpowered. Here is how the skill sizes a test, picks an effect you can actually detect, and keeps you from shipping a false win.

How many visitors does an A/B test need

It depends on your baseline conversion rate and the size of the lift you want to detect, and the numbers are larger than they look. On a 3 percent baseline, detecting a 5 percent relative lift needs about 208,000 visitors per variation, a 10 percent lift about 53,000, and a 20 percent lift about 14,000. The relationship is steep, because the effect sits in the squared denominator, so halving the minimum detectable effect roughly quadruples the traffic you need. The skill runs the classic two-proportion calculation at 80 percent power and 95 percent confidence and returns the sample per variation, the total, and the duration at your traffic.

Why most A/B tests are underpowered

Real winning lifts are small. The largest public meta-analysis found a median observed lift of about 4.9 percent, and most progress at the biggest experimentation shops comes in increments under 1 percent. Yet detecting a small lift needs traffic most sites do not have, which is why roughly 70 percent of tests in that meta-analysis were underpowered, statistically incapable of seeing the effect they produced. An underpowered test cannot give you a trustworthy answer, so the first job is to find out whether your test can conclude at all.

How to pick a minimum detectable effect

Work backward from your traffic. If your test cannot reach its required sample within about four weeks, it is underpowered for that effect, and the fix is to target a bigger change rather than wait months. Under about 100,000 visitors a month, the skill steers you to bold changes with a target effect of 10 to 20 percent, because larger expected effects need far less traffic. It also reserves about one test in four for a high-risk differentiated swing, since that is where outsized lifts come from.

How to avoid a false positive

The most common way to manufacture a fake win is to stop the test the moment it looks significant. Checking repeatedly and stopping at p below 0.05 pushes the real false-positive rate to about 26 percent, so the skill makes you pre-commit a fixed sample size and a stop date, and only allows continuous monitoring through an always-valid engine. It defines guardrail metrics that must not regress, runs a sample-ratio mismatch check before any result is read, and flags any lift that looks too large for a holdout re-test, because most oversized wins fade through regression to the mean.

Where this fits and what AI can and cannot do

AI is genuinely useful for two parts of this, generating a bold challenger with full variant copy in minutes and running the sample-size math, the steps teams usually skip. It is not a substitute for the experiment. Synthetic users and attention tools are directional at best, and no model reliably predicts whether real people will buy, so the verdict still comes from a powered test. For the full evidence, including why peeking inflates error and why brand-new winners often vanish, read Most A/B Tests Cannot Detect a Real Win.

Get the free skill →