α and power are two sides of the same coin — you can't shrink one without paying in the other.
α false positives. β false negatives. Pay one, shrink the other.
A holdout test is a hypothesis test. The null hypothesis (H0)
says marketing has no effect. The alternative (H1) says there is a real lift.
The decision threshold determines how you trade off two kinds of errors. In the math below,
σ is the standard error of the mean difference between cells —
σ_diff = s · √(1/n_t + 1/n_c) for a continuous metric with per-user SD s, or
√(2p(1-p)/n) for a conversion rate p. "MDE = 2.8σ" means the true
effect is 2.8 standard errors above zero — a standardized effect size, not a raw lift.
The left curve is H0 (no effect). The right curve is H1 (true lift exists). The vertical line is the decision threshold. Drag it to see how α (false positive rate) and power (true positive rate) trade off.
More data tightens both curves: σ_diff ∝ 1/√n. The reliable lever — also the costly one.
In most settings just a proxy for n (longer test = more traffic). Independent of n only when modeling time-correlated effects.
Raises the bar: larger MDE is easier to detect but smaller true effects slip through undetected.
CUPED: regress outcome on a pre-test covariate, variance ↓ by (1 − ρ²). ρ = 0.7 ≈ 2× free sample size.
2.8σ only becomes operational once you pick your baseline metric and cell size. For a
conversion rate p = 5% with n = 10,000 per cell:
σ_diff ≈ √(2·0.05·0.95/10,000) ≈ 0.31% absolute. Then
MDE = 2.8 × 0.31% ≈ 0.86% absolute — about a 17% relative lift.
The interactive's "2.8σ" is the same picture for every test; the corresponding "what lift can I
actually detect?" depends entirely on your p and n. One reconciliation:
the figure's 2.8σ is the 88%-power picture at one-sided α = 0.05, while the
sample-size formula below targets 80% power, where the separation is
z_α + z_β ≈ 2.49σ (hence (z_α+z_β)² ≈ 6.18). Same machinery, two design
points — pick your power and the multiplier follows.
Two-sample z-test for means: n ≈ 2(z_α + z_β)² · σ² / Δ². For proportions:
n ≈ 2(z_α + z_β)² · p(1-p) / Δ². At α = 0.05 (one-sided) and power = 0.80,
(z_α + z_β)² ≈ 6.18. Halving the MDE you want to detect requires 4× the sample.
The cost of small effects compounds.
Directional — "does marketing increase GMV?" All α on one side → more power. Only when you've genuinely pre-committed that you don't care about negative effects.
Non-directional — marketing could help or hurt; α split across both tails → less power. Use whenever harm is possible — cannibalization, brand damage, opportunity cost of crowding out a better channel.
The α math assumes one test, at the planned end. Each early peek at α=0.05 is another shot at rejecting H0 — five peeks pushes effective α to roughly 14%. If you must look mid-test, use a sequential design that spends α correctly: group-sequential boundaries (O'Brien-Fleming, Pocock), alpha-spending functions, or mSPRT / e-processes for fully sequential. "I'll just check on Wednesday" is not one of them.
Each pairwise comparison eats α. A four-cell test (control + three treatments) → three comparisons →
Bonferroni-adjusted α' = α/3 ≈ 0.017 per comparison to hold family-wise α at 0.05.
Required n per cell grows accordingly (roughly 30–40% more). Many A/B platforms run
multi-arm tests without correcting; their stated α is then optimistic.
n, σ, and the MDE you want to detect — there is no free lunch. Pick four
of the five and the fifth is determined.
Worked example. Suppose your baseline conversion rate is p = 5%, you want
to detect a 10% relative lift (so Δ = 0.5% absolute), at
α = 0.05 (one-sided) and power = 0.80. With
(z_α + z_β)² ≈ 6.18:
n ≈ 2 · 6.18 · 0.05·0.95 / (0.005)² ≈ 23,500 per cell — call it 50k total. Note n
counts users (visitors), not conversions. If your channel delivers ~10k users per week, that's
about 5 weeks to reach the planned sample. If that's not feasible, your options are:
accept a larger MDE (15% relative lift drops the requirement to ~10k per cell), use CUPED with a good
pre-test covariate (ρ=0.5 → ~25% fewer users
needed), or — the answer most channels don't want to hear — stop running tests on metrics your traffic
can't support and use observational + holdout analysis instead.