Skip to main content
Cover/04 · Test sizing
M Measurement Field Guide All topics
04
Figure 04 · Test design rigor

α and power are two sides of the same coin — you can't shrink one without paying in the other.

α false positives. β false negatives. Pay one, shrink the other.

A holdout test is a hypothesis test. The null hypothesis (H0) says marketing has no effect. The alternative (H1) says there is a real lift. The decision threshold determines how you trade off two kinds of errors. In the math below, σ is the standard error of the mean difference between cells — σ_diff = s · √(1/n_t + 1/n_c) for a continuous metric with per-user SD s, or √(2p(1-p)/n) for a conversion rate p. "MDE = 2.8σ" means the true effect is 2.8 standard errors above zero — a standardized effect size, not a raw lift.

The α-power trade-off · drag the threshold

The left curve is H0 (no effect). The right curve is H1 (true lift exists). The vertical line is the decision threshold. Drag it to see how α (false positive rate) and power (true positive rate) trade off.

H₀ · no effect H₁ · real lift MDE = 2.8σ threshold
α · false positive 0.050
power · true positive 0.876
β · false negative 0.124
Two kinds of mistakes
Type I · α
False positive — conclude lift exists when it doesn't. Convention: α = 0.05.
Type II · β
False negative — miss a real lift. Power = 1 − β, target ≥ 0.80.
The four levers of test design
n Sample size

More data tightens both curves: σ_diff ∝ 1/√n. The reliable lever — also the costly one.

T Test duration

In most settings just a proxy for n (longer test = more traffic). Independent of n only when modeling time-correlated effects.

δ MDE target

Raises the bar: larger MDE is easier to detect but smaller true effects slip through undetected.

σ↓ Variance reduction

CUPED: regress outcome on a pre-test covariate, variance ↓ by (1 − ρ²). ρ = 0.7 ≈ 2× free sample size.

σ → business units

2.8σ only becomes operational once you pick your baseline metric and cell size. For a conversion rate p = 5% with n = 10,000 per cell: σ_diff ≈ √(2·0.05·0.95/10,000) ≈ 0.31% absolute. Then MDE = 2.8 × 0.31% ≈ 0.86% absolute — about a 17% relative lift. The interactive's "2.8σ" is the same picture for every test; the corresponding "what lift can I actually detect?" depends entirely on your p and n. One reconciliation: the figure's 2.8σ is the 88%-power picture at one-sided α = 0.05, while the sample-size formula below targets 80% power, where the separation is z_α + z_β ≈ 2.49σ (hence (z_α+z_β)² ≈ 6.18). Same machinery, two design points — pick your power and the multiplier follows.

Sample size formula

Two-sample z-test for means: n ≈ 2(z_α + z_β)² · σ² / Δ². For proportions: n ≈ 2(z_α + z_β)² · p(1-p) / Δ². At α = 0.05 (one-sided) and power = 0.80, (z_α + z_β)² ≈ 6.18. Halving the MDE you want to detect requires 4× the sample. The cost of small effects compounds.

One-tailed vs two-tailed
One-tailed

Directional — "does marketing increase GMV?" All α on one side → more power. Only when you've genuinely pre-committed that you don't care about negative effects.

Two-tailed

Non-directional — marketing could help or hurt; α split across both tails → less power. Use whenever harm is possible — cannibalization, brand damage, opportunity cost of crowding out a better channel.

Don't peek

The α math assumes one test, at the planned end. Each early peek at α=0.05 is another shot at rejecting H0 — five peeks pushes effective α to roughly 14%. If you must look mid-test, use a sequential design that spends α correctly: group-sequential boundaries (O'Brien-Fleming, Pocock), alpha-spending functions, or mSPRT / e-processes for fully sequential. "I'll just check on Wednesday" is not one of them.

Multiple cells

Each pairwise comparison eats α. A four-cell test (control + three treatments) → three comparisons → Bonferroni-adjusted α' = α/3 ≈ 0.017 per comparison to hold family-wise α at 0.05. Required n per cell grows accordingly (roughly 30–40% more). Many A/B platforms run multi-arm tests without correcting; their stated α is then optimistic.

What the p-value actually means
P-value
P(observed Δ or more extreme | H0 is true)
Decision rule
If p < α → reject H0 → declare lift significant.
Common trap
p = 0.03 does not mean "3% chance the result is wrong" — it means the data would be that extreme (or more) under no effect.
Takeaway A test is a bet against two kinds of mistakes. α and power are linked algebraically to n, σ, and the MDE you want to detect — there is no free lunch. Pick four of the five and the fifth is determined.

Worked example. Suppose your baseline conversion rate is p = 5%, you want to detect a 10% relative lift (so Δ = 0.5% absolute), at α = 0.05 (one-sided) and power = 0.80. With (z_α + z_β)² ≈ 6.18: n ≈ 2 · 6.18 · 0.05·0.95 / (0.005)² ≈ 23,500 per cell — call it 50k total. Note n counts users (visitors), not conversions. If your channel delivers ~10k users per week, that's about 5 weeks to reach the planned sample. If that's not feasible, your options are: accept a larger MDE (15% relative lift drops the requirement to ~10k per cell), use CUPED with a good pre-test covariate (ρ=0.5 → ~25% fewer users needed), or — the answer most channels don't want to hear — stop running tests on metrics your traffic can't support and use observational + holdout analysis instead.

Methods note

Numbers throughout are illustrative. The 2.8σ separation, the 5% baseline rate, the 10% MDE, and the resulting ~23,500-per-cell are the simplest case that makes the α–power trade-off legible; substitute your own p, MDE, and σ.

Further reading
  • Localized Shift vs Overall Causal Impact
  • Adstock & attribution window considerations
  • Test Design · Power, α, p-value, tails
  • Superiority vs Non-inferiority