The holdout cell is rarely truly clean. The raw lift you measure is a diluted answer.
Two flavors of leakage. Raw lift understates. Intrinsic lift restores.
Pollution happens within the test cell: marketing leaks into the holdout
(or fails to reach part of BAU). In causal-inference language, this is non-compliance with
assignment. Raw lift estimates the ITT (Intent-to-Treat) effect — the
impact of assignment, not of treatment. Intrinsic lift is the CACE
or LATE (Complier / Local Average Treatment Effect) — the effect among compliers.
The standard correction intrinsic = raw / (1−p) is the Wald estimator from
instrumental-variables analysis with binary treatment.
Detection: we can see touches in the traffic data — the holdout user appears in click / impression logs even though they shouldn't have been served. The polluted share is observable.
Detection: there is no event to count. Absence of a touch does not generate a record, so the share is not directly observable at the user level.
What it represents: the marketing effect you can expect under real-world operating conditions — including the leakage that's part of how the channel actually runs.
Use it for: in-market performance forecasts, year-over-year comparisons, anything where you want today's lived performance, not the platonic ideal.
What it represents: the per-exposure causal effect of marketing — what would happen if every BAU user got marketed to and zero holdout users did.
Use it for: projecting impact when scaling the channel to the whole market, comparing channels on equal footing, sizing budget reallocations.
Let p = share of holdout users actually touched by marketing (read from traffic logs).
Under random assignment, a touched holdout user behaves like a BAU user. The raw lift is therefore a
weighted average of the intrinsic lift and zero — diluted by p. This leans on
the IV exclusion restriction: assignment moves conversions only through actual marketing
exposure, never on its own.
-
1
Estimate p from holdout users who appear in marketing traffic logs (touch tables, click streams). Use the same channel definition the test used.
-
2
Compute raw lift directly from the cells as assigned (intent-to-treat).
-
3
Divide by (1 − p) for the point estimate. For the CI, use the delta method:
Var(intrinsic) ≈ Var(raw)/(1−p)² + raw² · Var(p̂)/(1−p)⁴. Raw 8 ± 1pp with p = 33 ± 5pp → intrinsic ≈ 12 ± 1.7pp. -
4
Sanity check: if p is large or noisy, the inflation factor explodes — report both raw and intrinsic, never just one.
The Wald correction assumes a touched holdout user got the same dose as a BAU
user — same impressions, same frequency, same timing. Usually false. Pollution is typically
one stray cross-device touch, not the full ad-stock that BAU users accumulate. When polluted dose
is lighter than BAU dose, raw / (1−p) understates the true intrinsic
effect (the polluted users responded weakly because they got less marketing, not because the
marketing didn't work).
Typical industry ranges: Connected TV 30–50% (household-level targeting → device sharing); walled-garden display 10–30% (cross-device + lookalike); open programmatic 5–15% (ID fragmentation actually helps the holdout); branded paid search 0–5% (intent-driven, hard to "miss" the holdout). The 33% example here is mid-range — use your channel's actual logs, not a default.
When holdout pollution (rate p) and BAU exclusion (rate q) happen
together, the combined adjustment is intrinsic = raw / (1 − p − q) — the same Wald form,
with the denominator now the gap in treated share between the cells: (1 − q) − p.
Running 2SLS with assignment as the instrument for actual treatment gives exactly this, plus proper
standard errors, and handles both sides jointly. q is undetectable from logs, so it always
requires a stated assumption (e.g., "we model 10% silent exclusion based on platform delivery
rates"). Document it.
The "clean holdout" is structurally impossible now. Cross-device exposure, walled
gardens, Apple's App Tracking Transparency, third-party cookie deprecation — these have made zero
pollution unattainable on most digital channels. Intrinsic lift is no longer measurable; it
is estimable with assumptions. Practical detection methods: matching holdout IDs against
platform "delivered" logs (most reliable, requires deterministic ID), cookie-level cross-device
matching (degraded post-ATT), probabilistic matching for unauthenticated traffic (noisy). When
p is genuinely hard to pin down, propensity-matched untouched analysis
is the modern alternative: among holdout users with similar touch-propensity to BAU, compare the
matched-untouched subset and skip estimating p altogether.