Anatomy of a good test
- Hypothesis. "If we change X, we expect Y because Z." Specific, falsifiable.
- Primary metric. The one number that decides win/lose. NOT a list.
- Guardrail metrics. What you don't want to break. (e.g. activation can't lift if signup completion drops 10%.)
- Sample size. Power calc: n = f(MDE, α, power, baseline rate). Don't eyeball.
- Stop conditions. Pre-decide when to call it. Stopping early when results look good is the most common abuse.
MDE — Minimum Detectable Effect
The smallest lift you'd care about. A test powered for MDE = 5% won't reliably detect a 2% lift, even if it's real. Picking MDE is a business decision: what lift would justify the build? Picking too small means impossibly large sample sizes; too large means you miss real wins.
↳ the most common abuse
Peeking. Watching the test mid-run and stopping when it looks favorable. This inflates false-positive rates enormously — Kohavi shows that "significant" results from peeked tests are often noise. Pre-commit to a sample size and a stop condition.
When NOT to A/B test
- The change is too large to stage as a treatment (rebuilding the whole app).
- You don't have enough traffic to power a test (sample size > total weekly users).
- The decision is qualitative (brand redesign).
- It's a regulatory / safety change — you don't test those, you ship them.