A/B testing for PMs — Charter Learn

Anatomy of a good test

Hypothesis. "If we change X, we expect Y because Z." Specific, falsifiable.
Primary metric. The one number that decides win/lose. NOT a list.
Guardrail metrics. What you don't want to break. (e.g. activation can't lift if signup completion drops 10%.)
Sample size. Power calc: n = f(MDE, α, power, baseline rate). Don't eyeball.
Stop conditions. Pre-decide when to call it. Stopping early when results look good is the most common abuse.

MDE — Minimum Detectable Effect

The smallest lift you'd care about. A test powered for MDE = 5% won't reliably detect a 2% lift, even if it's real. Picking MDE is a business decision: what lift would justify the build? Picking too small means impossibly large sample sizes; too large means you miss real wins.

↳ the most common abuse

Peeking. Watching the test mid-run and stopping when it looks favorable. This inflates false-positive rates enormously — Kohavi shows that "significant" results from peeked tests are often noise. Pre-commit to a sample size and a stop condition.

When NOT to A/B test

The change is too large to stage as a treatment (rebuilding the whole app).
You don't have enough traffic to power a test (sample size > total weekly users).
The decision is qualitative (brand redesign).
It's a regulatory / safety change — you don't test those, you ship them.

A/B testing for PMs.

Anatomy of a good test

MDE — Minimum Detectable Effect

When NOT to A/B test

Sources cited

Further reading