Engineering Excellence

A/B Testing & Experiments

Intermediate

An A/B test shows different versions to different users at random and measures which one performs better. It is a careful way to answer the question "does this change actually help?" Done well, it replaces opinion with evidence. Done carelessly, it gives confident but wrong conclusions. Respect the statistics, respect the users, and never experiment with compliance or safety.

An A/B test is a controlled experiment. It splits users at random into a control group (the current experience) and one or more variants, then compares a chosen metric. It is how you test a hypothesis (see Hypothesis-Driven Development) with real behaviour instead of guesses. But it is easy to get wrong. Common mistakes are samples that are too small, checking results and stopping early, testing many things until one looks significant, or measuring the wrong outcome.

The essentials are: define the metric and sample size before you start, assign users at random, run the test long enough, and read the results honestly. In our regulated setting, experiments must respect privacy and consent. They must never run on controls that protect customers, such as KYC, screening, or fairness.

Design experiments rigorously

DoDefine the success metric and what "better" means before you start, tied to a clear hypothesis (see Hypothesis-Driven Development, Measuring What Matters).
DoAssign users to the control or a variant at random, and keep it stable. The same user should get the same experience for the whole test.
DoWork out the sample size you need, and run the test long enough to reach it. Small samples give noisy, misleading results.
DoUse feature flags or an experimentation platform to deliver variants safely, and to roll out the winner or roll back cleanly (see Feature Flags, Deployment Strategies).
DoMake sure both variants are correct and safe. An experiment is still production code and follows all our quality and security rules (see Definition of Done).
ConsiderTracking guardrail metrics alongside the target metric, so you catch a variant that raises clicks but harms errors, performance, or completion.

Interpret and run honestly

AlwaysDecide the metric, duration, and sample size up front, and stick to them. Do not check the results and stop the moment they look good. That creates false positives.
DoBe careful when you test many things at once. Test enough of them and one will look "significant" by chance. Stay sceptical (see Measuring What Matters).
DoReport results honestly, including results that show no effect or a negative effect. "No difference" is a valid and useful outcome (see Professional Ethics).
DoRespect users: privacy and consent for the tracking involved, accessibility for all variants, and no deceptive or manipulative designs (see Cookies & Consent, Product Analytics Privacy, Accessibility).
NeverA/B test or experiment on a compliance, security, or fairness control. For example, do not test whether skipping a KYC or screening step boosts conversion. These are never things to optimise (see AML Screening, Hypothesis-Driven Development).

Checking early on a tiny sample

// check the dashboard hourly; variant B is ahead after 80 users
// declare B the winner, ship it

Stopping early on a tiny sample the moment it looks good only "confirms" a result that is really just noise. The conclusion is probably wrong, and you ship a change that does not actually help.

Defined up front, big enough, honest

// metric: onboarding completion. Pre-computed sample ~5k/variant.
// run 2 weeks via flag, stable assignment, guardrails on errors/latency.
// analyse once at the end; ship only if the lift is real.

The design is fixed in advance, the sample is big enough, and guardrails catch side effects. The decision rests on a sound result: evidence, not noise.

Self-review checklist

AskDid I fix the metric, duration, and sample size before starting?
AskIs assignment random and stable, and are both variants safe and correct?
AskAm I reading the results honestly: not checking early, not picking favourable data, and watching the guardrails?
AskDoes this respect privacy and consent, and am I sure I am not experimenting on a compliance or safety control?

Why it matters: A/B testing is the best way to know whether a change truly helps. But poor practice gives confident, false conclusions that send the product in the wrong direction. Careful design, honest reading of results, and a firm rule against experimenting on safety or compliance controls let us make evidence-based decisions we can trust and defend.