A/B Test Analytics — Bayesian stats, revenue verdicts, segment truth | CustomFit.ai

What you see

One experiment, every angle that matters.

The verdict view CustomFit shows you the moment a test crosses your significance + lift gate. No exports. No SQL. No second tool.

PDP Hero CTA · "Ships tonight" vs "Add to cart"

Running 14 days · 25,643 visitors · 2 variants

Winner declared

A — Control

"Add to cart" · 12,842 visitors

2.84%

CVR

$1.84

RPV

baseline

B — "Ships tonight"

12,801 visitors

3.62%

CVR

$2.34

RPV

+27.4%

RPV lift

Bayesian P(B>A)

99.2%

p-value

0.012

95% CI on lift

+18.4% to +36.5%

Revenue impact

+$28,412 / wk

The metrics

Eight numbers we report on every experiment.

Each metric answers a different question. Read them in this order; never declare a winner on a single one.

Metric	What it is	Why it matters	Gotcha
Revenue per visitor (RPV)	Revenue ÷ unique visitors, per variant.	The only metric that maps directly to P&L. Catches CVR-lifting, AOV-tanking variants.	Needs a stable AOV baseline; trim refunds before you call a winner.
Conversion rate (CVR)	Orders ÷ unique visitors, per variant.	Fast to read, easy to explain, comparable across tests.	Blind to AOV. A win here can be a loss in revenue. Always pair with RPV.
Average order value (AOV)	Revenue ÷ orders, per variant.	Surfaces bundling, upsell, and offer-framing effects.	High variance on low order counts; needs longer runtime than CVR.
Bayesian probability to win	P(B > A) given the data so far.	Peeking-safe, intuitive — "86% likely B wins" beats "p = 0.04."	Requires a sensible prior. Default to neutral; tighten only with strong reason.
Frequentist p-value	Probability of observing this lift under the null hypothesis.	Familiar to legal, finance, and most analytics teams.	Peeking inflates false positives. Set runtime upfront; don't stop on first p < 0.05.
Confidence interval (CI)	Plausible range for the true lift.	Communicates uncertainty better than a point estimate. Width = sample size.	A wide CI straddling zero is not a winner, no matter how nice the midpoint looks.
Lift	% change vs control on the primary metric.	Headline number for stakeholders.	Always quote with the CI. Lift without bounds is decoration.
Holdout uplift	Treated cohort revenue vs holdout cohort revenue.	Proves the entire program is paying off, not just individual tests.	Holdout must stay untouched. Tempting to "borrow" the traffic; don't.

Stats engine

Bayesian for speed. Frequentist for audit.

CustomFit runs both engines on every experiment. Bayesian probability answers "how likely is B better than A right now?" — peeking-safe, easy to act on. Frequentist p-values answer "assuming no real effect, how surprising is this data?" — familiar to finance, defensible in audits.

You don't have to pick a religion. Both are shown side by side. Decide which one your team trusts and gate auto-promote on it.

Same experiment, both engines

Engine	Verdict
Bayesian: P(B beats A)	99.2%
Bayesian: expected RPV lift	+$0.50
Frequentist: p-value	0.012
Frequentist: 95% CI on lift	+18.4% – +36.5%
Min. detectable effect (set)	±10% on RPV
Runtime to 80% power	11 days (actual: 14)

Both engines agree on this experiment — common for well-powered tests. When they disagree, the data is too noisy to ship either way; let it run.

Segment truth

The pooled answer hides the real one. Segment lift surfaces it.

CustomFit slices significance by every audience attribute you have — geo, device, new-vs-returning, intent, traffic source. Tests that look neutral overall often hide double-digit wins inside specific segments.

PDP Hero CTA — segment-level lift

Segment	Visitors	RPV lift	P(B>A)
All visitors (pooled)	25,643	+27.4%	99.2%
Mobile · India · new	9,142	+38.1%	99.7%
Mobile · US · new	5,206	+22.4%	97.1%
Desktop · returning	4,887	+11.8%	84.6%
Paid social referral	3,108	+31.6%	96.2%
Organic · branded	3,300	+5.2%	62.4%

The pooled win is real — but the mobile-IN-new-visitor cohort delivered it. Ship variant B to that segment first; let the rest of the test mature before rolling broadly.

Q2 program holdout — 7% of traffic

Treated traffic RPV$3.84

Holdout traffic RPV$3.21

Incremental lift+19.6%

Incremental revenue (90d)+$184,266

Tests shipped in window11

Holdout-vs-treated is the only board-ready proof that the entire experimentation program is paying off — not just any single test.

Program-level proof

The 5–10% holdout is your only honest answer.

Individual A/B test wins compound — but stacking lifts on paper doesn't prove the whole program is moving revenue. A small slice of traffic that never sees any personalization is the cleanest counterfactual.

CustomFit reserves the holdout automatically, locks it from edits, and reports the incremental revenue contribution at any cadence you set — weekly, monthly, quarterly. It's the number your CFO wants. It's also the only one that survives an audit.

Don't do these

Five ways teams fool themselves reading test results.

Each of these looked like a winning experiment. Each lost money. Watch for them.

#	Pitfall	Symptom	Fix
01	Peeking at frequentist tests	You stop the test the moment p drops below 0.05, then it climbs back.	Lock the runtime upfront, or use Bayesian probability (peeking-safe by design).
02	Calling CVR a win, missing the AOV loss	CVR +6%, RPV -2%. You celebrated the wrong number.	Always read RPV alongside CVR. If they disagree, RPV wins the tie.
03	Pooled significance hiding segment truth	Test "loses" overall, but mobile-IN-new-visitor is a +28% blowout.	Read segment-level significance before killing a variant. Ship targeted, not pooled.
04	Underpowered tests on tiny traffic	200 visitors per variant, CI wider than the moon. You shipped on noise.	Pre-compute MDE. If you can't power it in 21 days, don't run it; pick a bigger lever.
05	No holdout, no program-level proof	Eight "winning" tests shipped, but revenue is flat vs same-quarter-last-year.	Carve a 5–10% holdout. Measure incremental revenue against it. Defend the program.

Decisions on autopilot

Auto-promote winners. Auto-pause losers.

Set the thresholds once. Every test that clears them gets shipped or killed without waiting on a Monday review. You stay in the loop on every change.

Significance gate

Pick Bayesian P(B>A) ≥ 95% or frequentist p < 0.05 — set per account, override per test.

Minimum lift gate

Don't promote a 0.2% lift even if it's significant. Set a floor (default +3% on RPV).

Runtime guard

Refuse to call winners before minimum runtime. Kills the peeking trap by design.

Loser pause

When P(B>A) drops below 5% after powered runtime, auto-pause and notify the owner.

Slack + email digest

Daily digest of new winners, losers, and tests crossing thresholds — to the channels you choose.

Audit log

Every promotion, pause, and threshold change is recorded with actor and timestamp.

Sample-ratio alarm

If the A/B traffic split drifts from 50/50 by more than 2pp, the test is flagged — your data is lying before you read it.

Staged rollout

Promote at 25% → 50% → 100% with auto-monitor at each step. Catch regressions before they hit full traffic.

A/B test analytics — common questions.

What is A/B test analytics?

A/B test analytics is the layer that turns raw experiment traffic into business decisions — measuring conversion rate, revenue per visitor, statistical significance, and segment-level lift. Strong analytics tells you which variant won, by how much, in which segments, and how confident you can be — without you running the math yourself.

Bayesian or frequentist — which should we use?

Both, depending on the question. Bayesian gives you the probability variant B beats A (intuitive, peeking-safe, lets you call winners earlier). Frequentist gives you the p-value (familiar to most stakeholders, defensible in audits). CustomFit shows both for every experiment — you decide which to act on.

Why measure revenue per visitor instead of conversion rate?

Conversion rate alone is misleading — a variant can lift CVR while dropping AOV, leaving you flat or negative on revenue. Revenue per visitor (RPV) bakes both into one number that maps to your P&L. Always check RPV before declaring a winner; it catches lifts that look real but lose money.

What is segment-aware significance?

It's calculating significance for each audience segment (mobile vs desktop, new vs returning, geo, intent) instead of only the pooled population. A variant can lose overall but win sharply for first-time mobile visitors in India — segment-aware analytics surfaces that so you can ship targeted personalizations instead of one-size-fits-all winners.

Do you support holdout measurement?

Yes. A holdout is a small slice of traffic (typically 5–10%) that never sees any active personalization or test variant. Comparing the holdout against treated traffic tells you the true incremental revenue contribution of the entire program — not just per-test. Without a holdout, you can't prove the program is working in aggregate.

Can experiments auto-promote winners?

Yes — once a variant clears your configured significance + minimum-lift thresholds, CustomFit can auto-promote it to 100% traffic and notify the team, or surface it as a one-click approval. Same logic auto-pauses statistically dead losers so they stop costing you traffic. Thresholds are per-account and overridable per-test.

Keep reading

Go deeper.

A/B testing — the product›The CRO pillar guide›Personalization at scale›Conversion glossary›All product features›Customer case studies›

Live · Right now

Mamaearth — free-shipping band +12.4% AOVGIVA — festive collection page +34% revenueBellavita — PDP CTA test +27.4% CVRKapiva — Quiz-driven recs +9.48% CTRThe Sleep Co — landing personalized 2× capturesPlum — Returning shopper swap +18.2% CVRMamaearth — free-shipping band +12.4% AOVGIVA — festive collection page +34% revenueBellavita — PDP CTA test +27.4% CVRKapiva — Quiz-driven recs +9.48% CTRThe Sleep Co — landing personalized 2× capturesPlum — Returning shopper swap +18.2% CVR

A/B test analytics, built for decisions.