Every CustomFit test ships with Bayesian + frequentist verdicts, revenue per visitor, segment-aware significance, and holdout measurement — so you know which variant won, for whom, by how much, and whether to ship it. No spreadsheets. No p-value squinting.
The verdict view CustomFit shows you the moment a test crosses your significance + lift gate. No exports. No SQL. No second tool.
Each metric answers a different question. Read them in this order; never declare a winner on a single one.
| Metric | What it is | Why it matters | Gotcha |
|---|---|---|---|
| Revenue per visitor (RPV) | Revenue ÷ unique visitors, per variant. | The only metric that maps directly to P&L. Catches CVR-lifting, AOV-tanking variants. | Needs a stable AOV baseline; trim refunds before you call a winner. |
| Conversion rate (CVR) | Orders ÷ unique visitors, per variant. | Fast to read, easy to explain, comparable across tests. | Blind to AOV. A win here can be a loss in revenue. Always pair with RPV. |
| Average order value (AOV) | Revenue ÷ orders, per variant. | Surfaces bundling, upsell, and offer-framing effects. | High variance on low order counts; needs longer runtime than CVR. |
| Bayesian probability to win | P(B > A) given the data so far. | Peeking-safe, intuitive — "86% likely B wins" beats "p = 0.04." | Requires a sensible prior. Default to neutral; tighten only with strong reason. |
| Frequentist p-value | Probability of observing this lift under the null hypothesis. | Familiar to legal, finance, and most analytics teams. | Peeking inflates false positives. Set runtime upfront; don't stop on first p < 0.05. |
| Confidence interval (CI) | Plausible range for the true lift. | Communicates uncertainty better than a point estimate. Width = sample size. | A wide CI straddling zero is not a winner, no matter how nice the midpoint looks. |
| Lift | % change vs control on the primary metric. | Headline number for stakeholders. | Always quote with the CI. Lift without bounds is decoration. |
| Holdout uplift | Treated cohort revenue vs holdout cohort revenue. | Proves the entire program is paying off, not just individual tests. | Holdout must stay untouched. Tempting to "borrow" the traffic; don't. |
CustomFit runs both engines on every experiment. Bayesian probability answers "how likely is B better than A right now?" — peeking-safe, easy to act on. Frequentist p-values answer "assuming no real effect, how surprising is this data?" — familiar to finance, defensible in audits.
You don't have to pick a religion. Both are shown side by side. Decide which one your team trusts and gate auto-promote on it.
| Engine | Verdict |
|---|---|
| Bayesian: P(B beats A) | 99.2% |
| Bayesian: expected RPV lift | +$0.50 |
| Frequentist: p-value | 0.012 |
| Frequentist: 95% CI on lift | +18.4% – +36.5% |
| Min. detectable effect (set) | ±10% on RPV |
| Runtime to 80% power | 11 days (actual: 14) |
Both engines agree on this experiment — common for well-powered tests. When they disagree, the data is too noisy to ship either way; let it run.
CustomFit slices significance by every audience attribute you have — geo, device, new-vs-returning, intent, traffic source. Tests that look neutral overall often hide double-digit wins inside specific segments.
| Segment | Visitors | RPV lift | P(B>A) |
|---|---|---|---|
| All visitors (pooled) | 25,643 | +27.4% | 99.2% |
| Mobile · India · new | 9,142 | +38.1% | 99.7% |
| Mobile · US · new | 5,206 | +22.4% | 97.1% |
| Desktop · returning | 4,887 | +11.8% | 84.6% |
| Paid social referral | 3,108 | +31.6% | 96.2% |
| Organic · branded | 3,300 | +5.2% | 62.4% |
The pooled win is real — but the mobile-IN-new-visitor cohort delivered it. Ship variant B to that segment first; let the rest of the test mature before rolling broadly.
Holdout-vs-treated is the only board-ready proof that the entire experimentation program is paying off — not just any single test.
Individual A/B test wins compound — but stacking lifts on paper doesn't prove the whole program is moving revenue. A small slice of traffic that never sees any personalization is the cleanest counterfactual.
CustomFit reserves the holdout automatically, locks it from edits, and reports the incremental revenue contribution at any cadence you set — weekly, monthly, quarterly. It's the number your CFO wants. It's also the only one that survives an audit.
Each of these looked like a winning experiment. Each lost money. Watch for them.
| # | Pitfall | Symptom | Fix |
|---|---|---|---|
| 01 | Peeking at frequentist tests | You stop the test the moment p drops below 0.05, then it climbs back. | Lock the runtime upfront, or use Bayesian probability (peeking-safe by design). |
| 02 | Calling CVR a win, missing the AOV loss | CVR +6%, RPV -2%. You celebrated the wrong number. | Always read RPV alongside CVR. If they disagree, RPV wins the tie. |
| 03 | Pooled significance hiding segment truth | Test "loses" overall, but mobile-IN-new-visitor is a +28% blowout. | Read segment-level significance before killing a variant. Ship targeted, not pooled. |
| 04 | Underpowered tests on tiny traffic | 200 visitors per variant, CI wider than the moon. You shipped on noise. | Pre-compute MDE. If you can't power it in 21 days, don't run it; pick a bigger lever. |
| 05 | No holdout, no program-level proof | Eight "winning" tests shipped, but revenue is flat vs same-quarter-last-year. | Carve a 5–10% holdout. Measure incremental revenue against it. Defend the program. |
Set the thresholds once. Every test that clears them gets shipped or killed without waiting on a Monday review. You stay in the loop on every change.
Pick Bayesian P(B>A) ≥ 95% or frequentist p < 0.05 — set per account, override per test.
Don't promote a 0.2% lift even if it's significant. Set a floor (default +3% on RPV).
Refuse to call winners before minimum runtime. Kills the peeking trap by design.
When P(B>A) drops below 5% after powered runtime, auto-pause and notify the owner.
Daily digest of new winners, losers, and tests crossing thresholds — to the channels you choose.
Every promotion, pause, and threshold change is recorded with actor and timestamp.
If the A/B traffic split drifts from 50/50 by more than 2pp, the test is flagged — your data is lying before you read it.
Promote at 25% → 50% → 100% with auto-monitor at each step. Catch regressions before they hit full traffic.
A/B test analytics is the layer that turns raw experiment traffic into business decisions — measuring conversion rate, revenue per visitor, statistical significance, and segment-level lift. Strong analytics tells you which variant won, by how much, in which segments, and how confident you can be — without you running the math yourself.
Both, depending on the question. Bayesian gives you the probability variant B beats A (intuitive, peeking-safe, lets you call winners earlier). Frequentist gives you the p-value (familiar to most stakeholders, defensible in audits). CustomFit shows both for every experiment — you decide which to act on.
Conversion rate alone is misleading — a variant can lift CVR while dropping AOV, leaving you flat or negative on revenue. Revenue per visitor (RPV) bakes both into one number that maps to your P&L. Always check RPV before declaring a winner; it catches lifts that look real but lose money.
It's calculating significance for each audience segment (mobile vs desktop, new vs returning, geo, intent) instead of only the pooled population. A variant can lose overall but win sharply for first-time mobile visitors in India — segment-aware analytics surfaces that so you can ship targeted personalizations instead of one-size-fits-all winners.
Yes. A holdout is a small slice of traffic (typically 5–10%) that never sees any active personalization or test variant. Comparing the holdout against treated traffic tells you the true incremental revenue contribution of the entire program — not just per-test. Without a holdout, you can't prove the program is working in aggregate.
Yes — once a variant clears your configured significance + minimum-lift thresholds, CustomFit can auto-promote it to 100% traffic and notify the team, or surface it as a one-click approval. Same logic auto-pauses statistically dead losers so they stop costing you traffic. Thresholds are per-account and overridable per-test.
14-day free trial. Bayesian + frequentist, segment-aware lift, holdout measurement — included on every plan.