
From the conversion glossary
Concepts referenced in this article, defined.
Statistical significance in A/B testing means there's less than a 5% chance your result is random. Here's what p-values, confidence levels, and sample size mean for your tests.

Concepts referenced in this article, defined.
Run rigorous A/B tests and personalize every visit on Shopify or any storefront โ no engineers required.
Statistical significance in A/B testing means that the difference you observed between your control and variant is unlikely to be a result of random chance โ specifically, there is less than a 5% probability the result is noise. In practical terms: at 95% confidence, you can act on the result without second-guessing whether you just got lucky.
If that still feels abstract, this guide will make it concrete. We'll cover what significance actually means, why 95% became the standard, what p-values are (without the statistics degree), and โ critically โ how to apply all of this to real decisions on your ecommerce store.
Imagine you flip a coin 10 times and get 7 heads. Is the coin biased? Maybe. But 7 heads out of 10 isn't unusual enough to be sure โ it could just be chance.
Now flip it 1,000 times and get 700 heads. Now you're confident. The result is too consistent to be random.
A/B testing works the same way. When you show two versions of a product page to visitors, you're not flipping a fair coin โ you're asking: is this difference real, or is it just noise in a small sample?
Statistical significance is the mathematical answer to that question. It tells you: given the sample sizes and the observed difference in conversion rates, how likely is it that you'd see a gap this large purely by chance?
When that probability drops below 5%, we call the result statistically significant. We're not saying the variant is definitely better โ we're saying the evidence is strong enough to act on.
What significance does NOT tell you:
Significance is a filter for noise. It doesn't replace judgment.
The p-value is the number that determines statistical significance. It's also one of the most misunderstood concepts in testing.
Here's the simplest way to think about it:
The p-value is the probability of seeing your result (or a more extreme one) if there was actually no difference between control and variant.
Think of it this way. You're a judge. The variant is on trial. Your null hypothesis is "the variant is innocent โ there's no real difference." The p-value is the probability that you'd see evidence this strong against the variant if it were actually innocent.
Common p-value mistakes:
"p = 0.05 means there's a 95% chance the variant is better." Wrong. It means there's a 5% chance the observed difference occurred by chance given no real effect. These are related but not the same thing.
Using p = 0.10 as your threshold. This gives you a 10% false positive rate. Run enough tests and 1 in 10 "wins" will be noise. Your testing programme will be systematically misled.
Reporting the p-value without the effect size. A p-value tells you about certainty, not magnitude. You need both.
These two terms look similar and get confused constantly. They answer different questions.
Confidence Level is the threshold you set before the test. "I want 95% confidence before I act on this result." It's a decision rule โ a bar you require the evidence to clear.
Confidence Interval is the range within which the true effect probably falls. If your test shows a 10% lift with a 95% confidence interval of [4%, 16%], it means: we're 95% confident the true effect is somewhere between +4% and +16%.
A narrow confidence interval means precise estimates. A wide one means you need more data.
In practice for D2C brands:
Set your confidence level at 95% before the test starts. Once you have results, look at the confidence interval โ if the lower bound of the interval still represents a meaningful business lift, you're in good shape. If the interval spans from -2% to +22%, your estimate is too imprecise to act on reliably, even if the midpoint looks exciting.
The 95% threshold (p < 0.05) wasn't handed down from a mountain. It was proposed by statistician Ronald Fisher in the 1920s as a pragmatic rule of thumb โ and it stuck.
In practice, 95% confidence represents a reasonable balance:
For Indian D2C brands running experiments on product pages and checkout flows, 95% is the right default. Use 99% when you're testing something that's expensive to reverse โ like a major homepage redesign or a pricing structure change.
Some teams argue that 90% is fine for low-stakes, reversible tests with small traffic bases. The problem is that it creates a culture of acting on inconclusive data. The compounding effect of 10% noise across a testing programme is significant. Stick to 95%.
This is the distinction that actually matters for your business โ and the one most testing guides skip over.
Statistical significance tells you the result is probably real.
Practical significance tells you the result is worth acting on.
A brand with 500,000 monthly visitors can detect a 0.2% absolute CVR lift at 95% confidence. That's statistically significant. But if your current CVR is 3%, a 0.2% lift means going from โน30,00,000 to โน32,00,000 in revenue on a โน10 crore GMV โ which may or may not justify the implementation cost.
Questions to ask for practical significance:
The rule of thumb: For most D2C brands, a test needs to show at least a 5% relative lift to be practically meaningful. Below that, you're optimising noise at the margin. Focus your programme on tests likely to move the needle by 10% or more.
You cannot talk about significance without talking about sample size. They are directly linked.
With a tiny sample, almost nothing reaches significance โ even if the difference is real. With a massive sample, trivial differences become "significant" even when they don't matter.
The relationship works like this:
Example:
You're testing a product page with a 2% baseline CVR. You want to detect a 10% relative lift (from 2.0% to 2.2%). At 95% confidence, you need approximately 8,000 visitors per variant โ 16,000 total.
If your page gets 500 visitors per day, that's 32 days of testing. If it gets 200 visitors per day, you're looking at 80 days. Is 80 days worth testing a 0.2% absolute CVR improvement? Often not.
This is why pre-test sample size calculation matters. It forces you to be explicit about what lift you care about, and whether you have the traffic to detect it. Read our dedicated guide on A/B testing sample size for the full calculation.
Here's a scenario most testing teams will recognise: you launch a test, check it after three days, and see 97% significance. You declare a winner and stop the test.
This is called peeking โ and it's one of the most common ways testing programmes get corrupted.
The problem is mathematical. When you run a test, your significance level fluctuates over time. Due to random variation, it's completely normal for significance to briefly spike above 95% early in the test, then fall back down as more data comes in.
If you check significance repeatedly during a test and stop as soon as it crosses 95%, you're not running a test at 95% confidence. The true false positive rate can be 25-40% depending on how often you check.
The fix:
Most A/B testing tools use frequentist statistics โ the framework we've described throughout this guide, with p-values and confidence levels.
There's an alternative: Bayesian statistics, which calculates the probability that one variant is better than another, incorporating prior knowledge and updating continuously as data comes in.
Frequentist (traditional):
Bayesian:
Recommendation for D2C brands: Start with frequentist 95% confidence as your standard. If your traffic is too low to reach frequentist significance in reasonable timeframes, explore Bayesian approaches โ but understand that "80% probability of improvement" is not the same as a statistically significant result and carries real risk of false positives.
Let's run through a real calculation using realistic numbers for an Indian D2C product page.
Scenario:
Results:
Step 1: Calculate the relative lift (3.69% - 3.0%) / 3.0% = 23% relative lift
Step 2: Plug into a significance calculator Using a standard two-proportion z-test with these inputs:
The z-score โ 2.38, which corresponds to p โ 0.017.
Step 3: Interpret p = 0.017 < 0.05 โ Statistically significant at 95% confidence. In fact, this clears the 98% confidence threshold.
Step 4: Assess practical significance At 500 monthly product page visitors (roughly 3,200 per 6-month period), 0.69% absolute CVR improvement ร 3,200 visitors ร โน2,499 AOV = approximately โน55,000 in incremental revenue per six months from this one test. Practically meaningful โ implement the variant.
Low-traffic stores face a real problem: they don't have enough visitors to run valid tests on most pages in a reasonable timeframe. Here's what to do.
Option 1: Test bigger changes
The minimum detectable effect you care about determines how much traffic you need. If you're testing a minor headline tweak (where a 5% lift would be meaningful), you need a lot of traffic. If you test a fundamentally different value proposition (where you might see 30%+ lift), you need far less.
Low-traffic stores should focus on bold, high-contrast tests rather than incremental tweaks.
Option 2: Move the test to a higher-traffic page
If your product page gets 100 visitors per day, test on the homepage or category page first. Get your testing muscle built up before going deeper.
Option 3: Use Bayesian methods with appropriate risk framing
A Bayesian result of "78% probability Variant B is better" is still useful information for a low-traffic brand โ as long as you understand the risk you're accepting. Don't treat it as equivalent to 95% frequentist confidence.
Option 4: Accept longer test durations
Some tests are worth running for 45โ60 days, especially for high-AOV products where even a 5% CVR improvement represents significant annual revenue. Pre-commit to the test duration and don't peek.
For the full framework on what is A/B testing and how to structure a testing programme, start with our A/B testing pillar guide.
Statistical significance is not a magic number that makes decisions for you. It's a quality filter โ a way of ensuring that the patterns you observe in your test data are real enough to act on.
The key principles to carry forward:
CustomFit.ai handles all of this automatically โ significance tracking, stopping rules, sample size guidance, and results dashboards โ so your team can focus on what to test, not how to calculate it.
1,000+ D2C brands use CustomFit.ai to run statistically valid A/B tests without needing a data science team. 14-day free trial ยท No credit card required ยท Setup in under 30 minutes.