Statistical significance explained in simple terms and easy examples

Statistical significance

definition

Introduction

Statistical significance is a measure of how confident we can be that an observed result is real rather than due to random chance. In B2B sales testing, it answers the question: "If I test two approaches and see different results, are they genuinely different or just luck?" Statistical significance is typically expressed as a confidence level: 95% confidence means there's only a 5% probability the result occurred by chance. Results are not statistically significant unless they meet a defined threshold (usually p-value less than 0.05, or 95% confidence).

Statistical significance is important in A/B testing because small sample sizes generate unreliable results. If you test email subject line A with 50 people and subject line B with 50 people, and A gets 4 replies while B gets 2 replies, the difference might look clear. But with such small samples, this 2% difference in reply rate could easily be random. Only with larger samples does the difference become statistically significant.

Key concepts in statistical significance

P-value: the probability that results occurred by random chance (lower is better; below 0.05 is significant)
Confidence level: inverse of p-value (95% confidence = 5% probability of chance; this is the standard threshold)
Sample size: larger samples make it easier to detect real differences and achieve significance
Effect size: how large the actual difference is (large effects are significant with smaller samples; small effects need large samples)

Statistical significance is not the same as practical significance. A change that's statistically significant might improve your metric by 0.2%, which is mathematically real but practically irrelevant. Conversely, a change that improves your metric by 5% might not reach statistical significance if your sample size is too small.

Why it matters

Statistical significance prevents you from optimising based on random noise. If you change your prospecting email based on a statistically insignificant result, you might be making changes that don't actually help. This wastes time and potentially makes things worse. Waiting for statistical significance ensures changes are real before rolling them out broadly.

For B2B teams, this is particularly important because each prospect matters. If you change your approach based on weak evidence and it's actually wrong, you're sending ineffective messages to hundreds or thousands of prospects. The cost of wrong decisions is high, so requiring statistical significance before deciding is economically rational.

However, statistical significance can also be a false standard. If you require statistical significance before making any changes, you might move slowly whilst competitors iterate faster. The balance is requiring appropriate confidence based on decision impact: small tactical changes (email subject line) might require 90% confidence, whilst major strategic changes (sales process redesign) might require 99% confidence.

How to apply it

When running A/B tests, calculate the sample size needed before starting the test. If you expect a 20% relative improvement and want 95% confidence, online calculators (Optimizely, CXL, Evan Miller's site) will tell you exactly how many subjects per variation you need. For most B2B email tests, this is 100-300 per variation depending on your baseline metrics. Don't stop the test early because results look good; run it to the planned size.

Document your hypothesis and decision rule before running the test. Don't decide post-hoc whether a result is significant. Say upfront: "We're testing subject line A versus B. If A generates a statistically significantly higher reply rate (95% confidence), we'll roll it out. Otherwise, we'll keep current approach." This prevents cherry-picking results or moving goalposts.

When analysing existing data (win/loss analysis, conversion patterns, opportunity analysis), apply the same statistical thinking. With 5 data points, patterns aren't reliable. With 50, they're more trustworthy. Be transparent about sample size when drawing conclusions: "We observed this pattern in 40 deals, which gives us reasonable confidence, but with 15 deals it would be uncertain."

Running properly-sized email test to detect real difference

A sales team wanted to test whether personalised subject lines outperformed generic ones. They planned to test 200 recipients per variation. Subject line A (personalised: "Quick question about your [company type]") achieved 2% reply rate (4 replies). Subject line B (generic: "Question for you") achieved 1.5% reply rate (3 replies). The 0.5% difference wasn't statistically significant because the sample size was too small for such a small difference. They continued testing with larger sample sizes and discovered after 1,000 recipients per variation that personalised subject lines genuinely produced 2.1% reply rate versus 1.6% for generic (statistically significant at 95% confidence). The original test was too small to detect this modest but real difference.

Avoiding false significance with proper controls

A sales team tested a new sales process with 15 deals and saw 40% win rate versus their 30% historical average. Excited, they rolled it out. After implementing broadly, they realised the 15-deal sample was non-representative - those deals happened to be easier opportunities, not because the process was better. With 100+ deals they saw actual win rate of 31%, barely above historical average. The original sample was too small to detect statistical significance, and they got lucky with a favourable sample. Now they require much larger sample sizes (50+ deals minimum) before declaring process changes effective.

Trading off statistical significance with business urgency

A B2B SaaS company was losing deals to a competitor and needed to act quickly. Rather than waiting 6 months for statistically significant data, they tested a new value proposition angle with 30 deals (below ideal statistical power). Results looked promising: win rate against this competitor improved from 35% to 48%, trending toward significance. Rather than wait for full statistical significance, they rolled out the new angle cautiously while continuing to collect data. The business urgency (losing deals to competitor) justified taking action on trending data rather than waiting for certainty. Six months later, with 150+ deals, the improvement held at 46% win rate, confirming the initial trending result.

Keep learning

Growth management

How do you make all four engines work together instead of in isolation?

Explore topic

Explore playbooks

Data & dashboards

Build the dashboards and data pipelines that show your growth engines in one view so you can spot bottlenecks and make decisions in minutes, not meetings.

Read playbook

Compound growth

Learn how twelve metrics compound into exponential growth and map exactly where your biggest leverage points are so every improvement multiplies.

Read playbook

Growth team tools

The wrong tools create friction. The right ones multiply your output without adding complexity. These are the tools I recommend for growth teams that move fast.

Read playbook

Review and plan next cycle

Analyse last cycle's results across all twelve metrics, identify the highest-leverage improvements, and set priorities that compound into the next period.

Read playbook

Related books

No items found.

Related chapters

2 Creating strong hypotheses

Most experiments fail before they start because the hypothesis is vague or untestable. Learn how to write hypotheses that are specific enough to prove or disprove and tied to metrics that matter.

Statistical significance