Just off the top of your head, can you explain what p-values are and what they tell you about your A/B test results? If reading that question almost made you close the current tab, you’re not alone. In fact, according to FiveThirtyEight, even scientists themselves can’t clearly explain what p-values really mean.
While your testing tool probably does most of the number-crunching under the hood, it’s much better to gain a solid understanding of how the math works out than simply running the test and waiting for the results to flash on your dashboard.
Without a good grasp of the math and statistical concepts behind A/B testing, you’ll simply end up guessing rather than actually experimenting. Knowing the underlying technical details helps you make better decisions in your testing and sampling design, as well as gives you a sound framework for interpreting results.
That’s why we put together this quick infographic that visually summarizes the ideas behind key mathematical/statistical concepts you typically encounter in A/B testing.
Don’t worry if you’re allergic to equations and arcane Greek symbols. We won’t be dealing with any of those here. We’ll only focus on building the intuition needed to make informed testing decisions.
Also, the rest of the blog post gives a more in-depth discussion of what’s on the infographic, so you should definitely check that out as well.
Let’s say you wanted to see if changing the color of your call-to-action (CTA) button on your whitepaper landing page from red to green would impact the number of downloads. You then randomly split your traffic 50-50, with one half assigned to the page having the red-colored CTA (the control group) and the other half assigned to the page which has the green-colored CTA (the variation group).
After recording 500 unique visits for each page, you observe that the conversion rate (number of downloads as a percentage of page traffic) for the control group was 7%, while the conversion rate for the variation group was 9%. You may be tempted to conclude that changing the CTA’s color has a real impact on conversions. But before you accept the results as valid, you first need to carefully answer a number of questions about your findings, such as:
- Do I have enough samples (page views) for each of the two groups?
- How likely is it that I got the test results simply by chance?
- Is the difference between the conversion rates big enough to justify making the change?
- If I ran the test again and again, how confident am I that it’s going to give me similar results?
These are only a few of the things you need to think about when planning and carrying out A/B tests. Below, we’ll go over the mathematical/statistical tools to help us objectively answer each of these questions.
The Very Basics
ConversionXL says the three statistical building blocks of A/B testing are: the mean, variance, and sampling. Let’s now gain a more intuitive feel for these concepts and understand what the numbers really tell us.
We commonly refer to the mean as the average. But what does the “mean” really mean? You can think of the mean as the number that represents a collection of numbers really well. That is, knowing the mean gives you a rough idea of what values a sample tends to have since most of the numbers in that sample will tend to cluster around the mean.
For example, if you determined that your average monthly site visits was 70,000 for the last 12 months, then you’re saying that 70,000 is a fairly acceptable summary for your monthly site traffic. That is, most of the time, your monthly site traffic will be “close” to 70,000.
Variance and Standard Deviation
The variance measures how dispersed the values in a collection of numbers are. The higher the variance, the more scattered the values in our sample set will be.
As the variance increases, the mean becomes a less reliable representative of our dataset.
Let’s say you want to compare the average time (in seconds) spent on two different pages of your site. For simplicity, you only collect 8 observations for each page. Your datasets look like this:
[20, 22, 21, 20, 20, 19, 17, 21]
[14, 27, 31, 10, 11, 28, 2, 37]
Both sample sets have a mean of 20 seconds. It’s easy to see that the average summarizes the Page A sample really well. Most of Page A’s observations tend to stick very closely to 20 seconds. On the other hand, 20 isn’t a very good summary for the observations reported under page B. That’s because the values in the second set tend to be farther away from the mean. If we compute the variance for each of the two sets, we get 2.3 for Page A and 152 for Page B.
We see that it can be misleading to solely rely on the mean to describe a sample set. That’s why you always need to look at the associated variance as well.
But one problem with the variance is that its value can be a bit tricky to interpret and use. Just look at the variances we calculated earlier. They’re both expressed in “second-squared” units (whatever that means).
To work around this, we often include the standard deviation in our analysis. The standard deviation is simply the square-root of the variance (don’t worry, you don’t need to compute this yourself most of the time).
As shown in the next example, it’s easier to work with the standard deviation. The standard deviation of the average time spent on Page A is around 1.5 seconds. Now, we can measure how far a given value is from the mean by expressing the difference as units of standard deviation. For example, the value 17 is around 2 standard deviations below the mean of 20 (the difference between 17 and 20 is 3, and 3 divided by the standard deviation of 1.5 is 2).
The key thing to remember is that the variance tells you how spread out your observations are, and the standard deviation gives you the average distance of each observation from the mean.
In our landing page split-test example, we use a sample of 500 unique page visits for each version. We select a sample that (hopefully) statistically represents the entire population of our landing page visitors. Since studying the whole population of page visitors is impractical, we settle for a representative sample instead.
Exactly how large our sample size should be will depend on a number of factors. Although you don’t need to know the formula for computing the ideal sample size, it’s important to understand that it uses the following factors:
- Significance level (the probability of seeing an effect when none actually exists)
- Statistical power (the probability of seeing an effect when the effect actually exists)
- Effect size (how big the difference or change is)
We’ll dig deeper into each of these later. For now, the main thing to know is that generally, the larger our sample size is, the more reliable (unbiased) the mean becomes.
Null Hypothesis Testing
When running A/B tests, we’re actually applying a process called null hypothesis testing (NHT). We compare the conversion rates of the two landing pages and test the null hypothesis that there is no difference between the two conversion rates (meaning the 2-percentage-point difference between the control’s 7% and the variation’s 9% simply happened by chance).
In A/B tests, a null hypothesis typically states that the change (or changes you made on the page) have no effect on conversions.
We reject the null hypothesis if the p-value is less than the significance level we set (more on this below). Rejecting the null hypothesis means our test shows evidence that there’s a “statistically significant” difference between the 7% and 9% conversion rates we saw earlier.
Having a “statistically significant” result in our A/B test indicates that the change we made to the landing page probably had an impact on the conversion rate.
Significance Level and p-value
The significance level is the probability that your A/B test incorrectly rejects a null hypothesis that’s actually true (i.e., the chance that you conclude there’s an effect when there’s really none). In other words, the significance level is the probability of getting a false positive result (or a Type 1 error).
It’s up to you how much significance level to use, but this is typically set to 5%. Having a 5% significance level means you’re willing to accept a 5% chance of a false positive result in your A/B test.
A related concept is the p-value. Statistics textbooks define The p-value as the probability that the result would be at least as extreme as those observed, assuming the null hypothesis was true.
If you get confused by the “assuming the null hypothesis was true” portion, think of it as simply assuming you ran a test that’s only made up of the control group (i.e., you made no variation).
Let’s say that in our landing page split-test example, we got a p-value of 3.2% or 0.032. This means there’s a 3.2% chance of getting at least a 9% conversion rate for the green-buttoned landing page (the variation group), assuming that the variation’s conversion rate was the same as the control’s 7% conversion rate.
Since we set the significance level at 5%, the p-value lies within the rejection threshold. This means it’s very unlikely we got the 9% conversion rate assuming the null hypothesis is true. This is taken as evidence against the null hypothesis, and so we reject it.
In other words, the p-value simply tells us how surprising a given result is. If it’s very surprising (i.e., p-value is less than the significance level), then it’s most likely safe to reject the null hypothesis.
Statistical power refers to the probability that your A/B test will correctly reject a false null hypothesis. In plain English, it’s the chance that your test detects a specific effect when an effect actually exists.
A low-power A/B test will be less likely to pick out an effect than a high-power test. The higher the statistical power, the lower the chance that your test makes a Type 2 error (failing to reject a false null hypothesis or false negative).
According to ConversionXL, A/B tests follow an 80% power standard. To improve your test’s statistical power, you need to increase the sample size, increase the effect size, or extend the test’s duration.
In order for your A/B tests to be actionable and useful, you not only need to determine if a given variation has an effect, but you should also measure how much is the effect. The significance level, p-value, and statistical power make up only the starting point. You also need to analyze the effect size.
In our example earlier, the effect size is the absolute difference between the two group’s conversion rates (2 percentage points). We may also express the effect size as units of standard deviation.
It’s important to estimate and/or compute the effect size in an A/B test. Estimating the effect size at the start of a test helps you determine the sample size and statistical power while reporting the test’s post-experiment effect size allows you to make more informed decisions about the variations you’re analyzing.
The 7% and 9% conversion rates from our earlier example are called point estimates (i.e., each of them corresponds to a single estimated number). But, since these values have only been estimated from samples, they may or may not coincide to the true conversion rates for each group.
That’s why you also need to build confidence intervals for your estimated conversion rates. Confidence intervals measure the reliability of an estimate by specifying the range of likely values where the true conversion rate will probably be found.
For example, here’s how we would most likely report a confidence interval for the variation’s conversion rates: “We are 95% confident that the true conversion rate for the green-colored landing page is 9% +/- 2%.”
In this example, we’re saying that given the test results we have, our best estimate for the tweaked landing page’s conversion rate is 9% and that we’re 95% confident that the true conversion rate lies within 7% to 11%. The “+/-2%” value is called the margin of error.
Since we’ve also made a point estimate of the control group’s conversion rate, we need to construct a separate confidence interval for it. If we find, for example, that a 95% confidence interval for the control group’s conversion rate overlaps with the other landing page’s confidence interval, we may need to keep testing to arrive at a statistically valid result.
Keep in mind that, in general, the larger the sample size, the narrower the confidence interval becomes (since more samples mean a more reliable estimate).
Here’s a list of helpful resources and further readings on A/B testing statistics and inferential statistics in general:
- A/B Testing Mastery: From Beginner To Pro in a Blog Post (ConversionXL)
- Using Effect Size—or Why the P Value Is Not Enough (NCBI)
The Search for Significance: A Crash Course in Statistical Significance (InContext)