Why do I need such a large sample size for my A/B test?
Running A/B tests is about finding out with some degree of confidence which of two versions of some thing is better. The canonical example would be running a test on the sign up form of your website. At present some percentage of users who see that page sign up or “convert”, and you’d like to see if making some change to the text or look of the page will make a greater percentage of users convert.
You can determine if that’s the case by running a hypothesis test comparing the proportion of users who convert when shown the original version (let’s call this p1) and proportion who convert when shown the new version (let’s call this p2). The null hypothesis is typically something like p1 - p2 = 0, or that there is no real difference between the two versions. And the alternate hypothesis is therefore something like p1 - p2 != 0 or p2 > p1, meaning that there is in fact some difference between the two versions. This difference between the two proportions is called the effect size, and it is often measured in standard deviations. For example if version A shows a conversion rate of 5%, version B shows a rate of 6%, and the standard deviation is 1% then the effect size is 1 standard deviation.
As you probably recall from statistics class back in the day, you must select a significance level for your test. If you see an effect size that is more standard deviations away from the mean than your significance level then you have a “significant” result.
All of this is well and good, but if you’ve looked into recommended sample sizes for running your experiment you’ve likely been surprised by how many examples are required: often many thousands for each version.
What exactly do we gain by using these recommended sample sizes as opposed to running our test until we’ve found a significant result? Quite a lot it turns out.
The Goal
Looking at your A/B tests only through the lens of significance gives you an incomplete and sometimes misleading picture. I would argue that instead of being concerned with significance, we should be concerned with the likelihood that the conclusions we draw from our test are true, and that those conclusions signify something important to us.
This goal can be broken down into more digestible chunks. First, what does it mean to be concerned with the likelihood of our results being true? This comes down to the fact that there are four possibilities when we run our test:
1. The null hypothesis is false and we reject it. A true positive.
2. The null hypothesis is true and we reject it. A false positive.
3. The null hypothesis is false and we fail to reject it. A false negative.
4. The null hypothesis is true and we fail to reject it. A true negative.
We want to be reasonably confident that our results fall into the true conclusions covered by numbers 1 and 4, and not the false conclusions of numbers 2 and 3.
By adjusting our significance level we can directly control for the likelihood of false positives. A significance level of 5% means that assuming our null hypothesis is true there is a 5% chance that by pure luck we will see a result as or more extreme than that significance level.
That covers false positives, but what about the risk of false negatives? That’s another importance piece of our quest for truth. Power in statistics is the probability of your hypothesis test detecting a positive result when it is present. Said another way it’s the probability of not having a false negative.
So in order to effectively pursue true results we need to be concerned with both the significance level and power.
What about the second part of our goal, conclusions that signify something important to us? This addresses the fact that you can find statistically significant results that are utterly uninteresting and not worth acting on. That is, you could run an A/B test and find a significant result that says “version B is 0.0001% more effective than version A”. Very likely, although it depends on your domain, this difference is meaningless and it wouldn’t be worth the time and effort to run the test and then replace option A with option B.
Crucially, such uninteresting results are more likely to be found as your sample size increases. This is because the larger the sample size the lower the standard deviation of the sampling distribution. Think back to our example from the beginning of the post where we had an effect size of 1% or 1 standard deviation (because our standard deviation was also 1%). If the sample size goes way up, shrinking the standard deviation to 0.1%, then the same results would show an effect size of 10 standard deviations. This would be much, much farther from the mean of the distribution and would almost certainly qualify as a significant result.
The point is that a large enough sample size can make any effect size significant. So it’s important that we define ahead of time the smallest effect size we deem to be meaningful to us. In the context of A/B testing this might mean the smallest effect size that would make it worth our while to wait for our test results and deploy a new version of the feature.
We are going to use the figure we come up with here as something called the Minimum Detectable Effect (MDE) in our sample size calculations. There will be more on MDE towards the end of the post for those who are curious, but it’s not necessary to understand it in detail to grasp what A/B testing sample size calculations should do for you.
What Your Sample Size Tells You
Taken together, what does all of this mean? We want to have a level of confidence that any conclusions we draw are true. We specify that level of confidence in terms of significance level to deal with the probability of false positives and power to deal with the probability of false negatives. We also specify the smallest effect size that is meaningful to us in order to avoid running a test only to find out that we don’t care about its results.
Given all of this we can solve for the smallest sample size that will satisfy all of those parameters. This will tell us “you need X trials per version for your experiment to have this level of significance level and power for an effect size at least as big as your MDE”. If your effect size is bigger than that MDE, awesome! You’ll have an even greater chance of detecting it than your specified level of power.
Somewhat counter-intuitively, you do not want to keep running your test beyond this sample size. Remember, that this means you will increase the risk of detecting as significant results smaller than your MDE, which by your own definition is the smallest effect size you would care to notice.
Equally important is avoiding the opposite temptation: not running your test until you’ve achieved the calculated sample size, and instead stopping the test as soon as you’ve achieved significance. Evan Miller does a wonderful job explaining why you shouldn’t do this, but the short answer is that it biases you towards thinking you have significant results when really you don’t. What if after 500 trials the result is deemed significant, but after 1500 it turns out the result isn’t after all? If you had stopped early you would have drawn the incorrect conclusion about your results.
Whew! That’s a lot and certainly more complicated than simply checking if your result is significant or not. But it is also an approach that will get you results you actually care about with a level of reliability you can be comfortable with.
Postscript - Explaining Minimum Detectable Effect
The MDE is the smallest effect size needed for a test to have at least a certain level of power, given a particular significance level. For example if you know that you want your test to have an significance level of 5% and a power of 80% you could calculate an MDE which would tell you “if I see an effect size equal to or greater than our MDE we will have at least an 80% chance of detecting a positive result given our 5% significance level”.
Jonathan Leirer has a great article explaining this better than I can. But the heart of the concept is that for a positive result the probability distribution of the null hypothesis (the “null distribution”) is not true, rather some alternate hypothesis with its own probability distribution is true. Further, the null distribution and the alternate distribution will be some distance apart, and may have some overlap, thus the possibility of false positives and negatives.
So, if we want a power of 80% the MDE describes how far from our significance level (which is on the null distribution) the alternate distribution mean must be for 80% of the alternate distribution to fit in the space up to that significance level. Said another way, if the alternate hypothesis is true, and 80% of the alternate hypothesis’s probability distribution falls past our significance level, we have an 80% chance of detecting positive results given a significant result. It’s worth remembering here that ”significant” simply means that our observed effect lies beyond our set level of significance level on the probability distribution. A picture really is worth about 1,000,000 words here, so read Jonathan’s article to see a visual of how this works.
















