Jan 2, 2021

# Discussing the Multiple Comparisons Problem

What it is and how to correct it

The discovery of an amazingly low p-value does not guarantee that the null-hypothesis is false. A p-value of 0.0001 indicates that there is still a 1 in 10,000 likelihood that the null hypothesis is valid. But, p-values *alone* can be deceptive. For instance, if you do repeated tests, you are likely to come upon a tiny p-value at some point, whether the null hypothesis is valid or not. Imagine taking 100 scientific experiments with a p-value of 0.05. Are these assumptions all legitimate? Probably not. There is already a 5 percent probability that the null hypothesis is potentially valid for any experiment with a p-value of 0.05. Collectively however, the possibility that *all* of these null hypotheses are false is very minimal. In each test you can have some confidence, but a false conclusion being made sometimes is also very possible.

Likewise, if you concurrently measure several metrics in an experiment, the probability that one of these will fulfill the alpha level increases. When we proceed to compare a multitude of amounts, we are expected to find strongly correlated quantities, whether or not there is an *actual* real connection. This is called **spurious correlation**. A former Harvard student named Tyler Vigen set out to examine these connections, which you can find on his website and/or his book, here. While the graphs on his website indicate that each pair of quantities is closely correlated, believing any of them to have any functional connections seems unrealistic. Despite what the visuals tell us, there is no connection that influences “US spending on science, space, and technology” with “Suicides by hanging, strangulation and suffocation”.

A spurious correlation is **type I error**, indicating that it’s a **false positive** — we believe we’ve discovered an important thing when there really isn’t one. We aim to set a very low p-value to reduce our sensitivity to type 1 errors for any connection we make in an experiment. Another way of putting this is a p-value level of less than 0.05 means we only make a type 1 error 1 out of 20 times. This means that if we have 20 simultaneous outcomes where the p-value is less than 0.05, 1 of them is almost sure to be a false positive/type 1 error. Each of the small risks of type 1 errors becomes accumulated as we make **multiple comparisons** by searching for several factors simultaneously.

Statisticians have developed approaches to minimize the probability of type 1 errors because of the increased possibility of drawing incorrect conclusions while statistically measuring several quantities concurrently, such as the **Bonferroni correction**. In the Bonferroni correction, you divide alpha by the number of observations you make to set a modified level that rejects the null hypothesis. For instance, the Bonferroni correction will encourage you to set the modified p-value threshold to 0.005 if you want alpha=0.05 and are making 10 comparisons at the same time (0.05/10). The lower p-value limit helps monitor type I errors, but this does not mean that you are resistant to them. It just tends to decrease the total risk of one happening. Consequently, the power of these experiments is decreased, and type II errors are more likely to occur.

Thank you for reading! I hope this post helped you better understand the multiple comparisons problem. For more context about error types and statistical power, check out these other blog posts I’ve written, and feel free to connect with me on LinkedIn: