Statistical Power in Hypothesis Testing

Discussing power and what its influences are in hypothesis testing

Image for post
Image for post
Image Source

If you have studied hypothesis testing, you are probably familiar with p-values and how they’re used to accept or deny the null hypothesis. The power of a statistical test calculates the potential of an experiment to detect a difference when one happens to occur. For this post, I will use testing the fairness of a coin as the example, and my null hypothesis will be “this coin is fair”. In this case, the likelihood of rejecting the null hypothesis when the coin is unfair is the power of our statistical test. The power of statistical tests will depend on many variables, including the p-value for rejecting the null hypothesis, the size of our samples, and the unfairness level of the coin (aka the effect size).

To reiterate, the power of a statistical test is characterized as the likelihood that the null hypothesis will be rejected, provided that it is indeed false. The power of a statistical test varies from 0 to 1, with 1 being a perfect test that ensures that the null hypothesis is dismissed when it is indeed incorrect. This is directly connected to β (beta), which is the possibility of type II errors. The opposite of power (or beta) is alpha (𝛼), and a data scientist will assess an appropriate alpha level, or the likelihood of type I errors, when developing a statistical test. If you need more clarity on what the differences are between type I and type II errors, check out my blog post, here. You can determine an ideal threshold that maximizes the power (beta) of the test based on the given alpha. Preferably, alpha and beta will both be reduced as much as possible, but depending on the situation and necessary sample sizes, this can be expensive in terms of computer resources and/or impractical.

The extent of the difference between the two samples that you are studying is called the effect size. You can use a t-test to decide if the coin is a fair coin after flipping the coin n number of times, with a null hypothesis (H0) of H0(tails)=0.5. To do this, if you compared two different coins or you compared the coin you’re testing against a known distribution, you would compare the mean of the sample to that of the other sample. In these instances, the metric often used for the effect size is Cohen’s D. By mathematical definition, Cohen’s D is d=m1-m2/s, where m1 and m2 are the respective means of the samples and s is the samples’ total standard deviation. In (hopefully) simpler terms, Cohen’s D is equivalent to the difference of the sample means divided by the combined standard deviation of the samples when looking at the difference of means between two populations. The combined standard deviation of the samples is the average allocation of all the pieces of data across the collective mean for the two samples.

When calculating power for a statistical test, there are three aspects to be considered: alpha, effect size and sample size. You can take a look at the plots of the power of your t-tests (provided there are differing sample sizes), which will allow you to gain a better understanding of the relationship between these quantities and what comprises a compelling statistical test.

In the context of various effect sizes, let’s look at how power would be altered. Going back to our example, let’s again consider the situation of trying to detect whether a coin is fair or not. Using the same null hypothesis from earlier, H0(tails)=0.5, our presumption is that we’re dealing with a fair coin. Again. power will depend on both the sample size and the effect size. So, if the alternative hypothesis (Ha) has a large effect size, such as Ha(tails)=0.95, then the null hypothesis is more likely to be rejected because the power is increased. If the alternative hypothesis has a smaller effect size, such as Ha(tails)=0.4, there is a higher probability of accepting the null hypothesis because the power is decreased.

Also as I mentioned earlier, the t-test is a popular statistical test to compare sample means. Typically, the first thing that is looked at when deciding whether to reject or accept the null hypothesis is the p-value. But, one really important thing I want to note is that it does not inherently mean a good statistical test has occurred simply because a t-test has an extremely small p-value. Even the most insignificant effect size can still be statistically meaningful when using extremely large sample sizes (think of if you were comparing 15,000 samples versus 15 samples). When interpreting the results of a statistical test, understanding these mutual relationships and taking into account all four parameters (alpha, effect size, sample size, and power) is essential.

I hope this blog post helped you learn something new or clarify a confusing aspect about statistical power and/or how power is influenced by sample size, alpha, and effect size.

Thank you for reading!


Written by

Aspiring Data Scientist — Recent Graduate of Flatiron School’s Online Data Science Bootcamp

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store