Statistical Power: Why Small Studies Often Find Nothing
Morgan Voss·
Suppose you want to know whether cats prefer a sleeping surface that is a few degrees warmer than room temperature. The preference, if it exists, is real but modest: most cats show some inclination toward warmth, but individuals vary considerably, and the effect is not large enough to be obvious from casual observation.
You recruit five cats. You run the study. You find no significant preference.
The tempting conclusion is that cats do not prefer warmer surfaces. The more defensible conclusion is that your study was not capable of detecting the effect even if it was there. A study with five cats, measuring a modest preference against substantial individual variation, had very little chance of returning a significant result. Five cats was not enough.
This is a question of statistical power.
The Definition
Statistical power is the probability of correctly rejecting the null hypothesis when the null hypothesis is false:
The term is the Type II error rate: the probability of failing to reject a false null, sometimes called a false negative. Power and sum to one by definition.
The two error types have names that reflect their position in the decision table. A Type I error (rate $\alpha$) is rejecting a true null: concluding there is an effect when there is not. A Type II error (rate $\beta$) is failing to reject a false null: concluding there is no effect when there is. These errors trade off against each other. Lowering typically increases , all else equal.
The Four Quantities
Power is determined by four quantities, and understanding each one explains why some studies are doomed before the first data point is collected.
The first is effect size. Larger effects are easier to detect: a strong cat preference for warmth is detectable with fewer animals than a mild one. Effect size is usually expressed as a standardized quantity. Cohen's , for mean differences, divides the raw difference by the pooled standard deviation. Small effects require large samples.
Sample size is the second lever. More observations reduce the standard error of the estimate, making it easier to distinguish a real effect from noise. The relationship is not linear: power grows with , so doubling the sample produces a modest improvement. Quadrupling it produces a more substantial one.
The significance level matters too. A stricter threshold (smaller $\alpha$) is harder to cross, which decreases power. Raising increases power at the cost of more false positives. This is the trade-off between Type I and Type II error, and it cannot be avoided by choosing a better study design.
Finally, variance. Higher variance in the outcome makes effects harder to detect. A cat population with very consistent thermoregulatory behavior would make the warm-surface preference easier to measure than one where individual baselines vary widely. Reducing measurement noise improves power without adding subjects.
Power Analysis
A power analysis uses these four quantities to determine sample size before data collection begins. The analyst specifies the smallest effect size worth detecting, the desired significance level, and the acceptable Type II error rate, then solves for the required .
The most common target is power of 0.80, meaning an 80% chance of detecting a real effect. This implies : a 20% chance of a false negative. The 0.80 convention, like the 0.05 significance threshold, is somewhat arbitrary but widely adopted.
For a two-sample t$-test with a medium effect size ($d = 0.5$) at $\alpha = 0.05 and power of 0.80, the required sample size is roughly 64 per group. The same parameters with a small effect ($d = 0.2$) require around 394 per group. The warm-surface cat study, if the preference is genuinely modest, needs considerably more than five animals.
Underpowered Studies and What They Produce
A study that is too small to reliably detect its target effect is underpowered. The consequences extend beyond the individual study.
An underpowered study will frequently fail to reach significance, which leads to non-results going unpublished (the file drawer problem) and to a literature that underrepresents true null findings relative to false ones. When an underpowered study does reach significance, the estimated effect is likely to be inflated: the only way a small study crosses the threshold is if, by chance, the observed effect is larger than the true one. Subsequent studies tend to find smaller effects. This is called the winner's curse.
The replication crisis in several scientific fields can be partially traced to widespread publication of underpowered studies. A finding based on 20 subjects, chased by a p-value of 0.049, is not strong evidence for much.
When to Do the Analysis
Power analysis belongs before data collection. It is a planning tool, not a post-hoc rationalization.
Computing power after a study finds no significant result and concluding the study was underpowered is sometimes called observed power analysis. It is not useful: it is essentially a transformation of the p-value back into a power estimate, and it adds no information. The useful question is whether the study was designed with enough sensitivity to detect the effect of interest. That question needs to be answered before the data exists.
The five-cat warm-surface study was not powered to detect a modest thermoregulatory preference. No analysis of the resulting data can repair that. The right time to notice the problem was before the first cat was recruited.
The cats don't charge. The site doesn't either. If something here helped a concept click, a small tip is appreciated.
Buy the Cats a TreatNo PayPal account needed.
