← All posts

The Multiple Comparisons Problem

Morgan Voss·

You want to know which of 20 toys your cat genuinely prefers. You test each one separately. For each toy, you present it alongside a neutral object, observe her engagement, and run a significance test at the 0.05 threshold. You repeat this for all 20 toys.

At the end, one toy produces a significant result. You declare it her favorite and write up the finding.

The problem: even if she has no genuine preferences at all, you should expect to find roughly one significant result in 20 tests at this threshold. The favorite toy may simply be the false positive the statistics predicted would appear.

The Family-Wise Error Rate

When you run a single test at significance level α\alpha, the probability of a false positive is α\alpha. When you run mm independent tests at the same threshold, the probability of at least one false positive across the entire family of tests is:

FWER=1(1α)m\text{FWER} = 1 - (1 - \alpha)^m

For α=0.05\alpha = 0.05 and m=20m = 20:

FWER=1(0.95)200.64\text{FWER} = 1 - (0.95)^{20} \approx 0.64

There is about a 64% chance of at least one false positive somewhere in those 20 tests, assuming all null hypotheses are true. The per-test error rate is 5%. The family-wise error rate is 64%. These are very different quantities, and confusing them produces unreliable inference.

The family-wise error rate grows quickly with the number of tests. At m=100m = 100 with $\alpha = 0.05$, the expected number of false positives is five, and the FWER approaches 0.994. Essentially, at least one significant result is guaranteed.

Bonferroni Correction

The simplest fix is the Bonferroni correction: divide the significance threshold by the number of tests, using α/m\alpha/m for each individual comparison.

For 20 tests at a target FWER of 0.05, each test requires p<0.0025p < 0.0025 to be declared significant. The corrected threshold holds the FWER at approximately α\alpha regardless of how many tests are run.

The Bonferroni correction is conservative. It assumes the tests are independent. When tests are positively correlated (as they often are when the same subjects or measures are involved), the correction overcorrects and loses statistical power. More sophisticated procedures (Holm-Bonferroni, Sidak) improve on this, but the basic Bonferroni is widely understood and defensible.

False Discovery Rate

Controlling the FWER guarantees that the probability of any false positive is below $\alpha$. This is a strong criterion. In exploratory research involving hundreds or thousands of comparisons, it is often too strong: the required correction devastates power across the entire analysis.

The false discovery rate (FDR), introduced by Benjamini and Hochberg in 1995, offers a different target. Rather than bounding the probability of any false positive, it bounds the expected proportion of false positives among all significant findings. An FDR of 0.05 means that among the tests declared significant, roughly 5% are expected to be false positives.

The Benjamini-Hochberg procedure achieves FDR control by ranking p-values and applying a threshold that scales with rank. It is less conservative than Bonferroni and more suitable for large-scale exploratory analyses, such as genome-wide association studies or neuroimaging studies with thousands of voxels.

Choosing between FWER and FDR control is a decision about what error rate you are willing to tolerate. Neither is universally correct.

P-hacking and HARKing

The multiple comparisons problem does not require a researcher to consciously test 20 things and pick the best result. It can arise through the accumulation of ordinary analytic decisions.

P-hacking refers to the practice of trying multiple analyses, subgroup selections, covariate adjustments, or outcome measures until a significant result emerges. Each decision is an informal test. The reported p-value reflects the final chosen analysis, not the full decision process. The effective number of comparisons made can be much larger than it appears.

HARKing stands for Hypothesizing After Results are Known: presenting a post-hoc hypothesis as if it were specified in advance. A researcher who tests 20 predictors and then writes the paper as if the one significant predictor was the primary hypothesis of interest has committed HARKing. The framing misleads readers about how much data dredging preceded the conclusion.

Both practices can occur without any intent to deceive. A researcher genuinely curious about their data will naturally explore it. The problem is that exploratory analysis looks identical to confirmatory analysis unless the difference is disclosed.

Pre-registration

The most straightforward structural solution is pre-registration: documenting the hypothesis, analysis plan, and number of comparisons before data collection begins, in a time-stamped record that cannot be altered later.

Pre-registration does not eliminate multiple comparisons. It makes them visible. If a pre-registered plan includes 20 tests, readers can see that and apply appropriate skepticism to any single significant result. If only five tests were specified in advance, a significant result among those five carries much more weight than a significant result from an unplanned post-hoc exploration.

The cat's toy study, pre-registered, would specify which toys are being tested and what multiple-comparisons correction will be applied. The researcher who finds one significant result under Bonferroni correction, in a pre-registered study, has produced a meaningful finding. The researcher who ran 20 unplanned tests and reported the most interesting one has not. The distinction is procedural, not mathematical, and it matters.

The cats don't charge. The site doesn't either. If something here helped a concept click, a small tip is appreciated.

Buy the Cats a Treat

No PayPal account needed.