Simpson's Paradox: When Subgroups Disagree With the Aggregate

Morgan Voss·May 7, 2026

A paradox in statistics usually isn't a logical contradiction. It's a result that is technically correct and deeply counterintuitive. Simpson's Paradox fits that description precisely: a trend can hold within every subgroup of a dataset and reverse in the combined data. Not occasionally. Reliably, when the subgroup structure has the right properties.

The Berkeley Example

In 1973, UC Berkeley was sued for gender bias in graduate admissions. The aggregate data seemed to support the claim: men were admitted at a higher rate than women. But when researchers looked at individual departments, most of them showed higher admission rates for women than for men. Both of these things were true simultaneously.

The resolution is not that the data was wrong. It's that women and men applied to different departments in very different proportions. Women disproportionately applied to departments with competitive admission rates, while men disproportionately applied to departments with higher baseline acceptance rates. The department structure was doing the work, not the gender.

Aggregate your data without accounting for that structure, and you get a number that describes nothing real.

The Mechanism

The paradox arises from a lurking variable: one that is correlated with both the grouping variable and the outcome. In the Berkeley case, department selectivity was the lurking variable. It correlated with gender (women applied to harder departments) and with admission probability (harder departments admit fewer people). Strip it out of the analysis and the aggregate rate carries its influence invisibly.

This is the Yule-Simpson effect, named for George Udny Yule and Edward H. Simpson, who described it separately decades apart. The formal statement is straightforward. Suppose group A has a higher rate than group B within every subpopulation $S_1, S_2, \ldots, S_k$ :

$P(Y \mid A, S_i) > P(Y \mid B, S_i) \quad \text{for all } i$

It is still possible that:

$P(Y \mid A) < P(Y \mid B)$

This happens when the subpopulations are weighted very differently between groups A and B. If group A is concentrated in subpopulations where $Y$ is rare, the aggregate rate for group A can drag below the aggregate rate for group B, even though A outperforms B within every individual subpopulation.

A Cat Food Illustration

A wet food study records body weights for both kittens and adult cats. Within the kitten population, cats fed wet food weigh slightly less on average than those on dry food. Within the adult population, the same pattern holds. Every subgroup says: wet food, lower weight.

The combined dataset says the opposite. Cats on wet food are heavier on average.

The reversal happens because adult cats, who weigh more for reasons having nothing to do with food, also tend to eat wet food more often. The sample of wet-food cats is dominated by adults. The sample of dry-food cats skews toward kittens. The age structure of the two feeding groups is so different that it overwhelms the within-group pattern.

The food is not causing the heavier weight. Age is correlated with both feeding type and body weight, and it is doing all the work. Ignore it and the combined data tells a backwards story.

Which Number Is Correct?

Both, technically. The aggregate and the within-group rates are each accurately computed. The question is which one answers the question you're actually asking.

This is where causal structure matters. If you want to know whether wet food itself affects weight, you need to compare cats who are otherwise similar: same age, same breed, same activity level. You need to condition on the confounders. The aggregate rate doesn't do this. It reflects a mixture of causal effects and compositional artifacts.

If, on the other hand, you want to predict the average weight of a randomly selected wet- food cat from this particular population, the aggregate rate is exactly right. The question determines which number is relevant.

The Practical Lesson

Simpson's Paradox appears wherever data is aggregated across heterogeneous groups: medical studies, test score comparisons, hospital readmission rates. In each case, the relevant question is whether the subgroup that mediates the reversal is causally upstream, downstream, or simply a nuisance variable.

The habit to develop is a simple one: before trusting an aggregate, ask what's being held constant. If groups differ in composition along a variable that also affects the outcome, the aggregate number may be carrying that composition more than the effect you care about.

The Berkeley case is more than 50 years old. The lesson it demonstrates has not expired.

The cats don't charge. The site doesn't either. If something here helped a concept click, a small tip is appreciated.

Buy the Cats a Treat

No PayPal account needed.

On this page

The Berkeley Example

The Mechanism

A Cat Food Illustration

Which Number Is Correct?

The Practical Lesson