← All posts

What a p-value Actually Measures

Morgan Voss·

A cat appears in the kitchen, reliably, about 30 seconds before the electric can opener sounds. Not after. Before. You have not touched the can opener. She arrives, sits, and waits.

This happens consistently enough that you start keeping notes. You run an informal test. The question is whether her timing is random, or whether she is responding to something you cannot hear or detect. The p-value answers one very specific version of that question: if her timing were completely random, how often would you observe behavior at least this precise?

It does not tell you whether she is psychic. It does not say how psychic she might be. It does not tell you the probability that her timing is random. It tells you how surprising the data is under the assumption that nothing is going on.

The Formal Definition

A p-value is the probability of observing a test statistic at least as extreme as the one computed from the data, assuming the null hypothesis is true:

p=P(TtobsH0)p = P(T \geq t_{\text{obs}} \mid H_0)

Here TT is the test statistic (which could be a $t$-statistic, a $z$-score, a chi-squared value, or something else depending on the test), tobst_{\text{obs}} is the value computed from the observed data, and H0H_0 is the null hypothesis.

The phrase "at least as extreme" means in the direction specified by the alternative hypothesis. For a two-sided test, it includes both tails; for a one-sided test, it includes only the relevant tail.

What the Null Hypothesis Framework Requires

The p-value is defined relative to a null hypothesis. The null is a specific claim about the world that the test is designed to evaluate. A typical null might be that two groups have the same mean, that a proportion equals some fixed value, or that a treatment has no effect.

The null hypothesis framework requires specifying H0H_0 before looking at the data. The p-value calculation assumes H0H_0 is true and asks: given that assumption, how probable is this outcome?

If H0H_0 is "the cat's timing is random," the test asks what distribution of timing deviations we would expect under pure randomness, and then locates the observed deviation within that distribution. A small p-value means the data falls in the tail of that distribution. It does not mean H0H_0 is false.

The Three Most Common Misreadings

The p-value is not the probability that the null hypothesis is true. That probability, $P(H_0 \mid \text{data})$, is a posterior probability. Computing it requires a prior probability on H0H_0 and the application of Bayes' theorem. The p-value P(TtobsH0)P(T \geq t_{\text{obs}} \mid H_0) conditions on H0H_0 being true. These are not the same quantity and not algebraically interchangeable.

Observing a p-value of 0.03 does not mean there is a 3% chance the null is true. It means that, under the null, data as extreme as what was observed would occur about 3% of the time.

The p-value does not measure effect size. A very small p-value can accompany a trivially small effect, particularly in large samples. With a large enough dataset, almost any departure from the null will be statistically detectable. Whether the detected departure matters practically is a separate question that the p-value cannot answer. The effect size measures practical significance. The p-value measures compatibility with the null.

A result below 0.05 is not confirmed to be real. The 0.05 threshold is a convention, not a natural law. It was Ronald Fisher's pragmatic suggestion in the 1920s, and it has been treated with a reverence Fisher himself probably did not intend. A p-value of 0.049 and a p-value of 0.051 are not meaningfully different. Treating the threshold as a binary verdict produces a literature littered with noise.

The Threshold Is Arbitrary

Fisher proposed 0.05 as a rough guide: if data would occur by chance fewer than one time in 20, that is reasonable grounds for further investigation. He did not propose it as a universal criterion for scientific truth.

Different fields use different thresholds. Particle physics requires p<3×107p < 3 \times 10^{-7} (five sigma) before claiming a discovery, because the prior probability of novel particles is low and the cost of false positives is high. Medical research has generally used 0.05, with consequences that have recently attracted significant scrutiny.

The appropriate threshold depends on the cost of false positives, the cost of false negatives, and the prior plausibility of the hypothesis. The p-value does not contain that information. It measures one thing: how surprising the data is under a specific null.

What the p-value Does Well

The p-value is a coherent, well-defined summary of one aspect of the data. It places the observed result within the distribution of results that would occur if the null were true. That is useful information. The problem is not the statistic itself. It is the burden placed on it.

Whether the cat is responding to ultrasonic frequencies from the can opener's motor, to the change in your posture when you decide to feed her, or to nothing at all requires more than a p-value. It requires understanding the mechanism, the effect size, and the reliability of the result across replications. The p-value might be where that investigation starts. It should not be where it ends.

The cats don't charge. The site doesn't either. If something here helped a concept click, a small tip is appreciated.

Buy the Cats a Treat

No PayPal account needed.