P-Hacking: Why p < 0.05 Lies When You Run Many Tests

Q: A researcher runs 10 independent significance tests at alpha = 0.05 on a dataset where every null hypothesis is true. What is the approximate probability that at least one test returns p < 0.05? A. 5% — that is what alpha guarantees. B. 10% — each test adds 1 percentage point. C. 40% — the error rate compounds across tests. D. 50% — about half the tests will cross the threshold.

Correct answer: C. The probability of no false positives across 10 tests is 0.95^10 ≈ 0.599, so the probability of at least one false positive is 1 − 0.599 ≈ 0.401, or roughly 40%. Alpha sets the per-test rate, not the family-wise rate — and the family-wise rate grows quickly once you run more than a handful of tests.

The most common reading: "I tested 20 things, one came back p = 0.03, so that one is real." The logic feels tight — p < 0.05 means less than a 5% chance of a false alarm, and 0.03 is well below the line.

The problem is that 5% is the per-test false-positive rate, not the per-study rate. When you run many tests, those small risks compound, and finding at least one "significant" result becomes nearly certain — even in a dataset of pure noise.

Why the mistake is the natural reading

The alpha threshold is introduced in the context of a single test: "if the null is true, you will wrongly reject it only 5% of the time." That framing is correct for one test.

But most real analyses don't stop at one test. You check 20 biomarkers, or 15 demographic subgroups, or 10 different outcome variables. Nothing in the p-value output tells you how many other tests you ran alongside it. A p-value of 0.03 looks identical whether it came from the only test you ran or from test number 18 in a batch of 20.

The readout gives you no warning. The significance star is the same either way.

The actual mechanism

Each test at alpha = 0.05 has a 95% chance of not producing a false positive when the null is true. Run two independent tests, and the probability that neither produces a false positive is 0.95 × 0.95 = 0.9025. Run twenty independent tests, and that probability drops to 0.95^20 ≈ 0.358.

Which means the probability of at least one false positive across those twenty tests is:

1 − 0.358 = 0.642

Roughly a 64% chance of finding something that looks significant, in a dataset where nothing is actually going on.

This is the family-wise error rate (FWER) — the probability of making at least one Type I error across a family of tests. The per-test alpha controls only the individual test's error rate, not the family's.

P-hacking is what happens when this dynamic is exploited (sometimes unconsciously): try enough comparisons, variable transformations, or subgroup cuts, and a p < 0.05 is statistically guaranteed to appear. The researcher then reports only that one result, and the reader sees a clean, "significant" finding with no evidence that anything unusual happened.

Worked numeric example

Suppose a team tests whether a new supplement affects 20 independent health markers — cholesterol, blood pressure, resting heart rate, and so on. The supplement is actually inert: every null hypothesis is true.

They test each marker at alpha = 0.05.

Probability each test does NOT produce a false positive: 0.95
Probability all 20 tests avoid false positives: 0.95^20 ≈ 0.358
Probability at least one test produces p < 0.05: 1 − 0.358 ≈ 0.642

So there is about a 64% chance the team finds at least one "significant" result and publishes a supplement effect that does not exist.

If they had pre-registered a single primary outcome before collecting data, the family-wise error rate would stay at 5%. The number of tests they ran on everything else does not infect a pre-specified primary analysis — because the fishing expedition never happened.

Bonferroni correction is the bluntest fix: divide alpha by the number of tests. For 20 tests at a desired FWER of 0.05, the per-test threshold becomes 0.05 / 20 = 0.0025. A result needs p < 0.0025 to clear the bar, which is much stricter. Other corrections (Benjamini-Hochberg for false discovery rate) are more powerful when the number of tests is large, but the core idea is the same — the threshold must account for how many comparisons were made.

How to internalize it

Before you interpret any p-value, ask: "How many tests did I (or the authors) run to get here?" If the answer is more than one and no correction was applied, the significance threshold you're reading against is wrong.
A pre-registered primary endpoint is not bureaucratic overhead — it is what makes the 5% threshold honest. Everything tested after the fact is exploratory, regardless of what the p-value says.
A useful sanity check: if you kept testing subgroups or variables until something hit p < 0.05, you were doing the math equivalent of flipping a coin until you got heads and then announcing that the coin always lands heads.
This is the same family of error as crediting a treatment when the group was selected for an extreme score: in both cases the analysis manufactures a result that looks real, and only a properly designed comparison can tell the artifact from the signal.

For a closer look at how the per-test threshold connects to what alpha actually measures in a single test, the piece on common hypothesis-testing mistakes walks through the p-value and alpha relationship in detail.

Check yourself

A researcher runs 10 independent significance tests at alpha = 0.05 on a dataset where every null hypothesis is true. What is the approximate probability that at least one test returns p < 0.05?

A. 5% — that is what alpha guarantees. B. 10% — each test adds 1 percentage point. C. 40% — the error rate compounds across tests. D. 50% — about half the tests will cross the threshold.

Correct answer: C.

The probability of no false positives across 10 tests is 0.95^10 ≈ 0.599, so the probability of at least one false positive is 1 − 0.599 ≈ 0.401, or roughly 40%. Alpha sets the per-test rate, not the family-wise rate — and the family-wise rate grows quickly once you run more than a handful of tests.

Close the gap

The multiple-comparisons problem is invisible in a single output table. You see a p-value; you don't see how many other p-values were computed and quietly set aside. That gap between what the number says and what the analysis actually did is where most replication failures live.

A tutor that works through analysis decisions with you in real time — before you lock in an interpretation — can surface the question you didn't think to ask: "How many things did you test to get here?" That is the catch that changes the conclusion.

Try Gradual Learning free ->