Hypothesis testing is one of those frameworks that feels intuitive until you try to apply it — and then each piece trips you up in a different way. In a self-study session covering frequentist inference from scratch, a student named Liam worked through the logic correctly in structured quizzes but, when asked to reason openly, fell into four distinct traps. Each one is different on the surface. Underneath, they all trace to the same root confusion.
The common mistakes
1. Treating the p-value as the probability the null hypothesis is true
The prompt was open-ended: an investigator sees p = 0.03 and tells their manager, "There's only a 3% chance the null hypothesis is true." Liam immediately flagged this as wrong — and he was right. But when asked to explain why, his phrasing exposed the boundary of his understanding: he said the p-value is "the probability of obtaining certain values when the null hypothesis is true," stopping just short of the full definition.
The trap: P(data | H₀) looks a lot like P(H₀ | data) if you're not being careful about which side of the conditional you're on. The p-value conditions on the null being true and asks how surprising your data would be. It says nothing about how probable the null is, given what you observed. Inverting that conditional — the "prosecutor's fallacy" — is a different calculation entirely, and one frequentist statistics doesn't perform.
The missing piece Liam's definition dropped: "or more extreme." The p-value is a tail probability, not the probability of a single observed outcome. That distinction matters when interpreting small p-values.
2. Assuming the null hypothesis always means "no relationship"
Later in the session, after the tutor explained that the null hypothesis is a specific, testable reference point — not necessarily a zero-effect claim — the tutor posed a follow-up. A manufacturer claims their batteries last 500 hours. What's the null? Liam answered correctly: H₀ is that they last exactly 500 hours.
But earlier, when reasoning without scaffolding, he had stated it directly: "A null hypothesis implies no relationship exists between the variables." That framing is understandable — most textbook examples use zero-effect nulls, and the convention makes it feel like a rule.
It isn't. The null is whatever specific claim you're testing against. It could be "the mean is 500," "the two groups differ by exactly 3 points," or "the slope is 1.5." What makes it a null is that it's precise enough to calculate a probability distribution — not that it says nothing is happening.
3. Inverting which way alpha moves error risk
This one was the most persistent. When asked to explain what happens when you lower alpha from 0.05 to 0.01, Liam wrote that you "increase the risk of Type I error." He also labeled a Type II error a "falso positivo" — terminology that belongs to Type I.
The mechanics, laid out plainly: alpha is the maximum acceptable Type I error rate. Lowering alpha means tightening the threshold — you require more extreme data before rejecting the null. That makes it harder to get a false positive, so Type I risk falls. The tradeoff is that real effects become harder to detect: Type II risk rises.
The confusion here is directional rather than conceptual. Liam understood what the two errors were. He knew the tradeoff existed. What he had wrong was which error moves in which direction when you turn the dial. After the correction, he answered an open-ended probe cleanly: "You have to lower alpha. That increases Type II error." The directional mapping then held across the rest of the session.
4. Thinking a lower alpha makes your study more likely to be significant
Asked how he'd choose between α = 0.01 and α = 0.10 before running a study, Liam said he'd pick 0.01 because "it gives us a narrow range to reject the null, and our study is more likely to be significant."
This is a direct reversal. A lower alpha is a stricter criterion. To reject the null with α = 0.01, you need a p-value below 0.01. With α = 0.10, a p = 0.03 would still clear the bar. Lowering alpha makes significance harder to achieve, not easier. Studies using α = 0.01 fail to reject more often — not less.
The underlying model behind this mistake seems to be that "lower alpha = more precision = stronger result." That's partially true — a lower alpha does imply stronger evidence is required — but it conflates "more rigorous" with "more likely to detect an effect," which are opposites. Greater rigor means fewer rejections, not more.
The actual mechanism
All four mistakes are variations on one confusion: treating alpha and the p-value as properties of the truth, rather than properties of the evidence threshold and the data.
When you misread the p-value as the probability the null is true, you're treating it as a statement about the world. When you assume the null always means "no relationship," you're importing a default into what is actually a design choice. When you invert the alpha-error direction, you're losing track of what alpha actually controls. And when you think lower alpha makes significance more likely, you're running the logic backward.
The framework, stated precisely:
- The null hypothesis is any specific, testable claim chosen as your reference. Zero-effect is a convention, not a definition.
- The p-value is the probability of observing data this extreme or more, assuming the null is true. It does not tell you how probable the null is.
- Alpha is the pre-specified maximum acceptable probability of a Type I error (rejecting a true null). It is not a precision setting or a detection amplifier.
- Lowering alpha raises the evidentiary bar. Type I risk falls, Type II risk rises. Raising alpha does the opposite.
These four concepts are part of a single system. The p-value is compared to alpha to make a decision. Alpha was set before the study to manage error risk. The null was specified before the study as the baseline claim. No part of the system evaluates how probable the null is — frequentist inference doesn't work that way.
How to remember it
Alpha is a commitment, not a detector. You set it before you see a single data point, and it represents the rate of false alarms you're willing to tolerate. The p-value is a measurement of evidence. Comparing them tells you whether your evidence crossed your pre-committed threshold — nothing more.
A useful contrast: lower alpha is like setting a stricter cutoff for an alarm to go off. Fewer alarms will trigger — including fewer false alarms, but also fewer real ones you might have caught.
Check yourself
A medical researcher is studying a new antibiotic. She sets α = 0.05, runs the trial, and gets p = 0.04. She rejects the null. Later, it turns out the antibiotic had no actual effect. A colleague says: "You should have used α = 0.01 — then your study would have been significant."
What's wrong with the colleague's statement?
A) Nothing — a lower alpha always increases the chance of detecting a real effect.
B) The colleague has the direction right, but α = 0.01 would not have changed this outcome since p = 0.04 is already low.
C) Lower alpha would have made rejection harder, not easier — p = 0.04 would not have crossed the α = 0.01 threshold.
D) The researcher made a Type II error, not a Type I error.
Correct answer: C.
With α = 0.01, the researcher's p = 0.04 would not clear the threshold and she would not have rejected the null. The colleague's statement reverses the logic: lower alpha means stricter criteria and fewer rejections. The error she committed was a Type I (rejecting a true null) — which a lower alpha would have made less likely, but also means the study would not have reached significance at all.
Close the gap
The tutor who worked with Liam caught each of these inversions in the moment — when he said "more likely to be significant," when he wrote "falso positivo" for a missed effect, when his open-ended answer stopped one phrase short of the correct p-value definition. That real-time correction, before the wrong model has time to solidify, is exactly what Gradual Learning is built to do.