The thing about p-values…
Okay, so p-values…
I feel like I really struggle with p-values. I understand the math but it’s hard to really internalize what they mean.
This post is an attempt to provide some intuition about how to interpret p-values when you come across them.
Let’s start with some bad takes. . .
What are p-values NOT?
The Wikipedia article Misuse of p-values helpfully lists a couple issues with p-values, which I paraphrase here.
The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.
The p-value is not the probability that the observed effects were produced by random chance alone.
There’s nothing magical about p = 0.05. Ceteris paribus, why is a result of p = 0.0499 different than p = 0.0501?
The p-value does not indicate the size or importance of the observed effect.
American Statistical Association Statement on p-values
The ASA (helpfully, maybe begrudgingly) defined a p-value.
I’ve read the ASA Statement on p-values a few times now. I’m starting it understand it. I think. Their principles for p-values seem helpful.
P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.
Intuition by Simulation
To get a better sense of this, I decided to play with a simulation. For this discussion let’s keep it very simple and talk only about the sample mean (no regressions).
We’ll start by creating a dataset that contains each integer from -5000 to 5000 once (-5000, -4999, -4998, and so on).
The sample mean is the simple average, so the sample mean is just 0.
Sample from the population and calculate summary statistics
So we have this population (every number from -5000 to 5000), and we’ll sample from it.
Let’s say we’re doing a phone survey to ask about net worth in $USD and call 1% of the target population (n = 100 people).
Mean is easy to calculate. It’s just the sum of all the values divided by the number in the sample. When I ran the simulation, I got a mean of 670.31.
Even in this “trivial” case, standard deviation gets complicated, fast. We don’t have the population, we only have a sample. So we compute the “corrected sample standard deviation” which shown below (and thankfully, which R computes with the sd command).
In my simulation, I got a sample standard deviation of s = 2620.88.
Construct Confidence Interval
To construct the confidence interval, we go back to high school statistics and use the formula:
Let’s construct a 95% confidence interval, so we need the relevant z-score, which is about 1.96. You can look it up in a table or pull it from your software.
When we construct our confidence interval from the simulation, we would get a confidence interval of [156.6, 1189.3]. So we think the TRUE mean of the population is that region (of course, we know that the TRUE mean is actually 0…more on that in a second).
Okay, but what about a p-value? We don’t have enough information to calculate a p-value yet. More on that in a minute. . . let’s do some sampling and construct CIs repeatedly first.
Suppose we did the simulation many thousands of times?
We run the simulation over and over. Usually, we’ll get confidence intervals that “capture” the true mean (0), but sometimes, just by random chance, our confidence interval does not include the true mean.
The graph above is a histogram of the means, with bars colored in to indicate whether the confidence interval for that mean captured the true mean.
Let’s look at some individual confidence intervals. Remember, each line below represents a simulation where we drew 100 people from our sample population at random.
This aligns with our understanding of confidence intervals. About 95% of the time, the true population mean (0) was captured by the confidence interval.
But remember, in the real world we often only have one confidence interval. We can only run a clinical trial once, or survey our customers once. That’s where p-values come in.
Now we can look at p-values
On to p-values. We have to define the null hypothesis first. That’s where the “statistical model” the ASA talks about comes in. Usually, the null hypothesis is that whatever we are estimating is 0. In our case (since it’s a simulation and we generated the data), we know that the true population mean μ = 0, but usually you don’t know the true value. We don’t know the actual benefit of the new drug we’re testing.
• H0: μ = 0 (Null hypothesis is that the mean is 0)
• Ha: μ is not 0 (Alternate hypothesis is that the mean is NOT 0)
Let’s look at the p-values from those 10,000 simulations we ran. A histogram plotting the p-values and how often each one came up is shown below.
This one threw me for a loop at first. Why are they uniformly distributed? Well, it’s actually what we’d expect. Any one outcome is just as likely as any other outcome because we are sampling at random. Under the null hypothesis, the p-values from repeated trials will be uniformly distributed.
But suppose the null hypothesis is FALSE. Suppose our population actually had a true mean μ = 500 and I drew from it at random.
What happens to our p-values now? I would usually find a sample mean greater than zero. The new histogram is shown below. Because we are drawing from a population where the true mean isn’t 0, we usually find very small p-values.
So assuming we found one of those very small p-values, we could say something like “we reject the null hypothesis.”
Look at the y-axes on Figures 4 and 5. There are about 20 hits in each bin in Figure 4, but 225 in the smallest bin in Figure 5! Wow. So if we found a very small p-value, like 0.01, which distribution did it most likely come from? Probably the second one.
That’s basically the intuition behind using a p-value. But remember, you could get the same p-value from either distribution.
P-Values can’t tell you if your effect matters
Just looking at the p-value isn’t enough. Does whatever effect have meaning in the real world? The p-value is just a piece of it.
References
Imbens, Guido W. 2021. “Statistical Significance, p -Values, and the Reporting of Uncertainty.” Journal of Economic Perspectives 35 (3): 157–74. https://doi.org/10.1257/jep.35.3.157.
Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p -Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.