The thing about p-values…

“The p-value was never intended to be a substitute for scientific reasoning. Well-reasoned statistical arguments contain much more than the value of a single number and whether that number exceeds an arbitrary threshold. The ASA statement is intended to steer research into a ‘post p<0.05 era.’”

— Ron Wasserstein, Executive Director of the American Statistical Association

Okay, so p-values…

I feel like I really struggle with p-values. I understand the math but it’s hard to really internalize what they mean.

This post is an attempt to provide some intuition about how to interpret p-values when you come across them.

Let’s start with some bad takes. . .

What are p-values NOT?

The Wikipedia article Misuse of p-values helpfully lists a couple issues with p-values, which I paraphrase here.

The p-value is not the probability that the null hypothesis is true, or the probability that the alternative hypothesis is false.
The p-value is not the probability that the observed effects were produced by random chance alone.
There’s nothing magical about p = 0.05. Ceteris paribus, why is a result of p = 0.0499 different than p = 0.0501?
The p-value does not indicate the size or importance of the observed effect.

American Statistical Association Statement on p-values

The ASA (helpfully, maybe begrudgingly) defined a p-value.

“Informally, a p-value is the probability under a specified statistical model that a statistical summary of the data (e.g., the sample mean difference between two compared groups) would be equal to or more extreme than its observed value. ”

— Wasserstein and Lazar 2016

I’ve read the ASA Statement on p-values a few times now. I’m starting it understand it. I think. Their principles for p-values seem helpful.

P-values can indicate how incompatible the data are with a specified statistical model.
P-values do not measure the probability that the studied hypothesis is true, or the probability that the data were produced by random chance alone.
Scientific conclusions and business or policy decisions should not be based only on whether a p-value passes a specific threshold.
Proper inference requires full reporting and transparency.
A p-value, or statistical significance, does not measure the size of an effect or the importance of a result.
By itself, a p-value does not provide a good measure of evidence regarding a model or hypothesis.

Intuition by Simulation

To get a better sense of this, I decided to play with a simulation. For this discussion let’s keep it very simple and talk only about the sample mean (no regressions).

We’ll start by creating a dataset that contains each integer from -5000 to 5000 once (-5000, -4999, -4998, and so on).

The sample mean is the simple average, so the sample mean is just 0.

`Figure 1: The population for the simulation contains each integer from -5000 to 5000 one time.`

Sample from the population and calculate summary statistics

So we have this population (every number from -5000 to 5000), and we’ll sample from it.

Let’s say we’re doing a phone survey to ask about net worth in $USD and call 1% of the target population (n = 100 people).

Mean is easy to calculate. It’s just the sum of all the values divided by the number in the sample. When I ran the simulation, I got a mean of 670.31.

Even in this “trivial” case, standard deviation gets complicated, fast. We don’t have the population, we only have a sample. So we compute the “corrected sample standard deviation” which shown below (and thankfully, which R computes with the sd command).

In my simulation, I got a sample standard deviation of s = 2620.88.

Construct Confidence Interval

“The confidence level represents the theoretical long-run frequency (i.e., the proportion) of confidence intervals that contain the true value of the unknown population parameter ”

— Wikipedia: Confidence Interval

To construct the confidence interval, we go back to high school statistics and use the formula:

Let’s construct a 95% confidence interval, so we need the relevant z-score, which is about 1.96. You can look it up in a table or pull it from your software.

When we construct our confidence interval from the simulation, we would get a confidence interval of [156.6, 1189.3]. So we think the TRUE mean of the population is that region (of course, we know that the TRUE mean is actually 0…more on that in a second).

Okay, but what about a p-value? We don’t have enough information to calculate a p-value yet. More on that in a minute. . . let’s do some sampling and construct CIs repeatedly first.

Suppose we did the simulation many thousands of times?

We run the simulation over and over. Usually, we’ll get confidence intervals that “capture” the true mean (0), but sometimes, just by random chance, our confidence interval does not include the true mean.

The graph above is a histogram of the means, with bars colored in to indicate whether the confidence interval for that mean captured the true mean.

Let’s look at some individual confidence intervals. Remember, each line below represents a simulation where we drew 100 people from our sample population at random.

`Figure 3: Confidence intervals from 100 randomly chosen trials`

This aligns with our understanding of confidence intervals. About 95% of the time, the true population mean (0) was captured by the confidence interval.

But remember, in the real world we often only have one confidence interval. We can only run a clinical trial once, or survey our customers once. That’s where p-values come in.

Now we can look at p-values

On to p-values. We have to define the null hypothesis first. That’s where the “statistical model” the ASA talks about comes in. Usually, the null hypothesis is that whatever we are estimating is 0. In our case (since it’s a simulation and we generated the data), we know that the true population mean μ = 0, but usually you don’t know the true value. We don’t know the actual benefit of the new drug we’re testing.

• H0: μ = 0 (Null hypothesis is that the mean is 0)
• Ha: μ is not 0 (Alternate hypothesis is that the mean is NOT 0)

Let’s look at the p-values from those 10,000 simulations we ran. A histogram plotting the p-values and how often each one came up is shown below.

`Figure 4: Distribution of p-values when the true mean is μ = 0 and the null hypothesis is H0: μ = 0.`

This one threw me for a loop at first. Why are they uniformly distributed? Well, it’s actually what we’d expect. Any one outcome is just as likely as any other outcome because we are sampling at random. Under the null hypothesis, the p-values from repeated trials will be uniformly distributed.

But suppose the null hypothesis is FALSE. Suppose our population actually had a true mean μ = 500 and I drew from it at random.

What happens to our p-values now? I would usually find a sample mean greater than zero. The new histogram is shown below. Because we are drawing from a population where the true mean isn’t 0, we usually find very small p-values.

`Figure 5: Distribution of p-values when the true mean is μ = 500 and the null hypothesis is H0: μ = 0.`

So assuming we found one of those very small p-values, we could say something like “we reject the null hypothesis.”

Look at the y-axes on Figures 4 and 5. There are about 20 hits in each bin in Figure 4, but 225 in the smallest bin in Figure 5! Wow. So if we found a very small p-value, like 0.01, which distribution did it most likely come from? Probably the second one.

That’s basically the intuition behind using a p-value. But remember, you could get the same p-value from either distribution.

P-Values can’t tell you if your effect matters

Just looking at the p-value isn’t enough. Does whatever effect have meaning in the real world? The p-value is just a piece of it.

“In many cases, researchers are interested in a point estimate and the degree of uncertainty associated with that point estimate as the precursor to making a decision or recommendation to implement a new policy. In such cases, the absence or presence of statistical significance (in the sense of being able to reject the null hypothesis of zero effect at conventional levels) is not relevant, and the all-too-common singular focus on that indicator is inappropriate.”

— Imbens 2021

References

Imbens, Guido W. 2021. “Statistical Significance, p -Values, and the Reporting of Uncertainty.” Journal of Economic Perspectives 35 (3): 157–74. https://doi.org/10.1257/jep.35.3.157.

Wasserstein, Ronald L., and Nicole A. Lazar. 2016. “The ASA Statement on p -Values: Context, Process, and Purpose.” The American Statistician 70 (2): 129–33. https://doi.org/10.1080/00031305.2016.1154108.