Okay, I’m taking a quick break from my regular posts because I can’t stop geeking out about this recent XKCD comic:
If you already know about clickbait (likely) and p-values (less likely), you may not need the joke explained to you. But there’s something deep going on here too, tying into the themes of probability and perception I’ve been exploring on this blog. So if you’ll let me explain what p-values are, and why you might want to adjust them for clickbait reasons, I’ll show you what this clickbait-corrected p-value is really doing.
A crash course in p-values
If you’re trying to decide whether the world is one way or another way (say, whether eating chocolate makes you a better athlete), you can try to find out using what’s called “hypothesis testing”:
- First you make two hypotheses: either chocolate does boost performance or it doesn’t. The boring hypothesis that says there’s no relationship is called the “null” hypothesis, denoted H0. The other possibility is the “alternative” hypothesis, H1.
- You think up some experiment that you hope would show evidence of the alternative hypothesis, if it were true. For example, you could time a sprinter, feed them chocolate, and time them again to see if they run faster.
- Then you do the experiment and collect the data—maybe your sprinter ran 10% faster after eating chocolate.
- Finally, the hard part: you calculate the probability that you would have seen at least that big an effect if the null hypothesis were true, e.g. “How likely is it that the sprinter would have run at least 10% faster the second time if the difference wasn’t due to eating chocolate?”
This probability is very hard to calculate! It depends how much the sprinter’s speed fluctuates from run to run, and whether they often run faster after warming up, or if the speed increase was just due to the calories in the chocolate and not its chocolatey-ness, and more. Making a good experiment is all about collecting the right kind of data to help you answer this question by isolating exactly the one change you’re interested in, such as by measuring several different runs to get a sense of the normal speed fluctuations, and maybe comparing your sprinter’s running speeds to those of a “control” runner who isn’t fed chocolate, or is fed something similar.
Anyway, say you calculate that there was only a 3% chance your sprinter would have improved so much if it weren’t due to the chocolate. This probability is your experiment’s p-value, expressed as a decimal; in this case p=0.03. You usually compare p to some predetermined threshold (0.05, 0.01 and 0.001 are all pretty common thresholds, depending on how strict you want to be), and if p comes out smaller, you declare that the evidence you saw was too unlikely to have happened by pure chance, so the alternative hypothesis must be true! (Or, more formally, you “reject the null hypothesis.”) Then you can say “Chocolate significantly increases sprinting speed”—passing this p-value test lets you say “significantly.” Of course, there’s still the possibility that the null hypothesis really was true and you just got unlikely evidence.
On the other hand, if your p-value came out bigger than the threshold, it means your experiment couldn’t tell you which hypothesis is right—you “fail to reject the null hypothesis.” You can try again with another experiment if you want; often, observing the same results in a bigger experiment gives you a smaller p-value. (Seeing one sprinter get 10% faster after eating chocolate is a lot less convincing than seeing 100 sprinters get 10% faster, even just on average.)
Now, it’s easy to get slightly confused and think of our p-value of 0.03 as meaning that we’re now 97% sure that chocolate boosts performance. This isn’t right: p represents the probability of getting our evidence given the null hypothesis, not the probability of the null hypothesis given our evidence. But, to be doubly confusing, sometimes these are nearly the same! Here’s what you need for that to happen (and if you’re interested why, check the comments):
- The p-value itself should be small. This is not a big issue in practice—anything under 10% is good enough if you’re casual. But if the p-value is close to 100% instead of close to 0%, it’s not like you can deduce that the null hypothesis is true after all—you just didn’t collect the kind of data that could ever falsify it.
- Under the alternative hypothesis, the probability of observing evidence at least extreme as you got should be about 100%. Think what would happen if you predicted chocolate boosts performance and found that it actually seems to lower it: you might get a small p-value but that doesn’t mean you were right.
- Prior to observing any evidence, the null and alternative hypotheses should both seem equally likely. This last criterion is rarely met; often we have some guess already and are trying to confirm it, or we’re trying to rule out something we think is unlikely anyway. Either way, we went in with expectations of what we’d find.
If all of these assumptions hold, then it’s okay to think of our p-value of 0.03 as giving us 97% confidence in the alternative hypothesis, given the evidence we got. But if any one of the assumptions doesn’t hold, we have to adjust the p-value somehow in order to interpret it as the probability that the null hypothesis is true after all. That’s where clickbait comes in.
The problem with clickbait
Okay, there are many problems with clickbait. But the one I’m thinking of is this: suppose you’re browsing through this made-up list of articles recently published in a journal that only accepts p-values less than 0.05:
…
“High air pollution decreases reported quality of life”
“Parental income positively correlated with child’s future income”
“Dead fish’s neurons correctly read human emotions”
“High-protein diet helps surgery patients regain muscle mass”
…
Now ask yourself two questions:
- Which link would you be most tempted to click on?
- If I told you one of these articles was due to that 5% chance of getting the observed evidence even though the null hypothesis was true after all, which one do you think it is?
That’s clickbait for you. Even if most of the information on the internet is true, you’re most likely to read what surprises you, which is also what’s most likely to be wrong.*
What can you do? Well, you could demand more evidence for wild-sounding claims, perhaps by requiring their p-values to be under an even smaller threshold. (Physicists making claims about the nature of the universe usually wait to get experimental p-values smaller than 0.0000003 before they’re sure they’ve observed something new.) Or you could have a variable threshold that depends on how “clickbaity” the claim is.
Another way of accomplishing the same thing would be to keep the threshold the same, but artificially increase the experimental p-values of clickbaity claims. That’s exactly what XKCD does—take a look at the equations again:
fraction of test subjects who click on
a headline announcing that H is true.
That fraction on the right is higher the more tempted you are to read an article claiming the alternative hypothesis, so an especially clickbaity article would have its p-value increased quite a lot. If the imaginary journal from the beginning of this section had excluded articles whose “clickbait-corrected” p-values were too high, maybe it wouldn’t have published that erroneous dead fish study.
The true meaning of clickbait-corrected p-values
There are lots of ways to adjust p-values to make them higher for clickbaity headlines; what makes this one so special? The answer goes back to what we were saying before about how p-values don’t represent the probability that the null hypothesis is true given your evidence, unless the two hypothesis seemed equally likely before you did the experiment. If we imagine that the likelihood that you click on a headline is proportional to how surprising you find it, i.e. the probability that you would have originally thought the headline was false, then we can recalculate the null hypothesis’ new probability… and instead of being just p, it’s the clickbait-corrected pCL! That derivation is in the comments too. All together, we get:
The clickbait-corrected p-value pCL is the probability that the null hypothesis is true given the evidence you observed.
(This assumes that pCL is close to 0, the probability of observing your evidence given the alternative hypothesis is close to 1, and absent any evidence the probability you think a statement is false is proportional to how likely you are to click on it.)
Anyway, that’s why I’ve been geeking out. Thanks for reading!
*By the way, that dead fish study was real! However, it was knowingly designed to expose bad statistical reasoning. I learned about it from Jordan Ellenberg’s fantastic book “How Not To Be Wrong.” Recommended for anyone who wants to not be wrong.
In case you’re into probability, here’s the derivation that the p-value can be interpreted as the new probability of the null hypothesis
, given your evidence
. First, Bayes’s theorem tells us that
Now, we’re assuming that
and
. Therefore
If
, then we can cancel them out and get
if
.
On the other hand, if
and
are proportional to
and
, respectively, then we get instead
LikeLike