Last month, a study that estimated the reproducibility of psychological science was published and elicited comments from many, including an insightful article in The Atlantic. The study was conducted by a large group of 270 psychologists. Together they tried to replicate 100 previously published findings from prominent journals, by independently re-running the experiments. It was a big task and the first one attempted at such a large scale that I am aware of.
The result? Only about 40% of their experiments replicated the original findings. This sounds worryingly low, which feeds into the wider discourse about poor scientific methodology and the ‘replication crisis‘.
There are many factors that lead to poor replicability, some of which were explored in this study. One that wasn’t discussed, and that I and others think is an important contributor, is the pervasive practice of using significance tests and conventional p-values thresholds (e.g. 0.05) as the sole arbiter of evidence.
p < 0.05? Publish!
p > 0.05? Reject!
Hmm…that reminds me of how we treat blood alcohol content (BAC) here in Australia:
BAC > 0.05? Get off the road, you drunk!
BAC < 0.05? Keep driving…
Of course, drunkenness is a matter of degree and the arbitrary 0.05 limit is chosen for practical convenience. Other countries use different limits.
The p-value, and ‘statistical significance’ broadly, have become a sort of ‘currency’ that allows one to make claims about truth or falsehood. You can see that reflected the scientific literature, with conclusions often written in a black and white fashion.
When taken to the extreme, this develops into a culture of intellectual laziness and mindless dichotomisation. Rather than considering whether the evidence at hand makes sense and is consistent with other studies, previous knowledge, etc., a significant p-value is used as a licence to declare a scientific ‘finding’. This leads to implausible conclusions being published, even in prominent journals. The female hurricanes study comes to mind (see Andrew Gelman’s comments) and other such examples seem to be a regular feature on Gelman’s blog (e.g. this one from last week). It’s clear how this culture can lead to substantial publication bias.
There’s an even more fundamental problem. This obsession with dichotomisation, jokingly referred to as ‘significantitis‘, feeds a belief that statistics is about getting a ‘licence’ to claim a finding. This is a misconception. Statistics is actually about quantifying uncertainty and assessing the extent of evidence. It’s about determining shades of grey, not about enforcing black and white.
As a statistician, I am concerned that this misconception is contributing to a lack of engagement with us in much of scientific research and a lack of investment in data analysis capabilities. As a scientist, I am concerned that this culture is perpetuating the reproducibility crisis which could harm our public reputation and promote widespread disillusionment in science.
I leave you with this famous xkcd comic: