Yesterday I described how our obsession with statistical significance leads to poorer scientific findings and practice. So…what can we do about it?
One proposal, championed by John Carlin and others, is that we completely eliminate the term ‘statistical significance’ from scientific discourse. The goal is to shift attention away from unhelpful dichotomies and towards a more nuanced discussion of the degree of evidence for an effect.
This will require a change in how we present our results. Instead of talking about ‘findings’ we would instead describe the direction and magnitude of effects we observe. This would naturally prompt a discussion about how relevant these are in the context of the research problem, something we should be doing anyway but that can easily get lost in the current style of discourse.
When observed effects are particularly surprising or unexpected, this is often because they really are too good to be true. Even if they are ‘significant’, they are likely to be substantial overestimates of any real effect. This can be demonstrated mathematically in the scenario where statistical power is low. Quantifying the evidence might show, for example, a very wide confidence interval, which should ring warning bells that the estimate is unreliable. Considering what a plausible range of effects would be and assessing the power to see them can shed further light on how strong a conclusion you can draw.
‘Absence of evidence is not evidence of absence’
— My daughter, on the existence of unicorns
Another benefit is that we get more clarity about ‘negative’ findings. Saying we have ‘no significant difference’ is not helpful. Does it mean we have strong evidence for a very low effect (i.e. evidence for absence), or have we simply run an underpowered study (i.e. absence of evidence)? Those are very different outcomes and we need to quantify the uncertainty in order to tell them apart.
An example
This proposal goes counter to much of current practice. Because ‘significance’ is so ingrained in scientific culture, it would be helpful to have some examples to see how to go about changing our habits. Here is an example reproduced from a talk by John Carlin.
Before:
To test the hypothesis that…development is structurally impaired in preterm infants, we studied 114 preterm infants and 18 term controls using…imaging techniques to obtain…(Y) at term corrected. There was no significant difference in Y between the preterm group and the term controls, whether adjusted or not for X.
After:
To test the hypothesis that…development is structurally impaired in preterm infants, we studied 114 preterm infants and 18 term controls using…imaging techniques to obtain…(Y) at term corrected. There was no clear evidence for a difference in Y, between the preterm group and the term controls, with an overall mean reduction of 8% (95% confidence interval -3% to 17%, P = 0.17). When adjusted for X, the difference was even smaller (3%; 95% CI -6% to 12%, P = 0.48).
General principles
- Avoid the word ‘significant’
- Use quantitative results (esp. how ‘negative’ is the result?)
- Comment on the degree of evidence
- Express results more cautiously, avoiding black/white interpretation (but best to quantify results as much as possible)
At the very least say something like ‘strong evidence for’ or ‘moderate evidence for’ or ‘no apparent relationship between’ instead of a phrase involving the word ‘significant’. Ideally, you would also quantify the evidence as in the above example. However, even without quantification the focus is least shifted away from simple dichotomisation and instead emphasises an interpretation of the degree of evidence.
‘Absence of evidence is quite possibly but not necessarily evidence of absence’
— My daughter, whose belief in the existence of unicorns has been tempered