It is well known that replicating a scientific experiment usually leads to a more conclusive result. One way in which this happens is that the statistical evidence becomes stronger when it is accumulated across many experiments. What is perhaps surprising is that describing and quantifying how this happens is not straightforward. Simple explanations can easily be misinterpreted if they gloss over key details.
Confusion due to ambiguity
One explanation I heard recently went roughly as follows:
Suppose we run a single experiment, using the conventional 5% level of statistical significance. A positive finding from this experiment will be wrong 1 out of 20 times. However, if we were to run three experiments instead of just one, the chance that all of them would be wrong would be 1 in 8,000 \((= 20^3)\).
The fact that is being explained here is that the false positive rate is decreasing. That is, if we assume the underlying research hypothesis is actually false, the chance that a single experiment will come out positive (i.e. will support the hypothesis based on a statistical test) is 1 in 20, and the chance that all three experiments will do so is 1 in 8,000.
However, most people are likely to interpret the statement differently. They will mistakenly think that the chance the research hypothesis is true, given a positive finding, is 1 in 20.
The difference is somewhat subtle. The first interpretation refers to the probability of the experimental outcome given an assumption about the truth of the research hypothesis. The second is the reverse, a probability of the hypothesis given an assumption about the outcome. The two can easily be confused, giving rise to what is known as the Prosecutor’s fallacy.
The main problem is the ambiguity of the phrase ‘will be wrong’, which can be interpreted in different ways. Most people would naturally focus on the main question of interest (‘is the hypothesis true?’) whereas classical statistics is usually posed in the reverse manner (‘what is probability of the data given the hypothesis?’). We can attempt to fix the explanation by more precise wording, for example:
Suppose we run a single experiment, using the conventional 5% level of statistical significance. If the research hypothesis is not true, the experiment will give rise to a positive finding by chance 1 in 20 times, while with three independent experiments the chance that all three would be positive goes down to 1 in 8,000.
While this is now factually correct, the message has become a bit harder for a lay audience to understand or relate to. They will want to know how replication helps to answer the question of interest. They may even impose their own interpretation of the probabilities despite the careful wording. Prosecutor’s fallacy still lurks in the shadows.
More meaningful explanations
To help such an audience, we can frame the explanation directly in terms of the chance that the hypothesis is true. This requires some extra information:
The statistical power of the experiment (also known as the sensitivity or the true positive rate). This is the chance that it will give a positive result if the research hypothesis is true.
The prior probability of the hypothesis. This is our best assessment of whether the research hypothesis is true before having run the experiment, summarised as a probability. (This can be based on other evidence already gathered for this hypothesis, or on evidence or experience from studies of similar or related hypotheses.)
After we conduct the experiment, we can combine the outcome and the above information together using Bayes’ theorem to determine the posterior probability of the hypothesis. This is our ‘updated’ assessment of it being true, in light of the evidence provided by the experiment. It is this quantity that is of most interest to the audience, and how it would differ if replicate experiments are conducted.
For example, suppose we wish to run a psychology experiment that is somewhat under-resourced and we have assessed the power to be about 20%. Furthermore, let’s suppose we are testing a speculative hypothesis and rate the chances of it being true at about 1 in 10. A positive finding in this case would upgrade this to about 1 in 3 (a posterior probability of about 33%), which still leaves plenty of room for doubt. If we replicate the experiment two more times, and get positives each time, then the overall posterior probability would be almost 90%. This would certainly look more convincing, although perhaps not completely conclusive.
In comparison, suppose we are planning a clinical trial with a power of 80%. We will test a drug for which we already have some evidence of an effect, rating the chances of this being true as 1 in 3. A positive outcome here already entails a posterior probability of almost 90%, while positive outcomes for three independent such trials would raise this to more than 99.9%.
Note that in both of these examples I have assumed the experiments would be designed to have a 5% false positive rate, as is commonly done. That means for both examples the false positive rate for three experiments is 1 in 8,000. However, the quantifiable impact on the actual question of interest varies.
The above examples show how to explain the impact of replication on the statistical evidence in a way that is more understandable than if only referring to the change in the false positive rate.
I recommend using an example along these lines when communicating the benefit of replication. Tailoring the example to the audience’s interests, including using assumptions that are as realistic as possible, would allow them to more easily see the relevance of the message. Even for a fairly general audience, I recommend describing a hypothetical experiment than referring to generic statistical properties.
Setting up this type of explanation requires some elaboration of key assumptions, such as power and prior probability, which can take a bit of time. The reward is a meaningful and understandable example.
While it might be tempting to resort to the ‘1 in 8,000’ explanation to keep the message brief, I recommend against it because it is likely to cause confusion.
If brevity is very important, I recommend steering away from numerical explanations and instead just describing the basic concepts qualitatively. For example, ‘replicating the experiment multiple times is akin to running a single larger experiment, which naturally has greater statistical power’.