Since November 2011, the Center for Open Science has been involved in an ambitious project: to repeat 100 psychology experiments and see whether the results are the same the second time round. The first wave of results will be released in tomorrow’s edition of Science, reporting that fewer than half of the original experiments were successfully replicated.
This is a known problem (but not well known enough for most researchers or any of the lay public).
Our system actually encourages the publication of papers that are false.
We need to compromise between the large cost for almost absolute certainty and the insights gained from cheaper research that comes with a larger chance of being wrong.
Kind of like do we want to spend $1 million to have a 95% chance of being right. Or $100,000 to be 80%.
Especially when replicating successfully the latter one increases the odds to 96% right.
In this paper, Johnson, the researcher looked at two statistical measures for relevance: “frequentist tests, which measure how unlikely a finding is to occur by chance, and Bayesian tests, which measure the likelihood that a particular hypothesis is correct given data collected in the study.”
Essentially, one approach looks at the chances that the results could have happened by random accident while the other looks at how likely the results are true.
Almost all researchers use the frequentist test, using an arbitrary value of 0.05 as the threshold – if their results produce a statistical value (𝛲) less than 0.05 (𝜌<0.05), they say it is significant (“It has a 95% chance of being correct”) and publish.
Although as I wrote before:
There is less than a 1 in 20 chance that the experiment would yield the results by random chance. But if say 40 labs do the experiment, then several WILL get the ‘statistically significant’ result by pure chance. There is no real effect just random fluctuation. Those labs that get a result, though, then get to publish a paper on data which are really only due to random chance. When anyone else tries to repeat the experiment, we see regression to the mean and the significance disappears.
This is so-called publication bias. If enough experiments are repeated, even random results will look significant. And get published.
Or this one:
Think of flipping a coin. You might get 4 heads in a row making you believe that somehow the coin is some sort of wonderful piece of metal. Wow, it comes up heads 100% of the time. Publish and look out tenure, here I come.
But when others repeat this, the numbers revert to the mean and the paper is shown to be incorrect.
It is as if we have chosen a threshold that is not really as statistically significant in isolation.
And Johnson found that this was exactly true. A frequentist threshold of 0.05 is too loose and will produce many false positives. Which is just what this Ars Technica article shows.
He had suggested that in studies using a 𝛲 value 0.05 as a threshold more than 25% were false. Here we find that it might be 50% in certain fields.
Johnson noticed that none of the research that had a p<0.005 failed to be replicated. His suggestions for making sure that we are not fooled by using too loose a threshold – use a p<0.005 for statistical significance.
The problem is that if a threshold of 0.005 is used, many researchers will simply not be able to publish much of anything. It is hard enough to get the threshold to 0.05.
Not because the researchers are being unethical. To get values closer to 0.005, substantially more resources must be used. In the case of social sciences, the number of people to include could rise several fold, increasing costs tremendously.
More money is simply not an option. So a compromise is made.
Papers are published knowing that there is a reasonable chance that they are not correct. But the information can get out to be checked. Further replication removes signal from noise and allows us to get a clearer picture.
And for possible less overall money.
Most researchers intuitive know the provisional value of things. We understand that the statistics only become reasonable with more data is obtained.
We know that any one paper, using a threshold of p<0.05, by itself, may not be true. But if several other papers replicate the research positively, then the statistics become clearer.
But most people, due to the media, think only true papers are published.
Thus the continuing confusion. Our system is designed to allow the publication of a lot of papers that will surely be wrong. But, in the long run, this may well be cheaper.