The Stapel affair: it is worse than we thought

After Diederik Stapel was caught cooking the scientific books, three committees investigated the extent of the fraud in their universities (Amsterdam, Groningen, Tilburg), and how it was possible that Stapel committed his fraud on such a massive scale. The report came out last week, and I find its content no less than shocking. And then I’m not just referring to what they found Stapel did, or how the universities where he did it never suspected anything. What shocked me most was the conduct of the other researchers. Worse even, many admitted to these practices without the slightest notion they were doing something wrong.

Repeat the experiments until you get the results you want
Suppose your hypothesis says that X leads to Y. You divide your test subjects into two groups: a group that gets the X treatment and a control group that gets no treatment. If your hypothesis is correct the treatment group should show Y more often than the control group. But how can you be sure the difference is not a coincidence? The problem is that you can never be certain of that, so the difference should be so large that a coincidence is very unlikely. Statisticians express this through the ‘P-value’: if your hypothesis is not true, the probability that you get these results is estimated by the P-value. In general scientists are satisfied if this P-value is lower than 5%. Note that this means that if the hypothesis were not true, you still have a 1 in 20 chance of getting results that suggest it is!

So here is the problem. Some of the interviewees in the Stapel investigation argued it is perfectly normal to do several experiments until you find an effect large enough for a P-value lower than 5%. Once you have found such a result, you report the experiment that gave you this result and ignore the other experiments. The problem here is that any difference you find can be due to coincidence. If you do two experiments, you have a chance of about 1 in 10 that at least one of them gives a P-value lower than 5% if the hypothesis is not true; if you do three experiments, the chance is about 1 in 7. This strategy must have given a lot of false positives.

Select the control group you want
No significant difference between the treatment group and the control group in this experiment? No sweat, you still have data on the control group in an experiment you did last year. After all, they are all random groups, aren’t they? So you simply select the control group that gives the difference you were looking for. Another recipe for false positives.

Keep mum about what you did not find
Another variety is that you had three hypotheses you wanted to test, but only two are confirmed (ok, technically hypotheses are not confirmed – you merely reject their negation). So what do you do? You simply pretend that you wanted to test these two all along and ignore the third one.

Select your outliers strategically
Suppose one of your test subjects scores extremely low or high on a variable: this person could be an exception who cannot be compared to the rest of your sample. For instance, somebody scores very high on some performance test, and when you check who it is it turns out that this person has done the test before. This is a good reason to remove this observation from your dataset because you are comparing this person to people who do the test for the first time. However, two things are important here: (1) you should explain that you excluded this observation, and why; and (2) you should do this regardless of its effect on the significance of your results. It turned out that many interviewees (1) did not report such exclusions in their publications; and (2) would only exclude an observation if doing so would make their results ‘confirm’ their hypothesis.

And all this seemed perfectly normal to some
But as I said earlier, the most troubling observation is that the interviewees had no idea that they were doing anything wrong. They said that these practices are perfectly normal in their field – in fact, in one occasion even the anonymous reviewer of an article requested that some results be removed from the article because they did not confirm their hypothesis!

The overall picture emerges of a culture where research is done not to test hypotheses, but to confirm them. Roos Vonk, a Dutch professor who, just before the whole fraud came out, had announced ‘results’ from an experiment with Stapel ‘showing’ that people who eat meat are more likely to show antisocial behaviour, argued on Dutch television that an experiment has “failed” if it does not confirm your hypothesis. It all reeks of a culture where the open-minded view of the curious researcher is traded for narrow-minded tunnel vision.

Don’t get me wrong here: the committee emphasizes (as any scientist should) that their sample was too small and too selective to draw any conclusions about the field of social psychology as a whole. Nevertheless, the fact that the committee observed this among several interviewees is troubling.

But the journals are also to blame, and there we come to a problem which I am sure is present in many fields, including economics. Have a sexy hypothesis? If your research confirms it the reviewers and the editor will crawl purring at your feet. If your research does not confirm it they will call your hypothesis far-fetched, the experimental set-up flawed, and the results boring. It’s the confirmed result that gets all the attention – and that makes for a huge bias in the overall scientific literature.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s