Dear students, please question my authority

Some more thoughts on the Stapel saga.

In most fraud cases Diederik Stapel told his students that he would perform his experiments, collect the data, do the analysis, and give the students the results. Some students wanted to be at the experiments (a good suggestion, because you can learn a lot from it), but he wouldn’t allow them. A few students wanted to see the raw data, but when they pressed him he expressed doubts whether they were good enough to be his PhD students. A filthy intimidation tactic if you ask me. But not only his students were duped: other academics, like the Dutch professor Roos Vonk, co-authored articles that turned out to be based on fake data. In the end the whole fraud came out┬ábecause a few students finally had the courage to stand up to him (and, possibly, the university board where he had a lot of friends – luckily the board did the right thing and took their complaints seriously). Other students, who allowed themselves to be intimidated, now have flawed dissertations. A few of them have left science because of the affair.

Dear students, learn from this. I promise I won’t cook the books, but don’t take my word for it – don’t take anyone’s word for anything. Not just your thesis supervisor. After you graduate you will work with other people,┬álike your boss or your co-authors. They can make mistakes. They can lie. When your name is on a proposal, a thesis, or an article, than you (and your co-authors) are responsible for its contents. Convince yourself that its contents is correct. Yes, I do the same with your contributions.

I know that in some cultures it is impolite to question the advice of your superiors, much like foot soldiers are supposed to follow their sergeant’s orders. That may work in the army, but we’re not in the army here. The one order I give you is not to take orders from me.

The Stapel affair: it is worse than we thought

After Diederik Stapel was caught cooking the scientific books, three committees investigated the extent of the fraud in their universities (Amsterdam, Groningen, Tilburg), and how it was possible that Stapel committed his fraud on such a massive scale. The report came out last week, and I find its content no less than shocking. And then I’m not just referring to what they found Stapel did, or how the universities where he did it never suspected anything. What shocked me most was the conduct of the other researchers. Worse even, many admitted to these practices without the slightest notion they were doing something wrong.

Repeat the experiments until you get the results you want
Suppose your hypothesis says that X leads to Y. You divide your test subjects into two groups: a group that gets the X treatment and a control group that gets no treatment. If your hypothesis is correct the treatment group should show Y more often than the control group. But how can you be sure the difference is not a coincidence? The problem is that you can never be certain of that, so the difference should be so large that a coincidence is very unlikely. Statisticians express this through the ‘P-value’: if your hypothesis is not true, the probability that you get these results is estimated by the P-value. In general scientists are satisfied if this P-value is lower than 5%. Note that this means that if the hypothesis were not true, you still have a 1 in 20 chance of getting results that suggest it is!

So here is the problem. Some of the interviewees in the Stapel investigation argued it is perfectly normal to do several experiments until you find an effect large enough for a P-value lower than 5%. Once you have found such a result, you report the experiment that gave you this result and ignore the other experiments. The problem here is that any difference you find can be due to coincidence. If you do two experiments, you have a chance of about 1 in 10 that at least one of them gives a P-value lower than 5% if the hypothesis is not true; if you do three experiments, the chance is about 1 in 7. This strategy must have given a lot of false positives.

Select the control group you want
No significant difference between the treatment group and the control group in this experiment? No sweat, you still have data on the control group in an experiment you did last year. After all, they are all random groups, aren’t they? So you simply select the control group that gives the difference you were looking for. Another recipe for false positives.

Keep mum about what you did not find
Another variety is that you had three hypotheses you wanted to test, but only two are confirmed (ok, technically hypotheses are not confirmed – you merely reject their negation). So what do you do? You simply pretend that you wanted to test these two all along and ignore the third one.

Select your outliers strategically
Suppose one of your test subjects scores extremely low or high on a variable: this person could be an exception who cannot be compared to the rest of your sample. For instance, somebody scores very high on some performance test, and when you check who it is it turns out that this person has done the test before. This is a good reason to remove this observation from your dataset because you are comparing this person to people who do the test for the first time. However, two things are important here: (1) you should explain that you excluded this observation, and why; and (2) you should do this regardless of its effect on the significance of your results. It turned out that many interviewees (1) did not report such exclusions in their publications; and (2) would only exclude an observation if doing so would make their results ‘confirm’ their hypothesis.

And all this seemed perfectly normal to some
But as I said earlier, the most troubling observation is that the interviewees had no idea that they were doing anything wrong. They said that these practices are perfectly normal in their field – in fact, in one occasion even the anonymous reviewer of an article requested that some results be removed from the article because they did not confirm their hypothesis!

The overall picture emerges of a culture where research is done not to test hypotheses, but to confirm them. Roos Vonk, a Dutch professor who, just before the whole fraud came out, had announced ‘results’ from an experiment with Stapel ‘showing’ that people who eat meat are more likely to show antisocial behaviour, argued on Dutch television that an experiment has “failed” if it does not confirm your hypothesis. It all reeks of a culture where the open-minded view of the curious researcher is traded for narrow-minded tunnel vision.

Don’t get me wrong here: the committee emphasizes (as any scientist should) that their sample was too small and too selective to draw any conclusions about the field of social psychology as a whole. Nevertheless, the fact that the committee observed this among several interviewees is troubling.

But the journals are also to blame, and there we come to a problem which I am sure is present in many fields, including economics. Have a sexy hypothesis? If your research confirms it the reviewers and the editor will crawl purring at your feet. If your research does not confirm it they will call your hypothesis far-fetched, the experimental set-up flawed, and the results boring. It’s the confirmed result that gets all the attention – and that makes for a huge bias in the overall scientific literature.

More thoughts on Stapel, Smeesters, and scientific fraud in general

Whenever there is a new case of scientific fraud the question pops up: does publish or perish force scientists to lie about their results? What makes this question all the more relevant is the fact that many universities employ their academic staff (including me) under some form of tenure track. Here the publish or perish is translated into a principle of up or out: either you keep increasing your education evaluation scores, publication list, Hirsch Index, project acquisition, and so forth, or you’re out of a job. Needless to say it gives quite an incentive to cook the books.

The first thing to realize here is that neither Stapel nor Smeesters are good examples of such a mechanism. Both had tenure, and Stapel has been making up data for the entire length of his career.

The second thing to realize, however, is that there are many forms of scientific misconduct, not all of which are outright fraud. Stapel is an extreme example of blatant fraud as he fabricated complete datasets. But there are more ways of behaving badly in science:

  • Skip observations that don’t support your hypothesis. This is what Smeesters is being accused of.
  • Copy text or ideas without citing the source.
  • The mirror of that: support a claim with a reference to a source that does not provide such justification.
  • Leave out details of the research method that would have put your results in a different light.
  • Run lots and lots of regressions on any combination of variables. You are bound to find a statistically significant relation between one or more variables somewhere. Present it as something you intended to investigate in the first place. (Be aware that “statistically significant at 5%” means “the probability that this relation is due to random fluctuations is 5%”, meaning that 1 in 20 of such “statistically significant” relations are really just a coincidence.)
  • Include the name of some big shot who hardly contributed to the paper but will make your paper look important. The big shot has yet another publication and you can bask in his glory.
  • When you do an anonymous peer review, tell the authors to cite some of your papers, especially the ones that improve your Hirsch Index if they are cited once more.
  • When you do an anonymous peer review, reject the paper if it presents results that you present in a paper that you just submitted to another journal. After all, you want to be the first to present the idea!
  • Or even worse than that: reject the paper (or a proposal) and submit the idea yourself. (Admittedly, given the huge time lag in publications you wouldn’t have a high chance of success.)

Note how difficult it is to identify bad intentions behind some of these, and that the line between good scientific practice and scientific misconduct can be surprisingly thin:

  • You can have very good reasons to skip an observation (protest bids in contingent valuation surveys are one). This is Smeesters’s defence.
  • You may have always thought that author X said Y in article Z, but actually you were confused with another article.
  • Nobody ever includes those details of the method in their papers, so why should you?
  • You’re a PhD student and you don’t want to let your professor down by not including him as an author – he is your supervisor, after all.
  • The paper you are reviewing would be incomplete without that reference, whether you wrote it or not.

It is easy to say that there are no such things as small sins and big sins: thou shalt not sin, period. But for most people it just doesn’t work that way: they wouldn’t mind crossing the speed limit by 10 km per h but object to crossing it by 100 km per h. And crossing the speed limit by 20 km per h may make you feel slightly worse about yourself, but when you are in a hurry it becomes easier to silence that guilty feeling.

So yes, I do believe the principles of publish or perish and up or out increase the incidence of scientific misconduct, but not in the way we read about it in the news. The cases you read about in the news are poor examples of such pressures. These are the sensational ones, the blatant fabrication of data by prestigious professors with big egos. The main damage is in the everyday nitty-gritty of science, and most of it may never be detected. Does that make it less bad? No, it may actually be worse because we don’t see, let alone quantify, the damage.

So is tenure track bad? Well, to paraphrase Churchill, it is the worst system except for all the other ones. The alternative we had in The Netherlands, where you had to wait for the current professor to die or retire before you could become one, has stifled scientific progress and chased a lot of talent out of the country. I believe the solution lies not in abandoning tenure track, but rather in the way we publish our results – but I’ll leave that for another post.

Yet another scientific fraud scandal in the Netherlands

Dirk Smeesters, professor Consumer Behaviour at the Erasmus University of Rotterdam, is suspected of scientific fraud. Worse than that, he is not the first one. And what really irks me about the issue is the complacency among many researchers, especially natural scientists. Many of them argue that these cases only prove the self-cleansing power of science: after all, hasn’t the impostor been caught? Fraud will always be detected.

But how many impostors do not get caught? I think it is more difficult to detect fraud in the social sciences than it is in many (but not all) natural sciences. Take Hwang Woo-Suk, the Korean guy who faked data claiming he cloned human embryos. Such fraud is bound to be detected. His peers will have good reasons to reproduce his results so they can apply the same technique, or even improve on it. Companies may want to commercialize the technique. When it turns out it doesn’t work, people would ask him for more details on how he did it, and try again. Somewhere down the line they would get suspicious because he is either not willing to share the details of his work, or his recommendations don’t help.

The work of Diederik Stapel and Dirk Smeesters is different. There is less incentive for replication, because the experiments tend to be fairly simple: it is not that they are some fancy new technology. You would not learn anything new from it, and you would not be able to publish it (“We did the same as Stapel et al. but it didn’t work” – “Well, your experimental set-up was probably wrong”). Diederik Stapel’s findings have been applied in many Dutch schools. The only way to find out whether they worked would have been to do randomly select the schools where we apply the insights – try explaining to parents why their kids are not being taught according to the latest insights in educational science. And even then, our evidence would be no more than a p-value: a probability that the treatment has no effect. Graham Bell could demonstrate his telephone worked, but it doesn’t work like that in the social sciences.