Our capacity to prevent, diagnose, treat, and cure disease depends upon the existence of a robust bank of scientific knowledge. To advance this essential resource, the National Institutes of Health invests somewhere in the neighborhood of $30 billion annually into research. However, serious concerns about a lack of reproducibility of published scientific findings have been voiced in recent years by drug companies and the public, and now the issue has even risen to the level of the White House. On 29 July 2014, the White House Science and Technology Policy Office and the National Economic Council posted a request for comment on an upcoming update to the Strategy for American Innovation. One of the items reads:

“Given recent evidence of the irreproducibility of a surprising number of published scientific findings, how can the Federal Government leverage its role as a significant funder of scientific research to most effectively address the problem?”

The concern of irreproducibility of scientific literature goes way beyond the recent Nature STAP papers debacle (in which scientists claimed they could generate pluripotent stem cells by treating skin cells with acid in a process called stimulus-triggered acquisition of pluri­potency or STAP). And it seems that none of the science fields are immune to this problem; social sciences, psychological sciences, ecological sciences, computer sciences and life sciences alike are facing the same problem of lack of reproducibility.

n

Which is the true mean? Each data point represents an independent result with a size of n=3 and error bars indicating standard deviation. The combined data follow the normal distribution.

With respect to the life sciences, I have a hunch of what might be one of the principal culprits of irreproducibility. I hereby call for all PIs, postdocs, graduate students, research technicians, and undergraduate students to read and study a short paper by Cumming et al., which appeared in 2007 in the Journal of Cell Biology entitled “Error Bars in Experimental Biology.” Read this article! One of the key points of the paper is not only that scientists should use larger values of n, but that scientists should reevaluate what their definition of n is. Perhaps this doesn’t sound like the kind of thing the White House is calling for, but in science the devil is in the details, and thus n has become a very necessary evil.

After reading Cumming et al., lets put our commitment to reproducibility to test by insuring that our own experiments are… reproducible! How do we do that? Of course we do the experiment multiple times. (Three would be a great start.) Unfortunately, many scientists mistakenly consider technical replicates as their marker of reproducibility instead of actually repeating the entire experiment. The reasons for doing this are not malicious, but often are the result of a misinterpretation of the meaning of n. I have learned from experience that scientists can be passionate about their use of technical replicates as n, and heated discussions can ensue. I am therefore so pleased to be writing this blog post.

As an example of what I am talking about, please consider the following (apologies to those with non-science backgrounds for the technicality of this paragraph):

Suppose you’re testing whether a given gene (say, Atg12) has a role in a cellular process (we’ll say autophagy). You test this by silencing expression of Atg12 in a cell line by introducing a siRNA that silences Atg12, or a placebo siRNA that silences nothing. You transfect three different petri dishes with each to make sure any result you get isn’t an accident of handling. Then you lyse open the cells and analyze whether or not the gene/protein is silenced by detection with an antibody. Now for the question: How many times did you test the silencing ability of this siRNA (i.e. what is the value of “n” for the experiment)? The answer? One. The experiment was done once. The three petri dishes (replicates) should be averaged, but no error bars please. These replicates are technical replicates, and tell us only about how reliably we can perform a technique like measuring the levels of a gene/protein (i.e. how reproducibly we can plate cells, transfect them, lyse them, and analyze them—all three relatively easy things). The biggest variation comes from how the cells are behaving, so measuring a bunch of different cell samples on the same day, treated in the same way, tells us very little about the biology of the process. Think it would be too noisy to run the whole experiment four times and consider the total n as only 4? Well, that’s kind of the point. To be truly reproducible, the data will need to be robust enough to be repeated in another laboratory, let alone again in your own lab with your reagents and your hands. But how many of us would even do such an experiment 3 times entirely? 2 times? One is the loneliest number, and yet it seems to find a surprising amount of company in the scientific literature.

It’s so easy to spot experiments in a paper that use technical replicates instead of biological replicates because the error bars are typically vanishingly small. It’s just not what real data looks like. (Again, please see Cummings et al.) Presentation of data from technical replicates might state that the data is a “representative experiment.” At least this is more transparent than to say that n = 3 when n actually is one! Please, show us all the data.

I am not suggesting that properly defining n is going to rid us of our reproducibility problem entirely, but at least for the life sciences, I think this would be a good start in the right direction.

“Dear President Obama, I am a scientist in the life sciences. I am concerned that too many of us are using technical replicates instead of biological replicates. Please help. Love — David”

David Madden, Ph.D., is an Associate Professor at Touro University California and Adjunct Assistant Professor at the Buck Institute for Research on Aging.