We Are Not Born Knowing How to Design and Analyse Scientific Experiments

Michael W. Festing

Better training in experimental techniques and changes to the
current methods of scientific research funding are both needed
to facilitate more-effective improvements in human health

The randomised controlled experiment is one of the most powerful tools ever invented for gaining new knowledge. The methods were developed for agricultural research by R.A. Fisher and colleagues, at Rothamstead Agricultural Experimental Station in the 1920s. The first clinical trial was carried out by Bradford-Hill in 1946, when he used it to evaluate streptomycin for the treatment of tuberculosis.1 So the randomised, controlled and blinded experiment is quite a recent development in human history. Now, many thousands of such experiments are conducted each year by scientists in virtually all disciplines, including the life sciences, engineering, psychology and agriculture.

But there are many pitfalls for the unwary scientist. It is all too easy for biases to creep in, leading to the wrong conclusions.2 All too often, randomisation is inadequate because it is only done in allocating animals to treatments, but does not include the order in which the observations are made, and ‘blinding’ is often omitted in situations where it would be expected to be important.3 As the design of the experiment is so important, and scientists spend so much of their time doing experiments, it seems obvious that they should be well-trained in the necessary techniques. Unfortunately, very few scientists get any formal training in the necessary methods.

Most seem to rely on what their supervisor told them when they were doing their PhD, but all too often this was based on tradition and what the supervisor had learned from their supervisor. As a result, far too many experiments are unrepeatable, or give the wrong result. This can have untold consequences. It would be much more cost effective to ensure that scientists are properly trained in the first place.

Experimental design and the use of animals in research

When the experiments involve laboratory animals, there is an additional ethical issue. The use of such animals might be justified if they contribute to human welfare.
But if these animal experiments give the wrong results, then not only are the animals wasted, but there may be substantial associated costs. Even the debate about whether animals can be used as models of humans can’t be realistically discussed if the animal experiments are unrepeatable. Three examples illustrate the possible magnitude of the problem:

1. In a commentary article published in Nature in 2012, Begley and Ellis4 claimed that “Unquestionably, a significant contributor to failure in oncology trials is the quality of published preclinical data”. The scientific community assumes that the claims made by preclinical research can be taken at their face value, and that: “The results of preclinical studies must therefore be very robust to withstand the rigours and challenges of clinical trials, stemming from the heterogeneity of both tumours and patients.” They identified 53 “landmark” papers in cancer research and tried to repeat them, in some cases with the assistance of the original authors, but in only six cases (11%) was this possible — “a shocking” result. They noted that in the studies which could be repeated, the authors had paid close attention to “controls, reagents, investigator bias and describing the complete data set.” In the unreproducible studies, there was an absence of ‘blinding’ and authors had often selected the results of a single investigation, such as a Western blot, which supported their hypothesis and rejected any results that did not do so. Some of these papers had spawned entire new fields of research with many secondary publications, without even verifying the original work. In some cases, clinical trials had even been set up on the basis of this unrepeatable work. The authors concluded that “the bar for reproducibility in performing and presenting preclinical studies must be raised”, while recognising that this will require considerable effort and a change of culture.

2. A somewhat similar finding was reported by Prinz et al. in 2011.5 Working in a pharmaceutical company, they claimed that many projects to validate exciting published papers resulted in disillusionment, because the key features of the data could not be reproduced. Altogether, they collected data from 67 in-house projects to validate published work, but were only able to do so in about 20–25% of them. There did not seem to be any correlation between the quality of the journal, as judged by its Impact Factor, and the reproducibility of the results. Anecdotally, they claimed that it is generally assumed in the pharmaceutical industry that only about half the published papers of relevance to the pharmaceutical industry give repeatable results.

Although they did not go into detail on the reasons for the poor reproducibility of the studies, they did suggest that it could be due to poor statistical methods and inadequate sample sizes. They also mentioned the extreme competition among academic laboratories and the pressure to publish positive results in high-impact journals that may be tempting scientists to cut corners in their research.

3. An editorial in Nature Genetics in 2012 (Vol 44, No. 6) suggested that a paper entitled “Design, power and the interpretation of studies in the standard murine model of ALS”6 should be essential reading for all scientists carrying out preclinical research. In it, the authors noted that there were more than 50 papers which claimed to have found new drugs to treat ALS (amyotrophic lateral sclerosis) by using the standard mouse model of a transgenic strain with a mutant human SOD1G93a gene. Yet there was only one drug, Riluzole, that was effective in humans. This prolonged life for only a few months. Due to the rapid and unrelenting nature of the disease, test compounds are rapidly advanced to clinical trials. So why were none of the drugs that were apparently effective in the mouse model effective in human patients? Over a period of five years, Scott et al. screened 70 drugs, including most of the drugs previously found to be effective in mice. They used 18,000 mice with “rigorous and appropriate statistical methodologies”. They had expected to be able to reproduce the positive findings in the drugs previously screened, but were unable to do so. None of the 70 drugs tested by them produced positive results.

In explaining these results, Scott et al. identified a number of confounding effects which affected survival times in their mice. These included copy number of the transgene, litter effects, gender, failure to ‘blind’ technicians and investigators to the treatment, and failure to exclude non-ALS deaths. They then used accumulated data on their control mice to simulate the effect of failure to control these variables. They found that this mouse model is inherently noisy.
If none of these confounding factors were removed, there was a 58% chance of observing a statistically significant false-positive difference between groups of ten mice, which was the usual sample size used. All of the false-positive results could be attributed to inappropriate experimental design through failure to control this variability.

A need for better training

The hallmark of a good experiment should be that the results are repeatable. If they are not, then something is seriously wrong. The use of animals in biomedical research depends on the assumption that results from animals can predict human outcomes. But they will not be able to do so if they can’t even predict the results from other animals of the same species. Whether animals provide a good model of humans in specific cases is a question that can’t be answered if the animal (or human) experiments are not repeatable.

Clearly, there is an urgent need for scientists to have better training in experimental techniques, particularly in the design and statistical analysis of experiments. FRAME has been making an important contribution, unique among animal welfare organisations, in running a number of 3-day workshops on experimental design in the UK, The Netherlands and Portugal, which have been enthusiastically received by participants. It is only a drop in the ocean, but it leads the way in tackling this serious issue. Possibly the new European Directive 2010/63/EU, with its strong emphasis on training, will make a difference. Some of the FELASA Category C training courses devote a considerable amount of time to experimental design. Scientists may also find the www.3Rsreduction.co.uk website useful. It is designed to help research scientists to teach themselves the principles of experimental design, as applied to research with laboratory animals.

A shift in scientific culture also appears to be necessary. Are methods of funding scientific research also partly to blame? There is extreme pressure on scientists to publish exciting work in high-impact journals. Negative results often go unpublished because they lack excitement, leading to publication bias. But producing a scientific paper is not the objective of health research. What is needed is good quality work which leads to improvement in human health. In the search for drugs to alleviate ALS, Scott et al. screened 70 compounds by using statistically rigorous techniques, but obtained negative results and a ‘dull’ publication. However, that was of much greater value than the 50 ‘exciting’ papers which claimed to have found a useful drug, but which turned out to be false positive results due to poor experimental design. Maybe the funding organisations should re-think the way in which they distribute their funds, in order to reduce the occurrence of this type of unhelpful and unproductive situation.

Dr Michael F.W. Festing
Russell and Burch House
96–98 North Sherwood Street
Nottingham NG1 4EE
E-mail: michaelfesting@aol.com

1 Hill, A.B. (1966). Principles of Medical Statistics, 381pp. London, UK: The Lancet.
2 Sarewitz, D. (2012). Beware the creeping cracks of bias. Nature 485, 149.
3 Kilkenny, C., Parsons, N., Kadyszewski, E., Festing, M.F., Cuthill, I.C., Fry, D., Hutton, J. & Altman, D.G. (2009). Survey of the quality of experimental design, statistical analysis and reporting of research using animals. PLoS One 4, e7824.
4 Begley, C.G. & Ellis, L.M. (2012). Drug development: Raise standards for preclinical cancer research. Nature 483, 531–533.
5Prinz, F., Schlange, T. & Asadullah, K. (2011). Believe it or not: How much can we rely on published data on potential drug targets? Nature Reviews Drug Discovery 10, 712.

Leave a Reply

Your email address will not be published. Required fields are marked *