Archive for the 'Replication' category

Generalization, not "reproducibility"

Feb 26 2018 Published by under Replication, ReplicationCrisis

The wikipedia entry on Generalization reads:

A generalization (or generalisation) is the formulation of general concepts from specific instances by abstracting common properties.

This is a very clean description of what many scientists think that they are about. I certainly do. I think that we are trying to use our experiments as specific instances from which to identify concepts and phenomena that have common properties with other situations not currently being tested. Thus our results should, we hope, generalize as predictions of what will happen in other situations.

Usually situations related to human health and behavior.

A recent paper by Voelkl and colleagues talks about this but totally borks the framing and terminology. They continually misuse "reproducibility" when they really mean to refer to generalization. And this harms science.

First, a quick overview. What Voelkl et al. present is a study which conducts meta-analysis of published studies. This technique includes a host of different studies which use approximately the same methods to address approximately the same question. The outcome of such a meta-analysis can tell us if a given qualitative interpretation is more likely to be true than not (think of it as a box score of the outcomes weighted by some qualities of the specific studies) and estimate the effect size (distance of mean effect relative to the variation expected, i.e. Cohen's d is most comprehensible to me).

As you can imagine, in a typical meta-analysis the studies vary quite a bit in detail. Perhaps it is the strain of rat being used. Or the sex. Or the light cycle the studies were run in. Perhaps it is the time of year or the humidity of the building. And most frequently there is variation in the scientists who are doing the investigating.

Meta-analysis is a test of generalization!

This is critical.

The big point in the Voelkl paper is that individual papers which include data sets on the same topic from multiple labs are more closely correlated with the meta-analytic result. As the authors put it in the Discussion:

Using simulated sampling, we compared the outcomes of single- and multi-laboratory studies, using the same overall number of animals, in terms of their accuracy of effect size estimates (pc) and FNR. For these simulations, we chose to use a large sample of published data from preclinical studies to guarantee that the results reflect real-life conditions. We found that pc increased substantially with the number of participating laboratories, without causing a need for larger sample sizes. This demonstrates that using more representative study samples through multi-laboratory designs improves the external validity and reproducibility of preclinical animal research.

Well, no shit Sherlock. A multi-laboratory study is already a test of generalization. It says that the same qualitative interpretation can be drawn from the study regardless of variation in laboratory, personnel and probably some other key variables. Since this is also what the meta-analysis is testing, it is no surprise whatever that this would be the result.

But. These authors use "reproducibility". The Wikipedia entry on this topic is a disaster which conflates several key issues together, most pertinently generalization, reproducibility and replicability. It starts out okay:

Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers.

Absolutely. Reproducibility is indeed the ability to reach the same conclusion (inferences) based on doing everything just like the other researchers did it. Great. It then immediately goes off the rails:

A related concept is replicability, meaning the ability to independently achieve non identical conclusions that are at least similar, when differences in sampling, research procedures and data analysis methods may exist.

what? That sounds more like a flexible version of reproducibility. If I had to try to parse out a difference for replicability I might observe the term "replicates" gives us a clue. As it does further down in the Wikipedia entry which now conflates the term repeatable with replicable.

The other component is repeatability which is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory. ... Although they are often confused, there is an important distinction between replicates and an independent repetition of an experiment. Replicates are performed within an experiment.

Seriously, who is editing this thing? Replicable now equals repeatable which means are all your subjects in the sample doing the same thing, more or less. I can get behind this needing a separate term but can we just pick one please? And not confuse that with the issue of whether the scientific result ("inference") can be reproduced or will generalize?

Back to reproducibility.

A particular experimentally obtained value is said to be reproducible if there is a high degree of agreement between measurements or observations conducted on replicate specimens in different locations by different people

See how they've immediately diverged? This may or may not be generalization depending on what you call "replicate specimens". To most eyes this means a whole different experiment which is for sure a test of generalization. Well the entry immediately clears up the intent is to conflate:

in science, a very well reproduced result is one that can be confirmed using as many different experimental setups as possible and as many lines of evidence as possible

The line about "as many different setups as possible" is the essence of generalization. And if that isn't enough confusion this sentence brings in converging evidence which is another concept entirely!

Back to Voelkl et al.:

our results suggest that eliminating these and other risks of bias (e.g., low statistical power, analytical flexibility) is not sufficient to guarantee reproducibility; the results will remain idiosyncratic to the specific laboratory conditions unless these conditions are varied.

"Idiosyncratic" here means reproducible. It means that if you keep the conditions identical, you should be able to repeat the experiment over and over and come up with the same approximate finding ("inference"). This finding can be endlessly reproducible, be built on experiments that are highly replicable within the samples and still fail to generalize beyond the idiosyncratic way that a given lab chooses to run the experiment.

So why do I say this failure to be clear about what we mean harms science?

Well, we are deep in the midst of much furor about a "reproducibility crisis" in science. There isn't one. Or at least if there is one, it has not been demonstrated clearly. The low grade annoyance of writing and reviewing the NIH grant section on Rigor is not a huge deal (at least it hasn't been for me so far). But it is yet another thing for people to beat up grants, possibly for no good reason. On the other end of the scale this will eventually be grist for conservative Congress Critters trying to cut investment in research. Somewhere in between lies the goal of the BigPharma voices promoting the lie so as to further offload their research and development costs onto the public purse.

The more immediate problem is that if we are not clear about what we mean in this discussion, our solutions will never solve anything, and may even hurt. I believe that to some extent people are indeed accusing science of having a reproducibility problem. Meaning, one assumes, that significant amounts of published work come to inferences that cannot be sustained if the experiments are done in exactly the same way. The solution for this, one deduces, can only be that each lab much perform many replicate experiments to provide improved confidence on reproducibility prior to publishing. "Make those sloppy bastards repeat it six times and I won't have to work so hard to figure out how to get my experiment working", goes the thinking. I guess

One interpretation of what Voelkl and colleagues are saying is that this won't help at all.

Besides known differences between the studies included in our analysis, such as the species or strain of animals (i.e., genotype) or reported differences in animal husbandry and experimental procedures, sources of variation included also many unknown and unknowable differences, such as the influence of the experimenter [38,39] or the microbiome [40], as well as subtle differences in visual, olfactory, and auditory stimulation. All those factors might affect treatment effects. Multi-laboratory designs are ideal to account for all of these sources of between-laboratory variation and should therefore replace standardized single-laboratory studies as the gold standard for late-phase preclinical trials

If we don't do work in a way that can test how well a conclusion generalizes across these issues, we will never solve the real problem. We will not know the limits of said generalization (it is not one thing, btw), the key experimental factors and the irrelevant detail. Instead we will continue to promote a collection of arbitrary and highly constrained experimental parameters and talk as if surely our results will generalize to a treatment medication for humans in rapid order.

In point of fact working to improve reproducibility (as we all do!) may be directly opposed to improving generalization and thereby compromise translation to helping improve human health.

And despite where people in science are pointing the finger of blame (i.e., the reproducibility of inferences that we can make using precisely the same approaches), they are really motivated and angered by the lack of generalization.

Seriously, listen to what has the scientists who are eager to be puppeted by Big Pharma have to say. Listen to their supposed examples that show "the problem is real". Look at what makes them really mad. Ask about their attempts to perform experiments related to the ones in the published literature that anger them so much. You will be more likely to conclude that they are not in fact miffed about directly reproducing a result. More often it is a failure to generalize beyond the original experimental conditions.

Voelkl B, Vogt L, Sena ES, Würbel H (2018) Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biol 16(2): e2003693.

6 responses so far

Group effects. or "effects".

Jul 22 2016 Published by under Replication, ReplicationCrisis

How many times do we see the publication of a group effect in an animal model that is really just a failure to replicate? Or a failure to completely replicate?

How many of those sex-differences, age-differences or strain-differences have been subjected to replication?

10 responses so far

Amgen continues their cherry picking on "reproducibility" agenda

Feb 05 2016 Published by under Conduct of Science, Replication, ReplicationCrisis

A report by Begley and Ellis, published in 2012, was hugely influential in fueling current interest and dismay about the lack of reproducibility in research. In their original report the authors claimed that the scientists of Amgen had been unable to replicate 47 of 53 studies.

Over the past decade, before pursuing a particular line of research, scientists (including C.G.B.) in the haematology and oncology department at the biotechnology firm Amgen in Thousand Oaks, California, tried to confirm published findings related to that work. Fifty-three papers were deemed 'landmark' studies (see 'Reproducibility of research findings'). It was acknowledged from the outset that some of the data might not hold up, because papers were deliberately selected that described something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics. Nevertheless, scientific findings were confirmed in only 6 (11%) cases. Even knowing the limitations of preclinical research, this was a shocking result.

Despite the limitations identified by the authors themselves, this report has taken on a life of truthy citation as if most of all biomedical science reports cannot be replicated.

I have remarked a time or two that this is ridiculous on the grounds the authors themselves recognize, i.e., a company trying to skim the very latest and greatest results for intellectual property and drug development purposes is not reflective of how science works. Also on the grounds that until we know exactly which studies and what they mean by "failed to replicate" and how hard they worked at it, there is no point in treating this as an actual result.

At first, the authors refused to say which studies or results were meant by this original population of 53.

Now we have the data! They have reported their findings! Nature announces breathlessly that Biotech giant publishes failures to confirm high-profile science.

Awesome. Right?

Well, they published three of them, anyway. Three. Out of fifty-three alleged attempts.

Are you freaking kidding me Nature? And you promote this like we're all cool now? We can trust their original allegation of 47/53 studies unreplicable?


Christ what a disaster.

I look forward to hearing from experts in the respective fields these three papers inhabit. I want to know how surprising it is to them that these forms of replication failure occurred. I want to know the quality of the replication attempts and the nature of the "failure"- was it actually failure or was it a failure to generalize in the way that would be necessary for a drug company's goals? Etc.

Oh and Amgen? I want to see the remaining 50 attempts, including the positive replications.

Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012 Mar 28;483(7391):531-3. doi: 10.1038/483531a.

21 responses so far

British Journal of Pharmacology issues new experimental design standards

Dec 23 2015 Published by under Conduct of Science, Replication, ReplicationCrisis

The BJP has decided to require that manuscripts submitted for publication adhere to certain experimental design standards. The formulation can be found in Curtis et al., 2015.

Curtis MJ, Bond RA, Spina D, Ahluwalia A, Alexander SP, Giembycz MA, Gilchrist A, Hoyer D, Insel PA, Izzo AA, Lawrence AJ, MacEwan DJ, Moon LD, Wonnacott S, Weston AH, McGrath JC. Experimental design and analysis and their reporting: new guidance for publication in BJP. Br J Pharmacol. 2015 Jul;172(14):3461-71. doi: 10.1111/bph.12856 [PubMed]

Some of this continues the "huh?" response of this behavioral pharmacologist who publishes in a fair number of similar journals. In other words, YHN is astonished this stuff is not just a default part of the editorial decision making at BJP in the first place. The items that jump out at me include the following (paraphrased):

2. You should shoot for a group size of N=5 or above and if you have fewer you need to do some explaining.
3. Groups less than 20 should be of equal size and if there is variation from equal sample sizes this needs to be explained. Particularly for exclusions or unintended loss of subjects.
4. Subjects should be randomized to groups and treatment order should be randomized.
6.-8. Normalization and transformation should be well justified and follow acceptable practices (e.g., you can't compare a treatment group to the normalization control that now has no variance because of this process).
9. Don't confuse analytical replicates with experimental replicates in conducting analysis.

Again, these are the "no duh!" issues in my world. Sticky peer review issues quite often revolve around people trying to get away with violating one or other of these things. At the very least reviewers want justification in the paper, which is a constant theme in these BJP principles.

The first item is a pain in the butt but not much more than make-work.

1. Experimental design should be subjected to ‘a priori power analysis’....latter requires an a priori sample size calculation that should be included in Methods and should include alpha, power and effect size.

Of course, the trouble with power analysis is that it depends intimately on the source of your estimates for effect size- generally pilot or prior experiments. But you can select basically whatever you want as your assumption of effect size to demonstrate a range of sample sizes as acceptable. Also, you can select whatever level of power you like, within reasonable bounds along the continuum from "Good" to "Overwhelming". I don't think there are very clear and consistent guidelines here.

The fifth one is also going to be tricky, in my view.

Assignment of subjects/preparations to groups, data recording and data analysis should be blinded to the operator and analyst unless a valid scientific justification is provided for not doing so. If it is impossible to blind the operator, for technical reasons, the data analysis can and should be blinded.

I just don't see how this is practical with a limited number of people running experiments in a laboratory. There are places this is acutely important- such as when human judgement/scoring measures are the essential data. Sure. And we could all stand to do with a reminder to blind a little more and a little more completely. But this has disaster written all over it. Some peers doing essentially the same assay are going to disagree over what is necessary and "impossible" and what is valid scientific justification.

The next one is a big win for YHN. I endorse this. I find the practice of reporting any p value other than your lowest threshold to be intellectually dishonest*.

10. When comparing groups, a level of probability (P) deemed to constitute the threshold for statistical significance should be defined in Methods, and not varied later in Results (by presentation of multiple levels of significance). Thus, ordinarily P < 0.05 should be used throughout a paper to denote statistically significant differences between groups.

I'm going to be very interested to see how the community of BJP accepts* this.

Finally, a curiosity.

11. After analysis of variance post hoc tests may be run only if F achieves the necessary level of statistical significance (i.e. P < 0.05) and there is no significant variance in homogeneity.

People run post-hocs after a failure to find a significant main effect on the ANOVA? Seriously? Or are we talking about whether one should run all possible comparison post-hocs in the absence of an interaction? (seriously, when is the last time you saw a marginal-mean post-hoc used?) And isn't this just going to herald the return of the pre-planned comparison strategy**?

Anyway I guess I'm saying Kudos to BJP for putting down their marker on these design and reporting issues. Sure I thought many of these were already the necessary standards. But clearly there are a lot of people skirting around many of these in publications, specifically in BJP***. This new requirement will stiffen the spine of reviewers and editors alike.

*N.b. I gave up my personal jihad on this many years ago after getting exactly zero traction in my scientific community. I.e., I had constant fights with reviewers over why my p values were all "suspiciously" p<0.5 and no backup from editors when I tried to slip this concept into reviews. **I think this is possibly a good thing. ***A little birdy who should know claimed that at least one AE resigned or was booted because they were not down with all of these new requirements.

39 responses so far

Thought of the day

Dec 05 2014 Published by under Replication, ReplicationCrisis, Science Publication

One thing that always cracks me up about manuscript review is the pose struck* by some reviewers that we cannot possibly interpret data or studies that are not perfect.

There is a certain type of reviewer that takes the stance* that we cannot in any way compare treatment conditions if there is anything about the study that violates some sort of perfect, Experimental Design 101 framing even if there is no reason whatsoever to suspect a contaminating variable. Even if, and this is more hilarious, if there are reasons in the data themselves to think that there is no effect of some nuisance variable.

I'm just always thinking....

The very essence of real science is comparing data across different studies, papers, paradigms, laboratories, etc and trying to come up with a coherent picture of what might be a fairly invariant truth about the system under investigation.

If the studies that you wish to compare are in the same paper, sure, you'd prefer to see less in the way of nuisance variation than you expect when making cross-paper comparisons. I get that. But still....some people.

Note: this is some way relates to the alleged "replication crisis" of science.
*having nothing to go on but their willingness to act like the manuscript is entirely uninterpretable and therefore unpublishable, I have to assume that some of them actually mean it. Otherwise they would just say "it would be better if...". right?

8 responses so far

Replication costs money

I ran across a curious finding in a very Glamourous publication. Being that it was in a CNS journal, the behavior sucked. The data failed to back up the central claim about that behavior*. Which was kind of central to the actual scientific advance of the entire work.

So I contemplated an initial, very limited check on the behavior. A replication of the converging sort.

It's going to cost me about $15K to do it.

If it turns out negative, then where am I? Where am I going to publish a one figure tut-tut negative that flies in the face of a result published in CNS?

If it turns out positive, this is almost worse. It's a "yeah we already knew that from this CNS paper, dumbass" rejection waiting to happen.

Either way, if I expect to be able to publish in even a dump journal I'm gong to need to throw some more money at the topic. I'd say at least $50K.

At least.

Spent from grants that are not really related to this topic in any direct way.

If the NIH is serious about the alleged replication problem then it needs to be serious about the costs and risks involved.
*a typical problem with CNS pubs that involve behavioral studies.

35 responses so far