Archive for the 'ReplicationCrisis' category

Group effects. or "effects".

Jul 22 2016 Published by under Replication, ReplicationCrisis

How many times do we see the publication of a group effect in an animal model that is really just a failure to replicate? Or a failure to completely replicate?

How many of those sex-differences, age-differences or strain-differences have been subjected to replication?

10 responses so far

Amgen continues their cherry picking on "reproducibility" agenda

Feb 05 2016 Published by under Conduct of Science, Replication, ReplicationCrisis

A report by Begley and Ellis, published in 2012, was hugely influential in fueling current interest and dismay about the lack of reproducibility in research. In their original report the authors claimed that the scientists of Amgen had been unable to replicate 47 of 53 studies.

Over the past decade, before pursuing a particular line of research, scientists (including C.G.B.) in the haematology and oncology department at the biotechnology firm Amgen in Thousand Oaks, California, tried to confirm published findings related to that work. Fifty-three papers were deemed 'landmark' studies (see 'Reproducibility of research findings'). It was acknowledged from the outset that some of the data might not hold up, because papers were deliberately selected that described something completely new, such as fresh approaches to targeting cancers or alternative clinical uses for existing therapeutics. Nevertheless, scientific findings were confirmed in only 6 (11%) cases. Even knowing the limitations of preclinical research, this was a shocking result.

Despite the limitations identified by the authors themselves, this report has taken on a life of truthy citation as if most of all biomedical science reports cannot be replicated.

I have remarked a time or two that this is ridiculous on the grounds the authors themselves recognize, i.e., a company trying to skim the very latest and greatest results for intellectual property and drug development purposes is not reflective of how science works. Also on the grounds that until we know exactly which studies and what they mean by "failed to replicate" and how hard they worked at it, there is no point in treating this as an actual result.

At first, the authors refused to say which studies or results were meant by this original population of 53.

Now we have the data! They have reported their findings! Nature announces breathlessly that Biotech giant publishes failures to confirm high-profile science.

Awesome. Right?

Well, they published three of them, anyway. Three. Out of fifty-three alleged attempts.

Are you freaking kidding me Nature? And you promote this like we're all cool now? We can trust their original allegation of 47/53 studies unreplicable?


Christ what a disaster.

I look forward to hearing from experts in the respective fields these three papers inhabit. I want to know how surprising it is to them that these forms of replication failure occurred. I want to know the quality of the replication attempts and the nature of the "failure"- was it actually failure or was it a failure to generalize in the way that would be necessary for a drug company's goals? Etc.

Oh and Amgen? I want to see the remaining 50 attempts, including the positive replications.

Begley CG, Ellis LM. Drug development: Raise standards for preclinical cancer research. Nature. 2012 Mar 28;483(7391):531-3. doi: 10.1038/483531a.

21 responses so far

British Journal of Pharmacology issues new experimental design standards

Dec 23 2015 Published by under Conduct of Science, Replication, ReplicationCrisis

The BJP has decided to require that manuscripts submitted for publication adhere to certain experimental design standards. The formulation can be found in Curtis et al., 2015.

Curtis MJ, Bond RA, Spina D, Ahluwalia A, Alexander SP, Giembycz MA, Gilchrist A, Hoyer D, Insel PA, Izzo AA, Lawrence AJ, MacEwan DJ, Moon LD, Wonnacott S, Weston AH, McGrath JC. Experimental design and analysis and their reporting: new guidance for publication in BJP. Br J Pharmacol. 2015 Jul;172(14):3461-71. doi: 10.1111/bph.12856 [PubMed]

Some of this continues the "huh?" response of this behavioral pharmacologist who publishes in a fair number of similar journals. In other words, YHN is astonished this stuff is not just a default part of the editorial decision making at BJP in the first place. The items that jump out at me include the following (paraphrased):

2. You should shoot for a group size of N=5 or above and if you have fewer you need to do some explaining.
3. Groups less than 20 should be of equal size and if there is variation from equal sample sizes this needs to be explained. Particularly for exclusions or unintended loss of subjects.
4. Subjects should be randomized to groups and treatment order should be randomized.
6.-8. Normalization and transformation should be well justified and follow acceptable practices (e.g., you can't compare a treatment group to the normalization control that now has no variance because of this process).
9. Don't confuse analytical replicates with experimental replicates in conducting analysis.

Again, these are the "no duh!" issues in my world. Sticky peer review issues quite often revolve around people trying to get away with violating one or other of these things. At the very least reviewers want justification in the paper, which is a constant theme in these BJP principles.

The first item is a pain in the butt but not much more than make-work.

1. Experimental design should be subjected to ‘a priori power analysis’....latter requires an a priori sample size calculation that should be included in Methods and should include alpha, power and effect size.

Of course, the trouble with power analysis is that it depends intimately on the source of your estimates for effect size- generally pilot or prior experiments. But you can select basically whatever you want as your assumption of effect size to demonstrate a range of sample sizes as acceptable. Also, you can select whatever level of power you like, within reasonable bounds along the continuum from "Good" to "Overwhelming". I don't think there are very clear and consistent guidelines here.

The fifth one is also going to be tricky, in my view.

Assignment of subjects/preparations to groups, data recording and data analysis should be blinded to the operator and analyst unless a valid scientific justification is provided for not doing so. If it is impossible to blind the operator, for technical reasons, the data analysis can and should be blinded.

I just don't see how this is practical with a limited number of people running experiments in a laboratory. There are places this is acutely important- such as when human judgement/scoring measures are the essential data. Sure. And we could all stand to do with a reminder to blind a little more and a little more completely. But this has disaster written all over it. Some peers doing essentially the same assay are going to disagree over what is necessary and "impossible" and what is valid scientific justification.

The next one is a big win for YHN. I endorse this. I find the practice of reporting any p value other than your lowest threshold to be intellectually dishonest*.

10. When comparing groups, a level of probability (P) deemed to constitute the threshold for statistical significance should be defined in Methods, and not varied later in Results (by presentation of multiple levels of significance). Thus, ordinarily P < 0.05 should be used throughout a paper to denote statistically significant differences between groups.

I'm going to be very interested to see how the community of BJP accepts* this.

Finally, a curiosity.

11. After analysis of variance post hoc tests may be run only if F achieves the necessary level of statistical significance (i.e. P < 0.05) and there is no significant variance in homogeneity.

People run post-hocs after a failure to find a significant main effect on the ANOVA? Seriously? Or are we talking about whether one should run all possible comparison post-hocs in the absence of an interaction? (seriously, when is the last time you saw a marginal-mean post-hoc used?) And isn't this just going to herald the return of the pre-planned comparison strategy**?

Anyway I guess I'm saying Kudos to BJP for putting down their marker on these design and reporting issues. Sure I thought many of these were already the necessary standards. But clearly there are a lot of people skirting around many of these in publications, specifically in BJP***. This new requirement will stiffen the spine of reviewers and editors alike.

*N.b. I gave up my personal jihad on this many years ago after getting exactly zero traction in my scientific community. I.e., I had constant fights with reviewers over why my p values were all "suspiciously" p<0.5 and no backup from editors when I tried to slip this concept into reviews. **I think this is possibly a good thing. ***A little birdy who should know claimed that at least one AE resigned or was booted because they were not down with all of these new requirements.

39 responses so far

Thought of the day

Dec 05 2014 Published by under Replication, ReplicationCrisis, Science Publication

One thing that always cracks me up about manuscript review is the pose struck* by some reviewers that we cannot possibly interpret data or studies that are not perfect.

There is a certain type of reviewer that takes the stance* that we cannot in any way compare treatment conditions if there is anything about the study that violates some sort of perfect, Experimental Design 101 framing even if there is no reason whatsoever to suspect a contaminating variable. Even if, and this is more hilarious, if there are reasons in the data themselves to think that there is no effect of some nuisance variable.

I'm just always thinking....

The very essence of real science is comparing data across different studies, papers, paradigms, laboratories, etc and trying to come up with a coherent picture of what might be a fairly invariant truth about the system under investigation.

If the studies that you wish to compare are in the same paper, sure, you'd prefer to see less in the way of nuisance variation than you expect when making cross-paper comparisons. I get that. But still....some people.

Note: this is some way relates to the alleged "replication crisis" of science.
*having nothing to go on but their willingness to act like the manuscript is entirely uninterpretable and therefore unpublishable, I have to assume that some of them actually mean it. Otherwise they would just say "it would be better if...". right?

8 responses so far

The most replicated finding in drug abuse science

Ok, ok, I have no actual data on this. But if I had to pick one thing in substance abuse science that has been most replicated it is this.

If you surgically implant a group of rats with intravenous catheters, hook them up to a pump which can deliver small infusions of saline adulterated with cocaine HCl and make these infusions contingent upon the rat pressing a lever...

Rats will intravenously self-administer (IVSA) cocaine.

This has been replicated ad nauseum.

If you want to pass a fairly low bar to demonstrate you can do a behavioral study with accepted relevance to drug abuse, you conduct a cocaine IVSA study [Wikipedia] in rats. Period.

And yet. There are sooooo many ways to screw it up and fail to replicate the expected finding.

Note that I say "expected finding" because we must include significant quantitative changes along with the qualitative ones.

Off the top of my head, the types of factors that can reduce your "effect" to a null effect, change the outcome to the extent even a statistically significant result isn't really the effect you are looking for, etc

  • Catheter diameter or length
  • Cocaine dose available in each infusion
  • Rate of infusion/concentration of drug
  • Sex of the rats
  • Age of rats
  • Strain of the rats
  • Vendor source (of the same nominal strain)
  • Time of day in which rats are run (not just light/dark* either)
  • Food restriction status
  • Time of last food availability
  • Pair vs single housing
  • "Enrichment" that is called-for in default guidelines for laboratory animal care and needs special exception under protocol to prevent.
  • Experimenter choice of smelly personal care products
  • Dirty/clean labcoat (I kid you not)
  • Handling of the rats on arrival from vendor
  • Fire-alarm
  • Cage-change day
  • Minor rat illness
  • Location of operant box in the room (floor vs ceiling, near door or away)
  • Ambient temperature of vivarium or test room
  • Schedule- weekends off? seven days a week?
  • Schedule- 1 hr? 2hr? 6 hr? access sessions
  • Schedule- are reinforcer deliveries contingent upon one lever press? five? does the requirement progressively increase with each successive infusion?
  • Animal loss from the study for various reasons

As you might expect, these factors interact with each other in the real world of conducting science. Some factors you can eliminate, some you have to work around and some you just have to accept as contributions to variability. Your choices depend, in many ways, on your scientific goals beyond merely establishing the IVSA of cocaine.

Up to this point I'm in seeming agreement with that anti-replication yahoo, am I not? Jason Mitchell definitely agrees with me that there are a multitude of ways to come up with a null result.

I am not agreeing with his larger point. In fact, quite the contrary.

The point I am making is that we only know this stuff because of attempts to replicate! Many of these attempts were null and/or might be viewed as a failure to replicate some study that existed prior to the discovery that Factor X was actually pretty important.

Replication attempts taught the field more about the model, which allowed investigators of diverse interests to learn more about cocaine abuse and, indeed, drug abuse generally.

The heavy lifting in discovering the variables and outcomes related to rat IVSA of cocaine took place long before I entered graduate school. Consequently, I really can't speak to whether investigators felt that their integrity was impugned when another study seemed to question their own work. I can't speak to how many "failure to replicate" studies were discussed at conferences and less formal interactions. But given what I do know about science, I am confident that there was a little bit of everything. Probably some accusations of faking data popped up now and again. Some investigators no doubt were considered generally incompetent and others were revered (sometimes unjustifiably). No doubt. Some failures to replicate were based on ignorance or incompetence...and some were valid findings which altered the way the field looked upon prior results.

Ultimately the result was a good one. The rat IVSA model of cocaine use has proved useful to understand the neurobiology of addiction.

The incremental, halting, back and forth methodological steps along the path of scientific exploration were necessary for lasting advance. Such processes continue to be necessary in many, many other aspects of science.

Replication is not an insult. It is not worthless or a-scientific.

Replication is the very lifeblood of science.

*rats are nocturnal. check out how many studies**, including behavioral ones, are run in the light cycle of the animal.

**yes to this very day, although they are certainly less common now

21 responses so far