The wikipedia entry on Generalization reads:
A generalization (or generalisation) is the formulation of general concepts from specific instances by abstracting common properties.
This is a very clean description of what many scientists think that they are about. I certainly do. I think that we are trying to use our experiments as specific instances from which to identify concepts and phenomena that have common properties with other situations not currently being tested. Thus our results should, we hope, generalize as predictions of what will happen in other situations.
Usually situations related to human health and behavior.
A recent paper by Voelkl and colleagues talks about this but totally borks the framing and terminology. They continually misuse "reproducibility" when they really mean to refer to generalization. And this harms science.
First, a quick overview. What Voelkl et al. present is a study which conducts meta-analysis of published studies. This technique includes a host of different studies which use approximately the same methods to address approximately the same question. The outcome of such a meta-analysis can tell us if a given qualitative interpretation is more likely to be true than not (think of it as a box score of the outcomes weighted by some qualities of the specific studies) and estimate the effect size (distance of mean effect relative to the variation expected, i.e. Cohen's d is most comprehensible to me).
As you can imagine, in a typical meta-analysis the studies vary quite a bit in detail. Perhaps it is the strain of rat being used. Or the sex. Or the light cycle the studies were run in. Perhaps it is the time of year or the humidity of the building. And most frequently there is variation in the scientists who are doing the investigating.
Meta-analysis is a test of generalization!
This is critical.
The big point in the Voelkl paper is that individual papers which include data sets on the same topic from multiple labs are more closely correlated with the meta-analytic result. As the authors put it in the Discussion:
Using simulated sampling, we compared the outcomes of single- and multi-laboratory studies, using the same overall number of animals, in terms of their accuracy of effect size estimates (pc) and FNR. For these simulations, we chose to use a large sample of published data from preclinical studies to guarantee that the results reflect real-life conditions. We found that pc increased substantially with the number of participating laboratories, without causing a need for larger sample sizes. This demonstrates that using more representative study samples through multi-laboratory designs improves the external validity and reproducibility of preclinical animal research.
Well, no shit Sherlock. A multi-laboratory study is already a test of generalization. It says that the same qualitative interpretation can be drawn from the study regardless of variation in laboratory, personnel and probably some other key variables. Since this is also what the meta-analysis is testing, it is no surprise whatever that this would be the result.
But. These authors use "reproducibility". The Wikipedia entry on this topic is a disaster which conflates several key issues together, most pertinently generalization, reproducibility and replicability. It starts out okay:
Reproducibility is the ability to get the same research results or inferences, based on the raw data and computer programs provided by researchers.
Absolutely. Reproducibility is indeed the ability to reach the same conclusion (inferences) based on doing everything just like the other researchers did it. Great. It then immediately goes off the rails:
A related concept is replicability, meaning the ability to independently achieve non identical conclusions that are at least similar, when differences in sampling, research procedures and data analysis methods may exist.
what? That sounds more like a flexible version of reproducibility. If I had to try to parse out a difference for replicability I might observe the term "replicates" gives us a clue. As it does further down in the Wikipedia entry which now conflates the term repeatable with replicable.
The other component is repeatability which is the degree of agreement of tests or measurements on replicate specimens by the same observer in the same laboratory. ... Although they are often confused, there is an important distinction between replicates and an independent repetition of an experiment. Replicates are performed within an experiment.
Seriously, who is editing this thing? Replicable now equals repeatable which means are all your subjects in the sample doing the same thing, more or less. I can get behind this needing a separate term but can we just pick one please? And not confuse that with the issue of whether the scientific result ("inference") can be reproduced or will generalize?
Back to reproducibility.
A particular experimentally obtained value is said to be reproducible if there is a high degree of agreement between measurements or observations conducted on replicate specimens in different locations by different people
See how they've immediately diverged? This may or may not be generalization depending on what you call "replicate specimens". To most eyes this means a whole different experiment which is for sure a test of generalization. Well the entry immediately clears up the intent is to conflate:
in science, a very well reproduced result is one that can be confirmed using as many different experimental setups as possible and as many lines of evidence as possible
The line about "as many different setups as possible" is the essence of generalization. And if that isn't enough confusion this sentence brings in converging evidence which is another concept entirely!
Back to Voelkl et al.:
our results suggest that eliminating these and other risks of bias (e.g., low statistical power, analytical flexibility) is not sufficient to guarantee reproducibility; the results will remain idiosyncratic to the specific laboratory conditions unless these conditions are varied.
"Idiosyncratic" here means reproducible. It means that if you keep the conditions identical, you should be able to repeat the experiment over and over and come up with the same approximate finding ("inference"). This finding can be endlessly reproducible, be built on experiments that are highly replicable within the samples and still fail to generalize beyond the idiosyncratic way that a given lab chooses to run the experiment.
So why do I say this failure to be clear about what we mean harms science?
Well, we are deep in the midst of much furor about a "reproducibility crisis" in science. There isn't one. Or at least if there is one, it has not been demonstrated clearly. The low grade annoyance of writing and reviewing the NIH grant section on Rigor is not a huge deal (at least it hasn't been for me so far). But it is yet another thing for people to beat up grants, possibly for no good reason. On the other end of the scale this will eventually be grist for conservative Congress Critters trying to cut investment in research. Somewhere in between lies the goal of the BigPharma voices promoting the lie so as to further offload their research and development costs onto the public purse.
The more immediate problem is that if we are not clear about what we mean in this discussion, our solutions will never solve anything, and may even hurt. I believe that to some extent people are indeed accusing science of having a reproducibility problem. Meaning, one assumes, that significant amounts of published work come to inferences that cannot be sustained if the experiments are done in exactly the same way. The solution for this, one deduces, can only be that each lab much perform many replicate experiments to provide improved confidence on reproducibility prior to publishing. "Make those sloppy bastards repeat it six times and I won't have to work so hard to figure out how to get my experiment working", goes the thinking. I guess
One interpretation of what Voelkl and colleagues are saying is that this won't help at all.
Besides known differences between the studies included in our analysis, such as the species or strain of animals (i.e., genotype) or reported differences in animal husbandry and experimental procedures, sources of variation included also many unknown and unknowable differences, such as the influence of the experimenter [38,39] or the microbiome , as well as subtle differences in visual, olfactory, and auditory stimulation. All those factors might affect treatment effects. Multi-laboratory designs are ideal to account for all of these sources of between-laboratory variation and should therefore replace standardized single-laboratory studies as the gold standard for late-phase preclinical trials
If we don't do work in a way that can test how well a conclusion generalizes across these issues, we will never solve the real problem. We will not know the limits of said generalization (it is not one thing, btw), the key experimental factors and the irrelevant detail. Instead we will continue to promote a collection of arbitrary and highly constrained experimental parameters and talk as if surely our results will generalize to a treatment medication for humans in rapid order.
In point of fact working to improve reproducibility (as we all do!) may be directly opposed to improving generalization and thereby compromise translation to helping improve human health.
And despite where people in science are pointing the finger of blame (i.e., the reproducibility of inferences that we can make using precisely the same approaches), they are really motivated and angered by the lack of generalization.
Seriously, listen to what has the scientists who are eager to be puppeted by Big Pharma have to say. Listen to their supposed examples that show "the problem is real". Look at what makes them really mad. Ask about their attempts to perform experiments related to the ones in the published literature that anger them so much. You will be more likely to conclude that they are not in fact miffed about directly reproducing a result. More often it is a failure to generalize beyond the original experimental conditions.
Voelkl B, Vogt L, Sena ES, Würbel H (2018) Reproducibility of preclinical animal research improves with heterogeneity of study samples. PLoS Biol 16(2): e2003693. https://doi.org/10.1371/journal.pbio.2003693