IACUC 101: Satisfying the erroneous inference by eyeball technique

I stumbled back onto something I've been meaning to get to. It touches on both the ethical use of animals in research, the oversight process for animal research and the way we think about scientific inference.

 

Now, as has been discussed here and there in the animal use discussions, one of the central tenets of the review process is that scientists attempt to reduce the number of animals wherever possible. Meaning without compromising the scientific outcome, the minimum number of subjects required should be used. No more.

physioprofitinErrBars-1.jpg

run more subjects..

We accept as more or less a bedrock that if a result meets the appropriate statistical test to the standard p < 0.05. Meaning that sampling the set of numbers that you have sampled 100 times from the same underlying population, fewer than five times will you get the result you did by chance. From which you conclude it is likely that the populations are in fact different.

 

There is an unfortunate tendency in science, however, to believe that if your statistical test returns p < 0.01 that this result is better. Somehow more significant, more reliable or more..real. On the part of the experimenter, on the part of his supervising lab head, on the part of paper reviewers and on the part of readers. Particularly the journal club variety.

False.

physioprofitinErrBars-2.jpg

p < eleventy dude!

I think this is intellectually dishonest. I mean, fine, there may be some assays and data types (or experiments) that essentially require that you adopt a different criterion to accept a result as resulting from other than chance. But you should have consistent standards and in the vast majority of cases that standard is going to be p < 0.05. Meaning that if pressed, you are willing to publish that result and willing to act as if you believe that result as firmly as you believe any other result. Trumpeting your p < 0.001 result as if it is somehow more real, however, is trying to say that you had a more stringent criterion in the first place. Which you most certainly did not. So it is dishonest. Within scientist, within fields and across science as a whole.

If p < 0.05 is the standard, than all else is gravy.

 

As anyone who has done any work with animals knows, in a whole bunch of cases you can lower that p-value simply by running more subjects. In fact it is not unheard of for PIs to tell their trainees to run a few more subjects to make the p-values (or error bars, same principle) "look better".

 

"Look better" means, "we don't actually make our inferences by statistics at all, what we actually believe in is the significant-by-error-bar-eyeball technique".

So imagine yourself on an Institutional Animal Care and Use Committee. One of the things you are supposed to evaluate is if the number of rats proposed to study, say, mephedrone is excessive. Roughly speaking let us stipulate that N=6 gives us the minimum power required to get a significant p-value; N=12 is robust. A decent chance of p < 0.01. But the PI is asking for N=18 per group so that the error bars look super tight and that notorious Reviewer #3 won't complain that the p < 0.05 doesn't seem real to him.
What group size are you going to approve? On what basis? How do you reconcile Reduction with the eyeball inference technique?
__
By all means take a stab at interpreting the graphical results derived from a repeated-measures study involving two timepoints. Which one is the significant result? [Update: I initially forgot to mention the bars are SEM]

48 responses so far

  • Joe says:

    I see you are among the masses that prefer using SEM for error bars. Perhaps you could elaborate on that choice compared to, for example, SD or 95% CI?
    If your goal is to visually indicate 'statistical significance', then 95% CI is the preferred choice for error bars. Why didn't you choose that? You wanted to summarize the data distribution? SD best summarizes the distribution. But I know those error bars in your graph are not SD because if they were they wouldn't have improved so much with an increase in N (especially with an N above 10). SEM is good for indicating the 'accuracy' of the mean value. But I don't think the point of your graph is that physioprofitin is precisely 50 under condition 1 and 64.5 in condition 2. Because if that were your goal you would have used a table.
    And there are other types of plots you could have chosen: box and whisker, vertical scatter. Why did you choose the plot you did, DM?
    If you're gonna lecture us on statistics, dude, you're gonna have to answer the tough questions.
    Otherwise, let's have something more honest and useful. Something for which you might be qualified, which might spark a lively discussion, and which furthers the primary goal of this blog (facilitate the careers of readers). Something like the grant-writing tips & tricks that are the best entires in this blog. Something like: 'Best visual methods for impressing your audience with inherently unimpressive data'.
    Or maybe that's what you're getting at, in a roundabout way?

  • Joe says:

    Oh, and for IACUC, I always just argue for more animals by saying that super-duper nice-looking graphs are important to ensure that the results 'are publishable in a high-quality peer-reviewed journal'. Which implies (correctly) that the lives of the animals I am asking to use will not have been wasted. Doing crappy experiments that aren't publishable is a waste of animal lives, no matter how few you use.

  • whimple says:

    Ridiculously small P values are embarrassing to publish because they mean you tried way too hard, used way more animals than you needed to, were too ignorant to do an a priori power analysis, should have published long ago, etc.

  • bsci says:

    I disagree with you here. p

  • bsci says:

    Damn less than signs. Here's what I actually wrote:
    I disagree with you here. p is less than 0.001 absolutely means something different than p is less than 0.05. It means that there's a 1/1000 rather than a 1/20 chance that this result was random noise. That's real information that is reasonable to include in the article. As whimple says, in some cases these shouldn't come up. Still, lets say your power analysis is defined to identify a certain effect. This defines the number of samples. Another interesting result might have been found with fewer samples, but, since more were collected to help identify the primary effect, the p value is lower.
    On the bigger question of # of animal, you're on a bit of a slippery slope. Let's say you want to only use the precise number of animals that get you p is less than 0.05 and no more. The way to do that is to recompute the statistics for every additional animal. Some additions might make the p is value go up and some make it go down. You stop as soon as you get p less than 0.05. This process seriously biases the statistics because you are letting the current p value define whether you collect more data and you'll always stop after getting a particularly "good" data point.
    In an ideal world, like whimple notes, a power analysis would define the # of needed animals and you never change that number. In reality, it's possible that the first bunch of data showed that some aspect of the power analysis was wrong (i.e. the noise is higher than expected). The power analysis could be revised, but adding 1 more animal at a time is a terrible idea.

  • DrugMonkey says:

    Sorry Joe, I forgot to say
    -bars are SEM in both graphs and the N is the same. They are intended to represent independent experiments in which the outcome was different
    -means are 55.2 and 70 in the first graph and 50 and 65 in the second graph.
    Doing crappy experiments that aren't publishable is a waste of animal lives
    And therein lies the rub, eh? I totally agree with you that taking "reduction" to the edge of experimental meaninglessness is bad from an animal use perspective. But I also think that letting supposed visual attractiveness to statistically-underinformed viewers drive the sample size isn't great *either*. and the trick is, where is the line to be drawn. Peer reviewed publication, regardless of IF, is a pretty big watershed if you ask me. Debates over quality and "impact"? Pretty hard to use this as a bright line.
    (and you *do* know that the degree of obsession reviewers have with sample sizes, power, etc is inversely correlated with IF, right? just another little irony of the Glamour game...)

  • School Marm says:

    You may avoid much potty talk and frustration by simply using ampersand-lt-semicolon in place of the less than sign, I believe. Unless MT doesn't parse this (<)in the comments.

  • I suppose once I see p < 0.05, what I'm really interested in is the R^2. Besides, a difference between p = "eleventy" and p < 0.05 might say as much about the underlying statistical model as much as anything that's biologically significant.

  • becca says:

    You seem to be implying that if p is less than 0.05, you believe it, if p is greater than p 0.05 you don't, and that's that.
    This is incorrect.
    People, perhaps ironically especially scientists, are really bad at handling uncertainty. Statistics are a set of methods for managing and comparing uncertainty, not a way to eliminate it. In the best of all possible worlds, the human brain would be well equipped to believe in things conditionally, and with more certainty for lower p values. In the real world, we also tend to factor in a number of 'prior probability' factors like how much we believe things in this journal, from this lab, using this method, ect. Though typically this is done ad hoc and not really rigorously.
    I feel like if we were all just better designed for intuitive grasp of truth-as-a-gradient (not in a shady pomo way, just as a "you can measure uncertainty" statistically valid way), the whole thing would be much easier.
    Anyway, in answer to your question, the best of all possible worlds, I'd approve n=6. But I suppose it might be worthwhile to take into account the scientific context as well. How novel/controversial is the result? If p less than 0.01 is out in the lit, is it going to decrease the odds somebody tries to include a reproduction of the work in their next study (yeah, nobody repeats-exactly animal work, but it's not at all uncommon, nor is it bad for science, to measure the same parameter somebody else measured while you measure something new)? Is reviewer # 3 only going to turn up if they submit to glamor mag? Is the work only going to get read if they submit to glamor mag?
    There are a lot of things that *could* justify ok'ing n greater than 6. But in most cases, the result is going to be new but not earthshaking, people will believe it if n is less than 0.05, and while they might have to argue with reviewer #3, at least they'll have the IACUC's decision in writing to argue with. AND the poor grad student doing the experiments might graduate in 4 years instead of 12.

  • Joe says:

    Statistics are a set of methods for managing and comparing uncertainty, not a way to eliminate it

    Actually, dear becca, there is nothing uncertain about p values. Neither they nor any other statistical measure represents uncertainty. To the contrary, they represent absolute certainty about the probability of something happening (assuming all conditions are met etc etc.)
    For example, I can tell you with absolute certainty that, using an unbiased 6-sided die, that I will roll a three approximately 1/6 of the time. I can't tell you whether any particular roll of the die will yield a three, but that's not the province of statistics. Statistics describes populations of numbers.
    Because replicability is a foundation stone of science, and because measurement variability is an unavoidable part of reality, we must deal with populations of nonidentical numbers in science. It is easy to compare individual numbers. Three is larger than two. Seventeen is not equal to fifteen. But populations of numbers are not so easy. Is a mean value of three larger than a mean value of two? Is a mean value of seventeen no different than a mean value of fifteen? I dunno. it's hard to tell. That's where statistics comes in.
    Therefore, dear becca, I think the word you were perhaps searching for was 'variability'. Statistics is a method for dealing with variability. The human mind is not good at dealing with variability. Maybe this is because we have short memories or biased memories or short attentions spans or whatever. it doesn't matter. Fortunately, we have statistics. Unfortunately, too many people twist statistics to fit their biases, instead of properly redirecting their biases to fit the statistics.

  • whimple says:

    Joe, are you this patronizing to everyone, or just to becca? Maybe just to girls? Hope not.

  • Joe says:

    I am, dear whimple, equally patronizing to everyone. Thanks for your interest in my attributes. I appreciate the attention.

  • whimple says:

    I just asked because, you know, you seemed so interested in getting student feedback on your pedagogical techniques and all. (see also: "Scholars and Teachers on Divergent Paths")

  • Joe says:

    Oh, yea. Thanks, whimple. I'll work on that.
    Not to deflect or anything, but I've been vomiting pontificatory comments like a madman the last few days and Comrade Physioprof has still not yet told me to get my own fucking blog. What's up with that? Is he sick? I'm starting to get concerned.

  • Not to deflect or anything, but I've been vomiting pontificatory comments like a madman the last few days and Comrade Physioprof has still not yet told me to get my own fucking blog. What's up with that? Is he sick?

    So long as Shitlin the Gibbering Fuck-Up stays gone, I'm not looking a gift horse in the mouth.

  • FB says:

    bsci,
    "[p less than 0.001] means that there's a 1/1000 rather than a 1/20 chance that this result was random noise."
    No, it doesn't. It means that there is less than a 1/1000 chance that, if the null hypothesis were true, you'd see data as extreme as what you observed (as measured by your chosen test statistic). This is different from "the chance the result was random noise", i.e., the probability the null hypothesis is true. (To even define that, you have to be a Bayesian.) p-values don't and can't tell you the probability of the null hypothesis.

  • becca says:

    whimple- thank you for concern. Joe is just mad because he got some antonyms mixed up and then made the mistake of walking into a situation where I could compare him to larry the cable guy (during one of larry's dimmer moments). Therefore he has to patronize to make himself feel better. Or perhaps he thinks I have established a bantering rapport, and he can thus entertain me with his attempts to banter back.
    joe- I meant what I said and I said what I meant, a becca is faithful, one hundred percent. No uncertainty or variability about it.
    😉
    You are correct that we are sometimes bad at dealing with variability per se. Obviously, if you take enough data points, you won't be able to remember the outcomes of every trial, and statistics do help with that. Statistics can give you a summary of all the numbers you've obtained, and some idea whether you have likely obtained enough to get an idea of the population of numbers you could have obtained.
    However, I truly meant that people don't want to deal with uncertainty. The trouble is, when you are looking a small number of variable datapoints, you have no trouble encompassing the entire dataset. You don't need the stats because your mind can't encompass the variability. You need the stats because, given the variability, you don't know what you can safely conclude... that is, you want a *certain* conclusion, and the variability is an obstacle. The statistics are a tool to help you get a handle on that variability such that you can gain certainty. Of course, if our minds were perfect, 'certainty' would always be a continuum, not a binary.

  • antipodean says:

    lemme get this straight... because this is fucking horrifying.
    You run an experiment and then run the stats and get a p of a level not low enough. You then make a decision to not run another experiment but to just run a few more subjects instead, pretend they are from the same experiment, and then REANALYSE THE SAME DATA?
    Is this sort of cooking the books standard practice in 'don't understand stats/experimental rigour/science' land?

  • tideliar says:

    Yeah, Antipodean, unfortunately this happens a lot.
    "6 mice should do it...Oh shit, p = 0.08. Um, do some more and see if you can chuck any of those outlier data points away..."
    Now I work more with clinicians and translational scientists so i see it done more right more frequently because they'll actually get an epidemiology and biostat/medical statistician to come in and help do it right.

  • cycloprof says:

    The cooking the books examples seem to be gross oversimplifications of physioproffitin researchin in animal models. E.g. How is running more subjects different than running an experiment over a course of 12 weeks with the cohort being expanded every two weeks (because it is physically impossible to run the entire experiment at one time)? This is also done in clinical/"translational" settings as "subjects" have lives too and cannot simply be called up from the "colony" as soon as the investigator is ready.

  • neurolover says:

    "How is running more subjects different than running an experiment over a course of 12 weeks with the cohort being expanded every two weeks (because it is physically impossible to run the entire experiment at one time)?"
    Running more subjects is different if you determine when you're going to stop from the results of your statistical test, including when you'll run another experiment.
    Physiologists do this with other n's (like cells, or samples) and it is also bad, a perverting effect on the literature that we "correct" for by applying the additional biases of which becca speaks (i.e. whether we believe a result, an author, a journal, a theory . . . .).

  • DrugMonkey says:

    Indeed cycloprof. But there is a very real pitfall in just running until you get The p-value you want.
    How to not put a thumb on the scale while not wasting the subjects you have already used?

  • Alex says:

    As a physicist, the thing we are always most stringent on is units. What are the un its of Physioprofitin? I'd assume something like fucktillions of molecules per bottle of Jameson.

  • DrugMonkey says:

    I was gonna write another post but I'm kinda busy...
    Next Hint.
    In the two repeated-measures, equal N studies depicted above (mean differences similar, as reported in Comment # 6) the stats I ran reported
    p < 0.002 and p < 0.001
    F-ratios of 192 and 225 respectively.
    Whut?

  • antipodean says:

    If it's repeated measures then the data are plotted either badly or wrongly. Also what the fuck is an F stat doing near a two measures per animal data-set? Why not just use a paired t-test? If the experiment is nicely set up why not take advantage- especially in animals where you have controlled most of the hard-core variance we have to deal with in 'wild-type' humans?
    I would suspect that the old clinician's standby of plotting all of the data points might work nicely in this case. I would guess that Fig1 will show parallel bands with a very regular delta effect but somewhat differing baseline and endpoint measurements.
    Cycloprof. If you have negative data experiment on file try this for an exercise. Analyse your data after every subject finishes. In a number of places you will have a significant effect. Even in large simulated data sets which have been programed to have no effect you will still have these positive effects emerging after some subjects finish. This is why you don't keep analysing data and then just adding a few more to cook the books.
    As tideliar has hinted this behavior is regarded as research misconduct in human medical studies. You have to plan your analyses before you start. Once you've used up you p

  • Neuro-conservative says:

    In general I agree with antipodean's response @25, but note that a paired t-test is identical to a repeated-measures ANOVA with no between-subjects factor. The t statistic would be the square root of the F and would yield the same p-value.
    The figures provided by DM are counter-intuitive because the variance is shown at each time point, but the statistic is testing the variability of the delta.
    With a small number of subjects, it would be most informative to just plot each pair of points, as suggested by antipodean.
    Because it is the weekend I will not get into random effects models.

  • hibob says:

    @antipodean:"This is why you don't keep analysing data and then just adding a few more to cook the books. As tideliar has hinted this behavior is regarded as research misconduct in human medical studies. You have to plan your analyses before you start."
    It may be regarded in theory as research misconduct in human drug trials, but a JAMA study comparing published articles to their (in theory) required ClinicalTrials.gov registry descriptions suggests that changing the primary endpoints during or after data acquisition is rampant. Changing or omitting primary endpoints pretty much renders moot any power analyses conducted before the data was unblinded, and introduces biases even more readily than adding a few more subjects to cook the books, but doesn't seem to be a problem with either IRBs or high impact medical journals:
    Comparison of Registered and Published Primary Outcomes in Randomized Controlled Trials
    JAMA. 2009;302(9):977-984.
    "Context As of 2005, the International Committee of Medical Journal Editors required investigators to register their trials prior to participant enrollment as a precondition for publishing the trial's findings in member journals."
    "Results Of the 323 included trials, 147 (45.5%) were adequately registered (ie, registered before the end of the trial, with the primary outcome clearly specified). ... Among articles with trials adequately registered, 31% (46 of 147) showed some evidence of discrepancies between the outcomes registered and the outcomes published. The influence of these discrepancies could be assessed in only half of them and in these statistically significant results were favored in 82.6% (19 of 23)."
    "Conclusion Comparison of the primary outcomes of RCTs registered with their subsequent publication indicated that selective outcome reporting is prevalent."

  • Bryan says:

    Why not frame it as an effect size (standardized mean difference) and be done with it?
    One could lower the error bars by better control as well.

  • DrugMonkey says:

    I don't see where effect size helps. Consider an example of weIght loss in overweight individuals ranging from 75 lbs (a 5 yr old) to 250 lbs (adult man). A 10 lb loss is a dinky effect size, no? But if produced consistantly across body size by some X factor, this would be hugely important.
    Or am I missing a different approach to effect size in a repeated measures design?

  • Neuro-conservative says:

    What are you talking about, DM?? The boy and the man each lose 10lbs? Wouldn't that be really weird?

  • Bryan says:

    It would depend on the standard deviation.
    10 pound mean difference in weight loss divided by a 5 pound standard deviation would be quite impressive. Divided by a 50 pound SD and it would be a small effect.
    In your example it would be the standard deviation of the weight-loss difference score before and after whatever the treatment was.

  • Bryan says:

    DM:
    I was gonna write another post but I'm kinda busy...
    Next Hint.
    In the two repeated-measures, equal N studies depicted above (mean differences similar, as reported in Comment # 6) the stats I ran reported
    p

  • Alex#23:
    ---"What are the units of Physioprofitin?"---
    Muppethuggin' dollars, man. Of course, in the metric system it would be muppethuggin' euros or something like that.
    I think the IUPAC (International Union for Physioprofitin Unit Consensus) has suggested that units be commonly referred to as stinkin' PUs....currently I believe 1 stinkin' PU = 0.74945 metric stinkin' PUs.
    Hope that helps.

  • antipodean says:

    hibob
    The reason we can quantify this cooking of the books in clinical trials is because of the clinical trials registers.
    It's seriously fucked up, I agree. But with the registers we can now finger the bastards. That's why the analysis of this shit was in JAMA and why the major medical journals now have explicit instructions to reviewers about checking the clinical trials register.
    Cooking the books in one example does not make cooking the books in another example OK.
    DM. Effect size is the mean effect divided by the standard deviation of that population/sample at baseline. Like Bryan said a 10lb difference is big or small depending on what the variability was at baseline.

  • Dude, what are you feeding your 5 year olds? Nothing but duck cake?

  • Isis the Scientist says:

    Don't hate on the duck cake.

  • becca says:

    Wasn't intended as hate on the duck cake! The duck cake looked DELICIOUS. That's why I think eating enough of them would turn a five year old into a fattyboombatty. I love my chubster little baby, but even *I* might worry, a little, about a 75lbs 5 year old.

  • Cyan says:

    Silly p-valuists. The mighty likelihood functions cares not for your stopping rule!

  • DSKS says:

    There's an article in Science News covering some of these issues...
    "Odds Are, It's Wrong"

  • Bob O'H says:

    Silly p-valuists. The mighty likelihood functions cares not for your stopping rule!

    FTW. But only amongst the terminal cognoscenti.
    For me the answer is that the IACUC should listen to a statistician. We're just talking about experimental design, so either run your power analysis, or define a stopping rule and run a sequential design.

  • anon says:

    yeah, why can't people just do a simple power analysis to determine sample size? p<0.05 is an established standard, but no one ever talks about a standard for beta.

  • Lamar says:

    1. never did a single power analysis until late in my postdoc (because I was asked to, not that I used it). just being honest: seemed like BS. you have to estimate variances. sometimes hard to get variances until you actually have some data. so I plug in my estimates...hmm...n of 8 per group? OK fine. that's about what I'd use anyway. now, is it really unethical to use 7 or 9? I just find that ridiculous.
    2. I've heard both sides of the behavioral neuroscience coin in this regard: one side says order 30 rats and do the entire study in one fell swoop. god forbid don't ever try to repeat an experiment with a new batch of animals because you risk not being able to repeat the results (I always found this absurd). the other side says split the study into, say, at least 2-3 batches/groups. this might be absolutely necessary if you only have, say, 8 self-administration boxes. the first batch is pilot, you also might get a couple animals die in your first group, etc;...behavioral neuroscience just isn't always as clean as you'd like. it's not cell culture. in many cases, the first bullet point on the future directions slide in most behaviorist's works-in-progress talks is "add more n". is this cooking the books or just finishing an experiment?

  • Cyan says:

    FTW. But only amongst the terminal cognoscenti.

    Apparently I'm so "in the know" I just might die from it. Oh internet, I wish I knew how to quit you.

  • whimple says:

    yeah, why can't people just do a simple power analysis to determine sample size? p
    Standard is beta = 0.8, alpha= 0.95

  • jda says:

    Read more about p values. They are more subtle than many of us are taught:
    http://library.mpib-berlin.mpg.de/ft/gg/GG_Null_2004.pdf

  • Darrell says:

    Amazing blog! Is your theme custom made or did you download it from somewhere?
    A design like yours with a few simple tweeks would really make my blog shine.

    Please let me know where you got your theme. Cheers

Leave a Reply