# Brief data analysis interlude

I have a trainee running a study in which she is examining the effects of methamphetamine on Bunny Hopping using the established open field to hedgerow assay. The primary dependent variable is escape latency from stimulus onset to crossing the plane of the hedge.

She is examining the effects of a locomotor stimulant dose of methamphetamine derived from her pilot dose-response study versus vehicle in groups of Bunnies which have been trained for six weeks in our BunnyConditioning Model and age matched sedentary Bunnies. (The conditioning training consists of various sprint, long run, horizonal hop and vertical leap modules.)

So we have four groups of Bunnies as follows:
1. Conditioned, Vehicle
2. Conditioned, Meth
3. Sedentary, Vehicle
4. Sedentry, Meth

The trainee is actually a collaborating trainee and so these data involve the analytic input of multiple PIs in addition to the trainee's opinio. We are having a slight disagreement over the proper analysis technique so I thought I would turn to the brilliant DM readers.

• KBHC says:

I would do t-tests AND ANOVA. It's not like it costs you anything to run it a few different ways, and each test tells you something slightly different.

• ginger says:

Two-way ANOVA, because if you don't care about the possibility of interaction, there's kinda no point to manipulating both variables simultaneously.

(I'm assuming that the underlying distribution of escape latency is reasonably normal or can be transformed to normalish.)

• DrugMonkey says:

two-way ANOVA on repeated measures

Generalized linear model

One way ANOVA with multiple post tests

regression

You didn't say if the dependent variable has a normal distribution

repeated meas to control for random bunny effects, with 2 fixed factors

• DrugMonkey says:

It’s not like it costs you anything to run it a few different ways

Is it statistically dishonest to run a bunch of analyses and simply pick the one that comes out the way you want it to?

• DrugMonkey says:

Two-way ANOVA, because if you don’t care about the possibility of interaction, there’s kinda no point to manipulating both variables simultaneously.

Is this the only way to determine if there is an interaction, ginger?

• Bob O'H says:

Yes. But it's statistically honest to explore your data. And you missed out the correct analysis - plotting your data.

• pinus says:

you do a two way anova, and then follow up with some planned t-tests...no?

• DrugMonkey says:

"follow up" is, in my view, incompatible with "planned". Planned comparisons are decided in advance of running the ANOVA and do not, to my view, depend on finding significant main effects to be conducted. These are actually pretty rare in my area.

"followup" implies post-hoc test strategy where you are adjusting for all possible comparisons. this I find to be much more common. Oddly, really, because in most multi-level / multi-factor designs common to my subfields, you are accounting for a bunch of comparisons that are illogical and you would never actually report as an effect of interest.

Personally I think we should do much more with pre-planned comparisons and am not sure why we do not.

• Bob O'H says:

1. This is a designed experiment, so you should have decided what the main analysis would be before the experiment.
2. What are your biological questions? I'm guessing you want to know if meth has a different effect in sedentary and conditioned bunnies. If so, that's an interaction between 2 variables, which is pointing toweards a two-way ANOVA with an interaction.
3. I don't see where the repeated measures is coming from - have you not mentioned part of the design? Your IRB will be most happy if you neglected to tell them about repeated bunny hops: they would have wanted to watch.
4. Was age matching done so that the distributions of ages are the same, or was each bunny matched to 3 other bunnies by age? Either way, should you add age as a covariate?
5. Were the bunnies cask conditioned?
6. 2-way ANOVAs are Generalized linear models. GLMs are great (sad news - one of their inventors died a couple of weeks ago), and they include a pile of useful models - even t-tests.
7, As I mentioned above, plot the data.

If you want some more help and advice feel free to email me. You can pay me in batches of coney strew.

• DrugMonkey says:

the comment missed the fact that I referred to four groups, no repeated measures complications here.

re 2, do you *need* to do 2-way to assess an interaction (this is especially important given you are the one saying 'plot the data'. you are venturing dangerously close to DM-rant territory btw :-))

age-matched, yeah. we order up age-matched bunny groups

GLM/2wa ANOVA- I thought the comment was just fucking with me

If you want some more help and advice feel free to email me.

psst, Bob, this is an, um, didactic exercise. I may possibly have made up the scenario out of whole cloth for discussion purposes.

• Kevin Z says:

You have one categorical variable (conditioned vs sedentary). I'm not sure how you are measuring your response (meth and vehicle), but assuming it is a continuous variable this calls for 2 way ANOVA. If you are making measurements on the same individuals (i.e. using same conditioned bunnies for meth and vehicles), you should look into a repeated measures ANOVA model. You definitely need to test the interaction. I would a priori expect there might even be an interaction because you have preconditioned the bunnies! It doesn't mean the study is flawed, it could be a discussion point depending on what you are doing.

• pinus says:

I don't know either. I get blasted anytime I try to just do pre planned comparisons.

• a biologist says:

If I am reading correctly, these are your hypotheses: The null hypothesis is that the means of each group are the same. The hypotheses are that the meth bunnies hop less well than the control bunnies, and that the conditioned bunnies hop better than the sedentary bunnies.

You use T-tests to test whether the means of groups are equal. A T-test also assumes normality, so you must test for that first.

• Bob O'H says:

re 2, do you *need* to do 2-way to assess an interaction

Well, you could do a one-way with 4 groups and compare it to a two-way without interaction, but that's jsut the same as donig a two-way with interaction: they're just different parameterisations. I guess you could also do something with t-tests, but they'll either be the same as a two-way (I think - I haven't checked the maths), or will be inferior and possibly misleading.

Of course, I'm also assuming normality and additivity are reasonable assumptions.

• a biologist says:

On a second reading, you haven't quite stated the hypotheses. In order to determine the appropriate statistics, the null should be explicit. I assumed the trainee wanted to know if meth inhibited Bunny Hopping, but perhaps the question is whether the conditioned bunnies hop better, even in the presence of meth.

• Jeezus motherfuck. How fucken bad does statistics education have to be that anyfuckenone could possibly think that a t-test should come anywhere near these fucken data?????

• Bob O'H says:

Asd bad as the typical stats education biologists get, I'm afraid.

Actually, I think you could use a t-test if you condition out the correct means, and the design is balanced. I wouldn't say this is what you should do, unless you're taking a mathematical statistics class.

• DrugMonkey says:

How fucken bad does statistics education have to be that anyfuckenone could possibly think that a t-test should come anywhere near these fucken data?????

See, now I would have gone with more of a Socratic dialog leading up to the alpha inflation dealio myself....

• bsci says:

I'll just re-comment what I put in "other" above. Have you confirmed that the statistical distributions are roughly normal? Reaction times are usually normal, but if escape latency has similarity to a serial search task, you might not be able to assume normality making none of these tests appropriate.

Assuming normality, for 2 conditions a 2-way ANOVA or a GLM could probably both get at the effect of interest. I wouldn't publish multiple t-tests, but they're sometimes useful to do to make sure nothing weird is happening in the data. Instead of multiple t-tests a bar graph with stdev error bars or a dot plot showing the individual values for each condition conveys the same information and can help with data visualization.

• DrugMonkey says:

right, so a biologist, the general issue here gets back to our standard for rejecting the null hypothesis. let us say we use 0.05 as our standard- 1/20 *comparisons* will be accepted as significant when there is really no difference.

let us say we want to compare meth vs veh in each pre-conditioning condition. plus compare pre-conditioning vs. sedentary within each drug treatment. with just these 4 comparisons we're already up to a 20% chance of rejecting the null for one difference in escape latency for our hopping bunnies which is due to chance, not real performance change.

again, generally speaking, the solution is to change the standard for rejecting the null for each comparison within a coherent experiment such that the *overall* chance of getting one false alarm result is 0.05. This is what ANOVA is, in essence, doing for you.

• DrugMonkey says:

ahh, bsci. fantastic. Another area of statistical fascism that requires additional consideration. Two questions:

1) what do you mean by "statistical distributions", i.e. what has to be normally distributed?

2) what does it mean that "ANOVA is robust against violations of assumptions"?

• Bob O'H says:

On distributional assumptions, Esa Läärä made the point in a paper last year that we shouldn't worry too much about normality. His argument is that if the sample size is large, it doesn't matter because asymptotics take over so that the standard errors are well estimated. If the sample size is small, it doesn't matter because you would detect any deviations from normality anyway. His paper is well worth a read for his other comments as well.

(I was one of the editors of the special issue that paper appeared in, so I'm biased towards recommending it)

• neuromusic says:

The only people in my grad program's stats class that understood ANOVAs afterward were the ones who had taken a legitimate stats class during undergrad.

2-way ANOVA w/ a post-hoc comparison (e.g. Tukey's)

• neuromusic says:

assuming that each group is normally distributed, etc,

• I would use 2-way ANOVA, so you can determine the interaction effects (as well as their strength). If you want to compare bunny groups, you can use Tukey's Honestly Significant Difference test (really, that's what it's called).

You might be able to use one-way tests depending on what you're asking (the a priori hypotheses matter); for exploration, look for the interaction terms.

• bsci says:

As you know, a normal distribution roughly means that, within each series of observations there is a central "true" value a roughly equal number of observations on each side of the central value (following a specific shape). How far from the center value the observations go is the variance (i.e. the VA in ANOVA). If your data doesn't look like this, then variance doesn't really tell you much and ANOVA isn't a valid probe.

I am not a statistician by far, but ANOVA is fairly robust in that if your distribution is slightly skewed from normal it should still be a reasonable test. Still, there are cases where the distribution is far from normal. For example, I was once measuring lag time from multiple signal sources. Most of the values were clustered between B and C ms, but there was a nontrivial long tail that went all the way to G ms. I realized ANOVA was not appropriate in this case and ended up doing other analyses.

If you have enough data you can see how much it deviates from normal in each population using a quantile-quantile plot (qqplot in matlab). You might need to consult a real biostatistician to define "too much deviation"

If you don't have enough data then you can think about the expected results. If this is essentially a reaction time measure then I wouldn't worry too much about normality. If there's some search element where some trials have the animals getting in a bad area that takes much longer to complete you might have problems.

• lylebot says:

I don't think it's accurate to say that ANOVA is changing the standard for rejection. It's just testing a different hypothesis than the 4 independent comparisons, specifically the hypothesis that all the means are equal rather than separate hypotheses about each pair of means. Methods for changing the standard by which to reject hypotheses about pairs of means include the Bonferroni correction and Tukey's Honest Significant Differences (which is based on an ANOVA model).

I voted two-way ANOVA, but actually I think you should do a randomization (exact test) or bootstrap procedure, especially if your n is small.

• Namnezia says:

Well there's nothing wrong with pre-planned comparisons if some of the comparisons do not make any scientific sense.

• David says:

This is a hare-brained study. You've picked the wrong control group entirely: if studying how fast rabbits can go, the appropriate control group is a Tortoise, using the methodology pioneered by Aesop et al (600BC).

• Devon says:

If someone seriously suggested a T-test then I suggest whacking them with a "clue-by-four."

• DrugMonkey says:

I think you touched closely on what I was trying to get at, bsci. Is it the sample you are working with that needs to be normally distributed to meet the assumption? Or is it the underlying population that needs to be normally distributed?

• bsci says:

If I'm understanding your question directly, it's the population, but they are related. A reasonably sized sample should have a similar distribution to the population you are observing. Still, if you only have 10 observations you might not be able to empirically confirm that.

Has anyone collected escape latency from stimulus onset to crossing the plane of the hedge from a large population? If that has a normal distribution, I wouldn't worry about your specific study.

• Bob O'H says:

It's the sample.

No, sorry. It's the residuals from the data. There's an assumption that the data are randomly drawn from the population, but it it's not then you've got more severe problems with generalization.

• Bob O'H says:

OK, this comment wins.

• Funky Fresh says:

Bonferroni is my homeboy.

• ecologist says:

Interesting. Very interesting.

How about scrapping significance tests altogether, and using information-theoretic statistics (AIC and its relatives) to find the statistical model most well-supported by the data. It might involve the effect of conditioning, it might involve the effect of meth, it might involve an interaction term, but it would tell you much more about the biology than a significance test.

Try asking the following question: do you really (really, truly), think that there is any possibility in hell that, say, the mean latency of meth bunnies is EXACTLY equal to the mean latency of vehicle bunnies? Not sort of equal. Not approximately equal. EXACTLY equal. To the eleventy-zillionth decimal place. Because that's the hypothesis you are testing. If you already know that it's an irrelevant hypothesis, what exactly are you gaining by testing it?

Hmmmm.

Or, if you decided you still want hypothesis tests, and were worried about normality (of the errors, not the data) assumed in ANOVA, you could do a randomization test and not have to worry about that.

Fun discussion to watch, thanks.

• antipodean says:

I think you may want to use Mixed models- unless I've totally misunderstood the conversation so far

If you have repeated measures (bunny exposed to both control and intervention) and your not bullshitting about having really matched bunnys to each other.

Random factors: the bunny (if multiple measures?) the bunny and his matched mates (ie the matching group)
Fixed: Meth vs. Vehicle and Condition vs. Sedentary and for a start the interaction.

You don't need control for age because that's controlled for by the design of the experiment via the matching. If the latencies are seriously skewed and mucking up the normality of the residuals you could employ a inverse transform. But mixed can deal with a fair bit of non-normality going in before the residuals get mucked up.

• DrugMonkey says:

It’s the sample.
see, now I'd go with the population. In his primer of biostatistics (5th ed) Stanton Glantz continually talks about the population from which the sample is drawn when discussing assumptions of normality. Chapt 10 on when to select a nonparametric is, I've just discovered, still bookmarked in my copy I haven't opened in years for just this reference. I may possibly have had to beat a reviewer back with this citation in the past 🙂

• Fucke that austrofrench bonferroni crappe.

• pinus says:

brilliant

• Neuro-conservative says:

Latencies (for almost any behavior) are notoriously skewed, with a few subjects usually just sitting around thinking about something else. If this is not an issue in this study, then I don't understand why there would even be a debate about using 2-way ANOVA. (One could design a GLM/regression model including dummy variables for each factor and the interaction, but this should be mathematically indistinguishable from a properly specified ANOVA.)

• aaron says:

The data won't be normally distributed. Use a Kruskal-Wallis test to determine if all the means are equal (non-parametric version of ANOVA that does not assume normality). If p<0.05 that some difference exists across the groups, then you decide beforehand which comparisons you are interested in making. Then do post-hoc Mann-Whitney-Wilcoxon tests, which are like t-tests that again are non-parametric and do not assume normality, and you correct the threshold for statistical significance by dividing alpha=0.05 by the number of comparisons (Bonferroni).

• Bob O'H says:

Try asking the following question: do you really (really, truly), think that there is any possibility in hell that, say, the mean latency of meth bunnies is EXACTLY equal to the mean latency of vehicle bunnies? Not sort of equal. Not approximately equal. EXACTLY equal. To the eleventy-zillionth decimal place. Because that’s the hypothesis you are testing. If you already know that it’s an irrelevant hypothesis, what exactly are you gaining by testing it?

Using AIC to find the best model will still suffer this problem - selecting any model other than the full model is equivalent to setting the extra terms to zero.

In this example, the full model is so simple I probably wouldn't do any model selection at all - I'd first report the interaction effect. if that's small, I would report the main effects (and let the researchers decide if they're interesting).

Actually, with this design I'd try to let the graphs show the results - I'd plot the group means with standard errors with treatment (meth/control) on the x axis and bunny/tortoise in different colours/symbols. I'd draw lines between the means for each level of bunny/tortoise, to emphasise the interaction.

• Bob O'H says:

Data analysis is an art, not painting by numbers. Why not look at the data and see if its (or at least its residuals) are approximately normal? Plotting normal scores are a great way of looking for problems. It might then be that a transformation makes more sense, or that there's an outlier (perhaps a typo).

• ginger says:

(I am in the Antipodes and I went schlupping off to bed after posting, and I haven't been online at all today because I've been trapped in our tearoom marking homework assignments about probability sampling, gahhhh, so I am going to enter night 2 without actually answering your question. But it's not because I'm afraid of you people and your crazy experimental science with its discrete predictors and continuous outcomes, I promise.)
(My life centers around analysis of continuous predictors and binary outcomes, so your data are all backwardsy to me. But a zillion years ago I spent whole days of my life thinking about main effects and interactions, and it will come back to me when I am not all, "Guuuhhh, whaaaat, you would use simple multistage random sampling with stratified clusters? ZERO points and a lot of writing in the margins for you!")

• bsci says:

Assuming unbiased sample selection, sample of sufficient size should have the same distribution as the population. That said, one can't require that the sample itself as a specific distribution because one can use reasonable statistics on fairly small samples of things... For example if you want to examine the quality of a few samples of hops from shipment before deciding to use it for beer production (the original purpose of the t-test).

If you're using some multi-level model, Bob is right that it's the normality of the residuals that matter.

This is only tangentially related, but Stan Glantz has the most amazing story of how he originally got tenure. I might pass it along in an unblogged medium at some point.

• antipodean says:

What Bob said.

Latencies are not always so skewed as to make them unanalysable via parametric methods. You can put stuff in that's highly skewed and still get normal residuals. Give it a go.

Data analysis is a black art.

• Scientopia Blogs

• DrugMonkey is an NIH-funded researcher who blogs about careerism in science. And occasionally about the science of drug use.

• Your donation helps to support the operation of Scientopia - thanks for your consideration.