Chatter on the Twitts today brought my attention to a paper by Weber and colleagues that had a rather startlingly honest admission.

Weber F, Hoang Do JP, Chung S, Beier KT, Bikov M, Saffari Doost M, Dan Y.Regulation of REM and Non-REM Sleep by Periaqueductal GABAergic Neurons. Nat Commun. 2018 Jan 24;9(1):354. doi: 10.1038/s41467-017-02765-w.

If you page all the way down to the end of the Methods of this paper, you will find a statement on sample size determination. I took a brief stab at trying to find the author guidelines for Nature Communications because a standalone statement of how sample size was arrived upon is somewhat unusual to me. Not that I object, I just don't find this to be common in the journal articles that I read. I was unable to locate it quickly so..moving along to the main point of the day. The statement reads partially:

Sample sizes

For optogenetic activation experiments, cell-type-specific ablation experiments, and in vivo recordings (optrode recordings and calcium imaging), we continuously increased the number of animals until statistical significance was reached to support our conclusions.

Wow. WOW!

This flies in the face of everything I have ever understood about proper research design. In the ResearchDesign 101 approach, you determine* your ideal sample size in advance. You collect your data in essentially one go and then you conduct your analysis. You then draw your conclusions about whether the collected data support, or fail to support, rejection of a null hypothesis. This can then allow you to infer things about the hypothesis that is under investigation.

In the real world, we modify this a bit. And what I am musing today is why some of the ways that we stray from ResearchDesign orthodoxy are okay and some are not.

We talk colloquially about finding support for (or against) the hypothesis under investigation. We then proceed to discuss the results in terms of whether they tend to support a given interpretation of the state of the world or a different interpretation. We draw our conclusions from the available evidence- from our study and from related prior work. We are not, I would argue, supposed to be setting out to find the data that "support our conclusions" as mentioned above. It's a small thing and may simply reflect poor expression of the idea. Or it could be an accurate reflection that these authors really set out to do experiments until the right support for a priori conclusions has been obtained. This, you will recognize, is my central problem with people who say that they "storyboard" their papers. It sounds like a recipe for seeking support, rather than drawing conclusions. This way lies data fakery and fraud.

We also, importantly, make the best of partially successful experiments. We may conclude that there was such a technical flaw in the conduct of the experiment that it is not a good test of the null hypothesis. And essentially treat it in the Discussion section as inconclusive rather than a good test of the null hypothesis.

One of those technical flaws may be the failure to collect the ideal sample size, again as determined in advance*. So what do we do?

So one approach is simply to repeat the experiment correctly. To scrap all the prior data, put fixes in place to address the reasons for the technical failure, and run the experiment again. Even if the technical failure hit only a part of the experiment. If it affected only some of the "in vivo recordings", for example. Orthodox design mavens may say it is only kosher to re run the whole shebang.

In the real world, we often have scenarios where we attempt to replace the flawed data and combine it with the good data to achieve our target sample size. This appears to be more or less the space in which this paper is operating.

"N-up". Adding more replicates (cells, subjects, what have you) until you reach the desired target. Now, I would argue that re-running the experiment with the goal of reaching the target N that you determined in advance* is not that bad. It's the target. It's the goal of the experiment. Who cares if you messed up half of them every time you tried to run the experiment? Where "messed up" is some sort of defined technical failure rather than an outcome you don't like, I rush to emphasize!

On the other hand, if you are spamming out low-replicate "experiments" until one of the scenarios "looks promising", i.e. looks to support your desired conclusions, and selectively "n-up" that particular experiment, well this seems over the line to me. It is much more likely to result in false positives. Well, I suppose running all of these trial experiments at the full power is just as likely it is just that you are not able to do as many trial experiments at full power. So I would argue the sheer number of potential experiments is greater for the low-replicate, n-up-if-promising approach.

These authors appear to have done this strategy even one worse. Because their target is not just an a priori determined sample size to be achieved only when the pilot "looks promising". In this case they take the additional step of only running replicates up to the point where they reach statistical significance. And this seems like an additional way to get an extra helping of false-positive results to me.

Anyway, you can google up information on false positive rates and p-hacking and all that to convince yourself of the math. I was more interested in trying to probe why I got such a visceral feeling that this was not okay. Even if I personally think it is okay to re-run an experiment and combine replicates (subjects in my case) to reach the a priori sample size if it blows up and you have technical failure on half of the data.

__

*I believe the proper manner for determining sample size is entirely apart from the error the authors have admitted to here. This isn't about failing to complete a power analysis or the like.

So human randomised clinical trials often have a priori stopping rules. If there's a standard treatment and a new treatment, and one of the treatments is hugely better than the other, we're ethically bound to stop the trial and draw our conclusion early - whether the standard is so much better that we should stick with it, or the new is so much better that we have to go change practice. Whether what we find is superiority or futility, we have to stop. But we have to set those rules in advance because we know the true effect size is what it is, it has nothing to do with how many people we measure it in - and if we just keep adding a few more and a few more and analysing and re-analysing, we are basically seeking out a false positive, and that's not just dishonest, it's dangerous to our research participants (and to everyone else who is potentially exposed to our drug.)

That's why I find this N-upping gross - it's fucking around with the stopping rules to try to engineer the "right answer" and that shit kills people.

There are Bayesian statistical techniques that can accommodate checking for statistical significance across multiple sample sizes. Essentially, you have to take some of the probability out of the tail and assign it to the sub-n that you have already done. In practice, these are extremely complicated and almost never done except in very large clinical trials with potentially dangerous or expensive treatments that have statistical teams capable of doing this analysis.

I suppose you have to give Weber et al some credit for honesty. They told you what they did. You can disbelieve their paper if you like. Although I'm kind of surprised it got through peer review with that statement.

The proper thing to do is pilot experiments that you then do not include in the actual n. The problem with that is that it is expensive in terms of both time and money. But I am seeing more and more journals and reviews asking for full replication cohorts presumably because people are not doing this.

BTW, we had a review come back recently that explicitly said "your data on X would be more convincing if you added a couple of more animals to it" because some of the animals were lost to the experiment (through technical failures making it impossible to collect the data from them). What was amazing about the request was that our data based on the n of animals going through the experiment was significant. (We had started with a larger n because we knew the experiment was technically challenging and some might well be lost to it.)

Gingerest expressed exactly what I was thinking (but much better than I would have said it).

"because we know the true effect size "

So I'm a complete newbie on this topic, but I don't understand how we know the effect size. I recently had to calculate the appropriate sample size for a project. I sat down with our stats group and was asked what my effect size was. I was honest and said we wouldn't know until we ran the tests. I have a hypothesis on the size, but no direct data. We ended up using "related" data from the literature to estimate the effect size, but that didn't really answer the question in my mind.

Here we go with the "irreproducible basic research" finger pointing.

People doing human studies do not get to claim the moral high ground here. I'm not having that. Some of the most abusive statistical practices I've ever seen have been in human studies. Large clinical trials may have all kinds of rules set out a priori, but they have plenty of other ways to game the system, i.e. not reporting data at all, making up secondary/tertiary endpoints post-hoc, last observation carried forward etc etc.

Let's not pretend that clinical trials/drug companies do have a history of engineering the "right answer".

Dave, "because we know the true effect size " . . . the complete quote was "we know the true effect size is what it is" . They just mean the true effect size is fixed (we don't know it before the experiment, but it has some value).

Most of these experiments work like this: You have a treatment, you want to know the effect of the treatment. You somehow quantify the effect of the treatment by measuring "something".

The question is whether "something" measured in the control group differs from "something" measured in the treatment group. The difference in "something"'s value is called the effect size.

The typical approach is to first compute the number of animals you need in each group. To perform this calculation you need to know the effect size. Of course, you don't know the effect size, in fact the whole point of the experiment is to establish that the effect size is non-zero. So you estimate the effect size. This is where the BS starts.

I maintain science would be much better off if people didn't focus so much on establishing "significance" and instead focused on precise measurements of effect size.

The problem is that once something is "significant" there's no incentive to repeat it. Where if someone reports an effect size of 10(95% CI 8-12) then someone else measures it as 10.8(95% CI 10.6-11.0) that would be recognized as a major advance and published.

The other way works also. Instead of "no significance, we didn't publish it", you say you measured the effect size to be 0.4(95% CI -0.2-1.0) then maybe someone later on can measure it better and get an effect size of 0.5(95% CI 0.2-0.8).

Thanks DNAman. Glad I'm not the only one who thinks there's an issue.

In my cause, we had to do the sample size calculation for an IRB review. The sample size was initially picked based on the amount of money available (i.e. we would collect as many samples as possible with the funds we had), but that doesn't fly to an IRB (perhaps rightfully so). Based on the literature numbers we found for related effect sizes, the sample size came out to basically the same number (by chance). I can't really say what we would have done if the required sample size was significantly higher. I guess we would just report that and see what the IRB decided.

Dave, I'm definitely not claiming the moral high ground - it's just sometimes easier to see the ethical problem when it's framed in terms of human health outcomes than when it's more abstract. Of course there are many ways to game the RCT and health intervention approval systems. There's more incentive to put structures in place to prevent such gaming when the consequence is increasing the suffering or shortening the lifespans of sick people than when the consequence is misleading whole fields full of scientists about biological mechanisms, but even so, there's still plenty of gaming in both, and the gaming is horrifying to me either way. It's just a lot easier for me to explain what's gross about that paper with stopping rules. And of course given market-driven medicine, there's plenty of clinical trials that are more about the potential for profit than about alleviating human suffering. But that's a huge digression from my point, which is that making up your N as you go is cheating, at least from the frequentist statistical perspective.

Also, yes, DNAman's correctly interpreted what I was aiming to say. The underlying effect size is whatever it is, and we're trying to measure it accurately and precisely.

Also I have also done work where our sample size was set by the money, time, or number of patients we had available, and/or where there was just flat out no data to guide sample size calculations. But you still gotta decide what's feasible *in advance*, and go into it with some confidence about what you think of a null or a positive finding given your design and its execution. If you don't have enough participants to interpret your null finding as likely to be a true null, you need to be very careful how you frame your study when you write it up, but you can still contribute to the body of knowledge. A small exploratory study that generates a null where you're not sure whether it's just a power issue can still tell you something about things other than your primary research outcome (e.g. feasibility/pitfalls, secondary outcomes that maybe turn out to have a bigger effect size than you thought).

As someone who doesn't do animal studies, I have a really naive question: suppose you do have a good guess for the effect size in advance and now you want to calculate the "right" N.

From the discussion here it seems that one would choose an N so that the 1-sigma measurement uncertainty is about 1/2 (or maybe 1/4?) the effect size. Is this what you mean?

If so, why is that considered more ethical? If this is a preclinical trial and the effect is large, wouldn't it be more useful to measure it to more precision so that future clinical trials would be able to estimate proper dose or number of patients whatever?

and if the effect is smaller than you'd guessed but still likely to be big enough to be important for future trials, what is so wrong with increasing N during the study to try to measure it more accurately?

Grumpy, I can't answer the first part, but for the second, my understanding is that when we are talking about experiments with risk to participants, we want to run as few experiments as are necessary to answer the question. This minimizes the risk while still being a useful study. Too few experiments and we don't really meet our objective, too many experiments and we put more lives at risk.

Grumpy, "what is so wrong with increasing N during the study to try to measure it more accurately"

First, there's nothing wrong with doing that in many cases. If you are trying to measure the speed of light or some physics thing, yeah increase N until you get to the level of precision you want.

To understand the problem that people are complaining about here, you have to take the view of a typical biomedical experiment. The effect size has a good chance of being zero, with large variations, and increasing N is very expensive (think of increasing N by one means killing another kitten or chopping off one of your fingers or something like that).

Under those conditions, increasing N one by one is kind of like playing the lottery. You did 4 mice, not significant, try it on number 5 maybe you get significance now, etc.

If you can increase up to 1000 it won't be a problem statistically, but you'll run out of fingers.

DNAman brings up an important point. The whole reason that increasing N as you go is a problem is because we estimate p-values of significance rather than confidence intervals around an actual value. (For example, it is valid to keep adding N to measure the speed of light or mass of an electron to higher and higher precision, but it is not valid to keep adding N to determine if condition one is "significantly different" than condition two.)

Every so often, the editor of one of the major journals writes an editorial asking everyone to stop using p-values to determine condition differences and instead to report confidence intervals on actual values (such as the confidence interval on the difference between conditions one and two).

Maybe it's time we started doing that.

Dna, qaz, David, etc., It sounds like we all agree that the decision of when to stop a study is always based on values.

For the speed of light measurement, you might stop averaging after 1 week, because tying up the lab's resources for 1 month is not worth the extra factor of two in precision.

For the animal study, you stop at some point when the extra resources and ethical issues make it not worth increasing the precision. And of course the number of trials is going to be far lower when animals are involved.

What I can't understand is why one would apply a rigid strategy for the relative precision (e.g. sigma=1/4 effect size) that is desired regardless of the effect size and potential impact of the study**. If you expect the effect is 1 (in some units), but anything above 0.1 would lead to important new insights, why would you stop when your precision is 0.25 and can't reject null hypothesis?

**Actually I'm not sure this is how it works in animal studies, just what it seemed like from the discussion. Hoping to be educated on this.

Grumpy - the problem is how p-values work. A p-value is measuring the likelihood that the data you collected could not arise from a null hypothesis. So, it is essentially a threshold. With each data point, you are going to measure a real value x plus some noise parameter z. As you collect data, the inclusion of the noise means that your estimate will walk around the real value. At some sample, you might well walk across that threshold. With additional samples, you might well walk back under it. If you stop when you cross the threshold, then you are over-estimating the likelihood that you crossed that threshold.

As I said in my earlier post, it is possible to take this walk into account using p-value statistics, which would allow you to do the stats with each new sample, but the math is messy and standard tests (ANOVAs, t-tests, stuff like that) do not take that into account.

Another, simpler solution is to not work by thresholds at all, but simply actually estimate some value. (Remember, almost no neuroscience or psychology experiments work this way.) In this case, you have a value and confidence intervals on that value and additional data is always taken into account appropriately, so you don't have to predefine your n.

Importantly, all you need to do is predefine your n. So saying "I will run 25 rats" or "I will run until I get 25 rats that show this baseline effect" or even "I will run rats until Tuesday, working at a steady pace" are all essentially fine. On the other hand, "we continuously increased the number of animals until statistical significance was reached to support our conclusions" is definitely not OK.

There have been a lot of complaints about underpowered experiments, but we need to remember that p-values only detect "the probability that it did NOT come from the null hypothesis". A non-significant p-value does not imply that the data does come from the null hypothesis. An underpowered experiment isn't wrong (because if you don't find anything, you can't conclude it's the same anyway), but it might be a waste of money because you looked for and didn't find something that's really there. The underpowered problem is a different beast.

Qaz, I understand that null hypothesis significance testing is the same as asking whether the confidence interval around the measured effect size includes 0. Obviously I think focusing on effect sizes and quantifying uncertainty properly is the right way to go. I also understand why stopping as soon as your confidence interval just barely doesn't include 0 is dangerous because you could have had an unlucky run at the beginning (though, to be honest, it is not difficult to quantify the extra bit of uncertainty this strategy involves so I don't have any religious opposition to it).

But I'm not trying to debate those things.

What I'm asking is why does DM and others think the "right" choice of N is one based on a calculation, before the experiment begins, which incorporates the expected effect size. To me the decision of how much precision you want should depend on how important the effect is and the ethical and resource costs associated with further improving precision. But I'm not 100% sure they aren't incorporating those things in their "power analyses".

I was pretty clear that I think the question of how you select sample size in advance is orthogonal to doing this versus “n-up until a significant result confirms our hypothesis”, Grumpy.

As you note, you may have a run of luck (good/bad/whatever) along the way. If you keep going until you cross the p-value threshold, you will incorrectly be reporting that you had a significant effect when you don't.

There are three ways to solve this problem.

1. Actually measure confidence intervals around a value. Care not whether that interval includes 0 or not, but report the interval itself.

2. Quantify the uncertainty with each check. It is very difficult to quantify the extra bit of uncertainty if you are not actually measuring confidence intervals. Most of the studies that we are talking about are using "standard statistical tests" which do not report the confidence interval or the uncertainty, but rather provide a "p-value" of the likelihood that you're confidence interval is outside of 0. Importantly, the relationship between the statistic being measured (usually translated to a p-value) and the shape of that confidence interval is very complex and not readily available.

3. Just decide on your n a priori. This means that you are not falsely giving yourself multiple opportunities to cross the threshold.

As I noted, it is possible to construct variants of these standard statistical tests that allow checking along the way, but that is very complicated and requires a detailed understanding of why that statistical test actually works. (I bet 90% of people using ANOVAs or F-tests do not really understand what the assumptions underlying them are or could explain why they work and would be absolutely unable to derive these test-along-the-way variants.)

Statistician here who keeps coming back to this blog useful grants related advice. There are a number of confusing things being thrown around here, IMHO. No offense to anyone, but I'll try to point them out as I understand them:

1. qaz should be slightly more careful with jargons: for example, there is no such thing as "p-value of the likelihood." There is no magical Bayesian remedy either. If you are being a Bayesian, you have to rely on something called Bayes Factors (Kass, Raftery 1995 JASA - off the top of my head) and cannot use p-values.

2. It's true you cannot just "n-up" until you reach significance. No questions. With enough sample size any null is significant.

3. But here's the catch: confidence intervals (CI) don't solve this issue either! With CIs you have a sense of where the true parameter might lie IN THE LONG RUN. Does your CI (based on your samples) catch the true parameter or not? There is no way to answer this and no way to assign a probability to it.

4. The only remedy is the thing that probably you scientists hate to hear from a reviewer. Replication studies. CIs, p-values etc. can only be interpreted as a "long run proportion." There is no way to deduce the true parameter value from "one" CI of sample size n (or p-value of one t-test of sample size n) but this is what people often tend to do. Please don't do this. If you do 2 replications, and two resultant CIs are (10,20) and (110,120) then you go back to the study and investigate the discrepancy, since it's unlikely that the same true parameter will give these two vastly different CIs. But, which CI is closer to the truth? No way to tell unless you run more replications.

"What I'm asking is why does DM and others think the "right" choice of N is one based on a calculation, before the experiment begins, which incorporates the expected effect size. To me the decision of how much precision you want should depend on how important the effect is and the ethical and resource costs associated with further improving precision. But I'm not 100% sure they aren't incorporating those things in their "power analyses"."

Fine by me, as long as you make the decision a priori.

Qaz, I don't think (not sure abt this) that confidence intervals get you out of the problem unless you use nonparametric methods to construct them - otherwise you're building your uncertainty estimate around the assumption that your single sample has the right parameters to use whatever distribution you've chosen for your CIs. Like, we do not really draw a nearly-infinite number of samples and then construct our confidence interval around The Truth - we draw a single sample and lean heavily on the Central Limit Theorem to get us a confidence interval we can trust, whether we are placing our faith in the standard normal distribution, the t-distribution or we're doing some funky business with link functions and/or transformation. So we still have to know how big we want our sample to be, and it has to be Big Enough, and we can't just keep recalculating until we get the interval we want. I think.

I think I was unclear what I mean by confidence intervals. What I meant was that CIs ask a fundamentally different question than p-values. When I talk about CIs, the question is "what is my best estimate of value x?" (as in, the speed of light or the mass of the electron). This can be contrasted with "is the value of x different from the value of y?", which is what p-values are getting at (as in, is an electron lighter than a proton?). The point from an n-perspective is that adding n to a "what is my best estimate of value x?" question just helps you get closer to that. Whereas it is not OK to just add n to a thresholded question ("is the value of x different from the value of y?"). The specific issues with how you determine the CIs is more complicated than I was trying to get into. (But the central tendency of the estimate only improves with adding n, and it is thus OK to keep adding n as you please because the question is about the mean not the variance.)

Also, yes, I have been sloppy with jargon (sorry about that - am not a statistician, but have a strong math background and have done both Bayes and standard stats), but I think my main statements were generally correct. Using Bayes appropriately changes the question to a value question rather than a threshold question. Bayes factors are used in lieu of p-values for threshold questions, but they are addenda added on to the real calculation, which is "what is the distribution of our estimate of the value of the difference between x and y?" If you report that estimated difference, then you can continue adding n to your hearts content, as long as you report that estimated difference (and the distribution) no matter what. If you are reporting a "Bayes Factor", then you are simply taking a different route to a p-value and you have to decide your n a priori (because you are thresholding again).

However, replication studies don't solve this problem either. They just move the p-value threshold farther away so that it's less likely you'll be fooled by a type I (false positive) finding.

And I do agree with JJ that the answer to discrepancies is more science. In fact, one of the key problems with the whole rigor/reproducibility argument today is that it is based on the incorrect belief that any one experiment tells you a final story. Any given experiment is a piece of a puzzle that we need to pull together over time. In a sense, all science is probationary. But it's the best system we've got for getting towards truth.

BTW, a great example is the mass of the electron, which took decades to correct and tighten up the errorbars: https://i.stack.imgur.com/WtmUj.png

Agree with qaz that CIs provide more complete information than p-values. An easy way to see this is if I give you a 95% CI you can easily tell if p<0.05 or not by looking at whether the null value is inside the CI (or you may even explicitly calculate p). But if I tell you p=0.03, you only know null value is not in the CI, but not what the CI per se is.

But I don't think with a larger "n" the interpretation of a CI will change; 95% CIs will miss by the true value 5% of the time in the long run by construction, regardless of "n." All that will happen is that with a larger n is all CIs will be narrower, both those that include the true value and those that miss. Since you only get to see your CI, you don't know if you are in the lucky 95% or unlucky 5%. Repeat experiments allow construction of multiple CIs and then probe the ones that don't match the others more carefully.

Anyway, going back to the original question of choosing n, I think a general recipe could be something like this:

1. Decide your significance (alpha) and the power (1 - beta) you need. This depends on the science at hand and you are the best judge. You should do this before doing anything else.

2. Run a small pilot to estimate the effect size and calculate the sample size to reach that power at that significance from #1.

3. Run your experiment with the n you calculated in #2. Do not re-use the samples from the pilot in #2. Note your CI.

4. Repeat step #3 as many times as possible and report all CIs. If you encounter an outlier CI (which will happen 5% of the time purely by chance) make sure your experiment didn't break or power didn't go out etc. but do report those CIs.

If you have an embarrassment of riches (i.e., more samples than you need) then you can use them for replication studies in #4 for a given power, rather than throwing them all in one experiment.

(I am sympathetic to the hard work that goes into data collection and understand for various reasons these steps need to be tweaked. But at the risk of being a bit tone deaf, this is what I would recommend statistically)

JJ - I think we are talking past each other. My point is that if your question is whether you are asking is about "what is the value of x", then you can add n. Adding n will shrink your CI around the value of x. While it doesn't tell you if you are in the good 95% or bad 5%, it does provide a more accurate estimate of x (which you can see because the CI shrinks). In fact, you should think that there is a full distribution of possibilities and you have an estimate of where x is in that distribution - that's the real Bayesian way. The CI is a mark at the 95% line of that distribution, but if you were going to do it right, you'd show that whole distribution.

On the other hand, if your question is "is x different from 0" or "is x different from y", then you are in thresholding world and adding n is not kosher and you need to play all those games you are talking about. Again, like the Bayes thing, if you are using confidence intervals to decide if x is different from y (because y lies outside your 95% confidence intervals of x), then you are essentially taking a roundabout way to get at p-values, you are in thresholding world again, and you need to pick your n a priori.

I was trying to explain to Grumpy (who I assume comes from a field that asks "what is the value of x" questions) why all this issue about picking your n beforehand.

Qaz--that graph of the electron mass error bars is a good example of the benefit of measuring effect size instead of significance. The problem with some fields is that they end up with "blah causes cancer" because they hit some significance level, but no one follows it up and it can become unchallengeable gospel. If instead, they said "blah increases cancer by X (95 % CI, Y-Z), then someone else could come along and improve that.

JJ-- I get what you are saying, and I see its applicability to a big 7 year Phase III trial type study.

But, I think your 4 step recipe drives some people crazy because of this step: "Run a small pilot to estimate the effect size and calculate the sample size to reach that power at that significance from #1."

That step describes almost all of academic biomedical science. We usually have no clue about the effect size. What you call "a small pilot" is all we have money for. Once we complete that, if we have a non-zero effect size, we consider it established and move onto the next step.

"That step describes almost all of academic biomedical science. We usually have no clue about the effect size. What you call "a small pilot" is all we have money for. Once we complete that, if we have a non-zero effect size, we consider it established and move onto the next step."

Lol, this is pretty much what I've observed as well.

Here is my naive take on how experiments should be designed:

1. Decide on a scientific question.

2. Figure out what variables you have independent control over and what you can measure to answer the question.

3. Determine a model that will fit the data you expect to generate to answer the question.

4. Collect as much data as you can for as many values of the independent variable(s) as you can. Do your best to quantify both statistical *and* systematic error.

5. Fit the data in an appropriate manner (ideally something simple like least-squares curve fitting; if that is not possible consider comparing several parameter estimation methods), report back. If the initial model didn't fit the data, try to figure out why and see if another model fits. If another model fits, plan a new experiment to test that model in a different way.

One could argue that biomed ppl already follow this and they just choose a simple binary model to fit their data due to lack of control over independent variables. But it seems to me that too often they begin by assuming the model must be a simple binary test, and then they design the data collection and fitting to match.

And from my limited experience, the question of systematic error is pretty much never addressed.

BTW, lest I sounds like a condescending jerk: plenty of physicists/engineers I know are incapable of 1 (clear scientific question) or 3 (modelling before collecting data) and they mostly do just fine.

I have a question for JJ. Increasingly, I see comments in figure legends stating that n mice representing three replicates were used to generate the figure being shown. However, there must be criteria that determine the legitimacy of this practice? What would that be out of interest?