A simple suggestion for Deputy Director for Extramural Research Lauer and CSR Director Nakamura

Nov 19 2015 Published by under Fixing the NIH, NIH, NIH Careerism

Michael S. Lauer, M.D., and Richard Nakamura, Ph.D. have a Perspective piece in the NEJM which is about "Reviewing Peer Review at the NIH". The motivation is captured at the end of the first paragraph:

Since review scores are seen as the proximate cause of a research project's failure to obtain support, peer review has come under increasing criticism for its purported weakness in prioritizing the research that will have the most impact.

The first half or more of the Perspective details how difficult it is to even define impact, how nearly impossible it is to predict in advance and ends up with a very true observation "There is a robust literature showing that expert opinion often fails to predict the future." So why proceed? Well, because

On the other hand, expert opinion of past and current performance has been shown to be a robust measure; thus, peer review may be more helpful when used to assess investigators' track records and renewal grants, as is typically done for research funded by the Howard Hughes Medical Institute and the NIH intramural program.

This is laughably illogical when it comes to NIH grant awards. What really predicts future performance and scientific productivity is who manages to land the grant award. The money itself facilitates the productivity. And no, they have never ever done this test I guarantee you. When have they ever handed a whole pile of grant cash to a sufficient sample of the dubiously-accomplished (but otherwise reasonably qualified) and removed most funding from a fabulously productive (and previously generously-funded) sample and looked at the outcome?

But I digress. The main point comes later when the pair of NIH honchos are pondering how to, well, review the peer review at the NIH. They propose reporting broader score statistics, blinding review*, scoring renewals and new applications in separate panels and correlating scores with later outcome measures.

Notice what is missing? The very basic stuff of experimental design in many areas of research that deal with human judgment and decision making.

TEST-RETEST RELIABILITY.

INTER-RATER RELIABILITY.

Here is my proposal for Drs. Lauer and Nakamura. Find out first if there is any problem with the reliability of review for proposals. Take an allocation of grants for a given study section and convene a parallel section with approximately the same sorts of folks. Or get really creative and split the original panels in half and fill in the rest with ad hocs. Whenever there is a SEP convened, put two or more of them together. Find out the degree to which the same grants get fundable scores.

That's just the start. After that, start convening parallel study sections to, again, review the exact same pile of grants except this time change the composition to see how reviewer characteristics may affect outcome. Make women-heavy panels, URM-heavy panels, panels dominated by the smaller University affiliations and/or less-active research programs. etc.

This would be a great chance to pit the review methods against each other too. They should review an identical pile of proposals in traditional face-to-face meetings versus phone-conference versus that horrible web-forum thing.

Use this strategy to see how each and every aspect of the way NIH reviews grants now might contribute to similar or disparate scores.

This is how you "review peer review" gentlemen. There is no point in asking if peer review predicts X, Y or Z outcome for a given grant when funded if it cannot even predict itself in terms of what will get funded.

__
*And by the way, when testing out peer review, make sure to evaluate the blinding. You have to ask the reviewers to say who they think the PIs are, their level of confidence, etc. And you have to actually analyze the results intelligently. It is not enough to say "they missed most of the time" if either the erroneous or correct guesses are not randomly distributed.

Additional Reading: Predicting the future

In case you missed it, the Lauer version of Rock Talk is called Open Mike.

Cite:
Reviewing Peer Review at the NIH
Michael S. Lauer, M.D., and Richard Nakamura, Ph.D.
N Engl J Med 2015; 373:1893-1895November 12, 2015
DOI: 10.1056/NEJMp1507427

35 responses so far

  • Rheophile says:

    Inter-rater reliability is very low for peer review in general:

    http://blog.mrtz.org/2014/12/15/the-nips-experiment.html

    http://academia.stackexchange.com/questions/33031/studies-over-how-noisy-is-it-to-accept-reject-submissions

    But perhaps with a fixed format and a higher motivation to get it right the first time, NIH grants could be better.

  • Philapodia says:

    The only way this type of experiment would work is if the reviewers didn't know it was an experiment to get accurate behaviors. Let's say you set up several study sections reviewing the same pool of grants as you suggest. I assume that the normal paylines would have to apply, but what happens if a grant scores well in the BSD study section but not so well in the 1/2 ad hoc study section and the 1/2 women study section? How do you chose what to fund? Is it fair to the rest of the applicants for a small subset of applications to be independently reviewed several times while everyone else has to wait?

    Imagine how pissed off the applicants are going to be when they find out they got scored well in one study section but are not funded or that their proposal that they agonized over for weeks/months was used in an experiment without their knowledge? Also, imagine how pissed off the reviewers are going to be when they find out that were part of an experiment and that all the effort that they put into reviewing was for naught if they scored an application very well but the other study sections didn't and it wasn't funded. Will this alienate good reviewers who fell ill-used by the experiment?

    I agree that an experiment like this would provide valuable information, but there are definite pitfalls to doing the experiment.

  • Pinko Punko says:

    Why not survey applicants about their reviews? Why not have reviewers review other reviews for the same grants? I think this would make it obvious where there are some less good reviews. I do think DMs suggestions would reveal possible bias towards some institutions/investigator types- perhaps into discussion range but not necessarily funding range.

  • jmz4gtu says:

    " Also, imagine how pissed off the reviewers are going to be when they find out that were part of an experiment"
    -Then you revoke their science card. I would hope they could at least appreciate the humor of experimenting on scientists.

    "Is it fair to the rest of the applicants for a small subset of applications to be independently reviewed several times while everyone else has to wait? "
    -Is it fair for future applicants to deal with a more flawed system just because the current crop of applicants can't deal with a delay?

    I don't think you'd have problems getting people to do a special 50/50 gender split review (at least not the women).

    In any case, there's no reason to delay decisions. Just wait until the next budget boondoggle, and do it then. The NIH *still* hasn't made any decisions on new grants for councils meeting in September (as near as I can tell with the ERA commons NOA lookup tool). So you'd have plenty of time to review and re-review the batch of applications with multiple rounds of reviewers.

    The hardest part of DMs proposal is getting enough reviewers to pull it off, but having the pool be representative of the people that actually take the time and effort to sit on study section. Of course, if the NIH would just grow a pair and demand that any NIH-funded investigator can be called up to review once a year, this wouldn't be an issue.

  • Drugmonkey says:

    Review is confidential. Review outcome is not known to the reviewers. They wouldn't have to know if they were in the "real" one or not. Speaking of which, no reason CSR can't assign one review panel as the one that counts and still convene other panels.

  • Drugmonkey says:

    Oh and it goes without saying this would have to be a CSR project only...no telling POs about it until later!

  • AG says:

    A few studies in the past year tried to take advantage of the ARRA funding to do this type of experiment.

    See for example PubMed ID 25722441, which compared ARRA R01s to "payline grants". This one concludes that dollar for dollar, those R01 grants funded only by virtue of the extra money, which scored below the payline, were comparable in output (measured by normalized citation impact per $1M). I recall a few other papers with different conclusions but can't find them.

    While certainly not a perfect experiment, these studies did try to see what happens if you give money to those deemed most worthy by the reviewers and those who just missed the cutoff. In my experience, you could probably go down to the 50th percentile and see similar productivity. Maybe even lower.

    Regarding the confidentiality of review, perhaps grants are being assigned to multiple panels. We'd never know unless they publicize their findings.

  • Mechanico-chemical Machine says:

    I am in a position to provide an anecdote regarding inter-reviewer reliability, as I have experienced the situation you suggest.

    A few yrs ago, I submitted an A0 application and was assigned to the study section I requested, which I later found out would meet in two 'parallel' sections comprised of folks of similar background and experience, a few months apart. My application comes back 'not discussed'. I look at the scores - one good review (2's), second not so good (3-4's), and the third scored me very very poorly (think 8's and 9's). So, I talk to my PO, and a short while later I get a call offering a do-over (they literally say the words "do-over"). They offer to take my same application, with no updates, edits, or changes of any kind, and send it to the parallel section to be re-scored. The parallel section won't know it's already been reviewed once. So I say, ok. Very shortly my original scores disappear from eRA and the new ones show up. The parallel section scored it at <5th percentile with 1's and 2's across the board. It was funded as requested.

  • drugmonkey says:

    I bet the NIH mantra would say "poor study section fit" for this case, rather than taking it as evidence for the entire system being unreliable.

  • jmz4gtu says:

    Do they keep track of individual reviewer statistics (e.g. scores given for each criteria)?

  • Jim Woodgett says:

    The main reason there is little reliability/reproducibility data about quality of peer review is that the whole house of cards would topple if the degree of variation (as tested legitimately through parallel panels) became clear. Peer review is the gold standard for scientific adjudication but it's ironic that it's effectiveness is largely based on historic record and assumptions that no longer hold. Drop below a 20th percentile success rate and the power to discriminate based on merit evaporates. Hence, the agency push towards more structured review and use of quantifiable metrics which simply results in formulaic, uncreative and gamed science.

    Jaded, moi?

  • Baltogirl says:

    Mechanico-chemical may have been a part of the experiment I heard about a few years ago from someone in the know. A group of grants was to be reviewed by different panels and the agreement assessed. I never found out what happened, nor did the person who told me about the study.
    Meanwhile, Dr. Nakamura assessed the effectiveness of study section in June by holding group interviews for panel members who volunteer for such in a DC hotel- 75 dollar gift cards were provided (I went). Of note, the meetings were held BEFORE the panels met. Never heard what happened, but really doubt that this is an effective way to assess the quality of peer review.
    I agree totally with DM regarding the absurdity of not testing the reliability of peer review using the measures he describes. (And these are scientists!?!)
    When important information is not available it is often because leaders are afraid to know the answers: so they either do not pose the necessary questions, or they repress results.

  • Dr Becca says:

    The fact that pretty much everyone I've met has seen a reasonable A0 score get worse as an A1 should be evidence enough that there is a TON of randomness in the system.

  • drugmonkey says:

    That may be a different issue though. Grants are in competition with what had been submitted *for that round*, not with their own prior version.

  • Neuropop says:

    BTW many of the test you suggest are de rigeur in NSF/BIO. Each project comes in as a preliminary proposal, and the successful ones after review are evaluated by a different panel + ad hoc reviewers. In as much as one can trust that a preliminary proposal and a full proposal are similar, one can look for test-retest reliability and perhaps even inter-rater reliability. Also, NSF program officers make it a point to balance the composition of the panels with all the criteria you mention (URM/women/small institutions etc.) So go ask one of them how it works. Maybe there's something to learn?

  • zb says:

    " Also, imagine how pissed off the reviewers are going to be when they find out that were part of an experiment"
    -Then you revoke their science card. I would hope they could at least appreciate the humor of experimenting on scientists.

    funny.

    The NIPS study described above seems to address some of the logistics of doing the experiment.

    As others point out, I think the real problem with the experiments are that they are going to show that systems that attempt to pick the top 5% of any pool are going to be flawed (even there were a ground truth of who the top 5% were). The system is going to be even more flawed when factoring in the unpredictability of future outcomes: that is, we might be able to pick the top 5 basketball players in the 8th grade, but if we use that pick to predict the top 5 basketball players in the 12th grade, even a perfect initial ranking will be an imperfect measure of future success.

  • drugmonkey says:

    You guys are so pessimistic! maybe a nice set of validation experiments would prove that NIH grant review *is* quite reliable and repeatable.

  • eeke says:

    "Also, imagine how pissed off the reviewers are going to be when they find out that were part of an experiment .."

    Wouldn't informed consent apply to this? Same goes for the applicants.

    It's a waste of time to try and "improve" peer review. I think even in the best of circumstances, it would be impossible to get >80% reliability across scores. The top ~20% of grants are likely indistinguishable in terms of merit, "impact", etc. Time would be better spent campaigning congress for more fuckin money.

  • Philapodia says:

    @jmz4gtu
    "-Then you revoke their science card. I would hope they could at least appreciate the humor of experimenting on scientists."

    How is this funny when peoples careers/livelihoods are on the line? People write grants to pay their own salaries and the salaries for their students and staff (including post-docs). Sacrificing people for the greater good is kind of shitty.

    @DM
    "Review is confidential. Review outcome is not known to the reviewers. They wouldn't have to know if they were in the "real" one or not."

    Review is not that really confidential and you know it. The applicant knows who the reviewers are on the panel (but now there's 3 times the reviewers in the study section reviewer list posted by CSR. Says the applicant "Huh, what's going on here I wonder? Am I part of an experiment?") and the reviewers who should be reviewing the proposals are usually in close fields to the applicant and hear things ("huh, I wonder why Asst. Prof Youngun didn't get tenure, he/she got all ones and should have gotten that grant based on the study section I was on. I'll ask around..."). It will get around sooner than you think.

    "Oh and it goes without saying this would have to be a CSR project only...no telling POs about it until later!"

    This assumes that all of the POs working in the same area don't talk to one another. There aren't that many POs in any area, and they'll hear about it.

    @eeke
    Informed consent would tip people off and potentially make them change behavior, negating the value of the study.

    These things will get around and reduce confidence in the system even further.

  • MorganPhD says:

    Why not just have a second study section re-review a set of grants that have already been reviewed or awarded? And have it be the most recent set of grants, thus it's unlikely that grants are scored lower just because of some timeliness issue (oh, you're not doing optogenetics or Crispr yet, low score).

    Then it's not a matter of fussing around with parallel review or deciding which score counts.

    And you don't need to tell the study section members a thing. They are there to review grants and are compensated (however poorly) for it. They are helping the NIH make a decision about grants, just not about whether this PARTICULAR set of grants will be funded or not.

  • zb says:

    "How is this funny when peoples careers/livelihoods are on the line? People write grants to pay their own salaries and the salaries for their students and staff (including post-docs). Sacrificing people for the greater good is kind of shitty."

    Seriously? You're saying an experiment like this one would be somehow worse than placebo groups? which are completely accepted in science (though there are also standards of ethics in designing them).

    The NIPS example had double review (to test for inter-evaluator reliability) and used the more positive review to determine outcome. I don't think that would work for grants (it wouldn't be fair to those grants that weren't reviewed twice, but it's not an inconceivable choice, if the number of grants in the experiment were small enough) -- but it would be easy to do the experiment while designating one of the two sections as the decision maker for each grant (while reviewers remained blind to that designation). The problem with the experiment is that it's expensive in terms of time and cost (reviewing is expensive) and I'm uncertain what we would do with the results.

    A fundamental question is whether the current "treatment" (i.e. review process) is working, in which case any new system would face a significant burden. On the other hand, if most of the patients are dying taking risks to look better treatments might be worth it.

    I feel like the system is broken as long as we are looking at 5% paylines, and that it's not broken at 20% paylines, because, I think there's already evidence that people can roughly classify groups of applications (grant, people, etc) into quintiles. I don't think they can classify them into 20-iles (whatever the word for that might be), and that when we pretend they can, we create widespread chaos and waste.

    With that hypothesis, a fix is to bring paylines back to the 20%. We could do that with more money (but i don't think that could happen) or fewer people. Some of the elites in the field seem to think that fewer people is the solution (i.e. designate a chosen and then review them to make sure they are in the 20% or even the 50%, the way, say, Howard Hughes might).

    The alternative fix is to figure out how to divide applications into 20 categories very very well. I don't see that happening, but those who are wedded to reviewing projects might be able to come up with better methods (or might think this one is working).

  • drugmonkey says:

    Morgan PHD.

    reconsideration of already-awarded grants misses the point. severe range truncation and the inability to include just-missed apps from that first review in the analysis. same problem with all of the NIH analyses about grant outcome that are conditioned on the award having been made. it is madness that they continue to try to advance these as if they are meaningful! we need to know about the stuff that didn't get funded.

    second, there is always a problem of time moving on in science. the best way is to have them reviewed simultaneously. otherwise it could be that a paper published or another grant awarded in between the reviews could contaminate things.

  • jmz4gtu says:

    "How is this funny when peoples careers/livelihoods are on the line? "
    -Gallows humor? It's certainly not any more messed up than your livelihood resting on the dubious claim that study sections can differentiate between a 12% and and 8% grant in terms of perceived worth, and not be affected by petty biases.

    I kind of agree that trying to improve the system is worthless with paylines where they are, but if that's what we're trying to do, than DM's suggestion of side by side, contemporaneous evaluation of the same grants is the best way to get a measure of variability and robustness in reviewing.

  • shrew says:

    Frankly this kind of experimental design is not exactly rocket science. (It isn't even brain surgery. "And I should know...")

    If they wanted to know these answers, they could have asked the question dozens of times by now. Baltogirl is right. They do analyses on already-funded grants to hope that people will stop asking about it.

    Because what happens when they do the experiment, and they find out that interrater reliability is shitte? Congress starts asking questions about why they are spending all this fucking money funding science that scientists can't even agree is valuable, that's what happens.

  • Baltogirl says:

    The reason we really need to get better at peer review is that the system is so binary. The outcome is exactly the same if you get scored at 14% or 94%- no money.
    If the current system were REALLY GOOD at rating from 0 to 20%, or if grants that scored between 10 and 20 percent got SOMETHING, it would not be considered as unfair.
    It's my opinion that we could do better.

  • MorganPhD says:

    I'm suggesting take every single grant reviewed during a given study section and review them 4 months later, including grants that were awarded and grants that were triaged.

    And yes, I think timeliness is an issue, but that one new published paper in the 4 months is going to affect (just a guess) a small % of grants. It might be more, as I don't have the honor of being on a study section (as was previously intimated, postdocs aren't yet good enough to do that...). Although timeliness can theoretically kill grants now, as the lag phase to review is currently a few months anyway.

    I'm just trying to spitball ways to get around the ethical considerations/PI outrage if you score grants in parallel and then arbitrarily decide which score counts. In my design, the PI never knows that their grant is re-reviewed. You go with the "gold standard" method of peer review until we find another method.

    Other commenters are correct (in my opinion) in saying that the NIH is doing this type of experiment piecemeal and in a non-serious way.

    The (retroactive-type) analyses that Rockey used to put on her blog are parsed out in a way to make current and new policies look timely and important, while ignoring larger issues.

  • Dr Becca says:

    The reason we really need to get better at peer review is that the system is so binary. The outcome is exactly the same if you get scored at 14% or 94%- no money.

    This x100000000. This is why it kills me when people say "but getting discussed these days is a real accomplishment!" No it isn't. The only accomplishment is getting FUNDED.

  • lurker says:

    Grants review during the salad years followed Newtonian-like mechanics. If you got a 30% or below, good chance you'll get funded. Write your grant clearly, try a few times, and likely you'll get your first grant before the tenure clock runs out.

    Today, grants reviews are in the quantum mechanics realm, making it impossible to test INTER-RATER RELIABILITY. The noise and randomness in the process is so frequent that like the Heisenberg Uncertainty Principle, if you tried to perturb the study section process to measure it, that act will distort the process from the original trajectory of the review events.

    1) You can't subject a test grant app through this test and be able to completely hide the intentions of this experiment. 2) SS meet infrequently, so who would subject a real application through 1 year of getting a triplicate tests? 3) To truly model real SS's, you can't model the noise from rampant turnover in adhocs and changing sets of grants per stack (some deadlines anecdotally have weaker piles than others). Any of these factors will greatly distort the percentile rankings versus impact scores in every review event.

    My anecdata is two separate grants at two different SS. Both had the same history at the same SSs in each submission: A0=upper 30's%, A1=lower 20's%, new A0 with a paper for each published=triaged. It's very hard to recover from that, and I bet the NIH CSR is more than aware the system right now is a crapshoot for everyone save the BSDs, who are so big that like physics they are immune to quantum mechanics disturbances. The rest of us are but atomic riffraff at the mercy of quantum mechanics, hence the churn.

    If only I could invoke Scott Bakula to get my grants funded (let's see how many readers get this reference).

  • AcademicLurker says:

    @lurker: I hear that they made a Study Section episode of Quantum Leap, but the test audience was so bored by it that it never aired.

  • The Other Dave says:

    "When have they ever handed a whole pile of grant cash to a sufficient sample of the dubiously-accomplished (but otherwise reasonably qualified)..."

    This happens all the time. It just takes the right pedigree.

    "...and removed most funding from a fabulously productive (and previously generously-funded) sample and looked at the outcome?"

    Misconduct?

  • AcademicLurker says:

    and removed most funding from a fabulously productive (and previously generously-funded) sample and looked at the outcome?

    Misconduct?

    This has actually happened recently in a field adjacent to mine. The (now former) BSD's productivity has indeed dropped dramatically. The trouble is, how do you disentangle decreased publications due to lost funding from decreased publications due to the fact that no one trusts anything the person says anymore?

  • drugmonkey says:

    Ok so if NIH wanted to they could actually pull together some stats on later career folks who have declined in total grant support vs those that maintained it. Particularly accounting for those PIs saved by pickups and R56s.

  • drugmonkey says:

    And maybe with the R00 pool look at those that happened to get early additional RPGs vs those that struggled longer to get additional grants.

  • […] Way of All Flesh Did humans approach the southern tip of South America more than 18,000 years ago? A simple suggestion for Deputy Director for Extramural Research Lauer and CSR Director Nakamura How Can We Write About Science When People Are Dying? Forensic Pseudoscience: The Unheralded Crisis […]

Leave a Reply