NIH Ginther Fail: This is not anything like real grant review

May 31 2018 Published by under Fixing the NIH, NIH, Underrepresented Groups

I recently discussed some of the problems with a new pre-print by Forscher and colleagues describing a study which purports to evaluate bias in the peer review of NIH grants.

One thing that I figured out today is that the team that is funded under the grant which supported the Forscher et al study also produced a prior paper that I already discussed. That prior discussion focused on the use of only funded grants to evaluate peer review behavior, and the corresponding problems of a restricted range. The conclusion of this prior paper was that reviewers didn't agree with each other in the evaluation of the same grant. This, in retrospect, also seems to be a design that was intended to fail. In that instance designed to fail to find correspondence between reviewers, just as the Forscher study seems constructed to fail to find evidence of bias.

I am working up a real distaste for the "Transformative" research project (R01 GM111002; 9/2013-6/2018) funded to PIs M. Carnes and P. Devine that is titled EXPLORING THE SCIENCE OF SCIENTIFIC REVIEW. This project is funded to the tune of $465,804 in direct costs in the final year and reached as high as $614,398 direct in year 3. We can, I think, fairly demand a high standard for the resulting science. I do not think this team is meeting a high standard.

One of the papers (Pier et al 2017) produced by this project discusses the role of study section discussion in revising/calibrating initial scoring.

Results suggest that although reviewers within a single panel agree more following collaborative discussion, different panels agree less after discussion, and Score Calibration Talk plays a pivotal role in scoring variability during peer review.

So they know. They know that scores change through discussion and they know that a given set of applications can go in somewhat different directions based on who is reviewing. They know that scores can change depending on what other ersatz panel members are included and perhaps depending on how the total number of grants are distributed to reviewers in those panels. The study described in the Forscher pre-print did not convene panels:

Reviewers were told we would schedule a conference call to discuss the proposals with other reviewers. No conference call would actually occur; we informed the prospective reviewers of this call to better match the actual NIH review process.

Brauer is an overlapping co-author. The senior author on the Forscher study is Co-PI, along with the senior author of the Pier et al. papers, on the grant that funds this work. The Pier et al 2017 Res Eval paper shows that they know full well that study section discussion is necessary to "better match the actual NIH review process". Their paper shows that study section discussion does so in part by getting better agreement on the merits of a particular proposal across the individuals doing the reviewing (within a given panel). By extension, not including any study section type discussion is guaranteed to result in a more variable assessment. To throw noise into the data. Which has a tendency to make it more likely that a study will arrive at a null result, as the Forscher et al study did.

These investigators also know that the grant load for NIH reviewers is not typically three applications, as was used in the study described in the Forscher pre-print. From Pier et al 2017 again:

We further learned that although a reviewer may be assigned 9–10 applications for a standing study section, ad hoc panels or SEPs can receive assignments as low as 5–6 applications; thus, the SRO assigned each reviewer to evaluate six applications based on their scientific expertise, as we believed a reviewer load on the low end of what is typical would increase the likelihood of study participation.

I believe that the reviewer load is critically important if you are trying to mimic the way scores are decided by the NIH review process. The reason is that while several NIH documents and reviewer guides pay lipservice to the idea that the review of each grant proposal is objective, the simple truth is that review is comparative.

Grant applications are scored on a 1-9 scale with descriptors ranging from Exceptional (1) to Very Good (4) to Poor (9). On an objective basis, I and many other experienced NIH grant reviewers argue, the distribution of NIH grant applications (all of them) is not flat. There is a very large peak around the Excellent to Very Good (i.e., 3-4) range, in my humble estimation. And if you are familiar with review you will know that there is a pronounced tendency of reviewers, unchecked, to stack their reviews around this range. They do it within reviewer and they do it as a panel. This is why the SRO (and Chair, occasionally) spends so much time before the meeting exhorting the panel members to spread their scores. To flatten the objective distribution of merit into a more linear set of scores. To, in essence, let a competitive ranking procedure sneak into this supposedly objective and non-comparative process.

Many experienced reviewers understand why this is being asked of them, endorse it as necessary (at the least) and can do a fair job of score spreading*.

The fewer grants a reviewer has on the immediate assignment pile, the less distance there need be across this pile. If you have only three grants and score them 2, 3 and 4, well hey, scores spread. If, however, you have a pile of 6 grants and score them 2, 3, 3, 3, 4, 4 (which is very likely the objective distribution) then you are quite obviously not spreading your scores enough. So what to do? Well, for some reason actual NIH grant reviewers are really loathe to throw down a 1. So 2 is the top mark. Gotta spread the rest. Ok, how about 2, 3, 3...er 4 I mean. Then 4, 4...shit. 4, 5 and oh 6 seems really mean so another 5. Ok. 2, 3, 4, 4, 5, 5. phew. Scores spread, particularly around the key window that is going to make the SRO go ballistic.

Wait, what's that? Why are reviewers working so hard around the 2-4 zone and care less about 5+? Well, surprise surprise that is the place** where it gets serious between probably fund, maybe fund and no way, no how fund. And reviewers are pretty sensitive to that**, even if they do not know precisely what score will mean funded / not funded for any specific application.

That little spreading exercise was for a six grant load. Now imagine throwing three more applications into this mix for the more typical reviewer load.

For today, it is not important to discuss how a reviewer decides one grant comes before the other or that perhaps two grants really do deserve the same score. The point is that grants are assessed against each other. In the individual reviewer's stack and to some extent across the entire study section. And it matters how many applications the reviewer has to review. This affects that reviewer's pre-discussion calibration of scores.

Read phase, after the initial scores are nominated and before the study section meets, is another place where re-calibration of scores happens. (I'm not sure if they included that part in the Pier et al studies, it isn't explicitly mentioned so presumably not?)

If the Forscher study only gave reviewers three grants to review, and did not do the usual exhortation to spread scores, this is a serious flaw. Another serious and I would say fatal flaw in the design. The tendency of real reviewers is to score more compactly. This is, presumably, enhanced by the selection of grants that were funded (either on the version that used or in revision) which we might think would at least cut off the tail of really bad proposals. The ranges will be from 2-4*** instead of 2-5 or 6. Of course this will obscure differences between grants, making it much much more likely that no effect of sex or ethnicity (the subject of the Forscher et al study) of the PI would emerge.

__
Elizabeth L. Pier, Markus Brauer, Amarette Filut, Anna Kaatz, Joshua Raclaw, Mitchell J. Nathan, Cecilia E. Ford and Molly Carnes, Low agreement among reviewers evaluating the same NIH grant applications. 2018, PNAS: published ahead of print March 5, 2018, https://doi.org/10.1073/pnas.1714379115

Elizabeth L. Pier, Joshua Raclaw, Anna Kaatz, Markus Brauer,Molly Carnes, Mitchell J. Nathan and Cecilia E. Ford. ‘Your comments are meaner than your score’: score calibration talk influences intra- and inter-panel variability during scientific grant peer review, Res Eval. 2017 Jan; 26(1): 1–14. Published online 2017 Feb 14. doi: 10.1093/reseval/rvw025

Patrick Forscher, William Cox, Markus Brauer, and Patricia Devine. No race or gender bias in a randomized experiment of NIH R01 grant reviews. Created on: May 25, 2018 | Last edited: May 25, 2018 https://psyarxiv.com/r2xvb/

*I have related before that when YHN was empanled on a study section he practiced a radical version of score spreading. Initial initial scores for his pile were tagged to the extreme ends of the permissible scores (this was under the old system) and even intervals within that were used to place the grants in his pile.

**as are SROs. I cannot imagine a SRO ever getting on your case to spread scores for a pile that comes in at 2, 3, 4, 5, 7, 7, 7, 7, 7.

***Study sections vary a lot in their precise calibration of where the hot zone is and how far apart scores are spread. This is why the more important funding criterion is the percentile, which attempts to adjust for such study section differences. This is the long way of saying I'm not encouraging comments naggling over these specific examples. The point should stand regardless of your pet study sections' calibration points.

10 responses so far

  • qaz says:

    As a note on comparative scoring, what I felt was the most fair study section I ever attended (certainly there was less complaining by study section members at the dinner after study section about the scores given) was explicitly forced to be comparative. We were told that we could only give each score once and if we had more than 9 grants, we had to use each score once before giving a grant the same score as another. What this meant was that you were literally ranking the grants in your stack. It was ***so*** much easier than typical scoring at study section which is often a mess of trying to decide just what is a "minor weakness" and what is a "major weakness".

    In truth, during the discussion that rule was rescinded and people certainly said "this was the best grant in my stack so I gave it a 1. I don't think it's that great." and "I had a great set. So I gave it a 5, it should be better than that." But in general, the starting spread of scores made the whole process easier.

    And I 100% agree with you that there is a peak around 2-4. This was why the old 1.0-5.0 was better, because we spent all our time ranking in the 1.5 to 3.0 range and left the 5.0 for the real garbage. It was painful giving a 9 to the worst grant in my stack, which was really a 4, but the other eight were all 2s and 3s.

  • drugmonkey says:

    But in general, the starting spread of scores made the whole process easier.

    see *.

  • Pinko Punko says:

    So much bias can sneak in when you are trying to spread. Which grant gets bumped one. Is it the better written, clearer grant of moderate impact, or the bloated grant that seems like it could be more significant. It is hard

  • ola says:

    Quit being coy Punko, it's the one that didn't cite your paper!

  • Pinko Punko says:

    Haha. *stares into distance*

  • drugmonkey says:

    So much bias can sneak in when you are trying to spread. Which grant gets bumped one. Is it the better written, clearer grant of moderate impact, or the bloated grant that seems like it could be more significant. It is hard

    I was going to get into that in another post. but yeah, I do think that this spreading of scores and the weighing of factors to put one grant ahead of another is where subtle biases can work to the detriment of out groups. How many times does "well this proposal is a straight up hot mess but we know Dr. BSD is going to come up with great papers so.....1" versus "dunno who the hell this youngun is but damn this is an amazing idea that just has to be funded so that's why I'm giving it a ...3" versus...?

  • mH says:

    I think this is exactly right. As a Canadian data point, when CIHR went from panels that scored similarly to NIH (with scores clustering near the estimated payline) to forcing reviewers to rank and then using mean ranks across reviewers, they went from roughly equal success rates across career stages to success rates for mid-career falling by about a third, and for early-career they were cut in half.

    (Aside: isn't it just so weird that no one ever comes up with grant review innovations that happen to penalize senior researchers? I mean these are all good faith efforts, right?)

    There were other issues as well, but I think forcing ranks and/or score spreading is in opposition to the reality of the real distribution of grant "fundability" by whatever measure. In particular the forced tradeoffs of ranking will amplify biases, and this is likely worse and less fair than the random/noise effects of having clustered grants near the payline.

    All of this stuff to force spreading or jimmy with ranks is an administrative fantasy. They want bright lines in the process so they can credibly claim they are unambiguously funding the "right" grants/people and not have to admit how arbitrary it truly inescapably is, even more so at low success rates.

  • bacillus says:

    Isn't triage supposed to get rid of the 6-9 scores before SS begins its discussions? Presumably, most of what survives triage would be funded in an ideal world? In which case clustering is the obvious outcome. I'd rather get triaged than get scores of 6-9. Both send the message not to bother revising the application, but the latter seems much more spiteful and much more likely to lead to cries of "no fair' from the affected PI.

  • qaz says:

    bacillus - the new scoring system is non-linear. It's supposed to be 1-8 plus 9. So triage should be in the 9s. Essentially, think of the old system (1.0-5.0) as linear but bunched so that people generally used 15 of the 50 available numbers (everything submitted ended 1.5-3.0). Since no one needed those other numbers, the new system is supposed to take those 15 numbers and split them over 1-8 and then put everything else in 9. Of course, humans don't understand that non-linearity and reviewers re-bunched everything back to 2-3 (out of 9). Thus the cajoling to "spread the scores".

    Also, NIH has explicitly denied that they care how study section makes you feel. In the old days, there was a goal of helping junior investigators learn how to write grants, so I was told when I first started that a score of 1.5 meant "phenomenal", a 2.0 meant "worth funding", and a 3.0 meant "I like the idea, but I want to see it again when you've fixed my concerns". A 4.0 meant "don't come back with this grant" and a 5.0 carried a specific message of "we're not putting up with you." The only time I ever actually saw a 5.0 was an NRSA that was missing large portions of logic and writing but the senior investigator had signed off clearly never having read it. Anyway, NIH is very strict at study section that the scores are only about ranking grants so program can make its decisions. For example, at every study section, we get reminded (scolded) that the current score does not have to be better than the previous score.

    Also, different study sections score differently. I've seen a study section where a 5 is good and fundable and others where a 5 is the worst in the day. That's why scores are normalized by study section.

    mH - thank you for that data point. That's fascinating.

  • […] to think, only about the score. The score on a given round is a value with considerable error, as the group itself described in a prior publication in which the same grant reviewed in different ersatz study sections ended up with a different […]

Leave a Reply