NIH Ginther Fail: Do the ersatz reviews recapitulate the original reviews?

A bit in Science authored by Jocelyn Kaiser recently covered the preprint posted by Forscher and colleagues which describes a study of bias NIH grant review. I was struck by a response Kaiser obtained from one of the authors on the question of range restriction.

Some have also questioned Devine’s decision to use only funded proposals, saying it fails to explore whether reviewers might show bias when judging lower quality proposals. But she and Forscher point out that half of the 48 proposals were initial submissions that were relatively weak in quality and only received funding after revisions, including four that were of too low quality to be scored.

They really don't seem to understand NIH grant review where about half of all proposals are "too low quality to be scored". Their inclusion of only 8% ND applications simply doesn't cut it. Thinking about this, however, motivated me to go back to the preprint, follow some links to associated data and download the excel file with the original grant scores listed.

I do still think they are missing a key point about restriction of range. It isn't, much as they would like to think, only about the score. The score on a given round is a value with considerable error, as the group itself described in a prior publication in which the same grant reviewed in different ersatz study sections ended up with a different score. If there is a central tendency for true grant score, which we might approach with dozens of reviews of the same application, then sometimes any given score is going to be too good, and sometimes too bad, as an estimate of the central tendency. Which means that on a second review, the score for the former are going to tend to get worse and the scores for the latter are going to tend to get better. The authors only selected the ones that tended to get better for inclusion (i.e., the ones that reached funding on revision).

Anther way of getting at this is to imagine two grants which get the same score in a given review round. One is kinda meh, with mostly reasonable approaches and methods from a pretty good PI with a decent reputation. The other grant is really exciting, but with some ill considered methodological flaws and a missing bit of preliminary data. Each one comes back in revision with the former merely shined up a bit and the latter with awesome new preliminary data and the methods fixed. The meh one goes backward (enraging the PI who "did everything the panel requested") and the exciting one is now in the fundable range.

The authors have made the mistake of thinking that grants that are discussed, but get the same score well outside the range of funding, are the same in terms of true quality. I would argue that the fact that the "low quality" ones they used were revisable into the fundable range makes them different from the similar scoring applications that did not eventually win funding.

In thinking about this, I came to realize another key bit of positive control data that the authors could provide to enhance our confidence in their study. I scanned through the preprint again and was unable to find any mention of them comparing the original scores of the proposals with the values that came out of their study. Was there a tight correlation? Was it equivalently tight across all of their PI name manipulations? To what extent did the new scores confirm the original funded, low quality and ND outcomes?

This would be key to at least partially counter my points about the range of applications that were included in this study. If the test reviewer subjects found the best original scored grants to be top quality, and the worst to be the worst, independent of PI name then this might help to reassure us that the true quality range within the discussed half was reasonably represented. If, however, the test subjects often reviewed the original top grants lower and the lower grants higher, this would reinforce my contention that the range of the central tendencies for the quality of the grant applications was narrow.

So how about it, Forscher et al? How about showing us the scores from your experiment for each application by PI designation along with the original scores?
__
Patrick Forscher William Cox Markus Brauer Patricia Devine, No race or gender bias in a randomized experiment of NIH R01 grant reviews. Created on: May 25, 2018 | Last edited: May 25, 2018; posted on PsyArXiv

3 responses so far

  • SidVic says:

    DISCLAIMER: As per usual I encourage you to read my posts on NIH grant matters with the recognition that I am an interested party. The nature of NIH grant review is of specific professional interest to me and to people who are personally and professionally close to

    What, what huh! this confused me, you are not talking your book here DM. Are you..
    Tell me

  • Grumpy says:

    Personally I think DM is overdoing it with the methods criticism on this paper. Just like in the other one on reviewer score variability DM tore apart. Presumably, the authors had to do their best with the proposals they were able to FOIA access to, and the data they present have some value even if the claims in the headlines are dangerous.

    However something I haven't seen much criticism on is their choice of definition of significance (they claim anything under 0.5 pt difference is insignificant).

    If I compare white female and black male in their data, I see a score differential of roughly 0.3 +/- 0.4 points. While you can't rule out the null hypothesis, if the 0.3 point score differential were real that could be a troubling bias.

    For order-of-magnitude calculation purposes, suppose all of these applications were scored uniformly from 1 to 5, and the funding cutoff was 2.5. If WF are getting scores from 1 to 4.7, then they have about 32% funding rate. If BM are scoring 1.3-5, then they have 25% funding rate. A systematic differential of 7% in funding rates could be a big problem.

    Add to that the fact that, IRL, scores are much more tightly clustered than that, and it seems a 0.3 point pure bias difference could be quite substantial. Considering the nonlinear impact of grant success on careers, I'd guess even a 0.2 point shift based on pure bias could be a disaster.

  • drugmonkey says:

    the data they present have some value even if the claims in the headlines are dangerous.

    what "value" do they have?

Leave a Reply