Agreement among NIH grant reviewers

Pier and colleagues recently published a study purporting to address the reliabiliy of the NIH peer review process. From the summary:

We replicated the NIH peer-review process to examine the qualitative and quantitative judgments of different reviewers examining the same grant application. We found no agreement among reviewers in evaluating the same application. These findings highlight the subjectivity in reviewers’ evaluations of grant applications and underscore the difficulty in comparing the evaluations of different applications from different reviewers—which is how peer review actually unfolds.

emphasis added.

This thing is a crock and yet it has been bandied about on the Twitts as if it is the most awesome thing ever. "Aha!" cry the disgruntled applicants, "This proves that NIH peer review is horrible, terrible, no good, very bad and needs to be torn down entirely. Oh, and it also proves that it is a super criminal crime that some of my applications have gone unfunded, wah."

A smaller set of voices expressed perplexed confusion. "Weird", we say, "but probably our greatest impression from serving on panels is that there is great agreement of review, when you consider the process as a whole."

So, why is the study irretrievably flawed? In broad strokes it is quite simple.
Restriction of the range. Take a look at the first figure. Does it show any correlation of scores? Any fair view would say no. Aha! Whatever is being represented on the x-axis about these points does not predict anything about what is being represented on the y-axis.

This is the mistake being made by Pier and colleagues. They have constructed four peer-review panels and had them review the same population of 25 grants. The trick is that of these 16 were already funded by the NCI and the remaining 9 were prior unfunded versions of grants that were funded by the NCI.

In short, the study selects proposals from a very limited range of the applications being reviewed by the NIH. This figure shows the rest of the data from the above example. When you look at it like this, any fair eye concludes that whatever is being represented by the x value about these points predicts something about the y value. Anyone with the barest of understanding of distributions and correlations gets this. Anyone with the most basic understanding grasps that a distribution does not have to have perfect correspondence for there to be a predictive relationship between two variables.

So. The authors claims are bogus. Ridiculously so. They did not "replicate" the peer review because they did not include a full range of scores/outcomes but instead picked the narrowest slice of the funded awards. I don't have time to dig up historical data but the current funding plan for NCI calls for a 10%ile payline. You can amuse yourself with the NIH success rate data here, the very first spreadsheet I clicked on gave a success rate of 12.5% for NCI R01s.

No "agreement". "Subjectivity". Well of course not. We expect there to be variation in the subjective evaluation of grants. Oh yes, "subjective". Anyone that pretends this process is "objective" is an idiot. Underinformed. Willfully in denial. Review by human is a "subjective" process by its very definition. That is what it means.

The only debate here is how much variability we expect there to be. How much precision do we expect in the process.

The most fervent defenders of the general reliability of the NIH grant peer review process almost invariably will acknowledge that the precision of the system is not high. That the "top-[insert favored value of 2-3 times the current paylines]" scoring grants are all worthy of funding and have very little objective space between them.

Yet we still seem to see this disgruntled applicant phenotype, responding with raucous applause to a crock of crap conclusion like that of Pier and colleagues, that seem to feel that somehow it is possible to have a grant evaluation system that is perfect. That returns the exact same score for a given proposal each and every time*. I just don't understand these people.
Elizabeth L. Pier, Markus Brauer, Amarette Filut, Anna Kaatz, Joshua Raclaw, Mitchell J. Nathan, Cecilia E. Ford and Molly Carnes, Low agreement among reviewers evaluating the same NIH grant applications. 2018, PNAS: published ahead of print March 5, 2018,

*And we're not even getting into the fact that science moves forward and that what is cool today is not necessarily anywhere near as cool tomorrow

21 responses so far

  • girlparts says:

    It sounds like you have more of a problem with the interpretation of the study than the study itself. It concluded that there is no repeatable difference between funded and near-funded grants, exactly what you say: "top-[insert favored value of 2-3 times the current paylines]" scoring grants are all worthy of funding and have very little objective space between them." They make no claims about the rest of the range. This seems quite important, if the 9th% grant is funded, and the 15th% grant isn't. The system isn't able to make distinctions within the range that is needed to make funding decisions. And if that is the case, as they point out, it is a waste of a lot of time and effort to try to do so.

    (I actually have other problems with the methodology, but not that part)

  • drugmonkey says:

    It's bullshittio to design a study like this and pretend that the conclusion won't be broad, even if you are careful to say 'among this sample of applications'. IMO. The papers are written in a way that clearly indicates they think this generalizes to the whole system and, surprise, surprise, this is the way it is being interpreted. The Abstract concludes with: our results have broad relevance for scientific grant peer review.. The Intro states: Because the grant peer-review process at NIH is confidential, the only way to systematically examine it is to replicate the process outside of the NIH in a highly realistic manner.

    "highly", "realistic", "manner".

    Even if you are right about the intent, this sample was of already funded grants and nearly funded grants. We don't know how close the near-misses were, of course. In principle they could have been within a couple of percentile points of the paylines. Some of the paid grants could have been exception funded for all we know. Either way, it is very unlikely that it was a representative sample of anything.

    This seems quite important, if the 9th% grant is funded, and the 15th% grant isn't. The system isn't able to make distinctions within the range that is needed to make funding decisions.

    If that was the entire point then they went to great lengths to conceal it. And didn't design the study to answer that question at all. If your sample is grants that are funded and grants that are funded on revision....where is the problem? One round of revision is "a waste of a lot of time and effort"? really? One? I don't agree.

  • drugmonkey says:

    I actually have other problems with the methodology, but not that part

    Which parts did you find lacking?

  • A Salty Scientist says:

    I would be quite interested in a study on the full range of funded through triaged grants. My feeling is that reviewers do a pretty good job binning applications into tiers, and I think it would be useful to get a better sense of the point at which reviewers have difficulty discriminating. I think this could get at optimal paylines, though that might be moot with no real political pressure to increase funding.

  • ThamizhKudimagan says:

    girlparts: " It concluded that there is no repeatable difference between funded and near-funded grants"

    If you looked at just the abstract (and a lot of people would do just that), you'd not come away with that impression. The closest they come to saying that in the abstract is: "...reviewers must differentiate the very best applications from comparatively weaker ones".

    25 grants and 4 panels seems like a small sample size to draw such an emphatic conclusion on something that involves a lot of money and affects a lot of people.

  • Emaderton3 says:

    @A Salty Scientist

    I would love to see this too. As DM said, it would help in understanding the variability.

    I had a grant get triaged twice in a specific NIH study section. I sent the same grant (literally almost word for word) to a large nonprofit, and their review committee scored it in the top tier of applications (and it got funded). Reviewers on both panels had the same scientific backgrounds. In fact, two members were on both panels. (I also had this same grant go to another study section, get scored, and almost get chosen for select pay. However, in this case, the background of the reviewers was in a completely different discipline.)

    I know, I know, a n=1 doesn't mean anything. And perhaps the nonprofit was looking for a different level of science since it was geared toward early career investigators. But it was a relatively large, multi-year award (almost a mini-R01), and facets of my proposal were lauded by them which wasn't the case in the NIH study section.

  • drugmonkey says:

    It sounds like perhaps the first study section was simply the wrong one, Emaderton3.

  • Grumpy says:

    I can appreciate that you don't like the spin taken here, but aren't you somebody on record saying the data are all that matters and you skip the interpretation and conclusions?

    Seems to me that ppl have been saying for some time that nobody can tell the difference between top-scoring grants without citation. Now there is some data to back that up. I'm personally glad they put it out there.

    Obviously it would be nicer to have a wide range but don't think you can get unfunded grants from a FOIA request (admittedly I didn't look carefully enough to see if that is how they got the proposals, I'm just guessing)

  • DrugMonkey says:

    I say that I’m more interested in my take on the data these I am in what the authors think. This is an example of that. Their interpretation is garbage. As is their design and therefore their data.

  • qaz says:

    I don't understand.

    Is this paper trying to say that there's noise in the system? We knew that.

    Is this paper trying to say there's little reliable difference between a 10th percentile and a 15th percentile? We knew that.

    Is this paper trying to say that there's little reliable difference between a 10th percentile and a 50th percentile? That's wrong. And we know that.

    This has been our biennial discussion on Drug Monkey that study section is well designed to find the top quarter of grants, but not to separate within that. Can we go on to something new? Like how we're going to get the feds to provide enough money to fund that top quarter of submitted grants....

  • DNAman says:

    I think Emaderton3 has a better point than the paper.

    There is a huge huge variability in scoring for all of us that don't fit nicely into a standing study section.

    There's some study sections where all the reviewers publish in the same journals, they all go to the same conferences, and they all know one another. In these study sections, it seems that everyone recognizes the important scientific problems and the review focus is mostly on your hypothesis and how you will test it. These are mostly the standing sections.

    However there are other study sections (mostly the ZRGs) that are just a grab bag of a few different fields. In these, a few reviewers might know each other, but the panel will be very diverse. Reviews here are HIGHLY dependent on the particular reviewer assigned to your application. You often get assigned a reviewer who isn't familiar with your field, even the basic agreed upon facts and methods that exist in your field. Say, everyone in your field uses a rat model. Then you get a review criticizing you for using a rat model because of blah blah blah. The reviewer of course is correct, the rat model isn't perfect, but what are you supposed to do?

  • drugmonkey says:

    qaz- the paper is trying very hard to imply that noise in the system based on the top 10-15% means that the top 10-15% cannot be distinguished from the bottom 50%. Unsurprisingly this appears to be the take of most people approvingly passing this study around.

  • girlparts says:

    My objections to the methodology were two-fold. They tried to quantify the similarities among reviews by counting the number of weaknesses, which in my opinion, is meaningless. It's the impact of a weakness that counts - one fatal flaw can invalidate a whole grant. And in my experience, the substance of the most important strengths and weaknesses actually do correlate reasonably well among reviewers. Admittedly, it would be an enormous, and somewhat subjective task to figure this out in a study. My other objection is that they compared the results of one reviewer per proposal. The system acknowledges that there is variation among reviewers by assigning more than one reviewer per grant. These differences are supposed to even out in the final results. The authors acknowledge this in the paper, saying that other studies have looked at how many reviewers are needed to reduce variation, and then wave it away.

  • DrugMonkey says:

    Whoa! I missed that single reviewer thing.

  • Pinko Punko says:

    I am confused by comment above, RE: reviewers per grant. Some reviewers only reviewed 1 grant, but all grants had more than one reviewer:

    " so that every application was evaluated by between two and four reviewers."

    They then used reviewer ratings to compare:

    "We measured agreement among reviewers in terms of the preliminary ratings that they assigned to grant applications before the study section meeting."

    (they did try some other ways to ask about similarity among reviews, but I agree that I wouldn't expect those to have meaning, unless reviewers were asked to specifically to annotate which weaknesses/strengths were drivers)

  • drugmonkey says:

    I think they just used a single reviewer's (the primary) comments for doing the analysis of the number of strength/weakness comments. All three for criterion score analysis. Right?

  • girlparts says:

    I think they used all three reviewers for analyzing criterion score correlations, but they used each of their preliminary scores, not their average, post-discussion scores. "The current study aims to examine agreement in the individual ratings before the study section meeting, with a focus on examining the alignment between reviewers’ preliminary ratings and their written critiques." They only evaluated how much variation there is amongst individual reviewers, not in final post-discussion priority scores.

  • […] emphasizes something I had to say about the Pier monstrosity purporting to study the reliability of NIH grant review. Terry Horminga […]

  • Pinko Punko says:

    Yeah- they didn't do discussion at all I think? The claim was that there already were data on what discussion does (exacerbates variability between study sections?).

  • Ola says:

    Was the first image deliberately drawn to resemble a swarm of gnats (as in circling a dog shit)? If so, extra points for artistic license.

  • notmycircus says:

    You'll never be able to remove subjectivity from reviews. As a researcher, really the only tool you can and absolutely should wield is the PHS Assignment Request Form for each NIH submission. At least it's an attempt to limit variability...

Leave a Reply