What impact, if any, does in-person discussion of NIH Grants have on the scoring?

Nov 19 2010 Published by under Data!!!!, Grant Review, NIH, NIH funding, Uncategorized

This is amazing. Strike that, AMAZING!

A paper published in PLoS ONE by Martin and colleagues examines the fate of R01 applications reviewed in 61 of the 172 standing study sections convened by the Center for Scientific Review of the NIH in a single round (the January 2009 Council one- submitted Jun-Jul 2008 and reviewed in Oct-Nov 2008).

It is going to take me a bit to go through all the data but lets start with Figure 1. This plots the preliminary scores (average of ~3 assigned reviewers) against the final priority score voted by the entire panel.

Figure 1. Average Preliminary Score versus SRG Final Priority Score. Preliminary Scores represent the average of the independent R01 priority scores given by the three assigned reviewers; the final priority score is the average of all the scores given by the voting members of the panel. Each data point represents the outcome for one R01 application. The difference between preliminary and final priority scores represents the change between the two values. Applications with differences displayed on the left declined after discussion; those on the right improved. doi:10.1371/journal.pone.0013526.g001

The first and most obvious feature is the tendency for discussion to make the best scores (lowest in the NIH scoring system) more extreme. I would suggest that this results from two factors. First, reviewers are reluctant (in my experience) to assign the best possible score prior to discussion. I don't understand this personally, but I guess I can grasp the psychology. People have the idea that perfection exists out there in some application and they want to reserve some room so that they can avoid having ever awarded a perfect score to a lesser application. Silly, but whatever. Once discussion starts and everyone is nodding along approvingly it is easier to drift to a more perfect score.

Second, there is a bit of the old "Fund that puppy NOW!" going on. Particularly, I would estimate, for applications that were near misses on a prior version and have come back in review. There can be a tendency to want to over-emphasize to Program staff that the study section found the application to be in the must-fund category.

__
Martin MR, Kopstein A, Janice JM, 2010 An Analysis of Preliminary and Post-Discussion Priority Scores for Grant Applications Peer Reviewed by the Center for Scientific Review at the NIH. PLoS ONE 5(11): e13526. doi:10.1371/journal.pone.0013526

4 responses so far

  • halophile says:

    Cool study - I look forward to more analysis. Obviously, the points furthest from the line would probably be the most interesting discussions. Anyone have some personal insight into how a proposal would receive a mediocre score pre-meeting and then get a great score after the discussion? Obviously, the reverse situation could occur but that's too sad to think about.

  • drugmonkey says:

    if by "more analysis" you mean more than the single graph...go read the paper!

    The way a score improves after discussion can be through several means.

    Obvious one is that you had a substantial split in the initial scores and the advocate(s) made a convincing argument for how great the application is.

    Related is when someone just kinda made a mistake and the discussion corrects that reviewers bad impression on some point or other. Yes, people do admit they were too harsh or wrong or whatever, IME.

    Third would be some version of a calibration of scores error where one or more reviewers don't change their opinion of the quality, just their calibration of where scores should be. Perhaps they are a new reviewer to the panel. Perhaps they really did have an exceptionally good pile of grants and were trying to spread scores within their assigned pile before the meeting.

    Very occasionally I have seen a panel revolt. The reviewers discuss how awesome the grant is and then end up with kinda "meh" scores and someone else asks "why are your post-discussion scores so bad if you all loved it so much"? On more than one occasion I have seen this result in enough people voting outside the range to very likely produce a beneficial skew in the eventual voted score.

  • Definitely a lot of "based on what I'm hearing, this sounds like a than a " calibration meta-discussion goes on in well-functioning study sections.

    In addition to the tendency for fundably scored grants to get even better after discussion, it look like there is a tendency for poorer grants to get even worse.

  • antipodean says:

    Bit of a trumpet shape though. Might need some test-validation analysis rather than pearson's and t-tests. Figure 1 appears to be sort of a Bland-Altman if that is a 45 degree line rather than a regaression line, so it works fine visually.

    Is that what you meant, halophile?

    -antipodean

Leave a Reply