Mobility of Disparate Scores in NIH Grant Review

Nov 07 2014 Published by under Fixing the NIH, Grant Review

If asked to pick the top two good things that I discovered about grant review when I first went for a study section stint, I'd have an easy time. The funny thing is that they come from two diametrically opposed directions.

The first amazing good thing about study section is the degree to which three reviewers of differing subdiscipline backgrounds, scientific preferences and orientations agree. Especially in your first few study section meetings there is little that is quite as nerve-wracking as submitting your initial scores and waiting to see if the other two reviewers agreed with you. This is especially the case when you are in the extreme good or bad end of the scoring distribution.

What I usually found was that there was an amazingly good amount of agreement on overall impact / priority score. Even when the apparent sticking points / points of approbation were different across all three reviewers.

I think this is a strong endorsement that the system works.

The second GoodThing I experienced in my initial service on a study section was the fact that anyone could call a grant up off the triage pile for discussion. This seemed to happen very frequently, again in my initial experiences, when there were significantly different scores. In today's scoring parlance, think if one or two reviewers were giving 1s and 2s and the other reviewer was giving a 5. Or vice versa. The point being to consider the cases where some reviewers are voting a triage score and some are voting a "clearly we need to discuss this" score. In the past, these were almost always called up for discussion. Didn't matter if the "good" scores were 2 to 1 or 1 to 2.

Now admittedly I have no CSR-wide statistics. It could very well be that what I experienced was unique to a given study section's culture or was driven by an SRO who really wanted widely disparate scores to be resolved.

My perception is that this no longer happens as often and I think I know why. Naturally, the narrowing paylines may make reviewers simply not care so much. Triage or a 50 score..or even a 40 score. Who cares? Not even close to the payline so let's not waste time, eh? But there is a structural issue of review that has squelched the discussion of disparate preliminary-score proposals.

For some time now, grants have been reviewed in the order of priority score. With the best-scoring ones being take up for discussion first. In prior years, the review order was more randomized with respect to the initial scores. My understanding was the proposals were grouped roughly by the POs who were assigned to them so that the PO visits to the study section could be as efficient as possible.

My thinking is that when an application was to be called up for review in some random review position throughout the 2-day meeting, people were more likely to do so. Now, when you are knowingly saying "gee, let's tack on a 30-40 min discussion to the end of day 2 when everyone is eager to make an earlier flight home to see their kids"...well, I think there is less willingness to resolve scoring disparity.

I'll note that this change came along with the insertion of individual criterion scores into the summary statement. This permitted applicants to better identify when reviewers disagreed in a significant way. I mean sure, you could always infer differences of opinion from the comments without a number attached but this makes it more salient to the applicant.

Ultimately the reasons for the change don't really matter.

I still think it a worsening of the system of NIH grant review if the willingness of review panels to resolve significant differences of opinion has been reduced.

29 responses so far

  • qaz says:

    My perception on my first study sections tracked yours very well. I was shocked how often I would come in with a chip on my shoulder either to attack or defend a grant and then how often the scores of the other reviewers agreed with me. (Actually, I find the scores agree less and feel far more random now.) And I saw many grants pulled up out of triage whenever there was any disagreement.

    Two things that I saw at the time were

    (1) that it really seemed to matter that study section get the score "right" because the score itself carried information. This meant that even a grant that wasn't going to get funded needed to get the score right so that the PI would know whether to resubmit (get in line) or not. [We can argue the advantages/disadvantages about having a line, but my point is that getting the score right is only important if it carries information.]

    However, we have now been told explicitly that the scores do not carry information, so no one cares if the score "is right" or not. All that matters is whether the grant is fundable. I think this is part of the problem with the new system - we're supposed to spread the fundable grants among 1-7 and put the not-fundable ones in 8 and 9. (No study section I've ever seen has been able to do that, but if you read the instructions, its very clear that's what they want.) But the statement that the score doesn't matter means that there's no point to pulling a grant out of triage unless you think it has a chance of being funded.

    and (2) that the senior people really wanted to help provide grantsmithing help to investigators they liked. This meant that people would pull up grants they knew were not fundable so that they would get real scores and would get discussed to get better reviews so they could be improved for next time. [Yes, there's a in-club pedigree problem here, but there was also a help-the-kid mentality that I saw.]

    Again, we've been told that teaching grantsmithing is not our job. It's a "waste of the study section's time". So why pull a grant up unless it has a chance of getting funded.

    By the way, we used to take 2 days for study section (really 1.5 or so) which meant there was time (1.6 days isn't that different from 1.5). But now the same number of grants are dealt with in 1 day exactly. That means that pulling a proposal out of triage means ending a half hour late and missing flights.

  • CD0 says:

    At some point there was an accumulation of applications in many study sections (>100 per cycle), which perhaps demanded a more stringent triage system. However, I have seen a progressive decline in the number of applications that we see lately, at least in my study section. That has also resulted in a higher number of applications recalled for discussion in the last meetings that I have attended.

    I think that the problem with the disparity of the scores comes from the pressure of the NIH to "spread the scores". As expected when they eliminated the possibility of scoring with decimal points, there was significant score compression in most study sections. Reviewers are now "encouraged" to use the entire range of scores. In my opinion, this introduces more variability in the evaluation of the 3 reviewers, depending on how good the rest of the grants in your package are.

    Also, there are many times when an intermediate score between, let's say, a"2" or a "3"would seem more appropriate, because there is a huge difference in terms of funding potential between these two numbers. I think that increasing the range of possible scores would alleviate the problem of score compression and would help reducing the disparity between scores.

  • DrugMonkey says:

    As far as I am aware SROs have *always* banged on about score spreading. Yes even back in the good old "182 is significantly different from 184" days.

  • "Also, there are many times when an intermediate score between, let's say, a"2" or a "3"would seem more appropriate, because there is a huge difference in terms of funding potential between these two numbers."

    If there is a "huge" difference in percentile between 2 and 3, then your study section is compressing way too much. Rule of thumb we use is 50 = 20%ile, 40 = 15%ile, 30 = 10%ile, and 20 = 5%ile.

  • CD0 says:

    "If there is a "huge" difference in percentile between 2 and 3, then your study section is compressing way too much."

    Agreed, but that should be happening to more study sections because in recent meetings we got increasing reminders from CSR to decompress more and more (all study sections; I suspect). Even teleconferences about this specific issue...

    I am not sure that having only 3 options to score an above average grant (5 is supposed to be average and 1 should be rarely used) is sufficient to discriminate the best applications. But, yes, I do not have the big picture. What I have recently seen is that the push for decompression has increased, based on my anecdotal evidence in one study section, disparities between the scores of the 3 assigned reviewers.

  • drugmonkey says:

    CPP has the most score spreading section I've ever seen. Assuming they actually behave the way he describes.

  • Pinko Punko says:

    DM, you should ask people to email you percentile information to see how many study sections are like CPPs. I say email because that allows which sections we are talking about to be tracked.

    Some recent percentiles that I know of:

    Section 1: 48 priority score. 30%ile.

    Section 2: 33 priority score. 27%ile.

    Section 3: 33 priority score. 28%ile.

    Section 4: 70 priority score. 51%ile

    Section 1 is the only one I have experience with that is even close to what CPP talks about. Sections 2-3 are really compressed.

    The attitude I find troubling is the "identify 2-3 grants to review seriously in the pile and then phone in the rest because it doesn't matter". I think this might be adding to the perceived increased arbitrary nature of the review process.

    My experience as reviewer on a panel like Section 1 is similar to what DM has said in that there is reasonable concurrence of reviewers with scores- but those concurrences are based on engaged reviews and are convincing. When there is not concurrence- there is usually really good discussion and definitely those automatically get rescued by the SRO if they are below because disparate scoring really asks for discussion. Where I think there is a greater issue is concurrence on scoring but without justification- and I think this is happening if reviews are being phoned in.

  • Neuro-conservative says:

    There is a potential contradiction between spreading the scores (at least in the extreme fashion described by CPP) and the numerical scoring guidelines distributed to reviewers. In my recent experience, many/most non-triaged applications have a few minor weaknesses and perhaps one or two moderate weaknesses but no major weaknesses. It would not be fair to give such an application a 6 or 7.

  • qaz says:

    Neuroconservative - there is an inherent incompatibility with the concept that scores carry information and the goal of using scores to separate close quality in grant applications. This is a point that we have discussed many many times on this blog, Rockey's blog, and on many others. After much discussion, NIH has come to the conclusion that scores are merely a tool to help program separate scores. A "7" merely means that you are in the lower group *of this current round of study section*. It should no longer be taken as having any meaning in its own right. This means that "It would not be fair to give such an application a 6 or 7." is no longer a meaningful sentence.

    This is a major change from previous systems.

    More importantly, I am not convinced that reviewer psychology is compatible with this system which is why grants keep coalescing back to "between 2 and 3", no matter how NIH tries to push us to use 5s and 6s for "pretty good grants".

  • Neuro-conservative says:

    I see your point, qaz, but then NIH should stop handing us sheets with that figure illustrating the scoring curve as a function of strengths vs weaknesses.

  • drugmonkey says:

    That scoring guidelines sheet is a total fracking disaster and works at cross purposes with their rightful goal of selecting the best apps *in this Fiscal Year*.

  • Pinko Punko says:

    I was very strenuous with my SRO that they would get the same resolution they want in "spreading" scores of DISCUSSED grants by just allowing 0.5 increments for discussed grants. Prelim scores can be 1-9 but scores in discussion could be 1.0-9.0. This would give greater resolution but also allow a PI to get an idea of the absolute not just relative score of the grant- I think this means a lot because some rounds have a difference in quality- some meetings are just a murderers row and other meetings, there are a bunch of decent grants but maybe something from the previous round really was scored better. A 1 should be a 1 and a 2 should be a 2, not depending on the rest of the mix. The whole idea to percentile against the last few rounds of a panel is also to get an idea of where grant lies in the bigger picture (and also to normalize sections). The problem is there are just too many panels that have the culture that all grants are 1-3, except the terrible ones that are 4-5. These panels need to get some space into their compression. They are terrible.

    If the panel members buy in to pretty good grant is a 5 (and there are panels that do that- see my comment above), it becomes a lot easier than when everything above average is already a 2-3. There are just a lot more ties and then everything is up to program.

  • Comradde PhysioProffe says:

    "CPP has the most score spreading section I've ever seen. Assuming they actually behave the way he describes."

    We absolutely do.

  • Joe says:

    "Naturally, the narrowing paylines may make reviewers simply not care so much. Triage or a 50 score..or even a 40 score. Who cares?"
    One does feel this way on day 2. How can you not? The first 2 hours on day one is when you discuss the applications that will be funded. Later in day 1 you'll find a gem that was down in the 3's and rescue it, but that's it. The last third of day 1 and all of day 2 you will be arguing over applications that are very unlikely to be funded. Seeing good ideas in those applications and knowing that those grants are nowhere near the cut-off is one of the most depressing aspects of study section. It's not that you are unwilling to provide time and feedback, it's the feeling that it won't make any difference and that you are powerless to change that.

  • drugmonkey says:

    Absolutely, Joe.

  • Ola says:

    Our SRO now says that requests for rescue should preferably be submitted by email to the SRO well before the meeting. No need to embarrass yourself in front of your peers by being "the one who made us stay late". Then when we start out on day one with the discussion order, there's just 3 or 4 extra proposals beyond the cut line, so everyone knows up front how many in total will be discussed and can plan their flight/exit on day two accordingly. It also saves that painful hour at the beginning, reading up from the bottom of the triage list one-by-one to ask if anyone wants to rescue anything. You still have the option to rescue at the meeting if you want, but I.M.E. nobody does it in person any more. Relieving the peer pressure element probably helps a few proposals get pulled up that would otherwise be ND.

  • Jo says:

    The flipside is that a lot of time is wasted on the good grants where we all agree they should be funded just because everyone is "fresh" and wants to hear the sound of their own voice.

    My suggestion is that the initial triage should be based on mean score (as it currently is) but then discussion order should be based on variance, highest first.

  • E rook says:

    Joe - shouldn't that indicate it'd be better to discuss the middle of the pack first? That's what we do in med school admissions during interview season. We individually rank the candidates we interviewed and an algorithm combines our scores then ranks them."Any red flags with 1 through 4? mm-K." There's no point in wasting time talking about something we agree on.

  • Davis Sharp says:

    I don't study human behavior, but some people who do have told me that the human limit when assigning things to discrete bins is 7. Beyond that, we have a hard time separating things. Grant scores went from 41 discrete variables (the 1.0-5.0 scale) to 9. But now some people want to add a 0.5 increment to increase the scale to 17 discrete variables. I think this will eventually limit the range used by reviewers to 1.0-5.0 because no-one wants to give a bad score. Then the same people will want 0.25 increments.

    The entire committee votes on a final score. So, the average is not likely to be an integer.

  • E rook says:

    Jo beat me to it. Variance is a better order than middle. I was thinking something like discussing 20th through 30th percentile first, the, 10th through 20th, then 1-10, then 49-50. Perhaps weighing percentile ranks with score variance to give a discussion order. No idea if this would help applicants, but maybe would make the experience less painful for reviewers.

  • Pinko Punko says:

    Davis,

    I think there are quite a number of ties, especially if the study section is compressed- this means there aren't 17 bins, there are essentially 4.

    Joe,

    I would say that it is massively depressing, but don't serve on the panel if you can't give effort to grants that for some reason might not make the cut. Those PIs are probably on a 2-3 submission track to try to get the work funded and putting a grant behind the 8-ball with a disengaged or lazy review is not only not helping, it could actually be hurting the grant.

    I know panels are painful for a lot of reasons, and they take a mental toll on me for sure, but given our daily jobs and the amount of stuff any PI has to deal with, my goodness the slog of day 2 of study section really is nothing compared to lots of other things, and I think the excuse of "it's not worth it/I'm so tired/it just sucks" - yeah, it sucks a lot, but you can do your best to make it better for people by doing the job, or you just don't agree to be on the panel. Really frustrating to hear this stuff.

  • drugmonkey says:

    Jo-

    Brilliant!

  • Pinko Punko says:

    It is worth spending some time discussing the "good" grants because those grants were reviewed by three people. The panel discusses them and discussion might concur or there might be some flaws detected. Either way, getting a feel for what is 1,2,3 allows the panel to get a feel for the proposals. I don't think that grants in the middle are compromised- the last grants discussed are. The grants in the middle are usually still pretty fluid based on the discussion of the first set. Many times scores will be reduced for those "top" grants based on discussed, and grants in the middle will go up upon full discussion. I do think that the high variance grants that are dropped to the bottom are the ones that could be at a disadvantaged.

  • Jim Woodgett says:

    Bear with me....

    The Canadian Institutes of Health Research (pico-NIH) is reforming its review processes (as well as its programs, hey, in for a penny). Instead of study sections/panels, there will be virtual review with each application scored by 5 reviewers who are selected based on keyword expertise. Since no two applications will be reviewed by the same 5 people, the scores do not allow cross-comparison - especially as we are talking about the entire breadth of health research from biomedical to health systems. Therefore, each reviewers collection of reviews will be ranked relative to each other. If they have 15 applications, they will be assigned fractions 1/15, 2/15, etc. Another reviewer may do 10 and so their reviews will be 1/10, 2/10, etc. The rankings for each application are then combined to yield an overall rank and variance. The smaller the relative score, the better.

    The data will be plotted for overall rank vs variance (Z-score). The top 10% will be approved for funding. A grey zone that is shaped to capture both variance and rank below that 10% that approximates twice the number of additional grants that can be funded will be evaluated by a face-to-face super panel (probably more than 1) which will vote Y/N by focussing on the type of variance (e.g. was the outlier reviewer onto a fatal flaw?).

    I don't know how this will work out. It has several contingencies. There need to be at least 5 reviews/application for the statistics to work (7 is optimal so 5 may be on the edge). The reviews should be independently derived (collusion leads to herd mentality). The work load is increased (instead of 2-3 reviewers, need 5 or more) so the agency has simplified the application, CV and review with more structure being introduced. This, paradoxically, can increase workload as they seek additional information.

    The advantage is that this approach removes study section politics. It should better capture applications that fall within study sections. It should avoid pre-assignment of certain specialities and self-adapt to the volumes of applications in various areas (i.e. the success rate for a given discipline/area should be equivalent across the spectrum). It also removes the possible effects of discussion order as there is none....

    As an aside, the first pilots of this approach are currently running. There are mixed reports of effectiveness and other confounding factors.

    We Canadian guinea pigs are getting a feel for what it's like being experimental subjects.

  • eeke says:

    @ Jim Woodgett - I like that idea. I'm curious to know what percentage of grants will fall within the top range to get rubber-stamp-funded without further review - this assumes that all or most reviewers will be in agreement about 10% of the time.

    I'm not comfortable hearing about this study section "culture" crap. To me, it suggests that if you're in that culture club, you'll be ok, but if you're new to that particular study section, you don't have a chance. I've looked at newly funded applications by study section on the Reporter website, and I find HUGE differences in numbers of new R01s (or equivalents) that get funded for the year. In some cases, there will be 3x or more new awards from one study section versus another. Does anyone have thoughts about this? Should the study sections with extremely low numbers of new projects awarded be avoided?

  • drugmonkey says:

    Without knowing how many apps they handle eeke, it would be foolish to draw conclusions.

  • drugmonkey says:

    If you are looking at continuation/new R01 ratios I suppose that might be useful for strategy.

  • Jim Woodgett says:

    The issue of study section culture clubbing is exacerbated in Canada as our swimming pools are smaller. There are most definitely cliques here and the quality of research performed across the various panels is no where near even (an issue recognized by the agency without calling out the delinquents). The main advantage of the virtual route is that applications that don't fit a predetermined panel don't get handed around like musical chairs. Such applications traditionally do very poorly as they are less likely to find advocates. Broader scope panels would avoid this but that hasn't been tried. Of course, the headlong rush into major reforms is also driven by historically low success rates and that will not be changed by the process.

    The ranking scheme may help normalize scores as it is true that different study sections will vary their definitions of merit. But the Great White Northern Experiment has to be done (and there are no controls of course) to determine whether that is the case.

  • […] discussions in a scientific society membership magazine (here,  here, and here) and in blogs (e.g. here),. For example, scientists who are early in their careers at present, in general, have different […]

Leave a Reply