Crisis of Faith

Jan 04 2013 Published by under Grant Review, NIH, NIH Careerism

The hardest thing about grant review is giving good scores to proposals that clearly suck compared with your own proposals that have been scoring* outside of the fundable range in recent rounds.

___
*because of rat bastige biased and incompetent reviewers that make eRroRZ of FaCT!, of course.

51 responses so far

  • Drugmonkey says:

    phew. found one in my pile that has a teensy bit of promise. high hopes, people, high hopes.

  • Grumble says:

    Why would you give them good scores if they "clearly suck"???

  • DrugMonkey says:

    Because one should spread scores and focus on the quality represented in one's pile first and foremost, IMO.

    Except in a very general calibration way, I think review approaches that attempt a ranking on some larger, fixed notion of grant apps is deeply flawed. Tha calibration cheat sheet is fucked up, especially since SROs chant their score spreading mantra incessantly

  • Dave says:

    DM: Do you consider the current payline when scoring? What would you say is your average score and what is the spread?

  • DrugMonkey says:

    No, 5, 8

  • qaz says:

    How many grants do you have to review in your cycle? Do you really think you have enough to judge a sample? By spreading the scores within your own individual stack, you are basically running preliminary heats. Winner of each heat goes on to the next stage.

    I don't see why you shouldn't bring your extensive grant reviewing experience to bear on where the scores fit into the expected distibution.

  • Physician Scientist says:

    Dave-
    In my pile right now, I'm averaging a 5 with a spread of 5. The tighter spread is just because I don't see the point of giving something worse than a 7 (it'll be triaged anyway, why pile on) and I've got nothing in my pile that deserves a 1.

  • eeke says:

    "I've got nothing in my pile that deserves a 1"

    You guys are harsh. How about a "2". We're not talking about Nobel prize-winning research here, just some project that is worthy of support. It sounds like you are shit-canning every proposal. Has it always been like this, or are you thinking that if someone actually gets funded, it will mean a lower chance of funding for yourself?

  • Comradde PhyioProffe says:

    Eeke, you need to understand the relationship between impact scores and percentiles, and what spreading scores means. If the study section decides to be "nice" and give everyone 2s, then their work has been a waste, and their peer review will have zero influence on who gets funded. If th

  • Comradde PhyioProffe says:

    e study section wants to influence the funding process, then it needs to spread scores, especially in the range of what it considers the top 20 percent of grants. You want to score the last grant in the top 20 percent at a 4, 15 percent at a 3, 5 percent at a 2, and 1 percent at a 1.

  • DrugMonkey says:

    Qaz-
    I have come to believe the averaging process of three reviewers, all taking a fair stab at spreading, is the way to accomplish the task we are set by the ICs. They do not fund zero gears next year because the ones this FY were awesome. Nor do they double up next year after we give all grants 6s this year.

    Judging your own pile as superlative or shitty is fraught with problems. Also, in a tactical sense it leads to other reviewers downgrading your opinion (IMO, naturally).

    I suggest every reviewer should assign at least one 1 an one 9 if they have a normal load.

  • Pinko Punko says:

    I'm on a section that is pretty good about calibrating, at least in terms of 1 means phenomenal so many reviewers will not have 1s. The hard part is that 3s are on the edge of not being funded, and 3s should be really good grants. This particular section seems to match very well the suggested criteria, but a similar section is just totally compressed as in: my section 30 =15-25%ile but the other section 22 =25-30%ile or something like that. Basically anything decent is getting 3 or above, so average is definitely not a 5 in that section. I was 1-8 last time with a median of 5, but read more than my pile to calibrate better.

    Some reviewers did seem to have strong piles, but they were generally validated by the other reviewers giving similar scores, so I don't think I noticed anyone inflating grades.

    eeke, as DM likes to point out, there is no "handing out" of scores. You can't just toss someone a 2. That grant gets funded, someone else's doesn't. There is no such thing as giving anyone a break. A lot of worthwhile, impactful research is not being funded now because there isn't money. Grants getting 3 or 4 in a study section that spreads scores all the way out are not getting funded. Those are really strong grants.

  • qaz says:

    "I suggest every reviewer should assign at least one 1 an one 9 if they have a normal load." That'd be a fair structure if that's what everyone is doing. It isn't a good description of what the other reviewers I've encountered on study section are doing.

    If NIH wanted us to do that, then they should ask us to rank the grants from best to worst of our group, not to score it.

  • Pinko Punko says:

    The SROs I know specifically ask people not to consider their pile at all, and to aim for consistent calibration, not 30 sliding scales.

  • DrugMonkey says:

    I hav yet to encounter an SRO that did not harp on about score spreading.

  • Comradde PhyioProffe says:

    If 30 is 15 -25 %ile, then your study section is not spreading scores sufficiently. 30 should by 10-15 percentile. By making 30 well outside the fundable range, you are effectively compressing the scores that span the fundable and non-fundable limit into just between 20 and 30, which is way too tight. You want 30 to be right at the border of fundable and non-fundable--which on average across ICs should be 12-17 %ile--and have 40 be at 25 %ile.

    And BTW, what DickeMonkey is suggesting about spreading the scores of your own pile from 1-9 is perfectly fine, so long as each reviewer genuinely recalibrates as a consequence of the panel discussion.

  • DrugMonkey says:

    There is also a recalibration process during the read phase in the ~1 week prior to the meeting. Looking at the scores of other assigned reviewers then justifies adjustment of your starting position determined by my suggestion for radical score spreading.

  • DrugMonkey says:

    You also can take a look (during the read phase) at similarly scored apps that you didn't review. Decide based on your evaluation of the apps, critiques and scores if you need to do any within-section adjustment.

  • DrugMonkey says:

    Oh and PP is totally right that if you compress your scores you are diminishing your (the study section's) role in determining what gets funded. Close or identical scores mean the POs have to make more decisions on their own.

    Now sure, you could rely on your brilliant bullet points to convince the PO...

  • qaz says:

    "Score spreading" is not the same as "ranking your pile". (Ranking your pile is fine if everyone does it.) But spreading scores is actually about acknowledging the non-linearity of the new scoring system.

    And CPP is exactly correct - if a 3 is not fundable, then you've simply translated back to the old system (where a 2/5 was the fundable boundary). Of course, in the old system, IT DIDN'T MATTER IF THE SCORES WERE SPREAD because there was plenty of room to differentiate a 1.4 from a 1.8 (that's 4 units). In the old system, you had a linear function from 1.0 (perfect) to 5.0 (terrible). In my experience, most of the grants we saw at study section sat in the 1.2 to 2.8 range (1.6 units of differentiation). In the new system, you are supposed to translate that 1.2-2.8 into 1-7, leaving 8 and 9 for what used to be the lower half. But psychologists have known for decades that humans don't like non-linear scales, so everyone ends up pushing the fundable range into 2-3 (only 1 unit of differentiation). This is why SROs are always harping on spreading scores.

    Ranking your pile might actually be a better system. But that's not what I've ever been told by an SRO, nor is it what other reviewers are doing.

  • Pinko Punko says:

    I would disagree with CPP because this study section is of course where all the grants are above average. Actually, average score is 4.5, so only slight compression. And, frankly, compression should be expected as grants improve but are not funded- you should start seeing more just barely unfunded grants in that range, and see an increase in compression- they have to start bunching up.

  • DrugMonkey says:

    qaz-

    You are high and score spreading was important under the old system too. That hasn't changed. The new system (designed to increase ties) is supposed to break *precisely* the fallacy you express, namely that a 162 differed meaningfully from a 164).

  • DrugMonkey says:

    The reasons SROs harp is because reviewers tend to clump scores around the perceived payline. As PP discussed, this *reduces the decision power of the initial review*.

  • Dave says:

    As someone who just got a 30 for an A0 K application, I'm pretty fascinated by this discussion!!

  • Dave says:

    DM: that's exactly why I asked if you consider the payline when scoring.

  • DrugMonkey says:

    Another reason to use radical score spreading is to help combat this tendency in yourself, Dave.

  • DrugMonkey says:

    In the event you are asking a slightly different question Dave, it is *very* hard to avoid approaching review from a gut level perspective of "this should fund" versus "this should not fund". Perhaps even impossible to totally get rid of that subtle (or not subtle) influence.

    Among other problems this gets into the quagmire of fixing the grant for the applicant via comments. Because reviewers are thinking "this should fund next time with a bit of revising".

  • Grumble says:

    As usual, NIH tries to fit the square peg of human nature into the round hole of its policies & procedures. If SROs constantly have to nag reviewers about spreading the scores, and reviewers constantly ignore them (or at least drift back towards fundable/not fundable scores after the SRO's stern reminders), then maybe the system doesn't work and should be changed.

    I've reviewed grants for several European funding organizations. They don't ask for scores. They ask whether the grant should be funded or not. Why can't NIH do the same? NIH should realize that grant reviewers are pretty much always going to have the answer to this question in the backs of their minds. So why not take advantage of it rather than make it out to be some kind of egregious moral failing to have this question influence scores?

  • qaz says:

    In the study section I was on five years ago, there was a definite qualitative difference between a 1.4 and a 1.8, and discussed scores averaged between a 1.2 and 2.8. Scores from different reviewers reliably clustered - if two people gave it a 1.4 and the third a 1.8, that was a point of major discussion. Generally, I could predict the other scores to within a range of about 0.2. Although scores were clustered in the 1.2 to 2.8 range, there was very good cross-reviewer reliability. Which means there WAS a difference. And we did see the occaisional garbage proposal which got a 4 or 5. In the study section I am on now there is little difference between a 2 or a 3 and discussed grant scores range from 2 to 4. (Actual score ranges go from 1 to 9, but little below a 4 is ever discussed.). I fail to see how providing a more quantized, less detailed scale provides better information. There was a difference between 1.4 and 1.6. It was noisy and "not significant", but definitely correlated with quality.

    "Not significantly different" (meaning the noise spreads more than the real difference) does not mean "the same" (meaning no difference).

    I do agree that reviewers tend to clump scores around the payline. Ths is because we really only have three values: "Fund!" "Don't fund!" And "Maybe?". All of the useful scoring information is in the Maybe category. So what we really should do is to max out your amps on Fund! and on Don't Fund! and to spend the scoring range on the edge. This is what the new scoring system tries to do - a 1 is Fund, and 8/9 is Don't fund, and the rest is supposed to be the scoring range. That's pretty much what the reviewer guidelines say. The problem is that psychologically, it's hard to be nonlinear, and we draw a linear range from 1 to 9, leaving the scoring range back at 2 (in a tight year) to 3 (in a good year), with no room to actually measure the edge.

  • DrugMonkey says:

    Yeah, you've arrived at what I am suggesting qaz. Use 1 all the way to 9.

    With respect to the "more information" issue, it was absolutely an intentional design feature to decrease the false differences between scores of 138 and 142. Ties were expected and built in. From there you can either look at it as freeing the PO from the tyranny of the review order...or encouraging them to do the job they should have been doing all along.

  • Dave says:

    How does the NIH normalize for these differences in scoring behavior across ICs and study sections?

  • Comradde PhyioProffe says:

    Percentile.

  • DrugMonkey says:

    The R01 scores are expressed as a percentile rank against all scores in that round and the two previous round for that study section in the general case. (R21s are typically not percentiled)

    In theory this accounts for hard ass or softball sections.

  • Some ICs (such as NINDS) percentile R21s when they are reviewed by standing CSR study sections.

  • Joe says:

    Problems arise when one or two reviewers are using the whole 1-9 scale or are trying to use the score descriptor sheet while the rest of the group is not. Reviewers 1 and 2 give a really good to pretty good proposal a 2 and a 3, and reviewer 3 gives it a 5. The proposal gets discussed near the end of the second day, and none of the reviewers are convinced to budge. The group can score from 2-5, and the proposal gets a 4, nowhere near funding.

  • The group can score from 2-5, and the proposal gets a 4, nowhere near funding.

    This mathematically entails that the other couple dozen people on the review panel were on average more comfortable with the 5 than with the 2 or 3. There is a huge amount of continuous calibration of scoring that goes on throughout the one or two days of review. This includes tons of comments by other members of the panel to the effect of "What I am hearing you say about this grant sounds a lot more like a 3 than a 5", or "You sounded a lot more enthusiastic about the Drugmonkey grant than the Physioprof grant we are discussing now, but you gave them both 4s", or "I am hearing multiple weaknesses that substantially affect the likely impact of the proposed studies, and so there is no way that this should be scored a 4", or "I am only hearing minor weaknesses that don't substantially affect the likely impact of the proposed studies, so there is no way that this should be scored a 6".

  • And I should also say that if a grant falls below the triage line based on a single outlier preliminary score, like one 2, one 3, and one 8, then the grant is highly likely to be discussed based on the demand of either or both of the 2,3 reviewers. And if it doesn't, then it entails that neither of the 2,3 reviewers was really all that enthusiastic. So this idea that oustanding grants are being "trashed" by a single assigned reviewer is not actually realistic.

    And BTW, this expresses yet *another* reason why junior faculty should be allowed, and even encouraged, to serve on study section: so that they learn that their paranoid fantasies of all the "bias" and "errors" that led to their grants not getting funded are actually chimeras.

  • Dave says:

    And BTW, this expresses yet *another* reason why junior faculty should be allowed, and even encouraged, to serve on study section: so that they learn that their paranoid fantasies of all the "bias" and "errors" that led to their grants not getting funded are actually chimeras.

    I would love to simply observe study sections, never mind actually serve on one.

  • I would love to simply observe study sections, never mind actually serve on one.

    This is forbidden by federal statute, so the only way to get junior PIs in the room is as reviewers.

  • Joe says:

    CPP,
    What you say is what it should be, and I am glad that you are on a study section that behaves so. However, I often see one or two guys who give much larger number scores than everyone else. I do hear the comments you mention, along the lines of "this sounds much more like a 3 than a 5." But only rarely have I seen a very mixed-score proposal (like the 2, 3, 5 example) brought to a 2,2,3 or 2,2,2. Also it seems that, given a wider range, scores will increase by more than just the vote of the hard-nosed guy - some panelists will choose the larger number when they can do it anonymously that would not raise their hands and vote out of range.

  • DrugMonkey says:

    CPP's two comments are spot bang on the mark people .

    Joe, I've seen it go down every which way. Grants get saved by a brilliant discussion from a reviewer or get hammered after someone identifies problems that are made clear to everyone. sometimes it is clear that there is just a plain old legitimate difference of opinion on how to weight factors and the panel members are told to vote their conscience within a huge post discussion range. sometimes the scores are "2,2,2...ok, we're done here"

    like all of life, YMMV* is a good way to think about it.

    *which brings me back, as always, to my mantra. the only way to beat the odds is to submit a whole lot of grants, targeting multiple study sections of interest.

  • Dude, all my comments are always spot bang on the mark!

  • Joe says:

    "And I should also say that if a grant falls below the triage line based on a single outlier preliminary score, like one 2, one 3, and one 8, then the grant is highly likely to be discussed based on the demand of either or both of the 2,3 reviewers."

    I'm with you except for the "highly likely" part. At the end of day 2, you're exhausted from arguing all afternoon about a bunch of proposals in the 3-4 range that you know won't be funded. Are you going to ask to pull up from triage a proposal you gave a 3 to? No, because you wonder if maybe you missed something or were being overly generous or shouldn't you be better about using the whole range anyway? If you gave it a 2, and you don't have to make the late-afternoon flight, then yes, you might be willing to get over the "nobody's getting funded" depression and fight the good fight.

  • [...] comment I made about grants being "saved" in discussion reminded me of one of the first experiences I had on study [...]

  • AcademicLurker says:

    I just got the "Just Say No to Score Compression" powerpoint presentation from the head of the study section that's meeting in February. In addition to sending out the presentation, he wants a pre-study section phone conference to make sure that everyone is consistent about what the scale means and how to use it.

  • qaz says:

    "In addition to sending out the presentation, he wants a pre-study section phone conference to make sure that everyone is consistent about what the scale means and how to use it." I think this is required. We have it every cycle in my study section. It won't do any good. The new scoring system is a mess. We should trash it and go back to the old system. Or to DM's rank-within-your-experience system*.

    * Although ranking doesn't work for special emphasis panels and center grants where you don't have enough samples.

  • Dude, you are fucken deranged. The new system is a kajillion times better than the old system, and the scoring rubric based on weighing strengths versus major and minor weaknesses is extremely useful in getting reviewers on the same page.

  • qaz says:

    CPP - I respectfully disagree.

    The more experience I got with the old system, the more I was able to predict it, the more I was able to understand where a proposal would fall in the rank. The problem with the old system was that there was a lot of noise mixed in the measurement. That's fine, but the noise was not uniform, it was Gaussian, so the measure was still a good estimate of the underlying rank.

    The more experience I get with the new system, the less it makes sense to me, the less I am able to predict it. What I've seen is that reviewers treat all decent grants as 2-3 and 4-9 as ways of saying how bad they are. Clearly, that's not what we're supposed to do, but at least in the study section that I've seen, that is consistently what happens. The more experience I have with the new system (on in-person study section, with my own grants, and in special emphasis panels), the less I am able to predict where a grant will end up. Assuming that the goal of study section is to provide an actual rank order, the old system was infinitely better than the new.

    However, I think DM has the truth of it, which is that the quantization is an intended feature of the new system and not a bug. It moves all of the real funding decision into Program, leaving study section the ability to say YES (1), maybe (2), only if needed for mercy (3), and NO (4-9).

    If the study section was working the way the scoring instructions say that it should, the impact-percentile scatter plot should be decidedly non-linear, with a bunch of grants at 1, most grants at 8-9, and a smaller set of grants in the middle (since that's supposed to measure the details of the maybe group and is where we all agree the needed range should be). But it's not. The impact-percentile scatter plots are always extremely linear. The problem is that most of the grants we get are very good and if they were a linear function of all possible grants (even all grants I've seen in my experience), most (50%+) of grants would score a 2.5 meaning "probably likely to produce some pretty good science."

    Look, if program wants a simple ranking of the grants, we should admit that this is an ordinal not an interval scale. I think that we should be instructed to do what DM does, which is to rank the grants you get, not to score them. The problem is that this only works if everyone does it. And that is not what we've been told to do. Nor is it what my fellow study-section reviewers do. In fact, I have heard members chastised for saying with a shrug "I gave it a 1. It was the best grant I read this cycle."

    If program wants scoring, then it should provide plenty of room to score and then realize that there's noise in the system.

  • You are grossly misreading the way the new system is both supposed to work and how it actually works, at least in my experience. The goal is to spread the top 25% of grants from 10-40.

  • qaz says:

    Then why is an impact of 4 = 40th percentile? (See http://scientopia.org/blogs/drugmonkey/2010/08/09/impact-score-versus-percentile-scatter-plot/ Yes, I know that's impact score, rather than overall. I can't find the percentile relationship to overall, but I'm sure overall is similar.)

    If the goal was to spread 25% of grants into 10-40, and to triage 50% of grants, then one would expect to be discussing a lot of grants with average scores of 6. Does this happen on any study sections? (It certainly doesn't on the one I'm on.)

  • Joe says:

    qaz has exactly characterized the behavior of the study sections I have been on, i.e., "all decent grants [score] 2-3 and 4-9 [is used as a way] of saying how bad they are." Grants with an average score of 6 are nowhere near getting discussed. You can get away with saying that a proposal is the best one in your stack, but if you link that directly to the score you gave it, you will be called out for it and told you are not supposed to score that way.

Leave a Reply