NIH Ginther Fail: A transformative research project

May 29 2018 Published by under Fixing the NIH, NIH, Underrepresented Groups

In August of 2011 the Ginther et al. paper published in Science let us know that African-American PIs were disadvantaged in the competition for NIH awards. There was an overall success rate disparity identified as well as a related necessity of funded PIs to revise their proposals more frequently to become funded.

Both of these have significant consequences for what science gets done and how careers unfold.

I have been very unhappy with the NIH response to this finding.

I have recently become aware of a "Transformative" research project (R01 GM111002; 9/2013-6/2018) funded to PIs M. Carnes and P. Devine that is titled EXPLORING THE SCIENCE OF SCIENTIFIC REVIEW. From the description/abstract:

Unexplained disparities in R01 funding outcomes by race and gender have raised concern about bias in NIH peer review. This Transformative R01 will examine if and how implicit (i.e., unintentional) bias might occur in R01 peer review... Specific Aim #2. Determine whether investigator race, gender, or institution causally influences the review of identical proposals. We will conduct a randomized, controlled study in which we manipulate characteristics of a grant principal investigator (PI) to assess their influence on grant review outcomes...The potential impact is threefold; this research will 1) discover whether certain forms of cognitive bias are or are not consequential in R01 peer review... the results of our research could set the stage for transformation in peer review throughout NIH.

It could not be any clearer that this project is a direct NIH response to the Ginther result. So it is fully and completely appropriate to view any resulting studies in this context. (Just to get this out of the way.)

I became aware of this study through a Twitter mention of a pre-print that has been posted on PsyArXiv. The version I have read is:

No race or gender bias in a randomized experiment of NIH R01 grant reviews. Patrick Forscher William Cox Markus Brauer Patricia Devine Created on: May 25, 2018 | Last edited: May 25, 2018

The senior author is one of the Multi-PI on the aforementioned funded research project and the pre-print makes this even clearer with a statement.

Funding: This research was supported by 5R01GM111002-02, awarded to the last author.

So while yes, the NIH does not dictate the conduct of research under awards that it makes, this effort can be fairly considered part of the NIH response to Ginther. As you can see from comparing the abstract of the funded grant to the pre-print study there is every reason to assume the nature of the study as conducted was actually spelled out in some detail in the grant proposal. Which the NIH selected for funding, apparently with some extra consideration*.

There are many, many, many things wrong with the study as depicted in the pre-print. It is going to take me more than one blog post to get through them all. So consider none of these to be complete. I may also repeat myself on certain aspects.

First up today is the part of the experimental design that was intended to create the impression in the minds of the reviewers that a given application had a PI of certain key characteristics, namely on the spectra of sex (male versus female) and ethnicity (African-American versus Irish-American). This, I will note, is a tried and true design feature for some very useful prior findings. Change the author names to initials and you can reduce apparent sex-based bias in the review of papers. Change the author names to African-American sounding ones and you can change the opinion of the quality of legal briefs. Change sex, apparent ethnicity of the name on job resumes and you can change the proportion called for further interviewing. Etc. You know the literature. I am not objecting to the approach, it is a good one, but I am objecting to its application to NIH grant review and the way they applied it.

The problem with application of this to NIH Grant review is that the Investigator(s) is such a key component of review. It is one of five allegedly co-equal review criteria and the grant proposals include a specific document (Biosketch) which is very detailed about a specific individual and their contributions to science. This differs tremendously from the job of evaluating a legal brief. It varies tremendously from reviewing a large stack of resumes submitted in response to a fairly generic job. It even differs from the job of reviewing a manuscript submitted for potential publication. NIH grant review specifically demands an assessment of the PI in question.

What this means is that it is really difficult to fake the PI and have success in your design. Success absolutely requires that the reviewers who are the subjects in the study both fail to detect the deception and genuinely develop a belief that the PI has the characteristics intended by the manipulation (i.e., man versus woman and black versus white). The authors recognized this, as we see from page 4 of the pre-print:

To avoid arousing suspicion as to the purpose of the study, no reviewer was asked to evaluate more than one proposal written by a non-White-male PI.

They understand that suspicion as to the purpose of the study is deadly to the outcome.

So how did they attempt to manipulate the reviewer's percept of the PI?

Selecting names that connote identities. We manipulated PI identity by assigning proposals names from which race and sex can be inferred 11,12. We chose the names by consulting tables compiled by Bertrand and Mullainathan 11. Bertrand and Mullainathan compiled the male and female first names that were most commonly associated with Black and White babies born in Massachusetts between 1974 and 1979. A person born in the 1970s would now be in their 40s, which we reasoned was a plausible age for a current Principal Investigator. Bertrand and Mullainathan also asked 30 people to categorize the names as “White”, “African American”, “Other”, or “Cannot tell”. We selected first names from their project that were both associated with and perceived as the race in question (i.e., >60 odds of being associated with the race in question; categorized as the race in question more than 90% of the time). We selected six White male first names (Matthew, Greg, Jay, Brett, Todd, Brad) and three first names for each of the White female (Anne, Laurie, Kristin), Black male (Darnell, Jamal, Tyrone), and Black female (Latoya, Tanisha, Latonya) categories. We also chose nine White last names (Walsh, Baker, Murray, Murphy, O’Brian, McCarthy, Kelly, Ryan, Sullivan) and three Black last names (Jackson, Robinson, Washington) from Bertrand and Mullainathan’s lists. Our grant proposals spanned 12 specific areas of science; each of the 12 scientific topic areas shared a common set of White male, White female, Black male, and Black female names. First names and last names were paired together pseudo-randomly, with the constraints that (1) any given combination of first and last names never occurred more than twice across the 12 scientific topic areas used for the study, and (2) the combination did not duplicate the name of a famous person (i.e., “Latoya Jackson” never appeared as a PI name).

So basically the equivalent of blackface. They selected some highly stereotypical "black" first names and some "white" surnames which are almost all Irish (hence my comment above about Irish-American ethnicity instead of Caucasian-American. This also needs some exploring.).

Sorry, but for me this heightens concern that reviewers deduce what they are up to. Right? Each reviewer had only three grants (which is a problem for another post) and at least one of them practically screams in neon lights "THIS PI IS BLACK! DID WE MENTION BLACK? LIKE REALLY REALLY BLACK!". As we all know, there are not 33% of applications to the NIH from African-American investigators. Any experienced reviewer would be at risk of noticing something is a bit off. The authors say nay.

A skeptic of our findings might put forward two criticisms: .. As for the second criticism, we put in place careful procedures to screen out reviewers who may have detected our manipulation, and our results were highly robust even to the most conservative of reviewer exclusion criteria.

As far as I can tell their "careful procedures" included only:

We eliminated from analysis 34 of these reviewers who either mentioned that they learned that one of the named personnel was fictitious or who mentioned that they looked up a paper from a PI biosketch, and who were therefore likely to learn that PI names were fictitious.

"who mentioned".

There was some debriefing which included:

reviewers completed a short survey including a yes-or-no question about whether they had used outside resources. If they reported “yes”, they were prompted to elaborate about what resources they used in a free response box. Contrary to their instructions, 139 reviewers mentioned that they used PubMed or read articles relevant to their assigned proposals. We eliminated the 34 reviewers who either mentioned that they learned of our deception or looked up a paper in the PI’s biosketch and therefore were very likely to learn of our deception. It is ambiguous whether the remaining 105 reviewers also learned of our deception.

and

34 participants turned in reviews without contacting us to say that they noticed the deception, and yet indicated in review submissions that some of the grant personnel were fictitious.

So despite their instructions and discouraging participants from using outside materials, significant numbers of them did. And reviewers turned in reviews without saying they were on to the deception when they clearly were. And the authors did not, apparently, debrief in a way that could definitively say whether all, most or few reviewers were on to their true purpose. Nor does there appear to be any mention of asking reviewers afterwards of whether they knew about Ginther, specifically, or disparate grant award outcomes in general terms. That would seem to be important.

Why? Because if you tell most normal decent people that they are to review applications to see if they are biased against black PIs they are going to fight as hard as they can to show that they are not a bigot. The Ginther finding was met with huge and consistent protestation on the part of experienced reviewers that it must be wrong because they themselves were not consciously biased against black PIs and they had never noticed any overt bias during their many rounds of study section. The authors clearly know this. And yet they did not show that the study participants were not on to them. While using those rather interesting names to generate the impression of ethnicity.

The authors make several comments throughout the pre-print about how this is a valid model of NIH grant review. They take a lot of pride in their design choices in may places. I was very struck by:

names that were most commonly associated with Black and White babies born in Massachusetts between 1974 and 1979. A person born in the 1970s would now be in their 40s, which we reasoned was a plausible age for a current Principal Investigator.

because my first thought when reading this design was "gee, most of the African-Americans that I know who have been NIH funded PIs are named things like Cynthia and Carl and Maury and Mike and Jean and.....dude something is wrong here.". Buuuut, maybe this is just me and I do know of one "Yasmin" and one "Chanda" so maybe this is a perceptual bias on my part. Okay, over to RePORTER to search out the first names. I'll check all time and for now ignore F- and K-mechs because Ginther focused on research awards, iirc. Darnell (4, none with the last names the authors used); LaTonya (1, ditto); LaToya (2, one with middle / maiden? name of Jones, we'll allow that and oh, she's non-contact MultiPI); Tyrone (6; man one of these had so many awards I just had to google and..well, not sure but....) and Tanisha (1, again, not a president surname).

This brings me to "Jamal". I'm sorry but in science when you see a Jamal you do not think of a black man. And sure enough RePORTER finds a number of PIs named Jamal but their surnames are things like Baig, Farooqui, Ibdah and Islam. Not US Presidents. Some debriefing here to ensure that reviewers presumed "Jamal" was black would seem to be critical but, in any case, it furthers the suspicion that these first names do not map onto typical NIH funded African-Americans. This brings us to the further observation that first names may convey not merely ethnicity but something about subcategories within this subpopulation of the US. It could be that these names cause percepts bound up in geography, age cohort, socioeconomic status and a host of other things. How are they controlling for that? The authors make no mention that I saw.

The authors take pains to brag on their clever deep thinking on using an age range that would correspond to PIs in their 40s (wait, actually 35-40, if the funding of the project in -02 claim is accurate, when the average age of first major NIH award is 42?) to select the names and then they didn't even bother to see if these names appeared on the NIH database of funded awards?

The takeaway for today is that the study validity rests on the reviewers not knowing the true purpose. And yet they showed that reviewers did not follow their instructions for avoiding outside research and that reviewers did not necessarily volunteer that they'd detected the name deception*** and yet some of them clearly had. Combine this with the nature of how the study created the impression of PI ethnicity via these particular first names and I think this can be considered a fatal flaw in the study.
__

Race, Ethnicity, and NIH Research Awards, Donna K. Ginther, Walter T. Schaffer, Joshua Schnell, Beth Masimore, Faye Liu, Laurel L. Haak, Raynard Kington. Science 19 Aug 2011:Vol. 333, Issue 6045, pp. 1015-1019
DOI: 10.1126/science.1196783

*Notice the late September original funding date combined with the June 30 end date for subsequent years? This almost certainly means it was an end of year pickup** of something that did not score well enough for regular funding. I would love to see the summary statement.

**Given that this is a "Transformative" award, it is not impossible that they save these up for the end of the year to decide. So I could be off base here.

*** As a bit of a sidebar there was a twitter person who claimed to have been a reviewer in this study and found a Biosketch from a supposedly female PI referring to a sick wife. Maybe the authors intended this but it sure smells like sloppy construction of their materials. What other tells were left? And if they *did* intend to bring in LBTQ assumptions...well this just seems like throwing random variables into the mix to add noise.

DISCLAIMER: As per usual I encourage you to read my posts on NIH grant matters with the recognition that I am an interested party. The nature of NIH grant review is of specific professional interest to me and to people who are personally and professionally close to me.

23 responses so far

  • None says:

    Question: would you have wrote such a long and vehement post if this study had concluded that there is rampant discrimination against African-Americans?

  • drugmonkey says:

    Question: Would you have attempted this ad hominem dismissal ploy instead of addressing the substance of my points if I had?

  • Pinko Punko says:

    Everything about their output so far seems to lack the needed ambition or scope to get at the problem. The counterargument is that they will be incrementally moving the needle towards further studies, but I feel like I would design studies in a different way. But I know nothing. My perception is that the main study in question is useless.

  • none says:

    But seriously... these name-swap experiments are done all the time. You can potentially learn a lot from them. Can you say in all honesty that you would write a multi-part blog series criticizing this study if it had come to a different conclusion? Go ahead and make the claim if you think it's true.

    And nothing in my original comment was ad hominem. On the other hand, referring to names like Darnell and Latoya as "blackface" seems pretty outrageously offensive to me...

  • Anon says:

    (second attempt to post this)

    Since blinding is one of the crucial issues here, are there common methods that have been shown to reliably tell if subjects are blind to the manipulation? I don't work with human subjects, so not something I have experience with, but seems like something people must have worked on. Maybe give some of the subjects a more obvious "tell" and see if the results differ? Maybe interview them? Or query them at the end and reward those who guess correctly, to disincentivize lying?

    And picking names based solely on race, rather than from the upper socioeconomic bracket of each racial group being studied, seems like a bad idea when you're trying to study how people evaluate professionals with advanced degrees. I guess "Carl Johnson" wouldn't be an obvious tell, but maybe "Bachelor of Science, Morehouse College" vs "Bachelor of Science, University of Vermont" would provide some cues. Modify the biosketch to add some awards and memberships relevant to minorities in STEM. It would be a strong hint at race, but it would be a more plausible biosketch than "LaTonya Washington, Bachelor of Science from U. North Dakota" or whatever.

  • drugmonkey says:

    But seriously... these name-swap experiments are done all the time. You can potentially learn a lot from them

    Is it really that difficult for you to read a blog post for content? or do your pre-existing ideas get in the way of that?

  • drugmonkey says:

    Anon- I'm no expert but I've read enough in Experimental Design classes and other coursework to know that yes, there are better and worse methods for this sort of thing. Debriefing your subjects can be very important as can be pilot studies to validate your methods and make sure they are working as intended. In short, it is a science like any other. These authors did do some debriefing but not, apparently, on the critical issues. Making sure that nobody in the analysis sample suspected what they were up to seems to be essential. As they pretty much state at one point. So why did they leave it up to random self-nomination to determine this critical bit of information?

    All the things you suggest are manipulations that could be evaluated in pilot work to see what works best to convey race without giving away the game.

  • Jonathan Badger says:

    and some "white" surnames which are almost all Irish

    Considering one of the PI's has a first name of Molly and the other Patricia, they might not know there is a difference.

  • Ola says:

    It boggles my mind why it is necessary to do this type of "research" on limited subsets of the data? CSR is already sitting on the mother lode of data - they just have to figure out how to parse it!

    The hypothesis would be that proposals from all races and sexes are equally meritorious, so just divide up the thousands upon thousands of existing reviews by category, and match them up just like they do in clinical trials. Take account of confounders, and figure out if there's really a bias in a real world setting with real reviewers on real grants. No need for fancy made up names when there are lots of real ones already in the CSR database.

    Or, you know, try to build a flawed model to ask the same question, and then spend forever wondering if the results are generalizable to the whole population (which, to reiterate, you have the data for!)

  • SidVic says:

    "The hypothesis would be that proposals from all races and sexes are equally meritorious, so just divide up the thousands upon thousands of existing reviews by category, and match them up just like they do in clinical trials. "

    What, huh? how do you test this?

  • SidVic says:

    By all means, let's get granular, OLA. Match the study section reviewers to WASP, Asain Jew, African-a, mixed race, Italian-a, etc.. Then correlate their scores with tribal membership and across the tribes (cause after all the WASP may be harder on the AA than another affiliation) to applications . It should be possible to tease out some biases with this analysis. Any statisticians out there that are willing to ruin their careers are welcome to take that and run with it...

    Just to clarify my above comment- i think you confuse a hypothesis with a presumption. Please clarify if i'm in error. The road you and DM advocate has pitfalls- you goof.

  • DrugMonkey says:

    JB- I suspect the Irish surname thing may be a result of the original study pulling this from Massachusetts births.

  • Ola says:

    @SidVic - Yes I see the "hypothesis" was perhaps not framed properly, but it is testable.

    The way to go about this in a statistically rigorous manner would be to identify a variety of factors (race, gender, age, geography, academic rank, what-have-you) that might affect proposal score when all other factors are corrected for, then do regression analyses to determine the degree to which each of these factors is predictive of score. You would then be able to say "N % of the overall difference in score is attributable to factor X". There would be built in positive controls for things we would expect to have an impact (such as academic rank), and you could also build in random factors that would not be expected to have an impact (e.g. average word length used in the equipment/resources section). Correct for multiple parallel testing, and out spits your answer - does race (or your variable-du-jour) correlate with score. What percent of score can be accounted for when race is the only variable?

    The hard work is in the nitty gritty of matching up the proposals. So, for each application from a black female PI who is a married full professor at a small liberal arts college in the rural northeast, you'd need to find a matching white female PI who is also a married full professor at a small liberal arts college in the rural northeast. No different than what he MPH folks do when parsing clinical trial data (matching a diabetic 300lb teenage pacific islander with a non-diabetic 300lb teenage pacific islander). It takes time, and you need a LOT of data to pick matched examples from - that's why CSR with their mother lode of examples is the only org' that can reliably do this.

  • SidVic says:

    Yes, Yes multivariate analysis. They did this on the gender wage gap and concluded that it doesn't exist. It has limits as the variables are important and virtually unlimited. Would you include IQ, hours worked per week, number of kids and time spent at their recitals... There is a subset of hypercompetitive men (one could argue pathologically so) that are accumulating the accolades. As i see it only one experiment would settle the matter. Divvy up the applications and send them to GOD (or a all-knowing AI) to determine their merit. Then let the chips fall where they may.

    Disparate outcomes are not evidence of systemic bias. Hell bells man, you and DM, or people of your viewpiont are running the show. You can't keep saying the system is rigged when you're running it! 100 years ago AAs were trying to pass as white to increase opportunities. Now we have examples of whites trying to pass as URM to take advantage of affirmative action!

  • mathlete says:

    "Now we have examples of whites trying to pass as URM to take advantage of affirmative action!"

    Perhaps (I suspect it's infrequent at best), but not because being a URM is a sweet ride to the top (they are UR for a reason). There may be a slight added advantage at the point of an application, but only because it's trying to (insufficiently) correct for a massive inequity that has accumulated over a lifetime (multiple ones at that).

    However, I do agree that it's *possible* that peer review is not significantly biased (it's been a while since I read DM's earlier posts on the subject). Even if so, we shouldn't pat ourselves on the back and say that our purely meritocratic system has no responsibility at all to address the other structural inequities that persist.

  • drugmonkey says:

    Of course it may not be at peer review. Could *easily* lie with the grey zone pickup behavior of Program. Interestingly we’ve heard nothing at all about NIH efforts to investigate that.

  • zb says:

    I really don't see how blinded grant review is possible. I've always known the people I was reviewing, for grants and papers. Potentially in some very large fields? Or in introductory grants? (i.e. for trainees? though trainee awards often seem to be highly influenced by training labs, so I'm not sure that would work, either, but maybe that could be held constant).

    I think the name study has been done with student (or postdoc?) queries and found a difference in return responses based on the name sex (and race?) of the querier (though I'm not finding a cite, so I can't critique the study).

    What kind of review would request that you not look up papers in PubMed? Is that real in any grant review?

    I just had a conversation about a group of smart 3rd graders who were tasked with imagining "scientists", "firefighters", . . . . and got within a few seconds that they were being "tested" on gender bias.

  • Postdoc says:

    zb: That anecdote about the 3rd graders is fascinating. Don't think I would have assumed that back in the day.

  • Anon says:

    With regard to preventing PubMed searches, I think one could tell the experimental subjects that portions of the grant have been modified both for the privacy of the people who provided source material and to see how people respond to applications written in different ways, so the bibliographies and biosketches do not have 100% accurate citations. That might satisfy the subjects. But then you'd want to query them after they finished their reviews and ask if they could guess the research question (with some sort of reward for getting it right or getting close), to see if the blinding worked.

    With regard to the 3rd graders, I think it would depend on what was going on in the classroom. If they have a curriculum that very consciously pushes against gender stereotypes, with frequent lessons on how stereotypes are false, frequent reminders that girls can become engineers and boys can become nurses, stories that emphasize girl heroes rescuing boys (with discussions of how important that is), etc., then I could easily see third graders figuring out what adults are interested in.

    Especially if the curricular push against gender stereotypes was recent. When I was in 4th grade I started at a new school. I never thought of the teachers at my k-3 school as having a pedagogical philosophy, partly because the fish never notice the water and partly because they were pretty well-rounded in their approach (which is still an approach, just not always an easily-noticed one). The teachers at my new school, however, had very clearly been to a workshop on a very specific teaching approach, and man did it show.

  • drugmonkey says:

    Anon- yes, I agree that there are many things that could reasonably be tried to validate and improve the methods.

    Year 5 of the transformative award has $465,804 in direct costs. Similar in other years except the $614,398 in Year 3.

    They've had a lot of time and cash to work on methods validation, if you ask me.

    Looking over the Results tab of the grant, they've spent a fair bit of effort on Aim 1, designed to "Identify the extent to which investigator characteristics influence the words and descriptors chosen by R01 peer-reviewers and how text relates to assigned scores " and some effort on Aim 2, ". Examine how interactional patterns among study section members promote receptivity and resistance to discussion topics and associated grant applicants.".

    wait, record scratch. This is all starting to sound very familiar

    http://drugmonkey.scientopia.org/2018/03/09/agreement-among-nih-grant-reviewers/

  • […] recently discussed some of the problems with a new pre-print by Forscher and colleagues describing a study which […]

  • qaz says:

    As a grant reviewer, I would expect Methods Validation to *precede* the grant. I would want to see it in the preliminary data.

  • […] study on Twitter, others panned it. A blogger known as Drugmonkey argued that reviewers had likely figured out the names were fake and deliberately gave their applications good scores to “show that they are not a bigot.” […]

Leave a Reply