Zealots

One of my favorite thing about this blog, as you know Dear Reader, is the way it exposes me (and you) to the varied perspectives of academic scientists. Scientists that seemingly share a lot of workplace and career commonalities which, on examination, turn out to differ in both expected and unexpected ways. I think we all learn a lot about the conduct of science in the US and worldwide (to lesser extent) in this process.

Despite numerous pointed discussions about differences of experience and opinion for over a decade now, it still manages to surprise me that so many scientists cannot grasp a simple fact.

The way that you do science, the way the people around you do science and the way you think science should be done are always but one minor variant on a broad, broad distribution of behaviors and habits. Much of this is on clear display from public evidence. The journals that you read. The articles that you read. The ones that you don't but can't possible miss knowing that they exist. Grant funding agencies. Who gets funded. Universities. Med schools within Universities. Research Institutions or foundations. Your colleagues. Your mentors and trainees. Your grad school drinking buddies. Conference friends and academic society behaviors.

It is really hard to miss. IMO.

And yet.

We still have this species of dumbass on the internet that can't get it through his* thick head that his experiences, opinions and, yes, those of his circle of reflecting room buddies and acolytes, is but a drop in the bucket.

And they almost invariable start bleating on about how their perspective is not only the right way to do things but that some other practice is unethical and immoral. Despite the evidence (again, often quite public evidence) that large swaths of scientists do their work in this totally other, and allegedly unethical, way.

The topic of the week is data leeching, aka the OpenAccessEleventy perspective that every data set you generate in your laboratory should be made available in easily understood, carefully curated format for anyone to download. These leeches then insist that anyone should be free to use these data in any way they choose with barely the slightest acknowledgment of the person who generated the data.

Nobody does this. Right? It's a tiny minority of all academic scientific endeavor that meets this standard at present. Limited in the individuals, limited in the data types and limited in the scope even within most individuals who DO share data in this way. Maybe we are moving to a broader adoption of these practices. Maybe we will see significant advance. But we're not there right now.

Pretending we are, with no apparent recognition of the relative proportions across academic science, verges on the insane. Yes, like literally delusional insanity**.

__
*94.67% male

**I am not a psychiatristTM

49 responses so far

  • Jonathan Badger says:

    Okay, for the people who *don't* believe in providing your data (and who think people who want to use your data are somehow "leeches"), what exactly do you think the point of publishing your paper is? Your interpretation of your data may be of passing interest, but the interesting thing is that you've generated data. And that combined with other data may yield insights you haven't thought of.

  • Pipsqueak says:

    Mind me asking what sorts of computational tools you use when you do research, DM?

  • drugmonkey says:

    Why? These people are not limiting themselves to any computational tool or data that requires such.

  • drugmonkey says:

    JB- you are always free to interpret the data presented in a paper as you see fit.

  • qaz says:

    JB - the point of publishing a paper is to report a discovery.

    The idea that the purpose of publishing a paper is to provide collected data is to fundamentally misunderstand science. (See, we can be zealots too.)

    Science is about integration, understanding, mechanism, and the identification of generalities that allow us to control future situations by identifying what the key parameters are of those situations are and how those parameters change with manipilation.

    Data sharing can be useful, as a check against stupid mistakes and fraud, but the vast majority of reproducibility problems are about not yet knowing the right generalities. A much better way to identify comparisons is to replicate and extend a discovery yourself. That way you know what both conditions really are.

  • WH says:

    "every data set you generate in your laboratory should be made available in easily understood, carefully curated format for anyone to download."

    This is currently true of next-gen sequencing data as well as crystallography. Use of the former is widespread in biological science at the moment and the de facto requirement for public database deposition of the latter has been responsible for ratting out cheats* and various "mistakes"**. It's always seemed ironic to me that these two seem to be some of the harder types of data to fake yet have the most stringent reporting requirements.

    *https://ori.hhs.gov/case-summary-murthy-krishna-hm
    **https://www.nature.com/articles/nature17421, for example

  • Jonathan Badger says:

    You are thinking too small scale. Yes, labs can replicate and extend particular experiments, but they can't (nor should they) replicate all experiments in a field. But they can and should benefit from having all that data available -- in that way they can apply statistics and machine learning to test hypotheses in silico. Even if you don't trust computational results completely (and granted, you shouldn't), these computational experiments can at least tell you whether or not it is really doing the actual bench stuff.

  • Pipsqueak says:

    Well, do you feel the same way about data processing software? If a group develops some critical analytical technique, and uses it in publications, should they be obliged to share it?

  • grumpy says:

    Explain something to me: suppose you generate a beautiful data set, publish the methods and initial discoveries from the data, and co-release the raw data with publication.

    Now a bunch of other scientists look through your data and generate all sorts of new discoveries. Is it really the case in some fields that the generators of the data set get left out of the citations/awards/plenary talks/whatever? Are they not credited (amongst people in the know) as being the pioneering seeds to these new discoveries?

    I find it kind of hard to imagine, but if anyone has specific anecdotes to share, it would be interesting to hear.

  • Jonathan Badger says:

    @pipsqueak
    Yes, of course. Any published software needs to be open source (and to be fair, many if not all journals already require that). A paper about a piece of closed-source software isn't a paper - it's an advertisement.

  • drugmonkey says:

    There ya go. Field specific practices evolve to fit their needs.

  • Jonathan Badger says:

    Except are they really "field specific"? Rather than simply that some fields are more forward thinking and less fuzzy and should be role models for the others?

  • dho says:

    If you are a funding agency, and you are paying a lab to do science, doesn't it maximize your return on investment if that lab is required to make the data that they generate available to the community of other labs they are also paying to do science?

    I agree with much of what you say: curating data is really hard, it is not equally attractive to everyone (and is much easier to contemplate sinking time into it when a lab is doing OK), some fields lend themselves more to making data available than others, within groups that aspire to this (mine does), not all projects achieve the same level of curation.

    I'll add that most datasets aren't nearly as interesting to anyone else as they are to the group that generates them, and it is delusional to think that many (any?) people have the time or inclination to spend re-running your datasets to check the impact of tweaking parameter X in tool Y.

    But it comes back to accountability. If I don't pay for my science out-of-pocket, I'm accountable to those that do. Though this is not the prevailing mindset among academic scientists, in my experience it is what people outside academic think we do already.

    And there is a selfish benefit - knowing that data needs to be carefully curated and that it could be exposed the sunlight if anyone cares to look at it tends to improve attention to detail at all stages throughout the data generation and analysis lifecycle.

  • qaz says:

    The truth is that most people I know would be happy to have their data shared. Theoretically, a paper should have all of the necessary information to replicate the experiment. In practice, of course, that's impossible because we don't always know what aspects of the question are important. Does time of day matter? Does phase of the moon? Does gender of the experimenter? Does the current political landscape? Does the humidity of the environment? (The first, third, and fifth have all been found to surprisingly matter in neurophysiology experiments. The second and fourth don't seem to.) Sharing data sets would be great. But the world is not so simple.

    There are three issues that need to be addressed before we can get to data sharing.

    (1) Giving people credit for the work done to collect the data. This is easy. We need a mechanism to cite the data collection. It can create a citation count just like any other.

    (2) The career trajectory of the experimental researcher collecting the data. One data set does not equal one publication (we do not write lab reports). Instead, many laboratories, particularly with complex data (like neurophysiology) mine their data for several years to get several publications. How do we protect the career trajectory of this process? That's not so easy and not so clear.

    (3) The burden of preparing the data for sharing. The data that are shared currently are data that are easily prepared with standard formats (gene sequences, crystal structures). Preparing a neurophysiology data set for sharing would take a graduate student approximately 3 to 6 months to convert our internal lab formats into standard formats from other sites and to add the meta-data and other descriptions to make the data usable by people who are not privy to our internal lab structure. This is the big problem. I don't think people who work on gene sequences appreciate the data sharing burden some other data sets create.

    My personal solution is to collaborate with anyone who wants to work on re- or further- analyzing our data so that we can help them work through problem 3.

    I would also point out that it is not clear to me what the real advantage of data sharing is. It would allow people to do checks for trivial mistakes (like the economics case where they screwed up their excel spreadsheet) but it wouldn't catch experimental mistakes (like the cold fusion mistake where they didn't stir the water). I suppose it's cheaper because you don't have to re-do an expensive experiment, but given the whole "we need to do more replication", it seems to me that re-doing experiments is important. It would allow computational people to "leech" off of experimental people, but a true collaboration seems like it would be a much better process.

  • PKC says:

    I work on computational modeling of single-cell data. Another lab, working in the same realm, published a really cool paper with an analysis method I wanted to replicate. So, I went online, downloaded the whole data-set and analysis code, and ran it myself. This was incredibly helpful to me - realistically, there's probably about an 80% chance I would have been able to replicate the analysis based on the published paper, but it would have taken 10 times longer. Had I not had access to the code and raw data, I probably wouldn't try to use this method, but since they generously provided it, it makes my life easier and hopefully my work better.

    To me, there seems to be a spectrum that data sits on. Two datasets, of equal interest and impact, can be heavily analyzed (think raw image data) or almost completely non-analyzed (westerns). In the former case, it seems obvious that sharing the raw data and analysis is desirable both for reproducibility and the edification of the broader community. In the latter case, "raw" data may not be incredibly useful to publish, but it's also very easy to publish - so, why not? I'm sure there is a middle ground - where it is costly and time-consuming to prepare raw data for sharing, and the added value is questionable. However, it also strikes me that in some cases, if labs made it a goal to publish their raw data, they would quickly adapt better data organization methods that would also benefit the lab.

  • drugmonkey says:

    Preach, qaz.

  • drugmonkey says:

    The “catching mistakes” is a red herring. Mostly used as cover for one of two real agendas. Data leeching. Or trying to question or deny the authors’ interpretations. This latter has both positives and negatives.

  • drugmonkey says:

    dho, PHC:

    Nobody is saying people *can’t* choose to do open science if it works for them. Not even questioning the value, in fact I endorse that. The question is if everyone should be forced to do all this. Who pays the costs, how many pay those costs and who reaps the (rare?) benefits?

    Suppose 100 PIs adopt these costly procedures for decades and only *1* dataset from 1 lab produces something via Open Access? Is it still a good general mandate?

  • drugmonkey says:

    Also: the thing I have most often been interested in hearing about from peers is not the raw data going into a pub. I want to hear about data that they didn’t think was publishable. The weird stuff. The negatives. The failed experiments. The half-start lab lore shit. This would require something even more costly and burdensome, basically pre-registration of every single experimental move and OpenLabBook science.

  • Odyssey says:

    In my 20+ faculty career I've been asked for raw data I think twice. I provided it happily. But twice in twenty years? If that's the level of interest in my raw data, why would I spend all the time and effort required to curate and deposit it in some public database?

    My work has a healthy citation rate, so that's not correlated with the lack of interest in my raw data. I'm pretty sure this is yet another field specific thing.

  • Jonathan Badger says:

    Rather than talking about nonsensical "data leeching", we should be talking about "data hoarding". Data you generate isn't "yours" if it is funded from a public source -- it belongs to everyone. Indeed, many funding agencies already require sequence data to be publicly available after a period of time -- whether or not it has been published. This simply needs to extended by the funding agencies to all data. People who don't like sharing are free to work in industry where secrecy and paranoia are considered positives.

  • AcademicLurker says:

    It would be interesting to look at the development of the Protein Data Bank and the eventually near universal adoption of the requirement that structures be deposited in the PDB as a condition of publication.

    That discussion was just being wrapped up when I was starting graduate school. In that case, I believe that there was very broad community buy-in among crystallographers and NMR folks.

  • Steven Morris says:

    We've actually had a lot of success in finding new insights from re-analysis of old datasets (bs-sequencing and methylation arrays). If you're trying to develop a new method for analyzing data, it's important to have access to a lot of previously published data to use as a basis for testing and comparison. IMO data leeching is a very unhelpful and regressive neologism in the world of publicly-funded science, where presumably most of us work.

  • drugmonkey says:

    JB- you are stating an aspirational opinion that clearly flies in the face of extant facts. Try to be clearer about your “shoulds”. Or, show where all data generated by public funds follows this supposed principle of yours.

    SM- how many times have you asked for data? What percentage refused?

  • PKC says:

    As I've thought more about this, I may be changing my mind. I think, what it really is, is that it feels easier to hit 'download' than it does to reach out an email someone. In my previous anecdote, I definitely would have felt some combination of intimidation and awkwardness at reaching out to a random PI. However, that doesn't mean I necessarily SHOULD have felt that way.

    Further more, when we transition from an email request to a click of a button, what do we lose? It seems likely that the email or conversation might be more fruitful than a click, by providing useful context for the data, or information on analysis that didn't work. That can't happen when data is uploaded. I still think open science is a good thing, but the reality of how often data is actually shared makes me rethink some things.

  • --bill says:

    The most recent volume of Osiris deals with data; this is Vol 32 no 1.
    W. Patrick McCray's article "The Biggest Data of All: Making and Sharing a Digital Universe" is a brief history of how standardization of data formats and analytic software tools occurred in astronomy.
    It speaks to a lot of the issues raised in these comments.

  • Jonathan Badger says:

    @DM -- It's unclear what you disagree with. Do you doubt that funding agencies require release of sequence data? I can find the actual text if you want. But the point is *why* do funding agencies have this rule? Isn't it obvious it's because the public paid for this data? And while there may not be such rules for other sorts of data as yet, the exact same justification would apply at a logical and ethical level.

  • drugmonkey says:

    The zealots do not say “sequence data” or otherwise restrict their claims at all. Your assertion about public was also not limited and would therefore apply to lots of stuff including military activity. It obviously doesn’t, so that appeal is null and void. “Logical and ethical” is just your attempt to cover your personal preference with some sort of non-subjective fig leaf.

  • Dnaman says:

    Hasn't it been NIH policy for 15 years that proposals should contain a data sharing plan:

    https://grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm

    There are a few cases where you can't share final data, for privacy reasons. But in most cases, you should.

    DM, what do you write in your Data Sharing Plan? "This section is not applicable to me. It is only for zealots and a tiny minority of scientists.

  • Pipsqueak says:

    My question was at DM: Do you feel the same way about data processing software? If a group develops some critical analytical technique, and uses it in publications, should they be obliged to share it?

  • Jonathan Badger says:

    Or, alternatively, you are getting angry because you realize deep down that data sharing is the only right and ethical position to take and feel ashamed now that people are beginning to call you and other data hoarders out. Trying to change the subject to classified military research (which in itself has huge ethical problems) is just dumb.

  • Draino says:

    As soon as they ask for all my raw data, I will point them to the decade of mouse genotyping records. All the photos of all the gels of all the PCRs of all the mice. All the columns of all the +/+ and +/- and -/- for all the mice. Have at it. Bask in the glory of my technicians.

  • Neuro-conservative says:

    I think data sharing has considerable merit, but the Open Science warriors would have more credibility if they at least
    1) acknowledged the costs (to data generating scientists), as well as
    2) the fact that they (OS warriors) just so happen to directly receive personal benefit from such a regime

  • drugmonkey says:

    Dnaman- In essence. There are rules about what requires a sharing plan and if it’s not required I say so. So do most of the grants I’ve reviewed lately. They stick to the actual policy instead of adhering to InternetOpenWaccaloon invented policy.

  • My favorite ones go one step further - it's not just "my way is the only right way to do things", it's "No, no one actually does it that other way, and furthermore they didn't really tell you that they do it that way, you're just imagining things." Unfortunately, that's a thing too.

  • mathlete says:

    I generate (as much as) a few hundred GB of data per paper. Who the f... is going to store all that data for the 2-5 people* who are even going to bother to download it? If someone asks for something, I happily make it available to them. But I also like to know who is interested, and the resulting dialog has led to a collaboration or two.

    On a related note, I do get pissed when papers announce a new computational method yet don't provide the scripts/code to implement it. What's the point then?

  • Jonathan Badger says:

    Exactly. No point..That's how I feel about papers that don't provide the data too. BTW this isn't the 1990s. "hundreds of gigabytes" isn't a large amount of data anymore. You could even store it on your cellphone if you have an SD card. It would be less than one percent of the storage of even a cheap laptop these days. People shouldn't have to "ask" you for your data. Properly done, scripts would gather terabytes of data from multiple sources and analyze it.
    .

  • drugmonkey says:

    My favorite ones go one step further - it's not just "my way is the only right way to do things", it's "No, no one actually does it that other way, and furthermore they didn't really tell you that they do it that way, you're just imagining things." Unfortunately, that's a thing too.

    Yes. Absolutely mindbogglingly self-involved intentionally blinkered attitudes.

  • mathlete says:

    BTW this isn't the 1990s. "hundreds of gigabytes" isn't a large amount of data anymore. You could even store it on your cellphone if you have an SD card. It would be less than one percent of the storage of even a cheap laptop these days. People shouldn't have to "ask" you for your data. Properly done, scripts would gather terabytes of data from multiple sources and analyze it.

    Sure, you could buy a giant SD card to hold it. With regards to a laptop though, I think you miscalculated; it would take up all of my current 256 GB SSD for one paper's worth of data (just ballparking here).

    But more importantly, do the journals want to deal with this, both storage and bandwidth? Or am I or my university expected to maintain a repository? I have ~40 TB of local disk space - should I just chmod a+r that and give a login to everyone in the world?

    How about YOU build the system that makes all of this trivially easy and then we can argue about whether participation should be mandatory?

  • Jonathan Badger says:

    Fair enough if you really have that small amount of storage on your laptop, but a 256GB SD is literally what I have in my phone for my music collection. You can get one for $85 these days.

    As for data sharing, I don't need to create a system for doing that -- it's kind of a solved problem. Even in the 1990s institutions maintained ftp servers for data sharing. These days people more commonly use Web based systems like Globus that share particular directories on a server so that they don't need to be copied to a ftp site. I'm sure your institution already has some system already in place whether you know of it or not -- besides sequence jockeys, physicists and astronomers have needed to share large files for decades.

  • thorazine says:

    "Fair enough if you really have that small amount of storage on your laptop, but a 256GB SD is literally what I have in my phone for my music collection. You can get one for $85 these days."

    Sure. But what you actually said before was, " 'hundreds of gigabytes' isn't a large amount of data anymore. You could even store it on your cellphone if you have an SD card. It would be less than one percent of the storage of even a cheap laptop these days." This is literally untrue (the max possible manufacturer spec MacBook Pro comes with a 4TB SSD; it is not a cheap laptop; yes, it's probably more expensive than a roughly-comparable laptop from your favorite vendor, but even if the price is cut in half, _that's still not a cheap laptop_.)

    But the larger point is this: it's not just me, and my shit laptop (or mathlete's shit laptop, for that matter), that's the problem. Advancing technology means that the sheer quantity of data coming out - genome sequence, gene expression, imaging, etc etc ad nauseam - basically means, at this point, giant server farms and serious bandwidth going into sharing remarkable amounts of data _that almost no one wants to see_. This is already happening. It's using up not-insignificant resources. There's a trade-off here between access and expense, and it's not clear that the appropriate equilibrium point is that at which absolutely everything is shared.

    As for the assertion that this is a solved problem - well, yes, partly. We have institutional repositories; we have databases like GEO and ArrayExpress and so on. But these don't work well for more traditional dataforms, where the metadata is often hard to standardize and voluminous. This is not an insoluble problem, but it's one where the return on investment in solving it is arguably low, and the return on investment in actually storing all those data is definitely low - since, really, nobody wants to see your original electrophysiology traces, or all the unsuccessful in situ hybridizations, or Draino's mouse genotyping PCR's. And this doesn't even get into the problem that a lot of the data that's already publicly available is really shit, which is a problem because it's an excuse not to do the experiment, or not to fund the experiment, because somebody already did it, even if they did it wrong.

  • dho says:

    Remarkable amounts of data also create opportunities for reanalysis that add value to the original dataset — answering questions not conceived by the original authors. I did this myself a few years ago. I pulled down a public dataset shared with a publication, looked at something in the data that was utterly useless from the perspective of the original study, and ended up telling a second story. Call this “data leeching” if you want, but I don’t see it that way. In part because just like in everything else, it is possible, and heck, even easy, to treat others the way you’d want to be treated yourself. In my case, I emailed the authors once I had a hint of a signal and invited them to participate in my reanalysis if they were interested. They ended up as co-authors on the resulting paper.

    I’m not only a data consumer, but also a data generator. Data that I’ve freely shared before and after publication has helped other groups. Some have contacted me about it and collaborated, others not so much. But in any case, it’s not my data. It’s data that I’m lucky enough to have the privilege to collect on behalf of the taxpayers who fund the work.

    Finally, its a red herring to say that you don’t have enough storage bandwidth to share data. Unless you are doing something truly astonishingly useful to huge numbers of people, the pool of people interested in your data is going to be small. Sharing 250GB of data, whether you consider it a lot or a little, is relatively trivial over any reasonable university network. As noted above, if your university has a physics department they are already moving way more data than this frequently. And 250GB *is* 5% of a 5TB external USB hard drive, which can be purchased for under $150. There are good reasons why people might not want to share data, but unless you have petabytes of it, infrastructure isn’t one of them.

  • drugmonkey says:

    but it's one where the return on investment in solving it is arguably low, and the return on investment in actually storing all those data is definitely low - since, really, nobody wants to see your original electrophysiology traces, or all the unsuccessful in situ hybridizations, or Draino's mouse genotyping PCR's.

    Critical point. and it's not just paying for the storage and bandwidth. It's the data curation that really costs. And all for what? Universal effort on data curation and standardization of everything you do, just for the exceptionally rare cases where someone wants to see it? When most of this could be taken care of with the usual order of business- requests and collaboration- on an efficient, as-needed basis?

    it's an excuse not to do the experiment, or not to fund the experiment, because somebody already did it, even if they did it wrong.

    This gets to the garbage-in, garbage-out aspect of arguing that OpenData is going to help with the alleged replication crisis. It is far better for other labs to repeat and extend our findings. This way we can get a better handle on the generalization issue which is the real issue at hand. Re-analysis may catch math errors, I guess, but damn little else. At best re-analysis is there for the critic to start arguing that choices about the type of analysis conducted are wrong and if you do this analysis the other way it either questions an interpretation of a positive result or creates a new positive from something the original authors thought was a negative. I don't find those sorts of discussions all that interesting, personally.

    In my case, I emailed the authors once I had a hint of a signal and invited them to participate in my reanalysis if they were interested. They ended up as co-authors on the resulting paper.

    The data generators got credit and participated. This is not "leeching". This is collaborating. I have no problem with this, done voluntarily, whatsoever.

    Some have contacted me about it and collaborated, others not so much. But in any case, it’s not my data.

    You are welcome to operate how you like. But I don't agree with you that the data generators do not have an interest in the data that are generated with taxpayer funds. This is a suggestion that is out of step with ongoing scientific practice. Even if there are some areas in which data deposition is indeed mandated, it is far from universal to the extent the zealots are demanding.

  • qaz says:

    JB - the idea that you think that ftp (or webhosting) is the problem of data sharing means that you really don't understand the problem.

    The problem is meta-data and data formats. I can put my data up on a webserver, no problem, but there is no way that anyone outside of my lab can understand it without a way to get to our in-house data formats. Moreover, even if I put up the basic saving/loading engines for it, the highest likelihood is that people are going to misunderstand it. I would have to add lots of meta-data.

    As I mentioned in my comments earlier (which I notice you never responded to), the *real* problem is the burden of putting up complex data in a format that can be understood and processed. I don't think you understand how individualized neurophysiological and behavioral data is. Because the data is complex, there is no simple standardized format to report it (like a gene sequence would be). Because the collection process is small (as compared to a billion dollar astronomy telescope or even a multimillion dollar fMRI machine), there are a thousand different formats. Essentially, each lab is unique in its data format.

    Taking a single data set and making it accessible and understandable to someone outside my lab would take an individual graduate student 3-6 months. That's a very expensive burden for a very small return.

    As DM says, it's the curation that's the problem.

    Moreover, you have not addressed the career path problem. Currently, in behavioral neuroscience (as one example where this is particularly hard), practice is that a lab runs a difficult experiment, and the PI's career is partially dependent on getting several papers out from a data set, particularly in its interaction with other data sets. How are you going to protect tenure- and career- paths? The culture of behavioral neuroscience does not separate out data-generators and data-analyzers and I think that most behavioral neuroscientists would be offended to suggest we should go that direction.

    And finally, while I certainly agree that one can find new things to mine in data sets, that's not the primary reason the zealots are generally bringing up these requirements - its usually about the (overhyped) "reproducibility crisis". As DM says, its far better for a new lab to repeat and extend the experiment. As I pointed out, it doesn't get at the real problems underlying reproducibility, which are differences in underlying procedures (like stirring the water).

    I fail to see why emailing the authors for data so that one can collaborate is a problem.

    For the record, what I write on my "data sharing" section that I am happy to share data requested via email and that I will provide it with appropriate exegesis.

  • dho says:

    @qaz - You're right that curation is the biggest challenge by far. My point on infrastructure was that it isn't the biggest challenge, not by a long shot, and is easily solvable.

    With respect to curation, my experience, which will not generalize to everyone, is that knowing data will end up online leads to better curation closer to the point of data collection. Fewer temporary notes on napkins and scraps of paper. More careful thought about how to organize data. A big problem in my group (and while we do sequencing, we also do a lot of other types of experiments that are non-standardized) is that without careful curation data becomes more and more useless the farther away one gets from its collection. A genuine question -- if it would take 3-6 months for a graduate student in your lab to make the data accessible to someone outside your lab, how does this work when you bring new people into your team? Are others in your group able to follow the work of that grad student 2 years after they graduate and move on from your lab? It's questions like this that make me think that *for my lab* more curation at the time of the experiment when topics are fresh in mind is a worthwhile use of time, but I understand this doesn't apply equally to everyone (maybe I'm not as much of a zealot as some since I quit the twitter at the start of the year so I don't know what others are saying about these things).

    In terms of career path, I think you used the same word I would -- culture. The culture of science and the varied fields within in can, does, and should change over time as new technologies become available. If you put out a dataset as part of a paper that is subsequently used in four other papers, there could be a mechanism to capture that as evidence that you have made a high-impact dataset. I don't care for metrices like H-index, but it seems like these would have been nearly impossible to calculate 30 years ago, yet today they are trivial. Why couldn't similar metrics for the impact of individual data generated by labs be implemented and used to assess impact?

  • Jonathan Badger says:

    I feel a bit like those aliens from 1950s SF movies that visit humanity and say things like "We were not so different once. We once had war and poverty like you Earthlings but we moved beyond such things". Once upon a time people in sequence/structure space didn't understand the importance of datasharing either.

    In the late 1960s Margaret Dayhoff (a pioneering computational biologist who among other things is responsible for the one-letter amino acid codes we use today) created her "Atlas of Protein Sequence and Structure". People didn't get it. They couldn't see why gathering all the sequence and structure data in one place was useful and even accused her of being a leech. Eventually she lost funding for the project and over a decade later came the beginnings of GenBank as we know it (originally at Los Alamos -- NIH types *still* didn't get it, although physicists did). The point is all the arguments you are making about the feasibility and desirability of large scale data sharing have been made long since -- and been refuted.

  • Odyssey says:

    The point is all the arguments you are making about the feasibility and desirability of large scale data sharing have been made long since -- and been refuted.

    No, they haven't. They really, really haven't. Why do you keep avoiding addressing the issues of curation cost and desirability of many data types?

  • qaz says:

    dho - On the surface, I agree with both of your points, but there's complexity underneath that (I think) is important to work through.

    Re: the second point, sure, if you could change the culture. But changing culture is hard (look at the failure to change GlamourMagJournaling even when every single scientist agrees its a disaster), particularly when not everyone agrees that the current data sharing practice is a problem. On the list of things I think need to be fixed culturally, data sharing doesn't make the top ten.

    Another important question is whether this would turn some people into data-generators and others into data-users. I'm not sure that's a good thing.

    Re: the first point. My lab contains several continuities of knowledge, including multiple long-term technicians, a perma-doc, and (not least) me (!), all of whom have deep knowledge of the procedures and what specific terms mean on the napkins and notecards. Furthermore, graduate students tend to stick around for 5 years, and postdocs for 4, so we tend to have good overlap on a project from 2 years before. And, even more so, graduateD students and postdocs are available by email or phone. And if a new student is going to data mine an old experiment (something we actually do a lot of), I work directly with that student to ensure that their understanding and knowledge of the data is correct.

    That being said, yes, Quality Management is an important thing. It's something we've started teaching our graduate students in our graduate program explicitly in a required experimental methods class. Furthermore, in principle, I agree with you. Quality Management and Data Curation are things that I am *trying* to implement in my lab. But it's actually very hard to force even into a single laboratory. I struggle mightily to get all of my students to use the same codeset and to write their code in the same way and to store their data in the database the same way and to do their experiments identically and ... and ... and... I don't think I do a great job, but I try. We have a program here helping labs learn to do Quality Management and apparently, I'm one of the better labs (which terrifies me). So, yes, I agree completely with you that good data curation would help the lab.

    But what I'm seeing right now is that I'm being asked to implement a 10 year program of re-construction retroactively and with no increase in budget.

    JB - No, you're like those physicists who come over to neuroscience and say "why isn't this easy? You just assume the cow is spherical." (BTW, the most interesting and important part of the cow is the deviation from sphericity.) What you're telling us is that over the last 60 years, trivial codings like gene sequences finally got data shared. I'm willing to bet that neurophysiology and behavioral neuroscience are data sharing by 2080. 🙂

  • drugmonkey says:

    With respect to curation, my experience, which will not generalize to everyone, is that knowing data will end up online leads to better curation closer to the point of data collection. Fewer temporary notes on napkins and scraps of paper. More careful thought about how to organize data. A big problem in my group (and while we do sequencing, we also do a lot of other types of experiments that are non-standardized) is that without careful curation data becomes more and more useless the farther away one gets from its collection.

    So what? You can curate your data how you like and other people can curate how they like. Why are you using "it works better for me" to insist that everyone else should do it?

Leave a Reply