Posted on June 27th, 2011 in Essays, Force11, Maryann Martone, News & Events | 1 Comment »
It is likely that many of the members of SFN have read a bit of text when applying for NSF or NIH grants that states “thou shalt share thine data”. Actually the statements are here:
NSF Data Sharing Policy — According to NSF’s “Dissemination and Sharing of Research Results,” NSF says that “Proposals submitted or due on or after January 18, 2011, must include a supplementary document of no more than two pages labeled “Data Management Plan.” To get you started, please see below NIF’s “Data Management Plan” text template for NSF grants.
NIH Data Sharing Policy — According to NIH’s “Final NIH Statement on Sharing Research Data,” NIH’s policy is that “all investigator-initiated applications with direct costs greater than $500,000 in any single year will be expected to address data sharing in their application.”
So does this mean that SFN members are running to various databases to turn over all of their negative results? Are they clambering to post every file that their undergraduates produced to the World Wide Web? Have they accosted their university webmaster recently and loaded him/her with various disks? Well, not that we can see over in bioinformatics (see Nature special data issue).
So why would people want to share their data?
1. The scientific enterprise of various laboratories is paid for by the taxpayers and, as such, belongs to them. Yes, there is a lot of small print that needs to go at the end of this sentence having to do with timing of publication, human subjects etc., but the principle is sound.
2. The sharing of data is becoming possible, compared to a few years ago when my hard drive was a total size of 20MB. I think that this sort of storage space is now given away with each gumball purchase.
3. We all want to be immortal, and if the papers that we have published have not given us that immortality, perhaps our data will. In all reality, data are probably as important to the scientific enterprise as journal articles before 1975. We probably will not find any use for them in the next few years, but throwing them away should be rather distasteful.
4. Data, especially negative data, can be much more informative than ‘representative data’ as published in Figure 1. What I mean is that in a recent conversation with an NIH official, I discovered to my dismay that a clinical trial was recently aborted (aka tens of millions of dollars wasted) even though the preliminary animal data looked quite promising. The reason that was given was that most of the negative data was discarded into waste bins and not published. This is a very common practice, I have picked through stacks of data to find just the perfect “representative figure,” but this practice seems to be doing a disservice to science, and it may be time to change course.
5. Because we say so (here “we” means the informatics and computational neuroscientists)! While 1% of data may be published in some journal (which makes available an image file after you sign over your first-born to them), the rest of it could be used to find out some interesting answers that require aggregating data from many labs. There are lots of people that are computationally inclined (it may be contagious, don’t get too close to your grad students), and they need data. Actually, they need to have well-curated, meaningful data that they can use to feed their algorithms. The hope is that they will build interesting models that shed new light on brain function, but they can’t do it if you toss your intellectual output into the waste basket.
Why don’t people share?
1. It is difficult: While a data sharing plan is required, neither the NSF nor the NIH actually tells people how to share their data. Furthermore, there are projects and databases that can take certain kinds of data, but not others, and there are more than 1800 of these at our last count. Some journals also require submissions to certain databases before publishing, but not all do.
2. It is costly: In the case that a particular data can’t be found to have a home in a public database, then counting the cost of creating a database, systems support, a webmaster, proprietary tools etc., the cost can eat up the bulk of most RO1′s.
3. You get no credit! Scientists who share their data seem to be treated no differently than those that hoard their data. “Data sharers” receive no additional publications, no extra points toward their tenure review, although they do seem to be cited more often (PLOS article). Also, the NIH seems not to have any mechanism in place, at the moment, to check whether the data were indeed shared.
4. “It’s mine, my precious!!!” (the creature known as Gollum). Need I say more?
So here are some things that we in bioinformatics can do to help:
We bio-informaticians (aka those of us with databases) can all make an effort to structure database submission such that it is transparent, easy-to- reference in publications, and relatively fast. We need to make sure that the researcher is not burdened with a 400 item minimal information checklist and a deep ontological modeling of their data! Biologists mostly just need to submit an excel sheet with some numbers in it and some identifying information, i.e., if they tell you that their subject is a B6, then don’t ask if it’s a mouse!
We made a cheat-sheet! At the Neuroscience Information Framework (NIF), we have compiled a few of the data sharing statements and put them up on the web, so they can be shamelessly copied. Of course our motive is that we would like to get the data, at the time of publication as our algorithms and our continued support from NIH seems to depend on it.
We should also be able to make it easy to find a good database to submit data to. The NIF project was built to account for all the things that NIH funded in informatics (how many databases are there for RNAi? Answer), but has grown to include a deep search of more than 80 databases and has a catalog of over 3800 projects including software tools, atlases and core facilities. This means that we are in a good position to know which projects could take your data, and if none will, then we will put the data up as we already have the infrastructure and computer geeks required. Furthermore, we have some neuro-biologists that can help translate neuro-English to computer-geek-ease!
There are many other issues in data publishing, and I hope that this post will help to stimulate that discussion.
NIF Project Lead