Posted on September 3rd, 2010 in Anita Bandrowski, Curation, Essays, Force11, General information | 1 Comment »
Publishing papers is one of the mainstays of scientific discourse. Many of us worry about the impact factor of journals, how many times we have been cited, ease of access to our papers. The the mantra has historically been ‘publish or perish,’ and has recently shifted to ‘get funding or perish,’ but that is another blog post altogether.
In recent years we have had a great deal more data to publish or, at the very least, have been given the opportunity to include more than just a “representative figure,” a trend that growth of databases and “supplemental materials” as repositories for information has been largely responsible for. This growth is a good thing; it has allowed the publishing of enormous studies of microarray datasets that test thousands of genes in a single experiment. Where before researchers could only discuss a small subset of the studied genes, they can now make the entirety of their data freely available, making it (theoretically) disseminated and discoverable.
The problem with supplemental data is that it is not indexed by PubMed, and even its owners have difficulty finding it (personal communication, Nestler 2010; also see 2009 Nature editorial by Guralnick). Furthermore, there are few standards for formatting these data during publication, resulting in a significant heterogeneity in the data formats. For example, when extracting tables that report microarray data for a project within the Neuroscience Information Framework (NIF) called the Drug Dependent Gene database, some of the tables were in the form of excel spreadsheets, some were PDF files, and some were simply .jpg or .tiff files. Some of the tables did not have titles, and many did not link back to their original papers, meaning that even if they were indexed by Google or another search engine, there would be no way to determine the actual context of the file.
Recently, the Journal of Neuroscience considered having researchers keep supplemental material on their own websites. All of our experiences at the NIF indicate that this is likely to be disastrous. NIF is in the business of maintaining links to websites, yet even we often lose them because webmasters rename directories in the process of instituting a “sleek, new design,” and don’t leave behind permanent forwarding addresses. The only difference may be the order of the words within a link, but even the most minor changes break links and necessitate a human having to ‘fix the link’ by hand. For a few links this is not a difficult task, but for thousands? It gets a little difficult. Sweeping changes in directory organization is something that we have long advocated against, but when it came to redesigning the NIF pages, we committed the cardinal sin of changing our directory structure and have had to go back to create redirects to all of the new pages so that people linking to us would land in the appropriate location.
At the NIF, we realize the tremendous value of supplementary data in both increasing the impact of a particular article, as well as a tool to reduce unnecessary duplication of experiments, and would like to share a few thoughts on the subject:
1. We advocate the submission of data to the appropriate databases (Gemma, GEO or DDG for microarray data, CCDB for microscopic images etc). If you have a set of data that do not conform, please ask us where it could go at our forum or at our email list.
2. We advocate the creation of one or more data warehouses that would permanently store the data that don’t fit into an existing database. Should libraries take these data sets and store them for us the way that they handle books?
3. We strongly advocate the use of a standard vocabulary to normalize the data submitted to any such database. Doing so would allow data to be integrated into the extensively developed semantic matrix of NIF, allowing it to be discovered more easily by the researchers using our portal, but also by automated agents deployed for text mining purposes. It is yet to be determined whether text mining technologies will produce useful answers to the real scientific questions, but making our data available to them in a format that is easy to digest seems a reasonable step.
NIF currently provides search tools across a wide variety of data sets and would provide efficient means of searching the landscape of misfit data sets, as well as a way to link back to the primary articles, widening the impact of research groups and authors. Moreover, including hard-won supplementary data will promote the rapidly growing knowledgebase within NIF, a semantic knowledgebase that can be discovered, explored and accessed by any student, scientist or interested researcher via the internet.
We are currently in development of submission protocols and web-based tools that will automate and ease the process to submit microarray and immunohistochemistry data to the DrugDependentGeneDB. However, would love to hear your thoughts or ideas in regards to this or any NIF matter – please feel free to leave a comment, become our friend, or send an email.