By Anita Bandrowski, Ph.D.
Humanity began writing on stone and clay tablets, and then moved to papyrus, paper, and now we write with electrons. Does it seem that our media for information storage is becoming more flimsy or is it better to search through piles of electrons than card catalogs? How can we save the wonderful work that we are all paying for (in the form of government funded research)? Do database records hold the same value as published papers? If so, how can we maintain them indefinitely? Should there be a paper version of each database? How can cloud computing, the linked data/open data initiatives help? What is the role of libraries in this sort of data landscape?
In my own experience, working on a semantic web project called the Neuroscience Information Framework (NIF) at the University of California San Diego, I noticed something strange that has happened to our society that bears on these questions. For several months my desk was housed among many others in one of those open workspaces whose explicit goal is to improve communication between the individuals (no cubicles). One day there was an interruption in the wireless service in the building. This interruption resulted in the inevitable frustration of “I can’t do what I was just doing,” but then a tremendous event occurred: these strange entities who had been toiling near me and whose existence I acknowledged with a nod each morning became real humans. The amazing awakening resembled an episode of Star Trek where the Borg, a half machine half biological group fully integrated in the hive mind suddenly lost connectivity to the hive and were bumbling around, very confused. People all around me began waking up from the technology trance and started to act more like … people. They greeted me, we exchanged opinions of the wireless services, and we met.
With my Borg experience in mind, questions of our deep dependence on technology crystallized. What if the power went off on wikipedia? What if google didn’t exist? How would I find things? How would I be able to work without google docs? In this networked world, is it possible that we can’t survive without the collective?
The level of integration of online information and search systems with our lives has become very eerie, to say the least.
As scientists do we have the same issues? Can’t we do research without PubMed? A few years ago while at Stanford, a colleague and I were talking to an art historian and the conclusion of the discussion was, “if it (a scientific paper or a piece of data) does not exist on the web, then it does not exist”. Something quite contrary to the experience of the art historian, who apparently still did research in a physical building that contained actual papers, books, and non-digital versions of art.
So, then, who backs up the data that we are becoming completely dependent on? When researchers move to a new university or pass onto the great beyond, what happens to the data stores that they maintained? Do they take their data with them setting up cloud computing operations?
The good news is scientific data in databases, whether or not its published on paper, is backed up and data are regularly checked for integrity at most sites. Data and software tools are also replicated in so called “mirrors”, which are essentially copies of the same data or software tools that serve a particular community. Additionally, the National Library of Medicine copies and stores many of the significant databases in their systems, allowing researchers to access them and storing a digital copy for posterity. For example, the Gensat project data exist on Rockefeller servers, but also a mirror of the data is set up at NCBI (the electronic national library of medicine and the home of PubMed).
This seems safe enough. However, the directors of the National Institutes of Health are not always as willing to indefinitely support databases as they are to pay researchers to set them up. So after five or ten years when the funding runs out, what happens to all that data that researchers painstakingly toiled for many years to gather? Some data was published on paper, some was likely not published anywhere or pulled together from papers by raw human effort such as the Ki database, which gathered the raw numbers from many publications for affinity between drugs and receptors. Many databases contain that elusive negative data which is not considered worthy of publishing by the ‘peer reviewing’ crowd, but which may save other researchers tremendous time if they try to replicate an experiment that several others already found did not work. Some databases migrate to funded projects and then are maintained by other universities while the funding is in flux, but some simply vanish into the ether. Should someone maintain them?
The experience of the private human genome project “Panther,” started by Craig Venter at Celera Inc, later Applera, later Applied Biosystems, later an unsupported project at the Stanford Research Institute, and now potentially rising from the ashes into a new project, shows that industrial data may have a similar or potentially an even more dire fate.
In recent years, several movements have swept data science. One is the open data movement and another is the linked data movement. Both bear on this issue of data maintenance. The linked data movement (one of the buzzwords in the semantic web community) attempts to link all pieces of related information by formal relationships, sort of like playing an enormous game of “Six Degrees of Kevin Bacon” with scientific data. Obviously, these data sets must be openly accessible for this to work, so the open data movement spurred the creation of huge datasets readable by anyone in the world. These data sets include some of the most valuable biomedical data, such as OMIM and PubMed, but also include wikipedia and other less than peer-reviewed data. Lots of the people in the open data world talk about their preferred ways of storing that data, such as “tuples” or graphs, but all this boils down to a couple of main ideas:
- A piece of data should persist in a reliable way, with a reliable address.
- A piece of data should be in a format that is readable by others.
- A piece of data should have a unique identifier, a social security number.
- A piece of data is not owned by anyone, but should be traceable to its origin.
Therefore, the open data community has a vested interest in making all data available for their systems to consume and compute, including the databases whose authors, or whose authors funding, has expired.
In the model of linked data, as a ‘six degrees of Kevin Bacon’ analogy, the data graph would suffer if the record of a movie were to be wiped off the graph. Would we still know that Tom Hanks was connected to Kevin Bacon if Apollo 13 was no longer a data link? Probably, but the link would no longer be direct.
The problem with linked data disappearing is that the relationship between Aquaporin4 and Eric Nestler is less well established than the relationship between Tom Hanks and Kevin Bacon. Actually, a database of supplementary materials contains this connection (see Drug Dependent Gene database). Indeed, if the data are deposited inside of a database but are not central nodes of discourse they may disappear without a sound. However, their inherent value may not be in their connectivity; it may instead be that they are valuable in a direction that few have pursued as a line of investigation, such as a promising lead for a therapeutic agent in a particular disease, or the piece of negative data that will spare another researcher a year of fruitless endeavor.
The six degrees of online data sources
The stance of the Neuroscience Information Framework (NIF) as a member of the semantic web community is that data should be preserved because it may be useful at a later time. The larger question is who will pay to preserve the data? What is the role of libraries in an age where books are no longer made of paper, but stores of knowledge with ‘a front end’ and a ‘back end’? Will we have thousands of databases taking up room in library basements somewhere, where they can be accessed like so many other ‘collections,’ or will projects such as NIF be the keepers of these data because they can integrate the searching of the data across data structures? Who will champion data preservation in the digital age?