Posted on October 31st, 2014 in Anita Bandrowski, Curation, Data Spotlight, Inside NIF, Interoperability | No Comments »
We have been hearing for some time that when it comes to data, it is all about size. The bigger is better mantra has been all over the press, but is it really size that matters?
There are the so called “Big Data” projects such as the Allen Brain Atlas, which generates data, sans hypothesis, over the whole brain for thousands of genes. This is great because the goal of the project is to generate consistent data and not worry about which disease will or will not be impacted by each data point. That may be a great new paradigm for science, but there are not many projects like this “in the wild”.
Most data is being generated in the world of science can be considered small, i.e., would fit on a personal computer, and there are a LOT of labs out there generating this sort of data. So the question that we addressed in the recent the Big Data issue of Nature Neuroscience, is whether small data could organize to become big data? If such a thing is desirable, then what would be the steps to accomplish this lumping?
Here are the principles that we have extracted from working on NIF that we think will really help small data (from Box 2):
Discoverable. Data must be modeled and hosted in a way that they can be discovered through search. Many data, particularly those in dynamic databases, are considered to be part of the ‘hidden web’, that is, they are opaque to search engines such as Google. Authors should make their metadata and data understandable and searchable, (for example, use recognized standards when possible, avoid special characters and non-standard abbreviations), ensure the integrity of all links and provide a persistent identifier (for example, a DOI).
Accessible. When discovered, data can be interrogated. Data and related materials should be available through a variety of methods including download and computational access via the Cloud or web services. Access rights to data should be clearly specified, ideally in a machine-readable form.
Intelligible. Data can be read and understood by both human and machine. Sufficient metadata and context description should be provided to facilitate reuse decisions. Standard nomenclature should be used, ideally derived from a community or domain ontology, to make it machine readable.
Assessable. The reliability of data sources can be evaluated. Authors should ensure that repositories and data links contain sufficient provenance information so that a user can verify the source of the data.
Useable. Data can be reused. Authors should ensure that the data are actionable, for example, that they are in a format in which they can be used without conversion or that they can readily be converted. In general, PDF is not a good format for sharing data. Licenses should make data available with as few restrictions as possible for researchers. Data in the laboratory should be managed as if it is meant to be shared; many research libraries now have data-management programs that can help.