Big Data vs Small Data: Is it really about size?

04:10pm October 31, 2014
Anita Bandrowski

Curation Interoperability Data Spotlight Inside NIF

We have been hearing for some time that when it comes to data, it is all about size. The bigger is better mantra has been all over the press, but is it really size that matters?

There are the so called "Big Data" projects such as the Allen Brain Atlas, which generates data, sans hypothesis, over the whole brain for thousands of genes. This is great because the goal of the project is to generate consistent data and not worry about which disease will or will not be impacted by each data point. That may be a great new paradigm for science, but there are not many projects like this "in the wild".

Most data is being generated in the world of science can be considered small, i.e., would fit on a personal computer, and there are a LOT of labs out there generating this sort of data. So the question that we addressed in the recent the Big Data issue of Nature Neuroscience, is whether small data could organize to become big data? If such a thing is desirable, then what would be the steps to accomplish this lumping?

Here are the principles that we have extracted from working on NIF that we think will really help small data (from Box 2):

Discoverable. Data must be modeled and hosted in a way that they can be discovered through search. Many data, particularly those in dynamic databases, are considered to be part of the 'hidden web', that is, they are opaque to search engines such as Google. Authors should make their metadata and data understandable and searchable, (for example, use recognized standards when possible, avoid special characters and non-standard abbreviations), ensure the integrity of all links and provide a persistent identifier (for example, a DOI).

Accessible. When discovered, data can be interrogated. Data and related materials should be available through a variety of methods including download and computational access via the Cloud or web services. Access rights to data should be clearly specified, ideally in a machine-readable form.

Intelligible. Data can be read and understood by both human and machine. Sufficient metadata and context description should be provided to facilitate reuse decisions. Standard nomenclature should be used, ideally derived from a community or domain ontology, to make it machine readable.

Assessable. The reliability of data sources can be evaluated. Authors should ensure that repositories and data links contain sufficient provenance information so that a user can verify the source of the data.

Useable. Data can be reused. Authors should ensure that the data are actionable, for example, that they are in a format in which they can be used without conversion or that they can readily be converted. In general, PDF is not a good format for sharing data. Licenses should make data available with as few restrictions as possible for researchers. Data in the laboratory should be managed as if it is meant to be shared; many research libraries now have data-management programs that can help.

Search

Recent Blog

: Recommendations for repositories and scientific gateways from a neuroscience perspective

: NEURON course at the CNS meeting

: Neuron Course this year has just been announced!

About

Community Resources

More Resources

Literature

Big Data vs Small Data: Is it really about size?

Search

Recent Blog

Blog Tags

About

Recent News Entries

Contact Us

SciCrunch

Log in

Leaving Community

About

Community Resources

More Resources

Literature

Log in

Big Data vs Small Data: Is it really about size?

Big Data vs Small Data: Is it really about size?

Search

Recent Blog

Blog Tags

About

Recent News Entries

Contact Us

SciCrunch