Archive for the ‘Curation’ Category

Eating my own Dog Food!

Posted on July 4th, 2015 in Anita Bandrowski, Curation, Interoperability, News & Events | No Comments »

While not all of you have been fortunate enough to attend the first Beyond the PDF meeting, I will say this; it was eye opening for this scientist. To me, the most memorable statement from the meeting was when Geoffrey Bilder argued from the back of the room that we should all Eat Our Own Dog Food! What he meant was that anyone building tools should actually use them or proclaiming any broad “thou shalt-s” should himself live up to the particular proclamation.

Easier said than done, Geoffrey!

In the years since this historic meeting, these statements have been eating away at my psyche.

I lead the Resource Identification Initiative, a project to add unique identifiers to all papers that use: antibodies, model organisms, software tools or databases. Basically I am telling authors to do “my bidding” and make their papers better to search and give academic credit to developers of software tools like R or ImageJ. I am asking these authors to help others selflessly and do something different than they have done before.

When submitting a paper to Frontiers in NeuroInformatics, as a middle author at the very beginning of the RII project, I felt very reluctant to add RRIDs to the paper. Who was I to suggest such a thing? I waited for the editor to remind us to add the identifers, I waited and no question came. Before final submission, I overcame my very uncharacteristic muteness and asked my collaborators to add the RRIDs to a table where I felt they were appropriate. It turned out that my colleagues did not object and the journal editor, also didn’t say anything about including them. His journal was not yet on board, something that has been remedied since.

Why did I feel so strongly that I should not include an identifier for tools while telling others to do it?
What was I afraid of?
Change is hard!

I am really not sure now what I was so afraid of because after overcoming this initial scientific recalcitrance I simply put RRIDs in the next paper without a second thought and continued to put them in since.

So as I was drafting this blog, a colleague asked me to contribute to a table in her paper, I will be one of those middle authors (huge paper with tons of authors), but this time as with my own papers I have asked her to include the RRIDs without being afraid; it took me about 8 minutes to pull all relevant RRIDs from and the paper was just submitted. I do not care if the journal is participating in the initiative officially or not.

I guess that what I have learned from all of this, is that once you accept change it becomes the new normal and RRIDs are a great new normal. Thanks Geoffrey for nagging me, I am very glad to say that I have Eaten My Own Dog Food!

Big Data vs Small Data: Is it really about size?

Posted on October 31st, 2014 in Anita Bandrowski, Curation, Data Spotlight, Inside NIF, Interoperability | No Comments »

We have been hearing for some time that when it comes to data, it is all about size. The bigger is better mantra has been all over the press, but is it really size that matters?

There are the so called “Big Data” projects such as the Allen Brain Atlas, which generates data, sans hypothesis, over the whole brain for thousands of genes. This is great because the goal of the project is to generate consistent data and not worry about which disease will or will not be impacted by each data point. That may be a great new paradigm for science, but there are not many projects like this “in the wild”.

Most data is being generated in the world of science can be considered small, i.e., would fit on a personal computer, and there are a LOT of labs out there generating this sort of data. So the question that we addressed in the recent the Big Data issue of Nature Neuroscience, is whether small data could organize to become big data? If such a thing is desirable, then what would be the steps to accomplish this lumping?

Here are the principles that we have extracted from working on NIF that we think will really help small data (from Box 2):

Discoverable. Data must be modeled and hosted in a way that they can be discovered through search. Many data, particularly those in dynamic databases, are considered to be part of the ‘hidden web’, that is, they are opaque to search engines such as Google. Authors should make their metadata and data understandable and searchable, (for example, use recognized standards when possible, avoid special characters and non-standard abbreviations), ensure the integrity of all links and provide a persistent identifier (for example, a DOI).

Accessible. When discovered, data can be interrogated. Data and related materials should be available through a variety of methods including download and computational access via the Cloud or web services. Access rights to data should be clearly specified, ideally in a machine-readable form.

Intelligible. Data can be read and understood by both human and machine. Sufficient metadata and context description should be provided to facilitate reuse decisions. Standard nomenclature should be used, ideally derived from a community or domain ontology, to make it machine readable.

Assessable. The reliability of data sources can be evaluated. Authors should ensure that repositories and data links contain sufficient provenance information so that a user can verify the source of the data.

Useable. Data can be reused. Authors should ensure that the data are actionable, for example, that they are in a format in which they can be used without conversion or that they can readily be converted. In general, PDF is not a good format for sharing data. Licenses should make data available with as few restrictions as possible for researchers. Data in the laboratory should be managed as if it is meant to be shared; many research libraries now have data-management programs that can help.


RRID’s are in the wild! Thanks to JCN and PeerJ

Posted on April 9th, 2014 in Anita Bandrowski, Curation, Essays, News & Events | 1 Comment »

We believe that reproducing science starts with being able to know what “materials” were used in generating the results.

Along with a truly dedicated group of volunteers from academia, government and non-government institutes, publishers and commercial antibody companies we have been running the Resource Identification Initiative (RII).

This initiative is meant to accomplish the following lofty goal: Ask authors to uniquely identify their antibodies (no easy task), organisms (an even harder task), and the databases and software tools that they used in their paper.

In order to ask them at the appropriate time, we gathered a group of journal chief editors to help us ask this question when authors are most interested in answering the question during the process of publication. We created many things to help them identify these things such as a database that stores information for 5 of the most common species used in experiments, antibody catalogs from over 200 vendors, and a database and tool catalog that contains over 3000 software tools and over 2500 academic databases, the largest of its’ kind.

We have been granted 3 months to determine if authors would actually do this. It has been two months, we fielded requests from about 30 users who could not find their resources, there have been more than 40 new software tools or databases registered to our tools registry, and more than 100 antibodies, but we kept waiting for RRIDs to show up in the literature.

Today our wait is over thanks to two papers, Khalil and Levitt in the Journal of Comparative Neurology and Joshi et al in PeerJ.

These authors apparently were able to correctly identify resources such as Matlab, NeuroLucida, ProteinDataBank and antibodies including anti-cholera toxin antibody from List Bio.

What does this tell us?

Well to start that this process is not impossible! That identifiers do exist for many things or the process of obtaining new ones is not so difficult that people can’t do this. It also tells us that when asked at the right time, authors are willing to go the extra step, find and provide identifiers to their reagents or software tools!

Great, but why do I care about a single paper that uses an antibody or Matlab?

Well, it turns out that for many years JCN and NIF staff have been working diligently to link papers through that same identifier so in the case of this cholera toxin antibody we have marked 23 other papers that have used it since 2006.

Screen Shot 2014-04-11 at 1.04.40 PM




Open Science? Try Good Science.

Posted on April 7th, 2014 in Author, Curation, Essays, Maryann Martone, News & Events | 1 Comment »

If the Neuroscience Information Framework is any guide, we are certainly in an era of “Openness” in biomedical science.  A search of the NIF Registry of tools, databases and projects for biomedical science for “Open” leads to over 700 results,  ranging from open access journals, to open data, to open tools.  What do we mean by “open”?  Well, not closed or, at least, not entirely closed.  These open tools are, in fact, covered by a myriad of licenses and other restrictions on their use.  But, the general theme is that they are open for at least non-commercial use without fees or undue licensing restrictions.

Open Science Share button

So, is Open Science already here?  Not exactly.  Open Science is more than a subset of projects that make data available or sharing of software tools, often because they received specific funding to do so.  According to Wikipedia, “Open science is the umbrella term of the movement to make scientific research, data and dissemination accessible to all levels of an inquiring society, amateur or professional. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.”   Despite the wealth of Open platforms, most of the products of science, including, most notably, the data upon which scientific insights rests, remain behind closed doors.  While attitudes and regulations are clearly changing, as the latest attempts by PLoS to establish routine sharing of data illustrate (just Google #PLOSfail), we are not there yet.

Why are so many pushing for routine sharing of data and a more open platform for conducting science?  I became interested in data sharing in the late 1990’s as a microscopist as we started to scale up rate and breadth at which we could acquire microscopic images.  Suddenly, due to precision stages and wide field cameras, we were able to image tissue sections at higher resolution over much greater expanses of tissue than before, when we were generally restricted to isolated snapshots or low magnification surveys.   I knew that there was far more information within these micrographs and reconstructions than could be analyzed by a single scientist.  It seemed a shame that they were not made more widely available.  To help provide a platform, we established the Cell Centered Database, which has recently merged with the Cell Image Library.  Although we were successful in the CCDB in attracting outside researchers to deposit their data, we were rarely contacted by researchers wanting to deposit their data. most of the time we had to ask, although many would release the data if we did.  But I do distinctly remember one researcher saying to me:  “I understand how sharing my data helps you, but not me”.

True.  So in the interest of full disclosure, let me state a few things.  I try to practice Open Science, but am not fanatical. I try to publish in open access journals, although I am not immune to the allure of prestigious closed journals.  I do blog, make my slides available through Slide Share, and upload pre-prints to Research Gate.  But I continue to remain sensitive to the fact that through my informatics work in the Neuroscience Information Framework and my advocacy for transforming scholarly communications through FORCE11 (the Future of Research Communications and e-Scholarship), I am now in a field where:  A)  I no longer really generate data.  I generate ontologies and other information artefacts, and these I share, but not images, traces, sequences, blots, structures;  B)  I do benefit when others share their data, as I build my research these days on publicly shared data.

But do I support Open Science because I am a direct beneficiary of open data and tools?  No.  I support Open Science because I believe that Open Science = Good Science.  To paraphrase Abraham Lincoln:  “If I could cure Alzheimer’s disease by making all data open, I would do so;  if I could cure Alzheimer’s disease by making all data closed, I would do so.”  In other words, if the best way to do science is the current mode:  publish findings in high impact journals that only become open access after a year, make sure no one can access or re-use your data, make sure your data and articles are not at all machine-processable, publish under-powered studies with only positive results, allow errors introduced by incorrect data or analyses to stay within the literature for years, then I’m all for it.

But, we haven’t cured Alzheimer’s disease or much else in the neurosciences lately.  That’s not to say that our current science, based on intense competition and opaque data and methods, has not produced spectacular successes.  It surely has.  But the current system has also led to some significant failures as well, as the retreat of pharmaceutical companies from neuroscience testifies.  Can modernizing and opening up the process of science to humans and machines alike accelerate the pace of discovery?  I think we owe the taxpayers, who fund our work in hope of advancing society and improving human health, an honest answer here.   Are we doing science as well as it can be done?

I don’t believe so.  And, as this is a blog and not a research article, I am allowed to state that categorically.  I believe that at a minimum, Open Science pushes science towards increased transparency, which, in my view, helps scientists produce better data and helps weed out errors more quickly.  I also believe that our current modes of scientific communication are too restrictive, and create too high a barrier for us to make available all of the products of our work, and not just the positive results.  At a maximum, I believe that routine sharing of data will help drive biomedical sciences towards increased discovery, not just because we will learn to make data less messy, but because we will learn to make better use of the messy data we have.

Many others have written on why scientists are hesitant or outright refuse to share their data and process  (see #PLOSfail above) so I don’t need to go into detail here.  But at least one class of frequent objections has to do with the potential harm that sharing will do to the researcher who makes data available.  A common objection is that others will take advantage of data that you worked hard to obtain before you can reap the full benefits.  Others say that there is no benefit to sharing negative results, detailed lab protocols or data, or blogging, saying that it is more productive for them to publish new papers than to spend time making these other products available.   Others are afraid that if they make data available that might have errors, their competitors would attack them and their reputations would be tarnished.  Some have noted that unlike in the Open Source Software community, where identifying and fixing a bug is considered a compliment, in other areas of scholarship, it is considered an attack.

All of these are certainly understandable objections.  Our current reward system does not provide much incentive for Open Science, and changing our current culture, as I’ve heard frequently, is hard.  Yes it is.  But if our current reward system is supporting sub-optimal science, then don’t we as scientists have an obligation to change it?  Taxpayers don’t fund us because they care about our career paths.  No external forces that I know of support, or even encourage, our current system of promotion and reward:  it is driven entirely by research scientists.  Scientists run the journals, the peer-review system, the promotion committees, the academic administration, the funding administration, the scientific societies and the training of more scientists.  Given that non-scientists are beginning to notice, as evidenced by articles in the Economist (2013) and other non-science venues about lack of reproducibility, perhaps it’s time to start protecting our brand.

While many discussions on Open Science have focused on potential harm to scientists who share their data and negative results, I haven’t yet seen discussions on the potential harm that Opaque Science does to scientists.  Have we considered the harm that is done to graduate students and young scientists when they spend precious months or years trying to reproduce a result that was perhaps based on faulty data or selective reporting of results?  I once heard a heart-breaking story of a promising graduate student who couldn’t reproduce the results of a study published in a high impact journal.  His advisor thought the fault was his, and he was almost ready to quit the program.  When he was finally encouraged to contact the author, he found that they couldn’t necessarily reproduce the results either.   I don’t know whether the student eventually got his degree, but you can imagine the impact such an experience has on young scientists.   Beyond my anecdotal example above, we have documented examples where errors in the literature have significant effects on grants awarded or the ability to publish papers that are in disagreement (e.g., Miller,  2006).  All of these have a very real human cost to science and scientists.

On a positive note, for the first time in my career, since I sipped the Kool Aid back in the early days of the internet, I am seeing real movement by not just a few fringe elements, but by journals, senior scientists, funders and administrators, towards change.  It is impossible to take a step without tripping over a reference to Big Data or metadata.  Initiatives are underway to create a system of reward around data in the form of data publications and data citations.  NIH has just hired Phil Bourne, a leader in the Open Science movement, as Associate Director of Data Science.  And, of course, time is on our side, as younger scientists and those entering into science perhaps have different attitudes towards sharing than their older colleagues.   Time will also tell whether Open Science = Good Science.  If it doesn’t, I promise to be the first to start hoarding my data again and publishing only positive results.


Economist, How Science Goes Wrong, Oct 19, 2013

Miller, G.  (2006) A scientist’s nightmare: software problem leads to five retractions.  Science, 22, 314, pp 1856-1857.


Blog originally posted to Wiley Exchanges.

Major Update of BAMS data

Posted on November 19th, 2013 in Anita Bandrowski, Curation, Data Spotlight | No Comments »

From our friends up the coast.

We are happy to announce a major BAMS update:

1. Data. more than 4000 connections reports collated from the primary
rat literature. The number of publicly available connectivity reports,
accessible through both BAMS and BAMS2 exceeds the 70k benchmark. Most
of the newly added reports are related to the cortico-cortical
connections in the rat. Consequently the rat cortico-cortical matrix
is better populated and is accessible through the “Connectomes” tab of
the classic BAMS menu

2. Interfaces and functionalities.
A. Many of BAMS’ users asked for a simpler method to view connections
data. Hence, we implemented an additional list view of the connections
in classic BAMS. This interface is now implemented in the “Reference”
section of BAMS (searching for information by references). A second
request from users was related to the export of data from the classic
BAMS. Consequently, the connections data retrieved by searches can be
downloaded by any user in Excel format. This new functionality was
also implemented in the “Reference” section of BAMS.
Example: the connectivity data collated from Cenquizca & Swanson,
We will expand these two new simple tools to the other BAMS modules
and data types after feedback from users.
B. The connectivity matrices that can be constructed online allow
inspection of detailed data by clicking on their colored squares.You
can verify this for any of the matrices available at the URL:
C. A new functionally relevant rat brain circuitry, the connections
between the medial cortex and the amygdala, can be accessed online:

3. Last but not least, the presence of BAMS at the SfN Meeting 2013 is
described together with two very important neuroinformatic platforms,
NIF and NeuroLex. You can read the article at the URL:

Resource Identification Guidelines – now at Elsevier

Posted on September 6th, 2013 in Anita Bandrowski, Curation, Interoperability, NIFarious Ideas | No Comments »

The problem of reproducibility of results has been addressed by many groups, as being due to scientists having very large data sets and highlighting the interesting, yet most likely statistically anomalous findings and other science no-no’s like reporting only positive results.

Our group, has been working to make the methods and reagents reporting better and I am happy to report that this group has been seeing resonance of these ideas.

In a group sponsored by FORCE11, a group of researchers, reagent vendors and publishers has been meeting to discuss how to best accomplish better reporting in all of the literature and both the NIH and publishers themselves are now becoming interested in their sucess. The latest and greatest evidence of this can be found on the Elsevier website, as a guideline to authors, however this will soon be followed by a pilot project to be launched at the Society for Neuroscience meeting with over 25 journals and most major publishers.

Of course there is no reason to wait for an editor to ask to put in catalog numbers or stock numbers for transgenic animals. These should be things that we are trained to do in graduate school as good practices for reporting our findings.

We seem to be getting ready to change (or change back) to a more rigorous methods reporting, which should strengthen the recently eroded credibility of the scientific enterprise. I for one, hope that the message that will be communicated is: “scientists don’t hide problems, even endemic ones, we examine them and find workable solutions”.

There is a Link between literature and data, it has been there for years, but nobody ever found it

Posted on July 10th, 2013 in Anita Bandrowski, Curation, Data Spotlight, Force11, Interoperability, NIFarious Ideas | No Comments »

The NIH has had a recent request for information about the NIH data catalog to which our group and many others have responded. Many voices including fairly important ones from the white house are now calling for making scientific research data open, available and linked to the publications written about the data. This is a very good thing. It should lead to better handling and comparison of data and better science.

However, sitting in many recent meetings with members of various national libraries, who shall remain nameless, I am astounded to learn that not only the scientists, but also librarians have never found the LinkOut feature in PubMed.

LinkOut is a little option at the bottom of all articles in PubMed hidden by the good staff into complete obscurity, please see the screen shot below if you don’t believe me that such a feature exists.

Screen Shot 2013-07-10 at 3.10.35 PM

The article above links to two data sets, one is based on a curated set of annotations linking genes to genetic disorders, and the other is the a set of statements about antibody reagents used in this paper. Links from other papers lead to computation model code described in the paper, activation foci or data repositories.

Although it is certainly rarely used, the model organism communities, data repositories and researchers, have been diligently adding their data to PubMed in the form of links. We may quibble about the fact that PubMed asks many of us to reduce the specific links to data to generic links that lead to another version of the same article, but the fact is, that the links to data are present! Because they are present, if the National Library of Medicine ever decides to search them, export them, or acknowledge their existence, it would be a treasure trove of data to literature links that would not require a huge new investment in infrastructure.

I am not suggesting that our infrastructure could not be upgraded, in fact we have many more technical gripes that I will not bring up here, but I am suggesting that we all take advantage of the massive investment of time and energy of curators and authors over the last decades to meticulously link their data or data repositories to the literature.

The LinkOut broker has helped NIF aggregate a list of about 250,000 links from ~40 databases, but what PubMed must have is a much much larger set of data. The links provided by NIF can be searched through the NIF site, they can be filtered by category and by database, and they can be extracted and embedded into other sites like science direct (see Of these 1/4 million links that we provide to PubMed, between 100 and 200 users find them per month. I think that we can and should do better.

  • We can ask that PubMed makes links to data prominent.
  • We can ask that any links in PubMed be of good quality, e.g., results of text-mining output should not be included without verification by authors or curators.
  • We can ask that the links show actual data as opposed to the representation of the paper in another site (currently required).

If you feel the sudden urge to be an arm-chair activist, then please let PubMed know that it would be nice if they celebrated the current links between data and publications instead of hiding them.

We have A LOT of neuroscience information, and would like to share….

Posted on May 14th, 2013 in Curation, Inside NIF, Jonathan Cachat | No Comments »

Over the past 4 years, the Neuroscience Information Framework systematically scanned the literature, internet and social buzz for all things neuroscience (& biomedical science). This tedious bookkeeping has resulted in the largest, most comprehensive catalog of neuroscience-relevant information ever amassed – with the added bonus of semantically enhanced search functions. And now, we would like to share it with you via myNIF…but before those details…

What do we mean “neuroscience information”?

Neuroscience information includes data, resources, literature, grants, multimedia, social buzz, a lexicon and more..

Data: Over 140 independent databases (i.e. CCDB,, GENSAT) are deeply indexed and semantically mapped by NIF – representing over 400 million pieces of data. These data are considered part of the “hidden web”, not indexed by major search engines because do so requires specialized database query statements for retrieving data within, rather than on the surfaces of pages surrounding the database. NIF has developed technologies to regularly re-crawl and update data content, index it, and provide search within the contents of these databases simultaneously. Moreover, data resulting from a search can be exported with a single click into standard data formats for desired, subsequent analysis. This can simply save  you time – if you need to know what type of serotonin receptors have been classified in zebrafish (Danio rerio) – searching NIF for ‘zebrafish serotonin receptor’ provides results from authoritative data providers (HomoloGene, EntrezGene) which can be compared instantly, rather than visiting each site separately, and comparing through notes, multiple windows, or several downloads. In addition to this primary information , the results also include related, and sometimes very helpful information about zebrafish and serotonin – signaling pathways, antibodies, and grant information.

Resources: Need to find a software analysis package for microarray data? NIF can recommend 41 options, as well as 100+ unique organizations, centers, labs and websites that  have similar interests. Looking for non-governmental funding of ALS research? Here are 7. What about a tissue bank with Alzheimer’s disease CNS tissue samples available for researchers? NIF is aware of around 88 worth a look. All of this to convey that a resource is object or entity, with a website, that provides potential value to neuroscience research or the researchers. Importantly, this catalog of resources indexed by NIF is maintained at NeuroLex, a semantic mediawiki website. Homologous to Wikipedia, in that any one can contribute their resource or favorite resources, but endowed with reasoning capabilities permitting logical reasoning on relationships between data (i.e. list all GABAergic Neurons).

What is the Cerebral Cortex?

Posted on January 14th, 2013 in Anita Bandrowski, Curation, Essays, Force11, Interoperability, News & Events | No Comments »

by Anita Bandrowski,

This may seem a silly question, but lets see if you are more like a fifth grader or more like me. It appears that a fifth grade class I recently interacted with can answer a question that I am having a lot of trouble with. They rattle off “the outside part of the brain”. True enough.
They can point to it, its the part that is “squiggly”. True enough.
“It is the part that thinks”. Ok, we can go with that answer.

So why are these fifth graders smarter than I am? Pun intended.

Read the rest of this entry »

So all I need is a number?

Posted on August 17th, 2012 in Curation, Force11, Interoperability, Maryann Martone | No Comments »

In the Neuroscience Information Framework (, we often tout the importance of using unique identifiers rather than text strings as a way to ensure that search engines like NIF can mitigate the ambiguity associated with searching for strings.  NIF provides access to the largest source of neuroscience information on the web, by providing simultaneous search over multiple databases, catalogs and literature databases.  If you search for Ca2 in NIF, you will find information on calcium, the hippocampus and a gene called CA2.  Unique identifiers can disambiguate among these by assigning unique handles to each;  a sort of social security number for each thing that we want to talk about.  Many groups are creating and promoting unique identifiers for all sorts of entities:  people (e.g., ORCID), articles (PubMed ID’s) and they are very handy things.  NIF itself has gotten into the business through its unique resource identifiers and antibody ID’s.   So all I need is a number, right?  Alas, no.  Because numbers, like names, are not unique either.  I just searched through NIF and found an antibody in the Beta Cell Consortium Database.  There was a column for “people who are using this” with a reference of  10077578.  Clicking on it took me to an article in PubMed, so clearly it is a Pub Med ID.  Great, I thought.  I want to see who else references that paper in NIF.  So I typed in PMID:10077578 into the NIF search interface and was able to retrieve the article in the NIF literature database.  But that’s not what I wanted.  Most of the times, database providers don’t provide the prefix PMID;  rather, they list just the numbers in a column labeled “Reference” or “Citation”.  So I typed in 10077578 and got multiple hits in the data federation from several databases.  Great, I thought.  Here are other sources of information that are referencing this paper.  Unfortunately, one was to Novus Biochemical antibody 100-77578, and one was to the gene Rumal_1324 (GeneID: 10077578).  So, clearly a number is not enough.  Some sort of name space is required, e.g., PMID:10077578 clearly tells me where I am to look.  NIF should have known better and is working to resolve this glitch, by identifying each number with a prefix, and in time, a full URI (Uniform Resource Identifier, not an upper respiratory infection).  The semantic web community has been working on these standards for a long time and discussion of the URI  is beyond this post.  But this is yet another example of why we at NIF encourage resource providers to think globally about their data;  are we producing our data in a form that makes it easier to link individual parts of our resource to other parts?