Archive for the ‘Interoperability’ Category

There is a Link between literature and data, it has been there for years, but nobody ever found it

Posted on July 10th, 2013 in Anita Bandrowski, Curation, Data Spotlight, Force11, Interoperability, NIFarious Ideas | No Comments »

The NIH has had a recent request for information about the NIH data catalog to which our group and many others have responded. Many voices including fairly important ones from the white house are now calling for making scientific research data open, available and linked to the publications written about the data. This is a very good thing. It should lead to better handling and comparison of data and better science.

However, sitting in many recent meetings with members of various national libraries, who shall remain nameless, I am astounded to learn that not only the scientists, but also librarians have never found the LinkOut feature in PubMed.

LinkOut is a little option at the bottom of all articles in PubMed hidden by the good staff into complete obscurity, please see the screen shot below if you don’t believe me that such a feature exists.

Screen Shot 2013-07-10 at 3.10.35 PM

The article above links to two data sets, one is based on OMIM.org a curated set of annotations linking genes to genetic disorders, and the other is the antibodyregistry.org a set of statements about antibody reagents used in this paper. Links from other papers lead to computation model code described in the paper, activation foci or data repositories.

Although it is certainly rarely used, the model organism communities, data repositories and researchers, have been diligently adding their data to PubMed in the form of links. We may quibble about the fact that PubMed asks many of us to reduce the specific links to data to generic links that lead to another version of the same article, but the fact is, that the links to data are present! Because they are present, if the National Library of Medicine ever decides to search them, export them, or acknowledge their existence, it would be a treasure trove of data to literature links that would not require a huge new investment in infrastructure.

I am not suggesting that our infrastructure could not be upgraded, in fact we have many more technical gripes that I will not bring up here, but I am suggesting that we all take advantage of the massive investment of time and energy of curators and authors over the last decades to meticulously link their data or data repositories to the literature.

The LinkOut broker has helped NIF aggregate a list of about 250,000 links from ~40 databases, but what PubMed must have is a much much larger set of data. The links provided by NIF can be searched through the NIF site, they can be filtered by category and by database, and they can be extracted and embedded into other sites like science direct (see neuinfo.org/developers). Of these 1/4 million links that we provide to PubMed, between 100 and 200 users find them per month. I think that we can and should do better.

  • We can ask that PubMed makes links to data prominent.
  • We can ask that any links in PubMed be of good quality, e.g., results of text-mining output should not be included without verification by authors or curators.
  • We can ask that the links show actual data as opposed to the representation of the paper in another site (currently required).

If you feel the sudden urge to be an arm-chair activist, then please let PubMed know that it would be nice if they celebrated the current links between data and publications instead of hiding them.

How long does it take to get a resource into NIF? The case of the open source brain.

Posted on June 4th, 2013 in Anita Bandrowski, Data Spotlight, Force11, Inside NIF, Interoperability | 2 Comments »

Believe it or not, there really is a project called open source brain, and it is a wonderful community of hackers that attempts to do very novel things with open source models, mainly in a format called NeuroML.

What is the open source brain?

Well, it takes models, converts them into cool visualizations and then allows users to manipulate them in their browser, with functionality similar to google body. The hope is to strap some significant computational power from the Neuroscience Gateway’s massive clusters so that the pretty pictures can be fully functional, but for now, this is a great way of exploring three-dimensional neurons and connectivity.

Screen Shot 2013-06-04 at 5.29.01 PM

But the reason I am blogging about this project is not because of the “ooohh-aaaahhh” factor that nice graphics usually have on me, but also because this source came to NIF in an interesting way, by human flying from London on his way to another meeting. Unfortunately last week we did not know about the Open Source Brain, but Padraig knew about NIF and wanted to register the project, hoping to integrate his data or at least “get the process started”.

At 10:30 am we were sufficiently caffeinated to begin and created a registry entry, from which we obtained an identifier.

The identifier was then used to create a sitemap entry in the DISCO database (essentially anyone who has logged in to the NeuroLex can click a button at the bottom of a curated registry entry can actually do this).

Then we added an “interop” file, which instructs our crawler to put data the xml data output by open source brain into our local data warehouse making sure to specify appropriate tables and columns.

Then we went to lunch, came back after fighting much larger crowds at the indian place than were expected before finals, and created the “view” of the data (basically, wrote a sql statement and used our concept mapping tool to define what data would be displayed).

By 3:30 pm we had a view deployed. Well ok, we did have to import the data twice because we messed up the file once, and this deployment was the beta server and we had to wait to update to production until Friday night, but that is still pretty darn fast in my opinion.

The question for many people who have data has been how much effort will it take to make my data interoperable with community resources and for the first time ever, we can report …. it will only take a couple of hours (we should insert many caveats here).

What is the Cerebral Cortex?

Posted on January 14th, 2013 in Anita Bandrowski, Curation, Essays, Force11, Interoperability, News & Events | No Comments »

by Anita Bandrowski,

This may seem a silly question, but lets see if you are more like a fifth grader or more like me. It appears that a fifth grade class I recently interacted with can answer a question that I am having a lot of trouble with. They rattle off “the outside part of the brain”. True enough.
They can point to it, its the part that is “squiggly”. True enough.
“It is the part that thinks”. Ok, we can go with that answer.

So why are these fifth graders smarter than I am? Pun intended.

Read the rest of this entry »

A Call to Science Bloggers

Posted on November 9th, 2012 in Force11, Interoperability, Jonathan Cachat, News & Events | 5 Comments »

With the growth of scientist participation in blogging and social networks, a considerable amount of meritable scientific chatter is unfolding online. Several prominent blogs have emerged, in fact NIF is now indexing may of these sites (via RSS feeds) and can be found in the Multimedia data type***.

In our continued effort to integrate and link data, NIF would like to create two-way links between your blog posts and the scientific articles they discuss through NIF Literature. For example, if users find an article in NIF Literature we can provide links to blogs or tweets that have discussed this article, in addition to the current link to full text access options. Your site or blog would also be included on PubMed search results thank’s to NIF automated LinkOut feature services.

However, it is currently very hard to achieve this goal and would require substantial manual curation efforts. In order to automate this process, we submit a few simple guidelines to the online science community.

1) Blogs and other long-form posts should always include related PubMed Identifiers (PMID) in citations. References can be in text, or placed together at the end of a post, but either way should include PMID: ######## for all citations. This standardized format of ‘PMID:######’ was suggested by the BioDBcore and biosharing.org initiatives and we strongly support it.

  • This is a MindHacks post without any citation information at all (aside form a link to Nature) – this is the worst possible scenario, for the purposes of this article. It is a wonderful dialogue on this exciting article, but very unlikely that people reading this article will ever know that this post exists – unfortunate for everyone involved.
  • This Neuroskeptic post correctly included citation information, along with the PMID, at the end of the post (see Screenshot below).

2) Short-form posts should include PMIDs when possible, particularly if linked directly to article. For example a recent tweet here.

3) Be found – index with search engines including NIF. For more information about submitting your site, blog or resource to NIF check here or fill in the small form here.

The internet was designed to enable a web of links between ideas, information and people. Following these simple guidelines will not only increase the connectivity between data, the social and semantic links are also valuable to information creators. First, it promotes more opportunities for scientific exchange and feedback. Secondly, it provides additional avenues to calculate impact metrics – similar to those observed by AltMetrics.org and PloS Journals.

Do you have any other thoughts related to increasing data integration and interpretability? Share them here in the comments below!

***If you would like to have your blog or site included within the NIF index drop us a line – info@neuinfo.org

How to make the most annoying biological database

Posted on November 4th, 2012 in Anita Bandrowski, Force11, Interoperability, NIFarious Ideas, Uncategorized | No Comments »

Dear biological database owners,

We have attempted to let people know how to make databases more interoperable and discoverable, but this blog takes a very different take on the idea. The ideas brought forward include making data silos, generating non-unique identifiers and my current favorite is the 44 page getting started guide.

So, what is it that you will build next?

So all I need is a number?

Posted on August 17th, 2012 in Curation, Force11, Interoperability, Maryann Martone | No Comments »

In the Neuroscience Information Framework (http://neuinfo.org), we often tout the importance of using unique identifiers rather than text strings as a way to ensure that search engines like NIF can mitigate the ambiguity associated with searching for strings.  NIF provides access to the largest source of neuroscience information on the web, by providing simultaneous search over multiple databases, catalogs and literature databases.  If you search for Ca2 in NIF, you will find information on calcium, the hippocampus and a gene called CA2.  Unique identifiers can disambiguate among these by assigning unique handles to each;  a sort of social security number for each thing that we want to talk about.  Many groups are creating and promoting unique identifiers for all sorts of entities:  people (e.g., ORCID), articles (PubMed ID’s) and they are very handy things.  NIF itself has gotten into the business through its unique resource identifiers and antibody ID’s.   So all I need is a number, right?  Alas, no.  Because numbers, like names, are not unique either.  I just searched through NIF and found an antibody in the Beta Cell Consortium Database.  There was a column for “people who are using this” with a reference of  10077578.  Clicking on it took me to an article in PubMed, so clearly it is a Pub Med ID.  Great, I thought.  I want to see who else references that paper in NIF.  So I typed in PMID:10077578 into the NIF search interface and was able to retrieve the article in the NIF literature database.  But that’s not what I wanted.  Most of the times, database providers don’t provide the prefix PMID;  rather, they list just the numbers in a column labeled “Reference” or “Citation”.  So I typed in 10077578 and got multiple hits in the data federation from several databases.  Great, I thought.  Here are other sources of information that are referencing this paper.  Unfortunately, one was to Novus Biochemical antibody 100-77578, and one was to the gene Rumal_1324 (GeneID: 10077578).  So, clearly a number is not enough.  Some sort of name space is required, e.g., PMID:10077578 clearly tells me where I am to look.  NIF should have known better and is working to resolve this glitch, by identifying each number with a prefix, and in time, a full URI (Uniform Resource Identifier, not an upper respiratory infection).  The semantic web community has been working on these standards for a long time and discussion of the URI  is beyond this post.  But this is yet another example of why we at NIF encourage resource providers to think globally about their data;  are we producing our data in a form that makes it easier to link individual parts of our resource to other parts?

Four Things You Can Do to Make Your Database More Interoperable

Posted on June 25th, 2009 in Force11, Interoperability, Maryann Martone, News & Events | 4 Comments »

As part of the Neuroscience Information Framework (NIF), we provide access to data contained in databases and structured web resources (e.g. queryable web services), sometimes referred to as the deep or hidden web, that are independently maintained by resource providers around the globe. We believe that this federated model is the most practical way to provide our users with access to the latest data without NIF having to maintain a centralized resource.  A federation model assumes that we can access each database or service and allow users to discover these resource through the NIF.  It also lets us merge data from different databases, essentially mixing and matching results in a way that is useful to our users.

The NIF has been registering databases and structured web resources for just over a year, moving discussions of database interoperability from the theoretical to the practical realm. As new databases are created every day, we thought it would be useful to provide our perspective on this issue so that decisions can be made at the outset that would improve the likelihood that the database can interoperate with others later on.  We are not going to discuss the relative merits of database platforms, e.g., relational, XML, object-oriented.   Nor will we consider here whether RDF is the answer to all interoperability problems (but stay tuned).  Rather, here we will focus on our experience with integration of existing databases, most of which are relational.

What is interoperability?  We define it simply as:  ”the ability of a system or component to function effectively with other systems or components” (http://www.yourdictionary.com/interoperability).  Why would you as a resource provider want to become interoperable?  Here are 3 good reasons:

1)      To be found.  NIF is just one of many portals on the web, we specialize in scientific data, so we have tools that allow scientists to search for all genes expressed in a particular brain region, for example, but our problems are the same ones dealt with by all search portals:  where is the information that I want?  Usually, it is scattered across web pages, pdf files and databases, many of which cannot be searched effectively by search engines.  Academics and NIH are excellent at providing wonderful data, data models, and manuscripts describing them, but we at NIF have discovered that academics who create databases are usually not so good at marketing.  If you have just created a database, do you want others to find it and use it?  If you follow a few simple rules for your database or other type of data resource, your data will have a higher impact on the community.

2)      To be useful:  No matter how comprehensive a database you create, you will be capturing only a tiny fraction of information on biological systems.  That’s why we have so many databases out there.  NIF has identified over 1500 independent databases that are potentially useful to neuroscience and we find more everyday.  A single individual would spend their lives locating these resources and querying them;  NIF lets you query them all simultaneously and combine and compare information across them (or at least, we will let you do that in the future).

3)      To be helpful:  I hear scientists complain all the time that they can’t use microscope parts from one manufacturer on an instrument made by another manufacturer.  I myself complain when I can’t use someone else’s cell phone charger for my phone.  Yet, we as scientists are unwilling to work a little extra to make our data and databases interoperable. We understand that databases are developed for a specific purpose by a specific group to serve their needs.  We also understand that the financial and technological resources for creating and maintaining these resources widely differ.  But we also know that there are some practices which can make it more or less difficult for a resource like NIF to make the contents of a web accessible database available and usable.

The issue of database interoperability is a complex one, and some excellent frameworks and discussions are available.  (http://www.sei.cmu.edu/isis/guide/introduction/lcim.htm).  For the purposes of this discussion, we will simplify the issue and address issues of interoperability at two levels: technical and data.

At the technical level,  we have encountered several roadblocks to making data resources available through the NIF.  At the most basic level are issues of access, e.g., institutional firewalls, stability of access privileges and access methods.  At another level, we have encountered difficulties in efficiently accessing databases when the identifiers for individual records change upon update. We have noted that some databases and vocabularies use identifiers that get regenerated every time the resource is updated. This practice makes it very difficult for NIF to maintain appropriate indices and links. We recommend that identifiers be stable; if they are to be removed, they should be made obsolete rather than deleted.  Related to this is the use of sessions to retrieve data pages instead of stable URI’s. Under this practice the application allows a user to access data only in a linear manner, i.e. the main page showing the cerebellum must be accessed before any of its subcomponents. Each session generates a temporary pointer or ‘session identifier’, which makes it difficult for a system such as NIF to make use of much of the specific data elements inside of resources that use sessions extensively. To harness the power of all the data available in these resources, they need to be made available outside of their web interface.

The above issues touch upon the ability of NIF to issue queries against a remote database, generate a search index and return results.  Equally important is the ability for NIF to search the database effectively and provide comprehensible and useful results to NIF users.  In our experience, the lack of a standard terminology is one of the major impediments to effective search across databases. In our very first database federation exercise, we registered 3 databases that had data on neurons:  NeuronDB, Neuromorpho and CCDB.  If we look at the list of neuron names, we see that there were 3 variants on the same cell class:  Cerebellar Purkinje cell;  Purkinje neuron, Purkinje cell.  If a user typed “Purkinje”, results may not be specific to Purkinje neurons and may contain information related to Purkinje fibers.  However, if a user wanted specific information about “Purkinje neuron” they would not retrieve records from all 3 databases.  Fortunately, we now have the NeuroLex, a lexicon for neuroscience that maps all 3 of these terms as synonyms to each other and assigns a unique ID to the class.  When issuing a query from NIF, NIF attempts to autocomplete all terms from the Neurolex vocabulary.  If we have the term, we automatically search for synonyms.  If a source uses any of the terms mapped to the ID by Neurolex, the result will be returned.  If a source uses a custom abbreviation (PC) or a symbolic notation (Purkinje cell = 1), then special mapping of the source database will have to occur using our concept mapping tool.  Note that we are not considering here whether the meaning of Purkinje cell is the same across all of these sources.  Meaning is a more difficult issue and one which will be addressed in a future blog.  But for now, just having a standard, non-symbolic term makes integration of databases a lot easier.

So here are our top 4 barriers to data federation in the NIF:

1)      Unstable identifiers:  Every time the database updates, the identifiers change and all pre-indexed links to those data records break;

2)      Access:  For increased utilization of the data, stable access needs to be provided either through a public connection to the database, a periodic dump of the database contents or through structured web services;

3)      Sessions: For general information results and data should be accessible using a static (i.e. non session based or stateless) URL;

4)      Vocabulary:  Use a standard terminology and avoid symbolic notations where possible.