Posted on August 17th, 2012 in Curation, Force11, Interoperability, Maryann Martone | No Comments »
In the Neuroscience Information Framework (http://neuinfo.org), we often tout the importance of using unique identifiers rather than text strings as a way to ensure that search engines like NIF can mitigate the ambiguity associated with searching for strings. NIF provides access to the largest source of neuroscience information on the web, by providing simultaneous search over multiple databases, catalogs and literature databases. If you search for Ca2 in NIF, you will find information on calcium, the hippocampus and a gene called CA2. Unique identifiers can disambiguate among these by assigning unique handles to each; a sort of social security number for each thing that we want to talk about. Many groups are creating and promoting unique identifiers for all sorts of entities: people (e.g., ORCID), articles (PubMed ID’s) and they are very handy things. NIF itself has gotten into the business through its unique resource identifiers and antibody ID’s. So all I need is a number, right? Alas, no. Because numbers, like names, are not unique either. I just searched through NIF and found an antibody in the Beta Cell Consortium Database. There was a column for “people who are using this” with a reference of 10077578. Clicking on it took me to an article in PubMed, so clearly it is a Pub Med ID. Great, I thought. I want to see who else references that paper in NIF. So I typed in PMID:10077578 into the NIF search interface and was able to retrieve the article in the NIF literature database. But that’s not what I wanted. Most of the times, database providers don’t provide the prefix PMID; rather, they list just the numbers in a column labeled “Reference” or “Citation”. So I typed in 10077578 and got multiple hits in the data federation from several databases. Great, I thought. Here are other sources of information that are referencing this paper. Unfortunately, one was to Novus Biochemical antibody 100-77578, and one was to the gene Rumal_1324 (GeneID: 10077578). So, clearly a number is not enough. Some sort of name space is required, e.g., PMID:10077578 clearly tells me where I am to look. NIF should have known better and is working to resolve this glitch, by identifying each number with a prefix, and in time, a full URI (Uniform Resource Identifier, not an upper respiratory infection). The semantic web community has been working on these standards for a long time and discussion of the URI is beyond this post. But this is yet another example of why we at NIF encourage resource providers to think globally about their data; are we producing our data in a form that makes it easier to link individual parts of our resource to other parts?