Posted on October 12th, 2009 in Anita Bandrowski, Curation, Essays, Force11 | 1 Comment »
Benefits and pitfalls of integration of two very different data types, by Dr. Anita Bandrowski, NIF Curator
Overview of NIF registration processes and the role of DISCO:
The Neuroscience Information Framework (NIF) project has a dynamic inventory of more than 2300 neuroscience-relevant resources. What makes that inventory dynamic is that NIF encourages resource providers to register their resource to our catalog of “all things neuroscience.” This process is not terribly involved for resource providers as they need to fill out basic information about their resource such as the URL, name, description and keywords. In the near future, resource providers will also be able to take away a “DISCO” file, short for resource discovery. This file is maintained on the resource providers’ Web site. Resource providers maintain the currency of information within this file at the source. When a change is made, NIF is alerted to the change through an automated agent that crawls the site periodically. In this way, resource providers do not need to provide updated information to NIF or any other system that indexes it. The updates are performed by the system. In this way, the NIF catalog is kept up-to-date without having to visit each of the 2000+ sites currently listed.
The process of provider registration is a good idea, and we are not the only ones to think of it. Other projects in biomedical science essentially seek to accomplish the same goal. Of these, the Biositemaps project, supported by the National Centers for Biomedical Computation, has advanced considerably towards implementing a similar technology. NIF believes that if providers register their resources using one of these tools, then they should not have to do it again using a slightly different tool. Rather, the data generated by all tools should be accessible to all systems. We have just completed an exercise in harvesting Biositemaps files into NIF and provide here our experience with and perspectives on the exercise.
Rationale for integration:
Tools such as Biositemaps and DISCO allow the people who know the most about their resources, i.e., those who created them, to describe those resources so that search engines can easily find them. This “self description” is a great idea in theory, but in practice it may not work as intended. The NIF project, a framework for resource description and discovery, has recently developed tools to harvest the descriptions from Biositemaps. We believe that biomedical resources should be described in a consistent manner and made discoverable so that projects similar to NIF can present them to our user community. During this exercise, we have come across several problems that were echoed by other projects attempting to do similar things.
At the outset, the Biositemaps initiative was created as a Google sitemaps-like database that was intended to point search engines to appropriate information about biological software and data sets. Biositemaps has a great deal of appropriate data about biologically relevant software tools. Because of this, NIF was highly interested in importing this data, which was especially enticing because the data was able to be dynamically updated by the resource providers, meaning that if the particular software tool has a new version, the search systems would be notified automatically of any update.
Metadata structure compatibility and vocabularies:
NIF has made a conscious decision to have a very simple metadata structure to alleviate problems, including the inappropriate use of metadata fields and the time intensiveness of both the curation effort and the training of curators. The original NIF developed a fairly comprehensive structure (still available at http://neurogateway.org; see also Gardner et al., 2008) that was populated by the resource providers themselves. These resource providers were mostly scientists who were building tools or databases. Many scientists are not metadata experts, and this led to a very inconsistent labeling of resources at the outset of the NIF project. The inconsistencies in annotation made searching for resources a very difficult task; furthermore, the complicated structure was not intuitive to the end user. The simple structure adopted by NIF [MM1]alleviated the curation and search problems and also turned out to be quite useful for integrating lots of different metadata structures, including Biositemaps. The mapping of fields from Biositemaps to the NIF was very simple, taking only a few days to reconcile.
The most significant effort for achieving integration was the mapping the resource types, e.g., database, software tools. Biositemaps populates the resource type from the Biomedical Resource Ontology (BRO: http://bioportal.bioontology.org/ontologies/39002), while NIF uses the NIFSTD resource ontology. These two efforts were developed independently but are now converging by concerted effort of both groups. However, during this process, they continue to have some differences. For example, some classes exist in one ontology and not the other, e.g., core facility that is explicitly labeled in the BRO and not labeled in the NIF. Thus, if resource providers mark their resources as a core facility, the NIF can’t automatically ingest this information, requiring intervention by a human curator. Therefore, we have continued to align the BRO and NIFSTD as much as is humanly possible to alleviate the need for human intervention.
Data structure compatibility and scope:
While the metadata structure harmonization has taken some effort, it is a tamable exercise, but we have noticed that the data within Biositemaps supplied by resource providers is extraordinarily heterogeneous in quality. In about 200 out of 400 Biositemaps, the data are well formed, but for the remaining records, there is partial information including missing resource names or URLs, making it difficult to take all of the data in Biositemaps and import it into NIF in an automated fashion. The NIF registry database (as all databases) expects to see certain minimal data including a name and a URL. When these items are not present, the database does not accept the record. Additionally, heterogeneity comes from the amount of descriptive text. NIF registry records prepared by curators have text of 3-6 paragraphs in most cases, but most Biositemaps resources describe themselves in a single sentence. NIF uses longer descriptions because we found out early in the project that longer descriptive text includes many keywords NIF users would use for search that may not be included as keywords, making search through the NIF registry more effective. With minimal descriptions, it is unlikely that the NIF search interface would retrieve Biositemaps resources in a sea of NIF curated resources. Finally, the issue of combining records that are already present in NIF with Biositemaps data presented some challenges to our system. Because we don’t yet have a universal way of assigning URI’s to resources, resources tend to be cross-listed in many catalogs. For this reason, NIF is supporting the Common Naming Project (http://neurocommons.org/page/Common_Naming_Project). As NIF had already provided additional curation to the resources listed that was in many cases more thorough than that supplied by the resource providers, the process of reconciling and merging of information was not straightforward. To address the problems noted above, NIF has updated the registry data structure to accommodate two versions of each record that coexist, one is a storage bin for automated data and the other the human curated version. Any record that is publicly available in the NIF will be curated by a human, yet with automatic registration, the human curator will be prompted to review the site whenever an update occurs.
Resource characterization is a tricky problem, and it is difficult to know for a particular audience the correct way to represent a resource. For example, the Biositemaps entry for the I2B2 project (https://www.i2b2.org/; an NIH-funded National Center for Biomedical Computing containing a large amount of software resources) created individual Biositemaps for each plug-in to their software tools. This is an issue of scope. Because NIF’s curators as a policy do not divide resources to this extent, we consider most plug-ins to be a part of the software resource, not an individual resource (there are some exceptions ,such as MATLAB libraries). For a project such as NIF, resources need to be well defined because trying to catalog every resource useful to neuroscientists can be a daunting task if a resource is too narrow, such as a plug-in. If we consider a resource to be appropriate for NITRC, a software library with hundreds of software applications, then it will take a curator some time to annotate this. However if we consider each plug-in to each program a resource, the task becomes too large and is not likely to help users. On the other hand, if a user is looking for a very specific plug-in, then having access to each individually is likely to be useful.
To solve this scope problem, we have created a uniqueness criterion for the URL, meaning that if the URL is not unique among several Biositemaps “resources,” then the resource descriptions will be folded into one. The solution is not perfect because unrelated resources could potentially have the same URL, but this strategy solved more problems than it created.
“Self registration” tools such as Biositemaps can be used to help human curators annotate a resource, including alerting curators that a resource has been created. However, while these tools can certainly help, we believe that these self-reporting tools do not replace trained human curators.