As part of the Neuroscience Information Framework (NIF), we provide access to data contained in databases and structured web resources (e.g. queryable web services), sometimes referred to as the deep or hidden web, that are independently maintained by resource providers around the globe. We believe that this federated model is the most practical way to provide our users with access to the latest data without NIF having to maintain a centralized resource. A federation model assumes that we can access each database or service and allow users to discover these resource through the NIF. It also lets us merge data from different databases, essentially mixing and matching results in a way that is useful to our users.
The NIF has been registering databases and structured web resources for just over a year, moving discussions of database interoperability from the theoretical to the practical realm. As new databases are created every day, we thought it would be useful to provide our perspective on this issue so that decisions can be made at the outset that would improve the likelihood that the database can interoperate with others later on. We are not going to discuss the relative merits of database platforms, e.g., relational, XML, object-oriented. Nor will we consider here whether RDF is the answer to all interoperability problems (but stay tuned). Rather, here we will focus on our experience with integration of existing databases, most of which are relational.
What is interoperability? We define it simply as: ”the ability of a system or component to function effectively with other systems or components” (http://www.yourdictionary.com/interoperability). Why would you as a resource provider want to become interoperable? Here are 3 good reasons:
1) To be found. NIF is just one of many portals on the web, we specialize in scientific data, so we have tools that allow scientists to search for all genes expressed in a particular brain region, for example, but our problems are the same ones dealt with by all search portals: where is the information that I want? Usually, it is scattered across web pages, pdf files and databases, many of which cannot be searched effectively by search engines. Academics and NIH are excellent at providing wonderful data, data models, and manuscripts describing them, but we at NIF have discovered that academics who create databases are usually not so good at marketing. If you have just created a database, do you want others to find it and use it? If you follow a few simple rules for your database or other type of data resource, your data will have a higher impact on the community.
2) To be useful: No matter how comprehensive a database you create, you will be capturing only a tiny fraction of information on biological systems. That’s why we have so many databases out there. NIF has identified over 1500 independent databases that are potentially useful to neuroscience and we find more everyday. A single individual would spend their lives locating these resources and querying them; NIF lets you query them all simultaneously and combine and compare information across them (or at least, we will let you do that in the future).
3) To be helpful: I hear scientists complain all the time that they can’t use microscope parts from one manufacturer on an instrument made by another manufacturer. I myself complain when I can’t use someone else’s cell phone charger for my phone. Yet, we as scientists are unwilling to work a little extra to make our data and databases interoperable. We understand that databases are developed for a specific purpose by a specific group to serve their needs. We also understand that the financial and technological resources for creating and maintaining these resources widely differ. But we also know that there are some practices which can make it more or less difficult for a resource like NIF to make the contents of a web accessible database available and usable.
The issue of database interoperability is a complex one, and some excellent frameworks and discussions are available. (http://www.sei.cmu.edu/isis/guide/introduction/lcim.htm). For the purposes of this discussion, we will simplify the issue and address issues of interoperability at two levels: technical and data.
At the technical level, we have encountered several roadblocks to making data resources available through the NIF. At the most basic level are issues of access, e.g., institutional firewalls, stability of access privileges and access methods. At another level, we have encountered difficulties in efficiently accessing databases when the identifiers for individual records change upon update. We have noted that some databases and vocabularies use identifiers that get regenerated every time the resource is updated. This practice makes it very difficult for NIF to maintain appropriate indices and links. We recommend that identifiers be stable; if they are to be removed, they should be made obsolete rather than deleted. Related to this is the use of sessions to retrieve data pages instead of stable URI’s. Under this practice the application allows a user to access data only in a linear manner, i.e. the main page showing the cerebellum must be accessed before any of its subcomponents. Each session generates a temporary pointer or ‘session identifier’, which makes it difficult for a system such as NIF to make use of much of the specific data elements inside of resources that use sessions extensively. To harness the power of all the data available in these resources, they need to be made available outside of their web interface.
The above issues touch upon the ability of NIF to issue queries against a remote database, generate a search index and return results. Equally important is the ability for NIF to search the database effectively and provide comprehensible and useful results to NIF users. In our experience, the lack of a standard terminology is one of the major impediments to effective search across databases. In our very first database federation exercise, we registered 3 databases that had data on neurons: NeuronDB, Neuromorpho and CCDB. If we look at the list of neuron names, we see that there were 3 variants on the same cell class: Cerebellar Purkinje cell; Purkinje neuron, Purkinje cell. If a user typed “Purkinje”, results may not be specific to Purkinje neurons and may contain information related to Purkinje fibers. However, if a user wanted specific information about “Purkinje neuron” they would not retrieve records from all 3 databases. Fortunately, we now have the NeuroLex, a lexicon for neuroscience that maps all 3 of these terms as synonyms to each other and assigns a unique ID to the class. When issuing a query from NIF, NIF attempts to autocomplete all terms from the Neurolex vocabulary. If we have the term, we automatically search for synonyms. If a source uses any of the terms mapped to the ID by Neurolex, the result will be returned. If a source uses a custom abbreviation (PC) or a symbolic notation (Purkinje cell = 1), then special mapping of the source database will have to occur using our concept mapping tool. Note that we are not considering here whether the meaning of Purkinje cell is the same across all of these sources. Meaning is a more difficult issue and one which will be addressed in a future blog. But for now, just having a standard, non-symbolic term makes integration of databases a lot easier.
So here are our top 4 barriers to data federation in the NIF:
1) Unstable identifiers: Every time the database updates, the identifiers change and all pre-indexed links to those data records break;
2) Access: For increased utilization of the data, stable access needs to be provided either through a public connection to the database, a periodic dump of the database contents or through structured web services;
3) Sessions: For general information results and data should be accessible using a static (i.e. non session based or stateless) URL;
4) Vocabulary: Use a standard terminology and avoid symbolic notations where possible.