Archive for the ‘Maryann Martone’ Category

Open Science? Try Good Science.

Posted on April 7th, 2014 in Author, Curation, Essays, Maryann Martone, News & Events | 1 Comment »

If the Neuroscience Information Framework is any guide, we are certainly in an era of “Openness” in biomedical science.  A search of the NIF Registry of tools, databases and projects for biomedical science for “Open” leads to over 700 results,  ranging from open access journals, to open data, to open tools.  What do we mean by “open”?  Well, not closed or, at least, not entirely closed.  These open tools are, in fact, covered by a myriad of licenses and other restrictions on their use.  But, the general theme is that they are open for at least non-commercial use without fees or undue licensing restrictions.

Open Science Share button

So, is Open Science already here?  Not exactly.  Open Science is more than a subset of projects that make data available or sharing of software tools, often because they received specific funding to do so.  According to Wikipedia, “Open science is the umbrella term of the movement to make scientific research, data and dissemination accessible to all levels of an inquiring society, amateur or professional. It encompasses practices such as publishing open research, campaigning for open access, encouraging scientists to practice open notebook science, and generally making it easier to publish and communicate scientific knowledge.”   Despite the wealth of Open platforms, most of the products of science, including, most notably, the data upon which scientific insights rests, remain behind closed doors.  While attitudes and regulations are clearly changing, as the latest attempts by PLoS to establish routine sharing of data illustrate (just Google #PLOSfail), we are not there yet.

Why are so many pushing for routine sharing of data and a more open platform for conducting science?  I became interested in data sharing in the late 1990’s as a microscopist as we started to scale up rate and breadth at which we could acquire microscopic images.  Suddenly, due to precision stages and wide field cameras, we were able to image tissue sections at higher resolution over much greater expanses of tissue than before, when we were generally restricted to isolated snapshots or low magnification surveys.   I knew that there was far more information within these micrographs and reconstructions than could be analyzed by a single scientist.  It seemed a shame that they were not made more widely available.  To help provide a platform, we established the Cell Centered Database, which has recently merged with the Cell Image Library.  Although we were successful in the CCDB in attracting outside researchers to deposit their data, we were rarely contacted by researchers wanting to deposit their data. most of the time we had to ask, although many would release the data if we did.  But I do distinctly remember one researcher saying to me:  “I understand how sharing my data helps you, but not me”.

True.  So in the interest of full disclosure, let me state a few things.  I try to practice Open Science, but am not fanatical. I try to publish in open access journals, although I am not immune to the allure of prestigious closed journals.  I do blog, make my slides available through Slide Share, and upload pre-prints to Research Gate.  But I continue to remain sensitive to the fact that through my informatics work in the Neuroscience Information Framework and my advocacy for transforming scholarly communications through FORCE11 (the Future of Research Communications and e-Scholarship), I am now in a field where:  A)  I no longer really generate data.  I generate ontologies and other information artefacts, and these I share, but not images, traces, sequences, blots, structures;  B)  I do benefit when others share their data, as I build my research these days on publicly shared data.

But do I support Open Science because I am a direct beneficiary of open data and tools?  No.  I support Open Science because I believe that Open Science = Good Science.  To paraphrase Abraham Lincoln:  “If I could cure Alzheimer’s disease by making all data open, I would do so;  if I could cure Alzheimer’s disease by making all data closed, I would do so.”  In other words, if the best way to do science is the current mode:  publish findings in high impact journals that only become open access after a year, make sure no one can access or re-use your data, make sure your data and articles are not at all machine-processable, publish under-powered studies with only positive results, allow errors introduced by incorrect data or analyses to stay within the literature for years, then I’m all for it.

But, we haven’t cured Alzheimer’s disease or much else in the neurosciences lately.  That’s not to say that our current science, based on intense competition and opaque data and methods, has not produced spectacular successes.  It surely has.  But the current system has also led to some significant failures as well, as the retreat of pharmaceutical companies from neuroscience testifies.  Can modernizing and opening up the process of science to humans and machines alike accelerate the pace of discovery?  I think we owe the taxpayers, who fund our work in hope of advancing society and improving human health, an honest answer here.   Are we doing science as well as it can be done?

I don’t believe so.  And, as this is a blog and not a research article, I am allowed to state that categorically.  I believe that at a minimum, Open Science pushes science towards increased transparency, which, in my view, helps scientists produce better data and helps weed out errors more quickly.  I also believe that our current modes of scientific communication are too restrictive, and create too high a barrier for us to make available all of the products of our work, and not just the positive results.  At a maximum, I believe that routine sharing of data will help drive biomedical sciences towards increased discovery, not just because we will learn to make data less messy, but because we will learn to make better use of the messy data we have.

Many others have written on why scientists are hesitant or outright refuse to share their data and process  (see #PLOSfail above) so I don’t need to go into detail here.  But at least one class of frequent objections has to do with the potential harm that sharing will do to the researcher who makes data available.  A common objection is that others will take advantage of data that you worked hard to obtain before you can reap the full benefits.  Others say that there is no benefit to sharing negative results, detailed lab protocols or data, or blogging, saying that it is more productive for them to publish new papers than to spend time making these other products available.   Others are afraid that if they make data available that might have errors, their competitors would attack them and their reputations would be tarnished.  Some have noted that unlike in the Open Source Software community, where identifying and fixing a bug is considered a compliment, in other areas of scholarship, it is considered an attack.

All of these are certainly understandable objections.  Our current reward system does not provide much incentive for Open Science, and changing our current culture, as I’ve heard frequently, is hard.  Yes it is.  But if our current reward system is supporting sub-optimal science, then don’t we as scientists have an obligation to change it?  Taxpayers don’t fund us because they care about our career paths.  No external forces that I know of support, or even encourage, our current system of promotion and reward:  it is driven entirely by research scientists.  Scientists run the journals, the peer-review system, the promotion committees, the academic administration, the funding administration, the scientific societies and the training of more scientists.  Given that non-scientists are beginning to notice, as evidenced by articles in the Economist (2013) and other non-science venues about lack of reproducibility, perhaps it’s time to start protecting our brand.

While many discussions on Open Science have focused on potential harm to scientists who share their data and negative results, I haven’t yet seen discussions on the potential harm that Opaque Science does to scientists.  Have we considered the harm that is done to graduate students and young scientists when they spend precious months or years trying to reproduce a result that was perhaps based on faulty data or selective reporting of results?  I once heard a heart-breaking story of a promising graduate student who couldn’t reproduce the results of a study published in a high impact journal.  His advisor thought the fault was his, and he was almost ready to quit the program.  When he was finally encouraged to contact the author, he found that they couldn’t necessarily reproduce the results either.   I don’t know whether the student eventually got his degree, but you can imagine the impact such an experience has on young scientists.   Beyond my anecdotal example above, we have documented examples where errors in the literature have significant effects on grants awarded or the ability to publish papers that are in disagreement (e.g., Miller,  2006).  All of these have a very real human cost to science and scientists.

On a positive note, for the first time in my career, since I sipped the Kool Aid back in the early days of the internet, I am seeing real movement by not just a few fringe elements, but by journals, senior scientists, funders and administrators, towards change.  It is impossible to take a step without tripping over a reference to Big Data or metadata.  Initiatives are underway to create a system of reward around data in the form of data publications and data citations.  NIH has just hired Phil Bourne, a leader in the Open Science movement, as Associate Director of Data Science.  And, of course, time is on our side, as younger scientists and those entering into science perhaps have different attitudes towards sharing than their older colleagues.   Time will also tell whether Open Science = Good Science.  If it doesn’t, I promise to be the first to start hoarding my data again and publishing only positive results.

References:

Economist, How Science Goes Wrong, Oct 19, 2013

Miller, G.  (2006) A scientist’s nightmare: software problem leads to five retractions.  Science, 22, 314, pp 1856-1857.

 

Blog originally posted to Wiley Exchanges.

BioCreative User Interactive Task – Request

Posted on September 4th, 2013 in General information, Jonathan Cachat, Maryann Martone, News & Events | No Comments »

The Neuroscience Information Framework is proud to support the BioCreative effort (http://www.biocreative.org) whose goal is bridging the gap between text mining and biocuration.

One activity BioCreative is currently recruiting for is a User Interactive Task (IAT) where a domain expert for a biocuration task is recruited to test a system.

This study is conducted remotely and is time flexible, as long as the evaluation is performed by September 25.  There are two levels of participation: full (involves actual curation of a selected corpus (about 60 abstracts total), completing a user survey, and being listed as co-author on the BioCreative IAT overview article) and partial (involves performing basic pre-defined tasks at the system’s website, completing a user survey and being acknowledge in the BioCreative IAT overview article).

BioCreative has 9 systems and one of them is EGAS (http://bioinformatics.ua.pt/software/egas/), a web-based platform for biomedical text mining and collaborative curation, supporting manual and automatic annotation of concepts and relations. In this case related to protein interactions in biomedical corpus related to neurological diseases.

Since NIF’s community expertise is on neurosciences BioCreative wanted to ask for our help to recruit suitable participants. Specially, they are in need for a couple of participants for full testing of EGAS.

The benefits to biocurators participating in this activity are multifold, including:

  • direct communication and interaction with developers
  • exposure to new text mining tools that can be potentially adapted and integrated into the biocuration workflow
  • contribution to the development of text mining systems that meet the needs of the biocuration community
  • dissemination of findings in a peer-reviewed journal article.

Here is the website for BioCreative IAT: http://www.biocreative.org/tasks/biocreative-iv/iat-task-biocurators/

Link to EGAS system and task description: http://www.biocreative.org/media/store/files/2013/egas.pdf

To register, select EGAS as the system in the registration page:

http://www.biocreative.org/tasks/biocreative-iv/iat-task-biocurators/#registration

How Do You Evaluate a Database

Posted on May 3rd, 2013 in Author, Essays, Force11, Maryann Martone, News & Events, NIFarious Ideas | 3 Comments »

by Maryann E Martone

I was speaking with a colleague recently who, like many of us, had experienced the frustration of trying to support his on-line resources.  He has assembled a comprehensive on-line resource, it is used by the community and was used by others to publish their studies.  It is not Genbank or EBI;  it is one of the thousands of on-line databases created by individuals or small groups that the Neuroscience Information Framework and others have catalogued.  My colleague has spent years on this resource, pored over hundreds of references and entered close to a million statements in the database.  By many means, it is a successful resource.  But in the grant review, he was criticized for not having enough publications.  I experienced the same thing in a failed grant for the resource that I had created, the Cell Centered Database.  In fairness, that was not the most damning criticism, but it just seemed so very misplaced. I had succeeded in standing up and populating a resource, well before there was any thought of actually sharing data.  People used the database and published papers on it, but apparently I should have been spending more time writing about it and less time working on it.

The problems of creating and maintaining these types of resources are well known and were discussed at Beyond the PDF2:  to be funded, you have to be innovative.  But you don’t have to be innovative to be useful.  To quote or paraphrase Carole Goble at the recent conference,  “Merely being useful is not enough.”

But presumably there is a threshold of perceived value where “merely being useful” is enough.  I am thinking of the Protein Databank or Pub Med.  These resources are well funded and also well used but hardly innovative.  I am guessing that many of the resources like my colleague and I created were started with the hope that they would be as well supported and integral to people’s work as the PDB or Pub Med.  But the truth is, they are not in the same class.  But they are still valuable and represent works of scholarship.  We are now allowed to list them on our biosketch for NSF.  So my question to you is:  how do we evaluate these thousands of smaller databases?

Ironically, our peers have no trouble evaluating an article about our databases, but they have much more trouble evaluating the resource itself.  How does one weigh 30,000 curated statements against 1 article?  What level of page views, visits, downloads and citations make a database worthwhile?  If my colleague had published 10 papers, the reviewers wouldn’t have likely checked how often they were cited, particularly if they were recent.  What is the equivalent of a citation classic for databases?  If you don’t have the budget of NCBI, then what level of service can you reasonably expect from these databases?  I thought that the gold standard was a published study that utilized your database to do something else, by a group unconnected to you.  Grant reviewers found that unconvincing.  Perhaps I didn’t have enough? But how many of these do you need, relative to the size of your community,  and on what time frame should you expect them to appear?  Sometimes studies take years to publish.  Do they need to be from the community that you thought you were targeting (and whose institute may have funded your resource) or does evidence from other communities count?

So perhaps if we want to accept databases and other artefacts in lieu of the article, we should help define a reasonable set of criteria by which they can be evaluated.  Anyone care to help here?

Why I started blogging-a scientist’s perspective

Posted on December 19th, 2012 in Essays, Force11, Maryann Martone, News & Events, NIFarious Ideas | 8 Comments »

by Maryann Martone

A recent post at the London School of Economics Social Science Impact blog on “Finding the time to blog” reminded me that I wanted to write a blog about why I started to blog. The use of social media and its proper place in academic communications is being discussed in many circles. Over at FORCE11, we aggregate quite a few blog feeds like the one from LSE where these issues are thoroughly covered. I wanted, however, to share a personal perspective. Like many scientists, I suspect, I was at first reluctant to blog. I did write a few posts for the NIF blog when we started it up, but then stopped because “It takes too much time”. Each blog took me several weeks before I was happy with it and, as is well advertised, blogs don’t count towards academic promotion, etc. So if I was going to spend that amount of time, I might as well spend it towards something that does count: writing papers, giving talks, training, teaching, networking and, oh, doing research. Besides, who would want to hear what I had to say?

Well, the astute reader might have noted that many of our rewarded activities involve someone (funders, conference organizers, students) actually paying to hear what we have to say. And, the astute reader might also note that a blog is a much more effective communication vehicle than most of these for accomplishing these tasks. I started to blog for real when I realized that a blog is my communication with the world. A lot of money has been invested in me as a vehicle for knowledge acquisition and integration. The more I share that with the world, the better I do my job. A blog is not a learned treatise which needs to carefully consider all angles, acknowledge all references in a specified format and go through rounds and rounds of editing to craft the language so as to offend nobody with unsupported statements. A blog is a written yet highly interactive version of the type of conversation I engage in every day with students, colleagues, audiences. It is my thoughts on a topic, developed over a lifetime of active inquiry, open to correction and discussion. You can believe them or not, just as you choose to believe them when I am speaking to you in an informal or formal setting.

But unlike these other forms of transient communication, where my words evaporate into the air, blogs live on the net. They are searched by Google, so they can be found easily. And they are living things, open to comment, discussion, updating. Once I realized what a blog could be, I could fire one off in a matter of minutes. Do I get some things wrong? Sure. But isn’t that why we communicate with each other in science, so we can try to put our thoughts in order in a way where flaws can be exposed? It was a magical moment when I read over a blog that I had posted earlier and realized that I had left out a part of the argument. Oh no! But then I just opened edit and put it in. But what if I misrepresent some part of an argument or forget to acknowledge someone? Isn’t that why we have peer review? Well, if you want peer review, just read the comments. Usually, someone will correct you if they care enough. And again, you can immediately acknowledge that input and modify your posting or post a new one. So rather than blogging taking me away from my job, I actually think it lets me do it better. It is a freeing form of communication. Scientists generally are interesting people, but you would never know it from the articles they produce. But you do when you get them talking. And that, imho, is what a blog should be: scientists talking for everyone’s benefit.

Those mean journals won’t publish my methods!

Posted on September 13th, 2012 in Essays, Force11, Maryann Martone, News & Events, NIFarious Ideas | 5 Comments »

The NIF team recently attended the Neuroinformatics Conference, held in Munich, Germany.  The conference featured several lively discussions on the reproducibility problem in neuroscience (and neuroinformatics) and what should be done.  Many in the audience complained that part of the problem is that the journals, especially the high impact ones (you know who you are), are cutting materials and methods further and further.  Many calls were made to put pressure on the publishers, and NIF is certainly all for that.  But thanks to our involvement in FORCE11, we asked the question to the audience “Why are you relying on the journals for this?  If you think that you need detailed materials and methods, why aren’t you publishing them on the web?  Your paper can still be in the journal, but why aren’t you making videos explaining your methods and posting them on You Tube or Sci Vee.  Why aren’t you using wikis like Open Wetware to make your detailed protocol available?  Why aren’t you writing a blog including details about your paper, including more detailed methods?  Why aren’t you putting your data into public repositories?  Why aren’t you creating a video protocol using Jove?”  I think it’s time for the scientific community at large to start asking themselves these questions.  But more importantly, it is time for the scientific community to act.  Scientists need to start cleaning their own house.  We do not have to wait for the journals to allow us to make our science more reproducible.  For the good of our respective fields, we should be doing this now.  If you don’t like these venues, NIF would be happy to host your videos and protocols.  NIF doesn’t care where something is, as long as we can link to it.  And NIF will link your protocol/video/blog to your published article in Pub Med too, using our link out feature.  These links are also featured in NIF literature as well.   And don’t go telling us that if the journals don’t do it, then it won’t be part of the permanent record.  That is undoubtedly true.  But until the journals change, the materials and methods will continue to shrink.  Isn’t a short term solution better than no solution at all?

How to bury your academic writing (or should I write that book chapter?)

Posted on August 31st, 2012 in Essays, Force11, Maryann Martone, News & Events, NIFarious Ideas | No Comments »

A recent blog post by Dorothy Bishop on “How to bury your academic writing” came through this week that considers the question of the relative impact of book chapters vs published articles.  She concluded that book chapters generated far fewer citations than published articles and attributed it to the fact that book chapters are generally behind a pay wall, often a fairly hefty one (the latter is my opinion, not hers).  It prompted a follow up blog post by Pat Thompson “Is writing book chapters a waste of time?” in defense of  writing book chapters. I don’t think that Ms. Bishop was saying that book chapters were a waste of time; indeed, she claimed that some of her best scholarly work was done as book chapters, as the medium allows for more speculation and creativity than journal articles.  I too have found that to be true;  some of my best works were book chapters, even though I was told early on in my academic career that book chapters were generally a waste of time and effort, as they did not count towards academic promotion (at least in the biomedical field). But they allowed me greater literary freedom than the typical biomedical article, and I was able to speculate and develop arguments without reviewers crying “foul!”.   But even I can’t gain access to many of these chapters anymore, except as my original word files, unless I have a copy of the book around.  So I concur with Ms. Bishop that writing book chapters is perfectly fine, but writing them on-line where they can be found and actually read would likely make them much more useful.  There are a lot of interesting tools and models out there where this could be done, e.g., Wikibooks . I confess that very rarely in my career have I been tempted to answer one of the many invitations to edit a book.  But if I did, I would strongly consider taking Ms. Bishop’s advice:  “My own solution would be for editors of such collections to take matters into their own hands, bypass publishers altogether, and produce freely downloadable, web-based copy.”

Adapted from a piece I wrote at FORCE11.

Maryann Martone, Neuroscience Information Framework

So all I need is a number?

Posted on August 17th, 2012 in Curation, Force11, Interoperability, Maryann Martone | No Comments »

In the Neuroscience Information Framework (http://neuinfo.org), we often tout the importance of using unique identifiers rather than text strings as a way to ensure that search engines like NIF can mitigate the ambiguity associated with searching for strings.  NIF provides access to the largest source of neuroscience information on the web, by providing simultaneous search over multiple databases, catalogs and literature databases.  If you search for Ca2 in NIF, you will find information on calcium, the hippocampus and a gene called CA2.  Unique identifiers can disambiguate among these by assigning unique handles to each;  a sort of social security number for each thing that we want to talk about.  Many groups are creating and promoting unique identifiers for all sorts of entities:  people (e.g., ORCID), articles (PubMed ID’s) and they are very handy things.  NIF itself has gotten into the business through its unique resource identifiers and antibody ID’s.   So all I need is a number, right?  Alas, no.  Because numbers, like names, are not unique either.  I just searched through NIF and found an antibody in the Beta Cell Consortium Database.  There was a column for “people who are using this” with a reference of  10077578.  Clicking on it took me to an article in PubMed, so clearly it is a Pub Med ID.  Great, I thought.  I want to see who else references that paper in NIF.  So I typed in PMID:10077578 into the NIF search interface and was able to retrieve the article in the NIF literature database.  But that’s not what I wanted.  Most of the times, database providers don’t provide the prefix PMID;  rather, they list just the numbers in a column labeled “Reference” or “Citation”.  So I typed in 10077578 and got multiple hits in the data federation from several databases.  Great, I thought.  Here are other sources of information that are referencing this paper.  Unfortunately, one was to Novus Biochemical antibody 100-77578, and one was to the gene Rumal_1324 (GeneID: 10077578).  So, clearly a number is not enough.  Some sort of name space is required, e.g., PMID:10077578 clearly tells me where I am to look.  NIF should have known better and is working to resolve this glitch, by identifying each number with a prefix, and in time, a full URI (Uniform Resource Identifier, not an upper respiratory infection).  The semantic web community has been working on these standards for a long time and discussion of the URI  is beyond this post.  But this is yet another example of why we at NIF encourage resource providers to think globally about their data;  are we producing our data in a form that makes it easier to link individual parts of our resource to other parts?

NIF and Data Sharing

Posted on June 27th, 2011 in Essays, Force11, Maryann Martone, News & Events | 1 Comment »

It is likely that many of the members of SFN have read a bit of text when applying for NSF or NIH grants that states “thou shalt share thine data”. Actually the statements are here:

NSF Data Sharing Policy — According to NSF’s “Dissemination and Sharing of Research Results,” NSF says that “Proposals submitted or due on or after January 18, 2011, must include a supplementary document of no more than two pages labeled “Data Management Plan.” To get you started, please see below NIF’s “Data Management Plan” text template for NSF grants.

NIH Data Sharing Policy — According to NIH’s “Final NIH Statement on Sharing Research Data,” NIH’s policy is that “all investigator-initiated applications with direct costs greater than $500,000 in any single year will be expected to address data sharing in their application.”

So does this mean that SFN members are running to various databases to turn over all of their negative results? Are they clambering to post every file that their undergraduates produced to the World Wide Web? Have they accosted their university webmaster recently and loaded him/her with various disks? Well, not that we can see over in bioinformatics (see Nature special data issue).

So why would people want to share their data?

1. The scientific enterprise of various laboratories is paid for by the taxpayers and, as such, belongs to them. Yes, there is a lot of small print that needs to go at the end of this sentence having to do with timing of publication, human subjects etc., but the principle is sound.

2. The sharing of data is becoming possible, compared to a few years ago when my hard drive was a total size of 20MB. I think that this sort of storage space is now given away with each gumball purchase.

3. We all want to be immortal, and if the papers that we have published have not given us that immortality, perhaps our data will. In all reality, data are probably as important to the scientific enterprise as journal articles before 1975. We probably will not find any use for them in the next few years, but throwing them away should be rather distasteful.

4. Data, especially negative data, can be much more informative than ‘representative data’ as published in Figure 1. What I mean is that in a recent conversation with an NIH official, I discovered to my dismay that a clinical trial was recently aborted (aka tens of millions of dollars wasted) even though the preliminary animal data looked quite promising. The reason that was given was that most of the negative data was discarded into waste bins and not published. This is a very common practice, I have picked through stacks of data to find just the perfect “representative figure,” but this practice seems to be doing a disservice to science, and it may be time to change course.

5. Because we say so (here “we” means the informatics and computational neuroscientists)! While 1% of data may be published in some journal (which makes available an image file after you sign over your first-born to them), the rest of it could be used to find out some interesting answers that require aggregating data from many labs. There are lots of people that are computationally inclined (it may be contagious, don’t get too close to your grad students), and they need data. Actually, they need to have well-curated, meaningful data that they can use to feed their algorithms. The hope is that they will build interesting models that shed new light on brain function, but they can’t do it if you toss your intellectual output into the waste basket.

Why don’t people share?

1. It is difficult: While a data sharing plan is required, neither the NSF nor the NIH actually tells people how to share their data. Furthermore, there are projects and databases that can take certain kinds of data, but not others, and there are more than 1800 of these at our last count. Some journals also require submissions to certain databases before publishing, but not all do.

2. It is costly: In the case that a particular data can’t be found to have a home in a public database, then counting the cost of creating a database, systems support, a webmaster, proprietary tools etc., the cost can eat up the bulk of most RO1′s.

3. You get no credit! Scientists who share their data seem to be treated no differently than those that hoard their data. “Data sharers” receive no additional publications, no extra points toward their tenure review, although they do seem to be cited more often (PLOS article). Also, the NIH seems not to have any mechanism in place, at the moment, to check whether the data were indeed shared.

4. “It’s mine, my precious!!!” (the creature known as Gollum). Need I say more?

So here are some things that we in bioinformatics can do to help:

We bio-informaticians (aka those of us with databases) can all make an effort to structure database submission such that it is transparent, easy-to- reference in publications, and relatively fast. We need to make sure that the researcher is not burdened with a 400 item minimal information checklist and a deep ontological modeling of their data! Biologists mostly just need to submit an excel sheet with some numbers in it and some identifying information, i.e., if they tell you that their subject is a B6, then don’t ask if it’s a mouse!

We made a cheat-sheet! At the Neuroscience Information Framework (NIF), we have compiled a few of the data sharing statements and put them up on the web, so they can be shamelessly copied. Of course our motive is that we would like to get the data, at the time of publication as our algorithms and our continued support from NIH seems to depend on it.

We should also be able to make it easy to find a good database to submit data to. The NIF project was built to account for all the things that NIH funded in informatics (how many databases are there for RNAi? Answer), but has grown to include a deep search of more than 80 databases and has a catalog of over 3800 projects including software tools, atlases and core facilities. This means that we are in a good position to know which projects could take your data, and if none will, then we will put the data up as we already have the infrastructure and computer geeks required. Furthermore, we have some neuro-biologists that can help translate neuro-English to computer-geek-ease!

There are many other issues in data publishing, and I hope that this post will help to stimulate that discussion.

Cheers,

——————————————-
Anita Bandrowski
NIF Project Lead
——————————————-

The Meaning of “Is”

Posted on April 16th, 2010 in Curation, Essays, Force11, General information, Maryann Martone | 1 Comment »

That’s an easy one, with all due respect to our former president.  As far as the NIF is concerned, “IS” is the inferior salivatory nucleus.  How do we know?

Perform a search in NIF and you will see various terms highlighted in the search results (the current highlighting color is brick red, but we are open to suggestions).   Hover over each of these highlighted terms and NIF will tell you what the term means to the NIF system.  If you hover over “IS,” NIF tells you it’s an anatomical structure. If you right click on it and ask to see “IS” in the Neurolex, it will tell you that IS is an abbreviation for the inferior salivitory nucleus.  This new feature is an example of what is often called “entity recognition.”

In the formal world of knowledge representation, an entity is that which is perceived, known, or inferred to have its own distinct existence.  For NIF, entities are those things like organisms, cells, molecules, and techniques that define our domain.  These entities are represented in the NIF ontologies.  Each entity has its own numerical identifier, sort of like a social security number, that uniquely identifies the entity.  This identifier is used to point to different ways of saying the same things to the same entity.  For example, NIF doesn’t care whether you call entity birnlex_2645, the IS, inferior salivary nucleus, or Freddy, for that matter.  They are all (and always) the same thing.

Unfortunately, the richness and complexity of our language makes recognizing entities a tricky thing, as everyone who uses a search engine knows.  Not only can we call the same entity many things, but we can call many entities the same thing.  Chances are that the IS highlighted by NIF in the search results actually is not the inferior salivatory nucleus but the third person form of the verb “to be,” or perhaps it is the initial segment of an axon or the Institute for Science.    Right now, NIF doesn’t really know.

In future releases of NIF, we will be working towards improving the accuracy of our entity recognition.  Why?  Because once we know that IS is a brain nucleus, we can find anything that is known about it:  its projections, its genes, the diseases in which it is affected.  A preview of what is coming can be seen in the NIF Cards.

IS Search

Search for IS with NIF Card

NIF cards for each entity can be viewed by right clicking over the highlighted term and selecting “Show NIF card” from the menu. NIF cards currently are only implemented for anatomical structures and cells.

For now, however, we hope you will explore the new NIF and develop an appreciation for the difficulties of semantic search by seeing what NIF thinks the results mean.  You may be surprised!

Four Things You Can Do to Make Your Database More Interoperable

Posted on June 25th, 2009 in Force11, Interoperability, Maryann Martone, News & Events | 4 Comments »

As part of the Neuroscience Information Framework (NIF), we provide access to data contained in databases and structured web resources (e.g. queryable web services), sometimes referred to as the deep or hidden web, that are independently maintained by resource providers around the globe. We believe that this federated model is the most practical way to provide our users with access to the latest data without NIF having to maintain a centralized resource.  A federation model assumes that we can access each database or service and allow users to discover these resource through the NIF.  It also lets us merge data from different databases, essentially mixing and matching results in a way that is useful to our users.

The NIF has been registering databases and structured web resources for just over a year, moving discussions of database interoperability from the theoretical to the practical realm. As new databases are created every day, we thought it would be useful to provide our perspective on this issue so that decisions can be made at the outset that would improve the likelihood that the database can interoperate with others later on.  We are not going to discuss the relative merits of database platforms, e.g., relational, XML, object-oriented.   Nor will we consider here whether RDF is the answer to all interoperability problems (but stay tuned).  Rather, here we will focus on our experience with integration of existing databases, most of which are relational.

What is interoperability?  We define it simply as:  ”the ability of a system or component to function effectively with other systems or components” (http://www.yourdictionary.com/interoperability).  Why would you as a resource provider want to become interoperable?  Here are 3 good reasons:

1)      To be found.  NIF is just one of many portals on the web, we specialize in scientific data, so we have tools that allow scientists to search for all genes expressed in a particular brain region, for example, but our problems are the same ones dealt with by all search portals:  where is the information that I want?  Usually, it is scattered across web pages, pdf files and databases, many of which cannot be searched effectively by search engines.  Academics and NIH are excellent at providing wonderful data, data models, and manuscripts describing them, but we at NIF have discovered that academics who create databases are usually not so good at marketing.  If you have just created a database, do you want others to find it and use it?  If you follow a few simple rules for your database or other type of data resource, your data will have a higher impact on the community.

2)      To be useful:  No matter how comprehensive a database you create, you will be capturing only a tiny fraction of information on biological systems.  That’s why we have so many databases out there.  NIF has identified over 1500 independent databases that are potentially useful to neuroscience and we find more everyday.  A single individual would spend their lives locating these resources and querying them;  NIF lets you query them all simultaneously and combine and compare information across them (or at least, we will let you do that in the future).

3)      To be helpful:  I hear scientists complain all the time that they can’t use microscope parts from one manufacturer on an instrument made by another manufacturer.  I myself complain when I can’t use someone else’s cell phone charger for my phone.  Yet, we as scientists are unwilling to work a little extra to make our data and databases interoperable. We understand that databases are developed for a specific purpose by a specific group to serve their needs.  We also understand that the financial and technological resources for creating and maintaining these resources widely differ.  But we also know that there are some practices which can make it more or less difficult for a resource like NIF to make the contents of a web accessible database available and usable.

The issue of database interoperability is a complex one, and some excellent frameworks and discussions are available.  (http://www.sei.cmu.edu/isis/guide/introduction/lcim.htm).  For the purposes of this discussion, we will simplify the issue and address issues of interoperability at two levels: technical and data.

At the technical level,  we have encountered several roadblocks to making data resources available through the NIF.  At the most basic level are issues of access, e.g., institutional firewalls, stability of access privileges and access methods.  At another level, we have encountered difficulties in efficiently accessing databases when the identifiers for individual records change upon update. We have noted that some databases and vocabularies use identifiers that get regenerated every time the resource is updated. This practice makes it very difficult for NIF to maintain appropriate indices and links. We recommend that identifiers be stable; if they are to be removed, they should be made obsolete rather than deleted.  Related to this is the use of sessions to retrieve data pages instead of stable URI’s. Under this practice the application allows a user to access data only in a linear manner, i.e. the main page showing the cerebellum must be accessed before any of its subcomponents. Each session generates a temporary pointer or ‘session identifier’, which makes it difficult for a system such as NIF to make use of much of the specific data elements inside of resources that use sessions extensively. To harness the power of all the data available in these resources, they need to be made available outside of their web interface.

The above issues touch upon the ability of NIF to issue queries against a remote database, generate a search index and return results.  Equally important is the ability for NIF to search the database effectively and provide comprehensible and useful results to NIF users.  In our experience, the lack of a standard terminology is one of the major impediments to effective search across databases. In our very first database federation exercise, we registered 3 databases that had data on neurons:  NeuronDB, Neuromorpho and CCDB.  If we look at the list of neuron names, we see that there were 3 variants on the same cell class:  Cerebellar Purkinje cell;  Purkinje neuron, Purkinje cell.  If a user typed “Purkinje”, results may not be specific to Purkinje neurons and may contain information related to Purkinje fibers.  However, if a user wanted specific information about “Purkinje neuron” they would not retrieve records from all 3 databases.  Fortunately, we now have the NeuroLex, a lexicon for neuroscience that maps all 3 of these terms as synonyms to each other and assigns a unique ID to the class.  When issuing a query from NIF, NIF attempts to autocomplete all terms from the Neurolex vocabulary.  If we have the term, we automatically search for synonyms.  If a source uses any of the terms mapped to the ID by Neurolex, the result will be returned.  If a source uses a custom abbreviation (PC) or a symbolic notation (Purkinje cell = 1), then special mapping of the source database will have to occur using our concept mapping tool.  Note that we are not considering here whether the meaning of Purkinje cell is the same across all of these sources.  Meaning is a more difficult issue and one which will be addressed in a future blog.  But for now, just having a standard, non-symbolic term makes integration of databases a lot easier.

So here are our top 4 barriers to data federation in the NIF:

1)      Unstable identifiers:  Every time the database updates, the identifiers change and all pre-indexed links to those data records break;

2)      Access:  For increased utilization of the data, stable access needs to be provided either through a public connection to the database, a periodic dump of the database contents or through structured web services;

3)      Sessions: For general information results and data should be accessible using a static (i.e. non session based or stateless) URL;

4)      Vocabulary:  Use a standard terminology and avoid symbolic notations where possible.