Posted on February 10th, 2017 in Anita Bandrowski, Force11, News & Events | No Comments »
Science is the act of trying and trying again, whether or not we confirm what we think should be happening.
Begely and Ellis in their 2012 paper from Amgen stated that only about 11% of cancer studies were replicable sending shockwaves through the scientific community for years. However, the authors did not give the scientific community all of the data that showed the replicates.
This week in eLife, the Center for Open Science and a cohort of great ‘re-do-ers’ have just published the first batch of studies that are replicates of influential cancer studies, attempting to confirm or deny what the original study claimed. We at the RRID initiative have noted that the original studies often lacked identifying information in the reagents they used, as is alluded to by some of the replication attempts. These simple omissions make replication much more difficult, something that the ‘re-do-ers’ struggled with.
This is really a monumental step and we will wait for the final publications to determine whether these rigorous and fully transparent attempts also fall in the 11% replication level as claimed by Begely and Ellis, but so far some of the replicates show trends in the same direction reported by the original study authors, though no replication attempt has panned out exactly the same way as the original paper. We certainly need to wait and see for the rest of the reports, but I am personally heartened that the original authors are engaged in the replication, commenting on these reports attempting to understand their own data and the new data.
The immortal Aristotle was once reported to say “Quality is not an act, it is a habit.” I think that if he were alive today he would be very interested in these developments and would implore us to look at ourselves and if we did not like what we see, he would call on us to change for the better. While none of my papers will likely be the target of this kind of scrutiny, I do hope that the methods and results will stand up in the long term. This is a call to action for all of us, to be more precise and to do better delivering on the promise of science for the patients who deserve our very best attention and very best methodology.
Greetings SciCrunch/NIF Community!
Please read below for a message from Maryann Martone regarding the upcoming “Annotating all Knowledge” event!
Join us for the kick off face to face meeting for the Annotating all Knowledge coalition. This is an all-day workshop, open to all, but space is limited so please register ASAP. You do not have to attend the FORCE2016 meeting to attend, although we strongly encourage it!
Goals of the workshop: At the first Beyond the PDF conference, the goals were lofty and the aims clear: “to move us beyond a static and disparate data and knowledge representation”. With open, web-based annotation a reality, the time and technology is ripe to realize this vision. This workshop will bring together key players in the scholarly communications ecosystem to explore the feasibility and desirability for equipping all books, articles and other digital media, new and old, with the capability for readers to create, share, and discover annotations from colleagues, authors, friends and experts around the globe. This technology should be open source, federated, and based on standards. The goal: to bring open annotation to all scholarly works within 5 years.
The workshop will feature presentations, discussions and demonstrations covering current progress, opportunities and barriers to bringing a pervasive, open annotation layer to all scholarly work. We will consider how open annotation fits into the larger idea of the 21st century Scholarly Commons, through linking of annotations to systems like ORCIDs. Outcomes will be to forge relationships and outline next steps towards achieving the goal.
The NIF/SciCrunch Team
U.S. Department of Health and Human Services NATIONAL INSTITUTES OF HEALTH NIH News Office of the Director (OD) <http://www.nih.gov/icd/od/> For Immediate Release: Monday, December 9, 2013
NIH NAMES DR. PHILIP E. BOURNE FIRST ASSOCIATE DIRECTOR FOR DATA SCIENCE
National Institutes of Health Director Francis S. Collins, M.D., Ph.D, announced today the selection of Philip E. Bourne, Ph.D., as the first permanent Associate Director for Data Science (ADDS). Dr. Bourne is expected to join the NIH in early 2014.
“Phil will lead an NIH-wide priority initiative to take better advantage of the exponential growth of biomedical research datasets, which is an area of critical importance to biomedical research. The era of ‘Big Data’ has arrived, and it is vital that the NIH play a major role in coordinating access to and analysis of many different data types that make up this revolution in biological information,” said Collins.
Dr. Bourne comes to the NIH from the University of California San Diego, where he is the Associate Vice Chancellor for Innovation and Industry Alliances of the Office of Research Affairs and a Professor in the Department of Pharmacology and the Skaggs School of Pharmacy and Pharmaceutical Sciences. He also is the Associate Director of the Research Collaboratory for Structural Bioinformatics (RCSB) Protein Data Bank. Dr. Bourne was trained as a physical chemist and obtained his Ph.D. from The Flinders University in South Australia.
Dr. Bourne’s professional interests focus on relevant biological and educational outcomes derived from computation and scholarly communication. This work involves the use of algorithms, text mining, machine learning, metalanguages, biological databases, and visualization applied to problems in systems pharmacology, evolution, cell signaling, apoptosis, immunology, and scientific dissemination. He has published over 300 papers and five books. One area to which he is extremely committed is to furthering the free dissemination of science through new models of publishing and better integration and subsequent dissemination of data and results.
Collins added, “I also must recognize and thank Dr. Eric Green, who served as the Acting ADDS since I announced the search to fill this new position. His willingness to take on this challenging role in its inception, and to get the ball rolling on the enormous tasks that accompany this high-priority initiative, is sincerely appreciated. Eric is certain to remain a tremendous source of knowledge and support as Phil continues the NIH’s effort to manage ‘Big Data’.”
About the National Institutes of Health (NIH): NIH, the nation’s medical research agency, includes 27 Institutes and Centers and is a component of the U.S. Department of Health and Human Services. NIH is the primary federal agency conducting and supporting basic, clinical, and translational medical research, and is investigating the causes, treatments, and cures for both common and rare diseases. For more information about NIH and its programs, visit <www.nih.gov>.
NIH…Turning Discovery into Health — Registered, U.S. Patent and Trademark Office ###
This NIH News Release is available online at:
A new editorial in the journal of Neuroinformatics points out that we should probably have a way of handling credit that is different from authorship.
Traditionally, scientific authorship has been given a special status. We base our reputations, we get jobs (or don’t get them), we validate ourselves based on various numbers such as the H-index, all based on authorship. Remember the ‘publish or perish’ mantra?
Now that the only scientific output is no longer just papers, but also data, the question is how can we credit scientists for making scientific contributions that are not papers? One obvious place is to do it the way we always have, add them as authors.
There are authors that publish hundreds of papers a year, companies that would like to publish scientific work, consortia like ADNI that require authorship for the use of their data, and even the people who play an online game listed as authors.
Are these really authors?
Scientific output is not limited to papers and we should not make the author field of papers a proxy for all scientific output credit especially in new modes of communication. The addition of corporate authorship to publications such that ADNI or J&J can claim authorship, is a “duct tape” solution to this question, not a real one. Authorship has to be re-thought from the ground up.
The NIH has had a recent request for information about the NIH data catalog to which our group and many others have responded. Many voices including fairly important ones from the white house are now calling for making scientific research data open, available and linked to the publications written about the data. This is a very good thing. It should lead to better handling and comparison of data and better science.
However, sitting in many recent meetings with members of various national libraries, who shall remain nameless, I am astounded to learn that not only the scientists, but also librarians have never found the LinkOut feature in PubMed.
LinkOut is a little option at the bottom of all articles in PubMed hidden by the good staff into complete obscurity, please see the screen shot below if you don’t believe me that such a feature exists.
Although it is certainly rarely used, the model organism communities, data repositories and researchers, have been diligently adding their data to PubMed in the form of links. We may quibble about the fact that PubMed asks many of us to reduce the specific links to data to generic links that lead to another version of the same article, but the fact is, that the links to data are present! Because they are present, if the National Library of Medicine ever decides to search them, export them, or acknowledge their existence, it would be a treasure trove of data to literature links that would not require a huge new investment in infrastructure.
I am not suggesting that our infrastructure could not be upgraded, in fact we have many more technical gripes that I will not bring up here, but I am suggesting that we all take advantage of the massive investment of time and energy of curators and authors over the last decades to meticulously link their data or data repositories to the literature.
The LinkOut broker has helped NIF aggregate a list of about 250,000 links from ~40 databases, but what PubMed must have is a much much larger set of data. The links provided by NIF can be searched through the NIF site, they can be filtered by category and by database, and they can be extracted and embedded into other sites like science direct (see neuinfo.org/developers). Of these 1/4 million links that we provide to PubMed, between 100 and 200 users find them per month. I think that we can and should do better.
If you feel the sudden urge to be an arm-chair activist, then please let PubMed know that it would be nice if they celebrated the current links between data and publications instead of hiding them.
It appears that the top search term is: database and the top database (no pun intended) is the Registry (of databases). Certainly a very interesting month.
Open Source Brain
Human Brain Atlas
Allen Brain Atlas
Gene Ontology Annotations
ABCD Brain Regions
Allen Mouse Brain Atlas
The top 25 search terms are:
“Gene Ontology Tools”
open source brain
Marijuana OR THC OR cannabis OR Cannabinoid
“Drug Related Gene Database”
I recently asked a bench scientist about her experiences in publishing in this very new mode of scholarly communication, i.e. in F1000Research, which is open access, has an open review process and is about as transparent as the community has ever asked any journal to be. The question was how did she view this process.
To give a bit of background, she is still attempting to publish 3 articles in F1000 research, about work that she has done on tracking down the switch from benign to malignant tumor growth. Two of the articles are now accepted for publication and in the process of being indexed by PubMed (F1000Research 2013, 2:10 (doi: 10.12688/f1000research.2-10.v1), F1000Research 2013, 2:9 (doi: 10.12688/f1000research.2-9.v2)) and the last is in the bowels of the publishing machinery (Witkiewicz et al Article I).
I asked her a set of questions about the review process, which she discusses below. She agreed to let me post them here and just as a note, the articles prior to publication were viewed 1415, 1373 & 1005 times and downloaded 231, 330 & 321 times, respectively. This sort of buzz is seldom generated by published work so I have been quite surprised that it can be generated prior to publication.
Your questions are easy to answer; however, I would like to point out that my answers may not well represent the larger community of younger bench scientist. My sense of right and wrong has been shaped in different countries (Poland, Austria, and Canada) and at different times. Nevertheless, here they are for whatever it is worth:
How do you view the landscape of open scholarly communication, do you get lost in it?
If I do not feel lost in the maze of the new ways of communicating it may be because of not having explored it enough. So far I have been relying mostly on the traditional ways of searching literature: PubMed and following references within articles found that way, as needed. I do get personal copies of the Scientist and Nature Methods and attend meetings in San Diego that are relevant to my work. I think it was in The Scientist that I first read about PLoS ONE and later F1000Research. From the meetings I get new clues for additional searches of the literature on my own.
If you were asked to change your methods to include catalog numbers or unique identifiers, would this make you mad and would you comply?
The catalog numbers for antibodies, the strain of GFP labeled mice and references to cell lines are all in the first versions of the articles, as they should be. These sorts of things although tedious do not bother me and in the long run having all practical details in one easy to find place is helpful.
Were there things you appreciated about having an open review?
Yes, definitely. I very much appreciated the professional editorial help up front. Another and even more critical point is that if the referees listed by the journal decline the invitation to write a review, others, not listed there may be considered as well. I waited too long for second and third reviews not realizing that they would not come.
Were there things that were a lot harder?
No. It is perhaps a little hard to take that defending one’s position does not change anything in the end. The editor does not judge one way or another. However, I do not mind that because the negative comments do not disqualify the article if there are others. That is fair enough. Any rules are fine with me provided all parties play by the same rules. ‘Dura lex sed lex’: harsh law but law.
Do you think that open review is more or less fair than traditional reviews?
Open review is more fair although fewer people are free enough to take sides in public.
Believe it or not, there really is a project called open source brain, and it is a wonderful community of hackers that attempts to do very novel things with open source models, mainly in a format called NeuroML.
What is the open source brain?
Well, it takes models, converts them into cool visualizations and then allows users to manipulate them in their browser, with functionality similar to google body. The hope is to strap some significant computational power from the Neuroscience Gateway’s massive clusters so that the pretty pictures can be fully functional, but for now, this is a great way of exploring three-dimensional neurons and connectivity.
But the reason I am blogging about this project is not because of the “ooohh-aaaahhh” factor that nice graphics usually have on me, but also because this source came to NIF in an interesting way, by human flying from London on his way to another meeting. Unfortunately last week we did not know about the Open Source Brain, but Padraig knew about NIF and wanted to register the project, hoping to integrate his data or at least “get the process started”.
The identifier was then used to create a sitemap entry in the DISCO database (essentially anyone who has logged in to the NeuroLex can click a button at the bottom of a curated registry entry can actually do this).
Then we added an “interop” file, which instructs our crawler to put data the xml data output by open source brain into our local data warehouse making sure to specify appropriate tables and columns.
Then we went to lunch, came back after fighting much larger crowds at the indian place than were expected before finals, and created the “view” of the data (basically, wrote a sql statement and used our concept mapping tool to define what data would be displayed).
By 3:30 pm we had a view deployed. Well ok, we did have to import the data twice because we messed up the file once, and this deployment was the beta server and we had to wait to update to production until Friday night, but that is still pretty darn fast in my opinion.
The question for many people who have data has been how much effort will it take to make my data interoperable with community resources and for the first time ever, we can report …. it will only take a couple of hours (we should insert many caveats here).
Scene: It is 4 am in the morning and the grant is due the next day. You have 12 pages to tell a story that will determine whether or not you can pay your salary and support your lab next year. You delete a sentence on page 5. The document swells to 13 pages. Oh no! The figures have all jumped around! The next paragraph is now in Courier 14 bold! You hit your head against the screen and spend the next hour trying to get everything back in place and the sentence deleted. The next day, you’re tired, irritable and have a broken computer screen.
Scene: My colleagues and I are writing a paper together. We spend weeks in Google Docs working on a collaborative document. We are all almost happy with the text. We then download it and import it into Word so we can work on the formatting and the figures. We then put it in Dropbox or email it around where all the versions get out of sync. My final edits don’t get copied into the final version. We get criticized for being sloppy.
Scene: You just got your rejection letter from the Nature. They liked your paper, but just not enough for this journal. They recommend Glia. You studiously look up the instructions for authors in Glia. You used a reference manager, but the references got screwed up anyway when you put it in Nature format. All the Endnote numbers are there, but it messed up your Word document, so you did some hand editing. You have to change “Methods” to “Materials and Methods” and other cosmetic changes. You spend a day reformatting the paper for Glia; you don’t prepare for your lecture, you don’t mentor a student, you don’t do your research; you don’t do anything but move around and rename sections, numbers and list the first 6 authors instead of the first 5. Science is stalled.
Scene: My child has a rare neurological disorder. A group of text miners offer to help mine the literature for a new druggable target. We ask a Foundation for access to all the literature on this disease. We are given 5000 abstracts and 600 pdfs from open access journals. We spend months working on agreements with the publishers, each of whom owns their own chunk of the field, for their content and go to open source repositories and authors. We get the other 4400 pdfs after a year. We spend 6 months cracking the pdf, trying to align the sections, and massaging the XML. My child can no longer walk.
Who is the common villain in all of this? You might be tempted to blame it on the publisher, and you wouldn’t be 100% wrong. But actually, you should blame it on formatting. At Beyond the PDF, Kaveh Bazargan put forth his proposal in the Vision session: “Why we should publish in XML and nothing else”. He didn’t win. But if you think about it, it’s actually a rather compelling vision. It’s not that he’s against formatting, it’s that formatting should be the “clothes that make the man” and not the man himself. The man should be born naked, as he is, and then dressed up to suit the occasion. And if all papers, as naked babies, are structured more or less the same, we can build all sorts of ways to add value to them.
So now let’s think about formatting. It is formatting that causes me to hit my head against the screen at 4 am. My number of words didn’t change in a significant way; the only thing that changed is how they were laid out on the page. Page numbers are a relic of the printing press. Look at your e-book on Kindle or the iPAD. There are no page numbers because there are no pages. If limits on grants and publications were consistently number of words and not pages, the problem would go away.
Why did I have to leave Google Docs? Because everyone knows that when it comes down to formatting and figures, Google Docs doesn’t cut it just yet. So we have to exit our workspace and go to a Word processor. Our paper didn’t change in a fundamental way; we were happy with the content. Anyone try to reconcile 3 Word documents on a Mac? There are probably great ways to do it, but I haven’t found them. Additional formatting is truly “an exercise in irrelevance”, to quote Philip Lord’s excellent blog.
Why , when when we have a reference ID, the DOI, that allows a publisher to retrieve all metadata for a piece of scholarly work, is the author being tasked with conforming to a reference style? Even if it requires a publisher to pay a company somewhere to format it, isn’t that what they do already? Why is the scholar doing this? Why do the publishers then have to pay someone else (CrossRef) to put back all the information that was stripped out of the reference in the first place?
And finally, as I learned from the School of Data, computers hate formatting. The first step of converting data into something that a machine can read is to strip out formatting. So if machines like to access XML or HTML5 or whatever, why are requiring all these transformations from Text to XML to PDF to XML to PDF to XML as the paper wends from Word processor to submission to editing to final. Who knows what comes out the other side and what happens to it along the way?
Kaveh points out that there needs to be a version of record. Isn’t it somewhat amazing now that we don’t have one? What is the version of record? The original document? I never go back and make the changes that the copy editor requests on my original text file, but that might be what I upload to my institutional repository or what I send to a colleague, if I don’t have the rights to the pdf. The pdf at the publisher? The version at Pub Med Central? Can we guarantee that they are all the same? It seems rather ridiculous that in 100 years, we might have people publishing doctoral dissertations on 21st century writings trying to determine the author’s original intent, much as current scholars try to reconstruct Euripides’ or Shakespeare’s intent from surviving versions.
So what is the alternative? Let’s envision a world where we have a standard XML or at least an interoperable XML for representing scholarly papers. Wouldn’t it make more sense, as Kaveh suggested, to have the XML version be the version of record and then have publishers, authors, whomever, dress it up whatever way they want? How many different ways are there of structuring a scientific paper? Why can’t the authors write their papers in generic format and then have the publishers present them as they wish? Why can’t the XML peer reviewed and typo-corrected version be deposited in a global ArXiv repository where it is accessible by text miners along with everything else? And why can’t this version appear as a PDF, HTML, DOC, or anything else it needs to be?
Isn’t it time that we as scholars throw our copy of Microsoft Word out the window and say “I’m mad as hell and I’m not going to take it anymore”. Has anyone tried to calculate the unproductive hours spent by scholars on formatting? Has anyone else been reduced to tears by jumping figures at 4 am? Has anyone then asked “Why?”
Kaveh may not have won because to the crowd at Beyond the PDF it seemed too obvious or unsophisticated. But isn’t it the obvious that we should be going after in FORCE11? We have on-line authoring tools available; we have the formats; we have institutional repositories and the arXiv model. If we have all the tools and technology and formats necessary to make content accessible to the web, text miners and publishers alike, then the obvious questions are “Why aren’t we?” and, “What would it take to make it happen?”
Scene (2014): I sit down to write my paper, because despite all advances, I still like to craft an argument in narrative, as my forbears have done for millenia. I log into (Annotum, Authorea, Word Press, Google, Microsoft, LaTeX-insert your tool here). These authoring tools are specially designed for scholarly publishing styles. My co-authors and I log in with our ORCID ID’s. We compose our paper with the appropriate sections. I insert my reference DOI’s from (Mendeley, EndNote, Zotero). I insert my figures reference ID from (my desktop, Cell Image Library, Fig Share). I reference my data sets DOI according to Amsterdam Manifesto standards from (Data Verse, NIF, CDL); my workflows from (WorkFlows 4 ever, MyExperiment); my code from (GitHub, Source Forge). All my coauthors hit “Acceptable” and I hit “Send”.
Send to whom? Why my preferred broker for peer review (Nature, Journal of Neuroscience, Peer J, Rubriq). My first choice doesn’t want me, so I go to my second. They will take it but want it reduced to 5000 words and 5 figures. OK, deal. I refine my argument and shorten. By doing that, some references disappear from the text. Bye. I send it off. The peer reviewers request refinements and catch some typos. I fix them. The Glia copy editor finds some additional mistakes. I correct them. When the mistakes are fixed, my coauthors say “Fine” and I say “Final”. We have gotten our permission slip and our work may now be admitted into the Body Scholarly for posterity. The (XML, HTML5, whatever) version is stamped “Version of Record” and it enters into the global ArXiv for scholarly communications.
Glia takes my XML and makes it look very pretty. They advertise it on their website and sell the pretty version back to institutions and individuals so they can read it. Some scholars have accused the publishers of existing only to sell us formatting. That’s OK. I pay for formatting all the time. I pay extra for the floral pattern on the Kleenex box because it goes with my color scheme. But my mother, who has no institutional subscription, can go to Pub Med Central and get my vanilla-formatted but still functional document. Sometimes, I buy generic.
Meanwhile, my text miner friends have a subscription to the Global ArXiv via their institution or a personal one, if they are not affiliated with an institution. They agree that they will not try to recreate an individual article for resale and then mine away. Perhaps a new target is discovered for my child’s disease or maybe there isn’t enough information available. But at least we have uniform access to the entire corpus in a form suitable for mining.
Now, with all my extra time saved by no formatting, I am largely done writing my grant proposal at 11 pm instead of 4 am. It needs to be 10,000 words with each figure counting for 500 words. At 11 pm, I am 10 words over. I delete a sentence and hit “Done”. At 11:15, I’m brushing my teeth. Grant reviewers get access to a nifty program that allows them to view/print/read in a variety of formats depending on their medium. One reviewer prefers to print things out on paper. Each figure is a full page and the font is 18 point so the grant is 40 pages long. But a younger reviewer with better vision who still likes paper chooses 12 point font and a half page for figures and her version is 12 pages. Another reviewer has an iPAD and likes to swipe and zoom but he still has my same 8, 000 words and 4 figures. Same content, same length-no extra burden on the reviewers-just different formatting.
There are many other current ways to enhance scholarly communication and many that haven’t yet been invented. But I think decoupling the front end human readable format from the back end machine readable format t is so fundamental to our current difficulties that we should tackle it collectively. And, I believe it is doable and just might be the carrot that induces our scholars to abandon their current models and move towards the future. From that, all else flows.