Over the past five years, we have seen a push by databases to streamline data extraction from journal articles through a variety of methods including the development of text-mining tools, the development of stand-alone user interfaces that can mine and hyperlink journal articles, and through working directly with journals to request authors to provide a minimum amount of information to the database [1, 12–16]. Authors are valuable participants in our pipeline even though our experience with author participation demonstrates they are not completely reliable in providing the data we request. Our experience supports the results of the FEBS Letters experiment and the BioCreative II.5 challenge [12], in which author performance was determined to be relatively low, and further, their participation did not save trained curators any significant time in the identification and extraction of relevant data. We also agree with the conclusion of the BioCreative II.5 assessment that author participation is useful when combined with database-generated annotations, both human and machine. Regardless, we think that it is still very important to involve the authors in this pipeline. Ultimately, we hope that the more frequently authors are asked to participate in database curation efforts the more such participation becomes a standard part of having their work published.
We have a narrow goal in asking for author participation in this pipeline, that is, for them to help us link objects that do not exist in the database lexicon repository, and thus our database. We have taken advantage of this author submission pipeline to feed into a literature triage pipeline whereby data-type curators who are not part of this markup pipeline are alerted when there are new objects present in the paper, which will need curating. This latter use of author submissions is not a primary focus of this project, but it does provide valuable communication from which the author and our database benefits. Any level of participation from authors is beneficial to us; we are fortunate that we have had reasonable author participation.
The proper balance between automation and manual efforts
A recent article by Atwood et al. [17] presented a nice overview of the different hyperlinking tools currently available. While fully automated hyperlinking tools can provide instantaneous links, be portable enough to use on any online html page and can form links to any source, they are at a disadvantage when it comes to ensuring accuracy. By using a manual QC step we can selectively unlink ambiguous terms, ensuring that the reader is taken to the correct webpage. One suggested solution to resolve these ambiguities is to rely on user feedback and employ the reader to correct links, which is in use by Utopia Documents [14] and Reflect[1] (relayed to us by one of anonymous reviewers). While this may prove an optimal solution for these fully automated tools, as we are starting from the point of the actual database that is being linked to, we might not benefit as much as these other projects. However, we are open to modifications of our pipeline that would increase our efficiency. As the most time consuming steps of this pipeline are the manual QC and XML editing we are actively developing tools to cut down on these steps. For example, we have created an interface that allows the curator to more easily view the marked up html as well as a list of the entities and links.
Benefits to the community
We established a manual QC step to resolve ambiguous or erroneous links that occur with automated linking. Automated tools cannot distinguish one entity from another entity with the same name, even if the entities are, for example, genes in entirely different species. Because of the well-established nomenclature in the C. elegans field, ambiguity is low for most of the entity classes. As we begin to link terms in classes that use a more familiar term such as the anatomy class, we have noted a rise in the number of links to resolve.
We use our QC step to enrich our database by identifying entities missing from our database. We also can take advantage of the QC step to link synonyms of terms to the appropriate community-approved name. These synonyms could represent jargon terms or author assigned names that did not follow community-based nomenclature rules when first adopted. For example, a researcher may have cloned a gene and assigned a name to it without first finding out if the gene name conflicts with a pre-existing class of genes soon to be published. In such cases, a link between the gene name used by the author can be made to the correct sequence page at the database, offsetting any confusion should the gene name get changed after the article is published.
By far the most important duty of this step is to resolve ambiguous links; however we have taken advantage of this step to feed important information back to the authors and journal as well as to enhance our own database. For example, we use the author first pass form to capture new entities that don't exist in our database. Combined with the manual QC step, we are able incorporate data in our database before the paper is published. In addition, our manual QC step has identified entities that had been discovered years ago but had not been entered into the database yet, which occurred for two papers.
Finally, our manual QC step has proved beneficial to the authors and journal as we have been able to catch typos and XML formatting errors that were missed by the authors and copy-editors.
Resolving ambiguities
The hardest hurdle in all of the hyperlinking efforts to date is resolving which species an ambiguous entity belongs to (see [18] for example). One automated method, Linneaus, has been developed to tackle the problem of identifying all gene names belonging to all species in an article without any prior knowledge of which species are discussed [19, 20]. However, our problem differs from others in that we have an advantage of knowing the different species discussed in an article a priori from the journal publisher and the authors. Hence the problem of species identification itself is not a major challenge we face. In addition, as noted, authors usually specify the species name near the entity should there be any ambiguity [20]. For example, authors may use abbreviations (Sc for S. cerevisiae, Sp for Schizosaccharomyces pombe) before the entity name for disambiguation, which could be identified before linking is started. Finally, since we expand our pipeline one MOD at a time and work closely with MOD curators, identifying unique styles and conventions of writing scientific articles for each species could be captured and used for automatic disambiguation. While proximity of the species name to the entity does help in ambiguity resolution in most cases, such a heuristic approach may not work for all cases because of complexities in natural language texts. Ultimately, the manual QC step still remains the best way to identify and correct any errors arising out of automatic methods.
We are planning on linking articles from more model organisms and are looking for other databases and journals to actively participate in this project.