One of the main challenges of genome projects is providing reliable functional annotations for gene products. For extensively studied model organisms, such as the nematode C. elegans, high quality functional annotations are largely gleaned from manual curation of experiments reported in the published literature, with additional annotations obtained from computational or comparative methods, such as protein domain analysis [1, 2]. For organisms with smaller research communities, however, functional annotations may initially derive largely from computational or comparative methods which, in turn, can rely heavily upon the accuracy and completeness of model organism genome curation for providing suitable reference annotations and training sets [3–5]. Thus, the extent to which model organism databases can keep pace with annotating an ever expanding literature will potentially have an impact not only on model organism genome curation but curation of a variety of other genomes as well. Given that manual curation is unlikely to keep up with current publication rates, developing new approaches to extracting biological facts from the published literature is imperative .
Introduced over ten years ago, the Gene Ontology (GO) has since become the de facto resource for functional genome annotation using controlled vocabularies . Divided into three distinct ontologies that describe Biological Processes, Molecular Functions, and Cellular Components, the GO is used by database curators to record key biological features of a gene product in language that is both humanly readable and computationally amenable. A key feature of the GO is that its ontologies are structured as directed acyclic graphs (DAGs) in which terms have parent-child relationships, with child terms being more specific or specialized than their respective parent(s). For example, the GO term mitochondrion is a child of intracellular membrane-bounded organelle which is, in turn, a child of intracellular organelle. Annotations made to more specialized child terms may thus be transitively made to the more general parent terms, as well: an annotation to mitochondrion is, transitively, an annotation to intracellular organelle. A second key feature of the GO is the use of evidence codes to support annotations. Evidence codes are used by curators to give an indication of the methodology that researchers use to infer facts about the genes or gene products they are studying. For example, annotation of a gene product to the Biological Process term cell division (GO:0051301) based upon a mutant phenotype that results in arrested cell division would use the Inferred from Mutant Phenotype (IMP) evidence code. Likewise, annotation of a gene product to the Cellular Component term plasma membrane (GO:0005886) based upon immunofluorescence experiments would use the Inferred from Direct Assay (IDA) evidence code. Selection of the appropriate GO evidence code thus requires information about the experiment or assay used in a publication which is often found only by reading the full text. Manual GO curation is, therefore, a labor-intensive process that if thorough can require reading and annotating the full text of hundreds, if not thousands, of publications. Thus, there is a growing need for semi- or fully-automated GO curation strategies that will help database curators rapidly and accurately identify key experimental results in the full text of research articles.
Natural language processing (NLP) applications offer a promising approach to aiding manual GO curation. Such applications use various methodologies, including classification of text within documents, to assist in associating gene products to GO terms. While some of these applications have met with considerable success in suggesting possible GO annotations, only a few take into account a real world database curation pipeline in which GO curators examine articles from a wide variety of journals and use the full text of these articles to first select a GO term and then evaluate the experimental methodology to confidently select the appropriate GO evidence code, an absolute necessity for making a GO annotation [8–16]. Thus, there is a need for text mining tools that can accurately mimic the manual curation process and thus be reliably incorporated into a database curation pipeline.
Here, we describe our strategy for using natural language processing, in particular the Textpresso text mining system , to curate experimentally determined subcellular localization of C. elegans proteins using the Cellular Component ontology of GO. Textpresso, an open source text mining tool, functions as both a simple search engine and a pattern-based, information extraction engine that employs categories of conceptually related words to semantically mark up the full text of papers. Examples of Textpresso categories include disease, phenotype, and regulation, which contain words and phrases such as 'Bardet-Biedl syndrome1', 'cell fate transformation', and 'downregulate', respectively. Textpresso searches may be performed using keywords and/or categories, and the results are presented as a list of sentences containing terms and phrases that match the search criteria (irrespective of their order in the sentence), numerically ranked according to the number of terms or phrases in the sentence that match the search string. At present, there are 19 implementations of Textpresso worldwide, including Textpresso for Neuroscience , budding yeast, Drosophila, Arabidopsis, and E. coli, with over 100 categories, such as Brain Area and Human Disease, Fly Body Parts, and Arabidopsis genes, to aid in searching and fact extraction.
We focused on GO Cellular Component annotations for our studies for three main reasons that suggested to us it would serve as a potential first proof of principle for developing a Textpresso-based, semi-automated curation pipeline for a heretofore fully manual approach. The first reason is that Cellular Component annotations are generally the result of a limited number of experimental strategies, namely microscopy and subcellular fractionation, thus potentially limiting the number of terms we would need to create new curation task-specific categories. Second, because we had noted during extensive manual curation that the information for the conclusion, i.e., subcellular localization, and the type of experimental assay are often stated in the same sentence, our search strategy of finding sentences that include proteins, cellular components and words or phrases referring to the experimental approach, was likely to be successful. Third, Cellular Component annotations derived from the small-scale experiments we were aiming to curate are typically made using one GO evidence code, "Inferred from Direct Assay (IDA)", thus simplifying the annotation process. In addition, there is a need for Cellular Component curation tools that make use of the full text of research articles, as authors often fail to include the results of such experiments in the abstracts of their papers. To illustrate, for a random sampling of 27 C. elegans proteins (see Results) only 28.4% of possible GO Cellular Component annotations could be made solely from PubMed abstracts compared to full text, with only 17.9% of those annotations as specific as annotations made from full text.
To investigate the potential usefulness of Textpresso for GO Cellular Component curation, we constructed three new Textpresso categories, termed Cellular Components, Assay Terms, and Verbs, containing terms and phrases found in a gold-standard set of sentences describing experimentally determined subcellular localization. To assess the performance of the new categories, we used them to annotate previously uncurated C. elegans proteins and found that by using Textpresso we were able to make 66.2% of all possible Cellular Component annotations with 97.3% accuracy. Further, by comparing the relative efficiencies of manual versus Textpresso-based curation, we find that Textpresso-based curation has the potential to improve curation efficiency at least eight-fold, and possibly as much as 15-fold, depending upon the individual curator. By incorporating a Textpresso-based Cellular Component curation pipeline into WormBase, we have moved from fully manual curation to computer-assisted validation for this data type.