An evaluation of GO annotation retrieval for BioCreAtIvE and GOA
© Camon et al 2005
Published: 24 May 2005
Skip to main content
© Camon et al 2005
Published: 24 May 2005
The Gene Ontology Annotation (GOA) database http://www.ebi.ac.uk/GOA aims to provide high-quality supplementary GO annotation to proteins in the UniProt Knowledgebase. Like many other biological databases, GOA gathers much of its content from the careful manual curation of literature. However, as both the volume of literature and of proteins requiring characterization increases, the manual processing capability can become overloaded.
Consequently, semi-automated aids are often employed to expedite the curation process. Traditionally, electronic techniques in GOA depend largely on exploiting the knowledge in existing resources such as InterPro. However, in recent years, text mining has been hailed as a potentially useful tool to aid the curation process.
To encourage the development of such tools, the GOA team at EBI agreed to take part in the functional annotation task of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge.
BioCreAtIvE task 2 was an experiment to test if automatically derived classification using information retrieval and extraction could assist expert biologists in the annotation of the GO vocabulary to the proteins in the UniProt Knowledgebase.
GOA provided the training corpus of over 9000 manual GO annotations extracted from the literature. For the test set, we provided a corpus of 200 new Journal of Biological Chemistry articles used to annotate 286 human proteins with GO terms. A team of experts manually evaluated the results of 9 participating groups, each of which provided highlighted sentences to support their GO and protein annotation predictions. Here, we give a biological perspective on the evaluation, explain how we annotate GO using literature and offer some suggestions to improve the precision of future text-retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as an assessment of our current electronic GO annotation strategies.
The GOA database currently extracts GO annotation from the literature with 91 to 100% precision, and at least 72% recall. This creates a particularly high threshold for text mining systems which in BioCreAtIvE task 2 (GO annotation extraction and retrieval) initial results precisely predicted GO terms only 10 to 20% of the time.
Improvements in the performance and accuracy of text mining for GO terms should be expected in the next BioCreAtIvE challenge. In the meantime the manual and electronic GO annotation strategies already employed by GOA will provide high quality annotations.
The number of proteins requiring functional characterization in the UniProt Knowledgebase  is still growing. Although proteins can be electronically annotated using existing resources , the most reliable and detailed annotation is still manually extracted from the literature by a team of experts. The problem with knowledge archived in the literature is that it is represented in scientific natural language where a variety of text phrases can be used to describe the same concept. Traditionally, this information could be deciphered by humans but was not easy to interpret computationally. Furthermore, the number of biological databases has also increased so that up-to-date annotation relies on the ability to integrate information from multiple sources.
Currently, one of the most important advances in database annotation, querying and interoperability is the development and use of structured vocabularies. In this regard, one of the most successful is the 'Gene Ontology' (GO) [3, 4]. Since 2001, the GOA database [2, 5] at the EBI has used GO to provide consistent descriptors for proteins in its UniProt Knowledgebase in the categories of molecular function, biological process and cellular component.
With the success of GO's integration into the analyses of microarray [6, 7] and mass spectrometry data , academic and pharmaceutical institutions are keen to fast-track the assignment of GO terms to large datasets. Consequently, a new generation of tools have been developed which aim to predict GO annotations using interacting networks , existing protein features [2, 10], sequence  and semantic similarities . Numerous text mining systems [13–15] have also attempted this task or reported results on aspects of this task.
While some of these tools are useful, others demonstrate a lack of understanding of how GO is used and queried by a biologist. For example, the GO term 'cell adhesion' (GO:0007155) has been experimentally verified as a process involving the protein ICAM1 but to assign that GO term automatically to every paper that mentions the protein ICAM1 is simply incorrect. Every article that mentions ICAM1 will not experimentally verify that process within its text; instead, it might simply describe the sequence. Annotating GO terms to biomedical literature in this way is not useful to curators, as the GO term is often not attached to a 'relevant' paper. For developers of automatic information extraction and retrieval techniques, however, this strategy might form part of a useful intermediate step to limit the number of GO terms to be searched in a given piece of text.
So what do GO curators really need? A useful tool would allow curators to retrieve all 'relevant' papers which report on the distinct features of a given protein and species and then to locate within the text the experimental evidence to support a GO term assignment. Given that GO is not designed for text mining, it is of no surprise that exact text strings of many of the 18,000 GO terms will not be found verbatim in the literature. Despite these difficulties, GOA is often asked to evaluate various automatic GO retrieval and extraction systems. To encourage their comparison and development and to save time in individually evaluating the different strategies, the GOA team was delighted to take part in task 2 of the BioCreAtIvE (Critical Assessment of Information Extraction systems in Biology) challenge.
BioCreAtIvE task 2 was designed to assess if automatically derived classification using information retrieval and extraction could assist biologists in the annotation of the GO terminology to proteins in UniProt. For the training set, participants were provided with papers linked to GO annotations from human proteins already publicly available . For the test set, we annotated 286 blind (not yet released) human proteins with GO terms using the full text of 202 Journal of Biological Chemistry articles. We manually evaluated 22,000 segments of text, which were provided to support the correct GO term and protein predictions. In this paper, we give a biological perspective on the evaluation, explain how we manually annotate GO using literature and offer some suggestions to improve the precision of future text retrieval and extraction techniques. Finally, we provide the results of the first inter-annotator agreement study for manual GO curation, as well as results assessing our current electronic GO annotation strategies, to help to establish a threshold for the text mining technology.
One of the distinguishing features of the UniProt Knowledgebase is the high level of annotation and database cross-references that are integrated with each entry. It therefore makes sense that the large-scale assignment of GO terms to the proteins in UniProt should exploit the existing knowledge stored in these entries . Enzyme Commission (EC) numbers and Swiss-Prot keywords have been manually curated into UniProt entries for many years. A manual mapping of GO terms to these existing vocabularies allows GO terms to be retrofitted to appropriate UniProt records. Similarly, UniProt records contain cross-references to the InterPro and HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes) databases [16, 17]. This is because the associated sequence contains features (signatures and domains) which provide evidence for their membership in a particular protein family. Based on a review of the literature of the well-annotated family members, GO terms are manually mapped to InterPro and HAMAP records. These mappings are released monthly with every GOA release and provide a useful first pass electronic GO annotation in the GOA database. As of March 2005, this strategy provided 69% GO coverage of UniProt records for over 85,000 species . Surprisingly, BioCreAtIvE participants did not appear to exploit these released GO annotations to help limit the GO lineages that might be found in the test set papers. Later in this paper, we will explore the accuracy of these electronic GO annotations and compare the results to the text mining systems used in BioCreAtIvE.
Electronic techniques are efficient in associating high-level GO terms to large datasets. On the other hand, manual curation provides more reliable and detailed GO annotation but is slower and more labour-intensive. It is clear that the manual curation process requires automatic assistance. However, before attempting to develop strategies to help curators make more rapid GO assignments, it is important to first understand current manual approaches.
Each GO consortium member uses slightly different techniques in locating papers suitable for manual GO annotation [18, 19]. The following describes the approach of the GOA curators. First we have to decide which human proteins to prioritize for GO annotation. We concentrate on 3 categories, (a) those which have no GO annotation, (b) those which have disease relevance and (c) those which are important for microarray analyses. Having chosen the protein accession to annotate, we now need to find relevant scientific papers. The first step is to decide if the papers already linked within the UniProt entry are relevant for GO annotation. The decision on whether to read the full text of a paper is based on the curator's interpretation of the text used in the paper title or abstract. The journals cited in UniProt/TrEMBL records are inherited from EMBL/DDBJ/GenBank databases  and so may describe the sequence rather than GO function, process or component. Papers that reference the sequence are accompanied by a remark located in the reference position (RP) line, which says 'SEQUENCE FROM N.A.' (nucleic acid). On the other hand, UniProt/Swiss-Prot records are manually supplemented with documents to support the annotation stored in the comment (CC) lines. In these cases, the remark in the RP line might also indicate the type of information extracted from a paper e.g. 'SUBCELLULAR LOCATION', 'FUNCTION', 'INTERACTION'. It should be noted, however, that the use of the word FUNCTION in Swiss-Prot is not the same as 'Molecular Function' usage in GO. Frequently, GO process terms can be extracted from FUNCTION CC lines.
In addition to the papers archived in the UniProt records, the NCBI PubMed advanced search  is queried to find papers that support supplementary GO annotation. Various combinations of the gene and protein, full and abbreviated names are searched. Initially, searches are limited to 'Title' or 'Title/abstract' and to 'Human entries only'. Electronic GO annotation and information in UniProt/Swiss-Prot CC lines often provide curators with an insight into the types of functions that could be extracted from the literature. With this information to hand, curators are able to refine their search options to find more than enough relevant papers for GO annotation. In GOA, our current aim is to find the most recent papers which provide experimental evidence for the unique features of a given protein. Our approach is protein-centric rather than paper-centric as it is not necessary to read all the relevant papers that might be used to assign the same GO term. In the future, however, adding more papers to experimentally verify a given function will provide greater confidence to the GO annotations. A good source of a complete set of functional annotations is often retrieved from recent review articles. These reports often have links to relevant papers with experimental verification. Any papers that report new data are fed back to the UniProt curators to add to the original entry.
Most GO Consortium members would agree that the most difficult task in searching the literature is finding papers that have experimental information for a given species. Often, the species 'name' (e.g. human) is not mentioned in the 'Title' or 'Abstract' and occasionally, not directly mentioned in the full text. On these occasions, the method section of the paper has to be read and perhaps the taxonomic origin of a cell line identified before any attempt at GO curation. Filtering 'Human entries only' via PubMed is not always accurate. In addition, authors do not always cite the most up-to-date gene nomenclature e.g. use of upper case letters for human gene symbols . This is likely to affect the precision of automatic 'gene product' entity extraction techniques.
Once a relevant paper is found, the full text is read to identify the unique features of a given protein. The majority of papers will mention more than one protein; however, a curator will concentrate on capturing the information pertinent to the main protein chosen for annotation. Most curators still prefer to print out papers rather than view papers online. This is simply to limit computer eye strain and because a curator can quickly scan and select the most relevant parts of the document for curation. Words or short phrases which can be converted to GO terms are highlighted by hand and the correct GO term identifier (ID) is documented in the paper margins for review.
GO terms are chosen by querying the GO files with the QuickGO web browser [2, 23] or with a local copy of DAG-Edit (official GO editor with browsing capability). Before assigning a GO term, the definition must be read to check its suitability. Obsolete GO terms (children of obsolete molecular function (GO:0008369), obsolete cellular component (GO:0008370), obsolete biological process (GO:0008370)) are not used in annotation. When electronic or manual GO annotations become obsolete, they are manually replaced with an appropriate term . The reason for the obsoletion and suggestions for replacement GO terms are documented in GO comment lines. If a useful term is missing from the ontology, an existing GO term is in the incorrect hierarchical position or a definition needs to be refined, a curator request is sent to the GO editorial office using SourceForge [3, 25].
The GO Consortium avoid using species-specific definitions for GO nodes; however some function, processes and component are not common to all organisms. Inappropriate species-specific GO terms (e.g. germination GO:0009844) should not be manually annotated to mammalian proteins. Sometimes these inappropriate terms can be distinguished by the sensu (in the sense of) designation (e.g. embryonic development (sensu Magnoliophyta, GO:0009793). Curators are cautious when manually assigning these terms. To avoid generating inappropriate GO term assignments, the text mining community should read the GO Consortium documentation on the subject .
If a curator is unsure of which process term should accompany a function term, they can consult the 'Often annotated with' section of the QuickGO browser. Here, GO terms that are assigned in tandem are displayed. These are also referred to as common concurrent assignments and are calculated on our existing manual and electronic GO annotations .
Important regions of a paper for GO annotation and the type of GO evidence codes that can be typically extracted from these regions.
Region of Paper
GO Evidence Code
Non-traceable author statement (NAS) Traceable author statement (TAS)
Non-traceable author statement (NAS) Traceable author statement (TAS)
All GO evidence codes 
All GO evidence codes 
All GO evidence codes 
Materials and Methods
Identify species (via cell line). Identify GO evidence code according to experiment used.
If no functional annotation can be found for a given protein after an exhaustive literature search, the GO terms molecular_function unknown (GO:0005554), biological_process unknown (GO:0000004) or cellular_component unknown (GO:0008372) can be assigned with GO evidence code ND ('No Data').
It is clear from the above that the manual GO annotation effort has many steps, which could be assisted by automatic information extraction techniques. For these reasons, BioCreAtIvE organizers designed a biologically motivated task which asked systems to identify the proteins in the text, to check if any functional annotation was present and to return the GO term ID representing this information and the evidence text that supported the annotation.
To train systems to perform this task accurately, thousands of manual GO annotation examples were required. The training data provided to participants is documented online . Essentially, the training set was extracted from the publicly available non-redundant human GO annotation dataset (gene_association.goa_human.gz) . It consisted of approximately 9000 manual GO associations linked to UniProt accessions, PubMed IDs and GO evidence codes. It was advised that GO annotations with GO evidence codes 'Inferred from Sequence/Structural Similarity' (ISS), 'Inferred by Curator' (IC) judgment and 'No Data' (ND) should be ignored.
It is important to note that historically, most of the human GO annotations in the GOA database were generated before 2002. Approximately 6000 manual annotations were integrated from the former Proteome Inc. (now Incyte Genomics), which may or may not have been extracted from full text, while an additional 3000 proteins were annotated by UniProt curators from abstracts only, as part of a fast-tracking strategy. These annotations can be identified in the GOA database with GO evidence codes NAS or TAS . Since 2002, full text articles are always read but the annotation and thus the creation of a large and useful training set is slow. These data can be identified in GOA by extracting terms with the GO evidence code, IDA, IEP, IMP, IGI or IPI . As a result, the number of useful training data will be relatively small and will represent relatively few GO terms. Furthermore, the relevant passages of the text used in curation were not marked in the training set. As such, the training data was not equivalent to the task allocated (marked passages not provided). This was unavoidable given current annotation approaches and may have affected the precision and recall abilities of some systems in the first BioCreAtIvE challenge. However a positive outcome of the BioCreAtIvE evaluation is that marked passages useful in GO curation have been manually verified and made available for future training.
GO TERM NAME
estrogen receptor activity
GO TERM NAME
estrogen receptor activity
One difficulty in creating the test set was that curators were often restricted to a single article per protein. Normally, a curator would seek verification of author statements from more than one paper. As a result, some articles were slightly over annotated compared to the normal curation process.
The test set was released to the BioCreAtIvE organizers on 3 November 2003. It was advised that participants should not use versions of GO archived in the CVS repository beyond this date. This was to ensure that the same GO ontology files were available to both the annotators and participants. The test set was suppressed from the monthly GOA release until January 2004.
BioCreAtIvE organizers created an online evaluation tool for task 2. For subtask 2.1, the tool displayed the UniProt accession in the test set, along with associated 'known' GO terms and documents. Participants were expected to return a segment of text (the evidence text) from the document that supported the annotation of the 'known' GO term. The provision of evidence text was critical for the evaluators as it provided a basis for rejecting or accepting that finding. Evidence text was visible to evaluators by means of a red text highlight. The full text surrounding the evidence text was also visible in black or blue font. The evaluation tool was easy to use and was designed with the evaluators to closely resemble a curation aid that might develop from this technology. Two GOA curators evaluated subtask 2.1. There were 9 distinct users for this task but 21 separate runs were submitted for evaluation.
Evaluation criteria for GO and protein predictions.
Criteria for GO term assignment
Criteria for protein association
The GO term assignment was correct or close to what a curator would choose, given the evidence text.
The protein mentioned in the evidence text correctly represented the associated UniProt accession (correct species).
The GO term assignment was in the correct lineage, given the evidence text, but was too high level (parent of the correct GO term) e.g. biological_process or too specific.
The evidence text did not support annotation to the associated UniProt accession but was generally correct for the protein family or orthologs (non-human species).
The evidence text did not support the GO term assignment. Note: The GO term may have been correct for the protein but the evidence text did not support it.
The evidence text did not mention the correct protein (e.g. for Rev7 protein (ligand) incorrect evidence text referred to 'Rev7 receptor') or protein family.
Summary of mistakes and curator comments following the task 2 evaluation.
Predicting obsolete GO terms
Strip obsolete GO terms, i.e. children of obsolete molecular function (GO:0008369), obsolete cellular component (GO:0008370), obsolete biological process (GO:0008370) 
Predicting GO terms from Materials and Methods e.g. 'pH' value yielded 'pH domain binding' (GO:0042731), 'CHO cell line' yielded numerous GO terms containing 'acetylcholine'.
Only look in certain sections of the paper for features. See Table 1for GOA
Predicting plant GO terms to human proteins e.g. germination (GO:0009844)
Look at GO Documentation on sensu  and strip out unnecessary GO terms.
Highlighting too much text
Set limit on evidence text highlight to be useful for curators. Limit to <5 lines.
Over-predicting GO terms from one line of text
More important to curator to choose a higher level term that is correct than to be too specific and incorrect.
Common GO terms predicted out of context e.g. text 'mapped to chromosome 3q26' yielded GO component term 'chromosome' GO:0005694. Text indicates chromosome number, not where the protein functions. e.g. text '249 amino acid' yielded multiple GO terms i.e. 'amino acid activation' GO:0043038.
Most papers will mention chromosome location and the amino acid length of a sequence.
Do not predict GO terms from text if words 'chromosome' or 'amino acid' in evidence text is accompanied by a number.
Choosing first paragraph of paper as supporting text
Although a lot of information can be found in introduction of paper, the task was to choose the highlight which supported the GO term.
Whole paragraph highlights do not speed up the curation process. Limit to <5 lines.
Difficulty in interpreting word order e.g. 'RNA binding protein' yielded the incorrect GO prediction 'protein binding'
Difficulty in predicting correct taxonomic origin of protein.
This can also be difficult for a curator, given lack of evidence in text.
Too many low confidence runs
Only submit data with high confidence level for evaluation. Limit participants to their best run/technique. (little difference between runs, repeat evaluations)
Curator 1 +2
Curator 2+ 3
Comparison of BioCreAtIvE test set manual annotations with electronic GO annotation predictions.
Total IEA annotations
Same lineage > granularity
Same lineage < granularity
Total same lineage
Total potential incorrect
Total minimal correct
Manual verification of electronic GO annotation reliability on 44 proteins.
Total Proteins (+ predictions)
Total Proteins (- predictions)
No. GO terms predictions
The GOA database currently provides 69% GO coverage of the UniProt Knowledgebase using in-house electronic and manual annotations as well as annotations integrated from GO Consortium members including MGI , SGD  and FlyBase . The analyses presented in this paper indicate that these techniques have high precision (90–100%) but every year, the number of new proteins requiring GO annotation increases. As such we need to develop new techniques to increase GO coverage without compromising on high quality annotation. We used the BioCreAtIvE functional annotation challenge as an opportunity to help the research community, which might in turn ultimately help us to speed up our curation progress. The results of BioCreAtIvE task 2 indicate that the prediction of GO terms and provision of supporting text evidence is a difficult task. However, given the simple mistakes that were made and the creation of relevant training data during the evaluation , the improved performance of text mining systems in the next BioCreAtIvE challenge is inevitable. To supplement this training data and limit the expense of future evaluations, a tool that would allow curators to highlight and even link several important sentences that support a GO annotation might be useful. Ultimately, future functional annotation challenges could be evaluated semi-automatically by matching the highlighted regions of text.
Although GO was not designed with text mining in mind, it does try to create a vocabulary for biological research that could be deciphered by both humans and machine processing. The complications in matching exact GO terms in the literature might be resolved when the GO Consortium implement their plans to decompose the GO phrases into individual words or concepts and properties and by the mapping of more synonyms to GO terms .
Improvements in the performance and accuracy of text mining should be expected in the next BioCreAtIvE challenge. In the future we hope it will offer a useful supplement to the manual and electronic techniques already employed by GOA.
We would like to thank Alfonso Valencia, Lynette Hirschman, Christian Blaschke, Alexander Yeh and Mark Colismo for organizing the BioCreAtIvE challenge and would also like to praise the community effort of the GO Consortium.
The GOA project is supported by grants QRLT-2001-00015 and QLRI-2000-00981 of the European Commission and a supplementary grant, HG-O2273 from the National Institute of Health (NIH).
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.