Skip to main content

Advertisement

GOChase-II: correcting semantic inconsistencies from Gene Ontology-based annotations for gene products

Abstract

Background

The Gene Ontology (GO) provides a controlled vocabulary for describing genes and gene products. In spite of the undoubted importance of GO, several drawbacks associated with GO and GO-based annotations have been introduced. We identified three types of semantic inconsistencies in GO-based annotations; semantically redundant, biological-domain inconsistent and taxonomy inconsistent annotations.

Methods

To determine the semantic inconsistencies in GO annotation, we used the hierarchical structure of GO graph and tree structure of NCBI taxonomy. Twenty seven biological databases were collected for finding semantic inconsistent annotation.

Results

The distributions and possible causes of the semantic inconsistencies were investigated using twenty seven biological databases with GO-based annotations. We found that some evidence codes of annotation were associated with the inconsistencies. The numbers of gene products and species in a database that are related to the complexity of database management are also in correlation with the inconsistencies. Consequently, numerous annotation errors arise and are propagated throughout biological databases and GO-based high-level analyses. GOChase-II is developed to detect and correct both syntactic and semantic errors in GO-based annotations.

Conclusions

We identified some inconsistencies in GO-based annotation and provided software, GOChase-II, for correcting these semantic inconsistencies in addition to the previous corrections for the syntactic errors by GOChase-I.

Background

The Gene Ontology (GO) project started to provide semantic standards for the annotation of molecular attributes of genes and gene products [1]. The Gene Ontology is a controlled vocabulary for describing genes and gene products in terms of their associated biological processes, cellular components and molecular functions. The structural foundation of GO is formally a Directed Acyclic Graph (DAG) wherein the terms are equivalent to the nodes and the relationships to the edges of the graph [2].

GO has grown enormously. The number of organism groups participating in the GO Consortium has grown every quarter year from the initial three to roughly two dozen [3]. A lot of biological databases use GO to annotate the molecular attributes of genes and gene products [4, 5]. GO-based analysis of microarray and mass spectrometry data have been successfully realized [3]. Recently, new generation of tools based-on GO have been developed, aiming to enhance biological knowledge such as protein structure classifying [6], gene-phenotype association predicting [7] and gene network building [8]. More details are available at GO website (http://www.geneontology.org/GO.tools.shtml). Unified Medical Language System (UMLS) metathesaurus has been integrated with GO to expand UMLS into the biological domain [9].

In spite of the undoubted importance of GO, several drawbacks associated with GO and GO-based annotations have been introduced. Masseroli correctly pointed out the structural and semantic problems of GO such as metonymy, species-specific terms and multiple paths [10]. Dolan et al. evaluated the reliability of GO-based annotations [11]. Poor inter-annotator reliability of GO-based annotations for human-mouse orthologous gene pairs was reported between two gene-annotation groups, MGI and GOA. Park et al. identified syntactic errors caused by the two GO-update operations, ‘new obsoletions’ and ‘new term merges’, used in the course of GO version change [12]. They introduced GOChase to detect and correct the syntactic errors and error propagations in GO-based annotations (http://www.snubi.org/software/GOChase/).

In the present study, we further identified semantic error types in GO-based annotations; redundant, biological-domain-inconsistent and taxonomy inconsistent annotation.

The first type is “redundant annotation.” When a gene is annotated to a GO term, for instance, according to the current GO annotation paradigm, it is considered to be implicitly annotated to all parents of the term. Assigning both parent and child terms to the same gene is regarded as “redundant annotation.” In some cases, if parent and child term was annotated in specific gene product using different evidence code, these annotations hard to say completely redundant. For example, an experiment may provide enough evidence to annotate to a parent, but not to any specific child, whereas a more specific annotation may be predicted by sequence comparison or other computation. In such cases both annotation would be retained, the parent because of its experiment support and the child for specificity. So we analyze the redundant annotation to distinguish the evidence code used in parent and child term.

The second type is “biological domain-inconsistent annotation.” A GO term should avoid using species-specific definitions and rather include any term that can be applied to more than one taxonomy classes of organisms (The Gene Ontology Consortium, 2000). Some GO terms have species-specific characteristics such as nucleus (GO:0005634), specific for eukaryotes and unidirectional conjugation (GO:0009291), specific for prokaryotic specific terms. As GO-based annotation expands to various species, however, species-specific terms become increasingly problematic. For example, a gene product having UNIPROT ID O24899 from Helicobacter pylori, a kind of bacteria, is wrongly annotated to nucleus, a eukaryote-only GO term.

The third type is “taxonomy inconsistent annotation”. Recently, the GO Consortium provided terms with taxonomy restrictions, containing species-specific terms with the NCBI taxonomy group for which they are or are not appropriate (http://www.geneontology.org/GO.sensu.shtml). Forty four taxonomic groups used taxonomy restricted terms in the January 2010 GO version. Taxonomy inconsistent annotation occurs when a taxonomy restricted term is annotated to a gene that does not belong to the corresponding taxonomy group. GO consortium checks the inconsistent annotation using taxonomy restricted terms and provide reports of inconsistent annotation. But many annotations have been produced without consideration of taxonomy restricted terms. For example, we found that a eukaryote restricted GO term, Golgi apparatus (GO:0005794), was (wrongly) annotated to 27 gene products of Escherichia coli, a kind of bacteria.

In the present study, we analyzed the distributions of the semantic inconsistencies in GO-based annotations using 27 major biological databases. To understand the factors influencing such inconsistent annotations, we perform correlation analysis between the inconsistent annotations and the possible attributes for the inconsistent annotations including the usage of evidence codes (http://www.geneontology.org/GO.evidence. shtml), the number of gene products, the number of species and the number of GO terms. We developed a set of web-based utilities, GOChase-II, to correct the semantic inconsistencies in addition to the previous corrections for the syntactic errors by GOChase-I [12].

Material and methods

Databases

We obtained GO DB downloads from the GO database site (http://www.geneontology.org/GO.downloads.database.shtml). We collected GO-based annotations for genes and gene products from 27 major biological databases including NCBI’s Gene and Ensembl. The GO DB schema used for data integration was obtained at http://www.geneontology.org/images/diag-godb-er.jpg. To extract GO-update history, we downloaded GO monthly reports from January 2000 to December 2007 from the GO FTP site (ftp://ftp.geneontolgy.org). Since January 2008, GO consortium, however, have not provided monthly reports, thus we use OBO-Edit tool to generate GO change reports over the past month [15]. OBO-Edit generated reports provide four additional types of change; change comment, change synonym, change category, and change external reference. It also provide six types of changes which defined by monthly report; new term, new obsoletion, term name change, new definition, new term merge and term movement. We parsed these 11 types of change for resolving GO-update history. The NCBI taxonomy database (ftp://ftp.ncbi.nih.gov/pub/taxonomy/) was downloaded to find and correct biological domain -inconsistent annotations. The NCBI taxonomy database indexes over 320,000 named organisms that are represented in the databases with at least one nucleotide or protein sequence [16].

Semantic inconsistencies

The hierarchical relationships extracted from the GO DAGs were used to determine redundant annotations. For each gene product, parent-child relationship between any pair of GO terms annotated to the gene product in the 27 biological databases was tested to determine redundant annotations (Table 1). We analyze the redundant annotation to distinguish between one specific gene product annotated using parent-child terms that use the same evidence code and those use different evidence codes. In some cases (details in introduction section), the redundant annotations of parent-child terms use the different evidence code are supporting data.

Table 1 Redundant annotations in biological databases

To find biological domain inconsistency in GO annotation, we reviewed and manually extracted 410 ‘eukaryote-only’ and 73 ‘prokaryote-only’ GO terms including such terms as RNA import into nucleus and ketodeoxyoctanoate biosynthesis (see additional file 1 and 2). All gene products in the 27 databases were divided into non-prokaryotic and non-eukaryotic classes according to the species definition in NCBI taxonomy. Biological-domain-inconsistent annotation was determined by testing the consistency between the corresponding species of a gene product and the ‘prokaryote-only’ or ‘eukaryote-only’ classification of the annotation term.

There were 44 taxonomy groups having taxonomy restricted terms in the January 2010 GO version. The taxonomy inconsistent annotation was determined by inconsistency between species-specific GO terms and the species of origin of the annotated gene products.

Attributes for inconsistent annotation

In search for the possible attributes for the inconsistent annotations, we evaluated five possible attributes by correlation analysis; the use of different evidence codes, the number of gene products, the number of species, the number of GO terms, and the average number of GO annotations. Every GO annotation is supposed to indicate the type of evidence. There are 18 evidence codes currently available. When no evidence code was assigned for an annotation, we marked it as ‘Not Available (NA)’.

Results

To analyze the distributions of the semantic inconsistencies in GO-based annotations we calculated the distribution of redundant annotations in the 27 biological databases (Table 1). All databases have redundant annotation. The fraction of redundant annotations in databases is distributed from 0.9% to 91% for gene products (31% in average), from 2% to 26% for GO terms (13% in average), and from 0.4% to 38% for GO annotations (12% in average). UniProtKB/Swiss-Prot shows the highest redundancy for gene product (91%) and GO annotation (38%). The database showing the highest redundancy in GO terms is Ensembl (24%). GeneDB_Pfalciparum shows the lowest numbers among the databases; 0.9% for gene products, 2.5% for GO terms and 0.4% for GO annotations. In all databases, the fractions of redundant annotation based on the same evidence code are larger than different evidence code.

The distributions of biological-domain-inconsistent annotations are calculated using prokaryote-only and eukaryote-only GO terms we defined. Biological domain inconsistent annotation was found in thirteen databases of non-prokaryotic gene product and eight databases of non-eukaryotic gene product (Table 2). Most of databases have less than 100 inconsistent annotations, except four databases (Ensembl, NCBI Gene, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL). In both biological domains, UniProtKB/TrEMBL is shown to have the highest portion of inconsistent annotation by all three measures. Taxonomy inconsistent annotations were found in 27 out of the 44 taxonomy groups having at least one taxonomy restricted GO term (Table 3 in additional file 3, see method). Table 3 in Additional file 3 shows the numbers of taxonomy inconsistent annotations (as numerators) and the numbers of taxonomy restricted GO terms used (as denominators) in the 27 databases. A blank cell means no annotation with taxonomy restricted GO term. Taxonomy inconsistent annotations are not evenly distributed across databases or taxonomy groups (Table 3 in additional file 3). The NCBI Gene, Ensemble, UniProtKB/Swiss-Prot and UniProtKB/TrEMBL, for example, has inconsistent annotation in most of taxonomy groups as while, no taxonomy inconsistent annotation was found in the five databases: CGD (0/2174), GeneDB_Tbrucei (0/422), NCBI (0/1291), PseudoCAP(0/20), and UniProt (0/17). Interestingly, all annotations for Passeriformes (11/11) are taxonomy inconsistent. Cellular organisms show the lowest taxonomy-inconsistent annotations rate (7/42344) among the 27 taxonomy groups.

Table 2 Biological-domain-inconsistent annotations in biological databases

To investigate which factors are related to each inconsistent annotation we analyzed correlation between three types of inconsistent annotation and 23 possible attributes of inconsistent annotation (Table 3). As shown in table 3, Inferred from Electronic Annotation (IEA) shows the highest correlation with redundant (r=0.99) and taxonomy inconsistent annotation (r=0.99). Biological domain inconsistent annotation shows high correlation with number of gene product (0.97). We found that the numbers of species and average number of GO annotation show high correlation while the number of GO term shows low correlation with all types of inconsistent annotation (Table 3).

Table 3 Gene Ontology distribution incorrectly annotated across evidence codes and the related factors

GOChase-II implementation

GOChase-I [12] is a set of web-based utilities to detect and correct syntactic errors from GO-based annotations caused by GO versioning and tracing problems. On the contrary, GOChase-II (http://www.snubi.org/software/GOChase2/) attempts to correct semantic errors in GO-based annotations. It provides four web-based interfaces. (1) GOChase-History resolves the whole evolution history of a GO ID. As an example, the GO term, sorocarp development (GO:0030587), has repeatedly swung back and forth among the fifteen GO terms (reproduction, cell communication, development, response to external stimulus, physiological process, biological_process, response to biotic stimulus, morphogenesis, multicellular organismal development, anatomical structure morphogenesis, anatomical structure development, asexual reproduction, fruiting body development in response to starvation, fruiting body development, response to starvation) by the 31 GO operations in fifteen updates between March 2002 and November 2008. (2) GOChase-Species resolves the distribution of the usage of a GO term across different species and displays the distribution onto the taxonomy tree. The Species function is a powerful tool to analyze the species specificity of a GO term. Some terms are limited to specific species whereas others are used for a wide variety of species. For example, negative regulation of vulval development (GO:0040027) is annotated 395 times but exclusively to Caenorhabditis elegans (i.e. 100%). It is suggested that cyanelle may be a species-specific term. We identified 3548 GO terms annotated only to a single species in January 2010 GO version (see additional file 4). On the other hand, oxidoreductase activity (GO:0016491) is annotated 800,048 times to 108,929 different species (i.e. 7.3 times per a species in average). Species function can also be used to find the wrong use of species-specific terms. (3) GOChase-Correct highlights a 'merged-term' and redirects it to the correct 'target term' into which the terms have been merged. For an obsolete term, GOChase provides the alternative terms. GOChase-Correct correct redundant and biological-domain-inconsistent annotations. (4) When one inputs a GO ID, GOChase will resolve all gene products annotated with the GO ID across all the databases in Table 1. GOChaser provides GO enrichment analysis for input gene-expression clusters. Although most GO enrichment analysis tools have the similar functionality [14], GOChaser has a unique functionality of correcting both the syntactic and semantic errors to improve the analysis results. GOChaser provides two statistical models, the hypergeomeric test and the Fisher’s exact test, with multiple hypotheses testing correction (Bonferroni correction).

Conclusion and discussion

We identified and corrected three types of semantic inconsistencies in GO-based annotations for gene products from 27 major biological databases. GO becomes a widely accepted ontology in biomedical field. The under-managed errors and inconsistencies may reflect its short history, its ever growing complexity, and the vast amount of the biological domain knowledge. Recently GO Consortium starts working on refining GO contents and structure [17]. The present study demonstrates that the GO community may be empowered by bioinformatics tools ensuring error-proof mechanisms concerning the GO hierarchical relationships, species-specific definitions and GO term usage guidelines.

To sum up our result in this research, there is no database free from the semantic inconsistent annotation. Among the three types of semantic inconsistent annotation, redundant annotation is most common error. About 12% of the whole annotations are redundant. Only a few biological-domain inconsistent annotations are found in the 18 biological databases because of the small number of ‘eukaryote-only’ (410) and ‘prokaryote-only’ (71) GO term.

The high correlation between IEA and inconsistent annotations (Table 3) suggests that IEA has lower reliability than others. Electronically generated associations without human judgment are labelled as IEA. GO Consortium proposes a hierarchy of reliability among evidence codes (http://www.geneontology.org/GO.evidence.shtml). In general, TAS and IDA show higher reliability. TAS and IDA have low correlation with all three types of inconsistent annotation. And most of evidence codes, which curated by human, have low correlation with all types of inconsistent annotation. This result implies that the hierarchy of reliability among evidence codes are preserved in inaccurate annotation.

The numbers of gene products and species of a database show high correlations with all types of inconsistent annotations except taxonomy-inconsistent annotation. It suggests that the complexity of database maintenance may affect the occurrence of inconsistent annotations. Therefore, it is more strongly required for such databases to implement a sound mechanism such as GOChase-II in order to avoid semantic inconsistencies caused by multiple user-groups.

References

  1. 1.

    Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 2000, 25(1):25–29. 10.1038/75556

  2. 2.

    Aho AV HJ, Ullman JD: Directed graphs. In Data Structures and Algorithms. Massachusetts: Addison-Wesley; 1983:219–221.

  3. 3.

    Lewis SE: Gene Ontology: looking backwards and forwards. Genome Biol 2005, 6(1):103. 10.1186/gb-2004-6-1-103

  4. 4.

    Camon E, Magrane M, Barrell D, Lee V, Dimmer E, Maslen J, Binns D, Harte N, Lopez R, Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Res 2004, 32(Database issue):D262–266. 10.1093/nar/gkh021

  5. 5.

    Stover NA, Krieger CJ, Binkley G, Dong Q, Fisk DG, Nash R, Sethuraman A, Weng S, Cherry JM: Tetrahymena Genome Database (TGD): a new genomic resource for Tetrahymena thermophila research. Nucleic Acids Res 2006, 34(Database issue):D500–503. 10.1093/nar/gkj054

  6. 6.

    Sadowski MI, Taylor WR: On the evolutionary origins of "Fold Space Continuity": A study of topological convergence and divergence in mixed alpha-beta domains. J Struct Biol 2010. [Epub ahead of print] [Epub ahead of print]

  7. 7.

    Mehan MR, Nunez-Iglesias J, Dai C, Waterman MS, Zhou XJ: An integrative modular approach to systematically predict gene-phenotype associations. BMC Bioinformatics 2010, 11(1):S62. 10.1186/1471-2105-11-S1-S62

  8. 8.

    Martin A, Ochagavia ME, Rabasa LC, Miranda J, Fernandez-de-Cossio J, Bringas R: BisoGanet: a new tool for gene network building, visualizatiion and analysis. BMC Bioinformatics 2010, 11: 91. 10.1186/1471-2105-11-91

  9. 9.

    Bodenreider O: The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res 2004, 32(Database issue):D267–270. 10.1093/nar/gkh061

  10. 10.

    Masseroli M, Pinciroli F: Using Gene Ontology and genomic controlled vocabularies to analyze high-throughput gene lists: three tool comparison. Comput Biol Med 2006, 36(7–8):731–747. 10.1016/j.compbiomed.2005.04.008

  11. 11.

    Dolan ME, Ni L, Camon E, Blake JA: A procedure for assessing GO annotation consistency. Bioinformatics 2005, 21 Suppl 1(1):i136–143. 10.1093/bioinformatics/bti1019

  12. 12.

    Park YR, Park CH, Kim JH: GOChase: correcting errors from Gene Ontology-based annotations for gene products. Bioinformatics 2005, 21(6):829–831. 10.1093/bioinformatics/bti106

  13. 13.

    The Gene Ontology Consortium: Creating the gene ontology resource: design and implementation. Genome Res 2001, 11(8):1425–1433. 10.1101/gr.180801

  14. 14.

    Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21(18):3587–3595. 10.1093/bioinformatics/bti565

  15. 15.

    Day-Richter J, Harris MA, Haendel M, Gene Ontology OBO-Edit Working Groups, Lewis S: OBO-Edit-- an ontology editor for biologists. Bioinformatics 2007, 23(16):2198–2200. 10.1093/bioinformatics/btm112

  16. 16.

    Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, Dicuccio M, Federhen S, et al.: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2010, 38(Database issue):D173–180.

  17. 17.

    The Gene Ontology Consortium: The Gene Ontology in 2010: extensions and refinemens. Nucleic Acids Research 2010, 38(Database issue):D331–335. 10.1093/nar/gkp1018

Download references

Acknowledgements

This research was supported by Basic Science Research Program through the National Research Foundation of Korea (NRF) funded by the Ministry of Education, Science and Technology (2010-0028631). Y.R.P was supported in part by a grant of the Korea Healthcare technology R&D Project, Ministry of Health & Welfare, Republic of Korea (A070001).

This article has been published as part of BMC Bioinformatics Volume 12 Supplement 1, 2011: Selected articles from the Ninth Asia Pacific Bioinformatics Conference (APBC 2011). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S1.

Author information

Correspondence to Ju Han Kim.

Additional information

Authors' contributions

YRP conceived the study, wrote the manuscript and implemented the web-based program. JK wrote the manuscript and validated the inconsistent annotation. HWL validated the taxonomy-specific GO terms and helped to draft the manuscript. YJY calculated history data of GO term. JHK coordinated and supervised the study. All authors read and approved the final manuscript.

Competing interests

The authors declare that they have no competing interests.

Yu Rang Park, Jihun Kim contributed equally to this work.

Electronic supplementary material

Additional file 1:Additional file 1(TXT 27 KB)

Additional file 2:Additional file 2(TXT 10 KB)

Additional file 3:Additional file 3(PDF 43 KB)

Additional file 4:Additional file 4(TXT 334 KB)

Rights and permissions

Reprints and Permissions

About this article

Keywords

  • Gene Ontology
  • Unify Medical Language System
  • Biological Database
  • Taxonomy Group
  • Semantic Error