Skip to main content

Comparison of annotation terms between automated and curated E. coli K12 databases

Background

Genome sequencing and annotation may provide ways to understand genomes. Annotation of genome results in identification of genes in terms of precise start and end sites and description of cellular components, molecular functions and biological process. Increase in the wealth of the genomic data has led to the necessity of identification of information encoded within the genome which in turn resulted in the development of automated annotation techniques that assigns functions to newly sequenced genes based on similarity to previously annotated genes. This approach has a few problems, for example if there was a mistake or error in previously annotated genomes it will result in whole family of misannotated genes. Annotation usually fails to meet the "golden standard" of the curated databases as the level of details in automated annotation systems is reduced, classifying proteins into more broader categories. To overcome this problem; ontology terms were used in automated databases as a means of understanding and recognizing types of proteins to the level of curated databases.

In this project we tried to compare the results of predictive automated bacterial annotation programs to a curated annotation databases such as EcoCyc. EcoCyc is a conservative multidimensional annotation system that is validated by over 15,000 publications. Automated annotation systems, such as BASys can be used as first pass annotation tools that try to add as many annotations as possible by drawing upon over 30 sources. Gene Ontology is described by a defined library of terms related to the biological process, cellular components and molecular functions of a gene in an organism. Because of the limited and common terms in the ontology annotations, we compared ontology's between the BASys and EcoCyc databases. Additional, non-ontology terms and metadata were generated in BASys. Methods were developed to compare these additional terms to the EcoCyc database and it was found that approximately 17% of the BASys predicted ontology's matched the EcoCyc database.

Materials and methods

Gene Ontology database [5] was used to convert each of the annotation terms into corresponding GO numbers as shown in Figure 1 using annotation term-2-GO files. BASys and EcoCyc were the databases used for comparison (3 and 4).

Figure 1
figure 1

Flow chart for processing of annotation terms. EcoCyc was used as the standard for the comparison of annotation terms in the form of Gene Id's and GO numbers. In order to convert BASys terms from different datasets (row: 3, lanes: 2–9) into gene ID's and GO numbers (row 5: lanes 2–9); conversion files (row: 4; lanes 2–9) from gene ontology site were used. Each of the BASys Gene Id's and numbers (row 5: lanes 2–9); were compared to EcoCyc Gene Id's and GO number (row 3: lane-1).

Each of the annotation terms from the respective databases were converted into common GO numbers by using the respective conversion files from the Gene Ontology site http://www.geneontology.org/.

Results and conclusion

Our results showed that of the appoximately 4200 genes in E. coli, 1594 of them have been validated by EcoCyc based on Ontology numbers. EcoCyc is a conservative annotation system that requires strict validation before entering annotation terms into its database. On the other hand BASys was found to be more liberal in assigning annotations to 2511 genes based on ontologies, because it was designed to annotate each gene as fully as possible. Total GO numbers based shown in Table 1 was found to be 21,708. Table 1 shows that about 17% (4% true positives and 13% true negatives) of BASys ontology assignments were validated with EcoCyc. About 70% of them were false positives and 13% of them were false negatives. The high false positive rate might be due to incomplete literature validation of EcoCyc database.

Table 1 Summary of matches and mismatches between databases

References

  1. Karp PD, Keseler IM, Shearer A, Latendresse M, Krummenacker M, Paley SM, Paulsen I, Collado-Vides J, Gama-Castro S, Peralta-Gil M, et al.: Multidimensional annotation of the Escherichia coli K-12 genome. Nucleic Acids Res 2007, 35(22):7577–7590. 10.1093/nar/gkm740

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  2. Van Domselaar GH, Stothard P, Shrivastava S, Cruz JA, Guo A, Dong X, Lu P, Szafron D, Greiner R, Wishart DS: BASys: a web server for automated bacterial genome annotation. Nucleic Acids Res 2005, (33 Web Server):W455-W459. 10.1093/nar/gki593

    Google Scholar 

  3. ECOCYC[http://www.ecocyc.org/]

  4. BASys Bacterial Annotation System[http://wishart.biology.ualberta.ca/basys/cgi/gallery.pl]

  5. Gene Ontology[http://www.geneontology.org/]

Download references

Acknowledgements

Bioinformatics and Information Science Center, Western Kentucky University.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to ReddySailaja Marpuri.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution 2.0 International License (https://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Marpuri, R., Rinehart, C.A. Comparison of annotation terms between automated and curated E. coli K12 databases. BMC Bioinformatics 10 (Suppl 7), A10 (2009). https://doi.org/10.1186/1471-2105-10-S7-A10

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-10-S7-A10

Keywords