Comparison of annotation terms between automated and curated E. coli K12 databases

Background Genome sequencing and annotation may provide ways to understand genomes. Annotation of genome results in identification of genes in terms of precise start and end sites and description of cellular components, molecular functions and biological process. Increase in the wealth of the genomic data has led to the necessity of identification of information encoded within the genome which in turn resulted in the development of automated annotation techniques that assigns functions to newly sequenced genes based on similarity to previously annotated genes. This approach has a few problems, for example if there was a mistake or error in previously annotated genomes it will result in whole family of misannotated genes. Annotation usually fails to meet the "golden standard" of the curated databases as the level of details in automated annotation systems is reduced, classifying proteins into more broader categories. To overcome this problem; from UT-ORNL-KBRIN Bioinformatics Summit 2009 Pikeville, TN, USA. 20–22 March 2009


Background
Genome sequencing and annotation may provide ways to understand genomes. Annotation of genome results in identification of genes in terms of precise start and end sites and description of cellular components, molecular functions and biological process. Increase in the wealth of the genomic data has led to the necessity of identification of information encoded within the genome which in turn resulted in the development of automated annotation techniques that assigns functions to newly sequenced genes based on similarity to previously annotated genes. This approach has a few problems, for example if there was a mistake or error in previously annotated genomes it will result in whole family of misannotated genes. Annotation usually fails to meet the "golden standard" of the curated databases as the level of details in automated annotation systems is reduced, classifying proteins into more broader categories. To overcome this problem; Flow chart for processing of annotation terms Figure 1 Flow chart for processing of annotation terms. EcoCyc was used as the standard for the comparison of annotation terms in the form of Gene Id's and GO numbers. In order to convert BASys terms from different datasets (row: 3, lanes: 2-9) into gene ID's and GO numbers (row 5: lanes 2-9); conversion files (row: 4; lanes 2-9) from gene ontology site were used. Each of the BASys Gene Id's and numbers (row 5: lanes 2-9); were compared to EcoCyc Gene Id's and GO number (row 3: lane-1).
ontology terms were used in automated databases as a means of understanding and recognizing types of proteins to the level of curated databases.
In this project we tried to compare the results of predictive automated bacterial annotation programs to a curated annotation databases such as EcoCyc. EcoCyc is a conservative multidimensional annotation system that is validated by over 15,000 publications. Automated annotation systems, such as BASys can be used as first pass annotation tools that try to add as many annotations as possible by drawing upon over 30 sources. Gene Ontology is described by a defined library of terms related to the biological process, cellular components and molecular functions of a gene in an organism. Because of the limited and common terms in the ontology annotations, we compared ontology's between the BASys and EcoCyc databases. Additional, non-ontology terms and metadata were generated in BASys. Methods were developed to compare these additional terms to the EcoCyc database and it was found that approximately 17% of the BASys predicted ontology's matched the EcoCyc database.

Materials and methods
Gene Ontology database [5] was used to convert each of the annotation terms into corresponding GO numbers as shown in Figure 1 using annotation term-2-GO files. BASys and EcoCyc were the databases used for comparison (3 and 4).
Each of the annotation terms from the respective databases were converted into common GO numbers by using the respective conversion files from the Gene Ontology site http://www.geneontology.org/.

Results and conclusion
Our results showed that of the appoximately 4200 genes in E. coli, 1594 of them have been validated by EcoCyc based on Ontology numbers. EcoCyc is a conservative annotation system that requires strict validation before entering annotation terms into its database. On the other hand BASys was found to be more liberal in assigning annotations to 2511 genes based on ontologies, because it was designed to annotate each gene as fully as possible. Total GO numbers based shown in Table 1 was found to be 21,708. Table 1 shows that about 17% (4% true positives and 13% true negatives) of BASys ontology assignments were validated with EcoCyc. About 70% of them were false positives and 13% of them were false negatives. The high false positive rate might be due to incomplete literature validation of EcoCyc database.