CORRIE: enzyme sequence annotation with confidence estimates
© Audit et al; licensee BioMed Central Ltd. 2007
Published: 22 May 2007
Using a previously developed automated method for enzyme annotation, we report the re-annotation of the ENZYME database and the analysis of local error rates per class. In control experiments, we demonstrate that the method is able to correctly re-annotate 91% of all Enzyme Classification (EC) classes with high coverage (755 out of 827). Only 44 enzyme classes are found to contain false positives, while the remaining 28 enzyme classes are not represented. We also show cases where the re-annotation procedure results in partial overlaps for those few enzyme classes where a certain inconsistency might appear between homologous proteins, mostly due to function specificity. Our results allow the interactive exploration of the EC hierarchy for known enzyme families as well as putative enzyme sequences that may need to be classified within the EC hierarchy. These aspects of our framework have been incorporated into a web-server, called CORRIE, which stands for Correspondence Indicator Estimation and allows the interactive prediction of a functional class for putative enzymes from sequence alone, supported by probabilistic measures in the context of the pre-calculated Correspondence Indicators of known enzymes with the functional classes of the EC hierarchy. The CORRIE server is available at: http://www.genomes.org/services/corrie/.
The explosion of genome sequencing technologies has resulted in an ever-increasing gap between the discovery of new gene sequences and their experimental characterization. The accumulation of raw sequence data has dictated the use of computational techniques for the inference of their possible functional roles, based on the evolutionary conservation of structure and function. However, this widely used empirical process has not attracted sufficient attention as a fundamental problem in computational biology, requiring rigorous analysis.
The typical solution to annotation transfer involves the inference of functional properties based on sequence similarity . This procedure can be divided into two steps: (i) the establishment of a list of proteins of known function and significant sequence similarity to the uncharacterized sequence ; (ii) the selection of those characterized sequences from which the annotation might be transferred . The procedure relies on the assumption of a strong relationship between protein structure and function. Despite the fact that this hypothesis is strongly supported by various studies , there is concern that a blind application of such procedures usually leads to annotation errors [5–8]. Two major types of errors can be made: (i) the short-listed homologous protein(s) have a different function from the query sequence (erroneous assignment, despite correct reference); (ii) the transferred annotations are incorrect (erroneous reference, despite correct assignment). The latter type followed by an iterative usage of annotation transfer results in the important problem of error propagation in annotated databases [3, 9]. Modeling studies have demonstrated that dramatic consequences on the reliability of database annotations can thus arise, with detrimental effects for the quality and integrity of reference databases . One of the challenges for future improvements is the association of function assignments with a measure of reliability that can control annotation quality , by excluding spurious annotations. Herein, we address this issue by analysing the Enzyme Classification (EC) hierarchy within a probabilistic framework for the process of homology-based annotation, as a follow-up of a previous theoretical study .
Methods and results
Our approach relies on the usage of a reference dataset such as the EC hierarchy, where protein sequences are pre-classified into (an arbitrary number of) functional classes . An assignment corresponds to a membership in a functional class; thus, function sharing becomes an explicit property. The possibility for a protein to belong to a functional class is assessed based on its similarity relationships with all protein sequences that do or do not belong to that class. Most existing methods map functions to proteins via the clustering of proteins based on sequence similarities irrespectively of any function sharing and the compilation of available functional descriptions in the (most relevant) cluster(s) to annotate the uncharacterized sequence(s) [11–13]. An innovative feature of our strategy is that individual sequences are mapped to functional classes, instead of individual functions mapped to sequence classes .
First, we followed the exact leave-one-out re-annotation scheme for assignments as described previously, with the updated information for proteins/EC classes , for comparison purposes. The overall (mean) performance was somewhat improved. We were able to generate (at P = 1) 59,766 assignments for 59,746 proteins (coverage 92.1%); some proteins may have more than one assignment at P = 1. Also, the number of annotation errors was 90, thus implying an error rate r = 0.15% (90 cases out of 59,766 assignments). Compared to our previous report , where we have annotated 28,088 enzymes over 589 classes, we observe an increase in coverage (92.1% compared to 90.6%) and a significant decrease in error rate (0.15% compared to 0.21%), despite a more than two-fold increase of the data.
Local error rate per EC class, for those cases where there is more than one error.
NADH dehydrogenase (ubiquinone)
Non-specific Ser/Thr protein kinase
H+-transporting two-sector ATPase
DNA-directed RNA polymerase
Overlapping EC classes, for those cases where there are more than two errors from a true EC class to an assigned EC class.
Name of true class
Name of assigned class
DNA-directed DNA polymerase
DNA-directed RNA polymerase
Substrate: DNA or RNA
NADH dehydrogenase (quinone)
NADH dehydrogenase (ubiquinone)
Electron acceptor: quinone or ubiquinone
Hydrolysis of 1,4-beta-D-glucosidic linkages
Exo-hydrolysis or endo-hydrolysis
Non-specific Ser/Thr protein kinase
Substrate: PI3 or Ser/Thr
NDP-glucose – starch glucosyltransferase
Substrate: NDP-glucose or ADP-glucose
Sodium-transporting two-sector ATPase
H+-transporting two-sector ATPase
Ion transporting two sector ATPase
Ion specificity: NA+ or H+
Finally, we have implemented this strategy into a web-server called CORRIE implemented using MySQL and we announce its availability for wider use by the community. The software requires a reference set of protein sequences, their association to a functional classification and an all-vs-all similarity table. Then, for any unclassified query sequence, CORRIE generates a probability for its membership to a functional class. CORRIE has been made accessible at http://www.genomes.org/services/corrie/; a downloadable version will follow soon. The format of the results is simple – by providing a query sequence, the user obtains the following information: the query sequence identifier, the original description (from the FASTA file format), an internal CORRIE protein identifier for retrieval purposes, the assignment probability, the predicted EC class, the EC description, and the local error rate for the specific class (as a guide for the quality of annotations) (Figure 1). The server provides all annotations obtained by CORRIE (including those with P < 1). The users may also use different α values and the multivariate framework. Users can also browse through various results so that they can refine their assessment of annotation quality and generally explore structure/function relationships within the entire sequence space of proteins known to be associated with enzymatic functions.
We have previously developed a framework for the probabilistic annotation of enzymes into the functional classes of the EC hierarchy . We have now extended this work using a larger reference database, and have reduced the error rates significantly while maintaining a coverage of >90%. We have also examined the local errors made in this assignment process and identified those EC classes more prone to non-specific structure/function relationships. Finally, we have made the system available as an interactive web server for the exploration of enzyme sequence space.
It is interesting to note that most errors reported (Tables 1 and 2) occur between closely related EC classes. This is particularly evident in cases where the similarity and difference of the function between overlapping classes is described (Table 2). In all six cases, the overall function remains the same while the difference lies in substrate specificity or the reaction mechanism. Recent studies have shown that substrate specificity in four of these twelve overlapping classes can be modulated with a small number of mutations. For instance, it has been reported recently that a RNA polymerase function was obtained from a DNA polymerase using in vitro compartmentalization, and a mutant with a single mutation was among the optimal mutants at synthesizing RNA . Also, in the case of a transporting ATPase, the specificity of transport from H+ to Li+ was achieved by just four mutations .
Beyond the issue of functional specificity, there is also an aspect of biological reality in the problematic cases, in terms of overlapping enzyme properties. In other words, these classes might represent activities that co-exist in the same enzyme. In the previous example of the DNA polymerase, it has also been reported that a mutant with just five mutations maintained a DNA polymerase activity, demonstrating that both these activities co-exist . Also, in the case of glucanases, co-existence of endo- and exo-activities has been observed in some enzymes . Finally, with starch glucosyltransferases, CORRIE annotates ADP-glucose specific enzymes as being NDP-glucose specific, which is less accurate yet valid.
These examples illustrate the intricate nature of the sequence-function relationship found among those few cases that CORRIE fails to annotate correctly, and point to the limitation of using sequence similarity as a distance measure between enzymes. Therefore, we envisage implementing other methods in CORRIE in the near future. For example, the sequences within each class could be used to create one or more sequence profiles against which a new sequence could be aligned to produce an alternative CI measure, possibly focusing on key residues [23, 24]. This would increase the sensitivity and specificity to a point where these ambiguous classes can be detected accurately.
One shortcoming of CORRIE, since it is based on the ENZYME database for validation purposes, is the implicit assumption that the query sequences are enzymes. A possible future development would be the explicit detection of enzyme sequences from similarity information. Schemes that have addressed the issue of enzyme recognition have been previously proposed . This can be achieved by an all-vs-all comparison and further classification using CORRIE, with the entire UniProt database. In that setting, hypothetical proteins that would match known enzyme classes, could readily be assigned to specific EC numbers, with the proper probabilistic measures attached to them. Currently, this is possible, but the error rate is certainly under-estimated. Finally, the extension to other classification schemes (and semantically richer formats) will facilitate the assignment of protein sequences to various aspects of biological function beyond the EC hierarchy.
The CGU at CERTH is supported by the Networks of Excellence BioSapiens (contract number LSHG-CT-2003-503265) and ENFIN (LSHG-CT-2005-518254), both funded by the European Commission.
This article has been published as part of BMC Bioinformatics Volume 8, Supplement 4, 2007: The Second Automated Function Prediction Meeting. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/8?issue=S4.
- Andrade MA, Sander C: Bioinformatics: from genome data to biological knowledge. Curr Opin Biotechnol 1997, 8: 675–83. 10.1016/S0958-1669(97)80118-8View ArticlePubMedGoogle Scholar
- Ouzounis CA, Karp PD: The past, present and future of genome-wide re-annotation. Genome Biol 2002, 3: COMMENT2001. 10.1186/gb-2002-3-2-comment2001PubMed CentralView ArticlePubMedGoogle Scholar
- Karp PD: What we do not know about sequence analysis and sequence databases. Bioinformatics 1998, 14: 753–4. 10.1093/bioinformatics/14.9.753View ArticlePubMedGoogle Scholar
- Wilson CA, Kreychman J, Gerstein M: Assessing annotation transfer for genomics: quantifying the relations between protein sequence, structure and function through traditional and probabilistic scores. J Mol Biol 2000, 297: 233–49. 10.1006/jmbi.2000.3550View ArticlePubMedGoogle Scholar
- Kyrpides NC, Ouzounis CA: Whole-genome sequence annotation: 'Going wrong with confidence'. Mol Microbiol 1999, 32: 886–7. 10.1046/j.1365-2958.1999.01380.xView ArticlePubMedGoogle Scholar
- Bork P, Koonin EV: Predicting functions from protein sequences – where are the bottlenecks? Nat Genet 1998, 18: 313–8. 10.1038/ng0498-313View ArticlePubMedGoogle Scholar
- Devos D, Valencia A: Intrinsic errors in genome annotation. Trends Genet 2001, 17: 429–31. 10.1016/S0168-9525(01)02348-4View ArticlePubMedGoogle Scholar
- Gerlt JA, Babbitt PC: Can sequence determine function? Genome Biol 2000, 1: REVIEWS0005. 10.1186/gb-2000-1-5-reviews0005PubMed CentralView ArticlePubMedGoogle Scholar
- Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA: Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 2002, 18: 1641–9. 10.1093/bioinformatics/18.12.1641View ArticlePubMedGoogle Scholar
- Levy ED, Ouzounis CA, Gilks WR, Audit B: Probabilistic annotation of protein sequences based on functional classifications. BMC Bioinformatics 2005, 6: 302. 10.1186/1471-2105-6-302PubMed CentralView ArticlePubMedGoogle Scholar
- Abascal F, Valencia A: Automatic annotation of protein function based on family identification. Proteins 2003, 53: 683–92. 10.1002/prot.10449View ArticlePubMedGoogle Scholar
- Krebs WG, Bourne PE: Statistically rigorous automated protein annotation. Bioinformatics 2004, 20: 1066–73. 10.1093/bioinformatics/bth039View ArticlePubMedGoogle Scholar
- Leontovich AM, Brodsky LI, Drachev VA, Nikolaev VK: Adaptive algorithm of automated annotation. Bioinformatics 2002, 18: 838–44. 10.1093/bioinformatics/18.6.838View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–10.View ArticlePubMedGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28: 304–5. 10.1093/nar/28.1.304PubMed CentralView ArticlePubMedGoogle Scholar
- Wu CH, Apweiler R, Bairoch A, Natale DA, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, et al.: The Universal Protein Resource (UniProt): an expanding universe of protein information. Nucleic Acids Res 2006, 34: D187–91. 10.1093/nar/gkj161PubMed CentralView ArticlePubMedGoogle Scholar
- Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA: CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts. Bioinformatics 2000, 16: 915–22. 10.1093/bioinformatics/16.10.915View ArticlePubMedGoogle Scholar
- Weiss H, Leonard K, Neupert W: Puzzling subunits of mitochondrial cytochrome reductase. Trends Biochem Sci 1990, 15: 178–80. 10.1016/0968-0004(90)90155-5View ArticlePubMedGoogle Scholar
- Bayer EA, Chanzy H, Lamed R, Shoham Y: Cellulose, cellulases and cellulosomes. Curr Opin Struct Biol 1998, 8: 548–57. 10.1016/S0959-440X(98)80143-7View ArticlePubMedGoogle Scholar
- Ong JL, Loakes D, Jaroslawski S, Too K, Holliger P: Directed evolution of DNA polymerase, RNA polymerase and reverse transcriptase activity in a single polypeptide. J Mol Biol 2006, 361: 537–50. 10.1016/j.jmb.2006.06.050View ArticlePubMedGoogle Scholar
- Zhang Y, Fillingame RH: Changing the ion binding specificity of the Escherichia coli H(+)-transporting ATP synthase by directed mutagenesis of subunit c. J Biol Chem 1995, 270: 87–93. 10.1074/jbc.270.1.87View ArticlePubMedGoogle Scholar
- Schubot FD, Kataeva IA, Chang J, Shah AK, Ljungdahl LG, Rose JP, Wang BC: Structural basis for the exocellulase activity of the cellobiohydrolase CbhA from Clostridium thermocellum . Biochemistry 2004, 43: 1163–70. 10.1021/bi030202iView ArticlePubMedGoogle Scholar
- Casari G, Sander C, Valencia A: A method to predict functional residues in proteins. Nat Struct Biol 1995, 2: 171–8. 10.1038/nsb0295-171View ArticlePubMedGoogle Scholar
- Lichtarge O, Bourne HR, Cohen FE: An evolutionary trace method defines binding surfaces common to protein families. J Mol Biol 1996, 257: 342–58. 10.1006/jmbi.1996.0167View ArticlePubMedGoogle Scholar
- des Jardins M, Karp PD, Krummenacker M, Lee TJ, Ouzounis CA: Prediction of enzyme classification from protein sequence without the use of sequence similarity. Proc Int Conf Intell Syst Mol Biol 1997, 5: 92–9.PubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.