ORENZA: a web resource for studying ORphan ENZyme activities
© Lespinet and Labedan; licensee BioMed Central Ltd. 2006
Received: 25 July 2006
Accepted: 06 October 2006
Published: 06 October 2006
Despite the current availability of several hundreds of thousands of amino acid sequences, more than 36% of the enzyme activities (EC numbers) defined by the Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) are not associated with any amino acid sequence in major public databases. This wide gap separating knowledge of biochemical function and sequence information is found for nearly all classes of enzymes. Thus, there is an urgent need to explore these sequence-less EC numbers, in order to progressively close this gap.
We designed ORENZA, a PostgreSQL database of ORphan ENZyme Activities, to collate information about the EC numbers defined by the NC-IUBMB with specific emphasis on orphan enzyme activities. Complete lists of all EC numbers and of orphan EC numbers are available and will be periodically updated. ORENZA allows one to browse the complete list of EC numbers or the subset associated with orphan enzymes or to query a specific EC number, an enzyme name or a species name for those interested in particular organisms. It is possible to search ORENZA for the different biochemical properties of the defined enzymes, the metabolic pathways in which they participate, the taxonomic data of the organisms whose genomes encode them, and many other features. The association of an enzyme activity with an amino acid sequence is clearly underlined, making it easy to identify at once the orphan enzyme activities. Interactive publishing of suggestions by the community would provide expert evidence for re-annotation of orphan EC numbers in public databases.
ORENZA is a Web resource designed to progressively bridge the unwanted gap between function (enzyme activities) and sequence (dataset present in public databases). ORENZA should increase interactions between communities of biochemists and of genomicists. This is expected to reduce the number of orphan enzyme activities by allocating gene sequences to the relevant enzymes.
Browsing the EC hierachy. For each level are indicated the total number of EC numbers and that of orphan EC numbers between brackets.
3 1065 
3.1 267 
3.2 163 
3.3 10 
3.4 317 
3.5 171 
3.6 109 
3.7 10 
3.8 10 
3.9 1 
3.10 2 
3.11 2 
3.12 1 
3.13 2 
3.2.1 140 
3.2.2 23 
Unexpectedly, Peter Karp  and us [4, 5] independently observed that a significant part of these curated and approved EC numbers does not correspond to any amino acid sequence in public databases. Recent updates of our previous results confirm this very large gap between known enzyme function and recorded protein sequence. There are presently only 2483 EC numbers having at least one associated sequence in the release 8.1 (13-Jun-2006) of the UniProt Knowledgebase . We have used the term orphan enzyme activities  for the 1444 EC numbers that do not have a sequence associated with them. Remarkably, these orphan enzyme activities currently represent 36.8% of the 3927 retained EC numbers.
We have already shown that orphans are present at about the same proportion in every class and subclass of enzyme activities . Likewise, we found no correlation between orphan distribution and main functional categories. 25.3% of the enzyme activities involved in well-studied metabolic pathways are sequence-less while we found 49.5% orphans among non-metabolic enzyme activities .
Thus, it appears that there is an important gap between function and sequence, which implies that its progressive bridging would require a concerted effort as already underlined [3, 4]. Accordingly, we have built ORENZA, a database of ORphan ENZyme Activities, to offer such a tool to the research community. Hereafter, we describe the content of this resource and we detail how to use it in order to reach the goals defined above.
Construction and content
Structure of the ORENZA database
In order to build an efficient relational database that will help to identify the encoding gene for the maximum number of sequence-less enzyme activities (the so-called orphan enzymes ) we have retrieved data from various public databases and we have organized them as described below.
A Perl script screened the occurrence of EC numbers in UniProt Knowledgebase . Any EC number assigned by the NC-IUBMB  that is not referenced in UniProt is defined as an orphan enzyme activity. Note that we did not take into account partial or incomplete EC numbers (318 in the present version of UniProt) but too ambiguous  for sound use.
Structuring the relational database and implementing the web resource
We chose to use exclusively open source tools to build ORENZA database.
Accordingly, PostgreSQL 8.1 , one of the most advanced open source databases, was installed on a Linux platform. PHP language  was used to structure the Web service and to better exploit the queries from the relational database.
Browsing and searching ORENZA
One can browse and/or search ORENZA using three main avenues as described in detail below.
Browsing the whole set of EC numbers
The first level consists of characteristics of the enzymatic activity and its history. The description section contains information taken from the NC-IUBMB data such as the different names (common, systematic, and others) of the enzyme, a scheme of the reaction(s) it catalyses and other data about the cofactors and NC-IUBMB comments about the reaction that are extracted from the ENZYME database . In the history part, we list fundamental references, and the date of creation of the entry in the official NC-IUBMB nomenclature.
The second level presents information about the position of the enzyme in the cell metabolism with the corresponding number of a KEGG map , and its taxonomic ubiquity with a list of organisms where this enzymatic activity has been characterized as recorded in the BRENDA database .
The third level exhibits information about the peptidic molecule such as motifs (from PROSITE ), the lists of amino acid sequences found in SwissProt and TrEMBL, respectively . If there is no sequence, as is the case for EC 220.127.116.11, which is labeled "orphan", this is clearly mentioned (Fig. 3B).
Browsing the orphan EC numbers
The second main avenue offered by ORENZA to explore the enzyme universe is the entire list, periodically updated, of the orphan enzyme activities. As described above, there are several ways to retrieve these orphans besides browsing the list in its entirety.
First, one can browse the different levels (class, subclass, etc.) of the EC hierarchy exactly as already described for the whole dataset of EC numbers.
It is possible to query ORENZA for a specific enzyme activity by entering either the EC number or the enzyme name. For example, entering the word "aspartate" recovers 41 EC numbers, 13 being presently not assigned to a sequence.
Distribution of orphan enzyme activities in a few model organisms
Species specific orphans (/total orphans)
Interestingly, the proportion of orphans that are common to these five model organisms is extremely low. Only three EC numbers are found as orphans in the five organisms: EC 18.104.22.168 (FAD diphosphatase), EC 22.214.171.124 (plus-end-directed kinesin ATPase), and EC 126.96.36.199 (minus-end-directed kinesin ATPase). Moreover, only three EC numbers are found as orphans in E. coli, fungi and animals but not in plants: EC 188.8.131.52 (phosphogluconate 2-dehydrogenase), EC 184.108.40.206 (N-methyl-L-amino-acid oxidase), and EC 220.127.116.11 (myosin ATPase).
The six orphan enzyme activities that are specific to Homo sapiens.
role in human physiology
urinary infection by Flavobacterium heparinum
blood coagulation, cardiovascular disease, carcinogenesis
Building an ORENZA community
We clearly need the help of a large array of experts to identify the putative sequence(s) associated with orphan enzyme activities [3, 4]. In order to encourage such a collective effort, we propose, as a part of this ORENZA resource, a friendly tool that will allow people having sound knowledge about specific enzyme activities to make helpful suggestions. Moreover, such a resource could help to establish fruitful and dynamic interactions between different experts interested in the same field. Indeed, each suggestion (with identification of its author) will appear on ORENZA resource as a new item on each EC number's individual files. If several experts agree on the same suggestion, it would be transmitted to the curators of UniProt with a high degree of confidence. In cases where experts provide conflicting advice, all versions of the advice provided will be published as they have been set and validated. This would allow the community to decide, eventually.
The presence of so many EC numbers that do not have an associated sequence appears rather extraordinary at a time where we are inundated by genomic data. Such a situation is encroaching Research at different levels. Alleviating this problem would be very helpful for the difficult task of annotating and/or reannotating genomes. Thus, there is an urgent need to bridge this unwanted gap between biochemical knowledge and massive identification of coding sequences and we and others (see Karp ) think that the whole community must contribute to this task. This is why we built this ORENZA resource.
We designed this database to be an interactive tool allowing each expert to exploit his/her knowledge about an (or a group of related) enzyme(s) that have been registered as being an orphan enzyme activity.
Different cases may exist and we already described three of them where personal expertise would eliminate many errors and/or neglected instances. (i) A trivial error takes place when the enzyme has been correctly described in a sequence database but its EC number is not indicated. This is the case for example of glyceraldehyde 3-phosphate dehydrogenases as already shown . One of these sequences (GAPOR, EC 18.104.22.168) has been entered in UniProt without its EC number although the information was given in relevant published papers. Presently, we estimate that up to 20% of the so-called orphan EC numbers might correspond to such a trivial incomplete annotation in the sequence databases (OL & BL, unpublished results). (ii) A sequence or a partial sequence has been previously determined but has not been published. We recently described such an instance in the case of putrescine carbamoyltransferases . (iii) We further observed that around 50% of the present orphan EC numbers are found in only one species or a few closely related organisms as shown on Fig. 6. This is due, in the large majority of the cases, to the fact that we miss genetic tools for such imperfectly studied organisms. Moreover, the availability of genomic sequences for closely related species is useless when the orphan EC numbers are specific for the studied organisms (see Tables 2 and 3).
We consider ORENZA to be a useful resource for all categories of biologists. Let us take for instance the data summarized in Table 2 and more precisely the observation that human cells harbour six enzyme activities that are not found elsewhere and that are not associated with any amino acid sequence (Table 3).
Any biologist would attempt to better understand the origin of such metabolic specificities. Any progress in this field could have positive consequences in terms of medical advances (see Table 3).
The genomicist would wonder if the occurrence of these six orphans is not an indicator of a big annotation problem in the current analysis of the human genome. The expert for either a specific enzyme or a physiological aspect related with these orphan enzyme activities would feel personally concerned and we hope that he/she will promptly answer such a challenge.
Availability and requirements
ORENZA resource is freely available via the Internet at http://www.orenza.u-psud.fr. The web accessibility has been tested to work with the Mozilla 1.7.12, Mozilla Firefox 1.5, and Internet Explorer 6.0 web browsers.
Complete lists of all EC numbers and of orphan EC numbers are available and will be periodically updated. All data can be easily downloaded as text files.
We thank the two anonymous reviewers for their constructive comments and Claudio Scazzocchio for critical reading of the manuscript and help with the English language. The Agence Nationale de Recherche (programme Masse de Données) and the CNRS have funded this project, including the processing charge for publishing this paper.
- Nomenclature Committee of the International Union of Biochemistry and Molecular Biology (NC-IUBMB) Eur J Biochem 1999, 264: 610–650. Enzyme Nomenclature [http://www.chem.qmul.ac.uk/iubmb/enzyme/index.html] Enzyme Nomenclature 10.1046/j.1432-1327.1999.nomen.x
- Fleischmann A, Darsow M, Degtyarenko K, Fleischmann W, Boyce S, Axelsen KB, Bairoch A, Schomburg D, Tipton KF, Apweiler R: IntEnz, the integrated relational enzyme database. Nucleic Acids Res 2004, 32: D434–437. [http://www.ebi.ac.uk/intenz/index.html] 10.1093/nar/gkh119PubMed CentralView ArticlePubMedGoogle Scholar
- Karp PD: Call for an enzyme genomics initiative. Genome Biol 2004, 5: 401. 10.1186/gb-2004-5-8-401PubMed CentralView ArticlePubMedGoogle Scholar
- Lespinet O, Labedan B: Orphan enzymes? Science 2005, 307: 42. 10.1126/science.307.5706.42aView ArticlePubMedGoogle Scholar
- Lespinet O, Labedan B: Puzzling over orphan enzymes. Cell Mol Life Sci 2006, 63: 517–523. 10.1007/s00018-005-5520-6View ArticlePubMedGoogle Scholar
- Bairoch A, Apweiler R, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: The Universal Protein Resource (UniProt). Nucleic Acids Res 2005, 33: D154–159. [http://www.expasy.uniprot.org/index.shtml] 10.1093/nar/gki070PubMed CentralView ArticlePubMedGoogle Scholar
- Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M: The KEGG resources for deciphering the genome. Nucleic Acids Res 2004, 32: D277-D280. [http://www.genome.ad.jp/kegg] 10.1093/nar/gkh063PubMed CentralView ArticlePubMedGoogle Scholar
- Schomburg I, Chang A, Ebeling C, Gremse M, Heldt C, Huhn G, Schomburg D: BRENDA, the enzyme database: updates and major new developments. Nucleic Acids Res 2004, 32: D431-D433. [http://www.brenda.uni-koeln.de/] 10.1093/nar/gkh081PubMed CentralView ArticlePubMedGoogle Scholar
- Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 2000, 28: 10–14. [http://www.ncbi.nlm.nih.gov/Taxonomy/] 10.1093/nar/28.1.10PubMed CentralView ArticlePubMedGoogle Scholar
- Bairoch A: The ENZYME database in 2000. Nucleic Acids Res 2000, 28: 304–305. [http://www.expasy.org/enzyme/] 10.1093/nar/28.1.304PubMed CentralView ArticlePubMedGoogle Scholar
- Hulo N, Bairoch A, Bulliard V, Cerutti L, De Castro E, Langendijk-Genevaux PS, Pagni M, Sigrist CJ: The PROSITE database. Nucleic Acids Res 2006, 34: D227-D230. [http://www.expasy.org/prosite/] 10.1093/nar/gkj063PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Henrick K, Nakamura H: Announcing the worldwide Protein Data Bank. Nature Structural Biology 2003, 10: 980. [http://www.wwpdb.org/] 10.1038/nsb1203-980View ArticlePubMedGoogle Scholar
- Green ML, Karp PD: Genome annotation errors in pathway databases due to semantic ambiguity in partial EC numbers. Nucleic Acids Res 2005, 33: 4035–4039. 10.1093/nar/gki711PubMed CentralView ArticlePubMedGoogle Scholar
- Naumoff DG, Xu Y, Glansdorff N, Labedan B: Retrieving sequences of enzymes experimentally characterized but erroneously annotated: the case of the putrescine carbamoyltransferase. BMC Genomics 2004, 5: 52. 10.1186/1471-2164-5-52PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.