Data mining of enzymes using specific peptides
© Weingart et al; licensee BioMed Central Ltd. 2009
Received: 4 June 2009
Accepted: 24 December 2009
Published: 24 December 2009
Predicting the function of a protein from its sequence is a long-standing challenge of bioinformatic research, typically addressed using either sequence-similarity or sequence-motifs. We employ the novel motif method that consists of Specific Peptides (SPs) that are unique to specific branches of the Enzyme Commission (EC) functional classification. We devise the Data Mining of Enzymes (DME) methodology that allows for searching SPs on arbitrary proteins, determining from its sequence whether a protein is an enzyme and what the enzyme's EC classification is.
We extract novel SP sets from Swiss-Prot enzyme data. Using a training set of July 2006, and test sets of July 2008, we find that the predictive power of SPs, both for true-positives (enzymes) and true-negatives (non-enzymes), depends on the coverage length of all SP matches (the number of amino-acids matched on the protein sequence). DME is quite different from BLAST. Comparing the two on an enzyme test set of July 2008, we find that DME has lower recall. On the other hand, DME can provide predictions for proteins regarded by BLAST as having low homologies with known enzymes, thus supplying complementary information. We test our method on a set of proteins belonging to 10 bacteria, dated July 2008, establishing the usefulness of the coverage-length cutoff to determine true-negatives. Moreover, sifting through our predictions we find that some of them have been substantiated by Swiss-Prot annotations by July 2009. Finally we extract, for production purposes, a novel SP set trained on all Swiss-Prot enzymes as of July 2009. This new set increases considerably the recall of DME. The new SP set is being applied to three metagenomes: Sargasso Sea with over 1,000,000 proteins, producing predictions of over 220,000 enzymes, and two human gut metagenomes. The outcome of these analyses can be characterized by the enzymatic profile of the metagenomes, describing the relative numbers of enzymes observed for different EC categories.
Employing SPs for predicting enzymatic activity of proteins works well once one utilizes coverage-length criteria. In our analysis, L ≥ 7 has led to highly accurate results.
Recently there has been a rapid growth in the number of putative proteins derivable from new genomic and metagenomic data . The extended use of environmental shotgun sequencing to study diverse microbial systems has made metagenomics a vastly growing field leading to a flux of data, calling for development and application of new tools that allow its investigation . Conventional tools for predicting the function of a protein from its sequence are based on sequence-similarity  or sequence-motifs [4, 5]. Here we outline a relatively simple and straight-forward method that is applicable to large numbers of sequences. Its purpose is finding whether each protein in the data is an enzyme and, if so, what its EC classification is. This Data Mining of Enzymes (DME) is based on the Specific Peptide (SP) method of , and is carried out by comparing the sequences of all proteins with a list of all SPs and looking for matches of the latter in the data.
SPs are strings of amino-acids, extracted from enzyme sequences using the motif extraction algorithm MEX . They are selected for their specificity to levels of the Enzyme Commission (EC) 4-level functional hierarchy. We have updated the SP set of  by extracting it from all Swiss-Prot enzymes of July 1st, 2006. More details are provided in Methods.
Using SPs for prediction of enzymatic function needs some further decisions as to what to do if various SP hits on the same protein have EC assignments that are not consistent with one another. Moreover, one should decide when a single SP hit is sufficient to make a prediction. The methodology developed here relies on coverage length (overall number of amino-acids) of consistent SP hits. This is further described below, when testing performance on an enzyme test set, and when discussing a ten-organism test-set that contains non-enzymatic as well as enzymatic proteins. We develop a random model for the latter to assess the effect of accidental SP matches. The resulting methodology, which we call Data Mining of Enzymes (DME), is being applied to analyze several metagenomes.
The new SP sets
A novel method based on sequence motifs has been proposed by , who have studied enzymes in the Swiss-Prot database. They have demonstrated that enzyme functions, as represented by the four-level EC hierarchy, can be deduced from the appearance of deterministic short strings of amino-acids, denoted as Specific Peptides (SPs), on these enzymes. The SPs were derived from enzyme sequence data using an unsupervised motif extraction algorithm MEX , and filtered by the EC so that each SP is specific to a particular EC branch, specifying the EC function that the enzyme performs. Thus, if an extracted motif is found to occur on enzymes belonging to only one EC number (i.e., 4th level of the EC hierarchy), this peptide will be declared to be an SP labeled with this EC number. If, however, the motif occurs on several EC numbers, all of which share the same 3rd-level hierarchy (i.e. the first three digits of their EC numbers are the same), the motif is declared as an SP with labeling at the third level of EC hierarchy, etc. The SPs of  comprise on average 8.4 amino-acids (SD 4.5), and were shown to compete favorably with a Smith-Waterman based SVM classifier. Usage of the SP methodology is demonstrated by our web-tool http://adios.tau.ac.il/DME. Given the sequence of an enzyme, this tool searches through the set of all SPs and finds which of them coincide with substrings of the sequence, indicating where they lie, what is the EC assignment associated with each SP, and provides the EC predicted by the DME method for the protein that is being queried.
Kunik et al  have investigated 50,698 enzyme sequences of the 48.3 Swiss-Prot release of October 2005. We have used the same methodology and applied it to all enzymes in the Swiss-Prot/Enzyme records of July 1st, 2006. The number of enzymes that have a single EC assignment is 89,854. Applying MEX and filtering it by EC levels in the same way as , we have obtained 87,017 SPs. This new 1st list of SPs serves as the basis for developing and analyzing our methodology.
In making the prediction of an EC number (i.e. 4th level of the EC hierarchy) based on one SP match, or several SP matches that have the same EC number assignment, we require that the total number of amino-acids of the protein matched with these SPs be at least seven. We refer to this number as the coverage-length L. If L at level 4, L4, is less than 7, we check for SP hits that are consistent at level 3 of the EC hierarchy, i.e. have identical first three digits in their assignments. Once again, a prediction is made if L at level 3, L3, is at least 7. In principle, the threshold of L at every EC level can be viewed as a parameter of our method. Reducing L increases recall at the expense of lowering precision, as will be discussed below.
Compilation of training and test datasets.
Selection Criteria from Swiss-Prot
Number of Proteins (and SPs)
Training set #1
Single EC annotation and Date-Integrated before 7/1/2006
(#SPs = 87,017)
"Enzyme Test Set"
EC annotation and Date-Integrated between 7/1/2006 and 7/1/2008
"Ten Organism Test-Set"
EC annotation and Date-Integrated between 7/1/2006 and 7/1/2008 and all non-enzymes before 7/1/2008
Training set #2
Single EC annotation and Date-Integrated before 7/27/2009
(#SPs = 312,465)
54% of the proteins in the 1st training set carry Swiss-Prot annotations of 'active site', 'binding site' or 'metal binding site' at specific locations of single amino-acids. SPs cover these functionally important sites significantly more than other loci on proteins, thus indicating biological significance of SPs (for an extensive discussion see , in particular Table 1 there). SP matches that overlap such sites are compiled, and the corresponding SPs are denoted as Annotated SPs (ASPs). We have thus compiled a list of 6,078 ASPs. All appear at least four times in the training set, and the location of the annotation is consistent in the different appearances. Most ASPs carry single annotations (1,900 active sites, 1932 binding sites and 1,819 metal binding sites), 418 ASPs carry two annotations and 3 ASPs carry all three annotations.
A second set of SPs is extracted from Swiss-Prot data dated July 27th, 2009. This training set, consisting of all singly annotated enzymes, contains 201,169 proteins. It has led to 312,465 SPs. Their length distribution is presented in Additional file 2, Figure S1. This set includes 285,485 SPs with labels corresponding to EC levels 3 and 4 (containing 257,598 SPs of length ≥ 7). Only SPs with EC labels at levels 3 and 4 are relevant for the assignment of EC level-3 annotations to proteins, and hence for the calculation of recall included in Table 1. It should be emphasized that only 191,275 of the Swiss-Prot annotated enzymes in the training set carry EC annotations at levels 3 and 4. They are the ones on which the EC predictions at level 3 are tested, leading to the recall result of 94%. The 2nd SP set is being used for the analysis of metagenomic data and is incorporated in our web-tool at http://adios.tau.ac.il/DME.
Estimate of accidental SP matches
Proteins that do not possess enzymatic functions may still have a substring that matches an SP. Such SP matches will be called 'accidentals'. Their occurrences can be modeled by SP hits on random protein sequences. Such random sequences are generated from real data by scrambling the order of the amino-acids in every protein, conserving only first-order statistics. 3 such sets were produced in order to measure the expected random hits. Estimates of the probabilities of accidental occurrences of SPs are derived below for the 10 organism test-set and for Sargasso Sea data.
Recall-precision analysis of EC annotations in enzymes
This is a generalization of the common terms used in binary classification problems where P|P, P|DP and NP|P are replaced by true-positive, false-positive and false-negative correspondingly.
Recall-precision analysis of EC annotations in proteins
in conventional binary classifications.
Analysis of the Methodology
Analysis of the Enzyme Test Set using the 1st SP set
Variation of precision and recall of DME (based on the 1st SP set) on the enzyme test-set as function of the L3 threshold.
Although precision turns out to be quite high, even for low L3 values, recall is low when compared to what BLAST  can achieve on this test-set. Using the most significant outcome of a BLAST search against the 1st training set as its prediction, and limiting the most significant e-value to stay below e-05, we find BLAST precision of 98% and recall of 95%, to be compared with DME values of 98.4% and 70% when setting L3 ≥ 7. Thus while precision is similar, DME loses on recall. There is no direct relation between DME and BLAST, although high coverage-length L values of DME go usually hand in hand with very low e-values of BLAST. Differences may occur for low L values of DME, and relatively high e-values in BLAST. We refer to Kunik et al.  for a discussion of such examples (see Table four there). The advantages of SPs in resolving classification problems in situations of remote homology have been discussed and exemplified by .
It is worthwhile pointing out that the fact that one can abide by such a small threshold value of L ≥ 7 is strongly connected to our requirement that the SP matches on the protein's sequence be exact. If one were to allow for insertions or deletions or replacements, such as the BLOSUM62 matrix , this would not work. Based on various trials we may state that, whereas reliance on BLOSUM works well for BLAST searches over large sequences, it ruins predictivity and specificity of SP searches even if only single amino-acid changes are allowed.
Analysis of the ten organism test-set
Comparison of results for the ten organism test-set with those of a random model as function of coverage-length at level 3 of the EC hierarchy.
DME predictions vs. Swiss-Prot EC (level 3) annotations for the 10 organism Test Set.
The interest in this exercise is twofold: to see how well our method performs on unassigned proteins, i.e. true-negatives, and how good our predictions are for putative novelties. Indeed, our accuracy turns out to be high, 95.1%, which proves that we have correct negative assignments.
DME predictions for the ten-organism test-set are compared with recent Swiss-Prot EC assignments.
DME Prediction (1st SP set)
Current Swiss-Prot EC annotation
Classification based on Annotated SPs
It has been noted by  and  that some of the SPs can be demonstrated to play important biological roles since they carry crucial amino-acids known to serve as active sites, binding sites or metal binding sites. Such annotations are available for 54% of the enzymes in the 1st Swiss-Prot training set. Selecting only SPs that carry these annotations we obtain a set of 6,078 Annotated SPs (ASPs), a mere 7% of all SPs. We have tested it on the enzyme test set. Using annotation predictions at the third level of EC we find precision 99.6% and recall 25.4%. The limited recall is due to the fact that ASPs have been derived from only 54% of the training set. Nonetheless they possess the advantage of being selected due to their demonstrated operational importance to the catalytic function. Because of their limited recall we have not used the ASPs as the primary tool for large scale analysis; however we list their properties in our web tool http://adios.tau.ac.il/DME. Any queried protein can be analyzed by this tool for SP hits and the expected DME prediction. The appearance of ASPs may serve as providing additional credence to the prediction, as well as specifying the positions of expected active or binding sites.
Analysis of Sargasso-Sea data
Numbers of sequences with consistent SP hits (same category at level 3 of the EC hierarchy) are compared between 5000 proteins randomly chosen from Sargasso-Sea data, and a corresponding random model, as function of coverage-length.
Similar results are obtained for L4. The results of Table 6 are slightly better than Table 3. The reason is that we have limited ourselves here to SPs of individual length 7 or more. Once again we choose L = 7 as our threshold for DME predictions. Applying DME with this threshold we obtain EC assignments at levels 3 and 4 for 220,278 proteins. All assignments are provided in Additional file 1, Tables S2-S4.
In addition to 6.1.1 (aaRS) enzymes we observe the following leading categories: 3.6.3 (Hydrolases catalyzing transmembrane movement of substances involving ATPases), 2.7.7 (Nucleotidyl transferases), 1.1.1 (Oxidoreductases acting on the CH-OH group of donors), and 4.2.1 (Carbon-oxygen lyases).
Leading occurrences of EC-numbers in Sargasso-Sea data
DNA-directed RNA polymerase
NADH dehydrogenase (quinone)
DNA topoisomerase (ATP-hydrolysing). DNA gyrase.
carbamoyl-phosphate synthase (glutamine-hydrolysing)
H+-transporting two-sector ATPase. ATP synthase.
DNA-directed DNA polymerase
Some examples of doubly annotated enzymes uncovered by DME in the Sargasso-Sea data.
The first and the last entries in Table 8 have many analogs in currently known doubly-annotated enzymes in Swiss-Prot. Checking all proteins we find that the SP hits that belong to the two different EC numbers do not overlap on the protein sequences, thus falling comfortably into the categorization of two different catalytic domains. It is interesting to note that finding multiple domains is easier with SPs than it is with BLAST: we will not miss out on a small domain of a protein that may be overshadowed by sequence similarities with a larger protein domain, and we can immediately check whether the different catalytic regions lie on disjoint sections of the protein. A full list of the doubly annotated Sargasso-Sea enzymes is presented in Additional file 1, Table S3. A further list of triple-enzymatic annotations is presented in Additional file 1, Table S4.
Human Gut Metagenome
Gill et al.  have analyzed the DNA sequences obtained from fecal DNA of two healthy adults - 'subject 7' a female aged 28 and 'subject 8' a male aged 37. We have analyzed the resulting proteins (downloaded from http://img.jgi.doe.gov/m/) with our DME method. The two proteomes of subjects 7 and 8 consist of 20,523 and 25,980 proteins correspondingly. We predict enzymatic annotations for 3,428 proteins of subject 7 and 4,102 proteins of subject 8. These numbers are relatively lower than the enzymatic content of Sargasso-Sea. Numbers of 6.1.1 enzymes are predicted to be 260 and 264 for subjects 7 and 8 respectively. Thus the number of different species contained in these samples is scaled down by two-orders of magnitude compared to the Sargasso-Sea data, which is quite reasonable given the size of the databases. Further comparisons between the three metagenomes are offered in the next section.
Trying to compare different metagenomes with each other one has obviously to resort to some normalization method. Normalizing the results of a histogram like Figure 2 by the total number of enzymes that we find, we obtain a spectrum characteristic of the genome or metagenome we study, which we will refer to as its enzymatic profile.
Absolute values of differences between enzymatic profiles, based on the DME predicted distributions at level 3 of EC.
It has been emphasized by  and by  that the functional characteristics of a metagenome vary with the environment in which it is being found. Hence we expect the genetic enzymatic profiles to vary accordingly. Our exercise shows that the gross features of microbial communities may be similar, thus more attention will have to be paid to smaller details, in particular emphasizing the cases where the relative differences between EC categories are the largest. This may become a useful tool in the future.
We wish to close this section by emphasizing that the three metagenomic profiles are different from those derived from the genome of E. coli, and very different from human. The comparisons are presented in Additional file 2, Figure S2, drawn according to the top 20 categories of E. coli, and Additional file 2, Figure S3, displaying the top 20 categories of human. It is quite evident that the weights (or numbers of different genes) of different EC categories change considerably from human to E. coli to bacterial metagenomes. This implies that enzymatic profiles contain information that may be of value in future studies of novel genetic material.
Using SPs it seems quite straightforward to perform data-mining of enzymes. There are however several provisos: a) although a majority of enzymes carry SPs, there exists a minority that does not; hence not all enzymes are expected to be discovered in a new dataset. b) SPs were substantiated on a training set, and their generalization carries with it some error, even on a test set composed of enzymes only. Errors may be due to a) changes in the official EC classification of an enzyme, or b) real biological changes such as evolutionary loss of an active site in a protein that resembles a known enzyme but has no catalytic function, or c) random appearance of SPs on proteins that have no catalytic activity. Errors due to reclassification of EC numbers cannot be controlled in any a-priori manner. The question of functionality loss can be partially checked through searching for the absence of annotated SPs in cases where such annotations may be expected for the enzyme in question. This demonstrates the importance of detailed corroboration of each individual prediction of the large-scale method studied here. The third source of errors, due to random appearance of SPs on proteins other than enzymes, has been taken into account by limiting our predictions to consistent SP hits with minimal coverage length of 7, and specifying the L values of our predictions as a measure of their confidence.
DME is based on deterministic motifs only, i.e. strings with specific sequences of amino-acids. Comparing it with the well-known motif method of Prosite patterns , by using available information in Swiss-Prot, we find that the latter has precision of 97% and recall of only 47% on the Enzyme test set, thus falling short of DME predictions. When comparing DME to BLAST on the enzyme test-set we found that DME had comparable precision (98.4% vs 98%) while BLAST has much better recall (95% vs 70.0%). Note that this comparison was based on the 1st SP set of July 2006.
It should be appreciated that the comparative procedure based on the Enzyme test set has some bias in favor of BLAST, because the latter serves as one of the inputs to Swiss-Prot assignments. As a result, cases of remote homology which may be captured by DME could have been missed by BLAST-based assignments, as was demonstrated by  and by . The SP-based search has two other advantages over BLAST: it is conceptually simpler, relying only on a look-up table, and it points to specific locations on the queried protein which may be relevant to the expected catalytic function of that enzyme. Hence it may have wide practical implications for enzyme research and development.
In spite of all the precautions outlined in the first paragraph, our predictions concerning the 10 organism test-set reported in this paper, do extremely well. Moreover, note that the recall quality of SPs on their training sets increased dramatically from 85% in 2006 to 94% in 2009 (see Table 1). This means that the minority of enzymes without SP hits diminishes with time. The reason is quite clear: MEX thrives on redundancy of patterns in the data. Therefore, the more proteins of the same family there are in the database, the better MEX will perform. As these lists fill up in the Swiss-Prot database, they can be better represented by simple SP motifs. Higher recall on the training set will undoubtedly reflect itself also as higher recall on future test sets, thus suggesting that the gap between the recall of BLAST vs DME will shrink with time. Indeed, carrying out a DME analysis, based on the 2nd SP set, of 19,849 enzymes that have been added to Swiss-Prot from July 28 to Sep 29, 2009, we find on this novel test set precision of 99.2% and recall of 92.4%. This is a considerable increase over the recall of 70% of the 1st SP set measured on the enzyme test set (see Table 1).
A straightforward peptide characterization of protein families seemed hopeless a decade or two ago, and hence necessitated the development of more sophisticated approaches such as BLAST, to quantify sequence similarities. Our analysis demonstrates that this has changed with time (and increasing amounts of data) so that nowadays the SP approach may be regarded as a useful tool, leading to valuable information. Such information, for three metagenomic data-sets, has been presented here as an example of the power of our novel methodology.
The requirement that SP occurrences on protein sequences has some minimal coverage length, e.g. L ≥ 7 amino-acids in our analyses, leads to the novel tool of DME. It is applicable to large genomic and metagenomic data, and provides a good indicator for the enzymatic classification of the queried proteins, based on a look-up table only. A web tool identifying SP (and ASP) occurrences on any queried protein sequence, and providing the EC prediction of DME, is available online at http://adios.tau.ac.il/DME.
We thank Uri Gophna and Eytan Ruppin for helpful conversations.
This study was supported in part by fellowships granted to UW and YL by the Edmond J. Safra Bioinformatics program at Tel-Aviv University.
- Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F: The Marine Viromes of Four Oceanic Regions. PLoS Biol 2006, 4(11):e368. 10.1371/journal.pbio.0040368PubMed CentralView ArticlePubMedGoogle Scholar
- Eisen JA: Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes. PLoS Biol 2007, 5(3):e82. 10.1371/journal.pbio.0050082PubMed CentralView ArticlePubMedGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function of pairwise sequence identity? J Mol Biol 2003, 333: 863–882. 10.1016/j.jmb.2003.08.057View ArticlePubMedGoogle Scholar
- Bork P, Koonin EV: Protein sequence motifs. Curr Op Structural Biology 1996, 6: 366–376. 10.1016/S0959-440X(96)80057-1View ArticleGoogle Scholar
- Bairoch A, Bucher P, Hofmann K: Prosite. Nuc Acids Res 1997, 25: 217–221. 10.1093/nar/25.1.217View ArticleGoogle Scholar
- Kunik V, Meroz Y, Solan Z, Sandbank B, Weingart U, Ruppin E, Horn D: Functional representation of enzymes by specific peptides. PLOS Comp Biol 2007, 3(8):e167. 10.1371/journal.pcbi.0030167View ArticleGoogle Scholar
- Solan Z, Horn D, Ruppin E, Edelman S: Unsupervised learning of natural languages. Proc Natl Acad Sci USA 2005, 102: 11629–11634. 10.1073/pnas.0409746102PubMed CentralView ArticlePubMedGoogle Scholar
- Meroz Y, Horn D: Biological Roles of Specific Peptides in Enzymes. Proteins: Structure, Function, and Bioinformatics 2008, 72(2):606–612. 10.1002/prot.21951View ArticleGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
- Eddy SR: Where did the BLOSUM62 alignment score matrix come from? Nat Biotechnol 2004, 22(8):1035–6. 10.1038/nbt0804-1035View ArticlePubMedGoogle Scholar
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, Wu DY, Paulsen I, Nelson KE, Nelson W, Fouts DE, Levy S, Knap AH, Lomas MW, Nealson K, White O, Peterson J, Hoffman J, Parsons R, Baden-Tillson H, Pfannkoch C, Rogers YH, Smith HO: Environmental Genome Shotgun Sequencing of the Sargasso Sea. Science 2004, 304: 66–74. 10.1126/science.1093857View ArticlePubMedGoogle Scholar
- Watanabe K, Nelson J, Harayama S, Kasai H: ICB database: the gyrB database for identification and classification of bacteria. Nucleic Acids Res 2001, 29(1):344–5. 10.1093/nar/29.1.344PubMed CentralView ArticlePubMedGoogle Scholar
- Gill SR, Pop M, DeBoy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE: Metagenomic Analysis of the Human Distal Gut Microbiome. Science 2006, 312: 1355–1359. 10.1126/science.1124234PubMed CentralView ArticlePubMedGoogle Scholar
- Tringe SG, Von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative Metagenomics of Microbial Communities. Science 2005, 308: 554–557. 10.1126/science.1107851View ArticlePubMedGoogle Scholar
- von Mering C, Hugenholtz P, Raes J, Tringe SG, Doerks T, Jensen LJ, Ward N, Bork P: Quantitative Phylogenetic Assessment of Microbial Communities in Diverse Environments. Science 2007, 315: 1126–1130. 10.1126/science.1133420View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.