EFICAz2: enzyme function inference by a combined approach enhanced by machine learning
© Arakaki et al. 2009
Received: 18 November 2008
Accepted: 13 April 2009
Published: 13 April 2009
Skip to main content
© Arakaki et al. 2009
Received: 18 November 2008
Accepted: 13 April 2009
Published: 13 April 2009
We previously developed EFICAz, an enzyme function inference approach that combines predictions from non-completely overlapping component methods. Two of the four components in the original EFICAz are based on the detection of functionally discriminating residues (FDRs). FDRs distinguish between member of an enzyme family that are homofunctional (classified under the EC number of interest) or heterofunctional (annotated with another EC number or lacking enzymatic activity). Each of the two FDR-based components is associated to one of two specific kinds of enzyme families. EFICAz exhibits high precision performance, except when the maximal test to training sequence identity (MTTSI) is lower than 30%. To improve EFICAz's performance in this regime, we: i) increased the number of predictive components and ii) took advantage of consensual information from the different components to make the final EC number assignment.
We have developed two new EFICAz components, analogs to the two FDR-based components, where the discrimination between homo and heterofunctional members is based on the evaluation, via Support Vector Machine models, of all the aligned positions between the query sequence and the multiple sequence alignments associated to the enzyme families. Benchmark results indicate that: i) the new SVM-based components outperform their FDR-based counterparts, and ii) both SVM-based and FDR-based components generate unique predictions. We developed classification tree models to optimally combine the results from the six EFICAz components into a final EC number prediction. The new implementation of our approach, EFICAz2, exhibits a highly improved prediction precision at MTTSI < 30% compared to the original EFICAz, with only a slight decrease in prediction recall. A comparative analysis of enzyme function annotation of the human proteome by EFICAz2 and KEGG shows that: i) when both sources make EC number assignments for the same protein sequence, the assignments tend to be consistent and ii) EFICAz2 generates considerably more unique assignments than KEGG.
Performance benchmarks and the comparison with KEGG demonstrate that EFICAz2 is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction. The EFICAz2 web service is available at: http://cssb.biology.gatech.edu/skolnick/webservice/EFICAz2/index.html
From a purely biochemical point of view, enzymes constitute the most important group of proteins. They are versatile, catalyzing most chemical reactions involved in the metabolism of living organisms, and abundant, representing approximately 15% to 35% of a given proteome [1, 2]. Enzymes are classified according to the Enzyme Commission (EC) system, a hierarchical system that assigns a unique four-field number to each enzymatic activity . The first field of an EC number indicates the general class of catalyzed reaction. The second and third fields depend on different criteria related to the chemical features of the substrate and the product of the reaction, and the fourth field is a sequential number without any special meaning. A comprehensive and detailed enzyme function annotation of the available genomes is necessary not only to increase our understanding of the biochemistry of living organisms, but also to gain more insight into the evolutionary processes that originated the diversity of enzymes currently found in nature . The precise assignment of EC numbers to catalytic proteins is a vital requirement for the correct reconstruction of metabolic pathways . Moreover, reconstructed metabolic pathways play a key role in many biomedical approaches [6–9], but the success of these applications strongly depends of the quality of the functional annotations of the enzymes comprising such pathways .
Despite the great importance of precise EC number assignments, enzyme functions as well as other molecular, cellular or physiological functions, are often inferred from sequence similarity to previously characterized proteins . In this annotation modality, commonly known as "prediction by homology transfer", the (incorrect) assumption is that all homologs have the same function . This functional annotation strategy is negatively affected by at least two factors. The first factor is the functional diversity of highly similar sequences observed in many protein families . For example, to transfer detailed enzyme function, given by four-field EC numbers, with an average precision of at least 90%, a sequence identity threshold of 60% is required . However, the functional annotation of many genomes has been carried out employing much lower thresholds . The second factor is the structural and functional modularity of proteins ; thus, when the modular nature of proteins is disregarded, functional annotations based on best database hits are often erroneous . Mainly due to these factors, sequence similarity-based annotation strategies result in a high number of errors [18, 19] that often propagate in public databases . For instance, it has been estimated that functional assignments inferred by sequence similarity in the Gene Ontology sequence database (GOSeqLite), have an estimated error rate of 49% . Other approaches for enzyme function prediction do not directly depend on the level of similarity between sequences. For example, several methods are based on the identification of specific structural patterns associated with functional sites [22–24], but they are limited by the requirement that the query protein's structure be solved. Yet other approaches are based on the analysis of properties of proteins such as tissue specificity, subcellular location and phylogenetic information , or genome context and other functional association evidence . However, these methods also suffer from the lack of consistent and comprehensive database annotations related to this kind of sequence-independent features.
To address the limitations of transfer of enzyme function by sequence similarity, we developed EFICAz (Enzyme Function Inference by a Combined Approach), an engine for large-scale high-precision enzyme function inference . The original implementation of EFICAz combines the predictions of four independent methods: (C1) CHIEFc family based FDR recognition: detection of Functionally Discriminating Residues (FDRs) in enzyme families obtained by a Conservation-controlled HMM Iterative procedure for Enzyme Family classification (CHIEFc), (C2) Multiple Pfam family based FDR recognition: detection of FDRs in combinations of Pfam families that concurrently detect a particular enzyme function, (C3) CHIEFc family specific SIT evaluation: pairwise sequence comparison using a CHIEFc family specific Sequence Identity Threshold (SIT), and (C4) High specificity multiple PROSITE pattern recognition: detection of multiple PROSITE patterns that, taken all together, are specifically associated to a particular enzyme function. Since each predictive component was designed to be highly precise and predictions made by any pair of components do not completely overlap (including C1 and C2, which only differ in the way the protein families are defined), at the final stage, EFICAz makes a particular EC number assignment when one or more of the four component methods predict a given EC number. Since EFICAz and its components have been fully described before , here, we briefly introduce the basics of the predictive components based on the recognition of FDRs and highlight possible improvements.
A CHIEFc or Pfam enzyme family E is defined by a multiple alignment of sequences evolutionary related to a seed group of sequences sharing a particular EC number EC E . FDRs are residues in specific positions of the alignment, selected via an Evolutionary Footprinting method  for their ability to discriminate between homo-functional and hetero-functional family members. Homo- and hetero-functional family members are defined as sequences annotated or not annotated with the EC number EC E , respectively. To apply an FDR recognition method, we first determine if a query sequence q is a member of an enzyme family E by evaluating a Hidden Markov Model derived from E. If so, we check if q exhibits conservation of the FDRs associated with E. When both conditions are fulfilled, we predict that q is a homo-functional member of E and assign the EC number EC E to the query sequence q. A figure illustrating the concept of FDRs can be found in Additional file 1: Figure S1. Example of Functionally Discriminating Residues (FDRs). A potential pitfall of the FDR recognition methods is that if the number of FDRs for a given enzyme family is too small, it can be difficult to achieve high prediction precision, because the matching of a very small number of residues in an alignment is more likely to occur by chance. Conversely, if the number of FDRs is too large, the prediction recall might suffer, because the matching of a large number of residues in an alignment imposes a very restrictive condition. In principle, these issues could be addressed by techniques more advanced than FDR matching in terms of their ability to detect the signals characteristic of homo-functional enzyme family members in the query sequence. In this work, we describe the development of a method for enzyme function inference that is based on this premise. We employ a Support Vector Machine (SVM) learning approach  that evaluates all the aligned positions between a query sequence and the multiple sequence alignment associated to a given Pfam or CHIEFc enzyme family. We term these components: (C5) CHIEFc family based SVM evaluation and (C6) Multiple Pfam family based SVM evaluation, and our benchmarks show that they yield higher predictive performance than their counterparts based on FDR recognition.
As mentioned above, in the previous implementation of EFICAz, all EC numbers predicted by the four original component methods were been reported, whether they agreed with each other or not. Here, based on estimations of the method's performance that are more realistic than those published before [1, 27], we show that such a strategy tends to negatively affect prediction precision, especially at low levels of maximal test to training sequence identity (MTTSI, formally defined in the Methods section). To address this issue, we have developed a tree-based classification algorithm  that applies a set of hierarchical rules to generate an EC number assignment from the list of the component methods that predict such EC number and the query sequence's MTTSI. We have included the two additional SVM-based component methods as well as the classification tree algorithm in the current implementation of EFICAz, that we term EFICAz2. According to the results of our performance benchmarks, EFICAz2 is dramatically more precise than EFICAz at low MTTSI, while it shows only a modest decrease in recall in this MTTSI regime.
The rest of this paper is organized as follows: in the Results and Discussion section, we describe the development and benchmarking of the SVM-based enzyme function inference method and the classification tree algorithm to generate the final EC number prediction, and present a comparative study of enzyme function annotations of the human proteome by EFICAz2 and KEGG . In the Conclusions section, we summarize the present work, stress its significance, and discuss its limitations. Finally, in the Methods section, we describe the data sources and procedures for training and benchmarking of EFICAz2, provide details about the statistical analyses and technical aspects of the generation of SVM and classification tree models, and describe the data sources for the comparative analysis of enzyme function annotation of the human proteome.
Two of the four component methods in the original implementation of EFICAz are based on the identification of homo-functional members of a given CHIEFc (C1) or Multiple Pfam enzyme family (C2), i.e., members whose enzymatic activity coincides with that of the seed enzymes that originated the family. The criterion followed by these methods to consider a query sequence as homo-functional (and therefore make the corresponding EC assignment) is the matching of FDRs. Since FDRs constitute a subset of all residues in the multiple sequence alignment associated to an enzyme family, we reasoned that an algorithm operating over all the aligned positions (i.e., with access to all possible information) could achieve higher discriminatory power, at least in certain cases. This situation is analogous to that of patterns and profiles for the identification of protein families and domains in the PROSITE database .
Initially, PROSITE consisted of patterns alone and was later enriched by the inclusion of profiles. Although, in general, PROSITE profiles exhibit increased sensitivity with respect to patterns, profiles and patterns complement each other, i.e. both types of descriptors offer unique advantages in particular cases .
Our implementation of the profile-like approach to the recognition of homo-functional sequences is based on SVM models associated to each enzyme family. The basic idea of the SVM algorithm is mapping the data from an input space into a high-dimensional feature space via a kernel function, and finding a hyper-plane to separate positive and negative samples in the feature space . The training of the SVM models is carried out using the whole set of aligned residues in the corresponding multiple sequence alignment, which include both positives or homo-functional sequences and negatives or hetero-functional sequences (see Methods section, "Support vector machine models"). The new component methods were termed: (C5) CHIEFc family based SVM evaluation and (C6) Multiple Pfam family based SVM evaluation. In order to compare the performance of the new SVM-based components to that of the FDR-based components, we carried out extensive benchmarking. First, we trained the two FDR-based (C1 and C2) and the two SVM-based components (C5 and C6) using previous releases of the corresponding databases; these specific versions of the component methods were later included in EFICAz2 version 10, based on the Release 10 of UniProt  (see Methods section, "Datasets for the training of different EFICAz2 versions"). Then, we selected test sequences from all of the well annotated, newly added Swiss-Prot sequences in UniProt Release 12.6 that were not included in the Release 10. Finally, for each test sequence, we collected the enzyme function predicted by each of the four components under evaluation and calculated the average precision and recall (see Methods section, "Benchmarking of EFICAz2 version 10"). The statistical significance of the differences in method's performance was evaluated as described in "Statistical analyses", in the Methods section.
In order to improve the precision of our approach, we decided to investigate more efficient ways to integrate the predictions generated by the six EFICAz component methods. We had demonstrated in our previous work that increased precision can be achieved by requiring the consensus of two or more components of EFICAz . Here, we decided to train decision tree models to find the optimal way to take advantage of consensual information from the different components. Decision trees are very effective tools in machine learning that produce accurate, highly interpretable predictions and have been successfully used in several computational biology and bioinformatics applications , including enzyme function prediction . For our particular case, we sought decision trees able to output a binary outcome (whether a given EC number is assigned or not to a protein sequence), based on the prediction results of each component. Decision trees that produce discrete outcomes are called classification trees . There are several possibilities to consider regarding the level of generalization of the classification trees, for example, whether or not they depend on the specific EC number type. In principle, EC number-specific classification trees could yield more accurate predictions. However, since not all the EC number types are represented in the set of test sequences, we opted for an EC number-independent solution.
We compared the performance of EFICAz2 (Figure 4, red columns) to that of the original EFICAz with four components or the updated version with six components. Compared to the original EFICAz, EFICAz2 displays a statistically significant decrease in average recall at MTTSI < 30% (a difference in recall of 5% and 10% for three- and four- field EC numbers, respectively, Figure 4AC) and at a few other MTTSI intervals, although the difference in recall is less than 5% in these latter cases. More importantly, EFICAz2 shows a dramatic increase in average precision at MTTSI < 30% (a difference in precision of 25% and 55% for three- and four- field EC numbers, respectively, Figure 4BD). Similar tendencies, with average recall increases and average precision decreases of higher magnitude, can be observed when EFICAz2 is compared to EFICAz updated to six components. In summary, we first shifted the precision-recall trade-off towards higher recall and lower precision by adding the SVM-based components to the original EFICAz implementation. Then, by making more efficient use of consensus between predictions from different components via classification tree models, we achieved acceptable levels of average precision at low MTTSI, with low impact on the average recall. The EFICAz2 code is available upon request to academic and non-profit users. In addition, we have made EFICAz2 available as a web service  that allows the submission of query protein sequences and returns the output via email. If an enzyme function inference is made, the output consists of the four-field or three-field EC number prediction/s, the predictive component/s that recognized the EC number/s, the MTTSI interval associated to the query sequence and the mean and standard deviation of the precision performance obtained from benchmarks.
EFICAz2 exhibits an average precision of at least 90% for MTTSI ≥ 40% (Figure 4B, D), a non trivial achievement, considering that to achieve this level of precision from a sequence similarity criterion alone, MTTSI ≥ 60% is required . Moreover, we significantly improved the prediction precision at MTTSI < 30%, compared to the original implementation of EFICAz. Nevertheless, the recall in this regime still requires additional improvement (average recall of 33% and 23% for three-field and four-field EC numbers at MTTSI < 30%, respectively, Figure 4AC). One possibility to overcome this EFICAz2's limitation is to include methods that do not depend on sequence information. Some protein features that have been used before with the purpose of enzyme function prediction include protein- protein interaction , phylogenetic distribution, tissue specificity and subcellular localization . Although we will explore the possibility of including non-sequence-dependent features of proteins in future versions of EFICAz, its implementation may be impaired by the low availability or inconsistency that this kind of annotations exhibits in current databases.
Comparative enzyme function annotation of the human proteome(1)
Level of detail of the enzyme function assignment: Three-field EC numbers
EC numbers with less than three fields(4): 20,889
Three-field EC numbers: 3,508/3,416 (5)
EC numbers with less than three fields(4): 21,398
EFICAz2 novels: 798/790
Level of EC annotation agreement(6)
Three-field EC numbers: 2,954/2,907
KEGG novels: 309/281
Level of detail of the enzyme function assignment: Four-field EC numbers
EC numbers with less than four fields(4): 21,660
Four-field EC numbers: 2,850/2,645
EC numbers with less than four fields(4): 21,833
EFICAz2 novels: 522/483
Level of EC annotation agreement(6)
Four-field EC numbers: 2,523/2,472
KEGG novels: 338/310
Table 1 show that, although both KEGG and EFICAz2 provide unique annotations, the novel assignments made by EFICAz2 significantly exceed those from KEGG. At the level of detail of three-field EC numbers, there are 798 novel annotations by EFICAz2 corresponding to 790 proteins versus 309 unique annotations for 281 proteins from KEGG. Similarly, for four-field EC numbers, there are 522 novel annotations for 483 proteins by EFICAz2 versus 338 unique annotations for 310 proteins from KEGG. We analyzed the agreement between EFICAz2 and KEGG assignments for the 2,626 sequences that were annotated with a level of detail of at least one three-field EC number by both sources. For a given annotated protein, we distinguished among three possibilities: i) full agreement, where all the EC number/s assigned to the protein by EFICAz2 and KEGG coincide, ii) partial agreement, where at least one but not all the EC numbers assigned to the protein by these sources agree, and iii) no agreement, where none of the EC numbers assigned to the protein by these sources agree. For the 2,626 common sequences annotated with three-field EC numbers, the level of full agreement is 96.8%, while the level of partial agreement or better is 99.3%. Similarly, for the 2,162 sequences annotated with four-field EC numbers by both sources, the full and at least partial agreement is 92.5% and 97.9%, respectively. The matching of EC numbers is done at the stated level of detail, i.e. when comparing three-field or four-field EC numbers, only the first three fields or the full four fields are considered, respectively.
The level of agreement between KEGG and EFICAz2 can also be assessed on the basis of the total number of EC number predictions by one or the other source, rather than by the total number of annotated proteins. The number of annotations and the number of proteins may differ because a single protein may have more than one enzymatic activity; therefore, more than one EC number may be associated to it. In this case, we only distinguish between agreement and lack of it. The number annotations by EFICAz2 and KEGG for the 2,626 sequences annotated with three-field EC numbers by both sources is 2,710 and 2,645, respectively. Thus, the level of agreement is 96.7% ([67+2,554]/2,710) and 99.1% ([67+2,554]/2,645) when expressed in terms of the number of EFICAz2 and KEGG three-field EC number annotations, respectively. The number of annotations by EFICAz2 and KEGG for the 2,162 sequences annotated with four-field EC numbers by both sources is 2,328 and 2,185, respectively. Therefore, the level of agreement is 91.7% ([117+2,019]/2,328) and 97.8% ([117+2,019]/2,185), when expressed in terms of the number of EFICAz2 and KEGG four-field EC number annotations, respectively.
This comparative analysis indicates that when both sources make EC number assignments for the same protein sequence, there is a high chance that these assignments are consistent. On the other hand, at the level of detail of three-field EC numbers, EFICAz2 generates more than double the number of unique assignments (i.e., assignments for proteins annotated as non-enzymes by the other compared source), while it provides more than 50% additional unique assignments when four-field EC numbers are considered. The unique EC number assignments made by EFICAz2 can be found in Additional file 2: Novel enzyme function annotations of the human proteome by EFICAz2.
In this work, we described, implemented and tested EFICAz2, a new version of EFICAz , our automated approach for enzyme function prediction, enhanced by means of machine learning techniques. We increased the number of EFICAz components from four to six by adding two methods based on the evaluation of Pfam and CHIEFc enzyme families by SVM classifiers. The SVM-based components showed statistically significant performance improvements compared to their counterpart methods based on the detection of FDRs. We generated a set of classification trees to integrate and take advantage of the complementarity between the predictions from the six component methods, and achieved a remarkable increase in average precision at low MTTSI, with only moderate impact on average recall. When we applied EFICAz2 to the enzyme function reannotation of the human proteome, we found that for proteins annotated as enzymes by both EFICAz2 and KEGG, the assigned EC numbers were highly consistent. Moreover, the number of unique enzyme assignments generated by EFICAz2 is significantly higher than the unique enzyme annotations in KEGG. Thus, the results of the performance benchmark and the comparison with KEGG, demonstrate that EFICAz2 is a powerful and precise tool for enzyme function annotation, with multiple applications in genome analysis and metabolic pathway reconstruction.
Number of sequences in reference sets used for EFICAz2 training
Reference sequence set
EFICAz2 version 10
EFICAz2 version 13
"enzymes" (three-field EC number)
"enzymes" (four-field EC number)
Number of families and EC number types associated with different EFICAz2 predictive components
Type of EFICAz2 component
Three-field EC numbers
Four-field EC numbers
EFICAz2 version 10
EFICAz2 version 13
EFICAz2 version 10
EFICAz2 version 13
All EFICAz2 components
For a given enzyme function f described by a three-field or four-field EC number, we calculate: precision f = TP f /(TP f +FP f ), and recall f = TP f /(TP f + FN f ), where (i) TP f (number of true positives) is the number of test sequences for which the function f is assigned by both EFICAz2 and UniProt 12.6, (ii) FP f (number of false positives) is the number of test sequences for which the function f is assigned by EFICAz2 but not by UniProt 12.6, and (iii) FN f (number of false negatives) is the number of test sequences for which the function f is assigned by UniProt 12.6 but not by EFICAz2.
In UniProt, as well as and in most protein sequence databases, the distribution of different EC classes is non-uniform, i.e. some enzyme functions are overrepresented while others are underrepresented (see Figure 6). To reduce the bias towards the most represented enzyme functions, we evaluate precision and recall for each individual enzyme function f, and then calculate average values. On the other hand, it is clear that test sequences with higher sequence identity to training enzymes are easier to predict than those exhibiting lower sequence identity. This correlation plus the fact that, in general, the sequence identities of the test sequences to the training enzymes are not uniformly distributed, introduces another potential source of bias. To reduce this second type of bias, we evaluate EFICAz2's performance at different levels of maximal test to training sequence identity (MTTSI). We define MTTSI as the maximal sequence identity between a given test sequence whose predicted function is f and any training enzyme whose true function is f.
Given a MTTSI interval m and an enzyme function f, we first select the test sequences whose EFICAz2 predicted function is f and whose MTTSI falls into the interval m. Then, based on the selected test sequences, we calculate the precision and recall of EFICAz2 for enzyme function f and MTSSI bin m. For each MTSSI bin, we calculate and report the average precision and recall across all enzyme functions for which these performance indicators are defined (i.e., where (TP f + FP f ) > 0 for precision calculation and where (TP f + FN f ) > 0 for recall calculation). It has to be mentioned that in previous benchmarks of EFICAz [1, 27], we calculated the average precision per MTTSI bin only across the EC number types that were represented in the test sequences. In this work, we decided to average the performance of all possible EC number types, which translates into a decreased average precision (because, by definition, all the additional enzyme functions considered for the average will have zero true positives) but provides a more realistic estimation of our method's performance.
In this work, we evaluated two more versions of EFICAz, besides EFICAz2: i) the original implementation of EFICAz where predictions from four component methods are combined without integration by classification tree models, and ii) a version that combines the previous four components and the two new SVM-based components, also lacking the benefit of classification tree predictive models. These versions only differ from EFICAz2 in the number of utilized component methods, or the way the predictions from different components are combined. Thus, the procedures for training of the individual components described above for EFICAz2 also apply to these two other versions of EFICAz.
We performed two-tailed t-tests to determine the significance of the differences in the average recall and precision at specific MTTSI intervals observed between different pairs of predictive methods. Our null hypothesis was that there is no significant change in these performance indicators (critical alpha level = 0.05). To evaluate differences in average recall, we used correlated t-tests because the recall values from each of the two compared methods can be matched according to their specific EC numbers. Conversely, to evaluate differences in average precision, we used t-tests for unpaired data because the prediction precision values associated with each method are not defined for the same set of EC numbers. In this case, assuming that the random variables had different (heteroscedastic t-test) or the same variance (homoscedastic t-test) yielded the same results at the set critical alpha level of 0.05.
We built an SVM model for each particular Pfam and CHIEFc enzyme family, whether the family is associated to a three-field or to a four-field EC number. Each enzyme family consists of a multiple sequence alignment of homo- and hetero-functional members; the goal of each SVM model is to discriminate between them. For classification purposes, homo- and hetero-functional members of an enzyme family are considered as positives and negatives, respectively. To transform the aligned protein sequences into a data matrix suitable for machine learning, a particular amino acid encoding scheme needs to be selected. Several methods for amino acid encoding have been proposed in the literature [38–40]. Here, we adopt an encoding method where each amino acid is represented by five highly interpretable continuous variables derived from multivariate statistic analysis of 494 physicochemical attributes . Thus, for training and evaluation of the SVM models, each aligned position of a member sequence is regarded as a five-dimensional vector, and a multiple sequence alignment with M proteins and N aligned positions is converted to a data matrix with M samples and N*5 input features. Therefore, a different SVM model is associated to each enzyme family, each model having a different number of features, depending on the number of aligned positions. We implemented the SVM models using the libSVM package  (kernel function = Radial Basis Function (RBF), γ = 1/k, where k is the number of attributes in the input data, and C = 1).
Decision trees are predictive models that classify data by mapping features of the data items to inferences about their target values, by means of a hierarchy of questions about such features . Decision trees can be implemented as classification trees when the outcome is discrete, or regression trees when the outcome is continuous . In this work, we have used classification trees to integrate the predictions generated by each of the six EFICAz component methods (C1 to C6) into a final, more precise EC number prediction. The source for training and testing of our classification tree predictive models is the dataset described in "Benchmarking of EFICAz2 version 10", in the Methods section. Our training samples are (p, z) pairs, where p denotes a protein sequence and z indicates its EC number. The features considered for the classification are the prediction statuses of the six EFICAz components. We encode the feature information for a given sample (p, z) in a six dimensional binary vector. Thus, "1" in certain dimension of the vector means that the corresponding EFICAz component predicts that protein sequence p exhibits the enzymatic activity associated to EC number z, while "0" indicates the opposite. The outcome of the predictive model is a logic variable indicating whether or not z is assigned to p.
We generated classification trees for two levels of enzyme function description (three- and four-field EC numbers) in two variants each, one for protein sequences with MTTSI < 30% and the other for protein sequences with MTTSI ≥ 30%. The 30% MTTSI threshold was empirically determined and optimized to achieve a biologically useful trade-off between the prediction performance of sequences in or out of the "Twilight Zone" of function prediction, as evaluated in our benchmarks. To create the classification trees, we used the rpart package version 3.1–41 from the statistical analysis tool R . The fitting of the models was done using the default parameters of the rpart function, with the exception of the weights argument. We opted for an EC number-dependent case weight equal to the harmonic mean of 1 and 1/N, i.e. 2/(N+1), where N is the number of training sequences that belong to a given EC number. The rationale of this weighting scheme is that it is a halfway balance between two extreme situations: i) implementing a weight = 1/N and thus completely ignoring the natural biases in enzyme abundance that might be partially reflected in databases (all EC number types are treated equally, whether represented by only one or by a large number of sequences), and ii) using a weight ≥ 1 for all cases (no weighting), with the risk of excessively biasing the models towards the EC numbers most abundantly represented in our training set of sequences.
The sources for the human protein sequences and their enzyme function annotations were the KEGG Genes and Brite databases (Release 47.0+/06-26, of June 26, 2008), respectively.
This research was supported by grant No. GM-48835 of the Division of General Medical Sciences of the NIH.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.