Automatic prediction of catalytic residues by modeling residue structural neighborhood
© Cilia and Passerini; licensee BioMed Central Ltd. 2010
Received: 6 February 2009
Accepted: 3 March 2010
Published: 3 March 2010
Prediction of catalytic residues is a major step in characterizing the function of enzymes. In its simpler formulation, the problem can be cast into a binary classification task at the residue level, by predicting whether the residue is directly involved in the catalytic process. The task is quite hard also when structural information is available, due to the rather wide range of roles a functional residue can play and to the large imbalance between the number of catalytic and non-catalytic residues.
We developed an effective representation of structural information by modeling spherical regions around candidate residues, and extracting statistics on the properties of their content such as physico-chemical properties, atomic density, flexibility, presence of water molecules. We trained an SVM classifier combining our features with sequence-based information and previously developed 3D features, and compared its performance with the most recent state-of-the-art approaches on different benchmark datasets. We further analyzed the discriminant power of the information provided by the presence of heterogens in the residue neighborhood.
Our structure-based method achieves consistent improvements on all tested datasets over both sequence-based and structure-based state-of-the-art approaches. Structural neighborhood information is shown to be responsible for such results, and predicting the presence of nearby heterogens seems to be a promising direction for further improvements.
Discovering the molecular mechanisms underlying the protein functioning is a key step for understanding the complex processes involved in living systems, and would possibly allow to correct dysfunctions. Large scale genomics projects are providing a huge amount of protein sequential and, at a lower but increasing rate, structural information. Nonetheless, a large portion of such proteins have their function still undetermined, as it is often not straightforward to understand the details of a protein function even when its 3D structure is known. The task requires a time-consuming trial-and-error process of hypothesis formulation and verification by targeted experiments such as site-directed mutagenesis . Considering the rate at which protein structures are solved, the gap with respect to functionally characterized proteins is destined to increase over time. Automatic approaches for the detection of protein functional sites can be very useful in narrowing this gap, by exactly determining functional residues or reducing the number of candidates to be experimentally verified.
In this paper we focus on predicting catalytic residues in enzymes. Enzymes are proteins able to accelerate chemical processes inside a cell. In the catalysis the enzyme works by forming complexes with the substrates, usually small molecules, and in doing so it lowers the activation energy of the reactions thus increasing their rate. According to their function enzymes are classified into six functional classes in the so called Enzyme Classification (EC) Nomenclature . Many enzymes need to be bound to an additional non-protein component called cofactor in order to perform their function. Cofactors can be grouped in: (a) coenzymes, i.e. dissociable cofactors that are usually organic; and (b) prosthetic groups, i.e. non dissociable cofactors. The enzyme lacking the cofactor is inactive and it is called apoenzyme, while the enzyme with the cofactor is active and it is called holoenzyme. Enzymatic functional domains are also called active or catalytic sites. The residues that are directly involved in the catalytic process (e.g. nucleophiles, proton-donors) constitute the active site, while residues in the surrounding space play the role of attracting and orienting the molecule to bind, and constitute the binding domain. The first kind of residues are of interest in our study. From now on we refer to them as functional or catalytic residues. Traditional approaches to functional site identification use homology-based strategies. Novel protein function is inferred by aligning the sequences or by superimposing the structures with already annotated proteins. In , active sites of non annotated proteins in the Pfam database , which contains about 8,200 protein families, are predicted by using a rule-based technique which exploits the homology and sequence similarity with other annotated proteins. The methodology is based on the transfer of experimentally determined active site data to other sequences within the same Pfam family. The authors show that it is possible to gain functional annotation of a large number of sequences in the Pfam database (enzymatic families) for which the residues responsible for catalysis have not been determined. However, these homology-based techniques are well-known to fail in specific situations. First, an annotated homologue of the target protein needs to be available, preventing their applicability to novel folds. Furthermore, proteins with similar overall tertiary structure can have different active sites, i.e. different functions [5, 6], and proteins with different overall tertiary structure can show the same function and similar active sites (an example on proteases can be found in ). Finally, the increasing lack of functional annotations makes transferring them by homology even less effective.
A number of researchers have recently tackled the problem of functional residues prediction. In  the authors generated three-dimensional templates of protein active sites with rigid prosthetic groups. Their approach is based on the simultaneous alignment of several protein structures, and relies on local atomic-level similarities based on multiple comparisons. The generated patterns include 3D atomic coordinates, position of chemical groups, and cavity locations. However the approach remains limited to the subset of proteins having rigid prosthetic groups.
In its simpler formulation, functional residue prediction can be cast into a binary classification task at the residue level. Petrova and Wu  and Youn et al.  have addressed it with a Support Vector Machine (SVM) fed with both protein sequence and structural properties. Capra and Singh  relied on an information-theoretic approach for estimating sequence conservation. The authors show that conservation of sequentially close residues improves predictive performance, especially when catalytic residues are involved. In , carefully crafted conservation scores were shown to play a major role in predictive performance. Closeness centrality measures  have been used in  to improve catalytic residue prediction by using neural networks trained with a genetic algorithm. A review on approaches and applications for structure-based protein function prediction can be found in . Interestingly, a recent study  showed that sequence information alone could provide results similar to those obtained by previous structure-based methods. Our early experiments came to the same conclusions (see Results), with profile enriched sequential information providing results which were statistically indistinguishable from carefully crafted features extracted from 3D coordinates . This seems to indicate that much work has still to be done in order to fully exploit the information contained in protein three-dimensional structures. A number of recent approaches investigated the use of topological , electrostatic  and graph theoretic  structured-based features for predicting ligand binding sites or protein functional sites. In a very recent work  electrostatic features from THEMATICS  and geometric features are combined with sequence conservation features in a maximum likelihood approach called Partial Order Optimum Likelihood (POOL). The authors underline the key role of THEMATICS features which are extracted from the residue theoretical titration curves.
In this work, we show how to effectively employ 3D structure information by modeling the structural neighborhood of candidate residues, represented as a sphere centered on the residue side chain. We encoded such neighborhood information with statistics on the properties of its content, such as physico-chemical properties, atomic density, flexibility, presence of water molecules. We trained a support vector machine combining our structural neighborhood features with evolutionary enriched sequence information as well as previously developed 3D features . Our structure-based method achieves improvements over both sequence-based and structure-based state-of-the-art predictors, as measured on a set of benchmark datasets with varying characteristics, and structural neighborhood information is shown to be responsible for such improvements. We additionally investigated the role of ligand information in presence of heterogens, playing possible catalytic or structural roles, and showed that exploiting such information in both sequence-based and structure-based active site predictions is an interesting direction for further research.
Results and Discussion
Functional residue prediction can be cast into a binary classification task at the residue level, namely predicting for each residue of a given protein, whether it is directly involved in the catalysis or not. We performed a detailed analysis and feature engineering on a dataset (PW) of 79 enzymes selected by Petrova and Wu  for their structural and functional heterogeneity with respect to their SCOP fold classification, EC numbers and BLAST sequence similarity. The dataset contains enzymes from all the six classes in the Enzyme Classification (EC) Nomenclature. We collected sequential and three-dimensional data for a total of 23,635 residues from the enzymes PDB files. Few residues were removed with respect to the 23,664 extracted in  due to uncertain correspondence in the mapping between the two datasets or due to conflicts between the residues reported in the PDB structure file and in the FASTA sequence from Uniprot . Only 254 out of 23,635 residues are labeled as functional in the Catalytic Site Atlas (CSA) database . Hence the dataset is strongly unbalanced (see Additional file 1) with a ratio between positive and negative examples of about 1:92.
We also conducted a broad experimental evaluation of the obtained features on a set of larger benchmark datasets which were proposed by previous sequence and structured-based approaches. Three benchmark datasets with varying homology level were proposed in : a SCOP fold dataset (EF fold), a SCOP family dataset (EF family) and a SCOP superfamily dataset (EF superfamily). Two additional datasets were included to study the performance of our approach in the presence of low homology: the HA SCOP superfamily dataset from  and the independent test set T-124 proposed in . The characteristics of these five datasets are summarized in . Finally we included the dataset of 160 proteins (POOL-160) used in  in order to compare with their approach.
We addressed the learning task with an SVM, a state-of-the-art binary classification algorithm which aims at separating positive and negative examples with a large margin, possibly accounting for margin errors. Details on SVMs can be found in several textbooks . All experiments were carried out using the SVM Light  software  downloadable from http://svmlight.joachims.org/. Our experimental evaluation is based on a 10-fold cross-validation procedure stratified at the protein level, that is, assuring that all residues of a certain protein always appear together in the same fold.
We fixed the regularization parameter (parameter c in the SVM Light implementation) to 1, and tuned the cost factor (parameter j in the SVM Light implementation), which outweighs the error on positive examples with respect to that on negative ones, on each fold of the 10-fold cross-validation by an inner cross-validation procedure inside its training set. Tuning the cost factor is particularly important for this application due to the strong imbalance between the number of positive and negative examples. Previous works [14, 16] addressed such a problem by subsampling negative examples according to a certain ratio and training the classifier on the reduced set.
The following measures have been used to evaluate our approach:
where t+, t-, f +, f -are the true positives, true negatives, false positives and false negatives respectively.
F1 is the harmonic mean between Recall and Precision, giving equal weight to the two complementary measures. It is the measure we optimized in our model selection phase. The averaged ROC and RP curves are drawn by averaging the per-protein curves as in . ROC and RP curves and their areas provide a broader picture of a classifier performance, as they do not require to choose a fixed decision threshold to discriminate positive and negative examples, but evaluate all possible thresholds. For highly skewed datasets, the area under the RP curve is more informative than the area under the ROC . We included both measures to allow for comparisons with previous approaches.
We evaluated the statistical significance of the performance differences between the various settings by paired Wilcoxon tests on the F1 measure reported for each fold. We employed a confidence level α of 0.05.
Overview of the extracted features
Target amino acid name
Target amino acid type
Physical and chemical properties (amino acid attributes)
Amino acidic composition
Water molecule quantity
From the protein sequence we extracted a conservation profile (1D3) capturing evolutionary information, together with standard attributes encoding the name and type of the amino acid (1D1 and 1D2).
From the protein tertiary structure we extracted features characterizing the three-dimensional neighborhood of a residue: statistics on the neighborhood properties such as physico-chemical characteristics (3D1), type and frequencies of the neighboring amino acids (3D2), charge or neutrality of the surrounding space (3D3), presence of the water (3D4), atomic density (3D5), flexibility of the target residue (3D6); the presence of disulphide bridges involving the target residue (3D7); information on the presence of potential cofactors or other ligands, by encoding the presence of nearby heterogens playing possible catalytic or structural roles (3D8) as well as the fact that they directly bind the target residue (3D9). The Methods section gives a detailed explanation of such structural features and the rationale behind their use.
Our aim was to exploit information related to the properties of the local structure surrounding a residue. We added these features to those already used in , which aim at modeling properties of the target residue plus its relationship with the whole region containing it. Such combined representation allowed us to obtain significant improvements, as detailed in the Results section.
Legend of abbreviations.
the attributes extracted from the protein sequence among the 24 in 
the whole set of 24 attributes proposed in 
the optimal set of 7 attributes selected among the 24 in 
SVM_ 1Di-j, 3Dk-r
the attributes from 1D i to 1D j and/or from 3D k to 3D r as described in section Methods, with i, j = 1, 2, 3 and k, r = 1, ..., 9
Results of different feature sets
Performance % ± s.d.
22 ± 11
30 ± 11
1.3 ± 0.7
24 ± 7
24 ± 8
26 ± 8
29 ± 12
0.9 ± 0.3
27 ± 9
26 ± 9
27 ± 10
30 ± 10
1.0 ± 0.4
27 ± 8
27 ± 8
22 ± 11
37 ± 11
1.8 ± 1.3
26 ± 10
27 ± 10
26 ± 10
37 ± 14
1.2 ± 0.5
30 ± 9
30 ± 10
26 ± 6
44 ± 10
1.4 ± 0.3
32 ± 7
33 ± 7
SV_M P 24_1D1-3, 3D1-6
28 ± 9
46 ± 10
1.4 ± 0.5
34 ± 8
34 ± 8
SVM_P 24_1D1-3, 3D1-9
33 ± 14
48 ± 8
1.4 ± 0.7
37 ± 7
38 ± 6
The first set of experiments refers to a sequence-based functional residue predictor, where each residue is characterized by features extracted from the protein sequence only (see Table 1). In Table 4, row 1 reports experimental results obtained by using our sequence-based attributes only, including the multiple alignment conservation profiles. We also experimented windows of conservation profiles of size varying between 1 and 10, where size w implies a window of w residues on each side of the target residue along the primary sequence, in addition to the profile of the target residue itself. Including such windows only provides a slight improvement (with w = 7) while drastically reducing the classifier efficiency. Furthermore, the features proved harmful when combined with structural information, possibly because the large number of features they introduced covered the signal coming from other more informative ones.
Rows 2 and 3 report additional results on sets of attributes extracted from sequence information only. The set SVM_P 51Dis a group of five attributes from  which includes the 1D1 and 1D2 attributes (see Table 1) and a conservation score from the Scorecons server , plus its entropy and relative entropy values, in place of our conservation profile. The results are comparable with those obtained with conservation profiles. Results combining all the available features extracted from the protein sequence are reported in row 3 (SVM_P 51D_1D1-3).
Results in the rows from the fourth on include additional information provided by structural features. In rows 4 and 5 we employed the two sets of attributes proposed in , i.e. the subset of the 7 optimal ones (SVM _P 7) and the entire set of 24 attributes (SVM _P 24) respectively. Note that we obtained performance improvements over the original results in  (achieving F1 = 13% and MCC = 23% for the P 24 feature set) by tuning the cost factor for false positives versus false negatives, as compared to random sub-sampling negative examples in order to obtain a balanced set.
SVM_P 24_1D1-3, 3D1-6
26 ◦ ∙
28 ◦ ∙
27 ◦ ∙
28 ◦ ∙
27 ◦ ∙
On the one hand, using features extracted from primary sequence alone allows us to apply the predictor to the much larger set of sequentially but not necessarily structurally determined proteins. On the other hand, as we already discussed in the introduction and also stated in the review of , the availability of structural information should be able to significantly contribute in solving the task. Indeed, adding three-dimensional information in the form of properties of the residue structural neighborhood allowed us to achieve significant improvements, as detailed below.
Row 6 in Table 4 reports results of the combination of our conservation profiles (1D3) with all the sequence and structural attributes in . Row 7 reports the result obtained by adding structural attributes encoding statistics of the residue three-dimensional neighborhood properties (3D1-6) without including the attributes related to the ligands (3D7-9, see Methods). Such results are always significantly better than those of sequence-based classifiers according to the statistical tests (see Table 5). Furthermore, performance improvements with respect to previous structure-based results (SVM_P 24) are significant in all but the smallest test set.
Finally, row 8 reports the performance obtained by including all the available ligand-based features, which allow to achieve further improvements and correctly predict some especially tough cases (detailed below), paving the way to an interesting research direction. Additional files 3 and 4 report detailed results and predictions for this classifier.
Further analyses on the effect of different sets of features on prediction errors provide some interesting insights on their usefulness and reliability. The quality of multiple alignments strongly influences the performance of sequence-based classifiers. On the proteins for which PSI Blast did not provide good alignments we observed poor performance. In those cases structural features help in compensating such deficiencies. The inclusion of ligand features allows the correct prediction of many catalytic residues which have low catalytic propensity, like the glycine in the methylglyoxal synthase (PDB code 1B93) and the glycine in the human glutathione synthetase (PDB code 2HGS). The latter is one of the emblematic cases of the importance of ligand features, as in the absence of those features only one of its four catalytic residues is correctly predicted. In the phosphofructokinase (PDB code 1PFK) the encoding of ligand features helps to correctly predict the two arginine residues of the active site. By looking at the three-dimensional structure of the protein, the active site seems to be exposed rather than located in a hydrophobic core. This implies that active site residues have associated structural features which may differ from those typical of the other catalytic residues in the dataset. On the other hand, we also observed few cases in which the addition of ligand features worsens predictions. This happens mainly when no heterogen appears in the crystal structure, possibly because the enzyme was solved in its apo form. We are planning to verify such conjecture by applying techniques for detecting binding sites in 3D structures . Note that while the presence of a heterogen provides a clear hint that the area could contain an active site, it is not by itself sufficient to determine the set of catalytic residues. Out of the 365 heterogen-binding residues in the dataset, only 62 were actually labelled as catalytic. If we restrict to the subset of heterogens which tend to occur near catalytic sites in enzymes (see the Ligand features section), the fraction becomes 51 out of 285. As detailed in , the sole fact of binding a substrate or cofactor does not classify a residue as catalytic. It also has to perform some specific activity such as proton abstraction from substrate, cofactor or water activation. For instance, the above-mentioned phosphofructokinase (PDB code 1PFK) contains three heterogens: ADP, beta-fructose diphosphate (FBP), and a magnesium ion; of the 15 residues which bind one of them, only four are actually catalytic. In this case the predictor manages to selectively exploit ligand information in identifying two active arginine residues, one of which does not directly bind any heterogen, with a single additional FBP-bound arginine incorrectly predicted as catalytic. Given that information on binding residues helps detecting active ones, it would be interesting to predict it when missing, either because sequence information alone is available, or because the 3D-structure does not contain the bound cofactor and/or substrate. Indeed, both binding and active residues should be identified in order to fully characterize the functional domain. We believe that combining active and binding site prediction in a single collective model, as already done with profile-HMM for specific functional domains , is a promising research direction, which can rely on a number of works for predicting binding sites from both sequence [32, 33] and structural information [29, 34].
In order to further investigate the discriminative potential of the features extracted from the 3D residue neighborhood we also experimented a structured kernel. We employed a 3D decomposition kernel on planar shapes in the 3D space. This kernel was proposed in  for the classification of small molecules. We adapted it to the functional residue prediction task by extracting specific shapes from the residue structural neighborhood. Among the different design choices we tried, the best performing one was the set of planar shapes of two (segment) and three (triangle) vertices in the 3D neighborhood of a residue. One of the vertices was the target residue itself, and the others were residues evolutionary conserved over one of the hydrophobic, charged or polar classes. While providing reasonable performance when used alone, with an average F1 of 22% and an average MCC of 25%, such shapes failed to improve performance in combination with the remaining sequence- and structure-based features. This result confirms that effectively exploiting three-dimensional information for modeling catalytic residues is a hard task, and further research is needed.
Comparison with other methods
We conducted a broad range of experiments on multiple benchmark datasets (see the Dataset section), and compared our results with the most recent methods for both sequence-based and structure-based prediction. Considered that none of the other methods directly encodes information on heterogens, we excluded such features from our set in all these comparisons.
Comparison with state-of-the-art sequence-based approach .
Previous results  suggested that appropriate sequence-based features managed to match performance of different structure-based predictors on the same datasets, a result we also observed in our early experiments on the PW dataset. Conversely, the improvements we achieve here show that structural information can indeed be effectively employed in predictions. Nonetheless, further research is needed in order to fully exploit it, as our results using heterogen information seem to indicate.
Comparison with the structure-based approach by Tang et al.  on the PW dataset.
Tang et al. (GANN) 1
SVM_P 24_1D1-3, 3D1-6
Comparison with the best results reported for the POOL structured-based method  on their benchmark dataset of 160 proteins.
POOL(T)POOL(G)POOL(C)/allprotein (Tong et al. )
SVM_P 24_1D1-3, 3D1-6 at Equal Precision
SVM_P 24_1D1-3, 3D1-6 at Equal Recall
In this work we addressed the problem of predicting catalytic residues from protein sequence and structure. We developed an effective approach to exploit structural information, by modeling residue structural neighborhood as a spherical region centered on the side chain centroid and including various statistics on the properties of the neighborhood content. Our method outperforms the current state-of-the-art sequence-based and structure-based approaches, as shown on different benchmarking datasets. We further explored the information provided by the presence of nearby heterogens, playing possible catalytic or structural roles, when such information is available from the solved structure. We showed that ligand information can play a key role in correctly identifying functional residues with low catalytic propensities, and we are currently investigating solutions to jointly predict active and binding residues in the site in a fully collective approach.
Features Extracted from the Sequence
The features extracted from the primary sequence encode characteristics of the target residues and evolutionary information (see Table 1):
1D1 encodes the amino acid name of the residue.
1D2 encodes the amino acid type of the residue based on its physico-chemical properties: H, R, K, E, D as charged; Q, T, S, N, C, Y, W as polar and G, F, L, M, A, I, P, V as hydrophobic .
1D3 encodes evolutionary information in the form of multiple alignment profiles.
1D1 and 1D2 are categorical (or nominal) attributes, and are encoded one-hot: each attribute is encoded with a vector of bits of size equal to the number of possible attribute values; value k is encoded with a vector having one at position k, and zero at all other positions. 1D3 is a real vector of conservation profiles computed from multiple alignments. We performed a two iteration Position-Specific Iterative Blast Search (PSI-Blast)  on a database of non-redundant protein sequences (nr) downloadable from ftp://ftp.ncbi.nlm.nih.gov/blast/db/. A threshold of 5e-3 on the expectation value was employed for both initial iteration and extending hits. We enriched the profile extracted from the multiple alignment with two values indicating its informativeness and reliability, namely profile entropy and weight of the conservation profile with respect to pseudocounts.
Features Extracted from the Structure
Residue Structural Neighborhood
Given such a 3D representation of residues, we define the structural neighborhood of a residue x as the set of residues and molecules contained in the volume of a sphere centered on x (x will be a target residue in our setting).
One can consider spherical regions of different radius. In this work we fix the radius of the sphere to a maximum of 8 Å which is the maximum interaction distance between a residue and a water molecule. The rationale behind this choice is that the interaction with a water molecule is very important for the catalysis in enzymes like the hydrolases.
Features characterizing a residue can be extracted from the protein structure if available. We showed (see Results) that extracting features from a residue neighborhood, thus exploiting the locality of the protein structure, can be useful to discriminate between functional and non functional residues. Table 2 summarizes the scalar features we extracted from the residue 3D neighborhood. The first group contains statistics on the properties of the neighborhood content, while the second encodes information on possible ligands contained in the neighborhood. Each row in the table corresponds to an attribute or a set of attributes encoding the properties specified in the description. In the following we provide a detailed description of such features. The paragraph ends with a description of the 3D shapes we extracted from the structural neighborhood, which proved reasonably informative when applied alone, but failed to improve the results in combination to the other features, as discussed in the Results section.
Statistics of the Neighborhood Properties
The first set of features encodes aggregate values representing properties of the atoms included in the sphere.
3D1 encodes chemical and physical properties of the residue neighborhood. This set of attributes represents properties such as hydrophobicity, polarity, polarizability and Van der Waals volume of the neighboring residues. They are encoded in a three bin distribution (normalized number of residues with low, medium, high hydrophobicity, polarity, polarizability and Van der Waals volume) according to the indices reported in the Amino Acid Index Database . The same encoding was used in  for protein function classification.
3D2 encodes the amino acid composition of the 3D sphere, represented as the frequency of occurrence of each one of the twenty amino acids.
3D3 represents charge or neutrality of the 3D sphere, encoded into three values: the number of positively charged residues, the number of negatively charged residues and their sum.
3D4 encodes the quantity of water in the sphere, measured as the number of water molecules within the sphere radius. This group of attributes is motivated by the fact that an active site is usually located in a hydrophobic core of the protein, while on the surface the quantity of water is higher and the residues exposed to the solvent are not hydrophobic.
3D5 measures the atomic density of the sphere, calculated as the total number of atoms it contains.
3D6 represents the residue temperature factor (B-factor), as a measure of the residue flexibility. It is calculated as the average of the atomic B-factors of atoms composing the residue, normalized over the whole protein. As the temperature factor could depend on the crystal structure, normalizing over the whole protein helps to exclude the variations that can be present among different protein crystal structures. Note that in  an unnormalized version of the residue B-factor was employed instead.
In oxidizing environments, cysteines tend to form covalent bonds called disulphide bridges, which help stabilizing the 3D structure of the protein. Disulphide bonded cysteines are usually not involved in the catalytic process: in the PW dataset of 79 enzymes the only exception is given by a protein disulphide isomerase (PDB code 1MEK). It has two catalytic cysteine residues in a thioredoxin domain similar to one of the well-known thioredoxin proteins. We encoded information on bridges by a flag (3D7) indicating whether the target residue is a disulphide bonded cysteine.
Enzymes often employ cofactors in order to help interacting with the substrate. Therefore, the presence of a cofactor in the structural neighborhood of a certain residue is an indication that the area could be an active site. On the other hand, many heterogens bind residues for structural rather than catalytic purposes, like NI in the methylmalonyl coa decarboxylase (PDB code 1EF8)  which is involved in trimerization. The Het-PDB Navi database  provides information on a large set of small molecules found in the protein structures of the PDB. For example information about the reaction in which the cofactors, substrates and products are involved, and the cofactor interface propensity. A description of the mechanisms of the catalysis is included in the CSA functional annotations whenever such information is available. It describes the role of the cofactors and which are the substrates and products of the reaction. In the dataset that we used for the feature engineering, 51 out of 79 enzyme structures contain heterogen molecules. For the remaining structures we can not say whether they are apoenzymes or they just do not require any help from cofactors during the catalysis. In the former case, methods for predicting metal-binding sites in apo protein structures  may be used to identify the presence of possible cofactors.
According to our analysis on this set of proteins, most of those heterogens have a demonstrated or putative role in the catalytic process (ZN, NAG, NAD, BME, MG, MN, U5P, ADP, HEM, FAD, MPD), while for others this role can be clearly ruled out (CL, NA, K, MAN), or it is just uncertain (PO4, SO4, POP). In order to correctly encode discriminant features related to the presence of cofactors, we divided the heterogen molecules into groups (at least the 71 we found in the PDB dataset, excluding DNA molecules) based on their physico-chemical, functional, spatial or shape characteristics.
As an example, in this dataset of 79 enzymes ZN usually has a verified role within the active site, thus we considered it as a primarily catalytic cofactor. Actually among the whole set of known enzymes there are cases, such as the DNA glycosylase, having a zinc-finger in which ZN has a structural role. We believe other features of the residue 3D neighborhood (e.g. four cysteine residues in the same sphere around a ZN atom) should help discriminating functional from non functional residues in these cases.
By merging the above-mentioned literature-based information with our analysis of the distances from catalytic residues, we derived the final classification into three groups reported in Table 10.
FE2, MN, CU1, MG, ZN3, ZN, HEM, HEG, HEC, SRM, MPD, MRD, FOK, PLP, P5P, PHS, OWQ, NO3, FS4, SF4, PVL, PYR, SEG, DHZ, FMT, HAD, CIT, ACN, PAC, ACT, 2PE, CNA, U5P, IKT, PGC, PGH, IMU, F6P, IMP, EEB, GLP, FBP, UD1, FCN, AZA, CRB, DHS, BME, ATP, ADP, GSH, FAD, FMN, SAM, AMP, NAD, GDP, GTP, GMP, MHF, NDP, NAG, NRI
K, NA, NI, FE, CA, CL, SAC, FCY, PCA, MES, MAN
PO4, PI, IPS, POP, SO4, SUL, GOL
According to the catalytic residue definition given in , which guides the annotation of the residues as functional in the CSA database, residues which bind a substrate or a cofactor are not annotated as catalytic unless they are in some way directly involved in the catalytic process. This consideration can be particularly useful to discriminate among residues with a high catalytic propensity (e.g. CYS, HIS) that bind cofactors for structural reasons. We represented this information as an additional feature (3D9) encoding the presence of a bond between the target residue and a cofactor. We used a distance threshold of 3 Å for detecting bonds.
Geometric shapes extracted according to spatial considerations can be viewed as features characterizing a residue structural neighborhood. Planar shapes, for instance, can be viewed as substructures of the 3D space surrounding a residue and characterizing its interactions with the other residues.
We extracted planar shapes with two (segments) and three vertices (triangles) from the structural neighborhood of a residue. Each vertex corresponds to a residue that we labeled with its class type: charged (Ch), hydrophobic (Hy) or Polar (Po). This allowed us to reduce the sparseness of the whole set of shapes, thus increasing the likeliness of shape matches during the kernel evaluations. We represented the three-dimensional neighborhood of a residue as: (a) a cloud of points corresponding to the side-chain centroids of the residues, labeled with Ch, Hy or Po according to their class; (b) a graph where each pair of vertices in the cloud is connected by an edge if their distance is less than 5Å.
The 3D kernel measures the similarity between two residues in terms of the shapes which are shared between their respective 3D neighborhoods.
We normalized attribute values in the [-1,+1] range applying the following linear transformation: value' = 2. - 1. While this implies a lower data sparsity with respect to a [0,1] normalization, preliminary experiments showed that it achieved better overall results. Missing values were managed by replacing categorical attributes with their modes and numerical attributes with their means, both computed from the distributions of observed values in the dataset.
The authors would like to thank Natalia Petrova and Cathy Wu for making their dataset available for the experiments and Ziding Zhang and Zhi-Ya Sheng for providing the details of their 10 fold cross-validation for comparison.
We also wish to thank Mauro Brunato for his substantial contribution in the earliest stages of this work. In this paper some of the molecular graphics images or parts of them were produced using the UCSF Chimera package  from the Resource for Biocomputing, Visualization, and Informatics at the University of California, San Francisco (supported by NIH P41 RR-01081).
- Humm A, Fritsche E, Mann K, Göhl M, Huber R: Recombinant expression and isolation of human L-arginine: glycine amidinotransferase and identification of its active-site cysteine residue. Biochemical Journal 1997, 322(Pt 3):771–776.View ArticlePubMedPubMed CentralGoogle Scholar
- Webb E: Enzyme nomenclature: recommendations of the nomenclature committee of the international union of biochemistry and molecular biology on the nomenclature and classification of enzymes. Academic Press New York; 1992.Google Scholar
- Mistry J, Bateman A, Finn R: Predicting active site residue annotations in the Pfam database. BMC Bioinformatics 2007, 8: 298. 10.1186/1471-2105-8-298View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Mistry J, Schuster-Böckler B, Griffths-Jones S, Hollich V, Lassmann T, Moxon S, Marshall M, Khanna A, Durbin R, Eddy SR, Sonnhammer EL, Bateman A: Pfam: clans, web tools and services. Nucleic Acids Research 2006, (34 Database):D247-D251. 10.1093/nar/gkj149Google Scholar
- Todd AE, Orengo CA, Thornton JM: Evolution of function in protein superfamilies, from a structural perspective. Journal of Molecular Biology 2001, 307(4):1113–1143. 10.1006/jmbi.2001.4513View ArticlePubMedGoogle Scholar
- Nagano N, Orengo C, Thornton J: One fold with many functions: the evolutionary relationships between TIM barrel families based on their sequences, structures and functions. Journal of Molecular Biology 2002, 321(5):741–765. 10.1016/S0022-2836(02)00649-6View ArticlePubMedGoogle Scholar
- Wallace AC, Laskowski RA, Thornton JM: Derivation of 3D coordinate templates for searching structural databases: application to serhis-asp catalytic triads in the serine proteinases and lipases. Protein Science 1996, 5(6):1001–1013.View ArticlePubMedPubMed CentralGoogle Scholar
- Nebel JC: Generation of 3D templates of active sites of proteins with rigid prosthetic groups. Bioinformatics 2006, 22(10):1183–1189. 10.1093/bioinformatics/btl040View ArticlePubMedGoogle Scholar
- Petrova NV, Wu CH: Prediction of catalytic residues using Support Vector Machine with selected protein sequence and structural properties. BMC Bionformatics 2006, 7: 312–324. 10.1186/1471-2105-7-312View ArticleGoogle Scholar
- Youn E, Peters B, Radivojac P, Mooney SD: Evaluation of features for catalytic residue prediction in novel folds. Protein Science 2007, 16(2):216–226. 10.1110/ps.062523907View ArticlePubMedPubMed CentralGoogle Scholar
- Capra JA, Singh M: Predicting functionally important residues from sequence conservation. Bioinformatics 2007, 23(15):1875–1882. 10.1093/bioinformatics/btm270View ArticlePubMedGoogle Scholar
- Fischer JD, Mayer CE, Söding J: Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics 2008, 24(5):613–620. 10.1093/bioinformatics/btm626View ArticlePubMedGoogle Scholar
- Chea E, Livesay DR: How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinformatics 2007, 8: 153. 10.1186/1471-2105-8-153View ArticlePubMedPubMed CentralGoogle Scholar
- Tang Y, Sheng Z, Chen Y, Zhang Z: An improved prediction of catalytic residues in enzyme structures. Protein Engineering Design and Selection 2008, 21(5):295–302. 10.1093/protein/gzn003View ArticleGoogle Scholar
- Gherardini P, Helmer-Citterich M: Structure-based function prediction: approaches and applications. Briefings in Functional Genomics and Proteomics 2008, 7(4):291–302. 10.1093/bfgp/eln030View ArticlePubMedGoogle Scholar
- Zhang T, Zhang H, Chen K, Shen S, Ruan J, Kurgan L: Accurate sequence-based prediction of catalytic residues. Bioinformatics 2008, 24(20):2329–2338. 10.1093/bioinformatics/btn433View ArticlePubMedGoogle Scholar
- Xie L, Bourne P: A robust and efficient algorithm for the shape description of protein structures and its application ... BMC Bioinformatics 2007, 8: S9. 10.1186/1471-2105-8-S4-S9View ArticlePubMedPubMed CentralGoogle Scholar
- Wei Y, Ko J, Murga L, Ondrechen M: Selective prediction of interaction sites in protein structures with THEMATICS. BMC Bioinformatics 2007, 8: 119. 10.1186/1471-2105-8-119View ArticlePubMedPubMed CentralGoogle Scholar
- Amitai G, Shemesh A, Sitbon E, Shklar M: Network analysis of protein structures identifies functional residues. Journal of Molecular Biology 2004, 344: 1135–1146. 10.1016/j.jmb.2004.10.055View ArticlePubMedGoogle Scholar
- Tong W, Wei Y, Murga L, Ondrechen M, Williams R: Partial Order Optimum Likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D structure and sequence properties. PLoS Computational Biology 2009, 5: e1000266. 10.1371/journal.pcbi.1000266View ArticlePubMedPubMed CentralGoogle Scholar
- Ondrechen M, Clifton J, Ringe D: THEMATICS: a simple computational predictor of enzyme function from structure. Proceedings of the National Academy of Sciences 2001, 98(22):12473–12478. 10.1073/pnas.211436698View ArticleGoogle Scholar
- The Universal Protein Resource (UniProt)[http://www.uniprot.org/]
- Porter CT, Bartlett GJ, Thornton JM: The Catalytic Site Atlas: a resource of catalytic sites and residues identified in enzymes using structural data. Nucleic Acids Research 2004, (32 Database):D129-D133. 10.1093/nar/gkh028Google Scholar
- Cristianini N, Shawe-Taylor J: An Introduction to Support Vector Machines and Other Kernel-based Learning Methods. Cambridge University Press; 2000.View ArticleGoogle Scholar
- Joachims T: Making large-scale SVM learning practical. In Advances in Kernel Methods - Support Vector Learning. Edited by: B Schölkopf CB, Smola A. MIT Press, Cambridge, MA; 1999.Google Scholar
- SVM Light software[http://svmlight.joachims.org/]
- Davis J, Goadrich M: The relationship between Precision-Recall and ROC curves. In Proceedings of the 23rd International Conference on Machine Learning (ICML '06). New York, NY, USA: ACM; 2006:233–240. full_textView ArticleGoogle Scholar
- Valdar WS: Scoring residue conservation. Proteins 2002, 48(2):227–241. 10.1002/prot.10146View ArticlePubMedGoogle Scholar
- Babor M, Gerzon S, Raveh B, Sobolev V, Edelman M: Prediction of transition metal-binding sites from apo protein structures. Proteins 2008, 70: 208–17. 10.1002/prot.21587View ArticlePubMedGoogle Scholar
- Bartlett G, Porter C, Borkakoti N, Thornton J: Analysis of catalytic residues in enzyme active sites. Journal of Molecular Biology 2002, 324: 105–121. 10.1016/S0022-2836(02)01036-7View ArticlePubMedGoogle Scholar
- Bernardes J, Fernandez J, Vasconcelos A: Structural descriptor database: a new tool for sequence based functional site prediction. BMC Bioinformatics 2008, 9: 492. 10.1186/1471-2105-9-492View ArticlePubMedPubMed CentralGoogle Scholar
- Lippi M, Passerini A, Punta M, Rost B, Frasconi P: MetalDetector: a web server for predicting metal binding sites and disulfide bridges in proteins from sequence. Bioinformatics 2008, 24(18):2094–2095. 10.1093/bioinformatics/btn371View ArticlePubMedPubMed CentralGoogle Scholar
- Shu N, Zhou T, Hovmöller S: Prediction of zinc-binding sites in proteins from sequence. Bioinformatics 2008, 24(6):775–782. 10.1093/bioinformatics/btm618View ArticlePubMedGoogle Scholar
- Ebert J, Altman R: Robust recognition of zinc binding sites in proteins. Protein Science 2008, 17: 54–65. 10.1110/ps.073138508View ArticlePubMedPubMed CentralGoogle Scholar
- Ceroni A, Costa F, Frasconi P: Classification of small molecules by two- and three-dimensional decomposition kernels. Bioinformatics 2007, 23(16):2038–2045. 10.1093/bioinformatics/btm298View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Research 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed CentralGoogle Scholar
- Kawashima S, Ogata H, Kanehisa M: AAindex: Amino Acid Index Database. Nucleic Acids Research 1999, 27: 368–369. 10.1093/nar/27.1.368View ArticlePubMedPubMed CentralGoogle Scholar
- Borgwardt KM, Ong CS, Schönauer S, Vishwanathan SVN, Smola AJ, Kriegel HP: Protein function prediction via graph kernels. Bioinformatics 2005, 21: i47-i56. 10.1093/bioinformatics/bti1007View ArticlePubMedGoogle Scholar
- Benning MM, Haller T, Gerlt JA, Holden HM: New reactions in the crotonase superfamily: structure of methylmalonyl CoA decarboxylase from Escherichia coli. Biochemistry 2000, 39(16):4630–9. 10.1021/bi9928896View ArticlePubMedGoogle Scholar
- Yamaguchi A, Iida K, Matsui N, Tomoda S, Yura K: Het-PDB Navi: a database for protein-small molecule interactions. Journal of Biochemistry 2004, 135: 79–84. 10.1093/jb/mvh009View ArticlePubMedGoogle Scholar
- Het-PDB Navi database[http://hetpdbnavi.nagahama-i-bio.ac.jp/index.php]
- Pettersen EF, Goddard TD, Huang CC, Couch GS, Greenblatt DM, Meng EC, Ferrin TE: UCSF Chimera-a visualization system for exploratory research and analysis. Journal of Computational Chemistry 2004, 25(13):1605–1612. 10.1002/jcc.20084View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.