CSmetaPred: a consensus method for prediction of catalytic residues
© The Author(s). 2017
Received: 19 June 2017
Accepted: 5 December 2017
Published: 22 December 2017
Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc.
Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization.
The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://18.104.22.168/csmetapred/.
In the post genomic era, one of the challenges is accurate protein function annotation as these could provide clues to insights into molecular details of biological processes [1, 2]. The task of protein function annotation combines cumbersome experimental studies with automated computational prediction methods, which usually allow molecular function annotation based on: establishing sequence/structure relationship between proteins of unknown function to proteins of known function, predicting putative binding sites for metals/chemical compounds/DNA/RNA/protein and prediction of catalytic residues of enzymes . The knowledge of catalytic residues can also assist in elucidation of reaction mechanism apart from providing enhanced function annotation of enzymes.
In the past decade, many sequence and/or structure-based catalytic residue prediction methods have been developed that rely on remote homology recognition, statistical and machine-learning algorithms. The sequence based prediction methods used sequence homology information or/and conserved family patterns/motifs [3–5], sensitive sequence-based scoring functions, amino acid stereochemical features [6, 7], conservation scores such as Von Neumann entropy, relative entropy, Jensen-Shannon divergence and sum-of-pairs measure [3, 8] to predict catalytic residues. Other prediction methods used phylogenetic motifs and phylogenetic trees [9, 10]. CRpred is one of the best sequence based methods that uses various sequence features such as residue type, hydrophobicity, and PSI-BLAST profiles  in a Support Vector Machine (SVM) based binary classification of residues into catalytic and non-catalytic residues. With the availability of tertiary structures, many methods were developed that used structure similarity searches with pre-calculated active site structural motif/template library [12–14], such as CATSID . Many other structure-based methods used structural features such as hydrophobicity distribution in protein , electrostatics , chemical properties , network centrality measures [18, 19], distribution of catalytic residues with centroid of structure , unusual central atomic distances , geometry based , contact density , structural neighbourhood  and side-chain orientation of catalytic residues [25, 26]. Many of these methods combine sequence and structural features to improve prediction accuracy [27–33]. For example, EXIA2 employs side-chain orientation of polar/charged residues and sequence features ; and DISCERN uses statistical models based on phylogenomic conservation score of sequence and several structural features  to predict catalytic residues.
Despite significant development in catalytic residue prediction methods, the ranked positions of known catalytic residues are on an average high among the list of all ranked residues. Improving the ranked positions of putative catalytic residues will facilitate and expedite their experimental identification and characterization. Moreover, an improved ranking of catalytic residues will also increase their prediction accuracy. To address these issues, we have developed methods based on meta-approach to predict active site residues that combine results from four well-known predictors to generate a consensus ranked list of residues. In the meta-predictor CSmetaPred, the residues are ranked based on the mean of normalized residue scores (meta-score) obtained from four predictors. Next, we included the predicted pocket information with the mean residue score or meta-score to further improve prediction performance in another meta-predictor, CSmetaPred_poc. Previously, meta-approaches have been shown to improve accuracy for protein structure and binding site predictions [34–36].
We used five benchmark datasets to evaluate meta-predictors that include three datasets from previous studies, which we primarily used as a legacy dataset to compare predictions from previous prediction methods. In the present work, we compiled 2 datasets macie-254 and csalit-688. Macie-254 is derived from the MACiE (mechanism, annotation and classification in enzymes) database , which provides manually curated list of catalytic residues with their putative roles in mechanistic steps of an enzymatic reaction. From 335 MACiE entries, enzymes having catalytic site defined in single pdb chain were used to prepare a non-redundant set of 254 proteins at 60% sequence identity using CD-HIT . Similarly, a non-redundant csalit-688 dataset (60% sequence identity) was generated from only literature annotated catalytic residues of pdb entries in Catalytic Site Atlas (CSA) database  and those not present in MACiE dataset. CSA database may annotate more than one catalytic site for a given single pdb chain depending on its reference source. Here, we merged two or more catalytic sites in a single pdb chain that have at least one common residue between them. The two datasets macie-254 and csalit-688 are combined to form a non-redundant CSAMAC dataset at 60% sequence identity using CD-HIT. Additionally, an unbound non-redundant (60% sequence identity) dataset, UB-137, was prepared from CSAMAC pdb entries, which are not bound to any ligand (pdb entries without HETATM record). The Table S1 in Additional file 1 provides list of datasets with pdb entries and their known catalytic residues. From earlier works, we took EF-Fold, POOL-160, and PW-79 datasets along with their respective catalytic residues definition [17, 30, 33] and pruned them to construct POOL-148, EF-Fold-164 and PW-79 (for details, see Additional file 2: S1 Text). Three datasets are pooled to construct a non-redundant (at 60% sequence identity) EF_POOL_PW dataset. Since pdb entries of EF_POOL_PW datasets are redundant with CSAMAC, we have described evaluation on CSAMAC dataset in the main text, whereas results from legacy datasets are provided in the supplementary material (Additional file 2). The average (standard deviation) number of catalytic residues in CSAMAC and EF_POOL_PW datasets is 3.3 (1.9) and 3.2 (1.9) respectively.
Overview of method
We have chosen four well-known active site prediction methods viz. CRpred, CATSID, DISCERN and EXIA2 for implementing in meta-predictors. These methods are selected primarily based on their prediction performances and their availability either as source code or easily automatable webservers. Moreover, these also are representative of different input features (sequence or/and structure) used for catalytic residue prediction. Among these, CRpred rely on only sequence derived features, CATSID uses only structural features, whereas, DISCERN and EXIA2 employ both sequence and structural properties for prediction of catalytic residues.
The webservers of CATSID (http://catsid.llnl.gov/) and EXIA2 (http://22.214.171.124/) were used for catalytic residues predictions. We have used EXIA2 webserver for prediction of proteins in benchmark studies. However, due to temporary unavailability of EXIA2 webserver, we have recoded EXIA2 and implemented in CSmetaPred webserver. We locally executed CRpred and DISCERN suites of program to predict catalytic residues using the packages obtained from developers’ websites (http://biomine.ece.ualberta.ca/CRpred/CRpred.htm) and (http://phylogenomics.berkeley.edu/software/) respectively. The procedure to derive residue score for various predictors is described below.
The score assigned to each residue from DISCERN and CRpred outputs are taken for meta-score calculation. These DISCERN and CRpred scores are referred to as Sdi and Scr respectively. From EXIA2 webserver parsed outputs, we took rank score and WCN assigned to residues as two independent scores for computing meta-score. The residue rank score, combines score for average side chain vector directions of its neighboring residues, amino acid combinations, structural flexibility and sequence conservation . WCN score is a measure of structural flexibility that is either obtained from EXIA2 output or is calculated using previously described algorithm . The rank score (Srs) is defined only for 12 amino acids (R, N, D, C, Q, E, H, K, S, T, Y and W), whereas WCN score (Swcn) is derived for all residues. Unlike other predictors, the CATSID outputs a list of hit templates and their associated template score, which is a measure of likelihood that a query protein shares catalytic function with the template. CATSID also provides alignment between the query and catalytic residues of template. To obtain residue score (Sca), we assign a template score to the aligned query residues in the alignment between query and template. If a residue is present in more than one alignment, we sum the score from each query template alignment and assign this summed score to the residue. Here, we have used all templates irrespective of any previously suggested score cut-off.
where, zSc(ij) is z-score of residue i for method j and p(j) is binary function with p(j) = 1 for residue having a assigned score, or 0 otherwise. The av-csc score is used in CSmetaPred to rank residues for every protein, wherein high score represents a greater chance for it to be a catalytic residue.
where, av-csc(j) is meta-score of pocket residue j, Nres(i) is number of residues in a given pocket i.
We selected top 5 ranked pockets from both methods and merge two pockets having ≥50% number of common residues between them. Thus, we generate a combined list of predicted pockets from both LIGSITE and Fpocket. The parameters for pocket ranking and merging were optimized using macie-254 dataset (Additional file 2: Figure S1).
In CSmetaPred_poc, residues are ranked based on av-csc-poc score.
Generation of homology models
To improve catalytic residues prediction of enzymes, without known tertiary structure, we have evaluated meta-predictor performance on homology modelled protein structures built using MODELLER . The protein models are built based on a single template structure with sequence identities ranging from 40% to 90% between query and template sequences. Details of dataset and construction of template library are given in supporting information (Additional file 2: S1 Text).
Each full-length protein sequence from CSAMAC dataset is queried against template library (LIB_TEMP) using profile_build() module of MODELLER to select a set of 335 proteins, which have sequence identity from 40 to 90% and coverage ≥70% to template sequences. The templates for 335 protein sequence is identified by searching these sequence against template library (LIB_TEMP) using profile_build() module of MODELLER. The templates with sequence identity <40% and >90% or query coverage <70% are discarded from the list of possible templates. Next, each query and template alignment having sequence identity ranging 40%–90% is grouped into sequence identities bins of 40–50%, 50–60%, 60–70%, 70–80% and 80–90% (see Additional file 2: S1 Text) with each bin having 235, 135, 53, 22 and 23 query-template alignments respectively. For each query-template alignment, we generated 10 models using align2d() module and the best model (having the lowest DOPE energy score) was used for prediction. Thus, we generated a total of 468 models for 335 query protein sequences.
Evaluation of method
The predictors are primarily evaluated using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves, which are frequently used in assessment of binary classifiers. As most of methods used in this study provide ranked list of residues, we create a binary classification by selecting top n ranked list as predicted catalytic residues and rest as non-catalytic residues. Hence, true positives (TP) are correctly predicted catalytic residues; false negatives (FN) are catalytic residues predicted as non-catalytic; false positives (FP) are non-catalytic residues predicted as catalytic; true negatives (TN) are correctly predicted non-catalytic residues. The precision, True Positive Rate (TPR) and False Positive Rate (FPR) are defined as:
Precision = TP / (TP + FP)
TPR (recall) = TP / (TP + FN)
FPR (1-specificity) = FP/ (FP + TN)
The measures mentioned above are used to compare prediction performances of CSmetaPred, CSmetaPred_poc, CRpred, EXIA2, DISCERN and WCN. The procedure adopted to rank residues from CRpred, EXIA2, DISCERN and WCN is given in supporting information (Additional file 2: S1 Text).
Results and discussions
As mentioned before, we have evaluated meta-predictors and compared them with their constituent methods on five benchmark datasets using average ROC and PR curves. In the present work, we have not included CATSID for comparison because using our approach of assigning score to residues from the template match score obtained from CATSID could provide score/rank only for subset of residues, whereas other methods rank all residues.
CSmetaPred prediction performance
Summary of MAS, MAP and catalytic residues median rank
CSAMAC dataset (884 protein)
Next, we have used PR curve to compare CSmetaPred performance with a motive to evaluate how well it classifies positives unlike ROC, which also consider misclassification of negatives. As is evident from the visual inspection of average PR curves (Fig. 2b) that CSmetaPred has better performance than EXIA2, WCN, DISCERN and CRpred SVM performance. This is also apparent from MAP (Table 1) and AUCPR (Additional file 2: Table S2) measures, which are used as a scalar value for comparison of PR curves. On CSAMAC dataset, MAP for CSmetaPred is highest (0.489) among its constituent methods (Table 1). Similarly, AUCPR of CSmetaPred is highest with a value of 0.324 followed by EXIA2, DISCERN and WCN having values of 0.167, 0.103 and 0.034 respectively (Additional file 2: Table S2). Importantly, evaluation of differences in prediction performances are found to be statistically significant (Wilcoxon signed-rank test with p-value <0.0001, see Additional file 2: Table S3) when we compared AP (described in Methods) calculated for each protein between CSmetaPred and other methods in a pairwise comparison (per protein basis). Both visual and quantitative analysis of PR curves show that CSmetaPred prediction is better than its constituent methods on EF_POOL_PW and five individual datasets (Additional file 2: Figure S4 for PR curves; MAP and AUCPR are summarized in Table S2). As CSmetaPred has comparatively lower median and average rank for catalytic residues (Table 1), this also suggests that CSmetaPred is able to improve catalytic residues ranks in comparison to other methods.
Further, we compared CSmetaPred predicted ranks of catalytic residues to their best possible ranks derived from CRpred, DISCERN and EXIA2. Here, the best possible rank of a residue is the minimum rank assigned by scores from above mentioned methods. It is important to note that in this analysis we have excluded CATSID because we could rank only subset of residues for which template was identified by CATSID (see Methods section). This inability to rank all residues will lead to lower ranks and which may not imply a necessarily better performance. The ‘best possible rank’ is a theoretical best scenario for selecting ranks for catalytic residue and this provides an upper bound of meta-approach performance. The comparison of ranks is performed on CSAMAC dataset having 2912 catalytic residues. Since lower ranked catalytic residues will largely affect CSmetaPred performance, we analyzed catalytic residues having the best possible rank less than 20. Most (86.9%) of catalytic residues have the best possible rank ≤20. Of these, for ~51% of catalytic residues CSmetaPred predicted ranks are either unchanged or improved marginally having median and mean decrease in rank of 2 and 3.4 respectively. Further, CSmetaPred predicted ranks are higher (poorer) for ~49% of catalytic residues compared to the best possible rank. Importantly, the increase in CSmetaPred predicted rank is not large as evident from median and mean rank increases of 3 and 7.6 respectively. The detailed analysis of catalytic residues with increase in CSmetaPred ranks showed that in most instances these residues are predicted only by one or two methods, which is exhibited in their higher normalized scores, whereas other methods assign lower residue scores as predictions from other methods are not good. This indicates that even though meta-predictor is not able to achieve the best possible scenario in meta-approach, it does not decrease ranks of catalytic residues drastically from the best possible scenario.
Summary of MAS, MAP, and median ranks of known catalytic residues when only polar/charged residues are ranked
CSAMAC Polar dataset (873 protein)
Comparison of CSmetaPred_poc with CSmetaPred
Performance of CSmetaPred_poc on ligand unbound structures
Since CSmetaPred_poc prediction relies on predicted pockets information to improve its prediction, we assessed any bias in pocket prediction due to ligand/s (substrate/product/cofactor) bound to proteins in pdb structures. For this assessment, we performed prediction using non-redundant pdb entries without any ligand bound pdb dataset (UB-137). The detailed analysis of average ROC and PR curves show that CSmetaPred_poc is still the best performing method (Additional file 2: Figure S9). CSmetaPred_poc achieves MAS and MAP values on UB-137 dataset of 0.976 and 0.620 respectively (Additional file 2: Table S2).
Catalytic residues rank analysis
Having shown that meta-predictors are the best among evaluated methods, next we have analyzed ranked position of known catalytic residues from all methods. As a metric to compare methods, we have used median rank of catalytic residues. Both meta-predictors achieve lower (better) median rank in comparison to other methods across all datasets (Table 1 and see Additional file 2: Table S2). In fact, CSmetaPred_poc achieves the lowest catalytic residue median rank of 6. The same is observed when either polar/charged or non-polar residues (Table 2 and Additional file 2: Table S4) are ranked separately.
We are not committed to any specific cut-off to select active site residues. However, to prioritize residues for experimental studies we analyzed our results to find a rank or filtration ratio cut-off to select active site residues. Based on average precision, average recall, and average accuracy from all datasets, we suggest residues with ranks ≤20 or ranks ≤4% filtration ratio cut-off as catalytic residues. On CSAMAC dataset, with 4% filtration ratio cut-off CSmetaPred_poc achieves the average precision, recall, and accuracy of 0.2, 0.79, and 0.96 respectively. Using same dataset and method the average precision, recall and accuracy with rank 20 are 0.14, 0.87, and 0.94 respectively. Similar average values are observed in other datasets. These cut-off values are to be used as an indicator rather than a rule to predict catalytic residues. Using these criteria, CSmetaPred_poc is able to rank ~87% and ~76% of known catalytic residues within top 20 ranks and 4% filtration ratio respectively.
As mentioned before (see Methods section) the binary classified residues generated by varying ranks in CSmetaPred_poc ranked list of residues is used for comparison of CSmetaPred_poc with other classifiers. On PW-79 dataset, Cilia and Passerini method achieves average recall and precision of 0.46 and 0.28 respectively . With the same dataset, at a recall of 0.46 CSmetaPred_poc has precision of 0.54 and at a precision 0.28 it has recall of 0.87. CRpred achieves average recall of 0.54 and precision of 0.175 on PW-79 dataset . CSmetaPred_poc achieves a precision of 0.50 at same recall of 0.54 and a recall of 0.94 at same precision of 0.175. A similar comparison with other datasets (POOL-148 and EF-Fold-164) is difficult, as we have excluded some pdb entries from these datasets (see Methods).
Catalytic residue prediction using protein models
Since experimental tertiary structures of many enzymes are not yet known, we evaluated whether modelled structures could be used for reliable prediction using CSmetaPred_poc. Moreover, this also provides comparison of prediction performance between modelled structures and prediction based only on sequence information. In this analysis, we build full-length homology models for sequences, with known tertiary structure, using single template having sequence identity ranging between 40 to 90% between query and template sequences (see Methods section).
First, we assessed model quality using Root Mean Square Deviation (RMSD) between the model and native structure. The average RMSD is 3.0 Å between models and native structures. The analysis of models with high RMSD showed that it is mostly due to either a long N/C terminal region or part of query sequences without any template aligned regions. Moreover, in some cases there was large conformational change between template and native structure that lead to large RMSD between model and native structure. Most high RMSD between model and native structure was from lowest sequence identity bin of 40–50%. Next, we used average ROC and PR curves to compare CSmetaPred_poc performance on models and native structures. Based on qualitative (visual inspection) and quantitative comparisons of ROC/PR curves, prediction based on models does not show comparable performance with their respective native structures (Additional file 2: Figure S12 and Table S5). However, considering median rank of catalytic residues, models result in slightly higher (poor) rank of 8.4 than native structures (rank 6.3). The detail analysis showed that most of poor performance is from models in 40–50% sequence identity category. In comparison to CRpred, which used only sequence information for prediction, models perform better as assessed by visual inspection of ROC/PR curves (Additional file 2: Figure S12) and comparison of median/average rank. CRpred results in median and average rank of catalytic residues of 11 and 23.2 respectively. This shows that CSmetaPred_poc will result in better prediction than using only sequence information as in CRpred. A study performed on usefulness of modelled structures has also suggested that low quality predicted structures could be used for catalytic residues prediction .
We compared residue level prediction results from CSmetaPred_poc with other methods on a list of pdb entries obtained from previous work or generated in the present work, which were mostly part of benchmark datasets. CSmetaPred_poc is able to get similar or better rank in most of the cases (Additional file 2: Table S6). Further, we analyzed prediction results for structures, with known catalytic residues, deposited in RCSB PDB database  subsequent to development of our method. For this analysis, we manually searched PDB database for recently determined tertiary structures of enzymes having curated list of catalytic residues. Interestingly, CSmetaPred_poc is able to predict most catalytic residues within top 20 ranks (Additional file 2: Table S7) as we observed in benchmark datasets. Below, we discuss some examples having the best results from CSmetaPred_poc.
The experimental site-directed mutagenesis in thioesterase enzyme YbdB from Escherichia coli has identified H89, E63, S67, H54, and Q48 as putative catalytic residues . Using YbdB structure (pdb id: 4k4c), CSmetaPred_poc is able to predict residues H89, E63, S67, H54, and Q48 residues at ranks 1, 2, 4, 5, and 19 respectively. A recent study on β-keto-acid cleavage enzyme family KCE (DUF849) has identified H46, H48, E143, R226, D231 as crucial catalytic residues and S82, T106, and E230 as important residues . Interestingly, CSmetaPred_poc ranks H46, H48, E143, R226, D231, S82, T106, and E230 residues (pdb id: 2y7f), at 2, 4, 6, 1, 5, 14, 17, and 3 ranked positions respectively.
The catalytic residues of Escherichia coli γ-glutamylcysteine synthetase (GshA) are not yet characterized. Using Escherichia coli GshA tertiary structure (pdb id: 1v4gA ) we predicted catalytic residues for this enzyme. From the subset of top 20 predicted residues (Additional file 2: Table S8), we mutated R330 (rank 1), R235 (rank 11), Y131, (rank 16) and R132 (rank 20) to investigate their role in catalysis using previously described in vivo and in vitro assays . Preliminary studies show no enzymatic activity for R330A mutant and reduced activity for mutants of R235, R132, and Y131 (Additional file 2: Figure S13 and Table S9) suggesting these could play a role in catalysis. Interestingly, R330 structural equivalent in GshA homologue from Saccharomyces cerevisiae (Sc-γ-GCS) (pdb id: 3ig5) is R472, which has also been suggested to be a catalytic residue . Further detailed study is required to investigate specific role of R330 in catalysis.
CSmetaPred and CSmetaPred_poc are provided as a webserver, which is accessible at http://126.96.36.199/csmetapred/. This server can take sequence or structure as an input. In CSmetaPred server we use our in-house recoded EXIA2. The comparison of residue ranks between original EXIA2 server result and our coded program is shown in Additional file 2: Figure S14.
We have developed meta-approach based catalytic residue prediction methods viz. CSmetaPred and CSmetaPred_poc that combine residue scores from four well known catalytic residue prediction methods. Based on visual and quantitative analysis of ROC and PR curves on benchmark datasets (including 3 legacy datasets), CSmetaPred shows improved prediction over its constituent methods. This approach of combining residues is simple yet effective in ranking catalytic residues. CSmetaPred_poc further improves prediction performance by including predicted pockets from LIGSITE and Fpocket. The assessment based on both ROC and PR curves shows that CSmetaPred_poc is the best of evaluated approaches. Importantly, known catalytic residues are at lower ranked (better) positions in prediction by both meta-predictors. This is also evident from the lowest median predicted rank of catalytic residues from CSmetaPred_poc in all datasets. CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset taking residues below rank 20 as true positives. Moreover, on the same dataset CSmetaPred_poc predicted all catalytic residues for ~73% of enzymes within top 20 ranks. The benchmarking of CSmetaPred_poc on comparative modelled structures showed that predicted tertiary structures could be used reliably for catalytic residue predictions in absence of experimentally determined structures. These analyses suggest that meta-predictors could assist experimentalists in their efforts to experimentally identify and characterize catalytic residues by prioritizing residues for mutational studies.
Swapnil Tichkule helped in manually identifying enzymes from PDB database for predictions discussed in case studies section of the manuscript.
This works was supported by IISER Mohali startup funds; Center of excellence in frontier areas of science and technology grant [MHRD-14-0064] from the Ministry of Human Resource Development, Government of India. AKB acknowledges funding from the JC Bose National Fellowship, Department of Science & Technology, Government of India. PC is supported by Department of Biotechnology, Government of India. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and materials
All data generated or analyzed during this study are included in this published article (and its additional information files).
SBP conceived and designed the meta-predictor approach. PC implemented meta-predictor approach and analysed prediction results. SK and AKB designed and performed experimental studies on γ-glutamylcysteine synthetase. SBP and PC wrote the manuscript. All authors have read and approved the manuscript.
Ethics approval and consent to participate
Consent for publication
The authors declare they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Loewenstein Y, Raimondo D, Redfern OC, Watson J, Frishman D, Linial M, Orengo C, Thornton J, Tramontano A. Protein function annotation by homology-based inference. Genome Biol. 2009;10(2):207.View ArticlePubMedPubMed CentralGoogle Scholar
- Radivojac P, Clark WT, Oron TR, Schnoes AM, Wittkop T, Sokolov A, Graim K, Funk C, Verspoor K, Ben-Hur A, et al. A large-scale evaluation of computational protein function prediction. Nat Methods. 2013;10(3):221–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Ja C, Singh M. Predicting functionally important residues from sequence conservation. Bioinformatics. 2007;23:1875–82.View ArticleGoogle Scholar
- Chien TY, Chang DT, Chen CY, Weng YZ, Hsu CM. E1DS: catalytic site prediction based on 1D signatures of concurrent conservation. Nucleic Acids Res. 2008;36(Web Server issue):W291–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Mistry J, Bateman A, Finn RD. Predicting active site residue annotations in the Pfam database. BMC Bioinformatics. 2007;8:298.View ArticlePubMedPubMed CentralGoogle Scholar
- Dou Y, Wang J, Yang J, Zhang C. L1pred: a sequence-based prediction tool for catalytic residues in enzymes with the L1-logreg classifier. PLoS One. 2012;7:3–9.Google Scholar
- Fischer JD, Mayer CE, Söding J. Prediction of protein functional residues from sequence by probability density estimation. Bioinformatics. 2008;24:613–20.View ArticlePubMedGoogle Scholar
- Yona G, Levitt M. Within the twilight zone: a sensitive profile-profile comparison tool based on information theory. J Mol Biol. 2002;315(5):1257–75.View ArticlePubMedGoogle Scholar
- La D, Livesay DR. Predicting functional sites with an automated algorithm suitable for heterogeneous datasets. BMC Bioinformatics. 2005;6:116.View ArticlePubMedPubMed CentralGoogle Scholar
- Sankararaman S, Sjölander K. INTREPID - information-theoretic tree traversal for protein functional site identification. Bioinformatics. 2008;24:2445–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Zhang T, Zhang H, Chen K, Shen S, Ruan J, Kurgan L. Accurate sequence-based prediction of catalytic residues. Bioinformatics. 2008;24:2329–38.View ArticlePubMedGoogle Scholar
- Kato T, Nagano N. Discriminative structural approaches for enzyme active-site prediction. BMC Bioinformatics. 2011;12(Suppl 1):S49.View ArticlePubMedPubMed CentralGoogle Scholar
- Nilmeier JP, Kirshner DA, Wong SE, Lightstone FC. Rapid catalytic template searching as an enzyme function prediction procedure. PLoS One. 2013;8Google Scholar
- Tang YR, Sheng ZY, Chen YZ, Zhang Z. An improved prediction of catalytic residues in enzyme structures. Protein Eng Des Sel. 2008;21(5):295–302.View ArticlePubMedGoogle Scholar
- Bryliński M, Prymula K, Jurkowski W, Kochańczyk M, Stawowczyk E, Konieczny L, Roterman I. Prediction of functional sites based on the fuzzy oil drop model. PLoS Comput Biol. 2007;3:0909–23.Google Scholar
- Bate P, Warwicker J. Enzyme/non-enzyme discrimination and prediction of enzyme active site location using charge-based methods. J Mol Biol. 2004;340(2):263–76.View ArticlePubMedGoogle Scholar
- Tong W, Wei Y, Murga LF, Ondrechen MJ, Williams RJ. Partial order optimum likelihood (POOL): maximum likelihood prediction of protein active site residues using 3D structure and sequence properties. PLoS Comput Biol. 2009;5Google Scholar
- Chea E, Livesay DR. How accurate and statistically robust are catalytic site predictions based on closeness centrality? BMC Bioinformatics. 2007;8:153.View ArticlePubMedPubMed CentralGoogle Scholar
- Fajardo JE, Fiser A. Protein structure based prediction of catalytic residues. BMC Bioinformatics. 2013;14:63.View ArticlePubMedPubMed CentralGoogle Scholar
- Ben-Shimon A, Eisenstein M. Looking at enzymes from the inside out: the proximity of catalytic residues to the molecular centroid can be used for detection of active sites and enzyme-ligand interfaces. J Mol Biol. 2005;351(2):309–26.View ArticlePubMedGoogle Scholar
- Kochańczyk M. Prediction of functionally important residues in globular proteins from unusual central distances of amino acids. BMC Struct Biol. 2011;11:34.View ArticlePubMedPubMed CentralGoogle Scholar
- Mitternacht S, Berezovsky IN. A geometry-based generic predictor for catalytic and allosteric sites. Protein Eng Des Sel. 2011;24:405–9.View ArticlePubMedGoogle Scholar
- Huang SW, Yu SH, Shih CH, Guan HW, Huang TT, Hwang JK. On the relationship between catalytic residues and their protein contact number. Curr Protein Pept Sci. 2011;12(6):574–9.View ArticlePubMedGoogle Scholar
- Cilia E, Passerini A. Automatic prediction of catalytic residues by modeling residue structural neighborhood. BMC Bioinformatics. 2010;11:115.View ArticlePubMedPubMed CentralGoogle Scholar
- Chien YT, Huang SW. Accurate prediction of protein catalytic residues by side chain orientation and residue contact density. PLoS One. 2012;7Google Scholar
- Lu CH, Yu CS, Chien YT, Huang SW. EXIA2: web server of accurate and rapid protein catalytic residue prediction. Biomed Res Int. 2014;2014:807839.PubMedPubMed CentralGoogle Scholar
- Brodkin HR, NA DL, Somarowthu S, Mills CL, Novak WR, Beuning PJ, Ringe D, Ondrechen MJ. Prediction of distal residue participation in enzyme catalysis. Protein Sci. 2015;24:762–78.View ArticlePubMedPubMed CentralGoogle Scholar
- Izidoro SC, de Melo-Minardi RC, Pappa GL. GASS: identifying enzyme active sites with genetic algorithms. Bioinformatics. 2015;31(6):864–70.View ArticlePubMedGoogle Scholar
- Laskowski RA, Watson JD, Thornton JM. ProFunc: a server for predicting protein function from 3D structure. Nucleic Acids Res. 2005;33(Web Server issue):W89–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Petrova NV, CH W. Prediction of catalytic residues using support vector machine with selected protein sequence and structural properties. BMC Bioinformatics. 2006;7:312.View ArticlePubMedPubMed CentralGoogle Scholar
- Sankararaman S, Sha F, Kirsch JF, Jordan MI, Sjölander K. Active site prediction using evolutionary and structural information. Bioinformatics. 2010;26:617–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang K, Horst JA, Cheng G, Nickle DC, Samudrala R. Protein meta-functional signatures from combining sequence, structure, evolution, and amino acid property information. PLoS Comput Biol. 2008;4Google Scholar
- Youn E, Peters B, Radivojac P, Mooney SD. Evaluation of features for catalytic residue prediction in novel folds. Protein Sci. 2007;16(2):216–26.View ArticlePubMedPubMed CentralGoogle Scholar
- Ginalski K, Elofsson A, Fischer D, Rychlewski L. 3D-jury: a simple approach to improve protein structure predictions. Bioinformatics. 2003;19(8):1015–8.View ArticlePubMedGoogle Scholar
- Huang B. MetaPocket: a meta approach to improve protein ligand binding site prediction. OMICS. 2009;13(4):325–30.View ArticlePubMedGoogle Scholar
- Zhou H, Pandit SB, Skolnick J. Performance of the pro-sp3-TASSER server in CASP8. Proteins. 2009;77(Suppl 9):123–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Holliday GL, Almonacid DE, Bartlett GJ, O'Boyle NM, Torrance JW, Murray-Rust P, Mitchell JB, Thornton JM. MACiE (mechanism, annotation and classification in enzymes): novel tools for searching catalytic mechanisms. Nucleic Acids Res. 2007;35(Database issue):D515–20.View ArticlePubMedGoogle Scholar
- Li W, Godzik A. Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.View ArticlePubMedGoogle Scholar
- Furnham N, Holliday GL, de Beer TA, Jacobsen JO, Pearson WR, Thornton JM. The catalytic site atlas 2.0: cataloging catalytic sites and residues identified in enzymes. Nucleic Acids Res. 2014;42(Database issue):D485–9.View ArticlePubMedGoogle Scholar
- Lin CP, Huang SW, Lai YL, Yen SC, Shih CH, CH L, Huang CC, Hwang JK. Deriving protein dynamical properties from weighted protein contact number. Proteins. 2008;72(3):929–35.View ArticlePubMedGoogle Scholar
- Le Guilloux V, Schmidtke P, Tuffery P. Fpocket: an open source platform for ligand pocket detection. BMC Bioinformatics. 2009;10:168.View ArticlePubMedPubMed CentralGoogle Scholar
- Hendlich M, Rippmann F, Barnickel G. LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model. 1997;15(6):359–63. 389View ArticlePubMedGoogle Scholar
- Sali A, Blundell TL. Comparative protein modelling by satisfaction of spatial restraints. J Mol Biol. 1993;234(3):779–815.View ArticlePubMedGoogle Scholar
- Fawcett T. An introduction to ROC analysis. Pattern Recogn Lett. 2006;27:861–74.View ArticleGoogle Scholar
- Davis J, Goadrich M: The relationship between precision-recall and ROC curves. Proceedings of the 23rd International Conference on Machine learning -- ICML'06 2006:233–240.Google Scholar
- Saito T, Rehmsmeier M. The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS One. 2015;10(3):e0118432.View ArticlePubMedPubMed CentralGoogle Scholar
- Manning CD, Raghavan P, Schütze H: Introduction to information retrieval: Cambridge University Press; 2008.Google Scholar
- Bartlett GJ, Porter CT, Borkakoti N, Thornton JM. Analysis of catalytic residues in enzyme active sites. J Mol Biol. 2002;324(1):105–21.View ArticlePubMedGoogle Scholar
- Carbajo D, Tramontano A. A resource for benchmarking the usefulness of protein structure models. BMC Bioinformatics. 2012;13:188.View ArticlePubMedPubMed CentralGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE. The Protein Data Bank. Nucleic Acids Res. 2000;28(1):235–42.View ArticlePubMedPubMed CentralGoogle Scholar
- Wu R, Latham JA, Chen D, Farelli J, Zhao H, Matthews K, Allen KN, Dunaway-Mariano D. Structure and catalysis in the Escherichia Coli hotdog-fold thioesterase paralogs YdiI and YbdB. Biochemistry. 2014;53(29):4788–805.View ArticlePubMedPubMed CentralGoogle Scholar
- Bastard K, Smith AA, Vergne-Vaxelaire C, Perret A, Zaparucha A, De Melo-Minardi R, Mariage A, Boutard M, Debard A, Lechaplais C, et al. Revealing the hidden functional diversity of an enzyme family. Nat Chem Biol. 2014;10(1):42–9.View ArticlePubMedGoogle Scholar
- Hibi T, Nii H, Nakatsu T, Kimura A, Kato H, Hiratake J, Oda J. Crystal structure of gamma-glutamylcysteine synthetase: insights into the mechanism of catalysis by a key enzyme for glutathione homeostasis. Proc Natl Acad Sci U S A. 2004;101(42):15052–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Kumar S, Kasturia N, Sharma A, Datt M, Bachhawat AK. Redox-dependent stability of the gamma-glutamylcysteine synthetase enzyme of Escherichia Coli: a novel means of redox regulation. Biochem J. 2013;449(3):783–94.View ArticlePubMedGoogle Scholar
- Biterova EI, Barycki JJ. Mechanistic details of glutathione biosynthesis revealed by crystal structures of Saccharomyces Cerevisiae glutamate cysteine ligase. J Biol Chem. 2009;284(47):32700–8.View ArticlePubMedPubMed CentralGoogle Scholar