CSmetaPred: a consensus method for prediction of catalytic residues

Background Knowledge of catalytic residues can play an essential role in elucidating mechanistic details of an enzyme. However, experimental identification of catalytic residues is a tedious and time-consuming task, which can be expedited by computational predictions. Despite significant development in active-site prediction methods, one of the remaining issues is ranked positions of putative catalytic residues among all ranked residues. In order to improve ranking of catalytic residues and their prediction accuracy, we have developed a meta-approach based method CSmetaPred. In this approach, residues are ranked based on the mean of normalized residue scores derived from four well-known catalytic residue predictors. The mean residue score of CSmetaPred is combined with predicted pocket information to improve prediction performance in meta-predictor, CSmetaPred_poc. Results Both meta-predictors are evaluated on two comprehensive benchmark datasets and three legacy datasets using Receiver Operating Characteristic (ROC) and Precision Recall (PR) curves. The visual and quantitative analysis of ROC and PR curves shows that meta-predictors outperform their constituent methods and CSmetaPred_poc is the best of evaluated methods. For instance, on CSAMAC dataset CSmetaPred_poc (CSmetaPred) achieves highest Mean Average Specificity (MAS), a scalar measure for ROC curve, of 0.97 (0.96). Importantly, median predicted rank of catalytic residues is the lowest (best) for CSmetaPred_poc. Considering residues ranked ≤20 classified as true positive in binary classification, CSmetaPred_poc achieves prediction accuracy of 0.94 on CSAMAC dataset. Moreover, on the same dataset CSmetaPred_poc predicts all catalytic residues within top 20 ranks for ~73% of enzymes. Furthermore, benchmarking of prediction on comparative modelled structures showed that models result in better prediction than only sequence based predictions. These analyses suggest that CSmetaPred_poc is able to rank putative catalytic residues at lower (better) ranked positions, which can facilitate and expedite their experimental characterization. Conclusions The benchmarking studies showed that employing meta-approach in combining residue-level scores derived from well-known catalytic residue predictors can improve prediction accuracy as well as provide improved ranked positions of known catalytic residues. Hence, such predictions can assist experimentalist to prioritize residues for mutational studies in their efforts to characterize catalytic residues. Both meta-predictors are available as webserver at: http://14.139.227.206/csmetapred/. Electronic supplementary material The online version of this article (10.1186/s12859-017-1987-z) contains supplementary material, which is available to authorized users.


Compilation of EF-Fold, POOL-148 and PW-79 dataset
From earlier works, we took EF-Fold, POOL-160, and PW-79 datasets along with their respective catalytic residues definition. Enzymes with catalytic site present in more than one Protein Data Bank (pdb) chain are removed from these datasets. Obsolete pdb entry was either replaced with updated pdb entry or removed from the dataset. Subsequently, we renamed datasets as POOL-148, EF-Fold-164 and PW-79 depending on number of pdb entries in them, which are 148, 164 and 79 respectively. These 3 datasets are merged to construct a nonredundant (60% sequence identity) EF_POOL_PW dataset using CD-HIT. EF_POOL_PW dataset consists of 286 proteins.

Procedure to rank residues from CRpred, DISCERN and WCN
We used residue SVM score in CRpred to rank residues. The WCN and DISCERN residues are ranked based on WCN and DISCERN residue scores respectively.

Procedure to rank non-polar residues from EXIA2
EXIA2 server explicitly ranks amino acids having catalytic functional side chain based on rank score that includes polar/charged amino acids (R, N, D, C, Q, E, H, K, S, T, Y) and tryptophan. However, all residues should be ranked in order to facilitate EXIA2 comparison with meta-predictors and other methods. To rank rest non-functional side chain (non-polar) amino acids including glycine, we have followed second phase approach of EXIA2 as mentioned in Huang et al.( [1]. From the EXIA2 server results page, for every residue in the ranked order, we fetch its neighbouring non-polar residues with WCN score > 0.9 and rank these neighbours based on WCN score. Importantly, neighbours follow the order list from EXIA2. For instance, neighbours of ranked 1 residues are ranked before rank 2 residue neighbours. Every non-polar residue is ranked only once, i.e. in its first occurrence as neighbour to a residue. This results in the ranked list of nonpolar residues, which is referred to as NP-1. Next, we calculate WCN score for residues neither ranked in EXIA2 results page nor listed as neighbours in NP-1. These set of non-polar residues are ranked on WCN score that is referred to as NP-2. For final ranked list of residues, we first take EXIA2 ranked list of residues followed by NP-1 and NP-2.

Comparison of CSmetaPred predicted ranks of known catalytic residues with the best possible rank from all methods
We compared catalytic residue predicted ranks from CSmetaPred to their best possible ranks derived from scores of all of its constituent methods (EXIA2, DISCERN CRpred and CATSID). Here, the best possible rank is the minimum of ranks assigned to residues in five scores (zSrs, zSwcn, zSca, zSdi and zScr). Initially, we did not consider CATSID for the best rank analysis because it provides template matched to the query and since not all the residues could be matched to a template, we could derive residue scores (residue rank subsequently) only for subset of residues that could lead to numerically low ranks. Apart from ranking only a subset of residues, an additional issue with ranking residues in CATSID hits is that more than one residue can have same score and identical ranks. In this analysis we have used CSAMAC dataset with 2912 catalytic residues. We restricted analysis for residues having the best rank ≤ 20. Importantly, most (94.8%) of catalytic residues have the best possible rank ≤ 20. Of these residues, ~26% and ~74% of catalytic residues showed no change/decrease (better performance) and increase (poor performance) in ranks with respect to the best rank respectively. For ~74% of residues with higher (poorer) ranks than the best possible rank cases, these do not have large increase in ranks as exhibited by mean and median increase in rank of 10.1 and 4 respectively. The detailed analysis of cases with large increase in CSmetaPred ranks showed that in most instances only one or two methods have a high residue scores, whereas other methods scores are relatively low, which lead to a decrease in the meta-score with subsequent increase in their ranked position.

Examples of large increase/decrease in CSmetaPred predicted ranks with respect to the best possible ranks from all methods
As discussed before, the best possible rank analysis provides the upper bound of meta-approach implemented in CSmetaPred. We analyzed CSmetaPred predicted ranks of some catalytic residues, which show large increase (poor predicted CSmetaPred ranks) or decrease (better ranks from CSmetaPred) in ranked positions with respect to the best possible rank. One of the catalytic residues (LYS-150) of enzyme phosphatidylinositol phosphate kinase (1b01B) is ranked at 28, 17, 25 and 17 by EXIA2, CRpred, WCN and DISCERN respectively (CATSID did not provide rank for this residue) that improves to rank 8 by CSmetaPred. Even though LYS-150 is not top residue in all methods, it is among top ranked residues and has consistent normalized residue scores of 1.1230, 2.4744, 1.4825 and 1.8339 from EXIA2, CRpred, WCN and DISCERN respectively. An example of increase in CSmetaPred predicted ranks is residue LYS-591 from enzyme isoleucyl-trna synthetase (1ileA) that is ranked at 120, 96, 569 and 6 by EXIA2, CRpred, WCN and DISCERN respectively. This residue (LYS-591) is ranked at 110 in CSmetaPred predicted ranks, mostly because normalized scores are not consistent and only one method (DISCERN) assigns relatively high scores as shown by normalized scores of 0.3521, 1.0424, -0.5579 and 2.647 from EXIA2, CRpred, WCN and DISCERN respectively. We have provided predicted ranks and scores of all benchmark proteins in our webserver.

Methodology used for homology modelling and construction of template library
The template library (LIB_TEMP) is constructed using PISCES server [2] with following criteria: resolution ≤2 Å, sequence length 40-1000 residues and nonredundant at 60% sequence identity.
The proteins from CSAMAC dataset having sequence identity from 40-90% and ≥ 70% coverage with templates from template library constitutes the dataset of proteins for homology modelling. To construct dataset for modelling we searched full-length protein sequences of each pdb entry from CSAMAC dataset against LIB_TEMP using profile_build() module of MODELLER to identity protein sequences, which have templates with sequence identity ranging from 40 to 90% and coverage ≥ 70% with respect to query sequence. This resulted in a set 335 protein used for homology modelling. We built models using MODELLER by taking only one template for a given query sequence. To identify template for 335 proteins, we parsed their profile_build() result and removed any template having sequence identity <40% and >90% or coverage <70% to query sequence. Further, based on sequence identity between query sequence (335 proteins) and templates we categorized query-template alignments into following sequence identity bins: 40-50%, 50-60%, 60-70%, 70-80% and 80-90%. Within each sequence identity category, we selected the best template, having maximum sequence identity, for a given query sequence to built model using MODELLER. Thus, resulted in 235, 135, 53, 22 and 23 models in 40-50%, 50-60%, 60-70%, 70-80% and 80-90% sequence identity category. For modelling, the alignment between query and template is constructed using align2d() module. The automodel class of MODELLER is used to build 10 different models. These models are ranked based on DOPE score. The model with lowest DOPE energy score is used as representative model in catalytic residue prediction. The present modelling strategy is a conservative structure prediction in template-based modelling category using MODELLER because multiple templates have been shown to improve quality of predicted structures.

Cloning, expression, and purification of mutant gshA
Different mutants of EcgshA were constructed by site overlap extension PCR using primer sets listed in Table S10 (Additional file 2). For protein expression, all mutant genes were cloned in pET23a expression vector at Sma1 and BamH1 sites. Protein was expressed and purified using the protocol as discussed earlier [4].

γ -GCS activity assay
The enzyme activity of mutant was assayed using protocol mentioned earlier [4]. Briefly, 1 μg purified protein was used to initiate the reaction in a standard reaction mixture of 800 μl contained 150 mM Tris/HCl, pH 8.0, 40 mM MgCl2, 150 mM KCl, 2 mM sodium phosphoenol pyruvate, 20 mM sodium L-glutamate, 15 mM L-cysteine, 4 mM sodium ATP, 4 units of LDH (rabbit muscle type II) 1.6 units of PK (rabbit muscle), and 0.24 mM NADH. The change in NADH absorbance was monitored at 340 nm using a spectrophotometer (UV-1800, Shimadzu, UV spectrophotometer) for 5 minutes. The reaction was carried out at 25 0 C. To determine the Km for cysteine, L-glutamate and ATP were used at 20 mM and 4 mM concentration respectively. The calculation of enzyme activity was done using GraphPad Prism software.  Figure S1: Cumulative distribution of catalytic residues present in predicted pockets. Plot showing cumulative distribution of catalytic residues within a given pocket rank on macie-254 dataset for: Pockets output from LIGSITE/Fpocket, re-ranked pockets using poc_sc score and merged top 5 reranked pockets. The vertical line shows that at pocket rank 5 both LIGSITE and Fpocket have achieved close to the maximum catalytic residues identified within predicted pockets. The drastic increase in catalytic residues fraction after reranking in LIGSITE could also be due to merging of pockets within LIGSITE.  EF_POOL_PW (B) datasets. All four methods contribute to different extent towards improving meta-score based ranking in CSmetaPred. It is apparent from ROC curves that excluding CRpred or CATSID residue score has maximum effect on prediction performance. This suggests that these two methods have major contribution in meta-score.                Figure S13: In vivo complementation assay of GshA wild type and mutant enzymes. Figure showing in vivo complementation assay of predicted catalytic residues mutants of GshA enzyme. Saccharomyces cerevisiae strain ABC1195 plasmids bearing WT gshA or the different cysteine binding residues gshA mutant gene cloned under TEF promoter. The transformants were grown overnight in SD+GSH medium and used to re-inoculate secondary culture. Cells were harvested at OD600 = 0.6 and serially diluted (0.2 to 0.0002 OD600). 10 μl was spotted on SD medium with or without GSH as sole source of organic sulphur. The vector pTEF416 and EcGshA were used as negative and positive control respectively.   Figure S14: Residues rank comparison between EXIA2 server and in-house recoded EXIA2. Plot showing residues rank comparison from EXIA2 server output and in-house recoded EXIA2 for (A) all residues, and (B) catalytic residues. The Pearson correlation coefficient between ranks for all residues obtained from EXIA2 and in-house program is 0.86, and the same for catalytic residues is 0.55.