Overview of experiments
This section presents experimental results to underscore the performance of template-based active-site prediction algorithms. Template is definable as a set of atoms in an active site of an enzyme protein. Roughly speaking, active-site prediction is performed by comparing the active sites of templates with local sites in protein structures; if the local site structure is sufficiently similar to the active-site template, the local site can be predicted as an active site. Precisely, there are two typical ways of predicting active sites using templates: single template analysis and multiple template analysis.
Single template analysis
In the single template analysis, it was attempted to search protein structures for local sites similar to a query template, to discover proteins that have the same active site as the template. The input and output in the single template analysis are described as follows:
Input (query): Template structure.
Output: List of proteins in a protein structure database. The proteins in the list are attached with the coordinates of local sites that are similar to the query template.
An example of the output in the single template analysis is given in Table 1. Template 1ams is used to generate the list. The top 10 proteins are listed as sorted in descending order of the similarity values. The predicted local sites in the 10 proteins are also included in the list. Indeed, all the 10 proteins have the same function as that of the template, and their active sites coincide exactly with the listed sites.
A typical procedure of single template analysis consists of two stages as follows:
Local site search. A local site search (LSS) algorithm, such as TESS  and JESS , is used to enumerate the candidates of local site structures similar to the template.
Similarity/deviation computation. The similarity or deviation of each candidate site to or from the template is computed. The candidate sites are sorted in descending and ascending order of the similarity and deviation values.
Multiple template analysis
Multiple template analysis is attempted to find active sites in a query protein structure by searching for local sites that are similar to a template structure in a set of pre-defined templates. This is used for predicting the function of the query protein.
Input (query): Protein tertiary structure.
Output: List of local sites that are similar to a pre-defined template.
The list in Table 2 is an example of the output in the multiple template analysis. This result is generated by attempting to predict the active sites in a query protein structure of 1bio. The list contains 10 local sites, which are sorted according to the confidence level. The true active-site in 1bio is “HIS 57 A, ASP 102 A, GLY 193 A, SER 195 A”, which is predicted at the top of the list.
A typical procedure of multiple template analysis consists of three stages:
Local site search. A local site search (LSS) algorithm is used to enumerate the candidates of local structures similar to a template.
Similarity/deviation computation. The similarity or deviation of each candidate site to or from the corresponding template is computed.
Post-processing. The similarities or deviations above are transformed to some probabilistic scores to compare and sort the local sites in descending order of the score values.
To investigate the performance of the templates, a search experiment was conducted against the PDB protein structure dataset. An enzyme database, EzCatDB , which contains information related to catalytic reaction classification and active-sites of enzymes, was used for the experiment. EzCatDB provides a hierarchic classification of enzyme reactions, RLCP, which clusters the same reaction types together based on the reaction type, the reactive site of the substrate, the catalytic mechanism, and the active-site of enzymes . The RLCP classification differs from the conventional enzyme classification, EC, as described in the Background section. In all, 36 templates were prepared based on the RLCP classification. To evaluate the prediction methods, protein structures that are assigned to have the same RLCP class as one of the 36 templates were enumerated for use in our experiments. Consequently, 1,219 structures were obtained.
A template can be created from a set of amino acid residues in an active-site from an EzCatDB entry. In the EzCatDB database, each residue in the active-site is classified into one of four types: catalytic-site residue, cofactor-binding-site residue, modified residue, and mainchain catalytic residue. For catalytic-site residues and modified residues, atoms from the sidechains of residues are included automatically in the template, whereas all atoms are included in the template for cofactor-binding-site residues. For mainchain catalytic residues, only the mainchain atoms are included in the template. The 36 templates were created by this procedure. In Table 3 (suppl.), the original PDBids for those templates are listed.
In our study, TESS  was implemented as an LSS algorithm to perform the local site search that is used for the first step in the catalytic-function prediction system. The LSS algorithm was applied to the 1,219 protein structures, and local sites that have RMSD larger than 4.0 Å were removed from the detected local structures. As a result, 587,431 local structures were detected and used for our experiments. Then, local sites were labeled as positive sites for a template if they were annotated in EzCatDB to have the same function as the template; the other local sites were labeled as negative sites. Some templates hit many local sites, but others hit only a few. Template 1acb hits the largest number of positive sites among 36 templates, and the number is 557. The template 1qk2 detected 108,036 negative sites, which is the greatest number among the 36 templates. The medians of the quantities of positive and negative sites are, respectively, 42 and 3401.5.
Mostly, only a few positive sites were detected in each protein structure by the LSS algorithm, although many more negative sites were detected. The distribution of the detected sites in the multiple template analysis is shown in the histograms (Figure 5 (suppl.)), where the x-axis shows the number of detected sites for each query protein and the y-axis shows the frequency of hit proteins. Therein, the same local sites can be detected several times when the sites are detected by several different templates. Only one positive site was detected among 849 protein structures. The remaining 370 protein structures have multiple positive sites. About 95.5% of protein structures contain fewer than five positive sites. The number of negatives is much greater than the number of positives in most protein structures. The median of the number of negative sites in a protein structure is 52. The 95th percentile is 163, and the maximum number is 720. This fact motivated us to devise precise similarity or deviation measurements between local site structures and template structures in order to extract true positive sites among the vastly numerous local sites detected in a protein structure.
To examine the generalization performance of prediction algorithms, the dataset of 1,219 PDB entries was divided randomly into a training set and a test set, so that each dataset can have 50% of the original dataset. The divisions were then adjusted so that at least one active-site can be hit in a training set by the LSS algorithm for every template. The test set was used only for prediction. Consequently, the test PDB set was never used for learning. This procedure was repeated 30 times. The average of the prediction performances over the 30 trials is described in this section.
Single template analysis
To predict local sites detected by the LSS algorithms, precise similarity or deviation measurements are necessary. Currently, the standard measurement is the so-called RMSD. Our study introduces two measurements: Weighted Mean Deviation (WMD) and DALI Score-based Discriminative Similarity (DSDS). RMSD is computed by taking the unweighted average of square Euclidean distances, whereas WMD takes the weighted average of distances. Furthermore, DSDS is the linear combination of DALI scores. The parameters for both WMD and DSDS are obtained using machine learning algorithms, which will hereinafter show notable differences from the conventional measurements in prediction performance. To confirm the effectiveness of machine learning for DSDS, the mean of DALI scores (MDS) was tested as the similarity between local sites and templates for the experimental control. The square of RMSD is designated as the Unweighted Mean Deviation (UMD), revealing the same ranking as RMSD.
An example of the prediction results in the single template analysis is shown in Table 1. This is a result obtained by the DSDS measurement, using template 1ams. All local sites in the list are positive, as described above. For comparison, Table 4 (suppl.) portrays a prediction result obtained using the UMD measurement, with the same template. Unfortunately, all the local sites in the list are negative.
To evaluate the performance of active-site prediction algorithms for the single template analysis, two criteria were adopted: ROC score and Sensitivity. The ROC score is the area under the ROC curve that is shown in two-dimensional space where the x-axis shows the false positive rate (FPR), and the y-axis shows the true positive rate (TPR). As a discrimination threshold of the similarity/deviation varies, different FPRs and TPRs are obtainable, yielding many points in the FPR-vs.-TPR space. Connecting those points yields an ROC curve for each template and for each of 30 trials. Here, the average of ROC scores was obtained. Sensitivity is also evaluated where the discrimination threshold is adjusted so that the specificity is 95%. In this article, the capitalized word, Sensitivity, was adopted to denote the sensitivity at specificity of 95%.
Figure 1 depicts the average of ROC scores and Sensitivities over the 36 templates, respectively. Comparing the four similarity or deviation measurements, DSDS achieved the best ROC score and the best Sensitivity (ROC score of 0.981 and Sensitivity of 0.936). WMD obtained the ROC score of 0.977 and Sensitivity of 0.920, each of which is the second best among the four similarity or deviation measurements. The difference between DSDS and WMD is small; one-sample t-test was insufficient to detect the statistically significant changes in both ROC score and Sensitivity (P-values are respectively 0.106 and 0.181 ). Compared to UMD, WMD obtained a significant improvement; the changes in ROC score and Sensitivity are 0.0181 and 0.0343, respectively, and the P-values of the changes are 0.00495 and 0.00356, respectively. The improvements from MDS to DSDS were larger. The changes in ROC score and Sensitivity are respectively 0.0253 and 0.0674. The results suggest that the combination of the DALI-score with machine learning is more effective than that of mean deviation.
Figure 2 and Figure 6 (suppl.) show scatter plots of the ROC scores and the Sensitivities, respectively, to compare three measurements—UMD, WMD, and DSDS—in template-by-template fashion. The two measurements obtained using machine learning, WMD and DSDS, performed much better than the baseline measurement: UMD. No remarkable difference between WMD and DSDS was observed from the scatter plots.
Figure 7 (suppl.) plots the median ROC curves over 36 templates. The median ROC curve is obtained by computing an ROC curve for each template, and taking the median over 36 TPRs at every FPR. To investigate the change in the ROC curves from different templates, the curves of the 25th percentiles and the 75th percentiles are also shown. The median ROC curve is drawn by solid curves, and the 25th percentiles are shown as dotted curves under the median curve, and the 75th percentiles are by dotted curves over the median curve. The changes in the ROC curves from different templates are not small for every prediction method. Another notable point is that the TPR value on the 25th percentile for DSDS is markedly higher than that for MDS; MDS and DSDS yield TPR values of 0.823 and 0.966 at the 5% FPR on the 25th percentiles, respectively, which implies that DSDS performs stably compared to MDS.
Multiple template analysis
To perform multiple template analysis for predicting the function of a query protein structure, the similarity or deviation measurements, such as RMSD, must be transformed through post-processing into some unified scores, which can be compared among different templates. One post-processing is the logistic regression (LR) . The LR method provides posterior probabilities for given similarity or deviation values. The parameters of the posterior probability function are estimated using training data. Another post-processing is PINTS. Results of our experiments revealed that PINTS works well for the square root of WMD. However, PINTS can be applied to neither MDS nor DSDS.
Methods of experimental control, designated as Direct in this paper, which compare the similarity or deviation measurements directly, were also tested. In the Direct methods, post-processing is not conducted. Consequently, there are now 10 combinations of the similarity or deviation measurements using post-processing methods: UMD-Direct, WMD-Direct, MDS-Direct, DSDS-Direct, UMD-LR, WMD-LR, MDS-LR, DSDS-LR, UMD-PINTS, WMD-PINTS.
An example presented in Table 2 was generated using WMD-PINTS. The local sites in the list are the predicted results for active-sites in the protein structure 1bio. As described above, the top in the prediction results is the true positive site. Table 5 (suppl.) portrays a prediction result obtained using the UMD-PINTS measurement, to predict the active sites of the same protein structure 1bio. In this case, no active site was predicted in the top 10.
To evaluate the prediction performance in the multiple template analysis, an ROC curve is drawn for each protein structure to compute the area under the curve: the ROC score. A modified version of the ROC score, called ROC5 score , is also used for performance evaluation. The ROC5 score is the area under ROC curve up to the first five false positives; the score is scaled so that it will be 0–1.
Figure 3 presents the average ROC score and the average ROC5 score, respectively, across 1,219 proteins for each measurement. Logistic regression engenders improvements in all sensitivity/deviation measurements, and PINTS further improves the performance. The WMD-PINTS achieved the ROC of 0.996 and the ROC5 of 0.970, which are significantly better than any of the four measurements using logistic regression. The best measurement among those four measurements using logistic regression is DSDS-LR, which yields the ROC of 0.991 and the ROC5 of 0.942. Actually, DSDS-Direct obtained the best ROC among the methods without post-processing (ROC of 0.982) and UMD-Direct obtained the best ROC5 among those four methods (ROC5 of 0.848), although they are significantly worse than either WMD-PINTS or DSDS-LR.
To compare the machine learning-based methods with the experimental control, the ROC and ROC5 scores of the three methods, UMD-PINTS, WMD-PINTS, and DSDS-LR are shown in the scatter plots of Figure 4 and Figure 8 (suppl.). These three are, respectively, the best methods among the methods with UMD, WMD, and DSDS. Figure 4 shows the ROC scores for 1,219 PDB structures, and Figure 8 (suppl.) portrays the ROC5 scores. For both the criteria, the two methods, WMD-PINTS and DSDS-LR, performed much better than the baseline method, UMD-PINTS, which suggests that machine learning is effective for the multiple template analysis. No remarkable difference between the two machine-learning-based methods were observed.