Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble
BMC Bioinformatics volume 15, Article number: 297 (2014)
Vitamins are typical ligands that play critical roles in various metabolic processes. The accurate identification of the vitamin-binding residues solely based on a protein sequence is of significant importance for the functional annotation of proteins, especially in the post-genomic era, when large volumes of protein sequences are accumulating quickly without being functionally annotated.
In this paper, a new predictor called TargetVita is designed and implemented for predicting protein-vitamin binding residues using protein sequences. In TargetVita, features derived from the position-specific scoring matrix (PSSM), predicted protein secondary structure, and vitamin binding propensity are combined to form the original feature space; then, several feature subspaces are selected by performing different feature selection methods. Finally, based on the selected feature subspaces, heterogeneous SVMs are trained and then ensembled for performing prediction.
The experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to the state-of-the-art vitamin-specific predictor, and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests. The TargetVita web server and the datasets used are freely available for academic use at http://csbio.njust.edu.cn/bioinf/TargetVita or http://www.csbio.sjtu.edu.cn/bioinf/TargetVita.
Functional positions in a protein are residues that play more critical roles than other residues and enable the protein to perform specific biological functions, such as capturing drugs , binding ligands , and interacting with other proteins . However, functionally annotated proteins still account for only a small portion of sequenced proteins, and the gap between annotated and sequenced proteins is ever-increasing with the rapid development of advanced sequencing technology and concerted genome projects . Automated computational methods for the prediction of protein functional positions are urgently needed and have become a hotspot in bioinformatics research. During the past decades, machine-learning-based computational methods have been extensively applied to various protein functional position prediction problems [5–7].
Protein-ligand interaction is one of the most important protein functions and plays vital roles in virtually all biological processes [2, 8, 9]. Considerable effort has been made to design effective methods for protein-ligand binding residue (site) prediction, and much progress has been made in this area . In the early stage, general-purpose protein-ligand predictors, which predict ligand binding sites (pockets) regardless of ligand types, dominate in the fields of protein-ligand binding site prediction. Such predictors include LIGSITE , CASTp , SURFNET , POCKET , fpocket , Q-SiteFinder , and SITEHOUND . Later, researchers observed that protein-ligand binding sites (pockets) vary significantly in their roles, sizes, and distributions for different types of protein-ligand interactions, and different ligands tend to bind diverse types of residues with prominent specificities [18, 19]. These observations motivated the emergence of ligand-specific predictors, which are specifically designed to predict binding residues or sites for certain ligand types, such as NsitePred  and TargetS  for protein-nucleotide binding prediction, FINDSITE-metal  and CHED  for protein-metal binding prediction, MetaDBSite  and DNABR  for protein-DNA binding prediction, protein-drug binding prediction , and others. These studies have shown that ligand-specific predictors are often superior to general-purpose predictors and are a promising route for improving the performance of protein-ligand prediction [20, 21].
Vitamins are typical ligands and play critical roles in various metabolic processes [27–29]. However, to the best of our knowledge, minimal work has been performed to design a specific predictor for predicting protein-vitamin binding residues. Recently, Panwar et al.  published their pioneering work on protein-vitamin binding prediction and a predictor, called VitaPred, was implemented. VitaPred  is a sequence-based ligand-specific predictor specifically designed for predicting protein-vitamin binding residues, and it consists of four independent prediction modules for predicting vitamin, vitamin-A, vitamin-B, and pyridoxal-5-phosphate (vitamin-B6) binding residues. VitaPred encodes each residue into a 340-D feature vector by applying a sliding window to the position-specific scoring matrix (PSSM) of a protein sequence; then, a support vector machine (SVM) is trained on the set of feature vectors of all the training residues. In the prediction stage, the feature vector of each residue in a query sequence is fed into the trained SVM, and the binding propensity of each residue is obtained; finally, a threshold is used to determine whether a residue is vitamin-interacting. Although VitaPred achieved great success in predicting protein-vitamin binding residues, there is still room for further improving the prediction performance: first, only the PSSM-derived feature was used in VitaPred, and other valuable features (e.g., protein secondary structure) were not well considered; second, the PSSM feature constructed in VitaPred may contain redundant information, which is useless or even harmful for performing prediction.
This paper follows the pioneering work of Panwar et al.  and aims to further improve the performance of protein-vitamin binding residue prediction. A new predictor, called TargetVita, which utilises multiple sequence-derived features and heterogeneous SVMs ensemble based on feature selection, is developed. In TargetVita, three different types of features (i.e., a position-specific scoring matrix feature, a predicted secondary structure feature, and a vitamin binding propensity feature) are combined to form the original feature space; then, three feature selection methods are performed on the original feature space to extract three different feature subspaces, and heterogeneous SVMs are trained on the reduced feature subspaces. Finally, when performing prediction, the vitamin-binding propensity of each residue in a query sequence is predicted by averaging the outputs of the three trained heterogeneous SVMs. Experimental results obtained with four separate vitamin-binding benchmark datasets demonstrate that the proposed TargetVita is superior to VitaPred  and an average improvement of 10% in terms of the Matthews correlation coefficient (MCC) was achieved over independent validation tests.
In this study, four benchmark datasets created by Panwar et al.  were utilised to evaluate the efficacy of the proposed method. For convenience, the four benchmark datasets are denoted as DVI (dataset of vitamin-interacting proteins), DVAI (dataset of vitamin-A-interacting proteins), DVBI (dataset of vitamin-B-interacting proteins), and DPLPI (dataset of pyridoxal-5-phosphate interacting proteins), respectively.
Each benchmark dataset was constructed with a stringent procedure as follows : taking DVI as an example, 1061 PDB IDs of proteins that make contact with vitamins were first collected from SuperSite documentation ; then, the sequences of all chains of these 1061 PDB IDs were downloaded from the Protein Data Bank . Among the obtained sequences, 2720 sequences were finally chosen according to the results returned from the Ligand Protein Contact (LPC) web server  by taking 1061 PDB IDs as inputs. Then, a threshold of 5.0 Å was used to determine the vitamin-interacting residues: a residue was considered to be vitamin-interacting if the closest distance between atoms of the protein and the partner vitamin was within the threshold (5.0 Å) ; finally, the maximal pairwise sequence identity of the vitamin-binding sequences obtained in the above steps was further reduced to 25% by using BLASTCLUST , and the obtained 187 vitamin-interacting chains with 30156 vitamin-binding residues constituted the DVI. Similarly, DVAI, DVBI, and DPLPI, which consist of 31, 141, and 71 non-redundant sequences, respectively, were also constructed by repeating the above-mentioned steps.
Four different independent validation datasets for DVI, DVAI, DVBI, and DPLPI, which consist of 46, 15, 27, and 16 non-redundant sequences, respectively, were also constructed . In addition, to guarantee the independence of the independent validation subset, the maximal pairwise sequence identity of each independent validation dataset is less than 25%, and any sequence in an independent validation dataset shares <25% identity to the sequences in the corresponding training dataset. Table 1 summarises the detailed compositions of the four benchmark datasets. Details for constructing these datasets can be found in .
To evaluate the performance of TargetVita with non-vitamin binding proteins, we also constructed a non-vitamin binding dataset, denoted as NVD, from BioLip , which is the most recently released semi-manually curated database for biologically relevant ligand–protein interactions. We constructed the NVD as follows: First, all the sequences that do not interact with vitamins are extracted from BioLip; then, the maximal pairwise sequence identity of the extracted protein sequences is culled to 30% by using CD-Hit  program, and the reduced dataset is obtained. Moreover, if a given sequence in the reduced dataset shares >30% identity with a sequence in the training dataset DVI, we remove the sequence from the reduced dataset. Finally, the remaining 6676 sequences (with 1852390 residues) constitute NVD.
All the datasets used in this study are included in Additional file 1.
Feature representation is a critical step in designing a machine-learning-based predictor. In this study, multiple sequence-derived features, which potentially have a positive impact on the performance improvement of protein-vitamin binding residue prediction, are extracted and combined to form an informative feature space.
1) Position-specific scoring matrix (PSSM)
Previous studies have demonstrated that the evolutionary information reflected by a position specific scoring matrix (PSSM) is a powerful feature source in many bioinformatics problems including protein-ligand binding predictions [20, 21, 37–40]. In view of this, PSSM was also taken as a feature source in this study. First, we obtained the original PSSM of a sequence by executing PSI-BLAST  to search the Swiss-Prot database through three iterations with 0.001 as the E-value cut-off against the sequence; then, each element x contained in PSSM was normalised by the logistic function f(x) = 1/(1 + e− x), and the normalised PSSM was obtained. Finally, the PSSM-based feature vector for each residue in the query sequence can be extracted with a sliding window as follows: for a residue at position i of the query sequence, its feature vector consists of the normalised PSSM elements of the query sequence corresponding to a sequence segment of length W centred on i. In this study, W, i.e., the size of sliding window, was set to 17, which has been demonstrated to be a better choice in VitaPred  and several other protein-ligand binding site prediction studies [37, 38]. Consequently, the dimensionality of the PSSM feature vector of a residue is 17 × 20 = 340.
2) Predicted secondary structures (PSSs)
A fundamental hypothesis for most of the sequence-based protein attribute predictions is that sequences with similar structures will have similar functions. Previous studies have also shown that a close relationship exists between protein structure and function. Many structural characteristics, such as secondary structure information, have been extensively investigated for the identification of protein functional residues (e.g., protein-ligand binding residues [40, 42]). The appropriate utilisation of protein structural information may potentially help to improve the performance of protein-ligand binding prediction, as has been empirically demonstrated in our recent work [21, 38]. Therefore, protein secondary structure information, predicted from the protein sequence by performing PSIPRED , was used as another feature source for protein-vitamin binding residue prediction.
The predicted secondary structure information of a protein sequence is obtained by applying PSIPRED  software, which predicts the likelihood that a given residue in a protein sequence belongs to one of three secondary structure classes: coil (C), helix (H), and strand (E). More specifically, for a protein sequence with L residues, PSIPRED outputs an L × 3 probability matrix, which represents the predicted secondary structure information of the protein. Again, a sliding window of size 17 was used to extract the predicted secondary structure feature of each residue, and the dimensionality of the extracted PSSs feature vector was 17 × 3 = 51.
3) Vitamin binding propensities (VBPs)
Previous studies have demonstrated that different ligands tend to bind different residues [18, 19, 21]. Panwar et al.  also analysed different protein-interacting residues of different vitamin classes and observed that different vitamins tend to bind different residues; this phenomenon can also be observed within vitamin sub-classes. Motivated by this observation, we can thus calculate the binding propensities of the 20 native amino acids for each type of vitamin and then extract a vitamin-specific 17-D binding propensity feature vector, denoted as VBP, for each residue in a protein sequence by concatenating the binding propensities of its neighbouring residues within a window of size 17 centred at the residue.
Finally, the feature representation of a residue is formed by serially combining its three corresponding feature vectors, i.e., PSSM, PSSs, and VBPs, and the dimensionality of the obtained feature vector is 340 + 51 + 17 = 408-D.
Ensemble multiple heterogeneous subspace SVMs based on feature selection
After determining the feature representation, prediction models can be trained on a dataset with machine-learning algorithms such as SVM, as used in this study. However, directly training prediction models on the original feature space is often not the best solution. One important reason is that redundant information that has no positive, and sometimes even negative, impact on the prediction performance could potentially exist. Selection of the most discriminative feature subspace from the original feature space may help to improve prediction performance. Accordingly, feature selection has been a hotspot and is widely used in many bioinformatics and related fields , such as sequence analysis [45, 46] and microarray analysis [47, 48]. For example, our recent work  has demonstrated that the PSSM feature contains redundant information, which is useless for disulphide connectivity prediction, and the prediction performance can be further improved when the original PSSM feature is reduced to a lower but more compact feature subspace via feature selection.
However, many existing traditional feature selection methods such as data variance , the Fisher score , and the Laplacian score  are faced with two deficiencies: first, the importance of features is calculated individually; thus, the correlation and dependency of different feature components are neglected. Second, the dimensionality of the reduced feature subspace needs to be prescribed in advance of the feature selection process, which is often difficult or even impossible in practice.
Recently, we developed a generalised Joint Laplacian Feature Weights Learning algorithm , denoted as JLFWL, which can effectively address the above-mentioned two deficiencies. Rather than computing feature weights one by one, JLFWL automatically determines the optimal size of the feature subspace and selects the best feature components from the original feature space by iteratively learning the feature weights jointly and simultaneously.
Here, we briefly restate JLFWL; details can be found in .
Let matrix X = [x1, x2, ⋯, x M ] ∈ RN × M be a training dataset, where M is the number of samples, N is the dimensionality of features, and x i is the feature vector of the i-th training sample. Then, the Joint Laplacian Feature Weights Learning can be summarised in Algorithm 1.
Note that in Algorithm 1, ε ≥ 0 is a parameter to control the -norm of w. In this study, ε is set to be 0.5.
Different feature subspaces can be selected from the original feature space with different feature selection methods, and the discriminative characteristics of the obtained feature subspaces may also differ. Prediction models trained on these different feature subspaces potentially complement each other, which motivates us to propose an ensemble learning scheme, i.e., a multiple heterogeneous subspace SVMs ensemble based on feature selection, as follows:
First, multiple feature subspaces are selected by applying different feature selection methods; then, based on these selected feature subspaces, multiple prediction models, which are termed as heterogeneous models, can be trained on the same dataset; for a query input, the final prediction output is obtained by ensembling the outputs of the trained heterogeneous prediction models.
To demonstrate the efficacy of the proposed ensemble learning scheme, three feature selection methods (i.e., our JLFWL together with two traditional feature selection methods, such as the Fisher score  and Laplacian score ) are taken to perform feature selections, and SVM  is used as the base machine-learning algorithm to train multiple heterogeneous prediction models. We locally performed JLFWL on benchmark datasets and found that the optimal dimensionality of the feature subspace, which is automatically determined by JLFWL, is 386. For consistency, the dimensionalities of the feature subspaces obtained by Fisher score  and Laplacian score  feature selection methods are also set to be 386.
In this study, C-SVM is used and there are no specific weights for the three heterogamous SVM models. Radial basis function is chosen as the kernel function. The other two parameters, i.e., the regularisation parameter γ and the kernel width parameter σ, are set according to the optimisation results from a grid search strategy in the LIBSVM software.
Workflow of the proposed TargetVita
Figure 1 illustrates the workflow of the proposed TargetVita. In the training stage, all the feature vectors of the residues in the training sequences constitute the training feature vector set; then, L feature subspaces can be selected by performing L different feature selection methods on the training feature vector set; based on the selected feature subspace, L heterogamous SVM models can be trained.
In the prediction stage, for each residue in a query protein sequence, its PSSM and PSSs feature vectors are first extracted by calling PSI-BLAST and PSIPRED and applying a sliding window technique; then, its PSSM, PSSs, and VBP feature vectors are combined and filtered by the L selected feature subspaces, and the L filtered feature vectors (i.e., feature vectors after feature selection) are further fed to the L corresponding trained heterogamous SVMs that predict, for that residue, the scores related to vitamin interaction; the final score for the residue is obtained by averaging the predicted scores of those heterogeneous SVM models. Finally, a threshold T is used to determine whether the residue is vitamin-interacting: residues with scores above threshold are marked as vitamin-interacting.
To help with the visualisation of the prediction results, MODELLER software  is taken to model the protein 3D structure from the sequence, and the predicted vitamin-interacting residues are highlighted in red on the modelled 3D structure.
To evaluate the performance of the proposed method, four routinely used evaluation indexes in this field, i.e., Sensitivity (Sn), Specificity (Sp), Accuracy (Acc), and the Matthews correlation coefficient (MCC) were taken:
where TP, FP, TN, and FN are the abbreviations of True Positive, False Positive, True Negative, and False Negative, respectively.
However, these four evaluation indexes are threshold-dependent, i.e., the values of these indexes vary with the threshold chosen. Clearly, it is impossible and unnecessary to report the values of these indexes under all the possible thresholds [20, 21, 37]. In light of this, two threshold selection strategies, which have been widely applied in the related fields, were taken to report the threshold-dependent evaluation indexes, i.e., Sn, Sp, Acc, and MCC.
Strategy I: Threshold that balances the values of Sn and Sp
The threshold-dependent evaluation indexes are reported with a threshold, denoted as T Balance , at which the value of Sn is equal or roughly equal to that of Sp. For convenience, the evaluation results obtained under T Balance will be termed as Balanced Evaluation in the subsequent descriptions.
Strategy II: Threshold that maximises the value of MCC
Because the MCC provides the overall measurement of the quality of the binary predictions, we thus reported the threshold-dependent evaluation indexes by choosing the threshold, denoted as T MaxMCC , which maximises the value of MCC of predictions. Similarly, for convenience, the evaluation results obtained under T MaxMCC will be termed as MaxMCC Evaluation.
In addition, to evaluate the overall prediction quality of a prediction model, the AUC, which is the area under the Receiver Operating Characteristic (ROC) curve and is threshold-independent, was also taken.
Results and discussion
Inappropriate cross-validation will over-estimate prediction performance
Cross-validation methods are often used to evaluate the performance of a predictor [56, 57]. Previous studies have shown that the leave-one-out cross-validation (jack-knife test) is most stringent [58–61]. However, leave-one-out cross-validation is time-consuming, especially when a dataset is huge and a complicated prediction algorithm (such as SVM in this study) is used. Additionally, because we need make a fair comparison of the proposed method with VitaPred , the five-fold cross-validation, which was used by Panwar et al.  to evaluate VitaPred, was also adopted in this study.
Another critical aspect that should be addressed here is the method of performing five-fold cross-validation. In fact, five-fold cross-validation can be performed at two different levels for the considered protein-vitamin binding residue prediction problem: (i) residue-level cross-validation and (ii) sequence-level cross-validation.
Residue-level five-fold cross-validation is performed as follows: residues in all the training protein sequences are randomly partitioned into five equally sized, disjoint subsets; then, one subset is used for testing, and the remaining four subsets are used for training; this process is continued until all the five subsets of the training dataset are traversed.
For sequence-level five-fold cross-validation, training protein sequences, rather than training residues, are randomly partitioned into five equally sized, disjoint subsets; then, residues in one subset are used for testing, and residues in the remaining four subsets are used for training; this practice is continued until all the five subsets of the training dataset are traversed.
In reference , Panwar et al. used residue-level cross-validation to evaluate the performance of their VitaPred. However, we believe that residue-level cross-validation tends to over-estimate the performance of a prediction model and is therefore inappropriate. Next, we will empirically demonstrate this argument as follows:
Note that to objectively and fairly compare our results with those obtained by VitaPred , the same machine-learning model (i.e., SVM) and the same feature representation (i.e., 340-D PSSM feature) were also used in our experiments.
For each of the four benchmark datasets (i.e., DVI, DVAI, DVBI, and DPLPI), we performed residue- and sequence-level five-fold cross-validations. Tables 2 and 3 summarise the performance comparisons between residue- and sequence-level five-fold cross-validations on the four benchmark datasets under Balanced Evaluation and MaxMCC Evaluation, respectively. Note that in Tables 2 and 3, SVM-R and SVM-S denote the results obtained by residue- and sequence-level cross-validations, respectively.
In Table 2, it can be observed that the values of Sn, Sp, Acc, MCC, and AUC for SVM-R are consistently superior to those for SVM-S throughout the four benchmark datasets. Taking MCC as an example, SVM-R clearly outperforms SVM-S, and an average improvement of 2% was observed on the four benchmark datasets. Similar results can also be observed in Table 3. From the comparison results between SVM-R and SVM-S on the four considered datasets listed in Tables 2 and 3, we empirically demonstrated that the residue-level cross-validation does over-estimate the performance of a prediction model. We speculate that the main reason for this over-estimation is that during the residue-level cross-validation, some testing residues and training residues may originate from the same protein sequence and thus have much higher homology, which will lead to a better prediction performance.
On the other hand, SVM-R is in fact a re-implementation of VitaPred  because the prediction model and feature representation used in SVM-R are exactly the same as those used in VitaPred. By revisiting Tables 2 and 3, we can find that the AUC values for SVM-R on the four benchmark datasets are 0.87, 0.85, 0.90, and 0.97, which are equal to that for VitaPred with only one minor exception (i.e., 0.85 and 0.83 for SVM-R and VitaPred, respectively, in the DVAI dataset), showing that SVM-R and VitaPred achieved almost equal overall prediction quality.
However, several abnormal phenomena are observed when comparing the other four indexes, (i.e., Sn, Sp, Acc, and MCC) between SVM-R and VitaPred. Using the results in the DPLPI dataset under Balanced Evaluation as an example (refer to Table 2), the values of Sn, Sp, and Acc for VitaPred are 90.20%, 92.61%, and 92.40%, respectively, which are obviously lower than those for SVM-R (i.e., 91.48%, 93.38%, and 93.30%, respectively); however, the value of MCC for VitaPred is, unexpectedly, approximately 12% higher than that for SVM-R. A similar phenomenon can also be observed in the DVAI dataset. According to the definitions of Sn, Sp, Acc, and MCC and the relationships between them, this phenomenon should not appear. Based on the comparison between SVM-R and VitaPred, we speculate that the MCC reported in VitaPred  has been over-estimated or over-optimised, which will be further demonstrated in the subsequent independent validation test.
Considering that the residue-level cross-validation will over-estimate the performance of a prediction model, together with the fact that VitaPred may possibly have over-estimated the MCCs on benchmark datasets, we will take SVM-S (i.e., a re-implementation of VitaPred) over sequence-level cross-validation evaluation, rather than VitaPred itself, as the baseline predictor to demonstrate the improvements in our proposed methods in the subsequent experiments.
Improving prediction performance by combining the features of PSSM, PSSs, and VBPs
In this section, we will demonstrate that the performance of protein-vitamin interaction prediction can be further improved by combining the features of PSSM, PSSs, and VBPs. The features of 340-D PSSM, 51-D PSSs, and 17-D VBPs are serially combined to form a 408-D discriminative feature, denoted as PSSM + PSSs + VBPs. We then evaluated the SVM-S with the PSSM + PSSs + VBPs feature as a model input on each of the four benchmark datasets over five-fold sequence-level cross-validation. Tables 4 and 5 summarise the performance comparisons between the PSSM + PSSs + VBPs and PSSM features under Balanced Evaluation and MaxMCC Evaluation, respectively. From Tables 4 and 5, we can see that the prediction performances are indeed improved on all the four benchmark datasets after incorporating PSSs and VBPs features into the PSSM feature under both Balanced Evaluation and MaxMCC Evaluation. Taking MCC, which is the overall measurement of the quality of the binary predictions, as an example, an average improvement of 1.5% was observed under both Balanced Evaluation and MaxMCC Evaluation. In terms of the AUC, which measures the overall prediction quality of a prediction mode, an average improvement of 1% was also observed. Figure 2 illustrates the ROC curves of the predictions with PSSM and PSSM + PSSs + VBPs features, respectively, on the DVI dataset over sequence-level five-fold cross-validation.
We have also provided performance comparisons of different feature combinations in Additional file 2: Table S1.
Ensembling heterogeneous SVMs helps to improve the prediction performance
In this section, we will empirically demonstrate that the prediction performance of protein-vitamin interactions can be further improved by ensembling multiple heterogeneous SVMs. More specifically, we adopted three feature selection methods (i.e., data variance , Fisher score , and Laplacian score ) to select three different feature subsets from the original PSSM + PSSs + VBPs feature space; then, we trained three heterogeneous SVMs on the selected feature subsets. The final prediction was performed by averaging the outputs of the three trained SVMs. For comparison, we also directly trained an SVM with the original PSSM + PSSs + VBPs feature, denoted as no ensemble. Table 6 summarises the performance comparisons between ensemble and no ensemble on all the four considered datasets over sequence-level five-fold cross-validation under Balanced Evaluation. Figure 3 illustrates the ROC curves of the predictions with ensemble and no ensemble, respectively, on the DVI dataset over sequence-level five-fold cross-validation.
In Table 6, it can be observed that the values of the five evaluation indexes (i.e., Sn, Sp, Acc, MCC, and AUC) of the prediction under ensemble are consistently superior to that of the prediction under no ensemble, with only two exceptions: Sn in the DVI and DPLPI datasets. The prediction under ensemble achieved average improvements of approximately 2% and 1% for MCC and AUC, respectively. The results demonstrate that the heterogeneous SVMs trained with different feature subsets can complement with each other, which accounts for the improvements in the prediction performance.
Comparison with existing protein-vitamin interaction predictors
In this section, we will compare the proposed method, called TargetVita, with existing predictors for protein-vitamin prediction. Note that in TargetVita, the PSSM + PSSs + VBPs feature was used as the input feature, and the SVM ensemble based on feature selection was applied. To the best of our knowledge, VitaPred  is the only predictor that was specifically designed for protein-vitamin prediction; thus, we compare the proposed TargetVita with VitaPred under both a cross-validation test and an independent validation test.
However, the cross-validation performance of VitaPred may have been overestimated because it was evaluated by performing residue-level cross-validation. In view of this, SVM-S, which is a re-implementation of VitaPred that was evaluated over sequence-level cross-validation, was taken to compare with the proposed TargetVita when performing the cross-validation test. As for the independent validation test, TargetVita will be compared with both VitaPred and the SVM-S.
A. Cross-validation test
Tables 7 and 8 illustrate the performance comparisons between TargetVita and SVM-S on the four datasets over five-fold sequence-level cross-validation under Balanced Evaluation and MaxMCC Evaluation, respectively. From Table 7, we can see that the values of the five evaluation indexes for TargetVita are consistently superior to that of SVM-S throughout the four benchmark datasets. Taking MCC and AUC as examples, which are the two indexes measuring the overall prediction performance of a predictor, TargetVita clearly outperforms SVM-S, and average improvements of 3.5% and 2% were observed, respectively. Under MaxMCC Evaluation (refer to Table 8), a similar phenomenon can also be observed, with only minor exceptions on Sn.
B. Independent validation test
Performing only cross-validation comparisons to demonstrate the effectiveness of a newly developed method over an existing method is often not convincing, the reason being that the characteristics of the new method may be over-fitted and/or over-optimised to the underlying dataset for the purpose of pursuing positive comparison results [21, 62, 63]. Validation on fresh independent data has been considered as an important and necessary procedure when comparing different methods, and it has been widely applied in related research.
With this view, we also performed independent validation tests to further demonstrate the superiority of the proposed TargetVita over existing protein-vitamin predictors. Tables 9 and 10 summarise the performance comparisons between TargetVita and existing predictors on independent validation tests under Balanced Evaluation and MaxMCC Evaluation, respectively. From Tables 9 and 10, two observations can be made as follows:
First, the proposed TargetVita significantly outperforms VitaPred under both Balanced Evaluation and MaxMCC Evaluation. Taking MCC as an example, TargetVita achieved approximately 9% ~ 11% and 6% ~ 20% improvements on the four independent validation datasets under Balanced Evaluation and MaxMCC Evaluation, respectively. In addition, TargetVita also outperformed SVM-S, which is a re-implementation of VitaPred, and acted as the best performer with an average improvement of approximately 2.5% for MCC if compared with the second best performer, SVM-S.
Second, the values of MCC for VitaPred over the cross-validation test and independent validation test differ significantly under both the Balanced Evaluation and MaxMCC Evaluation if compared with TargetVita and SVM-S. In other words, the performance evaluated over the cross-validation test is significantly better than that evaluated over the independent validation test for VitaPred, while similar performances were obtained over the cross-validation test and the independent validation test for TargetVita and SVM-S.
Taking the results under MaxMCC Evaluation as an example, the values of MCC over the cross-validation test for VitaPred are 0.53, 0.48, 0.61, and 0.81, respectively (refer to Table 3), while the values of MCC over the independent validation test for VitaPred are 0.38, 0.37, 0.35, and 0.63, respectively (refer to Table 10), for the four considered vitamins. Then, we can calculate that the MCC differences between the cross-validation test and the independent validation test of VitaPred for the four considered vitamins are 0.15, 0.11, 0.26, and 0.18, respectively. By revisiting Table 8, together with Table 10, we calculate that the MCC differences between the cross-validation test and the independent validation test of the proposed TargetVita for the four considered vitamins are only 0.03, 0.01, 0.05, and 0.05, respectively. Similarly, the MCC differences of SVM-S can also be calculated from Table 3 and Table 10. Figure 4 illustrates the differences between the MCC values over the cross-validation test and the independent validation test for VitaPred, SVM-S, and TargetVita on the four considered vitamins under MaxMCC Evaluation. Note that in Figure 4, d V , d S , and d T denote the MCC differences for VitaPred, SVM-S, and TargetVita, respectively, and only the MCC differences on the DPLPI dataset are explicitly labelled.
From Figure 4, we can intuitively find that the proposed TargetVita and SVM-S achieve similar performances (in terms of MCC) over both the cross-validation test and independent validation test, as their MCC differences are small, while the performance of VitaPred over the independent validation test is significantly lower than that that over the cross-validation test with all four considered vitamins, indicating that the performance of VitaPred over the cross-validation test has potentially been over-estimated or over-optimised, thus leading to a lower generalisation capability (i.e., poor performance with independent fresh data). This observation further supports the speculation (i.e., the MCC of VitaPred has potentially been over-estimated) we made in a previous section.
C. Performance on a non-vitamin binding dataset
We then performed performance comparisons between the proposed TargetVita and VitaPred on the non-vitamin binding dataset NVD, and the results of the comparison are listed in Additional file 2: Table S2. Note that the results of VitaPred and TargetVita were obtained by feeding the 6676 sequences to their corresponding web servers with default threshold settings.
From Table S2, we can clearly see that the proposed TargetVita achieved much better prediction performance than VitaPred. Among the 1852390 residues in the 6676 non-vitamin binding sequences, 46319 residues were mistakenly predicted as binding residues by VitaPred, while only 36361 false positives were obtained by TargetVita.
In this study, we have designed and implemented a new sequence-based predictor, called TargetVita, for protein-vitamin binding residue prediction. TargetVita performs prediction by utilising multiple features derived from protein sequences and effectively ensembling heterogeneous SVMs trained on different feature subspaces. Experimental results on benchmark datasets demonstrated that the proposed TargetVita can achieve good performance and is superior to existing protein-vitamin binding residue predictors.
Our future work will focus on further improving the prediction performance of TargetVita by uncovering new effective feature sources and applying more powerful machine-learning algorithms.
Availability of supporting data
The data sets supporting the results of this article are included within the Additional file 1.
Knowles J, Gromo G: A guide to drug discovery: target selection in drug discovery. Nat Rev Drug Discov. 2003, 2 (1): 63-69. 10.1038/nrd986.
Gao M, Skolnick J: The distribution of ligand-binding pockets around protein-protein interfaces suggests a general mechanism for pocket formation. Proc Natl Acad Sci U S A. 2012, 109 (10): 3784-3789. 10.1073/pnas.1117768109.
De Las Rivas J, Fontanillo C: Protein-protein interactions essentials: key concepts to building and analyzing interactome networks. PLoS Comput Biol. 2010, 6 (6): e1000807-10.1371/journal.pcbi.1000807.
Casari G, Sander C, Valencia A: A method to predict functional residues in proteins. Nat Struct Biol. 1995, 2 (2): 171-178. 10.1038/nsb0295-171.
Peterson LE, Chen XW: Machine learning in biomedicine and bioinformatics. Int J Data Min Bioinform. 2009, 3 (4): 363-364.
Inza I, Calvo B, Armananzas R, Bengoetxea E, Larranaga P, Lozano JA: Machine learning: an indispensable tool in bioinformatics. Methods Mol Biol. 2010, 593: 25-48. 10.1007/978-1-60327-194-3_2.
Larranaga P, Calvo B, Santana R, Bielza C, Galdiano J, Inza I, Lozano JA, Armananzas R, Santafe G, Perez A, Robles V: Machine learning in bioinformatics. Brief Bioinform. 2006, 7 (1): 86-112. 10.1093/bib/bbk007.
Alberts B: Molecular biology of the cell. 2008, New York: Garland Science, 5
Kokubo H, Tanaka T, Okamoto Y: Ab initio prediction of protein-ligand binding structures by replica-exchange umbrella sampling simulations. J Comput Chem. 2011, 32 (13): 2810-2821. 10.1002/jcc.21860.
Sousa SF, Fernandes PA, Ramos MJ: Protein-ligand docking: current status and future challenges. Proteins. 2006, 65 (1): 15-26. 10.1002/prot.21082.
Hendlich M, Rippmann F, Barnickel G: LIGSITE: automatic and efficient detection of potential small molecule-binding sites in proteins. J Mol Graph Model. 1997, 15 (6): 359-363. 10.1016/S1093-3263(98)00002-3. 389
Dundas J, Ouyang Z, Tseng J, Binkowski A, Turpaz Y, Liang J: CASTp: computed atlas of surface topography of proteins with structural and topographical mapping of functionally annotated residues. Nucleic Acids Res. 2006, 34 (Web Server issue): W116-W118.
Laskowski RA: SURFNET: a program for visualizing molecular surfaces, cavities, and intermolecular interactions. J Mol Graph. 1995, 13 (5): 323-330. 10.1016/0263-7855(95)00073-9. 307–328
Levitt DG, Banaszak LJ: POCKET: a computer graphics method for identifying and displaying protein cavities and their surrounding amino acids. J Mol Graph. 1992, 10 (4): 229-234. 10.1016/0263-7855(92)80074-N.
Le Guilloux V, Schmidtke P, Tuffery P: Fpocket: an open source platform for ligand pocket detection. BMC Bioinfor. 2009, 10: 168-10.1186/1471-2105-10-168.
Laurie AT, Jackson RM: Q-SiteFinder: an energy-based method for the prediction of protein-ligand binding sites. Bioinformatics. 2005, 21 (9): 1908-1916. 10.1093/bioinformatics/bti315.
Hernandez M, Ghersi D, Sanchez R: SITEHOUND-web: a server for ligand binding site identification in protein structures. Nucleic Acids Res. 2009, 37 (Web Server issue): W413-W416.
Henrich S, Salo-Ahen OM, Huang B, Rippmann FF, Cruciani G, Wade RC: Computational approaches to identifying and characterizing protein binding sites for ligand design. J Mol Recognit. 2010, 23 (2): 209-219.
Gromiha MM: Development of RNA stiffness parameters and analysis on protein-RNA binding specificity: comparison with DNA. Curr Bioinform. 2012, 7 (2): 173-179. 10.2174/157489312800604408.
Chen K, Mizianty MJ, Kurgan L: Prediction and analysis of nucleotide-binding residues using sequence and sequence-derived structural descriptors. Bioinformatics. 2012, 28 (3): 331-341. 10.1093/bioinformatics/btr657.
Yu D, Hu J, Yang J, Shen H, Tang J, Yang J: Designing template-free predictor for targeting protein-ligand binding sites with classifier ensemble and spatial clustering. IEEE/ACM Trans Comput Biol Bioinform. 2013, 10 (4): 994-1008.
Brylinski M, Skolnick J: FINDSITE-metal: integrating evolutionary information and machine learning for structure-based metal-binding site prediction at the proteome level. Proteins. 2011, 79 (3): 735-751. 10.1002/prot.22913.
Babor M, Gerzon S, Raveh B, Sobolev V, Edelman M: Prediction of transition metal-binding sites from apo protein structures. Proteins. 2008, 70 (1): 208-217.
Si J, Zhang Z, Lin B, Schroeder M, Huang B: MetaDBSite: a meta approach to improve protein DNA-binding sites prediction. BMC Syst Biol. 2011, 5 (Suppl 1): S7-10.1186/1752-0509-5-S1-S7.
Ma X, Guo J, Liu HD, Xie JM, Sun X: Sequence-based prediction of DNA-binding residues in proteins with conservation and correlation information. IEEE/ACM Trans Comput Biol Bioinform. 2012, 9 (6): 1766-1775.
Schmidtke P, Barril X: Understanding and predicting druggability. A high-throughput method for detection of drug binding sites. J Med Chem. 2010, 53 (15): 5858-5867. 10.1021/jm100574m.
Rebeille F, Ravanel S, Marquet A, Mendel RR, Webb ME, Smith AG, Warren MJ: Roles of vitamins B5, B8, B9, B12 and molybdenum cofactor at cellular and organismal levels. Nat Prod Rep. 2007, 24 (5): 949-962. 10.1039/b703104c.
Webb ME, Marquet A, Mendel RR, Rebeille F, Smith AG: Elucidating biosynthetic pathways for vitamins and cofactors. Nat Prod Rep. 2007, 24 (5): 988-1008. 10.1039/b703105j.
Dick T, Manjunatha U, Kappes B, Gengenbacher M: Vitamin B6 biosynthesis is essential for survival and virulence of Mycobacterium tuberculosis. Mol Microbiol. 2010, 78 (4): 980-988. 10.1111/j.1365-2958.2010.07381.x.
Panwar B, Gupta S, Raghava GPS: Prediction of vitamin interacting residues in a vitamin binding protein using evolutionary information. BMC Bioinform. 2013, 14 (1): 44-10.1186/1471-2105-14-44.
Bauer RA, Günther S, Jansen D, Heeger C, Thaben PF, Preissner R: SuperSite: dictionary of metabolite and drug binding sites in proteins. Nucleic Acids Res. 2009, 37 (suppl 1): D195-D200.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The protein data bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.
Sobolev V, Sorokine A, Prilusky J, Abola EE, Edelman M: Automated analysis of interatomic contacts in proteins. Bioinformatics. 1999, 15 (4): 327-332. 10.1093/bioinformatics/15.4.327.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Yang J, Roy A, Zhang Y: BioLiP: a semi-manually curated database for biologically relevant ligand-protein interactions. Nucleic Acids Res. 2013, 41 (D1): D1096-D1103. 10.1093/nar/gks966.
Li W, Godzik A: Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006, 22 (13): 1658-1659. 10.1093/bioinformatics/btl158.
Yu DJ, Hu J, Huang Y, Shen HB, Qi Y, Tang ZM, Yang JY: TargetATPsite: a template-free method for ATP-binding sites prediction with residue evolution image sparse representation and classifier ensemble. J Comput Chem. 2013, 34 (11): 974-985. 10.1002/jcc.23219.
Yu DJ, Hu J, Tang ZM, Shen HB, Yang J, Yang JY: Improving protein-ATP binding residues prediction by boosting SVMs with random under-sampling. Neurocomputing. 2013, 104: 180-190.
Firoz A, Malik A, Joplin KH, Ahmad Z, Jha V, Ahmad S: Residue propensities, discrimination and binding site prediction of adenine and guanine phosphates. BMC Biochem. 2011, 12: 20-10.1186/1471-2091-12-20.
Chen K, Mizianty MJ, Kurgan L: ATPsite: sequence-based prediction of ATP-binding residues. Proteome Sci. 2011, 9 (Suppl 1): S4-10.1186/1477-5956-9-S1-S4.
Schaffer AA: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res. 2001, 29: 2994-3005. 10.1093/nar/29.14.2994.
Zhang YN, Yu DJ, Li SS, Fan YX, Huang Y, Shen HB: Predicting protein-ATP binding sites from primary sequence through fusing bi-profile sampling of multi-view features. BMC Bioinform. 2012, 13: 118-10.1186/1471-2105-13-118.
Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292 (2): 195-202. 10.1006/jmbi.1999.3091.
Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 2507-2517. 10.1093/bioinformatics/btm344.
Tadesse MG, Vannucci M, Liò P: Identification of DNA regulatory motifs using Bayesian variable selection. Bioinformatics. 2004, 20 (16): 2553-2561. 10.1093/bioinformatics/bth282.
Saeys Y, Degroeve S, Aeyels D, Rouze P, Van de Peer Y: Feature selection for splice site prediction: a new method using EDA-based feature ranking. BMC Bioinform. 2004, 5 (1): 64-10.1186/1471-2105-5-64.
Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21 (5): 631-643. 10.1093/bioinformatics/bti033.
Gevaert O, Smet FD, Timmerman D, Moreau Y, Moor BD: Predicting the prognosis of breast cancer by integrating clinical and microarray data with Bayesian networks. Bioinformatics. 2006, 22 (14): e184-e190. 10.1093/bioinformatics/btl230.
Zhu L, Yang J, Song JN, Chou KC, Shen HB: Improving the accuracy of predicting disulfide connectivity by feature selection. J Comput Chem. 2010, 31 (7): 1478-1485.
Richard OD, Peter EH, David GS: Pattern classification. 2001, New York, USA: John Wiley & Sons, 2
He X, Cai D, Niyogi P: Laplacian score for feature selection. Advances in neural information processing systems. 2005, Vancouver, British Columbia, Canada: MIT Press
Yan H, Yang J: Joint Laplacian feature weights learning. Pattern Recogn. 2014, 47 (3): 1425-1432. 10.1016/j.patcog.2013.09.038.
Vapnik VN: Statistical learning theory. 1998, New York: Wiley-Interscience
Fan RE, Chen PH, Lin CJ: Working set selection using second order information for training SVM. J Mach Learn Res. 2005, 6: 1889-1918.
Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000, 29: 291-325. 10.1146/annurev.biophys.29.1.291.
Chou KC, Zhang CT: Prediction of protein structural classes. Crit Rev Biochem Mol Biol. 1995, 30 (4): 275-349. 10.3109/10409239509083488.
Yu D, Wu X, Shen H, Yang J, Tang Z, Qi Y: Enhancing membrane protein subcellular localization prediction by parallel fusion of multi-view features. IEEE Trans Nanobiosci. 2012, 11 (4): 375-385.
Chou KC, Shen HB: Cell-PLoc 2.0: an improved package of web-servers for predicting subcellular localization of proteins in various organisms. Nat Sci. 2010, 2 (10): 1090-1103.
Chou KC: Some remarks on protein attribute prediction and pseudo amino acid composition. J Theor Biol. 2011, 273 (1): 236-247. 10.1016/j.jtbi.2010.12.024.
Huang T, Chen L, Cai YD, Chou KC: Classification and analysis of regulatory pathways using graph property, biochemical and physicochemical property, and functional property. PLoS One. 2011, 6 (9): e25297-10.1371/journal.pone.0025297.
Chou KC, Shen HB: Cell-PLoc: a package of Web servers for predicting subcellular localization of proteins in various organisms. Nat Protoc. 2008, 3 (2): 153-162. 10.1038/nprot.2007.494.
Boulesteix AL: Over-optimism in bioinformatics research. Bioinformatics. 2010, 26 (3): 437-439. 10.1093/bioinformatics/btp648.
Sterlinga TD: Publication decisions and their possible effects on inferences drawn from tests of significance - or vice versa. J Am Stat Assoc. 1959, 54 (285): 30-34.
This work was supported by the National Natural Science Foundation of China (No. 61373062, 61175024, 61222306, 61100116, and 61233011), the Natural Science Foundation of Jiangsu (No. BK20141403), China Postdoctoral Science Foundation (No. 2013M530260, 2014T70526), "The Six Top Talents" of Jiangsu Province (No. 2013-XXRJ-022), and the Fundamental Research Funds for the Central Universities (No. 30920130111010).
The authors declare that they have no competing interests.
Conception and design: DJY, JH, HY, XBY, HBS, JYY; Acquisition of data: JH, DJY, HBS; Analysis of data: DJY, JH, YH, XBY, HBS, JYY; Writing and revising the manuscript: DJY, JH, HY, XBY, HBS. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 2: Table S1. Performance comparisons of different feature combinations over 5-fold sequence-level cross-validation under MaxMCC Evaluation. Table S2. Performance comparisons between the proposed TargetVita and VitaPred on the non-vitamin binding dataset NVD. (PDF 165 KB)
About this article
Cite this article
Yu, DJ., Hu, J., Yan, H. et al. Enhancing protein-vitamin binding residues prediction by multiple heterogeneous subspace SVMs ensemble. BMC Bioinformatics 15, 297 (2014). https://doi.org/10.1186/1471-2105-15-297