Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information
© Song et al; licensee BioMed Central Ltd. 2006
Received: 13 December 2005
Accepted: 09 March 2006
Published: 09 March 2006
The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function.
In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively.
A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis.
It is well known that the planar peptide bonds occur predominantly in the trans conformation , cis peptide bonds occur rarely in proteins in that there exists an energy barrier of approximately 20 kcal/mol between the trans and cis conformation. However, in the case of Xaa-Pro peptide bond (also called peptidyl prolyl isomerization, where Xaa is any amino acid), the difference in energy is only 0.5 kcal/mol between trans and cis isomerization, and the energy barrier is about 13 kcal/mol. Thus a considerable proportion (about 4–5%) of Xaa-Pro peptide bonds adopts the cis conformation, while only 0.03–0.05% Xaa-nonPro bonds occur in the cis form [2–4].
In recent years, there are an increasing number of known protein structures determined which exhibit conformational heterogeneity of one or more prolyl peptide bonds . Proline cis peptide bonds bear great biological significance in protein structure and function. The importance of proline cis/trans isomerization as rate-limiting step in protein folding has been well characterized [6–8], for example, it has been suggested to dominate the folding of the alpha subunit of trp synthase in E. coli . The isomerization process of Xaa-Pro peptide bonds can be catalyzed and accelerated by the so-called peptidyl prolyl cis/trans isomerase , which are found to be involved in cell signaling and cell replication, and be implicated in the induction of severe diseases such as cancer, AIDS, Alzheimer's disease and other neurodegenerative disorders . In addition, proline isomerization functions as molecular switch due to its potential ability to control protein activity within the confines of the intrinsic conformational exchange .
Since high throughput genome sequence projects are producing a large number of raw sequence data, fast and accurate prediction methods are in great demand to annotate protein structural and functional properties. Towards this point, accurate prediction of proline cis/trans isomerization in proteins would have many important applications in the study of protein structure prediction and rational molecular design. Numerous studies on the corrections of the proline cis/trans population and the prolyl puckering have been reported by analyzing different non-redundant datasets of protein X-ray structures [1, 4, 6, 12, 13]. The results indicated that there exist a significant correlation between cis conformation content and the local amino acid sequences adjacent to proline residues.
More recently, Pahlke et al employed different statistical methods like Chou-Fasman parameter calculation and occurrence matrices to analyze the probability of the cis and trans proline conformation and derived patterns for its possible prediction . Recent study on the conservation of cis prolyl bonds showed that cis prolyl residues are more often conserved than trans prolyl ones in evolutionary related proteins, and the overall protein sequence homology is a stronger indicator for the occurrence of cis prolyl residues in contrast to the local sequence motifs .
However, most of these studies were merely based on statistical analysis of the neighboring residue occurrences of centered proline, without further systematic prediction of proline cis/trans isomerization from the primary protein sequence. To the best of our knowledge, the first attempt to predict the peptidyl prolyl cis/trans isomerization on the basis of the amino acid sequences was done by Frömmel and Preissner . They used six different patterns to correctly assign about 72.7% (176 cis-prolyl residues in their relatively small dataset of 242 Xaa-Pro bonds) of known cis-prolyl residues, by taking into account the neighboring ± 6 residues centered on proline, as well as their physicochemical properties. Later, support vector machine (SVM) were then introduced to implement this task and achieved 76.7% prediction accuracy by using jack-knife test for the cis proline residues, using the single amino acid sequence information encoded by binary bits (0 and 1) as input vector . COPS algorithm was developed to predict the cis/trans peptide bond isomerization based on the conformation parameters , but this method only took advantage of the secondary structure information of amino acid triplets, failing to consider the important amino acid sequence information.
In this paper, we propose a novel method to predict the proline cis/trans isomerization based on support vector machine, which combined the position-specific scoring matrices (PSSM) extracted from the sequence profiles by PSI-BLAST  and the predicted secondary structures generated by PSIPRED program , as the SVM input vector in addition to the single amino acid sequence information. Our method has been evaluated on a well-resolved non-homologous dataset by 5-fold cross-validation test and achieved an overall prediction accuracy of 71.5% and Matthews Correlation Coefficient (MCC) values of 0.43 that provided a comparable prediction performance with all the previously reported results.
Xaa-Pro cis and Xaa-Pro trans peptide bond distribution
Effect of different kernel functions and parameters
The selection of the kernel function parameters is an important step for SVM training and testing, because they implicitly determine the structure of the high dimensional feature space when constructing the OSH . Several parameters must be determined in advance to optimize SVM training, such as the regularization parameter C, the γ parameter in RBF kernel, and the d parameter in polynomial kernel functions. The parameter C is a regulation parameter which controls the trade-off between margin and the training error.
Prediction accuracy comparison with different kernel functions and parameters. The results were obtained by 5-fold cross-validation.
α = 1, β = 1, d = 2
α = 1, β = 1, d = 5
γ = 0.01, C = 2.0
γ = 0.06, C = 2.0
γ = 0.2, C = 1.0
The imbalance problem
The imbalance problem will occur when there is a large difference between the positive and negative samples of the dataset . In this study, the cis and trans prolyl residues are unbalanced (1265 cis prolines versus 27196 trans ones). We need to take into consideration this problem because if this proportion is used, the training difficulty will be met and SVM classifier will not converge after the training. And in that case, SVM have a tendency to accurately predict the over-represented class (trans) and incorrectly assign the under-represented class (cis).
Usually, there are two methods towards overcoming the imbalance problem: (1) Increasing the dataset size of the under-represented samples by random resampling the dataset and (2) Decreasing the size of the over-represented dataset by random removing its samples . Here, we explored the second one. We set the ratio of the size of the positive to negative training samples (the positive-negative-training ratio) at 1:1, since SVM will achieve better accuracy coverage under this ratio.
Prediction using single sequence information
The SVM has been trained and tested with single sequences encoded as binary bits (0 and 1). In this coding scheme, each amino acid is represented by the 20-dimensional binary vector, e.g. Ala (10000000000000000000), Cys (01000000000000000000), ..., Tyr (00000000000000000001), etc.
Predictive performance of SVM based on singe sequence inputs of different local window sizes. More details for prediction accuracy measurement are given in the Methods section. The results were obtained by 5-fold cross-validation.
Prediction accuracy (%)
Prediction using amino acid composition of local sequence
We also used the amino acid compositions of different window sizes as SVM input, and compared the influence of different window sizes on the prediction performance. In many cases, amino acid compositions have been proved to result in the improvement of prediction performance to a certain extent. The amino acid composition is calculated by
where ni is the number of occurrences of amino acid type i in the local sequence window of window size w.
Predictive performance of SVM based on amino acid compositions of different local window sizes. More details for prediction accuracy measurement are given in the Methods section. The results were obtained by 5-fold cross-validation.
Prediction accuracy (%)
The prediction performance reached Q2 = 61.6% and MCC = 0.23 at the full length. The relatively high accuracy by using only amino acid compositions of the full sequence length mainly comes from the improvement on the Sensitivity value (as high as 72.6%) despite its low Specificity (44.5%), implying that proline cis/trans isomerization state is also determined by the global sequence information, as well as the local sequence information.
Prediction using multiple sequence alignment and secondary structure information
Comparison of predictive performance of SVM based on different encoding input information. More details for prediction accuracy measurement are given in the Methods section. The results were obtained by 5-fold cross-validation.
Prediction accuracy (%)
It is well known that multiple sequence alignment rather than single amino acid sequence could improve the prediction accuracy . In order to further improve the prediction performance, we then included multiple sequence alignment in the form of PSI-BLAST position-specific scoring matrices (PSSMs) as the SVM input. As expected, including evolutionary information in the form of PSI-BLAST profiles could significantly increase the prediction performance. As a result, the MCC improved from 0.26 with single local sequence to 0.40. The considerable improvement in prediction score came from the use of position-specific scoring matrices in the multiple sequence alignment that contained some relevant information of distantly related protein sequences with query proteins . And the PSI-BLAST profiles are represented by the position-specific probabilities of this relevant weighted information, thus greatly enhanced the prediction performance.
Recently, Pahlke et al developed a stand alone algorithm COPS to predict the cis and trans conformation of amino acids in proteins. Their algorithm was based on statistical analysis of the so-called conformation parameters- the extension of Chou-Fasman parameters. COPS derived four rules to predict the cis conformation by taking into consideration the secondary structure of amino acid triplets alone . Therefore we wanted to know whether introducing the predicted secondary structure information by PSIPRED as the input to SVM classifier would be contributive or not. As can be seen in Table 4, the overall accuracy Q2 was 63.6 and the MCC value was 0.27, which was better than that obtained with local sequence ("SS"). The results indicated that including the secondary structure by PSIPRED could provide more useful information for the prediction performance compared with the local sequence alone.
To further improve the prediction performance, we combined the multiple sequence alignment in the form of PSI-BLAST ("MS") and the predicted secondary structure from PSIPRED ("SS"). Among those five SVM models, "MS+SS" provided the best predictions of proline cis/trans isomerization. For this model, its overall accuracy Q2 was 71.5% and MCC was 0.43, while the MCC values for "LS", "AA" and "SS" were 0.26, 0.23 and 0.27, respectively. There is also a great improvement in the Sensitivity and Specificity values after using "MS+SS" encoding scheme. The final values of Sensitivity and Specificity are 70.7% and 72.2%, which are 14% and 3.5% higher than that obtained with single sequence alone, respectively. All these prediction scores indicate that using multiple sequence alignment together with the predicted secondary structure considerably increases the number of true positives and true negatives and decreases the over- and under-predictions.
However, our results also showed that simply combining "AA" together with "MS+SS" couldn't result in the better prediction performance than "MS+SS" (data not shown). This may result from the reason that including too many input vectors not only increased the useful information used by SVM classifier but also introduced much noise underlying those vectors at the same time.
Comparison with other methods
Comparison of predictive performance with Naïve Bayes, Logistic regression, IBk and J48 classifier. More details for prediction accuracy measurement are given in the Methods section. The results were obtained by 5-fold cross-validation.
Prediction accuracy (%)
IBk (K-nearest neighbors)
J48 (decision trees)
The prediction accuracy of SVM is about 12% and 13% higher than Naïve Bayes and Logistic regression classifiers, respectively. The accuracy difference between SVM classifier and those based on K-nearest neighbor and decision trees are even larger. The same tendency exists for the MCC values. Moreover, the SVM classifier could correctly assign 70.7% of the cis proline residues, namely, 13% higher than any other classifier implemented in Weka package used in this study. In contrast, Naïve Bayes and Logistic regression could only recognize about 61% of trans proline samples in the dataset, but on the other hand, they failed to predict the cis proline ones (less than 60%). Therefore, it is obvious that SVM outperformed other machine learning techniques in implementing the prediction task of proline cis/trans isomerization based on the same dataset.
Comparison of predictive performance with other methods.
Prediction accuracy (%)
Prediction performance evaluation method
242 Xaa-Pro bonds
SVM single sequencec
SVM single sequenced
SVM single sequencee
SVM PSI-BLAST and PSIPREDh
Based on the statistical analysis of the neighbors (± 6 residues) of proline residues and their physiochemical properties, Frömmel and Preissner found six patterns that could be applied to assign correctly 72.7% (less than 75%) of known cis proline residues . However, it should be pointed out that their result was obtained on a small dataset containing only 242 Xaa-Pro peptide bonds, thus the six patterns found might not be applicable when using larger dataset.
COPS is a stand alone algorithm that was developed based on the extended Chou-Fasman parameters, i.e. the conformation parameters for each amino acid after considering the correlation between the secondary structure information and the cis/trans conformation . Their prediction was made by using the four rules found, all of which needs to be fulfilled otherwise trans would be predicted. As can be seen from Table 6, the prediction accuracy of COPS for the cis proline is 63.6% (averaged by 10-fold cross-validation), which is consistent with the result obtained by using SVM based on predicted secondary structure.
Wang et al first introduced support vector machine to solve this task and achieved an overall accuracy of 69.8% and 76.7%, when measured by the independence and jack-knife test, respectively. They used the single amino acid sequence information encoded by binary bits (20-dimensional vectors composed of 0 and 1) as the input vector to SVM . Although their prediction accuracy by jack-knife test was better than that of our method, these results were drawn based on a different dataset.
Perhaps we should not attach too much importance to the prediction score here, because it is unfair to compare the different studies using different datasets and accuracy assessment methods. Although different datasets (242 prolyl residues, 2193, 2424 and 8584 proteins) and different prediction performance test methods (self-consistency, jack-knife and n-fold cross-validation) were used, our method achieved a comparable prediction performance, especially after adopting the PSI-BLAST and PSIPRED encoding scheme. Therefore we can conclude that our method was successful in predicting the proline cis/trans isomerization, with the prediction accuracy at a satisfactory level.
CISPEPpred web server
The CISPEPpred web server  has been developed for the prediction of proline cis/trans isomerization in proteins by using the method in this work. This server provides two SVM models based on the single sequence and the multiple sequence alignment in the form of PSI-BLAST profiles along with the secondary structure by PSIPRED, respectively. With the protein sequence submitted in FASTA format, the order of proline residues in the sequence and their respective cis/trans isomerization state predicted will be generated. Additional information including the introduction, methodology and the PDB chain list used in this study can be found at this website.
Prediction of proline cis/trans isomerization is important in the understanding of protein structure and function. In the present work, we carried out the extensive prediction study of proline cis/trans isomerization by using different encoding schemes and developed a novel tool to implement this task based on support vector machines. We investigated the effect of different SVM kernel functions and their corresponding parameters and found that using RBF kernel achieved better prediction performance compared with polynomial kernel and linear kernel. Our results indicate that SVM classifier built on multiple sequence alignment in the form of PSI-BLAST profiles could yield better performance, the prediction accuracy improved from 62.8% with single sequence to 69.8%, while MCC improved from 0.26 with single local sequence to 0.40. This result strengthens the fact that introducing multiple sequence alignments could improve the prediction performance rather than single sequence. Moreover, using PSI-BLAST profiles in the form of position-specific scoring matrices contribute significantly to improve the prediction performance together with the predicted secondary structures by PSIPRED, the prediction accuracy was further improved to Q2 of 71.5% and MCC of 0.43.
There are three important factors that account for the prediction performance of our method. Firstly, we employed SVM in the present study which is a new machine learning method based on Statistical Learning Theory. SVM has many attractive features not only in its fast speed and scalability, but also in its ability to extract and condense information contained in the training samples. Secondly, multiple sequence alignment in the form of PSI-BLAST profiles was used. The PSI-BLAST profiles were generated by searching the remote protein homologs against the NCBI non-redundant database, thus containing the useful evolutionary information . Thirdly, the predicted secondary structure by PSIPRED was also used. Recent studies indicate that the neighboring secondary structure of prolines could be used to predict the cis/trans conformation and achieved a good performance [14, 18]. PSIPRED is considered as one of the best secondary structure prediction methods. The strategy of using multiple sequence alignment in the form of PSI-BLAST profiles together with predicted secondary structure information by PSIPRED has been successfully applied in the prediction of α-turn  and β-turn types in proteins [26, 27].
Further improving the prediction accuracy with only local sequence information remains a difficult and challenging task, in that peptidyl prolyl cis/trans isomerization is also determined by its intrinsically flexible properties of cis/trans switches inside the proline residues themselves, which could in turn increase the prediction difficulty. The prediction performance is related to the global information on the protein level like the amino acid compositions. Moreover, recent study also suggested that global sequence homology is a strong indicator for the occurrence of cis prolyl residues . The key point is to find out accurate descriptors of cis/trans proline residues and put forward appropriate encoding schemes in order to serve efficiently as the classifier input vectors. However, the unbalanced distribution of cis/trans samples in proteins and the property cis/trans conformation switch of further increases the difficulty in predicting their states. It should be pointed out that the overall prediction accuracy of cis/trans isomerization is correlated with the ratio between these two classes, perhaps it would be reasonable for us not to attach much importance to the absolute Q2 values. In this aspect, MCC could be considered as the coequal measures of the classification performance.
Future improvements may be achieved by combining several available methods and incorporating more possible information to describe the prolyl cis/trans peptide bonds, for example, protein solvent accessibility. Since protein solvent accessibility is an important factor in determining protein structure and function, including this information might enhance the prediction performance. In fact, recent studies also indicated that cis proline residues are more frequently found in surface accessible areas compared to the trans prolines . Therefore, further improvement is anticipated to be attained by combining some non-local structural descriptors of proteins such as protein structural classes and homologs and the local sequence profiles of proline residues like protein solvent accessibility profiles. Thus future work is possible to focus on this direction and improve the prediction accuracy by constructing such multiple feature vectors.
In this paper, we developed a new method to predict the proline cis/trans isomerization in proteins based on support vector machine. The CISPEPpred web server has been designed to implement this task. The preliminary experiments indicate that using RBF kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We proposed several different sequence encoding schemes and compared their resulting prediction performance. The purpose of this study was to find which kind of information input can lead to the best prediction result. The prediction accuracies were averaged by using 5-fold cross-validation. It was found that using multiple sequence alignments could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and MCC from 0.26 to 0.40. Moreover, if coupled with the secondary structure information predicted by PSIPRED, the prediction accuracy was further improved to 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information. The successful application of SVM approach in this study reinforced that SVM is a powerful prediction tool for extracting the relationship between proline cis/trans isomerization and primary amino acid sequence. We believe that CISPEPpred will be a useful tool for proline cis/trans isomerization prediction and will provide helpful and complementary information in understanding protein structure and function.
In the present study, the dataset comprised 2424 non-homologous protein chains, which was obtained from the Culled PDB list provided by PSICES server . This list was generated on October 15, 2005. All structures in this database were determined by X-ray crystallography method with resolution better than 2.0 Å and R-factor less than 0.25. The sequence identity between each pair of sequences was less than 25%. The protein chains with sequence length shorter than 60 amino acids were excluded in our dataset. Every chain contains at least one proline residues. There are totally 609182 residues in this dataset. The protein chain names can be found in Additional file 1. The detailed information of proline cis/trans peptide records and protein sequences of each protein chain can be found in Additional file 2 and 3.
Although the PDB files do contain the CISPEP records, we can't directly extract these records in that there may exist some errors for such annotations as the bond angles [22, 23]. We calculated the ω dihedral angle of the CO-NH bond for each proline residue with the preceding amino acid. Bonds with ω dihedral angle between -30° and +30° were considered as cis peptide bonds, whereas bonds with ω dihedral angle between -180° (or +30°) and -30° (or +180°) were assumed to be trans. According to this definition, we gained 28461 ω dihedral angles for the Xaa-Pro bonds, which included 1265 cis and 27196 trans prolyl residues.
Sequence profiles generated by PSI-BLAST
We used a sliding window method to describe the neighboring sequence environments of proline residues, with local window length 2l. The local window was centered on the proline residue and the preceding amino acid. Evolutionary information in the form of multiple sequence alignment profiles generated by PSI-BLAST program was included in this window as the input information. The idea of adopting the intermediate PSI-BLAST generated position-specific scoring matrix (PSSM) as direct input was first proposed by Jones . Now this method has been widely used in protein secondary structure prediction [24–27], subcellular localization prediction , disulfide connectivity prediction , solvent accessibility prediction , protein-protein binding site prediction , DNA binding site prediction , protein B-factor profile , as well as protein contact number prediction . Including evolutionary information in the form of PSI-BLAST profiles has been proved to improve the prediction accuracy by a significant increment of about 3–5% in these problems.
Here, we applied this method as the first use of PSSM in proline cis/trans isomerization prediction. Firstly, we obtained the NCBI nr database , which contained all known databases: all non-redundant GenBank translations, SwissProt, PIR, PDB, PRF, and NCBI RefSeq database. Then, blastpgp program was run to query each protein in our dataset against the NCBI nr database to generate the PSSM profiles, by three iterations of PSI-BLAST, with a cutoff E-value of 0.001. After that, these profiles were scaled to the required 0–1 range by the following standard logistic function
where x is the raw profile matrix value. The scaled PSSM profiles were then used as the input information to SVM.
The use of PSSM profiles can avoid the time-consuming multiple sequence alignment procedures. The PSSM is a protein sequence is an M × 20 matrix, where M is the target sequence length and 20 is the number of amino acid types. Each element of the matrix represents the log-odds score of each amino acid at one position in the multiple alignments. The window size 2l+1 indicated the scope of the vicinity of the target prolyl peptide bonds, determining how much neighboring sequence information was included in the prediction. In order to evaluate the influence of different window sizes on the prediction performance, we selected 9 windows sizes to build our SVM predictors, i.e. M = 3, 5, 7, 9, 11, 13, 15, 17, 19 (l = 1, 2, 3, 4, 5, 6, 7, 8, 9, respectively).
Predicted secondary structure by PSIPRED
The predicted probability matrices of secondary structure states from PSIPRED have also been used in prediction. PSIPRED is a well-known program to predict the protein secondary structure, whose output provides the reliability indices (in 0–1 range) for all the three secondary structure states (helix, strand and coli) for each residue in the protein sequence . We directly extracted the M × 3 matrix from the output file of PSIPRED using a sliding window scheme, where M is the target sequence length and 3 is the number of secondary structure types.
Support vector machine
The concept of support vector machine (SVM) was first introduced by Vapnik and his coworkers [36, 37]. SVM is a new machine learning method based on Statistical Learning Theory (SLT) and has been extensively used in many kinds of pattern recognition problems, such as microarray data analysis , protein secondary structure prediction , protein subcellular localization prediction [40, 42, 43], disulfide connectivity prediction  and protein solvent accessibility prediction . The SVM approach usually outperforms other machine learning technologies, including artificial neural networks (ANN), K-nearest neighbor (KNN) methods and Bayesian inference classification. The basic idea of SVM is to transform the samples into a high dimensional feature space and construct an Optimal Separating Hyperplane (OSH) that maximize its distance from the closest training samples. The attractive features of SVM lie in its fast speed and scalability, as well as its ability to extract and condense information contained in the training samples. SVM can not only be used deal with two-class classification but also be extended to multi-class problems. More details description of SVM can be found in Vapnik's publications [36, 37].
In the present study, we used SVM_light, an implementation of Vapnik's SVM for support vector classification, regression and pattern recognition . 5-fold cross-validation was used on the dataset of 2, 424 protein sequences to evaluate the prediction efficiency of the current method. The whole dataset were randomly divided into 5 subsets of roughly equal size. In each validation step, one subset was selected for testing, while the rest were used as the training dataset. The selection of the kernel function parameters is an important step for SVM training and testing, because implicitly determine the structure of the high dimensional feature space when constructing the OSH . Several parameters must be determined in advance to optimize SVM training, such as the regularization parameter C, the γ parameter in RBF kernel, and the d parameter in polynomial kernel functions.
Here, we adopted the polynomial kernel function and Radial Basis Function (RBF kernel) to construct the SVM classifiers:
where in the case of polynomial kernel, the degree d needs to be tuned, and the γ parameter and the regularization parameter C for RBF kernel need to be regulated.
Waikato environment for knowledge analysis (Weka)
Weka 3.4.5 is a comprehensive Java library of machine learning package  providing an implementation of many state-of-the-art learning and data mining algorithms , such as decision trees, rule sets, Bayesian classifiers, support vector machines, logistic and linear regression, multi-layer perceptrons and nearest-neighbor methods, as well as meta-learners like bagging, boosting, stacking, etc . The algorithms provided by Weka can be classified into three types: classification, regression and feature selection. More information about Weka can be found in . In this work, we selected four algorithms to build our classifiers: 1) Naïve Bayes, which is an implementation of the probabilistic Naïve Bayesian classifier; 2) Logistic regression, which is a variation of ordinary regression frequently used when the observed outcome is restricted to two values; 3) lazy IBk, which is based on the k-nearest neighbors classifier that employs the distance metric for classification; 4) J48, which is an implementation of a decision tree learner.
The input data for Weka classifiers is represented in ARFF (attribute-relation function format), consisting of the list of all instances with the values for each instance separated by commas ("yes" for cis proline fragments and "no" for trans proline fragments). As a result of dataset training and testing, a confusion matrix will be generated showing the number of instances of each class that has been assigned.
To evaluate the prediction performance of the classifiers, we used the 5-fold cross-validation method, i.e. the dataset were randomly divided into ten groups, with each group containing roughly equal numbers of protein sequences. Each group was singled out in turn as the testing dataset, while the remaining proteins in other groups were used as the training dataset.
Four different measurements have been used to measure the prediction performance of our method. The sensitivity (sens; also called recall, i.e. the fraction of positive examples that are predicted correctly) is given by
where TP is the number of the true positives and FN is the number of false negatives or under-predictions.
The specificity (spec; also called precision, i.e. the fraction of negative examples that are predicted correctly) is given by
where TN is the number of true negatives, and FP is the number of false positives or over-predictions.
The overall prediction accuracy is given by
The Matthews Correlation Coefficient (MCC)  is defined as
The value of MCC is 0 for a random assignment and 1.0 for a perfect prediction. All the results obtained here are from 5-fold cross-validation.
We also measured the classification accuracy by using the Receiver Operating Characteristic (ROC) analysis . ROC is a threshold independent measure and classic method in signal processing technique and has been used in the prediction analysis of protein α-turn, β-turn and B-factor profiles [25, 26, 33]. For a prediction method, ROC plots classification sensitivity as a function of one minus specificity (1-specificity) for all possible thresholds. The resulting area under the ROC curve is considered as an important index for evaluating the classification performance. That means the highest and leftmost ROC curve in the plot represents the best classification method .
Availability and requirements
The prediction web server CISPEPpred is available at .
Position-Specific Scoring Matrix
Support Vector Machine
Statistical Learning Theory
Artificial Neural Network
Optimal Separating Hyperplane
Radial Basis Function
Attribute-Relation Function Format
- Q 2 :
Overall prediction accuracy
Matthews Correlation Coefficient
Receiver Operating Characteristic
Amino Acid composition
Multiple Sequence alignment
The authors would like to thank Dr. Tianhai Tian, Dr. Stephen Jeffery and Dr. André Leier (at Advanced Computational Modelling Centre, The University of Queensland) for enlightening discussions. This work was supported by grants from the Australian Research Council (ARC) and some of the computer simulations were performed at the High Performance Computing Facility at The University of Queensland. We are also grateful to the developers of PSI-BLAST, PSIPRED SVM_light, and Weka.
- Stewart DE, Sarkar A, Wampler JE: Occurrence and role of cis peptide bonds in protein structures. J Mol Biol 1990, 214: 253–260. 10.1016/0022-2836(90)90159-JView ArticlePubMedGoogle Scholar
- Weiss MS, Jabs A, Hilgenfeld R: Peptide bonds revisited. Nat Struct Biol 1998, 5: 676. 10.1038/1368View ArticlePubMedGoogle Scholar
- Jabs A, Weiss MS, Hilgenfeld R: Non-proline cis peptide bonds in protein. J Mol Biol 1999, 286: 291–304. 10.1006/jmbi.1998.2459View ArticlePubMedGoogle Scholar
- Pall D, Chakraabarti P: Cis peptide bonds in proteins: residues involved, their conformation, interaction and locations. J Mol Biol 1999, 294: 271–288. 10.1006/jmbi.1999.3217View ArticleGoogle Scholar
- Andreotti AH: Native state proline isomerization: an intrinsic molecular switch. Biochemistry 2003, 42: 9515–9524. 10.1021/bi0350710View ArticlePubMedGoogle Scholar
- Reimer U, Scherer G, Drewello M, Kruber S, Schutkowski M, Fischer G: Side-chain effects on peptidyl-prolyl cis/trans isomerization. J Mol Biol 1998, 279: 449–460. 10.1006/jmbi.1998.1770View ArticlePubMedGoogle Scholar
- Eckert B, Martin A, Balbach J, Schmid FX: Prolyl isomerization as a molecular timer in phage infection. Nat Struct Mol Biol 2005, 12: 619–623. 10.1038/nsmb946View ArticlePubMedGoogle Scholar
- Wedemeyer WJ, Welker E, Scheraga HA: Proline cis-trans isomerization and protein folding. Biochemistry 2002, 41: 14637–14644. 10.1021/bi020574bView ArticlePubMedGoogle Scholar
- Wu Y, Matthews CA: Cis-prolyl peptide bond isomerization dominates the folding of the alpha subunit of trp synthase, a TIM barrel protein. J Mol Biol 2002, 322: 7–13. 10.1016/S0022-2836(02)00737-4View ArticlePubMedGoogle Scholar
- Schmid FX, Mayr LM, Mücke M, Schönbrunner ER: Prolyl isomerases: role in protein folding. Advan Protein Chem 1993, 44: 25–66.View ArticleGoogle Scholar
- Dugave C, Demange L: Cis - trans isomerization of organic molecules and biomolecules: implications and applications. Chem Rev 2003, 103: 2475–2532. 10.1021/cr0104375View ArticlePubMedGoogle Scholar
- Kang YK, Choi HY: Cis - trans isomerization and puckering of proline residue. Biophys Chem 2004, 111: 135–142. 10.1016/j.bpc.2004.05.006View ArticlePubMedGoogle Scholar
- Reimer U, Fischer G: Local structural changes caused by peptidyl-prolyl cis/trans isomerization in the native state of proteins. Biophys Chem 2002, 96: 203–212. 10.1016/S0301-4622(02)00013-3View ArticlePubMedGoogle Scholar
- Pahlke D, Freund C, Leitner D, Labudde D: Statistically significant dependence of the Xaa-Pro peptide bond conformation on secondary structure and amino acid sequence. BMC Struct Biol 2005, 5: 1–8. 10.1186/1472-6807-5-8View ArticleGoogle Scholar
- Lorenzen S, Peters B, Goede A, Preissner R, Frömmel C: Conservation of cis prolyl bonds in proteins during evolution. Proteins 2005, 58: 589–595. 10.1002/prot.20342View ArticlePubMedGoogle Scholar
- Frömmel C, Preissner R: Prediction of prolyl residues in cis-conformation in protein structures on the basis of the amino acid sequence. FEBS Lett 1990, 277: 159–163. 10.1016/0014-5793(90)80833-5View ArticlePubMedGoogle Scholar
- Wang ML, Li WJ, Xu WB: Support vector machines for prediction of peptidyl prolyl cis/trans isomerization. J Peptide Res 2004, 63: 23–28. 10.1046/j.1399-3011.2004.00100.xView ArticleGoogle Scholar
- Pahlke D, Leitner D, Wiedemann U, Labudde D: COPS- cis/trans peptide bond conformation prediction of amino acids on the basis of secondary structure information. Bioinformatics 2005, 21: 685–686. 10.1093/bioinformatics/bti089View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acid Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol 1999, 292: 195–202. 10.1006/jmbi.1999.3091View ArticlePubMedGoogle Scholar
- PISCES: a protein sequence culling server[http://dunbrack.fccc.edu/PISCES.php]
- Janin J: Errors in three dimensions. Biochimie 1990, 72: 705–709. 10.1016/0300-9084(90)90153-8View ArticlePubMedGoogle Scholar
- Laskowski RA, MacArthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structure. J Appl Crytallogr 1993, 26: 283–291. 10.1107/S0021889892009944View ArticleGoogle Scholar
- Guo J, Chen H, Sun Z, Lin Y: A novel method for protein secondary structure prediction using dual-layer SVM and profiles. Proteins 2004, 54: 738–743. 10.1002/prot.10634View ArticlePubMedGoogle Scholar
- Kaur H, Raghava GP: Prediction of α-turns in proteins using PSI-BLAST profiles and secondary structure information. Proteins 2004, 55: 83–90. 10.1002/prot.10569View ArticlePubMedGoogle Scholar
- Kaur H, Raghava GP: A neural network method for prediction of β-turn types in proteins using evolutionary information. Bioinformatics 2004, 20: 2751–2758. 10.1093/bioinformatics/bth322View ArticlePubMedGoogle Scholar
- Zhang Q, Yoon S, Welsh WJ: Improved method for predicting β-turn using support vector machine. Bioinformatics 2005, 21: 2370–2374. 10.1093/bioinformatics/bti358View ArticlePubMedGoogle Scholar
- Xie D, Li A, Wang M, Fan Z, Feng H: LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST. Nucleic Acid Res 2005, 33: W105-W110. 10.1093/nar/gki359PubMed CentralView ArticlePubMedGoogle Scholar
- Chen YC, Hwang JK: Prediction of disulfide connectivity from protein sequences. Proteins 2005, 61: 507–512. 10.1002/prot.20627View ArticlePubMedGoogle Scholar
- Qin S, He Y, Pan XM: Predicting protein secondary structure and solvent accessibility with an improved multiple linear regression method. Proteins 2005, 61: 473–480. 10.1002/prot.20645View ArticlePubMedGoogle Scholar
- Bradford JR, Westhead DR: Improved prediction of protein-protein binding sites using a support vector machines approach. Bioinformatics 2005, 21: 1487–1494. 10.1093/bioinformatics/bti242View ArticlePubMedGoogle Scholar
- Ahmad S, Sarai A: PSSM-based prediction of DNA binding sites in proteins. BMC Bioinformatics 2005, 6: 33. 10.1186/1471-2105-6-33PubMed CentralView ArticlePubMedGoogle Scholar
- Yuan Z, Bailey TL, Teasdale RD: Prediction of protein B-factor profiles. Proteins 2005, 58: 905–912. 10.1002/prot.20375View ArticlePubMedGoogle Scholar
- Yuan Z: Better prediction of protein contact number using a support vector regression analysis of amino acid sequence. BMC Bioinformatics 2005, 6: 248. 10.1186/1471-2105-6-248PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI FTP website[ftp://ftp.ncbi.nlm.nih.gov/blast/db/]
- Vapnik V: Statistical learning theory. New York: Wiley; 1998.Google Scholar
- Vapnik V: The nature of statistical learning theory. New York: Springer; 2000.View ArticleGoogle Scholar
- Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet CW, Furey TS, Ares M, Haussler D: Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci 2000, 97: 262–267. 10.1073/pnas.97.1.262PubMed CentralView ArticlePubMedGoogle Scholar
- Hua S, Sun Z: A novel method of protein secondary structure prediction with high segment overlap measure: support vector machine approach. J Mol Biol 2001, 308: 397–407. 10.1006/jmbi.2001.4580View ArticlePubMedGoogle Scholar
- Hua S, Sun Z: Support vector machine approach for protein subcellular localization prediction. Bioinformatics 2001, 17: 721–728. 10.1093/bioinformatics/17.8.721View ArticlePubMedGoogle Scholar
- Qian J, Lin J, Luscombe NM, Yu H, Gerstein M: Prediction of regulatory networks: genome-wide identification of transcription factor targets from gene expression data. Bioinformatics 2003, 19: 1917–1926. 10.1093/bioinformatics/btg347View ArticlePubMedGoogle Scholar
- Sarda D, Chua GH, Li KB, Krishnan A: pSLIP: SVM based protein subcellular localization prediction using multiple physiochemical properties. BMC Bioinformatics 2005, 6: 152. 10.1186/1471-2105-6-152PubMed CentralView ArticlePubMedGoogle Scholar
- Cui Q, Jiang T, Liu B, Ma S: Esub8: A novel tool to predict protein subcellular localizations in eukaryotic organisms. BMC Bioinformatics 2005, 5: 66. 10.1186/1471-2105-5-66View ArticleGoogle Scholar
- Yuan Z, Burrage K, Mattick JS: Prediction of protein solvent accessibility using support vector machines. Proteins 2002, 48: 566–570. 10.1002/prot.10176View ArticlePubMedGoogle Scholar
- Weka 3: Data Mining Software in Java[http://www.cs.waikato.ac.nz/ml/weka/]
- Frank E, Hall K, Trigg L, Holmes G, Witten IH: Data mining in bioinformatics using Weka. Bioinformatics 2004, 20: 2479–2481. 10.1093/bioinformatics/bth261View ArticlePubMedGoogle Scholar
- Witten IH, Frank E: Data mining: Practical Machine Learning Tools and Techniques with Java Implementations. Morgan Kaufmann, San Francisco, CA; 2000.Google Scholar
- Matthews BW: Comparison of predicted and observed secondary structure of T4 phage lysozyme. Biochim Biophys Acta 1975, 405: 442–451.View ArticlePubMedGoogle Scholar
- Centor RM: Signal detectability: The use of roc curves and their analysis. Med Decis Making 1991, 11: 102–106.View ArticlePubMedGoogle Scholar
- CISPEPpred web server[http://foo.maths.uq.edu.au/~sjn/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.