Predicting residue-wise contact orders in proteins by support vector regression

Background The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. Results We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. Conclusion The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences.


Background
A major challenge in structural bioinformatics is the prediction of protein structure and function from primary amino acid sequences. This problem is becoming more pressing now as the protein sequence-structure gap is widening rapidly as a result of the completion of large-scale genome sequencing projects [1,2]. As an intermediate but useful step, predicting a number of key properties of proteins including secondary structure, solvent accessibility, contact numbers and contact order is a possible and promising strategy, which simplifies the prediction task by projecting the three-dimensional structures onto one dimension, i.e. strings of residue-wise structural assignments [3][4][5][6].
However, the current state-of-art methods can only achieve a prediction accuracy of 76%-80%, for the threestate secondary structure prediction [7]. One of the main reasons for the limitation to accurate secondary structure prediction is attributed to the long-range residue contacts (described by residue contact order), which is often overlooked or under-represented in the current prediction methods. Kihara examined the relationship between residue contact order and the prediction accuracy and found that there exists a negative correlation for the α-helices and β-strands [8]. Their studies indicated that long-range residue contacts have significant effects on the secondary structure prediction. Therefore, it is worthwhile incorporating these two-dimensional contact maps of residue contact orders in order to further improve the prediction performance. Moreover, in addition to its significance to secondary structure prediction, residue contact order also has an important implication in protein folding rate prediction [9,10]. Previous studies have well established that residue contact order has a strong correlation with folding rate and, more recently, Punta and Rost demonstrated that the two-state folding rates of a protein can be reliably estimated by predicting its residue-residue contacts even for the proteins of unknown structures [10].
Residue-wise contact order (RWCO) is a new kind of onedimensional protein structure representing the extent of long-range contacts, which is a sum of sequence separations between the given residue and all the other contacting residues [11,12]. Relative contact order (CO) was originally put forward by Plaxco et al. to describe the complexity of protein topology and is often used to study the correlation between protein topology and folding rate [13]. Based on this definition, Kihara further defined the residue contact order (RCO), which was the average contact order of the residue of interest [8]. Recently, Kinjo et al. put forward a similar definition and introduced the concept of RWCO [11,12], which can be considered as a generalization of RCO. In other words, RWCO is the sum of the sequence separation of contacting residues, that is, for residue i, RWCO i = n × RCO i , where n is the number of contacting residues with residue i [8]. As discussed by Kinjo et al., CO is a per-protein quantity based on the whole protein level, while RWCO and RCO are per-residue properties based on the residue level. Recent studies have indicated that it is applicable to use RWCO, together with contact numbers and secondary structures to accurately recover the three-dimensional structures of a protein [6,12]. Therefore, accurate prediction of RWCO values in proteins would have many important applications, especially in protein structure prediction and protein folding rate prediction, as well as helping to determine protein homologous folds.
Several methods have been developed so far to predict the RWCO distributions from the primary amino acid sequences. Kinjo et al. proposed a simple linear regression method to predict RWCO values and the local sequence information with multiple sequence alignments in the form of PSI-BLAST profiles was extracted using a sliding window scheme centered on the target residue. Their method achieved a highest correlation coefficient (CC) of 0.59 between the native (observed) and predicted RWCO values using an unusual half window size of 26 (full window size = 53). And the corresponding root mean square error (RSME) was 1.03. This result was averaged on the test datasets by 15-fold cross-validation. They claimed that this long-range correlation reflected by the unusually long window size was a conspicuous property of RWCO, which was distinctly different from any other one-dimensional structure prediction [11]. Later they developed another method called critical random network (CRN) to refine this task using the same extra-large window size of 53 residues, and their accuracy was further improved to a CC of 0.60 and RMSE of 0.88 [12].
In the present study, we proposed a novel method to predict the RWCO profiles from amino acid sequences based on support vector regression (SVR). Different from the linear regression approach, our method uses the non-linear radial basis kernel function (RBF) to approximate and determine the sequence-RWCO relationship. We extensively explored seven different sequence encoding schemes and examined their different effects on the prediction performance. The results showed that introducing the predicted secondary structure by PSIPRED program, in conjunction with the global information such as protein molecular weight and amino acid compositions, could significantly enhance the prediction performance. Our method could predict RWCO values with a Pearson's correlation coefficient (CC) of 0.60 and root mean square error (RMSE) of 0.78. We compared our prediction accuracy with that of Kinjo et al. using the same 15-fold crossvalidation based on the same training and testing datasets. Our results show that our approach is superior to the linear regression method and slightly better than the critical random network method in predicting protein structural profile values and describing sequence-structure relationships.

RWCO distribution at four different radius thresholds
The RWCO value for each residue in the dataset was computed by defining four different sphere radii r d centered on the C β atom of the target residue, i.e. r d = 8Å, 10Å, 12Å and 14Å. For each given radius cutoff r d , we computed the average RWCO distributions over the whole dataset using formula (1) and (2), which are displayed in Figure 1. The corresponding mean ( ) and standard deviation (SD) are listed in Table 1. There are significant correlations between the four different RWCO distributions. The RWCO values defined by four different radii cutoffs have correlation coefficients (CCs) all greater than 0.853 (Table  2). It can be seen that RWCO distributions with large radius cutoffs (r d = 12Å and 14Å) are close to gamma distributions ( Figure 1) and even after the normalization step using equation (3), their normalized RWCO distribution profiles retain the same tendency. Since previous studies also indicated that larger radii r d = 12Å and 14Å have more significant meaning in protein fold recognition [20] and because the directly related work [11,12] also used a large radius cutoff of 12Å, we set r d = 12Å in the following analysis in order to be consistent with the previous work and make an objective comparison.

Relationship between accessible surface area and RWCO
Since RWCO is a per-residue quantity of amino acid [11,12], it is natural to conjecture that there exists a relationship between RWCO and the solvent accessibility profile of amino acid residue. In order to investigate their connections, we extracted the accessible surface area (ASA) values of each residue in our dataset using the DSSP program [15]. The negative relationship between RWCO and ASA could be observed with a correlation coefficient of -0.463 ( Figure 2). This means that the larger the ASA of a residue, the smaller the RWCO value of that residue, which is consistent with the expectation that the residue with small ASA has large numbers of contact residues in the structure space around itself.

Predicting RWCO values using multiple sequence alignment profiles
As many studies have indicated, the evolutionary information implicitly contained in the multiple sequence alignments could provide better prediction performance compared with the single sequence alone. In this study, the position-specific scoring matrix (PSSM) generated by the PSI-BLAST program [31] served as the input to SVR. The important evolutionary information is stored in these multiple sequence alignment profiles. For an objective comparison with the results of Kinjo et al. [11,12], we also performed the same 15-fold cross-validation test in this study, i.e. 680 proteins were randomly divided into two parts: the training dataset with 630 proteins and the testing dataset with the remaining 50 proteins [11,12]. This procedure was repeated 15 times, generating the final 15 combinations of SVR training and testing datasets. At each cross-validation step, we built the SVR model using the normalized training set, predicted the normalized RWCO values using this model and then transformed to their N   Table 3 (column "LS").
In the current work, RWCO is normalized using the entire benchmark dataset. More specifically, RWCO is normalized according to the formula (3) in the Methods Section using the Standard Deviation and mean raw RWCO values that are computed based on the whole dataset. We first computed the normalized RWCO values before SVR training and testing, then replaced the raw RWCO values by using these normalized values (both for the training and testing datasets). After predicting the normalized RWCO values for the test datasets, we restored the raw RWCO values by transforming the predicted normalized RWCO values to raw ones by using equation (3). The rea-son for using normalized RWCO values instead of the raw values here is that this strategy can improve the prediction performance and is more robust than if raw values are used. As suggested by the reviewer, we tested the predictive performance of the same sequence encoding scheme "LS+W+AA+SS" based on both the normalized values and raw ones, whose result comparison is shown in Table 3. It is clear that the predictor using normalized RWCO values is superior to that of using raw values-the CC improves from 0.58 to 0.60, whereas the values of DevAp and RMSE_raw drop from 0.90 to 0.87 and from 3.09 to 3.05, respectively. This normalization step is important for achieving better prediction performance in the training and testing SVR process.
It can be seen that the use of multiple sequence alignments for SVR training and testing yields CC = 0.55, The accessible surface area as a function of RWCO DevA p = 0.94 and RMSE_raw = 3.21 (   Figure 4 shows the 3D structure depiction of the badlypredicted transcriptional activator GCN5. It has two regions that are badly predicted: from residue position 4 to 23 (represented using the green ball-and-stick model) and from position 92 to 106 (represented using the blue ball-and-stick model). Possible reasons for explaining why some regions of a protein are not well predicted may be due to the fact that there are some residues with relatively small numbers in the dataset and therefore they are less adequately represented after input into SVR models. This would also account for the lower prediction accuracy in some regions of the protein. SVM usually achieves better prediction accuracy when using well-represented datasets than the inadequately-represented ones as the training and testing datasets.

Improving the prediction performance by incorporating global information such as protein molecular weight and amino acid composition, as well as the predicted secondary structure by PSIPRED
The multiple sequence alignment profile used here is a kind of local sequence feature. However, we still need to take into account additional global features to further improve the prediction performance. Kinjo et al. also pointed out that the global context has an effect on the prediction accuracy and it might be useful to include more global features of amino acid sequences [12]. On the other hand, protein molecular weight, as another global sequence feature, could considerably improve the prediction accuracy [20]. We thus divided the protein sequences into four subgroups with equal protein numbers according to their molecular weights. We also incorporated the amino acid composition as the input vector to SVR.
In this work, we employed seven different encoding schemes, i.e. local sequence ("LS"), local sequence plus molecular weight ("LS+W"), local sequence plus amino acid composition ("LS+AA"), local sequence plus predicted secondary structure information by PSIPRED ("LS+SS"), local sequence together with molecular weight and amino acid composition ("LS+W+AA"), local sequence together with molecular weight and predicted secondary structure ("LS+W+SS"), local sequence, molecular weight, amino acid composition and predicted secondary structure information ("LS+W+AA+SS"). PSIPRED is a program to generate the probability profiles for three secondary structure state (helix, strand and coli) assignments for each residue of the predicted protein [31]. We extracted the 15 × 3 = 45 matrix from the output file of PSIPRED by selecting the sliding window size 15, and incorporated this matrix into the SVR model. For the last sequence encoding scheme "LS+W+AA+SS", a residue was encoded as a 15 × 20+1+20+15 × 3 = 366 dimensional vector. The prediction results for each subgroup are shown in Table 4.
As a kind of global feature using either the amino acid composition ("LS+AA") or protein molecular weight ("LS+W") yields the better prediction performance compared with local sequence alone. However, in contrast to amino acid composition, it is worth noting that protein molecular weight here can give a more significant improvement. The significance of molecular weight on The predicted and observed RWCO profiles for proteins integrin alpha-L (PDB code: 1mjn, chain A) and transcriptional activa-tor GCN5 (PDB code: 1e6i, chain A) the prediction performance has been previously observed in the prediction study of protein contact numbers [20]. This effect is even remarkable when predicting proteins with relatively small molecular weights. For instance, for proteins with weights less than 12320 Daltons, "LS+AA" schemes can give prediction accuracy with a CC of 0.53, DevA p of 0.91 and RMSE_raw of 2.97, while "LS+W" can increase the CC to 0.54 and decrease DevA p and RMSE_raw to 0.89 and 2.89, respectively. Furthermore, when combining the amino acid and molecular weight information, there is still a significant improvement in the final prediction performance. The encoding scheme "LS+W+AA" could predict RWCO values with an overall CC of 0.57, DevA p of 0.91 and RMSE_raw of 3.12.
Proteins with relatively large molecular weights are less well predicted than proteins with smaller molecular weights. For example, for proteins with molecular weights larger than 26460 Daltons, the "LS" encoding scheme could only predict their RWCO values with a CC of 0.52, DevA p of 1.00 and RMSE_raw of 3.44, which is rather lower than for the other protein groups. Even after adopting the "LS+W+AA+SS" encoding scheme, the resulting improvement is still not as significant as other protein groups, i.e. with a CC of 0.55, DevA p of 0.94 and RMSE_raw of 3.39. This might be attributable to the small numbers of large proteins in the current datasets which are under-represented when building SVR models, while the availability of the training samples could in turn affect the predictive ability of built SVM models to a large extent.
When compared with the global features such as amino acid composition ("AA") and protein molecular weight ("W"), however, the predicted secondary structure by PSIPRED seems to be the most important determinant of our predictors. This is apparent by observing the significant performance improvement that the CC increases from 0.55 using the "LS" encoding scheme to 0.58 using the "LS+SS" scheme, whereas the values of DevAp, RMSE_norm and RMSE_raw decreases from 0.94, 0.82, and 3.21 to 0.90, 0.79 and 3.10, respectively. We can also draw the same conclusion by comparing the performance improvement of the "LS+W" encoding scheme with that of the "LS+W+SS" scheme. The CC improves from 0.56 to 0.59, while the DevAp and RMSE_raw values decrease from 0.92 to 0.89 and 3.15 to 3.07, respectively, after incorporating "SS" information in the encoding scheme "LS+W". As a result, our method achieved the overall best prediction accuracy after adopting the encoding scheme "LS+W+AA+SS", i.e. combing all the four kinds of information. The average CC, DevA p and RMSE_raw scores are 0.60, 0.87 and 3.05, respectively.
To visualize the prediction accuracy of individual protein, we plotted the CC against the corresponding protein molecular weight in Figure 5. It can be shown that most proteins tested are situated in the region with CCs larger than 0.4 or more, while there also exist some separate proteins which are poorly predicted with CCs lower than 0.3. These proteins that are poorly predicted are found to be mainly distributed on both sides of protein weights, suggesting that both some small and large proteins are less accurately predicted.
We also calculated the overall distributions of CC, DevA p and RMSE of the testing proteins sequences for the seven different encoding schemes, which are depicted in Figure  6. The peak values of CC, DevA p and RMSE are close to 0.60, 0.86 and 3.0, respectively. For the CC distribution, the rightmost curve in the plot represents the best prediction method, while for DevA p and RMSE distributions, the leftmost curves denote the best method. All the three distributions of CC, DevA p and RMSE indicated that the "LS+W+AA+SS" encoding scheme leads to the best performance.
The mean absolute errors (MAEs) of residues with different RWCO values are plotted in Figure 7. It can be seen The 3D structure of the predicted transcriptional activator GCN5 (PDB code: 1e6i, chain A), whose two badly predicted regions (from residue position 4 to 23 and from position 92 to 106) are highlighted using the green and blue ball-and-stick models, respectively Figure 4 The 3D structure of the predicted transcriptional activator GCN5 (PDB code: 1e6i, chain A), whose two badly predicted regions (from residue position 4 to 23 and from position 92 to 106) are highlighted using the green and blue ball-and-stick models, respectively. The cyan backbone depicts other regions of the protein chain. This 3D molecular image was generated by the Protein Explorer program [36].
that the "LS+W+AA+SS" encoding scheme leads to the least mean absolute error for the majority of the regions and thus could give the best prediction performance. Residues with RWCO value 4 are predicted with the least mean absolute errors, due to the fact that these proteins have the largest numbers in the current dataset and thus are adequately represented. On the other hand, residues with larger RWCO values (RWCO>20) have larger MAEs and are thus worst predicted.

Comparison with other methods
We also compared our SVR based method with other prediction methods, such as the linear regression method [11] and critical random networks (CRNs) [12]. For an objective comparison, these methods are all measured on the same training and testing datasets using 15-fold crossvalidation. The results are summarized in Table 5.
When selecting the sequence encoding scheme "LS+W+AA+SS", the SVR method could achieve the best prediction accuracy with a CC of 0.60, DevA p of 0.87, and RMSE_raw of 3.05. The linear regression method is based on the simple linear regression scheme and achieved prediction accuracy with a CC of 0.59 and DevA p of 1.03 [11]. CRN predicted RWCO values by defining a linear function of a state vector associated with a target sequence, namely, the position-specific scoring matrices (PSSMs) generated from PSI-BLAST and achieved a best prediction performance with CC of 0.60 and DevA p of 0.88 [12]. Both linear regression and CRN methods employed the same local window size of 52 residues to achieve their respective best performance. As can be seen, the SVR method performed much better than the simple linear regression method and slightly better than that of the CRN method with the same accuracy of CC and smaller DevA p values. These results suggest that the SVR method is at least competitive with, if not better than, the previously developed methods.

Discussion
Residue-wise contact order, in conjunction with secondary structure, solvent accessibility, B factors and contact number, can provide complementary and indispensable information for the ultimate prediction of protein threedimensional structures. Due to the importance of residue contact orders on the protein folding and protein structure prediction, studies in this direction are receiving more and more attention recently [8][9][10][11][12][13].
Several ways may help to further improve the prediction performance in the future. The first approach is to use more accurately determined PDB structures with better resolutions. The second is to incorporate other informative and complementary features, such as protein solvent accessibility and contact numbers, which have been proved to have high correlations with RWCO values in proteins [11,12]. The third strategy can focus on how to effectively represent those under-represented proteins with lower or higher molecular weights. Increasing the ratio of these proteins in the whole dataset could also contribute to enhancing the prediction accuracy. Further improvement is possibly achieved by using refined datasets and combining more informative multiple feature descriptors together.
As a new machine learning method, support vector regression has many attractive features and our study presented here has further enhanced its useful application in reliably predicting residue-wise contact orders in proteins. The present method may also be useful in protein structure prediction, protein folding rate prediction and protein engineering applications.

Conclusion
In this paper, we have developed a novel approach to predict the residue-wise contact order in proteins using support vector regression based on the local protein sequence descriptor (multiple sequence alignments in the form of PSI-BLAST profiles) and two global descriptors (protein molecular weight and amino acid composition). The predicted secondary structure by PSIPRED also served as input to the SVR. For completeness, we introduced seven different sequence encoding combinations and investigated their effects on the prediction performance. We found that using the local sequence descriptor could provide benchmark prediction accuracy with a CC of 0.55, DevA p of 0.94 and RMSE of 3.21. Furthermore, after adopting the sequence encoding scheme "LS+W+AA+SS" that combined the local sequence descriptor, global descriptors and the predicted secondary structure together, our method could yield the best prediction performance with a CC of 0.60, DevA p of 0.87 and RMSE of 3.05, a significant improvement over the accuracy based on local sequence information alone. Our results indicated that both the local sequence context and the pre- The CC distribution versus protein molecular weight Figure 5 The CC distribution versus protein molecular weight. Each circle in the plot represents a protein sequence in the testing datasets.
The distributions of CC, DevA p and RMSE for the seven different encoding schemes Figure 6 The distributions of CC, DevA p and RMSE for the seven different encoding schemes. "LS", "LS+AA", "LS+W", "LS+SS", "LS+W+AA", "LS+W+SS", and "LS+W+AA+SS" are represented by dotted black, dashed red, dot-and-dashed green, dot-and-dashed orange, dash-dot-and-dotted cyan, short-dashed magenta and solid blue lines, respectively.
dicted secondary structure are important determinants in predicting residue-wise contact orders in proteins. We have demonstrated that the SVR approach is competitive with other existing algorithms based on linear regression models. Due to its attractive potential in condensing information and regressing value profiles, it is anticipated that the SVR method will play a more important role in analyzing large-scale genome and proteome data as more biological data becomes available through genome sequencing projects.

Dataset
We used the same dataset previously prepared by Kinjo and Nishikawa [11,12], which included 680 protein sequences and was originally extracted from ASTRAL database version 1.65 [16]. This is a well-defined dataset and each of the protein chains represents a superfamily from all-α, all-β, α/β, α+β and Multi-domain proteins in SCOP database [17]. The sequence identity between each pair of chains was less than 40%.
However, in the current ASTRAL SCOP version 1.69 (generated on August 1, 2005), some original protein chains included in version 1.65 are replaced or discarded. They are d1dj0a1 (replaced by d1dj0a_), d1dkza_ (replaced by d1dkza1 and d1dkza2), d1fvka1 (replaced by d1fvka_), d1gdoa (replaced by d1xffa_), d1hf8a_ (replaced by d1hf8a1 and d1hf8a2), d1jx4a_ (replaced by d1jx4a1 and d1jx4a2), and d1oi2a1 (replaced by d1oi2a_). In order to The mean absolute errors for different RWCO values of residues Figure 7 The mean absolute errors for different RWCO values of residues. The seven different encoding schemes "LS", "LS+AA", "LS+W", "LS+SS", "LS+W+AA", "LS+W+SS", and "LS+W+AA+SS" are represented by dotted black, dashed red, dotand-dashed green, dot-and-dashed orange, dash-dot-and-dotted cyan, short-dashed magenta and solid blue lines, respectively. There are a total of 120421 residues in this dataset. The protein chain names and their corresponding amino acid sequences can be found in the Additional File 1 (supplementary material). The detailed RWCO information with a radius cutoff of 12Å can be found in the Additional File 2 (supplementary material).

Residue-wise contact order
The concept of residue-wise contact order (RWCO) was first introduced by Kinjo and Nishikawa [11,12]. The discrete RWCO values of the i-th residue in a protein sequence with M residues is defined by where r i,j is the distance between the C β atoms of the i-th and j-th residues (C α atoms for glycine) in the protein sequence. Two residues are considered to be in contact if their C β atoms locate within a sphere of the threshold radius r d . Note that the trivial contacts between the nearest and second-nearest residues are excluded. In order to smooth the discrete RWCO values, Kinjo et al. proposed a particular sigmoid function [11,12,14], which is given by where w is a parameter that determines the sharpness of the sigmoid function. In the present study, for the sake of comparison, we set r d = 12 Å and w = 3, which was adopted by Kinjo et al. [11,12]. To find the function between the protein sequence and the normalized RWCO values, we use ε-insensitive support vector regression (ε-SVR) [29,30].  denote a set of training data, where feature vector x i denotes residue i in a protein sequence with M residues, and y i represents its corresponding normalized RWCO value.

Normalization of RWCO
The expected function of SVR is Here, W is the weight, b is the bias, and ΌW, the Lagrange multipliers contribute to the ultimate SVR prediction, whose associated samples are known as support vectors. As a contrast, those zero-valued Lagrange multipliers falling inside the ε-insensitive tube make no contribution to the regression. Normally, the number of support vectors is much smaller than that of the samples, thus SVR has the attractive property of condensing information in the training samples which is represented by these useful support vectors with non-zero values.
The kernel function K (x i , x) has several different forms, such as polynomial kernel function, radial basis kernel function (RBF), sigmoid kernel function, etc. The radial basis kernel function is adopted in this study, which is given by (8) where γ parameter needs to be regulated.
We used SVM_light, an implementation of Vapnik's SVM for support vector classification, regression and pattern recognition [35]. In the present study, we selected radial basis kernel function at ε = 0.01, γ = 0.01 and C = 5.0 to build the SVR models. This combination of parameters has been proven to yield the best performance in previous studies of predicting accessible surface area, B-factor and contact number [18][19][20].

Sequence encoding scheme
Since numerous studies have well established that the prediction performance resulting from using multiple sequence alignments in the form of PSI-BLAST [31] profiles usually outperforms that of single sequence [11,12,[18][19][20]25-28], we are more interested in utilizing multiple sequence encoding schemes here, in which the intermediate PSI-BLAST generated position-specific scoring matrix (PSSM) is used as the direct input to SVR.
We extracted the local sequence fragments of the centered residues of interest by a sliding window coding scheme, with window length 2l+1, where l is the half window size. We ran blastpgp program in the PSI-BLAST package to query each protein in the dataset against the NCBI nr database to generate the PSSM profiles, by three iterations of PSI-BLAST, with a cutoff E-value of 10 -7 . The PSSM is an M × 20 matrix, where M is the target sequence length and 20 is the number of amino acid types. Each element of the matrix represents the log-likelihood for each residue position in the multiple sequence alignment. Evolutionary information was included in this window as the input information coded by M × 20 dimensional vectors.
For the sake of SVR input and process, we simply divided all the elements in the PSSM profiles by 10 to normalize them, thus most values fell between -1.0 and 1.0. We selected a windows size of M = 15 to build the SVR predictors, which has been proven to yield the best performance in previous studies [18][19][20].

Prediction performance evaluation
In order to objectively evaluate the prediction performance of our approach, we employed the 15-fold cross-validation methods. The 680 protein sequences used in this study were randomly divided into fifteen subsets with roughly equal numbers of protein sequences. In each validation step, one subset was singled out in turn as the test- To measure the performance of SVR methods in this application, we calculated the Pearson's correlation coefficients (CC) between the predicted and observed RWCO values in a protein sequence as given by where x i and y i are the observed and predicted normalized RWCO values of the i-th residue, and and are their corresponding means. Here N is the total residue number in a protein.
In order to compare the prediction performance with existing methods, we also used the same measure DevA p proposed by Kinjo et al. [11,12] to calculate the RMS error between the predicted and observed RWCO values The root mean square error (RMSE) is also given by We computed two kinds of RMSE values: one is based on the predicted and observed normalized RWCO values (denoted by RMSE_norm) and the other is based on the predicted and observed raw (absolute) RWCO values (denoted by RMSE_raw). .

Additional file 1
The