PredRSA: a gradient boosted regression trees approach for predicting protein solvent accessibility
© Fan et al. 2016
Published: 11 January 2016
Protein solvent accessibility prediction is a pivotal intermediate step towards modeling protein tertiary structures directly from one-dimensional sequences. It also plays an important part in identifying protein folds and domains. Although some methods have been presented to the protein solvent accessibility prediction in recent years, the performance is far from satisfactory. In this work, we propose PredRSA, a computational method that can accurately predict relative solvent accessible surface area (RSA) of residues by exploring various local and global sequence features which have been observed to be associated with solvent accessibility. Based on these features, a novel and efficient approach, Gradient Boosted Regression Trees (GBRT), is first adopted to predict RSA.
Experimental results obtained from 5-fold cross-validation based on the Manesh-215 dataset show that the mean absolute error (MAE) and the Pearson correlation coefficient (PCC) of PredRSA are 9.0 % and 0.75, respectively, which are better than that of the existing methods. Moreover, we evaluate the performance of PredRSA using an independent test set of 68 proteins. Compared with the state-of-the-art approaches (SPINE-X and ASAquick), PredRSA achieves a significant improvement on the prediction quality.
Our experimental results show that the Gradient Boosted Regression Trees algorithm and the novel feature combination are quite effective in relative solvent accessibility prediction. The proposed PredRSA method could be useful in assisting the prediction of protein structures by applying the predicted RSA as useful restraints.
Since the concept of solvent accessibility was first introduced by Lee and Richards , defined as the surface area of a protein that is accessible to a spherical solvent while probing the surface of that molecule, it has been considered as a key factor for understanding protein structure and function . Predicting the three-dimensional (3D) structures of proteins from their one-dimensional sequences is a challenging issue because of the increasing gap between the enormous number of protein sequences and the number of known structures. Studies of solvent accessibility in proteins have provided many useful insights into the 3D structures of proteins . Furthermore, knowledge of solvent accessibility has proved useful for structural domains identification , fold recognition , binding region identification – and protein intrinsic disorder . The solvent accessibility is particularly important because it is associated with the spatial arrangement and packing of amino acids during the process of protein folding. It also plays an important role in predicting the active sites of protein-protein or protein-ligand binding .
In many earlier studies, the solvent accessibility prediction was taken as a classification problem with varying thresholds, two-state (exposed or buried) or three-state (exposed, intermediate or buried) –. However, there is no standard definition for the thresholds of solvent accessibility states. For instance, a residue may be predicted to be exposed state based on a relative solvent accessibility threshold of 10 %, but the same residue may be predicted to be buried state based on a threshold of 20 %. In view of this, it is necessary to predict the real values of solvent accessibility. Some representative machine learning techniques have been proposed to predict the real values of solvent accessibility, including multiple linear regression , support vector regression –, neural network , , energy optimization  and nearest neighbor method .
For the real-valued solvent accessibility prediction, Ahmad et al.  proposed a neural network method with only single sequence information as the input features. The result showed that this method achieved a MAE of 18.0–19.5 % on different data sets. Adamczak et al.  employed evolutionary information in the form of position-specific scoring matrix (PSSM) profile to train a neural network-based regression for the prediction. Compared with the single sequence based neural network , the prediction performance was improved and the MAE decreased by about 5 % on the PFAM database . Subsequently, Lee et al.  applied PSSM profile by constructing a correlation matrix different window positions to train a multiple linear regression method. The result showed a performance of 16.6 % MAE and 0.63 PCC on the Barton-502 dataset. Garg et al.  took multiple sequence alignment and secondary structure as input features to predict RSA based on a feed-forward neural network. The result indicated that a lower MAE achieved on CASP6 was 15.9 % and a higher PCC was 0.68.
Although these methods for surface accessibility prediction were developed, several issues still exist and make surface accessibility prediction a very challenging task. Mainly, there are three reasons: (1) specific biological properties for precisely predicting surface accessibility are not fully exploited, and no single parameter can definitely estimate the accessible surface area, various combinations of different feature types, including PSSM profiles, secondary structure features, native disorder features as well as other global sequence features , need to be investigated comprehensively; (2) the performance of the existing methods is still unsatisfactory, especially in terms of independent testing and (3) high-performance ensemble learning algorithms such as boosted regression trees haven’t been intensively used in this area.
In this article, we propose a new and efficient approach, PredRSA (Prediction of Relative Solvent Accessible surface area), that integrates gradient boosted regression trees (GBRT) algorithm with multiple sequence-based features (position-specific scoring matrix, secondary structure, conservation score, native disorder) and a global feature (side-chain environment) to predict RSA. We have benchmarked PredRSA using the Manesh training dataset and an independent dataset. Results show that PredRSA significantly outperforms the state-of-the-art methods and indicate that the GBRT algorithm and the novel feature combination are important determinants in the prediction of RSA.
The GBRT algorithm
where F 0(x ) is an initial value and given by , F m −1(x ) is the (m−1)th additive function combined from the first to the (m−1)th weak regression tree.
The value of the transition point δ depends on the iteration number m.
Finally, the resulting RSA value y corresponding to the amino acid residue x is given by: y=F M (x ).
Sequence encoding schemes
Selecting appropriate features is a crucial step because it directly determines the prediction performance. In this article, we explore various sequenced-based features which have been shown to be related to the solvent accessibility or ever applied in the similar issues. These features include PSSM profiles –, PSIPRED-predicted secondary structure , DISOPRED-predicted native disorder , conservation score and side-chain environment compositions . In this section, a more detail description about how to extract and encode these different sequence-based features as follows.
where x is the score derived from the PSSM profile and x ′ is the standardized value of x. For a given residue, its local sequence fragment is extracted and encoded as a 20×(2l+1)-dimensional vector by using a sliding window scheme where l denotes the half window size and L=2l+1 is the whole window length. Furthermore, the predictive performance of a variety of different local window sizes L (from 3–17) has been evaluated to select the optimal local window size L for the RSA prediction. Finally, in this encoding scheme, a residue is encoded by a 20×L=20×(2l+1)-dimensional vector.
In addition, we try to introduce residue conservation score for the solvent accessibility prediction. The value of sequence conservation for residue is a measure of how often a given residue is seen at an equivalent position in an equivalent protein across different species. Generally, residue conservation score is proportional to its buried degree. The conservation score is obtained by PSI-BLAST search as well , .
PSIPRED-predicted secondary structure information
In this work, we use the PSIPRED program to predict the secondary structure information. PSIPRED provides highly accurate prediction for protein secondary structures by applying a feed-forward neural network. The outputs of PSIPRED are encoded by the probability profiles of three secondary structures (C for coil, H for helix and E for strand). Some previous works have shown that incorporation of PSIPRED-predicted secondary structure information can significantly improve the prediction performance , .
Analogously, for a given residue, its three-state secondary structure profiles are extracted and encoded using a sliding window of L=2l+1 consecutive residues. Therefore, in this encoding scheme, a residue is composed of a 3×L=3×(2l+1)-dimensional vector.
DISOPRED-predicted native disorder information
In the past decade, protein disorder or unstructured regions have received considerable attention in that they are commonly responsible for important protein function. As such, there has been an increasing interest in studying such regions in proteins. Unstructured regions are found to be associated with molecular assembly, protein modification and molecular recognition –. Research shows unstructured regions have a large solvent accessible area, which explains why polar and charged residues which favorably interact with water are prevalent in these regions . The conclusion is that disordered regions are strongly correlated with local solvent accessibility areas. Local solvent accessibility values are often used to find the disordered regions as well , .
In order to further improve the performance, in this study, we use DISOPRED program to output the predicted possibility of each residue being natively disordered or ordered. Similarly, a residue is encoded by a 2×L=2×(2l+1)-dimensional vector in this encoding scheme.
The concept of side-chain environment was first purposed by Eisenberg et al.  and used to identify protein sequences that fold into a known three-dimensional structure. Then Li et al.  utilized it for prediction of protein-protein binding site.
Framework of PredRSA
To determine the optimal local sliding window size L and the iterative tree number M, we calculate the prediction performance for L in the range of 3–17 with a step of 2 and M in the range of 100–1500 with a step of 50 using a grid search method. With L=7 and M=800, the PredRSA approach achieves the best performance for the RSA prediction.
Results and discussion
Two non-homologous datasets of proteins chains with pair-wise sequence similarity less than 25 % have been used in order to objectively compare our approach with other available methods developed previously. One dataset is consisted of 215 proteins, which was also used earlier by Manesh et al.  for solvent-accessible surface area of residues prediction. The other dataset is consisted of 502 proteins, obtained from the Cuff and Barton  dataset of 513 proteins, selected by removing those sequences, which have less than 30 residues. These two datasets have been referred to as Manesh-215 and CB-502, respectively. However, since the Manesh-215 dataset was widely used by researchers to benchmark prediction methods , , , , taking into account comparative purposes, we use Manesh-215 as the main data set for evaluation and analysis.
To further evaluate the performance of existing methods and the method developed in the present study, we also generate an independent dataset of CASP10 proteins. Originally, it contains 85 proteins , and we have removed 17 structures (containing chains) by using PISCES culling sever  with 25 % sequence similarity cutoff including X-ray (less than 3.0 Å resolution and 0.3 of R-factor) and NMR structures which contain more than 50 residues. Finally, the remaining 68 proteins are used for independent test.
Calculation of RSA
In this work, we take relative solvent accessibility, also called relative solvent accessible surface area (RSA) as the prediction of solvent accessibility. The RSA of a residue in a protein chain is a normalized value from 0–1. It is calculated as the ratio by dividing the solvent accessible surface area (ASA) by the maximum solvent accessibility according to Manesh’s work  which uses Gly-X-Gly extended tripeptides. The values of ASA are calculated using DSSP  for all considered protein structures.
To measure the performance of real-valued RSA predictions, three widely used measures for real value RSA prediction are adopted in this study.
where N is the total number of residues in a protein sequence to predict; x i and y i are the experimental and predicted RSA values of the i-th residue, respectively; and are their corresponding means. P C C=1 indicates that the two sets of values are fully correlated, while P C C=0 indicates that they are completely uncorrelated.
where N is the total number of residues in a chain, N b and N e represent the number of residues correctly predicted as buried and exposed, respectively. T P,T N,F P and FN are the numbers of the true positives, true negatives, false positives and false negatives, respectively.
Effect of different sequence encoding schemes on the prediction performance
Prediction of real-valued RSA using the GBRT algorithm based on five different sequence encoding schemes that incorporate various combinations of sequence features
Performance comparison with other regression approaches
Performances comparison in predicting real values: PredRSA vs. other existing methods
Performance comparison for two-state prediction
Prediction performance of two-state classification based on different thresholds
Performance comparison of two-state classification: PredRSA vs. other existing predictors
Accuracy for two-states prediction(%)
Independent test on the CASP10 dataset
An independent test (CASP10) is constructed to further validate the usability of our PredRSA method. We train the classifiers based on the Manesh-215 dataset and test against the CASP10 dataset which contains 68 proteins. Other state-of-the-art methods including SPINE-X  and ASAquick  are also evaluated. SPINE-X uses a multistep neural-network algorithm by coupling secondary structure prediction with prediction of solvent accessibility and backbone torsion angles in an iterative manner, while ASAquick utlizes solely sequential widow information and global features with a general neural network method. The Pearson correlation coefficient of PredRSA is 0.71, which outperform the results of SPINE-X and ASAquick by a rate of 2 % (0.69) and 4 % (0.67).
Residue-specific variation in prediction error
Knowledge of residue solvent accessibility gives useful insights into protein structure and function prediction. In this work, we have presented PredRSA to predict real-valued relative solvent accessibility as well as classification state (buried or exposed) of a target residue. The method is based on a gradient boosted regression trees (GBRT) algorithm combined with a novel set of features. The 5-fold cross-validated correlation coefficient between predicted and experimental RSA (0.75) is significantly better than existing methods on the Manesh-215 dataset. We also performed additional independent benchmark tests of PredRSA on the CASP10 set containing 68 proteins where we find that the proposed method outperforms existing methods. Furthermore, for prediction of discrete state, our method is able to achieve an accuracy of 79.7 % with an MCC value of 0.56 using two states classifications at a threshold of 25 %, which defines an approximately balanced division into the two classes.
Experimental results show GBRT is an efficient machine learning approach for continuous values of the solvent accessibility of a target residue. Compared with other traditional techniques, GBRT has several obvious advantages such as high prediction accuracy and stronger generalization capability.
On the other hand, PredRSA utilizes a variety of multiple sequence-derived features, including the position-specific scoring matrices and conservation score in the form of PSI-BLAST profiles, predicted secondary structure, predicted natively disordered region and side-chain environment. We have comprehensively assessed the effects of different sequence encoding schemes on the prediction performance of RSA, and the results show the prediction performance of RSA outperforms previous methods. Our work provides a complementary and useful approach towards the more accurate prediction of protein solvent accessibility.
The publication fee of this article is funded by National Natural Science Foundation of China under grant No.61309010. This article has been published as part of BMC Bioinformatics Volume 17 Supplement 1, 2016: Selected articles from the Fourteenth Asia Pacific Bioinformatics Conference (APBC 2016). The full contents of the supplements are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/17/S1.
This work was supported by National Natural Science Foundation of China under grants No. 61309010 and No. 61379057, China Postdoctoral Science Foundation under grant no. 2015T80886, Specialized Research Fund for the Doctoral Program of Higher Education of China under grant no. 20130162120073 and Shanghai Key Laboratory of Intelligent Information Processing under grant no. IIPL-2014-002.
- Lee B, Richards FM: The interpretation of protein structures: estimation of static accessibility. J mole biol. 1971, 55 (3): 379-4. 10.1016/0022-2836(71)90324-X.View ArticleGoogle Scholar
- Eyal E, Najmanovich R, Mcconkey BJ, Edelman M, Sobolev V: Importance of solvent accessibility and contact surfaces in modeling side-chain conformations in proteins. J comput chem. 2004, 25 (5): 712-24. 10.1002/jcc.10420.View ArticlePubMedGoogle Scholar
- Rost B, Sander C: Conservation and prediction of solvent accessibility in protein families. Proteins Struct Funct Genet. 1994, 20 (3): 216-26. 10.1002/prot.340200303.View ArticlePubMedGoogle Scholar
- Wodak SJ, Janin J: Location of structural domains in proteins. Biochem. 1981, 20 (23): 6544-52. 10.1021/bi00526a005.View ArticleGoogle Scholar
- Liu S, Zhang C, Liang S, Zhou Y: Fold recognition by concurrent use of solvent accessibility and residue depth. Proteins Struct Funct Genet. 2007, 68 (3): 636-45. 10.1002/prot.21459.View ArticlePubMedGoogle Scholar
- Eisenberg D, McLachlan AD: Solvation energy in protein folding and binding. Nature. 1986, 319 (6050): 199-203. 10.1038/319199a0.View ArticlePubMedGoogle Scholar
- Mooney C, Pollastri G, Shields DC, Haslam NJ: Prediction of short linear protein binding regions. J mol biol. 2012, 415 (1): 193-204. 10.1016/j.jmb.2011.10.025.View ArticlePubMedGoogle Scholar
- Zhang QC, Deng L, Fisher M, Guan J, Honig B, Petrey D: Predus: a web server for predicting protein interfaces using structural neighbors. Nucleic acids res. 2011, 39 (suppl 2): 283-7. 10.1093/nar/gkr311.View ArticleGoogle Scholar
- He B, Wang K, Liu Y, Xue B, Uversky VN, Dunker AK: Predicting intrinsic disorder in proteins: an overview. Cell res. 2009, 19 (8): 929-49. 10.1038/cr.2009.87.View ArticlePubMedGoogle Scholar
- Huang B, Schroeder M: Ligsitecsc: predicting ligand binding sites using the connolly surface and degree of conservation. BMC structural biol. 2006, 6 (1): 19-10.1186/1472-6807-6-19.View ArticleGoogle Scholar
- Naderi-Manesh H, Sadeghi M, Arab S, Moosavi Movahedi AA: Prediction of protein surface accessibility with information theory. Proteins Struct Funct Bioinforma. 2001, 42 (4): 452-9. 10.1002/1097-0134(20010301)42:4<452::AID-PROT40>3.0.CO;2-Q.View ArticleGoogle Scholar
- Ahmad S, Gromiha MM: Netasa: neural network based prediction of solvent accessibility. Bioinforma. 2002, 18 (6): 819-24. 10.1093/bioinformatics/18.6.819.View ArticleGoogle Scholar
- Yuan Z, Burrage K, Mattick JS: Prediction of protein solvent accessibility using support vector machines. Proteins Struct Funct Bioinforma. 2002, 48 (3): 566-70. 10.1002/prot.10176.View ArticleGoogle Scholar
- Kim H, Park H: Prediction of protein relative solvent accessibility with support vector machines and long-range interaction 3d local descriptor. Proteins Struct Funct Bioinforma. 2004, 54 (3): 557-62. 10.1002/prot.10602.View ArticleGoogle Scholar
- Sim J, Kim SY, Lee J: Prediction of protein solvent accessibility using fuzzy k-nearest neighbor method. Bioinforma. 2005, 21 (12): 2844-9. 10.1093/bioinformatics/bti423.View ArticleGoogle Scholar
- Wang JY, Lee HM, Ahmad S: Prediction and evolutionary information analysis of protein solvent accessibility using multiple linear regression. Proteins Struct Funct Bioinforma. 2005, 61 (3): 481-91. 10.1002/prot.20620.View ArticleGoogle Scholar
- Yuan Z, Huang B: Prediction of protein accessible surface areas by support vector regression. Proteins Struct Funct Bioinforma. 2004, 57 (3): 558-64. 10.1002/prot.20234.View ArticleGoogle Scholar
- Xu W, Li A, Wang X, Jiang Z, Feng H: Improving prediction of residue solvent accessibility with svr and multiple sequence alignment profile. Conf Proc IEEE Eng Med Biol Soc. 2005, 3: 2595-8.Google Scholar
- Nguyen MN, Rajapakse JC: Two-stage support vector regression approach for predicting accessible surface areas of amino acids. Proteins Struct Funct Bioinforma. 2006, 63 (3): 542-50. 10.1002/prot.20883.View ArticleGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Real value prediction of solvent accessibility from amino acid sequence. Proteins Struct Funct Bioinforma. 2003, 50 (4): 629-35. 10.1002/prot.10328.View ArticleGoogle Scholar
- Adamczak R, Porollo A, Meller J: Accurate prediction of solvent accessibility using neural networks–based regression. Proteins Struct Funct Bioinforma. 2004, 56 (4): 753-67. 10.1002/prot.20176.View ArticleGoogle Scholar
- Xu Z, Zhang C, Liu S, Zhou Y: Qbes: predicting real values of solvent accessibility from sequences by efficient, constrained energy optimization. Proteins Struct Funct Bioinforma. 2006, 63 (4): 961-6. 10.1002/prot.20934.View ArticleGoogle Scholar
- Joo K, Lee SJ, Lee J: Sann: solvent accessibility prediction of proteins by nearest neighbor method. Proteins Struct Funct Bioinforma. 2012, 80 (7): 1791-7.Google Scholar
- Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, et al: The pfam protein families database. Nucleic acids res. 2002, 30 (1): 276-80. 10.1093/nar/30.1.276.View ArticlePubMedPubMed CentralGoogle Scholar
- Garg A, Kaur H, Raghava G: Real value prediction of solvent accessibility in proteins using multiple sequence alignment and secondary structure. Proteins Struct Funct Bioinforma. 2005, 61 (2): 318-24. 10.1002/prot.20630.View ArticleGoogle Scholar
- Song J, Tan H, Wang M, Webb GI, Akutsu T: Tangle: two-level support vector regression approach for protein backbone torsion angle prediction from primary sequences. PloS ONE. 2012, 7 (2): 30361-10.1371/journal.pone.0030361.View ArticleGoogle Scholar
- Friedman JH. Greedy function approximation: a gradient boosting machine. Ann Stat. 2001; 29:1189–232.Google Scholar
- Huber PJ: Robust estimation of a location parameter. Ann Math Stat. 1964, 35 (1): 73-101. 10.1214/aoms/1177703732.View ArticleGoogle Scholar
- Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al: Gapped blast and psi-blast: a new generation of protein database search programs. Nucleic acids res. 1997, 25 (17): 3389-402. 10.1093/nar/25.17.3389.View ArticlePubMedPubMed CentralGoogle Scholar
- Deng L, Guan J, Wei X, Yi Y, Zhang QC, Zhou S: Boosting prediction performance of protein–protein interaction hot spots by using structural neighborhood properties. J Comput Biol. 2013, 20 (11): 878-91. 10.1089/cmb.2013.0083.View ArticlePubMedPubMed CentralGoogle Scholar
- Deng L, Zhang QC, Chen Z, Meng Y, Guan J, Zhou S: PredHS: a web server for predicting protein–protein interaction hot spots by using structural neighborhood properties. Nucleic acids res. 2014, 42: W290-295. 10.1093/nar/gku437.View ArticlePubMedPubMed CentralGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J mol biol. 1999, 292 (2): 195-202. 10.1006/jmbi.1999.3091.View ArticlePubMedGoogle Scholar
- Ward JJ, Sodhi JS, McGuffin LJ, Buxton BF, Jones DT: Prediction and functional analysis of native disorder in proteins from the three kingdoms of life. J mol biol. 2004, 337 (3): 635-45. 10.1016/j.jmb.2004.02.002.View ArticlePubMedGoogle Scholar
- Bowie JU, Luthy R, Eisenberg D: A method to identify protein sequences that fold into a known three-dimensional structure. Science. 1991, 253 (5016): 164-70. 10.1126/science.1853201.View ArticlePubMedGoogle Scholar
- Zhang J, Zhao X, Sun P, Ma Z: Psno: predicting cysteine s-nitrosylation sites by incorporating various sequence-derived features into the general form of chous pseaac. Int J Mol Sci. 2014, 15 (7): 11204-19. 10.3390/ijms150711204.View ArticlePubMedPubMed CentralGoogle Scholar
- Song J, Burrage K, Yuan Z, Huber T: Prediction of cis/trans isomerization in proteins using psi-blast profiles and secondary structure information. BMC bioinforma. 2006, 7 (1): 124-10.1186/1471-2105-7-124.View ArticleGoogle Scholar
- Chen K, Kurgan L: Pfres: protein fold classification by using evolutionary information and predicted secondary structure. Bioinforma. 2007, 23 (21): 2843-50. 10.1093/bioinformatics/btm475.View ArticleGoogle Scholar
- Mizianty MJ, Kurgan L: Improved identification of outer membrane beta barrel proteins using primary sequence, predicted secondary structure, and evolutionary information. Proteins Struct Funct Bioinforma. 2011, 79 (1): 294-303. 10.1002/prot.22882.View ArticleGoogle Scholar
- Li N, Sun Z, Jiang F: Prediction of protein-protein binding site by using core interface residue and support vector machine. BMC bioinforma. 2008, 9 (1): 553-10.1186/1471-2105-9-553.View ArticleGoogle Scholar
- Deng L, Guan J, Dong Q, Zhou S: Prediction of protein-protein interaction sites using an ensemble method. BMC bioinforma. 2009, 10 (1): 426-10.1186/1471-2105-10-426.View ArticleGoogle Scholar
- Pugalenthi G, Kumar Kandaswamy K, Chou KC, Vivekanandan S, Kolatkar P: Rsarf: prediction of residue solvent accessibility from protein sequence using random forest method. Protein and peptide letters. 2012, 19 (1): 50-6. 10.2174/092986612798472875.View ArticlePubMedGoogle Scholar
- Dyson HJ, Wright PE: Intrinsically unstructured proteins and their functions. Nat Rev Mol Cell Biol. 2005, 6 (3): 197-208. 10.1038/nrm1589.View ArticlePubMedGoogle Scholar
- Haynes C, Oldfield CJ, Ji F, Klitgord N, Cusick ME, Radivojac P, et al: Intrinsic disorder is a common feature of hub proteins from four eukaryotic interactomes. PLoS Comput Biol. 2006, 2 (8): 100-10.1371/journal.pcbi.0020100.View ArticleGoogle Scholar
- Gsponer J, Futschik ME, Teichmann SA, Babu MM: Tight regulation of unstructured proteins: from transcript synthesis to protein degradation. Science. 2008, 322 (5906): 1365-8. 10.1126/science.1163581.View ArticlePubMedPubMed CentralGoogle Scholar
- Schlessinger A, Punta M, Yachdav G, Kajan L, Rost B: Improved disorder prediction by combination of orthogonal approaches. PLoS ONE. 2009, 4 (2): 4433-10.1371/journal.pone.0004433.View ArticleGoogle Scholar
- Zhang H, Zhang T, Chen K, Shen S, Ruan J, Kurgan L: On the relation between residue flexibility and local solvent accessibility in proteins. Proteins Struct Funct Bioinforma. 2009, 76 (3): 617-36. 10.1002/prot.22375.View ArticleGoogle Scholar
- Marsh JA: Buried and accessible surface area control intrinsic protein flexibility. J mol biol. 2013, 425 (17): 3250-63. 10.1016/j.jmb.2013.06.019.View ArticlePubMedGoogle Scholar
- Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins Struct Funct Bioinforma. 2000, 40 (3): 502-11. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q.View ArticleGoogle Scholar
- Wang JY, Ahmad S, Gromiha MM, Sarai A: Look-up tables for protein solvent accessibility prediction and nearest neighbor effect analysis. Biopolymers. 2004, 75 (3): 209-16. 10.1002/bip.20113.View ArticlePubMedGoogle Scholar
- The CASP10 Database. http://predictioncenter.org/casp10/groups_analysis.cgi. Accessed 2012.
- Wang G, Dunbrack RL: Pisces: a protein sequence culling server. Bioinforma. 2003, 19 (12): 1589-91. 10.1093/bioinformatics/btg224.View ArticleGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
- Faraggi E, Xue B, Zhou Y: Improving the prediction accuracy of residue solvent accessibility and real-value backbone torsion angles of proteins by guided-learning through a two-layer neural network. Proteins Struct Funct Bioinforma. 2009, 74 (4): 847-56. 10.1002/prot.22193.View ArticleGoogle Scholar
- Chang DT, Huang HY, Syu YT, Wu CP: Real value prediction of protein solvent accessibility using enhanced pssm features. BMC bioinforma. 2008, 9 (Suppl 12): 12-10.1186/1471-2105-9-S12-S12.View ArticleGoogle Scholar
- Petersen B, Petersen TN, Andersen P, Nielsen M, Lundegaard C: A generic method for assignment of reliability scores applied to solvent accessibility predictions. BMC Struct Biol. 2009, 9 (1): 51-10.1186/1472-6807-9-51.View ArticlePubMedPubMed CentralGoogle Scholar
- Chothia C: The nature of the accessible and buried surfaces in proteins. J mol biol. 1976, 105 (1): 1-12. 10.1016/0022-2836(76)90191-1.View ArticlePubMedGoogle Scholar
- Oobatake M, Ooi T: Hydration and heat stability effects on protein unfolding. Prog Biophys Mol Biol. 1993, 59 (3): 237-84. 10.1016/0079-6107(93)90002-2.View ArticlePubMedGoogle Scholar
- Meshkin A, Sadeghi M, Ghasem-Aghaee N: Prediction of relative solvent accessibility using pace regression. EXCLI J. 2009, 8: 211-7.Google Scholar
- Faraggi E, Zhang T, Yang Y, Kurgan L, Zhou Y: Spine x: improving protein secondary structure prediction by multistep learning coupled with prediction of solvent accessible surface area and backbone torsion angles. J comput chem. 2012, 33 (3): 259-67. 10.1002/jcc.21968.View ArticlePubMedGoogle Scholar
- Faraggi E, Zhou Y, Kloczkowski A: Accurate single-sequence prediction of solvent accessible surface area using local and global features. Proteins Struct Funct Bioinforma. 2014, 82 (11): 3170-6. 10.1002/prot.24682.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.