Template-based C8-SCORPION: a protein 8-state secondary structure prediction method using structural information and context-based features
© Yaseen and Li; licensee BioMed Central Ltd. 2014
Published: 14 July 2014
Secondary structures prediction of proteins is important to many protein structure modeling applications. Correct prediction of secondary structures can significantly reduce the degrees of freedom in protein tertiary structure modeling and therefore reduces the difficulty of obtaining high resolution 3D models.
In this work, we investigate a template-based approach to enhance 8-state secondary structure prediction accuracy. We construct structural templates from known protein structures with certain sequence similarity. The structural templates are then incorporated as features with sequence and evolutionary information to train two-stage neural networks. In case of structural templates absence, heuristic structural information is incorporated instead.
After applying the template-based 8-state secondary structure prediction method, the 7-fold cross-validated Q8 accuracy is 78.85%. Even templates from structures with only 20%~30% sequence similarity can help improve the 8-state prediction accuracy. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structures, such as 3-10 helices, turns, and bends, are highly improved, which are useful for practical applications.
Our computational results show that the templates containing structural information are effective features to enhance 8-state secondary structure predictions. Our prediction algorithm is implemented on a web server named "C8-SCORPION" available at: http://hpcr.cs.odu.edu/c8scorpion.
An important intermediate step in modeling the three-dimensional structure of a protein is to accurately predict its secondary structures . Most often, the secondary structures are classified into three general states, i.e., helices (H), strands (E), and coils (C). Correspondingly, success of secondary structure prediction is typically measured by the Q3 (3-state) accuracy. Many machine learning methods, including statistics analysis, neural networks, hidden Markov chain, support vector machines, have been developed to predict secondary structures. Correspondingly, there are many secondary structure prediction servers available, including GOR4 , PSI-Pred , PHD , SAM , Porter , JPred , SPINE , SSPRO , NETSURF , and many others. The modern secondary structure prediction servers can generate prediction results with close to 80% Q3 accuracy.
Compared to the general three secondary structure states, the DSSP program  has more detailed classifications by assigning secondary structures to eight states, including 3-10 helix (G), α-helix (H), π-helix (I), β-stand (E), bridge (B), turn (T), bend (S), and others (C). The 8-state secondary structures convey more precise structural information than 3-state, which is particularly important for a variety of applications. For example, accurate 8-state secondary structures predictions can restrict the variations of backbone dihedral angles within a small range according to the Ramachandran plots  and thus reduce the search space in template-free protein tertiary structure modeling. Moreover, differentiations among 3-10 helix, α-helix, and π-helix in secondary structure prediction aid to assign residues and fit protein structure models in cryo-electron microscopy density maps . Unfortunately, most of the secondary structure prediction software packages or servers only provide 3-state predictions.
Prediction Accuracy of RaptorXss8 on Benchmarks of CB513, CASP9, Manesh215, and Carugo338.
Most current secondary structure perdition methods do not rely on similarity to known protein structures; in other words, these methods are de novo, where the secondary structure prediction is based on sequence information only. However, we cannot neglect the fact that many protein sequences have some degree of similarity among themselves. Actually, over half of all known protein sequences have some detectable similarity (higher than 25%) to one or more sequences of known structures [15, 16]. Around 75% was reported as the percentage of those newly deposited protein structures in the PDB database showing significant similarity to previous deposited structures. Consequently, taking advantage of structural similarity of proteins with sequence similarity may lead to significant improvement of protein structure prediction. In fact, the latest version of porter  has used homology-based templates for 3-state secondary structure prediction . Porter has been reported to achieve prediction accuracy improvement when known structures with >30% sequence similarity are available and even reach theoretical upper bound of secondary structure prediction when such sequence similarity is higher than 50%.
In this paper, we investigate the template-based method for 8-state secondary structure prediction. We extract structural information from known structures of chains with certain sequence similarity to build structural templates. Then, the structural information contained in the templates is incorporated (as features) together with sequence and evolutionary information for neural network training and validation.
In the case where structural information from the structural template is not available for a residue, context-based scores estimating the favorability of that residue adopting a secondary structure conformation in the presence of its neighbors in sequence are used instead. The fundamental idea of the context-based scores is based on the fact that the formation of secondary structure exhibit strong local dependency, particularly, residues in a protein sequence are strongly correlated in different sequence positions in coils, β-sheets, 310 helices, α-helices, and π-helices. We extract statistics to derive context-based scores from a large training data set. These context-based scores are then incorporated as sequence-structure features together with sequence, template, and evolutionary information in neural network training process for 8-state secondary structure prediction.
We test our template-based 8-state prediction method on several popularly used benchmarks including CB513, Manesh215, and Carugo338 as well as the CASP9 targets. The prediction accuracies for the eight states are analyzed.
The protein data sets
We use the protein chain dataset Cull5547 generated by the PISCES server  on 10/21/2011 for neural network training and Cull16633 for context-based scores generation. Cull5547 contains 5,547 protein chains with at most 25% sequence identity and 2.0A resolution cutoff, and Cull16633 contains 16,633 protein chains with at most 50% sequence identity and 3.0A resolution cutoff. We eliminate very short chains, whose lengths are less than 40 residues, since the PSI-BLAST program  is usually unable to generate profiles for very short sequences, and very large chains whose lengths are greater than 1,000 residues. We also eliminate residue samples with undetermined secondary structures.
Public benchmarks, including CB513 , Manesh215 , Carugo338 , and the recent CASP9 targets , that are popularly employed as benchmarks for 3-state secondary structure predictions, are used to benchmark our method in 8-state predictions.
We use a window size of 15 residues for input encodings. Each residue is represented with 20 values from the PSSM (Position-Specific Scoring Matrix) data, 1 extra input to indicate if the residue window overlaps C- or N-terminal, 1 value for degree of similarity, and 8 values for structural information from template or context-based secondary structure scores . Hence, a total number of 450 values are used to describe each residue
The types and conformations of nearby residues play a critical role in secondary structure conformation that a residue may adopt . In particular, the hydrogen bonds between residues at positions i and i + 3, i and i + 4, and i and i + 5 lead to the formation of 3-10-helices, α-helices, and π-helices, respectively. Residues in contacting parallel or anti-parallel β-sheets are connected by hydrogen bonds in alternative positions. Moreover, the formation of interactions within coils beyond nearest neighbors appears not to contribute with statistical significance in determining coil structure . Hence, correlations among residues provide significant information in predicting secondary structure.
In this method, we will extract statistics of singlets , doublets , and triplets residues at different relative positions from protein sequences in Cull16633 dataset. These statistics represent estimations of the probabilities of residues adopting a specific structural state when none, one, or two of their neighbors in context are taken into consideration, respectively.
where is the PSSM frequency for residue type at the jth position of a protein sequence.
These pseudo-potentials are incorporated as context-based scores representing sequence-structure features in neural network training when structural information from templates is not available.
Neural network model
The prediction accuracy is calculated as the average of the seven prediction scores. We use both Q8 and SOV8 (Segment overlap ) scores to measure the qualities of our 8-state secondary structure predictions.
N-fold cross validation
To obtain a reliable estimate of the 8-state secondary structure prediction accuracy, we use 7-fold cross validation on Cull5547. We randomly divide the chains in Cull5547 into 7 subsets with approximately the same size, such that five subsets are used for training, one for testing, and one for validation.
7-fold cross-validation accuracy in template-based 8-state prediction.
Comparison between 8-state predictions with and without template on CB513, CASP9, Manesh215, and Carugo338.
Comparison of 7-fold cross validation prediction accuracies in eight states when templates with different sequence similarities are used.
# of chains
For α-helices (H), the prediction accuracy using templates with very low sequence similarity (0%, 10%] is already rather high (92.05%), mainly because there are sufficient number of α-helix samples available and the formation of α-helix is mainly result from local interactions. Anyway, the structural templates help refine the α-helix predictions with slight accuracy improvements. When structural templates with 40% or better similarity are available, the prediction accuracy of β-sheets (E) is also improved to above 90%, reaching the theoretical upper bound in secondary structure prediction. 40%+ similarity templates also significantly improve the accuracies of 3-10 helices (G) and bends (S) from 20%+ to 50%+. Similar but not as significant improvements are found in turns (T) and coils (C). However, the prediction results for bridges (B) and π-helices (I) are disappointing. Only when templates with very high similarity (>70%) are available, we can obtain 44% prediction accuracy in bridges (B). The prediction accuracy for π-helices (I) is still 0%. This is mainly due to the facts that π-helices are extremely rare (0.02%) and π-helices (I) are often misclassified into α-helices (H).
We describe a template-based approach to enhance 8-state secondary structure prediction accuracy in this paper. Our computational results show that the secondary structure templates, even obtained from sequence with only 20%~30% sequence similarity, can help improve the 8-state prediction accuracy. Overall, 78.85% Q8 accuracy and 80.10% SOV8 accuracy are achieved in 7-fold cross validation. The effectiveness of using structural information in templates has been demonstrated on popular benchmarks including CB513, CASP9, Manesh215, and Carugo338. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structure states, such as 3-10 helices, turns, and bends, are highly improved, which are suitable for practical use in applications.
A webserver (C8-Scorpion) implementing 8-state secondary structure prediction is currently available at http://hpcr.cs.odu.edu/c8scorpion.
Publication charges for this work were funded by NSF grant 1066471 to YL.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 8, 2014: Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S8.
- Rost B: Review:Protein secondary structure prediction continues to rise. J Struct Biol. 2001, 134 (2-3): 204-218. 10.1006/jsbi.2001.4336.View ArticlePubMedGoogle Scholar
- Garnier J, Gibrat JF, Robson B: GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol. 1996, 266: 540-553.View ArticlePubMedGoogle Scholar
- Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292 (2): 195-202. 10.1006/jmbi.1999.3091.View ArticlePubMedGoogle Scholar
- Rost B, Sander C: Combining evolutionary information and neural networks to predict protein secondary structure. Proteins. 1994, 19 (1): 55-72. 10.1002/prot.340190108.View ArticlePubMedGoogle Scholar
- Karplus K, Barrett C, Cline M, Diekhans M, Grate L, Hughey R: Predicting protein structure using only sequence information. Proteins-Structure Function and Genetics. 1999, Suppl 1: 121-125.View ArticleGoogle Scholar
- Pollastri G, McLysaght A: Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics. 2005, 21 (8): 1719-1720. 10.1093/bioinformatics/bti203.View ArticlePubMedGoogle Scholar
- Cole C, Barber JD, Barton GJ: The Jpred 3 secondary structure prediction server. Nucleic Acids Res. 2008, 36: W197-W201. 10.1093/nar/gkn238.PubMed CentralView ArticlePubMedGoogle Scholar
- Dor O, Zhou YQ: Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins. 2007, 66 (4): 838-845.View ArticlePubMedGoogle Scholar
- Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins-Structure Function and Genetics. 2002, 47 (2): 228-235. 10.1002/prot.10082.View ArticleGoogle Scholar
- Petersen B, Petersen TN, Andersen P, Nielsen M, Lundegaard C: A generic method for assignment of reliability scores applied to solvent accessibility predictions. Bmc Struct Biol. 2009, 9 (51): 10.1186/1472-6807-9-51.Google Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.View ArticlePubMedGoogle Scholar
- Ramachandran GN, Sasisekharan V: Conformation of polypeptides and proteins. Advances in protein chemistry. 1968, 23: 283-438.View ArticlePubMedGoogle Scholar
- Topf M, Baker ML, Marti-Renom MA, Chiu W, Sali A: Refinement of protein structures by iterative comparative modeling and CryoEM density fitting. J Mol Biol. 2006, 357 (5): 1655-1668. 10.1016/j.jmb.2006.01.062.View ArticlePubMedGoogle Scholar
- Wang ZY, Zhao F, Peng J, Xu JB: Protein 8-class secondary structure prediction using conditional neural fields. Proteomics. 2011, 11 (19): 3786-3792. 10.1002/pmic.201100196.PubMed CentralView ArticlePubMedGoogle Scholar
- Montgomerie S, Sundararaj S, Gallin WJ, Wishart DS: Improving the accuracy of protein secondary structure prediction using structural alignment. Bmc Bioinformatics. 2006, 7:Google Scholar
- Pollastri G, Martin AJM, Mooney C, Vullo A: Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics. 2007, 8:Google Scholar
- Wang GL, Dunbrack RL: PISCES:a protein sequence culling server. Bioinformatics. 2003, 19 (12): 1589-1591. 10.1093/bioinformatics/btg224.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
- Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins-Structure Function and Genetics. 2000, 40 (3): 502-511. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q.View ArticleGoogle Scholar
- Ahmad S, Gromiha MM, Sarai A: Real value prediction of solvent accessibility from amino acid sequence. Proteins-Structure Function and Genetics. 2003, 50 (4): 629-635. 10.1002/prot.10328.View ArticleGoogle Scholar
- Carugo O: Predicting residue solvent accessibility from protein sequence by considering the sequence environment. Protein Engineering. 2000, 13 (9): 607-609. 10.1093/protein/13.9.607.View ArticlePubMedGoogle Scholar
- Kinch LN, Shi S, Cheng H, Cong Q, Pei JM, Mariani V, Schwede T, Grishin NV: CASP9 target classification. Proteins. 2011, 79: 21-36. 10.1002/prot.23190.PubMed CentralView ArticlePubMedGoogle Scholar
- Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Li Y, Liu H, Rata I, Jakobsson E: Building a Knowledge-Based Statistical Potential by Capturing High-Order Inter-residue Interactions and its Applications in Protein Secondary Structure Assessment. Journal of chemical information and modeling. 2013, 53 (2): 500-508. 10.1021/ci300207x.View ArticlePubMedGoogle Scholar
- Sippl MJ: Calculation of Conformational Ensembles from Potentials of Mean Force - an Approach to the Knowledge-Based Prediction of Local Structures in Globular-Proteins. J Mol Biol. 1990, 213 (4): 859-883. 10.1016/S0022-2836(05)80269-4.View ArticlePubMedGoogle Scholar
- Zemla A, Venclovas C, Fidelis K, Rost B: A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins-Structure Function and Genetics. 1999, 34 (2): 220-223. 10.1002/(SICI)1097-0134(19990201)34:2<220::AID-PROT7>3.0.CO;2-K.View ArticleGoogle Scholar
- Rata I, Li Y, Jakobsson E: Backbone Statistical Potential from Local Sequence-Structure Interactions in Protein Loops. Journal of Physical Chemistry B. 2010, 114 (5): 1859-1869. 10.1021/jp909874g.View ArticleGoogle Scholar
- Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. Journal of Molecular Biology. 1998, 275: 895-916. 10.1006/jmbi.1997.1479.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.