Research | Open | Published:
Template-based C8-SCORPION: a protein 8-state secondary structure prediction method using structural information and context-based features
BMC Bioinformaticsvolume 15, Article number: S3 (2014)
Secondary structures prediction of proteins is important to many protein structure modeling applications. Correct prediction of secondary structures can significantly reduce the degrees of freedom in protein tertiary structure modeling and therefore reduces the difficulty of obtaining high resolution 3D models.
In this work, we investigate a template-based approach to enhance 8-state secondary structure prediction accuracy. We construct structural templates from known protein structures with certain sequence similarity. The structural templates are then incorporated as features with sequence and evolutionary information to train two-stage neural networks. In case of structural templates absence, heuristic structural information is incorporated instead.
After applying the template-based 8-state secondary structure prediction method, the 7-fold cross-validated Q8 accuracy is 78.85%. Even templates from structures with only 20%~30% sequence similarity can help improve the 8-state prediction accuracy. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structures, such as 3-10 helices, turns, and bends, are highly improved, which are useful for practical applications.
Our computational results show that the templates containing structural information are effective features to enhance 8-state secondary structure predictions. Our prediction algorithm is implemented on a web server named "C8-SCORPION" available at: http://hpcr.cs.odu.edu/c8scorpion.
An important intermediate step in modeling the three-dimensional structure of a protein is to accurately predict its secondary structures . Most often, the secondary structures are classified into three general states, i.e., helices (H), strands (E), and coils (C). Correspondingly, success of secondary structure prediction is typically measured by the Q3 (3-state) accuracy. Many machine learning methods, including statistics analysis, neural networks, hidden Markov chain, support vector machines, have been developed to predict secondary structures. Correspondingly, there are many secondary structure prediction servers available, including GOR4 , PSI-Pred , PHD , SAM , Porter , JPred , SPINE , SSPRO , NETSURF , and many others. The modern secondary structure prediction servers can generate prediction results with close to 80% Q3 accuracy.
Compared to the general three secondary structure states, the DSSP program  has more detailed classifications by assigning secondary structures to eight states, including 3-10 helix (G), α-helix (H), π-helix (I), β-stand (E), bridge (B), turn (T), bend (S), and others (C). The 8-state secondary structures convey more precise structural information than 3-state, which is particularly important for a variety of applications. For example, accurate 8-state secondary structures predictions can restrict the variations of backbone dihedral angles within a small range according to the Ramachandran plots  and thus reduce the search space in template-free protein tertiary structure modeling. Moreover, differentiations among 3-10 helix, α-helix, and π-helix in secondary structure prediction aid to assign residues and fit protein structure models in cryo-electron microscopy density maps . Unfortunately, most of the secondary structure prediction software packages or servers only provide 3-state predictions.
To the best of our knowledge, very few methods have been developed for the 8-state secondary structure prediction. Pollastri et al.  extended their 3-state prediction method to SSpro8 for 8-state secondary structure prediction. The reported Q8 accuracy of SSpro8 is 62-63% . A more recent prediction method of the 8-state, RaptorXss8, developed by Wang et al , has reported 67.9% Q8 accuracy through the use of conditional neural field (CNF) models. Table 1 shows the prediction accuracy of RaptorXss8 on several popularly used secondary structure prediction benchmarks, including CB513, CASP9, Manesh215, and Carugo338. Although nearly 70% Q8 accuracy is achieved, the prediction accuracies of different states vary significantly. In particular, the prediction accuracy of G, I, B, and S are very low, mainly due to the fact of their relatively infrequent appearance in protein data banks (PDB), whose distribution is shown in Figure 1. The low prediction accuracies in these states limit the application of 8-state secondary structure prediction in practice.
Most current secondary structure perdition methods do not rely on similarity to known protein structures; in other words, these methods are de novo, where the secondary structure prediction is based on sequence information only. However, we cannot neglect the fact that many protein sequences have some degree of similarity among themselves. Actually, over half of all known protein sequences have some detectable similarity (higher than 25%) to one or more sequences of known structures [15, 16]. Around 75% was reported as the percentage of those newly deposited protein structures in the PDB database showing significant similarity to previous deposited structures. Consequently, taking advantage of structural similarity of proteins with sequence similarity may lead to significant improvement of protein structure prediction. In fact, the latest version of porter  has used homology-based templates for 3-state secondary structure prediction . Porter has been reported to achieve prediction accuracy improvement when known structures with >30% sequence similarity are available and even reach theoretical upper bound of secondary structure prediction when such sequence similarity is higher than 50%.
In this paper, we investigate the template-based method for 8-state secondary structure prediction. We extract structural information from known structures of chains with certain sequence similarity to build structural templates. Then, the structural information contained in the templates is incorporated (as features) together with sequence and evolutionary information for neural network training and validation.
In the case where structural information from the structural template is not available for a residue, context-based scores estimating the favorability of that residue adopting a secondary structure conformation in the presence of its neighbors in sequence are used instead. The fundamental idea of the context-based scores is based on the fact that the formation of secondary structure exhibit strong local dependency, particularly, residues in a protein sequence are strongly correlated in different sequence positions in coils, β-sheets, 310 helices, α-helices, and π-helices. We extract statistics to derive context-based scores from a large training data set. These context-based scores are then incorporated as sequence-structure features together with sequence, template, and evolutionary information in neural network training process for 8-state secondary structure prediction.
We test our template-based 8-state prediction method on several popularly used benchmarks including CB513, Manesh215, and Carugo338 as well as the CASP9 targets. The prediction accuracies for the eight states are analyzed.
The protein data sets
We use the protein chain dataset Cull5547 generated by the PISCES server  on 10/21/2011 for neural network training and Cull16633 for context-based scores generation. Cull5547 contains 5,547 protein chains with at most 25% sequence identity and 2.0A resolution cutoff, and Cull16633 contains 16,633 protein chains with at most 50% sequence identity and 3.0A resolution cutoff. We eliminate very short chains, whose lengths are less than 40 residues, since the PSI-BLAST program  is usually unable to generate profiles for very short sequences, and very large chains whose lengths are greater than 1,000 residues. We also eliminate residue samples with undetermined secondary structures.
Public benchmarks, including CB513 , Manesh215 , Carugo338 , and the recent CASP9 targets , that are popularly employed as benchmarks for 3-state secondary structure predictions, are used to benchmark our method in 8-state predictions.
Figure 2 illustrates the procedure of constructing structural templates. First of all, for a given protein sequence target, PSI-BLAST is used to search against the NR (Non-Redundant) database with E-value = 0.001 and at most 3 iterations to generate the PSSM (Position Specific Scoring Matrix) data. Then, the PSSM is used to search against the Protein Data Bank (PDB ) for alignments with E-value = 10.0. If known structures are available in PDB, their 8-state assignments are determined by the DSSP program and then a structural template is built for the correspondent residue positions. Among the list of templates constructed, we select the top one that is less than 95% sequence similarity, according to PSI-BLAST ranking.
We use a window size of 15 residues for input encodings. Each residue is represented with 20 values from the PSSM (Position-Specific Scoring Matrix) data, 1 extra input to indicate if the residue window overlaps C- or N-terminal, 1 value for degree of similarity, and 8 values for structural information from template or context-based secondary structure scores . Hence, a total number of 450 values are used to describe each residue
Figure 3 shows an example of encoding residues in a protein sequence. For a residue with available structural information in the template, the corresponding secondary structure state is set to 1 while the other states are set to 0. At the same time, the degree of similarity is set for the sequence similarity. On the other hand, if the structural information for a residue is not available in the template, the degree of similarity is set to zero and the context-based scores are incorporated instead. The context-based scores are statistics-based pseudo-potentials to specify the favorability of a residue adopting a certain secondary structure in its amino acid context .
The types and conformations of nearby residues play a critical role in secondary structure conformation that a residue may adopt . In particular, the hydrogen bonds between residues at positions i and i + 3, i and i + 4, and i and i + 5 lead to the formation of 3-10-helices, α-helices, and π-helices, respectively. Residues in contacting parallel or anti-parallel β-sheets are connected by hydrogen bonds in alternative positions. Moreover, the formation of interactions within coils beyond nearest neighbors appears not to contribute with statistical significance in determining coil structure . Hence, correlations among residues provide significant information in predicting secondary structure.
In this method, we will extract statistics of singlets , doublets , and triplets residues at different relative positions from protein sequences in Cull16633 dataset. These statistics represent estimations of the probabilities of residues adopting a specific structural state when none, one, or two of their neighbors in context are taken into consideration, respectively.
The observed probabilities of the ith residue in a singlet , doublet , and triplet adopting a specific structural state are respectively estimated by
Here , , and are the weighted observed number of singlet , doublet , and triplet with adopting conformation in the protein structure database. , , and are the weighted observed number of singlets, doublets, and triplets. The observed numbers will be calculated as
where is the PSSM frequency for residue type at the jth position of a protein sequence.
Correspondently, the context-dependent pseudo-potentials are generated using the derived statistics of correlations between each residue and its nearby neighbors based on Sippl's potentials of mean force method . According to the inverse-Boltzmann theorem, we calculate the mean-force potential for a singlet residue adopting structural state ,
Here R is gas constant, T is temperature, and is the referenced probability. In our method, we will employ the conditional probability approach described in  to estimate the referenced probability by
Similarly, the mean-force potentials and for residue adopting structural state are
with the corresponding referenced probabilities,
Then, the context-dependent pseudo-potential for will be
These pseudo-potentials are incorporated as context-based scores representing sequence-structure features in neural network training when structural information from templates is not available.
Neural network model
We incorporate two phases of standard feed-forward neural network training for the 8-state secondary structure prediction. The first phase is the primary sequence-structure prediction and the second phase is the structure-structure refinement. The numbers of hidden nodes in the first and second networks are 225 and 68, respectively. Figure 4 shows the encoding diagram and the two-phase neural network architecture. Each neural network is trained to predict the secondary structure state of a residue in the middle of the residue window.
The prediction accuracy is calculated as the average of the seven prediction scores. We use both Q8 and SOV8 (Segment overlap ) scores to measure the qualities of our 8-state secondary structure predictions.
N-fold cross validation
To obtain a reliable estimate of the 8-state secondary structure prediction accuracy, we use 7-fold cross validation on Cull5547. We randomly divide the chains in Cull5547 into 7 subsets with approximately the same size, such that five subsets are used for training, one for testing, and one for validation.
Upon the selection of the best alignment with similarity less than 95% for all protein chains in the Cull5547 dataset, the final Q8 seven-fold cross validated accuracy after applying the template-based 8-state prediction reaches 78.85%. Table 2 lists the Q8 and SOV8 accuracies of 7-fold cross validation for each state.
Table 3 compares the Q8 and SOV8 accuracy of using predictions with and without templates on benchmarks of CB513, CASP9, Manesh215, and Carugo338. Clearly, when homology structural information is available, the 8-state prediction accuracy is significantly improved. It is also interesting to find that when structural templates are used, the 8-state prediction accuracy improvement in CASP9 is much less than the other benchmark sets. This is due to the fact that in the CASP9 experiment, targets are deliberately selected to have relatively low similarity to sequences with existing structures in PDB.
Figure 5 shows the distribution of the prediction accuracy as a function of sequence similarity in levels in CB513, CASP9, Manesh215, Carugo338 as well as Cull5574 in cross-validation. Without surprise, the better templates with higher sequence similarity level, the more accurate the prediction results are. More importantly, even templates with only 20%~30% sequence similarity can improve the prediction accuracy by near 5% in various benchmark sets compared to predicted results without templates.
Figure 6 uses the A chain of protein 1BTN as an example to demonstrate the effectiveness of template-based 8-state secondary structure prediction. Prediction without template has 73.6% Q8 accuracy. The best template found in PDB has 61% sequence similarity. Under the guidance of the structural template, the mispredicted helix segment and bend segment in template-less prediction (highlighted in Figure 6) are corrected, which leads to overall 89.6% Q8 accuracy.
As shown in Table 1, the prediction accuracies for different states vary largely due to the very unbalanced appearing frequencies of the eight states in protein structures. In this paper, we are particularly interested in the effectiveness of structural templates in improving the prediction accuracies of those states with low accuracy in prediction without templates. From Cull5547, we create five subsets of chains that have structural templates with similarity level in intervals of (0%, 10%), (10%, 20%), (20%, 40%), (40%, 70%), and (70%, 95%), respectively. Then, 7-fold neural network trainings are carried out for each subset and the average cross validation prediction accuracy for each state is reported in Table 4.
For α-helices (H), the prediction accuracy using templates with very low sequence similarity (0%, 10%] is already rather high (92.05%), mainly because there are sufficient number of α-helix samples available and the formation of α-helix is mainly result from local interactions. Anyway, the structural templates help refine the α-helix predictions with slight accuracy improvements. When structural templates with 40% or better similarity are available, the prediction accuracy of β-sheets (E) is also improved to above 90%, reaching the theoretical upper bound in secondary structure prediction. 40%+ similarity templates also significantly improve the accuracies of 3-10 helices (G) and bends (S) from 20%+ to 50%+. Similar but not as significant improvements are found in turns (T) and coils (C). However, the prediction results for bridges (B) and π-helices (I) are disappointing. Only when templates with very high similarity (>70%) are available, we can obtain 44% prediction accuracy in bridges (B). The prediction accuracy for π-helices (I) is still 0%. This is mainly due to the facts that π-helices are extremely rare (0.02%) and π-helices (I) are often misclassified into α-helices (H).
We describe a template-based approach to enhance 8-state secondary structure prediction accuracy in this paper. Our computational results show that the secondary structure templates, even obtained from sequence with only 20%~30% sequence similarity, can help improve the 8-state prediction accuracy. Overall, 78.85% Q8 accuracy and 80.10% SOV8 accuracy are achieved in 7-fold cross validation. The effectiveness of using structural information in templates has been demonstrated on popular benchmarks including CB513, CASP9, Manesh215, and Carugo338. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structure states, such as 3-10 helices, turns, and bends, are highly improved, which are suitable for practical use in applications.
A webserver (C8-Scorpion) implementing 8-state secondary structure prediction is currently available at http://hpcr.cs.odu.edu/c8scorpion.
Rost B: Review:Protein secondary structure prediction continues to rise. J Struct Biol. 2001, 134 (2-3): 204-218. 10.1006/jsbi.2001.4336.
Garnier J, Gibrat JF, Robson B: GOR method for predicting protein secondary structure from amino acid sequence. Methods Enzymol. 1996, 266: 540-553.
Jones DT: Protein secondary structure prediction based on position-specific scoring matrices. J Mol Biol. 1999, 292 (2): 195-202. 10.1006/jmbi.1999.3091.
Rost B, Sander C: Combining evolutionary information and neural networks to predict protein secondary structure. Proteins. 1994, 19 (1): 55-72. 10.1002/prot.340190108.
Karplus K, Barrett C, Cline M, Diekhans M, Grate L, Hughey R: Predicting protein structure using only sequence information. Proteins-Structure Function and Genetics. 1999, Suppl 1: 121-125.
Pollastri G, McLysaght A: Porter: a new, accurate server for protein secondary structure prediction. Bioinformatics. 2005, 21 (8): 1719-1720. 10.1093/bioinformatics/bti203.
Cole C, Barber JD, Barton GJ: The Jpred 3 secondary structure prediction server. Nucleic Acids Res. 2008, 36: W197-W201. 10.1093/nar/gkn238.
Dor O, Zhou YQ: Achieving 80% ten-fold cross-validated accuracy for secondary structure prediction by large-scale training. Proteins. 2007, 66 (4): 838-845.
Pollastri G, Przybylski D, Rost B, Baldi P: Improving the prediction of protein secondary structure in three and eight classes using recurrent neural networks and profiles. Proteins-Structure Function and Genetics. 2002, 47 (2): 228-235. 10.1002/prot.10082.
Petersen B, Petersen TN, Andersen P, Nielsen M, Lundegaard C: A generic method for assignment of reliability scores applied to solvent accessibility predictions. Bmc Struct Biol. 2009, 9 (51): 10.1186/1472-6807-9-51.
Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983, 22 (12): 2577-2637. 10.1002/bip.360221211.
Ramachandran GN, Sasisekharan V: Conformation of polypeptides and proteins. Advances in protein chemistry. 1968, 23: 283-438.
Topf M, Baker ML, Marti-Renom MA, Chiu W, Sali A: Refinement of protein structures by iterative comparative modeling and CryoEM density fitting. J Mol Biol. 2006, 357 (5): 1655-1668. 10.1016/j.jmb.2006.01.062.
Wang ZY, Zhao F, Peng J, Xu JB: Protein 8-class secondary structure prediction using conditional neural fields. Proteomics. 2011, 11 (19): 3786-3792. 10.1002/pmic.201100196.
Montgomerie S, Sundararaj S, Gallin WJ, Wishart DS: Improving the accuracy of protein secondary structure prediction using structural alignment. Bmc Bioinformatics. 2006, 7:
Pollastri G, Martin AJM, Mooney C, Vullo A: Accurate prediction of protein secondary structure and solvent accessibility by consensus combiners of sequence and structure information. BMC Bioinformatics. 2007, 8:
Wang GL, Dunbrack RL: PISCES:a protein sequence culling server. Bioinformatics. 2003, 19 (12): 1589-1591. 10.1093/bioinformatics/btg224.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 3389-3402. 10.1093/nar/25.17.3389.
Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins-Structure Function and Genetics. 2000, 40 (3): 502-511. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-Q.
Ahmad S, Gromiha MM, Sarai A: Real value prediction of solvent accessibility from amino acid sequence. Proteins-Structure Function and Genetics. 2003, 50 (4): 629-635. 10.1002/prot.10328.
Carugo O: Predicting residue solvent accessibility from protein sequence by considering the sequence environment. Protein Engineering. 2000, 13 (9): 607-609. 10.1093/protein/13.9.607.
Kinch LN, Shi S, Cheng H, Cong Q, Pei JM, Mariani V, Schwede T, Grishin NV: CASP9 target classification. Proteins. 2011, 79: 21-36. 10.1002/prot.23190.
Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28 (1): 235-242. 10.1093/nar/28.1.235.
Li Y, Liu H, Rata I, Jakobsson E: Building a Knowledge-Based Statistical Potential by Capturing High-Order Inter-residue Interactions and its Applications in Protein Secondary Structure Assessment. Journal of chemical information and modeling. 2013, 53 (2): 500-508. 10.1021/ci300207x.
Sippl MJ: Calculation of Conformational Ensembles from Potentials of Mean Force - an Approach to the Knowledge-Based Prediction of Local Structures in Globular-Proteins. J Mol Biol. 1990, 213 (4): 859-883. 10.1016/S0022-2836(05)80269-4.
Zemla A, Venclovas C, Fidelis K, Rost B: A modified definition of Sov, a segment-based measure for protein secondary structure prediction assessment. Proteins-Structure Function and Genetics. 1999, 34 (2): 220-223. 10.1002/(SICI)1097-0134(19990201)34:2<220::AID-PROT7>3.0.CO;2-K.
Rata I, Li Y, Jakobsson E: Backbone Statistical Potential from Local Sequence-Structure Interactions in Protein Loops. Journal of Physical Chemistry B. 2010, 114 (5): 1859-1869. 10.1021/jp909874g.
Samudrala R, Moult J: An all-atom distance-dependent conditional probability discriminatory function for protein structure prediction. Journal of Molecular Biology. 1998, 275: 895-916. 10.1006/jmbi.1997.1479.
Publication charges for this work were funded by NSF grant 1066471 to YL.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 8, 2014: Selected articles from the Third IEEE International Conference on Computational Advances in Bio and Medical Sciences (ICCABS 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S8.
The authors declare that they have no competing interests.
YL conceived the context-based scoring method. AY implemented the method and carried out the computation. AY and YL performed the result analysis. Both authors read and approved the final manuscript.