Template-based C8-SCORPION: a protein 8-state secondary structure prediction method using structural information and context-based features

Background Secondary structures prediction of proteins is important to many protein structure modeling applications. Correct prediction of secondary structures can significantly reduce the degrees of freedom in protein tertiary structure modeling and therefore reduces the difficulty of obtaining high resolution 3D models. Methods In this work, we investigate a template-based approach to enhance 8-state secondary structure prediction accuracy. We construct structural templates from known protein structures with certain sequence similarity. The structural templates are then incorporated as features with sequence and evolutionary information to train two-stage neural networks. In case of structural templates absence, heuristic structural information is incorporated instead. Results After applying the template-based 8-state secondary structure prediction method, the 7-fold cross-validated Q8 accuracy is 78.85%. Even templates from structures with only 20%~30% sequence similarity can help improve the 8-state prediction accuracy. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structures, such as 3-10 helices, turns, and bends, are highly improved, which are useful for practical applications. Conclusions Our computational results show that the templates containing structural information are effective features to enhance 8-state secondary structure predictions. Our prediction algorithm is implemented on a web server named "C8-SCORPION" available at: http://hpcr.cs.odu.edu/c8scorpion.


Background
An important intermediate step in modeling the threedimensional structure of a protein is to accurately predict its secondary structures [1]. Most often, the secondary structures are classified into three general states, i.e., helices (H), strands (E), and coils (C). Correspondingly, success of secondary structure prediction is typically measured by the Q3 (3-state) accuracy. Many machine learning methods, including statistics analysis, neural networks, hidden Markov chain, support vector machines, have been developed to predict secondary structures. Correspondingly, there are many secondary structure prediction servers available, including GOR4 [2], PSI-Pred [3], PHD [4], SAM [5], Porter [6], JPred [7], SPINE [8], SSPRO [9], NETSURF [10], and many others. The modern secondary structure prediction servers can generate prediction results with close to 80% Q3 accuracy.
Compared to the general three secondary structure states, the DSSP program [11] has more detailed classifications by assigning secondary structures to eight states, including 3-10 helix (G), α-helix (H), π-helix (I), β-stand (E), bridge (B), turn (T), bend (S), and others (C). The 8-state secondary structures convey more precise structural information than 3-state, which is particularly important for a variety of applications. For example, accurate 8-state secondary structures predictions can restrict the variations of backbone dihedral angles within a small range according to the Ramachandran plots [12] and thus reduce the search space in template-free protein tertiary structure modeling. Moreover, differentiations among 3-10 helix, α-helix, and π-helix in secondary structure prediction aid to assign residues and fit protein structure models in cryo-electron microscopy density maps [13]. Unfortunately, most of the secondary structure prediction software packages or servers only provide 3-state predictions.
To the best of our knowledge, very few methods have been developed for the 8-state secondary structure prediction. Pollastri et al. [9] extended their 3-state prediction method to SSpro8 for 8-state secondary structure prediction. The reported Q8 accuracy of SSpro8 is 62-63% [9]. A more recent prediction method of the 8-state, RaptorXss8, developed by Wang et al [14], has reported 67.9% Q8 accuracy through the use of conditional neural field (CNF) models. Table 1 shows the prediction accuracy of RaptorXss8 on several popularly used secondary structure prediction benchmarks, including CB513, CASP9, Manesh215, and Carugo338. Although nearly 70% Q8 accuracy is achieved, the prediction accuracies of different states vary significantly. In particular, the prediction accuracy of G, I, B, and S are very low, mainly due to the fact of their relatively infrequent appearance in protein data banks (PDB), whose distribution is shown in Figure 1. The low prediction accuracies in these states limit the application of 8-state secondary structure prediction in practice.
Most current secondary structure perdition methods do not rely on similarity to known protein structures; in other words, these methods are de novo, where the secondary structure prediction is based on sequence information only. However, we cannot neglect the fact that many protein sequences have some degree of similarity among themselves. Actually, over half of all known protein sequences have some detectable similarity (higher than 25%) to one or more sequences of known structures [15,16]. Around 75% was reported as the percentage of those newly deposited protein structures in the PDB database showing significant similarity to previous deposited structures. Consequently, taking advantage of structural similarity of proteins with sequence similarity may lead to significant improvement of protein structure prediction. In fact, the latest version of porter [6] has used homologybased templates for 3-state secondary structure prediction [16]. Porter has been reported to achieve prediction accuracy improvement when known structures with >30% sequence similarity are available and even reach theoretical upper bound of secondary structure prediction when such sequence similarity is higher than 50%.
In this paper, we investigate the template-based method for 8-state secondary structure prediction. We extract structural information from known structures of chains with certain sequence similarity to build structural templates. Then, the structural information contained in the templates is incorporated (as features) together with sequence and evolutionary information for neural network training and validation.
In the case where structural information from the structural template is not available for a residue, context-based scores estimating the favorability of that residue adopting a secondary structure conformation in the presence of its neighbors in sequence are used instead. The fundamental idea of the context-based scores is based on the fact that the formation of secondary structure exhibit strong local dependency, particularly, residues in a protein sequence Prediction accuracies for 3-10 helices (G), π-helices (I), β-bridges (B), and bends (T) are particularly low due to their low appearance frequencies. are strongly correlated in different sequence positions in coils, β-sheets, 310 helices, α-helices, and π-helices. We extract statistics to derive context-based scores from a large training data set. These context-based scores are then incorporated as sequence-structure features together with sequence, template, and evolutionary information in neural network training process for 8-state secondary structure prediction. We test our template-based 8-state prediction method on several popularly used benchmarks including CB513, Manesh215, and Carugo338 as well as the CASP9 targets. The prediction accuracies for the eight states are analyzed.

The protein data sets
We use the protein chain dataset Cull5547 generated by the PISCES server [17] on 10/21/2011 for neural network training and Cull16633 for context-based scores generation. Cull5547 contains 5,547 protein chains with at most 25% sequence identity and 2.0A resolution cutoff, and Cull16633 contains 16,633 protein chains with at most 50% sequence identity and 3.0A resolution cutoff. We eliminate very short chains, whose lengths are less than 40 residues, since the PSI-BLAST program [18] is usually unable to generate profiles for very short sequences, and very large chains whose lengths are greater than 1,000 residues. We also eliminate residue samples with undetermined secondary structures.
Public benchmarks, including CB513 [19], Manesh215 [20], Carugo338 [21], and the recent CASP9 targets [22], that are popularly employed as benchmarks for 3-state secondary structure predictions, are used to benchmark our method in 8-state predictions. Figure 2 illustrates the procedure of constructing structural templates. First of all, for a given protein sequence target, PSI-BLAST is used to search against the NR (Non-Redundant) database with E-value = 0.001 and at most 3 iterations to generate the PSSM (Position Specific Scoring Matrix) data. Then, the PSSM is used to search against the Protein Data Bank (PDB [23]) for alignments with E-value = 10.0. If known structures are available in PDB, their 8-state assignments are determined by the DSSP program and then a structural template is built for the correspondent residue positions. Among the list of templates constructed, we select the top one that is less than 95% sequence similarity, according to PSI-BLAST ranking.

Encoding
We use a window size of 15 residues for input encodings. Each residue is represented with 20 values from the PSSM (Position-Specific Scoring Matrix) data, 1 extra input to indicate if the residue window overlaps C-or N-terminal, 1 value for degree of similarity, and 8 values for structural information from template or contextbased secondary structure scores [24]. Hence, a total number of 450 values are used to describe each residue Figure 3 shows an example of encoding residues in a protein sequence. For a residue with available structural information in the template, the corresponding secondary structure state is set to 1 while the other states are set to 0. At the same time, the degree of similarity is set for the sequence similarity. On the other hand, if the structural information for a residue is not available in the template, the degree of similarity is set to zero and the contextbased scores are incorporated instead. The context-based scores are statistics-based pseudo-potentials to specify the favorability of a residue adopting a certain secondary structure in its amino acid context [24].

Context-based scores
The types and conformations of nearby residues play a critical role in secondary structure conformation that a residue may adopt [24]. In particular, the hydrogen bonds between residues at positions i and i + 3, i and i + 4, and i and i + 5 lead to the formation of 3-10-helices, α-helices, and π-helices, respectively. Residues in contacting parallel or anti-parallel β-sheets are connected by hydrogen bonds in alternative positions. Moreover, the formation of interactions within coils beyond nearest neighbors appears not to contribute with statistical significance in determining coil structure [27]. Hence, correlations among residues provide significant information in predicting secondary structure.
In this method, we will extract statistics of singlets (R i ), doublets (R i R i+k ), and triplets (R i R i+k 1 R i+k 2 ) residues at different relative positions from protein sequences in Cull16633 dataset. These statistics represent estimations of the probabilities of residues adopting a specific structural state when none, one, or two of their neighbors in context are taken into consideration, respectively.
The observed probabilities of the i th residue R i in a singlet (R i ), doublet (R i R i+k ), and triplet (R i R i+k 1 R i+k 2 ) adopting a specific structural state C i are respectively estimated by and Here with R i adopting conformation C i in the protein structure database. N obs (R i ), N obs (R i R i+k ), and N obs (R i R i+k 1 R i+k 2 ) are the weighted observed number of singlets, doublets, and triplets. The observed numbers will be calculated as where PSSM j (R i ) is the PSSM frequency for residue type R i at the j th position of a protein sequence.
Correspondently, the context-dependent pseudopotentials are generated using the derived statistics of correlations between each residue and its nearby neighbors based on Sippl's potentials of mean force method [25]. According to the inverse-Boltzmann theorem, we calculate the mean-force potential U singlet (R i , C i ) for a singlet residue R i adopting structural state C i , Here R is gas constant, T is temperature, and P ref (C i |R i ) is the referenced probability. In our method, we will employ the conditional probability approach described in [28] to estimate the referenced probability by Similarly, the mean-force potentials U doublet (C i , R i R i+k ) and U triplet C i , R i R i+k 1 R i+k 2 for residue adopting structural state are and with the corresponding referenced probabilities, Then, the context-dependent pseudo-potential for R i will be Utriplet Ci, RiRi+k 1 Ri+k 2 .
These pseudo-potentials are incorporated as contextbased scores representing sequence-structure features in neural network training when structural information from templates is not available.

Neural network model
We incorporate two phases of standard feed-forward neural network training for the 8-state secondary structure prediction. The first phase is the primary sequencestructure prediction and the second phase is the structurestructure refinement. The numbers of hidden nodes in the first and second networks are 225 and 68, respectively. Figure 4 shows the encoding diagram and the two-phase neural network architecture. Each neural network is trained to predict the secondary structure state of a residue in the middle of the residue window.

Performance measures
The prediction accuracy is calculated as the average of the seven prediction scores. We use both Q8 and SOV8 (Segment overlap [26]) scores to measure the qualities of our 8-state secondary structure predictions.

N-fold cross validation
To obtain a reliable estimate of the 8-state secondary structure prediction accuracy, we use 7-fold cross  validation on Cull5547. We randomly divide the chains in Cull5547 into 7 subsets with approximately the same size, such that five subsets are used for training, one for testing, and one for validation.

Results
Upon the selection of the best alignment with similarity less than 95% for all protein chains in the Cull5547 dataset, the final Q8 seven-fold cross validated accuracy after applying the template-based 8-state prediction reaches 78.85%. Table 2 lists the Q8 and SOV8 accuracies of 7-fold cross validation for each state. Table 3 compares the Q8 and SOV8 accuracy of using predictions with and without templates on benchmarks of CB513, CASP9, Manesh215, and Carugo338. Clearly, when homology structural information is available, the 8-state prediction accuracy is significantly improved. It is also interesting to find that when structural templates are used, the 8-state prediction accuracy improvement in CASP9 is much less than the other benchmark sets. This is due to the fact that in the CASP9 experiment, targets are deliberately selected to have relatively low similarity to sequences with existing structures in PDB. Figure 5 shows the distribution of the prediction accuracy as a function of sequence similarity in levels in CB513, CASP9, Manesh215, Carugo338 as well as Cull5574 in cross-validation. Without surprise, the better templates with higher sequence similarity level, the more accurate the prediction results are. More importantly, even templates with only 20%~30% sequence similarity can improve the prediction accuracy by near 5% in various benchmark sets compared to predicted results without templates. Figure 6 uses the A chain of protein 1BTN as an example to demonstrate the effectiveness of template-based 8-state secondary structure prediction. Prediction without template has 73.6% Q8 accuracy. The best template found in PDB has 61% sequence similarity. Under the guidance of the structural template, the mispredicted helix segment and bend segment in template-less prediction (highlighted in Figure 6) are corrected, which leads to overall 89.6% Q8 accuracy.

Discussion
As shown in Table 1, the prediction accuracies for different states vary largely due to the very unbalanced appearing frequencies of the eight states in protein structures. In this paper, we are particularly interested in the effectiveness of structural templates in improving the prediction accuracies of those states with low accuracy in prediction without templates. From Cull5547, we create five subsets of chains that have structural templates with similarity level in intervals of (0%, 10%), (10%, 20%), (20%, 40%), (40%, 70%), and (70%, 95%), respectively. Then, 7-fold neural network trainings are carried out for each subset and the average cross validation prediction accuracy for each state is reported in Table 4.
For α-helices (H), the prediction accuracy using templates with very low sequence similarity (0%, 10%] is already rather high (92.05%), mainly because there are sufficient number of α-helix samples available and the formation of α-helix is mainly result from local interactions. Anyway, the structural templates help refine the α-helix predictions with slight accuracy improvements. When structural templates with 40% or better similarity are available, the prediction accuracy of β-sheets (E) is also improved to above 90%, reaching the theoretical upper bound in secondary structure prediction. 40%+ similarity templates also significantly improve the accuracies of 3-10 helices (G) and bends (S) from 20%+ to 50%+. Similar but not as significant improvements are found in turns (T) and coils (C). However, the prediction results for bridges (B) and π-helices (I) are disappointing. Only when templates with very high similarity (>70%) are available, we can obtain 44% prediction accuracy in bridges (B). The prediction accuracy for π-helices (I) is still 0%. This is mainly due to the facts that π-helices are extremely rare (0.02%) and πhelices (I) are often misclassified into α-helices (H).

Conclusions
We describe a template-based approach to enhance 8state secondary structure prediction accuracy in this paper. Our computational results show that the  secondary structure templates, even obtained from sequence with only 20%~30% sequence similarity, can help improve the 8-state prediction accuracy. Overall, 78.85% Q8 accuracy and 80.10% SOV8 accuracy are achieved in 7-fold cross validation. The effectiveness of using structural information in templates has been demonstrated on popular benchmarks including CB513, CASP9, Manesh215, and Carugo338. More importantly, when good templates are available, the prediction accuracy of less frequent secondary structure states, such as 3-10 helices, turns, and bends, are highly improved, which are suitable for practical use in applications.  A webserver (C8-Scorpion) implementing 8-state secondary structure prediction is currently available at http://hpcr.cs.odu.edu/c8scorpion.