Secondary structure specific simpler prediction models for protein backbone angles

Motivation Protein backbone angle prediction has achieved significant accuracy improvement with the development of deep learning methods. Usually the same deep learning model is used in making prediction for all residues regardless of the categories of secondary structures they belong to. In this paper, we propose to train separate deep learning models for each category of secondary structures. Machine learning methods strive to achieve generality over the training examples and consequently loose accuracy. In this work, we explicitly exploit classification knowledge to restrict generalisation within the specific class of training examples. This is to compensate the loss of generalisation by exploiting specialisation knowledge in an informed way. Results The new method named SAP4SS obtains mean absolute error (MAE) values of 15.59, 18.87, 6.03, and 21.71 respectively for four types of backbone angles \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\phi$$\end{document}ϕ, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\psi$$\end{document}ψ, \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\theta$$\end{document}θ, and \documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$\tau$$\end{document}τ. Consequently, SAP4SS significantly outperforms existing state-of-the-art methods SAP, OPUS-TASS, and SPOT-1D: the differences in MAE for all four types of angles are from 1.5 to 4.1% compared to the best known results. Availability SAP4SS along with its data is available from https://gitlab.com/mahnewton/sap4ss.

However, ω angles are 180 • for majority proteins [3]. AA side chains also have their own dihedral angles but they are out of scope of this work since they can be dealt with later once backbones are obtained. Nevertheless, protein backbone structures are important for both template-based and template-free PSP [2,4].
Besides the representation method discussed above, protein backbone structures can also be represented by C α atoms since successive C α atoms have almost the same distance. In this case, instead of φ , ψ , and ω , two other angles θ and τ are used. Note θ and τ are respectively a planner and a dihedral angle comprising respectively three and four consecutive C α atoms. Since multiple residues are needed to define θ and τ , they could somewhat capture local structures.
In this work, we develop deep neural network (DNN) models to predict the backbone angles φ , ψ , θ , and τ for proteins. Protein backbone angle prediction (BAP) has achieved significant progress with the development of DNNs. Yet more accurate BAP is needed since errors in any angles in a protein has a cascaded effect on the entire protein structure.
Capturing local structures around and long range interactions between residues have been considered in BAP. Sliding windows [5,6,9,12] around residues have been used in feature encoding to capture the local structures. On the other hand, entire protein sequences have been used as features [9,11,16] to capture long range interactions. Convolutional neural networks (CNNs) [8,14] or LSTM-BRNNs [6,7] have also been used to capture long range interactions.
For benchmark datasets, we refer to PISCES [17], SPOT-1D [7,18], PDB150 [19] and CAMEO93 [20]. The first two are large with respectively 5.5K and 12.5K proteins and 1.2M and 2.7M residues. The last two are small with 150 and 93 proteins respectively and are used in testing.
Proteins locally exhibit three major secondary structure (SS) types such as helices, sheets, and coils. This three-state classification can be extended to an eight-state classification. Essentially some SS types are associated with angle ranges. For example, helices and sheets have ranges of 20 for φ and ψ . Because of these narrow angle ranges, BAP could be essentially viewed as a classification problem via SS type prediction, although backbone angles are actually continuous valued. Unfortunately, coils have no ranges and they are about 40% of residues in an average protein [21]. So SS prediction essentially does not make BAP trivial. SS prediction has obtained significant progress via DNN models [8,11,[22][23][24][25][26] and ab initio methods [27]. SSpro8 [28] achieves respectively 92% and 79% accuracy on proteins with or without homologs in the Protein Data Bank (PDB).
Predicted SS types have been used as features in deep learning for BAP [5,9,12,29]. Features in general in deep learning only implicitly capture problem characteristics and the neural network model then attempts to establish the unknown input-output relation but again very implicitly. Also, any machine learning method strives to achieve generality over the training examples and consequently loose accuracy in the process. While generic artificial intelligence (AI) methods could be adapted to a range of problems easily, they usually suffer from the loss of explicit problem specific knowledge. So explicit exploitation of any available knowledge is of great importance in AI. This could actually bridge the gap between the generality of the approach with the specificity of the problem. Inspired by this AI interest, we attempt to explicitly exploit predicted SS knowledge in BAP. To be more particular, for BAP, we train separate deep learning models for each SS category. This restricts the generalisation only within the specific class of training examples and thus compensates the loss of generalisation by exploiting specialisation knowledge in an informed way.
We name our new BAP method as Simpler Angle Predictor for Secondary Structures (SAP4SS), which has DNN models similar to a very recent BAP method named SAP [30]. Like SAP, our new method SAP4SS has simpler DNN models than what other recent methods such as OPUS-TASS [8] and SPOT-1D [7] have. SAP4SS uses the same fully connected neural network (FCNN) architecture as SAP does while OPUS-TASS and SPOT-1D use ensembles of LSTM-BRNNs and ResNets. SAP4SS has more features than SAP but fewer features than OPUS-TASS and SPOT-1D. While SAP has been trained on all residues, SAP4SS has separate DNN models for residues that belong to separate 3-state SS types.
On well-known benchmark datasets, SAP4SS obtains mean absolute error (MAE) values 15.59, 18.87, 6.03, and 21.71 respectively for φ , ψ , θ , and τ predictions. As a result, SAP4SS significantly outperforms existing state-of-the-art methods SAP, SPOT-1D and OPUS-TASS: the differences in MAE are from 1.5 to 4.1% compared to the best known results. The SAP4SS program along with its data is available from the website https:// gitlab. com/ mahne wton/ sap4ss.
ANGLOR [12] predicts φ and ψ angles separately, by utilising neural networks and support vector machines (SVM) [31] respectively. SPIDER [5] applies a stacked sparse auto-encoder deep neural network for predicting θ and τ angles. SPIDER2 [9] uses three iterations of the SPIDER-type models. SPIDER3 [6] applies a bidirectional recurrent neural networks (BRNN) with predicted backbone torsion angles, predicted secondary structures, and predicted solvent accessibilities as input. The predicted features are reused to train the BRNN iteratively for four times. RaptorX-Angle [29] employs a combination of clustering and deep learning for predicting φ and ψ values. DeepRIN [19] utlises deep residual inception network to predict φ and ψ values. NetSurfP-2.0 [14] employs large Long Short-Term Memory (LSTM) networks in BRNNs to predict φ and ψ angles.
SPOT-1D uses an ensemble of 9 LSTM-BRNN and ResNets with input features PSSM, HMM, 7PCP, and contact maps. The contact maps are from SPOT-Contact [18] and are used in a sliding window fashion. However, SPOT-1D also uses entire proteins at a time as input. SPOT-1D predicts all four types of backbone angles. The output angles are predicted as trigonometric ratios. OPUS-TASS predicts only φ and ψ angles with ensembles of DNNs having CNN, LSTM, and Transformer [32] layers. It has an input feature PSP19 [33] that classifies residues into rigid-body blocks. OPUS-TASS introduces a new constrained feature CSF3 [34] to describe backbone structures. OPUS-TASS uses a multitask learning strategy [35]. SAP predicts all four types of backbone angles using a simple FCNN with sliding windows, 8-state SS predictions, PSSM, and 7PCP input features.

Methods
In this section, we describe the deep learning models used in SAP4SS and the datasets used in experiments. These are similar to those used in SAP [30] but SAP4SS has additional input features and have separate DNN models for SS types.

Input features
Like SAP, for each residue, we consider the following features: 8-state SS prediction by SSpro8 [28] where the prediction is encoded as a one-hot vector as shown in Fig. 1; 20 features from PSSM profile generated by three iterations of PSI-BLAST [10] against the UniRef90 sequence database updated in April 2018; and 7PCP. Moreover, for each residue, in SAP4SS, we additionally consider 20 residue-substitution features from HMM profile generated by HHblists [36] and 1 feature for ASA predicted by SPOT-1D [7]. These make at most 56 features for each residue but we evalute their effects on the prediction accuracy by using and not using HMM and ASA features. To capture local structures around each residue, like SAP, we consider sliding windows of size W where ⌊W /2⌋ residues are after and ⌊W /2⌋ residues are before a given residue. Although some BAP methods based on recurrent neural networks (RNN) and CNNs use entire proteins at a time, we do not do so since the effects of very long range interactions are not clear from distance based molecular dynamic forces. For each residue, for a given window size W, we thus have up to 56W input features. Based on SAP's reported experimental results, for SAP4SS, we consider window sizes 5 and 9 only. Nevertheless, these input features are encoded either by using [0, 1] range based (shown in (1)) or Z-score based (shown in (2)) normalisation methods before feeding to the DNNs. In (1), x min and x max are the minimum and maximum values in the training set for the feature which x is coming from. In (2), µ and σ are the mean and standard deviation of the values in the training set for the feature which x is coming from. Both in (1) and (2), x ′ is the encoded value.

Predicted outputs
For each residue, SAP4SS predicts four angles φ , ψ , θ , and τ where θ i angle is defined by Like SAP, SAP4SS predicts the angle values directly. To ensure the periodicity of [−180 • , 180 • ] of the predicted angles, 360 • is subtracted from or added to any angle value greater than 180 • or less than −180 • respectively.

Loss functions
We use MAE as the loss function in the DNN models. We calculate absolute error AE = min(D, |360 − D|) where D = |P − A| , P is a predicted angle, and A is the actual angle for a residue. The mean is taken over the AEs for all residues. We ignore the angles at the beginning or at the end of the proteins, since they are not defined. The AE deals with the angle perodicity issue and is used in validation and testing as well.

Neural networks
We follow the same neural network architecture and implementation of SAP. In summary, we use an FCNN with 3 hidden layers, each having 150 neurons. Different numbers of layers have been experimented by SAP; so we do not run similar experiments again. The number of input features to the FCNN will depend on using and not using of HMM and ASA features and the window size. The Python-based FCNN implementation uses Keras library, SGD optimiser with momentum 0.9, and kernel initialiser glorot_uniform. The learning rate is initially 0.01. Then, with three successive iterations having no improvement in the loss functions, the learning rate gets reduced by a factor of 0.5 until reaches 10 −15 . For the input and hidden layers, the activation function is sigmoid while it is linear for the output layer. NVIDIA Tesla V100-PCIE-32GB machines are used to run the programs.

Benchmark datasets
In this study, we use datasets SPOT-1D [7,18], PDB150 [19], and CAMEO93 [20]. Our training and validation proteins come from SPOT-1D while testing proteins come from all three datasets. SPOT-1D dataset has proteins that were culled from PISCES [17] on Feb 2017 with the constraints of resolution ( < 2.5A • ), R-free < 1 , and a sequence identity cutoff of 25% according to BlastClust [10]. SPOT-1D also has proteins that were released between January 01, 2018 and July 16, 2018 and resolution < 2.5A • , R-free < 0.25 , and 25% sequence identity cutoff w.r.t. those structures released prior to 2018. PDB150 dataset has proteins released between February 1, 2019 and May 15, 2019. For each PDB150 protein, PSI-BLAST [10] was applied against the whole CullPDB [17] dataset with e-value smaller than 0.005. The CAMEO93 dataset contains proteins released between February 2020 and March 2020. We use SSpro8 [28] to generate SS predictions. So we perform 25% sequence similarity checking of training, validation, and testing proteins w.r.t. SSpro8's training proteins. Moreover, as is done in SAP, we have performed additional filtering to deal with sequence mismatch between data source files and having discontinuity in proteins' amino acid or secondary structure sequence. Also, we have excluded some proteins that cannot be dealt with by OPUS-TASS and have included large parts of some proteins that have discontinuity. After all these, we have 61 and 55 proteins left in PDB150 and CAMEO93 data sets respectively. Since these are considerably small numbers, we combine these with the testing proteins from SPOT-1D. Table 1 shows the numbers of proteins in each of the training, validation, and testing datasets.

SS specific DNNs
We use 3-state SS predictions made by SSpro8 [28]. These are denoted by coils, helixes, and sheets. We then categorise residues in the training proteins based on the three types.
For each of the three types, we then train a separate DNN on the residues that belong to that SS type. For testing, based on which of the three SS types a residue belongs to, we use the particular DNN for the SS type. Table 2 shows the distribution of the residues over the SS types: the top part shows the Dictionary of Protein Secondary Structure (DSSP) based actual classifications and the bottom part shows the predicted classifications.

Results
We compare various SAP4SS settings to find the best setting for each of the three SS types and for each of the four angle types. We then compare the performances of the best settings with that of the current state-of-the-art predictors. We also analyse the results in various ways.

Determining best settings
We consider SAP as our base line setting in which 8-state SS predictions, PSSM, and 7PCP are used as input features and an FCNN with 3 hidden layers as the DNN. For SAP4SS, we consider ASA and HMM as two other types of features. For input encoding, we consider either range based or Z-score based encoding and for window size, we consider either 5 or 9. We, therefore, have 16 SAP4SS settings for each of the three SS types. Each SAP4SS setting is denoted by a name AHIW where A ∈ {Y, N} denotes whether ASA is used or not, H ∈ {Y, N} denotes HMM features are used or not, I ∈ {R, Z} denotes whether range based or Z-score based input encoding is used, and W ∈ {5, 9} denotes the window size. For example, a setting YNR5 denotes ASA is used, HMM features are not used, range-based input encoding is used, and window size used is 5. Figure 2 shows the validation and testing performances of total 16 SAP4SS settings for each type of backbone angles and for each type of 3-state SS classes. From the charts in the figure, the best settings as shown in Table 3 are selected putting more emphasis on the testing performances and breaking ties with a view to reducing the total number of best settings. Notice that the testing performances in the charts are often worse than the validation performance and for the selected best settings the differences are up to 2.68 for τ and coils. This is explainable since our testing set is diverse as it include proteins from SPOT-1D, CAMEO93, and PDB150 datasets.
For the best settings, we perform 10-fold cross validation to check the robustness of the selected models. The variations in the performances of the 10-fold runs are statistically not significant; so we do not show them.
Hence forth, for further experiments, we use the best settings in Table 3 as the final SAP4SS method and compare it with existing such methods.

Comparison with existing predictors
We compare SAP4SS's performance with that of SPOT-1D [7], OPUS-TASS [8], and SAP [30] on our 1321 testing proteins. As described in Section titled Benchmark Datasets, the testing proteins are from SPOT-1D, PDB150, and CAMEO93 datasets. Table 4 shows the MAE values for various angles as predicted by SAP4SS and existing state-of-the-art methods for residues of various 3-state actual SS types. SAP4SS is better than the existing methods in all cases, except in φ angles for sheets. To see the relative improvements in MAE values obtained by SAP4SS compared to the existing methods, SAP4SS obtains the best performance and improvement = Best MAE − SAP4SS MAE SAP4SS MAE (negative value and hence degradation) where SAP4SS is outperformed. From the table, we see the improvements could be more than 6% and in 8 out of 12 cases, more than 1%. Table 5 shows the MAE values for angles as predicted by various methods. SAP4SS performs better than the existing methods in all four types of angles. We compute the relative improvement using the same formula as described before and see that SAP4SS obtains more than 1.5% and less than 4.1% improvement in the MAE values.
To check the correlations between the actual angles and the angles predicted by SAP4SS and the existing methods, we compute Spearman rank correlation coefficients and show them in Table 6. As we see, SAP4SS obtains the best positive coefficients among all the methods compared.
To check the statistical significance of the differences in the AE values for various methods, we perform Analysis of Variance (ANOVA) method and for 95% confidence level, we see that at least one method is significantly different from other methods. So  Table 3 The best SAP4SS settings for SS types and angle types

SS type Angle types Best setting
Helix as a posthoc analysis technique, we then perform Tukey's Honest Significant Difference (HSD) to check pairwise difference with 95% confidence level. From the results, we see that the null hypothesis is not rejected only for predictions of φ values by OPUS-TASS and SAP. For other angles and other pairs of methods, the differences are significant. Figure 3 shows the 95% confidence intervals for the AE values for various methods. Any overlapping e.g. for φ for OPUS-TASS and SAP denotes the difference is not significant while non-overlapping intervals indicate significant differences.

Comparison on protein length groups
We group our testing proteins on their numbers of residues and then compare the MAE values of SAP4SS, SAP, OPUS-TASS, and SPOT-1D for each group. For convenience of comparison, in Table 7, MAE values for SAP4SS are shown and for other methods, relative MAEs are computed and shown in % columns (formula shown in the caption of the table). As we see, SAP4SS outperforms the other three methods in all cases except one (OPUS-TASS in φ for length group 301-400).

Correct prediction per protein
We compare SAP4SS with SAP, OPUS-TASS, and SPOT-1D on the percentages of proteins having certain percentages of angles correctly predicted within given threshold MAE. The threshold MAE values are 6 • and 12 • ; in SAP and SPIDER, multiples of 6 • have been used as thresholds and the lowest MAE is about 6 • for one angle e.g. θ . The percentages of angles are varied from 0 to 100% with step 10%. Figure 4 shows these results. We see that with higher percentages of proteins having the same percentages of angles within thresholds, SAP4SS outperforms in all angles in both threshold levels.

Protein structure generation
With the φ and ψ angles predicted by various methods compared, we generate the protein structures for a number of proteins. From our 1321 proteins, we take only the whole proteins but not the domains. Then, we create a subset named alpha proteins taking 34 proteins that have at least 60% helix residues. Similarly, we create a subset named beta proteins taking 19 proteins that have at least 60% sheet residues. While generating the protein structures ω angles are assumed to be 180 • and C-N peptide bonds are assumed to have length 1.33Å. The bond angles and bond lengths are usually standard  within different residues of the same amino acids over native structures of various proteins. Using the standard bond angles and bond lengths, we could build each amino acid residue from scratch. However, we take an alternative way, in which we take one residue for each amino acid from native structures of some other known proteins and build a residue library. Each of the 20 residues in the library essentially has coordinates for up to C β atoms; note necessary bodily translation and rotation operations are performed to establish local coorindate systems. Concatenating these residues, using each residue as many times as we need for the given protein, we first create a linear chain for a protein and then apply the predicted φ and ψ angles on each residue to get the generated conformation. Figure 5 shows the RMSD values for various methods for the alpha and beta proteins. The performances are clearly far from a reasonable target of 6Å and any comparison with large RMSD values is merely indicative. Nevertheless, we see that SAP4SS has shown comparable performance with respect to other methods in both alpha and beta proteins. In this context, note that SAP4SS and SAP use sliding windows to capture local interactions while OPUS-TASS and SPOT-1D use entire proteins to capture global interactions.

Conclusion
In this paper, we improve accuracy of backbone angle prediction for protein structures. Machine learning methods loose accuracy in the process of achieving generality over the training proteins. We aim to supplement our deep learning method with specific knowledge about secondary structure types. Using predicted secondary structure types, we categorise residues and then train separate deep learning models for each category. This essentially restricts the generalisation process within the specific category. Our method named SAP4SS obtains mean absolute error values of 15.59, 18.87, 6.03, and 21.71 respectively for four types of backbone angles φ , ψ , θ , and τ . These are 1.5-4.1% better than predictions made by the current state-of-the-art prediction methods. SAP4SS along with its data is available from https:// gitlab. com/ mahne wton/ sap4ss.