FoldHSphere: deep hyperspherical embeddings for protein fold recognition

Background Current state-of-the-art deep learning approaches for protein fold recognition learn protein embeddings that improve prediction performance at the fold level. However, there still exists aperformance gap at the fold level and the (relatively easier) family level, suggesting that it might be possible to learn an embedding space that better represents the protein folds. Results In this paper, we propose the FoldHSphere method to learn a better fold embedding space through a two-stage training procedure. We first obtain prototype vectors for each fold class that are maximally separated in hyperspherical space. We then train a neural network by minimizing the angular large margin cosine loss to learn protein embeddings clustered around the corresponding hyperspherical fold prototypes. Our network architectures, ResCNN-GRU and ResCNN-BGRU, process the input protein sequences by applying several residual-convolutional blocks followed by a gated recurrent unit-based recurrent layer. Evaluation results on the LINDAHL dataset indicate that the use of our hyperspherical embeddings effectively bridges the performance gap at the family and fold levels. Furthermore, our FoldHSpherePro ensemble method yields an accuracy of 81.3% at the fold level, outperforming all the state-of-the-art methods. Conclusions Our methodology is efficient in learning discriminative and fold-representative embeddings for the protein domains. The proposed hyperspherical embeddings are effective at identifying the protein fold class by pairwise comparison, even when amino acid sequence similarities are low. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04419-7.


Training Dataset and Cross-Validation Subsets
2 Thomson-derived Hyperspherical Prototypes

Optimization curves
In order to maximally separate our fold class prototypes in the hyperspherical space, we minimized a Thomson-related loss function to optimize the matrix W ∈ R K×d , being the number of fold classes K = 1154 and embedding dimension d = 512. We accelerated the optimization process by using the Adam optimizer [1] with a learning rate of 10 −3 . To check the convergence of the algorithm, at each iteration we monitored both the sum of all inverse distances (THL-sum) and the maximum cosine similarity between all pairs of prototypes. Figure S1 includes the optimization curves for the two variants, THL-sum or THLmaxcos, and initial matrices W sof tmax from the CNN-GRU model or W random . When optimizing the THL-sum function, we see that the maximum cosine similarity between all pairs of prototypes decreases until a certain point in which starts increasing again. This suggests the algorithm is incorrectly trying to separate the majority of points, while bringing a few others together in order to further minimize the loss function. To avoid this unwanted behavior, we set the optimum iteration as the one with minimum value of maximum cosine similarity, which is 1130 for the W sof tmax and 546 for the W random initial matrices.
On the other hand, the THL-maxcos function provides lower maximum cosine similarity, whereas the sum of inverse distances remains higher. In this case, we optimized for a huge number of iterations (50, 000  Figure S1: Thomson optimization curves at each iteration monitoring two metrics: sum of inverse of distances (above) and maximum cosine similarity (below) between all pairs of prototypes. For both metrics, we compare different options for initialization and loss function: initial matrix W sof tmax and THL-sum (blue line), W sof tmax and THL-maxcos (magenta line), or W random and THL-sum (green dashed line).

Cosine similarity of prototype vectors
We then examined several structural characteristics from the optimized prototypes, in comparison with the initial matrices W sof tmax and W random . In Figure S2, we plot the K × K cosine similarity matrix for each set of fold class vectors or prototypes, as well as the histogram of such pairwise cosine similarities. To ensure maximum angular separation between prototypes-given the number of points (fold classes) and dimensions in the hypersphere-the cosine values should be around 0 or negatives, as this translates into angles close to or greater than 90 degrees. We observe this in the cosine similarity histograms for the optimized prototypes. However, the cosine similarity matrix for the initial W sof tmax suggests that the fold class vectors learned by softmax loss may contain rich information about the structural classes (i.e. 7 clusters can be observed, the same number as structural classes defined in SCOPe [2]).  Figure S2: Cosine similarity matrices (above) and cosine similarity probability histograms (below) computed for the K prototypes (corresponding to K different folds). The compared sets of prototypes, from left to right are: initial matrix W sof tmax , optimized W sof tmax with THL-sum, optimized W sof tmax with THL-maxcos, initial W random , or optimized W random with THL-sum.

Intra-and inter-structural class prototype separation
To evaluate the structural class information contained in the optimized prototypes, we grouped the K fold prototypes into their respective structural classes (7 classes from a to g in SCOPe [2]). Then, we measured the average separation in terms of cosine distance, considering fold prototypes within the same structural class (intra-class separation), and prototypes from different structural classes (inter-class separation). We also computed the angular Fisher score (AFS) [3], defined as: where S inter and S intra are the inter-class and intra-class scatter values, respectively. W r is a subset of W containing n r prototypes from structural class r, m r is the mean vector of those n r prototypes, and m is the mean vector of the whole set of prototypes.
In Figure S3, we can see that the optimized prototypes from the W sof tmax using the THL-sum function retain the structural class information, with higher intra-class cosine similarity values than those across different structural classes. However, this information is not preserved when using W random as initial matrix or the THL-maxcos as a loss function. Additionally, in Table S3 we observe that the initial W sof tmax provides a lower angular Fisher score (AFS) value than the initial W random . Once more, this suggests that the prototypes in W sof tmax are more informative about the structural classes. However, the AFS values decrease after optimizing both initial matrices with Thomson. This can be attributed to the significant increase in the cosine distance between all prototypes, which also increases the S intra term in equation (1). Overall, the W sof tmax with THL-sum option provides a better angular Fisher score (0.9073) than the rest of options. Inter-class separation

Cross-validation performance and optimal set of prototypes
We then trained our neural network models CNN-GRU and ResCNN-GRU using the LMCL function and a fixed matrix of prototypes in the classification layer. In Figure S4 we compare the cross-validation performance of three optimized matrices by Thomson: W sof tmax from each model with either THL-sum or THL-maxcos, and W random with THL-sum. Here we applied the tanh activation in the embedding layer and used a scale s = 30 in the LMCL function. These results show that the set of prototypes derived from the W sof tmax matrix with the THL-sum loss function yields a better fold classification performance than the other two options.  Finally, we repeated the process for the rest of neural network models (considering their own matrix W sof tmax ) and obtained the set of optimized prototypes by minimizing the THL-sum. The Thomson optimization curves and the optimal iteration for each model can be found in Figure S5. These optimized matrices are the ones we used to train the neural network models with the Thomson LMCL option.

Effect of Secondary Structure Predictions on Performance
In this work we represented the protein domain using a set of 45 features for each amino acid residue in the sequence, which include a one-hot encoding of the amino acid, the PSSM profile, as well as secondary structure and solvent accessibility predictions. As other methods from the bibliography [4,5], we used the SSPro/ACCPro programs from SCRATCH-1D [6] to obtain predictions for the secondary structure and solvent accessibility. However, it must be noted that SSPro/ACCPro use homology analysis, so when the protein domain can be found in the PDB database they provide nearly perfect predictions of these features. In order to study the impact of using different predictors, we replaced our predictions with those given by SSPro/ACCPro "ab-initio" (i.e. without homology analysis) and NetSurfP-2.0 (hhblits) [7]. As the NetSurfP-2.0 method has been introduced quite recently, we expect it to provide better results than SSPro/ACCPro "ab-initio" and closer to the ones given by SSPro/ACCPro "homology". Table S4 includes the results obtained at the test phase (using LINDAHL) by our ResCNN-GRU and ResCNN-BGRU models trained using either softmax loss, LMCL or Thomson LMCL. As we can see, a performance drop is shown when deactivating the homology analysis (SCRATCH-AB) but, in general, the proposed losses achieve better performance than softmax. NetSurfP-2.0, on the other hand, provides more similar results to those given by SCRATCH (homology), especially if we consider the top 5 accuracy results. However, it is difficult to draw conclusions without re-training our models using such predictions. Differences in performance could be explained by the fact that, in order to predict the fold class, some models might be more robust to secondary structure prediction errors than others. Table S4: Effect of secondary structure and solvent accessibility predictions on FoldHSphere performance using the LINDAHL dataset. The fold recognition accuracy (%) results are provided at the family, superfamily and fold levels, considering both the top 1 and top 5 ranked templates. We compare the predictions given by SCRATCH (homology) [6], SCRATCH-AB (ab-initio) [6] and NetSurfP-2.0 (hhblits) [7] on the pre-trained ResCNN-GRU and ResCNN-BGRU neural network models, using different loss functions: (a) Softmax loss with sigmoid activation, (b) LMCL with tanh activation, and (c) Thomson LMCL with tanh activation.  Here, the 976 embeddings within the LINDAHL dataset have been projected into two dimensions by means of t-distributed stochastic neighbor embedding (t-SNE) [8], using a perplexity of 50 and 'cosine' as metric.

Model
The resulting points have been colored according to the structural class of each domain, which are named from 1 to 9 in SCOP 1.37 [9].  Figure S7: Dendroheatmap of the 512-dimensional hyperspherical embeddings extracted from the ResCNN-BGRU model trained with Thomson LMCL (s = 30 and m = 0.6). The analysis has been done by running bi-clustering over 53 protein domains from the LINDAHL test set, covering 7 folds named 1_3, 2_8, 3_33, 4_76, 5_1, 6_7 and 7_28. We computed the cosine distance between embedding vectors (rows) and embedding components (columns) separately. We then applied hierarchical clustering with single linkage to group similar vectors and components together. The individual elements in each embedding vector are colored according to their values (lower values in green and higher values in red). Note that the legend values range from −1 to 1, as the embeddings were extracted after applying the tanh activation function. We can see how the protein domains cluster together according to their embedding vectors into 7 differentiated clusters, one for each selected fold.