DeepFrag-k: a fragment-based deep learning approach for protein fold recognition

Background One of the most essential problems in structural bioinformatics is protein fold recognition. In this paper, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features at fragment level to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multi-modal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolutional neural network (CNN) to classify the fragment vector into the corresponding fold. Results Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition. Conclusions There is a set of fragments that can serve as structural “keywords” distinguishing between major protein folds. The deep learning architecture in DeepFrag-k is able to accurately identify these fragments as structure features to improve protein fold recognition.

these experimentally-determined protein structures according to the hierarchy of structural similarity. In the past decades, the number of identified protein sequences has dramatically increased due to high-throughput sequencing technologies; however, the number of unique structural folds remains unchanged in the past seven years [3], indicating that the protein structure universe is nearly complete. A highly accurate computational fold recognition method is a critical tool to bridge the sequence-structure gap.
Fold recognition methods can be classified into two categories: sequence alignment methods and machine learning methods [4]. The idea behind sequence alignment methods is to match a sequence or sequence profile against those with experimentallydetermined structures as templates [5] to identify the most suitable fold. On the other hand, machine learning methods aim at identifying global or local features of a given sequence and then classify it into one of the known fold categories. Early machine learning fold recognition methods encompass using multi-layer perceptron and support vector machines [6]. Later, ensemble classifiers and kernel-based methods are introduced to discover correlations between sequence features to overcome the weakness of the early machine learning methods and improve the discriminability of the fold recognizers [5]. Recently, deep learning techniques have been applied to extract effective features, such as secondary structures [4] and inter-residue contacts [7], to further improve fold recognition.
In this work, we present a novel deep neural network architecture, so-called Deep-Frag-k, to classify target protein sequences into known protein folds. Unlike most of the fold recognition methods which predict folds directly from sequence and sequencerelated features, Deep-Frag-k adopts a two-stage process, where a fragment vector is predicted in stage 1 and then the corresponding protein fold is predicted in stage 2. The fundamental idea in Deep-Frag-k is to predict the potential structural fragments that a target protein sequence will form [8] during folding, represented as a fragment vector, which contains highly discriminative features to distinguish a protein fold [9]. If a protein sequence is regarded as a document, the fragments can be treated as words in this document. The fragments form structural motifs, which are building blocks to assemble the protein structure. In particular, certain fragments are critical to carry out important protein functions. These fragments can be treated as "keywords" features that are able to uniquely distinguish one fold from the others.
Deep-Frag-k is composed of two stages. The first stage uses a multi-modal Deep Belief Network (DBN) to fuse multiple groups of features, including sequence composition, amino acid physicochemical properties, and evolutionary information, to precisely predict potential structure fragments for a given sequence, which are represented as a fragment vector. Then, a 1-D Convolutional Neural Network (CNN) is employed to classify the fragment vector into the appropriate fold. We evaluate DeepFrag-k on three fold recognition datasets: Ding and Dubchak (DD) [10], Extended DD (EDD) [11], and Taguchi and Gromiha (TG) [12]. Our results show that DeepFrag-k is more accurate, sensitive, and robust than the existing methods, including PFP-Pred [13], GAOEC [14], ThePFP-FunDSeqE [15], Dehzangi et al. [6,16], MarFold [17], PFP-RFSM [18], Feng and Hu [19], Feng et al. [20], PFPA [21], Paliwal et al. [22,23], Dehzangi et al. [24], HMMFold [25], Saini et al. [26], and Profold [27], in protein fold recognition. Figure 1 presents the two-stage deep neural network architecture of DeepFrag-k. In the first stage, we predict a fragment vector representation of a target protein sequence using a fragment prediction model based on multi-modal DBN [28], which predicts the potential fragments that the target protein sequence will form during protein folding process. In particular, we focus on the top-100 most popular fragments, with 4-to 20-residue in length, described in our Frag-k fragment libraries [8,9]. Our previous results [9] show that these fragments can be used as the structural "keywords" to effectively distinguish between major protein folds. In the multi-modal DBN, the DBNs interact with each other to learn fragment latent representation on the set of features derived from sequence composition, physicochemical properties, and evolutionary information. The output of the first stage is a fragment vector with respect to the target protein sequence. Afterwards, in the second stage, this fragment vector is fed to a 1D Convolutional Neural Network (1D-CNN) [29] classifier, as the feature vector of the target protein sequence, to predict the likeliness of the protein folds.

DeepFrag-k fold recognition architecture
DeepFrag-k is implemented on the Tensorflow platform. The leaky ReLU activation functions are used in the DBN and CNN layers to avoid the vanishing gradient problem and speed up training. The Adam optimization algorithm for stochastic gradient descent is adopted for training the DBN and CNN models, with learning rate of 0.0001. The training of DeepFrag-k is carried out on a GPU P40 server with 3,840 CUDA cores and 24GB GDDR5 memory.

Fragment prediction (Stage 1)
A protein fold distinguish itself by forming certain unique secondary structures and super-secondary structure motifs, such as β-hairpins, short β-sheets, helix-loop-helix, and helix-turn-helix, which are represented as structural fragments. Correctly predicting these fragments from a given sequence can lead to effective features for fold recognition. However, the sequence features to predict fragments hold distinct statistical properties and the correlations between them are highly nonlinear [28]. For a shallow model, it is difficult to capture these correlations and form an integrated informative representation. Our fragment prediction model consists of a multimodal DBN and a fully-connected network. Our motivation for the proposed multimodal DBN is to tackle the above challenge by using an integrated representation to enhance the fragment prediction accuracy [28]. Figure 2 summarizes the framework of our proposed fragment prediction model. We use the Frag-k fragment libraries to train the fragment prediction model. First, we use the extracted sequence composition, physicochemical properties, and evolutionary information as feature groups to learn the latent representations of the top-100 Frag-k fragments.  As shown in [28], the top-100 Frag-k fragments are capable of classifying major SCOP folds in high accuracy and can also be used to assemble most protein structures in high precision. The multiple feature representations learned by the DBNs are concatenated to train a Restricted Boltzmann Machine (RBM) model [28] to fuse a latent feature representation for the feature groups. Finally, two fully-connected 1, 000 × 1, 000 neural network layers followed by a SoftMax layer of 100 output nodes, representing the top-100 Frag-k fragments, are trained with these latent feature representations to generate the fragment prediction. Such layer-by-layer learning helps gradually extract the effective features from the original feature groups [30]. The multimodal DBN learns discriminative latent features as a joint distribution determined by the hidden variables of non-correlated feature groups input [28]. As a result, the hybrid framework of multi-modal learning fuses an abstraction level representation, which enables the fragment predictor to integrate different feature groups for fragments of different lengths flexibly.
The training of the fragment prediction model is performed via Stochastic Gradient Descent method. During the training process, the Frag-k fragment library, with 1,000 samples in each fragment class, is randomly split into batches, each of which contains 500 samples. In order to prevent overfitting, dropout layers are inserted after every hidden layer with 0.5 dropout rate and an early stopping strategy is employed.

Fold prediction (Stage 2)
The fragment feature vector generated from stage 1 is fed to a 1D-CNN architecture to predict protein fold, as shown in Fig. 3. The proposed 1D-CNN comprises two pairs of convolution and max pooling layers (COV1-MP1 and COV2-MP2 ), two fully-connected layers FC1 and FC2, and a SoftMax layer. Between MP1 and COV2, we include a stacking The purpose of these 2D filters is to capture the relationships across the latent features produced by the convolution filters of the original fragment vector in COV1. Then the generated output is subsampled in max pooling layer MP2. In order to classify the flattened output of MP2 into corresponding folds, two fully-connected layers, FC1 and FC2, followed by a SoftMax layer are employed. We summarize the hyper-parameters for deep fold recognition architecture in Table 1.

Features extraction
Constructing a proper feature vector from a protein sequence is a critical step for protein fragment prediction [7]. Using multiple features extraction strategy, representing sequence, evolutionary, physicochemical information of a sequence fragment, maximizes the discriminative capability of the fold recognizer [31]. The sequence features for fragments used in DeepFrag-k include frequencies of functional groups, information entropy of amino acids and dipeptides [32], distribution of amino acids relative positions [31], and transitions of functional groups [33]. The physicochemical features include PseAAC (Pseudo Amino Acid Composition) [34] and Discrete Wavelet Transform (DWT) of   Table 2.

Datasets
Three datasets, including DD [10], TG [12], and EDD [11], are used to compare the effectiveness of DeepFrag-k with existing fold recognition methods. The sequences in these datasets cover most of the sequences in the SCOP database. The DD dataset is composed of a training set and a testing set, both of which cover 27 protein folds in the SCOP database, which belong to different structural classes containing α, β, α/β, and α + β, comprehensively. The DD training set contains 311 protein sequences with ≤ 40% residue identity and the testing set contains 383 protein sequences with ≤ 35% residue identity. Additionally, the sequences in the training set have identity ≤ 35% with those in the testing dataset, ensuring to provide an unbiased performance evaluation. The TG dataset contains 1,612 protein sequences with ≤ 25% sequence identity belonging to 30 different folds in SCOP 1.73 [12]. The EDD dataset is an extended version of the DD dataset, which contains 3,418 protein sequences with ≤ 40% sequence identity [11].

Fragment prediction model
The extracted sequence composition, physicochemical properties, and evolutionary information features of the Frag-k fragments are fed to the fragment prediction model to predict their potential corresponding fragments classes. We investigate the performance of the classifier measured by specificity, sensitivity, and accuracy, which are defined as the percentage of predicted fragment classes that are true positives, the percentage of true positives that are correctly predicted, and the fraction of fragments that are correctly classified, respectively. We first examine the classification of sequence fragments of the same length. Figure 4 shows the accuracy, specificity, and sensitivity of the ten-fold cross-validation results for top-100 Frag-k fragment targets of length ranging from 4 to 20 residues. One can find that the prediction accuracies of longer fragments (≥ 10 residues) are better than those of the shorter ones, where both specificity and sensitivity are over 80%. This is due to the fact that the longer fragments encompass richer discriminative information. However, when the top-100 Frag-k fragments with variable lengths are used as the target classes, the prediction accuracy reaches over 90%, because these top-100 Frag-k fragments with variable lengths are more representative structural keywords in the protein structure universe, as we showed in our previous study [9]. We analyze the effectiveness of the three feature groups ( Table 2) used to represent the sequence fragments on variable length Frag-k fragment prediction accuracy. We compose individual and combined sequence composition, physicochemical properties, and evolutionary information feature vectors to train the fragment prediction model showed in Fig. 2. The ten-fold cross-validation accuracy results are reported in Fig. 5. The evolutionary information plays the most important role; however, all of these feature groups contribute to the overall fragment accuracy improvements.

Fold classification model
As shown in our previous work [9], the Frag-k fragment library with variable length achieves higher fold classification accuracy than fixed-length ones. Moreover, our results in the previous sections show that the prediction accuracy on variable length Frag-k fragments than individual fixed-length fragments. Therefore, we used the fragment vectors based on variable-length fragment predictions from the fragment prediction model for the fold recognition model.
We use the sequences in DD, EDD, and TG datasets to evaluate the performance of DeepFrag-k. First, for a given sequence, we use a sliding window of 4 to 20 residues to consecutively segment it into a set of overlapping fragments, where gaps and nonprotein residues are excluded. Figure 6 summarizes the ten-fold cross-validation results of DeepFrag-k and other fold recognition methods on the DD dataset. DeepFrag-k outperforms the other methods by yielding 85.3% accuracy, which is 9.1% higher than the second highest, proFold (76.2%). More detailed comparisons between DeepFrag-k and ProFold for each individual protein fold are listed in Table 3. One can find that DeepFrag-k demonstrates better fold recognition accuracy than ProFold in 18 out of 27 protein folds. It is also important to notice that DeepFrag-k shows more balanced prediction accuracy. In particular, for the folds, such as b.34, b.47, c.3, c.37, and d.15, that ProFold exhibits poor prediction results, DeepFrag-k yields significant accuracy improvements.
We further evaluate the performance of DeepFrag-k on the EDD and TG datasets. The ten-fold cross-validation results in comparison with other methods are illustrated in Fig. 7. DeepFrag-k yields 96.1% and 97.5% accuracies on the EDD and TG datasets, respectively, which are higher than the other fold recognition methods. Due to significantly more samples are available in EDD and TG datasets, which is particularly helpful for our deep learning model to capture the discriminative features of the protein folds in  sequence space, the DeepFrag-k yields better fold recognition accuracies in the EDD and TG datasets than that in the DD dataset. Figure 8 depicts the Class Activation Map (CAM) [36] of DeepFrag-k on the EDD dataset to show how protein folds classified based on the fragment feature vectors from the protein sequences. The activation units that are most discriminative to fold

Discussions
In our previous work [9], we develop a protein structural fragment library (Frag-k), composed of about 400 backbone fragments ranging from 4 to 20 residues, as the structural "keywords" in the protein structure universe. A structure dictionary using these fragments as keywords can classify the major protein folds with high accuracy. The success of DeepFrag-k is due to identifying these keywords with high precision as structural features that are effective for fold recognition. The deep learning architecture in DeepFrag-k plays an important role in accurately identifying these fragments.
The current version of DeepFrag-k has its limitations. The CNN used in the Stage 2 training of DeepFrag-k is effective in capturing local interaction patterns between fragments, but have difficulty in learning their high-order, long-range interactions, which are essential to form stable spatial structures. This problem may be addressed by incorporating deep learning techniques, such as Recurrent Neural Network (RNN), that can learn sequence data as time series and capture long-range correlations.

Conclusions and future research directions
In this paper, we design DeepFrag-k, a two-stage deep learning neural network architecture, for fold recognition. The fragment prediction stage derives effective fragment feature vectors by fusing sequence composition, physicochemical properties, and evolutionary information features groups of sequence fragments to the fold recognition stage. Due to the highly discriminative capability of the fragment feature vectors, DeepFrag-k yields significant accuracy enhancement compared to other fold recognition methods on the DD, EDD, and TG datasets.
We will investigate using RNN to capture high-order, long-range interactions between structural fragments to further improve DeepFrag-k. Moreover, the features derived in DeepFrag-k are based on sequence fragments. They can be incorporated with other sequence or structure features, such as inter-residue interactions [7], to further improve fold recognition. Moreover, accurate fold recognition allows cooperatively fitting sequences into known three-dimensional folds, increasing the success rate by detecting very remote homologies. The recognized folds can be used as high-quality templates to predict tertiary structures in high resolutions. These will be our future research directions.