 Methodology article
 Open access
 Published:
DeepDist: realvalue interresidue distance prediction with deep residual convolutional network
BMC Bioinformatics volume 22, Article number: 30 (2021)
Abstract
Background
Driven by deep learning, interresidue contact/distance prediction has been significantly improved and substantially enhanced ab initio protein structure prediction. Currently, most of the distance prediction methods classify interresidue distances into multiple distance intervals instead of directly predicting realvalue distances. The output of the former has to be converted into realvalue distances to be used in tertiary structure prediction.
Results
To explore the potentials of predicting realvalue interresidue distances, we develop a multitask deep learning distance predictor (DeepDist) based on new residual convolutional network architectures to simultaneously predict realvalue interresidue distances and classify them into multiple distance intervals. Tested on 43 CASP13 hard domains, DeepDist achieves comparable performance in realvalue distance prediction and multiclass distance prediction. The average mean square error (MSE) of DeepDist’s realvalue distance prediction is 0.896 Å^{2} when filtering out the predicted distance ≥ 16 Å, which is lower than 1.003 Å^{2} of DeepDist’s multiclass distance prediction. When distance predictions are converted into contact predictions at 8 Å threshold (the standard threshold in the field), the precision of top L/5 and L/2 contact predictions of DeepDist’s multiclass distance prediction is 79.3% and 66.1%, respectively, higher than 78.6% and 64.5% of its realvalue distance prediction and the best results in the CASP13 experiment.
Conclusions
DeepDist can predict interresidue distances well and improve binary contact prediction over the existing stateoftheart methods. Moreover, the predicted realvalue distances can be directly used to reconstruct protein tertiary structures better than multiclass distance predictions due to the lower MSE. Finally, we demonstrate that predicting the realvalue distance map and multiclass distance map at the same time performs better than predicting realvalue distances alone.
Background
Recently, the accuracy of protein interresidue contact prediction has been substantially increased due to the development of residueresidue coevolution analysis methods effectively detecting the directly correlated mutations of contacted residues in the sequences of a protein family, such as Direct Coupling Analysis (DCA) [1], plmDCA [2], GREMLIN [3], CCMpred [4], and PSICOV [5]. The capability of these methods to extract the correlated mutation information for contact prediction largely depends on the number of effective sequences in multiple sequence alignment (MSA) of a target protein. Due to the advancement in the DNA/RNA sequencing technology [6, 7], many proteins have a lot of sufficiently diverse, homologous sequences that make their contact/distance prediction fairly accurate. However, for targets with a small number of effective homologous sequences (i.e. shallow sequence alignments), the coevolutionary scores are noisy and not reliable for contact prediction. The problem can be largely addressed by using noisy coevolutionary scores as input for advanced deep learning techniques that have strong pattern recognition power to predict interresidue contacts and distances.
After deep learning was introduced for contact prediction in 2012 [8], different deep learning architectures have been designed to integrate traditional sequence features with interresidue coevolution scores to substantially improve contact/distance prediction [9,10,11,12], even for some targets with shallow MSAs.
The improved contact predictions can be converted into interresidue distance information, which has been successfully used with distancebased modeling methods such as CONFOLD [13], CONFOLD2 [14], and EVFOLD [15] to build accurate tertiary structures for ab initio protein targets [16, 17].
In the most recent CASP13 experiment, several groups (e.g., AlphaFold [18] and RaptorX [19]) applied deep learning techniques to classify interresidue distances into multiple finegrained distance intervals (i.e. predict the distance distribution) to further improve ab initio structure prediction substantially. However, the probabilities of a distance belonging to different intervals predicted by the multiclassification approach still need to be converted into a distance value to be used for tertiary structure modeling. There is lack of deep learning regression methods to directly predict the exact real value of interresidue distances.
In this study, we develop a deep residual convolutional neural network method (DeepDist) to predict both the fulllength realvalue distance map and the multiclass distance map (i.e. distance distribution map) for a target protein. According to the test on 43 CASP13 hard domains (i.e. both FM and FM/TBM domains; FM: free modeling; TBM: templatebased modeling), 37 CASP12 hard (FM) domains, and 268 CAMEO targets, the method can predict interresidue distance effectively and perform better than existing stateoftheart methods in terms of the precision of binary contact prediction. We further show that predicting both realvalue distance map and multiclass distance map simultaneously is more accurate than only predicting realvalue distance map, demonstrating the advantage of DeepDist multitask learning framework to improve protein distance prediction.
Results
Comparing DeepDist with stateoftheart methods on CASP12 and CASP13 datasets in terms of precision of binary contact predictions
As a multitask predictor, our distance predictor DeepDist can not only classify each residue pair into distance intervals (multiclassification) but also predict its realvalue distance (regression). We convert the predicted distances into contact maps in order to compare DeepDist with existing methods using the most widely used evaluation metrics—the precision of top L/5, L/2, L longrange contact predictions (long range: sequence separation of the residue pair ≥ 24). Figure 1 reports the contact prediction precision of the multiclass distance prediction and the realvalue distance prediction of DeepDist and several stateoftheart methods on two CASP test datasets (43 CASP13 FM and FM/TBM domains and 37 CASP12 FM domains). To compare our distance prediction result on 43 CASP13 test sets strictly, we extract the contact precision results of RaptorXContact [19], AlphaFold [18], and TripletRes [12] reported in their paper. For trRosetta [20], we ran it with the same MSAs used with DeepDist to predict distance probability distribution map and converted it into a binary contact map within 8 Å threshold. On the CASP13 dataset (Fig. 1a), the contact precision of DeepDist is higher than the contact precision of three top methods (RaptorXContact, AlphaFold, and TripletRes) in CASP13 as well as trRosetta in almost all cases. For instance, the precision of top L/5 longrange predicted contacts for DeepDist(multiclass) and DeepDist(real_dist) is 0.793 and 0.786 on the CASP13 dataset, respectively, higher than 0.751 of trRosetta. The precision of top L/2 longrange predicted contacts for DeepDist(multiclass) is 0.661, which is also similar to trRosetta’s precision—0.652. According to this metric, the multiclass distance prediction (DeepDist(multiclass)) works slightly better than the realvalue distance prediction (DeepDist(real_dist)).
We also compare DeepDist with DeepMetaPSICOV [11] on 37 CASP12 FM domains. To rigorously evaluate them, we ran DeepMetaPSICOV with the same sequencebased features (sequence profile from PSIBLAST [21] and solvent accessibility from PSIPRED [22]) and MSAs used with DeepDist. Both multiclass distance prediction and realvalue distance prediction of DeepDist perform consistently better than DeepMetaPSICOV (Fig. 1b).
Comparison of predicting realvalue distance map and multiclass distance map simultaneously with predicting realvalue distance map alone
In order to evaluate if predicting realvalue distance map and multiclass distance map together improves the performance over predicting realvalue distance map only, we conducted two experiments. Experiment 1 trained realvalue distance prediction and multiclass distance prediction simultaneously; Experiment 2 trained realvalue distance prediction only. To ensure a fair comparison, two experiments used the same input features (PLM) and the same model architecture (PLM_Net mentioned in Method section).
We evaluated the realvalue distance prediction performance of the two experiments based on several evaluation metrics—longrange (residue pair separation ≥ 24) contact precision, MSE, and Pearson coefficient. As the evaluation data shown in Table 1, the realvalue distance prediction trained simultaneously with multiclass distance prediction in Experiment 1 performed better than the realvalue distance prediction trained alone in Experiment 2 according to all the metrics. The results demonstrate that DeepDist’s multitask learning framework can improve the performance of realvalue distance prediction.
Comparison of the ensemble model based on four kinds of inputs and a single model based on one input
Table 2 reports the performance of DeepDist (an ensemble of multiple models trained on four kinds of inputs) on the CASP13 dataset. The accuracy of DeepDist’s realvalue distance prediction (DeepDist(realdist)) and multiclass distance prediction (DeepDist(multiclass)) in Table 2 is substantially higher than the accuracy of Experiment 1 in Table 1, a single deep model trained on one kind of feature—PLM. For instance, the precision for top L/5 contact prediction and MSE of DeepDist (realdist) are 0.786 and 0.896 Å^{2}, better than 0.699 and 1.151 Å^{2} of the single model PLM_Net. The same results are observed for other single models trained on COV, PRE, or OTHER features, separately. The results clearly demonstrate that the ensemble approach improves the accuracy of interresidue distance prediction.
Comparison between realvalue distance prediction and multiclass distance distribution prediction in terms of 3D protein structure folding
To test the usefulness of two distance predictions for 3D structure folding, we use the realvalue distance map and multiclass distance map predicted by DeepDist with DFOLD [23] to construct the 3D models for the 43 CASP13 hard domains respectively. Table 3 shows the average TMscore of the top 1 model and the best model of the top 5 models of using realvalue distances (DeepDist(realdist)) and of using multiclass distances (DeepDist(multiclass)) on the 43 CASP13 FM and FM/TBM domains. The average TMscores of top 1 and top 5 models generated from realvalue distance predictions are 0.487 and 0.522, which demonstrates the feasibility of applying realvalue distance predictions to build protein tertiary structures with moderate model quality.
Figure 2 illustrates the distribution of TMscore of the top1 models of 43 CASP13 domains for DeepDist (realdist) and DeepDist(multiclass). The distribution of DeepDist (realdist) shifts toward higher scores (TMscore > 0.6). As shown in Additional file 1: Table S1, the realvalue distance prediction has 13 domains with TMscore > 0.6 and the multiclass distance prediction has 12. From the targetbytarget comparison, when the models of both methods have TMscore > 0.6, models constructed from the realvalue distance prediction tend to have higher scores. This is also consistent with what was observed in Fig. 2, a tendency of the TMscore distribution curve of the realvalue distance prediction sitting above the curve of the multiclass distance prediction when TMscore > 0.6. The reduction of MSE of the predicted distances may be one of the factors contributing to the improvement of DeepDist (realdist) over DeepDist(multiclass) for 3D modeling. The average MSE between the predicted realvalue distance map and the true distance map is 0.8964 Å^{2}, which is lower than the average MSE (1.0037 Å^{2}) between the distance map converted from the predicted multiclass distance map and the true distance map. The way of converting multiclass distance predictions to realvalue distance constraints and setting the upper and lower distance bounds for constructing 3D models can be another two factors that affect the final model quality.
On the 43 CASP13 FM and FM/TBM domains, we also compared the models generated from the predicted distance of DeepDist with two popular ab initio distancebased model folding methods: DMPfold [24] and CONFOLD2 [14] (Table 3). For DMPfold, we applied the same sequencebased features and multiple sequence alignment used with DeepDist as input for DMPfold to build 3D models. For CONFOLD2, we converted the predicted distance map to the contact map as its input to build 3D models. As shown in Table 3, Both DeepDist and DMPfold have a much better performance than the contactbased method CONFOLD2, clearly demonstrating that the distancebased 3D modeling is better than contactbased 3D modeling. The average TMscore of DeepDist (realdist) is 0.487, higher than 0.438 of DMPfold, probably due to more accurate distance prediction made by DeepDist. Considering top 5 models, DeepDist(real_dist) folds 23 out of 43 domains (TMscore > 0.5) correctly, higher than 16 of DMPfold. Figure 3 illustrates the DeepDist distance map for the target T0997 and other four highquality CASP13 tertiary structure models built from the predicted realvalue distances that have the TMscores ≥ 0.7.
The relationship between 3D models reconstructed from predicted realvalue distances and multiple sequence alignments.
The main input features used with DeepDist are derived from MSAs. Figure 4 plots the TMscores of top 1 models of 43 CASP13 domains against the natural logarithm of the number of effective sequences in their MSAs. There is a moderate correlation (Pearson’s correlation = 0.66) between the two. Moreover, 3D models for 6 domains (T0957s2D1, T0958D1, T0986s2D1, T0987D1, T0989D1, and T0990D1) with shallow alignments (the number of effective sequences (Neff) in the alignment < 55) have TMscore > 0.5 (i.e. TMscore 0.568, 0.644, 0.658, 0.555, 0.545 and 0.593, respectively), indicating DeepDist works well on some targets with shallow alignments.
Evaluation of CAMEO targets
In order to further evaluate DeepDist on a large dataset, we test DeepDist on 268 CAMEO targets selected from 08/31/2018 to 08/24/2019. The average precision of the top L/5 or L/2 longrange interresidue contact prediction converted from the realvalue distance prediction is 0.691, and 0.598, respectively. 191 out of 268 targets have the longrange top L/5 contact prediction precision ≥ 0.7. Figure 5 shows 5 highquality models constructed from DeepDist predicted realvalue distances. For the 14 targets with the number of effective sequences less than or equal to 50, the average top L/5 and top L/2 longrange contact prediction precision is 0.696 and 0.515, which is reasonable. Using the predicted distance to build 3D structures for the 14 targets, five of them have models with TMscore > 0.5. This further confirms that DeepDist’s predicted distances can fold some proteins with very shallow alignments correctly.
Discussion
Although there are numerous deep learning methods to conduct distance prediction by classifying distance into multiple intervals, there are few deep learning methods to predict realvalue distance via regression. Our results demonstrate that it is worthwhile to explore the potentials of realvalue distance prediction, which can be directly used by 3D modeling methods to build protein tertiary structures. Evaluated by the precision of binary contact prediction, the accuracy of predicting realvalue distance prediction alone is worse than predicting realvalue distances and classifying distances into multiple intervals at the same time in a multitask learning framework (Table 1). This demonstrates that the strength of DeepDist predicting the two types of distances simultaneously to improve the accuracy of predicting realvalue distance. Moreover, the two distance predictions in DeepDist achieve comparable results. The distance multiclassification prediction of DeepDist is slightly better than realvalue distance prediction in terms of precision of contact prediction, but it is a little worse in terms of MSE of predicted distance. The pvalue (shown in Additional file 1: Tables S2 and S3) calculated from the paired ttest of the corresponding MSE value pairs between DeepDist(realdist) and DeepDist(multiclass) suggests the significant differences in their mean MSE values. All those results show that the realvalue distance prediction can add some value on top of distance multiclassification prediction. Both the strengths and weaknesses of the two distance prediction methods in DeepDist have been demonstrated in this study. Which method should be chosen to use may depend on the specific needs of users and multiple factors such as how to convert multiclassification distances into realvalue distances, how to estimate distance errors, and which distances can be used by a 3D modeling tool. Moreover, more experiments are still needed to investigate if and how realvalue distance prediction can directly improve the performance of distance multiclassification prediction.
Conclusion
We develop an interresidue distance predictor DeepDist based on new deep residual convolutional neural networks to predict both realvalue distance map and multiclass distance map simultaneously. We demonstrate that predicting the two at the same time yields higher accuracy in realvalue distance prediction than predicting realvalue distance alone. The overall performance of DeepDist’s realvalue distance prediction and multiclass distance prediction is comparable according to multiple evaluation metrics. Both kinds of distance predictions of DeepDist are more accurate than several stateoftheart methods on the CASP13 hard targets. Moreover, DeepDist can work well on some targets with shallow multiple sequence alignments. And the realvalue distance predictions can be used to reconstruct 3D protein structures better than predicted multiclass distance predictions, showing that predicting realvalue interresidue distances can add the value on top of existing distance prediction approaches.
Methods
Overview
The overall workflow of DeepDist is shown in Fig. 6. We use four sets of 2D coevolutionary and sequencebased features to train four deep residual convolutional neural network architectures respectively to predict the Euclidean distance between residues in a protein target. Three of four feature sets are mostly coevolutionbased features, i.e. covariance matrix (COV) [25], precision matrix (PRE) [26], and pseudolikelihood maximization matrix (PLM) [4]) calculated from multiple sequence alignments. Considering that coevolutionbased features sometimes cannot provide sufficient information, particularly when targets have shallow alignments, the fourth set of sequencebased features (OTHER), such as the sequence profile generated by PSIBLAST [21], and solvent accessibility from PSIPRED [22] are used. The output of DeepDist is a realvalue L × L distance map and a multiclass distance map (L: the length of the target protein). The two types of distance maps are generated by two prediction branches. For each branch, the final output is produced by the ensemble of four deep network models (COV_Net, PLM_Net, PRE_Net, and OTHER_Net) named after their input feature sets (COV, PLM, PRE, and OTHER). For the prediction of the multiclass distance map, we discretize the interresidue distances into 25 bins: 1 bin for distance < 4.5 Å, 23 bins from 4.5 to 16 Å at interval size of 0.5 Å and a final bin for all distances ≥ 16 Å. For the realvalue distance map, we simply use the true distance map of the native structure as targets to train deep learning models without discretization. Because large distances are not useful and not predictable, we only predict interresidue distances less than 16 Å by filtering out true distances ≥ 16 Å.
Datasets
We select targets from the training list used in DMPfold [24] and extract their true structures from the Protein Data Bank (PDB) to create a training dataset. After filtering out the redundancy with the validation dataset and test datasets according to 25% sequence identity threshold, 6463 targets are left in the training dataset. The validation set contains 144 targets used to validate DNCON2 [10]. The three blind test datasets are 37 CASP12 FM domains, 43 CASP13 FM and FM/TBM domains, and 268 CAMEO targets collected from 08/31/2018 to 08/24/2019.
Input feature generation
The sequence databases used to search for homologous sequences for feature generation include Uniclust30 (201710) [27], Uniref90 (201804), Metaclust50 (201801) [28], a customized database that combines Uniref100 (201804) and metagenomics sequence databases (201804), and NR90 database (2016). All the sequence databases were constructed before the CASP13 experiment.
Coevolutionary features (i.e. COV, PRE, and PLM) are the main input features for DeepDist, where COV is the covariance matrix calculated from marginal and pair frequencies of each amino acid pair [25], PRE [26] is the inverse covariance matrix, and PLM is the inverse Potts model coupling matrix optimized by pseudolikelihoods [4]. All the three coevolutionary features are generated from multiple sequence alignment (MSA). Two methods, DeepMSA [29] and our inhouse DeepAln, are used to generate MSA for a target. The outputs of both MSA generation methods are the combination of the iterative homologous sequence search of HHblits [30] and Jackhmmer [31] on several sequence databases. The two methods differ in sequence databases used and the strategy of combining the output of HHblits and Jackhmmer searches. DeepMSA trims the sequence hits from Jackhmmer and performs sequence clustering, which shortens the time for constructing the HHblits database for the next round of search. To leverage its fast speed, we apply DeepMSA to search against a large customized sequence database that is composed of UniRef100 and metagenomic sequences. In contrast, DeepAln directly uses the fulllength Jackhmmer hits for building HHblits customized databases and is slower. It is applied to the Metaclust sequences database. The detailed comparison of two MSA generation methods is reported in the Additional file 1: Table S4. In addition to three kinds of coevolutionary features, 2D features such as the coevolutionary contact scores generated by CCMpred, Shannon entropy sum, mean contact potential, normalized mutual information, and mutual information are also generated. Moreover, some other features used in DNCON2 including sequence profile, solvent accessibility, joint entropy, and Pearson correlation are also produced, which are collectively called OTHER feature.
The features above are generated for the MSAs of both DeepMSA and DeepAln. Each of them is used to train a deep model to predict both realvalue distance map and multiclass distance map, resulting in 8 predicted realvalue distance maps and 8 multiclass distance maps (Fig. 6).
Deep network architectures for distance prediction
We started training the first network (COV_Net) with a simple feature set which consists of the covariance matrix described above, along with sequence profile (PSSM), contact scores (CCMpred), and Pearson correlation. Inspired by COV_Net, two networks—PLM_Net and PRE_Net that use two related coevolutionary matrices PLM and PRE generated from multiple sequence alignment were then added to use the coevolutionary relationship between amino acid pairs more effectively. Since all three networks highly depend on the quality of MSA, the fourth network OTHER_Net was constructed by adding only noncoevolutionary sequencebased features as input in case the MSA is shallow. To make sure every network works well, we tweaked the model architecture for each feature set. In total, there are four different networks in DeepDist, which are called COV_Net, PLM_Net, PRE_Net, and OTHER_Net (Fig. 7), respectively. PRE_Net and OTHER_Net share almost the same architecture with some minor differences. The detailed comparison of four networks is shown in Additional file 1: Table S5.
COV_Net (Fig. 7a) uses the COV matrix along with sequence profile (PSSM), contact scores (CCMpred), and Pearson correlation as input. It starts with a normalization block called RCIN that contains instance normalization (IN) [32], row normalization (RN), column normalization (CN) [33] and a ReLU [34] activation function, followed by one convolutional layer with 128 kernels of size 1 × 1 and one Maxout [35] layer to reduce the input channel from 483 to 64. The output of Maxout is then fed into 16 residual blocks. Each residual block is composed of two RCIN normalization blocks, two convolutional layers that consist of 64 kernels of size 3 × 3, and one squeezeandexcitation block (SE_block) [36]. The output feature maps from the block, together with the input of the block are added together as input for a ReLU activation function to generate the output of the residual block. The last residual block is followed by one convolutional instance normalization layer. The output of the layer is converted into two output maps simultaneously. One realvalue distance map is obtained by a ReLU function through a convolution kernel of size 1 × 1, and one multiclass distance map with 25 output channels is obtained by a softmax function.
PLM_Net (Fig. 7b) uses as input the PLM matrix concatenated with the sequence profile (PSSM) and Pearson correlation. The input is first fed into an instance normalization layer, followed by one convolutional layer and one Maxout layer. The output of Maxout is then fed into 20 residual blocks. Each residual block contains three RCIN blocks, four convolutional layers with 64 kernels of size 3 × 3, one SE_block, and one dropout layer [37] with a dropout rate of 0.2. The residual block is similar to the bottleneck residual block, except that the middle convolutional layer of kernel size 3 × 3 is replaced with three convolutional layers of kernel size 3 × 3, 7 × 1, 1 × 7, separately. The last residual block is followed by the same layers as in COV_Net to predict a realvalue distance map and a multiclass distance map.
PRE_Net (Fig. 7c) uses as input the PRE matrix as well as entropy scores (joint entropy, Shannon entropy) and sequence profile (PSSM). An instance normalization layer is first applied to the input. Unlike COV_Net and PLM_Net, one convolutional layer with 64 kernels of size 1 × 1 and an RCIN block are applied after the instance normalization layer for dimensionality reduction. The output of the RCIN block is then fed through 16 residual blocks. Each residual block is made of two stacked subblocks (each containing one convolutional layer with 64 kernels of size 3 × 3, an RCIN block, a dropout layer with a dropout rate of 0.2, a SE_block, and the shortcut connection). The final output layers after the residual blocks are the same as in COV_Net.
OTHER_Net uses OTHER features as input. Its architecture is basically the same as PRE_Net, except that it has 22 residual blocks and there is no dropout layer in each residual block.
The final output of DeepDist is an average realvalue distance map and an average multiclass distance map calculated from the output of the four individual network models, i.e. the output of the ensemble of the individual networks.
Training
The dimension of the input of COV_Net, PLM_Net, and PRE_Net is L × L × 483, L × L × 482, and L × L × 484 respectively, which is very large and consumes a lot of memory. Therefore, we use data generators from Keras to load large feature data batch by batch. The batch size is set as 1. A normal initializer [38] is used to initialize the network. For epochs ≤ 30, Adam optimizer [39] is performed with an initial learning rate of 0.001. For epochs > 30, stochastic gradient descent (SGD) with momentum [40] is used instead, with the initial learning rate of 0.01 and the momentum of 0.9. The realvalue distance prediction and multiclass distance classification are trained in two parallel branches. The mean squared error (MSE) and crossentropy are used as their loss function, respectively. At each epoch, the precision of top L/2 longrange contact predictions derived from the average of the two contact maps converted from the realvalue distance map and the multiclass distance map on the validation dataset is calculated. The interresidue realvalue distance map is converted to the contact map by inversing the predicted distance to obtain a relative contact probability (i.e. 1/dij: relative contact probability score; dij: predicted distance between residues i and j). The multiclass distance map is converted to the binary contact map by summing up the predicted probabilities of all the distance intervals ≤ 8 Å as contact probabilities.
Ab initio protein folding by predicted distances
We use distances predicted by DeepDist with our inhouse tool—DFOLD [23] built on top of CNS [41], a software package that implements distance geometry algorithm for NMR based structure determination, to convert the distance restraints into 3D structure models. For the predicted realvalue distance map, we select the predicted distances ≤ 15 Å and with sequence separation ≥ 3 to generate the distance restraints between CbCb atoms of residue pairs. 0.1 Å is added to or subtracted from the predicted distances to set the upper and lower distance bounds. For the predicted multiclass distance map, we first convert the distance probability distribution matrix to a realvalue distance map by setting each distance as the probabilityweighted mean distance of all intervals for a residue pair and using the standard deviation to calculate the upper and lower distance bounds. Given a final realvalue distance map, we prepare five different subsets of input distance restraints by filtering out distances ≥ x respectively, where x = 11 Å, 12 Å, 13 Å, 14 Å, and 15 Å. For each subset of distance restraints, we run DFOLD for 3 iterations. For each iteration, we generate 50 models and select the top five models ranked by the CNS energy score, the sum of all violations of all distance restraints used to generate a model. The top selected models generated from five subsets are further ranked by SBROD [42]. The final top one model is the one with the highest SBROD score. PSIPRED is used to predict the secondary structure to generate hydrogen bonds and torsion angle constraints for DFOLD to use.
Availability of data and materials
The datasets used in this study and the source code of DeepDist are available at https://github.com/multicomtoolbox/deepdist.
Change history
29 June 2021
A Correction to this paper has been published: https://doi.org/10.1186/s12859021042693
Abbreviations
 MSE:

The average mean square error
 DCA:

Direct coupling analysis
 MSA:

Multiple sequence alignment
 FM:

Free modeling
 TBM:

Templatebased modeling
References
Weigt M, White RA, Szurmant H, Hoch JA, Hwa T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc Natl Acad Sci. 2009;106(1):67–72.
Ekeberg M, Lövkvist C, Lan Y, Weigt M, Aurell E. Improved contact prediction in proteins: using pseudolikelihoods to infer Potts models. Phys Rev E. 2013;87(1):012707.
Kamisetty H, Ovchinnikov S, Baker D. Assessing the utility of coevolutionbased residue–residue contact predictions in a sequence and structurerich era. Proc Natl Acad Sci. 2013;110(39):15674–9.
Seemayer S, Gruber M, Söding J. CCMpred—fast and precise prediction of protein residue–residue contacts from correlated mutations. Bioinformatics. 2014;30(21):3128–30.
Jones DT, Buchan DW, Cozzetto D, Pontil M. PSICOV: precise structural contact prediction using sparse inverse covariance estimation on large multiple sequence alignments. Bioinformatics. 2012;28(2):184–90.
Meyer F, Paarmann D, D’Souza M, Olson R, Glass EM, Kubal M, Paczian T, Rodriguez A, Stevens R, Wilke A. The metagenomics RAST server—a public resource for the automatic phylogenetic and functional analysis of metagenomes. BMC Bioinform. 2008;9(1):386.
Wilke A, Bischof J, Gerlach W, Glass E, Harrison T, Keegan KP, Paczian T, Trimble WL, Bagchi S, Grama A. The MGRAST metagenomics database and portal in 2015. Nucl Acids Res. 2016;44(D1):D590–4.
Eickholt J, Cheng J. Predicting protein residue–residue contacts using deep networks and boosting. Bioinformatics. 2012;28(23):3066–72.
Wang S, Sun S, Li Z, Zhang R, Xu J. Accurate de novo prediction of protein contact map by ultradeep learning model. PLoS Comput Biol. 2017;13(1):e1005324.
Adhikari B, Hou J, Cheng J. DNCON2: improved protein contact prediction using twolevel deep convolutional neural networks. Bioinformatics. 2018;34(9):1466–72.
Kandathil SM, Greener JG, Jones DT. Prediction of interresidue contacts with DeepMetaPSICOV in CASP13. Proteins Struct Funct Bioinform. 2019;87(12):1092–9.
Li Y, Zhang C, Bell EW, Yu DJ, Zhang Y. Ensembling multiple raw coevolutionary features with deep residual neural networks for contactmap prediction in CASP13. Proteins Struct Funct Bioinform. 2019;87(12):1082–91.
Adhikari B, Bhattacharya D, Cao R, Cheng J. CONFOLD: residueresidue contactguided ab initio protein folding. Proteins Struct Funct Bioinform. 2015;83(8):1436–49.
Adhikari B, Cheng J. CONFOLD2: improved contactdriven ab initio protein structure modeling. BMC Bioinform. 2018;19(1):22.
Sheridan R, Fieldhouse RJ, Hayat S, Sun Y, Antipin Y, Yang L, Hopf T, Marks DS, Sander C: Evfold. org: Evolutionary couplings and protein 3D structure prediction. BioRxiv 2015:021022.
Michel M, Hayat S, Skwark MJ, Sander C, Marks DS, Elofsson A. PconsFold: improved contact predictions improve protein models. Bioinformatics. 2014;30(17):i482–8.
Monastyrskyy B, d’Andrea D, Fidelis K, Tramontano A, Kryshtafovych A. Evaluation of residue–residue contact prediction in CASP10. Proteins Struct Funct Bioinform. 2014;82:138–53.
Senior AW, Evans R, Jumper J, Kirkpatrick J, Sifre L, Green T, Qin C, Žídek A, Nelson AWR, Bridgland A, et al. Improved protein structure prediction using potentials from deep learning. Nature. 2020;577(7792):706–10.
Xu J, Wang S. Analysis of distancebased protein structure prediction by deep learning in CASP13. Proteins Struct Funct Bioinform. 2019;87(12):1069–81.
Yang J, Anishchenko I, Park H, Peng Z, Ovchinnikov S, Baker D. Improved protein structure prediction using predicted interresidue orientations. Proc Natl Acad Sci. 2020;117(3):1496–503.
Bhagwat M, Aravind L: Psiblast tutorial. In: Comparative genomics. Springer; 2007: 177–186.
Jones DT. Protein secondary structure prediction based on positionspecific scoring matrices. J Mol Biol. 1999;292(2):195–202.
Greener JG, Kandathil SM, Jones DT. Deep learning extends de novo protein modelling coverage of genomes using iteratively predicted structural constraints. Nat Commun. 2019;10(1):1–13.
Jones DT, Kandathil SM. High precision in protein contact prediction using fully convolutional neural networks and minimal sequence features. Bioinformatics. 2018;34(19):3308–15.
Li Y, Hu J, Zhang C, Yu DJ, Zhang Y. ResPRE: highaccuracy protein contact prediction by coupling precision matrix with deep residual neural networks. Bioinformatics. 2019;35(22):4647–55.
Mirdita M, von den Driesch L, Galiez C, Martin MJ, Söding J, Steinegger M. Uniclust databases of clustered and deeply annotated protein sequences and alignments. Nucl Acids Res. 2017;45(D1):D170–6.
Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):1–8.
Zhang C, Zheng W, Mortuza S, Li Y, Zhang Y: DeepMSA: constructing deep multiple sequence alignment to improve contact prediction and foldrecognition for distanthomology proteins. Bioinformatics 2019.
Remmert M, Biegert A, Hauser A, Söding J. HHblits: lightningfast iterative protein sequence searching by HMMHMM alignment. Nat Methods. 2012;9(2):173.
Eddy S: HMMER user’s guide. Department of Genetics, Washington University School of Medicine 1992, 2(1):13.
Ulyanov D, Vedaldi A, Lempitsky V: Instance normalization: the missing ingredient for fast stylization. Preprint arXiv:160708022 2016.
Mao W, Ding W, Xing Y, Gong H. AmoebaContact and GDFold as a pipeline for rapid de novo protein structure prediction. Nat Mach Intell. 2019;2019:1–9.
Nair V, Hinton GE: Rectified linear units improve restricted boltzmann machines. In: Proceedings of the 27th international conference on machine learning (ICML10): 2010. 807–814.
Goodfellow IJ, WardeFarley D, Mirza M, Courville A, Bengio Y: Maxout networks. Preprint arXiv:13024389 2013.
Hu J, Shen L, Sun G: Squeezeandexcitation networks. In: Proceedings of the IEEE conference on computer vision and pattern recognition: 2018. 7132–7141.
Srivastava N, Hinton G, Krizhevsky A, Sutskever I, Salakhutdinov R. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15(1):1929–58.
He K, Zhang X, Ren S, Sun J: Delving deep into rectifiers: surpassing humanlevel performance on imagenet classification. In: Proceedings of the IEEE international conference on computer vision: 2015. 1026–1034.
Kingma DP, Ba J: Adam: a method for stochastic optimization. Preprint arXiv:14126980 2014.
Qian N. On the momentum term in gradient descent learning algorithms. Neural Netw. 1999;12(1):145–51.
Brünger AT, Adams PD, Clore GM, DeLano WL, Gros P, GrosseKunstleve RW, Jiang JS, Kuszewski J, Nilges M, Pannu NS. Crystallography and NMR system: a new software suite for macromolecular structure determination. Acta Crystallogr D Biol Crystallogr. 1998;54(5):905–21.
Karasikov M, Pagès G, Grudinin S. Smooth orientationdependent scoring function for coarsegrained protein quality assessment. Bioinformatics. 2019;35(16):2801–8.
Acknowledgements
We wish to thank CASP organizers and predictors for sharing the data used in this work.
Funding
Research reported in this publication was supported in part by two NSF Grants (DBI 1759934 and IIS1763246), a DOE grant (DESC0021303) and an NIH Grant (R01GM093123) to JC. The funding agencies did not play a role in this research.
Author information
Authors and Affiliations
Contributions
JC conceived the project. TW, ZG, JH, and JC designed the method. TW and ZG implemented the method and gathered the results. TW, ZG, and JC analyzed the results. TW, ZG, JH, and JC wrote the manuscript. All authors edited and approved the manuscript. TW and ZG contributed equally to this work. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare they have no conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
The original version of this article was revised: the funding note has been updated.
Supplementary information
Additional file 1.
Supplemental results and data.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wu, T., Guo, Z., Hou, J. et al. DeepDist: realvalue interresidue distance prediction with deep residual convolutional network. BMC Bioinformatics 22, 30 (2021). https://doi.org/10.1186/s12859021039609
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859021039609