CONFOLD2: improved contact-driven ab initio protein structure modeling

Background Contact-guided protein structure prediction methods are becoming more and more successful because of the latest advances in residue-residue contact prediction. To support contact-driven structure prediction, effective tools that can quickly build tertiary structural models of good quality from predicted contacts need to be developed. Results We develop an improved contact-driven protein modelling method, CONFOLD2, and study how it may be effectively used for ab initio protein structure prediction with predicted contacts as input. It builds models using various subsets of input contacts to explore the fold space under the guidance of a soft square energy function, and then clusters the models to obtain the top five models. CONFOLD2 obtains an average reconstruction accuracy of 0.57 TM-score for the 150 proteins in the PSICOV contact prediction dataset. When benchmarked on the CASP11 contacts predicted using CONSIP2 and CASP12 contacts predicted using Raptor-X, CONFOLD2 achieves a mean TM-score of 0.41 on both datasets. Conclusion CONFOLD2 allows to quickly generate top five structural models for a protein sequence when its secondary structures and contacts predictions at hand. The source code of CONFOLD2 is publicly available at https://github.com/multicom-toolbox/CONFOLD2/. Electronic supplementary material The online version of this article (10.1186/s12859-018-2032-6) contains supplementary material, which is available to authorized users.

The most successful ab initio protein structure methods, i.e. fragment-assembly based 2 methods, require generating a lot of decoys to deliver accurate predictions. Methods 3 that can build models faster and are more residue contact sensitive are needed to realize 4 the promise of ab initio protein structure prediction driven by the recent advances in 5 contact prediction [1,2]. The CONFOLD method [3] can build high quality secondary 6 structures (including beta-sheets) and correct tertiary structures when predicted 7 contacts are accurate. It is integrated into other protein structure prediction methods 8 like CoinFold [4] and PconsFold2 [2]. In this paper, we develop an improved version of 9 CONFOLD by incorporating a soft-square energy function into CONFOLD, building 10 models using multiple sub-sets of contacts, adding model selection capability, and 11 rigorously testing it on various datasets including the Critical Assessment of protein 12 Structure Prediction (CASP) 11 and 12 datasets. CONFOLD2 also addresses a major 13 limitation of the CONFOLD method, i.e. generating a decoy of 200 models and not 14 producing top one or top five models. Compared to fragment-assembly methods that 15 need to generate thousands of model decoys [5], CONFOLD2 explores the fold space by 16 generating just a few hundred model decoys, and hence it runs relatively fast.

Implementation 18
Recently, it is found that energy functions that do not penalize unsatisfied predicted 19 contacts after certain distance threshold yield more accurate model reconstruction [5][6][7]. 20 Different contact energy functions like FADE [5], square-well function with exponential 21 decay [6], and modified Lorentz potential [7] applied to contact-guided protein folding 22 have been found to work best for various folding algorithms, mostly fragment-assembly 23 based methods. When distance geometry based approaches are used to fold proteins 24 with restraints, it has been shown that soft-square function performs best, with the 25 'rswitch' parameter to be tuned [8].
We replaced CONFOLD's [3]  term, the maximum weight (ceil) that any pair of predicted contacts can have is set to 36 1000, and 'w' is the weight of each contact pair and is set to 1. The most important 37 parameter affecting the quality of reconstruction is r sw and we optimized it to be 1.8. 'a' 38 and 'b' are constants determined at run-time such that the function is smooth at r sw CONFOLD modeling are selected based on the contact energy score, resulting in a total 51 of 200 models. Next, to filter out unfolded models, we rank these 200 models by 52 calculating their contact satisfaction score using top L/5 long-range contacts, and filter 53 out the bottom 150 models. The remaining 50 models are clustered into five clusters by 54 calculating their pairwise structural similarity measured by TM-score. We select the five 55 models closest to the centroids of these five clusters as the top five predictions with the 56 rank determined by the satisfaction score of the top L/5 long-range contacts. 57 SCRATCH suite [9] is used to predict three-state secondary structure and 58 Maxcluster [10] to compute pairwise model similarity for clustering. 59 Figure 1. Behavior of the contact energy term for various r sw values. For this demonstration desired distance is set to 10Å with a lower-bound of 0Å and upper-bound of 5Å, i.e. the desired distance between the pair of restrained residues is 10.0Å and 15.0Å. The "Existing" energy calculations refers to the old energy term implemented in CONFOLD method. The plot shows that depending upon the switching parameter, r sw , the energy calculations can taper early at around 1 or 2Å for r sw = 2 or at more than 25Å for r sw = 6.

60
As the first benchmark, we compared the performance of CONFOLD2 with the original 61 CONFOLD method [3] on the 150 proteins in the PSICOV dataset [11] using the 62 contacts predicted using PSICOV [11] (see Table1) Table S1 for a 68 detailed comparison).

69
Next, to evaluate our model selection technique (selecting top five models from 200) 70 we compared our approach of model selection using clustering with the model ranking 71 using contact satisfaction score only. On the same dataset, when we selected top five 72 models using contact satisfaction score of top L/5 or L/2 long-range contacts, we 73 achieved best-of-top-five TM-score of 0.50. The rationale for using top L/5 or L/2 74 contacts (instead of L or more) is that these subsets are found to best reflect the 75 accuracy of the predicted contacts [12]. In contrast, when we filter out the bottom 150 76 models, cluster the remaining 50 into five clusters, and select the cluster centroids, we 77 obtain best-of-top-five TM-score of 0.52, suggesting that the clustering approach is 78 effective in selecting models built from contacts. As summarized in Table1, we also 79 reconstructed models for the PSICOV-150 dataset using contacts predicted by 80 MetaPSICOV [13] and obtained a mean TM-score of 0.62 when best of top-five models 81 are evaluated (see Supp. Table S1 for detailed results), indicating that the improved 82 contact prediction leads to the better tertiary structure reconstruction. 83 Finally, using CONFOLD2, we predicted models for the protein sequence targets in 84 Table 1. Summary of the performance of CONFOLD2 on PSICOV, CASP11, and CASP12 datasets. Mean contact precision of top L/5 for (i) all (short-range, medium-range, and long-range: P SR+MR+LR ) contacts, and (ii) long-range contacts (P LR ) is reported for all the datasets. The TM-score of the best-of-200 and best-of-5 models reconstructed by CONFOLD2 are also presented. Results for single-domain and multi-domain subsets of the CASP11 and CASP12 datasets are also reported separately.

Dataset
Contact Precision (L/5) TM-score of Models Method P SR+MR+LR P LR Best-of-200 Best-of-5