The ab initio approaches to protein structure prediction usually employ the Monte Carlo technique to search the structural conformation that has the lowest energy. However, the widely-used energy functions are usually ineffective for conformation search. How to construct an effective energy function remains a challenging task.

Results

Here, we present a framework to construct effective energy functions for protein structure prediction. Unlike existing energy functions only requiring the native structure to be the lowest one, we attempt to maximize the attraction-basin where the native structure lies in the energy landscape. The underlying rationale is that each energy function determines a specific energy landscape together with a native attraction-basin, and the larger the attraction-basin is, the more likely for the Monte Carlo search procedure to find the native structure. Following this rationale, we constructed effective energy functions as follows: i) To explore the native attraction-basin determined by a certain energy function, we performed reverse Monte Carlo sampling starting from the native structure, identifying the structural conformations on the edge of attraction-basin. ii) To broaden the native attraction-basin, we smoothened the edge points of attraction-basin through tuning weights of energy terms, thus acquiring an improved energy function. Our framework alternates the broadening attraction-basin and reverse sampling steps (thus called BARS) until the native attraction-basin is sufficiently large. We present extensive experimental results to show that using the BARS framework, the constructed energy functions could greatly facilitate protein structure prediction in improving the quality of predicted structures and speeding up conformation search.

Conclusion

Using the BARS framework, we constructed effective energy functions for protein structure prediction, which could improve the quality of predicted structures and speed up conformation search as well.

Background

Determination of protein structure is important for understanding protein functions [1]. The classical techniques for protein structure determination include X-ray crystallography, nuclear magnetic resonance, and electron microscopy. These determination techniques, however, often suffer from the limitations in both expensive costs and long determination period, leading to the ever-increasing gap between the number of known protein sequences and that of solved protein structures [2]. Computational approaches to protein structure prediction from sequences are becoming increasingly important to narrow down the gap [3].

The protein structure prediction approaches can be categorized into two families, namely, template-based modeling [4–10] and ab initio approaches [11–15]. Recently the predicted contacts have also been shown to be invaluable to protein structure prediction [16–21]. Unlike the template-based modeling approaches, the ab initio prediction approaches work without requirements of known similar protein structures. Briefly speaking, most ab initio prediction approaches are based on the hypothesis that the native structure of a protein should be the highly-populated one with sufficiently low energy; thus, ab initio approaches usually perform conformation search to find a structural conformation with sufficiently low energy. For example, Rosetta employs the Monte Carlo technique to search conformations assembled from fragments of known structures, and finally reports the centroid of a large cluster of low-energy conformations [11].

For the ab initio prediction approaches, one of the key issues is designing an effective energy function [11, 14, 15]. Typically, an energy function is a weighted-sum of multiple energy terms. The energy terms characterize specific structural features, especially the interplay between local and global interactions of residues. For example, the hydrophobic interaction term was designed to capture the observed tendency of non-polar residues to aggregate in aqueous solution and exclude water molecules. Van der Waals force term is the sum of the attractive or repulsive forces among residues. Hydrogen bonding term describes the electromagnetic attractive interaction between polar molecules in which hydrogen is bound to highly electronegative atom oxygen in the carboxyl [1]. In Rosetta, a total of 13 energy terms were used at the residue level, and over 140 terms were used at the full-atom level; therefore, it is important to find the optimal weighting of so many energy terms [11]. This study focuses on designing an optimal weighting of the 13 energy terms used in Rosetta.

Ideally, an effective energy function is expected to be able to distinguish the native structure from non-native conformations (called decoys), and could drive as much as possible initial conformations to the native-like one during the conformation search process. To achieve these two objectives, a widely-used strategy for designing energy function is to maximize the correlation between energy and quality of decoys [22]. Here the quality of a decoy refers to the structural similarity between the decoy with the native structure, which is measured using root mean square deviation (RMSD) of backbone atoms in this study. Inspired by the idea of “funnel-shaped free energy surface”, Levitt et al. proposed a funnel sculpting technique to construct energy functions that allow the conformation search procedure to easily “roll” into the native structure from a random starting conformation [23]. In another study, Shell et al. attempted to smooth energy function to make the energy landscape a funnel [24].

In this study, we present a framework that constructs effective energy function for protein structure prediction. Our framework, called BARS, consists of two procedures, i.e., broadening attraction-basin where the native structure lies in the energy landscape (hereafter denoted as native attraction-basin), and reverse sampling. The underlying rationale is that the larger the attraction-basin is, the more likely for the Monte Carlo procedure to find the native structure. To explore the attraction basin, we performed reverse sampling starting from the native structure. Subsequently, we tuned the weights of energy terms to broaden the native attraction-basin and thus acquired an improved energy function. We showed that both the possibility of successful search and the quality of predicted conformation increase when using the improved energy function.

The manuscript is organized as follows: “Methods” section describes the whole framework of our method, and the linear program model to optimize protein energy weights as well. “Results and discussion” section lists experimental results of the optimized energy function. In “Conclusion” section, we will discuss some limitations of our method and possible future works.

Methods

To construct an effective energy function, our BARS framework alternates two procedures, i.e., for an energy function, we first explored the native attraction-basin in the corresponding energy landscape using reverse sampling, and then improved the energy function through broadening the attraction-basin. These two procedures were alternated until the energy function changes sufficiently small between successive iterations as below.

The details of the two procedures are described as follows.

Exploring the native attraction-basin using reverse sampling

When applying Monte Carlo technique to search the structural conformation with the lowest energy, the ab initio approaches might finally end with success if starting from some initial structural conformations, and might end with failure if starting from other initial structural conformations. An initial structural conformation is said to lie in the native attraction-basin if the conformation can finally evolve into the native structure during the conformation search process.

To explore the native attraction-basin under a specific energy function, we propose a technique called reverse Monte Carlo sampling. Here, the term “reverse” comes from the fact that the sampling process works essentially reverse to the general Monte Carlo technique used for conformation search. Specifically, the general Monte Carlo search procedure starts from a random initial conformation and moves towards the native structure, during which the energy of conformation is reducing. For each inter-mediate structural conformation, a perturbation is made to generate a new conformation. Some popular perturbation techniques include fragment replacing used by Rosetta [11] and torsion angle sampling used by FALCON [13]. The newly-generated conformation is accepted if it has lower energy relative to the original conformation; otherwise the new conformation will be rejected with a probability according to the Metropolis-Hasting rule [25] (Fig. 1). To emphasize the difference between the general Monte Carlo search technique and the sampling technique used in this study, we denoted the former one as forward Monte Carlo technique.

On the contrary to the forward Monte Carlo search technique, our reverse Monte Carlo sampling procedure starts from the native structure and moves towards the edge of the native attraction-basin, during which the energy of conformations is increasing (Fig. 2). The reverse sampling process ends at a conformation if any perturbation of this conformation cannot leads to increase of energy. Intuitively, this conformation lies at the edge of the native attraction-basin and thus is denoted as edge point conformation in this study.

Formally, each execution of reverse sampling will generate a path of conformations P=S_{0}→S_{1}→S_{2}→...→S_{n}, where S_{0} denotes the native structure, S_{i}(1≤i≤n) denotes the ith inter-mediate conformations along the reverse sampling path, and S_{n} denotes the final edge point conformation. The energy of the inter-mediation conformations increases along the path, suggesting the monotonicity of energy within the native attraction-basin. The RMSD between S_{0} and S_{n} is calculated as a rough measure of the radius of native attraction-basin. For the edge point conformation S_{n}, we perform m times of perturbation, thus acquiring m perturbation neighbors of S_{n}, denoted as \(N(S_{n})=\left \{S_{n}^{(1)}, S_{n}^{(2)},..., S_{n}^{(m)}\right \}\) (Fig. 2). In our study, m is set as 1000.

Broadening the native attraction-basin by smoothening the edge point conformations

Intuitively, if we can “smoothen” the edge point conformations, the native attraction-basin will be broadened since the reverse sampling process will not be stuck at these edge point conformations. We accomplished the “smoothening” operation through tuning weights of energy terms such that the energy of S_{n} is decreased to be less than at least one of its perturbation neighbors \(S_{n}^{(i)} (1 \le i \le m)\). To make the native structure still the lowest one under the new energy function, we imposed a constraint on weight-tuning such that after tuning weights, the new energy of S_{i} should be lower than that of the conformation S_{i+1}. In other words, the monotonicity of energy are maintained within the attraction-basin, and thus the shape of the native attraction-basin will also be maintained. Figure 2 shows how the conformation path changes after improving energy function.

We designed a linear program to calculate the optimal weights that satisfy this constraint.

$$ W \cdot {E}_{n} \leq \frac{1}{m}\sum\limits_{j=1}^{m} W\cdot {E}_{n}^{(j)} $$

(2)

$$ W \geq 0 $$

(3)

$$ |W| = |{W}_{0}| $$

(4)

Here the vector W denotes the weights of energy terms, and W_{0} denotes the original weights before tuning. For an inter-mediate conformation S_{i} in the reverse sampling path, E_{i} denotes the vector of its energy terms, i.e. \(E_{i}= < e_{i}^{(1)},e_{i}^{(2)},\ldots,e_{i}^{(13)} >\), where \(e_{i}^{(k)}\) represents the k-th energy term. The objective of the linear program is to find a new weighting scheme with change as small as possible. Formula (1) describes a constraint that the original relative order of S_{i} and S_{i+1}, i.e., the monotonicity of energy within the native attraction-basin, should be kept after tuning weights. Formula (2) was designed to “smoothen” the edge point conformation, i.e. at least one of the m perturbation neighbors of the edge point conformation has a higher energy; thus, S_{n} is no longer an edge point conformation under the new energy function.

Results and discussion

Data set

We evaluated the BARS framework on Test101 dataset that contains a total of 101 benchmark proteins. The criteria for selecting these proteins are: i) The length of these proteins are less than 120 amino acids as the energy terms used in this study were designed for small proteins [11]. ii) These proteins cover most SCOP superfamilies as energy functions differ with protein class. We used three proteins as representatives of three SCOP classes (Table 1) to explain the working process of the BARS framework and summarized experimental results on Test101 dataset in Additional file 1: Table S1.

Analysis of evolution of energy functions

As mentioned above, the BARS framework alternates the broadening attraction-basin and reverse sampling steps to gradually improve energy function. In our experiment, the initial weights of energy terms were set as the weights used by Rosetta in the scoring function score3.

Table 2 shows how the energy functions evolve as iteration proceeds for protein 1ctfA. From this table, it can be observed that the weights of energy terms are almost fixed after 6 iterations. In addition, although the difference between consecutive iterations are not very large as expected, the final weights are quite different from the initial ones (Manhattan distance: 15.16).

We also compared the final weights for proteins in different classes. As shown in Table 3, the final weights exhibited considerable difference for proteins from different classes. For example, the weight of Env term in 1iieA (class A) is 5.69, about 4 times larger than that of 1ctfA (class D), and over 2.5 times larger than that of 1iloA (class C). This is consistent with the fact that the environment local geometrical term is more important for all- α proteins, since local residue-residue interactions dominate the helix formation process. In addition, 1ctfA (class D) can be distinguished from the other two proteins at the sheet term: the final weight of this term is 3.48 for 1ctfA, much larger than that of 1iloA (1.90) and 1iieA (1.14). This is also reasonable as α + β proteins usually contain anti-parallel β-sheets, whereas α/ β proteins contains β- α- β motifs. Taken together, the table supports the view point that different energy terms are emphasized for proteins in different classes.

Broadening attraction-basin as iteration proceeds

We further investigated whether the native attraction-basin could be broadened after smoothening the edge point conformations. To measure the size of the native attraction-basin, we performed reverse sampling for 50 times, thus acquiring 50 edge point conformations. The mean RMSD between these edge point conformations and the native structure is calculated and used to measure the size of the native attraction-basin. Intuitively the calculated mean RMSD can be treated as radius of the native attraction-basin.

As shown in Fig. 3, the mean RMSD was 6 Å initially, and increased to nearly 14 Å at the final iteration. This clearly suggested that the attraction basin was really significantly enlarged as iteration proceeds.

Protein structure prediction using the improved energy function

We further investigated whether the improved energy function could facilitate protein structure prediction or not. For this aim, we compared the predicted structures by running Rosetta with different energy functions. Specifically, for each iteration step of BARS, we used the corresponding weighting scheme of energy terms to construct an energy function, and then run Rosetta using this energy function to generate 1000 decoys. We investigated two aspects of these predicted decoys: i)Quality of final prediction results: Among these decoys, a clustering procedure was executed and the centroid of the largest cluster was reported as the final prediction. Then we analyzed RMSD of the final prediction with the native structure. ii)Good decoy ratio: Among these generated decoys, we calculated the ratio of “good” decoys. Here, we adopted the widely-used criterion that for small proteins, a decoy is called good decoy if its RMSD to the native structure is less than 6 Å [13].

As shown in Fig. 4, the quality of the final prediction results improved as iteration proceeded. Taking protein 1iloA as an example, the final prediction had a RMSD of 2.7 Å when using the original weighting scheme (Fig. 5, left panel). In contrast, when using the optimized weighting scheme, the quality of the final prediction structure improved (RMSD: 1.3 Å, Fig. 5, right panel). We repeated this experiment on the 101 benchmark proteins in Test101 set and observed that for 82 proteins, the quality of final prediction structure improved (Fig. 6 and Additional file 1: Table S1). For example, the quality of predicted structure for protein 1pchA was low (RMSD: 11.721 Å) when using the original energy weight; in contrast, when using the optimized weighting scheme, the predicted structure significantly improved (RMSD: 3.417 Å).

Besides the quality of the final prediction, the good decoy ratio also increased considerably (Fig. 7). For example, if using the initial weights for protein 1iloA, only 11 decoys among the 1000 decoys were good decoys. In contrast, over 200 good decoys were generated when using the optimized weighting scheme. Taken together, these results clearly suggest that using the energy function constructed with BARS, the general Monte Carlo search procedure significantly improved in both success possibility and the quality of prediction results. This improvement should also be attributed to the broadened native attraction-basin.

Application range of the constructed energy functions

Application range is one of the key issues of energy functions. An ideal energy function is expected to be applicable on a large amount of proteins rather than a single protein. To examine the application range of the energy functions acquired using BARS, we run Rosetta with the energy function acquired from protein 1ctfA on other seven benchmark proteins in the same class to 1ctfA. As shown in Table 4, on all of the seven proteins, the predicted structures have high quality (RMSD less than 7 Å). More importantly, on five out of the seven proteins, the prediction results using the energy function acquired from 1ctfA are much better than those predicted using the original weights.

Similarly, we run Rosetta with the energy function acquired from protein 1iieA on other 20 benchmark proteins. As shown in Table 5, on 19 out of the 20 benchmark proteins, the predicted structures have RMSD less than 5 Å. On 18 out of these benchmark proteins, the predicted structures using the weights acquired from 1iieA are better than those predicted using the original weights.

In summary, these results demonstrate the wide application range of the energy functions acquired using the BARS framework. A reasonable explanation of this wide application range is that proteins in a class might share similar folding process; thus, the energy function optimized on a certain protein is also applicable for other proteins in the same class.

Conclusion

In this study we report the BARS framework for constructing effective energy functions. The framework attempts to improve energy function gradually such that the native attraction-basin was broadened. During this process, a reverse Monte Carlo sampling strategy was proposed to explore the native attraction-basin. Extensive experimental results demonstrate both effectiveness and wide application range of the constructed energy functions.

It has been reported that protein folding is a hierarchical process. According to this observation, Rosetta employs a multi-step prediction strategy. In particular, Rosetta first uses score function score_{0} with only hydrophobic core terms, then uses score_{2}/score_{5} with secondary structure terms, and finally uses score_{3} to incorporate a total of 13 energy terms [11]. This study focuses on the optimization of weights for the third step. How to design effective energy functions for the first and second steps remains as one of our future works. We also noticed that on 19 out of the 101 benchmark proteins, the quality of prediction results using the optimized weighting scheme were low. How to design better energy functions for these proteins is another future work. It should be pointed out that the RMSD deviation of the 50 edge point conformations is large at iteration 5 for protein 1ctfA (Fig. 3). This might imply the irregular shape of the native attraction-basin. We will investigate this issue in future work.

The application of the BARS framework is not limited to protein structure prediction. Constructing an effective scoring function is usually the first important step for optimization problems in various domains such as RNA structure prediction, natural language processing, etc. How to optimally combine multiple terms into a scoring function is a challenging task. Our BARS framework should greatly facilitate designing effective scoring functions for a large variety of problems.

Abbreviations

BARS:

Broadening attraction-basin and reverse sampling

RMSD:

Root mean square deviation

References

Branden C, Tooze J. Introduction to protein structure. New York: Garland Publishing; 1999.

Li X, Hu C, Liang J. Simplicial Edge Representation of Protein Structures and Alpha Contact Potential with Confidence Measure. Protein Struct Funct Bioinform. 2003; 53(4):792–805.

Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, et al. Gapped Blast and PsiBlast: a new generation of protein database search programs. Nucleic Acids Res. 1997; 25(17):3389–402.

Zhou H, Zhou Y. Fold recognition by combining sequence profiles derived from evolution and from depth-dependent structural alignment of fragments. Proteins. 2004; 58(2):321–8.

Xu J, Li M, Lin G, Kim D, Xu Y. Protein Threading by Linear Programming. In: Biocomputing: Proceedings of the 2003 Paci Symposium. vol. 8. Pacific Symposium on Biocomputing: 2003. p. 264–75.

Wu S, Szilagyi A, Zhang Y. Improving protein structure prediction using multiple sequence-based contact predictions. Structure. 2011; 19(8):1182–91.

Simons KT, Kooperberg C, Huang E, Baker D. Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol. 1997; 268(1):209–25.

Xu D, Zhang Y. Ab initio protein structure assembly using continuous structure fragments and optimized knowledge based force field. Proteins. 2012; 80(7):1715–35.

Andreatta M, Laplagne S, Li SC, Smale S. Prediction of residue-residue contacts from protein families using similarity kernels and least squares regularization. 2013. arXiv preprint arXiv:13111301.

Di Lena P, Nagata K, Baldi P. Deep architectures for protein contact map prediction. Bioinformatics. 2012; 28(19):2449–57.

Michel M, Hayat S, Skwark MJ, Sander C, Marks DS, Elofsson A. PconsFold: improved contact predictions improve protein models. Bioinformatics. 2014; 30(17):i482–8.

Zhang H, Gao Y, Deng M, Wang C, Zhu J, Li SC, et al. Improving residue-residue contact prediction via low-rank and sparse decomposition of residue correlation matrix. Biochem Biophys Res Commun. 2016; 472(1):217–22.

We would like to thank Torsten Julich for discussion.

Funding

We would like to thank the National Key Research and Development Program of China (2018YFC0910405), and the National Natural Science Foundation of China (31671369 and 31770775) for providing financial supports for this study and publication charges.

Chao Wang and Yi Wei contributed equally to this work.

Authors and Affiliations

Key Lab of Intelligent Information Processing, Institute of Computing Technology, Chinese Academy of Sciences, 6, Kexueyuan South Road, Zhongguancun, Beijing, 100190, China

Chao Wang, Yi Wei, Haicang Zhang, Lupeng Kong, Shiwei Sun & Dongbo Bu

University of Chinese Academy of Sciences, 19-1, Yuquan Road, Shijingshan, Beijing, 100049, China

Chao Wang, Yi Wei, Haicang Zhang, Lupeng Kong, Shiwei Sun, Wei-Mou Zheng & Dongbo Bu

Institute of Theoretical Physics, Chinese Academy of Sciences, 55, Zhongguancun East Road, Beijing, 100190, China

DB and WZ conceived the study. CW and YW performed the computation and analysis. HZ and SS analyzed the new framework. CW, LK, WZ, and DB wrote the manuscript. All authors read and approved the final manuscript.

Quality of predicted structures using the original weights and the optimized weights of energy terms on 101 benchmark proteins. (PDF 82 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Wang, C., Wei, Y., Zhang, H. et al. Constructing effective energy functions for protein structure prediction through broadening attraction-basin and reverse Monte Carlo sampling.
BMC Bioinformatics20
(Suppl 3), 135 (2019). https://doi.org/10.1186/s12859-019-2652-5