 Research article
 Open Access
 Published:
Balancing multiple objectives in conformation sampling to control decoy diversity in templatefree protein structure prediction
BMC Bioinformaticsvolume 20, Article number: 211 (2019)
Abstract
Background
Computational approaches for the determination of biologicallyactive/native threedimensional structures of proteins with novel sequences have to handle several challenges. The (conformation) space of possible threedimensional spatial arrangements of the chain of amino acids that constitute a protein molecule is vast and highdimensional. Exploration of the conformation spaces is performed in a samplingbased manner and is biased by the internal energy that sums atomic interactions. Even stateoftheart energy functions that quantify such interactions are inherently inaccurate and associate with protein conformation spaces overly rugged energy surfaces riddled with artifact local minima. The response to these challenges in templatefree protein structure prediction is to generate large numbers of lowenergy conformations (also referred to as decoys) as a way of increasing the likelihood of having a diverse decoy dataset that covers a sufficient number of local minima possibly housing nearnative conformations.
Results
In this paper we pursue a complementary approach and propose to directly control the diversity of generated decoys. Inspired by hard optimization problems in highdimensional and nonlinear variable spaces, we propose that conformation sampling for decoy generation is more naturally framed as a multiobjective optimization problem. We demonstrate that mechanisms inherent to evolutionary search techniques facilitate such framing and allow balancing multiple objectives in protein conformation sampling. We showcase here an operationalization of this idea via a novel evolutionary algorithm that has high exploration capability and is also able to access lowerenergy regions of the energy landscape of a given protein with similar or better proximity to the known native structure than several stateoftheart decoy generation algorithms.
Conclusions
The presented results constitute a promising research direction in improving decoy generation for templatefree protein structure prediction with regards to balancing of multiple conflicting objectives under an optimization framework. Future work will consider additional optimization objectives and variants of improvement and selection operators to apportion a fixed computational budget. Of particular interest are directions of research that attenuate dependence on protein energy models.
Background
Faster and cheaper highthroughput gene sequencing technologies have contributed millions of uncharacterized proteinencoding gene sequences in genomic databases [1]. Wetlaboratory efforts on resolving threedimensional (tertiary) biologicallyactive/native structures of proteins have contributed an order of magnitude less [2]. This disparity and the recognition that tertiary structure determines to a large extent biological function and molecular mechanisms in the cell [3] motivate the development of complementary, computational approaches to tertiary protein structure prediction (PSP) [4].
Due to hardware and algorithmic improvements, templatefree PSP methods, which focus on the most challenging setting of obtaining biologicallyactive structures of a protein from knowledge of its aminoacid sequence (in absence of a structural template from a close or remote homologous sequence), have made steady improvements in their capabilities [5]. Despite the success of hallmark protocols, such as Rosetta [6], Quark [7], and others [5], most notably due to domainspecific insight, templatefree PSP presents outstanding computational challenges. The space of possible threedimensional spatial arrangements of the chain of amino acids that constitute a protein molecule is vast and highdimensional; we refer to this space as conformation space to recognize choices in the computational representation of a structure^{Footnote 1}. Exploration of such complex spaces is performed in a samplingbased manner (most commonly under the Metropolis Monte Carlo – MMC framework) and is biased by the internal energy that sums atomic interactions. The goal is to generate lowenergy conformations that have a higher likelihood of being nearnative conformations (and populating thermodynamicallystable regions of the energy surface) [8]. However, even stateoftheart energy functions that quantify atomic interactions in a conformation are inherently inaccurate; they result in overly rugged energy surfaces (associated with protein conformation spaces) that are riddled with artifact local minima [9].
The key question in conformation sampling for templatefree PSP is how to obtain a broad, samplebased representation of the vast and highdimensional conformation spaces (and in turn the associated energy surface) and not miss possibly diverse local minima that may house nearnative conformations. The response to this question traditionally has been by the numbers; that is, the objective becomes to generate a large number of lowenergy conformations (also referred to as decoys) as a way of increasing the likelihood of having a diverse decoy dataset that covers a sufficient number of local minima possibly housing nearnative conformations.
In this paper we pursue a complementary approach and propose to directly control the diversity of sampled conformations. Inspired by hard optimization problems in highdimensional and nonlinear variable spaces, we propose that conformation sampling for decoy generation is more naturally framed as a multiobjective optimization problem. We demonstrate that mechanisms inherent to evolutionary search techniques facilitate such framing and allow balancing multiple competing objectives in protein conformation sampling. We showcase an operationalization of this idea via a novel evolutionary algorithm that has high exploration capability and is additionally able to access lowerenergy regions of the energy landscape of a given protein with similar or better proximity to the known native structure than stateoftheart algorithms.
The rest of this article is organized as follows. Related work is summarized in the following section. The proposed algorithm is described in the “Methods” section and evaluated in the “Results” section. The article concludes with a summary and discussion of future directions of work in the “Conclusion” section.
Related work
Key features are behind advances over the past decade in templatefree PSP. The conformation space is simplified and reduced in dimensionality. The atoms of the side chain in each amino acid are compressed into a pseudoatom, and the conformation variables are dihedral angles on bonds connecting modeled backbone atoms and sidechain pseudoatoms. Note that even this representation yields hundreds of dihedral angles (thus, a conformation space of hundreds of dimensions) even for chains not exceeding 150 amino acids. Additionally, the molecular fragment replacement technique is used to discretize the conformation space by bundling backbone dihedral angles together. Values are assigned for a consecutive number of angles simultaneously according to structural pieces or fragment configurations that are precompiled over known native protein structures [6].
Despite these two key developments, the conformation space demands powerful optimization algorithms under the umbrella of stochastic optimization. These algorithms have to balance limited computational resources between exploration of a space through global search with exploitation of local minima in the energy surface (the conformation space lifted by the internal energy of each conformation) through local search. The common approach, in Rosetta and others [10], achieves exploitation through intensive localized MMC search, while using multistart or randomrestart for global search or exploration. There are no explicit controls in these MMCbased treatments to balance between exploration and exploitation, which is key when the search space is highdimensional and highly nonlinear (rich in local minima). Moreover, to account for the fact that computational resources may be wasted on exploiting false local minima (artifacts of the particular energy function used)^{Footnote 2}, the recommendation from developers is to generate a large number of decoys (e.g., run the Rosetta abinitio protocol for conformation sampling tens of thousands of times).
MMCbased treatments do not address the core issue of balancing exploration with exploitation. Evolutionary algorithms (EAs) are inherently better equipped at addressing this balance for complex optimization problems [11]. A growing body of research shows that, when injected with domainspecific insight (as in Rosetta), EAs outperform Rosetta in exploration capability [12–16]. EAs carry out stochastic optimization inspired by natural selection. In particular, in populationbased EAs, a fixedsize population of individuals (conformations in our context) evolves over a number of generations. At every generation, individuals are selected to serve as parents. Selected parents are subjected to variation operators that produce new offspring. In memetic/hybrid EAs, this global search is interleaved with local search, as offspring are additionally subjected to an improvement operator, so that they can better compete with parents. A selection operator implements the concept of natural selection, as it pares down the combined parent and offspring population down to the fixedsize population. The interested reader is pointed to work in [14] for a review of EAs for templatefree PSP over the years.
EAs easily allow for framing conformation sampling for templatefree PSP as a multiobjective optimization problem. The latter may not seem immediately obvious, but the rise of false local minima is due to lack of knowledge on how to combine competing atomic interactions (electrostatic, hydrogenbonding, and others) and how much to weight each category of interactions in an energy function. These categories are often conflicting; that is, a change in a conformation may cause an increase in the value of one energetic term (e.g., electrostatics) but a decrease in the value of another (e.g., hydrogen bonding). Rather than combining such terms in one energy function that is used as an aggregate optimization objective, proofofconcept work has pursued a multiobjective optimization setting by treating different terms in an energy function as separate optimization objectives [16, 17]. It is worth noting that algorithmic ingredients in an EA (its various operators) naturally allow pursuing a multiobjective optimization treatment for decoy generation. Moreover, as we show in this paper, such mechanisms allow to control the diversity of sampled conformations and thus yield a broader, samplebased representation of the conformation space (and its energy surface).
Methods
The proposed algorithm is a memetic EA that controls the diversity of the conformations it computes via the selection operator that determines individual survival. The algorithm builds over expertise in our laboratory on EAs for decoy generation; namely, how to inject Rosetta domainspecific insight (structure representation, molecular fragment replacement technique, and scoring functions for conformation evaluation) in evolutionary search mechanisms. The methodological contribution in this paper is a novel, sophisticated selection operator to control conformation diversity and handle conflicting optimization objectives.
Summary of main ingredients
We provide a summary of the main computational ingredients first. The proposed EA evolves a fixedsize population of N conformations over generations. Great care is taken so the initial population P_{0} contains N physicallyrealistic, yet diverse conformations. Each conformation is initialized as an extended backbone conformation, and a series of fragment replacements randomize each conformation while adding secondary structure. This process is conducted as a Monte Carlo search, guided by two different scoring functions that first encourage avoidance of steric clashes (selfcollisions) and then the formation of secondary structure.
In the proposed EA, at the beginning of each generation, all conformations in the population are selected as parents and varied so that each yields one offspring conformation. The variation makes use of the popular molecular fragment replacement technique (described in greater detail below), effectively selecting a number of consecutive dihedral angles starting at some amino acid selected at random and replacing the angles with new ones drawn from a precompiled fragment library. This process and the variation operator are described in greater detail below. The variation operator contributes to exploration. To additionally improve exploitation (digging deeper into the energy surface), each offspring is further subjected to an improvement operator. This operator maps each offspring to a nearby local minimum in the energy surface via a greedy local search (that again utilizes fragment replacements), detailed below. At the end of the variation and improvement operators, the algorithm has now computed N new (offspring) conformations that will fight for survival among one another and the N parent conformations. The winners constitute the next population.
We now describe each of the operators in further detail.
Fragment replacement
In molecular fragment repacement, an amino acid in the segment [1,l−f+1] (where l is the number of amino acids in the protein chain) over the chain of amino acids is selected at random, effectively picking at random a fragment [i,i+f−1] of f consecutive amino acids in the sequence. This sequence of amino acids exists in some fragment configuration in some current conformation C_{curr}. The entire configuration of 3×f backbone dihedral angles (ϕ,ψ, and ω per amino acid) in C_{curr} is replaced with a new configuration of 3×f backbone dihedral angles to obtain C_{new}. The new configuration is obtained from precompiled fragment libraries. These libraries are computed over known native structures of proteins (deposited, for instance, in the Protein Data Bank) and are organized in such a way that a query with the aminoacid sequence of a fragment returns 200 configurations; one is selected at random to replace the configuration in the selected fragment in C_{curr}. The described process is the molecular fragment replacement in Rosetta. The reader is referred to Ref. [6] for further information on fragment libraries.
Initial population operator
Recall that a population contains a fixed number of conformations N. Given the aminoacid sequence of l amino acids, the Pose construct of the Rosetta framework is utilized to obtain an extended chain of backbone atoms, with the sidechain of each amino acid reduced to a centroid pseudoatom (this is known as the centroid representation in Rosetta). This process is repeated N times to obtain N (identical) extended conformations. Each extended conformation is then subjected to two consecutive stages of local search. Each one is implemented as an MMC search, but the stages use different scoring functions and different values for the scaling parameter α that controls the acceptance probability in the Metropolis criterion. In both stages, an MC move is a fragment replacement; a fragment of length 9 (9 consecutive amino acids) is selected at random over the chain of amino acids and replaced with a fragment configuration drawn at random from 9 aminoacid (aa) long fragment libraries. The latter are prebuilt given a target sequence by making use of the online Robetta fragment server [6].
In the first stage, the goal is to randomize each extended chain via fragment replacements but still avoid self collisions. The latter are penalized in the score0 scoring function, which is a Rosetta scoring function that consists of only a soft steric repulsion. This scoring function is utilized in stage one to obtain a diverse population of random conformations free of self collisions. A scaling parameter α=0 is used in the Metropolis criterion; this effectively sets the acceptance probability to 0, which guarantees that a move is only accepted if it lowers score0. This strict constraint is necessary to avoid carrying through selfcolliding conformations.
In the second stage, the goal changes from obtaining randomized, collisionfree conformations to conformations that resemble protein structures in that they have secondary structure elements that are packed rather than stretched out in space. This is achieved by switching from score0 to score1, which imposes more constraints than collision avoidance and allows formation of secondary structure. In addition, the scaling parameter is set to a higher value of 2, which increases the acceptance probability, increasing the diversity of conformations. This stage, also implemented as an MMC search where moves are fragment replacements, proceeds on a conformation until l consecutive moves (l is number of amino acids in a given protein sequence) fail per the Metropolis criterion. We note that score0 and score1 are members of a suite of Rosetta scoring functions that are weighted sums of 13 distinct energy terms. The process employed in the initial population (utilizing fragment length of 9 and different scoring functions at different substages) mirrors that in Rosetta (though the length of the MMC trajectories in the substages in the simulated annealing algorithm employed for decoy generation in Rosetta is much longer). The final ensemble of conformations obtained by the initial population operator now contains credible, proteinlike conformations.
Variation operator
The variation operator is applied onto a parent individual to obtain offspring. This operator implements asexual reproduction/mutation, making use of fragment replacement to vary a parent and obtain a new, offspring conformation. We note that in the variation operator, one does not want to institute too much of a (structural) change from the parent in the offspring, so that good properties of the parent are transferred to the offspring, but enough change to obtain a conformation different from the parent. For this reason, a fragment length f=3 is used in the variation operator. Note that the fragment replacement in the variation operator is not in the context of some MMC search; that is, one fragment replacement is carried out, and the result is accepted, yielding an offspring conformation obtained from a thusvaried parent.
Improvement operator
This operator maps an offspring to a nearby local minimum via a greedy local search that resembles stage two in the initial population operator. The search carries out fragment replacements (utilizing f=3) that terminates on an offspring when k consecutive moves fail to lower energy. The latter is measured via Rosetta’s score3. This scoring function upweights energetic constraints (terms) that favor formation of compact tertiary structures [18]. The utilization of score3 in the proposed algorithm mirrors the fact that in Rosetta, the majority of the search is done with score3. That is, most of the computational budget (in terms of fitness evaluations) is expended on the local improvement operator.
Selection operator
The selection operator is the mechanism leveraged to pursue a multiobjective optimization setting and directly control the diversity of computed conformations. We first describe how the selection operator allows a multiobjective optimization setting.
Multiobjective optimization under Pareto dominance
Let us consider that a certain number of optimization objectives is provided along which to compare conformations. A conformation C_{a} is said to dominate another conformation C_{b} if the value of each optimization objective in C_{a} is lower than the value of that same objective in C_{b}; this is known as strong dominance. If equality is allowed, the result is soft dominance. The proposed algorithm makes use of strong dominance. Utilizing the concept of dominance, one can measure the number of conformations that dominate a given conformation C_{b}. This measure is known as Pareto rank (PR) or, equivalently, domination count. In contrast, the number of conformations dominated by a given conformation C_{a} is known as the Pareto count (PC) of C_{a}. If no conformation in a set dominates a given conformation C_{b}, then C_{b} has a domination count (PR) of 0 and is said to be nondominated. Nondominated conformations constitute the Pareto front.
The concept of Pareto dominance can be operationalized in various ways. In early proofofconcept work [16, 17], the Rosetta score4 (which includes both shortrange and longrange hydrogen bonding terms) was divided into three optimization objectives along which parents and offspring can be compared in the selection operator: shortrange hydrogen bonds (objective 1), longrange hydrogen bonds (objective 2), and everything else (summed together in objective 3). This categorization recognizes the importance of hydrogen bonds for formation of native structure [18]. Using these three objectives, work in [16] utilizes only PR in the selection operator, first sorting the N parent and N offspring conformations from low to high PR, and then further sorting conformations with the same PR from low to high score4 (total energy that sums all three objectives). PC can be additionally considered to obtain a sorted order, as in [17]. Conformations with the same PR are sorted from high to low PC, and conformations with the same PC are further sorted from low to high score4. The selection operator then selects the top N conformations (out of the combined 2N conformations of parents and offspring) according to the resulting sorted order.
Nondominated Fronts
The proposed algorithm truly considers a multiobjective setting and does not utilize an aggregate energy value (the sum of the objectives). Specifically, the algorithm considers nondominated fronts in its selection operator. A fast, nondominated sorting algorithm (originally proposed in [19]) is used to generate these fronts as follows. All the conformations in the combined parent and offspring population that have a domination count of 0 (thus, are nondominated) make up the first nondominated front F_{1}. Each subsequent, nondominated front F_{i} is generated as follows. For each conformation C∈F_{i−1}, the conformations dominated by C constitute the set S_{C}. The domination count of each member in S_{C} is decremented by 1. Conformations in S_{C} that have their domination count reduced to 0 make up the subsequent, nondominated front F_{i}. This process of generating nondominated fronts terminates when the total number of conformations over the generated fronts equals or exceeds the population size N. In this way, the selection operator is accumulating enough goodquality conformations from which it can further draw based on additional nonenergy based objectives. Moreover, this allows generating Paretooptimal solutions over the generations and achieving better convergence to the true, Paretooptimal set.
Densitybased conformation diversity
Borrowing from evolutionary computation research [19] on optimization problems of few variables ranging from 1 to 30 (as opposed to hundreds of variables in our setting), we leverage crowding distance to retain diverse conformations. Crowding distance estimates the density of the conformations in the population space and guides the selection process over generations towards less crowded regions [19]. We use the crowding distance assignment technique to compute the average distance of a conformation from other conformations in the same nondominated front along each of the optimization objectives. First, the crowding distance of each conformation is initialized to 0. Then, for each objective, conformations are sorted based on their corresponding score (value of that objective) in ascending order and assigned infinite distance value to conformations with the highest and lowest scores; this ensures that conformations with the highest and lowest scores (effectively constituting the boundaries of the population space) are always selected. For all other conformations C, the absolute normalized difference in scores between the two closest conformations on either side of C is added to the crowding distance. Finally, when all the objectives are considered, the crowding distance of a conformation is the sum of the individual distances along each objective.
Putting it all together: Conformation diversity in a multiobjective optimization setting
To obtain the next population, the selection operator selects r conformations from the nondominated fronts F_{1},F_{2},…,F_{t} sequentially, where r is \(\sum _{i \in \{1, 2, \ldots, t\}}F_{i}\) until r+F_{t+1} reaches or exceeds N. If r<N, which is usually the case, the crowding distance of conformations in F_{t+1} is computed and used to sort them in descending order. The selection operator then selects the top N−r conformations in this order.
It is worth noting that in our earlier operationalizations of multiobjective optimization for templatefree PSP, all conformations ever computed were retained for the calculation of PR and PC values for each conformation. This introduces a significant computational overhead, which the proposed algorithm circumvents. The proposed algorithm instead uses only the current combined population of parents and offspring to perform selection, thus saving such overhead.
Implementation details
The population size is N=100 conformations, in keeping with earlier work on multiobjective EAs. Instead of imposing a bound on the number of generations, the proposed algorithm is executed for a fixed budget of 10,000,000 energy evaluations. The algorithm is implemented in Python and interfaces with the PyRosetta library. The algorithm takes 1−4 h on one Intel Xeon E52670 CPU with 2.6GHz base processing speed and 64GB of RAM. The range in running time depends primarily on the length of the protein. As further described in the “Results” section, the algorithm is run 5 times on a test case (a target aminoacid sequence) to remove differences due to stochasticity.
Results
Experimental setup
The evaluation is carried out on two datasets, a benchmark dataset of 20 proteins of varying folds (α,β,α+β, and coil) and lengths (varying from 53 to 146 amino acids), and a dataset of 10 hard, freemodeling targets from the Critical Assessment of protein Structure Prediction (CASP) community experiment. The first dataset was first presented partially in [20] and then enriched with more targets in [12, 13, 16, 21, 22]. Our second dataset consists of 10 freemodeling domains from CASP12 and CASP13.
The proposed algorithm is compared with Rosetta’s decoy sampling algorithm, a memetic EA that does not utilize multiobjective optimization [15], and two other memetic EAs that do so (one utilizing only Pareto Rank [16], and the other utilizing both Pareto Rank and Pareto Count [17], as described in the previous section). We will correspondingly refer to these algorithms as Rosetta, mEA, mEAPR, and mEAPR+PC. To aid in the comparisons, we will refer to the algorithm proposed in this paper as EvoDiverse. This comparison allows us to isolate the impact of the selection operator in EvoDiverse over those in mEAPR, and mEAPR+PC, as well as point to the impact of the multiobjective setting (in comparison with mEA) and the evolutionary computation framework overall (in comparison with Rosetta). Each of these algorithms is run 5 times on each target sequence, and what is reported is their best performance over all 5 runs combined. Each run continues for a fixed computational budget of 10M energy evaluations.
In keeping with published work on EAs [14], performance is measured by the lowest energy ever reached and the lowest distance ever reached to the known native structure of a target under consideration. The former measures the exploration capability. Since lower energies do not necessarily correlate with proximity to the native structure, it is important to also measure the distance of each decoy to a known native structure. We do so via a popular dissimilarity metric, least rootmeansquareddeviation (lRMSD) [23]. lRMSD first removes differences due to rigidbody motions (wholebody translation and rotation in three dimensions), and then averages the summed Euclidean distance of corresponding atoms in two conformations over the number of atoms compared. Typically, in templatefree PSP, the comparison focuses on the main carbon atom of each amino acid (the CA atoms). It is worth noting that lRMSD is nondescriptive above 8Å and increases with sequence/chain length. An RMSD within 5−6Å is considered to have captured the native structure. In addition to lRMSD, our evaluation on the CASP12 and CASP13 dataset includes two additional measures, the “Template Modeling Score” (TMscore) [24] and the “Global Distance Test  Total Score” (GDT_TS) [25, 26]. Both metrics produce a score between 0 and 1, where a score of 1 suggests a perfect match. A higher score indicates a better proximity. In practice, TMscores and GDT_TS scores of 0.5 and higher are indicative of good predictions/models.
To carry out a principled comparison, we evaluate the statistical significance of the presented results. We use Fisher’s [27] and Barnard’s [28] exact tests over 2x2 contingency matrices keeping track of the particular performance metric under comparison. Fisher’s exact test is conditional and widely adopted for statistical significance. Barnard’s test is unconditional and generally considered more powerful than Fisher’s test on 2x2 contingency matrices. We use 2sided tests to determine which algorithms do not have similar performance and 1sided tests to determine if EvoDiverse performs significantly better than the other algorithms under comparison.
Comparative analysis on benchmark dataset
Figure 1 shows the lowest energy obtained over combined 5 runs of mEA, mEAPR, mEAPR+PC, Rosetta, and EvoDiverse for each of the 20 target proteins; the latter are denoted on the x axis by the Protein Data Bank (PDB) [2] identifier (ID) of a known native structure for each target. Figure 2 presents the comparison in terms of the lowest lRMSD achieved on each of the test cases. Colorcoding is used to distinguish the algorithms from one another.
A summary of comparative observations is presented in Table 1. Table 1(a) shows that lowest energy is achieved by EvoDiverse in 9/20 of the test cases over the other algorithms; in comparison, mEAPR achieves the lowest energy in 4/20, mEA and mEAPR+PC in 3/20, and Rosetta in only 1 case. In a headtohead comparison, EvoDiverse bests each of the other algorithms in a comparison of lowest energy. Table 1(b) shows that lowest lRMSD is achieved by EvoDiverse in 10/20 test cases over the other algorithms; in comparison, mEAPR achieves the lowest energy in 2/20, mEA and mEAPR+PC in 1/20, and Rosetta in 9 cases. In a headtohead comparison, EvoDiverse bests each of the other algorithms in a comparison of lowest lRMSD, as well.
The above comparisons are further strengthened via statistical analysis. Table 2(a) shows the pvalues obtained in 1sided statistical significance tests that pitch EvoDiverse against each of the other algorithms (in turn), evaluating the null hypothesis that EvoDiverse performs similarly or worse than its counterpart under comparison, considering two metrics, achieving the lowest energy in each test case, and achieving a lower (lowest) energy on each test case that its current counterpart. Both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (which reject the null hypothesis) are marked in bold. Table 2(a) shows that the null hypothesis is rejected in most of the comparisons; EvoDiverse performs better than mEA and Rosetta; the performance over mEAPR and mEAPR+PC is not statistically significant.
Table 2(b) shows the pvalues obtained in 1sided statistical significance tests that pitch the performance of EvoDiverse against each of the other algorithms (in turn), evaluating the null hypothesis that EvoDiverse performs similarly or worse than its counterpart under comparison, considering two metrics, achieving the lowest lRMSD in each test case, and achieving a lower (lowest) lRMSD on each test case than its current counterpart. Both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (rejecting the null hypothesis) are in bold. Table 2(b) shows that the null hypothesis is rejected in most tests; EvoDiverse outperforms all algorithms except for Rosetta.
Table 3(a) shows the pvalues obtained in 2sided statistical significance tests that pitch EvoDiverse against each of the other algorithms (in turn), evaluating the null hypothesis that EvoDiverse performs similarly to its counterpart under comparison, considering two metrics, achieving the lowest energy in each test case, and achieving a lower (lowest) energy on each test case than its current counterpart. Both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (which reject the null hypothesis) are marked in bold. Table 2(a) shows that the null hypothesis is rejected in most of the comparisons; EvoDiverse does not perform similarly to mEA and Rosetta; the dissimilarity of performance compared to mEAPR and mEAPR+PC is not statistically significant at 95% confidence level. Similarly, Table 3(b) shows the pvalues obtained in 2sided statistical significance tests that now consider the lowest lRMSD instead of lowest energy. Table 3(b) shows that the null hypothesis is rejected in most tests; EvoDiverse does not perform similarly to all algorithms except for Rosetta at 95% confidence level.
Taken altogether, these results indicate that EvoDiverse has a high exploration capability, decidedly outperforming mEA and Rosetta in terms of its ability to wisely use a fixed computational budget to reach lowerenergy levels, and performing similarly or better than mEAPR and mEAPR+PC. The latter result is not surprising, as mEAPR, mEAPR+PC, and EvoDiverse use a multiobjective optimization framework, which delays a premature convergence, thus allowing them to reach lower energies within the same computational budget provided to mEA and Rosetta. Interestingly though, the headtohead lRMSD comparisons show that, while mEAPR and mEAPR+PC achieve lower energies than Rosetta, this does not help them achieve the same performance as Rosetta in terms of lowest lRMSDs. In contrast, EvoDiverse effectively retains the best of both. It is able to reach lower energies than Rosetta and comparable or lower lRMSDs than Rosetta, thus constituting a clear advantage over the current stateoftheart multiobjective optimization EAs.
When analyzing the performance of decoy generation algorithms, it is additionally informative to visualize the energy landscape that they probe one decoy at a time. We do so by plotting decoyenergy pairs, representing a decoy with its lowest lRMSD coordinate to the known native structure of each test case. Figures 3 and 4 juxtapose such landscapes for two selected test cases, the protein with known native structure under PDB ID 1ail, and that with known native structure under PDB ID 1dtjA, respectively.
The comparison is limited here to landscapes probed by EvoDiverse, mEAPR, and mEAPR+PC, as prior work comparing mEAPR and mEAPR+PC to Rosetta and mEA shows that these two algorithms achieve better funneling (better correlation between low energies and low lRMSDs to the native structure), and that mEAPR+PC does so the best for 1ail, while mEAPR does so for 1dtjA.
Figure 3 shows that EvoDiverse reveals better funneling of the landscape than mEAPR+PC (higher correlation between low energies and low lRMSDs) and multiple nonnative local minima, visually confirming its high exploration capability. Figure 4 shows that EvoDiverse and mEAPR reveal similar correlation between low energies and low lRMSDs (higher than both Rosetta and mEA) and multiple nonnative local minima.
Figure 5 superimposes the best decoy (lowest lRMSD to the known native structure) over the known native structure for three selected proteins (PDB IDs 1ail, 1dtjA, and 3gwl). Rendering is performed with the CCP4mg molecular graphics software [29]. In the case of 1ail, EvoDiverse obtains the lowest lRMSD to the native structure (1Å). On 1dtjA, EvoDiverse reaches a similar lowest lRMSD (2.6Å) as Rosetta and mEAPR (confirmed in Fig. 2). On 3gwl, EvoDiverse achieves a dramatic improvement of lowest lRMSD to the native structure over all other algorithms; while none of the other algorithms reach below 5Å, EvoDiverse reaches 3.2Å, almost a 2Å improvement.
Comparative analysis on CASP 1213 dataset
Table 4 shows the lowest energy and the average energy of the 10 best decoys obtained by EvoDiverse and Rosetta on each of the 10 target domains denoted by their identifiers in column 1. The lower energy values between the two algorithms on each target domain are marked in bold. Table 4 shows that lower energy values are obtained by EvoDiverse in 7/10 cases compared to Rosetta’s 3/10 cases. When the average of the best 10 decoys is considered instead, EvoDiverse achieves lower energy values in 8/10 cases compared to Rosetta’s 2/10 cases.
The above comparisons are further strengthened via statistical analysis. Table 8(a) shows the pvalues obtained in 1sided statistical significance tests that pitch EvoDiverse against Rosetta, evaluating the null hypothesis that EvoDiverse performs similarly or worse than Rosetta. Both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (which reject the null hypothesis) are marked in bold. Table 8(a) shows that the null hypothesis is rejected when the average of the best 10 decoys is considered, and EvoDiverse performs significantly better than Rosetta with 95% confidence. When the focus is on the lowest energy reached, the performance improvement of EvoDiverse over Rosetta is not statistically significant at 95% confidence level, although pvalues are very close to the 0.05 threshold.
Table 5 shows the lowest lRMSD to the native structure and the average lRMSD of the 10 best decoys obtained by EvoDiverse and Rosetta on each of the 10 target domains denoted by their identifiers in column 1. The lower lRMSD values between the two algorithms on each target domain are marked in bold. Table 4 shows that lower lRMSDs are obtained by EvoDiverse in 6/10 cases compared to Rosetta’s 4/10 cases. When the average of the bestlRMSD 10 decoys is considered, EvoDiverse achieves lower lRMSD in 9/10 cases compared to 2/10 cases of Rosetta. Figure 6 shows the best decoy (lowest lRMSD to the known native structure) obtained on each target domain by EvoDiverse and Rosetta. Rendering is performed with the CCP4mg molecular graphics software [29].
The above comparisons are further strengthened via statistical analysis. Table 8(b) shows the pvalues obtained in 1sided statistical significance tests that pitch EvoDiverse against Rosetta, evaluating the null hypothesis that EvoDiverse performs similarly or worse than Rosetta. Again, both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (which reject the null hypothesis) are marked in bold. Table 8(b) shows that the null hypothesis is rejected when the average of the best 10 decoys is considered and EvoDiverse performs significantly better than Rosetta with 95% confidence. When the focus is on the lowest lRMSD reached, the performance improvement of EvoDiverse over Rosetta is not statistically significant at 95% confidence level.
Table 6 shows the highest TMscore to the native structure and the average TMscore of the 10 best (in terms of TMscores) decoys obtained by EvoDiverse and Rosetta on each of the 10 target domains denoted by their identifiers in column 1. The higher TMscore values between the two algorithms on each target domain are marked in bold. Table 6 shows that higher TMscores are obtained by EvoDiverse and Rosetta on 5/10 cases. When the focus is on the average TMscore of the best (in terms of TMscores) 10 decoys is considered, EvoDiverse achieves higher TMscore in 6/10 cases compared to Rosetta’s 5/10.
Table 8(c) shows the pvalues obtained in 1sided statistical significance tests that pitch EvoDiverse against Rosetta, evaluating the null hypothesis that EvoDiverse performs similarly or worse than Rosetta. Both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (which reject the null hypothesis) are marked in bold. Table 8(c) shows that the null hypothesis is not rejected with 95% confidence and the performance improvement of EvoDiverse over Rosetta is not statistically significant.
Table 7 shows the highest GDT_TS score to the native structure and the average GDT_TS score of the 10 best (in terms of GDT_TS scores) decoys obtained by EvoDiverse and Rosetta on each of the 10 target domains denoted by their identifiers in column 1. The higher GDT_TS scores between the two algorithms on each target domain are marked in bold. Table 7 shows that higher values (on both the highest GDT_TS score and the average GDT_TS score over the 10 best decoys) are achieved by EvoDiverse in 6/10 cases compared to Rosetta’s 5/10.
Table 8(d) shows the pvalues obtained in 1sided statistical significance tests that pitch EvoDiverse against Rosetta, evaluating the null hypothesis that EvoDiverse performs similarly or worse than Rosetta. Both Fisher’s and Barnard’s test are conducted, and pvalues less than 0.05 (which reject the null hypothesis) are marked in bold. Table 8(d) shows that the null hypothesis is not rejected with 95% confidence and the performance improvement of EvoDiverse over Rosetta is not statistically significant.
Conclusion
This paper presents a novel conformation sampling algorithm, EvoDiverse, that operationalizes the multiobjective, stochastic optimization framework. The algorithm does not use total energy as a basis of selection but instead makes use of nondomination rank and crowding distance in its selection operator to encourage conformation diversity.
Yet, the results show that EvoDiverse reaches regions of lower total energy in the energy landscape of the benchmark dataset used here for evaluation, showcasing its higher exploration capability over the Rosetta decoy generation protocol and other, stateoftheart multiobjective EAs that use total energy as an additional optimization objective. In addition, EvoDiverse is able to reach comparable or lower lRMSDs than Rosetta, thus constituting a clear advantage over the current stateoftheart multiobjective EAs.
It is worth noting that EvoDiverse does not make use of an archive of decoys ever sampled, unlike other multiobjective EAs that do so to update the Pareto metrics for use in the selection operator. EvoDiverse uses only the current population and their offspring to perform selection, thus saving storage overhead.
The presented results constitute a promising research direction in improving decoy generation, and future work will consider additional optimization objectives and variants of improvement and selection operators to apportion a fixed computational budget. Of particular interest are directions of research that attenuate dependence on protein energy models and permit as optimization objectives learned rather than physicsbased models of structural integrity and nativeness.
Notes
 1.
The term conformation, though often interchanged with structure, refers to an assignment of values to variables selected to represent a spatial arrangement of the chain of amino acids. These variables can be Cartesian coordinates, angles, or others.
 2.
Work in [9] analyzes Rosetta energy/scoring functions and reports that, while these functions have improved, false minima are found on generated conformation/decoy datasets.
Abbreviations
 aa:

Amino acid
 EA:

Evolutionary algorithm
 lRMSD:

Least rootmeansquareddeviation
 PC:

Pareto count
 PDB:

Protein data bank
 PR:

Pareto rank
 PSP:

Protein structure prediction
References
 1
BlabyHaas CE, de CrécyLagard V. Mining highthroughput experimental data to link gene and function. Trends Biotechnol. 2013; 29(4):174–82.
 2
Berman HM, Henrick K, Nakamura H. Announcing the worldwide Protein Data Bank. Nat Struct Biol. 2003; 10(12):980.
 3
Boehr DD. Wright PE: How do proteins interact?Science. 2008; 320(5882):1429–30.
 4
Maximova T, Moffatt R, Ma B, Nussinov R, Shehu A. Principles and Overview of Sampling Methods for Modeling Macromolecular Structure and Dynamics. PLoS Comp Biol. 2016; 12(4):e1004619.
 5
Kryshtafovych A, Barbato A, Fidelis K, Monastyrskyy B, Schwede T, Tramontano A. Assessment of the assessment: evaluation of the model quality estimates in CASP10. Proteins. 2014; 82(Suppl 2):112–26.
 6
LeaverFay A, Tyka M, Lewis SM, Lange OF, Thompson J, Jacak R, et al. ROSETTA3: an objectoriented software suite for the simulation and design of macromolecules. Methods Enzymol. 2011; 487:545–74.
 7
Xu D, Zhang Y. Ab initio protein structure assembly using continuous structure fragments and optimized knowledgebased force field. Proteins Struct Funct Bioinf. 2012; 80(7):1715–35.
 8
Nussinov R, Wolynes PG. A second molecular biology revolution? The energy landscapes of biomolecular function. Phys Chem Chem Phys. 2014; 16(14):6321–2.
 9
Rubenstein AB, Blacklock K, Nguyen H, Case DA, Khare SD. Systematic Comparison of Amber and Rosetta Energy Functions for Protein Structure Evaluation. J Chem Theory Comput. 2018;:6321–6322. [Preprint].
 10
Shehu A. Probabilistic Search and Optimization for Protein Energy Landscapes In: Aluru S, editor. Handbook of Computational Molecular Biology. Singh A: Chapman & Hall/CRC Computer & Information Science Series: 2013.
 11
De Jong KA. Evolutionary Computation: a Unified Approach. Cambridge: MIT Press; 2006.
 12
Zhang G, Ma L, Wang X, Zhou X. Secondary Structure and Contact Guided Differential Evolution for Protein Structure Prediction. IEEE/ACM Trans Comput Biol Bioinf. 2018;:1–1. ISSN=15455963, https://doi.org/10.1109/TCBB.2018.2873691.
 13
Zhang GJ, Zhou GX, Yu XF, Hao H, Yu L. Enhancing protein conformational space sampling using distance profileguided differential evolution. IEEE/ACM Trans Comput Biol and Bioinf. 2017; 14(6):1288–301.
 14
Shehu A. A Review of Evolutionary Algorithms for Computing Functional Conformations of Protein Molecules In: Zhang W, editor. ComputerAided Drug Discovery, Methods in Pharmacology and Toxicology. Springer Verlag: 2015.
 15
Olson B, De Jong KA, Shehu A. OffLattice Protein Structure Prediction with Homologous Crossover. In: Conf on Genetic and Evolutionary Computation (GECCO). New York: ACM: 2013. p. 287–94.
 16
Olson B, Shehu A. MultiObjective Stochastic Search for Sampling Local Minima in the Protein Energy Surface. In: ACM Conf on Bioinf and Comp Biol (BCB). Washington: 2013. p. 430–9.
 17
Olson B, Shehu A. MultiObjective Optimization Techniques for Conformational Sampling in TemplateFree Protein Structure Prediction. In: Intl Conf on Bioinf and Comp Biol (BICoB). Las Vegas: 2014. p. 143–8.
 18
Shmygelska A, Levitt M. Generalized ensemble methods for de novo structure prediction. Proc Natl Acad Sci USA. 2009; 106(5):94305–5126.
 19
Deb K, Agrawal S, Pratap A, Meyarivan T. A fast and elitist multiobjective genetic algorithm: NSGAII. IEEE Trans Evol Comput. 2002; 6(2):182–97.
 20
Meiler J, Baker D. Coupled prediction of protein secondary and tertiary structure. Proc Natl Acad Sci USA. 2003; 100(21):12105–10.
 21
DeBartolo J, Hocky G, Wilde M, Xu J, Freed KF, Sosnick TR. Protein structure prediction enhanced with evolutionary diversity: SPEED. Protein Sci. 2010; 19(3):520–34.
 22
Molloy K, Saleh S, Shehu A. Probabilistic Search and Energy Guidance for Biased Decoy Sampling in Abinitio Protein Structure Prediction. IEEE/ACM Trans Comput Biol and Bioinf. 2013; 10(5):1162–75.
 23
McLachlan AD. A mathematical procedure for superimposing atomic coordinates of proteins. Acta Crystallogr A. 1972; 26(6):656–7.
 24
Zhang Y, Skolnick J. Scoring function for automated assessment of protein structure template quality. Proteins. 2004; 57:702–10.
 25
Zemla A, Venclovas C, Moult J, Fidelis K. Processing and analysis of CASP3 protein structure predictions. Proteins. 1999; 37:22–29.
 26
Zemla A, Venclovas C, Moult J, Fidelis K. Processing and evaluation of predictions in CASP4. Proteins. 2001; 45:13–21.
 27
Fisher RA. On the interpretation of χ ^{2} from contingency tables, and the calculation of P. J Roy Stat Soc. 1922; 85:87–94.
 28
Barnard GA. A new test of 2x2 tables. Nature. 1945; 156:177.
 29
McNicholas S, Potterton E, Wilson KS, Noble MEM. Presenting your structures: the CCP4mg moleculargraphics software. Acta Cryst. 2011; D76:386–94.
Acknowledgements
Computations were run on ARGO, a research computing cluster provided by the Office of Research Computing at George Mason University, VA (URL: http://orc.gmu.edu).
Funding
This work is supported in part by NSF IIS Grant No. 1763233. The funding body did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing of the manuscript.
Availability of data and materials
All software and data are available upon demand.
Author information
Affiliations
Contributions
ABZ and AS suggested the methods and the performance study in this manuscript and drafted the manuscript. ABZ implemented the methods and carried out the performance study. AS guided the study, provided comments and suggestions on the methods and performance evaluation, and improved the manuscript writing. All authors have read and approved the final manuscript.
Corresponding author
Correspondence to Amarda Shehu.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Received
Accepted
Published
DOI
Keywords
 Protein energy landscape
 Structural dynamics
 Stochastic optimization