The road not taken: retreat and diverge in local search for simplified protein structure prediction
© Shatabda et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Skip to main content
© Shatabda et al.; licensee BioMed Central Ltd. 2013
Published: 21 January 2013
Given a protein's amino acid sequence, the protein structure prediction problem is to find a three dimensional structure that has the native energy level. For many decades, it has been one of the most challenging problems in computational biology. A simplified version of the problem is to find an on-lattice self-avoiding walk that minimizes the interaction energy among the amino acids. Local search methods have been preferably used in solving the protein structure prediction problem for their efficiency in finding very good solutions quickly. However, they suffer mainly from two problems: re-visitation and stagnancy.
In this paper, we present an efficient local search algorithm that deals with these two problems. During search, we select the best candidate at each iteration, but store the unexplored second best candidates in a set of elite conformations, and explore them whenever the search faces stagnation. Moreover, we propose a new non-isomorphic encoding for the protein conformations to store the conformations and to check similarity when applied with a memory based search. This new encoding helps eliminate conformations that are equivalent under rotation and translation, and thus results in better prevention of re-visitation.
On standard benchmark proteins, our algorithm significantly outperforms the state-of-the art approaches for Hydrophobic-Polar energy models and Face Centered Cubic Lattice.
Proteins are the most important of all organ-isms present in the living cell. Given a protein's amino acid sequence, the protein structure prediction (PSP) problem is to find a three dimensional native structure that has the lowest free energy. In order to function properly, the protein has to fold into its native structure. Mis-folded proteins cause many critical diseases such as Alzheimer's disease, Cystic fibrosis, and Mad Cow disease. Knowledge about this native structure is of paramount importance and can have an enormous impact on the field of drug discovery. Not much is known about the folding process and the nature of the energy function is also very complex. For many decades, it has been considered one of the hardest problems in biology. In vitro laboratory methods like X-ray crystallography and Nuclear Magnetic Resonance (NMR) spectroscopy are very much slow and expensive. For these issues, many researchers from other fields are attracted to solve the problem using their own techniques [1, 2].
Computational methods applied to PSP fall into three broad categories: ab initio, homology modeling and protein threading. The later two methods depend on the templates (or structures) of known proteins and are useful only when matching templates are found. Research in ab initio PSP has been instigated by the famous Anfinsen's dogma. In 1973 Nobel Prize Laureate Christian B. Anfinsen suggested that the native structure of a globular protein is determined only by its primary amino acid sequence . The ab initio PSP can be viewed as a search problem, where one has to find a stable, unique, and kinetically accessible native structure from the space of all possible structures (also called conformations). The search space for this problem, even in the simplified models, contains an astronomically large number of conformations. Therefore, systematic search techniques are almost impractical since they perform exhaustive search and requires a huge amount of computational resources. In contrast, local search methods are normally very quick in finding good solutions, although they suffer from re-visitation and stagnation, and require good heuristics.
Performance of the computational methods also degrades when applied to the high resolution models that deal with real structures of proteins. This is due to three reasons: i) the unknown contributing factors of different forces to the energy functions, ii) protein models with atomic level details require huge computational effort, and iii) the space of possible conformations is very large and complex. For these reasons, the general paradigm of de novo PSP is to begin with the sampling of a large set of candidate (decoy) structures guided by a scoring function. In the final stage, the refinements are done to achieve the real structure. The simplified models, though lack many details, provide a realistic back-bone for the proteins and can be refined to get real structures .
Local search algorithms when applied to large proteins (sequence length around 200 monomers) suffer from a huge number of re-visitation and stagnation. To handle these issues, a number of techniques have been applied in the literature of PSP [5–7] that include tabu lists, adaptive measures, and various restart mechanisms. Similar approaches have also been used in other domains such as propositional satisfiability  and quadratic assignment problem . Many of the algorithms apply random restarts or restart from the best local minimum [6, 7]; which do not solve the problem in general.
In this paper, we present a new algorithm for the simplified protein structure prediction problem. During the search, our method selects the best candidate in each iteration, but memorizes the second best conformations that are generated but not selected or explored (called elite conformations) at each iteration. Whenever the search faces stagnation, we select the best conformation from this elite set and continue search from there. This retreat helps the search diverge. Similar techniques have been used in the systematic search techniques like A* search, but they require a huge amount of memory to store the unexplored frontier. We maintain only a small set of previously generated conformations by discarding conformations with similar fitness. It reduces the memory requirement and provides a mechanism to go back to earlier conformations with lower fitness value but with potential to lead towards better search regions. We also propose a new non-isomorphic encoding that reduce the non-unique or isomorphic conformations from the search space and makes the similarity matching of the conformations efficient. These isomorphic conformations are essentially same and show differences only because of the translational and rotational symmetry. We applied this encoding in our algorithm along with the long term memory of local minima proposed in . Experimental results show that our algorithm significantly outperforms the state-of-the-art algorithms on standard benchmark proteins using Hydrophobic-Polar(HP) energy model and Face Centered Cubic (FCC) lattice.
Lau and Dill  proposed a simplified HP energy model for protein structure prediction problem. It is proved to be a hard combinatorial problem . Due to the complexity, several techniques and their hybridizations have been applied to solve the problem. The similarity with the thermodynamic nature of the protein folding allured the researchers to apply simulated annealing [12, 13]. Genetic algorithms were first applied to solve this problem by Unger and Moult . The basic genetic algorithm was subsequently improved by many researchers [15–17].
Yue and Dill  applied constraint based approaches for the first time and developed the Constraint Based Hydrophobic Core Construction (CHCC) algorithm. Their method had several pitfalls: CHCC could only support the HP model and failed to report degeneracy or non-unique structures for several protein sequences. The research group of Rolf Backofen developed a Constrained-based Protein Structure Prediction (CPSP) tool , which provided solutions to these problems. However, CPSP tool depends on pre-calculated cores and does not converge for larger protein sequences. Palu et al.  developed COLA solver using highly optimized constraints and propagators to obtain satisfactory results on small and medium-sized instances (length < 80). Lesh et al.  provided a novel set of transformations called pull moves extendible to any lattice. Both Lesh et al.  and Blazewicz et al.  implemented tabu search meta-heuristics in-dependent of each other.
Hybrid techniques that combine the power of different strategies provided better results. Using the pull moves, Klau et al.  proposed an interactive optimization framework called Human Guided Simple Search (HuGS). Using the same pull move set, Ullah et al.  proposed a two-stage optimization approach. Furthermore, Ullah et al.  combined local search and constraint programming approaches. They introduced a protein folding simulation procedure on FCC lattice and employed the COLA solver  to generate neighborhood states for a simulated annealing based local search. They used MJ matrices with 20 × 20 amino acid pairwise interactions. They tested their approaches on some real proteins (length < 80) from the Protein Data Bank (PDB). Jiang et al.  combined tabu search strategy (GTS) with genetic algorithms in the two-dimensional HP Model.
Cebrian et al.  used tabu search to find 3D structures of Harvard instances  on FCC lattices for the first time. In their subsequent work, Dotu et al. [6, 7] applied Large Neighborhood Search (LNS) to further optimize the results found in . They also improved the tabu search by adopting a new neighborhood selection technique . Both of their methods are implemented in COMET. Shatabda et al.  proposed a memory based approach on top of the algorithm proposed by Dotu et al.  and improved the results on the FCC lattice and HP energy model. Other methods (such as Simulated Annealing , Ant Colony Optimization (ACO) , and Extremal Optimization ) are also found in the literature.
Proteins are polymers of amino acid monomers. In a simplified model, all monomers have an equal size and all bonds are of an equal length. Each amino acid monomer is represented by a single point and its position is restricted to a three dimensional lattice. A simplified energy function is used in calculating the energy of a conformation. The given amino acid sequence fits into a fixed lattice, where every two consecutive monomers in the sequence are also neighbor on the lattice (called the chain constraint) and two monomers can not occupy the same lattice point (called the self avoiding constraint).
The Face Centered Cubic (FCC) lattice is preferred over other lattices since it has the highest packing density  for spheres of equal size, and provides the highest degree of freedom for placing an amino acid monomer. Thus, it provides a realistic discrete mapping for proteins. The FCC lattice is generated by the following basis vectors: , , , , , , , , , , , . Two lattice points p, are said to be in contact or neighbors of each other, if for some vector in the basis of lattice .
where c ij = 1 only if two monomers i and j are neighbors (or in contact) on the lattice and 0 otherwise. The other term, e ij is calculated depending on the type of amino acids: e ij = -1 if s i = s j = H and 0 otherwise. Minimizing the summation in (1) is equivalent to maximizing the number of non-consecutive H-H contacts. Several other variants of HP-model  exist in the literature.
Using the HP energy model together with the FCC lattice, the simplified PSP problem is defined as: given a sequence s of length n, find a self avoiding walk p 1 ⋯ p n on the lattice such that the energy defined by (1) is minimized.
Local Setach Framework.
while moveList.notEmpty() do
m ← getNextCandidate()
while iteration ≤ maxIteration do
c ← getConformation(m)
e ← getNonIsoEncoding(s)
b ← getPacked(s)
if match(b, proximity) then
if local minima is detected then
if nonImprovingSteps ≥ maxStable then
if moveList.empty() then
no moves possible
nonImprovingSteps ← maxStable + 1
where dv(i, j) = d(i, j)2 -2 and d(i, j) = (x i -x j )2 + (y i - y j )2 + (z i - z j )2, i.e. square of the Euclidean distance between the ith and jth amino acids in the current conformation c of a sequence s of length n. The energy level of the structure is still determined by the HP energy value. The fitness function is used to drive the search only. The search algorithm periodically switches the type of the acid and selects the best move on a amino-acid which is not in the tabulist. In case of P moves, it selects a random move since a move of P type amino acid does not affect the fitness function. The search restarts from the previously found best solution whenever the fitness function is not improving for maxStable steps. The memory-based search in  extends this local search framework. It stores a proportion of the local minima encountered and whenever a move is selected, it generates the conformation and checks similarity with the stored local minima. If the generated conformation is within a given proximity of a stored local minimum, the conformation is discarded. Hamming distance is used as the similarity measure and relative encoding to represent the conformations.
Our algorithm is developed on top of the memory-based search. The pseudo-code for our algorithm is depicted in Table 1. Our algorithm differs from the memory-based approach in Line 14 of Procedure localSearch() where we select a conformation from the elite set at stagnation and in Line 9 of Procedure selectMove() where we store the prominent but not selected candidate conformations into the elite set. It also differs in the encoding of the representation of the conformations. We do that at Line 4 of Procedure selectMove() before matching it with stored local minima and at Line 10 of Procedure localSearch() while storing the local minimum. Rest of this section describes the detail of the procedures of our algorithm.
In each iteration of a local search, a number of conformations are generated. However, only a few of them are explored in the next iterations. In the case of a single candidate search, only a single conformation, which is typically the best conformation according to the heuristic, is selected for the next iteration. In successive iterations, the search goes on by generating the neighbors of the selected conformations. The other potential conformations with good fitness values are never used as the search is greedy in nature. We call them elite conformations. These conformations, if explored ever, may lead to better search regions. Note that, in the systematic search techniques, these conformations are stored and explored. However, they require a huge amount of memory. Moreover, the selection in a systematic search like A* search depends on a heuristic function that requires the goal to be known beforehand. In our case, the optimal structure is totally unknown and we can not afford to store a huge number of conformations. In our algorithm, we store the second best conformations and explore them whenever the search faces stagnation.
Pseudo-code for Elite Set Methods.
sb ← set of second best candidates
while eliteSet.notEmpty() do
while sb.notEmpty() do
c ← eliteSet.getTopElement()
m ← sb.getNextCandidate()
e ← getNonIsoEncoding(c)
c ← getConformation(m)
b ← getPacked(e)
e ← getNonIsoEncoding(c)
if match(b, proximity) == false then
b ← getPacked(e)
if match(b, proximity) == false then
We select the top element from the priority queue whenever the search stagnates. The search then continues from the selected elite conformation. The search algorithm, guided by the fitness function defined in (2), quickly forms a compact hydrophobic core at the center of the conformation and the greedy search oscillates within the same region of the search space before it can improve the fitness function to break the core or to form some alternate core. The detailed nature of the search is discussed in . The oscillating nature indicates that if we select a conformation from a region in the search space, then we can ignore the other conformations with the same or near fitness value and within the temporal locality. Every time an elite conformation is selected form the list, we do that by discarding a fixed proportion of the top elements from the list. This results in eliminating the conformations that are similar in fitness value and structure, and are also temporally proximate. This retreat effectively helps the search diverge. It also reduces the memory requirement for the priority queue used. The detailed pseudo-code of the method is given in the left side of Table 2. The method elitSet.release() at Line 6 releases the top elements from the elite set.
Pseudo-code for Non-Isomorphic Encoding.
for i ← 1 to N do
absdir = c.getAbsDir(i)
if absdir is a new direction then
Map[absdir] ← dirCount
dirCount + +
encoded[i] = Map[absdir];
We implemented our algorithm in C++ and ran experiments on the NICTA (http://www.nicta.com.au) cluster machine. The cluster has a number of machines each equipped with two 6-core CPUs (AMD Opteron @2.8 GHz, 3 MB L2/6 M L3 Cache) and 64 GB Memory, running Rocks OS (a Linux variant for cluster). We compared the performance of our algorithm to that of the tabu search by Dotu et al.  and the memory based approach proposed in . Algorithms were run 50 times for each of the protein sequences. Each run was given 5 hours to finish. We could not compare our results with the Large Neighborhood Search (LNS)  since the COMET program exited with 'too much memory needed' error for the large-sized benchmark proteins that we have selected. We do not show results for small-sized Harvard instances (length = 48) or other smaller protein sequences since both algorithms reach near optimal conformations and the difference of the energy levels achieved for these proteins are relatively small.
We also used a second set of benchmark proteins derived from the famous Critical Assessment of Techniques for Protein Structure Prediction (CASP) competition (http://predictioncenter.org/casp9/targetlist.cgi). These proteins are of length 230 ± 50. Six protein sequences were randomly chosen from the target list. These sequences are then converted into HP sequences. Results for these six proteins are also given in Table 4 (lower part). The PDB ids for each of these proteins are also given. The parameter settings for these six proteins were also kept the same. LNS column contains no data for these six proteins since they were not used in .
From the average energy levels shown in bold-face in Table 4, it is clearly evident that, for all the twelve proteins, our algorithm significantly outperforms both of the algorithms. We performed statistical t-test for independent samples with 95% level of significance to verify the significant difference in performances. We report the new lowest energy levels (w.r.t. incomplete search methods) for all twelve proteins. These energy levels are shown in italic-faced font in Table 4.
where E o is the average energy level achieved by our approach, E r is the average energy level achieved by the other approach, and E l is the optimal lower bound of the energy level. The missing values indicate the absence of any lower bound for the corresponding protein sequence. Similar measurements were also used in . From the values reported in Table 4, we clearly see that our algorithm produces conformations that are significantly better in terms of the average energy level achieved.
Effect of Non-Isomorphic Encoding.
# of discards
In this paper, we presented a local search algorithm for solving the protein structure prediction problem on FCC lattice using low resolution HP energy model. Experimental results shows that our algorithm outperforms the state-of-the art algorithms. We used a novel encoding scheme to represent the conformations along with a set of elite conformations to handle the stagnation of the local search. We believe that use of domain specific heuristics while selecting the conformations from the elite set can further improve the performance of the algorithm. In future, we wish to explore that and apply our techniques to higher resolutions and other energy models to see the effect. We wish to apply our techniques to other domains such as propositional satisfiability, vehicle routing. We believe the proposed encoding scheme will add efficiency to search techniques such as genetic algorithms.
The publication costs for this article were funded by the corresponding author's institution.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 2, 2013: Selected articles from the Eleventh Asia Pacific Bioinformatics Conference (APBC 2013): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S2.
We gratefully acknowledge the support of the Griffith University eResearch Services Team and the use of the High Performance Computing Cluster "Gowonda" to complete this research. We also thank NICTA, which is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.