The protein folding problem
The primary structure of a protein is comprised of a sequence of n amino acids, i.e., s = {s1, ..., s
n
} where each s
i
∈
.
is the alphabet of amino acids and |
| = 20. Given a lattice model ℒ, the 3D conformation of protein is the assignment of amino acids in the lattice points such that adjacent positions of s remain adjacent in the lattice and no two amino acids overlap. The protein tends to reach a 3D conformation with minimum free energy, which is called its tertiary structure. One idea of determining the free energy associated with a particular conformation is to assign energy to pairs of amino acids on the lattice within a predefined distance (formalized as contacts) and then summing all up. The objective of the protein folding problem is to determine the 3D conformation from primary structure such that the free energy is minimized.
Lattice and energy model
The domain of FCC lattice consists of points (x, y, z) ∈ ℤ such that x + y + z is even. Two FCC points P
i
(x
i
, y
i
, z
i
) and P
j
(x
j
, y
j
, z
j
) are adjacent if and only if |x
i
- x
j
| ≤ 1, |y
i
- y
j
| ≤ 1, |z
i
- z
j
| ≤ 1 and |x
i
- x
j
| + |y
i
- y
j
| + |z
i
- z
j
| = 2. Each FCC lattice point is adjacent to 12 neighbouring points and three consecutive adjacent points forms one of these four angles 60°, 90°, 120°, 180°. Observe that, for lattice units, it holds that |x
i
- x
j
| + |y
i
- y
j
| + |z
i
- z
j
| = 2. Two amino acids s
i
and s
j
in lattice positions P
i
and P
j
, respectively, are said to be in contact, i.e., contact(P
i
, P
j
), if and only if they are not adjacent in the primary sequence, i.e., |i - j| > 1 and |x
i
- x
j
| + |y
i
- y
j
| + |z
i
- z
j
| = 2.
In [14], the authors developed table of empirical contact potential that points out the energy contributions associated to pairs of amino acids s
i
and s
j
in contact, described by the commutative function energy(s
i
, s
j
). These contributions are developed from statistical methods applied to structures obtained from X-rays and NMR experiments.
Mathematical formalization
Given the amino acid sequence s, a conformation ϕ of s in FCC lattice is an injective function ϕ: {1... n} → ℒ such that ϕ (s
i
) and φ(si+1) are adjacent for i = 1... n - 1 and ϕ(s
i
) ≠ ϕ(s
j
) for i ≠ j to avoid overlapping.
The protein folding problem can then be formalized as an optimization problem for finding the conformation ϕ of s such that the following energy function is minimized:
CP framework
Modeling the protein folding problem on FCC lattice leads to the following structural constraints.
-
adjacent property: Adjacent amino acids in the primary sequence are mapped to adjacent lattice points. For each i ∈ {1... n - 1}, |x
i
- xi+1| + |y
i
- yi+1| + |z
i
- zi+1| = 2
-
non-overlap property: Two non-adjacent amino acids in the primary sequence must not occupy the same lattice point and must be separated by at least one lattice unit. For each i, j ∈ {1... n} and |i - j| > 1, |x
i
- x
j
| + |y
i
-y
j
| + |z
i
- z
j
| ≥ 2
The protein folding problem is effectively a Constraint Satisfaction Problem (CSP) defined by the constraints above with an objective energy function (Equation 1) to be minimized. Authors in [16, 18] added some extra constraints to disallow angles of 60° and 180° for three consecutive amino acids. Also their definition of contact distance is somewhat different from us. This study is, however, intended to measure the effectiveness of hybrid approach over local search and pure constraint programming approaches. Protein folding simulation results have been reported in the literature with local search using efficient neighbourhood function that allows all the possible angles for three consecutive amino acids in FCC lattice [6, 15]. These definition of angles and contact distance are used in our work in order to make valid comparisons.
COLA solver
COLA is an ad-hoc constraint solver on discrete 3-dimentional crystal lattices developed by Palu et al. [18]. The solver allows user to define lattice variables with associated domains, constraints over them and to search the space of admissible solutions.
Given a sequence s and lattice ℒ, the lattice variable V
i
represents the lattice point (x
i
, y
i
, z
i
) of amino acid s
i
. A domain D is described by a pair of points (D,
), where D= (D
x
, D
y
, D
z
) ∈ ℤ and
. D implicitly defines a box:
Each variable V is associated to a domain DVdescribed by a pair of points (DV,
). V is admissible if Box(DV) ≠ Φ. V is ground if it is admissible and DV=
.
A set of primitive binary constraints is defined in the form of C(V
i
, V
j
, d) over two variables V
i
and V
j
, based on spatial distances d ∈ ℕ. A CSP on the variables V1, ..., V
n
with domains
is a set of constraints of above-mentioned forms. A solution of the CSP is an assignment of lattice points to the variables V1, ..., V
n
within corresponding domains and satisfying all the binary constraints.
The solver is modeled in a way that separates constrain phase from search phase, thereby not allowing addition of new variables or constraints during search phase. The solver uses a combination of consistency techniques and systematic search to guide the solving process. Standard backtracking+propagation search procedure [22] is implemented to explore the search tree efficiently. The detailed description on COLA solver can be found in [18].
LS framework
Simulated annealing (SA) was introduced as an optimization tool independently in [23, 24]; see also [25]. The underlying algorithm acts within a solution space in accordance with a specific neighbourhood structure, where the transition steps are controlled by the objective function. The solution space for the protein folding problem consists of all the possible self-avoiding walks (SAW) on the FCC lattice. The objective function to be minimized here is given by the Equation 1. A logarithmic cooling schedule is employed which was shown in [26] to converge to optimal solutions. In the context of local search methods, it is important to employ an efficient neighbourhood function that determines the overall performance in terms of run-time and final energy value. We use CP technique described in CP framework Section to generate the neighbourhood for LS.
CP as neighbourhood generator for LS
A CSP can be built for a given protein sequence that enumerates all conformations satisfying structural constraints of protein in a given lattice. The procedure is basically an exploration of exponentially large search tree. With the most efficient propagation+backtracking technique, such an enumeration requires huge amount of time even for the medium sized instance. The size of the solution space for the protein folding problem is the number of self-avoiding walks on the FCC lattice that can be approximated by the formula (see [27]).
Without the presence of additional constraints (eg. secondary structure information, not allowing certain angles in the folding etc. [16, 18]), exploring this exponential search tree for the best energy value is infeasible.
Instead of exploring the complete search tree, we propose to explore a small part of it at a time and use this information for the next step. In each iteration of a standard local search procedure, the CSP solver is asked to enumerate all the possible neighbours of the current conformations by keeping certain parts (variables) of it fixed to its current lattice positions and allowing a small part to change their positions. The best neighbouring conformation found this way will be used for the next iteration.
We denote small parts of the protein sequence as subchains for the rest of the discussion. The Logarithmic simulated annealing (LSA) procedure starts with an initial conformation, i.e., all lattice variables [V1, ..., V
n
] are ground. Then in each iteration of LSA, a CSP solver generates random neighbourhood in the following way:
Step 1. Randomly select a lattice variable V
i
.
Step 2. Randomly select the length of the subchain l ∈ {7, 9, 11, 13} to be perturbed.
Step 3. Keep the two parts of the conformation unchanged, namely lattice variable sets [V1, ... V
i
] and Vi+l+1... V
n
. These variables remain fixed (ground) to their current values.
Step 4. For each variable V
j
with lattice point (x
j
, y
j
, z
j
) in the subchain [Vi+1, ... vi+l], we redefine the domains
as follows:
and
.
Here 1 ≤ b ≤ 3 is the domain dilation parameter which controls the size of the domain.
Step 5. Let CSP solver enumerate all the conformations based on current variable and domain definitions. Select the conformation with minimum energy.
In short, given a protein sequence and its current conformation having certain parts fixed, we ask CSP to find all the possible conformations by altering the remaining part of the conformation and also not changing the current conformation too much. The convergence of the solution towards lower energy state depends on the efficiency of the local search procedure (i.e., cooling schedule) employed and careful choice of subchain length (l) and domain dilation (b) parameters.
The choice of initial conformation is particularly important for our approach. It is not only recommended, but also absolutely necessary to start with a somewhat random "compact" structure irrespective of the energy value. At the initial stage of the search, the compact structure allows CSP solver to generate "good" neighbourhood conformations that help local search algorithm to converge quickly to conformations with lower energy. Otherwise, if we start with a "flat" initial conformation like a straight sheet-like structure, it take ages for local search to get to even moderately compact structure. Since the purpose of CSP solver is to rearrange a small subchain within a bounded box (defined by the domains of the corresponding variables), the shape of the box directly effects the possible number of rearrangements, i.e., the number of self avoiding walks within the box with two fixed end-points. It is observed that, for a curve-like subchain structure, the number of SAW's are higher, whereas for a sheet-like subchain structure, the number of SAW's are lower. Therefore, in our implementation of local search, the initialization procedure generates a compact initial conformation by iterating the following steps as long as amino acids are available to be placed.
Step 1. Apply consecutive forward-left and forward-right move for 5 times.
Step 2. Go up by applying consecutive left-up and right-up move once.
Step 3. Apply consecutive backward-right and backward-left move for 5 times.
Step 4. Go up by applying consecutive left-up and right-up move once again.
Figure 1(a) illustrates such an initial conformation for a protein having length of 54. Figure 1(b-c) shows the conformations obtained from two consecutive hybrid local search iterations using this initial conformation. Figure 1(d) depicts the final conformation.
Modification to COLA
COLA is a public-domain solver available for download in [28]. It contains the implementation of protein folding problem on FCC lattice with additional constraints and different notions of contact distance as mentioned earlier. For the sake of our study, we remove those additional constraints and modify the contact definition as well. The major part of modification, though, has to be done in the search procedure. In the original implementation, the search procedure starts with the leftmost variable with all the variables non-ground and not labeled. The search continues by assigning values to variables (labeling) one after another according to one of the two variable selection strategies (leftmost or first-fail) employing standard combination of consistency checking and propagation techniques. Once all the variables are labeled after n level of branching, the search backtracks to the last modified variables and assign new values to it (if available). This way the whole search space is explored systematically. In our case, most of the variables (n -l) are set ground at the beginning of search and the l non-ground variables are assigned new domains. Therefore we modify the search procedure to make it start from the leftmost non-ground variable and continue the process till the rightmost non-ground variable using leftmost variable selection strategy that explores only l level of branching. While bounded block fails (BBF) heuristic for solution searching over the whole solution space is more efficient [18], a complete enumeration is found to be more appropriate for our purpose, since we have to deal with only a small block of variables during each search phase. For the same reason, the choice of variable selection strategy is immaterial too. Therefore, we implement leftmost variable selection strategy with complete enumeration of search tree for our approach.
This modified search procedure is integrated into LSA to work as neighbourhood generator as described in CP as neighbourhood generator for LS Section. The choice of the range for subchain length, l is purely empirical. It is observed that CSP performs well with the smaller choice of l and tends to give better solution in shorter time than with longer l. On the other hand, though longer l results in increasing search time, they are vital for allowing local search to escape from local minima at times. Therefore, when selecting the length of the subchain to be altered, randomness is skewed in such a manner that shorter l are selected more than longer l. Since the choice of l and b directly effects the size of the search space (in effect runtime), we empirically find the acceptable combination of l, b and distribution of l (
(l)) that significantly speed-up the CSP solving phase (see Table 3).
The precise mathematical anaylsis for calculating the size of the search space explored by hybrid approach is complex and yet to be explained by us. It could be explained best by finding the number of self avoiding walks within a bounded box on FCC lattice when two end-points are fixed, which is a complex problem itself and deserves attention from mathematicians. In general, the upper bound of the search space in each iteration of the hybrid local search algorithm is close to a number 10l. l is the length of the subchain to be perturbed and the number of effective neighbouring positions for a given variable are approximated by 10. The tuning of parameters l and
(l) allows us to restrict the search space from a number close to 10n(see Equation 2) to a number close to I
h
* 108.5 where I
h
is the number of iterations used for hybrid LSA procedure. The introduction of domain dilation parameter b further reduces the search space by reducing the effective neighbouring positions to a number less than 10. It explains why hybrid approach gains significant speed-up over pure CP approach. The runtime difference between pure simulated annealing and hybrid simulated annealing can be explained similarly by the fact that pure simulated annealing explores a smaller search space of size I
ls
, where I
ls
is the number of maximum iterations used in the algorithm.