Constraint Logic Programming approach to protein structure prediction
 Alessandro Dal Palù^{1},
 Agostino Dovier^{1}Email author and
 Federico Fogolari^{2}
DOI: 10.1186/147121055186
© Dal Palù et al; licensee BioMed Central Ltd. 2004
Received: 09 July 2004
Accepted: 30 November 2004
Published: 30 November 2004
Abstract
Background
The protein structure prediction problem is one of the most challenging problems in biological sciences. Many approaches have been proposed using database information and/or simplified protein models. The protein structure prediction problem can be cast in the form of an optimization problem. Notwithstanding its importance, the problem has very seldom been tackled by Constraint Logic Programming, a declarative programming paradigm suitable for solving combinatorial optimization problems.
Results
Constraint Logic Programming techniques have been applied to the protein structure prediction problem on the facecentered cube lattice model. Molecular dynamics techniques, endowed with the notion of constraint, have been also exploited. Even using a very simplified model, Constraint Logic Programming on the facecentered cube lattice model allowed us to obtain acceptable results for a few small proteins. As a test implementation their (known) secondary structure and the presence of disulfide bridges are used as constraints. Simplified structures obtained in this way have been converted to all atom models with plausible structure. Results have been compared with a similar approach using a wellestablished technique as molecular dynamics.
Conclusions
The results obtained on small proteins show that Constraint Logic Programming techniques can be employed for studying protein simplified models, which can be converted into realistic all atom models. The advantage of Constraint Logic Programming over other, much more explored, methodologies, resides in the rapid software prototyping, in the easy way of encoding heuristics, and in exploiting all the advances made in this research area, e.g. in constraint propagation and its use for pruning the huge search space.
Background
 1.
assemblying the structure of a protein using structural fragments of similar sequences, available in the protein structure repository (the Protein Databank [3]), and later screening the feasibility of the resulting structures, using energetic criteria;
 2.
representing the protein chain by a highly simplified model which is, hopefully, treatable.
This second class of approaches is appealing in many respects [4]: first, the linkage between kinetics and thermodynamics of protein folding process and the basic intramolecular interactions is more easily addressable, because of the lesser number of variables. Second, the use of a simplified model agrees with the idea that details of atomic interactions between aminoacids are less important than the overall character of these interactions, because protein structure is flexible and can accommodate changes in the volume and shape of aminoacids much better than changes in their character (e.g. polar vs. hydrophobic [5]). Besides aiming at catching essential features of the protein folding process, simplified models have important computational advantages: generating and evaluating the energy of a conformation is efficiently done due to the reduced number of variables. A less evident benefit is that sampling (e.g. by molecular dynamics simulation or Monte Carlo methods) may be much more efficient due to the smoothness of energy surface due, once again, to the reduced number of degrees of freedom. Many lattice models have been used for simplified representation of proteins, up to date. Their capability of reproducing the secondary structure of proteins, as well as their relative arrangement has been reviewed by Godzik et al. [6]. A reasonable tradeoff between accuracy and the need to keep limited the number of base vectors is achieved by the face centered cubic ( ) lattice studied by Toma and Toma [7]. In particular both αhelices and βstrands are modelled with a very low RMSD from standard regular structures. Lattice models have been used mainly for understanding general properties of proteins, rather than for real predictive tasks, although their use, especially in hierarchical protocols has been proposed and realized. In particular, the (210) lattice has been used successfully by Skolnick and Kolinski in prediction of a small beta protein [8] and many other useful applications have been reported since these earlier works (see e.g. for recent successful applications [9, 10] and also the two recent reviews [4, 11]). A deep analysis of realistic lattice models of proteins proposed so far is definitely out of the scope of the present work, but there are few aspects of lattice models of proteins which need to be mentioned. The successful application of a lattice model depends obviously on the efficiency in generating conformations and searching for local minima. This aspect is dealt in the present work using Constraint Logic Programming, and taking advantage of all theoretical and implementative developments that have been realized in this context. The approach (and related languages) has been very seldom applied in the context of protein modeling and it has not been used for realistic protein structural predictions, to the best of our knowledge. A different, but equally important, aspect concerns the reliability of the model itself and of the forcefield used to evaluate conformational free energy. This aspect will not be dealt with by this work. An appropriate forcefield must take into account both local propensities to adopt a particular secondary structure (which ultimately depend on aminoacids' covalent structure and bulkiness) and their tendency to be in contact (which ultimately depends on their physicochemical character). Contact potentials have been derived by many groups (see e.g. [12, 13]) based on the observed versus expected contacts stored in the database. A similar approach could be followed in order to derive a torsional potential in order to describe local conformational propensities. However, it is not obvious how these potentials should be derived for lattice models and how the two potentials are to be considered together. These problems are not investigated here. Rather we consider contact potentials previously derived by our group from statistical analysis of the database [13], which are expected not to be accurate for a lattice model, but nevertheless should be able to reproduce essential features of aminoacid interactions. The local propensity to adopt a particular secondary structure can be computed by predictive methods [14]. However, for the small peptides analyzed in this paper, the correct secondary structure is selected from the deposited structures for testing purposes.
Constraint Logic Programming (briefly, ) [15, 16] is a declarative programming paradigm particularly wellsuited for encoding combinatorial minimization problems. It is the natural merger of the two declarative paradigms known as Constraint Solving and Logic Programming.
domain([X, Y, Z], 1, 10)
 3
* X + 4 * Y + 5 * Z ≤ 40, X + Y <Z
We have modeled a sort of knapsack problem using ( ). In general, in the modeling stage we can use constraints as well as declarative programs involving them.
Solution's search is performed by a constraint solver that is available in the language. The constraint solver uses constraints for sensibly pruning the search tree. One of the main capabilities is called constraint propagation. Constraint propagation reduces the domains of the variables eliminating those values that cannot lead to constraint solutions. For instance, in the considered example, constraint propagation reduces the domains of the variables X, Y, and Z to {1, ..., 4}, {1, ..., 4}, and {3, ..., 6}, respectively. For finding a possible solution, a further builtin capability – the labeling predicate – can be used. We can look for a generic solution as well as for a solution minimizing some function. In the example above, we could ask for minimizing the function 2X^{2} + Y + 4Z. This can be done by adding a constraint of the form:
F = 2 * X * X + Y + 4 * Z, labeling([minimize (F)], [X, Y, Z]).
The constraint solver then exploits the solution's search using constraint propagation and branchandbound techniques returning the answer:
F = 3, X = 3, Y = 1, Z = 5
The library clpfd of SlCStus Prolog [17] allows to effectively program in this framework. Let us observe that it is not required that F be a linear function.
The above described approach to optimization combinatorial problems is the socalled Constrain & Generate technique introduced as opposed to the Generate & Test technique of the classical Logic Programming approach (see, e.g. [18]). In the latter approach, a first phase generates nondeterministically a possible solution, and then the deterministic testphase checks whether the solution is acceptable or not. If the search space is exponential, this technique is not applicable. In the former approach, a first deterministic phase introduces a number of constraints, then a nondeterministic phase starts the generation of the solutions' space. The constraints introduced allow to sensibly prune the solutions' space in order to make the procedure effective. Moreover, in this phase one can take advantage from language builtin strategies (such as constraint propagation, branch and bound) and it is possible to further drive the solution search by means of problemdependent heuristics.
Briefly, next_constraints recursively calls the predicate next for each pair of consecutive aminoacids. Assume that <X 1, Y 1, Z 1> and <X 2, Y 2, Z 2> are the variables that will store the positions of a consecutive pair of aminoacids, then the predicate next states that X 1  X 2 + Y 1  Y 2 + Z 1 Z 2 = 2 and that X 1  X 2 ∈ {0, 1}, Y 1  Y 2 ∈ {0, 1}, Z 1  Z 2 ∈ {0, 1}. This is exactly the notion of adjacency in the facecentered cubic lattice of size 2 that we have used (see also the Methods Section).
Results and discussion
Constrained optimization problem in ( )
Experimental results
Name  N  Time  Energy  RMSD 

1LE0  12  1.3 s  9040  2.8 / 2.6 (2–11) 
1KVG  12  7.3 s  14409  2.7 / 2.4 (3–11) 
1LE3  16  2.3 s  13653  3.0 / 2.7 (2–15) 
1EDP  17  20.4 s  19389  4.3 / 1.1 (9–15) 
1PG1  18  14.6 s  10126  6.0 / 5.2 (4–17) 
1ZDD  34  300 s (limit)  20412  5.6 / 5.6 (5–34) 
17 m25 s  22350  4.0 / 4.0 (5–34)  
1VII  36  300 s (limit)  20860  10.4 / 6.7 (4–32) 
1000 s (limit)  22377  9.1 /6.3 (4–32)  
7 h42 m  26408  10.2 / 7.8 (4–32)  
CF = 0.3  3 h58 m  28710  8.0 / 7.4 (4–32)  
1VII(*)  36  300 s (limit)  17948  9.2 / 7.3 (4–32) 
1000 s (limit)  17948  9.2 /7.3 (4–32)  
3 h20 m  21211  10.3 / 6.9 (4–32)  
1E0M  37  300 s (limit)  13830  6.5 / 5.8 (8–29) 
1200 s (limit)  24613  8.4 / 3.6 (8–29)  
10 h (limit)  26592  8.8 / 3.4(8–29)  
24 h (limit)  30163  8.9 / 4.4 (8–29)  
2GP8  40  300 s (limit)  10303  10.5 / 8.9 (6–38) 
1000 s (limit)  24748  4.1 / 3.5 (6–38)  
10 h (limit)  26196  4.9 / 3.5 (6–38)  
10 h39 m  26196  4.9 / 3.5 (6–38)  
1ED0  46  300 s (limit)  29970  7.3 / 4.1 (3–40) 
1000 s (limit)  32369  8.6 / 7.1 (3–40)  
9 h38 m  38218  8.0 / 7.2 (3–40)  
1ENH  54  300 s (limit)  12480  10.4 / 8.9 (8–52) 
1000 s (limit)  12480  10.4 / 8.9 (8–52)  
10 h(limit)  23373  9.9 / 8.6 (8–52)  
24 h (limit)  23373  9.9 / 8.6 (8–52)  
6PTI  58  300 s (limit)  no sol.  
1000 s (limit)  29709  10.0 / 9.7 (3–55)  
10 h (limit)  37837  10.0 / 9.7 (3–55)  
24 h (limit)  37837  10.0 / 9.7 (3–55)  
CF = 0.25  48 h (limit)  42096  9.7 / 9.4 (3–55)  
CF= 0.18  24 h (limit)  47451  10.9 / 10.7 (3–55)  
2IGD  60  300 s (limit)  24158  19.3 / 16.3 (6–59) 
1000 s (limit)  29178  19.0 /16.2 (6–59)  
10 h (limit)  37479  16.9 / 15.0(6–59)  
24 h (limit)  37479  16.9 / 15.0 (6–59)  
CF = 0.17  4 h 59 m  40588  12.6 / 11.5 (6–59)  
2ERA  61  300 s (limit)  28993  11.6 / 10.6 (3–55) 
9 m28 s,  30746  12. 3/ 12.1 (3–55)  
ss = 5  15 m13 s  31692  12.7/11.6 (3–55)  
CF = 0.25, ss = 5  15 m12 s  33693  10.9/9.3 (3–55)  
CF = 0.25, ss = 4  1000 s (limit)  32985  12.3/12.4 (3–55)  
CF = 0.19, ss = 5  1000 s (limit)  34275  10.6/8.9 (3–55)  
CF = 0.19, ss = 4  1000 s (limit)  38138  11.6/10.6 (3–55)  
1SN1  63  300 s (limit)  no sol.  
1000 s (limit)  53888  13.0 / 10.5 (2–51)  
10 h (limit)  57615  11.9/ 9.6 (2–51)  
CF = 0.25, ss = 5  10 h (limit)  47121  8.6 / 8.1 (2–51)  
1YPA  63  300 s (limit)  36626  16.1 / 9.4 (12–52) 
1000 s (limit)  33886  17.1 / 10.9 (12–52)  
10 h (limit)  33886  17.1 / 10.9 (12–52)  
CF = 0.17  100 s (limit)  26297  12.5 / 10.5 (12–52)  
CF = 0.17  10 h (limit)  60244  12.9 / 9.8 (12–52) 
From left to right, the meaning of each column is as follows: the protein PDB identification code, the number N of aminoacids, the execution time, the energy of the best model found and its RMSD from the native structure for all the residues and for the core residues only. When there is not explicitly written "limit" it means that the program successfully terminated in the time reported; otherwise the program terminated due to time limit. We wish to observe that the results with time limit 10 h/24 h are typically computed in few hours. The rest of the time is used to further explore the solutions' space.
When a CF = η is reported a further constraint on the compactness ratio η is added before the search. CF = η bounds the linear distances X_{ i } X_{ j }, Y_{ i } Y_{ j }, and Z_{ i } Z_{ j } between all pair of residues i and j to η N where N is the length of the primary list. If η is low (e.g. 0.2), this constraint imposes a compact form to the protein and strongly reduces the running time.
One of the structural constraints considered is the presence of disulfide bonded residues (ssbonds). The rigid structure of the lattice is such that a low value of Euclidean distance (e.g., 2) between ssbonds often precludes all possible solutions. For this reason the default is chosen as 6. However, in some cases we tried computations with lower value. In these cases in the table the text ss = γ is reported.
The secondary structure, as computed from the deposited structure in PDB, has been input as constraint. As a unique exception, in the case of 1VII(*) we have instead predicted it using the GOR IV secondary structure prediction method [19].
Summary of molecular dynamics results
Name  N  RMSD (Å)  Time  RMSD ( + opt) (Å) 

1VII  36  5.3 / 4.7 (4–32)  17.8 h (2 ns)  5.8 / 4.8 (4–32) 
1E0M  37  5.5 / 4.0 (7–30)  26.3 h (4 ns)  8.7 / 3.6 (7–30) 
2GP8  40  5.9 / 3.8 (6–38)  37.7 h (4 ns)  3.9 / 2.3 (6–38) 
1ENH  54  5.9 / 5.0 (8–52) / 3.7 (8–36)  29.4 h (2 ns)  11.2 / 10.7 (8–52) / 4.7 (8–36) 
2IGD  61  5.7 / 4.1 (6–59)  48.6 h (4 ns)  12.9 / 11.5 (6–59) 
1YPA  64  9.2 / 7.1 (12–52)  116.9 h (8 ns)  11.8 / 9.4 (12–52) 
Comparison with Rosetta predictions
Name  N  Time  RMSD  Rosetta Time  Rosetta RMSD 

1ZDD  34  17 m.25 s.  4.0  5 m.35 s.  3.5 
1VII  36  3 h.58 m.  7.4 (4–32)  5 m.35 s.  4.2 
1E0M  37  10 h.  3.4 (8–29)  6 m.35 s.  7.7 
2GP8  40  10 h.  3.5 (6–38)  6 m.35 s.  6.4 
1ED0  46  10 h.  7.2 (3–40)  7 m.23 s.  8.9 
Constrained molecular dynamics simulation
The simulation time, ranging approximately between one and four CPU days, required for folding each protein on a 1.533 GHz AMD Athlon processor is reported in Table 2. The columns (from left to right) in Table 2 report the PDB identification code of the protein, the number of residues, the RMSD from native structure computed on C_{ α }atoms on the whole protein and only on core residues and the simulation time. The last column reports the RMSD from native structure for models obtained by after addition of all atoms and energy minimization as described in the Methods Section.
The simulation time needed for obtaining structures similar to native structures increases with the size of the protein both for the increasing size of the system and for the longer simulated annealing runs needed because of increasing complexity of the free energy landscape. Unfortunately a safer scheme would employ substantially longer simulation times.
This fact prompts for searching alternative ways to employ the same ideas.
The results in terms of RMSD from native structure support the idea that folding may be achieved, at least in simulation, by a hierarchical approach where local secondary structure elements are formed first and later their arrangement and contacts are optimized. A similar conclusion has been reached using a different model by Maritan and coworkers [21]. The RMSD on core residues is, in all but one case, less than 5.0 Å. In four out of six cases the RMSD on core residues is close to 4.0 Å. In the worst case, which is also the longest simulated chain, the RMSD on core residues is 7.1 Å.
Conclusions
The results obtained using this approach and reported in Tables 1 to 3 show that for small proteins a solution for the optimization problem is obtained in less than few hours. For the larger proteins studied here the inaccuracies of both the lattice model and contact potential prevent finding a compact solution. These problems are more likely to appear with increasing size of the protein and when the length of nonconstrained chain connecting two secondary structure elements is short, because the lattice allows a limited set of conformations.
Further work is being devoted towards a more realistic modeling representation of the protein, with at least two centers of interaction per residue, and towards refinement of the potential function by including a term for rotamer preferences. This term should map on the lattice the directional preferences of each unit with respect to the previous three units. Each of the six possible next positions for each unit should be weighted by an energy term derived from database analysis.
Also the optimal size of non constrained parts of the chain will be determined in order to allow more possible relative orientations among constrained secondary structure elements, possibly without increasing significantly the computation time. At present, however, when the positions of all atoms are reconstructed from the lattice C_{ α }trace, the RMSD on core residues of the resulting models, after energy minimization, compared to native structures, is as low as 4.8 Å for the thermostable domain of villin headpiece (PDB id.: 1VII), 3.6 Å for the WW domain (PDB id.: 1E0M), 2.3 Å for the coat proteinbinding domain of bacteriophage P22 (PDB id.: 2GP8).
It should be also noted that both the thermostable domain of villin headpiece and the WW contain three secondary structure elements that can be arranged in different ways in order to produce a compact structure. The low RMSD is therefore significant.
A comparable protocol employing a molecular dynamics simulated annealing procedure still leads to superior results for larger proteins, as expected because the protein representation is more accurate, but it takes longer execution times between one and four days on a 1.5 GHz P3 machine.
Recent results have shown that simplified models and more refined models can be employed successfully in hierarchical modeling procedures [9, 10]. The results obtained in the present work suggest that could be useful for finding starting conformations for further refinement.
Methods
The protein structure prediction problem as a minimization problem
The sequence of aminoacids defining a protein is called primary structure. This structure uniquely determines the (3D) native conformation, also known as tertiary structure. The protein structure prediction problem is the problem of predicting the tertiary structure of a protein given its primary structure. The native tertiary structure minimizes the global free energy of the protein.
Abstraction level
We consider each aminoacid as a single sphere centered in its C_{ α }atom; the distance between two consecutive C_{ α }atoms is assumed to be 3.8 Å Recent results (see, e.g., [13]) show that a contact between two residues, when represented only by their C_{ α }atoms, is optimally defined for C_{ α } C_{ α }distances shorter than 6.4 Å The number is obtained as the sum of the radius of the two C_{ α }carbon atoms we are dealing with (2 x 1.9 Å) and the value of 2.6 Å empirically determined in [13] for van der Waals surface contact. A table that points out the energy associated to pairs of aminoacids in contact has been developed [12, 13]. Let us denote by Pot(x, y) the energy value associated to a contact between aminoacids x and y (the order is immaterial); this value can either be positive or negative, according to the pair x, y.
Lattice model
According to [25] we use the FaceCentered Cubic Lattice ( ) that allows realistic angles between consecutive residues. The lattice is composed by cubes of size 2, where the central point of each face and the vertices are admitted. Thus, the domain consists in a set of triples <x, y, z> where <x, y, z ∈ >. We recall that given a point <x, y, z>, its 2norm is: <x, y, z> = . Given two points p_{1} and p_{2}, p_{1}  p_{2} is known as their Euclidean distance.
Mathematical formalization
In this setting, it is possible to formalize the protein folding problem as an optimization problem. Given a sequence S = s_{1} ... s_{ n }, with s_{ i }aminoacids, a fold of S is a function ω : {1, ..., n} → such that: ω(i)  ω(i + 1) = and ω(i)  ω(j) ≥ 2 for i ≠ j. The first constraint states that consecutive aminoacids have a fixed distance, corresponding to one lattice unit; the second that each aminoacid occupies a unitary sphere and that two spheres cannot overlap.
The protein folding problem can be reduced to the optimization problem of finding the fold ω of S such that the following energy is minimized [26, 27]:
where contact(ω(i), ω(j)) is 1 if ω(i)  ω(j) = 2, 0 otherwise. To avoid solutions equivalent modulo simple symmetries, other constraints can be added on the first positions.
Complexity issues
The decision version of this problem (and even of its HPabstraction) is proven to be NPcomplete on various lattices [28, 29]. However, we do not want to solve the problem for proteins of arbitrary length. Solving it for length N = 200–300 could be considered as an important contribution to biological sciences and there are yet such results using the HPabstraction [30]. Thus, in spite of its NPcompleteness, it is important to understand the size of the solution's space. The size of the solution's space is the number of selfavoiding walks on the lattice that can be approximated by the formula (cf., e.g., [31])
SAW_{ fcc }= 1.26N^{0.162}(10.0364)^{ N } (2)
This formula should modify in the presence of additional constraints as mentioned later.
Main implementation issues
Constraints
With distance_constraints, we also impose that two non consecutive residues must be separated by more than one lattice unit, to reflect the steric interaction between the C_{ α }s modelling aminoacids.
As described above, compact_constraints imposes that, for every pair of aminoacids, the norm of the projection of their distance on each x, y, z coordinate, is smaller than CompactFactor × N.
As said in the Lattice model Section, a contact is generated by two non consecutive aminoacids with Euclidean distance less than or equal to 2. As a consequence of the constraints applied, it suffices to check for a contact when the lattice distance equals 2, since distance_constraints excludes from the domain the possibility to place two non consecutive aminoacids at one lattice unit.
We also impose constraints coming from secondary structure information. Secondary structure can be predicted with good approximation (e.g., [36]). In our set of data we have collected such information from the Protein Data Bank. We represent secondary structure information as helix(i, j): elements i, i + 1, ..., j of the input sequence form an αhelix; strand(i, j): elements i, i + 1, ..., j are in a βstrand; ssbond(i, j): there is a disulfide bridge between element number i and j.
We use an auxiliary list called Indexes that stores torsional angles defined by four consecutive aminoacid positions. Due to lattice structure and our constraints, every four consecutive aminoacids can form only 6 discrete angles. Thus, each variable in Indexes can assume a value i from {0, ..., 5}, representing torsional angles of 0°, 60°, 120°, 180°, 240°, 300°, respectively. With these conventions, helices are approximated by sequences of indexes of the form 5, 5, 5, ... while βstrands are associated to sequences of the form 3, 3, 3, .... Note that specifying the coordinates of three points (i.e. to place and orient the protein) and the indexes, uniquely determines the conformation, ssbond(i, j), introduces a maximum distance constraint between the aminoacids i and j. The predicate energy_constrain is developed using an auxiliary symmetric matrix M. The optimal fold is reached when the sum of M elements is minimal. During the labeling phase, the information stored in M is used to control the minimization process and to cut the search tree.
Labeling stage
To reduce the size of the solution's space visited during execution, we have replaced the builtin labeling predicate with an adhoc constraintbased solution search predicate, called solutions_search. We describe here briefly the main features of this predicate and of its auxiliary predicates.
solutions_search • If the Tertiary list or the Indexes list is ground (already computed), then it terminates the folding process (possibly, after a call to the builtin labeling).

Otherwise, it calls choose_labeling. When this procedure terminates, it calls recursively solutions_search. Termination is guaranteed by the fact that each call to choose_labeling reduces the number of nonground variables.
choose_labeling • If the number of variables to be instantiated is low (in our code less than 4), it calls the builtin labeling.

Otherwise, it calls selection_strategy. This predicate computes several subsequences of the list of Indexes. Each subsequence consists of alternations of ground elements and nonground variables. selection_strategy selects the most known subsequence, namely the one containing the smallest ratio of variable over ground indexes, preferring the ones that include a ssbond. If in the selected subsequence there are too many variables, an arbitrary subsequence cut is done. After the subsequence is selected, the procedure labeling_new_launch is called.
labeling_new_launch It calls the auxiliary predicate labeling_new but stops the solution search when the global runtime is greater than the input time limit. If this is the case, the best computed solution is returned.
labeling_new This procedure receives the chosen sublist to be folded. Each index variable in it, is assigned an admissible value between 0 and 5. The order of values that is tried for each index is described by a precomputed auxiliary list. For each torsional index, a frequency statistics of the 6 indexes is precomputed and extracted from the PDB, according to the specific aminoacid sequence involved locally. We use this information to direct the search and explore first the most common torsional angles, in the hope that this selection rule reflects nature's strategy.
Moreover, the energy associated to the fold is minimized. For doing that, after each instantiation of a fixed number t of variables in a phase, we collect the best known ground admissible solution, its energy and its associated potential matrix. We compare the current status to history and decide if it is reasonable to cut the search tree. In particular, we designed a heuristic that allows to control the effectiveness of the cut, adapting it dynamically to the status of the fold. Practically, when the protein is partially specified, we estimate the ratio between ground and nonground variables in the potential matrix. If the ratio is low (i.e. the protein is poorly determined), we allow the current energy to be worse than the corresponding counterpart in the best fold so far reached. When the ratio is high (i.e. protein almost folded) we constrain the current energy to be slightly lower than the previous best known.
Molecular dynamics simulations
In order to have a fair comparison with a similar approach using allatom protein models we built detailed all atom models for six proteins in the studied set (namely those with PDB id. code: 1VII, 1E0M, 2GP8, 1ENH, 2IGD, 1YPA) and imposed, through torsional constraints, the secondary structure geometry found in the native structure. The constraining potential was 100 * (θ  θ_{0})^{2} kcal/(mol rad^{2}). The reference target angles (i.e. θ_{0} in the previous formula) were set to φ = 139 and ψ = 135 for residues in βstrand and to φ = 48 and ψ = 57 for residues in αhelices. For all constrained residues also the ω dihedral angle was constrained at 180 degrees.
The chain was first built fully extended and minimized by 400 steepest descent minimization steps and by 500 conjugate gradients minimization steps.
The protein was then heated in 10 ps up to 900 K in 20000 steps using a timestep of 0.0005 ps. Then the temperature was lowered down to 270 K in 20 steps. During each step molecular dynamics simulation was carried out for 100 ps for a total simulation time of 2 ns.
Simulations used the Generalized Born implicit solvent method [37] as implemented in the program CHARMM [38] with standard parameters for proteins. The forcefield used was CHARMM v.27 [39].
In order to obtain globular protein during simulation a constraint on the radius of gyration (computed only on C_{ α }atoms) was imposed. The target radius was decreased during the simulation from a value proper of an extended conformation down to the value given by 2.2N^{0.38} [34] where N is the number of residues.
Detailed models from lattice models
Declarations
Authors’ Affiliations
References
 Moult J, Hubbard T, Fidelis K, Pedersen J: Critical Assessment of Methods of Protein Structure Prediction (CASP): Round III. Proteins: Struct Funct Genet 1999, (Suppl 3):2–6.
 Venclovas C, Zemla A, Fidelis K, Moult J: Assessment of progress over the CASP experiments. Proteins: Struct Funct Genet 2003, (Suppl 6):585–595.
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res 2000, 28: 235–242.PubMed CentralView ArticlePubMedGoogle Scholar
 Skolnick J, Kolinski A: Reduced models of proteins and their applications. Polymer 2004, 45: 511–524.View ArticleGoogle Scholar
 Bashford D, Chothia C, Lesk A: Determinants of a protein fold. Unique features of the globin amino acid sequences. J Mol Biol 1987, 196: 199–216.View ArticlePubMedGoogle Scholar
 Godzik A, Kolinski A, Skolnick J: Lattice representations of globular proteins: how good are they? J Comp Chem 1993, 14: 1194–1202.View ArticleGoogle Scholar
 Toma T, Toma S: Folding simulation of protein models on the structurebased cubooctahedral lattice with the Contact Interactions algorithm. Protein Sci 1999, 8: 196–202.PubMed CentralView ArticlePubMedGoogle Scholar
 Skolnick J, Kolinski A: Simulations of the folding of a globular protein. Science 1990, 250: 1121–1125.View ArticlePubMedGoogle Scholar
 Xia Y, Huang ES, Levitt M, Samudrala R: Ab initio construction of protein tertiary structures using a hierarchical approach. J Mol Biol 2000, 300: 171–185.View ArticlePubMedGoogle Scholar
 Zhang Y, Kolinski A, Skolnick J: TOUCHSTONE II: a new approach to ab initio protein structure prediction. Biophys J 2003, 85: 1145–1164.PubMed CentralView ArticlePubMedGoogle Scholar
 L Mirny ES: Protein folding theory: from lattice to allatom models. Annu Rev Biophys Biomol Struct 2001, 30: 361–96.View ArticleGoogle Scholar
 Miyazawa S, Jernigan RL: Residueresidue potentials with a favorable contact pair term and an unfavorable high packing density term, for simulation and threading. J Mol Biol 1996, 256(3):623–644.View ArticlePubMedGoogle Scholar
 Berrera M, Molinari H, Fogolari F: Amino acid empirical contact energy definitions for fold recognition in the space of contact maps. BMC Bioinformatics 2003., 4(8):
 Rost B: Protein secondary structure prediction continues to rise. J Struct Biol 2001, 134: 204–218.View ArticlePubMedGoogle Scholar
 Jaffar J, Maher MJ: Constraint Logic Programming: A Survey. Journal of Logic Programming 1994, 19–20: 503–581.View ArticleGoogle Scholar
 Marriott K, Stuckey PJ: Programming with Constraints. The MIT Press, Cambridge, Mass; 1998.Google Scholar
 Swedish Institute for Computer Science: SICStus Prolog Home Page.[http://www.sics.se/sicstus/]
 Sterling L, Shapiro E: The art of Prolog. 2nd edition. The MIT Press, Cambridge, Mass; 1997.Google Scholar
 Pôle Biolnformatique Lyonnais: GOR IV secondary structure prediction method.[http://npsapbil.ibcp.fr]
 Simons K, Bonneau R, Ruczinski I, Baker D: Ab initio protein structure prediction of CASP III targets using ROSETTA. Proteins: Struct Fund Genet 1999, 3: 171–176.View ArticleGoogle Scholar
 Hoang TX, Seno F, Banavar JR, Cieplak M, Maritan A: Assembly of protein tertiary structures from secondary structures using optimized potentials. Proteins: Struct Funct Genet 2003, 52: 155–165.View ArticleGoogle Scholar
 Backofen R: The protein structure prediction problem: A constraint optimization approach using a new lower bound. Constraints 2001, 6(2–3):223–255.View ArticleGoogle Scholar
 Dill KA: Dominant forces in protein folding. Biochemistry 1990, 29: 7133–7155.View ArticlePubMedGoogle Scholar
 Yue K, Dill KA: Sequencestructure relationships in proteins and copolymers. Physical Review E 1993, 48(3):2267–2278.View ArticleGoogle Scholar
 Raghunathan G, Jernigan RL: Ideal architecture of residue packing and its observation in protein structures. Protein Sci 1997, 6: 2072–2083.PubMed CentralView ArticlePubMedGoogle Scholar
 Clote P, Backofen R: Computational Molecular Biology: An Introduction. John Wiley & Sons; 2001.Google Scholar
 Dal Palù A, Dovier A, Fogolari F: Protein Folding in CLP () with Empirical Contact Energies. In In Recent Advances in Constraints, of Lecture Notes in Artificial Intelligence Edited by: Apt KR, Fages F, Rossi F, Szeredi P, Vancza J. 2004., 3010:Google Scholar
 Crescenzi P, Goldman D, Papadimitrou C, Piccolboni A, Yannakakis M: On the complexity of protein folding. In Proc of STOC 1998, 597–603.Google Scholar
 Hart WE, Newman A: The Computational Complexity of Protein Structure Prediction in Simple Lattice Models. In In Handbook on Algorithms in Bioinformatics. CRC Press; in press.
 Backofen R, Will S: A ConstraintBased Approach to Structure Prediction for Simplified Protein Models that Outperforms Other Existing Methods. In Proceedings of the 19th International Conference on Logic Programming (ICLP 2003), of Lecture Notes in Computer Science, Springer 2003, 2916: 49–71.Google Scholar
 Schuster P, Stadler PF: Discrete Models of Biopolymers. In In Handbook of Computional Chemistry. Edited by: Crabbe MJC, Drew M, Konopka A. Marcel Dekker, New York; 2001.Google Scholar
 Dovier A: Protein Folding with ConstraintsBased Methods.[http://www.dimi.uniud.it/dovier/PF]
 Cantor CR, Schimmel PR: Biophysical chemistry. W H Freeman and Co; 1980.Google Scholar
 Huang X, Powers R: Validity of using the radius of gyration as a restraint in NMR protein structure detremination. J Am Chem Soc 2001, 123: 3834–3835.View ArticlePubMedGoogle Scholar
 Fogolari F, Esposito G, Viglino P, Cattarinussi S: Modeling of polypeptide chains as C α chains, C α chains with C β , and C α chains with ellipsoidal lateral chains. Biophys J 1996, 70: 1183–1197.PubMed CentralView ArticlePubMedGoogle Scholar
 Rost B, Sander C: Prediction of Protein Secondary Structure at better than 70% accuracy. J Mol Biol 1993, 232: 584–599.View ArticlePubMedGoogle Scholar
 Qiu D, Shenkin P, Hollinger F, Still W: The GB/SA Continuum Model for Solvation. A Fast Analytical Method for the Calculation of Approximate Born Radii. J Phys Chem 1997, 101: 3005–3014.View ArticleGoogle Scholar
 Brooks BR, Bruccoleri RE, Olafson BD, States DJ, Swaminathan S, Karplus M: CHARMM: A Program for Macromolecular Energy Minimization and Dynamics Calculations. J Comp Chem 1983, 4: 187–217.View ArticleGoogle Scholar
 MacKerell ADJ, Bashford D, Bellott M, Dunbrack RLJ, Evanseck JD, Field MJ, Fischer S, Gao J, Guo H, Ha S, JosephMcCarthy D, Kuchnir L, Kuczera K, Lau FTK, Mattos C, Michnick S, Ngo T, Nguyen DT, Prodhom B, Reiher WEI, Roux B, Schlenkrich M, Smith JC, Stote R, Straub J, Watanabe M, WiorkiewiczKuczera J, Yin D, Karplus M: Allatom empirical potential for molecular modeling and dynamics Studies of proteins. J Phys Chem B 1998, 102: 3586–3616.View ArticlePubMedGoogle Scholar
 European Bioinformatics Institute: MaxSprout: Reconstruction of 3D coordinates from C (alpha) trace.[http://www.ebi.ac.uk/maxsprout]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.