CPSP-tools – Exact and complete algorithms for high-throughput 3D lattice protein studies
© Mann et al; licensee BioMed Central Ltd. 2008
Received: 18 December 2007
Accepted: 07 May 2008
Published: 07 May 2008
The principles of protein folding and evolution pose problems of very high inherent complexity. Often these problems are tackled using simplified protein models, e.g. lattice proteins. The CPSP-tools package provides programs to solve exactly and completely the problems typical of studies using 3D lattice protein models. Among the tasks addressed are the prediction of (all) globally optimal and/or suboptimal structures as well as sequence design and neutral network exploration.
In contrast to stochastic approaches, which are not capable of answering many fundamental questions, our methods are based on fast, non-heuristic techniques. The resulting tools are designed for high-throughput studies of 3D-lattice proteins utilising the Hydrophobic-Polar (HP) model. The source bundle is freely available .
The CPSP-tools package is the first set of exact and complete methods for extensive, high-throughput studies of non-restricted 3D-lattice protein models. In particular, our package deals with cubic and face centered cubic (FCC) lattices.
The organisation of bio-molecules, in particular proteins, in the sequence and structure space has recently been attracting increased attention. Particularly questions concerning finding the native structure or investigating the kinetics and evolution of proteins have been widely studied. These problems are often tackled using simplified models such as the Hydrophobic-Polar (HP) model (e.g. Jacob et al. ). Though abstract, these models are computationally feasible and do allow for deeper insights into fundamental and general principles [2–4].
Several recurring tasks can be identified in such studies using simplified models. Namely, predicting the native structure, classifying whether a sequence is protein-like, calculating its degeneracy and stability, or the design of sequences that optimally fold to a given structure. The problems associated with these tasks are computationally very hard (NP-complete) [5–7]. Nevertheless, these tasks demand for exact and complete (i.e. non-heuristic) methods. It is important to note that stochastic methods cannot be used for proving optimality and in particular proving that a sequence has a unique lowest energy (protein-like) fold .
Consequently, with the exception of Yue and Dill , all studies requiring complete and exact answers to optimal structure prediction were based on exhaustive enumeration. These studies were, hence, confined to small sequence lengths. In other approaches, structures are artificially restricted to be maximally compact (e.g. filling a 3 × 3 × 3 cube) . This allows for complete enumeration but artificially biases the energy function towards overall hydrophobicity.
Furthermore, many studies are confined to extremely simplified models on the 2D-square or 3D-diamond-lattice [2, 11]. The coordination number, a measurement of lattice complexity, is four in both cases. The use of lattices with such a low complexity may lead to oversimplified models that are not able to reproduce real world properties. Park and Levitt  have shown that lattices with higher coordination number provide a much better fit to real protein structures. A further hint toward the simplicity of the 2D-lattice is the low computational complexity of inverse folding when compared to the 3D-cubic lattice . The Constraint-based Protein Structure Prediction (CPSP) approach by Backofen and Will  provides a way to overcome the aforementioned obstacles. The method is tailored to the HP model introduced by Lau and Dill . This model is widely used in the literature [15, 16]. CPSP supports complex 3D lattices (currently cubic and face centered cubic) without artificial restrictions (e.g. to be maximally compact). The approach predicts all globally optimal structures together with a proof of optimality. No naive, exhaustive enumeration of all structures is performed and it is as fast as stochastic methods that cannot prove optimality. Backofen and Will  showed that the CPSP-approach could fold even sequences of length 200 to optimality within seconds. In contrast, exhaustive structure enumeration as e.g. done by Blackburne and Hirst  is restricted to short sequence lengths. For instance, on a 3D-cubic lattice it is only viable to enumerate up to about length 20. In fact, the exact number of structures is only known up to length 23 where there are already more than 5 × 1015 . CPSP uses constraint programming that is commonly applied to hard (NP-complete) problems and, thus, avoids the complete expansion of the whole search space. Hence, constraint-programming techniques are a powerful tool to handle the high complexity that typifies problems related to protein structure. Constraint-programming techniques have successfully been applied to structure prediction with given secondary structure information , analysis of NMR data , and modeling of protein complexes .
Currently, we are not aware of any other complete approach that ensures optimality of the predicted structures in different lattices. There is an alternative to CPSP for the 3D-cubic lattice, the constraint-based hydrophobic core construction method by Yue and Dill . This allows the prediction of optimal structures and proves their optimality. However, using the CPSP-approach, Backofen and Will showed that the method developed by Yue and Dill is not always complete in enumerating all optimal structures .
CPSP-tools provides a set of programs that enable typical, modern research tasks to be calculated efficiently and accurately. Here we list the programs each with a typical example application. HPSTRUCT predicts (all) optimal and suboptimal structures as required for investigating properties of low energy conformations, as e.g. studied by Jacob and Unger . The statistical analysis of protein-like sequences, see Blackburne and Hirst , requires a degeneracy-based classification of sequences. This is possible with HPDEG. For the exploration of protein evolution, similar to Wroe and Chan , one needs to investigate the sequence-structure space. We provide HPDESIGN for sequence design and HPNNET for neutral network computation.
All methods can be applied to HP-sequences in the cubic and the more complex face centered cubic lattice model. Before giving a detailed description of the tools, we first introduce the idea of H-cores, central to these methods.
In the HP lattice models, two monomers form a contact if they occupy neighboring positions in the lattice. The energy of a structure is defined by the number of contacts between H-monomers, i.e. HH-contacts. Thus, an optimal (minimum energy) conformation maximizes the number of HH-contacts. An important observation is that optimal structures show an almost optimal (maximally compact) packing of the H-monomers. Such dispersions of H-monomers without any chain connectivity are called H-cores. The compactness of the H-cores is a basic feature that can be used for structure prediction and sequence design. Note that optimal H-cores are independent of a particular sequence and depend only on the number of H-monomers. Hence, compact and nearly compact H-cores can be precalculated and stored in a database. HPSTRUCT and HPDESIGN use this database as a starting point for their calculations (details later). Thereby, redundant computation is avoided, which significantly speeds up the CPSP-approach and related applications.
The enumeration of all optimal H-cores in complex lattice models such as FCC is a computationally hard problem by itself and was solved by Backofen and Will using constraint-programming techniques . Firstly an upper bound on the number of possible contacts for a given number of monomers is calculated via dynamic programming. Subsequently, this information is used to enumerate all compact optimal and almost optimal (suboptimal) H-cores for a given number of H-monomers using constraint-programming. Some statistics on the number of H-cores in the 3D-cubic lattice are given in Fig. 4. It shows that the number of H-cores grows exponentially in H-core size but still much slower than the number of structures for a corresponding sequence length.
HPSTRUCT implements the CPSP approach, as introduced by Backofen and Will , to predict provably optimal structures of 3D lattice proteins in the HP-model. For a given HP-sequence S and a given lattice type (cubic or face centered cubic), (all) optimal structures are calculated. The CPSP approach computes the global minimal energy for S.
The CPSP-approach is based on the H-core database as described before. For a concrete sequence S the approach systematically examines the list of H-cores compatible with S in decreasing maximal contact number. For each core, it attempts to thread the sequence through the core. Threading means to find a placement of the monomers of S in a self-avoiding walk such that all H-monomers are elements of the given H-core and all P-monomers are outside of the core. Since the H-cores are considered in the order of decreasing contacts, the first successful threading results in a structure with global minimal energy. Note that at this point the algorithm has proven that there is no structure of S that forms more HH-contacts.
Technically, the threading of a sequence through a core is performed by a constraint program. For this purpose, we formulate the threading problem as a constraint satisfaction problem (CSP) . It constrains the H-monomers of the sequence to the positions in the H-core. Further, it enforces successive monomers along the sequence to be neighbored in the lattice and prohibits the multiple use of a single position. The constraint-programming machinery allows for the enumeration of all valid placements according to the given constraints. In this way, all (sub)optimal structures for a given sequence can be calculated. For a more detailed description of the CSP definition and the mechanisms for solving it see .
All resulting structures of HPSTRUCT are returned in absolute move string representation. This compactly encodes the lattice position vectors between successive monomers in the structure and reduces the space consumption for huge data sets.
To handle the common case of highly degenerated sequences (with many optima), HPSTRUCT offers the possibility to limit the number of predicted structures or to generate only a representing subset. Such a subset only contains structures that are separated by at least (a user defined) distance k. The distance measure is the hamming distance on the absolute move strings.
The degeneracy of an HP-sequence S is the number of optimal structures S can adopt. It can be calculated using HPDEG and is the base to determine the stability of structures . HPDEG specializes HPSTRUCT and completely counts all optimal structures.
An important application of HPDEG is the classification of sequences as protein-like or not. A sequence is protein-like if it can adopt only one optimal structure (degeneracy 1), a definition applied by Li et al.  and Huard et al.  among others.
HPDEG is directly based on the CPSP-approach to compute the degeneracy. Here, all solutions for all arbitrary H-cores/CSPs are calculated. In addition, a significant acceleration of the process can be achieved by the search decomposition methods we introduced in . This is done by identifying sub-chains of the sequence that can be placed independently from each other. Their placements are calculated separately and the resulting numbers are multiplied to the overall structure number of the whole chain. This decomposition strategy results in a speedup of 3-times and higher on average.
HPDESIGN solves the inverse folding problem, i.e. the design of sequences that form a given structure X as their unique optimum. It allows deeper investigations of sequence-structure relations and a better understanding of general properties of protein folding .
The inverse folding problem (IFP) in 3D lattices has been shown by Berman et al.  to be NP-complete, i.e. it is, as the protein folding problem, a hard computational problem. In contrast, as the same authors show, the IFP in the simple 2D lattice is solvable in polynomial time. This indicates once more the higher complexity of three-dimensional lattice models. To our knowledge, HPDESIGN is the only method applicable to a 3D-model that calculates the desired sequence properties without exhaustive sequence space enumeration.
The approach is based on the CPSP H-core database in order to get a set of good candidate sequences C. First, using H-cores ordered by decreasing size and optimality, a matching of the core and the structure is done. For each match a candidate sequence is derived and added to C. Afterwards, each c ∈ C is evaluated concerning degeneracy and checked if X is its optimal structure.
The candidate set C, produced by the filtering step using the H-cores, consists of sequences that can adopt X with an optimal or slightly sub-optimal H-core. Therefore, their probability to form X as their unique optimum is very high and the size of C very small compared to the whole sequence space. The latter is of high importance for the performance of the method.
Often sequences with a special ratio of H/P occurrences or with only limited degeneracy are of interest. Both can be specified using HPDESIGN.
Furthermore, the number of evaluated H-cores is selectable to allow a balancing between runtime and completeness. This is done by adjusting their allowed level of optimality used in the filtering step.
The organisation of sequence space in neutral networks provides insights into evolutionary principles [15, 30]. Such networks can be expanded using HPNNET. A neutral network for a given structure X is an undirected binary graph, where each node represents a sequence that forms X as its unique optimal structure. Edges connect evolutionary related sequences, i.e. sequences that differ only in one sequence position, a point mutation. HPNNET expands a neutral network starting from an initial sequence (or a set of sequences) S that folds into the structure X.
The method follows the generate-and-test paradigm. Recursively, all neighboring sequences of S are tested if they adopt X as their unique optimum. If so, they are added to the network and their neighbors are checked. Therefore, HPNNET is capable of detecting and expanding connected neutral networks of different structures.
Running HPNNET with S as the only start sequence results in the connected component of the network S belongs to. However, Blackburne and Hirst  have shown by exhaustive enumeration in restricted models that neutral networks may consist of several connected components. To find and study them in complex three-dimensional lattices a combination of HPDESIGN and HPNNET can be used. The independently designed sequences resulting from HPDESIGN have a high chance to belong to different components. HPNNET supports as input such a set of sequences and expands all corresponding connected components. An example is later shown in the results section.
In addition to those described above, CPSP-tools provides a set of utility programs helpful for lattice protein studies. For instance using HPCONVERT, it is possible to convert between absolute move strings, the 3D-position data in XYZ-, Protein Data Bank (PDB-) and Chemical Markup Language (CML-) format. A move string normalization, as well as a conversion into an orientation independent relative move string, is available for a symmetry independent structure comparison.
HPVIEW interactively visualizes structures in 2D-square, 3D-cubic, and 3D-FCC lattices using the Jmol interface .
Installation and Usage
The package supplies standard installation procedures for Linux based on common tools (GNU automake) and can be compiled and installed easily on current 32- and 64-bit Linux systems (including Cygwin for Microsoft Windows™). The programs are written in C++ for highest performance and provide a slim text-based user interface for efficient pipelining as required for high-throughput experiments. A web front end is under development.
All constraint programming based algorithms utilize the open source Gecode system .
The validity of the algorithms has been tested and confirmed on a large set of benchmark problems. The functionality of H-core database access, structure prediction, and degeneracy computation are collected in the C++ CPSP-library. A complete API is included which allows the embedding, extension, and use of the CPSP approach in new programs.
To reduce package size, only a small fraction of the H-core database is included in the source package. This already enables the use of CPSP-tools for short sequences. The complete database is available on request.
Results and Discussion
Exemplary runs and data. Example runs of the exemplified CPSP-tools application scenarios. The corresponding sequences and structures are given in Table 2. The neutral net N is given in Figure 3.
X0, E = -13
X1, E = -22
X1, S1, deg = 1
S1 .. S4
X1, minH = 17, so = 2
S1 .. S12
13 m 43 s
X1, S1 .. S12, deg = 1
N, S1 .. S14
Data of exemplary runs.
S1 .. S14
Studies of sequence or structure features of proteins as done by Huard et al.  require a classification of sequences as protein-like. One way is to classify them by the number of optimal structures, i.e. their degeneracy. The fast calculation of this sequence property by HPDEG allows production of sufficiently large benchmark sets for detailed studies. To illustrate this, we run HPDEG for a random HP-sequence S0 revealing an enormous degeneracy, which is a frequent finding in the HP-model. As a starting point for the following scenarios, we evaluate the degeneracy of S1, a sequence with a single optimal structure. The very short runtimes for both checks are given in Table 1.
Calculating the globally optimal structure for a given sequence is the main task in many studies, e.g. see Jacob and Unger . Furthermore, in stochastic folding simulation approaches knowing the minimal possible energy is favorable. Both can be calculated extremely rapidly using HPSTRUCT. Again, We demonstrate this with sequences S0 and S1. This results in an energy of -13 and -22 and the optimal structures X0 and X1, respectively. Both structures are visualized in Figure 2.
To study protein evolution on the sequence level, neutral networks are widely utilized . Using HPNNET we can span the connected component of the neutral network for a given sequence with a unique optimal structure. Applied to S1 with X1 we find four sequences S2 .. S4 sharing X1 as their unique optimal structure. Note, this can be done without exhaustive sequence enumeration for a given structure.
The detailed study of neutral networks by Blackburne and Hirst  has shown that neutral networks may decompose into connected components. Their results are based on full enumeration of sequences and structures in the diamond lattice. This approach does not extend to complex lattice models due to the enormous size of the structure space as discussed above.
HPDESIGN can overcome that problem by directly designing sequences of the neutral network. Recall that the neutral network contains only sequences with the same unique optimal structure. The described design approach allows one to generate sequences of independent components in the neutral network without exhaustive enumeration. Afterwards, the full components can be expanded via HPNNET.
Preliminary studies performed with CPSP-tools indicate that neutral networks as large as N with several large independent components are rare in the unrestricted 3D-cubic model.
For complex 3D models, mainly heuristic and/or stochastic approaches to search for optimal structures of a given sequence are available [8, 33]. However, these methods are (a) incomplete and (b) cannot ensure the global optimality of the predicted structures. In consequence, the investigation of problems requiring this information was only possible using exhaustive enumeration, which is not possible for longer sequence lengths.
The CPSP approach is as fast as common stochastic methods but ensures that all predicted structures are globally optimal, and that none are missing. This is done without exhaustive structure space exploration applying constraint-programming techniques. Therefore, it is well suited to many studies in complex 3D models; especially for finding protein-like sequences, the investigation of neutral networks or sequence design. Further applications range from the generation of candidate sets to the validation of results of folding simulations and stochastic optimization methods.
The CPSP-tools package combines several applications in the field of bioinformatics concerning 3D lattice proteins. It allows advanced investigation of problems related to protein structure prediction, sequence evolution, inverse folding, and energy landscapes.
Availability and requirements
Project name: CPSP-tools
Project home page: http://www.bioinf.uni-freiburg.de/sw/cpsp/
Operating system(s): all Linux based systems (including Cygwin for MS Windows™)
Programming language: C++
Other requirements: Gecode and BIU library (a source bundle is provided)
License: BSD-style license
Any restrictions to use by non-academics: none
Martin Mann is supported by the EU project EMBIO (EC contract number 012835). Sebastian Will is partially supported by the EU Network of Excellence REWERSE (project reference number 506779).
Further, thanks to the reviewers of an earlier version of the manuscript for their helpful comments and Rhodri Saunders for proofreading.
- Jacob E, Horovitz A, Unger R: Different mechanistic requirements for prokaryotic and eukaryotic chaperonins: a lattice study. Bioinformatics 2007, 23: 240–248. 10.1093/bioinformatics/btm180View ArticleGoogle Scholar
- Wolfinger MT, Will S, Hofacker IL, Backofen R, Stadler PF: Exploring the lower part of discrete polymer model energy landscapes. Europhysics Lett 2006, 74: 725–732. 10.1209/epl/i2005-10577-0View ArticleGoogle Scholar
- Huard FP, Deane CM, Woo GR: Modelling sequential protein folding under kinetic control. Bioinformatics 2006, 22: 202–210. 10.1093/bioinformatics/btl248View ArticleGoogle Scholar
- Unger R, Moult J: Finding the lowest free energy conformation of a protein is an NP-hard problem: proof and implications. Bull Math Biol 1993, 55: 1183–1198.View ArticlePubMedGoogle Scholar
- Berger B, Leighton T: Protein folding in the hydrophobic-hydrophilic (HP) model is NP-complete. J Comp Biol 1998, 5: 27–40.View ArticleGoogle Scholar
- Berman P, DasGupta B, Mubayi D, Sloan R, Turán G, Zhang Y: The protein sequence design problem in canonical model on 2D and 3D lattices. In Combinatorial Pattern Matching. Volume 3109. Springer; 2004:244–253.View ArticleGoogle Scholar
- Hsu HP, Mehra V, Nadler W, Grassberger P: Growth-based optimization algorithm for lattice heteropolymers. Phys Rev E 2003, 68: 021113. 10.1103/PhysRevE.68.021113View ArticleGoogle Scholar
- Yue K, Dill KA: Forces of tertiary structural organization in globular proteins. Proc Natl Acad Sci 1995, 92: 146–150. 10.1073/pnas.92.1.146PubMed CentralView ArticlePubMedGoogle Scholar
- Li H, Tang C, Wingreen NS: Designability of protein structures: a lattice-model study using the Miyazawa-Jernigan matrix. Proteins 2002, 49: 403–412. 10.1002/prot.10239View ArticlePubMedGoogle Scholar
- Blackburne BP, Hirst JD: Population dynamics simulations of functional model proteins. J Chem Phys 2005, 123: 154907–9. 10.1063/1.2056545View ArticlePubMedGoogle Scholar
- Park BH, Levitt M: The complexity and accuracy of discrete state models of protein structure. J Mol Biol 1995, 249: 493–507. 10.1006/jmbi.1995.0311View ArticlePubMedGoogle Scholar
- Backofen R, Will S: A constraint-based approach to fast and exact structure prediction in three-dimensional protein models. Constraints 2006, 11: 5–30. 10.1007/s10601-006-6848-8View ArticleGoogle Scholar
- Lau KF, Dill KA: A lattice statistical mechanics model of the conformational and sequence spaces of proteins. Macromolecules 1989, 22: 3986–3997. 10.1021/ma00200a030View ArticleGoogle Scholar
- Cui Y, Wong WH, Bornberg-Bauer E, Chan HS: Recombinatoric exploration of novel folded structures: a heteropolymer-based model of protein evolutionary landscapes. Proc Natl Acad Sci 2002, 99: 809–814. 10.1073/pnas.022240299PubMed CentralView ArticlePubMedGoogle Scholar
- Jacob E, Unger R: A tale of two tails: why are terminal residues of proteins exposed? Bioinformatics 2007, 23: 225–230. 10.1093/bioinformatics/btl318View ArticleGoogle Scholar
- Blackburne BP, Hirst JD: Three-dimensional functional model proteins: structure function and evolution. J Chem Phys 2003, 119: 3453–3460. 10.1063/1.1590310View ArticleGoogle Scholar
- Sloane NJA: Number of n-step self-avoiding walks on cubic lattice. On-Line Encyclopedia of Integer Sequences 2007. [http://www.research.att.com/~njas/sequences/A001412]View ArticleGoogle Scholar
- Dal Palu A, Dovier A, Fogolari F: Constraint Logic Programming approach to protein structure prediction. BMC Bioinformatics 2004, 5: 186. 10.1186/1471-2105-5-186PubMed CentralView ArticlePubMedGoogle Scholar
- Krippahl L, Barahona P: PSICO: Solving protein structures with constraint programming and optimization. Constraints 2002, 7: 317–331. 10.1023/A:1020577603762View ArticleGoogle Scholar
- Krippahl L, Moura JJ, Palma PN: Modeling protein complexes with BiGGER. Proteins 2003, 52: 19–23. 10.1002/prot.10387View ArticlePubMedGoogle Scholar
- Decatur SE: Protein folding in the generalized hydrophobic-polar model on the triangular lattice. 1996.Google Scholar
- Bagci Z, Jernigan RL, Bahar I: Residue coordination in proteins conforms to the closest packing of spheres. Polymer 2002, 43: 451–459. 10.1016/S0032-3861(01)00427-XView ArticleGoogle Scholar
- Wroe R, Chan HS, Bornberg-Bauer E: A structural model of latent evolutionary potentials underlying neutral networks in proteins. HFSP J 2007, 1: 79–87. 10.2976/1.2739116PubMed CentralView ArticlePubMedGoogle Scholar
- Backofen R, Will S: Optimally Compact Finite Sphere Packings – Hydrophobic Cores in the FCC. In Proc of the 12th Annual Symposium on Combinatorial Pattern Matching. Volume 2089. Springer; 2001:257–272.View ArticleGoogle Scholar
- Marriott K, Stuckey PJ: Programming with Constraints: an Introduction. The MIT Press; 1998.Google Scholar
- Shortle D, Chan HS, Dill KA: Modeling the effects of mutations on the denatured states of proteins. Prot Sci 1992, 1: 201–215.View ArticleGoogle Scholar
- Will S, Mann M: Counting protein structures by DFS with dynamic decomposition. Proc of Workshop on Constraint Based Methods for Bioinformatics 2006, 83–90.Google Scholar
- Gupta A, Manuch J, Stacho L: Structure-approximating inverse protein folding problem in the 2D HP model. J Comp Biol 2005, 12: 1328–1345. 10.1089/cmb.2005.12.1328View ArticleGoogle Scholar
- Schuster P, Stadler PF: Networks in molecular evolution. Complexity 2002, 8: 34–42. 10.1002/cplx.10052View ArticleGoogle Scholar
- Jmol: an open-source Java viewer for chemical structures in 3D[http://jmol.sourceforge.net/]
- Gecode – generic constraint development environment[http://www.gecode.org]
- Shmygelska A, Hoos HH: An ant colony optimisation algorithm for the 2D and 3D hydrophobic polar protein folding problem. BMC Bioinformatics 2005, 6: 30. 10.1186/1471-2105-6-30PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.