An algorithm to enumerate all possible protein conformations verifying a set of distance constraints

Background The determination of protein structures satisfying distance constraints is an important problem in structural biology. Whereas the most common method currently employed is simulated annealing, there have been other methods previously proposed in the literature. Most of them, however, are designed to find one solution only. Results In order to explore exhaustively the feasible conformational space, we propose here an interval Branch-and-Prune algorithm (iBP) to solve the Distance Geometry Problem (DGP) associated to protein structure determination. This algorithm is based on a discretization of the problem obtained by recursively constructing a search space having the structure of a tree, and by verifying whether the generated atomic positions are feasible or not by making use of pruning devices. The pruning devices used here are directly related to features of protein conformations. Conclusions We described the new algorithm iBP to generate protein conformations satisfying distance constraints, that would potentially allows a systematic exploration of the conformational space. The algorithm iBP has been applied on three α-helical peptides.


Background
Protein structure determination is crucial for understanding protein function, as it paves the way to the discovery of new chemical compounds and of new approaches to control the biological processes.
The problem of protein structure determination is certainly a problem with multiple solutions, as proteins are flexible polymers. As most of the experimental techniques of the structural biology obtain measurements averaged on an ensemble of protein conformations, the usual approaches for structure determination intend to find an average structure or a set of conformations describing fluctuations around an average structure. A path intending to get a complete coverage of the conformational Full list of author information is available at the end of the article space, given a series of constraints, is usually not taken, although such an approach could provide precious information about the conformational equilibrium, which is essential in the function of many proteins, as the HIV protease [1].
An important class of experimental methods for protein structure determination is based on the measurement of inter-atomic distances and angles, such as Nuclear Magnetic Resonance (NMR) spectroscopy [2] and crosslinking coupled to mass spectrometry [3]. In NMR, distance intervals between hydrogens are determined from the measurement of nuclear Overhauser effects (NOE). The experimentally measured distances are then used as constraints for protein structure calculations. Pure in silico approaches have been also developed based on the use of inter-atomic distance constraints, such as homology modeling [4] or prediction of protein-protein complexes [5] and ligand poses [6].
The Distance Geometry Problem (DGP) [7,8] consists in identifying the sets of points which satisfy a set of constraints based on relative distances between some pairs of such points. The present work describes an algorithm developed to solve DGP in the context of protein structure determination: the points represent the protein atoms.
The DGP is a constraint satisfaction problem. Several approaches solve this problem by reformulating it [8] as a global optimization problem having a continuous search domain, and whose objective function is generally a penalty function designed to measure the violation of the distance constraints. Over the years, the solution of DGPs arising in structural biology have been typically attempted by Simulated Annealing (SA) approaches based on molecular dynamics [9]. Other proposed approaches are based on various optimization methods as in [10]. As all meta-heuristic approaches, SA may provide approximate solutions but does not deliver optimality certificates. In the case of protein structure determination, since the optimization problem is a reformulation of a constraint satisfaction problem, solutions given by SA can be successively verified by checking the violations of the distance constraints. However, additional solutions may exist but go undetected by SA. Thus, an algorithm for the systematic enumeration of the possible conformations of a given protein could find a widespread field of application. Branch-and-prune algorithms and similar were proposed in the general context of protein structure determination [11][12][13][14][15][16], (see also [8] and references therein). However, these studies primarily addressed the question of defining relative orientations of protein monomers in symmetric oligomers, not the determination of all possible conformation of a polypeptide chain with a very large number of degrees of freedom from distance constraints. Systematic exploration was proved to be useful in the case of residual dipolar couplings (RDC) constraints [17], for exploring the sidechains conformations [18,19] and for assignment of NOEs, provided that the structure is known [20]. For the structure determination from RDCs, it has been shown [21] that when using RDCs but only sparse NOEs the problem can be solved in polynomial time. Such approaches have also been used for structure determination in X-ray crystallography for non-crystallographic symmetry by orienting and translating symmetric protein subunits [22]. To the best of our knowledge, in this paper we present the first application of a Branch-and-Prune algorithm to the problem of full protein structure determination based on unambiguous distance information.
Under certain conditions, DGPs can be discretized [23] (see below), which means that the search domain for the corresponding optimization problem can be reduced to a discrete set, which has the structure of a tree. The discretization makes the enumeration of the entire solution set of DGP instances possible. This is important when the experimental constraints do not specify the protein conformation uniquely, i.e., more than one conformation satisfies all constraints. For solving discretized DGP, we employ an interval branch-and-prune (iBP) algorithm [24], which is based on the idea of recursively exploring the tree while generating new candidate atomic positions (branching phase) and to verify the feasibility of such positions (pruning phase) ( Figure 1). By making use of pruning devices, branches rooted at infeasible positions can be discarded from the tree, so that the search can be reduced to the feasible parts of the tree ( Figure 2). Pruning devices can be conceived and integrated in iBP to improve  the performances of the pruning phase and thus of the algorithm.
In the present work, we first describe the branching phase and the pruning devices used to determine the solutions to the Distance Geometry problem. Then, an overall view of the method is given along with the use of the branching and pruning devices at different steps and the complexity of the algorithm is analyzed. We finally illustrate the algorithm application with three proteins for which α-helical regions are known along with few long-range NMR constraints (ie. constraints measured between residues i and j such that |i − j| > 3 in the protein sequence). The obtained conformations display good stereochemical quality parameters, and the conformational space explored is larger than the one sampled with traditional optimization methods such as simulated annealing.

Methods
In order to sample the conformational space of a protein, we use a Branch-and-Prune algorithm to build a tree in which each node represents a solution for one atomic position. We limit ourselves in the present work to the calculation of the backbone and Cβ atomic coordinates.
The constraints used to generate atomic coordinates along the Branch-and-Prune algorithm are the following: 1. covalent distance constraints corresponding to bond lengths and bond angles, whose values are derived from high-resolution small molecule X-ray crystal structures [25]; 2. NMR distance constraints; 3. van der Waals radii of atoms between non-bonded atom pairs (i, j): a fraction of the sum of the van der Waals radii of each atom provides a lower bound to the corresponding inter-atomic distances: where σ ∈ [0, 1], and is typically around 0.85. The values for the radii are given in Table 1 [26,27]. These lower bounds apply only in the cases where no larger lower bound has been determined from NMR distance constraints; 4. distances derived from the backbone torsion angles φ and ψ; 5. hydrogen bonds in α-helix; 6. amino-acid chirality; 7. α-helix geometry.
The atom coordinates are calculated, one by one, following the atom order P ato described in Figure 3 and previously proposed in [24]. In this order, some atoms are repeated to insure that any entered atom is defined by distance constraints with respect to three preceding atoms in P ato [24]. The carbonyl oxygens and the atoms Cβ, which were not present in the order P ato , are calculated separately.
Then, the tree is built using a recursive procedure to create each node of the tree. This procedure is called branching phase. The created nodes are then submitted to the pruning devices in order to decide whether the node should be kept or removed. If the node is removed, the possible branches starting from this node are also pruned. A pruning device is responsible for checking whether a partial solution is feasible, i.e. to check whether a set of embedded atoms fulfill the constraints (1)-(7) described above.
In the following, we describe the branching phase and the pruning devices. Then, the complexity of the algorithm is described from a theoretical point of view, before presenting some application cases.

Branching devices
The tree parsed during iBP is formed by nodes, each corresponding to one set of atomic coordinates from the order P ato (Figure 3) [24]. At each level of the tree, the atomic coordinates of the corresponding atom are calculated by making use of a recursive procedure, called branching phase. The current atom position is defined by distance constraints to three other atoms. These distances are obtained from the constraints (1-3) described above: (1) the covalent constraints, (2) the NMR distance constraints, (3) the van der Waals radii.

Figure 3
Order P ato of the atoms parsed during the branch-and-prune algorithm.
If the distance constraints specify a unique value rather than an interval, this signifies that the distances to three immediate predecessors from the current vertex are known: these are the centers of the three spheres, and the distances are the radii of these spheres. The position of the current vertex/atom is thus defined by the intersection of three spheres, so there are at most two solutions for the current atom position: this is called a 2-branching situation (Figure 4).
When a distance is not uniquely defined, but rather defined by lower and upper bounds, i.e. d i,j ∈[ l i,j , u i,j ], this distance is uniformly discretized by sampling b ≥ 1 values in [ l i,j , u i,j ], as depicted in Figure 5.
(2) Figure 4 Intersection of three spheres. Intersection of three spheres, colored in yellow, green and cyan. The two points produced by the intersection are indicated with red spots.
In this case, we have a b-branching situation. The algorithm used for calculating the atom coordinates is then applied to each set ofd i values sampled for the distance constraints. The choice of the discretization factor b is a crucial point: a small value might lead to an infeasible problem because we may not select any feasible distance; a larger value increases the computational burden. In general, the finer the discretization, the more accurate the computation is, but it is not trivial to figure out the optimal value for b. One way to choose b is to consider that the number of nodes in the search tree is bounded by 3 + (2 l b k ) , where l is the number of tree levels where we have a 2-branching situation, and k is the number of tree levels where we have a b-branching situation [28]. Appropriate values of b should result in a manageable number of nodes.
Given the position of the three previous atoms k −3, k − 2, k − 1 in the order P ato and given the constraints to these atoms of the atom k to be embedded, the position of k is calculated by a recursive matrix multiplication by making use of the set of distances d = {d k,k−1 , d k,k−2 , d k,k−3 } between the previous atoms and k. Although there are several methods to compute sphere intersections [29], in our experience, the best trade-off between efficiency and numerical stability is given by the use of recursion matrices [23], and of the two following angles: (i) the torsion angle ω 3 formed by atoms {k, k − 1, k − 2, k − 3} which depends on the distance between k and k − 3, (ii) the angle The recursion is applied through the equation: where: and σ ∈ {+1, −1}. The series of recursion matrices is initialized as: d 2,1 being the distance between the first and the second atom, and d 3,2 the distance between the third and the second atom in the order P ato . The total number of B k matrices to be calculated along the parsing of the tree is bounded by 2 |P ato | b, where |P ato | is the size of the ordered atom list P ato . The product Q k−1 B k is calculated in two steps: (1) the fourth column of Q k , which gives us the coordinates of k, is computed; (2) only if k is not pruned, the three remaining columns are computed.
We must distinguish two cases when embedding an atom k. If it is the first appearance of k in P ato , we use equation 3 to compute all possible embeddings of k for σ ∈ {+1, −1} and the set of distances d. If it is not the first appearance of k in P ato , we need to take into account the fact that numerical instabilities generate matrices which will lead to slightly different coordinates for k than those computed the first time. In order to decrease the impact of these numerical errors, we compute the set of distances d, the angles θ 2 , ω 3 and for σ ∈ {+1, −1} the corresponding . We choose the value of k that yields the updated coordinates of k being the closest to the previous coordinates of this atom.
Each carbonyl oxygen O i−1 is uniquely determined for residue i, once C i−1 , N i and H i have been embedded, since these atoms are all part of the peptide plane [30]. As is common practice (see, e.g., [31][32][33]), we fix here the torsion angle ω of the peptide plane to -180°or 0°. In a previous implementation [34], the positions of the carboxylic oxygens were not stored. Although this approach leads to memory savings, the availability of carboxylic oxygen positions can improve the definition of the α-helix secondary structure.
The positions of the carbonyl oxygens are thus now calculated in the following way. If k = O i−1 is the carboxylic oxygen atom located at the vertex k, and {v 1 , v 2 , v 3 } are the vertices corresponding to atoms {C i−1 , N i , H i }, belonging on the same peptide plane π , we denote n π the normal vector to π . The coordinates of k can then be computed by solving the following non-linear system: where d ki are the distances between atoms k and i. Using an approach similar to those employed in [35], we obtain the equivalent linear system: The parameter d k1 is the length of the bond connecting O i−1 and C i−1 , the parameters d k2 and d k3 are the distances between k = O i−1 and N i , H i , calculated from bond angles and bond lengths between atoms of the peptide plane, and the angle ω of 180°in a trans peptide plane. The case of the cis peptide plane can be treated in the same way, modifying the value of ω to be 0°.
Following the idea proposed for carbonyl oxygens, the coordinates k of a Cβ atom can be computed from previously calculated atoms, because the four distances of k to atoms {v 1 = Cα, v 2 = Hα, v 3 = N, v 4 = C} are exactly known, and because these five atoms are not coplanar. The coordinates k are calculated by solving the linear system: The parameter d k1 is the length of the bond connecting k = Cβ and Cα, the parameters d k2 , d k3 and d k4 are the distances between k = Cβ and Hα, N, C, calculated from bond angles and bond lengths between these atoms.

Pruning devices
Once the set of possible coordinates of the atom k has been determined in the branching phase described above, pruning devices are used to check whether the coordinates of k are feasible. In some cases described below, the coordinates of k along with the coordinates of previously embedded atoms are checked together. If the check is negative, the solution obtained for k is discarded, which prunes all tree branches originating from the node k. In this section, we present the pruning devices used to accept or discard the coordinates of the atom k generated by the branching devices. The pruning device applies all these tests as soon as the involved atoms have been embedded.

Direct distance feasibility (DDF)
As the coordinates for an atom k are determined, we first check that all distances between k and the other embedded atoms respect the given lower and upper bounds arising from the constraints (1-3) listed in section "Solving the DGP with iBP".

Torsion angle feasibility (TAF)
The values of the backbone torsion angles φ, ψ, are used as a pruning device, checking whether they are located in the permitted regions of the Ramachandran plot. The pruning device, first introduced in [34], is implemented in the following way. The torsion angle ξ ijkl defined by a quadruple of atoms {i, j, k, l} falls into a domain ijkl , up to a certain tolerance t > 0. In general, ijkl is the union of κ dis-joined intervals, i.e. ijkl = κ c=1 c ijkl (9) From the bounds on a torsion angle ξ ijkl it is possible to derive bounds on the distance d il , noticing that where: Taking the maximum and minimum values of d(ξ ijkl ) for ξ ijkl ∈ ijkl , we obtain an interval [l il , u il ] for the distance d il . The sign of the angle ξ ijkl is used as an additional pruning criterion along with the d il interval.

Dijkstra shortest-path (DSP)
As introduced in [23], we can exploit the fact that the distances are Euclidean to improve the iBP pruning capabilities. We extend and generalize the procedure presented in [36] in the following way. We introduce an auxiliary graph G + with the same topology as the graph connecting the atoms in the protein, but such that the weight of each edge (i, j) is the upper bound of the distance d ij . For every pair of atoms i, j, the shortest-path between i, j in G + is a valid over-estimate of d ij . Thus we used an all-toall shortest-path algorithm, the Floyd-Warshall algorithm [37], to refine the upper bound for each pair of atoms.
The Dijkstra Shortest-Path pruning device uses the refined upper bounds of inter-atomic distances in the following way. According to Lemma 4 in [23], for an atom k and for each atom pair i, j such that i < j < k in the order P ato and for which d ik is known, the embedding of k can be pruned if: (12) where u jk is the upper bound of the atom pair (j, k) obtained using the Floyd-Warshall algorithm [37].

Chirality (CHI)
The pruning of atom coordinates through the aminoacid chirality is implemented through the so-called CORN rule of thumb: in amino acids, the groups COOH, R (sidechain), NH2 and H are bonded to the chiral center Cα carbon. Starting with the hydrogen atom away from the viewer, if these groups are arranged clockwise around the Cα carbon, then the amino-acid is in the D-form. If these groups are arranged counter-clockwise, the amino-acid is in the L-form. The CORN rule was restated by imposing that the torsion angle defined by the atoms C, Cβ, N, Hα of residue i for the D-form or C, N, Cβ, Hα of residue i for the L-form, is positive.

α-helix secondary structure
We proposed the use of α helix information as a pruning device in the context of the iBP algorithm first in [34].
The α helix location can be determined from an analysis of the NMR chemical shifts by TALOS [38]. Four criteria are used to enforce the formation of an α helix: (i) the formation of backbone hydrogen bonds between amide hydrogens and carbonyl oxygens, (ii) the alignment of the amide and carbonyl functions checked by a qualitative condition on the energy of the hydrogen bond, (iii) the definition of backbone φ and ψ torsion angles already described in the Torsion Angle Feasibility, (iv) the definition of three additional angles θ , θ ' and θ " similar to the ones introduced by Grishaev et al. [39].
On a sequence of m + 1 contiguous residues I α = {i, i + 1, . . . , i + m} forming an α helix, for any pair of residues (i − 4, i) belonging to I α , the lower and upper bounds on the distance between the carboxylic oxygen O i−4 and the amide hydrogen H i should be compatible with the formation of an hydrogen bond. The upper and lower bounds are defined in an input parameter file of iBP, and were set to 1.9 and 3.0 Å in the present work.
The condition checking the alignment of atoms involved in the hydrogen bond is implemented by calculating a local energy information defined in the DSSP package [40]: with q 1 = 0.42, q 2 = 0.2 and f = 332, and d AB correspond to the distance between atoms A and B.

Implementation details
In this section we provide an overview of the main implementation features. The iBP algorithm has been coded in C++ with extensive use of template meta-programming [41], STL [42,43], and BOOST (www.boost.org). Linear systems, as for instance (7), are solved using the LAPACK library [44].
Discretizable DGP instances were represented by simple weighted undirected graphs G = (V , E, d), which were handled by the Boost Graph Library (BGL) [45]. The points in R 3 were represented using the Boost Geometry Library (also known as Generic Geometry Library, GGL: www.boost.org).
Constraints on distances, angles or energy are typically expressed by enforcing a variable x to take values in a domain D, which is generally the union of intervals and singletons: The Boost Interval Library (BIL -see [46,47]) was used to store such representation, and to perform basic operations for intervals and singletons. On top of the BIL, we define the type domain which contains a set of intervals and operations as intersection, scaling, etc. The BIL allows also to select the underlining data format for the interval (single/double precision real, integer).

Theory
In this section we give some details about the worstcase asymptotic complexity behavior of the iBP algorithm. The description given above includes many details which are useful for finding the structure of proteins but which somewhat complicate the precise mathematical treatment. We first give a very brief abstract description of the iBP and of the formal problem it solves, and then proceed to discuss its complexity.
Formally speaking, the DGP is the following decision problem: given an integer K > 0, a simple undirected graph G = (V , E) and an edge weight function d : E → R + , is there a realization x : V → R K such that for each {u, v} ∈ E we have x u − x v 2 = d uv ? Note that we are writing x u for x(u) and d uv for d (u, v). We also remark that in the more "applied" interpretation given in the preceding section, the range of the edge function d is IR + , i.e. the set of all non-negative closed real intervals, and K = 3. The DGP is NP-hard for any K > 1 and NP-complete for K = 1 [48]. Since we are interested in finding all solutions of the DGP rather than just one, we denote by X the set of all realizations of G.

Assumptions on the DGP input data
In fact, due to the fact that our data come from a protein structure setting, we can also make the following assumptions about G and d: 1. there is an order 1, 2, . . . , n on the vertices such that 1,2,3 is a triangle in the graph G and, for each vertex We remark that the above definitions can be appropriately extended to Euclidean spaces of any dimension K > 0, not just K = 3. We call E D the discretization edges and E P the pruning edges. Discretization edges ensure that the graph G is rigid, which implies that there are finitely many realizations of G in R K . Pruning edges make some of those realizations infeasible, and thereby make the solution set X smaller. A few remarks are in order: • we consider that distances which are known because of covalent bond relations are sufficiently precise to be represented by a scalar; • we consider that distances which are known from NOESY (or other) experiments can be represented by intervals; We remark that the order on V was initially intended to follow the protein backbone [49], but new orders which better exploit the hydrogen atoms in or close to the backbone have been defined in [50,51]: these are the orders on which the above assumptions are based.
The DGP with the restrictions above, but where all intervals are replaced by scalars, is called DISCRETIZ-ABLE MOLECULAR DGP (DMDGP). Both the DMDGP and its generalization to any K (denoted by K DMDGP) are NP-hard [52,53]. The problem defined above, involving intervals, obviously contains the DMDGP as a sub-case and is hence also NP-hard by inclusion.

When all distances are precise
We first focus on the simplest case, where all intervals are replaced by scalar values. Then d : E → R + , and b = 1. In this simplified setting, the iBP is simply called BP [52], and the order on V is called a contiguous trilateration order [54] or a DMDGP order [55].
The BP can be defined as a recursive procedure: assuming we already found a realization x 1 , . . . , x v−1 for the vertices 1, . . . , v − 1, and that we mean to find a consistent realization x v for v, the discretization edges E D guarantee that there will be at most two positions for x v compatible with the distances restricted to E D [49]. This can be intuitively understood in R 3 by considering the intersection of three spheres centered at : the first two spheres either do not meet or their intersection is in general a circle, and the intersection of the third sphere with this circle is either empty or consists in general of two points [56]. We can now consider the distances defined on pruning edges in E P , linking v to its preceding vertices in order to accept or reject these two points. For each accepted point we recursively call BP with v replaced by v + 1, for all v < n.
When v = n we have a valid realization of the graph: we save it in X, and proceed to complete the recursive search. This yields a search tree which is explored depth-first. The recursion starts after placing the initial triangle 1,2,3 (either arbitrarily or by using BP restricted to subspaces), so this tree starts branching at level 4. It can be proved that, at completion, X contains all incongruent (modulo translations and rotations) realizations of G.
In the case where E P = ∅, the search tree is a complete binary tree with 2 n−3 nodes at the n-th (and last) level: in other words, its depth is n and its width is 2 n−3 . This is the worst case, since the BP must explore all of the nodes in the tree, and proves that the BP (and hence the iBP, since it generalizes the BP) is an exponential-time algorithm in n.
When E P = ∅, it was shown that X almost always contains a number of solutions which is either zero or a power of two [55]; this discovery led to a set of results where the BP search tree width can be kept polynomial in n during the search [53]. Since the exponential behavior is only due to the tree width, this yields a set of cases where the BP is actually fixed-parameter tractable (FPT). Throughout all our experiments with protein data we were always able to fix the parameter controlling the exponential growth of the tree width to a universal constant, which makes BP "polynomial on proteins" (this is an informal statement -the precise statement is given in [53]).

Intervals and discrete distance sets
The theory supporting the case where d might map edges to discrete sets of distance values or intervals, which is the case treated in this paper, is not so clearly understood yet. As it generalizes the simpler case sketched above, in a certain sense it inherits its properties, but this is an oversimplification: for instance, if all intervals are [0, ∞], it is obvious that the problem is easy independently of the graph topology, since every realization is valid.
Some bounds on the cardinality of X in the presence of discrete sets and intervals are given in [55]. Our understanding is that if the intervals are small enough, the theory which led to fixed-parameter tractability goes through with few changes, but we have no way so far of establishing an aprioristic maximum width for the intervals. If the intervals are very large the problem might become tractable, as mentioned above, for the purposes of finding at least one solution. The iBP would still behave exponentially, however.

Results and discussion
We applied the presented algorithm to three examples of proteins displaying α helical secondary structures. Before presenting the obtained results, we emphasize that the method proposed here has a completely different philosophy than classical optimization approaches commonly used in the field of NMR structure determination. In the present approach, each constraint is treated in the strict sense, that is, no violation, however small, is tolerated. This is why we consistently use the word constraint in the paper. This is what potentially allows us to systematically explore the entire search space. However, the use of the procedure demands that the data have been pre-processed accordingly, and all geometric inconsistencies that exist in three-dimensional space have been removed.
For the proteins studied here, if one includes the ensemble of NMR interval distance constraints stored in the .mr file at the Protein Data Bank (PDB) [57] as well as all pruning devices described above, all solutions are pruned out, indicating that no solution to the distance geometry problem exists with the deposited data. This is not really surprising, since the optimization algorithms generally used in NMR structure determination are based on optimization of a target function or hybrid energy rather than on strict constraint satisfaction. That is, there is always a phase where the algorithm tries to find a trade-off when inconsistencies exist between constraints. The optimization thus produces solutions in which chemical and NMR constraints are optimized, but in which small violations are always present. These inconsistencies are present in any structure determination, in particular because distance constraints are imprecise, due to experimental limitations.
Since the data in the PDB for the examples presented here were not pre-processed the way our algorithm requires, we decided to use a subset of the stored data sets: the definition of α-helix regions and a few long-range distance constraints arbitrary selected from the set of NMR constraints for structures with more than one α-helix. In order to further reduce the risk of all solutions being pruned, we used tolerance values for atomic positions and angles between atoms ( Table 2).
The three examples we chose to illustrate the algorithm display an increasing structural complexity: (i) a single α helix, corresponding to the structure of peptide CM15 determined in micelles (PDB id: 2JMY [58]), (ii) an α helical hairpin (PDB id: 2KXA [59]), (iii) the insecticidial toxin TAITX-1a, formed as a bundle of four α helices, restrained by three disulphide bridges (PDB id: 2KSL). The main characteristics of the studied proteins are given in Table 2.
All three examples were originally determined by NMR spectroscopy, and the corresponding constraint lists are available from the PDB. The analysis by PROCHECK [60] of the Ramachandran diagram of these three PDB structures shows that more than 85% of the residues are located in the core region. For 2KXA and 2KSL, more than 95% of the residues are located in the core and allowed region, whereas in 2JMY, 7% of the residues are located in the generously allowed region. For 2KXA, one PRO residue was replaced by an ALA, as the PRO cycle has not yet been included in the current version of the iBP algorithm.
We generated conformations using the branching phase and the pruning devices described above. The long-range constraints added for the calculations of 2KXA and 2KSL, are: For all calculations, except the one of 2JMY with the α helix defined along the whole sequence, the obtained conformations were filtered according to the coordinate root mean-squared deviation (RMSD: 1.5 Å) with respect to the previously obtained conformation in the iBP procedure. Enforcing an RMSD value larger than 1.5 Å between two successively stored conformations, avoids an oversampling of the conformational space. Each calculation was stopped after storing 10000 filtered conformations. For our three examples, five calculations were performed in total: three on 2JMY with different definitions of the α helix (residues 1-15, 3-13 and 5-11), and one each for 2KXA and 2KSL. For the first calculation on 2JMY, one conformation was obtained and saved. The second and third calculations on 2JMY were quite short, of the order of minutes (Table 2), which is due to the small size of the corresponding tree. For the 2KXA and 2KSL calculations, 10000 conformations were obtained in about 30 mins of calculation. Large total numbers of conformations were generated: this number increases from ∼634,000 (2JMY_1) up to ∼3,400,000 (2KXA) with the size of the considered problem, depending on the number of residues and on the number of constraints. Despite 2KSL being the largest example, the second smallest number of conformations was generated, which is the sign of a severe pruning arising from a rather restricted conformational space.
The reliability of the obtained conformations was checked in three ways. First, the whole set of NMR constraints deposited along with the PDB entries and involving backbone hydrogens, were probed on the conformations. Second, the quality of the obtained  conformations was checked using PROCHECK [60] analysis of the Ramachandran plot. Third, the obtained conformations were clustered with an unsupervised clustering method, namely the self-organizing map or SOM [61][62][63][64], in order to investigate the properties of sampled conformations.
The agreement of the obtained conformations with the backbone NMR constraints deposited with the PDB structures was checked by calculating the distances between the backbone hydrogens in each obtained conformation. The distances larger than the upper bound of the constraint correspond to violations of this constraint. The mean number of violated constraints along with the mean value of the difference to the upper bound for these constraints were calculated on all conformations (Table 2). For the 2JMY calculation with the 1-15 α helix definition, no violation of the NMR constraints could be observed. As expected, when the α helix definition is reduced (2JMY_1 and 2JMY_2), the average number of violations increases as well as the average maximum violation. Not surprisingly, the most violated constraints involve residues located at the N and C terminal parts of the α-helix, TRP-2, PHE-5, LYS-3, LYS-6 and VAL-11, VAL-14, LEU-15 for 2JMY_1 and 2JMY_2. The largest violations and number of violations are of the same order or value for 2KXA than for 2JMY_1 and 2JMY_2. In contrast, the largest violations and number of violations are observed for 2KSL and involve residues CYS-33, GLU-34, PHE-38, TYR-43. Such over-restraining of NMR structures have been put in evidence in the past, through molecular dynamics simulations [65] and analysis of the structure quality [66].
The average number of violations is similar for 2JMY_2, 2KXA and 2KSL, but the average maximum violation for 2KSL is twice as large as that for 2JMY_2 and 2KXA. This might be due to the very restrained conformations of 2KSL, which contain three disulphide bridges. Due to this restrained conformation, the NMR constraint list is probably more prone to contain inconsistencies, and large mechanical strain can be stored in the structure if one uses an optimization procedure such as simulated annealing. In contrast, no mechanical strain whatsoever is generated by the iBP algorithm, and the obtained conformations might have a stronger tendency to deviate from the PDB conformations.
For each example, the obtained conformations were compared to the first conformation deposited in the PDB. Minimum RMSD values in the range 1.1-2.1 Å were obtained for all targets, except 2KSL for which the minimum RMSD value was 3.0 Å. Thus the Branch-and-Prune algorithm was able to capture conformations close to the PDB conformations, the larger value obtained for 2KSL arising from the larger mechanical strain quoted above.
For each calculation, the conformation displaying the smallest number of NMR constraint violations was compared to the first conformation deposited in the PDB. The RMSD values are smaller than 1.5 Å for 2JMY and 2KXA. This shows that, in the context of the iBP algorithm, the measured NMR constraints also push the structure toward the PDB structure. For 2JMY_1 and 2JMY_2, the RMSD value increases since the definition of the α helical region is shorter. For 2KSL, the conformation displaying the smallest number of constraint violations, displays an RMSD of 3.5Å with the PDB first conformation, which agrees with the maximum number of violations observed for this protein and with the minimum RMSD with the PDB structure analyzed above.
From the PROCHECK [60] analysis, the percentage of residues located in core and allowed Ramachandran regions, is larger than 95% for all targets except 2JMY_1, 2JMY_2, for which the percentages are about 80% due to the reduced definition of the α helix. For all targets, the percentage of residues in disallowed regions is equal to zero. The relatively important percentage of residues located in the allowed region may arise from the systematic exploration performed by the Branch-and-Prune algorithm, the strict nature of the constraints, and the nature of the pruning devices.
In order to further probe the robustness of the proposed algorithm, iBP calculations on 2KXA and 2KSL have been performed, using input data degraded in the following way: (i) the length of each α helix has been reduced by 1 residues at each extremity, (ii) the lower and upper bounds of the long-range distance constraints have been increased by 0.5 Å. The introduction of this noise into the α helical and long-range constraints makes the iBP solution moving apart from the PDB structure, as the minimum RMSD to PDB structure changes from 1.1 to 2 Å for 2KXA, and from 3.0 to 4.3 Å for 2KSL. Nevertheless, the quality of the Ramachandran diagram remains satisfying, with 93.3% and 95.4% of the residues located in the core and allowed regions of the Ramachandran plot for 2KXA and 2KSL.
The conformations were clustered using a selforganizing map (SOM) approach [62,63], on which the coordinate RMSD values between the conformers obtained by Branch-and-Prune and the corresponding PDB structure, were projected on the SOMs ( Figure 6). These RMSD values lay in the 1.3-3.2 Å range for 2JMY_1, in the 2.4-4.9 Å range for 2JMY_2, in the 1.5-4.0 Å range for 2KXA, and in the 3.2-6.0 Å for 2KSL.
In the SOMs for the four calculations ( Figure 6), the RMSD values are colored according to their RMSD from the PDB entry, violet color indicating values smaller than the median value of the sampled RMSD value, green color indicating RMSD values larger than this median value. For 2JMY_1, 2KXA and 2KSL, a larger number of neurons of the SOMs belongs to the second group, which is the sign of an enhanced sampling of the conformational space with respect to the region sampled by simulated annealing. For 2JMY_2, the inverse picture is observed, which may arise from the more limited conformational space available to be sampled for a unique α-helix.
In 2KSL and 2KXA SOMs, the protein conformations corresponding to the region displaying the smallest coordinate RMSD values with respect to the PDB structure, were extracted (Figure 7). These sets of conformers are similar to the superimposed conformations obtained in a usual NMR calculation. Figure 6 Clustering of the conformations obtained by the iBP algorithm. Self-organizing maps describing the clustering of the conformations obtained by the iBP algorithm on 2JMY, 2KXA and 2KSL. The contour plots (lines) represent the local similarity between the clustered conformations. The color scales (on plot left) extend from blue to red (from very similar to very dissimilar conformations). The small red points are drawn on the SOM neuron for which the largest local similarity is observed between conformations. Each SOM neuron is colored according to the average value of the coordinates RMSD of the neuron conformations with respect to the PDB structure. The color scales extend (on plot right) from purple to green (from very similar to very dissimilar to the PDB structure). The similarity between SOM neurons as well as the RMSD to the PDB structure are expressed in Å for comparison purposes.

Conclusions
We proposed here a Branch-and-Prune algorithm (iBP) to solve the Distance Geometry Problem, in order to sample exhaustively the conformational space of the backbone of α-helical proteins. The iBP algorithm bears a very slight reminiscence to variable target function approaches for example implemented in DISMAN [67], due to the sequential nature of introducing constraints and non-bonded interactions. However, the precise way of introducing the constraints and non-bonded interactions differs significantly, and DISMAN does not systematically search space but is an optimization approach.
We introduced new pruning devices integrated in the iBP algorithm for DGP with intervals and we tested our iBP implementation on the backbones of α-helical proteins. Several pruning devices have been designed to enforce amino-acid chirality, α-helix geometry and van der Waals steric hindrance. The algorithm allowed to efficiently reconstruct backbone conformations of three α-helical peptides, of various sizes, and for which the structure were previously solved by NMR. The obtained solutions satisfy most of the NMR constraints involving backbone hydrogen bonds, and display very acceptable Ramachandran statistics. The present work represents a first successful step on the way to reconstruct protein structures using a branch-and-prune algorithm applied to the Distance Geometry problem.
Applications where this approach could have significant advantages are cases where there are few distances defining the tertiary structure of a protein, where it is important to characterize the space of all solutions. It might also be useful as part iterative automated assignment algorithms such as ARIA [68], CYANA [69] or UNIO [70], where in a first iteration all solutions compatible with a few unambiguous long-range constraints could be generated to reduce the ambiguity of the remaining constraints. Another application of the approach proposed here would be to provide input molecular conformations to model the structure of multi-subunit complexes into an electron microscopy density map [71].
Some limitations of the current version of iBP prevent for the moment its use with real nuclear Overhauser effect (NOE) data. These limitations are the use of unambiguous distance constraints, the non-inclusion of protein side-chains, the loss of information intervals and the appropriate weighting of the various constraints in order to overcome the inconsistencies contained among the whole constraint set. Protein side-chains can be added to the protein backbone afterward. The discretization of circle arcs could be tackled using algebraic geometry and geometric algebra approaches [72]. The Bayesian approach [73] developed for the objective weighting of various NMR contraints according to the data quality could be used to alleviate the inconsistency problems. The use of unambiguous distance constraints is probably the most unavoidable aspect of the current set-up of the algorithm.