An algorithm to enumerate all possible protein conformations verifying a set of distance constraints

Cassioli, Andrea; Bardiaux, Benjamin; Bouvier, Guillaume; Mucherino, Antonio; Alves, Rafael; Liberti, Leo; Nilges, Michael; Lavor, Carlile; Malliavin, Thérèse E

doi:10.1186/s12859-015-0451-1

Methodology Article
Open access
Published: 28 January 2015

An algorithm to enumerate all possible protein conformations verifying a set of distance constraints

Andrea Cassioli⁴,
Benjamin Bardiaux^1,2,
Guillaume Bouvier^1,2,
Antonio Mucherino⁶,
Rafael Alves⁴,
Leo Liberti^4,5,
Michael Nilges^1,2,
Carlile Lavor³ &
…
Thérèse E Malliavin^1,2

BMC Bioinformatics volume 16, Article number: 23 (2015) Cite this article

3668 Accesses
34 Citations
1 Altmetric
Metrics details

Abstract

Background

The determination of protein structures satisfying distance constraints is an important problem in structural biology. Whereas the most common method currently employed is simulated annealing, there have been other methods previously proposed in the literature. Most of them, however, are designed to find one solution only.

Results

In order to explore exhaustively the feasible conformational space, we propose here an interval Branch-and-Prune algorithm (iBP) to solve the Distance Geometry Problem (DGP) associated to protein structure determination. This algorithm is based on a discretization of the problem obtained by recursively constructing a search space having the structure of a tree, and by verifying whether the generated atomic positions are feasible or not by making use of pruning devices. The pruning devices used here are directly related to features of protein conformations.

Conclusions

We described the new algorithm iBP to generate protein conformations satisfying distance constraints, that would potentially allows a systematic exploration of the conformational space. The algorithm iBP has been applied on three α-helical peptides.

Background

Protein structure determination is crucial for understanding protein function, as it paves the way to the discovery of new chemical compounds and of new approaches to control the biological processes.

The problem of protein structure determination is certainly a problem with multiple solutions, as proteins are flexible polymers. As most of the experimental techniques of the structural biology obtain measurements averaged on an ensemble of protein conformations, the usual approaches for structure determination intend to find an average structure or a set of conformations describing fluctuations around an average structure. A path intending to get a complete coverage of the conformational space, given a series of constraints, is usually not taken, although such an approach could provide precious information about the conformational equilibrium, which is essential in the function of many proteins, as the HIV protease [1].

An important class of experimental methods for protein structure determination is based on the measurement of inter-atomic distances and angles, such as Nuclear Magnetic Resonance (NMR) spectroscopy [2] and cross-linking coupled to mass spectrometry [3]. In NMR, distance intervals between hydrogens are determined from the measurement of nuclear Overhauser effects (NOE). The experimentally measured distances are then used as constraints for protein structure calculations. Pure in silico approaches have been also developed based on the use of inter-atomic distance constraints, such as homology modeling [4] or prediction of protein-protein complexes [5] and ligand poses [6].

The Distance Geometry Problem (DGP) [7,8] consists in identifying the sets of points which satisfy a set of constraints based on relative distances between some pairs of such points. The present work describes an algorithm developed to solve DGP in the context of protein structure determination: the points represent the protein atoms.

The DGP is a constraint satisfaction problem. Several approaches solve this problem by reformulating it [8] as a global optimization problem having a continuous search domain, and whose objective function is generally a penalty function designed to measure the violation of the distance constraints. Over the years, the solution of DGPs arising in structural biology have been typically attempted by Simulated Annealing (SA) approaches based on molecular dynamics [9]. Other proposed approaches are based on various optimization methods as in [10]. As all meta-heuristic approaches, SA may provide approximate solutions but does not deliver optimality certificates. In the case of protein structure determination, since the optimization problem is a reformulation of a constraint satisfaction problem, solutions given by SA can be successively verified by checking the violations of the distance constraints. However, additional solutions may exist but go undetected by SA. Thus, an algorithm for the systematic enumeration of the possible conformations of a given protein could find a widespread field of application. Branch-and-prune algorithms and similar were proposed in the general context of protein structure determination [11-16], (see also [8] and references therein). However, these studies primarily addressed the question of defining relative orientations of protein monomers in symmetric oligomers, not the determination of all possible conformation of a polypeptide chain with a very large number of degrees of freedom from distance constraints. Systematic exploration was proved to be useful in the case of residual dipolar couplings (RDC) constraints [17], for exploring the sidechains conformations [18,19] and for assignment of NOEs, provided that the structure is known [20]. For the structure determination from RDCs, it has been shown [21] that when using RDCs but only sparse NOEs the problem can be solved in polynomial time. Such approaches have also been used for structure determination in X-ray crystallography for non-crystallographic symmetry by orienting and translating symmetric protein subunits [22]. To the best of our knowledge, in this paper we present the first application of a Branch-and-Prune algorithm to the problem of full protein structure determination based on unambiguous distance information.

Under certain conditions, DGPs can be discretized [23] (see below), which means that the search domain for the corresponding optimization problem can be reduced to a discrete set, which has the structure of a tree. The discretization makes the enumeration of the entire solution set of DGP instances possible. This is important when the experimental constraints do not specify the protein conformation uniquely, i.e., more than one conformation satisfies all constraints. For solving discretized DGP, we employ an interval branch-and-prune (iBP) algorithm [24], which is based on the idea of recursively exploring the tree while generating new candidate atomic positions (branching phase) and to verify the feasibility of such positions (pruning phase) (Figure 1). By making use of pruning devices, branches rooted at infeasible positions can be discarded from the tree, so that the search can be reduced to the feasible parts of the tree (Figure 2). Pruning devices can be conceived and integrated in iBP to improve the performances of the pruning phase and thus of the algorithm.

In the present work, we first describe the branching phase and the pruning devices used to determine the solutions to the Distance Geometry problem. Then, an overall view of the method is given along with the use of the branching and pruning devices at different steps and the complexity of the algorithm is analyzed. We finally illustrate the algorithm application with three proteins for which α-helical regions are known along with few long-range NMR constraints (ie. constraints measured between residues i and j such that |i−j|>3 in the protein sequence). The obtained conformations display good stereochemical quality parameters, and the conformational space explored is larger than the one sampled with traditional optimization methods such as simulated annealing.

Methods

In order to sample the conformational space of a protein, we use a Branch-and-Prune algorithm to build a tree in which each node represents a solution for one atomic position. We limit ourselves in the present work to the calculation of the backbone and C β atomic coordinates.

The constraints used to generate atomic coordinates along the Branch-and-Prune algorithm are the following:

1.
covalent distance constraints corresponding to bond lengths and bond angles, whose values are derived from high-resolution small molecule X-ray crystal structures [25];
2.
NMR distance constraints;
3.
van der Waals radii of atoms between non-bonded atom pairs (i,j): a fraction of the sum of the van der Waals radii of each atom provides a lower bound to the corresponding inter-atomic distances:
$$ d_{ij}\geq \sigma (r^{vdw}_{i} + r^{vdw}_{j}), $$
((1))

where σ∈ [ 0,1], and is typically around 0.85. The values for the radii are given in Table 1 [26,27]. These lower bounds apply only in the cases where no larger lower bound has been determined from NMR distance constraints;
Table 1 Van der Waals radii (see [ 26 ] and [ 27 ])
Full size table
4.
distances derived from the backbone torsion angles ϕ and ψ;
5.
hydrogen bonds in α-helix;
6.
amino-acid chirality;
7.
α-helix geometry.

The atom coordinates are calculated, one by one, following the atom order P _ato described in Figure 3 and previously proposed in [24]. In this order, some atoms are repeated to insure that any entered atom is defined by distance constraints with respect to three preceding atoms in P _ato [24]. The carbonyl oxygens and the atoms C β, which were not present in the order P _ato, are calculated separately.

Then, the tree is built using a recursive procedure to create each node of the tree. This procedure is called branching phase. The created nodes are then submitted to the pruning devices in order to decide whether the node should be kept or removed. If the node is removed, the possible branches starting from this node are also pruned. A pruning device is responsible for checking whether a partial solution is feasible, i.e. to check whether a set of embedded atoms fulfill the constraints (1)-(7) described above.

In the following, we describe the branching phase and the pruning devices. Then, the complexity of the algorithm is described from a theoretical point of view, before presenting some application cases.

Branching devices

The tree parsed during iBP is formed by nodes, each corresponding to one set of atomic coordinates from the order P _ato (Figure 3) [24]. At each level of the tree, the atomic coordinates of the corresponding atom are calculated by making use of a recursive procedure, called branching phase. The current atom position is defined by distance constraints to three other atoms. These distances are obtained from the constraints (1-3) described above: (1) the covalent constraints, (2) the NMR distance constraints, (3) the van der Waals radii.

If the distance constraints specify a unique value rather than an interval, this signifies that the distances to three immediate predecessors from the current vertex are known: these are the centers of the three spheres, and the distances are the radii of these spheres. The position of the current vertex/atom is thus defined by the intersection of three spheres, so there are at most two solutions for the current atom position: this is called a 2-branching situation (Figure 4).

When a distance is not uniquely defined, but rather defined by lower and upper bounds, i.e. d _i,j∈[l _i,j,u _i,j], this distance is uniformly discretized by sampling b≥1 values in [l _i,j,u _i,j], as depicted in Figure 5.

$$ \tilde d_{i}=\left\{ l_{i,i-3} + (t-1)\frac{(u_{i,i-3}-l_{i,i-3})}{b} : t=1,\ldots,b\right\}. $$

((2))

In this case, we have a b-branching situation.

The algorithm used for calculating the atom coordinates is then applied to each set of $\tilde {d}_{i}$ values sampled for the distance constraints. The choice of the discretization factor b is a crucial point: a small value might lead to an infeasible problem because we may not select any feasible distance; a larger value increases the computational burden. In general, the finer the discretization, the more accurate the computation is, but it is not trivial to figure out the optimal value for b. One way to choose b is to consider that the number of nodes in the search tree is bounded by 3+(2^l b ^k), where l is the number of tree levels where we have a 2-branching situation, and k is the number of tree levels where we have a b-branching situation [28]. Appropriate values of b should result in a manageable number of nodes.

Given the position of the three previous atoms k−3, k−2, k−1 in the order P _ato and given the constraints to these atoms of the atom k to be embedded, the position of k is calculated by a recursive matrix multiplication by making use of the set of distances d={d _k,k−1,d _k,k−2,d _k,k−3} between the previous atoms and k. Although there are several methods to compute sphere intersections [29], in our experience, the best trade-off between efficiency and numerical stability is given by the use of recursion matrices [23], and of the two following angles: (i) the torsion angle ω ₃ formed by atoms {k,k−1,k−2,k−3} which depends on the distance between k and k−3, (ii) the angle θ ₂ formed by atoms {k,k−1,k−2}.

The recursion is applied through the equation:

$$ \begin{aligned} \left[\begin{array}{c} x_{k} \\ y_{k} \\ z_{k} \\ 1 \end{array} \right] &= B_{1} B_{2} B_{3} \ldots B_{k}(d,\sigma) \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}\right]\\ &= Q_{k-1}B_{k}(d,\sigma) \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}\right] = Q_{k} \left[\begin{array}{c} 0 \\ 0 \\ 0 \\ 1 \end{array}\right], \end{aligned} $$

((3))

where:

$$ {\fontsize{7}{6}\begin{aligned} B_{k} (d,\sigma) = \left[\begin{array}{cccc} -\cos\theta_{2} & -\sigma \sin\theta_{2} &0& -d_{k,k-1}\cos\theta_{2}\\ \sigma\sin\theta_{2}\cos\omega_{3} & -\cos\theta_{2}\cos\omega_{3} & -\sin\omega_{3}& \sigma d_{k,k-1} \sin\theta_{2}\cos\omega_{3}\\ \sigma\sin\theta_{2}\sin\omega_{3} & -\cos\theta_{2} \sin\omega_{3}&\cos\omega_{3} &\sigma d_{k,k-1} \sin\theta_{2}\sin\omega_{3}\\ 0 & 0& 0&1 \end{array}\right], \end{aligned}} $$

((4))

and σ∈{+1,−1}. The series of recursion matrices is initialized as:

$$ \begin{aligned} B_{1}= \left[\begin{array}{cccc} 1 & 0& 0&0\\ 0 & 1& 0&0\\ 0 & 0& 1&0\\ 0 & 0& 0&1 \end{array}\right], B_{2}= \left[\begin{array}{cccc} -1 & 0& 0& -d_{2,1}\\ 0 & 1& 0&0\\ 0 & 0& -1&0\\ 0 & 0& 0&1 \end{array}\right],\\ B_{3}= \left[\begin{array}{cccc} -\cos\theta_{3} & -\sin\theta_{3} &0& -d_{3,2}\cos\theta_{3}\\ \sin\theta_{3}& -\cos\theta_{3} & 0 & d_{3,2} \cos\theta_{3}\\ 0 & 0& 1&0\\ 0 & 0& 0&1 \end{array}\right]. \end{aligned} $$

((5))

d _2,1 being the distance between the first and the second atom, and d _3,2 the distance between the third and the second atom in the order P _ato.

The total number of B _k matrices to be calculated along the parsing of the tree is bounded by 2∣ P _ato ∣b, where ∣ P _ato ∣ is the size of the ordered atom list P _ato. The product Q _k−1 B _k is calculated in two steps: (1) the fourth column of Q _k, which gives us the coordinates of k, is computed; (2) only if k is not pruned, the three remaining columns are computed.

We must distinguish two cases when embedding an atom k. If it is the first appearance of k in P _ato, we use equation 3 to compute all possible embeddings of k for σ∈{+1,−1} and the set of distances d. If it is not the first appearance of k in P _ato, we need to take into account the fact that numerical instabilities generate matrices which will lead to slightly different coordinates for k than those computed the first time. In order to decrease the impact of these numerical errors, we compute the set of distances d, the angles θ ₂,ω ₃ and for σ∈{+1,−1} the corresponding matrices B _k(d,+1),B _k(d,−1), which lead to two possible embeddings of k (Equation 3), as k ⁺=Q _k−1 B _k(d,+1) and k ⁻=Q _k−1 B _k(d,−1). We choose the value of k that yields the updated coordinates of k being the closest to the previous coordinates of this atom.

Each carbonyl oxygen O ⁱ⁻¹ is uniquely determined for residue i, once C ⁱ⁻¹, N ⁱ and H ⁱ have been embedded, since these atoms are all part of the peptide plane [30]. As is common practice (see, e.g., [31-33]), we fix here the torsion angle ω of the peptide plane to -180° or 0°. In a previous implementation [34], the positions of the carboxylic oxygens were not stored. Although this approach leads to memory savings, the availability of carboxylic oxygen positions can improve the definition of the α-helix secondary structure.

The positions of the carbonyl oxygens are thus now calculated in the following way. If k=O ⁱ⁻¹ is the carboxylic oxygen atom located at the vertex k, and {v ₁,v ₂,v ₃} are the vertices corresponding to atoms {C ⁱ⁻¹,N ⁱ,H ⁱ}, belonging on the same peptide plane π, we denote n _π the normal vector to π. The coordinates of k can then be computed by solving the following non-linear system:

$$ \left\{ \begin{array}{ll} \| k - v_{i} \|^{2} = d_{ki}^{2}, & i=1,2,3\\ n_{\pi}^{T} (v_{1} - k) = 0& \end{array} \right.. $$

((6))

where d _ki are the distances between atoms k and i. Using an approach similar to those employed in [35], we obtain the equivalent linear system:

$$ \left\{ \begin{array}{l} 2 (v_{2} - v_{1})^{T} k = d_{k1}^{2} - d_{k2}^{2} -\|v_{1}\|^{2} +\|v_{2}\|^{2}\\ 2 (v_{3} - v_{1})^{T} k = d_{k1}^{2} - d_{k3}^{2} -\|v_{1}\|^{2} +\|v_{3}\|^{2}\\ n_{\pi}^{T} (v_{1} - k) = 0 \end{array} \right. $$

((7))

The parameter d _k1 is the length of the bond connecting O ⁱ⁻¹ and C ⁱ⁻¹, the parameters d _k2 and d _k3 are the distances between k=O ⁱ⁻¹ and N ⁱ, H ⁱ, calculated from bond angles and bond lengths between atoms of the peptide plane, and the angle ω of 180° in a trans peptide plane. The case of the cis peptide plane can be treated in the same way, modifying the value of ω to be 0°.

Following the idea proposed for carbonyl oxygens, the coordinates k of a C β atom can be computed from previously calculated atoms, because the four distances of k to atoms {v ₁=C α,v ₂=H α,v ₃=N,v ₄=C} are exactly known, and because these five atoms are not coplanar. The coordinates k are calculated by solving the linear system:

$$ \left\{ \begin{aligned} 2 (v_{2} - v_{1})^{T} k = d_{k1}^{2} - d_{k2}^{2} -\|v_{1}\|^{2} +\|v_{2}\|^{2}\\ 2 (v_{3} - v_{1})^{T} k = d_{k1}^{2} - d_{k3}^{2} -\|v_{1}\|^{2} +\|v_{3}\|^{2}\\ 2 (v_{4} - v_{1})^{T} k = d_{k1}^{2} - d_{k4}^{2} -\|v_{1}\|^{2} +\|v_{4}\|^{2} \end{aligned} \right. $$

((8))

The parameter d _k1 is the length of the bond connecting k=C β and C α, the parameters d _k2, d _k3 and d _k4 are the distances between k=C β and H α, N, C, calculated from bond angles and bond lengths between these atoms.

Pruning devices

Once the set of possible coordinates of the atom k has been determined in the branching phase described above, pruning devices are used to check whether the coordinates of k are feasible. In some cases described below, the coordinates of k along with the coordinates of previously embedded atoms are checked together. If the check is negative, the solution obtained for k is discarded, which prunes all tree branches originating from the node k. In this section, we present the pruning devices used to accept or discard the coordinates of the atom k generated by the branching devices. The pruning device applies all these tests as soon as the involved atoms have been embedded.

Direct distance feasibility (DDF)

As the coordinates for an atom k are determined, we first check that all distances between k and the other embedded atoms respect the given lower and upper bounds arising from the constraints (1-3) listed in section “Solving the DGP with iBP”.

Torsion angle feasibility (TAF)

The values of the backbone torsion angles ϕ,ψ, are used as a pruning device, checking whether they are located in the permitted regions of the Ramachandran plot. The pruning device, first introduced in [34], is implemented in the following way. The torsion angle ξ _ijkl defined by a quadruple of atoms {i,j,k,l} falls into a domain Ξ _ijkl, up to a certain tolerance ε _t>0. In general, Ξ _ijkl is the union of κ dis-joined intervals, i.e.

$$ \Xi_{ijkl} = \bigcup\limits_{c=1}^{\kappa} \Xi_{ijkl}^{c} $$

((9))

From the bounds on a torsion angle ξ _ijkl it is possible to derive bounds on the distance d _il, noticing that

$$ d_{il}(\xi_{ijkl}) = \sqrt{d_{ij}^{2} + d_{lj}^{2} - 2(\cos(\xi_{ijkl})\sqrt{ef} + bc) d_{ij}d_{lj} }, $$

((10))

where:

$$ \begin{aligned} b&= \frac{1}{2}\frac{d_{lj}^{2} + d_{jk}^{2} - d_{lk}^{2}}{d_{lj}d_{kj}} \\ c&= \frac{1}{2}\frac{d_{ij}^{2} + d_{jk}^{2} - d_{ik}^{2}}{d_{ij}d_{jk}}\\ e&= 1-b^{2}, f= 1-c^{2}.\\ \end{aligned} $$

((11))

Taking the maximum and minimum values of d(ξ _ijkl) for ξ _ijkl∈Ξ _ijkl, we obtain an interval [ l _il,u _il] for the distance d _il. The sign of the angle ξ _ijkl is used as an additional pruning criterion along with the d _il interval.

Dijkstra shortest-path (DSP)

As introduced in [23], we can exploit the fact that the distances are Euclidean to improve the iBP pruning capabilities. We extend and generalize the procedure presented in [36] in the following way. We introduce an auxiliary graph G ⁺ with the same topology as the graph connecting the atoms in the protein, but such that the weight of each edge (i,j) is the upper bound of the distance d _ij. For every pair of atoms i,j, the shortest-path between i,j in G ⁺ is a valid over-estimate of d _ij. Thus we used an all-to-all shortest-path algorithm, the Floyd-Warshall algorithm [37], to refine the upper bound for each pair of atoms.

The Dijkstra Shortest-Path pruning device uses the refined upper bounds of inter-atomic distances in the following way. According to Lemma 4 in [23], for an atom k and for each atom pair i,j such that i<j<k in the order P _ato and for which d _ik is known, the embedding of k can be pruned if:

$$ \|i-j\| - d_{ik}> u_{jk} $$

((12))

where u _jk is the upper bound of the atom pair (j,k) obtained using the Floyd-Warshall algorithm [37].

Chirality (CHI)

The pruning of atom coordinates through the amino-acid chirality is implemented through the so-called CORN rule of thumb: in amino acids, the groups COOH, R (sidechain), NH2 and H are bonded to the chiral center C α carbon. Starting with the hydrogen atom away from the viewer, if these groups are arranged clockwise around the C α carbon, then the amino-acid is in the D-form. If these groups are arranged counter-clockwise, the amino-acid is in the L-form. The CORN rule was restated by imposing that the torsion angle defined by the atoms C,C β,N,H α of residue i for the D-form or C,N,C β,H α of residue i for the L-form, is positive.

α-helix secondary structure

We proposed the use of α helix information as a pruning device in the context of the iBP algorithm first in [34]. The α helix location can be determined from an analysis of the NMR chemical shifts by TALOS [38]. Four criteria are used to enforce the formation of an α helix: (i) the formation of backbone hydrogen bonds between amide hydrogens and carbonyl oxygens, (ii) the alignment of the amide and carbonyl functions checked by a qualitative condition on the energy of the hydrogen bond, (iii) the definition of backbone ϕ and ψ torsion angles already described in the Torsion Angle Feasibility, (iv) the definition of three additional angles θ, θ’ and θ” similar to the ones introduced by Grishaev et al. [39].

On a sequence of m+1 contiguous residues I _α={i,i+1,…,i+m} forming an α helix, for any pair of residues (i−4,i) belonging to I _α, the lower and upper bounds on the distance between the carboxylic oxygen O ⁱ⁻⁴ and the amide hydrogen H ⁱ should be compatible with the formation of an hydrogen bond. The upper and lower bounds are defined in an input parameter file of iBP, and were set to 1.9 and 3.0 Å in the present work.

The condition checking the alignment of atoms involved in the hydrogen bond is implemented by calculating a local energy information defined in the DSSP package [40]:

$$ q_{1}q_{2}\!\left[ \frac{1}{d_{O_{i-4}N_{i}}}\! +\! \frac{1}{d_{C_{i-4}H_{i}}}- \frac{1}{d_{O_{i-4}H_{i}}}\! - \!\frac{1}{d_{C_{i-4}N_{i}}} \right]\cdot f< -0.5, $$

((13))

with q ₁=0.42,q ₂=0.2 and f=332, and d _AB correspond to the distance between atoms A and B.

The last criterion enforces the angles θ, θ’, θ” to be respectively into the interval values 0/70°, 0/90° and 110/180°.

Implementation details

In this section we provide an overview of the main implementation features. The iBP algorithm has been coded in C++ with extensive use of template meta-programming [41], STL [42,43], and BOOST (www.boost.org). Linear systems, as for instance (7), are solved using the LAPACK library [44].

Discretizable DGP instances were represented by simple weighted undirected graphs G=(V,E,d), which were handled by the Boost Graph Library (BGL) [45]. The points in $\mathbb {R}^{3}$ were represented using the Boost Geometry Library (also known as Generic Geometry Library, GGL: www.boost.org).

Constraints on distances, angles or energy are typically expressed by enforcing a variable x to take values in a domain , which is generally the union of intervals and singletons:

$$ \mathcal{D} =\left\{ \bigcup_{j=1}^{m} \bar x_{j} \right\}\cup \left\{ \bigcup_{i=1}^{k} \left[{x_{i}^{l}},{x_{i}^{u}}\right]\right\}. $$

((14))

The Boost Interval Library (BIL – see [46,47]) was used to store such representation, and to perform basic operations for intervals and singletons. On top of the BIL, we define the type domain which contains a set of intervals and operations as intersection, scaling, etc. The BIL allows also to select the underlining data format for the interval (single/double precision real, integer).

Theory

In this section we give some details about the worst-case asymptotic complexity behavior of the iBP algorithm. The description given above includes many details which are useful for finding the structure of proteins but which somewhat complicate the precise mathematical treatment. We first give a very brief abstract description of the iBP and of the formal problem it solves, and then proceed to discuss its complexity.

Formally speaking, the DGP is the following decision problem: given an integer K>0, a simple undirected graph G=(V,E) and an edge weight function $d:E\to \mathbb {R}_{+}$, is there a realization $x:V\to \mathbb {R}^{K}$ such that for each {u,v}∈E we have ∥x _u−x _v∥₂=d _uv? Note that we are writing x _u for x(u) and d _uv for d(u,v). We also remark that in the more “applied” interpretation given in the preceding section, the range of the edge function d is $\mathbb {IR}_{+}$, i.e. the set of all non-negative closed real intervals, and K=3. The DGP is NP-hard for any K>1 and NP-complete for K=1 [48]. Since we are interested in finding all solutions of the DGP rather than just one, we denote by X the set of all realizations of G.

Assumptions on the DGP input data

In fact, due to the fact that our data come from a protein structure setting, we can also make the following assumptions about G and d:

1.
there is an order 1,2,…,n on the vertices such that 1,2,3 is a triangle in the graph G and, for each vertex v>3, v is adjacent to v−1,v−2,v−3;
2.
the set of edges E can be partitioned in two subsets E _D and E _P, such that E _P consists of all edges {u,v} with v>4 and |v−u|>3, and $E_{D}=E\smallsetminus E_{P}$;
3.
E _D can be further subdivided in $E_{D}^{\prime }$ and $E_{D}^{\prime \prime }$, so that $E_{D}^{\prime \prime }$ consists of all edges {u,v} with |v−u|=3, and $E_{D}^{\prime }=E_{D}\smallsetminus E_{D}^{\prime \prime }$;
4.
the distance function d is such that: (a) d _uv is a scalar for each $\{u,v\}\in E^{\prime }_{D}$; (b) d _uv consists of a discrete set of b scalars for each $\{u,v\}\in E^{\prime \prime }_{D}$; (c) d _uv is a general interval for all {u,v}∈E _P.

We remark that the above definitions can be appropriately extended to Euclidean spaces of any dimension K>0, not just K=3. We call E _D the discretization edges and E _P the pruning edges. Discretization edges ensure that the graph G is rigid, which implies that there are finitely many realizations of G in $\mathbb {R}^{K}$. Pruning edges make some of those realizations infeasible, and thereby make the solution set X smaller. A few remarks are in order:

we consider that distances which are known because of covalent bond relations are sufficiently precise to be represented by a scalar;
we consider that distances which are known from NOESY (or other) experiments can be represented by intervals;
we assume that a limited number of the intervals can be discretized into sets containing a finite number b of values within the intervals;
the edges in $E_{D}^{\prime }$ represent atom pairs of the form {v,v−1} or {v,v−2} for any v>2: these are involved in covalent bonds;
the edges in $E_{D}^{\prime \prime }$ represent atom pairs which are assigned a certain number b of possible values (optionally b=1 for certain pairs);
the edges in E _P represent atom pairs for which the distance might be a general interval.

We remark that the order on V was initially intended to follow the protein backbone [49], but new orders which better exploit the hydrogen atoms in or close to the backbone have been defined in [50,51]: these are the orders on which the above assumptions are based.

The DGP with the restrictions above, but where all intervals are replaced by scalars, is called DISCRETIZABLE MOLECULAR DGP (DMDGP). Both the DMDGP and its generalization to any K (denoted by ^KDMDGP) are NP-hard [52,53]. The problem defined above, involving intervals, obviously contains the DMDGP as a sub-case and is hence also NP-hard by inclusion.

When all distances are precise

We first focus on the simplest case, where all intervals are replaced by scalar values. Then $d:E\to \mathbb {R}_{+}$, and b=1. In this simplified setting, the iBP is simply called BP [52], and the order on V is called a contiguous trilateration order [54] or a DMDGP order [55].

The BP can be defined as a recursive procedure: assuming we already found a realization x ₁,…,x _v−1 for the vertices 1,…,v−1, and that we mean to find a consistent realization x _v for v, the discretization edges E _D guarantee that there will be at most two positions for x _v compatible with the distances restricted to E _D [49]. This can be intuitively understood in $\mathbb {R}^{3}$ by considering the intersection of three spheres centered at x _v−1,x _v−2,x _v−3 with radii d _v,v−1,d _v,v−2,d _v,v−3: the first two spheres either do not meet or their intersection is in general a circle, and the intersection of the third sphere with this circle is either empty or consists in general of two points [56]. We can now consider the distances defined on pruning edges in E _P, linking v to its preceding vertices in order to accept or reject these two points. For each accepted point we recursively call BP with v replaced by v+1, for all v<n. When v=n we have a valid realization of the graph: we save it in X, and proceed to complete the recursive search. This yields a search tree which is explored depth-first. The recursion starts after placing the initial triangle 1,2,3 (either arbitrarily or by using BP restricted to subspaces), so this tree starts branching at level 4. It can be proved that, at completion, X contains all incongruent (modulo translations and rotations) realizations of G.

In the case where $E_{P}=\varnothing $, the search tree is a complete binary tree with 2ⁿ⁻³ nodes at the n-th (and last) level: in other words, its depth is n and its width is 2ⁿ⁻³. This is the worst case, since the BP must explore all of the nodes in the tree, and proves that the BP (and hence the iBP, since it generalizes the BP) is an exponential-time algorithm in n.

When $E_{P}\not =\varnothing $, it was shown that X almost always contains a number of solutions which is either zero or a power of two [55]; this discovery led to a set of results where the BP search tree width can be kept polynomial in n during the search [53]. Since the exponential behavior is only due to the tree width, this yields a set of cases where the BP is actually fixed-parameter tractable (FPT). Throughout all our experiments with protein data we were always able to fix the parameter controlling the exponential growth of the tree width to a universal constant, which makes BP “polynomial on proteins” (this is an informal statement — the precise statement is given in [53]).

Intervals and discrete distance sets

The theory supporting the case where d might map edges to discrete sets of distance values or intervals, which is the case treated in this paper, is not so clearly understood yet. As it generalizes the simpler case sketched above, in a certain sense it inherits its properties, but this is an oversimplification: for instance, if all intervals are [ 0,∞], it is obvious that the problem is easy independently of the graph topology, since every realization is valid.

Some bounds on the cardinality of X in the presence of discrete sets and intervals are given in [55]. Our understanding is that if the intervals are small enough, the theory which led to fixed-parameter tractability goes through with few changes, but we have no way so far of establishing an aprioristic maximum width for the intervals. If the intervals are very large the problem might become tractable, as mentioned above, for the purposes of finding at least one solution. The iBP would still behave exponentially, however.

Results and discussion

We applied the presented algorithm to three examples of proteins displaying α helical secondary structures. Before presenting the obtained results, we emphasize that the method proposed here has a completely different philosophy than classical optimization approaches commonly used in the field of NMR structure determination. In the present approach, each constraint is treated in the strict sense, that is, no violation, however small, is tolerated. This is why we consistently use the word constraint in the paper. This is what potentially allows us to systematically explore the entire search space. However, the use of the procedure demands that the data have been pre-processed accordingly, and all geometric inconsistencies that exist in three–dimensional space have been removed.

For the proteins studied here, if one includes the ensemble of NMR interval distance constraints stored in the.mr file at the Protein Data Bank (PDB) [57] as well as all pruning devices described above, all solutions are pruned out, indicating that no solution to the distance geometry problem exists with the deposited data. This is not really surprising, since the optimization algorithms generally used in NMR structure determination are based on optimization of a target function or hybrid energy rather than on strict constraint satisfaction. That is, there is always a phase where the algorithm tries to find a trade-off when inconsistencies exist between constraints. The optimization thus produces solutions in which chemical and NMR constraints are optimized, but in which small violations are always present. These inconsistencies are present in any structure determination, in particular because distance constraints are imprecise, due to experimental limitations.

Since the data in the PDB for the examples presented here were not pre-processed the way our algorithm requires, we decided to use a subset of the stored data sets: the definition of α-helix regions and a few long-range distance constraints arbitrary selected from the set of NMR constraints for structures with more than one α-helix. In order to further reduce the risk of all solutions being pruned, we used tolerance values for atomic positions and angles between atoms (Table 2).

Table 2 Analysis of conformations obtained by the branch-and-pruning algorithm on the three proteins targets: 2JMY, 2KXA and 2KSL

Full size table

The three examples we chose to illustrate the algorithm display an increasing structural complexity: (i) a single α helix, corresponding to the structure of peptide CM15 determined in micelles (PDB id: 2JMY [58]), (ii) an α helical hairpin (PDB id: 2KXA [59]), (iii) the insecticidial toxin TAITX-1a, formed as a bundle of four α helices, restrained by three disulphide bridges (PDB id: 2KSL). The main characteristics of the studied proteins are given in Table 2. All three examples were originally determined by NMR spectroscopy, and the corresponding constraint lists are available from the PDB. The analysis by PROCHECK [60] of the Ramachandran diagram of these three PDB structures shows that more than 85% of the residues are located in the core region. For 2KXA and 2KSL, more than 95% of the residues are located in the core and allowed region, whereas in 2JMY, 7% of the residues are located in the generously allowed region. For 2KXA, one PRO residue was replaced by an ALA, as the PRO cycle has not yet been included in the current version of the iBP algorithm.

We generated conformations using the branching phase and the pruning devices described above. The long-range constraints added for the calculations of 2KXA and 2KSL, are:

for 2KXA, one constraint between H α hydrogen and carbonyl oxygen of Ala-5 and Met-17, enforcing the pairing of the two α-helices,
for 2KSL, three constraints between Carbons β of Cys-7 and Cys-37, of Cys-23 and Cys-33 and of Cys-26 and Cys-46, corresponding to the formation of the three disulphide bridges.

For all calculations, except the one of 2JMY with the α helix defined along the whole sequence, the obtained conformations were filtered according to the coordinate root mean-squared deviation (RMSD: 1.5 Å) with respect to the previously obtained conformation in the iBP procedure. Enforcing an RMSD value larger than 1.5 Å between two successively stored conformations, avoids an oversampling of the conformational space. Each calculation was stopped after storing 10000 filtered conformations. For our three examples, five calculations were performed in total: three on 2JMY with different definitions of the α helix (residues 1-15, 3-13 and 5-11), and one each for 2KXA and 2KSL. For the first calculation on 2JMY, one conformation was obtained and saved. The second and third calculations on 2JMY were quite short, of the order of minutes (Table 2), which is due to the small size of the corresponding tree. For the 2KXA and 2KSL calculations, 10000 conformations were obtained in about 30 mins of calculation. Large total numbers of conformations were generated: this number increases from ∼634,000 (2JMY_1) up to ∼3,400,000 (2KXA) with the size of the considered problem, depending on the number of residues and on the number of constraints. Despite 2KSL being the largest example, the second smallest number of conformations was generated, which is the sign of a severe pruning arising from a rather restricted conformational space.

The reliability of the obtained conformations was checked in three ways. First, the whole set of NMR constraints deposited along with the PDB entries and involving backbone hydrogens, were probed on the conformations. Second, the quality of the obtained conformations was checked using PROCHECK [60] analysis of the Ramachandran plot. Third, the obtained conformations were clustered with an unsupervised clustering method, namely the self-organizing map or SOM [61-64], in order to investigate the properties of sampled conformations.

The agreement of the obtained conformations with the backbone NMR constraints deposited with the PDB structures was checked by calculating the distances between the backbone hydrogens in each obtained conformation. The distances larger than the upper bound of the constraint correspond to violations of this constraint. The mean number of violated constraints along with the mean value of the difference to the upper bound for these constraints were calculated on all conformations (Table 2). For the 2JMY calculation with the 1-15 α helix definition, no violation of the NMR constraints could be observed. As expected, when the α helix definition is reduced (2JMY_1 and 2JMY_2), the average number of violations increases as well as the average maximum violation. Not surprisingly, the most violated constraints involve residues located at the N and C terminal parts of the α-helix, TRP-2, PHE-5, LYS-3, LYS-6 and VAL-11, VAL-14, LEU-15 for 2JMY_1 and 2JMY_2. The largest violations and number of violations are of the same order or value for 2KXA than for 2JMY_1 and 2JMY_2. In contrast, the largest violations and number of violations are observed for 2KSL and involve residues CYS-33, GLU-34, PHE-38, TYR-43. Such over-restraining of NMR structures have been put in evidence in the past, through molecular dynamics simulations [65] and analysis of the structure quality [66].

The average number of violations is similar for 2JMY_2, 2KXA and 2KSL, but the average maximum violation for 2KSL is twice as large as that for 2JMY_2 and 2KXA. This might be due to the very restrained conformations of 2KSL, which contain three disulphide bridges. Due to this restrained conformation, the NMR constraint list is probably more prone to contain inconsistencies, and large mechanical strain can be stored in the structure if one uses an optimization procedure such as simulated annealing. In contrast, no mechanical strain whatsoever is generated by the iBP algorithm, and the obtained conformations might have a stronger tendency to deviate from the PDB conformations.

For each example, the obtained conformations were compared to the first conformation deposited in the PDB. Minimum RMSD values in the range 1.1-2.1 Å were obtained for all targets, except 2KSL for which the minimum RMSD value was 3.0 Å. Thus the Branch-and-Prune algorithm was able to capture conformations close to the PDB conformations, the larger value obtained for 2KSL arising from the larger mechanical strain quoted above.

For each calculation, the conformation displaying the smallest number of NMR constraint violations was compared to the first conformation deposited in the PDB. The RMSD values are smaller than 1.5 Å for 2JMY and 2KXA. This shows that, in the context of the iBP algorithm, the measured NMR constraints also push the structure toward the PDB structure. For 2JMY_1 and 2JMY_2, the RMSD value increases since the definition of the α helical region is shorter. For 2KSL, the conformation displaying the smallest number of constraint violations, displays an RMSD of 3.5Å with the PDB first conformation, which agrees with the maximum number of violations observed for this protein and with the minimum RMSD with the PDB structure analyzed above.

From the PROCHECK [60] analysis, the percentage of residues located in core and allowed Ramachandran regions, is larger than 95% for all targets except 2JMY_1, 2JMY_2, for which the percentages are about 80% due to the reduced definition of the α helix. For all targets, the percentage of residues in disallowed regions is equal to zero. The relatively important percentage of residues located in the allowed region may arise from the systematic exploration performed by the Branch-and-Prune algorithm, the strict nature of the constraints, and the nature of the pruning devices.

In order to further probe the robustness of the proposed algorithm, iBP calculations on 2KXA and 2KSL have been performed, using input data degraded in the following way: (i) the length of each α helix has been reduced by 1 residues at each extremity, (ii) the lower and upper bounds of the long-range distance constraints have been increased by 0.5 Å. The introduction of this noise into the α helical and long-range constraints makes the iBP solution moving apart from the PDB structure, as the minimum RMSD to PDB structure changes from 1.1 to 2 Å for 2KXA, and from 3.0 to 4.3 Å for 2KSL. Nevertheless, the quality of the Ramachandran diagram remains satisfying, with 93.3% and 95.4% of the residues located in the core and allowed regions of the Ramachandran plot for 2KXA and 2KSL.

The conformations were clustered using a self-organizing map (SOM) approach [62,63], on which the coordinate RMSD values between the conformers obtained by Branch-and-Prune and the corresponding PDB structure, were projected on the SOMs (Figure 6). These RMSD values lay in the 1.3-3.2 Å range for 2JMY_1, in the 2.4-4.9 Å range for 2JMY_2, in the 1.5-4.0 Å range for 2KXA, and in the 3.2-6.0 Å for 2KSL.

In the SOMs for the four calculations (Figure 6), the RMSD values are colored according to their RMSD from the PDB entry, violet color indicating values smaller than the median value of the sampled RMSD value, green color indicating RMSD values larger than this median value. For 2JMY_1, 2KXA and 2KSL, a larger number of neurons of the SOMs belongs to the second group, which is the sign of an enhanced sampling of the conformational space with respect to the region sampled by simulated annealing. For 2JMY_2, the inverse picture is observed, which may arise from the more limited conformational space available to be sampled for a unique α-helix.

In 2KSL and 2KXA SOMs, the protein conformations corresponding to the region displaying the smallest coordinate RMSD values with respect to the PDB structure, were extracted (Figure 7). These sets of conformers are similar to the superimposed conformations obtained in a usual NMR calculation.

Conclusions

We proposed here a Branch-and-Prune algorithm (iBP) to solve the Distance Geometry Problem, in order to sample exhaustively the conformational space of the backbone of α-helical proteins. The iBP algorithm bears a very slight reminiscence to variable target function approaches for example implemented in DISMAN [67], due to the sequential nature of introducing constraints and non-bonded interactions. However, the precise way of introducing the constraints and non-bonded interactions differs significantly, and DISMAN does not systematically search space but is an optimization approach.

We introduced new pruning devices integrated in the iBP algorithm for DGP with intervals and we tested our iBP implementation on the backbones of α-helical proteins. Several pruning devices have been designed to enforce amino-acid chirality, α-helix geometry and van der Waals steric hindrance. The algorithm allowed to efficiently reconstruct backbone conformations of three α-helical peptides, of various sizes, and for which the structure were previously solved by NMR. The obtained solutions satisfy most of the NMR constraints involving backbone hydrogen bonds, and display very acceptable Ramachandran statistics. The present work represents a first successful step on the way to reconstruct protein structures using a branch-and-prune algorithm applied to the Distance Geometry problem.

Applications where this approach could have significant advantages are cases where there are few distances defining the tertiary structure of a protein, where it is important to characterize the space of all solutions. It might also be useful as part iterative automated assignment algorithms such as ARIA [68], CYANA [69] or UNIO [70], where in a first iteration all solutions compatible with a few unambiguous long-range constraints could be generated to reduce the ambiguity of the remaining constraints. Another application of the approach proposed here would be to provide input molecular conformations to model the structure of multi-subunit complexes into an electron microscopy density map [71].

Some limitations of the current version of iBP prevent for the moment its use with real nuclear Overhauser effect (NOE) data. These limitations are the use of unambiguous distance constraints, the non-inclusion of protein side-chains, the loss of information intervals and the appropriate weighting of the various constraints in order to overcome the inconsistencies contained among the whole constraint set. Protein side-chains can be added to the protein backbone afterward. The discretization of circle arcs could be tackled using algebraic geometry and geometric algebra approaches [72]. The Bayesian approach [73] developed for the objective weighting of various NMR contraints according to the data quality could be used to alleviate the inconsistency problems. The use of unambiguous distance constraints is probably the most unavoidable aspect of the current set-up of the algorithm.

References

Huang X, Britto M, Kear-Scott J, Boone C, Rocca J, Simmerling C, et al. The role of select subtype polymorphisms on HIV-1 protease conformational sampling and dynamics. J Biol Chem. 2014; 289:17203–14.
Article CAS PubMed PubMed Central Google Scholar
Kanelis V, Forman-Kay J, Kay L. Multidimensional NMR methods for protein structure determination. IUBMB Life. 2001; 52:291–302.
Article CAS PubMed Google Scholar
Sinz A. Chemical cross-linking and mass spectrometry for mapping three-dimensional structures of proteins and protein complexes. J Mass Spectrometry. 2003; 38:1225–37.
Article CAS Google Scholar
Marti-Renom M, Stuart A, Fiser A, Sánchez R, Melo F, Sali A. Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct. 2000; 29:291–325.
Article CAS PubMed Google Scholar
Vajda S, Kozakov D. Convergence and combination of methods in protein-protein docking. Curr Opin Struct Biol. 2009; 19:164–70.
Article CAS PubMed PubMed Central Google Scholar
Bello M, Martínez-Archundia M, Correa-Basurto J. Automated docking for novel drug discovery. Expert Opin Drug Discovery. 2013; 8:821–34.
Article CAS Google Scholar
Crippen G, Havel T. Distance geometry and molecular conformation. New York: Wiley; 1988.
Google Scholar
Liberti L, Lavor C, Maculan N, Mucherino A. Euclidean distance geometry and applications. SIAM Rev. 2014; 56:3–69.
Article Google Scholar
Nilges M, Gronenborn A, Brünger A, Clore G. Determination of three-dimensional structures of proteins by simulated annealing with interproton distance restraints. Application to crambin, potato carboxypeptidase inhibitor and barley serine proteinase inhibitor 2. Protein Eng. 1988; 2:27–38.
Article CAS PubMed Google Scholar
Alipanahi B, Krislock N, Ghodsi A, Wolkowicz H, Donaldson L, Li M. Determining protein structures from NOESY distance constraints by semidefinite programming. J Comput Biol. 2013; 20:296–310.
Article CAS PubMed PubMed Central Google Scholar
Wang C, Lozano-Pérez T, Tidor B. AmbiPack: a systematic algorithm for packing of macromolecular structures with ambiguous distance constraints. Proteins. 1998; 32:26–42.
Article CAS PubMed Google Scholar
Potluri S, Yan A, Chou J, Donald B, Bailey-Kellogg C. Structure determination of symmetric homo-oligomers by a complete search of symmetry configuration space using nmr restraints and Van Der Waals packing. Proteins. 2006; 65:203–19.
Article CAS PubMed Google Scholar
Potluri S, Yan A, Donald B, Bailey-Kellogg C. A complete algorithm to resolve ambiguity for intersubunit NOE assignment in structure determination of symmetric homo-oligomers. Protein Sci. 2007; 16:69–81.
Article CAS PubMed PubMed Central Google Scholar
Martin J, Yan A, Bailey-Kellogg C, Zhou P, Donald B. A geometric arrangement algorithm for structure determination of symmetric protein homo-oligomers from NOEs and RDCs. J Comput Biol. 2011; 18:1507–23.
Article CAS PubMed PubMed Central Google Scholar
Martin J, Yan A, Bailey-Kellogg C, Zhou P, Donald B. A graphical method for analyzing distance restraints using residual dipolar couplings for structure determination of symmetric protein homo-oligomers. Protein Sci. 2011; 20:970–85.
Article CAS PubMed PubMed Central Google Scholar
Reardon P, Sage H, Dennison S, Martin J, Donald B, Alam S, et al. Structure of an HIV-1-neutralizing antibody target, the lipid-bound gp41 envelope membrane proximal region trimer. Proc Nat Acad Sci USA. 2014; 111:1391–6.
Article CAS PubMed PubMed Central Google Scholar
Zeng J, Boyles J, Tripathy C, Wang L, Yan A, Zhou P, et al. High-resolution protein structure determination starting with a global fold calculated from exact solutions to the rdc equations. J Biomol NMR. 2009; 45:265–81.
Article CAS PubMed PubMed Central Google Scholar
Gordon D, Hom G, Mayo S, Pierce N. Exact rotamer optimization for protein design. J Comput Chem. 2003; 24:232–43.
Article CAS PubMed Google Scholar
Kingsford C, Chazelle B, Singh M. Solving and analyzing side-chain positioning problems using linear and integer programming. Bioinformatics. 2005; 21:1028–36.
Article CAS PubMed Google Scholar
Wang L, Donald B. An efficient and accurate algorithm for assigning nuclear Overhauser effect restraints using a rotamer library ensemble and residual dipolar couplings. The IEEE computational systems bioinformatics conference (CSB). Stanford, CA: The Institute of Electrical and Electronics Engineers, Inc.; 8-12 Aug 2005, pp. 189–202.
Google Scholar
Wang L, Mettu R, Donald B. A polynomial-time algorithm for de novo protein backbone structure determination from NMR data. J Comput Biol. 2006; 13:1276–88.
Google Scholar
O’Neil R, Lilien R, Donald B, Stroud R, Anderson A. Phylogenetic classification of protozoa based on the structure of the linker domain in the bifunctional enzyme, dihydrofolate reductase-thymidylate synthase. J Biol Chem. 2003; 278:52980–7.
Article CAS PubMed Google Scholar
Lavor C, Liberti L, Maculan N, Mucherino A. The discretizable molecular distance geometry problem. Comput Optimization App. 2012; 52:115–46.
Article Google Scholar
Lavor C, Liberti L, Mucherino A. The interval Branch-and-Prune Algorithm for the Discretizable Molecular Distance Geometry Problem with Inexact Distances. J Global Optimization. 2013; 56:855–71.
Article Google Scholar
Engh RA, Huber R. Accurate bond and angle parameters for x-ray protein structure refinement. Acta Crystallogr Sect A: Found Crystallogr. 1991; 47(4):392–400.
Article Google Scholar
Rocchia W, Alexov E, Honig B. Extending the applicability of the nonlinear poisson-boltzmann equation: Multiple dielectric constants and multivalent ions. J Phys Chem B. 2001; 105(28):6507–14.
Article CAS Google Scholar
Honig B, Nicholls A. Classical electrostatics in biology and chemistry. Science. 1995 May 26; 268(5214):1144–9.
Article CAS PubMed Google Scholar
Liberti L, Masson B, Lee J, Lavor C, Mucherino A. On the number of realizations of certain Henneberg graphs arising in protein conformation. Discrete Appl Mathematics. 2014; 165:213–32.
Article Google Scholar
Coope I. Reliable computation of the points of intersection of n spheres in $\mathbb {R}^{n}$. ANZIAM Journal. 2000; 42:461–77.
Article Google Scholar
Berg J, Tymoczko J, Stryer L. Biochemistry: International Edition. New York: WH Freeman & Co; 2006.
Google Scholar
Güntert P, Mumenthaler C, Wüthrich K. Torsion angle dynamics for NMR structure calculation with the new program DYANA. J Mol Biol. 1997; 273:283–98.
Article PubMed Google Scholar
Güntert P, Wüthrich K. Sampling of conformation space in torsion angle dynamics calculations. Comp Phys Commun. 2001; 138:155–69.
Article Google Scholar
López-Méndez B, Güntert P. Automated protein structure determination from NMR spectra. J Am Chem Soc. 2006; 128:13112–22.
Article CAS PubMed Google Scholar
Mucherino A, Lavor C, Malliavin T, Liberti L, Nilges M, Maculan N. Influence of pruning devices on the solution of molecular distance geometry problems. In: Pardalos, P., Rebennack, S. (eds.) Lecture Notes in Computer Science 6630. Germany: Springer: 2011. p. 206–17.
Google Scholar
Dong Q, Wu Z. A geometric build-up algorithm for solving the molecular distance geometry problem with sparse distance data. J Global Optimization. 2003; 26(3):321–33.
Article Google Scholar
Lavor C, Liberti L, Mucherino A, Maculan N. On a discretizable subclass of instances of the molecular distance geometry problem. In: Proceedings of the 2009 ACM Symposium on Applied Computing. ACM Press: 2009. p. 804–5.
Floyd RW. Algorithm 97: shortest path. Commun ACM. 1962; 5(6):345.
Article Google Scholar
Shen Y, Delaglio F, Cornilescu G, Bax A. TALOS+: a hybrid method for predicting protein backbone torsion angles from NMR chemical shifts. J Biomol NMR. 2009; 44:213–23.
Article CAS PubMed PubMed Central Google Scholar
Grishaev A, Bax A. An empirical backbone-backbone hydrogen-bonding potential in proteins and its applications to NMR structure refinement and validation. J Am Chem Soc. 2004; 126(23):7281–92.
Article CAS PubMed Google Scholar
Kabsch W, Sander C. Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers. 1983; 22(12):2577–637.
Article CAS PubMed Google Scholar
Abrahams D, Gurtovoy A. C++ Template Metaprogramming: Concepts, Tools, and Techniques from Boost and Beyond. Boston, Massachusetts: Addison-Wesley Professional; 2004.
Google Scholar
Austern MH. Generic Programming and the STL: Using and Extending the C++ Standard Template Library. Boston, Massachusetts: Addison-Wesley Longman Publishing Co., Inc.; 1998.
Google Scholar
Josuttis N. The C++ Standard Library: a Tutorial and Reference. Boston, Massachusetts: Addison-Wesley Professional; 1999.
Google Scholar
Anderson E, Bai Z, Bischof C, Blackford S, Demmel J, Dongarra J, et al. LAPACK Users’ Guide, 3rd edn. Philadelphia, PA: Society for Industrial and Applied Mathematics; 1999.
Book Google Scholar
Lee L-Q, Lumsdaine A. The Boost Graph Library: User Guide and Reference Manual. Boston, Massachusetts: Addison-Wesley Professional; 2002.
Google Scholar
Brönnimann H, Melquiond G, Pion S. The design of the boost interval arithmetic library. Theor Comput Sci. 2006; 351(1):111–8.
Article Google Scholar
Brönnimann H, Melquiond G, Pion S. The boost interval arithmetic library. In: Proceedings of the 5th Conference on Real Numbers and Computers. Lyon, France: 2003. p. 65–80. http://www.lri.fr/~melquion/doc/03-rnc5-article.ps.gz.
Saxe J. Embeddability of weighted graphs in k-space is strongly NP-hard. Proc 17th Allerton Conference Commun Control Comput. Monticello, Illinois; 1979:480-9.
Liberti L, Lavor C, Maculan N. A branch-and-prune algorithm for the molecular distance geometry problem. Int Trans Operational Res. 2008; 15:1–7.
Article Google Scholar
Lavor C, Mucherino A, Liberti L, Maculan N. On the computation of protein backbones by using artificial backbones of hydrogens. J Global Optimization. 2011; 50:329–44.
Article Google Scholar
Costa V, Mucherino A, Lavor C, Cassioli A, Carvalho L, Maculan N. Discretization orders for protein side chains. J Global Optimization. 2014; 60:333–49.
Article Google Scholar
Lavor C, Liberti L, Maculan N, Mucherino A. The discretizable molecular distance geometry problem. Comput Optimization App. 2012; 52:115–46.
Article Google Scholar
Liberti L, Lavor C, Mucherino A. The discretizable molecular distance geometry problem seems easier on proteins. In: Mucherino A, Lavor C, Liberti L, Maculan N (eds.) Distance Geometry: Theory, Methods, and Applications. New York: Springer: 2013.
Google Scholar
Cassioli A, Günlük O, Lavor C, Liberti L. Discretization vertex orders for distance geometry. Discrete Applied Mathematics (accepted). in press.
Liberti L, Masson B, Lavor C, Lee J, Mucherino A. On the number of realizations of certain Henneberg graphs arising in protein conformation. Discrete Appl Mathematics. 2014; 165:213–32.
Article Google Scholar
Lavor C, Lee J, Lee-St. John A, Liberti L, Mucherino A, Sviridenko M. Discretization orders for distance geometry problems. Optimization Lett. 2012; 6:783–96.
Article Google Scholar
Berman H, Kleywegt G, Nakamura H, Markley J. The future of the Protein Data Bank. Biopolymers. 2013; 99:218–22.
Article CAS PubMed Google Scholar
Respondek M, Madl T, Göbl C, Golser R, Zangger K. Mapping the orientation of helices in micelle-bound peptides by paramagnetic relaxation waves. J Am Chem Soc. 2007; 129:5228–34.
Article CAS PubMed Google Scholar
Lorieau J, Louis J, Bax A. The complete influenza hemagglutinin fusion domain adopts a tight helical hairpin arrangement at the lipid: water interface. Proc Nat Acad Sci USA. 2010; 107:11341–6.
Article CAS PubMed PubMed Central Google Scholar
Laskowski R, MacArthur M, Moss D, Thornton J. PROCHECK: a program to check the stereochemical quality of protein structure. J Appl Crystallogr. 1993; 26:283–91.
Article CAS Google Scholar
Bouvier G, Desdouits N, Ferber M, Blondel A, Nilges M. An automatic tool to analyze and cluster macromolecular conformations based on Self-Organizing Maps. Bioinformatics. 2015. in press.
Miri L, Bouvier G, Kettani A, Mikou A, Wakrim L, Nilges M, et al. Stabilization of the integrase-DNA complex by Mg2+ ions and prediction of key residues for binding HIV-1 integrase inhibitors. Proteins. 2014; 82:466–78.
Article CAS PubMed Google Scholar
Bouvier G, Duclert-Savatier N, Desdouits N, Meziane-Cherif D, Blondel A, Courvalin P, et al. Functional motions modulating VanA ligand binding unraveled by self-organizing maps. J Chem Inf Model. 2014; 54:289–301.
Article CAS PubMed Google Scholar
Kohonen T. Self-organizing Maps. Heidelberg, Germany: Springer; 2001.
Book Google Scholar
Fan H, Mark A. Relative stability of protein structures determined by X-ray crystallography or NMR spectroscopy: a molecular dynamics simulation study. Proteins. 2003; 53:111–20.
Article CAS PubMed Google Scholar
Nabuurs S, Spronk C, Vuister G, Vriend G. Traditional biomolecular structure determination by NMR spectroscopy allows for major errors. PLoS Computional Biol. 2006; 2:9.
Article CAS Google Scholar
Braun W, Gō N. Calculation of Protein Conformations by Proton-Proton Distance Constraints: A New Efficient Algorithm. J Mol Biol. 1985; 186:611–26.
Article CAS PubMed Google Scholar
Rieping W, Habeck M, Bardiaux B, Bernard A, Malliavin T, Nilges M. ARIA2: automated NOE assignment and data integration in NMR structure calculation. Bioinformatics. 2007; 23:381–2.
Article CAS PubMed Google Scholar
Guntert P. Automated NMR structure calculation with CYANA. Methods Mol Biol. 2004; 278:353–78.
CAS PubMed Google Scholar
Guerry P, Herrmann T. Comprehensive automation for NMR structure determination of proteins. Methods Mol Biol. 2012; 831:429–51.
Article CAS PubMed Google Scholar
Lasker K, Sali A, Wolfson H. Determining macromolecular assembly structures by molecular docking and fitting into a electron density map. Proteins. 2010; 78:3205–11.
Article CAS PubMed PubMed Central Google Scholar
Lavor C, Alves R, Figueiredo W, Petraglia A, Maculan N. Clifford Algebra and the discretizable molecular distance geometry problem. Adv Appl Clifford Algebras. 2015. in press.
Bernard A, Vranken W, Bardiaux B, Nilges M, Malliavin T. Bayesian estimation of NMR restraint potential and weight: a validation on a representative set of protein structures. Proteins. 2008; 79:1525–37.
Article CAS Google Scholar

Download references

Acknowledgments

TM, MN, BB thank the Institut Pasteur and the CNRS for support. This work was funded by the European Union (FP7-IDEAS-ERC 294809 to MN), the “investissement d’avenir” program (grant bip:bip to MN and LL), and the Brazilian Research Agencies FAPESP and CNPq (to CL and RA).

Author information

Authors and Affiliations

Institut Pasteur, Structural Bioinformatics Unit, 25, rue du Dr Roux, Paris, 75015, France
Benjamin Bardiaux, Guillaume Bouvier, Michael Nilges & Thérèse E Malliavin
CNRS UMR3528, 25, rue du Dr Roux, Paris, 75015, France
Benjamin Bardiaux, Guillaume Bouvier, Michael Nilges & Thérèse E Malliavin
University of Campinas (IMECC-UNICAMP), Campinas-SP, 13083-859, Brasil
Carlile Lavor
LIX, Ecole Polytechnique, Palaiseau, 91128, France
Andrea Cassioli, Rafael Alves & Leo Liberti
IBM TJ Watson Research Center, NY Yorktown Heights, 10598, USA
Leo Liberti
Université de Rennes-I, Rennes, France
Antonio Mucherino

Authors

Andrea Cassioli
View author publications
You can also search for this author in PubMed Google Scholar
Benjamin Bardiaux
View author publications
You can also search for this author in PubMed Google Scholar
Guillaume Bouvier
View author publications
You can also search for this author in PubMed Google Scholar
Antonio Mucherino
View author publications
You can also search for this author in PubMed Google Scholar
Rafael Alves
View author publications
You can also search for this author in PubMed Google Scholar
Leo Liberti
View author publications
You can also search for this author in PubMed Google Scholar
Michael Nilges
View author publications
You can also search for this author in PubMed Google Scholar
Carlile Lavor
View author publications
You can also search for this author in PubMed Google Scholar
Thérèse E Malliavin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Thérèse E Malliavin.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

AC, TM, MN, LL and AM designed the work. AC implemented the algorithm. TM, GB and BB performed and analyzed the application cases. AC, BB, AM, LL, CL, RA, MN and TM wrote the manuscript. All authors read and approved the final manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Cassioli, A., Bardiaux, B., Bouvier, G. et al. An algorithm to enumerate all possible protein conformations verifying a set of distance constraints. BMC Bioinformatics 16, 23 (2015). https://doi.org/10.1186/s12859-015-0451-1

Download citation

Received: 09 July 2014
Accepted: 05 January 2015
Published: 28 January 2015
DOI: https://doi.org/10.1186/s12859-015-0451-1

An algorithm to enumerate all possible protein conformations verifying a set of distance constraints