Computing H/D-Exchange rates of single residues from data of proteolytic fragments

Althaus, Ernst; Canzar, Stefan; Ehrler, Carsten; Emmett, Mark R; Karrenbauer, Andreas; Marshall, Alan G; Meyer-Bäse, Anke; Tipton, Jeremiah D; Zhang, Hui-Min

doi:10.1186/1471-2105-11-424

Methodology article
Open access
Published: 11 August 2010

Computing H/D-Exchange rates of single residues from data of proteolytic fragments

Ernst Althaus¹,
Stefan Canzar²,
Carsten Ehrler³,
Mark R Emmett^4,5,
Andreas Karrenbauer⁶,
Alan G Marshall^4,5,
Anke Meyer-Bäse⁷,
Jeremiah D Tipton⁴ &
…
Hui-Min Zhang⁴

BMC Bioinformatics volume 11, Article number: 424 (2010) Cite this article

5771 Accesses
35 Citations
Metrics details

Abstract

Background

Protein conformation and protein/protein interaction can be elucidated by solution-phase Hydrogen/Deuterium exchange (sHDX) coupled to high-resolution mass analysis of the digested protein or protein complex. In sHDX experiments mutant proteins are compared to wild-type proteins or a ligand is added to the protein and compared to the wild-type protein (or mutant). The number of deuteriums incorporated into the polypeptides generated from the protease digest of the protein is related to the solvent accessibility of amide protons within the original protein construct.

Results

In this work, sHDX data was collected on a 14.5 T FT-ICR MS. An algorithm was developed based on combinatorial optimization that predicts deuterium exchange with high spatial resolution based on the sHDX data of overlapping proteolytic fragments. Often the algorithm assigns deuterium exchange with single residue resolution.

Conclusions

With our new method it is possible to automatically determine deuterium exchange with higher spatial resolution than the level of digested fragments.

Background

In the solution-phase Hyrdogen/Deuterium ex-change (sHDX) experiment, protein surface accessibility is probed by exchange of labile hydrogen for deuterium. Simply speaking, hydrogens located at solvent exposed sites exchange at a higher rate with deuteriums from the solution than others. From these exchange rates one can therefore deduce information about protein solvent accessibility and thus protein conformation.

There is controversy surrounding the effect of D₂O solvent on the conformation of proteins. Sheu et al. [1] used molecular dynamic modeling of a small peptide to illustrate compaction of the peptide conformation in D₂O versus H₂O. This small compaction of the conformation occurs when the pep-tide is fully deuterated (which is never observed in the sHDX experiments). Since sHDX monitors the incorporation of deuterium over time the resulting slight compaction of the structure is minimized. Other methods used for the study of protein/protein interaction or protein conformation such as cross-linking [2, 3] or hydroxyl radical addition [4–6] result in large conformational change of the protein structure; leaving sHDX as the method of choice for probing protein conformational changes in solution.

NMR spectroscopy has been the gold standard for determination of protein structure, but it has limitations on protein solubility and molecular weight (<50 kD). Solution-phase HDX with mass spectrometry analysis has higher sensitivity and is not limited by molecular weight, but sHDX is hampered with a major difficulty. One only obtains exchange data for peptic fragments and assigning exchange rates to single residues has to be done by manual interpretation.

We provide an automated method to resolve this problem. More precisely, we present an algorithm that enumerates all possible exchange rates for single residues that explain the observed data of the peptic fragments. As the number of possibilities is often very large, we combine sets of assignments to equivalence classes which are easily interpreted such that the number of equivalence classes is typically very small.

The assignment of exchange rates to single residues from the data of the peptic fragments is a combinatorial problem. Hence, we apply methods from combinatorial optimization to it, i.e. we show how to formalize the problem as an integer linear program and propose methods to solve the problem.

Biochemical Background

Concerning the determination of protein-protein interaction, X-ray crystal diffraction and NMR [7] pro-vide the highest resolution of the sites of interaction. On the downside, both methods require large (milligram) quantities of protein. Other techniques rely on chemical or photo-induced reactions with MS analysis [8, 9] to reveal functional groups that are ex-posed to the solvent. These methods also suffer from physical limitations.

Another method utilizes hydroxyl radical reactions with alkyl CH bonds. The OH tends to re-act mainly with surface-exposed residues providing a good footprint of the solvent exposed surface of the protein(s) [4, 6]. The modification is covalent and thus irreversible, but each modification can potentially change the conformation of the protein, thus skewing results.

Exchange of labile hydrogens for deuteriums (sHDX) as a probe of protein surface accessibility does not change the conformation of the protein. Advantages of MS over NMR and X-ray crystallography structural determination are the ability to work at low concentration and high molecular weight.

The experiment is initiated by dilution of the protein solution into a biological buffer made with D₂O. Solvent accessible hydrogens are exchanged with deuterium. The exchange is quenched (greatly slowed) by dropping the pH to between pH 2.3 and pH 2.5 and lowering the temperature to approximately 0°C. The protein complex is digested with a protease that is active under quench conditions (such as pepsin) and on-line liquid chromatography is performed directly to the FT-ICR MS. Deuterium in corporation is monitored by the increase in mass of each peptic fragment as the deuterons are added.

These data sets are large, often with many over-lapping proteolytic fragments. From these data, the exchange rate is easily determined for the same peptic fragments from the protein and the protein/protein complex [10] (all other fragments are disregarded). When peptic fragments are not directly comparable, but are overlapping (Figure 1) manual interpretation must be performed to assign exchange rate to single residues. Data analysis is the greatest bottleneck in sHDX experiments; thus automated data analysis is necessary. Furthermore, we are interested in all such assignments, as averaging over all solutions gives better results in practice.

Mathematical Abstraction

In this section, we present our mathematical model for the assignment of exchange rates to residues. A brief overview of the introduced terms and symbols can be found in Table 1. In an idealized setting, we consider the following problem. We sequentially number the n residues of a protein from1 to n, beginning at the N-terminal residue and ending at the C-terminal residue. The set of peptic fragments resulting from the digestion of the protein is captured by a set ℱ of integer intervals (i, j) := {k ∈ ℕ | i ≤ k ≤ j}, for two positive integers i, j with i ≤ j, representing the endpoints of the corresponding fragment. In other words, the peptic fragment represented by (i, j) spans residues i, i+1, ..., j. Furthermore, K denotes the number of different classes of exchange rates, arising from the discretization of the experimentally measured deuterium uptake rates [11]. The K distinct classes of exchange rates, to which we simply refer as colors in the following, are represented by set $S$ . To simplify notation we number the colors from 1 to K and identify in the following the colors by their respective number. The experimentally found bulk information of how many residues within each fragment (i, j) ∈ ℱ fall into each of the exchange rate categories is given by "requirement" integers $b_{(i, j)}^{k}$ , for each fragment (i, j) ∈ ℱ and each color k ∈ $S$ . We call the vector b^kof "requirements" with respect to color k, indexed by fragments from, ℱ the right hand side for color k. In our experimental data, exactly three different colors are distinguished (interpreted as slow, medium, and fast exchange rates), i.e. K = 3. However, our method is not restricted to this case.

Table 1 Overview of terms and symbols

Full size table

The mathematical notion introduced above is illustrated in Figure 1. There the residues, numbered from 1 to 28, are spanned by 9 peptic fragments, i.e. |ℱ = 9|. The third peptic fragment "VWGKVEAD" will then be represented by the integer interval (12; 19). From the experimental data we know that 5 out of the 8 residues contained in this fragment exchanged slowly (s), two at medium rate (m), and the last remaining residue exchanged fast (f) and thus $b_{(12, 19)}^{s} = 5, b_{(12, 19)}^{m} = 2$ , and $b_{(12, 19)}^{f} = 1$ .

Determining the exchange rate of single residues from the experimentally found data for the peptic fragments then translates into finding a "consistent" assignment of colors from $S$ to the integer points from {1, ..., n}, representing the residues of the protein, that complies with the constraints imposed by the "requirements" $b_{(i, j)}^{k}$ . More precisely, we have to determine an assignment π : {1, ..., n} ↦ $S$ such that |{i ≤ l ≤ j : π(l) = k}| = $b_{(i, j)}^{k}$ for all given fragments (i, j) ∈ ℱ and all possible colors k ∈ $S$ . We call such an assignment feasible.

We say that two fragments (i, j) and (i', j') over-lap, if they share at least one common residue, i.e. (i, j) ∩ (i', j') ≠ ∅. The partition of the set of fragments ℱ into a maximum number of subsets, such that no two fragments from different subsets overlap, defines independent subproblems; an assignment of exchange rates to the residues spanned by the fragments of one subset does not affect the solution of a subproblem corresponding to any other subset of fragments.

Furthermore, we denote by $P$ the partition of the set of residues {1, ..., n} into maximal subsets such that residues from the same subset are spanned by exactly the same set of fragments. More precisely, for all residues i and j in the same part of $P$ and for all fragments f ∈ ℱ it holds i ∈ f ⇔ j ∈ f. Hence, for each part p ∈ $P$ and each fragment f ∈ ℱ either p ⊆ f or p ∩ f = ∅. In Figure 1 for example, residues number 7 (Q), 8 (Q), and 9 (V), are all contained in fragments number 1, 2, 6, and 8 and thus form an element $\overset{ª}{p}$ of partition $P$ . Note that for the two neighboring residues the set of containing fragments differs from {1, 2, 6, 8} and therefore part $\overset{ª}{p}$ = {7, 8, 9} is maximal.

However, data collected in real experiments usually contain some noise, such that no feasible assignment of exchange rates as defined above exists. Therefore, the goal is to compute all assignments that minimize the total sum of errors. Here, the error of an assignment π in fragment (i, j) ∈ ℱ with respect to color k is defined as the surplus, respectively the shortage, of residues in (i, j) that are assigned color k, compared to the number of such residues suggested by the experimental data. That is,

e_{(i, j)}^{k} = | b_{(i, j)}^{k} - | {i \leq l \leq j : π (l) = k} | |

(1)

and thus the objective is to minimize the sum of this deviations over all fragments and colors, i.e.

minimize \sum_{k \in S} \sum_{(i, j) \in ℱ} e_{(i, j)}^{k} .

(2)

In Figure 1 the colors green, yellow and red encode an optimal assignment π* of the exchange rates slow, medium and fast, with respect to objective (2). Under this assignment, fragment 3 contributes an error of 1 both w.r.t. color yellow (medium exchange rate)and color red (fast rate) to the total error of 17, while it satisfies the requirement for color green (slow rate, numbered 1) $b_{(12, 19)}^{1} = 5$ exactly.

Results and Discussion

In the following, we present different approaches to tackle the assingment problem that we have derrived from the mathematical abstraction mentioned before.

Integer linear programming formulation

First, we formulate the idealized version of the problem assuming error-free experimental data as an integer linear program (ILP). That is, we give an ILP whose feasible solutions correspond one-to-one to the feasible assignments of colors to residues.

Let π : {1, ..., n} ↦ $S$ be an assignment of colors to residues. A binary variable $x_{i}^{k}$ for every color k ∈ $S$ and every residue i ∈ {1, ..., n} indicates whether residue i is assigned color k or not, i.e.

x_{i}^{k} = {\begin{array}{l} 1 if π (i) = k \\ 0 otherwise . \end{array}

We denote by $x^{k} = (x_{1}^{k}, x_{2}^{k}, ..., x_{n}^{k})$ the vector of binary variables modeling the assignment of color k and let $x = {(x^{k})}_{k \in S}, x \in {0, 1}^{K n}$ .

Since every residue is assigned exactly one color, it must hold $\sum_{k \in S} x_{i}^{k} = 1$ for all i ∈ {1, ..., n}. Conversely, every 0-1 assignment to variables x satisfying $\sum_{k \in S} x_{i}^{k} = 1$ for all i ∈ {1, ..., n} corresponds to an assignment of colors to residues. A 0-1 assignment to x corresponds to a feasible color assignment π, if and only if furthermore $\sum_{l = i}^{j} x_{l}^{k} = b_{(i, j)}^{k}$ holds for all (i, j) ∈ ℱ and k ∈ $S$ .

Now consider the problem of computing an assignment with minimum total error. Translating the definition of the error that we make when assigning color k (or not) to residues in fragment (i, j) (see equation (1)) to the context of 0-1 assignments to variables x^k, the problem of minimizing (2) becomes

minimize \sum_{k \in S} \sum_{(i, j) \in ℱ} | b_{(i, j)}^{k} - \sum_{l = i}^{j} x_{l}^{k} | .

Concerning the formulation of a minimum sum of absolute values in terms of a linear objective function and linear constraints, observe that $| b_{(i, j)}^{k} - \sum_{l = i}^{j} x_{l}^{k} |$ is the smallest number $e_{(i, j)}^{k}$ that satisfies

\begin{matrix} e_{(i, j)}^{k} \geq \sum_{l = i}^{j} x_{l}^{k} - b_{(i, j)}^{k} and \\ e_{(i, j)}^{k} \geq - \sum_{l = i}^{j} x_{l}^{k} + b_{(i, j)}^{k} \end{matrix}

Hence, after introducing a variable $e_{(i, j)}^{k}$ for every color k ∈ $S$ and every fragment (i, j) ∈ ℱ, the integer linear program we are looking at is

\begin{array}{l} \min & \sum_{k \in S} \sum_{(i, j) \in ℱ} e_{(i, j)}^{k} \\ s . t . & e_{(i, j)}^{k} \geq \sum_{l = i}^{j} x_{l}^{k} - b_{(i, j)}^{k} & for all k \in S, (i, J) \in ℱ \\ e_{(i, j)}^{k} \geq - \sum_{l = i}^{j} x_{l}^{k} + b_{(i, j)}^{k} & for all k \in S, (i, J) \in ℱ \\ \sum_{k \in S} x_{l}^{k} = 1 & for all 1 \leq l \leq n \\ x \in {0, 1}^{K n} \end{array}

We refer to this integer linear program as basic-ILP.

In our experiments, it turns out that finding a single solution is very fast, whereas enumerating all solutions takes quite some time due to their large number. This large number can be explained as follows: Recall that $P$ is the partition of {1, ..., n} into a minimal number of parts, such that for each element p ∈ $P$ and each fragment f ∈ F either p ⊆ f or p ∩ f = ∅. In other words, no fragment starts or ends within such a part. Therefore, from an assignment π we can derive further assignments π' exhibiting the same total error, by simply permuting the colors within these parts, i.e. if i, j ∈ p for p ∈ $P$ and the total error of an assignment π is e₁, than π' with π' (i) = π (j), π' (j) = π (i) and π' (l) = π (l) for l ≠ i, j has total error e₂ with e₂ = e₁. We call two assignments equivalent, if one can be obtained from the other by iteratively applying this rule.

In order to enumerate equivalent solutions only once, we modify our integer linear program as follows: For k ∈ $S$ and p ∈ $P$ , we replace the binary variables ${(x_{l}^{k})}_{l \in p}$ by a single integer variable $y_{p}^{k}$ with $y_{p}^{k} : = \sum_{l \in p} x_{l}^{k}$ . Moreover, let A be the |ℱ| × | $P$ | inclusion matrix, i.e. for every f ∈ ℱ and p ∈ $P$ , the corresponding entry is given by

a_{f, p} = {\begin{matrix} 1 & if p \subseteq f \\ 0 & otherwise . \end{matrix}

We denote by $e^{k} = {(e_{(i, j)}^{k})}_{(i, j) \in ℱ}$ the vector of errors with respect to color k and by $y^{k} = {(y_{p}^{k})}_{p \in}$ the number of residues colored k. In matrix notation the constraints are then of the form

\begin{matrix} - A y^{k} + e^{k} \geq - b^{k} \\ A y^{k} + e^{k} \geq b^{k} \end{matrix}

for all k ∈ $S$ . Hence our integer linear program be-comes

\begin{array}{l} \min & \sum_{k \in S} \sum_{f \in ℱ} e_{f}^{k} \\ s . t . & - A y^{k} + e^{k} \geq - b^{k} & for all k \in S \\ A y^{k} + e^{k} \geq b^{k} & for all k \in S \\ \sum_{k \in S} y^{k} = P \\ y \geq 0, integer \end{array}

(3)

where P is the vector that contains |p| for each component p ∈ $P$ and $y = {(y^{k})}_{k \in S}$ We refer to this integer linear program as improved-ILP. We compute all solutions within a certain error bound by following basically the same approach as described above. However, the number of solutions now is just a fraction of the number of solutions of the original basic-ILP yielding a significant speed-up

Although there is commercial software for integer programming which quickly solves instances of reasonable size, there is no algorithm that is guaranteed to find an optimum solution in polynomial time, since integer programming is NP-complete in general. However, the problem of assigning exchange rates to residues in a way that is conform with the experimentally found bulk data exhibits a certain combinatorial structure. In the next section, we exploit this fact to derive an exact polynomial-time algorithm for the case of two colors and use it as a building block for approximation algorithms for more than two colors subsequently.

A Combinatorial Approach

First, let us consider the special case of two colors, i.e. K = 2 and thus $S$ = {1,2}. That is, we have constraints of the form $y_{p}^{1} + y_{p}^{2} = | p |$ for all p ∈ $P$ . This allows us to simplify the linear program considerably. We replace $y_{p}^{2} = | p | - y_{p}^{1}$ and omit the superscript of the y-variables in the following. This yields

\begin{matrix} - A y + e^{1} \geq - b^{1} & A y + e^{2} \geq F - b^{2} \\ A y + e^{1} \geq b^{1} & - A y + e^{2} \geq - F + b^{2} \end{matrix}

where F is the vector of fragment sizes. We may get rid of half of the constraints by the following observation. Let b := max {b¹, F - b²} and $\overset{ª}{b} : = \min {b^{1}, F - b^{2}}$ where the maximum is taken component-wise. Let y be an arbitrary feasible solution with minimum total error $\sum_{f \in ℱ} e_{f}^{1} + e_{f}^{2}$ . We may consider the contribution of each fragment independently for that particular y. We may rename the error variables e¹ and e² component-wise according to b and $\overset{ª}{b}$ , i.e.

\begin{matrix} e_{f} : = {\begin{array}{l} e_{f}^{1} & if b_{f} = b_{f}^{1} \\ e_{f}^{2} & otherwise \end{array} & {\bar{e}}_{f} : = {\begin{array}{l} e_{f}^{1} & if {\bar{b}}_{f} = b_{f}^{1} \\ e_{f}^{2} & otherwise \end{array} \end{matrix}

(4)

For each f ∈ ℱ with ${\overset{ª}{b}}_{f} \leq a_{f}^{T} y \leq b_{f}$ , we have $e_{f}^{1} + e_{f}^{2} = b_{f} - {\overset{ª}{b}}_{f}$ . If $a_{f}^{T} y > b_{f}$ , we get $e_{f}^{1} + e_{f}^{2} = 2 e_{f} + b_{f} - {\overset{ª}{b}}_{f}$ . Analogously, we get $e_{f}^{1} + e_{f}^{2} = 2 {\overset{ª}{e}}_{f} + b_{f} - {\overset{ª}{b}}_{f}$ if $a_{f}^{T} y < {\overset{ª}{b}}_{f}$ . Hence, it is sufficient to optimize the following linear program

\begin{array}{l} \min & \sum_{f \in ℱ} e_{f} + {\bar{e}}_{f} \\ s . t . & - A y + e \geq - b \\ A y + \bar{e} \geq \bar{b} \\ - y \geq - P \\ y, e, \bar{e} \geq 0 \end{array}

(5)

which is integral if b and $\overset{ª}{b}$ are integral since the constraint matrix is totally unimodular. The corresponding dual LP is given by

\begin{array}{l} \max & - b^{T} f^{1} - {\bar{b}}^{T} f^{2} + P^{T} f^{3} \\ s . t . & - A^{T} f^{1} + A^{T} f^{2} - f^{3} \leq 0 \\ 0 \leq f^{1, 2} \leq 1 \\ 0 \leq f^{3} \end{array}

(6)

which is equivalent to (multiplying the objective function by -1 and introducing slack variables)

\begin{array}{l} - \min & - b^{T} f^{1} - {\bar{b}}^{T} f^{2} + P^{T} f^{3} \\ s . t . & - A^{T} f^{1} + A^{T} f^{2} - f^{3} + f^{4} = 0 \\ 0 \leq f^{1, 2} \leq 1 \\ 0 \leq f^{3, 4} \end{array}

(7)

We will show next that this LP is a Minimum Cost Circulation Problem. To this end, let M be the matrix of the equality constraints, i.e.

M : = (- A^{T} A^{T} - I I) .

Note that this matrix has the column-wise consecutive-ones property. By row operations like in Gaussian elimination, we can easily transform M such that each column contains exactly one +1 and one -1, as follows. We add the dummy constraint 0 = 0 at the end and subtract from each row its predecessor. The resulting matrix, say $\overset{ª}{M}$ , can be considered as the node-arc-incidence matrix of a directed graph. Since the right hand side remains unchanged, we get a Minimum Cost Circulation problem on a graph with | $P$ | + 1 nodes and $O (| | + | ℱ |)$ arcs [12]. As a matter of fact, we have for each variable y_p two arcs corresponding to the constraint 0 ≤ y_p ≤ |p| and for each fragment (i, j) the arcs (i, j + 1) and (j + 1, i) as depicted in Figure 2.

For three or more colors the complexity is open. The totally unimodularity of the constraint matrix is destroyed, i.e. there are instances with fractional vertices, e.g. the one from Figure 2 with the appropriate right hand sides. Moreover, there is an instance which has a positive error, but the value of the LP is 0. Hence the integrality gap is infinite. If the number of colors is not fixed but part of the input, the problem is NP-complete [13].

A Simple and Efficient Heuristic for the General Case

We present an algorithm that uses our combinatorial approach for the 2-color case (K = 2) from previous section as a subroutine to provide solutions that approximate (without performance guarantee) a coloring, i.e. an assignment of colors to residues, with minimum total error for instances with arbitrary but fixed number of colors. The general idea is to reduce the problem to the 2-color case by merging all but one color, say color i, to a single color and solve the resulting problem by an algorithm for the minimum cost circulation problem, as described in the section about the Combinatorial Approach. We remove residues colored i by the obtained solution and solve the coloring problem on the remaining residues using K - 1 colors recursively.

Our approach works as follows. Consider an arbitrary color k ∈ $S$ . We compute a subset of the residues that are assigned color k such that the total error with respect to color k and the sum of all remaining colors is minimized, i.e. we solve the two color problem with requirements (= right hand sides)

(b^{k}, \sum_{k^{'} \neq k} b^{k^{'}}) .

Residues assigned color k in an optimal solution to this problem will be colored k in the final solution too, the assignment of the remaining colors $S$ \{k} to the remaining residues is computed recursively.

Note that the order in which colors are selected to be the next fixed color k in the recursive computation can be arbitrary. Nevertheless, they might lead to solutions of different total error. As we have only three different colors in our experimental data, we evaluate all six orderings and return the best solution found.

In the next section we present a Lagrangian relaxation method to compute, based on our combinatorial approach for the 2-color case, a bound on the minimum total error, which is exploited in a branch-&-bound manner to determine all optimal colorings.

A Lagrangian Relaxation Approach

In this section we propose a Lagrangian relaxation approach for the problem, which is particularly suit-able for finding all optimal solutions. It is based on the improved-ILP formulation:

\min \sum_{k \in S} \sum_{f \in ℱ} e_{f}^{k}

(8)

\begin{matrix} s . t . & - A y^{k} + e^{k} \geq - b^{k} & for all k \in S \end{matrix}

(9)

\begin{array}{l} A y^{k} + e^{k} \geq b^{k} & for all k \in S \end{array}

(10)

\begin{array}{l} \sum_{k \in S} y^{k} = P \\ y \geq 0, integer \end{array}

(11)

where P is the vector that contains the length of parts in $P$ . The problem can be considered to contain independent structures for each color k ∈ $S$ , namely the set of positive integer vectors y^ksatisfying (9) and (10) under the objective (8), that are linked by constraints (11). Therefore, dualizing the linking constraints (11), with Lagrangian multipliers λ, splits the problem into an independent problem for each color k ∈ $S$ :

\begin{array}{l} \min & \sum_{k \in S} \sum_{f \in ℱ} e_{f}^{k} + λ^{T} (\sum_{k \in S} y^{k} - P) & (I P (λ)) \\ s . t . & - A y^{k} + e^{k} \geq - b^{k} & for all k \in S \\ A y^{k} + e^{k} \geq b^{k} & for all k \in S \\ 0 \leq y^{k} \leq P & for all k \in S \\ y interger \end{array}

Neglecting the constant term -λ^TP in the objective function and replacing error variable e by e + ē we have to determine, for every color k ∈ $S$ , an optimal integral solution to the following linear program:

\begin{array}{l} \min \sum_{f \in ℱ} (e_{f}^{k} + {\overset{ª}{e}}_{f}^{k}) + λ^{T} y^{k} \\ s . t . - A y^{k} + e^{k} \geq - b^{k} \end{array}

(12)

A y^{k} + {\overset{ª}{e}}^{k} \geq b^{k}

(13)

\begin{array}{l} e^{k}, {\overset{ª}{e}}^{k} \geq 0 \\ 0 \leq y^{k} \leq P \end{array}

(14)

Note that we added constraint (14) to enforce $e_{f}^{k}$ or ${\overset{ª}{e}}_{f}^{k}$ to be zero if ${\overset{ª}{e}}_{f}^{k}$ , respectively $e_{f}^{k}$ , corresponds to the absolute value of the error, i.e. if the constraint (13), respectively the constraint (12), for fragment f is tight. Note that we have to enforce $e_{f}^{k}$ and ${\overset{ª}{e}}_{f}^{k}$ to be nonnegative. In every optimum solution either e or ē (or both) will be zero for each fragment f. Similar as for linear program (5), its dual is given by (omitting the color superscript k):

\begin{array}{l} - \min & b^{T} f^{1} - b^{T} f^{2} + P^{T} f^{3} \\ s .t . & - A^{T} f^{1} + A^{T} f^{2} - f^{3} + f^{4} = λ \\ 0 \leq f^{1, 2} \leq 1 \\ 0 \leq f^{3, 4} \end{array}

(15)

This linear program differs from LP (7) only in the right-hand sides of the equality constraints.

Conclusions

We applied our methods to process data from typical biochemical experiments. We report our results for four proteins: Calcium-binding protein (Cabin), Cytochrome P450 (CytoC), FK506 binding protein(FKBP), with two different digests (pepsin and XIII), and myoglobin. As a preprocessing step, the single fragments were analyzed with our integer linear programming based technique [14], except for FKBP V2 (MEM) which was analyzed with the MEM-method [15] and is based on the same data as FKBP V1 (ILP). We analyzed FKBP with only the xiii digestion (V3) and combined the datasets from the two digestions (V1, V2 and V4). The number discretized exchange rates per fragment obtained in this preprocessing step serves as input to the algorithm.

The instances have between 74 and 152 residues and between 18 and 49 fragments. The solutions with a minimal number of errors could be computed in less than 0.11 seconds for all instances. Computing all (non-equivalent) solutions with a minimal number of errors, from 96 up to almost 20 million in number, took less than 7 minutes, where the running time greatly depends on the number of solutions (see Table 2). Computing all solutions using the basic-ILP takes much longer as with the improved-ILP. For all instances, the heuristic computes a solution with the minimal error.

Table 2 Results & Runtime

Full size table

Where available, we compared our assignments of exchange rates to the results obtained by NMR-analysis (FKBP and CytoC [16]). The error measure is based on a comparison per part. Within each part, the rates assigned by the algorithm are compared to the ones from NMR. Table 3 summarizes the results. The table also shows the importance of taking all solutions into account, as averaging typically yields better results than a single solution. The assignments coincide to 60 - 75% to the ones obtained by NMR, when choosing the optimal ordering with in the parts of equivalent residues. Figure 3 provides the results for FK506 and Cytochrome P450 at single residue resolution for manual inspection.

Table 3 Comparison with NMR

Full size table

A structural view on the results for FK506 and myoglobin is given in Figure 4. For myoglobin we do not have NMR data at hand. Nevertheless does the figure nicely agree with the expected out come, as buried parts of the protein show on the average lower exchange rates than exposed parts. The two figures have been produced by use of PyMOL [17].

In our solutions, the resolution is significantly increased compared to the input data, i.e., the length of fragments obtained from the sHDX experiments. The parts are typically small (see Table 2), between2 and 4 residues. 75% of the parts are smaller than 8 residues. For 46% of the amino-acids, we get single-residue resolution on the data.

The results for the real instances are very promising, as the small number of easily interpretable classes of equivalent solutions can be used in protein structure prediction tools and for manual inspection.

Methods

In this Section, we describe the computational methods, which we use to solve the different formulation, as well as the biochemical methods to obtain the experimental data.

Solution of the integer linear program

We implemented our approach using the C++-Library SCIL [18] to solve integer linear programs. SCIL uses the libraries LEDA [19] and SCIP [20].SCIP uses CPLEX [21] or SoPlex [22] as solver for linear programs. The underlying solution method is branch-&-bound, that is described in detail in [23].

In order to find all solutions within a given error bound e, the constraint

\sum_{k \in S} \sum_{(i, j) \in ℱ} e_{(i, j)}^{k} \leq e

is added to the integer linear program and hence we are faced with the problem of computing all feasible solutions of an integer linear program. We do this with a branching-approach similar to the classical branch-&-bound method for finding an optimal solution: First, the linear relaxation is solved. If the linear relaxation is infeasible, the search on this branch terminates. If the solution is integral, it is stored (provided the solution was not found yet). If there is a binary variable which was not fixed so far (i.e. not set to 0 or 1), one such variable $x_{l}^{k}$ is picked and the two subproblems, in which the variable is fixed o 0 or 1 recursively, are solved. Notice that it is possible that we branch on a variable which already has an integral value. In this case, the solution of the linear relaxation of the subproblem will be the same as in problem itself. Nevertheless, we will terminate, as there are only a finite number of variables to branch on.

Solving the Combinatorial Problem

We may use any algorithm that solves the Minimum Cost Circulation problem, e.g. Cycle Canceling or Successive Shortest Path (see [12] for further reference). Both approaches have their advantages. The former always maintains a feasible circulation, i.e. we start with the zero flow and augment flow along negative cycles in the residual network until no negative cycle remains. Since the residual network with respect to an optimal circulation does not contain a directed negative circuit, we can find node potentials, i.e. a corresponding dual solution, using the Bellman-Ford algorithm in $O (| P | \cdot | ℱ |)$ time. The difference between the potential of two neighboring nodes then yields the value of the corresponding y-variable. The errors are determined straight forward. If there is a solution without error this approach yields a solution within the running time of Bellman-Ford. On the other hand, the Successive Shortest Path algorithm maintains similar node potentials such that the arc-weights remain non-negative. Since the total excess is bounded by | $P$ | in our case, the running time of that algorithm is $O (| P | \cdot | ℱ | + | P |^{2} \log | P |)$ .

Solving the Lagrangian Dual

Instead of a minimum cost circulation problem(right-hand side is 0), we have to solve the more general minimum cost flow problem [12] where the supplies and demands $\overset{ª}{λ}$ of the nodes are determined by the difference of Lagrangian multipliers, i.e. $\overset{ª}{λ}$ is of dimension | $P$ | + 1 and ${\overset{ª}{λ}}_{i} = λ_{i} - λ_{i - 1}$ for $2 \leq i \leq | P |, {\overset{ª}{λ}}_{1} = λ_{1}$ and ${\overset{ª}{λ}}_{| P | + 1} = - λ_{| P |}$ . A feasible flow of minimum cost can be computed efficiently by, e.g., the cycle-canceling algorithm and the successive shortest path algorithm, as well as variants of them, like the capacity scaling algorithm[12]. In our implementation (C++) we used the LEDA library [19] to solve the Lagrangian subproblem by analgorithm based on capacity scaling and successive shortest path computation [12].

We improve the resulting bounds by the subgradient optimization method described in the following and incorporate the overall approach into a branch-&-bound algorithm as the lower bounding scheme.

Let v(IP (λ)) denote the optimal value of IP (λ). Then for any vector λ of Lagrangian multipliers, the (non-differentiable) Lagrangian function

z (λ) = v (I P (λ))

provides a lower bound on the minimum total error. To benefit from the sharpest possible bound in the branch-&-bound framework we are interested in solving the Lagrangian dual problem

z^{*} = \max_{λ} z (λ) .

We apply the subgradient method to obtain near-optimal Lagrangian multipliers. Following the approach by Held and Karp [24] we iteratively determine values λ^ℓ+1for ℓ = 0,1, ..., of the Lagrangian multipliers by moving in the direction of a subgradient with "step length" μ_ℓ :

λ^{ℓ + 1} = λ^{ℓ} + μ_{ℓ} (\sum_{k \in S} y^{k} (ℓ) - P),

where ${(y^{k} (ℓ))}_{k \in S}$ is any optimal solution to IP(λ^ℓ ). The step length is computed according to formula

μ_{ℓ} = \frac{θ_{ℓ} (U B - z (λ^{k}))}{{‖ Σ_{k \in S} y^{k} (ℓ) - P ‖}^{2}},

where UB is a previously computed upper bound on z* and θ is a step size parameter assuming values in {x ∈ ℝ | 0 < x ≤ 2}. In the experiments it turns out, that initializing the vector of Lagrangian multipliers λ⁰ to the length P of the corresponding intervals in $P$ increases the convergence rate dramatically. We also experienced a fast convergence to near-optimal Lagrangian multipliers when following the classical Held-Karp method to choose the step size scalar θ: We start with θ₀ = 2 and half θ_ℓ whenever the best Lagrangian bound v(IP(λ)) found so far has not increased in a certain number of iterations. As soon as the step size scalar falls below a specified threshold or the number of iterations exceeds a certain limit (which is adaptive with respect to the depth of the branch-&-bound node), we branch on a variable $y_{p}^{k}, k \in S, p \in P$ , such that ${\overset{ª}{y}}_{p}^{k} - ⌊ {\overset{ª}{y}}_{p}^{k} ⌋$ is close to 0.5, where ${\overset{ª}{y}}_{p}^{k}$ is the average value of variable $y_{p}^{k}$ in the last h = 10 Lagrangian solutions. Since we aim to find all optimal colorings, we also branch on variables that are integral. Incorporating the Lagrangian approach as a lower bounding scheme into a branch-&-bound frame work gives an alternative algorithm that does not depend on commercial software packages.

Experimental Setting

The entire sHDX experiment was automated with a LEAP robot (HTS PAL, Leap Technologies, Carrboro, NC).

Automation of the experiment reduces human error and reduces deuterium for hydrogen back-exchange. All time points where interlaced and performed in triplicate to ensure experimental reproducibility. After digestion, the protein digest was injected from a 10 μ L loop to either a 1 mm × 50 mm C5 column (Phenomenex) or a Pro-Zap Pro-sphere HP C18 HR 1.5u 10 mm × 2.1 mm (All-tech). A rapid gradient 2% B to 95% B in 1.5 min(A: acetonitrile/H₂O/formic acid 5/94.5/0.5, B: ace-tonitrile/H₂O/formic acid 95/4.5/0.5) was used to elute peptides. The eluent was post-column split and infused by microelectrospray ionization into a custom built 14.5 T LTQ FT-ICR mass spectrometer. The extraction of the peptic fragments and their deuterium uptakes from these data was done by an in-house analysis package [25]. Then we compute the cumulative exchange rates from the deuterium uptakes with either the MEM-method [15] or a new approach based on integer linear programming [14].

A current limitation for implementation of this software is back exchange of deuterium-to-hydrogen during the separation of the samples. It has been reported that different peptides have a different percentage of back exchange due to the sequence of amino acids [26, 27]. Furthermore, the peptide sequence overlap will limit the ability to map single amino acid rate kinetics. Thus, reduction of backexchange has been investigated [28, 29], along with multiple acid proteases to increase sequence coverage [30]. The sHDX experiment is continually being improved, but in its current state the sHDX experiment does not take away from the integrity of the algorithm to discern single amino acid exchange kinetics.

References

Sheu SY, Schlag EW, Selzle HL, Yang DY: Molecular Dynamics of Hydrogen Bonds in Protein-D2O: The Solvent Isotope Effect. J Phys Chem A 2008, 112: 797–802. 10.1021/jp0771668
Article CAS PubMed Google Scholar
Kalkhof S, Ihling C, Mechtler K, Sinz A: Chemical Cross-Linking and High-Performance Fourier Transform Ion Cyclotron Resonance Mass Spectrometry for Protein Interaction Analysis: Application to a Calmodulin/Target Peptide Complex. Analytical Chemistry 2004, 77(2):495–503. 10.1021/ac0487294
Article Google Scholar
Fancy DA, Kodadek T: Chemistry for the analysis of proteinprotein interactions: Rapid and efficient cross-linking triggered by long wavelength light. Proc Natl Acad Sci USA 1999, 96: 6020–6024. 10.1073/pnas.96.11.6020
Article CAS PubMed PubMed Central Google Scholar
Goshe M, Chen Y, Anderson V: Identification of the Sites of Hydroxyl Radical Reaction with Peptides by Hydrogen/Deuterium Exchange: Pr evalence of Reactions with the Side Chains. Biochemistry 2000, 39(7):1761–1770. 10.1021/bi991569j
Article CAS PubMed Google Scholar
Holmberg L, Noller HF: Mapping the ribosomal RNA neighborhood of protein L11 by directed hydroxyl radical probing. J Mol Biol 1999, 289(2):223–233. 10.1006/jmbi.1999.2706
Article CAS PubMed Google Scholar
Sharp J, Becker J, Hettich R: Analysis of Protein Solvent Accessible Surfaces by Photochemical Oxidation and Mass Spectrometry. Analytical Chemistry 2004, 76(3):672–683. 10.1021/ac0302004
Article CAS PubMed Google Scholar
Zuiderweg E: Mapping Protein-Protein Interactions in Solution by NMR Spectroscopy. Biochemistry 2002, 41: 1–7. 10.1021/bi011870b
Article CAS PubMed Google Scholar
Leite J, Cascio M: Probing the Topology of the Glycine Receptor by Chemical Modification Coupled to Mass Spectrometry. Biochemistry 2002, 41(19):6140–6148. 10.1021/bi015895m
Article CAS PubMed Google Scholar
Kang S, Hawkridge A, Johnson K, Muddiman D, Prevelige P: Identification of Subunit-Subunit Interactions in Bacteriophage P22 Procapsids by Chemical Cross-linking and Mass Spectrometry. Journal of Proteome Research 2006, 5(2):370–377. 10.1021/pr050356f
Article CAS PubMed Google Scholar
Lam T, Lanman J, Emmett M, Hendrickson C, Prevelige P: Mapping of protein:protein contact surfaces by hydrogen/deuterium exchange, followed by on-line high-performance liquid chromatography-electrospray ionization fourier-transform ion-cyclotron-resonance mass analysis. Journal of Chromatography A 2002, 982: 85–95. 10.1016/S0021-9673(02)01357-2
Article CAS PubMed Google Scholar
Zhang Z, Smith DL: Determination of amide hydrogen exchange by mass spectrometry: a new tool for protein structure elucidation. Protein Sci 1993, 2(4):522–531. 10.1002/pro.5560020404
Article CAS PubMed PubMed Central Google Scholar
Ahuja RK, Magnanti TL, Orlin JB: Network flows: theory, algorithms, and applications. Englewood Cliffs, NJ: Prentice Hall Inc.; 1993.
Google Scholar
Althaus E, Canzar S, Elbassioni K, Karrenbauer A, Mestre J: Approximating the Interval Constrained Coloring Problem. In Proceedings of the 11th Scandinavian Workshop on Algorithm Theory. Gothenburg, Sweden; 2008.
Google Scholar
Althaus E, Canzar S, Ehrler C, Emmett MR, Karrenbauer A, Marshall AG, Meyer-Bäse A, Tipton J, Zhang H: Discrete Fitting of Hydrogen-Deuterium-exchange-data of overlapping fragments. The 2009 International Conference on Bioinformatics & Computational Biology 2009, in press.
Google Scholar
Zhang Z, Guan S, Marshall AG: Enhancement of the Effective Resolution of Mass Spectra of High Mass Biomolecules by Maximum Entropy-Based Deconvolution to Eliminate the Isotopic Natural Abundance Distribution. Journal of American Society of Mass Spectrometry 1997, 8: 659–670. 10.1016/S1044-0305(97)82982-0
Article CAS Google Scholar
Milne JS, Mayne L, Roder H, Wand AJ, Englander SW: Determinants of protein hydrogen exchange studied in equine cytochrome c. Protein Sci 1998, 7(3):739–745. 10.1002/pro.5560070323
Article CAS PubMed PubMed Central Google Scholar
The PyMOL Molecular Graphics System, Version 1.2r3pre, Schrödinger, LLC[http://www.pymol.org]
SCIL (Symbolic Constraints for Integer Linear Programming)2004. [http://www.mpi-sb.mpg.de/SCIL]
Mehlhorn K, Näher S: The LEDA Platform of Combinatorial and Geometric Computing. Cambridge: Cambridge University Press; 1999.
Google Scholar
für Informationstechnik Berlin KZZ:Solving Constraint Integer Programs. 2009. [http://scip.zib.de/]
Google Scholar
ILOG-CPLEX 10.02007. [http://www.ilog.com]
SoPlex (Sequential object-oriented simplex class library)2004. [http://www.zib.de/Optimization/Software/Soplex]
Wolsey LA: Integer programming. Wiley-interscience series in discrete mathematics and optimization, New York: Wiley & Sons; 1998.
Google Scholar
Held M, Karp R: The traveling salesman problem and minimum spanning trees: Part II. Mathematical Programming 1971, 1: 6–25. 10.1007/BF01584070
Article Google Scholar
Kazazic S, Zhang HM, Schaub TM, Emmett MR, Hendrickson CL, Blakney GT, Marshall AG: Automated Data Reduction for Hydrogen/Deuterium Exchange Experiments, Enabled by High-Resolution Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Journal of American Society of Mass Spectrometry 2010, (21):550–558. 10.1016/j.jasms.2009.12.016
Molday RS, Englander SW, Kallen RG: Primary structure effects on peptide group hydrogen exchange. Biochemistry 1972, 11(2):150–158. 10.1021/bi00752a003
Article CAS PubMed Google Scholar
Bai Y, Milne JS, Mayne L, Englander SW: Primary structure effects on peptide group hydrogen exchange. Proteins 1993, 17: 75–86. 10.1002/prot.340170110
Article CAS PubMed PubMed Central Google Scholar
Emmett MR, Kazazic S, Marshall AG, Chen W, Shi SDH, Bolanos B, Greig MJ: Supercritical Fluid Chromatography Reduction of Hydrogen/Deuterium Back Exchange in Solution-Phase Hydrogen/Deuterium Exchange with Mass Spectrometric Analysis. Analytical Chemistry 2006, 78(19):7058–7060. 10.1021/ac060693n
Article CAS PubMed Google Scholar
Zhang HM, Bou-Assaf GM, Emmett MR, Marshall AG: Fast Reversed-Phase Liquid Chromatography to Reduce Back Exchange and Increase Throughput in H/D Exchange Monitored by FT-ICR Mass Spectrometry. Journal of the American Society for Mass Spectrometry 2009, 20(3):520–524. 10.1016/j.jasms.2008.11.010
Article CAS PubMed PubMed Central Google Scholar
Zhang HM, Kazazic S, Schaub TM, Tipton JD, Emmett MR, Marshall AG: Enhanced Digestion Efficiency, Peptide Ionization Efficiency, and Sequence Resolution for Protein Hydrogen/Deuterium Exchange Monitored by Fourier Transform Ion Cyclotron Resonance Mass Spectrometry. Anal Chem 2008, 80(23):9034–41. 10.1021/ac801417d
Article CAS PubMed PubMed Central Google Scholar

Download references

Acknowledgements

Work supported by NSF (DMR-00-84173), NIH (GM78359), Florida State University, the National High Magnetic Field Laboratory in Tallahassee, FL, and by Deutsche Forschungsgemeinschaft (DFG) within Priority Programme 1307 "Algorithm Engineering".

Author information

Authors and Affiliations

Institut für Informatik, Fachbereich 08, Staudingerweg 9, 55099, Mainz, Germany
Ernst Althaus
Algorithmic Computational Biology, Centrum Wiskunde & Informatica, Amsterdam, The Netherlands
Stefan Canzar
Center for Bioinformatics, Computer Science Department, Saarland University, 66041, Saarbrücken, Germany
Carsten Ehrler
Ion Cyclotron Resonance Program, National High Magnetic Field Laboratory, Florida State University, 1800 East Paul Dirac Drive, Tallahassee, FL, 32310-4005, USA
Mark R Emmett, Alan G Marshall, Jeremiah D Tipton & Hui-Min Zhang
Department of Chemistry & Biochemistry, Florida State University, 95 Chieftain Way, Tallahassee, FL, 32306, USA
Mark R Emmett & Alan G Marshall
Institute of Mathematics, EPFL, Station 8, 1015, Lausanne, Switzerland
Andreas Karrenbauer
Department of Scientific Computing, Florida State University, Tallahassee, FL, 32306-4120, USA
Anke Meyer-Bäse

Authors

Ernst Althaus
View author publications
You can also search for this author in PubMed Google Scholar
Stefan Canzar
View author publications
You can also search for this author in PubMed Google Scholar
Carsten Ehrler
View author publications
You can also search for this author in PubMed Google Scholar
Mark R Emmett
View author publications
You can also search for this author in PubMed Google Scholar
Andreas Karrenbauer
View author publications
You can also search for this author in PubMed Google Scholar
Alan G Marshall
View author publications
You can also search for this author in PubMed Google Scholar
Anke Meyer-Bäse
View author publications
You can also search for this author in PubMed Google Scholar
Jeremiah D Tipton
View author publications
You can also search for this author in PubMed Google Scholar
Hui-Min Zhang
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ernst Althaus.

Additional information

Authors' contributions

HMZ, JT, and MRE performed the H/D exchange experiments and analyzed the data to yield the rate constant distributions from which the subsequent residue assignments were made. EA, SC, CE, and AK developed the mathematical model, performed the computational experiments and drafted the manuscript. AMB initiated and identified the mathematical approach. AGM participated in the design and coordination of the study and in preparation of the manuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Authors’ original file for figure 3

Authors’ original file for figure 4

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Althaus, E., Canzar, S., Ehrler, C. et al. Computing H/D-Exchange rates of single residues from data of proteolytic fragments. BMC Bioinformatics 11, 424 (2010). https://doi.org/10.1186/1471-2105-11-424

Download citation

Received: 16 January 2009
Accepted: 11 August 2010
Published: 11 August 2010
DOI: https://doi.org/10.1186/1471-2105-11-424

Computing H/D-Exchange rates of single residues from data of proteolytic fragments