In the following, we present different approaches to tackle the assingment problem that we have derrived from the mathematical abstraction mentioned before.
Integer linear programming formulation
First, we formulate the idealized version of the problem assuming error-free experimental data as an integer linear program (ILP). That is, we give an ILP whose feasible solutions correspond one-to-one to the feasible assignments of colors to residues.
Let π : {1, ..., n} ↦ be an assignment of colors to residues. A binary variable for every color k ∈ and every residue i ∈ {1, ..., n} indicates whether residue i is assigned color k or not, i.e.
We denote by the vector of binary variables modeling the assignment of color k and let .
Since every residue is assigned exactly one color, it must hold for all i ∈ {1, ..., n}. Conversely, every 0-1 assignment to variables x satisfying for all i ∈ {1, ..., n} corresponds to an assignment of colors to residues. A 0-1 assignment to x corresponds to a feasible color assignment π, if and only if furthermore holds for all (i, j) ∈ ℱ and k ∈ .
Now consider the problem of computing an assignment with minimum total error. Translating the definition of the error that we make when assigning color k (or not) to residues in fragment (i, j) (see equation (1)) to the context of 0-1 assignments to variables xk, the problem of minimizing (2) becomes
Concerning the formulation of a minimum sum of absolute values in terms of a linear objective function and linear constraints, observe that is the smallest number that satisfies
Hence, after introducing a variable for every color k ∈ and every fragment (i, j) ∈ ℱ, the integer linear program we are looking at is
We refer to this integer linear program as basic-ILP.
In our experiments, it turns out that finding a single solution is very fast, whereas enumerating all solutions takes quite some time due to their large number. This large number can be explained as follows: Recall that is the partition of {1, ..., n} into a minimal number of parts, such that for each element p ∈ and each fragment f ∈ F either p ⊆ f or p ∩ f = ∅. In other words, no fragment starts or ends within such a part. Therefore, from an assignment π we can derive further assignments π' exhibiting the same total error, by simply permuting the colors within these parts, i.e. if i, j ∈ p for p ∈ and the total error of an assignment π is e1, than π' with π' (i) = π (j), π' (j) = π (i) and π' (l) = π (l) for l ≠ i, j has total error e2 with e2 = e1. We call two assignments equivalent, if one can be obtained from the other by iteratively applying this rule.
In order to enumerate equivalent solutions only once, we modify our integer linear program as follows: For k ∈ and p ∈ , we replace the binary variables by a single integer variable with . Moreover, let A be the |ℱ| × || inclusion matrix, i.e. for every f ∈ ℱ and p ∈ , the corresponding entry is given by
We denote by the vector of errors with respect to color k and by the number of residues colored k. In matrix notation the constraints are then of the form
for all k ∈ . Hence our integer linear program be-comes
(3)
where P is the vector that contains |p| for each component p ∈ and We refer to this integer linear program as improved-ILP. We compute all solutions within a certain error bound by following basically the same approach as described above. However, the number of solutions now is just a fraction of the number of solutions of the original basic-ILP yielding a significant speed-up
Although there is commercial software for integer programming which quickly solves instances of reasonable size, there is no algorithm that is guaranteed to find an optimum solution in polynomial time, since integer programming is NP-complete in general. However, the problem of assigning exchange rates to residues in a way that is conform with the experimentally found bulk data exhibits a certain combinatorial structure. In the next section, we exploit this fact to derive an exact polynomial-time algorithm for the case of two colors and use it as a building block for approximation algorithms for more than two colors subsequently.
A Combinatorial Approach
First, let us consider the special case of two colors, i.e. K = 2 and thus = {1,2}. That is, we have constraints of the form for all p ∈ . This allows us to simplify the linear program considerably. We replace and omit the superscript of the y-variables in the following. This yields
where F is the vector of fragment sizes. We may get rid of half of the constraints by the following observation. Let b := max {b1, F - b2} and where the maximum is taken component-wise. Let y be an arbitrary feasible solution with minimum total error . We may consider the contribution of each fragment independently for that particular y. We may rename the error variables e1 and e2 component-wise according to b and , i.e.
(4)
For each f ∈ ℱ with , we have . If , we get . Analogously, we get if . Hence, it is sufficient to optimize the following linear program
(5)
which is integral if b and are integral since the constraint matrix is totally unimodular. The corresponding dual LP is given by
(6)
which is equivalent to (multiplying the objective function by -1 and introducing slack variables)
(7)
We will show next that this LP is a Minimum Cost Circulation Problem. To this end, let M be the matrix of the equality constraints, i.e.
Note that this matrix has the column-wise consecutive-ones property. By row operations like in Gaussian elimination, we can easily transform M such that each column contains exactly one +1 and one -1, as follows. We add the dummy constraint 0 = 0 at the end and subtract from each row its predecessor. The resulting matrix, say , can be considered as the node-arc-incidence matrix of a directed graph. Since the right hand side remains unchanged, we get a Minimum Cost Circulation problem on a graph with || + 1 nodes and arcs [12]. As a matter of fact, we have for each variable y
p
two arcs corresponding to the constraint 0 ≤ y
p
≤ |p| and for each fragment (i, j) the arcs (i, j + 1) and (j + 1, i) as depicted in Figure 2.
For three or more colors the complexity is open. The totally unimodularity of the constraint matrix is destroyed, i.e. there are instances with fractional vertices, e.g. the one from Figure 2 with the appropriate right hand sides. Moreover, there is an instance which has a positive error, but the value of the LP is 0. Hence the integrality gap is infinite. If the number of colors is not fixed but part of the input, the problem is NP-complete [13].
A Simple and Efficient Heuristic for the General Case
We present an algorithm that uses our combinatorial approach for the 2-color case (K = 2) from previous section as a subroutine to provide solutions that approximate (without performance guarantee) a coloring, i.e. an assignment of colors to residues, with minimum total error for instances with arbitrary but fixed number of colors. The general idea is to reduce the problem to the 2-color case by merging all but one color, say color i, to a single color and solve the resulting problem by an algorithm for the minimum cost circulation problem, as described in the section about the Combinatorial Approach. We remove residues colored i by the obtained solution and solve the coloring problem on the remaining residues using K - 1 colors recursively.
Our approach works as follows. Consider an arbitrary color k ∈ . We compute a subset of the residues that are assigned color k such that the total error with respect to color k and the sum of all remaining colors is minimized, i.e. we solve the two color problem with requirements (= right hand sides)
Residues assigned color k in an optimal solution to this problem will be colored k in the final solution too, the assignment of the remaining colors \{k} to the remaining residues is computed recursively.
Note that the order in which colors are selected to be the next fixed color k in the recursive computation can be arbitrary. Nevertheless, they might lead to solutions of different total error. As we have only three different colors in our experimental data, we evaluate all six orderings and return the best solution found.
In the next section we present a Lagrangian relaxation method to compute, based on our combinatorial approach for the 2-color case, a bound on the minimum total error, which is exploited in a branch-&-bound manner to determine all optimal colorings.
A Lagrangian Relaxation Approach
In this section we propose a Lagrangian relaxation approach for the problem, which is particularly suit-able for finding all optimal solutions. It is based on the improved-ILP formulation:
(8)
(9)
(10)
(11)
where P is the vector that contains the length of parts in . The problem can be considered to contain independent structures for each color k ∈ , namely the set of positive integer vectors yksatisfying (9) and (10) under the objective (8), that are linked by constraints (11). Therefore, dualizing the linking constraints (11), with Lagrangian multipliers λ, splits the problem into an independent problem for each color k ∈ :
Neglecting the constant term -λTP in the objective function and replacing error variable e by e + ē we have to determine, for every color k ∈ , an optimal integral solution to the following linear program:
(12)
(14)
Note that we added constraint (14) to enforce or to be zero if , respectively , corresponds to the absolute value of the error, i.e. if the constraint (13), respectively the constraint (12), for fragment f is tight. Note that we have to enforce and to be nonnegative. In every optimum solution either e or ē (or both) will be zero for each fragment f. Similar as for linear program (5), its dual is given by (omitting the color superscript k):
(15)
This linear program differs from LP (7) only in the right-hand sides of the equality constraints.