The CSM problem is a many-to-many generalization of the classical min-cost bipartite matching problem [12]. We describe the problem in an abstract setting, and cast it to a read alignment problem in the next section.
Consider arbitrary sets X and Y. A many-to-many matching (henceforth a matching) between X and Y is a set M of pairs {(x, y) ∈ X × Y} (see Figure 2, (a), (b), (c). The coverage of an element x ∈ X with respect to a matching M is c
M
(x) = |{y : (x, y) ∈ M}|. Symmetrically, c
M
(y) = |{x : (x, y) ∈ M}| for y ∈ Y .
A coverage sensitive matching cost function (henceforth a cost function) w for X and Y assigns matching costs w
m
(x, y) for every pair (x, y) ∈ X × Y , and coverage costs w
c
(z, i) for every z ∈ X ∪ Y and every integer i ≥ 0. The cost of a matching M between X and Y with respect to w is given by
(1)
The CSM problem
Input: A Matching Instance (X, Y, w) consisting of sets X, Y, and cost function w.
Output: Compute .
Note that CSM is a generalization of classical problems in combinatorics. For example, consider the problem of finding a maximum (partial one-to-one) matching on a bipartite graph G with vertex shores X, Y, and an edge set E. This problem can be solved by solving CSM on the input X, Y using the following costs: set w
c
(z, 0) = w
c
(z, 1) = 0, and w
c
(z, i) = ∞ for all z ∈ X ∪ Y, i > 1; set w
m
(x, y) = -1 for (x, y) ∈ E and otherwise set w
m
(x, y) = ∞. Similarly, CSM can also be used for solving the minimum/maximum weight variants of the bipartite matching problem. However, CSM is NP-hard in general (see Additional File 1), and therefore we do not expect to solve the general instance efficiently.
CSM with convex coverage costs
Let (X, Y, w) be a matching instance. We say that w has convex coverage costs if for every element z ∈ X ∪ Y and every integer i > 0, . We show here that CSM with convex coverage costs can be reduced to the poly-time solvable min-cost integer flow problem [11].
For x ∈ X, denote d
x
= |{y : w
m
(x, y) <∞}|, and similarly d
y
= |{x : w
m
(x, y) <∞}| for y ∈ Y . Denote and . The reduction builds the flow network N = (G, s, t, c, w'), where G is the network graph, s and t are the source and sink nodes respectively, and c and w' are the edge capacity and cost functions respectively. The graph G = (V, E) is defined as follows (Figure 2d).
-
♦ V = X ∪ Y ∪ CX∪ CY∪ {s, t}, where the sets , , and {s, t} contain unique nodes different from all nodes in X and Y . Note that we use the same notations for elements in X and Y and their corresponding nodes in V, where ambiguity can be resolved by the context.
-
♦ E = E1 ∪ E2 ∪ E3 ∪ E4 ∪ E5, where
-
,
-
,
-
,
-
,
and
The capacity function c assigns infinity capacities to all edges in E1 and E5 and unit capacities to all edges in E2, E3 and E4. The cost function w' assigns zero costs to edges in E1 and E5, costs w
c
(x, i) - w
c
(x, i - 1) to edges , costs w
c
(y, i) - w
c
(y, i - 1) to edges , and costs w
m
(x, y) to edges (x, y) ∈ E3. For E' ⊆ E, denote . An integer flow in N is a function f : E → {0, 1, 2, . . .}, satisfying that f(e) ≤ c(e) for every e ∈ E (capacity constraints), and for every v ∈ V \ {s, t} (flow conservation constraints). The cost of a flow f in N is defined by .
In what follows, let (X, Y, w) be a matching instance where w has convex coverage costs, and let N be its corresponding network. Due to the convexity requirement, for every x ∈ X and every integer i > 0, Similarly, for every y ∈ Y and every integer i > 0, , and we get the following observation:
Observation 1. Series of the form and are non-decreasing. Consequentially, for every and , and similarly for and .
Given a flow f in N, define the matching M
f
= {(x, y) : (x, y) ∈ E3, f(x, y) = 1}. Denote and . Since for edges e ∈ E1 ∪ E5 we have that w'(e) = 0, and since for edges e ∈ E2 ∪ E3 ∪ E4 we have that f(e) ∈ {0, 1} (due to capacity constraints), we can write
(2)
Given a non-infinity cost matching M between X and Y, define the flow f
M
in N as follows:
-
♦ For every (x, y) ∈ E3, f (x, y) = 1 if (x, y) ∈ M, and otherwise f(x, y) = 0;
-
♦ For every if c
M
(x) ≤ i, and otherwise ;
-
♦ For every , if c
M
(y) ≤ i, and otherwise ;
-
♦ For every ;
-
♦ For every .
It is simple to assert that f
M
is a valid flow in N (satisfying all capacity and flow conservation constraints), and that .
Claim 1. For every flow f in N,
Proof. From flow conservation constraints for every x ∈ X, where in particular by definition we have that Therefore, it follows from Observation 1 that for every x ∈ X, and similarly it may be shown that for every y ∈ Y. Hence,
□
Denote , and note that Δ depends only on the instance (X, Y, w) and not on any specific matching.
Claim 2. For every matching M between × and Y, w'(f
M
) = w(M) - Δ.
Proof. For x ∈ X, we have that and similarly for y ∈ Y. Therefore,
□
Claim 3. Let f* be a minimum cost flow in N. Then, M
f*
is a minimum cost matching between X and Y, and CSM(X, Y, w) = w'(f*) + Δ.
Proof. Since f* is a minimum cost flow in N, , thus . Let M be a matching between X and Y. Again, from the optimality of f*, w'(f*) ≤ w'(f
M
) and so , and in particular . Thus, is a minimum cost matching for (X, Y, w), and so .
□