MCF is a classical optimization problem that searches for k flows for k source-sink pairs (s
i
, t
i
) in a network N in order to either minimize the total cost of flows or maximize the total flow subject to capacity and demand constraints.
Quasispecies reconstruction can be formulated as an optimization problem in two ways: (1) identification of the most probable set of quasispecies formed by the largest subset of reads from the data, referred to as packing formulation; and (2) identification of a minimal set of quasispecies explaining all observed reads, referred to as covering formulation. These two formulations, when applied to MCFs, were developed into path packing and path covering algorithms of ShotMCF and AmpMCF, respectively.
AmpMCF algorithm
We consider an amplicon A as a multiset of reads such that each read r∈A has the same predefined starting and ending position in the genome start(A) and end(A), respectively. Two amplicons A
1
, A
2
are considered overlapped if and only if start(A1) ≤ start(A
2
) < end(A
1
) ≤ end(A2). A set of amplicons A = {A1, ..., A
m
} is said to be overlapping if and only if A
i
and Ai+ 1overlap for i = 1...m-1. Given an overlapping set A, we define a partial order < on the set of reads R = A1⋃...⋃A
m
as follows: r <r' if and only if r∈A
i
, r'∈Ai+1and r and r' are consistent over their overlap of length li,i+1 = end(Ai)- start(Ai+1)+1, i.e., the suffix of length li,i+1 of r coincides with the prefix of length li,i+1 of r'.
Given an overlapping set A = {A1, ..., A
m
}, we construct an (m+2)-staged directed vertex-weighted read-graph as follows: G = (V(G) = V1 ⋃ ... ⋃ V
m
⋃ {s, t}, E(G), c), where each v∈ V
i
, 1 ≤ i ≤ m corresponds to a distinct read r
v
∈A
i
. An edge (u, v) ∈ E(G) if and only if either r
u
<r
v
or u = s, v∈ A1 or u∈A
m
, v = t. We also define the function c: V1 ⋃ ... ⋃ V
m
→ [0,1], where c(v) denotes the frequency of the read represented by v ∈ V
i
in amplicon A
i
. It is evident that every full-size quasispecies that has a sequenced read from each amplicon A
i
corresponds to an (s, t)-path in the graph G.
A bipartite clique of G is defined as a set of vertices C⊆V(G) such that C⊆V
i
⋃Vi+ 1for some i and every vertex from the set C⋂V
i
is adjacent to every vertex from the set C⋂Vi+1.
Lemma 1. Consistent overlaps in amplicons A
i
, A
i+1
correspond to disjoint bipartite cliques in G.
Proof. Suppose the contrary; then there exist vertices v, v' ∈ A
i
and u, u' ∈Ai+1, such that r
v
<r
u
, r
v
<ru', r
v'
<r
u
, but it is not true that r
v'
<ru'. Since rv'and r
u
are comparable but rv'and ru'are not, the prefixes of length li,i+ 1of r
u
and ru'must not be consistent. This implies a contradiction with r
v
<r
u
and r
v
<ru'.
Using this simple finding, we transform the read graph G into a new "forked" edge-weighted directed read-graph H = (V(H), E(H), d) as follows. Consider each p × q-bipartite clique C = K
p,q
of G not containing vertices s,t. C ⊆ A
i
⋃Ai+1for some i∈{1, ..., m-1}. Add a new "fork" vertex v
fork
, delete all edges of the bipartite clique C and add edges from the sets {(u,v
fork
): u∈C⋂A
i
} and {(v
fork
, v): v∈C⋂Ai+1}. Define a new edge weight function d : E(H) → N as follows: d(uv
fork
) = c(u), d(v
fork
v) = c(v), d(su) = d(vt) = 0. Figure 1 illustrates this transformation.
As for G, every full-size quasispecies corresponds to (s,t)-path in the forked read graph H. However, H is (2m+1)-staged directed graph with much fewer edges than G: for every bipartite clique K
p,q
pq edges in G are replaced by only p+q edges in H. Since in network flow problems variables usually are associated with edges, this reduction is highly useful for the construction of the fast network flows-based algorithm for the quasispecies spectrum reconstruction problem.
The quasispecies reconstruction problem may be restated as the following covering problem:
Problem 1. Given a forked read graph H, cover H with a set of unique (s,t)-paths P
i
with frequencies g
i
such that the total frequency of paths is minimal and for every directed edge (u,v)∈ E(H) the sum of frequencies of paths containing (u, v) is at least d(uv).
We next reformulate Problem 1 as an MCF problem. Suppose that k is an upper bound for the number of quasispecies (k is the parameter of the algorithm analogous to the parameters of clustering algorithms such as k-means). Then an exact solution of Problem 1 could be obtained using the following Mixed Integer Linear Programming formulation:
(1)
s.t.
(2)
(3)
(4)
(5)
(6)
(7)
The variables represent the values of the flow i on the edge (u,v). With each flow gi we associate a binary vectors fi such that for every (u,v)∈E(H)
(8)
This condition is guaranteed by the constraints (5). Constraints (2) and (3) are covering and flow conservation constraints, respectively. Constraints (4) guarantee that flows gi are unsplittable for every i = 1, ..., k, i.e. the edges carrying each flow form a simple (s,t)- path P
i
in the forked read-graph H. In particular, the constraints imply that for every i = 1, ..., k the values are equal for all edges of P
i
. Therefore can be interpreted as values proportional to frequencies of quasispecies i.
The frequency of i- th quasispecies is calculated as the normalized size of the i-th flow by the formula
(9)
ShotMCF algorithm
The input is a set of distinct reads R with counts (c
v
: v∈ R) and a set of candidate sequences Q = {q1, ..., q
k
} generated using the max bandwidth method of ViSpA. We construct the directed read graph G = (V, E) as follows:
-
1)
for each read r
v
∈R aligned with the reference sequence add a vertex v∈V; the consensus of candidate sequences can be used as a reference;
-
2)
the directed edge (u, v) belongs to E if and only if some suffix of r
u
overlaps with a prefix of r
v
and the two reads agree inside the overlap;
-
3)
for each candidate sequence q
i
∈ Q add a source s
i
and a sink t
i
. Add edges (s
i
,v) and (v,t
i
) for each vertex v∈R such that r
v
coincides with the prefix or suffix of q
i
, respectively.
Let a read r
v
of length l be aligned with a candidate sequence q
i
and its alignment have j mismatches (replacements, insertions and deletions). Let pi
v
be the probability that read r
v
was obtained from the sequence q
i
. This probability can be estimated as
(10)
where ε is the sequencing error rate, i.e. the probability of error per nucleotide. Note that the analogous formula was used in the quasispecies theory for the calculation of the probability of mutation between two different quasispecies [15].
Using the read-graph constructed above, the quasispecies frequencies estimation problem can be formulated in terms of MCF as follows. Each (s
i
, t
i
)-path corresponds to some full-genome quasispecies, which can coincide with q
i
with a probability depending on values pi
v
. By using pi
v
as coefficients in the MCF objective function, we arrive to the following formulation:
(11)
s.t.
(12)
(13)
(14)
Here are flow variables. are auxiliary variables used for the simplicity of notations, which represent total flow through vertices v∈V. The resulted flow is fractional and can split, thus allowing for read errors and mutations. (11)-(14) is a variant of MCF, where vertex capacity constraints are used instead of edge capacity constraints. Once the problem is solved, the frequency of each candidate quasispecies could be estimated using (9).