MCF is a classical optimization problem that searches for k flows for k sourcesink pairs (s_{
i
}, t_{
i
}) in a network N in order to either minimize the total cost of flows or maximize the total flow subject to capacity and demand constraints.
Quasispecies reconstruction can be formulated as an optimization problem in two ways: (1) identification of the most probable set of quasispecies formed by the largest subset of reads from the data, referred to as packing formulation; and (2) identification of a minimal set of quasispecies explaining all observed reads, referred to as covering formulation. These two formulations, when applied to MCFs, were developed into path packing and path covering algorithms of ShotMCF and AmpMCF, respectively.
AmpMCF algorithm
We consider an amplicon A as a multiset of reads such that each read r∈A has the same predefined starting and ending position in the genome start(A) and end(A), respectively. Two amplicons A_{
1
}, A_{
2
} are considered overlapped if and only if start(A_{1}) ≤ start(A_{
2
}) < end(A_{
1
}) ≤ end(A_{2}). A set of amplicons A = {A_{1}, ..., A_{
m
}} is said to be overlapping if and only if A_{
i
} and A_{i+ 1}overlap for i = 1...m1. Given an overlapping set A, we define a partial order < on the set of reads R = A_{1}⋃...⋃A_{
m
} as follows: r <r' if and only if r∈A_{
i
}, r'∈A_{i+1}and r and r' are consistent over their overlap of length l_{i,i+1} = end(A_{i}) start(A_{i+1})+1, i.e., the suffix of length l_{i,i+1} of r coincides with the prefix of length l_{i,i+1} of r'.
Given an overlapping set A = {A_{1}, ..., A_{
m
}}, we construct an (m+2)staged directed vertexweighted readgraph as follows: G = (V(G) = V_{1} ⋃ ... ⋃ V_{
m
} ⋃ {s, t}, E(G), c), where each v∈ V_{
i
}, 1 ≤ i ≤ m corresponds to a distinct read r_{
v
} ∈A_{
i
}. An edge (u, v) ∈ E(G) if and only if either r_{
u
} <r_{
v
} or u = s, v∈ A_{1} or u∈A_{
m
}, v = t. We also define the function c: V_{1} ⋃ ... ⋃ V_{
m
} → [0,1], where c(v) denotes the frequency of the read represented by v ∈ V_{
i
} in amplicon A_{
i
}. It is evident that every fullsize quasispecies that has a sequenced read from each amplicon A_{
i
} corresponds to an (s, t)path in the graph G.
A bipartite clique of G is defined as a set of vertices C⊆V(G) such that C⊆V_{
i
}⋃V_{i+ 1}for some i and every vertex from the set C⋂V_{
i
} is adjacent to every vertex from the set C⋂V_{i+1}.
Lemma 1. Consistent overlaps in amplicons A_{
i
}, A_{
i+1
} correspond to disjoint bipartite cliques in G.
Proof. Suppose the contrary; then there exist vertices v, v' ∈ A_{
i
} and u, u' ∈A_{i+1}, such that r_{
v
} <r_{
u
}, r_{
v
} <r_{u'}, r_{
v'
} <r_{
u
}, but it is not true that r_{
v'
} <r_{u'}. Since r_{v'}and r_{
u
} are comparable but r_{v'}and r_{u'}are not, the prefixes of length l_{i,i+ 1}of r_{
u
} and r_{u'}must not be consistent. This implies a contradiction with r_{
v
} <r_{
u
} and r_{
v
} <r_{u'}.
Using this simple finding, we transform the read graph G into a new "forked" edgeweighted directed readgraph H = (V(H), E(H), d) as follows. Consider each p × qbipartite clique C = K_{
p,q
}of G not containing vertices s,t. C ⊆ A_{
i
}⋃A_{i+1}for some i∈{1, ..., m1}. Add a new "fork" vertex v_{
fork
}, delete all edges of the bipartite clique C and add edges from the sets {(u,v_{
fork
}): u∈C⋂A_{
i
}} and {(v_{
fork
}, v): v∈C⋂A_{i+1}}. Define a new edge weight function d : E(H) → N as follows: d(uv_{
fork
}) = c(u), d(v_{
fork
}v) = c(v), d(su) = d(vt) = 0. Figure 1 illustrates this transformation.
As for G, every fullsize quasispecies corresponds to (s,t)path in the forked read graph H. However, H is (2m+1)staged directed graph with much fewer edges than G: for every bipartite clique K_{
p,q
} pq edges in G are replaced by only p+q edges in H. Since in network flow problems variables usually are associated with edges, this reduction is highly useful for the construction of the fast network flowsbased algorithm for the quasispecies spectrum reconstruction problem.
The quasispecies reconstruction problem may be restated as the following covering problem:
Problem 1. Given a forked read graph H, cover H with a set of unique (s,t)paths P_{
i
} with frequencies g_{
i
} such that the total frequency of paths is minimal and for every directed edge (u,v)∈ E(H) the sum of frequencies of paths containing (u, v) is at least d(uv).
We next reformulate Problem 1 as an MCF problem. Suppose that k is an upper bound for the number of quasispecies (k is the parameter of the algorithm analogous to the parameters of clustering algorithms such as kmeans). Then an exact solution of Problem 1 could be obtained using the following Mixed Integer Linear Programming formulation:
minimize{\displaystyle \sum _{\begin{array}{c}\hfill i=1,\dots k\hfill \\ \hfill \left(s,u\right)\in E\left(H\right)\hfill \end{array}}}{g}_{su}^{i}
(1)
s.t.
{\displaystyle \sum _{i=1}^{k}}{g}_{uv}^{i}\ge d\left(uv\right),\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\left(u,v\right)\in E\left(H\right)
(2)
{\displaystyle \sum _{\left(u,v\right)\in E\left(H\right)}}{g}_{uv}^{i}={\displaystyle \sum _{\left(v,w\right)\in E\left(H\right)}}{g}_{vw}^{i},v\in V\left(H\right)\backslash \left\{s,t\right\},\phantom{\rule{1em}{0ex}}i=1,\dots ,k
(3)
{\displaystyle \sum _{\left(v,u\right)\in E\left(H\right)}}{f}_{vu}^{i}\le 1,v\in V\left(H\right),\phantom{\rule{1em}{0ex}}i=1,\dots ,k
(4)
{f}_{uv}^{i}\ge {g}_{uv}^{i},\left(u,v\right)E\left(H\right),\phantom{\rule{1em}{0ex}}i=1,\dots ,k
(5)
{f}_{uv}^{i}\in \left\{0,1\right\},\phantom{\rule{1em}{0ex}}\left(u,v\right)\in E\left(H\right),\phantom{\rule{1em}{0ex}}i=1,\dots ,k
(6)
{g}_{uv}^{i}\in \left[0,1\right],\left(u,v\right)\in E\left(H\right),\phantom{\rule{1em}{0ex}}i=1,\dots ,k
(7)
The variables {g}_{uv}^{i} represent the values of the flow i on the edge (u,v). With each flow g^{i} we associate a binary vectors f^{i} such that for every (u,v)∈E(H)
\mathsf{\text{if}}{g}_{uv}^{i}0\mathsf{\text{then}}{f}_{uv}^{i}=1
(8)
This condition is guaranteed by the constraints (5). Constraints (2) and (3) are covering and flow conservation constraints, respectively. Constraints (4) guarantee that flows g^{i} are unsplittable for every i = 1, ..., k, i.e. the edges carrying each flow form a simple (s,t) path P_{
i
} in the forked readgraph H. In particular, the constraints imply that for every i = 1, ..., k the values {g}_{uv}^{i} are equal for all edges of P_{
i
}. Therefore {g}_{uv}^{i} can be interpreted as values proportional to frequencies of quasispecies i.
The frequency of i th quasispecies is calculated as the normalized size of the ith flow by the formula
\frac{{\sum}_{su\in E\left(H\right)}{g}_{su}^{i}}{{\sum}_{i=1}^{k}{\sum}_{su\in E}{g}_{su}^{i}}.
(9)
ShotMCF algorithm
The input is a set of distinct reads R with counts (c_{
v
} : v∈ R) and a set of candidate sequences Q = {q_{1}, ..., q_{
k
}} generated using the max bandwidth method of ViSpA. We construct the directed read graph G = (V, E) as follows:

1)
for each read r_{
v
} ∈R aligned with the reference sequence add a vertex v∈V; the consensus of candidate sequences can be used as a reference;

2)
the directed edge (u, v) belongs to E if and only if some suffix of r_{
u
} overlaps with a prefix of r_{
v
} and the two reads agree inside the overlap;

3)
for each candidate sequence q_{
i
}∈ Q add a source s_{
i
} and a sink t_{
i
}. Add edges (s_{
i
},v) and (v,t_{
i
}) for each vertex v∈R such that r_{
v
} coincides with the prefix or suffix of q_{
i
}, respectively.
Let a read r_{
v
} of length l be aligned with a candidate sequence q_{
i
} and its alignment have j mismatches (replacements, insertions and deletions). Let p^{i}_{
v
} be the probability that read r_{
v
} was obtained from the sequence q_{
i
}. This probability can be estimated as
{p}_{v}^{i}={\left(\frac{\epsilon}{3}\right)}^{j}{\left(1\epsilon \right)}^{lj},
(10)
where ε is the sequencing error rate, i.e. the probability of error per nucleotide. Note that the analogous formula was used in the quasispecies theory for the calculation of the probability of mutation between two different quasispecies [15].
Using the readgraph constructed above, the quasispecies frequencies estimation problem can be formulated in terms of MCF as follows. Each (s_{
i
}, t_{
i
})path corresponds to some fullgenome quasispecies, which can coincide with q_{
i
} with a probability depending on values p^{i}_{
v
}. By using p^{i}_{
v
} as coefficients in the MCF objective function, we arrive to the following formulation:
maximize{\displaystyle \sum _{v\in V}}{\displaystyle \sum _{i=1}^{k}}{p}_{v}^{i}{g}_{v}^{i}
(11)
s.t.
{\displaystyle \sum _{i=1}^{k}}{g}_{v}^{i}\le {c}_{v},v\in V
(12)
{\displaystyle \sum _{uv\in E}}{g}_{uv}^{i}={\displaystyle \sum _{vw\in E}}{g}_{vw}^{i},\phantom{\rule{1em}{0ex}}v\in V\backslash \left\{{s}_{i},{t}_{i}\right\},\phantom{\rule{1em}{0ex}}i=1,\dots ,k
(13)
{g}_{uv}^{i}\ge 0,\phantom{\rule{1em}{0ex}}uv\in E,i=1,\dots ,k
(14)
Here {g}_{uv}^{i} are flow variables. {g}_{v}^{i}={\sum}_{uv\in E}{g}_{uv}^{i} are auxiliary variables used for the simplicity of notations, which represent total flow through vertices v∈V. The resulted flow is fractional and can split, thus allowing for read errors and mutations. (11)(14) is a variant of MCF, where vertex capacity constraints are used instead of edge capacity constraints. Once the problem is solved, the frequency of each candidate quasispecies could be estimated using (9).