Volume 15 Supplement 9
Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBSeq 2014)
On the complexity of Minimum Path Cover with Subpath Constraints for multiassembly
 Romeo Rizzi†^{1},
 Alexandru I Tomescu†^{2}Email author and
 Veli Mäkinen^{2}
DOI: 10.1186/1471210515S9S5
© Rizzi and Tomescu; licensee BioMed Central Ltd. 2014
Published: 10 September 2014
Abstract
Background
Multiassembly problems have gathered much attention in the last years, as NextGeneration Sequencing technologies have started being applied to mixed settings, such as reads from the transcriptome (RNASeq), or from viral quasispecies. One classical model that has resurfaced in many multiassembly methods (e.g. in Cufflinks, ShoRAH, BRANCH, CLASS) is the Minimum Path Cover (MPC) Problem, which asks for the minimum number of directed paths that cover all the nodes of a directed acyclic graph. The MPC Problem is highly popular because the acyclicity of the graph ensures its polynomialtime solvability.
Results
In this paper, we consider two generalizations of it dealing with integrating constraints arising from long reads or pairedend reads; these extensions have also been considered by two recent methods, but not fully solved. More specifically, we study the two problems where also a set of subpaths, or pairs of subpaths, of the graph have to be entirely covered by some path in the MPC. We show that in the case of long reads (subpaths), the generalized problem can be solved in polynomialtime by a reduction to the classical MPC Problem. We also consider the weighted case, and show that it can be solved in polynomialtime by a reduction to a mincost circulation problem. As a side result, we also improve the time complexity of the classical minimum weight MPC Problem. In the case of pairedend reads (pairs of subpaths), the generalized problem becomes NPhard, but we show that it is fixedparameter tractable (FPT) in the total number of constraints. This computational dichotomy between long reads and pairedend reads is also a general insight into multiassembly problems.
Keywords
multiassembly RNASeq minimum path cover directed acyclic graph network flow mincost circulationIntroduction
Background
The last years have witnessed NextGeneration Sequencing technologies applied to mixed settings in which the input sample consists of different, but highly related, genomic sequences. A major problem in this setting is to assemble the NGS reads produced from these different sequences, problem called multiassembly [1].
An emblematic example is the multiassembly of the expressed transcripts of a gene from RNASeq reads [2, 3]. The RNA transcripts of a gene are concatenations of exons, which can be shared among them, and whose length is typically much longer than the short read length. The RNASeq technology has proved essential in characterizing gene regulation and function, understanding development, disease, and disorders, including cancer [4–7]. The most popular tool for multiassembly of RNASeq reads is Cufflinks [8], but the great interest in the community has led to a recent proliferation of methods and tools, such as [9–20]. Another example is the multiassembly of NGS reads from viral quasispecies [21]. Since many viruses, such as HIV or HCV, encode their genomes in RNA rather than DNA, they lack DNA polymerase and are unable to repair mistakes in their genomes as they reproduce. Over the course of infection, the mistakes made in the replication of the virus are passed down to descendants, producing a family of related variants of the original viral genome, referred to as quasispecies. Among all of the new quasispecies produced, some may be more virulent than others, and it is of great epidemiological interest to identify them. Methods for this problem include [22–27].
The vast majority of the multiassembly tools are genomeguided, in the sense that they have access to one reference genome. Consequently, the analysis proceeds by aligning the reads to this reference, and constructing one of two major graph models. In the first, called an overlap graph, the nodes stand for reads and the edges stand for overlaps between reads. This model is employed both for RNASeq reads (by Cufflinks [8]), and for pyrosequencing reads from a viral population (ShoRAH [25, 26]). In the second model, called a splicing graph and used mainly for RNASeq reads, the nodes stand for contiguous stretches of DNA present entirely in some transcript (pseudoexons); its edges stand for reads spanning two pseudoexons and indicate that they are consecutive in some transcript. This model is employed by most of the other methods for the multiassembly of RNASeq reads [9–20]. Since both graph models arise from alignments to a reference sequence, they are also directed and acyclic (DAGs). Moreover, the nodes and the edges of the graph are weighted according to the observed coverage, and different strategies exist for integrating them into the formulation of the multiassembly problem. For example, in Cufflinks [8], the weight of an edge reflects the belief that its two endpoints originate from different transcripts, and is computed using the percentsplicedin metric proposed in [28].
Motivation
Given an overlap or a splicing DAG, many methods [8, 19, 20, 25–27] model the multiassembly problem as a Minimum Path Cover Problem; these include the wellknown tool for RNASeq reads Cufflinks [8]. A path cover in a directed graph G is a set of paths which cover all the nodes of G. A minimum path cover (MPC) is a path cover of minimum cardinality. Often, the edges of the DAG are weighted, and one is then interested in a minimum weight MPC. Even though this problem is in general NPcomplete (a path cover has cardinality 1 if and only if the directed graph has a Hamiltonian path), it is solvable in polynomial time on DAGs [29]. This fact is one of the main reasons why the MPC Problem has attracted so much interest. Therefore, it makes sense to extend it with other biological information, while maintaining its polynomialtime solvability.
In this paper we consider additional information arising from pairedend reads or long reads. Observe that, currently, both graph models and the associated MPC Problem include constraints only on pairs of nodes which must be consecutive in the (same) genomic sequence. However, on the one hand, most sequencers produce pairedend reads; these two reads correspond to nodes that must be in the same genomic sequence, but they are no longer consecutive in it. On the other hand, ThirdGeneration Sequencing technologies, like Pacific Biosciences [30], produce long reads whose length is in the range of thousands of basepairs. If properly errorcorrected, they introduce additional constraints on the sequences of nodes which must appear as consecutive in the same assembled genomic sequences. In the case of a splicing graph, such additional constraints can be introduced even from short reads completely overlapping a short middle pseudoexon (such as in the case of alternative donor/acceptor sites [31]).
Two different problem formulations have been recently proposed to better guide the multiassembly using pairedend or long reads. In the first [20], a partial assembly of the RNA transcripts is assumed (transfrags), and the following problem, which we call Minimum Path Cover with Subpath Constraints (MPCSC), is proposed. Given a DAG G and a set of subpaths in G (the transfrags, or the long reads), we are asked to find a MPC such that each given subpath is contained completely in some path of the path cover. In [20], the authors consider in fact the weighted version of the problem, and propose a polynomialtime reduction to the classical weighted MPC Problem. However, their reduction is incomplete as it does not deal with the case when two subpaths P_{1} and P_{2} are such that a suffix of P_{1} is a prefix of P_{2}. In the second formulation [19], given a DAG G and a set of pairedend RNASeq read alignments to the nodes of G, we are asked to find a minimum path cover whose paths contain all given pairedend reads. We call this problem Minimum Path Cover with Paired Subpaths Constraints (MPCPSC). In [19], the authors tackle the MPCPSC Problem by modeling it as the NPcomplete set cover problem.
Results and discussion
In this paper, we solve both the MPCSC and the MPCPSC Problem. Namely, we state the MPCSC Problem more generally than in [20], and give a correct and robust polynomialtime reduction of it to the classical MPC Problem on a DAG. Denote by n the number of nodes of the input DAG, by m its number of edges, by c the total number of subpath constraints, and by N the sum of their lengths. Constructing this reduction to the classical MPC Problem requires a preprocessing step, which, if implemented trivially, takes O(c^{2}n^{2}) time; however, we can reduce that to O(N + c^{2}) by use of a suffix tree construction suitable for large alphabets [32], and of an optimaltime algorithm for computing all pairs longest suffixprefix overlaps [33, 34]. The complexity of solving Problem MPCSC thus becomes $O\left(N+{c}^{2}+\sqrt{\left(n+c\right)}\left({n}^{2}+c\right)\right)$.
We also consider the weighted version of Problem MPCSC, and show that it can be solved in time O(N + (n + c)^{2} log(n + c) + (n + c)(m + c)) by a reduction to a mincost circulation problem on a network with flow lower bounds only [35]. Moreover, we prove that the MPCPSC Problem itself is NPcomplete, but we show that it is fixedparameter tractable (FPT) in the total number of constraints on the DAG.
As a side result of this paper, we obtain a simple algorithm for the classical minimum weight MPC Problem running in time O(n^{2}log n + nm), based on a recent reduction to a network flow problem [36]. This improves the current best bound O(n^{2} log n + nt(G)), where $t\left(G\right)\in \left\{m,m+1,\dots ,\left(\begin{array}{c}\hfill n\hfill \\ \hfill 2\hfill \end{array}\right)\right\}$ is the number of edges in the transitive closure of G, arising from the reduction in [29].
In view of this computational dichotomy between pairedend reads and long reads/transfrags, an alternative title of this paper could have been "Long reads are better than pairedend reads in multiasssembly problems". In fact, in the experiments we conducted for our own tool for RNASeq multiassembly Traph [37, 38], we fed Cufflinks [8], IsoLasso [10], SLIDE [12] and Traph both with singleend and pairedend reads, but did not notice any significant change in the multiassembly accuracy. Nevertheless, an immediate solution to the negative result concerning the complexity of the MPCPSC Problem could be to simply transform pairedend reads into long reads by a local assembly method which fills the gap between them, such as [39, 40].
As a preliminary experiment, in the Supplementary Material we show the solutions of Problem MPCSC on simulated RNASeq data from six cancerrelated genes. These results are compared to the ground truth, and to Cufflinks' solutions (given that Cufflinks uses the classical MPC model). These preliminary results indicate that, thanks to the additional long read constraints to the MPC problem, Problem MPCSC reports more transcripts than Cufflinks, and they are generally more accurate.
Both MPCSC and MPCPSC Problems are natural extensions of the classical MPC problem, and can be applied to any graph model for multiassembly, such as an overlap graph or a splicing graph. The MCP Problem has received great interest in the multiassembly community, and pairend reads, long reads, or transfrags are either already, or expected to be easily available in the near future. Our positive result concerning the MPCSC Problem, and the two proposed solutions for the MPCPSC Problem, give efficient ways to incorporate additional information that an NGS pipeline can provide. Moreover, all of our solutions are based on easy to implement reductions, and resort to wellknown problems in combinatorial optimization, for which there are many existing solvers.
Independently and parallel to this work, [41] gave analogs of our Thm. 4 and Lemma 2 for Problem MPCPSC.
Methods
A faster algorithm for the weighted Minimum Path Cover (MPC) Problem
Given a directed graph G, we say that a family $\mathcal{P}=\left\{{P}_{1},\dots ,{P}_{k}\right\}$ of paths in G is a path cover of G if every v ∈ V (G) belongs to some P_{ i }. Throughout this paper, we let n stand for the number of vertices of G and m stand for the number of edges of G. A minimum path cover (MPC) of G is a path cover of G of minimum cardinality. If each edge e of G has a nonnegative weight w(e), then a minimum weight minimum path cover is a minimum path cover $\phantom{\rule{0.1em}{0ex}}\mathcal{P}$ which minimizes the sum of the weights of the edges of the paths of $\phantom{\rule{0.1em}{0ex}}\mathcal{P}$, that is, ${\Sigma}_{P\in \mathcal{P}}$ Σ_{ e∈P }w(e).
A recent solution for the MPC Problem reduces it instead to a minflow problem [36], as follows. Each node of G is replaced by an arc with lower bound 1 (all other edges of G have lower bound 0), and a new global source s and sink t are added to G and connected to all sources and sinks of G, respectively (see Figures 1(a) and 1(c)). A minflow on this digraph is a flow of minimum value satisfying all lower bounds. The value of the minflow on this network equals the maximum size of an antichain of G, and any decomposition of it into paths gives a MPC [36]. A decomposition of a flow on a DAG into paths can be computed in time linear in the number of edges, by traversing the edges used by the flow [45]. A minflow problem can be solved by two applications of a maxflow algorithm [45]. Therefore, using the recent result on maxflows [46], this approach finds a MPC in time O(nm).
If in the unweighted case, the complexity of the method of [36] is incomparable with the complexity of solving the MPC Problem by a maximum matching problem, in the weighted case, the method of [36] leads to one of improved complexity. This is obtained by an algorithm for the following restricted variant of the mincost circulation problem [45, 47]: given a directed graph, and a flow lower bound for each edge and a cost per flow unit for each edge, the task is to find a circulation of minimum total cost satisfying all lower bounds. A circulation is a function assigning a flow value to each edge such that the flow conservation property is satisfied for all nodes; consequently, the flow network cannot have sources or sinks.
To solve the minimum weight MPC Problem, we extend the reduction in [36] by associating to the edges either cost 0, if they correspond to the nodes of G or are incident to s or t; or their weight in G, if they correspond to edges of G. Moreover, we add a new edge from t to s with lower bound 0 and having as cost the sum of all edge weights (plus a positive constant if all are 0). This implies that all mincost circulations induce a minflow (removing the edge from t to s), and thus, by [36], induce also a MPC that is of minimum weight; obviously, vice versa, a minimum weight MPC induces a mincost circulation on the constructed flow network.
There are many algorithms and solvers for the mincost circulation problem, with various time complexity upper bounds [47], for example O(nm log log C log(nK)) [48], where C is the maximum edge bound, and K is the maximum cost. If edges have only lower bounds, as in our case, the mincost circulation problem can be solved in time O(n log C(m + n log n)) [35]; since we have C = 1, this reduces to O(n^{2} log n + nm). Therefore, we have the following theorem.
Theorem 1 A minimum weight MPC of a DAG with n nodes and m edges can be computed in time O(n^{2}log n + nm), by a reduction to a mincost circulation problem.
The new problem formulations
We first consider the problem arising from long reads, or from transfrags. We introduce a slight generalization of a path cover of a DAG G, namely a set of paths which cover only a given subset ${V}^{\prime}$ of the nodes. We are also given a subset ${E}^{\prime}$ of the edges of G, and a family of subpaths ${\mathcal{P}}^{in}$ in G that all have to be entirely covered by some path of the path cover. We could have modeled each edge constraint in ${E}^{\prime}$ as a path of length 1 in ${\mathcal{P}}^{in}$, but for clarity, we keep these separate. Formally, we have:
Minimum Path Cover with Subpath Constraints (MPCSC) Problem INPUT: A DAG G and
 1
A subset ${V}^{\prime}$ of V (G)
 2
A subset ${E}^{\prime}$ of E(G)
 3
A family ${\mathcal{P}}^{in}=\left\{{P}_{1}^{in},\dots ,{P}_{t}^{in}\right\}$ of directed paths in G
TASK: Find a minimum number k of directed paths ${P}_{1}^{sol},\dots ,{P}_{k}^{sol}$ in G such that
 1
Every node in ${V}^{\prime}$ occurs in some ${P}_{{}^{i}}^{sol}$
 2
Every edge in ${E}^{\prime}$ occurs in some ${P}_{{}^{i}}^{sol}$
 3
Every path P^{ in } ϵ${\mathcal{P}}^{in}$is entirely contained in some ${P}_{i}^{sol}$
We call the elements of the sets ${V}^{\prime}$,${E}^{\prime}$, ${\mathcal{P}}^{in}$ constraints, and we say that the k paths in a solution satisfy these constraints.
In the second problem, we consider the weighted case, with one further generalization, as follows. As also noted by [20], in practice the paths in the path cover should start only in source nodes or in a specific subset of other nodes of G; similarly for their ending nodes. For example, in our method for the multiassembly of RNAtranscripts [37, 38], these nodes are identified when there is a sharp increase/decrease in read coverage in the middle of an exon, indicating the start/end of a transcript.
Minimum Weight Minimum Path Cover with Subpath Constraints (MWMPCSC) Problem
 1
A subset ${V}^{\prime}$ of V (G)
 2
A subset ${E}^{\prime}$ of E(G)
 3
A family ${\mathcal{P}}^{in}=\left\{{P}_{1}^{in},\dots ,{P}_{t}^{in}\right\}$ of directed paths in G
 4
A superset S of the sources of G, and a superset T of the sinks of G
 5
A weight w(e) for each e ∈ E(G)
 1
Every node in ${V}^{\prime}$ occurs in some ${P}_{i}^{sol}$
 2
Every edge in ${E}^{\prime}$ occurs in some ${P}_{i}^{sol}$
 3
Every path P ^{ in } ∈ ${\mathcal{P}}^{in}$ is entirely contained in some ${P}_{i}^{sol}$
 4
Every path ${P}_{i}^{sol}$ starts in a node of S and ends in a node of T
 5
$\sum _{i\in \left\{1,\dots ,k\right\}}}{\displaystyle \sum _{\mathsf{\text{edge}}e\in {P}_{i}^{sol}}}w\left(e\right)$ is minimum among all tuples of k paths satisfying properties 14
When only pairedend reads are available, each such pair of reads corresponds to a pair of subpaths that must both be covered by the same path in the path cover. Formally, we have:
Minimum Path Cover with Paired Subpath Constraints (MPCPSC) Problem
 1
A subset ${V}^{\prime}$ of V (G)
 2
A subset ${E}^{\prime}$ of E(G)
 3
A family ${\mathcal{P}}^{in}=\left\{\left({P}_{1,1}^{in},{P}_{1,2}^{in}\right),\dots ,\left({P}_{t,1}^{in},{P}_{t,2}^{in}\right)\right\}$ of pairs of directed paths in G
 1
Every node in ${V}^{\prime}$ occurs in some ${P}_{i}^{sol}$
 2
Every edge in ${E}^{\prime}$ occurs in some ${P}_{i}^{sol}$
 3
For every pair $\left({P}_{j,1}^{in},{P}_{j,2}^{in}\right)\in {\mathcal{P}}^{in}$, there exists such ${P}_{i}^{sol}$ that both ${P}_{j,1}^{in}$ and ${P}_{j,2}^{in}$ are entirely contained in ${P}_{i}^{sol}$
The MPC with Subpath Constraints (MPCSC) Problem
The unweighted case
In this section, we reduce the MPCSC Problem to the classical MPC Problem. We describe our reduction as a sequence of commented algorithmic steps.
Step 1. for every $\left(u,v\right)\in {E}^{\prime}$ do: ${V}^{\prime}:={V}^{\prime}\backslash \left\{u,v\right\}$;
If the MPC has a path P covering the arc (u, v), then P also covers both u and v. Therefore, the constraints u, v can be dropped from ${V}^{\prime}$ (if present).
Similarly to Step 1, if the MPC has a path P covering a subpath ${P}_{j}^{in}\in {\mathcal{P}}^{in}$ , then P also covers every node and edge of ${P}_{j}^{in}$, thus these constraints can be dropped from ${V}^{\prime}$ and ${E}^{\prime}$ (if present).
After this step, no subpath constraint is completely included into another; this is key for the correctness of Step 4 below.
Step 4. while there exist two paths ${P}_{i}^{in}$, ${P}_{j}^{in}\in {\mathcal{P}}^{in}$ such that a suffix of ${P}_{i}^{in}$ is a prefix of ${P}_{j}^{in}$ do:
let ${P}_{i}^{in}$, ${P}_{j}^{in}\in {\mathcal{P}}^{in}$ be as above and with the common part (i.e., the suffix of ${P}_{i}^{in}$ which is a prefix of ${P}_{j}^{in}$) the longest possible;
In this step, we merge paths sharing a suffix/prefix. We do this iteratively, at each step merging that pair of paths for which the shared suffix/prefix is longest possible. The correctness of this step is guaranteed by Lemma 1 below.
Lemma 1 If the MPCSC Problem on an instance $\left(G,{V}^{\prime},{E}^{\prime},{\mathcal{P}}^{in}\right)$ admits a solution with k paths, then also the problem instance transformed by applying Steps 14 admits a solution with k paths, and this solution also satisfies the original constraints ${V}^{\prime}$, ${E}^{\prime}$, ${\mathcal{P}}^{in}$.
Proof The correctness of Steps 13 was argued next to their introduction. Assume that $G,{V}^{\prime},{E}^{\prime},{\mathcal{P}}^{in}$ have been transformed by these first three steps, and let ${P}_{i}^{in},{P}_{j}^{in}\in {\mathcal{P}}^{in}$ be such that their common part (i.e., the suffix of ${P}_{i}^{in}$ which is a prefix of ${P}_{j}^{in}$) is longest possible. Suppose that the original problem admits a solution ${\mathcal{P}}^{sol}=\left\{{P}_{1}^{sol},\dots ,{P}_{k}^{sol}\right\}$ such that ${P}_{i}^{in},{P}_{j}^{in}$ are covered by different solution paths say ${P}_{a}^{sol}$ and ${P}_{b}^{sol}$, respectively. We show that the transformed problem admits a solution ${\mathcal{P}}^{*}=\left(\left\{{P}_{1}^{sol},\dots ,{P}_{k}^{sol}\right\}\backslash \left\{{P}_{a}^{sol},{P}_{b}^{sol}\right\}\cup \left\{{P}_{a}^{*},{P}_{b}^{*}\right\}\right)$, having the same cardinality as$\phantom{\rule{0.1em}{0ex}}\mathcal{P}$, in which ${P}_{i}^{in},{P}_{j}^{in}$ are covered by the same path ${P}_{a}^{*}$, and ${\mathcal{P}}^{*}$ also satisfies the original constraints ${V}^{\prime},{E}^{\prime},{\mathcal{P}}^{in}$.

${P}_{a}^{*}$ be the path obtained as the concatenation of the path ${P}_{a}^{sol}$ taken from its starting node until v_{ i }with the path ${P}_{b}^{sol}$ taken from v_{ i }until its end node (so that ${P}_{a}^{*}$ covers both ${P}_{i}^{in}$ and ${P}_{j}^{in}$).

${P}_{b}^{*}$ be the path obtained as the concatenation of the path ${P}_{b}^{sol}$ taken from its starting node until v_{ i }with the path ${P}_{a}^{sol}$ taken from v_{ i }until its end node.
We have to show that the path cover ${\mathcal{P}}^{*}=\left(\left\{{P}_{1}^{sol},\dots ,{P}_{k}^{sol}\right\}\backslash \left\{{P}_{a}^{sol},{P}_{b}^{sol}\right\}\right)\cup \left\{{P}_{a}^{*},{P}_{b}^{*}\right\}$ satisfies the original constraints ${V}^{\prime},{E}^{\prime},{\mathcal{P}}^{in}$. Since ${P}_{a}^{*}$ and ${P}_{b}^{*}$ use exactly the same edges as ${P}_{a}^{sol}$ and ${P}_{b}^{sol}$, then ${V}^{\prime}$ and ${E}^{\prime}$ are satisfied. Moreover, the only two problematic cases are when there is a subpath constraint ${P}_{k}^{in}$ which has v_{ i } as internal node and is satisfied only by ${P}_{a}^{sol}$, or it is satisfied only by ${P}_{b}^{sol}$. Denote, analogously, by u_{ k } and v_{ k } the endpoints of ${P}_{k}^{in}$. From the fact that the input was transformed at Step 3, ${P}_{i}^{in}$ and ${P}_{j}^{in}$ are not completely included in ${P}_{k}^{in}$.
Case 1. ${P}_{k}^{in}$ is satisfied only by ${P}_{a}^{sol}$ (Figures 3(a) and 3(b)). Since ${P}_{i}^{in}$ is not completely included in ${P}_{k}^{in}$, u_{ k } is an internal node of ${P}_{i}^{in}$; thus, a suffix of ${P}_{i}^{in}$ is prefix also of ${P}_{k}^{in}$. From the fact that the common part between ${P}_{i}^{in}$ and ${P}_{j}^{in}$ is longest possible, we have that vertices u_{ j } , u_{ k } , v_{ i } appear in this order in ${P}_{i}^{in}$. Thus, ${P}_{k}^{in}$ is also satisfied by ${P}_{b}^{*}$, since u_{ k } appears after u_{ j } on P^{ i }.
Case 2. ${P}_{k}^{in}$ is satisfied only by ${P}_{b}^{sol}$, and it is not satisfied by ${P}_{a}^{*}$ (Figure 3(c)). This means that ${P}_{k}^{in}$ starts on ${P}_{b}^{sol}$ before u_{ j } and, since it contains v_{ i }, it ends on ${P}_{b}^{sol}$ after v_{ i }. From the fact that ${P}_{j}^{in}$ in not completely included in ${P}_{k}^{in}$, v_{ k } is an internal node of ${P}_{j}^{in}$, and thus a suffix of ${P}_{k}^{in}$ equals a prefix of ${P}_{j}^{in}$. This common part is now longer than the common suffix/prefix between ${P}_{i}^{in}$ and ${P}_{j}^{in}$, which contradicts maximality of the suffix/prefix between ${P}_{i}^{in}$ and ${P}_{j}^{in}$. This proves the lemma.
The remaining steps can be seen as analogous to the reduction in [20].
Step 5. for every path ${P}_{i}^{in}\in {\mathcal{P}}^{in}$ do:
In this step, we represent each subpath constraint by an edge constraint. Its correctness is guaranteed by the fact that by now, no two subpath constraints are such that a suffix of the first is a prefix of the second. We should stress out that if there are more paths with the same endpoints, we may add parallel edges to the DAG. However, in Step 6 below these parallel edges will be transformed into parallel paths of length 2, rendering the DAG simple again.
At this point, we have transformed all subpath constraints into edge constraints. The edge constraints can be modeled as node constraints by simply subdividing each edge and introducing a new node in the middle of it; this node is then added to ${V}^{\prime}$.
Step 7. G:= T(G)
We replace G by its transitive closure, since in Step 8 below we are going to remove from G all vertices not in ${V}^{\prime}$.
Step 8. Remove from G all nodes not in ${V}^{\prime}$;
Since only the nodes in ${V}^{\prime}$ have to be covered by the paths in the path cover, we remove all other nodes. This is correct, since, at Step 7 above, we introduced all edges between nodes v and ${v}^{\prime}$ such that ${v}^{\prime}$ was reachable from v through some nodes not in ${V}^{\prime}$.
Step 9. Compute a MPC for the resulting graph G;
This can be done by any method discussed previously.
Step 10. Postprocess the paths obtained at Step 9 above by reverting the transformations executed at Steps 18, in reverse order.
Theorem 2 Problem MPCSC on a graph with n nodes, m edges, c subpath or edge constraints, and with N being the sum of subpath constraint lengths, can be solved by solving the classical MPC Problem in a graph with O(n + c) nodes and O(n^{2} + c) edges. This graph can be computed in time O(N + c^{2} + n^{2}), thus the complexity of Problem MPCSC is $O\left(N+{c}^{2}+\sqrt{\left(n+c\right)}\left({n}^{2}+c\right)\right)$.
Proof The complexity of the preprocessing phase is dominated by Steps 3 and 4. Step 3 can be solved by first building a (generalized) suffix tree on the concatenation of subpath constraints with a distinct symbol #_{ i } added after each constraint sequence ${P}_{i}^{in}$. This can be done in O(N) time even on our alphabet of size O(n) [32].
Then one can do as follows during depthfirst traversal of the tree: If a leaf corresponding to the suffix starting at the beginning of subpath constraint ${P}_{i}^{in}$ has an incoming edge labeled by only #_{ i } and its parent has still other children, then the constraint is a substring of another constraint and must be removed (together with the leaf).
For Step 4, we compute all pairs longest suffixprefix overlaps between the subpath constraints using an O(N + c^{2}) time algorithm in [33, Theorem 7.10.1, page 137], [45] with [32] as a subroutine for the sake of large alphabet. The output can be casted to a doublelinked list L containing elements of the form (i, j, len, prev_{ i }, next_{ i }, prev_{ j } , next_{ j } ) in decreasing order of the overlap length, len, between constraints ${P}_{i}^{in}$ and ${P}_{j}^{in}$. Pointers prev_{ i }, next_{ i }, prev_{ j }, and next_{ j } tell the previous/next occurrences of the tuple having i as the first element and j as the second element, respectively. Then popping the first tuple from L tells us the first constraints to merge, and following prev_{ ∗ } and next_{ ∗ } pointers we can remove all overlaps no longer relevant for next mergings; when removing, we need to make sure the nested doublelinked lists formed by the prev_{ ∗ } and next_{ ∗ } pointers are also updated. Continuing like this until the list L is empty gives all the overlaps in total O(c^{2}) time. Notice that the new merged constraints do not need to be separately taken into account in overlap computation; no completely new overlaps can be created due to Step 3.
Merging itself requires a similar linked list structure being a special case of unionfind: All the constraints are represented as doublelinked lists with node numbers as elements. Merging can be done by linking the doublelinked lists together, removing the extra overlapping part from the latter list and redirecting its start pointer to point inside the newly formed merged list. When finished with merging, the new constraints are exactly those old constraints whose start pointers still point to the beginning of a node list. The complexity of merging is thus O(N).
The weighted case
To solve the MWMPCSC Problem, we build on the reduction in [36] to a network flow problem. This reduction will allow the addition of edge weights and of constraints on the starting/ending nodes of the solution paths. Note that these constraints S and T cannot be included in the reduction of the MPC Problem to a bipartite matching problem. Moreover, the heuristic in [20, Sec. 2.4.2] of arbitrarily extending the paths in a minimum weight MPC towards sources/sinks cannot be proved to be correct.
 1
We replace each node $v\in {V}^{\prime}$ by an edge (v_{1}, v_{2}) such that all inneighbors of v are now inneighbors of v_{1}, and all outneighbors of v are now outneighbors of v_{2}. If node v was introduced at Step 6 to model an edge coming from a subpath constraint P , then the cost per unit of flow of (v_{1}, v_{2}) is the sum of the weights of the edges of P ; otherwise, it is 0.
 2.
For each edge e of G, if e is an original edge of G, we set its flow lower bound to 0 and its cost per unit of flow to w(e); otherwise we set both to 0.
 3.
The global source s has outgoing edges precisely to the nodes in the set S, and the global sink t has incoming edges precisely from the nodes in T ; we also add the edge (t, s). All edges incident to s or t have flow flower bound 0 and cost 0, except for the edge (t, s) having as cost the sum of all edge weights (plus a positive constant if all are 0). This guarantees, like before, that any mincost circulation is also a minflow.
Note that, by reducing to a flow problem, we do not have to perform Steps 7 and 8 anymore, since the coverage constraints are now modeled as flow lower bound constraints. As in the case of Thm. 1, we compute a mincost circulation on this transformed input G, that is, a function f : E(G) → ℕ which satisfies all the flow conservation property for all nodes, satisfies all edge lower bounds, and minimizes ${\sum}_{e\in E\left(G\right)}f\left(e\right)$. We then decompose the circulation (from which we remove the edge (t, s)) into paths, and covert these paths into paths of the original input graph. This is done by reverting the transformations executed at Steps 16, in reverse order (as done for the MPCSC Problem). As before, these paths form a MPC satisfying all constraints, and they also start and end in vertices of S and T , respectively (because of the way s and t were connected to the other nodes of the graph). Since these paths arise from a mincost circulation, then they also form a minimum weight MPC satisfying the input constraints. The flow network has only flow lower bounds, thus we can again apply the algorithm of [49], to get the following:
Theorem 3 Problem MWMPCSC on a graph with n nodes, m edges, c subpath or edge constraints, and with N being the sum of subpath constraint lengths, can be solved by reducing it to a mincost circulation problem on a network with O(n + c) nodes and O(m + c) edges, and with flow lower bounds only. This network can be computed in time O(N + c^{2} + m), and the complexity of Problem MWMPCSC becomes O(N + (n + c)^{2} log(n + c) + (n + c)(m + c)).
The MPC with Paired Subpaths Constraints (MPCPSC) Problem
The NPcompleteness proof
In this section we show that the MPCPSC Problem is NPcomplete. Our reduction is from the NPcomplete problem of deciding whether the chromatic number of a graph G, χ(G), is 3 [50]. We will show that it is actually NPcomplete to determine if the MPCPSC Problem admits a solution with just 3 paths, even on planar DAGs, of width 2, seriesparallel, when only paired subpath constraints are imposed, and all subpaths are just edges.
Theorem 4 Problem MPCPSC is NPcomplete.
Proof We show that the graph G = (V, E) has χ(G) = 3 if and only if the DAG P(G) drawn in Figure 5 admits a solution to Problem MPCPSC with 3 paths.
(⇒) Suppose that χ(G) = 3. Definitely, we need at least three paths to solve P(G), since the three edges v_{1}, X_{1}, Y_{1} exiting from node 0 cannot be covered by the same path, and each of them is mentioned in some constraint. By definition, G is 3 colorable if and only if V can be partitioned into three sets V_{ A }, V_{ B } , V_{ C } such that no edge of G is contained in any of them. We use these three sets to build up the three solution paths for Problem MPCPSC as follows: for all X ∈ {A, B, C}, in the first stage (until node n) path P_{ X } picks up all edges labeled with a node in V_{ X } and no edge labeled with a node in V\V_{ X } ; next, in the second stage (from node n until node n + m), P_{ X } picks up those edges $\left[{v}_{{i}_{k}}\right]$ such that ${v}_{{i}_{k}}$ belongs to P_{ X } . This is possible, since no edge ${e}_{k}={v}_{{i}_{k}}{v}_{{j}_{k}}$ is contained in the same color class, and consequently the two of edges of P (G) labeled ${v}_{{i}_{k}}$ and ${v}_{{j}_{k}}$ do not belong to the same path among {P_{ A }, P_{ B } , P_{ C } }. Thus, $\left[{v}_{{i}_{k}}\right]$ and $\left[{v}_{{j}_{k}}\right]$ do not have to be both covered by the same solution path. Therefore, the three paths P_{ A }, P_{ B } , P_{ C } satisfy all paired subpath constraints, and are a solution to Problem MPCPSC.
(⇐) Suppose the DAG P(G) drawn in Figure 5 admits a solution to Problem MPCPSC with 3 paths P_{ A }, P_{ B } , P_{ C } . Then, we partition V into three color classes A, B, C by setting v_{ i }∈ × if and only if the edge of P(G) labeled by v_{ i } (in the first stage from node 0 to node n) belongs to P_{ X } , for all X ∈ {A, B, C}. To see that {A, B, C} is indeed a partition of V , observe that in each block k of the first stage of P(G), no two paths in {P_{ A }, P_{ B } , P_{ C } } can share an edge, since all three edges v_{ k } , X_{ k } , Y_{ k } appear in some constraint. Therefore, each edge v_{ k } appears in exactly one of {P_{ A }, P_{ B } , P_{ C } }. The proof that the partition {A, B, C} is also a proper coloring of G encounters no difficulty, as the rationale behind the reduction was illustrated in the forward implication.
Corollary 1 For no ε > 0 there exists a $\left(\frac{4}{3}\in \right)$approximation algorithm for Problem MPCPSC unless P=NP. Moreover, the problem is not FPT when parameterized on OPT (the minimum number of paths in a solution).
The FPT algorithm
In the previous section, we obtained the NPcompleteness for the decision problem OPT = 3; this rules out a Dynamic Programming approach for Problem MPCPSC. In this section, we show that if OPT = 2, then the problem can be solved in polynomial time. This also leads to an FPT algorithm on the total number of constraints.
For any constraint of the input DAG G that is made up of a pair (P_{1}, P_{2}) of subpaths of G, we may assume that there exists a directed path of G completely containing both P_{1} and P_{2}, otherwise, the input instance is infeasible. Given any two constraints X and Y (X and Y can be nodes, edges, or pairs of subpaths), we say that X and Y are compatible if there is a directed path of G completely containing both X and Y . We exploit the following structural property:
Lemma 2 Let C be a set of constraints on a DAG G. There exists a directed path P in G which satisfies all constraints in C if and only if any two constraints in C are compatible.
Proof The forward implication is clear from the definition. For the backward implication, recall that the width of a DAG denotes the maximum size of an antichain of it. We claim that the union of the constraints in C is a DAG of width 1. Indeed, if it were of width 2 it would contain two nodes v_{1} and v_{2} which are pairwise not reachable by a directed path, thus forming an antichain of size 2. Since we assumed that for all pairs (P_{1}, P_{2}) of subpaths constraints of G, there exists a directed path of G completely containing both P_{1} and P_{2}, this implies that v_{1} and v_{2} belong to two different constraints X and Y in C. Thus, X and Y are not compatible, a contradiction.
Theorem 5 Given an instance for Problem MPCPSC, we can decide in polynomial time if OPT = 2, and if so, find the two solution paths. Moreover, Problem MPCPSC is fixedparameter tractable (FPT) in the total number C of input constraints.
Proof We build an incompatibility graph from the input constraints: every constraint is represented by a node, and we add an edge between two constraints iff they are incompatible. Then, OPT = 2 iff this incompatibility graph is bipartite, and the two classes of the bipartition give the two solution paths; this can be done in time O(C^{2}). If OPT > 2, then we try all possible ways of partitioning the set of all input constraints (the number of these possibilities is a function only on C), and check that each class of the partition consists of pairwise compatible constraints.
Notes
Declarations
Acknowledgements
We thank Anna Kuosmanen and Ahmed Sobih for their prompt help with the preliminary experiments on RNASeq data, which we included as Supplementary Material. This work was partially supported by the Academy of Finland under grant 250345 (CoECGR) and by the Finnish Cultural Foundation.
Declarations
Publication of this article was supported by the Academy of Finland under grant 250345 (CoECGR).
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBSeq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
Authors’ Affiliations
References
 Xing Y: The multiassembly problem: reconstructing multiple transcript isoforms from EST fragment mixtures. Genome Research. 2004, 14 (3): 426441. 10.1101/gr.1304504.PubMed CentralView ArticlePubMedGoogle Scholar
 Mortazavi A: Mapping and quantifying mammalian transcriptomes by RNASeq. Nature Methods. 2008, 5: 621628. 10.1038/nmeth.1226.View ArticlePubMedGoogle Scholar
 Pepke S, Wold B, Mortazavi A: Computation for ChIPseq and RNAseq studies. Nature methods. 2009, 6 (11): 2232.View ArticleGoogle Scholar
 Kim E, Goren A, Ast G: Insights into the connection between cancer and alternative splicing. Trends in genetics: TIG. 2008, 24 (1): 710. 10.1016/j.tig.2007.10.001.View ArticlePubMedGoogle Scholar
 LopezBigas N, Audit B, Ouzounis C, Parra G, Guigo R: Are splicing mutations the most frequent cause of hereditary disease?. FEBS Letters. 2005, 579 (9): 19001903. 10.1016/j.febslet.2005.02.047.View ArticlePubMedGoogle Scholar
 Wang Z, Gerstein M, Snyder M: RNASeq: a revolutionary tool for transcriptomics. Nature Reviews Genetics. 2009, 10 (1): 5763. 10.1038/nrg2484.PubMed CentralView ArticlePubMedGoogle Scholar
 Shah S: The clonal and mutational evolution spectrum of primary triplenegative breast cancers. Nature. 2012, 486 (7403): 395399.PubMedGoogle Scholar
 Trapnell C: Transcript assembly and quantification by RNASeq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology. 2010, 28: 511515. 10.1038/nbt.1621.PubMed CentralView ArticlePubMedGoogle Scholar
 Feng J: Inference of isoforms from short sequence reads. RECOMB  Research in Computational Molecular Biology. Edited by: Berger, B. 2010, LNCS, 6044: 138157. 10.1007/9783642126833_10.Google Scholar
 Li W: IsoLasso: a LASSO regression approach to RNASeq based transcriptome assembly. Journal of Computational Biology. 2011, 18 (11): 16931707. 10.1089/cmb.2011.0171.PubMed CentralView ArticlePubMedGoogle Scholar
 Lin YY: CLIIQ: Accurate Comparative Detection and Quantification of Expressed Isoforms in a Population. WABI  12th Workshop on Algorithms for Bioinformatics. 2012, LNCS, 7534: 178189. 10.1007/9783642331220_14.View ArticleGoogle Scholar
 Li JJ: Sparse linear modeling of nextgeneration mRNA sequencing (RNASeq) data for isoform discovery and abundance estimation. Proceedings National Academy of Sciences. 2011, 108 (50): 1986719872. 10.1073/pnas.1113972108.View ArticleGoogle Scholar
 Guttman M: Ab initio reconstruction of cell typespecific transcriptomes in mouse reveals the conserved multiexonic structure of lincRNAs. Nature Biotechnology. 2010, 28 (5): 503510. 10.1038/nbt.1633.PubMed CentralView ArticlePubMedGoogle Scholar
 Mezlini AM: iReckon: Simultaneous isoform discovery and abundance estimation from RNAseq data. Genome Research. 2012, 23 (3): 519529.View ArticlePubMedGoogle Scholar
 Mangul S: An integer programming approach to novel transcript reconstruction from pairedend RNASeq reads. ACM Conference on Bioinformatics, Computational Biology and Biomedical Informatics. Edited by: Ranka, S. 2012, ACM, New York, NY, USA, 369376.View ArticleGoogle Scholar
 Xia Z: NSMAP: A method for spliced isoforms identification and quantification from RNASeq. BMC Bioinformatics. 2011, 12 (1): 16210.1186/1471210512162.PubMed CentralView ArticlePubMedGoogle Scholar
 Bernard E: Efficient RNA Isoform Identification and Quantification from RNASeq Data with Network Flows. preprint: SU2CAACRDT0409; SES0835531; CCF0939370.Google Scholar
 Hiller D: Simultaneous Isoform Discovery and Quantification from RNASeq. Statistics in Biosciences. 2013, 5 (1): 119. 10.1007/s1256101390887.View ArticleGoogle Scholar
 Song L, Florea L: CLASS: constrained transcript assembly of RNAseq reads. BMC Bioinformatics. 2013, 14 (S5): 14Proceedings paper from RECOMBseq: Third Annual Recomb Satellite Workshop on Massively Parallel Sequencing Beijing, China. 1112 April 2013View ArticleGoogle Scholar
 Bao E, Jiang T, Girke T: Branch: boosting rnaseq assemblies with partial or related genomic sequences. Bioinformatics. 2013, 29 (10): 12501259. 10.1093/bioinformatics/btt127.View ArticlePubMedGoogle Scholar
 Beerenwinkel N, Gu¨nthard HF, Roth V, Metzner KJ: Challenges and opportunities in estimating viral genetic diversity from nextgeneration sequencing data. Frontiers in Microbiology. 2012, 3: 329PubMed CentralView ArticlePubMedGoogle Scholar
 Mancuso N, Tork B, Skums P, Mandoiu II, Zelikovsky A: Viral quasispecies reconstruction from amplicon 454 pyrosequencing reads. Bioinformatics and Biomedicine Workshops. 2011, IEEE, Atlanta, GA, USA, 94101.Google Scholar
 O'Neil S, Emrich S: Haplotype and minimumchimerism consensus determination using short sequence data. BMC Genomics. 2012, 13 (Suppl 2): 410.1186/1471216413S2S4.View ArticleGoogle Scholar
 Huang A, Kantor R, DeLong A, Schreier L, Istrail S: Qcolors: An algorithm for conservative viral quasispecies reconstruction from short and noncontiguous next generation sequencing reads. Bioinformatics and Biomedicine Workshops. 2011, IEEE, Atlanta, GA, USA, 130136.Google Scholar
 Eriksson N, Pachter L, Mitsuya Y, Rhee SY, Wang C, Gharizadeh B, Ronaghi M, Shafer RW, Beerenwinkel N: Viral population estimation using pyrosequencing. PLoS Computational Biology. 2008, 4 (5):Google Scholar
 Zagordi O, Bhattacharya A, Eriksson N, Beerenwinkel N: ShoRAH: estimating the genetic diversity of a mixed sample from nextgeneration sequencing data. BMC Bioinformatics. 2011, 12 (1): 11910.1186/1471210512119.PubMed CentralView ArticlePubMedGoogle Scholar
 Westbrooks K, Astrovskaya I, Campo DS, Khudyakov Y, Berman P, Zelikovsky A: HCV Quasispecies Assembly Using Network Flows. ISBRA Lecture Notes in Computer Science. Edited by: Mandoiu, I.I., Sunderraman, R., Zelikovsky, A. 2008, Springer, Berlin, 4983: 159170. 10.1007/9783540794509_15.Google Scholar
 Wang ET, Sandberg R, Luo S, Khrebtukova I, Zhang L, Mayr C, Kingsmore SF, Schroth GP, Burge CB: Alternative isoform regulation in human tissue transcriptomes. Nature. 2008, 456 (7221): 470476. 10.1038/nature07509.PubMed CentralView ArticlePubMedGoogle Scholar
 Fulkerson DR: Note on dilworth's decomposition theorem for partially ordered sets. Proceedings of the American Mathematical Society. 1956, 7 (4): 701702.Google Scholar
 Schadt EE, Turner S, Kasarskis A: A window into thirdgeneration sequencing. Human molecular genetics. 2010, 19 (R2): 227240. 10.1093/hmg/ddq416.View ArticleGoogle Scholar
 Sammeth M, Foissac S, Guig´o R: A General Definition and Nomenclature for Alternative Splicing Events. PLoS Computational Biology. 2008, 4 (8): 100014710.1371/journal.pcbi.1000147.View ArticleGoogle Scholar
 Farach M: Optimal suffix tree construction with large alphabets. 38th Annual Symposium on Foundations of Computer Science (FOCS'97). 1997, IEEE Computer Society, Washington, DC, USA, 137143.View ArticleGoogle Scholar
 Gusfield D: Algorithms on Strings, Trees, and Sequences  Computer Science and Computational Biology. 1997, Cambridge University Press, Cambridge UKView ArticleGoogle Scholar
 Gusfield D, Landau GM, Schieber B: An efficient algorithm for the all pairs suffixprefix problem. Inf Process Lett. 1992, 41 (4): 181185. 10.1016/00200190(92)90176V.View ArticleGoogle Scholar
 Gabow HN, Tarjan RE: Faster scaling algorithms for network problems. SIAM J Comput. 1989, 18 (5): 10131036. 10.1137/0218069.View ArticleGoogle Scholar
 Pijls W, Potharst R: Another note on dilworth's decomposition theorem. Journal of Discrete Mathematics. 2013, 2013: 692645View ArticleGoogle Scholar
 Tomescu AI, Kuosmanen A, Rizzi R, M¨akinen V: A Novel Combinatorial Method for Estimating Transcript Expression with RNASeq: Bounding the Number of Paths. WABI 2013  13th Workshop on Algorithms for Bioinformatics. 2013, LNBI, 8126: 440451.Google Scholar
 Tomescu AI, Kuosmanen A, Rizzi R, M¨akinen V: A Novel MinCost Flow Method for Estimating Transcript Expression with RNASeq. BMC Bioinformatics. 2013, 14 (Suppl 5): 15Proceedings paper from RECOMBseq: Third Annual Recomb Satellite Workshop on Massively Parallel Sequencing Beijing, China. 1112 April 2013Google Scholar
 Nadalin F, Vezzi F, Policriti A: GapFiller: a de novo assembly approach to fill the gap within paired reads. BMC Bioinformatics. 2012, 13 (S14): 8View ArticleGoogle Scholar
 Boetzer M, Pirovano W: Toward almost closed genomes with gapfiller. Genome Biology. 2012, 13 (6): 5610.1186/gb2012136r56.View ArticleGoogle Scholar
 Beerenwinkel N, Beretta S, Bonizzoni P, Dondi R, Pirola Y: Covering pairs in directed acyclic graphs. In: Language and Automata Theory and Applications. Lecture Notes in Computer Science. 2014, Springer, Berlin, 8370: 126137. 10.1007/9783319049212_10.Google Scholar
 Dilworth RP: A Decomposition Theorem for Partially Ordered Sets. The Annals of Mathematics. 1950, 51 (1):Google Scholar
 Hopcroft JE, Karp RM: An n^{5/2} algorithm for maximum matchings in bipartite graphs. SIAM J Comput. 1973, 2 (4): 225231. 10.1137/0202019.View ArticleGoogle Scholar
 Fredman ML, Tarjan RE: Fibonacci heaps and their uses in improved network optimization algorithms. J ACM. 1987, 34 (3): 596615. 10.1145/28869.28874.View ArticleGoogle Scholar
 Ahuja RK, Magnanti TL, Orlin JB: Network Flows: Theory, Algorithms, and Applications. 1993, PrenticeHall, Inc., Upper Saddle River, NJ, USAGoogle Scholar
 Orlin JB: Max flows in O(nm) time, or better. In: Proceedings of the 45th Annual ACM Symposium on the Theory of Computing. STOC '13. 2013, ACM, New York, NY, USA, 765774.View ArticleGoogle Scholar
 Schrijver A: Combinatorial Optimization  Polyhedra and Efficiency. 2003, Springer, BerlinGoogle Scholar
 Ahuja RK, Goldberg AV, Orlin JB, Tarjan RE: Finding minimumcost flows by double scaling. Mathematical Programming. 1992, 53: 243266. 10.1007/BF01585705.View ArticleGoogle Scholar
 Gabow HN, Tarjan RE: Faster scaling algorithms for general graph matching problems. J ACM. 1991, 38 (4): 815853. 10.1145/115234.115366.View ArticleGoogle Scholar
 Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NPCompleteness. 1979, W. H. Freeman & Co., New York, NY, USAGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.