Volume 16 Supplement 14
Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Bioinformatics
Exact approaches for scaffolding
- Mathias Weller^{1, 2}Email author,
- Annie Chateau^{1, 2} and
- Rodolphe Giroudeau^{1}
https://doi.org/10.1186/1471-2105-16-S14-S2
© Weller et al.; 2015
Published: 2 October 2015
Abstract
This paper presents new structural and algorithmic results around the scaffolding problem, which occurs prominently in next generation sequencing. The problem can be formalized as an optimization problem on a special graph, the "scaffold graph". We prove that the problem is polynomial if this graph is a tree by providing a dynamic programming algorithm for this case. This algorithm serves as a basis to deduce an exact algorithm for general graphs using a tree decomposition of the input. We explore other structural parameters, proving a linear-size problem kernel with respect to the size of a feedback-edge set on a restricted version of Scaffolding. Finally, we examine some parameters of scaffold graphs, which are based on real-world genomes, revealing that the feedback edge set is significantly smaller than the input size.
Keywords
scaffolding exact methods dynamic programming treewidth preprocessing fixed-parameter tractableIntroduction
During the last decade, a huge amount of new genomes have been sequenced, leading to an abundance of available DNA resources. Nevertheless, most of recent genome projects stay unfinished, in the sense that databases contain much more incompletely assembled genomes than whole stable reference genomes [1]. One reason for this phenomenon is that, for most of the analyses performed on DNA, an incomplete assembly is sufficient. Sometimes it is even possible to perform them directly from the sequenced data. Another reason is that producing a complete genome, or an as-complete-as-possible-genome is a difficult task. Traditionally, producing a complete genome consists of three steps, each of them computationally or financially difficult: the assembly, the scaffolding, and the finishing. The step of scaffolding, on which we focus here, consists of orienting and ordering at the same time the contigs produced by assembly. Many methods have been proposed and the recent activity on the subject shows that it is an active field (see, not exhaustively, [2–8] and Section ). A good survey of recent methods can be found in [9] for instance. Since the problem has been proved $\mathcal{N}\mathcal{P}$-complete from its first formulation [10], nearly all of these methods propose heuristic approaches. One of them claims that an exact method provides better results [5], however, the authors prepend a heuristic graph simplification before running their exact algorithm.
The approach presented here relies on a combinatorial problem on a dedicated graph, called scaffold graph, representing the link between already assembled contigs. The main idea is to represent each contig by two vertices linked by an edge (these "contig-edges" form a perfect matching on the scaffold graph). Other edges are constructed and weighted using complementary information on the contigs. The weight of a non-contig edge uv, with uu′, vv′ being contig-edges, corresponds to a confidence measure that the uu′ contig is succeeded by the vv′ contig (oriented as u′ − u − v − v′). The scaffold graph is a flexible tool to study the scaffolding issues. Indeed, the graph is a syntactical data-structure which may represent several semantics. For instance, the scaffold graphs used for our previous experiments have been built using Next-Generation Sequencing data, namely paired-end reads. However, we also could provide other type of information to compute the weight on the edges, like for instance ancestral support in a phylogenetic context, or comparison to other extent genomes which could be used as multiple references. The way to define the weight on the edges does not change the main goal of our method, which is to determine the optimal ordering and orientation of the contigs, given a specific criterion. It is also possible to mix two or more criteria in order to take several sources of information into account.
We also introduced two structural parameters σ_{p} and σ_{c} representing the respective numbers of linear and circular chromosomes sought in the genome. Those parameters seems quite artificial at a first sight, but they are well-motivated as follows. First, in a number of species, genomes are hard to assemble with classical methods because of an heterogeneous chromosomal structure or difficulties to perform classical assembly. This is the case, for instance, with the micro-chromosomes in the chicken genome [11]. Also, when scaffolding metagenomic data, one has to deal with very complex genomic structures [12]. We hope that relaxing the classical model with one chromosome, mixed with the flexibility of the scaffold graph, could help handling such complex situations. The second motivation to introduce those parameters came from the desire to explore an intermediary case between the very classical and studied $\mathcal{N}\mathcal{P}$-complete [13]TRAVELING SALESMAN PROBLEM (TSP) and a totally free structure, in which case optimizing a covering by paths and cycles is polynomial-time solvable [14]. The SCAFFOLDING problem then consists of finding (at most) σ_{p} paths and σ_{c} cycles that cover all contig-edges while maximizing the total weight. Previously, we proved that SCAFFOLDING is $\mathcal{N}\mathcal{P}$-complete, even under restricted conditions [15], developed polynomial-time approximation algorithms [15, 16] and evaluated them experimentally [17]. In this paper, we continue to explore this problem, as well as the structural properties of the scaffold graph. This exploration aims to develop an efficient method, that could be used both to refine the ratio analysis on previously designed heuristics and to handle cases of small genomes with better quality.
We present both theoretical and practical results, as well as our experimental findings. The paper is organized as follows: we relate a general state-of-the-art and methodological comparison with existing methods and our previous work in Section. In Section, we formally introduce the problem and some notations and definitions. In Section, we show that the general SCAFFOLDING problem is polynomial-time solvable on trees, and present a dynamic programming algorithm. We extended this algorithm in Section 11 to a parameterized algorithm with respect to the treewidth of the input, that is, this algorithm runs in polynomial time, on scaffold graphs that are sparse, i.e. treelike. In Section 11, we lay the foundation to developing effective preprocessing routines for SCAFFOLDING, by presenting a set of reduction rules whose application shrinks input instances of a derived problem, RESTRICTED SCAFFOLDING. We show that the size of the resulting instances can be bounded linearly in their feedback edge set number, thus proving a problem kernel for this parameter. Finally, in Section 11, we present some experimental analysis of scaffold graphs that are derived from real-world genomes.
Related work
Genome scaffolding is an intrinsically complex problem, and was studied through several computation models and tools. The first model, presented in [10], leads to its classification as an $\mathcal{N}\mathcal{P}$-complete problem. This work also proposes the first greedy approach. The scaffolding was then performed using genomic fragments coupled by mate-pairs [18], which differ from the paired-end reads essentially by their larger insert-size (The insert-size is the gap, in base pairs, between two fragments constituting a pair of reads), and their lower covering depth, so the size of the data and the organization of the graph may differ from actual Next-Generation Sequencing data using paired-end reads. A greedy-like approach is also used in SSPACE [19], iteratively combining the longest contigs. Conversely, Gao et al. present a dynamic-programming-based exact approach (OPERA) which provides higher-quality scaffolds than existing heuristic approaches [5]. However, their algorithm runs in O(n^{ k } ) time (where k is the "library width") which quickly grows out of reasonable proportions. Thus, to apply this method on real data, they prepend a graph contraction procedure to limit the size of the input. In SOPRA [4], Dayarian et al. introduce a removal procedure for problematic contigs, and separate the orientation step from the linking step. The orientation step uses a simulated annealing technique for regions of high complexity. In GRASS [3], a genetic algorithm is provided, using mixed integer programming (MIP) as in [6] and an expectation-maximization process to counter the intractability of the MIP model. They obtain results which are intermediary in quality between SSPACE and Opera. In SCARPA [2], the two problems of orienting and ordering the contigs are separated, as in SOPRA. Transforming the graph such that the contig orientation problem has a feasible solution is a fixed-parameter tractable problem with respect to the number of nodes to remove [20]. The authors first use the corresponding algorithm to pre-process the graph and exhibit an optimal relative orientation of the contigs. Then, pre-oriented contigs are ordered using a heuristic, and the removal of articulation vertices is used to limit the size of the connected components. Misassembled contigs are detected and removed. In BESST [21], the authors refine the previous approaches by considering the library insert size distribution to filter relevant linking information, leading to an almost linear scaffold graph, which is easy to treat. The scalability of these approaches is not always proven, and the general increase of the size and amount of available data brings real efficiency questions. Recently, some methods also use phylogenetic or comparative genomic approaches via rearrangement analysis to perform scaffolding on ancient or extant genomes when a set of closely related species genomes are available (see for instance [22–24]). This encourages the mix of multiple sources of information for scaffolding a given genome.
The scaffolding problem have been extensively studied in the framework of complexity and approximation. In [15, 16], the authors proved that the problem is $\mathcal{N}\mathcal{P}$-complete even if the scaffold graph is a planar bipartite graph. They also proposed some lower bounds for exact exponential-time algorithms for SCAFFOLDING according to the EXPONENTIAL-TIME HYPOTHESIS[25]. Finally, they proved that the minimization version of SCAFFOLDING (seeking a minimum-weight cover of the scaffold graph) is unlikely to be approximable within any constant ratio, even if the scaffold graph is a completed bipartite graph [15]. On the positive side, two polynomial-time factor-3-approximation algorithms for the maximization version have been designed. The first is an O(n^{2} log n)-time greedy algorithm, the second is based on the computation of a maximum matching (O(n^{3})-time). The theoretical aspects of the scaffold graph have been completed by extensive experimental results [17] for the maximization version on simulated and real datasets.
The main contribution of the present paper lays in both the algorithmic exploration of exact methods for the scaffolding problem with structural parameters, and the initiation of a discussion on the scaffold graph properties. Indeed, the knowledge of those properties may lead to algorithmic improvement, as well as the detection of putative errors in assemblies.
Definitions
Scaffolding problem. The central combinatorial object we are working with in this work is the "scaffold graph" (see also [16, 17]).
Definition 1 (Scaffold graph) A scaffold graph is a pair (G, M) of a graph G = (V, E, w) with an even number of vertices, and a weight function w : E → ℕ on its edges, and a perfect matching M on G.
Graph Theory. Slightly abusing notation, we sometimes consider paths as sets of edges. Furthermore, for a matching M and a vertex u, we define M(u) as the unique vertex v with uv ∈ M if such a v exists, and M(u) = ⊥, otherwise. A path p is alternating with respect to a matching M if, for all vertices u of p, also M(u) is a vertex of p. If M is clear from context, we do not mention it explicitly. For a graph G = (V, E) and a vertex set V^{′} ⊆ V, let G[V^{′}] denote the subgraph of G induced by V^{′} and let G − V^{′} := G[V \ V^{′}]. For S ⊆ E, let G − S := (V, E \ S ) and, for any edge or vertex x, we abbreviate G − {x} =: G − x. For a set of pairs S , we let Gr(S ) denote the graph having S as edgeset, that is, Gr(S ) := (U_{ e∈S } e, S ). For a function ω : E → ℕ and a set S ⊆ E, we abbreviate Σ_{ e∈S } ω(e) =: ω(S ). Let G = (V, E) be a graph with a matching M and let S be a matching in G − M. The number of paths (resp. cycles) in G[S ∪ M] is denoted by $\left|\right|S|{|}_{p}^{G,M}$ (resp. $\left|\right|S|{|}_{c}^{G,M}$). If all paths in G[S ∪ M] are alternating with respect to M, then, we call S a ||S||_{ p }-||S||_{ c }-cover (or simply cover) for G with respect to M (we will omit M if its clear from context). If ||S||_{ c } = 0, we may also refer to S as a ||S||_{ p }-path cover (or simply path cover). The general scaffold problem is expressed as follows (see also [16, 17]): SCAFFOLDING (SCA)
Input: A scaffold graph (G = (V, E, w), M), σ_{p} ∈ ℕ, σ_{c} ∈ ℕ, k ∈ ℕ
Question: Is there a σ_{p}-σ_{c}-cover S for G with respect to M with ω(S ) ≥ k?
Tree Decompositions. A tree decomposition of a graph G = (V, E) is a pair (T = (V^{ T } , E^{ T } ), X : V^{ T } → 2^{ V } ) such that (1) for all uv ∈ E, there is some i ∈ V^{ T } with uv ⊆ X(i) and (2) for all v ∈ V, the subtree T_{ v } := T[{X(i) | v ∈ X(i)}] is connected. We call the images of X "bags" and the size of the largest bag minus one is the width of the decomposition. A decomposition of minimum width for G is called the optimal for G and its width is called the treewidth of G. It is a folklore theorem that each graph G has an optimal tree decomposition (T, X) that is nice, that is, each bag X(i) is one of the following types:
Leaf bag: i is a leaf of T and X(i) = ∅,
Introduce vertex v bag: i is internal with child j and X(i) = X( j) ∪ {v} with v ∉ X(j),
Forget v bag: i is internal with child j and X(i) = X(j) − v with v ∈ X(i),
Introduce edge uv bag: i is internal with child j and uv ∈ E and uv ⊆ X(i) = X(j),
Join bag: i is internal with children j and ℓ and X(i) = X(j) = X(ℓ).
Additionally, each edge e ∈ E is introduced exactly once. Given a width-tw tree decomposition, we can obtain a nice tree decomposition of width tw in n · poly(tw) time [26].
Parameterized Algorithmics. Parameterized complexity theory challenges the traditional view of measuring the running times exclusively in the input size n. Instead, we aim at identifying a parameter of the input that we expect to be small (much smaller than n) in all instances that the application at hand may produce. We then focus on developing algorithms whose exponential part can be bounded in this parameter (see [27] for details). Parameterized complexity allows to prove performance guarantees for preprocessing algorithms: a polynomial-time algorithm that, given an instance x with parameter k, computes an instance x′ with parameter k′ ≤ k such that x is a yes instance if and only if x′ is a yes instance, is called kernelization and the result x′ is the kernel. It is common to present a kernelization by showing various reduction rules that "cut away" the easy parts of the input and, when applied exhaustively to the input, shrink it enough to prove the size bounds.
Scaffolding on trees
Towards developing an algorithm for SCAFFOLDING that runs fast on graphs that are "close to trees", we consider a strategy to solve SCAFFOLDING in case the input graph is a tree. We use a bottom-up dynamic programming approach that computes for each vertex v starting with the leaves, the best possible solutions for the subtree rooted at v. For ease of presentation, we thus consider the input tree T to be rooted arbitrarily with r denoting the root vertex. Note that the solution cannot contain any cycles in this case.
At each vertex, the dynamic programming algorithm needs to decide how many paths should be covered in each of the subtrees of its children. Seeing that it is infeasible to try all combinations, we employ, again, a dynamic programming strategy solving this problem for a given vertex v: We order the children of each vertex v arbitrarily and let s_{ v } denote the sequence of children of v with s_{ v }[j] denoting the j^{th} child of v and s_{ v }[1..j] := U_{ i≤j } s_{ v }[ j]. Then, we update the global dynamic programming table of v in order of s_{ v }. Thus, when we consider the j^{th} child u of v, we only have to split the paths to be covered between the two subtrees T_{ u } and the union of all previously considered subtrees rooted at children of v (that is, T [U_{ℓ<j}T_{ sv [ℓ] } ∪ {v}]).
In the following, for a vertex v, let C(v) denote its children and par(v) its parent (or ⊥ if v = r), and let T_{ v } denote the subgraph of T that is rooted at v. For u, v ∈ V, we define T_{ u } +T_{ v } := T [V(T_{ u })∪V(T_{ v })] and T_{ u } +v := T [V(T_{ u })∪{v}]. Let v ∈ V and 1 ≤ j ≤ |C(v)|. Then, we define ${T}_{j}^{v}:={\sum}_{i\le j}{T}_{{s}_{v}\left[i\right]}+v$. Finally, we abbreviate s_{ v } := s_{ v }[1..0] := ⊥.
Algorithm 1: Algorithm to fill the dynamic programming table.
1 I ← leaves of T;
2 while ∃v ∈ V \ I s.t. C(v) ⊆ I do
3 foreach 1 ≤ j ≤ |C(v)| with u := s_{ v }[j] do
4 foreach i ≤ σ_{p} do
5 if uv ∈ M then
6 [0, i, j]_{ v } ← max_{α∈{0,1}} max_{ℓ≤i−(1−α)}[α, ℓ, ∞]_{ u } + [0, i − (ℓ − α + 1), j − 1]_{ v };
7 [1, i, j]_{ v } ← max_{α∈{0,1}} max_{ℓ≤i}[α, ℓ, ∞]_{ u } + [1, i − (ℓ − α), j − 1]_{ v };
8 else
9 [0, i, j]_{ v } ← max_{ℓ≤i}max_{α∈{0,1}}[α, ℓ, ∞]_{ u } + [0, i − ℓ, j − 1]_{ v };
10 ${\left[\mathsf{\text{1}},i,j\right]}_{v}\leftarrow \underset{\ell \le i}{\mathsf{\text{max}}}\left\{\begin{array}{c}\mathsf{\text{ma}}{\mathsf{\text{x}}}_{\alpha \in \left\{0,\mathsf{\text{1}}\right\}}{\left[\alpha ,\ell ,\infty \right]}_{u}+{\left[\mathsf{\text{1}},i-\ell ,j-\mathsf{\text{1}}\right]}_{v};\hfill \\ \omega \left(uv\right)+{\left[0,\ell ,\infty \right]}_{u}+\left\{\begin{array}{c}{\left[0,i-\left(\ell -\mathsf{\text{1}}\right),j-\mathsf{\text{1}}\right]}_{v}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}};\mathsf{\text{if}}\phantom{\rule{0.3em}{0ex}}w\in {s}_{v}\left[\mathsf{\text{1}}..j\right]\hfill \\ {\left[0,i-\ell ,j-\mathsf{\text{1}}\right]}_{v}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{otherwise}}\hfill \end{array}\right.\hfill \end{array}\right.$
11 if v = r then return max_{ c∈{0,1} }[c, σ_{p}, ∞]_{ r }else I ← I ∪ {v};
Semantics: For (c, i, j, v) ∈ {0, 1} × [σ_{p}] × [|C(v)|] × V we let [c, i, j]_{ v } denote the maximum weight of an i-0-cover S for ${T}_{j}^{v}$such that v is incident with exactly c edges of S. We abbreviate [c, i, ∞]_{ v } := [c, i, |C(v)|]_{ v }.
We maintain a set I of initialized vertices and, as soon as r is initialized, the algorithm stops. Thus, we assume r ∉ I. Finally, the maximum weight of a σ_{p}-0-cover for T can be computed by max_{ c∈{0,1} }[c, σ_{p}, ∞]_{ r }.
Lemma 1 Algorithm 1 is correct, that is, for all (i, j, c, v) ∈ [σ_{p}] × [|C(v)|] × {0, 1} × V and for any maximum-weight i-0-cover S for ${T}_{j}^{v}$with respect to M such that v is incident with exactly c edges of S we have [c, i, j]_{ v } = ω(S).
Concerning the running time, we note that the body of the loop in line 3 is executed exactly once per vertex and, hence, lines 6,7,9, and 10 are executed σ_{p} times per vertex. As they compute the maximum over σ_{p} values, the whole algorithm runs in $O\left(n\cdot {\sigma}_{p}^{\mathsf{\text{2}}}\right)$ time.
Corollary 1 SCAFFOLDING on trees can be solved in$O\left(n\cdot {\sigma}_{p}^{\mathsf{\text{2}}}\right)$time.
Scaffolding with respect to treewidth
To present our algorithm, we use special permutations (involutions) to model matchings that allow reflexive pairing (that is, matching a vertex with itself). Thus, slightly abusing notation, we will consider permutations as sets of pairs. We denote the subset of reflexive pairs of a permutation P by P^{1} and the subset of non-reflexive pairs by P^{2}. Then, Gr(P) :=Gr(P^{2}). For permutations P and Q, we define P □ Q as the set of pairs uv such that P(x) = ⊥ ⊕ Q(x) = ⊥ for all x ∈ uv and there is a u-v-path in Gr(P ∪ Q). Furthermore, for a function d : A → B and (x, y) ∈ A × B, we define d[x → y] as the result of setting d(x) := y (that is (d \ ({x} × B)) ∪ {(x, y)}). Here, d[x → ⊥] means to remove x from the domain of d. Let T = (χ, E^{ T } ) be a tree decomposition of G with root X(r) ∈ χ. For a bag X(i), let G_{ i }denote the subgraph of G that contains exactly those edges of G that are introduced in a bag of the subtree of T that is rooted at X(i) and let ${G}_{i}^{{}^{S}}:={G}_{i}\left[S\cup M\right]$.
A table entry for the bag X(i) will be indexed by (i) a function d : X(i) → {0, 1, 2}, (ii) a permutation P with U_{ uv∈P } uv = d^{−1}(1) and(iii) integers p ≤ σ_{p} and c ≤ σ_{c}. See Figure 2 for an illustration of d and P. An entry will have the following semantics:
Definition 2 Let i ∈ V^{ T }. We call a set S_{ i } ⊆ E(G_{ i }) \ M eligible with respect to a tuple (d, P, p, c, i) if
1 each vertex v ∈ X(i) has degree d(v) in ${G}_{i}^{{S}_{i}}$,
2 for each uv ∈ P, if u ≠ v, then there is a u-v-path in ${G}_{i}^{{S}_{i}}$and, if u = v, then there is a path q of non-zero-length in ${G}_{i}^{{S}_{i}}$ that contains u and avoids d^{−1}(1) − u (we say that q is dangling from u).
3 ${G}_{i}^{{S}_{i}}$contains p paths and c cycles that do not intersect d^{−1}(1),
Semantics: A table entry [d, P, p, c]_{ i } is the maximum weight of any set that is eligible with respect to (d, P, p, c, i).
Then, we can read the maximum weight of a solution S for G from [∅, ∅, σ_{p}, σ_{c}]_{ r }.
Given a nice tree decomposition and a bag X(i) with children X(j) and X(ℓ) (possibly j = ℓ), we compute [d, P, p, c]_{ i } depending on the type of the bag X(i) (entries that are not mentioned explicitly are set to −∞):
Leaf bag: Set [∅, ∅, 0, 0]_{ i } := 0.
Introduce vertex v: Newly introduced vertices do not have introduced edges yet. Thus, we force the degree of v in ${G}_{i}^{S}$ to 0: Formally, let [d, P, p, c]_{ i } := [d[v → ⊥], P, p, c] _{ j }if d(v) = 0 and ∞, otherwise.
Introduce edge uv: Let d(u), d(v) ≥ 1 and, by symmetry, let d(u) ≥ d(v) ≥ 1. We define a value z (representing the assumption that uv is in S ) as follows. Let d′ := d[u →d(u) − 1, v → d(v) − 1], that is, we let d′ be the result of decreasing both d(u) and d(v) by one.
Case 2: d(u) = d(v) = 1. Then, both u and v are not incident to any edges in ${G}_{j}^{S}$ and, in ${G}_{j}^{{s}_{j}}$, there is just the edge uv incident to both. Thus, we set z only if uv ∈ P. Formally, let z := [d′, P − uv, p, c]_{ j } if uv ∈ P.
Case 3: d(u) = 2, d(v) = 1. Then, there is a path q containing uv and ending in v in ${G}_{i}^{S}$. If q ends in a vertex x in d^{−1}(1) − v, we have vx ∈ P and the permutation for X(j) contains ux. Otherwise, we have vv ∈ P and the permutation for X(j) contains uu. Note that, since v ∈ d^{−1}(1), we know that P(v) ≠ ⊥. Formally, for vx ∈ P, let z := [d′, (P − vv) + uu, p, c]_{ j }, if v = x and z := [d′, (P − vx) + ux, p, c]_{ j }, otherwise. Finally, let [d, P, p, c]_{ i } := z if uv ∈ M and [d, P, p, c]_{ i } := max{z + ω(uv), [d, P, p, c]_{ j }}, otherwise.
Lemma 2 The described algorithm is correct, that is, the computed value [d, P, p, c]_{ i } corresponds to the semantics.
Theorem 1 SCAFFOLDING can be solved in O(tw^{tw} ·σ_{p} · σ_{c} · n) time, given a width-tw tree decomposition of the input instance.
Kernel for restricted scaffolding
Towards developing effective preprocessing for SCAFFOLDING, we consider a more restricted problem variant, where all paths and cycles of the solution have to be of certain, respective lengths. $\mathcal{N}\mathcal{P}$-hardness of this variant can be inferred with the same reduction as used for SCAFFOLDING[15].
RESTRICTED SCAFFOLDING (RSCA)
Input: G with perfect matching M, ω : E → ℕ with ω(M) = 0, σ_{p} ∈ ℕ, σ_{c} ∈ ℕ, ℓ_{p} ∈ ℕ^{∗}, ℓ_{c} ∈ ℕ^{∗}, k ∈ ℕ
Question: ∃_{ X⊆E\M } s.t. G − X is a collection of σ_{p} alternating paths, each of length ℓ_{p},b and σ_{c} alternating cycles, each of length ℓ_{c}, and ω(E) − ω(X) ≥ k?
The length l_{ p } denotes the number of edges in the paths. It is necessarily odd in a solution of RESTRICTED SCAFFOLDING. In the following, we show that RESTRICTED SCAFFOLDING admits a linear-size problem kernel with respect to the parameter FES (feedback edge set), which is the size of a smallest set of edges whose deletion leaves an acyclic graph. To this end, we present a number of intuitive polynomial-time executable reduction rules that shrink the input graph. The first two rules shrink treelike structures of the instance while the remaining rules contract long chains.
For a u-v-path p, we call u and v its outer vertices while all other vertices of p are inner vertices. Let V^{ ○ } be the set of vertices that are in some cycle in G and let G^{ ○ } = G[V^{ ○ }]. Let G^{∗} = G[V^{∗}] denote the convex hull of G^{ ○ }, that is, V^{∗} is the set of vertices on some shortest path between some vertices u, v ∈ V^{ ○ }. For each v ∈ V^{∗}, the tree rooted at v that is incident with v in G − E^{∗} is called the "pendant tree" T_{ v } of v.
Reducing pendant trees
First, we remove isolated paths by cutting them into pieces of length ℓ_{p}. Since the correctness of this is trivial, we omit the proof.
The next rule cuts branching edges in pendant trees. To this end, it finds an occurrence of a path of length ℓ_{p} that is furthest from the root of the pendant tree.
Tree Rule 2 (see Figure 4(b)) Let T_{ v } be the pendant tree of some v ∈ V^{∗}, let u be a leaf of maximal distance to v in T_{ v }. Let W be the set of vertices × of T_{ v } such that a length-ℓ_{p}alternating u-x-path exists and let w be a vertex in W maximizing dist_{ Tv } (v, w). Then, delete from G all edges that are incident with the least common ancestor of u and w but are not on the unique u-w-path in T_{ v }.
Lemma 3 Let I = (G, M, ω, σ_{p}, σ_{c}, ℓ_{p}, ℓ_{c}, k) be a yes-instance of RESTRICTED SCAFFOLDING such that G is reduced with respect to Tree Rule 2 and let v ∈ V^{∗}. Then, T_{ v } is a path p that is alternating with respect to M ∩ E(T_{ v }) and |p| < ℓ_{p} and v is an endpoint of p.
In the following, we assume that G is reduced with respect to Tree Rule 2 and, thus, we can reject all instances for which Lemma 3 does not hold. Hence, in the following, we assume that Lemma 3 holds for the input instance. The next reduction rule helps unify the way in which pendant trees (which are now paths) attach to G^{∗}, simplifying the rest of the presentation.
Tree Rule 3 (see Figure 4(c)) Let T_{ v } be the pendant tree of some v ∈ V^{∗} that is incident with an edge e ∈ E(T_{ v }) \ M. Then, delete from G all edges incident with v that are not in M + e.
Reducing long paths
In the following, we assume that G is reduced with respect to the tree reduction rules presented in the previous section. For i ∈ ℕ, let V_{ i } denote the set of vertices of G that have degree i in G. The goal in this subsection is to reduce the length of chains of degree-two vertices in G. Thus, we consider paths whose inner vertices are all in V_{2}. We call these paths deg-2 path. If the solution has a cycle running through this path, then we cannot modify its length. Therefore, we focus on paths that are guaranteed to not be in a cycle in the solution:
The path reduction rules are based on the idea that the path segments that a deg-2 path p in G is split into by the solution are recurring, that is, if the solution contains the third edge of p, then it also contains the 3 + (ℓ_{p} + 1)^{th} edge of p. This gets slightly more complicated if there are pendants in p, but first, let us consider paths without any pendants.
The remaining two rules deal with deg-2 paths containing pendants. Note that any path in the solution that contains the pendant can only "continue" in two directions.
Path Rule 2 (see Figure 5(b)) Let p = (u_{0}, u_{1}, . . ., u_{ℓp +1} = w) and q = (w = v_{0}, v_{1}, . . ., v_{ℓp +1}) be deg-2 paths and let T_{ w } contain γ > 0 edges. For all i ≤ ℓ_{p}, let e_{ i } := u_{ i }u_{ i+1 }, let f_{ i } := v_{ i }v_{ i+1 } and let i^{+} := (i − γ) mod (ℓ_{p} + 1). Then, for all i ≤ ℓ_{p}, add ω(e_{ i }) to ω( f_{ i+ }) and contract e_{ i }. Finally, decrease σ_{p} by 1.
In case of two pendants, the solution is restricted by the distance between the pendants. If we can infer what the solution does by this length, we implement this right away, otherwise, we can represent the choices that a solution can take with a single pendant.
(1) If γ + γ_{ u }≠ f ℓ_{p} + 1 = γ + γ_{ v }, then delete ux_{1} (see Figure 6(a)).
(2) If γ + γ_{ u } = ℓ_{p} + 1 ≠ f γ + γ_{ v }, then delete vx_{|p|−1}.
(3) If γ + γ_{ u } = ℓ_{p} + 1 = γ + γ_{ v }, then return G′ (see Figure 6(b)).
(4) If γ = 1 and γ_{ u } + γ_{ v } + 1 = ℓ_{p}, then return G′ (see Figure 6(b)).
(5) If γ = 1 and γ_{ u } + γ_{ v } + 1 ≠ ℓ_{p}, then delete ux_{1} (see Figure 6(a)).
(6) If γ ≠ 1 and γ + γ_{ u } + γ_{ v } ≡ ℓ_{p} mod (ℓ_{p} + 1), then delete all edges e ∈ E(G) \ (M ∪ {ux_{1}}).
(7) In all other cases, return "NO".
Finally, we can show that an input graph that is reduced with respect to these rules cannot be larger than 11ℓ_{p} · FES(G) or 11ℓ_{c} · FES(G). To prove this, let G^{†} = (V^{†}, E^{†}) be the result of contracting all degree-2 vertices in G^{∗}.
Lemma 4 Let G be reduced with respect to all presented reduction rules. Then, |V| ≤ ℓ · (|V^{†}| + 3|E^{†}|) with ℓ := max{ℓ_{c}, ℓ_{p}}.
Proof By Lemma 3, we know that ∀_{ v∈V }†|E(T_{ v })| < ℓ_{p}. Therefore, if Lemma 4 is false, there is an edge uv ∈ E^{†} such that there are > 3ℓ vertices between u and v (i.e. uv is a contraction of more than 3ℓ edges of G). Nevertheless, by irreducibility with respect to Path Rule 3, there is at most one vertex w between u and v such that T_{ w }is not empty and the distance between u and v cannot be greater than 2ℓ + 1 (by irreducibility with respect to Path Rule 1 and 2). So, |E(T_{ w })| ≥ ℓ, contradicting Lemma 3.
Theorem 2 RESTRICTED SCAFFOLDING admits a kernel containing at most 11ℓ · FES(G) vertices and (11ℓ + 1) · FES(G) edges where ℓ := max{ℓ_{p}, ℓ_{c}}.
Results
Data
Details on the second dataset.
Genome | Reads library | Assembly tool | Mapping tool |
---|---|---|---|
staphylo | short jump library (from GAGE [32]) | velvet [33] | bwa [30] |
ecoli | Illumina reads library SRR001665 | velvet | bowtie [34] |
ypco92 | simulated with wgsim | minia [35] | bwa |
wolbachia | simulated with toyseq for the Variathon experiment [36] | minia | bwa |
arabido | Illumina reads library SRR616966 | velvet | bowtie |
Scaffold graphs parameters.
Data | |V| | |E| | Min/Max/Avg degree | FES | FVS (ub) | tw(ub) | h | dcy |
---|---|---|---|---|---|---|---|---|
anopheles^{Ch} | 84090 | 113497 | 1 / 51 / 2.70 | 29851 | 17962 | 12 | 3 | |
anthrax^{Ch} | 8110 | 11013 | 1 / 7 / 2.72 | 2906 | 1232 | 574 | 7 | 2 |
ebola^{Co} | 34 | 43 | 1 / 5 / 2.53 | 10 | 6 | 3 | 4 | 2 |
gloeobacter^{Ch} | 9034 | 12402 | 1 / 12 / 2.75 | 3375 | 2484 | 639 | 8 | 3 |
lactobacillus^{Ch} | 3796 | 5233 | 1 / 12 / 2.76 | 1439 | 804 | 260 | 8 | 2 |
monarch^{Mt} | 28 | 33 | 1 / 4 / 2.36 | 6 | 4 | 3 | 4 | 2 |
pandora^{Co} | 4902 | 6722 | 1 / 7 / 2.74 | 1822 | 1277 | 327 | 7 | 2 |
pseudomonas^{Ch} | 10496 | 14334 | 1 / 9 / 2.73 | 3851 | 2692 | 752 | 8 | 2 |
rice^{Cp} | 168 | 223 | 1 / 6 / 2.65 | 56 | 31 | 9 | 5 | 2 |
sacchr3^{Ch} | 592 | 823 | 1 / 7 / 2.78 | 232 | 142 | 43 | 6 | 2 |
sacchr12^{Ch} | 1778 | 2411 | 1 / 10 / 2.12 | 637 | 575 | 124 | 7 | 2 |
Scaffold graph parameters for select genomes and different cut-off thresholds: 0, 3, 6, and 10.
Data & threshold (|V|) | |E| | Min/Max/Avg degree | FES | FVS(ub) | tw(ub) | h | dcy | |
---|---|---|---|---|---|---|---|---|
arabido (345232) | 0 3 6 10 | 318984 252762 230333 215094 | 1 / 31 / 1.85 1 / 14 / 1.46 1 / 9 / 1.33 1 / 9 / 1.25 | 43593 8024 3247 1224 | 15703 5881 2814 1020 | 10610 792 74 51 | 18 9 8 8 | 4 3 2 2 |
ecoli (1732) | 0 3 6 10 | 8142 4043 3105 2695 | 2 / 46 / 9.40 2 / 23 / 4.67 2 / 18 / 3.59 2 / 16 / 3.11 | 6411 2312 1374 964 | 644 406 327 278 | 551 303 162 102 | 32 16 13 11 | 9 4 4 3 |
staphylo (602) | 0 3 6 10 | 4765 1743 1017 790 | 1 / 128 / 15.83 1 / 58 / 5.79 1 / 22 / 3.37 1 / 16 / 2.62 | 4164 1152 464 279 | 167 113 78 65 | 124 57 25 14 | 52 25 14 11 | 28 11 5 4 |
wolbachia (560) | 0 3 6 10 | 1036 523 459 399 | 1 / 56 / 3.70 1 / 15 / 1.86 1 / 15 / 1.63 1 / 5 / 1.43 | 481 47 19 12 | 106 30 16 11 | 26 3 2 2 | 11 6 5 5 | 3 2 2 2 |
ypco92 (2656) | 0 3 6 10 | 3465 2849 2651 2525 | 1 / 8 / 2.61 1 / 6 / 2.15 1 / 5 / 1.99 1 / 5 / 1.90 | 821 241 99 73 | 313 136 86 68 | 191 38 3 2 | 7 6 5 5 | 2 2 2 2 |
Graph parameters
Table 2 presents some parameters of the generated graphs. The h-index is the maximum number such that the graph contains h vertices of degree at least h. It measures the degree of connectivity of a graph. The feedback edge set (FES) is the size of a smallest set of edges whose deletion leaves an acyclic graph. The degeneracy is the smallest value d for which every subgraph has a vertex of degree at most d. It is a kind of measure of sparsity of the graph. We notice that scaffold graphs look quite sparse, with few vertices of high degrees and a feedback edge set number that is usually significantly lower than the number of vertices. While degree-based graph parameters like the degeneracy d are tiny in all instances, we recall that our problem generalizes Hamiltonian Cycle, which is already N P-hard on 3-regular graphs. However, Table 2 shows that, for instances that we expect to be seen in practice, these measures can be assumed constant and, thus, it might be worth considering a combination involving these parameters.
Table 3 shows the same statistics for the second set of graphs where edges of small weight (that is, low confidence) were discarded (for weight thresholds of 10, 6, 3 and 0, yielding the original graph with all edges present). We notice that even with a light filtering, the parameters of the scaffold graph are considerably lowered, confirming that raw data suffers from significant noise. Thus, filtering low quality information not only increases the quality of the scaffolding, but also may lead to a significant leap to tractability. Note that, at the time of writing this article, the Staphylococcus Aureus genome is no longer available in the NCBI Nucleotide database ("This RefSeq genome was suppressed because updated RefSeq validation criteria identified problems with the assembly or annotation.") and the Escherichia Coli genome has been updated since the presented version. Errors in assemblies are quite frequent ([37]) and yield additional issues in scaffolding. Having a better idea of the structure of a classical scaffold graph, and some simple criteria to determine what is anomalous and what is normal (subgraphs induced by repeats for instance) would be of real interest for the analysis of genomes.
Implementation and first results
We implemented the dynamic programming algorithm presented in Section 11 in C++ using boost and the treewidth optimization library TOL [38]. We ran it on a selection of the generated data sets (see Section 11) for which the greedy fill-in heuristic produced tree decompositions of width at most 45. We chose (σ_{p}, σ_{c}) = (3, 1) for the ebola and monarch genomes and (σ_{p}, σ_{c}) = (20, 3) for the more complicated inputs. The tests were run on an AMD Opteron(tm) Processor 6376 at 2300 MHz.
Discussion and conclusion
In this paper, we considered exact approaches to the $\mathcal{N}\mathcal{P}$-hard SCAFFOLDING problem which is an integral part of genome sequencing. We showed that it can be solved in polynomial time on trees or graphs that are close to being trees (constant treewidth) by a dynamic programming algorithm. We proved a linear-size problem kernel for the parameter "feedback edge set" for a restricted version in which the lengths of paths and cycles are fixed. We implemented an exact algorithm solving SCAFFOLDING in f(tw) · poly(n) time and evaluated our implementation experimentally, supporting the claim that this method produces high-quality scaffolds. Our experiments are run on data sets that are based on specific real-world genomes, which we also examined to identify a number of interesting parameters that may help design parameterized algorithms and random scaffold graph generators that produce more realistic instances. We are currently transferring the preprocessing rules to the general problem variant. We are highly interested in further graph classes that are closer to real-world instances than trees and on which the problem might be polynomial-time solvable. From an algorithmic point of view, we remark that only few bags of the used tree decompositions are small (see Figure 7). Thus, we envision a hybrid strategy of branching on the vertices in the largest bags before running the dynamic-programming algorithm and using this "distance x to treewidth-x" parameter to re-analyze the problem.
We intend to perform more extensive tests on diverse datasets, in particular comparing the quality of this approach to existing ones using, for instance, the criteria presented in [9]. This work demands reliable benchmarks, with up-to-date and well assembled genomes. From a bioinformatics point of view, we lay the groundwork for a careful analysis of the scaffold graph, as well as a tool to speed up the above algorithms, as a help to analyze the quality of the assembly, and maybe the structure of the genome itself.
Appendix
Proof of Lemma 1 The proof is by induction over the distance of v to r (descending) and, in case of ties, j (ascending).
Induction Base: The statement holds for all vertices v if j = 0 since ${T}_{0}^{v}=T\left[\left\{v\right\}\right]$ does not contain edges.
Induction Step (≤): Next, we show that [c, i, j]_{ v } ≤ ω(S ) by proving that a i-path cover S′ for ${T}_{j}^{v}$ of weight [c, i, j]_{ v } exists and, thus, [c, i, j]_{ v } = ω(S′) ≤ ω(S ) by optimality of S. To this end, let u := s_{ v }[j] and let w := M(v).
Case 4: w ≠ u and c = 1.
Proof of Section 11
Proof of Lemma 2 The proof is by induction on the distance of i to r (descending). In the induction base, i is a leaf of T and X(i) = ∅ and G_{ i } is empty. Thus, the domains of d and P are empty. Thus, [∅, ∅, 0, 0]_{ i } = 0 and all other entries are −∞.
For the induction step, we distinguish the possible bag types of X(i) with children X( j) and X(ℓ) (possibly j = ℓ):
Introduce vertex v: Since G_{ i } does not contain edges incident to v, only tuples with d(v) = 0 and P(v) = ⊥ are valid.
Forget vertex v: Let S_{ i } be a maximum weight set that is eligible for (d, P, p, c, i). We show that [d, P, p, c]_{ i } = ω(S_{ i }).
"≤":
Case 1: [d, P, p, c]_{ i } = [d[v → 1], P + vv, p − 1, c] _{ j }. By induction hypothesis, there is a set S_{ j } corresponding[1] to [d[v → 1], P + vv, p − 1, c] _{ j }. We show that S_{ j } is eligible for (d, P, p, c, i) and, thus, ω(S _{ i }) ≥ ω(S _{ j }) = [d, P, p, c]_{ i }. Since X(i) ⊂ X( j) and all paths between vertices in d^{−1}(1) that are represented by P + vv are also represented by P, the first two conditions are satisfied by S_{ j } for i. Since ${\text{deg}}_{{G}_{j}^{{S}_{j}}}\left(v\right)=1$, there is a path q in ${G}_{j}^{{s}_{j}}$ containing v. But since v ∉ d^{−1}(1), we know that ${G}_{i}^{{S}_{i}}$ contains one path more that does not intersect d^{−1}(1). Thus, ${G}_{i}^{{S}_{i}}$ contains p paths that do not intersect d^{−1}(1) and S_{ j } satisfies the third condition.
Case 2: [d, P, p, c]_{ i } = [d[v → 1], (P − uu) + uv, p, c] _{ j } for some uu ∈ P. By induction hypothesis, there is a set S_{ j } corresponding to [d[v → 1], (P−uu)+uv, p, c] _{ j }. Since uv ∈ (P − uu) + uv, there is a u-v-path q in ${G}_{j}^{{s}_{j}}$ (by Definition 2(2)) and q intersects d^{−1}(1) in u. Thus, ${G}_{i}^{{S}_{i}}$ contains p paths that do not intersect d^{−1}(1) and, thus, S_{ j } is eligible for (d, P, p, c, i).
Case 3: [d, P, p, c]_{ i } = [d[v → x], P, p, c] _{ j } for some x ∈ {0, 2}. By induction hypothesis, there is a set S_{ j } corresponding to [d[v → x], P, p, c] _{ j }. Then, S_{ j } is also eligible for (d, P, p, c, i).
"≥": Let $x:={\mathsf{\text{deg}}}_{{G}_{i}^{{s}_{i}}}\left(v\right)$.
Case 1: x ∈ {0, 2}. Then, S_{ i } is eligible for (d[v → x], P, p, c, j) and, by induction hypothesis, ω(S _{ i }) ≤ [d[v → x], P, p, c] _{ j } ≤ [d, P, p, c]_{ i }.
Case 2: x = 1. Then, by Definition 2(1), there is a path q in ${G}_{i}^{{S}_{i}}$ ending in v. If q has another end u in d^{−1}(1), then S_{ i } is eligible for (d[v → 1], (P − uu) + uv, p − 1, c, j) and, thus, ω(S _{ i }) ≤ [d[v → 1], (P − uu) + uv, p − 1, c]. Otherwise, S_{ i } is eligible for (d[v → 1], P + vv, p − 1, c, j). In both cases, ω(S _{ i }) ≤ [d, P, p, c]_{ i }.
Introduce edge uv: Let S _{ i }correspond to [d, P, p, c]_{ i }. We show that [d, P, p, c]_{ i }= ω(S _{ i }). Let d′ and z be as described in the dynamic programming and note that ${G}_{j}^{=}{G}_{i}^{-}uv.$ "≤": First, if [d, P, p, c]_{ i } = [d, P, p, c] _{ j }, then, by induction hypothesis, there is a set S _{ j }corresponding to [d, P, p, c] _{ j } and uv ∉ M. Then, ${G}_{j}^{{s}_{j}}={G}_{i}^{{s}_{j}}$ implying that S_{ j } is eligible for (d, P, p, c, i) and, thus, ω(S _{ i }) ≥ ω(S _{ j }) = [d, P, p, c]_{ i }.
In the following, we proceed in a similar manner for the case that [d, P, p, c]_{ i } = z + ω(uv): To show [d, P, p, c]_{ i } ≤ ω(S _{ i }), we consider a set S_{ j } that corresponds to the entry for X( j) from which [d, P, p, c]_{ i } is computed and whose existence is granted by induction hypothesis. Then, we show that S _{ j }+ uv is eligible for (d, P, p, c, i), implying ω(S _{ i }) ≥ ω(S_{ j } + uv) = z + ω(uv) if uv ∉ M and ω(S _{ i }) ≥ ω(S _{ j }) = z if uv ∈ M. Note that, in each of the cases, the degrees of u and v in ${G}_{j}^{{s}_{j}}$ are one less than their degrees in ${G}_{i}^{{s}_{{j}^{+uv}}}$. Thus, S_{ j } satisfies Definition 2(1) for d′.
[1]We say a set S corresponds to an entry [d, P, p, c]_{ i } if S is eligible for (d, P, p, c, i) and its weight ω(S ) = [d, P, p, c]_{ i }is maximum among all sets that are.
Case 1: d(u) = d(v) = 2. Then, both u and v have degree 1 in ${G}_{j}^{{s}_{j}}$.
Case 1a: z = [d′, P + uv, p, c − 1]_{ j }. Then, there is a u-v-path in ${G}_{j}^{{s}_{j}}$ (see Figure 3(a)). Since d^{−1}(1) = d′^{−1}(1) \ {u, v}, adding uv does not touch any paths intersecting d^{−1}(1). Thus, Definition 2(2) is satisfied. Further, adding uv closes a cycle in ${G}_{i}^{{s}_{{j}^{+uv}}}$ and, thus, ${G}_{i}^{{s}_{{j}^{+uv}}}$ contains one more cycle that does not intersect d^{−1}(1) than ${G}_{j}^{{s}_{j}}$. Thus, also Definition 2(3) is satisfied.
Case 1b: z = [d′, P ∪ {uu, vv}, p − 1, c] _{ j }. Then, both u and v have dangling paths in ${G}_{j}^{{s}_{j}}$ (see Figure 3(b)). By the same arguments as in Case 1a, Definition 2(2) is satisfied. Further, adding uv connects two paths such that the resulting path does not intersect d^{−1}(1). Thus, there are one more such paths in ${G}_{i}^{{s}_{{j}^{+uv}}}$ than in ${G}_{j}^{{s}_{j}}$, implying that Definition 2(3) is satisfied.
Case 1c: z = [d′, (P − xx) ∪ {ux, vv}, p, c] _{ j } for some x ∈ d^{−1}(1) = d′^{−1}(1) \ {u, v} with xx ∈ P. Then, ${G}_{j}^{{s}_{j}}$ contains a path dangling from v and a u-x-path (see Figure 3(c)). Thus, adding uv connects two paths such that the resulting path intersects d^{−1}(1) exclusively in x. Hence, there is a path dangling from x in ${G}_{i}^{{s}_{{j}^{+uv}}}$ and, hence, S_{ j } + uv satisfies Definition 2(2). Since no paths or cycles avoiding d′^{−1}(1) are affected by adding uv, Definition 2(3) is satisfied. The case that z = [d′, (P − xx) ∪ {vx, uu}, p, c] _{ j } is analogous.
Case 1d: z = [d′, (P − xy) ∪ {ux, vy}, p, c] _{ j } for some x, y ∈ d^{−1}(1) with xy ∈ P. Then, ${G}_{j}^{{s}_{j}}$ contains paths q_{ u } and q_{ v } that start in u and v, respectively, and end in x and y, respectively (see Figure 3(d)). Thus, adding uv connects q_{ u } and q_{ v } to a single path q that starts in x and ends in y, thus intersecting d^{−1}(1). Hence both Definition 2(2) and (3) are satisfied.
Case 2: d(u) = d(v) = 1. Thus, u and v are not incident to any edges in ${G}_{j}^{{s}_{j}}$. Thus, uv forms a new path connecting u and v in ${G}_{i}^{{s}_{{j}^{+uv}}}$. Since, in this case, z = [d′, P−uv, p, c]_{ j }and uv ∈ P, we conclude that both Definition 2(2) and (3) are satisfied.
Case 3: d(u) = 2, d(v) = 1. Then, v has no incident edges in ${G}_{j}^{{s}_{j}}$ and, thus, it is only adjacent to u in ${G}_{i}^{{s}_{{j}^{+uv}}}$. Further, u has degree 1 in ${G}_{j}^{{s}_{j}}$, so it is endpoint to a path q in ${G}_{j}^{{s}_{j}}$. Thus, vx ∈ P for some x ∈ d^{−1}(1) and ${G}_{i}^{{s}_{{j}^{+uv}}}$ contains the path q′ := q + uv.
Case 3a: x = v. Then, z = [d′, (P − vv) + uu, p, c] _{ j }. Since ${G}_{i}^{{s}_{{j}^{+uv}}}$ contains q′ which is a path dangling from u, Definition 2(2) is satisfied. Since no other paths are touched, Definition 2(3) is satisfied.
Case 3b: x ≠ v. Then, z = [d′, (P − vx) + ux, p, c]_{ j }. Since ${G}_{i}^{{s}_{{j}^{+uv}}}$ contains q′ which is a u-x-path, Definition 2(2) is satisfied. Since no other paths are touched, Definition 2(3) is satisfied.
"≥": If uv ∉ S_{ i } ∩ M, then ${G}_{i}^{{s}_{i}}={G}_{j}^{{s}_{j}}$ and, thus, S_{ i } is eligible for (d, P, p, c, j). By induction hypothesis, [d, P, p, c] _{ j } ≥ ω(S _{ i }) and, thus, [d, P, p, c]_{ i } ≥ [d, P, p, c] _{ j }} ≥ ω(S _{ i }). Otherwise, uv ∈ S_{ i } ∪ M
In the following, we show that S_{ i } − uv is eligible for a tuple corresponding to one of the entries over which we maximize to compute z. Thus, ω(S _{ i }) ≤ ω(S_{ i } − uv) + ω(uv) ≤ z + ω(uv) if uv tt M and ω(S _{ i }) ≤ z if uv ∈ M. Note that, in each case, S_{ i } − uv satisfies Definition 2(1) since the degrees of u and v decrease by one when removing uv.
Case 1: d(u) = d(v) = 2. Then, uv is part of a path q in ${G}_{i}^{{S}_{i}}$ and neither u nor v is an endpoint of q.
Case 1a: q is closed (that is, a cycle) (see Figure 3(a)). Then, q does not intersect d^{−1}(1), implying that ${G}_{i}^{{S}_{i}}$ contains one more such cycle than ${G}_{j}^{{s}_{i-uv}}$. Further, q − uv is a u-v-path in ${G}_{j}^{{s}_{i-uv}}$ intersecting d′^{−1}(1) only in u and v. Thus, S_{ i } − uv is eligible for (d′, P + uv, p, c − 1, j).
Case 1b: q is open and does not intersect d^{−1}(1) (see Figure 3(b)). Then, ${G}_{i}^{{S}_{i}}$ contains one more of such paths than ${G}_{j}^{{s}_{i-uv}}$. Further, q − uv decomposes into paths q_{ u } and q_{ v }intersecting d′^{−1}(1) only in u and v, respectively. Thus, S_{ i } − uv is eligible for (d′, P + {uu, vv}, p − 1, c, j).
Case 1c: q is open and intersects d^{−1}(1) in a single vertex x (see Figure 3(c)). Then, q − uv decomposes into paths q_{ u } and q_{ v } in ${G}_{j}^{{s}_{i-uv}}$, one of which intersects d′^{−1}(1) in x. Since no path avoiding d^{−1}(1) is touched, S_{ i } − uv is eligible for either (d′, (P − xx) ∪ {xu, vv}, p, c, j) or (d′, (P − xx) ∪ {uu, vx}, p, c, j).
Case 1d: q is open and intersects d^{−1}(1) in 2 distinct vertices x and y (see Figure 3(d)). Then, q − uv decomposes into a u-x-path q_{ u } and a v-y-path q_{ v } in ${G}_{j}^{{s}_{i-uv}}$. Thus, S_{ i } - uv is eligible for (d′, (P − xy) ∪ {ux, vy}, p, c, j).
Case 2: d(u) = d(v) = 1. Then, q = {uv} is a path in ${G}_{i}^{{S}_{i}}$ and, thus, u and v are isolated in ${G}_{j}^{{s}_{i-uv}}$. Thus, uv ∈ P and S_{ i } − uv is eligible for (d′, P − uv, p, c, j).
Case 3: d(u) = 2 and d(v) = 1. Thus, ${G}_{i}^{{S}_{i}}$ contains a path q = (v, u, . . .). If q intersects d^{−1}(1) − v in a vertex x, then q − uv intersects d′^{−1}(1) − u and, thus, S_{ i } − uv is eligible for (d′, (P − vx) + ux, p, c, j). Otherwise, q is a dangling from v in ${G}_{i}^{{S}_{i}}$ and q − uv is a dangling from u in ${G}_{j}^{{s}_{i-uv}}$, implying that S_{ i } − uv is eligible for (d′, (P − vv) + uu, p, c, j).
We show that S_{ i } := S _{ j }∪ S _{t} is eligible for (d, P, p, c, i) and, thus, ω(S ) ≥ ω(S _{ i }) = [d_{1}, P_{1}, p_{1}, c_{1}] _{ j } + [d_{2}, P_{2}, p_{2}, c_{2}]_{ℓ} = [d, P, p, c]_{ i }. First, by (2), we have that Definition 2(1) is satisfied. Second, to show that Definition 2(2) is satisfied, consider some uv ∈ P. Then, there is a path q = (u, x_{1}, x_{2}, . . ., v) in Gr(P_{1} ∪ P_{2}). Note that ${x}_{1},{x}_{2},\dots \in {d}_{1}^{-1}\left(1\right)\cup {d}_{2}^{-1}\left(1\right)$. Thus, ${G}_{i}^{{S}_{j}\cup {S}_{\ell}}={G}_{i}^{{s}_{i}}$ contains a u-x_{1}-path, an x_{1}-x_{2}-path, . . . . The concatenation of these paths forms a u-v path in ${G}_{i}^{{S}_{i}}$. Third, note that, for each uu ∈ (P_{1})^{1}, there is a path ${q}_{1}^{u}$ dangling from u in ${G}_{j}^{{s}_{j}}$ such that the only vertex of ${d}_{1}^{-1}\left(1\right)\cup {d}_{1}^{-1}\left(0\right)\supseteq {d}^{-1}\left(1\right)$ in ${q}_{1}^{u}$ is u. Analogously, a similar path ${q}_{2}^{u}$ exists for each uu ∈ (P_{2})^{1} in ${G}_{\ell}^{{s}_{\ell}}$. Thus, for each uu ∈ (P_{1} ∩ P_{2})^{1}, there is a path ${q}^{u}:={q}_{1}^{u}\cup {q}_{2}^{u}$ in ${G}_{i}^{{S}_{j}\cup {S}_{\ell}}$ containing u and avoiding d^{−1}(1) and q^{ u } is neither in ${G}_{j}^{{s}_{j}}$ nor in ${G}_{{}_{\ell}}^{{S}_{\ell}}$. Since ${G}_{j}^{{s}_{j}}$ contains p_{1} paths avoiding ${d}_{1}^{-1}\left(1\right)\cup {d}_{1}^{-1}\left(0\right)\supseteq {d}^{-1}\left(1\right)$ and ${G}_{{}_{\ell}}^{{S}_{\ell}}$ contains p_{2} paths avoiding ${d}_{2}^{-1}\left(1\right)\cup {d}_{2}^{-1}\left(0\right)\supseteq {d}^{-1}\left(1\right)$, we have that ${G}_{i}^{{S}_{j}\cup {S}_{\ell}}$ contains p_{1} + p_{2} + |(P_{1} ∩ P_{2})^{1}| such paths. Similarly, ${G}_{i}^{{S}_{j}\cup {S}_{\ell}}$ concontains c_{1} + c_{2} + |(P_{1} ∩ P_{2})^{2}| cycles avoiding d^{−1}(1).
"≥": Let S_{ j } := S_{ i } ∩E(G _{ j }) and let S_{ ℓ } := S_{ i } ∩E(G_{ ℓ }). We show that S_{ j } and S _{ ℓ }are eligible for tuples (d_{1}, P_{1}, p_{1}, c_{1}, j) and (d_{2}, P_{2}, p_{2}, c_{2}, ℓ), respectively, such that (2)-(5) hold. First, for all u ∈ X(i) = X( j) = X(ℓ) let d_{1} (d_{2}) be the number of edges of G _{ j } (G_{ ℓ }) incident with u. Since $E\left({G}_{i}^{{s}_{i}}\right)=E\left({G}_{j}^{{s}_{j}}\right)\uplus E\left({G}_{\ell}^{{S}_{\ell}}\right)$, we conclude that (2) holds and Definition 2(1) is satisfied. Second, let P_{1} be the set of pairs uv with u, $v\in {{d}_{1}}^{-\mathsf{\text{1}}}\left(1\right)$ such that, if u ≠ v, then there is a u-v path in ${G}_{j}^{{s}_{j}}$ and, if u = v, then ${G}_{j}^{{s}_{j}}$ contains a path dangling from u. Let P_{2} be defined analogously for d_{2} and ${G}_{\ell}^{{s}_{\ell}}$. Then, Definition 2(2) is satisfied. Further, for all uv ∈ P, since d(u) = 1 ⇐⇒ d_{1}(u) = 1 ⊕ d_{2}(u) = 1, we have P_{1}(u) = ⊥ ⊕ P_{2}(u) = ⊥ and there is a u-v-path q in ${G}_{i}^{{S}_{i}}$ that decomposes into paths of ${G}_{j}^{{s}_{j}}$ that connect vertices of ${{d}_{1}}^{-\mathsf{\text{1}}}\left(1\right)$ and paths of ${G}_{\ell}^{{s}_{\ell}}$ that connect vertices of ${{d}_{2}}^{-\mathsf{\text{1}}}\left(1\right)$. Thus, Gr(P_{1} ∪ P_{2}) contains a u-v-path and we conclude that (3) holds. Third, let p_{1} and c_{1} be the number of paths and cycles, respectively, in ${G}_{j}^{{s}_{j}}$ that avoid ${{d}_{1}}^{-\mathsf{\text{1}}}\left(1\right)$. Likewise for p_{2} and c_{2} in ${G}_{\ell}^{{s}_{\ell}}$ and d_{2}. Then Definition 2(3) is satisfied. Further, let p ′ denote the number of pairs uu ∈ P_{1} ∩ P_{2}, that is, p ′ := |(P_{1} ∩ P_{2})^{1}|. Then, since ${G}_{i}^{{S}_{j}\cup {S}_{\ell}}$ contains, for each such pair uu, a different path consisting of the concatenation of the two paths dangling from u in ${G}_{j}^{{s}_{j}}$ and ${G}_{\ell}^{{s}_{\ell}}$, respectively, ${G}_{i}^{{s}_{j}\cup {s}_{\ell}}$ contains exactly p_{1} + p_{2} + p′ paths avoiding d^{−1}(1). Thus, (4) holds and, in complete analogy, (5) holds. □
Proof of Theorem 1, sketch The bottleneck in the computation are the join nodes so we focus on computing their dynamic programming table. To calculate the number of entries that have to be considered in order to compute [d, P, p, c]_{ i }, assume that P_{1} and P_{2} are fixed. Then so is P and ${{d}_{1}}^{-\mathsf{\text{1}}}\left(1\right)$ and ${{d}_{2}}^{-\mathsf{\text{1}}}\left(1\right)$. Now, consider a vertex $u\in X\left(j\right)\backslash {d}_{1}^{-1}\left(1\right)$. If u is incident to an edge in ${G}_{j}^{\oslash}$, that is, an edge in M that has been introduced in the subtree rooted at j, then d_{1}(u) > 0 and, thus, d_{1}(u) = 2. However, if u is not incident with any matching edges in ${G}_{j}^{\oslash}$, then d_{1}(u) < 2, since otherwise, u would be incident to two non-matching edges. Thus, $u\in {d}_{1}^{-1}\left(2\right)$ if and only if u is incident to a matching edge in ${G}_{j}^{\oslash}$ and, consequently, fixing P_{1} and P_{2} also fixes d_{1} and, by extension, d_{2}. Finally, we can choose p_{1} and c_{1} in order to compute p_{2} and c_{2}. Since we need to consider only permutations that are also involutions, there are less than tw^{tw} ways to choose P_{1} and P_{2}. Thus, the maximum is over at most tw^{tw} ·σ_{p} · σ_{c} elements. Since there are O(n) bags in the tree decomposition, the algorithm can be executed in the claimed running time. □
Lemma 5 Tree Rule 2 is correct, that is, the instance I = (G, M, σ_{p}, σ_{c}, ℓ_{p}, ℓ_{c}) is yes if and only if the result I′ = (G′, M′, σ_{p} − 1, σ_{c}, ℓ_{p}, ℓ_{c}) of applying Tree Rule 2 to I is yes.
Proof Clearly, since all vertices of the graph G must be covered, then the only way to pack the vertex u is to include it into a path of length l_{ p } included the vertex v′ = LCA(u, w) and then decrease the number of paths by one. □
Proof of Lemma 3 We show that T_{ v } does not contain branching vertices (vertices with at least two children) since it is straightforward that, if T_{ v } is a path, it has to be alternating for I to be a yes-instance and, if its length exceeds ℓ_{p}, then Tree Rule 2 applies. Towards a contradiction, assume that T_{ v } has branching vertices and let z denote such a vertex in T_{ v } such that, among all branching vertices, z is furthest from v. Let u be a leaf of T_{ z } that has maximum distance to z and let d denote this distance. Since z is the only branching vertex in T_{ z } and I is a yes-instance, the unique u-z path in T_{ z } is alternating. Thus, by irreducibility with respect to Tree Rule 2, we know that d < ℓ_{p}. However, since z is branching, there is another leaf w at distance d′ ≤ d < ℓ_{p} to z in T_{ z }. Thus, if the unique u-w-path in T_{ z } is not alternating or its length is not ℓ_{p}, then I is not a yes-instance. But otherwise, Tree Rule 2 is applicable to u and w, contradicting irreducibility. □
Lemma 6 Tree Rule 3 is correct, that is, the instance I = (G, ω, M, σ_{p}, σ_{c}, ℓ_{p}, ℓ_{c}) is a yes-instance if and only if the result of applying Tree Rule 3 to I is.
Proof Let v and e be as defined in Tree Rule 3 and let e = {u, v}. To show correctness of Tree Rule 3, we prove that all optimal solutions for I contain e. To this end, let S be an optimal solution with e ∉ S . By Lemma 3, T_{ v } is an alternating path p ending in v with |p| < ℓ_{p}. By definition, M(u) is on p, so |p| ≥ 2. But then, G − e contains an isolated path of length strictly less than ℓ_{p}, implying that S is not a solution for I. □
Lemma 7 Path Rule 1 is correct, that is, the instance I = (G, ω, M, σ_{p}, σ_{c}, ℓ_{p}, ℓ_{c}) is a yes-instance if and only if the result I′ = (G′, ω, M′, σ_{p} − 1, σ_{c}, ℓ_{p}, ℓ_{c}) of applying Path Rule 1 to I is.
Proof Let $p=\left({u}_{0},{u}_{1},\dots ,{u}_{2{l}_{p}+2}\right)$ be as in Path Rule 1 and let ${p}^{\prime}:=\left({u}_{0},{u}_{1},\dots ,{u}_{{\ell}_{\mathsf{\text{p}}}+1}\right)$. Note that ${e}_{{\ell}_{\mathsf{\text{p}}}+1},{e}_{{\ell}_{\mathsf{\text{p}}}+2},\dots $ exist in G′.
"⇒": Let S be a solution for I, let ${p}^{\prime}:=\left({u}_{0},\phantom{\rule{2.36043pt}{0ex}}{u}_{1},\phantom{\rule{2.36043pt}{0ex}}\dots \phantom{\rule{2.36043pt}{0ex}},\phantom{\rule{2.36043pt}{0ex}}{u}_{{\ell}_{\mathsf{\text{p}}}+1}\right)$ and note that no cycle of G[S ∪ M] contains p since |p| > ℓ_{c}. We show that S ′ := S \ p ′ is a solution of for I^{′} with ω^{′}(S ^{′}) = ω(S ). First, note that all vertices of G^{′} are covered by S ^{′}. Second, note that p \ p^{′} is alternating since |p| = ℓ_{p} + 1. Finally, G^{′}[S ^{′} ∪ M^{′}] contains one path less than G[S ∪ M].
"⇐": Let S ^{′} be a solution for I^{′} and let u denote the vertex onto which ${u}_{0},{u}_{1},\dots \phantom{\rule{0.3em}{0ex}},{u}_{{\ell}_{\mathsf{\text{p}}}+1}$ have been contracted in I^{′} and let $e:=u{u}_{{\ell}_{\mathsf{\text{p}}}+2}$. We construct a solution S for I with ω(S ) = ω^{′}(S ^{′}). First, since |p| > ℓ_{c} + ℓ_{p} + 1, no cycle in G[S ∪ M] contains e. If e ∉ S ^{′} ∪ M^{′}, then S := S ^{′} ∪ (p^{′} − e_{0}). If e ∈ S ^{′} ∪ M^{′}, then e _{ j }∉ S ^{′} ∪ M^{′} for some ℓ_{p} <j ≤ 2ℓ_{p}. Then, $S:={S}^{\prime}\cup \left({p}^{\prime}-{e}_{j-\left({\ell}_{\mathsf{\text{p}}}+1\right)}\right)$ □
Lemma 8 Path Rule 2 is correct, that is, the instance I = (G, ω, M, σ_{p}, σ_{c}, ℓ_{p}, ℓ_{c}) is a yes-instance if and only if the result I^{′} = (G^{′}, ω^{′}, M^{′}, σ_{p} − 1, σ_{c}, ℓ_{p}, ℓ_{c}) of applying Path Rule 2 to I is.
Proof First, note that no solution for I or I^{′} covers w in a cycle since T_{ w } is necessarily covered by a path. Also note that q exists in I^{′}.
"⇒": Let S be a solution for I and note that there is some i ≤ ℓ_{p} such that e_{ i }∉ S ∪M (in fact, i ∈ {ℓ_{p}, γ}). Then, however, f_{ i−γ }∉ S and S ^{′} := S \ p is a solution for I^{′} and ω^{′}(S ^{′}) = ω(S ).
"⇐": Let S ^{′} be a solution for I^{′} and note that there is some i ≤ ℓ_{p} − γ such that f_{ i } ∉ S ^{′} ∪ M^{′} (in fact, i ∈ {0, ℓ_{p} − γ}). But then, S ^{′} ∪ (p − e_{ i+γ }) is a solution for I and ω(S ) = ω^{′}(S ^{′}). □
Lemma 9 Path Rule 3 is correct, that is, the instance I = (G, ω, M, σ_{p}, σ_{c}, ℓ_{p}, ℓ_{c}) is a yes-instance if and only if the result I^{′} = (G^{′}, ω, M, σ_{p} − 1, σ_{c}, ℓ_{p}, ℓ_{c}) of applying Path Rule 3 to I is.
Proof "⇒": Let S be solution for I and let q_{ u } and q_{ v } denote the paths containing u and v, respectively, in G[S ∪ M].
Case 1: Neither q_{ u } nor q_{ v } contain edges of p. Then, γ = 1 and ux_{1} ∉ S ∪ M. If γ_{ u } +γ_{ v } +1 = ℓ_{p}, then Case (4) applies and otherwise, Case (5) applies. In both cases, S is also a solution for I^{′}.
Case 2: q_{ v } contains edges of p, but q_{ u } does not. Then, γ_{ v }+ γ = ℓ_{p} + 1 and ux_{1} ∉ S ∪ M. If γ_{ u } + γ = ℓ_{p} + 1, then Case (3) applies and otherwise, Case (1) applies. In both cases, S is also a solution for I^{′}.
Case 3: q_{ u } contains edges of p, but q_{ v } does not. Then, γ_{ u } + γ = ℓ_{p} + 1 and vx_{|p|−1} ∉ S ∪ M and ux_{1} ∈ S \ M. If γ_{ u } + γ = ℓ_{p} + 1, then Case (3) applies and switching ux_{1} for wx_{1} in S gives a solution of same weight for I^{′}. Otherwise, Case (2) applies and S is also a solution
for I^{′}.
Case 4: Both q_{ u } and q_{ v } contain edges of p. Then, γ_{ u } + γ_{ v } + γ ≡ ℓ_{p} mod (ℓ_{p} + 1) and ux_{1} ∈ S \ M. If γ ≠ 1, then Case (6) applies and S is also a solution for I^{′}. Otherwise, Case (3) applies and switching ux_{1} for wx_{1} in S gives a solution of same weight for I^{′}.
"⇐": Let S ^{′} be a solution for I^{′}. Note that, if G^{′} does not contain wx_{1}, then S is clearly also a solution for I. Thus, assume that G^{′} contains wx_{1} and, thus, either Case (3) or (4) applies to I. We show that the result S of switching wx_{1} for ux_{1} in S ^{′} is a solution of the same weight for I. To show this, it suffices to show that, if a path q contains wx_{1}, then q ends at u. Assume this is false, that is, q contains wx_{1} and some edge e ∈ E(G^{′}) \ M incident with u. Since T_{ w } is not empty, no cycle in G^{′}[S ^{′} ∪ M] contains u. Then, the length of the u-v-path containing p − ux_{1} in G^{′} is equivalent to γ + γ_{ u } modulo ℓ_{p} + 1.
Case 1: vx_{|p|−1 }∈ S ∪ M. Then, since q does not end in u, we have ${\gamma}_{u}+\gamma +{\gamma}_{v}\not\equiv {\ell}_{\mathsf{\text{p}}}\phantom{\rule{0.3em}{0ex}}\mathsf{\text{mod}}\phantom{\rule{2.36043pt}{0ex}}\left({\ell}_{\mathsf{\text{p}}}+1\right)$. Thus, Case (4) does not apply to I, implying that Case (3) applies to I. But then, γ + γ_{ v } = ℓ_{p} + 1 and, hence, wx_{1} ∉ S ∪ M.
Case 2: vx_{|p|−1} ∉ S ∪ M. Then, since q does not end in u, we have ${\gamma}_{u}+\gamma \not\equiv 0\phantom{\rule{0.3em}{0ex}}\mathsf{\text{mod}}\phantom{\rule{2.36043pt}{0ex}}\left({\ell}_{\mathsf{\text{p}}}+1\right)$, implying γ_{ u } + γ ≠ ℓ_{p} + 1 since 0 < γ ≤ ℓ_{p} and 0 < γ_{ u } < ℓ_{p}. Thus, Case (3) does not apply to I, implying that Case (4) applies to I. But then, γ = 1 and, since vx_{|p|−1} ∉ S ∪ M, we have wx_{1} ∉ S ∪ M. □
Observation 1 Let G be connected. Then, |V^{†}| ≤ 2 FES(G) since |E^{†}| ≤ |V^{†}| + FES(G^{†}) ≤ |V^{†}| + FES(G) and 2|E^{†}| ≥ 3|V^{†}|.
Proof of Theorem 2 By Lemma 4, we have $\left|V\right|\phantom{\rule{0.3em}{0ex}}\le \ell \left(\left|{V}^{\u2020}\right|+3\left|{E}^{\u2020}\right|\right)\stackrel{\text{Observation}1}{\le}\ell (2\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{FES}}\left(G\right)+3\left(\mathsf{\text{FES}}\left(G\right)+2\phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{FES}}\left(G\right)\right)=11\ell \phantom{\rule{2.36043pt}{0ex}}\mathsf{\text{FES}}\left(G\right)$ and, thus, we obtain $\left|E\right|\le \mathsf{\text{11}}\ell \phantom{\rule{0.3em}{0ex}}\mathsf{\text{FES(}}G\mathsf{\text{)}}+\mathsf{\text{FES(}}G\mathsf{\text{)}}=\left(\mathsf{\text{11}}\ell +\mathsf{\text{1}}\right)\cdot \mathsf{\text{FES(}}G\mathsf{\text{)}}.$ □
Declarations
Acknowledgements
Publication of this work is funded by the Institut de Biologie Computationnelle.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 14, 2015: Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S14.
Authors’ Affiliations
References
- Reddy TBK, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, Mallajosyula J, Pagani I, Lobos EA, Kyrpides NC: The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Research. 2014, 43 (D1): 1099-1106. [https://gold.jgi-psf.org/distribution]View ArticleGoogle Scholar
- Donmez N, Brudno ML: SCARPA: scaffolding reads with practical algorithms. Bioinformatics. 2013, 29 (4): 428-434.View ArticlePubMedGoogle Scholar
- Gritsenko AA, Nijkamp JF, Reinders MJT, de Ridder D: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics. 2012, 28 (11): 1429-1437.View ArticlePubMedGoogle Scholar
- Dayarian A, Michael TP, Sengupta AM: SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics. 2010, 11: 345-View ArticlePubMedPubMed CentralGoogle Scholar
- Gao S, Sung W-K, Nagarajan N: Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. Journal of Computational Biology. 2011, 18 (11): 1681-1691.View ArticlePubMedPubMed CentralGoogle Scholar
- Salmela L, Mäkinen V, Välimäki N, Ylinen J, Ukkonen E: Fast scaffolding with small independent mixed integer programs. Bioinformatics. 2011, 27 (23): 3259-3265.View ArticlePubMedPubMed CentralGoogle Scholar
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27 (4): 578-579.View ArticlePubMedGoogle Scholar
- Koren S, Treangen TJ, Pop M: Bambus 2: Scaffolding metagenomes. Bioinformatics. 2011, 27 (21): 2964-2971.View ArticlePubMedPubMed CentralGoogle Scholar
- Hunt M, Newbold C, Berriman M, Otto T: A comprehensive evaluation of assembly scaffolding tools. Genome Biology. 2014, 15 (3):Google Scholar
- Huson DH, Reinert K, Myers EW: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM. 2002, 49 (5): 603-615.View ArticleGoogle Scholar
- Solinhac R, Leroux S, Galkina S, Chazara O, Feve K, Vignoles F, Morisson M, Derjusheva S, Bed'hom B, Vignal A, Fillon V, Pitel F: Integrative mapping analysis of chicken microchromosome 16 organization. BMC Genomics. 2010, 11 (1): 616-View ArticlePubMedPubMed CentralGoogle Scholar
- Sharpton TJ: An introduction to the analysis of shotgun metagenomic data. Frontiers in Plant Science. 2014, 5 (209):Google Scholar
- Shmoys DB, Lenstra JK, Kan AHGR, Lawler EL: The Traveling Salesman Problem: a Guided Tour of Combinatorial Optimization. Edited by: John Wiley & Sons. 1985Google Scholar
- Tutte WT: A short proof of the factor theorem for finite graphs. Canadian Journal of Mathematics. 1954, 6: 347-352.View ArticleGoogle Scholar
- Chateau A, Giroudeau R: Complexity and Polynomial-Time Approximation Algorithms around the Scaffolding Problem. AlCoB Lecture Notes in Computer Science. Edited by: Dediu, A.H., Martín-Vide, C., Truthe, B. 2014, Springer, 8542: 47-58.Google Scholar
- Chateau A, Giroudeau R: A complexity and approximation framework for the maximization scaffolding problem. Theoretical Computer Science. 2015, 595: 92-106. doi:10.1016/j.tcs.2015.06.023View ArticleGoogle Scholar
- Weller M, Chateau A, Giroudeau R: On the implementation of polynomial-time approximation algorithms for scaffold problems. submitted 2015Google Scholar
- Nieuwerburgh FV, Thompson RC, Ledesma J, Deforce D, Gaasterland T, Ordoukhanian P, Head SR: Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination. Nucleic Acids Research. 2012, 40 (3): 24-24.View ArticleGoogle Scholar
- Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27 (4): 578-579.View ArticlePubMedGoogle Scholar
- Reed B, Smith K, Vetta A: Finding odd cycle transversals. Operations Research Letters. 2004, 32 (4): 299-301.View ArticleGoogle Scholar
- Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L: BESST - efficient scaffolding of large fragmented assemblies. BMC Bioinformatics. 2014, 15 (1): 281-View ArticlePubMedPubMed CentralGoogle Scholar
- Luhmann N, Chauve C, Stoye J, Wittler R: Scaffolding of ancient contigs and ancestral reconstruction in a phylogenetic framework. Proceedings of Brazilian Symposium on Bioinformatics Lecture Notes in Computer Science. 2014, 8826: 135-143.View ArticleGoogle Scholar
- Rajaraman A, Tannier E, Chauve C: FPSAC: Fast Phylogenetic Scaffolding of Ancient Contigs. Bioinformatics. 2013, 29 (23): 2987-2994.View ArticlePubMedGoogle Scholar
- Aganezov S, Sitdykovaa N, Alekseyev MA: AGCConsortium: Scaffold assembly based on genome rearrangement analysis. Computational Biology and Chemistry. 2015, 57: 46-53.View ArticlePubMedGoogle Scholar
- Impagliazzo R, Paturi R, Zane F: Which Problems Have Strongly Exponential Complexity?. Journal of Computer and System Sciences. 2001, 63 (4): 512-530.View ArticleGoogle Scholar
- Bodlaender HL, Cygan M, Kratsch S, Nederlof J, Impagliazzo R, Paturi R, Zane F: Deterministic single exponential time algorithms for connectivity problems parameterized by treewidth. Information and Computation. 2015, 243: 86-111.View ArticleGoogle Scholar
- Downey RG, Fellows MR: Fundamentals of Parameterized Complexity. Texts in Computer Science. 2013, SpringerGoogle Scholar
- Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079.View ArticlePubMedPubMed CentralGoogle Scholar
- Chikhi R, Rizk G: Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms for Molecular Biology. 2013, 8: 22-View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595.View ArticlePubMedPubMed CentralGoogle Scholar
- Briot N, Chateau A, Coletta R, Givry SD, Leleux P, Schiex T: An Integer Linear Programming Approach for Genome Scaffolding. Workshop Constraints in Bioinformatics. 2014Google Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marc¸ais G, Pop M, Yorke JA: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012, 22 (3): 557-567. [http://gage.cbcb.umd.edu]View ArticlePubMedPubMed CentralGoogle Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research. 2008, 18 (5): 821-829.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): 25-View ArticleGoogle Scholar
- Chikhi R, Rizk G: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology. 2013, 8 (22):Google Scholar
- 2013, Variathon, [http://bioinf.dimi.uniud.it/variathon]
- Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW: Extensive error in the number of genes inferred from draft genome assemblies. PLoS Computational Biology. 2014, 10 (2): 1003998-View ArticleGoogle Scholar
- Koster A: TOL - Treewidth Optimization Library. 2012Google Scholar
- Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Scha¨ ffer AA, Koster A: Database indexing for production MegaBLAST searches. Bioinformatics (Oxford, England). 2008, 24 (16): 1757-1764.View ArticleGoogle Scholar
- Kolodner R, Tewari KK: Inverted repeats in chloroplast DNA from higher plants*. Proceedings of the National Academy of Sciences of the United States of America. 1979, 76 (1): 41-45.View ArticlePubMedPubMed CentralGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.