Skip to main content

Exact approaches for scaffolding

Abstract

This paper presents new structural and algorithmic results around the scaffolding problem, which occurs prominently in next generation sequencing. The problem can be formalized as an optimization problem on a special graph, the "scaffold graph". We prove that the problem is polynomial if this graph is a tree by providing a dynamic programming algorithm for this case. This algorithm serves as a basis to deduce an exact algorithm for general graphs using a tree decomposition of the input. We explore other structural parameters, proving a linear-size problem kernel with respect to the size of a feedback-edge set on a restricted version of Scaffolding. Finally, we examine some parameters of scaffold graphs, which are based on real-world genomes, revealing that the feedback edge set is significantly smaller than the input size.

Introduction

During the last decade, a huge amount of new genomes have been sequenced, leading to an abundance of available DNA resources. Nevertheless, most of recent genome projects stay unfinished, in the sense that databases contain much more incompletely assembled genomes than whole stable reference genomes [1]. One reason for this phenomenon is that, for most of the analyses performed on DNA, an incomplete assembly is sufficient. Sometimes it is even possible to perform them directly from the sequenced data. Another reason is that producing a complete genome, or an as-complete-as-possible-genome is a difficult task. Traditionally, producing a complete genome consists of three steps, each of them computationally or financially difficult: the assembly, the scaffolding, and the finishing. The step of scaffolding, on which we focus here, consists of orienting and ordering at the same time the contigs produced by assembly. Many methods have been proposed and the recent activity on the subject shows that it is an active field (see, not exhaustively, [28] and Section ). A good survey of recent methods can be found in [9] for instance. Since the problem has been proved N P -complete from its first formulation [10], nearly all of these methods propose heuristic approaches. One of them claims that an exact method provides better results [5], however, the authors prepend a heuristic graph simplification before running their exact algorithm.

The approach presented here relies on a combinatorial problem on a dedicated graph, called scaffold graph, representing the link between already assembled contigs. The main idea is to represent each contig by two vertices linked by an edge (these "contig-edges" form a perfect matching on the scaffold graph). Other edges are constructed and weighted using complementary information on the contigs. The weight of a non-contig edge uv, with uu′, vv′ being contig-edges, corresponds to a confidence measure that the uu′ contig is succeeded by the vv′ contig (oriented as u′uvv′). The scaffold graph is a flexible tool to study the scaffolding issues. Indeed, the graph is a syntactical data-structure which may represent several semantics. For instance, the scaffold graphs used for our previous experiments have been built using Next-Generation Sequencing data, namely paired-end reads. However, we also could provide other type of information to compute the weight on the edges, like for instance ancestral support in a phylogenetic context, or comparison to other extent genomes which could be used as multiple references. The way to define the weight on the edges does not change the main goal of our method, which is to determine the optimal ordering and orientation of the contigs, given a specific criterion. It is also possible to mix two or more criteria in order to take several sources of information into account.

We also introduced two structural parameters σp and σc representing the respective numbers of linear and circular chromosomes sought in the genome. Those parameters seems quite artificial at a first sight, but they are well-motivated as follows. First, in a number of species, genomes are hard to assemble with classical methods because of an heterogeneous chromosomal structure or difficulties to perform classical assembly. This is the case, for instance, with the micro-chromosomes in the chicken genome [11]. Also, when scaffolding metagenomic data, one has to deal with very complex genomic structures [12]. We hope that relaxing the classical model with one chromosome, mixed with the flexibility of the scaffold graph, could help handling such complex situations. The second motivation to introduce those parameters came from the desire to explore an intermediary case between the very classical and studied N P -complete [13]TRAVELING SALESMAN PROBLEM (TSP) and a totally free structure, in which case optimizing a covering by paths and cycles is polynomial-time solvable [14]. The SCAFFOLDING problem then consists of finding (at most) σp paths and σc cycles that cover all contig-edges while maximizing the total weight. Previously, we proved that SCAFFOLDING is N P -complete, even under restricted conditions [15], developed polynomial-time approximation algorithms [15, 16] and evaluated them experimentally [17]. In this paper, we continue to explore this problem, as well as the structural properties of the scaffold graph. This exploration aims to develop an efficient method, that could be used both to refine the ratio analysis on previously designed heuristics and to handle cases of small genomes with better quality.

We present both theoretical and practical results, as well as our experimental findings. The paper is organized as follows: we relate a general state-of-the-art and methodological comparison with existing methods and our previous work in Section. In Section, we formally introduce the problem and some notations and definitions. In Section, we show that the general SCAFFOLDING problem is polynomial-time solvable on trees, and present a dynamic programming algorithm. We extended this algorithm in Section 11 to a parameterized algorithm with respect to the treewidth of the input, that is, this algorithm runs in polynomial time, on scaffold graphs that are sparse, i.e. treelike. In Section 11, we lay the foundation to developing effective preprocessing routines for SCAFFOLDING, by presenting a set of reduction rules whose application shrinks input instances of a derived problem, RESTRICTED SCAFFOLDING. We show that the size of the resulting instances can be bounded linearly in their feedback edge set number, thus proving a problem kernel for this parameter. Finally, in Section 11, we present some experimental analysis of scaffold graphs that are derived from real-world genomes.

Related work

Genome scaffolding is an intrinsically complex problem, and was studied through several computation models and tools. The first model, presented in [10], leads to its classification as an N P -complete problem. This work also proposes the first greedy approach. The scaffolding was then performed using genomic fragments coupled by mate-pairs [18], which differ from the paired-end reads essentially by their larger insert-size (The insert-size is the gap, in base pairs, between two fragments constituting a pair of reads), and their lower covering depth, so the size of the data and the organization of the graph may differ from actual Next-Generation Sequencing data using paired-end reads. A greedy-like approach is also used in SSPACE [19], iteratively combining the longest contigs. Conversely, Gao et al. present a dynamic-programming-based exact approach (OPERA) which provides higher-quality scaffolds than existing heuristic approaches [5]. However, their algorithm runs in O(nk ) time (where k is the "library width") which quickly grows out of reasonable proportions. Thus, to apply this method on real data, they prepend a graph contraction procedure to limit the size of the input. In SOPRA [4], Dayarian et al. introduce a removal procedure for problematic contigs, and separate the orientation step from the linking step. The orientation step uses a simulated annealing technique for regions of high complexity. In GRASS [3], a genetic algorithm is provided, using mixed integer programming (MIP) as in [6] and an expectation-maximization process to counter the intractability of the MIP model. They obtain results which are intermediary in quality between SSPACE and Opera. In SCARPA [2], the two problems of orienting and ordering the contigs are separated, as in SOPRA. Transforming the graph such that the contig orientation problem has a feasible solution is a fixed-parameter tractable problem with respect to the number of nodes to remove [20]. The authors first use the corresponding algorithm to pre-process the graph and exhibit an optimal relative orientation of the contigs. Then, pre-oriented contigs are ordered using a heuristic, and the removal of articulation vertices is used to limit the size of the connected components. Misassembled contigs are detected and removed. In BESST [21], the authors refine the previous approaches by considering the library insert size distribution to filter relevant linking information, leading to an almost linear scaffold graph, which is easy to treat. The scalability of these approaches is not always proven, and the general increase of the size and amount of available data brings real efficiency questions. Recently, some methods also use phylogenetic or comparative genomic approaches via rearrangement analysis to perform scaffolding on ancient or extant genomes when a set of closely related species genomes are available (see for instance [2224]). This encourages the mix of multiple sources of information for scaffolding a given genome.

The scaffolding problem have been extensively studied in the framework of complexity and approximation. In [15, 16], the authors proved that the problem is N P -complete even if the scaffold graph is a planar bipartite graph. They also proposed some lower bounds for exact exponential-time algorithms for SCAFFOLDING according to the EXPONENTIAL-TIME HYPOTHESIS[25]. Finally, they proved that the minimization version of SCAFFOLDING (seeking a minimum-weight cover of the scaffold graph) is unlikely to be approximable within any constant ratio, even if the scaffold graph is a completed bipartite graph [15]. On the positive side, two polynomial-time factor-3-approximation algorithms for the maximization version have been designed. The first is an O(n2 log n)-time greedy algorithm, the second is based on the computation of a maximum matching (O(n3)-time). The theoretical aspects of the scaffold graph have been completed by extensive experimental results [17] for the maximization version on simulated and real datasets.

The main contribution of the present paper lays in both the algorithmic exploration of exact methods for the scaffolding problem with structural parameters, and the initiation of a discussion on the scaffold graph properties. Indeed, the knowledge of those properties may lead to algorithmic improvement, as well as the detection of putative errors in assemblies.

Definitions

Scaffolding problem. The central combinatorial object we are working with in this work is the "scaffold graph" (see also [16, 17]).

Definition 1 (Scaffold graph) A scaffold graph is a pair (G, M) of a graph G = (V, E, w) with an even number of vertices, and a weight function w : E on its edges, and a perfect matching M on G.

Notice that this model is close to what previous work call scaffold graph, except that this graph is not a directed graph, and contigs are represented by edges instead of vertices. Figure 1 shows an example of a scaffold graph.

Figure 1
figure 1

A scaffold graph with 17 contigs (bold edges) and 26 (weighted) links between them, corresponding to the ebola data set (see Section 11).

Graph Theory. Slightly abusing notation, we sometimes consider paths as sets of edges. Furthermore, for a matching M and a vertex u, we define M(u) as the unique vertex v with uv M if such a v exists, and M(u) = , otherwise. A path p is alternating with respect to a matching M if, for all vertices u of p, also M(u) is a vertex of p. If M is clear from context, we do not mention it explicitly. For a graph G = (V, E) and a vertex set V V, let G[V] denote the subgraph of G induced by V and let GV := G[V \ V]. For S E, let GS := (V, E \ S ) and, for any edge or vertex x, we abbreviate G − {x} =: Gx. For a set of pairs S , we let Gr(S ) denote the graph having S as edgeset, that is, Gr(S ) := (U eS e, S ). For a function ω : E and a set S E, we abbreviate Σ eS ω(e) =: ω(S ). Let G = (V, E) be a graph with a matching M and let S be a matching in GM. The number of paths (resp. cycles) in G[S M] is denoted by | | S | | p G , M (resp. | | S | | c G , M ). If all paths in G[S M] are alternating with respect to M, then, we call S a ||S|| p -||S|| c -cover (or simply cover) for G with respect to M (we will omit M if its clear from context). If ||S|| c = 0, we may also refer to S as a ||S|| p -path cover (or simply path cover). The general scaffold problem is expressed as follows (see also [16, 17]): SCAFFOLDING (SCA)

Input: A scaffold graph (G = (V, E, w), M), σp , σc , k

Question: Is there a σp-σc-cover S for G with respect to M with ω(S ) ≥ k?

Tree Decompositions. A tree decomposition of a graph G = (V, E) is a pair (T = (VT , ET ), X : VT → 2V ) such that (1) for all uv E, there is some i VT with uv X(i) and (2) for all v V, the subtree T v := T[{X(i) | v X(i)}] is connected. We call the images of X "bags" and the size of the largest bag minus one is the width of the decomposition. A decomposition of minimum width for G is called the optimal for G and its width is called the treewidth of G. It is a folklore theorem that each graph G has an optimal tree decomposition (T, X) that is nice, that is, each bag X(i) is one of the following types:

Leaf bag: i is a leaf of T and X(i) = ,

Introduce vertex v bag: i is internal with child j and X(i) = X( j) {v} with v X(j),

Forget v bag: i is internal with child j and X(i) = X(j) − v with v X(i),

Introduce edge uv bag: i is internal with child j and uv E and uv X(i) = X(j),

Join bag: i is internal with children j and ℓ and X(i) = X(j) = X(ℓ).

Additionally, each edge e E is introduced exactly once. Given a width-tw tree decomposition, we can obtain a nice tree decomposition of width tw in n · poly(tw) time [26].

Parameterized Algorithmics. Parameterized complexity theory challenges the traditional view of measuring the running times exclusively in the input size n. Instead, we aim at identifying a parameter of the input that we expect to be small (much smaller than n) in all instances that the application at hand may produce. We then focus on developing algorithms whose exponential part can be bounded in this parameter (see [27] for details). Parameterized complexity allows to prove performance guarantees for preprocessing algorithms: a polynomial-time algorithm that, given an instance x with parameter k, computes an instance x′ with parameter k′k such that x is a yes instance if and only if x′ is a yes instance, is called kernelization and the result x′ is the kernel. It is common to present a kernelization by showing various reduction rules that "cut away" the easy parts of the input and, when applied exhaustively to the input, shrink it enough to prove the size bounds.

Scaffolding on trees

Towards developing an algorithm for SCAFFOLDING that runs fast on graphs that are "close to trees", we consider a strategy to solve SCAFFOLDING in case the input graph is a tree. We use a bottom-up dynamic programming approach that computes for each vertex v starting with the leaves, the best possible solutions for the subtree rooted at v. For ease of presentation, we thus consider the input tree T to be rooted arbitrarily with r denoting the root vertex. Note that the solution cannot contain any cycles in this case.

At each vertex, the dynamic programming algorithm needs to decide how many paths should be covered in each of the subtrees of its children. Seeing that it is infeasible to try all combinations, we employ, again, a dynamic programming strategy solving this problem for a given vertex v: We order the children of each vertex v arbitrarily and let s v denote the sequence of children of v with s v [j] denoting the jth child of v and s v [1..j] := U i≤j s v [ j]. Then, we update the global dynamic programming table of v in order of s v . Thus, when we consider the jth child u of v, we only have to split the paths to be covered between the two subtrees T u and the union of all previously considered subtrees rooted at children of v (that is, T [Uℓ<jT sv [ℓ] {v}]).

In the following, for a vertex v, let C(v) denote its children and par(v) its parent (or if v = r), and let T v denote the subgraph of T that is rooted at v. For u, v V, we define T u +T v := T [V(T u )V(T v )] and T u +v := T [V(T u ){v}]. Let v V and 1 ≤ j ≤ |C(v)|. Then, we define T j v : = i j T s v [ i ] + v . Finally, we abbreviate s v := s v [1..0] := .

Algorithm 1: Algorithm to fill the dynamic programming table.

1 I ← leaves of T;

2 while v V \ I s.t. C(v) I do

3     foreach 1 ≤ j ≤ |C(v)| with u := s v [j] do

4         foreach iσp do

5             if uv M then

6                 [0, i, j] v ← maxα{0,1} maxℓ≤i−(1−α)[α, ℓ, ∞] u + [0, i − (ℓ − α + 1), j − 1] v ;

7                 [1, i, j] v ← maxα{0,1} maxℓ≤i[α, ℓ, ∞] u + [1, i − (ℓ − α), j − 1] v ;

8             else

9                 [0, i, j] v ← maxℓ≤imaxα{0,1}[α, ℓ, ∞] u + [0, i − ℓ, j − 1] v ;

10                [ 1 , i , j ] v max i ma x α { 0 , 1 } [ α , , ] u + [ 1 , i - , j - 1 ] v ; ω ( u v ) + [ 0 , , ] u + [ 0 , i - ( - 1 ) , j - 1 ] v ; if w s v [ 1 . . j ] [ 0 , i - , j - 1 ] v otherwise

11 if v = r then return max c{0,1} [c, σp, ∞] r else II {v};

Semantics: For (c, i, j, v) {0, 1} × [σp] × [|C(v)|] × V we let [c, i, j] v denote the maximum weight of an i-0-cover S for T j v such that v is incident with exactly c edges of S. We abbreviate [c, i, ∞] v := [c, i, |C(v)|] v .

We maintain a set I of initialized vertices and, as soon as r is initialized, the algorithm stops. Thus, we assume r I. Finally, the maximum weight of a σp-0-cover for T can be computed by max c{0,1} [c, σp, ∞] r .

Lemma 1 Algorithm 1 is correct, that is, for all (i, j, c, v) [σp] × [|C(v)|] × {0, 1} × V and for any maximum-weight i-0-cover S for T j v with respect to M such that v is incident with exactly c edges of S we have [c, i, j] v = ω(S).

Concerning the running time, we note that the body of the loop in line 3 is executed exactly once per vertex and, hence, lines 6,7,9, and 10 are executed σp times per vertex. As they compute the maximum over σp values, the whole algorithm runs in O ( n σ p 2 ) time.

Corollary 1 SCAFFOLDING on trees can be solved in O ( n σ p 2 ) time.

Scaffolding with respect to treewidth

In this section, we develop a parameterized algorithm with respect to the structural parameter "treewidth", solving SCAFFOLDING in O(twtw) poly(n, σp, σc) time. It is based on using a dynamic programming table to keep track of solutions that interact with the bags of a tree decomposition in a certain way (see Figure 2). Since we store for each type of interaction only the best solution and the number of interactions can be bounded in the treewidth, we arrive at the claimed bounds.

Figure 2
figure 2

Setup of the dynamic programming on tree decompositions. A vertex u X(i) can have degree d(u) = 0 (white circle), d(u) = 1 (black circle), or d(u) = 2 (gray circle) in G i S . Vertices with d(u) = 1 are always incident with paths in G i S (gray ellipse), forming a pairing (a permutation) P on them.

To present our algorithm, we use special permutations (involutions) to model matchings that allow reflexive pairing (that is, matching a vertex with itself). Thus, slightly abusing notation, we will consider permutations as sets of pairs. We denote the subset of reflexive pairs of a permutation P by P1 and the subset of non-reflexive pairs by P2. Then, Gr(P) :=Gr(P2). For permutations P and Q, we define PQ as the set of pairs uv such that P(x) = Q(x) = for all x uv and there is a u-v-path in Gr(P Q). Furthermore, for a function d : AB and (x, y) A × B, we define d[xy] as the result of setting d(x) := y (that is (d \ ({x} × B)) {(x, y)}). Here, d[x] means to remove x from the domain of d. Let T = (χ, ET ) be a tree decomposition of G with root X(r) χ. For a bag X(i), let G i denote the subgraph of G that contains exactly those edges of G that are introduced in a bag of the subtree of T that is rooted at X(i) and let G i S : = G i [ S M ] .

A table entry for the bag X(i) will be indexed by (i) a function d : X(i) → {0, 1, 2}, (ii) a permutation P with U uvP uv = d−1(1) and(iii) integers pσp and cσc. See Figure 2 for an illustration of d and P. An entry will have the following semantics:

Definition 2 Let i VT. We call a set S i E(G i ) \ M eligible with respect to a tuple (d, P, p, c, i) if

1 each vertex v X(i) has degree d(v) in G i S i ,

2 for each uv P, if u ≠ v, then there is a u-v-path in G i S i and, if u = v, then there is a path q of non-zero-length in G i S i that contains u and avoids d−1(1) − u (we say that q is dangling from u).

3 G i S i contains p paths and c cycles that do not intersect d−1(1),

Semantics: A table entry [d, P, p, c] i is the maximum weight of any set that is eligible with respect to (d, P, p, c, i).

Then, we can read the maximum weight of a solution S for G from [, , σp, σc] r .

Given a nice tree decomposition and a bag X(i) with children X(j) and X(ℓ) (possibly j = ℓ), we compute [d, P, p, c] i depending on the type of the bag X(i) (entries that are not mentioned explicitly are set to −∞):

Leaf bag: Set [, , 0, 0] i := 0.

Introduce vertex v: Newly introduced vertices do not have introduced edges yet. Thus, we force the degree of v in G i S to 0: Formally, let [d, P, p, c] i := [d[v], P, p, c] j if d(v) = 0 and ∞, otherwise.

Forget vertex v: A vertex v that we forget in bag X(i) can have degree 0,1, or 2 in G j S . If it has degree 1, then there is a path q dangling from it. If q ends in some other vertex u X(i) \ {v} = X( j), then, the permutation for X(i) contains uu and the permutation for X(j) contains uv. Otherwise, q is dangling from v in G j S so the permutation for X(j) contains vv. Formally, let

[ d , P , p , c ] i : = max max u u P [ d [ v 1] , ( P - u u ) + u v , p , c ] j , [ d [ v 1] , P + v v , p - 1 , c ] j , max x { 0 , 2 } [ d [ v x , P , p , c j .

Introduce edge uv: Let d(u), d(v) ≥ 1 and, by symmetry, let d(u) ≥ d(v) ≥ 1. We define a value z (representing the assumption that uv is in S ) as follows. Let d′ := d[ud(u) − 1, vd(v) − 1], that is, we let d′ be the result of decreasing both d(u) and d(v) by one.

Case 1 d(u) = d(v) = 2. Then, P avoids u and v. Since we assume uv S , this means that u and v have dangling paths q u and q v in G i S - u v = G j S that both intersect d′−1(1). If q u = q v , then adding uv to S closes a cycle in G i S and the permutation for X(j) contains uv (see Figure 3(a)). Otherwise, q u q v . Then, if q u intersects d′−1(1) \ {u, v} in a vertex x, then the permutation for X(j) contains ux (see Figure 3(c) and Figure 3(d)), otherwise, it contains uu (see Figure 3(b)). Likewise, if q v intersects d′−1(1) \ {u, v} in a vertex y, then the permutation for X(j) contains vy, otherwise, it contains vv. Note that, if both q u and q v intersect d′−1(1) \ {u, v} (see Figure 3(d)), then we have xy P. Note further that, if neither q u nor q v intersects d′−1(1) \ {u, v} (see Figure 3(b)), then q u q v {uv} is a path in G i S that does not intersect d−1(1) and, thus, we have to decrease the number p of such paths we are looking for in G j S . Formally, let

z : =  max [ d , P + u v , p , c - 1 ] j , [ d , P { u u , v v } , p - 1 , c ] j , max x x P [ d , ( P - x x ) { u x , v v } , p , c ] j , [ d , ( P - x x ) { u u , v x } , p , c ] j , max x y P [ d , ( P - x y ) { u x , v y } , p , c ] j .

Case 2: d(u) = d(v) = 1. Then, both u and v are not incident to any edges in G j S and, in G j s j , there is just the edge uv incident to both. Thus, we set z only if uv P. Formally, let z := [d′, Puv, p, c] j if uv P.

Case 3: d(u) = 2, d(v) = 1. Then, there is a path q containing uv and ending in v in G i S . If q ends in a vertex x in d−1(1) − v, we have vx P and the permutation for X(j) contains ux. Otherwise, we have vv P and the permutation for X(j) contains uu. Note that, since v d−1(1), we know that P(v) ≠ . Formally, for vx P, let z := [d′, (Pvv) + uu, p, c] j , if v = x and z := [d′, (Pvx) + ux, p, c] j , otherwise. Finally, let [d, P, p, c] i := z if uv M and [d, P, p, c] i := max{z + ω(uv), [d, P, p, c] j }, otherwise.

Join: The join bag X(i) "glues" the (disjoint) partial solutions of its children together at the vertices of X(i) = X( j) = X(ℓ). In particular, the degrees in G j S and in G S have to add up to d. Furthermore, the permutations P1 and P2 for X(j) and X(ℓ), respectively, have to "fit" P: For example, let uv P1 and vw P2, implying that there are paths q j and q in G j S and G S , respectively, that are connecting u and v and v and w, respectively. Then, in G i S , there is a single path q j q connecting u and w and containing v (with d(v) = 2). Finally, the numbers of paths and cycles have to "fit" p and c: For example, if the permutations for both X(j) and X(ℓ) contain uu (that is, u (P1P2)1), then G i S contains a path containing u that is neither in G j S nor in G S . On top of this, the remaining paths must be distributed among G j S and G S . Formally, let

[ d , P , p , c ] i : = max d 1 , d 2 : X ( i ) { 0 , 1 , 2 } v X ( i ) , d 1 ( v ) + d 2 ( v ) = d ( v ) max P 1 , P 2 P = P 1 P 2 max p 1 , p 2 , c 1 , c 2 p 1 + p 2 + | ( P 1 P 2 ) 1 | = p c 1 + c 2 + | ( P 1 P 2 ) 2 | = c [ d 1 , P 1 , p 1 , c 1 ] j + [ d 2 , P 2 , p 2 , c 2 ]

Lemma 2 The described algorithm is correct, that is, the computed value [d, P, p, c] i corresponds to the semantics.

Theorem 1 SCAFFOLDING can be solved in O(twtw ·σp · σc · n) time, given a width-tw tree decomposition of the input instance.

Kernel for restricted scaffolding

Towards developing effective preprocessing for SCAFFOLDING, we consider a more restricted problem variant, where all paths and cycles of the solution have to be of certain, respective lengths. N P -hardness of this variant can be inferred with the same reduction as used for SCAFFOLDING[15].

RESTRICTED SCAFFOLDING (RSCA)

Input: G with perfect matching M, ω : E with ω(M) = 0, σp , σc , ℓp , ℓc , k

Question: XE\M s.t. GX is a collection of σp alternating paths, each of length ℓp,b and σc alternating cycles, each of length ℓc, and ω(E) − ω(X) ≥ k?

The length l p denotes the number of edges in the paths. It is necessarily odd in a solution of RESTRICTED SCAFFOLDING. In the following, we show that RESTRICTED SCAFFOLDING admits a linear-size problem kernel with respect to the parameter FES (feedback edge set), which is the size of a smallest set of edges whose deletion leaves an acyclic graph. To this end, we present a number of intuitive polynomial-time executable reduction rules that shrink the input graph. The first two rules shrink treelike structures of the instance while the remaining rules contract long chains.

For a u-v-path p, we call u and v its outer vertices while all other vertices of p are inner vertices. Let V be the set of vertices that are in some cycle in G and let G = G[V]. Let G = G[V] denote the convex hull of G, that is, V is the set of vertices on some shortest path between some vertices u, v V. For each v V, the tree rooted at v that is incident with v in GE is called the "pendant tree" T v of v.

Reducing pendant trees

First, we remove isolated paths by cutting them into pieces of length ℓp. Since the correctness of this is trivial, we omit the proof.

Tree Rule 1 (see Figure 4(a)) Let p be an isolated path in G. If |p| ≥ ℓp, then split off a length-p path q and decrease (σp, k) by (1, ω(q)). Otherwise, return "NO".

Figure 3
figure 3

The four cases for introducing an edge uv with d ( u ) = d ( v ) = 2 in the dynamic programming for X ( i ): (a) there is a u - v -path in G j S , (b) u and v have dangling paths in G j S , (c) there is a u - x -path and a dangling path on v in G j S , or (d) there is a u - x -path and a v - y -path in G j S .

Figure 4
figure 4

Illustrations for the tree reduction rules.

The next rule cuts branching edges in pendant trees. To this end, it finds an occurrence of a path of length ℓp that is furthest from the root of the pendant tree.

Tree Rule 2 (see Figure 4(b)) Let T v be the pendant tree of some v V, let u be a leaf of maximal distance to v in T v . Let W be the set of vertices × of T v such that a length-palternating u-x-path exists and let w be a vertex in W maximizing dist Tv (v, w). Then, delete from G all edges that are incident with the least common ancestor of u and w but are not on the unique u-w-path in T v .

Lemma 3 Let I = (G, M, ω, σp, σc, ℓp, ℓc, k) be a yes-instance of RESTRICTED SCAFFOLDING such that G is reduced with respect to Tree Rule 2 and let v V. Then, T v is a path p that is alternating with respect to ME(T v ) and |p| < ℓp and v is an endpoint of p.

In the following, we assume that G is reduced with respect to Tree Rule 2 and, thus, we can reject all instances for which Lemma 3 does not hold. Hence, in the following, we assume that Lemma 3 holds for the input instance. The next reduction rule helps unify the way in which pendant trees (which are now paths) attach to G, simplifying the rest of the presentation.

Tree Rule 3 (see Figure 4(c)) Let T v be the pendant tree of some v V that is incident with an edge e E(T v ) \ M. Then, delete from G all edges incident with v that are not in M + e.

Reducing long paths

In the following, we assume that G is reduced with respect to the tree reduction rules presented in the previous section. For i , let V i denote the set of vertices of G that have degree i in G. The goal in this subsection is to reduce the length of chains of degree-two vertices in G. Thus, we consider paths whose inner vertices are all in V2. We call these paths deg-2 path. If the solution has a cycle running through this path, then we cannot modify its length. Therefore, we focus on paths that are guaranteed to not be in a cycle in the solution:

The path reduction rules are based on the idea that the path segments that a deg-2 path p in G is split into by the solution are recurring, that is, if the solution contains the third edge of p, then it also contains the 3 + (ℓp + 1)th edge of p. This gets slightly more complicated if there are pendants in p, but first, let us consider paths without any pendants.

Path Rule 1 (see Figure 5(a)) Let p = (u0, u1, . . .) be a deg-2 path with |p| > max{ℓp, ℓc} + ℓp + 1 and let e i := u i u i+1 for all i. Then, for all e i with i ≤ ℓp, add ω(e i ) to ω(eℓp +i+1) and contract e i . Finally, decrease σp by 1.

Figure 5
figure 5

Illustrations for the first path reduction rules.

The remaining two rules deal with deg-2 paths containing pendants. Note that any path in the solution that contains the pendant can only "continue" in two directions.

Path Rule 2 (see Figure 5(b)) Let p = (u0, u1, . . ., uℓp +1 = w) and q = (w = v0, v1, . . ., vℓp +1) be deg-2 paths and let T w contain γ > 0 edges. For all i ≤ ℓp, let e i := u i u i+1 , let f i := v i v i+1 and let i+ := (i − γ) mod (ℓp + 1). Then, for all i ≤ ℓp, add ω(e i ) to ω( f i+ ) and contract e i . Finally, decrease σp by 1.

In case of two pendants, the solution is restricted by the distance between the pendants. If we can infer what the solution does by this length, we implement this right away, otherwise, we can represent the choices that a solution can take with a single pendant.

Path Rule 3 (see Figure 6) Let p = (x0, x1, . . .) be a u-v-deg-2 path such that u, v V. Let γ u > 0 and γ v > 0 be the number of edges in T u and T v , respectively, and let w be the vertex at maximum distance to u in T u . Let γ := |p| mod (ℓp + 1) and let G′ be the result of replacing ux1 by wx1 of the same weight in G.

Figure 6
figure 6

Illustrations for Path Rule 3.

(1) If γ + γ u ≠ f ℓp + 1 = γ + γ v , then delete ux1 (see Figure 6(a)).

(2) If γ + γ u = ℓp + 1 ≠ f γ + γ v , then delete vx|p|−1.

(3) If γ + γ u = ℓp + 1 = γ + γ v , then return G′ (see Figure 6(b)).

(4) If γ = 1 and γ u + γ v + 1 = ℓp, then return G′ (see Figure 6(b)).

(5) If γ = 1 and γ u + γ v + 1 ≠ ℓp, then delete ux1 (see Figure 6(a)).

(6) If γ ≠ 1 and γ + γ u + γ v ≡ ℓp mod (ℓp + 1), then delete all edges e E(G) \ (M {ux1}).

(7) In all other cases, return "NO".

Finally, we can show that an input graph that is reduced with respect to these rules cannot be larger than 11ℓp · FES(G) or 11ℓc · FES(G). To prove this, let G = (V, E) be the result of contracting all degree-2 vertices in G.

Lemma 4 Let G be reduced with respect to all presented reduction rules. Then, |V| ≤ ℓ · (|V| + 3|E|) with ℓ := max{ℓc, ℓp}.

Proof By Lemma 3, we know that vV †|E(T v )| < ℓp. Therefore, if Lemma 4 is false, there is an edge uv E such that there are > 3ℓ vertices between u and v (i.e. uv is a contraction of more than 3ℓ edges of G). Nevertheless, by irreducibility with respect to Path Rule 3, there is at most one vertex w between u and v such that T w is not empty and the distance between u and v cannot be greater than 2ℓ + 1 (by irreducibility with respect to Path Rule 1 and 2). So, |E(T w )| ≥ ℓ, contradicting Lemma 3.

Theorem 2 RESTRICTED SCAFFOLDING admits a kernel containing at most 11ℓ · FES(G) vertices and (11ℓ + 1) · FES(G) edges where ℓ := max{ℓp, ℓc}.

Results

Data

In our experiments, we worked with two datasets, all derived from real genomes (see Table 1). The first one is a set of eleven contig sets, produced from real genomes using the following process: first, the genomes where taken off the NCBI nucleotide database (http://www.ncbi.nlm.nih.gov). Then, for each of them, a set of simulated paired-end reads was generated with the tool wgsim ([28]), with default parameters and a 20X mean covering depth. Thereafter, assembly was performed with the tool minia ([29]) with a k-mer size k = 29. Reads were mapped on the contigs with bwa ([30]), with default parameters and using the sampe method. The second dataset is composed of five scaffold graphs, already presented in [31] and [17]. Some of them have been produced by simulating reads, other come from real paired-end reads libraries (see Table 1). Finally, using the scaftools previously developed in [15, 31], we produced the scaffold graphs corresponding to these datasets. See [31] for a more detailed explanation of this pipeline.

Table 1 Details on the second dataset.

The aim of this process is first to produce a benchmark of test graphs that are more realistic than uniformly generated graphs to test our algorithms on, second to study the different parameters that may be interesting to consider for parameterized algorithms, and finally to help to design a more realistic and convenient scaffold graph generator allowing to generate graphs directly, avoiding the complicated pipeline described above (see Table 2 and Table 3). The second dataset was used to study the influence of a preprocessing operation on the scaffold graph, aiming at filtering low informative edges. Simplistically, we removed respectively edges with weight less than 3, 6 and 10. Results are presented in Table 3. We already know that this operation improves the quality of the produced scaffolds, when compared to the original reference genome. Here we would like to observe the behavior of our putative interesting structural parameters according to this filtering.

Table 2 Scaffold graphs parameters.
Table 3 Scaffold graph parameters for select genomes and different cut-off thresholds: 0, 3, 6, and 10.

Graph parameters

Table 2 presents some parameters of the generated graphs. The h-index is the maximum number such that the graph contains h vertices of degree at least h. It measures the degree of connectivity of a graph. The feedback edge set (FES) is the size of a smallest set of edges whose deletion leaves an acyclic graph. The degeneracy is the smallest value d for which every subgraph has a vertex of degree at most d. It is a kind of measure of sparsity of the graph. We notice that scaffold graphs look quite sparse, with few vertices of high degrees and a feedback edge set number that is usually significantly lower than the number of vertices. While degree-based graph parameters like the degeneracy d are tiny in all instances, we recall that our problem generalizes Hamiltonian Cycle, which is already N P-hard on 3-regular graphs. However, Table 2 shows that, for instances that we expect to be seen in practice, these measures can be assumed constant and, thus, it might be worth considering a combination involving these parameters.

Table 3 shows the same statistics for the second set of graphs where edges of small weight (that is, low confidence) were discarded (for weight thresholds of 10, 6, 3 and 0, yielding the original graph with all edges present). We notice that even with a light filtering, the parameters of the scaffold graph are considerably lowered, confirming that raw data suffers from significant noise. Thus, filtering low quality information not only increases the quality of the scaffolding, but also may lead to a significant leap to tractability. Note that, at the time of writing this article, the Staphylococcus Aureus genome is no longer available in the NCBI Nucleotide database ("This RefSeq genome was suppressed because updated RefSeq validation criteria identified problems with the assembly or annotation.") and the Escherichia Coli genome has been updated since the presented version. Errors in assemblies are quite frequent ([37]) and yield additional issues in scaffolding. Having a better idea of the structure of a classical scaffold graph, and some simple criteria to determine what is anomalous and what is normal (subgraphs induced by repeats for instance) would be of real interest for the analysis of genomes.

Implementation and first results

We implemented the dynamic programming algorithm presented in Section 11 in C++ using boost and the treewidth optimization library TOL [38]. We ran it on a selection of the generated data sets (see Section 11) for which the greedy fill-in heuristic produced tree decompositions of width at most 45. We chose (σp, σc) = (3, 1) for the ebola and monarch genomes and (σp, σc) = (20, 3) for the more complicated inputs. The tests were run on an AMD Opteron(tm) Processor 6376 at 2300 MHz.

Figure 7 shows running times and memory consumption needed to produce a solution as well as details regarding the used tree decompositions. Figure 8 shows optimal solutions for the two smallest data-sets: ebola and monarch. To validate the proposed scaffolding, we compared the output to the alignment of the contigs on the reference sequence using megablast ([39]). In the case of the ebola genome, the two isolated contigs (in green) are small (about 150 bp). One of them is placed between the orange contig and its neighbor, the other one finds its place at the other extremity of the chain. In the input scaffold graph, we notice that they are linked to the wrong node, we suppose this is due to their small size, disturbing the alignment step. For the monarch mitochondrion, however, two of the contigs (appearing in red) did not match on the sequence, meaning that the assembly yields some errors. The isolated contig is also small and not correctly connected in the scaffold graph.

Figure 7
figure 7

Information on the tree decomposition and resource requirements of running the algorithm for select instances with small treewidth. We chose (σp, σc) = (3, 1) for ebola and monarch, (20, 3) for rice and (256, 16) for the second data set (with edge cut-offs).

Figure 8
figure 8

Optimal solutions for the ebola (left) and monarch (right) graphs. Edges have their weights on the right. Matching edges are bold and normalized to weight 0.

Concerning the rice chloroplast, among the 84 contigs, only three were misplaced. All three of them are small (< 130 bp), two of them strongly overlap and the third has two occurrences in the reference genome, one complete and one partial. The seven remaining scaffolds follow exactly the right relative order and orientation of the contigs on the reference genome. Chloroplast genomes have a particularity which make them interesting as data for scaffolding. They present an inverted repeat region of approximately 20 kbp [40]. Figure 9 focuses on one of the scaffolds, where this inverted repeat is shown in pink. The other occurrence is not present in the scaffolding. To notice, the weights inside this repeat are in average higher than outside, which is totally expected since the read cover is approximately doubled for these sequences. Thus, areas of the graph with higher weight would lead to repeat hypothesis, if we are confident in the homogeneity of the cover.

Figure 9
figure 9

Scaffold in the rice chloroplast genome, including the inverted repeat.

Discussion and conclusion

In this paper, we considered exact approaches to the N P -hard SCAFFOLDING problem which is an integral part of genome sequencing. We showed that it can be solved in polynomial time on trees or graphs that are close to being trees (constant treewidth) by a dynamic programming algorithm. We proved a linear-size problem kernel for the parameter "feedback edge set" for a restricted version in which the lengths of paths and cycles are fixed. We implemented an exact algorithm solving SCAFFOLDING in f(tw) · poly(n) time and evaluated our implementation experimentally, supporting the claim that this method produces high-quality scaffolds. Our experiments are run on data sets that are based on specific real-world genomes, which we also examined to identify a number of interesting parameters that may help design parameterized algorithms and random scaffold graph generators that produce more realistic instances. We are currently transferring the preprocessing rules to the general problem variant. We are highly interested in further graph classes that are closer to real-world instances than trees and on which the problem might be polynomial-time solvable. From an algorithmic point of view, we remark that only few bags of the used tree decompositions are small (see Figure 7). Thus, we envision a hybrid strategy of branching on the vertices in the largest bags before running the dynamic-programming algorithm and using this "distance x to treewidth-x" parameter to re-analyze the problem.

We intend to perform more extensive tests on diverse datasets, in particular comparing the quality of this approach to existing ones using, for instance, the criteria presented in [9]. This work demands reliable benchmarks, with up-to-date and well assembled genomes. From a bioinformatics point of view, we lay the groundwork for a careful analysis of the scaffold graph, as well as a tool to speed up the above algorithms, as a help to analyze the quality of the assembly, and maybe the structure of the genome itself.

Appendix

Proof of Lemma 1 The proof is by induction over the distance of v to r (descending) and, in case of ties, j (ascending).

Induction Base: The statement holds for all vertices v if j = 0 since T 0 v =T [ { v } ] does not contain edges.

Induction Step (): First, we show that [c, i, j] v ω(S). To this end, let u := s v [j] and, noting that all vertices except maybe v of T j v are incident with edges in M, let q denote the path in G[S M] containing u. Let w be the vertex paired with v by M. Let α denote the number of edges in qE(T u + v) \ M incident with u and let βc denote the number of edges in qE ( T j - 1 v ) \M incident with s v [1..(j − 1)]. Let S u := U pS pE(T u ) and let S j - 1 := p S pE ( T j - 1 v ) and note that S u and S j−1 are path covers of T u and T j - 1 v , respectively. Thus, by induction hypothesis,

ω ( S u ) [ α , S u p , ] u and ω ( S j - 1 ) [ β , S j - 1 p , j - 1 ] v .
(1)

Case 1: uv M and c = 0. Then, S p = S u p T u + v + S j - 1 p = S u p + ( 1 - α ) + S j - 1 p and, thus,

[ c , i , j ] v l i n e 6 [ α , S u p , ] u + [ 0 , i - ( S u p - α + 1 ) , j - 1 ] v ( 1 ) ω ( S u ) + ω ( S j - 1 ) = ω ( S ) .

Case 2: uv M and c = 1. Then, β = 1. Furthermore, the path containing u is split over S u + uv and S j-1 , implying S p = S u p T u + v + S j - 1 p -1= S u p -α+ S j - 1 p . Thus,

[ c , i , j ] v l i n e 7 [ α , S u p , ] u + [ 0 , i - ( S u p - α ) , j - 1 ] v ( 1 ) ω ( S u ) +ω ( S j - 1 ) =ω ( S ) .

Case 3: uv M and c = 0. Then, ||S|| p = ||S u || p + ||S j−1 || p and we have

[ c , i , j ] v l i n e 9 [ α , S u p , ] u + [ 0 , i - S u p , j - 1 ] v ( 1 ) ω ( S u ) +ω ( S j - 1 ) =ω ( S ) .

Case 4: uv M and c = 1. If uv S , then β = 1. Furthermore, ||S|| p = ||S u || p + ||S j−1 || p and we have

[ c , i , j ] v l i n e 10 [ α , S u p , ] u + [ 1 , i - S u p , j - 1 ] v ( 1 ) ω ( S u ) +ω ( S j - 1 ) =ω ( S ) .

Otherwise, uv S , implying α = β = 0. If w s v [1..j], then S p = S u + u v p T u + v + S j - 1 p = S u p + S j - 1 p and

[ c , i , j ] v l i n e 10 ω ( u v ) + [ 0 , S u p , ] u + [ 0 , i - S u p , j - 1 ] v ( 1 ) ω ( u v ) +ω ( S u ) +ω ( S j - 1 ) =ω ( S ) .

Otherwise, w s v [1..j] and the path containing u is split over S u and S j−1 . Thus, S p = S u +uv p T u + v + S j - 1 p -1= S u p + S j - 1 p -1, implying

[ c , i , j ] v l i n e 10 ω ( u v ) + [ 0 , S u p , ] u + [ 0 , i - ( S u p - 1 ) , j - 1 ] v ( 1 ) ω ( u v ) +ω ( S u ) +ω ( S j - 1 ) =ω ( S ) .

Induction Step (): Next, we show that [c, i, j] v ω(S ) by proving that a i-path cover S′ for T j v of weight [c, i, j] v exists and, thus, [c, i, j] v = ω(S′) ≤ ω(S ) by optimality of S. To this end, let u := s v [j] and let w := M(v).

Case 1: w = u and c = 0. By line 6, there are ℓ and α such that [0, i, j] v = [α, ℓ, ∞] u +[0, i− (ℓ − α + 1), j − 1] v . By induction hypothesis, there are path covers S u and S j−1 corresponding to [α, ℓ, ∞] u and [0, i − (ℓ − α + 1), j − 1] v , respectively. Then, S := S u S j - 1 is a path cover for T j v and S p = S u p T u + v + S j - 1 p = S u p +1-α+ S j - 1 p =i. Furthermore,

ω ( S ) = ω ( S u ) + ω ( S j - 1 ) = i n d . h y p . [ α , , ] u + [ 0 , i - ( - α + 1 ) , j - 1 ] v = l i n e 6 [ 0 , i , j ] v .

Case 2: u = w and c = 1. By line 7, there are ℓ and α such that [1, i, j] v = [α, ℓ, ∞] u +[1, i− (ℓ − α), j − 1] v . By induction hypothesis, there are path covers S u and S j−1 corresponding to [α, ℓ, ∞] u and [1, i − (ℓ − α), j − 1] v , respectively. Then, S′ := S u S j−1 is a path cover for T j v and S p = S u p T u + v + S j - 1 p -1 since S j−1 contains an edge incident to v. Thus, S p = S u p + S j - 1 p -α=i. Furthermore,

ω ( S ) = ω ( S u ) + ω ( S j - 1 ) = i n d . h y p . [ α , , ] u + [ 1 , i - ( - α ) , j - 1 ] v = l i n e 7 [ 1 , i , j ] v .

Case 3: w ≠ u and c = 0. By line 9, there are ℓ and α such that [0, i, j] v = [α, ℓ, ∞] u + [0, i − ℓ, j − 1] v . By induction hypothesis, there are path covers S u and S j−1 corresponding to [α, ℓ, ∞] u and [0, i − ℓ, j − 1] v , respectively. Since c = 0 we have uv S u and, thus, S := S u S j - 1 is a path cover for T j v and S p = S u p + S j - 1 p =i. Furthermore,

ω ( S ) = ω ( S u ) + ω ( S j - 1 ) = i n d . h y p . [ α , , ] u + [ 0 , i - , j - 1 ] v = l i n e 9 [ 0 , i , j ] v .

Case 4: w ≠ u and c = 1.

Case 4a: There are ℓ and α such that [1, i, j] v = [α, ℓ, ∞] u + [1, i − ℓ, j − 1] v (see line 10). By induction hypothesis, there are path covers S u and S j−1 corresponding to [α, ℓ, ∞] u and [1, i − ℓ + 1, j − 1] v , respectively. Since uv M, some edge incident to u in T u is in M. Then, S := S u S j - 1 is a path cover for T j v and ||S′|| p = ||S u || p + ||S j−1 || p = i. Furthermore,

ω ( S ) = ω ( S u ) + ω ( S j - 1 ) = i n d . h y p . [ α , , ] u + [ 1 , i - , j - 1 ] v = l i n e 10 [ 1 , i , j ] v .

Case 4b: There is some ℓ such that [1, i, j] v = ω(uv) + [0, ℓ, ∞] u + [1, i − (ℓ − 1), j − 1] v (see line 10). Then, w s v [1..j] and, by induction hypothesis, there are path covers S u and S j−1 corresponding to [0, ℓ, ∞] u and [1, i−ℓ+1, j−1] v , respectively. Then, S′ := (S u +uv) S j−1 is a path cover for T j v and w,v and u are on the same path p in T j v [ S M ] . Since u is incident to an edge of M in T u , we know that p does not end in u. Thus, ||S′|| p = ||S u || p + ||S j−1 || p − 1 = i. Furthermore,

ω ( S ) = ω ( S u ) + ω ( S j - 1 ) + ω ( u v ) = i n d . h y p . [ 0 , , ] u + [ 1 , i - ( + 1 ) , j - 1 ] v = l i n e 10 [ 1 , i , j ] v .

Case 4c: There is some ℓ such that [1, i, j] v = ω(uv) + [0, ℓ, ∞] u + [1, i − ℓ, j − 1] v (see line 10). Then, w s v [1..j] and, by induction hypothesis, there are path covers S u and S j−1 corresponding to [0, ℓ, ∞] u and [1, i − ℓ, j − 1] v , respectively. Thus, S′ := (S u + uv) S j−1 is a path cover for T j v . Since u is incident to an edge of M in T u , we have ||S′|| p = ||S u || p + ||S j−1 || p = i. Furthermore,

ω ( S ) = ω ( S u ) + ω ( S j - 1 ) + ω ( u v ) = i n d . h y p . [ 0 , , ] u + [ 1 , i - , j - 1 ] v = l i n e 10 [ 1 , i , j ] v .

Proof of Section 11

Proof of Lemma 2 The proof is by induction on the distance of i to r (descending). In the induction base, i is a leaf of T and X(i) = and G i is empty. Thus, the domains of d and P are empty. Thus, [, , 0, 0] i = 0 and all other entries are −∞.

For the induction step, we distinguish the possible bag types of X(i) with children X( j) and X(ℓ) (possibly j = ℓ):

Introduce vertex v: Since G i does not contain edges incident to v, only tuples with d(v) = 0 and P(v) = are valid.

Forget vertex v: Let S i be a maximum weight set that is eligible for (d, P, p, c, i). We show that [d, P, p, c] i = ω(S i ).

"≤":

Case 1: [d, P, p, c] i = [d[v → 1], P + vv, p − 1, c] j . By induction hypothesis, there is a set S j corresponding[1] to [d[v → 1], P + vv, p − 1, c] j . We show that S j is eligible for (d, P, p, c, i) and, thus, ω(S i ) ≥ ω(S j ) = [d, P, p, c] i . Since X(i) X( j) and all paths between vertices in d−1(1) that are represented by P + vv are also represented by P, the first two conditions are satisfied by S j for i. Since deg G j S j ( v ) =1, there is a path q in G j s j containing v. But since v d−1(1), we know that G i S i contains one path more that does not intersect d−1(1). Thus, G i S i contains p paths that do not intersect d−1(1) and S j satisfies the third condition.

Case 2: [d, P, p, c] i = [d[v → 1], (Puu) + uv, p, c] j for some uu P. By induction hypothesis, there is a set S j corresponding to [d[v → 1], (Puu)+uv, p, c] j . Since uv (Puu) + uv, there is a u-v-path q in G j s j (by Definition 2(2)) and q intersects d−1(1) in u. Thus, G i S i contains p paths that do not intersect d−1(1) and, thus, S j is eligible for (d, P, p, c, i).

Case 3: [d, P, p, c] i = [d[vx], P, p, c] j for some x {0, 2}. By induction hypothesis, there is a set S j corresponding to [d[vx], P, p, c] j . Then, S j is also eligible for (d, P, p, c, i).

"≥": Let x:= deg G i s i ( v ) .

Case 1: x {0, 2}. Then, S i is eligible for (d[vx], P, p, c, j) and, by induction hypothesis, ω(S i ) ≤ [d[vx], P, p, c] j ≤ [d, P, p, c] i .

Case 2: x = 1. Then, by Definition 2(1), there is a path q in G i S i ending in v. If q has another end u in d−1(1), then S i is eligible for (d[v → 1], (Puu) + uv, p − 1, c, j) and, thus, ω(S i ) ≤ [d[v → 1], (Puu) + uv, p − 1, c]. Otherwise, S i is eligible for (d[v → 1], P + vv, p − 1, c, j). In both cases, ω(S i ) ≤ [d, P, p, c] i .

Introduce edge uv: Let S i correspond to [d, P, p, c] i . We show that [d, P, p, c] i = ω(S i ). Let d′ and z be as described in the dynamic programming and note that G j = G i - uv. "≤": First, if [d, P, p, c] i = [d, P, p, c] j , then, by induction hypothesis, there is a set S j corresponding to [d, P, p, c] j and uv M. Then, G j s j = G i s j implying that S j is eligible for (d, P, p, c, i) and, thus, ω(S i ) ≥ ω(S j ) = [d, P, p, c] i .

In the following, we proceed in a similar manner for the case that [d, P, p, c] i = z + ω(uv): To show [d, P, p, c] i ω(S i ), we consider a set S j that corresponds to the entry for X( j) from which [d, P, p, c] i is computed and whose existence is granted by induction hypothesis. Then, we show that S j + uv is eligible for (d, P, p, c, i), implying ω(S i ) ≥ ω(S j + uv) = z + ω(uv) if uv M and ω(S i ) ≥ ω(S j ) = z if uv M. Note that, in each of the cases, the degrees of u and v in G j s j are one less than their degrees in G i s j + u v . Thus, S j satisfies Definition 2(1) for d′.

[1]We say a set S corresponds to an entry [d, P, p, c] i if S is eligible for (d, P, p, c, i) and its weight ω(S ) = [d, P, p, c] i is maximum among all sets that are.

Case 1: d(u) = d(v) = 2. Then, both u and v have degree 1 in G j s j .

Case 1a: z = [d′, P + uv, p, c − 1] j . Then, there is a u-v-path in G j s j (see Figure 3(a)). Since d−1(1) = d′−1(1) \ {u, v}, adding uv does not touch any paths intersecting d−1(1). Thus, Definition 2(2) is satisfied. Further, adding uv closes a cycle in G i s j + u v and, thus, G i s j + u v contains one more cycle that does not intersect d−1(1) than G j s j . Thus, also Definition 2(3) is satisfied.

Case 1b: z = [d′, P {uu, vv}, p − 1, c] j . Then, both u and v have dangling paths in G j s j (see Figure 3(b)). By the same arguments as in Case 1a, Definition 2(2) is satisfied. Further, adding uv connects two paths such that the resulting path does not intersect d−1(1). Thus, there are one more such paths in G i s j + u v than in G j s j , implying that Definition 2(3) is satisfied.

Case 1c: z = [d′, (Pxx) {ux, vv}, p, c] j for some x d−1(1) = d′−1(1) \ {u, v} with xx P. Then, G j s j contains a path dangling from v and a u-x-path (see Figure 3(c)). Thus, adding uv connects two paths such that the resulting path intersects d−1(1) exclusively in x. Hence, there is a path dangling from x in G i s j + u v and, hence, S j + uv satisfies Definition 2(2). Since no paths or cycles avoiding d′−1(1) are affected by adding uv, Definition 2(3) is satisfied. The case that z = [d′, (Pxx) {vx, uu}, p, c] j is analogous.

Case 1d: z = [d′, (Pxy) {ux, vy}, p, c] j for some x, y d−1(1) with xy P. Then, G j s j contains paths q u and q v that start in u and v, respectively, and end in x and y, respectively (see Figure 3(d)). Thus, adding uv connects q u and q v to a single path q that starts in x and ends in y, thus intersecting d−1(1). Hence both Definition 2(2) and (3) are satisfied.

Case 2: d(u) = d(v) = 1. Thus, u and v are not incident to any edges in G j s j . Thus, uv forms a new path connecting u and v in G i s j + u v . Since, in this case, z = [d′, Puv, p, c] j and uv P, we conclude that both Definition 2(2) and (3) are satisfied.

Case 3: d(u) = 2, d(v) = 1. Then, v has no incident edges in G j s j and, thus, it is only adjacent to u in G i s j + u v . Further, u has degree 1 in G j s j , so it is endpoint to a path q in G j s j . Thus, vx P for some x d−1(1) and G i s j + u v contains the path q′ := q + uv.

Case 3a: x = v. Then, z = [d′, (Pvv) + uu, p, c] j . Since G i s j + u v contains q′ which is a path dangling from u, Definition 2(2) is satisfied. Since no other paths are touched, Definition 2(3) is satisfied.

Case 3b: x ≠ v. Then, z = [d′, (Pvx) + ux, p, c] j . Since G i s j + u v contains q′ which is a u-x-path, Definition 2(2) is satisfied. Since no other paths are touched, Definition 2(3) is satisfied.

"≥": If uv S i M, then G i s i = G j s j and, thus, S i is eligible for (d, P, p, c, j). By induction hypothesis, [d, P, p, c] j ω(S i ) and, thus, [d, P, p, c] i ≥ [d, P, p, c] j } ≥ ω(S i ). Otherwise, uv S i M

In the following, we show that S i uv is eligible for a tuple corresponding to one of the entries over which we maximize to compute z. Thus, ω(S i ) ≤ ω(S i uv) + ω(uv) ≤ z + ω(uv) if uv tt M and ω(S i ) ≤ z if uv M. Note that, in each case, S i uv satisfies Definition 2(1) since the degrees of u and v decrease by one when removing uv.

Case 1: d(u) = d(v) = 2. Then, uv is part of a path q in G i S i and neither u nor v is an endpoint of q.

Case 1a: q is closed (that is, a cycle) (see Figure 3(a)). Then, q does not intersect d−1(1), implying that G i S i contains one more such cycle than G j s i - u v . Further, quv is a u-v-path in G j s i - u v intersecting d′−1(1) only in u and v. Thus, S i uv is eligible for (d′, P + uv, p, c − 1, j).

Case 1b: q is open and does not intersect d−1(1) (see Figure 3(b)). Then, G i S i contains one more of such paths than G j s i - u v . Further, quv decomposes into paths q u and q v intersecting d′−1(1) only in u and v, respectively. Thus, S i uv is eligible for (d′, P + {uu, vv}, p − 1, c, j).

Case 1c: q is open and intersects d−1(1) in a single vertex x (see Figure 3(c)). Then, quv decomposes into paths q u and q v in G j s i - u v , one of which intersects d′−1(1) in x. Since no path avoiding d−1(1) is touched, S i uv is eligible for either (d′, (Pxx) {xu, vv}, p, c, j) or (d′, (Pxx) {uu, vx}, p, c, j).

Case 1d: q is open and intersects d−1(1) in 2 distinct vertices x and y (see Figure 3(d)). Then, quv decomposes into a u-x-path q u and a v-y-path q v in G j s i - u v . Thus, S i - uv is eligible for (d′, (Pxy) {ux, vy}, p, c, j).

Case 2: d(u) = d(v) = 1. Then, q = {uv} is a path in G i S i and, thus, u and v are isolated in G j s i - u v . Thus, uv P and S i uv is eligible for (d′, Puv, p, c, j).

Case 3: d(u) = 2 and d(v) = 1. Thus, G i S i contains a path q = (v, u, . . .). If q intersects d−1(1) − v in a vertex x, then quv intersects d′−1(1) − u and, thus, S i uv is eligible for (d′, (Pvx) + ux, p, c, j). Otherwise, q is a dangling from v in G i S i and quv is a dangling from u in G j s i - u v , implying that S i uv is eligible for (d′, (Pvv) + uu, p, c, j).

Join: "≤": Let d1, d2, P1, P2, p1, p2, c1, c2 be such that [d, P, p, c] i = [d1, P1, p1, c1] j + [d2, P2, p2, c2]. By induction hypothesis, there are sets S j and S corresponding to [d1, P1, p1, c1] j and [d2, P2, p2, c2], respectively, such that

v X ( i ) d ( v ) = d 1 ( v ) + d 2 ( v )  , 
(2)
P= P 1 P 2 ,
(3)
p= p 1 + p 2 +| ( P 1 P 2 ) 1 |, and
(4)
c= c 1 + c 2 +| ( P 1 P 2 ) 2 |.
(5)

We show that S i := S j S t is eligible for (d, P, p, c, i) and, thus, ω(S ) ≥ ω(S i ) = [d1, P1, p1, c1] j + [d2, P2, p2, c2] = [d, P, p, c] i . First, by (2), we have that Definition 2(1) is satisfied. Second, to show that Definition 2(2) is satisfied, consider some uv P. Then, there is a path q = (u, x1, x2, . . ., v) in Gr(P1 P2). Note that x 1 , x 2 , d 1 - 1 ( 1 ) d 2 - 1 ( 1 ) . Thus, G i S j S = G i s i contains a u-x1-path, an x1-x2-path, . . . . The concatenation of these paths forms a u-v path in G i S i . Third, note that, for each uu (P1)1, there is a path q 1 u dangling from u in G j s j such that the only vertex of d 1 - 1 ( 1 ) d 1 - 1 ( 0 ) d - 1 ( 1 ) in q 1 u is u. Analogously, a similar path q 2 u exists for each uu (P2)1 in G s . Thus, for each uu (P1P2)1, there is a path q u := q 1 u q 2 u in G i S j S containing u and avoiding d−1(1) and qu is neither in G j s j nor in G S . Since G j s j contains p1 paths avoiding d 1 - 1 ( 1 ) d 1 - 1 ( 0 ) d - 1 ( 1 ) and G S contains p2 paths avoiding d 2 - 1 ( 1 ) d 2 - 1 ( 0 ) d - 1 ( 1 ) , we have that G i S j S contains p1 + p2 + |(P1P2)1| such paths. Similarly, G i S j S concontains c1 + c2 + |(P1P2)2| cycles avoiding d−1(1).

"≥": Let S j := S i E(G j ) and let S := S i E(G ). We show that S j and S are eligible for tuples (d1, P1, p1, c1, j) and (d2, P2, p2, c2, ), respectively, such that (2)-(5) hold. First, for all u X(i) = X( j) = X() let d1 (d2) be the number of edges of G j (G ) incident with u. Since E ( G i s i ) =E ( G j s j ) E ( G S ) , we conclude that (2) holds and Definition 2(1) is satisfied. Second, let P1 be the set of pairs uv with u, v d 1 - 1 ( 1 ) such that, if u ≠ v, then there is a u-v path in G j s j and, if u = v, then G j s j contains a path dangling from u. Let P2 be defined analogously for d2 and G s . Then, Definition 2(2) is satisfied. Further, for all uv P, since d(u) = 1 d1(u) = 1 d2(u) = 1, we have P1(u) = P2(u) = and there is a u-v-path q in G i S i that decomposes into paths of G j s j that connect vertices of d 1 - 1 ( 1 ) and paths of G s that connect vertices of d 2 - 1 ( 1 ) . Thus, Gr(P1 P2) contains a u-v-path and we conclude that (3) holds. Third, let p1 and c1 be the number of paths and cycles, respectively, in G j s j that avoid d 1 - 1 ( 1 ) . Likewise for p2 and c2 in G s and d2. Then Definition 2(3) is satisfied. Further, let p ′ denote the number of pairs uu P1P2, that is, p ′ := |(P1P2)1|. Then, since G i S j S contains, for each such pair uu, a different path consisting of the concatenation of the two paths dangling from u in G j s j and G s , respectively, G i s j s contains exactly p1 + p2 + p′ paths avoiding d−1(1). Thus, (4) holds and, in complete analogy, (5) holds.    □

Proof of Theorem 1, sketch The bottleneck in the computation are the join nodes so we focus on computing their dynamic programming table. To calculate the number of entries that have to be considered in order to compute [d, P, p, c] i , assume that P1 and P2 are fixed. Then so is P and d 1 - 1 ( 1 ) and d 2 - 1 ( 1 ) . Now, consider a vertex uX ( j ) \ d 1 - 1 ( 1 ) . If u is incident to an edge in G j , that is, an edge in M that has been introduced in the subtree rooted at j, then d1(u) > 0 and, thus, d1(u) = 2. However, if u is not incident with any matching edges in G j , then d1(u) < 2, since otherwise, u would be incident to two non-matching edges. Thus, u d 1 - 1 ( 2 ) if and only if u is incident to a matching edge in G j and, consequently, fixing P1 and P2 also fixes d1 and, by extension, d2. Finally, we can choose p1 and c1 in order to compute p2 and c2. Since we need to consider only permutations that are also involutions, there are less than twtw ways to choose P1 and P2. Thus, the maximum is over at most twtw ·σp · σc elements. Since there are O(n) bags in the tree decomposition, the algorithm can be executed in the claimed running time. □

Lemma 5 Tree Rule 2 is correct, that is, the instance I = (G, M, σp, σc, ℓp, ℓc) is yes if and only if the result I′ = (G′, M′, σp − 1, σc, ℓp, ℓc) of applying Tree Rule 2 to I is yes.

Proof Clearly, since all vertices of the graph G must be covered, then the only way to pack the vertex u is to include it into a path of length l p included the vertex v′ = LCA(u, w) and then decrease the number of paths by one. □

Proof of Lemma 3 We show that T v does not contain branching vertices (vertices with at least two children) since it is straightforward that, if T v is a path, it has to be alternating for I to be a yes-instance and, if its length exceeds ℓp, then Tree Rule 2 applies. Towards a contradiction, assume that T v has branching vertices and let z denote such a vertex in T v such that, among all branching vertices, z is furthest from v. Let u be a leaf of T z that has maximum distance to z and let d denote this distance. Since z is the only branching vertex in T z and I is a yes-instance, the unique u-z path in T z is alternating. Thus, by irreducibility with respect to Tree Rule 2, we know that d < ℓp. However, since z is branching, there is another leaf w at distance d′d < ℓp to z in T z . Thus, if the unique u-w-path in T z is not alternating or its length is not ℓp, then I is not a yes-instance. But otherwise, Tree Rule 2 is applicable to u and w, contradicting irreducibility. □

Lemma 6 Tree Rule 3 is correct, that is, the instance I = (G, ω, M, σp, σc, ℓp, ℓc) is a yes-instance if and only if the result of applying Tree Rule 3 to I is.

Proof Let v and e be as defined in Tree Rule 3 and let e = {u, v}. To show correctness of Tree Rule 3, we prove that all optimal solutions for I contain e. To this end, let S be an optimal solution with e S . By Lemma 3, T v is an alternating path p ending in v with |p| < ℓp. By definition, M(u) is on p, so |p| ≥ 2. But then, Ge contains an isolated path of length strictly less than ℓp, implying that S is not a solution for I. □

Lemma 7 Path Rule 1 is correct, that is, the instance I = (G, ω, M, σp, σc, ℓp, ℓc) is a yes-instance if and only if the result I′ = (G′, ω, M′, σp − 1, σc, ℓp, ℓc) of applying Path Rule 1 to I is.

Proof Let p= ( u 0 , u 1 , , u 2 l p + 2 ) be as in Path Rule 1 and let p := ( u 0 , u 1 , , u p + 1 ) . Note that e p + 1 , e p + 2 , exist in G′.

"": Let S be a solution for I, let p := ( u 0 , u 1 , , u p + 1 ) and note that no cycle of G[S M] contains p since |p| > ℓc. We show that S ′ := S \ p ′ is a solution of for I with ω(S ) = ω(S ). First, note that all vertices of G are covered by S . Second, note that p \ p is alternating since |p| = ℓp + 1. Finally, G[S M] contains one path less than G[S M].

"": Let S be a solution for I and let u denote the vertex onto which u 0 , u 1 , , u p + 1 have been contracted in I and let e:=u u p + 2 . We construct a solution S for I with ω(S ) = ω(S ). First, since |p| > ℓc + ℓp + 1, no cycle in G[S M] contains e. If e S M, then S := S (pe0). If e S M, then e j S M for some ℓp <j ≤ 2ℓp. Then, S:= S ( p - e j - ( p + 1 ) )

Lemma 8 Path Rule 2 is correct, that is, the instance I = (G, ω, M, σp, σc, ℓp, ℓc) is a yes-instance if and only if the result I = (G, ω, M, σp − 1, σc, ℓp, ℓc) of applying Path Rule 2 to I is.

Proof First, note that no solution for I or I covers w in a cycle since T w is necessarily covered by a path. Also note that q exists in I.

"": Let S be a solution for I and note that there is some i ≤ ℓp such that e i S M (in fact, i {ℓp, γ}). Then, however, f i−γ S and S := S \ p is a solution for I and ω(S ) = ω(S ).

"": Let S be a solution for I and note that there is some i ≤ ℓp − γ such that f i S M (in fact, i {0, ℓp − γ}). But then, S (pe i+γ ) is a solution for I and ω(S ) = ω(S ). □

Lemma 9 Path Rule 3 is correct, that is, the instance I = (G, ω, M, σp, σc, ℓp, ℓc) is a yes-instance if and only if the result I = (G, ω, M, σp − 1, σc, ℓp, ℓc) of applying Path Rule 3 to I is.

Proof "": Let S be solution for I and let q u and q v denote the paths containing u and v, respectively, in G[S M].

Case 1: Neither q u nor q v contain edges of p. Then, γ = 1 and ux1 S M. If γ u v +1 = ℓp, then Case (4) applies and otherwise, Case (5) applies. In both cases, S is also a solution for I.

Case 2: q v contains edges of p, but q u does not. Then, γ v + γ = ℓp + 1 and ux1 S M. If γ u + γ = ℓp + 1, then Case (3) applies and otherwise, Case (1) applies. In both cases, S is also a solution for I.

Case 3: q u contains edges of p, but q v does not. Then, γ u + γ = ℓp + 1 and vx|p|−1 S M and ux1 S \ M. If γ u + γ = ℓp + 1, then Case (3) applies and switching ux1 for wx1 in S gives a solution of same weight for I. Otherwise, Case (2) applies and S is also a solution

for I.

Case 4: Both q u and q v contain edges of p. Then, γ u + γ v + γ ≡ ℓp mod (ℓp + 1) and ux1 S \ M. If γ ≠ 1, then Case (6) applies and S is also a solution for I. Otherwise, Case (3) applies and switching ux1 for wx1 in S gives a solution of same weight for I.

"": Let S be a solution for I. Note that, if G does not contain wx1, then S is clearly also a solution for I. Thus, assume that G contains wx1 and, thus, either Case (3) or (4) applies to I. We show that the result S of switching wx1 for ux1 in S is a solution of the same weight for I. To show this, it suffices to show that, if a path q contains wx1, then q ends at u. Assume this is false, that is, q contains wx1 and some edge e E(G) \ M incident with u. Since T w is not empty, no cycle in G[S M] contains u. Then, the length of the u-v-path containing pux1 in G is equivalent to γ + γ u modulo ℓp + 1.

Case 1: vx|p|−1 S M. Then, since q does not end in u, we have γ u +γ+ γ v p mod ( p + 1 ) . Thus, Case (4) does not apply to I, implying that Case (3) applies to I. But then, γ + γ v = ℓp + 1 and, hence, wx1 S M.

Case 2: vx|p|−1 S M. Then, since q does not end in u, we have γ u +γ0 mod ( p + 1 ) , implying γ u + γ ≠ ℓp + 1 since 0 < γ ≤ ℓp and 0 < γ u < ℓp. Thus, Case (3) does not apply to I, implying that Case (4) applies to I. But then, γ = 1 and, since vx|p|−1 S M, we have wx1 S M. □

Observation 1 Let G be connected. Then, |V| ≤ 2 FES(G) since |E| ≤ |V| + FES(G) ≤ |V| + FES(G) and 2|E| ≥ 3|V|.

Proof of Theorem 2 By Lemma 4, we have |V| ( | V | + 3 | E | ) Observation  1 ( 2 FES ( G ) + 3 ( FES ( G ) + 2 FES ( G ) ) = 11 FES ( G ) and, thus, we obtain | E | 11 FES( G ) + FES( G ) = ( 11 + 1 ) FES( G ) .

References

  1. Reddy TBK, Thomas AD, Stamatis D, Bertsch J, Isbandi M, Jansson J, Mallajosyula J, Pagani I, Lobos EA, Kyrpides NC: The Genomes OnLine Database (GOLD) v.5: a metadata management system based on a four level (meta)genome project classification. Nucleic Acids Research. 2014, 43 (D1): 1099-1106. [https://gold.jgi-psf.org/distribution]

    Article  Google Scholar 

  2. Donmez N, Brudno ML: SCARPA: scaffolding reads with practical algorithms. Bioinformatics. 2013, 29 (4): 428-434.

    Article  CAS  PubMed  Google Scholar 

  3. Gritsenko AA, Nijkamp JF, Reinders MJT, de Ridder D: GRASS: a generic algorithm for scaffolding next-generation sequencing assemblies. Bioinformatics. 2012, 28 (11): 1429-1437.

    Article  CAS  PubMed  Google Scholar 

  4. Dayarian A, Michael TP, Sengupta AM: SOPRA: Scaffolding algorithm for paired reads via statistical optimization. BMC Bioinformatics. 2010, 11: 345-

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  5. Gao S, Sung W-K, Nagarajan N: Opera: Reconstructing Optimal Genomic Scaffolds with High-Throughput Paired-End Sequences. Journal of Computational Biology. 2011, 18 (11): 1681-1691.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  6. Salmela L, Mäkinen V, Välimäki N, Ylinen J, Ukkonen E: Fast scaffolding with small independent mixed integer programs. Bioinformatics. 2011, 27 (23): 3259-3265.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  7. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27 (4): 578-579.

    Article  CAS  PubMed  Google Scholar 

  8. Koren S, Treangen TJ, Pop M: Bambus 2: Scaffolding metagenomes. Bioinformatics. 2011, 27 (21): 2964-2971.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  9. Hunt M, Newbold C, Berriman M, Otto T: A comprehensive evaluation of assembly scaffolding tools. Genome Biology. 2014, 15 (3):

  10. Huson DH, Reinert K, Myers EW: The greedy path-merging algorithm for contig scaffolding. Journal of the ACM. 2002, 49 (5): 603-615.

    Article  Google Scholar 

  11. Solinhac R, Leroux S, Galkina S, Chazara O, Feve K, Vignoles F, Morisson M, Derjusheva S, Bed'hom B, Vignal A, Fillon V, Pitel F: Integrative mapping analysis of chicken microchromosome 16 organization. BMC Genomics. 2010, 11 (1): 616-

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  12. Sharpton TJ: An introduction to the analysis of shotgun metagenomic data. Frontiers in Plant Science. 2014, 5 (209):

  13. Shmoys DB, Lenstra JK, Kan AHGR, Lawler EL: The Traveling Salesman Problem: a Guided Tour of Combinatorial Optimization. Edited by: John Wiley & Sons. 1985

    Google Scholar 

  14. Tutte WT: A short proof of the factor theorem for finite graphs. Canadian Journal of Mathematics. 1954, 6: 347-352.

    Article  Google Scholar 

  15. Chateau A, Giroudeau R: Complexity and Polynomial-Time Approximation Algorithms around the Scaffolding Problem. AlCoB Lecture Notes in Computer Science. Edited by: Dediu, A.H., Martín-Vide, C., Truthe, B. 2014, Springer, 8542: 47-58.

    Google Scholar 

  16. Chateau A, Giroudeau R: A complexity and approximation framework for the maximization scaffolding problem. Theoretical Computer Science. 2015, 595: 92-106. doi:10.1016/j.tcs.2015.06.023

    Article  Google Scholar 

  17. Weller M, Chateau A, Giroudeau R: On the implementation of polynomial-time approximation algorithms for scaffold problems. submitted 2015

    Google Scholar 

  18. Nieuwerburgh FV, Thompson RC, Ledesma J, Deforce D, Gaasterland T, Ordoukhanian P, Head SR: Illumina mate-paired DNA sequencing-library preparation using Cre-Lox recombination. Nucleic Acids Research. 2012, 40 (3): 24-24.

    Article  CAS  Google Scholar 

  19. Boetzer M, Henkel CV, Jansen HJ, Butler D, Pirovano W: Scaffolding pre-assembled contigs using SSPACE. Bioinformatics. 2011, 27 (4): 578-579.

    Article  CAS  PubMed  Google Scholar 

  20. Reed B, Smith K, Vetta A: Finding odd cycle transversals. Operations Research Letters. 2004, 32 (4): 299-301.

    Article  Google Scholar 

  21. Sahlin K, Vezzi F, Nystedt B, Lundeberg J, Arvestad L: BESST - efficient scaffolding of large fragmented assemblies. BMC Bioinformatics. 2014, 15 (1): 281-

    Article  PubMed  PubMed Central  Google Scholar 

  22. Luhmann N, Chauve C, Stoye J, Wittler R: Scaffolding of ancient contigs and ancestral reconstruction in a phylogenetic framework. Proceedings of Brazilian Symposium on Bioinformatics Lecture Notes in Computer Science. 2014, 8826: 135-143.

    Article  Google Scholar 

  23. Rajaraman A, Tannier E, Chauve C: FPSAC: Fast Phylogenetic Scaffolding of Ancient Contigs. Bioinformatics. 2013, 29 (23): 2987-2994.

    Article  CAS  PubMed  Google Scholar 

  24. Aganezov S, Sitdykovaa N, Alekseyev MA: AGCConsortium: Scaffold assembly based on genome rearrangement analysis. Computational Biology and Chemistry. 2015, 57: 46-53.

    Article  CAS  PubMed  Google Scholar 

  25. Impagliazzo R, Paturi R, Zane F: Which Problems Have Strongly Exponential Complexity?. Journal of Computer and System Sciences. 2001, 63 (4): 512-530.

    Article  Google Scholar 

  26. Bodlaender HL, Cygan M, Kratsch S, Nederlof J, Impagliazzo R, Paturi R, Zane F: Deterministic single exponential time algorithms for connectivity problems parameterized by treewidth. Information and Computation. 2015, 243: 86-111.

    Article  Google Scholar 

  27. Downey RG, Fellows MR: Fundamentals of Parameterized Complexity. Texts in Computer Science. 2013, Springer

    Google Scholar 

  28. Li H, Handsaker B, Wysoker A, Fennell T, Ruan J, Homer N, Marth G, Abecasis G, Durbin R: The sequence alignment/map format and SAMtools. Bioinformatics. 2009, 25 (16): 2078-2079.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  29. Chikhi R, Rizk G: Space-efficient and exact de bruijn graph representation based on a bloom filter. Algorithms for Molecular Biology. 2013, 8: 22-

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  30. Li H, Durbin R: Fast and accurate long-read alignment with Burrows-Wheeler transform. Bioinformatics. 2010, 26 (5): 589-595.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  31. Briot N, Chateau A, Coletta R, Givry SD, Leleux P, Schiex T: An Integer Linear Programming Approach for Genome Scaffolding. Workshop Constraints in Bioinformatics. 2014

    Google Scholar 

  32. Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M, Marc¸ais G, Pop M, Yorke JA: GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Research. 2012, 22 (3): 557-567. [http://gage.cbcb.umd.edu]

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  33. Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome research. 2008, 18 (5): 821-829.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

  34. Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome Biology. 2009, 10 (3): 25-

    Article  CAS  Google Scholar 

  35. Chikhi R, Rizk G: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms for Molecular Biology. 2013, 8 (22):

  36. 2013, Variathon, [http://bioinf.dimi.uniud.it/variathon]

  37. Denton JF, Lugo-Martinez J, Tucker AE, Schrider DR, Warren WC, Hahn MW: Extensive error in the number of genes inferred from draft genome assemblies. PLoS Computational Biology. 2014, 10 (2): 1003998-

    Article  Google Scholar 

  38. Koster A: TOL - Treewidth Optimization Library. 2012

    Google Scholar 

  39. Morgulis A, Coulouris G, Raytselis Y, Madden TL, Agarwala R, Scha¨ ffer AA, Koster A: Database indexing for production MegaBLAST searches. Bioinformatics (Oxford, England). 2008, 24 (16): 1757-1764.

    Article  CAS  Google Scholar 

  40. Kolodner R, Tewari KK: Inverted repeats in chloroplast DNA from higher plants*. Proceedings of the National Academy of Sciences of the United States of America. 1979, 76 (1): 41-45.

    Article  CAS  PubMed  PubMed Central  Google Scholar 

Download references

Acknowledgements

Publication of this work is funded by the Institut de Biologie Computationnelle.

This article has been published as part of BMC Bioinformatics Volume 16 Supplement 14, 2015: Proceedings of the 13th Annual Research in Computational Molecular Biology (RECOMB) Satellite Workshop on Comparative Genomics: Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S14.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Mathias Weller.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

MW, AC and RG conceived the method and the proofs. MW implemented and tested the algorithms on all datasets. MW, AC and RG wrote the paper.

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Weller, M., Chateau, A. & Giroudeau, R. Exact approaches for scaffolding. BMC Bioinformatics 16 (Suppl 14), S2 (2015). https://doi.org/10.1186/1471-2105-16-S14-S2

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-16-S14-S2

Keywords