Exact approaches for scaffolding

This paper presents new structural and algorithmic results around the scaffolding problem, which occurs prominently in next generation sequencing. The problem can be formalized as an optimization problem on a special graph, the "scaffold graph". We prove that the problem is polynomial if this graph is a tree by providing a dynamic programming algorithm for this case. This algorithm serves as a basis to deduce an exact algorithm for general graphs using a tree decomposition of the input. We explore other structural parameters, proving a linear-size problem kernel with respect to the size of a feedback-edge set on a restricted version of Scaffolding. Finally, we examine some parameters of scaffold graphs, which are based on real-world genomes, revealing that the feedback edge set is significantly smaller than the input size.


Introduction
During the last decade, a huge amount of new genomes have been sequenced, leading to an abundance of available DNA resources. Nevertheless, most of recent genome projects stay unfinished, in the sense that databases contain much more incompletely assembled genomes than whole stable reference genomes [1]. One reason for this phenomenon is that, for most of the analyses performed on DNA, an incomplete assembly is sufficient. Sometimes it is even possible to perform them directly from the sequenced data. Another reason is that producing a complete genome, or an as-complete-as-possible-genome is a difficult task. Traditionally, producing a complete genome consists of three steps, each of them computationally or financially difficult: the assembly, the scaffolding, and the finishing. The step of scaffolding, on which we focus here, consists of orienting and ordering at the same time the contigs produced by assembly. Many methods have been proposed and the recent activity on the subject shows that it is an active field (see, not exhaustively, [2][3][4][5][6][7][8] and Section). A good survey of recent methods can be found in [9] for instance. Since the problem has been proved N P -complete from its first formulation [10], nearly all of these methods propose heuristic approaches. One of them claims that an exact method provides better results [5], however, the authors prepend a heuristic graph simplification before running their exact algorithm.
The approach presented here relies on a combinatorial problem on a dedicated graph, called scaffold graph, representing the link between already assembled contigs. The main idea is to represent each contig by two vertices linked by an edge (these "contig-edges" form a perfect matching on the scaffold graph). Other edges are constructed and weighted using complementary information on the contigs. The weight of a non-contig edge uv, with uu′, vv′ being contig-edges, corresponds to a confidence measure that the uu′ contig is succeeded by the vv′ contig (oriented as u′ − u − v − v′). The scaffold graph is a flexible tool to study the scaffolding issues. Indeed, the graph is a syntactical data-structure which may represent several semantics. For instance, the scaffold graphs used for our previous experiments have been built using Next-Generation Sequencing data, namely paired-end reads. However, we also could provide other type of information to compute the weight on the edges, like for instance ancestral support in a phylogenetic context, or comparison to other extent genomes which could be used as multiple references. The way to define the weight on the edges does not change the main goal of our method, which is to determine the optimal ordering and orientation of the contigs, given a specific criterion. It is also possible to mix two or more criteria in order to take several sources of information into account . We also introduced two structural parameters s p and s c representing the respective numbers of linear and circular chromosomes sought in the genome. Those parameters seems quite artificial at a first sight, but they are wellmotivated as follows. First, in a number of species, genomes are hard to assemble with classical methods because of an heterogeneous chromosomal structure or difficulties to perform classical assembly. This is the case, for instance, with the micro-chromosomes in the chicken genome [11]. Also, when scaffolding metagenomic data, one has to deal with very complex genomic structures [12]. We hope that relaxing the classical model with one chromosome, mixed with the flexibility of the scaffold graph, could help handling such complex situations. The second motivation to introduce those parameters came from the desire to explore an intermediary case between the very classical and studied N P -complete [13]TRAVELING SALESMAN PROBLEM (TSP) and a totally free structure, in which case optimizing a covering by paths and cycles is polynomial-time solvable [14]. The SCAFFOLDING problem then consists of finding (at most) s p paths and s c cycles that cover all contig-edges while maximizing the total weight. Previously, we proved that SCAFFOLDING is N P -complete, even under restricted conditions [15], developed polynomial-time approximation algorithms [15,16] and evaluated them experimentally [17]. In this paper, we continue to explore this problem, as well as the structural properties of the scaffold graph. This exploration aims to develop an efficient method, that could be used both to refine the ratio analysis on previously designed heuristics and to handle cases of small genomes with better quality.
We present both theoretical and practical results, as well as our experimental findings. The paper is organized as follows: we relate a general state-of-the-art and methodological comparison with existing methods and our previous work in Section. In Section, we formally introduce the problem and some notations and definitions. In Section, we show that the general SCAFFOLDING problem is polynomial-time solvable on trees, and present a dynamic programming algorithm. We extended this algorithm in Section 11 to a parameterized algorithm with respect to the treewidth of the input, that is, this algorithm runs in polynomial time, on scaffold graphs that are sparse, i.e. treelike. In Section 11, we lay the foundation to developing effective preprocessing routines for SCAFFOLDING, by presenting a set of reduction rules whose application shrinks input instances of a derived problem, RESTRICTED SCAFFOLDING. We show that the size of the resulting instances can be bounded linearly in their feedback edge set number, thus proving a problem kernel for this parameter. Finally, in Section 11, we present some experimental analysis of scaffold graphs that are derived from real-world genomes.

Related work
Genome scaffolding is an intrinsically complex problem, and was studied through several computation models and tools. The first model, presented in [10], leads to its classification as an N P -complete problem. This work also proposes the first greedy approach. The scaffolding was then performed using genomic fragments coupled by matepairs [18], which differ from the paired-end reads essentially by their larger insert-size (The insert-size is the gap, in base pairs, between two fragments constituting a pair of reads), and their lower covering depth, so the size of the data and the organization of the graph may differ from actual Next-Generation Sequencing data using paired-end reads. A greedy-like approach is also used in SSPACE [19], iteratively combining the longest contigs. Conversely, Gao et al. present a dynamic-programming-based exact approach (OPERA) which provides higher-quality scaffolds than existing heuristic approaches [5]. However, their algorithm runs in O(n k ) time (where k is the "library width") which quickly grows out of reasonable proportions. Thus, to apply this method on real data, they prepend a graph contraction procedure to limit the size of the input. In SOPRA [4], Dayarian et al. introduce a removal procedure for problematic contigs, and separate the orientation step from the linking step. The orientation step uses a simulated annealing technique for regions of high complexity. In GRASS [3], a genetic algorithm is provided, using mixed integer programming (MIP) as in [6] and an expectation-maximization process to counter the intractability of the MIP model. They obtain results which are intermediary in quality between SSPACE and Opera. In SCARPA [2], the two problems of orienting and ordering the contigs are separated, as in SOPRA. Transforming the graph such that the contig orientation problem has a feasible solution is a fixed-parameter tractable problem with respect to the number of nodes to remove [20]. The authors first use the corresponding algorithm to pre-process the graph and exhibit an optimal relative orientation of the contigs. Then, pre-oriented contigs are ordered using a heuristic, and the removal of articulation vertices is used to limit the size of the connected components. Misassembled contigs are detected and removed. In BESST [21], the authors refine the previous approaches by considering the library insert size distribution to filter relevant linking information, leading to an almost linear scaffold graph, which is easy to treat. The scalability of these approaches is not always proven, and the general increase of the size and amount of available data brings real efficiency questions. Recently, some methods also use phylogenetic or comparative genomic approaches via rearrangement analysis to perform scaffolding on ancient or extant genomes when a set of closely related species genomes are available (see for instance [22][23][24]). This encourages the mix of multiple sources of information for scaffolding a given genome.
The scaffolding problem have been extensively studied in the framework of complexity and approximation. In [15,16], the authors proved that the problem is N P -complete even if the scaffold graph is a planar bipartite graph. They also proposed some lower bounds for exact exponential-time algorithms for SCAFFOLDING according to the EXPONENTIAL-TIME HYPOTHESIS [25]. Finally, they proved that the minimization version of SCAFFOLDING (seeking a minimum-weight cover of the scaffold graph) is unlikely to be approximable within any constant ratio, even if the scaffold graph is a completed bipartite graph [15]. On the positive side, two polynomialtime factor-3-approximation algorithms for the maximization version have been designed. The first is an O(n 2 log n)-time greedy algorithm, the second is based on the computation of a maximum matching (O(n 3 )-time). The theoretical aspects of the scaffold graph have been completed by extensive experimental results [17] for the maximization version on simulated and real datasets.
The main contribution of the present paper lays in both the algorithmic exploration of exact methods for the scaffolding problem with structural parameters, and the initiation of a discussion on the scaffold graph properties. Indeed, the knowledge of those properties may lead to algorithmic improvement, as well as the detection of putative errors in assemblies.

Definitions
Scaffolding problem. The central combinatorial object we are working with in this work is the "scaffold graph" (see also [16,17]).
Definition 1 (Scaffold graph) A scaffold graph is a pair (G, M) of a graph G = (V, E, w) with an even number of vertices, and a weight function w : E N on its edges, and a perfect matching M on G.
Notice that this model is close to what previous work call scaffold graph, except that this graph is not a directed graph, and contigs are represented by edges instead of vertices. Figure 1 shows an example of a scaffold graph.
Graph Theory. Slightly abusing notation, we sometimes consider paths as sets of edges. Furthermore, for a match-    M] are alternating with respect to M, then, we call S a ||S|| p -||S|| c -cover (or simply cover) for G with respect to M (we will omit M if its clear from context). If ||S|| c = 0, we may also refer to S as a ||S|| p -path cover (or simply path cover). The general scaffold problem is expressed as follows (see also [16,17]): SCAFFOLDING (SCA) We call the images of X "bags" and the size of the largest bag minus one is the width of the decomposition. A decomposition of minimum width for G is called the optimal for G and its width is called the treewidth of G. It is a folklore theorem that each graph G has an optimal tree decomposition (T, X) that is nice, that is, each bag X(i) is one of the following types: Leaf bag: i is a leaf of T and X(i) = ∅, Introduce vertex v bag: i is internal with child j and X (i) = X( j) ∪ {v} with v ∉ X(j), Forget v bag: i is internal with child j and X(i) = X(j) − v with v ∈ X(i), Introduce edge uv bag: i is internal with child j and uv ∈ E and uv ⊆ X(i) = X(j), Join bag: i is internal with children j and ℓ and X(i) = X (j) = X(ℓ).
Additionally, each edge e ∈ E is introduced exactly once. Given a width-tw tree decomposition, we can obtain a nice tree decomposition of width tw in n · poly(tw) time [26].
Parameterized Algorithmics. Parameterized complexity theory challenges the traditional view of measuring the running times exclusively in the input size n. Instead, we aim at identifying a parameter of the input that we expect to be small (much smaller than n) in all instances that the application at hand may produce. We then focus on developing algorithms whose exponential part can be bounded in this parameter (see [27] for details). Parameterized complexity allows to prove performance guarantees for preprocessing algorithms: a polynomialtime algorithm that, given an instance x with parameter k, computes an instance x′ with parameter k′ ≤ k such that x is a yes instance if and only if x′ is a yes instance, is called kernelization and the result x′ is the kernel. It is common to present a kernelization by showing various reduction rules that "cut away" the easy parts of the input and, when applied exhaustively to the input, shrink it enough to prove the size bounds.

Scaffolding on trees
Towards developing an algorithm for SCAFFOLDING that runs fast on graphs that are "close to trees", we consider a strategy to solve SCAFFOLDING in case the input graph is a tree. We use a bottom-up dynamic programming approach that computes for each vertex v starting with the leaves, the best possible solutions for the subtree rooted at v. For ease of presentation, we thus consider the input tree T to be rooted arbitrarily with r denoting the root vertex. Note that the solution cannot contain any cycles in this case.
At each vertex, the dynamic programming algorithm needs to decide how many paths should be covered in each of the subtrees of its children. Seeing that it is infeasible to try all combinations, we employ, again, a dynamic programming strategy solving this problem for a given vertex v: We order the children of each vertex v arbitrarily and let s v denote the sequence of children of v with s v [j] denoting the j th child of v and s v [1 ..j . Then, we update the global dynamic programming table of v in order of s v . Thus, when we consider the j th child u of v, we only have to split the paths to be covered between the two subtrees T u and the union of all previously considered subtrees rooted at children of v (that is, In the following, for a vertex v, let C(v) denote its children and par(v) its parent (or ⊥ if v = r), and let T v denote the subgraph of T that is rooted at .0] := ⊥. Algorithm 1: Algorithm to fill the dynamic programming table. 1I We maintain a set I of initialized vertices and, as soon as r is initialized, the algorithm stops. Thus, we assume r ∉ I. Finally, the maximum weight of a s p -0-cover for T can be computed by max c∈{0,1} [c, s p , ∞] r .
Concerning the running time, we note that the body of the loop in line 3 is executed exactly once per vertex and, hence, lines 6,7,9, and 10 are executed s p times per vertex. As they compute the maximum over s p values, the whole algorithm runs in O(n · σ 2 p ) time.

Scaffolding with respect to treewidth
In this section, we develop a parameterized algorithm with respect to the structural parameter "treewidth", solving SCAFFOLDING in O(tw tw ) poly(n, s p , s c ) time. It is based on using a dynamic programming table to keep track of solutions that interact with the bags of a tree decomposition in a certain way (see Figure 2). Since we store for each type of interaction only the best solution and the number of interactions can be bounded in the treewidth, we arrive at the claimed bounds.
To present our algorithm, we use special permutations (involutions) to model matchings that allow reflexive pairing (that is, matching a vertex with itself). Thus, slightly abusing notation, we will consider permutations as sets of pairs. We denote the subset of reflexive pairs of a permutation P by P 1 and the subset of non-reflexive pairs by P 2 .
Then, Gr(P) :=Gr(P 2 ). For permutations P and Q, we define P □ Q as the set of pairs uv such that P(x) = ⊥ ⊕ Q (x) = ⊥ for all x ∈ uv and there is a u-v-path in Gr(P ∪ Q). Furthermore, for a function d : For a bag X(i), let G i denote the subgraph of G that contains exactly those edges of G that are introduced in a bag of the subtree of T that is rooted at X(i) and let G A table entry for the bag X(i) will be indexed by (i) a function d : Figure 2 for an illustration of d and P. An entry will have the following semantics: i contains p paths and c cycles that do not intersect d −1 (1), Semantics: A table entry [d, P, p, c] i is the maximum weight of any set that is eligible with respect to (d, P, p, c, i).
Then, we can read the maximum weight of a solution S for G from [∅, ∅, s p , s c ] r .
Given a nice tree decomposition and a bag X(i) with children X(j) and X(ℓ) (possibly j = ℓ), we compute [d, P, p, c] i depending on the type of the bag X(i) (entries that are not mentioned explicitly are set to −∞): If it has degree 1, then there is a path q dangling from it. If q ends in some other vertex u ∈ X(i) \ {v} = X( j), then, the permutation for X(i) contains uu and the permutation for X(j) contains uv.
Introduce edge uv: Let d(u), d(v) ≥ 1 and, by symmetry, let d(u) ≥ d(v) ≥ 1. We define a value z (representing the assumption that uv is in S ) as follows.
, that is, we let d′ be the result of decreasing both d(u) and d(v) by one.
Case 1 d(u) = d(v) = 2. Then, P avoids u and v. Since we assume uv ∈ S , this means that u and v have dangling paths q u and q v in G S i − uv = G S j that both intersect d′ −1 (1). If q u = q v , then adding uv to S closes a cycle in G S i and the permutation for X(j) contains uv (see Figure 3(a)). Figure 3(c) and Figure 3(d)), otherwise, it contains uu (see Figure 3(b)). Likewise, if q v intersects d′ −1 (1) \ {u, v} in a vertex y, then the permutation for X(j) contains vy, otherwise, it contains vv. Note that, if both q u and q v intersect d′ −1 (1) \ {u, v} (see Figure 3(d)), then we have xy ∈ P. Note further that, if neither q u nor q v intersects d′ −1 (1) \ {u, v} (see Figure 3(b)), then q u ∪ q v ∪ {uv} is a path in G S i that does not intersect d −1 (1) and, thus, we have to decrease the number p of such paths we are looking for in G S j . Formally, let Then, both u and v are not incident to any edges in G S j and, in G s j j , there is just the edge uv incident to both. Thus, we set z only if uv ∈ P.
we have vx ∈ P and the permutation for X(j) contains ux. Otherwise, we have vv ∈ P and the permutation for X(j) contains uu. Note that, since v ∈ d −1 (1), we know that P(v) ≠ ⊥. Join: The join bag X(i) "glues" the (disjoint) partial solutions of its children together at the vertices of X(i) = X( j) = X(ℓ). In particular, the degrees in G S j and in G S have to add up to d. Furthermore, the permutations P 1 and P 2 for X(j) and X(ℓ), respectively, have to "fit" P: For example, let uv ∈ P 1 and vw ∈ P 2 , implying that there are paths q j and q ℓ in G S j and G S , respectively, that are connecting u and v and v and w, respectively. Then, in G S i , there is a single path q j ∪ q ℓ connecting u and w and containing v (with d (v) = 2). Finally, the numbers of paths and cycles have to "fit" p and c: For example, if the permutations for both X (j) and X(ℓ) contain uu (that is, u ∈ (P 1 ∩ P 2 ) 1 ), then G S i contains a path containing u that is neither in G S j nor in G S . On top of this, the remaining paths must be distributed among G S j and G S . Formally, let

Lemma 2
The described algorithm is correct, that is, the computed value [d, P, p, c] i corresponds to the semantics.
Theorem 1 SCAFFOLDING can be solved in O(tw tw ·s p · s c · n) time, given a width-tw tree decomposition of the input instance.

Kernel for restricted scaffolding
Towards developing effective preprocessing for SCAF-FOLDING, we consider a more restricted problem variant, where all paths and cycles of the solution have to be of certain, respective lengths. N P -hardness of this variant can be inferred with the same reduction as used for SCAFFOLDING [15].
G − X is a collection of s p alternating paths, each of length ℓ p ,b and s c alternating cycles, each of length ℓ c , and ω(E) − ω(X) ≥ k?
The length l p denotes the number of edges in the paths. It is necessarily odd in a solution of RESTRICTED SCAF-FOLDING. In the following, we show that RESTRICTED SCAFFOLDING admits a linear-size problem kernel with respect to the parameter FES (feedback edge set), which is the size of a smallest set of edges whose deletion leaves an acyclic graph. To this end, we present a number of intuitive polynomial-time executable reduction rules that shrink the input graph. The first two rules shrink treelike structures of the instance while the remaining rules contract long chains.
For a u-v-path p, we call u and v its outer vertices while all other vertices of p are inner vertices. Let V ○ be the set of vertices that are in some cycle in G and let G ○ = G[V ○ ]. Let G * = G[V * ] denote the convex hull of G ○ , that is, V * is the set of vertices on some shortest path between some vertices u, v ∈ V ○ . For each v ∈ V * , the tree rooted at v that is incident with v in G − E * is called the "pendant tree" T v of v.

Reducing pendant trees
First, we remove isolated paths by cutting them into pieces of length ℓ p . Since the correctness of this is trivial, we omit the proof.
Tree Rule 1 (see Figure 4(a)) Let p be an isolated path in G. If |p| ≥ ℓ p , then split off a length-ℓ p path q and decrease (s p , k) by (1, ω(q)). Otherwise, return "NO".
The next rule cuts branching edges in pendant trees. To this end, it finds an occurrence of a path of length ℓ p that is furthest from the root of the pendant tree.
Tree Rule 2 (see Figure 4(b)) Let T v be the pendant tree of some v ∈ V * , let u be a leaf of maximal distance to v in T v . Let W be the set of vertices × of T v such that a length-ℓ p alternating u-x-path exists and let w be a vertex in W maximizing dist Tv (v, w). Then, delete from G all edges that are incident with the least common ancestor of u and w but are not on the unique u-w-path in T v .
Lemma 3 Let I = (G, M, ω, s p , s c , ℓ p , ℓ c , k) be a yesinstance of RESTRICTED SCAFFOLDING such that G is reduced with respect to Tree Rule 2 and let v ∈ V * . Then, T v is a path p that is alternating with respect to M ∩ E(T v ) and |p| < ℓ p and v is an endpoint of p.
In the following, we assume that G is reduced with respect to Tree Rule 2 and, thus, we can reject all instances for which Lemma 3 does not hold. Hence, in the following, we assume that Lemma 3 holds for the input instance. The next reduction rule helps unify the way in which pendant trees (which are now paths) attach to G * , simplifying the rest of the presentation.
Tree Rule 3 (see Figure 4(c)) Let T v be the pendant tree of some v ∈ V * that is incident with an edge e ∈ E(T v ) \ M. Then, delete from G all edges incident with v that are not in M + e.

Reducing long paths
In the following, we assume that G is reduced with respect to the tree reduction rules presented in the previous section. For i ∈ N, let V i denote the set of vertices of G that have degree i in G. The goal in this subsection is to reduce the length of chains of degree-two vertices in G. Thus, we consider paths whose inner vertices are all in V 2 . We call these paths deg-2 path. If the solution has a cycle running through this path, then we cannot modify its length. Therefore, we focus on paths that are guaranteed to not be in a cycle in the solution: The path reduction rules are based on the idea that the path segments that a deg-2 path p in G is split into by the solution are recurring, that is, if the solution contains the third edge of p, then it also contains the 3 + (ℓ p + 1) th edge of p. This gets slightly more complicated if there are pendants in p, but first, let us consider paths without any pendants.
Path Rule 1 (see Figure 5(a)) Let p = (u 0 , u 1 , . . .) be a deg-2 path with |p| > max{ℓ p , ℓ c } + ℓ p + 1 and let e i := u i u i +1 for all i. Then, for all e i with i ≤ ℓ p , add ω(e i ) to ω(e ℓp +i +1 ) and contract e i . Finally, decrease s p by 1.
The remaining two rules deal with deg-2 paths containing pendants. Note that any path in the solution that contains the pendant can only "continue" in two directions.
Path Rule 2 (see Figure 5 Then, for all i ≤ ℓ p , add ω(e i ) to ω( f i+ ) and contract e i . Finally, decrease s p by 1.
In case of two pendants, the solution is restricted by the distance between the pendants. If we can infer what the solution does by this length, we implement this right away, otherwise, we can represent the choices that a solution can take with a single pendant.
Path Rule 3 (see Figure 6) Let p = (x 0 , x 1 , . . .) be a u-vdeg-2 path such that u, v ∈ V * . Let γ u > 0 and γ v > 0 be the number of edges in T u and T v , respectively, and let w be the vertex at maximum distance to u in T u . Let γ := |p| mod (ℓ p + 1) and let G′ be the result of replacing ux 1 by wx 1 of the same weight in G.
(6) If γ ≠ 1 and γ + γ u + γ v ≡ ℓ p mod (ℓ p + 1), then delete all edges e ∈ E(G) \ (M ∪ {ux 1 }). (7) In all other cases, return "NO". Finally, we can show that an input graph that is reduced with respect to these rules cannot be larger than 11ℓ p · FES(G) or 11ℓ c · FES(G). To prove this, let G † = (V † , E † ) be the result of contracting all degree-2 vertices in G * .
Proof By Lemma 3, we know that ∀ v∈V †|E(T v )| < ℓ p . Therefore, if Lemma 4 is false, there is an edge uv ∈ E † such that there are > 3ℓ vertices between u and v (i.e. uv is a contraction of more than 3ℓ edges of G). Nevertheless, by irreducibility with respect to Path Rule 3, there is at most one vertex w between u and v such that T w is not empty and the distance between u and v cannot be greater than 2ℓ + 1 (by irreducibility with respect to Path Rule 1 and 2). So, |E(T w )| ≥ ℓ, contradicting Lemma 3.

Data
In our experiments, we worked with two datasets, all derived from real genomes (see Table 1). The first one is a set of eleven contig sets, produced from real genomes using the following process: first, the genomes where taken off the NCBI nucleotide database (http://www.ncbi. nlm.nih.gov). Then, for each of them, a set of simulated paired-end reads was generated with the tool wgsim ([28]), with default parameters and a 20X mean covering depth. Thereafter, assembly was performed with the tool minia ( [29]) with a k-mer size k = 29. Reads were mapped on the contigs with bwa ( [30]), with default parameters and using the sampe method. The second dataset is composed of five scaffold graphs, already presented in [31] and [17]. Some of them have been produced by simulating reads, other come from real paired-end reads libraries (see Table 1). Finally, using the scaftools previously developed  in [15,31], we produced the scaffold graphs corresponding to these datasets. See [31] for a more detailed explanation of this pipeline.
The aim of this process is first to produce a benchmark of test graphs that are more realistic than uniformly generated graphs to test our algorithms on, second to study the different parameters that may be interesting to consider for parameterized algorithms, and finally to help to design a more realistic and convenient scaffold graph generator allowing to generate graphs directly, avoiding the complicated pipeline described above (see Table 2 and Table 3). The second dataset was used to study the influence of a preprocessing operation on the scaffold graph, aiming at filtering low informative edges. Simplistically, we removed respectively edges with weight less than 3, 6 and 10. Results are presented in Table 3. We already know that this operation improves the quality of the produced scaffolds, when compared to the original reference genome. Here we would like to observe the behavior of our putative interesting structural parameters according to this filtering. Table 2 presents some parameters of the generated graphs. The h-index is the maximum number such that the graph contains h vertices of degree at least h. It measures the degree of connectivity of a graph. The feedback edge set (FES) is the size of a smallest set of edges whose deletion leaves an acyclic graph. The degeneracy is the smallest value d for which every subgraph has a vertex of degree at most d. It is a kind of measure of sparsity of the graph. We notice that scaffold graphs look quite sparse, with few vertices of high degrees and a feedback edge set number that is usually significantly lower than the number of vertices. While degree-based graph parameters like the degeneracy d are tiny in all instances, we recall that our problem generalizes Hamiltonian Cycle, which is already N P-hard on 3-regular graphs. However, Table 2 shows that, for instances that we expect to be seen in practice, these measures can be assumed constant and, thus, it might be worth considering a combination involving these parameters. Table 3 shows the same statistics for the second set of graphs where edges of small weight (that is, low confidence) were discarded (for weight thresholds of 10, 6, 3 and 0, yielding the original graph with all edges present). We notice that even with a light filtering, the parameters of the scaffold graph are considerably lowered, confirming that raw data suffers from significant noise. Thus, filtering low quality information not only increases the quality of the scaffolding, but also may lead to a significant leap to tractability. Note that, at the time of writing this article, the Staphylococcus Aureus genome is no longer available in the NCBI Nucleotide database ("This RefSeq genome was suppressed because updated RefSeq validation criteria identified problems with the assembly or annotation.") and the Escherichia Coli genome has been updated since the presented version. Errors in assemblies are quite frequent ( [37]) and yield additional issues in scaffolding. Having a better idea of the structure of a classical scaffold graph, and some simple criteria to determine what is anomalous and what is normal (subgraphs induced by repeats for instance) would be of real interest for the analysis of genomes.

Implementation and first results
We implemented the dynamic programming algorithm presented in Section 11 in C++ using boost and the treewidth optimization library TOL [38]. We ran it on a selection of the generated data sets (see Section 11) for which the greedy fill-in heuristic produced tree decompositions of width at most 45. We chose (s p , s c ) = (3, 1) for the ebola and monarch genomes and (s p , s c ) = (20, 3) for the more complicated inputs. The tests were run on an AMD Opteron(tm) Processor 6376 at 2300 MHz. Figure 7 shows running times and memory consumption needed to produce a solution as well as details regarding the used tree decompositions. Figure 8 shows optimal solutions for the two smallest data-sets: ebola and monarch. To validate the proposed scaffolding, we compared the output to the alignment of the contigs on the reference sequence using megablast ( [39]). In the case of the ebola genome, the two isolated contigs (in green) are small (about 150 bp). One of them is placed between the orange contig and its neighbor, the other one finds its place at the other extremity of the chain. In the input scaffold graph, we notice that they are linked to the wrong node, we suppose this is due to their small size, disturbing the alignment step. For the monarch mitochondrion, however, two of the contigs (appearing in red) did not match on the sequence, meaning that the assembly yields some errors. The isolated contig is also small and not correctly connected in the scaffold graph.  Concerning the rice chloroplast, among the 84 contigs, only three were misplaced. All three of them are small (< 130 bp), two of them strongly overlap and the third has two occurrences in the reference genome, one complete and one partial. The seven remaining scaffolds follow exactly the right relative order and orientation of the contigs on the reference genome. Chloroplast genomes have a particularity which make them interesting as data for scaffolding. They present an inverted repeat region of approximately 20 kbp [40]. Figure 9 focuses on one of the scaffolds, where this inverted repeat is shown in pink. The other occurrence is not present in the scaffolding. To notice, the weights inside this repeat are in average higher than outside, which is totally expected since the read cover is approximately doubled for these sequences. Thus, areas of the graph with higher weight would lead to repeat hypothesis, if we are confident in the homogeneity of the cover.

Discussion and conclusion
In this paper, we considered exact approaches to the N P -hard SCAFFOLDING problem which is an integral part of genome sequencing. We showed that it can be solved in polynomial time on trees or graphs that are close to being trees (constant treewidth) by a dynamic programming algorithm. We proved a linear-size problem kernel for the parameter "feedback edge set" for a restricted version in which the lengths of paths and cycles are fixed. We implemented an exact algorithm solving SCAFFOLDING in f(tw) · poly(n) time and evaluated our implementation experimentally, supporting the claim that this method produces high-quality scaffolds. Our experiments are run on data sets that are based on specific real-world genomes, which we also examined to identify a number of interesting parameters that may help design parameterized algorithms and random scaffold graph generators that produce more realistic instances. We are currently transferring the preprocessing rules to the general problem variant. We are highly interested in further graph classes that are closer to real-world instances than trees and on which the problem might be polynomial-time solvable. From an algorithmic point of view, we remark that only few bags of the used tree decompositions are small (see Figure 7). Thus, we envision a hybrid strategy of branching on the vertices in the largest bags before running the dynamicprogramming algorithm and using this "distance x to treewidth-x" parameter to re-analyze the problem.
We intend to perform more extensive tests on diverse datasets, in particular comparing the quality of this approach to existing ones using, for instance, the criteria presented in [9]. This work demands reliable benchmarks, with up-to-date and well assembled genomes. From a bioinformatics point of view, we lay the groundwork for a careful analysis of the scaffold graph, as well as a tool to speed up the above algorithms, as a help to analyze the quality of the assembly, and maybe the structure of the genome itself.

Appendix
Proof of Lemma 1 The proof is by induction over the distance of v to r (descending) and, in case of ties, j (ascending).
and note that S u and S j−1 are path covers of T u and T v j−1 , respectively. Thus, by induction hypothesis, Case 2: uv ∈ M and c = 1. Then, b = 1. Furthermore, the path containing u is split over S u + uv and S j-1 , implying S p = S u Tu+v p (1) ≥ ω(Su)+ω(Sj−1) = ω(S).

Induction
Step ( Case 2: u = w and c = 1. By line 7, there are ℓ and a such that [1, i, j]  Since c = 0 we have uv ∉ S u and, thus, S := S u S j−1 is a path cover for T v j and S p = S u p + S j−1 p = i . Furthermore, Case 4: w ≠ u and c = 1. Case 4a: There are ℓ and a such that [1, i, j] 10). By induction hypothesis, there are path covers S u and S j−1 corresponding to [a, ℓ, ∞] u and [1, i − ℓ + 1, j − 1] v , respectively. Since uv ∉ M, some edge incident to u in T u is in M. Then, S := S u S j−1 is a path cover for T v j and ||S′|| p = ||S u || p + ||S j−1 || p = i. Furthermore, Case 4b: There is some ℓ such that [1, i, j] .j] and, by induction hypothesis, there are path covers S u and S j−1 corresponding to [0, ℓ, ∞] u and [1, i−ℓ+1, j−1] v , respectively. Then, S′ := (S u +uv) ∪ S j−1 is a path cover for T v j and w,v and u are on the same path p in Since u is incident to an edge of M in T u , we know that p does not end in u. Thus, ||S′|| p = ||S u || p + || Case 4c: There is some ℓ such that [1, i, j] ..j] and, by induction hypothesis, there are path covers S u and S j−1 corresponding to [0, ℓ, ∞] u and [1, i − ℓ, j − 1] v , respectively. Thus, S′ := (S u + uv) ∪ S j−1 is a path cover for T v j . Since u is incident to an edge of M in T u , we have

Proof of Section 11
Proof of Lemma 2 The proof is by induction on the distance of i to r (descending). In the induction base, i is a leaf of T and X(i) = ∅ and G i is empty. Thus, the domains of d and P are empty. Thus, [∅, ∅, 0, 0] i = 0 and all other entries are −∞.
For the induction step, we distinguish the possible bag types of X(i) with children X( j) and X(ℓ) (possibly j = ℓ): Introduce vertex v: Since G i does not contain edges incident to v, only tuples with d(v) = 0 and P(v) = ⊥ are valid.
Forget vertex v: Let S i be a maximum weight set that is eligible for (d, P, p, c, i). We show that [d, P, p, c] i = ω(S i ).
"≤": We show that S j is eligible for (d, P, p, c, i) and, thus, ω(S i ) ≥ ω(S j ) = [d, P, p, c] i . Since X(i) ⊂ X( j) and all paths between vertices in d −1 (1) that are represented by P + vv are also represented by P, the first two conditions are satisfied by S j In the following, we proceed in a similar manner for the case that [d, P, p, c] i = z + ω(uv): To show [d, P, p, c] i ≤ ω(S i ), we consider a set S j that corresponds to the entry for X( j) from which [d, P, p, c] i is computed and whose existence is granted by induction hypothesis. Then, we show that S j + uv is eligible for (d, P, p, c, i), implying ω(S i ) ≥ ω(S j + uv) = z + ω(uv) if uv ∉ M and ω(S i ) ≥ ω(S j ) = z if uv ∈ M. Note that, in each of the cases, the degrees of u and v in G s j j are one less than their degrees in G s j +uv i . Thus, S j satisfies Definition 2(1) for d′. [1]We say a set S corresponds to an entry [d, P, p, c] i if S is eligible for (d, P, p, c, i) and its weight ω(S ) = [d, P, p, c] i is maximum among all sets that are.  Figure 3(b)). By the same arguments as in Case 1a, Definition 2(2) is satisfied. Further, adding uv connects two paths such that the resulting path does not intersect d −1 (1). Thus, there are one more such paths in G s j +uv i than in G s j j , implying that Definition 2(3) is satisfied.
contains a path dangling from v and a u-x-path (see Figure 3(c)). Thus, adding uv connects two paths such that the resulting path intersects d −1 (1) exclusively in x. Hence, there is a path dangling from x in G s j +uv i and, hence, S j + uv satisfies Definition 2(2). Since no paths or cycles avoiding d′ −1 (1) are affected by adding uv, Definition 2(3) is satisfied. The case that z = [d′, (P − xx) ∪ {vx, uu}, p, c] j is analogous. Case 1d: z = [d′, (P − xy) ∪ {ux, vy}, p, c] j for some x, y ∈ d −1 (1) with xy ∈ P. Then, G s j j contains paths q u and q v that start in u and v, respectively, and end in x and y, respectively (see Figure 3(d)). Thus, adding uv connects q u and q v to a single path q that starts in x and ends in y, thus intersecting d −1 (1). Hence both Definition 2 (2) and (3)  Note that, in each case, S i − uv satisfies Definition 2(1) since the degrees of u and v decrease by one when removing uv. Case 1: d(u) = d(v) = 2. Then, uv is part of a path q in G S i i and neither u nor v is an endpoint of q. Case 1a: q is closed (that is, a cycle) (see Figure 3 . Thus, uv ∈ P and S i − uv is eligible for (d′, P − uv, p, c, j). Case 3: d(u) = 2 and d(v) = 1. Thus, G S i i contains a path q = (v, u, . . .). If q intersects d −1 (1) − v in a vertex x, then q − uv intersects d′ −1 (1) − u and, thus, S i − uv is eligible for (d′, (P − vx) + ux, p, c, j). Otherwise, q is a dangling from v in G S i i and q − uv is a dangling from u in G s i−uv j , implying that S i − uv is eligible for (d′, (P − vv) + uu, p, c, j).
We show that S i := S j ∪ S t is eligible for (d, P, p, c, i) and, thus, ω(S ) ≥ ω(S i ) = [d 1 , P 1 , p 1 , c 1 ] j + [d 2 , P 2 , p 2 , c 2 ] ℓ = [d, P, p, c] i . First, by (2), we have that Definition 2(1) is satisfied. Second, to show that Definition 2(2) is satisfied, consider some uv ∈ P. Then, there is a path q = (u, x 1 , x 2 , . . ., v) in Gr(P 1 ∪ P 2 ). Note that (1). Thus, G S j ∪S i = G s i i contains a u-x 1 -path, an x 1 -x 2 -path, . . . . The concatenation of these paths forms a u-v path in G S i i . Third, note that, for each uu ∈ (P 1 ) 1 , there is a path q u 1 dangling from u in G s j j such that the only vertex of Analogously, a similar path q u 2 exists for each uu ∈ (P 2 ) 1 in G s . Thus, for each uu ∈ (P 1 ∩ P 2 ) 1 , there is a path q u := q u 1 ∪ q u 2 in G S j ∪S i containing u and avoiding d −1 (1) and q u is neither in G s j j nor in G S . Since G s j j contains p 1 paths avoiding d −1 1 (1) ∪ d −1 1 (0) ⊇ d −1 (1) and G S contains p 2 paths avoiding d −1 we have that G S j ∪S i contains p 1 + p 2 + |(P 1 ∩ P 2 ) 1 | such paths. Similarly, G S j ∪S i concontains c 1 + c 2 + |(P 1 ∩ P 2 ) 2 | cycles avoiding d −1 (1). "≥": Let S j := S i ∩E(G j ) and let S ℓ := S i ∩E(G ℓ ). We show that S j and S ℓ are eligible for tuples (d 1 , P 1 , p 1 , c 1 , j) and (d 2 , P 2 , p 2 , c 2 , ℓ), respectively, such that (2)-(5) hold. First, for all u ∈ X(i) = X( j) = X(ℓ) let d 1 (d 2 ) be the number of edges of G j (G ℓ ) incident with u. Since E(G s i i ) = E(G s j j ) E(G S ), we conclude that (2) holds and Definition 2(1) is satisfied. Second, let P 1 be the set of pairs uv with u, v ∈ d 1 −1 (1) such that, if u ≠ v, then there is a u-v path in G s j j and, if u = v, then G s j j contains a path dangling from u. Let P 2 be defined analogously for d 2 and G s . Then, Definition 2(2) is satisfied. Further, for all uv ∈ P, since d(u) = 1 ⇐⇒ d 1 (u) = 1 ⊕ d 2 (u) = 1, we have P 1 (u) = ⊥ ⊕ P 2 (u) = ⊥ and there is a u-v-path q in G S i i that decomposes into paths of G s j j that connect vertices of d 1 −1 (1) and paths of G s that connect vertices of d 2 −1 (1) . Thus, Gr(P 1 ∪ P 2 ) contains a u-v-path and we conclude that (3) holds. Third, let p 1 and c 1 be the number of paths and cycles, respectively, in G s j j that avoid d 1 −1 (1) .