Constructing the de Bruijn graph
The de Bruijn graph derived from length-k reads of a genome g contains a node for each (k - 1)-mer present in the genome, and a directed edge u → v for every instance where the (k - 1)-mer represented by v occurs immediately after the (k - 1)-mer for u. In other words, there is an edge if u occurs at position i and v occurs starting at position i + 1. These edges can be obtained by considering every k-mer present in the genome and connecting the node for its (k - 1)-prefix to the node for its (k - 1)-suffix. Crucially, the de Bruijn graph can be a multigraph, with parallel edges between nodes. Self-loops are also permitted. Note that this definition of a de Bruijn graph differs from the traditional definition described in the mathematical literature in the 1940s [22] that requires the graph to contain all length-k strings that can be formed from an alphabet (rather than just those strings present in the genome). The more general formulation of the de Bruijn graph used in this paper, and the closely related string graph, are commonly used in the sequence assembly literature [23–26] under the same name, and we follow the same convention. Throughout this paper, we use s, t, u, v and w to denote both nodes within the de Bruijn graph and the DNA sequences they represent. The de Bruijn graph encodes all the information available from the input of the list of all k-mers of g. Because an edge u → v only occurs if v follows u somewhere in the genome g, we must traverse every edge to reconstruct the genome, and thus the correct genome sequence corresponds to an Eulerian path through the de Brujin graph [27].
We classify the nodes of a de Bruijn graph as decision and non-decision nodes. Decision nodes--those with more than one predecessor or more than one successor--are the primary complication in reconstructing the correct genome sequence from a de Brujin graph as they introduce ambiguity in possible graph traversals. Non-decision nodes (with at most one successor and at most one predecessor) correspond to sections of the graph that can be unambiguously reconstructed.
Simplification of the de Bruijn graphs
We applied the following graph transformations to the idealized de Bruijn graph constructed from the genome sequences. Each transformation preserves the entire set of paths consistent with the graph. See Figure 1 for illustrations of the various transformations. Successively applied, these transformations simplify the graph towards the smallest equivalent graph possible -- the most parsimonious graph structure that encodes the full information contained in the set of k-mers.
We use the term path to be an ordered sequence of edges. If the path starts and ends at the same node, it is a circuit. A trail is a path that is not a circuit. If the nodes s and t with unequal in- and out-degrees exist then the de Bruijn graph admits an Eulerian trail starting at s and ending at t, otherwise it contains only Eulerian circuits. Either is possible for graphs obtained from genomic sequence data: a circular chromosome necessarily yields a graph with circuits, while a linear genome may or may not, depending on whether the 3'-most k - 1 nucleotides equal the 5'-most.
Compressing paths
If u → v is an edge, we say u is a predecessor of v and v is a successor of u. If a node u has a successor v, and if u is the only predecessor of v, then nodes u and v can be merged into a single node that has the predecessors of u and the successors of v (Figure 1a), similar to previous approaches [14–17]. This transformation, called path compression, can be made even if there are multiple, parallel edges from u to v. Prior to path compression, there is a node for each (k - 1)-mer that appears in the genome, and the number of edges equals the genome length. Following path compression, the graphs dramatically decrease in size as non-branching chains are replaced by single nodes.
Path compression is a standard, natural technique, but additional simplifications are also possible for reducing the size of the genome graph even further. Decision nodes--those with more than one predecessor or more than one successor--are the complication in extracting the correct genome sequence from a genome graph. We distinguish three types of decision nodes: forward decision nodes have more than one successor, but a single predecessor; backward decision nodes have more than one predecessor, but a single successor; and full decision nodes have both more than one predecessor and more than one successor. We refer to forward and backward decision nodes as half decisions nodes.
Compressing tree-like regions
Any Eulerian graph G can be decomposed into simple, edge-disjoint cycles. (A simple cycle is one that uses each node at most once.) From this decomposition, a cycle graph cycle(G) can be constructed with a vertex for each simple cycle in G and an edge connecting cycles if they share a node in the Eulerian graph. If cycle(G) is a tree, then G only has a single Eulerian tour [28]. More generally, if an induced subgraph of cycle(G) is a tree, then the corresponding region has a unique traversal in any Eulerian tour. Thus, these tree-like regions can be collapsed into a single node labeled by the sequence constructed from that unique tour (Figure 1b).
Rather than explicitly constructing the cycle graph, in practice it is more efficient to recursively collapse the tree-like regions, starting with the leaves. Leaves can be identified by pairs of nodes u, v for which u is both the only predecessor and only successor of v, and u has only a single in-edge and a single out-edge that are not adjacent to v. In this case, the sequence of v can be appended to u, and v can be eliminated. A node u that has one or more self-loops and has only one in- and one out-edge that are not self-loops is also a leaf, and its self-loops can be collapsed into u. After collapsing a leaf, we perform any newly possible path compressions near the collapsed node.
Splitting half decision nodes
The graph can be modified by splitting forward and backward decision nodes into several new nodes without changing the strings represented by the graph (Figure 1c). If v is a forward decision node with predecessor u and successors w1, ..., w
m
(m ≥ 2) then node v can be replaced by m new nodes v1, ..., v
m
. Edges v
i
→ w
i
are added with multiplicity equal to the multiplicity of the edge v → w
i
. Backward decision nodes can similarly be split. Splitting a forward decision node in this manner may cause its predecessor to become a forward decision node, which can subsequently be split. In this way we can "unzip" a sequence of half decision nodes, thereby shifting the decision point in the direction of the unique predecessor/successor. While this transformation increases the number of nodes, the new nodes v
i
can often be eliminated by an application of path compression.
Exploiting edge multiplicities
If, at each decision node, each incoming edge could be correctly paired with an outgoing edge, there would be no difficulty in reconstructing the correct genome. Typically, without additional information, it is not possible to make any such pairings. In some cases, however, the edge multiplicities can be used to identify a predecessor of a node that must be matched with a successor (Figure 1d). In particular, let u → v → w be three nodes in a path such that the edge u → v has the highest multiplicity among edges entering v and v → w has the highest multiplicity among edges leaving v. Let c
u
and c
w
be the multiplicities of the edges u → v and v → w, respectively. If u ≠ w, we can infer that the path u → v → w must be part of any Eulerian tour if c
u
> d+(v) -c
w
, where d+(v) is the number of edges leaving v. In other words, all the incoming edges coming from u cannot be matched without using at least one outgoing edge adjacent to w (using reasoning similar to the pigeonhole principle). This reasoning can be used to reconstruct longer subsequences of DNA, but because it generally increased the size of the final graphs due to interactions with the other simplification techniques described here, we do not consider this type of simplification further in this paper.
Converting non-decision nodes to edges
Repeated application of these transformations typically leads to many three-node paths u → v → w where u is the only predecessor of v and w is the only successor of v but, because there are other edges incident to u and w, the path cannot be collapsed. We can, however, replace node v with an edge u → w labeled with the sequence of v (Figure 1e). Following path compression and this transformation, the graph contains either just a single node or only decision nodes that have both more than one predecessor and more than one successor.
Maximal compression
The simplification procedures aim to: (i) coalesce multiple adjacent non-decision nodes into single nodes or edges; and (ii) resolve some of the apparent ambiguity represented by the decision nodes. In order to estimate the smallest graph that is equivalent to the input de Bruijn graph, we apply each of these transformations in turn. Because several of the transformations can lead to opportunities for new applications of other transformations, the order in which we apply them is important. We start by performing path compression. Then, all tree-like regions of the graph are collapsed, followed by another round of path compression. We then split all backward decision nodes, then all forward decision nodes, and perform the path compressions that have been newly made possible. Finally, we convert non-decision nodes into edges. These techniques reduce, on average, the number of edges in the de Bruijn graphs by a further 65% compared with performing only path compression.
Counting words consistent with de Bruijn graphs
In 1975, J.P. Hutchinson and H.S. Wilf [29, 30] gave an expression that can be used to compute the number of unique words consistent with a de Bruijn graph. The following theorem, from [30], gives an expression for the number of possible reconstructions that are consistent with the de Bruijn graph. Let d-(u) and d+(u) be, respectively, the in- and out-degrees of vertex u (where the graph will be clear in context). Because the de Bruijn graph is Eulerian, d+(u) = d-(u) for all u with the possible exception of two nodes s and t for which d-(s) = d+(s) - 1 and d+(t) = d-(t) - 1.
Theorem 1 (Adapted from [30]) Let A = (a
uv
) be the adjacency matrix for an n-vertex de Bruijn graph G, with both a
uv
> 1 and self-loops allowed. If d+(v) = d-(v) for all v, then choose a vertex t arbitrarily, otherwise pick the unique t such that d+(t) = d-(t) - 1. Finally, let r
u
= d+(u) + 1 if u = t or r
u
= d+(u) otherwise. Then the number of words consistent with G that can be spelled ending with node t is given by
where L is the n × n matrix with r
u
-a
uu
along the diagonal and -a
uv
in entry (u, v).
The values r
u
in Theorem 1 are the number of times that the sequence represented by u must appear in the output word, given that path for the word ends at node t and thus that the output word ends with the sequence represented by t. When traversing the graph, we output u when we exit node u, except for the last node on a trail or circuit which must also be output when it is entered for the last time so that the k-mer associated with the last traversed edge is output. See [29, 30] for a proof of the above theorem. Briefly, it is proved by an extension of the "BEST" theorem [31–33] for counting the number of Eulerian circuits modified to correct for multiple edges and for the fact that we allow Eulerian trails in addition to only circuits. For example, the sequence q = "sababababab" yields a three node de Bruijn graph when k = 5. For this graph, the number of tours is 3!3! = 36 but they all yield the same sequence. By increasing the length of q in this example, we can increase the number of Eulerian circuits arbitrarily, while maintaining the property that all circuits reconstruct a single word.
(G, t) is computable in polynomial time.
To correctly apply Theorem 1 to the short-read assembly problem, we have to consider the cases of linear and circular genomes separately. In the typical circular case, the graph D
k
(g) will contain Eulerian circuits, and any node can be chosen as the final node t. Theorem 1 will count the number of distinct linear words q ending at t. If t occurs more than once in the genome but there is some unique sequence of DNA in the genome, then cycling q to end at each instance of t yields a different word counted by
(D
k
(g), t), each of which is equivalent to the same cyclic ordering of the nodes. In this case, because there are d+(t) occurrences of t, the number of cyclically equivalent words starting at t is
(G
k
(g), t)/d+(t).
We say a de Bruijn graph of a circular genome is periodic if it can be traversed with an Eulerian path of the form (tw1tw2t... tw
m
)ct for c > 1 and some choice of (possibly empty) paths w1, ..., w
m
that do not contain t. (The exponentiation notation indicates c repetitions of the parenthesized path.) In the very special case of periodic, circular genomes the number of solution words does not equal
(D
k
(g), t)/d+(t) for each t because many words may be cyclic permutations of each other. We do not encounter this special case in the chromosomes studied here. In particular, if there is any unique sequence of DNA of length ≥ k - 1 then the graph will not be periodic.
In the less common case of linear genomes, if the graph admits an Eulerian trail, there is a single choice of which node to chose as t. If a linear genome yields an Eulerian graph with Eulerian circuits (because its k - 1 suffix equals its k - 1 prefix), then any node can be used as the final node t, but the linearization of the Eulerian circuit may produce a chromosome with the wrong start position, a minor misassembly.
Extracting idealized contigs and computing the N50 size
Every edge in the genome graph corresponds to a sequence of DNA in the chromosome. Initially, these edges correspond to k-mer reads, but after path compression and other simplifications discussed above, these edges can represent long, unambiguous stretches of the chromosome bounded on either end by ambiguities. Hence, it is reasonable to use the set of remaining edges to estimate a set of contigs that can be extracted from a genome graph. We create one contig for each edge. If non-decision nodes are collapsed into edges as described above, some edges will be labeled with DNA sequences. If edge u → v has been labeled, the contig associated with an edge u → v will contain the sequence represented by node u concatenated with the sequence assigned to edge u → v, otherwise the contig will contain only the sequence represented by node u. If u → v is labeled, then the length of the contig is taken to be length(u) + length(u, v) - 2(k - 2). If u → v is unlabeled, the length of the contig is taken to be length(u) - (k - 2). Because the sequences represented by adjacent nodes and edges overlap by k - 2 nucleotides, the terms (k -2) and 2(k -2) ensure that overlapping bases are counted only once. Hence, the sum of the lengths of the contigs equals the chromosome length. The N50 size is the length m of the largest contig such that at least 50% of the genome is covered by contigs of size ≥ m; it is a commonly used measure of the connectedness of the assembly. Computing the N50 size using the idealized contigs defined above gives an estimate for the achievable N50 size. This estimate will probably be much larger than what can actually be obtained from real, noisy data, and thus it provides a reasonable upper bound on that size in practice.
Counting reconstructible genes
For many purposes, it is sufficient to reconstruct only the coding regions of the genes of a species. We consider a gene to be reconstructible if the sequence encoding it can be unambiguously inferred from the de Bruijn graph to be present in the genome. For example, if a gene contains a repeat, it is less likely to be reconstructible because the repeat will introduce several possible sequences downstream of the start codon.
Formally, to test whether a gene is reconstructible, we first find the node s that represents the region of the chromosome containing the start codon of the gene. We then find all paths that start at s and continue until they pass through first a backward decision and then a forward decision. For this purpose, a full decision node counts as simultaneously a backward and forward decision. If the region of the gene is wholly contained within a region represented by such a path, then we say the gene is reconstructible. This is a local definition of reconstructibility, requiring only a neighborhood around the node containing the start codon s to be considered. For the analysis presented here, we test each of the genes for reconstructibility on a graph in which the tree-like regions have been collapsed and path compression has been performed (see above).
Forward decisions are not by themselves sufficient to confuse the reconstruction: because every edge must be taken, we can exhaustively follow each path. Nodes with more than one predecessor (backward decisions), however, "pollute" the walk and once a subsequent forward decision is encountered it is not possible to locally determine which branch should be taken. It is possible that a gene may not be considered reconstructible when walking from its start codon to its stop codon but would be reconstructible if the definition were changed to consider walking from its stop codon to its start codon. In practice, we expect few genes are reconstructible by walking forward but not reconstructible by walking backward, hence we restrict ourselves to the definition of reconstructibility above.