Efficient parallel and out of core algorithms for constructing large bidirected de Bruijn graphs
 Vamsi K Kundeti^{1}Email author,
 Sanguthevar Rajasekaran^{1}Email author,
 Hieu Dinh^{1},
 Matthew Vaughn^{2} and
 Vishal Thapar^{3}
https://doi.org/10.1186/1471210511560
© Kundeti et al; licensee BioMed Central Ltd. 2010
Received: 1 June 2010
Accepted: 15 November 2010
Published: 15 November 2010
Abstract
Background
Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories  based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bidirected edges which can overlap with a node and ends up generating Θ(n Σ) messages (Σ being the size of the alphabet).
Results
In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the outofcore model and in this case it has an optimal I/O complexity of $\Theta (\frac{n\mathrm{log}(n/B)}{B\mathrm{log}(M/B)})$ (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster  both asymptotically and practically. We demonstrate the scalability of our sequential outofcore algorithm by comparing it with the algorithm used by VELVET to build the bidirected de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bidirected chain compaction problem.
Conclusions
The bidirected de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bidirected de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bidirected de Bruijn graphs. Furthermore, our algorithms do not employ any alltoall communications in a parallel setting and perform better than the prior algorithms. Finally our outofcore algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.
Keywords
Background
Sequencing genomes is one of the most fundamental problems in modern Biology and has immense impact on biomedical research. De novo sequencing is computationally more challenging when compared to sequencing with a reference genome. On the other hand the existing sequencing technology is not mature enough to identify/read the entire sequence of the genome  especially for complex organisms like the mammals. However small fragments of the genome can be read with acceptable accuracy. The shotgun sequencing employed in many sequencing projects breaks the genome randomly at many places and generates a large number of small fragments (reads) of the genome. The problem of reassembling all the fragmented reads into a small sequence close to the original sequence is known as the Sequence Assembly (SA) problem.
Although the SA problem seems similar to the Shortest Common Super string (SCS) problem, there are in fact some fundamental differences. Firstly, the genome sequence might contain several repeating regions. However, in any optimal solution to the SCS problem we will not be able to find repeating regions  because we want to minimize the length of the solution string. In addition to the repeats, there are other issues such as errors in reads and double strandedness of the reads which make the reduction to SCS problem very complex.
Existing algorithms to address the SA problem can be classified into two broad categories. The first class of algorithms model a read as a vertex in a directed graph  known as the overlap graph[1]. The second class of algorithms model every substring of length k (i.e., a kmer) in a read as a vertex in a (subgraph of) a de Bruijn graph [2].
In an overlap graph, for every pair of overlapping reads, directed edges are introduced consistent with the orientation of the overlap. Since the transitive edges in the overlap graph are redundant for the assembly process they are removed and the resultant graph is called the string graph[1]. The edges of the string graph are classified into optional, required and exact. The SA problem is reduced to the identification of a shortest walk in the string graph which includes all the required and exact constraints on the edges. Identifying such a walk  minimum Swalk, on the string graph is known to be NPhard [3].
In the de Bruijn graph model every substring of length k in a read acts as a vertex [2]. A directed edge is introduced between two kmers if there exists some read in which these two kmers overlap by exactly k  1 symbols. Thus every read in the input is mapped to some path in the de Bruijn graph. The SA problem is reduced to a Chinese Postman Problem (CPP) on the de Bruijn graph, subject to the constraint that the resultant CPP tour include all the paths corresponding to the reads. This problem (Shortest Super Walk) is also known to be NPhard. Thus solving the SA problem exactly on both of these graph models is intractable.
Overlap graph based algorithms were found to perform better (see [4–7]) with Sanger based read methods. Sanger methods produce reads typically around 1000 base pairs long and are very reliable. Unfortunately Sanger based methods are very expensive. To overcome the issues with Sanger reads, new read technologies such as sequencing by synthesis (Solexa) and pyrosequencing (454 sequencing) have emerged. These rapidly emerging read technologies are collectively termed as the Next Generation Sequencing (NGS) technologies. These NGS technologies can produce shorter genome fragments (anywhere from 25 to 500 basepairs long). NGS technologies have drastically reduced the sequencing cost per basepair when compared to Sanger technology. The reliability of NGS technology is acceptable, although it is relatively lower than the Sanger technology. In the recent past, the sequencing community has witnessed an exponential growth in the adoption of these NGS technologies.
On the other hand these NGS technologies can increase the number of reads in the SA problem by a large magnitude. The computational cost of building an overlap graph on these short reads is much higher than that of building a de Bruijn graph. De Bruijn graph based algorithms handle short reads very efficiently (see [8]) in practice compared to the overlap graph based algorithms. However the major bottleneck in using de Bruijn graph based assemblers is that they require a large amount of memory to build the de Bruijn graph.
This limits the applicability of these methods to large scale SA problems. In this paper we address this issue and present algorithms to construct large de Bruijn graphs very efficiently. Our algorithm is optimal in the sequential, parallel, and outofcore models. A recent work by Jackson and Aluru [9] yielded parallel algorithms to build these de Bruijn graphs efficiently. They present a parallel algorithm that runs in O(n/p) time using p processors (assuming that n is a constantdegree polynomial in p). The message complexity of their algorithm is Θ(n Σ). By message complexity we mean the total number of messages (i.e., kmers) communicated by all the processors in the entire algorithm. The distributed de Bruijn graph building algorithm in ABySS [10] is similar to the algorithm of [9].
One of the major contributions of our work is to show that we can build a bidirected de Bruijn graph in Θ(n/p) time with a message complexity of Θ(n). An experimental comparison of these two algorithms on an SGI Altix machine shows that our algorithm is considerably faster. In addition, our algorithm works optimally in an outofcore setting. In particular, our algorithm requires only $\Theta (\frac{n\mathrm{log}(n/B)}{B\mathrm{log}(M/B)})$ I/O operations.
Methods
Preliminaries
Let s ∈ Σ ^{ n } be a string of length n. Any substring s_{ j } (i.e., s[j]s[j + 1] . . . s[j + k  1], n  k + 1 ≥ j ≥ 1) of length k is called a k mer of s. The set of all k mers of a given string s is called the kspectrum of s and is denoted by $\mathbb{S}$ (s, k). Given a kmer s_{ j } , ${\overline{s}}_{j}$ denotes the reverse complement of s_{ j } (e.g., if s_{ j } = AAGTA then ${\overline{s}}_{j}=TACTT$). Let ≤ be the partial ordering among the strings of equal length such that s_{ i } ≤ s_{ j } indicates that the string s_{ i } is lexicographically smaller than s_{ j } . Given any kmer s_{ i } , let ${\widehat{s}}_{i}$ be the lexicographically smaller string between s_{ i } and ${\overline{s}}_{j}$. We call ${\widehat{s}}_{i}$ the canonical kmer of s_{ i } . In other words, if ${s}_{i}\le {\overline{s}}_{i}$ then ${\widehat{s}}_{i}={s}_{i}$ otherwise ${\widehat{s}}_{i}={\overline{s}}_{i}$. A kmolecule of a given kmer s_{ i } is a tuple $({\widehat{s}}_{i},{\overline{\widehat{s}}}_{i})$ consisting of the canonical kmer of s_{ i } and the reverse complement of the canonical kmer. We refer to the reverse compliment of the canonical kmer as noncanonical kmer.
A de Bruijn graph D^{ k } (s) of order k on a given string s is defined as follows. The vertex set V of D^{ k } (s) is defined as the kspectrum of s (i.e., $V=\mathbb{S}\left(s,\text{}k\right)$). We use the notation suf(v_{ i }, l) (pre(v_{ i }, l), respectively) to denote the suffix (prefix, respectively) of length l in the string v_{ i } . Let the symbol ◦ denote the concatenation operation between two strings. The set of directed edges E of D^{ k } (s) is defined as follows:
E = {(v_{ i }, v_{ j } ) suf(v_{ i }, k  1) = pre(v_{ j } , k  1) Λ v_{ i }[1] ◦ suf(v_{ i }, k  1) ◦ v_{ j } [k] ∈ $\mathbb{S}$(s, k + 1)}. We can also define de Bruijn graphs for sets of strings as follows. If S = {s_{ 1 }, s_{2} . . . s_{ n } } is any set of strings, a de Bruijn graph B^{ k } (S) of order k on S has $V={\cup}_{i=1}^{n}\mathbb{S}({s}_{i},k)$ and E = {(v_{ i }, v_{ j } ) suf(v_{ i }, k  1) = pre(v_{ j } , k  1) Λ ∃ l : v_{ i }[1] ◦ suf(v_{ i }, k  1) ◦ v_{ j } [k] ∈ $\mathbb{S}$(s_{ l }, k + 1)}. To model the double strandedness of the DNA molecules we should also consider the reverse complements ($\overline{S}=\{{\overline{s}}_{1},{\overline{s}}_{2},\dots ,{\overline{s}}_{n}\}$) while we build the de Bruijn graph.
To address this a bidirected de Bruijn graph $B{D}^{k}(S\cup \overline{S})$ has been suggested in [3]. The set of vertices V of $B{D}^{k}(S\cup \overline{S})$ consists of all possible kmolecules from S ∪ $\overline{S}$. The set of bi directed edges for $B{D}^{k}(S\cup \overline{S})$ is defined as follows. Let x, y be two kmers which are next to each other in some input string $z\in (S\cup \overline{S})$. Then an edge is introduced between the k molecules v_{ i } and v_{ j } corresponding to x and y, respectively. Please note that two consecutive kmers in some input string always overlap by k  1 symbols. The converse need not be true. The orientations of the arrow heads on the edges are chosen as follows. If the canonical kmers of nodes v_{ i } and v_{ j } overlap then an edge (v_{ i }, v_{ j } ▷, ▷) is introduced. If the canonical kmer of v_{ i } overlaps with the noncanonical kmer of v_{ j } then an edge (v_{ i }, v_{ j }, ▷, ◁) is introduced. Finally if the noncanonical kmer of v_{ i } overlaps with canonical kmer of v_{ j } then an edge (v_{ i }, v_{ j }◁, ▷) is introduced.
Results
Our algorithm to construct bidirected de Bruijn graphs
In this section we describe our algorithm BiConstruct to construct a bidirected de Bruijn graph on a given set of reads. The following are the main steps in our algorithm to build the bidirected de Bruijn graph. Let R_{ f } = {r_{1}, r_{2} . . . r_{ n } }, r_{ i } ∈ Σ ^{ r } be the input set of reads. $\overline{{R}_{f}}=\{\overline{{r}_{1}},\overline{{r}_{2}}\mathrm{...}\overline{{r}_{n}}\}$ is a set of reverse complements. Let $R*\text{}={R}_{f}\cup {\overline{R}}_{f}$ and ${R}^{k}{+\text{1}}^{}={\cup}_{r\in {R}^{*}}\mathbb{S}\left(r,k+1\right)$. R^{k+1}is the set of all (k + 1)mers from the input reads and their reverse complements.

[STEP1] Generate canonical edges: Let (x, y) = (z[1 . . . k], z[2 . . . k + 1]) be the kmers corresponding to a (k + 1)mer z[1 . . . k + 1] ∈ R^{ k +1 }. Recall that $\widehat{x}$ and ŷ are the canonical k mers of x and y, respectively. Create a canonical bidirected edge $({\widehat{v}}_{i},{\widehat{v}}_{j},{o}_{1},{o}_{2})$ for each (k + 1)mer as follows.$(\widehat{{v}_{i}},\widehat{{v}_{j}},{o}_{1},{o}_{2})=\{\begin{array}{l}\begin{array}{cc}(\widehat{x},\widehat{y},\u22b3,\u22b3)& x=\widehat{x},y=\widehat{y}\\ (\widehat{y},\widehat{x},\u22b2,\u22b2)& \begin{array}{l}\text{IF}\widehat{x}\le \widehat{y},\\ \text{ELSE}\end{array}\end{array}\\ \begin{array}{c}x\ne \widehat{x}\wedge y=\widehat{y}\\ (\widehat{x},\widehat{y},\u22b2,\u22b3)& \text{IF}\widehat{x}\le \widehat{y}\end{array}\\ \begin{array}{cc}(\widehat{y},\widehat{x},\u22b2,\u22b3)& \text{ELSE}\end{array}\\ \begin{array}{c}x=\widehat{x}\wedge y\ne \widehat{y}\\ (\widehat{x},\widehat{y},\u22b3,\u22b2)& \text{IF}\widehat{x}\le \widehat{y}\end{array}\\ \begin{array}{cc}(\widehat{y},\widehat{x},\u22b3,\u22b2)& \text{ELSE}\end{array}\\ \begin{array}{c}x\ne \widehat{x}\wedge y\ne \widehat{y}\\ (\widehat{x},\widehat{y},\u22b2,\u22b2)& \text{IF}\widehat{x}\le \widehat{y},\end{array}\\ \begin{array}{cc}(\widehat{y},\widehat{x},\u22b3,\u22b3)& \text{ELSE}\end{array}\end{array}$

[STEP2] Reduce multiplicity: Sort all the bidirected edges in [STEP1], using radix sort. Since the parameter k is always odd this guarantees that a pair of canonical kmers has exactly one orientation. Remove the duplicates and record the multiplicities of each canonical edge. Gather all the unique canonical edges into an edge list ℰ.

[STEP3] Collect bidirected vertices: For each canonical bidirected edge $({\widehat{v}}_{i},{\widehat{v}}_{j},{o}_{1},{o}_{2})\in \mathcal{E}$, collect the canonical kmers ${\widehat{v}}_{i}$, ${\widehat{v}}_{j}$ into list $\mathcal{V}$. Sort the list $\mathcal{V}$ and remove duplicates, such that $\mathcal{V}$ contains only the unique canonical kmers.

[STEP4] Adjacency list representation: The list ℰ is the collection of all the edges in the bidirected graph and the list $\mathcal{V}$ is the collection of all the nodes in the bidirected graph. It is now easy to use ℰ and generate the adjacency lists representation for the bidirected graph. This may require one extra radix sorting step.
Analysis of the algorithm BiConstruct
Theorem 1. Algorithm BiConstruct builds a bidirected de Bruijn graph of order k in Θ(n) time. Here n is number of characters/symbols in the input.
Proof. Without loss of generality assume that all the reads are of the same size r. Let N be the number of reads in the input. This generates a total of (r  k)N (k + 1)mers in [STEP1]. The radix sort needs to be applied in at most 2k log(Σ) passes, resulting in 2k log(Σ)(r  k)N operations. Since n = Nr is the total number of characters/symbols in the input, the radix sort takes Θ(kn log(Σ)) operations assuming that in each pass of sorting only a constant number of symbols are used. If k log(Σ) = O(log N ), the sorting takes only O(n) time. In practice since N is very large in relation to k and Σ, the above condition readily holds. Since the time for this step dominates that of all the other steps, the runtime of the algorithm BiConstruct is Θ(n).
A parallel algorithm for building bidirected de Bruijn graphs
In this section we illustrate a parallel implementation of our algorithm. Let p be the number of processors available. We first distribute N/p reads for each processor. All the processors can execute [STEP1] in parallel. In [STEP2] we need to perform parallel sorting on the list ℰ. Parallel radix/bucket sort  which does not use any alltoall communications can be employed to accomplish this. For example, the integer sorting algorithm of Kruskal, Rudolph and Snir takes $O\left(\frac{n}{p}\frac{\mathrm{log}n}{\mathrm{log}(n/p)}\right)$ time. This will be O(n/p) if n is a constant degree polynomial in p. In other words, for coarsegrain parallelism the run time is asymptotically optimal  which means optimality within a constant. In practice coarsegrain parallelism is what we have. Here n = N (r  k + 1). We call this algorithm ParBiConstruct.
Theorem 2. Algorithm ParBiConstruct builds a bidirected de Bruijn graph in time O(n/p). The message complexity is O(n).
The algorithm of Jackson and Aluru [9] first identifies the vertices of the bidirected graph  which they call representative nodes. Then for every representative node Σ manytomany messages are generated. These messages correspond to potential bidirected edges which can be adjacent on that representative node. A bidirected edge is successfully created if both the representatives of the generated message exist in some processor, otherwise the edge is dropped. This results in generating a total of Θ(nΣ) manytomany messages. The authors in the same paper demonstrate that communicating manytomany messages is a major bottleneck and does not scale well. We remark that the algorithm BiConstruct does not involve any manytomany communications and does not have any scaling bottlenecks.
The algorithm presented in [9] can potentially generate spurious bidirected edges. According to the definition [3] of the bidirected de Bruijn graph in the context of SA problem, a bidirected edge between two kmers/vertices exists if and only if there exists some read in which these two kmers are adjacent. We illustrate this by a simple example. Consider a read r_{ i } = AATGCATC. If we wish to build a bidirected graph of order 3, then AAT, ATG, TGC, GCA, CAT , and ATC form a subset of the vertices of the bidirected graph. In this example we see that the kmers AAT and ATC overlap by exactly 2 symbols. However there cannot be any bidirected edge between them according to the definition  because they are not adjacent. On the other hand the algorithm presented in [9] generates the following edges with respect to the kmer AAT : {(AAT, ATA), (AAT, ATG), (AAT, ATT ), (AAT, ATC)}. The edges (AAT, ATA) and (AAT, ATC) are purged since the kmers ATA and ATC are missing. However bidirected edges with corresponding orientations are established between ATG and ATC. Unfortunately (AAT, ATC) is a spurious edge and can potentially generate wrong assembly solutions. In contrast to their algorithm [9] our algorithm does not use alltoall communications  although we use pointtopoint communications.
Out of core algorithms for building bidirected de Bruijn graphs
Theorem 3. There exists an outofcore algorithm to construct a bidirected de Bruijn graph using an optimal number of I/O ' s.
Proof. Replace the radix sorting with an external Rway merge sort which takes only $\Theta (\frac{n\mathrm{log}(n/B)}{B\mathrm{log}(M/B)})$ I/O' s. Here M is the main memory size, n is the sum of the lengths of all reads, and B is the block size of the disk.
Simplified bidirected de Bruijn graphs
Given a list of candidate canonical bidirected edges, we apply a ListRankingTransform (see Figure 5) which introduces two new nodes v^{ + }, v^{  } for every node v in the original graph. Directed edges corresponding to the orientation are introduced. See Figure 5.
Lemma 1. Let BG(V, E) be a bidirected graph; let BG^{ t } (V^{ t }, E^{ t } ) be the directed graph after applying ListRankingTransform. Two nodes u, v ∈ V are connected by a bidirected path iff u^{ + } ∈ V^{ t } (u^{  } ∈ V^{ t }) is connected to one of v^{ + }(v^{  }) or v^{  }(v^{ + }) by a directed path.
Algorithm for bidirected chain compaction
Given a bidirected graph we now give an outline of the algorithm which compacts all the valid bidirected chains.

STEP1: Apply the ListRankingTransform for each bidirected edge. Let the resultant directed graph be G(V, E).

STEP2: Identify a subset of nodes V' = {v : v ∈ V , (d_{i n}(v) > 1 or d_{ou t}(v) > 1)} and a subset of edges E' = {(u, v) : (u, v) ∈ E, (u ∈ V' or v ∈ V')}.

STEP3: Apply pointer jumping on the directed graph G(V  V', E  E').

STEP4: Now let $({v}_{i}^{+},\dots ,{v}_{j}^{})$ be a maximal chain obtained after pointer jumping. Due to the symmetry in the graph there exists a corresponding complementary chain $({v}_{j}^{+},\dots ,{v}_{i}^{})$. Each chain is replaced with a single node and its label is the concatenation of all the labels in the chain. We should stick to some convention while we give a new number to the newly created node. For example, we can choose min{v_{ i }, v_{ j }} as the new number for the newly created node. In our example if v_{ i }= min{v_{ i }, v_{ j }} then we replace the chain $({v}_{i}^{+},\dots ,{v}_{j}^{})$ with node ${v}_{i}^{+}$ and relabel it with the concatenated label. Similarly, the chain $({v}_{j}^{+},\dots ,{v}_{i}^{})$ will be replaced with the node ${v}_{i}^{}$ and relabeled accordingly.

STEP5: Finally, to maintain the connectivity we need to update the edges in E' to reflect the changes during compaction. Coming back to our example, we have replaced the chain $({v}_{i}^{+},\dots ,{v}_{j}^{})$ with the node ${v}_{i}^{+}$. As a result we need to replace any edge $(x,{v}_{j}^{})\in E\text{'}$ with the edge $(x,{v}_{i}^{+})$. Similarly we also need to update any edges adjacent on ${v}_{j}^{+}$.
Note that all of the above steps can be accomplished with some constant number of radix sorting operations. Figure 3 illustrates the compaction algorithm on a bidirected graph. The red nodes in Figure 3 indicate the nodes in the set V'. Red colored edges indicate the edges in E'. After list ranking we will have four maximal chains as follows: $({Y}_{2}^{},{Y}_{3}^{+}),({Y}_{3}^{},{Y}_{2}^{+}),({X}_{1}^{},{X}_{2}^{+},{X}_{3}^{})({X}_{3}^{+},{X}_{2}^{},{X}_{1}^{+})$. Now if we stick to the convention described in STEP5 we renumber the new node corresponding to the chains $({Y}_{2}^{},{Y}_{3}^{+}),({Y}_{3}^{},{Y}_{2}^{+})$ as Y_{2}. As a result the edges $({Y}_{3}^{+},{Y}_{4}^{})$ and $({Y}_{4}^{+},{Y}_{3}^{})$ are updated (shown in green in Figure 3(c)) as $({Y}_{2}^{},{Y}_{4}^{})$ and $({Y}_{4}^{+},{Y}_{2}^{+})$. Finally the directed edges are replaced with bidirected edges in Figure 3(d).
Analysis of bidirected compaction on parallel and outofcore models
Let ℰ _{ l } be the list of candidate edges for compaction. To do compaction in parallel, we can use a Segmented Parallel Prefix[11] on p processors to accomplish this in time $O(2{\mathcal{E}}_{l}/p+\mathrm{log}(p))$. List ranking can also be done outofcore as follows. Without loss of generality we can treat the input for the list ranking problem as a set S of ordered tuples of the form (x, y). Given S we create a copy and call it S'. We now perform an external sort of S and S' with respect to y (i.e., using the y value of tuple (x, y) as the key) and x, respectively. The two sorted lists are scanned linearly to identify tuples (x, y) ∈ S, (x',y') ∈ S' such that y =x'. These two tuples are merged into a single tuple (x, y') and is added to a list ${\mathcal{E}}_{l}{\text{'}}^{}$. This process is now repeated on ${\mathcal{E}}_{l}{\text{'}}^{}$. Note that if the underlying graph induced by ℰ _{ l } does not have any cycles then ${\mathcal{E}}_{l}^{\text{'}}\le {\mathcal{E}}_{l}/2$; which means that the size of ${\mathcal{E}}_{l}{\text{'}}^{}$ geometrically decreases after every iteration. The I/O complexity of each iteration is dominated by the external sorting and hence bidirected compaction can be accomplished outofcore with $\Theta ({\mathrm{log}}_{2}(C)\frac{{\mathcal{E}}_{l}}{B}{\mathrm{log}}_{\frac{M}{B}}(\frac{{\mathcal{E}}_{l}}{B}))$ I/O operations, where C is the length of the longest chain. Care should be taken to deal with cycles. The sortmerge algorithm mentioned above will run forever in the presence of cycles. To address this we can maintain what is known as join count in every sortmerge phase. Given any S the join count of a tuple (x, y) ∈ S is an indicator variable J(x, y) = 1, if ∃(y, z) ∈ S; else 0. Finally J(S) = Σ_{(x, y )∈S}J(x, y). Notice that the function J(S) strictly decreases in every sort and merge phase and finally becomes zero when there are no cycles. However in the presence of cycles the function J(S) decreases and then remains constant. Thus if the function J(S) remains constant in any two consecutive sortmerge phases we can stop iterating and report that there are some cycles. Once we stop the list ranking we can easily detect the edges in the cycles. Our implementation of this outofcore list ranking based on this idea is available at http://trinity.engr.uconn.edu/~vamsik/exlistrank.
Improving the construction of the bidirected de Bruijn graph in some practical assemblers
Discussion
Before we go into the discussion we briefly describe our experimental setting. We used a SGIAltix 64bit, 64 node supercomputer with a RAM of 2 GB per node for our parallel algorithm experiments. For our sequential outofcore experiments we used a 32bit x86 machine with 1 GB of RAM. All our algorithms are implemented in C under Linux environment.
We have compared the performance of our algorithm and that of Jackson and Aluru [9]. We refer to the later algorithm as JA. To make this comparison fair, we have implemented the JA algorithm (because their implementation is unavailable) also on the same platform that our algorithm runs on. We have used the SGI/Altix IA64 machine for all of our experiments. Our implementation uses MPI for communication between the processors. We used a test set of 8 million unpaired reads obtained from sequencing a plant genome at CSHL. The performance of both the algorithms is measured with various values of k (the de Bruijn graph parameter) on multiple processors. Both the algorithms (JA and ParBiConstruct) use the same underlying parallel sorting routines.
Comparison of runtime between the JA algorithm and our algorithm on an input with 8 million reads from a plant genome
JA ALGO  ParBiConstruct  

P  k  U(sec)  S(sec)  R(mm:ss)  U(sec)  speedup  S(sec)  speedup  R(mm:ss)  Speedup 
8  21  6388.08  21.54  13:26.42  1028.61  6.21  8.08  2.67  2:10.77  6.17 
8  27  2220.41  14.85  5:47.71  500.22  4.44  3.72  3.99  1:04.12  5.42 
8  33  1662.14  6.63  3:29.89  288.14  5.77  2.47  2.68  0:37.46  5.60 
8  35  496.89  3.81  1:03.77  159.59  3.11  1.03  3.70  0:21.19  3.01 
12  21  9116.72  20.31  12:43.14  985.97  9.25  6.91  2.94  1:23.95  9.09 
12  27  2626.30  13.94  3:59.85  479.10  5.48  3.20  4.36  0:41.32  5.80 
12  33  2320.07  6.14  3:15.15  317.08  7.32  2.35  2.61  0:27.76  7.03 
12  35  561.02  3.47  0:48.19  166.06  3.38  1.13  3.07  0:15.04  3.20 
16  21  11943.18  19.91  12:29.49  1044.31  11.44  7.42  2.68  1:06.91  11.20 
16  27  2889.32  12.95  3:50.81  498.36  5.80  3.42  3.79  0:32.48  7.11 
16  33  2971.37  6.49  3:07.41  357.63  8.31  2.53  2.57  0:23.64  7.93 
16  35  580.08  3.67  0:37.63  170.56  3.40  1.33  2.76  0:11.85  3.18 
24  21  17744.55  20.26  12:21.94  1205.51  14.72  8.00  2.53  0:51.74  14.34 
24  27  3399.06  15.24  2:39.02  658.02  5.17  4.00  3.81  0:28.72  5.54 
24  33  4981.96  7.89  6:25.41  652.80  7.63  3.69  2.14  0:51.65  7.46 
24  35  750.26  5.57  0:37.30  295.58  2.54  2.23  2.50  0:14.19  2.63 
32  21  23119.80  20.95  12:04.89  1070.31  21.60  8.40  2.49  0:34.90  20.77 
32  27  3897.63  15.20  2:12.96  464.24  8.40  5.21  2.92  0:15.81  8.41 
32  33  5132.44  9.33  3:11.64  534.38  9.60  4.84  1.93  0:21.01  9.12 
32  35  973.78  5.35  0:37.30  324.23  3.00  3.35  1.60  0:13.16  2.83 
48  21  37116.65  26.37  13:03.67  2422.83  15.32  13.46  1.96  0:55.33  14.16 
48  27  4932.47  21.70  1:48.76  1112.72  4.43  10.51  2.06  0:26.11  4.17 
48  33  6658.10  13.72  3:11.84  1157.33  5.75  9.41  1.46  0:34.33  5.59 
48  35  1020.88  10.95  0:30.47  447.84  2.28  8.18  1.34  0:14.08  2.16 
64  21  51443.09  35.46  16:25.33  1938.25  26.54  25.05  1.42  0:39.57  24.90 
64  27  6304.66  33.49  2:05.96  1029.89  6.12  21.80  1.54  0:21.64  5.82 
64  33  15314.25  23.95  5:57.82  673.55  22.74  21.31  1.12  0:17.08  20.95 
64  35  1048.56  20.71  0:25.93  358.62  2.92  18.86  1.10  0:09.87  2.63 
The user time of our algorithm is also consistently superior compared to the user time of JA. This clearly is because we do much less local computations. In contrast, JA needs to do a lot of local processing, which arises from processing all the received edges, removing redundant ones, and collecting the necessary edges to perform manytomany communications.
Comparison of memory between the JA algorithm and our algorithm on an input with 8 million reads from a plant genome.
JA ALGO  ParBiConstruct  

P  k  RES.MEM(Mb)  SHR.MEM(Mb)  RES.MEM(Mb)  efficiency  SHR.MEM(Mb)  efficiency 
8  21  903.714  3.690  244.603  3.695  3.328  1.109 
8  27  675.516  3.638  146.750  4.603  3.232  1.126 
8  33  289.787  3.588  62.580  4.631  3.007  1.193 
8  35  116.938  7.527  14.625  7.996  2.397  3.140 
12  21  681.664  4.782  156.875  4.345  4.256  1.124 
12  27  265.304  6.882  91.219  2.908  3.889  1.769 
12  33  241.630  4.721  46.805  5.163  3.675  1.285 
12  35  94.548  7.518  12.276  7.702  2.771  2.713 
16  21  501.175  5.726  152.995  3.276  5.175  1.107 
16  27  333.487  11.730  77.523  4.302  4.823  2.432 
16  33  166.395  5.714  42.746  3.893  4.628  1.235 
16  35  68.851  8.172  13.578  5.071  3.443  2.374 
24  21  344.888  8.081  105.723  3.262  7.133  1.133 
24  27  214.681  16.271  61.674  3.481  6.690  2.432 
24  33  115.630  7.830  34.583  3.344  5.959  1.314 
24  35  52.110  10.082  15.990  3.259  4.472  2.254 
32  21  260.063  10.413  86.157  3.018  9.070  1.148 
32  27  179.657  20.550  49.737  3.612  8.096  2.538 
32  33  95.790  9.968  34.180  2.803  7.495  1.330 
32  35  47.602  11.290  15.289  3.113  4.905  2.302 
48  21  186.225  14.109  71.073  2.620  12.405  1.137 
48  27  112.237  19.623  47.596  2.358  11.255  1.744 
48  33  75.059  11.996  30.929  2.427  9.070  1.323 
48  35  37.500  10.643  19.576  1.916  6.764  1.573 
64  21  150.324  17.311  63.449  2.369  14.640  1.182 
64  27  105.692  23.570  43.590  2.425  12.975  1.817 
64  33  66.095  14.168  31.853  2.075  10.653  1.330 
64  35  38.823  12.159  22.275  1.743  8.096  1.502 
Scalability of our algorithm for up to 128 million reads. Since our test dataset contained only 8 million reads, we generated these reads randomly, each read was of size 35 and k = 33 was used.
reads  p  user time (ticks)  sys time (ticks)  wall time (min:sec) 

16777216  2  37147  259  1:14.02 
33554432  2  148070  1219  2:42.66 
67108864  2  340653  2348  6:18.77 
134217728  2  770922  5560  15:00.42 
16777216  4  37254  85  0:38.95 
33554432  4  99067  677  1:48.60 
67108864  4  240861  1931  4:14.57 
134217728  4  471196  4272  8:29.62 
16777216  8  20217  57  0:21.90 
33554432  8  47319  322  0:55.41 
67108864  8  153782  1781  2:39.18 
134217728  8  314281  3456  5:17.65 
16777216  16  16951  55  0:19.73 
33554432  16  17936  135  0:25.64 
67108864  16  64408  804  1:10.91 
134217728  16  135562  2148  2:21.83 
16777216  32  12901  40  0:16.38 
33554432  32  9973  191  0:17.55 
67108864  32  46659  486  0:53.32 
134217728  32  82414  950  1:28.87 
Outofcore algorithm versus VELVET graph building algorithm
Comparison of our algorithm with VELVET on a 32bit machine with 1 GB of RAM
OUT OF CORE  VELVET  

reads  initial edges  edges after multiplicity reduction and canonicalization  page faults  time hrs:min:sec  page faults  Time hrs:min:sec 
2097152  31457280  21387750  5593  00:10:31  82  00:4:31 
4194304  62914560  32443128  23084  00:22:40  2419455  07:50:02* 
6291456  94371840  39460652  40920  00:34:09  1255816  04:22:28* 
8388608  125829120  44840055  45716  00:38:13  1064952  03:50:02* 
The program is available at http://trinity.engr.uconn.edu/~vamsik/exbuildvgraph/.
Conclusions
In this paper we have presented an efficient algorithm to build a bidirected de Bruijn graph, which is a fundamental data structure for any sequence assembly program  based on an Eulerian approach. Our algorithm is also efficient in parallel and out of core settings. These algorithms can be used in building large scale bidirected de Bruijn graphs. Also, our algorithm does not employ any alltoall communications in a parallel setting and performs better than that of [9]. Finally, our outofcore algorithm can build these graphs with a constant amount of RAM, and hence can act as a replacement for the graph construction algorithm employed by VELVET [8].
Declarations
Acknowledgements
This work has been supported in part by the following grants: NSF 0326155, NSF 0829916 and NIH 1R01LM01010101A1.
Authors’ Affiliations
References
 Kececioglu JD, Myers EW: Combinatorial algorithms for DNA sequence assembly. Algorithmica 1995, 13(1–2):7–51. 10.1007/BF01188580View ArticleGoogle Scholar
 Pevzner PA, Tang H, Waterman MS: An Eulerian path approach to DNA fragment assembly. Proceedings of the National Academy of Sciences of the United States of America 2001, 98(17):9748–9753. 10.1073/pnas.171285098View ArticlePubMedPubMed CentralGoogle Scholar
 Medvedev P, Georgiou K, Myers G, Brudno M: Computability of models for sequence assembly. Workshop on Algorithms for Bioinformatics (WABI), LNBI4645 2007, 289–301. full_textGoogle Scholar
 Huang X, Wang J, Aluru S, Yang S, Hillier L: PCAP: A wholegenome assembly program. Genome research 2003, 13(9):2164–2170. [Cited By (since 1996): 61] [http://www.scopus.com] [Cited By (since 1996): 61] 10.1101/gr.1390403View ArticlePubMedPubMed CentralGoogle Scholar
 Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, Kravitz SA, Mobarry CM, Reinert KHJ, Remington KA, Anson EL, Bolanos RA, Chou H, Jordan CM, Halpern AL, Lonardi S, Beasley EM, Brandon RC, Chen L, Dunn PJ, Lai Z, Liang Y, Nusskern DR, Zhan M, Zhang Q, Zheng X, Rubin GM, Adams MD, Venter JC: A wholegenome assembly of Drosophila. Science 2000, 287(5461):2196–2204. 10.1126/science.287.5461.2196View ArticlePubMedGoogle Scholar
 Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov JP, Lander ES: ARACHNE: A wholegenome shotgun assembler. Genome research 2002, 12: 177–189. 10.1101/gr.208902View ArticlePubMedPubMed CentralGoogle Scholar
 PHRAP ASSEMBLER[http://www.phrap.org/]
 Zerbino DR, Birney E: Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome research 2008, 18(5):821–829. 10.1101/gr.074492.107View ArticlePubMedPubMed CentralGoogle Scholar
 Jackson BG, Aluru S: Parallel construction of bidirected string graphs for genome assembly. International Conference on Parallel Processing 2008, 346–353. full_textGoogle Scholar
 Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: A parallel assembler for short read sequence data. Genome research 2009, 19(6):1117–1123. 10.1101/gr.089532.108View ArticlePubMedPubMed CentralGoogle Scholar
 Jaja J: Introduction to Parallel Algorithms. Addison Wesley;Google Scholar
 Silberschatz A, Baer P, Gagne G: Operating System Princples. Wiley;Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.