Efficient parallel and out of core algorithms for constructing large bi-directed de Bruijn graphs

Background Assembling genomic sequences from a set of overlapping reads is one of the most fundamental problems in computational biology. Algorithms addressing the assembly problem fall into two broad categories - based on the data structures which they employ. The first class uses an overlap/string graph and the second type uses a de Bruijn graph. However with the recent advances in short read sequencing technology, de Bruijn graph based algorithms seem to play a vital role in practice. Efficient algorithms for building these massive de Bruijn graphs are very essential in large sequencing projects based on short reads. In an earlier work, an O(n/p) time parallel algorithm has been given for this problem. Here n is the size of the input and p is the number of processors. This algorithm enumerates all possible bi-directed edges which can overlap with a node and ends up generating Θ(nΣ) messages (Σ being the size of the alphabet). Results In this paper we present a Θ(n/p) time parallel algorithm with a communication complexity that is equal to that of parallel sorting and is not sensitive to Σ. The generality of our algorithm makes it very easy to extend it even to the out-of-core model and in this case it has an optimal I/O complexity of Θ(nlog(n/B)Blog(M/B)) (M being the main memory size and B being the size of the disk block). We demonstrate the scalability of our parallel algorithm on a SGI/Altix computer. A comparison of our algorithm with the previous approaches reveals that our algorithm is faster - both asymptotically and practically. We demonstrate the scalability of our sequential out-of-core algorithm by comparing it with the algorithm used by VELVET to build the bi-directed de Bruijn graph. Our experiments reveal that our algorithm can build the graph with a constant amount of memory, which clearly outperforms VELVET. We also provide efficient algorithms for the bi-directed chain compaction problem. Conclusions The bi-directed de Bruijn graph is a fundamental data structure for any sequence assembly program based on Eulerian approach. Our algorithms for constructing Bi-directed de Bruijn graphs are efficient in parallel and out of core settings. These algorithms can be used in building large scale bi-directed de Bruijn graphs. Furthermore, our algorithms do not employ any all-to-all communications in a parallel setting and perform better than the prior algorithms. Finally our out-of-core algorithm is extremely memory efficient and can replace the existing graph construction algorithm in VELVET.


I. INTRODUCTION
The genomic sequence of an organism is a string from the alphabet Σ = {A, T, G, C}.This string is also referred as the Deoxyribonucleic acid (DNA) sequence.DNA sequences exist as complementary pairs (A − T , G − C) due to the double strandedness of the underlying DNA structure.Several characteristics of an organism are encoded in its DNA sequence, thereby reducing the biological analysis of the organism to the analysis of its DNA sequence.Identifying the unknown DNA sequence of an organism is known as de novo sequencing and is of fundamental biological importance.On the other hand the existing sequencing technology is not mature enough to identify/read the entire sequence of the genomeespecially for complex organisms like the mammals.However small fragments of the genome can be read with acceptable accuracy.The shotgun method employed in many sequencing projects breaks the genome randomly at several places and generates several small fragments (reads) of the genome.The problem of reassembling all the fragmented reads into a small sequence close to the original sequence is known as the Sequence Assembly (SA) problem.
Although the SA problem seems similar to the Shortest Common Super string (SCS) problem, there are in fact some fundamental differences.Firstly, the genome sequence might contain several repeating regions.However, in any optimal solution to the SCS problem we will not be able to find repeating regions -because we want to minimize the length of the solution string.In addition to the repeats, there are other issues such as errors in reads and double strandedness of the reads which make the reduction to SCS problem very complex.
The literature on algorithms to address the SA problem can be classified into two broad categories.The first class of algorithms model a read as a vertex in a directed graph -known as the overlap graph [2].The second class of algorithms model every substring of length k (i.e., a kmer) in a read as a vertex in a (subgraph of) a de Bruijn graph [3].
In an overlap graph, for every pair of overlapping reads, directed edges are introduced consistent with the orientation of the overlap.Since the transitive edges in the overlap graph are redundant for the assembly process they are removed and the resultant graph is called the string graph [2].The edges of the string graph are classified into optional, required and exact.The SA problem is reduced to the identification of a shortest walk in the string graph which includes all the required and exact constraints on the edges.Identifying such a walk -minimum S-walk -on the string graph is known to be NP-hard [4].
When a de Bruijn graph is employed, we model every substring of length k (i.e., a k-mer) in a read as a vertex [3].A directed edge is introduced between two k-mers if there exists some read in which these two k-mers overlap by exactly k − 1 symbols.Thus every read in the input is mapped to some path in the de Bruijn graph.The SA problem is reduced to a Chinese Postman Problem (CPP) on the de Bruijn graph, subject to the constraint that the resultant CPP tour include all the paths corresponding to the reads.This problem is also known to be NP-hard.Thus solving the SA problem exactly on both these graph models is intractable.
Overlap graph based algorithms were found to perform better (see [5] [6] [7] [8]) with Sanger based read methods.Sanger methods produce reads typically around 1000 base pairs long.However these can produce significant read errors.To overcome the issues with Sanger reads new read technologies such as the pyrosequencing (454sequencing) have emerged.These read technologies can produce reliable and accurate genome fragments which are very short (up to 100 base-pairs long).On the other hand short read technologies can increase the number of reads in the SA problem by a large magnitude.Overlap based graph algorithms do not scale well in practice since they represent every read as a vertex.De Bruijn graph based algorithms seem to handle short reads very efficiently (see [9]) in practice compared to the overlap graph based algorithms.However the existing sequential algorithms [9] to construct these graphs are sub-optimal and require significant amounts of memory.This limits the applicability of these methods to large scale SA problems.In this paper we address this issue and present algorithms to construct large de Bruijn graphs very efficiently.Our algorithm is optimal in the sequential, parallel and out-of-core models.A recent work by Jackson and Aluru [1] yielded parallel algorithms to build these de Bruijn graphs efficiently.They present a parallel algorithm that runs in O(n/p) time using p processors (assuming that n is a constantdegree ploynomial in p).The message complexity of their algorithm is Θ(nΣ).By message complexity we mean the total number of messages (i.e., k-mers) communicated by all the processors in the entire algorithm.One of the major contributions of our work is to show that we can accomplish this in Θ(n/p) time with a message complexity of Θ(n).An experimental comparison of these two algorithms on an SGI Altix machine shows that our algorithm is considerably faster.In addition, our algorithm works optimally in an out-of-core setting.In particular, our algorithm requires only Θ( n log(n/B) B log(M/B) ) I/O operations.
The organization of the paper is as follows.In Section II we introduce some preliminaries and define a bidirected de Bruijn graph formally.Section III discusses our main algorithm in a sequential setting.Section V and Section VI show how our main idea can easily be extended to parallel and out-of-core models optimally.In Section V-A we provide some remarks on the parallel algorithm of Jackson and Aluru [1].Section VII gives algorithms to perform the simplification operation on the bi-directed de Bruijn graph.Section VIII discusses how our simplified bi-directed de Bruijn graph algorithm can replace the graph construction algorithm in a popular sequence assembly program VELVET [9].Finally we present experimental results in Section IX.

II. PRELIMINARIES
Let s ∈ Σ n be a string of length n.Any substring s j (i.e., s[j]s The set of all k−mers of a given string s is called the k−spectrum of s and is denoted by S(s, k).Given a k−mer s j , sj denotes the reverse complement of s j (e.g., if s j = AAGT A then sj = T ACT T ).Let ≤ be the partial ordering among the strings of equal length, then s i ≤ s j indicates that the string s i is lexicographically smaller than s j .Given any k−mer s i , let ŝi be the lexicographically smaller string between s i and si .We call ŝi the canonical k−mer of s i .In other words, if s i ≤ si then ŝi = s i otherwise ŝi = si .A k−molecule of a given k−mer s i is a tuple ( ŝi , si ) consisting of the canonical k−mer of s i and the reverse complement of the canonical k−mer.In the rest of this paper we use the terms positive strand and canonical k−mer interchangeably.Likewise the noncanonical k−mer is referred to as the negative strand.
A bi-directed graph is a generalized version of a standard directed graph.In a directed graph every edge has only one arrow head (-⊲ or ⊳-).On the other hand in a bi-directed graph every edge has two arrow heads attached to it (⊳-⊲, ⊳-⊳,⊲-⊳ or ⊲-⊲).Let V be the set of vertices and ] refer to the orientations of the arrow heads on the vertices v i and v j , respectively.A walk W (v i , v j ) between two nodes v i , v j ∈ V of a bi-directed graph G(V, E) is a sequence v i , e i1 , v i1 , e i2 , v i2 , . . ., v im , e im+1 , v j , such that for every intermediate vertex v il , 1 ≤ l ≤ m the orientation of the arrow head on the incoming edge adjacent on v il is opposite to the orientation of the arrow head on the out going edge.To make this clearer, let e il , v il , e il+1 be a sub-sequence in the walk then for the walk to be valid it should be the case that o 2 = o ′ 1 .Figure 1(a) illustrates an example of a bi-directed graph.Figure 1(b) shows a simple bi-directed walk between the nodes A and E. Bi-directed walk between two nodes may not be simple.Figure 1(c) shows a bi-directed walk between A and E which is not simple -because B repeats twice.
A de Bruijn graph D k (s) of order k on a given string s is defined as follows.The vertex set V of D k (s) is defined as the k−spectrum of s (i.e.V = S(s, k)).We use the notation suf (v i , l) (pre(v i , l), respectively) to denote the suffix (prefix, respectively) of length l in the string v i .Let the symbol • denote the concatenation operation between two strings.The set of directed edges E of D k (s) is defined as follows: We can also define de Bruijn graphs for sets of strings as follows.If S = {s 1 , s 2 . . .s n } is any set of strings, a de Bruijn graph B k (S) of order k on S has To model the double strandedness of the DNA molecules we should also consider the reverse complements ( S = { s1 , s2 . . .sn }) while we build the de Bruijn graph.
To address this a bi-directed de Bruijn graph BD k (S ∪ S) has been suggested in [4].The set of vertices V of BD k (S ∪ S) consists of all possible k−molecules from S ∪ S. The set of bi-directed edges for BD k (S ∪ S) is defined as follows.Let x, y be two k−mers which are next to each other in some input string z ∈ S ∪ S. Then an edge is introduced between the k−molecules v i and v j corresponding to x and y, respectively.Please note that two consecutive k−mers in some input string always overlap by k − 1 symbols.The converse need not be true.The orientations of the arrow heads on the edges are chosen as follows.If both x and y are the positive strands in v i and v j , respectively, then an edge (v i , v j , ⊲, ⊲) is introduced.If x is the positive strand in v i and y is the negative strand in v j an edge (v i , v j , ⊲, ⊳) is introduced.Finally, if x is the negative strand in v i and y is the positive strand in v j an edge (v i , v j , ⊳, ⊲) is introduced.
Figure 2 illustrates a simple example of the bi-directed de Bruijn graph of order k = 3 from a set of reads AT GG, CCAT, GGAC, GT T C, T GGA and T GGT observed from a DNA sequence AT GGACCAT and its reverse complement AT GGT CCAT .Consider two 3−molecules v 1 = (GGA, T CC) and v 2 = (GAC, GT C).Because the positive strand x = GGA in v 1 overlaps the positive strand y = GAC in v 2 by string GA, an edge (v 1 , v 2 , ⊲, ⊲) is introduced.Note that the negative strand GT C in v 2 also overlaps the negative strand T CC in v 2 by string T C, so the two overlapping strings GA and T C are drawn above the edge (v 1 , v 2 , ⊲, ⊲) in Figure 2. A bi-directed walk on the example bi-directed de Bruijn graph as illustrated by the dash line is corresponding to the original DNA sequence with the first letter omitted T GGACCAT .We would like to remark that the parameter k is always chosen to be odd to ensure that the forward and reverse complements of a k-mer are not the same.
Bi-directed de Bruijn graph example III.OUR ALGORITHM TO CONSTRUCT BI-DIRECTED DE BRUIJN GRAPHS In this section we describe our algorithm BiConstruct to construct a bi-directed de Bruijn graph on a given set of reads.The following are the main steps in our algorithm to build the bi-directed de Bruijn graph.Let R f = {r 1 , r 2 . . .r n }, r i ∈ Σ r be the input set of reads.Rf = { r1 , r2 . . .rn } is a set of reverse complements.Let R * = R f ∪ Rf and R k+1 = ∪ r∈R * S(r, k + 1).R k+1 is the set of all (k + 1)-mers from the input reads and their reverse complements.( vi Reduce multiplicity: Sort all the bidirected edges in [STEP-1], using radix sort.Since the parameter k is always odd this guarantees that a pair of canonical k-mers have exactly one orientation.Remove the duplicates and record the multiplicities of each canonical edge.Gather all the unique canonical edges into an edge list E.

IV. ANALYSIS OF THE ALGORITHM BiConstruct
Theorem 1: Algorithm BiConstruct builds a bidirected de Bruijn graph of the order k in Θ(n) time.
Here n is number of characters/symbols in the input.
Proof: Without loss of generality assume that all the reads are of the same size r.Let N be the number of reads in the input.This generates a total of (r − k)N (k + 1)-mers in [STEP-1].The radix sort needs to be applied at most 2k log(|Σ|) passes, resulting in 2k log(|Σ|)(r − k)N operations.Since n = N r is the total number of characters/symbols in the input, the radix sort takes Θ(kn log(|Σ|)) operations assuming that in each pass of sorting only a constant number of symbols are used.If k log(|Σ|) = O(log N ), the sorting takes only O(n) time.In practice since N is very large in relation to k and |Σ|, the above condition readily holds.Since the time for this step dominates that of all the other steps, the runtime of the algorithm BiConstruct is Θ(n).

V. PARALLEL ALGORITHM FOR BUILDING
BI-DIRECTED DE BRUIJN GRAPH In this section we illustrate a parallel implementation of our algorithm.Let p be the number of processors available.We first distribute N/p reads for each processor.All the processors can execute [STEP-1] in parallel.In [STEP-2] we need to perform parallel sorting on the list E. Parallel radix/bucket sort -which does not use any all-to-all communications-can be employed to accomplish this.For example, the integer sorting algorithm of Kruskal, Rudolph and Snir takes O n p log n log(n/p) time.This will be O(n/p) if n is a constant degree polynomial in p.In other words, for coarse-grain parallelism the run time is asymptotically optimal.In practice coarse-grain parallelism is what we have.Here n = N (r − k + 1).We call this algorithm Par-BiConstruct.
Theorem 2: Algorithm Par-BiConstruct builds a bidirected de Bruijn graph in time O(n/p).The message complexity is O(n).

A. Some remarks on Jackson and Aluru's algorithm
The algorithm of Jackson and Aluru [1] first identifies the vertices of the bi-directed graph -which they call representative nodes.Then for every representative node |Σ| many-to-many messages were generated.These messages correspond to potential bi-directed edges which can be adjacent on that representative node.A bi-directed edge is successfully created if both the representatives of the generated message exist in some processor, otherwise the edge is dropped.This results in generating a total of Θ(n|Σ|) many-to-many messages.The authors in the same paper demonstrate that communicating many-tomany messages is a major bottleneck and does not scale well.On the other had we remark that the algorithm BiConstruct does not involve any many-to-many communications and does not have any scaling bottlenecks.
On the other hand the algorithm presented in their paper [1] can potentially generate spurious bi-directed edges.According to the definition [4] of the bi-directed de Bruijn graph in the context of SA problem, a bi-directed edge between two k-mers/vertices exists iff there exists some read in which these two kmers are adjacent.We illustrate this by a simple example.Consider a read r i = AAT GCAT C. If we wish to build a bi-directed graph of order 3, then {AAT, AT G, T GC, GCA, CAT, AT C} form a subset of the vertices of the bi-directed graph.In this example we see that k-mers AAT and AT C overlap by exactly 2 symbols.However there cannot be any bi-directed edge between them according to the definition -because they are not adjacent.On the other hand the algorithm presented in [1] generates the following edges with respect to k-mer AAT : {(AAT, AT A), (AAT, AT G), (AAT, AT T ), (AAT, AT C)}.The edges (AAT, AT A) and (AAT, AT C) are purged since the k-mers AT A and AT C are missing.However bi-directed edges with corresponding orientations are established between AT G and AT C. Unfortunately (AAT, AT C) is a spurious edge and can potentially generate wrong Fig. 3. Problems with pointer jumping on bi-directed chains assembly solutions.In contrast to their algorithm [1] our algorithm does not use all-to-all communicationsalthough we use point-point communications.

VI. OUT OF CORE ALGORITHMS FOR BUILDING BI-DIRECTED DE BRUIJN GRAPHS
Theorem 3: There exists an out-of core algorithm to construct a bi-directed de Bruijn graph using an optimal number of I/O's.
Proof: Sketch: Replace the radix sorting with an external R−way merge which takes only Θ( n log(n/B) B log(M/B) ).Where M is the main memory size, n is the sum of the lengths of all reads, and B is the block size of the disk.

VII. SIMPLIFIED BI-DIRECTED DE BRUIJN GRAPH
The bi-directed de Bruijn graph constructed in the previous section may contain several linear chains.These chains have to be compacted to save space as well as time.The graph that results after this compaction step is refered to as the simplified bi-directed graph.A linear chain of bi-directed edges between nodes u and v can be compacted only if we can find a valid bi-directed walk connecting u and v.All the k-mers/vertices in a compactable chain can be merged into a single node, and relabelled with the corresponding forward and reverse complementary strings.In Figure 4 we can see that the nodes X 1 and X 3 can be connected with a valid bidirected walk and hence these nodes are merged into a single node.In practice the compaction of chains seems to play a very crucial role.It has been reported that merging the linear chains can reduce the number of nodes in the graph by up-to 30% [9].
Although bi-directed chain compaction problem seems like a list ranking problem there are some fundamental differences.Firstly, a bi-directed edge can be traversed in both the directions.As a result, applying pointer jumping directly on a bi-directed graph can lead to cycles and cannot compact the bi-directed chains correctly.Figure 3 illustrates the first phase of pointer jumping.As we  The same is true for the compaction of the bi-directed walk between X 1 and X 3 .The redundant edges after compaction are marked in red.Since bi-directed chain compaction has a lot of practical importance efficient and correct algorithms are essential.We now provide algorithms for the bi-directed chain compaction problem.Our key idea here is to transform a bi-directed graph into a directed graph and then apply list ranking.Given a list of candidate canonical bi-directed edges, we apply a ListRankingTransform (see Figure 5) which introduces two new nodes v + , v − for every node v in the original graph.Directed edges corresponding to the orientation are introduced.See Figure 5.
Lemma 1: Let BG(V, E) be a bi-directed graph; let BG t (V t , E t ) be the directed graph after applying Lis-tRankingTransform.Two nodes u, v ∈ V are connected by a bi-directed path iff u Proof: We first prove the forward direction by induction on the number of nodes in the bi-directed graph.Consider the basis of induction when |V | = 2, let v 0 , v 1 ∈ V .Clearly we are only interested when v 0 and v 1 are connected by a bi-directed edge.By the definition of ListRankingTransform the Lemma in this case is trivially true.Now consider a bi-directed graph with |V | = n + 1 nodes, if the path between v i , i < n and v j , j < n does not involve node v n the lemma still holds by induction on the sub bidirected graph BG(V − {v n }, E).Now assume that v i . . .v p , v n , v q . . .v j is the bi-directed path between v i and v j involving the node v n ; see Figure 6(a).Also Figure 6(a) shows how the transformed directed graph look like; observe the colors of bi-directed edges and corresponding directed edges.By induction hypothesis on the sub bi-directed paths v i . . .v p , v n and v n , v q . . .v j we have the following.v + i is connected to v + n or v − n by some directed path P 1 (See Figure 6(b); v + n is connected to v + j or v − j by some directed path P 2 .We examine three possible cases depending on how the directed edge from P 1 and P 2 is incident on v + n .In CASE-1 we have both P 1 and P 2 pointing into node v + n .This implies that the orientation of the bi-directed edges in the original graph is according to Figure 6(b).In this case we cannot have a bi-directed walk involving the node v n , which contradicts our original assumption.Similarly CASE-2(Figure 6(c)) would also lead to a similar contradiction.Only CASE-3 would let node v n involve in a bi-directed walk.In this case v + i will be connected to either v + j or v − j by concatenation of the paths P 1 , P 2 .We can make a similar argument to prove the reverse direction.

A. Algorithm for bi-directed chain compaction
We first identify a set of candidate bi-directed edges which can potentially form a chain.One possible criterion will be to include all the edges which are adjacent on bi-directed nodes with exactly one in and out degree.Each candidate bi-directed edge is transformed using ListRankingTranform and list ranking is applied on resultant set.As a consequence of the symmetry in ListRankingTransform we would see both forward and reverse complements of the compacted chains in the output.We can further canonicalize each chain and remove the duplicates by sorting.This results in unique bi-directed chains from the candidate bi-directed edges.Finally we report only the chains which are resultant of compaction of at least three bi-directed nodes.This removes all the inconsistent edges (see Figure 4) from further consideration.As a consequence of Lemma 1 all the bi-directed chains are correctly compacted.

B. Analysis of bi-directed compaction on parallel and out-of-core models
Let E l be the list of candidate edges for compaction.To do compaction in parallel, we can use a Segmented Parallel Prefix on p processors to accomplish this in time O(|2E l |/p + log(p)).On the other hand list ranking can also be done out-of-core as follows.Without loss of generality we can treat the input for the list ranking problem as a set S of ordered tuples of the form (x, y).Given S we create a copy and call it S ′ .We now perform an external sort of S, S ′ with respect to y (i.e., using the y value of tuple (x, y) as the key) and x respectively.The two sorted lists are scanned linearly to identify tuples (x, y) ∈ S, (x ′ , y ′ ) ∈ S ′ such that y = x ′ .These two
For each canonical bi-directed edge ( vi , vj , o 1 , o 2 ) ∈ E, collect the canonical k-mers vi , vj into list V. Sort the list V and remove duplicates, such that V contains only the unique canonical k-mers.• [STEP-4] Adjacency list representation: The list E is the collection of all the edges in the bi-directed graph and the list V is the collection of all the nodes in the bi-directed graph.It is now easy to use E and generate the adjacency lists representation for the bi-directed graph.This may require one extra radix sorting step.
• [STEP-3] Collect bi-directed vertices: can see, the green arcs indicate valid pointer jumps from the starting nodes.However since the orientation of the node Y 4 is reverse relative to the direction of pointer jumping a cycle results.In contrast, a valid bidirected chain compaction would merge all the nodes between Y 1 and Y 5 since there is a valid bi-directed walk between Y 1 and Y 5 .On the other hand, bi-directed chain compaction may result in inconsistent bi-directed edges and these edges should be recognised and removed.Consider a bi-directed chain in Figure4; this chain contains two possible bi-directed walks -Y 1 to Y 4 and X 1 to X 3 .The walk from Y 1 to Y 4 (Y 4 to Y 1 ) spells out a label AT AGGT (ACCT AT ) after compaction.Once we perform this compaction the edge between Y 4 and Z 1 in the original graph is no longer valid -because the k-mer on Z 1 cannot overlap with the label ACCT AT .