Compact representation of k-mer de Bruijn graphs for genome read assembly

Background Processing of reads from high throughput sequencing is often done in terms of edges in the de Bruijn graph representing all k-mers from the reads. The memory requirements for storing all k-mers in a lookup table can be demanding, even after removal of read errors, but can be alleviated by using a memory efficient data structure. Results The FM-index, which is based on the Burrows–Wheeler transform, provides an efficient data structure providing a searchable index of all substrings from a set of strings, and is used to compactly represent full genomes for use in mapping reads to a genome: the memory required to store this is in the same order of magnitude as the strings themselves. However, reads from high throughput sequences mostly have high coverage and so contain the same substrings multiple times from different reads. I here present a modification of the FM-index, which I call the kFM-index, for indexing the set of k-mers from the reads. For DNA sequences, this requires 5 bit of information for each vertex of the corresponding de Bruijn subgraph, i.e. for each different k−1-mer, plus some additional overhead, typically 0.5 to 1 bit per vertex, for storing the equivalent of the FM-index for walking the underlying de Bruijn graph and reproducing the actual k-mers efficiently. Conclusions The kFM-index could replace more memory demanding data structures for storing the de Bruijn k-mer graph representation of sequence reads. A Java implementation with additional technical documentation is provided which demonstrates the applicability of the data structure (http://folk.uio.no/einarro/Projects/KFM-index/).

Algorithm 1 Compute arbitrary ρ(a, i) from next stored value function ρ(a, i) a ∈ Σ, i ∈ {0, . . . , n} if i = 0 then return ρstore(a, 0) end j ← ι + (i) stored position j = ir ≥ i p ← ρstore(a, j) stored value while not fj−1 do j ← j − 1 end if j < i then return p end hasEdge ← false loop if fj then new vertex group if hasEdge then p ← p − 1 end if j < i then return p end hasEdge ← false end if if a ∈ Ej then hasEdge ← true end j ← j − 1 end loop return p end function The main advantage of this algorithm comes when it is combined with the original one, which computes ρ(a, i) based on the previous stored value, picking either algorithm based on which has the closer stored value. This reduces the average distance to the stored value by a factor of two, thus allowing q to be twice has high without additional computational costs.

Improved PreMerge function
The PreMerge function is presented in the article in a simple, straight forward way. However, there are a number of special cases in which the computations can be done more efficiently. When α B = β B is encountered, what remains is to insert the interval of vertices [α A , β A , which does not require computing all four ρ values. Another similar case is when the A and B intervals both contain exactly one vertex.
The following is a version of PreMerge where these two cases are handled separately.

Algorithm 3 Prepare merge: insert individual
if a ∈ E A i A then preInsertA(l − 1, ρA(a, iA)) end end for end function Algorithm 4 Prepare merge: merge single vertices In both functions, the data required to make the computations, like E i , can be looked up in one operation, which saves a little time. The most critical aspect, however, is that the number of calls to ρ is reduced.

Algorithm for generating uniquely determined paths
The first step in finding the uniquely determined paths is to identify the non-simple, or complex, vertices: i.e. those that do not have in-degree and out-degree both equal to one.
Algorithm 5 Prepare merge: merge single vertices Starting with the list of non-simple vertices, for each in-edge to a non-simple vertex, start backtracking until another non-simple vertex is reached. For each such in-edge, that path is uniquely determined, and thus all uniquely determined paths are generated. Some of these paths math start and end in final-completing vertices, and thus represent a suffix of length less than k: these may be excluded when the string representation of the path has been produced.

Parallel bit-processing for computing previous vertex position
The algorithm presented in the article, and implemented in the Java program, uses single bit operations to compute ρ(a, i) from stored values ρ store (a, i r ) where 0 = i 0 < · · · < i ζ = n are the stored positions with steps at most q.
For a computer with word size ω, i.e. ω = 64 for 64 bit computers, it is possible to process ω bits of information in parallel. This can provide substantial speed up relative to single bit operations: both because of the number of computational operations, and by having the data required for the computations packed into single words rather than read from memory one vertex at a time.
The present Java implementation stores the σ + 1 bits of information per vertex in one block, which for DNA with σ = 4 results in 12 such blocks being stored in each 64 bit word. The present implementation reads one vertex at a time without exploiting that fact consecutive vertices are likely to be stored in the same word.
For efficient parallelisation of bit operations, it is better to reorganise the data storage so that the ω bits of data that should be parallel processed are stored in a single word. For this, we use σ + 1 separate words for storing data for each of the letters and for the vertex group end flags. For computing ρ(a, i), we would then need to read one word containing data for η(a, j) for some interval of j that contains i, and one word with the corresponding vertex group data f j .
In order to ensure all vertex groups are stored within a single word, not split across words, some margin needs to be added: instead of blocks of length ω, blocks of length ν = ω − σ + 1 are used. We then store data in words ξ (a,r) = ξ where any bits outside the range of definition are set to 0. This makes so that for any set group end flag f rν+j with j ≥ 0 we ensure the entire vertex group is included in ξ (a,r) . When computing ρ(a, i), we select r = i/ν to ensure that i = rν + j for some j ∈ {0, . . . , ν − 1}.
The core algorithm takes the in-edge flags ξ (a,r) and vertex group end flag φ (r) and computes a word χ where set bits represent the vertex groups that contain and a in-edge. Starting with the stored value ρ store (a, (r + 1)ν), we can then subtract the number of vertex groups that contain an a in-edges, from the group containing i to the end of the block.
Algorithm 6 Whole word parallel computation of ρ(a, i) from next stored value function ρ(a, i) a ∈ Σ, i ∈ {0, . . . , n} r ← i/ν ; j ← i mod ν rν + j = i ξ ← ξ (a,r) ; φ ← φ (r) ;φ = not φ φ is the bitwise complement of φ χ ← ξ and φ process last vertex in each group loop σ − 1 times ensure all other vertices in group are processed ξ ← (ξ andφ) shift − 1 delete processed vertices and shift one position χ ← χ or (ξ and φ) process next vertex in each group end loop χ now contains bits set for each group containing an a in-edge p ← bitcount(χ shift + (σ − 1 + j)) count groups starting at position j within block return ρstore(a, (r + 1)ν) − p end function Note that the logical operations are performed bitwise, and the shift operator χ shift q shifts the bits of χ q positions to the reft or −q positions to the right: e.g. χ shift − 1 = 0χ ω−1 . . . χ 1 . The bitcount function exists as a single word operation with low-level implementation in e.g. Java.
For faster memory access, we suggest storing the ξ and φ data so that ξ (a,r) and φ (r) for a single r are located on the same area of memory: e.g. M (σ+1)r+a = ξ (a,r) where a = 0, . . . , σ − 1 represents the letters and M (σ+1)r+σ = φ (r) . The reason for this is that CPU optimisations like prefetching and caching can make access to words stored successively in memory much faster than random access.
For DNA sequences processed on a typical 64 bit computer, σ = 4 and ω = 64, and the block size ν = 61 can be used. The maximal block size referred to in the article then becomes q = 61, and so this requires (4 + 1) × 64/61 bit of information for storing the main data, and 4 × 8/61 bit for ρ store since lg q + 1 + lg (n/q)/64 < 8. In total, that is 5.77 bit per vertex.
The computational complexity of computing arbitrary ρ(a, i) is O(σ) due to the need to cover the maximal number of vertices in a vertex group. As in the article, this assumes that n < 2 ω : when this limitation is broken, the number of computational steps required whenever any position value is involved will increase by a factor of (lg n)/ω. In that respect, the computational complexity of algorithm 6 is O(σ + (lg n)/ω) as n increases. The bitcount function has complexity O(lg ω), but will still tend to be a single CPU operation. Thus, apart from these technical assumptions, the complexity remains O(σ) which given a fixed σ is constant time.
However, the apparently significant improvement from a time complexity O(q) algorithm, presented in the article and implemented in the Java program, to a complexity O(σ) algorithm may be less significant than the numbers indicate for moderately sized q. As noted above, random memory access is much slower than sequential memory access, and although the complexity increases with increasing q, benchmarking of the implementation indicated limited gain from reducing q much below 32. If we compute a baseline time required by extrapolating to q = 1, i.e. where all of ρ is stored, we find that the time used with q = 32 is roughly twice the baseline time, and at q = 64 it is three times the baseline time. This may in part be due to the time required for random memory access, and will most likely not be reduced by improvements in the algorithm. Thus, the bit-parallelised algorithm may be expected to be three times as fast as the present one for q ≈ 64, and twice as fast than q ≈ 32 although with slightly less memory consumed. As the algorithm requires a different organising of the data and reduced generality, it has not been implemented in the Java program.