We begin to illustrate our general strategy, showing how to represent a hash index in succinct space by building a compact ((n log σ) bits) representation of all the fingerprints of length-m text substrings. A succinct index is then built over this representation, obtaining a succinct hash data structure.
Definitions
Throughout this paper we will work with the alphabet Σ
DN A
= {A, C, G, T, N, $} (with N and $ being the undefined base and the contig end-marker, respectively) which, in practice, will be encoded in Σ
DN A'
= {0, 1, 2, 3} assigning a numerical value in Σ
DN A'
to N and $ characters. The size of our alphabet is, therefore, σ = |Σ
DN A'
| = 4, while with n, m, and w we will denote the reference length, the pattern length, and the (fixed) size of a computer memory-word (i.e. the number σw − 1 is assumed to fit in a memory word), respectively. As hash functions we will use functions of the form h : Σm → Σw mapping length-m Σ-strings to length-w Σ-strings. If necessary, we will use the symbol instead of h when we need to be clear on h's domain and codomain sizes. Given a string , the value will be also dubbed the fingerprint of P (in Σw). With we will denote the text that we want to index using our data structure. will denote , i.e. the length-j prefix of the i-th suffix of T. A hash data structure H for the text T with hash function h, will be a set of ordered pairs (an index) such that , that can be used to store and retrieve the positions of length-m substrings of T (m is therefore fixed once the index is built). A lookup operation on the hash H given the fingerprint h(P), will consist in the retrieval of all the positions such that and cases where but will be referred to as false positives.
The symbol ⊕ represents the exclusive OR (XOR) bitwise operator. a ⊕ b where , will indicate the bitwise XOR among the bits of the binary encoding of a and b. Analogously, x ⊕ y, where will indicate the bitwise XOR operation among the bits of the binary encoding of the two words x and y and V will denote the bitwise OR operator. is the Hamming distance between . We point out that the Hamming distance is computed between characters in Σ, and not between the bits of the binary encoding of each of them. Patterns and fingerprints are viewed as bit vectors only when computing bitwise operations such as OR, AND, and XOR.
Succinct representation of hash indexes
We begin by introducing the technique allowing succinct representation of hash indexes. The central property of the class of hash functions we are going to use is given by the following definition:
Definition 1 Let Σ = {0,..., |Σ| −1}. We say that a function h : Σm → Σw is a de Bruijn hash function if and only if, for every pair of strings P, Q ∈ Σm
With the following theorem we introduce the hash function used in the rest of our work and in the implementation of our structure:
Theorem 1 Let P ∈ Σm. The hash function h⊕ : Σm → Σw w ≤ m defined as
is a de Bruijn hash function.
A detailed proof of this theorem is given in Additional file 1. Given a de Bruijn hash function we can "extend" it to another de Bruijn hash function , operating on input strings of length n greater than or equal to m, as follows:
Definition 2 Given de Bruijn hash function and n≥ m, the hash value of on T∈ Σn, is the unique string such that:
for every 0 ≤ i ≤ n − m.
It is easy to show that a function enjoying the property in Definition 2 is a homomorphism on de Bruijn graphs (having as sets of nodes Σm and Σw, respectively). Since univocally determines and the two functions coincide on the common part Σm of their domain, in what follows we will simply use the symbol h to indicate both.
From Definitions 1 and 2 we can immediately derive the following important property:
Lemma 1 If h is a de Bruijn hash function, n ≥ m, and P ∈ Σm occurs in T ∈ Σn at position i, then h(P) occurs in h(T) at position i. The opposite implication does not (always) hold; we will refer to cases of the latter kind as false positives.
On the ground of Lemma 1 we can propose, differently from standard approaches in the literature, to build an index over the hash value of the text, instead of building it over the text. This can be done while preserving our ability to locate substrings in the text, since we can simply turn our task into that of locating fingerprints in the hash of the text T. We call dB-hash the data structure obtained with this technique. Notice that the underlying hash data structure is simulated by searching the occurrences of h(P) in h(T) during a lookup operation, so the algorithm is transparent to the particular indexing technique used.
Search algorithm
The core of our searching procedure is based on the algorithm rNA (Vezzi et al. [5], Policriti et al. [6]), a hash-based randomized numerical aligner based on the concept of Hamming-aware hash functions (see [5] and [6] for more details). Hamming-aware hash functions are particular hash functions capable to "squeeze" the Hamming ball of radius k around a pattern P to a Hamming ball of the same radius around the hash value of P. This feature allows to search the entire Hamming ball around P much more efficiently. The following theorem holds:
Theorem 2
The de Bruijn function
h
⊕
defined in Definition 1 is a Hamming aware hash function. In particular:
for every
See Additional file 1 for a detailed proof of this theorem. Since h⊕ is a de Bruijn and Hamming aware hash function, we can use it to build our structure and adapt the rNA algorithm to it. We call dB-rNA the new version of the rNA algorithm adapted to the dB hash data structure. More in detail, the Hamming-awareness property of h⊕ guarantees that, given a pattern P to be searched in the index, the set {h⊕(P') : d
H
(P, P') ≤ k} is small--((2σ − 2)k wk) = (6k wk) elements in our application--and can be computed in time proportional to its size. Notice that, with a generic hash function h, only the trivial upper bound ((σ − 1)k mk) can be given to the size of this set since each different P' such that d
H
(P, P') ≤ k could give rise to a distinct fingerprint. Our proposed algorithm is almost the same as the one described in [5, 6], the only difference being that the underlying data structure is a dB-hash instead of a standard hash. Briefly, the search proceeds in 3 steps. For each pattern P:
1 h⊕(P) is computed;
2 the index is searched for each element in the set ,
3 for each occurrence found, the text and the pattern P are compared to determine Hamming distance and discard false positives.
In practice, in our implementation we also split the pattern P in non-overlapping blocks before searching the index. With this strategy we reduce the maximum number of errors to be searched, improving the speed of the tool.
Complexity analysis
Let occ be the number of occurrences with at most k errors of the searched pattern P in T. Assuming that the alphabet size σ is a power of 2 (condition satisfied in our application), the expected complexity on uniformly distributed inputs of our algorithm has an upper bound of
here, σ = 4 is the size of the alphabet Σ
DN A'
. A fully formal proof of (an extended version of) this analysis can be found in [11].
Quality-aware strategy
Our tool implements a quality-aware heuristic that significantly improves search speed, at the price of a small loss in sensitivity. Briefly, we use base qualities to pick up only a small fraction of the elements from the Hamming ball centred on the hash h(B) of the searched block B, following the assumption that a high quality base is unlikely to be a miscall. Since in practice we divide the read in non-overlapping blocks and the heuristic affects only the searched block, with this strategy we lose only a small fraction of single variants like SNPs. More in detail, let be (e.g.) the Phred quality (see [12]) string associated to the searched block B. We compute a hash value on Q using the following hash function:
Definition 3 With we indicate the hash function defined as
where is defined as
q is a quality threshold (in our implementation we use q = 15). The values 0 and 3 have been chosen due to their binary representation (00 and 11, respectively). If , then during search we try to insert an error in position i of h⊕ (B) since at least one of the bases used to compute h⊕ (B)[i] has a low-quality. The quality-aware strategy is then implemented as follows: let B be the block to be searched and Q its associated Phred quality string. A fingerprint f (representing a block at distance at most k from B) is searched in the structure if and only if , i.e. if f differs from h⊕(B) only in positions corresponding to low quality bases. Since the number of low-quality base pairs in a read is typically low, this strategy allows to drastically reduce search space (which in practice leads approximately to a 10x speedup) if reliable qualities are available. In the results section we show, moreover, that this strategy has only a negligible impact on SNP detection and on the overall precision of our tool.
BW-ERNE: implementation details
We implemented our algorithm and data structure in the short reads aligner BW-ERNE (Burrows-Wheeler Extended Randomized Numerical alignEr), downloadable at http://erne.sourceforge.net. As hash function for our index we use h⊕. Given a text , we calculate h⊕ (T)BW T--the Burrows-Wheeler transform of h⊕(T)--adding the necessary additional structures needed to perform backward search and to retrieve text positions from h⊕ (T)BW T positions. BW-ERNE includes (from its predecessor ERNE) also a simple and fast strategy to allow a single indel in the alignment. This strategy does not affect running times and permits to correctly align a large fraction of short reads that come with indels (see Results section). It is well known that DNA is extremely difficult to compress and for this reason we choose not to introduce compression in our structure. Even if our index is not compressed, experiments show (see Section) that its memory requirements are similar or even smaller than those of other tools based on the FM index such as Bowtie [2], BWA [4] and SOAP2 [3]. Briefly, the structure is composed by three parts: the index, the plain text, and an auxiliary (standard) hash.
BWT index
The BWT index is constituted by stored as a wavelet tree (n log σ bits), rank counters (o(n log σ) bits), and sampled suffix array pointers for its navigation (n + o(n)) bits for one rank structure and 2n bits for one SA pointer every 16 text positions; the user can however modify the SA pointers density).
Plain text
is stored in a 3-bits per base format in blocks of 8 symbols (3n bits). We exploit this encoding to perform (1) text-query comparison of a single block, improving the speed of the algorithm.
Auxiliary hash
To speed up lookup operations, we finally store an auxiliary hash HAU X that indexes the w
aux
most significant digits of the fingerprints: the intervals obtained by backward search on all the numbers in the set are precomputed and stored in HAU X. In this way, a lookup operation on HBW T requires one lookup in HAU X followed by w − w
aux
steps of backward search. We require HAU X to occupy only n bits. This limit gives us an upper bound for w
aux
of . It can be proved [11] that the optimal word size for our algorithm is . Combining these results it follows that the cost of a lookup operation in our data structure is (log m).
Summing up, the total space occupancy of the dB-hash data structure implemented in BW-ERNE is of bits, corresponding in practice to approximately 1.4n Bytes (this fraction may slightly vary for different reference sizes): the index is succinct.