Technical background
A Bloom filter (BF) is an array A of size m having all positions initially marked 0. Elements are inserted into A by applying a collection of h hash functions: the output of each specifies a position to be marked with a 1 in A. Querying whether or not an element has been inserted involves applying the same h hash functions and checking the values at the positions they return. If at least one hash function returns 0, the element definitely was not inserted; if all 1s return, either it has been inserted, or it is a false positive. For a BF of size m, n entries can be inserted by h hash functions to achieve a false positive rate F ≈ (1 − e(−hn/m))h. In [11], it is shown that for fixed m and n, F is minimized with h = ln(2)ρ, where ρ = m/n. Plugging this value back in for F leads to F = cρ, where c = 0.6185.
Encoding and decoding using a Bloom filter
Our method involves two basic processes: BF loading and querying. We initially assume all reads are unique and later relax this assumption. We load all reads into a BF B, and then use the reference genome to query it. We query B with read length subsequences (and their reverse complements) from all possible start positions on the genome. This allows us to identify all of the potential reads that correspond to genome positions, a set that covers most of the hashed reads. Some of the accepted reads will be false positives. In order to avoid them in the decoding process, we identify a set FP corresponding to all reads accepted by B that are not in the original read set. Additionally, since the reads are taken from a specimen whose genome contains mutations compared to the reference (and since sequencing is error-prone), some reads will not be recovered by querying the genome. We call this set of reads FN. FN and FP are stored separately from B, and compressed using an off-the-shelf compression tool. For a set of unique reads, this suffices to allow a complete reconstruction of the reads.
Decoding proceeds by decompressing B, FN, and FP, and then repeating the querying procedure. We initialize the read set to FN. Then, we query B with each position from the genome as done to identify elements of FP. Whenever B accepts we check if the accepted read is not also in FP, and add it to the read set if it isn't. To remove the unique read restriction, we first move all repeated reads to FN before loading B. We treat reads containing 'N' characters similarly. These two additions allow us to circumvent an inherent limitation of Bloom filters - the loss of multiplicity information - and reduces the entropy in the (now multi-) set FN, making it more compressible. The encoding process with one BF is detailed in steps 1-4 of Figure 1 and Algorithm 1. The relative contributions of error reads and repeated reads to FN are discussed in the appendix.
Algorithm 1 Encode one Input: R, G; Output: B, FN, FP Conventions: Let g be the length of the reference genome G, q
i
be the ith genome query, ℓ
read
be the sequenced read length, and P be the set of genome queries accepted by B. For brevity, we exhibit queries from only the forward strand, whereas our implementation queries (and accepts from) both strands.
FN := {r : r ∈ R and (r is repeated in R or r contains an 'N')}
R' := R \ FN
for all r ∈ R' do
insert(r, B)
end for
for all i ∈ [1, g − ℓ
read
+ 1] do
;if
qi ∈ B
then
P := P ∪ qi
end if
end for
FN := FN ∪ {R' \ P}
FP := P \ R'
return (B, FN, FP )
Encoding and decoding using a BF cascade
Although appealingly simple, we found the above method did not offer competitive compression, as the costs imposed encoding FP and FN outweighed the benefit of storing the unique reads in B. Thus, to reduce the number of false positives that need to be compressed separately, we use a cascade of BFs as in [10]. To this end, we rename B and FP above as B1 and FP1, respectively. We consider B1 to be the first BF in a cascade, and each element of FP1 is then hashed into a subsequent BF B2. We note that since B2 is meant to store false positive reads it should reject true reads: thus, any element of R' (the set of unique reads) accepted by B2 is a false positive relative to B2. Thus, to identify FP2, we add each element accepted by querying R' against B2. This process can be continued for any desired number of BFs. Once BFs are loaded in this way, to identify real reads, we query each BF in the cascade and accept reads only if the index of the first BF to reject them is even.
Since elements inserted to BF
j
are necessarily a subset of those inserted to BFj− 2, we see an exponential drop-off in BF size (since F is fixed). Since sizes for successive BFs alternately depend on n and (2g − n)ρF ≈ 2gρF (the number of false positives expected for 2g queries from G multiplied by the cost per element, assuming g >> n), we expect the total file size to be approximately (nρ + 2gρF + nρF + 2gρF2 + ...) bits. Using F = cρ from above, we observe that for an infinite cascade, the average number of bits per read is then
(1)
Here the left hand side represents the sum of costs due to the expected number of elements in each BF for an infinite cascade. In practice, we use four BFs and a numerical solver in scipy [12] employing the L-BFGS-B [13] algorithm to find the value of ρ minimizing the above expression. The small list FP4 is encoded separately along with FN. The process is described in Figure 1 steps 5-11 and Algorithm 2. Decoding proceeds using queries from G as before, but in this case each accepted read is used to query subsequent BFs until rejection. This is depicted in Figure 2and Algorithm 3.
Algorithm 2 Encoding Let B
j
be the jth BF loaded (j ∈ [2, 4]) with FPj-1, S ∩ B
j
is short-hand notation for the subset of S accepted by B
j
.
(B1, FN, FP1) := Encode one(R, G) # we initialize by calling Algorithm 1
F P2 := R' ∩ B2
for j = 3 to j = 4 do
for all r ∈ FPj− 1do
insert(r, B
j
)
end for
FPj := FPj−2 ∩ Bj
end for
return (B1, B2, B3, B4, FN, FP4)
Algorithm 3 Decoding Input: (B1, B2, B3, B4, FN, FP4, G); Output: (R
rc
) For brevity, reconstruction of only one strand is shown.
Rrc := FN
for all i ∈ [1, g − ℓ
read
+ 1] do
for j = 1 to j = 4 do
ifthen
if j is even then
Rrc := Rrc ∪ qi
end if
continue {increment i}
end if
end for
if j = 4 and q
i
∈ FP4 then
Rrc := Rrc ∪ qi
end if
end for
return R
rc
Additional compression
BF parameters are automatically set to make each BF more compressible. This involves incrementing the number of hash functions for each BF from 1 to the minimal number that allows it to both have an uncompressed file size lower than a preset threshold (we used 500 MB) and obtain the value of F from equation 1. Typically, this results in h being in the range of 1 to 3. We do this in order to reduce each BF's compressed size (at the expense of increasing its RAM occupation); this practice is introduced in [11].
Once BFs are loaded and the sets FP 4 and FN are identified, we use 7zip [14] to compress the B 1, ..., B 4 and SCALCE [4] to compress the output lists FP4 and FN. In principle, any general compression tool can be used for the BFs, and it is preferable to use a tool that takes advantage of existing sequence overlaps among the leftover reads to compress them efficiently.