### Technical background

A Bloom filter (BF) is an array *A* of size *m* having all positions initially marked 0. Elements are inserted into *A* by applying a collection of *h* hash functions: the output of each specifies a position to be marked with a 1 in *A*. Querying whether or not an element has been inserted involves applying the same *h* hash functions and checking the values at the positions they return. If at least one hash function returns 0, the element definitely was not inserted; if all 1s return, either it has been inserted, or it is a false positive. For a BF of size *m, n* entries can be inserted by *h* hash functions to achieve a false positive rate *F ≈* (1 *− e*^{(−hn/m)})^{h}. In [11], it is shown that for fixed *m* and *n, F* is minimized with *h = ln*(2)*ρ*, where *ρ = m/n*. Plugging this value back in for *F* leads to *F = c*^{ρ}, where *c* = 0.6185.

### Encoding and decoding using a Bloom filter

Our method involves two basic processes: BF loading and querying. We initially assume all reads are unique and later relax this assumption. We load all reads into a BF *B*, and then use the reference genome to query it. We query *B* with read length subsequences (and their reverse complements) from all possible start positions on the genome. This allows us to identify all of the potential reads that correspond to genome positions, a set that covers most of the hashed reads. Some of the accepted reads will be false positives. In order to avoid them in the decoding process, we identify a set *FP* corresponding to all reads accepted by *B* that are not in the original read set. Additionally, since the reads are taken from a specimen whose genome contains mutations compared to the reference (and since sequencing is error-prone), some reads will not be recovered by querying the genome. We call this set of reads *FN. FN* and *FP* are stored separately from *B*, and compressed using an off-the-shelf compression tool. For a set of unique reads, this suffices to allow a complete reconstruction of the reads.

Decoding proceeds by decompressing *B, FN*, and *FP*, and then repeating the querying procedure. We initialize the read set to *FN*. Then, we query *B* with each position from the genome as done to identify elements of *FP*. Whenever *B* accepts we check if the accepted read is not also in *FP*, and add it to the read set if it isn't. To remove the unique read restriction, we first move all repeated reads to *FN* before loading *B*. We treat reads containing 'N' characters similarly. These two additions allow us to circumvent an inherent limitation of Bloom filters - the loss of multiplicity information - and reduces the entropy in the (now multi-) set *FN*, making it more compressible. The encoding process with one BF is detailed in steps 1-4 of Figure 1 and Algorithm 1. The relative contributions of error reads and repeated reads to *FN* are discussed in the appendix.

**Algorithm 1** Encode one **Input**: R, G; **Output**: B, FN, FP **Conventions**: Let *g* be the length of the reference genome *G, q*_{
i
} be the *i*^{th} genome query, *ℓ*_{
read
} be the sequenced read length, and *P* be the set of genome queries accepted by *B*. For brevity, we exhibit queries from only the forward strand, whereas our implementation queries (and accepts from) both strands.

FN := {*r* : *r ∈ R* and (*r* is repeated in *R* or *r* contains an 'N')}

R' := R \ FN

for all *r ∈ R'* do

insert(*r, B*)

end for

**for all** *i ∈* [1, *g − ℓ*_{
read
} + 1] **do**

**;if**
*qi ∈ B*
**then**

P := P ∪ q_{i}

end if

end for

FN := FN ∪ {R' \ P}

FP := P \ R'

return (*B, FN, FP* )

### Encoding and decoding using a BF cascade

Although appealingly simple, we found the above method did not offer competitive compression, as the costs imposed encoding *FP* and *FN* outweighed the benefit of storing the unique reads in *B*. Thus, to reduce the number of false positives that need to be compressed separately, we use a cascade of BFs as in [10]. To this end, we rename *B* and *FP* above as *B*_{1} and *FP*_{1}, respectively. We consider *B*_{1} to be the first BF in a cascade, and each element of *FP*_{1} is then hashed into a subsequent BF *B*_{2}. We note that since *B*_{2} is meant to store false positive reads it should reject true reads: thus, any element of *R'* (the set of unique reads) accepted by *B*_{2} is a false positive relative to *B*_{2}. Thus, to identify *FP*_{2}, we add each element accepted by querying *R'* against *B*_{2}. This process can be continued for any desired number of BFs. Once BFs are loaded in this way, to identify real reads, we query each BF in the cascade and accept reads only if the index of the first BF to reject them is even.

Since elements inserted to *BF*_{
j
} are necessarily a subset of those inserted to *BF*_{j− 2}, we see an exponential drop-off in BF size (since F is fixed). Since sizes for successive BFs alternately depend on *n* and (2*g − n*)*ρF ≈* 2*gρF* (the number of false positives expected for 2*g* queries from *G* multiplied by the cost per element, assuming *g >> n*), we expect the total file size to be approximately (*nρ* + 2*gρF* + *nρF* + 2*gρF*^{2} + *..*.) bits. Using *F = c*^{ρ} from above, we observe that for an infinite cascade, the average number of bits per read is then

\left(\rho \frac{2g\rho {c}^{\rho}}{n}\right)\left(1+{c}^{\rho}+{c}^{2\rho}+...\right)=\left(1+\frac{2g{c}^{\rho}}{n}\right)\left(\frac{\rho}{1-{c}^{\rho}}\right).

(1)

Here the left hand side represents the sum of costs due to the expected number of elements in each BF for an infinite cascade. In practice, we use four BFs and a numerical solver in scipy [12] employing the L-BFGS-B [13] algorithm to find the value of *ρ* minimizing the above expression. The small list *FP*_{4} is encoded separately along with *FN*. The process is described in Figure 1 steps 5-11 and Algorithm 2. Decoding proceeds using queries from *G* as before, but in this case each accepted read is used to query subsequent BFs until rejection. This is depicted in Figure 2and Algorithm 3.

**Algorithm 2** Encoding Let *B*_{
j
} be the *j*^{th} BF loaded (*j ∈* [2, 4]) with *FP*_{j-1}, *S* ∩ B_{
j
} is short-hand notation for the subset of *S* accepted by *B*_{
j
}.

(*B*_{1}, *FN, FP*_{1}) := **Encode one**(*R, G*) # we initialize by calling Algorithm 1

F P_{2} := R' ∩ B2

**for** *j* = 3 **to** *j* = 4 **do**

**for all** *r ∈ FP*_{j− 1}**do**

insert(*r, B*_{
j
})

**end for**

FP_{j} := FP_{j−2} ∩ B_{j}

**end for**

return (*B*_{1}, *B*_{2}, *B*_{3}, *B*_{4}, *FN, FP*_{4})

**Algorithm 3** Decoding **Input**: (*B*_{1}, *B*_{2}, *B*_{3}, *B*_{4}, *FN, FP*_{4}, *G*); **Output**: (*R*_{
rc
}) For brevity, reconstruction of only one strand is shown.

R_{rc} := FN

**for all** *i ∈* [1, *g − ℓ*_{
read
} + 1] **do**

**for** *j* = 1 **to** *j* = 4 **do**

**if**{q}_{i}\mathrm{\epsilon \u0338}{B}_{j}**then**

**if** j is even **then**

R_{rc} := R_{rc} ∪ q_{i}

**end if**

continue {increment i}

**end if**

**end for**

**if** j = 4 **and** *q*_{
i
} *∈ FP*_{4} **then**

R_{rc} := R_{rc} ∪ q_{i}

**end if**

**end for**

return *R*_{
rc
}

### Additional compression

BF parameters are automatically set to make each BF more compressible. This involves incrementing the number of hash functions for each BF from 1 to the minimal number that allows it to both have an uncompressed file size lower than a preset threshold (we used 500 MB) and obtain the value of F from equation 1. Typically, this results in h being in the range of 1 to 3. We do this in order to reduce each BF's compressed size (at the expense of increasing its RAM occupation); this practice is introduced in [11].

Once BFs are loaded and the sets *FP* 4 and *FN* are identified, we use 7zip [14] to compress the *B* 1, ..., *B* 4 and SCALCE [4] to compress the output lists *FP*_{4} and *FN*. In principle, any general compression tool can be used for the BFs, and it is preferable to use a tool that takes advantage of existing sequence overlaps among the leftover reads to compress them efficiently.