Volume 15 Supplement 9
Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBSeq 2014)
Fast lossless compression via cascading Bloom filters
 Roye Rozov^{1},
 Ron Shamir^{1}Email author and
 Eran Halperin^{1, 2, 3}
DOI: 10.1186/1471210515S9S7
© Rozov et al.; licensee BioMed Central Ltd. 2014
Published: 10 September 2014
Abstract
Background
Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGSspecific compression schemes can improve results over generic compression methods, such as the LempelZiv algorithm, BurrowsWheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These referencebased methods have been shown to compress better than referencefree schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas referencefree methods can usually compress in minutes.
Results
We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to referencebased methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using readlength subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.
Conclusions
Our method, called BARCODE, runs an order of magnitude faster than referencebased methods, while compressing an order of magnitude better than referencefree methods, over a broad range of sequencing coverage. In high coverage (50100 fold), compared to the best tested compressors, BARCODE saves 8090% of the running time while only increasing space slightly.
Keywords
Lossless Compression Bloom Filter Storage Sequencing NGS AlignmentfreeBackground
Deep sequencing has become almost ubiquitous in biology over the last decade. In the past five years, sequencing costs were halved every 5 months, while storage costs were halved every 14 months [1]. The long term effect of this trend is a growing gap between our capacity to store and analyze sequencing data, and our capacity to generate such data. For sharing results of largescale experiments, the effects have already become readily apparent: physical hard disk transfer has become a common practice, and cloud analysis platforms have been embraced in order to avoid the prohibitive time requirements needed to download or store huge volumes.
As a result, much effort has been placed on representing sequencing data more compactly. Specialized compression tools tailored to this context have emerged, improving upon general purpose compressors, such as gzip. These tools fall into two categories  referencebased [2, 3], and referencefree [4–6]. The former methods utilize knowledge of the genome from which reads were extracted (with mutations and errors), while the latter use no prior information. A recent article described the Pistoa Sequence Squeeze competition, wherein the relative merits of many of these methods were compared. This article also introduced new high performance methods that were among the competition leaders [1].
Compression algorithms are evaluated by two main criteria: their compression ratio, namely, the ratios of compressed file sizes to original file sizes, and by their speed. In the context of compressing reads, compression ratios are often expressed in terms of the average number of bits per base for a fixed read length. Currently, referencebased methods generally compress most effectively, but require long run times. In order to compress reads, referencebased methods first call on a shortread aligner to find a best alignment position for each read. Such an alignment typically has only a few (or no) mismatches relative to the reference. Reads can then be represented as integers marking reference positions instead of as sequences, along with the set of differences relative to the reference. Further refinements can then be applied, such as sorting the reads by reference position and then encoding differences between consecutive positions to use fewer bits, and employing Huffman coding to encode more common mutations with less bits than rare ones [3, 2]. Reference free methods employ a variety of techniques, including boosting schemes for general purpose compressors [4, 6], rough assembly for the sake of emulating referencebased compression [5], and arithmetic coding/context modeling approaches, which trade increases in run time for better compression ratios [1].
There is therefore an inherent tradeoff between runtime and compression ratio. Specifically, even though compression ratios are impressive for referencebased methods, their running times are often prohibitively high. In this work we propose a new method, Bloom filter Alignmentfree Referencebased COmpression and DEcompression (BARCODE, abbreviated to BRC below), which achieves high compression ratios with a dramatic decrease of runtime. BARCODE does so by leveraging the space efficiency of Bloom filters, probabilistic data structures allowing queries of set membership. Their use has recently grown in popularity in bioinformatics [7, 8, 5, 9], mainly to avoid the memory overhead needed to store large collections of klength substrings of sequenced reads (kmers) used to represent nodes of de Bruijn graphs in de novo assembly. To the best of our knowledge, this is the first use of Bloom filters for NGS compression.
Here, we adopt a similar scheme to that used for assembly in two recent works [8, 10]. We hash whole reads into BFs as a means of compression. In tests performed, BARCODE's run times are closest to those referencefree methods while its compression ratios near those of referencebased methods. In as little as a ninth of the running time, we are able to compress to within less than 20% of the compression ratios observed for referencebased methods. We demonstrate that with higher coverage levels, BARCODE's efficiency improves, whereas referencebased methods show no improvement, while the gap in run time grows more severe. By comparing our method with several existing tools, we show its superior balance of speed and compression efficiency.
Methods
Technical background
A Bloom filter (BF) is an array A of size m having all positions initially marked 0. Elements are inserted into A by applying a collection of h hash functions: the output of each specifies a position to be marked with a 1 in A. Querying whether or not an element has been inserted involves applying the same h hash functions and checking the values at the positions they return. If at least one hash function returns 0, the element definitely was not inserted; if all 1s return, either it has been inserted, or it is a false positive. For a BF of size m, n entries can be inserted by h hash functions to achieve a false positive rate F ≈ (1 − e^{(−hn/m)})^{ h }. In [11], it is shown that for fixed m and n, F is minimized with h = ln(2)ρ, where ρ = m/n. Plugging this value back in for F leads to F = c^{ ρ }, where c = 0.6185.
Encoding and decoding using a Bloom filter
Our method involves two basic processes: BF loading and querying. We initially assume all reads are unique and later relax this assumption. We load all reads into a BF B, and then use the reference genome to query it. We query B with read length subsequences (and their reverse complements) from all possible start positions on the genome. This allows us to identify all of the potential reads that correspond to genome positions, a set that covers most of the hashed reads. Some of the accepted reads will be false positives. In order to avoid them in the decoding process, we identify a set FP corresponding to all reads accepted by B that are not in the original read set. Additionally, since the reads are taken from a specimen whose genome contains mutations compared to the reference (and since sequencing is errorprone), some reads will not be recovered by querying the genome. We call this set of reads FN. FN and FP are stored separately from B, and compressed using an offtheshelf compression tool. For a set of unique reads, this suffices to allow a complete reconstruction of the reads.
Algorithm 1 Encode one Input: R, G; Output: B, FN, FP Conventions: Let g be the length of the reference genome G, q_{ i } be the i^{ th } genome query, ℓ_{ read } be the sequenced read length, and P be the set of genome queries accepted by B. For brevity, we exhibit queries from only the forward strand, whereas our implementation queries (and accepts from) both strands.
FN := {r : r ∈ R and (r is repeated in R or r contains an 'N')}
R' := R \ FN
for all r ∈ R' do
insert(r, B)
end for
for all i ∈ [1, g − ℓ_{ read } + 1] do
;if qi ∈ B then
P := P ∪ q_{i}
end if
end for
FN := FN ∪ {R' \ P}
FP := P \ R'
return (B, FN, FP )
Encoding and decoding using a BF cascade
Although appealingly simple, we found the above method did not offer competitive compression, as the costs imposed encoding FP and FN outweighed the benefit of storing the unique reads in B. Thus, to reduce the number of false positives that need to be compressed separately, we use a cascade of BFs as in [10]. To this end, we rename B and FP above as B_{1} and FP_{1}, respectively. We consider B_{1} to be the first BF in a cascade, and each element of FP_{1} is then hashed into a subsequent BF B_{2}. We note that since B_{2} is meant to store false positive reads it should reject true reads: thus, any element of R' (the set of unique reads) accepted by B_{2} is a false positive relative to B_{2}. Thus, to identify FP_{2}, we add each element accepted by querying R' against B_{2}. This process can be continued for any desired number of BFs. Once BFs are loaded in this way, to identify real reads, we query each BF in the cascade and accept reads only if the index of the first BF to reject them is even.
Here the left hand side represents the sum of costs due to the expected number of elements in each BF for an infinite cascade. In practice, we use four BFs and a numerical solver in scipy [12] employing the LBFGSB [13] algorithm to find the value of ρ minimizing the above expression. The small list FP_{4} is encoded separately along with FN. The process is described in Figure 1 steps 511 and Algorithm 2. Decoding proceeds using queries from G as before, but in this case each accepted read is used to query subsequent BFs until rejection. This is depicted in Figure 2and Algorithm 3.
Algorithm 2 Encoding Let B_{ j } be the j^{ th } BF loaded (j ∈ [2, 4]) with FP_{j1}, S ∩ B_{ j } is shorthand notation for the subset of S accepted by B_{ j }.
(B_{1}, FN, FP_{1}) := Encode one(R, G) # we initialize by calling Algorithm 1
F P_{2} := R' ∩ B2
for j = 3 to j = 4 do
for all r ∈ FP_{j− 1}do
insert(r, B_{ j })
end for
FP_{j} := FP_{j−2} ∩ B_{j}
end for
Algorithm 3 Decoding Input: (B_{1}, B_{2}, B_{3}, B_{4}, FN, FP_{4}, G); Output: (R_{ rc }) For brevity, reconstruction of only one strand is shown.
R_{rc} := FN
for all i ∈ [1, g − ℓ_{ read } + 1] do
for j = 1 to j = 4 do
if${q}_{i}\mathrm{\epsilon \u0338}{B}_{j}$then
if j is even then
R_{rc} := R_{rc} ∪ q_{i}
end if
continue {increment i}
end if
end for
if j = 4 and q_{ i } ∈ FP_{4} then
R_{rc} := R_{rc} ∪ q_{i}
end if
end for
return R_{ rc }
Additional compression
BF parameters are automatically set to make each BF more compressible. This involves incrementing the number of hash functions for each BF from 1 to the minimal number that allows it to both have an uncompressed file size lower than a preset threshold (we used 500 MB) and obtain the value of F from equation 1. Typically, this results in h being in the range of 1 to 3. We do this in order to reduce each BF's compressed size (at the expense of increasing its RAM occupation); this practice is introduced in [11].
Once BFs are loaded and the sets FP 4 and FN are identified, we use 7zip [14] to compress the B 1, ..., B 4 and SCALCE [4] to compress the output lists FP_{4} and FN. In principle, any general compression tool can be used for the BFs, and it is preferable to use a tool that takes advantage of existing sequence overlaps among the leftover reads to compress them efficiently.
Results and discussion
Comparison on simulated reads
BRC performance with varying coverage.
coverage  time (sec)  R(M)  FP 4(K)  FN (M)  BF size (MB)  FP_{4} size (MB)  FN size (MB)  compression bits/base 

10  410  6.3  3.7  2.0  8.5  0.38  36.8  0.58 
20  590  12.6  6.8  4.4  14.2  0.68  66.8  0.52 
30  800  18.9  9.3  7.1  19.0  0.94  93.8  0.48 
40  1006  25.2  20  10.2  22.8  2.01  119.0  0.46 
50  1220  31.5  16  13.5  26.4  1.66  143.0  0.43 
Program parameters used in compression tool comparison (Figure 3)
Program  Parameters 

dwgsim  C coverage level H e 0.00.005 R 0.0 1 read length 2 0 y 0.0 
bowtie2  x chr20 U input fastq S 
SCALCE  input fastq T 1 A n library o output prefix 
quip (default)  o=quip i=sam input sam 
quip (reference)  o=quip r ref fa i=sam input sam 
fastqz (default)  c input fastq output prefix 
fastqz (reference)  c input fastq output prefix r packed ref file 
BRC  rec load bf err rate 0 e 0 i 4 reads file packed ref file 
Overall, we found the compression ratio improved with greater coverage for all referencefree methods, and remained essentially constant for referencebased methods. Figure 3 shows that referencebased compressors are better in compression ratios but referencefree compressors are faster (An exception to this trend was fastqz, whose referencebased version is faster than its referencefree version, likely due to the use of context modelbased arithmetic coding). Quip performed poorly in compressing sequences without a reference, showing it has apparently been optimized for speed and perhaps compression of qualities and read names. SCALCE shows strong dependence of compression ratio on coverage, as would be expected by its leverage of the recurrence of long subsequences. BARCODE takes advantage of this trend to also improve with higher coverage, even as the proportion of reads hashed to BFs decreases (See Table 1). BARCODE's times are closest to SCALCE and referencefree quip, and its compression ratios approach those of reference based methods, especially at higher coverage values. For most coverage values, it maintains an order of magnitude time advantage vs. referencebased methods (~23x vs. fastqz, ~57x vs. quip), as well as an order of magnitude compression advantage of the tested referencefree methods.
Higher coverage, longer reads
Performance comparison on high coverage values and read lengths longer than 100.
Time (sec)  coverage  quip  scalce  fastqz  BRC  quip r  fastqz r 

50  759  533  5426  1220  9544  4063  
100  1599  1016  10417  2211  20216  7479  
100, len 200  1280  1165  8760  1400  21705  7400  
200, len 400  2284  2518  15706  2209  51628  17769  
Compression (bits/base)  50  2.0  0.62  0.97  0.43  0.34  0.32 
100  2.0  0.53  0.58  0.40  0.34  0.32  
100, len 200  1.8  0.43  0.61  0.37  0.19  0.45  
200, len 400  1.7  0.32  0.41  0.31  0.12  0.41 
Conclusions
We have presented a new approach to compressing sequencing reads, bridging the gap between the speed of alignmentfree, referencefree methods and the compression efficiency of referencebased methods. We have tested the dependence of extant sequence compressors on coverage levels and shown that while referencebased methods compress most efficiently, they place a heavy burden on CPU times due to alignment and cannot leverage added redundancy to benefit compression ratios. Referencefree methods do benefit from higher coverage, but maintain a considerable distance from referencebased methods in terms of compression ratios even at the highest levels tested. Although we have shown that our new method, BARCODE, obtains a better tradeoff than either of these extremes, we maintain that there remains much room for improvement, even when considering the inherent constraints imposed by the Kolmogorov complexity of the data. We note that further comparison to other methods like CRAM [3] and sam_comp [1] is needed.
BARCODE is currently a proofofprinciple implementation, and thus we expect that further optimization will improve run time and compression efficiency. Compression ratios may be improved by taking advantage of better general compression tools available such as the ZPAQ library [18], as fastqz and sam_comp do. Thus far, we have not utilized arithmetic coding techniques because they employ multiple threads and thus introduce significant additional resource requirements. Our approach can also be extended to allow for fast access to variants in the original data by using conventional BFs that are not compressed, and by compressing FN/FP reads using encoding that allows fast random access (at some expense of compression ratio). We aim to investigate these paths in the future.
Appendix
Real data test
We examined BARCODE's performance on the C. Elegans data set tested in the Sequence Squeeze competition, SRR065390_1. This data set consists of 33415360 100 bp reads, amounting to 33fold coverage of the genome. BARCODE's compression ratio on this data was 0.46 bits per base, and run time was 1203 seconds, in line with experiments described in the main text and comparable with referencebased methods tested in the Sequence Squeeze competition [1].
Contributions of repeated reads vs. errors to FN
Counts of repeats vs. errors with increasing coverage. The proportion of reads due to errorsremains roughly constant, while the proportion due to repeats increases as coverage increases.
Coverage  R(M)  repeats (M)  FN(M) 

10  6.3  0.3  2.0 
20  12.6  1.3  4.4 
30  18.9  2.8  7.1 
40  25.2  4.9  10.2 
50  31.5  7.4  13.5 
We model the contribution of error to FN using Binom(100, p) with p = 0.0025, the mean error over the read length used in our simulated reads (where error varies from 0 to 0.005 from the 5' to 3' ends). Most of the mass is carried by the one and two error terms, leading to a relative error proportion estimate of $\left(\begin{array}{c}\hfill 100\hfill \\ \hfill 2\hfill \end{array}\right){\left(1p\right)}^{98}{p}^{2}+\left(\begin{array}{c}\hfill 100\hfill \\ \hfill 2\hfill \end{array}\right){\left(1p\right)}^{99}p$. Table 4 shows this proportion is independent of coverage level.
Availability
BARCODE can be downloaded at http://www.cs.tau.ac.il/~heran/cozygene/software.shtml
Declarations
RS was supported in part by the Israel Science Foundation (grant 317/13) and by the Raymond and Beverly Sackler chair in bioinformatics. RR was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at TelAviv university, and by the Center for Absorption in Science, the Ministry of Immigrant Absorption in Israel. EH is a faculty fellow of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. EH was partially supported by the Israeli Science Foundation (grant 1425/13), and by National Science Foundation grant III1217615.
This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMBSeq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.
List of abbreviations
 BF:

 Bloom filter
 BRC:

 BARCODE
 bpb:

 bits per base
 bp:

 base pairs
Declarations
Acknowledgements
RR would like to thank Oron Navon and Roy Ronen for helpful comments in preparation of the manuscript.
Authors’ Affiliations
References
 Bonfield JK, Mahoney MV: Compression of FASTQ and SAM format sequencing data. PloS One. 2013, 8 (3): 5919010.1371/journal.pone.0059190.View ArticleGoogle Scholar
 Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G: Compressing genomic sequence fragments using SlimGene. Journal of Computational Biology. 2011, 18: 401413. 10.1089/cmb.2010.0253.PubMed CentralView ArticlePubMedGoogle Scholar
 HsiYang Fritz M, Leinonen R, Cochrane G, Birney E: Efficient storage of high throughput DNA sequencing data using referencebased compression. Genome Research. 2011, 21: 734740. 10.1101/gr.114819.110.PubMed CentralView ArticlePubMedGoogle Scholar
 Hach F, Numanagic I, Alkan C, Sahinalp SC: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012, 28 (23): 30517. 10.1093/bioinformatics/bts593.PubMed CentralView ArticlePubMedGoogle Scholar
 Jones DC, Ruzzo WL, Peng X, Katze MG: Compression of nextgeneration sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research. 2012, 40 (22): 17110.1093/nar/gks754.View ArticleGoogle Scholar
 Cox AJ, Bauer MJ, Jakobi T, Rosone G: Largescale compression of genomic sequence databases with the BurrowsWheeler transform. Bioinformatics. 2012, 28 (11): 16.View ArticleGoogle Scholar
 Pell J, Hintze A, CaninoKoning R, Howe A, Tiedje JM, Brown CT: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proceedings of the National Academy of Sciences. 2012, I (1): 111.Google Scholar
 Chikhi R, Rizk G: Spaceefficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms in Bioinformatics. 2012, 236248.View ArticleGoogle Scholar
 Melsted P, Pritchard J: Efficient counting of k mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011, 12: 33310.1186/1471210512333.PubMed CentralView ArticlePubMedGoogle Scholar
 Salikhov K, Sacomoto G, Kucherov G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithms in Bioinformatics Lecture Notes in Computer Science. Edited by: Darling, A., Stoye, J. 2013, 8126: 364376. 10.1007/9783642404535_28.View ArticleGoogle Scholar
 Mitzenmacher M: Compressed Bloom filters. IEEE/ACM Transactions on Networking. 2002, 10:Google Scholar
 Oliphant TE: SciPy: Open source scientific tools for Python. Computing in Science and Engineering. 2007, 9: 1020.View ArticleGoogle Scholar
 Zhu C, Nocedal J, Byrd RH, Lu P: Algorithm 778: LBFGSB: Fortran subroutines for largescale boundconstrained optimization. 1997Google Scholar
 Pavlov I: 7zip compression software. [http://www.7zip.org]
 Homer N: Dwgsim read simulations software. [https://github.com/nh13/dwgsim]
 Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ: Performance Comparison of Benchtop HighThroughout Sequencing Platforms. Nature Biotechnology. 2012, 30: 4349. 10.1038/nbt.2198.View ArticlePubMedGoogle Scholar
 Langmead B, Salzberg SL: Fast gappedread alignment with Bowtie 2. Nature Methods. 2012, 9: 357359. 10.1038/nmeth.1923.PubMed CentralView ArticlePubMedGoogle Scholar
 Mahoney M: ZPAQ compression software. [http://mattmahoney.net/dc/zpaq.html]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.