Fast lossless compression via cascading Bloom filters

Rozov, Roye; Shamir, Ron; Halperin, Eran

doi:10.1186/1471-2105-15-S9-S7

Volume 15 Supplement 9

Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014)

Proceedings
Open access
Published: 10 September 2014

Fast lossless compression via cascading Bloom filters

Roye Rozov¹,
Ron Shamir¹ &
Eran Halperin^1,2,3

BMC Bioinformatics volume 15, Article number: S7 (2014) Cite this article

2601 Accesses
21 Citations
6 Altmetric
Metrics details

Abstract

Background

Data from large Next Generation Sequencing (NGS) experiments present challenges both in terms of costs associated with storage and in time required for file transfer. It is sometimes possible to store only a summary relevant to particular applications, but generally it is desirable to keep all information needed to revisit experimental results in the future. Thus, the need for efficient lossless compression methods for NGS reads arises. It has been shown that NGS-specific compression schemes can improve results over generic compression methods, such as the Lempel-Ziv algorithm, Burrows-Wheeler transform, or Arithmetic Coding. When a reference genome is available, effective compression can be achieved by first aligning the reads to the reference genome, and then encoding each read using the alignment position combined with the differences in the read relative to the reference. These reference-based methods have been shown to compress better than reference-free schemes, but the alignment step they require demands several hours of CPU time on a typical dataset, whereas reference-free methods can usually compress in minutes.

Results

We present a new approach that achieves highly efficient compression by using a reference genome, but completely circumvents the need for alignment, affording a great reduction in the time needed to compress. In contrast to reference-based methods that first align reads to the genome, we hash all reads into Bloom filters to encode, and decode by querying the same Bloom filters using read-length subsequences of the reference genome. Further compression is achieved by using a cascade of such filters.

Conclusions

Our method, called BARCODE, runs an order of magnitude faster than reference-based methods, while compressing an order of magnitude better than reference-free methods, over a broad range of sequencing coverage. In high coverage (50-100 fold), compared to the best tested compressors, BARCODE saves 80-90% of the running time while only increasing space slightly.

Background

Deep sequencing has become almost ubiquitous in biology over the last decade. In the past five years, sequencing costs were halved every 5 months, while storage costs were halved every 14 months [1]. The long term effect of this trend is a growing gap between our capacity to store and analyze sequencing data, and our capacity to generate such data. For sharing results of large-scale experiments, the effects have already become readily apparent: physical hard disk transfer has become a common practice, and cloud analysis platforms have been embraced in order to avoid the prohibitive time requirements needed to download or store huge volumes.

As a result, much effort has been placed on representing sequencing data more compactly. Specialized compression tools tailored to this context have emerged, improving upon general purpose compressors, such as gzip. These tools fall into two categories - reference-based [2, 3], and reference-free [4–6]. The former methods utilize knowledge of the genome from which reads were extracted (with mutations and errors), while the latter use no prior information. A recent article described the Pistoa Sequence Squeeze competition, wherein the relative merits of many of these methods were compared. This article also introduced new high performance methods that were among the competition leaders [1].

Compression algorithms are evaluated by two main criteria: their compression ratio, namely, the ratios of compressed file sizes to original file sizes, and by their speed. In the context of compressing reads, compression ratios are often expressed in terms of the average number of bits per base for a fixed read length. Currently, reference-based methods generally compress most effectively, but require long run times. In order to compress reads, reference-based methods first call on a short-read aligner to find a best alignment position for each read. Such an alignment typically has only a few (or no) mismatches relative to the reference. Reads can then be represented as integers marking reference positions instead of as sequences, along with the set of differences relative to the reference. Further refinements can then be applied, such as sorting the reads by reference position and then encoding differences between consecutive positions to use fewer bits, and employing Huffman coding to encode more common mutations with less bits than rare ones [3, 2]. Reference- free methods employ a variety of techniques, including boosting schemes for general purpose compressors [4, 6], rough assembly for the sake of emulating reference-based compression [5], and arithmetic coding/context modeling approaches, which trade increases in run time for better compression ratios [1].

There is therefore an inherent tradeoff between runtime and compression ratio. Specifically, even though compression ratios are impressive for reference-based methods, their running times are often prohibitively high. In this work we propose a new method, Bloom filter Alignment-free Reference-based COmpression and DEcompression (BARCODE, abbreviated to BRC below), which achieves high compression ratios with a dramatic decrease of runtime. BARCODE does so by leveraging the space efficiency of Bloom filters, probabilistic data structures allowing queries of set membership. Their use has recently grown in popularity in bioinformatics [7, 8, 5, 9], mainly to avoid the memory overhead needed to store large collections of k-length substrings of sequenced reads (k-mers) used to represent nodes of de Bruijn graphs in de novo assembly. To the best of our knowledge, this is the first use of Bloom filters for NGS compression.

Here, we adopt a similar scheme to that used for assembly in two recent works [8, 10]. We hash whole reads into BFs as a means of compression. In tests performed, BARCODE's run times are closest to those reference-free methods while its compression ratios near those of reference-based methods. In as little as a ninth of the running time, we are able to compress to within less than 20% of the compression ratios observed for reference-based methods. We demonstrate that with higher coverage levels, BARCODE's efficiency improves, whereas reference-based methods show no improvement, while the gap in run time grows more severe. By comparing our method with several existing tools, we show its superior balance of speed and compression efficiency.

Methods

Technical background

A Bloom filter (BF) is an array A of size m having all positions initially marked 0. Elements are inserted into A by applying a collection of h hash functions: the output of each specifies a position to be marked with a 1 in A. Querying whether or not an element has been inserted involves applying the same h hash functions and checking the values at the positions they return. If at least one hash function returns 0, the element definitely was not inserted; if all 1s return, either it has been inserted, or it is a false positive. For a BF of size m, n entries can be inserted by h hash functions to achieve a false positive rate F ≈ (1 − e^(−hn/m))^h. In [11], it is shown that for fixed m and n, F is minimized with h = ln(2)ρ, where ρ = m/n. Plugging this value back in for F leads to F = c^ρ, where c = 0.6185.

Encoding and decoding using a Bloom filter

Our method involves two basic processes: BF loading and querying. We initially assume all reads are unique and later relax this assumption. We load all reads into a BF B, and then use the reference genome to query it. We query B with read length subsequences (and their reverse complements) from all possible start positions on the genome. This allows us to identify all of the potential reads that correspond to genome positions, a set that covers most of the hashed reads. Some of the accepted reads will be false positives. In order to avoid them in the decoding process, we identify a set FP corresponding to all reads accepted by B that are not in the original read set. Additionally, since the reads are taken from a specimen whose genome contains mutations compared to the reference (and since sequencing is error-prone), some reads will not be recovered by querying the genome. We call this set of reads FN. FN and FP are stored separately from B, and compressed using an off-the-shelf compression tool. For a set of unique reads, this suffices to allow a complete reconstruction of the reads.

Decoding proceeds by decompressing B, FN, and FP, and then repeating the querying procedure. We initialize the read set to FN. Then, we query B with each position from the genome as done to identify elements of FP. Whenever B accepts we check if the accepted read is not also in FP, and add it to the read set if it isn't. To remove the unique read restriction, we first move all repeated reads to FN before loading B. We treat reads containing 'N' characters similarly. These two additions allow us to circumvent an inherent limitation of Bloom filters - the loss of multiplicity information - and reduces the entropy in the (now multi-) set FN, making it more compressible. The encoding process with one BF is detailed in steps 1-4 of Figure 1 and Algorithm 1. The relative contributions of error reads and repeated reads to FN are discussed in the appendix.

Algorithm 1 Encode one Input: R, G; Output: B, FN, FP Conventions: Let g be the length of the reference genome G, q_i be the i^th genome query, ℓ_read be the sequenced read length, and P be the set of genome queries accepted by B. For brevity, we exhibit queries from only the forward strand, whereas our implementation queries (and accepts from) both strands.

FN := {r : r ∈ R and (r is repeated in R or r contains an 'N')}

R' := R \ FN

for all r ∈ R' do

insert(r, B)

end for

for all i ∈ [1, g − ℓ_read + 1] do

;if qi ∈ B then

P := P ∪ q_i

end if

end for

FN := FN ∪ {R' \ P}

FP := P \ R'

return (B, FN, FP )

Encoding and decoding using a BF cascade

Although appealingly simple, we found the above method did not offer competitive compression, as the costs imposed encoding FP and FN outweighed the benefit of storing the unique reads in B. Thus, to reduce the number of false positives that need to be compressed separately, we use a cascade of BFs as in [10]. To this end, we rename B and FP above as B₁ and FP₁, respectively. We consider B₁ to be the first BF in a cascade, and each element of FP₁ is then hashed into a subsequent BF B₂. We note that since B₂ is meant to store false positive reads it should reject true reads: thus, any element of R' (the set of unique reads) accepted by B₂ is a false positive relative to B₂. Thus, to identify FP₂, we add each element accepted by querying R' against B₂. This process can be continued for any desired number of BFs. Once BFs are loaded in this way, to identify real reads, we query each BF in the cascade and accept reads only if the index of the first BF to reject them is even.

Since elements inserted to BF_j are necessarily a subset of those inserted to BF_{j− 2}, we see an exponential drop-off in BF size (since F is fixed). Since sizes for successive BFs alternately depend on n and (2g − n)ρF ≈ 2gρF (the number of false positives expected for 2g queries from G multiplied by the cost per element, assuming g >> n), we expect the total file size to be approximately (nρ + 2gρF + nρF + 2gρF² + ...) bits. Using F = c^ρ from above, we observe that for an infinite cascade, the average number of bits per read is then

(ρ \frac{2 g ρ c^{ρ}}{n}) (1 + c^{ρ} + c^{2 ρ} + . . .) = (1 + \frac{2 g c^{ρ}}{n}) (\frac{ρ}{1 - c^{ρ}}) .

(1)

Here the left hand side represents the sum of costs due to the expected number of elements in each BF for an infinite cascade. In practice, we use four BFs and a numerical solver in scipy [12] employing the L-BFGS-B [13] algorithm to find the value of ρ minimizing the above expression. The small list FP₄ is encoded separately along with FN. The process is described in Figure 1 steps 5-11 and Algorithm 2. Decoding proceeds using queries from G as before, but in this case each accepted read is used to query subsequent BFs until rejection. This is depicted in Figure 2and Algorithm 3.

Algorithm 2 Encoding Let B_j be the j^th BF loaded (j ∈ [2, 4]) with FP_j-1, S ∩ B_j is short-hand notation for the subset of S accepted by B_j.

(B₁, FN, FP₁) := Encode one(R, G) # we initialize by calling Algorithm 1

F P₂ := R' ∩ B2

for j = 3 to j = 4 do

for all r ∈ FP_{j− 1}do

insert(r, B_j)

end for

FP_j := FP_j−2 ∩ B_j

end for

return (B₁, B₂, B₃, B₄, FN, FP₄)

Algorithm 3 Decoding Input: (B₁, B₂, B₃, B₄, FN, FP₄, G); Output: (R_rc) For brevity, reconstruction of only one strand is shown.

R_rc := FN

for all i ∈ [1, g − ℓ_read + 1] do

for j = 1 to j = 4 do

if $q_{i} ε̸ B_{j}$ then

if j is even then

R_rc := R_rc ∪ q_i

end if

continue {increment i}

end if

end for

if j = 4 and q_i ∈ FP₄ then

R_rc := R_rc ∪ q_i

end if

end for

return R_rc

Additional compression

BF parameters are automatically set to make each BF more compressible. This involves incrementing the number of hash functions for each BF from 1 to the minimal number that allows it to both have an uncompressed file size lower than a preset threshold (we used 500 MB) and obtain the value of F from equation 1. Typically, this results in h being in the range of 1 to 3. We do this in order to reduce each BF's compressed size (at the expense of increasing its RAM occupation); this practice is introduced in [11].

Once BFs are loaded and the sets FP 4 and FN are identified, we use 7zip [14] to compress the B 1, ..., B 4 and SCALCE [4] to compress the output lists FP₄ and FN. In principle, any general compression tool can be used for the BFs, and it is preferable to use a tool that takes advantage of existing sequence overlaps among the leftover reads to compress them efficiently.

Results and discussion

Comparison on simulated reads

We simulated reads from Human (hg19) chromosome 20 using dwgsim [15]. This tool introduces mutations into the reference genome and then samples reads from both genome strands using a user-defined per base error rate. We sampled 100 bp single end reads at various coverage levels with a 0.001 mutation rate and a per base error rate increasing from 0 to 0.005 from the 5' to the 3' end of reads (in line with current estimates of Illumina error rates [16]). We also demonstrated the effect of varying the error rate In Figure 4. All reported results were run on a 16 core AMD Opteron 6140 (2.6 GHz) 128 GB RAM server, running the Ubuntu 12.04 Linux operating system.

We found that BARCODE compresses more effectively at higher coverage. Although the proportion of reads in FN increases as the proportion of unique reads decreases (Table 1), BARCODE benefits from SCALCE's increasing efficiency due to greater redundancy among FNs. BARCODE's decode times were similar to its encode times, as would be expected since both rely on the same genome querying procedure.

Table 1 BRC performance with varying coverage.

Full size table

To demonstrate that our use of BFs improves upon SCALCE's compression results, we compared our results with SCALCE run on all reads. We also tested quip [5] and fastqz [1], state-of-the-art tools in terms of both compression efficiency and speed [1]. All three tools either output compression results or ratios separately for sequences, read names, and quality scores. We note that the best performers in the Sequence Squeeze competition in terms of base compression ratio, Sam-comp and CRAM, did not provide such outputs and thus did not allow direct comparison. Quip and fastqz also include both reference-based and reference-free modes. We performed alignment via bowtie2 [17] for quip runs while fastqz performed its own alignment. To ensure a fair comparison, all tools were run as a single thread when possible, including calls to 7zip and SCALCE from BARCODE. Fastqz used three threads during its run, as this was not a user-selectable parameter, but each thread was assigned to one of sequences, qualities, and names. Figure 3 compares BARCODE with other tools in terms of run time and compression efficiency. A full listing of program parameters used is provided in Table 2.

Table 2 Program parameters used in compression tool comparison (Figure 3)

Full size table

Overall, we found the compression ratio improved with greater coverage for all reference-free methods, and remained essentially constant for reference-based methods. Figure 3 shows that reference-based compressors are better in compression ratios but reference-free compressors are faster (An exception to this trend was fastqz, whose reference-based version is faster than its reference-free version, likely due to the use of context model-based arithmetic coding). Quip performed poorly in compressing sequences without a reference, showing it has apparently been optimized for speed and perhaps compression of qualities and read names. SCALCE shows strong dependence of compression ratio on coverage, as would be expected by its leverage of the recurrence of long subsequences. BARCODE takes advantage of this trend to also improve with higher coverage, even as the proportion of reads hashed to BFs decreases (See Table 1). BARCODE's times are closest to SCALCE and reference-free quip, and its compression ratios approach those of reference based methods, especially at higher coverage values. For most coverage values, it maintains an order of magnitude time advantage vs. reference-based methods (~2-3x vs. fastqz, ~5-7x vs. quip), as well as an order of magnitude compression advantage of the tested reference-free methods.

Higher coverage, longer reads

We tested scenarios of higher coverage and longer read lengths: (1) coverage 100 and read length 100 bp, (2) coverage 100 and read length 200 bp, and (3) coverage 200 and read length 400 bp. Table 3 shows a continuation of the trends expressed at lower coverage levels. Higher coverage aided reference-free methods, but not reference-based methods. Longer reads improved compression ratios in each case with the exception of fastqz -r. We observed larger impacts on run time as a result of doubling read length than coverage.

Table 3 Performance comparison on high coverage values and read lengths longer than 100.

Full size table

Conclusions

We have presented a new approach to compressing sequencing reads, bridging the gap between the speed of alignment-free, reference-free methods and the compression efficiency of reference-based methods. We have tested the dependence of extant sequence compressors on coverage levels and shown that while reference-based methods compress most efficiently, they place a heavy burden on CPU times due to alignment and cannot leverage added redundancy to benefit compression ratios. Reference-free methods do benefit from higher coverage, but maintain a considerable distance from reference-based methods in terms of compression ratios even at the highest levels tested. Although we have shown that our new method, BARCODE, obtains a better trade-off than either of these extremes, we maintain that there remains much room for improvement, even when considering the inherent constraints imposed by the Kolmogorov complexity of the data. We note that further comparison to other methods like CRAM [3] and sam_comp [1] is needed.

BARCODE is currently a proof-of-principle implementation, and thus we expect that further optimization will improve run time and compression efficiency. Compression ratios may be improved by taking advantage of better general compression tools available such as the ZPAQ library [18], as fastqz and sam_comp do. Thus far, we have not utilized arithmetic coding techniques because they employ multiple threads and thus introduce significant additional resource requirements. Our approach can also be extended to allow for fast access to variants in the original data by using conventional BFs that are not compressed, and by compressing FN/FP reads using encoding that allows fast random access (at some expense of compression ratio). We aim to investigate these paths in the future.

Appendix

Real data test

We examined BARCODE's performance on the C. Elegans data set tested in the Sequence Squeeze competition, SRR065390_1. This data set consists of 33415360 100 bp reads, amounting to 33-fold coverage of the genome. BARCODE's compression ratio on this data was 0.46 bits per base, and run time was 1203 seconds, in line with experiments described in the main text and comparable with reference-based methods tested in the Sequence Squeeze competition [1].

Contributions of repeated reads vs. errors to FN

FN is comprised of repeated reads filtered out to preserve their multiplicities, and reads differing from the reference because of errors or variations. Here, we describe the relative contributions of each part. The expected number of repeated reads can be described probabilistically. Assuming reads are sampled independently from G, given a read r, the probability of drawing r again is $1 - \frac{1}{G}$ . For R reads, the probability of observing no repetitions is then ${(1 - \frac{1}{G})}^{R - 1}$ . Thus, the expected number of repeated reads is $R (1 - {(1 - \frac{1}{G})}^{R - 1})$ . Since we hash reverse complement reads from reference strands separately, we revise the length considered to 2G. Since we wish to count the total multiplicity of each repeated read, the contribution of repeated reads to FN is thus approximated by $2 R (1 - {(1 - \frac{1}{2 G})}^{R - 1})$ . Clearly, this contribution depends on coverage, as shown in Table 4.

Table 4 Counts of repeats vs. errors with increasing coverage. The proportion of reads due to errorsremains roughly constant, while the proportion due to repeats increases as coverage increases.

Full size table

We model the contribution of error to FN using Binom(100, p) with p = 0.0025, the mean error over the read length used in our simulated reads (where error varies from 0 to 0.005 from the 5' to 3' ends). Most of the mass is carried by the one and two error terms, leading to a relative error proportion estimate of $(\begin{matrix} 100 \\ 2 \end{matrix}) {(1 - p)}^{98} p^{2} + (\begin{matrix} 100 \\ 2 \end{matrix}) {(1 - p)}^{99} p$ . Table 4 shows this proportion is independent of coverage level.

Availability

BARCODE can be downloaded at http://www.cs.tau.ac.il/~heran/cozygene/software.shtml

Declarations

RS was supported in part by the Israel Science Foundation (grant 317/13) and by the Raymond and Beverly Sackler chair in bioinformatics. RR was supported in part by a fellowship from the Edmond J. Safra Center for Bioinformatics at Tel-Aviv university, and by the Center for Absorption in Science, the Ministry of Immigrant Absorption in Israel. EH is a faculty fellow of the Edmond J. Safra Center for Bioinformatics at Tel Aviv University. EH was partially supported by the Israeli Science Foundation (grant 1425/13), and by National Science Foundation grant III-1217615.

This article has been published as part of BMC Bioinformatics Volume 15 Supplement 9, 2014: Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014). The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/15/S9.

Abbreviations

BF:: - Bloom filter
BRC:: - BARCODE
bpb:: - bits per base
bp:: - base pairs

References

Bonfield JK, Mahoney MV: Compression of FASTQ and SAM format sequencing data. PloS One. 2013, 8 (3): 59190-10.1371/journal.pone.0059190.
Article Google Scholar
Kozanitis C, Saunders C, Kruglyak S, Bafna V, Varghese G: Compressing genomic sequence fragments using SlimGene. Journal of Computational Biology. 2011, 18: 401-413. 10.1089/cmb.2010.0253.
Article PubMed Central CAS PubMed Google Scholar
Hsi-Yang Fritz M, Leinonen R, Cochrane G, Birney E: Efficient storage of high throughput DNA sequencing data using reference-based compression. Genome Research. 2011, 21: 734-740. 10.1101/gr.114819.110.
Article PubMed Central PubMed Google Scholar
Hach F, Numanagic I, Alkan C, Sahinalp SC: SCALCE: boosting sequence compression algorithms using locally consistent encoding. Bioinformatics. 2012, 28 (23): 3051-7. 10.1093/bioinformatics/bts593.
Article PubMed Central CAS PubMed Google Scholar
Jones DC, Ruzzo WL, Peng X, Katze MG: Compression of next-generation sequencing reads aided by highly efficient de novo assembly. Nucleic Acids Research. 2012, 40 (22): 171-10.1093/nar/gks754.
Article Google Scholar
Cox AJ, Bauer MJ, Jakobi T, Rosone G: Large-scale compression of genomic sequence databases with the Burrows-Wheeler transform. Bioinformatics. 2012, 28 (11): 1-6.
Article Google Scholar
Pell J, Hintze A, Canino-Koning R, Howe A, Tiedje JM, Brown CT: Scaling metagenome sequence assembly with probabilistic de Bruijn graphs. Proceedings of the National Academy of Sciences. 2012, I (1): 1-11.
Google Scholar
Chikhi R, Rizk G: Space-efficient and exact de Bruijn graph representation based on a Bloom filter. Algorithms in Bioinformatics. 2012, 236-248.
Chapter Google Scholar
Melsted P, Pritchard J: Efficient counting of k -mers in DNA sequences using a bloom filter. BMC Bioinformatics. 2011, 12: 333-10.1186/1471-2105-12-333.
Article PubMed Central CAS PubMed Google Scholar
Salikhov K, Sacomoto G, Kucherov G.: Using cascading bloom filters to improve the memory usage for de Brujin graphs. Algorithms in Bioinformatics Lecture Notes in Computer Science. Edited by: Darling, A., Stoye, J. 2013, 8126: 364-376. 10.1007/978-3-642-40453-5_28.
Chapter Google Scholar
Mitzenmacher M: Compressed Bloom filters. IEEE/ACM Transactions on Networking. 2002, 10:
Google Scholar
Oliphant TE: SciPy: Open source scientific tools for Python. Computing in Science and Engineering. 2007, 9: 10-20.
Article CAS Google Scholar
Zhu C, Nocedal J, Byrd RH, Lu P: Algorithm 778: L-BFGS-B: Fortran subroutines for large-scale bound-constrained optimization. 1997
Google Scholar
Pavlov I: 7zip compression software. [http://www.7-zip.org]
Homer N: Dwgsim read simulations software. [https://github.com/nh13/dwgsim]
Loman NJ, Misra RV, Dallman TJ, Constantinidou C, Gharbia SE, Wain J, Pallen MJ: Performance Comparison of Benchtop High-Throughout Sequencing Platforms. Nature Biotechnology. 2012, 30: 434-9. 10.1038/nbt.2198.
Article CAS PubMed Google Scholar
Langmead B, Salzberg SL: Fast gapped-read alignment with Bowtie 2. Nature Methods. 2012, 9: 357-359. 10.1038/nmeth.1923.
Article PubMed Central CAS PubMed Google Scholar
Mahoney M: ZPAQ compression software. [http://mattmahoney.net/dc/zpaq.html]

Download references

Acknowledgements

RR would like to thank Oron Navon and Roy Ronen for helpful comments in preparation of the manuscript.

Author information

Authors and Affiliations

Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv, Israel
Roye Rozov, Ron Shamir & Eran Halperin
Molecular Microbiology and Biotechnology Department, Tel-Aviv University, Tel Aviv, Israel
Eran Halperin
International Computer Science Institute, Berkeley, CA, USA
Eran Halperin

Authors

Roye Rozov
View author publications
You can also search for this author in PubMed Google Scholar
Ron Shamir
View author publications
You can also search for this author in PubMed Google Scholar
Eran Halperin
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Ron Shamir.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

RR developed the method. RS and EH designed the experiments. RR implemented the method and performed experiments. All authors analyzed results, co-wrote the manuscript, and read and approved the final manuscript.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.

The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.

To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.

The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Cite this article

Rozov, R., Shamir, R. & Halperin, E. Fast lossless compression via cascading Bloom filters. BMC Bioinformatics 15 (Suppl 9), S7 (2014). https://doi.org/10.1186/1471-2105-15-S9-S7

Download citation

Published: 10 September 2014
DOI: https://doi.org/10.1186/1471-2105-15-S9-S7

Proceedings of the Fourth Annual RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq 2014)

Fast lossless compression via cascading Bloom filters