Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Background Data volumes generated by next-generation sequencing (NGS) technologies is now a major concern for both data storage and transmission. This triggered the need for more efficient methods than general purpose compression tools, such as the widely used gzip method. Results We present a novel reference-free method meant to compress data issued from high throughput sequencing technologies. Our approach, implemented in the software Leon, employs techniques derived from existing assembly principles. The method is based on a reference probabilistic de Bruijn Graph, built de novo from the set of reads and stored in a Bloom filter. Each read is encoded as a path in this graph, by memorizing an anchoring kmer and a list of bifurcations. The same probabilistic de Bruijn Graph is used to perform a lossy transformation of the quality scores, which allows to obtain higher compression rates without losing pertinent information for downstream analyses. Conclusions Leon was run on various real sequencing datasets (whole genome, exome, RNA-seq or metagenomics). In all cases, LEON showed higher overall compression ratios than state-of-the-art compression software. On a C. elegans whole genome sequencing dataset, LEON divided the original file size by more than 20. Leon is an open source software, distributed under GNU affero GPL License, available for download at http://gatb.inria.fr/software/leon/. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0709-7) contains supplementary material, which is available to authorized users.

All read datasets used in the main paper are publicly available in the Sequence Read Archive (SRA) and were downloaded either from the NCBI or EBI web servers. Description of each dataset along with its SRA accession number is given in Table ST1.

SRA Accession
Org.  2 Impact of parameters k and T sol The kmer size and the minimal abundance threshold, ie. parameters k and T sol respectively, impact Leon compression ratio, as they control the number of nodes and the topology of the de Bruijn Graph. Leon performance was then computed for varying values of these parameters for the C. elegans WGS dataset.
The T sol parameter is inferred automatically from the histogram of kmer abundances (telling the number of kmers of each abundance). The following heuristic is used. The histogram is first smoothed, then we search for the position of the first increase, and the maximum value attained after that. We then look for the index of the minimum value between this first raise and this maximum. This index is the automatically inferred threshold, and roughly corresponds to the valley between erroneous kmers and "genomic" kmers. We also ensure that no more than 25 % of the distinct kmers are below the threshold, and that the threshold is >= 3. Figure S1 shows that the compression ratio is robust to variations of the T sol parameter around its optimal value. Importantly, the automatically inferred value, 8, gives a compression ratio very close to the optimal one. However, with more extreme values the compressed file size can drastically increase. For instance, if no filtering at all of low coverage kmers is performed, the compression ratio is divided by two (from 12.0 to 5.8), demonstrating the importance of removing sequencing errors in the reference de Bruijn Graph. In this case, all reads can be mapped perfectly to the graph, but the size of the graph and of the bifurcation lists are important too. Conversely, if the threshold is too high, removing too many genomic kmers, the compression ratio drops since the reference is incomplete and many more reads cannot map to it. Regarding quality scores compression ratios, a higher T sol value means fewer scores will be considered as good enough to be replaced by a high value '@', therefore quality compression ratios keeps decreasing with a higher T sol . However, we observe a plateau near the optimal value for sequence compression, which probably corresponds to values distinguishing erroneous of solid kmers. Similarly, Figure S2 shows that there is also an optimal value for the k parameter. For small k (typically k < 20 for the C. elegans dataset), many kmers are not unique in the genome, generating numerous branchings in the de Bruijn Graph. Even if the latter is much smaller in size with small k, this does not compensate for the numerous bifurcations to store for each read. Higher k value will also mean a larger dictionary of anchors and more unanchored reads. Consequently, this parameter depends on the complexity of the genome. For large genomes with numerous repeats, larger k should be preferred, but a trade-off must be found to compensate compression ratio with running time, since the counting step time can increase with k.  Figure S3 shows Leon execution time using 1 to 24 threads. The platform used is a 2.50 GHz Intel E5-2640 CPU with 12 cores ( 24 logical cores with hyper-threading). It should be noted that Leon seems to benefit highly from Intel hyper-threading. We suspect this is the case because Leon's major operations are queries inside a bloom filter, which induce many memory cache misses. Memory cache misses induce latency, usually well hidden by hyper-threading.  Table ST2.

Parallelization speed-up
Since Leon default mode is lossy for quality scores, other tools were also run in a lossy mode when possible.

Theoretical estimation of the optimal bloom filter size
Leon uses a probabilistic de Bruijn Graph, i.e. all kmer nodes of the graph are inserted in a bloom filter. The false positive rate of the bloom filter will induce false branching in the graph, meaning extra bifurcation events that will need to be stored in the compressed file.
Therefore, an optimal trade-off needs to be found: a large bloom filter will take more space in the file but will save space for read storage (and conversely).
The false positive rate F of a bloom filter can be approximated by F = f r with f = 0.5 log 2 ≈ 0.618 and r = bloom size nb elements inserted the number of bits per element inserted in the bloom filter [10].
With G the size of the target sequenced genome (i.e. approximately the number of nodes in the de Bruijn Graph) and D the average kmer abundance (somehow related to the depth of CompressQual ./hg96.sam -q 1 -l 4