 Methodology article
 Open Access
 Published:
MQF and buffered MQF: quotient filters for efficient storage of kmers with their counts and metadata
BMC Bioinformatics volume 22, Article number: 71 (2021)
Abstract
Background
Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store kmers with a skewed distribution.
Result
Here, we present the mixedcounters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to reanalyze the original data for further insertions or deletions.
Conclusions
The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
Background
Online algorithms effectively support streaming analysis of large data sets, which is important for analyzing data sets with large volume and high velocity [1]. Approximate data structures are commonly used in online algorithms to provide better average space and time efficiency [2]. For example, the Bloom filter supports approximate set membership queries with a predefined false positive rate (FPR) [3]. The countmin sketch (CMS) is similar to Bloom filters and can be used to count items with a tunable rate of overestimation. However, there are a number of problems with Bloom filters and the CMS—in particular, they do not support data locality.
The counting quotient filter (CQF) is a more efficient data structure that serves similar purposes with better efficiency for skewed distributions and much better data locality [4]. The CQF is a recent variant of quotient filters that tracks the count of its items using a variable size counter. As a compact hashtable, CQF can perform in either probabilistic or exact modes and supports deletes, merges, and resizing.
Analysis of kmers in biological sequencing data sets is an ongoing challenge [5]. Kmers in raw sequencing data often have a Zipfian distribution, and the CQF was built to minimize memory requirements for counting such items [4]. However, this advantage deteriorates in applications that require frequent random access to the data structure, and where the kmer count distribution may change in response to different sampling approaches, library preparation and/or sequencing technologies. For example, kmer frequency across 1000 thousands of RNAseq experiments shows different patterns of abundant kmers [6].
Data structures like CMS [7] and CQF [4] also do not natively support associating kmers with multiple values, which can be useful for coloring in De Bruijn graphs as well as other features [8,9,10]. Classical hash tables are designed to associate their keys with a generic data type but they are expensive memorywise [11]. Tools like jellyfish [12] and CHTKC [13] use lock free hash tables in their kmer counting algorithms. Minimal Perfect Hash Functions (MPHFs) can provide a more compact solution by mapping each kmer into a unique integer. These integers can then be used as indices for the kmers to label them in other data structures [14]. An implementation capable of handling large scale datasets with fast performance requires ~ 3 bits per element [15]. However, such a concise representation comes with a high falsepositive rate on queries for nonexistent items. Moreover, unlike hashtables, MPHF does not support insertions or deletions thus any change in the kmer set would require rehashing of the original dataset.
In this paper, we introduce the mixedcounters quotient filter (MQF), a modified version of the CQF with a new encoding scheme and labeling system supporting high data locality. We further show how Buffered MQF can be used to scale MQF to solidstate disks. We compare between MQF and the CQF, CMS, and MPHF data structures regarding memory efficiency, speed performance, and applicability to specific data analysis challenges. We further do a direct comparison of the CMS to MQF in the khmer software package for sequencing data analysis, to showcase the benefits of MQF is in real world applications.
Results
MQF has a lower load factor than CQF
The load factor is defined as the actual space utilized divided by the total space assigned for the data structure, and is an important measure of data structure performance. To compare load factors between the CQF and MQF data structures, instances of both structures were created using the same number of slots (\(2^{27}\)). Chunks of items from thirteen datasets with different distributions of item frequencies were inserted iteratively to be counted in both datastructures while recording the load factor after the insertion of each chunk. After the insertion of each chunk, both data structures were checked to confirm having the exact same hashes. The experiments stopped when MQF’s load factor reached 90%. MQF had lower loading factors for all tested datasets but the difference was minimal for the dataset with the highest Zipfian distribution (Z = 5). The lower the tested Zipfian distribution the lower the loading factor of MQF.
A lower loading factor enabled MQF to exceed the double CQF capacity with uniform distribution. Analysis of real kmers from 8 sequencing datasets showed ~ 40% increase on average in the capacity of MQF in comparison to the matching CQF (Fig. 1, Additional file 1: Figure S1 and Table S1).
MQF is usually more memory efficient than CQF
Progressively increasing numbers of items were sampled from the real and Zipfiansimulated datasets. The smallest CQF and MQF to store the same number of items from each dataset were created. To do that, the q parameter of CQF versus the q and F_{size} parameters of MQF were calculated empirically. Starting with a small q, CQF and MQF were tested to count the items. The q was incremented to find the smallest value that enables the structure to hold the items. MQF was more memory efficient for most real kmers and Zipfiansimulated distributions (Fig. 2, Additional file 1: Figure S2). The tuning of the F_{size} enabled MQF to grow in size gradually compared to CQF which has to double in size to fit the minimal increase in items beyond the capacity of a given q value (Additional file 1: Figure S3). Therefore, comparing the optimum parameters used to generate both data structures (Additional file 2: Table S2) shows that MQF is always more memory efficient if F_{size} > 1, while CQF could be slightly more memory efficient if F_{size} = 1. This is more likely to happen if sequencing datasets are highly erroneous or very polymorphic (e.g. mitochondrial genome sequencing in SRR12989394). Also, if sequencing has a very low coverage causing most of the kmers to appear once (e.g. smalls subsets of whole genome sequencing data in ERR992657). MQF comes with a utility file (https://github.com/diblab/MQF/blob/master/include/utils.h) to allow predicting the optimum MQF parameters.
MQF is faster than CQF and lowFPR CMS
The inmemory and buffered MQFs were evaluated for speed of insertion and query in comparison to three inmemory counting structures: CQF, the original CMS [7], and khmer's CMS [16]. To test the effect of FPR on the performance, the experiment was repeated for 4 different FPRs (0.1, 0.01, 0.001, 0.0001). All tested structures were constructed to have approximately the same memory space except for buffered MQF which used only onethird of this memory for buffering while the fullsize filter is on the disk. MQF is guaranteed to hold the same number of items as a CQF having the same number of slots. The number of slots in CQF was chosen so that the load factor was more than 85% and the MQFs were created with an equal number of slots. Items were sampled to be counted from the real and Zipfiansimulated datasets. After finishing the insertion, to assess the query rate, 5 M items from the same distribution as the insertion datasets were queried. Half of the query items did not exist in the insertion datasets.
MQF has slightly yet persistent faster insertion and query rates compared to CQF with minimal, if any, effect of the FPR on either structure. The performance of CMS is better with higher FPR and Khmer's implementation of CMS doubles the query rate of the original one. However, MQF is always faster than both CMS unless the FPR is more than 0.01 (Fig. 3, Additional file 1: Figures S4, S5).
MQF outperforms CMS in realworld problems
Khmer is a software package deploying a new implementation of CMS for kmer counting, error trimming and digital normalization [16]. To test MQF in reallife applications, we assessed the performance of the Khmer software package using CMS versus our new implementation using MQF (https://github.com/diblab/khmer/tree/MQFIntegration2). A real RNA seq dataset with 51 million reads from the Genome in a Bottle project [17] was used for error trimming and digital normalization; two realworld applications that involve both kmer insertions and queries. An exact MQF was used to create a benchmark for the approximate data structures. It took 5 Gb RAM to create the data structure and 45 and 43 min to perform trimming and digital normalization respectively. The optimal memory for MQF and the optimal number of hash functions for CMS were calculated to achieve the specified falsepositive rates. The CMS was constructed with the same size as the corresponding MQFs. The CMS and MQF versions of Khmer were compared regarding the speed and accuracy (Table 1).
MQF is faster than MPHF
MPHF is constructed by default to fit the input kmers while MQF would have different load factor that might affect its performance. To address this question, four growing subsets of real kmers were inserted into MQFs of size 255 MB to achieve 60%, 70%, 80%, and 90% load factors. For labeling, the order of the 1st kmer in each block was stored in the external labeling space (See the methods section). MPHFs were constructed with sizes ranging from 15 to 22 MB to fit the four datasets. All data structures were queried with 35 M existing kmers and the query times were reported. The MQFs were ~ 10 folds faster than the MPHFs. The query time of the MQF was invariable over the different load factors (Additional file 1: Figure S6).
Discussion
MQF is a new variant of counting quotient filters with novel counting and labeling systems. The new counting system increases memory efficiency as well as the speed of insertions and queries for a wide range of data distributions. The labeling system provides a flexible framework for labeling the member items while maintaining good data locality and a concise memory representation.
MQF is built on the foundation of CQF. MQF has the same ability to behave as an exact or approximate membership query data structure while tracking the count of its members. The insertion/query algorithm developed for CQF enables this family of compact hashtables to perform fast under high load factor (up to 95%) [4]. CQFs are designed to work best for data from high Zipfian distributions. However, previous kmer spectral analysis of RNAseq datasets showed substantial deviations from a Zipfian distribution in thousands of samples [6]. Such variations in distribution are expected given the variety of biosamples, the broad spectrum of sequencing techniques, and different approaches to data preprocessing.
MQF implements a new counting system that allows the data structure to work efficiently with a broader range of data distributions. The counting system adopts a simple encoding scheme that uses a fixed small space alone or with a variable number of the filter’s slots to record the count of member items (Fig. 4). Items with small counts utilize the small fixedsize counters. Therefore, slots, used to be consumed by CQF as counters for these items, are freed to accommodate more items in the filter. The MQF’s load factor grows slower than CQF with all distributions except the extreme Zipfian case (Z = 5) where the load factor is almost the same (Fig. 1). This is why the memory requirement for MQFs is usually smaller compared to CQFs under most distributions despite the extra space taken by the fixed counters (Fig. 2). The size of the fixedsize counter is constant independent of the slot size, therefore the memory requirement for this counter will be trivial with big slots for smaller FPRs and almost negligible in the exact mode. However, this fixedsize counter comes with an additional advantage for MQF. Tuning the size of the fixedsize counter enables the filter to accommodate more items with a slightly larger slot size. This allows the memory requirement for MQF to grow gradually instead of the obligatory size doubling seen in CQF (Fig. 2 and Additional file 1: Figures S2, S3).
Moreover, the new counting scheme in MQF is simplified compared to that of the CQF. MQF defines the required memory for any item based solely on its count (Additional file 1: Figures S7, S8). Therefore, an accurate estimation of the required memory for any dataset can be done extremely quickly by an approximate estimation of data distribution [18, 19]. This is unlike CQF which needs to add a safety margin to account for the special slots used by the counter encoding technique since it is impossible to estimate the number of these slots.
Regarding the speed of insertions and queries, MQF is slightly faster than CQF (Fig. 3). This could be explained partially by the lower load factor of MQF and partially by the simplicity of the coding/decoding scheme of its counting system. Both MQF and CQF are faster than CMS unless the target FPR is really high (e.g. FPR > 0.1) (Fig. 3). CMS controls its FPR by increasing the number of its hash tables requiring more time for insertions and queries to happen. In comparison, quotient filters use always one function but with more hashbits to control the FPR, with a minimal effect on the insertion/query performance (Fig. 3). With high FPR (e.g. FPR = 0.1), CMS uses fewer hash functions and is better performing than MQF. A quotient filter or CMS with a FPR = \(\delta\) should have the same probability of item collisions. However, the quotient filter will be more accurate because CMS has another type of error with a probability (\(1  \delta\)), which incorrectly increases the count of its items. This error is a “bounded error” with a threshold that inversely correlates with the width of the CMS [7]. In another sense, some applications might deploy CMS with a smaller table’s width to be more memory efficient than MQF if the application can tolerate a high bounded error.
Buffered MQF can trade some of the speed of insertions and queries for significant memory reduction by storing data on disk. The buffered structure was developed to make use of the optimized sequential read and write on SSD. The buffered structure processes most of the insertion operations using the bufferMQF that resides in memory, thereby limiting the number of access requests to the MQF stored on the SSD hard drive. Sequential disk access happens when the bufferMQF needs to be merged to the disk. This approach is very efficient for insertions but not for random queries which require more frequent SSD data access. Therefore, bufferMQF is best used for the applications where insertions represent the performance bottleneck e.g. kmer counting applications [5] and sequencing dataset indexing [20]. Moreover, in kmer analysis of huge raw datasets, buffered MQF can be used initially to filter out the low abundant kmers (i.e. likely erroneous kmers), then an inmemory MQF holding the filtered list of kmers could be used for subsequent application requiring frequent random queries. This allows multistage analyses where a first pass eliminates likely errors to minimize the memory requirements of computationally demanding applications like in the case of widely used graphbased algorithms [21, 22].
CMS is commonly used for online or streaming applications as long as their high error rate can be tolerated [23]. MQF has a better memory footprint in the approximate mode for lower error rates and thus can compute with CMS for online applications. A major advantage of quotient filters compared to CMS is the dynamic resizing ability in response to the growing input dataset [4]. The buffered version of MQF can be very useful when the required memory is still bigger than the available RAM. We should, however, note that online applications on MQF cannot make use of the memory optimization that could be achieved with an initial estimation of the filter parameters. A new version of the Khmer software that replaces CMS with MQF proves that the new data structure is more efficient in reallife applications. The MQF version is faster than the one with CMS unless the target FPR is high. Also, MQF is always more accurate than CMS although both structures have the same FPR. This behavior of CMS is due to the high error bound of its counts.
Unlike CQF, MQF is designed to be a more comprehensive associative data structure. MQF comes with a novel labeling system that supports associating each kmer with multiple values to avoid redundant duplication of kmers' keys in separate data structures. There are two types of labels: Internal labels adjacent to each item to achieve the best cache locality by storing labels next to the kmer. However, it has a fixed size and thus practically useful when a small size label is needed. The second labeling system is to label the kmers with one or more labels stored in external arrays while using the kmer order in the MQF as an index. External labeling is very memory efficient mimicking the idea of the minimal perfect hash function (MPHF) [14, 15]. MPHF undoubtedly has the lowest memory requirement of all the associative data structures [15]. However, MQF has better performance in both the construction and query phases. For construction, both structures require initial kmer counting. MQF needs just an extra O(N) operation to update the block labels where N is the number of its unique kmers. MPHF has to read then rehash the list of unique kmers possibly more than once which makes it slower than MQF. For query operations, MQF is 10 × faster regardless of the load factor of MQF (Additional file 1: Figure S6).
Furthermore, MQF offers more functionality and has fewer limitations than MPHF. MQF is capable of labeling a subset of its items which saves significant space for many applications. For example, kmer analysis applications may want to only label the frequent kmers, as an intermediate solution between pruning all the infrequent kmers and labeling all the kmers. Moreover, MQF allows online insertions and deletions of items as well as merging of multiple labeled MQFs (See the methods) while MPHF—which doesn’t store the items—needs to be rebuilt over the whole dataset, which requires reading and rehashing the datasets. Furthermore, MQF can be exact, while MPHF has false positives when queried with novel items that don’t belong to the indexed dataset.
Conclusions
MQF is a new counting quotient filter with a simplified encoding scheme and an efficient labeling system. MQF adapts well to a wide range of kmer datasets to be more memory and timeefficient than its predecessor in many situations. A buffered version of MQF has a fast insertion algorithm while storing most of the structure on external memory. MQF combines a fast access labeling system with MPHFlike associative functionality. MQF performance, features, and extensibility make it a good fit for many online algorithms of sequence analysis.
Methods
MQF data structure
MQF has a similar structure to CQF with a different scheme of metadata that enables different counting and labeling systems (Fig. 4). Like the CQF, the MQF requires 2 parameters, r and q, and creates an array of \(2^{q}\) slots; each slot has rbits. In MQF, \(Q_{i}\) is the slot at position i where i = 1 … \(2^{q}\). The MQF maintains the block design of CQF where each block has 64 slots with their metadata and one extra byte of metadata called Offset to enhance the query of items [4]. Both MQF and CQF have two metadata bits to accompany each slot: \(isRunEnd_{i}\) and \(isOccupied_{i}\). In the MQF, each slot i has extra metadata, a fixedsize counter with a value (\(F_{i}\)) and a configurable size (F_{size}). There are also two optional fixedsize parts of metadata allocated to allow different styles of labeling. Every slot has specific labeling (ST_{i}) with a configurable size (ST_{size} ≥ 0), and every block (j) has an optional space of a configurable size designed to store the number of items in the previous blocks.
The MQF uses the same insertion/query algorithm of CQF [4]. In brief, suppose item I, repeated c times, is to be inserted into Q. A hash function H is applied to I to generate a pbit fingerprint (H(I)). H(I) value is split into two parts, a quotient and remainder. The quotient (\(q_{i}\)) is the most significant q bits while the remainder (\(r_{i}\)) is the remaining least significant r bits. The filters store \(r_{i}\) in a slot \(Q_{j}\) where j is determined by linear probing stating from \(q_{i}\). One or more slots can be used to store the count of the same item. If the required slots for the item or its count are not free, all the consecutive occupied slots starting from this position will be shifted to free the required space. All items having the same q are stored into consecutive slots and are called a run. Items in the run are sorted by \(r_{i}\), and isRunEnd of the last slot in the run is set to one. isOccupied (\(q_{i}\)) is set to one if and only if there is a run for \(q_{i}\). Therefore, there is one bit set to one in each isOccupied and isRunEnd for each run. To query item I, a Rank and Select method is applied on the metadata arrays to get the run start and end for \(q_{i}\). Then all the items in the run are searched linearly for the slot containing \(r_{i}\). The subsequent one or more slots can be decoded to get the count of item I. CQF uses a special encoding scheme to recognize these counting slots but MQF utilizes the fixedcounter metadata element (see below).
Counting scheme
MQF uses variablelength integers to encode the kmers count as shown in Fig. 5. It uses as many slots as required to encode the count while using the fixedsize space of the slots to mark the last slot. To do so, each fixedsize space in all slots before last is assigned its max value (\(F_{max}\)), while the fixedsize space of the last slot stores the most significant bits of the count (\(F_{i}\)) where (\(F_{i} < F_{max}\)). If the value of the most significant bits is equal to \(F_{max}\), an extra slot with zero bits will be added (see example 3 in Fig. 5). For an item with count c, the fixedsize space of the item’s slot is enough for counting until \(c \ge F_{{{\text{max}} }}\) where the number of required slots for counting can be calculated as \(CEILING \left( {\frac{{abs (log_{2} \left( {c  F_{max} } \right)  F_{size} )}}{r}} \right) + FLOOR\left( {\frac{{c_{s} }}{{F_{max} }}} \right)\) where \(c_{s}\) represents the decimal value of the most significant bits. In comparison to the CQF, the MQF does not use special slots to resolve ambiguities, which is more memory efficient (Additional file 1: Figures S7 and S8). The counter encoding algorithm is described in Additional file 1: Figure S9.
Parameter estimation
For offline counting applications, the MQF parameters (q, r, F_{size}) can be even more optimized for each dataset to create the most memoryefficient filter that has enough slots to fit all unique items and their counts. The q parameter defines the number of slots (N) in MQF where q = \(log_{2} \left( N \right)\). The required numbers of slots for items and their count can be estimated from the cardinality of the target dataset, as with CQF. The r parameter is calculated from the equation r = pq where p is the total number of hashbits used to represent each item. In the exact mode, p equals the exact output of a reversible hash function. In the inexact mode, p is controlled by the target FPR (\(\delta\)) according to the equation \({\text{p}} = log_{2} \frac{N}{\delta }\) described before [4]. The F_{size} parameter defines the size of the fixedsize counter. This is critical because if a given MQF has too few slots for items in a dataset, the bigger MQF would have to double the number of slots causing a big jump in the memory requirement. To avoid that jump, MQF can use larger fixed size counters to decrease the number of slots required in counting on the expense of a slight increase in the slot size. MQF comes with a utility file (https://github.com/diblab/MQF/blob/master/include/utils.h) to enable the calculation of the optimum parameters.
Labeling system
MQF can map each item to its count as well as other values, which we call “labels”. Labels in MQF have two different systems. An internal labeling system stores the associated value for every key in the data structure, like a hash table. This label has a fixed size defined at the initialization of the MQF and is practically useful when a small size label is needed (e.g. one or two bits). The second labeling system labels the block. We use this label to store the number of items inserted in the MQF before each block. This enables labeling the items of the filter by separate arrays matching the order of the items in the filter, a behavior that can act as a minimal perfect hash function [15]. The naive way to compute the items’ order is to find the item in the MQF and iterate backward until the beginning of the filter to count the number of the preceding items, which is an O(N) operation. The MQF stores the number of items that exist before each block; therefore, the MQF iterates only to the beginning of each block, which is an O(1) operation. The number of previous items for each block is computed after the MQF is constructed. Any additional insertions or deletions of items would only require recalculation of the block label values with no need to reanalyze the original data. Moreover, labeled MQFs can be updated by merging multiple labeled MQFs and their external labeling arrays. External label arrays need to be merged after merging the labeled MQFs. To do so, the new items’ order is recomputed in the final MQF. Then, labels in the input external arrays can be copied into a new external array according to the new item order. Such a function has to consider resolving the conflicts of items happening in multipleinput MQF and labeled by different external labels (Fig. 6).
Buffered MQF
The Buffered MQF is composed of two MQF structures: a big structure stored on SSD called onDiskMQF, and an insertion buffer stored in the main memory called bufferMQF. OnDiskMQF uses stxxl vectors [24] because of the performance of their asynchronous IO. The bufferMQF is used to limit the number of accesses on the OnDiskMQF and change the access pattern to the ondisk structure from random to sequential. As shown in the insertion algorithm in Fig. 7, all the insertions are done first on bufferMQF; when it is full, the items are copied from bufferMQF to OnDiskMQF, and bufferMQF is cleared. The copy operation edits the onDiskMQF in a serial pattern which is preferred while working on SSD because many edits will be grouped together in one read/write operation. Figure 8 shows the query algorithm. The queried items are inserted first to temporary MQF and sequential access is done to query the items from the OnDiskMQF. The final count is the sum of the bufferMQF and the ondiskMQF.
Experimental setup of benchmarking
A total of 13 datasets (5 simulated and 8 real sequence data) were used in the experiments to cover a broad spectrum of fragment selection approaches and sequencing platforms. Three datasets called z2, z3, and z5 were simulated to follow Zipfian distribution using three different coefficients: 2, 3, and 5 respectively. The bigger the coefficient the more singletons in the dataset [25]. A fourth dataset was simulated from a uniform distribution with a frequency equal to 10. Another dataset was simulated to mimic singleend 100 bp DNA Illumina sequencing from human chromosome 20 using ART simulator [26]. The remaining 8 datasets represent kmers sampled from real sequencing datasets (ERR1050075, SRR11551346, SRR12801265, SRR12924365, SRR12937177, ERR992657, SRR12873993, SRR12989394). Throughout the manuscript, each dataset is referred to by its accession number. Additional file 1: Table S3 enlists all the datasets with their description. A kmer size of 25 nucleotides was used in all experiments.
Experiments were conducted to compare the performance, memory, and accuracy of MQF with the stateoftheart counting structures CQF, CMS, and MPHF. Unless stated otherwise, CQF and MQF used the same number of slots, and the same slot size while the fixed counter of MQF was set to two. The slot size was calculated to achieve the target FPR as described in the parameter estimation section (see Methods). To create comparable CMS, the number of the tables in the sketches was calculated using \(ln\frac{1}{\delta }\) as described before [7]. The table width was calculated by dividing the MQF size by the number of tables. The MPHF was created using the default options in the BBhash repo (https://github.com/rizkg/BBHash). An Amazon AWS t3.large machine with Ubuntu Server 18.04 was used to run all the experiments. The instance had 2 VCPUS and 8 GB RAM with a 100 GB provisioned IOPS SSD attached for storage. All codes used in the experiments can be accessed through the MQF GitHub repository (https://github.com/diblab/2020papermqfbenchmarks).
Availability of data and materials
The datasets used in Benchmarking are available in the “2020papermqfbenchmarks” repository. https://github.com/diblab/2020papermqfbenchmarks.
Abbreviations
 MQF:

Mixedcounters quotient filter
 CQF:

Counting quotient filter
 FPR:

False positive rate
 CMS:

Countmin sketch
 MPHF:

Minimal perfect hash functions
References
 1.
Kolajo T, Daramola O, Adebiyi A. Big data stream analysis: a systematic literature review. J Big Data. 2019;6:47.
 2.
Matias Y, Vitter JS, Young NE. Approximate data structures with applications. In: Proceedings of the fifth annual ACMSIAM symposium on discrete algorithms. Arlington: Society for Industrial and Applied Mathematics; 1994. p. 187–194.
 3.
Bloom BH. Space/time tradeoffs in hash coding with allowable errors. Commun ACM. 1970;13:422–6.
 4.
Pandey P, Bender MA, Johnson R, Patro R. A generalpurpose counting filter: making every bit count. In: Proceedings of the 2017 ACM international conference on management of data. Chicago: Association for Computing Machinery; 2017. p. 775–87.
 5.
Manekar SC, Sathe SR. A benchmark study of kmer counting methods for highthroughput sequencing. Gigascience. 2018;7:giy125.
 6.
Yu Y, et al. SeqOthello: querying RNAseq experiments at scale. Genome Biol. 2018;19:167.
 7.
Cormode G, Muthukrishnan S. An improved data stream summary: the countmin sketch and its applications. J Algorithms. 2005;55:58–75.
 8.
Muggli MD, et al. Succinct colored de Bruijn graphs. Bioinformatics. 2017;33:3181–7.
 9.
Bray NL, Pimentel H, Melsted P, Pachter L. Nearoptimal probabilistic RNAseq quantification. Nat Biotechnol. 2016;34:525–7.
 10.
Iqbal Z, Caccamo M, Turner I, Flicek P, McVean G. De novo assembly and genotyping of variants using colored de Bruijn graphs. Nat Genet. 2012;44:226–32.
 11.
Cormen TH, Leiserson CE, Rivest RL, Stein CS. Introduction to algorithms. Cambridge: MIT Press; 2009.
 12.
Marcais G, Kingsford C. A fast, lockfree approach for efficient parallel counting of occurrences of kmers. Bioinformatics. 2011;27:764–70.
 13.
Wang J, Chen S, Dong L, Wang G. CHTKC: a robust and efficient kmer counting algorithm based on a lockfree chaining hash table. Brief Bioinform. 2020. https://doi.org/10.1093/bib/bbaa063.
 14.
Belazzougui D, Botelho FC, Dietzfelbinger M. Hash, displace, and compress. In: Fiat A, Sanders P, editors. Algorithms—ESA 2009. Berlin: Springer; 2009. p. 682–93.
 15.
Limasset A, Rizk G, Chikhi R, Peterlongo P. Fast and scalable minimal perfect hashing for massive key sets. In: 16th International Symposium on Experimental Algorithms. Vol. 11. London, United Kingdom; 2017; p. 1–11.
 16.
Crusoe MR, et al. The khmer software package: enabling efficient nucleotide sequence analysis. F1000 Res. 2015;4:900.
 17.
Zook JM, Salit M. Genomes in a bottle: creating standard reference materials for genomic variation—why, what and how? Genome Biol. 2011;12:P31–P31.
 18.
Flajolet P, Fusy R, Gandouet O, Meunier F. HyperLogLog: the analysis of a nearoptimal cardinality estimation algorithm. In: Discrete mathematics & theoretical computer science; 2007. p. 137–56.
 19.
Mohamadi H, Khan H, Birol I. ntCard: a streaming algorithm for cardinality estimation in genomics data. Bioinformatics. 2017;33:1324–30.
 20.
Pandey P, et al. Mantis: a fast, small, and exact largescale sequencesearch index. Cell Syst. 2018;7:201207 e4.
 21.
Schlick T. Adventures with RNA graphs. Methods. 2018;143:16–33.
 22.
Sohn JI, Nam JW. The present and future of de novo wholegenome assembly. Brief Bioinform. 2018;19:23–40.
 23.
Muthukrishnan S. Data streams: algorithms and applications. Found Trends Theor Comput Sci. 2003;1:117–236.
 24.
Dementiev R, Kettner L, Sanders P. STXXL: standard template library for XXL data sets. Softw Prac Exp. 2008;38:589–637.
 25.
Powers D. Applications and explanations of Zipf's Law. In: CoNLL; 1998.
 26.
Huang W, Li L, Myers JR, Marth GT. ART: a nextgeneration sequencing read simulator. Bioinformatics. 2012;28:593–4.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Affiliations
Contributions
TAM and MS developed theoretical formalism. MS carried out the implementation and benchmarking. TAM conceived the original idea and supervised this work. All authors contributed to the writing of the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
Supplementary Figures, Supplementary Table 1 and Supplementary Table 3.
Additional file 2.
Supplementary Table 2.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Shokrof, M., Brown, C.T. & Mansour, T.A. MQF and buffered MQF: quotient filters for efficient storage of kmers with their counts and metadata. BMC Bioinformatics 22, 71 (2021). https://doi.org/10.1186/s1285902103996x
Received:
Accepted:
Published:
Keywords
 Compact hash tables
 kmers
 Debruijn graphs
 NGS
 Inexact data structures