Sketching and sampling approaches for fast and accurate long read classification

BMC Bioinformatics

Table 1 Comparison of sketching and sampling approaches

Approach	Index generation time	Total index size	K-mer query time
Uniform	O(n)	O(s)	O(1)
MinHash	O(n log s)	O(s)	O(1)
Weighted MinHash	O(n log s) + O(n)	O(s) + O(s) weights	O(1)
Order MinHash	O(n log s)	O(s) + O(s) positions	O(L)
Minimizer	O(n)	O(s)	O(1)

The theoretical runtimes for generating and querying screens of size s generated from a genome of size n. The main three approaches (Uniform, MinHash and Minimizer), are largely equivalent in terms of their computational cost. The augmented approaches (Weighted, Order) incur additional overhead, with Order MinHash also involving a more complex query process when comparing two sketches, depending on the choice of size of sublists (L). Exact counts of screen sizes and the number of lookups performed during a classification experiment, as well as the overhead of an exhaustive approach, can be found in Additional file 1: Table 5

ISSN: 1471-2105