Skip to main content

Table 3 Performance of existing tools

From: Sketching and sampling approaches for fast and accurate long read classification

 

Classification Accuracy (%) at Error Rate X

% of Human (H) and Contaminant (C) Reads identified at Error Rate X

X = 1%

X = 5%

X = 10%

X = 1%

X = 5%

X = 10%

H

C

H

C

H

C

Novel Approaches (200 TMs)

MinHash

77.8

75.6

73.0

99.5

99.5

99.4

99.1

98.8

98.5

Minimizer

79.6

77.3

74.1

99.5

99.5

99.4

99.1

98.9

98.5

Uniform

74.5

72.3

69.9

99.5

99.4

99.2

99.0

98.7

98.4

Minimap2

81.3

77.9

74.1

99.5

99.5

99.2

99.0

98.5

97.9

Winnowmap

81.3

77.9

74.1

99.5

99.5

99.2

99.0

98.5

98.0

Kraken2 (RefSeq DB)

72.0

66.0

58.0

99.3

99.2

98.8

98.1

97.2

96.5

Kraken2 (Custom DB)

72.2

66.1

58.0

99.3

99.3

98.8

98.1

97.2

96.5

Centrifuge (RefSeq DB)

72.2

67.5

62.1

99.3

99.2

98.8

98.4

97.8

97.2

Centrifuge (Custom DB)

72.4

67.5

62.1

99.3

99.3

98.8

98.4

97.8

97.2

CLARK

73.5

68.5

65.4

99.5

99.5

99.1

98.9

99.0

98.1

MashMap

74.5

70.8

67.3

99.5

99.4

98.9

98.1

97.7

96.6

  1. For genome-level classification accuracy, we find that alignment based methods perform best, due to their ability to compare against the entire sequence instead of a reduced or indexed form, allowing them to identify minute differences between highly similar genomes. Index-based approaches struggle to perform genome-level classification between highly similar genomes, with a significant number of reads being classified only to a lowest common ancestor of several possible source genomes. All tools perform similarly in contaminant detection, with this task less affected by higher error rates