Unsupervised statistical clustering of environmental shotgun sequences

BMC Bioinformatics

Table 4 Performance comparison of LikelyBin and CompostBin on pairs of genomes analyzed in Figures 5, 6, Table 2.

Org 1	Org 2	Frag L	Frag N	D ₃	LikelyBin accuracy	CB seeds	CompostBin accuracy
S. meliloti	A. aurescens	400	500	1.02	0.94	10 25	0.93 0.93
L. lactis	F. tularensis	400	500	1.15	0.92	10 25	0.76 0.12*
S. pneumoniae	H. pylori	400	500	0.97	0.96	10 25	0.12* 0.96
P. marinus	S. aureus	400	500	0.99	0.93	10 25	0.73 0.83
M. jannaschii	S. aureus	400	500	0.92	0.94	10 25	0.17* 0.91

Frag L, Fragment length; Frag N, Number of fragments per source; CB seeds, labeled fragments supplied to CompostBin for training. LikelyBin consistently performed equally to or above CompostBin performance despite being completely unsupervised, while CompostBin required a fraction of input fragments to be labeled to seed its clustering alorithm. We supplied training fragments to CompostBin without regard to their origin (protein or RNA-coding). In a likely practical scenario, only 16S RNA-coding fragments would be labeled, but would have different k-mer distributions from protein-coding regions, possibly confounding classification. (*) Convergence toward a good clustering was not observed in CompostBin for these datasets; accuracy can be less than 50% due to labeled input.

ISSN: 1471-2105