Centroid based clustering of high throughput sequencing reads based on n-mer counts

BMC Bioinformatics

Table 4 Recall rates for simulated human reads of different length, various distance functions, n = 2

	EM		k-means		$d_{2}^{*}$		χ ²		Symmetrized KL
Read length	Recall	std. dev.	Recall	std. dev.	Recall	std. dev.	Recall	std. dev.	Recall	std. dev.
2 clusters
30	0.737	0.133	0.735	0.136	0.610	0.083	0.737	0.140	0.736	0.134
50	0.762	0.141	0.760	0.143	0.649	0.105	0.760	0.144	0.762	0.141
75	0.781	0.145	0.778	0.147	0.677	0.122	0.778	0.148	0.781	0.145
100	0.794	0.148	0.791	0.150	0.719	0.131	0.791	0.150	0.794	0.148
150	0.812	0.152	0.810	0.153	0.803	0.147	0.810	0.154	0.812	0.152
200	0.827	0.153	0.825	0.155	0.824	0.151	0.824	0.155	0.826	0.153
250	0.839	0.153	0.837	0.155	0.838	0.151	0.837	0.156	0.839	0.153
300	0.850	0.153	0.848	0.155	0.850	0.152	0.847	0.156	0.850	0.153
400	0.867	0.152	0.866	0.154	0.869	0.152	0.866	0.154	0.867	0.152
3 clusters
30	0.573	0.110	0.573	0.108	0.447	0.076	0.715	0.131	0.572	0.111
50	0.604	0.124	0.603	0.126	0.474	0.090	0.674	0.134	0.603	0.125
75	0.629	0.135	0.629	0.138	0.626	0.139	0.664	0.144	0.629	0.136
100	0.647	0.142	0.647	0.146	0.671	0.148	0.668	0.150	0.647	0.143
150	0.675	0.153	0.675	0.156	0.724	0.157	0.687	0.159	0.675	0.153
200	0.696	0.160	0.696	0.164	0.692	0.161	0.706	0.167	0.696	0.160
250	0.714	0.166	0.714	0.170	0.714	0.166	0.723	0.172	0.714	0.166
300	0.730	0.171	0.730	0.173	0.730	0.170	0.738	0.176	0.730	0.170
400	0.756	0.177	0.757	0.179	0.757	0.176	0.762	0.180	0.756	0.176

Mean recall rates and standard deviation for various read lengths and 2 or 3 clusters. For every read length clustering was performed on 50 simulated read sets, each set originating from 1000 randomly chosen human RNA reference sequences and having 100000 reads. Clustering was performed using all distance functions considered in the paper, including those which do not guarantee convergence. Results for L₂ and d₂ distance are not shown. Word length is n = 2.

ISSN: 1471-2105