Centroid based clustering of high throughput sequencing reads based on n-mer counts

Solovyov, Alexander; Lipkin, W Ian

doi:10.1186/1471-2105-14-268

BMC Bioinformatics

Table 5 Recall rates for simulated human reads, different number of reads, n = 2

From: Centroid based clustering of high throughput sequencing reads based on n-mer counts

	EM		L ₂		d ₂
Number of reads	Recall	std. dev.	Recall	std. dev.	Recall	std. dev.
2 clusters
5000	0.783	0.160	0.793	0.166	0.790	0.165
10000	0.787	0.151	0.793	0.156	0.793	0.156
20000	0.798	0.146	0.801	0.151	0.801	0.150
30000	0.805	0.146	0.806	0.150	0.806	0.150
50000	0.812	0.147	0.812	0.150	0.812	0.149
75000	0.815	0.148	0.815	0.151	0.815	0.150
100000	0.818	0.149	0.816	0.151	0.816	0.151
150000	0.820	0.150	0.819	0.152	0.819	0.151
200000	0.821	0.150	0.819	0.152	0.819	0.152
400000	0.823	0.151	0.821	0.153	0.821	0.152
3 clusters
5000	0.657	0.181	0.660	0.184	0.656	0.181
10000	0.653	0.162	0.655	0.164	0.653	0.163
20000	0.661	0.151	0.661	0.153	0.659	0.152
30000	0.667	0.149	0.667	0.150	0.665	0.150
50000	0.674	0.150	0.674	0.151	0.673	0.152
75000	0.679	0.152	0.678	0.153	0.677	0.153
100000	0.682	0.153	0.681	0.154	0.680	0.155
150000	0.685	0.154	0.684	0.155	0.683	0.156
200000	0.686	0.155	0.685	0.156	0.685	0.157
400000	0.689	0.156	0.688	0.157	0.687	0.158
4 clusters
5000	0.577	0.183	0.587	0.189	0.581	0.188
10000	0.569	0.159	0.577	0.163	0.573	0.162
20000	0.576	0.144	0.583	0.146	0.580	0.145
30000	0.583	0.141	0.590	0.143	0.586	0.142
50000	0.591	0.140	0.598	0.142	0.595	0.142
75000	0.597	0.142	0.603	0.144	0.599	0.143
100000	0.600	0.143	0.606	0.145	0.603	0.145
150000	0.604	0.145	0.610	0.146	0.607	0.146
200000	0.605	0.145	0.612	0.147	0.608	0.147
400000	0.608	0.147	0.615	0.148	0.611	0.148
5 clusters
5000	0.520	0.181	0.534	0.187	0.527	0.184
10000	0.514	0.156	0.527	0.162	0.520	0.158
20000	0.521	0.140	0.532	0.145	0.527	0.144
30000	0.529	0.138	0.540	0.142	0.535	0.141
50000	0.539	0.139	0.549	0.143	0.544	0.142
75000	0.545	0.140	0.555	0.144	0.550	0.144
100000	0.548	0.142	0.558	0.146	0.553	0.145
150000	0.552	0.144	0.562	0.148	0.557	0.147
200000	0.554	0.145	0.564	0.149	0.560	0.148
400000	0.558	0.146	0.568	0.150	0.563	0.150

Mean recall rates and standard deviation for different number of reads. For each of the 50 randomly chosen subsets of human reference RNA sequences we simulated reads, choosing the specified number of reads. Clustering was performed using the EM, L₂ and d₂ algorithms. Word length is n = 2. Read length is 200bp. When computing the recall rate for each contig we use pseudocounts, artificially increasing the count of reads in each cluster by one.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com