Skip to main content

Table 4 Recall rates for simulated human reads of different length, various distance functions,n= 2

From: Centroid based clustering of high throughput sequencing reads based on n-mer counts

  EM k-means d 2 χ 2 Symmetrized KL
Read length Recall std. dev. Recall std. dev. Recall std. dev. Recall std. dev. Recall std. dev.
2 clusters
30 0.737 0.133 0.735 0.136 0.610 0.083 0.737 0.140 0.736 0.134
50 0.762 0.141 0.760 0.143 0.649 0.105 0.760 0.144 0.762 0.141
75 0.781 0.145 0.778 0.147 0.677 0.122 0.778 0.148 0.781 0.145
100 0.794 0.148 0.791 0.150 0.719 0.131 0.791 0.150 0.794 0.148
150 0.812 0.152 0.810 0.153 0.803 0.147 0.810 0.154 0.812 0.152
200 0.827 0.153 0.825 0.155 0.824 0.151 0.824 0.155 0.826 0.153
250 0.839 0.153 0.837 0.155 0.838 0.151 0.837 0.156 0.839 0.153
300 0.850 0.153 0.848 0.155 0.850 0.152 0.847 0.156 0.850 0.153
400 0.867 0.152 0.866 0.154 0.869 0.152 0.866 0.154 0.867 0.152
3 clusters
30 0.573 0.110 0.573 0.108 0.447 0.076 0.715 0.131 0.572 0.111
50 0.604 0.124 0.603 0.126 0.474 0.090 0.674 0.134 0.603 0.125
75 0.629 0.135 0.629 0.138 0.626 0.139 0.664 0.144 0.629 0.136
100 0.647 0.142 0.647 0.146 0.671 0.148 0.668 0.150 0.647 0.143
150 0.675 0.153 0.675 0.156 0.724 0.157 0.687 0.159 0.675 0.153
200 0.696 0.160 0.696 0.164 0.692 0.161 0.706 0.167 0.696 0.160
250 0.714 0.166 0.714 0.170 0.714 0.166 0.723 0.172 0.714 0.166
300 0.730 0.171 0.730 0.173 0.730 0.170 0.738 0.176 0.730 0.170
400 0.756 0.177 0.757 0.179 0.757 0.176 0.762 0.180 0.756 0.176
  1. Mean recall rates and standard deviation for various read lengths and 2 or 3 clusters. For every read length clustering was performed on 50 simulated read sets, each set originating from 1000 randomly chosen human RNA reference sequences and having 100000 reads. Clustering was performed using all distance functions considered in the paper, including those which do not guarantee convergence. Results for L2 and d2 distance are not shown. Word length is n = 2.
\