Skip to main content

Table 9 Assembly statistics for clustering and random splitting of a real sample

From: Centroid based clustering of high throughput sequencing reads based on n-mer counts

Splitting Mapped reads Total bp in contigs Number of contigs n50
Velvet assembly, hash length 21
All reads 1507427 (6.99%) 470740477 5650689 79
Soft EM, 2 clusters 1555663 (7.21%) 492995773 5928907 52
EM, 2 clusters 1458627 (6.76%) 453165323 5454404 65
k-means, 2 clusters 1475586 (6.84%) 455384651 5474129 124
GC content, 2 parts 1455825 (6.75%) 451987554 5437853 70
Random splitting, 2 clusters 1259894 (5.84%) 428174983 5219268 94
Soft EM, 3 clusters 1614090 (7.48%) 528221487 6359119 78
EM, 3 clusters 1429190 (6.63%) 443042444 5343548 55
k-means, 3 clusters 1439961 (6.68%) 443713631 5347679 77
GC content, 3 parts 1397108 (6.48%) 436515238 5278594 98
Random splitting, 3 clusters 1036477 (4.81%) 392638398 4878611 48
Velvet assembly, hash length 31
All reads 2327126 (10.79%) 290798616 2578825 100
Soft EM, 2 clusters 2263596 (10.50%) 292536888 2643061 204
EM, 2 clusters 2112597 (9.79%) 266185624 2412045 126
k-means, 2 clusters 2129306 (9.87%) 267875650 2424380 86
GC content, 2 parts 2106489 (9.77%) 265677735 2407873 100
Random splitting, 2 clusters 1629402 (7.55%) 222061071 2101527 104
Soft EM, 3 clusters 2269196 (10.52%) 310107203 2839376 226
EM, 3 clusters 2002261 (9.28%) 255782318 2354233 86
k-means, 3 clusters 2006030 (9.30%) 256111968 2358231 114
GC content, 3 parts 1934436 (8.97%) 247356556 2283296 106
Random splitting, 3 clusters 1257062 (5.83%) 184143812 1807765 141
Velvet assembly, hash length 41
All reads 1403308 (6.51%) 118746013 848180 127
Soft EM, 2 clusters 1289123 (5.98%) 110992860 805140 188
EM, 2 clusters 1182223 (5.48%) 99860264 725769 129
k-means, 2 clusters 1191102 (5.52%) 100436034 728680 125
GC content, 2 parts 1182618 (5.48%) 100638247 733416 127
Random splitting, 2 clusters 839681 (3.89%) 73260257 558661 83
Soft EM, 3 clusters 1275142 (5.91%) 114111918 836929 156
EM, 3 clusters 1081154 (5.01%) 92510990 683516 169
k-means, 3 clusters 1078651 (5.00%) 92021168 679148 136
GC content, 3 parts 1027385 (4.76%) 86928363 641027 136
Random splitting, 3 clusters 622242 (2.88%) 55079268 435757 149
  1. Statistics of assembly of real sequencing data. 21,568,249 reads from an Illumina run on a nasal swab cDNA were assembled with and without splitting the sample. Splitting into 2 or 3 clusters was performed randomly as well as using the soft and hard clustering techniques studied in the present work.
\