Skip to main content

Table 9 Assembly statistics for clustering and random splitting of a real sample

From: Centroid based clustering of high throughput sequencing reads based on n-mer counts

Splitting

Mapped reads

Total bp in contigs

Number of contigs

n50

Velvet assembly, hash length 21

All reads

1507427 (6.99%)

470740477

5650689

79

Soft EM, 2 clusters

1555663 (7.21%)

492995773

5928907

52

EM, 2 clusters

1458627 (6.76%)

453165323

5454404

65

k-means, 2 clusters

1475586 (6.84%)

455384651

5474129

124

GC content, 2 parts

1455825 (6.75%)

451987554

5437853

70

Random splitting, 2 clusters

1259894 (5.84%)

428174983

5219268

94

Soft EM, 3 clusters

1614090 (7.48%)

528221487

6359119

78

EM, 3 clusters

1429190 (6.63%)

443042444

5343548

55

k-means, 3 clusters

1439961 (6.68%)

443713631

5347679

77

GC content, 3 parts

1397108 (6.48%)

436515238

5278594

98

Random splitting, 3 clusters

1036477 (4.81%)

392638398

4878611

48

Velvet assembly, hash length 31

All reads

2327126 (10.79%)

290798616

2578825

100

Soft EM, 2 clusters

2263596 (10.50%)

292536888

2643061

204

EM, 2 clusters

2112597 (9.79%)

266185624

2412045

126

k-means, 2 clusters

2129306 (9.87%)

267875650

2424380

86

GC content, 2 parts

2106489 (9.77%)

265677735

2407873

100

Random splitting, 2 clusters

1629402 (7.55%)

222061071

2101527

104

Soft EM, 3 clusters

2269196 (10.52%)

310107203

2839376

226

EM, 3 clusters

2002261 (9.28%)

255782318

2354233

86

k-means, 3 clusters

2006030 (9.30%)

256111968

2358231

114

GC content, 3 parts

1934436 (8.97%)

247356556

2283296

106

Random splitting, 3 clusters

1257062 (5.83%)

184143812

1807765

141

Velvet assembly, hash length 41

All reads

1403308 (6.51%)

118746013

848180

127

Soft EM, 2 clusters

1289123 (5.98%)

110992860

805140

188

EM, 2 clusters

1182223 (5.48%)

99860264

725769

129

k-means, 2 clusters

1191102 (5.52%)

100436034

728680

125

GC content, 2 parts

1182618 (5.48%)

100638247

733416

127

Random splitting, 2 clusters

839681 (3.89%)

73260257

558661

83

Soft EM, 3 clusters

1275142 (5.91%)

114111918

836929

156

EM, 3 clusters

1081154 (5.01%)

92510990

683516

169

k-means, 3 clusters

1078651 (5.00%)

92021168

679148

136

GC content, 3 parts

1027385 (4.76%)

86928363

641027

136

Random splitting, 3 clusters

622242 (2.88%)

55079268

435757

149

  1. Statistics of assembly of real sequencing data. 21,568,249 reads from an Illumina run on a nasal swab cDNA were assembled with and without splitting the sample. Splitting into 2 or 3 clusters was performed randomly as well as using the soft and hard clustering techniques studied in the present work.