Skip to main content

Table 2 Clustering accuracy for the gold standard data

From: A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

Graph (852 nodes)

Th.

Edges

C

|C1|

|C2|

|C>2|

SF prec.

SF recall

Family prec.

Family recall

Similarity Score

15%

1563

427

300

116

432

49.3%

2.2%

35.4%

4.7%

 

25%

3057

335

223

88

537

46.4%

3.2%

33.0%

5.1%

 

35%

5125

265

169

72

607

42.7%

5.2%

30.6%

7.0%

 

45%

6689

239

153

64

631

41.5%

7.1%

30.4%

8.8%

 

55%

7433

223

140

64

644

40.7%

6.8%

29.8%

8.1%

Bitscore (from max score)

1000

1808

507

436

64

348

44.3%

3.6%

32.5%

4.1%

 

1100

4168

364

293

56

499

42.7%

5.2%

30.7%

5.8%

 

1200

5607

300

223

50

575

42.1%

4.7%

30.6%

5.8%

Edit Distance Threshold

50

1134

448

312

122

418

100%

4.2%

99.3%

21.9%

 

100

3453

310

192

86

574

100%

8.2%

98.6%

32.4%

 

150

8726

206

107

74

671

99.0%

15.5%

95.8%

45.6%

 

175

12,966

152

71

54

727

94.6%

25.6%

91.3%

57.7%

 

200

18,097

115

48

38

766

93.6%

29.5%

88.3%

60.7%

Needleman-Wunsch (from max score)

2200

1896

638

588

32

232

100%

8.8%

100%

33.5%

 

2400

5485

507

450

56

346

100%

17.9%

100%

48.5%

 

2600

8231

378

324

38

490

100%

18.8%

100%

51.8%

 

2800

15,603

291

228

46

578

100%

30.5%

100%

64.3%

 

3000

26,917

243

176

38

638

99.8%

39.7%

94.8%

79.1%

 

3200

39,633

165

123

26

670

83.8%

64.2%

68.4%

87.7%

DiWANN

NA

931

218

0

142

710

97.5%

3.5%

92.3%

25.5%

  1. This table shows a summary of clustering accuracy for the gold standard sequence similarity networks. Th. gives the threshold used for a given network, either in number of edits, distance from the maximum similarity score (for bitscore and Needleman-Wunsch) or percent similarity score. C gives the total number of clusters, |C1| gives the number of nodes in clusters of size 1 (singletons), |C2| gives the number of nodes in clusters of size 2, and |C>2| shows the number of nodes in clusters of size 3 and above. For calculating precision and recall, we assume that clusters correspond to the family and superfamily labels provided with the dataset. Sequences vary widely in length between 100 up to over 700 amino acids