Skip to main content

Advertisement

Table 2 Clustering accuracy for the gold standard data

From: A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

Graph (852 nodes) Th. Edges C |C1| |C2| |C>2| SF prec. SF recall Family prec. Family recall
Similarity Score 15% 1563 427 300 116 432 49.3% 2.2% 35.4% 4.7%
  25% 3057 335 223 88 537 46.4% 3.2% 33.0% 5.1%
  35% 5125 265 169 72 607 42.7% 5.2% 30.6% 7.0%
  45% 6689 239 153 64 631 41.5% 7.1% 30.4% 8.8%
  55% 7433 223 140 64 644 40.7% 6.8% 29.8% 8.1%
Bitscore (from max score) 1000 1808 507 436 64 348 44.3% 3.6% 32.5% 4.1%
  1100 4168 364 293 56 499 42.7% 5.2% 30.7% 5.8%
  1200 5607 300 223 50 575 42.1% 4.7% 30.6% 5.8%
Edit Distance Threshold 50 1134 448 312 122 418 100% 4.2% 99.3% 21.9%
  100 3453 310 192 86 574 100% 8.2% 98.6% 32.4%
  150 8726 206 107 74 671 99.0% 15.5% 95.8% 45.6%
  175 12,966 152 71 54 727 94.6% 25.6% 91.3% 57.7%
  200 18,097 115 48 38 766 93.6% 29.5% 88.3% 60.7%
Needleman-Wunsch (from max score) 2200 1896 638 588 32 232 100% 8.8% 100% 33.5%
  2400 5485 507 450 56 346 100% 17.9% 100% 48.5%
  2600 8231 378 324 38 490 100% 18.8% 100% 51.8%
  2800 15,603 291 228 46 578 100% 30.5% 100% 64.3%
  3000 26,917 243 176 38 638 99.8% 39.7% 94.8% 79.1%
  3200 39,633 165 123 26 670 83.8% 64.2% 68.4% 87.7%
DiWANN NA 931 218 0 142 710 97.5% 3.5% 92.3% 25.5%
  1. This table shows a summary of clustering accuracy for the gold standard sequence similarity networks. Th. gives the threshold used for a given network, either in number of edits, distance from the maximum similarity score (for bitscore and Needleman-Wunsch) or percent similarity score. C gives the total number of clusters, |C1| gives the number of nodes in clusters of size 1 (singletons), |C2| gives the number of nodes in clusters of size 2, and |C>2| shows the number of nodes in clusters of size 3 and above. For calculating precision and recall, we assume that clusters correspond to the family and superfamily labels provided with the dataset. Sequences vary widely in length between 100 up to over 700 amino acids