Skip to main content

Advertisement

Table 1 Clustering accuracy for the GroEL networks

From: A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

Graph (812 nodes) Th. Edges C |C1| |C2| |C>2| Genus prec. Genus recall Species prec. Species recall
Similarity Score 5% 2886 371 246 122 444 43.0% 21.6% 34.9% 43.0%
  15% 4668 275 175 90 547 38.9% 22.2% 30.3% 40.0%
  25% 8222 182 122 44 646 33.0% 31.8% 22.4% 36.9%
  35% 12,491 81 55 22 735 26.5% 18.6% 17.1% 28.0%
Bitscore (from max) 50 544 623 552 86 174 30.5% 21.7% 24.7% 51.3%
  100 2895 367 243 122 447 42.1% 20.2% 34.0% 42.0%
  200 4576 275 175 86 551 38.7% 21.7% 29.9% 42.1%
  300 9271 183 126 40 646 31.8% 26.2% 22.4% 33.8%
Edit Distance Threshold 8 2139 456 345 128 339 97.3% 33.4% 77.9% 58.3%
  16 2904 391 268 126 418 95.9% 35.3% 72.7% 59.1%
  30 4254 304 188 118 506 90.3% 47.3% 60.6% 63.2%
  42 5023 256 154 90 568 85.0% 51.9% 56.5% 64.4%
  54 6582 206 114 76 622 81.5% 58.3% 50.8% 66.6%
  60 7196 190 99 80 633 80.6% 62.6% 49.3% 66.8%
Needleman-Wunsch (from max score) 100 1780 482 386 110 316 42.2% 4.7% 30.8% 5.8%
  200 4691 280 175 98 539 87.0% 49.7% 59.4% 63.5%
  300 7733 183 96 80 636 79.1% 62.5% 48.0% 66.8%
DiWANN NA 1055 180 0 118 694 80.4% 43.9% 59.5% 61.8%
  1. This table shows a summary of clustering accuracy for the various GroEL networks. Th. gives the threshold used for a given network, either in number of edits, distance from the maximum similarity score (for bitscore and Needleman-Wunsch) or percent similarity score. C gives the total number of clusters, |C1| gives the number of nodes in clusters of size 1 (singletons), |C2| gives the number of nodes in clusters of size 2, and |C>2| shows the number of nodes in clusters of size 3 and above. For calculating precision and recall, we assume clusters should correspond to the genus and species labels for a given GroEL sequence. Each GroEL sequence is between roughly 550 and 600 amino acids