A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

BMC Bioinformatics

Table 1 Clustering accuracy for the GroEL networks

Graph (812 nodes)	Th.	Edges	C	\|C₁\|	\|C₂\|	\|C_>2\|	Genus prec.	Genus recall	Species prec.	Species recall
Similarity Score	5%	2886	371	246	122	444	43.0%	21.6%	34.9%	43.0%
	15%	4668	275	175	90	547	38.9%	22.2%	30.3%	40.0%
	25%	8222	182	122	44	646	33.0%	31.8%	22.4%	36.9%
	35%	12,491	81	55	22	735	26.5%	18.6%	17.1%	28.0%
Bitscore (from max)	50	544	623	552	86	174	30.5%	21.7%	24.7%	51.3%
	100	2895	367	243	122	447	42.1%	20.2%	34.0%	42.0%
	200	4576	275	175	86	551	38.7%	21.7%	29.9%	42.1%
	300	9271	183	126	40	646	31.8%	26.2%	22.4%	33.8%
Edit Distance Threshold	8	2139	456	345	128	339	97.3%	33.4%	77.9%	58.3%
	16	2904	391	268	126	418	95.9%	35.3%	72.7%	59.1%
	30	4254	304	188	118	506	90.3%	47.3%	60.6%	63.2%
	42	5023	256	154	90	568	85.0%	51.9%	56.5%	64.4%
	54	6582	206	114	76	622	81.5%	58.3%	50.8%	66.6%
	60	7196	190	99	80	633	80.6%	62.6%	49.3%	66.8%
Needleman-Wunsch (from max score)	100	1780	482	386	110	316	42.2%	4.7%	30.8%	5.8%
	200	4691	280	175	98	539	87.0%	49.7%	59.4%	63.5%
	300	7733	183	96	80	636	79.1%	62.5%	48.0%	66.8%
DiWANN	NA	1055	180	0	118	694	80.4%	43.9%	59.5%	61.8%

This table shows a summary of clustering accuracy for the various GroEL networks. Th. gives the threshold used for a given network, either in number of edits, distance from the maximum similarity score (for bitscore and Needleman-Wunsch) or percent similarity score. C gives the total number of clusters, |C₁| gives the number of nodes in clusters of size 1 (singletons), |C₂| gives the number of nodes in clusters of size 2, and |C_>2| shows the number of nodes in clusters of size 3 and above. For calculating precision and recall, we assume clusters should correspond to the genus and species labels for a given GroEL sequence. Each GroEL sequence is between roughly 550 and 600 amino acids

ISSN: 1471-2105