A nearest-neighbors network model for sequence data reveals new insight into genotype distribution of a pathogen

BMC Bioinformatics

Table 2 Clustering accuracy for the gold standard data

Graph (852 nodes)	Th.	Edges	C	\|C₁\|	\|C₂\|	\|C_>2\|	SF prec.	SF recall	Family prec.	Family recall
Similarity Score	15%	1563	427	300	116	432	49.3%	2.2%	35.4%	4.7%
	25%	3057	335	223	88	537	46.4%	3.2%	33.0%	5.1%
	35%	5125	265	169	72	607	42.7%	5.2%	30.6%	7.0%
	45%	6689	239	153	64	631	41.5%	7.1%	30.4%	8.8%
	55%	7433	223	140	64	644	40.7%	6.8%	29.8%	8.1%
Bitscore (from max score)	1000	1808	507	436	64	348	44.3%	3.6%	32.5%	4.1%
	1100	4168	364	293	56	499	42.7%	5.2%	30.7%	5.8%
	1200	5607	300	223	50	575	42.1%	4.7%	30.6%	5.8%
Edit Distance Threshold	50	1134	448	312	122	418	100%	4.2%	99.3%	21.9%
	100	3453	310	192	86	574	100%	8.2%	98.6%	32.4%
	150	8726	206	107	74	671	99.0%	15.5%	95.8%	45.6%
	175	12,966	152	71	54	727	94.6%	25.6%	91.3%	57.7%
	200	18,097	115	48	38	766	93.6%	29.5%	88.3%	60.7%
Needleman-Wunsch (from max score)	2200	1896	638	588	32	232	100%	8.8%	100%	33.5%
	2400	5485	507	450	56	346	100%	17.9%	100%	48.5%
	2600	8231	378	324	38	490	100%	18.8%	100%	51.8%
	2800	15,603	291	228	46	578	100%	30.5%	100%	64.3%
	3000	26,917	243	176	38	638	99.8%	39.7%	94.8%	79.1%
	3200	39,633	165	123	26	670	83.8%	64.2%	68.4%	87.7%
DiWANN	NA	931	218	0	142	710	97.5%	3.5%	92.3%	25.5%

This table shows a summary of clustering accuracy for the gold standard sequence similarity networks. Th. gives the threshold used for a given network, either in number of edits, distance from the maximum similarity score (for bitscore and Needleman-Wunsch) or percent similarity score. C gives the total number of clusters, |C₁| gives the number of nodes in clusters of size 1 (singletons), |C₂| gives the number of nodes in clusters of size 2, and |C_>2| shows the number of nodes in clusters of size 3 and above. For calculating precision and recall, we assume that clusters correspond to the family and superfamily labels provided with the dataset. Sequences vary widely in length between 100 up to over 700 amino acids

ISSN: 1471-2105