Machine learning-based approaches for ubiquitination site prediction in human proteins

BMC Bioinformatics

Table 2 Details of crafting different window sizes on the training set

Window size	Samples with duplicate sequences and labels (Type 1)			Samples with duplicate sequences (Type 2)			Samples after removing types 1 and 2
Window size	All	Positive samples	Negative samples	All	Positive samples	Negative samples	All	Positive samples	Negative samples
5	143,420	3728	139,692	154,835	12,125	142,710	84,861	13,122	71,739
7	6556	213	6343	7150	517	6633	191,493	15,066	176,427
9	2619	134	2485	2728	191	2537	193,891	15,106	178,785
15	1582	87	1495	1623	109	1514	194,544	15,133	179,411
21	1336	72	1264	1361	85	1276	194,699	15,141	179,558
27	1212	60	1152	1233	71	1162	194,776	15,147	179,629
33	1152	54	1098	1136	59	1077	194,829	15,151	179,678
45	1025	48	977	1034	53	981	194,888	15,153	179,735
55	969	44	925	978	49	929	194,920	15,155	179,765
77	854	37	817	863	42	821	194,992	15,159	179,833
99	771	34	737	776	39	737	195,047	15,161	179,886

Two types of duplication appeared in the data: Type1: identical sequences and labels; Type 2: identical sequences but different labels

ISSN: 1471-2105