Skip to main content

Table 3 Details of crafting different window sizes on the validation set. Two types of duplication appeared in the data: Type1: identical sequences and labels; Type 2: identical sequences but different labels

From: Machine learning-based approaches for ubiquitination site prediction in human proteins

Window size

Samples with duplicate sequences and labels (Type 1)

Samples with duplicate sequences (Type 2)

Samples after removing type 1 and 2

All

Positive samples

Negative samples

All

Positive samples

Negative samples

All

Positive samples

Negative samples

5

4861

47

4814

5492

398

5094

17,547

1589

15,958

7

172

6

166

180

10

170

20,506

1610

18,896

9

97

0

97

99

1

98

20,548

1613

18,935

15

57

0

57

59

1

58

20,571

1613

18,958

21

46

0

46

48

1

47

20,578

1613

18,965

27

42

0

42

44

1

43

20,580

1613

18,967

33

32

0

32

34

1

33

20,585

1613

18,972

45

20

0

20

20

0

20

20,592

1613

18,979

55

12

0

12

12

0

12

20,596

1613

18,983

77

8

0

8

8

0

8

20,598

1613

18,985

99

2

0

2

2

0

2

20,601

1613

18,988