Skip to main content

Table 3 The distribution of the size of the rules learnt, i.e. the number of predicates used in each rule in a rule set. The most common predicate used in the HIall setting with only one predicate was references to databases, followed by SWISS-PROT description arguments and keywords. The larger the number of predicates used in this setting, the more dominant becomes the use of predicates based on pure amino acid distributions and predicted secondary structure. In the HIseq setting a similar shift of use from predicates involving amino acid distributions towards predicted secondary structure predicates was observed. Rules with more than eight predicated are solely based on secondary structure.

From: Homology Induction: the use of machine learning to improve sequence similarity searches

Number of predicates used in each rule

HIall

HIseq

1

1030

169

2

369

940

3

314

810

4

121

340

5

14

86

6

1

18

7

1

6

8

1

1

9

0

1