Skip to main content

Table 16 The ten most important features related to difficult (D) and easy (E) classes measured by information gain

From: A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

 

Difficult (D)

Easy (E)

Rank

  Feature name

±

IG

  Feature name

±

IG

1

sentence length (char)

0.0089

label entropy in ST

+

0.110

2

label entropy in ST (SP)

0.0086

sentence length (char)

+

0.090

3

dep frequency in DG

0.0079

label entropy in DG

+

0.089

4

# of proteins in sentence

0.0078

nn frequency in DG

0.081

5

sentence length (word)

0.0069

appos frequency in DG

0.079

6

conj_and frequency in DG

0.0069

conj_and frequency in DG

0.076

7

prep_with frequency in DG

0.0066

dep frequency in DG

0.073

8

prep_with occurrence in DG

0.0066

det frequency in DG

0.069

9

nsubjpass frequency in DG

0.0059

amod frequency in DG

0.063

10

prep_in frequency in DG

0.0057

dobj frequency in DG

0.062

  1. IG - information gain; ST - syntax tree; DG - dependency graph; SP - shortest path. Italic typesetting indicates parsing tree labels. The sign after each feature indicates positive/negative correlation.