Skip to main content

Table 16 The ten most important features related to difficult (D) and easy (E) classes measured by information gain

From: A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

  Difficult (D) Easy (E)
Rank   Feature name ± IG   Feature name ± IG
1 sentence length (char) 0.0089 label entropy in ST + 0.110
2 label entropy in ST (SP) 0.0086 sentence length (char) + 0.090
3 dep frequency in DG 0.0079 label entropy in DG + 0.089
4 # of proteins in sentence 0.0078 nn frequency in DG 0.081
5 sentence length (word) 0.0069 appos frequency in DG 0.079
6 conj_and frequency in DG 0.0069 conj_and frequency in DG 0.076
7 prep_with frequency in DG 0.0066 dep frequency in DG 0.073
8 prep_with occurrence in DG 0.0066 det frequency in DG 0.069
9 nsubjpass frequency in DG 0.0059 amod frequency in DG 0.063
10 prep_in frequency in DG 0.0057 dobj frequency in DG 0.062
  1. IG - information gain; ST - syntax tree; DG - dependency graph; SP - shortest path. Italic typesetting indicates parsing tree labels. The sign after each feature indicates positive/negative correlation.