BMC Bioinformatics

Table 16 The ten most important features related to difficult (D) and easy (E) classes measured by information gain

From: A detailed error analysis of 13 kernel methods for protein-protein interaction extraction

	Difficult (D)			Easy (E)
Rank	Feature name	±	IG	Feature name	±	IG
1	sentence length (char)	−	0.0089	label entropy in ST	+	0.110
2	label entropy in ST (SP)	−	0.0086	sentence length (char)	+	0.090
3	dep frequency in DG	−	0.0079	label entropy in DG	+	0.089
4	# of proteins in sentence	−	0.0078	nn frequency in DG	−	0.081
5	sentence length (word)	−	0.0069	appos frequency in DG	−	0.079
6	conj_and frequency in DG	−	0.0069	conj_and frequency in DG	−	0.076
7	prep_with frequency in DG	−	0.0066	dep frequency in DG	−	0.073
8	prep_with occurrence in DG	−	0.0066	det frequency in DG	−	0.069
9	nsubjpass frequency in DG	−	0.0059	amod frequency in DG	−	0.063
10	prep_in frequency in DG	−	0.0057	dobj frequency in DG	−	0.062

IG - information gain; ST - syntax tree; DG - dependency graph; SP - shortest path. Italic typesetting indicates parsing tree labels. The sign after each feature indicates positive/negative correlation.

Back to article page

ISSN: 1471-2105

Contact us

General enquiries: journalsubmissions@springernature.com