Feature engineering for MEDLINE citation categorization with MeSH

BMC Bioinformatics

Table 2 Feature comparison over all results

Feature	SVMLight	SVM-perf	AdaBoostM1	Ada Over
Unigram	0.418	0.492 †	0.420	0.471 †
Bigram	0.406	0.513* †	0.420	0.477* †
Argumentative	0.403	0.479 †	0.415	0.464 †
Noun phrases	0.222	0.329 †	0.222	0.271 †
Concepts	0.409	0.497* †	0.427	0.480* †
CUIs	0.398	0.496	0.422	0.475 †
MTI predictions	0.513*	0.531* †	0.478*	0.501* †
MTI MMI	0.398	0.454 †	0.367	0.382 †
MTI PRC	0.481*	0.502 †	0.430	0.453 †
First level taxonomy	0.300	0.456 †	0.351	0.429 †
Second level taxonomy	0.222	0.424 †	0.329	0.393 †
Third level taxonomy	0.173	0.383	0.285	0.341 †
Journal	0.115	0.193 †	0.126	0.208 †
Affiliation	0.046	0.064	0.045	0.044 †
Author	0.062	0.137 †	0.081	0.084 †

Results are reported in F-measure. Binary representation of features is used. Several learning algorithms have been used including SVMLight, SVM-perf, AdaBoostM1 and AdaBoostM1 with oversampling of positive instances (Ada Over). For each column, results significantly better than unigram (p >0.05) are indicated with *. For each pair of methods (SVMLight/SVM-perf and AdaBoostM1/Ada Over), statistical differences are highlighted using †.

ISSN: 1471-2105