How to make the most of NE dictionaries in statistical NER

BMC Bioinformatics

Table 2 Upper bound protein name recognition performance after ideal lexicon enrichment

Method		R	P	F
Tagging (+test set protein names)	Full	79.02	61.87	69.40
	Left	82.28	64.42	72.26
	Right	80.96	63.38	71.10
Labelling (+test set protein names)	full	86.13	72.49	78.72
	Left	89.58	75.40	81.88
	Right	90.23	75.95	82.47

The upper bound performance on the JNLPBA-2004 test set by enriching the lexicon with protein names appearing in the test set. NB: It was the only the lexicon that was modified. The tagging and sequential labelling models were not retrained using the test set. The first block shows the performance of POS/PROTEIN tagging after adding protein names appearing in the test set to the dictionary. Since many protein names overlap with general English words, sometimes protein names in sentences are not recognized as protein names. The second block shows the performance of the sequence labelling based on the tagging output. Note that the tagging and sequential labelling models were not retrained using the test set.

ISSN: 1471-2105