How to make the most of NE dictionaries in statistical NER
© Sasaki et al; licensee BioMed Central Ltd. 2008
Published: 19 November 2008
When term ambiguity and variability are very high, dictionary-based Named Entity Recognition (NER) is not an ideal solution even though large-scale terminological resources are available. Many researches on statistical NER have tried to cope with these problems. However, it is not straightforward how to exploit existing and additional Named Entity (NE) dictionaries in statistical NER. Presumably, addition of NEs to an NE dictionary leads to better performance. However, in reality, the retraining of NER models is required to achieve this. We chose protein name recognition as a case study because it most suffers the problems related to heavy term variation and ambiguity.
We have established a novel way to improve the NER performance by adding NEs to an NE dictionary without retraining. In our approach, first, known NEs are identified in parallel with Part-of-Speech (POS) tagging based on a general word dictionary and an NE dictionary. Then, statistical NER is trained on the POS/PROTEIN tagger outputs with correct NE labels attached.
We evaluated performance of our NER on the standard JNLPBA-2004 data set. The F-score on the test set has been improved from 73.14 to 73.78 after adding protein names appearing in the training data to the POS tagger dictionary without any model retraining. The performance further increased to 78.72 after enriching the tagging dictionary with test set protein names.
Our approach has demonstrated high performance in protein name recognition, which indicates how to make the most of known NEs in statistical NER.
The accumulation of online biomedical information has been growing at a rapid pace, mainly attributed to a rapid growth of a wide range of repositories of biomedical data and literature. The automatic construction and update of scientific knowledge bases is a major research topic in Bioinformatics. One way of populating these knowledge bases is through named entity recognition (NER). Unfortunately, biomedical NER faces many problems, e.g., protein names are extremely difficult to recognize due to high ambiguity and variability. A further problem in protein name recognition arises at the tokenization stage. Some protein names include punctuation or special symbols, which may cause tokenization to lose some word concatenation information in the original sentence. For example, protein IL-2 and mathematical expression IL – 2 fall into the same token sequence IL – 2 as usually dash (or hyphen) is designated as a token delimiter. In this sense, protein name recognition from tokenized sequence is more challenging than that from text.
Research into NER is centered around three approaches: dictionary-based, rule-based and machine learning-based approaches . To overcome the usual NER pitfalls, we have opted for a hybrid approach combining dictionary-based and machine learning approaches, which we call dictionary-based statistical NER approach. After identifying protein names in text, we link these to semantic identifiers, such as UniProt accession numbers. In this paper, we focus on the evaluation of our dictionary-based statistical NER.
Our dictionary-based statistical approach consists of two components: dictionary-based POS/PROTEIN tagging and statistical sequential labelling. First, dictionary-based POS/PROTEIN tagging finds candidates for protein names using a dictionary. The dictionary maps strings to parts of speech (POS), where the POS tag-set is augmented with a tag NN-PROTEIN. Then, sequential labelling applies to reduce false positives and false negatives in the POS/PROTEIN tagging results. Expandability is supported through allowing a user of the NER tool to improve NER coverage by adding NE entries to the dictionary. In our approach, retraining of models is not required after dictionary enrichment.
Recently, Conditional Random Fields (CRFs) have been successfully applied to sequence labelling problems, such as POS tagging and NER.  The main idea of CRFs is to estimate a conditional probability distribution over label sequences, rather than over local directed label sequences as with Hidden Markov Models  and Maximum Entropy Markov Models . Parameters of CRFs can be efficiently estimated through the log-likelihood parameter estimation using the forward-backward algorithm, a dynamic programming method.
Training and test data
Experiments were conducted using the training and test sets of the JNLPBA-2004 data set .
The training data set used in JNLPBA-2004 is a set of tokenized sentences with manually annotated term class labels. The sentences are taken from the Genia corpus (version 3.02) , in which 2,000 abstracts were manually annotated by a biologist, drawing on a set of POS tags and 36 biomedical term classes. In the JNLPBA-2004 shared task, performance in extracting five term classes, i.e., protein, DNA, RNA, cell line, and cell type classes, were evaluated.
The test data set used in JNLPBA-2004 is a set of tokenized sentences extracted from 404 separately collected MEDLINE abstracts, where the term class labels were manually assigned, following the annotation specification of the Genia corpus.
Overview of dictionary-based statistical NER
Dictionary-based POS/PROTEIN tagging
The dictionary-based approach is beneficial when a sentence contains some protein names that conflict with general English words. Otherwise, if the POS tags of sentences are decided without considering possible occurrences of protein names, POS sequences could be disrupted. For example, in "met proto-oncogene precursor", met might be falsely recognized as a verb by a non dictionary-based tagger.
Given a sentence, the dictionary-based approach extracts protein names as follows. Find all word sequences that match the lexical entries, and create a token graph (i.e., trellis) according to the word order. Estimate the score of every path using the weights of node and edges estimated by training using Conditional Random Fields. Select the best path.
The features used were:
bigram of adjacent POS
bigram of adjacent PROTEIN
bigram of adjacent POS-PROTEIN
During the construction of the trellis, white space is considered as the delimiter unless otherwise stated within dictionary entries. This means that unknown tokens are character sequences without spaces.
A dictionary-based approach requires the dictionary to cover not only a wide variety of biomedical terms but also entries with:
all possible capitalization
all possible linguistic inflections
We constructed a freely available, wide-coverage English word dictionary that satisfies these conditions. We did consider the MedPost pos-tagger package  which contains a free dictionary that has downcased English words; however, this dictionary is not well curated as a dictionary and the number of entries is limited to only 100,000, including inflections.
Therefore, we started by constructing an English word dictionary. Eventually, we created a dictionary with about 266,000 entries for English words (systematically covering inflections) and about 1.3 million entries for protein names.
We created the general English part of the dictionary from WordNet by semi-automatically adding POS tags. The POS tag set is a minor modification of the Penn Treebank POS tag set , in that protein names are given a new POS tag, NN-PROTEIN. Further details on construction of the dictionary now follow.
Protein names were extracted from the BioThesaurus . After selecting only those terms clearly stated as protein names, pair-wise distinct 1,341,992 protein names in total were added to the dictionary.
Nouns were extracted from WordNet's noun list. Words starting with lower case and upper case letters were determined as NN and NNP, respectively. Nouns in NNS and NNPS categories were collected from the results of POS tagging articles from Plos Biology Journal  with TreeTagger .
Verbs were extracted from WordNet's verb list. We manually curated VBD, VBN, VBG and VBZ verbs with irregular inflections based on WordNet. Next, VBN, VBD, VBG and VBZ forms of regular verbs were automatically generated from the WordNet verb list.
Adjectives were extracted from WordNet's adjective list. We manually curated JJ, JJR and JJS of irregular inflections of adjectives based on the WordNet irregular adjective list. Base form (JJ) and regular inflections (JJR, JJS) of adjectives were also created based on the list of adjectives.
Adverbs were extracted from WordNet's adverb list. Both the original and capitalized forms were added as RB.
Pronouns were manually curated. PRP and PRP$ words were added to the dictionary.
Wh-words were manually curated. As a result, WDT, WP, WP$ and WRB words were added to the dictionary.
Words for other parts of speech were manually curated.
Statistical prediction of protein names
Statistical sequential labelling was employed to improve the coverage of protein name recognition and to remove false positives resulting from the previous stage (dictionary-based tagging).
We used the JNLPBA-2004 training data, which is a set of tokenized word sequences with IOB2  protein labels. As shown in Figure 2, POSs of tokens resulting from tagging and tokens of the JNLPBA-2004 data set are integrated to yield training data for sequential labelling. During integration, when the single token of a protein name found after tagging corresponds to a sequence of tokens from JNLPBA-2004, its POS is given as NN-PROTEIN1, NN-PROTEIN2,..., according to the corresponding token order in the JNLPBA-2004 sequence.
Following the data format of the JNLPBA-2004 training set, our training and test data use the IOB2 labels, which are "B-protein" for the first token of the target sequence, "I-protein" for each remaining token in the target sequence, and "O" for other tokens. For example, "Activation of the IL 2 precursor provides" is analyzed by the POS/PROTEIN tagger as follows.
IL 2 precursor NN-PROTEIN
The tagger output is given IOB2 labels as follows:
Activation NN 0
of IN 0
the DT 0
IL NN-PROTEIN1 B-PROTEIN
2 NN-PROTEIN2 I-PROTEIN
precursor NN-PROTEIN3 I-PROTEIN
provides VVZ 0
We used CRF models to predict the IOB2 labels. The following features were used in our experiments.
the first letter and the last four letters of the word form, in which capital letters in a word are normalized to "A", lower case letters are normalized to "a", and digits are replaced by "0", e.g., the word form of IL-2 is AA-0.
postfixes, the last two and four letters
The window size was set to ± 2 of the current token.
Results and discussion
Protein name recognition performance
Protein name recognition performance
(a) POS/PROTEIN tagging
(b) Word feature
(c) (b) + orthographic feature
(d) (c) + POS feature
(e) (d) + PROTEIN feature
(f) (e) after adding protein names in the training set to the lexicon
The baseline for sequential labelling (row (b)) shows the prediction performance when using only word features where no orthographic and POS features were used. The F-score of the baseline labelling method was 66.62. When orthographic feature was added (row (c)), the F-score increased by 5.40 to 72.02. When the POS feature was added (row (d)), the F-score increased by 0.19 to 72.21. Using all features (row (e)), the F-score reached 73.14. Surprisingly, adding protein names appearing in the training data to the dictionary further improved the F-score by 0.64 to 73.78, which is a state-of-the-art performance in protein name recognition using the JNLPBA-2004 data set.
Tagging and labelling speeds were measured using an unloaded Linux server with quad 1.8 GHz Opteron cores and 16 GB memory. The dictionary-based POS/PROTEIN tagger is very fast even though the total size of the dictionary is more than one million. The processing speed for tagging and sequential labelling of the 4,259 sentences of the test set data took 0.3 sec and 7.3 sec, respectively, which means that in total it took 7.6 sec. for recognizing protein names in the plain text of 4,259 sentences.
The advantage of the dictionary-based statistical approach is that it is versatile, as the user can easily improve its performance with no retraining. We assume the following situation as the ideal case: suppose that a user needs to analyze a large amount of text with protein names. The user wants to know the maximum performance achievable for identifying protein names with our dictionary-based statistical recognizer which can be achieved by adding more protein names to the current dictionary. Note that protein names should be identified in context. That is, recall of the NER results with the ideal dictionary is not 100%. Some protein names in the ideal dictionary are dropped during statistical tagging or labelling.
Upper bound protein name recognition performance after ideal lexicon enrichment
Tagging (+test set protein names)
Labelling (+test set protein names)
It is not possible in reality to train the recognizer on target data, i.e., the test set, but it would be possible for users to add discovered protein names to the dictionary so that they could improve the overall performance of the recognizer without retraining.
Rule-based and procedural approaches are taken in [16, 17]. Machine learning-based approaches are taken in [18–24]. Machine learning algorithms used in these studies are Naive Bayes, C4.5, Maximum Entropy Models, Support Vector Machines, and Conditional Random Fields. Most of these studies applied machine learning techniques to tokenized sentences.
Conventional results for protein name recognition
Tsai et al. 
Zhou and Su 
Kim and Yoon 
Okanohara et al. 
Finkel et al. 
Song et al. 
Park et al. 
Tsai et al.  and Zhou and Su  combined machine learning techniques and manual heuristics tailored for the data set. Tsai et al.  applied CRFs to the JNLPBA-2004 data. After applying pattern-based post-processing, they achieved the best F-score (75.12) among those reported so far. Kim and Yoon  also applied heuristic post-processing. Because of the domain dependence of their NER methods, when porting a NER system to a new domain (e.g., metabolite names), the developer of the NER system, not a user, currently needs to devise new post-processing heuristics for the new domain to outperform purely principled methods.
The GENIA Tagger  is trained on the JNLPBA-2004 Corpus. Okanohara et al.  employed semi-Markov CRFs whose performance was evaluated against the JNLPBA-2004 data set. Yamamoto et al.  used SVMs for character-based protein name recognition and sequential labelling. Their protein name extraction performance was 69%. Finkel et al.  employed MEMM. CRFs are applied in Settles . Rössler  and Park et al.  applied SVMs while Song et al.  applied both SVMs and CRFs.
protein, binding sites
common trans-acting factor
sequential labelling error
test set error
(the) receptor, (the) binding sites
coordination (and, or)
transcription factors NF-kappa B and AP-1
transcription factors NF-kappa B
nuclear factor kappa B complex
nuclear factor kappa B
protein tyrosine kinase(s)
protein tyrosine kinase
family name, biding site, and domain
T3 binding sites
sequential labelling error
test set error
Furthermore, thanks to the dictionary-based approach, ideal dictionary enrichment, without any retraining of the models, has shown to contribute to improve the performance to an F-score of 78.72.
Conclusion and future work
This paper has demonstrated how to utilize known named entities to achieve better performance in statistical named entity recognition. We took a two-step approach where sentences are first tokenized and tagged based on a biomedical dictionary that consists of general English words and about 1.3 million protein names. Then, a statistical sequence labelling step predicted protein names that are not listed in the dictionary and, at the same time, reduced false negatives in the POS/PROTEIN tagging results. The significant benefit of this approach is that a user, not a system developer, can easily enhance the performance by augmenting the dictionary. This paper demonstrated that the state-of-the-art F-score 73.78 on the standard JNLPBA-2004 data set was achieved by our approach. Furthermore, in our dictionary-based statistical NER approach, the upper bound performance using ideal dictionary enrichment, without any retraining of the models, was estimated to an F-score of 78.72.
Our future work includes applying the dictionary-based statistical NER approach to other NE categories, such as metabolite names. Furthermore, it will be of great interest to theoretically and empirically analyze the effect of dictionary enrichment to performance improvement.
This research is partly supported by EC IST project FP6-028099 (BOOTStrep), whose Manchester team is hosted by the JISC/BBSRC/EPSRC sponsored National Centre for Text Mining.
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 11, 2008: Proceedings of the BioNLP 08 ACL Workshop: Themes in biomedical language processing. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/9?issue=S11
- Ananiadou S, McNaught J, (eds): Text Mining for Biology and Biomedicine. Artech House, London 2006.Google Scholar
- Lafferty J, McCallum A, Pereira F: Conditional random fields: probabilistic models for segmenting and labelling sequence data. Proceedings of the Eighteenth International Conference on Machine Learning (ICML-2001) 2001, 282–289.Google Scholar
- Baum LE, Petrie T: Statistical inference for probabilistic functions of finite state Markov chains. The Annals of Mathematical Statistics 1966, 37: 1554–1563.View ArticleGoogle Scholar
- McCallum A, Freitag D, Pereira F: Maximum entropy Markov models for information extraction and segmentation. Proceedings of the Seventeenth International Conference on Machine Learning 2000, 591–598.Google Scholar
- Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y: Introduction to the Bio-Entity Recognition Task at JNLPBA. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 70–75.View ArticleGoogle Scholar
- Kim JD, Ohta T, Tateisi Y, Tsujii J: GENIA corpus – semantically annotated corpus for bio-textmining. Bioinformatics 2003, 19: i180-i182.View ArticlePubMedGoogle Scholar
- Kudo T, Yamamoto K, Matsumoto Y: Applying Conditional Random Fields to Japanese Morphological Analysis. Proceedings of Empirical Methods in Natural Language Processing (EMNLP) 2004, 230–237.Google Scholar
- Aoe J: An Efficient Digital Search Algorithm by Using a Double-Array Structure. IEEE Transactions on Software Engineering 1989,15(9):1066–1077.View ArticleGoogle Scholar
- Part-of-Speech Tagging Guidelines for the Penn Treebank Project[ftp://ftp.cis.upenn.edu/pub/treebank/doc/tagguide.ps.gz]
- Plos Biology Journals[http://biology.plosjournals.org]
- Tjong Kim Sang EF, Veenstra J: Representing Text Chunks. Proceedings of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (E-99); Bergen, June 8 – 12, 1999 1999, 173–179.Google Scholar
- Franzen K, Eriksson G, Olsson F, Asker L, Liden P, Koster J: Protein names and how to find them. International Journal of Medical Informatics 2002, 67: 49–61.View ArticlePubMedGoogle Scholar
- Fukuda K, Tsunoda T, Tamura A, Takagi T: Toward information extraction: identifying protein names from biological papers. Proceedings of Pacific Symposium on Biocomputing 1998, 705–716.Google Scholar
- Collier N, Nobata C, Tsujii J: Extracting the Names of Genes and Gene Products with a Hidden Markov Model. Proceedings of the 18th International Conference on Computational Linguistics (COLING'2000); Saarbrucken 2000, 201–207.View ArticleGoogle Scholar
- Kazama J, Makino T, Ohta Y, Tsujii J: Tuning Support Vector Machines for Biomedical Named Entity Recognition. Proceeding of ACL-2002 Workshop on Natural Language Processing in the Biomedical Domain 2002, 1–8.View ArticleGoogle Scholar
- Lee KJ, Hwang YS, Rim HC: Two-Phase Biomedical NE Recognition based on SVMs. Proceedings of ACL 2003 Workshop on Natural Language Processing in Biomedicine; Sapporo 2003, 33–40.View ArticleGoogle Scholar
- Okanohara D, Miyao Y, Tsuruoka Y, Tsujii J: Improving the Scalability of Semi-Markov Conditional Random Fields for Named Entity Recognition. Proceedings of the Forty fourth Annual Meeting of the Association for Computational Linguistics (ACL-2006); Sydney 2006, 465–472.Google Scholar
- Tanabe L, Wilbur WJ: Tagging Gene and Protein Names in Biomedical Text. Bioinformatics 2002, 18: 1124–1132.View ArticlePubMedGoogle Scholar
- Tsuruoka Y: GENIA Tagger 3.0.[http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/tagger]
- Yamamoto K, Kudo T, Konagaya A, Matsumoto Y: Use of morphological analysis in protein name recognitionstar. Journal of Biomedical Informatics 2004, 471–482.Google Scholar
- Tsai TH, Sung CL, Dai HJ, Hung HC, Sung TY, Hsu WL: NERBio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. BMC Bioinformatics 2006,7(Suppl 5):S11.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhou GD, Su J: Exploring Deep Knowledge Resources in Biomedical Name Recognition. Proceedings of the Joint Workshop on Natural Language Processing of Biomedicine and its Applications (JNLPBA-2004) 2004, 96–99.Google Scholar
- Kim S, Yoon J: Experimental Study on a Two Phase Method for Biomedical Named Entity Recognition. IEICE Transactions on Informaion and Systems 2007,E90-D(7):1103–1120.View ArticleGoogle Scholar
- Finkel J, Dingare S, Nguyen H, Nissim M, Sinclair G, Manning C: Exploiting context for biomedical entity recognition: from syntax to the Web. Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 88–91.View ArticleGoogle Scholar
- Settles B: Biomedical Named Entity Recognition Using Conditional Random Fields and Novel Feature Sets. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 104–1007.View ArticleGoogle Scholar
- Rössler M: Adapting an NER-System for German to the Biomedical Domain. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 92–95.View ArticleGoogle Scholar
- Park K-M, Kim S-H, Lee D-G, Rim H-C: Boosting Lexical Knowledge for Biomedical Named Entity Recognition. Proceeding of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 76–79.View ArticleGoogle Scholar
- Song Y, Kim E, Lee GG, Yi B: POSBIOTM-NER in the shared task of BioNLP/NLPBA. Proceedings of the Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA-2004) 2004, 100–103.View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.