Volume 12 Supplement 2
Word sense disambiguation for event trigger word detection in biomedicine
© Mart`Ltd. 2011
Published: 29 March 2011
This paper describes a method for detecting event trigger words in biomedical text based on a word sense disambiguation (WSD) approach. We first investigate the applicability of existing WSD techniques to trigger word disambiguation in the BioNLP 2009 shared task data, and find that we are able to outperform a traditional CRF-based approach for certain word types. On the basis of this finding, we combine the WSD approach with the CRF, and obtain significant improvements over the standalone CRF, gaining particularly in recall.
In recent years, the biomedical text processing field has created many annotated resources to further the development of automatic text analysis methods through standardised evaluation. Following the example of TREC  and other efforts, the goal is to develop datasets over which different approaches can be tested and compared. There is considerable diversity in the types of resources that have been produced, from TREC-style document relevance scores to the semantic annotation of all terms in a set of documents (for entities and events of interest).
The [TRANS changes] in the mRNA levels of these protooncogenes...
The human platelet-activating factor receptor (PAFR) gene is [TRANS transcribed] by...
We can see that two instances have been tagged with the event TRANS (short for TRANSCRIPTION), which refers to the process of creating an equivalent RNA copy of a sequence of DNA. This event is associated with different surface forms in the text: the noun changes in (1) and to the verb transcribed in (2). These types of wide-ranging lexical variations are common across all event types in the BioNLP task, and make both the annotation and the task challenging, with the best system in the shared task obtaining just above 50% F-score in the basic event structure recognition task (task 1: ).
When studying the BioNLP 2009 dataset, we observed that there is a set of high-frequency word types that tend to occur over many event types. This suggests the possibility of building separate models for each of these word types, in a fashion applied in word sense disambiguation (WSD). This technique has not been tested by previous trigger-word detection systems, where the approach is to build event-centred systems, or rely on hand-made dictionaries. The appeal of WSD is that it has been studied widely, and has been shown to perform well under certain conditions: shared tasks like SemEval  have illustrated that WSD systems can perform above 70% accuracy for fine-grained sense inventories ; and the performance over NLM-WSD, a collection tailored to the biomedical domain, is close to 90% accuracy . Should we be able to replicate these levels of accuracy over the BioNLP 2009 data, it has the potential to boost overall trigger word detection performance.
Another motivation for combining biomedical datasets and WSD is that the WSD community is constantly looking for new semantically-annotated data to test the portability of their systems over. There are only a few examples of domain-specific corpora annotated with sense information, and they are costly to produce . For instance the NLM-WSD collection  was constructed using 11 annotators, and has been used extensively for biomedical WSD experiments. An alternative could be to adapt the semantic annotation of BioNLP and other related biomedical datasets (e.g. BioCreative ) in order to generate more testbeds.
To summarise, our main hypothesis in this paper is that biomedical term-annotation tasks can benefit from WSD methods, which we will empirically test over the BioNLP event extraction dataset, where the correct identification of “trigger words” is a crucial component of the overall problem. We will adapt this dataset into a WSD-style collection, and evaluate the performance over a sample of word types that have particular properties, such as having enough training examples and a class distribution that is not overly skewed. We will analyse the raw performance scores to see if we can achieve similar performance to those achieved over other WSD problems, and we will also study the dataset itself, by measuring the strength of the “one sense per collocation” heuristic in relation to other WSD datasets.
The primary findings of this paper are: (a) WSD can indeed outperform sequential tagging techniques over high-frequency terms with relatively low skew; and (b) the overall performance of sequential tagging methods is boosted when we selectively include predictions from our WSD model.
This article is organised as follows. We describe the background of our research in Section . We then introduce our experimental setting in Section , and perform an analysis of feature types in Section . After this study, our main experimental results are presented in Section . We discuss further our experiments and analyse the errors in Section , and finally present our conclusions and future directions in Section .
List of target events for BioNLP 2009
In the original BioNLP 2009 shared task, most systems relied on at least two separate modules: trigger detection and event construction. Trigger detection involves the identification of event triggers and their type, while event construction associates event triggers with their arguments. For our experiments, we will focus on the trigger detection task, in order to simplify the analysis and comparison of different methods. The systems present in the BioNLP shared task addressed the trigger detection subtask by relying on hand-made dictionaries, sequential classifiers, or class-specific models; not word type models as we do in this paper. The top-ranked system in the 2009 BioNLP shared task was developed by a team from the University of Turku . Their pipeline consists of three main steps: trigger detection, argument assignment, and semantic post-processing. For trigger detection they treat each token as a separate classification problem, and train SVMs for each event type. They rely on a rich set of features, including a dictionary built from the training data, and syntactic dependencies. Their overall task-1 system achieved an F-score of 52% (with 47% recall and 58% precision) by relying on separate SVM classifiers for trigger detection and argument assignment. However, the performance over the trigger detection step in isolation was not reported.
The second-ranked system for the task was built by a team from the University of Jena . Their main architecture also had an independent module for trigger detection, which relied heavily on hand-curated dictionaries, built from the GENIA event corpus  and other sources. Their work required manual effort to pre-identify the predictive power of candidate trigger words for each event type, and they relied on this information to build dependency graphs that were refined in subsequent steps. The recall of their system was similar to the top ranked system, but the precision was much lower. Again, their performance for trigger word detection is not known.
We turn to the WSD literature to see if these techniques can contribute to the trigger detection task. There is a large literature on WSD; see  and  for recent overviews. The most successful approaches are supervised systems that build a separate model for each word type and POS, learning only from contexts that include it. The motivation behind this approach is the “one sense per collocation” heuristic , which observes that the meaning of a given word in a particular collocation tends to be invariant across all token occurrences.
One method that has been shown to perform consistently well over open-domain WSD and biomedical WSD is the Vector Space Model (VSM). It achieved high performance over the biomedical NLM-WSD collection (close to 90% accuracy) , and it has also been applied to the Senseval-3 English Lexical Sample dataset , where it ranked among the top systems with an accuracy of 72% . This classifier can accommodate a wide range of features, from local dependencies to MeSH terms (cf. Section ).
We designed an experiment to integrate a WSD module into an event-extraction system based on the BioNLP 2009 shared task data. We first describe the dataset used in this experiment, then the different systems tested.
In order to build a WSD collection, we first identified the candidate target words in the BioNLP data that are most likely to benefit from WSD, in terms of both WSD having a good chance of performing well over them, and there being enough token instances that, when fed back into a larger system, the predictions can potentially have a significant impact. Each word occurrence in the data will have one of the 9 trigger-word event classes or the non-event class (a total of 10 classes). Candidate word types for WSD are those which have high frequency and occur with different event classes. We rely on the GENIA tagger  for tokenisation and POS-tagging, and we group word occurrences by lemma and POS.
List of target words for our WSD experiment (N = noun, V = verb, J = adjective). We present the number of train and test instances, the number of classes, and the bias of the majority class.
# Top class %
Classifiers and features
Our main WSD classifier is based on the Vector Space Model (VSM), in the form of a nearest-prototype classifier. Each occurrence of an ambiguous word is represented as a binary vector in which each position indicates the occurrence/absence of a feature, and a single centroid vector is generated for each sense of each word type during training. These centroids are compared with the vectors that represent new examples using the cosine similarity metric. The sense assigned to a given test instance is that of the closest centroid.
As a secondary WSD classifier we use a Support Vector Machine (SVM-Weka), as implemented in the Weka toolkit . SVMs map feature vectors into a high-dimensional space and construct a classifier by searching for the hyperplane in that space that gives the greatest separation between the classes. For our experiments we rely on a polynomial kernel, with the C parameter set to 1 (the default value in Weka). In both cases, we build a separate classifier for each word type.
Both WSD classifiers rely on an extensive set of features:
Local collocations (Local): A set of features which describe the context of the ambiguous word token, in the form of: (1) bigrams and trigrams containing the ambiguous word constructed from lemmas, word forms, POS tags and PROTEIN tags provided in the BioNLP 2009 dataset; and (2) the lemma/word-form of preceding/following content words (adjectives, adverbs, nouns and verbs) occurring in the same sentence as the target word.
Syntactic dependencies: We identify the syntactic dependencies between the target word and other words in the sentence. We define two types of features: (1) lexicalised dependencies (the dependency relation type + the related lexical item), and (2) unlexicalised dependencies (the dependency relation type only). We extract the dependencies from the four parsers provided by the BioNLP 2009 shared task organisers .
Bag-of-words (BOW): Lemmas of all content words (nouns, verbs, adjectives, adverbs) in the same sentence as the target word, and, as a separate feature, lemmas of all content words within a ±4-word window around the target word.
Medical Subject Headings (MeSH) : the manually-assigned MeSH terms associated with the document that the sentence was taken from. MeSH is a controlled vocabulary for indexing biomedical and health-related information and documents, and all biomedical papers in MEDLINE are indexed with MeSH data.
We also implement a sequential tagger that does not follow the WSD approach of separate models for each word type. Specifically, we use the CRF++ toolkit , which has been shown to be highly successful over various chunking tasks. CRFs provide a discriminative framework for building structured models to segment and label sequence data . CRFs have the well-known advantage that they both model sequential effects and support the use of large numbers of features.
For the CRF classifier we used a similar set of feature types to the WSD classifiers: word-forms, lemmas, POS, chunk tags, protein annotations, and grammatical dependencies. For dependency annotation, we used the Bikel parser and GDep as provided by the shared task organisers. This information was provided as a feature that expresses the grammatical function of the token. We applied a window size of 4 words in either direction from the target word.
Note that the CRF classifier is able to classify occurrences other than those for the target word types, but we evaluate only over the 63 target words for a fair comparison.
Finally, we also built an extended version of CRF (CRF-VSM) which uses the WSD predictions as an extra feature. In cases where the current token is none of the target words, the feature has the value NULL. For the training data, we obtain the predictions by using 3-fold cross-validation, and for the test data we rely on the full training set. This system allows us to combine the NER and WSD approaches to the problem. For evaluation we provide the precision, recall, and F-score for each class, in addition to reporting the micro-averaged results over the 9 trigger-word classes. We also show the average accuracy across all instances; this score is less relevant to the final goal of correctly identifying trigger words, because it is affected by the predominance of the NON-EVENT instances and ignores recall.
We use randomised estimation to calculate whether any performance differences between methods are statistically significant . As a baseline we present the Majority Class (MC) classifier, which assigns the most frequent class seen in the training data to all the test instances.
Supervised WSD systems build upon the “one sense per collocation” heuristic, which shows that in fixed collocations, a given word will tend to occur with the same meaning. Yarowsky  defined “collocation” as the co-occurrence of words in a given relationship, for instance the words no relevant occurring immediately before the word changes. The features that we used satisfy this definition of “collocation”. We analyse the class entropy of the features in our collection and compare it with the WSD corpora introduced in Section . We rely on all features that occur at least twice, and we average the entropy values of all the features for each of the four basic feature types. We did not have access to all feature types for every corpus (e.g. MeSH is not found in Senseval), but the available ones can give us some insight into the differences across the corpora.
Entropy for each feature type across the three WSD corpora
With respect to the different feature types, there are not big differences in our collection, and as expected, local features and syntactic dependencies do best. More surprising are the results across Senseval data, with the high entropy of BOW and the low score of syntactic dependencies. NLM features have low entropy for all the different types.
Number of word types and their average training frequency for different average entropy ranges. An example for each group is provided, together with its most frequent class.
H < .3
.3 ≤ H < .4
H ≥ .4
WSD performance of the different classifiers (the best results per column are given in bold)
WSD result for different entropy ranges (the best score for each evaluation metric and entropy range is shown in bold)
H < .3
.3 ≤ H < .4
H ≥ .4
Performance of CRF-VSM by event type, sorted by F-score (Freq. = Frequency of the event in test data).
WSD experiment performance of VSM by POS. The best result per column is given in bold.
The difficulty of disambiguating verb senses has been observed variously in the WSD literature, and these results seem to confirm that tendency. Regarding adjectives, the performance is actually below the majority class baseline, but there are only three word types and relatively few token instances for each, so the overall impact on results is negligible. The difficulty here appears to relate to the choice of which word to annotate as the trigger word, and a richer model of the interaction of the words related to the event would be required to improve the performance.
Our results over the BioNLP 2009 dataset show that the transformation of existing biomedical annotation into a WSD dataset can be challenging. One of the main differences over standard WSD is that the word-models tend to be biased towards the non-event class, and this can be problematic for the classifiers. For the noun transcription, e.g., 83% of the training instances are of type NON-EVENT. A possible solution could be to apply re-sampling techniques to build more balanced models .
Another issue is the difficulty of event annotation for humans, even when strategies to ensure quality are put in place. After a long process, the BioNLP event annotation enforced the following guidelines : text-bound annotation (grounding all annotations to strings in text), single-facet annotation (keeping the viewpoint of annotation simple and very focused), and semantic typing (looking at types of entities for each event, and types of event for each entity to detect anomalies). The developers of the corpus explain that this process improved the inter-annotator agreement significantly, although they did not provide the numbers. They also describe how some high-frequency words can represent a wide variety of biomedical events depending on their related words in the context.
The event is described in the text as a process, and the annotators mark the word that culminates it, not the initiator. In this example, lack is annotated instead of transcription:
The event is underspecified in the sentence, referring to an unspecified gene, and it is not considered relevant. In this example, transcription is not annotated:
Multiple events occur simultaneously, and the annotation has to accommodate their relationship. In the following example the main event is a negative regulation event (NEG-REG) that affects a transcription event. The annotators seem to focus on the surface form of the NEG-REG event first (marking destabilization), and then annotate as TRANSCRIPTION the noun phrase that is directly related to it, choosing preformed, and ignoring the first mention of transcription:
The noun transcription occurs multiple times referring to the same event, but only the first occurrence is tagged:
A 2-4-fold increase in IFN-beta promoter [TRANStranscription] was observed in Sendai virus induced extracts, and deletion of PRDI and PRDII elements decreased this induced level of transcription.
These examples illustrate that for the BioNLP 2009 dataset, there are meta-linguistic aspects of the annotation that have to be taken into account, and more consistency is required to close the gap between textual representations and the ultimate goal of biomedical pathways. Significant effort has been done in the annotation of the BioNLP dataset, but we believe that word-by-word analysis can provide better means to improve and extend this kind of tagging. It has been shown in the WSD evaluation tracks that the annotation of lexical-sample datasets is easier and produces better quality data than all-words datasets, and this could be translated to the annotation of biomedical events. We have seen in this corpus that 63 ambiguous word types cover 62% of the event annotations, and focusing on the instances of these words separately could be a better way to produce consistent annotation.
We described a WSD-based method for detecting event trigger words in the BioNLP 2009 shared task data, and demonstrated that it attains superior performance than a traditional sequential tagging approach. The highest score is achieved when using the WSD predictions as features for a sequential tagger, which significantly improves the recall and F-score of the latter. We also observed that measuring the training class-entropy of features seems to be a good indicator of the kind of target word types that can improve over a sequential tagger.
Another result of this work is the identification of consistency issues in biomedical annotation, even when clear guidelines are provided. We found that a word-centered approach may help to find inconsistencies, specially given that a few target words seem to have high coverage of the trigger annotations. For future work we are planning to explore other challenges, such as BioCreative, and also to deploy full systems for the BioNLP shared task challenge, in order to directly compare against other systems.
NICTA is funded by the Australian Government as represented by the Department of Broadband, Communications and the Digital Economy and the Australian Research Council through the ICT Centre of Excellence program.
This article has been published as part of BMC Bioinformatics Volume 12 Supplement 2, 2011: Fourth International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2010. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/12?issue=S2.
- Text REtrieval Conference (TREC)[http://trec.nist.gov]
- John Wilbur, Lawrence Smith, Lorraine Tanabe: Biocreative 2. gene mention task. In Second BioCreative Evaluation Workshop. Madrid, Spain; 2007:7–16.Google Scholar
- Jin-Dong Kim, Tomoko Ohta, Jun’Ichi Tsujii: Corpus annotation for mining biomedical events from literature. BMC Bioinformatics 2008, 9: 10. 10.1186/1471-2105-9-10View ArticleGoogle Scholar
- Jin-Dong Kim, Tomoko Ohta, Sampo Pyysalo, Yoshinobu Kano, Jun’ichi Tsujii: Overview of BioNLP’09 shared task on event extraction. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. Boulder, USA; 2009:1–9.Google Scholar
- Eneko Agirre, Lluis Màrquez, Richard Wicentowski (Eds): Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007) In Association for Computational Linguistics. Prague, Czech Republic; 2007.Google Scholar
- Lluis Màrquez, Gerard Escudero, David Martinez, German Rigau: Supervised corpus-based methods for word sense disambiguation. In Word Sense Disambiguation. Edited by: Eneko Agirre and Phil Edmonds. Springer, Dordrecht, Netherlands; 2006.Google Scholar
- Mark Stevenson, Yikun Guo, Robert Gaizauskas, David Martinez: Disambiguation of biomedical text using diverse sources of information. BMC Bioinformatics 2008, 9(Suppl 11):S7. 10.1186/1471-2105-9-S11-S7View ArticleGoogle Scholar
- Eneko Agirre, Oier Lopez de Lacalle, Christiane Fellbaum, Andrea Marchetti, Antonio Toral, Piek Vossen: Semeval-2010 task 17: all-words word sense disambiguation on a specific domain. DEW ’09: Proceedings of the Workshop on Semantic Evaluations: Recent Achievements and Future Directions 2009, 123–128.Google Scholar
- Marc Weeber, James G Mork, Alan R Aronson: Developing a test collection for biomedical word sense disambiguation. In Proceedings of the 2001 AMIA Symposium. Washington DC, USA; 2001:746–750.Google Scholar
- BioCreAtIvE challenge[http://biocreative.sourceforge.net/index.html]
- Jari Björne, Juho Heimonen, Filip Ginter, Antti Airola, Tapio Pahikkala, Tapio Salakoski: Extracting complex biological events with rich graph-based feature sets. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. Boulder, USA; 2009:10–18.Google Scholar
- Ekaterina Buyko, Erik Faessler, Joachim Wermter, Udo Hahn: Event extraction from trimmed dependency graphs. In Proceedings of the BioNLP 2009 Workshop Companion Volume for Shared Task. Boulder, USA; 2009:19–27.Google Scholar
- Diana McCarthy: Word sense disambiguation: An overview. Language and Linguistics Compass 2009, 3(2):537–558. 10.1111/j.1749-818X.2009.00131.xView ArticleGoogle Scholar
- Roberto Navigli: Word sense disambiguation: a survey. ACM Computing Surveys 2009, 41(2):1–69.Google Scholar
- David Yarowsky: One sense per collocation. In HLT ’93: Proceedings of the workshop on Human Language Technology. Princeton, USA; 1993:266–271.Google Scholar
- Rada Mihalcea, Timothy Chklovski, Adam Kilgarriff: The Senseval-3 English lexical sample task. In Senseval-3: Third International Workshop on the Evaluation of Systems for the Semantic Analysis of Text. Barcelona, Spain; 2004:25–28.Google Scholar
- Eneko Agirre, David Martinez: The Basque Country University system: English and Basque tasks. In Proceedings of the 3rd ACL workshop on the Evaluation of Systems for the Semantic Analysis of Text (SENSEVAL). Barcelona, Spain; 2004:44–48.Google Scholar
- Yoshimasa Tsuruoka, Yuka Tateishi, Jin-Dong Kim, Tomoko Ohta, John McNaught, Sophia Ananiadou, Jun’ichi Tsujii: Developing a robust part-of-speech tagger for biomedical text. In Advances in Informatics - 10th Panhellenic Conference on Informatics. Volas, Greece; 2005:382–392.Google Scholar
- Ian H Witten, Eibe Frank: Data Mining: Practical machine learning tools and techniques. Morgan Kaufmann, San Francisco, USA; 2005.Google Scholar
- BioNLP 2009 shared task[http://www-tsujii.is.s.u-tokyo.ac.jp/GENIA/SharedTask/tools.shtml]
- Stuart J Nelson, Tammy Powell, Betsy L Humphreys: The Unified Medical Language System (UMLS) project. In Encyclopedia of Library and Information Science. third edition. Edited by: Marcia J. Bates and Mary Niles Maack. CRC Press, Boca Raton, USA; 2002.Google Scholar
- CRF++: Yet Another CRF toolkit[http://crfpp.sourceforge.net/]
- Charles Sutton, Andrew Mccallum: Introduction to conditional random fields for relational learning. In Introduction to Statistical Relational Learning. Edited by: Lise Getoor and Ben Taskar. MIT Press, Cambridge, USA; 2006.Google Scholar
- Alexander Yeh: More accurate tests for the statistical significance of result differences. In Proceedings of the 18th Conference on Computational Linguistics (COLING). Saarbrücken, Germany; 2000:947–953.Google Scholar
- Sotiris Kotsiantis, Dimitris Kanellopoulos, Panayiotis Pintelas: Handling imbalanced datasets: A review. GESTS International Transactions on Computer Science and Engineering 2006, 30: 25–36.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.