The number of documents discussing biomedical science is growing at an ever increasing rate, making it difficult to keep track of recent developments. Automated methods for cataloging, searching and navigating these documents would be of great benefit to researchers working in this area, as well as having potential benefits to medicine and other branches of science. Lexical ambiguity, the linguistic phenomenon where a term (word or phrase) has more than one potential meaning, makes the automatic processing of text difficult. For example, "cold" has several possible meanings in the Unified Medical Language System (UMLS) Metathesaurus [1] including "common cold", "cold sensation" and "Chronic Obstructive Airway Disease (COLD)". Weeber et al. [2] analysed MEDLINE abstracts and found that 11.7% of phrases were ambiguous relative to the UMLS Metathesaurus.
The ability to accurately identify the meanings of terms is an important step in automatic text processing. It is necessary for applications such as information extraction and text mining which are important in the biomedical domain for tasks such as automated knowledge discovery. The NLM Indexing Initiative [3] attempted to automatically index biomedical journals with concepts from the UMLS Metathesaurus and concluded that lexical ambiguity was the biggest challenge in the automation of the indexing process. Friedman [4] reported that an information extraction system originally designed to process radiology reports had problems with ambiguity when it was applied to more general biomedical texts. During the development of an automated knowledge discovery system Weeber et al. [5] found that is was necessary to resolve the ambiguity in the abbreviation MG (which can mean 'magnesium' or 'milligram') in order to replicate a well-known literature-based discovery concerning the role of magnesium deficiency in migraine headaches [6].
Word Sense Disambiguation (WSD) is the process of resolving lexical ambiguities. WSD has been actively researched since the 1950s and is regarded as an important part of the process of understanding natural language texts. A comprehensive description of current work in WSD is beyond the scope of this paper although overviews may be found in [7, 8]. Schuemie et al. [9] provide an overview of WSD in the biomedical domain. Previous researchers have used a variety of approaches for WSD of biomedical text. Some of them have taken techniques proven to be effective for WSD of general text and applied them to ambiguities in the biomedical domain, while others have created systems using domain-specific biomedical resources. However, there has been no direct comparison of which information sources are the most useful or whether combining a variety sources, a strategy which has been shown to be successful for WSD in the general domain [10, 11], also improves results in the biomedical domain.
This paper compares the effectiveness of a variety of information sources for WSD in the biomedical domain. These include features which have been commonly used for WSD of general text as well as information derived from domain-specific resources, including MeSH terms.
The remainder of this section provides an overview of various approaches to WSD in the biomedical domain. The Methods section outlines our approach, paying particular attention to the various types of information used by our system. An evaluation of this system is presented in the Results section, the implications of which can be found in the Discussion section.
The NLM-WSD data set
Research on WSD for general text in the last decade has been driven by the SemEval frameworks http://www.senseval.org which provide a set of standard materials for a variety of semantic evaluation tasks [12]. At this point there is no specific collection for the biomedical domain in SemEval, but a test collection for WSD in biomedicine, the NLM-WSD data set [2], is used as a benchmark by many independent groups. (An alternative collection is described by Widdows et al. [13], although the authors acknowledge that the low levels of inter-annotator agreement for the sense tags make the use of this data problematic.) The Unified Medical Language System (UMLS) Metathesaurus was used to define the set of possible meanings in the NLM-WSD data set. In UMLS strings are mapped onto concepts, indicating their meaning. Strings which map onto more than one concept are ambiguous. For example, the string "culture" maps onto the concepts 'Anthropological Culture' (e.g. "The aim of this paper is to describe the origins, initial steps and strategy, current progress and main accomplishments of introducing a quality management culture within the healthcare system in Poland.") and 'Laboratory Culture' (e.g. "In peripheral blood mononuclear cell culture streptococcal erythrogenic toxins are able to stimulate tryptophan degradation in humans"). 50 terms which are ambiguous in UMLS and occur frequently in MEDLINE were chosen for the NLM-WSD data set. 100 instances of each term were selected from citations added to the MEDLINE database in 1998 and manually disambiguated by 11 annotators. Twelve terms were flagged as "problematic" due to substantial disagreement between the annotators. There are an average of 2.64 possible meanings per ambiguous term and the most ambiguous term, "cold", has five possible meanings. Concepts which were judged to be very similar in meaning were merged. For example, two concepts for "depression": 'Depressive episode, unspecified' and 'Mental Depression'. In addition to the meanings defined in UMLS, annotators had the option of assigning a special tag ("none") when none of the meanings in UMLS were judged to be appropriate.
Various researchers have chosen to evaluate their systems against subsets of this data set. Liu et al. [14] used a set of 22 terms, saying "We excluded 12 [terms] that Weeber et al. considered problematic, as well as 16 terms in which the majority sense occurred with over 90% of instances." However, the 22 terms used to evaluate their system include "mosaic" and "nutrition" which Weeber et al. [2] flagged as problematic. Leroy and Rindflesch [15] used a set of 15 terms for which the majority sense accounted for less than 65% of the instances. Joshi et al. [16] evaluated against the set union of those two sets, providing 28 ambiguous terms. McInnes et al. [17] used the set intersection of the two sets (dubbed the "common subset") which contained 9 terms. The terms that form these various subsets are shown in Figure 1.
The 50 terms which form the NLM-WSD data set represent a range of challenges for WSD systems. The Most Frequent Sense (MFS) heuristic has become a standard baseline in WSD [18] and is simply the accuracy which would be obtained by assigning the most common meaning of a term to all of its instances in a corpus. Despite its simplicity, the MFS heuristic is a hard baseline to beat, particularly for unsupervised systems, because it uses hand-tagged data to determine which sense is the most frequent. Analysis of the NLM-WSD data set showed that the MFS over all 50 ambiguous terms is 78%. The different subsets have lower MFS, indicating that the terms they contain are more difficult to disambiguate. The 22 terms used by Liu et al. [14] have an MFS of 69.9% while the set used by Leroy and Rindflesch [15] has an MFS of 55.3%. The union and intersection of these sets have MFS of 66.9% and 54.9% respectively.
WSD of biomedical text
A standard approach to WSD is to make use of supervised machine learning systems which are trained on examples of ambiguous words in context along with the correct sense for that usage. The models created are then applied to new examples of that word to determine the sense being used.
Approaches which are adapted from WSD of general text include [14]. Their technique uses a supervised learning algorithm with a variety of features consisting of a range of collocations of the ambiguous word and all words in the abstract. They compared different supervised machine learning algorithms and found that a decision list worked best. Their best system correctly disambiguated 78% of the occurrences of 22 ambiguous terms in the NLM-WSD data set (see Figure 1).
Joshi et al. [16] also use collocations as features and experimented with five supervised learning algorithms: Support Vector Machines, Naive Bayes, decision trees, decision lists and boosting. The Support Vector Machine performed best scoring 82.5% on a set of 28 words (see Figure 1) and 84.9% on the 22 terms used by Liu et al. [14]. Performance of the Naive Bayes classifier was comparable to the Support Vector Machine, while the other algorithms were hampered by the large number of features.
Examples of approaches which have made use of knowledge sources specific to the biomedical domain include Leroy and Rindflesch [15] who used information from the UMLS Metathesaurus. They used the MetaMap tool [19] which identifies the relevant UMLS concepts for a piece of text. Leroy and Rindflesch used knowledge about whether the ambiguous word is the head word of a phrase identified by MetaMap, the ambiguous word's part of speech, semantic relations between the ambiguous words and surrounding words from UMLS as well as semantic types of the ambiguous word and surrounding words. Naive Bayes was used as a learning algorithm. This approach correctly disambiguated 65.5% of word instances from a set of 15 terms (see Figure 1). Humphrey et al. [20] presented an unsupervised system that also used semantic types from UMLS. They constructed semantic type vectors for each word from a large collection of MEDLINE abstracts. This allowed their method to perform disambiguation at a coarser level, without the need for labeled training examples. In most cases the semantic types can be mapped to the UMLS concepts used to annotate instances in the NLM-WSD corpus but not for all terms. In addition, this approach could not disambiguate instances which had been annotated with the "none" tag which indicated that none of the meanings in UMLS were judged to be appropriate. Five terms were excluded from their evaluation, four ("cold", "man", "sex" and "weight") because the semantic types could not be mapped onto UMLS concepts and the other ("association") because all instances of that term were assigned the "none" tag. In addition, only 67% of the instances for the remaining 45 terms were used for evaluation and, since instances with the "none" tag were also excluded, their system was only evaluated against an average of 54% of the instances of these terms. An accuracy of 78.6% was reported across these instances. McInnes et al. [17] also made use of information provided by MetaMap. In UMLS each concept has a Concept Unique Identifier (CUI) and these are also assigned by MetaMap. The information contained in CUIs is more specific than in the semantic types applied by Leroy and Rindflesch [15] and Humphrey et al. [20]. For example, two of the CUIs for the term "cold" in UMLS, "C0205939: Common Cold" and "C0024117: Chronic Obstructive Airway Disease", share the same semantic type: "Disease or Syndrome". McInnes et al. [17] were interested in exploring whether the more specific information contained in CUIs was more effective than UMLS semantic types. Their best result was reported for a system which represented each sense by all CUIs which occurred at least twice in the abstract surrounding the ambiguous word. They used a Naive Bayes classifier as the learning algorithm and reported an accuracy of 74.5% on the set of ambiguous terms tested by Leroy and Rindflesch [15] and 80.0% on the set used by Joshi et al. [16]. They concluded that CUIs are more useful for WSD than UMLS semantic types but that they are not as robust as features which are known to work in general English, such as unigrams and bigrams. Unfortunately, direct comparison of the various WSD systems which have been evaluated on the NLM-WSD data set is not straightforward. Firstly, as we have described, systems have been tested against a variety of ambiguous terms. A more subtle problem arises in the way in which researchers have chosen to present their results. With the exception of unsupervised systems [15, 20], which do not require training data, all approaches involve training a classifier using some portion of the available data and then testing against the remaining unseen portion. These supervised approaches normally involve choices over how to set the parameters which define the group of features used. For example, Liu et al. [14] compared a total of 22 different feature sets by varying the size of the context window around the ambiguous word and the terms which are extracted. One approach [14, 16] is to experiment with a variety of parameters and choose the best one for each ambiguous term. For example, the 78% accuracy figure quoted by Liu et al. [14] is obtained by choosing the result from the best classifier for each of the 22 terms used in their evaluation. We refer to this as per-term parameter setting. An alternative methodology involves applying the same parameters to all terms. For example, the results reported by McInnes et al. [17] are obtained by using the same parameters for all terms rather than selecting the best result for each. We call this global parameter setting.
It would be preferable to automate the process of parameter setting as far as possible however this would be difficult for per-term parameter setting, particularly for a data set such as NLM-WSD where there are only 100 instances for each ambiguous term and many senses with occur only a few times. The alternative approach, global parameter setting, is less affected by this problem and has the advantage that the settings are more likely to be suitable for terms other than the ones which are contained in the test collection. The global parameter setting methodology is used in the experiments described later in this paper.