Normalizing biomedical terms by minimizing ambiguity and variability

Background One of the difficulties in mapping biomedical named entities, e.g. genes, proteins, chemicals and diseases, to their concept identifiers stems from the potential variability of the terms. Soft string matching is a possible solution to the problem, but its inherent heavy computational cost discourages its use when the dictionaries are large or when real time processing is required. A less computationally demanding approach is to normalize the terms by using heuristic rules, which enables us to look up a dictionary in a constant time regardless of its size. The development of good heuristic rules, however, requires extensive knowledge of the terminology in question and thus is the bottleneck of the normalization approach. Results We present a novel framework for discovering a list of normalization rules from a dictionary in a fully automated manner. The rules are discovered in such a way that they minimize the ambiguity and variability of the terms in the dictionary. We evaluated our algorithm using two large dictionaries: a human gene/protein name dictionary built from BioThesaurus and a disease name dictionary built from UMLS. Conclusions The experimental results showed that automatically discovered rules can perform comparably to carefully crafted heuristic rules in term mapping tasks, and the computational overhead of rule application is small enough that a very fast implementation is possible. This work will help improve the performance of term-concept mapping tasks in biomedical information extraction especially when good normalization heuristics for the target terminology are not fully known.


Background
Named entities such as names of genes, proteins, chemicals, tissues, and diseases play a central role in information extraction from biomedical documents [1][2][3]. To fully utilize the information they convey in the document, we generally need to perform two steps. In the first step, which is commonly called named entity recognition, we identify the regions of text that are likely to be named entities and classify them into predefined categories. Substantial research efforts have been devoted to the improvement of named entity recognizers, and today we can identify biomedical named entities in the literature with reasonable (although still not entirely satisfactory) accuracy by using rule-based or machine learning-based techniques [4][5][6][7][8].
In the second step, we map the recognized entities with the corresponding concepts in the dictionary (or ontology). This step is crucial for making the extracted information exchangeable at the concept level [9]. This mapping task has proven to be non trivial especially in the biomedical domain [10][11][12][13]. One of the main problems is that biomedical terms have many potential variants, and it is not possible for a dictionary to cover all possible variants in advance.
One possible approach to tackle this problem is to use soft string matching techniques. Soft matching enables us to compute the degree of similarity between strings, and thus we can associate a term with its concept even when the dictionary fails to contain the exact spelling of the term. In fact, soft matching methods have been shown to be useful in several gene/protein name mapping tasks [11,13,14]. Soft string matching, however, is not without drawbacks: the method requires a considerable computational cost when looking up the dictionary [15]. This problem is particularly serious when we use large dictionaries such as those for gene/protein names and disease names, which can contain more than hundreds of thousands terms. Although there are techniques to speed up the computation for simple similarity measures like uniform-cost edit distance [16], it is hard to apply those techniques to the sophisticated similarity measures needed in real mapping tasks. To make matters worse, the size of the literature that we need to analyze for biomedical information extraction could be huge-MEDLINE abstracts contain more than 70 million sentences, let alone full papers.
Another approach to alleviate the problem of term variation is to normalize the terms by using heuristic rules [17][18][19]. For example, converting capital letters to lower case has been shown to be an effective normalization rule for gene/protein names [17]. The distinct advantage of the normalization approach over the soft matching approach is the speed of looking up the dictionary. Once the terms are normalized, we can use a hashing technique to lookup a dictionary with a constant computational cost regardless of its size, while the cost for soft matching increases linearly with the size of the dictionary.
What is most important in the normalization approach is how we normalize the terms. Bad heuristic rules often lose important information in the terms. For example, deleting all digits from a term is probably a bad rule for gene/protein names, because the rule makes it impossible to distinguish 'ACE1' and 'ACE2' although it enables us to match 'ACE' with 'ACE1.' Although using good heuristic rules is certainly important, their development is not straightforward. It requires good intuition and extensive knowledge of the terminology in question; the developer has to know the types of variation and potential side effects of normalization. Consequently, it remains to be seen what normalization rules would work well for various classes of named entities in the biomedical domain.
In this paper, we present a novel approach for the automatic discovery of term normalization rules, which requires no expert knowledge of the terminology. To achieve this goal, we leverage the important insight provided in previous studies [17,20] in which contrast and variability in gene names were analyzed to test the effectiveness of several normalization heuristics. Their work suggests that one could distinguish good normalization rules from bad ones by analyzing the effect of normalization on the relationships between terms and their concept IDs in the dictionary. We take their work one step further and present a framework for discovering a list of "good" normalization rules from a dictionary in a fully automated manner.

Ambiguity and variability
In this section, we describe two notions that are needed to quantify the utility of a normalization rule. We call them ambiguity and variability.
First, let us define a dictionary simply as a list of terms {t 1 , …, t N } where each term is associated with a concept ID c j ∈ {c 1 , …, c M }. In the biomedical domain, concept IDs typically correspond to the unique identifiers for conceptual entities such as genes, chemicals, and diseases defined in biomedical databases (e.g. UniProt, InChI, and OMIM). Table 1 shows an imaginary dictionary consisting of only three concept IDs. Here, we define two values for the dictionary: the ambiguity value and the variability value. The ambiguity value quantifies how ambiguous, on average, the terms in the dictionary are. More specifically, we define the ambiguity value as follows: where N is the number of terms in the dictionary, and C(t i ) is the number of the concept IDs that include a term whose spelling is identical to t i .
The variability value, in contrast, quantifies how variable the terms are. This is calculated as: Where M is the number of concept IDs in the dictionary, and T(c j ) is the number of unique terms that the concept c j includes.
For the dictionary shown in Table 1, we can calculate these values as follows: These values can be seen as the indicators of the complexity of terminology. Ideally, we do not want the terms to be ambiguous or variable, because both lead to impaired performance in mapping tasks. We thus favour smaller values for these factors. Now let us see how a normalization rule can change the situation. Suppose that we have the normalization rule that removes hyphens. By applying it to the terms in the dictionary, 'IL-2' becomes 'IL2', and 'IL-3' becomes 'IL3'. Then we obtain new values for ambiguity and variability: We have succeeded in reducing the variability value without increasing the ambiguity. This indicates that this normalization rule is a good one.
For the same dictionary shown in Table 1, we could think of a different normalization rule that replaces all digits with the special symbol '#'. If we apply this rule to the dictionary, 'IL2' and 'IL3' become 'IL #', 'IL-2' and 'IL-3' become 'IL-#', and 'ZFP580' and 'ZFP581' become 'ZFP###'. We then obtain: Although we have a decreased value for variability, the ambiguity value has increased. This indicates that this normalization rule may not be a good one.
The examples above demonstrate that we could distinguish good normalization rules from bad ones by observing the change of the ambiguity/variability values defined in the dictionary. In general, a normalization rule reduces the variability value at the sacrifice of the increase in the ambiguity value. Therefore, what we want is a rule that can maximize the reduction of the variability value and keep the increase of the ambiguity value minimal.
We now need to integrate the two values in order to quantify the overall "goodness" of a normalization rule. We define a new value, which we call complexity, as follows: where α is the constant that determines the trade-off between ambiguity and variability.
Now the problem has become very simple; if a normalization rule can reduce the complexity value for the dictionary, then the rule is a good rule, otherwise it is a bad rule.  Table 1: Ambiguity and variability in a dictionary. This is an imaginary dictionary consisting of three concept IDs. All terms belonging to the same concept ID are assumed to be synonymous (conveying the same meaning).

Concept ID Term
Zinc finger protein

Generating rule candidates
The next problem is how we automatically generate normalization rules. Ideally, we want to allow normalization rules to be of any type, such as regular expressions and context-free grammars. However, we found it difficult to incorporate such complex rules in a fully automatic manner because it entails a huge search space for rule hypotheses.
In this work, we focus only on character-level replacement rules. By focusing on this type, we can easily generate rule candidates from the terms in the dictionary. For example, the first and the second terms in the dictionary given in Table 1 constitute the following pair.

IL-2
From this pair, we can easily see that we will be able to match the two terms if we remove the hyphens (i.e. replace the hyphens with the null character), which in turn will reduce the variability value of the dictionary. In other words, we can automatically generate the rule that removes hyphens, by observing this term pair.
More formally, we can represent any pair of terms X and Y as follows: where L is the left common substring shared by X and Y, R is the right common substring, and X C and Y C are the substrings at the center that are not shared by the two strings. From this representation, we create the rule that replaces Y C with X C , which will transform Y into X.
For the above example pair 'IL2' and 'IL-2', L is 'IL', R is '2', Y C is '-', and X C is the null character. If we take the first term 'IL2' and the third term 'Interleukin' from the dictionary in Table 1, L is 'I', R is the null character, Y C is 'nterleukin', and X C is 'L2'.

Discovering rules
In the previous sections, we have defined a measure to quantify the utility of a normalization rule and presented a method to generate a rule candidate from any given term pair. Now we describe the whole process of rule discovery. The process is as follows: 1. Generate rule candidates from all possible pairs of synonymous terms in the dictionary (i.e. terms sharing the same concept ID).
2. Select a rule that can reduce the complexity value defined by Equation 3.
3. Apply the rule to all the terms in the dictionary.
4. Go back to 1-repeat until the predefined number of iterations is reached.
Notice that the process is iterative-we apply the discovered rule immediately to the dictionary and then use the updated dictionary for the next iteration. This is because the rules discovered are to be used in sequential manner; the end product of our rule discovery system is a list of normalization rules, and we shall use them exactly in the same order specified in the list. Thus the terms in the dictionary have to be sequentially updated in the rule discovery process to make sure that they go through the same rule applications.
In step 2, we need to select a good rule from the rule candidates generated in step 1. The obvious strategy would be to select the rule that maximizes the reduction of the complexity value of the dictionary. However, we found this strategy impractical when the dictionary is large, because it requires us to try applying every rule candidate to the dictionary to see its utility. In this work, we use a less computationally intensive strategy. First, we sort the rule candidates in descending order of frequency of occurrence. We then pick up the first rule that can decrease the complexity value. This strategy worked reasonably well, since the rule candidates that are generated many times decrease the variability value to a greater degree than infrequent ones do.
To further improve the efficiency of the entire process, we do not consider any rule candidates that have failed once to reduce the complexity value. This pruning method is not completely safe, because the terms in the dictionary change as the process proceeds and thus a candidate that has been rejected once could become acceptable at some point. However, we found that the speed-up gain outweighs the demerit when the dictionary is large.

Dictionaries
We used two large-scale dictionaries for the experiments to evaluate our rule discovery algorithm. One is a dictionary for human gene/protein names, and the other for disease names.
The gene/protein name dictionary was created from BioThesaurus [21], which is a collection of more than two million gene/protein names for various species. We selected only the human genes/proteins by consulting the UniProt database [22] and removed the names that were nonsensical (e.g. IDs for other databases). The resulting dictionary consisted of 14,893 concept IDs and 205,909 terms.
The disease dictionary was created from UMLS Metathesaurus [23], which is a large multi-lingual vocabulary database that contains biomedical and health related concepts and their various names. We extracted all entries that were associated with the semantic type "Disease or Syndrome." The resulting dictionary consisted of 48,391 concept IDs and 148,531 terms. Table 2 shows statistics of the dictionaries. Note that the terms in the gene/protein dictionary are highly ambiguous in the first place. For evaluation purpose, we also created a reduced version for each dictionary by removing all ambiguous terms.

Evaluation using held-out terms
As discussed in the Methods section, we create normalization rules in such a way that they minimize the variability and ambiguity of the terms in the dictionary. We thus know that they are "good" rules for the terms included in the dictionary. It is, however, not clear if those rules are also appropriate for the terms that are not included in the dictionary. In other words, we need to evaluate how the discovered rules will help map unseen terms with their correct concept IDs.
One way of evaluating such performance is to use a heldout data set for evaluation. Before executing a rule discovery process, we remove some randomly selected terms from the dictionary and keep them as separate data. We then execute the rule discovery process. The mapping performance is then evaluated by applying the discovered rules also to the held-out terms and looking them up in the dictionary, where the lookup system produces, for each heldout term, zero or more concept IDs by exact string matching. The overall lookup performance can be evaluated in terms of precision and recall. Precision is given by where n m is the total number of concept IDs output by the lookup system, and n c is the total number of correct concept IDs output by the system. Recall is given by where n h is the number of heldout terms.
With these performance measures, we carried out several sets of experiments for automatic rule discovery. Throughout all experiments reported in this paper, we set the tradeoff parameter α in Equation 3 to 0.1. All capital letters in the terms are converted to lower case before applying our rule discovery algorithm. Table 3 shows the result for the human gene/protein dictionary. We used the reduced version of the dictionary in this experiment in order to make clear how normalization affects the precision of lookup performance (i.e. the lookup precision without normalization is ensured to be 100%). We used 1,000 heldout terms for evaluation. The first column shows the iteration counts of the rule discovery process. The second column shows the values of ambiguity and variability in the dictionary. The third column shows the rules discovered. The fourth column shows the lookup performance evaluated using the heldout terms.
The table clearly shows that the recall of lookup improved as we applied the discovered rules to the terms. More importantly, the degradation of precision was kept minimal. The discovered rules indicate that some technical words such as 'protein', 'precursor', 'variant' and 'hypothetical' are not important in conceptually characterizing a term and thus can be safely removed. The 14th rule is concerned with the acronym 'il'. The rule effectively converts its long form 'interleukin' into the acronym. Some of the rules capture synonymous expressions. For example, the 25th rule replaces the word 'subunit' with 'chain'. Table 4 shows the result for the (reduced) disease dictionary. Again, the discovered rules improved the recall of lookup performance without causing a significant deterioration of precision. The rules discovered were very differ-

Evaluation using MEDLINE snippets
In the experiments presented in the previous section, we have demonstrated that the normalization rules discovered by our algorithm work well for unseen terms as well. It is, however, still not entirely clear how useful and safe those rules are. Although we used heldout data for evaluation, the nature of the heldout terms might be too similar to the remaining terms in the dictionary and thus we cannot rule out the possibility that the rules were actually overfitting the data. Moreover, the distribution of the terms in the dictionary is different from that of the terms appearing in real text, so the rules that are harmless within the dictionary might cause a problem of ambiguity when applied to terms in text.
To confirm the effectiveness of our normalization method, we need evaluation data that stem from real text rather than a dictionary. Fortunately, the BioCreAtIvE II gene normalization task [24] provides data which can be used for our experiments. The data (the "training.genelist" file) includes gene/protein name snippets extracted from MEDLINE abstracts, and each snippet is assigned an Ent-rezGene ID. Table 5 shows some examples of the snippets. This evaluation setting could be seen as the situation where we have a named entity recognizer that can perfectly identify the regions of gene/protein names in text. We converted the EntrezGene IDs to UniProt IDs so that they can be compared to the IDs in our human gene/protein dictionary. The resulting evaluation data consisted of 965 gene/protein name snippets and their IDs (there were 33 EntrezGene IDs that we failed to convert to UniProt IDs).
With this evaluation data, we ran experiments using our gene/protein name dictionary (not the reduced version). The result is shown in Table 6. Again, the discovered rules improved the recall of lookup performance without losing precision. The main reason why the improvement of recall is not as significant as in Table 3,4 is that, unlike heldout terms, many of the snippets are readily mappable to the terms in the dictionary without any normalization. The useful rules were slightly different from the ones in Table 3. For example, the 38th rule, in effect, converts 'receptor' to 'r'. The 44th rule converts 'alpha' to 'a'. The Roman numeral 'i' is converted to the Arabic counterpart '1' by the 75th rule.

Lookup performance
The greatest advantage of the normalization approach is the speed of looking up a dictionary. Once we normalize the terms in the dictionary and the input term, we can use a hashing technique to look it up in a constant time regardless of the dictionary size. The cost required for normalizing the terms in the dictionary is not a problem since it is done prior to processing the text. In contrast, the cost required for normalizing the input term could be an issue because we need to invoke the normalization process every time we come across a term in the course of text processing.
To see the computational overhead of normalization, we carried out experiments using the same dictionary and evaluation data used in the above experiment. We implemented the methods in C++ and ran the experiments on AMD Opteron 2.2GHz servers. Table 7 shows the result. The bottom row shows the result of our automatic normalization method in which we used the 100 normalization rules discovered by the algorithm.
We can see that the application of 100 rules made the lookup process several times slower than the case without any normalization. Note, however, that it is still more than ten thousand times faster than the soft matching cases where a simple character-level bigram similarity was employed. 0.67 seconds per lookup with soft matching may not appear to be hugely problematic, but it is not a desirable speed when we want to process a large amount of text or when real time processing is required (recall that we used only the human gene/protein dictionary in this experiment, which is a tiny fraction of the biomedical terminology).
We should nevertheless emphasize that the purpose of this work is not to claim that our automatic term normalization approach is superior to soft string matching approaches. Soft matching methods have a distinct advantage of being able to output similarity scores for matched terms. Also, soft matching is in general more robust to various transformations than normalization approaches. The heavy computational cost is not a problem in certain applications. Soft matching and normalization are, in fact, complementary.
The table also shows the performance achieved by the heuristic rules given in Fang et al. [18]. The normalization consists of case normalization, replacement of hyphens with spaces, removal of punctuation, removal of parenthesized materials, and removal of spaces. Their normalization gave a better recall than our system at the price of a degradation of precision. Among their normalization rules, removal of parenthesized materials is particularly interesting, because this rule can never be produced by  our algorithm. This is an instance of "clever" rules that are difficult to discover without the help of human knowledge.
We conducted a brief error analysis on the results of this mapping task to see what types of term variations were yet to be captured by the system. Somewhat surprisingly, there were still many terms that could be mappable via character-level replacement rules. This indicates that we could improve the rule discovery process by employing a more sophisticated method to explore the hypothesis space. Our rule discovery algorithm has some commonalities with Transformation Based Learning (TBL) [25], so the approaches proposed to improve the training process in TBL (e.g. [26] and [27]) may also be useful in pursuing this direction. The other types of unresolved variations

Conclusions
Developing good heuristics for term normalization requires extensive knowledge of the terminology in question, and it is the bottleneck of normalization approaches for term-concept mapping tasks. In this paper, we have shown that the automatic development of normalization rules is a viable solution to the problem, by presenting an algorithm that can discover effective normalization rules from a dictionary. The algorithm is easy to implement and efficient enough that it is applicable to large dictionaries. Experimental results using a human gene/protein dictionary and a disease dictionary have shown that the automatically discovered rules can improve recall without a significant loss of precision in term-concept mapping tasks. This work should be particularly useful for terminologies for which good normalization rules are not fully known.
In this work, we limited the type of normalization rules to character-level replacement. There are, however, many good heuristics that cannot be captured in this framework. Extending the scope of normalization rules to more flexible expressions is certainly an interesting direction of future work.