CNN-based ranking for biomedical entity normalization

Background Most state-of-the-art biomedical entity normalization systems, such as rule-based systems, merely rely on morphological information of entity mentions, but rarely consider their semantic information. In this paper, we introduce a novel convolutional neural network (CNN) architecture that regards biomedical entity normalization as a ranking problem and benefits from semantic information of biomedical entities. Results The CNN-based ranking method first generates candidates using handcrafted rules, and then ranks the candidates according to their semantic information modeled by CNN as well as their morphological information. Experiments on two benchmark datasets for biomedical entity normalization show that our proposed CNN-based ranking method outperforms traditional rule-based method with state-of-the-art performance. Conclusions We propose a CNN architecture that regards biomedical entity normalization as a ranking problem. Comparison results show that semantic information is beneficial to biomedical entity normalization and can be well combined with morphological information in our CNN architecture for further improvement.

in the newswire domain focus on dealing with the disambiguation problem using machine learning methods based on context of entity mentions, while, in the biomedical domain, lots of systems focus on the variation problem and use rule-based methods relying on morphological information of entity mentions. Although the rule-based methods achieve good performance on medical entity normalization, they still suffer from several limitations. Firstly, it is impossible to collect complete rules to cover all mentions of a type of entity as their descriptions are various and changing. Secondly, rules for one type of entities are not directly applicable to another type of entities. For example, synonym dictionary of disorders obviously cannot be applicable to medications. Thirdly, it is difficult to handle the following two cases: 1) mentions with similar morphology but totally different meaning such as "ADA-SCID (adenosine deaminase deficiency)" and "X-SCID (X-linked combined immunodeficiency diseases"; and 2) mentions with significant different morphology but similar semantic meanings such as "kaplan plauchu fitch syndrome" and "acrocraniofacial dysostosis". The goal of this paper is to advance biomedical entity normalization by utilizing semantic information of biomedical entity mentions. To achieve this goal, we design a CNN architecture to capture semantic information of biomedical entity mentions used for ranking candidates generated by handcrafted rules used in traditional rule-based systems. In order to tackle the absence problem, we add 'NIL' (denotes absence) into the candidate set of an entity mention when necessary. Then, we can handle variation and absence in a unified framework.
Our main contributions are: 1) we propose a novel CNN-based ranking method for biomedical entity normalization, which takes advantages of CNN in modeling semantic similarities of entity mentions; 2) our system achieves state-of-the-art results on two benchmark datasets.

Related work
In last decades, a large number of studies have been proposed for biomedical entity normalization, however, nearly all of them focus on how to use morphological information to normalize entity mentions more accurately. In this section, we only discuss three state-of-theart representative systems on the two benchmark datasets used for evaluation. Moreover, related studies using CNN for semantic matching are also briefly introduced.
UWM, the best system for the disease and disorder mention normalization task of the SemEval 2014 challenge by Ghiasvand and Kate [1] which used an expanded dataset of the eHealth task of the ShARe/CLEF 2013 challenge [2], achieved an accuracy of 89.50% on the dataset of the SheARe/ClEF 2013 challenge. It first automatically learned patterns of variations of clinical terms from the unified medical language system (UMLS) [3] and the training set of the challenge by computing edit distances between the variations, and then attempted to normalize unseen entity mentions by performing exact match between their variations generated by the learnt patterns and an entity mention in the training set or an entity in the given KB. DNorm [4] proposed by Leaman et al. adopted a pairwise learning-to-rank approach and achieved a good result on the NCBI disease dataset [5]. It adopted vector space model to represent medical entity mentions, and used a similarity matrix to measure how similar are given medical entity mentions and standard entity mentions. The similar system developed by Zhang et al. [6] were submitted to the SemEval 2014 challenge and achieved the best normalization result.
D'Souza & Ng'system (2015) [7], the best rule-based system up to date on the ShARe/CLEF datasets, was a multi-pass sieve system based on manual rules. It defined 10 kinds of rules at different priority levels to measure morphological similarities between entity mentions and entities in the given KB for normalization.
TaggerOne [8], the best machine learning-based system up to date on the NCBI dataset, used semi-Markov models for biomedical entity recognition and normalization jointly.
Convolutional neural network(CNN) is a type of feedforward artificial neural network that has been widely used to model semantic information of sequences in NLP such as sentences [9], short texts [10], questions and answers in question answering [11], etc, and then to handle sequence classification [12][13][14] and match problems [15,16]. There are two works most similar to our study: the work proposed by Aiaksei and Alessandro for learning to rank short text pairs [17] and the work proposed by Limsopatham & Collier [18] for normalising medical concepts in social media texts, which consider medical concept normalization as a classification problem. Both of them do not consider any other information, such as morphologic information, except semantic information modeled by CNN.

Overview
Our system consists of two modules: 1) candidate generation: generating candidates for a given biomedical entity mention, 2) candidate ranking: ranking biomedical entity candidates using a CNN architecture. A detailed description of them is given in the following sections.

Candidate generation
We use the same rules in [7] to generate candidates. In [7], 10 kinds of rules were used as sieves at 10 priority levels to normalize biomedical entities. A biomedical entity mention was normalized into an entity that was found by the sieve at the highest priority level. In our system, we divide those rules into three categories according to morphological similarity between entity mentions in text and entity mentions in a training set and a given KB: 1. exact-match (denoted by CI): an entity mention in text exactly matches an entity mention in the training set or a standard entity in the KB; 2. exact-match II (denoted by CII): an entity mention in text exactly matches an entity mention in the training set or a standard entity in the KB after morphological change; 3. partial-match (denoted by CIII): an entity mention in text cannot be found by the two categories of rules above, but some words of it appears in certain entity mentions in the training set or certain standard entities in the KB.
Given an entity mention m, there will be three candidate subsets generated by the rules in the three categories, denoted by E 1 , E 2 and E 3 respectively, and candidates in E 1 are most morphological similar to the mention, candidates in E 2 are second-most morphological similar to the mention and candidates in E 3 are least morphological similar to the mention. Table 1 gives us examples of candidates of some biomedical entity mentions generated by rules in different categories.
In order to integrate semantic information, for an entity mention m, we rank all candidates in its candidate subset most morphological similar to it (denoted by E) according to semantic similarities between m and the candidates, and choose the top-ranked entity as normalization result of m. When there is only one element in E and E is not M 3 (i.e. only one entity is found using rules in CI or CII), the element is directly chosen. When E is E 3 and NIL is considered, we add NIL into E for ranking as no entity can be found using rules in CI or CII. The detailed workflow of our system is shown in Fig. 1b, which is similar to D'Souza & Ng [7] as shown in Fig. 1a .

Candidate ranking
Suppose that there are n elements in E(E = {y 1 , y 2 , · · · , y n }) for an entity mention m, we design a CNN architecture (as shown in Fig. 2) to compute similarities of mention-and-candidate pairs < m, y 1 >, < m, y 2 >, · · · , < m, y n > based on both morphological information and semantic information, and rank them. The architecture is composed of six layers, which can be divided into two parts: 1) semantic representation, including an input layer, a convolutional layer and a pooling layer; 2) ranking based on similarity, including a joint layer, a hidden layer and a soft-max layer. In our study, we use a pairwise approach in learning to rank. In the training phase, a mention-and-candidate pair < m, y i > (1 ≤ i ≤ n) is labeled as 1 if it appears in the training set, otherwise, 0.
Semantic representation In the input layer, given an entity mention m with l words and a candidate y with k words, each word w of an entity mention and a candidate is represented by a word embedding of size t e w ∈ R t . We concatenate word embeddings of all words of m and y to form their representations, a matrix of size t × l for m and a matrix of size t × k for y. Take m = w 1 , w 2 , · · · , w l for example, it is represented by a matrix x m =[ e w1 , e w2 , · · · , e wl ] ∈ R t×l .
In the convolutional layer, the matrix of m and that of y are converted into two groups of feature vectors by applying a convolution operator to context windows of different sizes according to filters of different sizes. Given a filter of size c, s ∈ R t×c , for example, the following feature f i can be generated by applying the rectified linear unit (Relu) function to a c-word context window of m, denoted by w i:i+c−1 = w i w i+1 , · · · , w i+c−1 : where x i:i+c−1 = [ e wi , e wi+1 , · · · , e wi+c−1 ] is the representation of w i:i+c−1 and b ∈ R is a bias. When applying the convolution operator with filter s to all possible c-word context windows (i ranging from 1 to l−c+1), we obtain a feature vector f = [ f 1 , f 2 , · · · , f n−c+1 ] ∈ R t×l . For different types filters of different sizes, we obtain a group of feature vectors for m or y. The number of feature vectors is the number of filters.
In the pooling layer, 1-max pooling is used to extract the most important feature from each feature vector to reduce computational complexity of the architecture. That is, f = max{f 1 , f 2 , · · · , f n−c+1 } is extracted to represent f mentioned above. Suppose that there are p filters, we will obtain a new vector of length p to represent m or y, denoted by v m =[ f 1 , f 2 , · · · , f p ] for m, where f i is the feature extracted from the i-th feature vector generated by the convolutional layer.
Ranking based on similarity where r h is the weight vector of v joint and b ∈ R is a bias. Finally, the output of the hidden layer is further fed to Fig. 2 Architecture of the CNN-based ranking module the softmax layer for ranking (i.e. a pairwise learning-torank approach). The ranking score is calculated using the following function: where o h is the output of the hidden layer and q 0 , q 1 are weights of 0 and 1 labels, respectively. The candidate y that achieves highest score is chosen as the normalized entity.

Datasets
We evaluate our approach on two benchmark datasets for biomedical entity normalization: the ShARe/CLEF eHealth dataset (ShARe/CELF) [2] and the NCBI disease dataset (NCBI) [5]. The first dataset is about disorder (i.e. disease and problem) normalization in clinical text, and the second dataset is about disease normalization in biomedical literature. They have their own reference KB, and only the Share/CLEF datset considers NIL(i.e., CUIless). The detailed information of the three datasets is shown in Table 2, where "#*" is the number of '*' such as document (doc), entity (ent), mention (men), mention identifiers (ID) and NIL.

Experimental settings
We start with a baseline system, a reimplementation of D'Souza & Ng's system [7], investigate our CNN-based ranking system under two different settings: 1) CNNbased ranking without morphological information, and 2) CNN-based ranking with morphological information, and compare it with other state-of-the art systems, including the best challenge systems on the dataset of the ShARe/CLEF 2013 (UWM), the best rule-based systems up to date (Jennifer and Vincent's system), and the best machine learning-based system up to date (TaggerOne). The dimensionality of the input word embeddings of all words are set to 50 (i.e. t = 50). Each dimension is first randomly sampled from uniform distribution U[ 0.25, 0.25], then pre-trained by the word2vec tool, an implementation of the unsupervised word embeddings learning algorithm proposed by Mikolov [19] on a unannotated dataset of PubMed biomedical abstracts with about 219 million words, and finally fine-tuned on each training set. We use two kinds of filters of different sizes (i.e. c = 2 and 3), and set the number of each kind of filter to 50 (i.e., p = 50). All model parameters are optimized on each training set using 10-fold cross validation, the dimensionality of the input word embeddings of all words are set to 50 (i.e. t = 50), the best one choosen from [ 50, 100, 200] and set the number of each kind of filter to 50 (i.e., p = 50), the best one choosen from [ 10,20,50, 100] and the model performance is measure by accuracy (i.e., the proportion of mentions correctly normalized) on each test set.

Results
Our CNN-based ranking biomedical entity normalization system using morphological information (denoted by "CNN-based ranking" in Table 3) is better than the system without using morphological information (denoted by "CNN-based ranking # "), and achieves highest accuracies of 90.30% on the ShARe/CLEF test set and 86.10% on the NCBI test set, respectively, which are much higher than the accuracies achieved by the rule-based baseline system (see Table 3, where NA denotes no result reported). The improvement of accuracies on the two test sets ranges from 0.77 to 1.45% with an average of 0.74%. The best CCN-based ranking system outperforms UWM by 0.8% in accuracy on the ShARe/CLEF test set. Compared with D'Souza & Ng's system, our best CNN-based ranking system shows much higher accuracy on the NCBI test set, but a little lower accuracy on the ShARe/CLEF test set. Compared with TaggerOne, the accuracy of our best CNN-based ranking system is lower by about 2.70% on the NCBI test set. It should be stated here that our reimplementation of D'Souza & Ng's system obtains the same accuracy on the NCBI test set, but different accuracy on the ShARe/CLEF test set as we cannot completely reconstruct the dictionary used in their system but not released, which may be the main reason why our system performance is a little lower than their.

Discussion
In this paper, we propose a CNN-based ranking method for biomedical entity normalization, which can take advantages of not only semantic information but also morphological information of biomedical entity mentions. Experiments on two benchmark datasets show that the proposed CNN-based ranking method outperforms other state-of-the-art systems that only consider morphological information.
Our CNN-based ranking system uses the same rules as the baseline system to generate candidates of entity mentions, and show much better performance than the baseline system. The reason lies in that the CNN-based ranking system chooses the most semantic similar entity to each entity mention from candidates most morphological similar to the mention instead of choosing a entity by artificially defined priority levels. Semantic information of entity mentions provides a route to handle the two cases mentioned in the "Background" section. For example, entity mentions "tremulousness" and "tremulous" (from the ShARe/CLEF test set) with similar morphology are normalized into "tremors" and "tremulous" by the rule-based baseline system, among which "tremulousness" is wrongly normalized, but they are both correctly normalized into "tremulous" by our CNN-based ranking system. Entity mentions "metaplastic polyps of the colorectum" and "colonic polyps" (from the NCBI test set) with significant different morphology are normalized into "polyps" and "colonic polyps", among which "metaplastic polyps of the colorectum" is wrongly normalized, but they are correctly normalized "colorectal polyps" and "colonic polyps" by our CNN-based ranking system respectively. The normalization process of the two systems for these four mentions is shown in Table 4 in detail. Although our system does not show better results than some state-ofthe-art systems using rich manually-craft features (e.g., TaggerOne), this study proves the potentiality of CNN on biomedical entity normalization. How to integrate the systems using rich manually-crafted features and the CNNbased systems together for further improvement may be a direction for future work.
There are two main limitations of our CNN-based ranking system although it achieves good performance. Firstly, candidate generation relies on handcrafted rules, which determines the upper boundary of the CNN-based ranking system, that is the proportion of correct entities in all candidate sets. On the ShARe/CLEF and NCBI test sets, the upper boundaries are 91.33 and 87.45% respectively, indicating that there are a number of candidates beyond the rules. Secondly, the CNN-based ranking system completely ignores ambiguity of entity mentions. To evaluate the effect of ambiguity, we calculate the proportion of ambiguous entity mentions in all entity mentions, which is 4.48% on the ShARe/CLEF test set and 1.56% on the NCBI test set. Among these ambiguous mentions, 25.83% on the ShARe/CLEF test set and 13.33% on the NCBI test set are wrongly normalized. Besides ambiguity errors, there are also many other errors in our system: 1) mapping errors between various entity mentions such as "downbeat nystagmus" that should be mapped to "symptomatic nystagmus", but wrongly mapped to "nystagmus, myoclonic"; 2) mappings errors caused by "NIL" such as "cerebral microbleeds" that should be mapped to "NIL", but worngly mapped to "brain ischemia" and "oscillopsia" that should be mapped to "cyclophoria", but wrongly mapped to "NIL". These two types of errors account for 4.98 and 3.55% on the ShARe/CLEF test set and 12.96 and 0.73% on the NCBI test set of all entity mentions.
For further improvement, there are two possible directions: 1) increase the upper boundary of the CNNbased ranking system through generating some candidates semantic similar to entity mentions; 2) design an additional disambiguation module. Table 4 Normalization process of the rule-based baseline system and our CNN-based ranking system for some entity mentions