Currently, the vast majority of the biomedical resources are in unstructured form which originate from an assortment of different resources that incorporate nonstandard naming conventions, which makes the required information difficult to use and understand [10]. Ontologies help researchers to overcome these kinds of difficulties and help researchers facilitate the vast amounts of biomedical knowledge available [41]. An ontology can provide a unique identifier for describing information for each entity, which solves the heterogeneity problem and provides standardized and homogeneous data [39].
Linking named entities in text through an ontology is an essential process to make sense of the identified named entities [11]. When an ontology/dictionary containing a set of entities E and a text containing a set of entity mentions M are given, entity linking is the task of mapping each named entity mention m in the given text to its corresponding entity e in the given ontology/dictionary, where m∈M and e∈E [40]. This task is also called entity normalization, entity grounding, or entity categorization, which are used interchangeably throughout this paper.
Figure 1 demonstrates a sample text with annotated bacteria habitat (biotope) mentions, which are represented in bold and Fig. 2 demonstrates a sample portion from Onto-Biotope, which is an ontology for bacteria habitats. Given a sample text with annotated habitat mentions, the aim of habitat entity normalization is to link the mentions through the Onto-Biotope Ontology. For instance, “pediatric”, “respiratory”, and “children less than 2 years of age” are habitat entity mentions. The concept that is associated with the “pediatric” habitat mention in the Onto-Biotope ontology is “pediatric patient”, the one associated with the “respiratory” habitat mention is “respiratory tract part”, and for “children less than 2 years of age” it is “pediatric patient”.
The association between the entity mention “pediatric” and the ontology concept term name “pediatric patient” can be relatively more easily detected due to the lexical similarity between them. Similarly, the habitat mention “respiratory” and the ontology concept “respiratory tract part” also share a common word, making them lexically similar. However, lexical similarity may not always exist between entity mentions and concept term names or concept synonyms. For example, there is no lexical similarity between the habitat mention “children less than 2 years of age” and ontology concept term name “pediatric patient”, which calls for the utilization of semantic similarity.
Entity normalization can also be performed through a dictionary. For instance, the sample sentence “In Study 3, 67% of patients treated with ADCETRIS experienced any grade of neuropathy.” states a relation between the drug mention “ADCETRIS” and adverse drug reaction mention “neuropathy”. The adverse drug reaction mention “neuropathy” can be normalized to the “peripheral neuropathy” term in the Medical Dictionary for Regulatory Activities (MedDRA) [7].
Even if the named entities are given, linking the identified named entities to a unique concept identifier in an ontology/dictionary is not a trivial task in the biomedical domain. There are many challenges in the task of named entity linking through an ontology or a dictionary, two of which are the variety and ambiguity problems of the named entities [4]. A named entity may appear in different surface forms in a given text, which is called the variety problem. Furthermore, two named entities with the same surface form may have different semantic meanings, which is called the ambiguity problem. Linking of named entities for the biomedical domain has another big challenge besides these two common problems in the general natural language processing domain. In the biomedical domain, the training data is relatively smaller and the number of the ontology/dictionary categories that should be considered is larger compared to many other domains in natural language processing [6]. This poses a challenge for the standard supervised classification algorithms. For example, there are 2,221 semantic categories in the Onto-Biotope ontology, while the available training set contains only 747 entity mentions, and 16,295 words. For adverse drug reaction normalization, this situation is worse since there are 22,499 MedDRA dictionary terms.
In this paper, for the ontology based normalization of the named entity mentions in text, we propose an unsupervised approach, which utilizes both semantic and syntactic information. The proposed approach uses word embeddings learned from large unlabeled text to capture semantic information and syntactic parsing information to re-rank the candidate ontology/dictionary concept terms. The proposed approach is tested on two different data sets, which are the BioNLP Shared Task 2016 Bacteria Biotopes (BB3) categorization sub-task data to normalize habitat entities through the Onto-Biotope ontology and the Text Analysis Conference 2017 Adverse Drug Reaction data to normalize adverse drug reaction mentions through the MedDRA dictionary. On both data sets, the proposed normalization method with syntactic re-ranking achieved better performance than the normalization method without syntactic re-ranking. Furthermore, we obtained the new state-of-the-art results with 2.9 percentage points above the previous best result for the Bacteria Biotopes (BB3) categorization sub-task.
Related work
Several approaches have been proposed for biomedical entity normalization for different types of biomedical entities including genes/proteins [20, 32, 36, 46], bacteria biotopes [6, 13, 23, 37, 43], and diseases [14, 28]. Early systems tried to link the entity mentions to the knowledge base entities by utilizing dictionary look-up and string matching algorithms [16, 36]. Some studies [14, 23] used hand-written rules to measure the morphological similarity between entity mentions and ontology/dictionary entities, while others [17] automatically learned patterns of variations of the entities. Machine-learning based approaches, which learn the similarities between biomedical entity mentions and ontology concept names from labeled training data have also been proposed and applied as a solution to the normalization task of various biomedical entities such as diseases [28].
Most previous studies focused on utilizing morphological information for named entity normalization. However, morphological similarity alone is not adequate to normalize biomedical entities, which generally have forms different from the concept terms that they should be tagged with [6]. Word embedding models, which learn distributed representations of words from large unlabeled corpora, are promising approaches for capturing semantic information [34]. They have been successfully used in several recent Natural Language Processing (NLP) tasks including the biomedical domain [3, 8, 35, 42]. Recently, word embeddings have also been used for the task of biomedical named entity normalization. Li et al. [30] proposed a convolutional neural network (CNN) architecture leveraging semantic and morphological information, which handles the biomedical entity normalization task as a ranking problem. In the proposed method, firstly candidates are generated using hand-crafted rules, and then they are ranked according to semantic and morphological information, which are represented by a CNN-based model. Experiments on two benchmark datasets (the ShARe/CLEF eHealth dataset and the NCBI disease dataset) showed that semantic information is beneficial for the biomedical entity normalization task as well as morphological information. However, the requirement of hand-crafted rules and labeled data makes the adaptation of this method to different domains harder and time-consuming. Cho et al. [9] proposed a semi-supervised approach that facilitates word embeddings to represent semantic spaces for normalizing biomedical entities such as disease names and plant names and obtained promising performance. This method requires a domain specific corpus and dictionary. Therefore, the adaptation of it to other domains is not easy, if there are no such resources available.
A number of community-wide challenges including the BioCreative Challenges [1, 2, 22, 29, 47] and BioNLP Shared Tasks [13, 24, 25, 37], which have been conducted to assist the progress of research in biomedical text mining, also addressed the task of biomedical entity normalization. The Bacteria Biotope task, whose ultimate aim is information extraction regarding bacteria and their habitats, was first addressed in the BioNLP Shared Task 2011 [5, 25], and has been conducted in 2013 [6, 37] and 2016 again since then. We evaluated our proposed approach on the BB-cat subtask of the 2016 edition of the Bacteria Biotope task, which addressed the normalization of habitat entity mentions in PubMed abstracts using the OntoBiotope ontology [13]. In the official task, the teams TagIt [12] and LIMSI [18] proposed rule-based methods, while BOUN [43] proposed a similarity-based method that utilizes both approximate string matching and cosine similarity of word-vectors weighted with Term Frequency-Inverse Document Frequency (TF-IDF). According to the official results, the best precision (62%) for habitat mention normalization was obtained by the BOUN system.
The bacteria habitat mention normalization problem continued to attract the attention of the researchers after the shared task. CONTES is a recently proposed semi-supervised method for linking habitat entity mentions through the Onto-Biotope ontology [15]. The system is based on word embeddings that are induced from PubMed by utilizing the Word2Vec tool. The cosine similarities between term vector representations and concept vector representations are calculated to find the most similar ontology concept to the given entity mention. They applied the proposed normalization method to the test dataset of the Bacteria Biotope 2016 Task 3 (BB-cat), and obtained comparable results to that of the state-of-the-art for the task of Bacteria Biotopes categorization. CONTES contains a transformation step to make comparable the term vectors and the entity vectors which are represented in different dimensions. The need for the transformation step makes the method semi-supervised, since it requires labeled data for training the prediction model. Recently, Mehryary et al. [33] used TF-IDF weighted vector space representation for the named entity categorization of bacteria biotopes. Each ontology concept name and each entity mention is represented with a TF-IDF weighted vector considering each concept name in the ontology as a separate document and calculating IDF weights based on these names. The ontology concept with the highest cosine similarity is assigned to a given entity mention. Although they achieved state-of-the-art results in the normalization task, the TF-IDF based scheme has limitations in capturing the semantic relations between the ontology concepts and entity mentions, since it is primarily based on the surface forms of the words.
Besides the Bacteria Biotopes normalization task, we also evaluate our approach on the task of normalizing Adverse Drug Reaction (ADR) mentions in drug labels to the MedDRA terms. We use the recently provided data set from the Text Analysis Conference (TAC) 2017. Different types of data sources such as electronic health records [19], scientific publications, and social media data [38] and different types of lexicons such as the Unified Medical Language System (UMLS) [31] and the side effect resource (SIDER) [44] have been used to extract ADRs from text. Many of these studies proposed a lexicon-based matching approach for ADRs recognition. Although a number of studies have been conducted to automatically identify ADRs in text and map them through a dictionary using NLP techniques, as far as we know the normalization of the ADRs through a dictionary has not been studied as a separate task without named entity recognition.