Improving deep learning method for biomedical named entity recognition by using entity definition information

Background Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining that finds the boundaries of entity mentions in biomedical text and determines their entity type. To accelerate the development of biomedical NER techniques in Spanish, the PharmaCoNER organizers launched a competition to recognize pharmacological substances, compounds, and proteins. Biomedical NER is usually recognized as a sequence labeling task, and almost all state-of-the-art sequence labeling methods ignore the meaning of different entity types. In this paper, we investigate some methods to introduce the meaning of entity types in deep learning methods for biomedical NER and apply them to the PharmaCoNER 2019 challenge. The meaning of each entity type is represented by its definition information. Material and method We investigate how to use entity definition information in the following two methods: (1) SQuad-style machine reading comprehension (MRC) methods that treat entity definition information as query and biomedical text as context and predict answer spans as entities. (2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type and introduce entity type meaning, which is represented by entity definition information. All models are trained and tested on the PharmaCoNER 2019 corpus, and their performance is evaluated by strict micro-average precision, recall, and F1-score. Results Entity definition information brings improvements to both SQuad-style MRC and SOne methods by about 0.003 in micro-averaged F1-score. The SQuad-style MRC model using entity definition information as query achieves the best performance with a micro-averaged precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively. It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in F1-score. Compared with the state-of-the-art model without using manually-crafted features, our model obtains a 1% improvement in F1-score, which is significant. These results indicate that entity definition information is useful for deep learning methods on biomedical NER. Conclusion Our entity definition information enhanced models achieve the state-of-the-art micro-average F1 score of 0.9137, which implies that entity definition information has a positive impact on biomedical NER detection. In the future, we will explore more entity definition information from knowledge graph.


Background
Biomedical named entity recognition (NER) is a fundamental task of biomedical text mining to identify biomedical entity mentions of different types in biomedical text.Most biomedical NER studies focus on the biomedical text in English.To accelerate the development of Spanish biomedical NER techniques, Martin Krallinger et al. organized a specific challenge for chemical & drug mention recognition in Spanish biomedical text, called PharmaCoNER, in 2019 [1].Participants were required to recognize the entities in Spanish biomedical text, as shown in Fig. 1.
Biomedical NER is a typical sequence labeling problem, and lots of state-of-the-art methods have been proposed for this problem, such as BiLSTM-CRF [2].Almost all these methods do not consider the meaning of different entity types, which may benefit biomedical NER.The meaning of each entity type can be represented by its definition.For example, the definition of PROTEINAS in the guideline of PharmaCoNER 2019 is: "Las menciones de proteínas y genes incluyen péptidos, hormonas peptídicas y anticuerpos." (Protein and gene mentions include peptides, peptide hormones, and antibodies).In this paper, we explore how to encode entity definition information in two kinds of deep learning methods for NER.They are: (1) SQuad-style MRC methods designed to find a continuous span of entity mentions in given text for each type.We use each type's entity definition as a query instead of a naive query generated by simple rules in MRC methods.For convenience, we adopt MRC to represent SQuad-style MRC in the following sections in this paper.( 2) Span-level one-pass (SOne) methods that predict entity spans of one type by one type.We use entity definition information to represent each entity type's meaning and introduce the entity type meaning into SOne.The definition information of each type includes the original definition of each type in the guideline and entity mentions in the text.We compare them in the SOne model.
In order to evaluate the performances of MRC and SOne, we conduct experiments on the PharmaCoNER 2019 corpus.Experiments show that the entity definition information brings improvements to both MRC and SOne methods.The improvement in microaveraged F1-score is about 0.003.The MRC method using entity definition information as query achieves the best performance with a micro-average precision of 0.9225, a recall of 0.9050, and an F1-score of 0.9137, respectively.It outperforms the best model of the PharmaCoNER 2019 challenge by 0.0032 in micro-averaged F1-score.

Related work
The natural language processing (NLP) community has made a great contribution to the development of NER in the biomedical text through challenges, such as I2B2 (Informatics for Integrating Biology and the Bedside) [3,4], BioCreative (Critical Assessment of Information Extraction systems in Biology) [5,6], SemEval (Semantic Evaluation) [7,8], CCKS (China Conference on Knowledge Graph and Semantic Computing) [9,10] and IberLEF [11].A large number of methods have been proposed for biomedical NER.Most of them can be classified into the following three categories: (1) Rule-based methods that extract named entities using specific rules design by experts.The earlier clinical NLP tools are rule-based systems relying on clinical dictionaries, such as MedLEE [12], KnowledgeMap [13] and MetaMap [14].(2) Supervised machine learning methods with hand-crafted features Maximum Entropy (ME) [15,16], Support Vector Machines (SVM) [17], CRF [18,19], Hidden Markov Models (HMM) [20,21] and Structural Support Vector Machines (SSVM) [22].They usually treat NER as a sequence labeling task, which tags a sentence with a label sequence.The common features used in the supervised machine learning methods include orthographic information (e.g.capitalization, prefix, suffix and word-shape), syntactic information (e.g., POS tags), dictionary information, n-gram information, disclosure information (e.g.section information in EHRs) and some features generated from unsupervised learning methods [23].(3) Deep learning methods that can learn features from large unlabeled data without costly feature engineering.Convolutional Neural network (CNN) [24], Recurrent Neural Network (RNN) [25] and Long Short Term Memory neural network (LSTM) [2] have been widely used for biomedical NER and show good performance.Besides the methods mentioned above, there are also some other attempts.For example, to tackle the low-resource problem in the biomedical domain, researchers introduce multi-task learning methods to learn more abundant information from other tasks, such as NER from other sources, chunking, and POS tagging [26][27][28], and deploy transfer learning methods to first learn knowledge from related sources and then finetune on target [29][30][31][32][33].
Nowadays, there is an upward trend in defining NLP tasks in the MRC framework.MRC models [34][35][36] extract answer spans from the context given a pre-defined question.Generally, SQuad-style MRC models can be formalized as predicting the start position and the end position of the answer.Li et al. [37] treat the entity-relation extraction task as a multi-turn question answering and propose a unified MRC framework to recognize entities and extract relationships.Li et al. [38] propose an MRC method to recognize both flat and nested entities.

Datasets
In this study, all experiments are conducted on the PharmaCoNER 2019 corpus annotated by medicinal chemistry experts according to a pre-defined guideline.The corpus contains 1000 clinical records with 24,654 chemical & drug mentions.The corpus is divided into a training set of 500 records, a development set of 250 records and a test set of 250 records, where the test set is hidden in a background set of 3751 records during the test stage of the competition.In experiments, we first split each record into sentences by sentence ending symbols, including '\n' , '. ' , ';' , '?' , and '!' .About 95% of sentences are no longer than 230 tokens.The corpus statistics, including the number of records, sentences, and chemical & drug mentions of different types, are listed in Table 1.It should be noted that the UNCLEAR mentions are not considered during the competition.

Task definition
Given a sequence X = {x 1 , x 2 , . . ., x n } of length n, we need to assign a label sequence Y = y 1 , y 2 , . . ., y n to X, where y i is the possible label of token

, PROTEINAS, NORMALIZABLES, NO_NORMALIZABLES, UNCLEAR).
MRC definition: the sequence labeling problem can be redefined in the MRC framework as follows, For each label type y, its definition information is regarded as a query q y = {q 0 , q 1 , . . ., q m } of length m, a sentence X is regarded as the context of q y , the span of an entity of type y, and x y start:end = x start , x start+1 , . . ., x end−1 , x end , is recognized as an answer.Then, the original sequence labeling problem can be represented by q y , X, x y start:end .The goal of MRC is to find the spans of all entity mentions of all types, given all sentences.
SOne definition: SOne takes sequence X as inputs and predicts the spans of all entities of one type by one type using a multi-layer pointer network [39].The number of network layers depends on the number of entity types.For each type of entity, we add entity definition information e to enhance SOne by concatenating it to all tokens.

Query generation for MRC
Query generation is critical for MRC, since queries usually contain some prior knowledge (e.g.entity type definition) about tasks.Li et al. [40] introduce various kinds of query generation methods, including keywords, Wikipedia, rule-based template filling, synonyms, keywords combined synonyms and annotation guideline notes, and compare them.The results show that annotation guideline is the best choice for query generation.Following Li et al. [40], we compare two kinds of query generation: annotation guideline and rule-based template filling.Table 2 shows our generated queries for each type of entity.

Model detail
In this study, We utilize BERT (Bidirectional Encoder Representations from Transformers) [41] as our model backbone.Figure 2 shows the skeleton of the MRC model.Given query q y and sentence X, we need to predict the span of every entity of type y, including a start position x y start and an end position x y end .The model first takes the following input and encodes it by BERT: (1)  where [CLS] and [SEP] are special tokens of BERT, denoting whole sentence and sen- tence separator, respectively.Suppose that the last layer output of BERT is H ∈ R s×d , where s is the total length of [CLS] , q y , [SEP] , X and [SEP] , and d is the dimension of the last layer output of BERT, the model then predicts the possibilities of start position and end position as follows: where W start and W end are trainable parameters, b start and b end are biases.
The predicted start index I start and end index I end are: We use MRC_rule and MRC_guideline to denote MRC using rule-based template filling for query generation and MRC using annotation guideline as query, respectively.
Figure 3 shows the skeleton of the SOne model.In this model, we first use BERT to encode the input sentence X as Z ∈ R n×d (i.e., the output of the BERT's last layer), and then con- catenate the entity definition information representation e ∈ R d e to all tokens, where d e is the dimension of the entity definition information representation.Here, we consider three kinds of entity definition information: (1) entity mentions word embedding.each entity type definition information is represented by the mean pooling of word2vec embeddings (2) where E ∈ R n×d e is n copied e, and [] denotes the concatenation operation.
Finally, the SOne model makes the same prediction for start position and end position as the MRC model.The only difference is that SOne has four input-shared span predictors with the same structure and different parameters, while MRC has four separate span predictors.The overall objective function of MRC and SOne is: where L start is the start position prediction loss and L end is the end position prediction loss.

Evaluation metrics
The performances of all models are measured by micro-averaged precision (P), recall (R), and F1-score (F1) under the "exact-match" criterion: where TP is true positive, FP is false positive, and FN is false negative.
These measures can be calculated by the evaluation tool [43] released by the official organization of the PharmaCoNER 2019 challenge.

Experiment setting
Following Xiong's work [44], we first train our models on the training set and development set, and then further finetune the model for 20 epochs.The max sentence lengths of the MRC model and SOne model are set as 250 and 230, respectively.The difference in the max length is due to the query in the MRC model.The learning rate of BERT is set as 2e−5, the batch size of all models is set as 20.The dimension of entity definition information representation d e is set as 300.Other parameters are set as the default.The code is available at [45].

Performance evaluation
Table 3 presents the results of our proposed MRC and SOne model (lower part) and summarizes some reported results on the PharmaCoNER Corpus (upper part).( 6) First, the micro-average precision, recall and F1-score of MRC_rule and MRC_guideline is 0.915, 0.9055, 0.9109 and 0.9225, 0.9050, 0.9137, respectively.Results show that both MRC_rule and MRC_guideline outperform the baseline model SOne by 0.44% and 0.72% in micro-averaged F1-score.The reason why MRC_guideline performs better than MRC_rule lies in the expertness of guideline definition.For SOne extended models, all kinds of entity definition information representation can bring improvements to the baseline model SOne.Compared with SOne, the micro-averaged F1-score of SOne_ rule increases to 0.912, SOne_guideline increases to 0.9128, and SOne_w2v increases to 0.9094.The overall micro-averaged F1-score improvements of extended SOne models range from 0.29 to 0.63%.
Second, MRC-guideline outperforms all existing systems on the PharmaCoNER corpus, creating new state-of-the-art results and pushing the micro-averaged F1-score of the benchmark to 0.9137, which amounts to 0.32% absolute improvement over the top-1 system of the PharmaCoNER 2019 challenge, developed by us that using lots of features, and 1% absolute improvement over our previous system without using features [44], which is a significant improvement.We perform a significance test by comparing the model without using any feature with our MRC model or SOne model, and the results show that the improvement is significant (t-test < 0.05) [46].This implies that entity definition information has a positive impact on entity recognition.
Third, Table 4 shows the detailed results of each entity type of MRC_guideline and SOne_guideline.Both MRC_guideline and SOne_guideline perform best on NORMAL-IZABLES and worst on NO_NORMALIZABLES.Though MRC_guideline outperforms SOne_guideline in terms of micro-averaged F1-score, it wrongly predicts all NO_NOR-MALIZABLES type.The probable reason is that queries of NORMALIZABLES and NO_NORMALIZABLES are too similar, which may confuse our models.Overall, MRC_ guideline outperforms better than SOne_guideline on micro-averaged precision but worse on micro-averaged recall.Besides, we analyze all our proposed models and find that the SOne model can recognize the NO_NORMALIZABLES entities, but the MRC model cannot.It may be because that concatenation of entity definition representation benefits to few samples.

Error analysis
Comparing with previous state-of-the-art models, our model can recognize more named entities due to the domain knowledge embedded in the entity definition information.For example, because of the introduction of the PROTEIN information, our model can recognize "timoglobulina (thymoglobulin)", "protrombina (prothrombin)" and so on, which are ignored by previous state-of-the-art models.To visualize the effect of the added domain knowledge, we calculate the cosine similarity of some words based on their word2vec embeddings.For example, the similarity of "protrombina" and "proteínas" is more than 0.5 but has a lower similarity with "normalizar" or words in the question of the UNCLEAR type.
Though the MRC_guideline model outperforms other models, there are also some errors, mainly of the following five kinds.(1) About 20% of errors are due to the predicted entities not included in the gold test set.Although these predicted entities are the ones that have appeared, such as "vimentina (vimentin)", they are wrong because they are not officially annotated.(2) About 30% of errors are due to that the model omits some entities.(3) About 16% of the errors are because the model predicts the correct entity type, but the boundary is too long.For instance, the correct entity is "anticuerpos anticitoplasma (cytoplasmic antibodies)", but the model predicts "anticuerpos anticitoplasma de neutrófilo (antineutrophil cytoplasmic antibodies)", or the correct entity is "hormonas de crecimiento (growth hormones)", but the model predicts "hormonas de crecimiento y antidiurética (growth hormones and antidiuretics)".(4) About 20% of errors are because the model predicts the correct entity type, but the boundary is too short.For example, "tinción de auramina" is wrongly predicted as "auramina (auramine)", "anticuerpos antimembrana basal glomerular (glomerular basement membrane antibodies)" is wrongly predicted as "nticuerpos antimembrana basal (basal membrane antibodies)", and "(Ig)A-kappa" is wrongly predicted as "Ig".(5) About 10% of the errors are caused by that the model predicts the wrong entity type, and 70% of them are because that "NO_NORMALIZABLES" entity type is mistakenly predicted as "NOR-MALIZABLES", such as "Viekirax", "Tobradex" and "Harvoni".

Conclusion
This paper proposed two kinds of entity definition information enhanced model, MRC and SOne for biomedical NER.Compared with the previous models, our methods do not require features and achieve state-of-the-art performance with a micro-average

Fig. 1
Fig. 1 Examples of the biomedical named entities in Spanish records.(NORMALIZABLE entities in green, PROTEINAS entities in blue, NO_NORMALIZALLE entities in yellow and UNCLEAR entities in red.Notice that UNCLEAR entities are not included in the final evaluation.)

Table 1
Statistics of the PharmaCoNER 2019 Corpus

Table 2
Generated queries for each type of entity Rule-template ¿Qué entidades No_NORMALIZABLES se mencionan en el texto?(WhichNON-NORMALIZABLE entities are mentioned in the text?)

Table 3
Results on PharmaCoNER CorpusThe method with the highest F-score among all methods is highlighted in bold * Compared with the model without any feature, this is a significant improvement (t-test < 0.05)

Table 4
Detailed results of each entity type of MRC_guideline and SOne_guidelineThe methods with the highest F-scores in each entity type are highlighted in bold F1-score of 0.9137 on the PharmaCoNER Corpus.It indicates that the introduction of entity definition information is effective.In the future, we are planning to introduce more effective entity category definition information through domain knowledge graphs and to explore more valid methods to add the entity definition information, such as attention mechanism.