Knowledge-enhanced biomedical named entity recognition and normalization: application to proteins and genes

Background Automated biomedical named entity recognition and normalization serves as the basis for many downstream applications in information management. However, this task is challenging due to name variations and entity ambiguity. A biomedical entity may have multiple variants and a variant could denote several different entity identifiers. Results To remedy the above issues, we present a novel knowledge-enhanced system for protein/gene named entity recognition (PNER) and normalization (PNEN). On one hand, a large amount of entity name knowledge extracted from biomedical knowledge bases is used to recognize more entity variants. On the other hand, structural knowledge of entities is extracted and encoded as identifier (ID) embeddings, which are then used for better entity normalization. Moreover, deep contextualized word representations generated by pre-trained language models are also incorporated into our knowledge-enhanced system for modeling multi-sense information of entities. Experimental results on the BioCreative VI Bio-ID corpus show that our proposed knowledge-enhanced system achieves 0.871 F1-score for PNER and 0.445 F1-score for PNEN, respectively, leading to a new state-of-the-art performance. Conclusions We propose a knowledge-enhanced system that combines both entity knowledge and deep contextualized word representations. Comparison results show that entity knowledge is beneficial to the PNER and PNEN task and can be well combined with contextualized information in our system for further improvement.


Background
With the rapid development of computer technology and biotechnology, the number of biomedical literature is growing rapidly. These biomedical literatures contain a wealth of valuable knowledge, which can be used to promote biomedical development and help people improve their living environment. Furthermore, it is well recognized that the adoption of common database identifiers (IDs) could facilitate data integration and re-use. However, manually annotating them from massive biomedical literature is labor-intensive and costly. New methods and tools need to be developed to support more effective and consistent extraction of biomedical entities and their IDs, thereby facilitating downstream applications such as relation extraction [1] and knowledge base completion [2].
For this purpose, the BioCreative VI Track 1 proposed a challenging task (called Bio-ID Assignment), which focused on entity tagging and ID assignment [3]. There were two specific subtasks in Track 1: 1) biomedical named entity recognition (BioNER) and 2) normalization (BioNEN), also known as disambiguation. The first subtask aimed at automatically recognizing biomedical entities and their types from texts; and the second subtask was to associate entity mentions in texts with their corresponding common IDs in knowledge bases.
BioNER has been widely studied. Most existing approaches treat this problem as a sequence labeling task, which can be handled through traditional machine learning (ML)-based models (e.g., Hidden Markov Models and Conditional Random Fields) with complex feature engineering [4,5]. Although effective, the design of features is labor-intensive and time-consuming. To overcome this drawback, neural networks were proposed to automatically extract features based on word embedding technology [6][7][8]. They constructed feature representations through multi-layer neural networks without relying on complicated feature engineering. Among them, bidirectional long short-term memory with conditional random field model (BiLSTM-CRF) exhibited promising results [6].
Compared with BioNER, BioNEN is a more challenging task. Previous work on this subtask was largely based on domain-specific dictionaries or heuristic rules [9,10], and could achieve relatively high performance. However, these methods have a heavy reliance on the completeness of dictionaries and the design of rules. Therefore, it could be difficult to apply them to new datasets or shift them to new domains. Later, some work [11,12] proposed to convert mentions and candidate entities into a common vector space, and then disambiguated candidate entities by a scoring function (e.g., cosine similarity). In recent years, neural network-based approaches have shown considerable success in entity normalization [13][14][15]. These methods used neural architectures to learn the context representations around an entity mention and calculated the context-entity similarity scores to determine which candidate is a correct assignment.
Although many studies have been made for the BioNER and BioNEN, yet challenges still exist. One is the name variations, which means that a named entity may have multiple surface forms, such as its full name, partial names, morphological variants, aliases and abbreviations [16]. The other is entity ambiguity, which means that an entity mention could possibly correspond to different entity IDs [16,17]. Take Fig. 1 as an example to illustrate. In the solid line box of Fig. 1, the variants "VEGF (human)", "MVCD1" and "VPF" all represent the same gene entity (vascular endothelial growth factor), whose ID is "NCBI Gene: 7422". This is the name variations problem (synonym). The arrow in this side means that different variants can correspond to the same ID in the KB. In the dashed box, the variants "VEGF (human)" and "VEGF (pig)" have the same entity name, but correspond to different genus IDs ("NCBI Gene: 7422" and "NCBI Gene: 397157", respectively). This is the entity ambiguity problem (polysemy). The arrow in this side means that the same mention can have several variants with different IDs in the KB.
Large-scale Biomedical Knowledge bases (KBs) such as UniProt [18] and NCBI Gene [19] contain rich information about the protein/gene entities and structural relationship between them. This information is quite useful for solving the above two issues. Luo et al. [20] and Akhondi et al. [21] showed that the information of prior chemical entity name provided by domain dictionaries could help boost the NER performance. However, the existing structural information of entities and how to use them for the Bio-ID Assignment task has not yet been well studied.
Besides, the multi-sense information of words has been leveraged and empirically verified to be powerful in many sequence labeling tasks [22,23]. Peters et al. [22] proposed a deep contextualized word representation method, called Embeddings from Language Models (ELMo) [22]. This Fig. 1 Illustration of structure information of entities. The first column represents the mention, the middle column represents the variant corresponding to the mention, and the last column represents the entity ID. The solid line box represents the name variations and the arrow in this side means that different variants can correspond to the same ID in the KB. The dashed box represents the entity ambiguity and the arrow in this side means that the same mention can have several variants with different ID in the KB method directly adopted pre-trained bi-directional Language Models (biLMs) to integrate both semantic and multi-sense information of words as context-dependent embeddings. As far as we know, ELMo has not been well explored in biomedical domains.
In this paper, we propose a novel knowledge-enhanced system that could employ rich entity knowledge and deep contextual word representations for protein/gene named entity recognition (PNER) and normalization (PNEN). Specifically, entity name knowledge in KBs is introduced into a BiLSTM-CRF model to recall more protein/gene mentions. Then, structural knowledge of entities is encoded by an autoencoder into ID embeddings, which are used for entity disambiguation during the PNEN phase. To further explore the entity ambiguity issue, ELMo is also incorporated into our system to capture underlying meanings for each word. Experiments on the BioCreative VI Bio-ID corpus show that our proposed knowledge-enhanced system could effectively leverage prior knowledge and achieves state-of-theart performance on both PNER and PNEN subtasks.
The contributions of this work are summarized as follows: We explore the effect of ELMo representations on entity recognition and normalization in biomedical domains. Experimental results show that it could accurately capture context-dependent aspects of word meaning, therefore effectively improving the performance of PNER and PNEN. We integrate structural knowledge of entities into ID embeddings, which can be beneficial to remedy the entity ambiguity issue faced by PNEN. Entity name knowledge, which can be used as prior clues to better address the issue of name variations, is also incorporated into our system.

Experiment setup Dataset
Our experiments are conducted on the corpus published by BioCreative VI Bio-ID Track1 [3], which is drawn from annotated figure panel captions from SourceData [24] and is converted into BioC format along with the corresponding full text articles. Bio-ID corpus contains a training set and a test set. The training set consists of 13,573 annotated figure panel captions corresponding to 3658 figures from 570 full length articles, with a total of 51,977 annotated Protein/Gene IDs. The test set consists of 4310 annotated figure panel captions from 1154 figures taken from 196 full length articles, with a total of 14,232 annotated Protein/Gene IDs. Table 1 shows the statistical results of the number of IDs that an entity mention has (entity ambiguity) on the BioID corpus. For each target mention in the BioID corpus, we estimate ambiguity as the number of different IDs associated to it by human annotators. From Table 1, we can know that: (1) many mentions tend to be highly skewed, in the sense that they usually refer to one specific entity; (2) nearly one-third of mentions correspond to two or more different IDs; (3) the ambiguity rate per ambiguous mention is 2.79 on the training set and 2.41 on the test set. Table 2 shows the statistical results of the number of entity variants corresponding to a specific entity ID (name variations) on the BioID corpus. We compute the synonymy rate as the number of different variants that can be used to name a particular ID. From Table 2, we can see that: (1) for the 5282 IDs present in the Training set and 1980 IDs present in the Test set, most entities are associated with only a single variant (78% in the Training set and 85% in the Test set respectively); (2) and the synonymy rate is lower, 2.46 on the training set and 2.26 on the test set.

Negative sampling
Since our disambiguation model is only given training samples for correct ID assignments, negative sampling is needed to automatically generate samples of corrupt assignments. For each context-ID pair (C, s), where s is the correct ID assignment for the context C around the entity mention, we produce some negative samples with the same context C but with a different entity ID s ′ . Following Eshel et al. [13], we uniformly sample out of the candidate IDs of each mention to obtain a corrupt s ′ for forming each negative sample (C, s ′ ).

Training details
Throughout our experiments, a word is initialized with 200-dimensional pre-trained word embeddings [25], which are trained on the openly available biomedical literature (∼5B words) using the word2vec tool. The dimensions for character, part-of-speech (POS), chunking, and knowledge features (KFs) are 50, 25, 10, and 15, respectively. Deep contextualized word representations ELMo is 1024-dimensional generated by biLMs pretrained on a corpus with approximately 30 million sentences [22]. For both PNER and PNEN, we fine-tune all The left column reports four types of attributes, which are the number of unique proteins/genes mention terms (#Mentions), the number of #Mentions with only one entity ID attested in the corpus (#Monosemous), the number of #Mentions with two or more IDs attested in the corpus (#Polysemous), and the average number of candidate IDs that a polysemous target mention has (Ambiguity Rate) the parameters during training to improve the performance. UniProt [18] and the NCBI Gene [19] KBs are used for entity knowledge extraction as well as candidate ID generation. The versions of UniProt and NCBI Gene used in our experiments are 2018_11 and 04-Dec-2018, respectively. The disambiguation model is trained with fixed-size left and right contexts (n 2 = 10 words in each side excluding stop words and punctuation). Mini-batch size is set to 8 for both models. We fixed the dropout rate at 0.5 during training to ease the overfitting problem.
In the following experiments, we randomly chose 80% of the training set to be the actual training set and the remaining 20% to be the validation set. The training set is used to fit the parameters of the model, the validation set is used to evaluate the performance of our models and choose the hyper-parameters settings associated with the best performance (hyper-parameter tuning), the test set is used to assess the performance of the final chosen model by the official evaluation scripts provided by the BioID shared task.
The PNER task is evaluated on strict entity span matching, i.e. the character offsets have to be identical with the gold standard annotations. For the PNEN task, only the normalized IDs returned by the systems are evaluated. The performance of the systems is reported as precision (P), recall (R) and F1-score (F1) on corpus level.

PNER performance
In this experiment, we explore the effects of different features representations on the performance of our recognition model. Table 3 shows the results of different combinations of these features on the test set. We first take the BiLSTM-CRF model with word embedding and character embedding ½x w t ; x c t as the baseline for comparison. From Table 3, we can see that the addition of linguistic features (POS tagging and chunking) contributes to the PNER task on both the strict and overlap criteria, but only achieves a small improvement of 0.8 and 0.5% in F1-score, respectively. On the basis of linguistic features, the addition of KFs and ELMo representations bring a significant improvement in the PNER performance.
Take a look at the overlap match criteria, the addition of KFs increases the F1-score from 0.839 to 0.855 (1.6% improvement), especially showing substantial recall gains when comparing with others. This demonstrates that the rich information of prior protein/gene entities provided by KFs helps recognize more entity variants and proves the validity of entity name knowledge on the name variations issue.
Similarly, the addition of ELMo representations increases the F1-score by 2.1% from 0.839 to 0.860. Although the recall gains brought by the ELMo is not as good as that of KFs, it is capable of modeling the multisense information of words across vary linguistic context, resulting in an increase in precision. In other words, this allows the BiLSTM-CRF model to better understand the context information to accurately distinguish entities and non-entities. Moreover, ELMo can complement the context-free nature of traditional word embedding to represent context-dependent information.
When all additional features (linguistic features, KFs and ELMo representations) are added, the best performance (0.814 F1-score at the strict criteria and 0.871 F1score at the overlap criteria) is achieved. This proves that there exists a complementary relationship between KFs and ELMo, thus they can balance the recall and precision of the BiLSTM-CRF model on the PNER subtask.

PNEN performance
In this experiment, we explore the effects of the architecture of our disambiguation model on the test set based on the above PNER model selected by the validation set. Since the ELMo representation and the gating mechanism have been proven to be effective, they will be added directly to the entity disambiguation model without being explored. Five common sequence encoders are used for context representation learning, which are shown below.
LSTM and GRU Firstly, we explore the standard recurrent encoders with either Long Short-Term Memory (LSTM) or Gated Recurrent Units (GRU) for sequence encoding. The last hidden state is used to represent a sequence.
BiLSTM and BiGRU To preserve information from both past and future, we also consider bidirectional LSTM/GRU that concatenates the last hidden state of the forward direction, and the last hidden state of the backward direction to represent a sequence.
Hierarchical-ConvNet Inspired by Zhao et al. [26], we introduce a hierarchical convolutional network which concatenates different representations of the sequence at four different levels of convolutional layers. In each layer, a representation u i is computed by a max-pooling operation over The left column tabulates four types of attributes, which are the number of unique entity IDs (#IDs), the number of #IDs with only one variant (#Single Var.), the number of #IDs with two or more variants (#Multiple Var.), and the average number of variants that a multiple var. target ID has (Synonymy Rate) the feature maps. The final sequence representation h se t ¼ ½ u 1 ; u 2 ; u 3 ; u 4 is the concatenation of u i from each layer.
In addition, two kinds of attention mechanism are proposed to verify the effect of ID embedding. The first is Knowledge-based attention, which is designed to focus on contextual words that are more relevant to candidate IDs (mentioned in the section 2.3.3 Entity Disambiguation). The second is Self-attention [27], which is similar to the former except that pre-trained ID embeddings is not considered when calculating the attention score, as shown below: Table 4 shows the results of different model architectures on the test set. From Table 4, we can see that: (1) The recurrent encoders (GRU, LSTM, BiLSTM and BiGRU) achieve better performance than Hierarchical-ConvNet. Recurrent encoders are suitable to capture the long-term dependencies within sequences, while Hierarchical-ConvNet is suitable to capture the local features. In most cases, entity disambiguation relies predominantly on global features rather than local features. (2) Both BiLSTM and BiGRU perform well and are superior to unidirectional models (LSTM and GRU). Compared with the unidirectional models, the bidirectional models could capture context information more comprehensively. (3) By incorporating attention mechanism, the above five sequence encoders have achieved performance improvements, regardless of the introduction of Strict match criteria require that the predicted entity and the gold standard annotations have to match exactly at the byte offset; and overlap match criteria allows a match if the predicted entity overlaps with the gold annotation at all. The highest scores are highlighted in bold. We tune the hyper-parameters through the validation set and use the official evaluation script to assess the performance of the final chosen model on the test set Only the normalized IDs returned by the systems are evaluated on both micro-averaged and macro-averaged metrics. Micro-averaged calculates metrics globally by counting the total true positives, false negatives and false positives, macro-averaged calculates metrics for each label in documents and finds their unweighted mean. The highest scores are highlighted in bold. We tune the hyper-parameters through the validation set and use the official evaluation script to assess the performance of the final chosen model on the test set.
Knowledge-based attention or Self-attention.
The possible reason is that the attention mechanism could flexibly capture global and local connections and better model long-term dependencies to capture important context information between elements in a sequence. (4) Knowledge-based attention could effectively fuse knowledge and context representations and outperform the Self-attention mechanism. With the help of ID embeddings learned by autoencoder, Knowledge-based attention brings more benefits to PNEN and significantly increases its F1-score under both micro-and macro-averages. We attribute it to the following two aspects. On one hand, Knowledge-based attention mechanism could find important contexts related to candidate ID. On the other hand, the knowledge representations obtained from KBs through ID representation learning could provide valid information for PNEN. Knowledge representations could efficiently encode prior entity structural knowledge in a low-dimensional space and significantly improve the performance of PNEN. (5) Compared to BiLSTM + Knowledge-based attention, BiGRU + Knowledge-based attention wins with a slight advantage and achieves the highest Micro-averaged F1-score of 0.445 in all models, which means that it could better integrate context information and prior knowledge through the proposed attention mechanism.

Discussion
Comparison with related work for PNER We compared our approach with other related work on the PNER subtask and the results are shown in Table 5.
Kaewphan et al. [29] used a publicly available NER toolkit NERsuite with one-hot represented word and POS tagging as input for PNER. Additional dictionary features were also used for their experiments, but there was no clear performance improvement in either strict or overlap criteria. Their recognition approach achieved the highest rank on the PNER subtask (0.734 and 0.831 F1-scores under both matching criteria). However, traditional ML-based methods need extensive feature engineering, which is time-consuming and labor intensive. Based on their previous work [29], Kaewphan et al. [30] further developed a BiLSTM-CRF based model, which used character embeddings learned by a Convolutional Neural Network (CNN) and the predictions from their original NERsuite model [29] as inputs of the recognition model. Neural network-based methods bring significant improvement in PNER performance (3.2% F1score improvement under strict criteria than before). However, they did not use entity name knowledge or ELMo representations, resulting in a 4.8% F1-score lower than our method. Sheng et al. [28] also constructed a BiLSTM-CRF model that used only word and character as inputs, without relying on the help of any other external features.
Comparing with these approaches, our model incorporates multi-sense information of words and entity name knowledge in KBs. Therefore, our model gets relatively balanced precision and recall while both are improved, which outperforms approaches mentioned above.

Comparison with related work for PNEN
Similarly, we compared our work with other related work on the PNEN subtask. The results are shown in Table 6.
Kaewphan et al. [29] applied exact string matching to retrieve candidate IDs of protein/gene mentions based on KBs. For the ambiguous mentions with multiple candidate IDs, some heuristic rules were developed for disambiguating protein/gene mentions and uniquely assigning an ID. Their normalization approach achieved the highest rank on this PNEN subtask (0.397 microaveraged F1-score). Typically, hand-crafted rules are clear and effective, but they are inflexible and hard to expand to a new dataset. Kaewphan et al. [30] used the same method as their previous work [29] to perform PNEN, but based on their new recognition method. Compared with their previous results, they achieved 1.8% micro-averaged F1-score improvement from 0.397 to 0.415. This shows that their normalization approach depends on the performance of PNER to a large extent.
Sheng et al. [28] compiled a contextual dictionary based on the training set and then checked if the entity mention was in this contextual dictionary. If so, they  The highest scores are highlighted in bold normalized the mention to the known ID that shared the most contextual words with the sequence the entity belonged to. For cases without matched IDs in the compiled dictionary, they used the same UniProt API and NCBI Gene API as ours, to search for candidate IDs and directly assigned the first ID match to ambiguous mentions. Since no disambiguation models were used to pick candidate IDs, their approach achieved a relatively low precision and recall on the PNEN subtask. Our normalization approach outperforms the above related approaches and achieves a state-of-the-art result (0.445 micro-averaged F1-score). We attribute this to the validity of our disambiguation model, which accurately models the context representation through rich structural knowledge of entities in KBs.

Influences of the training data size
We further explored the influences of the training data size. We first divided the Bio-ID training data into eight parts and then added each part to training set one by one. In this case, we trained eight models with different sizes of training sets. Figure 2 shows the trend of PNER results with the size of training set increasing. From Fig. 2 we can see that the F1-scores of the PNER subtask increase gradually when the training set size increases. To estimate the asymptotic F1-score for PNER, we defined a non-linear function F1 PNER = i + jm n to fit the results of Strict Match and Overlap Match criterias as follows: where n is the number of training set part. These functions illustrate that as the training set increases, the asymptotic F1-scores under strict and overlap criteria could reach to about 0.829 and 0.882, respectively. Similarly, Fig. 3 shows the same information for PNEN subtask and the following estimations are obtained: These functions show similar results. As the training dataset increases, the asymptotic F1-scores under microand macro-averages could reach to about 0.463 and 0.426, respectively.

Error analysis
We analyzed the incorrect output of our knowledgeenhanced system on the PNEN subtask, and divided them into the following four types:  ), which is the case that our system fails to assign the correct ID to the ambiguous mention.  Table 7 for more details. PNER FPs propagate to the PNEN phase and cause 995 normalization errors, with a proportion of 17.59%. PNER FPs is that some non-entity words are incorrectly recognized as entities. Take the sentence (1) predicted by our knowledgeenhanced system as an example. The word "Miro" denoted by wave line looks much like the annotated entity "Miro1" according to its context, but is noise actually. Although we introduced deep contextual word representations ELMo into the system to capture details of the context, there are still errors in the PNER phase.
Sentence ( PNER FNs propagate to the PNEN phase and cause 2561 normalization errors, with a proportion of 45.21%. Such errors are generally related to the domain-specific abbreviations. In sentence (1), the word "TTX" denoted by underline is a gene entity but not recognized by our system. We analyzed two causes that lead to many entities not being able to be recalled. One is that although a large amount of variant information exists in the introduced entity name knowledge, there is still no guarantee that the coverage of a large number of abbreviated variants is complete. The other is that the textual context around the entity mention is too general, which makes our system difficult to capture discriminative information to disambiguate mentions.
Though some ambiguous entity mentions are correctly recognized in the PNER phase, they are assigned incorrect ID in the PNEN phase. Such kind of errors can be further divided into two sub-categories, PNEN FPs caused by Missed ID and PNEN FPs caused by Incorrect ID. Take sentence (1) to help understand, the true positive "Miro1" should correspond to the ID "NCBI Gene: 59040", but an incorrect "NCBI Gene:8850" is assigned.
PNEN FPs caused by Missed ID is the case that correct ID of the ambiguous mention is not included in the result retrieved by candidate ID generation. Missed ID is usually related to the entity ambiguity issue. The more variants an ambiguous mention corresponds to, the more candidate IDs it may have, which makes the resulting candidate IDs (up to 5) more difficult to cover the correct ID during the PNEN phase. The missed ID subcategory brings 1376 errors, with a proportion of 24.35%. PNEN FPs caused by Incorrect ID is the case that our system fails to assign the correct ID to the ambiguous mention. Although we add pre-trained ID embeddings to help context representation learning as accurately as

Conclusions
In this paper, we present a knowledge-enhanced system for biomedical named entity recognition and normalization with proteins and genes as the application target. For the name variations challenge, entity name knowledge is used for PNER to increase its recall rate. ELMo representations are also added to the recognition model for the purpose of improving the model precision. For the entity ambiguity challenge, we use an autoencoder to encode structural knowledge of entities into ID embeddings for better entity disambiguation. Experimental results on the BioCreative VI Bio-ID dataset verify that the proposed system outperforms the existing state-of-the-art systems on both PNER and PNEN subtasks, with the aid of these two kinds of knowledge and ELMo representations. Our system implemented this task in two separate steps in a pipeline, which may lead to error propagation from PNER to PNEN as can be seen from the error analysis. As future work, we would like to construct a joint model that recognizes and normalizes protein/gene entities simultaneously, to reduce such error propagation by enabling feedback from PNEN phase to PNER phase. And, it allows entity recognition and normalization to interact with each other to jointly optimize PNER and PNEN.

Methods
In this section, we describe our knowledge-enhanced system for the Bio-ID Assignment task. Figure 4 shows the workflow of our system. It can be divided into three modules: (1) Feature extraction is performed on the original corpus, and six types of features are obtained and used as input to the entity recognition model. (2) Entity recognition is used to get the entity mentions. The extracted features are mapped to vector representations and concatenated together as inputs to the entity recognition model for entity mentions. To further improve the model performance, some heuristic rules are used to correct the predicted results output by the entity recognition model. (3) Entity normalization is used to generate candidate IDs and eliminate ambiguity for mentions. This module first generates candidate IDs for the mentions, which are then mapped to pre-trained ID embeddings. The candidate ID embeddings will then be fed to the entity disambiguation model along with the local contexts of the mention, for the purpose of picking the most likely one as the assignment result for the mention. Fig. 4 The workflow of our knowledge-enhanced system. The arrow means the workflow of the system, the rectangle indicates a specific operation or process, and the pink oval box indicates the results of entity recognition and normalization

Feature extraction
Refer to the practice of Tsai et al. [31], we employ the GENIA Tagger [32] to process input documents, including tokenization, POS tagging and chunking. All of these provide features for our BiLSTM-CRF model to further enrich the information of each word. In particular, to recognize more entity variants and alleviate the name variations issue, entity name knowledge contained in UniProt and NCBI Gene KBs is used to generate our KFs. Specifically, the longest possible match between the input word sequences and variant term of entities contained in KBs are first captured. Then, for each word in the match, it is encoded in BIO (Begin, Inside, Outside) tagging scheme to form KFs. Intuitively, KFs can augment the hint information for each entity mention by a large number of variant terms in KBs.
In addition, ELMo representation learned by a pretrained LSTM-based multi-layer biLM is also extracted. The biLM takes character sequence of each word as input and encodes them with CNN and highway networks, whose output is then given to a two-layers BiLSTM with residual connections. Then, the combination of hidden states of each layer is performed to assign each word an ELMo representation. The following formula gives the ELMo representation for the t-th word, where h LM t; j is the t-th hidden state of j-th LSTM layer in the biLM, l j is the softmax-normalized weight of j-th LSTM layer in the biLM. γ is the scalar parameter that allows the model to scale the entire ELMo representations.
ELMo can model multi-sense information of each word across different linguistic contexts and make the embeddings contain more contextual information.

Entity recognition
The architecture of BiLSTM-CRF model for entity recognition is illustrated in Fig. 5. It mainly consists of three parts: Embedding layer, BiLSTM layer and CRF layer.

Embedding layer
Given an input feature sequence W = {w 1 , ..., w t , ..., w n } ∈ ℝ n , it is mapped to a feature vector sequence X = {x 1 , ..., x t , ..., x n } ∈ ℝ d × n through the embedding layer, where d is the embedding dimension and n is the sequence length.
Each of the feature vectors x t ∈ ℝ d consists of the following six parts. A word embedding x w t mapped by the word embedding matrix pre-trained by the word2vec tool, a character embedding x c t learned from the charac-ter sequence of the word by character-level BiLSTM, an ELMo representation x elmo t learned by the pre-trained LSTM-based multi-layer biLM, and the POS, chunking, knowledge feature vectors x pos t , x chunk t , x kf t obtained by random initialization mapping.
The concatenation of the above six feature vectors yields the output of the embedding layer, which can be represented as follows: After that, the feature vector sequence X will be taken as input to the next BiLSTM layer for context representation learning.

BiLSTM layer
Long short-term memory (LSTM) is a specific type of recurrent neural network that models dependencies between elements in a sequence through recurrent connections. Here, we use one forward LSTM to compute a hidden state h ! t ¼ LSTMðx t ; h ! t−1 Þ∈ℝ d 2 of the sequence X from left to right at the t-th time step, and the other backward LSTM to compute a hidden state h t ¼ LSTMð x t ; h tþ1 Þ∈ℝ d 2 of the same sequence in reverse, where d 2 is the dimension of the hidden state. Then, the two hidden states are concatenated to form the final output h t ¼ ½ h ! t ; h t of the BiLSTM layer at the t-th time step. After that, the output of BiLSTM h ¼ fh 1 ; :::; h t ; :::; h n g∈ℝ 2d 2 Ân is fed to a two-layer fully-connected neural network (FC) with tanh activation to predict the confidence score for each possible label of the word, which can be written as follow: where W∈ℝ d 2 Â2d 2 , V∈ℝ kÂd 2 and b∈ℝ d 2 Ân are the parameters that need to be trained, k is the number of distinct labels.

CRF layer
To model the dependencies across output tags, a Linear-Chain CRF layer is added on top of the BiLSTM layer to decode the best tag path in all possible tag paths. For the input sequence X, we consider P ∈ ℝ n × k to be the matrix of scores output by the BiLSTM layer. The element P i, j of the matrix corresponds to the score of the jth tag of the i-th word. For a sequence of predictions y = {y 1 , ..., y t , ..., y n }, we define its score to be: where T is a matrix of transition scores such that T i, j represents the score of a transition from the tag i to tag j. y 0 and y n are the extra start and end tags of a sequence and T is therefore a square matrix of size k + 2.
Then, a softmax is used to yield a probability of the path y by normalizing the above score over all possible tag path y ′ : During training, we maximize the log-likelihood of the correct tag sequence. RMSProp technique [33] with a learning rate of 1e-3 is used to update the parameters of the BiLSTM-CRF model. While decoding, we predict the tag sequence y * that obtains the maximum score given by: The Viterbi algorithm [34] is used to infer the optimal tag path for efficiency considerations.
To further improve the PNER performance, the same post-processing rules as Luo et al. [20] and Campos et al. [35] are applied to pick the most likely entity mention back and correct incomplete entity mentions.

Entity normalization
In this section, we explain how to map each recognized protein/gene mention to the corresponding ID in the UniProt or NCBI Gene KBs. Table 8 shows the pseudocode of our PNEN algorithm. It consists of the following two modules, (1) candidate ID generation and (2) entity disambiguation (i.e., pick the proper entity ID from all candidates as the mapping ID for each entity mention).
Since pre-trained ID embeddings are required for entity disambiguation, how they are learned through structural knowledge of entities will be introduced first.

ID representation learning
KBs contain rich structural knowledge of entities (e.g., name variations and entity ambiguity as shown in Fig.  1), which can be formalized as constraints on embeddings Fig. 5 The architecture of BiLSTM-CRF model for PNER. In embedding layer, "w2v" means the word embeddings pre-trained using the word2vec tool, character embedding can be learned by Character-level BiLSTM, "biLM" means the pre-trained bi-directional Language Model ELMo, "Randomly initialized" means obtaining a corresponding vector in a random manner. Six feature representations of each word are concatenated together to form an input and fed to a BiLSTM layer. The last is the CRF layer, which is used to decode the best tag path in all possible tag paths. The input sentence is "SyGCaMP5 and MtDsRed or myc -ΔEF -Miro1 -IRES -MtDsRed ( ΔEF Miro) with and without TTX treatments". The output tag sequence is "OOOOBOOOBOOOOOOOOOOOBO" and allows us to extend word embeddings to embeddings of entity ID. To this end, we adopt an autoencoder to learn the embedding of entity IDs based on Mention-Variant-ID structures provided by KBs.
The basic premises of the autoencoder are as follows: (i) entity IDs are sums of their variants; and (ii) entity mentions are sums of their variants. Take Fig. 1 as an example to illustrate. For the premise (i), entity ID "NCBI Gene: 7422" can be represented by the sum of its variants "VEGF (human)", "MVCD1" and "VPF". For the premise (ii), mention "VEGF" can be represented by the sum of its variants "VEGF (human)" and "VEGF (pig)".
We denote mention embedding as m (i) ∈ ℝ d , variant embedding as v (i, j) ∈ ℝ d and entity ID embedding as s (j) ∈ ℝ d . v (i, j) is that variant of mention m (i) that is a member of entity ID s (j) . Mention embedding m (i) is initialized by the average of the word embedding of its constituent words. We can then formalize our premises that the two constraints (i) and (ii) hold as follows: The autoencoder consists of two parts, encoding and decoding. When encoding, it takes mention embeddings as input and unravels them to the vectors of their variants. And then, IDs can be embedded by the sum of their constituent variants. To unravel mentions to corresponding variants for ID representation learning, we introduce a diagonal matrix E (i, j) ∈ ℝ d × d to allow the mention m (i) to distribute its embedding activations to its variants on each dimension separately. E (i, j) satisfies the following condition ∑ j E (i, j) = I n with I n being an identity matrix. Therefore, the encoding process can be written as follow: When decoding, it translates entity IDs back to mentions as follow: where D (i, j) ∈ ℝ d × d is in analogy to the diagonal matrix E (i, j) and used to distribute entity ID into its variants. m ðiÞ and v ði; jÞ represent the entity mention and variant generated in the decoding process. We align the decoded mention embedding with the original mention embedding to train the autoencoder. In addition, variant embeddings v (i, j) and v ði; jÞ obtained in both encoding and decoding parts are also aligned to strengthen the constraint on embeddings. Finally, our training objective for the autoencoder is to minimize the following equation: where α and β are weights and satisfy α + β = 1. We make α = β = 0.5 experimentally determined. With the help of this autoencoder, we thus encode structural knowledge of entities from UniProt and NCBI Gene KBs into ID embeddings, which are used as the input of the disambiguation model.

Candidate ID generation
In this module, for each entity mention m ∈ M, we aim to retrieve a candidate ID set S m which contains possible IDs that entity mention m may refer to. To this end, we propose the KB Retrieval method for candidate ID generation, as shown in part (1) of Table 8.
KB Retrieval method treats candidate ID generation as an information retrieval process, which takes mentions as queries and returns related IDs by the API resources provided by biomedical KBs. Here, UniProt official API [18] and NCBI-gene official API [19] are used to search for protein and gene IDs, respectively. To optimize for memory and run time, we keep top five results returned by KB Retrieval as candidate IDs for the entity mention.

Entity disambiguation
Part (2) of Table 8 shows the process of entity disambiguation. In most cases, the size of the candidate ID set Table 8 Pseudocode for PNEN Algorithm S m of a mention is larger than one. We propose a disambiguation model, which is shown in Fig. 6, to rank the candidate IDs in S m and pick the most likely one as the assignment result for the mention m. Specifically, the model takes the left and right contexts where the entity mention appears and a candidate ID as inputs, and outputs a probability-like score for the candidate ID being correct. We set n 2 as the length of the context sequence. The left context C L ∈ℝ dÂn 2 and the right context C R ∈ℝ dÂn 2 around the entity mention are first fed into a duo of sequence encoder (SE) for context representation learning. Take the left context as an example, the resulting hidden state of the t-th time step produced by the SE can be written as h se t ¼ SEðC L Þ , where different architectures of SE will be compared in the experimental part.
For the semantic meaning of a sequence, the importance of each context word with respect to the candidate ID embedding s should be different. To this end, we fit the sequence encoder with a knowledge-based attention mechanism to focus on appropriate subparts of the input context. Following Eshel et al. [13], we use the pre-trained ID embedding s as the controller to calculate the normalized weight α t ∈ [0, 1] for each hidden state h se t , which is then used to encode the entire sequence into a context representation o L ∈ ℝ d × 1 as follow: For each hidden state h se t , we use a feed forward neural network to compute its semantic relatedness with the candidate ID embedding s. The score function is calculated as follow: where W L a ∈ℝ 1Âd ; V L a ∈ℝ 1Âd and b L a ∈ℝ 1Â1 are attention parameters to be learned during training.
After that, the attention weight α t ∈ ℝ 1 × 1 of each hidden state can be defined as follows.
In general, the contribution of left and right contexts should be different for the selection of different candidate IDs. For the purpose of dynamically controlling the flow of left context representation o L and right context representation o R , a gating mechanism is also adopted as shown below: where ⊙ denotes element-wise product between two vectors, σ is a sigmoid activation, and W g ∈ ℝ 1 × d , V g ∈ ℝ 1 × d , b g ∈ ℝ 1 × 1 are the model parameters that need to be trained.
Finally, we further concatenate the output of gating mechanism z ∈ ℝ d × 1 and the pre-trained ID embedding s as the final feature representation [z; s] and feed it to a classifier. The classifier consists of a two-layer fullyconnected neural network (FC) with ReLU activation and an output layer with two output units in a softmax. Fig. 6 The architecture of our proposed disambiguation model. " C L " represents the left context, " C R " represents the right context, and "FC" represents the fully connected layer. The target entity is "Miro1" in Fig. 5. This figure takes candidate ID "NCBI Gene: 59040" as an example