Improving biomedical named entity recognition with syntactic information

Background Biomedical named entity recognition (BioNER) is an important task for understanding biomedical texts, which can be challenging due to the lack of large-scale labeled training data and domain knowledge. To address the challenge, in addition to using powerful encoders (e.g., biLSTM and BioBERT), one possible method is to leverage extra knowledge that is easy to obtain. Previous studies have shown that auto-processed syntactic information can be a useful resource to improve model performance, but their approaches are limited to directly concatenating the embeddings of syntactic information to the input word embeddings. Therefore, such syntactic information is leveraged in an inflexible way, where inaccurate one may hurt model performance. Results In this paper, we propose BioKMNER, a BioNER model for biomedical texts with key-value memory networks (KVMN) to incorporate auto-processed syntactic information. We evaluate BioKMNER on six English biomedical datasets, where our method with KVMN outperforms the strong baseline method, namely, BioBERT, from the previous study on all datasets. Specifically, the F1 scores of our best performing model are 85.29% on BC2GM, 77.83% on JNLPBA, 94.22% on BC5CDR-chemical, 90.08% on NCBI-disease, 89.24% on LINNAEUS, and 76.33% on Species-800, where state-of-the-art performance is obtained on four of them (i.e., BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800). Conclusion The experimental results on six English benchmark datasets demonstrate that auto-processed syntactic information can be a useful resource for BioNER and our method with KVMN can appropriately leverage such information to improve model performance.

the general domain, BioNER is considered to be more difficult due to the lack of largescale labeled training data and domain knowledge.
In the past decades, there have been many studies on BioNER, ranging from traditional feature based methods [4, 15-18, 20, 37] to recent deep learning based neural methods [5,12,19,23,32,45]. Among the neural methods, the ones leveraging powerful encoders (e.g., biLSTM) achieve better results comparing with feature based methods because such encoders are good at modeling contextual information. More recently, pretrained models such as ELMo [30] and BERT [6] achieved state-of-the-art performance on many NLP tasks in the general domain. Therefore, some studies [13,19] applied them to BioNER yet found that these models cannot perform as well as in the general domain when there is no domain-specific information integrated. Therefore, Lee et al. [19] proposed a variant of BERT, namely, BioBERT, for the biomedical domain, which is pretrained on large raw biomedical corpora and achieves state-of-the-art performance in BioNER.
In addition to the powerful encoders, syntactic information has also been playing an important role in many previous studies to help recognize biomedical named entities [4,5,20,23,37]. Intuitively, biomedical text often includes formal, well-structured, and long sentences, where syntactic information could be helpful because it can provide useful cues for recognizing NEs and thus help with the text understanding of NLP systems [36]. For example, Fig. 1 shows the parse tree of a sentence where the disease entity "Huntington disease" forms the object; thus, the boundary of a noun phrase can be a good cue for NER. Moreover, comparing with other types of extra resources, e.g., knowledge base [1,24,49], which are generally not publicly available or require human annotations, the syntactic information is easier to obtain through off-the-shelf NLP toolkits. Therefore, considering that the state-of-the-art BioBERT [19] does not leverage any syntactic information, we propose to improve BioBERT by incorporating the syntactic information of the input text, which is obtained from the parsing results of off-the-shelf toolkits.
To incorporate syntactic information into BioNER methods, previous studies have tried several ways. In the feature engineering methods, researchers use syntactic information to generate handcrafted features to help BioNER. For example, Song et al. [37] used part-of-speech (POS) and noun phrase tag features in a CRF-based BioNER system. In recent deep learning based methods, syntactic information is firstly represented by vectorized embeddings and then combined with word embedding through vector concatenation or element-wise summation, after which the resulting vector is fed into neural models to improve bioNER. For example, Luo et al. [23] used embedding vectors to represent syntactic information including POS and constituent labels, and concatenated Fig. 1 An example sentence. An example where the object noun phrase ("Huntington disease") is a named entity. The labels under the words are BIO tags those vectors with word embeddings. The combined embeddings were then sent into a biLSTM-CRF model with an attention mechanism to detect chemical NE. Dang et al. [5] proposed a model named D3NER, where the embeddings of various informative syntactic information are used to improve the results. Overall, previous approaches to leverage auto-processed syntactic information were limited to directly concatenating the embeddings of the syntactic information instances and the input words, without weighing the syntactic information instances in a specific context, where noisy information may hurt model performance. Therefore, we need to find a better method to leverage auto-processed syntactic information.
To weigh the syntactic information instances and leverage the important ones to improve BioNER methods, key-value memory networks (KVMN) [26] could be potentially useful, because it is demonstrated to be useful in leveraging extra information, e.g. knowledge base entities, to improve question answering tasks. In KVMN, the information is represented by key-value memory slots, where the keys are used to compute the weights for values by comparing these keys with the input, and the values are weighted summed according to the resulting weights and then used to make predictions. In addition, although the KVMN is originally proposed for question answering tasks, its variants also demonstrate impressive performance in many NLP tasks, such as Chinese word segmentation [40], semantic role labeling [11], and machine translation [27]. This motivates us to explore the possibility of using KVMN to leverage the syntactic information to improve BioNER. Therefore, in this paper, we propose BioKMNER (KM stands for Key-value Memory network), which uses KVMN to incorporate syntactic information into the backbone sequence labeling tagger to improve BioNER. Specifically, we firstly use off-the-shelf toolkits to parse biomedical text sentences and extract three types of syntactic information: namely, POS labels, syntactic constituents, and dependency relations. Then, for each word in the input sentence, in the KVMN, we use the keys to represent the context features associated with the word and the values to denote the corresponding syntactic information instances. Therefore, context features (keys) are used to compute the weights by comparing them with the input word, and syntactic information instances (values) are weighed accordingly. Finally, the weighted summed values are concatenated with the output of the encoder, where the resulting vector is fed into the decoder for prediction. In this way, the method can incorporate the pair-wisely organized context features and syntactic information instances obtained from the toolkits simultaneously. Different from previous studies that directly use noisy syntactic information instances by embedding concatenation, our BioKMNER weighs them in KVMN and thus reduces the effect of error propagation caused by the noisy parsing results. We experiment BioKM-NER on six English benchmark BioNER datasets covering four different entity types (i.e., chemical, disease, gene/protein, and species). The results demonstrate the effectiveness of our method for BioNER, where BioKMNER outperforms the BioBERT results reported by Lee et al. [19] on all datasets and achieves state-of-the-art results on four of them.
BC2GM BC2GM is a dataset used for the BioCreative II gene mention tagging task. 1 It contains 20,000 sentences from abstracts of biomedical publications and is annotated with 24,583 mentions of the names of genes, proteins and related entities. It has become the major benchmark for the NER of gene/proteins entity type [10,12,19,31,43,48].
JNLPBA JNPBA is the dataset for the Joint Workshop on NLP in Biomedicine and its Application Shared task. 2 It was organized by the GENIA Project based on the annotations of the GENIA Term corpus and consists of 2404 publication abstracts. It is widely used for evaluating multi-class biomedical entity taggers.
BC5CDR-chemical BC5CDR is a dataset used for the BioCreative V Chemical Disease Relation (CDR) Task. 3 It contains 1500 titles and abstracts from PubMed, 4 where chemical and disease mentions are annotated by human annotators. Following previous studies [23], we only use the subset with chemical entities and denote it as BC5CDR-chemical.
NCBI-disease NCBI-disease contains 793 PubMed abstracts that are annotated with disease mentions and their corresponding concepts. There are 6,892 disease mentions from 790 unique disease concepts in this dataset and 91% of the mentions are mapped to a single disease concept. It has been a widely used research resource for the disease NER.
LINNAEUS The LINNAEUS dataset was created specifically for species named entity recognition and consists of 100 full-text documents. In the LINNAEUS dataset, all mentions of species terms were manually annotated and normalized to the NCBI taxonomy IDs of the intended species.
Species-800 Species-800 is a novel benchmark corpus for species entities, which is based on manually annotated abstracts. It comprises 800 PubMed abstracts that contain identified organism mentions. Because the abstracts are select from journals on 8 different categories, the diversity of Species-800 is high and thus it is more challenging for NER systems.

Implementation
In the experiments, we use off-the-shelf NLP toolkits to generate syntactic information, following the common practice in previous studies such as Mohit and Hwa [28], Tkachenko and Simanovsky [42], and Luo et al. [23]. In our study, we focus on three types of syntactic information: POS labels, syntactic constituents, and dependency relations. We use Stanford CoreNLP Toolkits (SCT) 7 [25], which is a well-known NLP toolkit used in many studies [33,39], to obtain the POS tagging, constituency, and dependency parsing results of a given input sentence.
For the encoder, considering that BERT [6] and its variants [2,3,7,19] achieve state-of-the-art performance on many NLP tasks, we use the variant for the medical domain, i.e., BioBERT [19], in our method. Specifically, we use both the base and large version of BioBERT 8 and follow the hyper-parameters used by Lee et al. [19] (i.e., for BioBERT-Base, there are 12 self-attention heads with 768-dimensional hidden vectors; for BioBERT-Large, the number of heads is 24 with 1024-dimensional hidden vectors). All parameters in the encoder are fine-tuned in training. For the KVMN module, the embeddings of all keys and values are randomly initialized, with their dimension matching the dimension of hidden vectors in the BioBERT encoder. Besides, we follow the setting of Lee et al. [19] to run the training process, where we do not use the "alternate" annotations for the BC2GM dataset. Moreover, for each method, we train five models with different random seeds to initialize the model parameters and use the average of their micro F1 scores for evaluation. 9 In the experiments,we train each model for 150 Table 1 The statistics of the four benchmark datasets "Token #", "Sent. #" and "Entity #" represent the number of tokens, sentences, and entities  7 We use v3.9.2, downloaded from https ://stanf ordnl p.githu b.io/CoreN LP/. 8 We obtain the pre-trained models v1.1 from https ://githu b.com/naver /biobe rt-pretr ained 9 We evaluate all models by the widely used seqeval framework at https ://githu b.com/chakk i-works /seqev al.
epochs for the BC2GM dataset and for 60 epochs for all other datasets. 10 In each run, we evaluate the model on the development set after each epoch to find its best performing result.

Overall results
We run the baseline methods without using syntactic information and our methods with KVMN ( M ) to incorporate three types of syntactic information obtained from SCT on six aforementioned datasets, where two different encoders, i.e., BioBERT-Base and BioBERT-Large, are used. For reference, we also run baseline methods that use direct concatenation (DC) to incorporate such syntactic information, where the embeddings of context features and syntactic information are directly concatenated with the output of the BioBERT encoder. We report the experimental results (the average F1 scores of the five runs for each method as well as the standard deviations σ ) in Table 2. There are some observations. First, comparing with the baseline methods without using any syntactic information, our method with KVMN can work well with both BioBERT-Base and BioBERT-Large encoder, where decent improvements over the baseline methods are observed among all datasets.
Second, compared with DC, our methods with KVMN to incorporate syntactic information achieve better results in most cases. For example, on the Species-800 dataset, our method (Base + DR ( M )) obtains an average F1 score of 75.81% , while its corresponding DC-based method (Base + DR (DC)) obtains a lower average F1 score of 75.12% .

Table 2 Experimental results of models on six benchmark datasets
The experimental results are reported in terms of average F1 scores (F1) and the standard deviation σ . The methods in the group "Base" and "Large" refer to baselines with BioBERT-Base and BioBERT-Large encoder and our methods with KVMN ( M ). "DC" refers to the baseline method using direct concatenation to incorporate syntactic information. "PL", "SC", and "DR" stand for POS labels, syntactic constituents, and dependency relations, respectively Besides, in some cases where DC is applied, the syntactic information causes inferior results than baselines. For example, on the LINNAEUS dataset, the average F1 score of the DC-based method with the POS labels (Large + PL (DC)) is lower than the baseline (Large) results. One possible explanation could be: there are some noisy syntactic results extracted by off-the-shelf toolkits, which may influence the performance of the model and lead to worse results compared to the baselines only using BioBERT. Under this condition, methods with DC fails to distinguish the salient syntactic information that contributes more to the bioNER task in a specific context. On the contrary, KVMN can weigh such syntactic information according to the importance of the context features and thus, to some extent, avoid the errors caused by incorporating auto-processed syntactic information. Third, in many cases, in methods with KVMN, the information of syntactic constituents (SC) and dependency relations (DR) offers higher improvement than POS labels (PL). For example, on the BC2GM dataset, our method with the BioBERT-Large encoder obtains the average F1 scores of 85.43% and 85.17% when it is enhanced by SC and DR, respectively, while its average F1 score is 85.07% when PL is incorporated. One possible reason to explain the phenomenon could be: (1) the syntactic constituents can provide a cue of phrase functions and their boundaries (e.g., an NP treelet is not only a signal that can suggest there might be an NE inside, but also can give information about the possible starting and ending positions for that potential NE.); (2) the dependency relations link words in long-distance with their dependency relationships, which could be especially useful for biomedical texts that generally include long sentences and entities.

Comparison with previous studies
We compare the results of our best performing model with previous studies on all aforementioned datasets. The results (F1 scores) are summarized in Table 3, where our method outperforms the previous study (i.e., Lee et al. [19]) using the base and large version of BioBERT on all datasets. Specifically, for the BioBERT-Base, the improvement of F1 scores on BC2GM, JNLPBA, BC5CDR-chemical, NCBI-disease, LINNAEUS, and Species-800 are 0.20% , 0.23% , 0.53% , 0.37% , 0.55% , and 0.90% respectively; for the BioBERT-Large, the improvement on BC2GM and NCBI-disease are 0.28% and 0.84% , respectively. These results demonstrate the effectiveness of our method to leverage autoprocessed syntactic information in recognizing different types of named entities in the biomedical domain. In addition, our method achieves state-of-the-art performance on four datasets, i.e., BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800. Compared with [48] and [12], we do not outperform their results on JNLPBA and LIN-NAEUS, because the gaps between their results and our baseline method, i.e., BioBERT from Lee et al. [19], are big on these datasets, which could be hard to compensate for by using syntactic information. Except for the two datasets, our method outperforms their methods on all other datasets.

Effect of syntactic information ensemble
To explore the effect of using different types of syntactic information together, we conduct syntactic information ensemble experiments on the BC5CDR-chemical dataset.
In the experiments, we test different combinations of different types of syntactic information, where two ensemble strategies are used. The first sums the weighted value embeddings of each type of syntactic information; and the second uses concatenation. The results of the average F1 scores of different settings are reported in Table 4, where the results form the baseline methods without using any syntactic information are also included for reference. We have several observations from it. First, overall, compared with the baseline methods, our methods achieve better results with both the base and large versions of the BioBERT encoder. This indicates that the combination of different types of syntactic information can help with the performance of the baseline method for BioNER. Second, the concatenation strategy performs better than the summing strategy in syntactic information fusion. One possible explanation could be: summing the embeddings of different types of syntactic information may lose some information while concatenating them can keep all information on all types of syntactic embedding.

Effect of different toolkits
To explore the effect of using syntactic information from different NLP toolkits, in addition to SCT, we try another toolkit, i.e., spaCy, 11 to obtain the auto-processed syntactic information. In the experiments, we try two types of syntactic information, i.e., POS labels (PL) and dependency relations (DR), from the POS tagger and dependency parser of spaCy. We report the results (the average F1 scores and the standard deviation σ ) of our methods with KVMN on the BC5CDR-chemical dataset in Table 5. For reference, the results of our method using SCT as well as the baseline results are also reported. From the results, we can find that, for both base and large BioBERT encoders, our method can leverage the syntactic information from different NLP toolkits and thus achieves better performance comparing with the baseline methods.

Case study
To better understand how our method improves BioNER, we conduct a case study where two example sentences are used. In Fig. 2a, b, we show two sentences and illustrate the way of syntactic constituents and dependency relations to improve  figure, a, b are two examples of syntactic information (i.e., syntactic constituents and dependency relations) and the context features for "SEP" and "dystrophy", respectively. The weights for syntactic information learned from the memories are highlighted with the darker color referring to greater value bioNER, respectively. In both cases, for a specific word, we visualize the weights assigned to the corresponding syntactic information instances (values) on its associated contextual features (keys), where the deeper color refers to the higher weight. Syntactic constituents In the example sentence shown in Fig. 2a, the word we focus on is "SEP". In this case, the constituent information firstly narrows the context features of "SEP" down to the words within the noun phrase "pure spinal SEP abnormalities". Then, the KVMN module assigns the highest weight to "abnormalities" and its carrying syntactic information of "NP" among all other syntactic instances since they could be strong signals for disease names. Therefore, our method could assign the correct NE label to "SEP". Likewise, the situation for "pure" is on the opposite and thus it receives the lowest weight among other words.
Dependency relations In addition, in Fig. 2b, we visualize the weights assigned to dependency relations for the word "dystrophy" in another example sentence. In this case, dependency information successfully finds the dependents, i.e., "Myotonic", "DM", and "disorder", of "dystrophy", which could suggest useful cues to predict the NE labels. Among those dependents, KVMN distinguishes that the dependent "discorder" with an "appos" dependency relation (appositional modifier) strongly suggests "dystrophy" is a disease entity. Therefore, KVMN assigns the highest weight to the dependency relation offered by "disorder". Similarly, another modifier (i.e., "Myotonic") of "dystrophy" is also distinguished and weighed by the KVMN, and the second-highest weight is assigned to it accordingly. It is worth noting that the dependency information that contributes most to recognizing "dystrophy" as a part of an NE comes from a word ("disorder") in the long-distance; dependency information is able to capture that information and helps our method predict the NE tag for the word "dystrophy".

Conclusion
In this paper, we propose a method named BioKMNER with KVMN to enhance BioNER with auto-processed syntactic information (i.e., POS labels, syntactic constituents, and dependency relations) from off-the-shelf toolkits. In KVMN, context features and their corresponding syntactic information instances are mapped into keys and values, respectively. The values are weighed according to the comparison between the keys and the input words. Then the values are weighed summed and the resulting vector is fed back to the backbone tagging process to make predictions. In doing so, compared with previous studies that treat different syntactic information equally and leverage them by embedding concatenation, our method can discriminatively leverage the auto-processed syntactic information and avoid the error spread caused by the direct use of auto-processed syntactic results. The experimental results on six English benchmark datasets demonstrate that syntactic information can be a good resource to improve bioNER and our method with KVMN can appropriately leverage such information. In addition, our method outperforms the strong baseline method from the previous study using BioBERT [19] on all datasets and achieves state-of-the-art results on BC2GM, BC5CDR-chemical, NCBI-disease, and Species-800 datasets.

Methods
The overall architecture of our BioKMNER is shown in Fig. 3. Following the common approaches in BioNER, we treat it as a sequence labeling task, where the input word sequence X = {x 1 , x 2 , . . . , x i , . . . x l } is tagged with a sequence of NE labels Y = { y 1 , y 2 , . . . , y i , . . . y l } . In our method, we propose key-value memory networks (KVMN) [26] to incorporate syntactic information. Specifically, context features and their carrying syntactic information instances are mapped to keys and values in KVMN, where the values are weighed according to the comparison between the keys and the input words.
In this section, we firstly introduce the syntactic information extraction process. Then we elaborate the KVMN module used to incorporate the syntactic information. Finally, we explain how our NER method works with the KVMN module.

Syntactic information extraction
In our study, we focused on three types of syntactic information: POS labels, syntactic constituents, and dependency relations. To obtain such information, we first run the offthe-shelf NLP toolkits on the input sentence X . Then for each word x i in X , we extract the context features associated with x i and their corresponding syntactic information instances. Figure 4 shows the three types of context features and their corresponding syntactic information instances 12 for the sentence "Dihydropyrimidine dehydrogenase Fig. 3 The overall architecture of BioKMNER. The top part of the figure shows the syntactic information extraction process: for the input word sequence, we firstly use off-the-shelf NLP toolkits to obtain its syntactic information (e.g., syntax tree), then map the context features and the syntactic information into keys and values, and finally convert them into embeddings. The bottom part is our sequence labeling based BioNER tagger, which uses BioBERT [19] as the encoder and a softmax layer as the decoder. Between the encoder and decoder are the key-value memory networks (KVMN) which weighs syntactic information (values) according to the importance of the context features (keys). The output of KVMN is fed into the decoder to predict output labels deficiency is an autosomal recessive disease". 13 This figure focuses on the word "deficiency" (in boldface) with its highlighted context features and their corresponding syntactic information.
POS labels Given a current word x i in X , we use a 1-word window to extract the context words and their POS labels at both sides of x i . As is shown in Fig. 4a, for word "deficiency", the context features are [deficiency,dehydrogenase,is] and the corresponding syntactic instances are [deficiency_NN,dehydrogenase_NN,is_VBZ].
Syntactic constituents First, we define a set of acceptable syntactic nodes (denoted by L ) which contains 10 different constituent types 14 to selected syntactic constituents from the syntax tree of the input X . Then, for each word x i in X , we start with the leaf of x i in the parse tree, search up to find the first acceptable syntactic node which is in L . After finding the first acceptable node of x i , the words under that node and their combination with the node type label are selected as the context features and their corresponding syntactic information respectively. As is shown in Fig. 4b  The context features and their corresponding POS labels, syntactic constituents, and dependency relations for x 5 ="deficiency" are highlighted in part a, b, and c respectively Dependency relations According to the dependency relations of the words in the sentence, we first collect the dependents and the governor of the given word (i.e., first-order dependency relations). Then, we regard its dependents, its governor, and the word itself, as the context features and regard the combination of these words and their dependency types as the syntactic instances. In Fig. 4c, for the given word "deficiency", it has two dependents (i.e., "dihydropyrimidine" and "dehydrogenase") and one governor (i.e., "disease", which is the root of the sentence). According to these dependency relations, the context features of "deficiency" are [deficiency, dihydropyrimidine, dehydrogenase, a, metabolic] and the syntactic information instances are [deficiency_nsubj, dihydr opyrimidine_compound, dehydrogenase_compound, disease_ROOT].
Through these processes, the context feature list K and the syntactic instance list V are built upon the extraction results for each type of syntactic information. For each word x i in the word sequence X , in both training and predicting process, associated context features and syntactic information instances in K and V are activated and computed. We denote the context features and the syntactic information instances for x i as , respectively. Note that the context feature list K and syntactic instance list V used in our model do not necessarily need to include all three types of the syntactic information discussed above. In other words, our model can leverage each type of syntactic information independently. In the following subsection, we illustrate our method to leverage the keys and values through KVMN.

The memory module
Previous methods to leverage syntactic information for BioNER are limited in concatenating the embeddings of syntactic information instances with the input word embeddings. This method fails to distinguish the useful syntactic instances in a specific context, so that noisy syntactic information may hurt model performance. Therefore, we propose to use KVMN to enhance the incorporation process of syntactic information. Originally, KVMN is firstly proposed to incorporate the information in a list of memory slots (k j , v j ) (where k j and v j refer to keys and values, respectively) 15 into a model for question answering tasks. In KVMN, it addresses the keys by assigning a probability weight to the value in each memory slot by comparing the question (which is denoted as x) to each key: where · are feature mapping matrices and A is a matrix. Then, KVMN reads the values by computing the weighted sum using the resulting probability weights: Afterwards, o is incorporated into the question representation by an element-wise summation: o ′ = A� X (x) + o and the resulting o ′ is used to predict the answers of (1) p j = softmax(A� X (x) · A� K (k j )) 15 Here, we use the subscript j instead of i in the original paper to avoid confusion, because i is already used to refer to the input word x i at the position i. the question. Therefore, in KVMN, the keys are used to compute the weights, which is used to address the values with respect to the input; the values are used to incorporate useful information into the input presentation and thus improve model performance. Considering that knowledge base entries have been used as a possible type of resources for the memory slots to incorporate extra knowledge into the input representation by transforms between the keys and values [26], we can also use such transforms between context features and syntactic information instances to incorporate the syntactic information into our backbone method. In doing so, not only is the syntactic information addressed by comparing the input with context features (which we think is more intuitive than comparing the input with syntactic information), but also different syntactic information instances are weighed according to the comparison between keys and the input, which allows our method to distinguish the important syntactic information instances and leverage them accordingly.
In our approach to bioNER, we adapt the KVMN to a sequence labeling paradigm by applying it to each word x i in the input. Therefore, for x i , its hidden vector h i obtained from an encoder serves as the counterpart of input representation A� X (x) ; its associated context features and the corresponding syntactic information instances stand for the keys k j and values v j , respectively. In details, the memory module takes h i for each x i , activates the keys to address their embeddings and computes the probability weights for them by where e k i,j is the embedding vector of k i.j . Afterwards, we use the resulting probabilities on syntactic information instances in V i to get the weighted value embedding o i : where e v i,j is the embedding vector of the value v i,j . Once o i is obtained for each x i , we concatenate 16 it with h i to get the o ′ i , which can be represented by o ′ i = h i ⊕ o i .

Tagging with KVMN
To facilitate the process of leveraging syntactic information through KVMN, we firstly use an encoder to obtain the hidden vector h i for each x i . Among different types of encoders, in our method, we use the prevailing BioBERT [19], which is demonstrated to be an effective encoders for many biomedical NLP tasks, such as relation extraction [22] and natural language inference [46]. Therefore, the process to obtain the hidden vectors for the input X can be represented by