- Research
- Open access
- Published:
Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning
BMC Bioinformatics volume 23, Article number: 458 (2022)
Abstract
Background
Biomedical named entity recognition (BioNER) is a basic and important task for biomedical text mining with the purpose of automatically recognizing and classifying biomedical entities. The performance of BioNER systems directly impacts downstream applications. Recently, deep neural networks, especially pre-trained language models, have made great progress for BioNER. However, because of the lack of high-quality and large-scale annotated data and relevant external knowledge, the capability of the BioNER system remains limited.
Results
In this paper, we propose a novel fully-shared multi-task learning model based on the pre-trained language model in biomedical domain, namely BioBERT, with a new attention module to integrate the auto-processed syntactic information for the BioNER task. We have conducted numerous experiments on seven benchmark BioNER datasets. The proposed best multi-task model obtains F1 score improvements of 1.03% on BC2GM, 0.91% on NCBI-disease, 0.81% on Linnaeus, 1.26% on JNLPBA, 0.82% on BC5CDR-Chemical, 0.87% on BC5CDR-Disease, and 1.10% on Species-800 compared to the single-task BioBERT model.
Conclusion
The results demonstrate our model outperforms previous studies on all datasets. Further analysis and case studies are also provided to prove the importance of the proposed attention module and fully-shared multi-task learning method used in our model.
Background
With the rapid development of biomedical research, the number of biomedical documents increases with the explosive exponential growth, which has made it difficult for biomedical scholars to keep pace with the cutting-edge research. There is an increasing need of effective natural language processing (NLP) tools to help retrieve, organize, and manage the massive biomedical data and information. Biomedical named entity recognition (BioNER) is a primary first step in any biomedical literature mining task, which aims to detect the boundary of biomedical entities and predict their entity types, such as diseases, genes, species, chemical, etc. The performance of BioNER systems directly impacts downstream applications, such as biomedical relation extraction [1, 2], drug-drug interaction task [3, 4] and knowledge base construction [5, 6].
A BioNER task is typically considered as a sequence labeling task, which aims to assign the best label sequence for a given input sentence. A common tagging method is the BIO format [7], which denotes whether each token is at the Beginning of an entity, Inside, or Outside an entity. This method is capable of distinguishing consecutive entities and can be used easily in an end-to-end model, which inputs each token and produces BIO tags in the final layer. An example sentence annotated using the BIO format can be found in Fig. 1, where “congenital myotonic dystrophy” is the entity detected and “disease” is the entity type classified.
Traditional methods for the BioNER task usually used dictionary-based or rule-based approaches [8, 9]. These methods heavily relied on biomedical experts to establish dictionaries or rules, which takes a lot of manual labor and is time consuming. As the amount of data increases, more researchers tried to use machine learning approaches to deal with the BioNER task, such as Support vector machine (SVM) [10, 11] or Conditional random field (CRF) [7, 12, 13]. However, the conventional machine learning approaches need plenty of handcrafted features extracted from raw data, and the performance is limited. The rapid development of deep learning provides an easier way to overcome these problems. Crichton et al. [14] used the word context as the input based on the convolutional neural network (CNN) and Habibi et al. [15] proposed the bidirectional LSTM (BiLSTM) model combined with a CRF layer. More recently, pre-trained language models like BERT [16], XLNet [17], and Roberta [18] achieved great success on a lot of NLP tasks. Lee et al. [19] introduced a domain-specific language model, named BioBERT, which is pre-trained on the large-scale biomedical corpora. BioBERT largely outperformed previous methods in several biomedical text mining tasks including BioNER task. Considering the powerful performance of BioBERT, we propose to use it as the encoder of our model to obtain high-quality semantic representations.
In addition, we assume that combining the syntactic information, e.g., part-of-speech (POS) labels, syntactic constituents, and dependency relations with the pre-trained BioBERT can help recognize biomedical named entities. Specifically, sentences in biomedical texts are usually formal, well-structured and contain a lot of specialized terms, in which syntactic information can present grammatical structure for sentences and provide helpful cues for understanding the relationship between words. For example, Fig. 2 shows the constituency parse tree automatically produced by the NLP toolkit, where the disease entity is “congenital myotonic dystrophy.” The range of the noun phrase in this tree instead of the adjective phrase can be a good hint for BioNER. The other advantage of syntactic information is that it can be automatically generated by off-the-shelf NLP toolkits rather than manually constructed, which makes it easier to use in this task. Previous studies [20,21,22,23,24] suggest that the syntactic information has a certain ability to help the BioNER task. These studies normally concatenated the embeddings of the syntactic features with the word embeddings directly, which hurts the model performance because of the error-prone and noisy syntactic information processed from NLP toolkits. Accordingly, Tian et al. [25] proposed a novel model instead of directly concatenating to incorporate the syntactic information into the BioBERT encoder and achieved the best results in several BioNER datasets. They used the key-value memory network (KVMN) [26], a new deep neural method learning from pairwise information, to weight the syntactic features. However, the output of the KVMN mainly relies on the value embeddings. The key embeddings are only used for providing weights to values. To solve this problem, Tian et al. [27] proposed a new attention mechanism, named two-way attention, to integrate syntactic information for the encoder. The two-way attention can make full use of syntactic features, rather than using one feature (key embeddings) to weight the other (value embeddings) as in KVMN. Although this method achieved good performance in another task, named “the joint Chinese word segmentation (CWS) and part-of-speech (POS) tagging task,” it still has some shortcomings. One is that the two-way attention mechanism employed two separate attention parts, therefore it may lose some information between the two parts. Another is that the embeddings of syntactic information are only randomly initialized, which lacks a strong semantic representation ability and may cause the out-of-vocabulary problem. Consequently, we propose a novel attention mechanism to tackle these problems.
An effective deep learning model requires huge amounts of data. However, the dataset in the biomedical domain is more likely to become unavailable due to the limitations of privacy and specialization. To deal with the above problems, multi-task learning (MTL) has been introduced by previous studies [14, 28, 29, 29,30,31,32,33,34] and achieved great success in the BioNER task. The basic method of MTL is that multiple annotated datasets are trained at the same time to improve the performance on a single dataset. The datasets are all used in the BioNER task with a similar format, which may have different entity types and may be created by different researchers. Different datasets in the similar domain may contain useful common information like lexical semantics and grammatical expression. The multi-task model can therefore share this information across different datasets in the training step. In general, previous MTL models have adopted the strategy that the model shares certain parts of the model parameters for different datasets and leaves the rest separated for specific tasks. For example, Crichton et al. [14] proposed an MTL model by sharing parameters in encoder layers and convolution layers, and trained separately in decoder layers for each dataset. Wang et al. [28] proposed a BiLSTM-CRF model with an additional character layer. An MTL model was trained by sharing parameters of the character-level and word-level LSTMs and adjusting parameters of the CRF layer independently for different datasets. Chai et al. [34] trained an MTL model by sharing parameters in underlying layers of the XLNet and trained separately the upper layers of the XLNet and the decoder CRF. This MTL method leads to the limited ability of sharing information across different datasets and causes the model to rely too heavily on the task-specific layer. Besides, the multi-task model parameters will increase with the number of datasets because of additional task-specific layers. Huang et al. [35] proposed a transfer learning model by sharing all parameters to integrate multiple cross-domain datasets to achieve good results in Chinese Word Segmentation tasks. Inspired by this work, we propose a straightforward and effective multi-task learning model that shares all parameters across different datasets. The benefit is that we do not need any task-specific models to fit different datasets, and the method can be directly applied in the single-task model without manual adjustments. It can control the rapid growth of the total parameters on the MTL model and improve the performances on several BioNER datasets.
In this paper, we propose a novel fully-shared multi-task learning model based on the pre-trained BioBERT with a new attention module to integrate the auto-processed syntactic information for the BioNER task. The proposed framework contains two parts: One is the single-task method which is only trained on each single BioNER benchmark dataset, and the other is the multi-task method trained across all datasets together. Specifically, our single-task model uses a new proposed attention mechanism, named Combined Feature Attention (CFA), to integrate the syntactic information into BioBERT encoder for improving the performance. We employ the open source NLP toolkit to parse the input sentence and extract several types of syntactic information. Then, we use the proposed attention module to weight each token and its corresponding syntactic features, where syntactic features are combined with the hidden embeddings derived from BioBERT and syntactic labels obtained from the toolkit. Finally, the attention vectors are concatenated with the output of the BioBERT and used to guide the tagging process for the decoder. In this way, the single-task method takes advantage of the pre-trained BioBERT and syntactic information, and outperforms other single-task models in the BioNER task. Moreover, we introduce a straightforward and effective multi-task learning method which shares all model parameters to incorporate multiple datasets into one model. The fully-shared MTL method is a basic but effective way to learn the commonality among different datasets and can be applied easily to many single-task neural network models. To summarize, the main contributions of this paper are as follows:
-
We propose a new attention mechanism, named CFA, to make good use of the pre-trained BioBERT and the syntactic information in the single-task model. Our single-task model substantially outperforms the baseline BioBERT model and other models using the syntactic information because of our better syntactic feature extraction and combination ability.
-
We introduce a straightforward and effective multi-task learning method which shares all parameters without task-specific layers for different datasets. The fully-shared MTL method discriminatively exploits the implicit information across different datasets and significantly improves BioNER compared with the single-task model.
-
The experiment results on seven benchmark BioNER datasets show our fully-shared MTL model with CFA outperforms others on all datasets, which proves the effectiveness of the proposed method. Analyses and case studies show all components of our proposed model are necessary for achieving high performance.
Methods
Following the previous approaches, we treat BioNER as a sequence labeling task. Given the input biomedical sentence of n words \(\text {X}=[{{x}_{1}},{{x}_{2}},\ldots ,{{x}_{i}},\ldots ,{{x}_{n}}]\), the output is a sequence of named entity labels \(\text {Y}=[{{y}_{1}},{{y}_{2}},\ldots ,{{y}_{i}},\ldots ,{{y}_{n}}]\), where \({{x}_{i}}\) is the i-th word in the sentence, and \({{y}_{i}}\) is the i-th predicted label. For each \({{x}_{i}}\), the goal is to predict the corresponding label \({{y}_{i}}\) ‘B’, ‘I’, ‘O’, where ‘B’ indicates the word \({{x}_{i}}\) is the beginning of a biomedical entity, ‘I’ denotes \({{x}_{i}}\) is inside an biomedical entity, and ‘O’ denotes \({{x}_{i}}\) is outside an entity, i.e. \({{x}_{i}}\) is not a part of an biomedical entity.
The proposed framework contains two parts: One is the single-task model which is only trained on each single BioNER dataset, and the other is the multi-task model trained across all datasets together. In this section, we respectively explain details of the proposed single-task model and multi-task model.
Single-task model (STM)
The overall architecture of our single-task model is detailed in Fig. 3. The left part describes the backbone of the proposed architecture for the BioNER sequence labeling paradigm, and the right part is the process of handling the syntactic information. We propose a novel attention module to integrate the syntactic information into the backbone of the model. In this section, the process about syntactic feature extraction is first introduced. Next, we describe how the proposed attention mechanism, namely combined feature attention (CFA), incorporates the syntactic features into BioBERT. Finally, we describe how the sequence labeling model works with the attention layer.
Syntactic feature extraction
Following previous studies [25, 27], we utilize three types of syntactic information: POS labels, syntactic constituents, and dependency relations. POS is a category of words with similar grammatical properties, such as nouns, verbs, adjectives, adverbs and so on. Syntactic constituent is a word or a group of words that functions as a single unit in a hierarchical structure, such as a noun phrase, or verb phrase. Dependency relations are the concept that words are connected to each other through some kind of directed links, such as nominal subject, copula, adjectival modifier and so on. To obtain the syntactic information, first, we run the open source NLP toolkit, e.g., Stanford CoreNLP Toolkit (SCT) [36] to get the results for the input sentence \(\text {X}\). Then we extract the context features and their corresponding syntactic labels of each word \({{x}_{i}}\) in \(\text {X}\) from the results. In Fig. 4, we show an example for the highlighted word “congenital” in the input sentence “This case is a paternally transmitted congenital myotonic dystrophy,” where three types of context feature and their corresponding syntactic labels are extracted. We elaborate each type of syntactic information below.
-
POS labels Consider each word \({{x}_{i}}\) in sentence \(\text {X}\), we employ a 1-word window to extract the neighboring words on both sides of \({{x}_{i}}\) as context features and their corresponding POS labels as syntactic labels. For example, in Fig. 4a, the word “congenital” is the currently processed word, then the context features are the word itself and its left and right neighboring words and syntactic labels are the corresponding POS labels of each context word obtained from the toolkit. The context features are [transmitted, congenital, myotonic], and the syntactic labels are [VBN, JJ, JJ].
-
Syntactic constituents Given a word \({{x}_{i}}\) in \(\text {X}\), we first find the leaf containing \({{x}_{i}}\) in the syntactic parse tree, and then search up from the leaf to find the first acceptable ancestor node whose label is in a pre-defined syntactic label list following “the CoNLL-2003 shared task” [37]. Then we select all the words under this node as context features and search first ancestor nodes of these words as their corresponding syntactic labels. In Fig. 4b, “NP” is the first acceptable ancestor node for the example word “congenital.” There are six words under this node and each word can find its ancestor node. The context features are [a, paternally, transmitted, congenital, myotonic, dystrophy] and the corresponding syntactic labels are [NP, ADJP, ADJP, NP, NP, NP].
-
Dependency relations Dependency relations use directed acyclic graphs to depict the structure of a given sentence. The asymmetric relationship between two basic units has been called the dependency relation. One unit is the dominant element (called governor), and the other is the subordinate element (called dependent). For each word \({{x}_{i}}\) in \(\text {X}\), we first select governor words and dependent words of \({{x}_{i}}\) from the dependency structure as shown in Fig. 4c. Then, we treat these governor words, dependent words and the word \({{x}_{i}}\) as context features and treat the dependency types of these words in the graph as syntactic labels. As is shown in Fig. 4c, the example word “congenital” has only one governor word “dystrophy” which is the root of the sentence and no dependent words. For comparison, the word “transmitted” has one dependent word “paternally” which is pointed from “transmitted.” The context features for the word “congenital” are [congenital, dystrophy] and the corresponding syntactic labels are [amod, root].
After these procedures, we can build the context feature sequence S and the syntactic label sequence L for each type of the syntactic information for each input sentence \(\text {X}\). Formally, for each word \({{x}_{i}}\) in \(\text {X}\), let \({\text {S}_{i}}=[{{s}_{i,1}},{{s}_{i,2}},\ldots ,{{s}_{i,j}},\ldots ,{{s}_{i,{m}_{i}}}]\) and \({\text {L}_{i}}=[{{l}_{i,1}},{{l}_{i,2}},\ldots ,{{l}_{i,j}},\ldots ,{{l}_{i,{m}_{i}}}]\) be the sub sequence of S and L, respectively. Here, \({{s}_{i,j}}\) denotes a context word extracted by the rules we define, \({{l}_{i,j}}\) denotes the corresponding syntactic label for \({{s}_{i,j}}\), and \({m}_{i}\) denotes the length of \({\text {S}_{i}}\) and \({\text {L}_{i}}\). For example, in Fig. 4a, we focus on the 7th word “congenital,” \({{s}_{7,1}}\) = “transmitted,” \({{l}_{7,1}}\) = “VBN,” and \({m}_{7}\) =3. It’s worth noting that we obtain different S’s and L’s for three types of syntactic information, and our model utilizes each type of syntactic information separately.
Combined feature attention
Inspired by Tian et al. [27], we use the attention method to incorporate the syntactic features into the BioBERT model. We first feed the input sentence X into the encoder pre-trained BioBERT to get the hidden vector sequence:
where \(\mathbf {h}_{i}\in {{\mathbb {R}}^{{d}_{1}}}\) is the hidden vector of the i-th word \({{x}_{i}}\) and \({{d}_{1}}\) is the hidden dimension of the encoder. Second, we change the context feature sequence \({{S}_{i}}\) and syntactic label sequence \({{L}_{i}}\) to embedding matrices respectively for each word \({{x}_{i}}\). Because the words in \({{S}_{i}}\) are also included in the input sentence X, we leverage the hidden vector sequence H to embed \({{S}_{i}}\). Different from previous methods [25, 27] that use randomly initialized embeddings or pre-trained embeddings, the embeddings in our method have more abundant semantic representations and can avoid the OOV problem due to the powerful function and good performance of BioBERT. Specifically, a context word \({{s}_{i,j}}\) in \({{S}_{i}}\) is probably the k-th word in the sentence X, so we use \(\mathbf {h}_{k}\) to directly represent the embedding of \({{s}_{i,j}}\), where 1\(\le\)k\(\le\)n . In this way, we can obtain the embedding of each word in \({{S}_{i}}\):
where the context feature embedding matrix \(\mathbf {{E}_{i}^\text {S}}\in {{\mathbb {R}}^{{d}_{1}\times {m}_{i}}}\) and \(\mathbf {e}_{{i,j}}^\text {S}=\mathbf {h}_{k}\). As is shown in Fig. 3, the context features for the word “congenital” in the type of dependency relations are [congenital, dystrophy], where “congenital” and “dystrophy” are the 7th and 9th word in the input sentence, so \(\mathbf {{E}_{i}^\text {S}}=[\mathbf {e}_{{i,1}}^\text {S},\mathbf {e}_{{i,2}}^\text {S}]=[\mathbf {h}_{7},\mathbf {h}_{9}]\). As for \({{L}_{i}}\), we adopt the common approach of randomly initializing the embeddings and training with the model:
where the syntactic label embedding matrix is \(\mathbf {{E}_{i}^\text {L}}\in {{\mathbb {R}}^{{d}_{2}\times {m}_{i}}}\) and \({d}_{2}\) is the artificially set dimension of the initial embeddings. Then, we concatenate \(\mathbf {{E}_{i}^\text {S}}\) and \(\mathbf {{E}_{i}^\text {L}}\) to obtain syntactic feature embedding matrix for each input word \({{x}_{i}}\) and align the dimension of \(\mathbf {e}_{{i,j}}\) and \(\mathbf {h}_{k}\) by a fully connected layer:
where \(\mathbf {{E}_{i}}\in {{\mathbb {R}}^{{d}_{1}\times {m}_{i}}}\) is the syntactic feature embedding matrix for \({{x}_{i}}\), \(\mathbf {{W}_{e}}\in {{\mathbb {R}}^{{d}_{1}\times ({d}_{1}+{d}_{2})}}\) is the weight matrix and \(\mathbf {{b}_{e}}\in {{\mathbb {R}}^{{d}_{1}\times {m}_{i}}}\) is the bias vector.
Finally, we apply the Scaled Dot-Product Attention [38], an effective attention mechanism used in many NLP tasks, with the representations of the word \({{x}_{i}}\) and its syntactic feature to get attention vectors. It can be formulated as:
where \({a}_{i,j}\in {{\mathbb {R}}^{1}}\) is the attention weight for each syntactic feature \(\mathbf {e}_{{i,j}}\) in \(\mathbf {{E}_{i}}\), \(\mathbf {{h}_{i}}\) is the hidden vector of \({{x}_{i}}\), \(\mathbf {{a}_{i}}\in {{\mathbb {R}}^{{d}_{1}}}\) is the weighted vector for all syntactic features in \(\mathbf {{E}_{i}}\) and \(\sum\) denotes an element-wise sum operation. After that, we concatenate \(\mathbf {{a}_{i}}\) and \(\mathbf {{h}_{i}}\) to get the output vector \(\mathbf {{o}_{i}}\) for each word \({{x}_{i}}\) in sentence X, which can be expressed by \(\mathbf {{o}_{i}}=\mathbf {{a}_{i}}\oplus {\mathbf {{h}_{i}}}\).
In this way, the proposed attention module can learn the weights of the corresponding syntactic features for the input sentence. Since the attention module uses a special embedding method which combines the information of context features and syntactic labels, we name it combined feature attention (CFA).
Sequence tagging network
Once the output vector \(\mathbf {{o}_{i}}\) is obtained from the CFA module, we feed it into a fully-connected layer followed by a softmax layer. For each word \({{x}_{i}}\) in X, the tagging probability distribution \(\mathbf {\hat{y}}\) can be formulated as follows:
where \([{{\hat{y}}_{1}},{{\hat{y}}_{2}},{{\hat{y}}_{3}}]\) denote the probability of each type of BioNER labels, i.e., “B,” “I” and “O,” \(\mathbf {W}\) and \(\mathbf {b}\) are trainable parameters. We can also use the CRF layer instead of the softmax layer in our model, but from our test experiments it did not achieve significant improvement and took longer time in training steps. The loss function is cross-entropy.
Multi-task model (MTM)
Recently, multi-task learning (MTL) has been successfully applied to solve the problem of limited availability of annotated data in the BioNER tasks. Most previous MTL models for the BioNER task use multiple datasets simultaneously to train a model, in which some parameters of the model are shared for different datasets and the others are separated and task-specific. This leads to the limited ability of sharing information across different datasets and the explosive growth of the total parameters with the increase of datasets. We propose a straightforward and effective MTL method to attach a pair of tag identifiers for each input word sequence, “\(<tag>\)” and “\(</tag>\),” at the beginning and end of the sequence respectively, where “tag” denotes the name of the dataset containing the input sentence. As is shown in Fig. 5, if the input sentence X belongs to “T” dataset, we add the tag “\(<T>\)” before the first word \({x}_{1}\) and add the tag “\(</T>\)” after the last word \({x}_{9}\).
In the training step, we input sentences from all different datasets together with tag identifiers into the model. These tag identifiers distinguish the origin of each sentence to affect the hidden representations of each word in the sentence. It is similar to directly telling the model which dataset the input sentence belongs to, and allowing the model to learn the differences and commonalities between datasets. Since we only change the input sentence before encoding without the model architecture modification, the proposed fully-shared MTL method can share all parameters in the training step to integrate different datasets and train the model without any task-specific layers. In addition, BioNER datasets include various biomedical entity types, such as gene, disease, and species. There are multiple datasets for each type. Although the datasets under different types are quite different, we assume that the cross-type information of the biomedical domain can improve the performance of the multi-task model. Therefore, we train the model on the datasets under multiple entity types at the same time. Moreover, for the tag identifier “\(<tag>\),” we can use the name of the dataset or entity type containing the input sentence. Since the datasets under the same type are still different due to different constructors and annotation rules, we decide to use the name of the dataset as the tag. If we use the name of the entity type, the differences between datasets will not be captured.
In the inference step, we predict specific test sets by adding the corresponding dataset tag to the input sentence. If you want to recognize entities for a biomedical sentence, you need to select an appropriate dataset tag used in the training step according to your purpose. Different choices will lead to different results. For example, when you want to detect disease entities, you should choose any of the dataset belonging to disease type as the tag identifier.
Results
In this section, we first describe several BioNER benchmark datasets used in our experiments. Then we introduce the experimental setup and implementation details. Next, we present the results of different experiments for the proposed single-task model and multi-task model, respectively.
Datasets
We make experiments on seven BioNER benchmark datasetsFootnote 1 which are publicly available and widely used in previous studies. We utilize the same splitting strategy on training, validation and testing sets according to Lee et al. [19] for each dataset. Since these datasets include various biomedical entity types, we divide them into four categories: gene/protein, disease, species and chemical. Table 1 gives some details of these datasets including the number of sentences, sentence length, entity type and entity count, where the sentence length represents the average length of the sentences in the dataset, and entity count represents the total number of entities mentioned in the dataset. More details about these datasets can be found in [14].
Experiment setup
Our experiments are divided into two parts. We train the proposed single-task model (STM), named BioBERT-CFA, and some other comparative STMs for each of the datasets. Then we train the multi-task model (MTM) with all datasets jointly by using the proposed MTL method based on the vanilla BioBERT model and the proposed BioBERT-CFA model.
For the experiments of STM, we use “Stanford CoreNLP Toolkits” (SCT) [36] , a well-known open source toolkit which is widely used in many NLP studies, to process ach input sentence and obtain parts-of-speech, constituency, and dependency parsing results as the syntactic information. We use each type of syntactic information separately in the CFA module. For the encoder, we use the base v1.1 version of BioBERTFootnote 2 and keep the default hyper-parameter followed by Lee et al. [19], which consists of 12 transformer layers with 768 hidden vector dimensions. The parameters in the BioBERT encoder are fine-tuned with the model training. The embeddings of context features are derived from BioBERT and the embeddings of syntactic labels are randomly initialized in the CFA module. For the experiments of MTM, we combine all datasets as a total dataset for training. Then we change each input sentence respectively in the total dataset by using the proposed MTL method and feed it into STM to train a multi-task model. In testing, we evaluate the results of each dataset separately for each multi-task model.
We implement all experiments on a NVIDIA Tesla V100 GPU using PyTorch libraryFootnote 3. We employ Adam [39] as the optimizer with the learning rate of 5e-5 and train each model with a batch size of 64 and maximum sequence length of 128 for 30 epochs. For the evaluation metrics of BioNER, we use macro-averaged F1 scores computed by the widely used seqevalFootnote 4 script in all experiments.
Single-task model results
For comparison, we adopt the following three single-task models for the BioNER task as the baselines: the first one is the vanilla BioBERT model proposed by Lee et al. [19], which achieved good performance in many biomedical tasks. The second one is named BioKMNE [25] based on a key-value memory network (KVMN) [26] to integrate the syntactic feature with BioBERT and it outperformed the vanilla BioBERT model in their experiments. Besides, we implement a novel attention mechanism, named two-way attention (TWA) proposed for other tasks by Tian et al. [27], instead of the KVMN module in the BioKMNER model to incorporate the syntactic information for the BioNER task. We name this model BioBERT-TWA and assume that BioBERT-TWA can outperform BioKMNER. The BioKMNER model, BioBERT-TWA model and the proposed single-task model BioBERT-MFA use the same three types of syntactic information: POS labels (POS), syntactic constituents (Syn), and dependency relations (Dep) and employ the same NLP toolkit SCT to get the syntactic information.
Table 2 shows the overall performance of our model BioBERT-CFA compared with the three baseline models on the seven benchmark datasets, where BioBERT (ours) denotes our reproduced results of BioBERT, BioBERT-TWA (ours) denotes our reproduced results of the two-way attention method with the BioBERT encoder, and BioBERT-CFA (ours) denotes the results of our proposed single-task model. Bold indicates the highest score among all models. There are several observations for these results.
Firstly, compared with the vanilla BioBERT model without using any syntactic information, all models incorporating syntactic information achieve better results among most datasets. It demonstrates the effectiveness of using syntactic information to help recognize biomedical named entities.
Secondly, comparing BioKMNER and BioBERT-TWA, we find that BioBERT-TWA yields better performance in most cases. For instance, on the BC2GM dataset, BioBERT-TWA (Syn) achieves the F1 score of 84.96%, while KMNER obtains a lower F1 score of 84.76%. This phenomenon that the performance of KVMN is not as good as TWA is consistent with the results in Tian et al. [27], which may be due to the reason that the method of computing weights in KVMN is inaccurate compared to TWA.
Thirdly, the proposed single-task model BioBERT-CFA achieves the best performance on all benchmark datasets and provides a significant enhancement to the baselines by incorporating the syntactic information. For example, BioBERT-CFA achieves improvements of 1.06%, 0.95% and 0.80% F1 scores for the JNLPBA, Species-800 and NCBI-disease datasets respectively compared with BioBERT, which confirms the effectiveness and universality of the proposed CFA module. Comparing with BioBERT-TWA, the BioBERT-CFA model uses a novel attention mechanism and embedding method, and provides outstanding performance.
Among different types of syntactic information, in most cases, syntactic constituents (Syn) and dependency relations (Dep) in our experiments work better than part of speech tags (POS). For example, the BioBERT-CFA model achieves 85.36% and 85.28% F1 scores on the BC2GM dataset when it uses Syn and Dep, respectively, while 85.06% is achieved when it uses POS labels. The same phenomenon can be found in BioKMNER and BioBERT-TWA models. This is partly because the syntactic constituents and dependency relations provide more cues of the relationship between words, while the POS labels focus more on attributes of the word itself.
Multi-task model results
We train the multi-task model (MTM) with all aforementioned datasets together by using the fully-shared multi-task learning method based on the vanilla BioBERT model and the proposed BioBERT-CFA model, named BioBERT-MTM and BioBERT-CFA-MTM, respectively. In BioBERT-CFA-MTM, we use dependency relations (Dep) as syntactic information because of its good performance. The BioBERT-STM is our baseline model which denotes the single-task BioBERT model trained by a single dataset separately. For comparison, we design a BioBERT-DM model where we train BioBERT on the whole dataset which directly mixes all datasets without our MTL method. As shown in Table 3, the BioBERT-DM model greatly hurt the performance on all datasets because different datasets have different entity types and annotation rules, and this model has no ability to distinguish between different datasets. In contrast, the BioBERT-MTM yields quite stable improvements no matter what dataset we test compared with the baseline, which confirms that the model trained jointly with different datasets by using the fully-shared MTL method would achieve better performance than training it by a single dataset. The fully-shared MTL method can learn useful information from different datasets. The F1 scores of BioBERT-MTM on BC5CDR-Chemical dataset only increase by 0.12%, which is because that BC5CDR-Chemical is the only one of chemical types of those seven datasets. Additionally, we find that the final model BioBERT-CFA-MTM enhances performance remarkably on all datasets, which again shows the effectiveness of the proposed CFA module and MTL method.
Comparative analysis with previous studies
In this section, we compare the results of the final model BioBERT-CFA-MTM, which utilizes the proposed CFA module and fully-shared MTL method, with those of previous corresponding publications in the multi-task learning BioNER task. The results (F1 scores) on the same datasets are summarized in Table 4. Overall, our model outperforms previous studies in the BioNER task and achieves the best performance on all benchmark datasets. There are some valuable observations. First, the approaches based on the pre-trained language model, such as Akdemir et al. [33], Khan et al. [32], Tong et al. [40] and Chai et al. [34] generally outperform those based on the CNN and BiLSTM model, such as Crichton et al. [14], Wang et al. [28], Wang et al. [29], Yoon et al. [30] and Zuo et al. [31] This shows the power of using the pre-trained model as the encoder. Second, although the models of Akdemir et al. and Khan et al. are also based on the pre-trained BioBERT, the proposed BioBERT-CFA-MTM yields better performance because we combine the syntactic information and MTL method into BioBERT. Third, compared with the latest state-of-the-art model from Chai et al., BioBERT-CFA-MTM achieves substantial improvements on several datasets. This is because the former approach is based on the pre-trained XLNet model which is inferior to the biomedical domain-specific pre-trained BioBERT model. It also divides the parameters of the XLNet-CRF model into shared layers and task-specific layers while we share all the parameters of BioBERT-CFA-MTM across different datasets to learn more information from datasets.
Discussion
The effect of dimensions
We analyze the influence of different initial dimension sizes for the syntactic label embedding in the CFA module. The syntactic labels are used for integrating with the context features into the syntactic features, which are introduced in Equation 4. Therefore, the dimension \({{d}_{2}}\) of the syntactic label embedding can affect the model performance. We test the different sizes of \({{d}_{2}}\) by 32, 128, 256, 768, 1024 in the BioBERT-CFA model with the features of POS labels on the NCBI-disease dataset. Figure 6 presents the results where the best performance is achieved at 768. Since the size of the vocabulary for potential syntactic labels is relatively small (less than 100), we assumed a small size of \({{d}_{2}}\) would achieve better F1 scores. But the result shows the least size 32 gets the worst performance. It can be interpreted that the dimension \({{d}_{2}}\) of the syntactic label embedding is much smaller than the dimension \({{d}_{1}}\) of the context feature embedding, which leads to lower weights for the syntactic labels in the CFA module. Contrarily, the size 768 is equal to the dimension \({{d}_{1}}\) and therefore achieves the best performance.
The effect of the tag pair
In the proposed fully-shared MTL method, each input word sequence has attached a pair of tag identifiers, “\(<tag>\)” and “\(</tag>\),” at the beginning and end of the sequence respectively. To prove the effectiveness of the strategy, we conduct experiments and show the results in Table 5. The BioBERT-MTM model is the fully shared MTL model which uses a pair of tag identifiers to distinguish between different datasets and achieve outstanding performance on the seven BioNER datasets. Then we remove the end tag “\(</tag>\)” and only keep the beginning tag “\(<tag>\)” for each input sentence. From the results of “w/o the end tag” model, removing the end tag strategy leads to a slight decline in the F1-scores. It shows that using a pair of tag identifiers positively affects the hidden representations of each word in the sentence more than using a single tag. If we remove the entire pair of tag identifiers, “\(<tag>\)” and “\(</tag>\),” and only input the original sentence as the “w/o the tag pair” model, it degrades to the aforementioned baseline BioBERT-DM model. This method seriously damages the performance because it treats the sentences from different datasets as from the same source and does not distinguish between different datasets.
Analysis for datasets under the same type
To analyze the effect of entity types of the BioNER dataset, we train the BioBERT-DM and BioBERT-MTM model on the only two datasets of the disease type, i.e. NCBI-disease and BC5CDR-Disease, and show the results in Table 6. The “BioBERT-MTM (all)” denotes the aforementioned result of the BioBERT-MTM model where we train it on all seven datasets including other entity types. The BioBERT-DM model, where we directly mix the two disease datasets for training, does not gain satisfactory results compared with the single-task model BioBERT-STM. It shows that there are still some differences between the datasets of the same type and directly mixing them will degrade the performance. In contrast, BioBERT-MTM obtains good results, which proves again that the proposed fully-shared MTL method learns useful information across different datasets, even though these datasets are of the same type. Besides, the results of the BioBERT-MTM model trained on all datasets of multiple entity types are better than that trained on the two datasets of a single type, which shows that cross-type information in the biomedical domain improves the performance by using our MTL method.
Analysis for using the tag of type name
In the above experiments of BioBERT-MTM, where the tag pair is attached to the input sentence, we use the name of the dataset containing this sentence as the tag. Similarly, we can use the name of the corresponding entity type as the tag. For example, when the input sentence is from the Linnaeus dataset, the pair of tags will be “\(< Species>\)” and “\(</Species>\).” As shown in Table 7, “TN” denotes the method using the name of the type and “DN” denotes the method using the name of the dataset. Bold marks the highest score among all methods. We found that the results of “DN” vastly outperform “TN” in most cases, which shows some differences exist in the datasets even of the same type. In addition, we combine the method “TN” and “DN” by attaching two pairs of tag identifiers at the beginning and end of the sentence respectively and name it “TN+DN.” The results of “TN+DN” are worse than “DN.” It shows that in the “TN+DN” method, too many tags are attached to the input sentence, and the model cannot understand the meaning of each tag well.
Case study
To better illustrate how our approach improves biomedical named entity recognition, we conduct a case study and list some practical prediction cases of the baseline single-task model and the two proposed multi-task models on several benchmark datasets. The examples are shown in Table 8, where true labels and predicted labels are underlined in the sentence for each model. In case 1, we need to recognize entities about the gene or protein type. BioBERT-MTM correctly detects the boundaries of gene entity “human Elk1 gene” compared with the baseline, possibly because the multi-task model could learn similar context expressions from other related datasets. BioBERT-CFA-MTM correctly detects the boundaries of gene entity “pseudogene Elk2,” while BioBERT-STM and BioBERT-MTM only detect the result “Elk2.” This can be interpreted that BioBERT-CFA-MTM learns the relations between “pseudogene” and “Elk2” from the corresponding lexical structures. In case 2, all models have correct prediction for the disease entity “cancer,” but only BioBERT-CFA-MTM correctly recognizes “inherited colorectal polyposis,” which is probably due to the effect of the CFA module. In case 3, BioBERT-STM fails to predict the species entity “Colwellia psychrerythraea 34H” while BioBERT-MTM is able to correctly detect it. Besides, “Alteromonadales” is recognized as a species entity in BioBERT-STM and BioBERT-MTM, however, this word is not an entity in the standard answer. In summary, BioBERT-CFA-MTM improves BioNER effectively because it learns more lexical structures from the syntactic information and shares useful information between multiple datasets by the fully-shared MTL method. Nevertheless, there are still some difficult cases that cannot be solved by our models. In case 4, the phrase “skin photosensitivity” is inferred as the disease entity by three models, but it is not an entity. The word “photosensitivity” is rare and it does not appear in the training set, and the expression of the phrase “skin photosensitivity” is similar to other skin diseases, e.g., “skin fragility syndrome” and “skin track,” therefore it is error-prone and hard to correctly recognize. Likewise, in case 5, because the species entity “rhubarb” is a rare word and it is difficult to identify according to the context, our models fail to recognize it.
Conclusions
In this paper, we propose a novel fully-shared multi-task learning model based on the pre-trained BioBERT with a new attention module to integrate the auto-processed syntactic information for the BioNER task. The proposed attention module CFA extracts appropriate features from syntactic information and weights these features to enhance BioNER. The proposed multi-task learning method shares all parameters to capture useful information from different datasets. We conducted a large number of experiments on seven benchmark BioNER datasets and our methods achieved the best results on all datasets. The experiment results and case studies demonstrate the importance of the proposed CFA module and fully shared MTL method used in our model. In the future, we expect to employ biomedical-specific syntactic toolkits instead of the general-purpose toolkit to further improve the performance for CFA, and apply the proposed approach to other sequence tagging tasks.
Availability of data and materials
The datasets and code are available at are available at https://github.com/zzy1026/BioBERT-CFA-MTM. BioBERT pre-training parameters were provided by https://github.com/dmis-lab/biobert-pytorch.
Abbreviations
- BioNER:
-
Biomedical named entity recognition
- CFA:
-
Combined feature attention
- MTL:
-
Multi-task learning
- POS:
-
Part-of-speech
- CRF:
-
Conditional random field
- CNN:
-
Convolutional neural network
- BiLSTM:
-
Bidirectional long short-term memory
- BERT:
-
Bidirectional encoder representations from transformers
- SCT:
-
Stanford CoreNLP Toolkit
References
Zhang Y, Lin H, Yang Z, Wang J, Zhang S, Sun Y, Yang L. A hybrid model based on neural networks for biomedical relation extraction. J Biomed Inform. 2018;81:83–92.
Li J, Zhang Z, Li X, Chen H. Kernel-based learning for biomedical relation extraction. J Am Soc Inform Sci Technol. 2008;59(5):756–69.
Liu S, Tang B, Chen Q, Wang X. Drug-drug interaction extraction via convolutional neural networks. Comput Math Methods Med. 2016;2016:6918381.
Kolchinsky A, Lourenço A, Wu H-Y, Li L, Rocha LM. Extraction of pharmacokinetic evidence of drug–drug interactions from the literature. PLoS ONE. 2015;10(5):0122199.
Hao B, Zhu H, Paschalidis I. Enhancing clinical bert embedding using a biomedical knowledge base. In: Proceedings of the 28th international conference on computational linguistics; 2020. p. 657–61.
Wright D. NormCo: deep disease normalization for biomedical knowledge base construction. San Diego: University of California; 2019.
Settles B. Biomedical named entity recognition using conditional random fields and rich feature sets. In: Proceedings of the international joint workshop on natural language processing in biomedicine and its applications (NLPBA/BioNLP); 2004. p. 107–10
Kulick S, Bies A, Liberman M, Mandel M, McDonald R, Palmer M, Schein A, Ungar L, Winters S, White P. Integrated annotation for biomedical information extraction. In: HLT-NAACL 2004 workshop: linking biological literature, ontologies and databases; 2004. p. 61–8
Liu H, Hu Z-Z, Torii M, Wu C, Friedman C. Quantitative assessment of dictionary-based protein named entity tagging. J Am Med Inform Assoc. 2006;13(5):497–507.
Liao Z, Zhang Z. A generic classifier-ensemble approach for biomedical named entity recognition. In: Pacific-Asia conference on knowledge discovery and data mining. Springer; 2012. p. 86–97.
Lee K-J, Hwang Y-S, Kim S, Rim H-C. Biomedical named entity recognition using two-phase model based on SVMs. J Biomed Inform. 2004;37(6):436–47.
Campos D, Matos S, Oliveira JL. Gimli: open source and high-performance biomedical name recognition. BMC Bioinformatics. 2013;14(1):1–14.
Liao, Z., Wu, H.: Biomedical named entity recognition based on skip-chain Crfs. In: 2012 International Conference on Industrial Control and Electronics Engineering. IEEE; 2012. p. 1495–8.
Crichton G, Pyysalo S, Chiu B, Korhonen A. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics. 2017;18(1):1–14.
Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017;33(14):37–48.
Devlin J, Chang M-W, Lee K, Toutanova K. Bert: Pre-training of deep bidirectional transformers for language understanding. 2018. arXiv preprint arXiv:1810.04805
Yang Z, Dai Z, Yang Y, Carbonell J, Salakhutdinov RR, Le QV. Xlnet: Generalized autoregressive pretraining for language understanding. Advances in Neural Information Processing Systems 2019; 32.
Liu Y, Ott M, Goyal N, Du J, Joshi M, Chen D, Levy O, Lewis M, Zettlemoyer L, Stoyanov V. Roberta: A robustly optimized bert pretraining approach. 2019. arXiv preprint arXiv:1907.11692
Lee J, Yoon W, Kim S, Kim D, Kim S, So CH, Kang J. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics. 2020;36(4):1234–40.
Dang TH, Le H-Q, Nguyen TM, Vu ST. D3NER: biomedical named entity recognition using CRF-bilSTM improved with fine-tuned embeddings of various linguistic information. Bioinformatics. 2018;34(20):3539–46.
Luo L, Yang Z, Yang P, Zhang Y, Wang L, Lin H, Wang J. An attention-based BilSTM-CRF approach to document-level chemical named entity recognition. Bioinformatics. 2018;34(8):1381–8.
Yao L, Liu H, Liu Y, Li X, Anwar MW. Biomedical named entity recognition based on deep neutral network. Int J Hybrid Inf Technol. 2015;8(8):279–88.
Tang B, Cao H, Wang X, Chen Q, Xu H. Evaluating word representation features in biomedical named entity recognition tasks. BioMed Res Int. 2014;2014: 240403.
Zhang J, Shen D, Zhou G, Su J, Tan C-L. Enhancing hmm-based biomedical named entity recognition by studying special phenomena. J Biomed Inform. 2004;37(6):411–22.
Tian Y, Shen W, Song Y, Xia F, He M, Li K. Improving biomedical named entity recognition with syntactic information. BMC Bioinformatics. 2020;21(1):1–17.
Miller A, Fisch A, Dodge J, Karimi A-H, Bordes A, Weston J. Key-value memory networks for directly reading documents. 2016. arXiv preprint arXiv:1606.03126
Tian Y, Song Y, Ao X, Xia F, Quan X, Zhang T, Wang Y. Joint Chinese word segmentation and part-of-speech tagging via two-way attentions of auto-analyzed knowledge. In: Proceedings of the 58th annual meeting of the association for computational linguistics; 2020. p. 8286–96.
Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics. 2019;35(10):1745–52.
Wang X, Lyu J, Dong L, Xu K. Multitask learning for biomedical named entity recognition with cross-sharing structure. BMC Bioinformatics. 2019;20(1):1–13.
Yoon W, So CH, Lee J, Kang J. Collabonet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics. 2019;20(10):55–65.
Zuo M, Zhang Y. Dataset-aware multi-task learning approaches for biomedical named entity recognition. Bioinformatics. 2020;36(15):4331–8.
Khan MR, Ziyadi M, AbdelHady M. Mt-bioner: multi-task learning for biomedical named entity recognition using deep bidirectional transformers. 2020. arXiv preprint arXiv:2001.08904
Akdemir A, Shibuya T. Analyzing the effect of multi-task learning for biomedical named entity recognition. 2020. arXiv preprint arXiv:2011.00425
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y. Hierarchical shared transfer learning for biomedical named entity recognition. BMC Bioinformatics. 2022;23(1):1–14.
Huang K, Huang D, Liu Z, Mo F. A joint multiple criteria model in transfer learning for cross-domain chinese word segmentation. In: Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP); 2020. p. 3873–82.
Manning CD, Surdeanu M, Bauer J, Finkel JR, Bethard S, McClosky D. The stanford corenlp natural language processing toolkit. In: Proceedings of 52nd annual meeting of the association for computational linguistics: system demonstrations; 2014. p. 55–60
Sang EF, De Meulder F. Introduction to the CoNLL-2003 shared task: Language-independent named entity recognition. 2003. arXiv preprint arXiv:cs/0306050v1.
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, Kaiser Ł, Polosukhin I. Attention is all you need. Advances in neural information processing systems. 2017;30.
Kingma DP, Ba J. Adam: A method for stochastic optimization. 2014. arXiv preprint arXiv:1412.6980
Tong Y, Chen Y, Shi X. A multi-task approach for improving biomedical named entity recognition by incorporating multi-granularity information. In: Findings of the association for computational linguistics: ACL-IJCNLP 2021; 2021. p. 4804–13
Acknowledgements
We are grateful to the creators of various datasets, who provided them free of charge.
Funding
This work was partially supported by the Ministry of Science and Technology, ROC (Grant Number: 109-2221-E-468-014-MY3).
Author information
Authors and Affiliations
Contributions
ZZ designed and implemented the models, conducted the experiments, and drafted the manuscript. ALPC critically revised the manuscript. ZZ and ALPC participated in the literature search and experimental design. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Competing interests
All authors have nothing to disclose.
Consent for publication
All authors have approved the submission of this manuscript.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Zhang, Z., Chen, A.L.P. Biomedical named entity recognition with the combined feature attention and fully-shared multi-task learning. BMC Bioinformatics 23, 458 (2022). https://doi.org/10.1186/s12859-022-04994-3
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-022-04994-3