Skip to main content

CollaboNet: collaboration of deep neural networks for biomedical named entity recognition



Finding biomedical named entities is one of the most essential tasks in biomedical text mining. Recently, deep learning-based approaches have been applied to biomedical named entity recognition (BioNER) and showed promising results. However, as deep learning approaches need an abundant amount of training data, a lack of data can hinder performance. BioNER datasets are scarce resources and each dataset covers only a small subset of entity types. Furthermore, many bio entities are polysemous, which is one of the major obstacles in named entity recognition.


To address the lack of data and the entity type misclassification problem, we propose CollaboNet which utilizes a combination of multiple NER models. In CollaboNet, models trained on a different dataset are connected to each other so that a target model obtains information from other collaborator models to reduce false positives. Every model is an expert on their target entity type and takes turns serving as a target and a collaborator model during training time. The experimental results show that CollaboNet can be used to greatly reduce the number of false positives and misclassified entities including polysemous words. CollaboNet achieved state-of-the-art performance in terms of precision, recall and F1 score.


We demonstrated the benefits of combining multiple models for BioNER. Our model has successfully reduced the number of misclassified entities and improved the performance by leveraging multiple datasets annotated for different entity types. Given the state-of-the-art performance of our model, we believe that CollaboNet can improve the accuracy of downstream biomedical text mining applications such as bio-entity relation extraction.


The amount of biomedical text continues to increase rapidly. There were 4.7 million full-text online accessible articles in PubMed Central [1] in 2017. One of the obstacles in utilizing biomedical text data is that it is too large for a human to read or even search for needed information. This has led to the demand for automated extraction of valuable information. Text mining can be used to turn the time-consuming task into a fully automated job [27].

Named Entity Recognition (NER) is the computerized procedure of recognizing and labeling entities in given texts. In the biomedical domain, typical entity types include disease, chemical, gene and protein.

Biomedical named entity recognition (BioNER) is an essential building block of many downstream text mining applications such as extracting drug-drug interactions [8] and disease-treatment relations [9]. BioNER is also used when building a sophisticated biomedical entity search tool [10] that enables users to pose complex queries to search for bio-entities.

NER in biomedical text mining is focused mainly on dictionary-, rule-, and machined learning-based approaches [1116]. Dictionary based systems have a simple and intuitive structure but they cannot handle unseen entities or polysemous words, resulting in low recall [11, 12]. Moreover, building and maintaining a comprehensive and up-to-date dictionary involves a considerable amount of manual work. The rule based approach is more scalable, but it needs hand crafted feature sets to fit a model to a dataset [13, 14]. These rule and dictionary-based approaches can achieve high precision [10] but can produce incorrect predictions when a new word, which is not in the training data, appears in a sentence (out-of-vocabulary problem). This out-of-vocabulary problem occurs frequently especially in the biomedical domain, as it is common for a new biomedical term, such as a new drug name, to be registered in this domain.

Recently, studies have demonstrated the effectiveness of deep learning based methods. Sahu and Anand [17] demonstrated the efficiency of Recurrent Neural Network (RNN) for NER in biomedical text. The model by Sahu and Anand is composed of a bidirectional Long Short-Term Memory Network (BiLSTM) and Conditional Random Field (CRF). Sahu and Anand [17] also used character level word embeddings but could not demonstrate their benefits. Habibi et al. [18] combined the BiLSTM-CRF model implementation of Lample et al. [19] and the word embeddings of Pyysalo et al. [20]. Habibi et al. [18] utilized character level word embeddings to capture characteristics, such as orthographic features, of bio-medical entities and achieved state-of-the-art performance, demonstrating the effectiveness of character level word embeddings in BioNER.

Although these models showed some promising results, NER is still a very challenging task in the biomedical domain for the following reasons. First, a limited amount of training data is available for BioNER tasks. Gold-standard datasets contain annotations of one or two entity types. For example, the NCBI corpus [21] includes annotations of diseases but not of other types of entities such as genes and proteins. On the other hand, the JNLPBA corpus [22] contains annotations of only genes and proteins. Therefore, the data for each entity type comprises only a small portion of the total amount of annotated data.

Multi-task learning (MTL) is a method for training a single model for multiple tasks at the same time. MTL can leverage different datasets that are collected for different but related tasks [23]. Although extracting genes is different from extracting chemicals, both tasks require learning some common features that can help understand the linguistic expressions of biomedical texts. Crichton et al. [24] developed an MTL model that was trained on various source datasets containing annotations of different subsets of entity types. An MTL model by Wang et al. [25] achieved performance comparable to that of the state-of-the-art single task NER models. Inspired by the previous studies, we propose CollaboNet which uses the collaboration of multiple models. Unlike the conventional MTL methods which use only a single static model, CollaboNet is composed of multiple models trained on different datasets for different tasks. Each model in CollaboNet is trained on dataset annotated on a specific type of entity and becomes an expert on their own entity type.

Despite the high recall obtained by the MTL based models, the precision of these models is relatively low. Since MTL based models are trained on multiple types of entities and larger training data, they have a broader coverage of various biomedical entities, which naturally results in high recall. On the other hand, as the MTL models are trained on combinations of different entity types, they tend to have difficulty in differentiating among entity types, resulting in lower precision.

Another reason NER is difficult in the biomedical domain is that an entity could be labeled as different entity types depending on its textual context. In our experiments, we observed that many incorrect predictions were a result of the polysemy problem, in which a word, for example, can be used as both a gene and disease name. Models designed to predict disease entities misidentify some genes as diseases. This misidentification of entity types increases the false positive rate. For instance, BiLSTM-CRF based models for disease entities mistakenly label the gene name “BRCA1” as a disease entity because there are disease names such as “BRCA1 abnormalities” or “Brca1-deficient” in the training set. Besides, the training set that annotates “VHL” (Von Hippel-Lindau disease) as a disease entity confuses the models because VHL is also used as a gene name, since the mutation of this gene causes VHL disease.

To solve the false positive problems due to polysemous words, CollaboNet aggregates the results of collaborator models, and uses them as an additional input to the target model. Consider the case of predicting the disease entity VHL utilizing the outputs of gene and chemical models. Once a gene model predicts VHL as a gene, the gene model informs a disease model that VHL is a gene entity so that the disease model will not predict VHL as a disease. In CollaboNet, each model is individually trained on an entity type and then further trained on the outputs of other models that are trained on the other entity types. The models in CollaboNet take turns in being the target and collaborator models during training. Consequently, each model is an expert in its own domain and helps improve the accuracy by leveraging the multi-domain information from the other models.


In the following section, we first discuss a BiLSTM-CRF model for biomedical named entity recognition. The overall structure of the BiLSTM-CRF model is illustrated in Fig. 1. Next, we introduce the structure of CollaboNet, which is comprised of a set of BiLSTM-CRF models as shown in Fig. 2.

Fig. 1
figure 1

Character level word embedding using CNN and an overview of Bidirectional LSTM with Conditional Random Field (BiLSTM-CRF). Single-task model structure

Fig. 2
figure 2

Structure of CollaboNet. Arrows show the flow of information when target model MTarget is training. The models in CollaboNet take turns in being the target model

Problem Definition

Named entity recognition involves annotating words in a sentence as named entities. More formally, given an input sequence S=[w1,w2,...,wN], we predict corresponding labels Y=[y1,y2,...,yN]. We use the BIOES scheme [26] for representing yt, where B stands for Beginning, I for Inside, O for Out, E for End, and S for Single.

Embedding layer

Word Embedding (WE)

Word embedding is an effective way of representing words. As word embeddings capture semantic and syntactic meanings of words, they have been widely used in various natural language processing tasks including named entity recognition. The experiment of Habibi et al. [18] showed that word embeddings trained on biomedical corpora notably improved the performance of BioNER models. Pyysalo et al. [20] were the first to suggest training word embeddings on biomedical corpora from PubMed, PubMed Central (PMC), and Wikipedia. The results of Pyysalo et al. [20] and Habibi et al. [18] suggest that using word embeddings trained on biomedical corpora is essential for BioNER. We also use the trained word embeddings provided by Pyysalo et al. [20]. For each word wt in a sequence S, we denote a word represented by a word embedding as \(x_{t} \in \mathbb {R}^{d^{word}}\) where dword is a dimension of the word embedding.

Character Level Word Embedding (CLWE)

To give our model character level morphological information (e.g.,-ase’ is common in protein entities), we also leverage the character level information of each word. We build character level word embeddings (CLWEs) using a convolution neural network (CNN), similar to the work of Santos and Zadrozny [27]. Given a word wt, composed of M number of characters, we represent \(w_{t} = \left \{c_{1}^{t},c_{2}^{t},...,c_{M}^{t}\right \}\) where \(c_{i}^{t} \in \mathbb {R}^{d^{char}}\) is a randomly initialized character embedding for each unique character. Note that unlike the word embeddings trained on separate biomedical corpora, character embeddings are learned from only the BioNER task. For the CNN, padding of the proper size ((k−1)/2) according to window size k should be attached before and after each word. We obtain a window vector \(C_{i}^{t}\) by simply concatenating the character embeddings of \(C_{i}^{t}\) with the character embeddings of (k−1)/2 characters on both sides:

$$ C_{i}^{t} = \left[c^{t}_{i- (k-1)/{2}}, \cdots c_{i}^{t}, \cdots c^{t}_{i+(k-1)/{2}}\right] \in \mathbb{R}^{k d^{char}} $$

From the window vector \(C_{i}^{t}\), we perform a convolution operation as follows:

$$ \left[ x^{c}_{t} \right]_{j} = \max_{1 \leq i \leq M}\left[ W_{char} C_{i}^{t} + b_{char} \right]_{j} $$

where \(W_{char} \in \mathbb {R}^{d^{clwe} \times k d^{char}}\) and \(b_{char} \in \mathbb {R}^{d^{clwe}}\) denote a trainable filter and bias, respectively. We obtain the element-wise maximum values, and the output is a character level word embedding denoted as \(x_{t}^{c} \in \mathbb {R}^{d^{clwe}}\). We concatenate the character level word embedding with the word embedding trained on biomedical corpora as \(\hat {x}_{t} = \left [x_{t}, x^{c}_{t}\right ]\) to utilize both representations in our model.

Long Short-Term Memory (LSTM)

A Recurrent Neural Network (RNN) is a neural network that effectively handles variable-length inputs. RNNs have proven to be useful in various natural language processing tasks including language modeling, speech recognition and machine translation [2830]. Long Short-Term Memory (LSTM) [31] is one of the most frequently used variants of recurrent neural networks. Our model uses the LSTM architecture from Graves et al. [29]. Given the outputs of an embedding layer \(\left [\hat {x}_{1},..., \hat {x}_{N}\right ]\), the hidden states of LSTM are calculated as follows:

$$ i_{t}=\sigma \left(W_{xi} \hat{x}_{t} + W_{hi} h_{t-1} + b_{i} \right) $$
$$ f_{t}=\sigma \left(W_{xf} \hat{x}_{t} + W_{hf} h_{t-1} + b_{f} \right) $$
$$ c_{t} = f_{t} \odot c_{t-1} + i_{t} \odot \tanh \left(W_{xc} \hat{x}_{t} + W_{hc} h_{t-1} + b_{c} \right) $$
$$ o_{t}=\sigma \left(W_{xo} \hat{x}_{t} + W_{ho} h_{t-1} + b_{o} \right) $$
$$ h_{t} = o_{t} \odot \tanh \left(c_{t} \right) $$

where σ and tanh denote a logistic sigmoid function and a hyperbolic tangent function, respectively, and is an element-wise product. We use a forward LSTM that extracts the representations of inputs in the forward direction, and we use a backward LSTM that represents the inputs in the backward direction.

We concatenate the two states coming from the forward LSTM and the backward LSTM to form the hidden states of the bi-directional LSTM (BiLSTM). BiLSTM, proposed by Schuster and Paliwal [32], was extensively used in various sequence encoding tasks. We obtain a set of hidden states \(h_{t}^{bi} = \left [h_{t}^{f}, h_{t}^{b}\right ] \in \mathbb {R}^{2d^{lstm}}\) where \(h_{t}^{f}\) and \(h_{t}^{b}\) are hidden states of forward and backward LSTMs, respectively, at a time step t.

Bidirectional LSTM with Conditional Random Field (BiLSTM-CRF)

While BiLSTM handles long term dependency problems as well as backward dependency issues, modeling dependencies among adjacent output tags helps improve the performance of the sequence labeling models [25]. We applied a Conditional Random Field (CRF) to the output layer of the BiLSTM to capture these dependencies.

First, we compute the probability of each label given the sequence S=[w1,...,wN] as follows:

$$ z_{t} = W_{y} h_{t}^{bi} + b_{y} $$
$$ \begin{aligned} p(y_{t}|w_{1},..., w_{N}; \Theta) = \text{softmax}(z_{t})\\ \text{softmax}(\mathbf{a}_{j}) = \frac{\exp{a_{j}}}{\sum_{k} \exp{a_{k}}} \end{aligned} $$

where \(W_{y} \in \mathbb {R}^{5 \times 2d^{lstm}}\) and \(b_{y} \in \mathbb {R}^{5}\) are parameters of the fully connected layer for BIOES tags, and the softmax(·) function computes the probability of each tag. Based on the probability p and the CRF layer, our training objective to minimize is defined as follows:

$$ L_{LSTM} = -\sum_{t=1}^{N} \log p(y_{t}|w_{1},..., w_{N}; \Theta) $$
$$ L_{CRF} = -\sum_{t=1}^{T} {\left(A_{y_{t-1},y_{t}} + z_{t,y_{t}} \right)} $$
$$ Loss = L_{LSTM} + L_{CRF} $$

where LLSTM is the cross entropy loss for the label yt, and LCRF is the negative sentence-level log likelihood. The score of a tag is the summation of the transition score \(A_{y_{t-1},y_{t}}\) and the emission score from our LSTM \(z_{t,y_{t}}\) at time step t.

At test time, we use Viterbi decoding to find the most probable sequence given the outputs of the BiLSTM-CRF model.


CollaboNet, our novel NER model, is composed of multiple BiLSTM-CRF models (Fig. 2), and following the terminology of [25], we call each BiLSTM-CRF model a single-task model (STM). In CollaboNet, each STM is trained on a specific dataset and each STM is regarded as an expert on a particular entity type. These experts help each other since the knowledge of each expert is transferred to all the other experts. Training CollaboNet consists of phases and in each phase, except for the first preparation phase, only the target STM is trained on a single dataset for one epoch while the other STMs are not trained but only used to generate input for the target STM which is trained.

More formally, let us denote a set of datasets as D, and a single-task model as \(M_{k}^{n}\), which is trained on the k-th dataset in phase Pn. In the preparation phase (P0) of CollaboNet, each STM is trained independently on a corresponding dataset until the performance of each model converges.

Note that an STM in the preparation phase \(\left (M_{k}^{0}\right)\) is the same as a single BiLSTM-CRF model. In the preparation phase, we assume that each model \(M_{k}^{0}\) has obtained the maximum amount of knowledge about the k-th dataset.

In the subsequent phases Pn, where n≥1, we select an STM \(M_{d}^{n-1}\) which is an expert on the dataset d. We refer to the target STM \(M_{d}^{n-1}\) as the target model, and the remaining STMs as the collaborator models. To train the target model \(M_{d}^{n-1}\), we use inputs from the target dataset d and BiLSTM outputs from collaborator models \(M_{k}^{n-1}, \left \{k \vert k \neq d, k \in D \right \}\). We train each STM on its dataset for one epoch, and change the target STM \(M_{d}^{n-1}\) as follows:

$$ \begin{aligned} \hat{S_{d}^{n}}= \alpha_{k_{1}}M_{k_{1}}^{n-1}([S_{d}; \mathbf{0}]) \oslash \cdots \oslash \alpha_{k_{m}}M_{k_{m}}^{n-1}([S_{d}; \mathbf{0}]),\\ \left\{k_{i} \vert k_{i} \neq d, k_{i} \in S \right\} \end{aligned} $$
$$ \hat{p}(Y_{d}|S_{d}) = M_{d}^{n-1}\left([S_{d};\hat{S_{d}^{n}}]\right) $$

where [·;·] denotes concatenation and denotes an aggregation operation such as max pooling or concatenation. We used weighted max pooling for the aggregation operation. Sd is the input sequences of d-th dataset, and \(M_{d}^{n-1}(\cdot)\) is output ht, defined by Eq. 7. When aggregating the results of collaborator models, we multiply each of the results by a weight αk, which is a trainable parameter. The results are used to train the model \(M_{d}^{n-1}\). Using the outputs obtained by Eq. 14, we train \(M_{d}^{n-1}\) for one epoch, and it becomes \(M_{d}^{n}\) in the next phase. The CRF layer is attached to the final output of \(M_{k}^{n}\). Once we iterate all the target datasets dD, the next phase begins.

During the training phase Pn for d, the target STM, which is composed of the BiLSTM layer and the CRF layer, and weights αk{k|kd,kD} are trained. Parameters of the other STMs are not trained but the STMs generate only inferences on dataset d in the training phase Pn. For example, when the disease dataset is the target dataset, the BiLSTM of the other STMs produces inferences about the other entity types for the disease dataset. More specifically, inferences about genes for the disease dataset \(M_{gene}^{n-1}([S_{{disease}}; \mathbf {0}])\) which has rich information on gene entities, will benefit the disease STM.



We used 5 datasets (BC2GM [33], BC4CHEMD [34], BC5CDR [3538], JNLPBA [22], NCBI [21]), all of which were collected by Crichton et al. [24] (Table 1). Each of the 5 datasets were constructed from MEDLINE abstracts, and we used the BIOES notation format for named entity labels [26]. Each dataset focuses on one of the three biomedical entity types: disease, chemical, and gene/protein. We did not use cell-type entity tags from JNLPBA for the entity types.

Table 1 Descriptions of datasets

All the datasets are comprised of pairs of input sentences and biomedical entity labels for the sentences. While the JNLPBA dataset has only training and test sets, the other four datasets contain training, development and test sets. For JNLPBA, we used part of its training set as its development set which is the same size as its test set. Also, we found that the JNLPBA dataset from Crichton et al. [24] contained sentences that were incorrectly split. So we preprocessed the original dataset by Kim et al. [22] with a more accurate sentence separation.

The BC5CDR dataset has the sub-datasets BC5CDR-chem, BC5CDR-disease and BC5CDR-both, and they contain chemical entity types, disease entity types, and both entity types, respectively. We reported the performance on BC5CDR-chem and BC5CDR-disease. We have a total of six datasets: BC2GM, BC4CHEMD, BC5CDR-chem, BC5CDR-disease, JNLPBA, and NCBI.


For the evaluation of the named entity recognition task, true positives are counted from exact matches between predicted entity spans and ground truth spans based on the BIOES notation.

We also designed and applied a simple post-processing step that corrects invalid BIOES sequences. This simple step improved precision by about 0.1 to 0.5%, and thus boosted the F1 score by about 0.04 to 0.3%.

Precision, recall and F1 scores were used to evaluate the models.

  • M = total number of predicted entities in the sequence.

  • N = total number of ground truth entities in the sequence.

  • C = total number of correct entities.

$$ \begin{aligned} {Precision}=P=\frac{C}{M}, {Recall}=R=\frac{C}{N}, \\ {F}_{1} score=\frac{2PR}{P+R} \end{aligned} $$

Settings and hyperparameters

We used the 200 dimensional word embedding (WE) by Pyysalo et al. [20] which was trained on PubMed, PubMed Central (PMC) and Wikipedia text, and it contains about 5 million words. Word2vec [39] was used to train the word embedding. For character level word embedding (CLWE), we used window sizes of 3, 5, and 7.

We used AdaGrad optimizer [40] with an initial learning rate of 0.01 which was exponentially decayed for each epoch by 0.95. The dimension of the character embedding (dchar) was 30 and dimension of the character level word embedding (dclwe) was 200*3. We used 300 hidden units for both forward and backward LSTMs. Weights for aggregating the results of collaborator models were uniformly initialized with 1. We applied dropout [41] to two parts of CollaboNet: output of CLWE (0.5) and output of BiLSTM (0.3). The mini-batch size for our experiment was 10.

Most of our hyperparameter settings are similar to those of Wang et al. [25]. Only a few settings such as the dropout rates were different from the hyperparameters of Wang. We tuned these hyperparameters using validation sets.

The preparation phase P0 for 6 datasets takes approximately 900 min, which is the same amount of time it takes to train 6 single-task models. The rest of the phases Pn,n≥1 require 3000 min for complete training. If we exclude BC4CHEMD, the largest dataset, then the training time for Pn is reduced to 1500 min, which is half the time required for the remainder phases. Experiments were conducted on a 10-core CPU (Intel Xeon E5-260 v4 CPU 2.2 GHz) with one graphics processing unit (NVIDIA Titan Xp). Our code is written in TensorFlow 1.7 (GPU enabled version) for Python 2.7.


The experimental results of the baseline models and CollaboNet are provided in Tables 2 and 3, respectively. Table 2 shows the results of the single-task models (STMs) where Table 3 shows the comparison between the existing state-of-the-art multi-task learning model (MTM) and our CollaboNet.

Table 2 Performances of single-task models
Table 3 Performance of CollaboNet and the Multi-Task Model by Wang et al. [25]

Since Wang et al. [25] used BC5CDR-both for their experiments, we reran their models on BC5CDR-chem and BC5CDR-disease for a fair comparison with other models. The rerun scores are denoted with asterisks. We conducted 10 experiments with 10 different random initializations on our STM. We take arithmetic mean over the 6 datasets to compare the overall performance of each model.

Performance of single-task models

Table 2 shows the results of the STMs of Habibi et al. [18] and Wang et al. [25] (baseline STMs), and our STM on the 6 datasets. While the baseline STMs applied BiLSTM for the Character Level Word Embedding (CLWE) layer [18, 25], our STM used Convolution Neural Network (CNN) for the CLWE layer.

On average, our STM outperforms the baseline STMs in terms of precision, recall and F1 score. Although, Sahu and Anand [17] tried to improve the performance of NER models with CNN based CLWE layer, they have failed to do so. In our experiments, however, our STM outperforms other baseline STMs, demonstrating the effectiveness of STM with CNN based CLWE layer.

Performance of CollaboNet

Comparing Tables 2 and 3, CollaboNet achieves higher precision and F1 score than most STM models on all datasets. On average, CollaboNet has improved both precision and recall. CollaboNet also outperforms the multi-task model (MTM) from Wang et al. [25] on 4 out of 6 datasets (Table 3). While multi-task learning has improved performance in previous studies [25], using CollaboNet, which consists of expert models trained for each entity type, could further improve biomedical named entity recognition performance.


Compared to baseline models, CollaboNet achieves higher performance on macro average (Tables 2 and 3). The increase in precision is supportive when considering the practical use of the bioNER systems. In a number of biomedical text mining systems, important information tends to be repeated in a large size text corpus. Therefore, missing a few entities may not hinder the performance of an entire system, as this can be compensated elsewhere. However, incorrect information and the propagation of errors can effect the entire system.

In Table 4, we report the error types of our STM and CollaboNet. We define bio-entityerror as recognizing different types of biomedical entities as target entity types. For instance, recognizing ‘VHL’ as a gene when it was used as a disease in a sentence is a bio-entity error. Note that a bio-entity error could occur when an entity is a polysemous word (e.g. VHL), or comprised of multiple words (e.g. BRCA1 deficient), and thus correcting bio-entity errors requires contextual information or supervision of other entity type models. The error analysis was conducted on 4334 errors of our STM and 3966 errors of CollaboNet on 5 datasets (BC2GM, BC5CDR-chem, BC5CDR-disease, JNLPBA, NCBI). Error analysis was conducted on models which showed best performance in our experiments.

Table 4 The number of bio-entity type errors, the total number of errors, and the ratio of bio-entity errors to the total numbers of errors for each model prediction

The error analysis of our STM, which is a single BiLSTM-CRF model, shows that the majority of errors are classified as bio-entity errors which comprise up to 49.3% of the total errors in JNLPBA. According to the error analysis of our STM model, bio-entity errors constitute 1333 errors out of 4334 errors, comprising 30.8% of all the errors. Although bio-entity error was not the most common error type, the importance of bio-entity error is much greater that of other errors such as span error which was the most common error type, constituting 38% of incorrect errors. While most span errors can be easily fixed by non-experts, bio-entity errors are difficult to detect and fix, even for biomedical researchers. Also, for biomedical text mining tasks such as drug-drug interaction (DDI) extraction, span errors of an NER system have a minor effect on DDI results but bio-entity errors could lead to completely different results.

The performance improvement of CollaboNet over STM may not seem significant when considering the increased complexity of CollaboNet’s structure. We found by error analysis that CollaboNet had an increased number of span errors. As our metric is based on the exact match evaluation, consistent annotation of the ground truth dataset is important for reducing span errors which are caused by modifiers. For instance, in the phrase “acute adult renal failure,” “adult renal failure” may be labeled as an entity in some datasets. In this case, predicting “acute adult renal failure” or “renal failure” as an entity will be counted as a false negative and a false positive. On the other hand, some other datasets may include the modifier “acute” in an entity, considering “acute adult renal failure” as the only true prediction. Therefore, unlike STM, CollaboNet uses various datasets that have been annotated differently. Even though CollaboNet outperforms STM, its results may be lower due to this inconsistency in annotation.

In CollaboNet, each expert model is trained on a single entity type dataset, and their training inputs are a concatenation of word embeddings and outputs of the other expert models. We expect that the other expert models will transfer knowledge on their respective entity to the target model, and thus improve the bio-entity type error problem by collaboration. As Table 4 shows, CollaboNet performs better than our STM in detecting polysemy and other entity types. Among 3966 errors from CollaboNet, 736 errors are bio-entity errors, comprising 18.6% of all the errors.

Case study

We sampled the predictions of CollaboNet and those of our STM (single-task model) to further understand the strengths of CollaboNet in Table 5.

Table 5 Case study

The first example from chemical dataset in Table 5 shows our expected result from CollaboNet. Our STM annotates antilymphocyte globulin as a chemical entity. However, it is clear that the entity is not a chemical but a type of globulin which is a protein. The second example sentence from the chemical dataset is about an ACE / ARB entity. Again, our STM misidentifies the entity as a chemical entity. On the other hand, in CollaboNet, the target model (chemical model) obtains knowledge from one of the collaborator models (the gene/protein model) to avoid mistakenly recognizing the entity as a chemical entity. As globulin or ACE entities appear in the gene/protein dataset, the chemical model obtains information from the gene/protein model.

In the disease dataset, the first example shows a multi-word entity in parentheses. As a gene model can pass syntactic and semantic information about a word e.g., mutated and its surrounding words to a disease model, CollaboNet can abstain from predicting A-T, mutated as the disease entity, which our STM model failed to do. The second example in the disease dataset is on cardiac troponin T. Since cardiac + noun in biomedical text can be easily considered as a disease name, our STM misidentified this word as a disease entity. However, with the help of a gene model, CollaboNet did not mark it as a disease entity.

The gene/protein entity type further demonstrates the effectiveness of CollaboNet in reducing bio-entity type errors. Two example sentences contain abbreviations, which are one of the distinct characteristics of gene entities. LMB and cHD are incorrectly predicted as gene/protein entities by our STM, since lots of gene/protein entities are abbreviations. However, the target model (gene/protein model) in CollaboNet can obtain information on leptomycin and disease from the chemical and disease models, respectively. With the help of information from collaborator models, CollaboNet can effectively increase the precision of other entity type models.

In addition, we found some labels in the ground truth set, which we believe are incorrect. Tsai et al. [15] also reported that the inconsistent annotations in the JNLPBA corpus limit the NER system. We report our findings in Table 6.

Table 6 Case study

In the first row of Table 6, the gene/protein entity osteopontin was not marked in the ground truth labels, whereas our network correctly predicted it as a gene entity. The second row also displays questionable results of the ground truth labels. Although lg and bcl-6, which are abbreviations of Immunoglobulin and B-cell lymphoma 6, where not labeled in the ground truth labels, our model detected them as a gene / protein entity. The example sentences of gene/protein annotations in Table 6 were reviewed by several domain experts and medical doctors. As shown in the third row, beta-muricholate is a chemical entity but it was not annotated in the ground truth labels. However, the last row shows another type of annotation error. Contrast media is a general term for a medium used in medical imaging and since is not a proper noun, it is not a named entity.

These examples shows the presence of incorrect ground truth labels, which can harm the performance of bioNER models. However, we believe that these missed or misidentified ground truth labels can be corrected by our system.

Future works

For future work, we plan to cover more target entity types and use more datasets. For example, CRAFT [42], LINNAEUS [43] and Variome [44] are manually annotated datasets and are valuable resources that can be used for expanding our model. Second, we plan to apply CollaboNet to downstream biomedical text mining systems. For example, entity search engines such as BEST [10] could be improved by using more accurate NER models.


In this paper, we introduced CollaboNet, which consists of multiple BiLSTM-CRF models, for biomedical named entity recognition. While existing models were only able to handle datasets with a single entity type, CollaboNet leverages multiple datasets and achieves the highest F1 scores. Unlike recently proposed multi-task models, CollaboNet is built upon multiple single-task NER models (STMs) that send information to each other for more accurate predictions. In addition to the performance improvement over multi-task models, CollaboNet differentiates between biomedical entities that are polysemous or have similar orthographic features. As a result, our model achieved state-of-the-art performance on four bioNER datasets in terms of F1 score, precision and recall. Although our model requires a large amount of memory and time, which existing multi-task models require as well, the simple structure of CollaboNet allows researchers to build another expert model for different entity types in CollaboNet. As CollaboNet obtains higher precision than other models, we plan to apply CollaboNet in a biomedical text mining system.



Bidirectional long short-term memory


Biomedical named entity recognition


Character embedding


Character level word embedding


convolution neural network


Conditional random field


Drug-Drug Interaction


long short-term memory


Multi-task learning


Multi-task model


Named entity recognition


Natural language processing


PubMed Central


Single-task model


Recurrent neural network


Word embedding


  1. Home-PMC-NCBI. Accessed 01 Apr 2018.

  2. Cohen AM, Hersh WR. A survey of current work in biomedical text mining. Brief Bioinform. 2005; 6(1):57–71.

    Article  CAS  Google Scholar 

  3. Miwa M, Sætre R., Miyao Y, Tsujii J. Protein–protein interaction extraction by leveraging multiple kernels and parsers. Int J Med Inform. 2009; 78(12):39–46.

    Article  Google Scholar 

  4. Zheng JG, Howsmon D, Zhang B, Hahn J, McGuinness D, Hendler J, Ji H. Entity linking for biomedical literature. In: Proceedings of the ACM 8th International Workshop on Data and Text Mining in Bioinformatics. New York: ACM: 2014. p. 3–4.

    Google Scholar 

  5. Tsutsui S, Ding Y, Meng G. Machine reading approach to understand alzheimers disease literature. In: Proceedings of the Tenth International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO). London: BioMed Central: 2016.

    Google Scholar 

  6. Preiss J, Stevenson M. The effect of word sense disambiguation accuracy on literature based discovery. In: Proceedings of the ACM Ninth International Workshop on Data and Text Mining in Biomedical Informatics. DTMBIO ’15. New York: ACM: 2015. p. 1.

    Google Scholar 

  7. Heo GE, Kang KY, Song M. Examining the field of bioinformatics by the multi-faceted informetric approach. In: Proceedings of the ACM 10th International Workshop on Data and Text Mining in Bioinformatics.2016.

  8. Lim S, Lee K, Kang J. Drug drug interaction extraction from the literature using a recursive neural network. PloS ONE. 2018; 13(1):0190926.

    Google Scholar 

  9. Rosario B, Hearst MA. Classifying semantic relations in bioscience texts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics: 2004. p. 430.

    Google Scholar 

  10. Lee S, Kim D, Lee K, Choi J, Kim S, Jeon M, Lim S, Choi D, Kim S, Tan A-C, et al. Best: next-generation biomedical entity search tool for knowledge discovery from biomedical literature. PloS ONE. 2016; 11(10):0164680.

    Google Scholar 

  11. Hettne KM, Stierum RH, Schuemie MJ, Hendriksen PJ, Schijvenaars BJ, Mulligen E. M. v., Kleinjans J, Kors JA. A dictionary to identify small molecules and drugs in free text. Bioinformatics. 2009; 25(22):2983–91.

    Article  CAS  Google Scholar 

  12. Song M, Yu H, Han W-S. Developing a hybrid dictionary-based bio-entity recognition technique. BMC Med Inform Decis Mak. 2015; 15(1):9.

    Article  Google Scholar 

  13. Fukuda K-I, Tsunoda T, Tamura A, Takagi T, et al. Toward information extraction: identifying protein names from biological papers. In: Pac Symp Biocomput: 1998. p. 707–18.

  14. Proux D, Rechenmann F, Julliard L, Pillet V, Jacq B. Detecting gene symbols and names in biological texts. Genome Inform. 1998; 9:72–80.

    CAS  Google Scholar 

  15. Tsai RT-H, Sung C-L, Dai H-J, Hung H-C, Sung T-Y, Hsu W-L. Nerbio: using selected word conjunctions, term normalization, and global patterns to improve biomedical named entity recognition. In: BMC Bioinformatics. London: BioMed Central: 2006. p. 11.

    Google Scholar 

  16. Ju M, Miwa M, Ananiadou S. A neural layered model for nested named entity recognition. In: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). New Orleans: Association for Computational Linguistics: 2018. p. 1446–59.

    Google Scholar 

  17. Sahu S, Anand A. Recurrent neural network models for disease name recognition using domain invariant features. In: Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Berlin: Association for Computational Linguistics: 2016. p. 2216–25.

    Google Scholar 

  18. Habibi M, Weber L, Neves M, Wiegandt DL, Leser U. Deep learning with word embeddings improves biomedical named entity recognition. Bioinformatics. 2017; 33(14):37–48.

    Article  Google Scholar 

  19. Lample G, Ballesteros M, Subramanian S, Kawakami K, Dyer C. Neural architectures for named entity recognition. In: HLT-NAACL. San Diego: The Association for Computational Linguistics: 2016. p. 260–70.

    Google Scholar 

  20. Pyysalo S, Ginter F, Moen H, Salakoski T, Ananiadou S. Distributional semantics resources for biomedical text processing. In: Proceedings of the 5th International Symposium on Languages in Biology and Medicine, Tokyo, Japan: 2013. p. 39–43.

  21. Doğan RI, Leaman R, Lu Z. Ncbi disease corpus: a resource for disease name recognition and concept normalization. J Biomed Inform. 2014; 47:1–10.

    Article  Google Scholar 

  22. Kim J-D, Ohta T, Tsuruoka Y, Tateisi Y, Collier N. Introduction to the bio-entity recognition task vat jnlpba. In: Proceedings of the International Joint Workshop on Natural Language Processing in Biomedicine and Its Applications. Geneva: Association for Computational Linguistics: 2004. p. 70–5.

    Google Scholar 

  23. Caruana R. Multitask learning. Mach Learn. 1997; 28(1):41–75.

    Article  Google Scholar 

  24. Crichton G, Pyysalo S, Chiu B, Korhonen A. A neural network multi-task learning approach to biomedical named entity recognition. BMC Bioinformatics. 2017; 18(1):368.

    Article  Google Scholar 

  25. Wang X, Zhang Y, Ren X, Zhang Y, Zitnik M, Shang J, Langlotz C, Han J. Cross-type biomedical named entity recognition with deep multi-task learning. Bioinformatics. 2018. ISSN = 1367-4803,

  26. Ratinov L, Roth D. Design challenges and misconceptions in named entity recognition. In: Proceedings of the Thirteenth Conference on Computational Natural Language Learning. Stroudsburg: Association for Computational Linguistics: 2009. p. 147–55.

    Google Scholar 

  27. Santos CD, Zadrozny B. Learning character-level representations for part-of-speech tagging. In: Proceedings of the 31st International Conference on Machine Learning (ICML-14). 2014. p. 1818–26.

  28. Kim Y, Jernite Y, Sontag D, Rush AM. Character-aware neural language models. In: AAAI. Phoenix:AAAI Press: 2016. p. 2741–9.

  29. Graves A, Mohamed A-R, Hinton G. Speech recognition with deep recurrent neural networks. In: Acoustics, Speech and Signal Processing (icassp), 2013 Ieee International Conference On. Vancouver: IEEE: 2013. p. 6645–9.

    Google Scholar 

  30. Bahdanau D, Cho K, Bengio Y. Neural machine translation by jointly learning to align and translate. In: 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings: 2015.

  31. Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997; 9(8):1735–80.

    Article  CAS  Google Scholar 

  32. Schuster M, Paliwal KK. Bidirectional recurrent neural networks. IEEE Trans Signal Process. 1997; 45(11):2673–81.

    Article  Google Scholar 

  33. Smith L, Tanabe LK, nee Ando RJ, Kuo C-J, Chung I-F, Hsu C-N, Lin Y-S, Klinger R, Friedrich CM, Ganchev K, et al.Overview of biocreative ii gene mention recognition. Genome Biol. 2008; 9(2):2.

    Article  Google Scholar 

  34. Krallinger M, Rabal O, Leitner F, Vazquez M, Salgado D, Lu Z, Leaman R, Lu Y, Ji D, Lowe DM, et al. The chemdner corpus of chemicals and drugs and its annotation principles. J Cheminformatics. 2015; 7(S1):2.

    Article  Google Scholar 

  35. Wei C-H, Peng Y, Leaman R, Davis AP, Mattingly CJ, Li J, Wiegers TC, Lu Z. Overview of the biocreative v chemical disease relation (cdr) task. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop: 2015. p. 154–66.

  36. Li J, Sun Y, Johnson R, Sciaky D, Wei C-H, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z. Annotating chemicals, diseases, and their interactions in biomedical literature. In: Proceedings of the Fifth BioCreative Challenge Evaluation Workshop: 2015. p. 173–82.

  37. Leaman R, Islamaj Doğan R, Lu Z. Dnorm: disease name normalization with pairwise learning to rank. Bioinformatics. 2013; 29(22):2909–17.

    Article  CAS  Google Scholar 

  38. Leaman R, Wei C-H, Lu Z. tmchem: a high performance approach for chemical named entity recognition and normalization. J Cheminformatics. 2015; 7(1):3.

    Article  Google Scholar 

  39. Mikolov T, Chen K, Corrado G, Dean J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781. 2013.

  40. Duchi J, Hazan E, Singer Y. Adaptive subgradient methods for online learning and stochastic optimization. J Mach Learn Res. 2011; 12(Jul):2121–59.

    Google Scholar 

  41. Hinton GE, Srivastava N, Krizhevsky A, Sutskever I, Salakhutdinov RR. Improving neural networks by preventing co-adaptation of feature detectors. arXiv preprint arXiv:1207.0580. 2012.

  42. Cohen KB, Lanfranchi A, Choi M. J. -y., Bada M, Baumgartner WA, Panteleyeva N, Verspoor K, Palmer M, Hunter LE. Coreference annotation and resolution in the colorado richly annotated full text (craft) corpus of biomedical journal articles. BMC Bioinformatics. 2017; 18(1):372.

    Article  Google Scholar 

  43. Gerner M, Nenadic G, Bergman CM. Linnaeus: A species name identification system for biomedical literature. BMC Bioinformatics. 2010; 11(1):85.

    Article  Google Scholar 

  44. Verspoor K, Jimeno Yepes A, Cavedon L, McIntosh T, Herten-Crabb A, Thomas Z, Plazzer J-P. Annotating the biomedical literature for the human variome. Database. 2013; 2013:019.

    Article  Google Scholar 

Download references


We are sincerely grateful to Inah Chang for conducting manual error counting. We appreciate Susan Kim for editing the manuscript.


The design of the study and collection, analysis, and interpretation of data were funded by the National Research Foundation of Korea (NRF-2017M3C4A7065887, 2016M3A9A7916996) and National IT Industry Promotion Agency grant funded by the Ministry of Science and ICT and Ministry of Health and Welfare (NO. C1202-18-1001, Development Project of The Precision Medicine Hospital Information System (P-HIS)). Publication costs were funded by the National Research Foundation of Korea (NRF-2016M3A9A7916996).

Availability of data and materials

The source code of CollaboNet and the datasets are available at

About this supplement

This article has been published as part of BMC Bioinformatics Volume 20 Supplement 10, 2019: Proceedings of the 12th International Workshop on Data and Text Mining in Biomedical Informatics (DTMBIO 2018). The full contents of the supplement are available online at

Author information

Authors and Affiliations



WY, CHS, JL and JK conceived the idea. WY and JL designed the model. WY and CHS developed CollaboNet. CHS experimented and collected analysis examples and results. WY, JL and JK wrote the manuscript. JK, as the supervisor of WY, CHS and JL, provided guidance on the experiment. All authors read and approved the final manuscript.

Corresponding author

Correspondence to Jaewoo Kang.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (, which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Yoon, W., So, C., Lee, J. et al. CollaboNet: collaboration of deep neural networks for biomedical named entity recognition. BMC Bioinformatics 20 (Suppl 10), 249 (2019).

Download citation

  • Published:

  • DOI: