A neural joint model for entity and relation extraction from biomedical text

Background Extracting biomedical entities and their relations from text has important applications on biomedical research. Previous work primarily utilized feature-based pipeline models to process this task. Many efforts need to be made on feature engineering when feature-based models are employed. Moreover, pipeline models may suffer error propagation and are not able to utilize the interactions between subtasks. Therefore, we propose a neural joint model to extract biomedical entities as well as their relations simultaneously, and it can alleviate the problems above. Results Our model was evaluated on two tasks, i.e., the task of extracting adverse drug events between drug and disease entities, and the task of extracting resident relations between bacteria and location entities. Compared with the state-of-the-art systems in these tasks, our model improved the F1 scores of the first task by 5.1% in entity recognition and 8.0% in relation extraction, and that of the second task by 9.2% in relation extraction. Conclusions The proposed model achieves competitive performances with less work on feature engineering. We demonstrate that the model based on neural networks is effective for biomedical entity and relation extraction. In addition, parameter sharing is an alternative method for neural models to jointly process this task. Our work can facilitate the research on biomedical text mining.


Background
Automatically extracting entities and their relations from biomedical text has attracted much research attention in biomedical text mining community due to its important applications on knowledge acquisition and ontology construction [1]. Recently, various related tasks have been proposed, such as protein-protein interaction detection (PPI) [2], drug-drug interaction detection (DDI) [3], adverse drug event extraction (ADE) [4] and the bacteria biotope task (BB) [5].
Taking the ADE task for example, the objective of this task is to recognize mentions of drug and disease entities, and extract possible ADE relations between them. Given a sentence "A woman who was treated for thyrotoxicosis disease with methimazole drug developed agranulocytosis disease . ", the outputs will be three entity mentions and an ADE relation {methimazole drug , agranulocytosis disease } ADE .
*Correspondence: dhji@whu.edu.cn 1 School of Computer, Wuhan University, Bayi Road, Wuhan, China Full list of author information is available at the end of the article Entity and relation extraction is a standard task in text mining or natural language processing (NLP). Most of previous work used two-step pipeline models to perform this task. First, entity mentions in a given sentence are recognized using the technologies of named entity recognition (NER). NER is usually casted as a sequence labeling problem solved by conditional random fields (CRFs) [6]. Second, each entity pair is examined to decide whether they have task-specific relations using classification models such as support vector machines (SVMs) [7]. In the biomedical community, pipeline models are also frequently used for this task [8][9][10][11][12][13][14].
Such pipeline models suffer two main problems. First, the errors generated in the NER step may propagate to the step of relation classification. For instance, if a drug or disease entity mention is incorrectly recognized, the extraction of its related ADEs will be incorrect. Second, the interactions between two subtasks in the two steps are not able to be utilized, while these interactions may help the subtasks. For instance, given a sentence "The tire maker still employs 1400" [15], although it may be difficult to recognize "1400" as a person entity, the word "employs" indicates an employment-organization relation which must involve a person entity. Therefore, such relation may help the model to recognize "1400" correctly.
Due to the aforementioned disadvantages of pipeline models, joint models, which process entity recognition and relation classification simultaneously, have been proposed. Joint models process two subtasks simultaneously, so they can alleviate the problem of error propagation. On the other hand, some model parameters are shared by the submodels of entity recognition and relation classification in joint models, so these parameters help the models capture the interactions between two subtasks. Roth and Yih [16] proposed a joint inference framework based on integer linear programming to extract entities and relations. Li and Ji [15] exploited a single transitionbased model to accomplish entity recognition and relation classification simultaneously. Kordjamshidi et al. [17] proposed a structured learning model to extract biomedical entities and their relationships. However, these featurebased approaches require much feature engineering and they also suffer feature sparsity problem, since the combined feature space of a joint task is significantly larger than those of its subtasks.
Recently, deep learning with neural networks has received increasing research attention in the artificial intelligence area [18,19], as well as the text mining and NLP areas [20,21]. Compared with other models, deep neural networks adopt low-dimensional dense embeddings to denote features such as words or part-of-speech (POS) tags, which can effectively settle the feature sparsity problem. In addition, deep neural networks demand less feature engineering, since they can learn features from training data automatically. Ma and Hovy [22] and Lample et al. [23] exploited similar frameworks by combining recurrent neural networks (RNNs) with CRFs and obtained the best results on several benchmark NER datasets. For relation classification, there are two stateof-the-art methods using deep neural networks, namely RNNs [24] and convolutional neural networks (CNNs) [25]. They used RNNs or CNNs to learn relation representations along the words between two target entities or along the words on the shortest dependency path (SDP) of two target entities. Miwa and Bansal [26] proposed an end-to-end relation extraction model and obtained competitive performances in several datasets. However, there is less related work in biomedical entity and relation extraction using deep neural networks. Li et al. [27] and Mehryary et al. [28] used similar approaches with [24,25], but they only focused on relation classification with given entities. Li et al. [29] exploited a transition-based feedforward neural network to jointly extract drug-disease entity mentions and their ADE relations. Jiang et al. [30] proposed two independent neural models for DDI and gene mention tagging tasks, respectively.
In this paper, we follow the novel line of work on deep neural networks and propose a neural joint model to extract biomedical entities and their relations. First, our model uses CNNs to encode character information of words into their character-level representations. Second, character-level representations, word embeddings and POS embeddings are fed into a bi-directional (Bi) long short-term memory (LSTM) [31] based RNN to learn the representations of entities and their contexts in a sentence. These representations are used to recognize biomedical entities. Third, another Bi-LSTM-RNN learns relation representations of two target entities along their SDP. These representations are used to classify their relations. The second Bi-LSTM-RNN is stacked on the first one, i.e., the output vectors of LSTM units in the first Bi-LSTM-RNN are used as the input vectors of LSTM units in the second one. The parameters of LSTM units in the first Bi-LSTM-RNN are shared by both networks, so they are jointly affected by entity recognition and relation classification tasks during training. Our neural joint model was evaluated for extracting biomedical entities and their relations on two tasks, namely ADE [4] and BB [5]. Comparing with the state-ofthe-art model [29] for the ADE task, our model improved the precision and recall of drug-disease entity recognition by 3.2 and 7.1%, and ADE relation extraction by 3.5 and 12.9%, respectively. Comparing with the best system [14] for the BB task, our model boosted the precision and recall of resident relation extraction by 30.5 and 0.8%, respectively. Experimental results showed that our neural joint model could obtain competitive performances with less feature engineering. In addition, our model could obtain better performances than pipeline models by sharing parameters between the submodels. We demonstrate that deep neural networks are also effective for biomedical entity and relation extraction. Therefore, our model is able to facilitate the research on biomedical text mining.

CNN for character-level representations
Character-level features have been demonstrated to be effective for neural NER models. For example, the suffix "bacter" is a strong feature to indicate a bacteria entity such as "campylobacter" or "helicobacter". Following previous work [22,23], CNNs are used to extract morphological information (like the prefix or suffix of a word) from characters of words. Figure 1 shows the process of extracting character information from a word and encoding them into a character-level vector representation.
Given a word w = {c 1 , c 2 , . . . , c N }, c i denotes its ith character and emb(c i ) denotes the embedding of this character. To use morphological information, the embeddings of continuous characters in a window size C are concatenated as the final representation r c i of c i . For where W 1 and b 1 are the parameter matrix and bias vector that are learned, and tanh denotes the hyperbolic tangent activation function. To generate the character-level representation r w of this word w, max-pooling operations are applied to all kernel outputs o 1 , o 2 , . . . , o N . The k-th dimension of r w is computed by

Bi-LSTM-RNN for biomedical entity recognition
Following state-of-the-art neural models [22,23,26], biomedical entity recognition is casted as a sequence labeling problem. For example, if the standard label scheme BILOU is utilized in the ADE task, which includes two entity types namely Drug and Disease, entity labels can be designed as follows. B-Drug/B-Disease, I-Drug/I-Disease and L-Drug/L-Disease denote the beginning, following and last words of Drug/Disease entities, respectively. U-Drug or U-Disease denotes the single word of Drug or Disease entities. O denotes that the word does not belong to any type of entities. For example, given a sentence "gliclazide-induced acute hepatitis", Fig. 2 shows the process of labeling each word of this sentence by our Bi-LSTM-RNN model. Given a sentence w 1 /p 1 /r w 1 , w 2 /p 2 /r w 2 , . . . , w N /p N / r w N , where w i denotes the i-th word, p i denotes the POS tag of w i , and r w i denotes the character-level representation of w i . For the i-th step of sequence labeling, the Based on t = {t 1 , t 2 , . . . , t N }, a LSTM unit in the leftto-right direction associates each of them with a hidden state does not only capture the information in the current step, but also that in the previous steps. To capture the information in the following steps, we also add a counterpart In the hidden layer, − → h i and ← − h i are selected as one input source in the ith step. Moreover, the last entity label l e i−1 is also selected as another input source to consider label dependence (e.g., the label I-Drug should not follow the label O). This is not shown in Fig. 2 for conciseness. The final inputs and outputs of the i-th step in the hidden layer are given by where h e i denotes the output vector of the hidden layer, W 2 and b 2 denote the parameter matrix and bias vector that are learned.
Finally, the softmax output layer calculates the probabilities y e of all entity labels L e , given by where the k-th label with the maximum probability y e k is selected as the label of the i-th word.

Bi-LSTM-RNN for relation classification
Once entity recognition is finished, our model starts relation classification to determine whether a task-specific relation exists between all possible entity pairs. Prior work has demonstrated the effectiveness of SDPs in the dependency trees for relation classification [24,26]. The words along SDPs concentrate on most relevant information while diminishing less relevant noise. Following these studies, we use the Bi-LSTM-RNN to model relation representations between two target entities along their SDP. For example, given a sentence "gliclazide-induced acute hepatitis", Fig. 3 shows the process of classifying ADE relations by our Bi-LSTM-RNN.
Given an entity pair e a (e.g., gliclazide) and e b (e.g., acute hepatitis) in a sentence, the last words a (e.g., gliclazide) and b (e.g., hepatitis) of these entities are used Fig. 3 The Bi-LSTM-RNN for relation classification. The input sentence is tokenized before it is analyzed by a dependency parser. Tokens are indexed by Arabic numerals. Basic (a.k.a, projective) dependency style is utilized to build a tree. The bold lines in the tree denote the shortest dependency path (SDP) between "gliclazide" and "hepatitis" with their lowest common ancestor "induced". x i indicates the input vector of a LSTM unit as shown in Eq. 6 and i corresponds to the index of a token. In the Bi-LSTM-RNN layer, solid arrow lines denote bottom-up and top-down computations along the SDP in the dependency tree. Eq. 8 to build the SDP between them. The SDP can be formally represented by {a, a 1 , . . . , a m , c, b n , . . . , b 1 , b} (e.g., {gliclazide, induced, hepatitis}), where c denotes their lowest common ancestor in the dependency tree (e.g., induced). a 1 , . . . , a m denote the words occurring between a and c on the SDP, and b 1 , . . . , b n denote the words occurring between b and c. The SDP can be divided into two parts: {a, a 1 , . . . , a m , c} (e.g., {gliclazide, induced}) and {b, b 1 , . . . , b n , c} (e.g., {hepatitis, induced}) are bottomup sequences; {c, a m , . . . , a 1 , a} (e.g., {induced, gliclazide}) and {c, b n , . . . , b 1 , b} (e.g., {induced, hepatitis}) are topdown sequences. We extract features from both kinds of sequences by the Bi-LSTM-RNN. The input of each LSTM unit is a concatenation of three parts, given by where emb(d i ) denotes the embedding of dependency type d i between the word w i and its governor in the dependency tree. − → h i and ← − h i correspond to the word w i and they are identical to those notations mentioned in Eq. 4. Since − → h i and ← − h i are used as the inputs of these LSTM units, the Bi-LSTM-RNN for relation classification is stacked on the Bi-LSTM-RNN for entity recognition. Therefore, two Bi-LSTM-RNNs in our joint model share partial parameters and these parameters can be tuned during jointly training, which assists our joint model to capture the interactions between two subtasks. Miwa and Bansal [26] also demonstrated the effectiveness of such method for neural models.
The In the hidden layer, ↑ h a , ↑ h b , ↓ h a and ↓ h b are selected as one input source, and the entity representations r a and r b are used as another input source, computed by where K a and K b denote the index sets of the words in two entities, and − → h k and ← − h k are identical to those notations in Eq. 4. Entity representations are used to compensate information losses, since the SDP are built according to the last words of two target entities. For conciseness, this part is not shown in Fig. 3.
Finally, all vector representations of two input sources are concatenated and then computed in the hidden layer to generate the outputs h r , given by A softmax layer calculates the probabilities y r of all relation labels L r , given by where the k-th label with the maximum probability y r k is selected as the relation type of two target entities e a and e b .

Training
Both submodels of our joint model employ the same training algorithm and AdaGrad [32] is employed to control the update step. We describe their training in one section for conciseness. Online learning is exploited to train model parameters. Given a sentence with gold-standard entities and relations, we generate some training examples for entity recognition and relation classification submodels. When each example is sent to its corresponding submodel, the cross-entropy loss for this example is computed and gradients are back-propagated to each layer of the submodel for updating parameters. Therefore, we can consider two submodels are trained alternately. Moreover, since the parameters of LSTM units in the entity recognition submodel are shared by two submodels, the loss of each example can propagate to these parameters. Therefore, they are affected by both entity recognition and relation classification tasks.
Formally, assuming that the gold-standard label and its predicted probability are l and prob l , the loss for each example is calculated via -log prob l . If all losses are accumulated with a L 2 regularization term, the final objective is given by where θ denotes all model parameters, and λ is the regularization parameter.

Data
We carried out experiments on two tasks, namely adverse drug event extraction (ADE) [4] and the bacteria biotope task (BB) [5]. The ADE task aims to extract two kinds of entities (drugs and diseases) and relations about which drug is associated with which disease (ADEs). Its dataset is published in the form of independent sentences that come from 1644 PubMed abstracts. Sentences in the dataset are divided into two categories, namely 6821 sentences in which at least one drug/disease entity pair has the ADE relation (i.e., ADE sentences), and 16695 sentences in which no drug/disease entity pair has the ADE relation (i.e., non-ADE sentences). Biocurators only annotated drug/disease entities (i.e., the arguments of ADE relations) in the ADE sentences, so there are no annotated entities in the non-ADE sentences. Following previous work [29], only ADE sentences were used in our experiments since we need to evaluate the performances of both entity recognition and relation extraction. Similar to prior work [12,29], 120 relations with nested gold annotations were removed (e.g., "lithium intoxication", where "lithium" is related to "lithium intoxication").
The BB task aims to extract bacteria-related knowledge from PubMed abstracts. We focus on the BB-event+ner subtask, which consists of two parts, namely recognizing bacteria, habitat and geographical entity mentions, and extracting Lives_In relations between bacteria entities and their locations (either habitat or geographical entities). The training, development and test set of the BB-event+ner subtask include 71, 36 and 54 documents, which contain 1158, 736, 1049 entities and 327, 223, 314 relations, respectively. The statistics of the final data used in our experiments are shown in Table 1.

Evaluation metrics
Standard precision (P), recall (R), F1 were used as evaluation metrics of entity and relation extraction, computed by where a recognized entity mention was counted as truepositive (TP) if its boundary and type matched those of a gold entity mention. An extracted relation was counted as TP if its relation type was correct, and the boundaries and types of its related entities matched those of the entities in a gold relation. A recognized entity or extracted relation was counted as false-positive (FP) if it did not match the corresponding conditions mentioned above. The number of false-negative (FN) instances was computed by counting the gold entities or relations that had not been identified by our model. Since there were no official development set in the ADE task, we evaluated our model using 10-fold crossvalidation, where 10% of the data were used as the development set, 10% were used as the test set and the remaining were used as the training set. Then the final results were displayed as macro-averaged scores.
For the BB task, we used P, R and F1 to evaluate our model on the development set. The final results on the test set were given by the official evaluation service [5], which showed only the overall performance of relation extraction in P, R and F1.

Hyper-parameter settings
Some of hyper-parameter values were tuned according to the development set and others were chosen empirically following prior work [22,26] since it is infeasible to perform full search for all hyper-parameters. Their final values are shown in Table 2. For conciseness, the dimensions of model parameter matrices W 1 , W 2 , W 3 , W 4 , W 5 and bias vectors b 1 , b 2 , b 3 , b 4 , b 5 are not shown since they can be easily deduced from this table. Their values were randomly initialized with a uniform distribution.
The initial AdaGrad learning rate α and regularization parameter λ were set to 0.03 and 10 −8 , respectively. The dimension of word embeddings was set to 200 and those of other feature embeddings were set to 25. We used pre-trained biomedical word embeddings [33] to initial our word embeddings and other kinds of embeddings were randomly initialized in the range (-0.01, 0.01). All the embeddings were tuned during training except word embeddings.
For CNN, the character window size C was set to 3, so the dimension of convolutional kernel inputs r c can be computed as (2×3+1)×25=175. For Bi-LSTM-RNN in

Preprocessing
Given a document, we used some heuristic rules to split it into sentences and then tokenized these sentences into words. Tokenization was performed using not only whitespaces but also punctuations, since we might not find the node for an entity (e.g., "gliclazide") in the dependency tree if it was not separated from a piece of text (e.g., "gliclazide-induced"). All the words were transformed into their lowercase forms and numbers were replaced by zeroes. The version 3.4 of Stanford CoreNLP toolkit [34] was used for POS tagging and dependency parsing. To ensure dependency structures as trees, we employed basic (a.k.a., projective) dependencies. In particular, the discontinuous and nested entities were removed, in order to fit our model. Table 3 shows the results of prior work that processed the ADE task. Kang et al. [12] utilized a knowledgebased pipeline method, namely recognizing entities via an off-the-shelf tool, and extracting ADEs via the UMLS Metathesaurus and Semantic Network [35]. As shown in Table 3, their method obtained the imbalanced precision and recall. One likely reason is that their method did not distinguish between ADE relations and drug-disease treatment relations due to the limitations of manually designed rules and knowledge bases, so this strategy led to a high recall but a low precision. By contrast, our neural joint model achieved more balanced precisions and recalls without the assistance of knowledge bases. In addition, the recall of relation extraction is comparable with that of their method.

Result comparisons with other work
Li et al. [29] used a feed-forward neural network to jointly extract drug-disease entities and ADE relations. For drug-disease entity recognition, our model improved the precision, recall and F1 by 3.2, 7.1 and 5.1%, respectively. For ADE relation extraction, the precision, recall and F1 was improved by 3.5, 12.9 and 8.0%, respectively. Their method used knowledge bases such as Word-Net [36] and CTD [37] to help improving performances. Moreover, they manually designed global features to capture the interactions of entity recognition and relation extraction. By contrast, our model obtained much better results without using any knowledge base and captured the interactions automatically. Table 4 shows the results of related work that processed the BB task. LIMSI [14] achieved the best F1 in the official evaluation. It leveraged a pipeline framework using CRF to recognize mentions of bacteria and locations, and SVM to extract Lives_In relations between two entity mentions. UTS [5] also employed a pipeline framework that relied on two independent SVMs to perform entity recognition and relation classification, respectively. As shown in Table 4, they suffered either low precisions or recalls. Our neural joint model outperformed their methods without using knowledge bases provided by the task organizers. In addition, neural features reduced the work of feature engineering in CRF or SVM.
All the methods in the BB task achieved lower recalls than precisions, which might be caused by two reasons. The first reason is that there is much disagreement among annotators on whether to annotate an entity mention or relation as a gold answer based on the official statistics [5] shown in Table 5. This implies that it is a challenging task to extract Lives_In relations from PubMed abstracts, even for professional annotators. The second reason is that there are 27% inter-sentence relations (i.e., the argument entities of a relation occurring in different sentences) based on the official statistics of BB task, so the methods restricted to extract intra-sentence relations (i.e., the argument entities of a relation occurring in the same sentence) will suffer low recalls. Nevertheless, the extraction of inter-sentence relations is still a very challenging problem in the text mining or NLP area, which is not taken into account for the moment in this paper.

Feature contributions
The experiments were carried out on the development set to explore the contributions of different features. For  entity recognition, our features consist of words, characters, POS tags and entity labels. For relation extraction, our features consist of words, dependency types, entity representations. In feature contribution experiments, we took the model using word features as the baseline, and added only one kind of other features at a time.
In Table 6, entity labels were most useful in the ADE task, improving the precision and recall by 2.4 and 1.9%, respectively. While in the BB task, POS tags contributed the most, improving the precision and recall by 2.3 and 4.1%, respectively. The effectiveness of character features was moderate, improving the F1 by 0.3 and 1.3%. In Table 7, by adding entity representations, our model achieved the biggest improvements in F1, by 1.0% in the ADE task and 3.0% in the BB task. While dependency type features contributed the most for the precision in the BB task.
Based on our experiments, the contributions of these features are not consistent in different tasks, which is reasonable due to the characters of these tasks and their datasets.

Comparisons of joint and pipeline models
Since our model uses parameter sharing to joint two Bi-LSTM-RNN networks, it is necessary to evaluate the effectiveness of such method. To this end, a pipeline model was built without parameter sharing and compared with the joint model.
The pipeline model was built by replacing − → h i and ← − h i in Eq. 6 with word embeddings emb(w i ). Therefore, the connections between two Bi-LSTM-RNNs were cut off and they became independent submodels. To be fair, both the pipeline and joint models used only word embedding features. Here "+" means only that feature is added. "char", "pos" and "label" denote character, POS tag and entity label features, respectively Here "+" means only that feature is added."dep" and "entity" denote dependency type and entity representation features, respectively As shown in Table 8, the performance differences between the pipeline and joint models are slight in the ADE task. While in the BB task, the performance of the joint model is much better than that of the pipeline model, and the F1 scores of the joint model increase by 2.8 and 4.2% in entity recognition and relation classification, respectively. Miwa and Bansal [26] performed similar experiments in other datasets and the performance differences varied between 0.8-1.1%.
In general, we believe that parameter sharing between the subtasks of a joint model is effective since these parameters are influenced by correlated subtasks and they can help a joint model capturing the interactions of these subtasks. Nevertheless, such strategy may have few effects on improving performances for a specific task, so the characters of a task also need to be considered.

Error analysis
The errors were divided into two parts, namely FP and FN. For entity recognition, both FP and FN errors can be divided into two types: The boundary of an entity is incorrectly recognized and the type of an entity is incorrectly recognized. For relation extraction, FP errors contain two types: the entity mentions of a relation are incorrect (either boundaries or types), and entity mentions are correct but their relation is incorrectly predicted. FN errors consist of two types: First, at least one entity mention of a relation has not been recognized, leading to losing this relation; Second, both entity mentions of a relation have been recognized, but the model does not determine that they have such relation.
The statistics of error analysis was performed on the development sets of two datasets. As shown in Table 9,  boundary identification seems to be much more difficult than type identification in biomedical entity recognition. The errors of boundary identification account for more than 90% of total errors in both tasks. This may be rational due to the following reasons: First, there are only several entity types in the ADE (drug/disease) and BB (bacteria/emphhabitat/geographical) tasks, so it is easier for the model to identify entity types; Second, the characters of biomedical entities are more obvious than those of the entities in the common area, which helps the model to identify their types. For example, a bacteria entity "helicobacter" or drug entity "gliclazide" is much less ambiguous than an organization entity "bank", since "bank" has another meaning "riverside"; Third, the boundary of a biomedical entity is more difficult to be identified, since it may include a number of words to express an integrated biomedical concept, such as a disease entity "bilateral lower leg edema" or habitat entity "monocyte-like THP-1 cells". In Table 10, the percentage of the first type of FP errors is much higher than that of the second one in both tasks (55.7% vs. 3.1% and 22.7% vs. 15.2%), which implies the Total importance of entity recognition for relation extraction. The proportion of the second type of FP errors in the BB task is larger than that in the ADE task (15.2% vs. 3.1%), which demonstrates the relations in the BB task are more difficult to be predicted. In addition, the first type of FN errors accounts for nearly 50% of total errors in both tasks, which indicates that missing entities is the main reason of missing relations. Therefore, one way to alleviate this problem is to build a high-quality entity recognition model in order to reduce errors propagating to the subsequent step of relation extraction. Another alternative way is to use joint models to alleviate such error propagation. By contrast, the distribution of the second type of FN errors shows obvious differences between two tasks. In the ADE task, such errors account for 0.5%, while in the BB task, they account for 18.4%. The reasons for this may be because we only used ADE sentences, which contain at least one ADE relation, as our dataset in the ADE task, since the entities in non-ADE sentences were not annotated. The relation expression in ADE sentences may be apparent so they are easier for the model to determine. In contrast, we used all sentences in the BB task, which increases the difficulty of relation extraction. Furthermore, the relations in the ADE task were annotated in the sentence level, while ones in the BB task were annotated in the document level, so inter-sentence relations were lost.
To further demonstrate our observations from error analysis, we performed additional experiments to compare our model with two relation extraction methods that are based on co-occurrence entities inside one sentence and gold entity mentions. As shown in Table 11, co-occurrence and gold-mention based methods achieved pretty high performances (>95% in F1) in the ADE task, which demonstrates the errors of our model mainly come from entity recognition. Therefore, the low error rates of the second FP (Entities correct, relations wrong: 3.1%) and FN (Entities found, relations not found: 0.5%) in Table 10 are explainable. Achieving high performances when entities are given is mainly due to the annotation method of ADE corpus: if drug and disease entities have no ADE relations in a sentence, entities will not be annotated in that sentence either; therefore, if entities are given, ADE relations are almost determined. By contrast, the submodel of relation classification in our model also contributed a number of errors in the BB task, since co-occurrence and gold-mention based methods achieved modest performances when entities were given. It also explains the high error rates of the second FP (Entities correct, relations wrong: 15.2%) and FN (Entities found, relations not found: 18.4%) in Table 10.

Limitations of our model
The main limitation of our model is that it is not able to extract inter-sentence relations, which is a much more challenging task since it requires discourse-level language understanding and coreference resolution technologies. Some prior work has explored the methods for intersentence relation extraction [38,39] or event extraction [40]. In future work, our main objective is to alleviate this limitation.

Conclusions
In this paper, we explore a neural joint model to extract biomedical entities and their relations. Our model utilizes the advantages of several state-of-the-art neural models for entity recognition or relation classification in text mining and NLP. Experimental results on two related tasks showed that our model outperformed the best systems in those tasks. We find that deep neural networks can achieve competitive performances with less work on feature engineering and less dependence on external resources such as knowledge bases. In addition, parameter sharing is an effective method for neural models to jointly process several correlated tasks. We believe that our work can facilitate the research on biomedical text mining, especially for biomedical entity and relation extraction. Whether our model is effective for other biomedical entity-relation-extraction tasks remains to be investigated.