We adopt a competitive pretrained language model BERT [16] as our backbone that deals with SQuAD [17] and suits the condition of no answer. Our model consists of three major layers: (1) BERT encoder layer; (2) knowledge-enhanced attention layer; (3) prediction layer. Details are described as follows.

### BERT encoder layer

To be in line with BERT, given the context sequence \(C=(w_c^1, w_c^2, \ldots , w_c^n)\) and the query sequence \(Q=(w_q^1, w_q^2, \ldots , w_q^m)\), the input is formulated as a sequence \(S_{q,c} =(\mathrm{[CLS]}, w_q^1, w_q^2, \ldots , w_q^m,\) \(\mathrm{[SEP]}, w_c^1, w_c^2, \ldots , w_c^n,\) \(\mathrm{[SEP]})\), where \(\mathrm{[CLS]}\) indicates the start token of *Q* and \(\mathrm{[SEP]}\) separates *Q* and *C*. Then, the word sequence input is tokenized to token sequence \({\mathbf{s}} ={[s_i]_{i=1}^k}\) concatenating with their position embedding and segment embedding. Denote the BERT encoder which consists of L stacks of transformers as \(BERT(\cdot )\) as follows:

$$\begin{aligned} {\mathbf{s}} _i^l=Transformer({\mathbf{s}} _i^{l-1}),l \in [1,L] \end{aligned}$$

(1)

The hidden representation \({\mathbf{h}} ={[h_i]_{i=1}^k}\) for the token sequence obtained from BERT is \({\mathbf{h}} =BERT({\mathbf{s}} )\).

### Knowledge-enhanced attention layer

To obtain the knowledge-enhanced context representation, this layer is designed to integrate knowledge representation with the context representation of BERT. Here, we describe the details of this layer based on CID relation extraction. In the knowledge base, the same entity pair in different documents may contain different relation types. This layer shows how to integrate noisy knowledge representation into the context representation simply and effectively. It takes the BERT hidden representation \({\mathbf{h}}\) and the knowledge representation \({\mathbf{r}}\) as inputs, and outputs the knowledge-enhanced representation \({\mathbf{h}} ^{'}\).

To integrate prior knowledge representation, we first extract chemical-disease triples from the Comparative Toxicogenomics Database(CTD) [18] and employ TransE [19] to learn knowledge representation. Following [1], we extract (*chemical, disease, relation*) triples from both the CDR corpus and the CTD knowledge base. In the CTD base, there are three types of relations, including ‘*marker/mechanism*’, ‘*therapeutic*’ and ‘*inferred-association*’, where only ‘*marker/mechanism*’ indicates the CID relation. For those pairs in CDR but not in CTD, we set their relations to a specific symbol *‘null’*. Thus, there are four types of relations among all the triples and we finally obtain 2577184 triples for knowledge representation learning. Then, all the generated triples are regarded as correct examples to learn lower-dimension chemical representation, disease representation and relation representation \({\mathbf{r}} _t\) by TransE, where \({\mathbf{r}} _t \in {\mathbb {R}}^{d_2}\), and \(d_2\) denotes the representation dimension. Here, the chemical, disease and relation representations are initialized randomly with the normal distribution for training. It is worth noting that there may be more than one relation type between an entity pair (Fig. 2).

Then, the probable relation representation \({\mathbf{r}} =[{\mathbf{r}} _t]_{t=1}^3\) between candidate entities of each instance is introduced into the RC model to provide evidence. Here, we take two-step attention to combine the knowledge information and context information. First, we adopt an attention mechanism to select the most relevant KB relation representation for hidden representation of each token. A bilinear [20] operation is employed to calculate the attention weights between hidden representation \({\mathbf{h} }_i \in {\mathbb {R}}^{d_1}\) and relation representation \({\mathbf{r}} _t \in {\mathbb {R}}^{d_2}\), where \({\mathbf{W}} _{1} \in {\mathbb {R}}^{{d_1}\times {d_2}}\) and \({\mathbf{b}} _{1} \in {\mathbb {R}}^{d_2}\) are trainable parameters.

$$\begin{aligned} {\alpha _{it}}=\frac{exp({\mathbf{h}} _{i}{} {\mathbf{W} }_{1}{} {\mathbf{r}} _{t}+{\mathbf{b}} _{1})}{\sum _{t^{'}=1}^{3}exp({\mathbf{h}} _{i}{} {\mathbf{W}} _{1}{} {\mathbf{r}} _{t^{'}}+{\mathbf{b}} _{1})} \end{aligned}$$

(2)

Then, each relation representation \({\mathbf{r}} _t\) is aligned to each hidden state \({\mathbf{h}} _i\). Here, \({\mathbf{k}} _i\) is regarded as the weighted relation representation corresponding to each token.

$$\begin{aligned} {\mathbf{ k _{i}}}=\sum _{t}\alpha _{it}{} {\mathbf{r}} _{t} \end{aligned}$$

(3)

Second, we adopt a knowledge-context attention mechanism between the token’s knowledge representation \({\mathbf{k}} _{i}\) on each position of the token sequence and the hidden representation \({\mathbf{h}} _j\). A bilinear operation is employed between \({\mathbf{k} }_i\) and \({\mathbf{h}} _j\) to achieve weights on the hidden representation, while \({\mathbf{W}} _2 \in {\mathbb {R}}^{{d_2}\times {d_1}}\) and \({\mathbf{b}} _2 \in {\mathbb {R}}^{d_1}\) are parameters.

$$\begin{aligned} {\beta _{ij}}=\frac{exp({\mathbf{k}} _{i}{} {\mathbf{W}} _{2}{} {\mathbf{h}} _{j}+{\mathbf{b}} _{2})}{\sum _{j^{'}=1}^{k}exp({\mathbf{k}} _{i}{} {\mathbf{W}} _{2}{} {\mathbf{h}} _{j^{'}}+{\mathbf{b}} _{2})} \end{aligned}$$

(4)

Finally, the hidden representation \({\mathbf{h}}\) of tokens is aligned to the weighted knowledge representation \({\mathbf{k}}\) and weighted to each position *i*. Here, we denote the output after our two-step attention as \({\mathbf{h}} ^{'}=[{\mathbf{h}} _{i}^{'}]_{i=1}^{k}\).

$$\begin{aligned} \mathbf{ h _{i}^{'}}=\sum _{j}\beta _{ij}{} \mathbf{h} _{j} \end{aligned}$$

(5)

Here, \({\mathbf{h} }_{i}^{'}\) is the context representation enhanced with knowledge representation.

### Prediction layer

To obtain the final representation for prediction, the hidden representation \(\mathbf{h} _i\) and the knowledge enhanced representation \(\mathbf{h} _i^{'}\) are first combined with a linear operation to achieve the weighted representation \(\mathbf{v} _i = \mathbf{W} _{h}{} \mathbf{h} _{i}+\mathbf{W} _{h^{'}}{} \mathbf{h} _{i}^{'}+\mathbf{b}\). Then, we concatenate the knowledge enhanced representation \(\mathbf{h} _i^{'}\) with the weighted representation \(\mathbf{v} _i\) to achieve the input \(\mathbf{u} _i=[\mathbf{h} _i^{'}; \mathbf{v} _i]\) of the prediction layer. A feed-forward network FFN with RELU [21] activation is applied to the knowledge attention result, which works in some existing work. Finally, the output is applied to predict the start and end indexes of answers. For the situation where there is a null answer, its start and end indexes are both zero for the optimization of the objective function. Actually, the index of zero indicates the start token ‘[CLS]’. It is not the real text in the context and does not influence the optimization of the model for the indexes of non-null answers.

$$\begin{aligned} FFN(\mathbf{u} _{i}, \mathbf{W} _{3}, \mathbf{b} _{3}, \mathbf{W} _{4}, \mathbf{b} _{4}) = RELU(\mathbf{u} _{i}{} \mathbf{W} _{3}+\mathbf{b} _{3})\mathbf{W} _{4}+\mathbf{b} _{4} \end{aligned}$$

(6)

Here, \(\mathbf{W} _{4}\), \(\mathbf{b} _{4}\), \(\mathbf{W} _3\) and \(\mathbf{b} _3\) are trainable parameters. Along the sequence dimension, the *start* probability distribution and the *end* probability distribution for each token \(\mathbf{s} _i\) are calculated as:

$$\begin{aligned} p_i^{s}= & {} \frac{exp(FFN(\mathbf{u} _{i}, \mathbf{W} ^s _3, \mathbf{b} ^s _3, \mathbf{W} ^s _4, \mathbf{b} ^s _4))}{\sum _{j}exp(FFN(\mathbf{u} _{j}, \mathbf{W} ^s _3, \mathbf{b} ^s _3, \mathbf{W} ^s _4, \mathbf{b} ^s _4))} \end{aligned}$$

(7)

$$\begin{aligned} p_i^{e}= & {} \frac{exp(FFN(\mathbf{u} _{i}, \mathbf{W} ^e _3, \mathbf{b} ^e _3, \mathbf{W} ^e _4, \mathbf{b} ^e _4))}{\sum _{j}exp(FFN(\mathbf{u} _{j}, \mathbf{W} ^e _3, \mathbf{b} ^e _3, \mathbf{W} ^e _4, \mathbf{b} ^e _4))} \end{aligned}$$

(8)

After answer prediction, a predicted disease text or a null answer can be achieved. If the predicted disease text matches the gold disease name, a CID relation will be detected between the disease and the chemical which is described in its corresponding question.

After relation extraction on intra-sentential and inter-sentential data, two sets of prediction results are merged. Since all the candidate instances with respect to mention pairs are extracted, we judge that an entity pair has a CID relation as long as at least one instance was detected in which the CID relation exists. Since several documents may have no candidate CID relations after data preprocessing, similar to many other systems, we take the following heuristic rules to find the likely CID pairs in them: All chemicals in the title are associated with all diseases in the abstract.

### Objective function

To predict the start and end index of answer spans, the optimization objective is to maximize the conditional probability \(p(y_{s}, y_{e}|\mathbf{s} )\) of start index \(y_{s}\) and end index \(y_{e}\) on the given input sequence \(\mathbf{s}\). The loss is defined as the average of log probabilities of the ground truth start and end position based on the predicted distributions. *N* is the number of examples. The answer span index by (*i*, *j*) with maximum \(p_{i}^{s}p_{j}^{e}\) is chosen as the answer span.

$$\begin{aligned} Loss = -\frac{1}{N}\sum _{l=1}^{N}\frac{y_{s}log(p^{s})+y_{e}log(p^{e})}{2} \end{aligned}$$

(9)