The purpose of this paper is to focus on the risk factors in EMRs and to predict whether an individual suffers from CVD by machine learning methods. And the experiment is mainly divided into three stages: preprocessing the dataset, identification and extraction of risk factors, and prediction of CVD. In the data preprocessing stage, there are some missing and duplicate data in a few EMR texts, so we have carried out data cleaning and interpolation. In the stage of identifying risk factors, we use named entity recognition technology that has been widely used in industry or scientific research. The purpose is to accurately and effectively identify and extract the risk factors and their categories and time attributes in the EMRs. When we compare the recognition performance of CRF and BiLSTM-CRF, both perform well, but the latter performs better in experiments. We have analyzed the reasons in the following two aspects: On the one hand, there are many repetitions of the 12 risk factors in the EMR. On the other hand, BiLSTM is good at capturing the contextual information of text sequences, which is beneficial to identify the boundaries of entities. In the CVD prediction stage, we used the neural network model (RFAB) proposed in this paper. We present the main flow described above in Fig. 1.
Technical details of BiLSTM-CRF model
As shown in Fig. 2, BiLSTM-CRF identifies risk factors from EMRs with the BIO (Begin, Inside, Outside) annotation scheme [9]. The labels “HyC” and “HyD” in the figure both represent risk factors for hypertension. Their temporal attributes are, respectively, that they have been with the patient (Continue) and during the patient’s medical treatment (During). In the input layer, we determine the embedding of each input character by looking up the dictionary, expressed as \(Q=(q_{1},\ldots ,q_{k-3},\ldots ,q_{k})\). The character embeddings we pre-trained by the Skip-gram model [10] contain information about the words before and after it, that is, contextual information.
The model identifies risk factors by predicting the label corresponding to each character. A sequence of length n is inputted to the model, and the embedding layer maps characters one by one to a vector, i.e., \(X=(x_{1},\ldots ,x_{t},\ldots ,x_{ n})\). Then, it is fed to the BiLSTM layer to continue encoding, and the forward and backward LSTM respectively calculate the corresponding sequence representation \({\mathop {h_{t}}\limits ^{\rightarrow }}\) and \({\mathop {h_{t}}\limits ^{\leftarrow }}\) for each character t. As in the LSTM memory cell implemented by Lample et al. [11], the representation of the character t has left and right contextual information, i.e., \(h_{t}=[{\mathop {h_{t}}\limits ^{\rightarrow }};{\mathop {h_{t}}\limits ^{\leftarrow }}]\).
Then, the eigenvalues are zero-averaged by the activation function tanh, which to calculate the confidence score of the labels that each character t may correspond to.
$$\begin{aligned} e_t=\tanh (W_eh_t), \end{aligned}$$
(1)
where the weight matrix \(W_{e}\) is the parameter to be learned in training.
Finally, the feature information is decoded at the CRF layer, and the best labels for characters are predicted. The tth column of score matrix P is outputted by the network correspond to the vector \(e_{t}\) calculated by Eq. (1), where the element \(P_{i,j}\) is the score of the jth tag of ith character in the sequence. We introduce a transition probability matrix T that can utilize previous annotation information when tagging the current position. \(T_{y_i, y_{i+1}}\) represents the probability when tag \(y_i\) moves to tag \(y_{i+1}\). The optimal tags of the sequence \(y=(y_{1},\ldots ,y_{t},\ldots ,y_{n})\) are obtained by solving the maximum value of Eq. (2):
$$\begin{aligned} s(X,y) = \sum _{i=0}^N(T_{y_i,y_{i+1}}+P_{i,y_i}), \end{aligned}$$
(2)
where the transition probability matrix will be used as a parameter of the model for training. Then, we use the softmax function to generate the conditional probability of path y by normalizing the scores above over all possible tag paths \({\tilde{y}}\):
$$\begin{aligned} p(y|X) = \frac{e^{s(X,y)}}{\sum _{{\tilde{y}}}e^s(X,{\tilde{y}})}, \end{aligned}$$
(3)
In the training process, the model predicts the best label path to obtain the highest score by computing the log probability of maximizing the correct label sequence from Eq. (4):
$$\begin{aligned} \arg _{ {\tilde{y}}} \max s(X,{\tilde{y}}). \end{aligned}$$
(4)
The Viterbi algorithm [12] is utilized as the dynamic programming algorithm to obtain the optimal tagging path.
Technical details of RFAB model
As shown in Fig. 3, the purpose of our work is to comprehensively model EMRs text by using the characteristics of text content and risk factors in EMRs text, thus further realizing CVD prediction task. Generally speaking, RFAB consists of four parts: input layer, embedding layer, presentation layer, and prediction layer. The details are as follows.
Input Layer mainly tackles the problem of Feature Acquisition of the input EMR text and the input risk factors. For a Chinese raw text T, it contains m characters, i.e., \(C=\{c_1,c_2,\ldots ,c_m\}\), where each character \(c_i\left( 1\le i\le m\right)\) is an independent item. Meanwhile, T contains n risk factor words \(W\ =\ \{w_1,w_2,\ldots ,w_n\}\), this is \(T^\prime\). Since a word can often be divided into several characters, it is obvious that \(n\ \le m\). Thus, the length of C is equal to \(E^C\), and the length of W is equal to \(E^F\), i.e., \(\left| C\right| =\left| E^C\right|\), \(\left| W\right| =\left| E^F\right|\).
Embedding Layer aims to represent each item from Input Layer in a continuous space. It accepts the characteristics of two parts of content (i.e., \(E^C\), \(E^F\)) and outputs two embedding matrices by looking up embedding dictionary. For risk factors, we add each character-level embedding vector matched by the word correspondence, and then average to obtain the embedding vector corresponding to risk factors. As mentioned before, the lengths of the two-item features satisfy \(\left| C\right| =\left| E^C\right|\) and \(\left| W\right| =\left| E^F\right|\). To simplify the problem, we set the vector dimension of each of them to the same size D. Thus, a EMR text can be represented by two vector sequences, i.e., \(E^C=\{e_1^c,e_2^c,\ldots ,e_m^c\}\), \(E^F=\{e_1^f,e_2^f,\ldots ,e_n^f\}\). Exactly, these two vector sequences are also four embedding matrices, i.e., \(E^C\in R^{m\times D}\) and \(E^F\in R^{n\times D}\).
Representation Layer aims to generate a comprehensive representation of input EMR text by combining the context and risk factors information together. Corresponding to the property of character sharing, the recurrent structure of LSTM naturally processes words and characters one by one, which memorizes the characters or words that have already appeared [13]. In view of this advantage, we utilize an implementation of LSTM proposed by [14] and apply the bidirectional setting (i.e., BiLSTM) to capture both the forward and backward context information. Formally, given a specific feature embedding sequence of a sentence \(s=\{x_1,x_2,\ldots ,x_N\}\), the hidden vector of a BLSTM is calculated as follows:
$$\begin{aligned}&\overrightarrow{h_t}\ =LSTM\left( \overrightarrow{h_{t-1}},x_t\right) , \\&\overleftarrow{h_t}\ =LSTM\left( \overleftarrow{h_{t-1}},x_t\right) , \\&y_t\ =\left[ \overrightarrow{h_t},\overleftarrow{h_t}\right] , \end{aligned}$$
(5)
where \(\overrightarrow{h_t}\) and \(\overleftarrow{h_t}\) is the forward hidden vector and backward hidden vector respectively at the tth step in the BiLSTM. And \(y_t\) is the hidden output of each BiLSTM at the tth step, which is the concatenation of \(\overrightarrow{h_t}\) and \(\overleftarrow{h_t}\).
As shown in Fig. 3, there are two serialized BiLSTMs in the representation layer (i.e., \(BiLSTM^c+BiLSTM^f\)). In \(BiLSTM^c\), the values of their initial hidden states are set to zero. Meanwhile, \(BiLSTM^f\) receives the last hidden states of \(BiLSTM^c\) as input, which allows the context information of characters can be further combined with the information of risk factors.
Additionally, to assign important weights to certain risk factors thus model the risk factor sharing property when integrating information, we design an attention mechanism which can capture the interrelations between risk factors and their corresponding Specific EMRs content. Everytime \(BiLSTM^f\) receives a vector embedding of a risk factor (i.e., \(e_i^f\)), each \(y_\epsilon ^c\in Y^c=\{y_1^c,y_2^c,\ldots ,y_m^c\}\) will conduct the dot product operation with \(e_i^f\). Thus, the attention vector \(\alpha ^\prime\) for \(e_i^f\) is obtained as follows:
$$\begin{aligned} \alpha ^\prime =\left[ \alpha _1^\prime ,..,\alpha _i^\prime ,\ldots ,\alpha _n^\prime \right] ,\alpha _i^\prime =f\left( y_\epsilon ^c,e_i^f\right) ,1\le \epsilon \le m,1\le i\le n, \end{aligned}$$
(6)
where \(\alpha _\epsilon ^\prime\) denote the \(\epsilon\)th weight of a risk factor, and \(f\left( a,b\right)\) denotes the dot product function. But before the weighted sum operation, we need to normalize these weights using the softmax function, i.e., \(\alpha _i\) is obtained as follows:
$$\begin{aligned} \alpha _i=\frac{exp\left( \alpha _n^\prime \right) }{\sum _{i=1}^{n}exp\left( \alpha _i^\prime \right) },where\sum _{1}^{n}\alpha _i=1, \end{aligned}$$
(7)
then the vector embedding of \(r_i^f\) will be modified as:
$$\begin{aligned} \widetilde{e^f}=\sum _{i=1}^{n}\alpha _iy_\epsilon ^c, \end{aligned}$$
(8)
where \(y_\epsilon ^c\) denotes the \(\epsilon\)th item of \(Y^c\). After the attention operation (i.e., \(att_i\) in Fig. 3), \(\widetilde{e^f}\) have fused the weight information of risk factors. Then, \(BiLSTM^f\) will further learn the contextual information of \(\widetilde{e^f}\) through the calculations described in Eq. (5).
Prediction Layer As a result, we take the final hidden layer states of \(BiLSTM^f\) (i.e., \(y_o\)) as the final output, and redefine it as \(Z\in R^D\). Here, Z is exactly the ultimate representation of input EMR text T. After that, we feed Z into a fully-connected neural network to get an output vector \(O\in R^K\) (K is the number of classes, i.e., \(K=\left| U\right|\)):
$$\begin{aligned} O=sigmoid\left( Z\times W\right) , \end{aligned}$$
(9)
where \(W\in R^{D\times K}\) is the weight matrix for dimension transformation, and \(sigmoid\left( \cdot \right)\) is a non-linear activation function. Finally, we apply a softmax layer to map each value in O to conditional probability and realize the prediction as follows:
$$\begin{aligned} P=argmax\left( softmax\left( O\right) \right) , \end{aligned}$$
(10)
Model Training Since what we are trying to solve is a prediction task, we follow the work in [15] to apply the cross-entropy loss function to train our model, and the goal is to minimize the following Loss:
$$\begin{aligned} Loss=-\sum _{T\in C o r p u s}\sum _{i=1}^{K}{p_i\left( T\right) logp_i\left( T\right) }. \end{aligned}$$
(11)
where T is the input EMR text, Corpus denotes the training corpus and K is the number of classes. In the training process, we apply Adagrad as optimizer to update the parameters of RFAB, including W and all parameters (weights and biases) in each BiLSTM. To avoid the overfitting problem, we apply the dropout mechanism at the end of the embedding layer.