Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions

Background Inferring diseases related to the patient’s electronic medical records (EMRs) is of great significance for assisting doctor diagnosis. Several recent prediction methods have shown that deep learning-based methods can learn the deep and complex information contained in EMRs. However, they do not consider the discriminative contributions of different phrases and words. Moreover, local information and context information of EMRs should be deeply integrated. Results A new method based on the fusion of a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) with attention mechanisms is proposed for predicting a disease related to a given EMR, and it is referred to as FCNBLA. FCNBLA deeply integrates local information, context information of the word sequence and more informative phrases and words. A novel framework based on deep learning is developed to learn the local representation, the context representation and the combination representation. The left side of the framework is constructed based on CNN to learn the local representation of adjacent words. The right side of the framework based on BiLSTM focuses on learning the context representation of the word sequence. Not all phrases and words contribute equally to the representation of an EMR meaning. Therefore, we establish the attention mechanisms at the phrase level and word level, and the middle module of the framework learns the combination representation of the enhanced phrases and words. The macro average f-score and accuracy of FCNBLA achieved 91.29 and 92.78%, respectively. Conclusion The experimental results indicate that FCNBLA yields superior performance compared with several state-of-the-art methods. The attention mechanisms and combination representations are also confirmed to be helpful for improving FCNBLA’s prediction performance. Our method is helpful for assisting doctors in diagnosing diseases in patients.


Background
Electronic medical records (EMRs), which record patient phenotypes and treatments, are an underutilized data source. Extracting useful information and predicting diseases using EMRs to assist doctors in disease determination and timely treatment of patients is one of the goals of intelligent medical construction [1][2][3], which can not only help us better understand the clinical manifestations of various diseases [4][5][6] but also reduce medical errors to improve the health of patients and improve the work efficiency of doctors [7][8][9].
The previous methods for predicting diseases related to information in EMRs can be roughly grouped into three categories. The methods in the first category are rule-based, which can also be called expert systems. Expert systems are designed to address problems by utilizing the knowledge and experience of human experts [10]. They perform rule matching on each input EMR to select the disease that best fits these rules to implement the corresponding diagnosis for patients. These methods have achieved great success in the field of medical aided diagnosis [11][12][13]. However, as time goes on, there have been increasingly more cases, and the data are no longer relatively structured and constrained but tend to be multistructured and unstructured. Therefore, rulemaking has become infeasible.
Methods in the second category construct shallow models based on machine learning. Such methods have achieved considerable success in the fields of text classification [14][15][16], legal prediction [17][18][19] and intelligent medical systems [20][21][22]. For example, some common techniques are utilized on public medical datasets for predicting diseases, such as support vector machines [23][24][25], random forests [26,27], and logistic regression [28], and they achieve good predictive results. However, these methods have certain limitations in feature extraction. They usually need to artificially design certain features as input for machine learning and cannot capture the deep and complex internal information of data.
The methods in the third category are based on deep learning. In recent years, deep learning has achieved the most advanced effects on various natural language processing tasks, such as machine translation [29,30], sentiment analysis [31,32], speech recognition [33,34] and language modeling [35][36][37]. Moreover, in the medical field, experiments have proven that deep learning methods outperform state-of-the-art traditional predictive models in all cases with electronic health record (EHR) data. For example, Cheng et al. [38] proposed a prediction method based on convolutional neural network (CNN) for the risk prediction of EHR. Nguyen et al. [39] introduced a CNN model for predicting the probability of readmission. Choi et al. [40] introduced a shallow recurrent neural network (RNN) model to predict diagnoses and medications. Li et al. [41] provided a transformer-based model to predict diseases in the future. Due to the advantages of deep learning in cases of using EHR data, some deep learning-based models were applied to the diagnosis of Chinese electronic medical records. CNN have strong capabilities in feature extraction and expression [42,43]. For example, methods based on CNN were proposed by Yang et al. and Chen et al. to predict diseases [10,44]. They focused on extracting local information from adjacent words. However, these methods failed to consider the context information of the word sequence. Usama et al. and Hao et al. proposed prediction methods based on a recurrent convolutional neural network (RCNN) [45,46], which learns the local information of the word context. However, they did not fully exploit the whole context and local information. In addition, the previous methods do not discriminate the different contributions of different phrases and words. In our study, a novel method based on CNN and bidirectional long short-term memory (BiLSTM) with attention mechanisms is proposed for obtaining the latent representations of EMRs, which we refer to as FCNBLA (Fig. 1). FCNBLA fully integrates local information formed by several adjacent words, context information of the whole sentence, and enhanced phrase and word information. Figure 1a is dedicated to feature extraction from adjacent words of an EMR to obtain their local representation. In Fig. 1b, the context representation is learned from the whole EMR based on BiLSTM. Fig. 1 Schematic diagram for predicting diseases related to EMRS. a Extracted local information from several adjacent words. b Extracted context information for words in EMRs. c Establish attention mechanisms at the word level and phrase level and deeply integrate local and context information to obtain combination information. d Fully learn local information, context information and combination information In Fig. 1c, each phrase and word are assigned different weights by applying attention mechanisms, which may discriminate their different contributions for predicting diseases related to EMRs. The experimental results indicate that FCNBLA outperforms several state-of-the-art methods for predicting diseases.

Datasets for disease prediction related to EMRs
The EMR dataset we use comes from previous work on disease prediction [10]. The original 18,625 EMRs were originally collected from Huangshi Central Hospital in China. The dataset contains the 10 most common diseases: diabetes, hypertension, chronic obstructive pulmonary disease (COPD), arrhythmia, asthma, gastritis, gastric polyps, gout, gastric ulcers and urinary tract infection (UTI). Each EMR contains 18 items: initial diagnosis on admission, chief complaint, history of surgery, vital signs, specialist condition, general condition, allergic history, nutritional status, suicidal tendency, specialist examination, history of surgical trauma, complications, current medical history, fertility history, auxiliary examination, personal history, past medical history, and family history. Among them, initial diagnosis on admission is a disease related to an EMR, and the remaining 17 items record the patient's condition. However, 24 EMRs only included the initial diagnosis on admission but did not include any of the remaining 17 items, so we removed them. We used the remaining 18,601 EMRs as our experimental data.
We selected 70% of the 18,601 EMRs as the training set to train the model, selected 10% as the validation set to adjust the model parameters, and selected 20% to test performance of the model. The distributions of the training, validation and testing sets are consistent with the original data distributions of 10 diseases. For the 10 diseases, their training, validation, and testing data distributions are shown in Table 1.

Disease prediction model
In this section, we describe our prediction model for learning the latent representations of EMRs and predicting diseases related to EMRs. Figure 2 shows the overall architecture of the model, which involves three major network modules. The left side is the convolutional module, which learns the local representation of a given EMR. The BiLSTM module on the right side learns the context representation of the EMR. The middle part is the fusion of local information and context information of the EMR, and its fusion representation is obtained. For disease prediction, we design a combination strategy to estimate the final association score between a disease and the EMR.

Word embedding layer
We use word embeddings as a representation of each EMR in the input layer. The word embedding layer can be simply understood as a look-up operation; that is, it reads a one-hot vector, e t ∈ R |V| , for a word and maps it to a dense vector of d dimensions, x t = (x 1 , x 2 , …, x d ) as an input of the disease prediction model. The weight matrix of the word embeddings is H ∈ R d × | V| , which is randomly initialized. We fine-tune the initial word embeddings, modifying them during gradient updates of the neural network model by backpropagating gradients. We have the following formula: where V denotes a series of words and |V| is the size of the vocabulary.

Convolutional module on the left
The CNN proposed by Lecun et al. [47] can automatically learn feature representations. The CNN architecture is composed of three different layers: the convolutional layer, the pooling layer and the fully connected layer, as shown on the left side of Fig. 2.
An EMR consisting of T words to be classified is fed into the word embedding layer, T words are converted into vectors, and then an embedding matrix I ∈ R T × d is formed as the input of the CNN. The convolutional layer and pooling layer are the core of the CNN. The CNN used in our framework consists of a convolutional layer followed by a max pooling layer. For the convolutional layer, we use 3 filters with different heights to Fig. 2 The overall framework of the model for learning the potential representation of EMRs. The left of the framework is the CNN module, the right is the BiLSTM module, and the middle is the attention module slide across I and there are 50 filters for each height. Assume the height of a filter is k, which means the filter operates on the adjacent k words and the width of each filter is the same as the dimension of each input word embedding matrix and the outputs of the convolutional layer are feature maps.
The pooling layer may reduce the parameters of the neural network while maintaining the attributes of the word sequence so that the model can be effectively prevented from overfitting [10]. The pooling operation focuses on computing the max or average of the local regions. In this paper, we use the max pooling operation for each feature Z. After the pooling operation calculation is completed, all the extracted features are concatenated to form a local representation z C of an EMR.
BiLSTM module on the right LSTM was proposed by Hochreiter et al. [48] to solve the gradient vanishing/exploding problem of RNN. However, LSTM can only obtain information from past words. For the task of determining the disease that an EMR is related to, it is very useful to obtain the past and future context information because each word of an EMR is semantically related to other words. The BiLSTM proposed by Dyer et al. [49] extended the unidirectional LSTM by introducing a second hidden layer, and the connections between hidden layers flow in reverse chronological order. Therefore, BiLSTM can be used to capture context information of an EMR. As shown in Fig. 2, on the right side of the framework, the BiLSTM contains two subnetworks: the forward LSTM is used for obtaining the forward sequence context h ! t , and the backward LSTM obtains the backward sequence context h t . The final hidden state h t of each word is the concatenation of h ! t and h t .

Attention module on the middle
In our model, the attention module is used to learn which words or phrases are more important for the representation of an EMR. Therefore, the module consists of the attention mechanism at the phrase level and the one at the word level.
Attention at the phrase level Z obtained by the left convolutional module is composed of N (1 ≤ N ≤ T − k + 1) rows. We call each row of Z a phrase vector, which contains the convolution results from the j filters performing convolution operations on a sequence of k word embeddings. Z i is the i-th row of Z. Different phrases usually have different contributions to the representation of the EMR. Thus, we establish the attention mechanism for each phrase vector Z i to generate the final attention representation. Z i is assigned an attention weight β i , and β i is defined as follows: where W r is a weight matrix, b r is a bias vector, and u p is a phrase-level context vector. v i is the feature representation of Z i , which is obtained by feeding Z i into a one-layer multilayer perceptron (MLP). β i is a standardized importance weight of Z i and N is the number of rows of the feature map Z obtained by the convolutional layer. The phrase context vector u p is randomly initialized and updated during the training process. We aggregate the representations of those informative phrases to form the enhanced local phrase information of an EMR, which is represented as follows: Attention at the word level Different words also contribute differently to the representation of an EMR. Therefore, we establish a word level on the hidden state h t (1 ≤ t ≤ T) to generate the final attention representation. The attention weight at the word level is given as follows: where W c is a weight matrix, b c indicates a bias vector and u w is a word-level context vector. u t is a hidden representation of h t and α t is a normalized attention weight of h t . The important context information of the whole sentence is represented as l w , MLP-based module CNN is based on phrase-level attention, which learns the enhanced local phrase information of the EMR, and BiLSTM based on word-level attention learns enhanced context information of the entire EMR. It is necessary to better integrate the two pieces of information, so an MLP-based integration module is established. The MLP module consists of the left and right branches. The left branch is the enhanced local phrase representation l r , the right branch is the enhanced context representation l w , and l F is the concatenation of l r and l w and is defined as follows: where [.,.] indicates the concatenation operation. l F goes through a one-layer MLP to obtain a combination representation, p F . The fully connected layer is applied to further fuse the features within p F to obtain the representation of the middle side, s F .

Combination strategy
As shown in Fig. 2, the left and right sides of the framework obtain more detailed features, which we call low-level features. The middle part is based on attention, which learns high-level features. We designed a combination strategy to obtain corresponding scores from different emphases. For the concatenation of low-level local features z C of the left side and high-level combination features p F of the middle part, the emphasis is placed on learning the local information of an EMR, s C is obtained after p C goes through the fully connected layer, and s C contains local information and context information enhanced by the phrase-level and word-level attention mechanisms. Lower-level context features z B of the right side and high-level features p F of the middle are concatenated, and the emphasis is placed on learning the context information of an EMR, p B also goes through a fully connected layer and outputs s B which contains the context information of an EMR and enhanced local information, and its dimension is the same as the number of disease labels. f is the final representation of an EMR, and it is a weighted sum of s C , s B , and s F . It is defined as follows: where α, β and γ are used to control the contributions of s C , s B and s F , the values of β and γ are calculated based on one half of 1 − α , and α is a hyperparameter. f is inputted into a softmax layer to obtain p, where p is a prediction probability distribution of C disease classes (C = 10). p i represents the probability that an EMR is related to the i-th disease. In our model, the cross-entropy loss between the ground truth distribution of disease labels and the estimated probability distribution p is calculated as follows: where g ∈ R C is a vector that contains the true classification labels. T represents the training sample set, and C is the number of diseases.

Evaluation metrics of the model
In general, we use accuracy when evaluating the performance of the classifier. Accuracy is defined as the rate of the number of samples correctly classified by the classifier among the total number of samples for a given test dataset. In terms of a specific disease, such as diabetes, an EMR with a label, diabetes, is a positive example. An EMR with any other disease labels is regarded as a negative example. Accuracy alone is not sufficient to measure the performance of a classifier. As shown in Fig. 3, in the dataset, urinary tract infections is associated with 1.6% of the EMRs, and diabetes is associated with 30.3% of the EMRs. There is an imbalance among the EMRs associated with one disease and those associated with another disease. Figure 3a shows the number of EMRs related to each disease, and Fig. 3b is the proportion of EMRs related to a specific disease among all the EMRs. For such an imbalance problem, the macro-average is also used to evaluate the performance of the model.
The macro-average calculates three values, Precision macro, i , Recall macro, i and f macro,i for each disease, and averages f macro,i values of all the diseases. Precision macro, i is the rate of the correctly identified positive samples (EMRs) of the i-th disease (called d i ) among the samples that are retrieved. It is calculated as follows: where TP macro is the number of successfully identified positive samples about d i , and FP macro is the number of samples that are misidentified as d i . Recall macro, i is the proportion of the d i -related positive samples among all samples. It is defined as follows: where FN macro is the number of misidentified d i − related samples. f macro, i is the Fscore value of d i , and it is the harmonic average of Precision macro, i and Recall macro, i ; we obtain Finally, we calculate the average of all f macro, i (1 ≤ i ≤ C) and obtain F-score macro ¼ where C represents the number of diseases.

Baselines
To evaluate the performance of the proposed method, FCNBLA, we compare it with several state-of-the-art methods of disease prediction. We describe them in detail as follows.

SVM-TFIDF
TF-IDF is a commonly used weighting technique for information retrieval and data mining. This baseline model based on TF-IDF extracted key information and formed the representations of EMRs. SVM is used to classify and predict the disease related to a specific EMR [23].

CNN
In the word embedding layer, this method maps each word of an EMR into a word embedding, and all word embeddings form an embedding matrix. In the convolutional layer, the matrix is scanned with different filters to obtain different local representations. After max pooling is completed, the extracted multiple representations are concatenated end to end. Finally, the fully connected layer and softmax layer are used to obtain the probability that the EMR is associated with a disease [10].

RCNN
This method differs from the traditional CNN, and it first applies a bi-directional recurrent structure to capture the contextual information to the greatest extent possible when learning word representations. Second, the max pooling layer is used to form a more effective semantic representation. The representation is utilized to predict the disease related to an EMR [45].

BiLSTM
To use the context information between words in the sentence, we also established a baseline method based on BiLSTM. Each word in an EMR is mapped into a word embedding through the word embedding layer, the word sequence is inputted into BiLSTM to obtain the hidden representation of any word, and the association probability is obtained. We compared our method, FCNBLA, with the baseline.

Parameter setting
Word embeddings that are inputted into the convolutional layer are the same as those that are fed into the BiLSTM layer. Our word embedding is initialized with uniform , where we set d = 300. In the convolutional module, we use three different filter heights k ∈ [2, 3, 4]. The hidden layer dimension of the LSTM is 200, and the BiLSTM eventually outputs a 400-dimensional sentence representation. The Adam optimization algorithm is used to update the parameters, and the learning rate is set to 0.001. We apply a dropout strategy to the embedding layers of CNN and BiLSTM; the dropout rate is 0.2, and the batch size is 16. The value of α is 0.3, early stopping is adopted, and its value is set to 20 and in training, we used 100 epochs. For the support vector machine (SVM) method, the term frequency-inverse document frequency (TF-IDF) is used to extract features from EMRs. The document frequency is set to 5, which means that terms that appear in fewer than 5 documents are ignored. The value of n-gram ranges between 1 and 3. For the CNN, each word is also mapped to a 300-dimensional dense vector, which is randomly initialized. The heights of the filters are 4, 5, and 6, and each height has 128 filters. To ensure the fairness of the experiment, we also use randomly initialized word embeddings for RCNN, and the hidden layer size is 100. For the competing model BiLSTM, the hidden layer dimension of LSTM is set to 150. The learning rate of all competing models is 0.001, and their epochs are 100. Our implementation uses PyTorch and Python 3.6 to train and optimize the neural networks, and we use GPU cards (Nvidia GeForce GTX 1080) to speed up the model training process.

Result comparisons with other methods
As shown in Table 2, we can see that our method achieves the best effect on each evaluation method. On the test set, our method achieves 92.78% accuracy. FCNBLA performs best in terms of macro-average results. It achieves the highest precision macro (92.31%), Recall macro (90.46) and Fscore macro (91.29%), and its Fscore macro is 3.27, 2.37, 0.75 and 1.19% higher than SVM-TFIDF, CNN, RCNN and BiLSTM, respectively. The performance of SVM-TFIDF is worse than that of the other methods. A main reason is that SVM-TFIDF is a shallow model, which fails to deeply learn the complex feature representations of EMRs. CNN only focuses on local information contained by several words, which makes its Fscore macro lower than BiLSTM. RCNN is the secondbest performing method. This means that both context information and local information are very important for the association between EMRs and diseases. BiLSTM is slightly lower than RCNN because it only learns the context information formed by word sequences. As shown in Table 3, 10 diseases are listed on the left side in descending order of data volume (the specific quantity of data for each disease is shown in Table 1). We list the macro-average F-score value corresponding to each disease. FCNBLA achieved the highest F-score value in 8 of the 10 diseases. In terms of the diseases with large quantities of data, FCNBLA shows a slight improvement in performance compared to other baseline models, such as diabetes, COPD, and arrhythmia, which improve slightly, by approximately 0.1 to 0.5%. However, there are significant improvements in the diseases with fewer data, such as UTI, gastritis, and gastric polyps, which improve by 2.17, 2.19, and 1.08%, respectively, compared to the best baseline model. RCNN performs the best for the disease hypertension, and its' Fscore macro is only 0.13% higher than our model, it indicates RCNN is just slightly better than our model for the disease. For the disease asthma, the Fscore macro of RCNN is 0.97% higher than our model. We calculated the proportion of the number of EMRs in the corresponding word number range for each disease among the total number of EMRs for that disease and listed it in the supplementary table ST1. We found that EMRs with more than 500 words accounts for 72.77% of the total EMRs for asthma, while among other diseases, the highest proportion of EMRs with more than 500 words is 31.03%. It shows that RCNN performs better than our method and the other compared methods for the EMRs with more than 500 words. The primary reason is that RCNN uses the context information of left and right sides of a word to enhance the representation of the word, and the less information is lost during the process of learning extremely long text.

Discussion
Effect of attention at the phrase level and the word level To validate the effect of phrase-level attention and that of word-level attention, we also implemented an instance of FCNBLA, which only has an attention mechanism at the phrase level (FCNA). Similarly, an instance that has only attention at the word level (FBLA) and another instance that has no attention (FNOA) are constructed. As shown in Table 4, Fscore macro values of FCNA (89.49%) and FBLA (89.72%) are 0.59 and 0.82% higher than FNOA, respectively. Compared with FNOA, their accuracy is increased by 0.19 and 0.67%, respectively. This result indicates that establishing both the attention phase level and the word level is helpful for improving the performance of disease prediction. Phase-level attention is exploited to enhance the local information, and word-level attention is used to capture the context information. For the results of Fscore macro and accuracy, FBLA is slightly higher than FCNA. This indicates that the context information is more effective than the local information in enhancing EMR representations. A possible reason is that the phrase information learned can reflect local features of EMRs, but a comprehensive understanding of the context relationships of all the words  can extract more information from a given EMR. Compared with FCNA and FBLA, the Fscore macro of FCNBLA is increased by 1.80 and 1.57%, and its accuracy is increased by 1.10 and 0.62%, respectively. This confirms that it is necessary to introduce these two attentions.

Effect of the combination features of the middle module
To verify the effect of using the combination features learned by the CNN module and BiLSTM module, we remove the entire middle module based on MLP. The new instance is referred to as CNBL. CNBL consists of the left side and the right side. The local representation is learned by the left side, and the context representation is learned by the right side. Similar to the integration strategy of the three sides, the final prediction is obtained by integrating these two sides. As shown in Table 5, FCNBLA is 1.77 and 0.92% higher than CNBL on the Fscore macro and accuracy, which confirms the importance of the middle module for deeply combining local information and context information in terms of performance improvement.

Effect of our pairwise combination strategy
To verify the effect of our pairwise combination strategy, we implement an instance of FCNBLA, which is called TCNBLA. TCNBLA consists of the left side, the right side and the middle. It concatenates all local information z C obtained by the left side, the context information z B obtained by the right side, and the enhanced combination information p F obtained by the middle module. Finally, the concatenation of the three pieces of information is fed into the fully connected and softmax layers to obtain a prediction result. As shown in Table 6, FCNBLA is 1.18 and 0.38% higher than TCNBLA on the Fscore macro and accuracy, which proves that the semantic representation of an EMR learned from different emphases has an important role in improving performance.

Conclusions
A new method based on CNN and BiLSTM, FCNBLA, is developed for predicting the disease related to a given EMR. We establish attention mechanisms at the phrase and word levels to discriminate the different contributions of each phrase and word. This new framework is composed of three parts and is constructed for learning the local representation, context representation and combination representation enhanced by the attention mechanisms. In our experiments, the results show that FCNBLA is superior to other methods not only for macro-average but also for accuracy. Experimental results also confirm that phrase-level and word-level attention mechanisms and combination representation can enhance the inference of the disease related to a given EMR. FCNBLA may give scores for the diseases related to an EMR, and these scores are used to rank candidate diseases. FCNBLA can serve as a prediction tool to assist doctors in diagnosing diseases in patients.