Skip to main content

Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions



Inferring diseases related to the patient’s electronic medical records (EMRs) is of great significance for assisting doctor diagnosis. Several recent prediction methods have shown that deep learning-based methods can learn the deep and complex information contained in EMRs. However, they do not consider the discriminative contributions of different phrases and words. Moreover, local information and context information of EMRs should be deeply integrated.


A new method based on the fusion of a convolutional neural network (CNN) and bidirectional long short-term memory (BiLSTM) with attention mechanisms is proposed for predicting a disease related to a given EMR, and it is referred to as FCNBLA. FCNBLA deeply integrates local information, context information of the word sequence and more informative phrases and words. A novel framework based on deep learning is developed to learn the local representation, the context representation and the combination representation. The left side of the framework is constructed based on CNN to learn the local representation of adjacent words. The right side of the framework based on BiLSTM focuses on learning the context representation of the word sequence. Not all phrases and words contribute equally to the representation of an EMR meaning. Therefore, we establish the attention mechanisms at the phrase level and word level, and the middle module of the framework learns the combination representation of the enhanced phrases and words. The macro average f-score and accuracy of FCNBLA achieved 91.29 and 92.78%, respectively.


The experimental results indicate that FCNBLA yields superior performance compared with several state-of-the-art methods. The attention mechanisms and combination representations are also confirmed to be helpful for improving FCNBLA’s prediction performance. Our method is helpful for assisting doctors in diagnosing diseases in patients.


Electronic medical records (EMRs), which record patient phenotypes and treatments, are an underutilized data source. Extracting useful information and predicting diseases using EMRs to assist doctors in disease determination and timely treatment of patients is one of the goals of intelligent medical construction [1,2,3], which can not only help us better understand the clinical manifestations of various diseases [4,5,6] but also reduce medical errors to improve the health of patients and improve the work efficiency of doctors [7,8,9].

The previous methods for predicting diseases related to information in EMRs can be roughly grouped into three categories. The methods in the first category are rule-based, which can also be called expert systems. Expert systems are designed to address problems by utilizing the knowledge and experience of human experts [10]. They perform rule matching on each input EMR to select the disease that best fits these rules to implement the corresponding diagnosis for patients. These methods have achieved great success in the field of medical aided diagnosis [11,12,13]. However, as time goes on, there have been increasingly more cases, and the data are no longer relatively structured and constrained but tend to be multistructured and unstructured. Therefore, rulemaking has become infeasible.

Methods in the second category construct shallow models based on machine learning. Such methods have achieved considerable success in the fields of text classification [14,15,16], legal prediction [17,18,19] and intelligent medical systems [20,21,22]. For example, some common techniques are utilized on public medical datasets for predicting diseases, such as support vector machines [23,24,25], random forests [26, 27], and logistic regression [28], and they achieve good predictive results. However, these methods have certain limitations in feature extraction. They usually need to artificially design certain features as input for machine learning and cannot capture the deep and complex internal information of data.

The methods in the third category are based on deep learning. In recent years, deep learning has achieved the most advanced effects on various natural language processing tasks, such as machine translation [29, 30], sentiment analysis [31, 32], speech recognition [33, 34] and language modeling [35,36,37]. Moreover, in the medical field, experiments have proven that deep learning methods outperform state-of-the-art traditional predictive models in all cases with electronic health record (EHR) data. For example, Cheng et al. [38] proposed a prediction method based on convolutional neural network (CNN) for the risk prediction of EHR. Nguyen et al. [39] introduced a CNN model for predicting the probability of readmission. Choi et al. [40] introduced a shallow recurrent neural network (RNN) model to predict diagnoses and medications. Li et al. [41] provided a transformer-based model to predict diseases in the future. Due to the advantages of deep learning in cases of using EHR data, some deep learning-based models were applied to the diagnosis of Chinese electronic medical records. CNN have strong capabilities in feature extraction and expression [42, 43]. For example, methods based on CNN were proposed by Yang et al. and Chen et al. to predict diseases [10, 44]. They focused on extracting local information from adjacent words. However, these methods failed to consider the context information of the word sequence. Usama et al. and Hao et al. proposed prediction methods based on a recurrent convolutional neural network (RCNN) [45, 46], which learns the local information of the word context. However, they did not fully exploit the whole context and local information. In addition, the previous methods do not discriminate the different contributions of different phrases and words. In our study, a novel method based on CNN and bidirectional long short-term memory (BiLSTM) with attention mechanisms is proposed for obtaining the latent representations of EMRs, which we refer to as FCNBLA (Fig. 1). FCNBLA fully integrates local information formed by several adjacent words, context information of the whole sentence, and enhanced phrase and word information. Figure 1a is dedicated to feature extraction from adjacent words of an EMR to obtain their local representation. In Fig. 1b, the context representation is learned from the whole EMR based on BiLSTM. In Fig. 1c, each phrase and word are assigned different weights by applying attention mechanisms, which may discriminate their different contributions for predicting diseases related to EMRs. The experimental results indicate that FCNBLA outperforms several state-of-the-art methods for predicting diseases.

Fig. 1

Schematic diagram for predicting diseases related to EMRS. a Extracted local information from several adjacent words. b Extracted context information for words in EMRs. c Establish attention mechanisms at the word level and phrase level and deeply integrate local and context information to obtain combination information. d Fully learn local information, context information and combination information


Datasets for disease prediction related to EMRs

The EMR dataset we use comes from previous work on disease prediction [10]. The original 18,625 EMRs were originally collected from Huangshi Central Hospital in China. The dataset contains the 10 most common diseases: diabetes, hypertension, chronic obstructive pulmonary disease (COPD), arrhythmia, asthma, gastritis, gastric polyps, gout, gastric ulcers and urinary tract infection (UTI). Each EMR contains 18 items: initial diagnosis on admission, chief complaint, history of surgery, vital signs, specialist condition, general condition, allergic history, nutritional status, suicidal tendency, specialist examination, history of surgical trauma, complications, current medical history, fertility history, auxiliary examination, personal history, past medical history, and family history. Among them, initial diagnosis on admission is a disease related to an EMR, and the remaining 17 items record the patient’s condition. However, 24 EMRs only included the initial diagnosis on admission but did not include any of the remaining 17 items, so we removed them. We used the remaining 18,601 EMRs as our experimental data.

We selected 70% of the 18,601 EMRs as the training set to train the model, selected 10% as the validation set to adjust the model parameters, and selected 20% to test performance of the model. The distributions of the training, validation and testing sets are consistent with the original data distributions of 10 diseases. For the 10 diseases, their training, validation, and testing data distributions are shown in Table 1.

Table 1 Number of EMRs related to each disease in the training, validation and test sets

Disease prediction model

In this section, we describe our prediction model for learning the latent representations of EMRs and predicting diseases related to EMRs. Figure 2 shows the overall architecture of the model, which involves three major network modules. The left side is the convolutional module, which learns the local representation of a given EMR. The BiLSTM module on the right side learns the context representation of the EMR. The middle part is the fusion of local information and context information of the EMR, and its fusion representation is obtained. For disease prediction, we design a combination strategy to estimate the final association score between a disease and the EMR.

Fig. 2

The overall framework of the model for learning the potential representation of EMRs. The left of the framework is the CNN module, the right is the BiLSTM module, and the middle is the attention module

Word embedding layer

We use word embeddings as a representation of each EMR in the input layer. The word embedding layer can be simply understood as a look-up operation; that is, it reads a one-hot vector, et  RV, for a word and maps it to a dense vector of d dimensions, xt = (x1, x2, …, xd) as an input of the disease prediction model. The weight matrix of the word embeddings is H  Rd ×  V, which is randomly initialized. We fine-tune the initial word embeddings, modifying them during gradient updates of the neural network model by backpropagating gradients. We have the following formula:

$$ {\mathbf{x}}_t={\mathbf{He}}_t, $$

where V denotes a series of words and |V| is the size of the vocabulary.

Convolutional module on the left

The CNN proposed by Lecun et al. [47] can automatically learn feature representations. The CNN architecture is composed of three different layers: the convolutional layer, the pooling layer and the fully connected layer, as shown on the left side of Fig. 2.

An EMR consisting of T words to be classified is fed into the word embedding layer, T words are converted into vectors, and then an embedding matrix I  RT × d is formed as the input of the CNN. The convolutional layer and pooling layer are the core of the CNN. The CNN used in our framework consists of a convolutional layer followed by a max pooling layer. For the convolutional layer, we use 3 filters with different heights to slide across I and there are 50 filters for each height. Assume the height of a filter is k, which means the filter operates on the adjacent k words and the width of each filter is the same as the dimension of each input word embedding matrix and the outputs of the convolutional layer are feature maps.

The pooling layer may reduce the parameters of the neural network while maintaining the attributes of the word sequence so that the model can be effectively prevented from overfitting [10]. The pooling operation focuses on computing the max or average of the local regions. In this paper, we use the max pooling operation for each feature Z. After the pooling operation calculation is completed, all the extracted features are concatenated to form a local representation zC of an EMR.

BiLSTM module on the right

LSTM was proposed by Hochreiter et al. [48] to solve the gradient vanishing/exploding problem of RNN. However, LSTM can only obtain information from past words. For the task of determining the disease that an EMR is related to, it is very useful to obtain the past and future context information because each word of an EMR is semantically related to other words. The BiLSTM proposed by Dyer et al. [49] extended the unidirectional LSTM by introducing a second hidden layer, and the connections between hidden layers flow in reverse chronological order. Therefore, BiLSTM can be used to capture context information of an EMR. As shown in Fig. 2, on the right side of the framework, the BiLSTM contains two subnetworks: the forward LSTM is used for obtaining the forward sequence context \( {\overrightarrow{\mathbf{h}}}_t \), and the backward LSTM obtains the backward sequence context \( {\overleftarrow{\mathbf{h}}}_t \). The final hidden state ht of each word is the concatenation of \( {\overrightarrow{\mathbf{h}}}_t \) and \( {\overleftarrow{\mathbf{h}}}_t \).

Attention module on the middle

In our model, the attention module is used to learn which words or phrases are more important for the representation of an EMR. Therefore, the module consists of the attention mechanism at the phrase level and the one at the word level.

Attention at the phrase level

Z obtained by the left convolutional module is composed of N (1 ≤ N ≤ T − k + 1) rows. We call each row of Z a phrase vector, which contains the convolution results from the j filters performing convolution operations on a sequence of k word embeddings. Zi is the i-th row of Z. Different phrases usually have different contributions to the representation of the EMR. Thus, we establish the attention mechanism for each phrase vector Zi to generate the final attention representation. Zi is assigned an attention weight βi, and βi is defined as follows:

$$ {\mathbf{v}}_i=\tanh \left({\mathbf{W}}_r{\mathbf{Z}}_i+{\mathbf{b}}_r\right), $$
$$ {\beta}_i=\frac{\exp \left({\mathbf{v}}_i^{\top }{\mathbf{u}}_p\right)}{\sum \limits_{l=1}^N\exp \left({\mathbf{v}}_l^{\top }{\mathbf{u}}_p\right)}, $$

where Wr is a weight matrix, br is a bias vector, and up is a phrase-level context vector. vi is the feature representation of Zi, which is obtained by feeding Zi into a one-layer multilayer perceptron (MLP). βi is a standardized importance weight of Zi and N is the number of rows of the feature map Z obtained by the convolutional layer. The phrase context vector up is randomly initialized and updated during the training process. We aggregate the representations of those informative phrases to form the enhanced local phrase information of an EMR, which is represented as follows:

$$ {\mathbf{l}}_r=\sum \limits_{i=1}^N{\beta}_i\kern0.5em {Z}_i. $$

Attention at the word level

Different words also contribute differently to the representation of an EMR. Therefore, we establish a word level on the hidden state ht (1 ≤ t ≤ T) to generate the final attention representation. The attention weight at the word level is given as follows:

$$ {\mathbf{u}}_t=\tanh \left({\mathbf{W}}_c{\mathbf{h}}_t+{\mathbf{b}}_c\right), $$
$$ {\alpha}_t=\frac{\exp \left({\mathbf{u}}_t^{\top }{\mathbf{u}}_w\right)}{\sum \limits_{j=1}^T\exp \left({\mathbf{u}}_j^{\top }{\mathbf{u}}_w\right)}, $$

where Wc is a weight matrix, bc indicates a bias vector and uw is a word-level context vector. ut is a hidden representation of ht and αt is a normalized attention weight of ht. The important context information of the whole sentence is represented as lw,

$$ {\mathbf{l}}_w=\sum \limits_{t=1}^T{\alpha}_t{\mathbf{h}}_t. $$

MLP-based module

CNN is based on phrase-level attention, which learns the enhanced local phrase information of the EMR, and BiLSTM based on word-level attention learns enhanced context information of the entire EMR. It is necessary to better integrate the two pieces of information, so an MLP-based integration module is established. The MLP module consists of the left and right branches. The left branch is the enhanced local phrase representation lr, the right branch is the enhanced context representation lw, and lF is the concatenation of lr and lw and is defined as follows:

$$ {\mathbf{l}}_F=\left[{\mathbf{l}}_r,{\mathbf{l}}_w\right], $$

where [.,.] indicates the concatenation operation. lF goes through a one-layer MLP to obtain a combination representation, pF. The fully connected layer is applied to further fuse the features within pF to obtain the representation of the middle side, sF.

Combination strategy

As shown in Fig. 2, the left and right sides of the framework obtain more detailed features, which we call low-level features. The middle part is based on attention, which learns high-level features. We designed a combination strategy to obtain corresponding scores from different emphases. For the concatenation of low-level local features zC of the left side and high-level combination features pF of the middle part, the emphasis is placed on learning the local information of an EMR,

$$ {\mathbf{p}}_C=\left[{\mathbf{z}}_C,\kern0.5em {\mathbf{p}}_F\right]. $$

sC is obtained after pC goes through the fully connected layer, and sC contains local information and context information enhanced by the phrase-level and word-level attention mechanisms. Lower-level context features zB of the right side and high-level features pF of the middle are concatenated, and the emphasis is placed on learning the context information of an EMR,

$$ {\mathbf{p}}_B=\left[{\mathbf{z}}_B,\kern0.5em {\mathbf{p}}_F\right]. $$

pB also goes through a fully connected layer and outputs sB which contains the context information of an EMR and enhanced local information, and its dimension is the same as the number of disease labels. f is the final representation of an EMR, and it is a weighted sum of sC, sB, and sF. It is defined as follows:

$$ \mathbf{f}=\alpha {\mathbf{s}}_C+\beta {\mathbf{s}}_B+\gamma {\mathbf{s}}_F, $$

where α, β and γ are used to control the contributions of sC, sB and sF, the values of β and γ are calculated based on one half of 1 − α , and α is a hyperparameter. f is inputted into a softmax layer to obtain p,

$$ \mathbf{p}=\mathrm{softmax}\left(\mathbf{f}\right). $$

where p is a prediction probability distribution of C disease classes (C = 10). pi represents the probability that an EMR is related to the i-th disease.

In our model, the cross-entropy loss between the ground truth distribution of disease labels and the estimated probability distribution p is calculated as follows:

$$ loss=-\sum \limits_{d\in T}\sum \limits_{c=1}^C{\mathbf{g}}_c(d)\log \left({\mathbf{p}}_c(d)\right), $$

where gRC is a vector that contains the true classification labels. T represents the training sample set, and C is the number of diseases.


Evaluation metrics of the model

In general, we use accuracy when evaluating the performance of the classifier. Accuracy is defined as the rate of the number of samples correctly classified by the classifier among the total number of samples for a given test dataset.

$$ Accuracy\kern0.5em =\frac{TP+ TN}{TP+ FP+ TN+ FN} $$

where true positive (TP): in the test set, the classifier correctly classifies the positive samples into positive classes, true negative (TN): in the test set, the classifier correctly classifies negative samples into negative classes, false positive (FP): in the test set, the classifier incorrectly classifies negative samples into positive classes, false negative (FN): in the test set, the classifier incorrectly classifies positive samples into negative classes. In terms of a specific disease, such as diabetes, an EMR with a label, diabetes, is a positive example. An EMR with any other disease labels is regarded as a negative example.

Accuracy alone is not sufficient to measure the performance of a classifier. As shown in Fig. 3, in the dataset, urinary tract infections is associated with 1.6% of the EMRs, and diabetes is associated with 30.3% of the EMRs. There is an imbalance among the EMRs associated with one disease and those associated with another disease. Figure 3a shows the number of EMRs related to each disease, and Fig. 3b is the proportion of EMRs related to a specific disease among all the EMRs. For such an imbalance problem, the macro-average is also used to evaluate the performance of the model.

Fig. 3

The number and proportion of each disease in the EMR set. a Shows the number of EMRs related to each disease. b Shows proportion of EMRs related to a specific disease among all the EMRs

The macro-average calculates three values, Precisionmacro, i, Recallmacro, i and fmacro,i for each disease, and averages fmacro,i values of all the diseases. Precisionmacro, i is the rate of the correctly identified positive samples (EMRs) of the i-th disease (called di) among the samples that are retrieved. It is calculated as follows:

$$ {Precision}_{macro,i}=\frac{TP_{macro}}{TP_{macro}+{FP}_{macro}}, $$

where TPmacro is the number of successfully identified positive samples about di, and FPmacro is the number of samples that are misidentified as di. Recallmacro, i is the proportion of the di-related positive samples among all samples. It is defined as follows:

$$ {Recall}_{macro,i}=\frac{TP_{macro}}{TP_{macro}+{FN}_{macro}}, $$

where FNmacro is the number of misidentified di − related samples. fmacro, i is the F ‐ score value of di, and it is the harmonic average of Precisionmacro, i and Recallmacro, i; we obtain

$$ {f}_{macro,i}=\frac{2\times {Precision}_{macro,i}\times {Recall}_{macro,i}}{Precision_{macro,i}+{Recall}_{macro,i}}. $$

Finally, we calculate the average of all fmacro, i (1 ≤ i ≤ C) and obtain

$$ \mathrm{F}\hbox{-} {\mathrm{score}}_{\mathrm{macro}}=\frac{\sum \limits_{i=1}^C{f}_{macro,i}}{C}, $$

where C represents the number of diseases.


To evaluate the performance of the proposed method, FCNBLA, we compare it with several state-of-the-art methods of disease prediction. We describe them in detail as follows.


TF-IDF is a commonly used weighting technique for information retrieval and data mining. This baseline model based on TF-IDF extracted key information and formed the representations of EMRs. SVM is used to classify and predict the disease related to a specific EMR [23].


In the word embedding layer, this method maps each word of an EMR into a word embedding, and all word embeddings form an embedding matrix. In the convolutional layer, the matrix is scanned with different filters to obtain different local representations. After max pooling is completed, the extracted multiple representations are concatenated end to end. Finally, the fully connected layer and softmax layer are used to obtain the probability that the EMR is associated with a disease [10].


This method differs from the traditional CNN, and it first applies a bi-directional recurrent structure to capture the contextual information to the greatest extent possible when learning word representations. Second, the max pooling layer is used to form a more effective semantic representation. The representation is utilized to predict the disease related to an EMR [45].


To use the context information between words in the sentence, we also established a baseline method based on BiLSTM. Each word in an EMR is mapped into a word embedding through the word embedding layer, the word sequence is inputted into BiLSTM to obtain the hidden representation of any word, and the association probability is obtained. We compared our method, FCNBLA, with the baseline.

Parameter setting

Word embeddings that are inputted into the convolutional layer are the same as those that are fed into the BiLSTM layer. Our word embedding is initialized with uniform samples from \( \left[-\sqrt{3/d},\kern0.5em +\sqrt{3/d}\right] \), where we set d = 300. In the convolutional module, we use three different filter heights k [2, 3, 4]. The hidden layer dimension of the LSTM is 200, and the BiLSTM eventually outputs a 400-dimensional sentence representation. The Adam optimization algorithm is used to update the parameters, and the learning rate is set to 0.001. We apply a dropout strategy to the embedding layers of CNN and BiLSTM; the dropout rate is 0.2, and the batch size is 16. The value of α is 0.3, early stopping is adopted, and its value is set to 20 and in training, we used 100 epochs. For the support vector machine (SVM) method, the term frequency-inverse document frequency (TF-IDF) is used to extract features from EMRs. The document frequency is set to 5, which means that terms that appear in fewer than 5 documents are ignored. The value of n-gram ranges between 1 and 3. For the CNN, each word is also mapped to a 300-dimensional dense vector, which is randomly initialized. The heights of the filters are 4, 5, and 6, and each height has 128 filters. To ensure the fairness of the experiment, we also use randomly initialized word embeddings for RCNN, and the hidden layer size is 100. For the competing model BiLSTM, the hidden layer dimension of LSTM is set to 150. The learning rate of all competing models is 0.001, and their epochs are 100. Our implementation uses PyTorch and Python 3.6 to train and optimize the neural networks, and we use GPU cards (Nvidia GeForce GTX 1080) to speed up the model training process.

Result comparisons with other methods

As shown in Table 2, we can see that our method achieves the best effect on each evaluation method. On the test set, our method achieves 92.78% accuracy. FCNBLA performs best in terms of macro-average results. It achieves the highest precisionmacro  (92.31%), Recallmacro  (90.46) and F ‐ scoremacro (91.29%), and its F ‐ scoremacro is 3.27, 2.37, 0.75 and 1.19% higher than SVM-TFIDF, CNN, RCNN and BiLSTM, respectively. The performance of SVM-TFIDF is worse than that of the other methods. A main reason is that SVM-TFIDF is a shallow model, which fails to deeply learn the complex feature representations of EMRs. CNN only focuses on local information contained by several words, which makes its F ‐ scoremacro lower than BiLSTM. RCNN is the second-best performing method. This means that both context information and local information are very important for the association between EMRs and diseases. BiLSTM is slightly lower than RCNN because it only learns the context information formed by word sequences.

Table 2 Prediction result of FCNBLA and its baselines on the test set

As shown in Table 3, 10 diseases are listed on the left side in descending order of data volume (the specific quantity of data for each disease is shown in Table 1). We list the macro-average F-score value corresponding to each disease. FCNBLA achieved the highest F-score value in 8 of the 10 diseases. In terms of the diseases with large quantities of data, FCNBLA shows a slight improvement in performance compared to other baseline models, such as diabetes, COPD, and arrhythmia, which improve slightly, by approximately 0.1 to 0.5%. However, there are significant improvements in the diseases with fewer data, such as UTI, gastritis, and gastric polyps, which improve by 2.17, 2.19, and 1.08%, respectively, compared to the best baseline model.

Table 3 Prediction result for each disease of FCNBLA and its baselines on the test set

RCNN performs the best for the disease hypertension, and its’ F ‐ scoremacro is only 0.13% higher than our model, it indicates RCNN is just slightly better than our model for the disease. For the disease asthma, the F ‐ scoremacro of RCNN is 0.97% higher than our model. We calculated the proportion of the number of EMRs in the corresponding word number range for each disease among the total number of EMRs for that disease and listed it in the supplementary table ST1. We found that EMRs with more than 500 words accounts for 72.77% of the total EMRs for asthma, while among other diseases, the highest proportion of EMRs with more than 500 words is 31.03%. It shows that RCNN performs better than our method and the other compared methods for the EMRs with more than 500 words. The primary reason is that RCNN uses the context information of left and right sides of a word to enhance the representation of the word, and the less information is lost during the process of learning extremely long text.


Effect of attention at the phrase level and the word level

To validate the effect of phrase-level attention and that of word-level attention, we also implemented an instance of FCNBLA, which only has an attention mechanism at the phrase level (FCNA). Similarly, an instance that has only attention at the word level (FBLA) and another instance that has no attention (FNOA) are constructed. As shown in Table 4, F ‐ scoremacro values of FCNA (89.49%) and FBLA (89.72%) are 0.59 and 0.82% higher than FNOA, respectively. Compared with FNOA, their accuracy is increased by 0.19 and 0.67%, respectively. This result indicates that establishing both the attention phase level and the word level is helpful for improving the performance of disease prediction.

Table 4 Prediction results of FCNBLA and its three instances FNOA, FCNA and FBLA

Phase-level attention is exploited to enhance the local information, and word-level attention is used to capture the context information. For the results of F ‐ scoremacro and accuracy, FBLA is slightly higher than FCNA. This indicates that the context information is more effective than the local information in enhancing EMR representations. A possible reason is that the phrase information learned can reflect local features of EMRs, but a comprehensive understanding of the context relationships of all the words can extract more information from a given EMR. Compared with FCNA and FBLA, the F ‐ scoremacro of FCNBLA is increased by 1.80 and 1.57%, and its accuracy is increased by 1.10 and 0.62%, respectively. This confirms that it is necessary to introduce these two attentions.

Effect of the combination features of the middle module

To verify the effect of using the combination features learned by the CNN module and BiLSTM module, we remove the entire middle module based on MLP. The new instance is referred to as CNBL. CNBL consists of the left side and the right side. The local representation is learned by the left side, and the context representation is learned by the right side. Similar to the integration strategy of the three sides, the final prediction is obtained by integrating these two sides. As shown in Table 5, FCNBLA is 1.77 and 0.92% higher than CNBL on the F ‐ scoremacro and accuracy, which confirms the importance of the middle module for deeply combining local information and context information in terms of performance improvement.

Table 5 Prediction results of FCNBLA and its instance CNBL

Effect of our pairwise combination strategy

To verify the effect of our pairwise combination strategy, we implement an instance of FCNBLA, which is called TCNBLA. TCNBLA consists of the left side, the right side and the middle. It concatenates all local information zC obtained by the left side, the context information zB obtained by the right side, and the enhanced combination information pF obtained by the middle module. Finally, the concatenation of the three pieces of information is fed into the fully connected and softmax layers to obtain a prediction result. As shown in Table 6, FCNBLA is 1.18 and 0.38% higher than TCNBLA on the F ‐ scoremacro and accuracy, which proves that the semantic representation of an EMR learned from different emphases has an important role in improving performance.

Table 6 Prediction results of FCNBLA and its instance TCNBLA


A new method based on CNN and BiLSTM, FCNBLA, is developed for predicting the disease related to a given EMR. We establish attention mechanisms at the phrase and word levels to discriminate the different contributions of each phrase and word. This new framework is composed of three parts and is constructed for learning the local representation, context representation and combination representation enhanced by the attention mechanisms. In our experiments, the results show that FCNBLA is superior to other methods not only for macro-average but also for accuracy. Experimental results also confirm that phrase-level and word-level attention mechanisms and combination representation can enhance the inference of the disease related to a given EMR. FCNBLA may give scores for the diseases related to an EMR, and these scores are used to rank candidate diseases. FCNBLA can serve as a prediction tool to assist doctors in diagnosing diseases in patients.

Availability of data and materials

The datasets analyzed during the current study are downloaded from the website Our code is available for the readers according to their reasonable request.



Convolutional neural network


Bidirectional long short-term memory


Electronic medical record


recurrent convolutional neural network


Chronic obstructive pulmonary disease


Urinary tract infections


Multilayer perceptron


Support vector machine


Term frequency-inverse document frequency


True positive


False positive


False negative


  1. 1.

    Boonstra A, Broekhuis M. Barriers to the acceptance of electronic medical records by physicians from systematic review to taxonomy and interventions. BMC Health Serv Res. 2010;10(1):231.

    PubMed  PubMed Central  Article  Google Scholar 

  2. 2.

    Miotto R, Wang F, Wang S, Jiang X, Dudley JT. Deep learning for healthcare: review, opportunities and challenges. Brief Bioinform. 2017;19(6):1236–46.

    PubMed Central  Article  Google Scholar 

  3. 3.

    Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow P-M, Zietz M, Hoffman MM. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface. 2018;15(141):20170387.

    PubMed  PubMed Central  Article  Google Scholar 

  4. 4.

    Gann B. Giving patients choice and control: health informatics on the patient journey. Yearbook Med Informatics. 2012;21(01):70–3.

    Article  Google Scholar 

  5. 5.

    Tang H, Ng JHK. Googling for a diagnosis—use of Google as a diagnostic aid: internet based study. BMJ. 2006;333(7579):1143–5.

    PubMed  PubMed Central  Article  Google Scholar 

  6. 6.

    White RW, Horvitz E. Cyberchondria: studies of the escalation of medical concerns in web search. ACM Transact Information Syst (TOIS). 2009;27(4):23.

    Google Scholar 

  7. 7.

    Hillestad R, Bigelow J, Bower A, Girosi F, Meili R, Scoville R, Taylor R. Can electronic medical record systems transform health care? Potential health benefits, savings, and costs. Health Aff. 2005;24(5):1103–17.

    Article  Google Scholar 

  8. 8.

    Stewart WF, Shah NR, Selna MJ, Paulus RA, Walker JM. Bridging the inferential gap: the electronic health record and clinical evidence: emerging tools can help physicians bridge the gap between knowledge they possess and knowledge they do not. Health Aff. 2007;26(Suppl1):w181–91.

    Article  Google Scholar 

  9. 9.

    Shamy M, Upshur R. How doctors think. Perspect Biol Med. 2008;51(1):158–61.

  10. 10.

    Yang Z, Huang Y, Jiang Y, Sun Y, Zhang Y-J, Luo P. Clinical assistant diagnosis for electronic medical record based on convolutional neural network. Sci Rep. 2018;8(1):6329.

    PubMed  PubMed Central  Article  CAS  Google Scholar 

  11. 11.

    Pestian JP, Brew C, Matykiewicz P, Hovermale DJ, Johnson N, Cohen KB, Duch W. A shared task involving multi-label classification of clinical free text. In: Association for Computational Linguistics; 2007. p. 97–104.

  12. 12.

    Lancini S, Lazzari M, Masera A, Salvaneschi P. Diagnosing ancient monuments with expert software. Struct Eng Int. 1997;7(4):288–91.

    Article  Google Scholar 

  13. 13.

    Salvaneschi P, Cedei M, Lazzari M. Applying AI to structural safety monitoring and evaluation. IEEE Expert. 1996;11(4):24–34.

    Article  Google Scholar 

  14. 14.

    Tong S, Koller D. Support vector machine active learning with applications to text classification. J Mach Learn Res. 2001;2(Nov):45–66.

    Google Scholar 

  15. 15.

    Dashdorj Z, Song M. An application of convolutional neural networks with salient features for relation classification. BMC bioinformatics. 2019;20(10):244.

    PubMed  PubMed Central  Article  Google Scholar 

  16. 16.

    Chen J, Huang H, Tian S, Qu Y. Feature selection for text classification with Naïve Bayes. Expert Syst Appl. 2009;36(3):5432–5.

    Article  Google Scholar 

  17. 17.

    Liu C-L, Hsieh C-D. Exploring phrase-based classification of judicial documents for criminal charges in Chinese. In: International Symposium on Methodologies for Intelligent Systems; 2006. p. 681–90.

    Google Scholar 

  18. 18.

    Lin W-C, Kuo T-T, Chang T-J, Yen C-A, Chen C-J, Lin S-d. Exploiting machine learning models for Chinese legal documents labeling, case classification, and sentencing prediction. In: Computational Linguistics and Chinese Language Processing; 2012. p. 49–68.

  19. 19.

    Zeng J, Ustun B, Rudin C. Interpretable classification models for recidivism prediction. J Royal Stat Soc Ser A (Statistics in Society). 2017;180(3):689–722.

    Article  Google Scholar 

  20. 20.

    Avci E. A new intelligent diagnosis system for the heart valve diseases by using genetic-SVM classifier. Expert Syst Appl. 2009;36(7):10618–26.

    Article  Google Scholar 

  21. 21.

    Keshani M, Azimifar Z, Tajeripour F, Boostani R. Lung nodule segmentation and recognition using SVM classifier and active contour modeling: a complete intelligent system. Comput Biol Med. 2013;43(4):287–300.

    PubMed  Article  Google Scholar 

  22. 22.

    Thongkam J, Xu G, Zhang Y. AdaBoost algorithm with random forests for predicting breast cancer survivability. In: 2008 IEEE International Joint Conference on Neural Networks (IEEE World Congress on Computational Intelligence); 2008. p. 3062–9.

    Google Scholar 

  23. 23.

    Yu W, Liu T, Valdez R, Gwinn M, Khoury MJ. Application of support vector machine modeling for prediction of common diseases: the case of diabetes and pre-diabetes. BMC Med Informatics Decis Making. 2010;10(1):16.

    Article  Google Scholar 

  24. 24.

    Wu J, Roy J, Stewart WF. Prediction modeling using EHR data: challenges, strategies, and a comparison of machine learning approaches. Med Care. 2010;48(6):S106–13.

    PubMed  Article  Google Scholar 

  25. 25.

    Barakat N, Bradley AP, Barakat MNH. Intelligible support vector machines for diagnosis of diabetes mellitus. IEEE Trans Inf Technol Biomed. 2010;14(4):1114–20.

    PubMed  Article  Google Scholar 

  26. 26.

    Khalilia M, Chakraborty S, Popescu M. Predicting disease risks from highly imbalanced data using random forest. BMC Med Informatics Decis Making. 2011;11(1):51.

    Article  Google Scholar 

  27. 27.

    Lebedev A, Westman E, Van Westen G, Kramberger M, Lundervold A, Aarsland D, Soininen H, Kłoszewska I, Mecocci P, Tsolaki M. Random forest ensembles for detection and prediction of alzheimer's disease with a good between-cohort robustness. NeuroImage. 2014;6:115–25.

    PubMed  Article  Google Scholar 

  28. 28.

    Razavian N, Blecker S, Schmidt AM, Smith-McLallen A, Nigam S, Sontag D. Population-level prediction of type 2 diabetes from claims data and analysis of risk factors. Big Data. 2015;3(4):277–87.

    PubMed  Article  Google Scholar 

  29. 29.

    Zhang X, Su J, Qin Y, Liu Y, Ji R, Wang H. Asynchronous bidirectional decoding for neural machine translation. In: Thirty-Second AAAI Conference on Artificial Intelligence; 2018.

    Google Scholar 

  30. 30.

    Sutskever I, Vinyals O, Le Q. Sequence to sequence learning with neural networks. Advances in NIPS; 2014. p. 3104–12.

  31. 31.

    Xiong S, Wang K, Ji D, Wang B. A short text sentiment-topic model for product reviews. Neurocomputing. 2018;297:94–102.

    Article  Google Scholar 

  32. 32.

    Dos Santos C, Gatti M. Deep convolutional neural networks for sentiment analysis of short texts. In: Proceedings of COLING 2014, the 25th International Conference on Computational Linguistics: Technical Papers; 2014. p. 69–78.

    Google Scholar 

  33. 33.

    Chiu C-C, Sainath TN, Wu Y, Prabhavalkar R, Nguyen P, Chen Z, Kannan A, Weiss RJ, Rao K, Gonina E. State-of-the-art speech recognition with sequence-to-sequence models. In: 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2018. p. 4774–8.

    Google Scholar 

  34. 34.

    Liu Z-T, Xie Q, Wu M, Cao W-H, Mei Y, Mao J-W. Speech emotion recognition based on an improved brain emotion learning model. Neurocomputing. 2018;309:145–56.

    Article  Google Scholar 

  35. 35.

    Mikolov T, Kombrink S, Burget L, Černocký J, Khudanpur S. Extensions of recurrent neural network language model. In: 2011 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP); 2011. p. 5528–31.

    Google Scholar 

  36. 36.

    Al-Rfou R, Choe D, Constant N, Guo M, Jones L. Character-level language modeling with deeper self-attention. In: Proceedings of the AAAI Conference on Artificial Intelligence; 2019. p. 3159–66.

    Google Scholar 

  37. 37.

    Devlin J, Chang M-W, Lee K, Toutanova K. Bert. Pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies; 2019. p. 4171–86.

  38. 38.

    Cheng Y, Wang F, Zhang P, Hu J. Risk prediction with electronic health records: A deep learning approach. In: Proceedings of the 2016 SIAM International Conference on Data Mining; 2016. p. 432–40.

    Google Scholar 

  39. 39.

    Nguyen P, Tran T, Wickramasinghe N, Venkatesh S. Deepr: a convolutional net for medical records. IEEE Journal of Biomedicaland Health Informatics 2017;21(1):22–30.

  40. 40.

    Choi E, Bahadori MT, Sun J, Kulas J, Schuetz A, Stewart W. Retain: an interpretable predictive model for healthcare using reverse time attention mechanism. In: Advances in Neural Information Processing Systems; 2016. p. 3504–12.

    Google Scholar 

  41. 41.

    Li Y, Rao S, Solares JRA, Hassaine A, Canoy D, Zhu Y, RahimiK, Salimi-Khorshidi G. BEHRT: Transformer for electronic health records. Sci Rep. 2020;10(1):1–12.

  42. 42.

    Li X, Wang H, He H, Du J, Chen J, Wu J. Intelligent diagnosis with Chinese electronic medical records based on convolutional neural networks. BMC bioinformatics. 2019;20(1):62–74.

    PubMed  PubMed Central  Article  Google Scholar 

  43. 43.

    Zeiler MD, Fergus R. Visualizing and understanding convolutional networks. In: European conference on computer vision; 2014. p. 818–33.

    Google Scholar 

  44. 44.

    Chen M, Hao Y, Hwang K, Wang L, Wang L. Disease prediction by machine learning over big data from healthcare communities. IEEE Access. 2017;5:8869–79.

    Article  Google Scholar 

  45. 45.

    Usama M, Ahmad B, Wan J, Hossain MS, Alhamid MF, Hossain MA. Deep feature learning for disease risk assessment based on convolutional neural network with intra-layer recurrent connection by using hospital big data. IEEE Access. 2018;6:67927–39.

    Article  Google Scholar 

  46. 46.

    Hao Y, Usama M, Yang J, Hossain MS, Ghoneim A. Recurrent convolutional neural network based multimodal disease risk prediction. Futur Gener Comput Syst. 2019;92:76–83.

    Article  Google Scholar 

  47. 47.

    LeCun Y, Bottou L, Bengio Y, Haffner P. Gradient-based learning applied to document recognition. Proc IEEE. 1998;86(11):2278–324.

    Article  Google Scholar 

  48. 48.

    Hochreiter S, Schmidhuber J. Long short-term memory. Neural Comput. 1997;9(8):1735–80.

    CAS  PubMed  Article  Google Scholar 

  49. 49.

    Dyer C, Ballesteros M, Ling W, Matthews A, Smith NA. Transition-baseddependency parsing with stack long short-term memory. In: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics and the 7th International Joint Conference on Natural Language Processing; 2015. p. 334–43.

Download references


The authors thank the anonymous referees for their careful reading of our manuscript and their extensive comments.


The work was supported by the Natural Science Foundation of China (61972135), the Natural Science Foundation of Heilongjiang Province (LH2019F049, LH2019A029), the China Postdoctoral Science Foundation (2019 M650069), the Heilongjiang Postdoctoral Scientific Research Staring Foundation (BHL-Q18104), the Fundamental Research Foundation of Universities in Heilongjiang Province for Technology Innovation (KJCX201805), and the Fundamental Research Foundation of Universities in Heilongjiang Province for Youth Innovation Team (RCYJTD201805). The funders did not play any role in the design of the study, data collection and analysis, or preparation of the manuscript.

Author information




TW, PX and ZLL designed the study. TW and ZLL implemented the model. TW, PX, ZLL performed experiments and analyses. TW drafted the manuscript and PX, TGZ revised it. All authors have read and approved the final version of this manuscript.

Corresponding author

Correspondence to Ping Xuan.

Ethics declarations

Ethics approval and consent to participate

The data that we used are obtained from the public datasets ( Therefore, the ethics approval is not applicable for our study.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and Permissions

About this article

Verify currency and authenticity via CrossMark

Cite this article

Wang, T., Xuan, P., Liu, Z. et al. Assistant diagnosis with Chinese electronic medical records based on CNN and BiLSTM with phrase-level and word-level attentions. BMC Bioinformatics 21, 230 (2020).

Download citation


  • EMR-related disease prediction
  • Convolutional neural network
  • Bidirectional long short-term memory
  • Attention at phrase level
  • Attention at word level