 Research
 Open Access
 Published:
Biomedical word sense disambiguation with bidirectional long shortterm memory and attentionbased neural networks
BMC Bioinformatics volume 20, Article number: 502 (2019)
Abstract
Background
In recent years, deep learning methods have been applied to many natural language processing tasks to achieve stateoftheart performance. However, in the biomedical domain, they have not outperformed supervised word sense disambiguation (WSD) methods based on support vector machines or random forests, possibly due to inherent similarities of medical word senses.
Results
In this paper, we propose two deeplearningbased models for supervised WSD: a model based on bidirectional long shortterm memory (BiLSTM) network, and an attention model based on selfattention architecture. Our result shows that the BiLSTM neural network model with a suitable upper layer structure performs even better than the existing stateoftheart models on the MSH WSD dataset, while our attention model was 3 or 4 times faster than our BiLSTM model with good accuracy. In addition, we trained “universal” models in order to disambiguate all ambiguous words together. That is, we concatenate the embedding of the target ambiguous word to the maxpooled vector in the universal models, acting as a “hint”. The result shows that our universal BiLSTM neural network model yielded about 90 percent accuracy.
Conclusion
Deep contextual models based on sequential information processing methods are able to capture the relative contextual information from pretrained input word embeddings, in order to provide stateoftheart results for supervised biomedical WSD tasks.
Background
In the health and biomedical domain, valuable information can be mined from a huge amount of unstructured data, such as scientific literature, clinical narratives in the electronic health records, and healthrelated postings on social media [1]. Similar to natural language processing (NLP) in the general domain, knowledge discovery and information extraction require specialized tasks such as syntactic parsing, named entity recognition (NER), and relation extraction. In NER, an important step is to decide the correct sense of an ambiguous word or phrase based on its context. Otherwise, the accuracy of downstream NLP applications such as sentiment analysis and text classification will suffer.
In the biomedical NLP area, wellcurated medical terminologies and lexicons lay a solid foundation. The Unified Medical Language System (UMLS), which consists of over 200 biomedical terminologies and ontologie, has more than ten million terms and three million concepts. The terms with the same meaning are mapped to the same concepts. For example, myocardial infarction and heart attack are mapped to the same concept, which is assigned a concept unique identifier (CUI). Biomedical NER is usually realized by correctly recognizing and mapping an entity mentioned in the sentence to a concept in the UMLS. For instance, the term nursing has two concepts in the UMLS: Discipline of Nursing and Breast Feeding. In the sentence “Breastfeeding is in general safe but needs appropriate observation of the nursing infant.”nursing refers to Breast Feeding, whereas in another expression “Strategic research, technological innovation and nursing”, nursing refers to Discipline of Nursing.
Biomedical texts often contain a series of lexical ambiguities, such as abbreviations and polysemous terms. For instance, the acronym CRF may refer to chronic renal failure, or corticotropinreleasing factor. Some terms have different but very similar meanings. For instance, malaria may refer to the disease malaria, or the malaria vaccine. When extracting information from biomedical texts, selecting the correct meaning (“sense”) for an ambiguous term based on its context is called word sense disambiguation (WSD) [2].
Biomedical WSD has been a longstanding challenge for more than 20 years. Many biomedical WSD methods have been developed. As far as 2004, Liu et al. evaluated supervised methods including decision lists [3] and Naive Bayes. In 2006, Xu et al. [4] improved the supervised approaches and indicated that the error rate of supervised approaches is proportional to the similarity of senses. Also, it is very expensive to generate a labeled corpus. Researchers tried to reduce the labeling costs by several different approaches. Wang et al. [5] proposed an interactive learning method to reduce labeling expense while outperforming the active learning approach. Semisupervised learning is another type of approaches. In Liu et al. [6], labeled data is first generated automatically from UMLS and MEDLINE databases, then used to train supervised algorithms. Similarly, Yu et al. used MEDLINE abstracts to create labeled data for their supervised training algorithms [7]. Besides these efforts, Xu et al. leveraged knowledge from dictated dispatch discharge in the clustering analysis and estimated sense frequency for WSD [8]. In addition, Duque et al. completed the same task by incorporating external knowledge resource. They leveraged cooccurrence information in a graphbased unsupervised WSD method [9]. Yepes et al. [10] conducted a study comparing four different knowledge based methods. The best results were obtained via a WSD method using the semantic types assigned to the concepts in the UMLS Metathesaurus. The context of the ambiguous word and semantic types of the candidate concepts are mapped to journal descriptors. The journal descriptors are compared to choose among the candidate concepts. Sabbir et al. [11] used a concept mapping tool MetaMap to label a corpus of PubMed abstracts with UMLS CUIs. They used the Word2Vec algorithm [12] to generate concept embeddings. They used cosine similarity and KNN algorithm to disambiguate the words in the MSH WSD dataset, reporting stateoftheart results achieved by an unsupervised system. However, Sabbir et al. pointed out that their approach may be considered weakly supervised because of the use of MetaMap. Rais et al. [13] introduced No Distance Sense Relate, a modification of the Sense Relate Algorithm. No Distance Sense Relate ignores the distance of the context word from the word being disambiguated, therefore all the terms in the context have an equal weight. No Distance Sense Relate method, evaluated on the MSH WSD dataset, consistently yielded a higher accuracy with a window size of 3. However, with a window size of 2, Sense Relate method yielded a higher disambiguation accuracy than No Distance Sense Relate. It was concluded that depending on the window size, the distance between the target word and the terms in the context can influence the accuracy of the model differently. Recently, with the promise of deep neural networks in NLP tasks, specialized neural network architectures have been developed for WSD. Typical approaches include the recurrent convolutional neural networks evaluated by Festag and Spreckelsen [14], and the LSTM network proposed by Yepes [15]. These approaches indicate that a large amount of highquality, annotated data is required to achieve satisfactory performance in WSD.
To infer the correct sense of an ambiguous word, existing WSD methods often leverage the information of the context. However, merely focusing on the cooccurrence of ambiguous words may not be sufficient to determine their correct senses. Consider the word contract in this example: “Since one of the workers contracted leukemia, the company hired a wellknown law firm to protect itself in case of the potential lawsuit,”. The presence of “workers”, “company”, “firm” and “law” may overweight the word “leukemia”, indicating that to contract in this context means making a legal agreement with someone, rather than its correct sense catching or becoming ill with a disease. We thus believe that it is necessary to use a sequential information processing system to resolve this issue and capture semantic relations in the context. Moreover, the local context of the target ambiguous word may not provide enough information for WSD. Say, in order to decide the correct meaning (sense) of the word nursing in the sentence “The absolute need to articulate the gender issue in nutrition, nursing and medical academic curricula is stated.”, a large context is required even for humans. Typically, a suitable context should be part of the paragraph. Since most paragraphs in our dataset are usually within 200 words, we will use the entire paragraph to infer the meaning of each ambiguous term.
In this work, we apply deep contextual representations [16] to establish two supervised WSD systems: (1) a WSD system based on both a multilayer bidirectional Long ShortTerm Memory (BiLSTM) neural network model, and (2) a WSD system based on the selfattention model. Note that this work is based on our previous work [17]: We apply the same BiLSTM neural network model from [17], but we further build an attention model with the selfattention architecture introduced in [18]. Also, we further expand our previous research on universal WSD systems in this work. The contributions of this work are threefold:
First, we build a deep contextual representation of a target ambiguous word, using the output from two layers of the BiLSTM network or the attention architecture. In contrast to [16], in which weighted summation of the lower layers is applied, we perform a maxpooling operation to extract related features from the context. Our models take pretrained word embeddings as inputs. However, note that our BiLSTM network and attention model are not pretrained on any other general NLP datasets, but trained endtoend on the WSD dataset.
Second, our contextual representation is not only “deep” but also “wide”: We use the outputs at multiple time steps, rather than only using the output at timestep t, where x_{t} is the target word embedding. Then, we perform the maxpooling operation along the timestep. As our results show, larger and wider contexts lead to significant improvements in prediction accuracy.
Third, we make efforts to develop universal models for biomedical WSD. All the existing methods (e.g., [14, 15]) build a model for each ambiguous word separately. In contrast, we design universal WSD models by concatenating the embedding of the target ambiguous word with the maxpooled output. Then we train both the BiLSTM neural network model and the selfattention model on all the words in the dataset. Experimental results show that this concatenation design significantly increases the prediction accuracy of our universal BiLSTM neural network model. Furthermore, the prediction accuracy on some words are higher when using the universal network than the wordspecific ones.
The rest of this paper is organized as follows: We provide a detailed explanation on both of our models in the “Methods” section, in which all the four structures and the hint layer are discussed in detail. In the “Results” section, the results and operating details of our experiments are provided. Then, we provide a comprehensive analysis on our models in the “Discussion” section, including the advantages and potential weaknesses of the BiLSTM neural network model against the selfattention model. Finally, we conclude this work with a summary and our future research plans in the last section.
Methods
In this section, we shall introduce our methods in detail. We provide two approaches: a WSD method based on a BiLSTM neural network, and a WSD method based on the selfattention model introduced in [18].
Note that both of our BiLSTM neural network model and our attention model share the same upper layer structure: We designed four different transformation structures to operate on the outputs of the two BiLSTM layers or the two attention layers (both our BiLSTM model and our attention model are always a stack of two identical layers). Then, a maxpooling layer shall operate on the chosen transformation structure to generate a dense embedding. Finally, an optional concatenation between the target ambiguous word embedding and the dense output embedding may be implemented before the softmax result.
The novelty of our work lies in the four transformation structures and the optional concatenation in the upper layer. We will introduce this architecture together with the BiLSTM model in the first subsection. Then in the second subsection, we will provide a detailed introduction to the attention architecture.
WSD Method based on the BiLSTM Network
In this subsection, we shall first introduce the structure of an LSTM cell. Then, our adjustments to the LSTM output and the concatenation of the target word embedding to the maxpooled vector in the upper layer of our neural network model are introduced in detail.
Long ShortTerm Memory Networks
Long ShortTerm Memory (LSTM) is a gated Recurrent Neural Network (RNN) introduced by Hochereiter and Schimdhuber in 1997 [19] and refined by Gers in 1999 [20]. The structure of an LSTM cell is shown in Fig. 1. Mathematically, the operation within an LSTM cell can be described as:
Here, σ represents the sigmoid function: σ(x)=1/(1+ exp(−x)). And as we mentioned above, there are three “gates” operating in an LSTM cell: The forget gate is denoted as f_{t}; the input gate as i_{t}; and output gate as o_{t}. Within an LSTM cell, the three gates shall operate with the trainable matrices W to keep “valuable” information from previous time steps and eliminate “invaluable” parts according to the provided label. This operation is a recurrent process, which has been introduced in detail in our previous paper [17]. For brevity, we will not repeat the details here.
Thanks to the ability to capture longterm semantic dependencies and the superior performance on long sequences, the LSTM is commonly used in many NLP tasks. In this work, we will use a specific type of LSTM, the Bidirectional LSTM (BiLSTM) [21]. In case of BiLSTM, the input sequence will be processed in both forward and backward directions, with independent parameters in each direction. The outputs at each timestep from both directions are concatenated and become the input of the BiLSTM in the next layer, in case of multiple layers. As such, the complete information about the whole input sequence will be captured by the neural network node at any timestep. In order to take advantage of this feature, we use BiLSTM networks to better capture the semantic relations on both sides of the target word.
Structure of the upper layer
We use the BiLSTM neural network model as the example to show the structure of the upper layer. As we mentioned above, the BiLSTM neural network model and the attention model share the same upper layer structure. In order to build the complete architecture of our attention model, one only needs to replace each BiLSTM layer with an attention layer, keeping all the other structures unchanged.
For the training of the BiLSTM neural network, we use 25 words before and after the target ambiguous word as the input in order to reduce training time. As shown in the Results subsection, the overall performance is improved when the network is trained with full paragraphs. Suppose the first layer output is Y=(y_{1},⋯,y_{T}), and the second layer output is Z=(z_{1},⋯,z_{T}), with y_{i} and z_{i} to be vectors with the same dimension \(\mathcal {D}\), i.e., \(\mathbf {y}_{i},\mathbf {z}_{i}\in \mathbb {R}^{\mathcal {D}}\). And after applying different layer settings, we decide that a twolayer BiLSTM with dropout [22] provides the best performance.
We design in total four optional structures to perform on top of the BiLSTM to adjust its output. We use H to represent the output of each structure. Then, these four structures can be described as the followings:
 (i)
We directly use the output from the BiLSTM. That is, H=Z.
 (ii)
We perform weighted summation between Y and Z. That is, H=λY+(1−λ)Z, where λ∈[0,1] is a variable.
 (iii)
We concatenate Y and Z along time steps. That is, since both Y and Z are \(T\times \mathcal {D}\) tensors, H will be a \(2T\times \mathcal {D}\) tensor.
 (iv)
We concatenate Y and Z along each vector y and z. That is, H will be a \(T\times 2\mathcal {D}\) tensor.
After a specific upper layer structure is chosen, we perform a maxpooling operation along timesteps on H to get \(\mathbf {h} \in \mathbb {R}^{\mathcal {D}}\) (or \(\mathbb {R}^{2\mathcal {D}}\) in case of structure (iv)). That is, we pick the maximum value along the timesteps within each dimension \(d \in \mathcal {D} \ (\text {or} \ 2\mathcal {D})\). Based on these settings, we hope that the context vector h can capture the context information that is sufficient for disambiguation.
In addition, an optional step \(\mathcal {C}\) is provided: The target word embedding x_{k} and the context vector h will be concatenated to form the context − word embedding ξ=[h,x_{k}]. Finally, the vector ξ (or h if the optional layer is not applied) passes through two dense layers with 256 and 64 hidden units respectively, before the softmax output. We will provide experimental results from neural network models with or without the optional layer \(\mathcal {C}\). The results confirm our assumption that the optional layer \(\mathcal {C}\) is more beneficial in case of the universal model than the wordspecific model, which is discussed in the “Discussion” section. The complete network structure is shown in Fig. 2. For clarity, we only specify the version (iii) in it.
Training
In each training step, we give a paragraph (w_{1},⋯,w_{T}) with the target ambiguous word w_{t} marked out. There is only one target ambiguous word in each paragraph. For an ambiguous word w, its possible senses are labeled as M_{1},M_{2}, etc. The correct sense M_{i} of w of the ambiguous word is given at the end of the paragraph. The label set {M_{i}} is shared by all ambiguous words, which means the label itself does not specify a UMLS concept, and contains no semantic information.
Then, the corresponding pretrained word embedding (x_{1},⋯,x_{T}) is obtained from the embedding matrix accordingly. We either use the full paragraph as the input, or use a fixedlength input with 25 words before and after x_{t}, i.e., (x_{t−25},⋯,x_{t−1},x_{t}, x_{t+1},⋯,x_{t+25}). For simplicity, we mostly use X=(x_{1},⋯,x_{T}) to represent the sequence in this paper without specifying what kind of input method we use. We shall specify it when necessary.
Implementation details
We applied exponential decayed learning rate: Starting from 0.05, the learning rate decays every 2500 steps with a base equals to 0.96. We used Adagrad Optimizer due to its suitability for training on sparse data and its ability to perform more informed gradientbased learning [23]. In addition, we used an early stopping technique in order to make learning process more timeefficient. Since the target ambiguous words in the MSH WSD dataset do not have the same number of training items, we decided to save a checkpoint when the lowest validation loss is noted. We restored the checkpoint and stopped the training if the prediction accuracy did not decrease after 5 epochs on the validation set. This method applies dynamic training epochs and hence makes the training more flexible for different input files. All our models were implemented in TensorFlow [24].
WSD Methods based on the Attention Model
Our attention model has the same upper layers as our BiLSTM neural network model. That is, as we mentioned, we only use the attention layer to replace the BiLSTM network layer when switching from the BiLSTM neural network model to the attention model. The structures (i) through (iv), the maxpooling layer and the optional layer \(\mathcal {C}\) are all the same. We shall note that the attention architecture in this work is mainly based on the selfattention encoder and decoder in [18]. To be specific, we only apply the encoder architecture from [18], since our WSD task is not complicated enough to apply the decoder. One can find an overall discussion on the selfattention encoder and decoder models in [18].
We shall use five parts in this subsection to present the attention architecture used in this paper: Part one serves as a general introduction, indicating that our attention architecture is actually a stack of several identical attention layers (or identical encoder layers). Part two introduces the scaled dotproduct attention, the core attention mechanism in an encoder layer. Part three indicates that in each encoder layer, a number of scaled dotproduct attentions shall operate in parallel to provide multiple outputs, which will then be concatenated and projected into one final attention output. This process is called multihead selfattention (or multihead attention in short). Part four shows the structure of the feedforward network, which is located on top of the multihead attention within an encoder layer. Finally, part five indicates how the order of sequence of the input words is encoded into the input embeddings.
General attention architecture: A stack of identical layers
Our attention architecture is actually the encoder of the Transformer in [18]. We only use the encoder of the Transformer since our task is to provide the correct label on the sense of the ambiguous word based on the inputs, whose complexity is not high enough for a decoder. The complete encoder layer is a stack of two identical encoder layers. The structure of one encoder layer is shown in Fig. 3.
Similar to the BiLSTM model, the input to the attention architecture is pretrained word embeddings X=(x_{1},⋯,x_{T}). Then, the attention architecture shall perform on the input to obtain an output Y=(y_{1},⋯,y_{T}). In our model, both the input embeddings and the output embeddings have the same dimensions \(d_{\mathbf {x}_{t}}=d_{\mathbf {y}_{t}}=d_{{model}}=200\phantom {\dot {i}\!}\).
Each layer of the encoder consists of two sublayers: A multihead selfattention sublayer, and a positionwise fully connected feedforward sublayer. Also as shown in Fig. 3, a residual connection is applied to each sublayer before a normalization. That is, suppose the input to one sublayer is X and the functional implementation of this sublayer is Sublayer(X). Then, the output of this sublayer is LayerNorm(X+Sublayer(X)).
Within the multihead selfattention sublayer, the major attention operations are implemented by the mechanism called Scaled DotProduct Attention, which is a selfattention mechanism. These structures are shown in Fig. 4.
We introduce the scaled dotproduct attention in the next part, and subsequently the multihead attention.
Scaled DotProduct Attention
The initial inputs to the scaled dotproduct attention are the input embeddings X=(x_{1},⋯,x_{T}). But X is not the direct input: Consider X=(x_{1},⋯,x_{T}) as the stack of each input embedding x_{t}, which makes X a T×d_{model} matrix. Then, three matrices are generated based on X as the direct inputs to scaled dotproduct attention as:
∙ The Query Q=XW^{Q}, where W^{Q} is a d_{model}×d_{q} matrix and hence Q is a T×d_{q} matrix.
∙ The Key K=XW^{K}, where W^{K} is a d_{model}×d_{k} matrix and hence K is a T×d_{k} matrix. And the scaled dotproduct attention follows d_{q}=d_{k}.
∙ The Value V=XW^{V}, where W^{V} is a d_{model}×d_{v} matrix and hence V is a T×d_{v} matrix.
We will introduce the operation of the scaled dotproduct attention based on each column of the matrices Q, K and V. As such, the intuitive meaning of each matrix can be reflected. Take the first word embedding x_{1} as an example: We create its query vector q_{1}=x_{1}W^{Q}, its key vector k_{1}=x_{1}W^{K} and its value vector v_{1}=x_{1}W^{V}. Then, its query vector q_{1} will do inner product: p_{t}=q_{1}·k_{t}^{T} with all the key vectors k_{1},⋯,k_{T} generated via k_{t}=x_{t}W^{K} (Here, the capital T on k_{t}^{T} means vector transformation, which has nothing to do with time steps of the input).
The scaled product p_{t} represents the amount of attention the word w_{1} shall put onto the word w_{t}. So, “query” means a “consultation” from a word, and “key” means the own property of a word. The inner product between the query vector of one word and the key vector of another represents how important the latter word is to the former one, according to the level of the match between the key and the query.
Then, a softmax is implemented on p_{1} through p_{T} after dividing them by \(\sqrt {d_{k}}\):
Finally, a weighted summation is implemented between (s_{1},⋯,s_{T}) and (v_{1},⋯,v_{T}) to get the output z_{1} with respect to the input embedding x_{1} as: \(\mathbf {z}_{1}=\sum _{t=1}^{T}s_{t}\mathbf {v}_{t}\).
The vectors z_{2},⋯,z_{T} are generated similarly with respect to x_{2},⋯,x_{T}. Then, the stack of vectors Z=(z_{1},⋯,z_{T}) is the output matrix of the scaled dotproduct attention. Since vector z has dimension d_{v} as vector v, we have that the matrix Z is T×d_{v}.
In addition, the inner products between vectors q and k can be represented by matrix multiplication. Combining the softmax weighted summation, we can use the following concise formula to represent the operation of scaled dotproduct attention:
The reason to divide the dot product between query vectors and key vectors by \(\sqrt {d_{k}}\) is that, the dot product grows large in magnitude if d_{k} is large. And a large magnitude of the dot product shall push the softmax function into a region with small gradient. As a result, the scaled dotproduct attention shall divide the dot product between q and k by \(\sqrt {d_{k}}\) to control the magnitude.
However, a single scaled dotproduct attention will not be the final output of the attention mechanism in one identical encoder layer. Instead, the final output is based on a concatenation of multiple scaled dotproduct attention mechanisms, which also generate the output tensor with the same shape as the input embeddings X.
Multihead attention
Instead of implementing a single scaled dotproduct attention, the authors of [18] implemented multiple of them in parallel. That is, multiple sets of queries, keys and values are generated based on the same input embeddings X, and then a scaled dotproduct attention is implemented on each set in parallel. After that, the outputs from these scaled dotproduct attentions are concatenated and projected linearly to get a final output. This process is shown in the right hand side of Fig. 4.
That is,
where the projection matrices \(\mathbf {W}_{i}^{Q}\in \mathbb {R}^{d_{model}\times d_{q}}\), \(\mathbf {W}_{i}^{K}\in \mathbb {R}^{d_{model}\times d_{k}}\) and \(\mathbf {W}_{i}^{V}\in \mathbb {R}^{d_{model}\times d_{v}}\) for i=1,2,⋯,h are independent in different heads. The final projection matrix is \(\mathbf {W}^{O}\in \mathbb {R}^{h d_{v}\times d_{model}}\). Similar to the single scaled dotproduct attention, we always have d_{q}=d_{k}.
In this work, we apply multiple h values. We always set d_{k}=d_{v}=d_{model}/h in the multihead attention sublayer. No matter what values d_{k}, d_{v} are, the final output Z of the multihead attention sublayer is always a T×d_{model} matrix. This is because the concatenation of the heads generates a T×hd_{v} matrix, which is used to multiply \(\mathbf {W}^{O}\in \mathbb {R}^{h d_{v}\times d_{model}}\) to produce the output \(\mathbf {Z}\in \mathbb {R}^{T\times d_{model}}\). As a result, Z shall have the same shape as the input embeddings X, so that the residual connection can be implemented.
The multihead attention is more beneficial than a single scaled dotproduct attention for several reasons. By applying multiple attention heads, the final linear projection W^{O} can provide the result based on independent attention outputs, which will reduce the error rate. We will discuss this further in the “Discussion” section.
However, the final output from the multihead attention sublayer is not directly used as the encoder output. A feedforward neural network operates on the output from the multihead attention sublayer in order to get a further filtered and projected result. We shall introduce the structure of the feedforward network in part four of this subsection, and provide a brief analysis on why we need a feedforward neural network on top of the attention mechanism in the “Discussion” section.
Positionwise feedforward networks
Above the attention sublayer in one identical encoder, there is the positionwise feedforward network, whose input is the output Z of the multihead attention sublayer. The positionwise feedforward network shall operate on each column of Z separately and identically. This operation consists of two linear transformations and a ReLU activation in between. That is, suppose Z=(z_{1},⋯,z_{T}) with \(\mathbf {z}_{t}\in \mathbb {R}^{d_{model}}\). Then, we have that
where the maximum (ReLU activation) is performed identically on each dimension.
While the linear transformations are the same for all z_{t}, the parameters {W_{1},W_{2},b_{1},b_{2}} can be different from layer to layer. We always have \(\mathbf {W}\in \mathbb {R}^{d_{model}\times d_{model}}\) and \(\mathbf {b}\in \mathbb {R}^{d_{model}}\), so that the final output FFN(Z) shall have the same dimensionality as the input embeddings X.
Then, as we mentioned at the beginning of this subsection, the same upper layers as in the BiLSTM neural network model are applied on top of the attention architecture here. That is, the output FFN(Z) from each of the two identical encoder layers shall play the same role as the output from each BiLSTM layer, so that structure (i) through structure (iv) may be performed right after.
Note that the attention architecture introduced in the above four parts contains no convolutional or recurrent structure to deal with sequential information. As a result, an additional method is required to make use of the order of sequence of the input words. This method is called positional encoding, which is discussed in the final part of this subsection as follows.
Positional Encoding
In language modeling, the order of the input sequence usually contains important information. So a language modeling system should have an efficient mechanism to make use of the order of the input sequence. Although the attention architecture in [18] has no convolutional or recurrent structure to process the sequential information, an encoding method is applied to directly encode the order of input words into the input embeddings.
That is, for the input embeddings X=(x_{1},⋯,x_{T}), the positional embeddings PE=(pe_{1},⋯,pe_{T}) are generated with
where pe_{t,2i} means the i’th dimension of the positional embedding pe_{t}. Then a simple matrix addition X+PE provides the actual input embeddings to the attention architecture. In this way, the positional information is injected into the embedding by the sinusoid function.
The reason to choose sinusoid functions is based on the assumption that they would allow the model to easily learn to attend by relative positions, since for any fixed k, pe_{t+k} can be represented as a linear function of pe_{t}.
Results
In this section, we will introduce our experimental settings and results. We will also compare our results with the results from other papers. Both our BiLSTM model and our selfattention model produce promising performance. But in general, our BiLSTM model works better, providing the stateoftheart performance on the MSH WSD dataset.
We used the same experimental settings for both of our approaches: The word embeddings are pretrained with the skipgram model by Mikolov et al. [12] on the joint dataset Wikipedia + PubMed + PMC. And all our models were trained on the MSH WSD dataset, consisting of 203 separate files. Each file contains around 200 paragraphs, which are the training corpus of a specific biomedical ambiguous word, whose location is marked out in each paragraph. However, 17 out of 203 ambiguous words do not have corresponding pretrained embedding. As as result, our training and evaluating were performed on the remaining 186 words.
We used the same training method for these two approaches: We trained a single BiLSTM neural network model on all the 186 words simultaneously to get a universal WSD network. In other words, we merged all the 186 datasets into a large one and then trained the models with it. We trained one BiLSTM neural network model on each dataset of an ambiguous word w to get 186 wordspecific WSD networks in total. Similarly, we trained a universal WSD attention model on the merged dataset, and 186 wordspecific WSD attention models on the dataset for each ambiguous word w.
We found that the model using the whole paragraph as the input yields better results than that using 25 words before and after the target ambiguous word in the paragraph. However, training on the wholeparagraph input is much more expensive than that on the 2525word input for a deep neural network. As a result, we trained most of our BiLSTM neural network models using the 2525word input. That is, we chose the neural network model with best performance on the 2525word input, then we trained it using the input consisting of whole paragraphs. On the other hand, the training of the attention model is much more efficient than that of the deep networks. As a result, we always used the entire paragraph as the input for the attention model.
When training both universal and wordspecific models with both deep network and selfattention architectures, we always randomly picked 70% of the paragraphs as the training set, while 10% and 20% of the paragraphs were used as the validation and testing sets, respectively.
Based on the above settings and the early stopping technique, our best performance reached test accuracy of 96.00%, which came from the wordspecific BiLSTM neural network model with the timestep concatenation (structure (iii)) and without layer \(\mathcal {C}\), trained on the wholeparagraph inputs. Then, since structure (iii) provides the best result for the BiLSTM neural network model among all the four structures, we only trained our attention model (both wordspecific and universal) with structure (iii). After applying several different values of h in the multihead attention, we found that the attention model in general is slightly less accurate than the neural network models, although the training is 3 to 4 times faster than that of the neural network models. Our best test accuracy of the wordspecific attention model is 93.94%, coming from the model with h=2 trained on the wholeparagraph inputs. We show the results from our wordspecific models in Table 1.
Here, we use basic NN, sum NN, cctT NN and cctV NN to refer to the BiLSTM neural network model with structure (i), (ii), (iii), and (iv), respectively. All of these four models were trained with the 2525word input. The CctT NN wholeparagh refers to the BiLSTM neural network model with structure (iii) trained on the wholeparagraph input. Attention cctT wholeparagh refers to the wordspecific attention model with structure (iii) trained on the wholeparagraph inputs. From Table 1, we can see that the optional layer \(\mathcal {C}\) actually reduces the accuracy of wordspecific models, which will be further discussed in the “Discussion” section.
Finally, we randomly choose approximately 90 percent of the paragraphs from each file as our training set, and the remaining 10 percent paragraphs are used as our testing set. Then, our best performing model, the BiLSTM neural network model under structure (iii) without layer \(\mathcal {C}\), is trained with the wholeparagraph inputs. The training always stops after 50 epochs, since there is no validation set for early stopping technique. With these settings, our testing accuracy reaches 97.14%. To the best of our knowledge, this result surpasses the existing stateoftheart WSD performance on the MSH WSD dataset [15].
Since the training of universal network is time consuming, we only train the universal BiLSTM neural network model with structure (iii), the best performing structure. The optional layer \(\mathcal {C}\) significantly improved the performance of our universal neural network model: After applying the optional layer \(\mathcal {C}\), the test accuracy of our universal neural network model increases from 80.72% to 88.75%. We believe that the target ambiguous word is reemphasized by the optional layer \(\mathcal {C}\), which significantly improves the prediction accuracy of the universal network. This issue will be discussed further in the “Discussion” section.
We also trained our universal attention models with structure (iii) on several different values of h in the multihead attention. The one with h=4 provides the best test accuracy of 82.40% with layer \(\mathcal {C}\), and 63.50% without it. It is interesting that the best universal attention model has four heads in the multihead attention architecture, while the best wordspecific one only has two. We will provide our analysis on why the universal attention model needs more heads than the wordspecific ones in the “Discussion” section.
Among the 203 biomedical ambiguous words in the MSH WSD dataset, 20 of them have often been used as comparison baselines for WSD models. For instance, Festag et al. showed their disambiguation result on these 20 words in [14], applying a Recurrent Convolutional Neural Network (RCNN) as their WSD model. JimenoYepes et al. provided multiple approaches to biomedical WSD in [25] with these 20 words. Their best performance came from a supervised Naive Bayes (NB) model with WEKA data mining package. We will compare the results from both our BiLSTM model and our attention model with the result from the RCNN model in [14], and the result from the Supervised NB model in [25], which are shown in Tables 2 and 3.
Table 2 shows the results from our wordspecific models (both BiLSTM neural networks and attention) and the two baselines. According to the experience described above, all our wordspecific models were equipped with layer \(\mathcal {C}\). Our wordspecific attention model has h=2 in the multihead attention. We can see that our wordspecific BiLSTM neural network models with all four structures outperform the two baselines, and achieve the accuracy higher than 95%. Our wordspecific attention model performs almost as good as Baseline 2, and even better than Baseline 1. In comparison to the baselines, our wordspecific BiLSTM neural network model with structure (iii) trained on wholeparagraph inputs achieves higher accuracy on some difficult words such as Phosphorylase, which have very similar senses and are therefore difficult to disambiguate [14]. As we can see, for these 20 words, our wordspecific deep network model with structure (ii) performs on average better than that with structure (iii) by 0.5%. Nevertheless, this does not hold true when we averaged the results on all the 186 words. Because of that, we also evaluated an ensemble of all our four structures. This gave us an interesting insight into the dataset, which will be further explained in the next section.
Table 3 shows the results from our universal models (both BiLSTM neural networks and attention) and the two baselines. Both of our two universal models are under structure (iii) with h=4 in the multihead attention and trained without layer \(\mathcal {C}\). We can see that our universal deep network model performs almost as good as the model from Baseline 1. However, we found that our universal attention model failed to provide results with satisfying accuracy. It seems that the attention model is only suitable for wordspecific tasks. We believe that this is because the attention structure is too simple to process the large, merged data containing multiple ambiguous words. We shall discuss this issue in detail in the next section.
Discussion
In this section, we provide further analysis and comparison on WSD methods based on the BiLSTM neural network model and the selfattention model. Then, we will discuss some issues of the dataset. We believe that some labels may not be accurate, which places an upper bound on the accuracy.
Why does LSTM cell work?
In this subsection, we provide mathematical analysis on how the gate operations in an LSTM cell resolve gradient vanishing (or blowup) along the time steps.
The purpose of gate operations in an LSTM cell is to ensure that the memory is not impacted by gradient vanishing (or blowup), which is the major defect of a typical RNN. Mathematical analysis shows that the gradient ∂o_{t+K}/∂c_{t} should not vanish (or blowup) when K is large. Using the chain rule of partial derivative into Eqs. (1), (2), (3), (4) and (5) in the Methods” section, we have
It is easy to see that setting f_{t}≡1 is a simple way to avoid gradient vanishing. According to Eq. (2), the direct dependent variables of f_{t+k} are x_{t+k} and h_{t+k−1}. Due to the variance from the inputs and the complicated interactions among the gates in LSTM cells, vectors x_{t+k} and h_{t+k−1} possess timestep independent distributions. Accordingly, f_{t+k} can be inferred as timestep independent as well. Therefore, when K is large, the number of f_{t+k} with extremely large absolute value will be almost equal to the number of those with extremely small absolute value due to timestep independence, and hence they shall cancel each other in the product \(\prod _{k=1}^{K}\mathbf {f}_{t+k}\). As a result, ∂o_{t+K}/∂c_{t} will not vanish or blowup when K is large. This performance guarantees that c_{t} can be efficiently updated by crossentropy error in many steps ahead, which explains why the LSTM cell can remember longterm dependencies.
Why does timestep concatenation work best?
In our experiments, we hope the contextual information can be captured by our BiLSTM network in an explicit process. In order to do so, we adapted the ideas behind the ELMo model in [16], combining with our own upper layer design to fit the WSD task better. As shown in the “Methods” and “Results” sections, the timestep concatenation structure performs the best.
Our explanation is, concatenation of BiLSTM outputs along the timestep would better preserve both output layers, comparing to the basic BiLSTM model and the weighted summation one. As a result, the maxpooling operation will have a better chance to capture the information from the first layer output, and hence increasing the utilization of the first layer nodes of the BiLSTM. In contrast, although the output vectors from both layers are preserved as well by the concatenation along the vector, this concatenation increases the vector’s dimension and the maxpooling dimension. Due to this reason, redundant information may be included in the maxpooled vector, leading to less useful information being outweighted. The experimental results coincide with this analysis: Although both of the concatenation models perform better than the basic BiLSTM model and the weight summation one, the timestep concatenation structure outperforms the vector concatenation one a little bit.
Why do we desire an attention model?
As mentioned in the “Results” section, our selfattention model performs three to four times faster than our neural network model. Yet, the computational efficiency does not constitute the entire consideration on the attention models. According to the initial work [18] of the selfattention model, three aspects are considered when evaluating a language processing system: the computational efficiency, the number of sequential operations required, and the path length that signals have to travel in the system. But since the LSTM neural network already possesses excellent path length, we do not need a selfattention model just in seeking of a longer path length. So, we only discuss the first two aspects here.
When comparing the computational complexity between different structures, people always use the operations needed in each layer, i.e., the computational complexity per layer, as the measurement. According to the attention structure introduced in the “Methods” section, we can see that the major operations in an identical encoder layer come from the matrix multiplications. So by a routine calculation, we can get the computational complexity per attention layer as \(\mathcal {O}(T^{2}\cdot d_{model})\), where T is the length of the input and d_{model} is the dimension of the embedding. On the other hand, the computational complexity per recurrent layer is \(\mathcal {O}(T\cdot {d_{model}}^{2})\); and the computational complexity per convolutional layer is \(\mathcal {O}(k\cdot T\cdot {d_{model}}^{2})\), where k is the kernel size of convolutions. For most of the stateoftheart language models, the embedding size d_{model} is around 150 to 500, which is necessarily larger than the average temporal length T of the inputs. As a result, we shall have T^{2}·d_{model}≪T·d_{model}^{2}≪k·T·d_{model}^{2}. This relationship explains why the attention model operates much faster, and why the convolutional neural network operates much slower than the recurrent neural network.
By the number of sequential operations required, it means the amount of operations needed on each time step per layer. In this case, the recurrent operation has to be implemented at each time step, rendering an RNN with \(\mathcal {O}(T)\) sequential operations. The convolutional operation with kernel width k and stride k brings \(\mathcal {O}(T/k)\) sequential operations to a typical CNN language model (different from the strideone convolutions in vision, where a CNN language model usually has a stride size same as the kernel width). In contrast, the matrix multiplication in an identical encoder layer covers all time steps, providing the selfattention model with \(\mathcal {O}(1)\) sequential operations. As a result, we can see that the selfattention model possesses the advantage towards RNN or CNN structures in terms of sequential operations. Since sequential operations decide the actual computing process in CPUs, an attention model as a result necessarily reduces the true CPU time required for training.
Therefore, although being slightly less accurate than a deep recurrent neural network, we still desire an attention model to perform word sense disambiguation.
Why does an attention model fail to yield better accuracy than deep networks?
Although being efficient on sequential inputs, a selfattention model is limited in some aspects. According to the experimental results, the major defect of an attention model is the unsatisfactory accuracy compared to deep neural network models.
We need to pay attention to two facts: First, in the case of wordspecific WSD models, our best attention model yields test accuracy that is 2 percent lower than that of our best BiLSTM neural network model. In the case of universal models, our attention model failed to provide a satisfactory result. Second, the best universal attention model has four heads in the multihead attention architecture, while the wordspecific attention model only has two. We believe that these facts from our experiments provide a plausible reason of the suboptimal accuracy of an attention model: The attention structure is too simple to learn complicated representations from big and complex data.
As the analysis we did in the above subsection, the complex structure of each LSTM cell enables the network to learn complicated temporal representations from the input data. The combination of various activation functions within each cell as well as the stack of layers enables the deep network to establish complicated decision regions in the input space. In contrast, the attention model does not involve an activation function. The major operations of a multihead attention are paralleled matrix multiplications. As a result, we believe that the representation from an attention model is quite restricted by the simplicity of its structure. This restricted ability on representation learning makes an attention model reluctantly provide satisfying results on small, wordspecific datasets, yet failed to perform on the merged, complicated big dataset for universal learning, even with an increased number of heads in the multihead attention architecture.
We can clearly see the tradeoff made by the attention model: sacrificing accuracy in order to obtain computational efficiency. As a result, comparing to the efficiency, it is an acceptable defect for the attention model to perform less accurate than the deep network.
Why do we need a feedforward network on top of the attention structure?
Note that, the only activation function in an identical encoder layer is the ReLU function in the feedforward sublayer on top of the selfattention structure. Similar to a convolutional neural network or a recurrent neural network, the activation functions serve as filters, which eliminate the redundant information and keep the relative representations. As a result, the errors and redundant information from the selfattention sublayers shall be filtered out by the ReLU gate in the feedforward sublayer, making the final output more efficient for disambiguating. We believe that this is the major reason for the selfattention model to have a feedforward network on top of the attention sublayer.
It is often desirable to keep a balance between performing accuracy and computational efficiency. As a result, the feedforward network in [18] consists of only two matrix transformations, in which the first transformation is equipped with the ReLU function. As such, the activation functions in the feedforward network serve as filters to improve the performance, while avoiding increasing the computational complexity of the selfattention model too much. Otherwise, a deep feedforward sublayer would consume the computational efficiency obtained by the attention structure.
Why is multihead attention better than a single scaled dotproduct attention?
We believe that, the reason for researchers in [18] to concatenate the output from each single scaled dotproduct attention to build a multihead structure is that, the multihead attention enables a “vote”, or averaging on the output from each single attention structure, so that the chance of making mistakes is reduced.
In addition, the parallel performance of each single attention structure shall enable the attention model to build flexible decision regions and learn complicate representations, which to a certain extent mediates the low ability of the attention model on representation learning. Meanwhile, the parallel performance in the multihead structure avoids increasing the computational complexity: matrix multiplications can be performed in parallel in multiple units in a GPU. Thus, a machine equipped with multiple GPUs can speed up the attention model even more.
The above analysis comprehensively compared our deep BiLSTM neural network model and our selfattention model on word sense disambiguation tasks. In general, a deep neural network model provides more accurate results especially on big datasets, while a selfattention model performs much faster and can get satisfactory results, especially on small datasets. Researchers in the future may take our analysis into consideration when choosing a WSD architecture for a specific data set.
Is an universal WSD system possible?
As shown in the “Results” section, the optional layer \(\mathcal {C}\) significantly improves the performance of our universal WSD model. However, when we add the optional layer \(\mathcal {C}\) back to our wordspecific models, it is surprising that the prediction accuracies are slightly reduced.
We believe that when training a universal network, the optional layer \(\mathcal {C}\) will put further emphasis on the target ambiguous word to the dense layer, thereby leading to a more meaningful representation in the dense layer on the contextual information. In contrast, when training on the dataset for a single ambiguous word, further emphasis is redundant and may actually overwhelm some useful information in the maxpooled vector h, therefore slightly disturbing the dense layer on generating contextual information representations.
As far as we know, our universal WSD network is the first universal model in biomedical word sense disambiguation. Our results indicate that a universal model is feasible for NLP tasks. We hope that our research can draw research community’s attention to the universal word sense disambiguation system in the future.
Is there any mislabeling in the MSH WSD dataset?
After looking into the prediction output from neural networks with all the four upper layers, we found an interesting phenomenon: When testing with some specific paragraphs, the less accurate upper layer structures (i, ii, iv) on average even outperformed the best one (iii).
To further examine this phenomenon, we created two ensemble methods on neural network models: The first ensemble method makes neural networks with structure (i) through (iv) do a majority voting on the output. The second method implements a weighted voting, where the network with better performance on a validation set has a larger weight. Ensemble methods resulted in a small gain in accuracy (up to 0.4%).
Then, we further looked into the mistakes made by the networks. It appears that all our models agree on the same wrong label in 190 cases (3% of the test set). Examination of these words and their corresponding paragraphs shows that the disambiguation in these cases are very challenging even to human experts. For example, the word Medullary has two concepts in the UMLS:
Adrenal Medulla, inner part of the adrenal gland, (Body Part, Organ, or Organ Component)
Medulla Oblongata, also part of the brain, (Body Part, Organ, or Organ Component).
Both of these concepts relate to parts of the brain and have a similar meaning, rendering disambiguation challenging.
Other words can be argued to have more than one correct sense in the given paragraph. The word Laryngeal has two corresponding concepts in the UMLS:
Larynx, irregularly shaped, musculocartilaginous tubular structure, (Body Part, Organ, or Organ Component)
Laryngeal Prosthesis, a device, activated electronically or by expired pulmonary air, which simulates laryngeal activity and enables a laryngectomized person to speak, (Medical Device).
In following paragraph from the dataset of Laryngeal:
“A comparative study of speech after total laryngectomy and total laryngopharyngectomy. Quality of voice is an important factor in the consideration of treatment for advanced laryngeal cancer. This prospective study compared the speaking proficiency of patients who used the BlomSinger valve after total laryngectomy and after total laryngopharyngectomy with jejunal graft reconstruction with that of a group of normal subjects. The total laryngectomy group demonstrated excellent communication ability both facetoface and on the telephone. They exhibited superior scores for objective intelligibility, subjective intelligibility, acceptability, and intonation when compared with the total laryngopharyngectomy group....”
This paragraph is labeled with the second meaning (Laryngeal) in the UMLS. However, one can argue that the first meaning (Larynx) would be more appropriate, which is the meaning predicted by our networks. Those examples suggest that there may be an upper bound of accuracy on this dataset, which we think is around 97%.
Conclusions
In this paper, we propose both wordspecific and universal WSD models using a deep BiLSTM neural networks and selfattention architectures, with four different adjustments on the models’ outputs and an optional concatenation layer. Experiments showed that the contextual information is sufficient for word sense disambiguation, and can be efficiently learned by deep networks as well as attention architectures. Furthermore, obtaining an explicit and wide representation of the contextual information improves the disambiguation accuracy significantly. Finally, reemphasizing the target ambiguous word is crucial to our universal WSD system.
In addition, we believe that good word embeddings are essential to all NLP tasks. However, being the stateoftheart language model, the skipgram negative sampling (SGNS) model does not capture the sense of word when training their embeddings [12]. Therefore, we hope to establish a language model using deep neural networks, so that not only the embeddings of words but also the embeddings of senses is established. As such, further semantic information can be captured naturally by both the word embeddings and the deep network.
Availability of data and materials
The MSH WSD dataset used in this study is publicly available at https://wsd.nlm.nih.gov/collaboration.shtml. The code of the Skipgram model is available at http://evexdb.org/pmresources/vecspacemodels/. The source code of the models is available upon request.
Abbreviations
 BiLSTM:

Bidirectional long shortterm memory
 CNN:

Convolutional neural network
 CUI:

Concept unique identifier
 LSTM:

Long shortterm memory
 NER:

Namedentity recognition
 NLP:

Natural language processing
 RNN:

Recurrent neural network
 SGNS:

Skipgram negative sampling
 SVM:

Support vector machine
 UMLS:

Unified medical language system
 WSD:

Word sense disambiguation
References
 1
Savova GK, Coden AR, Sominsky IL, Johnson R, Ogren PV, Groen PCd, Chute CG. Word sense disambiguation across two domains: Biomedical literature and clinical notes. J Biomed Inform. 2008; 41(6):1088–100. https://doi.org/10.1016/j.jbi.2008.02.003.
 2
Navigli R. Word sense disambiguation: A survey. ACM Comput Surv (CSUR). 2009; 41(2):10.
 3
Liu H, Teller V, Friedman C. Research paper: A multiaspect comparison study of supervised word sense disambiguation. J Am Med Inform Assoc JAMIA. 2004; 11 4:320–31.
 4
Xu H, Markatou M, Dimova R, Liu H, Friedman C. Machine learning and word sense disambiguation in the biomedical domain: design and evaluation issues. BMC Bioinformatics. 2006; 7:334.
 5
Wang Y, Zheng K, Xu H, Mei Q. Interactive medical word sense disambiguation through informed learning. J Am Med Inform Assoc. 2018; 25(7):800–8.
 6
Liu H, Lussier YA, Friedman C. Disambiguating ambiguous biomedical terms in biomedical narrative text: An unsupervised method. J Biomed Inform. 2001; 34 4:249–61.
 7
Yu H, Kim W, Hatzivassiloglou V, Wilbur WJ. Using medline as a knowledge source for disambiguating abbreviations and acronyms in fulltext biomedical journal articles. J Biomed Inform. 2007; 40(2):150–9.
 8
Xu H, Stetson PD, Friedman C. Combining corpusderived sense profiles with estimated frequency information to disambiguate clinical abbreviations. In: AMIA Annual Symposium Proceedings. American Medical Informatics Association: 2012. p. 1004–13.
 9
Duque A, Stevenson M, MartinezRomo J, Araujo L. Cooccurrence graphs for word sense disambiguation in the biomedical domain. Artif Intell Med. 2018; 87:9–19.
 10
JimenoYepes A, Aronson AR. Knowledgebased biomedical word sense disambiguation: comparison of approaches. BMC Bioinformatics. 2010; 11(1).
 11
Sabbir A, JimenoYepes A, Kavuluru R. Knowledgebased biomedical word sense disambiguation with neural concept embeddings. In: 2017 IEEE 17th International Conference on Bioinformatics and Bioengineering (BIBE): 2017. p. 163–70. https://doi.org/10.1109/BIBE.2017.0061.
 12
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J. Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems: 2013. p. 3111–9.
 13
Rais M, Lachkar A. Biomedical word sense disambiguation contextbased: Improvement of senserelate method. In: 2016 International Conference on Information Technology for Organizations Development (IT4OD): 2016. p. 1–6. https://doi.org/10.1109/IT4OD.2016.7479309.
 14
Festag S, Spreckelsen C. Word sense disambiguation of medical terms via recurrent convolutional neural networks. Stud Health Technol Inform. 2017; 236:8–15. IOS Press.
 15
Yepes AJ. Word embeddings and recurrent neural networks based on longshort term memory nodes in supervised biomedical word sense disambiguation. J Biomed Inform. 2017; 73:137–47.
 16
Peters ME, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L. Deep contextualized word representations. arXiv preprint arXiv:1802.05365. 2018. https://arxiv.org/abs/1802.05365.
 17
Bis D, Zhang C, Liu X, He Z. Layered Multistep Bidirectional Long ShortTerm Memory Networks for Biomedical Word Sense Disambiguation. In: 2018 IEEE International Conference on Bioinformatics and Biomedicine. IEEE: 2018. p. 313–320.
 18
Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez A, Kaiser L, Polosukhin I. Attention is all you need: 2017. p 5998–6008.
 19
Hochreiter S, Schmidhuber J. Long shortterm memory. Neural Comput. 1997; 9(8):1735–80.
 20
Gers F, Schmidhuber J, Cummins F. Learning to forget: Continual prediction with lstm. Neural Comput. 2000; 12(10):2451–71.
 21
Graves A, Schmidhuber J. Framewise phoneme classification with bidirectional lstm and other neural network architectures. Neural Netw. 2005; 18(56):602–10.
 22
Zaremba W, Sutskever I, Vinyals O. Recurrent neural network regularization. arXiv preprint arXiv:1409.2329. 2014. https://arxiv.org/abs/1409.2329.
 23
Bottou L, Olivier B. The tradeoffs of large scale learning. Adv Neural Inf Process Syst. 2008; 20:161–8.
 24
Ramsunder B. Tensorflow tutorial. Presentation of Stanford machine learning course. https://cs224d.stanford.edu/lectures/CS224dLecture7.pdf. Accessed 1 Mar 2019.
 25
JimenoYepes AJ, McInnes BT, Aronson AR. Exploiting mesh indexing in medline to generate a data set for word sense disambiguation. BMC Bioinformatics. 2011; 12(1):223.
 26
Graves A, Jaitly N. Towards endtoend speech recognition with recurrent neural networks. In: Proc. 31st International Conference on Machine Learning, vol 32: 2014. p. 1764–72.
Acknowledgments
We are grateful to the authors of [18]. This work is based on both our previous work [17] and the encoder selfattention model in [18]. We are also grateful to the technical help from Shamik Bose.
About this supplement
This article has been published as part of BMC Bioinformatics Volume 20 Supplement 16, 2019: Selected articles from the IEEE BIBM International Conference on Bioinformatics & Biomedicine (BIBM) 2018: bioinformatics and systems biology. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume20supplement16.
Funding
This study was supported by the NVIDIA GPU Grant and in part by the National Institute on Aging award R21AG061431 (PI: He/Bian). The content is solely the responsibility of the authors and does not necessarily represent the official views of NIH. The publication cost will be covered by the Open Access Fund of FSU Libraries.
Author information
Affiliations
Contributions
ZH, XL conceived the concept of this study. XL, CZ, DB, ZH designed the initial study protocol. CZ and DB built the models for word sense disambiguation and carried out the experiments. CZ and DB wrote the initial draft of the manuscript. All authors have provided feedback and edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Canlin Zhang and Daniel Bi\'{s} are equal contributors.
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Zhang, C., Biś, D., Liu, X. et al. Biomedical word sense disambiguation with bidirectional long shortterm memory and attentionbased neural networks. BMC Bioinformatics 20, 502 (2019). https://doi.org/10.1186/s1285901930798
Published:
Keywords
 Word sense disambiguation
 LSTM
 Selfattention
 Biomedical