In this section, a novel Multi-Probe Attention Neural Network (MPANN) is proposed for automatic COVID-19 semantic indexing. Figure 3 illustrates the architecture of the proposed method, which is a universal deep learning framework integrating multiple semantic evidence generated by different biomedical aspects. The architecture of MPANN mainly consists of four modules: MeSH Masking, Probe Encoding, Multi-Probe Attention, and Multi-view Classifier. The details are discussed as follows.
As shown in the figure, the proposed method introduces a masking mechanism leveraging a KNN-derived approach to identify the most similar articles from the training set for each input article. It then ranks and extracts the most frequent MeSH terms from these similar articles as the candidate MeSH terms for each target article, which significantly reduces the complexity of the indexing problem. The extracted candidate MeSH terms are then embedded and fed into the downstream neural networks.
Moreover, the proposed neural network takes multiple textual components from different semantic aspects as inputs as well as the extracted candidate terms for each input article. These inputs are considered to be semantic probes and would be encoded under word embeddings and transformer encoders to generate further feature representations.
Additionally, the proposed neural network employs an attention mechanism to automatically assign different attentive weights to input probes and consequently attends to the most important semantic aspects of the input article. After the feature extraction at both term-level and document-level, the feature representations are further utilized to perform the following MeSH indexing prediction.
Finally, a linear multi-view classifier is adopted to take the extracted features from different semantic aspects to conduct the final MeSH classification. For each candidate term, the model is able to predict a probability score. In the training phase, the binary cross-entropy loss is utilized with a gradient-based method to optimize the model parameters. A more detailed description of the proposed method is provided in the following subsections.
MeSH masking
COVID-19 semantic indexing is regarded as an extreme multi-label classification problem, which requires assigning appropriate labels from more than twenty thousand MeSH terms for each input article. How to reduce the high classification dimension is essential to the overall system performance. To tackle this problem, we employ a KNN algorithm to generate a refined subset of candidate terms for each input article. Technically, this generation procedure is considered to be MeSH Masking. The main ideas accounting for taking a small subset of candidate terms instead of the entire MeSH vocabulary are as follows: (i) Since each article merely carries around 13 MeSH annotations, there are far more negative terms than positive ones. The down-sampling of the negative samples is applicable by taking a recommended small subset of terms as candidates, in order that the classifier only needs to concentrate on predicting the most suitable terms from a plausible subset; (ii) During the training phase, a small subset of candidate terms is able to narrow down the prediction complexity as the neural network does not need to predict for the entire term vocabulary, which efficiently saves the model storage and calculation costs.
For each article, titles and abstracts are first split into a sequence of tokens, a word embedding matrix \(E_{e} \in {\mathbb{R}}^{{|V_{e} | \times d_{e} }}\) is then utilized to convert all the tokens into low-dimensional dense vectors, where |Ve| is the vocabulary size and de is the embedding size. In this regard, each input article can be represented by the sequence of word embeddings in accordance with its tokenized result, which can be consequently denoted as:
$${\varvec{D}} = [{\varvec{w}}_{1} , {\varvec{w}}_{2} , \ldots , {\varvec{w}}_{L} ] \in {\mathbb{R}}^{{L \times d_{e} }}$$
(1)
where D is viewed as a sequence of vectors that represents the input article. L is the sequence length and wi is the embedding vector for the word at position i. We further apply the KNN-driven strategy to choose the most similar articles from the training dataset for each input article. To this end, each article is represented by the Term Frequency-Inverse Document Frequency (TFIDF) weighted word embeddings:
$${\varvec{d}} = \frac{{\mathop {\mathbf{\sum }}\nolimits_{i = 1}^{L} tfidf_{i} \cdot {\varvec{w}}_{i} }}{{\mathop {\mathbf{\sum }}\nolimits_{i = 1}^{L} tfidf_{i} }} \in {\mathbb{R}}^{{d_{e} }}$$
(2)
Cosine similarity is adopted to find the most similar articles from the training set for each input article:
$${\text{Similarity}}(i, j) = \frac{{{\varvec{d}}_{i}^{T} {\varvec{d}}_{j} }}{{\left\| {{\varvec{d}}_{i} } \right\| \cdot \left\| {{\varvec{d}}_{j} } \right\|}}$$
(3)
After finding K nearest neighbors for each article, all MeSH terms in these neighbors are collected and ranked according to their frequency. In this way, top M MeSH terms are finally reserved as the candidate terms for each input article.
Probe encoding
Regarding the abundance of meaningful representations from different semantic aspects, we propose to take advantage of multiple context-aware inputs of each article as semantic probes to extract potential biomedical clues for MeSH recommendations. Specifically, we mainly exploit four different semantic probes: Context Probe, Candidate Term Probe, Journal Probe, and Dynamic Topic Probe. We argue that each probe is able to carry certain semantic information of biomedical knowledge and fertilize the meaningful expression for each input article. The details of the above-mentioned semantic probes are introduced as follows:
Context probe
For each input article, its word sequence is considered to be the context probe, which conveys narrative textual information and offers implicit cues for determining MeSH recommendations. However, despite the meaningful representation of word embeddings, the word vectors are still less informative for text representation due to the lack of contextual comprehension. In this regard, a transformer encoder is adopted to read and encode the context probe as shown at the bottom of Fig. 3, which has shown promising results in many Natural Language Processing (NLP) areas [38,39,40]. This encoder makes use of both explicit and implicit textual correlations between the adjacent words. Specifically, each word in the context probe is represented by its hidden state generated from the encoder:
$${\varvec{t}}_{i} = {\text{Transformer}}(\theta ; w_{i} ) \in {\mathbb{R}}^{{d_{t} }}$$
(4)
where θ represents the parameters of the encoder, dt stands for the hidden size, and ti is the encoded hidden state of the i-th word. The entire context probe is then represented accordingly by the sequence of the encoded hidden states, which is denoted as follows:
$${\varvec{T}} = [{\varvec{t}}_{1} ,{\varvec{t}}_{2} , \ldots , {\varvec{t}}_{L} ]^{T} \in {\mathbb{R}}^{{L \times d_{t} }}$$
(5)
where \({\text{T}} \in {\mathbb{R}}^{{L \times {\text{d}}_{{\text{t}}} }}\) is a L-by-dt matrix concatenating all hidden states of words.
Candidate term probe
MeSH Masking procedure guarantees a handful subset with M most relevant terms for the recommendation, which are further taken as the candidate term probes for each input article. The refined small subset of candidate terms can notably mitigate the noise introduced by the extremely unbalanced negative term samples and provide a plausible semantic scope of topics to which the article pays attention. In practice, each term is taken as a single probe and is then converted through an embedding matrix \({\varvec{E}}_{f} \in {\mathbb{R}}^{{|V_{f} | \times d_{f} }}\), where |Vf| is the vocabulary size and df is the embedding size. As word length usually differs in different term names, an RNN encoder is accordingly applied to acquire the name representation within a fixed length. In addition, in order to enhance the term representation, five kinds of statistical indicators are concatenated to the name representations, which are (a) a vector of length 2 indicating whether the candidate term occurs in the title and its frequency; (b) a vector of length 4 indicating whether the candidate term occurs in the first sentence, last sentence, and middle part of the abstract and its frequency; (c) a vector of length 2 indicating whether the candidate term can be recognized by MTI Online System [13, 22] and its score; (d) a vector of length 2 indicating whether the term is recognized by KNN and its score; (e) a scalar value indicating the global probability of term occurrence in the journal. The candidate term probes of the input article can be finally denoted as follows:
$${\varvec{H}} = [{\varvec{m}}_{1} , {\varvec{m}}_{2} , \ldots , {\varvec{m}}_{M} ]^{T} \in {\mathbb{R}}^{{M \times d_{f} }}$$
(6)
where mi is the probe representation of the i-th candidate term and M is the number of the recommended terms after the MeSH Masking stage.
Journal probe
In addition to Context Probe and Candidate Term Probe, Journal Probe is another informative semantic probe for MPANN. In the scientific area, articles are prone to be published in specific journals that are devoted to distinct research topics, such as chemicals, cancers, or coronavirus. This distinct information about journals is also important and instructive to provide essential cues for MeSH recommendations. To this end, each journal name that occurs in the corpus is taken as the journal probe. Specifically, each word in the journal probe is converted into a low-dimensional dense vector using the embedding matrix \(E_{j} \in {\mathbb{R}}^{{|V_{j} | \times d_{j} }}\), where |Vj| is the vocabulary size, and dj is the embedding length. Since the word length is not identical among different journals, an RNN encoder is then leveraged to encode the word vectors to acquire the final hidden state c within a fixed length which is utilized to represent the journal probe.
Dynamic topic probe
Inspired by [25, 26], the dynamic topic probes are also introduced to the multi-probe attention neural network. Although MeSH Masking is able to sharply reduce the prediction space, some existing implicit yet general semantic aspects probably still exist beyond the scope of the current candidate term probes. For instance, an article dedicated to the new variant virus SARS-CoV-2 probably also discusses other general topics related to clinical treatments that might be missed in the candidate terms. Therefore, in order to capture this potential and meaningful topic information, a new kind of dynamic topic probe is proposed to represent additional informative topic aspects contained in the article. Compared with the candidate term probes which are explicitly related to some specific topics of the input article, the dynamic topic probes are more relevant to the general aspects of background knowledge beyond the candidate term probes. To this end, we employ the embedding matrix \({\varvec{E}}_{p} \in {\mathbb{R}}^{{|V_{p} | \times d_{p} }}\) to represent the i-th dynamic topic probe using a low-dimensional dense vector pi, where |Vp| is the vocabulary size and dp is the size of the embedding vector. Accordingly, dynamic topic probes are inherent vectors of the model parameters, and each carries a certain aspect of general biomedical knowledge. Suppose there are N dynamic topic probes assigned to an input article, we can obtain the corresponding representation as an N-by-dp matrix denoted as follows:
$${\varvec{P}} = [{\varvec{p}}_{1} , {\varvec{p}}_{2} , \ldots , {\varvec{p}}_{N} ]^{T} \in {\mathbb{R}}^{{N \times d_{p} }}$$
(7)
Multi-probe attention
After encoding all the above-mentioned probes, we calculate the dot products among them to obtain the attended weight representations for different semantic aspects. The attentive feature representations at both the term-level and documental-level are primarily taken into consideration and further extracted for the downstream MeSH prediction. Specifically, we group these semantic probes into multiple pairs and calculate five different types of attention to obtain the attentive features. The calculation includes Context-Term Attention, Journal-Term Attention, Journal-Context Attention, Journal-Topic Attention, and Context-Topic Attention.
Feature representation at term level
For feature representation at the term level, we separately represent and extract the attentive features by calculating Context-Term Attention and Journal-Term Attention. For Context-Term Attention, given the encoded context probes T and candidate term probes H, we first compute their attentive weight matrix G and then adopt a SoftMax function to get the normalized attention weights as follows:
$${\varvec{G}} = \left[ {{\varvec{Tm}}_{1} , {\varvec{Tm}}_{2} , \ldots ,{\varvec{Tm}}_{M} } \right]^{T} \in {\mathbb{R}}^{M \times L}$$
(8)
$${\varvec{\alpha}}_{i}^{G} = SoftMax\left( {{\varvec{Tm}}_{i} } \right) \in {\mathbb{R}}^{L}$$
(9)
$$SoftMax\left( {\varvec{G}} \right) = [{\varvec{\alpha}}_{1}^{G} ,{\varvec{\alpha}}_{2}^{G} , ..., {\varvec{\alpha}}_{M}^{G} ]^{T} \in {\mathbb{R}}^{M \times L}$$
(10)
where \({\varvec{\alpha}}_{i}^{G} \in [0, 1]^{L}\) is the i-th weight vector over the context probe T and \(\sum_{k = 1}^{L} \alpha_{ik}^{G} = 1\). Technically, the higher the weight value, the more related the attention is paid to the probe. Each term-specific representation is then computed by the attentive weight vectors and textual probes:
$${\varvec{e}}_{i}^{G} = {[}{\varvec{\alpha}}_{i}^{G} ]^{{\text{T}}} {\text{T}} \in {\mathbb{R}}^{{{\text{d}}_{{\text{t}}} }}$$
(11)
where \({\varvec{e}}_{i}^{G}\) is i-th term-aware specific representation. The term-aware contextual feature \({\varvec{e}}^{G} \in {\mathbb{R}}^{{d_{t} }}\) is the mean value of the summation of \(\sum_{{\text{i = 1}}}^{{\text{M}}} {\varvec{e}}_{i}^{G}\).
For Journal-Term Attention, we calculate and extract the term-aware feature in the same way as follows:
$${\varvec{\alpha}}^{J} = SoftMax\left( {{\varvec{Hc}}} \right) \in {\mathbb{R}}^{M}$$
(12)
$${\varvec{e}}^{J} = [{\varvec{\alpha}}^{J} ]^{T} {\varvec{H}} \in {\mathbb{R}}^{{d_{m} }}$$
(13)
where \({\varvec{\alpha}}^{J} \in [0, 1]^{M}\) is the attention weight over the term probe mi and \({\varvec{e}}^{J} \in {\mathbb{R}}^{{d_{m} }}\) is the feature representation. We concatenate the extracted feature vectors eG and eJ into the vector rT as the feature representation for the term level.
Feature representation at documental level
Apart from the feature extraction at the term level, we also propose to extract the features from the document level. Particularly, we extract the attentive features through Context-Topic Attention, Journal-Context Attention, and Journal-Topic Attention, respectively. Given the encoded probes T and P, we extract the topic-aware contextual feature by computing the Context-Topic Attention. The calculations are denoted as follows:
$${\varvec{U}} = [{\varvec{Tp}}_{1} , {\varvec{Tp}}_{2} , \ldots ,{\varvec{Tp}}_{N} ]^{T} \in {\mathbb{R}}^{N \times L}$$
(14)
$${\varvec{\alpha}}_{i}^{U} = SoftMax\left( {{\varvec{Tp}}_{i} } \right) \in {\mathbb{R}}^{L}$$
(15)
$${\varvec{e}}_{i}^{U} = [{\varvec{\alpha}}_{i}^{U} ]^{T} {\varvec{T}} \in {\mathbb{R}}^{{d_{t} }}$$
(16)
where U is the weight matrix, \({\varvec{\alpha}}_{i}^{U} \in [0, 1]^{L}\) is the weight vector over the context probes, and \(\sum_{k = 1}^{L} \alpha_{ik}^{U} = 1\); \({\varvec{e}}_{i}^{U}\) is i-th topic specific representation. The topic-aware contextual feature \({\varvec{e}}^{U} \in {\mathbb{R}}^{{d_{t} }}\) is represented using the mean value of the summation of \(\sum_{i = 1}^{Q} {\varvec{e}}_{i}^{U}\).
Similarly, features encoded by Journal-Topic Attention and Journal-Context Attention are extracted in the same way as follows:
$${\varvec{\alpha}}^{S} = SoftMax\left( {{\varvec{Pc}}} \right) \in {\mathbb{R}}^{N}$$
(17)
$$e^{S} = [{\varvec{\alpha}}^{S} ]^{T} {\varvec{P}} \in {\mathbb{R}}^{{d_{p} }}$$
(18)
$${\varvec{\alpha}}^{Q} = SoftMax\left( {{\varvec{Tc}}} \right) \in {\mathbb{R}}^{N}$$
(19)
$${\varvec{e}}^{Q} = [{\varvec{\alpha}}^{Q} ]^{T} {\varvec{T}} \in {\mathbb{R}}^{{d_{t} }}$$
(20)
where \({\varvec{\alpha}}^{S} \in [0, 1]^{N}\) and \({\varvec{\alpha}}^{S} \in [0, 1]^{N}\) are the normalized weight vectors over the dynamic topic probes and context probes, respectively; \({\varvec{e}}^{S} \in {\mathbb{R}}^{{d_{p} }}\) and \({\varvec{e}}^{Q} \in {\mathbb{R}}^{{d_{t} }}\) are the respective feature representations. The extracted feature vectors eU, eS and eJ are concatenated into the vector rD which is considered as the feature representation for the document level.
Multi-view classification
Benefiting from the attention mechanism, the feature representations at both term level and document level are finally extracted. To compute the confidence of MeSH recommendation, the feature representations rT and rD are further concatenated to form the final feature vector v and are fed into the linear projection layer with a Sigmoid activation function. The final output \({\varvec{o}} \in {\mathbb{R}}^{M}\) is used to calculate the probability score for each corresponding MeSH term:
$${\varvec{o}} = \sigma ({\varvec{Wr}} + {\varvec{b}})$$
(21)
where \({\varvec{W}} \in {\mathbb{R}}^{{M \times d_{v} }}\) is the linear transformation matrix, \({\varvec{b}} \in {\mathbb{R}}^{M}\) is the bias, and σ is the Sigmoid activation function. The value M equals the number of the candidate MeSH terms for the classification and each output can be interpreted as the confidence score of the corresponding recommendation.
To learn the parameters of the network, the binary cross-entropy loss function is used via the calculation of the predicted terms and the gold MeSH annotations in the training set:
$${\mathcal{L}}_{j} = - (y_{j} \log (\hat{y}_{j} ) + (1 - y_{j} )\log (1 - \hat{y}_{j} ))$$
(22)
where \(y_{j} \in [0, 1]\) is the ground-truth label of the j-th MeSH term; yj = 0 means the j-th MeSH term is not annotated to the article by human indexers, while yj = 1 means the j-th MeSH term is annotated. We can calculate the total loss by summing them up:
$${\mathcal{L}} = \mathop \sum \limits_{j = 1}^{M} {\mathcal{L}}_{j}$$
(23)
The entire framework of MPANN is trained end-to-end by a gradient-based optimization algorithm to minimize the loss of \({\mathcal{L}}\).