 Research
 Open access
 Published:
MCRWR: a new method to measure the similarity of documents based on semantic network
BMC Bioinformatics volume 23, Article number: 56 (2022)
Abstract
Background
Besides Boolean retrieval with medical subject headings (MeSH), PubMed provides users with an alternative way called “Related Articles” to access and collect relevant documents based on semantic similarity. To explore the functionality more efficiently and more accurately, we proposed an improved algorithm by measuring the semantic similarity of PubMed citations based on the MeSHconcept network model.
Results
Three article similarity networks are obtained using MeSHconcept random walk with restart (MCRWR), MeSH random walk with restart (MRWR) and PubMed related article (PMRA) respectively. The area under receiver operating characteristic (ROC) curve of MCRWR, MRWR and PMRA is 0.93, 0.90, and 0.67 respectively. Precisions of MCRWR and MRWR under various similarity thresholds are higher than that of PMRA. Mean value of P5 of MCRWR is 0.742, which is much higher than those of MRWR (0.692) and PMRA (0.223). In the article semantic similarity network of “Genes & Function of organ & Disease” based on MCRWR algorithm, four topics are identified according to golden standards.
Conclusion
MeSHconcept random walk with restart algorithm has better performance in constructing article semantic similarity network, which can reveal the implicitly semantic association between documents. The efficiency and accuracy of retrieving semanticrelated documents have been improved a lot.
Background
PubMed database is the largest biomedical database containing more than 32 million citations and abstracts in 2021. It is updated daily with 200–4000 citations and has an exponentially growing tendency [1]. Besides Boolean retrieval with a query, PubMed provides users with an alternative way called “Related Articles” in the result page, which can recommend relevant documents based on semantic similarity. How to recommend and retrieve relevant documents effectively and accurately from millions of scientific articles remains a very difficult and challenging task. Therefore, it has great importance to explore a more accurate approach to measure the similarity between documents based on semantic relations.
There is no doubt that the ideal method to compute the similarity of documents is based on the fulltext analysis [2]. However, considering the cost and limitation of obtaining the full text, it may be more sensible to select the metadata of literature records to measure the similarity of documents, such as MeSH terms, titles, abstracts and references.
Small explored and established relatedness between articles using cocitation analysis [3]. Cocitation occurs if two articles are cocited by a common article. The more frequently two articles are cocited, the higher the cocitation intensity is and the higher the similarity of the corresponding documents is. However, since this approach ignored the content information of articles which were fundamental nature of articles, the precision was often not high.
Considering the content information of articles, researchers have proposed various methods to calculate similarity based on semantic relations between documents. Chandrasekaran et al. [4] reviewed and traced the evolution of semantic similarity methods proposed over the years, categorizing them as knowledgebased, corpusbased, deep neural networkbased methods, and hybrid methods. Among them, there are five popular approaches to measure the similarity between documents based on titles and abstracts: cosine similarity using term frequencyinverse document frequency (TFIDF cosine), latent semantic analysis (LSA), topic modeling, and two Poissonbased methods best match 25 (BM25) and PMRA [5,6,7,8,9,10,11]. Researchers had shown that PMRA had the best performance among the five methods above. PMRA was developed by integrating TFIDF and document length of articles into BM25 algorithm, which founded the “related articles” feature of PubMed.
However, PMRA does not always recommend the relevant documents for the topic that users are interested in. For example, if two articles have similar word distributions, they would be inferred as relevant, disregarding the semantic differences between them. And vice versa, PMRA may make wrong recommendations if two articles are highly relevant semantically while having different word distributions. Therefore, to explore a more precise method, we should not only take the word distribution into consideration, but also utilize the implicitly semantic relations between a pair of terms extracted from articles. As two most popular biomedical terminology thesauruses, MeSH thesaurus and unified medical language system (UMLS) metathesaurus are used to explore the semantic relationship between terms [12, 13]. Castro and Berlanga studied the literature similarity using semantic annotations from UMLS, demonstrating that the semanticconcepts were useful to optimize the similarity measurement of documents [2]. Cui and Pan calculated the semantic similarities of documents using the semantic relations between MeSH terms and tried to construct a semantic similarity network of documents in further study [14, 15]. Nevertheless, some studies held different opinions that MeSH terms were not competitive, possibly because they failed to consider the implicit information between MeSH terms.
In order to measure the semantic similarity between documents in a specific domain more accurately, we propose a novel method using random walk with restart (RWR) algorithm to walk the MeSHconcept similarity network which is constructed utilizing the semantic relations of MeSH terms and semantic concepts. RWR algorithm [16,17,18] is usually used to predict the similarity and relatedness between nodes in a network by choosing a seed node and transiting randomly from the present node to neighboring nodes based on the probabilities between two nodes, and finally getting a convergent probability. In the present study, MeSH terms and semantic concepts are extracted from the downloaded articles and used to represent the documents. According to the similarity of MeSH terms and the cooccurrence frequency between MeSH terms and semantic concepts, we construct a MeSHconcept similarity network and calculate the similarity between documents by RWR algorithm. The proposed method is compared with the PMRA algorithm by the area under ROC curve (AUC), P5 value, precision and document clustering for evaluation.
Methods
We proposed a new method called MeSHconcept Random Walk with Restart (MCRWR) to measure the semantic similarity of documents in Medline database. There are three steps. Firstly, MeSH terms were extracted from articles and MeSH similarity network was constructed based on the hierarchical relationship of MeSH tree numbers which corresponded to the MeSH terms in the MeSH thesaurus. Secondly, semantic concepts were extracted from titles and abstracts of articles and identified from UMLS (Unified Medical Language System) metathesaurus. MeSHconcept similarity network was constructed by incorporating the semantic concepts into MeSH similarity network. Finally, Semantic similarity between two articles was computed according to the feature vectors generated from MeSHconcept similarity network by Random Walk with Restart (RWR) algorithm. We downloaded 4240 articles from 2005 TREC (Text Retrieval Conference) genome project as the corpus. The performance of our proposed approach was evaluated by four measures (area under ROC curve, precision, P5 value, and document clustering) compared with MeSH RWR(MRWR) algorithm and PMRA (PubMed Related Article) algorithm.
In this study, we use the following four steps.

1.
Define and collect a corpus of documents.

2.
Extract and preprocess the MeSH terms and semantic concepts from titles and abstracts of documents to generate termdocument matrices.

3.
Calculate pairwise documentdocument similarities using PMRA algorithm and RWR algorithm respectively.

4.
Evaluate the proposed algorithm (MCRWR) by the area under ROC curve, P5 value, precision and the effect of document clustering.
Study corpus
The corpus of this study comes from 2005 text retrieval conference (TREC) Genome Project [19] (http://skynet.ohsu.edu/trecgen/data/2005/genomics.qrels.large.txt), which consists of a subset of MEDLINE from 1994 to 2003 and includes 34,633 articles. In 2005, five generic topic templates (GTTs) including semantic types were created by TREC to reflect the biologists’ information needs. Each topic template has 10 topics. Domain experts judge the relevance between each topic and document being included in the topic, and then assign a corresponding discriminant score for the documents: 0 represents “not relevant”, 1 represents “possibly relevant”, and 2 represents “definitely relevant”. In our study, if the relevance score of topicdocument is not equal to 0, we regard the document as one of documents belonging to the topic. Given that two documents have crosstopics, we regard the two documents as relevant, which is defined as golden standards for evaluating the proposed MCRWR method. Among the total 34,633 articles, 4491 articles have topicrelevance score greater than 0. Removing invalid PubMed ID (PMID), 4240 nonoverlapping articles on 49 topics (87 articles belonging to 2 topics and 2 articles belonging to 3 topics) are included for the following study. The detail information of the corpus is shown in Table 1.
Termdocument matrices
In this study, PMID is used as the unique identifier. MeSH terms, as well as titles and abstracts are extracted from the corpus to generate termdocument matrixes. Compared with the subheadings and nonmajor MeSH terms, the major MeSH terms are more specific and can represent the thematic content of articles to a greater degree. Therefore, only major MeSH (hereafter referred to as MeSH) extracted from the documents by MeSH thesaurus (2016 edition) are retained and no further redundant operations are performed.
In calculating the similarity between documents, the core components of PMRA are the term frequencies that cooccur in two documents and the importance of cooccurred terms in the respective document, while the RWR algorithm is based on the MeSH terms and semantic concepts from titles and abstracts being used to build similarity networks. Considering the differences in the selection of textual features between PMRA and RWR algorithms, different methods are adopted to deal with titles and abstracts. We delete the punctuation marks, numbers and redundant blanks in the titles and abstracts for PMRA, then use the Porter algorithm [20] to extract the stem and generate the stem sets. Term frequency and inverse document frequency of every stem from each article are counted. For RWR algorithm, we use Garcia’s approach [2] which adopts the concept mapping annotator (CMA) method [21] to identify semantic concepts in the UMLS metathesaurus (2016AB Edition) to match the semantic concepts in the titles and abstracts. CMA provides candidate concepts and corresponding matching scores in annotating semantic information which is similar to MetaMap [22]. The threshold of matched scores is controlled to a small range to obtain as many as possible semantic concepts. Each identified concept is labeled by the Concept Unique Identifier (CUI) in the UMLS metathesaurus and TFIDF of each CUI is counted. To build the MeSHconcept network model in the following study more easily, we filter the semantic concepts in each document. Because articles in Medline usually have 3–5 major MeSH terms, we can choose 3–5 concepts for the MeSHconcept network for balance. In this study, we only retain top 5 concepts based on TFIDF. Semantic concepts identified by CMA would be either a single word (i.e., Gene) or a phrase with specific meaning (i.e., Gene Expression), so the matched concepts contain more abundant semantic information.
Three termdocument matrices are built after text preprocessing: MeSH termdocument matrix, stemdocument matrix and semantic conceptdocument matrix. The MeSH termdocument matrix is a 0–1 matrix, in which the row represents the distribution of MeSH terms in each document and the column represents the MeSH terms corresponding to a particular document. The other two matrices are similar to the MeSH termdocument matrix, except that the elements in the two matrices are TFIDF scores of concepts which have been calculated ahead. To be consistent with the PMRA algorithm, we set the word frequency coefficients of stem and semantic concepts in the titles to 2 and those in the abstracts to 1. The stemdocument matrix is used for PMRA algorithm, and the other two termdocument matrices are used for RWR algorithm.
Similarity of documents
Similarity of documents based on PMRA
The PMRA algorithm is developed on the basis of BM25, both two algorithms assume that the word frequency of documents theoretically obeyed the Poisson distribution. In measuring document similarity, PMRA first picks out the terms that cooccurred in the titles, abstracts and MeSH terms of two documents, and then calculates the weight of each term in its corresponding document (Eq. 1), finally computes the similarity between two documents based on term weights (Eq. 2). For computational simplicity, only the stemmed terms from titles and abstracts are included in this study.
In Eq. 1, idf_{t} is the inverse document frequency of the term t in document d. λandμare set according to the default parameters (λ = 0. 022, μ = 0.013) [10], representing whether the term t is the expected probability of the topic which document d belongs to. k is the term frequency of term t, and l is the length of document d.
In Eq. 2, N is the number of terms that cooccurred in the titles and abstracts of document d and document c. All document similarity scores are normalized in [0,1].
Similarity of documents based on RWR
Network modeling is performed to calculate the similarity between documents, and the flowchart of the procedure is presented in Fig. 1. Doc_{1−n} means documents in the corpus, M means MeSH terms extracted from the document. Every document was represented by some MeSH terms to construct the MeSH termdocument matrix as showed in the upper left in Fig. 1. We calculate the similarity between the MeSH terms based on the frequency and the hierarchical relationship of MeSH tree numbers which corresponded to the MeSH terms in the MeSH thesaurus to obtain MeSH network. After that, semantic concepts are extracted from titles and abstracts of articles and identified from UMLS (Unified Medical Language System) metathesaurus. The MeSHconcept similarity network is constructed by adding semantic concepts to the MeSH similarity network as nodes. In the two similarity networks mentioned above, the nodes denote MeSH terms or semantic concepts. RWR is performed on the MeSHconcept similarity network, taking the corresponding MeSH terms and semantic concepts of each document as seed nodes to simulate random walk on the whole network. Finally, transfer probabilities of each document are obtained based on all nodes in the network. Thus, the similarity between documents is computed by computing cosine value of transfer probability vectors. Finally, document similarity network is constructed as presented in the lower right of Fig. 1.
MeSH similarity network
Jiang and Conrath hold the point that the similarity between two terms depended on the difference between individual informativeness and common informativeness [23]. The MeSH tree structure which is a directed acyclic graph has the following properties: (i) one MeSH tree number (a node) responds to a MeSH term; (ii) lower MeSH tree number is more specific than the upper ones. In other words, the meaning of the MeSH tree number is more specific from parent node to child node, the lower the MeSH tree number is located, the more informativeness it has. The amount of information on the root node in the MeSH tree is approximately 0. This method takes into account the hierarchical relationship between the MeSH terms, which can be used to calculate the semantic association of the MeSH terms.
According to Zhu et al. [24], in MeSH thesaurus, each node corresponds to a MeSH tree number. We denote the set of all descendants of node v by des(v). For node v1 and node v2, we use cca(v1, v2) to represent the closest common ancestor. For example, v1 = C04.588.945.418.365 and v2 = C04.588.945.418.948.585, then des(v1) is the set of all descendants of node C04.588.945.418.365(including node v1), cca(v1, v2) = C04.588.945.418. Count(v) is the frequency of node v in the corpus, and N is the total number of MeSH terms appearing in the corpus. The informativeness of MeSH tree number v is calculated by Eq. 3. The similarity score between MeSH tree number v1 and v2 is computed by Eq. 4 and Eq. 5. In this study,εis set to 3 [24].
For the reality that each MeSH term may have one or more tree numbers, we use the average maximum match (AMM) [25] to calculate the similarity score between two MeSH terms. Assuming that the tree number set of MeSH term M1 is denoted as m and that of MeSH term M2 is denoted as n, the similarity of two MeSH terms is measured by formula 6. That is, the similarity score between M1 and M2 is the sum of the maximum similarity between any tree number of M1 and all the tree numbers of M2 plus the sum of the maximum similarity between any tree number of M2 and all the tree numbers of M1, which is divided by the sum of the tree numbers of M1 and M2 (m and n represent the number of MeSH tree numbers).
MeSHconcept similarity network
We refer to the MeSH similarity matrix as an adjacency matrix, and create an undirected weighted network G (V, E), of which the node V_{i} ∈ V represents the MeSH term, the edge E_{i,j} ∈ E represents the relatedness of the MeSH term V_{i} and V_{j}, and the weight of edge is the similarity score of V_{i} and V_{j}. For a better performance of network modeling, we add nodes and edges to expand the original MeSH similarity network to a MeSHconcept similarity network. Taking five semantic concepts C_{i} = {C_{i1},C_{i2},C_{i3},C_{i4},C_{i5}} filtered from document d as example, they are added into the network G(V = V ∪ C_{i}) as new nodes, and new edges are added between the five semantic concepts and MeSH terms included in document d, the weights of new edges are TFIDF (normalized in the range of [0,1]) of the five semantic concepts respectively. Because some of MeSH terms in a document may be the same with the identified semantic concepts, there may exist the case that one MeSH term may link to two or more same semantic concepts. Given the case above, only the edge with maximum weight between MeSH term and semantic concept is reserved. In the processed MeSHconcept similatity network, weights of edges between MeSH term nodes are the similarity of two MeSH terms, and the weights of edges between MeSH term nodes and semantic concept nodes are TFIDF of the corresponding semantic concepts. Assuming that there are L semantic concepts in the corpus that are different from each other, and the MeSH term nodes are connected with an average of k semantic concepts, then the expanded MeSHconcept similarity network would have  V + L nodes and E + k *  V  edges.
Document similarity network
The principle of the RWR algorithm is to walk randomly along the edges of the network from the seed nodes in the network. For a node or a set of nodes, it is randomly transited to the neighboring nodes or directly returned to the starting node based on certain probabilities. Studies have shown that the transition probability of each node would reach a stable state after performing the iteration of random walk, the probability distribution of the network would not change any more even though it continues to walk [18]. In this statement, the corresponding stable transition probability of each node or set of nodes in the network can be regarded as the similarity score between the current node and the seed node. The sketch of RWR algorithm can be defined as follows:
where the column vector p^{t} represents the probability distribution of the tth step in the network, and α represents the probability of returning directly to the seed node, which is called the restart probability. The smaller the value of αis, the wider the range of nodes walk. W is the probabilistic transfer matrix, and W_{i,j} is denoted as the probability transiting from node i to node j. Parameter q is the restart column vector, representing the probability of each node being a start node(seed node) in the initial statement. In the process of random walk, the iteration is performed by Eq. 7 until p converges and finally a stable probability distribution between the seed node and other nodes is obtained.
In the present study, we use a dRWR function from R package dnet to calculate the RWR process. Based on a termdocument matrix, which is a 0–1 matrix suggesting whether the MeSH terms and concepts occurring in a document, we apply dRWR on the MeSHconcept network, by using MeSH terms and semantic concepts in each document as a set of seed nodes for the random walk. Finally, the convergent transition probability distribution of each set of seed nodes are obtained, which can be regarded as the textual feature vector of the corresponding document. In this network, the weighted adjacency matrix is the probabilistic transfer matrix W, the restart vector q of nodes corresponding to MeSH terms and semantic concepts are set to 1, the others are set to 0. In order to prevent the seed node walk too far in the network, α is set to 0.6 after carrying out many experiments to investigate the actual performance of the algorithm. In the iteration process, when the difference between P^{t+1} and p^{t} is less than or equal to 10^{−10}, the recognized p is convergent, and obtain the probability distribution of seed nodes.
If we denote MeSHconcept network as the text feature space of the whole corpus, the convergent transfer probability vector of document d could be regarded as the relatedness value of other nodes in the network relative to document d. Thus, we can recognize the convergent transfer probability vector as feature vector of document d. RWR is performed to get the feature vector p_{d1} and p_{d2} of document d1 and document d2 respectively, then similarity between two documents is calculated by cosine similarity function (Eq. 8).
The article similarity network based on MeSHconcept RWR is constructed, and the same process is done on the MeSH similarity network to create the article similarity network based on MeSH RWR.
Evaluation indicators
To demonstrate the practical significance of using semantic concepts to enhance the MeSH similarity network, we perform the RWR algorithm on the MeSHconcept similarity network and the MeSH similarity network. For comparing the performance of PMRA with those of two RWR algorithms, taking the golden standard of TREC as reference, we measure the area under curve (AUC) at corpus level, the precision of different similarity thresholds at topic level, P5 values and the clustering effects of documents based on the topological structure of article similarity networks.
Disregarding the difference of document topics, the documents are divided into two categories according to the TREC golden standard: document pairs within the same topic and document pairs among different topics. Selecting a series of similarity thresholds, we draw the ROC curve using true positive rate (TPR in Eq. 9) as the ordinate and false positive rate (FPR in Eq. 9) as the abscissa. The ROC curve could identify the advantages and disadvantages of algorithms under any similarity threshold. The larger the AUC is, the better the algorithm performs.
Biologists tend to be keen on quality rather than quantity in retrieving literature related to the topic of interest. PubMed, for example, recommends only the top five "Related Articles" articles. Meanwhile, considering that the similarity between different topics might be different, we rank the documents for every topic based on similarity scores, and compute the precision of top five documents ranked (P5 in Eq. 10). The P5 value of each topic is the average P5 values of all documents in the topic. The P5 value of the corpus is the mean P5 values of all topics. In addition, we selecte several similarity thresholds, regarding documentdocument pair as unsimilar if the similarity between them is lower than the threshold, and computed the precision of each topic and the whole corpus (Precision in Eq. 10). The definitions of true positive (TP), false positive (FP), true negative (TN) and false negative (FN) are shown in confusion matrix in Fig. 2. We use Chisquare test to compare the performance of different algorithms.
Because the TREC corpus do not provide direct similarity scores, it classifies the documents according to the topic. Therefore, we cluster the documents based on the topological structure of the document similarity network compared with the golden standard. To get a better performance of document clustering, we utilize the Infomap [26] algorithm from the Igraph package [27] in R language [28], which is considered to be one of popular network community detecting algorithms at present. Infomap algorithm regards the coding length of random walk path in graph as the objective function to be optimized and transformed the community detection in the network into the issues of information compressing and coding [29].
Results
MeSHconcept similarity network
Basic information about the MeSH similarity network and the enhanced MeSHconcept similarity network are shown in Table 2. An important attribute of network is clustering coefficient, which is an indicator to reveal the clustering degree of nodes in the whole network. The bigger the clustering coefficient is, the closer the nodes are linked with others. Both networks have distinct community structures and are closely connected between nodes (Fig. 3). In contrast, the scale of theMeSHconcept similarity network is larger, but the degree of clustering decreases with a lower clustering coefficient. The semantic concept nodes in MeSHconcept similarity network surround the MeSH term nodes to form a radial shape.
ROC curve, P5, and precision
According to the TREC golden criterion, the documentdocument pairs in the document similarity matrix are divided into two categories: the docdoc pairs within the same topic and the docdoc pairs between different topics. By Wilcoxon ranksum test, there is a significant difference in similarity scores between the two categories of docdoc pairs using all three algorithms (MCRWR, MRWR, PMRA) (P < 2.2E16), which indicates that the three algorithms could identify whether the documents belong to the same topic to a certain extent. The ROC curves of the three algorithms are shown in Fig. 4. MCRWR algorithm have the largest AUC (0.93, 95% confidence interval (CI): 0.892–0.961), MRWR algorithm is slightly inferior to MCRWR algorithm with AUC = 0.90(95% CI: 0.874–0.932), while the AUC of PMRA algorithm is the smallest (0.67, 95% CI: 0.653–0.71).
The mean P5 values of 49 topics are shown in Fig. 5. Mean P5 values are reported with the format mean ± standard deviation. Mean P5 values based on MCRWR (0.742 ± 0.204) are obviously higher than those of MRWR (0.692 ± 0.226) and PMRA (0.223 ± 0.119). For the topics (104, 133, 136 et al.) containing less than 5 documents, the mean P5 values of all three algorithms are less than 0.5. The mean P5 values of PMRA algorithm are lower than those of the two RWR algorithms on all topics except topic 44 holding equal mean P5 values to MCRWR. In the two RWR algorithms, the method based on the MeSHconcept similarity network performs better in 44 topics. The comparisons of mean P5 values of all three algorithms are statistically significant with P < 0.001. However, the results of P5 values could only reveal the precision of the top 5 related documents, so there exist some drawbacks that we can’t observe the precision of all related documents.
In order to improve the persuasiveness of the results, we calculate the precision of every topic under the similarity thresholds of 0 to 0.9 after comparing the areas under ROC curve and mean P5 values. As shown in Fig. 6, the upperleft graph illustrates the average precision rate for all topics, the other three subgraphs present the precision rates for the three randomly chosen topics (138, 128, 14) respectively. The three curves at the initial points (threshold = 0) in the graph are nearly the same, and increase with the growth of similarity threshold. The precision curves of the two RWR algorithms are above that of PMRA under all similarity thresholds. The precision curve of MeSHconcept RWR rises rapidly under the similarity thresholds of 0 to 0.5, and then approaches to 1 with little change afterwards. The precision of MeSH RWR is higher than that of MeSHconcept RWR when the similarity thresholds are less than or equal to 0.1, and the precision curve increases steeply to the peak in the similarity thresholds of 0.2 to 0.7. PMRA precision curve rises faster with the increase of similarity thresholds, which indicates that PMRA algorithm is sensitive to high similarity documents. The comparisons of precisions of three algorithms are statistically significant with P < 0.05(P = 0.002 of Mean_topic, P = 0.03 of Topic138, P = 0.003 of Topic128, and P = 0.001 of Topic114).
Network clustering
Considering a simple attempt of testing and suitable number of articles, We take 210 articles from 9 topics in the fourth semantic template "Genes&Function of organ & Disease" as nodes, and construct an undirected weighted similarity network based on MeSHconcept RWR, of which the edge weights are the similarity scores between documents. When the document similarity network is constructed, the topological structure of the network is very indecipherable and it is tough to explain and read for us. To more clearly elucidate the performance of document clustering, we need to prune the network using a similarity threshold. Given a similarity threshold, if the similarity score between two nodes is lower than the threshold, the line between the two nodes will be deleted, if the similarity score between two nodes is bigger than the threshold, the line between the two nodes will be retained. After testing a lot of similarity thresholds, we find that the similarity threshold of 0.08 is a cutoff value to make the network more explainable. With different datasets, the similarity thresholds are different. It is necessary to perform many tests to find a suitable similarity threshold. Being pruned using the similarity threshold of 0.08, the processed network (see Fig. 7) contain 210 nodes and 4119 edges, the same color nodes represent the same topic according to the golden standard. A total of eight communities are detected after network clustering. The related document sets of four communities (2, 3, 7, 8) are identical to those of four topics classified by the golden standard. Community 4 is a subset of topic 132, and community 1 contain all the documents from topic137 and one document from topic 132. Community 5 include all the documents of topic 130 and two papers from topic 133. Community 6 contain all the documents on topic 134 and three articles of topic 133. Specific information on clustering results and golden standard are presented in Tables 3 and 4 respectively.
Discussion
Several methods were investigated in [30] to calculate the similarity between MeSH terms, and measures to compute the similarity of documents based on the similarity of MeSH terms were presented. However, this approach only considered the direct similarity between MeSH terms, and ignored the indirect relationship of mutual transition between them, which might undiscover the hidden semantic information between documents. The RWR algorithm used on the MeSH similarity network could solve this tough problem of the MeSH similarity transition to a great degree. Figure 8 showed a simple MeSH similarity network. There was no link between "Osteochondritis" and "Breast Neoplasms", but both were neighboring nodes of "Breast Diseases". If we arbitrarily considered them having no relationship, we would arrive to ignore the bridging role via the bridging term of "Breast Diseases". In the RWR method, when random walker transited from the seed node "Osteochondritis" to other nodes, "Osteochondritis" would jump to "Breast Neoplasms" by the bridging term "Breast Diseases". In addition, the more closely the seed MeSH terms of the two articles were linked in the network, the higher the similarity between the two documents was. In Fig. 8 for example, the similarity between article A represented by the red nodes and article B represented by the yellow nodes was obviously higher than the similarity between article A and article C represented by the green nodes.
Because of the difference in precision between different topics, there might be a bias that the precisions of document similarities were represented only by mean precision of topics which was macroaveraged. Therefore, we draw boxplots of topic precisions under different thresholds (Fig. 9). Obviously, in addition to the mean precision, the quartile of topic precision of the RWR algorithm was also higher than that of the PMRA algorithm at the same similarity threshold. Precision differences among different topics varied with thresholds, which might be due to differences in the number of documents and the actual content of 49 topics in the corpus.
The proposed method to measure document similarity can not only retrieve the "Related Article" of a certain document, but also perform unsupervised clustering on document collection according to the semantic information. Specific communities can also be detected according to the topological structure of the document similarity network. Meanwhile, we can provide biologists with information navigations through interconnections of document nodes in the network. Since our method is based on the semantic information of documents, we can provide strong supporting evidences for evaluating the importance of documents by identifying the important nodes (such as nodes having high degree of centrality) in the document network.
In this study, MeSH terms and semantic concepts identified from titles and abstracts were used as feature vectors of documents to measure document similarity, thus the calculated document similarity varied dynamically with different corpuses, which was consistent to the reality. With the birth of new knowledge and new literature, the similarity between documents is by no means inflexible. It is well known that there are only coarsegrained hierarchical relationships between nodes in the MeSH classification system. This superficial annotation of semantic relationships greatly limited the performance of our algorithm. When adding edges between nodes of semantic concepts and those of MeSH terms, we simply took TFIDF as the edge weights and did not consider the semantic relationship between MeSH terms and semantic concepts. To compensate for these shortcomings, we will utilize 134 semantic types of UMLS semantic network and 54 kinds of semantic relationships to combine MeSH terms with semantic concepts [13] in the future work. By using finegrained secondlevel mapping of semantic types and semantic relationships, we can analyze document similarities in greater depth.
Conclusion
We propose a new approach to measure document similarities based on the network semantic space by combining semantic information with the network model. In constructing a network model, it is better to combine semantic concepts with MeSH terms than to use MeSH terms only. MeSHconcept RWR is superior to PMRA of PubMed by comparing AUC, P5 values, precision and network clustering, which suggests that MeSHconcept RWR has great application prospects in classifying and clustering documents. Moreover, it also provides a new perspective for evaluating documents.
Availability of data and materials
The datasets used and analysed during the current study are available from the corresponding author on reasonable request.
Abbreviations
 MeSH:

Medical subject headings
 MCRWR:

MeSHconcept random walk with restart
 MRWR:

MeSH random walk with restart
 PMRA:

PubMed related article
 ROC:

Receiver operating characteristic
 RWR:

Random walk with restart
 TFIDF:

Term frequencyinverse document frequency
 LSA:

Latent semantic analysis
 UMLS:

Unified medical language system
 TREC:

Text retrieval conference
 GTTs:

Generic topic templates
 PMID:

PubMed ID
 CMA:

Concept mapping annotator
 CUI:

Concept unique identifier
 BM25:

Best match 25
 AUC:

Area under the curve
 TPR:

True positive rate
 AMM:

Average maximum match
 FPR:

False positive rate
 TP:

True positive
 FP:

False positive
 TN:

True negative
 FN:

False negative
References
PubMed Overview. https://pubmed.ncbi.nlm.nih.gov/about/. Accessed 20 Mar 2021.
Garcia Castro LJ, Berlanga R, Garcia A. In the pursuit of a semantic similarity metric based on UMLS annotations for articles in PubMed Central Open Access. J Biomed Inform. 2015;57:204–18.
Small H. Cocitation in the scientific literature: a new measure of the relationship between two documents. J Am Soc Inform Sci. 1973;24(4):265–9.
Chandrasekaran D, Mago V. Evolution of semantic similarity—a survey. ACM Comput Surv. 2021;54(2):1–37.
Boyack KW, Newman D, Duhon RJ, Klavans R, Patek M, Biberstine JR, Schijvenaars B, Skupin A, Ma N, Börner K. Clustering more than two million biomedical publications: comparing the accuracies of nine textbased similarity approaches. PLoS ONE. 2011;6(3):e18029.
Salton G, Buckley C. Termweighting approaches in automatic text retrieval. Inf Process Manag. 1988;24:513–23.
Deerwester S, Dumais ST, Furnas GW, Landauer TK, Harshman R. Indexing by latent semantic analysis. J Am Soc Inf Sci. 1990;41:391–407.
Blei DM, Ng AY, Jordan MI. Latent Dirichlet allocation. J Mach Learn. 2003;3:993–1022.
Sparck Jones K, Walker S, Robertson SE. A probabilistic model of information retrieval: development and comparative experiments Part 1. Inf Process Manag. 2000;36:779–808.
Sparck Jones K, Walker S, Robertson SE. A probabilistic model of information retrieval: development and comparative experiments Part 2. Inf Process Manag. 2000;36:809–40.
Lin J, Wilbur WJ. PubMed related articles: a probabilistic topicbased model for content similarity. BMC Bioinf. 2007;8:423.
Rogers F. Medical subject headings. Bull Med Libr Assoc. 1963;51:114–6.
Bodenreider O. The Unified Medical Language System (UMLS): integrating biomedical terminology. Nucleic Acids Res. 2004;32(Database issue):D267–70.
Pan XW, Yang Y, Cui L. Research review of scientific paper network and conception of constructing paper similarity network. J Med Inf. 2013;34(6):48–54.
Pan XW. Comparison and evaluation of content and semantic similarity article network construction methods. China Medical University, 2014.
Suratanee A, Plaimas K. DDA: a novel networkbased scoring method to identify diseasedisease associations. Bioinform Biol Insights. 2015;9:175–86.
Sun J, Shi H, Wang Z, Zhang C, Liu L, Wang L, He W, Hao D, Liu S, Zhou M. Inferring novel lncRNAdisease associations based on a random walk model of a lncRNA functional similarity network. Mol Biosyst. 2014;10(8):2074–81.
Lovász L. Random walks on graphs: a survey. Combinatorics. 1996: 353–398.
Hersh WR, Bhupatiraju RTTREC. Genomics track overview. Trec Proc. 2005;2006:14–25.
Rijsbergen C, Robertson SE, Porter MF. New models in probabilistic information retrieval. 1980.
Berlanga R, Nebot V, Jimenez E. Semantic annotation of biomedical texts through concept retrieval. Procesamiento Del Lenguaje Natural. 2010;45:247–50.
MetaMap—a tool for recognizing UMLS concepts in text. https://lhncbc.nlm.nih.gov/ii/tools/MetaMap.html. Accessed 8 Oct 2021.
Jiang JJ, Conrath DW. Semantic similarity based on corpus statistics and lexical taxonomy. Rocling. 1997:11512–0.
Zhu SF, Zeng J, Mamitsuka H. Enhancing MEDLINE document clustering by incorporating MeSH semantic similarity. Bioinformatics. 2009;25(15):1944–51.
Wang JZ, Du Z, Payattakool R, Yu PS, Chen CF. A new method to measure the semantic similarity of GO terms. Bioinformatics. 2007;23(10):1274–81.
Rosvall M, Bergstrom CT. Maps of random walks on complex networks reveal community structure. Proc Natl Acad Sci USA. 2008;105(4):1118–23.
Csardi G, Nepusz T. The igraph software package for complex network research. InterJournal. 2006:1695.
Team R. R: A language and environment for statistical computing. 2013. Computing. 2011;1:12–21.
Rosvall M, Axelsson D, Bergstrom CT. The map equation. Eur Phys J Spec Top. 2009;178:13.
Zhou J, Shui Y, Peng S, Li X, Mamitsuka H, Zhu S. MeSHSim: an R/Bioconductor package for measuring semantic similarity over MeSH headings and MEDLINE documents. J Bioinform Comput Biol. 2015;13(6):1542002.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
XP and PH did all the computational work. XP and LC contributed to the writing of the manuscript. SL and LC conceived the study. All authors have read the paper and approve it.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Pan, X., Huang, P., Li, S. et al. MCRWR: a new method to measure the similarity of documents based on semantic network. BMC Bioinformatics 23, 56 (2022). https://doi.org/10.1186/s12859022045781
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022045781