pyMeSHSim: an integrative python package for biomedical named entity recognition, normalization, and comparison of MeSH terms

Background Many disease causing genes have been identified through different methods, but there have been no uniform annotations of biomedical named entity (bio-NE) of the disease phenotypes of these genes yet. Furthermore, semantic similarity comparison between two bio-NE annotations has become important for data integration or system genetics analysis. Results The package pyMeSHSim recognizes bio-NEs by using MetaMap which produces Unified Medical Language System (UMLS) concepts in natural language process. To map the UMLS concepts to Medical Subject Headings (MeSH), pyMeSHSim is embedded with a house-made dataset containing the main headings (MHs), supplementary concept records (SCRs), and their relations in MeSH. Based on the dataset, pyMeSHSim implemented four information content (IC)-based algorithms and one graph-based algorithm to measure the semantic similarity between two MeSH terms. To evaluate its performance, we used pyMeSHSim to parse OMIM and GWAS phenotypes. The pyMeSHSim introduced SCRs and the curation strategy of non-MeSH-synonymous UMLS concepts, which improved the performance of pyMeSHSim in the recognition of OMIM phenotypes. In the curation of 461 GWAS phenotypes, pyMeSHSim showed recall > 0.94, precision > 0.56, and F1 > 0.70, demonstrating better performance than the state-of-the-art tools DNorm and TaggerOne in recognizing MeSH terms from short biomedical phrases. The semantic similarity in MeSH terms recognized by pyMeSHSim and the previous manual work was calculated by pyMeSHSim and another semantic analysis tool meshes, respectively. The result indicated that the correlation of semantic similarity analysed by two tools reached as high as 0.89–0.99. Conclusions The integrative MeSH tool pyMeSHSim embedded with the MeSH MHs and SCRs realized the bio-NE recognition, normalization, and comparison in biomedical text-mining.


Background
Biomedical named entity (bio-NE) recognition, normalization, and comparison are fundamental tasks for extracting and utilizing valuable biomedical information from textual data. They are important to disease diagnosis [1], drug repositioning [2], overrepresentation analysis [3], and genetic analysis [4]. These functions are realized by identifying key entities in unstructured texts, mapping identified entities to a controlled vocabulary, and measuring the semantic similarity between the vocabulary terms [5].
Medical Subject Heading (MeSH) is a controlled vocabulary that can be used in bio-NE recognition, normalization and comparison [6]. It consists of three main record types including descriptor records, qualifier records, and supplementary concept records (SCRs). MeSH is curated by the National Library of Medicine (NLM) and serves as the index system in PubMed/MEDLINE and other NLM databases. Since 2002, NLM has used Medical Text Indexer (MTI) to provide indexing recommendations based on MeSH in the bio-NE recognition for literatures [7]. Due to its precise literature annotations, MeSH has become more and more popular for normalizing bio-NEs such as disease names, in medical and genetic public databases [8,9]. Like the structure of Gene Ontology [10] and Disease Ontology, the structure of MeSH as a directed acyclic graph [11] allows the comparison of semantic similarity between two MeSH terms in the graph.
Several MeSH tools have been developed to realize bio-NE recognition, normalization, or comparison. As a MeSH tool for bio-NE recognition and normalization, NLM MeSH has provided an online browser (https://meshb.nlm.nih. gov/search) to parse MeSH terms from the input phrases. However, the browser is neither tolerant to even subtle difference of input phrases from MeSH terms, nor applicable to batch processing. Although some Bio-NE tools based on machine learning method have come out with good performance on specific corporas, they were designed for recognizing certain categories, like diseases and chemicals, of MeSH terms from literature abstracts, and have unknown performance for other categories of MeSH terms or from short biomedical phrases. As MeSH tools for bio-NE comparison, meshes [12] and meshSim [13] have recently been developed to measure MeSH semantic similarity by using the R dataset MeSH.Hsa.eg.db [3] as data framework. However, the lack of SCRs in MeSH dataset limits the use of tools both meshes [12] and meshSim for comparing rare diseases such as "alzheimer's disease 7" and "Bardet-Biedl syndrome 11". Furthermore, there is still a lack of an integrated one-stop MeSH toolkit to realize bio-NE recognition, normalization, and comparison.
To solve above problems, an integrative python package pyMeSHSim was developed to realize bio-NE recognition, normalization and comparison for MeSH terms. It can directly parse MeSH terms from free biomedical texts and measure the semantic similarity between the MeSH term pairs. Additionally, a lightweight comprehensive MeSH dataset was generated and embedded as the data framework into pyMeSHSim, which enables batch processing and the application of pyMeSHSim to both common diseases and rare diseases.  [14].

Dataset construction
The multiple-to-one relationship between MeSH-synonymous UMLS concepts and MeSH MHs was curated from the table MRSAT in UMLS. For example, the MeSH MH "Alzheimer Disease" (D000544) includes seven MeSH concepts, each of which corresponds to several MeSH entry terms and a UMLS concept (Supplementary Table 1). In our dataset, we included the MeSH MHs and related UMLS concepts, while we excluded the MeSH concept and MeSH entry term information. Moreover, we curated the most useful "parent" and "child" relationship between MeSH MHs from the table MRREL in UMLS.
The one-to-one relationship between MeSH-synonymous UMLS concepts and SCRs was curated from the table MRSAT in UMLS. In our dataset, we included the SCRs and its corresponding UMLS concepts, as well as the "narrower" and "broader" relationship between SCRs and MeSH MHs curated from the table MRREL in UMLS.
The qualifier records and other MeSH descriptor records except MeSH MHs were not included in our dataset. In the study, we used "MeSH term" to refer to MeSH MH or SCR.

Bio-NE recognition and normalization
The bio-NE recognition were realized by MetaMap [15], a widely used biomedical natural language processing software recognizing UMLS concepts from free texts. Although machine learning methods might have better performance than MetaMap in recommending MeSH MHs to MEDLINE citations, their use were constrained by the requirement of large amount of training data to establish the model and by the potential imbalance of the training data [16]. However, disease phenotypes from GWASdb [17], OMIM [18], and GAD [19] and drug indications in public databases DrugBank [20] and TTD [21] could not provide large amount of training data required by machine learning, while MetaMap required no training data, which was the advantage of MetaMap. The UMLS concepts curated by MetaMap were then converted to MeSH terms based on our dataset. MeSH-synonymous UMLS concepts were directly converted to MHs or SCRs, while non-MeSH-synonymous UMLS concepts, as free texts, were first processed into MeSH-synonymous UMLS concepts and then converted to MHs or SCRs.

Bio-NE comparison
We compared the bio-NEs based on the similarity between their corresponding MeSH terms. The semantic similarity was usually calculated by graph-based or information content (IC)-based method. The graph-based method measured the node distance between two MeSH terms in the MeSH hierarchical structure, while the IC-based method depended on the specificity and informativeness of MeSH terms [22].
We retrieved the number of publications indexed by MeSH terms using the NCBI E-Utility [23], and calculated the IC values as below.
Where D(d) is the sum of all the descendent terms of MeSH term d; freq(x) is the number of publications indexed by term x; N is the total number of publications indexed by MeSH; and IC(d) is the IC value of term d.
We implemented the following four IC-based algorithms: Where d 1 and d 2 are MeSH terms; Sim lin , Sim res , Sim rel , and Sim JC correspond to Lin's [24], Resnik's [25], Schlicker's [26], and Jiang and Conrath's [27] algorithms, respectively; MICA (the most informative common ancestor) is the ancestor of the selected two MeSH terms with the maximal IC value among all ancestors. We designated MICA as 0, which was between MeSH terms from different categories denoted by the first character of the tree number of MeSH terms. For example, MICA between the MeSH terms "Tauopathies" (tree number: "C10.574.945") and "Schizophrenia" (tree number: "F03.700.750") is 0 because they belonged to different categories ("C" for diseases vs "F" for psychiatry and psychology).
We also implemented the graph-based Wang's [28] algorithm as below.
Where d is a MeSH term; A(d) is the ancestors deduced from tree numbers of d; n a is the number of edges between d to a; S d (a) is the semantic contribution of a to d; SV d is the total semantic contributions of all ancestors to d; Sim Wang (d 1 , d 2 ) is Wang's algorithm score between MeSH terms d 1 and d 2 ; ω is a tuneable weight in [0,1] range used to measure the relation between two terms. In this study, we tuned ω from 0 to 1 with a step of 0.1 to test the robustness of our results (Supplementary Table 2, Supplementary figure 1A, 1B), and set it to 0.6, when pyMeSHSim using Wang's algorithm had the highest correlation with meshes for all the algorithms.
Noteworthily, both IC-based and graph-based methods depended on the tree number, but some MeSH terms may have more than one tree number, thus resulting in multiple similarity values between one pair of MeSH terms. We retained only the maximal similarity value between two MeSH terms.

Package detail
The pyMeSHSim consists of three subpackages (1) the metamapWrap subpackage recognizing bio-NEs from the text, (2) the data subpackage normalizing UMLS concepts into MeSH terms by the embedded MeSH dataset, and (3) the Sim subpackage comparing semantics of MeSH terms by measuring the distance between MeSH terms (Fig. 1). Detailed descriptions of the subpackages and their parameters are provided in the reference manual (Supplementary File 1, https://pymeshsim.readthedocs.io/en/latest/).

1) The metamapWrap subpackage
The bio-NE recognition and normalization of pyMeSHSim were realized by the meta-mapWrap subpackage which was a wrapper for MetaMap [15]. The subpackage meta-mapWrap curated MeSH-synonymous UMLS concepts from free texts including non-MeSH-synonymous UMLS concepts, and then converted the curated MeSHsynonymous UMLS concepts into corresponding MeSH terms via the data subpackage. We set parameters "-N -J semantic_type _list -R MSH -I -z -conj -Q 4 -silent --sldi", where semantic_type list was the list of disease-related semantic types (corresponding to "inpo,dsyn,phpranab,orgf,clna,hlca,genf,orga,neop,emod,inbe,lbtr,anst,npop,celc,cell, bpoc,acty,mobd,celf,evnt,sosy,patf,tisu,moft,fndg,bdsu,ortf,menp,acab,comd,sbst,cgab",  Fig. 1 The components and workflow of pyMeSHSim. pyMeSHSim consists of three subpackages, including metamapWrap, data and Sim. In bio-NE recognition, metamapWarp curates the UMLS concepts from free text. In bio-NE normalization, data translates UMLS concepts to MeSH terms, and maps SCRs to MHs using selected records and relationships between records in MeSH. In bio-NEs comparison, Sim uses IC-based and graph-based methods to measure semantic similarity between two bio-NEs as can be seen in the manuals) as the default of pyMeSHSim. Users can customize the parameters to suit their needs.

Bio-NEs comparison
2) The data subpackage The MeSH dataset was embedded into the data subpackage in bcolz format with a corresponding data interface (Supplementary Table 3). It included five tables: (1)

3) The Sim subpackage
The bio-NE comparison of pyMeSHSim was conducted with the Sim subpackage by measuring the distance between MeSH terms. Each narrower record of the SCR was converted into one or more broader terms of MHs before the measurement. Like the tool meshes, pyMeSHSim offered five representative semantic similarity measurements, including four information content (IC) based (Lin's, Resnik's, Schlicker's, and Jiang and Conrath's) and one graph-based (Wang's) algorithms.

Evaluation with OMIM phenotypes
To test whether the introduction of SCRs and our curation strategy of non-MeSHsynonymous UMLS concepts contributes to improving the performance of pyMeSHSim in bio-NE recognition, we compared the genes annotated with MeSH MHs and SCRs from OMIM [18] phenotype-gene pairs. The OMIM phenotype-gene pairs were collected from the database disease-connect [30], which used MetaMap to process the disease phenotypes into MeSH-synonymous and non-MeSH-synonymous UMLS concepts. MeSH-synonymous UMLS concepts were directly converted into MHs and SCRs by using pyMeSHSim. Subsequently, SCRs were further converted into their "broader" MHs. Non-MeSH-synonymous UMLS concepts, as free texts, were processed into MeSH-synonymous UMLS concepts. Based on the source of their corresponding UMLS concepts, we classified OMIM phenotypes into MH, SCR, and non-MeSH groups. And then, we compared the genes corresponding to the same MHs from all the three groups (Fig. 2). The genes without Entrez IDs were excluded, since Entrez IDs were required for the following disease enrichment analysis. The MHs with less than 10 genes in at least two groups were also excluded. After the filtering, 36 MHs and 1498 MH-gene pairs (Supplementary Table 4) were remained, including 761 MH-gene pairs from MH group, 522 from SCR group, and 215 from non-MeSH group. About 87.5% MH-gene pairs in SCR group were also present in MH group, indicating high overlap of genetic features between subtype diseases and its corresponding MH diseases, and validating the significance of SCRs in disease curation (Fig. 3). Additionally, the 59.5% overlap of MH-gene pairs was found between non-MeSH group and MH group and 10.7% overlap between non-MeSH group and SCR group, indicating the effectiveness of our curation strategy of non-MeSH-synonymous UMLS concepts.
To further validate the reasonability of introducing SCR and our curation strategy of non-MeSH-synonymous UMLS concepts, we hypothesized that the additional MHgene pairs derived from SCRs and non-MeSH-synonymous UMLS concepts should improve the gene enrichment in the MH diseases. We remained the seven MHs with at least 5 non-overlap MH-gene pairs in SCR group and non-MeSH group, and tested the enrichment of genes corresponding to MHs in the diseases by using the UMLS-based disease enrichment analysis tool DOSE [31]. For each of the seven MHs, the addition of genes from SCR and non-MeSH groups led to more significant enrichment in the disease mapped to the MH (Table 1). Especially, the addition of 50 genes of the MeSH MH Osteochondrodysplasias (D010009) from SCR and non-MeSH groups to the 14 genes from the MH group led to the higher p value (6.57E-35 vs 8.87E-19) of enrichment in the disease Osteochondrodysplasias (Table 1), suggesting the contribution of the introduced SCRs and curation strategy of non-MeSH-synonymous UMLS concepts to the improved performance of pyMeSHSim in bio-NE recognition and normalization.

Evaluation with GWAS phenotypes
To evaluate the performance of pyMeSHSim on bio-NE recognition, we took the manual work of Nelson's group in parsing 461 GWAS phenotypes to MeSH terms as the gold standard, and compared the performance of pyMeSHSim with DNorm and Tag-gerOne, which are the state-of-the-art machine learning based tools for locating and identifying disease and chemical concepts [32][33][34]. DNorm and TaggerOne integrated different Lexical resources as training data, and could recognize MeSH terms and OMIM terms from free text. In the performance comparison, we only extracted the MeSH results from these two softwares. PyMeSH-Sim successfully recognized MeSH terms from 442 (96%) GWAS phenotypes, while DNorm and TaggerOne only identified 129 (28%) and 192 (42%) (Supplementary Table 5). There were 158 phenotypes specifically identified by pyMeSHSim but not by DNorm/TaggerOne. Regarding the categories of recognized MeSH terms, pyMeSHSim successfully identified terms in 15/17 categories, while DNorm and TaggerOne, which were designed for disease or chemical entity recognition, identified terms mainly in "C" (Diseases) and "F" (Psychiatry and Psychology) categories (Supplementary Table 6). Even for phenotypes in the "C" category, pyMeSHSim (> 0.94) showed higher recall than DNorm (> 0.32) and TaggerOne (> 0.49) across all the similarity thresholds used to determine matches with Nelson's manual work as true positives (Supplementary Table 5, Fig. 4). Despite the lower precision of pyMeSHSim (> 0.56) than DNorm (> 0.62) and TaggerOne (> 0.64), the differences in precision were subtle when consider only perfect match ( Table 2, Fig. 4), and the overall performance F1 of pyMeSHSim (> 0.70) was always higher than DNorm (> 0.42) and TaggerOne (> 0.55) (Fig. 4). The lower performance of DNorm and TaggerOne maybe since they were not MeSH term taggers. Additionally, the recall, precision and F1 were all higher for pyMeSHSim with SCRs than that without SCRs, demonstrating the contribution of SCRs to improved performance of pyMeSHSim in bio-NE recognition and normalization.
We then investigated the phenotypes in the "C" category specifically tagged by pyMeSHSim or DNorm/TaggerOne with the same MeSH term as Nelson's manual work, and found 38 phenotypes specifically identified by pyMeSHSim (Supplementary Table 7), while only five by DNorm/TaggerOne (Supplementary Table 8). The 38 phenotypes specifically identified by pyMeSHSim included 26 phenotypes tagged with related MeSH terms by DNorm/TaggerOne (similarity Lin score > 0), and 12 missed by them. Among the 12 phenotypes, "Graves`disease" (D006111), "Paget's disease" (D010001), and "Behcet's disease" (D001528) might be missed due to special symbol "`". Meanwhile, the five phenotypes not perfectly identified by pyMeSHSim included three tagged with related MeSH terms by pyMeSHSim, and two missed by it ("Tumor biomarkers" and "Coronary artery calcification"). The phenotype "Tumor biomarkers" was correctly recognized by pyMeSHSim as D014408 (Tumor biomarkers), while tagged as D009369 (Neoplasms) by Nelson's group and DNorm. The other phenotype "Coronary artery calcification" was mistakenly identified as D002113 (Calcification, Physiologic) by pyMeSHSim, while as D061205 (Vascular Calcification) by Nelson and TaggerOne. These results of error analysis demonstrated better performance of pyMeSHSim than DNorm and Tagger-One in recognizing MeSH terms from short biomedical phrases like GWAS phenotypes. The similarity between MeSH terms identified by the tools and Nelson's manual work were called as a true positive or false positive when their similarity was higher or lower than the determined threshold. When the similarity threshold is set to 1, only perfect matched terms would be considered as true positives. The recall ( TP TPþFN ), precision ( TP TPþFP ) and F1 ( 2ÂprecisionÂrecall precisionþrecall ) of the tools were calculated at each similarity threshold  ) is the number of phenotypes whose parsing results matched the manual work at determined similarity threshold. The similarity between MeSH terms identified by the two methods were measured with Lin score, and called as a TP or FP when their similarity was higher or lower than the determined threshold. FN (false negative) is the number of unrecognized phenotypes. b cision ¼ TP TPþFP , where FP is the number of phenotypes whose parsing results mismatched the manual work at determined similarity threshold.
We further compared the parsing results of pyMeSHSim with Nelson's manual work, and found 114 phenotypes differently tagged (similarity Lin score = 0) and 17 missed by pyMeSHSim. The manual work preferred mapping the phenotypes to disease category (C). For example, phenotypes like "Vitamin E levels", "Hematology traits" and "Pulmonary function" were parsed as "Vitamin E Deficiency" (D014811), "Hematologic Diseases" (D033461) and "Lung Diseases" (D008171) by Nelson's group, while identified as "Vitamin E" (D014810), "Hematology" (D006405) and "Lung" (D008168) by pyMeSHSim. However, such preference of the manual work could lead to bias. For example, "Eye color", "Hair color" and "Serum urate" were parsed as "color vision defects", "hair diseases" and "urinary calculi" by Nelson's group, while as "Color, Eye", "Color, Hair" and "Acid, Uric" by pyMeSHSim (Supplementary Table 5). Therefore, at least a part of the parsing differences between the manual work and pyMeSHSim were attributed to human bias in the manual work. Meanwhile, among the 17 phenotypes not recognized by pyMeSHSim, "IgG levels", "IgM levels", "IgE levels", "PR interval" and "QT interval" might be missed due to the abbreviations inside (Supplementary Table 5).
To test the semantic similarity function of pyMeSHSim, we calculated all the semantic similarities between the curated MeSH terms using pyMeSHSim and the latest semantic analysis tool meshes (Supplementary Table 2). The similarity calculated by both packages was 1 when the MeSH terms were the same, and was 0 when MeSH terms were of different categories. The 55 GWAS phenotypes with the different term pairs in the same category were found resulting from the recognition respectively via pyMeSH-Sim and Nelson's group work. The pyMeSHSim succeeded in calculating the similarities between the term pairs of all the 55 phenotypes, while meshes was only capable of comparing MH-MH pairs, and it failed to compare SCR-MH pairs of 15 phenotypes (Supplementary Table 2). Of the 15 SCRs parsed by pyMeSHSim, 13 were mapped to the same MHs as parsed by Nelson's group. The similarity correlation of the remaining 40 term pairs between pyMeSHSim and meshes was 0.89 (Rel's)-0.97 (Res') ( Table 3,  Supplementary Table 2, Supplementary figure 1B), demonstrating similar, if not better, performance of pyMeSHSim to that of meshes in bio-NE comparison.

Effectiveness of pyMeSHSim
PyMeSHSim aims to provide users a one-stop MeSH toolkit for bio-NE recognition, normalization and comparison, and multiple efforts were made to confirm its effectiveness. For example, (i) We compared the performance of pyMeSHSim in bio-NE recognition and normalization with manual work in parsing GWAS phenotypes, and found high consistency between them, indicating the great potential of pyMeSHSim for aiding professional manual curation of bio-NEs; (ii) We compared the performance of pyMeSHSim in bio-NE recognition and normalization with another two tools base on machine learning methods, and showed higher sensitivity and accuracy of pyMeSHSim in parsing short biomedical phases like GWAS phenotypes; (iii) We converted the OMIM phenotypes to MeSH terms using pyMeSHSim, and demonstrated improved effectiveness in bio-NE recognition and normalization by including SCRs in its embedded dataset; (iv) We compared the similarity measurement between pyMeSHSim and meshes and showed comparable performance in bio-NE comparison.

Caveat
Considering that MeSH is one of the most widely used biomedical vocabulary, pyMeSHSim will further contribute to data integration. In addition, the introduction of SCRs to the implemented dataset enables pyMeSHSim to handle rare diseases in public databases like OMIM and Orphanet (www.orpha.net). However, whether general concepts such as MHs or specific concepts such as SCRs are preferable will depend on the end use. Users should be cautious to select the according right terms in using pyMeSHSim.

Conclusions
We developed pyMeSHSim, an integrative, lightweight, and data-rich python package for biomedical text mining. To the best of our knowledge, this is the first one-stop MeSH toolkit integrating the functions of bio-NE recognition, normalization and comparison. PyMeSHSim is expected to be widely used as a powerful tool in bioinformatics, computational biology, and biomedical research.