SicknessMiner: a deep-learning-driven text-mining tool to abridge disease-disease associations

Background Blood cancers (BCs) are responsible for over 720 K yearly deaths worldwide. Their prevalence and mortality-rate uphold the relevance of research related to BCs. Despite the availability of different resources establishing Disease-Disease Associations (DDAs), the knowledge is scattered and not accessible in a straightforward way to the scientific community. Here, we propose SicknessMiner, a biomedical Text-Mining (TM) approach towards the centralization of DDAs. Our methodology encompasses Named Entity Recognition (NER) and Named Entity Normalization (NEN) steps, and the DDAs retrieved were compared to the DisGeNET resource for qualitative and quantitative comparison. Results We obtained the DDAs via co-mention using our SicknessMiner or gene- or variant-disease similarity on DisGeNET. SicknessMiner was able to retrieve around 92% of the DisGeNET results and nearly 15% of the SicknessMiner results were specific to our pipeline. Conclusions SicknessMiner is a valuable tool to extract disease-disease relationship from RAW input corpus. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04397-w.

from a plethora of sources to define related pathologies [3] or (ii) DisGeNET that defines DDAs as diseases that share, at least, a common gene or variant among the gene-disease associations considered in the database. DisGeNET comprises data retrieved from several approaches as expert curated resources, animal models, inferred data from Human Phenotype Ontology (HPO) and variants-related resources, and previously mentioned gene-disease associations [4].
In this work, our goal was to retrieve DDAs while lifting the strain on establishing or possessing prior gene-disease lists, and we concentrated herein on cancer with a particular focus on Blood Cancers (BCs). Cancer is a complex and multifactorial disease for which there is a widespread availability of related literature [5], useful for biomarker research [6], drug discovery [7], biological pathways detection [8], among others, highlighting this technique's positive contribution to a highly demanding field of study. The strength of mining the literature towards building disease networks was tapped into on multiple occasions as recently reviewed by Rodríguez-González [9]. In particular, BCs or hematologic cancers, which affect the production and function of blood cells encompassing the leukemia, lymphoma and multiple myeloma families, represent a large percentage of overall detected cancers (higher than 13%) and have a mortality-rate higher than 8% in the USA [10], with over 720 K deaths worldwide each year [11].
Biomedical text-mining, henceforth Text-Mining (TM), has already been used as a strategy to disclose analogous associations [12,13]. TM allows the retrieval of informative data from latent data within biomedical literature, namely toward data-driven analysis [14,15]. TM has greatly benefited from the takeoff of Artificial Intelligence (AI) algorithms that can be used for Natural Language Processing (NLP), an AI subfield dedicated to bridge the gap between human language and the language of computers. TM systems are most often built as pipelines that encompass a variable number of steps depending on the general goal of the research. Nonetheless, the initial steps of Named Entity Recognition (NER) and Named Entity Normalization (NEN) are generally needed. NER processes the input text into a set of entities, whilst NEN maps an entity into a concept in a terminology [16,17]. State-Of-The-Art methods (SOTA), such as Bidirectional Encoder Representations from Transformers for Biomedical Text Mining (BioBERT), provide excellent results for the NER step of TM, attaining human-like performance [17]. However, there is still a lack of NEN models that can tackle multiple categories as well as a need for combined NER and NEN strategies to be time-smart and enhance accuracy by integrating other data types such as ontologies directly upon the entity's retrieval step [14].
Herein, we present SicknessMiner, 1 a leading-edge TM pipeline encompassing both NER and NEN steps. The pipeline uses SOTA models as BioBERT for NER and Nor-mCo for NEN. SicknessMiner does not require a specific input format and, as such, has a broad application and it is easy-to-use. SicknessMiner was fine-tuned for diseases and is able to return them as entities mapped to Medical Subject Headings (MeSH) terms with top performance. Both models of SicknessMiner were trained on the NCBI Disease dataset [18]. SicknessMiner demonstrates that our chosen DDA evaluation, ranking via disease-disease co-mention in a set of scientific publications, can be highly useful to find meaningful relations. This is a powerful finding, since this technique relies only on NER and NEN models. Such high accuracy models can be easily constructed, widening Sick-nessMiner application as it does not require the existence of other ontologies or knowledge bases regarding each specific domain of application. To further evaluate our DDAs results, we compared them with a well-known database, DisGeNET, which classifies DDAs based on the similarity of their gene-or variant-disease lists. Figure 1 illustrates the methodological pipeline followed to attain, validate and evaluate SicknessMiner.

Results
SicknessMiner two modules, NER and NEN, were trained using a corpus of 793 Pub-Med abstracts with over 6.8 K disease mentions mapped to 790 unique concepts either linked to the MeSH or the Online Mendelian Inheritance in Man (OMIM) databases. The NER task consists of identifying entities of interest in a given corpus. For example, given a RAW text, the output of the NER module is a set of text spans where the entities can be found, in our particular case, known diseases. After the NER step, the retrieved text spans related to the entities serve as input to the NEN module, which maps each text span to an entry in a given ontology. If an entity is not found in the ontology, it is mapped as unknown. The main output of the SicknessMiner consists of a set of entities found in a given corpus, with their respective IDs corresponding to the key ontology, as well as the place in the text where the entities were found.
Both models of SicknessMiner were trained on the NCBI Disease dataset [18]. For the NER, we started the BioBERT model with the BioBERT v1.1 pre-trained weights [17] and fine-tuned them in the NCBI Disease train dataset using the default parameter of SicknessMiner pipeline: a TM approach for DDAs. SicknessMiner is a two-step pipeline integrating subsequent modules for NER and NEN. First, from the RAW text input, an entity's list is plotted according to co-mentions with more than 1 BC type. To evaluate SicknessMiner, we used the BC5CDR evaluation kit. Finally, DDAs were doubly assessed via SicknessMiner and DisGeNET and further evaluation was performed for a better understanding of key improvements obtained herein the system made available by Lee et al. [17]. For the NEN, we trained the NormCo [19] on the NCBI Disease train dataset using the default parameter of the system.

SicknessMiner evaluation
We choose a simple but efficient method to DDA evaluation: ranking via disease-disease co-mention. To evaluate the performance of our method in the NCBI Disease test set, we used the evaluation kit 2 from the BC5CDR task to compute the precision, recall, and F1-score for both the NER and NEN steps. The BC5CDR evaluation kit is a program that receives a set of labelled text from the user and, by using the predictions of the model as input, computes these three performance metrics as output in order to assess the efficiency of the model developed by the user. SicknessMiner attained a precision of 0.87, recall of 0.89 and F1-score of 0.88 for the NER module; and a precision of 0.80, recall of 0.83 and F1-score of 0.81 for the NEN module given a perfect NER. When the NER results were considered, the NEN module achieved a precision of 0.72, recall of 0.79 and F1-score of 0.76.

DDAs retrieval
We applied DDA retrieval to BCs as already mentioned. We queried PubMed for "((leukemia[Title/Abstract]) OR (multiple myeloma[Title/Abstract])) OR (lymphoma[Title/Abstract])" and retrieved over 390 K titles and abstracts. We merged all the results and submitted them to the SicknessMiner pipeline. To evaluate the results obtained by SicknessMiner, we compared them against DisGeNET with the Concept IDs for the BCs types: C0023418, C0024299, and C0026764 for leukemia, lymphoma and multiple myeloma, respectively. DisGeNET was chosen for direct performance comparison as the existing alternative models exhibit a few difficulties as these sometimes are not publicly available, do not encompassed the same group of diseases, or can even involve a complex methodological approach, which goes against our main aim of attaining an easy-to-use integrative pipeline [13,[20][21][22][23][24][25]. Whilst DisGeNET retrieved a total of 57,624 co-mentions between 22,611 unique diseases and one of the diseases of interest (Leukemia, Lymphoma or Multiple Myeloma), SicknessMiner retrieved 12,263 co-mentions between 5443 unique diseases.

DDAs evaluation
SicknessMiner was qualitatively and quantitatively compared against the results obtained through DisGeNET. To this end, we plotted a graph for each system, where each node in the graph corresponds to a disease and each edge connects two related diseases (Figs. 2 and 3; full results tables are available as Additional file 1). In Fig. 2, every resulting node in the graph is one of the top 100 hits with more shared genes connected to 2 or 3 BCs simultaneously from DisGeNET. In Fig. 3, each graph node represents one of the top 100 hits regarding the number of co-mentions to the 3 BCs via SicknessMiner. To make some sense of the totality of results and how the two systems compare, we also performed quantitative analysis for each approach (Fig. 4). To improve readability, we filtered the 100 most relevant nodes. For the DisGeNET, the process was straightforward, since it already provided the data related to our target diseases (Leukemia, Lymphoma or Multiple Myeloma). The set of genes of the target and related disease were sorted using the Jaccard index, and the top 100 diseases related to at least two target diseases were chosen. For SicknessMiner, we considered that two diseases were related if both of them appeared together in a paper title or abstract (i.e., they are co-mentioned). Afterwards, we ranked the co-mentions based on the number of times they were in the text set and chose the top 100 related to at least two target diseases. To attain realistic DDAs in DisGeNET, we considered a threshold for the similar genes of, at least, 20 shared genes to accept a DDA as positive. While SicknessMiner could retrieve overall minus 5% of DDAs (6455 vs 6087), 92% of the DDAs delivered by DisGeNET were also available in SicknessMiner. Also, SicknessMiner yielded close to 16% unidentified DDAs by DisGeNET, contrasting to only 8% diseases from DisGeNET that were unidentified in SicknessMiner (Fig. 4).
We also looked at the results from a biological angle since SicknessMiner's goal was to retrieve diseases from biomedical literature, hence, it is domain-specific in its model. The majority of edges detected, thus the majority of the DDAs reported, are some type of cancer. However, in the top 100 edges, SicknessMiner retrieved a higher number of DDAs/edges representing non-cancerous diseases than DisGeNET, 22% and 4%, respectively. In terms of unique non-cancerous DDAs, SicknessMiner could find 17 entities against 3 in DisGeNET. The retrieved non-cancerous DDAs are listed in Table 1.

Discussion
Herein, we presented SicknessMiner, a TM pipeline encompassing our view towards a useful NLP approach for biomedical texts. Our approach highlights the usefulness of TM pipelines as integrated sources of information and as end-to-end platforms easyaccessible for beginner users. The use of individual modules in the pipeline allows us to easily replace and further optimize or introduce any enhancements independently, as new SOTA tools become available, which further improves the results of the pipeline as a whole. Furthermore, by using the BioBERT SOTA neural network module, we can take advantage of pre-trained weights that are made available, by the community, and that can be applied to different tasks beside NER (e.g. Relation Extraction and Question Answering) [17], some of which we would like to explore in future works. Moreover,

Fig. 4 SicknessMiner and DisGeNET comparison
SicknessMiner benefits from using BioBERT, since transfer learning is a powerful ML/ DL technique that gains from any improved weights that arise in the community. By using only abstracts rather than full-texts, our aim was to lower the false-positive rate for non-existent DDAs that could occur from negative results published in the fulltext reports. Also, this enables a reproducible effect by resorting only to publicly available data since full-text articles are often difficult to access due to the existent publication fees. Despite some authors ponting to higher performance when using full texts instead of abstracts-only TM [26], the fact that the access to papers is not universal, justifies our choice to evaluate our results based on abstracts-only. Nonetheless, we also believe that to postulate new knowledge, one must always dedicate some human curatorship to the hypothesis at hand, which should resolve any errors brought by abstracts-only mining.
With an easy access GitHub repository (https:// github. com/ Morei raLAB/ Sickn essMi ner) and reproducible examples, SicknessMiner is a valuable tool for the NER and NEN steps of any TM pipeline for disease retrieval. Furthermore, SicknessMiner, via its co-mention analysis, is also able to successfully establish DDAs. Despite the fact that comention is often referred to as a lesser approach for establishing associations between mapped entities, SicknessMiner was able to recover close to 92% of DDAs from a wellestablished web server such as DisGeNET and still contribute with 16% of new DDAs. These pave the way for further analysis to assess new contributions to the field of BCs and related diseases. DisGeNET also includes DDAs from TM approaches and yet its performance is not improved compared to SicknessMiner. Moreover, DisGeNET needs gene or variant lists related to a disease, which can be tiresome and costly. When analyzing our graphs (Figs. 2 and 3), the most noticeable aspect was that the majority of results obtained were cancers (for full results tables, see Supporting Information). Cancer is an intricate process that, despite our scientific advances, still lacks a complete thorough understanding. Indeed, cancer can be seen as a systemic process that possesses many layers in need of a system's biology approach [27]. With this in mind, we focused instead on the most frequent co-mentioned or the ones with the highest Jaccard index non-cancerous diseases in the case of SicknessMiner and DisGeNET, respectively. SicknessMiner was able to retrieve nearly 6 times more non-cancerous diseases. In line with the systemic take on cancer, we also believe that the results retrieved solely by SicknessMiner as "anemia", "fever", and "thrombocytopenia" are encompassed in such a way that can grant a preview in the usefulness of such approaches. The only DDA reported by DisGeNET not present in the top 100 edges from SicknessMiner was rheumatoid arthritis that was assessed as associated with both leukemia and multiple myeloma. However, SicknessMiner also retrieved AutoImmune Diseases (AID) MeSH term, for which rheumatoid arthritis in one [28], in the top co-mentions. Finally, it is noteworthy to mention that the top 30% results retrieved from DisGeNET are some form or variation of the original query, as "childhood leukemia" or "adult lymphoma" whereas SicknessMiner, comparatively, retrieves less of those variations of the original query, highlighting "genetic diseases, inborn", "epstein-barr virus infections", and "myelodysplastic syndromes" in the top 30 edges. Nonetheless, the broad scope of non-cancerous DDAs found by SicknessMiner can provide a wider, more thorough glimpse and complement research lines by providing new insights of relevant DDAs.
During our work, the major problem regarding the NEN step, was the existence of several ontologies for diseases with the most eminent being DO [29], MeSH [30], and Unified Medical Language System Concept Unique Identifiers (UMLS CUI) [31]. In fact, results comparison can be hindered by the existence of several ontologies for which a direct correspondence or mapping is not always possible. A centralized ontology or the availability of an external mapping tool would be highly beneficial for TM pipelines both by enabling comparison, but also by allowing the integration of data from different sources onto the same model. Hence, in the future, other categories will be included in our model to fully take advantage of such impressive results in the DDA category, and to pave a new way towards accessibility in the AI-OMICS-era.

Conclusion
Accurate TM solutions/software's are sparse and still lack the inclusion of ML/DL models. Herein, we introduced SicknessMiner, a novel tool encompassing TM SOTA methods to simultaneously perform NER and NEN. SicknessMiner shows a very high performance and can retrieve 92% of all associations fetched by DisGeNET, a respected resource for DDAs. Despite the usefulness of obtaining relationships among diseases, most available models to tackle this problem tend to encompass a layer of ontology or some type of knowledge-based methodology. Our approach, despite seeming simpler, aims to harness the already available knowledge enclosed in scientific papers and uses text-mining tools to retrieve DDAs. We believe that through the development of Sick-nessMiner, we were able to build a comprehensive, highly upgradeable and customizable, easy to use TM pipeline to postulate new relevant DDAs.

TM approach: building sicknessminer
We used SOTA modules as the basis for the two tasks: BioBERT [17] for NER and Nor-mCo [19] for NEN. BioBERT is a Deep Neural Network (DNN) model, a biomedical domain application of the Bidirectional Encoder Representations from Transformers (BERT) model [32]. BERT models have shown promising performances in a variety of NLP tasks, including NER. BioBERT provides different sets of weights 3 to initialize the neural network, but the best results require parameter training on a distribution similar to the target distribution. We fitted the BioBERT parameters on the NCBI Disease dataset. 4 A similar approach was performed for the NEN task using as basis the NormCo [19], a simple, yet powerful, model that uses Recurrent Neural Networks (RNN) to map the name of the entities into the IDs in a given ontology. We started from a pre-trained set of weights, made available by the authors of NormCo, 5 and fine-tuning the model to the NCBI Disease Corpus [18].
Both NER and NEN fine-tuning were performed following the experiments described in the corresponding papers, and code provided by the authors. In particular, the NER system was implemented using TensorFlow, 6 an Adam optimizer with a decay rate of 0.01, and a learning rate of 5e −5 trained for 10 epochs. The NEN system was implemented using Torch, 7 with the same optimizer. We used a learning rate of 5e −4 and 100 epochs. After each epoch, the model was evaluated against a validation set (a set of examples distinct from the training and test sets) and stopped if no improvement of the model was achieved after 15 epochs.

SicknessMiner evaluation
The evaluation of SicknessMiner was performed in the test set of the NCBI Dataset using the BC5CDR evaluation kit to compute the precision, recall and F1-score for both NER and NEN. This evaluation kit is commonly used to evaluate such tasks since it employs the widely used BC5CDR corpus for results comparison [19]. The NEN was performed on the result of the NER module, which means that errors from NER may cascade to NEN. We opted to use the BC5CDR evaluation kit, instead of the reported evaluation result implemented by each system, to attain a consistent evaluation across all systems. The precision is given by the formula P = tp tp+fp , where tp is the number of true positives and fp is the number of false positives. The recall is given by the formula R = tp tp+fn , where fn is the number of false negatives. Finally, the F1-score is given by the formula F = 2 * P * R P+R . Intuitively, the precision measures the confidence of the model in predicting the true examples, whilst the recall measures the capacity of the model to detect the true examples among all data; and the F1-score is a harmonic mean between the precision and recall, used to summarize both metrics in a single number, biased toward the smaller value between the two metrics.

DDAs retrieval
PubMed used queries were: "((leukemia[Title/Abstract]) OR (multiple myeloma[Title/ Abstract])) OR (lymphoma[Title/Abstract])", giving a total of 390 K titles and abstracts (as of 21st of April 2021). SicknessMiner was compared to DisGeNET with the Concept IDs for the BCs types: C0023418, C0024299, and C0026764 for leukemia, lymphoma and multiple myeloma, respectively. Upon results collection, and since DisGeNET and Sick-nessMiner use different ontologies to identify diseases, an extra mapping step was performed to match results from both. DisGeNET uses the UMLS CUI and SicknessMiner uses both MeSH and OMIM identifiers. One caveat of mapping at the NEN step was related to the mismatch among the ontologies used by different sources that difficulties results' comparison and/or combination. In order to overcome this shortcoming and to easily compare SicknessMiner and DisGeNET, we converted the DisGeNET mappings, originally in UMLS CUI, to MeSH or OMIM identifiers. To increase the identification power of our approach and to include additional identifiers descriptors, further mapping was performed recursively searching the xml files desc2021, supp2021, pa2021 and qual2021 made available via FTP by NCBI (https:// nlmpu bs. nlm. nih. gov/ proje cts/ mesh/ MESH_ FILES/ xmlme sh/). All performed mappings are also available on our GitHub repository.

DDAs evaluation
SicknessMiner was qualitatively and quantitatively compared against the results obtained through DisGeNET by both graph representation and similarity/dissimilarity percentage calculations. The set of genes of the target and related disease were sorted using the Jaccard index, a measurement of similarity between two different sets (A and B), given by the formula |A∩B| |A∪B| . Depending on the methodology, we considered that diseases were related in two different ways. For SicknessMiner, we considered that two diseases were positively related to each other whenever there was a co-mention amongst the paper title and/or abstract. Differently, for DisGeNET, we considered that two diseases were related whenever they shared, at least, 20 genes. Hence, relations were extracted for the entirety of the results, ranked according to the decrescent number of co-mentions or shared genes for Sick-nessMiner and DisGeNET, respectively, and the top 100 more frequent relations were kept for further analysis.