Mining on Alzheimer’s diseases related knowledge graph to identity potential AD-related semantic triples for drug repurposing

Background To date, there are no effective treatments for most neurodegenerative diseases. Knowledge graphs can provide comprehensive and semantic representation for heterogeneous data, and have been successfully leveraged in many biomedical applications including drug repurposing. Our objective is to construct a knowledge graph from literature to study the relations between Alzheimer’s disease (AD) and chemicals, drugs and dietary supplements in order to identify opportunities to prevent or delay neurodegenerative progression. We collected biomedical annotations and extracted their relations using SemRep via SemMedDB. We used both a BERT-based classifier and rule-based methods during data preprocessing to exclude noise while preserving most AD-related semantic triples. The 1,672,110 filtered triples were used to train with knowledge graph completion algorithms (i.e., TransE, DistMult, and ComplEx) to predict candidates that might be helpful for AD treatment or prevention. Results Among three knowledge graph completion models, TransE outperformed the other two (MR = 10.53, Hits@1 = 0.28). We leveraged the time-slicing technique to further evaluate the prediction results. We found supporting evidence for most highly ranked candidates predicted by our model which indicates that our approach can inform reliable new knowledge. Conclusion This paper shows that our graph mining model can predict reliable new relationships between AD and other entities (i.e., dietary supplements, chemicals, and drugs). The knowledge graph constructed can facilitate data-driven knowledge discoveries and the generation of novel hypotheses. Supplementary information The online version contains supplementary material available at (10.1186/s12859-022-04934-1).


Background
Neurodegenerative diseases are a heterogeneous group of disorders that are characterized by the progressive degeneration of the structure and function of the central nervous system or peripheral nervous system [1].Common neurodegenerative dis-arXiv:2202.08712v4[cs.AI] 28 Nov 2022 eases, such as Alzheimer's disease(AD) and related dementias (ADRD), are usually incurable and irreversible and difficult to stop.
AD/ADRD are multi-factorial and complex neurodegenerative diseases characterized by progressive memory loss and severe dementia with neuropsychiatric symptoms [2].An estimated 5.8 million Americans aged 65 and older (12.6%) are living with AD/ADRD in 2020, and this number is projected to reach 13.8 million by 2050 [3].High prevalence of AD/ADRD creates huge medical and social burdens.The total costs for health care, long-term care and hospital services for all Americans with AD/ADRD are estimated at 305 billion in 2020 [3].The high failure rate of the development of AD/ADRD drugs amplifies demographic and financial challenges.Given the increasing prevalence of the disease, finding innovative ways to develop effective drugs is an urgent need.Drug repurposing is a strategy for identifying new usages of approved or investigational drugs that are outside the scope of their original medical indications [4].There are majorly three computational methods for discovering drug repurposing evidence: the network-based methods, text mining and natural language processing (NLP) based approaches, as well as machine learning-based approaches [5].Inspired by the fact that biologic entities in the same module of biological networks share similar characteristics, network-based approach aims to find several modules(subnetworks or cliques) using algorithms according to the topology structures of networks.NLP approaches usually includes processes of identifying biological entities and mining new knowledge from scientific literature.While machine learning-based approaches can apply different machine learning models such as logistic regression, support vector machine (SVM), random forest(RF), and deep learning (DL) to identify drug repurposing signals The computational drug repurposing strategy offers various advantages over developing entirely new drugs, including the possibilities to lower failure risks and risk of unknown side effects/complications, efficient utilization of development funds and shortened development timelines [6].Developments in high-throughput screening technologies have catapulted computational drug repurposing to the forefront of attractive drug discovery approaches because the vast amounts of available data could potentially lead to new clues for drug repurposing that individual projects could not possibly reveal.
Knowledge graphs can provide comprehensive and semantic representations for heterogeneous data, which has been successfully leveraged in many biomedical applications including drug repurposing [7].For example, a few recent research focused on using knowledge graph-based approaches to drug repurposing for COVID-19 [8] [9] [10].Sosa et al. applied knowledge graph embedding methods in drug repurposing for rare diseases [11].Malas et al. leveraged the semantic properties of a knowledge graph to prioritize drug candidates for Autosomal Dominant Polycystic Kidney Disease (ADPKD) [12].However, to the best of our knowledge, knowledge graph-based approaches have rarely been applied in AD/ADRD drug repurposing.
The objective of this paper is to study potential relations between Alzheimer's diseases and dietary supplements, chemicals, and drugs using a knowledge graph-based approach.Studies have indicated that some drugs, chemicals or food supplements could be related to preventing or delaying neurodegeneration and cognitive decline [13].However, further research is needed to better understand the back-end mechanisms and to reveal the potential interactions with clinical and pharmacokinetic factors.In this paper, we encode biomedical concepts and their rich relations into a knowledge graph through literature mining [14].Literature Mining is a data mining technique that identifies the entities such as genes, diseases, and chemicals from literature, discovers global trends, and facilitates hypothesis generation based on existing knowledge.Literature mining enables researchers to study a massive amount of literature quickly and reveal hidden relations between entities that were hard to be discovered by manual analysis.More specifically, we introduce a biomedical knowledge graph that specifically focuses on AD/ADRD and discovers underlying relations between chemicals, drugs, dietary supplements and AD/ADRD.More details of how to construct the knowledge graph and how to leverage graph embedding methods to predict candidates with scoring will be described in the methods section.We also present several rankings of candidates and comparisons of different graph embedding algorithms.  1 shows the performance of three widely used graph completion methods that are trained on our knowledge graph: TransE is based on translational distance and DistMult and ComplEx are based on semantic information.We can see that the TransE model performs the best among all these graph embedding algorithms with a Mean Rank (MR) of 10.53 and a Hit Ratio of 10 (Hits@10) 0.58.We then use TransE model for the prediction of potential candidates.Specifically, the final model embeds nodes into a size of 250 with a learning rate of 0.01 with an L2 distance metric.

Prediction results
We found that some potential candidates might be relevant to AD prevention and treatment.Based on the training data and our scoring function, we identified the top-ranked subjects that connect with AD-related concepts with predicates treat or prevent.Tables 2, 4, and 6 show the top 10 entities according to their numbers of appearances for the drug, chemical, and Dietary Supplement categories respectively.Table 3,5 and 7 shows the top 10 ranked triples according to the candidate scores for the three categories.The triples with relevant evidence from PubMed with studies earlier than 1/1/2019 are marked in bold.The triples that only appeared in recently published studies after 1/1/2019 are marked in italic.The clinical drug and chemical categories were extracted from the Unified Medical Language System (UMLS) and we used the Integrated Dietary Supplement Knowledge Base (iDISK) [15] as a reference for dietary supplements.

Clinical Drug
For the treatment relation, We were able to find evidence supporting seven out of ten entities (Table 2) and six out of ten triples (Table 3) through related literature and clinical trials for triples.All drugs appear in Table 4 appear in Table 2 while Table 2 has some extra drugs: Local corticosteroid, acyclovir, metronidazole, Cam, and Dexamethasone.Specifically, corticosteroids might become part of a multi-agent regimen for Alzheimer's disease and also have applications for other neurodegenerative disorders [16].Our model indicates that Valacyclovir, an antiviral medication might also have an effect in AD/ADRD prevention.While we did not find evidence that Acyclovir is directly related to AD/ADRD, a recent study shows that Valacyclovir Antiviral therapy could be used to reduce the risk of dementia [17].A study demonstrated that antibiotic (ABX) cocktail-mediated perturbations (high dose kanamycin, gentamicin, colistin, metronidazole, vancomycin) of the gut microbiome in two independent transgenic lines leads to a reduction in Aβ deposition in male mice and underlie the observed reductions in brain amyloidosis, which is the hallmark of Alzheimer's Disease.[18].Tacrolimus [19] has been in phase two clinical trial which investigates neurobiological effect in persons with MCI and dementia starting 12/1/2021.Early study also indicated that high doses of prednisolone have the effect of reducing amyloid reduction which resulted in some delay of the cognitive decline [20][21].Propranolol [22] has shown efficacy in reducing cognitive deficits in Alzheimer's transgenic mice.According to Joseph [16], a short pulse of high dose intrathecal methylprednisolone, dexamethasone or triamcinalone will result in detectable slowing of Alzheimer's disease.
As for the prevent relation, we found evidence that supports seven among ten triple predictions (Table 3) and all drugs in this table also appear in the Table 3.For example, a recent study in 2021 shows that Amifostine, which appears in our top 4 triple predictions, could mitigate cognitive injury induced by heavy-ion radiation [23].Betaine could be a promising candidate for arresting Hcy-induced D-like pathological changes and memory deficits [24].Mazurek et al. show that Oxytocin could interfere with the formation of memory in experimental animals and contribut to memory disturbance associated with Alzheimer's disease [25].

Chemical
For the treat relationship prediction, we found supporting evidence for seven out of the top ten entities (Table 4) and eight out of the top ten triple predictions (Table 5).For the treat relations, Table 4 and Table 5 have some overlaps: Amifostine, Chlorhexidine, Amiloride, Etazolate, and licopyranocoumarin.As we discussed in the Drug section, Amifostine, which appears in our top 1 triple predictions, could mitigate cognitive injury induced by heavy-ion radiation [23].Moreover, a study has shown that oral pathogens in some circumstances can approach the brain, potentially affecting memory and causing dementia [26].Since chlorhexidine could be used to reduce Methicillin-resistant Staphylococcus aureus (MRSA) to improve oral health, it might be a potential candidate for the treatment of Alzheimer's Disease.Several studies mentioned the neuroprotective activity of Tetracycline and its derivatives [27] [28].Amiloride is a Na+/H+ exchangers (NHEs), which is proved to be associated with the development of mental disorders or Alzheimer's disease [29].In addition, we found in an earlier clinical trial that Etazolate was used to moderate AD [30].Licopyranocoumarin, as a compound from herbal medicine, was proved to have neuroprotective effect to Parkinson disease [31].Dexrazoxane and Forskolin only appear in Table 4.A study in 2019 implies that Dexrazoxane may serve as an effective neuroprotectant to treat neurodegeneration and has potential clinical value in term of PD therapeutics [32].Forskolin shows neuroprotective effects in APP/PS1 Tg mice and may be a promising drug in the treatment of patients with AD [33].In addition, Tetracyline and proparglyamine only show up in Table 5.There are several studies mentioned that the neuroprotective activity of Tetracycline and its derivatives [27] [28].Propargylamine was discussed on its beneficial effects and pro-survival/neurorescue inter-related activities relevant to Alzheimer's Disease in several studies [34] [35].
For prevention relation, we found six out of ten triples that are related to AD and all six corresponding chemicals also appear in Table 4. Recent studies show that antibiotic chemicals such as Fluoroquinolones, Amoxicillin, Clarithromycin, and Ampicillin can produce therapeutic effects to Alzheimer's Disease [36] [37].Although we have not found that Cortisone has a direct effect on Alzheimer's Disease, common anti-inflammatory drugs do have some treatment effects [38].Earlier study has shown that allopurinol has treatment of aggressive behaviour in patients with dementia [39].In addition, Ceftriaxone(CEF) appears in Table 4.It significantly attenuated amyloid deposition and neuroinflammatory response and a study has confirmed the potential of CEF as a promising treatment against cognitive decline from the early stages of AD progression [40].

Dietary Supplement
Since there is little evidence that food can directly treat or prevent the Alzheimer's Disease, we focus on the triples with affect relationships.In the rank of the top 10 predictions of Table 7, we found dietary fiber (three times), tea (three times), rice, and honey all have the possibility to reduce the risk of AD/ADRD and they also appear in Table 6.Dietary fiber has the potential that protects impact on brain Aβ burden in older adults and the finding may assist in the development of dietary that prevent AD onset [41].Moreover, according to [42], green tea intake might reduce the risk of dementia and cognitive impairment.Another study shows that honey can be a rich source of cholinesterase inhibitors and therefore may play a role in AD treatment [43].Previous studies have also shown that dietary choline intake (e.g.eggs (egg yolk) and fruits) are associated with better outcomes on cognitive performance [44].Increasing dietary intake of minerals could also reduce the risk of dementia.For example, research found a link between potassium levels and diagnosis of cognitive impairment in Mexican-Americans.[45].In addition, one recent study indicates that highly water pressurized brown rice could ameliorate cognitive dysfunction and reduce the levels of amyloid-β, which is a major protein responsible for AD/ADRD [46].Coffee drinking may be associated with a decreased risk of dementia/AD.This may be mediated by caffeine and/or other mechanisms like antioxidant capacity and increased insulin sensitivity.[47] Existing literature provides a reasonably strong scientific rationale to encourage testing whether ketamine (or its metabolites) has procognitive effects on Alzheimer's patients.[48].Last but not least, based on the available literature, a nutraceutical formulation containing N-acetylcysteine among other compounds has shown some pro-cognitive benefits in Alzheimer's patients [49].

Conclusion
In this study, we built a framework to construct and analyze a knowledge graph that links AD/ADRD-related biomedical knowledge from PubMed to facilitate drug repurposing.More specifically, we focused on identifying potentially new relationships between AD/ADRD and chemical, drug and food supplements respectively.Our analysis indicated that the pipeline can be used to identify biomedical concepts that are semantically close to each other as well as to reveal relationships between biomedical elements and diseases of interest.Linking sparse knowledge from fastgrowing literature would be beneficial for existing knowledge/information retrieval, and may promote uncovering of new knowledge.This framework is flexible and can be used for other applications such as multi-omics applications, therapeutic discovery, and clinical decision support for neurodegenerative diseases as well as other diseases.The knowledge graph we constructed can facilitate data-driven knowledge discovery and new hypothesis generation.
A breadth of possibilities exists to further improve this framework.First, our knowledge graph leveraged SemMedDB, an existing database that contains triples extracted from PubMed article.While we tried to improve the accuracy using a BERT-based approach, other NLP techniques could be implemented to further improve the accuracy of information extraction.Second, in addition to include knowledge extracted from literature, we could also incorporate triples from wellacknowledged biomedical databases to further enrich the knowledge graph.Third, we leveraged three state of the art knowledge graph embedding models in this research.In the future, we will investigate new strategies to extend embeddings to cope with sparse and unreliable data as well as multiple relationships.Last but not least, we only focused on the top 10 ranked triples for evaluation in this paper.We were able to identify supporting evidence for most of them, which indicates that our approach can inform reliable new knowledge.In addition, we only incorporate 2.8M triples for our knowledge graph due to computational resource limits, further investigation needs to be done on additional triples, which could potentially lead to new hypotheses for AD treatment and prevention.

Methods
We constructed a knowledge graph using biomedical concepts and relations extracted from PubMed literature using NLP tools.The extracted triples were then further filtered based on statistics and NLP models.The rest of the subject-relationobject triples were used to build the knowledge graph.We then applied graph embedding algorithms to identify potential candidates for AD treatment and prevention.An overview of this is also described in Figure 1.

Data Collection and Relationship Extraction
To construct the knowledge graph, we directly obtained triples from SemMedDB [50], which is a database of triples that are automatically extracted from the biomedical literature using Natural Language Processing (NLP) tools through SemRep [51].Subject and object arguments are normalized to concepts defined in the UMLS with unique identifiers (CUIs).The triples are in the form of subject-predicate-object.

Rule-based Filtering
The original data directly obtained from the SemMedDB contained a large number of triples, but not all of them are useful for finding candidates for AD/ADRD treatment/prevention.We applied rules that are similar to [8] to exclude unrelated subject/object and predicate types.More specifically, we eliminated triples involving generic biomedical concepts such as Activities & Behaviors, Concepts & Ideas, Objects, Occupations, Organizations, and Phenomena.The rest of the triples were eliminated based on their degree of centrality (A in , A out ) and G 2 score that indicates the strength of association between a subject and an object.Specifically, the degree centrality(A in , A out ) was calculated with the adjacency matrix M as: And the G 2 score is calculated from the statistical relation between two contingency tables: Observation table and Expectation table .[52] where O ijk represents the items in the observation table and represents the items in the expectation table.At last, these three scores were normalized to [0, 1] and summed up into a final score.To keep the knowledge graph in a reasonable size that the graph embedding algorithms could handle, we only kept about 2.5M triples.In order to ensure that AD-related triples are included in the knowledge graph, we kept all triples that are related to Alzheimer's Diseases terms in the UMLS during triple elimination using the above criteria.Table 9 in the additional file section summarizes the AD-related UMLS concepts we kept in this process.At last, we have 2.8M triples left in our knowledge graph.

Calibration using PubMedBERT
We leveraged about 6,000 annotations from a previous study [15] and used them as the training data for the PubMedBERT fine-tuning.These annotations were manually labeled with 1 or 0, where 1 indicates that the triples and their relationships do exist and are correct (triples labeled with 1); and 0 means that the triples do not exist or are incorrect (triples labeled with 0).PubMedBERT took the text input of subject, object, predicate type as well as the sentence that these were extracted from.The model obtained an F-1 score of 0.82, Recall of 0.91 and Precision of 0.75 on the validation set; and F-1 score of 0.83, Recall of 0.89 and Precision of 0.78 on these annotations.

Graph Embedding Algorithms
Knowledge graph embedding is a promising approach to graph completion tasks [53].It embeds entities and relations into vector space to evaluate the probability that a given triplet (h,r,t) is true through a scoring function.We leveraged three popular knowledge graph embedding methods, TransE, DistMult and ComplEx for our knowledge graph completion task.To train this knowledge graph, these three models do negative sampling by corrupting triplets (h,r,t) to either form (h',r,r) or (h,r,t'), where h' and t' are the negative samples.Therefore, if y=±1 is the label for positive and negative triplets and f is the scoring function, then the logistic loss is computed as according to [54]: Tables

DistMult
Semantic matching models like DistMult [56] use similarity-based scoring functions that associate each entity with a vector to capture its latent semantics.In this model, each relation is represented as a diagonal matrix which models pairwise interactions between latent factors by a bilinear function as shown in Table 8.

ComplEx
Since the scoring function of DistMult is symmetric in terms of h and t, the function cannot handle asymmetric relationships.Complex Embeddings (ComplEx) [57] introduces complex-valued embeddings to solve this problems.Specifically, the scoring function can be expanded as: Re(h T (diag(r)) t = Re( Candidates scoring for repurposing We focused on three kinds of predictions for the candidate selection in this research: dietary supplements candidates, chemical candidates, and clinical drug candidates. The clinical drug and chemical categories were extracted from the UMLS and we used the iDISK [15] as a reference for dietary supplements.For each type of candidates, the model iterates over all possible triples, i.e. (h i ,r i ,t k ), and h i ∈ all nodes for particular type of candidates ,r j ∈ all relations, and t k ∈ all nodes related to Alzheimer's Disease.In knowledge graph embedding-based approaches, the scoring function φ(h, r, t) is defined in terms of the embeddings of entities and relations; i.e., h, r, and t are embedded into vector space, and φ is defined in terms of operations or scoring functions over these objects.They all project the node and entities to lowerdimensional embeddings but with different scoring functions.TransE simply uses the distance between the embeddings of the head, sum with the relation embedding and tail as the scoring function, while DistMult and ComplEx use bilinear map to define scoring functions.For drugs and chemicals, we used two types of relations (i.e., treat and prevent) for prediction in this paper since the focus of the paper is drug repurposing.For dietary supplements, on the other hand, we focus on the "affect" relationship since it might be relatively challenging to detect top-ranked direct relationships between dietary supplements and AD treatment/prevention.

Evaluation for drug repurposing
We leveraged the time-slicing technique that is commonly used in literature mining [58] to evaluate our triple prediction approach.We trained all three models using data before 1/1/2019 to see whether we can predict triples that were first published after this date.

Figure 1
Figure 1 General Pipeline: The biological concepts in PubMed literature was extracted using NLP tools and was built into a knowledge graph using Subject-relation-object triples.Graph embedding algorithms were used to find potential candidates and complete the knowledge graph.Number of triples left are shown in each step.

Table 2
Rankings For Drugs

Table 4
Rankings For Chemicals

Table 6
Rankings For Dietary Supplements

Table 7
Rankings For Dietary Supplement Triples

Table 8
[55]ing Function of Graph Embedding AlgorithmsTransETransE[55]is one of the earliest translational distance models.The model projects head, tail and relations into the same space where the relation is interpreted as a translation vector r so that the head and tail can be connected by relations with low error.And the score function is the negative of the distance of this error as shown in

Table 8
. TransE does have disadvantages in dealing with 1-to-N, N-to-1, and N-to-N relations.For example, if Alzheimer's Disease could be affected by different food supplements, then TransE model might learn similar results for all these food supplements.