Random walk with restart on multilayer networks: from node prioritisation to supervised link prediction and beyond

Background Biological networks have proven invaluable ability for representing biological knowledge. Multilayer networks, which gather different types of nodes and edges in multiplex, heterogeneous and bipartite networks, provide a natural way to integrate diverse and multi-scale data sources into a common framework. Recently, we developed MultiXrank, a Random Walk with Restart algorithm able to explore such multilayer networks. MultiXrank outputs scores reflecting the proximity between an initial set of seed node(s) and all the other nodes in the multilayer network. We illustrate here the versatility of bioinformatics tasks that can be performed using MultiXrank. Results We first show that MultiXrank can be used to prioritise genes and drugs of interest by exploring multilayer networks containing interactions between genes, drugs, and diseases. In a second study, we illustrate how MultiXrank scores can also be used in a supervised strategy to train a binary classifier to predict gene-disease associations. The classifier performance are validated using outdated and novel gene-disease association for training and evaluation, respectively. Finally, we show that MultiXrank scores can be used to compute diffusion profiles and use them as disease signatures. We computed the diffusion profiles of more than 100 immune diseases using a multilayer network that includes cell-type specific genomic information. The clustering of the immune disease diffusion profiles reveals shared shared phenotypic characteristics. Conclusion Overall, we illustrate here diverse applications of MultiXrank to showcase its versatility. We expect that this can lead to further and broader bioinformatics applications.


Introduction
Random Walk is a powerful approach for analysing and exploring networks.By simulating the movement of a particle randomly traversing nodes and edges in a network, Random Walks are able to capture several topological and structural properties of networks [1], including connectivity [2], community structure [3], and node centrality [4].
Inspired from the PageRank algorithm [5], initially developed for ranking web pages in search results by simulating the behavior of an internet user following hyperlinks or restarting on arbitrary pages, Random Walk with Restart (RWR) was first introduced by Pan et al. [6].In the RWR approach, the random particle, at each step, can navigate from one node to one of its neighbors or restart its walk from a node randomly sampled from a set of seed nodes.As PageRank, this strategy prevents the walker from getting trapped in dead ends and allows a more comprehensive exploration of the network's topology [7].RWR, by enabling restart from one or several seed nodes, simulates a diffusion process in which the objective is to determine the steady state of an initial probability distribution [8].This steady state represents a measure of proximity between the seed(s) and all the network nodes, quantifying the extent to which the influence or information from the seed nodes has spread throughout the network.It overall identifies nodes that are closely connected to the seed(s) and provides valuable insights into the network's organisation.
In computational biology, RWR has been particularly useful for the exploration of large-scale interaction networks and to derive guilt-by-association knowledge.For instance, RWR strategies significantly outperformed local distance measures for the prediction of gene-disease associations [9].They have also been successfully applied to protein function prediction [10], identification of disease comorbidity [11], or drug-target interaction prediction [12].More recently, RWR have been applied to drug prioritisation and repurposing for SARS-CoV-2 [13,14].
Originally designed for investigating simple single-layer (i.e., monoplex) networks, RWR has been extended to navigate more complex networks, i.e. networks composed of multiple layers of interaction data.One such extension was proposed by Li and Patra [15] and introduced a RWR exploration of heterogeneous networks.They applied this approach to predict novel gene-phenotype relationships using a heterogeneous network composed of gene-gene interactions, phenotype-phenotype interactions, and known gene-phenotype associations.We introduced a RWR allowing the exploration of multiplex-heterogeneous networks, i.e., multiplex networks connected to each other by bipartite interactions [16].More recently, we developed MultiXrank, a RWR algorithm able to explore generic multilayer networks [17].We define a generic multilayer network as a multilayer network composed of any number and combination of multiplex and monoplex networks connected by bipartite interaction networks.In this multilayer framework, all the networks can also be weighted and/or directed.MultiXrank hence offers the opportunity to apply RWR on multilayer networks containing rich and complex interactions and fundamentally better suited for representing the multi-scale interactions observed in biological systems.In practice, MultiXrank outputs scores representing a measure of proximity between the seed(s) and all the nodes of the multilayer network.These output scores can then be used in a large number of downstream applications.We aim here to illustrate the versatility of the use of MultiXrank output scores.First, we show that MultiXrank can be used for node prioritisation.From a multilayer network containing gene, drug, and diseases interactions, we used MultiXrank scores to prioritise candidate drugs for leukaemia.We also used the large network assembled in the Hetionet project [18], encompassing nine distinct types of nodes (including genes, drugs, diseases, biological processes, and pharmacological classes), to prioritise drugs for epilepsy.Second, we show that MultiXrank scores can be used to train a supervised classifier to predict gene-disease associations.Finally, we show how MultiXrank can be used to compute and compare diffusion profiles obtained for immune diseases on a multilayer network containing genomic information extracted from Promoter Capture Hi-C (PCHi-C) [19] experiments in different hematopoietic cells [20].Overall, these diverse applications of MultiXrank demonstrate its versatility, both in the types of networks it can explore and the variety of downstream analyses that can be applied using its output scores.

Node prioritisation to study human genetic diseases
RWR approaches are frequently used to assess the proximity between seed node(s) and all the other nodes in a network.By leveraging the RWR output scores, nodes that are proximal to the seed node(s) can be prioritised.Several benchmark studies have shown that RWR-based node prioritisation often outperforms a range of other methodologies, including other diffusion-based, neighborhood-based, and embedding-based approaches [21][22][23].
We will illustrate this prioritisation strategy by exploring the heterogeneous and rich information contained in biological multilayer networks using MultiXrank to prioritise genes and drugs in leukaemia and epilepsy.

Prioritising genes and drugs of interest in leukaemia using MultiXrank on a gene and drug multilayer network
We first focused on leukaemia, a disease for which we can confront our predictions with the knowledge accumulated in the literature.We prioritised genes and drugs of interest for leukaemia based on MultiXrank output scores obtained from exploring a multilayer network composed of a gene multiplex network and a drug multiplex network, connected with a gene-drug bipartite network representing known drug-target associations (Materials and methods).
We selected two seeds associated with leukaemia.More precisely, we selected HRAS as gene seed.HRAS is a gene of the RAS gene family associated with a wide variety of tumors, in particular in myeloid leukaemia [24].We also selected a drug seed, Tipifarnib (DB04960), a drug investigated for the treatment of acute myeloid leukaemia and other types of cancer [25][26][27].Using these two nodes jointly as seeds is particularly relevant as HRAS is a farnesylated protein and Tipifarnib is a farnesyltransferase inhibitor [28].We applied MultiXrank (with the parameters specified in Additional file 1: Table S3) using these two seeds jointly and selected the top 10 highest-scoring gene and drug nodes (Additional file 1: Tables S4 and S5, respectively).We extracted the subnetwork connecting the seed nodes and the top 10 prioritised genes and drugs and their close neighborhood (Fig. 1).We observed that prioritised nodes are close to both seeds, with a maximum shortest path distance between a prioritised node and a seed node equal to 4 (Additional file 1: Tables S4 and S5).
A literature survey of these top-10 prioritised drugs and genes establishes known or suspected connections with leukaemia (Additional file 1: section 2.A).For instance, the top scoring gene, CYP3A4, is a drug-metabolising enzyme that has been shown to play a role in drug resistance in leukaemia [29].The second highest-scoring gene, FNTB, is coding the farnesyltransferase, and a target of Tipifarnib [30].Different genes related to signal transduction and known to be relevant for cancer, such as RAF1, RASGRP1, RASA1, or ARAF, are also identified among the top-scoring genes.Moreover, the top prioritised drug, Astemizole (DB00637), is a good candidate for leukaemia treatment as it's anti-leukaemic properties have been demonstrated in human leukaemic cells [31].Interestingly, Astemizole is metabolised by CYP3A4 [32], the top-scoring gene.
Additionally, we complemented this use case with another node prioritisation in leukaemia that uses two additional biological networks (Additional file 1: Section 1.D).Indeed, in the gene multiplex network, we added a fourth layer containing directed regulatory interactions between transcription factors and target genes inferred in leukaemia.We also included a directed monoplex network containing metabolic reactions.Multi-Xrank node prioritisation using the same seeds as the ones presented before lead to the same set of prioritised drugs, with minor ranking changes.For instance, Zoledronic acid (DB00399), previously ranked in the top 5 position, now holds the first position.Moreover, 8 out of the top 10 ranked genes from the previous analysis were also prioritised in the new results, with slight variations in their ranking.Two novel genes, ZBTB17 and Fig. 1 Subnetwork connecting the seed nodes (in red), the top 10 prioritised genes (diamonds) and drugs (dots) and their neighborhood.Vertex colors indicate the ranking of the nodes (except for the seed nodes, colored in red), with darker colors indicating better ranking.Edges are colored according to their provenance: gene multiplex (blue), drug multiplex (grey) and bipartite interactions (orange) GABPA, are now prioritised, with 3 rd and 6 th ranks, respectively.Interestingly, ZBTB17 codes a protein involved in the regulation of c-Myc, which is known to play a crucial role in leukemogenesis [33].GABPA codes a subunit of the GA-binding protein transcription factor, implicated in chronic myelogenous leukaemia development [34].The addition of a metabolic network as a new monoplex network in the multilayer network allowed us to also prioritise metabolic reactions.The top-ranked reactions were linked to Terpenoid backbone biosynthesis, Acylglycerides metabolism, Glycerophospholipid metabolism, Cholesterol metabolism, and Inositol phosphate metabolism.Notably, Cholesterol metabolism and Inositol phosphate metabolism have been documented as being altered in leukaemia [35,36].More broadly, disrupted lipid metabolism is recognised as a hallmark of cancer [37].This additional analysis demonstrates that MultiXrank can effectively consider directed regulations and interactions within the regulatory and metabolic reaction networks.This analysis is available as a Jupyter Notebook tutorial on GitHub (see section Availability of data and materials).

Prioritising genes and drugs of interest in Epilepsy using MultiXrank on a biomedical knowledge graph
We applied MultiXrank to prioritise candidate drugs for epilepsy, using as seed the epilepsy disease node (DOID:1826) in the large and heterogeneous knowledge graph assembled in the Hetionet project [18].This heterogeneous network is composed of eleven different types of nodes (Materials and Methods).We compared the drugs topscored by MultiXrank, which is fully unsupervised, with the drugs prioritised by the Hetionet strategy, a supervised machine learning approach based on a regularised logistic regression model [18].To evaluate the robustness of MultiXrank in relation to the choice of input parameters, we applied four distinct sets of parameters (Additional file 1: Table S6).
Most drugs are top-prioritised by both approaches.For instance, for one of the sets of parameters tested in MultiXrank (set of parameters number 4, Additional file 1: Table S6), 59% of the top-100 Hetionet prioritised drugs are also in the top-100 Multi-Xrank prioritised drugs, 79% are in the top-200 MultiXrank prioritised drugs, and 99% are in the top-500 MultiXrank prioritised drugs (Fig. 2).
We further checked the 41 drugs from the top-100 drugs identified by MultiXrank that are not prioritised by Hetionet (Additional file 1: S7).Interestingly, 3 of them (namely, Propofol, Vigabatrin and Diclofenac, respectively ranked 8, 23 and 49 by MultiXrank) have been tested in clinical trials for epilepsy, according to the DrugBank [38].After extracting the DrugBank Categories associated to those 41 drugs (Additional file 1: Table S7), we observed that 24 of them are classified as Cytochrome P-450 Substrates.A recent study has shown that spontaneous recurrent seizures in mice modify Cytochrome P-450 expression in the liver and hippocampus.The authors hypothesise that nuclear receptors or inflammatory pathways can be considered as candidates for Cytochrome P-450 regulation during seizures [39].Another study showed that Cytochrome P-450 enzymes can have a significant impact on the response to anti-epileptic drugs [40].The second most represented DrugBank Category in the list of the 41 drugs prioritised by MultiXrank but not by Hetionet was the category Agents that produce hypertension, which map to 18 drugs.A review of the existing literature regarding hypertension and epilepsy show that those two conditions often co-occur [41,42].Furthermore, the relationship between the two conditions could be bidirectional, meaning that they can influence and exacerbate each other [43].
These results indicate that MultiXrank can provide predictions complementary to the Hetionet supervised machine learning approach.In addition, MultiXrank predictions can be easily interpreted as the subnetworks underlying the top-scoring nodes can be easily extracted.

Supervised prediction of gene-disease associations
In a second study, we present a supervised approach to predict gene-disease associations.Predicting gene-disease associations is crucial for the diagnosis, understanding, and treatment of genetic diseases.Among available approaches to predict gene-disease associations, network-based methods have been particularly exploited and have demonstrated good performances [44].Among network-based approaches, RWR strategies have been shown to significantly outperform local distance measures [9].These network approaches were initially based mainly on unsupervised strategies, but an increasing number of methods are implementing supervised strategies [44].Here, we use the output scores of MultiXrank to train supervised XGBoost and Random Forest binary classifiers to predict gene-disease associations (Additional file 1: Fig. S5).
We used a multilayer network composed of a gene multiplex network and a disease monoplex network (Materials and Methods).These multiplex and monoplex networks are connected by a gene-disease bipartite network constructed with an outdated version of DisGeNET (v2.0, 2014, [45]).The edges of the bipartite network are weighted according to the support score provided by DisGeNET v2.0 (2014).We applied MultiXrank on  S6).The data on ictogenic properties has been sourced from the Hetionet study [18].AIGD are anti-ictogenic drugs that have a seizure suppressor effect (beige), IGD are ictogenic drugs (pink), and UKND are drugs with unknown effects (grey).The outer circle of the pie chart displays the number of AIGD, IGD, and UNKD drugs amongst the 100 drugs prioritised by Hetionet (e.g., 77 Hetionet prioritised drugs are AIGD).The center of the pie chart displays the number of drugs from each category that were also prioritised by MultiXrank in the top-100 (displayed in lighter shades), top-200 (displayed in middle shade), and top-500 drugs (displayed in darker shades).For instance, 55 top-100 MultiXrank prioritised drugs are AIGD and 64 top-200 MultiXrank prioritised drugs are AIGD.The drugs prioritised in the top-500 include the drugs prioritised in the top-200, and the top-200 includes the drugs prioritised in the top-100.The white part of the pie chart corresponds to the drugs prioritised by Hetionet that are not in our MultiXrank top-500 prioritised drugs the multilayer network described above, using the gene and disease nodes from each gene-disease association as seeds.The parameters used for running MultiXrank are detailed in Additional file 1: Table S8.We used both positive associations (i.e.true genedisease associations) and negative associations (i.e., random gene-disease pairs that are not associated according to DisGeNET).For each set of positive seeds (true gene-disease association), the gene-disease bipartite edge connecting the two seeds was removed from the bipartite network before training.We collected MultiXrank output scores obtained for all positive and negative gene-disease pairs of seeds and trained binary XGBoost and Random Forest classifiers with different parameters (Additional file 1: Table S9).We then tested the performance of the classifiers in predicting unseen gene-disease associations from the outdated version of DisGeNET that were kept out for testing as well as the gene-disease associations that have been added in the updated version of Dis-GeNET (v7.0, 2020, [46]).The full machine learning procedure is detailed in the Materials and Methods section.We also report the performances of our models in predicting DisGeNET v2.0 (2014) and DisGeNET v7.0 (2020) associations in Additional file 1 : Tables S9 and S10, respectively.For the prediction of unseen test DisGeNET v2.0 (2014) associations, the best classification performance was achieved with an XGBoost model taking class imbalance into account.This model reached a balanced accuracy of 0,85 and an F1-score of 0,79, showing the predictive potential of MultiXrank output scores.However, the prediction performance dropped considerably for predicting DisGeNET v7.0 (2020) associations (balanced accuracy 0,64 and F1-score 0,53).It should be noted that the MultiXrank scores used for the classification of DisGeNET v2.0 (2014) and Dis-GeNET v7.0 (2020) associations were calculated on the same network, constructed solely from the information contained in DisGeNET v2.0 (2014).Importantly, DisGeNET v2.0 (2014) reported only 381 654 gene-disease associations, whereas DisGeNET v7.0 (2020) reported 1 135 037 associations, which represents a threefold increase.Moreover, over the 21 666 genes reported in DisGeNET v7.0 (2020), only 14 255 appeared in DisGeNET v2.0 (2014).Similarly, only 38% of the 30 170 diseases reported in DisGeNET v7.0 (2020) were also reported in DisGeNET v2.0 (2014).The substantial increase in the amount of information contained in DisGeNET between 2014 and 2020 is potentially the cause of the significant decrease in classification performance.

Diffusion profiles comparison to unveil immune diseases similarities
The scores resulting from a random walk using a given seed can be regarded as a diffusion profile and represent a network-based molecular signature.For instance, RWRdiffusion profiles have been utilised to assess the signatures of drugs and diseases within a multi-scale interactome network, uncovering potential opportunities for drug repurposing [47].Diffusion profiles obtained starting from different seeds can indeed be compared to reveal signature proximities.
Here, we propose to compute and compare the diffusion profiles obtained using 131 immune diseases ( Additional file 1: Table S13) as seed in MultiXrank applied to several multilayer networks.We created eight hematopoietic cell-specific multilayer networks, each composed of four different monoplex/multiplex networks (Fig. 3, Materials and Methods).The first two monoplex/multiplex networks incorporate gene and disease interactions sourced from public databases.These two layers are the sames across all the eight multilayer networks.The remaining two layers encode Promoter Capture Hi-C (PCHi-C) fragment interactions and Topologically Associating Domain (TAD) interactions observed in various hematopoietic cell lines, extracted from a dataset generated in [20].It's important to note that the PCHI-C fragment layer and the TAD layer are unique to each hematopoietic cell line, and hence vary across the eight multilayer networks (Additional file 1: Section 1.C).
The strength of this multilayer network constructions lies in its capacity to combine non cell-specific generic gene and disease interactions with data regarding genomic interactions unique to hematopoietic cell lineages.In addition, the genomic interaction layers allow us to consider data representing the 3D conformation of DNA and non-coding regions of the genomes.This 3D conformation of DNA is a key to understanding, for instance, genomic structural variations that are key players in the study of diseases [48].We demonstrated that these genomic data maintain the signal of the hematopoietic cell type.Indeed, the PCHi-C fragments and TAD datasets capture the tree lineage of hematopoietic cells (Additional file 1: Section 4.A, Additional file 1: Fig. S6).We also demonstrate that this lineage signal is captured in the RWR scores obtained from applying MultiXrank to the eight multilayer networks (Additional file 1: Section 4.B, Additional file 1: Fig. s S7 and S8).
Here, we aim to apply MultiXrank on the eight hematopoietic multilayer networks using as seeds 131 different immune diseases to obtain the disease diffusion profiles.We consider the diffusion profiles as disease signatures.We will next cluster the immune diseases based on the similarity of their diffusion profiles.We hypothesise that such clustering can reveal potentially similar immune diseases.
To reveal similarities between the 131 immune diseases based on the diffusion profiles obtained on the eight multilayer networks, we first compute disease-disease distances for each cell type (i.e.hematopoietic multilayer network) and node type (i.e.Then, the disease-disease distance matrices obtained for the eight hematopoietic multilayer networks are fused (equation 2, Materials and Methods).This procedure produces 4 disease-disease integrated similarity matrices, one for each node type in the multilayer networks.These matrices are then used to cluster the immune diseases using a multiview clustering algorithm (Materials and Methods).The obtained clusters are detailed in Additional file 1: Table S12.Additionally, the 4 matrices are concatenated into a single matrix and projected into a 2D t-SNE (t-distributed stochastic neighbor embedding) [49] space.In this projection, we label each immune disease with their corresponding cluster (Fig. 4).
We conducted additional analysis to explore the composition of disease clusters and extract their essential characteristics: • Cluster 0 regroups 31 immune diseases.It is mainly composed of Inflammatory (e.g.Takayasu's arteritis, giant cell arteritis (temporal arteritis), autosomal recessive earlyonset inflammatory bowel disease), Autoinflammatory (e.g.tumor necrosis factor receptor-associated periodic syndrome (TRAPS), hyper-IgD syndrome, familial cold autoinflammatory syndrome), Autoimmune diseases (e.g: rheumatoid arthritis, type 1 diabetes, cicatricial pemphigoid, Hashimoto's thyroiditis).This grouping underscores the intricate relationship between inflammation, auto-inflammation and autoimmunity, as supported by the existing literature on inflammatory disorders.Indeed, numerous studies have postulated an immunological continuum linking monogenic autoinflammatory disorders with autoimmunity [50,51].• Cluster 1 encompasses 59 immune diseases.It groups conditions marked by immunodeficiencies, primary immunodeficiencies (including several types of Complement component deficiencies) and increased susceptibility to infections.Within this cluster, numerous diseases are linked to immunoglobulin-related abnormalities, including various forms of hypogammaglobulinemia, immunodeficiency with hyper IgM, as well as immunoglobulin A deficiency and agammaglobulinemia.• Cluster 2 accounts for 41 immune diseases.These diseases appear to be rather diverse, encompassing diseases that impact a wide range of bodily systems.Many types of leukaemias (e.g.acute myeloid leukaemia, chronic lymphocytic leukaemia, chronic myelogenous leukaemia), lymphomas (e.g.Hodgkin's lymphoma, non-Hodgkin lymphoma) and other blood-related diseases (e.g.pernicious anemia, myelodysplastic syndromes) are included in this cluster.Other systems affected by diseases from cluster 2 include the cardiovascular (e.g: congenital heart block), hepatic (e.g.glycogen storage disease type 1B), skeletal (e.g.cherubism), dermatological (e.g.lichen sclerosus, pruritic urticarial papules plaques of pregnancy), muscular (e.g.inclusion body myositis), neurological (e.g.Aicardi-Goutieres syndrome) and neuromuscular (e.g.stiff person syndrome) systems.However, many of these diseases can impact multiple systems (e.g.DiGeorge syndrome, Aicardi-Goutières syndrome, CHARGE syndrome, glycogen storage disease type 1B, Pearson syndrome).
Facing the apparent diversity of diseases included in cluster 2, comparatively to the well defined clusters 0 and 1, we further investigated the composition of cluster 2.
A literature review on the diseases included in this cluster shows that most of them are associated with blood diseases on one hand and cardiovascular system diseases on the other hand.Indeed, many of the diseases that are not directly associated with leukaemia or lymphoma appear to be comorbid to those cancers.For instance, elevated risks of lymphoma and leukaemia have been reported for patients with pernicious anemia [52] and myelodysplastic syndromes can evolve to acute myeloid leukaemia [53].Other diseases from cluster 2 that are considered associated with lymphoma and leukaemia in the literature include ataxia telangiectasia [54], Bloom syndrome [55], cartilage-hair hypoplasia [56], and Chediak-Higashi syndrome [57], among others.Moreover, many diseases from cluster 2 could be associated with cardiovascular diseases, according to the literature: congenital heart defects are observed in 50-85% of cases of CHARGE syndromes [58]; the TARP syndrome is associated with congenital heart defects [59]; congenital heart disease is a common feature in the 22q11.2deletion syndrome (DiGeorge syndrome) [60]; Parry Romberg syndrome is associated with hypertrophic cardiomyopathy and rheumatologic heart disease [61]; lichen sclerosus is associated with increased risk of cardiovascular comorbidities in female [62]; Pearson syndrome is often associated with cardiac conduction defects [63]; the association of inclusion body myositis and cardiac disease is debated [64]; a case of Melkersson-Rosenthal syndrome affecting cardiac connective tissues was reported in [65].Finally, we examined some diseases close to each other in the t-SNE space.We detail here 3 examples: • Chronic recurrent multifocal osteomyelitis (CRMO) (UMLS:C0410422, seed 2) and Majeed syndrome (UMLS:C1864997, seed 99) ( Fig. 4, cluster 0): CRMO is known as one of the major features of Majeed syndrome [66].• Netherton syndrome (UMLS:C0265962, seed 103) and Eosinophilic esophagitis (UMLS:C0341106, seed 5): (see Fig. 4, cluster 0): Eosinophilic esophagitis is observed in 44% of the people with Netherton syndrome [67].• Myasthenia gravis (UMLS:C0026896, seed 21) and Myositis (UMLS:C0027121, seed 22) (see Fig. 4, cluster 0): several cases of co-existence of Myasthenia Gravis and Myositis are reported in [68].
All of these observations suggest that the integrated MultiXrank scores effectively capture similarities in disease diffusion profiles, indicating shared phenotypic manifestations and potential comorbidity patterns among the diseases.

Conclusion
Multilayer networks provide a valuable framework for integrating a wide range of biological interactions involving diverse types of entities.Overall, multilayer network approaches have consistently demonstrated superior performance in a large variety of tasks compared to methods applied on monoplex networks.[16,17,[69][70][71] The Mul-tiXrank Random Walk with Restart algorithm can effectively explore such multilayer networks, offering opportunities for various analyses.In this study, we demonstrate the versatility of the MultiXrank algorithm through three distinct biological applications: prioritising drugs and genes in a disease context, predicting gene-disease associations, and comparing and clustering diseases.

Construction of the biological networks
The different studies presented in this manuscript explore different biological networks summarised in Additional file 1: Table S1.Detailed information on the biological networks used in each study are provided in Additional file 1: Section 1.
In study 1, for node prioritisation in leukaemia, we used a multilayer network composed of a gene and a drug multiplex networks, connected by bipartite gene-drug interactions (Additional file 1: Section 1.A, Additional file 1: Table S1).In the gene multiplex network, we encode 3 types of gene interactions: protein-protein interactions, molecular complexes and pathways gathered from public databases.In the drug multiplex network, we encode 4 types of drug interactions: adverse interactions, experimental drug combinations, computationally predicted drug interactions and pharmacological interactions.The two multiplex networksare connected with gene-drug associations extracted from the repoDB database [72].
In study 1, for node prioritisation in epilepsy, we used Hetionet [18], an network containing 11 types of nodes (Additional file 1: Section 1.B, Additional file 1: Table S1).The first multiplex network encodes 3 types of gene interactions: co-expression, physical interaction and regulation.The second and third monoplex networks encode for disease similarities and drug similarities, respectively.The three monoplex/multiplex networks are connected with bipartite gene-drug interactions, gene-disease interactions and disease-drug interactions.The multilayer network also contains other node types (including pathways, biological process and pharmacologic classes) that are connected to the gene, drug and disease networks via bipartite interactions.
In study 2, Gene-Disease associations were predicted from MultiXrank output scores obtained by exploring a multilayer network composed of a gene multiplex network and a disease monoplex network, connected by gene-disease bipartite interactions (Additional file 1: Section 1.A, Additional file 1: Table S1).The multiplex network encodes gene interactions, identical to the one used in node prioritisation for leukaemia.The monoplex network encode disease interactions.The gene-disease bipartite associations were extracted from an outdated version of the DisGeNET database (v2.0, 2014).
Finally, in study 3, we obtained MultiXrank scores from the exploration of eight hematopoietic multilayer networks (Additional file 1: Section 1.C, Additional file 1: Table S1).Those networks are composed of a disease monoplex network, a gene multiplex network and two genomic monoplex networks.The gene multiplex network encode gene interactions, and is identical to the one used in node prioritisation for leukaemia.The disease monoplex network encodes disease phenotypic proximities.The two remaining networks encode genomic information: PCHi-C fragment interactions and TAD interactions.Importantly, the gene and disease networks are identical in the eight hematopoietic networks, whereas the PCHi-C and TAD networks were computed for each hematopoietic cell type from a dataset obtained in [20].We describe in detail the processing pipeline applied to obtain the PCHi-C fragment and TAD layers in Additional file 1: Section 1.C.

Network visualisation
The network visualisation displayed in Fig. 1 was obtained with Cytoscape [73].

Supervised classification
We created the gene-disease associations dataset from an outdated version of DisGeNET (v2.0, 2014).We obtained 1914 gene-disease associations.We generated a negative dataset by randomly picking 3828 pairs of gene and disease nodes that are not considered associated in DisGeNET v2.0 (2014).For each positive and negative gene-disease association defined in the training dataset, we used both the gene and the disease nodes as seeds when running MultiXrank (parameters defined in Additional file 1: Table S8).Then, we trained Random Forest and XGBoost binary classifiers (Additional file 1: Table S9) based on the MultiXrank output scores for predicting gene-disease associations.The evaluation of the binary classifiers is done on an updated version of the genedisease associations dataset; DisGeNET v7.0 (2020).The test dataset contained 7218 novel positive gene-disease associations, and the negative dataset is randomly selected twice as many (i.e. 14 436) negative associations.We ran MultiXrank using as seeds the gene and disease nodes of each association of the evaluation dataset, using the same parameters used for obtaining MultiXrank scores for the training dataset (Additional file 1: Table S8).Then, the MultiXrank output scores were used as input of the previously computed Random Forest and XGBoost models (trained on the data obtained with the gene-disease association of DisGeNET v2.0 (2014)) to predict their label.Finally, we compared the predicted labels to the true labels.We report the results for each model in Additional file 1: Table S10.The full procedure is detailed in Additional file 1: Section 3 and Additional file 1: Fig. S5.

Disease-disease rank distances per cell type and per node type
To compute disease-disease distance within each hematopoietic multilayer network, we start by integrating the scores of MultiXrank for each node type independently.
First, we create a disease distance matrix for each hematopoietic multilayer network c and node type t using equation 1: with dis i and dis j , the two seeds (i.e.immune diseases) considered and N c,t k the number of nodes of type t in the multilayer network c.
The rank distances, dist dis i ,k and dist dis j ,k , are computed using the following equations: with r dis j dis i ,k , the rank of the node at position k for disease dis i in the list of scores for dis- ease dis j .with r dis i dis j k , the rank of the node at position k for disease dis j in the list of scores for dis- ease dis i .
The distance dist dis i ,k (respectively dist dis j ,k ) represents the square of the absolute dif- ference between the rank of the node at position k for disease dis i (resp.dis j ) and the rank of the same node for disease dis j (resp.dis i ).

Disease-disease integrated rank distances
We integrate disease-disease rank distances across hematopoietic multilayer networks for each node type using the following equation: with D t dis i ,dis j the integrated disease distances across hematopoietic multilayer networks for node type t, N c the total number of hematopoietic multilayer networks (i.e.eight) and D c,t dis i ,dis j the disease-disease distances computed for hematopoietic multilayer network c and node type t.

Multiview clustering
For clustering immune diseases based on the distance matrices D t dis i ,dis j obtained for each node type t, we employed the multiview spectral clustering algorithm [74] implemented in the mvlearn python package [75] with n_clusters = 3.

t-SNE projection
To visualise the immune disease clustering, we first concatenated the four immune disease distance matrices D t dis i ,dis j .This concatenated matrix was then projected in a twodimensional t-SNE (t-distributed stochastic neighbor embedding) space [49].The points were colored according to their assigned cluster, obtained from the multiview clustering.
sampled randomly.We next ran MultiXrank using as seeds the gene and disease nodes from each positive and negative gene-disease association, and saved the output scores in the matrix **.Finally, this ** matrix is used as an input of the previously trained random forest classifier to predict the labels of each gene-disease association.The predicted labels are then compared with the true known labels.The protocol is the same for two or three multiplex networks.Supplementary Figure S6.2D PCA projection of the Jaccard index similarities between the different hematopoietic cell types.Left: similarities computed on the PCHi-C fragment dataset.Right: similarities computed on the TAD dataset.Red: Lymphoid cells.Blue: Myeloid cells.Supplementary Figure S7.2D PCA projection of the similarity between the hematopoietic cell types.The tree lineage of hematopoietic cells is correctly found with the PCHi-C fragment, TAD and disease output scores of MultiXrank.However, the tree lineage is not recovered for the protein output scores.Supplementary Figure S8.2D PCA projection of hematopoietic cell similarities with respect to the integrated MultiXrank output scores effectively visualises the similarity between various hematopoietic cell types.In this projection, lymphoid cells (depicted in red) and myeloid cells (depicted in blue) exhibit a relative separation within the PCA space.Moreover, nCD4 and nCD8 cells are in close proximity to each other and relative proximity with nB cells.Furthermore, Macrophage (Mac0) and its precursor, Monocyte (Mon), appear closely situated.Additionally, Erythrocyte (Ery) and Megakaryocyte (MK), both stemming from the same progenitor, also exhibit high proximity to each other.Supplementary Figure S9.Workflow of the MultiXrank Python package.The networks and the parameters need to be ordered as in the tree file in the left panel.This tree file is used by MultiXrank to run the RWR on the multilayer network defined in the bipartite and multiplex folders with the parameters given in the "config.yml"and "seeds.txt"files.Finally, MultiXrank returns the output scores in an output folder as shown in the right panel, each multiplex/monoplex network is associated with a ranking distribution of the nodes involved in this specific network.A Conda environment is available on GitHub (https:// github.com/ anthb apt/ multi xrank) to install all dependencies needed by MultiXrank.

Fig. 2
Fig. 2 Ictogenic properties of the 100 drugs prioritised by Hetionet and overlap with the top 100, 200, and 500 drugs prioritised by MultiXrank (parameter set 4, see Additional file 1: TableS6).The data on ictogenic properties has been sourced from the Hetionet study[18].AIGD are anti-ictogenic drugs that have a seizure suppressor effect (beige), IGD are ictogenic drugs (pink), and UKND are drugs with unknown effects (grey).The outer circle of the pie chart displays the number of AIGD, IGD, and UNKD drugs amongst the 100 drugs prioritised by Hetionet (e.g., 77 Hetionet prioritised drugs are AIGD).The center of the pie chart displays the number of drugs from each category that were also prioritised by MultiXrank in the top-100 (displayed in lighter shades), top-200 (displayed in middle shade), and top-500 drugs (displayed in darker shades).For instance, 55 top-100 MultiXrank prioritised drugs are AIGD and 64 top-200 MultiXrank prioritised drugs are AIGD.The drugs prioritised in the top-500 include the drugs prioritised in the top-200, and the top-200 includes the drugs prioritised in the top-100.The white part of the pie chart corresponds to the drugs prioritised by Hetionet that are not in our MultiXrank top-500 prioritised drugs

Fig. 3
Fig. 3 Hematopoietic multilayer networks composed of two genomic layers built from PCHi-C and TAD data, of a gene multiplex network and of a disease monoplex network.The disease monoplex and the gene multiplex network are the same in all the hematopoietic multilayer networks.However, the PCHi-C and TAD layers are specific to each hematopoietic cell line.The black arrows represent the bipartite networks that connect two different types of nodes

Fig. 4
Fig. 4 t-SNE projection of the integrated distances between the 131 different immune diseases.Colors indicate the clusters in which the diseases were grouped according to the multiview spectral clustering algorithm.Points highlighted with grey background correspond to diseases cited in the main text i ,k + dist dis j ,k ( 2k+1 2 ) 2 dist dis i ,k = |(k − r dis j dis i ,k )| 2 dist dis j ,k = |(r dis i dis j ,k − k)| 2