 Research
 Open Access
 Published:
PWN: enhanced random walk on a warped network for disease target prioritization
BMC Bioinformatics volumeÂ 24, ArticleÂ number:Â 105 (2023)
Abstract
Background
Extracting meaningful information from unbiased highthroughput data has been a challenge in diverse areas. Specifically, in the early stages of drug discovery, a considerable amount of data was generated to understand disease biology when identifying disease targets. Several random walkbased approaches have been applied to solve this problem, but they still have limitations. Therefore, we suggest a new method that enhances the effectiveness of highthroughput data analysis with random walks.
Results
We developed a new random walkbased algorithm named prioritization with a warped network (PWN), which employs a warped network to achieve enhanced performance. Network warping is based on both internal and external features: graph curvature and prior knowledge.
Conclusions
We showed that these compositive features synergistically increased the resulting performance when applied to random walk algorithms, which led to PWN consistently achieving the best performance among several other known methods. Furthermore, we performed subsequent experiments to analyze the characteristics of PWN.
Background
Deciphering target proteins for disease treatment has been an important challenge in medical care, as it is the first step in the drug discovery process and one that critically affects its success rate. To effectively solve this problem, we must first understand disease biology, and due to the increased accessibility of highthroughput technologies in recent years, diverse types of unbiased data have been generated for a range of diseases. However, the causes and consequences of disease states are concurrently reflected in those unbiased highthroughput data that compare the disease samples and normal samples. Thus, discriminating potential causes from widespread consequences is an essential task when using diseaseperturbed data to prioritize targets to cure.
A wellconstructed proteinprotein interaction (PPI) network can help tackle this issue because it provides hints for dealing with massive amounts of highthroughput omics data by showing the overall landscape of the protein relations. Several previous studies adopted random walkbased approaches for utilizing PPI networks to associate genes and diseases and reported some encouraging results by suggesting a number of disease genes with literature evidence [1,2,3,4,5,6,7,8,9]. A biological network consists of nodes and edges, which represent the biological entities and the relations between these entities, respectively. Since the information constituting the network is usually obscured in the given omics data, researchers try to integrate the topological properties into the omics data analysis process and enhance the initial analysis results. One of the methods that leverage the network is a random walk. A random walk diffuses the initial signal through its neighbors. Therefore, the diffused signal is heavily affected by its original signal and the signals derived from the direct neighbors, while other nodes also slightly affect the signal. Most of these studies provided a collective set of known disease genes to initiate random walk processes and obtain novel targets that reflected previously studied disease biology. In the same way, this approach can be applied to extract important genes from omics data. For example, differentially expressed genes derived from omics data can be used as the starting points of random walk processesÂ [10,11,12].
Most random walkbased methods heavily rely on an unweighted and undirected network when they spread the information assigned to nodes; i.e., they do not make a distinction between different neighbors when choosing which neighbor to use to spread information, although the neighbors have different biological importance levels. Therefore, one can expect that using a weighted and directed network to propagate more information through important edges can yield improved accuracy. Nevertheless, how to assign suitable edge properties remains a question. In the case of a PPI network, the network contains not just simple interactions between its constituents but much more information, such as the intraconnectivity of protein complexes or a set of proteins involved in the same pathway. This property obviously implies that we must carefully landscape the PPI network to let information flow in the proper direction.
The most intuitive and direct way to achieve this is to use biological information related to the way the researcher wants to. Hristov [11] seems to be a representative example of utilizing this idea. In that study, they showed that the accuracy of random walk algorithm can be improved by using proper prior knowledge. Noteworthy finding among their experiments was that using cancerspecific prior knowledge gives more accurate result than using general cancerrelated prior knowledge, in the target identification per cancer type experiment.
Another available option is using the networkâ€™s own properties. We choose the curvature in a PPI network, which is based on the local connectivity and other networkderived properties. Curvature is a concept originating from differential geometry that measures the rate of bending at a given point or how much the region in question is warped from flat lines, flat surfaces, or flat manifolds. For instance, the curvature of a circle of radius \(r\) is \(r^{1}\). Various types of curvatures have been suggested and studied: Gaussian curvature, geodesic curvature, and sectional curvature, to name a few [13]. A Ricci curvature, one of the most important variants, has been used in a wide range of fields, such as fluid mechanics, Einsteinâ€™s general theory of relativity, and Perelmanâ€™s proof of the PoincarĂ© theoremÂ [14].
We thereby suggest a new algorithm named prioritization with a warped network (PWN), which can be applied to disease target identification methods using a series of highthroughput data, â€śomicsâ€ť data, based on a PPI network. PWN incorporates both networkdependent features and networkindependent features to warp the network by applying graph curvature and known disease genes, respectively.
Results
Overview of PWN
PWN is designed to be an efficient variant of random walk with restart (RWR) [15]. Unlike the usual RWR algorithm that employing simple unweighted network, PWN uses a weighted asymmetric network that is generated from an unweighted and undirected network. The weights come from two distinct features. One is an internal feature that depends on the network topology, and the other is an external feature that is independent of the given network (see Fig.Â 1 for a graphical overview).
PWN uses the graph Ricci curvature, which is highly related to local structural information, as the internal feature (see â€śWarping via an internal feature: graph curvatureâ€ť section). While curvature was originally defined on smooth continuous domains, several researchers have extended this concept to discrete objects such as networks. The graph Ricci curvatureÂ [16,17,18,19] is one of these extensions that was independently developed byÂ [18] andÂ [16]. The Ollivierâ€“Ricci curvature lies in optimal transport theory, while the Formanâ€“Ricci curvature was derived using CW complexes introduced in homotopy theory. These types of curvatures have been applied to several recent graphbased machine learning algorithmsÂ [20,21,22].
Note that the edges in a dense complete graph (or clique) tend to have higher curvatures, which implies that higher curvatures can be observed on the edges in intraprotein complexesÂ [23, 24]. In contrast, intercomplex edges have a higher probability of possessing lower curvatures. Therefore, the graph Ricci curvature can be used to overcome the indiscriminate nature of a random walk, as shown in Fig.Â 2. Consider a random walk starting from the purple node, where the left and right sides have different structures. Naturally, one would like to distinguish them in random walks, but the probabilities of being on the left side and right side are both equal to \(5/10=50\%\) when using an unweighted graph. To solve this issue, one can inject the curvature into random walks, and thus, the random walk is now affected by the local structure.
In the context of a bioinformatics application, this implies that one can control the amount of information that propagates through or flows out from protein complexes and/or hubs by using curvature. Proteins are mutually influenced by whether they are physically linked or by various nonphysical relationships hidden prior to protein formation (at the level of DNA or RNA). Therefore, it is intuitive that the properties of networks (e.g., hub, curvature, ...) affect the physical or nonphysical (information transfer) relationships of proteinprotein interaction [25, 26]. Note that most of the nodes belonging to protein complexes are connected to each other, so the edges in the complexes can have large positive curvatures. In contrast, the neighbors of a hub node are likely to not be connected, so the edges attached to the hub have large negative curvatures. It can be seen in various PPI networks that it is possible to distinguish hub nodes from other nodes by a color determined solely by the curvature. Using PPI networks in E. coli and humans, we confirmed that these properties are common. As shown in the Fig.Â 3 and Additional file 1, the protein complexes appear red while the hub proteins appear blue.
In fact, various existing methods already employed Ricci curvatures to their biological applications. Sandhu et al. [22] used curvatures on gene coexpression networks to distinguish between cancer networks and normal networks. Coupled with side information, Sia et al. [27] can construct functional communities with the curvaturebased community detection algorithm. Murgas et al. [28] applied PPI curvatures on a singlecell RNAseq data and obtained successful results on several tasks, which includes distinguishing pluripotent cells and distinguishing cancer cells. Zhu et al. [29] combined the curvature of a PPI with a clustering technique to extract cancer subtypes from the multiomics data.
PWN is designed to manage the proportion of information circulating in and flowing out of certain regions by controlling this internal feature. We empirically show that this internal feature has little impact on the resulting performance (see â€śEffectiveness of internal featuresâ€ť section) but provides significant improvements when it is combined with the external feature (see â€śEffectiveness of the external featureâ€ť section).
We use the (augmented) Formanâ€“Ricci curvature instead of the Ollivierâ€“Ricci curvature. Although several studies have suggested that both types of curvature behave similarly on various biological networks (including PPI networks)Â [30, 31], computing the Formanâ€“Ricci is much faster than computing the Ollivierâ€“Ricci curvature, especially in a largescale network. Since PPI networks often have large numbers of nodes and edges (see â€śCollecting PPI networksâ€ť section), using the Formanâ€“Ricci curvature seems preferable for us.
After warping the network using curvatures, the prior knowledge (see â€śCollecting the ground truth and constructing experimentsâ€ť section) related to the given task is applied to the network as an external feature. While conventional random walkbased methods do not consider the prior knowledge from a given task, some types of modern algorithms, including machine learning methods, heavily employ prior knowledge from the context and encode that knowledge into their algorithms. When attempting to obtain a more appropriate result for a specific task, an algorithm reflecting prior knowledge would perform better than a general taskindependent algorithm. Given this assumption, we use external data, which cannot be gathered from the network, to provide a clear guide for the propagation of information and enhance performance.
PWN warps the network by assigning higher weights to prior knowledgerelated edges. Note that the prior knowledge is not guaranteed to cover the ground truth in its entirety, so PWN first spreads the prior knowledge, and the missing information can be covered (see â€śWarping via an external feature: prior knowledgeâ€ť section).
uKIN, previously suggested byÂ [11], also diffuses the prior information using a (shifted) Laplacian and multiplies the smoothed knowledge with the edge weights. However, uKIN occasionally encounters tuning limitations since the amount of shift is not bounded from above. If the optimal amount of shift is larger than the expected amount, the search space of the tuning hyperparameters rarely contains the optimal region and results in suboptimal performance. Therefore, we use an alternative method based on a RWR [15], where the range of each hyperparameter is always bounded between \(0\) and \(1\) and is thus easier to tune.
Finally, the gene scores obtained from the unbiased omics data (see â€śComputing the initial gene scoresâ€ť section) are diffused through the warped network, which gives the final gene scores (see â€śScore diffusion with a warped networkâ€ť section). We add hyperparameters to control the amount of information spread during each step, which makes PWN more versatile and flexible.
Comparison between PWN and other methods
PWN can be used for identifying the targets with properly given prior knowledge and gene scores. To demonstrate this, we designed series of simulations using various cancerrelated data from Homo sapiens, including The Cancer Genome Atlas (TCGA) and Cancer Gene Census (CGC), and check whether the methods can find the known cancer targets.
First, we collect the data as follows (see â€śDataset preparation for a simulation studyâ€ť sectionÂ for more details). First, we download public PPI databases and compute their initial gene scores using statistical tests performed on transcriptome data. Then, we collect the groundtruth genes and randomly divide them into two groups: one is an (optional) train set used by diffusion methods, and the other is a test set required for performance measurement. From the collected data, we apply various methods and measure the resulting performance metrics. We repeat this multiple times to achieve a robust performance comparison.
Under this setup, we compare the performance of several methods. For baselines, we use the RWRÂ [15], the RWR with GDCÂ [32], uKINÂ [11], and mNDÂ [33], each of which is equipped with the default hyperparameters presented in the original papers. For PWN, we use \(\beta =0.5\), \(\gamma =0.5\) and \(q=0.3\) as the default parameters. Note that these values are not tuned or cherrypicked. Additionally, we add some variant methods to our experiments: the RWR with curvatures, uKIN with curvatures and PWN without curvatures (see â€śVariants of PWNâ€ť section).
We evaluate the performance of each simulated trial and draw box plots for each method and metric, as shown in Fig.Â 4, TablesÂ 1 andÂ 2, which clearly shows that PWN outperforms the other methods. PWN consistently achieves the highest AveP, with PWN without curvature placing second. Additionally, the significance of the improvement is shown in Fig.Â 5, which turns out that PWN is significantly better than every other baselines. Also note that the uKIN with curvature performs better than the vanilla uKIN. Contrary to our expectations, the RWR with curvatures does not work well; it is sometimes even inferior to the baseline.
In the remaining sections, we focus on analyzing the behavior of PWN, such as the effects of curvature (see â€śEffectiveness of internal featuresâ€ť section) and prior knowledge (see â€śEffectiveness of the external featureâ€ťÂ and Â â€śEffectiveness of the amount of prior knowledgeâ€ť sections). Additionally, we find that PWN has slightly more volatile performance than the other methods. The major cause of this variance is identified in â€śPost hoc analysis of the large induced varianceâ€ť section.
Effectiveness of internal features
To observe the pure influence of the curvature alone, we compare the methods that do not rely on prior knowledge. Three methods are chosen: the RWR, RWR with GDCÂ [32], and RWR with curvatures. We plot the results in Fig.Â 6, which reveals that the effect of curvature might be different when using different PPI networks. In the experiment using STRING, one can confirm that Prec@100 and Prec@200 decrease when \(\beta\) decreases. In contrast, Prec@100 and Prec@200 decrease when \(\beta\) increases if BioGRID is used. Note that AveP seems to remain the same for both cases.
This phenomenon occurs because the characteristics of the two networks are different, as clearly shown in Fig.Â 7. Both figures appear to be Vshaped, but some differences are also identified.

BioGRID has longer tails; in other words, it contains nodes with more extremely negative average curvatures than those in STRING.

STRING contains nodes with sufficiently positive average curvatures, while BioGRID does not.
From these differences, we suspect that STRING has more protein complexes than BioGRID, while BioGRID tends to have more hub proteins interacting with a large number of neighbors (recall Fig.Â 2).
Furthermore, notice the difference in perspective of the relative densities of priors. In STRING, the priors are concentrated at near the origin and the positivecurvature region, while the priors are spread more widely and tends to have more negative curvature in BioGRID. These differences seem to have made the difference in Fig.Â 7; if the priors are in the negativecurvature region (as in BioGRID), it would be advantageous to send prior knowledge towards it by setting \(\beta\) to negative and vice versa.
Effectiveness of the external feature
In this experiment, we measure the performance achieved by PWN with varying hyperparameters so that we can understand the comprehensive effect of curvature and prior information. FigureÂ 8 displays the results of an experiment conducted with uKIN as the baseline.
The most remarkable aspect of Fig.Â 8 is that the application of curvature information enhances performance when used with prior information. PWN with \(\beta =1/2\) always performs better than PWN without curvature, as seen when the AveP is calculated. However, larger \(\beta\) usually harms the performance again.
We suspect the following hypothesis. Due to the nature of the sigmoid function, as \(\beta\) increases, edges weights in the matrix \(K\) squashes to 0Â s (see Fig.Â 9). When the edge weight becomes 0, it is impossible for the prior knowledge to be diffused in that direction. Therefore, the prior knowledge does not spread well. Because of this phenomenon, larger \(\beta\) interferes the diffusion and harms the performance. Similarly, we suspect that negative \(\beta\) also adversely affects the spread of prior knowledge.
Additionally, note that PWN works well when the restart probability \(\gamma\) satisfies \(0.3\le \gamma \le 0.7\), implying that a moderate level of smoothing is important.
Effectiveness of the amount of prior knowledge
We have argued that the combination of internal and external features can improve the resulting analysis performance. However, note that external information is not always available. It is important to make PWN work even if there is little prior knowledge; otherwise, PWN could not work on most realworld cases.
Thus, to determine the influence of the amount of prior knowledge, we plot the performance of PWN and uKINÂ under various amounts of prior knowledge in Fig.Â 10. We can observe that PWN always performs better than the baseline as long as prior knowledge is given, regardless of the amount of information. Thus, one can expect a performance improvement even though little prior knowledge is available. In addition, the performance seems to increase linearly as the amount of prior knowledge increases. This implies that PWN can efficiently fuse prior knowledge, since if an information loss exists, the increasing trend is likely to be weak or not detected.
Post hoc analysis of the large induced variance
Although the performance of PWN is superior to that of other methods, we find that the variance of PWN is much larger than that of other methods. We suspect that the large performance variance originates from the large variance of the smoothed prior knowledge, so we empirically verify that hypothesis by comparing the internal variances of uKIN and PWN using the same simulated dataset, and the results are displayed in Fig.Â 11.
As the figure implies, the standard deviation of the smoothed prior information of PWN is much larger than that of uKIN. We conclude that these high variances of PWN trivially make the transition matrix and resulting gene scores highly volatile, and thus, the performance metrics are affected.
Discussion
Through the various experiments, we find that PWN performs better than other available random walk methods, although some points related to the effects of using curvatures and the properties of PWN remain to be discovered.
Our first question concerns the reason that a side effect is induced when no prior knowledge is available. We suspect that the curvature itself might have no strong relation to the given task. Topological properties are not explicitly related to biological tasks, and no one can guarantee their effectiveness. Thus, using only topological properties might distort the diffusion process in an unwanted way. Furthermore, due to the massive number of edges, the dissonance becomes large and unrecoverable. As a result, the diffusion process becomes inefficient and might damage the resulting performance. One of our unexpected observations is that applying curvature in a negative sense (\(\beta <0\)) does not affect or even increases the accuracy.
Utilizing curvatures with prior knowledge, however, has a synergetic effect on the analysis. Note that the amount of prior knowledge is much less than the number of genes in the network; thus, applying prior knowledge removes the effects of most edges by suppressing every edge attached to meaningless nodes. From this observation, we hypothesize that the effectiveness of curvature is at last revealed when the prior information leaves only meaningful and relevant edges; otherwise, the side effect of applying curvature is dominant because there are too many irrelevant edges.
Another noticeable point is related to the effects of the hyperparameters, as shown in Fig.Â 8. Recall that an extreme value of \(\beta\) or \(\gamma\) harms the performance. Although we have yet to determine explicit evidence, we interpret this phenomenon as follows. In the case of \(\gamma\), we conjecture the following statement: excessive smoothing might cause a slight disparity between the priors and nonpriors, and the use of prior knowledge without smoothing can remove important edges related to unobserved or missing knowledge.
Furthermore, a large \(\beta\) can make PWN completely ignore lowcurvature edges and cause inefficacy in the random walks. As mentioned above, most edges have low edge weights after applying PWN since the prior information suppresses every edge attached to meaningless nodes. Furthermore, since the edges with large curvatures are attached to a few nodes that have many edges, the reconstructed network is very sparse and far away from a single connected network.
Future works
Weâ€™d tested PWN on three PPIs (STRING, BioGRID, IID) and all three experiments confirmed that PWN has the best performance. However, it is a matter to be further confirmed whether this is also true for all PPI. Empirically, it can be verified through other PPI databases or in other organisms.
Default hyperparameters in this paper also have to be verified whether they operate universally well. By our intuition, tuning is necessary depending on the nature of the network. We also think that the larger \(\beta\) makes the performance worse, so we can find a range of \(\beta\) for the PWN to works well.
Although PWN is very effective, the results of PWN are more volatile than those of other available baselines. We initially suspect that this problem is related to the prior smoothing process due to the following observations: the lack of a variance between PWN and PWN without curvature in Fig.Â 4, and the intuitive approximated computations (see Additional file 2). FigureÂ 11 empirically proves this hypothesis and provides some hints for reducing the variance. We should build a novel method to effectively smooth the prior information for achieving lower variance.
In this study, we used the driver genes of cancer that have already been experimentally proven as prior information. In addition, in many cases there is significant prior knowledge of the disease, which will affect the performance and application of our algorithms. For example, additional information that can be considered is prior knowledge by cancer type. Interestingly, the more prior knowledge about a particular type of cancer is used, the better we will be able to discover the gene responsible for the cancer. This has already been demonstrated by other groups using similar random walks approach [11].
We tested our idea only on randomwork algorithms, but it can be easily extended to other networkbased algorithms, especially graph neural networks. Although most graph neural networks are based on messagepassing architectures, there exist such cases where random walk is directly used, such as [34], and our idea seems more suitable in the latter case. In line with this, the experiment should be conducted on a task with more complex features and outputs, not just score aggregation.
Lastly, weâ€™re planning to explore other network properties that have greater relevance to disease target identification and employ a highthroughput data analysis to achieve increased performance whether prior knowledge is available or not.
Conclusion
The random walk approach has become a popular tool in integrative analyses. The trends in recent work suggest that it will continue to be used and further refined as demands related to various data types arise. Several random walk methods have been developed to derive an effective procedureÂ [11, 32]. We introduced PWN, a new method that combines a graph curvature approach for controlling the amount of information flowing in networks with prior knowledge to achieve enhanced prediction performance. We showed above that a synergetic effect was observed when a graph curvature approach and prior knowledge were applied simultaneously. Furthermore, our method achieved the most performance gains relative to GDCÂ [32] and uKINÂ [11]. In future work, we will also test whether PWN can successfully help analyze other biological data.
Methods
The Python package and related datasets and code used for our reproducible experiments are available on GitHub.^{Footnote 1}
Notations
Let \(G=(V,E)\) be an unweighted network denoting the interactions between nodes, where \(V=\{1,\dots ,n\}\) and \(E\subset V\times V\) are sets of nodes and edges, respectively. For \(e\in E\), \(\sigma _e\) and \(\delta _e\) denote the source and target node of edge \(e\), respectively, so \(\sigma _{(i,j)}=i\) and \(\delta _{(i,j)}=j\). Assume that \((i,j)\in E\iff (j,i)\in E\) and \(\not \exists i\in V: (i,i)\in E\), which means that \(G\) is undirected and has no selfloops. The neighbors of \(i\) are defined as \(N_{i}=\{k: (i, k)\in E\}\).
Let \(A\in {\{0, 1\}}^{n\times n}\) be the adjacency matrix of \(G\):
From the adjacency matrix \(A\), the degree matrix \(D\in \mathbb {N}^{n\times n}\) is defined as
PWN
Warping via an internal feature: graph curvature
First, we warp the unweighted adjacency matrix \(A\) using the networkrelated feature. We choose to use the augmented Formanâ€“Ricci curvatureÂ [16, 17] \(\kappa _{e}\), which can be simply computed as
where \({S}\) is the number of elements in the set \(S\). Then, we construct our first warped adjacency matrix \(K=\mathbb {R}_{+}^{n\times n}\) by
where \(\text{mean}(\kappa)\) and \(\textrm{sd}(\kappa )\) are the sample mean and sample standard deviation of \(\kappa _{e}\), respectively, and \(\beta \in \mathbb {R}\) is a hyperparameter for controlling the effect of curvatures.
Note that the range of \(\kappa _{e}\) may differ across various networks and can take extremely large or small values, so we first normalize the curvatures to prevent these potential problems. Additionally, note that \(\beta =0\) yields the original unweighted network.
Warping via an external feature: prior knowledge
Let the set of prior nodes \(P\subset V\) be given, where each node in \(P\) is known to be related to a given task and independent of the network. We want to warp \(K\) again using \(P\) to reflect the prior knowledge. We define \(\phi \in \mathbb {R}_{+}^{n}\) as
Then, we build a Markov kernel \(P\in \mathbb {R}_{+}^{n\times n}\) as follows:
where \(\gamma \in [0,1]\) is a hyperparameter named the restart probability. From the kernel \(P\), we compute a stationary distribution \(\pi \in \mathbb {R}_{+}^{n}\) and consider it as the smoothed prior knowledge.
\(\gamma = 1\) implies the use of prior knowledge without smoothing, while \(\gamma = 0\) involves fully smoothing the prior knowledge. Recall that when performing a restart, the kernel jumps to a random prior node that is drawn from some conditionally uniform distribution. This means that the method guarantees the equal use of the prior information.
Finally, we compute the final weighted adjacency matrix \(A^{*}\in \mathbb {R}_{+}^{n\times n}\) by
Note that \(A^{*}\) is asymmetric, although \(A_{ji}^{*}> 0 \iff A_{ij}^{*} > 0\) is still satisfied. In other words, PWN assigns different weights to the same edge but in different directions, which implies that PWN converts undirected graphs to implicitly directed graphs.
Score diffusion with a warped network
Let \(v^{(0)}\in \mathbb {R}^{n}\) be the initial gene scores obtained from the given omics data. We want to enhance \(v^{(0)}\) to \(v^{*}\) by injecting more information via the warped network \(A^{*}\) so that we can obtain more accurate and reliable scores. We choose to use an RWR as follows:
where \(q\in [0,1]\) is the restart probability and \(D^{*}\) is the degree matrix of \(A^*\).
Variants of PWN
For qualitative analysis purposes, we also consider the following variants of PWN. The first version is an RWR with curvatures, which considers the \(K\) in â€śWarping via an internal feature: graph curvatureâ€ť section as a weighted adjacency matrix and applies an RWR on \(K\). The second variant is uKIN with curvatures, which applies curvatures to uKIN as in PWN. The last version is PWN without curvatures; i.e., it uses the original adjacency matrix \(A\) instead of \(K\). Note that this is equivalent to PWN with \(\beta =0\).
Dataset preparation for a simulation study
Computing the initial gene scores
We then use unbiased highthroughput data to compute the initial gene scores. The highthroughput data are prepared from TCGA data portal.^{Footnote 2} We download the transcriptome data containing cancer samples and normal samples for 12 cancer types (breast cancer, colon adenocarcinoma, headneck squamous cell carcinoma, kidney chromophobe, kidney renal clear cell carcinoma, kidney renal papillary cell carcinoma, liver hepatocellular carcinoma, lung adenocarcinoma, lung squamous cell carcinoma, prostate adenocarcinoma, stomach adenocarcinoma, and thyroid cancer) while considering the balanced availability of the two sample types. Then, we identify differentially expressed genes by computing \(p\) values using the \(t\)test and combine them using Fisherâ€™s methodÂ [35]. In contrast, since mND can handle multidimensional input scores, we do not merge the \(p\) values for mND. Finally, we convert the \(p\) values to \(Z\)scores as \(Z=\Phi (1p)=\Phi (p)\), where \(\Phi\) is the cumulative distribution function of the standard normal distribution.
Collecting PPI networks
[36] performed a comprehensive quantitative study comparing the performance and usage of different PPI databases and quantified the agreement between curated interactions shared across 16 major public databases. Among these databases, we exclude 6 databases^{Footnote 3} that cannot be downloaded and compare the statistics of 10 databases, which are listed in Additional file 3. We select BioGRIDÂ [37] and STRINGÂ [38] for our main experiments, since BioGRID and STRING provide the largest sets of PPIs among the available primary and secondary databases, respectively. Furthermore, we only include the edges in STRING that have experimental evidence and confidence scores that are larger than 0.7. We also include additional results using IID [39] in Additional file 4.
Collecting the ground truth and constructing experiments
For most of the experiments, we need both prior knowledge genes and genes to be uncovered, where the former are used by diffusion methods and the latter are required in performance measurements. We choose to randomly divide the ground truths to simulate this experimental design. If the experiments do not need any prior knowledge (see â€śEffectiveness of internal featuresâ€ť section), we simply use all ground truths as the genes to be uncovered.
We crawled the CGC list from COSMICÂ [40] website^{Footnote 4} as our ground truth. The CGC list contains 723 known cancer driver genes, and we remove 9 genes that are not available in the network. Then, we randomly divide the CGC genes into two subsets at a 2:8 ratio. The smaller train set is employed as prior knowledge for the tested methods, while the larger test set represents the relevant genes to be uncovered by the methods. To achieve robustness, we repeat this process 30 times.
Performance measurement
We choose average precision (AveP) and the precision at \(k\) (Prec@\(k\); \(k=100,200\)) as the performance metrics, where the definitions are followed:
where \(\chi\) is an indicator function that returns 1 if the given condition is true, and 0 otherwise. Both are commonly used metrics for imbalanced cases in which the portion of the positive class is tiny. Notice that AveP is an estimator for the area under precisionrecall curve (AUPRC) [41], and itâ€™s more preferable than directly computing the area since the latter is often known to give overlyoptimistic results due to the curve interpolation [42] while computing AveP does not depend on curve interpolation so there is no such problem.
For a fair comparison, we use the exact same prior knowledge and the exact same set of relevant genes for every method in each trial. Additionally, we conduct onesided paired \(t\)tests on the performance metrics to verify that the achieved improvement is significant, and adjust the \(p\) values via BenjaminiHochberg correction [43].
Availability of data and materials
The software and datasets generated and/or analyzed during the study are available on GitHub: https://github.com/Standigm/PWN.
Notes
HitPredict, UniHI, iRefWeb, GPSProt, hPRINT and HPRD.
Abbreviations
 PPI:

Proteinâ€“protein interaction
 RWR:

Random walk with restart
 AveP:

Average precision
 AUPRC:

Area under precisionrecall curve
 Prec@\(k\) :

Precision at \(k\)
 TCGA:

The Cancer Genome Atlas
 CGC:

Cancer Gene Census
References
Zhu L, Su F, Xu Y, Zou Q. Networkbased method for mining novel HPV infection related genes using random walk with restart algorithm. Biochimica et Biophysica Acta (BBA) Mol Basis Dis. 2018;1864(6):2376â€“83. https://doi.org/10.1016/j.bbadis.2017.11.021.
Li L, Wang Y, An L, Kong X, Huang T. A networkbased method using a random walk with restart algorithm and screening tests to identify novel genes associated with MeniĂ¨reâ€™s disease. PLoS ONE. 2017;12(8):0182592. https://doi.org/10.1371/journal.pone.0182592.
Yepes S, Tucker MA, Koka H, Xiao Y, Jones K, Vogt A, Burdette L, Luo W, Zhu B, Hutchinson A, Yeager M, Hicks B, Freedman ND, Chanock SJ, Goldstein AM, Yang XR. Using wholeexome sequencing and protein interaction networks to prioritize candidate genes for germline cutaneous melanoma susceptibility. Sci Rep. 2020;10(1):17198. https://doi.org/10.1038/s41598020742935.
Zhang Y, Zeng T, Chen L, Ding S, Huang T, Cai YD. Identification of COVID19 infectionrelated human genes based on a random walk model in a virushuman protein interaction network. Biomed Res Int. 2020;2020:1â€“7. https://doi.org/10.1155/2020/4256301.
Cui X, Shen K, Xie Z, Liu T, Zhang H. Identification of key genes in colorectal cancer using random walk with restart. Mol Med Rep. 2017;15(2):867â€“72. https://doi.org/10.3892/mmr.2016.6058.
KĂ¶hler S, Bauer S, Horn D, Robinson PN. Walking the interactome for prioritization of candidate disease genes. Am J Hum Genet. 2008;82(4):949â€“58. https://doi.org/10.1016/j.ajhg.2008.02.013.
Guo W, Shang DM, Cao JH, Feng K, He YC, Jiang Y, Wang S, Gao YF. Identifying and analyzing novel epilepsyrelated genes using random walk with restart algorithm. Biomed Res Int. 2017;2017:1â€“13. https://doi.org/10.1155/2017/6132436.
Lu S, Yan Y, Li Z, Chen L, Yang J, Zhang Y, Wang S, Liu L. Determination of genes related to uveitis by utilization of the random walk with restart algorithm on a proteinâ€“protein interaction network. Int J Mol Sci. 2017;18(5):1045. https://doi.org/10.3390/ijms18051045.
Zhang J, Suo Y, Liu M, Xu X. Identification of genes related to proliferative diabetic retinopathy through RWR algorithm based on proteinâ€“protein interaction network. Biochimica et Biophysica Acta (BBA) Mol Basis Dis. 2018;1864(6, Part B):2369â€“75. https://doi.org/10.1016/j.bbadis.2017.11.017.
Laenen G, Thorrez L, BĂ¶rnigen D, Moreau Y. Finding the targets of a drug by integration of gene expression data with a protein interaction network. Mol BioSyst. 2013;9(7):1676. https://doi.org/10.1039/c3mb25438k.
Hristov BH, Chazelle B, Singh M. uKIN combines new and prior information with guided network propagation to accurately identify disease genes. Cell Syst. 2020;10(6):470â€“4793. https://doi.org/10.1016/j.cels.2020.05.008.
Silverbush D, Sharan R. A systematic approach to orient the human proteinâ€“protein interaction network. Nat Commun. 2019;10(1):3015. https://doi.org/10.1038/s41467019108876.
do Carmo MP. Differential geometry of curves & surfaces, revised & updated. 2nd ed. Mineola, New York: Dover Publications, INC; 2018.
Villani C. Optimal transport: old and new. Grundlehren Der Mathematischen Wissenschaften, vol. 338. Berlin: Springer; 2009.
Cowen L, Ideker T, Raphael BJ, Sharan R. Network propagation: a universal amplifier of genetic associations. Nat Rev Genet. 2017;18(9):551â€“62. https://doi.org/10.1038/nrg.2017.38.
Forman. Bochnerâ€™s method for cell complexes and combinatorial Ricci curvature. Discrete Comput Geom. 2003;29(3):323â€“74. https://doi.org/10.1007/s004540020743x.
Sreejith RP, Mohanraj K, Jost J, Saucan E, Samal A. Forman curvature for complex networks. J Stat Mech Theory Exp. 2016;2016(6): 063206. https://doi.org/10.1088/17425468/2016/06/063206.
Ollivier Y. Ricci curvature of Markov chains on metric spaces. J Funct Anal. 2009;256(3):810â€“64. https://doi.org/10.1016/j.jfa.2008.11.001.
Ollivier Y. A survey of Ricci curvature for metric spaces and Markov chains. Probab Approach Geom. 2010;57:343â€“82. https://doi.org/10.2969/aspm/05710343.
Ni CC, Lin YY, Luo F, Gao J. Community detection on networks with Ricci flow. Sci Rep. 2019;9(1):9984. https://doi.org/10.1038/s41598019463809.
Ye Z, Liu KS, Ma T, Gao J, Chen C. Curvature graph network. In: International conference on learning representations (2019).
Sandhu R, Georgiou T, Reznik E, Zhu L, Kolesov I, Senbabaoglu Y, Tannenbaum A. Graph curvature for differentiating cancer networks. Sci Rep. 2015;5(1):12323. https://doi.org/10.1038/srep12323.
Yu H, Paccanaro A, Trifonov V, Gerstein M. Predicting interactions in protein networks by completing defective cliques. Bioinformatics. 2006;22(7):823â€“9. https://doi.org/10.1093/bioinformatics/btl014.
Li XL, Foo CS, Tan SH, Ng SK. Interaction graph mining for protein complexes using local clique merging. Genome Inform. 2005;16(2):260â€“9. https://doi.org/10.11234/gi1990.16.2_260.
BarabĂˇsi AL, Albert R. Emergence of scaling in random networks. Science. 1999;286(5439):509â€“12. https://doi.org/10.1126/science.286.5439.509.
Stumpf MPH, Wiuf C, May RM. Subnets of scalefree networks are not scalefree: sampling properties of networks. Proc Natl Acad Sci. 2005;102(12):4221â€“4. https://doi.org/10.1073/pnas.0501179102.
Sia J, Zhang W, Jonckheere E, Cook D, Bogdan P. Inferring functional communities from partially observed biological networks exploiting geometric topology and side information. Sci Rep. 2022;12(1):10883. https://doi.org/10.1038/s4159802214631x.
Murgas KA, Saucan E, Sandhu R. Hypergraph geometry reflects higherorder dynamics in protein interaction networks. Sci Rep. 2022;12(1):20879. https://doi.org/10.1038/s4159802224584w.
Zhu J, Tran AP, Deasy JO, Tannenbaum A. Multiomic integrated curvature study on pancancer genomic data. https://doi.org/10.1101/2022.03.24.485712.
Samal A, Sreejith RP, Gu J, Liu S, Saucan E, Jost J. Comparative analysis of two discretizations of Ricci curvature for complex networks. Sci Rep. 2018;8(1):8650. https://doi.org/10.1038/s41598018270013.
Pouryahya M, Mathews J, Tannenbaum A. Comparing three notions of discrete Ricci curvature on biological networks (2017). https://doi.org/10.48550/arXiv.1712.02943. arXiv:1712.02943
Gasteiger J, WeiĂź enberger S, GĂĽnnemann S. Diffusion improves graph learning. In: Advances in neural information processing systems, vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/hash/23c894276a2c5a16470e6a31f4618d73Abstract.html Accessed 02 Feb 2023.
Di Nanni N, Gnocchi M, Moscatelli M, Milanesi L, Mosca E. Gene relevance based on multiple evidences in complex networks. Bioinformatics. 2019. https://doi.org/10.1093/bioinformatics/btz652.
Gasteiger J, Bojchevski A, GĂĽnnemann S. Predict then propagate: graph neural networks meet personalized pagerank. https://openreview.net/forum?id=H1gL2A9Ym Accessed 01 Feb 2023.
Fisher RA. Statistical methods for research workers. 7th ed. Edinburgh: Oliver and Boyd; 1938.
Bajpai AK, Davuluri S, Tiwary K, Narayanan S, Oguru S, Basavaraju K, Dayalan D, Thirumurugan K, Acharya KK. Systematic comparison of the proteinprotein interaction databases from a userâ€™s perspective. J Biomed Inform. 2020;103: 103380. https://doi.org/10.1016/j.jbi.2020.103380.
Oughtred R, Stark C, Breitkreutz BJ, Rust J, Boucher L, Chang C, Kolas N, Oâ€™Donnell L, Leung G, McAdam R, Zhang F, Dolma S, Willems A, CoulombeHuntington J, Chatraryamontri A, Dolinski K, Tyers M. The BioGRID interaction database: 2019 update. Nucleic Acids Res. 2019;47(D1):529â€“41. https://doi.org/10.1093/nar/gky1079.
Szklarczyk D, Gable AL, Lyon D, Junge A, Wyder S, HuertaCepas J, Simonovic M, Doncheva NT, Morris JH, Bork P, Jensen LJ, Mering Cv. STRING v11: proteinâ€“protein association networks with increased coverage, supporting functional discovery in genomewide experimental datasets. Nucleic Acids Res. 2019;47(D1):607â€“13. https://doi.org/10.1093/nar/gky1131.
Kotlyar M, Pastrello C, Sheahan N, Jurisica I. Integrated interactions database: tissuespecific view of the human and model organism interactomes. Nucleic Acids Res. 2016;44(D1):536â€“41. https://doi.org/10.1093/nar/gkv1115.
...Tate JG, Bamford S, Jubb HC, Sondka Z, Beare DM, Bindal N, Boutselakis H, Cole CG, Creatore C, Dawson E, Fish P, Harsha B, Hathaway C, Jupe SC, Kok CY, Noble K, Ponting L, Ramshaw CC, Rye CE, Speedy HE, Stefancsik R, Thompson SL, Wang S, Ward S, Campbell PJ, Forbes SA. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 2019;47(D1):941â€“7. https://doi.org/10.1093/nar/gky1015.
Boyd K, Eng KH, Page CD. Area under the precisionrecall curve: point estimates and confidence intervals. In: Blockeel H, Kersting K, Nijssen S, Ĺ˝eleznĂ˝ F, editors. Machine learning and knowledge discovery in databases. Lecture notes in computer science. Springer; 2013. p. 451â€“66. https://doi.org/10.1007/9783642409943_29.
Davis J, Goadrich M. The relationship between precisionrecall and roc curves. In: Proceedings of the 23rd international conference on machine learning. ICML â€™06, pp. 233â€“240. Association for Computing Machinery, New York (2006). https://doi.org/10.1145/1143844.1143874.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Seri B (Methodol). 1995;57(1):289â€“300. https://doi.org/10.1111/j.25176161.1995.tb02031.x.
Acknowledgements
Not applicable.
Funding
Not applicable.
Author information
Authors and Affiliations
Contributions
TK and HK conceived the overall study plan. HK and JH surveyed the available public data and collected them. SH developed the methodology, implement the code and performed the analysis. All authors participated in reviewing the previous research and writing the manuscript. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
The PPI network of the Homo sapiens, colored using the curvatures.
Additional file 2.
Supplementary information for post hoc analysis purposes.
Additional file 3.
Summary statistics for the primary/secondary PPIs.
Additional file 4.
Additional results using IID.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Han, S., Hong, J., Yun, S.J. et al. PWN: enhanced random walk on a warped network for disease target prioritization. BMC Bioinformatics 24, 105 (2023). https://doi.org/10.1186/s1285902305227x
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902305227x
Keywords
 Diseasetarget identification
 Proteinâ€“protein interaction
 Random walk
 Machine learning