DriverRWH: discovering cancer driver genes by random walk on a gene mutation hypergraph

Background Recent advances in next-generation sequencing technologies have helped investigators generate massive amounts of cancer genomic data. A critical challenge in cancer genomics is identification of a few cancer driver genes whose mutations cause tumor growth. However, the majority of existing computational approaches underuse the co-occurrence mutation information of the individuals, which are deemed to be important in tumorigenesis and tumor progression, resulting in high rate of false positive. Results To make full use of co-mutation information, we present a random walk algorithm referred to as DriverRWH on a weighted gene mutation hypergraph model, using somatic mutation data and molecular interaction network data to prioritize candidate driver genes. Applied to tumor samples of different cancer types from The Cancer Genome Atlas, DriverRWH shows significantly better performance than state-of-art prioritization methods in terms of the area under the curve scores and the cumulative number of known driver genes recovered in top-ranked candidate genes. Besides, DriverRWH discovers several potential drivers, which are enriched in cancer-related pathways. DriverRWH recovers approximately 50% known driver genes in the top 30 ranked candidate genes for more than half of the cancer types. In addition, DriverRWH is also highly robust to perturbations in the mutation data and gene functional network data. Conclusion DriverRWH is effective among various cancer types in prioritizes cancer driver genes and provides considerable improvement over other tools with a better balance of precision and sensitivity. It can be a useful tool for detecting potential driver genes and facilitate targeted cancer therapies. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04788-7.

The Cancer Genome Atlas (TCGA), which provides somatic mutation landscapes to better characterize the molecular signatures of cancer [3]. There is a consensus viewpoint on tumorigenesis that only a few mutational events occurring in a set of genes (called "cancer driver genes") affect the homeostatic development of a set of key cellular functions [4][5][6]. Discovery of these cancer driver genes across various tumor types is a key step in understanding tumor biology and developing targeted anticancer therapies.
A number of computational tools have been developed to identify cancer driver genes from multidimensional genomic data. Most of these tools can be classified into three categories based on their basic principles [7]. Frequency-based approaches define that the most commonly occurring mutation are more likely to be drivers, such as Mut-SigCV and MuSic [8,9]. Unfortunately, methods based on frequency are underpowered for uncovering low recurrently driver genes [10]. Functional impact-based approaches, such as OncodriveFM, integrate multiple-domain information to predict the functional impact of single nucleotide variants (SNVs) [11,12]. However, most of these methods use machine learning based models. Building either a gold-standard positive data set or a negative data set for such model is a difficult task, and that restricts the use of these methods [10]. The third category is network-based methods enlightened by the observation that mutations in a cancer genome tend to converge on a few biological pathways, attempt to identify groups of driver genes based on prior knowledge of pathways and proteins or genetic interactions [13][14][15][16][17]. A tool named DawnRank adopts PageRank algorithm to rank potential drivers based on their impact on the overall differential expression of the downstream genes [14]. HotNet2 uses a random walk with restart algorithm for identification of mutated subnetworks, in which the mutation frequency of each gene and the frequencies of its network neighbors are considered and hub genes are often yielded with highly predicted scores [15]. This kind of methods have advantages in their ability to identify driver genes with low recurrence and improve the accuracy of predicting driver genes to some extent [18].
Despite the rapid progress in computational approaches to prioritize cancer driver genes with the advent of next-generation sequencing technologies, the false positive rates of these existing methods are still too high. In addition, there are evidences showing that driver gene co-occurring may play a key role in cancer initiation and progression [19][20][21]. Because the activation or inactivation of one given driver gene is usually not sufficient to induce tumorigenesis, multiple mutations in different driver genes have to cooperate to gradually transform normal cells into precursor lesions and subsequently invasive and metastatic cancer [22][23][24][25]. Among majority of the published methods, the practice of putting single gene mutation frequency as input information could result in the loss of all the co-occurring alternations information of the individual tumors. In this study, we introduced a weighted hypergraph model and present a novel tool Driv-erRWH by integrating mutation profile and PPI network data to predict driver genes. Hypergraph is a generalization of simple graphs where its edges, called hyperedges, are allowed to connect arbitrary number of vertices, which makes it suitable for representation of high-order relations and it can be used to model biology network, data structure, computations and a variety of other systems [26][27][28]. Herein, we adopted hyperedges to represent the co-exist relationship among mutated genes in individuals, so the problem of information loss of co-occurring alternations can be avoided in a certain extent. We next specified the weights of mutated genes in each hyperedge according their interaction in PPI network and construct the weighted hypergraph. Thereafter, we generalized a random walk algorithm to the weighted hypergraph. Finally, we ranked all the candidate mutated genes for the given cancer type. To verify our method, we applied DriverRWH to 31 cancer types from TCGA and found that our method outperforms the state-of-theart tools for the majority of cancer types regardless of which reference network we use. We also evaluated the robustness of our method and found that DriverRWH is highly robust to various data perturbations.

Overview
In this study, we proposed DriverRWH, which uses random walk on weighted hypergraph to prioritize the driver genes ( Fig. 1). Firstly, for a given cancer type, a hypergraph was constructed basing on mutation profile, wherein tumor samples are presented as hyperedges and mutant genes are presented as vertices. Secondly, according to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, we differentiated genes within a hyperedge of sample in accordance with their degrees in the corresponding subnetwork of the PPI network. Then, we adopted a probabilistic weighted random walk that take advantage of the hypergraph structure, and carried out PPI network  Fig. 1 Overview of DriverRWH. A, B Construction of the weighted hypergraph model using somatic mutation profiles of a given cancer type and a PPI network. Each sample is indicated with colored circular area (hyperedge) which contains all the mutated genes (vertices) of individual. Since the number of mutated gene varies from samples, the hypergraph contains different number of vertices. The weights of vertices in each hyperedge are assigned according to the degree in the context of the background subnetwork. C Illustration of the random walk process on the hypergraph. For vertex u , we randomly select a hyperedge which incident with u and then selects a node according the weights of vertices in selected hyperedge as the destination vertex v to shift this iteratively. After some steps, the random walk would stabilize, producing a score for each mutated gene. At last, all candidate mutant genes are ranked in descending order based on their score.

The DriverRWH algorithm
In the present model, mutation data of a given cancer type and a PPI dataset are used as the input information (Fig. 1A). As shown in Fig. 1B, a hypergraph consisting of the mutated genes of all samples was constructed. If a gene is mutated in a sample, it would be presented as a vertex in the hyperedge corresponding to the sample. Without loss of generalization, the hypergraph can be defined as HG(V , E) , where V is the set of vertices and E is the set of hyperedges. A hyperedge e is a subset of, satisfying e∈E = V . Hyperedge e is said to be incident with vertex u if u ∈ e ; thus, the incidence matrix H ∈ R |V |×|E| can be defined as follows: After construction of the hypergraph, a specified subnetwork is generated for each sample, based on the mutated genes and their interaction in the PPI network. According to our hypothesis that a gene is more likely to be a driver gene if it is highly associated with other mutated genes, a fairly standard choice of the weight of vertices in each hyperedge are their degrees in the corresponding induced subnetwork of the PPI network.
Then, we developed a random walk process on the weighted hypergraph. Similar to a random walk on a simple graph, this walk is a type of Markov process, which is seen as the transition between two vertices. Note that the transition on the hypergraph occurs only if two vertices are incident to a hyperedge, so the random walk on the hypergraph is defined to be a two-step process. In the first step, the surfer selects a hyperedge e incident with the current vertex u ; thereafter, it selects a target vertex v within the chosen hyperedge (Fig. 1C). If one vertex is an isolated node in the subnetwork, it also has the potential to be a driver gene, so a small weight of 0.01 is set. Let Ne be the subnetwork containing vertices in hyperedge e and denote d Ne (u) as the degree of u in the subnetwork.
Thereafter, the surfer selects vertex v proportional to the weight of v within the hyperedge. Notably, in our model, the weights of vertices may vary in accordance with the hyperedges. According to the aforementioned definition, the degree of vertex u and hyperedge e in hypergraph HG(V , E) can be defined as follows: With all the elements defined, we calculated the transition probability from vertex u to vertex v as follows: which can also be written in matrix form: where D u ∈ R |V |×|V | is the diagonal vertex degree matrix, D e ∈ R |E|×|E| is the diagonal hyperedge degree matrix with element δ(e) and W ∈ R |V |×|E| is the weighted incident matrix of hypergraph HG(V , E) . Note that the transition matrix P is stochastic, where each row sums to 1.
Furthermore, we implemented a random walk with restart on the hypergraph. All genes are considered to be potential driver genes and are assigned with equal probabilities; i.e., the initially normalized probability vector − → v (0) ∈ R |V |×1 such that each element is assigned with equal probability 1 |V | . Moreover, the restart probability at every step is set to be 1 − α(0 < α < 1) . In this article, we set α to be 0.2. Finally, the random walk formula can be expressed as follows: In the formula above, − → v (t) is defined such that the i th element means the probability that the surfer stops at vertex i at step t . After a number of steps, the random walk will be stable, which can be defined as − → v (∞) . The stabilized state implies that the distance between − → v (t + 1) and − → v (t) by the L1 norm is smaller than the provided cutoff value. In this paper, we set the cutoff as 10 −6 . The elements of the stabilized vector − → v are defined as the DriverRWH score, which can reflect the role that the mutated genes play in cancer.

Performance evaluation
To evaluate the method, an unbiased comprehensive known cancer gene set is needed. Unfortunately, such a gold-standard set of cancer genes is currently unavailable. Alternatively, we used four complementary cancer gene sets derived from various sources as the reference driver gene set for all the cancer types. First, 616 cancer genes were downloaded from the Cancer Gene Census (CGC) database, which includes genes for which mutations have been causally implicated in cancer and is widely used as a gold-standard cancer gene set [32]. Second, the list of HiConf cancer gene panels consists of 99 driver genes that have previously been detected through genetic criteria and that could plausibly be detected with exome sequencing data [33]. The third set has 291 high-confidence cancer driver genes identified by a rule-based method (HCD) [34]. The fourth set contains 125 driver genes defined by the "20/20 rules", which identifies Mut-driver genes based on the characteristic mutational patterns for oncogenes and tumor suppressor genes [35]. Now that each cancer gene set is biased toward particular features or study methods, we utilized a union of these four lists as the reference driver gene set, with a total of 785 genes. This operation can reduce the bias caused by using a single reference gene list to some degree. Using aforementioned reference driver genes as a benchmark, we generated receiver operating characteristic (ROC) curves and areas under the curve (AUCs) to evaluate the true positive and false positive rate. For practical reasons, only top-ranked candidate genes might enter into follow-up experimental validation. Considering that the high performance of prioritization for all genes cannot guarantee successful prioritization for the top ranked candidates, we also assessed the number of known driver gene recovered in the top 20, 50, 100,150 and 200 candidate genes. Due to the diversity of cancer types, we are more interested in tumor-specific drivers than the general common drivers across all tumor types. We downloaded IntOGen database (https:// www. intog en. org/ downl oad) [4]. This database harnesses the strengths of different driver prediction methods and provides a tumor-specific driver genes list, which is considered to be the best trade-off between sensitivity and specificity. This list contains 31 types of cancer among which Kidney Chromophobe (KICH) has 7 specific drivers (minimum) and Uterine Corpus Endometrial Carcinoma (UCEC) has 55 (maximum). All of the above lists are shown in Additional file 3. From an application point of view, we should assess the ability of our method to identify novel driver genes that may not have been discovered in IntOGen. The genes in top 200 candidate gene list predicted by DriverRWH with both HumanNet and STRINGv10 while not in the tumor-specific drivers were considered to be potential novel drivers. From the functional perspective, these genes were evaluated by the biological analysis using DAVID on-line database, CancerGeneNet and iGMDR database [36][37][38][39].
We leveraged a literature mining method named CoCiter, which calculates the co-citation significance between predicted driver genes and the keywords cancer type, 'driver' and 'cancer' to verify the top 30 significant genes [40]. The higher co-citation score implicates the stronger association between the genes and the key terms. Without loss of generality, we compared DriverRWH with 24 driver gene prediction methods across 31 cancer type, some of which identify significant drivers by P-value (the genes with FDR adjusted P-value < 0.05) and the rest of methods provide the priority scores for candidate driver genes (the top 30 genes are selected as significant drivers). It is acceptable for the reason that the median number of significant genes for other methods in all data sets is 30.

Known driver genes have higher degree in the PPI network
In DriverRWH, we hypothesized that a gene is more likely to be a cancer driver if it is prone to associate with other mutated genes in cancer. This hypothesis has already been proposed in some studies [15,41]. To further validate it, we analyzed the linkage of mutated genes in the PPI network. For a given cancer type, an induced subnetwork of the PPI network which just contains mutated genes from all samples was built. The genes that mutated at least once in a cancer type were divided into two groups according to whether they are in the reference driver gene set (the union of CGC, HiConf, MCD, Mutdriver, with a total number of 785 genes): known driver genes and the others. We calculated the degree of vertices in the induced subnetwork. Taking the three cancer types LUSC, BRCA and UCEC for illustration, we found the degrees of known driver genes were significantly larger than those of the other mutant genes (Fig. 2, P-value < 0.001). This result suggests that cancer driver genes were adjacent to more mutated genes than the others. The same analysis using HumanNet is also available (Additional file 1: Fig S1).

Performance of DriverRWH
To evaluate the performance of our method, we compared our method from three aspects, prediction of known driver genes, functional enrichment analysis and literature mining analysis. Firstly, we implemented six prioritizing methods, MutsigCV [8], Dawn-Rank [14], MinNetRank [16], Subdyquency [17], Gravity [41] and OncodriveFML [42] on three cancer types, namely Lung squamous cell carcinoma (LUSC), Breast invasive carcinoma (BRCA), and Uterine Corpus Endometrial Carcinoma (UCEC) (see Additional file 4). In order to eliminate the deviation brought by the background network, we operated DriverRWH and the other three network-based methods (MinNetRank, Subdyquency, and Dawnrank) basing on the same network, STRINGv10 and HumanNet respectively. Then, we compared DriverRWH with 24 other driver gene prediction tools to evaluated its performance across 31 cancer types. Lastly, we verified the robustness of our method by testing the performance in perturbed data where the mutation data and network data were extracted randomly with different size.

Results for lung squamous cell carcinoma
Lung cancer is regarded as the main leading cause of cancer deaths, which take up 18.0% of deaths [43]. In this research, we applied DriverRWH to 480 LUSC samples in TCGA database. Bracketed digits indicate the number of known driver genes and the other genes in the subnetwork of STRINGv10, which are induced by the mutated genes present in at least one tumor sample for a given cancer type Using reference driver genes as benchmarks, we generated receiver operating characteristic (ROC) curves. When using STRINGv10 as background network, DriverRWH outperforms the other six tools. in terms of sensitivity and specificity in identifying known driver gene (Fig. 3A). We further assessed the predictive power for the topranked candidate genes. As shown in Fig. 3B, we observed that DriverRWH identified more known cancer driver genes by its top 20, 50, 100, 150 and 200 genes. Furthermore, the number of know driver gene retrieved by DriverRWH with STRINGv10 network in its 20 top-ranked candidates is more than half of it. When HumanNet was used, Driver-RWH is still significantly better than the others methods (Additional file 1: Fig S2).
Specifically, using the top 30 candidate genes as significant driver, we searched these genes in co-citer website by the key terms 'Cancer' , 'Driver' and 'Lung' . As Table 1 shows, some significant well-known driver genes like TP53, PTEN and PIK3CA are near the top of the list. Although they are also identified by most of other methods, their ranking fell behind ours. The well-known suppressor TP53 which disrupts the cell cycle arrest and the apoptosis pathways in human cancer ranks first in our method, but it ranks 527th in Gravity algorithm. The PTEN is proved to be related to small cell lung cancer, which is an admitted tumor suppressor gene with phosphatase activity [49]. It is co-cited with 'Lung' and 'Cancer' for 253 and 2597 times, which is regarded as driver genes in 35 publications. The PTEN ranks the 16th in our list but ranked 44th in MinNetRank and 588th in Gravity. The mutation of PIK3CA gene can lead to abnormal enhancement of the catalytic activity of PI3Ks and promote the carcinogenesis of cells in lung cancer [49]. It ranks 7th in our method but 22th in MutsigCV and OncodriveFML, and 473th in MinNetRank. On the other hand, KDR (Kinase insert domain-containing receptor), ranked 24th, was reported to play a critical role in the metastasis of cancer and is used as a molecular target in cancer therapy [50]. Co-cited with "Cancer" for 207 times and 'Lung' for 105 times, KDR even not deemed as a diver gene in lung cancer and can be thought as a potential driver. The similar analysis basing on HumanNet is also available (Additional file 1: Table S1). We adopted the GAD and KEGG pathway enrichment analysis and found these significant driver genes enrich in the small cell lung cancer, PI3K-Akt signaling pathway, etc., which are significantly related to lung cancer (Additional file 1: Fig S3). The hallmarks of cancer are defined as a set of crucial functional abilities acquired by human cells as they move from normalcy to neoplastic growth states [51]. We linked these significant drivers to hallmarks of cancer using CancerGeneNet online database which calculates the shortest paths between genes and phenotypes [38]. Half of the top 30 genes could be associated with hallmarks of cancer. KDR, one of the potential drivers we mentioned above, is linked to "Angiogenesis", "Cell Death", "Differentiation", "DNA Repair", "Glycolysis", "Immortality", "Inflammation", "Metastasis" and "Proliferation" (Additional file 5). In order to assess the drug sensitivity of these significant drivers, we performed gene-drug analysis using online database iGMDR, which shows that 73.3% of significant genes are druggable (Additional file 6).

Results for breast invasive carcinoma
Breast cancer is the most commonly diagnosed cancer, with an estimate 2.3 million new cases, taking up to 11.7% of all the cancer cases in 2020 [43]. We focused on 791 BRCA samples in TCGA database to construct the hypergraph. Compared with other methods, DriverRWH shows the best performance in terms of ROC curves when STRINGv10 and HumanNet were used respectively ( Fig. 4 and Additional file 1: Fig S4). Meanwhile, although DriverRWH discerned less driver gene than MutSigCV in top 20 candidates, it was found to predict more known driver genes in the top 50, 100, 150 and 200 candidates (Fig. 4B).
We evaluated the capacity of DriverRWH in identifying the breast cancer potential driver genes. Similarly, we adopted 61 genes, which are in the 200 top ranked candidate genes predicted with both HumanNet and the STRINGv10 while not in tumor-specific drivers list to conduct the GAD and pathway enrichment analysis. Notably, 29 genes (44.6%) are enriched for "CANCER" (P-value = 1.67 × 10 -4 , FDR = 1.67 × 10 -4 ) and 12 (18.5%) are enriched for "breast cancer" (P-value = 2.15 × 10 −5 , FDR = 0.0087). In the case of pathways, these genes are significantly enriched in "Breast cancer". The top 25 pathways are shown in additional file (Additional file 1: Fig S5).
The cociter score of the top 30 candidate genes predicted by DriverRWH using STRINGv10 network is demonstrated in Table 2. Particularly, 8 of the top 10 candidate genes are exactly driver genes, including acknowledged driver gene TP53 (ranked 1st), the most recurrently mutated gene PIK3CA (ranked second), etc. With high cociter scores, KMT2C ranked 8th in DriverRWH, not even identified in MutsigCV and Dawnrank and ranked 2121 in Gravity. AKT1, which co-appears with "Cancer" for 1863 times and "Breast" for 477 times, ranked 10th in DriverRWH while it ranked merely 1226th in Gravity and 2233th in OncodriveFML. The ERBB2, which ranked 16th in DriverRWH, is confirmed to be related to breast cancer, but it ranked 35th in OncodriveFML, 126th in MutsigCV, and even 1465th in Gravity [52]. Besides, DriverRWH can identify some genes that are highly related with breast cancer but was not recognized by other six methods. For instance, EGFR is one of the first identified important targets of novel antitumor agents, which co-occur "Breast" 722 times, "Cancer" 4091 times, and "Driver" 94 times [53]. MTOR ranked 22nd, co-appearing 321 times with "Breast", 1896 times with "Cancer", and 21 times with "Driver". The similar analysis basing on HumanNet is also available (Additional file 1: Table S2). We performed GAD and pathway enrichment analysis of the top 30 candidate driver genes. The identified genes are enriched in "breast cancer" in GAD. These gene are significantly enriched in "Breast cancer", "Proteoglycans in cancer", "Endometrial cancer", etc., which have an association with breast cancer by KEGG enrichment analysis (Additional file 1: Fig S5). 66.7% of the candidate driver genes could be linked to hallmarks of

Results for uterine corpus cancer
Uterine corpus cancer is the sixth most common type of cancer and the second most common gynecological malignancy in female, with more than 417,000 new cases and 97,000 deaths worldwide in 2020 [54]. We used 448 patients with 40,543 candidate genes from the TCGA database. DriverRWH outperforms the other six prioritizing methods with the same reference driver genes as benchmarks when assessed by the ROC and percentage of known driver gene in the top candidate genes (Fig. 5 and Additional file 1: Fig S6).
For the discovery of potential drives, we selected 41 genes with the same criteria mentioned earlier, of which 22 genes (51.2%) are association with cancer (P-value = 1.37 × 10 -4 , FDR = 1.37 × 10 -4 ). These genes are significantly enriched in PI3K -Akt signaling pathway and MAPK signaling pathway, both of which play an important role in cellular growth and survival, have been implicated in endometrial cancer pathogenesis (Additional file 1: Fig S7) [55].
We took top 30 candidate drivers in consideration, Table 3 shows the cociter score between these candidate genes and the terms " Endometrial", "Cancer" and "Drivers". Apoptosis-suppressing gene MTOR which co-appears with "Endometrial" 63 times, with "Cancer" 1896 times, ranked 19th in DriverRWH, but ranked 112th, 182th, and 1380th in Dawnrank, MutsigCV and OncodriveFML. Notch1 is tumor-suppressive in human endometrial cancer cells [56], which ranked 11th in DriverRWH, while 61th in MutsigCV, 94th in Subdyquency, even 2630th in OncodriveFML and 7054th in Gravity. Moreover, PRKDC is proved to be significantly associated with a high mutation load, which ranked 20th in DriverRWH [57]. Recent research suggest that high mutation load is a predictive biomarker of response to immune checkpoint inhibitors in uterine corpus cancer [58]. The similar analysis basing on HumanNet is also available (Additional file 1: Table S3). We performed GAD and pathway enrichment analysis of these candidate genes (Additional file 1: Fig S7). In terms of GAD enrichment analysis, these genes are enriched in "endometrial cancer", etc. In pathway enrichment analysis, they significantly enriched in Endometrial cancer. 70% of the top ranked genes have the shortest path to cancer phenotypes in CancerGeneNet database. PRKDC is linked with "Angiogenesis", "Cell death", "Differentiation", "DNA repair", "Glycolysis", "Immortality", Metastasis" and "Proliferation" (Additional file 5). 83.3% of these candidate genes have related drugs in iGMDR online database (Additional file 6).

The stability of the performance across 31 cancer types
Furthermore, we compared the performance of DriverRWH with 24 up-to-date driver gene prediction methods in order to assess the stability of DriverRWH across 31 cancer types. For DriverRWH and six methods mentioned above which provide ranks of the candidate driver gene, top 30 genes were selected as significant drivers [59]. For those methods that generate P-values, an adjusted P-values < 0.05 was used as the threshold to claim driver genes [60,61]. The details of tools and the criteria for candidate driver genes are provided in the Additional file 7. Figure 6 displays the proportion of predicted driver genes presented in the reference driver set across 31 cancer types, arranged by the order of the median. DriverRWH recovered approximately 50% (median fraction is 53.3%) of known driver genes in the top 30 ranked candidate genes in more than half of 31 cancer types, which is significantly better than the results of the other methods.

Robustness of DriverRWH
To test the robustness of DriverRWH, we applied our method to perturbed data where the mutation data and network data were shuffled randomly (Fig. 7). In detail, for the mutation data, two types of perturbations were taken: (1) randomly selecting 50% and 10% of the samples and (2) randomly selecting 50% and 10% of the original mutation information in the somatic mutation matrix. With 20 repeats, we used only 50% and 10% of samples and 50% of mutation information. There is no significant decrease in terms of the AUC scores and the cumulative number of recovered driver genes. If only 10% of mutation information was retained, there would be a slight decrease. It's worth noting that the performance of the top 20 candidates was always at a high level. For the network data, two forms of perturbation were also taken: (1) randomly selecting 50% and 10% of the original network information and (2) using PPI data with 50% and 10% noise added. There was also only a minor decrease in the AUC scores and the cumulative number of recovered cancer genes. A similar conclusion could be obtained when performing robust analysis basing on HumanNet (Additional file 1: Fig S8). These results suggest that the perturbation of mutation data and the network did not seriously affect the result, indicating that DriverRWH is highly robust to the quality of the input data. The performance of 25 driver gene prediction methods. Distribution of the fraction of predicted candidate driver genes presented in the reference driver set across 31 cancer types

Discussion
Recent years, many methods have been developed to distinguish driver genes from passengers. Limited by the design of the simple network model, most of them are incapable of expressing the many-to-many multiple association relationship. The mutation profile was always compressed into the mutation frequency of genes, resulting in the loss of co-mutation information for individual samples. In this study, we propose a networkbased method DriverRWH, which has the capability of effectively integrating the mutation and PPI network data to predict cancer driver genes. The novelty of our method lies in the introduction of a weighted hypergraph model, which is constructed to simultaneously capture two class of relation among mutated genes in individual samples: 1) high-order relations were captured by storing hundreds of mutated genes in a hyperedge for each sample. 2) using the same mutated genes as above, an induced subnetwork of PPI network can be generated by preserving mutated genes and their interaction in the background network, which represents the pair-wise relations between mutated genes. Our model retains complete co-mutation relations for the mutated genes in individual tumors and these interactions in PPI network, which can adequately embody the implicit inherent peculiarity of them and avoid the loss of information. Taking advantage of hypergraph structure, we extended the typical random walk process on a simple graph to a probabilistic weighted random walk on hypergraph.
Using a reference driver gene set as a benchmark, DriverRWH consistently outperformed the other six state-of-art prioritization methods in terms of the ROC analysis, rank of driver genes and the cumulative number of known driver genes recovered in the top-ranked candidate genes. Moreover, some new unknown potential driver genes which are co-cited by  some cancer associated literatures also can be discovered by DriverRWH, meanwhile the high-ranking genes enrich in some significant cancer pathway. At last, taking top 30 as predicted candidate driver genes, we can compare DriverRWH with other non-ranking methods. The results shows that DriverRWH achieves a higher performance than four prioritization methods and 19 other non-ranking methods across 31 cancer types. Despite of these encouraging results, there are several limitations in the current model. First, for TCGA data, tumor heterogeneity may increase the data bias, and future work should be done to reduce false-positive discoveries by using single-cell genomics data. Second, DriverRWH relies on a broad context molecular network that is still incomplete at present, so refined gene functional networks in the near future could improve the performance of our method. A cancer-specific network might better represent the natural interactions of genes in cancer and potentially provide a more reliable network. Third, our method focuses on general driver gene detection but does not aim to offer personalized means of diagnosis, which is more useful in real applications. In the future, we plan to extend our method to discover drivers in personalized manner.

Conclusions
Recently, many computational methods and tools have been proposed to identify driver genes. However, long-tail distribution of the mutation frequency of genes in cancer genomes remains a major concern. There are many widely accepted methods based on mutation frequencies, but they fail to comprehensively consider the co-mutation information in individuals. Considering hypergraph has unique advantages of retaining complete co-occurrence information, we introduced the hypergraph theory in driver gene prediction, thus compensating for the co-mutation information loss issue by existing methods. For each hyperedge, degrees of vertex in the corresponding subnetwork of the PPI network were utilized to design the weighted hypergraph, through which we realized the integration of the mutation data and the PPI data. Subsequently, motivated by Pag-eRank algorithm, we implemented the random walk with restart on the hypergraph, and proposed a novel approach DriverRWH to prioritize mutated genes. As demonstrated in this paper, DriverRWH not only excels existing methods in the identification of known driver genes but also is capable of discovering potential driver genes. Furthermore, the model behaves robustly under the perturbation of mutation data and network data. Our results show that DriverRWH can be a useful tool for prioritization driver genes. The source code of DriverRWH is freely available at https:// github. com/ Shand ongUn ivers ityZh anglab/ Drive rRWH.