Identification of driver genes based on gene mutational effects and network centrality

Background As one of the deadliest diseases in the world, cancer is driven by a few somatic mutations that disrupt the normal growth of cells, and leads to abnormal proliferation and tumor development. The vast majority of somatic mutations did not affect the occurrence and development of cancer; thus, identifying the mutations responsible for tumor occurrence and development is one of the main targets of current cancer treatments. Results To effectively identify driver genes, we adopted a semi-local centrality measure and gene mutation effect function to assess the effect of gene mutations on changes in gene expression patterns. Firstly, we calculated the mutation score for each gene. Secondly, we identified differentially expressed genes (DEGs) in the cohort by comparing the expression profiles of tumor samples and normal samples, and then constructed a local network for each mutation gene using DEGs and mutant genes according to the protein–protein interaction network. Finally, we calculated the score of each mutant gene according to the objective function. The top-ranking mutant genes were selected as driver genes. We name the proposed method as mutations effect and network centrality. Conclusions Four types of cancer data in The Cancer Genome Atlas were tested. The experimental data proved that our method was superior to the existing network-centric method, as it was able to quickly and easily identify driver genes and rare driver factors.

than 10 algorithms (such as CHASM, SIFT, and MutationAssessor); obtains 96 features in structure, evolution, and genes; and builds an algorithm based on machine learning prediction-driven missense mutations [31]. The FATHMM algorithm integrates homologous sequences and conserved protein domain information and uses a hidden Markov model-based algorithm to distinguish cancer-related amino acid mutations among passenger mutations [32,33]. The DriverML algorithm proposed by Han et al. used statistical methods to quantify the scores of different mutation types on protein function and then combined them with machine learning algorithms to identify cancer driver genes [34]. However, the method of training prediction models using machine learning has some shortcomings. For example, in predicting driver mutations, it is difficult to obtain high-quality positive and negative sample datasets, which is a significant challenge for machine learning-based algorithms.
The development of network analysis science, such as in the fields of complex systems, social networks, communication networks, and transportation networks, has inspired many bioinformatics researchers to use network analysis methods to study the functional mechanism of molecular systems. Pathway-and network-based methods can easily simplify biological entities and their interactions into nodes and edges, allowing the systematic study of the nature of complex diseases [35] and the diagnosis, prevention, and treatment of cancer. Moreover, network-and pathway-based strategies have become one of the most promising approaches for identifying driver mutations, and some researchers have found that genes work together to form biological networks, which can be used to identify driver genes. MEMO [36] relies on the predictive pathway or the mutual exclusion of driving mutations in the sub-net to try to find a small sub-net of genes belonging to the same pathway. PARADIGM-Shift [37] uses pathway-level information and other features to infer the dysfunction of mutations. Researchers have also attempted to use protein-protein interaction network (PPI) data to integrate different omics data. For example, HotNet2 [38] combined with PPI used hotspot diffusion to find the small sub-networks of frequent mutations. However, the authors tried to identify a cancer-driving module composed of many genes, rather than genes that are crucial for cancer development. A recently published method, DriverNet [39], identifies a simple set of mutated genes associated with genes that experience mRNA expression disorders in a PPI network. OncolMPACT [40] prioritizes mutated genes based on linkages to dysregulated genes in cancer using matched expression data. The VarWalker algorithm, through sample-specific gene screening, constructs a sample-specific network, and integrates and recognizes driver genes [41]. The DawnRank algorithm analyzes the effect of a mutant gene on its downstream genes in a molecular interaction network, and used the PageRank algorithm sequences the genes of a single sample, finally resulting in the identification of driver genes [3]. The DEOD algorithm integrates genomic mutation data, expression data, and PPI network data; constructs a directed weighted graph based on the method of partial covariance selection; and identifies driver genes that have a significant effect on the target gene [42]. MUFFINN [43] considers mutations in neighboring genes in a network in two different ways, either consider mutations in the most frequently mutated neighbor (DNmax) or to consider mutations in all direct neighbors with normalization by their degree connectivity (DNsum) showing good predictive performance in large candidate sets.
In recent years, researchers have also attempted to identify driver genes from the perspective of individual networks. For example, the SSN algorithm is based on individual network identification of driver genes, which uses the Pearson Correlation Coefficient (PCC) of sample expression data to construct individual networks and then, through statistical analysis, determine cancer driver genes or modules [44]. The HIT'n DRIVE algorithm integrates each patient's individual genomic mutation data and expression data to construct a network and identify the driver genes and modules that affect transcriptional changes based on the expected value of the shortest random walk length in the network [45]. From the perspective of individuals, Guo et al. successively proposed the SCS [46] and PNC [47] algorithms. The SCS algorithm integrates mutation data, expression data, and molecular network data of each patient sample, and uses the network control method to evaluate the individual genes. Driver genes are then identified based on the effect of gene mutations on gene expression [46]. The PNC algorithm uses paired samples to construct individual networks, and then uses structure-based network control principles to identify individual driver genes [47]. The PRODIGY algorithm proposed by Dinstag et al. integrates individual mutation and expression data with pathways and PPI network data, uses reward collection Steiner tree models to quantify the regulatory effects of mutant genes on pathways and recognize driver genes [48]. However, owing to incomplete data in gene interaction networks, the false positive rate of these existing methods is still very high; therefore, further improvement is needed, which brings challenges to network-based prediction methods.
To overcome false positives and improve prediction accuracy, in this study, we introduced semi-local centrality and considered mutational information between genes to identify mutant genes in tumors. Unlike DriverNet, we considered the structure of the genes in the network. The introduction of network centrality can lead to the identification of genes at key locations in the network. These genes may be driven by genes or regulatory genes. MUFFINN considers the direct neighbor information of mutated genes in the network, but ignores the information of the secondary neighbor. Based on this, our method considered not only the nearest and the next-nearest neighbors of node but also the interaction between mutant gene nodes. We processed the cancer coding region mutation data from TCGA into a gene-patient mutation matrix as well as calculated the gene mutation score and the Euclidean distance between two genes according to the matrix. Increasing evidence shows that miRNAs are widely involved in the occurrence of cancer [49,50]; therefore, we also performed gene expression analysis to obtain differentially expressed genes. Moreover, functional studies have suggested that driver mutations alter the expression of its downstream genes in the molecular interaction network [51]; therefore, we integrated differentially expressed genes and mutated genes into the PPI network and calculated the effect of the mutated genes based on the obtained local network. Experiment on TCGA datasets verified that our proposed mutations effect and network centrality (MENC) method was superior to the existing methods based on frequency and network centrality.

Results
Most existing network methods for identifying driver genes are based on global networks. These global networks increase computational complexity. In addition, the accuracy of these methods needs to be improved. Our method employed a novel scheme: we first calculated the effect of the mutation, and then identified a local network for each mutated gene. We used the objective function to calculate the effect of mutated genes in the local network and sort the mutated genes according to the score to determine the driver genes. The top-ranking genes were more likely to become driver genes, which are more interesting to researchers and can even advance to further biological experiments for verification. Therefore, in the comparison analysis, we only used the top 50 candidate genes. To show the advantages of our model, we analyzed four large-scale publicly available datasets, including glioblastoma (GBM), bladder cancer (BLCA), prostate cancer (PRAD), and ovarian cancer (OVARIAN). The experimental results showed that our method was better than not only the network-centric method but also other types of methods. More importantly, our method was also able to recognize rare driver genes.

Datasets and resources
In this study, we mainly used two types of data: coding region mutation data and gene expression data. In particular, the coding region mutation data included copy number variations (CNVs) and SNVs. These data were obtained from 328 GBM samples, 379 BLCA samples, 252 PRAD samples, and 316 OVARIAN samples, and downloaded from the TCGA data portal (https:// tcga-data. nci. nih. gov/ tcga/). We used only samples that included both of them. The PPI network we used was downloaded from the Human Protein Reference Database (HPRD) [52,53], which consists of 9617 genes and 74,078 edges. Table 1 shows the sample counts in the four cancers mapped on the PPI network mutated gene numbers and outlying gene numbers.
In the absence of basic facts, quantitative measurements using standard sensitivity/ specific benchmarking techniques are impractical. To help assess the quality of our results, we obtained a list of 616 known drivers from the Cancer Gene Census (CGC) database (09/26/2016) [54].

Comparison with network-centric approaches
To evaluate the method's ability to identify known driver genes, we compared our method with network centricity-based methods. As mentioned above, we used the CGC as an approximate benchmark for known driver genes. For comparison, we used the following three metrics (precision and recall rates and F1score) in this study: We compared our method with two main network-centrality-based methods, SCS [46] and MUFFINN [43]. MUFFINN considers mutational information among direct neighbors, either in the most frequently mutated neighbor (DNmax) or in all direct neighbors with normalization by their degree of connectivity (DNsum). The results are shown in Fig. 1. Here, we only show the results for two types of cancer (GBM and OVARIAN). As shown in the figure, our method performed better than SCS and MUFFINN. For GBM cancer, our method was not as effective as SCS in identifying the first 15 candidate driving genes, but our method showed a great improvement in the latter. MENC was significantly superior to the other methods for the other three cancers. The number of CGCs covered among the top 50 genes identified was 27 genes with our method, 24 with SCS, 12 with DNsum, and 21 with DNmax. Our method achieved the best results for the BLCA and PRAD cancer data.
For OVARIAN cancer, the top 50 genes analyzed by our method included 27 in the CGC database, while SCS had 17, DNsum had 10, and DNmax had 17. It can also be seen that the SCS method exhibits a large downward trend. The accuracy of the top 10 genes was 0.8, and the accuracy was reduced to below 0.4 in the top 30. Our method is relatively stable, and there is no significant decline. The results indicated that our method yielded reliable results for identifying driver genes. (1)

Comparison with other approaches
Because our method not only considers the characteristics of the network but also calculates the mutation scores and interaction of the genes, we also compared MENC with DriverNet [39], a frequency-based method, and OncolMPACT [40]. As shown in Fig. 2, in general, relative to CGC, our approach was superior to DriverNet, Frequency, and OncolMPACT in analyzing all cancer datasets. Although only the results of BLCA and PRAD cancers are shown here, the same good results were obtained for other cancer data, which are not shown here.

Novel and reliable driver genes found using our method
In addition to identifying frequently mutated driver genes, MENC can identify important rare driver genes. According to DawnRank's [3] definition of novel and important driver genes, genes meeting the following requirements are rare genes: (1) the ranking of the driver gene is based on patient population; (2) frequency of the mutation is less than 2% of the patient population in the mutation data; (3) the gene has not been identified as a driver gene by CGC.
In OVARIAN, 316 samples were analyzed. Using our method, nine rare driving factors were identified as the top 20 genes according to the above definition, seven of which were in included in CGC (see Table 2). Although some rare driver genes such as EGFR, EP300, and CREBBP have been found in DNMax and DNSum, they rank higher in our method. In addition, SRC (1.58% of cases) is usually associated with disease and may lead to the development of human malignancies [55]. FYN (0.95% of cases) and PRKCA (1.58% of cases) have not been listed as driving genes by CGC, but studies have found that they are associated with many cancers and overexpressed in cancer patients [56,57]. Fig. 2 The comparison of precision, recall and F1score for top ranking genes of MENC and other methods. The X axis represents the number of top ranking genes and the Y axis represents the score of the precision, recall and F1score respectively. The last row is the result of BLCA and the next row is the result of PRAD In BLCA, we identified 18 rare genes among 22 candidate driver genes (see Table 3), 12 of which were in CGC. For example, MENC recognized AKT1 (0.53% of cases) as a serine/threonine protein kinase, and its downstream proteins have been reported to be frequently activated in human cancers [58]. Most of the highest-ranked genes in BLCA are low-frequency mutant genes.
Considering that the identification of cancer driver genes is required for cancer treatment, we used the drug-genes interaction database (DGIdb) [59] and TARGET database [60] to determine whether our candidate driver genes are clinically relevant genes. The results are shown in Fig. 3. In all four cancer datasets, 80% or more candidate driver genes were identified as actionable targets. Approximately 40% of the genes were druggable. There is a partial intersection between the candidate genes and druggable genes. The union of the actionable and druggable genes in the four cancers

Enrichment analysis
To test the biological function of the MENC-predicted candidate drivers, we used the DAVID tool (v6.8) for KEGG pathway and GO function enrichment analyses.
For OVARIAN, the important candidates were mainly enriched in pathways in cancer, viral carcinogenesis, proteoglycans in cancer, prostate cancer, and pancreatic cancer. They were also involved in biological process such as positive regulation of transcription from RNA polymerase II promoter and signal transduction. Regarding cellular components, the identified candidates were enriched in the nucleus, nucleoplasm, cytosol, cytoplasm, and plasma membrane. Furthermore, with regards to important molecular functions, the candidate drivers were enriched in identical protein binding, DNA binding, and transcription factor binding.
In BLCA, KEGG analysis showed that the candidate genes were enriched in pathways in cancer, chemokine signaling pathway, and PI3K-Akt signaling pathway. GO analysis revealed that the candidate genes were enriched in signal transduction, positive regulation of transcription, and DNA template. As for cellular components, the candidates were enriched in the cytoplasm and nucleus. In terms of molecular functions, the candidates were enriched in protein binding, enzyme binding, and transcription factor activity.
In GBM, the candidates were enriched in pathways in cancer, viral carcinogenesis, and hepatitis B. In terms of biological processes, the candidate drivers were enriched in signal transduction, viral processes, and protein phosphorylation. With respect to cellular components, the candidates were enriched in the nucleus, plasma members, cytoplasm, and nucleoplasm. As for molecular functions, the candidates were enriched in enzyme binding, transcription factor activity, and sequence-specific DNA binding.
In PRAD, the enriched KEGG pathways were proteoglycans in cancer, thyroid hormone signaling pathway, and microRNAs in cancer. The enriched GO functions were negative regulation of the apoptotic process and protein phosphorylation. As for cellular components, the candidates were enriched in the cytosol, nucleus, and plasma membrane. In terms of molecular functions, the candidate drivers were enriched in protein binding, ATP binding, transcription factor binding, and kinase activity.

Discussion and conclusions
In this study, we proposed the MENC method for identification of driver genes. Our approach not only considered mutation frequency in patients but also integrated mutation and gene expression data into a gene-gene interaction network. We considered the nearest and next-nearest nodes from the source node when calculating the network centrality. When tested on the GBM and OVARIAN datasets, our method performed significantly better than the network-based SCS and MUFFIN methods. In addition, our method was superior to other methods such as DriverNet in analyzing the PRAD and BLCA datasets. Our method even identified rare driver genes.
Nevertheless, our approach had some limitations. For example, in clinical practice, precision medicine and personalized medicine are important for the diagnosis and treatment of patients. However, using the proposed method, we could not diagnose driver genes in the individual. In the future, we will propose a new approach to identify patientspecific and rare driver genes based on individual mutations and gene expression profiles in tumors.

Overview of the MENC approach
We proposed a new method that combined mutation and expression data into a PPI network, and adopted a combination of semi-local centrality and mutation effect function to identify the driver genes of cancer. The method consisted of three main steps. First, we integrated SNV and CNV data to obtain a mutation matrix, and calculated the gene mutation score (Eq. 2) and the Euclidean distance (Eq. 3) between two genes according to the matrix. Next, the mutation effect function between genes was calculated according to Eq. 4. In the second step, we compared the expression profiles of tumor samples with those of normal samples to identify DEGs. We subsequently constructed a semilocal network for each mutation gene using DEGs and mutation genes according to the PPI network. The third step was to calculate the local centrality and mutation effect of the mutated genes according to the target function (Eq. 5). The top-ranking genes were regarded as candidate driver genes. Our method considered the nearest and next-nearest nodes when calculating the local centrality. Compared with global centrality measures (e.g., betweenness centrality and closeness centrality), our local centrality measure had a much lower computational complexity. We also added the mutational effect function, as to not ignore some genes that have a low degree but may have a much higher influence than high-degree genes [61]. A flowchart of the method is shown in Fig. 4.

Calculation of gene mutation score and distance between genes
The downloaded TCGA coding region mutation data were summarized in a binary gene-patient matrix M, in which the rows represent the genes, and the columns represent the cancer samples (patients). For gene i, if the patient has SNVs or CNVs, M(i, j) = 1; otherwise, M(i, j) = 0. We used the MaxMIF [62] method to calculate the mutation score (Eq. 2). Based on the obtained gene-patient matrix, we calculated the mutation score of the gene. The mutation score M(i) for each gene i accounts for the contribution of its mutation to cancer, defined as follows: where K i is the set of patients with mutations in gene i. N k is the total number of mutated genes in sample k. N max is the maximum number of mutated genes in all samples. If gene i has no mutation in all samples, that is, K i is empty, then M(i) is assigned a background mutation score (BMS) that is no greater than any mutant gene.
We then calculated the Euclidean distance between two genes according to the distance formula (Eq. 3), where vector X, Y is the row vector of each gene in the genepatient matrix, and x i , y i is an element in the row vector. In this study, we also tried other distance formulas, such as Jaccard and Manhattan, and brought the distance obtained by each distance formula into the final objective function. We found that the obtained driver genes were the same; therefore, we chose the Euclidean distance in the experiment. (2)

Mutations effect function between genes
Reference MaxMIF measures the effect of interaction between two mutant genes on biological functions. In this experiment, we also used mutation impact function (MIF) values to calculate the effect of mutation between two genes. The value is driven by the gravity principle [63].
Here, M(i) and M(j) are the mutation scores of gene i and j, respectively. r ij is the reciprocal of the Euclidean distance between gene i and gene j. Euclidean distance measures the similarity of two vectors (the similarity of two genes on the patient set). Two genes with high mutation scores and high similarity had high MIF values.

Identification of DEGs and construction of local network
In this study, expression data were processed the same way as SCS data. To indicate the DEGs of each patient, we first calculated the log2 fold-change in gene expression between the paired tumor and normal samples. Genes with an absolute value greater than 1 were considered as DEGs. We then collected the DEGs from each patient to obtain the DEGs of the cohort. All patient mutation genes were selected from the mutation matrix. In addition, we downloaded the PPI network as an interaction graph between the mutated genes and DEGs. If there are edges of mutant genes and DEGs in the network, the two genes are connected to the semi-local network. We built a semi-local network where mutated genes were considered the source node and DEGs were the target nodes. Moreover, we only considered the role of the mutant in two steps, which reduced the computational complexity. After preprocessing the data, the next step was performed.

Calculation of driver gene scores
Unlike some existing network-based methods, we constructed a new semi-local intersection network for each mutated gene by merging mutant genes, DEGs, and HPRD networks. Referring to the metric of the network local centrality measure C L (v) in [61], C L (v) calculates the number of neighbors of node v and the neighbors of the neighbors. We have made corresponding improvements to this formula: when counting the number of neighbors of a node gene, we performed different calculations for the neighbors of the node that were mutations and DEGs. If the neighbor of the node was a mutated gene, we used the MIF between the genes multiplied by the degree of the node, and if the neighbor was the DEGs, only the degree of the node was calculated. See formula (5): where N(v)/N(w) represents the set of neighbors of node v/w. We calculated the local centrality of the mutated gene. For mutation i, if the mutated gene u/w was ligated, we also considered the mutation effect between them as a weight, calculated by c(u)/c(w). Therefore, we can identify drivers that are important in the network and have a strong effect on other genes. If the neighbor u/v is a DEG, calculated by b(u)/b(w), which only considered the centrality of the network. Our main idea was to accord the function as the effect score in a local network. The higher the score, the greater the effect of the mutated gene on the DEGs in the local network. The presence of genes is both a mutation and a differential expression. Therefore, these genes may be more important. Therefore, when a gene is differentially expressed, it acts as a target node. However, when mutated, it acts as a source node. The score for this type of gene increased. Using this model, we obtained a score for each mutant gene. Then, according to the scores, we ranked the mutation genes to identify influential genes. We assumed that the higher the ranking, the more likely it was to be a driver gene.

Authors' contributions
YYT carried out the experiments and analyses presented in this work and wrote the manuscript. PJW performed data analysis. JX and CHZ helped with the project design, edited the manuscript, and provided guidance and feedback throughout the project. RFC and JZ supervised YYT and PJW in collecting the data and participated in the discussion of experimental results with all authors. All authors read and approved the final manuscript.

Funding
The publication costs for this study were funded by the National Natural Science Foundation of China (Nos. U19A2064, 61873001, 61872220, and 61861146002). This study was supported by the Open Foundation of Engineering Research Center of Big Data Application in Private Health Medicine, Fujian Province University (KF2020006), and the Xinjiang Autonomous Region University Research Program (XJEDU2019Y002). The funding bodies played no role in the design of the study; collection, analysis, and interpretation of data; and writing of the manuscript.

Availability of data and materials
All datasets analyzed in the current study were downloaded from the TCGA data portal (https:// tcga-data. nci. nih. gov/ tcga/). The evaluation data set used was from the CGC gene list of the COSMIC database (https:// cancer. sanger. ac. uk/ cosmic), version number (09/26/2016).