Weighted minimum feedback vertex sets and implementation in human cancer genes detection

Background Recently, many computational methods have been proposed to predict cancer genes. One typical kind of method is to find the differentially expressed genes between tumour and normal samples. However, there are also some genes, for example, ‘dark’ genes, that play important roles at the network level but are difficult to find by traditional differential gene expression analysis. In addition, network controllability methods, such as the minimum feedback vertex set (MFVS) method, have been used frequently in cancer gene prediction. However, the weights of vertices (or genes) are ignored in the traditional MFVS methods, leading to difficulty in finding the optimal solution because of the existence of many possible MFVSs. Results Here, we introduce a novel method, called weighted MFVS (WMFVS), which integrates the gene differential expression value with MFVS to select the maximum-weighted MFVS from all possible MFVSs in a protein interaction network. Our experimental results show that WMFVS achieves better performance than using traditional bio-data or network-data analyses alone. Conclusion This method balances the advantage of differential gene expression analyses and network analyses, improves the low accuracy of differential gene expression analyses and decreases the instability of pure network analyses. Furthermore, WMFVS can be easily applied to various kinds of networks, providing a useful framework for data analysis and prediction. Supplementary Information The online version supplementary material available at 10.1186/s12859-021-04062-2.


Data sets
In this study, we used the directed human protein interaction network [20] for the analyses; it contains 6338 genes (vertices) and 34814 directed interactions (arcs). To evaluate the relative prediction accuracies for cancer genes between our methods and existing methods, we collected cancer-related gene sets from five public databases: ONGene [21], TSGene [22], CGC [23], NCG [24] and MSigDB C6 [25]. Since not all genes from the data sets are contained in the directed human protein interaction network, we filtered the common genes in both a certain data set and the network. The sizes of these data sets are shown in Table 1.
In the rest of this paper, when we calculate the recall of various methods, we consider only the size of the common gene sets.

Weight definition
To define the weights of genes, we first downloaded the RNA-seq data from TCGA [26], which contains gene expression data from 1102 breast tumour samples and 113 normal samples. Next, the counts of level 3 RNASeqV2 data were processed and transformed before being used for further analysis [27]. Specifically, we used the fold change (FC) value (with the binary logarithm and absolute value) between tumour and normal samples as the weight of each vertex (gene). For a specific gene v, its weight is calculated by the following formula: where T i is the expression value of tumour sample i, N j is the expression value of a normal sample j, and n and m are the numbers of tumour and normal samples, respectively. Intuitively, a high FC value corresponds to a high possibility of a cancer gene. Thus, it is reasonable to use the FC values as the weights of genes.
For the genes that appear in the network but have no expression values in the TCGA data (only 143 genes, 2.3% of the network size; these are called weight-loss genes), we gave them default weights of 0 rather than ignoring them; thus, if such a gene is essential at the topological level, it has the potential to be selected as a cancer gene, which may counteract the disadvantage of the traditional differential expressionbased methods in dark gene-revealing and missing-data situations. Finally, all 6338 genes in the graph were weighted. The topological structure of the graph remained the same as in the original protein interaction network.

Experiments and evaluation
The whole experiment process is shown in Fig. 1. First, we analysed the directed human protein interaction network with traditional MFVSs and obtained a set of 463 vertices. Then, we used our WMFVS method on the same network (the weights were derived from the FC values). We also used the inverses of the weights as the penalty values and applied them to our WFVS method.
Because of the non-uniqueness of the MFVS method, it is not a general evaluation if we consider only one MFVS result. Therefore, we calculated a set of random MFVSs by applying the WMFVS method with randomly shuffled gene weights. First, we planned to compute 1000 random MFVSs for analysis. However, since the Gurobi optimizer (version 8.1.0) does not always output a real optimal solution (e.g., even when we restrict the size of the output to be exactly 463, which is the size of the MFVS, sometimes the sizes of the output are smaller than 463), we filtered the obviously incorrect results and verified all the other outputs as MFVSs. Finally, we obtained 875 approved random MFVSs (since some MFVSs may be lost in the ignore_w operation and the MFVSs are not distributed uniformly, not all possible MFVSs have the same possibility of random selection).
The WMFVS and WFVS result data can be found in the supplementary data. The random MFVS data are placed in https://github.com/lrming1993/WMFVS_codes.
To evaluate the results of these three methods, we first checked the graph-level results (see Table 2).
The run time of MFVS is due to the use of the traditional non-weighted MFVS method. The sum weight of MFVS uses the average value from 875 randomly weighted WMFVSs.
As we expected, the WMFVS method obtained a better total weight than the traditional MFVS. However, the result of WMFVS is not always better than that of MFVS. The total weight of the output of the traditional MFVS method is random (the output  Fig. 1 The experiment flowchart. The red, blue and green lines correspond to the WMFVS, WFVS and random MFVS pipelines, respectively is related to the graph structure but has no relevance to the vertex weights), so it is possible for MFVS to output a highly weighted vertex set, even higher than the weight of the calculated WMFVS (Gurobi may not always give a real optimal result because of its numerical instability). However, our WMFVS method clearly has better stability.
The WFVS method returned an FVS with 528 vertices, which is approximately 14% larger than the size of the MFVS. The selected WFVS has a better average weight than both the MFVS and WMFVS. This result is consistent with our purpose for WFVSs, which focuses on the total weight rather than the size of the FVS.
Then, we used the five prepared cancer-related gene data sets to evaluate the results of these three methods. We verified the recall of the three FVS methods in the five data sets. The results are shown in Table 3 and Fig. 2.
We can see that WMFVS and WFVS have better recall than traditional MFVS in all five sets, which is a benefit of the well-defined gene weights (especially for WFVS). Furthermore, we calculated the p-values of WMFVS and WFVS for 875 random MFVSs (Table 4).
For a certain data set, denote the recall of WMFVS by R WMFVS . The recalls of all random MFVSs compose a set R random . Then the p-value of WMFVS is calculated by the following formula: The calculation of the p-value of WFVS is the same as above.
Next, as control methods, we considered several other kinds of methods of cancer gene prediction.
(1) Randomly select 463 genes (select 100 times and take the average performance).
(2) Select the 463 highest-weighted genes, which is a traditional differential expressionbased method. (3) Select the set of genes that appear in at least 49.5% MFVSs (we used 49.5% since the number of genes was exactly 463).
Method (2) uses only weights for classification (i.e., a pure differential expression analysis method), while method (3) uses only graph theoretic results (i.e., a pure network analysis method). Method (3) selects the most common genes that appear in the MFVS. Intuitively, these genes should have great significance in the graph topology. The recalls and precisions of all these methods are listed in Table 5. Additionally, see Fig. 3.

Performance and enrichment score
In ONGene, TSGene and MSigDB, both WMFVS and WFVS have good p-values, but for CGC and NCG, the p-value is relatively high. One major reason is that there exists some correlation between the classification metric of the data set and the defined gene weight. To analyse this correlation, we utilized the enrichment score (ES) from GSEA [25], which reflects the degree to which a set S is overrepresented at the extremes (top or bottom) of an entire ranked list. First, we sorted all the genes from the network by weight from high to low. Then, for a certain cancer gene set S, we traversed the sorted gene list, increasing a runningsum statistic when we encountered a gene in S and decreasing it when we encountered  a gene not in S. We modified the increment and decrement value to ensure that the running sum was 0 at the end of the gene list. The enrichment scores of the five data sets are shown in Fig. 4. It is easy to see that the ONGene, TSGene and MSigDB data sets are significantly enriched at the tops of the lists. Although NCG seems enriched at the top, its ES is relatively low; the ES of CGC is even worse than that of NCG. The best enriched data set is MSigDB. Since this data set was constructed directly from microarray gene expression data from cancer gene perturbations, it is closely related to differential expression values. The ES value explains the different performances of WMFVS and WFVS in different data sets.   Table 5 and Fig. 3 show that, except in MSigDB, WMFVS has the best precision and WFVS has the best recall. In MSigDB, cancer genes are closely related to the differential expression values of genes in breast cancer, leading to a precision of 87.5% for the simple weight-based method (i.e., method (2)). In this case, integration of the network structure may decrease the precision. However, in most cases, it is hard to find such a closely related metric for classification. We can observe that in other data sets, method (2) performs worse than the other methods. The results support the effectiveness of our WMFVS and WFVS methods.

Dark genes
As mentioned previously, traditional differential expression-based methods are not able to find graph-level important genes that have low differential expression values, i.e., dark genes. In our research, we defined a dark gene as a gene that has a relatively low weight (i.e., a low differential expression value) but is recorded as a cancer gene in the cancer gene data base(s). Specifically, we first derived the differentially expressed genes (DEGs) by using the criteria of | log 2 FC| ≥ 1 and adjusted p-value ≤ 0.05 from the TCGA breast cancer RNA-seq data, where FC is the fold change value of a certain gene. Based on these criteria, we found 4245 DEGs (called the DEG set). Next, we curated the dark gene set from each cancer gene data set by excluding these DEGs.
In our experiments, we further selected the top 463 of the highest-weighted genes (i.e., the most differentially expressed genes; called the top-463 DEG set) to avoid an unbalanced gene number in comparison to the WMFVSs and WFVSs identified by the WMFVS and WFVS methods, respectively. For each of the cancer gene data sets, the precisions of the all-DEG set, top-463 DEG set, WMFVS and WFVS are shown in Fig. 5. Figure 5 shows that our WMFVS and WFVS methods display better precision than the traditional DEG-based method (i.e., the all-DEG set and the top-463 DEG set) in four of five cancer gene data sets. Moreover, approximately 60-70% of the genes are dark genes, which were detected by using our WMFVS and WFVS methods but ignored by the traditional DEG method. Even for the MSigDB C6 data set, which was generated directly from microarray data or from internal unpublished profiling experiments involving the perturbation of known cancer genes, the WMFVS and WFVS methods also have a good ability to detect dark genes. In summary, our WMFVS and WFVS methods have an advantage in identifying dark genes that are hard to find by using traditional DEG methods.

Missing-data cases
In this study, to retain the topological structure of the network, the weight-loss genes are assigned default weights of 0 rather than being removed. By further analysis, we found 3 weight-loss genes (i.e., CDC2, ZBTB8 and TADA3L) included in the WMFVS result, 7 weight-loss genes (i.e., CDC2, ZBTB8, RhoGDI, TADA3L, RNF12, NP and MAP3K7IP1) contained in at least one of the 875 random MFVS results, and no weight-loss genes in the WFVS result. In particular, CDC2 and ZBTB8 were included in all the random MFVS results as well as in the WMFVS result. The CDC2 gene is related to the highly conserved protein CDK1, which functions as a serine/threonine kinase and is a key player in cell cycle regulation [28]. The CDC2 gene is also considered a cancer-related gene whose overexpression may play an important role in human breast carcinogenesis [29]. While little is known about the ZBTB8 gene, the same ZBTB family protein, ZBTB7A, has been implicated in high expression in cancer tissue and the breast cancer cell lines MDA-MB-231 and MCF-7 [30], suggesting that ZBTB8 may act as a transcriptional repressor or be involved in tumorigenesis. The uncovering of CDC2 and ZBTB8 genes illustrates that the WMFVS method may address the disadvantage of traditional DEG methods in missing-data cases.

Conclusion
We present several new methods for cancer gene prediction. Our WMFVS method uses differential gene expression to select MFVSs, improving the stability of the general MFVS algorithm and obtaining a much better result than the differential gene expression-based method when the weights of the genes are well defined. Our WFVS method is a variant of WMFVS, which aims at finding an FVS in the network that contains the maximum total weight. This method obtains better recall than WMFVS by sacrificing precision. Thus, generally, if the researcher wants to reveal as many potential cancer genes as possible, WFVS is better; if the researcher prefers better precision, then WMFVS is better. Furthermore, since WFVS ignores the restriction of the output size, it focuses more on the vertex weight than WMFVS. Therefore, if the researcher has good confidence in the weight definition, i.e., the weights are closely related to the classification, WFVS will have a better result than WMFVS. We can see this from the data analyses on the MsigDB data set, which has the highest enrichment score on our defined weights. However, in many cases, since we are not sure whether the defined weights are closely related to the classification, using WMFVS will maintain better precision for the prediction. WMFVS and WFVS take advantage of both bio-data and the network structure. They can be useful in novel cancer gene prediction and evaluation, and the same idea may also be applied to other bioinformatics problems. The main challenge of our methods is the definition of the weights. WMFVS and WFVS can perform very well when the weights are well defined but may display limited performance when the weights are not directly related to the category. Another issue concerns graph compression. In our experiments, the traditional MFVS method analysed the compressed graph (with the ignore operation; see details in the next section), which contained 660 vertices and 5604 arcs, and it was efficient and took only approximately 4 seconds to obtain the result. The input graph of WMFVS and WFVS was compressed using the limited ignore_w operation (see details in the next section), which contained 2348 vertices and 17283 arcs. Because of the different input scales, WMFVS and WFVS were not as efficient as the simple MFVS method, although the time costs were still acceptable. The development of new algorithms for weighted graph compression is left as future work.

Graph compression
In biological networks, a network usually contains tens of thousands of vertices and hundreds of thousands of arcs. In many cases, processing a large network is not practical because of the NP-hardness of the MFVS problem [12]. Generally, we can compress the original graph to a simpler graph that maintains (or can restore) the size of the MFVS of the original graph.
In the following sections, we define v.suc and v.pre as the sets of successors and predecessors of vertex v, respectively. Let v i be a vertex in a network S. Consider the following three cases [18]: C1. v i ∈ v i .suc , i.e., v i has a self-loop; then, v i should be in all FVSs, otherwise the selfloop cannot be removed. C2. v i .suc = ∅ (or v i .pre = ∅ ); then, v i is not in any MFVS, since it is not in any cycle. C3. |v i .suc| = 1 (or |v i .pre| = 1 ); let v j be the only successor (or predecessor, respectively) of v i ; then, any cycle containing v i also contains v j .
For C1, we use a temporary list M to record v i ; we add v i to M and remove v i and all its incoming and outgoing arcs from the graph. We use remove(v i ) to denote this removing process. For C2, since v i is not in any MFVS, we can safely use remove(v i ) without any change to the possible MFVSs.
For C3, assume v i is in some cycle c. If we attempt to break c by removing v i , then it is equally good (sometimes better) to remove v j rather than v i . Here, we connect all predecessors of v i to all its successors and then use remove(v i ) . We denote this connecting and removing operation by ignore(v i , S) , where S is the current graph to which v i belongs. The procedure is as follows: In the above procedure, v is a vertex in graph S, and S.E is the arc set of graph S. Then we have the following procedure to compress a graph S: We repeat this procedure until S cannot be modified. Furthermore, we use the strongly connected components (scc's) [17,19] to reduce the arcs. Since an arc between two scc's is not in any cycle, the deletion of these arcs will not change any MFVSs. We use compress_scc(S) to denote the operation that removes all arcs between two different scc's in S. The whole graph compressing procedure is as follows: The returned M contains the vertices that are always in any MFVS, and the union of M and any MFVS of the compressed graph will be an MFVS of the original graph.
Note that not all MFVSs of the original graph can be obtained from the above method. Some MFVSs are lost in the ignore operation, while in a weighted MFVS problem, the lost MFVSs may have the maximum weight. For the weighted case, we modify the ignore operation to consider the weights of vertices (only for positiveweighted cases). The following method ensures that the maximum-weight MFVS (the WMFVS) will not be lost: where v.w denotes the weight of vertex v.
Thus, M cannot be a WMFVS, i.e., if v has only one predecessor and the weight of v is less than that of the predecessor, then v does not belong to any WMFVS.
The proof is similar when v has only one successor and the weight of v is less than that of the successor.

ILP formulation for MFVS and WMFVS
After the compressing procedure, if the compressed graph is not empty, we can use an ILP method [17] to solve the remaining MFVS problem. For each remaining vertex v i , we add two parameters x i (Boolean) and k i (integer), where x i denotes whether v i is included in the output MFVS result and k i is a temporary parameter used in the ILP. The ILP formulation is as follows: ILP1: where E is the arc set of the remaining graph. These constraints ensure that the selected vertices compose an FVS of S, while the objective function means that the selected FVS has a minimum size, i.e., it is an MFVS. Now we consider the weighted case of the MFVS problem. Given a graph S, where each vertex v i ∈ S.V has a weight v i .w (in what follows, we use w i to denote v i .w if there is no ambiguity), the WMFVS problem is to find an MFVS of S that has the maximum total weight. Assuming we already know the size s of the MFVS (by ILP1 or some estimation method such as that of [31] or [32]), the following formulation optimizes the selected MFVS as a WMFVS: ILP2: The constraint x i = s ensures that the selected FVS is an MFVS, while the objective function selects the maximum-weight MFVS among all possible MFVSs.

Maximum-weight FVS
In the WMFVS problem, we first restrict the size of the FVS to be minimal and then select the maximum-weight MFVS as the objective. However, sometimes the weight may be more important than the size of an FVS. As an example, in Fig. 6, the WMFVS is {b} , which has a total weight of −20 . If we do not restrict the minimum size of the set, the FVS {a, c} , which has weight −4 , seems better.
Here we define a variant of the WMFVS problem, which ignores the exact size of the output vertex set, as follows: Given a graph S, where each vertex v i ∈ S.V has a weight v i .w (or w i ), the weighted FVS (WFVS) problem is to find an FVS of S that

Maximize
�w i x i Subjectto �x i = s k i − k j + nx i ≥ 1 ∀(v i , v j ) ∈ E where 0 ≤ k i ≤ n − 1 and x i is Boolean Fig. 6 A simple example. In this case, the total weight may be more important than the size of an FVS has the maximum total weight. We can simply use a similar ILP as ILP2 to solve the WFVS problem.

ILP3:
However, simply removing the constraint x i = s may lead to a trivial solution when the weights of the vertices are positive, since the set of all vertices will always be a WFVS.
Here we consider two methods to avoid the trivial solution: 1. Modify all weights to be negative. Assume the maximum weight of the vertices is w m ; then, for each weight w i , modify it to w i := w i − w m − ǫ . Here, ǫ is a small positive constant to ensure that all weights are negative. The ILP is the same as ILP3. 2. Reverse the weights to penalty values. We can simply do this by taking the inverse of each w i , i.e., Then, modify the ILP3 formula as follows: ILP3': In our research, we examined both ways of calculating the weights in the WFVS method. We found that the first modification is more unstable when running the ILP process, i.e., more obviously wrong ILP results appeared. Thus, we chose to use the second method to compute the weights in the WFVS method; i.e., we reversed the weights to be penalty values, which are always positive values.
In the second method, we need to avoid the 'division by zero' error. To this end, we used the simple heuristic formula below.
Let l be a large number (in our program, we used 65536); then, the penalty is calculated by:

Experimental environment
We implemented all the methods in Python 3.7.0 with an Intel(R) Core(TM) i7-7700 CPU and 32.0 GB RAM. The compress_scc procedure uses Gabow's algorithm [33]. The ILP processing is based on Gurobi 8.1.0 [34].

Maximize
�w i x i Subjectto k i − k j + nx i ≥ 1 ∀(v i , v j ) ∈ E where 0 ≤ k i ≤ n − 1 and x i is Boolean