 Research
 Open Access
 Published:
GraceAKO: a novel and stable knockoff filter for variable selection incorporating gene network structures
BMC Bioinformatics volume 23, Article number: 478 (2022)
Abstract
Motivation
Variable selection is a common statistical approach to identifying genes associated with clinical outcomes of scientific interest. There are thousands of genes in genomic studies, while only a limited number of individual samples are available. Therefore, it is important to develop a method to identify genes associated with outcomes of interest that can control finitesample false discovery rate (FDR) in highdimensional data settings.
Results
This article proposes a novel method named GraceAKO for graphconstrained estimation (Grace), which incorporates aggregation of multiple knockoffs (AKO) with the networkconstrained penalty. GraceAKO can control FDR in finitesample settings and improve model stability simultaneously. Simulation studies show that GraceAKO has better performance in finitesample FDR control than the original Grace model. We apply GraceAKO to the prostate cancer data in The Cancer Genome Atlas program by incorporating prostatespecific antigen (PSA) pathways in the Kyoto Encyclopedia of Genes and Genomes as the prior information. GraceAKO finally identifies 47 candidate genes associated with PSA level, and more than 75% of the detected genes can be validated.
Background
Identifying genes and pathways associated with complex traits is a primary focus for advancing the scientific understanding in genomic studies, for which extensive clinical experiments and genetic counseling are required [1]. A major challenge is that genomic data is usually highdimensional but with a limited sample size. Currently, there are rich public genomic data, such as the Kyoto Encyclopedia of Genes and Genomes (KEGG, https://www.genome.jp/kegg/), which provides generegulatory pathways including biological regulatory relationships between genes or gene products. These generegulatory pathways form a network that can be represented as a graphical structure, where the vertices are genes or gene products, and the edges are generegulatory pathways. Inspired by the nature of genomic data, graphconstrained estimation (Grace) offers a novel perspective on variable selection for genomic data by taking graphical structures into account. The predictors in the network are geneexpression data linked by generegulatory pathways for a specific clinical outcome, which can also be called graphstructured covariates [2]. These graphical structures, such as generegulatory pathways, can improve the sensitivity of detecting pathways [3]. Therefore, [4] developed a Grace model, networkconstrained regularization procedure by encoding the graphical structures into a Laplacian matrix to incorporate this kind of prior biological information into regression analysis. The networkconstrained regularization procedure includes the Lasso [5], and the elastic net regulation procedure [6] as special cases. Particularly, the networkconstrained penalty may be transformed to a Lassotype problem by constructing an augment dataset, which can also retain the automatic variables selection property [4]. The augment dataset can extend the sample size from n to \(n+p\), allowing the model to choose p variables despite the fact that \(n\ll p\). Moreover, because the loss function of the networkconstrained penalty is a convex function, it can ensure the grouping effect of the regression in the case of identical predictors. In accordance with the theorem of [4], the quantitative description of the grouping effect is measured beyond a half of the elastic net model \(\frac{1}{2 \lambda _2} \sqrt{2(1\rho )}\), where \(\lambda _2\) is a fixed scalar and \(\rho\) is the correlation between the vertices, which means that if two vertices are highly correlated (e.g., \(\rho =1\)), the difference between their coefficient paths could be almost 0. According to [1], the optimal policy for this variable selection effort would be to identify significant relevant genes and provide error control for these discoveries (both genes and variants). However, these conventional regression approaches only control false discovery rate (FDR) asymptotically with no guarantee in finitesample settings. Specifically, there is a lack of effective approaches for variable selection which can not only integrate the graphical structure but also provide a guarantee of finitesample FDR control.
To address this challenge, we propose GraceAKO, a novel method for identifying genes associated with complex traits of scientific interest that integrates the core concept of aggregation of multiple knockoffs (AKO, [7]) with the networkconstrained regularization procedure [4]. The salient idea of knockoff inference is to generate knockoff variables by mimicking the correlation structure of the original variables without considering the response variable (conditionally on the original variables) [8]. ModelX knockoffs, as opposed to fixedX knockoffs, regarded the original variables as random and relied on the specific stochastic properties of the linear model, thus extending knockoff inference to highdimensional data [8]. These knockoff variables are applied to control finitesample FDR served as negative controls so that the original variables are selected if they are considerably more connected with the response variable than their knockoff variables. Specifically, the knockoff inference uses various types of feature statistics to determine which variables are significant and which are not. The feature statistics impose a flipsign property, which implies that swapping the variables and their knockoffs alters the sign of the feature statistics. The methods for constructing the feature statistics are i.i.d. random for the “null hypothesis” whose coefficients are zero [8, 9]. To control FDR, [9] developed a datadependent threshold whose derivation formula may be regarded as an estimate of the fraction of false discoveries. In addition, variables whose feature statistics are larger than the threshold may be selected, and estimations of the FDR can be converted to provide finitesample FDR control with a high degree of accuracy. However, modelX knockoffs were generated using Monte Carlo sampling, which made it challenging to reproduce the results. Thus, multiple knockoffs were proposed to address this limitation, allowing for a more stable finitesample FDR control and more reproducible findings [7, 10, 11]. In particular, statistical aggregation is a typical statistical approach to solve instability by aggregating modelX knockoffs inference. Aggregation of multiple knockoffs (AKO) was proposed by [7], which rested on a reformulation of modelX knockoffs to introduce an intermediate feature statistic. It brought the idea from [12] to replicate modelX knockoff procedure multiple times, and then performed statistical aggregation to generate new intermediate feature statistics. Hence, it is more stable than modelX knockoff filter.
Specifically, the key contribution of our proposed GraceAKO is that we integrate the graphical structure to conduct variable selection with finitesample FDR control. Based on the knockoff inference property, we update the Laplacian matrix in the networkconstrained penalty and perform variable selection with the original explanatory variables and their knockoffs. The primary steps of GraceAKO are summarized as follows. First, we generate modelX knockoffs according to the correlation structure of the graphstructured covariates [8] and encode the graphical structures into a Laplacian matrix [13]. Second, we simultaneously fit the graphstructured covariates and their modelX knockoffs into the networkconstrained regularization procedure to multiple feature statistics: Lasso coefficientdifferences (LCDs). Third, we repeat the above procedures multiple times and employ the statistical aggregation approach [12] to transform the LCDs into new intermediate feature statistics, Aggregated Grace Coefficients (AGCs). Fourth, we conduct the Benjamini–Hochberg (BH) procedure [14] on the multiple AGCs to select the candidate variables. In our simulation studies, we show that GraceAKO has satisfactory performance, allowing for higher reproducibility of results, and can control the FDR in finitesample settings. We further analyze a prostate cancer data set from The Cancer Genome Atlas (TCGA) program using GraceAKO, and then identify 47 candidate genes, of which 75% were also found in the previous literature.
The remainder of this article is organized as follows. In “Method” section, we describe the method of GraceAKO under the linear regression framework. In “Results” section, we assess the performance of GraceAKO using simulation studies. In “Application to the TCGA Prostate Cancer Data” section, we apply GraceAKO to a prostate cancer data set from the TCGA program by incorporating the KEGG pathways as prior information. In “Conclusions” section, we briefly summarize our method.
Method
In genomic studies, we usually apply a regression model to identify genes and pathways associated with the trait of interest by linking highdimensional data (e.g., microarray geneexpression data) to the trait. Consider the following linear model where \({\varvec{X}}\) is a \(n\times p\) design matrix with n observations and p predictors, and \({\textbf{y}}\) is the response:
where the design matrix \({\varvec{X}} = ({\varvec{x}}_1, \cdots , {\varvec{x}}_p)\) and \({\textbf{x}}_j = ({\text {x}}_{1j}, {\text {x}}_{2j}, \cdots , {\text {x}}_{nj}), j = 1,\cdots ,p\) represents a vector of the graphstructured covariates from genomic data (e.g., geneexpression data), and the coefficient \({\boldsymbol{\beta }}\) represents the contribution of the graphstructured covariates to the trait of interest, and \({\boldsymbol{\epsilon }}\) is a vector of random errors. We further assume that the response is centred and the predictors are standardized,
where [p] denotes the set including \(\{1,2, \cdots , p\}\). The main goal is to estimate \({\boldsymbol{\beta }}= (\beta _0, \beta _1,\cdots , \beta _p)\) and select the predictors with nonzero contribution. The regulatory relationships of biological networks for some complex traits, represented as graphical structures between genes or gene products in genomic studies, shed light on underlying biological knowledge, where the covariates are the graph’s nodes and the edges indicate functional relationships between two genes. The biological networks can be utilized to identify the differentially expressed genes [15, 16]. Specifically, in such a graphical structure, the genes are linked by edges with certain probabilities, where the edge probability is interpreted as a weight in an undirected graph to form a weighted graph [2]. To incorporate the prior information about the biological networks, [4] proposed a networkconstrained regularization criterion:
where \({\varvec{L}}\) is a nonnegative Laplacian matrix of a weighted graph containing biological networks information, and \(\cdot _1\) indicates the \(L_1\) norm, and \(\lambda _1\) is the Lasso penalty [5], and \(\lambda _2\) is the penalty for the Laplacian matrix \({\varvec{L}}\). Specifically, the Laplacian matrix was first introduced by [13], which included numerous properties of the graph by its consistent set of the eigenvalues or spectrum. When p is large, the model (1) is treated as “sparse”, in which most elements of the coefficient \({\boldsymbol{\beta }}\) are zero [2]. In equation (2), the \(L_1\) norm deals with sparse matrices, and \({\boldsymbol{\beta }}^T{\varvec{L}}{\boldsymbol{\beta }}\) induces a smooth solution of coefficients of the graphstructured covariates. Additionally, \({\varvec{L}}\) depicts the graphical structure assuming that set V includes vertices corresponding to the graphstructured covariates, and W is the weights of the edges in which w(u, v) denotes a weight of the edge between the graphstructured covariates u and v, and the degree of vertex v is represented as \(d_{v}=\sum _{u \sim v} w(u, v)\). In the genomic data, w(u, v) quantifies the uncertainty of the edge between two vertices, such as the probability of an edge connecting two graphstructured covariates when the graphical structure is constructed from data. Motivated by [13], we apply the normalized \({\varvec{L}}\) [4]:
Therefore, we can rewrite the second penalty term \({\boldsymbol{\beta }}^T {\varvec{L}} {\boldsymbol{\beta }}\) of Eq. (2) as follows [4]:
The networkconstrained regularization procedure integrates the known biological network’s information for variable selection. By introducing the networkconstrained penalty, the networkconstrained regularization procedure was able to identify more interpretable genes and subnetworks related with the outcome of interest, while simultaneously inducing sparsity and smoothness of the biological network and the coefficients. However it does not ensure false discovery rate (FDR) control in finitesample settings [4]. The networkconstrained regularization procedure only has the asymptotic property only when \(n\rightarrow \infty\) and p is fixed and suffers from identifying numerous false positive discoveries when p is large and the number of samples n is limited. To address this issue, we introduce knockoff inference and multiple knockoffs to control finitesample FDR to achieve a stable performance. The knockoff filter procedure was first proposed in [17], and [8] further proposed modelX knockoff filter to extend its application to highdimensional data. ModelX knockoffs, \(\tilde{{\varvec{X}}}\) are generated from the original data by Monte Carlo sampling and retain the same data structure as the originals \({\varvec{X}}\), in which \({\varvec{X}}\) and \(\tilde{{\varvec{X}}}\) are pairwise exchangeable. We summarize the properties of modelX knockoffs as in [8]:

(1)
Swapping the locations of related elements, \({\textbf{x}}_j\) and \(\tilde{{\textbf{x}}_j}\), would not change the joint distribution of \(({\varvec{X}}, {\boldsymbol{\tilde{X}}})\) conditional on \({\textbf{y}}\).

(2)
Once the original covariates \({\varvec{X}}\) are known, their modelX knockoffs, \(\tilde{{\varvec{X}}}\) provides no extra information on the response variable \({\textbf{y}}\).
The knockoffs filter is a cheap method to control finitesample FDR since it does not require strong assumptions about the design matrix \({\varvec{X}}\). Due of the random nature of modelX knockoffs sampling, however, the outcome would be unstable and cannot be guaranteed to be reproduced. To increase the stability, multiple knockoffs approaches were developed, and aggregation of multiple knockoffs (AKO) is one of them [7]. In this article, we present a novel method for variable selection termed GraceAKO, which combines the biological network information for improved variable selection with finitesample FDR control using knockoff filter technique. Our proposed GraceAKO has four major steps.
First, we generate modelX knockoffs, \({\varvec{ \tilde{X}}}\) from the original data matrix \({\varvec{X}}\) using R package “knockoff” [8]. \(({\varvec{X}},{\varvec{ \tilde{X}}})\) is further regarded as a new design matrix of the predictors. Based on the properties of modelX knockoffs whose correlation structure is the same as that of the original variables, we generate a new normalized Laplacian matrix \({\varvec{L}}\). In summary, we generate knockoff variables and update the Laplacian matrix to include the graphical structure of knockoff variables in this step. The knockoff variables are introduced into the model based on their properties.
Second, the response variable \({\varvec{y}}\) and the new predictors \(({\varvec{X}},{\varvec{ \tilde{X}}})\) are fitted in Eq. (2). Inspired by [4], a natural solution of GraceAKO is equivalent to the following optimization problem:
subject to:
where \(\alpha = \lambda _2/(\lambda _1+\lambda _2)\), and t is a constant value, and \(({\boldsymbol{\beta }},\tilde{{\boldsymbol{\beta }}})^T\) is a vector of the new coefficients. Furthermore, in the function \((1\alpha )\sum ^p_{j=1}\Vert \beta _j\Vert _1+\alpha \sum _{u\sim v}(\frac{\beta _u}{\sqrt{d_u}}\frac{\beta _v}{\sqrt{d_v}})^2w(u,v)\), \(\Vert \cdot \Vert _1\) deals with sparse data matrix consistent with its function in the networkconstrained penalty, and the second term penalizes the weighted sum of the squares of the difference of coefficients between the graphstructured covariates, which is scaled by the degree of the associated vertices in the network. Specifically, when two genes are connected, it is expected that their coefficients would be similar rather than identical, which is accomplished by applying the second term of the penalty [4].
To identify relevant variables, we compute the Lasso coefficientdifference (LCD) as the feature statistic to measure the evidence against null hypothesis (\(\beta _j = 0\)) [8]:
where \({\hat{\beta }}_j\), and \({\hat{\beta }}_{j+p}\) are the estimated coefficients of \({\textbf{x}}_j\) and \({\boldsymbol{\tilde{x}}}_j\), respectively. Due to the symmetric distribution of \(W_j\) under the null, \(W_j\) equally takes on positive and negative values [8]. Moreover, a large positive value of \(W_j\) suggests that the distribution of \({\textbf{y}}\) is statistically dependent on \({\textbf{x}}_j\) and that there is a strong probability that \({\textbf{x}}_j\) is a relevant gene associated with the response \({\textbf{y}}\). In this step, the networkconstrained penalty is integrated with knockoff variables to control finitesample FDR. The natural solution of GraceAKO follows the same optimization problem as the networkconstrained regularization procedure. However, the modelX knockoffs procedure only generates knockoff variables once by Monte Carlo sampling, which leads to instability. To solve this challenge, we repeat the knockoff generation process multiple times and apply the statistical aggregation strategy to increase stability [12].
Third, following [7], GraceAKO transforms the feature statistic \(W_j\) into a new intermediate feature statistic \(q_j\):
We repeat the aforementioned steps B times, including the generation of knockoffs and the calculation of the intermediate feature statistic \(q_j\), to generate a \(B \times p\) matrix of the intermediate feature statistics. Then, we propose a new feature statistic, Aggregated Grace Coefficient (AGC), which is derived by applying the quantile aggregation algorithm [12] to the \(B\times p\) matrix:
where \(\gamma\) is the quantile point, and \(Q(\cdot )\) denotes the quantile function, and B is the prespecified replication times. To summarize the third step, we generate modelX knockoffs B times and then use the quantile aggregation procedure to yield AGCs, \({\varvec{{\bar{q}}}}\). The quantile aggregation approach introduces intermediate feature statistics and aggregated feature statistics, which are based on the concept of statistical aggregation [12] to improve stability.
Fourth, we apply the Benjamini–Hochberg procedure (BH) [14] to the AGCs, \(\bar{{\varvec{q}}} = ({\bar{q}}_1, {\bar{q}}_2, \cdots , {\bar{q}}_p)\) to compute a datadependent threshold:
where \(\alpha\) is the userspecified nominal FDR level. We finally choose the candidate variables satisfying the following requirement:
In this article, we measure the performance using the modified false discovery rate (mFDR):
where \(S_0 = \{j \in [p]: \beta _j = 0\}\) includes the predictors that have no effect on the trait of interest. The implementation details are available in Algorithm 1.
As an illustration in [8], the primary objective of the knockoff filter procedure was to build an as permissive as possible datadependent threshold. The threshold can provide a controllable estimation of FDR to ensure modelX knockoffs property. [7] was based on a reformulation of the original knockoff inference and developed an intermediate feature statistics to replace the feature statistics of [8]. Specifically, AKO still provided a datadependent threshold based on the specification in [8] which is used by GraceAKO.
Results
Simulation Studies
In this section, we performed a wide range of simulation studies to evaluate our proposed method, GraceAKO, and compared it to the networkconstrained regularization procedure (namely Grace) [4]. We supposed that there were 10 transcription factors (TFs) and each regulated 10 genes. The graphical structure included g unconnected regulatory modules with p genes in total and edges linked each of the TFs and 10 genes regulated. Here we denoted the first 44 genes related to the response variable \({\textbf{y}}\). We assumed that the data were simulated from the following settings:

(1)
\({\textbf{y}} = {\textbf{X}}{\boldsymbol{\beta }} + {\boldsymbol{\epsilon }}\), where
\({\boldsymbol{\beta }} =(5, \overbrace{ \frac{5}{\sqrt{10}},\cdot \cdot \cdot , \frac{5}{\sqrt{10}}}^{10}, 5, \overbrace{ \frac{5}{\sqrt{10}},\cdot \cdot \cdot , \frac{5}{\sqrt{10}}}^{10}, 3, \overbrace{ \frac{3}{\sqrt{10}},\cdot \cdot \cdot , \frac{3}{\sqrt{10}}}^{10}, 3, \overbrace{ \frac{3}{\sqrt{10}},\cdot \cdot \cdot , \frac{3}{\sqrt{10}}}^{10}, 0,\cdot \cdot \cdot 0)\), and \({\boldsymbol{\epsilon }}\) was generated from \(N(0, {\boldsymbol{\sigma }}^2)\);

(2)
The noise level was denoted as \(\sigma ^{2}=\sum _{u}^p \beta _{u}^{2} / 4\);

(3)
For each expression level TF, X was drawn from normal distribution: \({\varvec{X}}_{g}\sim N(0, 1)\), and conditional on the TF, the expression levels of genes which regulated to the specific TF were drawn from a conditional normal distribution with correlation of 0.7.
We set \(n = (100, 200, 300)\), \(g = (10, 20, 40, 60)\) and \(p = (110, 220, 440, 660)\), where g represented the total number of unconnected regulatory modules, to simulate 12 settings in total. Figure 1 shows the submatrix in order to provide a more accurate depiction of the graphical structure, given that every 11 of the first 44 variables form an identical subnetwork. As the empirical research in [7] demonstrated, performance was stable and robust once the iteration B approached 25 times. In the meanwhile, FDR could be empirically controlled under the prespecified level. Therefore, we fixed \(B = 25\), and \(\gamma = 0.1\) in all scenarios. Additionally, we set the mFDR control level \(\alpha\) at 0.1. According to [4], Grace performed better than the Lasso [5] and the elastic net [6] in term of the combination of \(\Vert \cdot \Vert _1\) and \(\Vert \cdot \Vert _2\) and the Laplacian matrix \({\varvec{L}}\). We thus focused on the comparison of GraceAKO and Grace in our simulation studies. We assessed the regression performance using the average values of mFDR, standard errors (calculated by Monte Carlo method), and the number of variables selected.
To tune the parameters, we applied 10fold crossvalidation. The values of tuning parameter \(\lambda _{{\varvec{L}}}\) were specified from the range of 0.1 to 2.0 with a step size of 0.1, which ensured that the \({\varvec{L}}\) matrix was nonnegative. Moreover, the values of \(\lambda _1\) were drawn from the range of 110 to 200 with a step size of 5. The validation set of \(\lambda _2\) ranged from 1 to 10 with a step size of 1. Each simulation setting was repeated for 30 times.
As Table 1 showed, the mFDR values of GraceAKO were controlled at the prespecified level, and Grace could not control mFDR in finitesample settings. Furthermore, as the data dimension p increased with a fixed sample size n, the mFDR for Grace increased which even reached at 0.67. However, GraceAKO could always control the mFDR under 0.1. Moreover, the standard errors for GracAKO were always smaller than Grace’s, which indicated that GraceAKO could also improve the stability. In Fig. 2, we depicted a boxplot for the findings when \(n =300\). The upper figure’s black line represented the prespecified FDR level. Grace, in particular, had inflated power at cost of high mFDR.
As shown in Table 2, the number of selected variables varied significantly between these two models. The number of true variables was fixed at 44. When n and p increased, Grace selected many false discoveries. Notably, when \(n = 300\) and \(p=660\), Grace identified 138 candidate variables. This was the reason why Grace’s power was inflated in Fig. 2. GraceAKO showed considerably more stable performance, and it can guarantee finitesample FDR control under all data settings with slightly conservative power.
We further assessed the robustness and single knockoff (modelX knockoffs) performance on simulated data (\(n = 100\), \(p=110\), and a prespecified mFDR = 0.1). To evaluate the robustness of GraceAKO, we randomly selected 20 vertices from the first 44 elements (true candidate genes) of the Laplacian matrix and set their degrees to zero. Additionally, we set the first and third TFs to have false degrees of 1 and 4, respectively. The mFDR and TPP of GraceAKO were 0.013 and 0.516, respectively. It indicated that GraceAKO could still control the mFDR under the prespecified FDR level (mFDR \(= 0.1\)), despite the fact that some information of graphical structure was misspecified for the true variables. Moreover, prior researches demonstrated that the findings might be robust to the misspecification of the graphical structure [4, 15, 16]. As there were few genes associated with the response variable, the majority of coefficients would be zero. In addition, we examined the performance of Grace incorporating with modelX knockoffs, termed as GraceKO over 200 simulations. We conducted simulations with 200 replications when \(n = 100\) and \(p = 110\). The results were reported in Table 3. We observed that GraceKO and GraceAKO were both able to control the mFDR under the prespecified mFDR level, and GraceAKO shown more stable performance with a lower standard deviation. Moreover, GraceKO performed more conservatively than GraceAKO, which was able to identify fewer candidate genes. Furthermore, we also assessed the computational cost of GraceAKO when \(n=100\) and \(p=110\). The knockoff generation and inference steps of GraceAKO could be conducted by parallel running, which took about 30 seconds with a server of Intel Xeon Silver 4116 CPU 2.10 GHz and 64 GB RAM memory. Consequently, we concluded that GraceAKO was robust to the incorrect information of the graphical structure and was more powerful in identifying candidate variables.
Application to the TCGA Prostate Cancer Data
To demonstrate the usefulness of GraceAKO, we applied it to a geneexpression data of prostatespecific antigen (PSA) level from The Cancer Genome Atlas (TCGA) program. The TCGA program is a landmark cancer genomics program with over 11,000 cases of primary cancer samples spanning 33 cancer types [18]. Additionally, the Kyoto Encyclopedia of Genes and Genomes (KEGG), as a public database, contains rich information about the graphical structures of genes [19]. It contributes to understanding various aspects of biological systems and pathways. Prostate cancer is the most common malignancy in midaged males and the second leading malignancy [20]. This external graphical structure is represented by a penalty weight matrix, which is the Laplacian matrix \({\varvec{L}}\) constructed in equation (3). Ref. [21] indicated that metastatic prostate cancer remained incurable even in patients who finished intensive multimodal therapy. It is an urgent challenge to propose a novel approach for disease management via identifying prognostic determinant genes. Moreover, [22] indicated that the statistical significance of differential expression might require abundant experiments, and the probability of type I error increased as the multiple hypotheses were tested. In this article, we thus were more concerned with the accuracy of gene selection in tumor investigations (assessed by mFDR) than the power of detection.
We first removed the samples with missing measurements and then encoded PSA pathways [23] from the KEGG to construct the normalized Laplacian matrix for Grace and GraceAKO. We finally obtained the data with sample size \(n=339\) and dimension \(p=5,947\). We denoted the PSA level as our response. To reduce computational demands, we first performed variable screening via correlation learning following [24]. The final sample size and dimension were \(n=339\) and \(p=600\), respectively, consisting of the data structure in simulation studies. We then standardized the explanatory variables and centered the response. The parameter \(\lambda _1\) ranged from 1 to 40 length out as 4, and \(\lambda _2\) was among 1, 2, 3, 4, and 5. We denoted the target FDR level at 0.2 and \(B = 25\) for the iterations of AKO procedure. The quantile point was \(\gamma = 0.3\). For a fair comparison, we both conducted 10fold crossvalidation to select the tuning parameters. We showed the details in Fig. 3.
There were 90 genes selected by Grace in total, and 47 genes selected by GraceAKO. Ref. [2] indicated that Grace might lose some accuracy when the coefficients’ signs are different. After checking the intersection of variable selection sets, we found that the genes selected by GraceAKO were a subset of Grace’s. However, due to the inflated mFDR in the simulation studies, the results of Grace might identify some false discoveries. Moreover, previous studies confirmed 35 out of 47 candidate genes detected by GraceAKO were related to prostate cancer. Some details of them were listed in Table 4. For example, IL9 was recently assigned as an essential gene in tumor immunity, and AK6 was already regarded as a biomarker in prostate cancer treatments [25, 26]. HLADRB5 was highly related to MHCII genes, which were the related genes of prostate cancer, and so was the CHRNB2 [27]. Moreover, a high concentration of IFNA2 was related to advanced prostate carcinoma [28]. PLK1 was overexpressed in many cancers, including prostate cancer, and scientists found that translation of the PP2APLK1 SDL interaction caused the expression of PLK1 and PP2A, which were commonly regarded as a biomarker in the cancer cells [29]. Ref. [30] indicated the percentage of the case with alteration of ANAPC7 achieved beyond 5.21 percentage points. The results demonstrated that the APC/C had a profound effect on cancer survival. MYC in 2010 had already been confirmed to be affected by the loss of the tumor in [31].
Furthermore, there were also some subnetworks in our findings. H2BC9, H2AC14, and H2AC7 consisted of a small subnetwork. Moreover, IFNA7, IL9, IL5RA, IFNA2, PDGFB, and KIT comprised a subnetwork of validated pathways. Except for KIT, the remaining genes were investigated as possible prostate cancer therapy genes. Additionally, there was a pathway between CAMK4 and PRKACF, where CAMKK2 was a significant androgen receptor target for prostate cancer tumor growth, according to [32]. MYC was linked with RXRA. In [33], RXRA, which was discovered as a novel target of miR191, was conserved in a cell line derived from radio recurrent prostate cancer.
Conclusions
This article introduces GraceAKO to perform variable selection by incorporating the networkconstrained penalty [4] and AKO [7]. In contrast to the conventional variable selection process, GraceAKO applies a normalized Laplacian matrix to encode the graphical structures between the potential genes or the gene products. It applies the \(L_1\) penalty to selected variables and the \(L_2\) penalty to degreescaled differences of coefficients concerning the graphical structure. Moreover, our proposed GraceAKO guarantees of FDR control in finitesample settings by identifying variable employing multiple knockoff variables. GraceAKO addresses the instability of modelX knockoffs by incorporating a statistical aggregation procedure and introducing a new feature statistics AGC. The simulation results indicated that GraceAKO had superior performance in finitesample FDR control in a wide range of simulation settings. In order to control the finitesample FDR, GraceAKO would be slightly conservative in terms of power [10]. Furthermore, the proposed general framework for variable selection with finitesample FDR control can be broadly extended to other existing penalties (e.g., the Lasso penalty [5] and the elastic net penalty [6]).
Availability of data and materials
The data generated and analysis code are available in the GitHub repository https://github.com/mxxptian/GraceAKO.
References
Katsevich E, Sabatti C. Multilayer knockoff filter: controlled variable selection at multiple resolutions. Ann Appl Stat. 2019;13(1):1.
Li C, Li H. Variable selection and regression analysis for graphstructured covariates with an application to genomics. Ann Appl Stat. 2010;4(3):1498–516.
Rahnenführer J, Domingues FS, Maydt J, Lengauer T. Calculating the statistical significance of changes in pathway activity from gene expression data. Stat Appl Genetics Mol Biol. 2004;3(1).
Li C, Li H. Networkconstrained regularization and variable selection for analysis of genomic data. Bioinformatics. 2008;24(9):1175–82.
Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Ser B (Methodol). 1996;58(1):267–88.
Zou H, Hastie T. Regularization and variable selection via the elastic net. J R Stat Soc Ser B (Stat Methodol). 2005;67(2):301–20.
Nguyen TB, Chevalier JA, Thirion B, Arlot S. Aggregation of multiple knockoffs. In: International Conference on Machine Learning. PMLR; 2020. p. 7283–93.
Candes E, Fan Y, Janson L, Lv J. Panning for gold:’modelX’knockoffs for high dimensional controlled variable selection. J R Stat Soc Ser B (Stat Methodol). 2018;80(3):551–77.
Barber RF, Candès EJ. Controlling the false discovery rate via knockoffs. Ann Stat. 2015;43(5):2055–85.
Gimenez JR, Zou J. Improving the stability of the knockoff procedure: multiple simultaneous knockoffs and entropy maximization. In: Proceedings of the twentysecond international conference on artificial intelligence and statistics. 2019 16–18 Apr;89:2184–2192. http://proceedings.mlr.press/v89/gimenez19b.html.
Emery K, Keich U. Controlling the FDR in variable selection via multiple knockoffs. arXiv eprints. 2019. arXiv:1911.09442.
Meinshausen N, Meier L, Bühlmann P. Pvalues for highdimensional regression. J Am Stat Assoc. 2009;104(488):1671–81.
Chung FRK. Spectral Graph Theory. vol 92. American Mathematical Society, Providence. 1997. https://books.google.co.jp/books?id=YUc38_MCuhAC.
Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc Ser B (Methodol). 1995;57(1):289–300.
Wei Z, Li H. A Markov random field model for networkbased analysis of genomic data. Bioinformatics. 2007;23(12):1537–44.
Wei Z, Li H. A hidden spatialtemporal Markov random field model for networkbased analysis of time course gene expression data. Ann Appl Stat. 2008;2(1):408–29.
Barber RF, Candès EJ. Controlling the false discovery rate via knockoffs. Ann Stat. 2015;43(5):2055–85.
Koboldt D, Fulton R, McLellan M, Schmidt H, KalickiVeizer J, McMichael J, et al. Comprehensive molecular portraits of human breast tumours. Nature. 2012;490(7418):61–70.
Kanehisa M, Goto S. KEGG: Kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 2000;28(1):27–30.
Stangelberger A, Waldert M, Djavan B. Prostate cancer in elderly men. Rev Urol. 2008;10(2):111–9.
Wang G, Zhao D, Spring DJ, DePinho RA. Genetics and biology of prostate cancer. Genes Dev. 2018;32(17–18):1105–40.
Reiner A, Yekutieli D, Benjamini Y. Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics. 2003;19(3):368–75.
Kim I, Choi S, Kim S. BRCAPathway: a structural integration and visualization system of TCGA breast cancer data on KEGG pathways. BMC Bioinformatics. 2018;19(1):42. https://doi.org/10.1186/s1285901820166.
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space. J R Stat Soc Ser B Stat Methodol. 2008;70(5):849–911.
Wan J, Wu Y, Ji X, Huang L, Cai W, Su Z, et al. IL9 and IL9producing cells in tumor immunity. Cell Commun Signal. 2020;18(1):50.
Liu C, Zhang L, Huang Y, Lu K, Tao T, Chen S, et al. MicroRNA328 directly targets p21activated protein kinase 6 inhibiting prostate cancer proliferation and enhancing docetaxel sensitivity. Mol Med Rep. 2015;12(5):7389–95.
Axelrod ML, Cook RS, Johnson DB, Balko JM. Biological consequences of MHCII expression by tumor cells in cancer. Clin Cancer Res. 2019;25(8):2392–402.
Erb HH, Langlechner RV, Moser PL, Handle F, Casneuf T, Verstraeten K, et al. IL6 sensitizes prostate cancer to the antiproliferative effect of IFNα2 through IRF9. Endocr Relat Cancer. 2013;20(5):677.
Cunningham CE, Li S, Vizeacoumar FS, Bhanumathy KK, Lee JS, Parameswaran S, et al. Therapeutic relevance of the protein phosphatase 2A in cancer. Oncotarget. 2016;7(38):61544–61.
Melloy P. The AnaphasePromoting Complex: a key mitotic regulator associated with somatic mutations occurring in cancer. Genes Chromosomes Cancer. 2019;59(3):189–202.
Koh CM, Bieberich CJ, Dang CV, Nelson WG, Yegnasubramanian S, De Marzo AM. MYC and prostate cancer. Genes Cancer. 2010;1(6):617–28.
Dadwal UC, Chang ES, Sankar U. Androgen receptorCaMKK2 axis in prostate cancer and bone microenvironment. Front Endocrinol (Lausanne). 2018;9:335.
Ray H, Haughey C, Hoey C, Jeon J, Murphy R, et al. miR191 promotes radiation resistance of prostate cancer through interaction with RXRA. Cancer Lett. 2020;473:107–17.
Acknowledgements
Publication made possible in part by support from the HKU Libraries Open Access Author Fund sponsored by the HKU Libraries.
Funding
This work was supported, in part, by the Hong Kong Research Grants Council (RGC) Early Career Scheme 2021/22 (project number 27305221).
Author information
Authors and Affiliations
Contributions
Y.D.Z. and Z.L. conceived and supervised the study. P.T. and Y.H. implemented the software and conducted analysis. P.T. wrote the first draft of the manuscript. P.T., Z.L. and Y.D.Z. revised the manuscript. All authors approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Tian, P., Hu, Y., Liu, Z. et al. GraceAKO: a novel and stable knockoff filter for variable selection incorporating gene network structures. BMC Bioinformatics 23, 478 (2022). https://doi.org/10.1186/s1285902205016y
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285902205016y
Keywords
 False discovery rate
 Highdimensional data
 Variable selection
 Graphconstrained