 Research
 Open Access
 Published:
Overlapping group screening for detection of geneenvironment interactions with application to TCGA highdimensional survival genomic data
BMC Bioinformatics volume 23, Article number: 202 (2022)
Abstract
Background
In the context of biomedical and epidemiological research, geneenvironment (GE) interaction is of great significance to the etiology and progression of many complex diseases. In highdimensional genetic data, two general models, marginal and joint models, are proposed to identify important interaction factors. Most existing approaches for identifying GE interactions are limited owing to the lack of robustness to outliers/contamination in response and predictor data. In particular, rightcensored survival outcomes make the associated feature screening even challenging. In this article, we utilize the overlapping group screening (OGS) approach to select important GE interactions related to clinical survival outcomes by incorporating the gene pathway information under a joint modeling framework.
Results
Simulation studies under various scenarios are carried out to compare the performances of our proposed method with some commonly used methods. In the real data applications, we use our proposed method to identify GE interactions related to the clinical survival outcomes of patients with head and neck squamous cell carcinoma, and esophageal carcinoma in The Cancer Genome Atlas clinical survival genetic data, and further establish corresponding survival prediction models. Both simulation and real data studies show that our method performs well and outperforms existing methods in the GE interaction selection, effect estimation, and survival prediction accuracy.
Conclusions
The OGS approach is useful for selecting important environmental factors, genes and GE interactions in the ultrahigh dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The same idea of the OGS approach can apply to other outcome models, such as the proportional odds survival time model, the logistic regression model for binary outcomes, and the multinomial logistic regression model for multiclass outcomes.
Background
It is believed that in the development of complex diseases such as cancer, diabetes, and so on, geneenvironment (GE) interaction plays a critical role beyond the main genetic (G) or environmental (E) factors ([1, 2] and so on). For example, Batchelor et al. [3] showed that the interaction between the gene TP53 and age affects the prognosis of glioblastoma. As a consequence, incorporating significant GE interaction factors into a survival prediction model would enhance the performance of the later.
In the setting of highdimensional genetic data analysis, there exist two ways to identification of important GE interactions: the marginal and joint analyses [4]. The marginal analysis considers only one gene at a time, and fits a model consisting of multiple E factors, this gene, and its interaction with E factors. the other performs joint analysis and considers all genes in a single model.
In the framework of marginal analysis of highdimensional genetic data, for each gene, a model consisting of multiple E factors, a single gene itself, and its interaction with E factors is fitted. Specifically, the conceptual marginal model is “Outcome ~ Es + G + G*(Es)”, where the outcome variable can be a continuous, categorical, or survival time phenotype, Es represents a set of environmental factors such as environmental exposures, demographic, clinical, and socioeconomic variables, and G*(Es) represents the interaction between the G factor and all E factors under consideration. The significant GE interactions can be selected based on the corresponding marginal pvalues. Since the marginal model is lowdimensional, its main advantage is its computational stability and conceptual simplicity. Therefore, marginal programs are popular in the fields of bioinformatics and biomedicine. However, a common limitation of traditional methods of marginal analysis is its lack of robustness. In practical genetic studies, Xu et al. [5] pointed out that longtailed distributions and contamination in prognosis response and predictors are not uncommon. In addition, human input errors may also lead to longtailed distributions and contamination. In Fig. 1, we displayed The Cancer Genome Atlas (TCGA) clinical survival data for esophageal carcinoma (ESCA) and head and neck squamous cell carcinoma (HNSCC) to show the longtailed distribution phenomenon. Moreover, censored survival outcomes make the relevant feature screening difficult.
On the other hand, models in the framework of joint analysis better describe disease biology given the fact that complex diseases are related to the combined effects of multiple genetic biomarkers. The conceptual joint model is “Outcome ~ Es + Gs + (Gs)*(Es)”, where Gs represents a set of G factors, including gene expressions, SNPs and other types of molecular measurements, and (Gs)*(Es) represents the interactions between all G and E factors. In this article, we focus on the joint analysis framework. A common challenge of joint analysis is its high dimensionality, which makes it difficult to identify significant interaction effects. Moreover, rightcensored survival outcomes and contaminated biomarker data make the task even challenging.
For survival outcomes, popular models include the accelerated failure time (AFT) model and Cox’s model. Based on the AFT model, several robust joint regression methods have been proposed. The Penalized trimmed regression (PTReg) method [6] uses the trimmed regression to account for longtailed distribution/contamination in prognosis response and predictors, and Wu et al. [7] incorporates the G structure into the joint modeling. These methods conduct regularized estimation and selection based on the minimax concave penalty (MCP) penalty and utilize a decomposition technique to explain the interaction hierarchy. Their main potential disadvantage is that the model size is much larger than the sample size, and the statistical power under the penalized regression frameworks may be suboptimal [8]. In addition, since the gene expression data is often contaminated, the traditional Pearson correlation or Gaussian graphical models may not be a suitable measure to quantify the correlation among genes [9].
Based on the above rationale, we plan to adopt a twostep screening approach to detect GE interactions by incorporating biological pathways information. The proposed method uses annotated gene sets collected in the molecular signatures database [10], which can be downloaded from the website http://www.broadinstitute.org/gsea/msigdb. Wang and Chen [11] described the idea of an overlapping group screening procedure that aims at genegene interaction selection, called the OGS method, for survival prediction based on the Cox model. In this work, we extend and modified the OGS method to detect GE interactions, and show that OGS has several advantages: (1) it can alleviate the collinearity problem in regression analysis due to the correlation between biomarkers in the same gene/pathway; (2) it can significantly reduce the search space for interaction effects by using the feature grouping structure; and (3) it can significantly improve the model selection performed by penalized regression in an ultrahigh dimensional feature space.
Simulation studies under various scenarios reveal that our method works well and outperforms existing methods in the model selection, estimation, and prediction accuracy. In the real data application, we combine gene expression profile data with prior pathway information from the Gene Ontology biological process (GOBP) database and use the OGS approach to select several important environmental factors, genes, and GE interactions that are associated with clinical survival outcomes of patients with HNSCC, and ESCA using TCGA clinical survival genetic data [12]. Using the pathway information available from the GOBP database to group genes into several pathways, we further conduct accurate survival predictions based on the selected main and interacting biomarkers.
Methods
We consider a study with \(N\) independent subjects. For a subject \({\text{i}}\), suppose that there are \(q\) environmental/clinical variables \({\varvec{e}}_{i} = \left( {e_{i1} , \cdots ,e_{iq} } \right)^{^{\prime}}\), and \(p\) genes \({\varvec{x}}_{{\varvec{i}}} = \left( {x_{i1} , \cdots ,x_{ip} } \right)^{^{\prime}}\) assigned to \(G\) possibly overlapping pathways; that is, a given gene may belong to multiple pathways. The pathway information accounts for the natural hierarchical structure of genes, and the overlapping pathways commonly exist in the gene expression data. Our aim is to determine the main features (genes and environment) and their interactions related to clinical survival outcomes, while taking into account the pathway information.
For a subject \(i\), assume the survival outcome \(t_{i}\) is related to the environmental/clinical variables \({\varvec{e}}_{i}\), gene expression covariates \({\varvec{x}}_{{\varvec{i}}}\), and their componentwise interactions \({\varvec{u}}_{i} = \left( {e_{i1} x_{i1} , \ldots , e_{i1} x_{ip} , e_{i2} x_{i1} , \ldots , e_{iq} x_{ip} } \right)^{^{\prime}}\) through the Cox regression model. In the Cox regression framework, the hazard function at time \(t\) for subject \(i^{\prime}{\text{s}}\) survival given the covariates is modeled as.
\(\lambda \left( {t{\varvec{e}}_{i} , {\varvec{x}}_{i} ,{\varvec{u}}_{i} } \right) = \lambda_{0} \left( t \right)exp\left( {{\varvec{e}}_{i}^{^{\prime}} {\varvec{\alpha}} + {\varvec{x}}_{i}^{\user2{^{\prime}}} {\varvec{\beta}} + {\varvec{u}}_{i}^{\user2{^{\prime}}} {\varvec{\eta}}} \right)\),where \(\lambda_{0} \left( t \right)\) is a nonnegative deterministic baseline hazard function and \(\left( {{\varvec{\alpha}}, {\varvec{\beta}},{\varvec{\eta}}} \right)\) are corresponding parameters. Usually the survival outcome is subject to censoring, and we use \(\delta_{i}\) to denote whether subject \(i^{\prime}{\text{s}}\) survival time is observed or censored.
Incorporating the grouping (pathway) information into the modeling process may improve the interpretability and prediction accuracy of the model. When groups overlap with each other, special techniques are required to account for the overlapping grouping information. According to Jacob et al. [13], we decompose the original coefficient vector into the sum of groupspecific potential effects, that is, \({\varvec{\beta}} = \mathop \sum \limits_{j = 1}^{G} {\varvec{\gamma}}^{j}\) where \({\varvec{\gamma}}^{j} = \left( {\gamma_{1}^{j} , \cdots ,\gamma_{p}^{j} } \right)^{^{\prime}}\) is the latent coefficient vector for group \(j\). For \(j = 1, \ldots ,G\) and \(k = 1, \ldots ,p\), we set \(\gamma_{k}^{j} = 0\) if gene k does not belong to group \(j\). Redefine the latent coefficient \({\varvec{\gamma}}^{j}\) by removing the zero elements therein, and form the latent coefficient vector \({\varvec{\gamma}}\) by stacking the vectors \({\varvec{\gamma}}^{1} , \ldots , {\varvec{\gamma}}^{G}\). Let \(d\) be the length of \(\user2{ \gamma }\). We can then rewrite \({\varvec{\beta}} = \user2{S\gamma }\), where \({\varvec{S}}\) is a \(p \times d\) matrix whose elements are 1 or 0. A simple example for illustration is given in Additional file 1: Appendix S1.
On the basis of the coefficient decomposition, the original regression model can be transformed into a new model, that is,\({\varvec{X}}_{N \times p} {\varvec{\beta}}_{p \times 1} = {\varvec{X}}_{N \times p} {\varvec{S}}_{p \times d} {\varvec{\gamma}}_{d \times 1} = \tilde{\user2{X}}_{N \times d} {\varvec{\gamma}}_{d \times 1}\), where \({ }{\varvec{X}} = \left( {{\varvec{x}}_{1} , .{ }.{ }.{ },{\varvec{x}}_{N} } \right)^{^{\prime}}\). Equivalently, this new model can be constructed by duplicating the columns of overlapping variables in the original design matrix. For the new transformed model, the hazard function for subject \(i\) in the Cox regression model is reexpressed as
The method (OGS) for GE interaction selection
We apply the OGS method to the environment and gene expression profile data with clinical survival trait to detect important main effects as well as interactions by incorporating prior pathway information. The steps of the OGS algorithm for GE interaction selection are described as follows.
Step1 We utilize the overlapping group Cox regression model to identify the candidate pathways based on the latent effect approach, which can be performed by the R package “grpregOverlap” [14]. We define \(\hat{M}_{main}\) as the selected set of pathways, and \(A = \left {\hat{M}_{main} } \right\) as the size of \(\hat{M}_{main}\).
Step 2 We utilize the sequence kernel association test (SKAT) to obtain the groupspecific significance, where each group is formed by the interaction between the genes of each candidate pathway selected in the first step and the environmental factors in Es, where Es is a set of environmental factors. Following Chen et al. [15], the SKAT statistic under the Cox regression model is defined as
Here \({\varvec{m}}\) is the vector of martingale residuals estimated from the null model by regressing survival outcomes on only the environmental covariates Es without considering the gene expression data; \({\varvec{R}}_{\left( k \right)} = \left[ {r_{\left( k \right)ij} } \right]_{N \times l}\), where \(l\) is the number of GE interaction pairs in the candidate pathway group \(k\), \(r_{\left( k \right)ij}\) is the \(j\)th GE interaction pair of \(i\)th subject in the candidate pathway group \(k\), and \({\varvec{W}}_{\left( k \right)}\) is a diagonal weight matrix that contains the weights of the \(l\) interaction pairs in the candidate pathway group \(k\). Suitable weights can improve the testing power [16]. Following [16], we consider an unsupervised weight manner that is defined as
where \(v_{i} = \frac{{Var\left( {r_{\left( k \right) \cdot i} } \right)}}{{\mathop \sum \nolimits_{j} Var\left( {r_{\left( k \right) \cdot j} } \right)}}\). That is, the square of the weight is a beta probability density function with specific parameters 1 and 25, evaluated at the ratio of the sample variance of the ith variable in the data to that of all variables.
Based on the null model by regressing survival outcomes on only the environmental covariates Es without gene covariates, let \({\varvec{E}}\) is an \(N \times q\) design matrix for the \(q\) environmental covariates, and \({\varvec{V}} = {\text{diag}}\left( {c_{1} , \ldots ,c_{N} } \right)  {\varvec{PP}}^{^{\prime}}\), where \({\varvec{P}}\) is an \(N \times \nu\) matrix with element \(p_{ij}\) the baseline hazard for individual \(i\) at ordered failure time \(t_{\left( j \right)}\),\({ }j = 1, \ldots ,\nu ,\) and \(c_{i}\) the cumulative hazard for individual \(i\) at observed time \(t_{i}\).
Let \(\sum_{\left( k \right)} = {\varvec{W}}_{\left( k \right)} {\varvec{R}}_{\left( k \right)}^{^{\prime}} \left( {{\varvec{V}}  {\varvec{VE}}\left( {\user2{E^{\prime}VE}} \right)^{  1} {\varvec{E}}^{^{\prime}} {\varvec{V}}} \right){\varvec{R}}_{\left( k \right)} {\varvec{W}}_{\left( k \right)}\) be the covariance matrix of the vector \({\varvec{W}}_{\left( k \right)} {\varvec{R}}_{\left( k \right)} {\varvec{m}}\) under the null hypothesis of all geneenvironment interaction pairs in the candidate pathway group \(k\) having null effects. Under the null hypothesis, the SKAT statistic follows a weighted sum of chisquare distribution:
where \(\lambda_{\left( k \right)j}\),j \(=\) l,…, \(l\) are the eigenvalues of \(\sum_{\left( k \right)}\), and \(\chi_{1,j}^{2}\)'s are independent 1df central chisquare random variables.
We use the Davies method [17] to approximate the tail probability of the mixture chisquare distribution, which can be calculated by the R package “CompQuadForm” [18]. Generally speaking, the Davies method is accurate [19]. The pvalues \(\left\{ {p_{1, \ldots ,} } \right.\left. {p_{A} } \right\}\) are used as our group screening measure; a smaller pvalue corresponds to a higher group importance and therefore leads to a higher priority of selection.
Step 3 In the third step, we select significant GE interactions based on the permutation procedure with the cutoff point determined by the softthresholding rule, where the permutation is applied to the covariate matrix consisting of both genes and environmental covariates. We randomly permute the original data \(\left\{ {Y_{i} ,{\varvec{e}}_{i} } \right.,\left. {{\varvec{x}}_{i} } \right\}\) to form the permuted data.
\(\left\{ {Y_{i} ,{\varvec{e}}_{\pi \left( i \right)} } \right.,\left. {{\varvec{x}}_{\pi \left( i \right)} } \right\}\) following the null model, where \(Y_{i} = \left( {t_{i} ,\delta_{i} } \right)\) is the survival outcome, and \({ }\left\{ {\pi \left( 1 \right),. . .,\left. {\pi \left( N \right)} \right\}} \right.\) is a random permutation of the index. Then we apply again the SKAT test for each of the candidate pathway groups with the permuted data to obtain the group screening measures (pvalues) \(\left\{ {p_{1}^{*} } \right.,. . .,\left. {p_{A}^{*} } \right\}\) and the desired threshold \(\tau\) is obtained by taking the minimum of \(\left\{ {p_{1}^{*} } \right.,. . .,\left. {p_{A}^{*} } \right\}\).
To obtain a stable threshold, we repeat the above permutation process more times and define a cutoff point to select candidate pathway groups by using the median of the obtained desired thresholds, that is \(C_{int} =\) median \(\left\{ {\tau_{1} } \right.,. . .,\left. {\tau_{I} } \right\}\). We adopt \(C_{int}\) to select candidate pathway groups, i.e.
is our selected set of candidate pathway groups. In practice, we take \({ }I\) as 30.
Note that he permutation procedure used to determine a datadriven threshold was similar to that proposed by Fan et al. [20], which implicitly assumes that the censoring mechanism is independent of all covariates. This stronger assumption on censoring mechanism will not invalidate the permutation procedure since what we indeed require for the null hypothesis is that the tuple of time and censoring indicator is independent of all the covariates.
Step 4 Finally, in the framework of joint modeling, based on environmental covariates, and selected genes and GE interactions, a penalized regression with an appropriate penalty is used to establish the final survival prediction model. Therefore, we apply the penalized Cox’s regression together with the Ridge or Lasso penalty to build the final prediction model based on all environmental variables, genes in \({ }\hat{M}_{main}\) and GE interactions in \(\hat{M}_{int}\). The penalized Cox regression model with the Ridge or Lasso penalty can be obtained through the R package “glmnet” [21].
In the first step and the second step of the new OGS method, as the original OGS method in Wang and Chen [11], we still apply the overlapping group selection method to identify the causal pathways and the SKAT test to obtain the groupspecific significance. However, the new OGS method improves the original one [11] by using an unsupervised manner for weight construction in the second step of the OGS procedure. In the third step of the new OGS method, we perform multiple permutations to obtain a stable threshold for interaction group selection, where the permutation process is the same as in the original OGS, except that permeation is now applied to a covariate matrix consisting of genes and environmental factors. Finally, the penalized Cox’s regression with the Ridge or Lasso penalty is still applied to build the final survival prediction model based on the environmental factors, the selected genes and the selected GE interactions. These modifications bring better performance for model selection, estimation, and prediction.
Results
Comparison with alternative methods in variable selection, estimation, and prediction
In the following simulations, we study the performances of the proposed OGS approach in variable selection, estimation and prediction, and compare them with the performance of the “Oracle”, “SIS Lasso”, “Ordinary Lasso” and “GSIS SCAD” methods. The “Oracle” method is based on the underlying true model, which is known in the simulations but unknown in real applications. The “SIS Lasso” method [8] uses univariate Cox’s regression to select environmental variables, genes, and GE interactions one by one, with a prefixed number \(\left( {\frac{N}{\log \left( N \right)}} \right)\) of topranked predictors as our candidate model, and then includes the selected variables in a penalized Cox regression model with the Lasso penalty to form the final prediction model. The “Ordinary Lasso” method is the penalized Cox regression model with the Lasso penalty considering all environmental variables, genes, and GE interactions in the model. The “GSIS SCAD” method is an overlapping group Cox regression model with the SCAD penalty based on the latent effect approach, which can be performed by the R package “grpregOverlap” [14].
For performance comparison, we adopt the root mean squared error (RMSE) to measure estimation accuracy, defined as
where \(S\) is the size of the full model including all main and interaction covariates and \({\varvec{\theta}}^{\prime } = \left( {{\varvec{\alpha}}^{\prime } ,{\varvec{\beta}}^{\prime } ,{\varvec{\eta}}^{\prime } } \right)\).
To evaluate the estimation performance, we report RMSE.M, the mean of the root mean square errors of 200 simulations. To evaluate the performance of the selection accuracy, we consider various criteria: P.int is the proportion of the underlying effective GE interaction variables contained by the selected GE interaction variables; Sen. is the sensitivity, defined as the proportion of the underlying effective variables being selected; Spe. is the specificity, defined as the proportion of the underlying ineffective variables not being selected. We also report the median size of the selected models, S.model, in 200 simulations. To evaluate the performance of survival prediction, we consider three measures of prediction accuracy: the deviance, the cindex proposed by Harrell et al. [22] and timedependent AUC proposed by Blanche et al. [23]; smaller deviance or larger cindex and timedependent AUC corresponds to better prediction accuracy. The median values of these measure over 200 simulations are reported.
Let \(\hat{\user2{\theta }}^{\user2{^{\prime}}} = \left( {\hat{\user2{\alpha }}^{\user2{^{\prime}}} ,\hat{\user2{\beta }}^{\user2{^{\prime}}} ,\hat{\user2{\eta }}^{\user2{^{\prime}}} } \right)\) an estimator of the (penalized) Cox regression parameter in a prediction model obtained from the training dataset. Let \({ }\left( {t_{i}^{\user2{*}} ,\delta_{i}^{*} ,{\varvec{e}}_{i}^{\user2{*}} ,{\varvec{x}}_{i}^{\user2{*}} ,{\varvec{w}}_{i}^{\user2{*}} } \right)\) be the survival and covariate data of subject \(i{ }\) in the test data. Define \(\left( {{\varvec{e}}_{i}^{*\prime } ,{\varvec{x}}_{i}^{*\prime } ,{\varvec{w}}_{i}^{*\prime } } \right)\hat{\user2{\theta }}\) as the prognosis index (PI) value for subject \(i\) in the test data. The Cox test is defined as the pvalue of PI when PI is used as the covariate in the univariate Cox model for survival outcomes in the test data. Similarly, the LRtest is the pvalue of the logrank test for the null hypothesis of equal survival between the “good” and “poor” prognostic groups in the test data, where the “good” and “poor” prognostic groups are classified according to whether the PI value is higher or lower than the median PI value in the test data. Smaller Coxtest and LRtest values correspond to better predictive power.
In simulations we consider survival data with a cohort size 300 in the training set, where each subject’s survival time follows the Cox proportional hazards model
with the covariates \({\text{e}}\) and \({\varvec{x}}\) jointly following a multivariate standard normal distribution with correlation \(corr\left( {{\varvec{e}}_{ \cdot j} ,{\varvec{e}}_{ \cdot k} } \right) = 0.3^{{\left {j  k} \right}}\) and \(corr\left( {{\varvec{x}}_{ \cdot j} ,{\varvec{x}}_{ \cdot k} } \right) = 0.5^{{\left {j  k} \right}}\), and \(corr\left( {{\varvec{e}}_{ \cdot j} ,{\varvec{x}}_{ \cdot k} } \right) = 0\) for all \(j, k\). The censoring time distribution follows a uniform distribution. We then generate survival data, independent of the training data, with a cohort size 100 as the test data to assess the prediction accuracy for different methods.
In this simulation study, we consider 5 environmental variables and assume that the first 4 are related to the survival outcome, and the corresponding effects are 1.5, 2.25, 3, 1.5. On the other hand, the gene covariates considered contain 25 groups that have different group sizes (the numbers of genes) and may share with each other some of the genes. The group sizes and the overlapping structure (i.e. the number of the shared genes between two overlapping groups) are shown in Table 1, where the overlapping groups are shown side by side. For example, group 1 contains 3 genes, as group 2 does, but the two groups contain only 5 unique genes, and 1 gene is shared between the two groups. As a result, there are a total of 500 genes and 632 groupspecific latent effects (see “Methods” section) in this example. Figure 2 displays the gene network structure. Groups 1, 7, 13, and 19 are set to be effective, and genes in each of them have constant latent effects of 3, 3, 2, and − 2, respectively. In addition, effective interactions (E1 * G22, E1 * G24, E2 * G26) with the corresponding effects \(\left( {1.5,1.5,2} \right)\) and (E2 * G78, E3 * G83, E3 * G88) with the corresponding effects \(\left( {  1,  1.5,  2} \right)\) are in group 7 and group 13, respectively. The number of effective environment, gene, or GE interaction factors is 91 among a total of 3,005 such factors. We examine the performances of different methods under a censoring rate of 30%, 50%, or 70%. We also conduct further simulations to demonstrate the performances of the new proposal, whose details and results can be seen in Additional file 1: Appendix S2.
Summary of simulation results
From the simulation results shown in Table 2 and Additional file 1: Table S3 in Appendix S2 where the gene network structure is complex, we see that the OGS method using the Lasso or Ridge penalty performs substantially better than the “SIS Lasso”, “Ordinary Lasso” and “GSIS SCAD” methods in variable selection, effect estimation, and survival prediction. On the other hand, simulation results shown in Additional file 1: Table S3 of Appendix S2 where a simpler gene structure is considered, the performance of OGS with Lasso or Ridge penalty is worse than that of the “Ordinary Lasso” method when the censoring rate is 30%; while when the censoring rate is higher (50% or 70%), the OGS with the Lasso or Ridge penalty performs better than the “Ordinary Lasso”.
Furthermore, further simulation studies with a small cohort size are conducted under the scenario where all simulated settings are the same as those in the previous simulation study except for a cohort size defined as 150/50 (training/testing). We still obtain similar numeric results patterns; these corresponding results are shown in Additional file 1: Tables S5–S7 in Appendix S3. In addition, we also conduct a simulation study that the overlapping genes occur among three groups instead of just two, where we still obtain result patterns similar to those under two groups; please see Additional file 1: Table S8 in Appendix S2.
Real data application: TCGA HNSCC data
The TCGA HNSCC RNASeq expression data, together with the phenotype data containing the survival time and censoring status data, can be downloaded from the R package’TCGAbiolinks’ [24], or’UCSCXenaTools’ [25]. After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA HNSCC data with 517 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 58%. The TCGA HNSCC clinical information data can be obtained from the’FireBrowse’ database [26].
Since the number of cancerrelated genes is expected to be limited, we conduct prescreening using nonparametric inverse probabilityofcensoring weighted (IPCW) Kendall’s tau correlation [27], which can also improve stability for feature selection. The top 2000 genes with the largest absolute IPCW Kendall’s tau correlation are selected for downstream analysis.
The five E factors analyzed including AJCC pathologic stage nodes, AJCC pathologic stage tumor, age, gender, and ICD O3 site. Summary information for these clinical variables is reported in the Table 3. Some of the clinical variables contain missing values, and we use the sparse boosting method [28] in the R package “GEInter” [29] to perform multiple imputation for the missing values in the clinical variables.
The PTReg method [5] was developed to conduct robust joint analysis using penalized trimmed regression with the MCP penalty under the AFT model for the rightcensored survival outcome. We are interested in comparing the PTReg approach with our proposed OGS approach in the real data application. The whole 12,005 main and GE interaction predictors are considered for the “SIS Lasso”, “Ordinary Lasso”, and “PTReg” methods. For the OGS method, among the 2000 preselected genes, prior pathway information for 1489 genes, which are mapped into 6015 pathways based on the GO biological process database, is utilized. The 511 genes that are not mapped into any pathways in the GO biological process database are either discarded or put together as a group for the latent effect analysis in the OGS method, leading to a total of 8939 or 12,005 main and GE interaction effects considered.
We take ten random splits of the whole data into 413:104 training/test sets to evaluate the performances of all the methods considered in the TCGA HNSCC data application. Table 4 reports the median of the survival prediction results over the ten folds when the 511 ungrouped genes are discarded from analysis. We see that the performance of the OGS method with Ridge or Lasso penalty is better than the “SIS Lasso”, “Ordinary Lasso”, and “PTReg” methods. The OGS approach putting the 511 ungrouped genes together as an additional group results in the same prediction model as the one discarding the ungrouped genes. Also, the OGS analysis results based on the pathway information obtained from other annotated gene set databases, including GO cellular component (GOCC), GO molecular function (GOmf), and KEEG, are compared with the other methods for survival prediction in the TCGA HNSCC data, as shown in Additional file 1: Table S9. These additional results based on pathway information from alternative gene set databases still reveal that the OGS approach performs better than the other methods.
Based on one random split of the data, Fig. 3 displays the Kaplan–Meier survival curves of the “good” and “poor” prognosis groups in the test data. It can be seen that the OGS method separates the two groups better than other methods. When applying the OGS with the Lasso penalty to the entire data based on the GO biological process database, we identify several important GE interaction effects, and obtain the corresponding parameter estimates, as shown in Additional file1: Table S10. We note that the clinical variable “Age” interacts with several genes, and most of these genes, such as “CAMP” [30], “DEFB1” [31], “MAP2K7” [32] have been shown to be related to HNSCC. And “Age” factor has been shown to be related to HNSCC [33].
Real data application: TCGA ESCA data
The TCGA ESCA RNASeq expression data, together with the phenotype data containing the survival time and censoring status data, can be downloaded from the R package’TCGAbiolinks’ [24], or’UCSCXenaTools’ [25]. After excluding patients with missing survival time data, our analysis is focused on the subset of the TCGA ESCA data with 368 patients and 20,501 gene expression variables. The censoring rate of the survival time in the data is about 58%. The TCGA ESCA clinical information data can be obtained from the’FireBrowse’ database [26].
Since the number of cancerrelated genes is expected to be limited, we conduct prescreening using nonparametric inverse probabilityofcensoring weighted (IPCW) Kendall’s tau correlation [27], which can also improve stability for feature selection. The top 2000 genes with the largest absolute IPCW Kendall’s tau correlation are selected for downstream analysis. The seven clinical variables whose E effects are analyzed include age, gender, esophageal tumor central location, person neoplasm cancer status, race, BMI, and AJCC pathologic stage, and their summary information is reported in the Table 5. Some of the clinical variables contain missing values, and we use the sparse boosting method in the R package “GEInter” to perform multiple imputation for the missing values in the clinical variables. Based on the GO biological process database, 1458 genes among the top 2000 genes are mapped into 4360 pathways and such prior pathway information is utilized in the OGS method. Excluding the genes without being mapped into any pathway, there are a total of 11,671 main and GE interaction covariates in the proposed OGS method. On the other hand, a total of 16,007 main and GE interaction predictors are considered in the “SIS Lasso”, “Ordinary Lasso”, and “PTReg” methods.
We take ten random splits of the whole TCGA ESCA data into 294:74 training/test sets to evaluate the performances of all methods for survival prediction in the TCGA ESCA data. Table 6 reports the median of the survival prediction results among the ten folds. We see that the performance of the OGS method with the Ridge or Lasso penalty is better than the “SIS Lasso”, “Ordinary Lasso”, and “PTReg” methods. In addition to the OGS analysis discarding the 542 genes without mapped pathways in the GO biological process database, we also perform the OGS analysis putting the unmapped genes together as an additional group, and the two different implements of the OGS method result in the same prediction model. Also, different annotated gene sets databases, including GOCC, GOMF, and KEEG, are also used in the OGS approach to catch pathway information. As shown in Additional file 1: Table S11. the OGS method still outperforms than the other methods using such alternative pathway information.
Based on one random split of the data, Fig. 4 displays the Kaplan–Meier survival curves for the “good” and “poor” prognosis groups in the test data. It is seen that the two survival curves are better separated by the OGS approach than other methods. When applying the OGS with the Lasso penalty for whole data based on the GO biological process database, we identify and estimate several important GE interaction effects, which are shown in Additional file 1: Table S12. We note that the clinical variable “Age” interacts with several genes, and most of these genes, such as “CD40LG” [34], “DEK” [35], “IL6” [36] have been shown to be related to HNSCC. And two “Weight” and “Age” factors have been shown to be related to HNSCC ([37, 38]).
Conclusion
In this article, we propose a twostage overlapping group screening procedure to identify important main and geneenvironment (GE) interaction effects efficiently for survival prediction. In the first stage, the new proposal utilizes the latent effect approach to identify candidate gene pathways for survival prediction, adjusting for the E and GE interaction factors. Different gene pathways are allowed to overlap with each other, i.e., to share common genes. In the second stage, we utilize the SKAT approach [15], which is a popular group testing approach, to obtain the grouplevel pvalue of each candidate gene pathway as well as the associated GE factors, adjusting for the E factors. A pathway as well as the associated GE factors is then selected when their grouplevel pvalue is smaller than the one under covariate (both G and E factors) permutation. The final survival prediction model is constructed by a Cox model based on the E factors, the selected gene pathways as well as the associated GE factors, subject to the Ridge or Lasso penalty. Simulation and real data studies demonstrate that, compared with the analysis that ignores pathway information, the new proposal can significantly improve the accuracy of gene and geneenvironment interaction selection, as well as the resulting survival predictions.
The new OGS method aims at geneenvironment interaction, while the OGS in Wang and Chen [11] aims at gene–gene interaction. The new OGS method improves the original one [11] by using an unsupervised manner for weight construction in step 2 of the OGS procedure, and performing multiple permutations to obtain a stable threshold for interaction group selection in step 3 of the OGS procedure. These modifications bring better performance for model selection, estimation, and prediction.
Discussion
The OGS method is flexible. Although we focus on survival prediction based on the Cox proportional hazards model, the same idea can straightforwardly apply to other outcome models, such as the proportional odds survival time model, the logistic regression model for binary outcomes, and the multinomial logistic regression model for multiclass outcomes. For example, the SKAT statistic involved in the OGS method can be modified simply by using the residuals from the alternative model under consideration.
Since the gene data is highdimensional, following the conventional feature screening idea, the initial step of the OGS method is to use some univariate approach to screen gene variables for downstream analysis. Such a supervised screening procedure is common (e.g., Fan et al. [20], Xu et al. [6], and Xu et al. [5]) in literature, and is in fact conducted after splitting the whole sample into the training and testing subsamples. In other words, when we evaluate the prediction performance using the test sample, the effect of supervised feature screening procedure has been taken into account and the evaluation is fair. We use the nonparametric inverse probability of censoring weighted (IPCW) Kendall’s tau correlation [27] to select the top 2,000 genes for downstream analysis. The IPCW Kendall’s measure it can be applied to a wide range of survival models, and the Kendall’s tau measure is not influenced by outliers, which is a major concern in gene expression data where contaminated data are common.
As in Jacob et al. [13] and Zeng and Breheny [14], the latent effect model is indeed overparameterized, and the effects of each gene decomposed into the pathways are latent and unobserved. Owing to this nature of over parametrization, the penalized regression (group lasso) method is needed and employed for parameter estimation. Using such overparametrization and penalized regression techniques, it is helpful to identify groupspecific effects from the original Cox model regression parameters.
The OGS method employs the latent effect approach to extract gene network structure information in terms of gene pathways. This requires a predesignated gene group (pathway) structure and is limited to genes that can be assigned to at least one group (pathway). It is interesting to study how to relax these restrictions to improve the performance of feature selection and survival prediction in the presence some covariate network structure.
In fact, the OGS procedure does not respect the hierarchy between main and interaction effects. We agree with that, if the hierarchy principle can be incorporated, the accuracy of interaction selection may have improved strength. Wu et al. [7] utilize a decomposition technique to explain the interaction hierarchy, and such decomposition technique may be incorporated into the OGS procedure as a further extension.
Moreover, the OGS method does not select at both the pathway level and the gene level simultaneously. How to improve the OGS in selecting pathways and genes simultaneously will be investigated in our future work. The last step of the OGS is to apply penalized Cox regression together with Ridge or Lasso penalty to build the final prediction model, we can try to combine the other penalties like MCP, Adaptive Lasso to enhance the robustness of the estimation of the OGS (Jiang et al. [39], Ren et al. [40]). This issue will also be investigated in our future work.
Availability of data and materials
The TCGA ESCA, and HNSCC genomic data with survival traits and pathway information database analyzed during this study are all available at figshare website https://doi.org/10.6084/m9.figshare.16816654.v6. R codes for the simulation studies and real data are available at figshare website https://doi.org/10.6084/m9.figshare.16816303.v3.
Abbreviations
 AFT:

Accelerated failure time
 AJCC:

American joint committee on cancer
 AUC:

Area under the curve
 BMI:

Body mass index
 ESCA:

Esophageal carcinoma
 GEInter:

Geneenvironment interaction
 GE:

Geneenvironment
 GOBP:

Gene ontology biological process
 GOCC:

Gene ontology cellular component
 GOMF:

Gene ontology molecular function
 GSIS:

Grouping sure independence screening
 HNSCC:

Head and neck squamous cell carcinoma
 IPCW:

Inverse probabilityofcensoring weight
 Lasso:

Least absolute shrinkage and selection operator
 LRtest:

Logrank test
 MCP:

Minimax concave penalty
 OGS:

Overlapping grouped screening
 PTReg:

Penalized trimmed regression
 PI:

Predictor index
 RMSE:

Root mean squared error
 SCAD:

Smoothly clipped absolute deviation
 SIS:

Sure independence screening
 SKAT:

Sequence kernel association test
 SNPs:

Singlenucleotide polymorphisms
 TCGA:

The cancer genome atlas
References
McAllister K, Mechanic LE, Amos C, et al. Current challenges and new opportunities for geneenvironment interaction studies of complex diseases. Am J Epidemiol. 2017;186(7):753–61. https://doi.org/10.1093/aje/kwx227.
Thomas D. Gene–environmentwide association studies: emerging approaches. Nat Rev Genet. 2010;11(4):259–72. https://doi.org/10.1038/nrg2764.
Batchelor TT, Betensky RA, Esposito JM, et al. Agedependent prognostic effects of genetic alterations in glioblastoma. Clin Cancer Res. 2004;10(1 Pt 1):228–33. https://doi.org/10.1158/10780432.ccr08413.
Zhou F, Ren J, Lu X, Ma S, Wu C. Geneenvironment interaction: a variable selection perspective. Methods Mol Biol. 2021;2212:191–223. https://doi.org/10.1007/9781071609477_13.
Xu Y, Wu M, Zhang Q, Ma S. Robust identification of geneenvironment interactions for prognosis using a quantile partial correlation approach. Genomics. 2019;111(5):1115–23. https://doi.org/10.1016/j.ygeno.2018.07.006.
Xu Y, Wu M, Ma S, Ahmed SE. Robust geneenvironment interaction analysis using penalized trimmed regression. J Stat Comput Simul. 2018;88(18):3502–28. https://doi.org/10.1080/00949655.2018.1523411.
Wu M, Zhang Q, Ma S. Structured geneenvironment interaction analysis. Biometrics. 2020;76(1):23–35. https://doi.org/10.1111/biom.13139.
Fan J, Lv J. Sure independence screening for ultrahigh dimensional feature space (with discussion). J R Stat Soc B. 2008;70(5):849–911. https://doi.org/10.1111/j.14679868.2008.00674.x.
Wang JH, Chen YH. Networkadjusted Kendall’s Tau measure for feature screening with application to highdimensional survival genomic data [published online ahead of print, 2021 Jan 29]. Bioinformatics. 2021;37:15. https://doi.org/10.1093/bioinformatics/btab064.
Subramanian A, Tamayo P, Mootha VK, et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. Proc Natl Acad Sci USA. 2005;102(43):15545–50. https://doi.org/10.1073/pnas.0506580102.
Wang JH, Chen YH. Overlapping group screening for detection of genegene interactions: application to gene expression profiles with survival trait. BMC Bioinform. 2018;19:335. https://doi.org/10.1186/s1285901823722.
Cancer Genome Atlas Research Network. Comprehensive genomic characterization defines human glioblastoma genes and core pathways. Nature. 2008;455(7216):1061–1068. https://doi.org/10.1038/nature07385
Jacob L, Obozinski G, Vert JP. Group lasso with overlap and graph lasso. In: Proceedings of the 26th annual international conference on machine learning. Montreal: ACM; 2009. p. 433–40.
Zeng Y, Breheny P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Inform. 2016;15:179–87. https://doi.org/10.4137/CIN.S40043.
Chen H, Lumley T, Brody J, et al. Sequence kernel association test for survival traits. Genet Epidemiol. 2014;38(3):191–7. https://doi.org/10.1002/gepi.21791.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with the sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93. https://doi.org/10.1016/j.ajhg.2011.05.029.
Davies RB, Algorithm AS. 155: The distribution of a linear combination of random variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.
Duchesne P, Lafaye De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the LiuTangZhang approximation and exact methods. Comput Stat Data Anal. 2010;54(4):858–62.
Wu B, Guan W, Pankow JS. On efficient and accurate calculation of significance Pvalues for sequence kernel association testing of variant set. Ann Hum Genet. 2016;80(2):123–35. https://doi.org/10.1111/ahg.12144.
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultrahigh dimensional additive models. J Am Stat Assoc. 2011;106(494):544–57. https://doi.org/10.1198/jasa.2011.tm09779.
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1–13. https://doi.org/10.18637/jss.v039.i05.
Harrell FE Jr, Lee KL, Mark DB. Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat Med. 1996;15(4):361–87. https://doi.org/10.1002/(SICI)10970258(19960229)15:4%3c361::AIDSIM168%3e3.0.CO;24.
Blanche P, Dartigues JF, JacqminGadda H. Estimating and comparing timedependent areas under receiver operating characteristic curves for censored event times with competing risks. Stat Med. 2013;32(30):5381–97.
Colaprico A, Silva TC, Olsen C, et al. TCGAbiolinks: an R/Bioconductor package for integrative analysis of TCGA data. Nucleic Acids Res. 2016;44(8): e71. https://doi.org/10.1093/nar/gkv1507.
Wang S, Liu X. The UCSCXenaTools R package a toolkit for accessing genomics data from UCSC Xena platform, from cancer multiomics to singlecell RNAseq. J Open Sour Softw. 2019;4(40):1627. https://doi.org/10.21105/joss.01627.
Deng M, Brägelmann J, Kryukov I, SaraivaAgostinho N, Perner S 2017 FirebrowseR: an R client to the Broad Institute's Firehose Pipeline. Database (Oxford); 2017. https://doi.org/10.1093/database/baw160
Wang JH, Chen YH. Interaction screening by Kendall’s partial correlation for ultrahighdimensional data with survival trait. Bioinformatics. 2020;36(9):2763–9. https://doi.org/10.1093/bioinformatics/btaa017.
Wu M, Ma S. Robust semiparametric geneenvironment interaction analysis using sparse boosting. Stat Med. 2019;38(23):4625–41. https://doi.org/10.1002/sim.8322.
Wu M, Qin X, Ma S. GEInter: an R package for robust geneenvironment interaction analysis. Bioinformatics. 2021;37(20):3691–2. https://doi.org/10.1093/bioinformatics/btab318.
Huang C, Chen L, Savage SR, et al. Clinical proteomic tumor analysis consortium proteogenomic. insights into the biology and treatment of HPVnegative head and neck squamous cell carcinoma. Cancer Cell. 2021;39(3):361–79. https://doi.org/10.1016/j.ccell.2020.12.007.
Li L, Wang XL, Lei Q, et al. Comprehensive immunogenomic landscape analysis of prognosisrelated genes in head and neck cancer. Sci Rep. 2020;10(1):6395. https://doi.org/10.1038/s41598020631488.
Ren Z, Zhang L, Ding W, et al. Development and validation of a novel survival model for head and neck squamous cell carcinoma based on autophagyrelated genes. Genomics. 2021;113(1 Pt 2):1166–75. https://doi.org/10.1016/j.ygeno.2020.11.017.
Pérez Sayáns M, Chamorro Petronacci CM, Lorenzo Pouso AI, et al. Comprehensive genomic review of TCGA head and neck squamous cell carcinomas (HNSCC). J Clin Med. 2019;8(11):1896. https://doi.org/10.3390/jcm8111896.
Wang Q, Li P, Wu W. A systematic analysis of immune genes and overall survival in cancer patients. BMC Cancer. 2019;19(1):1225.
Matrka MC, Cimperman KA, Haas SR, et al. Dek overexpression in murine epithelia increases overt esophageal squamous cell carcinoma incidence. PLoS Genet. 2018;14(3): e1007227.
Qiao Y, Zhang C, Li A, et al. IL6 derived from cancerassociated fibroblasts promotes chemoresistance via CXCR7 in esophageal squamous cell carcinoma. Oncogene. 2018;37:873–83.
LopezAlvarenga JC, Vargas JA, Lopez LH, et al. Effect of body weight and esophageal damage on the severity of gastroesophageal reflux symptoms. Mexican GERD working group. Arch Med Res. 2009;40(7):576–81. https://doi.org/10.1016/j.arcmed.2009.08.003.
Codipilly DC, Sawas T, Dhaliwal L, et al. Epidemiology and outcomes of youngonset esophageal adenocarcinoma: an analysis from a populationbased database. Cancer Epidemiol Biomark Prev Publ Am Assoc Cancer Res Cospons Am Soc Prev Oncol. 2021;30(1):142–9. https://doi.org/10.1158/10559965.EPI200944.
Jiang Y, Huang Y, Du Y, Zhao Y, Ren J, Ma S, Wu C. Identification of prognostic genes and pathways in lung adenocarcinoma using a Bayesian approach. Cancer Inform. 2017;16:1176935116684825.
Ren J, Zhou F, Li X, Ma S, Jiang Y, Wu C. Robust Bayesian variable selection for gene–environment interactions. Biometrics. 2022. https://doi.org/10.1111/biom.13670.
Acknowledgements
We are very grateful to the associate editor and the referees for their very valuable comments and suggestions that helped to improve the manuscript.
Funding
This research is supported by the grant MOST 1102118M035001MY2 from the Ministry of Science and Technology of Republic of China (Taiwan).
Author information
Authors and Affiliations
Contributions
Conceived and designed the experiments: JH. Analyzed the data: JH, KH. Wrote the first draft of the manuscript: JH. Contributed to the writing of the manuscript: JH, YH. Agree with manuscript results and conclusions: JH, KH, YH. Jointly developed the structure and arguments for the paper: JH, KH, YH. Made critical revisions and approved final version: JH, YH. All authors reviewed and approved of the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1:
The full detail of the latent effect approach, a series of simulation studies, simulated settings where some genes are shared by three groups, and the real data analysis.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, JH., Wang, KH. & Chen, YH. Overlapping group screening for detection of geneenvironment interactions with application to TCGA highdimensional survival genomic data. BMC Bioinformatics 23, 202 (2022). https://doi.org/10.1186/s12859022047507
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022047507
Keywords
 Geneenvironment interaction
 Joint model
 Lasso
 Overlapping group screening
 Survival prediction
 TCGA