 Methodology article
 Open Access
 Published:
Overlapping group screening for detection of genegene interactions: application to gene expression profiles with survival trait
BMC Bioinformatics volume 19, Article number: 335 (2018)
Abstract
Background
The development of a disease is a complex process that may result from joint effects of multiple genes. In this article, we propose the overlapping group screening (OGS) approach to determining active genes and genegene interactions incorporating prior pathway information. The OGS method is developed to overcome the challenges in genomewide data analysis that the number of the genes and genegene interactions is far greater than the sample size, and the pathways generally overlap with one another. The OGS method is further proposed for patients’ survival prediction based on gene expression data.
Results
Simulation studies demonstrate that the performance of the OGS approach in identifying the true main and interaction effects is good and the survival prediction accuracy of OGS with the Lasso penalty is better than the ordinary Lasso method. In real data analysis, we identify several significant genes and/or epistasis interactions that are associated with clinical survival outcomes of diffuse large Bcell lymphoma (DLBCL) and nonsmallcell lung cancer (NSCLC) by utilizing prior pathway information from the KEGG pathway and the GO biological process databases, respectively.
Conclusions
The OGS approach is useful for selecting important genes and epistasis interactions in the ultrahigh dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The OGS approach is generally applicable to various types of outcome data (quantitative, qualitative, censored event time data) and regression models (e.g. linear, logistic, and Cox’s regression models).
Background
Discovering important pathways, genes, and genegene interactions that account for the phenotype of interest has continued to be a key challenge in genomewide expression analysis [1]. Under this highdimensional data setting, single and multiple biomarker (e.g. gene) tests commonly used usually have limited power to detect causal biomarkers associated with the clinical phenotypes. To improve the power, analyses incorporating external biological information have been proposed. For example, genebased analyses group the singlenucleotide polymorphisms (SNPs) under study into genes, and pathwaybased analyses group the genes under study into some biologically meaningful pathways; both types of multiple biomarker analyses have shown to be effective in detecting causal association signals and become increasingly popular. The analyses incorporating external biological information are particularly useful for detecting interaction effects among biomarkers, since the number of interaction effects grows quickly with the number of biomarkers and hence traditional statistical tests lose power.
To identify causal interaction effects of singlenucleotide polymorphisms (SNPs) on a quantitative or disease trait, Fang et al. [2] develop a twostage grouped sure independence screening (TSGSIS) procedure using genebased SNP sets. Simulation studies demonstrate that the performance of TSGSIS is better than some existing approaches, including the extended SVM [3] and the TSSIS method [4] without incorporating gene set information. A potential drawback for the TSGSIS method is that, it is developed in the setting where the groups (gene sets) are nonoverlapping and does not pay attention to settings with overlapping groups, which would be encountered in pathwaybased analyses where different pathways may involve some common genes. Besides, TSGSIS is focused specifically on the quantitative/qualitative outcome modeled by linear/logistic regression, and its application to the survival outcome has not been examined.
In this work, we propose the overlapping group screening (OGS) method, which is an extension and improvement of TSGSIS to accommodate overlapping group structures. Following the latent effect approach of Jacob et al. [5], we decompose the original biomarker effects into a sum of groupspecific latent effects, so that the original overlapping group structure can be transformed into a new nonoverlapping group structure. The latent effect approach has also been applied by Zeng and Breheny [6], Zhang et al. [7] and Tang et al. [8] to joint selection of genes and genetic pathways.
In addition, to perform association analyses with general types of traits including survival endpoints, OGS employs the sequence kernel association test (SKAT) proposed by Chen et al. [9] as the group screening criterion. SKAT is a supervised, flexible, and computationally efficient regression method to test for association between genetic variants/gene expressions in a region and a quantitative/qualitative/survival trait [10]. In particular, SKAT can quickly compute pvalues analytically by fitting the null model only once, and hence can be conveniently applied to genomewide data. Further, we utilize a datadriven thresholding strategy of Fan et al. [11] for screening candidate biomarkers/features, where we permute randomly the original biomarker data among subjects to decouple the association between the biomarker and outcome data, such that the permuted data follow the null distribution, from which a cutoff value for the SKAT pvalue to determine significance can be determined. After screening candidate biomarkers by the SKAT pvalues, we apply the Ridge or Lasso penalized regression method [12] to build the prediction model in OGS. The Lasso penalty, in particular, allows for automatic variable selection, which are commonly employed in highdimensional data such as genomewide data analysis.
We note that OGS maintains the advantages of TSGSIS, namely: (i) it can mitigate the issue of colinearity in regression analyses owing to correlations among biomarkers in the same gene/pathway, and (ii) it can substantially reduce the search space for interaction effects by utilizing the feature grouping structure.
The other objective of this article is to predict survival outcomes based on gene expression profiles, a topic which has received much attention in the recent decade ([13,14,15] and so on). Zhang et al. [16] indicate that one of the main shortcomings of the past studies is the failure to incorporate prior biological information into the prediction model, which may in turn lead to inaccurate prognosis and prediction. The survival prediction based on OGS addresses this problem. Simulation studies demonstrate that the OGS approach not only identifies correctly the causal biological pathways and epistasis, but also improves survival prediction compared with the alternative analyses that ignore the pathway information.
In the real data application, we utilize OGS to select several causal genes and epistasis that are associated with clinical survival outcomes of diffuse large Bcell lymphoma (DLBCL) and nonsmallcell lung cancer (NSCLC) patients. In these applications, we combine gene expression profile data with prior pathway information from the KEGG pathway database (for DLBCL) and the Gene Ontology (GO) biological process database (for NSCLC), which are popular public databases providing information on discovered pathways and their involved genes [17]. We use the pathway information available from these two databases to assign genes into groups based on the specific pathways in which they are involved, and conduct survival prediction based on the selected genes and genegene interactions.
Motivation
Suppose that there are q genes assigned to G possibly overlapping pathways, namely, a given gene may belong to more than one pathway. The schematic plot in Fig. 1 displays the natural hierarchal structure of genes related to pathways and shows the overlapping pathway structure present in the gene expression data. Each gene can belong to one or multiple pathways. It is of interest to identify genes, as well as their interactions, that are associated with the clinical survival outcome.
Survival model
Let X denote the N × q dimensional covariate matrix of the gene expression profiles with \( \mathrm{X}={\left({\mathrm{x}}_1,\mathrm{L},{\mathrm{x}}_N\right)}^{/}={\left(\begin{array}{ccc}{x}_{11}& \cdots & {x}_{1q}\\ {}\vdots & \ddots & \vdots \\ {}{x}_{N1}& \cdots & {x}_{Nq}\end{array}\right)}_{N\times q} \), where x_{ij} denotes the expression level of the jth gene of the ith subject. Assume the survival outcome T_{i} is related to the gene expression covariates x_{i} through a Cox’s regression model. In the Cox’s regression framework, the hazard function at time t for subject i‘s survival given the covariates is modeled as
where λ_{0}(t) is a nonnegative deterministic baseline hazard function and β = (β_{1}, ⋯, β_{q})^{/} is the logarithm of the risk ratio. Based on the Cox’s model, the survival function of subject i given his/her expression profile x_{i} is given by \( P\left({T}_i>t{\mathbf{x}}_i\right)=S\left(t{\mathbf{x}}_i\right)={S}_0{(t)}^{\exp \left({\mathbf{x}}_i^{/}\boldsymbol{\upbeta} \right)} \) with \( {S}_0(t)=\exp \left[{\int}_0^t{\lambda}_0(s) ds\right] \) the baseline survival. Usually the survival outcome is subject to censoring, and we use \( {t}_i^{\ast } \) to denote the observed survival time of subject i, and \( {\delta}_i^{\ast } \) is the indicator of whether the survival time of subject i is censored.
In practice, we can check the Cox’s model assumption by existing approaches, such as statistical tests and graphical diagnostics based on the Schoenfeld residuals [18].
Latent effect approach
Incorporating the grouping (pathway) information into the modeling process has the potential to improve the interpretability and the accuracy of the model. When the groups overlap one another, special techniques are required to adequately account for the overlapping grouping information. According to Jacob et al. [5], we decompose the original coefficient vector into a sum of groupspecific latent effects, namely, \( \boldsymbol{\upbeta} =\sum \limits_{j=1}^G{\boldsymbol{\upgamma}}^j \), where \( {\boldsymbol{\upgamma}}^j={\left({\gamma}_1^j,\mathrm{L},{\gamma}_q^j\right)}^{/} \) is the latent coefficient vector for group j. Here is a simple example for illustration [6]. Suppose that there are four genes that are involved in the four pathways, P1 = {g1, g2}, P2 = {g2, g3}, P3 = {g1, g3} and P4 = {g3, g4}, the original coefficient β can be decomposed as β = [β_{1}, β_{2}, β_{3}, β_{4}]^{/}
\( =\left[\begin{array}{l}{\gamma}_1^1\\ {}{\gamma}_2^1\\ {}{\gamma}_3^1\\ {}{\gamma}_4^1\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^2\\ {}{\gamma}_2^2\\ {}{\gamma}_3^2\\ {}{\gamma}_4^2\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^3\\ {}{\gamma}_2^3\\ {}{\gamma}_3^3\\ {}{\gamma}_4^3\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^4\\ {}{\gamma}_2^4\\ {}{\gamma}_3^4\\ {}{\gamma}_4^4\end{array}\right] \) \( =\left[\begin{array}{l}{\gamma}_1^1\\ {}{\gamma}_2^1\\ {}0\\ {}0\end{array}\right]+\left[\begin{array}{l}0\\ {}{\gamma}_2^2\\ {}{\gamma}_3^2\\ {}0\end{array}\right]+\left[\begin{array}{l}{\gamma}_1^3\\ {}0\\ {}{\gamma}_3^3\\ {}0\end{array}\right]+\left[\begin{array}{l}0\\ {}0\\ {}{\gamma}_3^4\\ {}{\gamma}_4^4\end{array}\right] \)
Based on the coefficient decomposition, the original regression model can be transformed into a new model, i.e. \( {\mathbf{X}}_{N\times q}{\boldsymbol{\upbeta}}_{q\times 1}={\mathbf{X}}_{N\times q}{\mathbf{S}}_{q\times u}{\boldsymbol{\upgamma}}_{u\times 1}={\overset{\sim }{\mathbf{X}}}_{N\times u}{\boldsymbol{\upgamma}}_{u\times 1} \). Equivalently, this new model can be constructed by duplicating the columns of overlapped variables in the raw design matrix. For the new transformed model, the hazard function for subject i in the Cox’s regression model is reexpressed as
Method (OGS)
We propose the OGS method and apply it to the gene expression profile data with clinical survival trait to detect causal genes and epistasis interactions by incorporating prior pathway information. We standardized all the predictors before performing the OGS approach. The steps of the OGS algorithm are described as follows.
Step1: Based on the latent effect approach, we utilize the overlapping group Cox’s regression model to identify the causal pathways, which can be computed by the R package “grpregOverlap” [6]. We define \( {\hat{\mathrm{M}}}_{main} \) as the selected set of causal pathways, and \( A=\left{\hat{\mathrm{M}}}_{main}\right \) as the size of \( {\hat{\mathrm{M}}}_{main} \).
Step 2: Consider genegene interaction pairs between gene pairs from one causal pathway or two different causal pathways in \( {\hat{\mathrm{M}}}_{main} \), as well as gene pairs between one pathway in \( {\hat{\mathrm{M}}}_{main} \) and one noncausal pathway outside \( {\hat{\mathrm{M}}}_{main} \). The interaction between two pathways is also termed “crosstalk” of pathways [19]. For groups of genegene interaction pairs from each of the candidate pathways or from each two crosstalk pathways, apply the SKAT test to obtain the groupspecific significance. Detail about the groupspecific SKAT test is given in the next section.
Step 3: We randomly permute the original genotype matrix x_{i} to form the permuted data {Y_{i}, x_{π(i)}} following the null model, where {π(1), ⋯, π(N)} is a random permutation of the index. Then apply again the SKAT test for each of the pathway interaction groups with the permuted data to obtain the group screening measures (pvalues) \( \left\{{p}_1^{\ast },\cdots, {p}_B^{\ast}\right\} \). We adopt \( {C}_{\mathrm{int}}=\min \left\{{p}_1^{\ast },\cdots, {p}_B^{\ast}\right\} \) as a cutoff point to select candidate pathway interactions, i.e.
is our selected set of pathway interactions.
Step 4: Apply the penalized Cox’s regression with the Ridge, or Lasso penalty to build the final prediction model based on genes in \( {\hat{\mathrm{M}}}_{main} \) and genepair interactions in \( {\hat{\mathrm{M}}}_{\mathrm{int}} \). Note that when applying the Lasso penalty, some of the genes/gene pairs in \( {\hat{\mathrm{M}}}_{main} \)/\( {\hat{\mathrm{M}}}_{\mathrm{int}} \) may be removed since the Lasso penalty can set some of the coefficients exactly to 0, while when applying the Ridge penalty, all of the candidate genes and gene pairs are retained. The penalized Cox’s model with the Ridge and Lasso penalties can be obtained by the R package “glmnet” [12].
Groupspecific test (SKAT)
Following Chen et al. [9], the groupspecific SKAT statistic under the Cox’s regression model is given as
Here, \( B=A+{C}_2^A+\left(GA\right)\times A \) is the total number of groups of pathway interaction, m is the vector of martingale residuals estimated from the null model without considering the gene expression data, R_{(k)} = [r_{(k)ij}]_{N × l},
where l is the number of genegene interaction pairs in the pathway interaction group k, r_{(k)ij} is the jth genegene interaction pair of ith subject in the pathway interaction group k, and W_{(k)} is a diagonal weight matrix that contains the weights of the l interaction pairs in the pathway interaction group k. Suitable weights can improve the testing power [10]. We utilize the penalized Cox’s partial likelihood approach with the Ridge penalty to estimate effect sizes for genegene interaction pairs in each pathway interaction group, and take the square root of the absolute estimated coefficients as our weights, i.e.,
Based on null model without gene covariates, let V = diag (e_{1}, …, e_{N}) − PP^{/}, where P is an N × v matrix with element p_{ij}the baseline hazard for individual i at ordered failure time \( {t}_{(j)}^{\ast },j=1,\cdots, v \), and e_{i} the cumulative hazard for individual i at observed time \( {t}_i^{\ast } \). Let \( {\Sigma}_{(k)}={\mathbf{W}}_{(k)}{\mathbf{R}}_{(k)}^{/}{\mathbf{VR}}_{(k)}{\mathbf{W}}_{(k)} \) be the covariance matrix of the vector \( {\mathbf{W}}_{(k)}{\mathbf{R}}_{(k)}^{/}\mathbf{m} \) under the null hypothesis of all genegene interaction pairs in the pathway interaction group k having null effects. Under the null hypothesis, the SKAT statistic follows a mixture chisquare distribution:
where λ_{(k)j}, j = 1, ⋯, l are the eigenvalues of Σ_{(k)}, and \( {\chi}_{1,j}^2 \)‘s are independent 1df central chisquare random variables.
We use the Davies method [20] to approximate the tail probability (pvalue) of the mixture chisquare distributions, which can be computed by R package “CompQuadForm” [21]. In general, the Davies method is accurate [22]. The pvalues {p_{1}, ⋯, p_{B}} for the pathway interaction groups serve as our group screening measures; a smaller pvalue corresponds to higher significance of the group and hence leads to higher priority to be selected.
Results
In the following simulations, we investigate the performances of the proposed OGS approach in variable selection, estimation, and prediction, and compared them with those from the “Oracle”, “Univariate Selection”, “Ordinary Lasso”, and “TSGSIS Lasso” methods. The “Oracle” method is based on the underlying true model, which is known in simulations but unknown in real applications. The “Univariate Selection” method selects the genes and genepairs one by one via univariate regression, with controlled false discovery rate (< 0.05), and the selected variables are included in a multivariate Cox’s regression model to form the final prediction model. The “Ordinary Lasso” method is the penalized Cox’s regression model with the covariates of gene expressions from all genes and genepair interactions and with the Lasso penalty. The “TSGSIS Lasso” method is essentially proposed by Fang et al. [2], except that we apply the SKAT test to obtain the groupspecific significance.
For performance comparison, we obtain the root mean squared error (RMSE) to measure estimation accuracy, defined as
where S is full model size including all main and interaction covariates. Over 500 simulations, we report the median value RMSE.M of RMSE over simulations. We also report the following proportions in 500 simulations as performance measures for variable selection: T.model is the proportion where the selected model includes the underlying effective variables, including both the main and interaction terms; Tint.model is the proportion where the selected model includes the underlying effective genegene interaction terms; Sen. is the sensitivity, i.e., the proportion of the underlying effective variables being selected; Spe. is the specificity, i.e., the proportion of the underlying ineffective variables not being selected. We also report the median size S.model of the selected model over 500 simulations. For assessing the performance in survival prediction, we report two measures of prediction accuracy: the deviance and cindex proposed by Harrell et al. [23] and smaller deviance/larger cindex corresponds to better prediction accuracy. The median values of deviance and cindex over 500 simulations are reported.
Also, let \( \hat{\boldsymbol{\upbeta}} \) be an estimator of the (penalized) Cox’s regression parameter in a prediction model obtained from the training dataset and \( \left({t}_i^{\ast },{\delta}_i^{\ast },{\mathbf{x}}_i^{\ast}\right) \) the survival and covariate data of subject i in the test data. Define \( {\mathbf{x}}_i^{\ast}\hat{\boldsymbol{\upbeta}} \) as the prognosis index (PI) value for subject i. The prediction accuracy measure of Coxtest is defined as the pvalue of PI when PI is used as the covariate in the univariate Cox model for the survival outcome in the test data. A smaller value of Coxtest (pvalue) would suggest better prediction accuracy. Similarly, the prediction accuracy measure of LRtest is the pvalue of the logrank test for the null hypothesis of equality of the survival between the “poor” and “good” prognosis groups in the test data, which are formed according to whether the PI value is higher or lower than the median PI value. A smaller LRtest value corresponds to better prediction power.
We consider survival data with a cohort size 500 as the training set, where each subject’s survival time follows the Cox’s proportional hazards model
with β measuring the logrelative risk with respect to the covariates and the covariates x jointly following a multivariate standard normal distribution with correlation corr(x_{⋅j}, x_{⋅k}) = 0.5^{j − k}. The censoring time distribution follows a uniform U(0, 1) distribution. We then generate survival data, independent of the training data, with a cohort of size 100 as the test data to assess the prediction accuracy for different methods.
Simulation setting 1
In this simulation study, the design matrix consists of 5 groups with each group having different group sizes. The group size (number of genes in each pathway) and the overlapping structure (number of genes shared by two overlapping pathways) are shown in Table 1.
For example, pathways 1 and 2 contain 7 and 14 genes, respectively. The two groups contain 18 unique genes, and 3 genes are shared by the two groups. As a result, there are 81 genes (q = 81) and 105 latent effects in this example. Fig. 2 shows the gene indices of the pathways. Pathways 2 and 4 are effective, and genes in each of them have constant latent effects of 4.5 and − 3, respectively. Three types of genegene interactions are considered: (1) genegene interactions (x_{⋅8} × x_{⋅9}, x_{⋅10} × x_{⋅11}, x_{⋅12} × x_{⋅13}) within pathway 2 with effects(6, 6, 6), (2) genegene interaction (x_{⋅36} × x_{⋅66}, x_{⋅38} × x_{⋅68}, x_{⋅40} × x_{⋅70}) across pathways 4 and 5 with effects(−6, −6, −6), and (3) coexistence of interactions (1) and (2). The number of effective genes and genepair interactions is 45 or 48 among the total 3321 genes and genepairs. We examine performances of different methods under a censoring rate of 50% or 65%.
Simulation setting 2
In this simulation study, the design matrix consists of 24 groups with each group having different group sizes, ranging from 3 to 60 (genes). The group size and the overlapping structure are shown in Table 2.
For example, pathway 4 contains 6 genes, as group 5 does, and the two groups contain 10 unique genes, and 2 genes are shared by the two groups. As a result, there are 462 genes (q = 462) and 594 latent effects in this example. Fig. 3 shows the gene indices of the pathways. Pathways 1, 7, 13, and 19 are effective, and genes in each of them have constant latent effects of 4.5, − 3, − 3, and 1.5, respectively. As above, three types of genegene interactions are considered: (1) genegene interactions (x_{⋅22} × x_{⋅23}, x_{⋅24} × x_{⋅25}, x_{⋅26} × x_{⋅27}) within pathway 7 with effects(4, 4, 4), (2) genegene interaction (x_{⋅81} × x_{⋅101}, x_{⋅82} × x_{⋅102}, x_{⋅83} × x_{⋅103}) across pathways 13 and 14 with effects(−4, −4, −4), and (3) coexistence of interactions (1) and (2). The number of effective genes and genepair interactions is 84 or 87 among the total 106,953 genes and genepairs. We examine different methods under a censoring rate of 50% or 65%.
Summary of simulation results
From the simulation results shown in Tables 3, 4, 5, 6, 7, and 8, the OGS method using the Lasso penalty outperforms the OGS method using the Ridge penalty. Also, compared to the existing methods, OGS with the Lasso penalty performs substantially better than the Univariate Selection and the TSGSIS with the Lasso penalty methods in variable selection (T.model, Tint.model, Sen., Spe.), estimation (RMSE.M), and prediction (Deviance, cindex). When the number of groups (pathways) and the group size (number of genes) are smaller (Setting 1) and the censoring rate is relatively lower (50%), the ordinary Lasso also performs well in variable selection and survival prediction; while in other cases, the ordinary Lasso is less competitive than the proposed OGS method with the Lasso penalty in variable selection, estimation, and survival prediction. Comparing Tables 3, 4, and 5, or Tables 6, 7, and 8, we see that the pattern of interactions, namely whether the genegene interactions occur within the same pathway or not, does not affect much the performance of the proposed OGS method, in particular for survival prediction.
The DLBCL analysis
The DLBCL data [24] contain two sets of gene expression data, CHOP and RCHOP. The CHOP dataset is under a combination chemotherapy with cyclophosphamide, doxorubicin, vincristine and prednisone; RCHOP is under the current golden standard treatment, the rituxima immunotherapy in addition to the chematherapy in CHOP. The CHOP and RCHOP datasets consist of censored survival outcomes from 181 and 233 patients, respectively, with gene expression data from the same 3833 genes after the filtering process. The censoring rates are 42% and 74% in the CHOP and RCHOP datasets, respectively. These two microarray datasets can be downloaded from the R package “bujar” [25]. In our analysis, we divide randomly the patients into 207:207 training/test datasets from the pool of RCHOP and CHOP datasets. There were no significant differences in clinical survival outcome between subjects in the two datasets.
We apply the proposed OGS approach to the DLBCL data with the prior pathway information obtained from the KEGG pathway database. The following analysis is based on the 451 genes mapped into 165 pathways in the DLBCL data, which result in 101,926 main and twoway interaction covariates.
In Steps 1–3 of the OGS approach, we identify 6 significant pathways and 2 significant crosstalk pathway interactions. In Step 4 of the OGS method, the Cox’s model with the Ridge or Lasso penalty is applied to the training data to establish the final prediction model. In particular, the OGS method with the Lasso penalty leads to a prediction model with 5 main and 10 twoway interaction covariates. The “Univariate Selection” and “Ordinary Lasso” methods are applied directly to the whole 101,926 covariates in the training data to build the prediction models. The “Overlap Lasso” method is obtained by applying the R package “grpregOverlap” [6], which performs group selection among overlapping groups with the Lasso penalty but without considering interactions among features.
Table 9 displays several survival prediction accuracy measures for different approaches in the test data. We see that the OGS method with the Lasso penalty has better performances compared to existing methods in the test data. Fig. 4 displays the KaplanMeier survival curves for the “good” (blue) and “poor” (red) prognosis groups in the test data, which are formed according to whether the prognosis index (PI) value is lower or higher than the median PI value (see the Results section for detail). It is seen that that the two survival curves are better separated by the OGS approach than by the existing methods.
In DLBCL data, we discard 3382 genes that are not mapped into any pathways in the KEGG pathway database based on the latent effect approach. We also perform the other OGS analysis putting the 3382 ungrouped genes together as an additional group. The results from such an analysis are similar to those presented here.
The NSCLC analysis
The NSCLC data of Chen et al. [14] is available from NCBI with accession number GSE4882. The data contain censored survival outcomes from 125 lung cancer patients and their gene expression profiles for 672 genes. The censoring rate is 65%. Following Emura et al. [13], we consider the subset consisting of 485 genes, and, following Chen et al. [14], we divide the patients into 63:62 training/test datasets.
Based on the GO biological process database, prior pathway information for 251 genes mapped into 344 pathways are utilized, which lead to a total number of 31,626 main and twoway interaction covariates. Using the OGS approach, we identify 2 significant pathways but no significant pathway interaction, and the final prediction model obtained by the Lasso method includes main effects from two genes, DUSP6 and LCK. Indeed, the two genes are also included in the fivegene signature by Chen et al. [14], and are found to be strongly associated with lung cancer in other literatures ([26,27,28] and so on).
Table 10 shows the prediction accuracy measures for patients’ survival in the test sample of the NSCLC data, where the measure LRtest_3 is the pvalue of the logrank test for equality of survival distributions among the three prognosis groups divided by the tertiles of the PI values in the test sample. Fig. 5 displays the three KaplanMeier survival curves for three prognosis groups (“good”, “medium”, “poor” groups according tertiles of the PI values) in the test sample of the NSCLC data (in this case the LRtest for the two prognosis groups divided by the median PI is less significant. Fig. 6 displays the two KaplanMeier survival curves for the two prognosis groups). In all these measures, the OGS method with the Lasso penalty performs better than the Ordinary Lasso.
Besides, we also apply the 10fold crossvalidation method to evaluate the performance of the OGS method for survival prediction in the NSCLC data. In the 10fold crossvalidation process, most of the time the OGS still identifies the same prediction model containing the main effects of DUSP6 and LCK genes. Table 11 shows the performances of the OGS method in the NSCLC data with the performance evaluation based on the 10fold crossvalidation, i.e., the average of the results among 10 folds. We see that the performance patterns are similar to those in Table 10, and the OGS with the Lasso penalty still outperforms the other methods.
In NSCLC data, we discard 234 genes that are not mapped into any pathways in the GO biological process database based on the latent effect approach. The OGS approach for putting the 234 ungrouped genes together as an additional group results in the same prediction model as the one presented above.
Discussion
The OGS procedure can further adjust for confounding covariates (e.g. environmental factors) when all the models involved, including the null model without using gene expression covariate data, further adjust for the confounding variables; see [9, 10] for the SKAT statistics with confounding covariates for quantitative, qualitative and survival traits.
In this article, we consider twoway and multiplicative interactions as a simple way to implement interaction assessments. Examination of higherorder and general forms of interaction is challenging and deserves further research. Besides, the OGS method employs the latent effect approach to deal with the overlapping structure among pathways. This approach requires the gene grouping (pathway) structure to be prespecified and is restricted to genes that can be assigned to at least one group (pathway). It is interesting to study how these restrictions can be relaxed to improve the performances of gene selection and survival prediction. Yu and Liu [29] propose a procedure for sparse regression incorporating a comprehensive graphical structure (SRIG) among predictors, and we would like to extend the current proposal by employing the SRIG approach.
The idea of group screening procedure we propose can also be applied to detect geneenvironment interactions. In the first step, we still apply the overlapping group method to identify the causal pathways \( {\hat{\mathrm{M}}}_{main} \). In the second step, we apply the SKAT test to obtain the groupspecific significance, where each of the groups are formed by the interactions between one gene from each of the causal pathways in \( {\hat{\mathrm{M}}}_{main} \) and one environment factor in Z, where Z is the set of environment covariates whose interactions with genes are of interest. In step 3, we select significant geneenvironment interactions, where the permutation procedure and the cutoff determination are the same as those in the original OGS, except that now the permeation is applied to the covariate matrix consisting of both gene and environmental covariates. Finally, the penalized regression with the Ridge or Lasso penalty is still applied to build the final prediction model based on the genes in \( {\hat{\mathrm{M}}}_{main} \), the environmental covariates, and the selected geneenvironment interactions. We plan to study extensions of the OGS method, including the extension to geneenvironment interactions, in our future research.
In this work we focus on survival prediction based on the Cox’s proportional hazards model. In the case where the proportional hazards assumption is not appropriate, an alternative model, such as the proportional odds model, that proves to be appropriate can be used instead in the OGS procedure proposed. The required modification with models alternative to the proportional hazards is quite straightforward. For example, the SKAT statistic involved in OGS can be simply modified by using residuals from the alternative model considered.
Conclusions
It has been a longlasting interest in the bioinformatics field for detecting the pairwise genegene interactions. In this paper we propose an overlapping group screening procedure to identify causal genes and genegene interactions efficiently by incorporating prior pathway information, where the pathways involved are allowed to overlap one another. Specifically, we utilize the gene pathway information via the latent effect approach which formally accounts for the possibly overlapping grouping structure. In addition, we utilize the SKAT testing approach to perform powerful screening of main and interaction effects. Simulation and real data studies demonstrate that the new proposal can substantially improve the accuracy of gene and genegene interaction selection and hence lead to more accurate survival prediction compared with the common analyses that ignore the pathway information. We provide an R package “OGS” to perform Steps 1–3 of the proposed OGS method, together with the reference manual describing how to perform “OGS” and the code used in our simulation. Please see Additional Files 2 and 3 for detail.
The OGS approach is general in that they can accommodate various types of clinical outcomes and regression models, such as quantitative, qualitative, and survival outcomes modeled by linear, logistic, and Cox’s regression models, respectively. In the paper the OGS approach based on the Cox’s model for gene selection, effect estimation, and survival outcome prediction has been examined. The OGS methods for continuous and binary outcomes based respectively on the linear and logistic regression models are discussed in Additional File 1. The extension of OGS to more flexible models, such as those based on the kernel methods [30], deserves further research and will be studied in our future work.
The importance of genegene interactions have been discussed widely in literature. For example, Cordell [31] discussed the need of considering genegene interactions in genetic studies of complex diseases. Fang et al. [2] identified and confirmed important genegene interactions related to rheumatoid arthritis. We believe that the proposed overlapping group screening (OGS) approach provides an useful tool to this important task in delineating the underlying disease etiology.
Abbreviations
 DLBCL:

Diffuse large Bcell lymphoma
 Lasso:

Least absolute shrinkage and selection operator
 LRtest:

Logrank test
 NSCLC:

Nonsmallcell lung cancer
 OGS:

Overlapping group screening
 PI:

Predictor index
 RMSE:

Root mean squared error
 SKAT:

Sequence kernel association test
 SNPs:

Singlenucleotide polymorphisms
 SRIG:

sparse regression incorporating graphical structure
 SVM:

Support vector machine
 TSGSIS :

Twostage grouped sure independence screening
 TSSIS:

Twostage sure independence screening
References
 1.
Huang YT, VanderWeele TJ, Lin X. Joint analysis of snp and gene expression data in genetic association studies of complex diseases. Ann Appl Stat. 2014;8(1):352–76.
 2.
Fang YH, Wang JH, Hsiung CA. TSGSIS: a highdimensional grouped variable selection approach for detection of wholegenome SNP–SNP interactions. Bioinformatics. 2017;33(22):3595–602.
 3.
Fang YH, Chiu YF. SVMbased generalized multifactor dimensionality reduction approaches for detecting genegene interaction in family studies. Genet Epidemiol. 2012;36(2):88–98.
 4.
Li J, Zhong W, Li R, Wu R. A fast algorithm for detecting genegene interactions in genomewide association studies. Appl Stat. 2014;8(4):2292–318.
 5.
Jacob L, Obozinski G, Vert JP. Group lasso with overlap and graph lasso. In: Proceedings of the 26th annual international conference on machine learning. Montreal: ACM; 2009. p. 433–40.
 6.
Zeng Y, Breheny P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer inform. 2016;15:179–87.
 7.
Zhang L, Morris JS, Zhang L, Orlowski RZ, Baladandayuthapani V. Bayesian joint selection of genes and pathways: applications in multiple myeloma genomics. Cancer inform. 2014;13:113–23.
 8.
Tang Z, Shen Y, Li Y, Zhang X, Wen J, et al. Group spikeandslab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information. Bioinformatics. 2018;34(6):901–10.
 9.
Chen H, Lumley T, Brody J, HeardCosta NL, Fox CS, Cupples LA, Dupuis J. Sequence kernel association test for survival traits. Genet Epidemiol. 2014;38(3):191–7.
 10.
Wu MC, Lee S, Cai T, Li Y, Boehnke M, Lin X. Rarevariant association testing for sequencing data with sequence kernel association test. Am J Hum Genet. 2011;89(1):82–93.
 11.
Fan J, Feng Y, Song R. Nonparametric independence screening in sparse ultrahighdimensional additive models. J Am Stat Assoc. 2011;106(494):544–57.
 12.
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1–13.
 13.
Emura T, Chen YH, Chen HY. Survival prediction based on compound covariate under cox proportional hazard models. PLoS One. 2012;7(10):1–12.
 14.
Chen HY, Yu SL, Chen CH, Chang GC, Chen CY, et al. A fivegene signature and clinical outcome in nonsmallcell lung cancer. N Engl J Med. 2007;356(1):22–0.
 15.
Bovelstad HM, Nygard S, Storvold HL, Aldrin M, Borgan O, et al. Predicting survival from microarray data a comparative study. Bioinformatics. 2007;23(16):2080–7.
 16.
Zhang X, Li Y, Akinyemiju T, Ojesina AI, Buckhaults P, Liu N, et al. Pathwaystructured predictive model for cancer survival prediction: a twostage approach. Genetics. 2017;205(1):89–100.
 17.
Subramanian A, Tamayo P, Mootha VK, Mukheriee S, Ebert BL, et al. Gene set enrichment analysis: a knowledgebased approach for interpreting genomewide expression profiles. PNAS. 2005;102(43):15545–50.
 18.
Therneau TM, Grambsch PM. Modeling survival data: extending the cox model, 1^{st} Ed. New York: SpringerVerlag; 2000.
 19.
Donaldson R, Calder M. Modeling and analysis of biochemical signalling pathway crosstalk. Computer Science. 2011;18:1–15.
 20.
Davies RB, Algorithm AS. 155: The distribution of a linear combination of X ^{2} random variables. J R Stat Soc Ser C Appl Stat. 1980;29(3):323–33.
 21.
Duchesne P, Lafaye De Micheaux P. Computing the distribution of quadratic forms: Further comparisons between the LiuTangZhang approximation and exact methods. Comput Stat Data Anal. 2010;54(4):858–62.
 22.
Wu B, Guan W, Pankow JS. On efficient and accurate calculation of significance pvalues for sequence kernel association testing of variant set. Ann Hum Genet. 2016;80(2):123–35.
 23.
Harrell FE, Lee KL, Mark DB. Multivariate prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat. in Med. 1996;15(4):361–87.
 24.
Lenz, et al. Stromal gene signatures in largeBcell lymphomas. N Engl J Med. 2008;359(22):2313–23.
 25.
Wang Z. bujar: BuckleyJames regression for survival data with highdimensional covariates. R packages version 0.2–1. 2015.
 26.
Skrzypski M, Dziadziuszko R, Jassem E, SzymanowskaNarloch A, Gulida G, et al. Main histologic types of nonsmallcell lung cancer differ in expression of prognosisrelated genes. Clin Lung Cancer. 2013;14(6):666–73.
 27.
Chen YC, Chang TC, Ke WC, Chiu HW. Cancer adjuvant chemotherapy strategic classification by artificial neural network with gene expression data: An example for nonsmall cell lung cancer. J Biomed Inform. 2015;56:1–7.
 28.
Shao WL, Wang DY, He JX. The role of gene expression profiling in earlystage nonsmall cell lung cancer. J Thorac Dis. 2010;2(2):89–99.
 29.
Yu G, Liu Y. Sparse regression incorporating graphical structure among predictors. J Am Stat Assoc. 2016;111(514):707–20.
 30.
Sinnott JA, Cai T. Pathway aggregation for survival prediction via multiple kernel learning. Stat Med. 2018;37(16):2501–15.
 31.
Cordell HJ. Detecting genegene interactions that underlie human diseases. Nat Rev. Genet. 2009;10(6):392–404.
Acknowledgements
We are very grateful to the AE and referees for their very valuable comments that helped to improve the manuscript. We would like to thank Dr. TY Chen and Dr. YP Lin for helpful discussions.
Funding
This research is supported by the Ministry of Science and Technology of Taiwan under the grant MOST 104–2118M001006MY3. The funding body did not play any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Availability of data and materials
DLBCL with CHOP and RCHOP microarray data analyzed during this study are included in this published article [24] and stored in the R package “bujar” [25].
Author information
Affiliations
Contributions
Conceived and designed the experiments: JH, YH. Analyzed the data: JH. Wrote the first draft of the manuscript: JH, YH. Contributed to the writing of the manuscript: JH, YH. Agree with manuscript results and conclusions: JH, YH. Jointly developed the structure and arguments for the paper: JH, YH. Made critical revisions and approved final version: JH, YH. Both authors reviewed and approved of the final manuscript.
Corresponding author
Correspondence to YiHau Chen.
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
The full detail and performances of the OGS approach for survival, continuous and binary outcomes, and settings where some of genes are shared by three groups (pathways). (DOC 317 kb)
Additional file 2:
An R package “OGS”, which is a Windows binaries zip file. (ZIP 29 kb)
Additional file 3:
A reference manual for the “OGS” package. (PDF 77 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Wang, J., Chen, Y. Overlapping group screening for detection of genegene interactions: application to gene expression profiles with survival trait. BMC Bioinformatics 19, 335 (2018). https://doi.org/10.1186/s1285901823722
Received:
Accepted:
Published:
Keywords
 Genegene interaction
 Lasso
 Overlapping group
 Survival prediction