Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information

Background Group structures among genes encoded in functional relationships or biological pathways are valuable and unique features in large-scale molecular data for survival analysis. However, most of previous approaches for molecular data analysis ignore such group structures. It is desirable to develop powerful analytic methods for incorporating valuable pathway information for predicting disease survival outcomes and detecting associated genes. Results We here propose a Bayesian hierarchical Cox survival model, called the group spike-and-slab lasso Cox (gsslasso Cox), for predicting disease survival outcomes and detecting associated genes by incorporating group structures of biological pathways. Our hierarchical model employs a novel prior on the coefficients of genes, i.e., the group spike-and-slab double-exponential distribution, to integrate group structures and to adaptively shrink the effects of genes. We have developed a fast and stable deterministic algorithm to fit the proposed models. We performed extensive simulation studies to assess the model fitting properties and the prognostic performance of the proposed method, and also applied our method to analyze three cancer data sets. Conclusions Both the theoretical and empirical studies show that the proposed method can induce weaker shrinkage on predictors in an active pathway, thereby incorporating the biological similarity of genes within a same pathway into the hierarchical modeling. Compared with several existing methods, the proposed method can more accurately estimate gene effects and can better predict survival outcomes. For the three cancer data sets, the results show that the proposed method generates more powerful models for survival prediction and detecting associated genes. The method has been implemented in a freely available R package BhGLM at https://github.com/nyiuab/BhGLM. Electronic supplementary material The online version of this article (10.1186/s12859-019-2656-1) contains supplementary material, which is available to authorized users.


Background
Survival prediction from high-dimensional molecular data is an active topic in the fields of genomics and precision medicine, especially for various cancer studies. Large-scale omics data provide extraordinary opportunities for detecting biomarkers and building accurate prognostic and predictive models. However, such high-dimensional data also introduce statistical and computational challenges. Tibshirani [1,2] has proposed a novel penalized method, lasso, for variable selection in high-dimensional data, which has attracted considerable attention in modern statistical research. Thereafter, several penalized methods were developed, like minimax concave penalty (MCP) method by Zhang [3,4], smoothly clipped absolute deviation (SCAD) penalty method by Fan and Li [5]. These penalization approaches have been widely applied for disease prediction and prognosis using large-scale molecular data [6][7][8][9][10][11].
Furthermore, the group structures among molecular variables was noticed in analysis. For example, genes can be grouped into known biological pathways or functionally similar sets. Genes within a same biological pathway may be biologically related and statistically correlated. Incorporating such biological grouping information into statistical modeling can improve the interpretability and efficiency of the models. Several penalization methods have been proposed had been proposed to utilize the grouping information, such as group Lasso method [12], sparse group lasso (SGL) [13,14]. group bridge [15], composite MCP [16], composite absolute penalty method [17], group exponential Lasso [18], group variable selection via convex log-exp-sum penalty method [19], and doubly sparse approach for group variable selection [20]. Some of these methods perform group level selection, including or excluding an entire group of variables. Others can perform bi-level selection, achieving sparsity within each group. Huang et al. [21] and Ogutu et al. [22] reviewed these penalization methods in prediction and highlighted some issues for further study.
Ročková and George [23,24] recently proposed a new Bayesian approach, called the spike-and-slab lasso, for high-dimensional normal linear models using the spike-and-slab mixture double-exponential prior distribution. Based on the Bayesian framework, we have recently incorporated the spike-and-slab mixture double-exponential prior into generalized linear models (GLMs) and Cox survival models, and developed the spike-and-slab lasso GLMs and Cox models for predicting disease outcomes and detecting associated genes [25,26]. More recently, we have developed the group spike-and-slab lasso GLMs [27] to incorporate biological pathways.
In this article, we aim to develop the group spike-and-slab lasso Cox model (gsslasso Cox) for predicting disease survival outcomes and detecting associated genes by incorporating biological pathway information. An efficient algorithm was proposed to fit the group spike-and-slab lasso Cox model by integrating Expectation-Maximization (EM) steps into the extremely fast cyclic coordinate descent algorithm. The novelty is incorporating group or biological pathway information into the spike-and-slab lasso Cox model for predicting disease survival outcomes and detecting associated genes. The performance of the proposed method was evaluated via extensive simulations and comparing with several commonly used methods. The proposed procedure was also applied to three cancer data sets with thousands of gene expression values and their pathways information. Our results show that the proposed method not only generates powerful prognostic models for survival prediction, but also excels at detecting associated genes.

Methods
The group spike-and-slab lasso Cox models In Cox survival model, variables y i = (t i , d i ) for each individual is the survival outcome. The censoring indicator d i takes 1 if the observed survival time t i for individual i is uncensored. The d i takes 0 if it is censored. For individual i, the true survival time is assumed by T i . Therefore, when T i = t i , d i = 1, whereas when T i > t i , d i = 0. The predictor variables include numerous molecular predictors (e.g., gene expression) and some relevant demographic/clinical covariates. Assume that the predictors can be organized into G groups (e.g., biological pathways) based on existing biological knowledge. It should be indicated that the group could overlap each other. For example, one or some genes can belong to two or more biological pathway. Following the idea of overlap group lasso [28][29][30][31], we performed a restructure step by replicating a variable in whatever group it appears to expand the vector of predictors.
In Cox proportional hazards model, it usually assumes that the hazard function of survival time T takes the form [32,33]: where the baseline hazard function h 0 (t) is unspecified, X and β are the vectors of explanatory variables and coefficients, respectively, and Xβ is the linear predictor or called the prognostic index. Fitting classical Cox models is to estimate β by maximizing the partial log-likelihood [34]: where R(t i ) is the risk set at time t i . In the presence of ties, the partial log-likelihood can be approximated by the Breslow or the Efron methods [35,36]. The standard algorithm for maximizing the partial log-likelihood is the Newton-Raphson algorithm [32,37]. For high dimensional and/or correlated data, the classical model fitting is often unreliable. The problem can be solved by using Bayesian hierarchical modeling or penalization approaches [31,38,39]. We here propose a Bayesian hierarchical modeling approach, which allows us to simultaneously analyze numerous predictors and more importantly provides an efficient way to incorporate group information. Our hierarchical Cox models employ the spike-and-slab mixture double-exponential (de) prior on the coefficients: where, s 0 and s 1 are the preset scale parameters, which are small and relatively large (0 < s 0 < s 1 ), inducing strong or weak shrinkage on β j , respectively. γ j is the indicator variable: γ j = 1 or 0. Equivalently, this prior can be expressed as (1 -γ j ) de(0, s 0 ) + γ j de(0, s 1 ), a mixture of the shrinkage prior de(0, s 0 ) and the weakly informative prior de(0, s 1 ), which are the spike and slab components of the prior distribution, respectively. We incorporate the group structure by proposing a group-specific Berllouli distribution for the indicator variables. For predictors in group g, the indicator variables are assumed to follow the Berllouli distribution with the group-specific probability θ g : If group g includes important predictors, the parameter θ g will be estimated to be relatively large, implying other predictors in the group more likely to be important. Therefore, the group-specific Berllouli prior plays a role on incorporating the biological similarity of genes within a same pathway into the hierarchical model. For the probability parameters, we adopt a beta prior, θ g~b eta(a, b), setting a = b = 1 yielding the uniform hyper prior θ g~U (0, 1) that will be used in later sections to illustrate our method. Hereafter, the above hierarchical Cox models are referred to as the group spike-and-slab lasso Cox model.

The EM coordinate descent algorithm
We have developed a fast deterministic algorithm, called the EM coordinate descent algorithm to fit the spike-and-slab lasso Cox models by estimating the posterior modes of the parameters [26]. The EM coordinate descent algorithm incorporates EM steps into the cyclic coordinate descent procedure for fitting the penalized lasso Cox models, and has been shown to be fast and efficient for analyzing high-dimensional survival data [26]. We here extend the EM coordinate descent algorithm to fit the group spike-and-slab lasso Cox models. We derive the algorithm based on the log joint posterior density of the parametersϑ = (β, γ, θ): The log-likelihood function, logp(t, d| β, h 0 ), is proportional to the partial log-likelihood pl(β) defined in Eq.
(2) or the Breslow or the Efron approximation in the presence of ties [35,36], if the baseline hazard function h 0 is replaced by the Breslow estimator [37,40]. Therefore, the log joint posterior density can be expressed as where pl(β) is the partial likelihood described in (2), and S j = (1 − γ j )s 0 + γ j s 1 .
In EM coordinate decent algorithm, the indicator variables γ j were treated the as 'missing values'. The parameters (β, θ) were estimated by averaging the missing values over their posterior distributions. For the E-step, the expectation of the log joint posterior density was calculated with respect to the conditional posterior distributions of the missing data. For predictors in group g, the conditional posterior expectation of the indicator variable γ j can be derived as where p(γ j = 1| θ g ) = θ g , p(γ j = 0| θ g ) = 1 − θ g , p(β j | γ j = 1, s 1 ) = de(β j | 0, s 1 ) and p(β j | γ j = 0, s 0 ) = de(β j | 0, s 0 ). Therefore, the conditional posterior expectation of S −1 j can be obtained by From Eqs. (7) and (8), we can see that the estimates of p j and S j are larger for larger coefficientsβ j , leading to different shrinkage for different coefficients.
For the M-step, parameters (β, θ) were updated by maximizing the posterior expectation of the log joint posterior density with γ j and S −1 j replaced by their conditional posterior expectations. From the log joint posterior density, we can see that β and θ can be updated separately, because the coefficients β are only involved inplðβÞ− P J j¼1 S −1 j jβ j j and the probability parameter θ is only in P J j¼1 ðγ j logθ g þ ð1−γ j Þ logð1−θ g ÞÞþ P G g¼1 ðða−1Þ logθ g þ ðb−1Þ logð1−θ g ÞÞ . Therefore, the coefficients β are updated by maximizing the expression: whereŜ −1 j is the conditional posterior expectation of S −1 j as derived above. Given the scale parameters S j , the term P J j¼1Ŝ −1 j jβ j j serves as the L 1 lasso penalty withŜ −1 j as the penalty factors, and thus the coefficients can be updated by maximizing Q 1 (β) using the cyclic coordinate decent algorithm, which is extremely fast and can estimate some coefficients exactly to zero [31,41]. The probability parameters {θ g } are updated by maximizing the expression: We can easily obtain: where J g is the number of predictors belonging to group g.
We assess convergence by the criterion: is the estimate of deviance at the t th iteration, and ε is a small value (say 10 − 5 ).

Evaluation of predictive performance
We can use several ways to measure the performance of a fitted group lasso Cox model, including the partial log-likelihood (PL), the concordance index (C-index), the survival curves, and the survival prediction error [37]. The partial log-likelihood function measures the overall quality of a fitted Cox model, and thus is usually used to choose an optimal model [37,41,42]. The standard way to evaluate the performance of a model is to fit the model using a data set and then calculate the above measures with independent data. A variant of cross-validation [31,43], called pre-validation method was used in the present study to evaluate the performance. The data was randomly split to K subsets of roughly the same size. The (K -1) subsets was used to fit a hierarchical Cox model. The estimate of coefficients denoted asβ ð−kÞ from the data excluding the k-th subset.
The prognostic indicesη ðkÞ ¼ X ðkÞβ ð−kÞ , called the cross-validated or pre-validated prognostic index, were calculated for all individuals in the k-th subset of the data. Cross-validated prognostic indicesη i for all individuals can be calculated by cycling through all the K parts. Then, (t i ; d i ;η i ) was used to compute the several measures described above. We can see that the cross-validated prognostic value for each patient is derived independently of the observed response of the patient. Therefore, the 'pre-validated' dataset (t i ; d i ;η i ) can essentially be treated as a 'new dataset'. This procedure provides valid assessment of the predictive performance of the model [31,43].
whereβ ð−kÞ is the estimate of β from all the data except the k-th part, plðβ ð−kÞ Þ is the partial likelihood of all the data points and pl ð−kÞ ðβ ð−kÞ Þ is the partial likelihood excluding part k of the data. By subtracting the log-partial likelihood evaluated on the non-left out data from that evaluated on the full data, we can make efficient use of the death times of the left out data in relation to the death times of all the data.

Selecting optimal scale values
The spike-and-slab double-exponential prior requires two preset scale parameters (s 0 , s 1 ). Following the previous studies [24][25][26], we set the slab scale s 1 to be relatively large (e.g., 1), and consider a sequence of L decreasing values { s l 0 }: the spike scale s 0 . We then fit L models with scales {ðs l 0 ; s 1 Þ; l ¼ 1; ⋯; L } and select an optimal model using the method described above. This procedure is similar to the lasso implemented in the widely-used R package glmnet, which quickly fits the lasso Cox models over a grid of values of λ covering its entire range, giving a sequence of models for users to choose from [31,41].

Implementation and software package
We have incorporated the method proposed in this study into the function bmlasso() in our R package BhGLM [44]. The package BhGLM also includes several other functions for summarizing and evaluating the predictive performance, like summary.bh, cv.bh predict.bh. The function in the package is very fast, usually taking several minutes for fitting and evaluating a model with thousands of variables. The package BhGLM is freely available from https://github.com/nyiuab/BhGLM.

Simulation studies
We assessed the proposed approach by extensive simulations, and compared with the lasso implemented in the R package glmnet and several penalization methods that can incorporate group information, including sparse group lasso (SGL) in the R package SGL, overlap group lasso (grlasso), overlap group MCP (grMCP), overlap group SCAD (grSCAD), and overlap group composite MCP (cMCP) in the R package grpregOverlap [45]. Our simulation method was similar to our previous work [26,27]. We considered five simulation scenarios with different complexities, including non-overlap or overlap groups, group sizes, number of non-null groups, and correlation coefficients (r) ( Table 1). In simulation scenario 2-5, overlap structures were considered. To handle the overlap structures, we duplicated overlapping predictors into groups that predictors belong to [28,30]. In each scenario, we simulated two data sets, and used x 502 x 503 x 504 } 4 varying number of non-null groups (8/3/1) Group group1 group2 group7 group8 group11 group12 group19 group20 the first one as the training data to fit the models and the second one as the test data to evaluate the predictive values. We replicated the simulation 100 times and summarized the results over these replicates. In simulation scenario 6, we vary the effect size of the non-zero coefficient β 5 , from − 2 to 2. Other simulation setting are the same with scenario 2. The purpose of this simulation is to see the profile of prior scale along with varying effect size. Each simulated dataset included n = 500 observations, with a censored survival response y i and a vector of m = 1000 continuous predictorsX i = (x i1 , …, x im ). We assumed 20 groups. Each group included about 50 predictors. For example, group 1 and 2 included variables (x 1 , …, x 50 ) and (x 51 , …, x 100 ), respectively. The vector X i was randomly sampled from multivariate normal distributionN 1000 (0, Σ), where the covariance matrix Σ was set to account for varied grouped correlation and overlapped structures under different simulation scenarios. We simulated several scenarios. The predictors were assumed to be correlated each other with in group and those predictors in different groups were assumed to be independent. The correlation coefficient r was generally set to be 0.5.
To simulate the censored survival response, following the method of Simon [41], we generated the "true" survival time T i for each individual from the exponential distribution: T i Exponð expð P m j¼1 x ij β j ÞÞ and the censoring time C i for each individual from the exponential distribution: C i~E xpon(exp(r i )), where r i were randomly sampled from a standard normal distribution. The observed censored survival time t i was set to be the minimum of the "true" survival and censoring times, t i = min(T i , C i ), and the censoring indicator d i was set to be 1 if C i > T i and 0 otherwise. Our simulation scenarios resulted in different censoring ratios, but generally below 50%. For all the scenarios, we set eight coefficients to be non-zero and the others to be zero.

Scenario 1: Non-overlap group
In this scenario, each group is independent. There was no any overlap among groups. Eight non-zero predic-tors{x 5 , x 20 , x 40 }, {x 210 , x 220 , x 240 }, {x 975 , x 995 } were simulated to be included into three groups, group 1, 5, and 20 (Table 1). The group sizes is 50, including 50 predictors, presented as below: Group setting: Scenario 2: Overlap grouping In this scenario, overlapped grouping structure was considered. Only the last group is independent. For example, for group 1 and group 2, there were five predictors (x 46 , x 47 , x 48 , x 49 , x 50 ) belong to two groups. The setting for eight non-zero predictors and their effect sizes are the same with scenario 1. The group sizes is still 50. The overlap structure are presented below: Group setting:

Scenario 3: Varying group sizes
Group size means the number of predictors included in a group. A big group size means the group included relative more predictors. The group size may affect the model fitting. In this scenario, we assumed two groups, group 1 and 11, including non-zero predictors, {x 1 , x 2 , x 3 , x 4 } and {x 501 , x 502 , x 503 , x 504 }, respectively. Other simulation setting are similar with scenario 2. To investigate the group size effect on model fitting, we simulated different group size as below: (1). only four non-zero predictors included in group 1 and 11: Group ID: Group setting: x 1x 4 x 5 - x 100 x 96x 150 x 501x 504 x 505x 600 x 896x 950 x 951x 1000 (2). 20 predictors included in group 1 and 11: Group ID: Group setting: x 1x 20 x 21x 100 x 96x 150 x 501x 520 x 521 -x 600 x 896x 950 x 951x 1000 (3). 50 predictors included in group 1 and 11: Group ID: Group setting: x 1x 50 x 46x 100 x 96x 150 x 501x 550 x 546x 600 x 896x 950 x 951x 1000

Scenario 4: Varying the number of non-null group
The true non-zero predictors may be included in some groups. Other zero predictors belong to other groups.
These groups included non-zero predictors called non-null group. The number of non-null group may also affect the model fitting. To evaluate the group number effect, we varied the number of non-null groups, as following: (1).There are 8 non-null groups including non-zero coefficients: {x 5

Scenario 5: Varying the correlation within group
To evaluate the effect of correlation within group, we set different correlation coefficients within a group: r = 0.0, 0.5, and 0.7. Other settings were the same with scenario 2.
Scenario 6: Self-adaptive shrinkage on varying the effect size The significant feature of the proposed spike-and-slab prior is the self-adaptive shrinkage. To show this property, we performed additional simulation study based on Scenario 1. We fixed the prior scale (s 0 , s 1 ) = (0.02, 1) and varied the effect size of the first simulated non-zero predictor (x 5 ) from (− 2, 2). We recorded the scale parameters for this non-zero predictor (x 5 ) and nearby zero effect predictor (x 6 ), and non-zero predictor (x 20 ) with the simulated effect size − 0.7. These three predictors belong to the same group.

Real data analysis
We applied the proposed gsslasso Cox model to analyze three real datasets, ovarian cancer (OV), lung adenocarcinoma (LUAD), and breast cancer. The whole genome expression data were downloaded from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) (updated at June 2017). We firstly clean the data to get the clear survival information and potential genes involved in further analysis. The details of the three datasets and clean procedure are described below paragraphs. Secondly, several genome annotation tools were used to construct the pathways information. All the genes were mapped to KEGG pathways by using R/bioconductor packages: mygene, clusterProfiler and AnnotationDbi [46]. The R/Bioconductor mygene package was used to convert gene names to gene ENTREZ ID. The clusterProfiler package was used to get pathway/group information for genes, by loading the gene ENTREZ ID.
AnnotationDbi was used primarily to create mapping objects that allow easy access from R to underlying annotation databases, like KEGG in the present study. By using these packages, we mapped the genes into pathways, and got group structure information for further analysis. Only the gene included in pathways were used in further analysis. Thirdly, the proposed method and several penalization approaches used in above simulation study were applied to analyze the survival data with thousands of genes and pathway/group information. We performed 10-fold cross-validation with 10 replicates to evaluate the predictive values of the several models. After model fitting, the non-zero parameters were the detected genes.

TCGA ovarian cancer dataset (mRNA sequencing data)
This dataset contains mRNA expression data and relevant clinical outcome for ovarian cancer (OV) from TCGA. The raw dataset includes 304 patients and 20,503 genes after removing the duplication and unknown gene names. The raw clinical data included 586 patients. We cleaned the clinical survival data from several clinical files, and obtained 582 patients with clear survival information. We merged the individuals both with gene expression data and survival information, and obtained 304 patients with 20,503 genes for further analysis. First, we filtered the genes with expressions values less than 10. Then, genes with more than 30% of zero expression values in the dataset were removed. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 304 patients with 14,265 genes were included in our analysis. The censoring ratio was 39.5%.We mapped these genes to 271 pathways including 4260 genes.
TCGA lung adenocarcinoma dataset (mRNA sequencing data) The raw expression data contains 578 patients and 20,530 genes. After removing the duplication and unknown gene names, there are 516 patients with 20,501 used for further analysis. The raw clinical data included 521 patients. We cleaned the clinical data with clear survival records, and included 497 patients in our analysis. We then merged the clinical data and expression data, and obtained 491 patients for with 20,501 genes for quality control. Similar with the steps for ovarian cancer dataset, we filtered the genes with expressions values less than 10. Then, we removed genes with more than 30% of zero expression values in the dataset. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 491 patients with 14,143 genes were included in our analysis. The censoring ratio was 68.4%. We mapped these genes to 274 pathways including 4266 genes.

TCGA breast cancer dataset (mRNA sequencing data)
The raw expression data contains 1220 patients and 20,530 genes. After removing the duplication and unknown gene names, there are 1097 patients with 20,503 used for further analysis. The raw clinical data included 1097 patients. We cleaned the clinical data with clear survival records, and included 1084 patients in our analysis. We then merged the clinical data and expression data, and obtained 1082 patients for with 20,503 genes for quality control. The same steps used here for breast cancer dataset, we filtered the genes with expressions values less than 10, and removed genes with more than 30% of zero expression values in the dataset. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 1082 patients with 14,077 genes were included in our analysis. The censoring ratio was 86.0%. We mapped these genes to 275 pathways including 4385 genes.

Simulation results
Predictive performance Tables 2 and 3 summarizes the CVPL (cross-validated partial likelihood) and C-index in the testing data over 100 replicates for Scenarios 1-5. We observed that the group spike-and-slab lasso Cox model performed similarly with cMCP and outperformed other methods, under different simulation scenarios. These results suggested that, with complex group structures, the proposed method could perform well.

Accuracy of parameter estimates
To evaluate the accuracy of parameters estimation, we summarized the average numbers of non-zero coefficients and the mean absolute errors (MAE) of coefficient estimates, defined as MAE = P jβ j −β j j=m, in Tables 4 and 5 for different scenarios. It was found that the dected number of null-zero coefficients were very close preset number 8, and the values of MAE were very small for the proposed method under different scenarios. The performances of the group spike-and-slab lasso Cox and cMCP were consistently better than the other methods for all the five scenarios, and the proposed method was slightly better than cMCP. These results suggested that the proposed method can generate lowest false positive and unbiased estimation.
The estimates of coefficients from the group spike-and-slab lasso Cox and the other methods over 100 replicates are shown in Fig. 1 and Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3, Additional file 4: Figure S4, Additional file 5: Figure S5, Additional file 6: Figure S6 and Additional file 7: Figure S7 for different scenarios. It can be seen that the group spike-and-slab lasso Cox method produced estimates close to the simulated values for all the coefficients. This is expected, because the spike-and-slab prior can induce weak shrinkage on larger coefficients and strong shrinkage on zero coefficients. In contrast, other methods except for cMCP, gave a strong shrinkage amount on all the coefficients and resulted in the solutions that non-zero coefficients were shrunk and underestimated compared to true values. In addition, higher false positives (grey bars) were observed, except for the group spike-and-slab lasso Cox and cMCP methods.

The self-adaptive shrinkage feature
To show the self-adaptive shrinkage feature, we performed simulation 6. Figure 2 shows the adaptive shrinkage amount on non-zero coefficients x 5 , along with the varying effect size. It clearly shows that the proposed spike-and-slab lasso Cox model approach has self-adaptive and flexible characteristics, without affecting the nearby zero coefficient (x 6 ) and non-zero variable (x 20 ) belong to the same group.

Real data analysis results
There were about one third genes were mapped to pathways for the above three real datasets. The rest genes were put together as an additional group. The detailed information of genes shared by different Note: Values in the parentheses are standard deviations. "gsslasso" represents the proposed group spike-and-slab lasso cox. The slab scales, s 1 , are 1 in the analyses. The optimal s 0 = 0.02 and s 0 = 0.03 for gsslasso cox methods under scenario 1 and 2, respectively. For scenarios with overlap structures, SGL method was not used for comparison since it cannot handle overlap situation directly pathways is listed in Additional file 8: S1, S2, and S3, for ovarian cancer, lung cancer and breast cancer, respectively. Real data analysis is to build a survival model for predicting the overall survival outcome by integrating gene expression data and pathway information. We standardized all the predictors to use a common scale for all predictors, prior to fitting the models, using the function covariate() function in BhGLM package. In our prior distribution, there were to preset parameters, (s 0 , s 1 ). In our real data analysis, we fixed the slab scale s 1 to 1, and varied the spike scale s 0 values by: {k × 0.01; k = 1, …, 9}, leading to 9 models. The optimal spike scale s 0 was select by the 10-fold Notes: in scenario 3, group size "4/50" denotes that there are four none-zero coefficients embedded in a group with 50 predictors. The group size is 50. This is true for "4/20" and "4/4". The optimal s 0 = 0.02 for different group size settings. In scenario 4, "8/20" denotes that there are 8 non-null groups among 20 groups. Each non-null group includes at least one non-zero coefficients. The optimal s 0 = 0.02 for the three settings. In scenario 5, the optimal s 0 are 0.02, 0.03, and 0.04 for different correlation coefficients, 0.0, 0.5, and 0.7 within group, respectively. The slab scales, s 1 , are 1 in this scenario 3 4, and 5. Values in the parentheses are standard errors. "gsslasso" represents the proposed group spike-and-slab lasso cox 10-time cross-validation according to the CVPL. Using the optimal s 0 , we performed further real data analysis. For comparison, we also analyzed the data using the several existing methods as described in the simulation studies. We performed 10-fold cross-validation with 10 replicates to evaluate the predictive values of the several models. Table 6 summarizes the measures of the prognostic performance on these three data sets, by only using the genes included in pathway. For all the data sets the proposed group spike-and-slab lasso Cox model performed better than the other methods. The above results used only genes mapped in pathways. Additional file 9 shows the measures of the performance on these three data sets, by using the all genes. The genes which were not mapped into any pathway were put together as an additional group. We can see that the prediction performance of the proposed method were still better than the other methods.
The pathway enrichment analyses for these detected genes were summarized in Additional file 10: S4-S6. Additional file 11: S7 presents the genes detected by the proposed and existed methods. Their standardized effects size were also plotted for the three real data sets. There were many common gene among different methods. For ovarian cancer dataset, the genes CYP2R1 and HLA-DOB detected by the proposed gsslasso method, were also detected by both lasso and cMCP methods. For Lung cancer dataset, several genes detected by the proposed gsslasso method, such  Notes: in scenario 3, group size "4/50" denotes that there are four none-zero coefficients embedded in a group with 50 predictors. The group size is 50. This is true for "4/20" and "4/4". The optimal s 0 = 0.02 for different group size settings. The slab scales, s 1 , are 1 in this scenario. In scenario 4 "8/20" denotes that there are 8 non-null groups among 20 groups. Each non-null group includs at least one non-zero coefficients. The optimal s 0 = 0.02 for the three settings. In scenario 5, the optimal s 0 are 0.02, 0.03, and 0.04 for different correlation coefficients, 0.0, 0.5, and 0.7 within group, respectively. The slab scales, s 1 , are 1 in this scenario 3, 4 and 5. Values in the parentheses are standard errors. "gsslasso" represents the proposed group spike-and-slab lasso cox as VDAC1, EHHADH, ACAT2, KIT, CCND1, PIK3R1, NRAS, GNPNAT1, and KYNU, were also detected by other method. For, breast cancer dataset, two genes HSPA1A and ABCB5 detected by the proposed gsslasso method were also detected by other method. We found that most of the genes detected by the proposed method were associated with cancers in previous studies. HABP2, detected in ovarian cancer, was associated with familial nonmedullary thyroid cancer [47]. CYP24A1, the main enzyme responsible for the degradation of active vitamin D, plays an important role in many cancer related cellular processes. The associations between CYP24A1 polymorphisms and cancer susceptibility had been evaluated by many studies [48]. Keratin 8 (KRT8) plays an essential role in the development and metastasis of multiple human cancers. A recent study suggested that in clear cell renal cell carcinoma and gastric cancer, KRT8 upregulation promotes tumor metastasis and associated with poor prognosis [49,50]. E2F7, detected in lung cancer, involved in several cancer studies, which might act as an independent prognostic factor for breast cancer, and Squamous Cell Carcinoma, and gliomas [51][52][53]. Most of the genes detected by the proposed method in the three real datasets had been found associated with different cancers. These results may provide some interesting information for further studies.

Discussion
The group structures among various features arise naturally in many biological and medical researches, especially in large-scale omics data. Such grouping information is biologically meaningful and intrinsically encoded in the biological data. Thus it is desirable to incorporate the grouping information into data analysis. Various penalization methods have been designed for such situations [13,14,30,54,55]. Recently, we have developed a novel hierarchical modeling approach, the group spike-and-slab lasso GLMs, to integrate the variable group information for gene detection and  Fig. 1 The parameter estimation averaged over 100 replicates for the group spike-and-slab lasso Cox (gsslasso), the lasso, grlasso, grMCP, grSCAD, SGL and cMCP methods for Scenario 1. Blue cycles are the simulated non-zero values. Black points and lines represent the estimated values and the interval estimates of coefficients prognostic prediction [27]. In this study, we extended the method to Cox proportional hazards model for analyzing censored survival data.
Similar to the group spike-and-slab lasso GLMs, the key to our group spike-and-slab lasso Cox is the group spike-and-slab double-exponential prior. This prior has significant advantage in variable selection and parameter estimation. It induces weak shrinkage on larger coefficients and strong shrinkage on irrelevant coefficients. In contrast, other methods usually gave a strong shrinkage amount on all the coefficients and resulted in the solutions that non-zero coefficients were shrunk and underestimated. The proposed group spike-and-slab prior allows the model to incorporate the biological similarity of genes within a same pathway into the analysis.
The spike-and-slab prior depends on the spike and slab scale parameters. Our previous study suggested that slab scale s 1 had little influence on model fitting, while the spike scale s 0 strongly affected model performance [25,26]. A slab scale s 1 value introducing weak shrinkage amount would be helpful to include relevant variables into the model. Therefore, we set s 1 = 1 in our analysis. We evaluated the performance of the proposed model on a grid of values of spike scale s 0 from a reasonable range, e.g., (0, 0.1), and then selecting an optimal value using cross-validation. This is a path-following strategy for fast  Table 6 The measures of optimal group spike-and-slab lasso (gsslasso) cox and the lasso cox models for TCGA ovarian cancer, lung adenocarcinoma (LUAD) and breast cancer dataset with pathway genes by 10 times 10-fold cross validation Note: Values in the parentheses are standard errors. For group spike-and-slab lasso model, the optimal s 0 = 0.03 for three data sets. In TCGA ovarian cancer, we mapped 4260 genes into 271 pathways. The analyses was performed on these genes including in these pathways. The same is true for other two datasets dynamic posterior exploration of the proposed models, which is similar to the approach of Ročková and George [24,56]. Additional file 12: Figure S8 a and b show the solution paths under Scenario 2, for the proposed model and the lasso Cox model. Additional file 12: Figure S8 c and d show the profiles of cross-validated palatial log-likelihood by 10-fold cross-validation for the proposed model. These profiles would help to choose optimal tuning parameters. It could be found that, similar to the lasso, the spike-and-slab lasso Cox is a path-following strategy for fast dynamic posterior exploration. However, the solution path is essentially different from that of the lasso model. For the lasso cox model, the number of non-zero coefficient could be a few, even zero if a strong penalty is adopted. However, in the spike-and-slab lasso Cox model, larger coefficients will be always included in the model with weak shrinkage, while irrelevant coefficients are removed (grey path in Additional file 12: Figure S8 a). Another feature of the proposed spike-and-slab prior is bi-level selection, which is capable of selecting important groups as well as important individual variables within those groups. Several methods perform bi-level selection, including cMCP method [16], SGL [14], and group exponential lasso [18]. The underlying assumption is that the model is sparse at both the group and individual variable levels. The proposed group spike-and-slab lasso Cox model can efficiently perform bi-level selection. In group level, the importance of a group is controlled by the group-specific probability θ g . Within a group, the spike-and-slab prior allows to perform variable selection by shrinking irrelevant or small effect coefficients exactly to zero, without affecting the prediction performance.
The extensive simulation studies show that the prediction performance of the proposed method is always slightly better than cMCP method, and significantly better than all other methods under different scenarios. In the real data analysis, the prediction accuracy of the proposed method incorporating pathway information was slightly improved compared with the existing methods. This might be mainly due to the complex genetic components involved in the expression data, like haplotype blocks, subnetworks, and interaction among the genes. The present model under the linear assumption may not capture these complexities. More sophisticated strategies could potentially enhance prediction accuracy and further improve the models, by defining more precise biological grouping information.
There are several further extensions of the proposed method. For example, it can also be extended to incorporate multiple level group structure, like three-level group structure, i.e. SNP-gene-pathway. In addition, the proposed model takes the spike-and-slab mixture double-exponential prior. Other priors with a spike at zero and includes heavier tails could be investigated, like Cauchy distribution, a special case of Student-t distribution. The theoretical and empirical properties of other priors are different, which may introduce more interesting results.