 Methodology article
 Open Access
 Published:
Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information
BMC Bioinformatics volume 20, Article number: 94 (2019)
Abstract
Background
Group structures among genes encoded in functional relationships or biological pathways are valuable and unique features in largescale molecular data for survival analysis. However, most of previous approaches for molecular data analysis ignore such group structures. It is desirable to develop powerful analytic methods for incorporating valuable pathway information for predicting disease survival outcomes and detecting associated genes.
Results
We here propose a Bayesian hierarchical Cox survival model, called the group spikeandslab lasso Cox (gsslasso Cox), for predicting disease survival outcomes and detecting associated genes by incorporating group structures of biological pathways. Our hierarchical model employs a novel prior on the coefficients of genes, i.e., the group spikeandslab doubleexponential distribution, to integrate group structures and to adaptively shrink the effects of genes. We have developed a fast and stable deterministic algorithm to fit the proposed models. We performed extensive simulation studies to assess the model fitting properties and the prognostic performance of the proposed method, and also applied our method to analyze three cancer data sets.
Conclusions
Both the theoretical and empirical studies show that the proposed method can induce weaker shrinkage on predictors in an active pathway, thereby incorporating the biological similarity of genes within a same pathway into the hierarchical modeling. Compared with several existing methods, the proposed method can more accurately estimate gene effects and can better predict survival outcomes. For the three cancer data sets, the results show that the proposed method generates more powerful models for survival prediction and detecting associated genes. The method has been implemented in a freely available R package BhGLM at https://github.com/nyiuab/BhGLM.
Background
Survival prediction from highdimensional molecular data is an active topic in the fields of genomics and precision medicine, especially for various cancer studies. Largescale omics data provide extraordinary opportunities for detecting biomarkers and building accurate prognostic and predictive models. However, such highdimensional data also introduce statistical and computational challenges. Tibshirani [1, 2] has proposed a novel penalized method, lasso, for variable selection in highdimensional data, which has attracted considerable attention in modern statistical research. Thereafter, several penalized methods were developed, like minimax concave penalty (MCP) method by Zhang [3, 4], smoothly clipped absolute deviation (SCAD) penalty method by Fan and Li [5]. These penalization approaches have been widely applied for disease prediction and prognosis using largescale molecular data [6,7,8,9,10,11].
Furthermore, the group structures among molecular variables was noticed in analysis. For example, genes can be grouped into known biological pathways or functionally similar sets. Genes within a same biological pathway may be biologically related and statistically correlated. Incorporating such biological grouping information into statistical modeling can improve the interpretability and efficiency of the models. Several penalization methods have been proposed had been proposed to utilize the grouping information, such as group Lasso method [12], sparse group lasso (SGL) [13, 14]. group bridge [15], composite MCP [16], composite absolute penalty method [17], group exponential Lasso [18], group variable selection via convex logexpsum penalty method [19], and doubly sparse approach for group variable selection [20]. Some of these methods perform group level selection, including or excluding an entire group of variables. Others can perform bilevel selection, achieving sparsity within each group. Huang et al. [21] and Ogutu et al. [22] reviewed these penalization methods in prediction and highlighted some issues for further study.
Ročková and George [23, 24] recently proposed a new Bayesian approach, called the spikeandslab lasso, for highdimensional normal linear models using the spikeandslab mixture doubleexponential prior distribution. Based on the Bayesian framework, we have recently incorporated the spikeandslab mixture doubleexponential prior into generalized linear models (GLMs) and Cox survival models, and developed the spikeandslab lasso GLMs and Cox models for predicting disease outcomes and detecting associated genes [25, 26]. More recently, we have developed the group spikeandslab lasso GLMs [27] to incorporate biological pathways.
In this article, we aim to develop the group spikeandslab lasso Cox model (gsslasso Cox) for predicting disease survival outcomes and detecting associated genes by incorporating biological pathway information. An efficient algorithm was proposed to fit the group spikeandslab lasso Cox model by integrating ExpectationMaximization (EM) steps into the extremely fast cyclic coordinate descent algorithm. The novelty is incorporating group or biological pathway information into the spikeandslab lasso Cox model for predicting disease survival outcomes and detecting associated genes. The performance of the proposed method was evaluated via extensive simulations and comparing with several commonly used methods. The proposed procedure was also applied to three cancer data sets with thousands of gene expression values and their pathways information. Our results show that the proposed method not only generates powerful prognostic models for survival prediction, but also excels at detecting associated genes.
Methods
The group spikeandslab lasso Cox models
In Cox survival model, variables y_{i} = (t_{i}, d_{i}) for each individual is the survival outcome. The censoring indicator d_{i} takes 1 if the observed survival time t_{i} for individual i is uncensored. The d_{i} takes 0 if it is censored. For individual i, the true survival time is assumed by T_{i}. Therefore, when T_{i} = t_{i}, d_{i} = 1, whereas when T_{i} > t_{i}, d_{i} = 0. The predictor variables include numerous molecular predictors (e.g., gene expression) and some relevant demographic/clinical covariates. Assume that the predictors can be organized into G groups (e.g., biological pathways) based on existing biological knowledge. It should be indicated that the group could overlap each other. For example, one or some genes can belong to two or more biological pathway. Following the idea of overlap group lasso [28,29,30,31], we performed a restructure step by replicating a variable in whatever group it appears to expand the vector of predictors.
In Cox proportional hazards model, it usually assumes that the hazard function of survival time T takes the form [32, 33]:
where the baseline hazard function h_{0}(t) is unspecified, X and β are the vectors of explanatory variables and coefficients, respectively, and Xβ is the linear predictor or called the prognostic index.
Fitting classical Cox models is to estimate β by maximizing the partial loglikelihood [34]:
where R(t_{i}) is the risk set at time t_{i}. In the presence of ties, the partial loglikelihood can be approximated by the Breslow or the Efron methods [35, 36]. The standard algorithm for maximizing the partial loglikelihood is the NewtonRaphson algorithm [32, 37].
For high dimensional and/or correlated data, the classical model fitting is often unreliable. The problem can be solved by using Bayesian hierarchical modeling or penalization approaches [31, 38, 39]. We here propose a Bayesian hierarchical modeling approach, which allows us to simultaneously analyze numerous predictors and more importantly provides an efficient way to incorporate group information. Our hierarchical Cox models employ the spikeandslab mixture doubleexponential (de) prior on the coefficients:
where, s_{0} and s_{1} are the preset scale parameters, which are small and relatively large (0 < s_{0} < s_{1}), inducing strong or weak shrinkage on β_{j}, respectively. γ_{j} is the indicator variable: γ_{j} = 1 or 0. Equivalently, this prior can be expressed as (1  γ_{j}) de(0, s_{0}) + γ_{j} de(0, s_{1}), a mixture of the shrinkage prior de(0, s_{0}) and the weakly informative prior de(0, s_{1}), which are the spike and slab components of the prior distribution, respectively.
We incorporate the group structure by proposing a groupspecific Berllouli distribution for the indicator variables. For predictors in group g, the indicator variables are assumed to follow the Berllouli distribution with the groupspecific probability θ_{g}:
If group g includes important predictors, the parameter θ_{g} will be estimated to be relatively large, implying other predictors in the group more likely to be important. Therefore, the groupspecific Berllouli prior plays a role on incorporating the biological similarity of genes within a same pathway into the hierarchical model. For the probability parameters, we adopt a beta prior, θ_{g}~beta(a, b), setting a = b = 1 yielding the uniform hyper prior θ_{g}~U(0, 1) that will be used in later sections to illustrate our method. Hereafter, the above hierarchical Cox models are referred to as the group spikeandslab lasso Cox model.
The EM coordinate descent algorithm
We have developed a fast deterministic algorithm, called the EM coordinate descent algorithm to fit the spikeandslab lasso Cox models by estimating the posterior modes of the parameters [26]. The EM coordinate descent algorithm incorporates EM steps into the cyclic coordinate descent procedure for fitting the penalized lasso Cox models, and has been shown to be fast and efficient for analyzing highdimensional survival data [26]. We here extend the EM coordinate descent algorithm to fit the group spikeandslab lasso Cox models. We derive the algorithm based on the log joint posterior density of the parametersϑ = (β, γ, θ):
The loglikelihood function, logp(t, d β, h_{0}), is proportional to the partial loglikelihood pl(β) defined in Eq. (2) or the Breslow or the Efron approximation in the presence of ties [35, 36], if the baseline hazard function h_{0} is replaced by the Breslow estimator [37, 40]. Therefore, the log joint posterior density can be expressed as
where pl(β) is the partial likelihood described in (2), and S_{j} = (1 − γ_{j})s_{0} + γ_{j}s_{1}.
In EM coordinate decent algorithm, the indicator variables γ_{j} were treated the as ‘missing values’. The parameters (β, θ) were estimated by averaging the missing values over their posterior distributions. For the Estep, the expectation of the log joint posterior density was calculated with respect to the conditional posterior distributions of the missing data. For predictors in group g, the conditional posterior expectation of the indicator variable γ_{j} can be derived as
where p(γ_{j} = 1 θ_{g}) = θ_{g}, p(γ_{j} = 0 θ_{g}) = 1 − θ_{g}, p(β_{j} γ_{j} = 1, s_{1}) = de(β_{j} 0, s_{1}) and p(β_{j} γ_{j} = 0, s_{0}) = de(β_{j} 0, s_{0}). Therefore, the conditional posterior expectation of \( {S}_j^{1} \) can be obtained by
From Eqs. (7) and (8), we can see that the estimates of p_{j} and S_{j} are larger for larger coefficientsβ_{j}, leading to different shrinkage for different coefficients.
For the Mstep, parameters (β, θ) were updated by maximizing the posterior expectation of the log joint posterior density with γ_{j} and \( {S}_j^{1} \) replaced by their conditional posterior expectations. From the log joint posterior density, we can see that β and θ can be updated separately, because the coefficients β are only involved in\( pl\left(\beta \right){\sum}_{j=1}^J{S}_j^{1}\left{\beta}_j\right \) and the probability parameter θ is only in \( {\sum}_{j=1}^J\left({\gamma}_j\log {\theta}_g+\left(1{\gamma}_j\right)\log \left(1{\theta}_g\right)\right)+{\sum}_{g=1}^G\left(\left(a1\right)\log {\theta}_g+\left(b1\right)\log \left(1{\theta}_g\right)\right) \). Therefore, the coefficients β are updated by maximizing the expression:
where \( {\widehat{S}}_j^{1} \) is the conditional posterior expectation of \( {S}_j^{1} \) as derived above. Given the scale parameters S_{j}, the term \( {\sum}_{j=1}^J{\widehat{S}}_j^{1}\left{\beta}_j\right \) serves as the L_{1} lasso penalty with \( {\widehat{S}}_j^{1} \) as the penalty factors, and thus the coefficients can be updated by maximizing Q_{1}(β) using the cyclic coordinate decent algorithm, which is extremely fast and can estimate some coefficients exactly to zero [31, 41]. The probability parameters {θ_{g}} are updated by maximizing the expression:
We can easily obtain:
where J_{g} is the number of predictors belonging to group g.
Totally, the framework of the proposed EM coordinate decent algorithm was summarized as follows:

1)
Choose a starting value for β^{0}, and \( {\theta}_g^0 \). For example, we can initialize β^{0} = 0, and \( {\theta}_g^0=0.5 \).

2)
For t = 1, 2, 3, …,
Estep: Update γ_{j} and \( {S}_j^{1} \) by their conditional posterior expectations.
Mstep:

a)
Update β using the cyclic coordinate decent algorithm;

b)
Update (θ_{1}, ⋯, θ_{G}) by Eq. (11).
We assess convergence by the criterion: ∣d^{(t)} − d^{(t − 1)} ∣ /(0.1− d^{(t)} ) < ε, where d^{(t)} = − 2pl(β^{(t)}) is the estimate of deviance at the t^{th} iteration, and ε is a small value (say 10^{− 5}).
Evaluation of predictive performance
We can use several ways to measure the performance of a fitted group lasso Cox model, including the partial loglikelihood (PL), the concordance index (Cindex), the survival curves, and the survival prediction error [37]. The partial loglikelihood function measures the overall quality of a fitted Cox model, and thus is usually used to choose an optimal model [37, 41, 42]. The standard way to evaluate the performance of a model is to fit the model using a data set and then calculate the above measures with independent data. A variant of crossvalidation [31, 43], called prevalidation method was used in the present study to evaluate the performance. The data was randomly split to K subsets of roughly the same size. The (K – 1) subsets was used to fit a hierarchical Cox model. The estimate of coefficients denoted as \( {\widehat{\beta}}^{\left(k\right)} \) from the data excluding the kth subset. The prognostic indices \( {\widehat{\eta}}_{(k)}={X}_{(k)}{\widehat{\beta}}^{\left(k\right)} \), called the crossvalidated or prevalidated prognostic index, were calculated for all individuals in the kth subset of the data. Crossvalidated prognostic indices \( {\widehat{\eta}}_i \) for all individuals can be calculated by cycling through all the K parts. Then, (\( {t}_i,{d}_i,{\widehat{\eta}}_i \)) was used to compute the several measures described above. We can see that the crossvalidated prognostic value for each patient is derived independently of the observed response of the patient. Therefore, the ‘prevalidated’ dataset (\( {t}_i,{d}_i,{\widehat{\eta}}_i \)) can essentially be treated as a ‘new dataset’. This procedure provides valid assessment of the predictive performance of the model [31, 43].
Moreover, we also use an alternative way to evaluate the partial loglikelihood, i.e., the socalled crossvalidated partial likelihood (CVPL), defined as [37, 41, 42].
where \( {\widehat{\beta}}_{\left(k\right)} \) is the estimate of β from all the data except the kth part, \( pl\left({\widehat{\beta}}_{\left(k\right)}\right) \) is the partial likelihood of all the data points and \( {pl}_{\left(k\right)}\left({\widehat{\beta}}_{\left(k\right)}\right) \) is the partial likelihood excluding part k of the data. By subtracting the logpartial likelihood evaluated on the nonleft out data from that evaluated on the full data, we can make efficient use of the death times of the left out data in relation to the death times of all the data.
Selecting optimal scale values
The spikeandslab doubleexponential prior requires two preset scale parameters (s_{0}, s_{1}). Following the previous studies [24,25,26], we set the slab scale s_{1} to be relatively large (e.g., 1), and consider a sequence of L decreasing values {\( {s}_0^l \)}: \( {s}_1>{s}_0^1>{s}_0^2>\cdots >{s}_0^L>0 \), for the spike scale s_{0}. We then fit L models with scales {\( \left({s}_0^l,{s}_1\right);l=1,\cdots, L \)} and select an optimal model using the method described above. This procedure is similar to the lasso implemented in the widelyused R package glmnet, which quickly fits the lasso Cox models over a grid of values of λ covering its entire range, giving a sequence of models for users to choose from [31, 41].
Implementation and software package
We have incorporated the method proposed in this study into the function bmlasso() in our R package BhGLM [44]. The package BhGLM also includes several other functions for summarizing and evaluating the predictive performance, like summary.bh, cv.bh predict.bh. The function in the package is very fast, usually taking several minutes for fitting and evaluating a model with thousands of variables. The package BhGLM is freely available from https://github.com/nyiuab/BhGLM.
Simulation study and real data analysis
Simulation studies
We assessed the proposed approach by extensive simulations, and compared with the lasso implemented in the R package glmnet and several penalization methods that can incorporate group information, including sparse group lasso (SGL) in the R package SGL, overlap group lasso (grlasso), overlap group MCP (grMCP), overlap group SCAD (grSCAD), and overlap group composite MCP (cMCP) in the R package grpregOverlap [45]. Our simulation method was similar to our previous work [26, 27]. We considered five simulation scenarios with different complexities, including nonoverlap or overlap groups, group sizes, number of nonnull groups, and correlation coefficients (r) (Table 1). In simulation scenario 2–5, overlap structures were considered. To handle the overlap structures, we duplicated overlapping predictors into groups that predictors belong to [28, 30]. In each scenario, we simulated two data sets, and used the first one as the training data to fit the models and the second one as the test data to evaluate the predictive values. We replicated the simulation 100 times and summarized the results over these replicates. In simulation scenario 6, we vary the effect size of the nonzero coefficient β_{5}, from − 2 to 2. Other simulation setting are the same with scenario 2. The purpose of this simulation is to see the profile of prior scale along with varying effect size.
Each simulated dataset included n = 500 observations, with a censored survival response y_{i} and a vector of m = 1000 continuous predictorsX_{i} = (x_{i1}, …, x_{im}). We assumed 20 groups. Each group included about 50 predictors. For example, group 1 and 2 included variables (x_{1}, …, x_{50}) and (x_{51}, …, x_{100}), respectively. The vector X_{i} was randomly sampled from multivariate normal distributionN_{1000}(0, Σ), where the covariance matrix Σ was set to account for varied grouped correlation and overlapped structures under different simulation scenarios. We simulated several scenarios. The predictors were assumed to be correlated each other with in group and those predictors in different groups were assumed to be independent. The correlation coefficient r was generally set to be 0.5.
To simulate the censored survival response, following the method of Simon [41], we generated the “true” survival time T_{i} for each individual from the exponential distribution: \( {T}_i\sim \mathrm{Expon}\left(\exp \left({\sum}_{j=1}^m{x}_{ij}{\beta}_j\right)\right) \) and the censoring time C_{i} for each individual from the exponential distribution: C_{i}~Expon(exp(r_{i})), where r_{i} were randomly sampled from a standard normal distribution. The observed censored survival time t_{i} was set to be the minimum of the “true” survival and censoring times, t_{i} = min(T_{i}, C_{i}), and the censoring indicator d_{i} was set to be 1 if C_{i} > T_{i} and 0 otherwise. Our simulation scenarios resulted in different censoring ratios, but generally below 50%. For all the scenarios, we set eight coefficients to be nonzero and the others to be zero.
Scenario 1: Nonoverlap group
In this scenario, each group is independent. There was no any overlap among groups. Eight nonzero predictors{x_{5}, x_{20}, x_{40}}, {x_{210}, x_{220}, x_{240}}, {x_{975}, x_{995}} were simulated to be included into three groups, group 1, 5, and 20 (Table 1). The group sizes is 50, including 50 predictors, presented as below:
Group ID:  1  2  …  5  …  19  20 
Group setting:  x_{1} − x_{50}  x_{51} − x_{100}  x_{201} − x_{250}  x_{901} − x_{950}  x_{951} − x_{1000} 
Scenario 2: Overlap grouping
In this scenario, overlapped grouping structure was considered. Only the last group is independent. For example, for group 1 and group 2, there were five predictors (x_{46}, x_{47}, x_{48}, x_{49}, x_{50}) belong to two groups. The setting for eight nonzero predictors and their effect sizes are the same with scenario 1. The group sizes is still 50. The overlap structure are presented below:
Group ID:  1  2  3  …  19  20 
Group setting:  x_{1} − x_{50}  x_{46} − x_{100}  x_{96} − x_{150}  x_{896} − x_{950}  x_{951} − x_{1000} 
Scenario 3: Varying group sizes
Group size means the number of predictors included in a group. A big group size means the group included relative more predictors. The group size may affect the model fitting. In this scenario, we assumed two groups, group 1 and 11, including nonzero predictors, {x_{1}, x_{2}, x_{3}, x_{4}} and {x_{501}, x_{502}, x_{503}, x_{504}}, respectively. Other simulation setting are similar with scenario 2. To investigate the group size effect on model fitting, we simulated different group size as below:

(1).
only four nonzero predictors included in group 1 and 11:
Group ID:  1  2  3  …  11  12  …  19  20 
Group setting:  x_{1}  x_{4}  x_{5}  x_{100}  x_{96}  x_{150}  x_{501}  x_{504}  x_{505}  x_{600}  x_{896}  x_{950}  x_{951}  x_{1000} 

(2).
20 predictors included in group 1 and 11:
Group ID:  1  2  3  …  11  12  …  19  20 
Group setting:  x_{1}  x_{20}  x_{21}  x_{100}  x_{96}  x_{150}  x_{501}  x_{520}  x_{521} x _{600}  x_{896}  x_{950}  x_{951}  x_{1000} 

(3).
50 predictors included in group 1 and 11:
Group ID:  1  2  3  …  11  12  …  19  20 
Group setting:  x_{1}  x_{50}  x_{46}  x_{100}  x_{96}  x_{150}  x_{501}  x_{550}  x_{546}  x_{600}  x_{896}  x_{950}  x_{951}  x_{1000} 
Scenario 4: Varying the number of nonnull group
The true nonzero predictors may be included in some groups. Other zero predictors belong to other groups. These groups included nonzero predictors called nonnull group. The number of nonnull group may also affect the model fitting. To evaluate the group number effect, we varied the number of nonnull groups, as following:

(1).
There are 8 nonnull groups including nonzero coefficients: {x_{5}}, {x_{55}}, {x_{305}}, {x_{355}}, {x_{505}}, {x_{555}}, {x_{905}}, and {x_{955}};

(2).
There are 3 nonnull groups including nonzero coefficients: {x_{5}, x_{15}, x_{25}}, {x_{355}, x_{365}, x_{375}}, and {x_{905}, x_{915}};

(3).
There is only 1 nonnull group including nonzero coefficients: {x_{5}, x_{10}, x_{15}, x_{20}, x_{25}, x_{30}, x_{35}, x_{40}}. The overlap settings were the same with scenario 2. The group number and effect sizes of these nonzero coefficients are shown in Table 1.
Scenario 5: Varying the correlation within group
To evaluate the effect of correlation within group, we set different correlation coefficients within a group: r = 0.0, 0.5, and 0.7. Other settings were the same with scenario 2.
Scenario 6: Selfadaptive shrinkage on varying the effect size
The significant feature of the proposed spikeandslab prior is the selfadaptive shrinkage. To show this property, we performed additional simulation study based on Scenario 1. We fixed the prior scale (s_{0}, s_{1}) = (0.02, 1) and varied the effect size of the first simulated nonzero predictor (x_{5}) from (− 2, 2). We recorded the scale parameters for this nonzero predictor (x_{5}) and nearby zero effect predictor (x_{6}), and nonzero predictor (x_{20}) with the simulated effect size − 0.7. These three predictors belong to the same group.
Real data analysis
We applied the proposed gsslasso Cox model to analyze three real datasets, ovarian cancer (OV), lung adenocarcinoma (LUAD), and breast cancer. The whole genome expression data were downloaded from The Cancer Genome Atlas (TCGA, http://cancergenome.nih.gov/) (updated at June 2017). We firstly clean the data to get the clear survival information and potential genes involved in further analysis. The details of the three datasets and clean procedure are described below paragraphs. Secondly, several genome annotation tools were used to construct the pathways information. All the genes were mapped to KEGG pathways by using R/bioconductor packages: mygene, clusterProfiler and AnnotationDbi [46]. The R/Bioconductor mygene package was used to convert gene names to gene ENTREZ ID. The clusterProfiler package was used to get pathway/group information for genes, by loading the gene ENTREZ ID. AnnotationDbi was used primarily to create mapping objects that allow easy access from R to underlying annotation databases, like KEGG in the present study. By using these packages, we mapped the genes into pathways, and got group structure information for further analysis. Only the gene included in pathways were used in further analysis. Thirdly, the proposed method and several penalization approaches used in above simulation study were applied to analyze the survival data with thousands of genes and pathway/group information. We performed 10fold crossvalidation with 10 replicates to evaluate the predictive values of the several models. After model fitting, the nonzero parameters were the detected genes.
TCGA ovarian cancer dataset (mRNA sequencing data)
This dataset contains mRNA expression data and relevant clinical outcome for ovarian cancer (OV) from TCGA. The raw dataset includes 304 patients and 20,503 genes after removing the duplication and unknown gene names. The raw clinical data included 586 patients. We cleaned the clinical survival data from several clinical files, and obtained 582 patients with clear survival information. We merged the individuals both with gene expression data and survival information, and obtained 304 patients with 20,503 genes for further analysis. First, we filtered the genes with expressions values less than 10. Then, genes with more than 30% of zero expression values in the dataset were removed. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 304 patients with 14,265 genes were included in our analysis. The censoring ratio was 39.5%.We mapped these genes to 271 pathways including 4260 genes.
TCGA lung adenocarcinoma dataset (mRNA sequencing data)
The raw expression data contains 578 patients and 20,530 genes. After removing the duplication and unknown gene names, there are 516 patients with 20,501 used for further analysis. The raw clinical data included 521 patients. We cleaned the clinical data with clear survival records, and included 497 patients in our analysis. We then merged the clinical data and expression data, and obtained 491 patients for with 20,501 genes for quality control. Similar with the steps for ovarian cancer dataset, we filtered the genes with expressions values less than 10. Then, we removed genes with more than 30% of zero expression values in the dataset. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 491 patients with 14,143 genes were included in our analysis. The censoring ratio was 68.4%. We mapped these genes to 274 pathways including 4266 genes.
TCGA breast cancer dataset (mRNA sequencing data)
The raw expression data contains 1220 patients and 20,530 genes. After removing the duplication and unknown gene names, there are 1097 patients with 20,503 used for further analysis. The raw clinical data included 1097 patients. We cleaned the clinical data with clear survival records, and included 1084 patients in our analysis. We then merged the clinical data and expression data, and obtained 1082 patients for with 20,503 genes for quality control. The same steps used here for breast cancer dataset, we filtered the genes with expressions values less than 10, and removed genes with more than 30% of zero expression values in the dataset. Furthermore, we calculated the coefficient of variance (CV) of expression values for each gene, and kept the genes with CV of larger than 20% quantile. After these steps, 1082 patients with 14,077 genes were included in our analysis. The censoring ratio was 86.0%. We mapped these genes to 275 pathways including 4385 genes.
Results
Simulation results
Predictive performance
Tables 2 and 3 summarizes the CVPL (crossvalidated partial likelihood) and Cindex in the testing data over 100 replicates for Scenarios 1–5. We observed that the group spikeandslab lasso Cox model performed similarly with cMCP and outperformed other methods, under different simulation scenarios. These results suggested that, with complex group structures, the proposed method could perform well.
Accuracy of parameter estimates
To evaluate the accuracy of parameters estimation, we summarized the average numbers of nonzero coefficients and the mean absolute errors (MAE) of coefficient estimates, defined as MAE = \( \sum \left{\widehat{\beta}}_j{\beta}_j\right/m \), in Tables 4 and 5 for different scenarios. It was found that the dected number of nullzero coefficients were very close preset number 8, and the values of MAE were very small for the proposed method under different scenarios. The performances of the group spikeandslab lasso Cox and cMCP were consistently better than the other methods for all the five scenarios, and the proposed method was slightly better than cMCP. These results suggested that the proposed method can generate lowest false positive and unbiased estimation.
The estimates of coefficients from the group spikeandslab lasso Cox and the other methods over 100 replicates are shown in Fig. 1 and Additional file 1: Figure S1, Additional file 2: Figure S2, Additional file 3: Figure S3, Additional file 4: Figure S4, Additional file 5: Figure S5, Additional file 6: Figure S6 and Additional file 7: Figure S7 for different scenarios. It can be seen that the group spikeandslab lasso Cox method produced estimates close to the simulated values for all the coefficients. This is expected, because the spikeandslab prior can induce weak shrinkage on larger coefficients and strong shrinkage on zero coefficients. In contrast, other methods except for cMCP, gave a strong shrinkage amount on all the coefficients and resulted in the solutions that nonzero coefficients were shrunk and underestimated compared to true values. In addition, higher false positives (grey bars) were observed, except for the group spikeandslab lasso Cox and cMCP methods.
The selfadaptive shrinkage feature
To show the selfadaptive shrinkage feature, we performed simulation 6. Figure 2 shows the adaptive shrinkage amount on nonzero coefficients x_{5}, along with the varying effect size. It clearly shows that the proposed spikeandslab lasso Cox model approach has selfadaptive and flexible characteristics, without affecting the nearby zero coefficient (x_{6}) and nonzero variable (x_{20}) belong to the same group.
Real data analysis results
There were about one third genes were mapped to pathways for the above three real datasets. The rest genes were put together as an additional group. The detailed information of genes shared by different pathways is listed in Additional file 8: S1, S2, and S3, for ovarian cancer, lung cancer and breast cancer, respectively.
Real data analysis is to build a survival model for predicting the overall survival outcome by integrating gene expression data and pathway information. We standardized all the predictors to use a common scale for all predictors, prior to fitting the models, using the function covariate() function in BhGLM package. In our prior distribution, there were to preset parameters, (s_{0}, s_{1}). In our real data analysis, we fixed the slab scale s_{1} to 1, and varied the spike scale s_{0} values by: {k × 0.01; k = 1, …, 9}, leading to 9 models. The optimal spike scale s_{0} was select by the 10fold 10time crossvalidation according to the CVPL. Using the optimal s_{0}, we performed further real data analysis. For comparison, we also analyzed the data using the several existing methods as described in the simulation studies.
We performed 10fold crossvalidation with 10 replicates to evaluate the predictive values of the several models. Table 6 summarizes the measures of the prognostic performance on these three data sets, by only using the genes included in pathway. For all the data sets the proposed group spikeandslab lasso Cox model performed better than the other methods. The above results used only genes mapped in pathways. Additional file 9 shows the measures of the performance on these three data sets, by using the all genes. The genes which were not mapped into any pathway were put together as an additional group. We can see that the prediction performance of the proposed method were still better than the other methods.
The pathway enrichment analyses for these detected genes were summarized in Additional file 10: S4S6. Additional file 11: S7 presents the genes detected by the proposed and existed methods. Their standardized effects size were also plotted for the three real data sets. There were many common gene among different methods. For ovarian cancer dataset, the genes CYP2R1 and HLADOB detected by the proposed gsslasso method, were also detected by both lasso and cMCP methods. For Lung cancer dataset, several genes detected by the proposed gsslasso method, such as VDAC1, EHHADH, ACAT2, KIT, CCND1, PIK3R1, NRAS, GNPNAT1, and KYNU, were also detected by other method. For, breast cancer dataset, two genes HSPA1A and ABCB5 detected by the proposed gsslasso method were also detected by other method.
We found that most of the genes detected by the proposed method were associated with cancers in previous studies. HABP2, detected in ovarian cancer, was associated with familial nonmedullary thyroid cancer [47]. CYP24A1, the main enzyme responsible for the degradation of active vitamin D, plays an important role in many cancer related cellular processes. The associations between CYP24A1 polymorphisms and cancer susceptibility had been evaluated by many studies [48]. Keratin 8 (KRT8) plays an essential role in the development and metastasis of multiple human cancers. A recent study suggested that in clear cell renal cell carcinoma and gastric cancer, KRT8 upregulation promotes tumor metastasis and associated with poor prognosis [49, 50]. E2F7, detected in lung cancer, involved in several cancer studies, which might act as an independent prognostic factor for breast cancer, and Squamous Cell Carcinoma, and gliomas [51,52,53]. Most of the genes detected by the proposed method in the three real datasets had been found associated with different cancers. These results may provide some interesting information for further studies.
Discussion
The group structures among various features arise naturally in many biological and medical researches, especially in largescale omics data. Such grouping information is biologically meaningful and intrinsically encoded in the biological data. Thus it is desirable to incorporate the grouping information into data analysis. Various penalization methods have been designed for such situations [13, 14, 30, 54, 55]. Recently, we have developed a novel hierarchical modeling approach, the group spikeandslab lasso GLMs, to integrate the variable group information for gene detection and prognostic prediction [27]. In this study, we extended the method to Cox proportional hazards model for analyzing censored survival data.
Similar to the group spikeandslab lasso GLMs, the key to our group spikeandslab lasso Cox is the group spikeandslab doubleexponential prior. This prior has significant advantage in variable selection and parameter estimation. It induces weak shrinkage on larger coefficients and strong shrinkage on irrelevant coefficients. In contrast, other methods usually gave a strong shrinkage amount on all the coefficients and resulted in the solutions that nonzero coefficients were shrunk and underestimated. The proposed group spikeandslab prior allows the model to incorporate the biological similarity of genes within a same pathway into the analysis.
The spikeandslab prior depends on the spike and slab scale parameters. Our previous study suggested that slab scale s_{1} had little influence on model fitting, while the spike scale s_{0} strongly affected model performance [25, 26]. A slab scale s_{1} value introducing weak shrinkage amount would be helpful to include relevant variables into the model. Therefore, we set s_{1} = 1 in our analysis. We evaluated the performance of the proposed model on a grid of values of spike scale s_{0} from a reasonable range, e.g., (0, 0.1), and then selecting an optimal value using crossvalidation. This is a pathfollowing strategy for fast dynamic posterior exploration of the proposed models, which is similar to the approach of Ročková and George [24, 56]. Additional file 12: Figure S8 a and b show the solution paths under Scenario 2, for the proposed model and the lasso Cox model. Additional file 12: Figure S8 c and d show the profiles of crossvalidated palatial loglikelihood by 10fold crossvalidation for the proposed model. These profiles would help to choose optimal tuning parameters. It could be found that, similar to the lasso, the spikeandslab lasso Cox is a pathfollowing strategy for fast dynamic posterior exploration. However, the solution path is essentially different from that of the lasso model. For the lasso cox model, the number of nonzero coefficient could be a few, even zero if a strong penalty is adopted. However, in the spikeandslab lasso Cox model, larger coefficients will be always included in the model with weak shrinkage, while irrelevant coefficients are removed (grey path in Additional file 12: Figure S8 a).
Another feature of the proposed spikeandslab prior is bilevel selection, which is capable of selecting important groups as well as important individual variables within those groups. Several methods perform bilevel selection, including cMCP method [16], SGL [14], and group exponential lasso [18]. The underlying assumption is that the model is sparse at both the group and individual variable levels. The proposed group spikeandslab lasso Cox model can efficiently perform bilevel selection. In group level, the importance of a group is controlled by the groupspecific probability θ_{g}. Within a group, the spikeandslab prior allows to perform variable selection by shrinking irrelevant or small effect coefficients exactly to zero, without affecting the prediction performance.
The extensive simulation studies show that the prediction performance of the proposed method is always slightly better than cMCP method, and significantly better than all other methods under different scenarios. In the real data analysis, the prediction accuracy of the proposed method incorporating pathway information was slightly improved compared with the existing methods. This might be mainly due to the complex genetic components involved in the expression data, like haplotype blocks, subnetworks, and interaction among the genes. The present model under the linear assumption may not capture these complexities. More sophisticated strategies could potentially enhance prediction accuracy and further improve the models, by defining more precise biological grouping information.
There are several further extensions of the proposed method. For example, it can also be extended to incorporate multiple level group structure, like threelevel group structure, i.e. SNPgenepathway. In addition, the proposed model takes the spikeandslab mixture doubleexponential prior. Other priors with a spike at zero and includes heavier tails could be investigated, like Cauchy distribution, a special case of Studentt distribution. The theoretical and empirical properties of other priors are different, which may introduce more interesting results.
Conclusion
Incorporating biological group structure in highdimensional molecular data analysis can improve the accuracy of disease prediction and power of gene detection. We propose a new hierarchical Cox model, gsslasso Cox, for incorporating biological pathway information for predicting disease survival outcomes and detecting associated genes. We develop a fast and stable deterministic algorithm to fit the proposed models. Extensive simulation studies and real applications show that compared with several existing methods, the proposed approach provides more accurate parameter estimation and survival prediction. The proposed method has been implemented in a freely available R package BhGLM.
Abbreviations
 cMCP:

Composite minimax concave penalty
 CVPL:

Crossvalidated partial likelihood
 grlasso:

Group lasso
 grMCP:

Group minimax concave penalty
 grSCAD:

Group smoothly clipped absolute deviation
 gsslasso cox:

Group spikeandslab lasso cox model
 lasso:

Least absolute shrinkage and selection operator
References
Tibshirani R. Regression shrinkage and selection via the lasso. J Royal Statistical Soc Series B. 1996;58:267–88.
Tibshirani R. The lasso method for variable selection in the Cox model. Stat Med. 1997;16(4):385–95.
Zhang C. Penalized linear unbiased selection. Rutgers University: Department of Statistics and Bioinformatics; 2007. Technical Report #2007–2003
Zhang CH. Nearly unbiased variable selection under minimax concave penalty; 2010. p. 894–942.
Fan J, Li R. Variable selection via nonconcave penalized likelihood and its Oracle properties. J Am Stat Assoc. 2001;96(456):1348–60.
Zhang W, Ota T, Shridhar V, Chien J, Wu B, Kuang R. Networkbased survival analysis reveals subnetwork signatures for predicting outcomes of ovarian cancer treatment. PLoS Comput Biol. 2013;9(3):e1002975.
Yuan Y, Van Allen EM, Omberg L, Wagle N, AminMansour A, Sokolov A, Byers LA, Xu Y, Hess KR, Diao L, et al. Assessing the clinical utility of cancer genomic and proteomic data across tumor types. Nat Biotechnol. 2014;32(7):644–52.
Sohn I, Sung CO. Predictive modeling using a somatic mutational profile in ovarian high grade serous carcinoma. PLoS One. 2013;8(1):e54089.
Rapaport F, Zinovyev A, Dutreix M, Barillot E, Vert JP. Classification of microarray data using gene networks. BMC Bioinformatics. 2007;8(1):1–15.
Barillot E, Calzone L, Hupe P, Vert JP, Zinovyev A. Computational systems biology of Cancer Chapman & Hall/CRC Mathematical & Computational Biology; 2012.
Zhao Q, Shi X, Xie Y, Huang J, Shia B, Ma S. Combining multidimensional genomic measurements for predicting cancer prognosis: observations from TCGA. Brief Bioinform. 2015;16(2):291–303.
Yuan M, Lin Y. Model selection and estimation in regression with grouped variables. J R Stat Soc Ser B. 2006;68(1):49–67.
Friedman J, Hastie T, Tibshirani R. A note on the group lasso and a sparse group lasso. Stanford University: Technical report, Department of Statistics; 2010.
Simon N, Friedman J, Hastie T, Tibshirani R. A sparsegroup lasso. J Comput Graph Stat. 2013;22(2):231–45.
Huang J, Ma S, Xie H, Zhang CH. A group bridge approach for variable selection. Biometrika. 2009;96(2):339–55.
Breheny P, Huang J. Penalized methods for bilevel variable selection. Statistics and its interface. 2009;2(3):369–80.
Zhao P, Rocha G, Yu B. The composite absolute penalties family for grouped and hierarchical variable selection. Ann Stat. 2009;37(6A):3468–97.
Breheny P. The group exponential lasso for bilevel variable selection. Biometrics. 2015;71(3):731–40.
Chen Y, Du P, Wang Y. Variable selection in linear models. Wiley Interdisciplinary Reviews: Computational Statistics. 2014;6(1):1–9.
Kwon S, Ahn J, Jang W, Lee S, Kim Y. A doubly sparse approach for group variable selection. Ann Inst Stat Math. 2017;69(5):997–1025.
Huang J, Breheny P, Ma S. A selective review of group selection in highdimensional models. Stat Sci. 2012;27(4).
Ogutu JO, Piepho HP. Regularized group regression methods for genomic prediction: bridge, MCP, SCAD, group bridge, group lasso, sparse group lasso, group MCP and group SCAD. BMC Proc. 2014;8(Suppl 5):S7.
Ročková V, George EI. Bayesian penalty mixing: the case of a nonseparable penalty. In: Frigessi A, Bühlmann P, Glad IK, Langaas M, Richardson S, Vannucci M, editors. Statistical analysis for highdimensional data: the Abel symposium, vol. 2014. Cham: Springer International Publishing; 2016. p. 233–54.
Ročková V, George EI: The spikeandslab lasso. J Am Stat Assoc 2016:Online, DOI: https://doi.org/10.1080/01621459.01622016.01260469.
Tang Z, Shen Y, Zhang X, Yi N. The spikeandslab lasso generalized linear models for prediction and associated genes detection. Genetics. 2017;205(1):77–88.
Tang Z, Shen Y, Zhang X, Yi N. The spikeandslab lasso Cox model for survival prediction and associated genes detection. Bioinformatics. 2017;33(18):2799–807.
Tang Z, Shen Y, Li Y, Zhang X, Wen J, Qian C, Zhuang W, Shi X, Yi N. Group spikeandslab lasso generalized linear models for disease prediction and associated genes detection by incorporating pathway information. Bioinformatics. 2018;34(6):901–10.
Silver M, Montana G. Alzheimer’s disease neuroimaging I: fast identification of biological pathways associated with a quantitative trait using group lasso with overlaps. Stat Appl Genet Mol Biol. 2012;11(1):Article 7.
Silver M, Chen P, Li R, Cheng CY, Wong TY, Tai ES, Teo YY, Montana G. Pathwaysdriven sparse regression identifies pathways and genes associated with highdensity lipoprotein cholesterol in two Asian cohorts. PLoS Genet. 2013;9(11):e1003939.
Jacob L, Obozinski G, Vert JP. Group lasso with overlap and graph lasso. In: Proceedings of the 26th Annual International Conference on Machine Learning. Montreal, Quebec, Canada: 1553431: ACM; 2009. p. 433–40.
Hastie T, Tibshirani R, Wainwright M. Statistical learning with sparsity  the lasso and generalization. New York: CRC Press; 2015.
Klein J, Moeschberger M. Survival Analysis. New York: SpringerVerlag; 2003.
Ibrahim J, Chen MH, Debajyoti S. Bayesian survival analysis. New York: SpringerVerlag; 2001.
Cox DR. Regression models and life tables. J R Stat Soc. 1972;34:187–220.
Breslow NE. Contribution to the discussion of the paper by D. R. Cox. J Royal Stat Soc B. 1972;34:216–7.
Efron B. The efficiency of Cox's likelihood function for censored data. J Am Stat Assoc. 1977;72:557–65.
van Houwelinggen HG, Putter H. Dynamic prediction in clinical survival analysis. Boca Raton: CRC Press; 2012.
Gelman A, Hill J. Data analysis using regression and multilevel/hierarchical models. New York: Cambridge University Press; 2007.
Gelman A, Carlin JB, Stern HS, Dunson DB, Vehtari A, Rubin DB. Bayesian data analysis. Third ed. New York: Chapman & Hall/CRC Press; 2014.
Breslow N. Covariance analysis of censored survival data. Biometrics. 1974;30:89–99.
Simon N, Friedman J, Hastie T, Tibshirani R. Regularization paths for Cox's proportional hazards model via coordinate descent. J Stat Softw. 2011;39(5):1–13.
van Houwelingen HC, Bruinsma T, Hart AA, Van’t Veer LJ, Wessels LF. Crossvalidated Cox regression on microarray gene expression data. Stat Med. 2006;25(18):3201–16.
Tibshirani RJ, Efron B. Prevalidation and inference in microarrays. Stat Appl Genet Mol Biol. 2002;1:1–18.
Yi N, Tang Z, Zhang X, Guo B. BhGLM: Bayesian hierarchical GLMs and survival models, with applications to genomics and epidemiology. Bioinformatics. 2018. https://doi.org/10.1093/bioinformatics/bty803.
Zeng Y, Breheny P. Overlapping group logistic regression with applications to genetic pathway selection. Cancer Informat. 2016;15:179–87.
Yu G, Wang LG, Han Y, He QY. clusterProfiler: an R package for comparing biological themes among gene clusters. Omics : a journal of integrative biology. 2012;16(5):284–7.
Gara SK, Jia L, Merino MJ, Agarwal SK, Zhang L, Cam M, Patel D, Kebebew E. Germline HABP2 mutation causing familial nonmedullary thyroid Cancer. N Engl J Med. 2015;373(5):448–55.
Zhu M, Qiu S, Zhang X, Wang Y, Souraka TDM, Wen X, Liang C, Tu J. The associations between CYP24A1 polymorphisms and cancer susceptibility: a metaanalysis and trial sequential analysis. Pathology  Research and Practice. 2018;214(1):5363.
Tan HS, Jiang WH, He Y, Wang DS, Wu ZJ, Wu DS, Gao L, Bao Y, Shi JZ, Liu B, et al. KRT8 upregulation promotes tumor metastasis and is predictive of a poor prognosis in clear cell renal cell carcinoma. Oncotarget. 2017;8(44):76189–203.
Fang J, Wang H, Liu Y, Ding F, Ni Y, Shao S. High KRT8 expression promotes tumor progression and metastasis of gastric cancer. Cancer Sci. 2017;108(2):178–86.
Chu J, Zhu Y, Liu Y, Sun L, Lv X, Wu Y, Hu P, Su F, Gong C, Song E, et al. E2F7 overexpression leads to tamoxifen resistance in breast cancer cells by competing with E2F1 at miR15a/16 promoter. Oncotarget. 2015;6(31):31944–57.
Yin W, Wang B, Ding M, Huo Y, Hu H, Cai R, Zhou T, Gao Z, Wang Z, Chen D. Elevated E2F7 expression predicts poor prognosis in human patients with gliomas. J Clin Neurosci. 2016;33:187–93.
HazarRethinam M, de Long LM, Gannon OM, Boros S, Vargas AC, Dzienis M, Mukhopadhyay P, SaenzPonce N, Dantzic DDE, Simpson F, et al. RacGAP1 is a novel downstream effector of E2F7dependent resistance to doxorubicin and is prognostic for overall survival in squamous cell carcinoma. Mol Cancer Ther. 2015;14(8):1939–50.
Meier L, van de Geer S, Bühlmann P. The group lasso for logistic regression. J Royal Stat Soc Series B. 2008;70(1):53–71.
Zhou N, Zhu J. Group variable selection via a hierarchical lasso and its Oracle property; 2011.
Ročková V, George EI. EMVS: the EM approach to Bayesian variable selection. J Am Stat Assoc. 2014;109(504):828–46.
Acknowledgements
We acknowledge the contributions of the TCGA Research Network.
Funding
This work was supported in part by the National Natural Science Foundation of China (81773541 and 81573253), funds from the Priority Academic Program Development of Jiangsu Higher Education Institutions at Soochow University, grants from China Scholarship Council, Jiangsu Provincial Key Project in Research and Development of Advanced Clinical Technique (BL2018657) to ZT, USA National Institutes of Health (R03DE024198, R03DE025646) to NY. The funding body did not played any roles in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.
Availability of data and materials
All data for the manuscript will be available without restriction at website: https://cancergenome.nih.gov/. The package BhGLM is freely available from https://github.com/nyiuab/BhGLM.
Author information
Authors and Affiliations
Contributions
Study conception and design: NY, ZXT. Simulation study and data summary: ZXT, NY. Real data analysis and interpretation: ZXT, YS, SFL, XZ, ZY, BG, JC, NY. Drafting of manuscript: ZXT, YS, SFL, XZ, ZY, BG, JC, NY. All authors read and approved the final version of the manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional files
Additional file 1:
Figure S1. The parameter estimation averaged over 100 replicates for the group spikeandslab lasso Cox (gsslasso), the lasso, grlasso, grMCP, grSCAD, and cMCP methods for Scenario 2. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. (PDF 523 kb)
Additional file 2:
Figure S2. The parameter estimation averaged over 100 replicates for the group spikeandslab lasso Cox (gsslasso), the lasso and grlasso methods for Scenario 3. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. The main title of each plot denotes the varying group size for scenario 3. (PDF 778 kb)
Additional file 3:
Figure S3. The parameter estimation averaged over 100 replicates for grMCP, grSCAD, and cMCP methods for Scenario 3. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. The main title of each plot denotes the varying group size for Scenario 3. (PDF 767 kb)
Additional file 4:
Figure S4. The parameter estimation averaged over 100 replicates for the group spikeandslab lasso Cox (gsslasso), the lasso and grlasso methods for Scenario 4. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. The main title of each plot denotes the varying the number of nonnull group for Scenario 4. (PDF 791 kb)
Additional file 5:
Figure S5. The parameter estimation averaged over 100 replicates for grMCP, grSCAD, and cMCP for Scenario 4. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. The main title of each plot denotes the varying the number of nonnull group for Scenario 4. (PDF 783 kb)
Additional file 6:
Figure S6. The parameter estimation averaged over 100 replicates for the group spikeandslab lasso Cox (gsslasso), the lasso and grlasso methods for Scenario 5. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. The main title of each plot denotes the varying the number of nonnull group for Scenario 5. (PDF 1054 kb)
Additional file 7:
Figure S7. The parameter estimation averaged over 100 replicates for grMCP, grSCAD, and cMCP for Scenario 5. Blue cycles denote the simulated nonzero values. Black points and lines represent the estimated values and the interval estimates of coefficients. The main title of each plot denotes the varying the number of nonnull group for Scenario 5. (PDF 778 kb)
Additional file 8:
S1, S2, and S3. The detailed information of genes shared by different pathways for ovarian cancer, lung cancer and breast cancer, respectively. (ZIP 213 kb)
Additional file 9:
Table S1. The measures of optimal group spikeandslab lasso (gsslasso) cox and the lasso cox models for TCGA ovarian cancer, lung adenocarcinoma (LUAD) and breast cancer dataset with all genes by 10 times 10fold cross validation. (DOCX 20 kb)
Additional file 10:
S4, S5 and S6. The pathway enrichment analyses for these detected genes for ovarian cancer, lung cancer and breast cancer, respectively. (ZIP 15 kb)
Additional file 11:
S7. The detected genes and their standardized effect sizes estimated by the group spikeandslab lasso Cox model and five existed methods for TCGA real datasets. (PDF 1340 kb)
Additional file 12:
Figure S8. The solution path and crossvalidated partial loglikelihood profiles of the group spikeandslab lasso Cox (a, c) and the lasso (b, d) based on the Scenario 2. The colored points on the solution path represent the estimated values of assumed eight nonzero coefficients, and the circles represent true nonzero coefficients. Vertical lines correspond to the optimal models. (PDF 1052 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Tang, Z., Lei, S., Zhang, X. et al. Gsslasso Cox: a Bayesian hierarchical model for predicting survival and detecting associated genes by incorporating pathway information. BMC Bioinformatics 20, 94 (2019). https://doi.org/10.1186/s1285901926561
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901926561
Keywords
 Cox survival models
 Grouped predictors
 Hierarchical modeling
 Lasso
 Pathway
 Spikeandslab prior