Combining heterogeneous subgroups with graph-structured variable selection priors for Cox regression

Background Important objectives in cancer research are the prediction of a patient’s risk based on molecular measurements such as gene expression data and the identification of new prognostic biomarkers (e.g. genes). In clinical practice, this is often challenging because patient cohorts are typically small and can be heterogeneous. In classical subgroup analysis, a separate prediction model is fitted using only the data of one specific cohort. However, this can lead to a loss of power when the sample size is small. Simple pooling of all cohorts, on the other hand, can lead to biased results, especially when the cohorts are heterogeneous. Results We propose a new Bayesian approach suitable for continuous molecular measurements and survival outcome that identifies the important predictors and provides a separate risk prediction model for each cohort. It allows sharing information between cohorts to increase power by assuming a graph linking predictors within and across different cohorts. The graph helps to identify pathways of functionally related genes and genes that are simultaneously prognostic in different cohorts. Conclusions Results demonstrate that our proposed approach is superior to the standard approaches in terms of prediction performance and increased power in variable selection when the sample size is small. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04483-z.

In the Bayesian framework, different types of variable selection priors have been proposed also with application to the Bayesian Cox model.One common choice is the use of shrinkage priors such as the Bayesian lasso as an analog to the frequentist penalized likelihood approach [20,26,41].A popular alternative are "spike-and-slab" priors that use latent indicators for variable selection and a mixture distribution for the regression coefficients [14,35].In general, the regression coefficients are modeled independently.However, with applications to molecular data, it can be reasonable to consider structural information between covariates, since the effect on a clinical outcome is typically not caused by single genes acting in isolation, but rather by changes in a regulatory or functional pathway of interacting genes.Several authors have dealt with this problem by using a Markov random field (MRF) prior to incorporate structural information on the relationships among the covariates into variable selection [21,28,33,34].Alternatively, Chakraborty and Lozano [5] propose a Graph Laplacian prior for modeling the dependence structure between the regression coefficients through their precision matrix.
When the data are heterogeneous and consists of known subpopulations with possibly different dependence structures, estimating one joint graphical model would hide the underlying heterogeneity while estimating separate models for each subpopulation would neglect common structure.For this situation, Danaher et al. [8] use an extension of the frequentist graphical lasso with either a group or fused lasso type penalty for joint structure learning.Saegusa and Shojaie [30] propose a weighted Laplacian shrinkage penalty where the weights represent the degree of similarity between subpopulations.Bayesian approaches for sharing common structure in the joint inference of multiple graphical models have also been developed [24,27,40].Peterson et al. [27] use an MRF prior for the graph structures with pairwise similarities between different graphs.However, all these methods have in common that they focus on structure learning only and do not take into account the relationship between (structured) covariates and a clinical outcome as in the context of regression modeling.
We consider the situation that molecular measurements and a survival outcome are available for various, possibly heterogeneous patient subgroups or cohorts such as in a multicenter study.In the following, we use the term "subgroup" for different pre-known patient cohorts or data sets.In classical subgroup analysis, only the data of the subgroup of interest is used to build a risk prediction model for this specific subgroup.This may lead to a loss of power or unstable results with high variance especially in small subgroups.Thus, it is tempting to simply pool all data to increase the sample size.This approach, however, can result in biased estimates when the subgroups are heterogeneous regarding their effects and subgroup-specific effects may get lost.We aim at sharing information between subgroups to increase power when this is supported by the data.Our approach provides a separate risk prediction model for each subgroup that allows the identification of common as well as subgroup-specific effects and has improved prediction accuracy and variable selection power compared to the two standard approaches.
Some frequentist approaches tackle this problem by suggesting a penalized Cox regression model with a weighted version of the partial likelihood that includes patients of all subgroups but assigns them (individual) weights.Weyer and Binder [37] propose the use of fixed weights.This idea is extended by Richter et al. [29] using model-based optimization for tuning of the weights to obtain the best combination of fixed weights regarding prediction accuracy.[23] estimate individual weights from the data such that they represent the probability of belonging to a specific subgroup.
In this paper, we use a Bayesian approach and borrow information across subgroups through graph-structured selection priors instead of weights in the likelihood.We propose an extension of the Bayesian Cox model with "spike-and-slab" prior for variable selection by Treppmann et al. [35] in the sense that we incorporate graph information between covariates into variable selection via an MRF prior instead of modeling the regression coefficients independently.The graph is not known a priori and inferred simultaneously with the important predictors.Its structure can be partitioned into subgraphs linking covariates within or across different subgroups.Thus, representing conditional dependencies between genes (i.e.pathways) and similarities between subgroups by genes being simultaneously prognostic in different subgroups.

Methods
First, the general methods are described that are required for our proposed Bayesian model introduced later in this section.

The Bayesian Cox proportional hazards model
Assume the observed data of patient m consist of the tuple ( tm , δ m ) and the covariate vector The Cox proportional hazards model [7] models the hazard rate h(t|x m ) of an individual m at time t.It consists of two terms, the non-parametric baseline hazard rate h 0 (t) and a parametric form of the covariate effect: where β = (β 1 , . . ., β p ) ′ is the unknown parameter vector that represents the strength of influence of the covariates on the hazard rate.Under the Cox model, the joint survival probability of n patients given x is where t is the vector of the individual observed times for all patients and T the vector of corresponding random variables.One of the most popular choices for the cumulative baseline hazard function H 0 (t) is a gamma process prior where H * (t) is an increasing function with H * (0) = 0 .H * can be considered as an initial guess of H 0 and a 0 > 0 describes the weight that is given to H * (t) [20].Lee et al. [20] propose a Weibull distribution H * (t) = ηt κ with fixed hyperparameters η and κ .Follow- ing Zucknick et al. [41], we obtain estimates of η and κ from the training data by fitting a parametric Weibull model without covariates to the survival data.We choose a 0 = 2 in accordance with the authors.
In practice the presence of ties is very common, leading to the grouped data likelihood described in Ibrahim et al. [17, chapter 3.2.2].A finite partition of the time axis is constructed with 0 = c 0 < c 1 < . . .< c J and c J > tm for all m = 1, . . ., n .The observed time tm of patient m falls in one of the J disjoint intervals I g = (c g−1 , c g ] , g = 1, . . ., J .Assume the observed data D = {(x, R g , D g ) : g = 1, . . ., J } are grouped within I g , where R g and D g are the risk and failure sets corresponding to interval g.Let h g = H 0 (c g ) − H 0 (c g−1 ) be the increment in the cumulative baseline hazard in interval I g , g = 1, . . ., J .From the gamma process prior of H 0 follows that the h g 's have independent gamma distributions The conditional probability that the observed time of patient m falls in interval I g is given by with h = (h 1 , . . ., h J ) ′ .The resulting grouped data likelihood is defined as [17, chapter 3.2.2].

Stochastic search variable selection
The stochastic search variable selection (SSVS) procedure by George and McCulloch [14] uses latent indicators for variable selection and models the regression coefficients as a mixture of two normal distributions with different variances This prior allows the β i 's to shrink towards zero.Due to the shape of the two-component mixture distribution, it is called spike-and-slab prior.The latent variable γ i indicates the inclusion ( γ i = 1 ) or exclusion ( γ i = 0 ) of the i-th variable and specifies the variance of the normal distribution.τ i (> 0) is set small so that β i is likely to be close to zero if γ i = 0 .c i (> 1) is chosen sufficiently large to inflate the coefficients of selected variables and to make their posterior mean values likely to be non-zero.In general, the variances of the regression coefficients are assumed to be constant: τ i ≡ τ and c i ≡ c for all i = 1, . . ., p.
The standard prior for γ = (γ 1 , . . ., γ p ) ′ is a product of independent Bernoulli distributions with prior inclusion probability π γ = P(γ i = 1) .Typically, these prior inclusion prob- abilities are chosen to be the same for all variables and often with π γ set to a fixed value.

Graphical models
A graphical model is a statistical model that is associated with a graph summarizing the dependence structure in the data.The nodes of a graph represent the random variables of interest and the edges of a graph describe conditional dependencies among the variables.Structure learning implies the estimation of an unknown graph.Recent applications are mainly driven by biological problems that involve the reconstruction of gene regulatory networks and the identification of pathways of functionally related genes from their expression levels.A graph is called undirected, when its edges are unordered pairs of nodes instead of ordered pairs with edges pointing from one node to the other (directed graph).
When the variables are continuous measurements and assumed to be multivariate normal a common choice are Gaussian models [11].We assume that the vector of random variables X m = (X m1 , . . ., X mp ) ′ for patient m, m = 1, . . ., n follows a multivariate normal distribution with mean vector 0 and covari- ance matrix .The inverse of the covariance matrix is referred to as precision matrix −1 = = (ω ij ) i,j=1,...,p , with symmetric and positive definite.Let X ∈ R n×p be the data matrix consisting of n independent patients and S = 1 n X ′ X the sample covariance matrix.In graphical models, a graph G is used to represent conditional dependence relationships among random variables X .Let G = (V , E) be an undirected graph, where V = {1, . . ., p} is a set of nodes (e.g.genes) and E ⊂ V × V is a set of edges (e.g. relations between genes) with edge (i, j) ∈ E ⇔ (j, i) ∈ E .G can be indexed by a set of p(p − 1)/2 binary variables G = (g ij ) i<j ∈ {0, 1} p×p with g ij = 1 or 0 when edge (i, j) belongs to E or not.The symmet- ric matrix G is termed adjacency matrix representation of the graph.The graph structure implies constraints on the precision matrix such that g ij = 0 ⇔ (i, j) / ∈ E ⇔ ω ij = 0 , meaning that variables i and j are conditionally independent given all remaining variables [11,36].We use the approach for structure learning by Wang [36] that is based on continuous spike-and-slab priors for the elements of the precision matrix and latent indicators for the graph structure.The approach induces sparsity and is efficient due to a block Gibbs sampler and no approximation of the normalizing constant.The corresponding hierarchical model is defined as where C(G, ν 0 , ν 1 , ) and C(θ) are the normalizing constants, and θ = {ν 0 , ν 1 , , π G } is the set of all parameters with ν 0 > 0 small, ν 1 > 0 large, > 0 and π G ∈ (0, 1) . 1 { ∈M + } restricts the prior to the space of symmetric-positive definite matrices.A small value for ν 0 ( g ij = 0 ) means that ω ij is small enough to bet set to zero.A large value for ν 1 ( g ij = 1 ) allows ω ij to be substantially different from zero.The binary latent variables G = (g ij ) i<j ∈ {0, 1} p(p−1)/2 serve as edge inclusion indicators.Wang [36] proposes the following fixed hyperparameters π G = 2 p−1 , ν 0 ≥ 0.01 , ν 1 ≤ 10 and = 1 as resulting in good convergence and mixing.

The proposed Bayesian subgroup model
We assume the entire data consists of S predefined subgroups of patients (different cohorts or data sets), where for each patient the subgroup affiliation is known and unique.This information, which specific subgroup a patient belongs to, is available in the data.

Likelihood
Let X s ∈ R n s ×p be the gene expression (covariate) matrix for subgroup s, s = 1, . . ., S , consisting of n s independent and identically distributed observations.For patient m in subgroup s the vector of random variables X s,m = (X s,m1 , . . ., X s,mp ) ′ is assumed to fol- low a multivariate normal distribution with mean vector 0 and unknown precision matrix ss = −1 s , m = 1, . . ., n s .We consider the outcome Y s = (Y s,1 , . . ., Y s,n s ) ′ with Y s,m = ( Ts,m , δ s,m ) as well as the predictors X s , to be random variables.Thus, the likelihood for subgroup s is the joint distri- bution p(Y s , X s ) = p(Y s |X s ) • p(X s ) .The conditional distribution p(Y s |X s ) corresponds to the grouped data likelihood of the Bayesian Cox proportional hazards model at the beginning of this section [20] for subgroup s where D s = {(x s , R s,g , D s,g ) : g = 1, . . ., J s } are the observed data in subgroup s, with R s,g the risk and D s,g the failure sets corresponding to interval I s,g = (c s,g−1 , c s,g ] , g = 1, . . ., J s .The increment in the cumulative baseline hazard for subgroup s in interval I s,g is termed h s,g = H 0 (c s,g ) − H 0 (c s,g−1 ) .β s is the p-dimensional vector of regression coefficients for subgroup s.
The marginal distribution of X s is multivariate normal with S s = X ′ s X s The joint likelihood across all subgroups is the product of the subgroup likelihoods

Prior on the parameters h s and β s of the Cox model
The prior for the increment in the cumulative baseline hazard in subgroup s follows independent gamma distributions with a Weibull distribution H * (c s,g ) = η s c κ s s,g , g = 1, . . ., J s , s = 1, . . ., S [20].We choose the hyperparameters a 0 , η s and κ s to be fixed and in accordance with Lee et al. [20] and Zucknick et al. [41].We set a 0 = 2 and estimate the hyperparameters η s and κ s from the (training) data by fitting a parametric Weibull model without covariates to the survival data of subgroup s.
We perform variable selection using the SSVS approach by George and McCulloch [14] described earlier in this section.The prior of the regression coefficients β s,i in subgroup s conditional on the latent indicator γ s,i is defined as a mixture of two nor- mal distributions with small ( τ 2 ) and large ( c 2 τ 2 ) variance The latent indicator variable γ s,i indicates the inclusion ( γ s,i = 1 ) or exclusion ( γ s,i = 0 ) of variable i in the model for subgroup s.We assume equal variances for all regression coefficients.We set the hyperparameters to the fixed values τ = 0.0375 and c = 20 following Treppmann et al. [35].This choice corresponds to a standard deviation of c • τ = 0.75 and a 95% probability interval of [−1.47, 1.47] for p(β s,i |γ s,i = 1).

Prior on γ linking variable and graph selection
The standard prior for the binary variable selection indicators γ s,i is a product of inde- pendent Bernoulli distributions as utilized by Treppmann et al. [35].However, this does not consider information from other subgroups and relationships between covariates.For this situation, we propose a Markov random field (MRF) prior for the latent variable selection indicators that incorporates information on the relationships among the covariates as described by an undirected graph.This prior assumes that neighboring covariates in the graph are more likely to have a common effect and encourages their joint inclusion.The MRF prior for γ given G is defined as where γ = (γ 1,1 , . . ., γ 1,p , . . ., γ S,1 , . . ., γ S,p ) ′ is a pS-dimensional vector of variable inclu- sion indicators, G is a symmetric (pS × pS) adjacency matrix representation of the graph, and a, b are scalar hyperparameters.
The hyperparameter a influences the overall variable inclusion probability and controls the sparsity of the model, with smaller values resulting in sparser models.Without loss of generality a < 0 .The hyperparameter b > 0 determines the prior belief in the strength of relatedness between pairs of neighboring variables in the graph and controls the probability of their joint inclusion.Higher values of b encourage the selection of variables with neighbors already selected into the model.The idea becomes more evident by looking at the conditional probability An MRF prior for variable selection has also been used by other authors [21,28,33,34].However, unlike us, they do not address the problem of borrowing information across subgroups by linking covariates in a graph.
We propose a joint graph with possible edges between all pairs of covariates within each subgroup and edges between the same covariates in different subgroups.The elements g rs,ij in the adjacency matrix of the graph G represent the presence ( g rs,ij = 1 ) or absence ( g rs,ij = 0 ) of an edge between nodes (genes) i and j in subgroups r and s.They can be viewed as latent binary indicator variables for edge inclusion.The adjacency matrix in the present model is defined as ) i<j is the matrix of latent edge inclusion indicators within subgroup s and G rs = (g rs,ii ) r<s is the matrix of latent edge inclusion indicators between subgroups r and s with r, s = 1, . . ., S , r < s , i, j = 1, . . ., p , i < j.
Thus, within each subgroup s we assume a standard undirected graph with possible edges between all pairs of genes representing conditional dependencies as in a functional or regulatory pathway.Between different subgroups we only allow for relations between the same gene in different subgroups (different genes in different subgroups are assumed to be unconnected).This allows sharing information between subgroups and prognostic genes shared by different subgroups have a higher inclusion probability.To visualize this idea, Fig. 1 shows an example network consisting of two subgroups, each with five predictors.

Graph selection prior on and G
We infer the unknown graph and precision matrix using the structure learning approach for Gaussian graphical models by Wang [36].The precision matrix of subgroup s corresponding to subgraph G ss is denoted by ss = (ω ss,ij ) i<j .
We assume the binary edge inclusion indicators within subgroup s ( g ss,ij ) as well as between subgroups r and s ( g rs,ii ) to be independent Bernoulli a priori with fixed prior probability of edge inclusion π G ∈ (0, 1).

Posterior inference
The joint posterior distribution for the set of all parameters θ = {h, β, γ , G, �} is propor- tional to the product of the joint likelihood and the prior distributions of the parameters in all subgroups

Markov Chain Monte Carlo sampling
Markov Chain Monte Carlo (MCMC) simulations are required to obtain a posterior sample of the parameters.The different parameters are updated iteratively according to their conditional posterior distributions using a Gibbs sampler.A brief outline of the MCMC sampling scheme is given in the following.More details are provided in the Appendix.Starting with an arbitrary set of initial values for the parameters, the MCMC algorithm runs with a reasonably large number of iterations to obtain a representative sample from the posterior distribution.All subsequent results are based on single MCMC chains, each with 20 000 iterations in total and a burn-in period of 10 000 iterations.As starting values we choose an empty model with: We assessed convergence of each MCMC chain by looking at autocorrelations, trace plots and running mean plots of the regression coefficients.In addition, we ran several independent MCMC chains with different starting values to ensure that the chains and burn-in period were long enough to reach (approximate) convergence.

Posterior estimation and variable selection
We report the results of the Cox models in terms of marginal and conditional posterior means and standard deviations of the estimated regression coefficients, as well as posterior selection probabilities.After removal of the burn-in samples, the remaining MCMC samples serve as draws from the posterior distribution to calculate the empirical estimates.These estimates are then averaged across all training sets for each variable separately.
The strategy for variable selection follows Treppmann et al. [35].First, the mean model size m * is computed as the average number of included variables across all MCMC itera- tions after the burn-in.Then the m * variables with the highest posterior selection prob- ability are considered as the most important variables and selected in the final model.
We visually assess the inferred graph by the marginal posterior probabilities of the pairwise edge inclusion indicators.High probabilities suggest that an edge exists between two covariates (nodes).We consider the presence of an edge as a continuous parameter rather than choosing a cutoff for binary decision.

Prediction
We use training data for model fitting and posterior estimation and test data to assess model performance.We evaluate the prediction performance of the Cox models by the integrated Brier score.
The expected Brier score can be interpreted as a mean square error of prediction.It measures the inaccuracy by comparing the estimated survival probability Ŝ(t|x m ) of a patient m, m = 1, .., n , with the observed survival status 1( tm > t) and the squared residuals are weighted using inverse probability of censoring weights to adjust for the bias caused by the presence of censoring in the data.Ĉ(t) is the Kaplan- Meier estimator of the censoring times [3,31].
The predictive performance of competing survival models can be compared by plotting the Brier score over time (prediction error curves).Alternatively, prediction error curves can be summarized in one value with the integrated Brier score as a measure of inaccuracy over a time interval rather than at single time points [15]

Median probability model and Bayesian model averaging
For the calculation of the prediction error, we account for the uncertainty in model selection by two different approaches: the Median Probability Model (MPM) [1] and an approximation to Bayesian Model Averaging (BMA) [16].After removal of the burnin samples, we compute the Brier score over the "best" selected models.According to the BMA approach we choose the top 100 models with the largest log-likelihood values to obtain the marginal posterior means of the regression coefficients, which in turn are required for the risk score.Our choice of the number of top models for BMA approximation is based on visual assessment of the MCMC frequencies of the different top-selected models.However, the number of models could be optimized.For the MPM approach we select all covariates with a mean posterior selection probability larger than 0.5.For these variables we calculate the marginal posterior means of the regression coefficients and the corresponding risk score.

Simulation study
In The priors for variable selection and structure learning are specified as follows.We set the hyperparameter of the Bernoulli distribution to π γ = 0.02 , matching the prior prob- ability of variable inclusion in the MRF prior of the CoxBVS-SL model.Based on a sensitivity analysis, we choose the hyperparameters of the MRF prior as a = −4 and b = 1 .When the graph G contains no edges or b = 0 then the prior variable inclusion probabil- ity is exp(a) (1+exp(a)) ≈ 0.018 .This probability increases when b > 0 is combined with a non- empty graph.
For the sensitivity analysis of a and b we considered in total 36 combinations of the following hyperparameter values: a ∈ {−4, −3.75, −3.5, . . ., −2.25, −2} and b ∈ {0.25, 0.5, 0.75, 1} and simulated the data according to scenario I with n = p = 100 .Visual assessment of the results showed that they were relatively robust without major differences between the parameter combinations (Additional file 1: Fig. S1).Therefore, we selected the combination of values for a and b based on a compromise between variable selection accuracy (trade-off between large probability of true positive and small probability of false positive selections) and prediction performance.
The remaining hyperparameters for G and ss are chosen as ν 0 = 0.1, ν 1 = 10 , = 1 and π G = 2/(p − 1) , in accordance with Wang [36] and Peterson et al. [28].Wang [36] extensively studied the impact of different parameter combinations on the structure learning results, reporting that the results were relatively insensitive to the choice of = 1 .He recommended a range for the parameters ν 0 and ν 1 as providing good conver- gence and mixing.Based on his recommendation, we performed a sensitivity analysis in previous simulations to confirm that the parameter range is also appropriate for our experiments.All tested parameter combinations provided reasonable variable selection results with only small differences, which led us to choose one of the best performing combinations in terms of variable selection accuracy.
In the following simulations, we examine varying numbers of genomic covariates p and sample sizes n, with a focus on small sample sizes relative to the number of variables which is characteristic for gene expression data.We standardize the genomic covariates before model fitting and evaluation to have zero mean and unit variance.Parameters of the training data (mean and standard deviation of each variable) are used to scale the training and test data.For the standard subgroup model and the proposed model we standardize each subgroup separately, whereas for the combined model we pool training data of all subgroups.
For Bayesian inference, typically one training data set is used for posterior estimation and an independent test data set for model evaluation.However, results have shown some variation due to the data draw.Therefore, in the following, simulation of training and test data is repeated ten times for each simulation scenario.
In a second simulation set up we use two different hyperparameters b for the subgraphs G ss , s = 1, 2 and G 12 in the MRF prior of the CoxBVS-SL model and compare the prediction performance with the Sub-struct model.In the latter G 12 is an empty graph and only information of G ss is included in the MRF prior.We use the same training and test data as before but only consider simulation scenarios with p = 100.

Data simulation
Training and test data, each consisting of n samples and p genomic covariates, are simulated from the same distribution as described in the following.We consider two subgroups that differ only in their relationship between genomic covariates and survival endpoint ( β s , s = 1, 2 ), and in the parameters for the simulation of survival data.We generate gene expression data from the same multivariate normal distribution with mean vector 0 and precision matrix .The precision matrix is defined such that the vari- ance of each gene is 1 and partial correlations exist only between the first nine prognostic genes.Within the three blocks of prognostic genes determined by the same effect (gene 1 to 3, gene 4 to 6, and gene 7 to 9) we assume pairwise partial correlations of 0.5.All remaining genes are assumed to be uncorrelated.
We simulate survival data from a Weibull distribution according to Bender et al. [2], with scale η s and shape κ s parameters estimated from two gene expression cancer cohorts [4,10] to obtain realistic survival times.Therefore, we compute survival probabilities at 3 and 5 years using the Kaplan-Meier estimator, separately for both cohorts.The corresponding probabilities are 57% and 75% for 3-years survival, and 42% and 62% for 5-years survival, respectively.Event times for subgroup s are simulated from with true effects β s ∈ R p , s = 1, 2 .We randomly draw non-informative censoring times C s from a Weibull distribution with the same Weibull parameters as for the event times, resulting in approximately 50% censoring rates in both subgroups.The individual observed event indicators and times until an event or censoring are defined as δ s = 1(T s ≤ C s ) and T s = min(T s , C s ) , s = 1, 2.
We choose the true effects of the genomic covariates on survival outcome as stated in Table 1.Genes 1, 2, 3 and 7, 8, 9 are subgroup-specific, while genes 4, 5 and 6 have the same effect in both subgroups.All remaining genes represent noise and have no effect in both subgroups.

Simulation results I
We consider three low-dimensional settings with p = 20 genes and n = 50, 75, 100 sam- ples in each subgroup, as well as five high-dimensional settings with p = 100 and sam- ple sizes n = 50, 75, 100, 150 .We also tested p = 100 and n = 125 , but as expected, the results always lay between the results for n = 100 and n = 150 .For this reason, they are not shown here.We compare our proposed model (CoxBVS-SL) to the standard subgroup model (Subgroup) and the standard combined or pooled model (Pooled) regarding variable selection accuracy and prediction performance.
Posterior selection probabilities for each gene are computed based on all iterations after the burn-in and averaged across all training data sets.The resulting mean posterior selection probabilities of the first nine genes in subgroup 1 are depicted in Fig. 2 (and in Additional file 1: Fig. S2, for subgroup 2).Across all simulation scenarios, the CoxBVS-SL model has more power for the selection of prognostic genes compared to the two standard approaches, and at the same time, does not erroneously select noise genes (false positives) as the Pooled model.As expected, with larger n, power and accuracy in variable selection increase for both, the CoxBVS-SL and the Subgroup model.The Pooled model only correctly identifies the joint effects of genes 4, 5 and 6 but fails to detect subgroup-specific effects.
Posterior estimates of the regression coefficients βj of the first nine genes in subgroup 1 are shown in Fig. 3 for conditional posterior means (conditional on γ = 1 ) and in Addi- tional file 1: Fig. S3 for marginal posterior means (independent of γ ), both along with standard deviations.The corresponding results for subgroup 2 are depicted in Additional file 1: Figs.S4 and S5.For n < 100 the conditional posterior means of the prognos- tic genes are less shrunk than the marginal posterior means.Results of the CoxBVS-SL model and the Subgroup model are very similar, whereas the Pooled model averages p=20, n=50 p=20, n=75 p=20, n=100 p=100, n=50 p=100, n=75 p=100, n=100 p=100, n=150 effects across subgroups leading to biased subgroup-specific effects and more false positives.Surprisingly, the joint effects of genes 4, 5 and 6 are also more precisely estimated (less shrunk) by CoxBVS-SL and Subgroup compared to Pooled.We assess prediction performance by the integrated Brier score (IBS), computed based on the Median Probability Model (MPM, Fig. 4 for subgroup 1 and Additional file 1: Fig. S7, for subgroup 2) and the Bayesian Model Averaging (BMA, Additional file 1: Fig. S6, for subgroup 1 and Additional file 1: Fig. S8, for subgroup 2).The Pooled model has the worst prediction accuracy.In the case of MPM, CoxBVS-SL performs clearly better than Subgroup, for BMA both models are competitive.
Inference of the graph showed relatively high accuracy for learning the conditional dependence structure among genes within subgroups and for detecting joint effects across different subgroups.The block correlation structure between the prognostic genes within each subgroup is correctly estimated by the precision matrix and the subgraph G ss , s = 1, 2 in the CoxBVS-SL model (see Additional file 1: Fig. S9).Inference of the subgraph G 12 linking both subgroups improves with increasing sample size.The corresponding marginal posterior edge inclusion probabilities of the prognostic genes with joint effects (genes 4, 5 and 6) are larger than for the remaining genes, which p=20, n=50 p=20, n=75 p=20, n=100 p=100, n=50 p=100, n=75 p=100, n=100 p=100, n=150  . . .p becomes more evident for increasing n (see Additional file 1: Fig. S10).Findings support the assumption that incorporating network information into variable selection may increase power to detect associations with the survival outcome and improve prediction accuracy.

Simulation results II
Next, we study the effect of two different hyperparameters b in the MRF prior of the CoxBVS-SL model with respect to variable selection and prediction performance.The new hyperparameter b 1 = 1 corresponds to the subgraphs G ss , s = 1, 2 within each sub- group and b 2 = 1, 1.5, 2, 2.5, 3 to the subgraph G 12 linking both subgroups.By choosing a larger value for b 2 , we give G 12 more weight in the MRF prior and thus, increase the prior variable inclusion probability for genes being simultaneously selected in both subgroups and having a link in G 12 .
We compare the results of CoxBVS-SL with varying b 2 to the results of the Sub-struct model where b 2 = 0 and only information of G ss , s = 1, 2 is included in the MRF prior.In this comparison we investigate how much information is added by G 12 over G ss .For the other hyperparameters we use the same values as in the previous simulations.We apply all models to the same training and test data sets as before but only consider simulation scenarios with p = 100 and n = 50, 75, 100, 125, 150.
Figure 5 shows the mean posterior selection probabilities of the first nine genes in subgroup 1 (subgroup 2 is presented in Additional file 1: Fig. S11).The results of Sub-struct are similar to CoxBVS-SL with b 2 = 1 .Increasing values of b 2 lead to larger posterior variable inclusion probabilities, however, not only for the prognostic genes (see genes 7, 8 and 9 in subgroup 1).This means more power for the correct identification of prognostic genes when n ≤ p , but on the other hand, a tendency towards more false positives.
Posterior estimates of the regression coefficients βj are very similar for all models.Figure 6 shows the conditional posterior means (conditional on γ = 1 ) and Additional file 1: Fig. S12 the marginal posterior means (independent of γ ) along with standard deviations of the first nine genes in subgroup 1.The corresponding results of subgroup 2 are depicted in Additional file 1: Figs.S13 and S14.
We assess prediction performance in terms of the integrated Brier score (IBS), computed based on the Median Probability Model (Fig. 7) and the Bayesian Model Averaging (Additional file 1: Fig. S15).Larger values of b 2 tend to lead to a slightly better prediction performance of CoxBVS-SL compared to Sub-struct when n < p .When the sample size is large, the prediction accuracy of all models is similarly good.
Additional file 1: Fig. S16 compares the results of the subgraph G 12 for varying b 2 in CoxBVS-SL.For larger values of b 2 the marginal posterior edge inclusion probabilities of the prognostic genes with joint effects (genes 4, 5 and 6) increase, as expected, since they are given a higher weight in the prior.However, when b 2 = 3 we also notice a minor increase of the marginal posterior edge inclusion probabilities of the other six prognostic genes with subgroup-specific effects.

Case study based on Glioblastoma protein expression data
In this section we compare CoxBVS-SL with varying b 2 to both standard models, Pooled and Subgroup.We use the Glioblastoma protein expression data from Peterson et al. [28], comprising 212 samples with survival data (159 events) and p = 187 proteins.For reasons of computation time, we use only p = 20 proteins and standard- ize the protein expression data as described in the previous section.In contrast to the

Table 2 Effects in simulation II
Simulated effects in both subgroups.Groups of proteins with the same effect are defined by different phosphorylation sites (or isoforms) of the same protein so that they can learn from each other For the survival endpoint we simulate the event times T s and censoring times C s , respectively, in subgroup s from a Weibull distribution with scale and shape parameters estimated by the Kaplan-Meier estimator of the true event and censoring times, respectively, in the specific subgroup.The individual observed event indicators and survival times until an event or censoring are defined as δ s = 1(T s ≤ C s ) and t s = min(T s , C s ) , resulting in approximately 42% censoring rates in both subgroups.The effects in subgroup s = 1 and s = 2 that we assume for the simulation of survival data are depicted in Table 2.
We repeatedly randomly split the complete data into training (with proportion 0.8) and test sets, stratified by subgroup and event indicator.In total, we generate ten training data sets for model fitting and ten test data sets for evaluation of the prediction performance.
We choose the hyperparameters in accordance with the case study in Peterson et al. [28] as follows.For the two standard models a prior probability of variable inclusion of 0.2 is assumed.In the CoxBVS-SL model we set the hyperparameters of the precision matrix and graph to ν 0 = 0.

Results of the case study
When either b 1 or b 2 increases the mean posterior selection probabilities of all pro- teins increase too (Fig. 8).The Subgroup and CoxBVS-SL model with b 1 = b 2 = 0.5 perform similarly.They correctly identify the subgroup-specific effects of the first six proteins and do not falsely select any noise proteins.Interestingly, the effects of proteins AMPK and Annexin (ID 7 and 8), going in opposite directions for both subgroups, as well as the joint effects of proteins GSK3 are not all identified.There are a few false negatives.The Pooled model, in contrast, shows a clear bias for the subgroup-specific and opposite effects.The effects are averaged across both subgroups,

Discussion
We consider the situation of different, possibly heterogeneous patients subgroups (preknown cohorts or data sets) with survival endpoint and continuous molecular measurements such as gene expression data.When building a separate risk prediction model for each subgroup, it is important to consider heterogeneity but at the same time it can be reasonable to allow sharing information across subgroups to increase power, in particular when the sample sizes are small.For this situation we propose a hierarchical Cox model with stochastic search variable selection prior.To achieve higher power in variable selection and better prediction performance, we use an MRF prior instead of the standard Bernoulli prior for the latent variable selection indicators γ .The MRF prior leads to higher selection probabilities for genes that are related in an undirected graph.We use this graph to link genes across different subgroups and thereby borrow information between subgroups.Genes that are simultaneously prognostic in different subgroups have a higher probability of being selected into the respective subgroup Cox models.As a side aspect, the graph in the MRF prior also allows us to estimate a network between genes within each subgroup providing indications of functionally related genes and pathways.Here, genes that are conditionally dependent have a higher selection probability.
In the simulations and the case study we compared our proposed CoxBVS-SL model to the standard approach with independent Bernoulli prior for γ represented by the Sub- group and Pooled model.Simulations showed that the Pooled model performed worst in terms of variable selection and prediction accuracy.It averaged the effects across both subgroups and thus, led to biased estimates.CoxBVS-SL had more power in variable selection and slightly better prediction performance than Subgroup when the sample size was small.For n > p both models were competitive.However, in the case study the CoxBVS-SL and Subgroup model performed similarly well (Pooled was again clearly worse).The reason for this may be that the sample sizes in both subgroups were relatively large, in particular n > p.
In the MRF prior of our proposed CoxBVS-SL model we specify one unique hyperparameter b for both, the connection levels of covariates within and between subgroups.Since this assumption may be inadequate, we considered further simulations where we studied the effect of increasing values of b 2 , representing the weight that is given to the subgraph G 12 in the MRF prior of CoxBVS-SL, and compared the results to the Sub- struct model where b 2 = 0 .When b 2 was small, CoxBVS-SL and Sub-struct performed very similarly.Thus, the subgraph linking both subgroups had only a small influence on the results compared to the conditional dependencies among covariates within each subgroup (subgraphs G 11 and G 22 ).For larger values of b 2 prediction performance slightly improved and power in variable selection increased but on the other hand, there was a tendency towards false positive variables.By using different hyperparameters b 1 and b 2 , we can vary the strength of connection between pairs of covariates within and between subgroups.However, we still assume that all pairs of subgroups are equally-likely linked a priori.In our situation this assumption is justified since we have no prior knowledge of the amount of shared, similar effects between subgroups.If prior information on the heterogeneity structure between subgroups (similar effects) is available, it can be incorporated into the MRF prior or the graph prior.
The problem of different connection levels of covariates within and between subgroups can in a similar way be approached by the hyperparameter π G in the graph prior, instead of the hyperparameter b in the MRF prior.In previous simulations (data not shown) we increased the weight for G 12 by choosing a larger value for the prior probability of edge inclusion π G for the corresponding edge inclusion indicators g 12,ii , i = 1, . . ., p .This led to larger posterior edge selection probabilities, however, for all genes and not only the ones with joint effects.The variable selection results did not change remarkably.We could observe a small increase in power for all genes which again implied a tendency towards false positives.
Due to computation time, we have included only up to 200 variables so far and the analysis of many thousands of genes is not (yet) feasible.An advantage of the CoxBVS-SL model is that it does not require prior knowledge of the graph among the covariates within and between subgroups.It accounts for uncertainty over both variable and graph selection.In situations where pathway information is available and the graph structure is known, it is possible to incorporate this structural information in the MRF prior via a fixed graph.
We assume that subgroups are pre-specified with the subgroup affiliation of each patient being unique and fixed.However, in situations with unknown subgroups the latent subgroup structure would first need to be determined using methods such as clustering.A wide variety of approaches have been proposed for the clustering of molecular data such as gene expression profiles [9,13,18,25,39] with extensions to sparse clustering [32,38] and integrative clustering of multiple omics data types [6,19].

Conclusions
To our knowledge, we propose the first completely Bayesian approach to combine different, possibly heterogeneous subgroups/cohorts in Cox regression with variable selection.We offer a solution for sharing information across the subgroups to increase power in variable selection and improve prediction performance.
We were able to demonstrate the superiority of our proposed CoxBVS-SL model over the two standard approaches.The standard Pooled model always performed worst, whereas the CoxBVS-SL model outperformed the standard Subgroup model when n ≤ p and otherwise was competitive.This suggests that incorporating network information into variable selection can increase power to identify the prognostic covariates and improve prediction performance.We showed that a proper choice of the hyperparameter b (and a) in the MRF prior is crucial for the results of the graph and the Cox model.
Our proposed model does not require prior knowledge of the dependency structure between covariates within subgroups and the heterogeneity structure between subgroups (i.e., of the amount of shared, similar effects).In the absence of any prior structural information, we assume that all pairs of covariates within and between subgroups are equallylikely linked a priori, and we allow inference of the corresponding unknown graphical structures.In situations where prior structural information is available, for example pathway information or degree of heterogeneity between subgroups, this information can be incorporated into our model.We presented a way to assign different connection levels to covariates within and between subgroups by using different hyperparameters b in the MRF prior.Alternatively, one could use fixed edges in the graph or varying prior edge selection probabilities.
The discovery of graphical structure is an additional benefit of our proposed model.However, our focus is on prediction performance, unbiased effect estimation and variable selection in the Cox model.Our proposed CoxBVS-SL model showed improved results in the situation of small sample sizes which is an important problem, not only in clinical applications.
Then the conditional distribution is ( * 1 ) ∝ G(v| n s 2 + 1, s 22 + 2 ), ( * 2 ) ∝ N (u| − C s 12 , C). Permuting any column in ss to be updated to the last one leads to a block Gibbs sampler for the update of ss .
Step 2: Update of G Update all elements in G iteratively with Gibbs sampler from their conditional dis- tributions.All elements g rs,ij are assumed independent Bernoulli a priori with p(g rs,ij = 1) = π G and p(g rs,ij = 0) = 1 − π G .
Update g rs,ii , r, s = 1, . . ., S , r < s , i = 1, . . ., p (edges between the same gene in dif- ferent subgroups) from the conditional distribution where G −rs,ii denotes all elements in G except for g rs,ii .Accept g rs,ii = 1 with probability where This means, update g rs,ii as follows: g rs,ii = 1, if u < w a w a +w b , u ∼ U[0, 1] 0, else .Update g ss,ij , s = 1, . . ., S , i, j = 1, . . ., p , i < j (edges between different genes in the same subgroup) from the conditional distribution where G −ss,ij denotes all elements in G except for g −ss,ij .Accept g ss,ij = 1 with probability is the matrix of (genomic) covariates.tm = min(T m , C m ) denotes the observed time of patient m, with T m the event time and C m the censoring time.δ m = 1(T m ≤ C m ) indicates whether a patient experienced an event ( δ m = 1 ) or was right-censored ( δ m = 0).

Fig. 1
Fig. 1 Example graph.Illustration of the proposed graph for S = 2 subgroups, each with p = 5 genomic predictors (nodes).Possible edges between two nodes are marked by dashed lines

Fig. 2
Fig. 2 Mean posterior probabilities of variable selection in simulation I. Mean posterior selection probabilities of the first nine genes in subgroup 1 (averaged across the ten training sets).The colors represent the different models and the plot symbol indicates whether a gene is selected on average or not

Fig. 3
Fig. 3 Posterior effect estimates in simulation I. Conditional posterior means (conditional on γ = 1 ) and standard deviations (SD) of the regression coefficients of the first nine genes in subgroup 1 (averaged across the ten training sets)

Fig. 7 Fig. 8 Fig. 9 IBSFig. 10
Fig. 7 Prediction performance in simulation II.Integrated Brier Scores (IBS) across all ten test sets for subroup 1 (left) and 2 (right) (based on the Median Probability Model).The black triangle within each boxplot represents the mean value

1
[36]subgroup s = 1, ..., S update ss with the block Gibbs sampler proposed by Wang[36].2 Update all elements in G iteratively with Gibbs sampler from the conditional distri- butions p(g ss,ij = 1|G −ss,ij , ω ss,ij , γ ) as well as p(g rs,ii = 1|G −rs,ii , γ ) , where G −rs,ii ( G −ss,ij ) denotes all elements in G except for g rs,ii ( g ss,ij ). 3 Update all elements in γ iteratively with Gibbs sampler from the conditional distribu- tions p(γ s,i = 1|γ −s,i , G, β s,i ) , where γ −s,i denotes all elements in γ except for γ s,i .4 Update β s,i from the conditional distribution p(β s,i |β s,−i , γ s , h s , D s ) , s = 1, . . ., S , i = 1, . . ., p , using a random walk Metropolis-Hastings algorithm with adaptive jumping rule as proposed by Lee et al. [20].β s,−i includes all elements in β s except for β s,i .5 The conditional distribution p(h s,g |h s,−g , β s , γ s , D s ) for the update of h s,g can be well approximated by the gamma distribution where d s,g is the number of events in interval g for subgroup s and h s,−g denotes the vector h s without the g-th element, g = 1, . . ., J s , s = 1, . . ., S [17, chapter 3.2.2].
[35]following, we compare the performance of our proposed model, referred to as CoxBVS-SL (for Cox model with Bayesian Variable Selection and Structure Learning, as an extension of the model by Treppmann et al.[35]), to a standard subgroup model and a combined model.The combined model pools data from all subgroups and treats them as one homogeneous cohort, whereas the subgroup model only uses information in the subgroup of interest and ignores the other subgroups.Both standard approaches follow the Bayesian Cox model proposed by Treppmann et al.[35]with stochastic search variable selection and independent Bernoulli priors for the variable inclusion indicators γ.