First, the general methods are described that are required for our proposed Bayesian model introduced later in this section.
The Bayesian Cox proportional hazards model
Assume the observed data of patient m consist of the tuple \(({\tilde{t}}_m, \delta_m)\) and the covariate vector \({\varvec{x}}_m=(x_{m1},\ldots ,x_{mp})' \in {\mathbb {R}}^{p}\), \(m=1,\ldots ,n\). \({\varvec{x}}\in {\mathbb {R}}^{n\times p}\) is the matrix of (genomic) covariates. \({\tilde{t}}_m=\min (T_m,C_m)\) denotes the observed time of patient m, with \(T_m\) the event time and \(C_m\) the censoring time. \(\delta_m = {\mathbbm {1}}(T_m \le C_m)\) indicates whether a patient experienced an event (\(\delta_{m}=1\)) or was right-censored (\(\delta_{m}=0\)).
The Cox proportional hazards model [7] models the hazard rate \(h(t|{\varvec{x}}_m)\) of an individual m at time t. It consists of two terms, the non-parametric baseline hazard rate \(h_0(t)\) and a parametric form of the covariate effect:
$$\begin{aligned} h(t|{\varvec{x}}_m)= h_0(t) \cdot \exp ({\varvec{\beta}}' {\varvec{x}}_m ) = h_0(t) \cdot \exp \left( \sum_{i=1}^{p} \beta_i x_{mi} \right) , \end{aligned}$$
where \({\varvec{\beta}}=(\beta_1,\ldots ,\beta_p)'\) is the unknown parameter vector that represents the strength of influence of the covariates on the hazard rate.
Under the Cox model, the joint survival probability of n patients given \({\varvec{x}}\) is
$$\begin{aligned} P(\tilde{{\varvec{T}}}>\tilde{{\varvec{t}}}|{\varvec{x}}, {\varvec{\beta}}, H_0) = \exp \Big ( - \sum_{m=1}^{n} \exp ({\varvec{\beta}}'{\varvec{x}}_m) H_0({\tilde{t}}_m) \Big ) , \end{aligned}$$
where \(\tilde{{\varvec{t}}}\) is the vector of the individual observed times for all patients and \(\tilde{{\varvec{T}}}\) the vector of corresponding random variables. One of the most popular choices for the cumulative baseline hazard function \(H_0(t)\) is a gamma process prior
$$\begin{aligned} H_0 \sim {\mathcal{GP}}(a_0H^{*}, a_0) , \end{aligned}$$
where \(H^{*}(t)\) is an increasing function with \(H^{*}(0)=0\). \(H^{*}\) can be considered as an initial guess of \(H_0\) and \(a_0>0\) describes the weight that is given to \(H^{*}(t)\) [20]. Lee et al. [20] propose a Weibull distribution \(H^{*}(t) = \eta t^{\kappa}\) with fixed hyperparameters \(\eta\) and \(\kappa\). Following Zucknick et al. [41], we obtain estimates of \(\eta\) and \(\kappa\) from the training data by fitting a parametric Weibull model without covariates to the survival data. We choose \(a_0=2\) in accordance with the authors.
In practice the presence of ties is very common, leading to the grouped data likelihood described in Ibrahim et al. [17, chapter 3.2.2]. A finite partition of the time axis is constructed with \({0=c_0<c_1<\ldots <c_J}\) and \(c_J>{\tilde{t}}_m\) for all \({m=1,\ldots ,n}\). The observed time \({\tilde{t}}_m\) of patient m falls in one of the J disjoint intervals \({I_g=(c_{g-1},c_g]}\), \({g=1,\ldots ,J}\). Assume the observed data \({{\mathfrak{D}}=\{({\varvec{x}}, {\mathcal{R}}_g, {\mathcal{D}}_g): g=1,\ldots ,J\}}\) are grouped within \(I_g\), where \({\mathcal{R}}_g\) and \({\mathcal{D}}_g\) are the risk and failure sets corresponding to interval g. Let \({h_g = H_0(c_g) - H_0(c_{g-1})}\) be the increment in the cumulative baseline hazard in interval \(I_g\), \({g=1,\ldots ,J}\). From the gamma process prior of \(H_0\) follows that the \(h_g\)’s have independent gamma distributions
$$\begin{aligned} h_g \sim {\mathcal{G}}(\alpha_{0,g}-\alpha_{0,g-1}, a_0) \, , \quad \text {with} \quad \alpha_{0,g}=a_0H^{*}(c_g) \, . \end{aligned}$$
The conditional probability that the observed time of patient m falls in interval \(I_g\) is given by
$$\begin{aligned} P({\tilde{T}}_m \in I_g|{\varvec{h}})&= \exp \Big ( -\exp ({\varvec{\beta}}'{\varvec{x}}_m) \sum_{j=1}^{g-1} h_j \Big ) \cdot \Big [ 1 - \exp \big ( -h_g \exp ({\varvec{\beta}}'{\varvec{x}}_m) \big ) \Big ], \end{aligned}$$
with \({\varvec{h}} = (h_1,\ldots ,h_J)'\). The resulting grouped data likelihood is defined as
$$\begin{aligned}{}&L({\mathfrak{D}}|{\varvec{\beta}}, {\varvec{h}}) \\&\quad \propto \prod_{g=1}^{J} \left[ \exp \Big ( -h_g \! \! \sum_{k \in {\mathcal{R}}_g-{\mathcal{D}}_g} \! \! \exp ({\varvec{\beta}}'{\varvec{x}}_k) \Big ) \prod_{l \in {\mathcal{D}}_g} \Big [ 1-\exp \big ( -h_g \exp ({\varvec{\beta}}' {\varvec{x}}_l) \big ) \Big ] \right] \end{aligned}$$
[17, chapter 3.2.2].
Stochastic search variable selection
The stochastic search variable selection (SSVS) procedure by George and McCulloch [14] uses latent indicators for variable selection and models the regression coefficients as a mixture of two normal distributions with different variances
$$\begin{aligned} \beta_i|\gamma_i \sim (1-\gamma_i) \cdot {\mathcal{N}}(0, \tau_i^{2}) + \gamma_i \cdot {\mathcal{N}}(0, c_i^{2} \tau_i^{2}) \, , \quad i=1,\ldots ,p \, . \end{aligned}$$
This prior allows the \(\beta_i\)’s to shrink towards zero. Due to the shape of the two-component mixture distribution, it is called spike-and-slab prior. The latent variable \(\gamma_i\) indicates the inclusion (\(\gamma_i=1\)) or exclusion (\(\gamma_i=0\)) of the i-th variable and specifies the variance of the normal distribution. \(\tau_i~(>0)\) is set small so that \(\beta_i\) is likely to be close to zero if \(\gamma_i=0\). \(c_i~(>1)\) is chosen sufficiently large to inflate the coefficients of selected variables and to make their posterior mean values likely to be non-zero. In general, the variances of the regression coefficients are assumed to be constant: \(\tau_i \equiv \tau\) and \(c_i \equiv c\) for all \(i=1,\ldots ,p\).
The standard prior for \({\varvec{\gamma}}=(\gamma_1,\ldots ,\gamma_p)'\) is a product of independent Bernoulli distributions
$$\begin{aligned} p({\varvec{\gamma}}) = \prod_{i=1}^{p} \pi_{\gamma}^{\gamma_i} \cdot (1-\pi_{\gamma})^{1-\gamma_i} , \end{aligned}$$
with prior inclusion probability \(\pi_{\gamma}=P(\gamma_i=1)\). Typically, these prior inclusion probabilities are chosen to be the same for all variables and often with \(\pi_{\gamma}\) set to a fixed value.
Graphical models
A graphical model is a statistical model that is associated with a graph summarizing the dependence structure in the data. The nodes of a graph represent the random variables of interest and the edges of a graph describe conditional dependencies among the variables. Structure learning implies the estimation of an unknown graph. Recent applications are mainly driven by biological problems that involve the reconstruction of gene regulatory networks and the identification of pathways of functionally related genes from their expression levels. A graph is called undirected, when its edges are unordered pairs of nodes instead of ordered pairs with edges pointing from one node to the other (directed graph). When the variables are continuous measurements and assumed to be multivariate normal a common choice are Gaussian models [11].
We assume that the vector of random variables \({\varvec{X}}_m = (X_{m1},\ldots ,X_{mp})'\) for patient m, \(m=1,\ldots ,n\) follows a multivariate normal distribution with mean vector \({\varvec{0}}\) and covariance matrix \(\varvec{\Sigma}\). The inverse of the covariance matrix is referred to as precision matrix \({\varvec{\Sigma}}^{-1}={\varvec{\Omega}}=(\omega_{ij})_{i,j=1,\ldots ,p}\), with \(\varvec{\Omega}\) symmetric and positive definite. Let \({\varvec{X}} \in \mathbb {R}^{n \times p}\) be the data matrix consisting of n independent patients and \({\varvec{S}}=\frac{1}{n}{\varvec{X}}'{\varvec{X}}\) the sample covariance matrix.
In graphical models, a graph \({\widetilde{G}}\) is used to represent conditional dependence relationships among random variables \({\varvec{X}}\). Let \({\widetilde{G}}=(V,E)\) be an undirected graph, where \(V=\{ 1,\ldots ,p\}\) is a set of nodes (e.g. genes) and \(E \subset V \times V\) is a set of edges (e.g. relations between genes) with edge \({(i,j)\in E \Leftrightarrow (j,i)\in E}\). \({\widetilde{G}}\) can be indexed by a set of \(p(p-1)/2\) binary variables \({{\varvec{G}}=(g_{ij})_{i<j} \in \{0,1\}^{p \times p}}\) with \(g_{ij}=1\) or 0 when edge (i, j) belongs to E or not. The symmetric matrix \({\varvec{G}}\) is termed adjacency matrix representation of the graph. The graph structure implies constraints on the precision matrix \({\varvec{\Omega}}\) such that \(\, g_{ij}=0 \, \Leftrightarrow \, (i,j)\notin E \, \Leftrightarrow \, \omega_{ij}=0\), meaning that variables i and j are conditionally independent given all remaining variables [11, 36].
We use the approach for structure learning by Wang [36] that is based on continuous spike-and-slab priors for the elements of the precision matrix and latent indicators for the graph structure. The approach induces sparsity and is efficient due to a block Gibbs sampler and no approximation of the normalizing constant. The corresponding hierarchical model is defined as
$$\begin{aligned}{}&p({\varvec{\Omega}}|{\varvec{G}},\theta ) = C({\varvec{G}}, \nu_0, \nu_1,\lambda )^{-1} \prod_{i<j} {\mathcal{N}}(\omega_{ij}|0,\nu_{g_{ij}}^{2}) \prod_{i} \text {Exp}(\omega_{ii}|\frac{\lambda}{2}) {\mathbbm {1}}_{ \{{\varvec{\Omega}} \in {\mathcal{M}}^{+}\}} \\&p({\varvec{G}}|\theta ) = C(\theta )^{-1} C({\varvec{G}}, \nu_0, \nu_1,\lambda ) \prod_{i<j} \big ( \pi_G^{g_{ij}} (1-\pi_G)^{1-g_{ij}} \big ) , \end{aligned}$$
where \(C({\varvec{G}}, \nu_0, \nu_1,\lambda )\) and \(C(\theta )\) are the normalizing constants, and \({\theta =\{\nu_0, \nu_1,\lambda , \pi_G \}}\) is the set of all parameters with \(\nu_0 > 0\) small, \(\nu_1 > 0\) large, \(\lambda >0\) and \(\pi_G \in (0,1)\). \({\mathbbm {1}}_{\{{\varvec{\Omega}} \in {\mathcal{M}}^{+} \}}\) restricts the prior to the space of symmetric-positive definite matrices. A small value for \(\nu_0\) (\(g_{ij}=0\)) means that \(\omega_{ij}\) is small enough to bet set to zero. A large value for \(\nu_1\) (\(g_{ij}=1\)) allows \(\omega_{ij}\) to be substantially different from zero. The binary latent variables \({{\varvec{G}}=(g_{ij})_{i<j} \in \{0,1\}^{p(p-1)/2}}\) serve as edge inclusion indicators. Wang [36] proposes the following fixed hyperparameters \(\pi_G=\frac{2}{p-1}\), \(\nu_0 \ge 0.01\), \(\nu_1 \le 10\) and \(\lambda =1\) as resulting in good convergence and mixing.
The proposed Bayesian subgroup model
We assume the entire data consists of S predefined subgroups of patients (different cohorts or data sets), where for each patient the subgroup affiliation is known and unique. This information, which specific subgroup a patient belongs to, is available in the data.
Likelihood
Let \({\varvec{X}}_s \in \mathbb {R}^{n_s \times p}\) be the gene expression (covariate) matrix for subgroup s, \({s=1,\ldots ,S}\), consisting of \(n_s\) independent and identically distributed observations. For patient m in subgroup s the vector of random variables \({{\varvec{X}}_{s,m} = (X_{s,m1},\ldots ,X_{s,mp})'}\) is assumed to follow a multivariate normal distribution with mean vector \({\varvec{0}}\) and unknown precision matrix \({\varvec{\Omega}}_{ss}={\varvec{\Sigma}}_s^{-1}\), \(m=1,\ldots ,n_s\).
We consider the outcome \({\varvec{Y}}_s=(Y_{s,1},\ldots ,Y_{s,n_s})'\) with \(Y_{s,m}=({\tilde{T}}_{s,m}, \delta_{s,m})\) as well as the predictors \({\varvec{X}}_s\), to be random variables. Thus, the likelihood for subgroup s is the joint distribution \({p({\varvec{Y}}_s, {\varvec{X}}_s)=p({\varvec{Y}}_s|{\varvec{X}}_s) \cdot p({\varvec{X}}_s)}\). The conditional distribution \(p({\varvec{Y}}_s|{\varvec{X}}_s)\) corresponds to the grouped data likelihood of the Bayesian Cox proportional hazards model at the beginning of this section [20] for subgroup s
$$\begin{aligned}{}&L({\mathfrak{D}}_s|{\varvec{\beta}}_s, {\varvec{h}}_s) \\&\quad \propto \prod_{g=1}^{J_{s}} \left[ \exp \Big ( -h_{s,g} \sum_{k \in {\mathcal{R}}_{s,g}-{\mathcal{D}}_{s,g}} \exp ({\varvec{\beta}}_s'{\varvec{x}}_{s,k}) \Big ) \prod_{l \in {\mathcal{D}}_{s,g}} \Big [ 1-\exp \big ( -h_{s,g} \exp ({\varvec{\beta}}_s' {\varvec{x}}_{s,l} ) \big ) \Big ] \right] , \end{aligned}$$
where \({\mathfrak{D}}_s=\{({\varvec{x}}_s, {\mathcal{R}}_{s,g}, {\mathcal{D}}_{s,g}): g=1,\ldots ,J_s\}\) are the observed data in subgroup s, with \({\mathcal{R}}_{s,g}\) the risk and \({\mathcal{D}}_{s,g}\) the failure sets corresponding to interval \({I_{s,g}=(c_{s,g-1},c_{s,g}]}\), \({g=1,\ldots ,J_s}\). The increment in the cumulative baseline hazard for subgroup s in interval \(I_{s,g}\) is termed \({h_{s,g} = H_0(c_{s,g}) - H_0(c_{s,g-1})}\). \({\varvec{\beta}}_s\) is the p-dimensional vector of regression coefficients for subgroup s.
The marginal distribution of \({\varvec{X}}_s\) is multivariate normal with \({\varvec{S}}_s={\varvec{X}}_s' {\varvec{X}}_s\)
$$\begin{aligned} p({\varvec{X}}_s|{\varvec{\Omega}}_{ss})&\propto \prod_{m=1}^{n_s} |{\varvec{\Omega}}_{ss}|^{1/2} \exp \big ( -\frac{1}{2} {\varvec{X}}_{s,m}' {\varvec{\Omega}}_{ss} {\varvec{X}}_{s,m} \big ) \\&= |{\varvec{\Omega}}_{ss}|^{n_s/2} \exp \big ( -\frac{1}{2} \underbrace{\sum_{m=1}^{n_s} {\varvec{X}}_{s,m}' {\varvec{\Omega}}_{ss} {\varvec{X}}_{s,m}}_{=\text {tr}({\varvec{S}}_s {\varvec{\Omega}}_{ss})} \big ) . \end{aligned}$$
The joint likelihood across all subgroups is the product of the subgroup likelihoods
$$\begin{aligned} \prod_{s=1}^{S} L({\mathfrak{D}}_s|{\varvec{\beta}}_s, {\varvec{h}}_s) \cdot p({\varvec{X}}_s|{\varvec{\Omega}}_{ss}) . \end{aligned}$$
Prior specifications
Prior on the parameters \({\varvec{h}}_s\) and \({\varvec{\beta}}_s\) of the Cox model
The prior for the increment in the cumulative baseline hazard in subgroup s follows independent gamma distributions
$$\begin{aligned} h_{s,g} \sim {\mathcal{G}}(a_0 (H^{*}(c_{s,g})-H^{*}(c_{s,g-1})), a_0) , \end{aligned}$$
with a Weibull distribution \(H^{*}(c_{s,g}) = \eta_s c_{s,g}^{\kappa_s}\), \(g=1,\ldots ,J_s\), \(s=1,\ldots ,S\) [20]. We choose the hyperparameters \(a_0\), \(\eta_s\) and \(\kappa_s\) to be fixed and in accordance with Lee et al. [20] and Zucknick et al. [41]. We set \(a_0=2\) and estimate the hyperparameters \(\eta_s\) and \(\kappa_s\) from the (training) data by fitting a parametric Weibull model without covariates to the survival data of subgroup s.
We perform variable selection using the SSVS approach by George and McCulloch [14] described earlier in this section. The prior of the regression coefficients \(\beta_{s,i}\) in subgroup s conditional on the latent indicator \(\gamma_{s,i}\) is defined as a mixture of two normal distributions with small (\(\tau^{2}\)) and large (\(c^{2} \tau^{2}\)) variance
$$\begin{aligned} \beta_{s,i}|\gamma_{s,i} \sim (1-\gamma_{s,i}) \cdot {\mathcal{N}}(0, \tau^{2}) + \gamma_{s,i} \cdot {\mathcal{N}}(0, c^{2} \tau ^{2}) \, , \quad i=1,\ldots ,p . \end{aligned}$$
The latent indicator variable \(\gamma_{s,i}\) indicates the inclusion (\(\gamma_{s,i}= 1\)) or exclusion (\({\gamma_{s,i}= 0}\)) of variable i in the model for subgroup s. We assume equal variances for all regression coefficients. We set the hyperparameters to the fixed values \({\tau =0.0375}\) and \(c=20\) following Treppmann et al. [35]. This choice corresponds to a standard deviation of \(c \cdot \tau = 0.75\) and a 95% probability interval of \([-1.47,1.47]\) for \(p(\beta_{s,i}|\gamma_{s,i}= 1)\).
Prior on \({\varvec{\gamma}}\) linking variable and graph selection
The standard prior for the binary variable selection indicators \(\gamma_{s,i}\) is a product of independent Bernoulli distributions as utilized by Treppmann et al. [35]. However, this does not consider information from other subgroups and relationships between covariates. For this situation, we propose a Markov random field (MRF) prior for the latent variable selection indicators that incorporates information on the relationships among the covariates as described by an undirected graph. This prior assumes that neighboring covariates in the graph are more likely to have a common effect and encourages their joint inclusion. The MRF prior for \({\varvec{\gamma}}\) given \({\varvec{G}}\) is defined as
$$\begin{aligned} p({\varvec{\gamma}}|{\varvec{G}})&= \frac{ \exp ( a {\varvec{1}}_{pS}' {\varvec{\gamma}} + b {\varvec{\gamma}}'{\varvec{G}} {\varvec{\gamma}})}{\sum_{{\varvec{\gamma}} \in \{0,1\}^{pS}} \exp ( a {\varvec{1}}_{pS}' {\varvec{\gamma}} + b {\varvec{\gamma}}'{\varvec{G}} {\varvec{\gamma}})} \propto \exp ( a {\varvec{1}}_{pS}' {\varvec{\gamma}} + b {\varvec{\gamma}}'{\varvec{G}} {\varvec{\gamma}}) , \end{aligned}$$
where \({\varvec{\gamma}}=(\gamma_{1,1},\ldots ,\gamma_{1,p},\ldots ,\gamma_{S,1},\ldots ,\gamma_{S,p})'\) is a pS-dimensional vector of variable inclusion indicators, \({\varvec{G}}\) is a symmetric \((pS \times pS)\) adjacency matrix representation of the graph, and a, b are scalar hyperparameters.
The hyperparameter a influences the overall variable inclusion probability and controls the sparsity of the model, with smaller values resulting in sparser models. Without loss of generality \(a<0\). The hyperparameter \(b>0\) determines the prior belief in the strength of relatedness between pairs of neighboring variables in the graph and controls the probability of their joint inclusion. Higher values of b encourage the selection of variables with neighbors already selected into the model. The idea becomes more evident by looking at the conditional probability
$$\begin{aligned} p(\gamma_{s,i}|{\varvec{\gamma}}_{-(s,i)},{\varvec{G}}) = \frac{\exp \left( a \gamma_{s,i} + 2b \gamma_{s,i} \cdot (\sum_{j\ne i} \gamma_{s,j} g_{ss,ij} + \sum_{r\ne s} \gamma_{r,i} g_{rs,ii}) \right) }{1+\exp \left( a +2b \cdot (\sum_{j\ne i} \gamma_{s,j} g_{ss,ij} + \sum_{r\ne s} \gamma_{r,i} g_{rs,ii}) \right) } . \end{aligned}$$
An MRF prior for variable selection has also been used by other authors [21, 28, 33, 34]. However, unlike us, they do not address the problem of borrowing information across subgroups by linking covariates in a graph.
We propose a joint graph with possible edges between all pairs of covariates within each subgroup and edges between the same covariates in different subgroups. The elements \(g_{rs,ij}\) in the adjacency matrix of the graph \({\varvec{G}}\) represent the presence (\(g_{rs,ij}=1\)) or absence (\(g_{rs,ij}=0\)) of an edge between nodes (genes) i and j in subgroups r and s. They can be viewed as latent binary indicator variables for edge inclusion. The adjacency matrix in the present model is defined as
$$\begin{aligned} {\varvec{G}}= \left( \begin{array}{@{}*{4}{c}@{}} {\varvec{G}}_{11} &{} {\varvec{G}}_{12} &{} \ldots &{} {\varvec{G}}_{1S} \\ {\varvec{G}}_{12} &{} {\varvec{G}}_{22} &{} \ldots &{} {\varvec{G}}_{2S} \\ \vdots &{} \vdots &{} \ddots &{} \vdots \\ {\varvec{G}}_{1S} &{} {\varvec{G}}_{2S} &{} \ldots &{} {\varvec{G}}_{SS} \\ \end{array} \right) . \end{aligned}$$
\({\varvec{G}}_{ss}=(g_{ss,ij})_{i<j}\) is the matrix of latent edge inclusion indicators within subgroup s
$$\begin{aligned} {\varvec{G}}_{ss} = \begin{pmatrix} \, 0 \quad &{} g_{ss,12} &{} \ldots &{} g_{ss,1(p-1)} &{} g_{ss,1p} \\ g_{ss,12} &{} \, 0 \quad &{} \ddots &{} &{} g_{ss,2p} \\ \vdots &{} \ddots &{} \ddots &{} \ddots &{} \vdots \\ g_{ss,1(p-1)} &{} &{} \ddots &{} \, 0 \quad &{} g_{ss,(p-1)p} \\ g_{ss,1p} &{} g_{ss,2p} &{} \ldots &{} g_{ss,(p-1)p} &{} \, 0 \quad \\ \end{pmatrix} , \end{aligned}$$
and \({\varvec{G}}_{rs}=(g_{rs,ii})_{r<s}\) is the matrix of latent edge inclusion indicators between subgroups r and s
$$\begin{aligned} {\varvec{G}}_{rs} = \text {diag}(g_{rs,11},\ldots ,g_{rs,pp}) , \end{aligned}$$
with \(r,s=1,\ldots ,S\), \(r<s\), \(i,j=1,\ldots ,p\), \(i<j\).
Thus, within each subgroup s we assume a standard undirected graph with possible edges between all pairs of genes representing conditional dependencies as in a functional or regulatory pathway. Between different subgroups we only allow for relations between the same gene in different subgroups (different genes in different subgroups are assumed to be unconnected). This allows sharing information between subgroups and prognostic genes shared by different subgroups have a higher inclusion probability. To visualize this idea, Fig. 1 shows an example network consisting of two subgroups, each with five predictors.
Graph selection prior on \({\varvec{\Omega}}\) and \({\varvec{G}}\)
We infer the unknown graph and precision matrix using the structure learning approach for Gaussian graphical models by Wang [36]. The precision matrix of subgroup s corresponding to subgraph \({\varvec{G}}_{ss}\) is denoted by \({\varvec{\Omega}}_{ss} = (\omega_{ss,ij})_{i<j}\).
The corresponding prior is defined by
$$\begin{aligned} p({\varvec{\Omega}}_{ss}|{\varvec{G}}_{ss},\nu_0,\nu_1,\lambda ) \propto \prod_{i<j} {\mathcal{N}}(\omega_{ss,ij}|0,\nu ^2_{g_{ss,ij}}) \prod_{i} \text {Exp}(\omega_{ss,ii}|\frac{\lambda }{2}) {\mathbbm{1}}_{\{\Omega_s \in {\mathcal{M}}^{+} \}}, \end{aligned}$$
with fixed hyperparameters \(\nu_0>0\) small, \(\nu_1>0\) large and \(\lambda >0\).
We assume the binary edge inclusion indicators within subgroup s (\(g_{ss,ij}\)) as well as between subgroups r and s (\(g_{rs,ii}\)) to be independent Bernoulli a priori
$$\begin{aligned} p({\varvec{G}}|\pi_G) \propto \prod_{s} \prod_{i<j} \big [ \pi_G^{g_{ss,ij}}(1-\pi_G)^{1-g_{ss,ij}} \big ] \cdot \prod_{r<s} \prod_{i} \big [ \pi_G^{g_{rs,ii}}(1-\pi_G)^{1-g_{rs,ii}} \big ] , \end{aligned}$$
with fixed prior probability of edge inclusion \(\pi_G \in (0,1)\).
Posterior inference
The joint posterior distribution for the set of all parameters \(\theta = \{ {\varvec{h}}, {\varvec{\beta}}, {\varvec{\gamma}}, {\varvec{G}}, {\varvec{\Omega}} \}\) is proportional to the product of the joint likelihood and the prior distributions of the parameters in all subgroups
$$\begin{aligned}{}&p({\varvec{h}}, {\varvec{\beta}}, {\varvec{\gamma}}, {\varvec{G}}, {\varvec{\Omega}}|{\mathfrak{D}},{\varvec{X}})\\&\propto \prod_{s=1}^{S} \Big [ L({\mathfrak{D}}_s|{\varvec{\beta}}_s,{\varvec{h}}_s) \cdot p({\varvec{X}}_s|{\varvec{\Omega}}_{ss}) \Big ] \\&\cdot \prod_{s=1}^{S} \Big [ p({\varvec{\Omega}}_{ss}|{\varvec{G}}_{ss}) \cdot p({\varvec{G}}) \cdot p({\varvec{\gamma}}|{\varvec{G}}) \cdot \prod_{i=1}^{p} p(\beta_{s,i}|\gamma_{s,i}) \cdot \prod_{g=1}^{J_s} p(h_{s,g}|{\varvec{\beta}}_s) \Big ] . \end{aligned}$$
Markov Chain Monte Carlo sampling
Markov Chain Monte Carlo (MCMC) simulations are required to obtain a posterior sample of the parameters. The different parameters are updated iteratively according to their conditional posterior distributions using a Gibbs sampler. A brief outline of the MCMC sampling scheme is given in the following. More details are provided in the Appendix.
-
1
For subgroup \(s=1,\ldots ,S\) update \({\varvec{\Omega}}_{ss}\) with the block Gibbs sampler proposed by Wang [36].
-
2
Update all elements in \({\varvec{G}}\) iteratively with Gibbs sampler from the conditional distributions \(p(g_{ss,ij} =1| {\varvec{G}}_{-ss,ij}, \omega_{ss,ij}, {\varvec{\gamma}})\) as well as
\({p(g_{rs,ii}=1 | {\varvec{G}}_{-rs,ii}, {\varvec{\gamma}})}\), where \({\varvec{G}}_{-rs,ii}\) (\({\varvec{G}}_{-ss,ij}\)) denotes all elements in \({\varvec{G}}\) except for \(g_{rs,ii}\) (\(g_{ss,ij}\)).
-
3
Update all elements in \({\varvec{\gamma}}\) iteratively with Gibbs sampler from the conditional distributions \(p(\gamma_{s,i} =1 | {\varvec{\gamma}}_{-s,i}, {\varvec{G}}, \beta_{s,i})\), where \({\varvec{\gamma}}_{-s,i}\) denotes all elements in \({\varvec{\gamma}}\) except for \(\gamma_{s,i}\).
-
4
Update \(\beta_{s,i}\) from the conditional distribution \(p(\beta_{s,i}|{\varvec{\beta}}_{s,-i}, {\varvec{\gamma}}_s, {\varvec{h}}_s, {\mathfrak{D}}_s)\), \({s=1,\ldots ,S}\), \({i=1,\ldots ,p}\), using a random walk Metropolis-Hastings algorithm with adaptive jumping rule as proposed by Lee et al. [20]. \({\varvec{\beta}}_{s,-i}\) includes all elements in \({\varvec{\beta}}_{s}\) except for \(\beta_{s,i}\).
-
5
The conditional distribution \(p(h_{s,g}|{\varvec{h}}_{s,-g}, {\varvec{\beta}}_s, {\varvec{\gamma}}_s, {\mathfrak{D}}_s)\) for the update of \(h_{s,g}\) can be well approximated by the gamma distribution
$$\begin{aligned} & h_{{s,g}} |\varvec{h}_{{s, - g}} ,\varvec{\beta }_{s} ,\varvec{\gamma }_{s} ,{\mathfrak{D}}_{s} \mathop \sim \limits^{{{\text{approx}}{{.}}}} {\mathcal{G}}(a_{0} (H^{*} (c_{{s,g}} ) \\ & \quad - H^{*} (c_{{s,g - 1}} )) + d_{{s,g}} ,a_{0} + \sum\limits_{{k \in {\mathcal{R}}_{{s,g}} - {\mathcal{D}}_{{s,g}} }} {\exp } (\varvec{\beta }_{{s^{\prime } }} \varvec{x}_{{s,k}} )), \\ \end{aligned}$$
where \(d_{s,g}\) is the number of events in interval g for subgroup s and \({\varvec{h}}_{s,-g}\) denotes the vector \({\varvec{h}}_{s}\) without the g-th element, \(g=1,\ldots ,J_s\), \(s=1,\ldots ,S\) [17, chapter 3.2.2].
Starting with an arbitrary set of initial values for the parameters, the MCMC algorithm runs with a reasonably large number of iterations to obtain a representative sample from the posterior distribution. All subsequent results are based on single MCMC chains, each with 20 000 iterations in total and a burn-in period of 10 000 iterations. As starting values we choose an empty model with:
$$\begin{aligned}{\varvec{G}}^{(0)}&={\varvec{0}}_{pS\times pS}\\{\varvec{\Sigma}}_s^{(0)}&={\varvec{I}}_{p\times p}\,\text{and}\,{\varvec{\Omega}}_{ss}^{(0)}=({\varvec{\Sigma}}_s^{(0)})^{-1} \text{for}\, s=1,\ldots ,S\\ {\varvec{\gamma}}_s^{(0)}&=(0,\ldots ,0)'{\text{for}}\,s=1,\ldots ,S\\ \beta_{s,i}^{(0)} &\sim {\mathcal{U}}[-0.02, 0.02]\,\text{for}\,i=1,\ldots ,p, s=1,\ldots ,S\\h_{s,g}^{(0)} &\sim {\mathcal{G}}(1,1)\,\text{for}\,s=1,\ldots ,S,g=1,\ldots ,J_s \end{aligned}$$
We assessed convergence of each MCMC chain by looking at autocorrelations, trace plots and running mean plots of the regression coefficients. In addition, we ran several independent MCMC chains with different starting values to ensure that the chains and burn-in period were long enough to reach (approximate) convergence.
Posterior estimation and variable selection
We report the results of the Cox models in terms of marginal and conditional posterior means and standard deviations of the estimated regression coefficients, as well as posterior selection probabilities. After removal of the burn-in samples, the remaining MCMC samples serve as draws from the posterior distribution to calculate the empirical estimates. These estimates are then averaged across all training sets for each variable separately.
The strategy for variable selection follows Treppmann et al. [35]. First, the mean model size \(m^{*}\) is computed as the average number of included variables across all MCMC iterations after the burn-in. Then the \(m^{*}\) variables with the highest posterior selection probability are considered as the most important variables and selected in the final model.
We visually assess the inferred graph by the marginal posterior probabilities of the pairwise edge inclusion indicators. High probabilities suggest that an edge exists between two covariates (nodes). We consider the presence of an edge as a continuous parameter rather than choosing a cutoff for binary decision.
Prediction
We use training data for model fitting and posterior estimation and test data to assess model performance. We evaluate the prediction performance of the Cox models by the integrated Brier score.
The expected Brier score can be interpreted as a mean square error of prediction. It measures the inaccuracy by comparing the estimated survival probability \({\hat{S}}(t|{\varvec{x}}_m)\) of a patient m, \(m=1,..,n\), with the observed survival status \({\mathbbm {1}}({\tilde{t}}_m > t)\)
$$\begin{aligned} \widehat{\textit{BS}}(t)= \frac{1}{n} \sum_{m=1}^{n} {\hat{w}}_m(t) \cdot \left( {\mathbbm {1}}({\tilde{t}}_m > t) -{\hat{S}}(t|{\varvec{x}}_m) \right) ^{2} \end{aligned}$$
and the squared residuals are weighted using inverse probability of censoring weights
$$\begin{aligned} {\hat{w}}_m(t) = \frac{ {\mathbbm {1}}({\tilde{t}}_m \le t) \delta_m}{{\hat{C}}({\tilde{t}}_m)} + \frac{{\mathbbm {1}}({\tilde{t}}_m > t)}{{\hat{C}}(t)} \end{aligned}$$
to adjust for the bias caused by the presence of censoring in the data. \({\hat{C}}(t)\) is the Kaplan-Meier estimator of the censoring times [3, 31].
The predictive performance of competing survival models can be compared by plotting the Brier score over time (prediction error curves). Alternatively, prediction error curves can be summarized in one value with the integrated Brier score as a measure of inaccuracy over a time interval rather than at single time points [15]
$$\begin{aligned} \textit{IBS}(t^*)= \frac{1}{t^*} \int_0^{t^*} \textit{BS}(t) \text {d}t , \quad t^*> 0 . \end{aligned}$$
Median probability model and Bayesian model averaging
For the calculation of the prediction error, we account for the uncertainty in model selection by two different approaches: the Median Probability Model (MPM) [1] and an approximation to Bayesian Model Averaging (BMA) [16]. After removal of the burn-in samples, we compute the Brier score over the “best” selected models. According to the BMA approach we choose the top 100 models with the largest log-likelihood values to obtain the marginal posterior means of the regression coefficients, which in turn are required for the risk score. Our choice of the number of top models for BMA approximation is based on visual assessment of the MCMC frequencies of the different top-selected models. However, the number of models could be optimized. For the MPM approach we select all covariates with a mean posterior selection probability larger than 0.5. For these variables we calculate the marginal posterior means of the regression coefficients and the corresponding risk score.