 Research
 Open Access
 Published:
A hierarchical spikeandslab model for pancancer survival using panomic data
BMC Bioinformatics volume 23, Article number: 235 (2022)
Abstract
Background
Panomics, pancancer analysis has advanced our understanding of the molecular heterogeneity of cancer. However, such analyses have been limited in their ability to use information from multiple sources of data (e.g., omics platforms) and multiple sample sets (e.g., cancer types) to predict clinical outcomes. We address the issue of prediction across multiple highdimensional sources of data and sample sets by using molecular patterns identified by BIDIFAC+, a method for integrative dimension reduction of bidimensionallylinked matrices, in a Bayesian hierarchical model. Our model performs variable selection through spikeandslab priors that borrow information across clustered data. We use this model to predict overall patient survival from the Cancer Genome Atlas with data from 29 cancer types and 4 omics sources and use simulations to characterize the performance of the hierarchical spikeandslab prior.
Results
We found that molecular patterns shared across all or most cancers were largely not predictive of survival. However, our model selected patterns unique to subsets of cancers that differentiate clinical tumor subtypes with markedly different survival outcomes. Some of these subtypes were previously established, such as subtypes of uterine corpus endometrial carcinoma, while others may be novel, such as subtypes within a set of kidney carcinomas. Through simulations, we found that the hierarchical spikeandslab prior performs best in terms of variable selection accuracy and predictive power when borrowing information is advantageous, but also offers competitive performance when it is not.
Conclusions
We address the issue of prediction across multiple sources of data by using results from BIDIFAC+ in a Bayesian hierarchical model for overall patient survival. By incorporating spikeandslab priors that borrow information across cancers, we identified molecular patterns that distinguish clinical tumor subtypes within a single cancer and within a group of cancers. We also corroborate the flexibility and performance of using spikeandslab priors as a Bayesian variable selection approach.
Background
Motivating application
Since its completion in 2018, the Cancer Genome Atlas (TCGA) database has become a cornerstone for studying the relationship between cancer molecular heterogeneity and clinical outcomes. TCGA contains data from multiple “omics” sources, including the genome, transcriptome, proteome, and epigenome, from over 10,000 patients across 33 types of cancer [1], opening the door to panomics, pancancer research. Changes to genomic function that affect cancer development and behavior occur at multiple omics levels, motivating several panomics studies that have discovered vast molecular variation across multiple levels within a single cancer type [2,3,4]. Meanwhile, pancancer research has been motivated by discoveries of the same genomic changes affecting tumors from different tissuesoforigin [5]. These discoveries suggest the importance of considering multiple omics sources and multiple cancer types at once to holistically characterize cancer’s etiological landscape.
One approach to studying molecular heterogeneity across both omics sources and cancer types is BIDIFAC+, a method of simultaneous factorization and decomposition of variation across bidimensionally linked matrices [6]. BIDIFAC+ identifies latent factors, analogous to principal components, that may be shared across any number of omics platforms or sample sets. These components describe patterns of variability across these combinations of omics sources and patient groups. When applied to TCGA data, BIDIFAC+ revealed patterns of variability shared by mRNA, miRNA, methylation, and protein data driving heterogeneity across multiple cancers [6]. However, these results were solely exploratory, and did not consider prediction of important clinical endpoints. Our goal is to assess the prognostic value and clinical relevance of panomic patterns of molecular variability identified by BIDIFAC+. To do so, we sought to use a comprehensive model for overall survival that flexibly borrows information across the different types of cancer.
Components of our pancancer, panomics analysis
Our approach builds on two active areas of statistical methodology: prediction via integrative dimension reduction (described in the “Prediction via bidimensional dimension reduction” section ) and structured Bayesian variable selection (described in the “Bayesian hierarchical spikeandslab survival model” section).
Prediction via bidimensional dimension reduction
Predictive modeling in the case of a single highdimensional dataset often begins by first applying a method such as principal components analysis (PCA) to obtain a small set of latent variables (i.e., components) that explain variation in the data [7]. These components can be used for predictive modeling using classical approaches [8]. However, PCA does not translate smoothly to the multisource (e.g., multiomics) context. In this context, one may use the results of multisource integrative methods, like joint and individual variation explained (JIVE) [9], structural learning and integrative decomposition (SLIDE) [10], or generalized integrative principal components analysis (GIPCA) [11]. These methods identify components that are shared across or specific to multiple sources, which has been shown to improve power and interpretation for multiomics predictive models over adhoc applications of PCA [12]. However, these approaches do not apply when there are multiple sources of covariates and multiple sample sets, as is the case in the panomics, pancancer setting. This article addresses the issue of prediction across multiple sources of data and multiple sample sets by using components identified by BIDIFAC+ in a predictive model. BIDIFAC+ identifies components that may be shared across any number of sources (e.g., omics platforms) and any number of sample sets (e.g., cancer types). In particular, we use BIDIFAC+ components from bidimensional integration of multiple omics sources and multiple cancer types to model TCGA patients’ overall survival (OS).
Bayesian hierarchical spikeandslab survival model
In order to model the relationship between patient OS and components from BIDIFAC+ dimension reduction, we consider a Bayesian hierarchical survival regression framework. Bayesian hierarchical regression has been used previously for pancancer survival modeling [13], and is attractive in this context because it facilitates borrowing of information across cancer types while allowing a different survival model for each cancer. This feature of our approach is motivated by the assumption that molecular patterns may drive heterogeneity in more than one cancer. However, our model is also flexible enough to allow the effect of these patterns to differ according to the cancer type. Accommodating a censored outcome is straightforward in this framework, which has been demonstrated in prior work [13, 14].
Many genomic components have little relation to clinical outcomes, and so we pursued a sparse model that accommodates variable selection within the hierarchical framework. There is an extensive literature on Bayesian approaches to variable selection. Mitchell and Beauchamp [15], George and McCulloch [16], and Kuo and Mallick [17] are foundational but differing perspectives on spikeandslab variable selection. The spikeandslab approach is unique in providing an “included/excluded” interpretation for each predictor through the use of indicator variables that turn on and off each coefficient. In contrast, other Bayesian variable selection approaches adaptively shrink coefficients of uninformative predictors towards zero. Examples of such priors include the Bayesian lasso [18], the Bayesian elastic net [19], and the horseshoe prior [20].
These Bayesian variable selection methods have been used in hierarchical models of many forms. Yang et al. [21] propose using spikeandslab priors to identify important groups of covariates in nonparametric regression models and seemingly unrelated regressions models. Zhang et al. [22] propose a variable selection approach which identifies groups of covariates to include in the model and estimates lasso solutions for coefficients in selected groups. These methods operate on a single sample set where inducing sparsity at the group level on covariates is desired. In contrast, Suo et al. [23] and Mousavi et al. [24] demonstrate using spikeandslab priors for variable selection on a single covariate set shared by multiple sample sets for classification. Hierarchical variable selection has also been considered in Bayesian survival models, as is done in Lee et al. [25] and Lee et al. [26], which both present the use of spikeandslab priors in proportional hazards models. Maity et al. [14] consider the pancancer context and induce sparsity in a Bayesian model for survival using hierarchical horseshoe priors.
To induce sparsity in our context across multiple cancer types (i.e., groups) we extend George and McCulloch [16]’s definition of a spikeandslab prior in three ways: (a) we allow the possibility that a predictor is included for one sample group but not another, (b) we allow the slab distribution’s location and scale to be inferred hierarchically based on data from groups for which the covariate is included, and (c) we impose a prior on the inclusion probabilities of each covariate to borrow information across groups. These modifications adapt the original formulation to borrow information across groups without compromising the flexibility that covariate inclusion and coefficient estimation can differ between groups. This approach is attractive when groups offer agreeable information about shared covariates, and it is advantageous to borrow information to increase power. Our model accommodates a potentiallycensored outcome, offering a translational approach that isolates predictors informative of survival.
Our context differs from Yang et al. (2020) [21] and Zhang et al. (2014) [22], who study variable selection on a shared covariate set for a single sample set. Our approach also differs from Mousavi et al. (2014) [24] and Suo et al. (2013) [23], who study variable selection for multiple sample sets but require the same predictors be included for all sample sets. The approach of Maity et al. (2020) [14] most closely resembles ours by borrowing strength across multiple sample sets to select informative predictors for patient survival but allowing for differences in the model for each sample set. However, we consider a spikeandslab prior as opposed to a horseshoe prior to facilitate an “inclusion/exclusion” interpretation for each predictor with accompanying posterior probabilities, which is useful in this context. Our work also differs in that we allow for the survival model for each cancer type to depend on a different set of predictors.
The rest of our article is organized as follows. In the “Methods” section we describe our methods in detail, including an introduction to the BIDIFAC+ method, the spikeandslab prior, and our hierarchical extensions. In the “Application to pancancer, panomics data” section, we apply our Bayesian model to TCGA data to predict patient OS using patterns of variability identified by BIDIFAC+ and investigate the clinical relevance of the results. In the “Simulation study” section, we present a simulation study evaluating the flexibility of the hierarchical spikeandslab prior in the context of our data application. We provide a discussion of the results and suggestions for future work in the “Discussion” section and concluding thoughts in the “Conclusion” section.
Methods
Here we describe our Bayesian hierarchical model with spikeandslab priors. We first provide an overview of bidimensionallylinked data and the BIDIFAC+ method in the “BIDIFAC+ for bidimensionallylinked data” section. We then describe the classical spike and slab model in the “Spikeandslab priors” section. In the “Hierarchical extensions” section, we describe how to extend the spikeandslab prior to borrow information across grouped data. Then, we outline our full model used for the survival outcome in “Extensions to survival data” section. We describe computation of the posterior predictive loglikelihood for model assessment in the “Posterior predictive likelihood validation” section.
BIDIFAC+ for bidimensionallylinked data
Here we briefly introduce the BIDIFAC+ method. BIDIFAC+ [6] is an exploratory factorization method for bidimensionallylinked data. Bidimensionallylinked data is a data structure consisting of several datasets linked by their rows (e.g., shared omics features) and their columns (e.g., shared patient cohorts), as visualized in Fig. 1. BIDIFAC+ decomposes bidimensionallylinked data into a series of lowrank modules or matrices explaining structured variability shared either globally (across all datasets), across the row sets, across the column sets, or unique to each dataset. To illustrate the BIDIFAC+ decomposition, define the following concatenated matrix:
where each \(\mathbf {X}_{ji}\) represents a dataset from omics platform \(j,j=1,\dots , J\) and patient cohort \(i, i=1,\dots , I\). The BIDIFAC+ factorization is:
\(\mathbf {S}_{\cdot \cdot }^{(k)}\) is a lowrank module which corresponds to structured variation that exists on a subset of the J omics sources and I patient groups and \(\mathbf {E}_{\cdot \cdot }\) represents random noise. The submatrix \(\mathbf {S}_{ji}^{(k)}\) of \(\mathbf {S}_{\cdot \cdot }^{(k)}\) is nonzero if the kth lowrank module explains variability in omics source j and cancer i, or entirely 0 otherwise. The number of lowrank modules, K, is either set to \(K = (2^I1)(2^J1)\) to enumerate all combinations of omics platforms and cohorts, or may be chosen such that the modules explain a predetermined amount of variability in the data.
To derive predictors from these lowrank modules, we obtain the sample scores via the SVD of each \(\mathbf {S}_{\cdot \cdot }^{(k)}\). These sample scores reflect how the identified multiomic patterns are expressed in samples across patient cohorts. To understand the biological relevance of these multiomic patterns, we may investigate the loadings from this SVD, which map the latent patterns to the observed feature space. Investigating the loadings reveals the observed molecular features that are most relevant to the identified multiomic pattern.
Spikeandslab priors
We now introduce the spikeandslab prior in its general form. Consider the ordinary linear model for an outcome \(y_i\) given covariates \(\{X_{i\ell }\}_{\ell =1}^L\),
for subjects \(i=1,\ldots ,I\). The classical spikeandslab model considered by George and McCulloch [16] imposes the following prior on the coefficients \(\beta _\ell\):
where \(\tau _\ell ^2\) is chosen to be small and \(c_\ell ^2\) is chosen to be large. The indicator \(\gamma _\ell\) reflects from which distribution \(\beta _\ell\) is generated: if \(\gamma _\ell =1\), \(\beta _\ell\) is generated from the slab, \(\hbox {N}(0, c_\ell ^2 \tau _\ell ^2)\), and if \(\gamma _\ell =0\), \(\beta _\ell\) is generated from the spike, \(\hbox {N}(0, \tau _\ell ^2)\). Practically, \(\gamma _\ell\) indicates whether covariate \(\ell\) has a nonnegligible contribution to the predictive model. The prior encourages sparsity via the spike, and shrinks coefficients under the slab towards zero. Uncertainty in model selection is easy to interpret via the posterior probabilities of each \(\gamma _\ell\).
Hierarchical extensions
Now, assume the data are grouped or clustered, e.g., by genetic strain or cancer type. We index each group by i, \(i=1,\dots , I\) and index subjects within each group by j, \(j=1,\dots , n_i\) where \(n_i\) is the sample size for group i. Consider L covariates \(\{X_{1}, X_{2}, \dots , X_{L}\}\), where a subset of the L covariates is available for each group. Let \(S_i = \{\ell : X_\ell \text { exists for group } i \}\) be the indices for covariates measured on group i. Let \(y_{ij}\) be the response for the jth subject in the ith group, \(j = 1, \dots , n_i\), \(i = 1, \dots , I\). Specify a linear model for \(y_{ij}\) as follows:
where \(\epsilon _{ij}\) are iid random variables such that \(\mathbb {E}(\epsilon _{ij}) = 0\) and \(\hbox {Var}(\epsilon _{ij}) = \sigma ^2\). This framework not only allows for covariate sets to differ between groups, but also allows the effect of each predictor to vary by group, where the partial effect of predictor \(\ell\) for group i is given by \(\beta _{i\ell }\). We allow for the possibility that a predictor may have no effect on group i’s outcome through the use of spikeandslab variable selection. We extend George and McCulloch’s implementation of a spikeandslab prior (4) by inferring the distribution of the slab hierarchically (with a possibly nonzero mean) while allowing for differential inclusion across groups. The hierarchical structure is also extended to the inclusion probabilities. We define our spikeandslab prior as follows:
where \(\ell = 1,\dots , L\) and \(z^2\) is chosen to be very small. Here, \(\gamma _{i\ell }\) is an inclusion indicator that reflects whether or not the coefficient for covariate ℓ comes from the spike or the slab distribution for group i. If \(\gamma _{i\ell } = 1\), then \(\beta _{i\ell }\) is generated from the slab, \(\hbox {Normal}(\tilde{\beta }_\ell , \lambda ^2_\ell )\), and if \(\gamma _{i\ell } = 0\) then \(\beta _{i\ell }\) is generated from the spike, \(\hbox {Normal}\left( 0, z^2\right)\). Data from clusters for which covariate \(\ell\) is generated from the slab are used to infer the mean \(\tilde{\beta }_\ell\) and variance \(\lambda ^2_\ell\) of the slab distribution. This may increase our power to infer covariate \(\ell\)’s effect if the groups provide concordant information. We apply a Beta prior to the inclusion probability \(\pi _\ell\) for covariate \(\ell\), cementing a fully Bayesian framework. A unique inclusion probability for each predictor induces correlation between selected predictors across the I groups. Consequently, inference on \(\pi _\ell\) reflects the proportion of groups for which covariate \(\ell\) has predictive power.
Extensions to survival data
We now outline our full hierarchical spikeandslab survival model with the likelihood and hyperparameters we use in our data application and simulations. Let \(y_{ij}\) be the survival time, which may or may not be observed due to censoring, for the jth subject in the ith group, \(j = 1, \dots , n_i\), \(i = 1, \dots , I\). Let \(S_i = \{ \ell : X_\ell \text { exists for group } i \}\) be the set of covariate indices available for group i. It is possible that group i and group \(i'\) where \(i\ne i'\) do not share all of the same covariates. Further, define
where \(y_{ij}^c\) is the censor time for the jth subject in the ith cancer type. Then, our hierarchical spikeandslab model is
where \(\ell \in S_i\) for the ith cancer type and j indexes the subject within the ith cancer type. We selected a lognormal likelihood because previous work demonstrated it outperforms other parametric models for TCGA pancancer survival [13, 14]. The priors for \(\tilde{\beta }_0\), \(\lambda ^2_0\), \(\tilde{\beta }_\ell\), and \(\lambda ^2_\ell\) were chosen to be sufficiently uninformative and to match the scale of the data, though our later data application results in the “Application to pancancer, panomics data” section appeared to be insensitive to the choice of hyperparameters in these priors. The spike variance was arbitrarily set at \(\frac{1}{10000}\) and results were not sensitive to this choice. We implement our model using an inhouse Gibbs sampling algorithm, in which the unobserved outcomes \(y_{ij}\) are simulated from their full conditional distribution when the observation is censored (\(y_{ij}^*=y_{ij}^c)\). All full conditional distributions for the censored survival model used for our data application in the “Application to pancancer, panomics data” section are provided in Additional file 1: Appendix A .
Posterior predictive likelihood validation
We assess our hierarchical spikeandslab approach and related models via 5fold cross validation of the logposterior predictive likelihood. The logposterior predictive likelihood is a measure of the predictive accuracy of the model. We use this measure in a cross validation framework as a holistic way to assess the predictive ability of a model fit to training data on an independent test set. The logposterior predictive likelihood is defined as follows. For \(k=1,\dots , 5\), consider the kth trainingtest set split, \(\vec {Y} = \{\vec {Y}^{\text {train}}_k, \vec {Y}^{\text {test}}_k \}\). Let \(p(y\Theta _0, X)\) be the lognormal probability density for survival time, given all model parameters \(\Theta _0\) and covariates X. Define \(X^{train}_k\) and \(X^{test}_k\) as the training and test set split of the covariates corresponding with the training and test set split of \(\vec {Y} = \{\vec {Y}^{\text {train}}_k, \vec {Y}^{\text {test}}_k \}\) for the kth fold. On each training fold, we fit the model and generated posterior samples for each parameter. For each posterior sample t after burnin and thinning, we computed
where \(\Theta _o^t\) is a vector of all the tth iteration posterior samples for the parameters of the probability distribution of survival based on the kth fold of the training data. After computing this quantity for each iteration, we computed an estimate of the outofsample posterior predictive likelihood:
where T is the number of sampling iterations after burnin and thinning. The logposterior likelihood measures how well a model fits the observed data, with a higher value indicating better fit. After running each model on the training fold and computing the logposterior predictive likelihood on the corresponding test fold, we took the average of each models’ logposterior likelihoods to determine which framework provided the best fit.
Results
Application to pancancer, panomics data
Here we describe the application of the hierarchical spikeandslab model to TCGA data to characterize the clinical relevance of underlying genomic components. To do so, we model patient OS because it is clearly defined, clinically important, and available for most subjects [27]. The model predictors are derived from applying BIDIFAC+ [6] to TCGA panomics, pancancer data as explained below. Additional methodological details are provided in the “Methods” section and the Additional file
Our data was originally curated for use in Hoadley et al.'s [28] pancancer clustering analysis. These data consisted of 29 cancer types and 4 omics platforms. The cancer types are primarily defined by their tissueoforigin, and we denote each type by its TCGA study abbreviation, e.g., BRCA for breast invasive carcinoma and ESCA for esophageal carcinoma. The omics platforms include (1) RNASeq data for 20531 genes, (2) miRNASeq data for 743 miRNAs, (3) DNA methylation levels for 22601 CpG sites, and (4) reversephase protein array data for 198 proteins. BIDIFAC+ decomposes the data into a sum of lowrank modules, each corresponding to structured variation that exists on a subset of the 4 omics platforms and the 29 cancers. Using this method, [6] identified 50 lowrank modules from which we derived predictors for our model.
We obtained predictors from the BIDIFAC+ results by computing the singular value decomposition (SVD) of each lowrank module to identify underlying components (analogous to principal components) that are specific to a subset of omics platforms and cancer types. For each module’s SVD, we took the product of each singular value with its corresponding right singular vector. This product gives us the component scores for each subject, which will serve as predictors in our survival model. In this context, the BIDIFAC+ components are assumed to be independent and roughly orthogonal. Since BIDIFAC+ can produce components that explain negligible variation in the data, we did not want to consider these as possible predictors in our predictive model. We would not expect these components to explain much variability in OS and they would lead to unnecessary noise. To ensure we consider predictors with the highest likelihood of explaining variation in survival, we developed selection criteria that precedes our modeling step. Our inclusion criteria were as follows:

1
Include the first component from the SVD of each lowrank module. This component explains the most variation within each module.

2
Include any other components whose ratio of eigenvalue (squared singular value) to total variability in the original multisource, multicancer data was greater than 0.01. This amounts to selecting predictors that explain at least 1% of the variation in the original panomic, pancancer data.
The 0.01 threshold could be adjusted in future studies but it yielded a manageable number of possible model predictors for our purposes. In sum, we considered 66 components derived from the 50 modules as predictors of OS. We refer to each of these predictors by the module from which it was derived and the index of its corresponding right singular vector from the module’s SVD, e.g., predictor 5.1 is the first component from module 5.
To complete our model, we also included a model intercept for each cancer and patient age at the time of diagnosis as a predictor. In our previous work, we showed that age has a strong effect on overall survival in 27 of the 29 cancers considered here [13]. We standardized all predictors to have mean 0 and standard deviation 1 to facilitate comparisons of covariate effects on survival.
We obtained clinical data from the TCGA Clinical Data Resource (TCGACDR) [27]. Before running any analyses, we removed subjects who were missing both a survival time and a censoring time, removed subjects who had survival times that were negative or zero, and removed subjects missing a value for age. After filtering, we retained 6856 subjects across 29 cancer types with data from the 4 omics platforms.
We first assessed which of the following model frameworks provided the best predictive performance on the TCGA data factorized by BIDIFAC+. In parentheses, we give the name that each model is henceforth referred to.

1
A hierarchical spikeandslab model, our proposed model, described in the “Extensions to survival data” section (“hierarchical”)

2
A model with only a random intercept for each cancer type and no covariates (“null model”)

3
A hierarchical model, with no spikeandslab component and all covariates are included (“full model”)

4
A hierarchical model with a spikeandslab component and prior inclusion probabilities fixed at 0.5 [“fixed (0.5)”]

5
A hierarchical spikeandslab model where a single inclusion probability, \(\pi\), is shared for all covariates and all cancer types (as opposed to inferring an inclusion probability, \(\pi _\ell\), for each covariate) with a uniform prior \(\pi \sim \text{ Beta }(1,1)\) (“shared model”)

6
A joint model in which the proposed hierarchical spikeandslab model is applied to the 29 cancer types appended together rowwise. This model treats all cancer types as one cancer (“joint model”)

7
The proposed hierarchical spikeandslab model applied to each of the 29 cancer types separately. The model is fit to each cancer type assuming the cancers are independent of one another (“separate model”)

8
The hierarchical horseshoe accelerated failure time model (“hsaft”) proposed by [14]. We implemented this model using the PanCanVarSel R package provided by the authors [29].
We compared how these models fit the data using 5fold cross validation of the logposterior predictive likelihood, as described in the “Posterior predictive likelihood validation” section. After running each model on the training fold and computing the logposterior predictive likelihood on the corresponding test fold, we took the average of each models’ logposterior likelihoods to determine which framework provided the best fit.
Our model selection results can be found in Table 1. While the hierarchical and shared model were quite close in fit and predictive accuracy for overall patient survitval, we proceeded with the proposed hierarchical model for the rest of our analysis.
We ran the hierarchical spikeandslab model on the factorized TCGA data for 100000 iterations with a 50000 iteration burnin and 10iteration thinning. Multiple runs of the model with different initial values gave similar results, suggesting that convergence was satisfactory. We display the variable selection results in Fig. 2 via a heatmap of the posterior inclusion probabilities; an interactive version of this heatmap with links to additional plots displaying componeents with high posterior probabilities is available at www.ericfrazerlock.com/PanTCGA_Inclusion_Heatmap.pdf. The posterior inclusion probability for covariate \(\ell\) in cancer i is the average of its inclusion indicators generated by our model after burnin and thinning. Age was included for every cancer type with uniformly high probability, while the inclusion of panomic components were comparatively sparse. BIDIFAC+ predictors that capture molecular variation across all or most cancer types were mostly not included by our model. However, certain BIDIFAC+ predictors were identified as predictive of patient survival with high probability and are summarized in Table 2, ordered by descending posterior inclusion probability. In total, our hierarchical spikeandslab model selected 24 BIDIFAC+ components across 17 cancer types, based on a posterior inclusion probability above 0.5.
Note that the sign of the effects and credible intervals are not immediately interpretable because the identified components (given by singular vectors of an SVD) are uniquely defined up to their sign. However, we can interpret the scale of the effect. Motivated by these results, we chose to investigate more deeply components included for uterine corpus endometrial carcinoma (UCEC), brain lower grade glioma (LGG), kidney renal papillary cell carcinoma (KIRP), kidney renal clear cell carcinoma (KIRC), and kidney chromophobe (KICH) to understand the clinical relevance of the panomic components selected. In what follows, we describe our investigation into the clinical significance of these components. While the figures and discussion describe the marginal effects of components, bear in mind that because our model is multivariate the identification of a component’s predictive power for survival is relative to the information contained in other model predictors. We describe here our investigation into components 16.1 for UCEC, 7.2 for LGG, and 11.1 for KIRP and KIRC; we explore the inclusion of additional components for UCEC, LGG, and KICH in in the Additional file 1: Appendix B.
BIDIFAC+ component 16.1 was identified as predictive of survival by the model with near certainty (Table 2). We investigated if this component was associated with UCEC’s three histological subtypes: endometrioid, serous, and mixed serous and endometroid [30]. We examined this using histological labels provided in TCGACDR [27]. Based on the kernel density estimation (KDE) graph shown in Fig. 3a, the three UCEC histological subtypes cluster distinctly along component 16.1. This suggests that this pattern of variation is primarily driven by distinctions between the three types of UCEC tumors. The Kaplan–Meier survival figure provided in Fig. 3b shows divergent survival outcomes for the three subtypes. [30] found that serous and serouslike tumors show extensive somatic copy number alterations (SCNAs), while endometrioid tumors do to a lesser degree, and observed that SCNAs roughly correlated with progressionfree survival. While we modeled OS, this may be an underlying latent variable.
BIDIFAC+ component 7.2 was identified as predictive of survival in LGG subjects with a posterior probability of 0.92. We considered its association with the mutation status of genes IDH1 and IDH2 and deletion status in chromosome arms 1p and 19q (1p/19q codeletion) using data from [31]. Mutations in these genes define most cases of LGG and contribute to an LGG subtype associated with better survival [31]. We saw subjects with wildtype IDH mutation clustered distinctly along component 7.2 (Fig. 4a), with Kaplan–Meier survival curves in Fig. 4b displaying divergent survival patterns among the three IDH mutation groups. This suggests that this pattern of variation is linked to IDH mutation patterns that correlate with patient survival.
Lastly, component 11.1 was associated with survival for KIRP and KIRC subjects. Using the classification scheme from TCGA’s panrenal project, samples were documented as either KIRP or KIRC, with KIRP further subdivided into type I, type II, and CIMP. Any KIRP patients who did not fit into these categories were left unclassified. CIMP refers to a CpG island methylator phenotype [32] and type I and type II are characterized by specific genetic mutations [33]. The CIMP subgroup is known to have the poorest survival of all renal cancers [32], which prompted us to examine if this pattern of variation captured this distinction using data from [32]. Figure 5a shows KIRP subjects classified as CIMP cluster separately from the other three subgroups. Kaplan–Meier survival curves emphasize the stark survival difference between CIMP and the remaining KIRP and KIRC subjects. KIRC also shows poorer survival compared to KIRP types I, II, and unclassified subjects (Fig. 5b). While it seems that 11.1 is associated with KIRP clinical subtypes (e.g., CIMP), it is unclear to what characteristics of KIRC these predictors are linked. The presence and clinical relevance of the CIMP phenotype has been wellstudied for KIRP, and our analysis suggests that similar distinctions exist within KIRC that are also clinically relevant.
Simulation study
We now present a simulation study to compare different approaches to hierarchical variable selection with the spikeandslab prior. The primary goal of our simulation study was to characterize how modifications to the hierarchical variable selection component of our proposed model perform under different datagenerating schemes, specifically in the context of our data application. We compared our proposed model under various datagenerating schemes to the seven other modeling frameworks considered in the “Application to pancancer, panomics data” section.
We designed the datagenerating schemes to mimic our TCGA data application in the “Application to pancancer, panomics data” section by generating groups of the same sample size, with the same number of covariates for each group, and by randomly rightcensoring subjects. The degree of overlap across groups for each covariate matches that in our data application, which allows the possibility that some covariates are shared across all groups, some covariates are shared across subsets of groups, and some covariates are present in only one group. Each model assumed a lognormal outcome and approximately 50% of subjects were censored.
We considered six datagenerating scenarios:

1
Each covariate is included for all groups for which it is available with probability 0.5 or excluded for all groups for which it is available with probability 0.5

2
Each covariate is included for all groups for which it is available with probability 0.1 or excluded for all groups for which it is available with probability 0.9

3
Each covariate is included independently for each group with probability 0.5, i.e. no true hierarchical structure

4
Each covariate is included independently for each group with probability 0.1

5
All covariates are included in the model

6
All covariates are excluded in the model
We compared the performance of the eight models using two metrics: the mean sumofsquared deviations (SSD) between the true inclusion indicator and the posterior inclusion estimated by each model and the logposterior predictive likelihood defined in the “Posterior predictive likelihood validation” section. The mean SSD provides a measure of selection accuracy while the logposterior predictive likelihood provides a measure of predictive accuracy.
We now define the mean sumofsquared deviations. Assume the true inclusion indicator for covariates \(\ell\) available for group i to be \(\gamma _{i\ell }\). Let \(\hat{\gamma }_{i\ell }\) be the posterior inclusion probability for the covariates of group i estimated by the model; \(\hat{\gamma }_{i\ell }\) is computed by averaging the inclusion indicators from each model iteration after burnin and thinning. The mean SSD for model k, \(k=1,\dots , 8\) is
where \(M = \sum _{i=1}^{29} S_i\) is the total number of regression coefficients in the model. A lower mean SSD reflects better variable selection accuracy, while a higher mean SSD reflects poorer variable selection accuracy. We averaged mean SSDs across simulation replications. We do not compute the SSD metric for the hsaft model because the horseshoe prior does not inherently model a probability that a given predictor is included or excluded.
To compare all eight models based on their logposterior predictive likelihoods, we generated a training data set under each datagenerating condition. After fitting the model on this training data set, we computed the logposterior likelihood on a test data set which is generated under the same condition and with the same true parameter values as the training data set. We averaged the resulting logposterior likelihoods across simulation replications.
We designed our simulation as follows. For 100 replications,

1
Run each of the eight considered models under the six datagenerating conditions

2
For each condition, generate the data accordingly and run each model for 10000 iterations

3
After a 5000 iteration burnin and 10iteration thinning, compute the mean SSD

4
Generate a test data set under the same datagenerating conditions. Using the posteriors samples generated based on the training data in step 3, compute the logposterior predictive likelihood

5
At the end of the simulation, average across all simulation replications
We used 100 replications to ensure consistent results when the simulation study was repeated. The resulting mean SSDs and logposterior likelihoods are shown in the Tables 3 and 4, respectively, where we bold the best performing model. In both tables, we use pairwise ttests as a simple way to assess whether the observed differences in performance across the simulation replications are statistically significant. If the performance of two models were not significantly different under a particular datagenerating condition, they were considered to have performed equally well.
The proposed hierarchical spikeandslab model performs best under conditions (1) and (2) because each group affords concordant information about each predictor. In this case, it is beneficial to borrow strength across groups when estimating the prior inclusion probability. Unlike conditions (1) and (2), conditions (3) through (6) are not specifically suited to the proposed model but competitive mean SSDs and posterior likelihoods demonstrate its competitive performance. Under condition (3), when the prior inclusion probability was 0.5 for all covariates, the fixed (0.5) model naturally performs best. The shared model performs well under both (3) and (4) because there was no added benefit to estimating the prior inclusion probability separately for each covariate. Conditions (5) and (6) were the most extreme, under which all or none of the covariates were included, respectively. Under (5) and (6), the full model (a hierarchical model without a spikeandslab component) and the null model perform best, respectively. Like (3) and (4), the shared model is competitive because there was no added advantage to estimating a prior inclusion probability for each covariate; however, our proposed model remained comparable in selection and prediction accuracy. Under conditions (1), (2), (5), and (6), the joint model performs variable selection comparably well because the groups do not differ in their inclusion patterns; however, its predictive performance across all conditions is starkly poorer. The hsaft model offers relatively competitive predictive performance under all conditions, despite the generative model not matching its assumptions.
This simulation study demonstrates that the hierarchical spikeandslab prior described in the “Methods” section is competitive in its ability to correctly identify which covariates to include under the datagenerating scenarios considered here. While it offers flexibility to estimate different inclusion probabilities for each covariate and pool information across groups, it can also perform well when borrowing strength is not advantageous. Under each condition, the proposed model fit the test data well, exhibiting its strength in identifying a model with predictive power. While it may not be the optimal model under each condition, its performance was consistent with models that were tailored to perform well, making it a flexible option for an array of underlying datagenerating mechanisms.
Discussion
In this article, we address prediction across multiple sources of data and multiple samples sets by using molecular patterns of variability identified by BIDIFAC+ in a Bayesian hierarchical model. Our model uses spikeandslab priors that borrow information across groups in determining the model’s sparsity structure. This spikeandslab framework has the advantage of increasing power to detect covariate inclusion and covariate effect while being flexible enough to allow each group to have a different covariate set. It also induces correlation under the posterior between selected predictors. An additional perk of this prior is the natural “inclusion/exclusion” interpretation that other shrinkage methods, like the horseshoe prior [20] or the Bayesian lasso [18], lack.
We apply this model to TCGA data where we use BIDIFAC+ components from the integration of panomics, pancancer data as predictors for overall patient survival. Predictive modeling using these data expands upon the exploratory work of [6] and contributes to the body of research in prediction using multisource, multisample set data. Factorizing the panomic, pancancer data prior to predictive modeling is useful because the original genomic data is very high dimensional which presents issues of multicollinearity for modeling. Our model gave sparse results regarding selected predictors that explain variability across a large number of cancers. However, it did identify several molecular patterns within smaller subsets of cancer types that are strongly informative of survival, including clinically relevant molecular distinctions that have been previously established (e.g., subtypes within UCEC and LGG) and similar effects across cancer types that warrant further investigation (e.g., for the kidney cancers). In our context, we assumed BIDIFAC+ components were independent and orthogonal; however, this approach could be extended to incorporate correlation in the components if a correlation structure is known a priori. Other worthwhile future directions include considering different parametric assumptions for the survival model and relaxing the assumption that the error variance is shared across groups. Additionally one could consider nonlinear models and generalized additive models in this context. Including other clinical covariates, like stage and grade, might elucidate the effect of BIDIFAC+ predictors on overall survival; however, stage and grade are not uniformly defined over different cancer types, which presents a challenge for their use in pancancer clinical modeling. Alternatively, one may consider other clinical variables as the response, like progressionfree survival, but the availability of such data is not as widespread for the TCGA cohort [27]. Patterns of variation identified by BIDIFAC+ on other omics sources, such as copy number variation, could also be considered as predictors.
We also present results from a simulation study where we evaluate the performance of modifications to the variable selection component in our proposed Bayesian model in our data application context. The goal of this study was to characterize the flexibility of our hierarchical spikeandslab prior in fitting a diverse array of datagenerating schemes that mimic our application’s group structure. Our simulation study showed that this prior was competitive under all six datagenerating conditions considered. This study could be expanded to compare the proposed model to other survival models, such as the proportional hazards model or models assuming different parametric survival distributions. Incorporating other Bayesian variable selection methods, like the horseshoe prior [20] and the Bayesian lasso [18], into these models would also be worthwhile for comparison.
For our TCGA application, we fit our model more than once from random starting values and observed consistent results across these randomlyinitialized chains, suggesting that the sampler converges and appropriately covers models with high posterior probability. However, for other scenarios a potential drawback of the spikeandslab approach is that in the presence of many predictors, the posterior sampling algorithm may not cover all possible models. Another potential drawback of our approach is that it does not explicitly incorporate biological relationships between the omics features that are known a priori, and the associations captured are not necessarily causal. In general, more work can be devoted to devising variable selection methods that borrow information across grouped data. It would be valuable to evaluate and compare the performance of these extensions, in addition to the spikeandslab model we discuss here, to characterize their relative advantages and disadvantages in a hierarchical setting.
Conclusion
We expand upon the exploratory results of bidimensional integration of multiple sources of data and multiple sample sets by using BIDIFAC+ components in a Bayesian hierarchical survival model with spikeandslab priors. Our results show that molecular patterns of variability identified by BIDIFAC+ were predictive of survival in subsets of cancers available in TCGA and may offer insight into novel clinical subtypes. We also show that the spikeandslab prior is a suitable option for Bayesian variable selection on grouped data in several different datagenerating scenarios.
Availability of data and materials
The code and data generated and analysed during the current study are available in the HierarchicalSS_PanCanOmics repository, https://github.com/sarahsamorodnitsky/HierarchicalSS_PanCanPanOmics/.
Abbreviations
 BIDIFAC:

Bidimensional integrative factorization
 TCGA:

The Cancer Genome Atlas
 mRNA:

Messenger ribonucleic acid
 miRNA:

Micro ribonucleic acid
 PCA:

Principal components analysis
 JIVE:

Joint and individual variance explained
 SLIDE:

Structural learning and integrative decomposition
 GIPCA:

Generalized integrative principal components analysis
 OS:

Overall survival
 BRCA:

Breast cancer invasive carcinoma
 ESCA:

Esophageal carcinoma
 DNA:

Deoxyribonucleic acid
 SVD:

Singular value decomposition
 TCGACDR:

TCGA Clinical Data Resource
 hsaft:

Horseshoe shrinkage accelerated failure time
 UCEC:

Uterine corpus endometrial carcinoma
 LGG:

Brain lower grade glioma
 KIRP:

Kidney renal papillary cell carcinoma
 KIRC:

Kidney renal clear cell carcinoma
 KICH:

Kidney chromophobe
 SSD:

Sum of squared deviations
 CpG:

Cytosinephosphateguanine
 CIMP:

CpG island methylator phenotype
References
Hutter C, Zenklusen JC. The cancer genome atlas: creating lasting value beyond its data. Cell. 2018;173(2):283–5.
TCGA Research Network et al. Comprehensive molecular portraits of human breast tumors. Nature. 2012; 490(7418):61.
TCGA Research Network, etal. Comprehensive molecular profiling of lung adenocarcinoma. Nature. 2014; 511(7511):543.
Verhaak RG, Hoadley KA, Purdom E, Wang V, Qi Y, Wilkerson MD, Miller CR, Ding L, Golub T, Mesirov JP, et al. Integrated genomic analysis identifies clinically relevant subtypes of glioblastoma characterized by abnormalities in PDGFRA, IDH1, EGFR, and NF1. Cancer Cell. 2010;17(1):98–110.
Weinstein JN, Collisson EA, Mills GB, Shaw KRM, Ozenberger BA, Ellrott K, Shmulevich I, Sander C, Stuart JM, Network CGAR, et al. The cancer genome atlas pancancer analysis project. Nat Genet. 2013;45(10):1113–20.
Lock EF, Park JY, Hoadley KA. Bidimensional linked matrix factorization for panomics pancancer analysis. Ann Appl Stat. 2022;16(1):193–215. https://doi.org/10.1214/21AOAS1495.
Massy WF. Principal components regression in exploratory statistical research. J Am Stat Assoc. 1965;60(309):234–56.
Bair E, Hastie T, Paul D, Tibshirani R. Prediction by supervised principal components. J Am Stat Assoc. 2006;101(473):119–37.
Lock EF, Hoadley KA, Marron J, Nobel AB. Joint and individual variation explained (JIVE) for integrated analysis of multiple data types. Ann Appl Stat. 2013;7(1):523.
Gaynanova I, Li G. Structural learning and integrative decomposition of multiview data. Biometrics. 2019;75(4):1121–32.
Zhu H, Li G, Lock EF. Generalized integrative principal component analysis for multitype data with blockwise missing structure. Biostatistics. 2020;21(2):302–18.
Kaplan A, Lock EF. Prediction with dimension reduction of multiple molecular data sources for patient survival. Cancer Inf. 2017;16:1–11.
Samorodnitsky S, Hoadley KA, Lock EF. A pancancer and polygenic Bayesian hierarchical model for the effect of somatic mutations on survival. Cancer Inf. 2020;19:1176935120907399.
Maity AK, Bhattacharya A, Mallick BK, Baladandayuthapani V. Bayesian data integration and variable selection for pancancer survival prediction using protein expression data. Biometrics. 2020;76(1):316–25.
Mitchell TJ, Beauchamp JJ. Bayesian variable selection in linear regression. J Am Stat Assoc. 1988;83(404):1023–32.
George EI, McCulloch RE. Variable selection via gibbs sampling. J Am Stat Assoc. 1993;88(423):881–9.
Kuo L, Mallick B. Variable selection for regression models. Sankhyā Indian J Stat Ser B. 1998;65–81.
Park T, Casella G. The Bayesian lasso. J Am Stat Assoc. 2008;103(482):681–6.
Li Q, Lin N, et al. The Bayesian elastic net. Bayesian Anal. 2010;5(1):151–70.
Carvalho CM, Polson NG, Scott JG. The horseshoe estimator for sparse signals. Biometrika. 2010;97(2):465–80.
Yang X, Narisetty NN, et al. Consistent group selection with Bayesian high dimensional modeling. Bayesian Anal. 2020;15(3):909–35.
Zhang L, Baladandayuthapani V, Mallick BK, Manyam GC, Thompson PA, Bondy ML, Do KA. Bayesian hierarchical structured variable selection methods with application to molecular inversion probe studies in breast cancer. J R Stat Soc Ser C. 2014;63(4):595–620.
Suo Y, Dao M, Tran T, Srinivas U, Monga V. Hierarchical sparse modeling using spike and slab priors. In: 2013 IEEE international conference on acoustics, speech and signal processing; 2013. pp. 3103–7. IEEE.
Mousavi HS, Srinivas U, Monga V, Suo Y, Dao M, Tran TD. Multitask image classification via collaborative, hierarchical spikeandslab priors. In: 2014 IEEE international conference on image processing (ICIP); 2014. pp. 4236–40. IEEE.
Lee KE, Mallick BK. Bayesian methods for variable selection in survival models with application to dna microarray data. Sankhyā Indian J Stat. 2004;756–778.
Lee KE, Kim Y, Xu R. Bayesian variable selection under the proportional hazards mixedeffects model. Comput Stat Data Anal. 2014;75:53–65.
Liu J, Lichtenberg T, Hoadley KA, Poisson LM, Lazar AJ, Cherniack AD, Kovatich AJ, Benz CC, Levine DA, Lee AV, et al. An integrated TCGA pancancer clinical data resource to drive highquality survival outcome analytics. Cell. 2018;173(2):400–16.
Hoadley KA, Yau C, Hinoue T, Wolf DM, Lazar AJ, Drill E, Shen R, Taylor AM, Cherniack AD, Thorsson V, et al. Celloforigin patterns dominate the molecular classification of 10,000 tumors from 33 types of cancer. Cell. 2018;173(2):291–304.
Arnab Kumar M, Anirban B, Bani K M, Veerabhadran B. Bayesian data integration and variable selection for pancancer survival prediction using protein expression data. Biometrics (2019). R package version 0.0.3.
Levine DA, Network CGAR, et al. Integrated genomic characterization of endometrial carcinoma. Nature. 2013;497(7447):67–73.
TCGA Research Network. Comprehensive, integrative genomic analysis of diffuse lowergrade gliomas. New Engl J Med. 2015;372(26):2481–98.
Ricketts CJ, De Cubas AA, Fan H, Smith CC, Lang M, Reznik E, Bowlby R, Gibb EA, Akbani R, Beroukhim R, et al. The cancer genome atlas comprehensive molecular characterization of renal cell carcinoma. Cell Rep. 2018;23(1):313–26.
TCGA Research Network. Comprehensive molecular characterization of papillary renalcell carcinoma. New Engl J Med. 2016;374(2):135–45.
Acknowledgements
Not applicable.
Funding
This work was supported by the National Institutes of Health (NIH) National Cancer Institute (NCI) grant R21CA231214, and National Institute of General Medical Sciences (NIGMS) grant R01GM130622.
Author information
Authors and Affiliations
Contributions
SS ran all analyses and prepared the manuscript. KAH provided insightful perspective on the TCGA data and results. EFL conceived of the project idea and developed the model with SS. All read and edited the manuscript. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
None declared.
Additional information
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Additional file
Additional file 1.
This supplementary document provides additional details on the Gibbs sampling algorithm for posterior computation, further details on the TCGA data application, and an additional simulation study to validate the model and sampling algorithm.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Samorodnitsky, S., Hoadley, K.A. & Lock, E.F. A hierarchical spikeandslab model for pancancer survival using panomic data. BMC Bioinformatics 23, 235 (2022). https://doi.org/10.1186/s12859022047703
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022047703
Keywords
 Bayesian hierarchical modeling
 Bidimensionallylinked matrices
 Panomics
 pancancer
 Spikeandslab priors
 survival analysis
 The Cancer Genome Atlas (TCGA )