Volume 11 Supplement 8
Proceedings of the Neural Information Processing Systems (NIPS) Workshop on Machine Learning in Computational Biology (MLCB)
Infinite mixtureofexperts model for sparse survival regression with application to breast cancer
 Sudhir Raman^{1}Email author,
 Thomas J Fuchs^{2, 3},
 Peter J Wild^{4},
 Edgar Dahl^{5},
 Joachim M Buhmann^{2, 3} and
 Volker Roth^{1}
DOI: 10.1186/1471210511S8S8
© Raman et al; licensee BioMed Central Ltd. 2010
Published: 26 October 2010
Abstract
Background
We present an infinite mixtureofexperts model to find an unknown number of subgroups within a given patient cohort based on survival analysis. The effect of patient features on survival is modeled using the Cox’s proportionality hazards model which yields a nonstandard regression component. The model is able to find key explanatory factors (chosen from main effects and higherorder interactions) for each subgroup by enforcing sparsity on the regression coefficients via the Bayesian GroupLasso.
Results
Simulated examples justify the need of such an elaborate framework for identifying subgroups along with their key characteristics versus other simpler models. When applied to a breastcancer dataset consisting of survival times and protein expression levels of patients, it results in identifying two distinct subgroups with different survival patterns (lowrisk and highrisk) along with the respective sets of compound markers.
Conclusions
The unified framework presented here, combining elements of cluster and feature detection for survival analysis, is clearly a powerful tool for analyzing survival patterns within a patient group. The model also demonstrates the feasibility of analyzing complex interactions which can contribute to definition of novel prognostic compound markers.
Background
where h_{0}(t) is the baseline hazard function (chance of instant death given survival till time t), x is the vector of covariates and β is a vector of regression coefficients. In this paper, we focus on covariates which are categorical in nature, since it is a frequently encountered case in biological applications.
In the past, such models have been extended to a mixture model (mixture of survival experts) in order to find subgroups in data with respect to survival time along with measuring the effect of covariates within each subgroup. In this context, (Rosen and Tanner) [2] define a finite mixtureofexperts (MOE) model by maximizing the partial likelihood for the regression coefficients and by using some heuristics to resolve the number of experts in the model. A more recent attempt at this analysis, which was carried out by [3], uses a maximum likelihood approach to infer the parameters of the model and the Akaike information criterion (AIC) to determine the number of mixture components. A Bayesian version of the mixture model has been investigated by [4], which analyzes the model with respect to time but does not capture the effect of covariates. On the other hand, the work by [5] performs variable selection based on the covariates but ignores the clustering aspect of the modeling. Similarly, [6] defines an infinite mixture model but does not include a mixture of experts, hence assuming all the covariates to be generated from the same distribution and also assumes a common shape parameter for the Weibull distribution.
In this paper, we unify the various important elements of this analysis into a Bayesian infinite mixtureofexperts (MOE) framework to model survival time, while capturing the effect of covariates and also dealing with an unknown number of mixing components. The number of experts are inferred using a Dirichlet process prior on the mixing proportions, which overcomes the issue of deciding the number of mixture components beforehand [7]. The regression component, introduced via the proportionality hazards model, is nonstandard since the Weibull distribution is not part of the exponential family of distributions due to the lack of fixedlength sufficient statistics. Another novel feature of this framework is the addition of sparsity constraints to the regression coefficients β in order to determine the key explanatory factors (covariates) for each mixture component. Since the covariates are discrete in nature, each variable is transformed to a group of dummy variables and sparsity is achieved by applying a Bayesian version of the GroupLasso (as described in [8] and [9]) which is based on a sparse constraint for grouped coefficients [10]. We demonstrate the ability of the model to recover the right sparsity pattern with simulated examples. In a related work, [11] show sparsistency (sparse pattern consistency) of the lasso in the limit of large observations. The following sections describe all the components of this unified framework with some results on a breastcancer dataset.
Methods
In this section, we explain the overall model in an incremental way starting first with a regression model for survival analysis and then attaching a clustering model to it. This also highlights the incremental nature of the algorithm presented for inference.
Bayesian survival regression
where N is the number of observations, δ_{ i } = 0 when the i^{ th } observation is censored and 1 otherwise. Further, to model the effect of covariates x on the distribution over time, we apply Cox’s proportional hazards model. Under this model, the covariates are assumed to have a multiplicative effect on the hazard function:
h(tx) = h_{0}(t) exp(f(x, β)), (6)
where h_{0}(t) is the baseline hazard function, x is the vector of covariates and β is a vector of regression coefficients. In our model, we assume the function f to be a linear predictor i.e. f(x, β) = η = x^{ T }β. We also consider higherorder interactions (firstorder  pairs of features, and secondorder  triplets of features etc.) instead of modeling just the main effects (individual features). Further flexibility is added to the linear predictor by adding a random effect in the following manner:
η = x^{ t }β + ∈, where ∈ ~ N(0,σ^{2}). (7)
We note that although most parts of the model described so far resemble an enhancement of a generalized linear model (GLM) (see [13]) called a randomintercept model, it is not strictly a GLM since the Weibull distribution lacks fixedlength sufficient statistics and is not considered, in a strict sense, to be part of the exponential family of distributions unless the shape parameter is known. Although the Weibull distribution lacks fixedlength sufficient statistics, for the two parameters (α_{ w }, λ _{ w }), it is still possible to define a joint conjugate prior ([14]), as is explained in the subsection on priors eq. (10). In order to provide a full Bayesian treatment of the model, we define suitable conjugate priors for the other parameters of the model, namely σ and β.
Contrast coding
Priors
One of the major requirements of the model is to find the key explanatory factors from data. To achieve this goal, we need to apply sparsity constraints on the regression coefficients β to identify the key interactions. As described, the coding procedure gives rise to groups of contrastcoded variables. This transformation of data leads to the task of inferring sparsity on a group level, i.e. on grouped dummy variables, where each group represents a single variable in the original formulation.
where G is the number of groups, p_{ g } is the size of group g, ρ and σ^{2} play the role of the Lagrange parameter in classical GroupLasso and each β_{ g } is a scaled mixture of MultivariateGaussians. Based on (9), we can derive the marginal pdf of β_{ g } analytically as a product of Multivariate Laplacians (for details, see [8]).
where a,b,c > 0 and d allows us to deal with the lack of fixedlength sufficient statistics.
Posteriors
In practice, sampling from the posterior distribution will not be possible directly, hence we propose to use a Gibbs sampling strategy for stochastic integration. The sampling process further enables this procedure to be incorporated very naturally as another step in the clustering algorithm discussed in the next section. Additionally, for the lasso model, the BlockedGibbs sampler has been shown to be geometrically ergodic in [17]. Hence the convergence of the Gibbs sampler is expected to be very rapid. Multiplying the priors with the likelihood and rearranging the relevant terms yields the full conditional posteriors, which are needed in the Gibbs sampler for carrying out the stochastic integrations. The posterior for σ, β, ρ and are exactly as defined in [8]. The conditional posterior of η_{ i } is difficult to sample from since it is not of standard form. However, since the conditional posterior is logconcave, we propose the use of Laplace approximation, similar to that in [18], which approximates the conditional posterior to a Normal distribution and simplifies sampling considerably. Although alternatives exist in the form of adaptiverejection sampling, the Laplace approximation gives results that are indistinguishable while speeding up computations considerably.
where P_{ y } is the product of t_{ i } ’s for which δ_{ i } = 1 and (●) represents all the unknown parameters. This marginal results in a nonstandard distribution, and sampling is done via a discretized version of the same.
Infinite mixture of survival experts
This representation allows us to visualize each mixture component as a joint distribution over (x, t). The distribution over x is modeled as a Normal distribution as show in Figure 2. The standard joint conjugate prior of NormalInvχ^{2} is applied to the parameters . The posterior conditionals are also of standard form and hence can be easily incorporated into the Gibbs sampling scheme introduced in the previous section. To complete the Bayesian picture, we need to apply a suitable prior to the mixing proportions c. In a finite MOE model, a Dirichlet distribution is a standard conjugate prior to the mixing proportions. All other parameters and priors, based on the modeling of (x, t), follow from the previous section.
Algorithm 1 Blocked Gibbs Sampling for a Truncated Dirichlet process
1:  Input: N observations D = (x_{ i }, t_{ i }). 
2:  Initialize: c_{ i } = random cluster assignments and parameters . 
3:  Draw from the posterior of the joint distribution p(π, Φ*, c) by drawing from the conditionals. 
4:  while NotCoverged do 
5:  Sample Φ*  π, c, D  This is carried out individually for each parameter in the model conditioned on the rest. 
6:  Sample c  Φ*, π, D  For i = 1,…, N, draw values , c_{ i } = 1,…, M. 
7:  Sample π  Φ*, c, D  The mixture proportions are drawn based on the posterior P(πα)P(cπ). 
8:  end while 
Markov Chain Monte Carlo (MCMC) sampling for Inference and Parameter Estimation. The inference of the infinitemixtureofexperts model is carried out by MCMC sampling of the posterior distribution. Although there exist nonconjugate versions of the Dirichlet process algorithms (as given in [21]) which can be applied for inference, for practical reasons, we use a truncated version of the Dirichlet process called the DirichletMultinomial allocation model [22], by specifying an upper bound on maximum number of clusters based on the prior knowledge of the particular application. It serves as a good approximation to the DP measure and results in a finitesum random probability measure which is computationally easy to deal with and easy to implement. More specifically, we carry out a BlockedGibbs sampling on a truncated Dirichlet process (see Algorithm 1 for details). After initializing all the parameters, the sampling algorithm is executed till the point of convergence. The point of convergence can be determined based on the lengthcontrol diagnosis explained in [23] or fixed to a maximum number of iterations based on studying the traceplots of the sampling process in simulations.
Results and discussion
Simulations. In order to demonstrate the effectiveness of the model, experiments were carried out on simulated data. The first experiment shows the capability of the model to correctly identify two subgroups in data along with identifying the key explanatory factors in both groups. The dataset of size 150 was generated from two equally proportioned clusters with (5, 5) and (1,1) being the shape and scale parameters for the Weibull distribution for each cluster. The features consisted of 7 variables with expansion up to 2nd order interactions (63 terms). For the first cluster, the significant factors included main effects X 1, X 3 and X 4, all first order interactions with these three variables i.e. (X 1 : X 3), (X 1 : X 4), (X 3 : X 4) and a second order interaction (X 1 : X 3 : X 4). Similarly, for the second cluster, the significant factors included main effects X 2, X 6 and X 7, all first order interactions with these three variables (i.e. (X 2 : X 6), (X 2 : X 7), (X 6 : X 7)) and a second order interaction (X 2 : X 6 : X 7).
Application to BreastCancer dataset. The dataset consists of measured intensity levels obtained from tissue microarrays of the following markers: karyopherinalpha2 (KPNA2), nuclear staining for p53, the anticytokeratin CK5/6, the fibrous structural protein CollagenVI, the interαtrypsin inhibitor ITIH5, the estrogen receptor (ER) and the human epidermal growth factor receptor HER2. From these categorical variables we constructed covariates arranged in a design matrix which includes all dummycoded interactions up to the second order.
Crossvalidation experiments were conducted for both the MOE and single cluster model which gave rise to similar trends but with unclear significance. Despite of the fact that this dataset is one of the biggest of its kind, the rather low number of samples (270 patients) remains the main challenge in these scenarios. A further difficulty is the large number of censored patients (60%), which is a common problem in long term retrospective studies.
The highrisk patient cluster is characterized by a global underexpression of ER and overexpression of basically all other markers, in particular KPNA2, CK5/6 and HER2. Overexpression of the latter two markers clearly identifies this cluster as a collection of basal and HER2type breastcancer patients. The occurrence of KPNA2 in the highrisk group is also in accordance with previous studies: KPNA2 is a member of the karyopherin (importin) family, which is part of the nuclear transport protein complex. KPNA2 overexpression has been shown in several gene expression signatures in breast cancer and other cancer types. KPNA2 overexpression has been previously identified as a possible prognostic marker in breast cancer [24].
Interpretation of interaction terms
ER  ↘  
ER  ↘  CK5/6  ↘  
KPNA2  ↘  p53  ↘  CollagenVI  ↘ 
ITIH5  ↗  HER2  ↗  
ER  ↘  CollagenVI  ↘  HER2  ↗ 
ER  ↘  KPNA2  ↘  ITIH5  ↗ 
ER  ↘  p53  ↗  CK5/6  ↘ 
ER  ↘  KPNA2  ↘  CollagenVI  ↘ 
The observation that highorder interaction terms seem to be even more indicative than the individual main effects is a highly interesting result of this study which may lead to the definition of novel prognostic markers for better differentiation between highrisk patients. Together with our medical partners we are currently testing these new hypothetical compoundmarkers.
The lowrisk cluster has a clear luminaltype signature (strong ER response). Hardly any significant patterns can be identified which, however, is quite understandable by noticing that the survival curve is almost flat for these patients: in the proportional hazards model the individual covariates influence the “passage of time”, and a flat curve basically means that there is almost no intraclass variation that could be explained by individual covariate effects.
Conclusions
We have introduced a fully Bayesian survival infinite mixtureofexperts model which extends classical approaches by including feature selection for contrastcoded categorical variables. Random links and a mixtureofexperts architecture allow for both stochastic and modeldriven deviations from the underlying parametric survival model. The inherent clustering property of the final model makes it possible to identify patient groups which are homogeneous with respect to the predictive power of their covariates for the observed survival times. The builtin Bayesian feature selection mechanism reveals clusterspecific explanatory factors and interactions. Due to the Bayesian treatment within a suitably expanded model, posterior samples can be generated efficiently which makes it possible to assess the statistical significance based on a very large number of draws.
Applied to survival data from a breast cancer study, the model identified two stable patient clusters that show a clear distinction in terms of survival probability. Several strong highorder interactions between marker proteins were detected which carry more information about the survival targets as the markers themselves. Not only does this result confirm earlier studies, it also shows that the analysis of complex interactions is feasible and may lead to the definition of novel prognostic markers. We are currently conducting new experiments to test these new hypothetical compoundmarkers.
List of abbreviations
 AIC:

Akaike information criterion
 MOE:

Mixture of experts
 GLM:

Generalized linear model
 MCMC:

Markov chain Monte Carlo
 DP:

Dirichlet Process
Declarations
Acknowledgements
The work was supported by a grant of the Swiss SystemsX.ch Initiative (Swiss National Science Foundation) to the project ”LiverX” (Competence Center for Systems Physiology and Metabolic Diseases). We also acknowledge financial support from the FET programme within the EU FP7, under the SIMBAD project (Contract 213250).
Authors’ Affiliations
References
 Klein JP, Moeschberger ML: . Survival Analysis: Techniques for Censored and Truncated Data. 1997, SpringerVerlag:New York IncView ArticleGoogle Scholar
 Rosen O, Tanner M: Mixtures of Proportional Hazards Regression models. Statistics in Medicine. 1999, 18: 11191131. 10.1002/(SICI)10970258(19990515)18:9<1119::AIDSIM116>3.0.CO;2V.View ArticlePubMedGoogle Scholar
 Ando T, Imoto S, Miyano S: Kernel Mixture Survival Models for Identifying Cancer Subtypes, Predicting Patient’s Cancer Types and Survival Probabilities. Genome Informatics. 2004, 15 (2): 201210.PubMedGoogle Scholar
 Kottas A: Nonparametric Bayesian Survival Analysis using Mixtures of Weibull distributions. Journal of Statistical Planning and Inference. 2006, 136 (3): 578596. 10.1016/j.jspi.2004.08.009.View ArticleGoogle Scholar
 Ibrahim JG, Chen MH, Maceachern SN: Bayesian Variable Selection for Proportional Hazards Models. The Canadian Journal of Statistics. 1999, 27 (4): 701717. 10.2307/3316126.View ArticleGoogle Scholar
 Paserman MD: Bayesian Inference for Duration Data with Unobserved and Unknown Heterogeneity: Monte Carlo Evidence and an Application. 2004, IZA Discussion Papers 996, Institute for the Study of Labor (IZA)Google Scholar
 Rasmussen CE, Ghahramani Z: Infinite Mixtures of Gaussian Process Experts. Advances in Neural Information Processing Systems 14. 2002, MIT Press, 881888.Google Scholar
 Raman S, Fuchs T, Wild P, Dahl E, Roth V: The Bayesian GroupLasso for Analyzing Contingency Tables. Proceedings of the 26th International Conference on Machine Learning. 2009, Omnipress, 881888.Google Scholar
 Raman S, Roth V: Sparse Bayesian Regression for Grouped Variables in Generalized Linear Models. Proceedings of the 31st DAGM Symposium on Pattern Recognition. 2009, SpringerVerlag, 242251.Google Scholar
 Yuan M, Lin Y: Model Selection and Estimation in Regression with Grouped Variables. J. Roy. Stat. Soc. B. 2006, 4967. 10.1111/j.14679868.2005.00532.x.Google Scholar
 Ravikumar P, Liu H, Lafferty J, Wasserman L: Spam: Sparse additive models. Advances in Neural Information Processing Systems 20. 2007, MIT PressGoogle Scholar
 Ibrahim JG, Chen MH, Sinha D: . Bayesian Survival Analysis. 2001, SpringerVerlag:New York IncView ArticleGoogle Scholar
 McCullaghand P, Nelder J: . Generalized Linear Models. 1983, Chapman & HallView ArticleGoogle Scholar
 Fink D: A Compendium of Conjugate Priors. In progress report: Extension and enhancement of methods for setting data quality objectives. Technical Report. 1995Google Scholar
 Everitt B: . The Analysis of Contingency Tables. 1997, Chapman & HallGoogle Scholar
 Gelman A, Carlin J, Stern H, Rubin D: . Bayesian Data Analysis. 1995, Chapman&HallGoogle Scholar
 Kyung M, Gill J, Ghosh M, Casella G: Penalized Regression, Standard Errors and Bayesian Lassos. Bayesian Analysis. 2010, 5 (2): 369412.View ArticleGoogle Scholar
 Green P, Park T: Bayesian Methods for Contingency Tables using Gibbs Sampling. Statistical Papers. 2004, 45: 3350. 10.1007/BF02778268.View ArticleGoogle Scholar
 Jacobs RA, Jordan MI, Nowlan SJ, Hinton GE: Adaptive Mixtures of Local Experts. Neural Computation. 1991, 3: 7987. 10.1162/neco.1991.3.1.79.View ArticleGoogle Scholar
 Kim S, Smyth P, Stern H: A Nonparametric Bayesian Approach to Detecting Spatial Activation Patterns in fMRI Data. In Proceedings of the 9th International Conference on Medical Image Computing and Computer Assisted Intervention. 2006, 217224.Google Scholar
 Neal RM: Markov Chain Sampling Methods for Dirichlet Process Mixture Models. Journal of Computational and Graphical Statistics. 2000, 9: 249265. 10.2307/1390653.Google Scholar
 Ishwaran H, Zarepour M: Exact and Approximate Sum Representations for the Dirichlet process. The Canadian Journal of Statistics. 2002, 30: 269283. 10.2307/3315951.View ArticleGoogle Scholar
 Raftery A, Lewis S: One long run with diagnostics: Implementation strategies for Markov chain Monte Carlo. Statistical Science. 1992, 7: 493497. 10.1214/ss/1177011143.View ArticleGoogle Scholar
 Dahl E, Kristiansen G, Gottlob K, Klaman I, Ebner E, Hinzmann B, Hermann K, Pilarsky C, Dürst M, KlinkhammerSchalke M, Blaszyk H, Knuechel R, Hartmann A, Rosenthal A, Wild PJ: Molecular Profiling of LaserMicrodissected Matched Tumor and Normal Breast Tissue Identifies Karyopherin α2 as a Potential Novel Prognostic Marker in Breast Cancer. Clinical Cancer Research. 2006, 12: 395060. 10.1158/10780432.CCR052090.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.