Survival models with preclustered gene groups as covariates
© Kammers et al; licensee BioMed Central Ltd. 2011
Received: 20 June 2011
Accepted: 16 December 2011
Published: 16 December 2011
Skip to main content
© Kammers et al; licensee BioMed Central Ltd. 2011
Received: 20 June 2011
Accepted: 16 December 2011
Published: 16 December 2011
An important application of high dimensional gene expression measurements is the risk prediction and the interpretation of the variables in the resulting survival models. A major problem in this context is the typically large number of genes compared to the number of observations (individuals). Feature selection procedures can generate predictive models with high prediction accuracy and at the same time low model complexity. However, interpretability of the resulting models is still limited due to little knowledge on many of the remaining selected genes. Thus, we summarize genes as gene groups defined by the hierarchically structured Gene Ontology (GO) and include these gene groups as covariates in the hazard regression models. Since expression profiles within GO groups are often heterogeneous, we present a new method to obtain subgroups with coherent patterns. We apply preclustering to genes within GO groups according to the correlation of their gene expression measurements.
We compare Cox models for modeling disease free survival times of breast cancer patients. Besides classical clinical covariates we consider genes, GO groups and preclustered GO groups as additional genomic covariates. Survival models with preclustered gene groups as covariates have similar prediction accuracy as models built only with single genes or GO groups.
The preclustering information enables a more detailed analysis of the biological meaning of covariates selected in the final models. Compared to models built only with single genes there is additional functional information contained in the GO annotation, and compared to models using GO groups as covariates the preclustering yields coherent representative gene expression profiles.
We present prediction models for survival times built from high dimensional gene expression data. The challenge is to construct models that are complex enough to have high prediction accuracy but that at the same time are simple enough to allow biological interpretation. Univariate approaches use single genes as covariates in survival time models, whereas multivariate models perform dimension reduction through gene selection (see, e.g., ). In addition, the combination of clinical data and gene expression data is a hot topic of research [2, 3] and is included in our model building procedure. Analysis of the prognostic index  and the Brier Score [5, 6] can be used to assess the predictive performance of the models.
Here, we present models with higher interpretability by combining genes to gene groups (e.g. biological processes) and then using these groups as covariates in the survival models. The hierarchically ordered 'GO groups' (Gene Ontology) are particularly suitable . The Gene Ontology (GO) project provides structured, controlled vocabularies and classifications according to molecular and cellular biology. The current ontologies of the GO project are biological process, molecular function, and cellular component. These three areas are considered rather independent of each other and we make use of the biological process ontology.
A problem when relating gene groups with gene expression profiles is that the genes in each gene group may have different expression profiles: interesting subgroups may not be detected due to heterogeneous or anti-correlated expression profiles within one gene group. We propose to cluster the expression profiles of genes in every gene group and preselect relevant clusters (preclustering).
For statistical analysis, the Cox regression model  is a well-known method for modeling censored survival data. It can be used for identifying covariates that are significantly correlated with survival times. Due to the high-dimensional nature of microarray data we cannot obtain the parameter estimates directly with the Cox log partial likelihood approach. However, we can combine the Cox model with selection and shrinkage procedures and compare the prediction performance of the obtained models. Based on these models statistical selection procedures are applied. Univariate selection and forward selection have been shown to have problematic performance in highdimensional settings. Therefore we do not show their results in this work. We focus on presenting the results for ridge regression  and lasso regression [10, 11] as shrinkage methods. Note that lasso regression is a variable selection method as well.
In order to integrate the clinical information and microarray data in survival models properly, it is a common approach to handle the clinical covariates as unpenalized mandatory variables [3, 12]. In addition to the genomic information, clinical covariates like age, tumor size and tumor stage may be important predictors for survival times of patients. These approaches show that the combination of genomic and clinical information may also improve predictions.
Our aim is the combination of methods for survival prediction with biological a priori knowledge. On real gene expression data sets we evaluate the potential of including preclustered gene groups as covariates in survival models. Models built with gene groups alone have equal or decreased prediction accuracy since many genes are not yet annotated to their corresponding functions. However, we will show that after adding the preclustering information to the gene groups the resulting models have improved interpretability while prediction performance remains stable.
In the next chapter we introduce the methods for analyzing survival data, for preclustering genes, for model selection, and for evaluating the prediction accuracy of the resulting survival models. Then we present and discuss results on two real gene expression data sets.
We first present the notation, the Cox model and how the covariates are defined that are used in the Cox models - especially the preclustering Algorithm is presented. Then we describe the log partial likelihood concept derived for the Cox model and introduce model selection/shrinkage methods. Since most methods for dimension reduction or shrinkage require the selection of a tuning parameter that determines the amount of shrinkage, finally, we describe how to choose the tuning parameter for each method.
For patient i, this expression contains the possibly censored failure time t i , the (non-censoring-)indicator δ i (equal to 1 if t i is a true survival time and to 0 if it is censored) and the vector of gene (or summarized gene group) expression values x i .
Further, R(t i ) is the risk set at time t i ; this is the set of all patients who have not yet failed nor been censored. The value of β′x i is called prognostic index or risk score of patient i.
In the following, we assume that the data consists of two different categories of covariates
clinical covariates Z = (Z 1 , ..., Z q ): e.g. tumor size, tumor grade, age
genomic covariates X = (X 1, ..., X p ): gene expression values of single genes or combined gene expression values for gene groups.
The second model consists of p genomic covariates X = (X 1,..., X p ). In our genetic regression models we use single genes, gene groups as well as preclustered gene groups as covariates. A gene group must be appropriately summarized in order to obtain one representative value for each individual (patient). We summarize the gene expression measurements from all genes belonging to one GO group or cluster via the first principle component of all genes that belong to this gene group. In the following we will consider three types of genomic models:
preclustered GO groups.
Due to the small number of clinical covariates, the shrinkage and dimension reduction procedures will only be applied to the genomic covariates.
where d(i, j) is the dissimilarity of the ith and jth gene, the two following steps are carried out iteratively until convergence, starting with K sequentially selected genes as initial solution:
Build: Select sequentially K initial clusters and assign each gene to its closest medoid.
Swap: Minimize the objective function (5) by switching medoids with other genes of the same cluster.
of the ith and jth gene with the gene expressions x i and x j is based on their Pearson correlation. This yields small dissimilarities between positively correlated genes and large values for negatively correlated genes, respectively.
Here, the values C i are the elements of the lower triangle of the correlation matrix of the N j genes within a single cluster. The maximum mean ICC among the K = 2,..., N - 1 possible cluster configurations corresponds to the optimal number of clusters within one GO group.
For comparing our results to those being published in the literature, we make use of the following two most established and successful shrinkage procedures: L 1 (lasso) and L 2 (ridge) penalized regression. Univariate and forward stepwise selection do not produce satisfactory results for our high dimensional settings. We have compared these two methods in our analysis, and in agreement with previous results from Boevelstad et al. [4, 12] prediction performance was always worse (data not shown). We present the methods for the model containing clinical and genomic information.
L1 (lasso) and L2 (ridge) penalized estimation methods shrink the estimates of the regression coefficients towards zero relative to the maximum likelihood estimates. Both methods are similar in nature, but the results of L1 and L2 penalization can be very different. We perform the penalization only on the high-dimensional genomic covariates, the clinical covariates are handled as unpenalized mandatory variables.
The lasso shrinks the regression coefficients toward zero by penalizing the absolute values instead of their squares. The penalized log partial likelihood thus becomes .
Ridge regression  shrinks the regression coefficients by imposing a penalty on their squared values. The regression coefficients are estimated by maximizing the penalized log partial likelihood where is the penalty term and l(β, γ) is given by (2). Applying an L 2 penalty tends to result in many small but non-zero regression coefficients, whereas penalizing with the absolute values has the effect that many regression coefficients are shrunk exactly to zero. Thus the lasso also is a variable selection method.
We applied both methods using the R package penalized . In both methods the tuning parameter λ controls the amount of shrinkage and is obtained again by cross-validation.
where l(β, γ) denotes the log partial likelihood given in (2) and l m (β, γ) the log partial likelihood when the mth fold (m = 1,..., M) is left out.
The difference of the two terms compared in the formula is that in the right term the likelihood is evaluated without the mth fold, and on the left side it is evaluated with all patients. In both cases the parameters β and γ are estimated without the mth fold. The estimate of β and γ when the mth fold is left out is denoted by and . The optimal value of λ is chosen to maximize the sum of the contributions of each fold to the log partial likelihood.
Next we describe how we evaluate the prediction performance of the models. We make use of three different model evaluation criteria. The whole procedure is applied to two well-known data sets. The basic idea is to split the data into a training set for model fitting and a test set for model evaluation, i.e. for determining the prediction performance. It is important to note that we have to consider several splits of the data into training and test sets due to the extreme dependence of the results on such a split (cf. [4, 17]).
In order to obtain a fair comparison of the prediction methods, we divide the data 100 times at random in a training and test set at the ratio of 2:1. After computing the optimal tuning parameter by 10-fold cross-validation using the training data, we estimate the regression coefficients and on the whole training data set. For each split into training data and test data, we calculate on the test set the three evaluation criteria explained in the next subsections. The results are compared with the help of boxplots and prediction error curves.
We assign patients to subgroups based on their prognosis, into one with good and one with bad prognosis. If the prognostic index of patient i is higher, the survival time is expected to be shorter. For this reason, a patient i in the test set is assigned to the high-risk group if its prognostic index is above the median of all prognostic indices calculated on the test set. We apply a logrank test on the two prognostic groups and use the p-value as an evaluation criterion for the usefulness of the grouping. Boevelstad  points out that a disadvantage of this criterion is that it does not consider the ranking of the patients within the groups and it may not be biologically meaningful.
The prognostic index is used as a single continuous covariate in a Cox model. We fit the model . Using the likelihood ratio test, we test the null hypothesis α = 0 versus α ≠ 0 and assess the prediction performance with the obtained p-value. A small p-value indicates ability of the prognostic index to discriminate between short and long survivors.
is a score for the average predition performance for all time points in the interval [0, t *]. In accordance with , we calculate the IBS for the two data sets for t * = 10 years due to high censoring after 10 years of survival.
For investigating the relationship between microarray gene expression data and censored survival data, we analyze two published breast cancer data sets with the methods described above. In this section, we present the results for the evaluation procedure applied to these two data sets. Standard approaches focus on single genes as covariates [1, 4, 19]. We integrate additional biological knowledge by building models with preclustered GO groups as covariates. In order to assess the merit of this approach, we also present results for models using only genes or only GO groups as explanatory variables and combine the genomic information with the clinical data. In order to obtain a fair comparison of models with different types of genomic covariates, we only use those genes that are annotated to GO groups. We have to consider several splits of the data into training and test set due to the dependence of the results on such a split. We first present detailed results for one specific random split, then we present comprehensive results summarizing 100 random splits.
At this point we want to highlight that the proposed methods are computationally intensive. Due to the nested cross-validation procedure for obtaining the optimal tuning parameter λ and the preclustering approach, we performed all computations on the LiDOng high performance computing cluster of TU Dortmund University with 432 nodes and up to 64 GB RAM per node. The calculation takes several weeks to accumulate all results for one high dimensional data set.
The Dutch breast cancer (DBC) data set is a subset of the original data set with 24 885 gene expression measurements from n = 295 women with breast cancer . After data pre-processing as proposed by  our analysis is performed with 1 890 genes, that are annotated to at least one GO group, according to the biological process ontology. We obtained the data from the website https://www.msbi.nl/dnn/People/Houwelingen.aspx. In total, there are 5 560 GO groups to which at least one of these genes is annotated. The mean number of genes included in these GO groups is approximately 17 genes, 90% of all GO groups contain at most 30 genes. For 79 patients an event was observed. The clinical covariates are age, size, nodes and grade.
The Mainz cohort (MC) study consists of n = 200 node-negative breast cancer patients who were treated at the Department of Obstetrics and Gynecology of the Johannes Gutenberg University Mainz between the years 1988 and 1998 . All patients underwent surgery and did not receive any systemic therapy in the adjuvant setting. Gene expression profiling of the patients' RNA was performed using the Affymetrix HG-U133A array, containing 22 283 probe sets, and the GeneChip System. The normalization of the raw data was done using RMA from the Bioconductor package affy. The raw. cel files are deposited at the NCBI GEO data repository with accession number GSE11121. For covariates in the survival models, 17 834 genes and 8 587 GO groups are available. The mean number of genes included in these GO groups is approximately 102 genes, 90% of all GO groups contain at most 146 genes. There have been 47 events observed. The clinical data covers age at diagnosis, tumor size and grade as well as the estrogen receptor status.
One random split into training and test data for the Mainz cohort study
This example indicates that the predictive performance of models built with GO groups alone and of models with preclustered GO groups is comparable with classical models using only genes as covariates. The p-values for model assessment are similar, but in addition, we have more information in the final model; annotations of preclustered GO groups can help clinicians to investigate the selected genes according to their biological function.
For all three types of genomic covariates the two prognostic groups are clearly separated on the test data, with significant differences in overall survival (p <0.02) between the high-risk group and the low-risk group. The separation between the two groups is best when using a model containing preclustered GO groups (p = 0.0092).
We have observed high variability of the chosen tuning parameters and the parameter estimates depending on the split into training and test data. In order to quantify which covariates are consistently selected in different splits and how stable the evaluation measures are, we calculated results for 100 random splits and compared the selected genes and GO groups.
Rows of the plots correspond to two model evaluation criteria, the prognostic index and the Integrated Brier Score, and the columns correspond to two types of models: the genomic model and the genomic model with clinical covariates. Results for the logrank test are nearly the same as for the prognostic index and therefore not shown here. In each plot we show the results for the two model selection methods. The p-values for the prognostic index are shown on the - log10 scale, thus a value of 2, e.g., corresponds to a p-value of 0.01. Small values for the integrated Brier Score correspond to good prediction performance. For both evaluation criteria in all plots the horizontal line at the median indicates the reference model containing only clinical information.
The following main statements can be deduced from the plots:
Lasso regression with preclustered GO groups has the best prediction performance for the DBC data set, see the median of the p-values across 100 splits in Figure 2. In the Mainz cohort study, we see the same result for the genomic model using the Brier Score for evaluation (see Figure 3).
Methods using GO groups or preclustered GO groups as covariates perform in general as well as models using only genes.
The prognostic index and the Brier Score yield similar results.
Top 10 selected covariates for preclustered GO-groups according to 100 splits into training and test data
macromolecule metabolic process
response to stimulus
multicellular organismal process
multicellular organismal process
regulation of cellular process
macromolecule metabolic process
response to stimulus
response to chemical stimulus
The typical challenge when relating survival times to gene expression measurements is a relatively small number of individuals compared to a large number of predictors. In this case the use of classical approaches is not possible. In accordance with , the lasso regression method seems most suitable and promising: its prediction performance is slightly better compared to ridge regression and the solution is sparse  and . show that ridge regression performs better than all the other methods. In our analysis, ridge regression leads in general to comparable but not better results compared to the lasso. However, an important disadvantage of this method is that it does not select variables. We observe relevant differences between high-risk and low-risk patients, but there are too many genes or GO groups to be further investigated. The preclustering approach is beneficial concerning prediction performance in the lasso setting and leads to comparable results in the other models. However, a main benefit of preclustering is that we detect genes with similar expression patterns and that these gene subgroups are correlated with survival. In addition, we can have a detailed view on the GO groups containing the preclustered subgroups. Table 2 shows that the cluster sizes as well as the corresponding GO groups are quite large. However, in this case the selection of the top 4 clusters is quite stable. For gaining further biological insight a more detailed analysis of the composition of these clusters is required and promising.
In terms of the Brier Score, we showed that the prediction performance of models using clinical, genomic or both information is comparable. It seems that these different kind of covariates contain an overlap of information for predicting survival.
Our comparative study shows that different model selection procedures can be used to identify genes and (preclustered) GO groups related to survival outcomes and to build models for predicting survival times of future patients.
The integration of GO groups is useful, since they contain aggregated information of biological function and thus are often more informative than single genes. It is encouraging that in terms of prediction performance, our results obtained with (preclustered) GO groups as predictors are comparable to those using only genes as predictors. Thus the potentially improved interpretability makes these models with GO groups competitive. We demonstrated that this result holds true also for models using GO groups and not only genes. Our agenda in the present work was:
Constructing models with a relatively small subset of relevant covariates that are enriched with additional gene group information in terms of the Gene Ontology.
Presenting a new approach of preclustering genes from one functional group due to different expression profiles within one GO group.
Comparing prediction rules for the three types of covariates (genes, gene groups, preclustered gene groups).
Adding clinical information and comparing the results to single use of genomic data.
The next step for improving our models is to integrate more detailed information concerning the hierarchically structured gene ontology. For coping with high correlations between GO groups one can follow the approach of  where correlations between neighboring GO groups in the GO graph are iteratively removed. Finally, in future projects, the biological interpretation of the identified gene groups will include not only the interpretation of the (preclustered) GO groups according to overall function, but it is also helpful to take a closer look at the single genes contained in these gene groups.
We make use of the R package penalized  that provides algorithms for penalized estimation in Cox proportional hazards models. The package is freely available from http://cran.r-project.org. R code for model selection and evaluation is available at http://www.statistik.tu-dortmund.de/survivalGO.html.
The work on this paper has been supported by the German Research Foundation (DFG) within the Research Training Group "Statistical Modelling", project C2 and the Collaborative Research Center SFB 876 "Providing Information by Resource-Constrained Analysis", project A3.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.