 Methodology Article
 Open Access
Boosting the discriminatory power of sparse survival models via optimization of the concordance index and stability selection
 Andreas Mayr^{1, 2}Email author,
 Benjamin Hofner^{1} and
 Matthias Schmid^{2}
https://doi.org/10.1186/s1285901611498
© The Author(s) 2016
 Received: 19 November 2015
 Accepted: 13 July 2016
 Published: 22 July 2016
Abstract
Background
When constructing new biomarker or gene signature scores for timetoevent outcomes, the underlying aims are to develop a discrimination model that helps to predict whether patients have a poor or good prognosis and to identify the most influential variables for this task. In practice, this is often done fitting Cox models. Those are, however, not necessarily optimal with respect to the resulting discriminatory power and are based on restrictive assumptions. We present a combined approach to automatically select and fit sparse discrimination models for potentially highdimensional survival data based on boosting a smooth version of the concordance index (Cindex). Due to this objective function, the resulting prediction models are optimal with respect to their ability to discriminate between patients with longer and shorter survival times. The gradient boosting algorithm is combined with the stability selection approach to enhance and control its variable selection properties.
Results
The resulting algorithm fits prediction models based on the rankings of the survival times and automatically selects only the most stable predictors. The performance of the approach, which works best for small numbers of informative predictors, is demonstrated in a large scale simulation study: Cindex boosting in combination with stability selection is able to identify a small subset of informative predictors from a much larger set of noninformative ones while controlling the perfamily error rate. In an application to discover biomarkers for breast cancer patients based on gene expression data, stability selection yielded sparser models and the resulting discriminatory power was higher than with lasso penalized Cox regression models.
Conclusion
The combination of stability selection and Cindex boosting can be used to select small numbers of informative biomarkers and to derive new prediction rules that are optimal with respect to their discriminatory power. Stability selection controls the perfamily error rate which makes the new approach also appealing from an inferential point of view, as it provides an alternative to classical hypothesis tests for single predictor effects. Due to the shrinkage and variable selection properties of statistical boosting algorithms, the latter tests are typically unfeasible for prediction models fitted by boosting.
Keywords
 Timetoevent data
 Boosting
 Stability selection
 Concordance index
 Variable selection
 Highdimensional data
Background
In the evaluation of biomarkers and gene signatures for survival data, the focus is often on the ability of a new marker combination to discriminate between patients with larger and smaller survival times [1–3]. For example, one is often interested in predicting whether patients survive a specific time point of interest, e.g., five years after baseline examination. In practice, prediction models are often derived using Cox regression, which, however, suffers from restrictive regularity assumptions such as the proportional hazards assumption. It is well known that, if violated, these assumptions may cause Cox regression to result in suboptimal model fits with a decreased prediction accuracy [4].
Despite its limitations, Cox regression remains the predominant technique for modeling survival data in biostatistics (see [3, 5] for recent examples). In fact, most attempts to relax the proportional hazards assumption (such as stratification of the baseline hazard and timevarying coefficients) retain the basic properties and limitations of the Cox model. Similarly, regularization schemes for survival models (such as penalized Cox regression [6, 7], and univariate preselection of markers [8]) are usually based on Cox modeling. On the other hand, nonCoxbased approaches from the machine learning field such as support vector machines for survival data [9] or random survival forests [10] have the problem that they lead to black box predictions with limited interpretability.
Despite its good performance, especially in situations where the proportional hazards assumption is violated, Cindex boosting has the drawback that variable selection cannot be accomplished as easily as with traditional boosting algorithms designed for the calibration of a prediction model (such as LogitBoost [18, 19], which optimizes the conditional probability estimates for a binary outcome). In fact, the discriminatory nature of the Cindex, which evaluates the ranking of the values of η but does not involve probability estimation, has been observed to be relatively insensitive to overfitting, making traditional regularization approaches for boosting (such as early stopping [20]) infeasible. This observation coincides with a recent result by Wyner et al. [21] who demonstrated that overfitting in boosting models for binary outcomes is unlikely to happen as long as discriminatory measures (such as the percentage of observations correctly classified) are used for evaluation. While resistance to overfitting is often considered to be an advantage in machine learning research, it also implies that sparse prediction rules, which are desirable in biomedical applications for reasons of interpretability and generalizability [22], are difficult to obtain.
To address this problem, we propose a new variable selection technique for Cindex boosting that is able to identify the most influential and stable predictors for survival. The method is based on the stability selection approach proposed by Meinshausen and Bühlmann [23], which has recently been enhanced [24] and adapted to gradient boosting estimation [25, 26]. The idea of stability selection is to fit the model to a high number of subsets of the original data. One then determines the average number of subsets in which a variable was selected. Variables where the selection frequency exceeds a certain threshold are considered to be stable. Importantly, variable selection is accomplished via controlling the perfamily error rate (PFER) of the predictor variables selected for inclusion in the boosting model. As a consequence, the sparsity of the resulting prediction model is governed by the desired level of error control, and resistance to overfitting is no longer an issue. Using a comprehensive simulation study, as well as a gene expression data set on lymph node negative breast cancer collected by Desmedt et al. [27], we will demonstrate that stability selection can also be adapted to perform variable selection in Cindex boosting. In particular, our results suggest that the new method is able to both optimize the Cindex and to identify the most relevant predictors for survival at the desired error level.
Methods
The Cindex for survival data
The truncated Cindex is an alternative to the Cindex in situations where the right tail of the estimated survival function of T is unstable [1, 28–30]. While we do not explicitly consider the truncated version of the Cindex in this work, our methodology easily extends to truncated time ranges of the form [0,τ] (see below).
Although following a relatively simple and straightforward definition, in practice the estimation of the Cindex becomes problematic in samples with censoring. Some estimators proposed in the literature omit observation pairs where the smaller survival time was censored, however this can lead to biased results [31]. Others rely on the assumptions of a Cox proportional hazards model [32, 33] which becomes problematic in settings were those are not fulfilled. For an overview and comparison of different estimators for the Cindex and other discriminatory measures for survival data see Schmid and Potapov [13].
The term \(\frac {\Delta _{j}}{\hat {G}(\tilde {T}_{j})^{2}}\) accounts for the inverse probability that observation j is censored. Δ _{ j } represents the censoring indicator, \(\tilde {T}\) are observed survival times subject to censoring and \(\hat {G}(\cdot)\) denotes the KaplanMeier estimator of the unconditional survival function for the censoring time T _{cens} (estimated from the learning data via the observed \(\tilde {T}\) and taking Δ _{ j } as event indicator).
Of note, the estimator \(\widehat {C}_{\text {Uno}}(T, \eta)\) is a consistent estimator of the Cindex if censoring is independent of T (coarsening completely at random, [28, 30]). If censoring is independent of T conditional on η (coarsening at random), the terms \(\hat {G}(\cdot)\) in the definition of \(\widehat {C}_{\text {Uno}}(T, \eta)\) can be replaced by conditional terms \(\hat {G}(\cdot  \eta)\) that are derived from a survival model for the censoring distribution [29, 30]. Wang and Long (2016) also analyzed the properties of \(\widehat {C}_{\text {Uno}}(T, \eta)\) in situations where censoring is not independent of T.
Boosting the Cindex
To find the optimal predictor η with respect to the Cindex, we adapt a componentwise gradient boosting algorithm [35] with simple linear models as baselearners. Boosting originally emerged from machine learning, but during the last 15 years has evolved into a powerful tool to fit statistical models (“statistical boosting”, [36, 37]). The basic idea is to apply simple regression functions as baselearners (in the easiest case simple linear models) and iteratively fit them onebyone to the negative gradient of a loss function. In every boosting iteration only the bestfitting baselearner is included in the model, effectively leading to variable selection.
The loss function defines the type of regression setting the additive predictor is optimized for. The L _{2} squared error loss leads to classical regression of the mean [38], the L _{1} loss to median regression which can be extended to quantile regression via the checkfunction [39]. Incorporating the negative loglikelihood as loss function allows to fit classical generalized linear or additive models (GLMs or GAMs, [35]). For an overview of different loss functions for gradient boosting and their implementation see Hofner et al. [40].
The variable selection properties of statistical boosting algorithms are controlled by the stopping iteration m _{stop} [20]. If the algorithm is stopped before convergence (early stopping), variables that have never been selected up to this iteration are effectively excluded from the final model. The stopping iteration m _{stop} is typically chosen such that it optimizes the prediction accuracy on separate test data generated via resampling techniques (e.g., bootstrapping or subsampling).
In case of Cindex boosting, this common procedure, however, becomes problematic as the rankbased loss function is rather robust against overfitting and early stopping is hardly possible. An optimal m _{stop} often cannot be determined in this case. Similar results have been described for binary outcomes if discriminatory measures are used to evaluate the prediction performance [21]. In case of Cindex boosting, in many practical settings it hence makes sense to run the algorithm until convergence and omit the optimization of m _{stop} (see [14]).
Stability selection
To ensure the selection of the most influential predictors despite this resistance to overfitting, we incorporate the stability selection approach by Meinshausen and Bühlmann [23] which was later refined by Shah and Samworth [24]. Stability selection is a generic method that applies to a wide range of statistical estimation techniques which conduct variable selection [42], including penalized regression approaches such as lasso [43], boosting [18] or tree based approaches such as random forests [44].
Shah and Samworth [24] propose to use 2·B complementary pairs, i.e., use the subsample as well as its complement. With additional assumptions on the distribution of the selection frequencies (unimodality or rconcavity), tighter error bounds can be derived [24]. This rconcavity can be seen as an interpolant between unimodality and logconcavity. With r=−∞ rconcavity equals the unimodality assumption and with r=0 logconcavity is assumed (for a thorough definition see [24]). Error bounds with unimodality assumption are tighter than the standard error bounds from the equation above, but not as tight as error bounds with rconcavity assumption. Usually, both assumptions hold [24].
The selection of the parameters q, π _{thr} and PFER are crucial for the performance of stability selection. In general, we advice to choose q large enough to select all influential variables but small enough to reflect the researchers believe in the sparsity of the resulting model. In a sensible range, the actual size of q is of minor importance. Similarly, Meinshausen and Bühlmann [23] found that the actual choice of the threshold π _{thr} is of minor importance as long as it is in a sensible range (∈(0.6,0.9)). Note that for a fixed q it is computationally very easy to change either the threshold or the PFER as the resampling results can be reused. Hence, for fixed q different thresholds (corresponding to different levels of error control) can be easily compared. Larger thresholds lead to sparser models, while thresholds close to 0.5 lead to models which are less sparse. This is also reflected in the upper bound for the PFER which decreases with increasing threshold. Selection frequencies resulting from stability selection can also be used as a descriptive statistic to assess which variables are selected with high frequencies and which variables are rarely selected.
If error control is of primary interest, we advice to chose q and the upper bound for the PFER. The PFER should be chosen such that α≤PFER≤m·α, with significance level α and m hypothesis tests. This provides a good rationale for a sensible error control with the extreme cases of FWERcontrol (familywise error rate; PFER=α) and no multiplicity adjustment (PFER=m·α).
For an indepth overview of stability selection in the context of boosting, see Hofner et al. [26].
Implementation
All presented methods are made available for the open source statistical programming environment R [45]. The algorithm for boosting the Cindex is implemented via the Cindex() family for the addon package mboost. Stability selection is implemented via the stabsel() function from the stabs [46] package, which is also incorporated in mboost. It provides an implementation of the classical approach [23] and the extended sampling scheme using complementary pairs [24]. For evaluating the discriminatory power of the resulting models on test data, Uno’s estimator for the Cindex the is provided with the UnoC() function of the survAUC [47] package. A workedout example on how to apply these function in practice is provided in Additional file 1, the Rcode to reproduce the analyses of this article is included as Additional file 2. In order to benchmark our results, we used the competing Cox lasso approach implemented in the glmnet package [48] which also can be combined with stability selection via stabs. Note that also other implementations for boosting survival models are available in the R framework (gbm [49], CoxBoost [50]) as well as methods depending on the Brier score [51], like the peperr [52] and the pec [53] packages.
Results
Simulation study

To analyze if the algorithm is able to correctly identify a small subset p _{inf} of informative variables from a larger set of p possible predictors in settings of p>n.

To investigate the impact of the two parameters which have to be specified for stability selection, namely the number of selected variables q per boosting run and the threshold π _{thr} for the necessary selection probability.

To compare the resulting discriminatory power of the final models (containing only stable predictors) with the ones from Cindex boosting without stability selection and the competing Cox lasso approach.

To check the performance of our approach in scenarios where the proportional hazards assumption does not hold.
Additionally to the survival times T, we generated for every observation i=1,...,n an additional censoring time T _{censi } and defined the observed survival time by \(\tilde {T}_{i} := \min (T_{i}, T_{\text {cens}i})\) leading to independent censoring of on average 50 % of the observations. The sample size remained fixed with n=200 observations. For stability selection we used 2·B=100 complementary subsamples and computed the error bounds under the rconcavity assumption (cf., [26]). The final models containing only the selected stable variables were fitted with a fixed m _{stop}=1000. We compared the performance of this approach also with Cindex boosting on all p predictors (also without tuning, but with fixed m _{stop}=10000) and the Cox lasso. For the latter, the shrinkage parameter was optimized via 10fold crossvalidation.
Variable selection
Variable selection results from 100 simulation runs: median number of true positives  false positives and calculated upper bound for the perfamilyerror rate (PFER, in brackets) for different values of q and π _{thr}
Cindex boosting  Cox  

p  p _{inf}  n  PHviol  q  π _{thr} = 0.5  π _{thr} = 0.6  π _{thr} = 0.7  π _{thr} = 0.8  π _{thr} = 0.9  without π _{thr}  lasso 
1000  4  200  false  100  4 8 (24.8)  4 3 (11.4)  4 1 (4.27)  4 0 (1.92)  4 0 (0.75)  4 180  4 36 
50  4 1 (5.20)  4 0 (2.61)  4 0 (0.97)  4 0 (0.43)  4 0 (0.17)  
20  4 0 (0.61)  4 0 (0.33)  3 0 (0.14)  3 0 (0.06)  2 0 (0.02)  
15  4 0 (0.32)  3 0 (0.17)  3 0 (0.08)  3 0 (0.04)  2 0 (0.01)  
10  3 0 (0.13)  3 0 (0.07)  3 0 (0.04)  2 0 (0.02)  2 0 (0.01)  
5  2 0 (0.03)  2 0 (0.02)  2 0 (0.01)  2 0 (0.00)  1 0 (0.00)  
500  4  200  false  100  4 14 (51.9)  4 5 (27.9)  4 2 (10.4)  4 0 (4.73)  4 0 (1.87)  4 166  4 31 
50  4 3 (12.4)  4 1 (5.71)  4 0 (2.13)  4 0 (0.96)  4 0 (0.38)  
20  4 0 (1.55)  4 0 (0.82)  4 0 (0.30)  3 0 (0.14)  3 0 (0.05)  
15  4 0 (0.79)  4 0 (0.44)  3 0 (0.17)  3 0 (0.07)  3 0 (0.03)  
10  4 0 (0.31)  3 0 (0.16)  3 0 (0.07)  3 0 (0.03)  2 0 (0.01)  
5  3 0 (0.07)  3 0 (0.03)  2 0 (0.02)  2 0 (0.01)  1 0 (0.00)  
500  4  200  true  100  413 (51.9)  45 (27.9)  42 (10.4)  40 (4.73)  40 (1.87)  4 171  4 36 
50  42 (12.4)  41 (5.71)  40 (2.13)  40 (0.96)  40 (0.38)  
20  40 (1.55)  40 (0.82)  40 (0.30)  40 (0.14)  30 (0.05)  
15  40 (0.79)  40 (0.44)  40 (0.17)  30 (0.07)  30 (0.03)  
10  40 (0.31)  40 (0.16)  30 (0.07)  30 (0.03)  20 (0.01)  
5  30 (0.07)  30 (0.03)  20 (0.02)  20 (0.01)  10 (0.00)  
50  4  200  false  20  4 7 (50.0)  4 4 (50.0)  4 2 (6.33)  4 1 (3.06)  4 0 (1.25)  4 43  4 14 
15  4 3 (50.0)  4 2 (8.12)  4 1 (2.88)  4 0 (1.34)  4 0 (0.54)  
10  4 1 (5.19)  4 0 (2.79)  4 0 (1.04)  4 0 (0.47)  4 0 (0.19)  
5  4 0 (1.24)  4 0 (0.57)  4 0 (0.21)  4 0 (0.10)  3 0 (0.04)  
500  12  200  false  100  1212 (51.9)  124 (27.9)  121 (10.4)  110 (4.73)  90 (1.87)  12 150  12 78 
50  92 (12.4)  80 (5.71)  70 (2.13)  60 (0.96)  30 (0.38)  
20  50 (1.55)  40 (0.82)  30 (0.30)  20 (0.14)  10 (0.05)  
15  40 (0.79)  30 (0.44)  20 (0.17)  10 (0.07)  00 (0.03)  
10  30 (0.31)  20 (0.16)  10 (0.07)  00 (0.03)  00 (0.01)  
500  40  200  false  200  1713 (500)  125 (500)  82 (63.3)  40 (30.6)  10 (12.5)  35 139  9 12 
100  1612 (51.9)  114 (27.9)  72 (10.4)  40 (4.73)  10 (1.87)  
50  62 (12.4)  41 (5.71)  20 (2.13)  10 (0.96)  00 (0.38)  
25  20 (2.6)  10 (1.3)  00 (0.48)  00 (0.21)  00 (0.08) 
For given q, the parameter π _{thr} controls the sparsity of the resulting models: For p=1000, q=100 and p _{inf}=4, for example, on average eight variables were falsely selected with a threshold value of π _{thr}=0.5. This number decreased over three (π _{thr}=0.6), and one (π _{thr}=0.7) to zero for higher threshold values π _{thr}. Thus, for threshold values of π _{thr}≥0.8 only the four informative predictors were included in the final model.
Comparing the results for p _{inf}=4 and different numbers of predictors p, it gets clear that the optimal combination of q and π _{thr} depends not only on the number of true informative variables but also on p. For larger numbers of p, q should also be larger to give the algorithm the chance to select enough variables on each subsample so that the informative ones pass the threshold: For p=50 this could be achieved already with q=5; for p=1000 at least q=15 is necessary (better results for q=50 or higher). This interdependence between q, p and π _{thr} can be also observed via the computed upper bound for the PFER (following the error bounds provided in [24]). It has to be noted, however, that on average much less variables were falsely selected in practice than could be in theory (following the upper bound of the PFER). This indicates that the error bound is conservatively controlled.
For higher numbers of informative variables p _{inf} the algorithm had more problems identifying the correct ones. For p _{inf}=12, the number of selected variables per subsample has to be increased to q=100 to incorporate all true informative ones. For smaller values of q, even for π _{thr}=0.5 only parts of the true predictors were selected; however, stability selection still mostly prevented incorporating false positives. The competing Cox lasso approach, in contrast, also on average achieved to identify the true p _{inf}=12, but additionally included large numbers of noninformative variables in the final model. For p _{inf}=40, the picture became more extreme: Now both approaches, Cindex boosting with stability selection and the Cox lasso were no longer able to select the correct predictors. Only for q=200 and π _{thr}=0.5 on average 17 out of 40 predictors were correctly identified by Cindex boosting with stability selection (13 false positives), the Cox lasso incorporated 9 true predictors in the model (12 false positives). Cindex boosting without stability selection in this case correctly identified 35 predictors.
Discriminatory power
Resulting discriminatory power of Cindex boosting in combination with stability selection for different values of q and π _{thr} compared to the competing Cox lasso approach
Cindex boosting  Cox  

p  p _{inf}  n  PHviol  q  π _{thr} = 0.5  π _{thr} = 0.6  π _{thr} = 0.7  π _{thr} = 0.8  π _{thr} = 0.9  without π _{thr}  lasso 
1000  4  200  false  100  0.8150  0.8286  0.8358  0.8393  0.8396  0.7889  0.8148 
50  0.8343  0.8365  0.8381  0.8357  0.8253  
20  0.8324  0.8252  0.7829  0.7662  0.7394  
15  0.8309  0.7813  0.7694  0.7519  0.7340  
10  0.7799  0.7683  0.7519  0.7426  0.6202  
5  0.7497  0.7426  0.7323  0.6176  0.5993  
500  4  200  false  100  0.7998  0.8179  0.8305  0.8361  0.8391  0.7735  0.8161 
50  0.8268  0.8332  0.8375  0.8388  0.8340  
20  0.8358  0.8351  0.8309  0.7744  0.7607  
15  0.8346  0.8314  0.7835  0.7672  0.7521  
10  0.8279  0.7801  0.7672  0.7587  0.7400  
5  0.7627  0.7587  0.7444  0.7347  0.6154  
500  4  200  true  100  0.8304  0.8481  0.8612  0.8656  0.8671  0.7886  0.8345 
50  0.8555  0.8635  0.8664  0.8668  0.8664  
20  0.8657  0.8654  0.8626  0.8477  0.7662  
15  0.8654  0.8626  0.8554  0.7743  0.7573  
10  0.8598  0.8442  0.7757  0.7614  0.7360  
5  0.7660  0.7573  0.7391  0.7275  0.6219  
50  4  200  false  20  0.8183  0.8248  0.8303  0.8333  0.8358  0.7939  0.8256 
15  0.8268  0.8298  0.8329  0.8353  0.8370  
10  0.8314  0.8348  0.8366  0.8370  0.8366  
5  0.8373  0.8353  0.8324  0.8247  0.7662  
500  12  200  false  100  0.9109  0.9218  0.8996  0.8639  0.8081  0.8852  0.8834 
50  0.7991  0.7880  0.7451  0.7089  0.6482  
20  0.6954  0.6609  0.6239  0.5698  –  
15  0.6664  0.6274  0.5830  0.5549  –  
10  0.6275  0.5848  0.5610  –  –  
500  40  200  false  200  0.6416  0.6269  0.6088  0.5755  0.5344  0.6983  0.5782 
100  0.6373  0.6245  0.6028  0.5706  0.5308  
50  0.5907  0.5703  0.5407  0.5129  –  
25  0.5411  0.5269  –  –  – 
Putting the resulting discriminatory power in Table 2 into relation with the bounds of the PFER provided in Table 1, in our simulation settings with p _{inf}=4 the best results were achieved with a PFER (expected number of false positives) of 1 to 4. For p _{inf}=12 and p _{inf}=40 better results were achieved when the PFER reaches or exceeds the number of truly informative predictors p _{inf}.
The final models of the competing Coxlasso approach on average led to a slightly lower discriminatory power than the models from Cindex boosting with stability selection, although in many scenarios the true informative predictors had been correctly identified. Similar to our approach, the Cox lasso also yielded the poorest results for the simulation setting with p _{inf}=40 (\(\hat {C}_{\text {Uno}} = 0.5782\)) where it was clearly outperformed even by Cindex boosting without stability selection.
Breast cancer data
We analysed the performance of our approach also on data to build a gene signature for the prediction of the development of distant metastases in breast cancer patients. The data set (n=196) was collected by Desmedt et al. [27] to validate a 76gene expression signature proposed by Wang et al. [55]. In addition to the expression levels of the 76 genes, four clinical predictor variables were considered (tumor size, estrogen receptor (ER) status, tumor grade and age). Observed metastasisfree survival ranged from 125 days to 3652 days, with 79.08 % of the survival times being censored. The data set is available on GEO (http://www.ncbi.nlm.nih.gov/geo, access number GSE 7390).
To generate independent data sets for model fitting and evaluation, we constructed 100 training and test samples via stratified subsampling (stratified for censoring). On each of the training samples, we fitted Cindex boosting and also Cox lasso models with and without stability selection (with \(q = \frac {p}{2}\) and different values of π _{thr}). The selected genes were afterwards included together with the clinical variables in prediction models that were again fitted either via Cindex boosting or via Cox proportional hazard models.
Variable selection
Discriminatory power
Discussion
The numerical results from the simulation study and the breast cancer data suggest that Cindex boosting in combination with stability selection is able to correctly identify small numbers of influential predictors in potentially highdimensional settings.
Regarding discriminatory power, Cindex boosting outperformed common Coxbased penalization approaches both in the simulations and in the breast cancer application. This finding is not surprising, as our approach – in contrast to Cox regression – is specifically tailored to optimize the ability of the model to differentiate between observations with smaller and larger survival times.
On the other hand, we emphasize that our approach is particular favorable for identifying sparse models, the additional sparsity resulting from stability selection does not necessarily lead to more accurate predictions. While in the simulation study, where the algorithm was confronted in most scenarios with very few informative variables and a much larger set of completely noninformative ones, the additional stability selection also led to a higher discriminatory power than standard Cindex boosting, this result was not confirmed in the breast data application: It can be assumed that most of the 76 preselected genes will at least have a minor effect on the survival outcome [55]. Incorporating stability selection in this setting led to sparser models (Fig. 1), but with higher threshold values π _{thr} the discriminatory power decreased (Fig. 2). In fact, also the results of our simulation study have shown that for larger true models stability selection with a very strict level of error control seems to discard predictor variables that have small but nonnegligible contributions to prediction accuracy. In these cases, a higher discriminatory power was achieved without the incorporation of stability selection. One could hence argue, that increasing interpretability via sparsity and getting the highest possible discriminatory power are two different goals that may not always be achievable at the same time (cf., Hothorn [56]).
In addition to the Cindex considered in this work (see Chen et al. [57] for a similar algorithm without stability selection), various other approaches to evaluate the prediction accuracy of a survival model exist. For example, a wellestablished approach is to evaluate measures that emulate the R ^{2} coefficient of explained variation by relating the likelihood of the prediction model to the respective likelihood of a null model that does not include the marker η [58, 59]. In contrast to the Cindex, these measures are likelihoodbased (or, in case of the Cox model, based on the partial likelihood) and are therefore dependent on the correct specification of the survival model under consideration. Another popular approach is to consider scoring rules for survival data [51, 60], which measure prediction error by the distance between the predicted and the observed survival functions of the observations in a sample. An oftenused scoring rule is the Brier score, which evaluates the squared distance between survival functions [51]. Because scoring rules are based on probability estimates of the individualspecific survival functions, whereas the Cindex is solely based on the rankings of the survival times and the marker values, the two approaches share properties that are similar to the calibration and discrimination approaches, respectively, considered in binary classification (e.g., [61]).
Conclusion
The methodology proposed in this paper addresses the problem of variable selection in Cindex boosting. By combining gradient boosting with stability selection, we constructed a subsamplingbased estimation procedure that incorporates only the most “stable” predictor variables while controlling the perfamily error rate. This property is of considerable interest in biomedical research, as the identification of a small subset of important (here, stable) markers is often considered to be a key issue in prediction modeling. As pointed out by many authors (e.g., [22]), sparse prediction models containing only a moderate number of covariates are desirable in practice for reasons of interpretability. Furthermore, measuring biomarkers is often costly, so that the implementation of a prediction model in clinical practice crucially depends on the level of sparsity of the model.
The combination of gradient boosting and stability selection may also be considered appealing from an inferential point of view. Because statistical inference in boosting models is challenging due to the partly unknown convergence properties of the algorithm and the various regularization schemes involved, very few approaches to derive covariatewise hypothesis tests and pvalues exist [62, 63]. Via stability selection, one can also compute the percomparison error rate [64] which can be interpreted as a standard overall pvalue with multiplicity correction (for details see [26]). Therefore, by controlling the number of falsely selected predictor variables, stability selection provides an alternative to covariatewise tests for assessing the relevance of predictor variables via inferential procedures.
Abbreviations
AUC, area under the receiver operating characteristics curve; Cindex, concordance index; PFER, perfamily error rate; FWER, familywise error rate; ROC, receiveroperating characteristics curve
Declarations
Acknowledgements
None.
Funding
The work on this article was supported by the German Research Foundation (DFG), grant SCHM 2966/12 and the Interdisciplinary Center for Clinical Research (IZKF) of the FriedrichAlexanderUniversity ErlangenNürnberg (Project J49). The authors additionally acknowledge support by Deutsche Forschungsgemeinschaft and FriedrichAlexanderUniversity ErlangenNürnberg (FAU) within the funding program Open Access Publishing. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Availability of data and material
The dataset supporting the conclusions of this article is available in the GEO repository (http://www.ncbi.nlm.nih.gov/geo, access number GSE 7390).
Authors’ contributions
Conceived and designed the experiments: AM, BH, MS. Analyzed the data: AM, BH, MS. Wrote the manuscript: AM, BH, MS. All authors read and approved the final manuscript;
Authors’ information
None.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable as data example is reanalysis of publicly available data.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Heagerty PJ, Zheng Y. Survival model predictive accuracy and ROC curves. Biometrics. 2005; 61(1):92–105.View ArticlePubMedGoogle Scholar
 Pepe MS, Zheng Y, Jin Y, Huang Y, Parikh CR, Levy WC. Evaluating the ROC performance of markers for future events. Lifetime Data Anal. 2008; 14(1):86–113.View ArticlePubMedGoogle Scholar
 Tournoud M, Larue A, Cazalis MA, Venet F, Pachot A, Monneret G, Lepape A, Veyrieras JB. A strategy to build and validate a prognostic biomarker model based on rtqpcr gene expression and clinical covariates. BMC Bioinformatics. 2015; 16(1):106.View ArticlePubMedPubMed CentralGoogle Scholar
 Schmid M, Kestler HA, Potapov S. On the validity of timedependent AUC estimators. Brief Bioinform. 2015; 16:153–68.View ArticlePubMedGoogle Scholar
 Weyer V, Binder H. A weighting approach for judging the effect of patient strata on highdimensional risk prediction signatures. BMC Bioinformatics. 2015; 16(1):294.View ArticlePubMedPubMed CentralGoogle Scholar
 Tibshirani R, et al.The lasso method for variable selection in the Cox model. Stat Med. 1997; 16(4):385–95.View ArticlePubMedGoogle Scholar
 Goeman JJ. l _{1} penalized estimation in the cox proportional hazards model. Biom J. 2010; 551(1):70–84.Google Scholar
 Witten DM, Tibshirani R. Survival analysis with highdimensional covariates. Stat Methods Med Res. 2010; 19(1):29–51.View ArticlePubMedGoogle Scholar
 Van Belle V, Pelckmans K, Van Huffel S, Suykens JA. Support vector methods for survival analysis: A comparison between ranking and regression approaches. Artif Intell Med. 2011; 53:107–18.View ArticlePubMedGoogle Scholar
 Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann. Appl. Stat. 2008; 2(3):841–60.View ArticleGoogle Scholar
 Harrell FE, Califf RM, Pryor DB, Lee KL, Rosati RA. Evaluating the yield of medical tests. J Am Med Assoc. 1982; 247(18):2543–6.View ArticleGoogle Scholar
 Harrell FE, Lee KL, Califf RM, et al.Regression modeling strategies for improved prognostic prediction. Stat Med. 1984; 3(2):143–52.View ArticlePubMedGoogle Scholar
 Schmid M, Potapov S. A comparison of estimators to evaluate the discriminatory power of timetoevent models. Stat Med. 2012; 31(23):2588–609.View ArticlePubMedGoogle Scholar
 Mayr A, Schmid M. Boosting the concordance index for survival data – a unified framework to derive and evaluate biomarker combinations. PloS ONE. 2014; 9(1):84483.View ArticleGoogle Scholar
 Kattan MW. Evaluating a new markers predictive contribution. Clin Cancer Res. 2004; 10(3):822–4.View ArticlePubMedGoogle Scholar
 Pencina MJ, D’Agostino RB. Overall c as a measure of discrimination in survival analysis: model specific population value and confidence interval estimation. Stat Med. 2004; 23(13):2109–23.View ArticlePubMedGoogle Scholar
 D’Agostino R, Nam BH. Evaluation of the performance of survival analysis models: discrimination and calibration measures. Handb Stat. 2004; 23:1–25.View ArticleGoogle Scholar
 Friedman JH, Hastie T, Tibshirani R. Additive logistic regression: A statistical view of boosting (with discussion). Ann Stat. 2000; 28:337–407.View ArticleGoogle Scholar
 Cai YD, Feng KY, Lu WC, Chou KC. Using logitboost classifier to predict protein structural classes. J Theor Biol. 2006; 238(1):172–6.View ArticlePubMedGoogle Scholar
 Mayr A, Hofner B, Schmid M. The importance of knowing when to stop – a sequential stopping rule for componentwise gradient boosting. Methods Inf Med. 2012; 51(2):178–86.View ArticlePubMedGoogle Scholar
 Wyner AJ, Olson M, Bleich J, Mease D. Explaining the success of adaboost and random forests as interpolating classifiers. 2015. arXiv preprint arXiv:1504.07676. http://arxiv.org/abs/1504.07676.
 Wyatt JC, Altman DG. Commentary: Prognostic models: Clinically useful or quickly forgotten?Br Med J. 1995; 311:1539–41.View ArticleGoogle Scholar
 Meinshausen N, Bühlmann P. Stability selection (with discussion). J R Stat Soc Ser B. 2010; 72:417–73.View ArticleGoogle Scholar
 Shah RD, Samworth RJ. Variable selection with error control: Another look at stability selection. J R Stat Soc Ser B Stat Methodol. 2013; 75(1):55–80.View ArticleGoogle Scholar
 Schmid M, Hothorn T, Krause F, Rabe C. A PAUCbased estimation technique for disease classification and biomarker selection. Stat Appl Genet Mol Biol. 2012; 11(5). doi:http://dx.doi.org/10.1515/15446115.1792.
 Hofner B, Boccuto L, Göker B. Controlling false discoveries in highdimensional situations: Boosting with stability selection. BMC Bioinformatics. 2015; 16(144). doi:http://dx.doi.org/10.1186/s1285901505753.
 Desmedt C, Piette F, Loi S, Wang Y, Lallemand F, HaibeKains B, Viale G, Delorenzi M, Zhang Y, d’Assignies MS, Bergh J, Lidereau R, Ellis P, Harris AL, Klijn JGM, Foekens JA, Cardoso F, Piccart MJ, Buyse M, Sotiriou C. Strong time dependence of the 76gene prognostic signature for nodenegative breast cancer patients in the TRANSBIG multicenter independentvalidation series. Clin Cancer Res. 2007; 13:3207–214.View ArticlePubMedGoogle Scholar
 Uno H, Cai T, Pencina MJ, D’Agostino RB, Wei LJ. On the Cstatistics for evaluating overall adequacy of risk prediction procedures with censored survival data. Stat Med. 2011; 30(10):1105–17.PubMedPubMed CentralGoogle Scholar
 Gerds TA, Kattan MW, Schumacher M, Yu C. Estimating a timedependent concordance index for survival prediction models with covariate dependent censoring. Stat Med. 2013; 32(13):2173–84.View ArticlePubMedGoogle Scholar
 Wang M, Long Q. Addressing issues associated with evaluating prediction models for survival endpoints based on the concordance statistic. Biometrics. 2016. doi:http://dx.doi.org/10.1111/biom.12470.
 Antolini L, Boracchi P, Biganzoli E. A timedependent discrimination index for survival data. Stat Med. 2005; 24(24):3927–44.View ArticlePubMedGoogle Scholar
 Gönen M, Heller G. Concordance probability and discriminatory power in proportional hazards regression. Biometrika. 2005; 92(4):965–70.View ArticleGoogle Scholar
 Song X, Zhou XH. A semiparametric approach for the covariate specific ROC curve with survival outcome. Stat Sinica. 2008; 18(947965):84.Google Scholar
 van der Laan MJ, Robins JM. Unified methods for censored longitudinal data and causality. New York: Springer; 2003.View ArticleGoogle Scholar
 Bühlmann P, Hothorn T. Boosting algorithms: Regularization, prediction and model fitting (with discussion). Stat Sci. 2007; 22:477–522.View ArticleGoogle Scholar
 Mayr A, Binder H, Gefeller O, Schmid M. The evolution of boosting algorithms  from machine learning to statistical modelling. Methods Inf Med. 2014; 53(6):419–27.View ArticlePubMedGoogle Scholar
 Mayr A, Binder H, Gefeller O, Schmid M. Extending statistical boosting  an overview of recent methodological developments. Methods Inf Med. 2014; 53(6):428–35.View ArticlePubMedGoogle Scholar
 Bühlmann P, Yu B. Boosting with the L _{2} loss: Regression and classification. J Am Stat Assoc. 2003; 98:324–38.View ArticleGoogle Scholar
 Fenske N, Kneib T, Hothorn T. Identifying risk factors for severe childhood malnutrition by boosting additive quantile regression. J Am Stat Assoc. 2011; 106(494):494–510.View ArticleGoogle Scholar
 Hofner B, Mayr A, Robinzonov N, Schmid M. Modelbased boosting in R: A handson tutorial using the R package mboost. Comput Stat. 2014; 29:3–35. doi:http://dx.doi.org/10.1007/s0018001203825.View ArticleGoogle Scholar
 Ma S, Huang J. Regularized ROC method for disease classification and biomarker selection with microarray data. Bioinformatics. 2005; 21(24):4356–62.View ArticlePubMedGoogle Scholar
 Shankar J, Szpakowski S, Solis NV, Mounaud S, Liu H, Losada L, Nierman WC, Filler SG. A systematic evaluation of highdimensional, ensemblebased regression for exploring large model spaces in microbiome analyses. BMC Bioinformatics. 2015; 16(1):31.View ArticlePubMedPubMed CentralGoogle Scholar
 Tibshirani R. Regression shrinkage and selection via the lasso. J R Stat Soc Series B. 1996; 58(1):267–88.Google Scholar
 Breiman L. Random forests. Mach Learn. 2001; 45:5–32.View ArticleGoogle Scholar
 R Development Core Team. R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2015. http://www.Rproject.org.Google Scholar
 Hofner B, Hothorn T. Stabs: Stability Selection with Error Control. 2015. R package version 0.51. http://CRAN.Rproject.org/package=stabs.
 Potapov S, Adler W, Schmid M. survAUC: Estimators of Prediction Accuracy for Timetoevent Data. 2012. R package version 1.05. http://CRAN.Rproject.org/package=survAUC.
 Simon N, Friedman J, Hastie T, Tibshirani R, et al.Regularization paths for cox’s proportional hazards model via coordinate descent. J Stat Softw. 2011; 39(5):1–13.View ArticlePubMedPubMed CentralGoogle Scholar
 Ridgeway G. gbm: Generalized Boosted Regression Models. 2010. R package version 1.63.1. http://CRAN.Rproject.org/package=gbm.
 Binder H. CoxBoost: Cox Models by Likelihoodbased Boosting for a Single Survival Endpoint or Competing Risks. 2013. R package version 1.4. http://CRAN.Rproject.org/package=CoxBoost.
 Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999; 18(1718):2529–45.View ArticlePubMedGoogle Scholar
 Porzelius C, Binder H. Peperr: Parallelised Estimation of Prediction Error. 2013. R package version 1.17. http://CRAN.Rproject.org/package=peperr.
 Mogensen UB, Ishwaran H, Gerds TA. Evaluating random forests for survival analysis using prediction error curves. J Stat Softw. 2012; 50(11):1–23.View ArticlePubMedPubMed CentralGoogle Scholar
 Klein JP, Moeschberger ML. Survival analysis: techniques for censored and truncated data, 2nd edn. New York: Springer; 2003.Google Scholar
 Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijervan Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet. 2005; 365(9460):671–9.View ArticlePubMedGoogle Scholar
 Hothorn T. Discussion: Stability selection. J R Stat Soc Ser B. 2010; 72:463–4.Google Scholar
 Chen Y, Jia Z, Mercola D, Xie X. A gradient boosting algorithm for survival analysis via direct optimization of concordance index. Comput Math Methods Med. 2013; 2013. doi:http://dx.doi.org/10.1155/2013/873595.
 Kent JT, O’Quigley J. Measures of dependence for censored survival data. Biometrika. 1988; 75(3):525–34.View ArticleGoogle Scholar
 O’Quigley J, Xu R, Stare J. Explained randomness in proportional hazards models. Stat Med. 2005; 24(3):479–89.View ArticlePubMedGoogle Scholar
 Schmid M, Hielscher T, Augustin T, Gefeller O. A robust alternative to the Schemper–Henderson estimator of prediction error. Biometrics. 2011; 67(2):524–35.View ArticlePubMedGoogle Scholar
 Casalicchio G, Bischl B, Boulesteix AL, Schmid M. The residualbased predictiveness curve: A visual tool to assess the performance of prediction models. Biometrics. 2015. doi:http://dx.doi.org/10.1111/biom.12455.
 Boulesteix AL, Hothorn T. Testing the additional predictive value of highdimensional molecular data. BMC Bioinformatics. 2010; 11(78). doi:http://dx.doi.org/10.1186/147121051178.
 Mayr A, Schmid M, Pfahlberg A, Uter W, Gefeller O. A permutation test to analyse systematic bias and random measurement errors of medical devices via boosting location and scale models. Stat Methods Med Res. 2015. [Epub ahead of print].Google Scholar
 Dudoit S, Shaffer JP, Boldrick JC. Multiple hypothesis testing in microarray experiments. Stat Sci. 2003; 18(1):71–103.View ArticleGoogle Scholar