BMC Bioinformatics BioMed Central Methodology article Flexible boosting of accelerated failure time models

Background When boosting algorithms are used for building survival models from high-dimensional data, it is common to fit a Cox proportional hazards model or to use least squares techniques for fitting semiparametric accelerated failure time models. There are cases, however, where fitting a fully parametric accelerated failure time model is a good alternative to these methods, especially when the proportional hazards assumption is not justified. Boosting algorithms for the estimation of parametric accelerated failure time models have not been developed so far, since these models require the estimation of a model-specific scale parameter which traditional boosting algorithms are not able to deal with. Results We introduce a new boosting algorithm for censored time-to-event data which is suitable for fitting parametric accelerated failure time models. Estimation of the predictor function is carried out simultaneously with the estimation of the scale parameter, so that the negative log likelihood of the survival distribution can be used as a loss function for the boosting algorithm. The estimation of the scale parameter does not affect the favorable properties of boosting with respect to variable selection. Conclusion The analysis of a high-dimensional set of microarray data demonstrates that the new algorithm is able to outperform boosting with the Cox partial likelihood when the proportional hazards assumption is questionable. In low-dimensional settings, i.e., when classical likelihood estimation of a parametric accelerated failure time model is possible, simulations show that the new boosting algorithm closely approximates the estimates obtained from the maximum likelihood method.


Background
Predicting the expected time to event from a high-dimensional set of predictor variables has become increasingly important in the last years. A particularly interesting problem in this context is the analysis of studies relating patients' genotypes, for example measured via gene expression levels, to a clinical outcome such as "disease free survival" or "time to progression". Survival models of this type share the common problems that are typical for the analysis of gene expression data: Sample sizes are small while the number of potential predictors (i.e., gene expression levels) is extremely large. As a consequence, standard estimation techniques can not be applied any more.
For these reasons, a variety of new methods for obtaining survival predictions from high-dimensional data have been suggested in the literature. Most of these methods are focused on the Cox proportional hazards model [1], while some other methods have been developed for fitting semiparametric accelerated failure time (AFT) models [2] in high-dimensional settings. Tibshirani [3], Gui and Li [4], Park and Hastie [5], and Zou [6] introduced Lassolike algorithms for minimizing L 1 penalized versions of the Cox partial likelihood. Due to the structure of the L 1 penalty, these methods have the advantage that variable selection is carried out simultaneously with parameter estimation. Li and Luan [7] introduced a technique for maximizing the L 2 penalized Cox partial likelihood, which (in contrast to methods using the L 1 penalty) does not carry out variable selection but includes all predictors. On the other hand, several authors developed methods for fitting semiparametric AFT models in high-dimensional data settings: While Huang and Harrington [8] and Wang et al. [9] considered modifications of the Buckley-James method [10], Huang et al. [11] and Datta et al. [12] developed algorithms based on a censoring-adjusted penalized least squares loss function. In addition to penalized estimation techniques, there are various strategies for reducing the dimensionality of microarray data before building an unpenalized survival model, see Schumacher et al. [13], Bovelstad et al. [14], and van Wieringen et al. [15] for overviews of this topic.
In this paper the focus is on gradient boosting [16,17], which is an alternative method for fitting survival models in high-dimensional data settings. Similar to the Lasso, boosting has a built-in variable selection mechanism which is carried out simultaneously with the estimation of the prediction function. Although being connected to the Lasso (see Efron et al. [18]), boosting algorithms are not primarily designed for the maximization of penalized (partial) likelihood functions but can rather be interpreted as a method for minimizing convex loss functions via gradient descent techniques. In each step of a boosting algorithm, a so-called base-learner (e.g., a linear regression model) is fitted to the negative gradient of a pre-specified loss function. The current estimate of the prediction function is then updated with the newly obtained estimate of the gradient. As the base-learner can be modified such that only one covariate is used for estimating the gradient in each step (leading to the so-called component-wise baselearner), variable selection is carried out at each iteration.
Originally introduced for classification problems by Freund and Schapire [19,20], boosting has developed into a computationally efficient technique for fitting many types of regression models in high-dimensional data settings. In principle, the boosting loss function can be any negative log likelihood function of some exponential family. Vari-ous authors have shown that the method tends to be promising with respect to both variable selection and prediction accuracy [16,17,21,22].
Concerning the prediction of survival outcomes from gene expression data, the development of boosting algorithms has so far focused on the same model families as L 1 and L 2 penalized estimation techniques, namely the Cox proportional hazards model and the semiparametric AFT model. Ridgeway [23], Li and Luan [24], and Binder and Schumacher [25] used boosting algorithms with the negative partial log likelihood loss function while Hothorn et al. [26] introduced a boosting algorithm for fitting semiparametric AFT models. Similar to Huang et al. [11], Hothorn et al. [26] used censoring weights for adjusting the least squares objective function.
While the proportional hazards model is the most frequently used survival model in biostatistics and while fitting semiparametric AFT models is a robust method for survival prediction when there is unknown heterogeneity in the data, application of both types of analyses is still inadequate in a number of situations. Instead, it might be advisable to fit a fully parametric AFT model with an explicitly specified distribution of the survival outcome, such as the Weibull or the log-logistic distribution [2].
As an example we consider a high-dimensional set of microarray data originally analyzed in a classification context by Barrier et al. [27]. The authors collected a sample of 50 patients operated on for a stage II colon adenocarcinoma. 25 patients developed a metachronuous metastasis, whereas the other 25 patients remained disease free for at least 60 months. For each patient the expressions of 22,283 genes were obtained. Barrier et al. [27] selected a sample consisting of the 30 most differentially expressed genes between the disease and the disease-free group. By applying diagonal linear discriminant analysis based on the expressions of the 30 genes, the authors achieved a high prediction accuracy (76.3%) for the two patient groups. In the following we will address the equally important problem of modeling the time to development of metachronuous metastasis.
Figs. 1 and 2 show the results of a preliminary survival analysis of the Barrier stage II colon cancer data: Here only the 14 most differentially expressed genes between the disease and the disease-free group (i.e., the genes with pvalues less than or equal to 0.005) were used as predictor variables. The Cox-Snell residuals shown in Fig. 1 suggest that a parametric AFT model with either a log-logistic or a lognormal distribution fits the data best. The Cox proportional hazards model does not seem to be appropriate, which is further confirmed by the results shown in Fig. 2: Here the parameters of a stratified Cox model were esti-mated, where the strata were defined by splitting the values of the most overexpressed gene in the disease group at their median. The two baseline hazard functions shown in Fig. 2 cross, so the proportional hazards assumption seems to be violated.
Although the preliminary analysis is by no means sufficient for building a survival model from the Barrier data, we have nevertheless gained valuable insight into the distribution of the survival times. How should we incorporate this "prior knowledge" into a boosting algorithm? Fig. 2 suggests that using the Cox partial likelihood as a loss function for boosting is at least questionable, since the Cox model is known to be sensitive with respect to violations of the proportional hazards assumption (see Schemper [28]). On the other hand, Fig. 1 suggests that the survival times of the Barrier data either follow a loglogistic distribution or a lognormal distribution, so fitting a semiparametric AFT model without any distributional assumption might be inefficient (given that we only have 50 observations with 50% of them being censored). In order to account for these issues, we focus on the development of a boosting algorithm for parametric AFT models. An algorithm of this type has not yet been developed in the literature, since AFT models include a scale parameter for modeling the variance of the error distribution in the regression equation. With maximum likelihood estimation of AFT models, this scale parameter is estimated Cox-Snell residuals obtained from fitting various survival models to the Barrier data Figure 1 Cox-Snell residuals obtained from fitting various survival models to the Barrier data. The upper left panel shows the Cox-Snell residuals of a semiparametric Cox model vs. the Nelson-Aalen estimate of their cumulative hazard function. Estimates were obtained from fitting a Cox proportional hazards model to the Barrier data via maximization of the partial log likelihood. The 14 most differentially expressed genes between the disease and the disease-free group were used as predictor variables. The other panels show the Cox-Snell residuals (together with their cumulative hazard function) obtained from fitting various parametric AFT models to the same data via maximum likelihood estimation. Obviously, the lines corresponding to the Cox-Snell residuals of the log-logistic and lognormal models are closest to the line through the origin, indicating that these models fit the data best. By contrast, the Cox model and the Weibull model (which both assume proportional hazards) do not seem to fit the data well, indicating that the proportional hazards assumption is violated.

Lognormal model
simultaneously with the regression parameters. Boosting algorithms, however, have so far not been able to deal with scale parameters but only with the prediction function, i.e., with the regression parameters. This is the reason why it has not been possible yet to use the negative log likelihood function of an AFT model as a loss function for boosting. In fact, boosting Cox models and semiparametric AFT models was only possible because these methods do not include a scale parameter.
In the following we will introduce a new boosting algorithm that allows for simultaneous estimation of both the prediction function and the scale parameter in parametric AFT models. This algorithm uses the negative log likelihood of the AFT model as a loss function and works in a stepwise fashion. After starting with some offset values of the regression and scale parameters, a component-wise base-learning procedure is applied to the negative gradient in each iteration, and an update of the prediction function is obtained. Afterwards, the scale parameter is reestimated in each iteration by plugging the current estimate of the prediction function into the loss function and by minimizing the loss function over the scale parameter. As a base-learning procedure we use the classical linear least squares approach, i.e., the final model can be interpreted as a survival fit depending on a linear predictor.
The characteristics of the new algorithm are first investigated by means of a simulation study with artificial data. It is shown that variable selection is carried out efficiently, and that the algorithm has a high predictive power if compared to the "null models" with no covariates at all. Also, the regression estimates of boosting in low-dimensional settings turn out to be almost identical to the corresponding maximum likelihood estimates. To show the practical applicability of the new algorithm, we carry out a detailed study on the above introduced data set by Barrier et al. [27]. Evaluating the performance measures of this study suggests that boosting with the negative log likelihood of a parametric AFT model is able to outperform boosting with the Cox partial log likelihood, at least in case of the Barrier data.

Estimation problem
Consider a set of realizations of i.i.d. random variables (T 1 , C 1 , X 1 ),..., (T n , C n , X n ), where T 1 ,..., T n are one-dimensional survival times, C 1 ,..., C n are one-dimensional censoring times, and X 1 ,..., X n are p-dimensional vectors of covariates. It is assumed that some of the survival times This corresponds to the classical case of "random censoring". The objective is to fit the accelerated failure time model where (T, X) follows the same distribution as each of the (T i , X i ), i = 1,...,n, and where W is a random noise variable independent of X. The function f is an unknown prediction function of log(T), and σ is an unknown scale parameter that controls the amount of noise added to the prediction function f. Typically, f is a linear or additive function of the covariates X.   standard extreme value distribution, T (given X) follows a Weibull distribution with parameters λ:= exp(-f/σ) and α:= 1/σ (cf. Klein and Moeschberger [29]). Other popular examples of parametric AFT models are the log-logistic distribution for T|X (with W following a standard logistic distribution) and the lognormal distribution for T|X (with W following a standard normal distribution). It is important to note that the latter two models explicitly allow for non-proportional hazards. The Weibull distribution is a parametric example of the well-known Cox proportional hazards model (which does not necessarily assume a regression equation such as (1) but is instead based on a semiparametric model for the hazard function of T|X).
Classically, the estimation of f and σ in a parametric AFT model is performed by maximizing the log likelihood function Semiparametric AFT models leave the parameter σ and the distribution of the noise variable W in Model (1) unspecified. Instead of using maximum likelihood techniques, f is estimated via minimization of a censoring-adjusted least squares criterion [2,10,30].

FGD boosting with component-wise linear least squares and scale parameter estimation
In the following we will use boosting techniques for obtaining precise estimates of Model (1) when the number of covariates is large. We start by considering a boosting algorithm known as "component-wise linear functional gradient descent (FGD)" [16,17,21,31]. The objective of FGD is to obtain the real-valued prediction function where the one-dimensional function ρ is a convex loss function that is assumed to be differentiable with respect to f. Estimation of f* is performed by minimizing the empirical risk with respect to f. Component-wise FGD works as follows: The above algorithm can be interpreted as a negative gradient descent algorithm in function space. In each step, an estimate of the true negative gradient of the loss function is added to the current estimate of f*. Thus, a structural (regression) relationship between Y and the covariate vector X is established (for details on the characteristics of FGD we refer to Bühlmann and Hothorn [22]). Moreover, FGD carries out variable selection in each iteration, as only one component of X is selected in Step 4. This property makes the algorithm applicable even if p > n. Due to the additive structure in Step 5, the final boosting estimate at iteration m stop can be interpreted as a linear model fit but will typically depend on only a subset of the p components of X.
As we want to use FGD for the estimation of parametric AFT models, we set ρ equal to the negative log likelihood function specified in (2). However, in this case, the standard FGD algorithm presented above can not be applied, as (2) includes an additional scale parameter σ that has to be estimated simultaneously with f. We therefore extend the classical FGD algorithm in the following way: ,..., It is easily seen that the modified FGD algorithm estimates f* and σ in a stepwise fashion: In Steps 3 and 4, f* is estimated for a given value of σ, while in Step 5, σ is estimated given the current estimate of f*. It is also clear from Step 4 that the built-in variable selection mechanism of FGD is not affected by the additional estimation of the scale parameter σ. The value of the stopping iteration m stop is the main tuning parameter of FGD. In the following we will throughout use five-fold cross-validation for determining the value of m stop , i.e., m stop is the iteration with lowest predictive risk. The choice of the step-length factor ν has been shown to be of minor importance for the predictive performance of a boosting algorithm. The only requirement is that the value of ν is "small", such that a stagewise adaption of the true prediction function f* is possible (see Bühlmann and Hothorn [22]). We therefore set ν = 0.1. As a goodness-of-fit criterion in Step 4 we use the R 2 measure of explained variation which is known from linear regression.

Measures for the predictive power of boosting techniques
After having built a survival model from a set of microarray data, it is essential to evaluate the predictive performance of the model. To this purpose, various measures of predictive accuracy have been proposed in the literature [32][33][34][35][36][37]. Since none of these measures has been adopted as a standard so far, we use the following strategy for measuring the performance of a survival prediction rule:

a) Log-Likelihood
If the objective is to compare prediction rules obtained from the same model family (such as the family of Weibull distributions), we use the predictive log likelihood or partial log likelihood as a measure of predictive accuracy. The predictive log likelihood or partial log likelihood is defined as follows: Suppose that the parameter vector θ of a survival model has been estimated from a training sample, and that a test sample (T k , C k , X k ), k = 1,...,n test , is used for evaluating the predictive accuracy of the model. Denote the log likelihood or partial log likelihood function of the survival model by l θ (T, C, X). The predictive log likelihood is then given by where the parameter estimate has been plugged into the log likelihood or partial log likelihood of the test sample. In case of a parametric AFT model we have θ = (f, σ), whereas in case of a Cox model, θ is the prediction function f used for modeling the hazard rate of T|X. In a boosting context it is particularly advisable to use (4) as a measure of predictive accuracy, since the negative log likelihood or partial log likelihood is used as a loss function for the corresponding boosting algorithms. As a consequence, the loss function is measured on the same scale as the predictive accuracy (4).

b) Brier-Score
If the objective is to compare prediction rules obtained from different model families or estimation techniques, it is no longer possible to use the predictive log likelihood as a measure of predictive accuracy. This is because the likelihood and partial likelihood functions of different model families are usually measured on different scales, so they can not be compared directly. Moreover, many nonparametric and semiparametric estimation techniques do not involve likelihood functions at all, implying that a predictive likelihood value can not be computed.
In case of the Barrier data we will compare the performance of various parametric and semiparametric estimation techniques by using the Brier score [34,37] as a measure of predictive accuracy. The Brier score is defined as the time-dependent squared distance between the predicted survival probability and the true state of an observation. Since the Brier score is not based on any specific model assumptions, the predictive accuracy of various model families and estimation techniques can be measured on the same scale. Also, possible misspecifications of a survival model are taken into account [37]. In this paper we use the methodology of Gerds and Schumacher [38] who combined the evaluation of the Brier score with the 0.632+ estimator developed by [39]. This methodology has been used previously for predicting survival outcomes from microarray data sets [25]. Denote the cross-validated error rate, i.e., the mean or median of the out-of-bag error rates Err b (t), b = 1,...,B, by .
Both (5) and (7) are estimators of the true prediction error E[Γ(t) -S(t, X)] 2 , where S(t, X) is the true survival function of T|X at time point t. Since (7) is an upward-biased estimator and (5) is a downward-biased estimator of the true prediction error (see Gerds and Schumacher [38]), we use a linear combination of err(t) and as the overall measure of accuracy at time point t: Defining ω(t) := 0.632/(1 -0.368R(t)) with and yields the 0.632+ estimator of the Brier score at time point t. For a detailed derivation of (9) we refer to Gerds and Schumacher [38] and Efron and Tibshirani [39]. In case of the Barrier data, we evaluate (9) at each time point t > 0 up to a survival time of five years (i.e., t max = 60 months). This yields a prediction error curve depending on t, where 0 <t ≤ t max .

Results and Discussion
In this section we investigate the properties of the new boosting algorithm for parametric AFT models. We first conduct a simulation study with artificial data. Afterwards, we investigate the ability of boosting to predict survival outcomes from gene expression data. This is done by conducting a benchmark study on the Barrier stage II colon cancer data. All computations were carried out with the R system for statistical computing (version 2.6.2, [40]) using a modification of the glmboost() function in package mboost (version 1.0-1, [41]).

Simulation study with artificial data
We carried out a simulation study on linear AFT models with five covariates, i.e., we considered the model log(T) = β 1 X 1 + ... + β 1 X 5 + σW, where X 1 ,...,X 5 are jointly normally distributed covariates with parameters E(X k ) = 0, var(X k ) = 1, and cov(X k , X l ) = 0.5, k, l = 1,..  In a first step, we compared the estimates of β obtained from the new boosting algorithm with the corresponding estimates obtained from standard maximum likelihood estimation. For each of the three distributions of W, we generated 50 i.i.d. data sets (of size n = 100 each) from Model (10). Censoring was introduced to the data by simulating 50 additional i.i.d. data sets (of size 100 each) from Model (10). The values of the dependent variables in these additional data sets constituted the censoring times of the corresponding first 50 data sets. This implied that (a) the censoring times followed the same distribution as the survival times but were independent from the latter, and that (b) about 50% of the observations in each data set were censored on average. The new boosting algorithm with the corresponding negative log likelihood loss function was applied to each data set. Note that the final boosting estimate can be interpreted as a linear model fit (depending on a parameter estimate of β), since we used component-wise linear base-learners. We additionally used standard maximum likelihood techniques for estimating the parameters of Model (10) from the 50 data sets.
The boxplots of the parameter estimates of the Weibull model are shown in Fig. 3. Obviously, the parameter estimates obtained from boosting are very similar to the parameter estimates obtained from maximum likelihood estimation. Similar results were obtained for the log-logistic and lognormal models. This can also be seen from the Spearman correlations between the parameter estimates shown in Table 1. We next compared the ability of boosting and of standard maximum likelihood techniques to select a subset of informative covariates out of a larger set of covariates. To this purpose, we extended the set of variables used for the previous simulation study by an additional set of 15 independent standard normally distributed covariates with parameters β 6 = ... = β 20 = 0. Fig. 4 shows the resulting parameter estimates obtained from boosting and from maximum likelihood estimation. Obviously, all estimates of the "non-informative" param-eters β 6 ,...,β 20 are close to the true value of 0. In 33.7% of all 15 · 50 cases, the boosting parameter estimates of β 6 ,...,β 20 are exactly 0, which means that non-informative covariates have not been selected. The corresponding percentage rates were 32.1% and 21.5% for the log-logistic and the lognormal model, respectively. In addition, the boosting parameter estimates corresponding to β 6 ,...,β 20 have a smaller variance than the corresponding maximum likelihood estimates. For these reasons, boosting seems to be preferable to likelihood estimation when it comes to variable selection. It can also be seen from Fig. 4 that the boosting estimates of the non-zero parameters β 1 ,...,β 5 are (slightly) downward-biased. This shrinking effect reflects the well-known bias-variance trade-off which is common to boosting techniques.
We finally investigated the predictive performance of the new boosting algorithm and of maximum likelihood estimation techniques. To this purpose, we used the model estimates obtained from the 50 data sets for predicting the likelihood of 50 additionally generated i.i.d. test samples of sample size ntest = 100 each (which also followed Model (10)). The corresponding predictive log likelihood values, which are shown in Fig. 5, suggest that the predic-var( ) var( ) var( ) . , β β β β σ 1 1 1 5 1 1 1 5 Boxplots of the Weibull parameter estimates when 5 inform-ative covariates are present

Coefficient estimates, 5 informative covariates
tive performance of boosting is better than the predictive performance of maximum likelihood estimation. A Wilcoxon paired-sample test for the predictive log likelihood values of boosting and maximum likelihood estimation resulted in a p-value of less than 0.001. For the log-logistic and lognormal models, the respective p-values were also smaller than 0.001.

Benchmark experiment on stage II colon cancer data
In order to show the practical applicability of the new boosting algorithm for parametric AFT models, we conducted a benchmark study on the Barrier stage II colon cancer data [27]. Again we used (1) the negative Weibull log likelihood, (2) the negative log-logistic log likelihood, and (3) the negative lognormal log likelihood as loss functions for boosting. To compare boosting for parametric AFT models with other estimation techniques, we additionally fitted survival models by using (4) L 2 Boosting for semiparametric AFT models [26], (5) L 1 penalized estimation for semiparametric AFT models [11] (using the lars() function in R package lars [42]), (6) gradient boosting with the negative Cox partial log likelihood loss [23], (7) L 1 penalized Cox partial likelihood estimation (using the coxpath() function in R package glmpath [43]), and (8) nonparametric estimation of the survival function via the Kaplan-Meier estimator. The latter estimator corresponds to the non-informative "null model" with no covariates at all.
We applied the eight estimation techniques to 50 bootstrap training samples generated from the Barrier data. Prediction errors were obtained by using the 0.632+ methodology, as described in the Methods section. For computational reasons we reduced the predictor space for the L 1 penalized methods (5) and (7), i.e., we used the Boxplots of the predictive Weibull log likelihood estimates  Boxplots of the Weibull parameter estimates when 5 inform-ative and 15 additional non-informative covariates are present  Coefficient estimates, 15 additional covariates 10,218 most differentially expressed genes between the disease and the disease-free group instead of all 22,283 genes. The subset of 10,218 genes corresponds to those genes whose differences between the disease and the disease-free group are significant at a level of α = 0.2. This reduction of the predictor space did not have any negative effect on the predictive performance of the two L 1 penalized techniques (see later). The tuning parameters of all eight methods were determined by using five-fold cross validation.
The survival functions of the eight methods, which are needed for computing the Brier score, were estimated as follows: For the "parametric" boosting methods (1), (2), and (3), we estimated the survival functions by plugging the estimated prediction function into the Weibull, loglogistic, and lognormal survival functions, respectively. In case of L 2 Boosting and L 1 penalized estimation for semiparametric AFT models (methods (4) and (5)) we made use of the following relationship: where ε := σW. We then estimated the probability on the right-hand side of (12) by applying the Kaplan-Meier estimator to the residuals obtained from the bootstrap training samples and by evaluating the Kaplan-Meier survival functions at "time points" . In case of boosting with the Cox partial log likelihood loss and L 1 penalized Cox regression (methods (6) and (7)) we used the estimated survival function where is the Breslow estimator of the cumulative baseline hazard of the survival times. Fig. 6 shows the median prediction error curves corresponding to the parametric AFT boosting techniques (1), (2), and (3). Obviously, the prediction error curves cross, so the predictive performance of the models depends on the time point t under consideration. However, for most values of t, boosting with the negative log-logistic or negative lognormal log likelihood loss yields the smallest prediction error. In Fig. 7 the median prediction error curve corresponding to the log-logistic model is compared to the median prediction error curves obtained from the two techniques for semiparametric AFT models (4) and (5).
Here the parametric AFT model seems to have the best predictive performance, indicating that the efficiency loss due to the semiparametric estimation of the AFT model is relatively large. In Fig. 8 the median prediction error curve of the log-logistic model is compared to the median prediction error curves obtained from the two partial log likelihood techniques (6) and (7), as well as to the median prediction error curve obtained from the "null model" (8). Again we see that boosting with the negative loglogistic log likelihood has the best predictive performance.
We finally computed the Cox-Snell residuals from the boosting estimates based on the complete data set. The residuals, which are shown in Fig. 9, confirm the results obtained from the preliminary analysis of the Barrier data presented in the Background section: Similar to Fig. 1, we see that the log-logistic model fits the data well, while the Cox proportional hazards model seems to be inadequate here.

Conclusion
By introducing a boosting algorithm for parametric AFT models we have extended the methodology for modeling survival times in high-dimensional data settings. Boosting algorithms for survival data have previously been developed for the Cox proportional hazards model [23,24] and for semiparametric AFT models [26]. An extension of these algorithms to parametric AFT models has not been Analysis of the Barrier stage II colon cancer data -prediction error curves for various parametric AFT models Figure 6 Analysis of the Barrier stage II colon cancer dataprediction error curves for various parametric AFT models. Prediction error curves obtained from boosting with the negative log-logistic log likelihood, boosting with the negative Weibull log likelihood, and boosting with the negative lognormal log likelihood. possible so far, since parametric AFT models depend on a scale parameter which has to be estimated simultaneously with the prediction function. To overcome this problem we have developed a flexible boosting algorithm which is able to deal with loss functions depending on both the prediction function and a scale parameter. As a consequence, the negative log likelihood function of an AFT model can be used as a loss function for the component-wise functional gradient descent boosting algorithm.
The simulation study on various parametric AFT models has shown that the favorable properties of boosting with respect to variable selection are left untouched by the additional estimation of the scale parameter. Moreover, when sample sizes are small (such that a semiparametric estimation of AFT models becomes inefficient) or when the proportional hazards assumption of the survival times is violated, boosting techniques for parametric AFT models seem to lead to an increase in prediction accuracy. This is suggested by the results obtained from the benchmark study on Barrier's stage II colon cancer data. We finally point out that we have so far focused exclusively on boosting with the component-wise linear base-learning procedure. This procedure fits a set of simple linear regression models to the negative gradient in each boosting iteration.
We have used component-wise linear least squares mainly because of their computational efficiency. In fact, component-wise linear least squares base-learners are highly efficient even when the number of predictors is extremely large (such as in case of Barrier's stage II colon cancer data, where p > 22,000). In principle, however, the new boosting algorithm can also be extended to additive models with smooth components. This can, for example, be done by using component-wise smoothing splines (see Bühlmann and Yu [17]) or P-splines as base-learners. Moreover, the boosting algorithm developed in this paper is not exclusively designed for fitting parametric AFT models but could also be used in combination with other likelihoodbased loss functions depending on a scale parameter. The properties of these extensions still have to be investigated and constitute an issue for future research.

Authors' contributions
MS developed the algorithm, conducted the benchmark experiments, and wrote the initial version of the manuscript. TH is the primary author of the mboost package, contributed to the design of the benchmark study and revised the manuscript.
Analysis of the Barrier stage II colon cancer data -prediction error curves for various survival models Figure 8 Analysis of the Barrier stage II colon cancer dataprediction error curves for various survival models. Prediction error curves obtained from boosting with the negative log-logistic log likelihood, boosting with the negative Cox partial log likelihood, L 1 penalized estimation of a Cox proportional hazards model (CoxPath), and nonparametric estimation via the Kaplan-Meier estimator. Analysis of the Barrier stage II colon cancer data -prediction error curves for parametric and semiparametric AFT models Figure 7 Analysis of the Barrier stage II colon cancer dataprediction error curves for parametric and semiparametric AFT models. Prediction error curves obtained from boosting with the negative log-logistic log likelihood, L 2 Boosting for semiparametric AFT models, and L 1 penalized estimation for semiparametric AFT models (Lasso).