Estimation problem
Consider a set of realizations of i.i.d. random variables (T_{1}, C_{1}, X_{1}),..., (T_{
n
}, C_{
n
}, X_{
n
}), where T_{1},..., T_{
n
}are onedimensional survival times, C_{1},..., C_{
n
}are onedimensional censoring times, and X_{1},..., X_{
n
}are pdimensional vectors of covariates. It is assumed that some of the survival times T_{1},..., T_{
n
}are rightcensored, so that only the random variables {\tilde{Y}}_{i} := min{T_{
i
}, C_{
i
}}, i = 1,..., n, are observable. This implies that the available data consist of realizations of {\tilde{Y}}_{i} and of the set of censoring indicators δ_{
i
}∈ {0, 1}, where δ_{
i
}= 0 if observation i is censored and δ_{
i
}= 1 if the complete survival time T_{
i
}has been observed. It is further assumed that the C_{
i
}, i = 1,...,n, are independent of (T_{
i
}, X_{
i
}), i = 1,...,n. This corresponds to the classical case of "random censoring". The objective is to fit the accelerated failure time modellog(T) = f(X) + σW, (1)
where (T, X) follows the same distribution as each of the (T_{
i
}, X_{
i
}), i = 1,...,n, and where W is a random noise variable independent of X. The function f is an unknown prediction function of log(T), and σ is an unknown scale parameter that controls the amount of noise added to the prediction function f. Typically, f is a linear or additive function of the covariates X.
The distributional assumption on the noise variable W determines the distributional form of the survival time T. If this distribution is fully specified, we refer to Model (1) as a parametric AFT model. For example, if W follows a standard extreme value distribution, T (given X) follows a Weibull distribution with parameters λ:= exp( f/σ) and α:= 1/σ (cf. Klein and Moeschberger [29]). Other popular examples of parametric AFT models are the loglogistic distribution for TX (with W following a standard logistic distribution) and the lognormal distribution for TX (with W following a standard normal distribution). It is important to note that the latter two models explicitly allow for nonproportional hazards. The Weibull distribution is a parametric example of the wellknown Cox proportional hazards model (which does not necessarily assume a regression equation such as (1) but is instead based on a semiparametric model for the hazard function of TX).
Classically, the estimation of f and σ in a parametric AFT model is performed by maximizing the log likelihood function
{\displaystyle \sum _{i=1}^{n}{\delta}_{i}\left[\mathrm{log}\phantom{\rule{0.1em}{0ex}}\sigma +\mathrm{log}\phantom{\rule{0.1em}{0ex}}{f}_{W}\left(\frac{{Y}_{i}f({X}_{i})}{\sigma}\right)\right]+\left(1{\delta}_{i}\right)}\left[\mathrm{log}\phantom{\rule{0.1em}{0ex}}{S}_{W}\left(\frac{{Y}_{i}f({X}_{i})}{\sigma}\right)\right],
(2)
where Y_{
i
}:= log(min{T_{
i
}, C_{
i
}}) is the logarithm of {\tilde{Y}}_{i}. The functions f_{
W
}and S_{
W
}denote the probability density and survival functions of the noise variable W, respectively (cf. Klein and Moeschberger [29], Chapter 12).
Semiparametric AFT models leave the parameter σ and the distribution of the noise variable W in Model (1) unspecified. Instead of using maximum likelihood techniques, f is estimated via minimization of a censoringadjusted least squares criterion [2, 10, 30].
FGD boosting with componentwise linear least squares and scale parameter estimation
In the following we will use boosting techniques for obtaining precise estimates of Model (1) when the number of covariates is large. We start by considering a boosting algorithm known as "componentwise linear functional gradient descent (FGD)" [16, 17, 21, 31]. The objective of FGD is to obtain the realvalued prediction functionf* := argmin_{f(·)}E [ρ(Y, f(X))], (3)
where the onedimensional function ρ is a convex loss function that is assumed to be differentiable with respect to f. Estimation of f* is performed by minimizing the empirical risk {\displaystyle {\sum}_{i=1}^{n}\rho ({Y}_{i},f({X}_{i}))} with respect to f. Componentwise FGD works as follows:

1.
Initialize the ndimensional vector {\widehat{f}}^{[0]} with an offset value, e.g., {\widehat{f}}^{[0]}\equiv 0. Set m = 0.

2.
Increase m by 1. Compute the negative gradient \frac{\partial}{\partial f}\rho (Y,f) and evaluate at {\widehat{f}}^{[m1]}({X}_{i}), i = 1,...,n. This yields the negative gradient vector
{U}^{[m1]}={\left({U}_{i}^{[m1]}\right)}_{i=1,\mathrm{...},n}:={\left(\frac{\partial}{\partial f}\rho {(Y,f)}_{Y={Y}_{i},f={\widehat{f}}^{[m1]}({X}_{i})}\right)}_{i=1,\mathrm{...},n}.

3.
Fit the negative gradient U^{[m1]}to each of the p components of X (i.e., to each covariate) separately by p times using a simple linear regression estimator. This yields p estimates of the negative gradient vector U^{[m1]}.

4.
Select the component of X which fits U^{[m1]}best according to a prespecified goodnessoffit criterion. Set {\widehat{U}}^{[m1]} equal to the fitted values from the corresponding best model fitted in Step 3.

5.
Update {\widehat{f}}^{[m]}={\widehat{f}}^{[m1]}+\nu {\widehat{U}}^{[m1]}, where 0 <ν ≤ 1 is a realvalued step length factor.

6.
Iterate Steps 2–5 until m = m_{stop} for some stopping iteration m_{stop}.
The above algorithm can be interpreted as a negative gradient descent algorithm in function space. In each step, an estimate of the true negative gradient of the loss function is added to the current estimate of f*. Thus, a structural (regression) relationship between Y and the covariate vector X is established (for details on the characteristics of FGD we refer to Bühlmann and Hothorn [22]). Moreover, FGD carries out variable selection in each iteration, as only one component of X is selected in Step 4. This property makes the algorithm applicable even if p > n. Due to the additive structure in Step 5, the final boosting estimate at iteration m_{stop} can be interpreted as a linear model fit but will typically depend on only a subset of the p components of X.
As we want to use FGD for the estimation of parametric AFT models, we set ρ equal to the negative log likelihood function specified in (2). However, in this case, the standard FGD algorithm presented above can not be applied, as (2) includes an additional scale parameter σ that has to be estimated simultaneously with f. We therefore extend the classical FGD algorithm in the following way:

1.
Initialize the ndimensional vector {\widehat{f}}^{[0]} with an offset value, e.g., {\widehat{f}}^{[0]}\equiv 0. In addition, initialize the onedimensional scale parameter {\widehat{\sigma}}^{[0]}with an offset value, e.g., {\widehat{\sigma}}^{[0]}=1. Set m = 0.

2.
Increase m by 1. Compute the negative gradient \frac{\partial}{\partial f}\rho (Y,f,\sigma ) and evaluate at {\widehat{f}}^{[m1]}({X}_{i}), i = 1,...,n, and at {\widehat{\sigma}}^{[m1]}. This yields the negative gradient vector
{U}^{[m1]}={\left({U}_{i}^{[m1]}\right)}_{i=1,\mathrm{...},n}:={\left(\frac{\partial}{\partial f}\rho {(Y,f,\sigma )}_{Y={Y}_{i},f={\widehat{f}}^{[m1]}({X}_{i}),\sigma ={\widehat{\sigma}}^{[m1]}}\right)}_{i=1,\mathrm{...},n}.

3.
Fit the negative gradient U^{[m1]}to each of the p components of X (i.e., to each covariate) separately by p times using a simple linear regression estimator. This yields p estimates of the negative gradient vector U^{[m1]}.

4.
Select the component of X which fits U^{[m1]}best according to a prespecified goodnessoffit criterion. Set {\widehat{U}}^{[m1]} equal to the fitted values from the corresponding best model fitted in Step 3.

5.
Update {\widehat{f}}^{[m]}={\widehat{f}}^{[m1]}+\nu {\widehat{U}}^{[m1]}, where 0 <ν ≤ 1 is a realvalued step length factor. Plug {\widehat{f}}^{[m]}into the empirical risk function {\displaystyle {\sum}_{i=1}^{n}\rho ({Y}_{i},f({X}_{i}),\sigma )}and minimize the empirical risk over σ. This yields the scale parameter estimate {\widehat{\sigma}}^{[m]}.

6.
Iterate Steps 2–5 until m = m_{stop} for some stopping iteration m_{stop}.
It is easily seen that the modified FGD algorithm estimates f* and σ in a stepwise fashion: In Steps 3 and 4, f* is estimated for a given value of σ, while in Step 5, σ is estimated given the current estimate of f*. It is also clear from Step 4 that the builtin variable selection mechanism of FGD is not affected by the additional estimation of the scale parameter σ. The value of the stopping iteration m_{stop} is the main tuning parameter of FGD. In the following we will throughout use fivefold crossvalidation for determining the value of m_{stop}, i.e., m_{stop} is the iteration with lowest predictive risk. The choice of the steplength factor ν has been shown to be of minor importance for the predictive performance of a boosting algorithm. The only requirement is that the value of ν is "small", such that a stagewise adaption of the true prediction function f* is possible (see Bühlmann and Hothorn [22]). We therefore set ν = 0.1. As a goodnessoffit criterion in Step 4 we use the R^{2} measure of explained variation which is known from linear regression.
Measures for the predictive power of boosting techniques
After having built a survival model from a set of microarray data, it is essential to evaluate the predictive performance of the model. To this purpose, various measures of predictive accuracy have been proposed in the literature [32–37]. Since none of these measures has been adopted as a standard so far, we use the following strategy for measuring the performance of a survival prediction rule:
a) LogLikelihood
If the objective is to compare prediction rules obtained from the same model family (such as the family of Weibull distributions), we use the predictive log likelihood or partial log likelihood as a measure of predictive accuracy. The predictive log likelihood or partial log likelihood is defined as follows: Suppose that the parameter vector θ of a survival model has been estimated from a training sample, and that a test sample (T_{
k
}, C_{
k
}, X_{
k
}), k = 1,...,n_{test}, is used for evaluating the predictive accuracy of the model. Denote the log likelihood or partial log likelihood function of the survival model by l_{
θ
}(T, C, X). The predictive log likelihood is then given by
{l}_{\text{pred}}(\widehat{\theta}):={\displaystyle \sum _{k=1}^{{n}_{\text{test}}}{l}_{\widehat{\theta}}({T}_{k},{C}_{k},{X}_{k})},
(4)
where the parameter estimate \widehat{\theta} has been plugged into the log likelihood or partial log likelihood of the test sample. In case of a parametric AFT model we have θ = (f, σ), whereas in case of a Cox model, θ is the prediction function f used for modeling the hazard rate of TX. In a boosting context it is particularly advisable to use (4) as a measure of predictive accuracy, since the negative log likelihood or partial log likelihood is used as a loss function for the corresponding boosting algorithms. As a consequence, the loss function is measured on the same scale as the predictive accuracy (4).
b) BrierScore
If the objective is to compare prediction rules obtained from different model families or estimation techniques, it is no longer possible to use the predictive log likelihood as a measure of predictive accuracy. This is because the likelihood and partial likelihood functions of different model families are usually measured on different scales, so they can not be compared directly. Moreover, many nonparametric and semiparametric estimation techniques do not involve likelihood functions at all, implying that a predictive likelihood value can not be computed.
In case of the Barrier data we will compare the performance of various parametric and semiparametric estimation techniques by using the Brier score [34, 37] as a measure of predictive accuracy. The Brier score is defined as the timedependent squared distance between the predicted survival probability and the true state of an observation. Since the Brier score is not based on any specific model assumptions, the predictive accuracy of various model families and estimation techniques can be measured on the same scale. Also, possible misspecifications of a survival model are taken into account [37]. In this paper we use the methodology of Gerds and Schumacher [38] who combined the evaluation of the Brier score with the 0.632+ estimator developed by [39]. This methodology has been used previously for predicting survival outcomes from microarray data sets [25].
We first draw B bootstrap samples of size n from the data, where the bootstrapped observations constitute the training data sets and the outofbootstrap observations constitute the test data sets. Define Γ_{
i
}(t) := I(T_{
i
}> t), i = 1,...,n, for each time point t > 0. Further denote by \widehat{S}(t,{X}_{i}), i = 1,...,n, the survival function of observation i at time point t estimated from the complete data set. We compute the apparent error rate
\text{err}(t):=\frac{1}{n}{\displaystyle \sum _{i=1}^{n}{\left({\Gamma}_{i}(t)\widehat{S}(t,{X}_{i})\right)}^{2}}{W}_{i}(t,\widehat{G}),
(5)
where
{W}_{i}(t,\widehat{G}):=\frac{I({\tilde{Y}}_{i}\le t){\delta}_{i}}{\widehat{G}({\tilde{Y}}_{i})}+\frac{I({\tilde{Y}}_{i}>t)}{\widehat{G}(t)}
(6)
are weights that are introduced to account for censoring [37]. The expression \widehat{G} denotes the KaplanMeier estimate of the distribution of the censoring times C_{
i
}, i = 1,...,n. For each bootstrap sample and for each time point t we further compute the outofbag error rates
{\text{Err}}_{b}(t):=\frac{1}{{n}_{b}}{\displaystyle \sum _{i\u03f5{\mathcal{B}}_{b}}^{n}{\left({\Gamma}_{i}(t){\widehat{S}}_{b}(t,{X}_{i})\right)}^{2}}{W}_{i}(t,\widehat{G}),\phantom{\rule{0.1em}{0ex}}b=1,\mathrm{...},B,
(7)
where n_{
b
}is the cardinality of the set {\mathcal{B}}_{b} of outofbootstrap observations corresponding to bootstrap sample b, b = 1,..., B. The expression {\widehat{S}}_{b}(t,{X}_{i}) is the estimated survival function of the outofbootstrap observation i at time point t obtained from the bootstrap training sample b. Denote the crossvalidated error rate, i.e., the mean or median of the outofbag error rates Err_{
b
}(t), b = 1,...,B, by {\text{Err}}_{\mathcal{B}}(t).
Both (5) and (7) are estimators of the true prediction error E[Γ(t)  S(t, X)]^{2}, where S(t, X) is the true survival function of TX at time point t. Since (7) is an upwardbiased estimator and (5) is a downwardbiased estimator of the true prediction error (see Gerds and Schumacher [38]), we use a linear combination of err(t) and {\text{Err}}_{\mathcal{B}}(t) as the overall measure of accuracy at time point t:
\text{Err}(t):=[1\omega (t)]\text{err}(t)+\omega (t){\text{Err}}_{\mathcal{B}}(t).
(8)
Defining ω(t) := 0.632/(1  0.368R(t)) with R(t):=\frac{{\text{Err}}_{\mathcal{B}}(t)\text{err}(t)}{\text{NoInferr}(t)\text{err}(t)} and
\text{NoInferr}(t):=\frac{1}{{n}^{2}}{\displaystyle \sum _{i=1}^{n}{\displaystyle \sum _{j=1}^{n}{\left({\Gamma}_{i}(t)\widehat{S}(t,{X}_{j})\right)}^{2}}}{W}_{i}(t,\widehat{G})
(9)
yields the 0.632+ estimator of the Brier score at time point t. For a detailed derivation of (9) we refer to Gerds and Schumacher [38] and Efron and Tibshirani [39]. In case of the Barrier data, we evaluate (9) at each time point t > 0 up to a survival time of five years (i.e., t_{max} = 60 months). This yields a prediction error curve depending on t, where 0 <t ≤ t_{max}.