Estimation problem
Consider a set of realizations of i.i.d. random variables (T1, C1, X1),..., (T
n
, C
n
, X
n
), where T1,..., T
n
are one-dimensional survival times, C1,..., C
n
are one-dimensional censoring times, and X1,..., X
n
are p-dimensional vectors of covariates. It is assumed that some of the survival times T1,..., T
n
are right-censored, so that only the random variables := min{T
i
, C
i
}, i = 1,..., n, are observable. This implies that the available data consist of realizations of and of the set of censoring indicators δ
i
∈ {0, 1}, where δ
i
= 0 if observation i is censored and δ
i
= 1 if the complete survival time T
i
has been observed. It is further assumed that the C
i
, i = 1,...,n, are independent of (T
i
, X
i
), i = 1,...,n. This corresponds to the classical case of "random censoring". The objective is to fit the accelerated failure time modellog(T) = f(X) + σW, (1)
where (T, X) follows the same distribution as each of the (T
i
, X
i
), i = 1,...,n, and where W is a random noise variable independent of X. The function f is an unknown prediction function of log(T), and σ is an unknown scale parameter that controls the amount of noise added to the prediction function f. Typically, f is a linear or additive function of the covariates X.
The distributional assumption on the noise variable W determines the distributional form of the survival time T. If this distribution is fully specified, we refer to Model (1) as a parametric AFT model. For example, if W follows a standard extreme value distribution, T (given X) follows a Weibull distribution with parameters λ:= exp(- f/σ) and α:= 1/σ (cf. Klein and Moeschberger [29]). Other popular examples of parametric AFT models are the log-logistic distribution for T|X (with W following a standard logistic distribution) and the lognormal distribution for T|X (with W following a standard normal distribution). It is important to note that the latter two models explicitly allow for non-proportional hazards. The Weibull distribution is a parametric example of the well-known Cox proportional hazards model (which does not necessarily assume a regression equation such as (1) but is instead based on a semiparametric model for the hazard function of T|X).
Classically, the estimation of f and σ in a parametric AFT model is performed by maximizing the log likelihood function
(2)
where Y
i
:= log(min{T
i
, C
i
}) is the logarithm of . The functions f
W
and S
W
denote the probability density and survival functions of the noise variable W, respectively (cf. Klein and Moeschberger [29], Chapter 12).
Semiparametric AFT models leave the parameter σ and the distribution of the noise variable W in Model (1) unspecified. Instead of using maximum likelihood techniques, f is estimated via minimization of a censoring-adjusted least squares criterion [2, 10, 30].
FGD boosting with component-wise linear least squares and scale parameter estimation
In the following we will use boosting techniques for obtaining precise estimates of Model (1) when the number of covariates is large. We start by considering a boosting algorithm known as "component-wise linear functional gradient descent (FGD)" [16, 17, 21, 31]. The objective of FGD is to obtain the real-valued prediction functionf* := argminf(·)E [ρ(Y, f(X))], (3)
where the one-dimensional function ρ is a convex loss function that is assumed to be differentiable with respect to f. Estimation of f* is performed by minimizing the empirical risk with respect to f. Component-wise FGD works as follows:
-
1.
Initialize the n-dimensional vector with an offset value, e.g., . Set m = 0.
-
2.
Increase m by 1. Compute the negative gradient and evaluate at , i = 1,...,n. This yields the negative gradient vector
-
3.
Fit the negative gradient U[m-1]to each of the p components of X (i.e., to each covariate) separately by p times using a simple linear regression estimator. This yields p estimates of the negative gradient vector U[m-1].
-
4.
Select the component of X which fits U[m-1]best according to a pre-specified goodness-of-fit criterion. Set equal to the fitted values from the corresponding best model fitted in Step 3.
-
5.
Update , where 0 <ν ≤ 1 is a real-valued step length factor.
-
6.
Iterate Steps 2–5 until m = mstop for some stopping iteration mstop.
The above algorithm can be interpreted as a negative gradient descent algorithm in function space. In each step, an estimate of the true negative gradient of the loss function is added to the current estimate of f*. Thus, a structural (regression) relationship between Y and the covariate vector X is established (for details on the characteristics of FGD we refer to Bühlmann and Hothorn [22]). Moreover, FGD carries out variable selection in each iteration, as only one component of X is selected in Step 4. This property makes the algorithm applicable even if p > n. Due to the additive structure in Step 5, the final boosting estimate at iteration mstop can be interpreted as a linear model fit but will typically depend on only a subset of the p components of X.
As we want to use FGD for the estimation of parametric AFT models, we set ρ equal to the negative log likelihood function specified in (2). However, in this case, the standard FGD algorithm presented above can not be applied, as (2) includes an additional scale parameter σ that has to be estimated simultaneously with f. We therefore extend the classical FGD algorithm in the following way:
-
1.
Initialize the n-dimensional vector with an offset value, e.g., . In addition, initialize the one-dimensional scale parameter with an offset value, e.g., . Set m = 0.
-
2.
Increase m by 1. Compute the negative gradient and evaluate at , i = 1,...,n, and at . This yields the negative gradient vector
-
3.
Fit the negative gradient U[m-1]to each of the p components of X (i.e., to each covariate) separately by p times using a simple linear regression estimator. This yields p estimates of the negative gradient vector U[m-1].
-
4.
Select the component of X which fits U[m-1]best according to a pre-specified goodness-of-fit criterion. Set equal to the fitted values from the corresponding best model fitted in Step 3.
-
5.
Update , where 0 <ν ≤ 1 is a real-valued step length factor. Plug into the empirical risk function and minimize the empirical risk over σ. This yields the scale parameter estimate .
-
6.
Iterate Steps 2–5 until m = mstop for some stopping iteration mstop.
It is easily seen that the modified FGD algorithm estimates f* and σ in a stepwise fashion: In Steps 3 and 4, f* is estimated for a given value of σ, while in Step 5, σ is estimated given the current estimate of f*. It is also clear from Step 4 that the built-in variable selection mechanism of FGD is not affected by the additional estimation of the scale parameter σ. The value of the stopping iteration mstop is the main tuning parameter of FGD. In the following we will throughout use five-fold cross-validation for determining the value of mstop, i.e., mstop is the iteration with lowest predictive risk. The choice of the step-length factor ν has been shown to be of minor importance for the predictive performance of a boosting algorithm. The only requirement is that the value of ν is "small", such that a stagewise adaption of the true prediction function f* is possible (see Bühlmann and Hothorn [22]). We therefore set ν = 0.1. As a goodness-of-fit criterion in Step 4 we use the R2 measure of explained variation which is known from linear regression.
Measures for the predictive power of boosting techniques
After having built a survival model from a set of microarray data, it is essential to evaluate the predictive performance of the model. To this purpose, various measures of predictive accuracy have been proposed in the literature [32–37]. Since none of these measures has been adopted as a standard so far, we use the following strategy for measuring the performance of a survival prediction rule:
a) Log-Likelihood
If the objective is to compare prediction rules obtained from the same model family (such as the family of Weibull distributions), we use the predictive log likelihood or partial log likelihood as a measure of predictive accuracy. The predictive log likelihood or partial log likelihood is defined as follows: Suppose that the parameter vector θ of a survival model has been estimated from a training sample, and that a test sample (T
k
, C
k
, X
k
), k = 1,...,ntest, is used for evaluating the predictive accuracy of the model. Denote the log likelihood or partial log likelihood function of the survival model by l
θ
(T, C, X). The predictive log likelihood is then given by
(4)
where the parameter estimate has been plugged into the log likelihood or partial log likelihood of the test sample. In case of a parametric AFT model we have θ = (f, σ), whereas in case of a Cox model, θ is the prediction function f used for modeling the hazard rate of T|X. In a boosting context it is particularly advisable to use (4) as a measure of predictive accuracy, since the negative log likelihood or partial log likelihood is used as a loss function for the corresponding boosting algorithms. As a consequence, the loss function is measured on the same scale as the predictive accuracy (4).
b) Brier-Score
If the objective is to compare prediction rules obtained from different model families or estimation techniques, it is no longer possible to use the predictive log likelihood as a measure of predictive accuracy. This is because the likelihood and partial likelihood functions of different model families are usually measured on different scales, so they can not be compared directly. Moreover, many nonparametric and semiparametric estimation techniques do not involve likelihood functions at all, implying that a predictive likelihood value can not be computed.
In case of the Barrier data we will compare the performance of various parametric and semiparametric estimation techniques by using the Brier score [34, 37] as a measure of predictive accuracy. The Brier score is defined as the time-dependent squared distance between the predicted survival probability and the true state of an observation. Since the Brier score is not based on any specific model assumptions, the predictive accuracy of various model families and estimation techniques can be measured on the same scale. Also, possible misspecifications of a survival model are taken into account [37]. In this paper we use the methodology of Gerds and Schumacher [38] who combined the evaluation of the Brier score with the 0.632+ estimator developed by [39]. This methodology has been used previously for predicting survival outcomes from microarray data sets [25].
We first draw B bootstrap samples of size n from the data, where the bootstrapped observations constitute the training data sets and the out-of-bootstrap observations constitute the test data sets. Define Γ
i
(t) := I(T
i
> t), i = 1,...,n, for each time point t > 0. Further denote by , i = 1,...,n, the survival function of observation i at time point t estimated from the complete data set. We compute the apparent error rate
(5)
where
(6)
are weights that are introduced to account for censoring [37]. The expression denotes the Kaplan-Meier estimate of the distribution of the censoring times C
i
, i = 1,...,n. For each bootstrap sample and for each time point t we further compute the out-of-bag error rates
(7)
where n
b
is the cardinality of the set of out-of-bootstrap observations corresponding to bootstrap sample b, b = 1,..., B. The expression is the estimated survival function of the out-of-bootstrap observation i at time point t obtained from the bootstrap training sample b. Denote the cross-validated error rate, i.e., the mean or median of the out-of-bag error rates Err
b
(t), b = 1,...,B, by .
Both (5) and (7) are estimators of the true prediction error E[Γ(t) - S(t, X)]2, where S(t, X) is the true survival function of T|X at time point t. Since (7) is an upward-biased estimator and (5) is a downward-biased estimator of the true prediction error (see Gerds and Schumacher [38]), we use a linear combination of err(t) and as the overall measure of accuracy at time point t:
(8)
Defining ω(t) := 0.632/(1 - 0.368R(t)) with and
(9)
yields the 0.632+ estimator of the Brier score at time point t. For a detailed derivation of (9) we refer to Gerds and Schumacher [38] and Efron and Tibshirani [39]. In case of the Barrier data, we evaluate (9) at each time point t > 0 up to a survival time of five years (i.e., tmax = 60 months). This yields a prediction error curve depending on t, where 0 <t ≤ tmax.