We first present the notation, the Cox model and how the covariates are defined that are used in the Cox models - especially the preclustering Algorithm is presented. Then we describe the log partial likelihood concept derived for the Cox model and introduce model selection/shrinkage methods. Since most methods for dimension reduction or shrinkage require the selection of a tuning parameter that determines the amount of shrinkage, finally, we describe how to choose the tuning parameter for each method.

### Cox model

In the following, we assume that we have a sample size of

*n* patients, and a (possibly right-censored) survival time for the response. In order to cope with censored survival times data we use the Cox model, also known as proportional hazards regression model [

8]. Cox suggested that the risk of an event (e.g. cancer recurrence, death or any date of interest) at time

*t* for a patient with given covariate vector

*x* = (

*x*
_{1},...,

*x*
_{
p
}) is modeled as

where

*h*
_{0}(·) is an arbitrary baseline hazard function and

*β* = (

*β*
_{1},...,

*β*
_{
p
}) a vector of regression coefficients. In the classical setting with

*n > p*, the regression coefficients are estimated by maximizing the log partial likelihood

For patient *i*, this expression contains the possibly censored failure time *t*
_{
i
}, the (non-censoring-)indicator *δ*
_{
i
} (equal to 1 if *t*
_{
i
} is a true survival time and to 0 if it is censored) and the vector of gene (or summarized gene group) expression values *x*
_{
i
}.

Further, *R*(*t*
_{
i
}) is the risk set at time *t*
_{
i
}; this is the set of all patients who have not yet failed nor been censored. The value of *β*′*x*
_{
i
} is called *prognostic index* or *risk score* of patient *i*.

### Definition of covariates

In the following, we assume that the data consists of two different categories of covariates

clinical covariates *Z* = (*Z*
_{1} , *..*., *Z*
_{
q
}): e.g. tumor size, tumor grade, age

genomic covariates *X* = (*X*
_{1}, *..*., *X*
_{
p
}): gene expression values of single genes or combined gene expression values for gene groups.

For a detailed analysis we consider three different types of Cox models. We start with the simple model using only the clinical covariates

The second model consists of *p* genomic covariates *X* = (*X*
_{1},..., *X*
_{
p
}). In our genetic regression models we use single genes, gene groups as well as preclustered gene groups as covariates. A gene group must be appropriately summarized in order to obtain one representative value for each individual (patient). We summarize the gene expression measurements from all genes belonging to one GO group or cluster via the first principle component of all genes that belong to this gene group. In the following we will consider three types of genomic models:

genes

groups

preclustered GO groups.

In the last step, we combine the genomic models with the clinical model, which can be written as

Due to the small number of clinical covariates, the shrinkage and dimension reduction procedures will only be applied to the genomic covariates.

### Preclustering with PAM

In order to find

*K* homogeneous subgroups of genes within one GO group containing

*N* genes, we use the partitioning around medoids (PAM) cluster analysis (cf. [

13]). The PAM procedure is based on the search for

*K* representative genes, the medoids, among the

*N* genes to be clustered. To achieve the goal of finding

*K* medoids that minimize the sum of dissimilarities of the genes to their closest medoid

where *d*(*i, j*) is the dissimilarity of the *i*th and *j*th gene, the two following steps are carried out iteratively until convergence, starting with *K* sequentially selected genes as initial solution:

Build: Select sequentially *K* initial clusters and assign each gene to its closest medoid.

Swap: Minimize the objective function (5) by switching medoids with other genes of the same cluster.

To find correlated subgroups, the dissimilarity

of the *i*th and *j*th gene with the gene expressions *x*
_{
i
} and *x*
_{
j
} is based on their Pearson correlation. This yields small dissimilarities between positively correlated genes and large values for negatively correlated genes, respectively.

The number of clusters

*K* for the PAM algorithm has to be chosen in advance. To find tight clusters of highly correlated genes, [

14] suggest using the Intra Cluster Correlation:

Here, the values *C*
_{
i
} are the elements of the lower triangle of the correlation matrix of the *N*
_{
j
} genes within a single cluster. The maximum mean ICC among the *K* = 2,..., *N -* 1 possible cluster configurations corresponds to the optimal number of clusters within one GO group.

### Methods for dimension reduction

For comparing our results to those being published in the literature, we make use of the following two most established and successful shrinkage procedures: *L*
_{1} (lasso) and *L*
_{2} (ridge) penalized regression. Univariate and forward stepwise selection do not produce satisfactory results for our high dimensional settings. We have compared these two methods in our analysis, and in agreement with previous results from Boevelstad et al. [4, 12] prediction performance was always worse (data not shown). We present the methods for the model containing clinical and genomic information.

*L*_{1} (lasso) and *L*_{2} (ridge) penalized estimation methods shrink the estimates of the regression coefficients towards zero relative to the maximum likelihood estimates. Both methods are similar in nature, but the results of *L*_{1} and *L*_{2} penalization can be very different. We perform the penalization only on the high-dimensional genomic covariates, the clinical covariates are handled as unpenalized mandatory variables.

The lasso shrinks the regression coefficients toward zero by penalizing the absolute values instead of their squares. The penalized log partial likelihood thus becomes
[11].

Ridge regression [9] shrinks the regression coefficients by imposing a penalty on their squared values. The regression coefficients are estimated by maximizing the penalized log partial likelihood
where
is the penalty term and *l*(*β, γ*) is given by (2). Applying an *L*
_{2} penalty tends to result in many small but non-zero regression coefficients, whereas penalizing with the absolute values has the effect that many regression coefficients are shrunk exactly to zero. Thus the lasso also is a variable selection method.

We applied both methods using the R package penalized [15]. In both methods the tuning parameter *λ* controls the amount of shrinkage and is obtained again by cross-validation.

### Choosing the tuning parameter

The model complexity of the prediction methods depends on a tuning parameter

*λ*. We use

*M*-fold cross-validation as proposed by [

16] for estimating

*λ*. The

*M*-fold cross-validated log partial likelihood (CVPL) is given by

where *l*(*β*, *γ*) denotes the log partial likelihood given in (2) and *l*
_{
m
}(*β*, *γ*) the log partial likelihood when the *m*th fold (*m* = 1,..., *M*) is left out.

The difference of the two terms compared in the formula is that in the right term the likelihood is evaluated without the *m*th fold, and on the left side it is evaluated with all patients. In both cases the parameters *β* and *γ* are estimated without the *m*th fold. The estimate of *β* and *γ* when the *m*th fold is left out is denoted by
and
. The optimal value of *λ* is chosen to maximize the sum of the contributions of each fold to the log partial likelihood.

### Evaluation

Next we describe how we evaluate the prediction performance of the models. We make use of three different model evaluation criteria. The whole procedure is applied to two well-known data sets. The basic idea is to split the data into a training set for model fitting and a test set for model evaluation, i.e. for determining the prediction performance. It is important to note that we have to consider several splits of the data into training and test sets due to the extreme dependence of the results on such a split (cf. [4, 17]).

### Logrank Test

We assign patients to subgroups based on their prognosis, into one with *good* and one with *bad* prognosis. If the prognostic index
of patient *i* is higher, the survival time is expected to be shorter. For this reason, a patient *i* in the test set is assigned to the *high-risk* group if its prognostic index is above the median of all prognostic indices calculated on the test set. We apply a logrank test on the two prognostic groups and use the *p*-value as an evaluation criterion for the usefulness of the grouping. Boevelstad [4] points out that a disadvantage of this criterion is that it does not consider the ranking of the patients within the groups and it may not be biologically meaningful.

### Prognostic Index

The prognostic index
is used as a single continuous covariate in a Cox model. We fit the model
. Using the likelihood ratio test, we test the null hypothesis *α* = 0 versus *α* ≠ 0 and assess the prediction performance with the obtained *p*-value. A small *p*-value indicates ability of the prognostic index to discriminate between short and long survivors.

### Brier Score

The prediction performance can also be assessed based on the (integrated) Brier Score that was introduced by [

5] in survival context. The consistent estimate of the expected Brier Score

*BS*(

*t*) is defined as a function of time

*t >*0 by

where

stands for the estimated survival for patient

*i* and

denotes the Kaplan-Meier estimate of the censoring distribution. The estimation of

is performed via the Breslow estimator of the cumulative baseline hazard function (see, e.g., [

18], Chapter 8.8). Good predictions at time

*t* are reflected by small Brier Scores. Note that the Brier Score

*BS*(

*t*) is dependent on the point in time

*t*. The integrated Brier Score IBS, given by

is a score for the average predition performance for all time points in the interval [0, *t*
*]. In accordance with [6], we calculate the IBS for the two data sets for *t*
^{
*
} = 10 years due to high censoring after 10 years of survival.