Comparative optimism in models involving both classical clinical and gene expression information

Background In cancer research, most clinical variables have already been investigated and are now well established. The use of transcriptomic variables has raised two problems: restricting their number and validating their significance. Thus, their contribution to prognosis is currently thought to be overestimated. The aim of this study was to determine to what extent optimism concerning current transcriptomic models may lead to overestimation of the contribution of transcriptomic variables to survival prognosis. Results To achieve this goal, Cox proportional hazards models that adjust for clinical and transcriptomic variables were built. As the relevance of the clinical variables had already been established, they were not submitted to selection. As for genes, they were selected using both univariate and multivariate methods. Optimism and the contribution of clinical and transcriptomic variables to prognosis were compared through simulations and by using the Kent and O'Quigley ρ2 measure of dependence. We showed that the optimism relative to clinical variables was low because these are no longer submitted to selection of relevant variables. In contrast, for genes, the selection process introduced high optimism, which increased when the proportion of genes of interest decreased. However, this optimism can be decreased by increasing the number of samples. Conclusion Two phenomena have to be taken into account by comparing the predictive power and optimism of clinical variables and those of genes: overestimation for genes due to the selection process and underestimation for clinical variables due to the omission of relevant genes. In comparison with genes, the predictive value of validated clinical variables is not overestimated, which should be kept in mind in future studies involving both clinical and transcriptomic variables.


Background
In the field of cancer research, classical clinical variables have long been used as prognostic markers. Indeed, many strong clinical determinants that explain most of the prognosis have already been identified. Nevertheless, certain characteristics of cancer are still poorly understood, and these need to be elucidated to improve treatment. Thus, cancer research is making use of new technologies, especially microarrays, and survival analysis methods have been extended to take into account the potential informa-tion from microarray analysis with the hope that this transcriptomic information will supersede clinical data. For example, Shipp et al. [1] showed that a 13-genes-signature-based outcome predictor provided additional information not reflected in the clinical prognostic model based on the International Prognostic Index. Today, cancer clinicians would like to combine genes and classical clinical variables in the same models to improve assessment of cancer prognosis. In the recent years, some authors addressed this question in two ways. The first way is to involve classical clinical and transcriptomic variables in the same models [2][3][4]. However, not all authors took into account the particularity of each type of variable. The second way is to consider the additional predictive value of genes. In this context, Tibshirani and Efron [5] warned in 2002 against too premature conclusions regarding the predictive power of transcriptomic variables. Using the breast study from Van't Veer [6], those authors showed that the effects of genes were overestimated with regard to the classical clinical variables like tumor grade or size. They proposed a "pre-validation" strategy to correct the artificial importance given to genes.
In the very recent literature, few authors joined those two ways in unique models. Thus, Binder and Schumacher [7] proposed an offset-based boosting approach in the context of survival data. This approach allows also answering the question whether prediction is improved or not by adding transcriptomic variables to classical clinical variables in the same model. Shortly later, Boulesteix et al. [8] proposed a more general approach than that of Binder and Schumacher, in the sense that it was not limited to survival data. This approach is based on PLS, random forest, and the pre-validation strategy suggested by Tibshirani and Efron.
The interesting thing in the two above approaches is that they allow simultaneously to construct a classifier combining both types of variables and to determine whether microarray data present additional predictive value.
In agreement with this consideration, we think it is of major importance when designing statistical models to keep in mind that the characteristics of transcriptomic variables are completely different from those of clinical variables. Specific clinical variables have already been validated in many large studies. Thus, most of these clinical variables are no longer included in the selection process (e.g. the estrogen receptor status in breast cancer or the international prognostic index in lymphoma). In contrast, the selection step is still needed for genes, and various issues are still a matter of debate. First, fewer studies have been conducted on transcriptomic variables than on clinical variables; thus, there are fewer datasets available to repeat the analyses and validate the relationships. In most cases, genes selected in a single study are assumed to have a general prognostic value. This selection is presented as a benchmark for the disease without external validation studies on new datasets. Second, whenever available, these datasets are rather small compared to the number of genes under study. Considering the high number of variables and the relatively low number of observations, microarray data can easily lead to a high number of falsepositive variables. By chance alone, many genes may be found significantly associated with the outcome even though most of them may not actually be linked to prognosis.
Some publications have clarified certain additional issues related to the selection process in microarray analysis. Ein-Dor et. al showed that the final gene signature depends highly on the subset of patients used for gene selection [9]. Later, the same team pointed out that the reproducibility of a signature depends on the number of samples used for the analysis [10]. Other teams were interested in the False Discovery Rate (FDR); that is, the expected proportion of false positives among the genes declared as significant. When looking for differential genes, Pawitan et. al showed that the FDR is mostly influenced by the proportion of truly differentially expressed genes and by the sample size [11].
The same problems are met in survival studies where the construction of transcriptomic models raises simultaneously the problem of restricting the number of genes to include and that of validating the selected genes. When a model is too complex -i.e., the number of free parameters to estimate is too high given the information contained in the data -the strength of that model will be exaggerated due to overfitting. Some conclusions of the analysis may be due to noise or to some spurious associations between the covariates and the outcome. In this case, the model has high adequacy and predictive accuracy for the dataset on which it was built but is not able to accurately predict the outcome with new datasets. Thus, the ability of the model to predict outcome with new datasets is overestimated: this is called optimism. The main objective of this article is to quantify to what extent the optimism of transcriptomic models induced by the selection process may lead to overestimation of the contribution of transcriptomic variables to prognosis, especially in comparison with clinical variables when they are included in a single model. To be able to control the contribution of the two types of variables, the following study was conducted on simulated datasets.

Methods
To compare optimism relative to clinical and transcriptomic models within the context of survival, the study was based on simulated datasets that included both clinical and transcriptomic variables. We wanted to simulate the "real situation" faced by clinicians and statisticians: classical clinical variables are already validated, whereas transcriptomic variables are still in the selection and validation process. Regarding clinical variables, only validated ones are considered to build combined classifiers, while regarding genes, many of the considered ones are still superfluous noise genes.
To analyze one dataset, the following procedure was employed: (1) The variables of interest were first identified; (2) Once clinical variables or genes had been chosen, Cox proportional hazards models including both types of variables were constructed; (3) To measure the predictive information contained in each survival model, the ρ 2 measure of dependence from Kent and O'Quigley was used so that optimism for both types of variable could be compared [12,13].
R and S-Plus codes used in our analyses are available at ftp://pbil.univ-lyon1.fr/pub/logiciel/Optimism.

Simulation of the datasets
A classical way to link variables to censored survival data is to use the Cox proportional hazards model. Let us define X an (n, m) matrix of m variables for n individuals. For each of the n patients, the follow-up times were noted t 1 ,..., t n as were the event-indicators d 1 ,..., d n with d i = 1 if the event occurred and d i = 0 if it did not occur. At time t, the Cox proportional model is given by: where λ 0 (t) is a baseline hazard function, β = {β 1 ,..., β m } is the vector of parameters and X 1 ,..., X m are the vectors of length n describing each of the m variables for the n patients.
Through this model, we simulated a virtual population of size n in which the m variables consisted of both clinical and transcriptomic variables. The simulation process was inspired from the simulation study from Gui and Li [14], which R code is publicly available.
More precisely, each patient was described by two clinical variables, p genes and survival information. The aim was to estimate in a single model the relationship between the two types of variables and survival times.
Clinical variables were simulated using binomial distributions with probabilities 0.5 and 0.4, respectively, as parameters for success (e.g. the positive vs. negative estrogen status). Normal distributions N(0, 1) were assumed for the transcriptomic variables. A Weibull distribution with shape parameter 5 and scale parameter 2 was used for the baseline function. For censoring times, a uniform U(0, 8) was used, leading to about 40% censoring. The underlying model was: where X C and X T were respectively the matrix describing the clinical and the transcriptomic variables.
As there were two clinical variables, X C was an (n, 2) matrix. As for genes, p were under study, leading to an (n, p) matrix X T . Only p 1 of the p genes were considered as related to survival; the p 0 remaining genes were under the null hypothesis H 0 of no association with survival. Note that p = p 1 + p 0 and m = p + 2. The relevance of most of the clinical variables had already been established through several studies. The two clinical variables were then considered significant, and coefficients for both of these variables were set at 0.8: β C, i = 0.8, i = 1, 2. Coefficients for transcriptomic variables related to survival were set at 0.2: β T, i = 0.2, i = 1,..., p 1 and the remaining p 0 were set at 0: β T, For a fixed set of parameters p and p 1 , r = 60 training sets of n patients were simulated according to the design described above. For each of these training sets, 50 corresponding test sets were drawn following the same design. This overall process was performed varying n, p and p 1 sequentially. The number of patients n was considered in {50, 100, 200, 400}, p was considered in {500, 1000, 2000, 4000} and p 1 was considered in {5, 10, 20}.
A single Cox proportional hazards model involving both clinical and transcriptomic variables could then be estimated for each of these simulated datasets, as described below.

Variable selection and model construction Cox proportional hazards model
In the traditional Cox model, the vector of parameters is such that it maximizes the following Cox partial likelihood (PL): where D is the set of indices of the events and R k is the set of indices of the individuals at risk at time t k .
In our case, there were p + 2 parameters to estimate. This leads to a huge number of variables in comparison with the number of individuals; the high dimensional space of the transcriptomic predictors thus precludes the use of the standard maximum Cox partial likelihood method to estimate the parameters. Several methods have been pro- posed in the literature to deal with this high dimension issue in survival models involving genes.

Adaptation to high-dimensional data
The first solution aims at selecting a lower subset of genes according to the relevance of each gene taken separately. This approach takes each feature in an univariate way and also does not take into account the interactions between genes. We used the log-rank statistic to order genes; the number of genes to be involved in the model was chosen a priori. We retained the 20 genes with the highest statistical values, so that the number of selected genes was in the same order of magnitude as in the approach which follows. The second solution is a multivariate selection method that simultaneously selects genes and estimates their effect on survival. One way to do this is to maximize the partial likelihood under constraints using L1 or L2 penalization. Contrary to the L2 penalization [15,16], which uses all genes in the prediction, only some genes are used in the prediction with the L1 penalization [17]. The threshold gradient descent (TGD) method proposed by Friedman and Popescu [18] allows a compromise between the L1 and L2 penalizations. Through the choice of a defined threshold, it approximates the L2 (low threshold) and L1 (high threshold) penalized estimations. Gui and Li [19] extended the TGD to the survival model and demonstrated the ability of their approach to select relevant genes and to provide good predictive performance. We therefore used this model to select the genes to include in our models.
Briefly, the TGD method is based on the gradient method, which is classically employed to determine the minimum of a loss function. With this method, the parameters vector is derived in a sequential manner following the direction of the negative gradient of the loss function, here the partial log-likelihood noted l(β T ) and defined as l(β T ) = -logPL(β T ). The negative gradient is defined as: g(ν) = -∂l/ ∂β T Starting with = 0, the vector of estimated parameters is then updated at each iteration: The parameter ν, which begins at zero, controls the Although the TGD method combines selection of genes and estimation of their effect on survival, we used it only for selection purposes; that is, to select genes irrespective of their estimated coefficients.
Thus, we used two approaches to select genes: a univariate approach using the log-rank, and a multivariate approach based on the TGD. As for clinical variables, they were considered as validated and, thus, directly included in the final model.
The optimism arising respectively from the clinical and the transcriptomic variables was then estimated and compared as follows.

Comparison of the contribution of the variables to the prognosis ρ 2 as a measure of explained variability
Different criteria allow selection and comparison of models based on their capacity to predict the outcome of individuals who did not participate in the model building. Among them, explained variability reflects the robustness of the model, and its efficiency in predicting outcomes on new datasets. It measures the information given by some variables involved in a specific model.
In linear regression, the coefficient of determination R 2 quantifies the proportion of variability in a data set that Cox model on the test set to compute , we are able to evaluate the predictive information actually given in the test set by the genes selected on the training set. A lack of re-estimation of the coefficients would have assumed that the predictive power of the genes selected on the training set-given by the coefficients of the Cox model estimated on the training set-is equal on the test set. The selection process introduces much optimism. The latter must be taken into account by re-estimating the coefficients on the test set. By doing so, the predictive information of the selected genes will be closer to their effective predictive information in the test set.
These measures were estimated for both the clinical and the transcriptomic variables.

Optimism quantification
Optimism was quantified by computing relative differences between the various ρ 2 , as described in equations 4 to 6. By comparing ρ 2 values estimated in the training and the test sets, Δ TrTe (Equation 4) shows the error made by considering that the signature given by one dataset is the real signature and delivers the same information on other datasets. In other words, it gives the difference between the predictive information anticipated on one dataset and the effective information on another.

Results
Results are shown through boxplots. We found this way of representation well-suited to show the distribution of the various differences obtained over each set of 60 training datasets. Each point contributing to the boxplot corresponds to one measure of Δ, one Δ being computed for each of the 60 training sets.

Number of patients
Results are shown in an example with p = 1000 genes. For the clarity of the figure, only results obtained with p 1 = 10 are shown. They remain the same with p 1 = 5 or p 1 = 20. Regarding clinical variables, Δ TrTe varied around zero and overall did not depend on the sample size. Regarding genes, the difference decreased with increasing sample size; the difference never reached zero in the multivariate case. The decreasing effect was even stronger with the univariate selection method. This can be explained by the fact that the number of selected genes depended on the test set for the multivariate method whereas it is fixed a priori in the univariate method; in the former, the number of selected genes also varied and we observed that this number increased with the number of patients. The TGD selected genes that were not truly related to survival (False Positives) contributed to the computation of although they were noise. As a consequence Δ TrTe had higher values for the transcriptomic model in multivariate selection than in univariate selection, even for n = 400 patients. These results show that transcriptomic and clinical variables have different behaviors. The predictive power of genes selected on one dataset is overestimated with regard to the predictive power they would have with other datasets. On the contrary, the predictive power of clinical variables is the same with both training and test sets. As a result, the two types of variables cannot be interpreted the same way. higher p 1 , the greater the number of genes missed at selection. Note that, on the contrary, including non-relevant genes in the model does not bias the estimation of the other covariates.
Regarding genes, Δ TrPop tended to zero when the number of patients increased; the higher the number of patients, the nearer the adjusted was to the expected .
When the sample size is too small, the highly predictive information assumed to be given by the selected genes is far from true information. We were also interested in differences between adjusted and (figure 3).
Regarding the clinical variables, the results were the same as for the previous differences.
Regarding genes, by using the ones selected with the training set in the test set, the differences Δ TePop were mainly negative when the sample size exceeded 50 patients: this means that the selected genes were not able to report the true information contained in the dataset. In other words, the genes selected on the previous dataset had no predictive power on other datasets because they were not relevant.
When selecting the p 2 genes, these can be either really  figure 4 shows that increasing n, remained of the same order for all selected genes whereas the ρ 2 due to the TP increased. In contrast, the right panel of figure 4 shows that for all selected genes or for TP only evolved in the same way. In cases with 50 or 100 patients, there were no TP; was also only due to noise; this cannot be seen when using only one dataset for a study and may lead to incorrect interpretation of noise as information. The high values observed for all the genes for the 50 patients are due to the fact that there were too few patients to get good estimations.

Total number of genes
The need for a lot of individuals is valid whatever the number of genes. However, for a fixed number of individuals, the total number of features under study also has an impact on optimism. Results are shown in an example with p = 100 patients. Figure 5 shows the values of the differences between and in the transcriptomic model. When the total number of genes increased, Δ TrTe increased too. When there were too few genes of interest, it was difficult for the selection method to find the relevant ones (true positives). Genes selected on the training set had no predictive power on the test sets, which can be explained as follows. The more genes there are, the higher the optimism: the greater the number of genes under study, the more overestimated is the predictive power of the transcriptomic model. The high value of is due to noise and not to real information. The study of Δ TePop , indicated that genes selected on the training set are not able to relay the predictive power really contained in the test set when the number of genes truly related to survival is too small relatively to the total number of genes. Indeed, differences were negative, and increased with increasing p 1 .

Discussion
Genes are not yet validated as predictors of outcome. They are selected on a single dataset and assumed to have the same predictive power with other datasets. But results show that this predictive power is overestimated in the case of genes. This overestimation is even more significant when the number of patients is low and the total number of genes high. This is due to two phenomena which overlap: the gene selection mechanism and the power problem. When there are too few patients the genes that are selected are not the true ones due to a lack of power. This problem is not encountered for clinical variables, for which the selection process is over because they have been already validated. The same problem arises when there are too many genes relatively to the number of genes truly related to survival. This problem is difficult to solve in real life studies and must be kept in mind. Indeed, the number of genes of interest is not known in advance.
Many papers dealt with the choice of the best method for gene selection. Our aim was not to study a specific method but rather to study what happens once genes have been selected. However, some comments may be made as to the TGD. First, from one dataset to another, there is considerable variety in the number and identity of genes in a selected set. Nevertheless, the number of true positives, which increases with the number of samples, is more stable. Note that the number of selected genes depends on the choice of the parameter τ. With a fixed value of τ and p 1 , the conclusions would be the same whatever the choice of this parameter: optimism increases with the number of patients and decreases with the number of genes involved in the study. Second, one point that appears to us as a drawback of this method is that it gives very low coefficient estimations, even for true positives. In our case, we only used the TGD for selection purposes, and coefficients of selected genes were re-estimated in a new Cox model. However, we noticed that some of these genes had very low estimated coefficients on the training sets, even more so with a low number of samples. We may wonder why these genes were selected.
To answer the question of comparative optimism of clinical and transcriptomic variables, we worked on simulated datasets reflecting the real situation encountered by clinicians and statisticians. Thus, we simulated only two true clinical variables, but many superfluous genes. Our conclusions depend on this choice of the simulation setup. It is not the nature (classical clinical or transcriptomic) of the two types of variables that explains the difference in the introduced optimism but their status (selected and validated or not): few validated variables for clinical variables and many variables under selection for genes, of which noisy variables. It is clear that with 50 clinical variables and only 5 biologically pre-selected relevant cancer genes, the situation would be reversed.
Concerning the simulation process, as the real correlation structure of genes is not well known, we chose not to include it in this work. Moreover, clinical variables and gene-expressions were all modeled to be independent. Future studies will aim to model dependence structures between the two types of variables and genes.

Conclusion
By comparing the predictive power and optimism from clinical variables with genes two phenomena have to be taken into account: overestimation for genes due to the selection process and underestimation for clinical variables due to the omission of relevant genes. By including clinical and transcriptomic variables in the same model these results must be kept in mind. The predictive power of the clinical variables must not be neglected. In comparison with genes, their importance is not overestimated, which gives the feeling that they have less influence. In reality, part of their impact is hidden by the optimism encountered for genes.