Bagging survival tree procedure for variable selection and prediction in the presence of nonsusceptible patients
 Cyprien Mbogning^{1, 2}Email author and
 Philippe Broët^{1, 2, 3, 4}
https://doi.org/10.1186/s128590161090x
© Mbogning and Broët. 2016
Received: 22 December 2015
Accepted: 21 May 2016
Published: 7 June 2016
Abstract
Background
For clinical genomic studies with highdimensional datasets, treebased ensemble methods offer a powerful solution for variable selection and prediction taking into account the complex interrelationships between explanatory variables. One of the key component of the treebuilding process is the splitting criterion. For survival data, the classical splitting criterion is the Logrank statistic. However, the presence of a fraction of nonsusceptible patients in the studied population advocates for considering a criterion tailored to this peculiar situation.
Results
We propose a bagging survival tree procedure for variable selection and prediction where the survival treebuilding process relies on a splitting criterion that explicitly focuses on timetoevent survival distribution among susceptible patients.
A simulation study shows that our method achieves good performance for the variable selection and prediction. Different criteria for evaluating the importance of the explanatory variables and the prediction performance are reported. Our procedure is illustrated on a genomic dataset with gene expression measurements from early breast cancer patients.
Conclusions
In the presence of nonsusceptible patients among the studied population, our procedure represents an efficient way to select eventrelated explanatory covariates with potential higherorder interaction and identify homogeneous groups of susceptible patients.
Keywords
Bagging Survival tree Highdimensional data Nonsusceptible individuals GenomicBackground
Since the inception of largescale genomic technologies, there has been a growing interest in analyzing the prognostic and predictive impact of highdimensional genomic markers. However, the extremely large number of potential interaction terms prevent from being specified in advance and incorporated in classical survival models. In this context, treebased recursive partitioning methods such as CART (Classification And Regression Tree [1]) provide wellsuited and powerful alternatives. This nonparametric methodology partitions recursively the predictor space into disjoint subregions (socalled terminal nodes or leaves) that are near homogeneous according to the outcome of interest. This framework is particularly wellsuited to detect relevant interactions and produce prediction in highdimensional settings.
Since the first extension of CART to censored data (termed as survival trees) proposed by Gordon and Olshen [2], many new methods have been proposed so far (for a review see [3]). Broadly speaking, the key components for a survival tree are: the splitting criterion, the prediction measure, the pruning and tree selection rules. The splitting survivaltree criteria rely either on minimizing the withinnode homogeneity or maximizing the betweennode heterogeneity. They are based on various quantities such as the distance between KaplanMeier survival curves [2], likelihoodrelated functions (e.g. [4]) or score statistics (e.g. [5]) such as weighted or unweighted Logrank test statistics. The final prediction measure, within each terminal node, is typically based on nonparametric estimations of either the cumulative hazard function or the survival function. The pruning and selection rules are applied to find the appropriate subtree and avoid overfitting.
However, the wellknown instability of treebased structures has led to the development of socalled survival ensemble methods such as bagging survival tree and random survival forest [6, 7]. The main idea is that the combination of several survival tree predictors has better predicting power that each individual tree predictor. The general strategy is to draw bootstrap samples from the original observations and to grow the maximal tree for each of these samples. This strategy also circumvents the problem of pruning and selection since each tree is grown full size. The final prediction is obtained by averaging the predictions from each individual trees. In practice, the bagging can be viewed as a special case of random survival forests where all the covariates are considered as relevant candidates at each node. These methods also provide a way to define various variable importance measures that can be used for variable selection.
Even though survival trees are nonparametric methods, their constructions rely heavily on the chosen modelrelated splitting criteria that are based on either parametric or semiparametric modeling assumptions (e.g. [4, 8]). Thus, for a particular problem, the choice of the splitting criteria is crucial to the performance of the tree regarding variable selection and prediction [9]. This problem is particularly appealing in the context of survival data with nonsusceptible individuals where the investigator is interested in identifying homogeneous subgroups according to the timetoevent outcome among the individuals who are susceptible to experience the event of interest. In clinical oncology, these nonsusceptible individuals (sometimes referred as longterm survivors or cured patients) are those who have been successfully cured from the disease by the primary treatment. For infectious and immune diseases, these individuals are those who are resistant to certain pathogens or tolerant to specific antigens. In such mixed population, none of the classically used splitting criteria explicitly focuses on the timetoevent survival distribution among susceptible individuals, which raises some open questions about their performance.
In the literature, various survival models taking into account for a fraction of nonsusceptible patients (also called “improper survival distribution” models) have been proposed. The oldest framework relies on twocomponent mixture models which explicitly assumes that the population under study is a mixture of two subpopulations of patients (susceptible/nonsusceptible) in a parametric or semiparametric modeling approach (for a review, see [10]). A different framework proposed more recently defines the cumulative hazard risk as a bounded increasing positive function that can be interpreted from either a mechanistic model (as first introduced by [11] in oncology) or a latent activation scheme [12].
In this work, our aim is to unravel complex interactions between genomic factors that act on the timetoevent distribution among susceptible patients while adjusting for the confounding effect associated to the existence of a fraction of susceptible patients in the population under study.
Thus, we propose a bagging survival tree procedure for variable selection and prediction which is tailored to this situation. The strategy relies on an improper survival modeling which considers a linear part for taking into account for known confounders associated with the nonsusceptible fraction and a tree structure for the eventrelated explanatory variables. The building of the survival trees rely on a modelbased splitting criteria that explicitly focuses on susceptible patients. The considered splitting criterion is linked to a recently proposed modelbased discrimination index that quantifies the ability of a variable to separate susceptible patients according to their timetoevent outcome [13].
Next, the splitting criteria and the general procedure are presented. We then compare the results obtained with this procedure to those obtained with the classical Logrank statistic as the splitting criteria. We illustrate the clinical interest of this procedure for selection and prediction among patients with earlystage breast carcinoma for whom gene expression measurements have been collected. We conclude with a discussion on the practical use of the procedure, its limitations and the potential extensions.
Methods
Notations and improper survival model
Let the continuous random variables T and C be the true event and censoring times. Let X=m i n(T,C) be the observed time of followup, δ=1_{(X=T)} the indicator of event and Y(t)=1 _{(X≥t)} the at risk indicator at time t. Here, we consider that for nonsusceptible individuals T=∞ _{+}. Thus, the survival function S(t) of T is said to be improper with S(∞ _{+})>0. The hazard function (or the instantaneous event rate) of T is noted: λ(t)=f(t)/S(t), where f(t) is the density function of T. The corresponding cumulative hazard function is noted \(\Lambda (t)={\int _{0}^{t}}\lambda (s)ds\) with a finite positive limit θ such as Λ(t=∞ _{+})=θ<∞ _{+}. Let Z=(Z _{1},Z _{2}) be the (m _{1}+m _{2})dimensional vector of covariates where Z _{1} is the m _{1}dimensional subvector of known confounding covariates linked to the nonsusceptible state and Z _{2} is the m _{2}dimensional subvector of explanatory covariates of interest (associated with the timetoevent outcome).
Here, the cumulative hazard function is modeled such as: \(\Lambda \left (tZ_{1i},W_{il}^{\Phi (Z_{\mathbf {2}})}\right)=\theta e^{\alpha ^{T}Z_{1i}}\left \{1\exp \left [ H \left (t;W_{il}^{\Phi (Z_{\mathbf {2} })}\right) \right ] \right \} \) where H(t) is an unspecified continuous positive function increasing from zero to infinity which formulates the shape of the timetoevent survival distribution for each terminal node. Thus, the cumulative hazard function \(\Lambda \left (tZ_{1i},W_{il}^{\Phi (Z_{\mathbf {2}})}\right)\) is bounded, increases with t and reaches its maximum for \(\phantom {\dot {i}\!}\theta e^{\alpha ^{T}Z_{1i}}\) where α is an unknown vector of parameters associated to Z _{1} and θ is a positive parameter.
where \(h(t)=\frac {\partial H\left (t\right) }{\partial t}\) and γ is an unknown parameter associated with variable Z ^{∗}.
Splitting criterion
The classical use of Logrank related statistics in survival trees relies on the fact that these statistics are considered as betweennode heterogeneity criteria.
In the context of a mixed population (nonsusceptible/susceptible), we have proposed [13] a pseudoR2 criterion that can be interpreted in terms of percentage of separability obtained by a variable according to timetoevent outcomes of susceptible patients. This criterion represents a good candidate for the splitting process.
In the following, we give the formula of the splitting criterion through its relationship with the partial loglikelihood score.
The practical expression of U and V are obtained by replacing Λ _{0}, θ, and α by their respective estimators \(\hat {\Lambda }_{0}\), \(\hat {\theta }\), and \(\hat {\alpha }\). Here, \(\widehat {\Lambda }_{0}\) is the leftcontinuous version of the Breslow’s estimator [16, 17]. The estimated quantity \(\widehat {\theta }\) is equal to \(\hat {\Lambda }_{0}(t_{\max })\) where t _{max} is the last observed failure time and \( \widehat {\alpha }\) is the maximum partial likelihood estimator of α under the null hypothesis (γ=0).
The quantity \(S=\frac {U^{2}}{V}/K\) where K is the total number of distinct event times is a pseudoR2 measure [13]. This criterion is unitless, ranges from zero to one and increases with the effect of the splitting variable. It is also not affected by the censoring, the sample size and the nonsusceptible fraction. To a factor K, this criterion can also be interpreted as the robust score statistic obtained from the partial loglikelihood under the improper survival model [15].
Bagging procedure and prediction estimate
We consider a learning set \(\mathcal {L}\), consisting of n independent observations: \(\mathcal {L}=\{{(X_{i},\delta _{i},Z_{i})},\ i=1,\ldots,n\}\). Let \(\mathcal {L}_{b}^{\ast }\) (b=1,…,B) denotes the b^{th} bootstrap sample of the training set \(\mathcal {L}\) obtained by drawing with replacement n elements of \(\mathcal {L}\). According to random sampling of observations with replacement, an average of 36.8 % are not part of \(\mathcal {L}_{b}^{\ast }\). Let OOB \(_{b}= \mathcal {L}\setminus \mathcal {L}_{b}^{\ast }\) be the set of these elements. The observations in OOB _{ b } are not used to construct the predictor \(\mathcal {P}_{b}\); they constitute for this predictor the socalled Out Of Bag (OOB) sample.
 1.Repeat for b=1,…,B

Take a bootstrap replicate \(\mathcal {L}_{b}^{\ast }\) of the training set \(\mathcal {L}\)

Build a survival tree such as: ∗ For each split candidate variable Z ^{∗} (based on the information from Z _{2}) compute the corresponding splitting criterion S(Z ^{∗}) presented above. ∗ Do the same procedure for all the split candidate variables. ∗ Find the best split S ^{∗} which is the one having the maximum value over all the candidates. Then, a new node is built and the observations are splitted accordingly. ∗ Iterate the process until each node reaches a predefined minimum node size or be homogeneous. ∗ Construct the final tree denoted \(\mathcal {T}^{b }\left (W^{b }\right) \) where \(W_{l}^{\left (b\right)} (l = 1,\ldots,L(b))\) is a vector of indicator variables representing the L(b) leaves of the tree such that \(W_{il}^{\left (b\right)} = 1\) if the i ^{ t h } observation belongs to the l ^{ t h } terminal node of \(\mathcal {T}^{b }\), and 0, otherwise.

Calculate the cumulative hazard function (CHF) estimator for each terminal leave of each bootstrap tree \(\mathcal {T}^{b}\). ∗ The Breslowtype estimator of the baseline cumulative hazard [16, 17] in a terminal node l of the tree \(\mathcal {T}^{b}\) is computed aswhere \(\hat {\alpha }\) is the partial loglikelihood estimator obtained using all the learning data from the tree \(\mathcal {T}^{b }\). ∗ The NelsonAalen estimator of the baseline cumulative hazard [18, 19] in a terminal node l of the tree \(\mathcal {T}^{b }\) is computed as:$${} \begin{aligned} \hat{\Lambda}_{l}^{b}\left(t\right) &=\hat{\Lambda}\left(tW_{l}^{\left(b\right) }=1\right)\\ &=\sum_{i;t_{i}\leq t}^{n}1_{\left\{ W_{il}^{\left(b\right) }=1\right\} }\left(\frac{\delta_{i}}{\sum_{j = 1}^{n}1_{\left\{ W_{jl}^{\left(b\right) }=1\right\} }{Y_{j}^{l}}\left(t_{i}\right) e^{\hat{\alpha}Z_{1j}}}\right) \end{aligned} $$$${} \begin{aligned} \hat{\Lambda}_{l}^{b}\left(t\right) &=\hat{\Lambda}\left(tW_{l}^{\left(b\right) }=1\right)\\ &=\sum_{i;t_{i}\leq t}^{n}1_{\left\{ W_{il}^{\left(b\right) }=1\right\} }\left(\frac{\delta_{i}}{\sum_{j = 1}^{n}1_{\left\{ W_{jl}^{\left(b\right) }=1\right\} }{Y_{j}^{l}}\left(t_{i}\right) }\right) \end{aligned} $$

 2.
Compute the CHF prediction estimator:
The CHF prediction estimator for a new patient j with covariate Z _{ j } is computed as follows. The patient’s covariates \(Z_{2_{j}}\) are dropped down each tree. Then, the prediction is obtained as the weighted average of the estimated CHF over the learning datasets with the same membership terminal node assignment than the new case:where L(b) is the number of leaves nodes of the tree \(\mathcal {T}^{b}\)$$\hat{\Lambda}\left(t\right Z_{j}) = \frac{1}{B}\sum_{b = 1}^{B} \sum_{l = 1}^{L(b)}1_{\left\{ W_{jl}^{\left(b\right)} = 1\right\} }e^{\hat{\alpha}Z_{1_{j}}}\hat{\Lambda}_{l}^{b}\left(t\right) $$
Measures of prediction accuracy
Importance score
The choice of a measure of importance for a variable can rely on either the prediction capacity or the discriminative ability of the variable through the tree structure. Here, we consider the following importance scores.
Index importance score (IIS)
Depth and index importance score (DIIS)
The second criteria is inspired from the Depth Importance measure that has been introduced by Chen et al. [23]. This measure is similar to the Index Importance Score but also considers the location of the splitting.
Permutation prediction importance score (PPIS)
Large values of PPIS indicate a strong predictive ability whereas values close to zero indicate a poor predictive ability. In the following, we will denote PPISNA and PPISBRE the scores obtained using the NelsonAalen and the Breslow estimators, respectively.
Basket of important variables
For selecting a subset (hereinafter referred as a basket) of the most important variables, the main problem is to choose a threshold value for the previous scores. Several performancebased approaches have been proposed in the literature to deal with the variable selection in Random Forests comparing either OOB or crossvalidated errors of a set of nested models. Most of these procedures share the same methodological scheme and differ only in minor aspects (for a few see [24–26]). However, for survival data there is no consensus about which measure of prediction error is the most appropriate. Thus, each measure leads to a particular estimation of the prediction error that ultimately leads to select different subset of variables. Rather than using performancebased approaches, we propose hereafter to consider a strategy based on a testing procedure using a topological index which allows to select a basket of important variables.
In the following and without loss of generality, we suppose that the index score of interest is the IIS.

Step 1: Use the learning set \(\mathcal {L}\) to build the bagging predictor as describe in “Bagging procedure and prediction estimate” Section. Compute for each competing variable \(Z_{2_{j}}\) the index score of importance I I S _{ j } as describe in “Importance score” Section.

Step 2: Let σ:{1,…,n}→{1,…,n} be a random permutation of the set {1,…,n}; let \(\mathcal {L}_{\sigma } = \{{(X_{\sigma (i)},\delta _{\sigma (i)},Z_{1\sigma (i)}, Z_{2i})},\ i=1,\ldots,n\}\) be a partial permutation of \(\mathcal {L}\). Use the learning set \(\mathcal {L}_{\sigma }\) to build another bagging predictor using the same procedure as in the first step and compute again for each competing variable \(Z_{2_{j}}\) the index score of importance \( II{S_{j}^{0}} \).

Step 3: Repeat Step 2 a number Q of times.

Step 4: Compute the Pvalues for each competing variable \(Z_{2_{j}}\) as follows:$$p_{j} = \frac{1}{Q}\sum_{q = 1}^{Q} \mathbf{1}_{\{IIS_{jq}^{0} \geq IIS_{j}\} } $$

Step 5: Using a Bonferroni procedure for multiple comparisons, the selected variables are those fulfilling the conditions p _{ j }≤α/m _{2}; j=1,…,m _{2}.
This procedure is conceptually similar to the one proposed by [27] to correct the bias of the socalled Gini importance in a RandomForest framework. Nevertheless, in our framework, we have to take into account the covariables Z _{1} associated to the nonsusceptible individuals. For this purpose, the permutation scheme used in Step 2 ensures that the existing relationship between the timetoevent observations and the covariates Z _{1} is not distorted under the null hypothesis.
Results and discussion
Simulation scheme
In order to evaluate the performance of the bagging survival strategy relying either on the classical adjusted Logrank splitting criterion (denoted LR) or the proposed pseudoR2 criterion (denoted R2), we performed a simulation study as follows.
with e ^{ α }=1.25, λ _{0}=1, γ _{1}=0.6, γ _{2}=1.8, γ _{3}=0.45, γ _{4}=0.35, γ _{5}=2. The Bernoulli variables G _{1},…,G _{5} related to the timetoevent variable T are generated using the following scheme:
\(G_{i}\sim \mathcal {B}(\nu _{i}), \text {\ for\ }i = 1,\dots,5\) with ν _{1}=0.5; ν _{2}=0.6; ν _{3}=0.5; ν _{4}=0.3; ν _{5}=0.65.
We considered eight different scenarios with, for each, three different values for the number of noise or noninformative covariables (10, 100 and 500), that are independent Bernoulli variables with π=0.5. Thus, a total of 24 different simulation sets were generated. The first four scenarios are based on model (3), with N=250 individuals, a proportion of nonsusceptible patients of 25 and 50 %, and the rate of censoring observations within the susceptible population of 10 and 25 %. The last four scenarios are also based on model (3), but with N=500 individuals and the same setting as the previous ones.
Simulation scenarios for the evaluation of the importance scores and the prediction accuracy
Scenario  N  plateau  censoring  Noise covariables 

1 (a)  10  
1 (b)  250  25 %  10 %  100 
1 (c)  500  
2 (a)  10  
2 (b)  250  25 %  25 %  100 
2 (c)  500  
3 (a)  10  
3 (b)  250  50 %  10 %  100 
3 (c)  500  
4 (a)  10  
4 (b)  250  50 %  25 %  100 
4 (c)  500  
5 (a)  10  
5 (b)  500  25 %  10 %  100 
5 (c)  500  
6 (a)  10  
6 (b)  500  25 %  25 %  100 
6 (c)  500  
7 (a)  10  
7 (b)  500  50 %  10 %  100 
7 (c)  500  
8 (a)  10  
8 (b)  500  50 %  25 %  100 
8 (c)  500 
For all scenarios, LR and R2 are adjusted criteria for the known confounding factor G1 linked to the nonsusceptible state. We also evaluate the prediction accuracy using either the NelsonAalen (denoted NA) or the Breslow (denoted BRE) estimators. For prediction accuracy, we present the results obtained from the integrated Brier score (IBS). For the variable selection, we present the results obtained with IIS, DIIS and PPIS criteria.
For each scenario, we have generated 50 data sets. The bagging procedure with 400 trees was then applied to each data set with the two proposed splitting criteria. We then obtained 50 estimates of the Out Of Bag Integrated Brier Score for each method and each scenario.
We considered an additional scenario designed to mimic a data set that would reflect a situation, such as the one presented in our example, where variables are functionally related through groups (e.g. biological pathway). In practice, we generated correlated variables divided in five blocks of various sizes (ranging from 10 to 30 %) with correlations ranging between −0.2 to 0.3. We considered a situation with 500 individuals, a proportion of nonsusceptible patients of 25 %, a rate of censoring observations within the susceptible population of 10 and 25 % and two different values for the number of noninformative covariables (100 and 500).
Simulation results
Prediction results
In the first scenario (first column from left to right of Fig. 3), the OOBIBS are consistently and slightly lower than their counterparts of scenario 2 (second column from left to right of Fig. 3). This was expected because of the increase in censoring proportion among the susceptible population from 10 % in scenario 1 to 25 % in scenario 2.
The OOBIBS obtained using our proposed “pseudoR2” splitting criterion are better (lower median value with a smaller variability) than those obtained with the stratified Logrank criterion. The “PseudoR2” consistently outperforms the Logrank in term of prediction accuracy for the first two scenarios. For these scenarios, the results obtained with BRE and NA estimators are comparable. The impact of the additional noise variables on the prediction accuracy seems insignificant.
The same remarks can be made for scenarios 3 and 4 using the last two columns from left to right of Fig. 3. The only additional information here is an increase of the global magnitude of the OOBIBS from the first two columns of Fig. 3 to the last two columns. This is mainly due to the decrease of the proportion of susceptible population from 75 % in scenarios 1–2 to 50 % in scenarios 34, leading to a decrease of the number of events observed. These scenarios are more challenging than the previous ones.
The results of scenarios 5–8 (Fig. 4) are slightly better than those of scenarios 1–4. This is mainly due to the increase in the number of individuals from 250 to 500.
Importance scores results
Figure 6 shows that in the simple Scenarios 1a and 2a with only 10 noise variables (blue color within Fig. 6), the pseudoR2 splitting criterion attempts a clear discrimination between explanatory variables and noise variables, regardless of the considered importance scores. The same remark can not be made for the Logrank splitting criterion where the PPIS index discriminates only one variable while the IIS and DIIS indexes attempted to discriminate three explanatory variables from the noise ones.
In the more challenging scenarios 1b and 2b with 100 noise variables (red color within Fig. 6), the PPIS behaves poorly with the Logrank splitting criterion while the pseudoR2 splitting criterion behaves well in discriminating the explanatory variables from the 100 noise variables. Nevertheless, the performances are quite similarly between the two splitting criterion with regard to IIS and DIIS.
In the most challenging scenarios 1c and 2c with 500 noise variables (green color within Fig. 6) we observe a little deterioration of performances, mainly for the PPIS index. The Logrank splitting criterion behaves poorly for all the indexes, while the IIS and DIIS for the pseudoR2 splitting criterion still attempts a discrimination at low level compare to the previous ones.
The results of scenarios 3–4 are displayed in Fig. 7, where almost the half of the population is nonsusceptible. Combining this amount of plateau with censoring observations results in very few events observed in the scrutinized population. Compare to the previous scenarios, the results are quite similar for the pseudoR2 splitting criterion with indexes IIS and DIIS. Nevertheless, the figure suggests a decrease in performances for indexes PPISNA and PPISBRE. Overall the Logrank splitting criterion performs very poorly regardless of the indexes.
The results of scenarios 5–6 are displayed in Fig. 8. These scenarios give more power for identify explanatory variables than the previous scenarios 1–4, since there is an increase in the population size by a factor of 2. As expected, the results are slightly better than all the results of scenarios 1–4. The pseudoR2 splitting criterion allows a clear discrimination between the noise variables and the explanatory ones despite a very little degradation for PPIS indexes when the censoring rate increases and the number of noise variables is very high.
The results of scenarios 7–8 are displayed in Fig. 9. These results are quite similar to those of scenarios 5–6 for the pseudoR2 criterion, mainly for the IIS and DIIS indexes despite the increase of the fraction of nonsusceptible individuals. Also, the PPIS performs poorly with a high number of noise variables.
We investigated other scenarios with different values for the parameters related to the explanatory variables that lead to the same trends (results not shown). We also analyzed a scenario (results not shown) with a very small plateau value (5 %). As expected, our procedure outperforms the adjusted Logrank splitting method in terms of prediction accuracy but these gains are smaller than those obtained for higher “plateau” value. This is not surprising since the adjusted Logrank criteria can be seen as the limiting case of our criteria in which all the patients are susceptible. Thus, large power gains are anticipated in a situation where a nonnegligible fraction of nonsusceptible patients is expected. However, if the plateau value is very small but identical for all individuals, then the classical unadjusted Logrank criteria should be more efficient.
Analysis of breast cancer data
Description of the data
We used bioclinical data extracted from two genomic datasets (GSE2034, GSE2990) publicly available on the GEO (Gene Expression Omnibus) website (http://www.ncbi.nlm.nih.gov/geo/). The GSE2034 dataset corresponds to the expression microarray study conducted by Wang et al. [28] and the GSE2990 dataset to the one conducted by Sotiriou et al. [29]. Both studies investigate the prognostic effect of gene expression changes on the outcome of patients with primary breast cancer. For gene expression analyses, Affymetrix Human Genome U133A Array were used in both studies and estrogenreceptor (ER) status (positive/negative) was available. The clinical outcome considered was the distant metastasisfree survival. Distant metastasisfree survival was defined as the interval from the date of inclusion to the first occurrence of metastasis or last followup.
For these two early breast cancer series, surgical resection can be considered as effective at eliminating the tumor burden for a nonnegligible proportion of patients whereas, for the others, it leads to a lower tumor burden and thereby prolonged survival without distant relapse. Thus, a nonsusceptible fraction exists, and having a large number of patients followed up more than a decade after the primary treatment allows for an interpretable time sequence for tumor relapse.
For this work, we decided to investigate the impact of estrogenrelated genes in predicting metastasis among patients with ERpositive tumors.
The gene expression datasets of the two series were analyzed after a joint quantile normalization. Here, we focused on estrogenrelated genes that were defined as those demonstrating, on the whole dataset, a significant gene expression changes between ERpositive and ERnegative for a familywise error rate of 1 % (Bonferroni procedure). This selection led to the selection of 1,265 genes. We then selected patients with ERpositive tumors with a total of 294 patients (209 from GSE2034 and 85 from GSE2990) and investigated the effect of estrogenrelated genes on the occurrence of distant metastasis.
In order to take into account the difference in the proportion of nonsusceptible patients between the two series, we included this variable as a confounding variable.
Results
When looking to the first ten genes, no gene was selected in common between the adjusted Logrank and the pseudoR2 criterion. The first five topgenes selected with the pseudoR2 criterion are: CBX7, NUTF2, AGO2, RPS4X and TTK.
The CBX7 (Polycomb protein chromobox homolog 7) gene is involved in several biologic processes and recent works indicate a critical role in cancer progression. A relationship between the downregulation of CBX7 expression and the tumor aggressiveness and poor prognosis has been reported in different cancer. Preliminary studies also indicate a potential role in the modulation of response to therapy [30].
The NUTF2/NTF2 (nuclear transport factor 2) gene encodes a small binding protein. The main function of NTF2 is to facilitate transport of certain proteins into the nucleus. It is also involved in regulating multiple processes, including cell cycle and apoptosis.
The AGO2 (Argonaute 2) gene is a central component of RNAinduced silencing complex which plays critical roles in cancer process through proliferation, metastasis and angiogenesis. AGO2 has been found overexpressed in various carcinomas and associated with tumor cell growth and poor prognosis [31].
The RPS4X (Xlinked ribosomal protein S4) gene is involved in cellular translation and proliferation. Low RPS4X expression has been shown to be associated with poor prognosis in bladder, ovarian and colon cancer. Level of RPS4X is also a good indicator for resistance to platinumbased therapy and a prognostic marker for ovarian cancer. More recently, RPS4X has been identified as a partner of the overexpressed multifunctional protein YB1 in several breast cancer cells. Depletion of RPS4X results in consistent resistance to cisplatin in such cell lines [32].
TTK (threonine tyrosine kinase, also known as Mps1) gene is essential for alignment of chromosomes to the metaphase plate and genomic integrity during cell. TTK gene has been identified as one of the top 25 genes overexpressed in tumors with chromosomal instability and aneuploidy [33]. TTK is overexpressed in a various solid cancers, and elevated levels of TTK correlate with high histological grade in tumors and poor patient outcome.
We also examined the Importance scores and 50 estimates of the importance scores for each procedures were computed. The mean of the 50 values is presented in Fig. 12 for the top 30 variables.
Discussion
The discovery and predictive use of eventrelated markers have to face two main challenges that are the search over markers acting in complex networks of interactions and the potential presence of nonsusceptible patients in the studied population. In this work, we proposed a new bagging survival procedure devoted to this task. The strategy relies on an improper survival modelisation which considers a linear part for taking into account for known confounders associated with the nonsusceptible fraction and a tree structure for the eventrelated explanatory variables. The proposed treestructured modeling differs from the treeaugmented Cox proportional hazards model proposed by Sun et al. [34] in that it is explicitly tailored for mixture population. Moreover, our procedure relies on the use of a splitting criteria which can be interpreted as a timetoevent discrimination index suited to mixed population.
The results of our simulation study show the good behavior of our bagging procedure based on the pseudoR2 criterion as compared to the one relying on the classical Logrank statistic. For prediction, even though differences between the procedures are small, better predictions were obtained with the proposed procedure. If a difference between the fractions of nonsusceptible individuals is expected then the estimators that use the Breslow estimate should be preferred over those using the NelsonAalen estimate.
For variable selection, even in the presence of a high number of nuisance variables, our procedure is able to select the explanatory variables. The performance is obviously better when the number of events which can occur among susceptible patients is increasing. Based on our simulation study, we recommend the IIS or the DDIS criteria. These criteria rely on the discriminative performance of each splitting variable with or without the information related to the depth of the split. By contrast, the PPIS criterion which relies on prediction error is highly dependent on the censoring rate and the number of noise variables. Moreover, it is wellknown that there is no consensus on which prediction error criterion should be used for survival data.
The search for markers that predict distant relapse in hormone receptorpositive treated patients is still an intensive area of study. In the analysis of the two series of earlystage breast cancer presented in this article, the proposed procedure is particularly appealing since the majority of the patients are amenable to cure and then will never recur from the disease. The fraction of nonsusceptible patients being clearly different between these two studies, we consider the study as a confounder variable. We obtain a selection of topgenes which is different from those obtained with the classical Logrank statistic. The five top genes selected with our procedure are related to cancer and most of them have only been recently reported to be associated with prognosis. In breast cancer, we know that various pathways related to the tumor process are activated and that there is no unique selection of prognostic factors. However, since our main aim is to select the more powerful set of predictors and obtain the highest prediction, our procedure should be preferred. This modelbased selection which takes into account the highorder interactions and focuses on susceptible patients shed light on new markers that could serve as potential drug targets for new therapies.
In this work, we assumed that the hazard functions for the susceptible individuals between two child nodes are conditionally proportional given the node but the proportionality for any two nodes from different parents is not required. Postulating a proportional hazards structure within the whole tree could be an option which requires further development and evaluation. Here, we also considered the case with known confounding variables which is frequently encountered in biomedical research. For a different purpose, we could however consider extending the procedure to unknown confounding variables. Further works are however needed to cope with the potential degree of nonidentifiability between failure time distribution of susceptible individuals and the proportion of nonsuceptibles individuals.
Conclusion
In the presence of a mixed population with nonsusceptible patients, our results show that our bagging survival procedure with the proposed splitting criterion has good performance for prediction and variable selection. For measuring variable importance, we recommend the use of either the proposed Index Importance Score or the Depth and Index Importance Score.
The proposed treebuilding process, which relies on a modelbased splitting criteria, can be considered as a convenient hybrid solution that combines multiplicative intensity model and treestructured modeling. We believe that the proposed survival bagging procedure is very appealing for many clinical genomic studies in which a fraction of nonsusceptible individuals is commonly encountered. This procedure has been implemented in a R package called iBST (improper Bagging Survival Tree) and will be available soon on the CRAN repository.
Endnotes
Not applicable.
Abbreviations
CART, classification and regression tree; CHF, cumulative hazard function; DIIS, depth and index importance score; ER, estrogen receptor; GEO, gene expression omnibus; IBS, integrated brier score; iBST, improper Bagging Survival Tree; IIS, index importance score; LR, log rank; OOB, out of bag; OOBIBS, out of bag integrated brier score; PPIS, permutation prediction importance score; PPISNA, permutation prediction importance score with Nelson Aalen; PPISBRE, permutation prediction importance score with Breslow.
Declarations
Acknowledgements
The research leading to these results was conducted as part of the ABIRISK consortium (AntiBiopharmaceutical Immunization: Prediction and analysis of clinical relevance to minimize the risk). For further information please refer to www.abirisk.eu. We thank the members of the Abirisk WP4, Julie Davidson and Agnès HincelinMery for their great support.
The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement n^{o} [115303], resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/20072013) and EFPIA companies’ in kind contribution.
Funding
The research leading to these results has received support from the Innovative Medicines Initiative Joint Undertaking under grant agreement n^{o} [115303], resources of which are composed of financial contribution from the European Union’s Seventh Framework Programme (FP7/20072013) and EFPIA companies’ in kind contribution.
Availability of data and material
The data sets supporting the results of this article are publicly available at the GEO (Gene Expression Omnibus) database http://www.ncbi.nlm.nih.gov/geo/, with accession code GSE2034 and GSE2990.
Authors’ contributions
Both authors participated in the design, implementation and evaluation of the procedure. Both authors wrote, read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Breiman L, Olshen JH, Stone CJ. Classification and Regression Trees. Belmont: Wadsworth International Group; 1984.Google Scholar
 Gordon L, Olshen R. Treestructured survival analysis. Cancer Treat Rep. 1985; 69(10):1065–9.PubMedGoogle Scholar
 BouHamad I, Larocque D, BenAmeur H. A review of survival trees. Stat Surv. 2011; 5:44–71.View ArticleGoogle Scholar
 Davis RB, Anderson JR. Exponential survival trees. Stat Med. 1989; 8(8):947–61.View ArticlePubMedGoogle Scholar
 LeBlanc M, Crowley J. Relative risk trees for censored survival data. Biometrics. 1992; 48(2):411–25.View ArticlePubMedGoogle Scholar
 Hothorn T, Lausen B, Benner A, RadespielTröger M. Bagging survival trees. Stat Med. 2004; 23(1):77–91.View ArticlePubMedGoogle Scholar
 Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS. Random survival forests. Ann Appl Stat. 2008;:841–60.Google Scholar
 Leblanc M, Crowley J. Survival trees by goodness of split. J Am Stat Assoc. 1993; 88(422):457–67.View ArticleGoogle Scholar
 Shimokawa A, Kawasaki Y, Miyaoka E. A comparative study on splitting criteria of a survival tree based on the cox proportional model. J Biopharm Stat. 2016; 26(2):386–401.View ArticlePubMedGoogle Scholar
 Maller RA, Zhou S. Testing for the presence of immune or cured individuals in censored survival data. Biometrics. 1995; 51(4):1197–205.View ArticlePubMedGoogle Scholar
 Tsodikov A, Ibrahim J, Yakovlev A. Estimating cure rates from survival data. J Am Stat Assoc. 2003; 98(464):1063–1078.View ArticlePubMedPubMed CentralGoogle Scholar
 Cooner F, Banerjee S, Carlin BP, Sinha D. Flexible cure rate modeling under latent activation schemes. J Am Stat Assoc. 2007; 102(478).Google Scholar
 Rouam S, Broët P. A discrimination index for selecting markers of tumor growth dynamic across multiple cancer studies with a cure fraction. Genomics. 2013; 102(2):102–11.View ArticlePubMedGoogle Scholar
 Fleming TR, Harrington DP. Counting Processes and Survival Analysis vol. 169. New York: Wiley; 2011.Google Scholar
 Lin DY, Wei LJ. The robust inference for the cox proportional hazards model. J Am Stat Assoc. 1989; 84(408):1074–8.View ArticleGoogle Scholar
 Breslow N. Discussion on ‘regression models and lifetables’(by dr cox). J Roy Statist Soc Ser B. 1972; 34:216–7.Google Scholar
 Breslow N. Covariance analysis of censored survival data. Biometrics. 1974; 30(1):89–99.View ArticlePubMedGoogle Scholar
 Nelson W. Theory and applications of hazard plotting for censored failure data. Technometrics. 1972; 14(4):945–66.View ArticleGoogle Scholar
 Nelson W. Hazard plotting for incomplete failure data. J Qual Technol. 1969; 1(1):27–52.Google Scholar
 Korn EL, Simon R. Measures of explained variation for survival data. Stat Med. 1990; 9(5):487–503.View ArticlePubMedGoogle Scholar
 Altman DG, Royston P. What do we mean by validating a prognostic model?Stat Med. 2000; 19(4):453–73.View ArticlePubMedGoogle Scholar
 Graf E, Schmoor C, Sauerbrei W, Schumacher M. Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999; 18(1718):2529–45.View ArticlePubMedGoogle Scholar
 Chen X, Liu CT, Zhang M, Zhang H. A forestbased approach to identifying gene and gene–gene interactions. Proc Natl Acad Sci. 2007; 104(49):19199–203.View ArticlePubMedPubMed CentralGoogle Scholar
 Jiang H, Deng Y, Chen HS, Tao L, Sha Q, Chen J, Tsai CJ, Zhang S. Joint analysis of two microarray geneexpression data sets to select lung adenocarcinoma marker genes. BMC Bioinforma. 2004; 5(1):81.View ArticleGoogle Scholar
 DiazUriarte R, Alvarez de Andrés S. Gene selection and classification of microarray data using random forest. BMC Bioinforma. 2006; 7(3).Google Scholar
 Genuer R, Poggi JM, TuleauMalot C. Variable selection using random forests. Pattern Recogn Lett. 2010; 31(14):2225–36.View ArticleGoogle Scholar
 Altmann A, Toloşi L, Sander O, Lengauer T. Permutation importance: a corrected feature importance measure. Bioinformatics. 2010; 26(10):1340–7.View ArticlePubMedGoogle Scholar
 Wang Y, Klijn J, Zhang Y, Sieuwerts A, Look M, Yang F, Talantov D, Timmermans M, Meijervan Gelder M, Yu J, Jatkoe T, Berns E, Atkins D, Foekens J. Geneexpression profiles to predict distant metastasis of lymphnodenegative primary breast cancer. Lancet. 2005; 19(4):671–9.View ArticleGoogle Scholar
 Sotiriou C, Wirapati P, Loi S, Harris A, Fox S, Smeds J, Nordgren H, Farmer P, Praz V, HaibeKains B, et al. Gene expression profiling in breast cancer: understanding the molecular basis of histologic grade to improve prognosis. J Natl Cancer Inst. 2006; 98(4):262–72.View ArticlePubMedGoogle Scholar
 Pallante P, Forzati F, Federico A, Arra C, Fusco A. Polycomb protein family member cbx7 plays a critical role in cancer progression. Am J Cancer Res. 2015; 5(5):1594.PubMedPubMed CentralGoogle Scholar
 Ye Z, Jin H, Qian Q. Argonaute 2: A novel rising star in cancer research. J Cancer. 2015; 6(9):877.View ArticlePubMedPubMed CentralGoogle Scholar
 Garand C, Guay D, Sereduk C, Chow D, Tsofack SP, Langlois M, Perreault È, Yin HH, Lebel M. An integrative approach to identify yb1interacting proteins required for cisplatin resistance in mcf7 and mdamb231 breast cancer cells. Cancer Sci. 2011; 102(7):1410–7.View ArticlePubMedGoogle Scholar
 Carter SL, Eklund AC, Kohane IS, Harris LN, Szallasi Z. A signature of chromosomal instability inferred from gene expression profiles predicts clinical outcome in multiple human cancers. Nat Genet. 2006; 38(9):1043–8.View ArticlePubMedGoogle Scholar
 Su X, Tsai CL. Treeaugmented cox proportional hazards models. Biostatistics. 2005; 6(3):486–99.View ArticlePubMedGoogle Scholar