Combining techniques for screening and evaluating interaction terms on highdimensional timetoevent data
 Murat Sariyar^{1, 2}Email author,
 Isabell Hoffmann^{1} and
 Harald Binder^{1}
https://doi.org/10.1186/147121051558
© Sariyar et al.; licensee BioMed Central Ltd. 2014
Received: 22 October 2013
Accepted: 28 January 2014
Published: 26 February 2014
Abstract
Background
Molecular data, e.g. arising from microarray technology, is often used for predicting survival probabilities of patients. For multivariate risk prediction models on such highdimensional data, there are established techniques that combine parameter estimation and variable selection. One big challenge is to incorporate interactions into such prediction models. In this feasibility study, we present building blocks for evaluating and incorporating interactions terms in highdimensional timetoevent settings, especially for settings in which it is computationally too expensive to check all possible interactions.
Results
We use a boosting technique for estimation of effects and the following building blocks for preselecting interactions: (1) resampling, (2) random forests and (3) orthogonalization as a data preprocessing step. In a simulation study, the strategy that uses all building blocks is able to detect true main effects and interactions with high sensitivity in different kinds of scenarios. The main challenge are interactions composed of variables that do not represent main effects, but our findings are also promising in this regard. Results on real world data illustrate that effect sizes of interactions frequently may not be large enough to improve prediction performance, even though the interactions are potentially of biological relevance.
Conclusion
Screening interactions through random forests is feasible and useful, when one is interested in finding relevant twoway interactions. The other building blocks also contribute considerably to an enhanced preselection of interactions. We determined the limits of interaction detection in terms of necessary effect sizes. Our study emphasizes the importance of making full use of existing methods in addition to establishing new ones.
Keywords
Boosting Highdimensional data Model selection Model complexity Prediction error curves Random forest Time to event settingsBackground
As more and more highdimensional molecular data is amassed, the importance of biomarker research increases. Specifically, predictive biomarkers are usually wanted in order to predict risks associated with diseases. When building multivariate risk prediction models for finding such biomarkers, it is desirable to produce sparse models. The sparsity of the resulting models facilitates the biological and statistical interpretation [1–4]. Approaches such as componentwise boosting [5] or the LASSO [6–8] achieve sparsity by performing variable selection and parameter estimation simultaneously. There are two frequently occurring problems in this context: first, lack of reproducibility of variable selections across different studies, for example concerning gene expression data [9–11]; second, no established approaches to account for interactions. The latter deficit can lead to selection of wrong variables or biased parameter estimations. The first problem, i.e. the inability to confirm most of the published gene related signatures, has led to doubts whether signatures should be produced at all. However, the failure of finding stable signatures could to some extent be ascribed to inadequate modeling. Approaches that are more comprehensive are necessary, for example, combining molecular data with annotation and clinical information [9, 12–15]. One ingredient should be to incorporate promising interactions in the model. Many tools for modeling interactions exist, but, as far as we know, no systematic investigations of potential building blocks are available.
Examples for promising modeling strategies that can account for interactions are penalized regression models [16, 17], logic regression [18, 19], multifactordimensionality reduction [20, 21] or random forests [22, 23]. For a comprehensive review regarding interaction (pre) selection approaches, we refer to [24]. Logic regression and multifactordimensionality reduction are primarily destined for discrete marker data, e.g., for single nucleotide polymorphism data. In contrast to that, penalized or regularized regression models cover more general types of data. Their main property is to put a penalty on the model parameters, which correspond to marker effects, for estimation. The usage, for example, of an L_{1}penalization forces most of the estimated parameters to be zero, i.e., the values of the corresponding covariates do not influence predictions obtained from the fitted model. Even though these models are primarily used for main effect selections, there is an increasing interest in incorporating interactions [25–27]. When there is no a priori knowledge, such approaches either require the interactions to be formed by variables that represent main effects or that interaction terms are created by combining the covariates in a certain way, e.g., by producing all distinct twoway interactions (or by coarsening the input space before producing the interactions [28]). The first route can lead to false negatives even if the true interactions have relevant marginal effects, and the second one neglects the fact that it is frequently either not feasible or computationally too expensive to consider all possible interactions. Altogether, this means that a screening method for promising interaction terms is in most cases necessary, especially for higher order interactions. The potential of random forests to provide nonparametrical means for handling various kinds of interaction structures makes them attractive as an interaction screening method for penalized regression models. However, apart from some interesting theoretical results (see [29, 30]) and positive empirical findings regarding prediction performances (e.g., [31–33]), the ability to extract information from random forests is considered problematic. The main objection is that established variable importance measures seem to be unable to detect relevant interaction effects in the absence of strong marginal components [34–36].
Variable importance measures (VIMs) for random forests are meant to extract the information contained in forests. Established VIMs are the Gini, the permutation accuracy, or the minimal depth importance (see [37–39]). The first measure uses the mean improvements in the Gini index in a forest related to the investigated variable. Permutation accuracy importance measures the change in prediction accuracy of the forest when the values of a variable are permutated randomly, and minimal depth importance is roughly related to the mean minimum distance (the depth) from the root node to the investigated variable. These measures can also be used for finding interactions in the forest. For example, the permutation accuracy importance can easily be extended such that the values of two variables are permutated randomly [37]. These variable importance measures lead to a ranking of variables, in which interaction information is assumed to enter in some way. Whether these interactions are statistically relevant can be evaluated by penalized regression models. Hence, a comprehensive evaluation can consist of two parts: extracting interaction terms based on random forest information and estimating a statistical regression model based on all available variables and identified interaction terms.
In this paper, we show building blocks for evaluating and incorporating interactions terms in highdimensional timetoevent settings, in particular for settings in which it is very computationally expensive to check all possible interactions with an exhaustive search algorithm. The main ingredients are random survival forests (RSF), a specific adaptation of random forests to timetoevent settings, and an incremental stagewise forward regression technique, called CoxBoost [40–42]. CoxBoost is a boosting technique based on the Cox proportional hazards model and combines variable selection with model estimation. For this purpose, it uses a penalized version of the partial loglikelihood and applies componentwise boosting. We investigate the effect of a combination of these approaches, the additional contribution of resampling, and the advantage of a special data preprocessing step. This work is a feasibility study; hence, we are first of all interested in investigating how several components can contribute to the solution of the interaction finding problem. The specific choice of the investigated tools is justified by their specific properties; however, there are alternatives to our decisions (see above). We are interested in predicting risks within timetoevent settings and we use methods established in these settings. In this context, we rely on some assumptions, such as the proportional hazard assumption.
In the next section, we present details of CoxBoost and RSF together with corresponding VIMs. After presenting evaluation tools and our interaction detection strategy, we outline a simulation design for the evaluation. In the Results Section, the findings of the simulation study are shown, and we illustrate our approach on two realworld applications. Finally, we describe limitations of the study and summarize our findings in the Conclusion Section.
Methods
Timetoevent or survival data for n investigated entities is typically given as a set of triples z_{ i }=(t_{ i },δ_{ i },x_{ i }),i=1,…,n. The first component is the observed time for each entity i and is given by t_{ i }= min(T_{ i },C_{ i }), where T_{ i } is the event time and C_{ i } is the censoring time from which on the entity is no longer observed. The second component is the event indicator δ_{ i }, which takes the value 1 if an event has occurred at the observed time (T_{ i }≤C_{ i }) and 0 if the event time is censored (T_{ i }>C_{ i }). The third element, x_{ i }, is the vector of values of the p covariates observed at baseline.
CoxBoost
while ${\widehat{\beta}}_{j}^{\left(k\right)}={\widehat{\beta}}_{j}^{(k1)}$ for all covariates j≠j^{∗}. The tuning parameter ρ is typically set to $\sum _{i}{\delta}_{i}\xb7(\frac{1}{\nu}1)$, with ν∈(0,1] as the relative step size factor. The number of boosting update steps can be determined by a crossvalidation procedure.
One salient feature of this forward stagewise regression technique is that it inherently avoids ’breaking up a large main effect coefficient into a sum of smaller pieces’ in contrast to, for example, nonboosted regression models with L_{2}penalization (see [16]). In addition to that, CoxBoost has many extensions. It is, for example, possible to force the inclusion of a number of covariates into the model by suspending penalization for them [15]. This is relevant for settings with few clinical covariates and a large number of molecular variables. In this case, the coefficient estimates of the mandatory covariates are updated before the other covariates. Further, more than one coefficient can be updated in each boosting step, or the penalization parameter can vary from step to step. CoxBoost and all these features are implemented as an Rpackage, correspondingly called CoxBoost[46].
Random forests
Random forests are ensembles of – usually binary – classification or regression trees [22]. Usually unpruned trees are generated based on resamples of the original data and a random component in the splitting procedure, which implies that in every knot splitting is based on the number mtry of randomly selected variables. Unpruned trees in the context of random forests are rather unproblematic in terms of overfitting on training data; however, they can have detriment effects on the consistency of the response estimations [29, 47]. Each path in such generated trees represents a sequence of splits that leads to the response of cases corresponding to that path. The final model response is determined by aggregation, e.g. averaging the responses of a case over all trees.
Random forests can detect and deal with small effects, interactions and nonlinear associations, making no assumptions about the corresponding functional form [48]. All of these characteristics are also valid for trees. However, one important rationale behind random forests is the decorrelation of information that is represented in single trees, which reduces the corresponding variances – a bagging phenomenon [30, 49] – and the grouping property of trees. The latter property relates to the fact that a split on a variable from a cluster of correlated variables is frequently followed by splits of other members of that group [50]. A further advantage of forests over trees is that they can approximate smooth functions without the necessity of having a large number of leaves in a tree, due to the smoothing effect of the bagging phenomenon [51]. Random forests perform relatively well off the shelf [52] with the defaultvalues for mtry (= ) and for the number of trees in a forest (=1000).
where d_{l,h} and Y_{l,h} are the number of deaths and entities at risks at time t_{l,h}. RSF are implemented in the Rpackage randomSurvivalForest[56].
Variable importance measures for random forests
Various variable importance measures (VIMs) can be used for selecting variables. There are two wellknown VIMs: Gini importance and permutation accuracy importance (PAM). Another VIM is the mean minimaldepth measure, which has been proposed recently. Roughly, it measures the shortest distance (depth) from the root node to the parent node of the maximal subtree (the largest subtree whose root node splits with respect to the variable investigated). For further details, we refer to [50]. Different VIMs can produce different rankings; for example, the Gini importance was found to be highly affected by selection bias, e.g., continuous variables are preferred to categorical variables with only few categories [38]. In the following, we focus on PAM, because it is widely accepted and relates to the concept of simulating a null distribution (necessary for computing pvalues), even though we are aware of potential problems [50, 57]. For further information regarding VIMs, we refer to [58] and [59].
There are two versions of PAM. In its common version it is computed with respect to random permutations of the components of x_{ j }=(x_{1j},…,x_{ nj })^{ T }, which breaks the association of x_{ j } with the response and all variables. In a more sophisticated variant, which is unique to RSF with respect to survival data [50], the vector x_{ i }=(x_{i1},…,x_{ ip }) related to entity i is dropped down in all trees, in which it was out of bag in the training process; whenever a split node for an investigated variable is encountered, the corresponding vector x_{ i } is randomly assigned to one of the daughter nodes. In both variants, the variable importance results from the prediction error of the altered forest minus the prediction error of the nonaltered forest. The larger the importance values of a variable, the higher its value for prediction. It is important to notice that PAM is tied to the error measure used. One frequently used error measure for RSF, which we use here as well, is Harrell’s concordance index, which measures the discrimination ability of a model [60].
Tools for finding effects in timetoevent data
In highdimensional settings, the problems of extracting relevant information by regression models are aggravated compared to the lowdimensional counterparts. For example, even if stepwise regression introduces biases related to multiple test problems (see, for example, [61, 62]), it nevertheless provides a means for tackling variable selection issues in a comprehensive manner. It is therefore crucial to investigate mechanisms and measures for an adequate model selection on highdimensional data. Three issues have to be addressed simultaneously: (1) a sparse variable selection, (2) representing the relevant structure in the data, and (3) good prediction performance. We try to tackle these issues and in particular concentrate on integrating substantial interactions into the model.
The likelihoodbased boosting algorithm promises sparse and stable variable selection, which is a consequence of simultaneous selection and estimation in a multivariable model. Naturally, variable selection stability also depends on the quality of the data (see, for example, [63]), and for obtaining highquality molecular data frequently appropriate preprocessing steps are necessary, e.g., background correction and normalization. Concerning the other two issues (representing the relevant structure in the data and good prediction performance) Yang [64] strikingly demonstrates that best predictive models usually contain irrelevant features and important features often do not lead to best prediction performances (see also [65, 66]). Whenever we encounter the tradeoff between relevance and usefulness for prediction, we prioritize ’finding relevant variables’ over prediction performance.
The models are evaluated within a resample procedure for estimating sensitivity and stability. As a performance measure adapted for timetoevent endpoints, we use the Brier score [67, 68]. The Brier score is a strictly proper scoring rule, i.e. it is optimal only at the true probability model (see [69]). For example, the area under the curve (AUC) is not a strictly proper rule, because it can lead to optimal values for different probability models (slight changes of probabilities often do not matter). Two common resampling techniques are crossvalidation (CV) and bootstrapping. Crossvalidation partitions the data into folds and evaluates prediction performance on every single fold with models fitted to the data from the remaining folds; a more precise characterization for CV is therefore ’subsample technique’. Both techniques can cause problems (see [38, 70]), and we decided to use subsampling with splits of relative size 0.632 to (10.632), because this seems to work well in many settings [71, 72]. Such a subsampling procedure is roughly comparable to a 3fold CV (see [73, 74]).
Here, B is the number of resamples, n the number of rows, and ${\mathcal{I}}_{b}$ the indices of those cases that are included in the resample b.
Assembling of building blocks into an interaction detection strategy
Our comprehensive strategy consists of three parts: (a) first main effect detection, (b) preselection of interactions terms,(c) final model selection. Parts (a) and (c) use CoxBoost and are fixed. Here, we rely on the ability of CoxBoost to produce sparse models and to include important variables. In part (b), we consider the following building blocks: (BB1) subsampling, (BB2) random forests, and (BB3) orthogonalization as a data preprocessing step. Different decisions concerning the building blocks lead to flexibility in part (b). When combining building blocks into comprehensive strategies, overfitting to the data at hand and overoptimism could occur [76]. One way to account for that – besides the usage of independent validation data sets – is to evaluate the contribution of the building blocks to the results.
The use of an outer subsampling for interaction finding has the aim of enhancing the credibility of interaction information. Specifically, we use the variable inclusion frequency (VIFs), i.e. the proportion of times that the variable appeared in the model, for assessing the relevance of an interaction term. For example, when using random forests, this means that the number of random forests in which interaction terms are deemed relevant is the basis for a preselection of interactions. Here, an interaction term is assessed as relevant if both underlying variables have PAM values larger than zero in a random forest (typically, there are many variables with PAM values ≤0). In other words, variables have to be simultaneously important for a random forest.
 1.
Specify: Indices of clinical covariates or other known main effect variables, number S of subsamples for preselecting interactions, and number R of preselected interaction terms. In case of identical VIF values for the Rth and (R+1)th found interaction, all interactions with that VIF value are included as well.
 2.Subsample the original data set Z in relation 0.632 to (10.632), leading to the data sets Z _{ b } and ${Z}_{{b}^{\prime}}$.
 (a)
First pass main effects detection: Run CoxBoost on Z _{ b }, possibly incorporating clinical covariates $\left\{{x}_{k}\rightk\in \mathcal{K}\}$ without penalization. This leads to the model CoxBoostM and a list of main effects, given by the index set . Main effects and clinical covariates are used for orthogonalization (if this preprocessing step is considered) in the preselection step and as unpenalized variables in the final CoxBoost model.
 (b)
Preselection of interaction terms: If $\mathcal{\mathcal{M}}\cup \mathcal{K}$ is nonempty, regress all covariates with indices $\{1,\dots ,p\}\setminus (\mathcal{\mathcal{M}}\cup \mathcal{K})$ on the variables in $(\mathcal{\mathcal{M}}\cup \mathcal{K})$. Subsequently, compute the corresponding residuals of the covariates, which leads to the data matrix ${\stackrel{~}{Z}}_{b}$ (building block (BB3)). Subsample S times data from ${\stackrel{~}{Z}}_{b}$ – from Z _{ b }, when is empty – in relation of 0.632 to (10.632) and generate RSF on each larger subsample (building blocks (BB1) and (BB2)). Construct interaction terms by all pairs of variables with PAM values greater 0 on every subsample and compute VIFs of the interactions terms at the end of the subsampling process. Select the R most frequent pairs.
 (c)
Final model: covariates are ${x}_{k},k\in \mathcal{K}$, ${x}_{i},i\in \mathcal{\mathcal{M}}$, and the R selected cross product terms of (b). Run CoxBoost on Z _{ b } with these covariates without penalization for covariates with indices in , leading to model cb _{fin}.
 (d)
Compute prediction error: Apply cb _{fin} on ${Z}_{b}^{\prime}$ and compute the Brier score.
 (a)

b1 Do the same as in rsfVIFres, but without orthogonalization. (rsfVIF)

b2 Replace rsf in rsfVIF by CoxBoost: subsample S times data from Z_{ b } in relation 0.632 to (10.632) and run CoxBoost on each of the larger data sets. Finally, compute VIFs related to the variables selected by CoxBoost in each subsample, and create R pairs, i.e. interaction terms, related to the variables with the highest VIFs. (cbVIF)

b3 Omit subsampling in cbVIF: compute all distinct cross product terms of covariates with indices in . Here, S is superfluous and R is not needed, if $\left\sqrt{\left\mathcal{\mathcal{M}}\right}\right\le R$; otherwise, select randomly R interactions from all cross product terms. (cbcrossp)
There are many more alternatives, which are not considered due to a limited space and for reasons of clarity.
Simulation design
For a systematic analysis of the building blocks and the corresponding interaction detection strategies, a timetoevent simulation study was conducted. Here, we define interactions as effects based on multiplicative combinations of variables. The main interest concerns the ability of the strategies to find relevant interactions and especially those that might be difficult to detect, i.e., variables in interactions are not members of the set of true main effects. The secondary focus is on the prediction performance, which highly depends on the effect sizes of main effect variables and interaction terms.
The effect sizes of nonzero effects in each scenario
Scenarios:  Effect size ME  Effect size Int  Corr value  Block 

size  
Sim42  (3, 3, 3, 3)  (5,5)  
Sim22_1.0  (0.9, 0.9)  (1.0, 1.0)  
Sim22_0.5  (0.9, 0.9)  (0.5, 0.5)  
Sim22_0.25  (0.9, 0.9)  (0.25, 0.25)  
Sim22_1.5  (0.9, 0.9)  (1.5, 1.5)  
Sim22_2.0  (0.9, 0.9)  (2.0, 2.0)  
Sim22_2.5  (0.9, 0.9)  (2.5, 2.5)  
Sim22_bin  (0.9, 0.9)  (1.0, 1.0)  
Sim22_corr01  (0.9, 0.9)  (1.0, 1.0)  0.1  5 
Sim22_corr03  (0.9, 0.9)  (1.0, 1.0)  0.3  5 
Sim22_corr05  (0.9, 0.9)  (1.0, 1.0)  0.5  5 
Sim22_corr07  (0.9, 0.9)  (1.0, 1.0)  0.7  5 
Sim42 refers to the simulation case with 4 main effects and 2 interactions that are composed of the main effects and Sim22_x to the cases with 2 main effects and 2 interactions that are not related to these main effects; in other words: they are composed of variables that have zero effect sizes. If x is numeric, it gives the uniform effect size; x= "bin" denotes the case of interactions composed of binary variables, and x beginning with "corr" relates to cases with variables that are blockcorrelated with a uniform correlation coefficient c, c∈{0.1,0.3,0.5,0.7}, across the 200 fiverblocks, i.e., values of variables are sampled from a 5dimensional normal distribution with the same variance matrix (the same correlation value at the offdiagonals and variance of 1) over all blocks. Here, main effects and variables in interactions terms stem from different blocks.
In scenarios Sim42 and Sim22_1.0, all interaction detection strategies are considered for evaluating the effect of the building blocks. The other scenarios are used to investigate the behavior and the limits of rsfVIFres. The simple scenario Sim42 is used for ascertaining that the strategies are capable of finding the relevant main effects. Scenario Sim22_1.0 is the reference scenario for the scenarios with nonsmooth interactions, i.e. interactions incorporating binary covariates, and correlated variables.
The performance of a strategy is measured by the number of correct nonzero variables in the models, i.e. the variable selection sensitivity with respect to the main effects and the interaction terms, and by the prediction error (Brier score). Specificity values or predictive values are not separately listed in the result tables. However, these measures can be deduced from the sensitivity values and the number of selected variables.
In other words, rIPEC gives the relative improvement of prediction performance of strategy S_{ i } compared to the prediction performance of the KaplanMeier.
Results
Simulation study
The simulation was conducted in R3.0.2 with following main settings for the model implementations used in our strategies. Parameters not listed are considered secondary and were set to their default values:
CoxBoost penalty: (Number of events) $\xb7(\frac{1}{0.05}1)$. Penalty value for the updates in each boosting step. standardize: TRUE. Covariates are standardized. stepno: As computed by cv.CoxBoost. Number of boosting steps.
RSF mtry: Square root of the number of variables (default value). ntree: 1000. Number of trees grown (default value).
Results of the simulation study for all strategies in scenarios Sim42 and Sim22_1.0
IntScreen  IntSensiA  VarsTotal  MainSensi  IntSensi  rIPEC  

CoxBoostM  Final Model  
Scenario Sim42  
(cbcrossp)  213.78  0.7 (0.05)  14.3 (6.29)  0.845 (0.04)  0.7 (0.05)  0.12 (0.12)  0.4 (0.19) 
(cbVIF)  1573.06  0.88 (0.03)  19.42 (7.89)  0.845 (0.04)  0.88 (0.03)    0.43 (0.15) 
(rsfVIF)  24557.74  0.95 (0.02)  34.74 (7.97)  0.82 (0.04)  0.94 (0.02)    0.37 (0.15) 
(rsfVIFres)  25740.32  0.97 (0.02)  32.02 (6.56)  0.845(0.05)  0.97 (0.02)  0.12 (0.122)  0.4 (0.12) 
Scenario Sim22_1.0  
(cbcrossp)  89.56  0 (0)  17.28 (8.18)  0.95 (0.02)  0 (0)  0.12 (0.09)  0.09 (0.13) 
(cbVIF)  2058.5  0 (0)  19.12 (12.19)  0.87 (0.03)  0 (0)    0.09 (0.1) 
(rsfVIF)  17511.96  0.07 (0.01)  12.64 (10.85)  0.73 (0.07)  0.06 (0.02)    0.08 (0.08) 
(rsfVIFres)  17701.72  0.4 (0.05)  19.9 (12.93)  0.81(0.04)  0.39(0.05)  0.12 (0.09)  0.14 (0.14) 
In scenario Sim42, use of random forests generate models with more than 30 variables in the mean (about twice the number seen Sim22_1.0), whereas the other two strategies result in less than 20 variables on average. This means that there are many false positive findings when using RSF, which has a negative impact on the rIPEC compared to the cbstrategies. On the other hand, rsfVIFres leads to the largest sensitivity values for main effects and interactions. Hence, there is a tradeoff between sensitivity and prediction performance. For all preselection variants it seem that when true interaction are preselected (see IntSensiA), then almost all of them are selected in the final model. Comparing cbcrossp with cbVIF, we see that subsampling can increase IntSensi without decreasing MainSensi, and this leads to the best rIPEC value in scenario Sim42. Use of random forest instead of CoxBoost for interaction preselection (rsfVIF) increases IntSensi but leads to a slight reduction of MainSensi. The MainSensi and IntSensi values of rsfVIFres indicate that orthogonalization is not only important for further increasing IntSensi but also for a higher MainSensi value compared to rsfVIF. Overall, in this scenario, subsampling is important and random forests should be applied on orthogonalized data for achieving the largest sensitivity values but even then, prediction performance cannot be improved compared to interaction preselection with CoxBoost.
Scenario Sim22_1.0 exhibits some differences to Sim42. First, all strategies lead to similar and moderate numbers of total variables in the final model. Hence, high IntScreen values in RSF strategies do not result in more false positives than interaction preselection variants that use CoxBoost. CoxBoost is not able to preselect true interactions, with or without subsampling. Subsampling even reduces MainSensi values. Use of random forest further decreases MainSensi with a little compensation of increased IntSensi but at an interchange rate that makes a further reduction of rIPEC possible. Again, RSF has to be applied to the preprocessed data (rsfVIFres) for increasing MainSensi and IntSensi compared to rsfVIF. Now, the increase in IntSensi is drastic, which leads to the best rIPEC value in this scenario. The IntSensi value is still moderate; however, one should bear in mind that the interactions are built by variables that do not represent main effects. This might be particularly relevant for real world applications: even a moderate variable inclusion frequency of an interaction term could indicate an important interaction if the underlying variables are irrelevant as main effects.
Both scenarios show that all building blocks are important, and in particular orthogonalization is important before applying RSF, i.e. disentangling information beforehand is crucial for preselecting interactions. CoxBoost is unlikely to benefit from such a preprocessing because it already applies some sort of orthogonalization during fitting (further experiments also point in that direction; data not shown).
That IntSensiA is often similar to IntSensi in both scenarios means that one can rely on the ability of CoxBoost to choose the right interaction terms out of those presented, regardless of IntScreen. Hence, it seems that the parameter R (number of selected interactions) can be quite high. For assessing the effect of R, we additionally investigated the same scenarios rsfsVIFres with R=1000 (see Additional file 1: Table S1). With this reduced Rvalue, sensitivities, VarsTotal, and rIPEC decreased; the latter two measures were in particular reduced in scenario Sim22_1.0. Thus, in case of doubt, R should be set to a larger value.
We also investigated the behavior of the parameter estimates of the main effects and the interaction terms. In no case did a true effect receive a wrong sign. In the mean, shrinkage of the coefficients was stronger in scenario Sim42 than in Sim22_1.0. This has two reasons: higher absolute values of the true coefficients and increased number of nonzero coefficients. As the results show, this increased shrinkage is not relevant for the sensitivity of the detection strategies. One can try to reduce shrinkage by reducing the value of the penalty parameter or manually increasing the number of step sizes; however, we would not recommend such intervention in a highdimensional setting without good reasons (see below).
Results of the simulation study for rsfVIFres in scenarios Sim22_0.25  Sim22_2.5 and Sim_bin
Scenario  IntScreen  IntSensiA  VarsTotal  MainSensi  IntSensi  rIPEC  

CoxBoostM  Final Model  
Strategy rsfVIFres  
Sim22_2.5  15231.62  0.3 (0.05)  16.04 (14.72)  0.29 (0.05)  0.3 (0.05)  0 (0.06)  0.1 (0.16) 
Sim22_2.0  17200.98  0.33 (0.05)  18.28 (14.41)  0.51 (0.05)  0.33 (0.05)  0.01 (0.06)  0.14 (0.19) 
Sim22_1.5  16878.62  0.34 (0.05)  20.38 (14.57)  0.67 (0.05)  0.34 (0.05)  0.05 (0.08)  0.12 (0.15) 
Sim22_1.0  17701.72  0.4 (0.05)  19.9 (12.93)  0.81 (0.04)  0.39 (0.05)  0.12 (0.09)  0.14 (0.14) 
Sim22_0.5  18613.08  0.43 (0.05)  18.04 (10.92)  0.93 (0.03)  0.13 (0.03)  0.21 (0.1)  0.17 (0.1) 
Sim22_0.25  20404.02  0.46 (0.05)  20.4 (11.63)  0.98 (0.01)  0 (0)  0.25 (0.11)  0.2 (0.11) 
Sim22_bin  22847.06  0.42 (0.05)  26.22 (10.9)  0.49(0.07)  0.34(0.05)  0.2 (0.07)  0.15 (0.08) 
Results of the simulation study for rsfVIFres in scenarios with correlated variables
IntScreen  IntSensiA  VarsTotal  MainSensi  IntSensi  rIPEC  

CoxBoostM  Final Model  
Sim22_corr01  19779.06  0.22 (0.04)  17.76 (13.25)  0.75 (0.04)  0.21 (0.04)  0.11 (0.07)  0.1 (0.11) 
Sim22_corr03  17961.56  0.23 (0.04)  12.74 (12.85)  0.46 (0.05)  0.18 (0.04)  0.06 (0.07)  0.03 (0.08) 
Sim22_corr05  15953.22  0.15 (0.04)  6.66 (8.6)  0.14 (0.03)  0.06 (0.02)  0.03 (0.06)  0 (0.03) 
Sim22_corr07  14174.12  0.15 (0.04)  7.26 (8.49)  0.06 (0.02)  0.04 (0.02)  0.03 (0.2)  0.01 (0.3) 
Real data illustrations
Diffuse largeBcell lymphoma data
In order to illustrate how rsfVIFres can be applied on real data, we first analyzed the wellknown Rosenwald data [81]. This data set was used to link 7399 (‘lymphochip’ cDNA microarray) gene expression features of 240 patients with diffuse largeBcell lymphoma (DLBCL) to the time of their death. DLBCL is an aggressive malignancy of mature B lymphocytes with a high rate of remissions. The objective of the Rosenwald study was to devise a molecular profile that accounts for the underlying heterogeneity, predicts survival and can be used for assessing the effect of the related therapies. Overall, 138 deaths were observed, with a five year overall survival of 48%. The 7399 features measured at baseline represent 4128 genes. An established clinical predictor, the International Prognostic Index (IPI  a combination of five clinical features), is available for n=222 patients, which will be considered for the analysis in the following, i.e. $\mathcal{K}=\left\{\text{Indices}\right(\text{IPI}\left)\right\}$. For further details and an overview with respect to various strategies for analyzing this and related data sets, we refer to [82].

Hs.184298 (0.66), Hs.99741 (0.54), Hs.85769 (0.44) and

Hs.76807:Hs.84298 (0.10), Hs.79428:Hs.193736 (0.08), Hs.20191:Hs.99597 (0.06)
Neuroblastoma data

Hs.496658 (0.68), Hs.491494 (0.58), Hs.584827(0.54) and

Hs.496658:Hs.148989 (0.28), Hs.496658:Hs.371249 (0.18), Hs.496658:Hs.532824 (0.18)
Discussion
From the results of the simulation study, we conclude that random forests can provide relevant interaction information. If the interaction is strong enough, the marginal effects of the underlying variables are at a level such that they are frequently selected as split variables in the random forest generation process. Further, the results indicate that disentangling information also is important for achieving good results. The reason behind this might be that variables associated to main effects can mask interactions in random forests, which affects the split variable selection process. Disentanglement of information specifically means to transform variables to be orthogonal to those with indices in $(\mathcal{\mathcal{M}}\cup \mathcal{K})$. When the number of estimated main effects in CoxBoostM is too large (rule of thumb: more than about $\frac{1}{10}\xb7n$[90]), the corresponding regressions can be unreliable. In this case, we would recommend focusing on those main effects with the largest absolute coefficient estimates in CoxBoostM. Another possibility is to use the linear predictor $\text{lp}=\sum _{i\in \mathcal{\mathcal{M}}}{\beta}_{i}{x}_{i}$ for the regression ${x}_{\mathit{\text{jk}}}=\alpha \xb7{\text{lp}}_{k}+{\epsilon}_{\mathit{\text{jk}}},\phantom{\rule{1em}{0ex}}j\notin \mathcal{\mathcal{M}},k=1,\dots ,n$. In both cases, the orthogonalization is imperfect and results (not shown) based on the latter variant indicated that sensitivities related to interactions are considerably lower than with the strategy for computing residuals proposed here. However, an alternative should be taken into consideration, when the number of cases is small (e.g., smaller than 50).
The scenarios with correlation and nonsmooth interactions show that the preselection of interactions is less affected than the final CoxBoost model. For nonsmooth interactions, this was expected due to the nonsmooth nature of individual trees in random forests, but the effects on correlated data indicate that the preselection of interactions in rsfVIFres is quite robust. One further interpretation of the simulation study is that moderate variable inclusion frequency of an interaction term (e.g., 10%−30%) still could indicate an important interaction. The real data example showed that uncertainty related to the reliability of the findings can make it necessary to consider and contextualize as much information as possible. Specifically, increasing the step size of CoxBoost from its optimal value to 500 in the Rosenwald data was an attempt to reduce the uncertainty. Due to the considerable deterioration of prediction performance, the decision on the importance of the identified interactions was based on additional biological knowledge. There is no absolute threshold with respect to a decrease in prediction performance that makes the results definitely unreliable. The results showed that detection of true interaction and main effects can be accompanied by deteriorated or bad prediction performances due to the increase in false positives. It always depends on the subjectmatter question whether a certain level of prediction performance is deemed necessary. If biology can help sorting out the true effects, concerns related to prediction performance even might be considered secondary.
The results showed that the number of preselected interactions R must be large enough (>>1000 in our data sets) for guaranteeing that the screening process is able to preselect relevant effects. CoxBoost was frequently able to select the right variables out of ten thousands of variables. This is a feature of many other (L_{1}) regularized regression techniques such as the LASSO (see also [91, 92]), which (under sparsity assumptions) also are consistent for variable selection, even when the number of variables p is as large as exp(n^{ α }) for some 0<α<1 [93]. Empirically determining an optimal R is nevertheless difficult. This issue certainly needs further scientific investigations.
Limitations
Due to the focus of this paper and the limited space, our study has several limitations. First, only twoway interactions were considered in the interaction screening process. In realworld data, all kind of multifactor and nonlinear interactions can be expected. Second, the simulation scenarios are limited in their scope, because we focused on one critical issue: the effects of the building blocks when interactions are built from variables that do not represent main effects. Although, we also investigated simulations scenarios with correlations, further investigations of informative correlations and more complex correlations structures are relevant.
Third, the realdata applications showed that the strategies cannot be used in an automatic way. Decisions related to the choice of some parameter values (e.g., number of subsamples S or indices of unpenalized variables ), interpretation of the results, and further processing of these results have to be based on subjectmatter knowledge and the specific application. Such requirements could discourage a user from using rsfVIFres. Nevertheless, it should be clear that assessing the necessity of considering interactions is not trivial, even for the simplest case of genegene interactions and therefore informed decisions are crucial.
Fourth, there are open questions such as the specific value for R or alternatives to the building blocks presented in the paper. There are several routes for extending our proposal or replacing components in it. Fifth, we only considered proportional hazard models and simulated data from such models. It is important to consider departures from the related assumptions in future studies, for example by considering timedependent effects. Finally, the real data examples only considered microarray data. Recent sequencing approaches, such as the RNASeq technology, are gaining more and more ground and should be targeted as well.
Conclusion
Our aim in this study was to build a strategy for incorporating twoway interactions into multivariate risk prediction models that are built on highdimensional molecular data. When it is either not feasible or computationally too expensive to consider all possible interactions, screening is necessary in case of no a priori knowledge. We presented three important building blocks for such a screening strategy: subsampling, random forests, and orthogonalization of the data, and concluded that all building blocks are important. Our decision for using random forests for screening interactions has one main reason: the promise of random forests to capture various kinds of relevant interaction structures. CoxBoost was used, because it usually produces sparse risk prediction models. We assumed that a combination of these two approaches could be fruitful due to their complementary character. However, components can be separately replaced by other ones, for example random forests by multifactordimensionality reduction, and such flexibility seems necessary, because no specific combination of building blocks will perform well on every kind of data.
The results show that screening interactions through random forests is feasible and useful, when one is interested in finding relevant twoway interactions. Effect sizes of the interactions should be large enough in order to guarantee useful results. When the underlying variables do not represent main effects, sensitivities related to variable and interaction selection are moderate (≤40%). The results of the simulation study indicates that making all variables orthogonal to those with indices in $(\mathcal{\mathcal{M}}\cup \mathcal{K})$ could enable random forests to preselect relevant interaction effects even in the absence of strong marginal components.
The real data applications showed that not only preprocessing and a combination of different tools are important for interaction detection but also an intelligent postprocessing. Our final conclusion is that in addition to focusing on establishing new methods, it is important to make full use of existing ones.
Declarations
Acknowledgements
We thank the editor and two anonymous reviewers for their constructive comments, which helped us to improve the manuscript significantly. Grateful acknowledgement goes also to Dr. Johanna Mazur for proofreading.
Authors’ Affiliations
References
 Fan J, Lv J: A selective overview of variable selection in high dimensional feature space (invited review article). Stat Sinica. 2010, 20: 101148.Google Scholar
 Fan J, Samworth R, Wu Y: Ultrahigh dimensional feature selection: beyond the linear model. J Mach Learn Res. 2009, 10: 20132038.PubMed CentralPubMedGoogle Scholar
 Guyon I: An introduction to variable and feature selection. J Mach Learn Res. 2003, 3: 11571182.Google Scholar
 Buhlmann P, van de Geer S: Statistics for HighDimensional Data: Methods, Theory and Applications. 2011, New York: SpringerView ArticleGoogle Scholar
 Buhlmann P, Hothorn T: Boosting algorithms: regularization, prediction and model fitting. Stat Sci. 2007, 22 (4): 477505. 10.1214/07STS242.View ArticleGoogle Scholar
 Tibshirani R: Regression shrinkage and selection via the Lasso. J R Stat Soc (Series B). 1996, 58: 267288.Google Scholar
 Tibshirani R: The Lasso method for variable selection in the Cox model. Stat Med. 1997, 16 (4): 385395. 10.1002/(SICI)10970258(19970228)16:4<385::AIDSIM380>3.0.CO;23.View ArticlePubMedGoogle Scholar
 Park MY, Hastie T: L1regularization path algorithm for generalized linear models. J R Stat Soc: Series B (Stat Methodol). 2007, 69 (4): 659677. 10.1111/j.14679868.2007.00607.x.View ArticleGoogle Scholar
 Chen HC, Chen J: Assessment of reproducibility of cancer survival risk predictions across medical centers. BMC Med Res Methodol. 2013, 13: 2510.1186/147122881325.View ArticlePubMed CentralPubMedGoogle Scholar
 Huang Y, Gottardo R: Comparability and reproducibility of biomedical data. Brief Bioinform. 2013, 14 (4): 391401. 10.1093/bib/bbs078.View ArticlePubMed CentralPubMedGoogle Scholar
 Nilsson R, Bjorkegren J, Tegner J: On reliable discovery of molecular signatures. BMC Bioinformatics. 2009, 10: 3810.1186/147121051038.View ArticlePubMed CentralPubMedGoogle Scholar
 Lee Y, Scheck A, Cloughesy T, Lai A, Dong J, Farooqi H, Liau L, Horvath S, Mischel P, Nelson S: Gene expression analysis of glioblastomas identifies the major molecular basis for the prognostic benefit of younger age. BMC Med Genom. 2008, 1: 5210.1186/17558794152.View ArticleGoogle Scholar
 Bovelstad H, Nygard S, Borgan O: Survival prediction from clinicogenomic models  a comparative study. BMC Bioinformatics. 2009, 10: 41310.1186/1471210510413.View ArticlePubMed CentralPubMedGoogle Scholar
 Kammers K, Lang M, Hengstler J, Schmidt M, Rahnenfuhrer J: Survival models with preclustered gene groups as covariates. BMC Bioinformatics. 2011, 12: 47810.1186/1471210512478.View ArticlePubMed CentralPubMedGoogle Scholar
 Binder H, Schumacher M: Allowing for mandatory covariates in boosting estimation of sparse highdimensional survival models. BMC Bioinformatics. 2008, 9: 1410.1186/14712105914.View ArticlePubMed CentralPubMedGoogle Scholar
 Park MY, Hastie T: Penalized logistic regression for detecting gene interactions. Biostatistics. 2008, 9: 3050. 10.1093/biostatistics/kxm010.View ArticlePubMedGoogle Scholar
 Wu TT, Chen YF, Hastie T, Sobel E, Lange K: Genomewide association analysis by lasso penalized logistic regression. Bioinformatics. 2009, 25 (6): 714721. 10.1093/bioinformatics/btp041.View ArticlePubMed CentralPubMedGoogle Scholar
 Dinu I, Mahasirimongkol S, Liu Q, Yanai H, Sharaf Eldin N, Kreiter E, Wu X, Jabbari S, Tokunaga K, Yasui Y: SNPSNP Interactions discovered by logic regression explain Crohn’s disease genetics. PLoS ONE. 2012, 7 (10): e4303510.1371/journal.pone.0043035.View ArticlePubMed CentralPubMedGoogle Scholar
 Schwender H, Ickstadt K: Identification of SNP interactions using logic regression. Biostatistics. 2008, 2007: 9187.Google Scholar
 Ritchie MD, Hahn LW, Roodi N, Bailey LR, Dupont WD, Parl FF, Moore JH: Multifactordimensionality reduction reveals highorder interactions among estrogenmetabolism genes in sporadic breast cancer. Am J Hum Genet. 2001, 69: 138147. 10.1086/321276.View ArticlePubMed CentralPubMedGoogle Scholar
 Hahn LW, Ritchie MD, Moore JH: Multifactor dimensionality reduction software for detecting genegene and geneenvironment interactions. Bioinformatics. 2003, 19 (3): 376382. 10.1093/bioinformatics/btf869.View ArticlePubMedGoogle Scholar
 Breiman L: Random forests. Mach Learn. 2001, 45: 532. 10.1023/A:1010933404324.View ArticleGoogle Scholar
 Jiang R, Tang W, Wu X, Fu W: A random forest approach to the detection of epistatic interactions in casecontrol studies. BMC Bioinformatics. 2009, 10 (Suppl 1): 112. 10.1186/1471210510S1S1.View ArticleGoogle Scholar
 UpstillGoddard R, Eccles D, Fliege J, Collins A: Machine learning approaches for the discovery of genegene interactions in disease data. Brief Bioinform. 2013, 14 (2): 251260. 10.1093/bib/bbs024.View ArticlePubMedGoogle Scholar
 Gao H, Wu Y, Li J, Li H, Li J, Yang R: Forward LASSO analysis for highorder interactions in genomewide association study. Brief Bioinform. 2013, Jun 17. [Epub ahead of print]Google Scholar
 Bien J, Simon N, Tibshirani R: A lasso for hierarchical testing of interactions. Tech. rep., Department of Computer Science, Michigan State University; 2012. [http://wwwstat.stanford.edu/~tibs/research.html]
 Pashova H, LeBlanc M, Kooperberg C: Boosting for detection of geneenvironment interactions. Stat Med. 2013, 32 (2): 255266. 10.1002/sim.5444.View ArticlePubMed CentralPubMedGoogle Scholar
 duVerle DA, Takeuchi I, MurakamiTonami Y, Kadomatsu K, Tsuda K: Discovering combinatorial interactions in survival data. Bioinformatics. 2013, 29 (23): 30533059. 10.1093/bioinformatics/btt532.View ArticlePubMed CentralPubMedGoogle Scholar
 Biau G, Devroye L: On the layered nearest neighbour estimate, the bagged nearest neighbour estimate and the random forest method in regression and classification. J Multivariate Anal. 2010, 101 (10): 24992518. 10.1016/j.jmva.2010.06.019.View ArticleGoogle Scholar
 Biau G: Analysis of a random forests model. J Mach Learn Res. 2012, 98888: 10631095.Google Scholar
 Teng S, Luo H, Wang L: Random forestbased prediction of protein sumoylation sites from sequence features. Proceedings of the First ACM International Conference on Bioinformatics and Computational Biology. 2010, New York: ACM, 120126.View ArticleGoogle Scholar
 Qi Y, BarJoseph Z, KleinSeetharaman J: Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Proteins: Struct Funct Bioinformatics. 2006, 63 (3): 490500. 10.1002/prot.20865.View ArticleGoogle Scholar
 Lunetta K, Hayward LB, Segal J, Van Eerdewegh P: Screening largescale association study data: exploiting interactions using random forests. BMC Genetics. 2004, 5: 3210.1186/14712156532.View ArticlePubMed CentralPubMedGoogle Scholar
 Winham S, Colby C, Freimuth R, Wang X, de Andrade M, Huebner M, Biernacka J: SNP interaction detection with Random forests in highdimensional genetic data. BMC Bioinformatics. 2012, 13: 16410.1186/1471210513164.View ArticlePubMed CentralPubMedGoogle Scholar
 Hapfelmeier A, Ulm K: A new variable selection approach using random forests. Comput Stat Data Anal. 2013, 60 (0): 5069.View ArticleGoogle Scholar
 Yoshida M, Koike A: SNPInterForest: A new method for detecting epistatic interactions. BMC Bioinformatics. 2011, 12: 46910.1186/1471210512469.View ArticlePubMed CentralPubMedGoogle Scholar
 Ishwaran H: Variable importance in binary regression trees and forests. Electron J Stat. 2007, 1: 519537. 10.1214/07EJS039.View ArticleGoogle Scholar
 Strobl C, Boulesteix AL, Zeileis A, Hothorn T: Bias in random forest variable importance measures: illustrations, sources and a solution. BMC Bioinformatics. 2007, 8: 2510.1186/14712105825.View ArticlePubMed CentralPubMedGoogle Scholar
 Hapfelmeier A, Hothorn T, Ulm K, Strobl C: A new variable importance measure for random forests with missing data. Stat Comput. 2012, 114.Google Scholar
 Tutz G, Binder H: Generalized additive modeling with implicit variable selection by likelihoodbased boosting. Biometrics. 2006, 62 (4): 961971. 10.1111/j.15410420.2006.00578.x.View ArticlePubMedGoogle Scholar
 Porzelius C, Schumacher M, Binder H: Sparse regression techniques in lowdimensional survival data settings. Stat Comput. 2010, 20 (2): 151163. 10.1007/s1122200991556.View ArticleGoogle Scholar
 Binder H, Schumacher M: Incorporating pathway information into boosting estimation of highdimensional risk prediction models. BMC Bioinformatics. 2009, 10: 1810.1186/147121051018.View ArticlePubMed CentralPubMedGoogle Scholar
 Tutz G, Binder H: Boosting ridge regression. Comput Stat Data Anal. 2007, 51 (12): 60446059. 10.1016/j.csda.2006.11.041.View ArticleGoogle Scholar
 Cox DR: Regression models and lifetables. J R Stat Soci. Series B (Methodological). 1972, 34 (2): 187220.Google Scholar
 Binder H, Allignol A, Schumacher M, Beyersmann J: Boosting for highdimensional timetoevent data with competing risks. Bioinformatics. 2009, 25 (7): 890896. 10.1093/bioinformatics/btp088.View ArticlePubMedGoogle Scholar
 Binder H: CoxBoost: Cox models by likelihood based boosting for a single survival endpoint or competing risks. 2013, [R package version 1.4]Google Scholar
 Biau G, Devroye L, Lugosi G: Consistency of random forests and other averaging classifiers. J Mach Learn Res. 2008, 9: 20152033.Google Scholar
 Reif DM, Motsinger AA, McKinney BA, Crowe JE, Moore JH: Feature Selection using a Random Forests Classifier for the Integrated Analysis of Multiple Data Types. Computational Intelligence and Bioinformatics and Computational Biology. 2006, New York: IEEE, 18.Google Scholar
 Chen X, Ishwaran H: Random forests for genomic data analysis. Genomics. 2012, 99 (6): 323329. 10.1016/j.ygeno.2012.04.003.View ArticlePubMed CentralPubMedGoogle Scholar
 Ishwaran H, Kogalur UB, Gorodeski EZ, Minn A, Lauer MS: Highdimensional variable selection for survival data. J Am Stat Assoc. 2010, 105 (489): 205217. 10.1198/jasa.2009.tm08622.View ArticleGoogle Scholar
 Buhlmann P, Yu B: Analyzing Bagging. Ann Stat. 2002, 30: 927961.View ArticleGoogle Scholar
 Svetnik V, Liaw A, Tong C, Culberson JC, Sheridan RP, Feuston BP: Random forest: a classification and regression tool for compound classification and QSAR modeling. J Chem Inf Comput Sci. 2003, 43 (6): 19471958. 10.1021/ci034160g.View ArticlePubMedGoogle Scholar
 Ishwaran H, Kogalur UB, Blackstone EH, Lauer MS: Random survival forests. Ann Appl Stat. 2008, 2 (3): 841860. 10.1214/08AOAS169.View ArticleGoogle Scholar
 Segal MR: Regression trees for censored data. Biometrics. 1988, 44: 3547. 10.2307/2531894.View ArticleGoogle Scholar
 Ishwaran H, Kogalur UB, Chen X, Minn AJ: Random survival forests for highdimensional data. Stat Anal Data Mining. 2011, 4: 115132. 10.1002/sam.10103.View ArticleGoogle Scholar
 Ishwaran H, Kogalur U: Random Forests for Survival, Regression and Classification (RFSRC). 2013, [R package version 1.3]Google Scholar
 Strobl C, Boulesteix AL, Kneib T, Augustin T, Zeileis A: Conditional variable importance for random forests. BMC Bioinformatics. 2008, 9 (307):Google Scholar
 Genuer R, Poggi JM, TuleauMalot C: Variable selection using random forests. Pattern Recognit Lett. 2010, 31 (14): 22252236. 10.1016/j.patrec.2010.03.014.View ArticleGoogle Scholar
 Boulesteix AL, Janitza S, Kruppa J, König IR: Overview of randomforestmethodology and practical guidance with emphasis on computational biology and bioinformatics. Wiley Interdiscipl Rev : Data Mining Knowl Discov. 2012, 2 (6): 493507. 10.1002/widm.1072.Google Scholar
 Ishwaran H, Kogalur U: Random survival forests for R. R News. 2007, 7 (2): 2531.Google Scholar
 Cook RD, Weisberg S: Applied Regression Including Computing and Graphics. 1999, New York: WileyInterscienceView ArticleGoogle Scholar
 Harrell FE: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression, and Survival Analysis. 2010, New York: SpringerGoogle Scholar
 Starmans M, Pintilie M, John T, Der S, Shepherd F, Jurisica I, Lambin P, Tsao MS, Boutros P: Exploiting the noise: improving biomarkers with ensembles of data analysis methodologies. Genome Med. 2012, 4 (11): 8410.1186/gm385.View ArticlePubMed CentralPubMedGoogle Scholar
 Yang Y: Prediction/Estimation with simple linear models: is it really that simple?. Econometric Theory. 2007, 23: 136.View ArticleGoogle Scholar
 Blum A, Langley P: Selection of relevant features and examples in machine learning. Artif Intell. 1997, 97: 245271. 10.1016/S00043702(97)000635.View ArticleGoogle Scholar
 Kohavi R, John GH: Wrappers for feature subset selection. Artif Intell. 1997, 97: 273324. 10.1016/S00043702(97)00043X.View ArticleGoogle Scholar
 Brier GW: Verification of forecasts expressed in terms of probability. Mon Weather Rev. 1950, 78: 13. 10.1175/15200493(1950)078<0001:VOFEIT>2.0.CO;2.View ArticleGoogle Scholar
 Gerds TA, Schumacher M: Consistent estimation of the expected Brier score in general survival models with rightcensored event times. Biom J. 2006, 48 (6): 10291040. 10.1002/bimj.200610301.View ArticlePubMedGoogle Scholar
 Gneiting T, Raftery AE: Strictly proper scoring rules, prediction, and estimation. J Am Stat Assoc. 2004, 102: 359378.View ArticleGoogle Scholar
 Efron B, Tibshirani R: Improvements on crossvalidation: the.632+ bootstrap method. J Am Stat Assoc. 1997, 92 (438): 548560.Google Scholar
 Binder H, Schumacher M: Adapting prediction error estimates for biased complexity selection in highdimensional bootstrap samples. Stat Appl Genet Mol Biol. 2008, 7: 128.Google Scholar
 Porzelius C, Schumacher M, Binder H: The benefit of databased model complexity selection via prediction error curves in timetoevent data. Comput Stat. 2011, 26 (2): 293302. 10.1007/s0018001102366.View ArticleGoogle Scholar
 Kohavi R: A Study of CrossValidation and Bootstrap for Accuracy Estimation and Model Selection. Proceedings of the 14th International Joint Conferences on Artificial Intelligence. 1995, Montreal: Morgan Kaufmann, Los Altos, 11371143.Google Scholar
 Abraham G, Kowalczyk A, Loi S, Haviv I, Zobel J: Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context. BMC Bioinformatics. 2010, 11: 27710.1186/1471210511277.View ArticlePubMed CentralPubMedGoogle Scholar
 Graf E, Schmoor C, Sauerbrei W, Schumacher M: Assessment and comparison of prognostic classification schemes for survival data. Stat Med. 1999, 18 (1718): 25292545. 10.1002/(SICI)10970258(19990915/30)18:17/18<2529::AIDSIM274>3.0.CO;25.View ArticlePubMedGoogle Scholar
 Jelizarow M, Guillemot V, Tenenhaus A, Strimmer K, Boulesteix AL: Overoptimism in bioinformatics: an illustration. Bioinformatics. 2010, 26 (16): 19901998. 10.1093/bioinformatics/btq323.View ArticlePubMedGoogle Scholar
 Bender R, Augustin T, Blettner M: Generating survival times to simulate Cox proportional hazards models. Stat Med. 2005, 24 (11): 17131723. 10.1002/sim.2059.View ArticlePubMedGoogle Scholar
 Nicodemus KK, Malley J, Strobl C, Ziegler A: The behavior of random forest permutationbased variable importance measures under predictor correlation. BMC Bioinformatics. 2010, 11: 11010.1186/1471210511110.View ArticlePubMed CentralPubMedGoogle Scholar
 Nicodemus K: Letter to the editor: On the stability and ranking of predictors from random forest variable importance measures. Brief Bioinform. 2011, 12 (4): 369373. 10.1093/bib/bbr016.View ArticlePubMed CentralPubMedGoogle Scholar
 Nicodemus KK, Malley J, Strobl C, Ziegler A: The behavior of random forest permutationbased variable importance measures under predictor correlation. BMC Bioinformatics. 2010, 11: 11010.1186/1471210511110.View ArticlePubMed CentralPubMedGoogle Scholar
 Rosenwald A, Wright G, Chan WC, Connors JM, Campo E, Fisher RI, Gascoyne RD, MullerHermelink HK, Smeland EB, et al: The use of molecular profiling to predict survival after chemotherapy for diffuse largeBcell lymphoma. New Engl J Med. 2002, 346 (25): 19371947. 10.1056/NEJMoa012914.View ArticlePubMedGoogle Scholar
 Segal MR: Microarray gene expression data with linked survival phenotypes: diffuse largeBcell lymphoma revisited. Biostatistics. 2006, 7 (2): 268285.View ArticlePubMedGoogle Scholar
 Zhang L, Li L, Liu H, Borowitz JL, Isom GE: BNIP3 mediates cell death by different pathways following localization to endoplasmic reticulum and mitochondrion. J Fed Am Soc Exp Biol. 2009, 23 (10): 340514.Google Scholar
 Wong CS, Moller A: Siah: A promising anticancer target. Cancer Res. 2013, 73 (8): 24002406. 10.1158/00085472.CAN124348.View ArticlePubMedGoogle Scholar
 Crow MT: Hypoxia, BNIP3 Proteins, and the mitochondrial death pathway in Cardiomyocytes. Circ Res. 2002, 91 (3): 183185. 10.1161/01.RES.0000030195.38795.CF.View ArticlePubMedGoogle Scholar
 Nakayama K, Ronai Z: Siah: new players in the cellular response to hypoxia. Cell Cycle. 2004, 3 (11): 13457. 10.4161/cc.3.11.1207.View ArticlePubMedGoogle Scholar
 Chinnadurai G, Vijayalingam S, Gibson SB: BNIP3 subfamily BH3only proteins: mitochondrial stress sensors in normal and pathological functions. Oncogene. 2008, 27 (Suppl 1): S11427.View ArticlePubMed CentralPubMedGoogle Scholar
 House CM, Moller A, Bowtell DD: Siah Proteins: novel drug targets in the Ras and Hypoxia pathways. Cancer Res. 2009, 69 (23): 88358838. 10.1158/00085472.CAN091676.View ArticlePubMedGoogle Scholar
 Oberthuer A, Kaderali L, Kahlert Y, Hero B, Westermann F, Berthold F, Brors B, Eils R, Fischer M: Subclassification and individual survival time prediction from gene expression data of neuroblastoma patients by using CASPAR. Clin Cancer Res. 2008, 14 (20): 65906601. 10.1158/10780432.CCR074377.View ArticlePubMedGoogle Scholar
 Harrell FE: Regression Modeling Strategies: With Applications to Linear Models, Logistic Regression and Survival Analysis. 2001, New York: SpringerView ArticleGoogle Scholar
 Efron B, Hastie T, Johnstone I, Tibshirani R: Least angle regression. Ann Stat. 2004, 32 (2): 407499. 10.1214/009053604000000067.View ArticleGoogle Scholar
 Hesterberg T, Choi NH, Meier L, Fraley C: Least angle and l1 penalized regression: A review. Stat Surv. 2008, 2 (2008): 6193.View ArticleGoogle Scholar
 Zhao P, Yu B: On model selection consistency of Lasso. J Mach Learn Res. 2006, 7: 25412563.Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.