Random generalized linear model: a highly accurate and interpretable ensemble predictor
 Lin Song^{1, 2},
 Peter Langfelder^{1} and
 Steve Horvath^{1, 2}Email author
DOI: 10.1186/14712105145
© Song et al.; licensee BioMed Central Ltd. 2013
Received: 1 October 2012
Accepted: 3 January 2013
Published: 16 January 2013
Abstract
Background
Ensemble predictors such as the random forest are known to have superior accuracy but their blackbox predictions are difficult to interpret. In contrast, a generalized linear model (GLM) is very interpretable especially when forward feature selection is used to construct the model. However, forward feature selection tends to overfit the data and leads to low predictive accuracy. Therefore, it remains an important research goal to combine the advantages of ensemble predictors (high accuracy) with the advantages of forward regression modeling (interpretability). To address this goal several articles have explored GLM based ensemble predictors. Since limited evaluations suggested that these ensemble predictors were less accurate than alternative predictors, they have found little attention in the literature.
Results
Comprehensive evaluations involving hundreds of genomic data sets, the UCI machine learning benchmark data, and simulations are used to give GLM based ensemble predictors a new and careful look. A novel bootstrap aggregated (bagged) GLM predictor that incorporates several elements of randomness and instability (random subspace method, optional interaction terms, forward variable selection) often outperforms a host of alternative prediction methods including random forests and penalized regression models (ridge regression, elastic net, lasso). This random generalized linear model (RGLM) predictor provides variable importance measures that can be used to define a “thinned” ensemble predictor (involving few features) that retains excellent predictive accuracy.
Conclusion
RGLM is a state of the art predictor that shares the advantages of a random forest (excellent predictive accuracy, feature importance measures, outofbag estimates of accuracy) with those of a forward selected generalized linear model (interpretability). These methods are implemented in the freely available R software package randomGLM.
Background
Prediction methods (also known as classifiers, supervised machine learning methods, regression models, prognosticators, diagnostics) are widely used in biomedical research. For example, reliable prediction methods are essential for accurate disease classification, diagnosis and prognosis. Since prediction methods based on multiple features (also known as covariates or independent variables) can greatly outperform predictors based on a single feature [1], it is important to develop methods that can optimally combine features to obtain high accuracy. Introductory text books describe well known prediction methods such as linear discriminant analysis (LDA), Knearest neighbor (KNN) predictors, support vector machines (SVM) [2], and tree predictors [3]. Many publications have evaluated popular prediction methods in the context of gene expression data [49].
Ensemble predictors are particularly attractive since they are known to lead to highly accurate predictions. An ensemble predictor generates and integrates multiple versions of a single predictor (often referred to as base learner), and arrives at a final prediction by aggregating the predictions of multiple base learners, e.g. via plurality voting across the ensemble. One particular approach for constructing an ensemble predictor is bootstrap aggregation (bagging) [10]. Here multiple versions of the original data are generated through bootstrapping, where observations from the training set are randomly sampled with replacement. An individual predictor (e.g. a tree predictor) is fitted on each bootstrapped data set. Thus, 100 bootstrapped data sets (100 bags) will lead to an ensemble of 100 tree predictors. In case of a class outcome (e.g. disease status), the individual predictors “vote” for each class and the final prediction is obtained by majority voting.
Breiman (1996) showed that bagging weak predictors (e.g. tree predictors or forward selected linear models) often yields substantial gains in predictive accuracy [10]. But it seems that ensemble predictors are only very rarely used for predicting clinical outcomes. This fact points to a major weakness of ensemble predictors: they typically lead to ”black box” predictions that are hard to interpret in terms of the underlying features. Clinicians and epidemiologists prefer forward selected regression models since the resulting predictors are highly interpretable: a linear combination of relatively few features can be used to predict the outcome or the probability of an outcome. But the sparsity afforded by forward feature selection comes at an unacceptably high cost: forward variable selection (and other variable selection methods) often greatly overfit the data which results in unstable and inaccurate predictors [11, 12]. Ideally, one would want to combine the advantages of ensemble predictors with those of forward selected regression models. As discussed below, multiple articles describe ensemble predictors based on linear models including the seminal work by Breiman [10] who evaluated a bagged forward selected linear regression model. However, the idea of bagging forward selected linear models (or other GLMs) appears to have been set aside as new ensemble predictors, such as the random forest, became popular. A random forest (RF) predictor not only bags tree predictors but also introduces an element of randomness by considering only a randomly selected subset of features at each node split [13]. The number of randomly selected features, mtry, is the only parameter of the random forest predictor. The random forest predictor has deservedly received a lot of attention for the following reasons: First, the bootstrap aggregation step allows one to use outofbag (OOB) samples to estimate the predictive accuracy. The resulting OOB estimate of the accuracy often obviates the need for crossvalidation and other resampling techniques. Second, the RF predictor provides several measures of feature (variable) importance. Several articles explore the use of these importance measures to select genes [5, 13, 14]. Third, it can be used to define a dissimilarity measure that can be used in clustering applications [13, 15]. Fourth, and most importantly, the RF predictor has superior predictive accuracy. It performs as well as alternatives in cancer gene expression data sets [5] but it really stands out when applied to the UCI machine learning benchmark data sets where it is as good as (if not better than) many existing methods [13]. While we confirm the truly outstanding predictive performance of the RF, the proposed RGLM method turns out to be even more accurate than the RF (e.g. across the disease gene expression data sets). Breiman and others have pointed out that the black box predictions of the RF predictor can be difficult to interpret. For this reason, we wanted to give bagged forward selected generalized linear regression models another careful look. After exploring different approaches for injecting elements of randomness into the individual GLM predictors, we arrived at a new ensemble predictor, referred to as random GLM predictor, with an astonishing predictive performance. An attractive aspect of the proposed RGLM predictor is that it combines the advantages of the RF with that of a forward selected GLM. As the name generalized linear model indicates, it can be used for a general outcome such as a binary outcome, a multiclass outcome, a count outcome, and a quantitative outcome. We show that several incremental (but important) changes to the original bagged GLM predictor by Breiman add to up to a qualitatively new predictor (referred to as random GLM predictor) that performs at least as well as the RF predictor on the UCI benchmark data sets. While the UCI data are the benchmark data for evaluating predictors, only a dozen such data sets are available for binary outcome prediction. To provide a more comprehensive empirical comparison of the different prediction methods, we also consider over 700 comparisons involving gene expression data. In these genomic studies, the RGLM method turns out to be slightly more accurate than the considered alternatives. While the improvements in accuracy afforded by the RGLM are relatively small they are statistically significant.
This article is organized as follows. First, we present a motivating example that illustrates the high prediction accuracy of the RGLM. Second, we compare the RGLM with other state of the art predictors when it comes to binary outcome prediction. Toward this end, we use the UCI machine learning benchmark data, over 700 empirical gene expression comparisons, and extensive simulations. Third, we compare the RGLM with other predictors for quantitative (continuous) outcome prediction. Fourth, we describe several variable importance measures and show how they can be used to define a thinned version of the RGLM that only uses few important features. Even for data sets comprised of thousands of gene features, the thinned RGLM often involves fewer than 20 features and is thus more interpretable than most ensemble predictors.
Methods
Construction of the RGLM predictor
RGLM is an ensemble predictor based on bootstrap aggregation (bagging) of generalized linear models whose features (covariates) are selected using forward regression according to AIC criterion. GLMs comprise a large class of regression models, e.g. linear regression for a normally distributed outcome, logistic regression for binary outcome, multinomial regression for multiclass outcome and Poisson regression for count outcome [16]. Thus, RGLM can be used to predict binary, continuous, count, and other outcomes for which generalized linear models can be defined. The “randomness” in RGLM stems results from two sources. First, a nonparametric bootstrap procedure is used which randomly selects samples with replacement from the original data set. Second, a random subset of features (specified by input parameter nFeaturesInBag) is selected for each bootstrap sample. This amounts to a random subspace method [17] applied to each bootstrap sample separately.
Importantly, RGLM also has a parameter maxInteractionOrder (default value 1) for creating interactions up to a given order among features in the model construction. For example, RGLM.inter2 results from setting maxInteractionOrder=2, i.e. considering pairwise (also known as 2way) interaction terms. As example, consider the case when only pairwise interaction terms are used. For each bag a random set of features is selected (similar to the random subspace method, RSM) from the original covariates, i.e. covariates without interaction terms. Next, all pairwise interactions among the nFeaturesInBag randomly selected features are generated. Next, the usual RGLM candidate feature selection steps will be applied to the combined set of pairwise interaction terms and the nFeaturesInBag randomly selected features per bag resulting in nCandidateCovariates top ranking features per bag, which are subsequently subjected to forward feature selection.
These methods are implemented in our R software package randomGLM which allows the user to input a training set and optionally a test set. It automatically outputs outofbag estimates of the accuracy and variable importance measures.
Parameter choices for the RGLM predictor
Default setting of nFeaturesInBag
N  nFeaturesInBag/N  N^{∗}  nFeaturesInBag/N  

No interaction  1−10  1  1−10  1 
11−300  1.0276−0.00276N  11−300  1.0276−0.00276N^{∗}  
>300  0.2  >300  0.2  
2way interaction  1−4  1  1−10  1 
5−24  1.0276−0.00276N(N+1)/2  11−300  1.0276−0.00276N^{∗}  
>24  0.2  >300  0.2  
3way interaction  1−3  1  1−10  1 
4−12  1.0276−0.00276(N^{3}+5N)/6  11−300  1.0276−0.00276N^{∗}  
>12  0.2  >300  0.2 
Relationship with related prediction methods
 1.
RGLM allows for interaction terms between features which greatly improve the performance on some data sets (in particular the UCI benchmark data sets). We refer to RGLM involving twoway or three way interactions as RGLM.inter2 and RGLM.inter3, respectively.
 2.
RGLM has a parameter nFeaturesInBag that allows one to restrict the number of features used in each bootstrap sample. This parameter is conceptually related to the mtry parameter of the Random Forest predictor. In essence, this parameter allows one to use a random subspace method (RSM, [17]) in each bootstrap sample.
 3.
RGLM has a parameter nCandidateCovariates that allows one to restrict the number of features in forward regression, which not only has computational advantages but also introduces additional instability into the individual predictors, which is a desirable characteristic of an ensemble predictor.
 4.
RGLM optimizes the AIC criterion during forward selection.
 5.
RGLM has a “thinning threshold” parameter which allows one to reduce the number of features involved in prediction while maintaining good prediction accuracy. Since a thinned RGLM involves far fewer features, it facilitates the understanding how the ensemble arrives at its predictions.
RGLM is not only related to bagging but also to the random subspace method (RSM) proposed by [17]. In the RSM, the training set is also repeatedly modified as in bagging but this modification is performed in the feature space (rather than the sample space). In the RSM, a subset of features is randomly selected which amounts to restricting attention to a subspace of the original feature space. As one of its construction steps, RGLM uses a RSM on each bootstrap sample. Future research could explore whether random partitions as opposed to random subspaces would be useful for constructing an RGLM. Random partitions of the feature space are similar to random subspaces but they divide the feature space into mutually exclusive subspaces [19, 20]. Random partition based predictors have been shown to perform well in highdimensional data ([19]). Both RSM and random partitions have more general applicability than RGLM since these methods can be used for any base learner. There is a vast literature on ensemble induction methods but a property worth highlighting is that RGLM uses forward variable selection of GLMs. Recall that RGLM goes through the following steps: 1) bootstrap sampling, 2) RSM (and optionally creating interaction terms), 3) forward variable selection of a GLM, 4) aggregation of votes. Empirical studies involving different base learners (other than GLMs) have shown that combining bootstrap sampling with RSM (steps 1 and 2) leads to ensemble predictors with comparable performance to that of the random forest predictor [21].
Another prediction method, random multinomial logit model (RMNL), also shares a similar idea with RGLM. It was recently proposed for multiclass outcome prediction [18]. RMNL bags multinomial logit models with random feature selection in each bag. It can be seen as a special case of RGLM, except that no forward model selection is carried out.
Software implementation
The RGLM method is implemented in the freely available R package randomGLM. The R function randomGLM allows the user to output training set predictions, outofbag predictions, test set predictions, coefficient values, and variable importance measures. The predict function can be used arrive at test set predictions. Tutorials can be found at the following webpage: http://labs.genetics.ucla.edu/horvath/RGLM.
Short description of alternative prediction methods
Forward selected generalized linear model predictor (forwardGLM)
We denote by forwardGLM the (single) generalized linear model predictor whose covariates were selected using forward feature selection (according to the AIC criterion). Thus, forwardGLM does not involve bagging, random feature selection, and is not an ensemble predictor.
Random forest (RF)
RF is an ensemble predictor that consists of a collection of decision trees which vote for the class of observations [13]. The RF is known for its outstanding predictive accuracy. We used the randomForest R package in our studies. We considered two choices for the RF parameter mtry: i) the default RF predictor where mtry equals the square root of the number of features and ii) RFbigmtry where mtry equals the total number of features. We always generated at least 500 trees per forest but used 1000 trees when calculating variable importance measures.
Recursive partitioning and regression trees (Rpart)
Classification and regression trees were generated using the default settings rpart R package. Tree methods are described in [3].
Linear discriminant analysis (LDA)
LDA aims to find a linear combination of features (referred to as discriminant variables) to predict a binary outcome (reviewed in [22, 23]). We used the lda R function in the MASS R package with parameter choice m e t h o d=m o m e n t.
Diagonal linear discriminant analysis (DLDA)
DLDA is similar to LDA but it ignores the correlation patterns between features. While this is often an unrealistic assumption, DLDA (also known as gene voting) has been found to work well in in gene expression applications [4]. Here we used the default parameters from the supclust R package [24].
K nearest neighbor (KNN)
We used the knn R function in the class R package [22, 23], which chose the parameter k of nearest neighbors using 3fold cross validation (CV).
Support vector machines (SVM)
We used the default parameters from the e1071 R package to fit SVMs [2]. Additional details can be found in [25].
Shrunken centroids (SC)
The SC predictor is known to work well in the context of gene expression data [26]. Here we used the implementation in the pamr R package [26] which chose the optimal level of shrinkage using cross validation.
Penalized regression models
Various convex penalties can be applied to generalized linear models. We considered ridge regression [27] corresponding to an ℓ_{2} penalty, the lasso corresponding to an ℓ_{1} penalty [28], and elastic net corresponding to a linear combination of ℓ_{1} and ℓ_{2} penalties [29]. We used the glmnet R function from the glmnet R package [30, 31] with alpha parameter values of 0, 1, and 0.5 respectively. glmnet also involves another parameter (lambda) which was chosen as the median of the lambda sequence output resulting from glmnet. For UCI benchmark data sets, pairwise interaction between features were considered.
20 diseaserelated gene expression data sets
Description of the 20 disease expression data sets
Data set  Samples  Features  Reference  Data set ID  Binary outcome 

adenocarcinoma  76  9868  [32]  NA  most prevalent class vs others 
brain  42  5597  [33]  NA  most prevalent class vs others 
breast2  77  4869  [34]  NA  most prevalent class vs others 
breast3  95  4869  [34]  NA  most prevalent class vs others 
colon  62  2000  [35]  NA  most prevalent class vs others 
leukemia  38  3051  [36]  NA  most prevalent class vs others 
lymphoma  62  4026  [37]  NA  most prevalent class vs others 
NCI60  61  5244  [38]  NA  most prevalent class vs others 
prostate  102  6033  [39]  NA  most prevalent class vs others 
srbct  63  2308  [40]  NA  most prevalent class vs others 
BrainTumor2  50  10367  [41]  NA  Anaplastic oligodendrogliomas vs Glioblastomas 
DLBCL  77  5469  [42]  NA  follicular lymphoma vs diffuse large Bcell lymphoma 
lung1  58  10000  [43]  GSE10245  Adenocarcinoma vs Squamous cell carcinoma 
lung2  46  10000  [44]  GSE18842  Adenocarcinoma vs Squamous cell carcinoma 
lung3  71  10000  [45]  GSE2109  Adenocarcinoma vs Squamous cell carcinoma 
psoriasis1  180  10000  GSE13355  lesional vs healthy skin  
psoriasis2  82  10000  [48]  GSE14905  lesional vs healthy skin 
MSstage  26  10000  [49]  EMTAB69  relapsing vs remitting RRMS 
MSdiagnosis1  27  10000  [50]  GSE21942  RRMS vs healthy control 
MSdiagnosis2  44  10000  [49]  EMTAB69  RRMS vs healthy control 
Empirical gene expression data sets
For all data sets below, we considered 100 randomly selected gene traits, i.e. 100 randomly selected probes. They were directly used as continuous outcomes or dichotomized according to the median value (top half =1, bottom half =0) to generate binary outcomes. For all data sets except “Brain cancer”, $\frac{2}{3}$ of the observations (arrays) were randomly chosen as the training set, while the remaining samples were chosen as test set. We focused on the 5000 genes (probes) with the highest mean expression levels in each data set.
Brain cancer data sets
These two related data sets contain 55 and 65 microarray samples of glioblastoma (brain cancer) patients, respectively. Gene expression profiles were measured using Affymetrix U133 microarrays. A detailed description can be found in [51]. The first data set (comprised of 55 samples) was used as a training set while and the second data set (comprised of 65 samples) was used as a test set.
SAFHS blood lymphocyte data set
This data set [52] was derived from blood lymphocytes of randomly ascertained participants enrolled in the San Antonio Family Heart Study. Gene expression profiles were measured with the Illumina Sentrix Human Whole Genome (WG6) Series I BeadChips. After removing potential outliers (based on low interarray correlations), 1084 samples remained in the data set.
WB whole blood gene expression data set
This is the whole blood gene expression data from healthy controls. Peripheral blood samples from healthy individuals were analyzed using Illumina Human HT12 microarrays. After preprocessing, 380 samples remained in the data set.
Mouse tissue gene expression data sets
The 4 tissue specific gene expression data sets were generated by the lab of Jake Lusis at UCLA. These data sets measure gene expression levels (Agilent array platform) from adipose (239 samples), brain (221 samples), liver (272 samples) and muscle (252 samples) tissue of mice from the B ×H F_{2} mouse intercross described in [53, 54]. In addition to gene traits, we also predicted 21 quantitative mouse clinical traits including mouse weight, length, abdominal fat, other fat, total fat, adiposity index (total fat ∗100/weight), plasma triglycerides, total plasma cholesterol, highdensity lipoprotein fraction of cholesterol, plasma unesterified cholesterol, plasma free fatty acids, plasma glucose, plasma lowdensity lipoprotein and very lowdensity lipoprotein cholesterol, plasma MCP1 protein levels, plasma insulin, plasma glucoseinsulin ratio, plasma leptin, plasma adiponectin, aortic lesion size (measured by histological examination using a semiquantitative scoring methods), aneurysms (semiquantitative scoring method), and aortic calcification in the lesion area.
Machine learning benchmark data sets
Description of the UCI benchmark data
Data set  Samples  Features 

BreastCancer  699  9 
HouseVotes84  435  16 
Ionosphere  351  34 
diabetes  768  8 
Sonar  208  60 
ringnorm  300  20 
threenorm  300  20 
twonorm  300  20 
Glass  214  9 
Satellite  6435  36 
Vehicle  846  18 
Vowel  990  10 
Simulated gene expression data sets
We simulated an outcome variable y and gene expression data that contained 5 modules (clusters). Only 2 of the modules were comprised of genes that correlated with the outcome y. 45% of the genes were background genes, i.e. these genes were outside of any module. The simulation scheme is detailed in Additional file 1 and implemented in the R function simulateDatExpr5Modules from the WGCNA R package [55]. This R function was used to simulate pairs of training and test data sets. The simulation study was used to evaluate prediction methods for continuous outcomes and for binary outcomes. For binary outcome prediction, the continuous outcome y was thresholded according to its median value.
We considered 180 different simulation scenarios involving varying sizes of the training data (50, 100, 200, 500, 1000 or 2000 samples) and varying numbers of genes (60, 100, 500, 1000, 5000 or 10000 genes) that served as features. Test sets contained the same number of genes as in the corresponding training set and 1000 samples. For each simulation scenario, we simulate 5 replicates resulting from different choices of the random seed.
Results
Motivating example: diseaserelated gene expression data sets
We compare the prediction accuracy of RGLM with that of other widely used methods on 20 gene expression data sets involving human disease related outcomes. Many of the 20 data sets (Table 2) are well known cancer data sets, which have been used in other comparative studies [4, 5, 56, 57]. A brief description of the data sets can be found in Methods.
Prediction accuracy in the 20 disease gene expression data sets
Data set  RGLM  RF  RFbigmtry  Rpart  LDA  DLDA  KNN  SVM  SC 

adenocarcinoma  0.842  0.842  0.842  0.737  0.842  0.744  0.842  0.842  0.803 
brain  0.881  0.810  0.833  0.762  0.810  0.929  0.881  0.786  0.929 
breast2  0.623  0.610  0.636  0.584  0.610  0.636  0.584  0.558  0.636 
breast3  0.705  0.695  0.716  0.611  0.695  0.705  0.669  0.674  0.700 
colon  0.855  0.823  0.823  0.726  0.855  0.839  0.774  0.774  0.871 
leukemia  0.921  0.895  0.921  0.816  0.868  0.974  0.974  0.763  0.974 
lymphoma  0.968  1.000  1.000  0.903  0.960  0.984  0.984  1.000  0.984 
NCI60  0.902  0.869  0.869  0.738  0.885  0.902  0.852  0.869  0.918 
prostate  0.931  0.892  0.902  0.853  0.873  0.627  0.804  0.853  0.912 
srbct  1.000  0.944  0.984  0.921  0.857  0.905  0.952  0.873  1.000 
BrainTumor2  0.760  0.750  0.740  0.620  0.760  0.700  0.700  0.660  0.720 
DLBCL  0.909  0.851  0.883  0.831  0.922  0.779  0.870  0.792  0.857 
lung1  0.931  0.931  0.931  0.828  0.914  0.931  0.931  0.897  0.914 
lung2  0.935  0.935  0.935  0.826  0.957  0.978  0.935  0.848  0.978 
lung3  0.901  0.901  0.887  0.803  0.873  0.859  0.831  0.859  0.887 
psoriasis1  0.989  0.994  0.989  0.978  0.994  0.989  0.989  0.983  0.989 
psoriasis2  0.963  0.988  0.976  0.963  0.976  0.963  0.963  0.963  0.963 
MSstage1  0.846  0.846  0.846  0.423  0.769  0.769  0.808  0.769  0.769 
MSdiagnosis1  0.963  0.926  0.926  0.556  0.889  0.889  0.963  0.926  0.926 
MSdiagnosis2  0.591  0.614  0.614  0.568  0.545  0.568  0.568  0.568  0.523 
MeanAccuracy  0.871  0.856  0.863  0.752  0.843  0.833  0.844  0.813  0.863 
Rank  1  4  2.5  9  6  7  5  8  2.5 
Pvalue  NA  0.029  0.079  0.00014  0.0075  0.05  0.014  0.00042  0.37 
As seen from Table 4, RGLM achieves the highest mean accuracy in these disease data sets, followed by RFbigmtry and SC. Note that the standard random forest predictor (with default parameter choice) performs worse than RGLM. The accuracy difference between RGLM and alternative methods is statistically significant (Wilcoxon signed rank test <0.05) for all predictors except for RFbigmtry, DLDA and SC. Since RFbigmtry is an ensemble predictor that relies on thousands of features it would be difficult to interpret its predictions in terms of the underlying genes.
Our evaluations focused on the accuracy (and misclassification error). However, a host of other accuracy measures could be considered. Additional file 2 presents the results for sensitivity and specificity. The top 3 methods with highest sensitivity are: RF (median sensitivity =0.969), SVM (0.969) and RGLM (0.960). The top 3 methods with highest specificity are: SC (0.900), RGLM (0.857) and KNN (0.848).
A strength of this empirical comparison is that it involves clinically or biologically interesting data sets but a severe limitation is that it only involves 20 comparisons. Therefore, we now turn to more comprehensive empirical comparisons.
Binary outcome prediction
Empirical study involving dichotomized gene traits
The fact that RFbigmtry is more accurate in this situation than the default version of RF probably indicates that relatively few genes are informative for predicting a dichotomized gene trait. Also note that RGLM is much more accurate than the unbagged forward selected GLM which reflects that forward selection greatly overfits the training data. In conclusion, these comprehensive gene expression studies show that RGLM has outstanding prediction accuracy.
Machine learning benchmark data analysis
Prediction accuracy in the UCI machine learning benchmark data
Data set  RGLM  RGLM.inter2  RF  RFbigmtry  Rpart  LDA  DLDA  KNN  SVM  SC 

BreastCancer  0.964  0.959  0.969  0.961  0.941  0.957  0.959  0.966  0.967  0.956 
HouseVotes84  0.961  0.963  0.958  0.954  0.954  0.951  0.914  0.924  0.958  0.938 
Ionosphere  0.883  0.946  0.932  0.917  0.875  0.863  0.809  0.849  0.940  0.829 
diabetes  0.768  0.759  0.759  0.754  0.741  0.768  0.732  0.740  0.757  0.743 
Sonar  0.769  0.837  0.817  0.788  0.707  0.726  0.697  0.812  0.822  0.726 
ringnorm  0.577  0.973  0.940  0.910  0.770  0.567  0.570  0.590  0.977  0.535 
threenorm  0.803  0.827  0.807  0.777  0.653  0.817  0.825  0.815  0.853  0.817 
twonorm  0.937  0.953  0.947  0.920  0.733  0.957  0.960  0.947  0.953  0.960 
Glass  0.636  0.743  0.827  0.799  0.729  0.659  0.531  0.808  0.748  0.645 
Satellite  0.986  0.987  0.988  0.985  0.961  0.985  0.734  0.990  0.988  0.803 
Vehicle  0.965  0.986  0.986  0.973  0.944  0.967  0.729  0.909  0.974  0.752 
Vowel  0.936  0.986  0.983  0.976  0.950  0.938  0.853  0.999  0.991  0.909 
MeanAccuracy  0.849  0.910  0.909  0.893  0.830  0.846  0.776  0.862  0.911  0.801 
Rank  6  2  2  4  8  7  10  5  2  9 
Pvalue  0.0093  NA  0.26  0.042  0.00049  0.0093  0.0067  0.11  0.96  0.0015 
Overall, we find that RGLM.inter2 ties with SVM (diff =−0.001,p=0.96) and RF (diff =0.001,p=0.26) for the first place in the benchmark data. As can be seen from Additional file 3, RGLM.inter2 achieves the highest sensitivity and specificity, which also support its good performance in the benchmark data sets.
A potential limitation of these comparisons is that we considered pairwise interaction terms for the RGLM predictor but not for the other predictors. To address this issue, we also considered pairwise interactions among features for other predictors. Additional file 4 shows that no method surpasses RGLM.inter2 when pairwise interaction terms are considered. In particular, interaction terms between features do not improve the performance of the random forest predictor. A noteworthy disadvantage of RGLM.inter in case of many features is the computational burden that may result from adding interaction terms. In applications where interaction terms are needed for RGLM, faster alternatives (e.g. RF) remain an attractive choice.
Simulation study involving binary outcomes
Continuous outcome prediction
In the following, we show that RGLM also performs exceptionally well when dealing with continuous quantitative outcomes. We not only compare RGLM to a standard forward selected linear model predictor (forwardGLM) but also a random forest predictor (for a continuous outcome). We do not report the findings for the knearest neighbor predictor of a continuous outcome since it performed much worse than the above mentioned approaches in our gene expression applications (the accuracy of a KNN predictor was decreased by about 30 percent). We again split the data into training and test sets. We use the correlation between test set predictions and truly observed test set outcomes as measure of predictive accuracy. Note that this correlation coefficient can take on negative values (in case of a poorly performing prediction method).
Empirical study involving continuous gene traits
Mouse tissue expression data involving continuous clinical outcomes
Simulation study involving continuous outcomes
Comparing RGLM with penalized regression models
As a caveat, we mention that cross validation methods were not used to inform the parameter choices of the penalized regression models since the RGLM predictor was also not allowed to fine tune its parameters. By only using default parameter choices we ensure a fair comparison. In a secondary analysis, however, we allowed penalized regression models to use cross validation for informing the choice of the parameters. While this slightly improved the performance of the penalized regression models (data not shown), it did not affect our main conclusion. RGLM outperforms penalized regression models in these comparisons.
Feature selection
Here we briefly describe how RGLM naturally gives rise to variable (feature) importance measures. We compare the variable importance measures of RGLM with alternative approaches and show how variable importance measures can be used for defining a thinned RGLM predictor with few features.
Variable importance measure
There is a vast literature on using ensemble predictors and bagging for selecting features. For example, Meinshausen and Bühlmann describe “stability selection” based on variable selection employed in regression models [62]. The method involves repetitive subsampling, and variables that occur in a large fraction of the resulting selection set are chosen. Li et al. use a random knearest neighbor predictor (RKNN) to carry out feature selection [57]. The Entropybased Recursive Feature Elimination (ERFE) method of Furlanello et al. ranks features in high dimensional microarray data [63]. RGLM, like many ensemble predictors, gives rise to several measures of feature (variable) importance. For example, the number of times a feature is selected in the forward GLM across bags, timesSelectedByForwardRegression, is a natural measure of variable importance (similar to that used in stability selection [62]). Another variable importance measure is the number of times a feature is selected as candidate covariate for forward regression, timesSelectedAsCandidates. Note that both timesSelectedByForwardRegression and timesSelectedAsCandidates have to be ≤n B a g s. Finally, one can use the sum of absolute GLM coefficient values, sumAbsCoefByForwardRegression, as a variable importance measure. We prefer timesSelectedByForwardRegression, since it is more intuitive and points to the features that directly contribute to outcome prediction.
Leo Breiman already pointed out that random forests could be used for feature selection in genomic applications. DíazUriarte et al. proposed a related gene selection method based on the RF which yields small sets of genes [5]. This RF based gene selection method does not return sets of genes that are highly correlated because such genes would be redundant when it comes to predictive purposes. Since the RGLM based importance measure timesSelectedByForwardRegression is expected to lean towards selecting genes that are highly associated with the outcome, it comes as no surprise that only a few genes selected by the procedure of DíazUriarte et al. turn out to have a top ranking in terms of the RGLM measure timesSelectedByForwardRegression (Additional file 5). It is beyond our scope to provide a thorough evaluation of the different variable selection approaches and we refer the reader to the literature, e.g. [5, 64]. While our studies show that the RGLM based variable importance measures have some relationships to other measures, they are sufficiently different from other measures to warrant a thorough evaluation in future comparison studies.
RGLM predictor thinning based on a variable importance measure
Both RGLM and random forest have superior prediction accuracy but they differ with respect to how many features are being used. Recall that the random forest is composed of individual trees. Each tree is constructed by repeated node splits. The number of features considered at each node split is determined by the RF parameter mtry. The default value of mtry is the square root of the number of features. In case of 4999 gene features in our empirical studies, the default value is m t r y=71. For RFbigmtry, we choose all possible features, i.e. m t r y=4999. We find that a random forest predictor typically uses more than 40% of the features (i.e. more than 2000 genes) in the empirical studies. In contrast, RGLM typically only involves a few hundred genes in these studies. There are several reasons why RGLM uses far fewer features in its construction. First, and foremost, it uses forward selection (coupled with the AIC criterion) to select features in each bag. Second, the number of candidate covariates considered for forward regression is chosen to be low, i.e. n C a n d i d a t e C o v a r i a t e s=50.
In RGLM, the number of times a feature is selected by forward regression models among all bags, timesSelectedByForwardRegression, follows a highly skewed distribution. Only few features are repeatedly selected into the model while most features are selected only once (if at all). It stands to reason that an even sparser, highly accurate predictor can be defined by refitting the GLM on each bag without considering these rarely selected features. We refer to this feature removal process as RGLM predictor thinning. Thus, features whose value of timesSelectedByForwardRegression lies below a prespecified thinning threshold will be removed from the model fit a posteriori.
Interestingly, the accuracy diminishes very slowly for initial, low threshold values. But even low threshold values lead to a markedly sparser ensemble predictor (Figure 9 (B)). In other words, the average fraction of features (genes) remaining in the thinned RGLM declines drastically as the thinning threshold increases.
where $x=\frac{\mathit{\text{thinning}}\phantom{\rule{1em}{0ex}}\mathit{\text{threshold}}}{\mathit{\text{nBags}}}$ and e denotes Euler’s constant e≈2.718. Equation 1 was found by log transforming the data and using optimization approaches for estimating the parameters. No mathematical derivation was used. One can easily show that F(x) (Equation 1) is a monotonically decreasing function which accurately describes the proportion of remaining features as can be seen from Figure 9 (B). Since the proportion of remaining variables depends not only on the thinning threshold but also on the number of bags nBags, we also study how these results depend on the choice of nBags. Toward this end, we varied nBags from 20 to 500 for predicting the 100 dichotomized gene traits in the mouse adipose data set. Additional file 6 shows that the predicted values (red curve) based on Equation 1 overlaps almost perfectly with the observed values (black curve) for all considered choices of nBags, which indicates that Equation 1 accurately estimates the proportion of remaining features for range of different values of nBags.
Our results demonstrate that the number of required features decreases rapidly even for low values of the thinning threshold without compromising the prediction accuracy of the thinned predictor. Figure 9 (C) shows that a thinning threshold of 20, leads to a thinned predictor whose accuracy is negligibly lower (difference in median accuracy= 0.009) than that of the original RGLM predictor but it involves less than 20% of the original number of variables. Recall that even the original number of variables is markedly lower than that of the RF predictor. These results demonstrate that the thinned RGLM combines the advantages of an ensemble predictor (high accuracy) with that of a forward selected GLM model (few features, interpretability).
RGLM thinning versus RF thinning
Discussion
Why was the RGLM not discovered earlier?
Additional reasons why the merits of RGLM have not been recognized earlier may be the following. First, it may be a historical accident. Bagging was quickly overshadowed by other seemingly more accurate ways of constructing ensemble predictors, such as boosting [73] and the RF [13], both of which have markedly better performance on the UCI benchmark data. We find that RGLM.inter2 ties with SVM and RF for the top spot in UCI benchmark data set (Table 5). Incidentally, RGLM performs significantly better than SVM and RF on the disease data sets (Table 4) and in the 700 gene expression comparisons (Figure 2).
Second, previous comparisons of bagged predictors in the context of genomic data were based on limited empirical evaluations. Many comparisons involved fewer than 20 microarray data sets when comparing predictors [4, 5]. While the comparisons involved clinically important data sets from cancer applications, these studies were simply not comprehensive enough.
Third, previous studies probably did not consider enough bootstrap samples (bags). While previous studies used 10 to 50 bags, we always used 100 bags when constructing the RGLM. To illustrate how prediction accuracy depends on the number of bags, we evaluate the brain cancer data with 1 to 500 bags using 5 gene traits randomly selected from those used in our binary and continuous outcome prediction, respectively. The results are shown in Additional file 7. Most improvement is gained in the first several dozens of bags. 100 bags is generally enough although fluctuations remain. More bags may lead to slightly better predictions but at the expense of longer computation time.
Strengths and limitations
RGLM shares many advantages of bagged predictors including a nearly unbiased estimate of the prediction accuracy (the outofbag estimate) and several variable importance measures. While our empirical studies focus on binary and continuous outcomes, it is straightforward to define RGLM for count outcomes (resulting in a random Poisson regression model) and for multiclass outcomes (resulting in a random multinomial regression model).
A noteworthy limitation of RGLM is computational complexity since the forward selection process (e.g. by the function stepAIC[22] from the MASS R package) is particularly timeconsuming. The total time depends on the number of candidate features, the order of interaction terms, and the number of bags. Our R implementation allows the user to use parallel processing for speeding up the calculations.
Our empirical studies demonstrate that RGLM compares favorably with the random forest, support vector machines, penalized regression models, and many other widely used prediction methods. As a caveat, we mention that we chose default parameter choices for each of these methods in order to ensure a fair comparison. Future studies could evaluate how these prediction methods compare when resampling schemes (e.g. cross validation) are used to inform parameter choices. Our randomGLM R package will allow the reader to carefully evaluate the method.
Conclusions
Since individual forward selected GLMs are highly interpretable, the resulting ensemble predictor is more interpretable than an RF predictor. Our empirical studies (20 disease related gene expression data sets, 700 gene expression trait data, the UCI benchmark data) clearly highlight the outstanding prediction accuracy afforded by the RGLM. High accuracies are achieved not only in genomic data sets (many features, small sample size) but also in the UCI benchmark data (few features, large sample size).
Abbreviations
 RGLM:

Random generalized linear model
 RGLM:

Inter2  RGLM considering pairwise interactions between features
 RGLM:

Inter3  RGLM considering twoway and threeway interactions between features
 forwardGLM:

Forward selected generalized linear model
 RF:

Random forest with default mtry
 RFbigmtry:

Random forest with mtry equal to the total number of features
 GLM:

Generalized linear model
 Rpart:

Recursive partitioning
 LDA:

Linear discriminant analysis
 DLDA:

Diagonal linear discriminant analysis
 KNN:

K nearest neighbor
 SVM:

Support vector machine
 SC:

Shrunken centroids
 RSM:

Random subspace method
 RMNL:

Random multinomial logit model
 RKNN:

Random K nearest neighbor
 ERFE:

Entropybased recursive feature elimination
 AIC:

Akaike information criteria
 aMV:

Adjusted majority vote.
Declarations
Acknowledgements
We acknowledge grant support from 1R01DA03091301, P50CA092131, P30CA16042, UL1TR000124. We acknowledge the efforts of IGC and expO in providing data set GSE2109.
Authors’ Affiliations
References
 Pinsky P, Zhu C: Building multimarker algorithms for diesease prediction: the role of correlations among markers. Biomarker insights. 2011, 6: 8393.PubMed CentralView ArticlePubMed
 Vapnik V: The nature of statistical learning theory. 2000, New York: SpringerView Article
 Breiman L, Friedman J, Stone C, Olshen R: Classification and regression trees. 1984, California: Wadsworth International Group
 Dudoit S, Fridlyand J, Speed TP: Comparison of discrimination methods for the classification of tumors using gene expression data. J Am Stat Assoc. 2002, 97 (457): 7787. 10.1198/016214502753479248.View Article
 DiazUriarte R, Alvarez de AndresS: Gene selection and classification of microarray data using random forest. BMC Bioinformatics. 2006, 7: 310.1186/1471210573. [http://www.biomedcentral.com/14712105/7/3]PubMed CentralView ArticlePubMed
 Pirooznia M, Yang J, Yang MQ, Deng Y: A comparative study of different machine learning methods on microarray gene expression data. BMC Genomics. 2008, 9 (Suppl 1): S1310.1186/147121649S1S13. [http://www.biomedcentral.com/14712164/9/S1/S13]PubMed CentralView ArticlePubMed
 Caruana R, NiculescuMizil A: An empirical comparison of supervised learning algorithms. Proceedings of the 23rd international conference on Machine learning, ICML ’06. 2006, New York, NY, USA: ACM, 161168. [http://doi.acm.org/10.1145/1143844.1143865]View Article
 Statnikov A, Wang L, Aliferis C: A comprehensive comparison of random forests and support vector machines for microarraybased cancer classification. BMC Bioinformatics. 2008, 9 (1): 31910.1186/147121059319. [http://www.biomedcentral.com/14712105/9/319]PubMed CentralView ArticlePubMed
 Caruana R, Karampatziakis N, Yessenalina A: An empirical evaluation of supervised learning in high dimensions. Proceedings of the 25th international conference on Machine learning, ICML ’08. 2008, New York, NY, USA: ACM, 96103. [http://doi.acm.org/10.1145/1390156.1390169]View Article
 Breiman L: Bagging Predictors. Machine Learning. 1996, 24: 123140.
 Derksen S, Keselman HJ: Backward, forward and stepwise automated subset selection algorithms: Frequency of obtaining authentic and noise variables. British J Mathematical Stat Psychology. 1992, 45 (2): 265282. 10.1111/j.20448317.1992.tb00992.x. [http://dx.doi.org/10.1111/j.20448317.1992.tb00992.x]View Article
 Harrell FJ, Lee K, Mark D: Multivariable prognostic models: issues in developing models, evaluating assumptions and adequacy, and measuring and reducing errors. Stat med. 1996, 15: 361387. 10.1002/(SICI)10970258(19960229)15:4<361::AIDSIM168>3.0.CO;24.View ArticlePubMed
 Breiman L: Random Forests. Machine Learning. 2001, 45: 532. 10.1023/A:1010933404324.View Article
 Svetnik V, Liaw A, Tong C, Wang T: Application of Breiman’s Random Forest to Modeling StructureActivity Relationships of Pharmaceutical Molecules. Multiple Classier Systems, Fifth International Workshop, MCS 2004, Proceedings, Cagliari, Italy. Lecture Notes in Computer Science. Edited by: Roli F, Kittler J, Windeatt T. 2004, Springer Berlin / Heidelberg, 334343.
 Shi T, Horvath S: Unsupervised learning with random forest predictors. J Comput Graphical Stat. 2006, 15: 118138. 10.1198/106186006X94072. [http://dx.doi.org/10.1198/106186006X94072]View Article
 McCullagh P, Nelder J: Generalized Linear Models. second edition, ISBN 13: 9780412317606. 1989, London: Chapman and Hall/CRCView Article
 Ho TK: The random subspace method for constructing decision forests. IEEE Trans Pattern Anal Machine Intelligence. 1998, 20 (8): 832844. 10.1109/34.709601. [http://dx.doi.org/10.1109/34.709601]View Article
 Prinzie A, den Poel DV: Random Forests for multiclass classification: Random MultiNomial Logit. Expert Syst Appl. 2008, 34 (3): 17211732. 10.1016/j.eswa.2007.01.029. [http://www.sciencedirect.com/science/article/pii/S0957417407000498]View Article
 Ahn H, Moon H, Fazzari MJ, Lim N, Chen JJ, Kodell RL: Classification by ensembles from random partitions of highdimensional data. Comput Stat Data Anal. 2007, 51 (12): 61666179. 10.1016/j.csda.2006.12.043. [http://dx.doi.org/10.1016/j.csda.2006.12.043]View Article
 Moon H, Ahn H, Kodell RL, Baek S, Lin CJ, Chen JJ: Ensemble methods for classification of patients for personalized medicine with highdimensional data. Artif Intelligence Med. 2007, 41 (3): 197207. 10.1016/j.artmed.2007.07.003. [http://www.sciencedirect.com/science/article/pii/S0933365707000863]View Article
 Panov P, Džeroski S: Combining bagging and random subspaces to create better ensembles. Proceedings of the 7th international conference on Intelligent data analysis, IDA’07. 2007, Berlin, Heidelberg: SpringerVerlag, 118129. [http://dl.acm.org/citation.cfm?id=1771622.1771637]
 Venables W, Ripley B: Modern Applied Statistics with S. fourth edition ISBN 0387954570. 2002, New York: SpringerView Article
 Ripley B: Pattern Recognition and Neural Networks. ISBN 0 521 46086 7. 1996, UK: Cambridge University Press
 Dettling M, Bühlmann P: Supervised clustering of genes. Genome Biol. 2002, 3 (12): research0069.1research0069.15. 10.1186/gb2002312research0069. [http://genomebiology.com/2002/3/12/research/0069]View Article
 Chang C, Lin C: LIBSVM: a library for Support Vector Machines. [http://www.csie.ntu.edu.tw/~cjlin/libsvm]
 Tibshirani R, Hastie T, Narasimhan B, Chu G: Diagnosis of multiple cancer types by shrunken centroids of gene expression. Proc Natl Acad Sci USA. 2002, 99 (10): 65676572. 10.1073/pnas.082099299.PubMed CentralView ArticlePubMed
 Draper N, Smith H, Pownell E: Applied regression analysis. Volume 3. 1966, New York: Wiley
 Tibshirani R: Regression shrinkage and selection via the lasso. J R Stat Soc. Ser B (Methodological). 1996, 58: 267288.
 Zou H, Hastie T: Regularization and variable selection via the elastic net. J R Stat Soc: Ser B (Statistical Methodology). 2005, 67 (2): 301320. 10.1111/j.14679868.2005.00503.x.View Article
 Friedman J, Hastie T, Tibshirani R: Regularization paths for generalized linear models via coordinate descent. J stat software. 2010, 33: 1View Article
 Simon N, Friedman JH, Hastie T, Tibshirani R: Regularization Paths for Cox’s Proportional Hazards Model via Coordinate Descent. J Stat Software. 2011, 39 (5): 113. [http://www.jstatsoft.org/v39/i05]View Article
 Ramaswamy S, Ross KN, Lander ES, Golub TR: A molecular signature of metastasis in primary solid tumors. Nat Genet. 2003, 33: 4954. 10.1038/ng1060. [http://dx.doi.org/10.1038/ng1060]View ArticlePubMed
 Pomeroy SL, Tamayo P, Gaasenbeek M, Sturla LM, Angelo M, Mclaughlin ME, Kim JYH, Goumnerova LC, Black PM, Lau C, Allen JC, Zagzag D, Olson JM, Curran T, Wetmore C, Biegel JA, Poggio T, Mukherjee S, Rifkin R, Califano A, Stolovitzky G, Louis DN, Mesirov JP, Lander ES, Golub TR: Prediction of central nervous system embryonal tumour outcome based on gene expression. Nature. 2002, 415 (6870): 436442. 10.1038/415436a. [http://dx.doi.org/10.1038/415436a]View ArticlePubMed
 van’t Veer L, Dai H, van de Vijver M, He Y, Hart A, Mao M, Peterse H, van der kooy K, Marton M, Witteveen A, Schreiber G, Kerkhoven R, Roberts C, Linsley P, Bernards R, Friend S: Gene expression profiling predicts clinical outcome of breast cancer. Nature. 2002, 415: 530536. 10.1038/415530a.View Article
 Alon U, Barkai N, Notterman DA, Gishdagger K, Ybarradagger S, Mackdagger D, Levine AJ: Broad Patterns of Gene Expression Revealed by Clustering Analysis of Tumor and Normal Colon Tissues Probed by Oligonucleotide Arrays. Proc Natl Acad Sci USA. 1999, 96: 674550. 10.1073/pnas.96.12.6745.PubMed CentralView ArticlePubMed
 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286 (5439): 5317. 10.1126/science.286.5439.531.View ArticlePubMed
 Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JC, Sabet H, Tran T, Yu X, Powell JI, Yang L, Marti GE, Moore T, Hudson J, Lu L, Lewis DB, Tibshirani R, Sherlock G, Chan WC, Greiner TC, Weisenburger DD, Armitage JO, Warnke R, Levy R, Wilson W, Grever MR, Byrd JC, Botstein D, Brown PO, Staudt LM: Distinct types of diffuse large Bcell lymphoma identified by gene expression profiling. Nature. 2000, 503511.
 Ross DT, Scherf U, Eisen MB, Perou CM, Rees C, Spellman P, Iyer V, Jeffrey SS, Van de Rijn M, Waltham M, Pergamenschikov A, Lee JC, Lashkari D, Shalon D, Myers TG, Weinstein JN, Botstein D, Brown PO: Systematic variation in gene expression patterns in human cancer cell lines. Nat Genet. 2000, 24 (3): 227235. 10.1038/73432.View ArticlePubMed
 Singh D, Febbo PG, Ross K, Jackson DG, Manola J, Ladd C, Tamayo P, Renshaw AA, D’Amico AV, Richie JP, Lander ES, Loda M, Kantoff PW, Golub TR, Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell. 2002, 1 (2): 203209. 10.1016/S15356108(02)000302. [http://view.ncbi.nlm.nih.gov/pubmed/12086878]View ArticlePubMed
 Khan J, Wei JS, Ringnér M, Saal LH, Ladanyi M, Westermann F, Berthold F, Schwab M, Antonescu CR, Peterson C, Meltzer PS: Classification and diagnostic prediction of cancers using gene expression profiling and artificial neural networks. Nat Med. 2001, 7 (6): 673679. 10.1038/89044. [http://dx.doi.org/10.1038/89044]PubMed CentralView ArticlePubMed
 Nutt CL, Mani DR, Betensky RA, Tamayo P, Cairncross JG, Ladd C, Pohl U, Hartmann C, McLaughlin ME, Batchelor TT, Black PM, von Deimling A, Pomeroy SL, Golub TR, Louis DN: Gene Expressionbased Classification of Malignant Gliomas Correlates Better with Survival than Histological Classification. Cancer Res. 2003, 63 (7): 16021607. [http://cancerres.aacrjournals.org/content/63/7/1602.abstract]PubMed
 Shipp MA, Ross KN, Tamayo P, Weng AP, Kutok JL, Aguiar RC, Gaasenbeek M, Angelo M, Reich M, Pinkus GS, Ray TS, Koval MA, Last KW, Norton A, Lister TA, Mesirov J, Neuberg DS, Lander ES, Aster JC, Golub TR: Diffuse large Bcell lymphoma outcome prediction by geneexpression profiling and supervised machine learning. Nat Med. 2002, 8: 6874. 10.1038/nm010268. [http://dx.doi.org/10.1038/nm010268]View ArticlePubMed
 Kuner R, Muley T, Meister M, Ruschhaupt M, Buness A, Xu EC, Schnabel P, Warth A, Poustka A, Sültmann H, Hoffmann H: Global gene expression analysis reveals specific patterns of cell junctions in nonsmall cell lung cancer subtypes. Lung Cancer. 2009, 63: 3238. 10.1016/j.lungcan.2008.03.033.View ArticlePubMed
 SanchezPalencia A, GomezMorales M, GomezCapilla JA, Pedraza V, Boyero L, Rosell R, FárezVidal M: Gene expression profiling reveals novel biomarkers in nonsmall cell lung cancer. Int J Cancer. 2011, 129 (2): 355364. 10.1002/ijc.25704. [http://dx.doi.org/10.1002/ijc.25704]View ArticlePubMed
 Clinically annotated tumor database: [https://expo.intgen.org/geo/]
 Swindell WR, Johnston A, Carbajal S, Han G, Wohn C, Lu J, Xing X, Nair RP, Voorhees JJ, Elder JT, Wang XJ, Sano S, Prens EP, DiGiovanni J, Pittelkow MR, Ward NL, Gudjonsson JE: GenomeWide Expression Profiling of Five Mouse Models Identifies Similarities and Differences with Human Psoriasis. PLoS ONE. 2011, 6 (4): e1826610.1371/journal.pone.0018266. [http://dx.doi.org/10.1371%2Fjournal.pone.0018266]PubMed CentralView ArticlePubMed
 Nair RP, Duffin KCC, Helms C, Ding J, Stuart PE, Goldgar D, Gudjonsson JE, Li Y, Tejasvi T, Feng BJJ, Ruether A, Schreiber S, Weichenthal M, Gladman D, Rahman P, Schrodi SJ, Prahalad S, Guthery SL, Fischer J, Liao W, Kwok PYY, Menter A, Lathrop GM, Wise CA, Begovich AB, Voorhees JJ, Elder JT, Krueger GG, Bowcock AM, Abecasis GR: Collaborative Association Study of Psoriasis: Genomewide scan reveals association of psoriasis with IL23 and NFkappaB pathways. Nat genet. 2009, 41 (2): 199204. 10.1038/ng.311. [http://dx.doi.org/10.1038/ng.311]PubMed CentralView ArticlePubMed
 Yao Y, Richman L, Morehouse C, de los Reyes M, Higgs BW, Boutrin A, White B, Coyle A, Krueger J, Kiener PA, Jallal B: Type I Interferon: Potential Therapeutic Target for Psoriasis?. PLoS ONE. 2008, 3 (7): e273710.1371/journal.pone.0002737. [http://dx.plos.org/10.1371%2Fjournal.pone.0002737]PubMed CentralView ArticlePubMed
 Brynedal B, Khademi M, Wallström E, Hillert J, Olsson T, Duvefelt K: Gene expression profiling in multiple sclerosis: A disease of the central nervous system, but with relapses triggered in the periphery?. Neurobiology of Disease. 2010, 37 (3): 613621. 10.1016/j.nbd.2009.11.014. [http://www.sciencedirect.com/science/article/pii/S0969996109003362]View ArticlePubMed
 Kemppinen AK, Kaprio J, Palotie A, Saarela J: Systematic review of genomewide expression studies in multiple sclerosis. BMJ Open. 2011, 1: [http://bmjopen.bmj.com/content/1/1/e000053.abstract]
 Horvath S, Zhang B, Carlson M, Lu K, Zhu S, Felciano R, Laurance M, Zhao W, Shu Q, Lee Y, Scheck A, Liau L, Wu H, Geschwind D, Febbo P, Kornblum H, TF C, Nelson S, Mischel P: Analysis of Oncogenic Signaling Networks in Glioblastoma Identifies ASPM as a Novel Molecular Target. Proc Natl Acad Sci USA. 2006, 103 (46): 1740217407. 10.1073/pnas.0608396103.PubMed CentralView ArticlePubMed
 Goring HHH, Curran JE, Johnson MP, Dyer TD, Charlesworth J, Cole SA, Jowett JBM, Abraham LJ, Rainwater DL, Comuzzie AG, Mahaney MC, Almasy L, MacCluer JW, Kissebah AH, Collier GR, Moses EK, Blangero J: Discovery of expression QTLs using largescale transcriptional profiling in human lymphocytes. Nat Genet. 2007, 39: 12081216. 10.1038/ng2119.View ArticlePubMed
 Ghazalpour A, Doss S, Zhang B, Plaisier C, Wang S, Schadt E, Thomas A, Drake T, Lusis A, Horvath S: Integrating Genetics and Network Analysis to Characterize Genes Related to Mouse Weight. PloS Genetics. 2006, 2 (2): 810.1371/journal.pgen.0020008.View Article
 Fuller T, Ghazalpour A, Aten J, Drake T, Lusis A, Horvath S: Weighted gene coexpression network analysis strategies applied to mouse weight. Mamm Genome. 2007, 18 (67): 463472. 10.1007/s0033500790433.PubMed CentralView ArticlePubMed
 Langfelder P, Horvath S: WGCNA: an R package for weighted correlation network analysis. BMC Bioinformatics. 2008, 9: 55910.1186/147121059559.PubMed CentralView ArticlePubMed
 Statnikov A, Aliferis CF, Tsamardinos I, Hardin D, Levy S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics. 2005, 21 (5): 631643. 10.1093/bioinformatics/bti033. [http://bioinformatics.oxfordjournals.org/content/21/5/631.abstract]View ArticlePubMed
 Li S, Harner EJ, Adjeroh D: Random KNN feature selection  a fast and stable alternative to Random Forests. BMC Bioinformatics. 2011, 12: 45010.1186/1471210512450. [http://www.biomedcentral.com/14712105/12/450]PubMed CentralView ArticlePubMed
 Chang CC, Lin CJ: Training vSupport Vector Classifiers: Theory and Algorithms. Neural Comput. 2001, 13 (9): 21192147. 10.1162/089976601750399335.View ArticlePubMed
 Yang F, Wang Hz, Mi H, Lin Cd, Cai Ww: Using random forest for reliable classification and costsensitive learning for medical diagnosis. BMC Bioinformatics. 2009, 10 (Suppl 1): S2210.1186/1471210510S1S22. [http://www.biomedcentral.com/14712105/10/S1/S22]PubMed CentralView ArticlePubMed
 Lopes F, Martins D, Cesar R: Feature selection environment for genomic applications. BMC Bioinformatics. 2008, 9 (1): 45110.1186/147121059451. [http://www.biomedcentral.com/14712105/9/451]PubMed CentralView ArticlePubMed
 Frank A, Asuncionm A: UCI Machine Learning Repository. 2010, [http://archive.ics.uci.edu/ml]
 Meinshausen N, Bühlmann P: Stability selection. J R Stat Soc: Ser B (Statistical Methodology). 2010, 72 (4): 417473. 10.1111/j.14679868.2010.00740.x. [http://dx.doi.org/10.1111/j.14679868.2010.00740.x]View Article
 Furlanello C, Serafini M, Merler S, Jurman G: An accelerated procedure for recursive feature ranking on microarray data. Neural Networks. 2003, 16: 641648. 10.1016/S08936080(03)001035. [http://www.sciencedirect.com/science/article/pii/S0893608003001035]View ArticlePubMed
 Saeys Y, Inza I, Larranaga P: A review of feature selection techniques in bioinformatics. Bioinformatics. 2007, 23 (19): 25072517. 10.1093/bioinformatics/btm344. [http://bioinformatics.oxfordjournals.org/content/23/19/2507.%20abstract]View ArticlePubMed
 Perlich C, Provost F, Simonoff JS: Tree Induction vs. Logistic Regression: A LearningCurve Analysis. J Machine Learning Res. 2003, 4: 211255.
 Arena V, Sussman N, Mazumdar S, Yu S, Macina O: The Utility of StructureActivity Relationship (SAR) Models for Prediction and Covariate Selection in Developmental Toxicity: Comparative Analysis of Logistic Regression and Decision Tree Models. SAR and QSAR in Environ Res. 2004, 15: 118. 10.1080/1062936032000169633. [http://www.tandfonline.com/doi/abs/10.1080/1062936032000169633]View Article
 PinoMejias R, CarrascoMairena M, PascualAcosta A, CubilesDeLaVega MD, MunozGarcia J: A comparison of classification models to identify the Fragile X Syndrome. J Appl Stat. 2008, 35 (3): 233244. 10.1080/02664760701832976. [http://www.tandfonline.com/doi/abs/10.1080/02664760701832976]View Article
 van Wezel M, Potharst R: Improved customer choice predictions using ensemble methods. Eur J Operational Res. 2007, 181: 436452. 10.1016/j.ejor.2006.05.029. [http://www.sciencedirect.com/science/article/pii/S0377221706003900]View Article
 Wang G, Hao J, Ma J, Jiang H: A comparative assessment of ensemble learning for credit scoring. Expert Syst Appl. 2011, 38: 223230. 10.1016/j.eswa.2010.06.048. [http://dx.doi.org/10.1016/j.eswa.2010.06.048]View Article
 Shadabi F, Sharma D: Comparison of Artificial Neural Networks with Logistic Regression in Prediction of Kidney Transplant Outcomes. Proceedings of the 2009 International Conference on Future Computer and Communication, ICFCC ’09. 2009, Washington, DC, USA: IEEE Computer Society, 543547. [http://dx.doi.org/10.1109/ICFCC.2009.139]View Article
 Sohn S, Shin H: Experimental study for the comparison of classifier combination methods. Pattern Recognit. 2007, 40: 3340. 10.1016/j.patcog.2006.06.027. [http://www.sciencedirect.com/science/article/pii/S0031320306003116]View Article
 Bühlmann P, Yu B: Analyzing Bagging. Ann Stat. 2002, 30: 927961.View Article
 Freund Y, Schapire RE: A decisiontheoretic generalization of online learning and an application to boosting. Proceedings of the Second European Conference on Computational Learning Theory, EuroCOLT ’95. 1995, London, UK, UK: SpringerVerlag, 2337. [http://dl.acm.org/citation.cfm?id=646943.712093]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.