Random forest versus logistic regression: a large-scale benchmark experiment

Background and goal The Random Forest (RF) algorithm for regression and classification has considerably gained popularity since its introduction in 2001. Meanwhile, it has grown to a standard classification approach competing with logistic regression in many innovation-friendly scientific fields. Results In this context, we present a large scale benchmarking experiment based on 243 real datasets comparing the prediction performance of the original version of RF with default parameters and LR as binary classification tools. Most importantly, the design of our benchmark experiment is inspired from clinical trial methodology, thus avoiding common pitfalls and major sources of biases. Conclusion RF performed better than LR according to the considered accuracy measured in approximately 69% of the datasets. The mean difference between RF and LR was 0.029 (95%-CI =[0.022,0.038]) for the accuracy, 0.041 (95%-CI =[0.031,0.053]) for the Area Under the Curve, and − 0.027 (95%-CI =[−0.034,−0.021]) for the Brier score, all measures thus suggesting a significantly better performance of RF. As a side-result of our benchmarking experiment, we observed that the results were noticeably dependent on the inclusion criteria used to select the example datasets, thus emphasizing the importance of clear statements regarding this dataset selection process. We also stress that neutral studies similar to ours, based on a high number of datasets and carefully designed, will be necessary in the future to evaluate further variants, implementations or parameters of random forests which may yield improved accuracy compared to the original version with default values. Electronic supplementary material The online version of this article (10.1186/s12859-018-2264-5) contains supplementary material, which is available to authorized users.


Additional file 3: Results on partial dependence
We further investigate the behavior of logistic regression (LR) and random forest (RF) based on a few interesting example datasets from OpenML by considering partial dependence plots-as we did in subsection 2.3 for simulated datasets. More precisely, the aim of these additional analyses is to assess whether differences in performances (between LR and RF) are related to differences in partial dependence plots. After getting a global picture for all datasets included in our study, we inspect three interesting "extreme cases" more closely. For this purpose we need a measure to quantify the difference between partial dependence plots of two methods (here, LR and RF). Since we did not find such a measure in the literature, we suggest a simple approach in the next section.

Measuring differences in partial dependences
For feature X j (j ∈ 1, .., p), let u i,j , i ∈ 1, .., 10 denote the uniform grid on which the partial dependence is computed, with u 1,j = min(X j ) and u 10,j = max(X j ). Let P D denote the corresponding values of the partial dependence at point u i,j for RF and LR, respectively. Our ad-hoc measure is based on the absolute difference |P D i,j | between these two quantities. To give more importance to ranges of X j with many observations, these differences are weighted by the proportion W i,j of observations of feature X j that are closer to point u i,j than to any other point (note that 10 i=1 W i,j = 1).
Finally, to obtain a measure of the difference of partial dependence plots over the p features, each feature is weighted by its relative importance R j in order to give more weight to informative features. The relative importance R j is defined as the variable importance of feature X j (or 0 if this variable importance is negative) divided by the sum of the variable importances of all features.
Our simple measure of the differences between partial dependences for RF and LR for a dataset of interest is thus defined as Difference in accuracies vs. difference in partial dependences for the 243 datasets When displaying the scatterplot of ∆acc vs. ∆P artialDependence for the 243 datasets included in our study, no clear trend can be identified. We subsequently select three "extreme" cases from OpenML and inspect them more closely.
As a first extreme case (Case 1), we select a dataset with low |∆acc| and high ∆P artialDif f erence. The second extreme case (Case 2) shows both low |∆acc| and low ∆P artialDif f erence. The third extreme case (Case 3) shows a very high ∆acc and a high ∆P artialDif f erence. These three datasets are investigated in detail below. In this case, p is large and no feature has a relative importance exceeding 1.8%. It seems that the dataset does not have enough useful information for classification, hence the relatively poor accuracies with both RF and LR. It can be seen from Figure 1 (top-right panel) that the two main features are highly correlated and insufficient to separate the two classes (depicted as blue and red points, respectively). LR does not converge and yields incoherent partial dependence patterns. RF seems to be more robust to this lack of information and to better extract information from the two best features, which is however insufficient in improving accuracy, hence the similar accuracies of RF and LR.  In this case the two models are very close. This is due to the linearity of the problem, as can be seen from Figure 2 (top-right panel). In this easy scenario, both algorithms perform equally well, close to perfect classification. It can be seen from Figure 2 (top-left and bottom-right panels) that RF and LR partial dependences are nearly indistinguishable for the two main features 'northing' and 'isns'.  In this case, p = 2 so that we can visualize the whole dataset as 2D representation in Figure 3 (top-right panel). ∆acc is large, i.e. RF performs substantially better than LR. We can clearly see a dependency in Figure 3 that explains the better performance of RF. This dependency can also be seen in the difference between partial dependences of RF and LR, especially for feature V2. This extreme case illustrates the better behaviour of RF in case of non-linear dependency structures (as also previously outlined through our simple simulation in Section 2.3).