Intervention in prediction measure: a new approach to assessing variable importance for random forests

Background Random forests are a popular method in many fields since they can be successfully applied to complex data, with a small sample size, complex interactions and correlations, mixed type predictors, etc. Furthermore, they provide variable importance measures that aid qualitative interpretation and also the selection of relevant predictors. However, most of these measures rely on the choice of a performance measure. But measures of prediction performance are not unique or there is not even a clear definition, as in the case of multivariate response random forests. Methods A new alternative importance measure, called Intervention in Prediction Measure, is investigated. It depends on the structure of the trees, without depending on performance measures. It is compared with other well-known variable importance measures in different contexts, such as a classification problem with variables of different types, another classification problem with correlated predictor variables, and problems with multivariate responses and predictors of different types. Results Several simulation studies are carried out, showing the new measure to be very competitive. In addition, it is applied in two well-known bioinformatics applications previously used in other papers. Improvements in performance are also provided for these applications by the use of this new measure. Conclusions This new measure is expressed as a percentage, which makes it attractive in terms of interpretability. It can be used with new observations. It can be defined globally, for each class (in a classification problem) and case-wise. It can easily be computed for any kind of response, including multivariate responses. Furthermore, it can be used with any algorithm employed to grow each individual tree. It can be used in place of (or in addition to) other variable importance measures. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1650-8) contains supplementary material, which is available to authorized users.


Scenario 1
Figures S1 and S2 are analogous to Figure 1 in the manuscript, while Figures S3 and S4 correspond to Figure 2 in the manuscript with sample sizes n = 50 and n = 500, respectively. With n = 50, results are similar to those in the manuscript with n = 120. With n = 500, there are few variables with a high number of observation. In such situations, it is desirable to have the terminal node size go up with the sample size [3]. In the previous figures, this was not taken into consideration and IPM results have been affected, as they are based only on the tree structure, and not on performance. In those cases, the depth of the trees in random forests may regulate overfitting [4]. If the maximum depth (maxdepth) of the trees are restricted to 3 (this parameter has not been tuned), for example, results for IPM change for the better radically. The average ranking of variables for IPM (CIT-RF, mtry = 5, maxdepth = 3) in the case of n = 500 is: 3.28 (X1), 1.01 (X 2 ), 3.63 (X 3 ), 3.47 (X 4 ) and 3.61 (X 5 ), i.e. X 2 ranks first on 99% of occasions, and second on 1% of occasions. Therefore, the ranking configuration is nearly perfect. For comparison, Table S1 shows the ranking distribution of X 2 for VIMs applied to Scenario 1 with n = 120, as in the manuscript, whereas the average rankings for each variable are shown in Table S2. Table S1: Ranking distribution (in percentage) of X 2 for VIMs in Scenario 1 with n = 120. The most frequent position for each method is marked in bold font. Note that X 2 should rank ideally first in 100% of occasions.  With mtry = 12, the smaller sample size affects the methods differently. PVIM-CIT-RF and CPVIM provide less importance to X 5 and X 6 than to the irrelevant variable X 4 . MD considers X 5 and X 6 more important than X 4 , but the importance of X 1 and X 2 is not as high as expected, and it is too similar to the importance given to (the less important) X 3 and the irrelevant X 4 . IPM-CIT-RF shows a ranking pattern in the middle between these two situations, the one represented by PVIM-CIT-RF and CPVIM, and the one represented by MD.
As regards the results with n = 500, on the one hand the higher sample size affects the behavior of the methods in three different ways with mtry = 3. Results with PVIM-CIT-RF, PVIM-CART-RF and GVIM are the less successful because give more or less the same importance to the irrelevant variable X 4 as the important predictors X 5 and X 6 . The opposite behavior is found for CPVIM, which is the method with the biggest difference in importance between X 4 and the group formed by X 5 and X 6 . However, CPVIM gives less importance to the relevant predictors X 1 and X 2 , when they are as important as X 5 and X 6 . IPM (CIT-RF and CART-RF) and MD show a similar profile as CPVIM, but they give more importance to X 1 and X 2 than the one given to CPVIM, and less importance to X 5 and X 6 than the one given by CPVIM. On the other hand, with mtry = 12 the methods show a similar ranking pattern among them, but the methods that give the most similar ranking to the theoretical one are IPM with CIT-RF and CART-RF. The dissimilarity is computed as the sum of the differences in absolute value between the average ranking of each method and the theoretical one.

Scenarios and 4
Tables S3, S4, S5 and S6 show the average ranking (from the 100 data sets) for each method in Scenarios 3 and 4 with n = 50 and n = 500. They are homologue to Tables 8 and 9 in the manuscript with n = 120.   For Scenario 3, X 1 or X 2 are in third position in 50% of occasions with n = 50 and mtry = 3 with MD, their results are more affected to worse by a smaller sample size than the results of the other methods. X 5 (related with X 2 ) is given lower ranking in all methods than when n = 120 was considered, but the same patterns in Table 8 are observed in general. However, with n = 500, results for IPM are more similar to those theoretically expected (X 1 and X 2 in first position, while the rest of variables are irrelevant and should rank in 5th position). For MD with n = 500, the pattern observed with n = 120 in Table 8 is now more evident. Results with MD shows a bias on the irrelevant categorical predictor X 7 , which is always ranked in 7th position, and also on X 5 (irrelevant but related with X 2 ), which is always ranked in 3rd position. The other irrelevant variables X 3 , X 4 and X 6 rank in 5th position.
In Scenario 4 with n = 50, results are affected for the small sample size and the special configuration. Remember that variable X 2 is irrelevant when X 1 = 1, which is the most frequent value (60%). In other words, it is expected that X 2 intervenes in the generation of approximately only 20 (50 × 0.4) samples. With MD, the rank of X 2 rises up and it is closer to the rank given to X 5 . The same ranking pattern as that of the case n = 120 is observed for the rest of variables, included the bias for X 7 . For IPM, the rank of X 2 rises a lot with n = 50. X 2 ranks in second position in 22% of occasions, while X 2 ranks from third to seventh position around 15% of occasions for each position. Note that when n = 50, the size of the OOB sample is around 18 (50 × (1-0.632)), so only around 7 (18 × 0.4) samples will have a value of X 1 = 0 and X 2 will participate in the generation of the responses. The size sample of the in-bag observations, which build the trees, with X 1 = 0 is approximately 13 (50 × 0.632 × 0.4). Therefore, we are estimating IPM with a very small sample, and a small sample size is a source of variance [1]. For solving this issue and increasing the sample size, trees have been computed with all available observations (f raction = 0.99 has been considered in the function cf orest of the R package party [2]) as IPM is not used for prediction. Then, all observations have been used for estimating IPM. Results of this configuration appear in the last row of Table  S5, which gives an average ranking of 3.64 for X 2 . The average ranking for irrelevant variables is around 4.5. However, this global information can be easily desegregated by groups with IPM, supplying interesting information. The average IPM values were 54% for X 1 , 15% for X 2 and around 6% for the other variables. For samples with X 1 = 0, the average IPM values were 75% for X 1 and 25% for X 2 , and null for the rest of variables. Therefore, IPM discards the irrelevant variables.
Results for VIMs with n = 500 in Scenario 4 are similar to the manuscript with n = 120, except for IPM, for the reason explained in Section 1. The results for IPM (CIT-RF, mtry = 7, maxdepth = 3) are incorporated into Table S6. When the depths of trees are limited for avoiding overfitting in the high sample size setting, results of IPM are again very good. In fact, it is very reasonable that X 1 ranks first and X 2 second in 81% of occasions, and X 2 ranks first and X 1 second in 19% of occasions, according to the structure of Scenario 4, since X 1 is only important for some part of the sample, and both X 1 and X 2 are important for the other part of the sample.