Skip to main content

Advertisement

Table 3 Benefit from feature selection.

From: A comparison of random forest and its Gini importance with standard chemometric methods for the feature selection and classification of spectral data

   univariate selection multivariate selection (Gini importance) multivariate selection (PLS/PC)
   PLS PC RF PLS PC RF PLS PC RF
MIR BSE orig 10.0 (6) 10.0 (4) 1.5 (7) 10.0 (6) 10.0 (6) 2.0
(6)
3.0 (13) 0.9 (80) 0.5 (51)
  binned 6.0
(5)
7.0
(5)
2.2 (9) 10.0 (9) 10.0 (6) 3.0
(9)
10.0 (5) 9.0
(4)
0.4 (51)
MIR wine French 4.0
(3)
3.0
(2)
0.7 (64) 5.0
(3)
3.0
(1)
3.0 (26) 0.0 (100) 0.6 (33) 0.0 (64)
  grape 8.0
(2)
8.0 (21) 0.6 (64) 10.0 (4) 10.0 (5) 2.0 (11) 4.0
(1)
6.0
(1)
0.0 (64)
NMR tumor all 1.0 (80) 0.5 (11) 4.0 (6) 4.0 (51) 0.0 (100) 2.0
(6)
0.8 (11) 0.0 (100) 0.3 (80)
  center 2.0
(7)
0.4
(6)
0.8 (86) 2.0 (26) 0.2 (64) 0.7 (41) 0.0 (100) 0.7 (13) 0.3 (80)
NMR candida 1 0.5 (80) 0.0 (80) 0.8 (80) 0.0 (100) 0.0 (100) 0.8 (41) 0.4 (64) 0.0 (100) 0.4 (9)
  2 0.4 (80) 0.9 (64) 0.0 (80) 2.0 (64) 0.4 (26) 0.0 (100) 2.0 (21) 1.0 (21) 0.4 (41)
  3 0.0 (100) 0.0 (100) 0.0 (80) 2.0 (80) 0.6 (80) 2.0 (26) 2.0 (80) 0.0 (100) 0.7 (41)
  4 0.8 (80) 0.0 (100) 0.0 (80) 2.0 (80) 1.0 (80) 2.0 (64) 0.7 (33) 0.0 (100) 0.3 (32)
  5 0.0 (100) 0.0 (100) 0.7 (80) 0.0 (100) 0.4 (80) 1.0 (64) 0.7 (64) 0.7 (80) 0.4 (21)
  1. Significance of accuracy improvement with feature selection as compared to using the full set of features; and percentage of original features used in a classification that has maximum accuracy (in parentheses). The significance is specified by -log10(p), where p is the p-value of a paired Wilcoxon test on the 100 hold-outs of the cross-validation (see text). For comparison, -log(0.05) = 1.3 and -log(0.001) = 3; the value of 6.0 reported for MIR BSE binned in the second row of the first column corresponds to a highly significant improvement in classification accuracy, corresponding to a p-value of 10-6.