Skip to main content

Advertisement

Table 3 Stability of variable (gene) selection evaluated using 200 bootstrap samples. "# Genes": number of genes selected on the original data set. "# Genes boot.": median (1st quartile, 3rd quartile) of number of genes selected from on the bootstrap samples. "Freq. genes": median (1st quartile, 3rd quartile) of the frequency with which each gene in the original data set appears in the genes selected from the bootstrap samples. Parameters for backwards elimination with random forest: mtryFactor = 1, s.e. = 0, ntree = 2000, ntreelterat = 1000, fraction.dropped = 0.2.

From: Gene selection and classification of microarray data using random forest

Data set Error # Genes # Genes boot. Freq. genes
Backwards elimination of genes from random forest
s.e. = 0
Leukemia 0.087 2 2 (2, 2) 0.38 (0.29, 0.48)1
Breast 2 cl. 0.337 14 9 (5, 23) 0.15 (0.1, 0.28)
Breast 3 cl. 0.346 110 14 (9, 31) 0.08 (0.04, 0.13)
NCI 60 0.327 230 60 (30, 94) 0.1 (0.06, 0.19)
Adenocar. 0.185 6 3 (2, 8) 0.14 (0.12, 0.15)
Brain 0.216 22 14 (7, 22) 0.18 (0.09, 0.25)
Colon 0.159 14 5 (3, 12) 0.29 (0.19, 0.42)
Lymphoma 0.047 73 14 (4, 58) 0.26 (0.18, 0.38)
Prostate 0.061 18 5 (3, 14) 0.22 (0.17, 0.43)
Srbct 0.039 101 18 (11, 27) 0.1 (0.04, 0.29)
s.e. = 1
Leukemia 0.075 2 2 (2, 2) 0.4 (0.32, 0.5)1
Breast 2 cl. 0.332 14 4 (2, 7) 0.12 (0.07, 0.17)
Breast 3 cl. 0.364 6 7 (4, 14) 0.27 (0.22, 0.31)
NCI 60 0.353 24 30 (19, 60) 0.26 (0.17, 0.38)
Adenocar. 0.207 8 3 (2, 5) 0.06 (0.03, 0.12)
Brain 0.216 9 14 (7, 22) 0.26 (0.14, 0.46)
Colon 0.177 3 3 (2, 6) 0.36 (0.32, 0.36)
Lymphoma 0.042 58 12 (5, 73) 0.32 (0.24, 0.42)
Prostate 0.064 2 3 (2, 5) 0.9 (0.82, 0.99)1
Srbct 0.038 22 18 (11, 34) 0.57 (0.4, 0.88)
Alternative approaches
SC.s
Leukemia 0.062 822 46 (14, 504) 0.48 (0.45, 0.59)
Breast 2 cl. 0.326 31 55 (24, 296) 0.54 (0.51, 0.66)
Breast 3 cl. 0.401 2166 4341 (2379, 4804) 0.84 (0.78, 0.88)
NCI 60 0.246 51183 4919 (3711, 5243) 0.84 (0.74, 0.92)
Adenocar. 0.179 0 9 (0, 18) NA (NA, NA)
Brain 0.159 4177 1257 (295, 3483) 0.38 (0.3, 0.5)
Colon 0.122 15 22 (15, 34) 0.8 (0.66, 0.87)
Lymphoma 0.033 2796 2718 (2030, 3269) 0.82 (0.68, 0.86)
Prostate 0.089 4 3 (2, 4) 0.72 (0.49, 0.92)
Srbct 0.025 374 18 (12, 40) 0.45 (0.34, 0.61)
NN.vs
Leukemia 0.056 512 23 (4, 134) 0.17 (0.14, 0.24)
Breast 2 cl. 0.337 88 23 (4, 110) 0.24 (0.2, 0.31)
Breast 3 cl. 0.424 9 45 (6, 214) 0.66 (0.61, 0.72)
NCI 60 0.237 1718 880 (360, 1718) 0.44 (0.34, 0.57)
Adenocar. 0.181 9868 73 (8, 1324) 0.13 (0.1, 0.18)
Brain 0.194 1834 158 (52, 601) 0.16 (0.12, 0.25)
Colon 0.158 8 9 (4, 45) 0.57 (0.45, 0.72)
Lymphoma 0.04 15 15 (5, 39) 0.5 (0.4, 0.6)
Prostate 0.081 7 6 (3, 18) 0.46 (0.39, 0.78)
Srbct 0.031 11 17 (11, 33) 0.7 (0.66, 0.85)
  1. 1 Only two genes are selected from the complete data set; the values are the actual frequencies of those two genes.
  2. 2 [33] select 21 genes after visually inspecting the plot of cross-validation error rate vs. amount of shrinkage and number of genes. Their procedure is hard to automate and thus it is very difficult to obtain estimates of the error rate of their procedure.
  3. 3 [31] report obtaining more than 2000 genes when using shrunken centroids with this data set and show that the minimum error rate is achieved with about 5000 genes.
  4. 4 [33] select 43 genes. The difference is likely due to differences in the random partitions for cross-validation. Repeating 100 times the gene selection process with the full data set the median, 1st quartile, and 3rd quartile of the number of selected genes are 13, 8, and 147. For these data, [31] obtain 72 genes with shrunken centroids, which also falls within the above interval.