Skip to main content

Advertisement

Figure 1 | BMC Bioinformatics

Figure 1

From: GeneSrF and varSelRF: a web-based tool and R package for gene selection and classification using random forest

Figure 1

Example output. Some figures from the output of the web-based application (see [24]). a) Out-of-bag error rate vs. the number of genes in the class prediction model, for both the complete, original data set (red line) and the 200 bootstrap samples (black lines). These figures can help identify the best number of genes in the class prediction model. It seems, that we can do fairly well using just 2 genes in our model. This is the conclussion we reach both with the complete, original data set and the bootstrap samples. b) Probability of class membership of each sample, from out-of-bag samples (i.e., bootstrap runs where the sample was not included in the training group). Most samples are well classified, specially those from class ALL (their average out of bag probability of membership in their true class is larger than 0.75). c) Importance spectrum plots can help decide on the number of "relevant variables": we compare the variable importance plots from the original data with variable importance plots that are generated when the class labels and the predictors are independent (class labels are randomly permuted). In this case the first 30 variables have importances well above those from sets with randomly permuted class labels. d) Selection probability plots: for each of the top ranked genes from the original sample, the probability that it is included among the top ranked k genes (blue: k = 20; red: k = 100) from the (200) bootstrap samples. Thus, these plots can be a measure of our confidence in the stability of choosing a number of k ranked genes. In this case, with k = 20 only the two or three most important genes are repeatedly chosen among the best 20. If we select the first 100 genes, the 30 best ranked ones appear at least in 75% of the bootstrap samples.

Back to article page