Skip to main content

Table 1 Comparative analysis of R-based methods for gene selection

From: DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data

  iterativeBMA[11] varSelRF[12, 13] R-SVM[14] ttest [genefilter] DFP
Method Bayesian model averaging (BMA) approach over the underlying classification model (logistic regression) varSelRF uses the measures of variable importance (related to the classification) provided directly by the Random Forest algorithm R-SVM uses a contribution factor of each feature (computed from the weights of the SVM classifier) t-test The selected genes are based on the induced fuzzy pattern for each class
Type of classification Multiclass Multiclass Binary classifications Binary classifications Multiclass
Dependence among features Multivariate Multivariate Multivariate Univariate Univariate
Remarks The method facilitates biological interpretation by producing posterior probabilities of selected genes and models. BMA accounts for the uncertainty about the best set to choose by averaging over multiple models
The R package is available from Bioconductor
The method requires a limit in the maximum number of relevant genes to be selected and the final results are conditioned by an initial selection based on a univariate gene selection method
The method does not require pre-specify the number of genes to be selected, but rather adaptively chooses the number of genes
The R package is available from CRAN and its implementation takes advantage of computing clusters and multicore processors
The varSelRF is biased to identify small sets of genes that can still achieve good predictive performance (thus, highly correlated genes will not be selected since they are considered as redundant genes)
The algorithm is based on the repeated application of the SVM classifier over progressively smaller sets of genes (where genes are excluded according to the defined contribution factor) until a satisfactory solution is achieved. The number of iterations and the number of features to be selected in each iteration are very ad hoc
The R-SVM method is only suitable for binary classifications
The computational effort is smaller than multivariate methods
The genefilter package is available from Bioconductor
It is sensitive against outliers which are frequent in microarray data
It requires normal distribution of the expressions levels within both classes
It does not require any assumption about the distribution of the expression levels and
It accounts for the noise in the data because, as a fuzzy-based method, it deals with linguistic categories instead of raw data
The implementation is computationally efficient and available from Bioconductor
The DFP method does not take into consideration that features are influencing a biological outcome in the context of networks of interacting genes