Skip to main content

Table 1 Comparative analysis of R-based methods for gene selection

From: DFP: a Bioconductor package for fuzzy profile identification and gene reduction of microarray data

 

iterativeBMA[11]

varSelRF[12, 13]

R-SVM[14]

ttest [genefilter]

DFP

Method

Bayesian model averaging (BMA) approach over the underlying classification model (logistic regression)

varSelRF uses the measures of variable importance (related to the classification) provided directly by the Random Forest algorithm

R-SVM uses a contribution factor of each feature (computed from the weights of the SVM classifier)

t-test

The selected genes are based on the induced fuzzy pattern for each class

Type of classification

Multiclass

Multiclass

Binary classifications

Binary classifications

Multiclass

Dependence among features

Multivariate

Multivariate

Multivariate

Univariate

Univariate

Remarks

The method facilitates biological interpretation by producing posterior probabilities of selected genes and models. BMA accounts for the uncertainty about the best set to choose by averaging over multiple models

The R package is available from Bioconductor

The method requires a limit in the maximum number of relevant genes to be selected and the final results are conditioned by an initial selection based on a univariate gene selection method

The method does not require pre-specify the number of genes to be selected, but rather adaptively chooses the number of genes

The R package is available from CRAN and its implementation takes advantage of computing clusters and multicore processors

The varSelRF is biased to identify small sets of genes that can still achieve good predictive performance (thus, highly correlated genes will not be selected since they are considered as redundant genes)

The algorithm is based on the repeated application of the SVM classifier over progressively smaller sets of genes (where genes are excluded according to the defined contribution factor) until a satisfactory solution is achieved. The number of iterations and the number of features to be selected in each iteration are very ad hoc

The R-SVM method is only suitable for binary classifications

The computational effort is smaller than multivariate methods

The genefilter package is available from Bioconductor

It is sensitive against outliers which are frequent in microarray data

It requires normal distribution of the expressions levels within both classes

It does not require any assumption about the distribution of the expression levels and

It accounts for the noise in the data because, as a fuzzy-based method, it deals with linguistic categories instead of raw data

The implementation is computationally efficient and available from Bioconductor

The DFP method does not take into consideration that features are influencing a biological outcome in the context of networks of interacting genes