Selection of a subset of important features (variables) is crucial for modeling high dimensional data in bioinformatics. For example, microarray gene expression data may include *p* ≥ 10, 000 genes. But the sample size, *n*, is much smaller, often less than 100. A model cannot be built directly since the model complexity is larger than the sample size. Technically, linear discriminant analysis can only fit a linear model up to *n* parameters. Such a model would provide a perfect fit, but it has no predictive power. This "*small n, large p* problem" has attracted a lot of research attention, aimed at removing nonessential or noisy features from the data, and thus determining a relatively small number of features which can mostly explain the observed data and the related biological processes.

Though much work has been done, feature selection still remains an active research area. The significant interest is attributed to its many benefits. As enumerated in [1], these include (i) reducing the complexity of computation for prediction; (ii) removing information redundancy (cost savings); (iii) avoiding the issue of overfitting; and (iv) easing interpretation. In general, the generalization error becomes lower as fewer features are included, and the higher the number of samples per feature, the better. This is sometimes referred to as the *Occam's razor* principle [2]. Here we give a brief summary on feature selection. For a recent review, see [3]. Basically, feature selection techniques can be grouped into three classes: *Class I*: *Internal variable selection*. This class mainly consists of *Decision Trees* (DT) [4], in which a variable is selected and split at each node by maximizing the purity of its descendant nodes. The variable selection process is done in the tree building process. The decision tree has the advantage of being easy to interpret, but it suffers from the instability of its hierarchical structures. Errors from ancestors pass to multiple descendant nodes and thus have an inflated effect. Even worse, a minor change in the root may change the tree structure significantly. An improved method based on decision trees is Random Forests [5], which grows a collection of trees by bootstrapping the samples and using a random selection of the variables. This approach decreases the prediction variance of a single tree. However, Random Forests may not remove certain variables, as they may appear in multiple trees. But Random Forests also provides a variable ranking mechanism that can be used to select important variables.

*Class II: Variable filtering*. This class encompasses a variety of filters that are principally used for the classification problem. A specific type of model may not be invoked in the filtering process. A filter is a statistic defined on a random variable over multiple populations. With the choice of a threshold, some variables can be removed. Such filters include *t*-statistics, *F*-statistics, Kullback-Leibler divergence, Fisher's discriminant ratio, mutual information [6], information-theoretic networks [7], maximum entropy [8], maximum information compression index [9], relief [10, 11], correlation-based filters [12, 13], relevance and redundancy analysis [14], etc.

*Class III: Wrapped methods*. These techniques wrap a model into a search algorithm [15, 16]. This class includes foreward/backword, stepwise selection using a defined criterion, for instance, partial *F*-statistics, Aikaike's Information Criterion (AIC) [17], Bayesian Information Criterion (BIC) [18], etc. In [19], sequential projection pursuit (SPP) was combined with partial least square (PLS) analysis for variable selection. Wrapped feature selection based on Random Forests has also been studied [20, 21]. There are two measures of importance for the variables with Random Forests, namely, mean decrease accuracy (MDA) and mean decrease Gini (MDG). Both measures are, however, biased [22]. One study shows that MDG is more robust than MDA [23]; however another study shows the contrary [24]. Our experiments show that both methods give very similar results. In this paper we present results only for MDA. The software package **varSelRF** in R developed in [21] will be used in this paper for comparisons. We call this method RF-FS or RF when there is no confusion. Given the hierarchical structure of the trees in the forest, stability is still a problem.

The advantage of the filter approaches is that they are simple to compute and very fast. They are good for pre-screening, rather than building the final model. Conversely, wrapped methods are suitable for building the final model, but are generally slower.

Recently, *Random KNN* (RKNN) which is specially designed for classification in high dimensional datasets was introduced in [25]. RKNN is a generalization of the *k*-nearest neighbor (KNN) algorithm [26–28]. Therefore, RKNN enjoys the many advantages of KNN. In particular, KNN is a nonparametric classification method. It does not assume any parametric form for the distribution of measured random variables. Due to the flexibility of the nonparametric model, it is usually a good classifier for many situations in which the joint distribution is unknown, or hard to model parametrically. This is especially the case for high dimensional datasets. Another important advantage of KNN is that missing values can be easily imputed [29, 30]. Troyanskaya *et al*. [30] also showed that KNN is generally more robust and more sensitive compared with other popular classifiers. In [25] it was shown that RKNN leads to a significant performance improvement in terms of both computational complexity and classification accuracy. In this paper, we present a novel feature selection method, RKNN-FS, using the new classification and regression method, RKNN. Our empirical comparison with the Random Forests approach shows that RKNN-FS is a promising approach to feature selection for high dimensional data.