- Methodology article
- Open access
- Published:

# Feature weight estimation for gene selection: a local hyperlinear learning approach

*BMC Bioinformatics*
**volume 15**, Article number: 70 (2014)

## Abstract

### Background

Modeling high-dimensional data involving thousands of variables is particularly important for gene expression profiling experiments, nevertheless,it remains a challenging task. One of the challenges is to implement an effective method for selecting a small set of relevant genes, buried in high-dimensional irrelevant noises. RELIEF is a popular and widely used approach for feature selection owing to its low computational cost and high accuracy. However, RELIEF based methods suffer from instability, especially in the presence of noisy and/or high-dimensional outliers.

### Results

We propose an innovative feature weighting algorithm, called LHR, to select informative genes from highly noisy data. LHR is based on RELIEF for feature weighting using classical margin maximization. The key idea of LHR is to estimate the feature weights through local approximation rather than global measurement, which is typically used in existing methods. The weights obtained by our method are very robust in terms of degradation of noisy features, even those with vast dimensions. To demonstrate the performance of our method, extensive experiments involving classification tests have been carried out on both synthetic and real microarray benchmark datasets by combining the proposed technique with standard classifiers, including the support vector machine (SVM), *k*-nearest neighbor (KNN), hyperplane *k*-nearest neighbor (HKNN), linear discriminant analysis (LDA) and naive Bayes (NB).

### Conclusion

Experiments on both synthetic and real-world datasets demonstrate the superior performance of the proposed feature selection method combined with supervised learning in three aspects: 1) high classification accuracy, 2) excellent robustness to noise and 3) good stability using to various classification algorithms.

## Background

Feature weighting is an important step in the preprocessing of data, especially in gene selection for cancer classification. The growing abundance of genome-wide sequence data made possible by high-throughput technologies, has sparked widespread interest in linking sequence information to biological phenotypes. However, the expression data usually consist of vast numbers of genes (≥10,000), but with small sample size. Therefore, feature selection is a necessary for solving such problems. Reducing the dimensionality of the feature space and selecting the most informative genes for effective classification with new or existing classifiers are commonly adopted techniques in empirical studies.

In general, the feature weights are obtained by assigning a continuous relevance value to each feature via a learning algorithm by focusing on the context or domain knowledge. The feature weighting procedure is particularly useful for instances based on learning models, in which a distance metric is typically constructed using all features. Moreover, feature weighting can reduce the risk of overfitting by removing noisy features, thereby improving the predictive accuracy. Existing feature selection methods broadly fall into two categories: wrapper and filter methods. Wrapper methods use the predictive accuracy of predetermined classification algorithms (called base classifiers), such as the support vector machine (SVM), as the criterion for determining the goodness of a subset of features [1, 2]. Filter methods select features according to discriminant criteria based on the characteristics of the data, independent of any classification algorithms [3–5]. Commonly used discriminant criteria include entropy measurements [6], Fisher ratio measurements [7], mutual information measurements [8–10], and RELIEF-based measurements [11, 12].

As a result of emerging needs in the biomedical and bioinformatics fields, researchers are particularly interested in algorithms that can process data containing features with large (or even huge) dimensions, for example, microarray data in cancer research. Therefore, filter methods are widely used owing to their efficient computation. Of the existing filter methods for feature weighting, the RELIEF algorithm [13] is considered to be one of the most successful owing to its simplicity and effectiveness. The main idea behind RELIEF is to iteratively update feature weights iteratively using a distance margin to estimate the difference between neighboring patterns. The algorithm has been further generalized (with the new algorithm referred to as RELIEF-F) to average multiple nearest neighbors, instead of just one, when computing sample margins, whose name is RELIEF-F [13]. Sun et al. showed that RELIEF-F achieves significant improvement in performance over the original RELIEF. Sun also systematically proved that RELIEF is indeed an online algorithm for a convex optimization problem [11]. By maximizing the averaged margin of the nearest patterns in the feature scaled space, RELIEF can estimate the feature weights in a straightforward and efficient manner. Based on the theoretical framework, I-RELIEF, an outlier removal scheme, can be applied since the margin averaging is sensitive to large variations [11].

To accomplish sparse feature weighting, the author incorporated a *l*_{1} penalty into the optimization by I-RELIEF [12].

In this paper, we propose a new feature weighting scheme within the RELIEF framework. The main contribution of the proposed algorithm is that the feature weights are estimated from local patterns approximated by a locally linear hyperplane, and thus we call the proposed algorithm as LH-RELIEF or (LHR), for short. It is shown that the proposed feature weighting scheme achieves good performance when combined with standard classification models, such as the support vector machine (SVM), naive Bayes (NB) [14], *k*-nearest neighbors (KNN), linear discriminant analysis (LDA) [15] and kierarchical *k*-nearest neighbor (HKNN) [16]. The superior performance with respect to classification accuracy and excellent robustness to data heavily contaminated by noises make the proposed method promising for using in bioinformatics, where data are severely degraded by background artefacts owing to sampling bias or the high degree of redundancy, such as in the simultaneous parallel sequencing of large/huge numbers of genes.

The advantages of our method are as follows: (1) The gene selection process considers the discriminative power of multiple similar genes that are conditional on their linear combinations. This allows joint interactions between genes to be fully incorporated to reflect the importance of similar genes; (2) LHR assigns weights to genes and thus allows the selection of important genes that can accurately classify samples; (3) Using the genes selected by LHR, classic classifiers including NB, LDA, SVM, HKNN and KNN achieved comparable or even superior accuracy as reported in the literature. This confirms that incorporation of interactions among similar genes in feature weighting estimation under local linear assumptions not only conveys information of the underlying bio-molecular reaction mechanisms, but also provides high gene selection accuracy.

## Results and discussion

To evaluate the performance of the proposed LHR, we conducted extensive experiments on different datasets. First, we performed experiments on a synthetic data from the famous Fermat’s spiral problem [17]. We then tested it on nine medium to large benchmark microarray datasets, which were all used to investigate the relationship between cancers and gene expression.

### Evaluation methods

In this study, we tested the performance of the proposed LHR by combining it with standard classifiers, including NB, KNN, SVM, and HKNN [16]. We applied leave-one-out cross-validation (LOOCV) or 10-fold cross validation (CV) to evaluate classification accuracy. LOOCV provides an unbiased estimate of the generalization error for stable classifiers such as KNN. Using LOOCV, each sample in the dataset was predicted by the model built from the rest of the samples and the accuracy for each predication was included in the final measurement. Using the 10-fold CV scheme, the dataset was randomly divided into ten equal subsets. At each turn, nine subsets were used to construct the model while the remaining subset was used for prediction. The average accuracy for 10 iterations was recorded as the final measurement. For classifiers with tuning parameters (such as the SVM), the optimal parameters were first estimated with 5-fold CV using the training data and then used in the modeling. To simplify the comparison, some of the accuracy results were taken from the literature.

### Parameter settings

LHR takes two parameters: the number of nearest neighbors (*k*) and the regularized constant (*λ*). The choice of *k* depends on the sample size. For small samples, *k* should be small, such as 3 or 5, whereas for large samples, *k* should be set to a larger value, such as 10 or 20. Performance generally improves as *k* increases, however, beyond a certain threshold, larger values of *k* may not lead to any further improvement [18]. A rule of thumb is to set *k* to be the odd number 7. *λ* helps to stabilize the matrix inversion from singular and is generally a tiny constant. In our experiments, we set *λ*=10^{-3}.

### Synthetic experiments on Fermat’s spiral problem

In the first experiment, we tested the performance of the proposed method on the well-known Fermat’s spiral problem. The test dataset consists of two classes with 200 samples for each class. The labels of the spiral are completely determined by its first two features. The shape of the Fermat’s spiral distribution is shown in Figure 1(a). Heuristically, the label of a sample can easily be inferred from its local neighbors. Therefore, classification based on local information thus gives a more accurate result than global measurement based prediction (or classification) since the latter is sensitive to noise degradation. To test the stability and robustness of LHR, irrelevant features following the standard normal distribution were added to the spiral for classification testing. The dimensions of the irrelevant features were set to {0,1000,2000,3000,4000,5000,6000,7000,8000,9000,10000}. To compare the ability to recover informative features, both the I-RELIEF and LOGO algorithms were also used because of its intrinsic closeness to LHR. The three feature weighting schemes were first applied to rank the importance of the features. Only the top five ranked features were retained to test the robustness of feature selection schemes under noisy contamination. Performance comparisons were conducted on the truncated dataset using five classic classifiers: SVM, LDA, NB, KNN, and HKNN. For each experiment, both 10-fold CV and LOOCV were used to evaluate the classification accuracy. To eliminate statistical variations, we repeated the experiments ten times on each dataset and recorded the average classification errors. The detailed numerical results are given in Tables 1 and 2 for 10-fold CV and LOOCV, respectively. To visualize the results, we created a box plot of the distributions thereof for the experimental results after 10-fold CV and LOOCV in Figure 1(b) and (c), respectively. Each plot represents the classification accuracy for a single dataset. Figure 1(b) shows the 10-fold CV accuracy for each of the five classifiers against the dimensions of the noisy features. Figure 1(c) shows the LOOCV accuracy values against the dimensions of the noisy features. We use dark colors to denote the accuracy results achieved using I-RELIEF and LOGO, while a light color is used for those by LHR. In most cases, the performance of LHR coupled with various classifiers is superior to that of both I-RELIEF and LOGO, and thus the corresponding box plot lies above the ones for I-RELIEF and LOGO.

The line graph of the average performance confirms that the proposed method is more robust to noise than I-RELIEF and LOGO. In both CV experiments, we observed that the performance of the three methods was very similar in case where the dimension of the irrelevant features was small. For example, with a zero dimension of irrelevant features, i.e, no noisy features, classification results by the five classifiers were very similar. The average accuracy is 75.2*%* for LHR and 75.4*%* for 10-fold CV, 72.3*%* for LHR and 72.0*%* for LOOCV. However, as the dimension of the irrelevant features increases, both the performance of I-RELIEF and LOGO are severely degraded by the noisy features. In comparison, the performance of LHR is very stable and superior to that of the other combinations. In both experiments, the overall accuracy by LHR is better than that of I-RELIEF and LOGO. We also observed that the accuracies after LOGO, when combining with the five classifiers, were in small variance. This nice property implies that the LOGO method could derive features that are less dependent on classification model, and thus are less redundant than LHR and I-RELIEF do.

### Empirical large/huge microarray datasets

In the second experiment, we tested the performance of the proposed algorithm on nine binary microarray datasets. The benchmark datasets, which have been widely used to test a variety of algorithms, are all related to human cancers, including the central nervous system, colorectal, diffuse large B-cell lymphoma, leukemia, lung, and prostate tumors. Characteristics of the datasets are summarized in Table 3.

We note that most of the test datasets have small sample sizes (less than 100). This poses a difficulty in evaluating the performances of classifiers using the standard fold CV schemes. In this experiment, the LOOCV method was used instead to estimate the accuracy of the classifiers. Each sample in the dataset was predicted by a classifier constructed using the rest of the samples. To assess the generality of the selected informative genes, classic classifiers including LDA, KNN, NB, HKNN and SVM were tested on the selected genes. The experimental results are summarized in Table 4. Note that some of the results were taken directly from the literature.

For the individual dataset, LHR outperformed or achieved comparable performance to the best result reported in the literature. For the CNS data, the LHR-SVM, LHR-LDA and LHR-HKNN achieved superior performances with almost 100*%* accuracy, which is much higher than the second best performance by k-TSP [19]. For the colon data, although the accuracy of the LHR-based classifier is worse than that of BMSF-SVM, IVGA-SVM and LOGO, the accuracy of all the five classifiers are similar. This implies that the selected genes are very robust to the choice of different classifiers. Similar results are observed on the DLBCL, prostate2 and prostate3 datasets. For the GCM, leukemia, lung and prostate1 datasets, the LHR-based classifier was ranked either first or second. The selected genes tested by the five classifiers show similar performance on the leukemia, lung and prostate1 datasets. For the prostate2 data, BMSF-SVM realized remarkably good accuracy, although the results using the other three classifiers with BMSF feature selection are less impressive. LOGO also performed nicely, yet the average is suboptimal to LHG. In comparison, the performance using LHR feature selection is fairly stable. For the prostate3 data, LOGO based classifiers performed very well, while the LHR based ones were slightly less accurate than the top ones. Compared with LOGO in terms of the ability to select informational genes, the proposed algorithm achieved comparable performance by reaching the classification accuracy of 97.39*%*, which is slightly less than LOGO of 97.61*%*.

When considering the average accuracy for each algorithm across all cancers datasets, the top four methods with the highest average accuracy are LOGO-HKNN, BMSF-SVM, LHR-KNN/LOGO-KNN, LHR-SVM and LHR-HKNN. The proposed scheme has a slightly lower average accuracy than BMSF-SVM and LOGO-HKNN, but a higher accuracy than the others. However, the values for *m* *e* *a* *n* ± *s* *t* *a* *n* *d* *a* *r* *d* *d* *e* *v* *i* *a* *t* *i* *o* *n* of the averaged accuracy are 96.65±0.725 for LHR, 97.61±1.5 for LOGO and 94.88±2.191 for BMSF. This shows that the proposed LHR outperforms both LOGO and BMSF in terms of overall accuracy as well as confirming its excellent stability in terms of the choice of classification method.

### Comparison with standard feature selection methods

For comparison with other feature selection models, eleven standard techniques were tested as well as the proposed LHR. The selected techniques include *t*-statistic (*t*-stat), twoing rule (TR), information gain (IG), Gini index (Gini), max minority (MaxM), sum minority (SumM), sum of variances (SumV), one-dimensional support vector machine (OSVM), minimum redundancy maximum relevance (mMRM) [27] and I-RELIEF [28]. The code for the first eight schemes is available through RankGene at http://genomics10.bu.edu/yangsu/rankgene. The code for mRMR is available at http://penglab.janelia.org/proj/mRMR/, where two implementations of mRMR: namely, MID and MIQ, are provided. The I-RELIEF package is available at http://plaza.ufl.edu/sunyijun/[28].

It has been suggested by the author in [25, 27] that accurate discretization could improve the performance of mRMR. The author also reported consistent results when the expression values are transformed into 2 or 3 states using *μ*±*k* *σ* with *k* ranging from 0.5 to 2, and where *μ* and *σ* are gene specific mean and standard deviation, respectively (http://penglab.janelia.org/proj/mRMR/FAQ_mrmr.htm). In our experiments, we followed the transformation rule suggested in [25] to simplify the comparison. Expression values greater than *μ*+*σ* were set to 1; values between *μ*-*σ* and *μ*+*σ* were set to 0; and values less than *μ*-*σ* were set to -1.

In each experiment, a feature selection scheme was first used to select the informative genes, followed by classification tests on the truncated dataset. For subjective comparison, we set the number of informative genes for the selected feature selection scheme to be the same as that determined by LHR, which usually finds a relatively small number of genes (less than 30). This allowed us to examine whether the limited number of informative genes generated by LHR had more discriminative power than those generated by the other methods.

The LOOCV accuracy for each of the five classification algorithms (LDA, NB, SVM, KNN, and HKNN) is reported in Table 5. The number of genes selected by LHR is listed in the second column and the same number is used to create the truncated data for the other feature selection schemes. In most cases, the variables selected by LHR achieved the optimal or suboptimal LOOCV accuracy when coupled with the five classifiers. To investigate the extent of the information conveyed by the selected genes, we created a box plot of the LOOCV accuracy for the five classification algorithms (LDA, SVM, KNN, NB, and HKNN) on each of the tested datasets in Figure 2. A remarkable characteristics of the proposed LHR is its low dependence on the classifiers, resulting in the corresponding box plot having a narrower bandwidth than that for the other methods, shown in Figure 2. This property implies that the genes selected by LHR are highly informative, and thus the discriminative performance is robust to the choice of different classifiers.

### Computation complexity

Solving of the LHR algorithm involves in a quadratic minimization problem (Eq. (4)) for each sample. Therefore, it needs a much higher computational cost than linear method does, such as I-RELIEF and LOGO. Although the matrix of *H*^{T}** WH** in Eq. (4) is positive-definite and in small size, the minimization problem of Eq. (4) can be solved in polynomial time (

*O*(

*n*

^{3}) for

*n*NNs of a sample). Thus, the complexity in each iteration are approximately

*O*(

*n*

^{3}∗

*N*) times higher than I-RELIEF does.

## Conclusions

In this paper, we proposed a new feature weighting scheme to overcome the common drawbacks of the RELIEF family. The nearest miss and hit subsets are approximated by constructing a local hyperplane. Then feature weight updating is achieved by measuring the margin between the sample and its hyperplane in a general RELIEF framework. The main contribution of the new variation is that the margin is more robust to the noise and outliers than those of earlier works. Therefore, the feature weights can characterize the local structure more accurately. Experimental results on both synthetic and real-world microarray datasets validated our findings when combining the proposed method with five classic classifiers. The performance of the proposed weighting scheme performed is superior in terms of classification error on most test datasets. Extensive experiments demonstrated that the proposed scheme has three remarkable characteristics: 1) high accuracy in classification, 2) excellent robustness to noise and 3) good stability with respect to various classification algorithms.

## Methods

### RELIEF

The RELIEF algorithm has been successfully applied in feature weighting owing to its simplicity and effectiveness [12, 13]. The main idea of RELIEF is the iterative adjustment of feature weights according to their ability to discriminate among neighboring patterns. Mathematically, suppose that ** X**={

*x*_{1},

*x*_{2},⋯,

*x*_{ n }}

_{d×N}is a randomly selected sample matrix of binary class data where each sample

**has**

*x**d*dimensions,

**={**

*x**x*

_{1},

*x*

_{2},⋯,

*x*

_{ d }}. One can estimate the two nearest neighbors, where one is from the same class (called

*the nearest hit*or NH) and the other is from a different class (called

*the nearest miss*or NM). Then, weight

*w*

_{ f }of the

*f*-th (

*f*=1,2,⋯,

*d*) feature is updated by the heuristic estimation:

where *N* *M*_{
f
},*N* *H*_{
f
} denote the *f*-th coordinate value of vector *NM* and *NH*, respectively. Since no exhaustive or iterative search is needed for RELIEF updates, this scheme is very efficient in processing data with huge dimensions. Thus, it is particularly promising for large-scale problems such as analysis of microarray data [3, 12, 27]. The author generalized the updates scheme to compute the maximum expected margin **E**[ *ρ*(** w**)] by scaling the features [11, 12] to overcome the drawbacks of RELIEF, such as outlier detection and inaccurate updates:

with {\mathit{z}}_{n}=\sum _{{\mathit{x}}_{n}\in \mathit{\text{NM}}\left({\mathit{x}}_{i}\right)}P({\mathit{x}}_{n}=\mathit{\text{NM}}({\mathit{x}}_{i}\left)\right|\mathit{w}\left)\right|{\mathit{x}}_{n}-{\mathit{x}}_{i}|-\sum _{{\mathit{x}}_{n}\in \mathit{\text{NH}}\left({\mathit{x}}_{i}\right)}P({\mathit{x}}_{n}=\mathit{\text{NH}}\left({\mathit{x}}_{i}\right)\left|\mathit{w}\right)|{\mathit{x}}_{n}-{\mathit{x}}_{i}|, where *N* *M*(*x*_{
i
})={*x*_{
n
}:1≤*n*≤*N*,*y*_{
i
}≠*y*_{
n
}} and *N* *H*(*x*_{
i
})={*x*_{
n
}:1≤*n*≤*N*,*y*_{
i
}=*y*_{
n
}} are index sets of the nearest miss and the nearest hit for the sample *x*_{
i
}. *N* is the sample size. *P*(*x*_{
n
}=*N* *M*(*x*_{
i
})|** w**) (or

*P*(

*x*_{ n }=

*N*

*H*(

*x*_{ i })|

**)) is the probability of a sample**

*w*

*x*_{ n }being in the set of

*N*

*M*(

*x*_{ i }) (or

*N*

*H*(

*x*_{ i })) in the feature space scaled by weights

**. Though the probability distributions are initially unknown, they can be estimated through kernel density estimation [29]. The authors called this method I-RELIEF and showed that it achieved significant performance improvement over the traditional models. Classification of a feature scaled dataset achieved higher accuracy than standard techniques such as the SVM [1, 2, 30] and NN model [31]. Feature weighting is also robust to noisy features. To obtain a sparse and economic feature weighting, Sun incorporated the**

*w**l*

_{1}penalty into the optimization of I-RELIEF and named the algorithm by Logo (

*fit locally and think globally*) [12]. Extensive experiments have demonstrated that Logo could accurately grasp the intrinsic structure of the data and match nicely with classic classification models.

However, the expectation in Eq. (2) is obtained by averaging the nearest neighbors. Therefore, feature weight estimation may be less accurate if the samples contain many outliers or most of the features are irrelevant. In both cases, the distance between the tested sample and its nearest neighbor is a large value. It follows that large bias is introduced to margin estimation by using the such averaging operation. Although the influence of abnormal samples can be reduced by introducing kernel distribution estimation [11, 12], this in turn introduces additional free parameters. Moreover, probability estimation via kernel approximation is sensitive to the sample size [28]. Therefore, it limits the empirical applications such as analysis of microarray data, which the data are notoriously known for the fact that the dimension of the sample observations is much smaller than that of the sample features [32]. In this paper, we propose using a local hyperplane to approximate the set of the nearest hit and miss, and then estimate the feature weight by maximizing the expected margin defined by the hyperplane. The advantage of this approximation is that the hyperplane is more robust to noisy feature degradation than averaging all the neighbors [11–13].

### Local hyperplane conditional on feature weight

Processing high-dimensional data by mapping the data of interest into an embedded non-linear manifold within the higher-dimensional space has attracted wide interest in machine learning. The local hyperplane approximation shares similar merits with local linear embedding methods [12, 26, 33]. It assumes that the samples’ structure is locally linear and therefore each sample lies on a local linear hyperplane, spanned by its nearest neighbors. Mathematically, let us assume that the feature weights \mathit{w}\doteq \{{w}_{1},{w}_{2},\cdots \phantom{\rule{0.3em}{0ex}},{w}_{I}\} are known in advance. Thus, sample ** x** can be represented by a local hyperplane of class

*c*, conditional on the feature weight

**, as:**

*w*where ** H** is an

*I*×

*n*matrix comprising

*n*NNs of sample

**:**

*x***={**

*H*

*h*_{1},

*h*_{2},⋯,

*h*_{ n }}, with

*h*_{ i }being the

*i*-th nearest neighbor (called the

*prototype*) of class

*c*.

**is a diagonal matrix with diagonal element**

*W**w*

_{ i }being the weight of the

*i*-th feature. The parameters of

**=(**

*α**α*

_{1},…,

*α*

_{ n })

^{T}are the weights of the prototypes {

*h*_{ i },

*i*=1,2,…,

*n*}. These can be viewed as the spanning coefficients of subspace

*L*

*H*

_{ c }(

**). Therefore, the hyperplane can be represented as: {· |**

*x*

*H***=**

*α**α*

_{1}

*Wh*_{1}+

*α*

_{2}

*Wh*_{2}+…+

*α*

_{ n }

*Wh*_{ n }}. The projection

*L*

*H*

_{ c }(

**) of**

*x***onto the hyperplane can be computed by minimizing the distance between sample**

*x**x*and the hyperplane, both of which are dependent on the feature weight. Therefore, the value of

**can be estimated as:**

*α*The regularization parameter *λ* is used to emphasize the “smoothing” effect of the optimum solution, which degenerates to be an unit vector in certain radical cases.

We propose using a hyperplane to represent the set of the nearest miss *N* *M*(** x**) and nearest hit

*N*

*H*(

**) for a given sample**

*x***. The advantage of the representation is the robust characterization of the local sample patterns. Then the distances between the sample and its**

*x**NH*(or

*NM*) set can be estimated from the local hyperplane rather than averaging across all samples within the set. Therefore, we redefine the margin for a sample

**as {\rho}_{n}\doteq d({\mathit{x}}_{n}-{\mathit{\text{LH}}}_{\mathit{\text{NM}}}({\mathit{x}}_{n}\left)\right)-d({\mathit{x}}_{n}-{\mathit{\text{LH}}}_{\mathit{\text{NH}}}({\mathit{x}}_{n}\left)\right). The feature weights are then estimated by maximizing the total margin:**

*x*where vector *z*_{
n
} is defined as: {\mathit{z}}_{n}=\frac{1}{N}\sum _{n=1}^{N}\left(\sum _{i=1}^{I}|{\mathit{x}}_{n}^{\left(i\right)}-\mathit{\alpha}{\mathit{H}}_{\mathit{\text{NM}}}^{\left(i\right)}({\mathit{x}}_{n}\left)\right|-\sum _{i=1}^{I}|{\mathit{x}}_{n}^{\left(i\right)}-\mathit{\beta}{\mathit{H}}_{\mathit{\text{NH}}}^{\left(i\right)}({\mathit{x}}_{n}\left)\right|\right), where *H*_{
N
M
}(*x*_{
n
}) and *H*_{
N
H
}(*x*_{
n
}) are the nearest neighbors of the set of the nearest miss and hit of sample *x*_{
n
}. *α*_{
n
} and *β*_{
n
} are the coefficients for spanning hyperplane {\mathit{\text{LH}}}_{\mathit{\text{NM}}}^{\left(n\right)} and {\mathit{\text{LH}}}_{\mathit{\text{NH}}}^{\left(n\right)}. ** w** is a vector with its

*i*-th element

**(**

*w**i*) being the weight of the

*i*-th feature, for

*i*=1,2,…,

*I*. To solve the minimization problem of Eq. (5), the parameters of

*α*_{ n },

*β*_{ n }, which are dependent on the nearest neighbors, must be estimated. The main problem with this estimation, however, is that the nearest neighbors of a given sample are unknown before learning. In the presence of many thousands of irrelevant features, the nearest neighbors defined in the original space can be completely different from those in the induced space. Therefore, the nearest neighbors defined in the original feature space may not be the same in the weighted feature space. To address these difficulties, we use an iterative algorithm, similar to the Expectation Maximization algorithm and I-RELIEF [11], to estimate the feature weights. The detailed numerical solution is provided in Additional file 1: S.1. The pseudo-code for LH-RELIEF is summarized in Additional file 2: S.2.

## Availability of supporting data

The Matlab code used to tested on the Fermat’s spiral and the cancer microarray datasets is available at http://sunflower.kuicr.kyoto-u.ac.jp/\~{r}uan/LHR/.

## References

Duan K-BB, Rajapakse JC, Wang H, Azuaje F: Multiple SVM-RFE for gene selection in cancer classification with expression data. IEEE Trans Nanobiosci. 2005, 4 (3): 228-234. 10.1109/TNB.2005.853657.

Guyon I, Weston J, Barnhill S, Vapnik V: Gene selection for cancer classification using support vector machines. Mach Learn. 2002, 46: 389-422. 10.1023/A:1012487302797.

Ding C, Peng H: Minimum redundancy feature selection from microarray gene expression data. J Bioinform Comput Biol. 2005, 3 (2): 185-205. 10.1142/S0219720005001004.

Guyon I: An introduction to variable and feature selection. J Mach Learn Res. 2003, 3: 1157-1182.

Huang CJ, Yang DX, Chuang YT: Application of wrapper approach and composite classifier to the stock trend prediction. Expert Syst Appl. 2008, 34 (4): 2870-2878. 10.1016/j.eswa.2007.05.035.

Koller D, Sahami M: Toward optimal feature selection. Proceedings of the Thirteenth International Conference on Machine Learning. Edited by: Saitta L. 1996, Morgan Kaufmann Press, 284-292.

Jain AK, Duin RPW, Mao J: Statistical pattern recognition: a review. IEEE Trans Pattern Anal Mach Intell. 2000, 22 (1): 4-37. 10.1109/34.824819.

Kwak N, Choi C-H: Input feature selection by mutual information based on parzen window. IEEE Trans Pattern Anal Mach Intell. 2002, 24: 1667-1671. 10.1109/TPAMI.2002.1114861.

Brown G: Some thoughts at the interface of ensemble methods and feature selection. Multiple Classifier Systems. Edited by: Neamat EG, Josef K, Fabio R. 2010, Springer Press, 314-314.

Brown G: An information theoretic perspective on multiple classifier systems. Multiple Classifier Systems. Edited by: Springer Press, Jón B, Josef K, Fabio R. 2009, 344-353.

Sun Y: Iterative relief for feature weighting: Algorithms, theories, and applications. IEEE Trans Pattern Anal Mach Intell. 2007, 29 (6): 1035-1051.

Sun Y, Todorovic S, Goodison S: Local-learning-based feature selection for high-dimensional data analysis. IEEE Trans Pattern Anal Mach Intell. 2010, 32 (9): 1610-1626.

Kononenko I: Estimating attributes: analysis and extensions of RELIEF. European Conference on Machine Learning. Edited by: Francesco B, Luc D-R. 1994, Berlin Heidelberg: Springer Press, 171-182.

Li T, Zhang C, Ogihara M: A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004, 20 (15): 2429-2437. 10.1093/bioinformatics/bth267.

Wu MC, Zhang L, Wang Z, Christiani DC, Lin X: Sparse linear discriminant analysis for simultaneous testing for the significance of a gene set/pathway and gene selection. Bioinformatics. 2009, 25 (9): 1145-1151. 10.1093/bioinformatics/btp019.

Vincent P, Bengio Y: K-local hyperplane and convex distance nearest neighbor algorithms. Advances in Neural Information Processing Systems. Edited by: Thomas G, Sue B, Zoubin G. 2001, MIT Press, 985-992.

Sun Y, Wu D: A relief based feature extraction algorithm. SDM. Edited by: Apte C, Park H, Wang K, Zaki J-M. 2008, SIAM Press, 188-195.

Hall P, Park BU, Samworth RJ: Choice of neighbor order in nearest-neighbor classification. Ann Stat. 2008, 36 (5): 2135-2152. 10.1214/07-AOS537.

Tan AC, Naiman DQ, Xu L, Winslow RL, Geman D: Simple decision rules for classifying human cancers from gene expression profiles. Bioinformatics. 2005, 21 (20): 3896-3904. 10.1093/bioinformatics/bti631.

Geman D, Christian A, Naiman DQ, Winslow RL: Classifying gene expression profiles from pairwise mRNA comparisons. Stat Appl Genet Mol Biol. 2004, 3 (1): 1071-1077.

Chopra P, Lee J, Kang J, Lee S: Improving cancer classification accuracy using gene pairs. PloS One. 2010, 5 (12): e14305-10.1371/journal.pone.0014305.

Dagliyan O, Uney Y-F, Kavakli I-H, Turkay M: Optimization based tumor classification from microarray gene expression data. PloS One. 2011, 6 (2): e14579-10.1371/journal.pone.0014579.

Zheng CH, Chong YW, Wang HQ: Gene selection using independent variable group analysis for tumor classification. Neural Comput Appl. 2011, 20 (2): 161-170. 10.1007/s00521-010-0513-2.

Zhang JG, Deng HW: Gene selection for classification of microarray data based on the Bayes error. BMC Bioinformatics. 2007, 8 (1): 370-378. 10.1186/1471-2105-8-370.

Zhang H, Wang H, Dai Z, Chen M-s, Yuan Z: Improving accuracy for cancer classification with a new algorithm for genes selection. BMC Bioinformatics. 2012, 13 (1): 1-20. 10.1186/1471-2105-13-1.

Roweis ST, Saul LK: Nonlinear dimensionality reduction by locally linear embedding. Science. 2000, 290 (5500): 2323-2326. 10.1126/science.290.5500.2323.

Peng YH: A novel ensemble machine learning for robust microarray data classification. Comput Biol Med. 2006, 36: 553-573. 10.1016/j.compbiomed.2005.04.001.

Girolami M, He C: Probability density estimation from optimally condensed data samples. IEEE Trans Pattern Anal Mach Intell. 2003, 25: 1253-1264. 10.1109/TPAMI.2003.1233899.

Christopher A, Andrew M, Stefan S: Locally weighted learning. Artif Intell Rev. 1997, 11: 11-73. 10.1023/A:1006559212014.

Statnikov A, Wang L, Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification. BMC Bioinformatics. 2008, 9: 319-328. 10.1186/1471-2105-9-319.

Shakhnarovich G, Darrell T, Indyk P: Nearest-neighbor methods in learning and vision. IEEE Trans Neural Netw. 2008, 19 (2): 377-

Fraley C, Adrian E-R: Model-based clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002, 97 (458): 611-631. 10.1198/016214502760047131.

Pan Y, Ge SS, Al Mamun A: Weighted locally linear embedding for dimension reduction. Pattern Recognit. 2009, 42 (5): 798-811. 10.1016/j.patcog.2008.08.024.

## Acknowledgements

The authors would like to thank Dr. Y. Sun for the source code of I-RELIEF and LOGO, and Dr. H.Y. Zhang for her valuable comments on the BMSF method. This work was partially supported by ICR-KU International Short-term Exchange Program for Young Researchers in design and analysis of computational experiments. HM was supported by BGI-SCUT Innovation Fund Project (SW20130803), National Nature Science Foundation of China (61372141) and the Fundamental Research Fund for the Central Universities (2013ZM0079).

## Author information

### Authors and Affiliations

### Corresponding author

## Additional information

### Competing interests

The authors declare that they have no competing interests.

### Authors’ contributions

HM designed the LHR algorithm, participated in the numerical experiments and drafted the manuscript. PY participated in the numerical experiments. MN participated in the design of the study and TA participated in the study design and helped to draft the manuscript. All authors read and approved the final manuscript.

## Electronic supplementary material

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Cai, H., Ruan, P., Ng, M. *et al.* Feature weight estimation for gene selection: a local hyperlinear learning approach.
*BMC Bioinformatics* **15**, 70 (2014). https://doi.org/10.1186/1471-2105-15-70

Received:

Accepted:

Published:

DOI: https://doi.org/10.1186/1471-2105-15-70