Optimal combination of feature selection and classification via local hyperplane based learning strategy
© Cheng et al. 2015
Received: 5 November 2014
Accepted: 29 May 2015
Published: 10 July 2015
Classifying cancers by gene selection is among the most important and challenging procedures in biomedicine. A major challenge is to design an effective method that eliminates irrelevant, redundant, or noisy genes from the classification, while retaining all of the highly discriminative genes.
We propose a gene selection method, called local hyperplane-based discriminant analysis (LHDA). LHDA adopts two central ideas. First, it uses a local approximation rather than global measurement; second, it embeds a recently reported classification model, K-Local Hyperplane Distance Nearest Neighbor(HKNN) classifier, into its discriminator. Through classification accuracy-based iterations, LHDA obtains the feature weight vector and finally extracts the optimal feature subset. The performance of the proposed method is evaluated in extensive experiments on synthetic and real microarray benchmark datasets. Eight classical feature selection methods, four classification models and two popular embedded learning schemes, including k-nearest neighbor (KNN), hyperplane k-nearest neighbor (HKNN), Support Vector Machine (SVM) and Random Forest are employed for comparisons.
The proposed method yielded comparable to or superior performances to seven state-of-the-art models. The nice performance demonstrate the superiority of combining feature weighting with model learning into an unified framework to achieve the two tasks simultaneously.
DNA microarray datasets can simultaneously determine the expression levels of thousands of genes . For application purposes, these gene expression data must then be classified into various categories . Together with classification methods, microarray technology has successfully guided clinical management decisions for individual patients, such as oncology [3, 4]. However, the sample size of the genetic dataset is usually much smaller than the number of genes, which extends into thousands or even tens of thousands . Such limited availability of high-dimensional samples is particularly problematic for standard classification models. Feature selection technology, which seeks to eliminate irrelevant, redundant, and noisy genes while retaining all the highly discriminative genes, presents as an effective means of resolving this problem.
Various feature selection or dimensionality reduction methods have been proposed throughout the past decades. Among the most well-known unsupervised methods is Principal Component Analysis (PCA)  which preserves as much variance in the data as possible. Feature selection techniques can be broadly categorized into three groups; filter, wrapper and hybrid [7, 8]. The filter methods, such as Relief  and Mutual Information , identify feature subsets from the original feature set based on specific evaluation criteria that are independent of a learning algorithm. The wrapper methods use the classifier to evaluate the performance of each subset with a search algorithm. However, filter methods yield poor performance because they ignore classifier interactions, whereas wrapper methods are very computationally expensive. Hybrid methods  combine the advantages of both techniques to achieve nice learning performance with a predetermined learning algorithm and a reduced complexity.
Another type of feature selection model is discriminant analysis, which typically aims to minimize the margin between the inter-class and intra-class distances. For example, Fisher linear discriminant analysis (FLDA) searches for the embedding transformation that maximizes the between-class scatter while minimizing the within-class scatter. Recent research has concentrated on boosting the discriminative potential of these algorithm by exploiting the local data structure. Motivated by the great success of manifold local learning, researchers have proposed localized discriminant models such as locality preserving projections (LPP) , local discriminant embedding (LDE) , marginal Fisher analysis (MFA)  and locally linear discriminant analysis (LLDA) .
To fulfill data mining tasks, feature selection is usually followed by classification or clustering to reveal the intrinsic data structure. Although a few classification methods such as support vector machine (SVM)  could achieve the task of feature selection simultaneously, they are usually performed by separate algorithms. Such loose connection compromises the accuracy of the methods. Recently, some researchers have embedded the classifier into the discriminant analysis, and have reported remarkable experimental results. For example, a local mean-based nearest neighbor discriminant analysis (LM-NNDA) model was designed to construct classification rule in guiding the discriminator . By optimizing a linear discriminant projection based on one nearest-neighbor (1-NN) classification scheme, the authors[18, 19] achieved both high classification accuracy and fast computational speed.
The present paper introduces a novel discriminant analysis model, named local hyperplane-based discriminant analysis (LHDA). This model optimizes the performance by combining feature selection with an effective classification scheme, namely, the K-Local Hyperplane Distance Nearest Neighbor (HKNN) classifier . By minimizing the leave-one-out-cross-validation (LOOCV) error rate within the training phase, LHDA is shown to be optimally matched to the classifier of HKNN. The competitive performance of our method relative to established approaches is demonstrated in extensive experiments on synthetic and empirical datasets.
The advantages of our method are in three aspects: (1) Selection of the informative gene is conditioned on its linear combinations of similar peers, thus fully exploiting their joint discrimination power; (2) Incorporating the feature weighting within classifier learning process yields accurate feature weight and optimal classification performance simultaneously, thus fulfill the two important analysis task in a dynamic and tight way; (3) The superior performance of LHDA over its peers confirms that incorporation of interactions among similar genes in feature weighting estimation under local linear approximation, as well as relating the two tasks of feature selection and classification into an unified model not only revealing the informative genes, but also provides nice classification performance.
Results and discussion
The performance of LHDA was evaluated in extensive experiments on various datasets. The first experiment was conducted on the famous Fermat’s Spiral synthetic data, which demonstrates the accuracy and robustness of LHDR in terms of feature weighting and classification, even when the data are highly degraded by noise. The second experiment was an empirical validation on 13 benchmark UCI datasets , which have low/median feature dimensions. The third experiment was conducted on practical 20 microarray datasets, which are characterized by large feature dimensions. Both the UCI datasets and microarray datasets were extensively tested in machine learning.
Several state-of-art classification algorithm, including KNN , HKNN , SVM with linear (linear-SVM) and radical basis kernel (rbf-SVM) , were employed when comparing performance after feature selecting. Comparisons were also made against the discriminate analysis models LSDA , LDPP [18, 19] and LM-MNDA , and a well-known feature selection method called I-Relief [9, 24]. All four of these established models quantify the importance of features by incorporating local structures. In the final experiment, the algorithm was compared against eight standard feature selection methods combined with independent classification models. The performance of the classifiers was quantified by Leave-One-Out Cross-Validation (LOOCV), 10-fold cross validation (10-fold-CV) and inner Leave-One-Out Cross-Validation loop (inner LOOCV loop). In the LOOCV scheme, each sample in the dataset was predicted by building a model from the remaining samples and recording the accuracy of each model. In 10-fold-CV, the dataset was randomly divided into ten equally sized subsets. Nine of these subsets were used in the model construction and the remaining subset was used for prediction. In order to reduce the over-fitting problem as well as overcoming learning bias, an inner LOOCV scheme was used. Within the framework, each test sample is firstly removed from the dataset, resulting in a new training set. Then the whole learning process is carried out on the training set and tested on the left sample. The procedure is repeated for all tested samples and their averaged performance is calculated to quantify the performance of the learning model.
Synthetic experiment on Fermat’s Spiral
Summary the speed of five local based methods
Irrelevant feature NO.
Experiments on UCI datasets
The second experiment was conducted on 13 datasets downloaded from the UCI Machine Learning Repository . Most of the tested datasets have low-dimensional features and are widely used in variousclassification model evaluation. For each dataset, the aforementioned feature selection methods were firstly used to have the projections, which were then used to scale the raw datasets into the feature space. Four benchmark classification models, including KNN, HKNN, linear-SVM and rbf-SVM, were employed to evaluate the performance of five feature selection methods; LHDA, LDPP , LSDA , LM-NNDA , I-Relief [9, 24]. The results of 10-fold-CV are summarized in Additional file 1: Table S3. The top performers were I-Relief and LSDA coupled to the rbf-SVM classifier, with average accuracies of 88.09 % and 87.78 %, respectively. The performance of LHDA is only marginally below that of I-Relief and LSDA.
However, in the LOOCV evaluation, LHDA trumped the benchmark algorithms, achieving an average accuracy of 84.72 % (see Additional file 1: Table S4). This result is anticipated because our model is optimized to achieve minimization of the LOO errors. The second-best performer was LSDA combined with the classic rbf-SVM (average accuracy =84.70 %). When counting the win/loss/tie number of LHDA over the others, LHDA-HKNN obtained the best performances, which is higher than the others.
To further evaluate LHDA method, we conduct inner LOOCV loop testing on the 13 UCI dataset, the experimental results are summarized in Additional file 1: Table S5. The proposed model of LHDA, coupling with HKNN and rbf-SVM achieved the optimal and suboptimal performance in terms of averaged accuracy. If one counts the win/loss/tie number of LHDA over the others, the proposed LHDA also obtained remarkable performance.
Experiments on microarray datasets
Summary of the tested microarray datasets
We hope to demonstrate that LHDA selects the highly informative and diagnostic genes from each dataset. To this end, we combined LHDA and the benchmark algorithms with various classification models and quantified the information conveyed by the selected genes by the classification accuracy. Similar to the second experiment, the projections were first obtained by four feature selection methods; LDPP , LSDA , LM-NNDA , I-Relief  and the proposed LHDA. In the subsequent classification experiments, the KNN, HKNN and SVM classifiers were applied to the feature-weighted space. Because the number of samples was limited, the performances were evaluated by the LOOCV scheme alone.
The experimental results are summarized in Additional file 1: Table S6. In majority cases, the best results were yielded by the proposed LHDA model. Indeed, the classification accuracy of LHDA was 100 % in 10 of the 20 datasets. LHDA was especially proficient at selecting genes implicated in adenocarcinoma and colon cancer, with respective classification accuracies of 98.68 % and 95.16 % in linear-SVM. Overall, LHDA tested by linear-SVM achieved remarkably high rankings in 11 out of 20 datasets. Moreover, the averaged accuracies after four classifiers for each dataset reflect the accuracy of feature weighting, shown in the last column for the five feature weighing method. The highest accuracy after the five methods on each dataset was highlighted in bold. One may note that the LHDA ranked in the top of eleven times among twenty datasets, demonstrating that the feature weighting obtained could quantify the intrinsic structures adequately.
As shown in the last row of Additional file 1: Table S6, the highest and second-highest average performance was achieved by LHDA coupled to linear-SVM and HKNN, respectively. The top five methods, in order of decreasing average accuracy, were LHDA with linear-SVM (97.82 %), LHDA-HKNN (96.95 %), LSDA-HKNN (96.74 %), LDPP with rbf-SVM (96.60 %), and LHDA with rbf-SVM (96.37 %). The accuracies of these five top-ranking combinations were quite close. The proposed method yielded the highest average accuracy, implying that the discriminative power of LHDA is at least as high as other state-of-the-art methods. In order to further evaluate the proposed method, confusion matrices of the classification results for the aformentioned feature selection methods were drawn, shown in Additional file 1: Table S7-S10.
Comparison with standard feature selection methods
To further demonstrate the accuracy of feature weights obtained by the proposed LHDA, we compared it against eight baseline feature selection models, namely, information gain (IG), twoing rule (TR), Gini index (Gini), sum minority (SumM), sum of variances (SumV), max minority (MaxM), t-statistic (t-test) and one-dimensional support vector machine (OSVM). The algorithm codes for these eight schemes are available through RankGene at http://genomics10.bu.edu/yangsu/rankgene. The proposed LHDA was also compared with two state-of-the-art embedded methods, Support Vector Machine - Recursive Feature Elimination(SVM-RFE)  and Random Forest .
In the two experiments, informative gene subsets were first identified by each feature selection method, and were then evaluated by the four classification models, KNN, HKNN, linear-SVM and rbf-SVM. In the first experiment, the number of informative genes was set to equal to the number found by LHDA. This configuration enables a simple subjective comparison and allows us to investigate the discriminative power given a limited number of informative genes. The LOOCV accuracy is reported in Additional file 1: Table S11. The second row of this table states the number of informative genes found by LHDA. LHDA delivered superior average accuracy performance over the other tested methods, and significantly outperformed the second-most accurate method. Again, the highest and next-highest performance was achieved by LHDA coupled to two of the four classifiers.
To test the performance of the embedding models, the classical methods of SVM-RFE and Random Forest were employed for comparison. The experimental results are summarized in Additional file 1: Table S12. One may note that the performance of the three embedding methods are very close. The averaged accuracies of feature weighting after the three classifiers on each dataset were reported in the last column. It suggested that the LHDA ranked in the top of ten times among twenty datasets. When testing the feature weights obtained from LHDA by classification models of HKNN and linear-SVM, both of which achieved remarkably high rankings in 12 out of 20 datasets. In comparison, the linear SVM-RFE archived the second rank of 9 out of 20 datasets. Finally, the LHDA defeated the SVM-RFE by achieving slightly highest averaged performance when both of them are tested by linear-SVM, as shown in the last row of Additional file 1: Table S12.
Summary the speed of three embedded methods
Irrelevant feature NO.
In this work, we proposed a new discriminant analysis model. The proposed LHDA uniquely incorporates both the feature weight and local structure to guide data classification. Optimal feature weights (in terms of LOO) are obtained by minimizing the penalized optimization problem. The proposed LHDA therefore achieves both accurate feature weight estimation and robust supervised classification simultaneously. In addition, LHDA preferentially selects the highly informative and discriminative features from datasets, boosting the performance of HKNN and other classification models. A numerical scheme for efficient minimization was developed, and the method was evaluated in extensive synthetic, median- and high-dimensional biomedical data. Four benchmark classification models and twelve widely recognized feature selection methods were employed for comparisons. The performance ability of LHDA was equal to or superior to other state-of-the-art methods, as demonstrated in rigorous quantitative analyses.
Notation and problem description
The purposes of this paper is to establish a model which achieves both the supervised classification for a new sample x and its feature weight estimation of w. To achieve the goal, a local hyperplane based discriminant analysis model (LHDA) is proposed. The aim of LHDA is to optimize a classification model, namely, the feature-weighted hyperplane k-nearest neighborhood (FHKNN) model, within a feature-scaled space to simultaneously achieve the feature estimation and supervised classification. Therefore, LHDA consists of two steps, supervised classification via FHKNN and feature estimation through local learning. We shall describe the two phases individually.
Feature weighted hyperplane KNN model (FHKNN)
The dimensionality of high-dimensional data is usually reduced by an appropriate technique prior to data processing. Mapping the data of interest into an embedded non-linear manifold within the higher-dimensional space has gained wide recognitions in machine learning[12, 15]. The local hyperplane approximation adopted in the present paper maintains the robustness of local linear embedding models. It assumes that sample structure is locally linear and therefore lies in a locally linear hyperplane.
where the vector z=(w T |x−h 1|,w T |x−h 2|,…,w T |x−h k |).
Feature estimation though local hyperplane approximation
where λ is a trade-off term that penalizes the sparsity of the feature vector.
The proposed algorithm is similar to the expectation maximization (EM) scheme. For a given feature weight w, the spanning coefficients α and β for each sample are calculated, which are then used to correct the estimation of the feature weight w. A pseudo code for the algorithm is presented in Fig. 3.
The LHDA algorithm embeds the local data structure into the classification by minimizing its error in feature weighted space. It proceeds through two steps; approximating the local hyperplane of each sample and solving a minimizing problem to obtain the feature vector. The computational complexities of the hyperplane approximation and minimization steps in each iteration are O(c k N D) and O(N D), respectively. Here, c is the number of data’s class, k is the number of nearest neighbor we choose, N is the number of samples and D is the feature dimensionality.
Availability of supporting data
The Matlab code used to tested on the Fermat Spirals and cancer microarray datasets are all available at http://pan.baidu.com/s/1hq8Bk2o.
The authors would like to thank Dr. Y. Sun for the source code of I-RELIEF and Miss Peiyin Ruan for her help in compiling the codes from RankGene.
This work was supported by BGI-SCUT Innovation Fund Project (SW20130803), National Nature Science Foundation of China (61372141), the Fundamental Research Fund for the Central Universities (2013ZM0079), Guangdong Natural Science Foundation Grant (S2013010016852) and UIC internal grant.
- Rhodes DR, Yu J, Shanker K, Deshpande N, Varambally R, Ghosh D, Barrette T. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplastic transformation and progression In: Randy S, editor. Proceedings of the National Academy of Sciences of the United States of America. National Academy of Sciences Press: 2004. p. 9309–9314.Google Scholar
- Chang HY, Nuyten DSA, Sneddon JB, Hastie T, Tibshirani R, Sørlie T. Robustness, scalability, and integration of a wound-response gene expression signature in predicting breast cancer survival In: Randy S, editor. Proceedings of the National Academy of Sciences of the United States of America. National Academy of Sciences Press: 2005. p. 3738–3743.Google Scholar
- Yang K, Cai Z, Li J, Lin G. A stable gene selection in microarray data analysis. BMC Bioinformatics. 2006; 7:228–235.View ArticlePubMedPubMed CentralGoogle Scholar
- Ni B, Liu J. A hybrid filter/wrapper gene selection method for microarray classification In: Daniel Y, Xizhao W, Jianbo S, editors. Proceedings of 2004 International Conference on Machine Learning and Cybernetics. IEEE Press: 2004. p. 2537–2542.Google Scholar
- Saeys Y, Inza I, Larrañaga P. A review of feature selection techniques in bioinformatics. Bioinformatics. 2007; 23:2507–2517.View ArticlePubMedGoogle Scholar
- Abdi H, Williams LJ. Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics. 2010; 2(4):433–459.View ArticleGoogle Scholar
- Pok G, Liu Steve J-C, Ryu KH. Effective feature selection framework for cluster analysis of microarray data. Bioinformatics. 2010; 4:385–392.Google Scholar
- Talavera L. An evaluation of filter and wrapper methods for feature selection in categorical clustering In: Famili A, editor. Advances in Intelligent Data Analysis VI. Berlin Heidelberg Press: 2005. p. 440–451.Google Scholar
- Sun Y. Iterative relief for feature weighting: algorithms, theories, and applications. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2007; 29:1035–1051.View ArticlePubMedGoogle Scholar
- Brown G. Some thoughts at the interface of ensemble methods and feature selection In: Neamat EG, Josef K, Fabio R, editors. Multiple Classifier Systems. Springer Press: 2010. p. 314–314.Google Scholar
- Kim Y, Street WN, Menczer F. Efficient dimensionality reduction approaches for feature selection In: Arivazhagan S, editor. International Conference on Conference on Computational Intelligence and Multimedia Applications. IEEE Press: 2007. p. 121–127.Google Scholar
- He X, Yan S, Hu Y, Niyogi P, Zhang H-J. Face recognition using laplacianfaces. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005; 27:328–340.View ArticlePubMedGoogle Scholar
- Roweis ST, Saul LK. Nonlinear dimensionality reduction by locally linear embedding. Science. 2000; 290:2323–2326.View ArticlePubMedGoogle Scholar
- Yan S, Xu D, Zhang B, Zhang H-J, Yang Q, Lin S. Graph embedding and extensions: a general framework for dimensionality reduction. IEEE Trans Pattern Anal Mach Int. 2007; 29:40–51.View ArticleGoogle Scholar
- Kim T-K, Kittler J. Locally linear discriminant analysis for multimodally distributed classes for face recognition with a single model image. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2005; 27:318–327.View ArticlePubMedGoogle Scholar
- Cortes C, Vapnik V. Support-vector networks. Machine Learning. 1995; 3:273–297.Google Scholar
- Yang J Zhang, Yang J-y, Zhang D. From classifiers to discriminators: A nearest neighbor rule induced discriminant analysis. Pattern Recognition. 2011; 44:1387–1402.View ArticleGoogle Scholar
- Villegas M, Paredes R. Dimensionality reduction by minimizing nearest-neighbor classification error. Pattern Recognition Letters. 2011; 32:633–639.View ArticleGoogle Scholar
- Villegas M, Paredes R. Simultaneous learning of a discriminative projection and prototypes for nearest-neighbor classification. IEEE Conference on Computer Vision and Pattern Recognition. 2008:1–8.Google Scholar
- Vincent P, Bengio Y. K-local hyperplane and convex distance nearest neighbor algorithms In: Thomas G, Sue B, Zoubin G, editors. Advances in Neural Information Processing Systems. MIT Press: 2001. p. 985–992.Google Scholar
- Kim T-K, Kittler J. UCI machine learning repository. University of California Irvine School of Information Andcomputer Sciences. 2007.Google Scholar
- Aha DW, Kibler D, Albert MK. Instance-based learning algorithms. Machine Learning. 1991; 1:37–66.Google Scholar
- Cai D, He X, Zhou K, Han J, Bao H. Locality sensitive discriminant analysis In: Veloso M, editor. Proceedings of the 20th International Joint Conference on Artificial Intelligence. MIT Press: 2007. p. 708–713.Google Scholar
- Sun Y, Todorovic S, Goodison S. Local-learning-based feature selection for high-dimensional data analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence. 2010; 32:1610–1626.View ArticlePubMedPubMed CentralGoogle Scholar
- Cai H, Ng M. Feature weighting by relief based on local hyperplane approximation In: Pang-Ning T, editor. Advances in Knowledge Discovery and Data Mining. Springer Press: 2012. p. 335–346.Google Scholar
- Duan KB, Rajapakse JC, Wang H, Azuaje F. Multiple svm-rfe for gene selection in cancer classification with expression data. IEEE Transactions on NanoBioscience. 2005; 4:228–234.View ArticlePubMedGoogle Scholar
- Liaw A, Wiener M. Classification and regression by randomforest. R news. 2002; 2:18–22.Google Scholar
- Meier L, Van De Geer S, Bühlmann P. The group lasso for logistic regression. J R Stat Soc Series B (Statistical Methodology). 2008; 70:53–71.View ArticleGoogle Scholar
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.