Improving feature selection performance using pairwise pre-evaluation

Background Biological data such as microarrays contain a huge number of features. Thus, it is necessary to select a small number of novel features to characterize the entire dataset. All combinations of the features subset must be evaluated to produce an ideal feature subset, but this is impossible using currently available computing power. Feature selection or feature subset selection provides a sub-optimal solution within a reasonable amount of time. Results In this study, we propose an improved feature selection method that uses information based on all the pairwise evaluations for a given dataset. We modify the original feature selection algorithms to use pre-evaluation information. The pre-evaluation captures the quality and interactions between two features. The feature subset should be improved by using the top ranking pairs for two features in the selection process. Conclusions Experimental results demonstrated that the proposed method improved the quality of the feature subset produced by modified feature selection algorithms. The proposed method can be applied to microarray and other high-dimensional data.


Background
Microarray gene expression data contains thousands of hundreds of genes (features). Biologists are interested in identifying the expressed genes that correlate with a specific disease, or genes with strong interactions. The high dimensionality of microarray data is a challenge for computational analysis. Feature selection by data mining may provide a solution because it can deal with high dimensional datasets [1].
The goal of feature selection is to find the best subset with fewer dimensions, but that also contributes to higher prediction accuracy. This speeds up the execution time for the learning algorithms before data analysis as well as improving the prediction accuracy. A simplistic way of obtaining the optimal subset of features is to evaluate and compare all of the possible feature subsets and select the one that yields the highest prediction accuracy.
However, as the number of features increases, the number of possible subsets also increases according to a geometrical progression. For example, using a dataset with 1000 features, the number of all possible feature subsets is 2 1000 ≈ 1.07 × 10 301 ., which means that is virtually impossible to evaluate them in a reasonable time. Even if the problem space is reduced from 1000 to 100, the number of subsets for evaluation is 2 100 ≈ 1.27 × 10 30 cases, which will still require a long computational time. Therefore, it is practically impossible to calculate and compare all of the possible feature subsets because of the prohibitive computational cost.
Various approaches have been proposed to deal with feature selection from high dimensional datasets [2,3], which can be divided into two general categories: the filter approach and feature subset selection. In the filter approach, each feature is evaluated using a specific evaluation measure, such as correlation, entropy, and consistency, to choose the best n features for further classification analysis. Frequency-spatial domain decomposition (FSDD) [4], Relief [5], chi-squared [6,7], and gain ratio [8] are filter approaches. A feature selection algorithm based on a distance discriminant (FSDD) can identify features that allow good class separability among classes in each feature. The Relief algorithm randomly selects an instance and identifies its nearest neighbors, i.e., one from its own class and others from the other classes. The quality estimator is then updated for all of the attributes to assess how well the feature distinguishes the instance from its closest neighbors. Chi-squared is a well-known discrete data hypothesis testing method used in statistics, which evaluates the correlation between two variables and determines whether they are independent or correlated. The gain ratio is defined as the ratio between the information gain and the intrinsic value. The features with a higher gain ratio are selected.
Filter methods are effective in computational time, but they do not consider the interactions among the features. In particular, during gene expression data analysis, gene-gene interactions are an important issue that cannot be ignored. Feature subset selection is a better approach to this analysis [9] because it evaluates a set of features instead of each feature in a dataset. Therefore, the interactions among features can be measured in a natural manner using this approach. An important issue during feature subset selection is how to choose a reasonable number of subsets from all the subsets of features. Some heuristic methods have been proposed. Thus, forward search [10] starts from an empty set and sequentially adds the feature x that maximizes the evaluation value when combined with the previous feature subset that has already been selected. By contrast, backward elimination [10] starts from the full set and sequentially removes the feature x that least reduces the evaluation value. Hill climbing [10] starts with a random attribute set and evaluates all of its neighbors and chooses the best. Best first search [10] is similar to forward search but it also chooses the best node from those that have already been evaluated and it is then evaluated. The selection of the best node is repeated approximately max.brackets times if no better node is found. Minimum redundancy maximum relevance feature selection (MRMR) [11] combines forward search with redundancy evaluation.
Many feature (subset) selection methods have been proposed and applied to microarray analysis [12][13][14][15] and medical image analysis [16,17]. Feature subset selection is a better approach for gene expression data than the filter approach, but it does not evaluate whole subsets of features because of the computational cost involved. Previous experimental results indicate that all pairs of two features can be evaluated within a reasonable time after appropriate preprocessing of all the features. Thus, if the interactions between pairs of two features are known, the interactions can be measured based on the classification accuracy for a given pair of features. Feature selection should be improved by applying this information in the filter method and feature subset selection approaches.
In the present study, we propose a method for improving the performance of feature selection algorithms using a b  The results obtained in various experiments using microarray datasets confirmed that the proposed approach performance better than the original feature selection approach.

Methods
Before describing the proposed approach, we need to define some notations. The input of feature selection is a dataset DS, which has N features and class labels CL  In the experiments, each dataset contained about 12000-15000 features. A mutual information test was performed for all of the features in a dataset and the best 1000 features were chosen in the pre-filtering step. In the proposed method, the input dataset DS for feature selection is this pre-filtered dataset. The     Figure 2 describes the pseudo-code used to derive COMBN. After producing COMBN, four filter algorithms, two feature subset selection algorithms, and MRMR are modified so the pairwise classification table is used in the original algorithms. Table 1 summarizes the modified feature selection algorithms.
The modification of the original feature selection algorithms is similar in most cases. Therefore, we present the pseudo-code for three selected algorithms, where Figs. 3, 4 and 5 show the pseudo-codes of the original and modified algorithms. Figure 3 presents the Chi-squared pseudo-code as an example for the filter method. The original Chi-squared      The pseudo-codes of the original and modified forward search algorithm (Fig. 4) are used to modify the feature subset selection methods. The original forward search first algorithm finds a single feature with the highest evaluation value based on the eval() function and adds it to CHOSEN. In the second step, it repeatedly finds the next feature that can obtain the highest evaluation value together with the feature(s) in CHOSEN until no more features can increase the evaluation accuracy (line 14,15). Various methods are available for implementing the eval() function, but we employ SVM classification as an evaluation function. The modified algorithm finds the best two features from COMBN in the finding loop (line 9), whereas a single feature was searched from the feature list of DS in the original algorithm. This idea can be applied to other feature subset selection algorithms. Figure 5 summarizes the pseudo-code for the original and modified MRMR algorithms. MRMR adopts the forward search method and evaluates the redundancy between target features, but there is no breaking condition for finding the feature subset. Therefore, it has   Fig. 4. However, the eval() function in Fig. 4 is substituted by the mrmr() function and breaking conditions in Fig. 4 are omitted (see line 14,15 for original forward search).
After obtaining the selected feature subsets produced by several algorithms, a classification test was performed using SVM and KNN because they are recognized for their good performance. The leave-one-out cross-validation test was used to avoid the overfitting problem. The FSelector package [18] in R (http://www.r-project.org) was used to test the original feature selection algorithms. FSDD and MRMR are not supported by the FSelector package, so they were implemented using R.

Results
To compare the original and proposed feature selection algorithms, we used five microarray datasets from the Gene Expression Omnibus (GEO) website (http:// www.ncbi.nlm.nih.gov/geo/), which provides accession IDs for GEO datasets. A brief description of the datasets is provided in Table 2.  Tables 3, 4, 5, 6 and 7 and Figs. 6, 7, 8, 9 and 10 show the experimental results obtained by the filter methods and MRMR to compare the classification accuracy of the original feature selection algorithms and proposed methods. The filter methods evaluate each feature and the user must select the best n features from the evaluation results. For most of the datasets and with various numbers of selected features, the proposed modified algorithms obtained higher classification accuracy than the original methods. In some cases for FSDD and Relief, the original algorithms were marginally more accurate than the proposed methods with the KNN test. The SVM test always improved the classification accuracy, excluding one result obtained by Relief. In general, the SVM yielded greater improvements than KNN, possibly because the pairwise classification table was produced by the SVM, and thus the KNN test might have made greater improvements if it was used instead. In general, the proposed method increased the classification accuracy by 2-11 % and it was most accurate when the number of features selected was 25. Tables 8 and 9, and Figs. 11 and 12 show the experimental results obtained by the feature subset selection algorithms. In the case of forward search (Table 8 and Fig. 11), the SVM test obtained a marginal improvement in the classification accuracy compared with the original method, whereas the KNN test decreased the accuracy. The difference between KNN and SVM may have been due to the method employed for the preparation of the pairwise classification table. Thus, if the eval() function in Figs. 2 and 4 had been changed to KNN, the results in Fig. 11(a) would be different. The proposed method markedly improved the accuracy of the filter methods compared with feature subset selection. The filter methods only evaluate each feature and they do not consider interactions between features, whereas feature subset selection methods consider feature interactions. Therefore, the proposed method performed well with the filter methods. The proposed method selected features with greater numbers than the original algorithms and improved the classification accuracy (Table 8). In the case of forward search (Table 9 and Fig. 12), the original algorithm did not reduce the number of features, whereas the proposed method reduced the initial 1000 features by 90 %. The proposed method removed a large number of features, but the KNN and SVM tests improved the   table and Table 10 summarizes the computational time needed for this step. The average time was 63.1 min. This step is performed only once for given datasets and it is not a great burden for the overall feature selection process. Table 11 summarizes the computational time required by various algorithms using the GDS1027 dataset. The proposed modified algorithms were faster than the original algorithms in the case of Relief, forward search, and MRMR, but slower for FSDD and Chi-squared. In general, the proposed algorithms produced the results within a reasonable amount of time.

Discussion
The proposed algorithms are useful but their implementation may be a difficult task for users. Thus, to facilitate further research, we have built an R package called "fsPair" and posted it on the web site (http://bitl.dankook.ac.kr/ biosw/pairwise). This package includes executable codes, source codes, a user manual, usage examples, and a sample dataset. We have added three more classifiers, i.e., random forest, naive Bayes, and neural network. We have also added multi-core parallelism to allow the rapid generation of pairwise classification tables. Users are free to  download this package and test the proposed feature selection methods using their own datasets. Next, we consider the application of the proposed methods to the solution of real problems. Kurgan et al. [19] proposed a method for cardiac diagnosis using single proton emission computed tomography (SPECT) images, where they built the SPECTF dataset containing 44 features and 267 instances. Each of the features contained values extracted from a specific region of interest. Each of the patients (instances) was classified according to two categories: normal and abnormal. They aimed to produce a good classifier for diagnosing the problem. The accuracy of their proposed CLIP3 algorithm was 77 %. We tried to find "marker features" that might be helpful for cardiac diagnosis. Thus, using our fsPair package and the original algorithms, we test different combinations of feature selection algorithms and classifiers, and Table 12 summarizes the results obtained. Using the SPECTF dataset, the results produced by the original and modified algorithms differed little because the dataset had a small number of features. However, the proposed algorithms selected a smaller numbers of features than the original algorithms, but their accuracy was similar. For example, the original algorithms had the best accuracy using MRMR and random forest with 15 features, whereas the modified algorithms had the best accuracy using FSDD and random forest with five features. Thus, five features referred to as F21S, F17R, F20S, F3S, F13S, and F8S are highly informative features for cardiac diagnosis. We performed a bootstrap test using the five features from the dataset and a very good area under the receiver operating characteristic curve (AUC) score was obtained, as shown in Fig. 13. This suggests that the five features selected may be of practical value for future diagnosis.

Conclusions
Feature (subset) selection has various applications in bioinformatics. However, the selection of a novel feature set from a huge numbers of features is a critical issue, which involves the evaluation of each feature, feature interaction, and redundancy in the features. In this study, we proposed a method that improves the quality of feature selection. Using information about the interactions between two features is very helpful for enhancing the original feature selection algorithms. If the computational power increases in the future, then information about the interactions between three or more features in a given dataset could further improve the feature selection process. The generation of interaction information is another issue. In this study, we used the classification accuracy as an evaluation measure for interaction but the evaluation measure could be changed if the aim of feature selection is not classification. The proposed method does not include redundancy among its features. Thus, the addition of a redundancy removal algorithm