 Research
 Open access
 Published:
Genetic algorithmbased feature selection with manifold learning for cancer classification using microarray data
BMC Bioinformatics volume 24, Article number: 139 (2023)
Abstract
Background
Microarray data have been widely utilized for cancer classification. The main characteristic of microarray data is “large p and small n” in that data contain a small number of subjects but a large number of genes. It may affect the validity of the classification. Thus, there is a pressing demand of techniques able to select genes relevant to cancer classification.
Results
This study proposed a novel feature (gene) selection method, IsoGA, for cancer classification. IsoGA hybrids the manifold learning algorithm, Isomap, in the genetic algorithm (GA) to account for the latent nonlinear structure of the gene expression in the microarray data. The Davies–Bouldin index is adopted to evaluate the candidate solutions in Isomap and to avoid the classifier dependency problem. Additionally, a probabilitybased framework is introduced to reduce the possibility of genes being randomly selected by GA. The performance of IsoGA was evaluated on eight benchmark microarray datasets of cancers. IsoGA outperformed other benchmarking gene selection methods, leading to good classification accuracy with fewer critical genes selected.
Conclusions
The proposed IsoGA method can effectively select fewer but critical genes from microarray data to achieve competitive classification performance.
Introduction
DNA microarray data have important applications in clinical decision support, such as diagnosis of disease (e.g., cancer) and prediction of clinical outcomes [1,2,3]. In recent decades, advances in DNA microarrays have enabled researchers to have a global view of cells. DNA microarray can measure the expression of thousands of genes simultaneously and help researchers to investigate the biological state of a cell [4]. Such highthroughput expression profiling can be used to distinguish a subject sample with cancer from those without or to classify tumor samples into different grades of cancer [1, 3]; these two applications are called cancer classification in this article. Due to the high expense of collecting microarray data with highdimensional feature space (\(p\)), only limited data samples (\(n\)) are available from the population of subjects, which leads to the issue of curse of dimensionality, also known as the “large p, small n” problem [5, 6]. The highdimensional gene feature space causes conventional statistical methods invalid. Even if some methods can handle the highdimensional data, the inclusion of genes not related to cancer can deteriorate the accuracy of cancer classification [6, 7]. Thus, selecting a subset of genes relative to the cancer classification from microarray data (i.e., dimensionality reduction) is crucial and a pressing need. Various methods of performing dimensionality reduction have been proposed, and these methods can be generally grouped into feature extraction and feature selection [5]. Feature extraction methods project or compress the original features to create fewer new variables. The major drawback of these methods is that the interpretability of the variables can be lost during the projecting process. Alternatively, feature selection methods identify the most critical subset of features by removing the noisy features from the entire microarray data; thus, the characteristic and interpretability of data are preserved. Hira and Gillies [5] provided more detailed discussions on the advantages and disadvantages of these methods.
Feature selection can be considered as an optimization problem and include four groups: filter, wrapper, hybrid and embedded methods [8]. In recent decades, wrapper feature selection methods with metaheuristics as search strategies have become increasingly popular in microarray data analysis [8, 9]. Metaheuristic algorithms have advantages in fast convergence, excellent search ability, and high population diversity. They are superior to other methods in readability and interpretability and avoid premature convergence or falling into local optima [10]. On microarray data, metaheuristicsbased methods can search the optimal subset of genes more efficiently by using a specific fitness function to evaluate the candidate subsets of genes, and these methods can be combined with many classifiers for cancer classification [10]. Recently, many enhancements of the metaheuristic algorithms have been proposed by mimicking the behaviors of organisms in nature. For example, artificial bee colony (ABC) [11, 12], cuckoo search (CS) [13, 14], bacterial colony optimization (BCO) [15], chimp optimization algorithm (ChOA) [16], forest optimization algorithm (FOA) [17], and genetic algorithm (GA) [18]. These enhancements are based on bioinspired optimization [19] and showed good performance in gene selection [10]. However, individual algorithm usually has inherent limitations. Thus, the hybrid feature selection method is usually adopted to achieve better performance [20, 21]. A hybrid method combines filter and wrapperbased methods for feature selection. Therefore, the hybrid method typically achieves the high accuracy characteristic of wrappers and the high efficiency characteristic of filters [22]. Metaheuristic algorithms can be hybridized with feature extraction methods (e.g., the hybridization between ABC and independent component analysis [23]) or optimization methods (e.g., the binary particle swarm optimization and sine cosine algorithm [24]). Many metaheuristicsbased hybrid methods adopted GA, a method inspired by the evolutionary process of natural selection, to improve performance in feature selection [20, 25, 26]. For example, Alshamlan et al. [27] developed the genetic bee colony (GBC) algorithm by combining GA with the ABC algorithm, [28]. Das et al. embedded the Harmony Search (HS) algorithm with GA [29]. However, the present metaheuristicbased hybrid methods have several shortcomings:

1.
Classifier dependency: These methods use fitness values that include the classification accuracy of a specific classifier, which can lead to classifier dependency because the metaheuristic algorithm aims to optimize the classification accuracy [30, 31].

2.
Randomness: In the preexperiments, it was found that even when the same algorithms and objective functions are used on the same dataset, randomness in the algorithms could result in quite different subsets of genes being selected when the analysis is repeated. Thus, it is necessary to employ a feature selection method that reduces the impact of algorithmic randomness.

3.
Linear space assumption: Most metaheuristics methods use linear distances to evaluate candidate subsets of genes. For example, Garro et al. [32] introduced a classification method that utilizes the ABC algorithm with a classification error function for feature selection and multiple artificial neural networks to evaluate gene subsets. This approach is based on the assumption that gene expression vectors are distributed in linear Euclidean space. However, this assumption does not always hold in practice [20]. Since genes are dynamically linked with each other, it is reasonable to assume that gene expression features lie in the nonlinear space. Thus, nonlinear algorithms, such as manifold learning, should be more appropriate for dimensionality reduction and fitness evaluation [33]. Among the nonlinear manifold learning methods, Isometric feature mapping (Isomap) has good performance in preserving the underlying data structure and could improve the classification accuracy [34, 35].
To solve the aforementioned issues, we propose a method called IsoGA, which hybrids Isomap and GA to select the optimal subset of genes, i.e., the genes most helpful to cancer classification. The key ideas in the proposed method are as follows. Isomap is used to map highdimensional nonlinear microarray data to a lowdimensional linear space. The correlation of gene subsets and cancer subtypes is measured by the Davies–Bouldin (DB) index [36] to reflect the clarity of division between samples of different classes in the mapped dataset. A feature selection framework with IsoGA inserted is proposed to reduce the influence of randomness. In this framework, the GA search is repeated several times to select feature subset that optimizes the fitness function, and a new set containing the common features selected over a specified threshold number of times is used in the final classifier. The threshold is calculated based on the binomial distribution and the entire number of genes in microarray data. The threshold ensures that IsoGA could select reasonable numbers of cancerrelated genes from various pdimensional datasets. By selecting a smaller subset of genes, the proposed method expects to improve cancer classification accuracy on microarray data.
Methods
Notation
The dataset adopted in this study can be denoted as \(\left(X, {\varvec{y}}\right)=\{({{\varvec{x}}}_{{\varvec{i}}}, {y}_{i})i=1,\dots ,n\}\), where \({{\varvec{x}}}_{{\varvec{i}}}=({x}_{i1},\dots ,{x}_{ip})\), \({\varvec{y}}={\left({y}_{1},\dots ,{y}_{n}\right)}^{T}\), and \({y}_{i}\in \left\{1,\dots ,C\right\}\) indicates the class label of \({{\varvec{x}}}_{i}\) where \(C\) denotes the number of classes. Let p be the total number of features and n be the total number of samples.
In five fold crossvalidation, we chose one of five folds in turn as a test set \(({X}_{te}, {{\varvec{y}}}_{te})\) each time, and the other four folds as the training set \(({X}_{tr}, {{\varvec{y}}}_{tr})\). For each training set, we generated the \({b}^{\mathrm{th}}\) bootstrap samples \(({X}_{tr}^{\left(b\right)}, {{\varvec{y}}}_{tr}^{\left(b\right)})\) and \(({X}_{val}^{\left(b\right)}, {{\varvec{y}}}_{val}^{\left(b\right)})\) as the training and validation sets, respectively.
Let each candidate solution be \({{\varvec{s}}}_{i}\), \(i=1, \dots , pop.size\) (population size), and \({\varvec{s}}\) be the size of the solution.
Isometric feature mapping (Isomap)
In 2000, Tenenbaum et al. [37] proposed a framework that uses the local metric information to learn the underlying global geometry of the data for nonlinear dimensionality reduction, referred to as Isomap. Isomap is a generalization of the conventional multidimensional scaling (MDS) algorithm for nonlinear manifolds [35]. MDS preserves the Euclidean distance between the data points consistent in the observation space and the target space as much as possible and assumes that the manifold is linearly or approximately linearly embedded in a highdimensional observation space [38]. It attempts to maintain the geodesic distance on the manifold of the highdimensional observation space consistent with the Euclidean distance in the target space.
The most significant difference in the calculation process between Isomap and MDS is the calculation of distance matrix. MDS calculates the distance matrix of the data in a highdimensional space based on the Euclidean distance, while Isomap calculates the distance matrix based on the geodesic distance approximation. The geodesic distance is approximated as the shortest path between two points along the nonlinear manifold surface.
The pseudocode of the Isomap algorithm can be presented in Fig. 1.
An Isomap process can be defined as:
where \(X\) is the original highdimensional data, \(X\) includes \(n\) samples in \({\mathbb{R}}^{p}\), while \(\widetilde{X}\) is the lowdimensional data in the target space \({\mathbb{R}}^{d} (d<p)\).
Two parameters need to be determined, including k in the knearest neighborhood graph and d, which is the dimensionality of the target space.
First, for each data point, the nearest \(k\) points are connected by edges to construct a neighborhood graph \(G\). The weight of each edge \({e}_{ij}\) is the Euclidean distance \({{\varvec{x}}}_{i}{{\varvec{x}}}_{j}\), \(i,j=1, \dots , n\).
Then, the geodesic distance between each pair is estimated by determining the shortest path in the neighborhood graph \(G\). Here, the Warshall–Floyd algorithm is adopted to search for the shortest path. After this step, the estimated geodesic distance matrix \(D={({d}_{ij})}_{n\times n}\) contains the shortest path distances between all pairs of data points. To ensure the symmetry of the distance matrix \(D\), if there is a case where one point is the nearest neighbor of another point while the latter is not the nearest neighbor of the former, then the former would be connected to the latter [39].
The following steps are the same as those used in the classical MDS. The inner product matrix can be calculated as:
where \({J}_{n}={I}_{n}1/n{1}_{n}{1}_{n}^{T}\), \({D}^{2}={({d}_{ij}^{2})}_{n\times n}\), \({I}_{n}=diag(\mathrm{1,1},\dots ,1)\) is the identity matrix of size n, and \({1}_{n}=(\mathrm{1,1},\dots ,1)\) is the 1vector of size n.
Next, we conduct the eigenvalue decomposition on \(K\) to obtain the eigenvector \(V\) and eigenvalue matrix \(\Lambda\):
For the determined target dimensionality \(d\), we take the first \(d\) eigenvalues and corresponding eigenvectors to calculate the coordinate matrix \(\widetilde{X}\) of the target space \({\mathbb{R}}^{d}\).
Proposed Isomapembedded GA (IsoGA) method
The pseudocode of our feature selection framework and IsoGA are presented in Figs. 2 and 3. A flowchart of our proposed feature selection framework is illustrated in Fig. 4.
The basic idea of GA is to imitate the natural selection process, where individuals with high fitness survive, while those with low fitness are eliminated. After several generations, the individual with the highest fitness is finally obtained, which represents the optimal solution to the challenge of interest. Therefore, the fitness value for optimization in the GA is a key parameter, whose choice is related to the judgement of the feature subset.
Here, all candidate feature subsets are binarycoded for each individual, where “1” and “0” denote that the feature corresponding to the location is selected and excluded, respectively. Based on the results of prior testing, we set the number of features selected by each individual to 30, i.e., each contains solely 30 bytes of “1”.
We define the fitness function as the DB Index of \({\widetilde{X}}^{(s)}\):
where \({X}^{(s)}\) is the subset of \(X\) that solely includes features belonging to \({\varvec{s}}\), and \({\widetilde{X}}^{(s)}\) is the matrix after mapping from the pdimensional to the ddimensional space by the Isomap algorithm.
The DB index is based on the following ideas: an accurate classification should have high interclass and low intraclass dispersions; that is, the ratio of intraclass dispersion to interclass dispersion should be small. As such, the smaller the DB index, the clearer the division of the data. Thus, the optimal solution with the smallest DB index is the feature subset for which each data class can be most clearly partitioned after the Isomap dimensionality reduction.
We assume that clearer partitioning means that the contained features contribute more to the classification. Thus, more accurate classification results can be obtained. To verify this assumption, we performed simulation experiments using a support vector machines (SVM) classifier. We randomly selected 500 random feature subsets of size 30 for each dataset (A detailed description of the datasets is provided in Datasets and Preprocessing). The DB index of all random subsets was calculated after the dimensionality reduction using Isomap. The microAUC (introduced in Evaluation Metrics) value of the test set was calculated, and scatter plots of the results are plotted in Additional file 1: Fig. S1.
The majority of the datasets indicate a negative correlation; however, the data points are sparsely scattered on both sides of the regression line. This result suggests that even if the DB Index is minimal, it does not necessarily mean that the classification performance is the best; however, if we directly search for the feature subset with the highest accuracy, it will be very timeconsuming. Therefore, it is a reasonable and feasible solution to considerably narrow down the search scope by determining the smallest DB index.
After each GA search is completed, solely the SVM prediction accuracy of the validation set of the best 10 individuals in the last generation is calculated, and the individual with the highest accuracy is selected as the optimal solution \({{\varvec{s}}}_{opt}\) for this GA search.
Owing to the random GA search process, not all optimal subset genes obtained in each search are relational and informative for cancer classification. To obtain the genes that are not randomly selected, we set a threshold, \(\theta\). If the number of selections in the 10 GA searches is higher than \(\theta\), it will be included in the best gene subset \({{\varvec{s}}}_{best}\).
Finally, we adopted the classifiers to evaluate the obtained best gene subset \({{\varvec{s}}}_{best}\).
Regarding the classifier selection, SVM has demonstrated a better performance than the other existing machine learning algorithms in current research on twoclass and multiclass microarray classification problems [40]. The features of SVMs include flexibility in the choice of similarity functions, the ability to handle data with large feature spaces, and the ability to obtain sparse solutions, making them suitable for gene expression data analysis [41]. Therefore, we chose the SVM as one of the major classifiers in this study.
The artificial neural network is an algorithm that simulates the structure and activity of neurons in the human brain. It comprises a series of neurons and connected layers. Backpropagation (BP) is the most popular algorithm for training a neural network by adjusting the synaptic weights [32].
The radial basis function kernel support vector machine (RBFSVM) and resilient backpropagation with a weight backtracking neural network (Rprop + NN) are used as classifiers to evaluate the performance of the selected feature subsets.
As illustrated in Fig. 4, a fivefold crossvalidation test was performed. The entire training set \({X}_{tr}\) is adopted for parameter tuning and feature selection, as well as for the learning process of classifiers, and the test set is used to test the accuracy of the classification results. The details of the crossvalidation test are described in Nested CrossValidation.
We use the kofnGA package [42] and RDRToolbox package [43] in R to implement the genetic algorithm feature selection and Isomap algorithm, respectively. All the experiments are performed in the R environment.
Parameter selection and tuning
In the calculation process, the hyperparameters d and k are required as the input to the Isomap algorithm.
The parameter d is the dimensionality of the target space, which should be equal to the potential intrinsic dimensionality of data in the ideal case; however, the intrinsic dimensionality depends on the dataset and is difficult to determine in advance. The maximum likelihood dimensions estimator (MLDE) [44] method is used to automatically determine the dimensions of the target space of the Isomap algorithm.
For the parameter k of Isomap, we optimized the value of k using a grid search with a search range of [5, 20]. The k with the smallest DB index value after the Isomap dimensionality reduction is regarded as the optimal value.
After the parameter tuning with the entire training set \({X}_{tr}\), 10 pairs of training and validation sets were randomly generated by the bootstrap bagging method. For each training set \({X}_{tr}^{\left(b\right)}\), an IsoGA was performed, and finally, 10 optimal gene subsets were obtained.
To determine the threshold θ, we calculated the probability of being selected at random to be less than 5%, depending on the size of different datasets.
For simplicity of calculation, if the selection of gene subsets is random, we assume that all genes will be selected with the same probability. Each GA search can be regarded as a Bernoulli trial, and the probability of being selected in each trial can be calculated as \(p=v/n.gene\) (where \(v\) is the size of the optimal subset and \(n.gene\) is the number of genes). Then, the number of times selected in 10 GA runs (\(X\)) follows the Binomial distribution: \(X\sim B(10,p)\). The probability of a gene being selected θ times is:
According to the number of genes in the dataset, we calculated the minimum \(\theta\) value that can make the probability \(\sum_{k=0}^{\theta }P\left(X=k\right)\) more than 0.95 as the threshold to obtain the best gene subset. This ensures that a gene selected more than θ times owing to randomness is a small probability event with a probability of less than 5%. Here, we consider that this gene is not selected randomly but correlates with cancer classification.
We applied the grid search method to optimize the parameters of each classifier. The parameter tuning ranges of RBFSVM and Rprop + NN are provided in Tables 1 and 2, respectively.
For each parameter combination, we performed two threefold crossvalidations to measure the average prediction accuracy.
Computational complexity analysis
The proposed model is a hybrid method, and we discuss the computational complexity separately for each algorithm used in it. We can determine the complexity of MLDE, DBIndex, and Isomap based on previous studies [34, 45,46,47]. As a result, the computational complexity of the proposed IsoGA is \(O\left({n}^{3}\right)\) and the complexity of parameter selection for Isomap is \(O\left(\mathrm{log}n\right)+O(p)\) [See a more detailed explanation in Additional file 1].
The computational complexity of the two classifiers used in this study is not discussed here, as they are not part of our proposed IsoGA method and can be substituted with other classifiers.
Datasets and preprocessing
This study used eight benchmark cancer microarray datasets to evaluate the performance of the proposed method. We used the datasets processed by Zhu et al. [40], and these datasets are originally published in literature [48, 49]. We presented a summary in Table 3. These datasets include cancer types such as breast, central nervous system, colon, leukemia, lung, lymphoid, and small round blue cell tumor. The number of features ranges from over 24,000 to only 2,000, and the target variables include both binary and multiclass classification situations, ranging from prognostic status to cancer subtype classification.
The Lymphoma dataset contains several missing data. The genes with missing values were removed. In addition, a few genes in the Breast and Lymphoma datasets had the same expression values. Such genes are meaningless for classification prediction. Therefore, they were removed directly. A statistical summary of the final datasets after removal is provided in Table 3.
Because the various gene expressions in the datasets can affect the classification performance, the datasets were standardized. The samples containing several outliers were removed.
Owing to several irrelevant and redundant features in the microarray data [40], the GA search space becomes vast, thereby decreasing search efficiency and computational speed. Although GA has good global search performance, the existence of several redundant features significantly increases the randomness of the GA search.
Therefore, we calculated the information gain between the target variable and each gene. Information gain is a measure based on entropy, higher information gain means a higher correlation between feature and classification [50]. We determined that the information gain of a vast number of genes was 0. This means that different classification labels do not increase the amount of information on these genes. Therefore, we removed these genes from the preprocessed datasets. The gene numbers after removal and the thresholds θ for the best gene subset selection for each dataset are presented in Table 4.
Evaluation methods
Evaluation metrics
The accuracy (\(Acc\)) is commonly adopted as the classifier evaluation index for classification problems, and the formula of \(Acc\) is formulated as follows:
where P and N are the numbers of positive and negative samples, respectively, while TP and TN denote the numbers of positive and negative samples that were correctly predicted by the classifier.
One disadvantage of \(Acc\) is that it depends on the choice of the classification threshold when the output of the classifier is the probability of each class. The area under the receiveroperating characteristic curve (AUC), which is not affected by the threshold, is a better choice.
In this study, however, there are multiple labels in the datasets to which AUC is not available. Therefore, all the performance metrics, including the average accuracy indices, macroAUC, and microAUC, were used to evaluate the classifier performance. The macro approach averaged the values of metric M for each class, while the micro approach aggregated the values of all contingency tables for each class and then computed the metric M interested across all classes [51]:
Here, metric M is the AUC. As there is no consensus about macro and micro approaches [51], both metrics are considered in this study.
Nested crossvalidation
In this study, the nested crossvalidation method was adopted, in which the outer and inner sides were crossvalidated separately (Fig. 5).
For the entire dataset, we used a stratified sampling method to divide it into five folds. One was used as a test set, while the remaining four folds were used as the training set. The sample proportions of different classes in each fold were consistent with those of the population.
Simultaneously, validation is necessary for the inner loop (i.e., parameter tuning). Because the number of samples was small, we adopted the stratified bootstrap aggregating (bagging) method to explore the optimal feature subset on the inner side for the GA search process.
The stratified bagging method uses random sampling with replacement to sample each class of data separately. Then, the sampling results of all classes are combined to generate the inbag set. The outofbag sample, which is not selected, is used as the validation set, and there is no duplication. We set the bootstrap sample size to be the same as the original dataset (i.e., the entire training set) and sampled it 10 times. The random simulation results indicate that all samples can be selected into the training set at least once after 10 bagging sampling.
Ranking score
Feature selection aims to build a higheraccuracy model with fewer features. When there is no significant difference in accuracy, we tend to consider that using fewer features is better. Therefore, we adopted the ranking score \(\mathcal{R}\) here to compare and evaluate the feature selection methods comprehensively.
We calculated the ranking of each model for each metric. The higher the classification performance, the higher the ranking; simultaneously, the fewer the number of features selected, the higher the ranking. The ranking score \(\mathcal{R}\) is defined as follows:
where \({r}_{i}^{j}\) denotes the sum of rankings of the ith metric of the jth model on all datasets, while \({r}_{\mathbf{s}}\) denotes the sum of rankings of \(\mathbf{s}\) for all datasets, respectively.
Model comparison
We compare different feature selection methods from two aspects. First, to verify the effectiveness of the Isomap algorithm in our proposed framework, the MDSembedded GA (MDSGA) method and GA method without any dimensionality reduction were also conducted. All methods used the same hierarchical fivefold crossvalidation training and test sets to ensure fair comparisons.
We then considered the CERABC feature selection process proposed by Beatriz et al. [32] and the Markovembedded genetic algorithm (MBEGA) for gene selection proposed by Zhu et al. [40] as competitive models. These two methods have shown promising performance in the gene selection of microarray data.
Concisely the CERABC method used the artificial bee colony (ABC) algorithm, one of the most popular metaheuristic algorithms such as the genetic algorithm [10], as an optimization technique, and the classification error function (CER) was used as the fitness function. We use the metaheuristicOpt [52] package in R to implement the ABC algorithm. The parameters of ABC are set according to the reported parameters in the original paper, and default parameters are applied for unreported ones. For the threshold \(th\), i.e., the probability that a gene can be selected, we used the one that achieved the highest accuracy for each dataset to obtain the feature subset as the result of this method. Because our goal is to compare the effectiveness of feature selection, we only utilized the feature selection results of CERABC and assessed them using the same classifiers and tuning approach as IsoGA.
The MBEGA method [40] is similar to our proposed method, which is also a GAbased gene selection method. We compared our method with MBEGA and used the same datasets as those used in developing MBEGA. We relied on the published results of MBEGA for comparison without implementing this method ourselves.
Results
Results based on the proposed framework
As described above, we first verified the effectiveness of the Isomap algorithm within the same framework. The subset of the entire training set that solely contains the genes selected by the feature selection method is used to train the classifier, while the test set is used to evaluate the performance of the trained classifier.
Accordingly, three models were tested, i.e., IsoGA, MDSGA, and GA. All models follow the proposed framework. The following part presents the performance of RBFSVM and Rprop + NN trained on the feature subsets selected by each method on each dataset, including two evaluation indicators, MacroAUC and MicroAUC.
Tables 5 and 6 present the average macro and microAUC values of the RBFSVM classifier and their corresponding standard deviations in the outer fivefold crossvalidation. The best results obtained for each dataset are indicated in bold.
The proposed IsoGA achieved the highest average macro and microAUC values for the Breast, Leukemia, and Lymphoma datasets. The average value of the microAUC of Leukemia and Lymphoma was 1, and the standard deviation was 0.
The highest average macro and microAUC values were attained using MDSGA for the CNS, Colon, and Lung datasets. All three methods showed similar performances on the SRBCT dataset.
We performed the Wilcoxon signedrank test to test the statistical significance of the results of the five folds obtained by the different methods. The result of IsoGA was used as a benchmark to test whether MDSGA and GA were statistically significantly different from it. We also calculated the ranking of each method in terms of macro and microAUC (indicated as Marank and Mirank, respectively). These results can be found in Additional file 1: Table S1.
Based on the Wilcoxon signedrank test results, we found that most of the differences among the three methods were not statistically significant. According to the sum of rankings, the sum of the AUC rankings of the MDSGA method for the two classifiers is higher. Therefore, in the proposed GAbased feature selection framework, the subset of genes selected by the IsoGA method had a slightly lower classification performance than MDSGA on the RBFSVM classifier.
Similarly, Tables 7 and 8 present the average macro and microAUC values of the Rprop + NN classifier and their corresponding standard deviations in the outer fivefold crossvalidation. The best results obtained for each dataset are indicated in bold.
The subset of genes selected by the IsoGA method, according to the macroAUC values of Rprop + NN classifier on classification, outperformed the other two methods on the five datasets, including Breast, Colon, Leukemia, Lung, and MLL.
Based on the MicroAUC values, the performance on the five datasets, Breast, Colon, Leukemia, and MLL, was better than that of the other two methods. In addition, although not the highest, the results for Lung and SRBCT datasets, 0.975 and 0.991, respectively, can be considered very close to the optimal results of 0.974 and 0.997, with a marginal difference.
Similarly, the ranking of the performance of each method and the p value of the Wilcoxon sign rank test on the different datasets are shown in Additional file 1: Table S2.
According to the Wilcoxon signedrank test results, most of the differences between the three methods were not statistically significant; however, the IsoGA method had the highest sum of rankings in the overall AUC rankings for the two classifiers.
Overall, in the proposed GAbased feature selection framework, the Rprop + NN classifier obtained from the subset of genes selected by the IsoGA method outperformed the MDSGA and GA methods.
Because the primary aim of feature selection is to reduce the data dimensionality, it is better to select fewer genes when there is no significant improvement in classification accuracy. The average number of genes selected by each method, \(s\), is summarized in Table 9. The optimal result obtained on each dataset, i.e., the minimum average size, is shown in bold.
The ranking score \(\mathcal{R}\) of the classification performance of the two classifiers and the ranking of the selected feature subset sizes of these three methods are provided in Table 10.
In summary, the proposed IsoGA method achieved the best overall performance (\(\mathcal{R}=22.2\)), indicating that it can select fewer genes while achieving a high classification accuracy.
Comparison with other existing methods
Because the performance metric adopted in these comparison models is the average classification accuracy, and we did not have the codes of MBEGA to calculate its macro and microAUC, we compared the results based on the average accuracy.
The Wilcoxon signedrank tests were performed to compare the accuracy results from the outer fivefold crossvalidation of models in each dataset (Table 11). We performed separate tests depending on the classifier. Taking IsoGA as a reference, if the average accuracy of classification of IsoGA is higher, a onesided test is performed; if the p value is less than the given significant level, the result of IsoGA is significantly higher than that of the compared method, otherwise, no significant difference is indicated; if the average accuracy of classification of IsoGA is lower, a twosided test is performed, and if the p value is less than the given significant level, the result of IsoGA is significantly different from the compared method, otherwise it means there is no significant difference. “**” denotes a significance level of 0.05, and “*” denotes a significance level of 0.1.
The results in Table 11 indicate that the proposed IsoGA method can achieve the best average accuracy on the RBFSVM classifier for the five datasets (Breast, Leukemia, Lymphoma, MLL, and SRBCT). The maximum average accuracy achieved on each dataset is shown in bold.
For the CNS dataset, the gene subset selected by the CERABC algorithm achieved the best prediction accuracy on the RBFSVM classifier, and the Colon and Lung datasets and the MBEGA method achieved the highest accuracy; however, the optimal gene subsets of the CNS and Colon selected by the IsoGA algorithm were the smallest. Solely for the Lung dataset, the MBEGA method selected the fewest genes while achieving the highest accuracy.
To comprehensively compare these models, the rankings of the average prediction accuracy \({\overline{r} }^{Acc}\) and selected gene subset sizes \({r}_{{\varvec{s}}}\) are summarized in Table 12. As the results of the MBEGA method are based solely on the SVM classifier, the results of the SVM are considered in calculating the average accuracy ranking.
According to the results, the proposed IsoGA method achieved the highestranking score (\(\mathcal{R}\) = 30.5), representing the best classification performance and smallest gene subset simultaneously.
Feature selection results and visualization
Visualizing the dimensionalityreduced dataset is intuitive to verify whether the selected feature subsets are related to cancer classification and compare the classification performance.
We show the visualization results of two datasets, Leukemia and Lung (Fig. 6). The results of other datasets can be found in Additional file 1: Fig. S2. The upper panel illustrates the results of Isomap dimensionality reduction using all genes, and the lower panel illustrates the results of Isomap dimensionality reduction using solely the subset of genes selected by the IsoGA method.
Even if all the genes are used, Isomap can obtain clearer results after dimensionality reduction, suggesting that our hypothesis that the microarray data are distributed on the nonlinear structure is reasonable.
Using the proposed IsoGA selected genes, each class data point can be separated more clearly. This indicates that the proposed feature selection framework can effectively remove the noise and redundancy, which are irrelevant to classification, and as a result, it can obtain visible results that are easier to understand and explain.
Discussions
In this study, we proposed a novel feature selection method called IsoGA and a framework based on it. The proposed method could select a smaller subset of critical genes and improve the accuracy of cancer classification. The proposed method takes into account the nonlinear structure of gene expressions in microarray data and uses Isomap for dimensionality reduction and fitness evaluation. Moreover, the proposed framework reduced the randomness in the GA search algorithm by repeating the search process and selecting features based on a specified threshold. Thus, more noisy features could be removed with a limited number of the potentially cancerrelated genes selected for cancer classification, and the overall accuracy of classifiers was improved. We found that IsoGA exhibited efficient gene selection performance and achieved high accuracy in cancer classification. In addition, we also found that using nonlinear method might be a better choice for dimensionality reduction in microarray data.
The originality and significance of this study are summarized as follows:

1.
This study innovatively hybrid the manifold learning algorithm Isomap with GA for feature selection. This hybridization takes into account the nonlinear structure of microarray data. Isomap maps the sample points distributed in nonEuclidean space to lowdimensional Euclidean space by calculating the geodesic distance between sample points. Comparison results showed that GA combined with Isomap achieved the highestranking score \(\mathcal{R}\), indicating the best feature selection performance, compared to GA combined with the linear dimensionality reduction methods or without dimensionality reduction.

2.
This study introduced an innovative approach to evaluating the correlation between feature subsets and cancer subtypes in GA. Instead of relying on classifier accuracy, we used the clarity of division between samples of different classes. The fitness of a solution in GA search is evaluated by the DB index, which avoids classifier dependency and can be applied easily to any other classifiers. The DB index enables inferences about the appropriateness of data partition and helps to assess which subset of genes can effectively partition gene features with different labels. However, as noted by Thomas et al. [53], the DB index evaluates the distance between clusters using Euclidean distance and does not consider the geometry of the spatial distribution of clusters. To address this limitation, Isomap is used in the proposed method to map the nonlinear microarray data to a lowdimensional linear space, considering the underlying geometry of the data distribution.

3.
The proposed feature selection framework aims to mitigate the impact of algorithmic randomness in selecting features. Although the good global search performance of GA benefits from the random mutation, it can introduce randomness, leading to the selection of irrelevant features into the optimal subset of features. Therefore, we introduced a statistical method that calculates the outputs of multiple GA search results, and genes with a probability of less than 5% of being randomly selected are included in the optimal subset. The comparison results show that this improvement can select fewer genes while obtaining the same or even higher accuracy. This indicates that the proposed framework can potentially avoid the randomness of the metaheuristic algorithm.
The classification performance of the proposed method was compared to other existing ones on eight microarray datasets of different cancers. IsoGA achieved the highestranking score \(\mathcal{R}\), indicating that the highly accurate classification performance can be achieved by using a smaller gene subset size. Prior to applying the proposed models, these datasets were preprocessed by removing the missing values and outliers, and uninformative features were filtered out using information gain due to the presence of multiclass data sets that do not apply to the ttest. IsoGA improves classification accuracy and preserves data interpretability. It has general applicability in that it can be extended to various classifiers. Although RBFSVM and Rprop + NN were used in this study, IsoGA could be combined with many other classifiers for cancer classification since the feature selection is independent of the classifiers.
However, there are several limitations. Firstly, we did not consider the factors of potential similarity and interaction among the genes, which may have some impact on the stability of the feature selection algorithm and the classification performance. Understanding these factors requires knowledge of biology and disease, which is beyond the scope of this study. Secondly, although Isomap is an effective method in various domains, it still has some shortcomings, such as topological instability and powerlessness in handling nonconvex manifolds [54]. Isomap is an unsupervised dimensionality reduction technique, resulting in the incapability to use the class label information and embed new data points for testing or validation. Some extended Isomapbased methods have been proposed to solve this problem. For example, Multimanifold Discriminant Isomap (MMDIsomap) [55] and semisupervised discriminant Isomap (SSDIsomap) [56] may provide a better solution. Since the validation of Isomap is not necessary in our proposed framework, these extended methods are not considered here. Lastly, we assumed that gene microarray data are more likely to be in nonlinear space. However, the distribution of realworld gene expression data is far more complex, and it is difficult to verify the nonlinear space assumption. Nevertheless, the comparison results suggested that nonlinearity provides a better fit than linearity distance.
Conclusions
In this study, we proposed a GAbased feature selection framework called IsoGA to select the optimal subset of genes in microarray data. The framework embedded the Isomap algorithm for nonlinear dimensionality reduction to select genes that met a given probabilitybased threshold as the best for classification. IsoGA exhibited efficient gene selection performance and achieved high accuracy in cancer classification.
Availability of data and materials
The microarray datasets analyzed in this study are available at http://csse.szu.edu.cn/staff/zhuzx/Datasets.html.
References
Daoud M, Mayo M. A survey of neural networkbased cancer prediction models from microarray data. Artif Intell Med. 2019;1(97):204–14.
Colombo PE, Milanezi F, Weigelt B, ReisFilho JS. Microarrays in the 2010s: the contribution of microarraybased gene expression profiling to breast cancer classification, prognostication and prediction. Breast Cancer Res. 2011;13(3):1–15. https://doi.org/10.1186/bcr2890.
Tarca AL, Romero R, Draghici S. Analysis of microarray experiments of gene expression profiling. Am J Obstet Gynecol. 2006;195(2):373–88.
PiatetskyShapiro G, Tamayo P. Microarray data mining: facing the challenges. ACM SIGKDD Explor Newsl. 2003;5(2):1–5.
Hira ZM, Gillies DF. A review of feature selection and feature extraction methods applied on microarray data. Adv Bioinform. 2015;2015:198363. https://doi.org/10.1155/2015/198363.
Huynh PH, Nguyen VH, Do TN. Improvements in the Large p, Small n Classification Issue. SN Comput Sci. 2020;1(4):1–19. https://doi.org/10.1007/s42979020002102.
Osareh A, Shadgar B. Microarray data analysis for cancer classification. In: 2010 5th international symposium on health informatics and bioinformatics, HIBIT 2010. 2010. p.125–32.
Alhenawi E, AlSayyed R, Hudaib A, Mirjalili S. Feature selection methods on gene expression microarray data for cancer classification: a systematic review. Comput Biol Med. 2022;1(140):105051.
Sharma M, Kaur P. A comprehensive analysis of natureinspired metaheuristic techniques for feature selection problem. Arch Comput Methods Eng. 2021;28(3):1103–27. https://doi.org/10.1007/s11831020094126.
Shukla AK, Tripathi D, Reddy BR, Chandramohan D. A study on metaheuristics approaches for gene selection in microarray data: algorithms, applications and open challenges [Internet]. In: Evolutionary intelligence, vol. 13. Springer.2020. p. 309–29. https://doi.org/10.1007/s12065019003066
Schiezaro M, Pedrini H. Data feature selection based on artificial bee colony algorithm. EURASIP J Image Video Process. 2013;47:1–8.
Musheer RA, Verma CK, Srivastava N. Novel machine learning approach for classification of highdimensional microarray data. Soft Comput. 2019;23(24):13409–21. https://doi.org/10.1007/s00500019038797.
Aziz RM. Application of nature inspired soft computing techniques for gene selection: a novel frame work for classification of cancer. Soft Comput. 2022;26(22):12179–96. https://doi.org/10.1007/s00500022070329.
Aziz RM. Cuckoo searchbased optimization for cancer classification: a new hybrid approach. J Comput Biol. 2022;29(6):565–84. https://doi.org/10.1089/cmb.2021.0410.
Wang H, Jing X, Niu B. A discrete bacterial algorithm for feature selection in classification of microarray gene expression cancer data. Knowl Based Syst. 2017;126:8–19.
Pashaei EE, Pashaei EE. An efficient binary chimp optimization algorithm for feature selection in biomedical data classification. Neural Comput Appl. 2022;34(8):6427–51. https://doi.org/10.1007/s00521021067750.
NouriMoghaddam B, Ghazanfari M, Fathian M. A novel multiobjective forest optimization algorithm for wrapper feature selection. Expert Syst Appl. 2021;1(175):114737.
Holland JH. Genetic algorithms. Sci Am. 1992;267(1):66–73.
Rai D, Garg AK, Tyagi K. Bioinspired optimization techniques. ACM SIGSOFT Softw Eng Notes. 2013;38(4):1–7. https://doi.org/10.1145/2492248.2492271.
Oh IS, Lee JS, Moon BR. Hybrid genetic algorithms for feature selection. IEEE Trans Pattern Anal Mach Intell. 2004;26(11):1424–37.
Hsu HH, Hsieh CW, Da LuM. Hybrid feature selection by combining filters and wrappers. Expert Syst Appl. 2011;38(7):8144–50.
Jović A, Brkić K, Bogunović N. A review of feature selection methods with applications. In: 2015 38th international convention on information and communication technology, electronics and microelectronics, MIPRO 2015—proceedings. 2015. p. 1200–5.
Aziz R, Verma CK, Srivastava N. A novel approach for dimension reduction of microarray. Comput Biol Chem. 2017;1(71):161–9.
Kumar L, Bharti KK. A novel hybrid BPSO–SCA approach for feature selection. Nat Comput. 2021;20(1):39–61. https://doi.org/10.1007/s1104701909769z.
Aziz RM. Natureinspired metaheuristics model for gene selection and classification of biomedical microarray data. Med Biol Eng Compu. 2022;60(6):1627–46. https://doi.org/10.1007/s11517022025557.
Liu XY, Liang Y, Wang S, Yang ZY, Ye HS. A hybrid genetic algorithm with wrapperembedded approaches for feature selection. IEEE Access. 2018;27(6):22863–74.
Alshamlan HM, Badr GH, Alohali YA. Genetic Bee Colony (GBC) algorithm: a new gene selection method for microarray cancer classification. Comput Biol Chem. 2015;1(56):49–60.
Aziz R, Verma CK, Srivastava N. Artificial neural network classification of high dimensional data with novel optimization approach of dimension reduction. Ann Data Sci. 2018;5(4):615–35. https://doi.org/10.1007/s4074501801552.
Das K, Mishra D, Shaw K. A metaheuristic optimization framework for informative gene selection. Inform Med Unlocked. 2016;4:10–20.
Aziz R, Verma CK, Srivastava N, Aziz R, Verma CK, Srivastava N. Dimension reduction methods for microarray data: a review. AIMS Bioeng. 2017;4(1):179–97. https://doi.org/10.3934/bioeng.2017.1.179.
Karegowda AG, Jayaram MA, Manjunath AS. Feature subset selection problem using wrapper approach in supervised learning. Int J Comput Appl. 2010;1(7):13–7.
Garro BA, Rodríguez K, Vázquez RA. Classification of DNA microarrays using artificial neural networks and ABC algorithm. Appl Soft Comput J. 2016;1(38):548–60.
Nilsson J. Manifold learning in computational biology [Internet]. Centre for Mathematical Sciences, Lund University; 2008. https://portal.research.lu.se/en/publications/manifoldlearningincomputationalbiology.
Bartenhagen C, Klein HU, Ruckert C, Jiang X, Dugas M. Comparative study of unsupervised dimension reduction techniques for the visualization of microarray gene expression data. BMC Bioinform. 2010;11(1):1–11.
Orsenigo C, Vercellis C. A comparative study of nonlinear manifold learning methods for cancer microarray data classification. In: Expert systems with applications, vol. 40. 2013. p. 2189–97.
Davies DL, Bouldin DW. A cluster separation measure. IEEE Trans Pattern Anal Mach Intell. 1979;PAMI1(2):224–7.
Tenenbaum JB, De Silva V, Langford JC. A global geometric framework for nonlinear dimensionality reduction. Science. 2000;290(5500):2319–23.
De Silva V, Tenenbaum JB. Global versus local methods in nonlinear dimensionality reduction. In: Advances in neural information processing systems, vol. 15. 2002.
Christoph B. A package for nonlinear dimension reduction with Isomap and LLE [Internet]. GitHub. 2019. https://github.com/Budheimer/RDRToolbox/blob/master/R/Isomap.R.
Zhu Z, Ong YS, Dash M. Markov blanketembedded genetic algorithm for gene selection. Pattern Recogn. 2007;40(11):3236–48.
Michael B, William NG, David L, Nello C, Charles S, Manuel AJ, et al. Support vector machine classification of microarray gene expression data [Internet]. University of California, Santa Cruz, Technical Report UCSCCRL9909. 1999. https://www.soe.ucsc.edu/research/technicalreports/UCSCCRL9909.
Wolters MA. A genetic algorithm for selection of fixedsize subsets with application to design problems. J Stat Softw. 2015;24(68):1–18.
Bartenhagen C. RDRToolbox: a package for nonlinear dimension reduction with Isomap and LLE. R package version 1.48.0. 2022.
Levina E, Bickel P. Maximum likelihood estimation of intrinsic dimension. Advances in neural information processing systems; 2004. vol. 17. p. 1–8.
Hino H. ider: intrinsic dimension estimation with R. R J. 2017;9(2):329.
Beygelzimer A, Kakadet S, Langford J, Arya S, Mount D, Li S. FNN: fast nearest neighbor search algorithms and applications. R package version. 2022;1(1):1–17.
Muravyov S, Antipov D, Buzdalova A, Filchenkov A. Efficient computation of fitness function for evolutionary clustering. Mendel. 2019;25(1):87–94.
Li T, Zhang C, Ogihara M. A comparative study of feature selection and multiclass classification methods for tissue classification based on gene expression. Bioinformatics. 2004;20(15):2429–37.
Li J, Liu H. Kent ridge biomedical data set repository. Institute for Infocomm Research. 2002.
Zhang G, Hou J, Wang J, Yan C, Luo J. Feature selection for microarray data classification using hybrid information gain and a modified binary krill herd algorithm. Interdiscip Sci Comput Life Sci. 2020;12(3):288–301. https://doi.org/10.1007/s1253902000372w.
Gibaja E, Ventura S. A tutorial on multilabel learning. ACM Comput Surv. 2015. https://doi.org/10.1145/2716262.
Riza LS, Iip, Nugroho EP, Prabowo MBA, Junaeti E, Abdullah AG. Metaheuristicopt: metaheuristic for optimization. R package version 10 0, 2017. 2019;1–48. https://cran.rproject.org/package=metaheuristicOpt.
Thomas JCR, Peñas MS, Mora M. New version of Davies–Bouldin Index for clustering validation based on cylindrical distance. In: Proceedings—international conference of the Chilean computer science society, SCCC. IEEE Computer Society; 2013. p. 49–53.
Anowar F, Sadaoui S, Selim B. Conceptual and empirical comparison of dimensionality reduction algorithms (PCA, KPCA, LDA, MDS, SVD, LLE, ISOMAP, LE, ICA, tSNE). Comput Sci Rev. 2021;1(40):100378.
Yang B, Xiang M, Zhang Y. Multimanifold discriminant Isomap for visualization and classification. Pattern Recogn. 2016;1(55):215–30.
Huang R, Zhang G, Chen J. Semisupervised discriminant Isomap with application to visualization, image retrieval and classification. Int J Mach Learn Cybern. 2019;10(6):1269–78. https://doi.org/10.1007/s1304201808096.
Acknowledgements
Not applicable.
Funding
This work was supported by the Japan Society for the Promotion of Science KAKENHI grants 20H05967, 20K21827, and 21H05052.
Author information
Authors and Affiliations
Contributions
ZW and TT conceived the study. ZW conducted the experiments and drafted the manuscript. YZ and YST suggested the study and revised the manuscript. JS helped revise the manuscript. TS revised the manuscript and managed the project funding. All authors read and approved the final manuscript.
Corresponding authors
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
JS is an Associate Editor of BMC Bioinformatics. Other authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1.
This supplementary file includes a detailed description of the computational complexity analysis, as well as modeling details and results: (1) Table S1—The Rank of MacroAUC and MicroAUC of RBFSVM Classification on Microarray Datasets and the P Value of Wilcoxon Sign Rank Test (2) Table S2—The Rank of MacroAUC and MicroAUC of Rprop+ NN Classification on Microarray Datasets and the P Value of Wilcoxon Sign Rank Test (3) Table S3—The parameter selection and tunning range (4) Figure S1—The regression fitting results of the classification accuracy of gene subsets with different DB values (5) Figure S2—Visualization results of each dataset.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Wang, Z., Zhou, Y., Takagi, T. et al. Genetic algorithmbased feature selection with manifold learning for cancer classification using microarray data. BMC Bioinformatics 24, 139 (2023). https://doi.org/10.1186/s12859023052673
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859023052673