A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data
© Yang et al. 2010
Published: 18 January 2010
Skip to main content
© Yang et al. 2010
Published: 18 January 2010
Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.
In this paper we describe an improved hybrid system for gene selection. It is based on a recently proposed genetic ensemble (GE) system. To enhance the generalization property of the selected genes or gene subsets and to overcome the overfitting problem of the GE system, we devised a mapping strategy to fuse the goodness information of each gene provided by multiple filtering algorithms. This information is then used for initialization and mutation operation of the genetic ensemble system.
We used four benchmark microarray datasets (including both binary-class and multi-class classification problems) for concept proving and model evaluation. The experimental results indicate that the proposed multi-filter enhanced genetic ensemble (MF-GE) system is able to improve sample classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. The MF-GE system is very flexible as various combinations of multiple filters and classifiers can be incorporated based on the data characteristics and the user preferences.
Feature selection is an important process for high dimensional data analysis. With the advancement of new high-throughput bio-technologies, feature selection quickly found its use in the analysis of the massive quantity of generated data . The gene selection in microarray data is one of such crucial applications because microarray datasets inherently have high feature-to-sample ratio, i.e., several thousands of features (genes) with only a few dozen of samples . To identify biologically significant biomarkers and to improve the ability in new case diagnosis, robust and scalable feature selection methods play a critical role.
Different from filters, the "wrapper" approach evaluates the selected gene subset according to their power to improve sample classification accuracy . The classification thus is "wrapped" in the gene selection process, as depicted in Figure 1(b). Classical wrapper algorithms include forward selection and backward elimination . Recently, evolutionary based algorithms such as Genetic Algorithm (GA) and Evolution Strategy (ES) have been introduced as more advanced wrapper algorithms for the analysis of microarray datasets [9–12]. Unlike classical wrappers which select genes incrementally , GA selects genes nonlinearly by creating gene subset randomly. Furthermore, GA is efficient in exploring large searching space for solving combinatorial problems . This makes it a promising solution for gene selection in microarray data. Nevertheless, wrapper approaches like GA have long been criticized for suffering from overfitting  because an inductive algorithm is usually used as the sole criterion in gene subset evaluation. In other words, the use of a given inductive algorithm as the sole optimization guide leads the system to seek for high classification accuracy on training data blindly which may give poor generalization property on unseen data classification.
The third group of selection scheme is known as embedded approaches, which use the inductive algorithm itself as the feature selector as well as classifier. As illustrated in Figure 1(c), feature selection is actually a by-product of the classification process. Example are classification trees such as ID3  and C4.5 . However, the drawback of embedded methods is that they are generally greedy based , using only top ranked genes to perform sample classification in each step while an alternative split may perform better. Furthermore, additional steps are required to extract the selected genes from the embedded algorithms.
To address the drawbacks of each method while attempt to take advantage of their strengthes, various hybrid algorithms have been proposed. In , Yang et al. pointed out that no one filter algorithm is universally optimal and there is seldom any basis or guidance to the choice of a particular filter for a given dataset. They proposed a hybrid method which synthesizes several different filters using a special designed distance. Their experimental results indicate that including multiple source of information is an advantage in improving prediction accuracy. However, this approach, too, did not incorporate classification information which could be very useful in obtaining more accurate sample classification result.
Since relying on a single classifier often gives bias and overfitted classification results, designing multiple classifier system to weigh the classification hypotheses also received much attention [18, 19]. To incorporate the benefits of GA in evaluating features by groups and in extracting nonlinear relationship from associated features, we recently proposed a genetic ensemble (GE) framework for feature selection . By applying multiagent techniques for hybrid system composition under the proposed genetic framework, we found a GE combination, which is superior to many alternatives in the context of microarray data analysis . In that system multiple classifiers were applied to evaluate the goodness of gene subsets, and the system works in an iterative way, collecting multiple gene subsets as candidate sample classification profiles. The preliminary experimental results suggest that the GE system is able to improve the sample classification accuracy and the reproducibility of the gene selection results which is often overlooked .
To further improve the generalization property of the selected genes and gene subsets on unseen data classification, in this study, we incorporate multiple filtering algorithms into the GE system. This more advanced system is named the multi-filter enhanced genetic ensemble system, or MF-GE for short. A novel mapping strategy for multiple filtering information fusion is developed to fuse the evaluation scores from multiple filters, and this strategy is incorporated into the GE system for gene selection and classification. Thus the initialization and mutation processes of the original genetic ensemble system is governed by the knowledge generated from multiple filtering algorithms.
We compare the MF-GE system with the original GE system and the GA/KNN hybrid proposed by Li et al.  which is similar to GE except that the optimization is guided by k-nearest neighbor classifiers. Also, Gain Ratio filtering algorithm (which is commonly employed for gene selection of microarray datasets) is used as an additional yardstick. We found that this improved system is able to produce higher classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. More importantly, the proposed multi-filter mapping component and the genetic ensemble component are very flexible, allowing any filters/classifiers with new capabilities to be added to the system and those no longer used to be deleted from the system based on the data requirements or user preferences.
An intermediate step called "score mapping" serves as the synergy between the filtering process and the wrapper process. It is described in details in the next subsection.
where is the average expression value of the kth gene among all samples, σ k is the standard deviation of the kth gene among all samples, and is the transformed value of x ik which denotes the expression value of the kth gene in sample i.
After the softmax scaling process, the evaluation scores with different filtering algorithms are summed up to a set of total score which indicates the overall score of each gene under the evaluation of multiple filtering algorithms. The total scores are then timed with 10 and rounded into integer. Those with scores smaller than 1 are set to score of 1 to make sure all candidate genes are included in the wrapper selection process. The final step is the score-to-frequency mapping step which transfers the given integer of each gene into the appearance frequency of this gene in the transferred candidate gene pool (we call it a gene frequency map). The random processes of "chromosome" initialization and the "chromosome mutation" of the genetic ensemble system are then conducted based on this gene frequency map.
It is readily noticed that genes with higher overall evaluation scores will appear in the gene frequency map more frequently, thus, will have a better chance to be chosen in the initialization step and the mutation step. In this way, multiple filter information is fused into the gene selection process, which helps to integrate information of data characteristics from different aspects.
In this subsection, we introduce five filtering algorithms incorporated in our MF-GE hybrid system for experiments. All these filtering algorithms have been routinely applied for gene selection of microarray data.
where c i , (i = 1, ..., m) denotes the possible classes of the samples from a dataset, while g is the gene that has a set of possible values denoted as V . N(g = v, c i ) and E(g = v, c i ) are the observed and the expected co-occurrence of g = v with the class c i , respectively.
where diff(g, S 1, S 2) calculates the difference between the values of the gene g for two samples (S 1 and S 2), denotes the ith randomly selected samples from the dataset, while S d and S s denote nearest sample from a different class to and nearest sample from the same class to , respectively. N(.) is a normalization function which keeps the value of W(g) to be in the interval [-1, 1].
where H(.) is the information entropy function. H(class) and H(g) give the entropy values of the class and a given gene, while H(class|g) gives the entropy value of a gene with respect to the class.
where S v is the subset of S of which gene g has value v.
It is clear that each algorithm uses a different criterion in evaluating the worth of the candidate genes in microarray datasets. When combined, candidate genes are assessed from many different aspects.
Ensemble of classifiers has recently been suggested as a promising measure to overcome the limitation of individual classifier . In our previous study, we demonstrated that if combined properly, multiple classifiers can achieve higher sample classification accuracy and more reproducible feature selection results . Therefore, selecting classification algorithms and developing suitable integration strategies are the key to a successful ensemble. What characteristics should we promote in the ensemble construction? The basic concerns are that they should be as accurate and diverse as possible , and the individual classifiers should be relatively computationally efficient. With these criteria in mind, we evaluated different composition under the genetic architecture within a multiagent framework . A hybrid of five classifiers, namely, decision tree (DT), random forest (RF), 3-nearest neighbors (3NN), 7-nearest neighbors (7NN), and naive bayes (NB) is identified to be better in terms of sample classification and stability than many alternatives. Furthermore, two integration strategies, namely, blocking and majority voting have been employed for ensemble construction.
The blocking strategy optimizes the target gene subset by improving the sample classification accuracy using the whole ensemble rather than one specific inductive algorithm. This formulation adds multiple test conditions into the algorithm, and the gene subset optimized under this criterion will not tie to any specific classifier, but have a more generalization nature. Moreover, genes selected with this strategy are more likely to have real relevance to the biological trait of interest . The majority voting combines multiple classifiers and tries to optimize the target feature set into a superior set in producing high consensus classification . This part of the function promotes the selected genes in creating diverse classifiers implicitly, which in turn leads to the high sample classification accuracy .
where Se j is the sensitivity value calculated as the percentage of the number of true positive classification ( ) of samples in class j, N j denotes the total number of samples in class j, and m is the total number of classes.
where the empirical coefficients w 1 and w 2 specify the contribution weights of each term.
This section describes the experimental settings and presents the experimental results.
The "Leukemia" dataset  investigates the expression of two different subtypes of leukemia (47 ALL and 25 AML), and the "Colon" dataset  contains expression patterns of 22 normals and 40 cancerous tissues. The "Liver" dataset  has 82 samples labeled as Hepatocellular carcinoma (HCC) and other 75 samples labeled as Non-tumor. The task for these three datasets is to identify a small group of genes which can distinguish samples from two classes. The "MLL" dataset  provides a multi-classes classification problem. The task is to discriminate each class using a selected gene profile. These four datasets cover the general situations in gene selection and sample classification of microarray datasets.
In order to objectively differentiate and compare the power of different feature selection algorithms, we applied a double cross validation process. That is, each dataset is partitioned by an external cross validation and an internal cross validation. The gene selection process is conducted on the internal cross validation sets while the external cross validation sets are used for evaluating the selection results.
Standardize the gene expression levels of the dataset with the mean of 0 and the variance of 1.
Normalize the gene expression levels of the dataset into [0, 1].
Split each dataset into external train sets and external test sets with an external 3-fold stratified cross validation.
Rank each gene in the external train sets with the between-group to within-group sum of square ratio (BSS/WSS) .
Pre-filter the external train sets by selecting the top 200 genes from the ranking list.
Split the external train sets into internal train sets and internal test sets with an internal 3-fold stratified cross validation.
The gene score calculation is conducted by using the internal train sets while the wrapper selection is performed using internal train sets and internal test sets collaboratively. The external test sets are reserved for the evaluation of the selected genes on unseen data classification, and are excluded from pre-filtering as well as the gene selection processes.
For the genetic ensemble component, a set of initial tests is conducted to evaluate different parameter configurations, from which parameter values are chosen and fixed for the latter experiments.
Genetic ensemble settings
Tournament Selection (3)
Single Point (0.7)
Multi-Point (0.1 & 0.25)
w 1 = 0.5, w 2 = 0.5
In our parameter tuning experiments, the average gene subset size is within 2 to 10. Thus, the GA chromosome is represented as a string of size 15. In chromosome coding, each position is used to specify the id of a selected gene or assigned a "0" to denote no gene is selected at the current position. This gives a population of gene subsets of different sizes with a maximum of 15.
Classifiers and filters are created by using Waka - a machine learning suite which provides the implementation of various popular machine learning and data mining algorithms . In specific, J48 algorithm is used to create classification tree. Random forest algorithm with size of 7 trees is applied, while k-nearest neighbor and naive bayes classifiers are adopted with default parameters. Each filtering algorithm is provoked for evaluation of each candidate gene and integrated from our main code through the class API of Waka.
The GA/KNN code were downloaded from the author's web site . The chromosome length of 15, the iteration of 1000, and the majority voting with k = 3 of the kNN were used. For each dataset, GA/KNN requires a pre-specified selection threshold of cut-off. Therefore, different thresholds were used according to their classification power on different datasets.
The first set of experiments is set out to compare the classification accuracy of the selected gene sets from MF-GE hybrid with GE, GA/KNN, and Gain Ratio filter algorithm. Instead of trying to achieve the highest classification accuracy, we focus on differentiating the classification power of different gene selection algorithms. The ranking and classification of each dataset are repeated 5 times and each time the top 5, 10, 15, and 20 genes are used for sample classification. We report the average of the classification results.
Classification comparison of different gene ranking algorithms using Leukemia dataset
78.55 ± 2.96
83.04 ± 1.56
84.51 ± 2.53
91.75 ± 0.99
90.82 ± 1.87
92.35 ± 0.70
93.74 ± 1.27
94.30 ± 1.73
95.48 ± 0.95
89.43 ± 1.10
90.45 ± 2.04
90.86 ± 1.26
90.28 ± 1.33
96.20 ± 0.93
96.27 ± 1.65
93.29 ± 1.29
95.33 ± 0.96
96.23 ± 1.26
Classification comparison of different gene ranking algorithms using Colon dataset
62.43 ± 2.78
73.08 ± 2.77
76.64 ± 1.53
73.48 ± 2.09
71.86 ± 2.02
74.35 ± 2.01
73.83 ± 1.57
75.43 ± 0.92
77.01 ± 2.09
67.62 ± 1.45
68.39 ± 1.76
68.78 ± 2.32
72.12 ± 1.68
76.46 ± 2.14
75.07 ± 2.38
73.37 ± 1.84
75.81 ± 2.00
76.98 ± 1.06
Classification comparison of different gene ranking algorithms using Liver dataset
88.33 ± 0.94
87.09 ± 0.79
88.19 ± 0.56
90.31 ± 1.11
91.87 ± 0.94
93.13 ± 1.18
90.46 ± 0.65
93.57 ± 0.57
93.39 ± 0.79
89.53 ± 0.56
91.91 ± 0.69
92.54 ± 0.57
90.85 ± 0.51
92.70 ± 0.67
93.63 ± 0.64
91.60 ± 0.36
93.37 ± 0.46
93.80 ± 0.47
Classification comparison of different gene ranking algorithms using MLL dataset
72.89 ± 2.08
78.27 ± 3.10
81.54 ± 1.67
88.07 ± 1.05
88.20 ± 1.41
89.74 ± 0.60
88.22 ± 1.30
86.18 ± 1.39
88.14 ± 1.09
86.72 ± 1.03
85.02 ± 1.49
86.69 ± 1.98
89.62 ± 0.67
90.68 ± 1.28
91.50 ± 0.67
88.38 ± 0.97
89.02 ± 1.71
91.08 ± 0.96
An apparent question is that whether such improvements with multiple filters justify the additional computational expenses? This question can be answered from two aspects. Firstly, the multi-filter score calculation in the MF-GE system is done only once at the start of the algorithm. This step will not be involved in the genetic iteration and optimization processes. Therefore, it is computationally efficient to incorporate these initial information. Secondly, by closely observing the classification results produced by individual classifiers, we can see that the MF-GE system achieved better classification results in almost all cases than those alternative methods, regardless which inductive algorithm is used for evaluation. Moreover, such improvement is consistent throughout all datasets used for evaluation. This demonstrates that the gene subsets selected by the MF-GE system have a better generalization property and thus are more informative for unseen data classification. From the biological perspective, the selected genes and gene subsets are more likely to have genuine association with the disease of interest. Hence, they are more valuable for future biological analysis.
The second set of experiments is conducted to compare the average generation of convergence (termination generation) and the average gene subset size collected in each iteration of the MF-GE and the original GE hybrid. We formulate these two criteria for comparison because the biological relationship with the target disease is more easily identified when the number of the selected genes is small , and a shorter average termination generation implies that the method is more efficient in terms of computational time.
Generation of convergence & subset size for each dataset using MFGE and GE
Average Generation of Convergence
1 × 10-2
Average Subset Size
4 × 10-3
Average Generation of Convergence
5 × 10-2
Average Subset Size
3 × 10-3
Average Generation of Convergence
1 × 10-1
Average Subset Size
1 × 10-3
Average Generation of Convergence
8 × 10-2
Average Subset Size
3 × 10-2
Top 5 genes with the highest selection frequency of each microarray data
TCF3 Transcription factor 3 (E2A immunoglobulin enhancer binding factors E12/E47)
CCND3 Cyclin D3
CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage)
P03001 TRANSCRIPTION FACTOR IIIA
S-100P PROTEIN (HUMAN)
Human desmin gene, complete cds
H. sapiens mRNA for GCAP-II/uroguanylin precursor
COLLAGEN ALPHA 2(XI) CHAIN (Homo sapiens)
Plasmalemma vesicle associated protein (PLVAP)
PDZ domain containing 11 (PDZD11)
Shisa homolog 5 (Xenopus laevis) (SHISA5)
Basic leucine zipper nuclear factor 1 (BLZF1)
Ficolin (collagen/fibrinogen domain containing lectin) 2 (hucolin) (FCN2)
vicpro2.D07.r Homo sapiens cDNA, 5' end
Human common acute lymphoblastic leukemia antigen (CALLA) mRNA, complete cds
Homo sapiens myosin light chain kinase (MLCK) mRNA, complete cds
H. sapiens mRNA for Tcell leukemia
Human leukemogenic homolog protein (MEIS1) mRNA, complete cds
Traditionally, filter and wrapper algorithms are treated as competitors in gene selection for data classification. In this study, we embrace an alternative view and attempt to combine them as the building blocks of a more advanced hybrid system. The proposed MF-GE system applied several novel integration ideas to strengthen the advantages of each component while avoiding their weaknesses. The experimental results indicate the followings:
By fusing evaluation feedbacks of multiple filtering algorithms the system does not only seek for high classification accuracy of training dataset greedily, but takes into consideration other characteristics of the data as well. The overfitting problem can then be circumvented and a better generalization of the selected gene and gene subsets can be achieved.
By weighing the goodness of each candidate gene from multiple aspects, we reduce the chance of identifying false-positive gene while producing more compact gene subset. This is useful since future biological experiment can be more easily conducted to validate the importance of the selected genes.
With the use of multiple filtering information, the MF-GE is able to converge more quickly without sacrificing the sample classification accuracy and thus saves computational expenses.
The MF-GE system provides an effective measure for incorporating different algorithm components. It allows any filters or classifiers with new or special capabilities to be added to the system and those no longer useful or inappropriate to be deleted from the system based on the data requirements or user preferences. Finally, the MFGE hybrid system is implemented in Java and is freely available from the project homepage .
PY is supported by a NICTA International Postgraduate Award (NIPA) and a NICTA Research Project Award (NRPA).
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S1.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.