Volume 11 Supplement 1

Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010)

Open Access

A multi-filter enhanced genetic ensemble system for gene selection and sample classification of microarray data

BMC Bioinformatics201011(Suppl 1):S5

DOI: 10.1186/1471-2105-11-S1-S5

Published: 18 January 2010



Feature selection techniques are critical to the analysis of high dimensional datasets. This is especially true in gene selection from microarray data which are commonly with extremely high feature-to-sample ratio. In addition to the essential objectives such as to reduce data noise, to reduce data redundancy, to improve sample classification accuracy, and to improve model generalization property, feature selection also helps biologists to focus on the selected genes to further validate their biological hypotheses.


In this paper we describe an improved hybrid system for gene selection. It is based on a recently proposed genetic ensemble (GE) system. To enhance the generalization property of the selected genes or gene subsets and to overcome the overfitting problem of the GE system, we devised a mapping strategy to fuse the goodness information of each gene provided by multiple filtering algorithms. This information is then used for initialization and mutation operation of the genetic ensemble system.


We used four benchmark microarray datasets (including both binary-class and multi-class classification problems) for concept proving and model evaluation. The experimental results indicate that the proposed multi-filter enhanced genetic ensemble (MF-GE) system is able to improve sample classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. The MF-GE system is very flexible as various combinations of multiple filters and classifiers can be incorporated based on the data characteristics and the user preferences.


Feature selection is an important process for high dimensional data analysis. With the advancement of new high-throughput bio-technologies, feature selection quickly found its use in the analysis of the massive quantity of generated data [1]. The gene selection in microarray data is one of such crucial applications because microarray datasets inherently have high feature-to-sample ratio, i.e., several thousands of features (genes) with only a few dozen of samples [2]. To identify biologically significant biomarkers and to improve the ability in new case diagnosis, robust and scalable feature selection methods play a critical role.

Currently, three major types of feature selection models have been intensively utilized for gene selection and dimension reduction in microarray data. The first type is known as "filter" approach. Typically, filtering algorithms do not optimize the classification accuracy of the classifier directly, but attempt to select genes with certain kind of evaluation criterion. Examples are χ2-statistic [3], t-statistic [4], ReliefF [5], Information Gain, and Gain Ratio [6]. With the filter approach the gene selection process and the classification process are thus separated, as shown in Figure 1(a). The advantages are that the algorithms are often fast and the selected genes are better generalized to unseen data classification. However, to ignore the effects of the selected gene subset on the performance of the classifier may cause crucial information being lost for accurate sample discrimination and target gene identification [7]. More importantly, filtering algorithms often treat each gene independently. Nevertheless, genes are commonly connected by various bio-pathways and functioning as groups. Such one gene at a time methods often miss important bio-pathway information.
Figure 1

Different types of feature selection algorithms. (a)Filter approach (b)Wrapper approach (c)Embedded approach.

Different from filters, the "wrapper" approach evaluates the selected gene subset according to their power to improve sample classification accuracy [7]. The classification thus is "wrapped" in the gene selection process, as depicted in Figure 1(b). Classical wrapper algorithms include forward selection and backward elimination [8]. Recently, evolutionary based algorithms such as Genetic Algorithm (GA) and Evolution Strategy (ES) have been introduced as more advanced wrapper algorithms for the analysis of microarray datasets [912]. Unlike classical wrappers which select genes incrementally [13], GA selects genes nonlinearly by creating gene subset randomly. Furthermore, GA is efficient in exploring large searching space for solving combinatorial problems [14]. This makes it a promising solution for gene selection in microarray data. Nevertheless, wrapper approaches like GA have long been criticized for suffering from overfitting [1] because an inductive algorithm is usually used as the sole criterion in gene subset evaluation. In other words, the use of a given inductive algorithm as the sole optimization guide leads the system to seek for high classification accuracy on training data blindly which may give poor generalization property on unseen data classification.

The third group of selection scheme is known as embedded approaches, which use the inductive algorithm itself as the feature selector as well as classifier. As illustrated in Figure 1(c), feature selection is actually a by-product of the classification process. Example are classification trees such as ID3 [15] and C4.5 [16]. However, the drawback of embedded methods is that they are generally greedy based [8], using only top ranked genes to perform sample classification in each step while an alternative split may perform better. Furthermore, additional steps are required to extract the selected genes from the embedded algorithms.

To address the drawbacks of each method while attempt to take advantage of their strengthes, various hybrid algorithms have been proposed. In [17], Yang et al. pointed out that no one filter algorithm is universally optimal and there is seldom any basis or guidance to the choice of a particular filter for a given dataset. They proposed a hybrid method which synthesizes several different filters using a special designed distance. Their experimental results indicate that including multiple source of information is an advantage in improving prediction accuracy. However, this approach, too, did not incorporate classification information which could be very useful in obtaining more accurate sample classification result.

Since relying on a single classifier often gives bias and overfitted classification results, designing multiple classifier system to weigh the classification hypotheses also received much attention [18, 19]. To incorporate the benefits of GA in evaluating features by groups and in extracting nonlinear relationship from associated features, we recently proposed a genetic ensemble (GE) framework for feature selection [20]. By applying multiagent techniques for hybrid system composition under the proposed genetic framework, we found a GE combination, which is superior to many alternatives in the context of microarray data analysis [21]. In that system multiple classifiers were applied to evaluate the goodness of gene subsets, and the system works in an iterative way, collecting multiple gene subsets as candidate sample classification profiles. The preliminary experimental results suggest that the GE system is able to improve the sample classification accuracy and the reproducibility of the gene selection results which is often overlooked [22].

To further improve the generalization property of the selected genes and gene subsets on unseen data classification, in this study, we incorporate multiple filtering algorithms into the GE system. This more advanced system is named the multi-filter enhanced genetic ensemble system, or MF-GE for short. A novel mapping strategy for multiple filtering information fusion is developed to fuse the evaluation scores from multiple filters, and this strategy is incorporated into the GE system for gene selection and classification. Thus the initialization and mutation processes of the original genetic ensemble system is governed by the knowledge generated from multiple filtering algorithms.

We compare the MF-GE system with the original GE system and the GA/KNN hybrid proposed by Li et al. [9] which is similar to GE except that the optimization is guided by k-nearest neighbor classifiers. Also, Gain Ratio filtering algorithm (which is commonly employed for gene selection of microarray datasets) is used as an additional yardstick. We found that this improved system is able to produce higher classification accuracy, generate more compact gene subset, and converge to the selection results more quickly. More importantly, the proposed multi-filter mapping component and the genetic ensemble component are very flexible, allowing any filters/classifiers with new capabilities to be added to the system and those no longer used to be deleted from the system based on the data requirements or user preferences.


The MF-GE hybrid approach

System overview

A flow chart of the proposed MF-GE hybrid system is illustrated in Figure 2. In this system the gene selection process is sequentially divided into two phases, i.e., "filtering process" and "wrapper process". In the filtering process, multiple filtering algorithms are applied to give scores for each candidate gene in the microarray dataset. The scores of each gene are then integrated for wrapper process. In the wrapper process, the genetic ensemble algorithm is used to select discriminative genes using the information provided by the filtering process. The detail of this genetic ensemble algorithm is described in [20, 21]. Basically, a multiple objective GA (MOGA) is utilized as the gene combination search engine while an ensemble of the classifiers is used as the gene subsets evaluation component to provide feedback for gene subsets optimization. The algorithm executes iteratively, collecting multiple gene subsets. The final collections are ranked and the top genes are used for sample classification.
Figure 2

The flow chart of the MF-GE hybrid system for gene selection and classification of microarrays.

An intermediate step called "score mapping" serves as the synergy between the filtering process and the wrapper process. It is described in details in the next subsection.

Multi-filter score mapping

Traditionally, filtering algorithms select differential genes independently for the classification process. However, such information could be beneficial if appropriately integrated into the wrapper procedure. To fuse the evaluation information from multiple filtering algorithms, we developed a multi-filter score mapping strategy which serves as the connection between the filtering process and the wrapper process. An example of this mapping process with two filters is depicted in Figure 3.
Figure 3

An example of multiple filter score mapping strategy for evaluation information fusion.

The process starts by calculating scores for each candidate gene with different filtering algorithms. The evaluation scores obtained from different filtering algorithms are then integrated. One issue in integrating multiple scores is that different filtering algorithms often provide evaluation scores with different scales. In order to combine the evaluation results of multiple filters, we must transform the evaluation scores into a common scale. Therefore, the softmax scaling process is adopted to squash the gene evaluation results of each filtering algorithm into the range of 0[1]. The calculation is as follows:
in which

where https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-11-S1-S5/MediaObjects/12859_2010_Article_3955_IEq1_HTML.gif is the average expression value of the k th gene among all samples, σ k is the standard deviation of the k th gene among all samples, and https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-11-S1-S5/MediaObjects/12859_2010_Article_3955_IEq2_HTML.gif is the transformed value of x ik which denotes the expression value of the k th gene in sample i.

After the softmax scaling process, the evaluation scores with different filtering algorithms are summed up to a set of total score which indicates the overall score of each gene under the evaluation of multiple filtering algorithms. The total scores are then timed with 10 and rounded into integer. Those with scores smaller than 1 are set to score of 1 to make sure all candidate genes are included in the wrapper selection process. The final step is the score-to-frequency mapping step which transfers the given integer of each gene into the appearance frequency of this gene in the transferred candidate gene pool (we call it a gene frequency map). The random processes of "chromosome" initialization and the "chromosome mutation" of the genetic ensemble system are then conducted based on this gene frequency map.

It is readily noticed that genes with higher overall evaluation scores will appear in the gene frequency map more frequently, thus, will have a better chance to be chosen in the initialization step and the mutation step. In this way, multiple filter information is fused into the gene selection process, which helps to integrate information of data characteristics from different aspects.

Filters and classifiers

Filter components

In this subsection, we introduce five filtering algorithms incorporated in our MF-GE hybrid system for experiments. All these filtering algorithms have been routinely applied for gene selection of microarray data.


When used for gene evaluation, χ2-statistic can be considered as to calculate the occurrence of a particular value of a gene and the occurrence of a class associated with this value. Formally, the discriminative power of a gene is quantified as follows:

where c i , (i = 1, ..., m) denotes the possible classes of the samples from a dataset, while g is the gene that has a set of possible values denoted as V . N(g = v, c i ) and E(g = v, c i ) are the observed and the expected co-occurrence of g = v with the class c i , respectively.


ReliefF is a widely used filtering algorithm. In microarray data classification context, the algorithm selects genes that have high resolution distinguishing samples which have similar expression patterns. The formula used by ReliefF to compute the weight or "importance" of a gene g is as follows:

where diff(g, S1, S2) calculates the difference between the values of the gene g for two samples (S1 and S2), https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-11-S1-S5/MediaObjects/12859_2010_Article_3955_IEq3_HTML.gif denotes the i th randomly selected samples from the dataset, while S d and S s denote nearest sample from a different class to https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-11-S1-S5/MediaObjects/12859_2010_Article_3955_IEq3_HTML.gif and nearest sample from the same class to https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-11-S1-S5/MediaObjects/12859_2010_Article_3955_IEq3_HTML.gif , respectively. N(.) is a normalization function which keeps the value of W(g) to be in the interval [-1, 1].

Symmetrical Uncertainty

The Symmetrical Uncertainty method evaluates the worth of an gene by measuring the symmetrical uncertainty with respect to the sample class [23]. Each gene is evaluated as follows:

where H(.) is the information entropy function. H(class) and H(g) give the entropy values of the class and a given gene, while H(class|g) gives the entropy value of a gene with respect to the class.

Information Gain

Information Gain is a statistic measure often used in nodes selection for decision tree construction. It measures the number of bits of information provided in class prediction by knowing the value of feature [3]. Let c i belong to a set of discrete classes (1, ..., m). V be the set of possible values for a given gene g. The information gain of a gene g is defined as follows:

Gain Ratio

The final filtering algorithm used in the hybrid system is Gain Ratio. Gain Ratio incorporates "split information" of features into Information Gain statistic. The "split information" of a gene is obtained by measuring how broadly and uniformly it splits the data [24]. Let's consider again a microarray dataset has a set of classes denoted as c i , (i = 1, ..., m), and each gene g has a set of possible values denoted as V . The discriminative power of a gene g is given as:
in which:

where S v is the subset of S of which gene g has value v.

It is clear that each algorithm uses a different criterion in evaluating the worth of the candidate genes in microarray datasets. When combined, candidate genes are assessed from many different aspects.

Classifier components

Ensemble of classifiers has recently been suggested as a promising measure to overcome the limitation of individual classifier [25]. In our previous study, we demonstrated that if combined properly, multiple classifiers can achieve higher sample classification accuracy and more reproducible feature selection results [20]. Therefore, selecting classification algorithms and developing suitable integration strategies are the key to a successful ensemble. What characteristics should we promote in the ensemble construction? The basic concerns are that they should be as accurate and diverse as possible [26], and the individual classifiers should be relatively computationally efficient. With these criteria in mind, we evaluated different composition under the genetic architecture within a multiagent framework [21]. A hybrid of five classifiers, namely, decision tree (DT), random forest (RF), 3-nearest neighbors (3NN), 7-nearest neighbors (7NN), and naive bayes (NB) is identified to be better in terms of sample classification and stability than many alternatives. Furthermore, two integration strategies, namely, blocking and majority voting have been employed for ensemble construction.

The blocking strategy optimizes the target gene subset by improving the sample classification accuracy using the whole ensemble rather than one specific inductive algorithm. This formulation adds multiple test conditions into the algorithm, and the gene subset optimized under this criterion will not tie to any specific classifier, but have a more generalization nature. Moreover, genes selected with this strategy are more likely to have real relevance to the biological trait of interest [27]. The majority voting combines multiple classifiers and tries to optimize the target feature set into a superior set in producing high consensus classification [28]. This part of the function promotes the selected genes in creating diverse classifiers implicitly, which in turn leads to the high sample classification accuracy [29].

The fitness functions derived from blocking (fitness b (s)) and majority voting (fitness v (s)) are defined as follows:
where k is the size of the majority voting V k (.), h i (s), (i = 1, ..., L) is the classification hypothesis generated by classifier i in the ensemble while classifying dataset using gene subset s, y is the class label of samples, and BC(.) is the balanced classification accuracy which is calculated as follows:

where Se j is the sensitivity value calculated as the percentage of the number of true positive classification ( https://static-content.springer.com/image/art%3A10.1186%2F1471-2105-11-S1-S5/MediaObjects/12859_2010_Article_3955_IEq4_HTML.gif ) of samples in class j, N j denotes the total number of samples in class j, and m is the total number of classes.

Finally, the fitness function of the MOGA is defined as follows:

where the empirical coefficients w1 and w2 specify the contribution weights of each term.

Results and discussion

This section describes the experimental settings and presents the experimental results.

Experimental settings

Datasets and data pre-processing

We gathered four benchmark microarray datasets for system evaluation, including binary-class and multi-class classification problems. Table 1 summarizes each dataset.
Table 1

Microarray datasets for evaluation











# Sample





# Gene





# Class






ALL: 47

TUM: 40

HCC: 82

ALL: 24


AML: 25

NOR: 22

NON: 75

MLL: 20



AML: 28

The "Leukemia" dataset [30] investigates the expression of two different subtypes of leukemia (47 ALL and 25 AML), and the "Colon" dataset [31] contains expression patterns of 22 normals and 40 cancerous tissues. The "Liver" dataset [32] has 82 samples labeled as Hepatocellular carcinoma (HCC) and other 75 samples labeled as Non-tumor. The task for these three datasets is to identify a small group of genes which can distinguish samples from two classes. The "MLL" dataset [33] provides a multi-classes classification problem. The task is to discriminate each class using a selected gene profile. These four datasets cover the general situations in gene selection and sample classification of microarray datasets.

In order to objectively differentiate and compare the power of different feature selection algorithms, we applied a double cross validation process. That is, each dataset is partitioned by an external cross validation and an internal cross validation. The gene selection process is conducted on the internal cross validation sets while the external cross validation sets are used for evaluating the selection results.

Data normalization and pre-processing are of great importance and can have heavy influence on the success of the overall analysis. Based on the previous studies, only a few dozens of genes (or even only a few genes) are needed for sample classification in general [34, 35]. Therefore, for each microarray dataset 200 genes are pre-filtered from the external train sets, which are then suitable for follow up precise gene selection. Specifically, we apply the following pre-processing steps:
  1. 1.

    Standardize the gene expression levels of the dataset with the mean of 0 and the variance of 1.

  2. 2.

    Normalize the gene expression levels of the dataset into [0, 1].

  3. 3.

    Split each dataset into external train sets and external test sets with an external 3-fold stratified cross validation.

  4. 4.

    Rank each gene in the external train sets with the between-group to within-group sum of square ratio (BSS/WSS) [36].

  5. 5.

    Pre-filter the external train sets by selecting the top 200 genes from the ranking list.

  6. 6.

    Split the external train sets into internal train sets and internal test sets with an internal 3-fold stratified cross validation.


The gene score calculation is conducted by using the internal train sets while the wrapper selection is performed using internal train sets and internal test sets collaboratively. The external test sets are reserved for the evaluation of the selected genes on unseen data classification, and are excluded from pre-filtering as well as the gene selection processes.


For the genetic ensemble component, a set of initial tests is conducted to evaluate different parameter configurations, from which parameter values are chosen and fixed for the latter experiments.

The iteration of the genetic ensemble procedure is set to 100. Within each iteration, the population size of GA is 100. These 100 populations are divided into two niches each with 50, and is evolved separately. After every 10 generations, the favorite chromosomes from each niche are exchanged to the other. The probability of crossover p c is 0.7. A novel mutation strategy is implemented to allow multiple mutations, that is, when a single mutation happened (with the probability of 0.1) on a chromosome, another single point mutation may happen on the same chromosome with the probability of 0.25 and so on. The selection method is the tournament selection with the candidate size of 3, and the contribution weights of w1 and w2 are set to 0.5. Lastly, the termination condition for each iteration is either that the termination generation of 100th is reached or the similarity of the population converges to 90%. Table 2 summarizes the parameter settings.
Table 2

Genetic ensemble settings



Fitness Function




Population Size




Chromosome Size



Multiple Conditions


Tournament Selection (3)


Single Point (0.7)


Multi-Point (0.1 & 0.25)

Contribution Weight

w1 = 0.5, w2 = 0.5

In our parameter tuning experiments, the average gene subset size is within 2 to 10. Thus, the GA chromosome is represented as a string of size 15. In chromosome coding, each position is used to specify the id of a selected gene or assigned a "0" to denote no gene is selected at the current position. This gives a population of gene subsets of different sizes with a maximum of 15.

Classifiers and filters are created by using Waka - a machine learning suite which provides the implementation of various popular machine learning and data mining algorithms [23]. In specific, J48 algorithm is used to create classification tree. Random forest algorithm with size of 7 trees is applied, while k-nearest neighbor and naive bayes classifiers are adopted with default parameters. Each filtering algorithm is provoked for evaluation of each candidate gene and integrated from our main code through the class API of Waka.

The GA/KNN code were downloaded from the author's web site [37]. The chromosome length of 15, the iteration of 1000, and the majority voting with k = 3 of the k NN were used. For each dataset, GA/KNN requires a pre-specified selection threshold of cut-off. Therefore, different thresholds were used according to their classification power on different datasets.


The first set of experiments is set out to compare the classification accuracy of the selected gene sets from MF-GE hybrid with GE, GA/KNN, and Gain Ratio filter algorithm. Instead of trying to achieve the highest classification accuracy, we focus on differentiating the classification power of different gene selection algorithms. The ranking and classification of each dataset are repeated 5 times and each time the top 5, 10, 15, and 20 genes are used for sample classification. We report the average of the classification results.

The evaluation results obtained from different microarray datasets are depicted in Tables 3, 4, 5, 6, respectively. In each table, the classification results using each individual classifier as well as the mean and the majority voting of them are listed. It is easy to see that the MF-GE system has a higher average classification accuracy for all datasets. For example, 1.20%, 1.33%, 0.75%, and 1.85% improvements of mean over the original GE (which is the second best in average over all datasets) are obtained using the MF-GE system for Leukemia, Colon, Breast, and MLL, respectively. Given the fact that the GE part of these two algorithms are the same, the natural explanation of the improvement is attributed to the fusion of multiple filter information.
Table 3

Classification comparison of different gene ranking algorithms using Leukemia dataset





Gain Ratio







78.55 ± 2.96

83.04 ± 1.56

84.51 ± 2.53


Random Forests


91.75 ± 0.99

90.82 ± 1.87

92.35 ± 0.70


3-Nearest Neighbor


93.74 ± 1.27

94.30 ± 1.73

95.48 ± 0.95


7-Nearest Neighbor


89.43 ± 1.10

90.45 ± 2.04

90.86 ± 1.26


Naive Bayes


90.28 ± 1.33

96.20 ± 0.93

96.27 ± 1.65








Majority Voting


93.29 ± 1.29

95.33 ± 0.96

96.23 ± 1.26

Table 4

Classification comparison of different gene ranking algorithms using Colon dataset





Gain Ratio







62.43 ± 2.78

73.08 ± 2.77

76.64 ± 1.53


Random Forests


73.48 ± 2.09

71.86 ± 2.02

74.35 ± 2.01


3-Nearest Neighbor


73.83 ± 1.57

75.43 ± 0.92

77.01 ± 2.09


7-Nearest Neighbor


67.62 ± 1.45

68.39 ± 1.76

68.78 ± 2.32


Naive Bayes


72.12 ± 1.68

76.46 ± 2.14

75.07 ± 2.38








Majority Voting


73.37 ± 1.84

75.81 ± 2.00

76.98 ± 1.06

Table 5

Classification comparison of different gene ranking algorithms using Liver dataset





Gain Ratio







88.33 ± 0.94

87.09 ± 0.79

88.19 ± 0.56


Random Forests


90.31 ± 1.11

91.87 ± 0.94

93.13 ± 1.18


3-Nearest Neighbor


90.46 ± 0.65

93.57 ± 0.57

93.39 ± 0.79


7-Nearest Neighbor


89.53 ± 0.56

91.91 ± 0.69

92.54 ± 0.57


Naive Bayes


90.85 ± 0.51

92.70 ± 0.67

93.63 ± 0.64








Majority Voting


91.60 ± 0.36

93.37 ± 0.46

93.80 ± 0.47

Table 6

Classification comparison of different gene ranking algorithms using MLL dataset





Gain Ratio







72.89 ± 2.08

78.27 ± 3.10

81.54 ± 1.67


Random Forests


88.07 ± 1.05

88.20 ± 1.41

89.74 ± 0.60


3-Nearest Neighbor


88.22 ± 1.30

86.18 ± 1.39

88.14 ± 1.09


7-Nearest Neighbor


86.72 ± 1.03

85.02 ± 1.49

86.69 ± 1.98


Naive Bayes


89.62 ± 0.67

90.68 ± 1.28

91.50 ± 0.67








Majority Voting


88.38 ± 0.97

89.02 ± 1.71

91.08 ± 0.96

An apparent question is that whether such improvements with multiple filters justify the additional computational expenses? This question can be answered from two aspects. Firstly, the multi-filter score calculation in the MF-GE system is done only once at the start of the algorithm. This step will not be involved in the genetic iteration and optimization processes. Therefore, it is computationally efficient to incorporate these initial information. Secondly, by closely observing the classification results produced by individual classifiers, we can see that the MF-GE system achieved better classification results in almost all cases than those alternative methods, regardless which inductive algorithm is used for evaluation. Moreover, such improvement is consistent throughout all datasets used for evaluation. This demonstrates that the gene subsets selected by the MF-GE system have a better generalization property and thus are more informative for unseen data classification. From the biological perspective, the selected genes and gene subsets are more likely to have genuine association with the disease of interest. Hence, they are more valuable for future biological analysis.

Figure 4 gives the comparison of the mean classification accuracy and the majority voting accuracy of these five classifiers with different gene ranking methods in each microarray dataset. In all cases, integrating classifiers with majority voting gives better classification results than the average of individuals. Therefore, majority voting can be considered as a useful classifier integration method for improving the overall classification accuracy. Figure 5 depicts the multi-filter scores of the 200 genes pre-filtered by BSS/WSS. It is evident that many genes with relatively low BSS/WSS ranking have shown very high multi-filter scores. Interestingly, in colon dataset, genes are fractured into two groups with respect to the multi-filter scores. It is interesting to conduct further study on finding the causality of such inconsistency.
Figure 4

Sample classification. The comparison of average classification and majority voting classification of the five classifiers with different gene selection methods in each microarray dataset.

Figure 5

Multi-filter scores of the 200 genes pre-filtered by BSS/WSS.

The second set of experiments is conducted to compare the average generation of convergence (termination generation) and the average gene subset size collected in each iteration of the MF-GE and the original GE hybrid. We formulate these two criteria for comparison because the biological relationship with the target disease is more easily identified when the number of the selected genes is small [38], and a shorter average termination generation implies that the method is more efficient in terms of computational time.

As illustrated in Table 7, it is clear that the MF-GE system is capable of converging more quickly while also generating smaller gene subsets. Specifically, the average gene subset size given by MF-GE is about 0.4 to 0.7 of a gene less than those of GE, while the average generation of convergence is about 1 to 2 generations faster. Essentially, the improvement on producing more compact gene subsets is more significant as demonstrated by the P-Value of the one tail student t-test. The results are also visualized in Figure 6 and Figure 7 using box plotting. One interesting finding is that those figures indicate a dataset-depended relationship, that is, the optimal subset size and the convergence of the genetic component is partially determined by the given dataset. Nevertheless, significant improvements can be achieved by fusion prior data information into the system.
Table 7

Generation of convergence & subset size for each dataset using MFGE and GE


Comparison Criterion





Average Generation of Convergence



1 × 10-2


Average Subset Size



4 × 10-3


Average Generation of Convergence



5 × 10-2


Average Subset Size



3 × 10-3


Average Generation of Convergence



1 × 10-1


Average Subset Size



1 × 10-3


Average Generation of Convergence



8 × 10-2


Average Subset Size



3 × 10-2

*P-Values are calculated using student t-test with one tail.

Figure 6

Average gene subset size selected by GE and MF-GE with each microarray dataset.

Figure 7

Average generation of convergence of GE and MF-GE with each microarray dataset.

Lastly, in Table 8, we list the top 5 genes with the highest selection frequency of each microarray dataset respectively.
Table 8

Top 5 genes with the highest selection frequency of each microarray data


Accession Num

Gene Description






TCF3 Transcription factor 3 (E2A immunoglobulin enhancer binding factors E12/E47)



Nucleoside-diphosphate kinase



CCND3 Cyclin D3



CST3 Cystatin C (amyloid angiopathy and cerebral hemorrhage)









Human desmin gene, complete cds



H. sapiens mRNA for GCAP-II/uroguanylin precursor






Plasmalemma vesicle associated protein (PLVAP)



PDZ domain containing 11 (PDZD11)



Shisa homolog 5 (Xenopus laevis) (SHISA5)



Basic leucine zipper nuclear factor 1 (BLZF1)



Ficolin (collagen/fibrinogen domain containing lectin) 2 (hucolin) (FCN2)



vicpro2.D07.r Homo sapiens cDNA, 5' end



Human common acute lymphoblastic leukemia antigen (CALLA) mRNA, complete cds



Homo sapiens myosin light chain kinase (MLCK) mRNA, complete cds



H. sapiens mRNA for Tcell leukemia



Human leukemogenic homolog protein (MEIS1) mRNA, complete cds


Traditionally, filter and wrapper algorithms are treated as competitors in gene selection for data classification. In this study, we embrace an alternative view and attempt to combine them as the building blocks of a more advanced hybrid system. The proposed MF-GE system applied several novel integration ideas to strengthen the advantages of each component while avoiding their weaknesses. The experimental results indicate the followings:

  • By fusing evaluation feedbacks of multiple filtering algorithms the system does not only seek for high classification accuracy of training dataset greedily, but takes into consideration other characteristics of the data as well. The overfitting problem can then be circumvented and a better generalization of the selected gene and gene subsets can be achieved.

  • By weighing the goodness of each candidate gene from multiple aspects, we reduce the chance of identifying false-positive gene while producing more compact gene subset. This is useful since future biological experiment can be more easily conducted to validate the importance of the selected genes.

  • With the use of multiple filtering information, the MF-GE is able to converge more quickly without sacrificing the sample classification accuracy and thus saves computational expenses.

The MF-GE system provides an effective measure for incorporating different algorithm components. It allows any filters or classifiers with new or special capabilities to be added to the system and those no longer useful or inappropriate to be deleted from the system based on the data requirements or user preferences. Finally, the MFGE hybrid system is implemented in Java and is freely available from the project homepage [39].



PY is supported by a NICTA International Postgraduate Award (NIPA) and a NICTA Research Project Award (NRPA).

This article has been published as part of BMC Bioinformatics Volume 11 Supplement 1, 2010: Selected articles from the Eighth Asia-Pacific Bioinformatics Conference (APBC 2010). The full contents of the supplement are available online at http://​www.​biomedcentral.​com/​1471-2105/​11?​issue=​S1.

Authors’ Affiliations

School of Information Technologies (J12), The University of Sydney
NICTA, Australian Technology Park
Faculty of Computer and Information Science, Southwest University
School of Information Technology, Deakin University
Sydney Bioinformatics, The University of Sydney
Centre for Mathematical Biology, The University of Sydney
Centre for Distributed and High Performance Computing, The University of Sydney


  1. Saeys Y, Lnza I, Larrañaga P: A review of feature selection techniques in bioinformatics. Bioinformatics 2007, 23(19):2507–2517. 10.1093/bioinformatics/btm344View ArticlePubMed
  2. Somorjai RL, Dolenko B, Baumgartner R, Crow JE, Moore JH: Class prediction and discovery using gene microarray and proteomics mass spectroscopy data: curses, caveats, cautions. Bioinformatics 2003, 19: 1484–1491. 10.1093/bioinformatics/btg182View ArticlePubMed
  3. Wang Y, Makedon F, Ford J, Pearlman J: Hykgene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data. Bioinformatics 2005, 21: 1530–1537. 10.1093/bioinformatics/bti192View ArticlePubMed
  4. Jafari P, Azuaje F: An assessment of recently published gene expression data analyses: reporting experimental design and statistical factors. BMC Med Inform Decis Mak 2006, 6: 27. 10.1186/1472-6947-6-27PubMed CentralView ArticlePubMed
  5. Robnik-Šikonja M, Kononenko I: Theoretical and empirical analysis of relieff and rrelieff. Machine Learning 2003, 53: 23–69. 10.1023/A:1025667309714View Article
  6. Su Y, Murali T, Pavlovic V, Schaffer M, Kasif S: Rankgene: identification of diagnostic genes based on expression data. Bioinformatics 2003, 19: 1578–1579. 10.1093/bioinformatics/btg179View ArticlePubMed
  7. Kohavi R, John G: Wrapper for feature subset selection. Artificial Intelligence 1997, 97: 273–324. 10.1016/S0004-3702(97)00043-XView Article
  8. Blum A, Langley P: Selection of relevant features and examples in machine learning. Artificial Intelligence 1997, 97: 245–271. 10.1016/S0004-3702(97)00063-5View Article
  9. Li L, Weinberg C, Darden T, Pedersen L: Gene selection for sample classification based on gene expression data: study of sensitivity to choice of parameters of the GA/KNN method. Bioinformatics 2001, 17: 1131–1142. 10.1093/bioinformatics/17.12.1131View ArticlePubMed
  10. Ooi C, Tan P: Genetic algorithms applied to multi-class prediction for the analysis of gene expression data. Bioinformatics 2003, 19: 37–44. 10.1093/bioinformatics/19.1.37View ArticlePubMed
  11. Jirapech-Umpai T, Aitken S: Feature selection and classification for microarray data analysis: Evolutionary methods for identifying predictive genes. BMC Bioinformatics 2005, 6: 146. 10.1186/1471-2105-6-148View Article
  12. Liu J, Cutler G, Li W, Pan Z, Peng S, Hoey T, Chen L, Ling X: Multiclass cancer classification and biomarker discovery using GA-based algorithm. Bioinformatics 2005, 21: 2691–2697. 10.1093/bioinformatics/bti419View ArticlePubMed
  13. Inza I, Sierra B, Blanco R, Larrañaga P: Gene selection by sequential search wrapper approaches in microarray cancer class prediction. Journal of Intelligent and Fuzzy Systems 2002, 12: 25–33.
  14. Kudo M, Sklansky J: Comparison of algorithms that select features for pattern classifiers. Pattern Recognition 2000, 33: 25–41. 10.1016/S0031-3203(99)00041-2View Article
  15. Quinlan JR: Induction of decision trees. Machine Learning 2004, 1: 81–106.
  16. Quinlan JR: C4.5: programs for machine learning. San Mateo, CA: Morgan Kaufmann; 1993.
  17. Yang YH, Xiao Y, Segal MR: Identifying differentially expressed genes from microarray experiments via statistic synthesis. Bioinformatics 2005, 21(7):1084–1093. 10.1093/bioinformatics/bti108View ArticlePubMed
  18. Hassan M, Hossain M, Bailey J, Macintyre G, Ho J, Ramamohanarao K: A voting approach to identify a small number of highly predictive genes using multiple classifiers. BMC Bioinformatics 2009, 10(Suppl 1):S19. 10.1186/1471-2105-10-S1-S19PubMed CentralView ArticlePubMed
  19. Liu B, Cui Q, Jiang T, Ma S: A combinational feature selection and ensemble neural network method for classification of gene expression data. BMC Bioinformatics 2004, 5: 136. 10.1186/1471-2105-5-136PubMed CentralView ArticlePubMed
  20. Zhang Z, Yang P: An ensemble of classifiers with genetic algorithm based feature selection. IEEE Intelligent Informatics Bulletin 2008, 9: 18–24.
  21. Zhang Z, Yang P, Wu X, Zhang C: An agent-based hybrid system for microarray data analysis. IEEE Intelligent Systems 2009, 24(5):53–63. 10.1109/MIS.2009.92View Article
  22. Saeys Y, Abeel T, Peer Y: Robust feature selection using ensemble feature selection techniques. In Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases. Part II. Volume 5212. Springer; 2008:313–325. full_textView Article
  23. Witten IH, Frank MD: Data Mining: Practical Machine Learning Tools and Techniques. Second edition. Elsevier; 2005.
  24. Mitchell T: Machine Learning. McGraw Hill; 1997.
  25. Dietterich TG: Ensemble methods in machine learning. In Proceedings of Multiple Classifier System. Volume 1857. Springer; 2000:1–15. full_textView Article
  26. Tsymbal A, Pechenizkiy M, Cunningham P: Diversity in search strategies for ensemble feature selection. Information Fusion 2005, 6: 83–98. 10.1016/j.inffus.2004.04.003View Article
  27. Bontempi G: A blocking strategy to improve gene selection for classification of gene expression data. IEEE/ACM Transactions on Computational Biology and Bioinformatcis 2007, 4: 293–300. 10.1109/TCBB.2007.1014View Article
  28. Lam L, Suen Y: Application of majority voting to pattern recognition: an analysis of its behaviour and performance. IEEE Transactions on Systems, Man, and Cybernetics 1997, 27: 553–568. 10.1109/3468.618255View Article
  29. Ruta D, Gabrys B: Application of the evolutionary algorithms for classifier selection in multiple classifier systems with majority voting. Proceedings of MCS 2001, LNCS 2096 2001, 399–408.
  30. Golub T, Slonim D, Tamayo T, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh M, Downing J, Caligiuri M, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 1999, 286: 531–537. 10.1126/science.286.5439.531View ArticlePubMed
  31. Alon U, Barkai N, Notterman D, Gish K, Ybarra S, Mack D, Levine A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. PNAS 1999, 96: 6745–6750. 10.1073/pnas.96.12.6745PubMed CentralView ArticlePubMed
  32. Chen X, Cheung S, So S, Fan S, Barry C, Higgins J, Lai K, Ji J, Dudoit S, Ng I, et al.: Gene expression patterns in human liver cancers. Molecular Biology of the Cell 2002, 13: 1929–1939. 10.1091/mbc.02-02-0023.PubMed CentralView ArticlePubMed
  33. Armstrong S, Staunton J, Silverman L, Pieters R, den Boer M, Minden M, Sallan S, Lander E, Golub T, Korsmeyer S: MLL translocations specify a distinct gene expression profile that distinguishes a unique leukemia. Nature Genetics 2001, 30: 41–47. 10.1038/ng765View ArticlePubMed
  34. Hua J, Xiong Z, Lowey J, Suh E, Dougherty E: Optimal number of features as a function of sample size for various classification rules. Bioinformatics 2005, 21: 1509–1515. 10.1093/bioinformatics/bti171View ArticlePubMed
  35. Li W, Yang Y: How many genes are needed for a discriminant microarray data analysis? Proceedings of Critical Assessment of Microarray Data Analysis 2000, 137–150.
  36. Dudoit S, Fridlyand J, Speed T: Comparison of discrimination methods for the classification of tumors using gene expression data. Journal of the American Statistical Association 2002, 97: 77–87. 10.1198/016214502753479248View Article
  37. GA/KNN software usage agreement and download[http://​www.​niehs.​nih.​gov/​research/​resources/​software/​gaknn/​]
  38. Ding C, Peng H: Minimum redundancy feature selection from microarray gene expression data. Journal of Bioinformatics and Computational Biology 2005, 3(2):185–205. 10.1142/S0219720005001004View ArticlePubMed
  39. MFGE project homepage[http://​www.​cs.​usyd.​edu.​au/​~yangpy/​software/​MFGE]


© Yang et al; licensee BioMed Central Ltd. 2010

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://​creativecommons.​org/​licenses/​by/​2.​0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.