The MF-GE hybrid approach
System overview
A flow chart of the proposed MF-GE hybrid system is illustrated in Figure 2. In this system the gene selection process is sequentially divided into two phases, i.e., "filtering process" and "wrapper process". In the filtering process, multiple filtering algorithms are applied to give scores for each candidate gene in the microarray dataset. The scores of each gene are then integrated for wrapper process. In the wrapper process, the genetic ensemble algorithm is used to select discriminative genes using the information provided by the filtering process. The detail of this genetic ensemble algorithm is described in [20, 21]. Basically, a multiple objective GA (MOGA) is utilized as the gene combination search engine while an ensemble of the classifiers is used as the gene subsets evaluation component to provide feedback for gene subsets optimization. The algorithm executes iteratively, collecting multiple gene subsets. The final collections are ranked and the top genes are used for sample classification.
An intermediate step called "score mapping" serves as the synergy between the filtering process and the wrapper process. It is described in details in the next subsection.
Multi-filter score mapping
Traditionally, filtering algorithms select differential genes independently for the classification process. However, such information could be beneficial if appropriately integrated into the wrapper procedure. To fuse the evaluation information from multiple filtering algorithms, we developed a multi-filter score mapping strategy which serves as the connection between the filtering process and the wrapper process. An example of this mapping process with two filters is depicted in Figure 3.
The process starts by calculating scores for each candidate gene with different filtering algorithms. The evaluation scores obtained from different filtering algorithms are then integrated. One issue in integrating multiple scores is that different filtering algorithms often provide evaluation scores with different scales. In order to combine the evaluation results of multiple filters, we must transform the evaluation scores into a common scale. Therefore, the softmax scaling process is adopted to squash the gene evaluation results of each filtering algorithm into the range of 0[1]. The calculation is as follows:
in which
where
is the average expression value of the k th gene among all samples, σ
k
is the standard deviation of the k th gene among all samples, and
is the transformed value of x
ik
which denotes the expression value of the k th gene in sample i.
After the softmax scaling process, the evaluation scores with different filtering algorithms are summed up to a set of total score which indicates the overall score of each gene under the evaluation of multiple filtering algorithms. The total scores are then timed with 10 and rounded into integer. Those with scores smaller than 1 are set to score of 1 to make sure all candidate genes are included in the wrapper selection process. The final step is the score-to-frequency mapping step which transfers the given integer of each gene into the appearance frequency of this gene in the transferred candidate gene pool (we call it a gene frequency map). The random processes of "chromosome" initialization and the "chromosome mutation" of the genetic ensemble system are then conducted based on this gene frequency map.
It is readily noticed that genes with higher overall evaluation scores will appear in the gene frequency map more frequently, thus, will have a better chance to be chosen in the initialization step and the mutation step. In this way, multiple filter information is fused into the gene selection process, which helps to integrate information of data characteristics from different aspects.
Filters and classifiers
Filter components
In this subsection, we introduce five filtering algorithms incorporated in our MF-GE hybrid system for experiments. All these filtering algorithms have been routinely applied for gene selection of microarray data.
χ2-statistic
When used for gene evaluation, χ2-statistic can be considered as to calculate the occurrence of a particular value of a gene and the occurrence of a class associated with this value. Formally, the discriminative power of a gene is quantified as follows:
where c
i
, (i = 1, ..., m) denotes the possible classes of the samples from a dataset, while g is the gene that has a set of possible values denoted as V . N(g = v, c
i
) and E(g = v, c
i
) are the observed and the expected co-occurrence of g = v with the class c
i
, respectively.
ReliefF
ReliefF is a widely used filtering algorithm. In microarray data classification context, the algorithm selects genes that have high resolution distinguishing samples which have similar expression patterns. The formula used by ReliefF to compute the weight or "importance" of a gene g is as follows:
where diff(g, S1, S2) calculates the difference between the values of the gene g for two samples (S1 and S2),
denotes the i th randomly selected samples from the dataset, while S
d
and S
s
denote nearest sample from a different class to
and nearest sample from the same class to
, respectively. N(.) is a normalization function which keeps the value of W(g) to be in the interval [-1, 1].
Symmetrical Uncertainty
The Symmetrical Uncertainty method evaluates the worth of an gene by measuring the symmetrical uncertainty with respect to the sample class [23]. Each gene is evaluated as follows:
where H(.) is the information entropy function. H(class) and H(g) give the entropy values of the class and a given gene, while H(class|g) gives the entropy value of a gene with respect to the class.
Information Gain
Information Gain is a statistic measure often used in nodes selection for decision tree construction. It measures the number of bits of information provided in class prediction by knowing the value of feature [3]. Let c
i
belong to a set of discrete classes (1, ..., m). V be the set of possible values for a given gene g. The information gain of a gene g is defined as follows:
Gain Ratio
The final filtering algorithm used in the hybrid system is Gain Ratio. Gain Ratio incorporates "split information" of features into Information Gain statistic. The "split information" of a gene is obtained by measuring how broadly and uniformly it splits the data [24]. Let's consider again a microarray dataset has a set of classes denoted as c
i
, (i = 1, ..., m), and each gene g has a set of possible values denoted as V . The discriminative power of a gene g is given as:
in which:
where S
v
is the subset of S of which gene g has value v.
It is clear that each algorithm uses a different criterion in evaluating the worth of the candidate genes in microarray datasets. When combined, candidate genes are assessed from many different aspects.
Classifier components
Ensemble of classifiers has recently been suggested as a promising measure to overcome the limitation of individual classifier [25]. In our previous study, we demonstrated that if combined properly, multiple classifiers can achieve higher sample classification accuracy and more reproducible feature selection results [20]. Therefore, selecting classification algorithms and developing suitable integration strategies are the key to a successful ensemble. What characteristics should we promote in the ensemble construction? The basic concerns are that they should be as accurate and diverse as possible [26], and the individual classifiers should be relatively computationally efficient. With these criteria in mind, we evaluated different composition under the genetic architecture within a multiagent framework [21]. A hybrid of five classifiers, namely, decision tree (DT), random forest (RF), 3-nearest neighbors (3NN), 7-nearest neighbors (7NN), and naive bayes (NB) is identified to be better in terms of sample classification and stability than many alternatives. Furthermore, two integration strategies, namely, blocking and majority voting have been employed for ensemble construction.
The blocking strategy optimizes the target gene subset by improving the sample classification accuracy using the whole ensemble rather than one specific inductive algorithm. This formulation adds multiple test conditions into the algorithm, and the gene subset optimized under this criterion will not tie to any specific classifier, but have a more generalization nature. Moreover, genes selected with this strategy are more likely to have real relevance to the biological trait of interest [27]. The majority voting combines multiple classifiers and tries to optimize the target feature set into a superior set in producing high consensus classification [28]. This part of the function promotes the selected genes in creating diverse classifiers implicitly, which in turn leads to the high sample classification accuracy [29].
The fitness functions derived from blocking (fitness
b
(s)) and majority voting (fitness
v
(s)) are defined as follows:
and
where k is the size of the majority voting V
k
(.), h
i
(s), (i = 1, ..., L) is the classification hypothesis generated by classifier i in the ensemble while classifying dataset using gene subset s, y is the class label of samples, and BC(.) is the balanced classification accuracy which is calculated as follows:
and
where Se
j
is the sensitivity value calculated as the percentage of the number of true positive classification (
) of samples in class j, N
j
denotes the total number of samples in class j, and m is the total number of classes.
Finally, the fitness function of the MOGA is defined as follows:
where the empirical coefficients w1 and w2 specify the contribution weights of each term.