Algorithms for microarray data analysis typically focus on obtaining a set of genes that can distinguish between the different classes in a given sample set. Thus, the primary concern is to ensure the relevance of the genes to the classes under consideration.
Given a microarray data set with m samples belonging to k known classes and n genes, we want to select out those genes which are able to predict the differences in the gene expression patterns in different sample classes. Define
; |c| = k, as the vector labeling the classes of samples and
; i ∈ n as the gene expression profile of gene i. Let
be the feature set of all genes and let S be the set of selected genes. Then, the feature set selection problem can be defined as follows:
Problem 1
Select a set S of genes, S ⊂
such that ∀ gene s ∈ S the relevance of s with
is maximized.
However, the feature set of genes selected will contain a number of redundant genes with sometimes little relevance to the classes. This is due to the fact that the presence of genes that are closely related to each other imply that there is a possibility of genes orthoganal to those in the selected set being left out of the final feature set. Moreover, the presence of genes with little relevance to the classes leads to a reduction in the "useful information".
Ideally, selected genes should have high relevance with the classes while the redundancy among the selected genes is low. Most previous studies emphasized the selection of highly relevant genes. Ding et. al. [20] addressed the issue of the redundancies among the selected genes. The genes with high relevance are expected to be able to predict the classes of the samples. However, the prediction power is reduced if many redundant genes are selected. In contrast, a feature set that contains genes not only with high relevance with respect to the classes but with low mutual redundancy is more effective in its prediction capability.
Problem formulation
To assess the effectiveness of the genes, both the relevance and the redundancy need to be measured quantitatively. An entropy based correlation measure is chosen here. According to Shannon's information theory [31], the entropy of a random variable X can be defined as:
Entropy measures the uncertainty of a random variable. For the measurement of the interdependency of two random variables X and Y, some researchers [20, 21] used mutual information, which is defined as:
I(X, Y) = H(X) + H(Y) - H(X, Y) (2)
In order to ensure that different values are comparable and have similar effects, normalized mutual information is used as a measure and is defined as:
U(X, Y) is symmetrical and ranges from 0 to 1, with the value 1 indicating that the knowledge of one variable completely predicts the other (high mutual relevance) while the value 0 indicates that X and Y are independent (low mutual relevance).
The mutual relevance between
and
can then be modeled by U (
) while the dependency between two genes is U (
).
The total relevance of all selected genes is given by
The total redundancy among the selected genes is given by
Therefore, the problem of selecting genes can be reformulated as follows:
Problem 2
Select a set S of genes, S ⊂
such that ∀ g
i
∈ S, the total relevance of all the selected genes with
, J1, is maximized while the total relevance among all the selected genes g
i
∈ S, J2, is minimized.
This is a two-objective optimization problem. To solve it, a simple way is to combine these two objectives into one:
where β is a weight parameter.
subsection* Algorithm
To solve the above problem, Battiti [21] proposed a greedy algorithm. The procedure can be described as follows (see Figure 7):
-
1.
Initialization: F ← allgenes, S ← ∅.
-
2.
First gene: select gene i that has highest relevance U (
). g
i
∈ S, F \ i.
-
3.
Remaining genes: From F, select gene j that maximizes
.
-
4.
Repeat the above step until the desired number of genes are obtained.
The maximization problem (6) can also be re-formulated into a binary optimization problem. Let x
i
be a binary variable with value 1 for selecting gene i while value 0 for not. Thus, Equation (6) can be rewritten into:
It can be further rewritten into matrix form:
max U
c
Tx - β xTU
p
x (8)
where U
c
is the relevance vector, U
p
is matrix of pairwise redundancy.
Beasley et al. [32] discussed several heuristic algorithms to solve such binary quadratic programming problems. A heuristic simulated annealing method was employed to solve the problem. The pseudo codes of simulated annealing can be obtained from [32].
There are however limitations to both approaches. There is a possibility that the solution obtained for Problem 2 can lead to a local optimum. This could result in a sub-optimal feature set thereby affecting the prediction accuracy. In order to expand the search space, an iterative procedure was adopted. The data was initially clustered and partitioned into K groups, C1, C2,..., C
K
by using k-means clustering. The idea was to group genes with similar expression patterns together. The greedy or heuristic simulated annealing procedure was then applied to select a subset of genes, S
k
, from each partition, k, such that the selected genes had low mutual relevance with respect to each other while at the same having maximal relevance with the different classes. The genes selected from each subset are then combined to obtain a single gene set, that is, S = S1 ⋃ S2 ⋃ S3,..., ⋃S
K
.
The final set of genes is selected by carrying out a leave-one-out cross validation (LOOCV). For each run, one sample is held out for testing whilei the remaining N - 1 samples are used to train the classifier. The genes are selected by the algorithm using the training samples and then are used to classify the testing sample. The overall accuracy rate is calculated based on the correctness of the classifications of each testing sample. In order to get a deeper understanding of the selected genes, those genes found in common for all the N different runs of the LOOCV experiment are finally listed out for further investigation. The process of gene selection is repeated by selecting a subset of genes from this feature set, that gives a classification error that is below a user defined threshold ε. Nearest neighborhood (k-NN) classification method is used to assess the discriminant power of the selected genes by the method. The process is stopped when the error becomes greater than ε. The full algorithm is presented in Figure 8.