A stable gene selection in microarray data analysis

Background Microarray data analysis is notorious for involving a huge number of genes compared to a relatively small number of samples. Gene selection is to detect the most significantly differentially expressed genes under different conditions, and it has been a central research focus. In general, a better gene selection method can improve the performance of classification significantly. One of the difficulties in gene selection is that the numbers of samples under different conditions vary a lot. Results Two novel gene selection methods are proposed in this paper, which are not affected by the unbalanced sample class sizes and do not assume any explicit statistical model on the gene expression values. They were evaluated on eight publicly available microarray datasets, using leave-one-out cross-validation and 5-fold cross-validation. The performance is measured by the classification accuracies using the top ranked genes based on the training datasets. Conclusion The experimental results showed that the proposed gene selection methods are efficient, effective, and robust in identifying differentially expressed genes. Adopting the existing SVM-based and KNN-based classifiers, the selected genes by our proposed methods in general give more accurate classification results, typically when the sample class sizes in the training dataset are unbalanced.


Background
DNA microarray is a technology that can simultaneously measure the expression levels of thousands of genes in a single experiment. It is commonly used for comparing the gene expression levels in tissues under different conditions, such as wild-type versus mutant, or healthy versus diseased [1]. Some of the genes are expected to be differentially modulated in tissues under different conditions, with their expression levels increased or decreased to signify the experimental conditions. These discriminatory genes are very useful in clinical applications such as recog-nizing diseased profiles. However, due to high cost, the number of experiments that can be used for classification purpose is usually limited. This small number of experiments, compared to the large number of genes in an experiment, wakes up "the curse of dimensionality" and challenges the classification task and other data analysis in general. It is well-known that quite a number of genes are house-keeping genes and many others could be unrelated to the classification task [2]. Therefore, an important step to effective classification is to identify the discriminatory genes thus to reduce the number of genes used for classification purpose. This step of discriminatory gene identification is generally referred to as gene selection. Gene selection is a pre-requisite in many applications [3]. It should be noted that, often, the number of unrelated genes is much larger than the number of discriminatory genes.
There are a variety of gene selection methods proposed in the last a few years [2,4,5]. Among them, some methods assume explicit statistical models on the gene expression data. For example, Baldi and Long [4] developed a Gaussian gene-independent model on the gene expression data, and implemented a t-test combined with a full Bayesian treatment for gene selection. These methods assuming certain models are referred to as model-based gene selection methods. Other methods do not assume any specific distribution model on the gene expression data and they are referred to as model-free gene selection methods. For example, Xiong et al. [2] suggested a method to select genes through the space of feature subsets using classification errors. Guyon et al. [5] proposed a gene selection approach utilizing support vector machines based on recursive feature elimination. It has been reported that the results of model-free gene selection methods may be influenced by the classification methods chosen for scoring the genes [6]. Nonetheless, modelbased gene selection methods lack of adaptability, because it is unlikely possible to construct a universal probabilistic analysis model that is suitable for all kinds of gene expression data, where noise and variance may vary dramatically across different gene expression data [6]. In this sense, model-free gene selection methods are more desirable than model-based ones.
Gene selection is to provide a subset of a small number of discriminatory, or the most relevant, genes that can effectively recognize the class to which a testing sample belongs. That is, it is to provide a classifier such that the classification error is minimized. The known dataset that is used for learning the classifier, or equivalently for gene selection, is referred to as the training dataset. In a training dataset, every sample is labeled with its known class. Notice that if one class is significantly larger than the others, then the trained classifier might be biased. Therefore, the desired gene selection methods are those that are not affected by the sizes of classes in the training dataset. A gene selection method is called stable if the selected genes are kept the same when duplicating all the samples in any class in the training dataset.
In this paper, we propose two novel gene scoring functions s 1 (·) and s 2 (·) to design two stable gene selection methods GS1 and GS2 [see Additional file 5], respectively, to be detailed in the Methods section. According to the classification scheme proposed in [6], our proposed gene selection methods are single gene scoring approaches. These two gene scoring functions non-trivially incorporate the means and the variations of the expression values of genes in the samples belonging to a common class, based on a very general assumption that discriminatory genes are those having different means across different classes, small intra-class variations and relatively large interclass variations. This spherical data assumption does not involve any specific statistical model, and in this sense, the resultant gene selection methods GS1 and GS2 could be regarded as model-free. They are also shown to be stable theoretically.

Results and discussion
We have applied our gene selection methods GS1 and GS2 based on the gene scoring functions s 1 (·) and s 2 (·), respectively, to a total of 8 publicly available microarray datasets [7]: the leukemia (LEU) dataset [8], the small round blue cell tumors (SRBCT) dataset [9], the malignant glioma (GLIOMA) dataset [10], the human lung carcinomas (LUNG) dataset [11], the human carcinomas (CAR) dataset [12], the mixed-lineage leukemia (MLL) dataset [13], the prostate (PROSTATE) dataset [14], and the diffuse large Bcell lymphoma (DLBCL) dataset [15], the first two of which have been used for several similar testings of gene selection methods. On each dataset, one portion was used as the training dataset for our methods to score the genes and the other portion was used as the testing dataset. For each specified number x we reported the classification accuracy, on the testing dataset, of the classifier based on the top ranked x genes using the training dataset. The quality of these top ranked x genes is justified on two aspects: 1) the classification accuracy of the resultant classifier on the testing datasets, and 2) for the first two datasets LEU and SRBCT, the stability when the training datasets were partially changed. All the experiments were conducted in MATLAB [16] environment on a Pentium IV PC with a 2.4 GHz processor and a 512 MB RAM.

The datasets
The LEU dataset contains in total 72 samples in two classes, acute lymphoblastic leukemia (ALL) and acute myeloid leukemia (AML), which contain 47 and 25 samples, respectively. Every sample contains 7,129 gene expression values. We adopted the pretreatment method proposed in [1] to remove about half the genes and subsequently every sample contains only 3, 571 gene expression values.
The SRBCT dataset contains in total 83 samples in four classes, the Ewing family of tumors (EWS), Burkitt lymphoma (BL), neuroblastoma (NB) and rhabdomyosarcoma (RMS) [9]. Every sample in this dataset contains only 2,308 gene expression values. Among the 83 samples, 29, 11, 18, and 25 samples belong to classes EWS, BL, NB and RMS, respectively.
The GLIOMA dataset [10] contains in total 50 samples in four classes, cancer glioblastomas (CG), non-cancer glioblastomas (NG), cancer oligodendrogliomas (CO) and non-cancer oligodendrogliomas (NO), which have 14,14, 7,15 samples, respectively. Each sample has 12625 genes. We adopted a standard filtering method [10], that is, genes with minimal variations across the samples were removed. For this dataset, intensity thresholds were set at 20 and 16,000 units. Genes whose expression levels varied < 100 units between samples, or varied < 3 fold between any two samples, were excluded. After preprocessing, we obtained a dataset with 50 samples and 4433 genes.
The LUNG dataset [11] contains in total 203 samples in five classes, adenocarcinomas, squamous cell lung carcinomas, pulmonary carcinoids, small-cell lung carcinomas and normal lung, which have 139, 21, 20, 6,17 samples, respectively. Each sample has 12600 genes. The genes with standard deviations smaller than 50 expression units were removed and we obtained a dataset with 203 samples and 3312 genes [11].
The CAR dataset [12] contains in total 174 samples in eleven classes, prostate, bladder/ureter, breast, colorectal, gastroesophagus, kidney, liver, ovary, pancreas, lung adenocarcinomas, and lung squamous cell carcinoma, which have 26, 8,26,23,12,11,7,27,6,14,14 samples, respectively. Each sample contains 12533 genes. In our experiment, we preprocessed dataset as described in [12]. We included only those probe sets whose maximum hybridization intensity (AD) in at least one sample was 200, all AD values ≤ 20, including negative AD values, were raised to 20, and the data were log transformed. After preprocessing, we obtained a dataset with 174 samples and 9182 genes.
The MLL dataset [13] contains in total 72 samples in three classes, acute lymphoblastic leukemia (ALL), acute myeloid leukemia (AML), and mixed-lineage leukemia gene (MLL), which have 24, 28, 20 samples, respectively. In our experiment, intensity thresholds were set at 100 -16000 units. Then the relative variation of expressions for each gene was determined by dividing the maximum expression for the gene among all samples (max) over the minimum expression (min), i.e. max/min. We filtered out the genes with max/min ≤ 5 or (max -min) ≤ 500, that is, for max/ min fold variation, we excluded genes with less than 5fold variation and, for (max -min) absolute variation, we excluded genes with less than 500 units absolute. After preprocessing, we obtained a dataset with 72 samples and 8685 genes.
The PROSTATE dataset [14] contains in total 102 samples in two classes tumor and normal, which have 52 and 50 samples, respectively. The original dataset contains 12600 genes. In our experiment, intensity thresholds were set at 100 -16000 units, the same as in the MLL dataset. Then we filtered out the genes with max/min ≤ 5 or (max -min) ≤ 50. After preprocessing, we obtained a dataset with 102 samples and 5966 genes.
The DLBCL dataset [15] contains in total 77 samples in two classes, diffuse large B-cell lymphomas (DLBCL) and follicular lymphoma (FL) which have 58 and 19 samples, respectively. The original dataset contains 7129 genes. We set intensity thresholds at 20 -16000 units, then we filtered out genes with max/min ≤ 3 or (max -min) ≤ 100. After preprocessing, we obtained a dataset with 77 samples and 6285 genes.
Among the above 8 datasets [see Additional files 3 and 4], the first two, LEU and SRBCT, have been used frequently for testing gene selection methods and classifiers. For each of them, if we use the ratio of the largest class size divided by the smallest class size to represent the level of unbalance, then the fifth dataset CAR is the most unbalanced. In our experiments, we have run every gene selection method on each of the 8 datasets to rank the genes, and for every x ≤ 100, the classification accuracies of the classifier built using the top ranked x genes have been collected [see Additional file 1]. We chose to present part of the classification accuracies on datasets SRBCT and CAR in details (as plots) and to present only three values for x, 30, 60, and 100, for all eight datasets (as tables).

Classification accuracies
Using the top ranked genes selected by a gene selection method, together with their expression values in the training dataset, one can build a classifier that will decide for each testing sample the class it belongs to. Only the expression values for those selected genes in the testing sample are used for such a decision making. This is a standard way to test the quality of those selected genes, to examine how well the resulting classifier performs. Note that testing samples are not included in the training dataset. To this purpose, we define the classification accuracy to be the percentage of the correct decisions made by the classifier on the testing samples. We have compared our methods GS1 and GS2 with two other model-free gene selection methods Cho's [17] and F-test [1]. In our experiments, we have adopted two ways to build a classifier using the selected genes, one is through Support Vector Machines (SVMs) [5] and the other is through K-Nearest-Neighbor (KNN) search [1]. Essentially, SVMs compute a decision plane to separate the set of chips (in the training dataset) having different class memberships, and use this plane to predict the class memberships for testing chips. There are a number of kernels used in SVMs models for decision plane computing and we chose a linear kernel as described in [5]. A KNN classifier ascertains the class of a testing sample by analyzing its K nearest neighbors in the training dataset [1]. We chose the Euclidean distance in our KNN classifier with K = 5 and predicted the class by majority vote [1]. The SVM we used in MATLAB is from [18] and we coded the KNN by ourselves. For ease of presentation, the achieved classifiers are referred to as the SVM-classifier and the KNN-classifier, respectively. Figures 1 and 2 plot the classification accuracies of the SVM-classifier based on four gene selection methods GS1, GS2, Cho's, and F-test, on the SRBCT and CAR datasets, respectively. These classification accuracies were obtained through Leave-One-Out (LOO) cross validation. In LOO cross validation, one sample was left out as a testing sample and the remaining were used as the training dataset, and this was done for every sample in the dataset. We have also conducted 5-Fold cross validation, in which each dataset was randomly partitioned into 5 parts of equal size and in every experiment four parts were used as the training dataset (the fifth part was used as the testing dataset). This was done for every four parts in the dataset and the process (that is, random partition, training, and testing) was repeated for 100 times. The average accuracy over all these 500 testing datasets was taken as the 5-Fold cross validation classification accuracy. All plots of the 5-Fold (and the other LOO) cross validation classification accuracies of the SVM-classifier and the KNN-classifier based on four gene selection methods GS1, GS2, Cho's, and Ftest, on the eight datasets are included in Additional file 1. Especially, columns 2-4 (and 6-8) in Tables 1, 2, 3, 4, 5, 6, 7, 8 record these cross validation classification accuracies, for only three numbers of top ranked genes, that is, 30, 60, and 100. Column 5 (and column 9) records the highest cross validation classification accuracies on these eight datasets ever achieved by the SVM-classifier and the KNN-classifier, respectively, as well as the associated numbers of selected genes (no more than 100 genes were used).
Note that in the 5-fold cross validation, the classification accuracy is calculated as the average of 500 classification accuracies on 500 testing datasets. We have also collected their standard deviations [see Additional file 2]. For three numbers 30, 60, and 100, the standard deviations are included in Tables 1, 2, 3, 4, 5, 6, 7, 8. Essentially, all these four gene selection methods, GS1, GS2, Cho's, and F-test, have very close standard deviations, and these standard deviations seem to be independent of classifier and dataset. Consequently, looking at all the classification accuracies as shown in Figures 1, 2 and Tables 1, 2, 3, 4, 5, 6, 7, 8, one general conclusion is that our gene selection methods, GS1 and GS2, perform at least comparably well to Ftest and Cho's, on all 8 datasets using both the SVM-classifier and the KNN-classifier. Typically, our methods outperform significantly the other two methods on datasets SRBCT, GLIOMA, LUNG, and CAR, which have unbalanced class sizes.

Stability of the gene selection methods
Given a training dataset (in this case, we take the whole dataset as the training dataset), to test the stability of a gene selection method we duplicated all the samples in one class to produce a duplicated dataset. Our gene selection methods GS1 and GS2 are shown to be stable theoretically (cf. Methods section) and therefore are not subjects to such a test. For each of Cho's and F-test, it was applied on the duplicated datasets to report the same The leave-one-out cross validation classification accuracies of the SVM-classifier on four gene selection methods, GS1, GS2, Cho's, and F-test, on the SRBCT dataset Figure 1 The leave-one-out cross validation classification accuracies of the SVM-classifier on four gene selection methods, GS1, GS2, Cho's, and F-test, on the SRBCT dataset. By GS2 By GS1 By Cho's By F-test numbers of genes as it was applied to the original training dataset, and then the percentages of the common genes were recorded. Note that the LEU dataset and the SRBCT dataset give 2 and 4 duplicated datasets, respectively. Table 9 collects these percentages.
We have also performed a similar experiment to test the stability when a small portion of the samples were removed from the training dataset. For each class in a training dataset, it was divided into three parts of equal size and every time one part was removed from the dataset to obtain a reduced dataset. Then again, the percentages of the common selected genes using the original dataset and the reduced datasets were recorded. We tried in total 1000 random divisions and the average of 3000 percentages was taken to be the stability for this class. Table 10 collects The leave-one-out cross validation classification accuracies of the SVM-classifier on four gene selection methods, GS1, GS2, Cho's, and F-test, on the CAR dataset Figure 2 The leave-one-out cross validation classification accuracies of the SVM-classifier on four gene selection methods, GS1, GS2, Cho's, and F-test, on the CAR dataset.

Conclusion
In this paper, we proposed two stable gene selection methods GS1 and GS2, which could also be regarded as model-free. From the comparisons made to one most recent method Cho's and one most classic method F-test on eight publicly available datasets, GS1 and GS2 selected genes whose resultant classifiers achieved at least equally good and most of the time better classification accuracies. Both leave-one-out and 5-fold cross validations confirmed our conclusion. We haven't run any biological experiments to verify each of the top ranked genes by our methods yet inconsistent to other methods. Nonetheless, we believe that our methods would be good potential substitutes to the ones currently in use as our methods are model-free and stable.

Methods
Assume in the training dataset there are in total p genes in the microarray chips, and assume we have in total n chips/ samples that have been grouped into L classes. Let a ij denote the expression value of gene j in sample i. The training dataset can thus be represented as a matrix We will define two gene scoring functions using entry values in matrix A n×p . These two scoring functions might be considered to better use the means and the variations of the gene expression values.  Intuitively, gene j has a strong discriminating power if the means , k = 1, 2,..., L, differ significantly and , k = 1, 2,..., L, indicating the intra-class variations, are all small.
For example, suppose we have a microarray expression matrix A 12×4 as shown, in which 12 samples have been known in 3 classes: Then the centroid matrix is and the mean of the centroids is = (1.2, 1.2, 1.2, 1.0).
The deviation matrix X 12×4 is and the intra-class average deviations are Figures 3, 4, 5, 6 illustrate the expression values of these four genes across all 12 samples, with the intra-class means and average deviations also shown. There are three key ideas in our design of gene scoring functions, which will be exemplified through these four genes. First of all, gene 1 has quite different mean expression values across three classes, compared to gene 4 that has the same means. Therefore, gene 1 is intuitively better than gene 4 in terms of discriminating power. Note that the goal of gene selection is to select genes that have significantly different means across different classes. For each gene j, the quantity is the mean of all the centroids on gene j and it represents all the samples. is stable, that is, it would not change when the samples in one class are duplicated (since the number of classes, L, and all the means, , for k = 1, 2,..., L, do not change). We define the scatter of gene j to capture the inter-class variations, which takes in as a component: in which the square root is the standard (estimated) deviation of all the centroids on gene j. Clearly seen, scatter(j) is a stable function. More discriminatory genes are expected to have bigger scatter-values. In the following, we prove an upper bound and a lower bound for scatter(j).

holds, and it becomes equality if and
only if a 1 = a 2 = ... = a n .
Inequality S + ≤ a n -a 1 holds again if = 0, since (a i -) 2 ≤ (a n -a 1 ) 2 for every i. Therefore, we may assume that a 1 <a 2 < ... <a n . A similar enlarging process gives S ≤ max{a n -, -a 1 }. Since and we conclude that S + ≤ a n -a 1 .
According to Lemma 2, the following theorem on the bounds on scatter(j) holds.
The plot of the expression values of gene 2 across all 12 samples in the example dataset, with both intra-class means and aver-age deviations calculated Figure 4 The plot of the expression values of gene 2 across all 12 samples in the example dataset, with both intra-class means and average deviations calculated. Intuitively, d 2 (j) includes more details in its calculation than d 1 (j) does. In the above example, gene 2 and gene 3 have the same mean expression values across all three classes: 1j = 2j = 3j . Therefore, we have d 1 (2) = d 1 (3) but d 2 (2) <d 2 (3). Since intuitively gene 2 has a stronger separability than gene 3, d 2 (j) could be even more sensitive than d 1 (j).
The above two functions d 1 (·) and d 2 (·) basically consider the means of intra-class variations. The following two functions δ 1 (j) and δ 2 (j) are introduced to capture the variations of intra-class deviations, corresponding to d 1 (·) and d 2 (·), respectively:

Theorem 4
PROOF. The proof is easily done by simplifying the definition formulae for δ 1 (j) and δ 2 (j).
Similar to functions d 1 (j) and d 2 (j), for an ideal differentially expressed gene j, both δ 1 (j) and δ 2 (j) are expected to have small values. Moreover, similar to the relation between d 1 (j) and d 2 (j), δ 2 (j) is considered more sensitive than δ 1 (j). We define function compact k (j) = d k (j) + δ k (j), for k = 1, 2, to evaluate the intra-class variations for gene j. And we define the gene scoring function s k (j) = compact k (j)/scatter(j) to rank the genes according to their differentiability. Note that a smaller value of s k (j) indicates a higher differentiability.
We denote the gene selection method using compact 1 (j) = d 1 (j) + δ 1 (j) as GS1, and the other using compact 2 (j) = d 2 (j) + δ 2 (j) as GS2. Both GS1 and GS2 are model-free and stable. In each of them, the scores for all genes are calculated and genes are sorted in non-decreasing order of their scores. Since the number of genes, p, is typically much larger than the number of samples, n, the overall running time to compute this order is O(p log p). In practice, there The plot of the expression values of gene 3 across all 12 samples in the example dataset, with both intra-class means and aver-age deviations calculated Figure 5 The plot of the expression values of gene 3 across all 12 samples in the example dataset, with both intra-class means and average deviations calculated. are several ways to select the informative genes using this order. For example, one may select the top ranked x genes for further analysis, or the top ranked x% genes, or all the genes with score no larger than some constant, among others.

F-test method
F-test method [1,19] is also a single gene scoring approach. Besides the notations used in our methods, it uses to denote the variance of expression value of gene j in the k-th class: and to denote the variance in the whole dataset. Gene j has a score defined to be:

Cho's method
Using the same notations used as in the above, Cho's method [17] defines a weight factor w i for sample i, which is if sample i belongs to class k. Let . The weighted mean(j) for gene j is defined as The weighted standard deviation is defined as  The plot of the expression values of gene 4 across all 12 samples in the example dataset, with both intra-class means and aver-age deviations calculated Figure 6 The plot of the expression values of gene 4 across all 12 samples in the example dataset, with both intra-class means and average deviations calculated.