Discovery of dominant and dormant genes from expression data using a novel generalization of SNR for multi-class problems

Background The Signal-to-Noise-Ratio (SNR) is often used for identification of biomarkers for two-class problems and no formal and useful generalization of SNR is available for multiclass problems. We propose innovative generalizations of SNR for multiclass cancer discrimination through introduction of two indices, Gene Dominant Index and Gene Dormant Index (GDIs). These two indices lead to the concepts of dominant and dormant genes with biological significance. We use these indices to develop methodologies for discovery of dominant and dormant biomarkers with interesting biological significance. The dominancy and dormancy of the identified biomarkers and their excellent discriminating power are also demonstrated pictorially using the scatterplot of individual gene and 2-D Sammon's projection of the selected set of genes. Using information from the literature we have shown that the GDI based method can identify dominant and dormant genes that play significant roles in cancer biology. These biomarkers are also used to design diagnostic prediction systems. Results and discussion To evaluate the effectiveness of the GDIs, we have used four multiclass cancer data sets (Small Round Blue Cell Tumors, Leukemia, Central Nervous System Tumors, and Lung Cancer). For each data set we demonstrate that the new indices can find biologically meaningful genes that can act as biomarkers. We then use six machine learning tools, Nearest Neighbor Classifier (NNC), Nearest Mean Classifier (NMC), Support Vector Machine (SVM) classifier with linear kernel, and SVM classifier with Gaussian kernel, where both SVMs are used in conjunction with one-vs-all (OVA) and one-vs-one (OVO) strategies. We found GDIs to be very effective in identifying biomarkers with strong class specific signatures. With all six tools and for all data sets we could achieve better or comparable prediction accuracies usually with fewer marker genes than results reported in the literature using the same computational protocols. The dominant genes are usually easy to find while good dormant genes may not always be available as dormant genes require stronger constraints to be satisfied; but when they are available, they can be used for authentication of diagnosis. Conclusion Since GDI based schemes can find a small set of dominant/dormant biomarkers that is adequate to design diagnostic prediction systems, it opens up the possibility of using real-time qPCR assays or antibody based methods such as ELISA for an easy and low cost diagnosis of diseases. The dominant and dormant genes found by GDIs can be used in different ways to design more reliable diagnostic prediction systems.


Background
Many studies have investigated the mechanism of carcinogenesis by analyzing the gene expression profiles from microarray data. Accurate diagnosis of different categories of cancers or identification of subgroups with homogeneous molecular signature is important for proper treatment and prognosis. The application of gene expression data for these tasks is challenging because of its high dimensional nature and the noisy characteristics. Since the number of genes contained in each chip is far exceeding the number of available samples, the standard statistical methods for classification often do not work well. Therefore, identification of informative genes related to a set of diseases is an important subject in the field of biomedical informatics at least for two reasons: understanding the roles played by the genes in cancer biology and development of tools for efficient and accurate diagnostic prediction.
Many novel classification, clustering and prediction methodologies have been suggested to analyze gene expression data [1][2][3][4]. Here we need to deal with two problems: identification of marker genes (this is a problem of dimensionality reduction) and use of the marker genes for designing a diagnostic prediction system. For the second problem many machine learning tools, such as Neural Networks, Decision Trees, Nearest Neighbor Classifier, Naive Bayes classifier, Support Vector Machines have been used [5][6][7][8].
For the problem of gene selection also many methods have been proposed [2,4,[7][8][9][10]. Gene selection methods can further be grouped into two categories: linear methods and non-linear methods.
The linear methods are very intuitive which exploit the linear relation between expression levels and the status of the disease. In other words, for a two-class problem, say Acute Lymphoblastic Leukemia (ALL) and Acute Myelogenous Leukemia (AML), the high expression level may correspond to ALL while a low expression level may correspond to AML or vice versa. Two such indices are Signalto-Noise ratio (SNR) [2] and correlation [7]. The SNR for a gene g is defined as SNR(g) = (μ 1 (g) -μ 2 (g))/(σ 1 (g) + σ 2 (g)), where μ i (g) and σ i (g) are the mean and standard deviation of expression levels of a gene g for samples in class i (i = 1, 2), respectively. The authors in [7] adopted several formulae (Euclidean distance, Pearson correlation, SNR, etc.) for measuring the similarity between the expression levels of a gene g and an ideal gene g ideal in a 2class problem, where an ideal gene pattern was defined by g ideal = (g ideal,1 , ʜ, g ideal,G ), g ideal,j = 1, if the jth sample is from class 1, otherwise g ideal,j = 0; ∀ j = 1, ʜ, G. The ideal values can also be taken as 0 for class 1 and 1 for the class 2. Let x g be the vector consisting of the expression values for a gene g for all samples. Now the Pearson correlation or cosine similarity between the two vectors g ideal and x g can be used to rank the genes. Although very intuitive, these methods are neither easy to generalize to multiclass, nor such methods can take into account non-linear interaction between genes. The BW ratio [4] is a linear index that can be used for multiclass problems, but it is less intuitive and it is not easy to visualize its behavior.
Note that, there have been attempts to adapt two-class methods such as correlation for multiclass problems using the one-vs-all strategy [11]. In [11], first a set of genes is selected based on ANOVA. Then using this shortlisted genes, a set of important genes is identified for each class by casting the problem into a two class problem. We call these method as ANOVA+Correlation method. For example, in a k-class problem, to get a set of important genes for class c, samples from class c are considered from class 1 and all samples from the remaining classes pooled together are treated as class 2. Then the correlation, as explained, in the previous paragraph is computed. Such a method will select strong marker genes, but may also select poor ones because the pooled class will have a much stronger and undesirable effect on the correlation than the class under consideration. Similarly, using the OVA strategy the SNR can also be used to select genes for a multiclass problem [12]. We shall call this method as OVA.SNR. In the OVA.SNR approach, for a k-class problem, to select useful genes, say, for class 1, the data set is divided into two groups, data from class 1 and data from the the remaining 2 to k classes. Although such methods may find useful genes, in this case, the mean and standard deviation of the pooled group may not (usually will not) represent any useful information about the remaining classes. For example, in a 3-class problem, suppose for a gene, samples from each of the three classes are normally distributed (this is an assumption made while using ANOVA type tests). For simplicity, suppose we have n samples from each of the three classes and the mean and standard deviation computed from these samples for the three classes are μ i , σ i ; i = 1, ʜ, 3, respectively. In the OVA.SNR approach, the mean of the second group, does not represent the central tendency of the pooled group and hence it does not represent any useful information about the structure of the remaining two classes. Moreover, when samples from class 2 and class 3 are normally distributed with two different means, the pooled samples will not be normally distributed. Hence, OVA schemes, which use mean of the pooled class, for gene selection is not conceptually appealing, although such approaches may find useful discriminatory genes. On the other hand, the non-linear methods can take into account non-linear interaction between genes. There are several such methods, for example, online feature selection using neural network [10], SVM-based recursive feature elimination (SVM-RFE) [9], and the maximum margin criterion-based recursive feature elimination (MMC-RFE) [8]. In [10], the authors have considered the non-linear interaction between genes as well as that between genes and the tool used for gene selection. Although in [10] they have successfully discovered a small set of biomarkers for accurate prediction of cancer subgroups, the behavior of non-linearly interacting genes is less interpretable than the linearly interacting genes for making simple decision rules. The SVM-RFE is a quite popular method of feature selection in an iterative manner. This method makes use of repeated training of a SVM classifier with a progressively reduced set of features. In every iteration, some of the less important features are removed. For a two-class problem, the SVM classifier finds the weight vector, w ∈ R p , p is the number of genes, associated with the hyperplane that maximizes the margin of separation. The SVM-REF algorithm, trains SVM with all available genes first and finds the optimal weight vector w ∈ R p . Then it computes a Ranking Criterion, RC, for each gene. A possible choice of RC is (w i ) 2 . Then either a single gene (or a set of genes) with the smallest values of RC is removed and the process is then repeated with the reduced set of genes.
Here we aim to develop a gene selection method which is intuitive, can find useful marker genes and can be viewed as a true generalization of SNR. The GDI is akin to the SNR, which is widely used in two-class gene selection problems [2], but GDI can be applied to multicategory problems, and identifies dominant and dormant genes. We define two indices named, Gene Dominant Index (GDI -Dom ) and Gene Dormant Index (GDI Dor ). The GDI Dom leads to the novel concept of Dominant Genes while the other index leads to the concept of Dormant Genes. A dominant gene is over-expressed in only one of the classes and under-expressed in the remaining classes, and thus has a very strong class specific signature. A dormant gene, on the other hand, is under-expressed in only one of the classes but over-expressed in the remaining classes, and thus also has a strong class specific signature. Clearly, dominant or dormant genes are good biomarkers, if they exist, and they are likely to play key roles in identifying sub-types/classes of disease. In order to reduce the effect of the finite sample size, we randomly select a part of the data to find a list of dominant and dormant genes. This process of random partition of data and computation of GDIs are repeated 100 times. To compare the performance of our methods, we shall use six classifiers for diagnostic prediction: NMC, NNC, SVM with linear kernel, and SVM with Gaussian kernel. Each of the two SVMs is realized using both the OVA and OVO strategies and this makes the total number of classifiers to six. Our method is tested on four multi-class cancer data sets. We shall see later that our proposed methods can find a small set of discriminating biomarkers with excellent prediction accuracy.

Results and discussion
Four multicategory microarray gene expression data sets, namely, SRBCT (Small Round Blue Cell Tumors) [13], Leukemia [14], CNS (Central Nervous System Tumors) [15], and Lung Cancer [16] are used in this study for detailed analysis. We divide our discussion into three subsections, the biological relevance of some of the dominant/dormant genes, visual assessment of the dominant/ dormant marker genes, and comparison of classifier performance. The results obtained using SRBCT, Leukemia, and CNS are compared with those in [8]. The Lung Cancer data set (not used in [8]) is further used to show the effectiveness of our method. Details of the data sets can be found in Materials and Methods. We have followed the same experimental protocols as in [8] to make a proper comparison. Additionally, we have implemented the multiclass version correlation based method (ANOVA+Correlation) and SNR (OVA.SNR) for comparison of performance.
AF1Q, (d) FGFR4. The gene FCGRT (Fc fragment of IgG, receptor) has an EWS (Ewing sarcomas) specific signature because it is moderate to highly upregulated for the EWS group and is downregulated for the other three groups. This gene is known to play significant roles in other types of cancers too. For example, in [17] authors suggested a set of 26 prognostic genes that can provide predictive information on the survival of patients suffering from lung cancer. They found that a higher expression level of FCGRT relates to a better survival outcome. The WAS gene belongs to the set of Human Cancer Genes [18]. It has a very strong BL (Burkitt lymphomas) class specific signature, and it is also found important by others in the context of the SRBCT data set [19]. The relationship of WAS to Burkitt Lymphomas is also reported in [20]. The deficiency of WAS gene causes the Wiskott-Aldrich syndrome, which is an X-linked hereditary disease associating primary immunodeficiency, thrombocytopenia, an increased risk of autoimmune diseases and malignancies, particularly non-Hodgkin's lymphoma (NHL) [21][22][23][24]. In patients with Wiskott-Aldrich syndrome, a higher rate of malignancy has been observed, particularly in Epstein-Barr Virus-related brain tumor, leukemia and lymphoma http://www.stjude.org. Amongst the different kinds of tumors, the most frequently associated one with Wiskott-Aldrich syndrome is the NHL tumor (it is about 76%).
The other kinds of tumors associated with WAS include, Hodgkin's disease, glioma, and testicular carcinoma [21,24]. Although NHL is the most common type of malignancy found in WAS and BL represents 40% to 50% of all NHL cases in childhood, BL has hardly been reported in WAS. But a case of BL with WAS is reported in [20]. In [24], authors reported Malignant B Cell Non-Hodgkin's Lymphoma of the Larynx with Wiskott-Aldrich syndrome. All these clearly establishes the important role of WAS not only in BL, but in other types of malignancies too.
The ALL1-fused gene from chromosome 1q (AF1Q) is one of the dominant genes found by our method for the neuroblastoma (NB) group. Many authors have reported this gene to play important roles in cancer [25,26]. As revealed Using the training data, for each class the top five most frequently occurring genes are selected. Using the training data, for each class the top five most frequently occurring genes are selected. Using the training data, for each class the top five most frequently occurring genes are selected.
by Fig. 1(c), AF1Q is moderate to highly express for the neuroblastoma cases, while it exhibits low expression values for the other three groups of the SRBCT.
As discussed in [10,27], FGFR4 carries out the signal transduction to the intracellular environment in cellular prolif-eration, differentiation and migration. Overexpression of FGFR4 is found in various cancers, such as of pituitary, prostate, thyroid [28][29][30], but in normal tissues, FGFR4 expression is hardly noticeable. In our study with the SRBCT, we noticed a very strong RMS (rhabdomyosarcomas) specific signature, very high expression levels of FGFR4 for the RMS group, but for the other groups it is practically unexpressed. However, in lung adenoarcinoma, FGFR4 is found to be downregulated [31].
The second and third most dominant genes for the EWS class are Follicular lymphoma variant translocation 1 (FVT1) and CAV1 (caveolin 1, caveolae protein, 22 kD). According to [32] FVT1 is found to be weakly expressed in normal hematopoietic tissues, but is shown to exhibit a very high rate of transcription in some T-cell malignancies and in phytohemagglutinin-stimulated lymphocytes.
Becuase of the proximity of FVT1 to BCL2, authors in [32] also have indicated that both genes may involve in the tumoral process. For the present data set, it exhibits a very strong EWS specific signature. Its expression is practically absent for RMS, NB and NHL groups, but it is highly expressed for the EWS group. The gene CAV1 is also a biologically informative gene. In our study we found CAV1 to be upregulated for the EWS group. According to [33], CAV1 is down-regulated in oncogene-transformed and tumor-derived cells and it is an essential structural constituent of caveolae that plays important roles in mitogenic signaling and oncogenesis. Many studies have reported CAV1 as a candidate tumor suppressor gene [34][35][36]. It has been established that CAV1 has tumor suppressor activity in human cancers, including breast cancer [33,37], ovarian cancer [38], and lung cancer [39]. But in [40], they showed that CAV1 is over-expressed in human gastric cancer cell line GTL-16. Also, for diffuse large B-cell lymphoma [41] and prostate cancer [42], CAV1 is identified to serve as a diagnostic and prognostic marker. For the Lung Cancer data set in this study we have found CAV1 as a good dominant gene for the normal tissue group and this is in conformity with the fact that CAV1 also plays the role of a tumor suppressor. This is also consistent with down-regulation of CAV1 in human lung carcinoma [39]. Thus, CAV1 plays an important role in cancer biology.
According to Table 2, we shall now discuss the importance of some dominant and dormant genes for the Leukemia data set. The MME (membrane metallo-endopeptidase), also known as CD10, is the most important dominant gene for the ALL group as found by GDI Dom . MME is found to play different roles in different types of cancers. In [43], authors suggested that the functions of MME vary with tissue types and disease states. For example, in hepatocellular and thyroid carcinoma MME exhibits higher expression levels [44,45], while in poorly differentiated tumors in the colon and stomach MME shows low expression levels [46]. According to [47], MME is downregulated in the ALL samples with MLL (Mixed-Lineage Leukemia) rearrangements compared to ALL without MLL rearrangements. In our study we have found MME to be highly expressed in ALL while for the MLL and AML groups it is moderately expressed.
The top two dominant genes for the MLL group found by GDI are MBNL1 and MEIS1. In our study, we have found that expression levels of both MBNL1 and MEIS1 are higher for the MLL group than the other two groups. In [48], authors have found upregulation of these two genes in the ALL and AML groups with MLL chimeric fusion genes. It is interesting to know that by just using three genes MME and MBNL1 and MEIS1, one can do a good job of discrimination between the three types of leukemia (results not shown); of course, three dominant genes, one from each class can do an excellent job of classification too. In our investigation with the CNS data set, as shown in Table 3, the transcriptional repressor, insulinoma-associated 1 (INSM1) is found to be one of the dominant genes for the MD (medulloblastomas) group. Different investigations have found this gene to play roles in tumors of neuroendocrine origin. In [59], they reported INSM1 as one of the important genes in discriminating pancreatic adenocarcinomas and islet cell tumors from normal pancreatic tissues. The gene INSM1 is also found to be overexpressed in small-cell lung cancer (SCLC), SCLC cell lines as well as in medullary thyroid carcinoma, insulinoma, and pituitary tumors [16,60,61].
As shown in Table 4

Visual assessment of the dominant/dormant marker genes
In the next section we shall demonstrate the utility of the identified genes through performance comparison with different classifiers. But classifier performance is an indi-  Fig. 1 display the four most dominant (one for each class) genes for the SRBCT data set. As expected, the dominant gene for a class appears with high expression values in the samples from that class, but with low expression values in the samples of the other classes/subgroups. As an example, for the SRBCT data set the most dominant gene, FCGRT (Image: 770394), for the Ewing Sarcoma is highly expressed for the EWS group while, it is practically unexpressed for the other three SRBCT classes ( Fig. 1(a)). Similarly, Fig. 1 shows that for the Burkitt Lymphomas (BL) the most dominant gene, WAS (Image: 236282), is over-expressed for the BL samples but under-expressed for the other classes. Figure 2 depicts that for the SRBCT the dormant genes for all four classes are not very good and that explains the poor performance of the classifiers discussed later. In Fig.  2 we find that the most dormant gene, CDK6 (Image: 295985), for the EWS is completely unexpressed for the EWS samples while it is moderately expressed for the remaining three classes. Of the remaining three classes, the average expression level for the BL group is the closest to that of EWS group. Although, from pattern recognition point of view, this gene can distinguish EWS from the other three classes, since the difference between the average expression levels for EWS and BL groups is not high, this gene may not be considered a very good dormant gene. In some cases, the identified dormant genes may not even be good from pattern recognition point of view also. As an example, consider Fig. 2 Fig. 3(a) depicts that the gene MME has a very strong ALL specific signature and Fig. 3(c) representing CHRFAM7A has a strong signature for the AML group; while the gene MBNL1 ( Fig. 3(b)) although has an MLL specific signature, it is not as strong as that of the other two genes. Fig. 4 depicts the scatterplots of the most dormant genes for Leukemia data set. Here we find that for majority of the samples in the ALL group, the most dormant gene, LGALS1, takes low expression values compared to the samples from the other two groups. In this case the separation between the average expression values between the ALL and AML groups is quite high making it a good dormant gene. This is also revealed by the GDI values of 1.66. Similarly, for the AML class, the most dormant gene, MEF2A, is downregulated for the AML group, while it is upregulated for the remaining groups (the average GDI value is 1.89). Thus, this gene can also be considered a good dormant gene.   Fig. 9(a). In Fig. 9(a), for the training data different classes are represented by different shapes with different colors. For the independent test data, we use the same shapes but filled in with colors. For example, if the training data from a class is represented by red empty square, then the test data from the same class will be represented by filled in red square. Figure 9(a) reveals that samples from different classes form nice clusters both for the training and independent data sets, although the gene selection is done exclusively based on the training set. case. In the next section, we will demonstrate that even with the dormant genes all six classifiers perform quite well. This might mean that if we would use a higher dimensional Sammon's plot we might obtain a better separability between classes.

Comparison of classifier performance
We conduct our experiments to examine the results using six distinct classifiers (three of them are used in [8] On the right side of Figs. 13, 14, 15, for an easy reference, we also include the relevant summary of the prediction results in [8] using different gene selection methods. Here we display the prediction result in bold if it is better than the best classification error reported in [8] and uses less (or equal) number of genes than that in [8]. For the SRBCT data set, with only three dominant genes from each class, the performance of all six classifiers are better than the best performance reported by Niijima et al. [8] using 20 genes by their eight classifiers (as shown in Fig.13). This may be taken as an indicator of strong dominancy of the selected genes. On the other hand, the performance of the dormant genes are not very good signifying absence of good dormant genes which is also confirmed by Fig. 9(b). Although, the performance of the Sammon's plots for the SRBCT data set using the training and independent data together Figure 9 Sammon's plots for the SRBCT data set using the training and independent data together. For the training data different classes are represented by different shapes with different colors. For the independent test data, the same shapes are used but filled in with colors; e.g., the training data from the EWS class is represented by black empty square and the test data from the same class, EWS, is represented by filled in black square. selected genes. Note that, in [8] using 20 genes the best classification error achieved on the test data is 5.8%; while in our case even with just 12 dominant genes all six classifiers can produce very comparable test accuracies with that of the best results in [8] using 20 genes. The performance of our four SVM classifiers using just 3 genes (one from each class) is better than that of both SVM classifiers in [8] using 20 genes. This clearly indicates the quality of the dominant marker genes identified by GDI Dom . Unlike the SRBCT, for this data set, the dormant genes also have good discriminating power. In fact, with 15 dormant genes the classification error rates for our two non-SVM classifiers are comparable to the best classier performance in [8] using 20 genes; while the performance of our four SVM classifiers is significantly better than that of the remaining four classifiers in [8]. In these figures "Combination" refers to using both sets of dominant and dormant genes together to design the classifier. In Fig. 14 Sammon's plots for the Leukemia data set using the training and independent data together Figure 10 Sammon's plots for the Leukemia data set using the training and independent data together. For the training data different classes are represented by different shapes with different colors. For the independent test data, the same shapes are used but filled in with colors; e.g., the training data from the ALL class is represented by black empty square and the test data from the same class, ALL, is represented by filled in black square. Since there is an independent data set for each of the SRBCT and Leukemia, we have used the selected dominant/dormant genes in Table 1 and Table 2 to examine the prediction performance on those independent data sets. For these two data sets, all samples in the training data are used to train different classifiers using the selected genes with m = 1 to 5 folds. Then the trained classifiers are used to evaluate their performance on the independent test data set. Here we have normalized the expression value of each gene to [0,1] across samples considering both training and independent data sets. Note that, for the SVM classifier we need to choose some hyper-parameters. As done for other experiments, the training data set is randomly divided into training and validation sets of equal size. Then the validation set is used to choose the hyper-parameters. The classifier thus designed is tested on the independent data set. Like other experiments, here too the training-validation partition is repeated 100 times and the average number of misclassification and its standard deviation on the independent test data are reported in Figs. 13 and 14. From Fig. 13 we find that even just with 4 dominant genes the performance of all classifiers on the independent test set is quite good. The effect of the use of combined gene is very prominent for the SRBCT data set. For all folds 1 to 5, the performance of all classifiers on the independent test set is excellent. For the Leukemia data set also with just 3 dominant genes, the six classifiers make 2-3 mistakes and with just six genes all six classifiers result in around zero misclassification on the independent test data (Fig. 14). The classification performance of the dormant genes on the independent data is very good too. In this case, the performance of all six classifiers with 3 dormant genes is better than the performance of the classifiers with 3 dominant genes. For this data set, the performance of all six classifiers using dominant and dormant genes together on the test data is excellent too.
In Fig. 16, we examine the prediction performance for the Lung Cancer data set (not used in [8]) using the same six classifiers with different number of dominant or dormant genes selected by the proposed method. For this data set we compare our results with those in [5]. In [5] three non-SVM classifiers and five SVM classifiers are used. Figure 16 reveals that for three non-SVM classifiers (KNN, NN and PNN) using all 12600 genes, the prediction errors reported in [5] vary between 10.36% ~14.34%, while using just 5 dominant genes, with one gene per class, the performance of our six classifiers are quite good and are comparable or better than that of the three non-SVM classifiers. With just 20 dominant genes (four genes per class), the test error rates of our six classifiers vary between 5.8% and 7.8% while the best accuracy reported in [5] is 3.35% but the method in [5] use all 12600 genes. Here although our best result is about 2~3% lower than that of the best Sammon's plots for the CNS data set  result in [5] (see, far right side of Fig. 16), the evaluation criteria and computational protocols are not the same. For example, we have used only 5-25 (less that 0.20% of the 12600) genes, while in [5]all 12600 genes are used; we have generated statistics about test accuracies using 100 sets (generated by resampling), while in [5] a 10 fold cross-validation is used; for the SVM classifier we have used the most simple linear kernel and the nonlinear Gaussian kernel for comparison, while in [5] authors have used nonlinear polynomial kernel and several other sophisticated classifiers such as back-propagation neural networks, and probabilistic neural networks.
In order to look at the statistical significance of the average GDI values of the dominant and dormant genes identified based on 100 data splitting experiments, here we further perform the permutation test 500 times (Details about the procedure can be found in the Materials and Methods section). These results are summarized in Tables 1, 2, 3, 4. From these tables we find that each of the selected dominant/dormant genes in every data set has a highly reliable p-and q-values. Especially, for those selected dominant/ dormant genes, which appeared with very high frequencies, the p-and q-values are practically zero (0). Hence, from a statistical viewpoint, our method can recognize genes with trustworthy class-specific characteristics. Such genes can be used to design more reliable diagnostic systems.
In addition, we have checked the literature for other similar methods for identifying marker genes associated with one class in a multiclass environment. In this context, Pavlidis and Noble [11] use ANOVA and Correlation together. We call this scheme as ANOVA+Correlation scheme. In [12] SNR is used for preliminary screening of genes which is followed by the use of a SVM based technique. Both these schemes for multiclass analysis use a one-versus-all (OVA) approach. We have implemented the ANOVA+Correlation scheme and also used SNR with OVA strategy to select class specific genes. The later method is referred to as "OVA.SNR". As revealed by Tables  5 and 6, all three gene selection methods (GDI.Dominant, OVA.SNR and ANOVA+Correlation) produce comparable results.
In this context it is worth emphasizing that many genes may have discriminating power and hence can be considered marker genes but the dominant and dormant genes are special types of markers and all marker genes are not necessarily dominant/dormant genes. GDI is designed to identify dominant/dormant genes, if present. Moreover, any method of gene selection should be theoretically/con-Sammon's plots for the Lung Cancer data set  ceptually appealing. Use of the OVA strategy may select useful genes for classification but it is not conceptually/ theoretically appealing and may lead to potential problems. We have already explained it once and we again reemphasize it here. In the OVA.SNR approach, for a kclass problem, to select marker genes, for a class, say class c, we divide the data set into two groups, data from class c and the pooled data from the remaining k -1 classes. Clearly, the mean and standard deviation of the pooled group do not represent any useful information about the remaining classes. For example, the mean of k -1 pooled classes may fall in a region which may not even have any data points in its neighborhood. Moreover, use of statistics like t-statistic makes certain assumptions about the distribution of data in each class. Even if the assumptions are satisfied for each class, it may not (usually will not) be satisfied for the pooled class. The pooling of samples will also affect the ANOVA+Correlation method. The adverse influence of pooling samples from different classes will become more serious if there are several classes. In such a case, the pooled group will be of much higher size than any individual group and hence its influence will also be stronger.
Consequently, this may make the correlation based method fail to recognize overlapped structure between expression levels from different classes. Thus, use of such OVA scheme for gene selection is not conceptually appealing. But this must not be taken to infer that OVA.SNR or ANOVA+Correlation will not be able to select useful genes, nor our intention is to claim that GDI will not select poor genes.  Number of independent test samples is 20.
Number of training samples used to design the system to test on the independent set is 63. We now illustrate with a synthetic data set that SNR (OVA) can lead to false positive dominant/dormant genes. Figure 17 shows the expression values of a five-class data where each class has the same number of samples and roughly the same standard deviation. It is clear that this gene is not a dominant gene. The GDI value for the black class is 0.61 while SNR (OVA) for the same class is 1.54. Note that, since we are using one-versus-all philosophy, a SNR value of 1.54 is expected to be much more significant than a GDI value of 0.61. The most significant difference between SNR (OVA) and GDI methods is that the value of SNR (OVA) is influenced by samples from all other group while GDI uses a comparison of only two selected groups with the highest mean values.
Depending on the data sets the behavior of these three methods (GDI.Dominant, OVA.SNR and ANOVA+Correlation) may be similar in terms of classifier performance, but dominant/dormant genes identified may be different. An important distinctive feature of GDI over OVA is that it can effectively reduce the false positive cases. Even though, the set of top significant dominant/dormant genes from each class identified by these three methods may be similar, the importance (priority) of those genes as dominant/dormant genes may be different.
"Combination" refers to using both sets of dominant and dormant genes together for the classifier. Hence, the number of selected genes in the "Combination" case for each fold is {6, 12, 18, 24, 30}.
Number of independent test samples is 15.
Number of training samples used to design the system to test on the independent set is 57.   "Combination" refers to using both sets of dominant and dormant genes together for the classifier. Hence, the number of selected genes in the "Combination" case for each fold is {10, 20, 30, 40, 50}.
Statnikov et al. [5] had reported Accuracy (%) for the Lung Cancer data set by using all genes as input features fed for several classifiers. Here we used the 1-Accuracy (%) as the error rate listed above for comparison. ever, strong dormant genes may not always be available, but if they exist, they are also quite useful for diagnosis. Based on the GDI values we have proposed a mechanism for selecting a set of useful biomarkers that may play significant roles in the biology of the set of diseases (here cancers) under consideration and can be used to design useful diagnostic prediction systems. It is possible to design other methods of discovering biomarkers using the GDIs. Our experimental results suggest that the proposed method can identify good biomarkers.
In order to establish the utility of the dominant and dormant genes we have considered four multi-category cancer data sets. First, we have analyzed the roles of some of the dominant and dormant genes in cancer biology. Then we have used visual assessment techniques to assess the level of dominancy/dormancy of the genes. For these we have used scatterplot of individual gene to assess each gene separately and also have used Sammon's projection to get an idea about the overall quality (discriminating power) of a set of genes selected by our GDIs. These plots have clearly revealed the class specific signatures of the genes selected by the GDIs.
To further demonstrate the utility of the identified genes, we have used six classifiers. Our experiments show that a few dominant genes can yield very good prediction accuracies. We have compared our results with published results and found that the dominant genes identified by our method can result in a comparable performance usually with fewer genes than other methods. But as explained earlier, it may be difficult to find strong dor-
Leukemia data set [14] This Affymetrix high-density oligonucleotide array data set contains 57 samples from 3 classes of leukemia: 20 acute lymphoblastic leukemia (ALL), 17 mixed-lineage leukemia (MLL), 20 acute myelogenous leukemia (AML), each with 12582 genes. In addition, an independent data set with 4 ALL, 3 MLL, and 8 AML samples are further used in the validation process. Both data sets are available at http://www.broad.mit.edu/cgi-bin/cancer/datasets.cgi.
CNS data set [15] This is also an Affymetrix high-density oligonucleotide microarray data set containing 42 samples from 5 different tumors of the

Preprocessing
For the Leukemia and CNS data sets, in the preprocessing step the gene expression values smaller than 100 are raised to 100; while expression values greater than 16000 are set to 16000, and then the expression values are subjected to a base 10 logarithmic transformation. After that, the distribution of gene expression values in each sample is adjusted to zero mean and unit variance. For the SRBCT data set, we do not make any change to the gene expression values as that had already been preprocessed in the original data source [13]. For these three data sets, we adopt the same data preprocessing protocols as in [8]. For the Lung Cancer data set (not used in [8]), we use the same preprocessed data as used in [16] without doing any additional preprocessing.

Experiment design
In order to confirm the efficacy of our proposed new gene selection method and to make proper comparisons, we followed the same experimental protocols as used in [8].
First, for gene selection, in addition to our proposed method, we have used two gene selection strategies mentioned in [8]: maximum margin criterion-based recursive feature elimination (MMC-RFE) and support vector machine-based recursive feature elimination (SVM-RFE).
Here we have not implemented the MMC-RFE and SVM-RFE algorithms, but simply extracted the results from [8].
Second, for evaluating the performance of each gene selection method, we have used the repeated random splitting methodology utilized in [8] in which the samples (not including the independent test data that are available for the SRBCT and Leukemia) are partitioned randomly into a training set and a test set such that the training and test sets maintain the same proportions of samples from different classes as in the whole data set. The training set consists of two-thirds of the entire sample set, and test set consists of the remaining one-third of the samples. This random training-test splitting is repeated 100 times. For each such random training-test splitting (called outer level splitting), we again randomly split the training set 100 times to produce a smaller training set. In this inner-level splitting, we use three-fourth of the training data for finding dominant and dormant genes, which are then used to evaluate the performance of classifiers on the outer level test data. This classifier performance evaluation process is explained using the following step-algorithm, Classifier Performance Evaluation.
1.1 Partition the data set X into XTR (training set) and XTS (test set), such that XTR = X, XTS = X -XTR, r <s; for example, here we use r = 2, s = 3, XTR = X.    [8]): the Nearest Mean Classifier, the Nearest Neighbor Classifier, and four kinds of the Support Vector Machine Classifiers. The adopted SVM classifiers include the one-versus-one SVM with linear kernel (OVO.SVM-L), the one-versus-one SVM with Gaussian kernel (also called SVM with Radial Basis Function, OVO.SVM-R), the one-versus-all SVM with linear kernel (OVA.SVM-L), and the one-versus-all SVM with Gaussian kernel (OVA.SVM-R). Note that, only the SVM.OVA-L was used in [8]. We have implemented the NMC and NNC classifiers; while for application of SVM to multi-class problems, we have used the e1071 library of R http://www.r-project.org which is based on the LIBSVM http://www.csie.ntu.edu.tw/~cjlin/libsvm/. For SVMs, the training data are further randomly split into two equal parts (training and validation) for determining the optimal hyper-parameters for the SVM classifiers. The optimal hyper-parameters are then used to design SVM classifiers with the training data and their performance is evaluated on the test data. Here for C (the constant for regularization), we use four choices {1, 10, 100, 1000} and for the spread of Gaussian kernel γ, we consider eight choices {0.0001, 0.001, 0.01, 0.1, 1, 10, 100, 1000}.

Gene dominant and dormant indices (GDI)
As we mentioned in Background, our main contribution is to develop a gene evaluation index, called "Gene Dominant/Dormant Index (GDI)", to select significant genes for multicategory classification problems. This GDI concept is similar in spirit to the Signal-to-Noise ratio (SNR), broadly adopted for gene selection in two-class problems [2], but the GDI can be applied to multicategory problems. Moreover, GDI further helps to identify dominant and dormant genes as defined next.

Dominant Gene
A gene that is over-expressed in only one of the classes and under-expressed in the remaining classes. Thus a dominant gene is defined with respect to a set of diseases/ classes and it has a very strong class specific signature.

Dormant Gene
A gene that is under-expressed in only one of the classes but over-expressed in the remaining classes. Thus a dor-mant gene is also defined with respect to a set of diseases/ classes and it also has a strong class specific signature.
From the above definitions, it is clear that dominant genes, if any, will be good biomarkers because such genes are expected to play active roles for the disease. It also appears that finding a dominant gene may not be a difficult task, particularly for a given set of cancers, because usually some genes will be highly expressed for a particular type of cancer. But dormant genes may not always be available in a given set of diseases as the requirements of dormant genes are harder to satisfy. It is easy to visualize that both dormant genes and dominant genes will have high discriminating power. Moreover, one can design a diagnostic system using the dominant genes and then can authenticate the decisions using information available with the dormant genes. These can lead to more reliable diagnostic systems. In simulation results we demonstrate that we can make more accurate prediction for several multiclass problems based on dominant or dormant genes selected by the GDI criterion (compared to two existing gene selection methods for multiple classes, such as SVM-RFE [8] and MMC-RFE [8]). For an easy understanding, Fig. 18 depicts the steps involved in the computation of GDI, which are explained next.

Normalization
The expression values of each gene are normalized in the range from 0 to 1 across samples. This step preserves the richness in the original expression values for each gene among the samples and helps us to easily visualize the distribution of expression values for the dominant or dormant genes.

Computation of mean and standard deviation
For each gene, the mean and standard deviation of the gene expression values in each class are calculated. Let the mean and standard deviation for gene i in class j be μ ij , σ ij .

Sorting of the mean values
For notational simplicity, to explain the computation of the GDI for gene i, we ignore the index i. We sort μ j ; j = 1, ..., k in descending order. Let the sorted mean values be μ j(s) ; j = 1, ..., k. Suppose μ 1(s) is the mean for class m. This means that the gene under consideration is most highly expressed in class m. Similarly, if μ 2(s) corresponds to class r, then if we exclude class m, then amongst the remaining classes this gene has the highest expression level on average in class r. Thus, if the gene under consideration has a distinct class specific signature, then μ 1(s) and μ 2(s) must be well separated and if that is not so, then this gene cannot be a dominant gene. Note that, to make this conclusion, we do not need to look at the mean values corresponding to other classes. We can do so because we have sorted the class means in descending order.

Computation of GDI for dominant genes
Now we define the GDI Dom for the gene under consideration as: As discussed above, the index at Equation 1 can be computed for each gene and then the GDI Dom values can be sorted in descending order. A higher value of GDI Dom indicates that the gene for the m-th class is significantly overexpressed compared to the r-th class and obviously it is more strongly over-expressed compared to the remaining classes. Thus, it is a dominant gene for class m or 1(s). Dominant genes, if exist, will appear at the top of the sorted list. A set of genes can then be selected from this sorted list for further processing. Note that, for a two class problem, although we do not use the absolute value in the numerator, because of the sorting, Equation 1 is exactly the same as that of Golub's SNR index [2]. In other words, the GDI Dom can be viewed as true generalization of Golub's SNR for a multiclass problem.

Computation of GDI for dormant genes
However, the GDI Dom in Equation 1 will not be able to find the dormant genes, if any. In order to find the dormant genes we can proceed as follows. If the gene under consideration is a dormant one, then it will be unexpressed for one class but at least moderately expressed for all of the remaining classes. In this case, (μ k-1(s) -μ k(s) ) should be considerably high, where μ k(s) is the last value in the sorted sequence; in other words, it is the mean expression level for the class in which the gene under consideration is least expressed. Thus, we define the GDI Dor for identifying the dormant gene as Note that, Equation 1 uses the class mean values and standard deviations of the top two classes in the sorted list while Equation 2 uses the class means and standard deviations corresponding to the last two values in the sorted list. Consequently, if GDI Dor is significantly high for a gene, then this gene is a dormant gene for the class represented by k(s).
It is easy to see that for a two class problem, GDI Dor reduces to the SNR of [2]. Thus both GDI Dom and GDI Dor can be viewed as generalizations of SNR. We can combine Equations 1 and 2 and write in a convenient manner as in Equation 3.
In Equation 3 when x = Dom, p and q correspond to the top two classes, respectively, in the sorted list and when x = Dor, then p and q correspond to the last two classes in the sorted list, respectively.
We want to emphasize that a dominant gene is dominant for a class with respect to the given set of classes/groups under consideration. For example, given the SRBCT group, a gene may be dominant for the Neuroblastoma class implying that this gene is highly expressed for the Neuroblastoma cases but unexpressed for the other three types of childhood cancers. Now if we augment the set of four childhood cancers by one more type, then this particular gene may not remain dominant with respect to the group of five childhood cancers. Similar is the case with dormant genes.

Finding a list of dominant/dormant genes for each class
After calculating the GDI Dom values of all genes, a list of dominant genes for each class can be obtained as follows. For each gene, the GDI Dom is associated to the class represented by 1(s); in other words, it is associated to the class corresponding to the top element in the sorted list. In this way, every gene is associated with a class and a value of dominancy as expressed by GDI Dom . We can now sort the genes associated with a particular class according to the GDI Dom values. In this way we get a sorted list for each class. We can now select useful genes for a class from the top of the list. Clearly, when selecting the dominant genes, the higher the GDI Dom , the more dominant the gene is. A similar procedure can be applied for the generation of a list of dormant genes for each class using the GDI Dor values.

Gene selection strategy
If we use several dominant (or dormant or both kinds of) genes from each class ranked according to GDI Dom values to design diagnostic systems, we are expected to get sufficient discriminating power for all classes in multi-class discrimination problems. But since in each resampling experiment we may get a different set of dominant (dormant) genes for a class, it would be better to aggregate the output of several resampling experiments. Different strategies are possible for this. Next we propose one such strategy:

Frequency-based method
The gene selection scheme is displayed in Algorithm Gene Selection. It proceeds as follows. In each of the 100 trials, we select the top m (= 5) dominant (dormant) genes for each class to compute the frequency with which each such gene appears as a candidate gene for a class. A good dominant (dormant) gene is likely to appear more frequently. In order to find the set of interesting (marker) genes for each class we select the top five most frequently occurring genes. However, some class may have more than five genes with strong class specific signatures. If that happens, we should include those genes also if our goal is to find the set of interesting (marker) genes, not just designing of a classifier. Hence, in addition to the top five genes, if there are other genes with frequency of appearance 50 or more (in 100 trials) we also consider those genes important. In this manner we find a set of genes that may be biologically interesting. But all these genes may not be necessary for designing a classifier, because for a k-class discrimination, even a set of less than k good genes may be adequate. Tables 1, 2, 3, 4 are generated by this scheme.
1.1 Partition the data set X into XTR and XTS, such that XTR = X, XTS = X -XTR, p <q; here we use p = 2, q = 3, XTR = X.
1.2 Use XTR to compute GDIs for each gene.

Permutation test to assess statistical significance of GDI indices
To assess the statistical significance of the GDI indices associated with the identified dominant and dormant genes, a permutation test has been performed. The procedure followed is summarized below. Both un-adjusted pvalues and q-values adjusted for multiple comparisons are computed. Let G be the total number of genes and S be the total number of sample points.
(1) Given an expression matrix D (x gs is the expression intensity of gene g and sample unit s; 1 ≤ g ≤ G, 1 ≤ s ≤ S) with class labels (y s , 1 ≤ s ≤ S), we compute the gene dominant index GDI Dom , m g and gene dormant index GDI Dor , r g , for each gene g.
(2) Randomly permute the class labels y s for B times. In the bth permutation (1 ≤ b ≤ B), compute , the new GDI Dom and , the new GDI Dor for gene g using the expression matrix D and the permuted labels .