A unified framework for finding differentially expressed genes from microarray experiments
 Jahangheer S Shaik^{1} and
 Mohammed Yeasin^{1, 2, 3, 4, 5}Email author
DOI: 10.1186/147121058347
© Shaik and Yeasin; licensee BioMed Central Ltd. 2007
Received: 29 March 2007
Accepted: 18 September 2007
Published: 18 September 2007
Abstract
Background
This paper presents a unified framework for finding differentially expressed genes (DEGs) from the microarray data. The proposed framework has three interrelated modules: (i) gene ranking, ii) significance analysis of genes and (iii) validation. The first module uses two gene selection algorithms, namely, a) twoway clustering and b) combined adaptive ranking to rank the genes. The second module converts the gene ranks into pvalues using an Rtest and fuses the two sets of pvalues using the Fisher's omnibus criterion. The DEGs are selected using the FDR analysis. The third module performs three fold validations of the obtained DEGs. The robustness of the proposed unified framework in gene selection is first illustrated using false discovery rate analysis. In addition, the clusteringbased validation of the DEGs is performed by employing an adaptive subspacebased clustering algorithm on the training and the test datasets. Finally, a projectionbased visualization is performed to validate the DEGs obtained using the unified framework.
Results
The performance of the unified framework is compared with wellknown ranking algorithms such as tstatistics, Significance Analysis of Microarrays (SAM), Adaptive Ranking, Combined Adaptive Ranking and Twoway Clustering. The performance curves obtained using 50 simulated microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 3 real cancer datasets and 3 Parkinson's datasets show the similar improvement in performance. First, a 3 fold validation process is provided for the twosample cancer datasets. In addition, the analysis on 3 sets of Parkinson's data is performed to demonstrate the scalability of the proposed method to multisample microarray datasets.
Conclusion
This paper presents a unified framework for the robust selection of genes from the twosample as well as multisample microarray experiments. Two different ranking methods used in module 1 bring diversity in the selection of genes. The conversion of ranks to pvalues, the fusion of pvalues and FDR analysis aid in the identification of significant genes which cannot be judged based on gene ranking alone. The 3 fold validation, namely, robustness in selection of genes using FDR analysis, clustering, and visualization demonstrate the relevance of the DEGs. Empirical analyses on 50 artificial datasets and 6 real microarray datasets illustrate the efficacy of the proposed approach. The analyses on 3 cancer datasets demonstrate the utility of the proposed approach on microarray datasets with two classes of samples. The scalability of the proposed unified approach to multisample (more than two sample classes) microarray datasets is addressed using three sets of Parkinson's Data. Empirical analyses show that the unified framework outperformed other gene selection methods in selecting differentially expressed genes from microarray data.
Background
The high throughput experiments such as DNA microarrays have become one of the most popular biotechnologies to monitor the expression levels of thousands of genes simultaneously. Microarray experiments produce expression profiles measured under some experimental conditions and are normally labeled on the basis of external information such as, clinical identification of samples or expression of genes with respect to time [1]. By analyzing microarray expression profiles one can deduce information that can provide significant understanding of the mechanism of the disease under study. Sophisticated statistical techniques are required to extract relevant genes given enormous amount of microarray data. The gene selection can be a challenging issue as the microarray data is skewed with a large number of genes in one dimension and a few samples in the other dimension. There is a large volume of biological and technical noise that must be normalized to generate a more uniform measure.
The gene selection is performed typically using one of the following criteria, i) finding differential expression of genes individually (statistics based gene selection) or ii) coexpressed genes offering high discrimination between two classes of samples (clustering based gene selection). Both of these criteria lead to different computational procedures in the selection of differentially expressed genes (DEGs). A plethora of mathematical techniques have been developed for finding DEGs in microarray data, for example, [1–4]. The performances of these methods are hard to quantify and compare as they yield significantly different results on the same dataset. This problem can be attributed to the assumptions behind the methods employed for ranking as well as to the unique characteristics of the microarray data. It is widely acknowledged that no single method is adequate to produce the desired result. The fusion of the algorithms that are diverse in nature may lead to the desired result [5]. This paper proposes a gene selection method which is a blend of clustering and statistics based ranking. The gene selection is performed first by employing the twoway clustering and statistics based ranking. These ranks are converted into pvalues using Rtest and fused using the Fisher's omnibus criterion. The significance of the genes is analyzed next by performing false discovery rate (FDR) analysis.
The clusteringbased ranking is performed using twoway clustering. The twoway clustering framework involves clustering the genes into relevant groups and then clustering the samples using the gene groups. Many different frameworks have been proposed for twoway clustering of microarray data. For example, Getz et al. [2] proposed a procedure called coupled twoway clustering by iteratively applying one way clustering within the subgroups of gene and tissue clusters from the previous iteration. Tang et al. [6] reported an interrelated clustering framework based on an iterative process that uses heuristics to define the number of clusters. McLachlan et al. [7] assume a model of distribution to cluster the genes. Theunique characteristics of microarray data limit the utility of some of these methods.
The performance of a twoway clustering framework also depends on the underlying clustering algorithm. A plethora of clustering methods have been proposed for mining microarray data [8–10]. They include but are not limited to hierarchical methods [8], self organizing maps [9], kmeans clustering [10] and their variations. This paper employs an adaptive subspace iteration (ASI) based algorithm for clustering microarray data (see methods). This algorithm is well suited to handle a large number of data points. The centroids of the clusters are available as one of the outputs hence new data points may be assigned to relevant clusters with ease (dynamic data clustering). This faster computational algorithm as the results suggest complements the twoway clustering framework employed in this paper.
The performance of the statistics based algorithms on the other hand depends on the number of available samples. If the samples are less, which is true for microarray data, it is difficult to assume a distribution for the data. The ranking functions based on mean and sample variance yield inaccurate results due to the high level of noise. The statistical methods followed for finding DEGs were initially based on 2 sample ttest [11] obtained by pooling the variances from two cases [12]. The estimates used here are based on the assumption that there are a large number of samples for statistical analysis. Tusher et al. [13] pointed out that small sample variance estimates (not much variation among the samples) yield false alarms for DEGs. They introduced an additive constant to the sample variance to reduce the false detection rate. This parameter estimation was later proposed by Jeffery et al. [14] as the 90^{th} percentile of the sum of gene specific global standard errors. Mukherjee et al. [3] proposed the notion of reproducibility to minimize expected loss in determination of test statistics. The mean often is not a good representative of all the samples and may be corrupted by the outliers. Therefore, Shaik et al. incorporate Hausdorff distance into the combined adaptive ranking function to cope with the unique characteristics of the data sets and to improve the robustness of the ranking algorithm [4]. Most of these methods provide the user with only the ranks for the genes and the significance of the genes is unknown based only on the ranks.
The pvalues are an indication of significance of the genes based on differential expression. This is important for feature selection studies because the pvalues indicate the probability that a gene is deemed significant not by chance (FDR – false discovery rate). There are several significant studies that focus on this important issue [15–18]. However, most of the gene selection methods provide the user with only the ranking of the genes [3, 7, 13, 14, 19]. The ranks may be used to sort the genes based on differential expression from highest to lowest. The rankings do not indicate the significance of the genes. The nonavailability of the pvalues therefore poses problems in gene selection. The availability of pvalues enables controlling the false discovery rate, which is to accept a minimum number of false positives relative to the number of rejected hypotheses. An Rtest is performed in this paper to convert ranks to pvalues [20].
The validation of DEGs is a challenging research issue. This paper uses 3 different methods, namely FDR analysis, clustering and visualization based methods to validate the DEGs. The ASIbased clustering algorithm [21] is employed for the clustering based validation. The steps involved in clusteringbased validation can be summarized as follows:

Find the differentially expressed genes between sample classes using the training set.

Apply ASI algorithm to cluster the training samples using DEGs as features and verify if the clusters are consistent with different classes.

Apply ASI algorithm to cluster the test samples with DEGs as features.

Compare the obtained clusters with the class label information of the test classes.

Repeat the process using all the ranking functions such as tstatistics [11], SAM, Adaptive [3], combined adaptive [4], twoway clustering [21] and the proposed unified ranking.
The application of projection based techniques by Shaik et al. [22, 23] for the visualization of microarray data is found to be well suited for the multiclass microarray datasets. In this paper the 3D star coordinate projection (3DSCP) algorithm originally proposed in [22] is used for the visual validation of the DEGs. The key idea behind the application of visualization algorithms for the validation of DEGs is that if the DEGs are used as features to project the samples, the samples of different classes should be projected to distinct locations in the projected space else a random projection pattern is observed [24]. The 3DSCP algorithm is provided in additional File 1.
Methods
This section discusses the basic formulation of the individual modules of unified framework. The overview of the unified framework is presented first.
Unified framework
Here, 'N' is the number of pvalue sets (2 in this case) and 'P_{ k }' is the set of pvalues obtained using ranking procedure 'k'. The resultant score 'U' follows χ^{2} distribution with 2N degrees of freedom. The scores are hence compared with χ^{2} distribution to obtain their significance at appropriate significance level α. The appropriate significance level α is decided based on false discovery rate (FDR) analysis such that there are minimum expected percentage of false positives. The selected genes are further validated using several validation measures.
The proposed framework for selecting and validating DEGs can be succinctly summarized as follows:

Rank the genes using the twoway clustering framework.

Rank the genes using statistics based ranking method.

Convert the ranks to pvalues using Rtest [20].

Combine the pvalues of both gene selection methods using Fisher's omnibus criterion to obtain unified score as shown in the Eq. 1.

Select the DEGs based on FDR analysis.

Validate the selected DEGs.
Module 1: Gene ranking
The marker genes are generally ranked based on two criteria, i) finding differential expression of genes individually or ii) coexpressed genes offering high discrimination between two classes of tissues. Both of these criteria lead to different computational procedures in selecting DEGs as shown below:
Twoway clustering
Here, 'M' is the total number of gene clusters, 'L' is the number of different labels according to the ground truth, C_{ i }are the samples that are part of cluster 'i', G_{ j }is the group of samples having label 'j' according to the ground truth and (C_{ i }∩ G_{ j }) is the maximum consistency between any of the sample clusters C_{ i }and the samples 'G' with labels 'j' according to the ground truth.
Adaptive subspace iteration algorithm
The adaptive subspace iteration (ASI) is a subspace based method to cluster the data. It involves an optimization process that iteratively identifies the subspace structure. The following notations are used in the algorithm:

D_{ nxm }is the data matrix that contains the microarray data with 'n' genes and 'm' samples. Also, assume that each macro cluster is divided into 'k' number of micro clusters at each resolution level (cf. Fig. 2).

The matrix M_{ nxk }is the membership matrix. Each gene has 'k' memberships corresponding to the 'k' clusters. The cluster to which the gene belongs has membership of 1 and rest of the memberships are 0. This enables hard clustering of the genes.

Let S_{ mxk }be the subspace structure associated with each gene cluster. This subspace has adequate information about the most informative genes in the cluster. The columns of S determine the relevance of each sample in the formation of a cluster. Hence, (DS)_{ nxk }represents the projection of the data onto the subspaces.

Let 'C' be the projection of centroid of each gene cluster onto the subspaces given by S_{ mxk }. This enables calculating the distances between the genes in the subspace and to each of the centroids in the subspace to determine the relevance of each gene with each of the clusters. The relationship between the 'C', 'S' and 'M' is given by,
C = (M^{ T }M)^{1}M^{ T }DS. (3)
Here, (M^{ T }M) provides the size of the clusters (number of genes in a cluster). The diagonal elements provide the size of each cluster and off diagonal elements are zero. Hence 'C' matrix calculates the mean of each gene cluster to estimate the centroids. These centroids are projected to subspace as shown in Eq. 3.
= tr(DSS^{ T }D^{ T })  tr(D^{ T }S^{ T }MC) (7)
Here, the first component (DSS^{ T }D^{ T }) = (DS)(DS)^{ T }gives the total deviance of the data in the subspace. The second component (D^{ T }S^{ T }MC) = (((D^{ T }S^{ T })M)C) first projects the data onto the subspace as given by (D^{ T }S^{ T }). Further, the sum of distance between the centroids is estimated using (((D^{ T }S^{ T })M)C). The objective function shown in Eq. 7 is minimized by maximizing distances between the centroids of individual clusters.
The objective function in (7) is minimized by considering first 'k' Eigen vectors of (D^{ T }(M(M^{ T }M)^{1} M^{ T } I)D)_{1:k}[26]. Therefore, 'S 'is updated using Eq. 8
S = (D^{ T }(M(M^{ T }M)^{1} M^{ T } I)D)_{ k }. (8)
Please note that this feature provides dimensionality reduction and further computations are all performed in the reduced sub space. The output of the algorithm is 'M' and 'S'. Here, 'M' offers the cluster memberships and 'S'offers the weights of the samples forming the clusters defined by the matrix 'M'. Based on the membership, the relevance of the gene with a cluster may be estimated. If the membership is 0, there is no relevance and membership 1 indicates the gene belongs to that cluster.
ASI algorithm
Begin clustering
Step 1: Begin Initialization
Initialize 'M' with zeros and randomly place 1 in each row.
Initialize 'S' with Random values such that each column adds up to 1;
End Initialization
Step 2: Project the centroids of each cluster onto the subspaces using Eq. (3);
Step 3: Compute the initial optimization value 'O_{0}' using the objective function of Eq. (7);
Step 4: Perform optimization by iteratively updating D, F, S;
Begin Optimization
While (O_{1} <O_{0}) //Continue as long as the cluster compactness increases
Step 41: Update 'M' given by the formula in equation (5)
Begin Loop //Iterate over all the features
//update the memberships by finding the relevance of a gene with each of the updated cluster centroids as shown in Eq. 9
M(i, j) = ((DS)_{i,j} C_{:,j}); j = 1...k
Min(M(i, j)) = 1; j = 1...k
End loop
Step 43: Update 'S 'given by the formula in equation (8);
Step 44: Compute Step 2;
Step 45: Compute 'O_{1}' using equation (7);
Step 46: If (O_{1} <O_{0}); //Check for the terminating condition//.
O_{0} = O_{1};
End optimization
End Clustering
DaviesBouldin index
DaviesBouldin index is a measure of cluster validation metric[25]. It measures the homogeneity of the clusters by finding the ratio of the sum of intracluster scatter to intercluster scatter. The intra cluster scatter is a measure of spread of a cluster. The inter cluster scatter on the other hand is a measure of distinctiveness of the clusters. Therefore, lower the ratio of intra cluster scatter to inter cluster scatter, the better.
Here, 'c' is the number of clusters and 'DB' is the measure of homogeneity of the clusters. Lower the 'DB', more homogenous are the clusters.
Combined adaptive ranking
Here,'d' is the difference of means for 'mean method' and Hausdorff distance between different samples for the 'Hausdorff distance method'. The $\widehat{\sigma}$ is the square root of the sample variance.
S_{1} = R_{1}(1: h) and S_{2} = R_{2}(1: h). (15)
A consistency measure is obtained by comparing these two sets
Co = S_{1} ∩ S_{2}. (16)
Here, W_{1} and W_{2} represent the confidence in the rankings R_{ M }and R_{ H }obtained using the consistency 'Co' obtained from the Eq. 16.
Module 2: Significance analysis of ranking
The ranking algorithms of the module 1 provide the user with only the relative ranks. These ranks do not indicate the significance of the rankings. Therefore, these ranks must be converted to pvalues to find the significance.
Convert the scores to pvalues
The Rtest followed in [20] is followed in this paper to convert scores to pvalues. This is formulated as a hypothesis testing problem. Let 'I' be the informative genes and 'UI' be the noninformative genes. The null hypothesis is that the gene is not informative and the alternate hypothesis is that the gene is informative. The distribution of statistics under null hypothesis is obtained as follows (Please see [20] for more details):

Obtain the ranks of the genes using the scores for 'I' iterations using bootstrapping. The value I = 25 is often adequate as indicated in [20].

Construct the distribution of statistics under null hypothesis from consistently high ranked (insignificant) genes.

The median rank (r) of each gene is obtained (in [20] mean rank was computed).

The pvalue of each gene is obtained by finding p(r_{ i }/g ∈ UI) i.e. the probability of the ranking of gene 'r_{ i }' given the gene belongs to nullhypothesis. The null hypothesis is that gene is uninformative therefore lower the probability under null hypothesis, more significant is the gene.
False discovery rate (FDR) analysis
The FDR provides the expected proportion of false positives among the selected DEGs where the number of selected genes is greater than 0. In this paper α is selected such that FDR is minimized. If the ground truth information about the DEGs is available, the performance of ranking algorithms may be compared using the performance analysis curves.
Performance analysis curves
The performance analysis curves are employed to study the performance of different ranking algorithms. The problem at hand is a binary classifier where the gene is either differentially expressed or not differentially expressed. There are four possible alternatives that may be obtained from the classifier viz. true positives (TPs), false positives (FPs), true negatives (TNs) and false negatives (FNs). The TPs are the number of true DEGs among the selected DEGs S_{ g }. The FPs are the number of true nonDEGs among S_{ g }. Alternatively, TNs are the total number of true nonDEGs among the genes deemed insignificant by the algorithm where as the FNs are the total number of true DEGs among the genes deemed insignificant. If the labels for the genes (differentially expressed/ nondifferentially expressed) are available, which is true for artificial microarray datasets employed in this study, it is possible to accurately find TPs, FPs, TNs and FNs. The plot of TPF vs FPF hence, enables the comparison of performance of various classifiers employed in the study. Each of the 50 artificial datasets employed in this paper is used as an instance for building the performance curves. The TP, FP, TN and FN are added at each instance for 50 artificial datasets. The true positive fraction (TPF) is obtained by using the formula TPF = TP/(TP+FN) and false positive fraction (FPF) by using the formula FPF = FP/(FP+TN). These TPFs and FPFs are plotted to build the performance analysis curves.
Artificial microarray datasets
Two different models are employed to generate artificial microarray datasets viz. i) Lognormal model [27] and ii) Asymmetric Laplace distribution [28]. Each artificial dataset is created to have 2050 genes with 10 samples under each of the two conditions. The first 50 genes are rendered differentially expressed and the rest 2000 are rendered nonDEGs. This process enables class labels for genes (DEGs or nonDEGs) for each generated artificially generated microarray dataset which can be used as ground truth to quantitatively assess the performance of different algorithms used in this study.
Lognormal model
Parameters for Generating Artificial Microarray Datasets
Tissue Type  Normal tissues (condition1)  Abnormal tissues (condition 2)  

Non DEGs  mean  0  0 
variance  Gamma distribution with mean 2, variance 2  
DEGs  mean  0  Normal distribution mean 3, variance 1 
variance  Gamma distribution with mean 3, variance 2  Gamma distribution with mean 2, variance 2 
Asymmetric laplace distribution model
The artificial microarray datasets are created using the same procedure employed for lognormal distribution but by using an Asymmetric Laplace distribution as reported in [28]. The mean and variance of the DEGs and NonDEGs are approximated using the parameters in table 1. The sample size was set to 12.
Results
The performance of the proposed unified framework for finding DEGs from microarray datasets is evaluated using two models of simulated microarray datasets (50 artificially generated microarray datasets each [3, 4, 27]) as well as six real cancer and Parkinson's microarray datasets [29–31]. Artificial datasets with ground truth information are used for the comparison of performance of unified framework with other gene selection methods. The performance of various gene selection algorithms [3, 11, 13] is further compared with the proposed method using real microarray datasets in the selection of DEGs.
Artificial microarray datasets
 1.
Generate artificial microarray dataset such that the first 50 genes are rendered differentially expressed and next 2000 are nondifferentially expressed (see methods).
 2.
Find the ranks using module 1 of the unified framework and convert them to pvalues using the Rtest.
 3.
Merge the pvalues and obtain the unified pvalue using Fisher's omnibus criteria.
 4.
Perform FDR Analysis.
 5.
Compare the DEGs with the ground truth to obtain TPF and FPF (see methods).
 6.
Repeat the steps 1–5 for all 50 artificial microarray datasets to obtain performance curves as shown in Fig. 4.
The unified framework, its individual modules and other well known techniques such as the tstatistics [24, 32], significance analysis of microarrays [13], adaptive ranking [3], combined Adaptive ranking [4] and twoway clustering using ASI [21] are employed to find DEGs of 50 artificially generated microarray datasets. The Rtest [20] is employed to convert ranks to pvalues for the gene selection methods that do not offer pvalues. The Fig. 4 shows the performance curves (see methods) of various well known methods and the proposed unified approach using lognormal model and asymmetric Laplace model. Analyzing the values in the performance plots, it can be inferred that the proposed unified approach outperforms the other gene selection methods in finding the DEGs from the artificial microarray data.
Leukemia microarray dataset
Gene expressions of approximately 6817 genes are used to classify two types of acute Leukemia viz. Acute Lymphoid Leukemia (ALL) and Acute Myeloid Leukemia (AML). The data consists of 47 (38 Bcell and 9 Tcell) cases of ALL and 25 cases of AML samples. The data is divided into a training class containing 38 samples (27 ALL and 11 AML) and a test class containing 34 samples of tissues (20 ALL and 14 AML). The class labels for the training and test samples are available from Golub et al. [30]. The preprocessing steps proposed by Golub et al. resulted in 3571 genes, the rest of the genes are considered noise and therefore eliminated.
Gene selection and statistical validation
FDR Analysis of Leukemia Dataset
AlphaValue  tstatistics  SAM  Adaptive  Combined Adaptive  TwoWay  Unified  

GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  
0.01  154  23.19  171  20.88  183  19.51  189  18.89  191  18.7  211  16.92 
0.005  94  18.99  103  17.33  119  15  121  14.76  117  15.26  147  12.15 
0.001  29  12.31  34  10.5  41  8.71  41  8.71  43  8.3  52  6.8 
Comparison of the obtained DEGs with DEGs obtained by Golub et al. [30]
The 52 significantly expressed genes obtained using the unified framework are compared to the DEGs obtained by Golub et al. [30] (see additional file 3). There are 24 genes common to the genes found by Golub et al. This shows that the genes obtained by the unified framework are not significantly different from those obtained by Golub et al. It also shows that there are many genes selected by the unified framework that were not considered significant by Golub et al. It has already been statistically validated that unified framework offered less percentage of expected false positives and hence the genes selected using unified framework are considered to be relevant.
Clusteringbased validation
ASI Classification of Leukemia Samples using the DEGs
Gene Selection Method  Samples  tstatistics  SAM  Adaptive Ranking  Combined Adaptive Ranking  Twoway Clustering  Unified Ranking 

Training  38  33  35  36  36  38  38 
Testing  34  25  28  29  30  30  33 
The ASI algorithm is further applied to cluster the test samples using the DEGs obtained through respective methods. It is evident from the row 3 of Table 3 that the DEGs obtained using the unified framework classified the AML and ALL samples better (97.06%) than the DEGs obtained using the other methods. This also shows the improved performance of unified framework over the other methods as shown in Table 3.
Visualizationbased validation
Gastric cancer microarray dataset
The objective of this study is to identify genes distinguishing primary gastric cancers and metastatic gastric cancers from neoplastic gastric cancers which are otherwise morphologically indistinguishable. Approximately 30300 genes are used to study expression patterns of 90 primary gastric cancers and 22 neoplastic gastric cancers. The preprocessing steps indicated by Chen et al. [29] are employed resulting in 5200 genes for further study.
The two bootstrapped datasets are created from the original dataset, one for training and one for testing. The training data has randomly selected 60 primary samples and 12 neoplastic samples where as the test data has randomly selected 30 primary samples and 10 neoplastic samples. The experimental design used for the Leukemia dataset is followed for the analysis of the Gastric cancer dataset.
Gene selection and statistical validation
FDR Analysis of Gastric Cancer Dataset
AlphaValue  tstatistics  SAM  Adaptive  Combined Adaptive  TwoWay  Unified  

GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  
0.01  417  12.47  398  13.07  397  13.10  406  12.81  414  12.56  434  11.98 
0.005  299  8.70  283  9.19  279  9.32  288  9.03  294  8.84  325  8 
0.001  173  3.01  189  2.75  166  3.13  175  2.97  187  2.78  210  2.48 
Comparison of the DEGs to the significant genes obtained by Chen et al. [29]
The DEGs found using the unified framework are compared against the DEGs found by Chen et al. [29]. The 204 genes out of 210 genes found by unified algorithm are common to the DEGs found by Chen et al. [29]. The list of common genes may be accessed through additional file 5. It may be seen that most of the genes found using the unified framework were present in the list of 3000 genes found significant by Chen et al. The improved performance of the unified framework may be attributed to the rejection of most of the genes deemed significant by Chen et al. This is one of the advantages of FDR analysis which focuses not only on the selection of DEGs but also on the rejection of the insignificant genes.
Clusteringbased validation
ASI Classification of Gastric Cancer Samples using the DEGs
Gene Selection Method  Samples  tstatistics  SAM  Adaptive Ranking  Combined Adaptive Ranking  Twoway Clustering  Unified Ranking 

Training  72  64  67  67  69  69  72 
Testing  40  28  34  33  35  33  39 
Visualizationbased validation
Colon cancer microarray dataset
The Affymetrix oligonucleotide array complementary to more than 6500 human genes are used in this study. The gene expression is studied using 40 tumor samples and 22 normal samples. The preprocessing of this dataset resulted in 2000 interesting genes which have been used as input to the gene selection algorithms.
The analysis is performed by first dividing the data into training and test sets. The training data has 25 tumor samples and 12 normal samples selected randomly where as the test data has 15 tumor samples and 10 normal samples selected randomly. The steps similar to experimental design followed for Leukemia dataset is used for the analysis of this dataset.
Gene selection and statistical validation
FDR Analysis of Colon Cancer Dataset
AlphaValue  tstatistics  SAM  Adaptive  Combined Adaptive  TwoWay  Unified  

GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  
0.01  211  9.48  233  8.58  221  9.05  218  9.17  211  9.48  236  8.47 
0.005  121  8.2  119  8.4  113  8.85  117  8.55  124  8.06  134  7.46 
0.001  48  4.17  54  3.7  38  5.26  42  4.76  57  3.51  66  3.03 
Comparison of the DEGs with earlier works
A list of significantly differentially expressed genes is not available from Alon et. al for comparison. However, the comprehensive analysis on this dataset is performed by Su et al. [33]. The procedure involves ranking the genes using 8 different measures viz. ttest, information gain, sum of variances, twoing rule, gini index, sum minority, max minority and ID SVM. The rankings are then fused to obtain a list of 100 better ranked genes [33]. This list of 100 ranked genes is compared with the list of 66 genes obtained using the unified framework. The 51 genes out of 66 genes were among the top 100 genes obtained using the 'rankgene' method. The rank gene method did not employ any FDR analysis for gene selection, it merely lists the top 100 genes.
Clusteringbased validation
ASI Classification of Colon Cancer Samples using the DEGs
Gene Selection Method  Samples  tstatistics  SAM  Adaptive Ranking  Combined Adaptive Ranking  Twoway Clustering  Unified Ranking 

Training  37  31  33  31  33  32  35 
Testing  25  19  22  22  21  21  24 
Visualizationbased validation
Parkinson's dataset
The Parkinson's dataset is employed to extend the application of two sample gene selection methods to multisample experiments. Three sets of microarray data are available for this model from Miller et al. [31]. The first dataset is obtained using Codelink Mouse uniSet I bioarrays. The other two are obtained using Affymetrix array data analyzed using Affymetrix Microarray Suite software(MAS 5) and Model Based Expression Index (MBEI) using dChip software. The data consists of three treatment groups MC (saline treated mouse control), MME (mouse MPTP early) and MML (mouse MPTP late). Each group has four set of samples obtained at different times after MPTP administration using 12588 genes. The performance of different gene selection methods is evaluated for the comparison of MC and MME, MC and MML groups for all the three datasets. This pattern of comparison provides DEGs at different times. This also provides information about the DEGs at the early stage that stayed differentially expressed at late stage.
Codelink mouse uniSet I bioarrays
Experimental design

Find the pvalues for the genes based on differential expression between the MC and MME groups and MC and MML groups.

Merge the two sets of pvalues using Fisher's Omnibus criterion [34].

Perform FDR analysis and select the genes with significant differential expression such that there is minimum percentage of expected false positives.

Apply the ASI algorithm to cross validate the DEGs.

Repeat the process for each gene selection method.
Gene selection and statistical validation
FDR Analysis of Parkinson's Dataset using CodeLink BioArrays
AlphaValue  tstatistics  SAM  Adaptive  Combined Adaptive  TwoWay  Unified  

GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  
0.01  51  46.07  52  45.13  57  41.18  57  41.18  51  46.07  56  41.91 
0.005  25  46.9  29  40.4  28  41.8  28  41.8  29  40.4  31  37.8 
0.001  14  16.76  15  15.65  14  16.76  16  14.67  15  15.65  17  13.81 
Clusteringbased validation of codelink data
Cross Validation of Parkinson's Datasets using Training Samples
Gene Selection Method  Samples  tstatistics  SAM  Adaptive Ranking  Combined Adaptive Ranking  Twoway Clustering  Unified Ranking 

Codelink  12  10  10  9  11  10  12 
MAS05  12  10  10  10  10  11  12 
Dchip  12  9  9  9  10  11  12 
Visualizationbased validation
Affymetrix using MAS 05
Gene selection and statistical validation
FDR Analysis of Parkinson's Dataset using Affymetrix MAS05
AlphaValue  tstatistics  SAM  Adaptive  Combined Adaptive  TwoWay  Unified  

GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  
0.01  46  50.87  46  50.87  48  48.75  48  48.75  49  47.76  51  45.88 
0.005  25  46.8  27  43.34  25  46.8  28  41.8  28  41.8  31  37.75 
0.001  12  19.5  13  18  11  21.27  12  19.5  14  16.71  14  16.71 
Clusteringbased validation of MAS 05 data
The Table 9 shows the number of samples identified correctly by the gene selection methods. As shown in the Table 9, row 3, the unified framework performed relatively better in the validation of the samples (100% accuracy).
Visualizationbased validation
Affymetrix using dchip
Gene selection and statistical validation
FDR Analysis of Parkinson's Dataset using Affymetrix dChip
AlphaValue  tstatistics  SAM  Adaptive  Combined Adaptive  TwoWay  Unified  

GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  GS  %FP  
0.01  49  44.47  50  43.58  52  41.9  52  41.9  49  44.47  54  40.35 
0.005  33  33  35  31.1  35  31.1  35  31.1  33  33  37  29.43 
0.001  15  14.53  15  14.53  15  14.53  16  13.62  14  15.56  15  13.22 
Clusteringbased validation of dchip data
The samples are clustered by employing the DEGs obtained using various gene selection methods (for α = 0.001). The obtained sample clusters are compared with the class labels of the samples (MC, MME and MML). The Table 9 shows that all samples (100%) are correctly identified using the proposed unified framework.
Visualizationbased validation
Discussion
This paper presents a unified framework of gene selection and their validation. The fusion of two different gene selection algorithms viz. twoway clustering and combined adaptive ranking is performed to rank the genes. The twoway framework finds the differential expression of the coexpressed genes. The progressive framework using ASI algorithm is employed to cluster the gene dimension. This presents the gene clusters at different resolutions which may be analyzed for differential expression. The clusters at different resolutions may be tested for differential expression. The number of resolutions for progressive framework is determined using the DaviesBouldin Index.
Most of the ranking functions employed in this study for gene selection provide the user with only the relative ranking of the genes. These ranks enable sorting the genes based on differential expression but they do not indicate the significance of genes. The Rtest presents a means of converting ranks into a measure of significance (pvalues). The gene rankings using module1 are converted into pvalues using Rtest and fused using Fisher's omnibus criterion. The FDR analysis is further applied on the fused pvalues. The FDR analysis enables judicious selection of DEGs by providing a balance between the genes selected and expected percentage of false positives. For example, at α = 0.001 the percentage of false positives for gastric cancer dataset using the unified framework is 2.48%. This indicates that out of 210 genes there is a possibility of only 5 genes (210*2.4% = 5) to have occurred by chance.
The real datasets are divided into two categories i) Two sample experiments with a large number of samples and ii) Multi sample experiments with small number of samples. For the first category, emphasis is made on the validation techniques. The data is divided into training and testing sets. The DEGs are obtained by employing the training set and three fold validations are performed. The improvement in statistical power for the selection of DEGs is first shown with the aid of FDR analysis. The clustering based cross validation of the DEGs is performed next by clustering the training and test samples and evaluating the performance. Finally, a visualization based cross validation is performed to show the separability of samples in the projected space. The aim of the second category of the real datasets is to show the extensibility of the proposed approach to multisample experiments. Due to the nonavailability of large number of samples, the validation is performed on only the training set by employing the clustering and visualization based algorithms. The clusteringbased validation approaches clearly showed the better performance of unified framework over the other algorithms. Further, the visualization based validation demonstrated that the DEGs obtained using the unified framework offered much clear separation between the samples of the different classes than the DEGs obtained using the other methods.
Conclusion
A unified framework for finding DEGs from microarray data is developed and empirically evaluated. The judicious combination of the three different modules is used to build the unified framework. The performance of the unified framework is compared with other well known gene selection algorithms. The performance analysis curves using 50 artificial microarray datasets each following two different distributions indicate the superiority of the unified framework over the other reported algorithms. Further analyses on 6 real cancer datasets show the similar improvement in performance. The comprehensive validation of the DEGs is presented using the first three real datasets. The robustness in the selection of genes is first presented using FDR analysis for various methods used in the study. The clustering based validation is presented next by analyzing the clustering of training and test samples using ASI algorithm. Finally, a visualization based validation is performed. The scalability of the proposed unified approach to multisample experiments is demonstrated using the Parkinson's datasets. Empirical analyses on artificial and real microarray datasets illustrate the efficacy of the proposed unified framework in finding the DEGs.
Declarations
Acknowledgements
The authors acknowledge the Herff Fellowship, financial assistance from the Bioinformatics program and faculty startup grants from the University of Memphis for partially funding this research. The authors thank Drs. Ebenezer George and Ramin Homayouni for their helpful tips in preparing the manuscript. The authors would also like to thank the anonymous reviewers for their helpful suggestions and comments in improving the quality of the paper.
Authors’ Affiliations
References
 Guyon I: An Introduction of Variable and Feature Selection. Journal of Machine Learning Research. 2003, 3 (78): 11571182.Google Scholar
 Getz G, Levine E, Domany E: Coupled twoway clustering of gene microarray data. Proceedings of National Academy of Science, USA. 2000, 97 (22): 1207912084.View ArticleGoogle Scholar
 Mukherjee S, Roberts SJ, Laan MJ: Dataadaptive Test Statistics for Microarray Data. Bioinformatics. 2005, 21 (2): 108114.Google Scholar
 Shaik J, Yeasin M: Adaptive Ranking and Selection of Differentially Expressed Genes from Microarray Data. WSEAS transactions on Biology and Biomedicine. 2006, 3 (2): 125133.Google Scholar
 HuiHuang H: Advanced Data Mining Technologies in Bioinformatics. 2006, Idea Group Publishing, 329Google Scholar
 Tang C, Zhang A: Interrelated Twoway Clustering: an unsupervised approach for gene expression data analysis. In Proceedings of the 2nd IEEE international Symposium on Bioinformatics and Bioengineering. 2001, 14 (4): 4148.View ArticleGoogle Scholar
 McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data. Bioinformatics. 2002, 18 (3): 413422.View ArticlePubMedGoogle Scholar
 Alon U, Barkai N, Notterman DA, K.Gish, Ybarra S, Mack D, Levine AJ: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays. Proc Natl Acad Sci USA. 1999, 96: 67456750.PubMed CentralView ArticlePubMedGoogle Scholar
 Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander E, Golub T: Interpreting Patterns of Gene Expression with SelfOrganizing Maps:Methods and Applications to Hematopoietic differentiation. Proc Natl Acad Sci. 1999, 96: 29072912.PubMed CentralView ArticlePubMedGoogle Scholar
 Tavazoie S, Hughes J, Campbell M, Cho R, Church G: Cluster Analysis and Display of Genome Wide Expression Patterns. Nat Genetics. 1999, 22: 281285.View ArticlePubMedGoogle Scholar
 Sahai H, Ojeda MM: Analysis of Variance for Random Models: Theory, Methods, Applications and Data Analysis. 2004, Birkhauser, 484View ArticleGoogle Scholar
 Casella G, Berger RL: Statistical Inference. Duxbury Advanced Series. 2001, Duxbury Press, 2Google Scholar
 Tusher VG, Tibshirani R, Chu G: Significance Analysis of Microarrays Applied to The Ionizing Radiation Response. PNAS. 2001, 98 (9): 51165121.PubMed CentralView ArticlePubMedGoogle Scholar
 Jeffery IB, Higgins DG, Culhane AC: Comparison and Evaluation of Methods for Generating Differentially Expressed Gene lists from MicroArray Data. BMC Bioinformatics. 2006, 7: 359375.PubMed CentralView ArticlePubMedGoogle Scholar
 Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Statistical Society. 1995, 57 (1): 289300.Google Scholar
 Benjamini Y, Yekutieli D: The Control of The False Discovery Rate in Multiple Testing Under Dependency. The Annals of Statistics. 2001, 29 (4): 11651188.View ArticleGoogle Scholar
 Storey JD, Tibshirani R: Statistical Significance for Genome Wide Studies. PNAS. 2003, 100 (16): 94409445.PubMed CentralView ArticlePubMedGoogle Scholar
 Fernando RL, Nettleton D, Southey BR, Dekkers JCM, Rothschild MF, Soller M: Controlling the Proportion of False Positives in Multiple Dependent Tests. Genetics. 2004, 166 (1): 611619.PubMed CentralView ArticlePubMedGoogle Scholar
 Shaik J, Yeasin M: Ranking Function Based on Higher Order Statistics (RFHOS) for TwoSample Microarray Experiments: May; Atlanta, GA.Edited by: Mandoiu I, Zelikovsky A. 2007, Springer Verlag, LNBI 4463: 97108.Google Scholar
 Zhang C, Lu X, Zhang X: Significance of Gene Ranking for Classification of Microarray Samples. IEEE/ACM Transactions on Computational Biology and Bioinformatics. 2006, 3 (3): 312320.View ArticlePubMedGoogle Scholar
 Shaik J, Yeasin M: A Progressive Framework for TwoWay Clustering Using Adaptive Subspace Iteration for Functionally Classifying Genes . Proceedings of IEEE IJCNN'06, Vancouver, Canada. 2006, 52875292.Google Scholar
 Shaik J, Yeasin M: Visualization of High Dimensional Data using an Automated 3D Star Coordinate System. Proceedings of IEEE IJCNN'06, Vancouver, Canada. 2006, 23182325.Google Scholar
 Shaik J, Yeasin M: Functionally Classifying Genes from Microarray Data Using Linear and Nonlinear Data Projection. IEEE International Conference on Computer Systems and Applications. 2006, 608615.Google Scholar
 Stekel D: Microarray Bioinformatics. 2003, Cambridge , Cambridge University Press, 2631View ArticleGoogle Scholar
 Davies DL, Bouldin DW: A Cluster Separation Measure. IEEE Transactions on Pattern Analysis and Machine Intelligence. 1979, 1: 224227.View ArticlePubMedGoogle Scholar
 Li T, Ma S, M.Ogihara: Document Clustering via Adaptive Subspace Iteration. Special Information Group on Information Retrieval 2004. 2004, 218225.Google Scholar
 Lonnstedt I, Speed T: Replicated Microarray Data. Statistica Sinica. 2002, 12: 3146.Google Scholar
 Purdom E, Holmes S: Error Distribution for Gene Expression Data. Statistical Applications in Genetics and Molecular Biology. 2005, California , Stanford University, 4 (1): 15.
 Chen X, Leung SY, Yeuen ST, Chu KM, Ji J, Li R, Chan ASY, Law S, Troyanskaya OG, Wong J, So S, Botstein D, Brown PO: Variation in Gene Expression Patterns in Human Gastric Cancers. Mol Bio Cell. 2003, 14: 32083215.View ArticleGoogle Scholar
 Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science. 1999, 286: 531537.View ArticlePubMedGoogle Scholar
 Miller RM, Callahan LM, Casaceli C, Chen L, Kiser GL, Chui B, Kaysserkranich TM, Sendera TJ, Palaniappan C, Federoff HJ: Dysregulation of Gene Expression in the 1Methyl4Phenyl1,2,3,6Tetrahydropyridine Lesioned Mouse Substantia Nigra. Journal of Neuroscience. 2004, 24 (34): 74457454.View ArticlePubMedGoogle Scholar
 Duda RO, E.Hart P, G.Stork D: Pattern Classification. 2000, John Wiley and Sons Inc, 2ndGoogle Scholar
 Su Y, Murali TM, Pavlovic V, Schaffer M, Kasif S: rankgene:Identication of Diagnostic Genes Based on Expression Data. . 2002, [http://www.genomics10buedu/yangsu/rankgene/]Google Scholar
 Fisher RA: The use of multiple measurements in taxonomic problems. Annual Eugenics. 1936, 7 (2): 179188.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.