Volume 9 Supplement 6
Symposium of Computations in Bioinformatics and Bioscience (SCBB07)
MCMtest: a fuzzysettheorybased approach to differential analysis of gene pathways
 Lily R Liang^{1}Email author,
 Vinay Mandal^{2},
 Yi Lu^{3} and
 Deepak Kumar^{4}Email author
DOI: 10.1186/147121059S6S16
© Liang et al; licensee BioMed Central Ltd. 2008
Published: 28 May 2008
Abstract
Background
Gene pathway can be defined as a group of genes that interact with each other to perform some biological processes. Along with the efforts to identify the individual genes that play vital roles in a particular disease, there is a growing interest in identifying the roles of gene pathways in such diseases.
Results
This paper proposes an innovative fuzzysettheorybased approach, Multidimensional Cluster Misclassification test (MCMtest), to measure the significance of gene pathways in a particular disease. Experiments have been conducted on both synthetic data and real world data. Results on published diabetes gene expression dataset and a list of predefined pathways from KEGG identified OXPHOS pathway involved in oxidative phosphorylation in mitochondria and other mitochondrial related pathways to be deregulated in diabetes patients. Our results support the previously supported notion that mitochondrial dysfunction is an important event in insulin resistance and type2 diabetes.
Conclusion
Our experiments results suggest that MCMtest can be successfully used in pathway level differential analysis of gene expression datasets. This approach also provides a new solution to the general problem of measuring the difference between two groups of data, which is one of the most essential problems in most areas of research.
Background
Current microarray technologies conduct simultaneous studies of gene expression data of thousands of genes under different conditions. In most of these studies, expression data are analyzed using various standard statistical methods to identify a list of genes responsible for a particular condition. However, in these approaches, interplay among genes is not taken into account. Since organisms behave as complex systems, functioning through chemical networks and interaction of various molecules (also known as pathways), this interplay of genes may provide invaluable insight to the understanding of various diseases. Thus, along with the efforts to identify the individual genes that play vital roles in a particular disease, there is a growing interest in identifying the roles of different pathways in such diseases.
Biological pathway is a group of related genes coding for proteins that interact with each other to perform some biological processes. According to the biological processes they are involved with, pathways can be divided into several categories, such as metabolic pathways and regulatory pathways. Metabolic pathways are series of chemical reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell, or the initiation of another metabolic pathway. Regulatory pathways represent proteinprotein interactions.
During the past few years, many signaling and metabolic pathways have been discovered experimentally and have been integrated into pathway databases, such as KEGG [1] and Biocarta [2]. Various statistical techniques have been developed to analyze microarray expression data for the relevance of predefined pathways to a particular disease. These techniques include gene set enrichment analysis [3, 4], pathway level analysis of gene expression using singular value decomposition by Tomfohr et al. [5], and hypothesis testing [6] by Tian et al. These approaches are reviewed in detail in the related works section.
Generally speaking, these approaches can be divided into two categories:

Conduct statistical differential analysis at the individual gene level, and integrate the result statistics of the genes in the same pathway;

Obtain activity level indices of each pathway for different sample groups and conduct differential analysis of these indices.
For the first category, when the statistics at individual gene level miss significant genes, the effectiveness of the pathway analysis will be affected. An example is given in the later part of this section. For the second approach, extracting pathway activity level indices from gene expression data may cause loss of information.
Diabetes is a group of diseases characterized by high levels of blood glucose resulting from defects in insulin production, insulin action, or both. It is one of the most common diseases, affecting 18.2 million people in the United States, or 6.3% of the population [7]. Hence, identifying active pathways in diabetes is a critical task for understanding this disease. Several pathway analysis works have been proposed in this area [3, 5, 6].
In gene set enrichment analysis (GSEA) [3], a differential statistic is calculated first for each gene from their expression data of two different groups of samples. Then the genes are ordered according to the statistic values. A running sum of weights is calculated from the ordered list for a particular pathway. The maximum value of this running sum is called the enrichment score of that pathway. To measure the significance of this score, a null distribution of enrichment scores is generated by permuting the sample labels. This approach falls into the first category stated previously, i.e., statistical analysis at individual gene level is performed followed by an integration of these statistics of genes in the same pathway.
In [5], a hypothesis testing framework for pathway differential analysis is proposed. Ttest and Wilcoxon rank test are recommended to measure the difference of expressions of a single gene between two groups of samples. Then this statistic is accumulated over each gene in a particular pathway and standardized by the total number of genes in this pathway. The significance of the result is then interpreted by rejecting two null hypotheses, each with a null population generated by permuting sample labels or gene indices. This approach also belongs to the first category above. Statistical analysis at individual gene level is still required for the pathway analysis in this approach.
In [6], singular value decomposition is used to obtain pathway activity levels from the gene expression matrix. Ttest is applied to the pathway activity levels of the two different sample groups to measure the difference. Significance of the measurement is also obtained by permuting the sample labels. In this approach, no differential analysis at individual gene level is required. However, an extraction of pathway activity level prior to the differential analysis is required. During this extraction process, since only the first eigenvector of singular value decomposition is used, some information of expressions is lost. This approach belongs to the second category stated above.
An example of five gene pathway
Gene ID  S _{1}  S _{2}  CM dvalue  Pvalue  

CMtest  ttest  Rank Sum test  
1  750  559  649  685  636  310  359  135  97  178  1  0.001  0.000  0.008 
2  391  379  268  323  380  774  506  416  468  449  1  0.005  0.029  0.008 
3  598  424  695  451  141  342  260  266  229  234  0.904  0.018  0.077  0.152 
4  233  216  193  394  327  436  980  363  424  416  0.905  0.017  0.071  0.015 
5  305  221  241  183  158  201  176  189  177  250  0.812  0.143  0.448  0.693 
In this paper, we propose an innovative fuzzysettheorybased approach for differential analysis of gene pathways and apply it on identifying significant pathways for diabetes. In our proposed MCMtest, instead of identifying individual genes first, the differential analysis is done directly at the pathway level without individual gene differential statistic. All expression values of genes which belong to a pathway of a particular patient are treated as a vector. The intuition behind this is based on the fact that genes for each patient interplay with each other. MCMtest does not extract activity level of pathways either. This allows keeping the maximum amount of information for the pathway differential analysis. Moreover, the fuzzy concept makes the approach more tolerant to individual data item noise.
Results
To investigate our approach, we conducted experiments on both synthetic data and real world data. We first conducted a series of experiments on synthetic datasets to find the characteristics of MCM dvalue. We then used the MCMtest on the real world diabetes dataset analyzed by Tomfohr et al. [5] and GSEA [3]. Results on real world diabetes data identified several pathways that were deregulated in diabetes patients. The top three pathways identified were related to mitochondrial functions in accordance with previous diabetes studies. Mitochondrial dysfunction is known to be related to insulin resistance and type2 diabetes. Our data suggests that the method can be successfully used in pathway level differential analysis of gene expression datasets.
Relationship between MCM dvalue and mean difference of the distributions
 1.
generated 17 values from Gaussian distribution N (μ, σ), where μ is the mean and σ is the variance, to use as gene expression data. The number 17 was chosen to mimic the real world diabetes dataset used for the analysis in this paper.
 2.
repeated Step 1 for 100 times to get expression data of 100 genes
 3.
generated 17 values from Gaussian distribution N (μ + x, σ), with x = 0 at this time.
 4.
repeated Step 3 for 100 times
 5.
analyzed these 100 pairs of sets of values with MCMtest and obtained the dvalue.
 6.
repeated Step 1 to Step 5 for 1000 times and averaged the dvalues over all the iterations.
 7.
repeated Step 1 to Step 6 for each x: x = 0, 20, 40, 60, 80, 100, 120, 140, 160, and 180.
Impact of population size on standard deviation of MCM dvalue
Relationship between MCM dvalue and empirical pvalue
 1.
We generated 15000 pairs of sets, each set with 15 values from standard normal distribution.
 2.
From these 15000 pairs of sets, we randomly selected 100 pairs of sets to simulate expression data of a pathway with 100 genes under two conditions. We calculated dvalue for this pathway. Since we know that the data size required to obtain stable standard deviation of dvalue is 8000 from the previous experiment, this process is repeated 10000 times.
 3.
For each pathway generated above with dvalue D, we calculated the empirical pvalue as n+1/10001, where n is the number of dvalues generated above that are equal to or greater than D. The relationship between the dvalue and pvalue is shown in Figure 3.
In Figure 3 we can see that as the dvalue increases, the pvalue decreases. In particular, when dvalue is greater than 0.809, we have pvalue ≤ 0.05.
Impact of number of samples on error rate of MCMtest dvalue
Analyzing the diabetes dataset with MCMtest
The diabetes dataset contains the transcriptional profiles of smooth muscle biopsies of diabetic and normal individuals. In the expression dataset, for each gene, there are 17 expression values from subjects with type 2 diabetes (DM2), 17 expression values from subjects with normal glucose tolerance (NGT) and 10 expression values from subjects with impaired glucose tolerance (IGT). For our analysis, we only used the 34 expression values from subjects with type 2 diabetes and subjects with normal glucose tolerance. Furthermore, we used about 150 pathways obtained from KEGG (Kyoto Encyclopedia of Genes and Genomes) [1].
The expression values in the dataset which are too small, i.e., less than 100 are considered to be the result of noise. So, to minimize the effect of these low values, we only included the genes which have at least one of the expression values greater than 100. Out of the 22,283 genes in the dataset, 10,983 met the filtering criteria. The dvalue for each pathway was calculated as described in the methodology section before. The pvalue for the pathway was calculated using permutation test. We permuted the genes 1000 times, each time selecting the same number of genes as that of the pathway under consideration. We then calculated the dvalue of each pathway obtained thus and the pvalue for the pathway was the fraction of times the dvalues of the pathways obtained by 1000 permutation equaled or exceeded the original dvalue.
The results from MCMtest on diabetes dataset
Pathway Name  MCMtest pvalue  No. of genes hits in dataset  Actual no. of genes in pathway  Percentage of gene hits 

OXPHOS  0.04995  106  114  92.98 
c20 U133 probes  0.013  215  270  79.62 
human mitoDB  0.029  436  594  73.4 
c33 U133 probes  0.021  245  362  67.67 
MAP00252  0.022  23  35  65.71 
c34 U133  0.012  274  452  60.62 
c21 U133  0.026  166  287  57.84 
c8 U133  0.013  164  288  56.94 
Using our method, we identified the deregulation of mitochondrial pathways in the dataset which is in accordance with previous studies. The first cluster of genes involved was from the mitochondrial OXPHOS pathway. The OXPHOS pathway was well represented in the data with 93% of genes (106 out of 114) present in the dataset. Oxidative phosphorylation in mitochondria provides energy in the form of ATP generation. In muscle cells, mitochondrial dysfunction has been linked to insulin resistance and type2 diabetes [8–10]. The involvement of genes coded by mitochondria in insulin resistance is also well known. The depletion of cellular mitochondrial DNA has been shown to cause insulin resistance in experimental model [11]. Reduced mitochondrial oxidative phosphorylation leads to the accumulation of intracellular triglycerides resulting in insulin resistance. The next 2 clusters, c20_U133 which is a manually curated cluster of genes coregulated with OXPHOS [3] and the mitochondrial gene cluster human_mitoDB_6_2002 reinforce that muscle mitochondrial dysfunction is linked to type2 diabetes.
Conclusion
In this paper, we propose an innovative fuzzysettheorybased approach for differential analysis of gene pathways and apply it on identifying significant pathways for diabetes. Experiments have been conducted on both synthetic datasets and real world dataset. Results on real world diabetes data identified several number of gene pathways. Of note our top significant pathways were related to mitochondrial function which is well known to be involved in insulin resistance and type2 diabetes. This approach can be used not only for pathway analysis of other diseases but also for other domains. As measuring the difference of two groups of data are essential to most of researches, our approach provides a solution to this general and most critical problem.
Methods
In [12–14], we proposed two fuzzysettheory based methods, CMtest and FMtest, to identify the individual genes that expressed significant differences under two conditions. In this paper, we extended the cluster misclassification concept to a multidimensional space and propose a new approach for pathway analysis, Multidimensional Cluster Misclassification test (MCMtest). Comparing with CMtest and FMtest, MCMtest looks for a group of genes significant under two conditions instead of identifying significant individual genes under two conditions. In this approach, the expression values of a group of Q genes for a particular sample under a particular condition are considered as a Qdimension vector. The differential analysis is done at the vector level, without individual gene differential statistic.
In this section, we first introduce the concept of fuzzy membership function of vectors, then the details of MCMtest.
Fuzzy membership Function of Vectors
For vectors with n dimensions, their fuzzy membership function will be n+1dimensional, with one dimension measuring the fuzzy membership.
Our approach
Consider a pathway that consists of Q genes, the problem now is to determine how these Q genes are expressed differently under two conditions. To answer this question, we consider the expression values of the Q genes for a particular sample under a particular condition as a Qdimension vector. Then the expression values of a pathway under one condition j can be modeled as set S_{ j }(j = 1, 2) of points in a Qdimension space. The idea is to consider the two sets of points S_{1} and S_{2} as samples from two different fuzzy sets. We then examine the membership value of each element with respect to these two fuzzy sets and determine the dvalue between the two sets of samples.
N_{ j }is the number of samples in S_{ j }, ${\overrightarrow{x}}_{n}$ is vector made by the expression values of the nth sample under condition j.
We denote the misclassified elements in S_{1} with respect to FS_{2} as M_{ FS2 }(S_{1}) = {$\overrightarrow{e}$$\overrightarrow{e}$ ∈ S_{1} ∦ m ($\overrightarrow{e}$, FS_{2}) > 0}. Similarly, we denote the misclassified elements in S_{2} with respect to FS_{1} as M_{ FS1 }(S_{2}) = {$\overrightarrow{f}$∈ S_{2} ∦ m ($\overrightarrow{f}$, FS_{1}) > 0}. We denote the number of misclassified elements in S_{1} and S_{2} with respect to each other as # M (S_{1}, S_{2} = M_{ FS2 }(S_{1}) + M_{ FS1 }(S_{2}). We then define the convergence degree (cvalue) of S_{1} and S_{2} as a linear interpolation of the number of misclassified elements and the mutual misclassification degrees as follows.
c(S_{1}, S_{2}) = β*T_{1} + (1β) * T_{2} (5)
Then, the divergence between S_{1} and S_{2} can be calculated using the following:
d(S_{1}, S_{2}) = 1c(S_{1}, S_{2}) (8)
In our method, to negate the effect of outliers, we used αtrimmed mean instead of normal mean. The αtrimmed mean is calculated by ordering the sample under consideration and taking away the smallest and largest α values from the ordered sample. The mean of the remaining values in the sample is αtrimmed mean of the sample. For instance, if we have a sample of (3, 17, 25, 29, 23, 53, 22, 31, 45, 81, 90, 1), the 2trimmed mean is calculated by removing the smallest two values (1, 3), and largest two values (81, 90) from the sample set. The mean of the remaining values (30.625) becomes the 2trimmed mean of the sample.
Analysis of method
One dimension: a special case
In this section we analyze MCMtest for it theoretical justification. For the sake of clarity, we start with one dimension, the simplest and special case of multidimension. The one dimensional MCMtest corresponds to differential analysis of a single gene.
MCMtest uses the probability distribution functions of these two distributions as their fuzzy membership functions respectively, and takes advantage of the membership differences of "misclassified" samples. As shown in Figure 6, a sample x_{1} of D_{2} has a higher degree of belonging to D_{1}, thus is "misclassified" in MCMtest. This misclassification degree is aggregated with all the other "misclassified" samples of D_{2} that are misclassified. Similarly, x_{2} of D_{1} has a higher degree for D_{2}, thus is also misclassified. This misclassification degree is also aggregated with all the other misclassified samples of D_{1}.
MCMtest collects all the misclassified degrees and the number of misclassified samples and form them into an index to measure the divergence of these two distributes. With the mean difference between these two distributions increases, the number of misclassified samples, as well as the aggregated misclassification degree decreases. Thus the MCM dvalue will decrease.
Two and higher dimensions
Distributions of higher dimensions are hard to visualize. But the idea of the misclassification test stays the same. In multidimension space, each sample is a vector. And their misclassification degrees are used to measure the divergence of their distributions.
List of abbreviations
 MCMtest:

multidimensional Cluster Misclassification test
 CMtest:

cluster misclassification test
 FMtest:

fuzzy membership test
 GSEA:

gene set enrichment analysis
Declarations
Acknowledgements
We would like to thank Ying Wang and Togba Liberty for generating some of the figures used in this paper. This work was supported by the Agriculture Experiment Station at the University of the District of Columbia (Project No.: DC0LIANG; Accession No.: 0203877).
This article has been published as part of BMC Bioinformatics Volume 9 Supplement 6, 2008: Symposium of Computations in Bioinformatics and Bioscience (SCBB07). The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/9?issue=S6.
Authors’ Affiliations
References
 Kyoto Encyclopedia of Genes and Genomes. [http://www.genome.jp/kegg/]
 Biocarta. [http://www.biocarta.com/]
 Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: Pgc1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nature Genet. 2003, 34: 267273. 10.1038/ng1180.View ArticlePubMedGoogle Scholar
 Subramanian AS, Tamayo P, Mootha VK, Mukherjee S, Ebert BL, Gillette MA, Paulovich A, Pomeroy SL, Golub TR, Lander ES, Mesirov JP: Gene set enrichment analysis: A knowledgebased approach for interpreting genomewide expression profiles. PNAS. 2005, 102: 1554515550. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMedGoogle Scholar
 Tomfohr J, Lu J, Kepler TB: Pathway level analysis of gene expression using singular value decomposition. BMC Bioinformatics. 2005, 6: 22510.1186/147121056225.PubMed CentralView ArticlePubMedGoogle Scholar
 Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. PNAS. 2005, 102: 1354413549. 10.1073/pnas.0506577102.PubMed CentralView ArticlePubMedGoogle Scholar
 American Diabetes Association Website. [http://www.diabetes.org/]
 Morino K, Petersen KF, Shulman GI: Molecular mechanisms of insulin resistance in humans and their potential links with mitochondrial dysfunction. Diabetes. 2006, 55 (Suppl 2): S9S15. 10.2337/db06S002.PubMed CentralView ArticlePubMedGoogle Scholar
 Takamura T, Honda M, Sakai Y, Ando H, Shimizu A, Ota T, Sakurai M, Misu H, Kurita S, MatsuzawaNagata N, Uchikata M, Nakamura S, Matoba R, Tanino M, Matsubara K, Kaneko S: Gene expression profiles in peripheral blood mononuclear cells reflect the pathophysiology of type 2 diabetes. Biochem Biophys Res Commun. 2007, 361 (2): 37984. 10.1016/j.bbrc.2007.07.006.View ArticlePubMedGoogle Scholar
 Skov V, Glintborg D, Knudsen S, Jensen T, Kruse TA, Tan Q, Brusgaard K, BeckNielsen H, Højlund K: Reduced expression of nuclearencoded genes involved in mitochondrial oxidative metabolism in skeletal muscle of insulinresistant women with polycystic ovary syndrome. Diabetes. 2007, 56 (9): 234955. 10.2337/db070275.View ArticlePubMedGoogle Scholar
 Park SY, Lee W: The depletion of cellular mitochondrial DNA causes insulin resistance through the alteration of insulin receptor substrate1 in rat myocytes. Diabetes Res Clin Pract. 2007, 77 (Suppl 1): S165S171. 10.1016/j.diabres.2007.01.051. 17462778View ArticlePubMedGoogle Scholar
 Liang LR, Lu S, Lu Y, Dhawan P, Kumar D: CMtest: An innovative divergence measurement and its application in diabetes gene expression data analysis. Proceedings of the IEEE International Conference on Granular Computing: 262–268. 2006, May ; Atlanta, Georgia, USAGoogle Scholar
 Liang LR, Lu S, Wang X, Lu Y, Mandal V, Patacsil D, Kumar D: FMtest: A fuzzysettheorybased approach to differential gene expression data analysis. BMC Bioinformatics. 2006, 7 (Suppl 4): S710.1186/147121057S4S7.PubMed CentralView ArticlePubMedGoogle Scholar
 Lu Y, Lu S, Liang LR, Kumar D: FMtest: A fuzzysettheory based approach for discovering diabetes genes. Proceedings of the IEEE International Symposium of Computations in Bioinformatics and Bioscience: 48–55. 2006, June ; Hangzhou, Zhejiang, P.R. ChinaGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.