MCM-test: a fuzzy-set-theory-based approach to differential analysis of gene pathways

Background Gene pathway can be defined as a group of genes that interact with each other to perform some biological processes. Along with the efforts to identify the individual genes that play vital roles in a particular disease, there is a growing interest in identifying the roles of gene pathways in such diseases. Results This paper proposes an innovative fuzzy-set-theory-based approach, Multi-dimensional Cluster Misclassification test (MCM-test), to measure the significance of gene pathways in a particular disease. Experiments have been conducted on both synthetic data and real world data. Results on published diabetes gene expression dataset and a list of predefined pathways from KEGG identified OXPHOS pathway involved in oxidative phosphorylation in mitochondria and other mitochondrial related pathways to be deregulated in diabetes patients. Our results support the previously supported notion that mitochondrial dysfunction is an important event in insulin resistance and type-2 diabetes. Conclusion Our experiments results suggest that MCM-test can be successfully used in pathway level differential analysis of gene expression datasets. This approach also provides a new solution to the general problem of measuring the difference between two groups of data, which is one of the most essential problems in most areas of research.


Background
Current microarray technologies conduct simultaneous studies of gene expression data of thousands of genes under different conditions. In most of these studies, expression data are analyzed using various standard statistical methods to identify a list of genes responsible for a particular condition. However, in these approaches, interplay among genes is not taken into account. Since organisms behave as complex systems, functioning through chemical networks and interaction of various molecules (also known as pathways), this interplay of genes may provide invaluable insight to the understanding of various from Symposium of Computations in Bioinformatics and Bioscience (SCBB07) Iowa City, Iowa, USA. 13-15 August 2007 diseases. Thus, along with the efforts to identify the individual genes that play vital roles in a particular disease, there is a growing interest in identifying the roles of different pathways in such diseases.
Biological pathway is a group of related genes coding for proteins that interact with each other to perform some biological processes. According to the biological processes they are involved with, pathways can be divided into several categories, such as metabolic pathways and regulatory pathways. Metabolic pathways are series of chemical reactions occurring within a cell, catalyzed by enzymes, resulting in either the formation of a metabolic product to be used or stored by the cell, or the initiation of another metabolic pathway. Regulatory pathways represent proteinprotein interactions.
During the past few years, many signaling and metabolic pathways have been discovered experimentally and have been integrated into pathway databases, such as KEGG [1] and Biocarta [2]. Various statistical techniques have been developed to analyze microarray expression data for the relevance of predefined pathways to a particular disease. These techniques include gene set enrichment analysis [3,4], pathway level analysis of gene expression using singular value decomposition by Tomfohr et al. [5], and hypothesis testing [6] by Tian et al. These approaches are reviewed in detail in the related works section.
Generally speaking, these approaches can be divided into two categories: • Conduct statistical differential analysis at the individual gene level, and integrate the result statistics of the genes in the same pathway; • Obtain activity level indices of each pathway for different sample groups and conduct differential analysis of these indices.
For the first category, when the statistics at individual gene level miss significant genes, the effectiveness of the pathway analysis will be affected. An example is given in the later part of this section. For the second approach, extracting pathway activity level indices from gene expression data may cause loss of information.
Diabetes is a group of diseases characterized by high levels of blood glucose resulting from defects in insulin production, insulin action, or both. It is one of the most common diseases, affecting 18.2 million people in the United States, or 6.3% of the population [7]. Hence, identifying active pathways in diabetes is a critical task for understanding this disease. Several pathway analysis works have been proposed in this area [3,5,6].
In gene set enrichment analysis (GSEA) [3], a differential statistic is calculated first for each gene from their expression data of two different groups of samples. Then the genes are ordered according to the statistic values. A running sum of weights is calculated from the ordered list for a particular pathway. The maximum value of this running sum is called the enrichment score of that pathway. To measure the significance of this score, a null distribution of enrichment scores is generated by permuting the sample labels. This approach falls into the first category stated previously, i.e., statistical analysis at individual gene level is performed followed by an integration of these statistics of genes in the same pathway.
In [5], a hypothesis testing framework for pathway differential analysis is proposed. T-test and Wilcoxon rank test are recommended to measure the difference of expressions of a single gene between two groups of samples. Then this statistic is accumulated over each gene in a particular pathway and standardized by the total number of genes in this pathway. The significance of the result is then interpreted by rejecting two null hypotheses, each with a null population generated by permuting sample labels or gene indices. This approach also belongs to the first category above. Statistical analysis at individual gene level is still required for the pathway analysis in this approach.
In [6], singular value decomposition is used to obtain pathway activity levels from the gene expression matrix. Ttest is applied to the pathway activity levels of the two different sample groups to measure the difference. Significance of the measurement is also obtained by permuting the sample labels. In this approach, no differential analysis at individual gene level is required. However, an extraction of pathway activity level prior to the differential analysis is required. During this extraction process, since only the first eigenvector of singular value decomposition is used, some information of expressions is lost. This approach belongs to the second category stated above.
As discussed above, either t-test or rank sum test is used as a core step by [3,6] to identify individual genes which are expressed differently from two different sample groups. Thus these methods inevitably inherit the disadvantage of t-test and rank sum test. While the t-test is very sensitive to extreme values and cannot distinguish two sets with close means even though the two sets are significantly different from each others, the rank sum test is not sensitive to absolute values. In turn, those pathways contain genes which can not be identified by t-test or rank sum test but actually are significantly differently expressed in two different sample groups will be affected. For example, as showed in Table 1, the expressions of Gene 3 are significantly different under two conditions. However this gene was not identified by t-test. Thus, a pathway involving this gene is less likely to be identified by the first category of analysis that uses t-test at the gene level.
In this paper, we propose an innovative fuzzy-set-theorybased approach for differential analysis of gene pathways and apply it on identifying significant pathways for diabetes. In our proposed MCM-test, instead of identifying individual genes first, the differential analysis is done directly at the pathway level without individual gene differential statistic. All expression values of genes which belong to a pathway of a particular patient are treated as a vector. The intuition behind this is based on the fact that genes for each patient interplay with each other. MCMtest does not extract activity level of pathways either. This allows keeping the maximum amount of information for the pathway differential analysis. Moreover, the fuzzy concept makes the approach more tolerant to individual data item noise.

Results
To investigate our approach, we conducted experiments on both synthetic data and real world data. We first conducted a series of experiments on synthetic datasets to find the characteristics of MCM d-value. We then used the MCM-test on the real world diabetes dataset analyzed by Tomfohr et al. [5] and GSEA [3]. Results on real world diabetes data identified several pathways that were deregulated in diabetes patients. The top three pathways identified were related to mitochondrial functions in accordance with previous diabetes studies. Mitochondrial dysfunction is known to be related to insulin resistance and type-2 diabetes. Our data suggests that the method can be successfully used in pathway level differential analysis of gene expression datasets.

Relationship between MCM d-value and mean difference of the distributions
Suppose two sets S 1 and S 2 are drawn from two different distributions, then a good divergence value will satisfy the following property: the less the overlap, the higher the dvalue. To validate that our MCM-test has this property, we performed the following steps: 1. generated 17 values from Gaussian distribution N (μ, σ), where μ is the mean and σ is the variance, to use as gene expression data. The number 17 was chosen to mimic the real world diabetes dataset used for the analysis in this paper.

repeated
Step 1 for 100 times to get expression data of 100 genes 3. generated 17 values from Gaussian distribution N (μ + x, σ), with x = 0 at this time.  Two datasets are generated from two distributions N (μ, σ) and N (μ + x, σ). As the mean difference, x, increases, the dvalue also increases. We also note that the error rate becomes stable after the size of the population becomes greater than 8000.

Relationship between MCM d-value and empirical p-value
Suppose two vectors S 1 and S 2 are drawn from same Normal distribution. What is the probability that the MCM dvalue of these vectors is greater than a particular D? Does the probability increases with the increase of D? To answer these questions, we studied the relationship between MCM d-value and empirical p-value as follows: 1. We generated 15000 pairs of sets, each set with 15 values from standard normal distribution.
2. From these 15000 pairs of sets, we randomly selected 100 pairs of sets to simulate expression data of a pathway with 100 genes under two conditions. We calculated dvalue for this pathway. Since we know that the data size required to obtain stable standard deviation of d-value is 8000 from the previous experiment, this process is repeated 10000 times.
3. For each pathway generated above with d-value D, we calculated the empirical p-value as n+1/10001, where n is the number of d-values generated above that are equal to or greater than D. The relationship between the d-value and p-value is shown in Figure 3.
In Figure 3 we can see that as the d-value increases, the pvalue decreases. In particular, when d-value is greater than 0.809, we have p-value ≤ 0.05.

Impact of number of samples on error rate of MCM-test dvalue
In order to understand the effect of the number of samples on error rate of MCM d-value, we generated datasets with different sample sizes. For each sample size, we generated 10000 datasets and calculated the corresponding 10000 d-values. The standard deviation of these d-values was calculated. This process is repeated 10 times and the average of the standard deviations is recorded as the error rate. The same is done for the other sample sizes. The relationship between number of samples and the error rate is shown in Relationship between MCM d-value and its empirical p-value Figure 3 Relationship between MCM d-value and its empirical p-value. As the d-value increases, the corresponding empirical p-value decreases.
Impact of number of permutation on the error rate of PDF of MCM d-value

Analyzing the diabetes dataset with MCM-test
The diabetes dataset contains the transcriptional profiles of smooth muscle biopsies of diabetic and normal individuals. In the expression dataset, for each gene, there are 17 expression values from subjects with type 2 diabetes (DM2), 17 expression values from subjects with normal glucose tolerance (NGT) and 10 expression values from subjects with impaired glucose tolerance (IGT). For our analysis, we only used the 34 expression values from subjects with type 2 diabetes and subjects with normal glucose tolerance. Furthermore, we used about 150 pathways obtained from KEGG (Kyoto Encyclopedia of Genes and Genomes) [1].
The expression values in the dataset which are too small, i.e., less than 100 are considered to be the result of noise. So, to minimize the effect of these low values, we only included the genes which have at least one of the expression values greater than 100. Out of the 22,283 genes in the dataset, 10,983 met the filtering criteria. The d-value for each pathway was calculated as described in the methodology section before. The p-value for the pathway was calculated using permutation test. We permuted the genes 1000 times, each time selecting the same number of genes as that of the pathway under consideration. We then calculated the d-value of each pathway obtained thus and the p-value for the pathway was the fraction of times the dvalues of the pathways obtained by 1000 permutation equaled or exceeded the original d-value.
The pathways are ordered in the ascending order of their p-values. The significant pathways, i.e., the pathways with p-value less than 0.05, are then ordered according to the percentage of the genes in the pathway which were represented in the dataset. Table 2 shows the result after sorting.
Using our method, we identified the deregulation of mitochondrial pathways in the dataset which is in accordance with previous studies. The first cluster of genes involved was from the mitochondrial OXPHOS pathway. The OXPHOS pathway was well represented in the data with 93% of genes (106 out of 114) present in the dataset. Oxidative phosphorylation in mitochondria provides energy in the form of ATP generation. In muscle cells, mitochondrial dysfunction has been linked to insulin resistance and type-2 diabetes [8][9][10]. The involvement of genes coded by mitochondria in insulin resistance is also well known. The depletion of cellular mitochondrial DNA has been shown to cause insulin resistance in experimental model [11]. Reduced mitochondrial oxidative phosphorylation leads to the accumulation of intracellular triglycerides resulting in insulin resistance. The next 2 clusters, c20_U133 which is a manually curated cluster of genes coregulated with OXPHOS [3] and the mitochondrial gene cluster human_mitoDB_6_2002 reinforce that muscle mitochondrial dysfunction is linked to type-2 diabetes.

Conclusion
In this paper, we propose an innovative fuzzy-set-theorybased approach for differential analysis of gene pathways and apply it on identifying significant pathways for diabetes. Experiments have been conducted on both synthetic datasets and real world dataset. Results on real world diabetes data identified several number of gene pathways. Of note our top significant pathways were related to mitochondrial function which is well known to be involved in insulin resistance and type-2 diabetes. This approach can be used not only for pathway analysis of other diseases but also for other domains. As measuring the difference of two groups of data are essential to most of researches, our approach provides a solution to this general and most critical problem.

Methods
In [12][13][14], we proposed two fuzzy-set-theory based methods, CM-test and FM-test, to identify the individual genes that expressed significant differences under two conditions. In this paper, we extended the cluster misclassification concept to a multi-dimensional space and propose a new approach for pathway analysis, this approach, the expression values of a group of Q genes for a particular sample under a particular condition are considered as a Q-dimension vector. The differential analysis is done at the vector level, without individual gene differential statistic.
In this section, we first introduce the concept of fuzzy membership function of vectors, then the details of MCMtest.

Fuzzy membership Function of Vectors
In fuzzy set theory, the degree for one variable to belong to a fuzzy set is defined by a function. For a vector which has two dimensions, the degree that it belongs to a set of vectors can be defined by a three-dimensional function, with the third dimension being the measurement of the membership. Figure 5 shows a sample fuzzy membership function for a vector (x, y).
For vectors with n dimensions, their fuzzy membership function will be n+1-dimensional, with one dimension measuring the fuzzy membership.

Our approach
Consider a pathway that consists of Q genes, the problem now is to determine how these Q genes are expressed differently under two conditions. To answer this question, we consider the expression values of the Q genes for a particular sample under a particular condition as a Q-dimension vector. Then the expression values of a pathway under one condition j can be modeled as set S j (j = 1, 2) of points in a Q-dimension space. The idea is to consider the two sets of points S 1 and S 2 as samples from two different fuzzy sets. We then examine the membership value of each element with respect to these two fuzzy sets and determine the d-value between the two sets of samples.
The mean of the expression values of set S j is: where, N j is the number of samples in S j , is vector made by the expression values of the n-th sample under condition j.
We then characterize set S j (j = 1, 2) by a fuzzy set FS j (j = 1, 2) whose fuzzy membership function is defined as: where, Given an element in S 1 , we calculate its element misclassification degree with respect to FS 2 as We denote the misclassified elements in S 1 with respect to e e e f f where, and Then, the divergence between S 1 and S 2 can be calculated using the following: In our method, to negate the effect of outliers, we used αtrimmed mean instead of normal mean. The α-trimmed For computational simplicity, an Epanechnikov function shown as following can be used instead of the Gaussian function of equation (2): where,

One dimension: a special case
In this section we analyze MCM-test for it theoretical justification. For the sake of clarity, we start with one dimension, the simplest and special case of multi-dimension. The one dimensional MCM-test corresponds to differential analysis of a single gene.
The visualization tells us that they are different as they cover different areas and have different shapes. The CMtest, which can be considered as a special case of the MCM-test differentiate them by measuring the differences on the Y axis, which is a combined result of the location difference together with the difference of the variances.
MCM-test uses the probability distribution functions of these two distributions as their fuzzy membership functions respectively, and takes advantage of the membership differences of "misclassified" samples. As shown in Figure One dimension Gaussian distributions, μ1 = 600, μ2 = 700, σ1 = 50, σ2 = 100
A sample fuzzy membership function of vector (x, y)

Figure 5
A sample fuzzy membership function of vector (x, y).
6, a sample x 1 of D 2 has a higher degree of belonging to D 1 , thus is "misclassified" in MCM-test. This misclassification degree is aggregated with all the other "misclassified" samples of D 2 that are misclassified. Similarly, x 2 of D 1 has a higher degree for D 2 , thus is also misclassified. This misclassification degree is also aggregated with all the other misclassified samples of D 1 .
MCM-test collects all the misclassified degrees and the number of misclassified samples and form them into an index to measure the divergence of these two distributes. With the mean difference between these two distributions increases, the number of misclassified samples, as well as the aggregated misclassification degree decreases. Thus the MCM d-value will decrease. Figure 7(a) illustrates samples of two distributions, each of which is a 2-D Gaussian function. In pathway analysis, the X and Y axis can be the expression data of two individual genes respectively. Figure 7(b) shows the probability density functions of these two distributions, which can be used as their fuzzy membership functions after multiplying a constant.

Two and higher dimensions
Distributions of higher dimensions are hard to visualize. But the idea of the misclassification test stays the same. In multi-dimension space, each sample is a vector. And their misclassification degrees are used to measure the divergence of their distributions.   BioMedcentral BMC Bioinformatics 2008, 9(Suppl 6):S16 http://www.biomedcentral.com/1471-2105/9/S6/S16