Volume 7 Supplement 4

## Symposium of Computations in Bioinformatics and Bioscience (SCBB06)

# FM-test: a fuzzy-set-theory-based approach to differential gene expression data analysis

- Lily R Liang†
^{1}, - Shiyong Lu†
^{2}Email author, - Xuena Wang
^{3}, - Yi Lu
^{2}, - Vinay Mandal
^{2}, - Dorrelyn Patacsil
^{4}and - Deepak Kumar
^{4}

**7(Suppl 4)**:S7

**DOI: **10.1186/1471-2105-7-S4-S7

© Liang et al; licensee BioMed Central Ltd 2006

**Published: **12 December 2006

## Abstract

### Background

Microarray techniques have revolutionized genomic research by making it possible to monitor the expression of thousands of genes in parallel. As the amount of microarray data being produced is increasing at an exponential rate, there is a great demand for efficient and effective expression data analysis tools. Comparison of gene expression profiles of patients against those of normal counterpart people will enhance our understanding of a disease and identify leads for therapeutic intervention.

### Results

In this paper, we propose an innovative approach, *fuzzy membership test* (FM-test), based on fuzzy set theory to identify disease associated genes from microarray gene expression profiles. A new concept of FM d-value is defined to quantify the divergence of two sets of values. We further analyze the asymptotic property of FM-test, and then establish the relationship between FM d-value and p-value. We applied FM-test to a diabetes expression dataset and a lung cancer expression dataset, respectively. Within the 10 significant genes identified in diabetes dataset, six of them have been confirmed to be associated with diabetes in the literature and one has been suggested by other researchers. Within the 10 significantly overexpressed genes identified in lung cancer data, most (eight) of them have been confirmed by the literatures which are related to the lung cancer.

### Conclusion

Our experiments on synthetic datasets show that FM-test is effective and robust. The results in diabetes and lung cancer datasets validated the effectiveness of FM-test. FM-test is implemented as a Web-based application and is available for free at http://database.cs.wayne.edu/bioinformatics.

## Background

Microarray techniques have revolutionized genomic research by making it possible to monitor the expression of thousands of genes in parallel. As the amount of microarray data being produced is increasing at an exponential rate, there is a great demand for efficient and effective expression data analysis tools. The gene expression profile of a cell determines its phenotype and responses to the environment. These responses include its responses towards environmental factors, drugs and therapies. Gene expression patterns can be determined by measuring the quantity of the end product, protein, or the mRNA template used to synthesize the protein. Comparison of gene expression profiles in patients against their normal counterpart people will enhance our understanding of a disease and identify leads for therapeutic intervention. Several important breakthroughs and progress in the gene expression profiling of diseases have been made [1–5]. More interestingly, researchers have identified many genes that play important roles in the onset, development, and progression of various diseases. Identification of these disease genes offers a route to a better understanding of the molecular mechanisms underlying pathogenesis, a necessary prerequisite for the rational development of improved preventative and therapeutic methods.

*S*

_{ 1 }and

*S*

_{ 2 }, the t-value is calculated as

The gene expression values for five genes under two conditions.

Gene ID | IR | IS | d-value | p-value | ||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

FM | t-test | rank sum | ||||||||||||

1 | 750 | 559 | 649 | 685 | 636 | 310 | 359 | 135 | 97 | 178 | 0.999 | 0.001 | 0.008 | 0.000 |

2 | 123 | 142 | 11 | 406 | 220 | 305 | 398 | 707 | 905 | 688 | 0.756 | 0.012 | 0.011 | 0.031 |

3 | 246 | 213 | 232 | 134 | 67 | 86 | 79 | 77 | 94 | 61 | 0.725 | 0.017 | 0.021 | 0.098 |

4 | 200 | 191 | 220 | 83 | 197 | 49 | 81 | 116 | 111 | 135 | 0.708 | 0.019 | 0.024 | 0.058 |

5 | 598 | 424 | 695 | 451 | 141 | 342 | 260 | 266 | 229 | 234 | 0.674 | 0.025 | 0.077 | 0.152 |

where μ_{
S
}and σ_{
S
}are the sample mean and standard deviation of *S*, respectively.

The limitation of t-test is that it cannot distinguish two sets with close means even though the two sets are significantly different from each other. Another limitation of t-test is that it is very sensitive to extreme values.

Another popular statistical method is Wilcoxon rank sum test, which can be used to test the null hypothesis that two sets *S*_{
1
}and *S*_{
2
}have the same distribution. We first merge the data from these two sets and rank the values from the lowest to the highest with all sequences of ties being assigned an average rank. The Wilcoxon test statistic *W* is the sum of the ranks from set *S*_{
1
}. Assuming that the two sets have the same continuous distribution (and no ties occur), then *W* has a mean and standard deviation given by

where *m* = |*S*_{
1
}| and *n* = |*S*_{
2
}|.

We test the null hypothesis *H*_{
o
}: no difference in distributions. A one-sided alternative is *H*_{
a
}: *S*_{
1
}yields lower measurements. We use this alternative if we expect or see that *W* is unusually lower than its expected value μ. In this case, the p-value is given by a normal approximation. We let *N*~*N*(μ,σ) and compute the left-tail *Pr(N ≤ W)* (using continuity correction if *W* is an integer).

If we expect or see that *W* is much higher than its expected value, then we should use the alternative *H*_{
a
}: first *S*_{
1
}yields higher measurements. In this case, the p-value is given by the right-tail *Pr(N ≥ W)*. If the two sums of ranks from each set are close, then we could use a two-sided alternative *H*_{
a
}: there is a difference in distributions. In this case, the p-value is given by twice the smallest tail value 2**Pr(N ≤ W)*, if *W* < μ; or 2**Pr(N ≥ W)*, if *W* > μ.

Although rank sum test overcomes the limitation of t-test in sensitivity to extreme values, it is not sensitive to absolute values. This might be advantageous to some applications but not to others.

## Results

To validate our approach, first, we investigated the distribution of FM d-value on a set of synthetic datasets. Second, we conducted experiments on a synthetic dataset to study the relationship between FM-test d-value and its empirical p-value. Third, on another synthetic dataset, we studied the relationship between FM d-value and the mean difference of distributions.

### The probability distribution of FM d-value

*S*

_{ 1 }and

*S*

_{ 2 }are randomly drawn from the same normal distribution, what is the probability distribution of FM d-value? To answer this question, we conducted the following simulation:

- 1.
We generated

*N*= 64000 pairs of sets of values, with each set containing 5 values. As shown in Figure 1(a), each value in the two data sets is randomly generated from the same normal distribution*N*(0,1).

- 2.
We calculated the d-value for each pair of sets.

- 3.We then estimated the probability density value where
*δ*= 0.005. The value is essentially the fraction of the FM d-values falling in region [d-*δ*, d+*δ*] divided by the region length 2*δ*. The probability density function of the d-distribution was drawn in Figure 2.

- 4.
At the end, in order to understand the effect of the number of pairs used for simulation, i.e., the size of the dataset, on the approximation error of the d-distribution, we generated datasets with different data sizes. For each data size, we generated 10 datasets, and thus derived 10 probability density functions. The maximum standard deviation for all d-values is recorded as the

*error rate*for that data size. As shown in Figure 3, as expected, the error rate decreases as the size of the dataset increases.

From Figure 2, we can see that most FM d-values fall into the range from 0.2 to 0.5, and very few fall into the range greater than 0.6, or less than 0.2. In particular, when *d* ≥ 0.6056, p-value ≤ 0.05. This is reflected in the red-shared area in Figure 2 with
*f*(*x*)*dx* = 0.05. Therefore, given two sets *S*_{
1
}and *S*_{
2
}drawn from the same normal unit distribution, the chance that the pair has a FM d-value equal to or greater than 0.6056 is very low. On the other hand, if we observe that two sets have a d-value equal to or greater than 0.6056, then this is strong evidence that these two sets are drawn from two different distributions. Therefore, they should be considered as significantly divergent.

Figure 3 shows the effect of data size on the error rate of the derived probability density function. As the data size increases, the error rate decreases. We can see from Figure 3 that, after the number of pairs of sets in a dataset is greater than 8000, the trend of the error rate becomes stable. Thus, to obtain a reliable empirical p-value for FM-test, the data size should be greater than 8000.

### Relationship between FM d-value and its empirical p-value

*S*

_{ 1 }and

*S*

_{ 2 }are drawn from the same normal distribution, what is the probability that they have a FM d-value equal to or greater than a particular

*D*? If the

*D*increases, will this probability decrease? To answer these questions, we studied the relationship between FM d-value and empirical p-value as follows:

- 1.
Based on the above experimental result, we know that we need at least 8000 pairs of sets to obtain a reliable empirical p-value. Therefore, in this experiment, we generated 10000 pairs of sets of values, with each set containing 5 values. Each value is randomly generated from the unit normal distribution

*N*(0,1). - 2.
We calculated the d-value for each pair of sets.

- 3.
For each pair of sets

*S*_{ 1 }and*S*_{ 2 }with d-value*D*, we calculated its empirical p-value as*n*+1/10001 where*n*is the number of pairs in these 10000 pairs that have a d-value equal to or greater than*D*. - 4.
We drew the relationship between d-value and empirical p-value in Figure 4.

From Figure 4, we can see that as d-value increases, the p-value decreases. In particular, when *d* ≥ 0.6056, we have p-value ≤ 0.05.

### Relationship between FM d-value and the mean difference of distributions

*S*

_{ 1 }and

*S*

_{ 2 }are drawn from two different distributions, then a good divergence measurement should satisfy the following property: the less overlap between these two distributions, the greater the d-value. We validated that our FM-test has this property as follows:

- 1.
As shown in Figure 1(b), two data sets are generated from two distributions. Let

*N*(0,1) and*N*(x, 1) be two normal distributions, where*x*is the mean difference between these two distributions. In this experiment, we consider*x*= 0, 0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4, 4.5, respectively. - 2.
We generated 1000 pairs of sets of values, with the first set containing 5 values that are randomly generated from

*N*(0,1), and the second set containing 5 values that are randomly generated from*N*(x, 1). - 3.
We calculated the d-value for each pair. Let the average of these 1000 d-values be

*d*. We then plotted (*x*,*d*) in Figure 5.

- 4.
We repeated step 2 and 3 for different

*x*. Finally, the curve was drawn in Figure 5.

Figure 5 confirmed the desirable property of FM-test: the larger the mean difference between the two distributions, the greater the d-value.

## Discussion

### Analyzing diabetes data with FM-test

A diabetes dataset of microarray gene expression for a total of 10831 genes downloadable from [6] is used for analysis. For each gene, there are ten expression values, five from a group of insulin-sensitive (IS) people and five from a group of insulin-resistant (IR) people. Only the genes that have no null expression values are included in this analysis. We also require that, for a gene to be included, at least five out of its ten expression values are greater than 100. This eliminates the genes whose expression values are noisy and not reliable.

*U49835*is identified by FM-test as the 21st ranked significant gene with p-value 0.0258. Neither t-test (with p-value 0.0768) nor rank sum test (with a p-value 0.1522) identifies this gene as significant.

Ten best-ranked and worst-ranked genes of diabetes identified by FM-test.

Probe Set | Gene Description | d-value | Empirical p-value | t-test p-value | rank sum p-value |
---|---|---|---|---|---|

U45973 | Human phosphatidylinositol (4,5) bisphosphate | 0.999 | 0.0003 | 0.0001 | 0.0076 |

M60858 | Human nucleolin gene | 0.935 | 0.0016 | 0.0017 | 0.0076 |

| Homo sapiens mRNA for fungal sterol-C5-desaturase homolog | 0.892 | 0.0028 | 0.0029 | 0.0147 |

| Human alpha 2 type IX collagen (COL9A2) mRNA | 0.872 | 0.0038 | 0.0066 | 0.0076 |

L07648 | Human MXI1 mRNA | 0.858 | 0.0043 | 0.0052 | 0.0076 |

L07033 | Human hydroxymethylglutaryl-CoA lyase mRNA | 0.855 | 0.0046 | 0.0054 | 0.0076 |

X53586 | Human mRNA for integrin alpha 6 | 0.851 | 0.0047 | 0.0075 | 0.0076 |

X81003 | Homo sapiens HCG V mRNA | 0.791 | 0.0089 | 0.0077 | 0.0076 |

| ribosomal protein L7 | 0.767 | 0.0108 | 0.0109 | 0.0313 |

| melan-A | 0.756 | 0.0126 | 0.0118 | 0.0311 |

X82324 | POU domain, class 3, transcription factor 4 | 0.206 | 0.9987 | 0.407 | 1 |

M14764 | nerve growth factor receptor (TNFR superfamily, member 16) | 0.204 | 0.9989 | 0.652 | 1 |

M64673 | heat shock transcription factor 1 | 0.204 | 0.9990 | 0.652 | 0.844 |

U20657 | ubiquitin specific peptidase 4 (proto-oncogene) | 0.197 | 0.9993 | 0.642 | 0.844 |

D17793 | aldo-keto reductase family 1, member C3 | 0.196 | 0.9999 | 0.471 | 0.839 |

D78014 | dihydropyrimidinase-like 3 | 0.194 | 1 | 0.620 | 0.548 |

AB002314 | PDZ domain containing 10 | 0.191 | 1 | 0.367 | 0.545 |

L20348 | oncomodulin | 0.181 | 1 | 0.405 | 0.544 |

D50063 | proteasome (prosome, macropain) 26S subunit | 0.179 | 1 | 0.544 | 0.421 |

Z79581 | H.sapiens LAZ3/BCL6 gene, first non coding exon | 0.179 | 1 | 0.545 | 0.407 |

To study the relevance of genes in insulin metabolism and diabetes, the 10 best ranked differentially regulated genes shown in Table 2 were further searched in the published literature. Human phosphatidylinositol(4,5) bisphosphate 5-phosphatase homolog (gene *U45973*) was found to be differentially expressed in insulin resistance cases. Over-expression of inositol polyphosphate 5-phosphatase-2 SHIP2 has been shown to inhibit insulin-stimulated phosphoinositide 3-kinase (PI3K) dependent signaling events. Analysis of diabetic human subjects has revealed an association between SHIP2 gene polymorphism and type 2 diabetes mellitus. Also knockout mouse studies have shown that SHIP2 is a significant therapeutic target for the treatment of type-2 diabetes as well as obesity [8]. Csermely et al. reported that insulin mediates phosphorylation/dephosphorylation of nucleolar protein nucleolin (gene *M60858*) by stimulating casein kinase II, and this may play a role in the simultaneous enhancement in RNA efflux from isolated, intact cell nucle [9]. c-myc is an oncogene that codes for transcription factor Myc that along with other binding partners such as MAX plays an important role widely studied in various physiological processes including tumor growth in different cancers. Myc modulates the expression of hepatic genes and counteracts the obesity and insulin resistance induced by a high-fat diet in transgenic mice overexpressing c-myc in liver [10].

Max interactor protein, MXI1 (gene *L07648*) competes for MAX thus negatively regulates MYC function and may play a role in insulin resistance. In the presence of glucose or glucose and insulin, leucine is utilized more efficiently as a precursor for lipid biosynthesis by adipose tissue. It has been shown that during the differentiation of 3T3-L1 fibroblasts to adipocytes, the rate of lipid biosynthesis from leucine increases at least 30-fold and the specific activity of 3-hydroxy-3-methylglutaryl-CoA lyase (gene *L07033*), the mitochondrial enzyme catalyzing the terminal reaction in the leucine degradation pathway, increases 4-fold during differentiation [11]. Schottelndreier et al[12] have described a regulatory role of integrin alpha 6 (gene *X53586*) in Ca2+ signaling, that is known to have a significant role in insulin resistance [13].

HCGV gene product (gene *X81003*) is known to inhibit the activity of protein phosphatase-1, which is involved in diverse signalling pathways including insulin signaling [14]. Human ribosomal protein L7 (Gene *X57959*)plays a regulatory role in eukaryotic translation apparatus. It has been shown to be an autoantigen in patients with systemic autoimmune diseases, such as systemic lupus erythematosus [15]. Identification of this gene in our analysis and by [6] suggests a possible role of this gene in insulin resistance. Published reports on these genes indicate their roles in insulin signalling and warrant further investigations on their functions in insulin resistance cases. We further recommend genes *D85181, M95610* and *U06452* as candidate genes for future research in this area.

_{ 2 }( / ), where is the mean of expressions in the IS group, and is the mean of the expressions in the IR group. In this way, up and down regulation appear symmetric. The Y axis represents the p-value for our FM-test, which is on a negative log scale log

_{10}(

*p*-

*value*), so that smaller p-values appear higher up. The X axis indicates biological impact of the change; the Y axis indicates the statistical evidence, or reliability of the change.

As shown in Figure 6, gene *U45973* is identified by FM-test as the most statistically significant gene and it is over-expressed in the IR group; gene *X53586* is identified by FM-test as the 7th statistically significant gene and it is over-expressed in the IS group. Although genes *M60858, D85181, M95610, L07648, L07033*, and *X81003* have been identified by FM-test among the top ten significant genes, they are not over-expressed in either groups. Finally, gene *U41515* is identified by FM-test as the 11th significant gene and it is over-expressed in the IS group.

In summary, out of the top 10 genes identified by FM-test, we could find 6 of them in the literature about their association with insulin metabolism and diabetes. Among the remaining four genes, gene *X57959* has been recommended by [6] as a candidate gene for diabetes, we recommend that gene *D85181, M95610* and *U06452* could serve as candidate genes for future research in this area.

### Analyzing lung cancer data with FM-test

Ten best-ranked (overexpressed) cancer genes identified by FM-test.

Probe Set | Gene Description | p-value |
---|---|---|

NM_173086 | KRT6E: Keratin 6E | 0.000125 |

NM_001723 | DST: Dystonin | 0.000125 |

NM_002639 | SERPINB5: Serpin peptidase inhibitor, clade B (ovalbumin), member 5 | 0.000125 |

AB010153 | TP73L: Tumor protein p73-like | 0.000125 |

NM_023915 | GPR87: G protein-coupled receptor 87 | 0.000125 |

NM_006536 | CLCA2: Chloride channel, calcium activated, family member 2 | 0.000125 |

NM_001005337 | PKP1: Plakophilin 1 (ectodermal dysplasia/skin fragility syndrome) | 0.000125 |

AF043977 | CLCA2: Chloride channel, calcium activated, family member 2 | 0.000125 |

NM_004415 | DSP: Desmoplakin | 0.000125 |

NM_019093 | UGT1A9: UDP glucuronosyltransferase 1 family, polypeptide A9 | 0.000125 |

## Conclusion

We proposed an innovative approach based on the fuzzy set theory, FM-test, that quantifies the divergence of two sets directly. We have validated FM-test on synthetic datasets and show that it is effective and robust. We also applied FM-test to a real diabetes dataset and a cancer dataset. For each dataset, we identified 10 significant genes. Within 10 significant genes in diabetes dataset, six of them have been confirmed to be associated with insulin signalling and/or diabetes in the literature, one has been recommended by others, the remaining three genes, *D85181*, *M95610* and *U06452*, are suggested as three potential diabetes genes involved in insulin resistance for further biological investigation. Out of the 10 significantly overexpressed genes identified in the lung cancer data eight are confirmed by literature to be related to lung cancer. The remaining two genes NM_023915 and NM_019093 are potential candidates for further biological investigation. In addition, we analyzed the asymptotic properties of the distribution of FM d-value and the equation to calculate its p-value. The analysis is presented in appendix. FM-test is implemented as a Web-based application and can be accessed for free at http://database.cs.wayne.edu/bioinformatics.

## Methods

In this section, based on the fuzzy set theory [27], we present our innovative approach, the fuzzy-set-theory-based method test (FM-test), to quantify the divergence of two sets of values directly and robustly. In addition, in append ix section, we show the asymptotic property of FM-test, and then establish the relationship between FM d-value with p-value.

*S*

_{ 1 }and

*S*

_{ 2 }be two sets of values of a particular feature for two groups of samples under two different conditions. The basic idea is to consider the two sets of values as samples from two different fuzzy sets. We examine the membership value of each element with respect to the other fuzzy set. By calculating the average of membership values, we measure the divergence of the original two sets. In particular, we perform the following steps:

- 1.
Compute the sample mean and standard deviation of

*S*_{ 1 }and of*S*_{ 2 }respectively. - 2.
Characterize

*S*_{ 1 }and*S*_{ 2 }as two fuzzy sets*FS*_{ 1 }and*FS*_{ 2 }whose fuzzy membership functions, (*x*) and (*x*), are defined with the sample means and standard deviations. The fuzzy membership function (*x*)(i = 1,2) maps each value*x*to a fuzzy membership value that reflects the degree of*x*belonging to (*x*)(i = 1,2). - 3.
- 4.
Define the divergence degree (FM d-value) between the two sets based on the convergence degree.

### Fuzzy Sets and Membership Functions

The sample mean μ_{1} of *S*_{
1
}is calculated as

where *n*_{
1
}is the number of elements in *S*_{
1
}, and the sample standard deviation σ_{
1
}of *S*_{
1
}is calculated as

For gene 5 in Table 1, we have μ_{
1
}= 461.8, σ_{
1
}= 210.59, μ_{
2
}= 266.2, and σ_{
2
}= 45.29. We then characterize set *S*_{
1
}by a fuzzy set *FS*_{
1
}whose fuzzy membership function is defined as

*x*) maps each value

*x*in

*S*

_{ 1 }to a fuzzy membership value to quantify the degree that

*x*belongs to

*FS*

_{ 1 }. A value equal to the mean has a membership value of 1 and belongs to fuzzy set

*FS*

_{ 1 }to a full degree; a value that deviates from the mean has a smaller membership value and belongs to

*FS*

_{ 1 }to a smaller degree. The further the value deviates from the mean, the smaller the fuzzy membership value. Similarly, the fuzzy membership function for

*S*

_{ 2 }is defined as

where μ_{
2
}and σ_{
2
}are the mean and standard deviation of *S*_{
2
}respectively.

For gene 5 in Table 1, we have
and
. With these two fuzzy membership functions, the fuzzy membership values for each element with respect to the two sets can be calculated. For example,
(598) = 0.81 and
(598) = 2.2*E*^{-12}.

### Our Proposed Method: FM-test

Since the fuzzy membership functions can overlap, one element can belong to more than one fuzzy set with a respective degree for each. For an element in *S*_{
1
}, we measure the degree that it belongs to *FS*_{
1
}by applying its value to
. Similarly we can apply its value to
to measure the degree that it belongs to *FS*_{
2
}. The idea of FM-test is to consider the membership value of an element in *S*_{
1
}with respect to *S*_{
2
}as a bond between *S*_{
1
}and *S*_{
2
}, and vice versa, then the aggregation of all these bonds reflects the overall bond between these two sets. The weaker this overall bond is, the more divergent these two sets are. The strength of the overall bond between two sets is quantified by their c-value, which aggregates the mutual membership values of elements in *S*_{
1
}and *S*_{
2
}and is defined as follows.

**Definition 1 (FM c-value)**: Given two sets *S*_{
1
}and *S*_{
2
}, the convergence degree between *S*_{
1
}and *S*_{
2
}in FM-test is defined as

Now we define the divergence value in FM-test (FM d-value) as follows.

**Definition 2 (FM d-value)**: Given two sets *S*_{
1
}and *S*_{
2
}, the FM d-value between *S*_{
1
}and *S*_{
2
}is defined as

For gene 5 in Table 1, *c(S*_{
1
}*, S*_{
2
}*) = 0.326*, thus the divergence value is 1-*c(S*_{1}*, S*_{2}*)* = 0.674. We calculated all the p-values for the five genes in Table 1 for the three methods. One interesting observation is that, while both t-test and Wilcoxon rank sum test fail to recognize gene 5 as a significant gene since their p-values are greater than 0.05, our FM-test identifies gene 5 as a significant gene with a p-value of 0.025. The reason of the failure of t-test and Wilcoxon rank sum test is due to their sensitivity to the extreme value 141 in the first set of the gene.

Given a calculated FM d-value *D* for two sets *S*_{
1
}and *S*_{
2
}, to interpret *D* in terms of "significantly divergent" or not, we need to know the cutoff value δ of *D*, so that when *D* ≥ δ, the two sets are interpreted as significantly divergent. In the context of FM-test, we like to test the following null hypothesis *H*_{
o
}: *S*_{1} and *S*_{2} originate from the same distribution. Then the p-value is defined as the probability {*Pr(d(S*_{
1
}, *S*_{2}*)* ≥ *D* | *S*_{
1
}and *S*_{
2
}were randomly sampled from the same distribution}. As a convention of statistical analysis, if *p-value* ≤ 0.05, then this is strong evidence to reject the null hypothesis, and accepts that the two sets are significantly divergent, while the p-value reflects the significance. It has been very common to use Monte Carlo procedures to calculate the empirical p-value which approximates the exact p-value without relying on asymptotic distributional theory or on exhaustive enumeration. Davison and Hinkley [28] present the formula for obtaining an empirical p-value as *(n+1)/(N+1)*, where *N* is the number of samples in the data set, and *n* is the number of those samples which produce the statistical value greater than or equal to the specified value.

We perform the following steps to calculate the p-value of two sets *S*_{
1
}and *S*_{
2
}with their FM d-value *D*: (1) Estimate the distribution that *S*_{
1
}and *S*_{
2
}are drawn from a normal distribution *N*(μ,σ), where μ and σ are estimated using the sample mean and standard deviation of *S*_{
1
}∪ *S*_{
2
}; (2) Randomly draw *N* pairs of sets from *N*(μ,σ), then calculate the FM d-value for each pair; (3) Calculate the empirical p-value as *(n+1)/(N+1)*, where *n* is the number of pairs whose FM d-values are equal or greater than *D*.

## APPENDIX

### Asymptotic Characteristics of the FM d-value

The FM d-value is defined in Method section as follows:

where *S* = *S*_{1}∪ *S*_{2} = {*x*_{
i
},*i* = 1,..., *n*_{1} + *n*_{2}}, *n*_{1} = |*S*_{1}| · *n*_{2} = |*S*_{2}| and
(*x*) = 1 if *x* ∈ *S*_{
i
}and 0 otherwise for *i* = 1,2.

*X*over sample space

*S*with a probability

*p*of choosing a sample

*x*from

*S*

_{1}. The calculation of the d-value for a given sample

*x*is therefore given by

*d*(

*S*

_{1},

*S*

_{2}) = = 1 - . Next, the mean and the variance of Δ(

*X*) are calculated respectively preparing for establishing the asymptotic distribution of the d-value.

#### (1). Calculation of the mean of Δ(*X*)

The mean of Δ(*X*) is given by

By (12)–(14), the mean of Δ(*X*) when *p* = 0.5 is

#### (2). Calculation of the variance of Δ(*X*)

Since *S*_{1} and *S*_{2} are independent, the variance of Δ(*X*) is given by

Similarly,

Therefore, when *p* = 0.5

*d*(

*S*

_{1},

*S*

_{2}) = 1 - . By calculating the mean and the variance of Δ(

*X*) in formula (Δ1) and (Δ2), the mean and the variance of the d-value are derived straightforward as follows:

For a large sample, by the **central limit theorem**, the distribution of the d-value follows a truncated normal distribution approximately: *d*(*S*_{1},*S*_{2})→ *N*(*E*(*d*),*Var*(*d*)) on a restrained domain of [0 1].

- i.
Balance study:

*p*= 0.5,*n*_{1}=*n*_{2}=*n*/2

- v.
Balance study with equal variance of 1 and equal mean for large samples:σ

^{2}= 1, μ_{1}= μ_{2},*n*_{1}=*n*_{2}≥ 25

*E*(*d*(*s*_{1},*s*_{2})) = 1-
≈ 0.293, *var*(*d*(*s*_{1},*s*_{2}))=(
,
)/*n* ≈ 0.327/*n*

*d*(*S*_{1},*S*_{2}) → *N*(0.293,0.327/*n*) with a restrained domain of [0 1].

### Calculation of p-value

P-value is also called the observed level of significance and is commonly used to report the smallest α-level at which the observed test result is significant. In this section, we derived the parametric calculation of p-value for the FM test based on the asymptotic distribution obtained from section *I*.

The null hypothesis of the test is *H*_{0}:μ_{1} = μ_{2}, where μ_{1} and μ_{2} are the mean gene expression levels of two studied groups. According to the asymptotic distribution of the d-value, following its special case (ii) (balance study with equal mean), a test statistic under the null hypothesis for large sample size (n >= 25) is given by

Suppose *d*_{
obs
}is an observed d-value for a given study based on two independent samples *S*_{1} = {*x*_{
i
},*i* = 1,...,*n*_{1}} and *S*_{2} = {*y*_{
i
},*i* = 1,...,*n*_{2}}. The population variances σ_{1}^{2} and σ_{2}^{2} are estimated by the corresponding sample variances
and
. Thus the mean and variance of d-value are estimated by

P-value is therefore derived as follows:

### Application in Gene Expression Analysis

P-values given by FM-test for five genes from the study example.

Gene ID | IR | IS | d-value | p-value | |||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|

FM-test by(Δ 3) | FM | t-test | rank sum | ||||||||||||

1 | 750 | 559 | 649 | 685 | 636 | 310 | 359 | 135 | 97 | 178 | 0.999 | 0.000 | 0.001 | 0.008 | 0.000 |

2 | 123 | 142 | 11 | 406 | 220 | 305 | 398 | 707 | 905 | 688 | 0.756 | 0.007 | 0.012 | 0.011 | 0.031 |

3 | 246 | 213 | 232 | 134 | 67 | 86 | 79 | 77 | 94 | 61 | 0.725 | 0.041 | 0.017 | 0.021 | 0.098 |

4 | 200 | 191 | 220 | 83 | 197 | 49 | 81 | 116 | 111 | 135 | 0.708 | 0.014 | 0.019 | 0.024 | 0.058 |

5 | 598 | 424 | 695 | 451 | 141 | 342 | 260 | 266 | 229 | 234 | 0.674 | 0.062 | 0.025 | 0.077 | 0.152 |

## Notes

## Declarations

### Acknowledgements

We would like thank anonymous reviewers for their helpful comments. This work was supported by the Agricultural Experiment Station at the University of the District of Columbia (Project No.: DC-0LIANG; Accession No.: 0203877).

This article has been published as part of *BMC Bioinformatics* Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S4.

## Authors’ Affiliations

## References

- Rome S, Clement K, Rabasa-Lhoret R, Loizon E, Poitou C, Barsh GS, Riou JP, Laville M, Vidal H:
**Microarray profiling of human skeletal muscle reveals that insulin regulates approximately 800 genes during a hyperinsulinemic clamp.***J Biol Chem*2003,**278**(20):18063–18068. 10.1074/jbc.M300293200View ArticlePubMedGoogle Scholar - Shalev A, Pise-Masison CA, Radonovich M, Hoffmann SC, Hirshberg B, Brady JN, Harlan DM:
**Oligonucleotide microarray analysis of intact human pancreatic islets: identification of glucose-responsive genes and a highly regulated TGFbeta signaling pathway.***Endocrinology*2002,**143**(9):3695–3698. 10.1210/en.2002-220564View ArticlePubMedGoogle Scholar - Sreekumar R, Halvatsiotis P, Schimke JC, Nair KS:
**Gene expression profile in skeletal muscle of type 2 diabetes and the effect of insulin treatment.***Diabetes*2002,**51**(6):1913–1920.View ArticlePubMedGoogle Scholar - Eckenrode SE, Ruan QG, Collins CD, Yang P, McIndoe RA, Muir A, She JX:
**Molecular pathways altered by insulin b9–23 immunization.***Ann N Y Acad Sci*2004,**1037:**175–185. 10.1196/annals.1337.029View ArticlePubMedGoogle Scholar - Voisine P, Ruel M, Khan TA, Bianchi C, Xu SH, Kohane I, Libermann TA, Otu H, Saltiel AR, Sellke FW:
**Differences in gene expression profiles of diabetic and nondiabetic patients undergoing cardiopulmonary bypass and cardioplegic arrest.***Circulation*2004,**110**(11 Suppl 1):II280–286.PubMedGoogle Scholar - Yang X, Pratley RE, Tokraks S, Bogardus C, Permana PA:
**Microarray profiling of skeletal muscle tissues from equally obese, non-diabetic insulin-sensitive and insulin-resistant Pima Indians.***Diabetologia*2002,**45**(11):1584–1593. 10.1007/s00125-002-0905-7View ArticlePubMedGoogle Scholar - Rosner B:
**Fundamentals of Biostatistics.**In*Pacific Grove*. 5th edition. CA: Duxbury Press; 2000.Google Scholar - Dyson JM, Kong AM, Wiradjaja F, Astle MV, Gurung R, Mitchell CA:
**The SH2 domain containing inositol polyphosphate 5-phosphatase-2: SHIP2.***Int J Biochem Cell Biol*2005,**37**(11):2260–2265. 10.1016/j.biocel.2005.05.003View ArticlePubMedGoogle Scholar - Csermely P, Schnaider T, Cheatham B, Olson MO, Kahn CR:
**Insulin induces the phosphorylation of nucleolin. A possible mechanism of insulin-induced RNA efflux from nuclei.***J Biol Chem*1993,**268**(13):9747–9752.PubMedGoogle Scholar - Riu E, Ferre T, Hidalgo A, Mas A, Franckhauser S, Otaegui P, Bosch F:
**Overexpression of c-myc in the liver prevents obesity and insulin resistance.***Faseb J*2003,**17**(12):1715–1717.PubMedGoogle Scholar - Frerman FE, Sabran JL, Taylor JL, Grossberg SE:
**Leucine catabolism during the differentiation of 3T3-L1 cells. Expression of a mitochondrial enzyme system.***J Biol Chem*1983,**258**(11):7087–7093.PubMedGoogle Scholar - Schottelndreier H, Potter BV, Mayr GW, Guse AH:
**Mechanisms involved in alpha6beta1-integrin-mediated Ca(2+) signalling.***Cell Signal*2001,**13**(12):895–899. 10.1016/S0898-6568(01)00225-XView ArticlePubMedGoogle Scholar - Kulkarni RN, Roper MG, Dahlgren G, Shih DQ, Kauri LM, Peters JL, Stoffel M, Kennedy RT:
**Islet secretory defect in insulin receptor substrate 1 null mice is linked with reduced calcium signaling and expression of sarco(endo)plasmic reticulum Ca2+-ATPase (SERCA)-2b and -3.***Diabetes*2004,**53**(6):1517–1525.View ArticlePubMedGoogle Scholar - Zhang J, Zhang L, Zhao S, Lee EY:
**Identification and characterization of the human HCG V gene product as a novel inhibitor of protein phosphatase-1.***Biochemistry*1998,**37**(47):16728–16734. 10.1021/bi981169gView ArticlePubMedGoogle Scholar - von Mikecz A, Hemmerich P, Peter HH, Krawinkel U:
**Characterization of eukaryotic protein L7 as a novel autoantigen which frequently elicits an immune response in patients suffering from systemic autoimmune disease.***Immunobiology*1994,**192**(1–2):137–154.PubMedGoogle Scholar - Wachi S, Yoneda K, Wu R:
**Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues.***Bioinformatics*2005,**21**(23):4205–4208. 10.1093/bioinformatics/bti688View ArticlePubMedGoogle Scholar - Camilo R, Capelozzi VL, Siqueira SA, Del Carlo Bernardi F:
**Expression of p63, keratin 5/6, keratin 7, and surfactant-A in non-small cell lung carcinomas.***Hum Pathol*2006,**37**(5):542–546. 10.1016/j.humpath.2005.12.019View ArticlePubMedGoogle Scholar - Schuetz CS, Bonin M, Clare SE, Nieselt K, Sotlar K, Walter M, Fehm T, Solomayer E, Riess O, Wallwiener D,
*et al*.:**Progression-specific genes identified by expression profiling of matched ductal carcinomas in situ and invasive breast tumors, combining laser capture microdissection and oligonucleotide microarray analysis.***Cancer Res*2006,**66**(10):5278–5286. 10.1158/0008-5472.CAN-05-4610View ArticlePubMedGoogle Scholar - Chen EI, Yates JR:
**Maspin and tumor metastasis.***IUBMB Life*2006,**58**(1):25–29.View ArticlePubMedGoogle Scholar - Sbisa E, Mastropasqua G, Lefkimmiatis K, Caratozzolo MF, D'Erchia AM, Tullo A:
**Connecting p63 to cellular proliferation: the example of the adenosine deaminase target gene.***Cell Cycle*2006,**5**(2):205–212.View ArticlePubMedGoogle Scholar - Massion PP, Taflan PM, Jamshedur Rahman SM, Yildiz P, Shyr Y, Edgerton ME, Westfall MD, Roberts JR, Pietenpol JA, Carbone DP,
*et al*.:**Significance of p63 amplification and overexpression in lung cancer development and prognosis.***Cancer Res*2003,**63**(21):7113–7121.PubMedGoogle Scholar - Hayes DC, Secrist H, Bangur CS, Wang T, Zhang X, Harlan D, Goodman GE, Houghton RL, Persing DH, Zehentner BK:
**Multigene real-time PCR detection of circulating tumor cells in peripheral blood of lung cancer patients.***Anticancer Res*2006,**26**(2B):1567–1575.PubMedGoogle Scholar - Schwarz J, Ayim A, Schmidt A, Jager S, Koch S, Baumann R, Dunne AA, Moll R:
**Differential expression of desmosomal plakophilins in various types of carcinomas: correlation with cell type and differentiation.***Hum Pathol*2006,**37**(5):613–622. 10.1016/j.humpath.2006.01.013View ArticlePubMedGoogle Scholar - Young GD, Winokur TS, Cerfolio RJ, Van Tine BA, Chow LT, Okoh V, Garver RI Jr:
**Differential expression and biodistribution of cytokeratin 18 and desmoplakins in non-small cell lung carcinoma subtypes.***Lung Cancer*2002,**36**(2):133–141. 10.1016/S0169-5002(01)00486-XView ArticlePubMedGoogle Scholar - Mondal G, Sengupta S, Panda CK, Gollin SM, Saunders WS, Roychoudhury S:
**Overexpression of Cdc20 leads to impairment of the spindle assembly checkpoint and aneuploidization in oral cancer.***Carcinogenesis*2006.Google Scholar - Chatrath P, Scott IS, Morris LS, Davies RJ, Rushbrook SM, Bird K, Vowler SL, Grant JW, Saeed IT, Howard D,
*et al*.:**Aberrant expression of minichromosome maintenance protein-2 and Ki67 in laryngeal squamous epithelial lesions.***Br J Cancer*2003,**89**(6):1048–1054. 10.1038/sj.bjc.6601234PubMed CentralView ArticlePubMedGoogle Scholar - Klir GJ, Yuan B:
*Fuzzy Sets and Fuzzy Logic: Theory and Applications.*Prentice-Hall; 1995.Google Scholar - Davison A, Hinkley D:
*Bootstrap methods and their application.*Cambridge: Cambridge University Press; 1997.View ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.