 Proceedings
 Open Access
A method for computing the overall statistical significance of a treatment effect among a group of genes
 Robert Delongchamp^{1},
 Taewon Lee^{1}Email author and
 Cruz Velasco^{2}
https://doi.org/10.1186/147121057S2S11
© Delongchamp et al.; licensee BioMed Central Ltd. 2006
 Published: 26 September 2006
Abstract
Background
In studies that use DNA arrays to assess changes in gene expression, our goal is to evaluate the statistical significance of treatments on sets of genes. Genes can be grouped by a molecular function, a biological process, or a cellular component, e.g., gene ontology (GO) terms. The meaning of an affected GO group is often clearer than interpretations arising from a list of the statistically significant genes.
Results
Computer simulations demonstrated that correlations among genes invalidate many statistical methods that are commonly used to assign significance to GO terms. Ignoring these correlations overstates the statistical significance. Metaanalysis methods for combining pvalues were modified to adjust for correlation. One of these methods is elaborated in the context of a comparison between two treatments. The form of the correlation adjustment depends upon the alternative hypothesis.
Conclusion
Reliable corrections for the effect of correlations among genes on the significance level of a GO term can be constructed for an alternative hypothesis where all transcripts in the GO term increase (decrease) in response to treatment. For general alternatives, which allow some transcripts to increase and others to decrease, the bias of naïve significance calculations can be greatly decreased although not eliminated.
Keywords
 Gene Ontology
 Empirical Distribution
 Standard Normal Distribution
 Null Distribution
 Randomization Test
Introduction
The purpose of this work is to evaluate the statistical significance of treatments on the expressions in subsets of the genes on an array; for example, sets defined by gene ontology terms (GO terms, http://www.geneontology.org). GO terms group genes according to a biological process, molecular function, or cellular component. Inferences about the impact of a treatment are usually more straightforward when based on GO terms or equivalent groupings as opposed to lists of significant genes. Hence, we want to assess the statistical significance of the treatments on the group.
mRNA levels measured among genes within a GO group will be correlated. Correlations among genes involved in a common biological task are likely, and correlations are also expected simply because the set of expressions are measured within an animal and array, i.e., under shared conditions. The pvalues computed in many packages[1, 2] assume independence and therefore could be misleading.
The statistical significance of a GO group is commonly assessed by counting the number of statistically significant genes in the group. The null hypothesis that this count is a random sample of the significant genes on the array is tested versus an alternative hypothesis that the count is enriched [1–3]. The test is Fisher's exact test (probabilities computed using a hypergeometric distribution) or one of its many approximations (e.g., chisquared test which approximates the hypergeometric distribution with a binomial distribution) [4].
We do not test for enrichment primarily because the null distribution depends upon effects among interrogated genes that are unrelated to genes in the evaluated GO term. Put another way, the significance of a GO term should not depend upon whether or not other genes on the array (not in GO term) were affected by the treatments. This approach is not amenable to corrections for correlations among the pvalues, since the test inherently assumes exchangeability among genes; an assumption which is not met under arbitrary correlation structures. Furthermore, the usual implementation simply counts 'significant' genes which precludes extracting supporting evidence from the 'not significant' genes.
The distribution of pvalues for the individual genes on the array allows one to estimate the number of affected genes, and this estimate typically is larger than any list of 'significant' genes, which can be compiled with an acceptably low rate of misclassification [5–7]. Conceptually, methods that collate individual pvalues within biologically meaningful groups can extract any supporting evidence for treatment effects from group members that individually cannot be identified as affected by the treatments.
We prefer an approach that is based on a pvalue having a uniform distribution under the null hypothesis [8]. For any continuous probability distribution, x = F^{1}(1p) transforms p to the distribution specified by F. Two choices for F, which are in common use for combing pvalues, are the standard normal distribution, Φ(x), and the chisquare distribution with 2 degrees of freedom, F(x) = 1  exp(x/2). When pvalues are independent, the distribution of the sum is straightforward for either normal or chisquared deviates, and serves as a basis for testing the significance of a GO term. This is elaborated for the normal distribution in the next section, where we also develop a correction for correlations. The test based on the chisquared deviate can also be corrected for correlation at least in a 'onesided' case [9–12].
Randomization tests can deal with correlations among the genes in a GO term, and, when applicable, they should generate uniformly distributed null pvalues [13–15]. Our studies usually process samples in batches and the estimation of treatment differences is essentially done within batches [6, 16–18]. Such analyses usually cannot be implemented as randomization tests and our motivation to develop the presented methods is largely in this context. We will present adjustments for a onesample ttest because the results are transferable to pairwise contrasts in a fixedeffects linear model. While the pairwise contrasts are our real concern, their presentation carries algebraic baggage that is largely irrelevant to correcting for correlation and we have opted for streamlined notation, which focused on the fundamental issues in correcting for correlation. A onesample ttest can also be implemented as a randomization test, which suggests presenting it as a competitor. The randomization test is constructed to have a uniform distribution under the null although for sample sizes as small as n = 5 the number of possible permutations may be a complicating factor. When both methods are applicable, we would not choose between them based upon their abilities under the null distribution but rather on a consideration of their power and/or robustness, which is well beyond the scope of this paper.
We will assume that the data are n observations coming from a population with mean vector, Δ and covariance matrix, Σ. Although simple, this model is directly applicable to some of our studies, and it serves here to focus on basic issues with relatively simple mathematics. For an example where this model can be used directly, see the test for gender differences in gene expression as estimated in Delongchamp et al. (2005)[16].
Methods
Metaanalyses
Several methods are used in metaanalyses to combine a set of pvalues into an overall significance level. Under a null hypothesis, the pvalue for a corresponding statistic is a random variable with a uniform distribution and it can be transformed to a convenient probability distribution [8]. Here, we use the inverse of the standard normal distribution. Then
z_{ i }= Φ^{1}(1  p_{ i })
is a random variable from the standard normal distribution, and when the set of pvalues, {p_{ i }: i = 1,...,m}, are also independent, the statistic,
where _{ m }1_{1} = (1,1,...,1)', also has a standard normal distribution. So, the pvalue,
gives an overall significance level for the set. We refer to this as the naïve estimate because it naively assumes that covariance of z is the identity matrix, cov(z) = I.
Adjusting for correlation
In studies which use DNA arrays, the expressions measured on an array will be correlated and these correlations imply that the set of pvalues, {p_{ i }: i = 1,...,m}, are not independent. If the covariance of z is known, it is straightforward to modify the statistic so that it accommodates correlations. Suppose that cov(z) = R, then the variance of 1'z is 1'R1 and the appropriate pvalue is
Note that the naïve estimator takes R = I implying no correlations among the m pvalues.
When R is unknown, it must be estimated. In this context, it is useful to 'adjust' the variance of the naïve estimator. Let be the average value of the offdiagonal elements of R, i.e., = (1'R1  m)/(m(m  1)), then the implied adjustment is
That is,
The correction for Ponly depends on the average correlation, , and not on the individual correlations. We believe that this allows one to generate an acceptable estimate in small data sets even though the individual correlations will be poorly estimated.
Estimating a correction for correlation
Continuing with the data model described in the Introduction, the mean of n observations is assumed to have a multivariate normal distribution with mean vector, Δ, and covariance matrix, ; that is,
Let

S estimate Σ;

D be a diagonal matrix; ; estimated by
The elementwise tstatistic for the null hypothesis, Δ = 0, can be written in vectorial form as
Then for a onesided pvalue with an increasing alternative, t → z implies that the appropriate correlation matrix in Equation 1 is R = DΣD, which is approximated by = Σ . Note that for a decreasing alternative, the same correlation applies for z = Φ^{1} (p).
ttest
The onesided pvalue for a null hypothesis on Δ is based on the distribution of the statistic, . Let and u_{ j }= ay_{ j }, and note that
So
Morrison [19] derives Hotelling's T^{2} test in the context of tstatistics of linear combinations, and presumably such statistics have a history nearly as long as multivariate statistics. O'Brien [20] examined this specific statistic as a method for comparing multiple endpoints in clinical trials. Lauter [21] noted that the null distribution is not the tdistribution with small sample sizes and explained a modified statistic, which corrects this. While the modified statistic controls the Type 1 error, the modification seems to reduce the power relative to O'Brien's statistic (simulations not reported).
One versus twosided tests
So far, we have been discussing the case where p_{i} is a onesided pvalue. Essentially, Equation 1 applies whether p_{ i }is a pvalue from a onesided test or a twosided test. However, the covariance of z is not the same as the covariance of its absolute value, z. Consequently a twosided test in which , needs a different adjustment than a onesided test in which p = 1  Φ(z).
For a twosided test, the null distribution of 1'z can be generated through Monte Carlo samples from the null distribution of z, MVN(0, cov(z)). That is, let z_{1},...,z_{ k }be pseudorandom samples from the multivariate normal distribution, MVN(0, cov(z)), and directly compute the pvalue for the observed value, ψ = 1'z, as
where I(A) is an indicator function which gives 1 if A is true, or 0 otherwise. In simulations where Σ is specified, the adjustment can be implemented based upon R, or I. In the twosided case, it is also of interest to generate samples from MVN(0, cov(z)) where is computed from and . In theory, adjustments based upon R correct the pvalue for correlations, and for large enough n so will and . In practice, or must be useful when n is quite small. Their utility in this regard can be illustrated with simulated data.
Results
Simulation
The theory outlined in the previous sections provides adjustments which will work well for large sample sizes. It does not guarantee much when sample sizes are as small as is seen in most studies that use DNA arrays. We simulated a few 'representative' cases to illustrate that P computed from the naïve statistic can be very inaccurate and to demonstrate that the adjustments proposed herein give useful corrections with small sample sizes.
Correlation matrix, R, used in the simulations. The representative correlations were randomly selected to be between 0.35 and 0.55. These correlations in this range are commonly observed in our studies.
1  0.3722  0.412  0.3719  0.5174  0.3699  0.478  0.4798  0.4496  0.5085  0.5058  0.3705  0.3524  0.4731  0.4538  0.4057  0.4246  0.443  0.3885  0.4519 

0.3722  1  0.3527  0.4034  0.5445  0.4073  0.522  0.5225  0.5097  0.5239  0.3733  0.5298  0.5443  0.3895  0.4547  0.5036  0.3541  0.5071  0.429  0.4089 
0.412  0.3527  1  0.5456  0.359  0.5349  0.398  0.4338  0.3809  0.4903  0.409  0.3718  0.3743  0.5229  0.3965  0.5091  0.5334  0.413  0.3691  0.428 
0.3719  0.4034  0.5456  1  0.5345  0.514  0.409  0.4894  0.5176  0.4007  0.3824  0.4371  0.3679  0.4261  0.4772  0.3796  0.3906  0.4217  0.4393  0.4452 
0.5174  0.5445  0.359  0.5345  1  0.4061  0.418  0.4092  0.5346  0.5224  0.4241  0.3802  0.3807  0.5035  0.4824  0.4953  0.4104  0.3501  0.5438  0.5125 
0.3699  0.4073  0.5349  0.514  0.4061  1  0.351  0.5327  0.434  0.4566  0.4948  0.4281  0.4867  0.4563  0.4714  0.5291  0.4009  0.354  0.503  0.4166 
0.4776  0.5225  0.3979  0.4092  0.4176  0.3505  1  0.4777  0.4354  0.4761  0.4109  0.3624  0.5064  0.4444  0.4038  0.3903  0.4479  0.407  0.5217  0.5469 
0.4798  0.5225  0.4338  0.4894  0.4092  0.5327  0.478  1  0.3583  0.444  0.4546  0.4606  0.4706  0.4674  0.4813  0.4581  0.5171  0.4625  0.4332  0.3834 
0.4496  0.5097  0.3809  0.5176  0.5346  0.434  0.435  0.3583  1  0.5184  0.548  0.4173  0.5257  0.4726  0.3544  0.4549  0.4734  0.5458  0.4346  0.5143 
0.5085  0.5239  0.4903  0.4007  0.5224  0.4566  0.476  0.444  0.5184  1  0.5253  0.5161  0.4758  0.3904  0.4799  0.4291  0.5031  0.473  0.3508  0.4288 
0.5058  0.3733  0.409  0.3824  0.4241  0.4948  0.411  0.4546  0.548  0.5253  1  0.4822  0.4519  0.4779  0.4359  0.3542  0.4875  0.4398  0.3992  0.4385 
0.3705  0.5298  0.3718  0.4371  0.3802  0.4281  0.362  0.4606  0.4173  0.5161  0.4822  1  0.5422  0.4317  0.4011  0.5234  0.4553  0.4305  0.432  0.391 
0.3524  0.5443  0.3743  0.3679  0.3807  0.4867  0.506  0.4706  0.5257  0.4758  0.4519  0.5422  1  0.3732  0.352  0.4682  0.4535  0.3799  0.3696  0.5219 
0.4731  0.3895  0.5229  0.4261  0.5035  0.4563  0.444  0.4674  0.4726  0.3904  0.4779  0.4317  0.3732  1  0.4225  0.3962  0.478  0.4375  0.5188  0.4279 
0.4538  0.4547  0.3965  0.4772  0.4824  0.4714  0.404  0.4813  0.3544  0.4799  0.4359  0.4011  0.352  0.4225  1  0.4796  0.3642  0.5126  0.364  0.3575 
0.4057  0.5036  0.5091  0.3796  0.4953  0.5291  0.39  0.4581  0.4549  0.4291  0.3542  0.5234  0.4682  0.3962  0.4796  1  0.41  0.5454  0.4097  0.3597 
0.4246  0.3541  0.5334  0.3906  0.4104  0.4009  0.448  0.5171  0.4734  0.5031  0.4875  0.4553  0.4535  0.478  0.3642  0.41  1  0.3866  0.4484  0.5465 
0.443  0.5071  0.413  0.4217  0.3501  0.354  0.407  0.4625  0.5458  0.473  0.4398  0.4305  0.3799  0.4375  0.5126  0.5454  0.3866  1  0.4062  0.3617 
0.3885  0.429  0.3691  0.4393  0.5438  0.503  0.522  0.4332  0.4346  0.3508  0.3992  0.432  0.3696  0.5188  0.364  0.4097  0.4484  0.4062  1  0.3759 
0.4519  0.4089  0.428  0.4452  0.5125  0.4166  0.547  0.3834  0.5143  0.4288  0.4385  0.391  0.5219  0.4279  0.3575  0.3597  0.5465  0.3617  0.3759  1 
Variances used in the simulations. 20 variances are randomly selected in the range typical of our studies.
0.058864 

0.034697 
0.025862 
0.023389 
0.002802 
0.00839 
0.000179 
0.003993 
0.014377 
0.002584 
0.000641 
0.000972 
0.003201 
0.047617 
0.051366 
0.003047 
0.004786 
0.000368 
0.007916 
0.010357 
Discussion
Correlations among gene expressions within a GO term invalidate the computed pvalue when it is based upon assumed independence. Such estimates overstate the significance when the correlations are positive. This behavior was demonstrated analytically through a specific statistic, Equation (1), as well as through simulations. The simulated data show that the bias of the naïve pvalue can be substantial with moderate correlation.
For didactic reasons, we used a statistic and a scenario, which is mathematically tractable. However, it should be understood that overstating statistical significance is a problem for any statistic where the computed pvalue for the GO term assumes independence among gene expressions. This is true for widely implemented tests which evaluate if significant genes are 'overrepresented' within a GO term. In these tests, pvalues are based upon the hypergeometric distribution (Fisher's Exact Test) or its binomial or chisquared approximations, and an assumption of independence is essential to the construction of the null distribution. In addition to the presented statistic, there are other metaanalysis tests based upon a uniform distribution of null pvalues. A theorybased adjustment for correlation is difficult to construct for these tests. Naïve versions are biased, but not always as severe as the estimate presented here. So, we have been pursuing corrections for them.
For the illustrated statistic, the naïve pvalue is easy to correct when the correlation is known. In practice the correlation must be estimated from limited data. Under the onesample ttest scenario, we can estimate the applicable correlation. The simulation of a 'representative' group of 20 genes shows that estimating the correlation improves upon the naïve pvalue with as few as 5 samples.
In the onesided case, a tstatistic can be computed which implicitly adjusts for the presence of correlations. This statistic is easy to implement in existing computer programs; essentially the program that computed bygene pvalues can be used. This procedure can be extended to any statistical test that can be applied to the individual genes. As presented here, the onesided alternative specifies that all genes change in the same direction. It is trivial to apply the procedure for any prespecified direction of change for each gene. As our knowledge of expression profiles from responses to toxicity grows, this approach might become a standard test in screening chemicals for toxicity.
Conclusion
Reliable corrections for the effect of correlations among genes on the significance level of a GO term can be constructed for a onesided alternative hypothesis. For general twosided alternatives the bias of naïve significance calculations can be greatly decreased although not eliminated.
Declarations
Acknowledgements
TL was supported by an Oak Ridge Institute of Science and Education (ORISE) fellowship at NCTR.
Authors’ Affiliations
References
 Tong W, Harris S, Cao X, Fang H, Shi L, Sun H, Fuscoe J, Harris A, Hong H, Xie Q, Perkins R, Casciano D: Development of public toxicogenomics software for microarray data management and analysis. Mutat Res 2004, 549: 241–253.View ArticlePubMedGoogle Scholar
 Khatri P, Draghici S: Ontological analysis of gene expression data: current tools, limitations, and open problems. Bioinformatics 2005, 21: 3587–3595. 10.1093/bioinformatics/bti565PubMed CentralView ArticlePubMedGoogle Scholar
 Draghici S, Khatri P, Martins RP, Ostermeier GC, Krawetz SA: Global functional profiling of gene expression. Genomics 2003, 81: 98–104. 10.1016/S08887543(02)000216View ArticlePubMedGoogle Scholar
 Johnson NL, Kotz S: Discrete Distributions. New York: John Wiley & Sons; 1969.Google Scholar
 Allison DB, Gadbury GL, Heo M, Fernandez JR, Lee CK, Prolla TA, Weindruch R: A mixture model approach for the analysis of microarray gene expression data. Computational Statistics and Data Analysis 2002, 39: 1–20. 10.1016/S01679473(01)000469View ArticleGoogle Scholar
 Delongchamp RR, Bowyer JF, Chen J, Kodell RL: Multipletesting strategy for analyzing cDNA array data on gene expression. Biometrics 2004, 60: 774–782. 10.1111/j.0006341X.2004.00228.xView ArticlePubMedGoogle Scholar
 Schweder T, Spjotvoll E: Plots of pvalues to evaluate many tests simultaneously. Biometrika 1982, 69: 493–502. 10.2307/2335984View ArticleGoogle Scholar
 Hedges LV, Olkin I: Statistical Method for MetaAnalysis. Academic Press; 1985.Google Scholar
 Brown MB: A method for combining nonindependent, onesided tests of significance. Biometrics 1975, 31: 987–992. 10.2307/2529826View ArticleGoogle Scholar
 Kost JT, McDermott MP: Combining dependent pvalues. Statistics & Probability Letters 2002, 60: 183–190. 10.1016/S01677152(02)003103View ArticleGoogle Scholar
 Xu X, Tian L, Wei LJ: Combining dependent tests for linkage or association across multiple phenotypic traits. Biostatistics 2003, 4: 223–229. 10.1093/biostatistics/4.2.223View ArticlePubMedGoogle Scholar
 Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS: Truncated product method for combining p values. Genetic Epidemiology 2002, 22: 170–185. 10.1002/gepi.0042View ArticlePubMedGoogle Scholar
 Mootha VK, Lindgren CM, Eriksson KF, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly MJ, Patterson N, Mesirov JP, Golub TR, Tamayo P, Spiegelman B, Lander ES, Hirschhorn JN, Altshuler D, Groop LC: PGC1alpharesponsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet 2003, 34: 267–273. 10.1038/ng1180View ArticlePubMedGoogle Scholar
 Tian L, Greenberg SA, Kong SW, Altschuler J, Kohane IS, Park PJ: Discovering statistically significant pathways in expression profiling studies. PNAS 2005, 102: 13544–13549. 10.1073/pnas.0506577102PubMed CentralView ArticlePubMedGoogle Scholar
 Simon R, Lam AP: BRB ArrayTools User's Guide, Version 3.3: (National Cancer Institute Biometric Research Branch, Bethesda, MD) Technical Report. 2005., 28:Google Scholar
 Delongchamp RR, Velasco C, Dial S, Harris AJ: Genomewide estimation of gender differences in the gene expression of human livers: statistical design and analysis. BMC Bioinformatics 2005, 6(Suppl 2):S13. 10.1186/147121056S2S13PubMed CentralView ArticlePubMedGoogle Scholar
 Desai VG, Moland CL, Branham WS, Delongchamp RR, Fang H, Duffy PH, Peterson CA, Beggs ML, Fuscoe JC: Changes in expression level of genes as a function of time of day in the liver of rats. Mutation Research – Fundamental & Molecular Mechanisms of Mutagenesis 2004, 549: 115–129. 10.1016/j.mrfmmm.2003.11.016View ArticleGoogle Scholar
 Parrish RS, Delongchamp RR: Normalization. In DNA Microarrays and Statistical Genomic Techniques: Design, Analysis, and Interpretation of Experiments. Edited by: Allison DB, Page GP, Beasley TM, Edwards JW. Boca Raton, FL: Chapman & Hall/CRC; 2005:9–28.Google Scholar
 Morrison DF: Multivariate Statistical Methods. New York: McGrawHill Book Company; 1967.Google Scholar
 O'Brien PC: Procedures for comparing samples with multiple endpoints. Biometrics 1984, 40: 1079–1087. 10.2307/2531158View ArticlePubMedGoogle Scholar
 Lauter J: Exact t and F tests for analyzing studies with multiple endpoints. Biometrics 1996, 52: 964–970. 10.2307/2533057View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.