Testing for treatment effects on gene ontology

In studies that use DNA arrays to assess changes in gene expression, it is preferable to measure the significance of treatment effects on a group of genes from a pathway or functional category such as gene ontology terms (GO terms, ) because this facilitates the interpretation of effects and may markedly increase significance. A modified meta-analysis method to combine p-values was developed to measure the significance of an overall treatment effect on such functionally-defined groups of genes, taking into account the correlation structure among genes. For hypothesis testing that allows gene expression to change in both directions, p-values are calculated under the null distribution generated by a Monte Carlo method. As a test of this procedure, we attempted to distinguish altered pathways in microarray studies performed with Mitochips, oligonucleotide microarrays specific to mitochondrial DNA-encoded transcripts. We found that our analytic method improves the specificity of selection for altered pathways, due to incorporation of the inter-gene correlation structure in each pathway. It is thus a practical method to measure treatment effects on GO groups. In many actual applications, microarray experiments measure treatment effects under complicated design structures and with small sample sizes. For such applications to real data of limited statistical power, and also in computer simulations, we demonstrate that our method gives reasonable test results.


Introduction
The advent of DNA microarray technology has revolutionized genomic research and medicine because of its ability to simultaneously determine the expression levels of thousands of genes. However, the interpretation of large amounts of microarray gene expression data, and the abil-ity to derive biologically meaningful conclusions from such data, have always been daunting tasks for statisticians. Because of the high volume and complex characteristics of microarray data, much of the initial work on their analysis has focused on development of data mining or data reduction methods to identify differentially expressed genes. Typically, the p-value of a test statistic is calculated for each gene, the genes are ranked according to these p-values, and a pre-specified significance criterion, such as the false discovery rate, is used to determine a cutoff which creates a category of differentially expressed genes [1][2][3].
Attempts to interpret individual genes in a list of significant genes are demanding and laborious. Therefore, recent efforts have focused on discovery of biological pathways rather than on individual gene function. Gene ontology terms (GO terms, http://www.geneontol ogy.org) reflect gene groupings based on molecular function, biological process, or cellular structure/organelle. The interpretation of differentially expressed GO groups is generally simpler than the presentation of a list of statistically significant genes, and more resistant to erroneous conclusions that can arise from microarray artefacts.
Several statistical methods that combine the analysis of differential gene expression with biological databases have been proposed and implemented in computer packages for a more rapid interpretation of genome-wide expression data [4]. However, most such methods are based on a series of univariate statistical tests and do not properly account for the complex structure of gene interactions. The statistical significance of a GO group is commonly assessed by comparing the number of statistically significant genes in the group to the number expected by chance using Fisher's exact test, which is based on the hypergeometric distribution [5]. Fisher's exact test is used to compare these proportions to assess overrepresentation of significant genes in functional categories. This approach is not amenable to correction for correlations among p-values, since the test inherently assumes exchangeability among genes, an assumption which is not valid under arbitrary or actual correlation structures [6,7].
Hotelling's T 2 Statistic and permutation methods address the correlation structure among genes. Hotelling's T 2 statistic is not applicable when the sample size is smaller than the number of genes in a GO term [8]. Permutation methods, although quite valuable under appropriate conditions, are severely compromised by limited numbers of permutable sample pairs. In many cases, the design of microarray studies has a rather complicated structure intended to manage technical variation associated with differences among probes, dyes, and reagent batches by creating treatment blocks within these sources of variation [9]. Such cases are not suited to permutation methods.
A modified meta-analysis method was developed by Delongchamp et al. [10] to combine p-values, and thus to measure the significance of an overall treatment effect on a group of genes, while taking into account the inter-genic correlation structure. The method is based on the fact that p-values follow a uniform distribution under the null hypothesis. Inverse-normal transformed p-values have a normal distribution and their sum over a set of genes also would follow a normal distribution, provided that the component p-values are independent. The test we have developed to measure the significance of overall treatment effect on genes within a GO category is based on a modification of this statistic, to reflect the actual correlation structure among genes sharing a GO term.
In this paper, we extend the method from a simple oneclass t-test with the null hypothesis H 0 : μ = 0 to a test for pair-wise contrast in a fixed-effects linear model. In the following sections, we describe in detail the extension of the methodology, with validation through computer simulations and application to two toxicogenomics studies designed to evaluate treatment effects on the levels of mRNA transcripts involved in mitochondrial function. We thus demonstrate that this methodology provides a practical approach to testing the significance of the treatment effects on gene classes defined by GO terms, and by extension on any other prior categorization of genes into functional subsets. Because many microarray experiments measure treatment effects under complicated design structures and with small sample sizes [9], we used a simulation study to determine whether the method gives reasonable results under these conditions. Specific applications to toxicogenomics studies showed that the methodology has improved specificity in choosing significantly altered pathways or functional categories, and may thus assist in the understanding of molecular mechanisms of mitochondrial toxicity in the liver induced by anti-HIV drugs [11,12] and in assessing effects on mitochondrial function of weight-reducing dietary supplements, such as usnic acid [13].

Measurement of a treatment effect for each gene
Under a fixed-effects linear model, gene expression data for an arbitrary gene can be written as y = Xβ + ε, where y and ε are n × 1 random vectors, X is a known n × p design matrix of rank r, and β is a p × 1 vector representing unknown parameters. The vector y denotes an observed measurement of expression, suitably transformed, for n biological samples, and ε is an error vector, distributed as N n (0, σ 2 I n ), where σ 2 denotes the unknown within-treat-ment variance among samples. The parameters β and σ 2 are assumed to be gene-specific. Statistical analyses are applied to one gene at a time, with a common design matrix, X. The unbiased estimators of β and σ 2 are In many toxicogenomic studies, the significance of a treatment effect is tested under the null hypothesis H 0 : cβ = 0, where cβ is a pair-wise contrast among treatments. Under the null hypothesis, has a t-distribution with n -r degrees of freedom, and the p-value to assess the significance of a treatment is calculated from this statistic.

Test for a gene group
A modified meta-analysis method of combining p-values was developed to measure the significance of an overall treatment effect on any group of genes by a one-class t-test [10]. The p-value calculated from the null hypothesis is a random variable with uniform distribution, which can be transformed to a suitable probability distribution. , gives the overall significance of the treatment effect on the GO group. We refer to this as the naïve estimate because it assumes independence among p-values.
In reality, genes in a GO group are likely to be functionally related and thus not independent. When the correlation structure among genes is known, we can make a simple adjustment of the naïve estimate. In this case, the test statistics T still has a standard normal distribution and we denote it as , for the k-th gene in a GO group. A common contrast vector, c, is used through all genes since we are measuring same contrast for each gene. The summary statistic for a GO group, is also nor-mally distributed and its variance is , where 1 is m vector of 1s and R is the correlation matrix of (y 1 , y 2 , ʜ, where r s,t is the s-th row and t-th column element of R.
Therefore, is the appropriate p-value which corrects the naïve p-value, .
It follows that , where is the average of off-diagonal elements of R. The correction depends only on the average correlation, , and the correction tends to give a reduced significance when > 0. When R is unknown, we estimated the covariance, to provide an average correlation coefficient for the correction.
The correlation structure of p-values is different for a twosided t-test, which allows gene expression changes in both directions, than for a one-sided situation. A two-tailed test, in which p = 2(1 -Φ(|z|)), requires a different correction method, since the correlation among |z k |, k = 1, ʜ, m differs from a one-sided test in which p = 1 -Φ(z). The null distribution of the summary statistics 1'|z| can be generated through Monte Carlo sampling from the null distribution of z, MVN(0, cov(z)). When z 1 , ʜ, z n are random samples from MVN(0, cov(z)), the p-value for the observed value, Ψ = 1'|z|, is computed as , where I(A) is an indicator function which gives 1 if A is true, or 0 otherwise. Here, cov(z)ˆˆˆ) has to be estimated from the data. The estimated correlation, and its variation, , which has for off-diagonal elements, are used to estimate cov(z).

Simulation
The derivation of the method presented in the previous section is based on the known correlation matrix of the vector of dependent variable Y. When the correlation matrix is not known, we use an estimate of the correlation matrix. In reality, the correlation matrix is always unknown. The proposed method produces an approximately correct p-value for a group of genes. To demonstrate that the method gives not perfect but acceptably correct p-values, we present simulation results in this section.    to a uniform distribution than when is used. The estimate of the average correlation, , is more robust than that of each element of when sample size is small (n = 3 for each group). Figure 3 shows the distribution of p-values for a case with larger samples. The simulation for one-class t-test with sample size n = 25 was conducted as above, to compare several methods for two sided tests. The empirical distributions of p-values were generated from 500,000 iterations. We looked at small p-values between 0.1 and 0.001 on a log scale. Figure 3 shows that the Hotelling T 2 test gives the smallest difference from the uniform distribution. The Hotelling T 2 test is applicable when the number of sample is larger than the number of genes in a group.
When we have a reasonably correct estimate of R, the method is a little better than the method which uses an approximation of R. Both the method and the method give quite accurate p-values with reference to the p-values from the true correlation matrix, R.

Examples
We present two real-world examples based on data from Mitochip, a mitochondria-specific mouse oligonucleotide microarray which was developed by Dr. Varsha Desai at

RR
the National Center for Toxicological Research [11]. Mitochip measures the levels of mRNA for 542 mitochondrial and nuclear genes associated with mitochondrial structure and function. Each Mitochip includes 9 housekeeping genes and 9 Arabidopsis plant genes to serve as positive and negative control genes, respectively. We considered 317 relevant GO groups related to mitochondrial functions, based on a database from Mouse Genome Informatics (MGI, http://www.informatics.jax.org). Table 1 shows the design of an experiment to test the effects of zidovudine (AZT) and lamivudine (3TC) on mouse-liver gene expression. AZT is an anti-HIV drug used to reduce mother-to-child transmission of the virus. AZT Comparison of two-sided tests with sample size n = 25  Fifteen samples are collected and assayed for gene expression using the experimental design is reported to produce severe adverse effects, and shows more toxicity when AZT is applied in combination with 3TC. Adverse effects are believed to be due to druginduced mitochondrial disfunction [14].

RR R
Oxidative phosphorylation is a key mitochondrial function that requires the electron transport assembly of four protein complexes (I, II, III, IV) to catalyze sequential oxidation/reduction reactions, and complex V to generate ATP. Several clinical and animal studies have investigated the effect of nucleoside reverse transcriptase inhibitors (NRTIs), analogs such as AZT, on mitochondrial respiratory chain complexes. These studies suggest that there is a deficit in one of the components of complexes I and IV in skeletal muscle of children perinatally exposed to antiretroviral nucleoside analogues [15]. Table 2 shows p-values indicating the significance of treatment effects on the GO groups related to oxidative phosphorylation and apoptosis. The two-sided correction method detects significant effects on genes encoding components of complexes III and IV, whereas the naïve method finds that genes in all 5 complexes are significantly affected. This demonstrates that the two-sided correction method is more specific in finding significantly affected gene groups, although of course the "true" answer is not known a priori. Although Fisher's exact test also detects significant alteration in genes of complex IV, this test appears to be detecting a gene group that is different from the other groups, rather than registering treatment effects directly. The one-sided test is not applicable since it seems that gene expression changes in both directions after AZT and 3TC treatment.
Usnic acid is a lichen metabolite used as a weight-loss dietary supplement due to its uncoupling action on mitochondria. However, its use has been associated with severe liver disorders in many individuals. Animal studies conducted thus far have evaluated effect of usnic acid on mitochondria, primarily by measuring the rate of oxygen consumption and/or ATP generation. Generation of ATP requires tight coupling of electron transport with oxidative phosphorylation, maintained through a proton gradi-ent across the inner mitochondrial membrane. An important finding of the study is a lack of usnic acid effect on complex V, despite a significant up-regulation of all four complexes of the electron transport chain. Usnic acid is a known uncoupler that is highly lipophilic in both neutral and anionic forms due to its numerous carbonyl groups that absorb the negative charge of the anion by resonance stabilization. This lipophilicity of usnic acid and the usniate anion allows easy passage of both entities through the mitochondrial membranes by passive diffusion into the matrix where it is ionized, releasing a proton into the matrix. The resulting usniate anion can then diffuse back into the inter-membrane space where it binds to the proton on the acidic side of the inner membrane to reform usnic acid which can then diffuse back into the matrix. The resulting cycle causes proton leakage that eventually can dissipate the proton gradient across the inner membrane, disrupting the tight coupling between electron transport and ATP synthesis. This model would explain the absence of gene-expression changes associated with complex V in usnic acid-treated mice, despite the increased electron transport by complexes I -IV. It may also explain the decline in ATP level in spite of increased oxygen consumption.
In Table 3, only the two-sided correction method enables us to explain the function of usnic acid as described above. The one-sided correction method gives p-values similar to those in the two-sided correction method, but this is likely to be due to most of the gene expression changes entailing up-regulation. When the direction of gene expression change due to a treatment is known, then the one-sided correction method is the appropriate choice; it also needs less computation time than the twosided correction method which employs Monte Carlo sample generation.

Discussion
In many studies that use microarray data, the number of samples is small as in the first example shown above. While the number of samples in the simulations is only 3 for each group, the distribution of corrected p-values For the one-sided test, the correction for the correlation depends only on , the average of off-diagonal elements of the correlation matrix. The corrections using or are equivalent for the one-sided test method. Important points in choosing a correlation estimate for the twosided test are the following; 1) The correlation estimate should be robust for small sample sizes, and 2) The correlation estimate should preserve . satisfies these two conditions.
When the direction of gene expression change is pre-specified, the one-sided test is a good choice since it is easy and fast to calculate p-values. However, the two-sided test is the one we have to use in most cases, because it is usually not possible to pre-specify how individual genes will respond to treatment in the exploratory context. When we have a small number of samples to estimate the correlation, the method gives a robust result. Since misrepresents the true correlation, and gives biased p-values, the method works better for larger sample sizes. This is seen in Figure 3, where the method is better than the method. We hesitate to present a specific threshold sample size as sufficient for a converged correlation estimate, since it varies with respect to several conditions, such as the number of genes in a group, the variation of the elements of the correlation matrix, etc. The simulation result in Figure 3 shows that both methods give quite accurate p-values compared to the p-values from the true correlation matrix, R. The Hotelling T 2 test is the best choice whenever it is applicable.
In Table 1, the distribution of animals from different treatment groups (A-E) in three batches (1)(2)(3) gives no per-mutable pairs. In this case, randomization methods are not applicable. Even though randomization methods inherently take into account the correlation structure among genes, they may not be practical when the design of the experiment is complicated and the number of samples per group is small, reducing the numbers of permutable pairs. Randomization methods that permute class labels can adjust p-values for the correlation structure among genes. Randomization methods choose a summary statistic (e.g. enrichment score (ES) in [16], average z-score in [17]), which reflects the degree to which a set of genes is enriched. When the significance of the summary statistic is measured by permuting class labels, the method preserves gene-gene correlations and when applicable, would give similar result to the presented method. Randomization methods that permute gene labels, such as Fisher's exact test, do not preserve the correlation structure and misrepresent the group's significance.

Conclusion
We have presented a method to test the significance of expression changes within a group of genes, while considering the correlation structure among genes in each group. This method will enable the rapid detection of microarray evidence indicating altered cell functions or pathways, and will facilitate the interpretation of microarray outcomes. Application of the method to real data shows that it is an improved, practical method to evaluate the effects of treatments on functional classes of genes such as those based on Gene Ontology descriptors.