Handling multiple testing while interpreting microarrays with the Gene Ontology Database
© Osier et al; licensee BioMed Central Ltd. 2004
Received: 02 April 2004
Accepted: 06 September 2004
Published: 06 September 2004
The development of software tools that analyze microarray data in the context of genetic knowledgebases is being pursued by multiple research groups using different methods. A common problem for many of these tools is how to correct for multiple statistical testing since simple corrections are overly conservative and more sophisticated corrections are currently impractical. A careful study of the nature of the distribution one would expect by chance, such as by a simulation study, may be able to guide the development of an appropriate correction that is not overly time consuming computationally.
We present the results from a preliminary study of the distribution one would expect for analyzing sets of genes extracted from Drosophila, S. cerevisiae, Wormbase, and Gramene databases using the Gene Ontology Database.
We found that the estimated distribution is not regular and is not predictable outside of a particular set of genes. Permutation-based simulations may be necessary to determine the confidence in results of such analyses.
With technological improvements and decreasing costs, microarrays are quickly becoming an affordable analytical tool for genetics analysis. Additionally, the arrays being used are of increasing spot density, allowing for more genes to be tested at once. One impact of the resulting increase in data flow is that it will become more likely that a researcher using microarrays will have greater difficulty making sense of results from preliminary statistical analyses without further computational exploration. In other words, once the researcher has received a list of genes, by whatever statistical means, that are differentially expressed, the task of determining the biological implications of that gene list will need to be performed by statistical methods utilizing computers.
Numerous research groups are developing software tools to perform an interpretation of the list of differentially expressed genes, generally by mapping against previously developed knowledgebases such as the Gene Ontology (GO) [1, 2] or GenMAPP  as a reference data set (reviewed briefly in ). Some tools, such as DAVID  and FatiGO  examine the percentage of the gene list that is directly associated with a node of the knowledgebase. This method is extremely fast due to its simplicity, but it does have disadvantages, which are also due to the simplicity of the analysis. For example, in some of these tools, information about how nodes (biological terms or steps in pathways) of the knowledgebase are related to each other is ignored. Additionally, in hierarchical structures such as GO, genes with a less precise functional definition will be associated with a node closer to the root than a gene with a more precise definition. In such a case, the information content about the two genes is split into different nodes, reducing the power of the analytical method.
Other tools such as GOMiner  and MAPPFinder  analyze the gene list in a broader context of the knowledgebase, looking for patterns of a larger scale than a single node. MAPPFinder searches for whole pathways (MAPPs) over-represented by the gene list. GOMiner performs analyses using genes associated with a node in GO or genes associated with any children of that node, sometimes called "inclusive analysis". In this way, GOMiner minimizes the power reduction of some simpler methods.
These tools provide a powerful way for the researcher to quickly get a summarization of the gene list within a biological context. One common problem for the inclusive analytical methods, especially those using knowledgebases with polyhierarchical structures (individual nodes can have multiple parents) like GO, is correcting for multiple statistical tests, usually thousands. In such a case, a Bonferroni correction is overly conservative to the point of being counterproductive since few if any results of the interpretation remain significant . As of June, 2003, GODB had >13,000 DAG nodes which may be tested, meaning a correction factor of greater than four orders of magnitude would be needed in a Bonferroni correction. Other standard methods used include controlling the Family-Wise Error Rate (FWER) using a numerical correction of the p-value (discussed in ) or controlling the False Positive Rate (FDR, discussed in ). In both cases the methodology should again be overly conservative since, when using inclusive analysis, the p-values for each GO term are not independent .
Here we present, in the context of the program GOArray , a preliminary analysis of the feasibility of using permutation-based simulations to provide an alternate method of handling the multiple-testing problem. GOArray analyzes the gene list in the context of GO. Permutations of the differentially expressed gene list are generated from the total list of genes represented on the microarray to estimate the distribution of significant GO terms expected by chance. We analyze the nature of the distribution of significant terms in reference to varying p-values and numbers of differentially expressed genes using publicly available data sets. We then compare the list of significant terms between data sets. Finally, we discuss the implications of this distribution to provide one solution to the multiple-test problem when analyzing microarray data in the context of GO.
Based on this set of simulations, predictability appears to be limited to specific data sets. One method of correcting our expectations after performing multiple tests would be to calculate a factor by which to modify α based on the DAG of GO terms. In other words, one could use an adjusted p-value to control the FWER or FDR. Controlling for these two types of error by use of adjusted p-values, however, assumes independence of the tests . Since there is currently no practical method for directly untangling the interdependence of terms in the GO hierarchy to generate a less conservative correction, adjusted p-values are limited to overly strict results.
Another method would be to determine a formula that conservatively approximates the simulated distribution. Unfortunately, the only commonality between the distributions is that, with the exception of the Drosophila data sets, the number of significant terms increases with an increasing number of GOI and an increasing p-value cutoff. The magnitude and detailed shape of the distribution varies between all tested data sets. Even in the more regular non-Drosophila data sets, there were some fluctuations in the distribution, and a smooth surface was not observed. Since neither of the two methods of correcting expectations is currently feasible, it appears that, for now, we are forced to rely on simulation-based methods to estimate the expected distribution of significant terms for each set of genes being examined.
While it would be desirable to have a smooth topology that allows for a simple formulaic calculation of the number of significant terms one would expect by chance, it is unfortunately not observed for the Drosophila data sets examined here. The trough that disrupts the Drosophila data sets was not observed, however, in the data sets for other species. The cause of this trough is undetermined, but may be due to structure within the graph of GO terms associated with FlyBase accessions. Alternatively, there could be structure within the chosen genes that is more evident with smaller data sets, since the trough appears to be deepest for the two smaller data sets. One way to approach the question of cause would be to examine which, if any, terms are observed disproportionately in the permuted sets. Based on the frequency of terms it may be possible to observe a pattern in either the genes tested or the set of associated GO terms. We have been unable to observe such a pattern, but that does not mean it does not exist. If one could be found, it may give insights into how to dissect the structure, possibly leading to a more elegant solution to the multiple test problem than a simulation-based approach.
Though we were unable to find hints of an easy formulaic way to correct our expectations, we may be able to find a practical (e.g., efficient) method of correction through simulations. There are several ways in which simulated estimates of the distribution could be implemented to provide a less conservative method, yet still statistically appropriate, than a Bonferroni correction to handle the problem of correcting our expectations after performing multiple statistical tests. The simplest to implement, and likely the most accurate, would be to perform a permutation-based simulation for each analysis of a microarray data set in the context of GO. The primary problem with this approach is that it is computationally intensive since the GOI would need to be permutated and scored a thousand or more times for every analysis of a microarray. While tools such as parallel processing can reduce the absolute time necessary to perform the simulations, it is not the most elegant way to solve the problem.
Another method would be to simply generate the estimated distribution, again using a permutation-based simulation, once for each set of accession numbers (e.g., each microarray design) for a range of GOI counts and p-value cutoffs, similar to what we have done here but in finer detail, and storing the results. The most conservative simulation distribution neighboring the experimental combination of p-value and GOI count could then be extracted from the stored table to provide an estimated distribution. One problem with this approach is determining how fine a table to design (e.g., the number of values to simulate for each of the two primary parameters). With a simple 10 × 10 matrix, the simulation took ~16–24 hours on a single 2.4 GHz Xeon processor. A finer matrix of parameter values will result in a better estimation of the topology, but consumes more time to compute in a non-linear fashion. However, if a large number of microarray experiments is to be conducted with a single geometry, this method would reduce the total time to estimate significance across all experiments since the simulation would only need to be performed once. Additionally, it will be necessary to determine what range of values should be considered. For the smallest data set tested here (>2500 FlyBase accessions), GOI lists representing less than 20% (500) of the accession numbers were used. The amount of computation time that should be dedicated to simulating the distribution of significant terms expected by chance will likely be a balance determined by the computing resources available, estimates of how many experiments will use the array design, and minimal p-value cutoffs and maximal GOI parameter values determined by the predicted user needs.
Based on the large simulations performed here, it appears that the rate at which terms are observed as significant is not predictable between sets of genes for a given GOI count and p-value cutoff. Even within a particular species, there is no correlation in relative frequency at which particular terms are significant. Therefore, permutation-based simulations appear to be the most reliable way to generate an estimate of the expected distribution of significant terms. As a result, we plan to extend the confidence tests in the next version of GOArray (version 2.0) by implementing a "false positive frequency estimation" for individual terms based on simulation results. Also, since which terms are observed as significant appears to be highly dependent on the structure of the gene list, and possibly the list of GOI, we plan to examine the merits of bootstrap methods (e.g. in the simulations choosing GOI from the original list of GOI with replacement) rather than a strict permutation method (e.g. choosing GOI from the total list of genes without replacement).
In the best case, it appears feasible to pre-generate the estimated distribution of the number of significant terms through a permutation-based simulation, then use a lookup table during analyses of experimental data sets. In the worst case, one would need to generate the distribution for each experimental data set, possibly testing various p-value cutoffs to determine where power is maximal. Even in the worst case, currently available processing power allows the test for a single set of genes and a single p-value cutoff to be performed in well under an hour. While near-instant results would be desirable by end users, the worst case scenario is still quite practical and will only improve over time alongside general computer performance. Thus, relying on permutation-based methods may not be a serious inconvenience, and in fact a highly accurate method of assessing our confidence in the results of the analysis.
All tests were performed on a single processor of a dual Xeon 2.4 MHz CPU system with 2 gigabytes of RAM. The operating system was RedHat Linux 7.3 with an SMP kernel. All time calculations were determined using the Linux command time.
GOArray is a Perl script that maps genes of interest (GOI) and non-GOI (NGOI), where the difference between the two gene lists is determined by the researcher, from a microarray experiment to terms in GO and all of that term's parent terms. The GO rooted-DAG is represented in a hash table using the GODB field terms.id as the keys. A z-score is assigned to each term based on the number of genes associated with that term or any of its children relative to the total numbers of GOI and NGOI. Z-scores were used to calculate p-values since they are easy and efficient to compute, and they approximate the hypergeometric p-values when the number of NGOI and GOI for the entire data set is large compared to the NGOI and GOI for the individual nodes. Terms with only one gene in the numerator (GOI) are not given a z-score since it is not possible to have an overrepresentation of GOI with a single gene. P-values are determined using twice the value (e.g., a two-sided test) returned by the routine "uprob($z)" (where "$z" is the z-score) from the Perl module Statistics::Distributions available from the Comprehensive Perl Archive Network (CPAN) . The June 2003 GODB data set is used in this analysis.
Simulations of the GOI list are performed by permuting the status of each gene, keeping the total number of GOI constant. For example, in the case of an experiment examining 5000 total genes with 100 GOI, in each simulation 100 of the 5000 genes are assigned the status of GOI, and 4900 genes are assigned the status of NGOI.
The only modifications to the GOArray source code in this analysis are the addition of loop structures to iterate the numbers of GOI and count the number of significant terms under different p-value cutoffs to determine when a term is significant, the use of a user-determined random number seed rather than a computer determined one for reproducibility, and the addition of a routine to summarize the simulation data.
The source code for both GOArray and the modifications discussed here are available on the Web .
Using the modified GOArray code, the number of significant terms were determined for p-values (determining which terms were significant, not which genes were GOI) from 0.05 down to ~0.000098 (starting with 0.05 and decreasing the p-value by a factor of 2 with each iteration), and GOI counts from 50 to 500 in increments of 50. This generates the number of significant terms for each of 1000 permutations for all combinations of ten different p-values and ten different GOI counts, for a total of 100 distributions of 1000 permutations for each data set.
This work is supported in part by NIH grant K25 HG02378 (KHC), by NIH grants T15 LM07056 and P20 LM07253 from the National Library of Medicine (MVO), by NSF grant DBI-0135442 (KHC), and by NSF grant DMS 0241160 (HYZ).
- The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology. Nature Genetics 2000, 25: 25–29. 10.1038/75556PubMed CentralView ArticleGoogle Scholar
- The Gene Ontology[http://www.geneontology.org/]
- Dahlquist KD, Salomonis N, Vranizan K, Lawlor SC, Conklin BR: GenMAPP, a new tool for viewing and analyzing microarray data on biological pathways. Nature Genetics 2002, 31: 19–20. 10.1038/ng0502-19View ArticlePubMedGoogle Scholar
- Osier MV: Post-Analysis Interpretation: "What do I do with this gene list?". In DNA Microarrays and Statistical Genomic Techniques: Design, Analysis, and Interpretation of Experiments (Edited by: Allison, Page, Beasley, Edwards). New York: Marcel Dekker, Inc , in press.Google Scholar
- Dennis G Jr, Sherman BT, Hosack DA, Yang J, Gao W, Lane HC, Lempicki RA: DAVID Database for Annotation, Visualization, and Integrated Discovery. Genome Biology 2000, 4: R60. 10.1186/gb-2003-4-9-r60View ArticleGoogle Scholar
- Al-Shahrour F, Díaz-Uriarte R, Dopazo J: FatiGO: a web tool for finding significant associations of Gene Ontology terms to groups of genes. Bioinformatics 2004, 20: 578–580. 10.1093/bioinformatics/btg455View ArticlePubMedGoogle Scholar
- Zeeberg BR, Feng W, Wang G, Wang MD, Fojo AT, Sunshine M, Narasimhan S, Kane DW, Reinhold WC, Lababidi S, Bussey KJ, Riss J, Barrett JC, Weinstein JN: GoMiner: a resource for biological interpretation of genomic and proteomic data. Genome Biology 2003, 4: R28. 10.1186/gb-2003-4-4-r28PubMed CentralView ArticlePubMedGoogle Scholar
- Doniger SW, Salomonis N, Dahlquist KD, Vranizan K, Lawlor SC, Conklin BR: MAPPFinder: using Gene Ontology and GenMAPP to create a global gene-expression profile from microarray data. Genome Biology 2003, 4: R7. 10.1186/gb-2003-4-1-r7PubMed CentralView ArticlePubMedGoogle Scholar
- Westfall PH, Young SS: Resampling-Based Multiple Testing New York: John Wiley & Sons 1993.Google Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. J R Statist Soc B 1995, 57: 289–300.Google Scholar
- Slonim DK: From patterns to pathways: gene expression data analysis comes of age. Nat Genet 2002, Suppl 32: 502–508. 10.1038/ng1033View ArticleGoogle Scholar
- Edgar R, Domrachev M, Lash AE: Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Research 2002, 30: 207–210. 10.1093/nar/30.1.207PubMed CentralView ArticlePubMedGoogle Scholar
- Arbeitman MN, Furlong EE, Imam F, Johnson E, Null BH, Baker BS, Krasnow MA, Scott MP, Davis RW, White KP: Gene expression during the life cycle of Drosophila melanogaster. Science 2002, 297: 2270–2275. 10.1126/science.1072152View ArticlePubMedGoogle Scholar
- Meiklejohn CD, Parsch J, Ranz JM, Hartl DL: Rapid evolution of male-biased gene expression in Drosophila. PNAS 2003, 100: 9894–9899. 10.1073/pnas.1630690100PubMed CentralView ArticlePubMedGoogle Scholar
- Comprehensive Perl Archive Network, CPAN[http://www.cpan.org]
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.