PRESTO: Rapid calculation of order statistic distributions and multipletesting adjusted Pvalues via permutation for one and twostage genetic association studies
 Brian L Browning^{1}Email author
DOI: 10.1186/147121059309
© Browning; licensee BioMed Central Ltd. 2008
Received: 06 February 2008
Accepted: 13 July 2008
Published: 13 July 2008
Abstract
Background
Largescale genetic association studies can test hundreds of thousands of genetic markers for association with a trait. Since the genetic markers may be correlated, a Bonferroni correction is typically too stringent a correction for multiple testing. Permutation testing is a standard statistical technique for determining statistical significance when performing multiple correlated tests for genetic association. However, permutation testing for largescale genetic association studies is computationally demanding and calls for optimized algorithms and software. PRESTO is a new software package for genetic association studies that performs fast computation of multipletesting adjusted Pvalues via permutation of the trait.
Results
PRESTO is an order of magnitude faster than other existing permutation testing software, and can analyze a large genomewide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO has several unique features that are useful in a wide range of studies: it reports empirical null distributions for the topranked statistics (i.e. order statistics), it performs userspecified combinations of allelic and genotypic tests, it performs stratified analysis when sampled individuals are from multiple populations and each individual's population of origin is specified, and it determines significance levels for one and twostage genotyping designs. PRESTO is designed for casecontrol studies, but can also be applied to trio data (parents and affected offspring) if transmitted parental alleles are coded as case alleles and untransmitted parental alleles are coded as control alleles.
Conclusion
PRESTO is a platformindependent software package that performs fast and flexible permutation testing for genetic association studies. The PRESTO executable file, Java source code, example data, and documentation are freely available at http://www.stat.auckland.ac.nz/~browning/presto/presto.html.
Background
Permutation testing is often described as the goldstandard for determining statistical significance when performing multiple correlated tests for genetic association. Permutation testing can be applied to both casecontrol studies and trio studies (parents and affected offspring). In permutation testing, the case/control status of the individuals (for casecontrol studies) or the transmitted/untransmitted status of the parental chromosomes (for trio studies) are randomly permuted. The maximum test statistic, maximized over all tests for all markers, is calculated for the original affection/transmission status and for each permuted affection/transmission status. If k out of P permutations have a maximum test statistic greater than the maximum test statistic for the original data, the multipletesting adjusted Pvalue for the experiment is (k+1)/(P+1) [1].
Permutation testing is computationally demanding for largescale genetic association studies and requires an optimized software implementation. The PRESTO software package provides fast permutation testing for genomewide association studies with thousands or millions of markers genotyped on thousands of samples. In addition to using permutation of the trait status to determine statistical significance of userspecified allelic and genotypic tests, PRESTO has three additional useful features: it can compute empirical distributions of order statistics so that the significance of sophisticated multimarker statistics such as truncated products can be determined [2, 3], it can perform stratified tests when sampled individuals are from multiple populations and each individual's population of origin is specified, and it can compute significance levels for twostage genotyping designs using only firststage genotyping data [4].
Implementation
Features
PRESTO is designed to be flexible and userfriendly. Input files have a simple format with rows corresponding to markers and columns corresponding to individuals (two columns per diploid genotype). This format is wellsuited to largescale genetic studies where there are typically many more markers (rows) than individuals (columns). Genetic marker data can be split up over multiple input files (e.g. one file per chromosome). There are no restrictions on how alleles or missing data are coded, and any sequence of nonwhite space characters can be used. Multiallelic markers are permitted and are analyzed by creating a diallelic marker for each allele (grouping the other alleles) and testing each diallelic marker for association with the trait status. If the cases and controls are sampled from a stratified population and the strata are specified, PRESTO will automatically perform stratified allelic and genotypic tests [5, 6] and will permute the trait status within each population stratum.
PRESTO can also compute significance levels of combined single locus and multilocus analysis by representing clusters of haplotypes as diallelic markers as described in Browning and Browning [7]. By default, PRESTO permutes casecontrol status for individuals; however, it can also permute casecontrol status for chromosomes, so that transmitted/untransmitted data from trio studies (parents and affected offspring) can be analyzed.
PRESTO performs a CochranMantelHaenszel (CMH) test with continuity correction and a Mantel trend test [5, 6]. The CMH test is a generalization to stratified data of the standard chisquare test of independence, and the Mantel trend test is a generalization to stratified data of the allelic trend test. When there is only one population stratum, these test statistics are equal to the standard test statistics after multiplying by N/(N 1) where N is the sample size. The CMH test is used to test for recessive, dominant, or overdominant effects. Although the CMH test can be used to test for allelic effects, the Mantel trend test is the preferred allelic test since it is robust to departure from HardyWeinberg equilibrium.
For each permutation of the trait status, PRESTO can store and report the topranked order statistics. The jth order statistic is the jth largest test statistic. For each marker, the test statistic is the largest allelic or genotypic test statistic for that marker. The empirical distributions of the topranked order statistics enable one to calculate an empirical Pvalue for any order statistic or any function of order statistics. An example of a function of order statistics is the jth rank truncated product statistic, which is defined as the product of the nominal Pvalues of the j largest test statistics [2]. Rank truncated product statistics are useful for detecting association of multiple topranked markers with a trait, even when no individual marker has a significant multipletesting adjusted Pvalue.
PRESTO can also calculate significance levels for twostage genotyping designs from the firststage genotype data using the technique described by Dudbridge [4]. In a twostage genotype design, the sample is divided into two stages, a set of markers is genotyped in the first stage sample, and only the subset of firststage markers with test statistics greater than a specified threshold are genotyped in the second stage sample. PRESTO samples the null distribution of the topranked order statistics for a twostage genotyping design by using a subset of the first stage sample as a simulated first stage sample and the remainder of the first stage sample as a simulated second stage sample [4].
Optimization techniques
PRESTO employs several techniques to optimize permutation testing on largescale data sets. The permutations of the trait status are computed once and are stored. Each permutation of the trait status is represented as an array of Boolean (1 bit) variables in which the kth binary indicator gives the affection status of the kth chromosome in the input file. Each genetic marker is read once and is tested against all stored permutations of the trait status, so that only one marker is stored in memory at a time.
For each permutation of the trait status and each diallelic marker, a 2 × 3 contingency table is created where the rows are the cases and controls and the columns are the three possible genotypes. PRESTO obtains the 2 × 3 contingency table counts without having to check the permuted trait status and genotype for all individuals. Instead, PRESTO stores the indices of individuals with missing genotypes, heterozygote genotypes, and minor (least common) allele homozygote genotypes. The indices of individuals with major allele homozygote genotypes do not need to be stored because the case and control major allele homozygote genotype counts can be calculated from the case and control sample sizes, the case and control missing genotype counts, and the case and control heterozygote and minor allele homozygote genotype counts. For example, if there are N genotypes, and 90% of these are major allele homozygotes, then contingency tables for each permutation of the trait status are constructed by examining the permuted trait status of N/10 individuals instead of N individuals. This optimization is expected to be increasingly effective as denser marker sets are developed since markers with low minor allele frequency are much more numerous than markers with high minor allele frequency. An analogous optimization is used to obtain allele contingency table counts.
Output files
PRESTO produces three output files: a log file, a Pvalue file, and a null distribution file. The log file summarizes the analysis and reports the command line parameters, the running time, and a list of all markers with a multipletesting adjusted Pvalue less than 0.2.
The Pvalue file gives the chisquare test statistics for each allelic and genotypic test performed for each marker, and the permutation Pvalue for the maximum test statistic for each marker (maximized over all allelic and genotypic tests for the marker). If a marker has a maximum test statistic t_{0} when tested for association with the unpermuted trait status, and if for k out of P permutations of the trait status there exists at least one marker with a maximum test statistic ≥ t_{0}, then the multipletesting adjusted Pvalue for the marker is (k+1)/(P+1) [1]. The Pvalue file has a simple format and can be read into standard statistical software packages, such as R [8] for filtering and sorting. An R script for displaying QQplots of Pvalue distributions is available from the PRESTO web site.
The null distribution file gives the largest test statistics for each permutation of the trait status. If there are P permutations and the K largest test statistics are saved, then the null Pvalue file is a P × K whitespace delimited matrix whose rows correspond to permutations and whose columns are the empirical distributions of the topranked order statistics. The entry in row i and column j is the jth largest test statistic for the ith permutation. Thus the jth column gives the empirical distribution of the jth largest test statistic.
Results
Computational time
PRESTO running times for the Wellcome Trust Case Control Consortium Crohn's disease study.
# order statistics  # strata  onestage study  twostage study 

1  1  52.3 m  33.8 m 
1  12  84.9 m  55.1 m 
1000  1  56.6 m  34.3 m 
1000  12  85.6 m  58.5 m 
Running times for PRESTO 1.0.1 and PLINK 1.0 [10] were compared for this same data set on the same computer using a chisquare allelic test, 1000 permutations of the trait status, and a single population stratum. PRESTO was approximately 18 times faster than PLINK (50 minutes vs. 15 hours).
PRESTO's running time is linear in the number of samples, linear in the number of markers and linear in the number of permutations. Generally, 1000 permutations are sufficient to determine experimentwide significance. PRESTO can also be run in parallel as described in the documentation.
Memory requirements
Since only one marker is stored in memory at a time and since the trait status for each individual is stored using 2 bits, PRESTO's memory requirements are modest. If there are P permutations, N individuals, and the distributions of the top K order statistics are reported, then 2NP/8 bytes are allocated to store the permutations of the trait status and 8KP bytes are allocated to store the topranked order statistics (8 bytes per floating point number). In practice, 500 Mb of memory should suffice for values of P ≤ 10,000, K ≤ 2000, and N ≤ 20,000.
Discussion
Permutation testing with 1000 permutations of a large casecontrol genomewide association study with 5000 individuals genotyped for 500,000 markers can be performed using PRESTO in approximately one hour of computing time (Table 1). With PRESTO, the costs of permutation testing (in terms of time and computing resources) are extremely low for many common study designs, and these costs compare very favourably to the costs associated with data generation (e.g. performing genotype assays, calling genotypes, and performing data quality control filtering).
There has been some debate regarding the number of permutations required. When performing N permutations, the smallest multipletesting adjusted Pvalue one can observe is 1/(N+1) [1]. Thus, 1000 permutations can provide multipletesting adjusted Pvalues as low as 0.001, which provide strong evidence of association. In the analysis of Wellcome Trust Case Control Consortium data described in Table 1, multipletesting adjusted Pvalues of 0.05, 0.01, and 0.005 correspond to nominal Pvalues of 7.5 × 10^{8}, 1.5 × 10^{8}, and 6.2 × 10^{9} respectively. If additional permutations are desired, 10^{4} or 10^{5} permutations are easily performed on a large genomewide data set like the WTCCC data set in Table 1, and even larger numbers of permutations can be easily performed for smaller studies (computation time is linear in the number of permutations).
Permutation testing is particularly appealing because of its simplicity. Recently, several more complex alternatives to permutation testing have been proposed [11–14]. These methods can be useful, more computationally efficient alternatives to permutation testing in some situations.
Some methods for computing adjusted Pvalues exploit the fact that for many common statistical tests, the correlated tests have an asymptotic multivariate normal distribution under the null hypothesis of no traitmarker correlation. Seaman and MüllerMyhsok have proposed estimating the asymptotic distribution and sampling directly from it [14], and Conneely and Boehnke have proposed estimating the asymptotic distribution and calculating probabilities under this distribution using numerical integration [11]. Either approach can be used to estimate the probability of observing a minimum Pvalue smaller than the observed minimum Pvalue. Both approaches are particularly wellsuited to situations where covariate data are available or multiple quantitative phenotypes are tested. When the asymptotic distribution is accurately estimated, these methods are shown to give accurate results (compared to permutation as the gold standard) for candidate gene studies.
There are some limitations with these approaches that estimate the asymptotic multivariate normal distribution of the test statistics. These methods do not estimate significance levels for twostage genotyping designs. A more severe restriction is that these methods are typically limited to several hundred correlated tests. Seaman and MüllerMyhsok and Conneely and Boehnke suggest that the number of samples should be at least 10 times the number of tests performed in order to accurately estimate the asymptotic multivariate normal distribution [11, 14]. So these methods cannot be directly applied to hundreds of thousands of single marker tests in a genomewide association study.
Other alternatives to permutation testing are based on importance sampling. Kimmel and Shamir [13] have proposed a method that uses importance sampling to accurately estimate extremely small multipletesting adjusted Pvalues, and Kimmel and colleagues [12] have modified this method to work with data from a stratified population. Decay of linkage disequilibrium with increasing genomic distance is exploited to further improve the computational efficiency of these methods.
These importance sampling methods lack some of the features that are found in PRESTO. The methods do not calculate significance for twostage genotyping designs, and they do not calculate adjusted Pvalues for general order statistics. In the extension to stratified data, the association test statistic used in Kimmel et al [12] will have suboptimal power because it ignores the population structure of the data (the population structure is incorporated in the importance sampling, but not in the test statistic). The method of Kimmel and colleagues [12] can be modified to use a test statistic for stratified data (such as those used in PRESTO), but this would dramatically increase the computation time because their method loops through all possible contingency tables for each sampled permutation, and the number of contingency tables consistent with a permutation increases exponentially with the number of population strata.
Methods for computing multipletesting adjusted Pvalues that are based on asymptotic multivariate normal distributions or importance sampling, are more complex than permutation testing, and require the asymptotic approximations to be accurate. In addition, when testing a single binary trait, these alternative methods provide little or no decrease in computational time relative to permutation testing with PRESTO, unless one is performing more than 1000 permutations.
Conclusion
PRESTO is a flexible, platformindependent software package that determines multipletesting adjusted statistical significance for largescale genetic association studies by using permutation of the trait status. PRESTO is faster than existing permutation testing software and can analyze a large genomewide association study (500 K markers, 5 K individuals, 1 K permutations) in approximately one hour of computing time. PRESTO can be used with stratified data from multiple populations and with twostage genotyping designs. PRESTO can also report empirical null distributions for the topranked statistics (i.e. order statistics) so that statistical significance can be determined for any test statistic calculated in terms of order statistics.
Availability and requirements

Project name: PRESTO

Project home page: http://www.stat.auckland.ac.nz/~browning/presto/presto.html

Operating system(s): Platform independent

Programming language: Java

Other requirements: standard edition (SE) Java Runtime Environment (JRE) 5.0 (or higher)

License: freely available for academic and commercial use.
Declarations
Acknowledgements
This work was supported by the U.S. National Institutes of Health grant 3R01GM07509102S1.
The analysis in Table 1 makes use of data generated by the Wellcome Trust Case Control Consortium. A full list of the investigators who contributed to the generation of the data is available from http://www.wtccc.org.uk. Funding for the Wellcome Trust Case Control Consortium project was provided by the Wellcome Trust under award 076113.
Authors’ Affiliations
References
 Besag J, Clifford P: Sequential MonteCarlo pvalues. Biometrika 1991, 78(2):301–304.View Article
 Dudbridge F, Koeleman BP: Rank truncated product of Pvalues, with application to genomewide association scans. Genet Epidemiol 2003, 25(4):360–366. 10.1002/gepi.10264View ArticlePubMed
 Zaykin DV, Zhivotovsky LA, Westfall PH, Weir BS: Truncated product method for combining Pvalues. Genet Epidemiol 2002, 22(2):170–185. 10.1002/gepi.0042View ArticlePubMed
 Dudbridge F: A note on permutation tests in multistage association scans. Am J Hum Genet 2006, 78(6):1094–1095. 10.1086/504527PubMed CentralView ArticlePubMed
 Agresti A: Categorical Data Analysis. second edition. New York: John Wiley & Sons; 2002.View Article
 Mantel N: Chisquare tests with one degree of freedom; extensions of the MantelHaenszel procedure. Journal of the American Statistical Association 1963, 58(303):690–700. 10.2307/2282717
 Browning SR, Browning BL: Rapid and accurate haplotype phasing and missingdata inference for wholegenome association studies by use of localized haplotype clustering. Am J Hum Genet 2007, 81(5):1084–1097. 10.1086/521987PubMed CentralView ArticlePubMed
 R Development Core Team: R: A Language and Environment for Statistical Computing. Vienna, Austria: R Foundation for Statistical Computing; 2006.
 The Wellcome Trust Case Control Consortium: Genomewide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature 2007, 447(7145):661–678. 10.1038/nature05911PubMed CentralView Article
 Purcell S, Neale B, ToddBrown K, Thomas L, Ferreira MA, Bender D, Maller J, Sklar P, de Bakker PI, Daly MJ, et al.: PLINK: a tool set for wholegenome association and populationbased linkage analyses. Am J Hum Genet 2007, 81(3):559–575. 10.1086/519795PubMed CentralView ArticlePubMed
 Conneely KN, Boehnke M: So Many Correlated Tests, So Little Time! Rapid Adjustment of P Values for Multiple Correlated Tests. Am J Hum Genet 2007, 81(6):1158–68. 10.1086/522036PubMed CentralView ArticlePubMed
 Kimmel G, Jordan MI, Halperin E, Shamir R, Karp RM: A randomization test for controlling population stratification in wholegenome association studies. Am J Hum Genet 2007, 81(5):895–905. 10.1086/521372PubMed CentralView ArticlePubMed
 Kimmel G, Shamir R: A fast method for computing highsignificance disease association in large populationbased studies. Am J Hum Genet 2006, 79(3):481–492. 10.1086/507317PubMed CentralView ArticlePubMed
 Seaman SR, MullerMyhsok B: Rapid simulation of P values for product methods and multipletesting adjustment in association studies. Am J Hum Genet 2005, 76(3):399–408. 10.1086/428140PubMed CentralView ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.