Moment based gene set tests

Background Permutation-based gene set tests are standard approaches for testing relationships between collections of related genes and an outcome of interest in high throughput expression analyses. Using M random permutations, one can attain p-values as small as 1/(M+1). When many gene sets are tested, we need smaller p-values, hence larger M, to achieve significance while accounting for the number of simultaneous tests being made. As a result, the number of permutations to be done rises along with the cost per permutation. To reduce this cost, we seek parametric approximations to the permutation distributions for gene set tests. Results We study two gene set methods based on sums and sums of squared correlations. The statistics we study are among the best performers in the extensive simulation of 261 gene set methods by Ackermann and Strimmer in 2009. Our approach calculates exact relevant moments of these statistics and uses them to fit parametric distributions. The computational cost of our algorithm for the linear case is on the order of doing |G| permutations, where |G| is the number of genes in set G. For the quadratic statistics, the cost is on the order of |G|2 permutations which can still be orders of magnitude faster than plain permutation sampling. We applied the permutation approximation method to three public Parkinson’s Disease expression datasets and discovered enriched gene sets not previously discussed. We found that the moment-based gene set enrichment p-values closely approximate the permutation method p-values at a tiny fraction of their cost. They also gave nearly identical rankings to the gene sets being compared. Conclusions We have developed a moment based approximation to linear and quadratic gene set test statistics’ permutation distribution. This allows approximate testing to be done orders of magnitude faster than one could do by sampling permutations. We have implemented our method as a publicly available Bioconductor package, npGSEA (www.bioconductor.org). Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0571-7) contains supplementary material, which is available to authorized users.


Introduction
In a genome-wide expression study, researchers often compare the level of gene expression in thousands of genes between two treatments groups (e.g., disease, drug, genotype, etc.).Many individual genes may trend toward differential expression, but will often fail to achieve significance.This could happen for a set of genes in a given pathway or system (a gene set).A number of significant 1 arXiv:1405.1383v1[stat.CO] 6 May 2014 and related genes taken together can provide strong evidence of an association between the corresponding gene set and treatment of interest.Gene set methods can improve power by looking for small, coordinated expression changes in a collection of related genes, rather than testing for large shifts in many individual genes.
Additionally, single gene methods often assume that all genes are independent of each other; this is not likely true in real biological systems.With known gene sets of interest, researchers can use existing biological knowledge to drive their analysis of genome-wide expression data, thereby increasing the interpretability of their results.Mootha et al. (2003) first introduced gene set enrichment analysis (GSEA) and calculated gene set p-values based on Kolmogorov-Smirnov statistics.Since then, there have been many methodological proposals for GSEA; no single one is always the best.For example, some tests are better for a large number of weakly associated genes, while others have better power for a small number of strongly associated genes (Newton et al., 2007).
One of the most important differences among gene set methods is the definition of the null hypothesis.Tian et al., 2005 andGoeman andBühlmann, 2007 (among others) introduce two null hypotheses that differentiate the general approaches for gene set methods.The first measures whether a gene set is more strongly related with the outcome of interest than a comparably sized gene set.Methods of this type typically rely on randomizing the gene labels to test what is often called the competitive null hypothesis.This is problematic because genes are inherently correlated (especially those within a set) and permuting them does not give a rigorous test (Goeman and Bühlmann, 2007).
The second type of approach is used to determine whether the genes within a set associate more strongly with the outcome of interest than they would by chance, had they been independent of the outcome.Methods that test this selfcontained null hypothesis usually judge statistical significance by randomizing the phenotype with respect to expression data and assume that gene sets are fixed.While we acknowledge that the competitive hypothesis is often of interest, we focus on methods that test the self-contained hypothesis in this paper.
A popular self-contained GSEA method is the JG-score (Jiang and Gentleman, 2007), which determines the the level of enrichment based on averaging linear model statistics.Recently, Ackermann and Strimmer (2009) compared 261 different gene set tests, and found particularly good performance from a sum of squared single gene regression coefficients.We extend both the sum and the sum of squared linear statistics approaches with a new method in this paper.
All current GSEA methods are based on permutation approaches.The initial GSEA (Mootha et al., 2003) and JG-score (Jiang and Gentleman, 2007) methods both have closed form null distributions for their enrichment statistics, Gaussian and Kolmogorov-Smirnov, respectively; however, even the authors of these methods acknowledge that these distributions do not give the correct pvalues and suggest the use of permutation.Lehmann and Romano (2005) give a concise explanation of how permutation inference works.It is common to approximate the permutation distribution by a large Monte Carlo sample (Eden and Yates, 1933;David, 2008).
Permutation tests are simple to program and do not make parametric distributional assumptions.They also can be applied to almost any statistic we might wish to investigate.However, permutation approaches are often computationally expensive, are subject to random inference, and fail to achieve continuous p-values.Each of these drawbacks is described in more depth below.
We have developed a new gene set enrichment approach that approximates the permutation distribution of our corresponding test statistics.We find that our method of moments techniques result in almost exactly the same p-values as permutation approaches, but in much less computation time.Through our approach, we are able to obtain refined p-values and achieve stringent significance thresholds.We applied our approach to three public expression analyses, and found disease-associated gene sets not previously discovered in these studies.

The data
For definiteness, we present our notation using the language of gene expression experiments.Let g, h, r, and s denote individual genes and G be a set of genes.The cardinality of G is denoted |G|, or sometimes p.That is the same letter we use for p-value, but the usages are distinct enough that there should be no confusion.Our experiment has n subjects.The subjects may represent patients, cell cultures, or tissue samples.
The expression level for gene g in subject i is X gi , and Y i is the target variable on subject i. Y i is often a treatment, disease, or other phenotype.We center the variables so that The X gi are not necessarily raw expression values, nor are they restricted to microarray values.In addition to the centering (1) they could have been scaled to have a given mean square.The scaling factor for X gi might even depend on the sample variance for some genes h = g if we thought that shrinking the variance for gene j towards the others would yield a more stable test statistic (Smyth, 2005).We might equally use a quantile transformation, replacing the j th largest of the raw X gi by Φ −1 ((j − 1/2)/n) where Φ is the Gaussian cumulative distribution function.Further preprocessing may be advised to handle outliers in X or Y .We do require that the preprocessing of the X's does not depend on the Y 's and vice versa.

Test statistics
Our measure of association for gene g on our treatment of interest is If both X gi and Y i are centered and standardized to have variance 1, then βg = ρg , the sample correlation between Y and gene g.The usual t-statistic for testing a linear relationship between these variables is t , which is a monotone transformation of ρg .
For reasons of power and interpretability, we apply gene set testing methods instead of just testing individual genes.Linear and quadratic test statistics have been found to be the best performers for gene set enrichment analyses; we thus consider two statistics for our approach: The statistic T G,w can approximate the JG score of Jiang and Gentleman (2007).The JG score is (1/ |G|) g∈G t g .Taking w g = √ n − 2/(sd(X g )sd(Y )), where sd denotes a standard deviation, weights genes similarly to the JG score.Although T G,w with these weights sums statistics equivalent to t statistics, it is not exactly equivalent to the sum of those statistics because of the way ρg appears in the denominator of each t g .
The statistic C G,w is a weighted sum of squared sample covariances.Ackermann and Strimmer (2009) conducted an extensive simulation of gene set methods and found good results for quadratic combinations of per gene test statistics.
The letters T and C are mnemonics for the t and χ 2 distributions that resemble the permutation distributions of these quantities.The w g are scalar weights.For the quadratic statistics we will suppose that w g 0. We won't need that condition to find moments of C G,w , but because we will compare C G,w to a χ 2 distribution, it is reasonable to avoid negative weights.Non-negative weights are also used to simplify our algorithm.
Although linear and quadratic test statistics are fairly restricted, they do allow a reasonable amount of customization through the weights w g , and they are very interpretable compared to more ad hoc statistics.
In a permutation analysis, we replace Y i by Y i where The n! different permutations form a reference distribution from which we can compute p-values.There are often so many possible permutations that we cannot calculate or use all of them.Instead, we independently sample uniform random permutations M times, getting statistics C m = C G,w,m , and similarly T m , for m = 1, . . ., M .We then compute p-values by comparing our observed statistics to our permutation distribution: where p Q and p C are p-values for two-sided inferences on the quadratic and linear statistic, respectively, and p L (left) and p R (right) are for one-sided inferences based on the linear statistic.We use the mnemonic C in p C to denote the central (or two-sided) p-value, which corresponds to a central confidence interval.The +1 in numerator and denominator of the p-values corresponds to counting the sample test statistic as one of the permutations.That is, we automatically include an identity permutation.

Permutation disadvantages
There are three main disadvantages to permutation-based analyses: cost, randomness, and granularity.Testing many sets of genes becomes computationally expensive for two reasons.First, there are many test statistics to calculate in each permuted version of the data.Second, to allow for multiplicity adjustment, we require small nominal p-values to draw inference about our sets, which in turn requires a large number of permutations.That is, to obtain a small adjusted p-value (e.g., via FDR, FWER, Bonferroni methods), one first needs a small enough raw p-value.In order to obtain small raw p-values, the number of permutations (M ) must be large, thereby increasing computational cost.
Because permutations are based on a random shuffling of the data, there is a chance that we will obtain a different p-value for our set of interest each time we run our permutation analysis.That is, our inference is subject to a given random seed.
Permutations also have a granularity problem.If we do M permutations, then the smallest possible p-value we can attain is 1/(M + 1).At or below this minimum p-value permutation tests have no power.Knijnenburg et al. (2009) suggest that for a reliable p-value, there should be at least 10 permuted values more extreme than the sample.That requires M ≈ 10/p and when it is necessary, due to test multiplicity, to use small p such as 10 −6 or smaller, the permutation approach becomes computationally expensive.We call this the sample granularity problem.
There is also a population granularity problem.In an experiment with n observations, the smallest possible p-value is at least 1/n!.Sometimes the at-tainable minimum is much larger.For instance, when the target variable Y is binary with n/2 positive and n/2 negative values then the smallest possible pvalue is 1/ n n/2 .For n = 10 we necessarily have p 1/252.Rotation sampling methods such as ROAST are able to get around this population granularity problem (Wu et al., 2010).Increased Monte Carlo sampling can mitigate the sample granularity problem but not the population granularity problem.
Another aspect of the granularity problem is that permutations give us no basis to distinguish between two gene sets that both have the same p-value 1/(M + 1).There may be many such gene sets, and they have meaningfully different effect sizes.Many current approaches solve this problem by ranking significantly enriched gene sets by their corresponding test statistics.This practice only works if all test statistics have the same null distribution and correlation structure, which is not the case for many current GSEA methods.Additionally, the resulting broken ties do not have a p-value interpretation and cannot be directly used in multiple testing methods.To break ties in this way also requires the retention of both a p-value and a test statistic for inference, rather than just one value.
Because of each of these limitations of permutation testing, there is a need to move beyond permutation-based GSEA methods.The methods we present below are not as computationally expensive, random, or granular as their permutation counterparts.Our proposal results in a single number on the p-value scale.

Moment based reference distributions
To avoid the issues discussed above, we approximate the distribution of the permuted test statistics T G,w by Gaussians or by rescaled beta distributions.For quadratic statistics C G,w we use a distribution of the form σ 2 χ 2 (ν) choosing σ 2 and ν to match the second and fourth moments of C G,w under permutation.
For the Gaussian treatment of T G,w we find σ 2 = var( T G,w ) under permutation using equation ( 5) of Section 3.3 and then report the p-value p = Pr(N (0, σ 2 ) T G,w ), where T G,w is the observed value of the linear statistic.The above is a left tail p-value.Two-tailed and right-tailed p values are analogous.
When we want something sharper than the normal distribution, we can use a scaled Beta distribution, of the form A + (B − A)beta(α, β).The beta(α, β) distribution has a continuous density function on 0 < x < 1 for α, β > 0. We choose A, B, α and β by matching the upper and lower limits of T G,w , as well as its mean and variance.Using equation ( 5) from our theory section we have + 1 , and The observed left-tailed p-value is It is easy to find the permutations that maximize and minimize T G,w by sorting the X and Y values appropriately as described in Section 3.3.The result has A < 0 < B. For the beta distribution to have valid parameters we must have σ 2 < −AB.From the inequality of Bhatia and Davis (2000), we know that σ 2 −AB.There are in fact degenerate cases with σ 2 = −AB, but in these cases T G,w only takes one or two distinct values under permutation, and those cases are not of practical interest.
Like us, Zhou et al. ( 2009) have used a beta distribution to approximate a permutation.They used the first 4 moments of a Pearson curve for their approach.Fitting by moments in the Pearson family, it is possible to get a beta distribution whose support set (A, B) does not even include the observed value.That is, the observed value is even more extreme than it would have to be to get p = 0; it is almost like getting p < 0. We chose (A, B) based on the upper and lower limits of T G,w to prevent our observed test statistic from falling outside the range of possible values of our reference distribution (Section 3.3).
For the quadratic test statistic C G,w we use a σ 2 χ 2 (ν) reference distribution reporting the two-tailed p-value Pr(σ 2 χ 2 (ν) C G,w ) after matching the first and ) .
Our formulas for E( C G,w ) and E( C 2 G,w ) under permutation are given in equation (4) of Section 3.1.Those formulas use E( β 2 g ) and cov( β 2 g , β 2 h ) which we give in Corollaries 1 and 2 of Section 3.1.
All of our reference distributions are continuous and unbounded and hence they avoid the granularity problem of permutation testing.We have prepared a publicly available Bioconductor (Gentleman et al., 2004) package, npGSEA, which implements our algorithm and calculates the corresponding statistics discussed in this section.
3 Theoretical results

Permutation moments of test statistics
Under permutation, E( Y i ) = 0 by symmetry, and so E( β g ) = 0 too.We easily find that, The means, variances and covariances in (4) are taken with respect to the random permutations with the data X and Y held fixed.We adopt the convention that moments of permuted quantities are taken with respect to the permutation and are conditional on the X's and Y 's.This avoids cumbersome expressions like E( β 2 g | X gi , Y i , g ∈ G).We will need the following even moments of X and Y : X gi X hi , and Although our derivations involve O(p 4 ) different moments when the gene set G has p genes, our computations do not require all of those moments.
Lemma 1.For an experiment with n 2 including genes g and h, Proof.See Appendix 1.
Corollary 1.For an experiment with n 2 including genes g and h, Proof.This follows from Lemma 1 because E( β g ) = 0.
From Corollary 1, we see that the correlation between permuted test statistics β g and β h is simply the correlation between expression values for genes g and h.
Lemma 2. For an experiment with n 4 including genes g, h, r, s, where X * ghrs = Xgh Xrs + Xgs Xhr + Xgr Xhs , with A T given by The expression is complicated, but it is simple to compute; we need only two moments of Y , two cross-moments of X, and the 2 × 2 matrix A T B. The matrix A depends on the experiment through n.Using Lemma 2 we can obtain the covariance between β 2 g and β 2 h .
Corollary 2. For an experiment with n 4 and genes g, h, where X * gghh = Xgg Xhh + 2 X2 gh with A and B as given in Lemma 2.
Proof.The covariance is E( Applying Lemma 2 to the first expectation and Lemma 1 to the other two yields the result.

Rotation moments of test statistics
Rotation sampling (Wedderburn, 1975;Langsrud, 2005) provides an alternative to permutations, and is justified if either X or Y has a Gaussian distribution.It is simplest to describe when Y ∼ N (µ, σ 2 I n ) and even simpler for Y ∼ N (0, σ 2 I n ).In the latter case we can replace Y by Y = QY where Q ∈ R n×n is a random orthogonal matrix (independent of both X and Y ), and the distribution of our test statistics is unchanged under the null hypothesis that X and Y are independent.
Rotation tests work by repeatedly sampling from the uniform distribution on random orthogonal matrices and recomputing the test statistics using Y instead of Y .They suffer from sample granularity but not population granularity because Q has a continuous distribution (for n 2).
To take account of centering we need to use a rotation test appropriate for Y ∼ N (µ, σ 2 I n ).Langsrud (2005) does this by choosing rotation matrices that leave the population mean fixed.He rotates the data in an n − 1 dimensional space orthogonal to the vector 1 n .To get such a rotation matrix, he first selects an orthogonal contrast matrix W ∈ R n×(n−1) .This matrix satisfies W T W = I n−1 and W T 1 n = 0 n−1 .Then he generates a uniform random rotation Langsrud (2005) shows how to rotate Y in the residual space of this model, leaving the fits unchanged.Wu et al. (2010) have implemented rotation sampling for microarray experiments in their method, ROAST.They speed up the sampling by generating a random vector instead of a random matrix.For some tests, permutations and rotations have the same moments, and so our approximations are approximations of rotation tests as much as of permutation tests.
Our rotation method approximation performs very similarly to the permutation method.We let where Q * is a uniform random n − 1 × n − 1 rotation matrix and the contrast matrix W ∈ R n×(n−1) satisfies W T 1 n = 0 n−1 and W T W = I n−1 and then β, T and C are defined as for permutations, substituting Y for Y .
The variance of the quadratic test statistic depends on which contrast matrix W one chooses, and it cannot always match the permutation variance.This difference disappears asymptotically as n → ∞.
Lemma 3.For an experiment with n 2 including genes g and h, the moments E( β g ) and E( β g β h ) are identical to their permutation counterparts, regardless of the choice for W .
Corollary 3.For an experiment with n 2, E( T G,w ), var( T G,w ) and E( C G,w ) are the same whether Y is formed by permutation or rotation of Y .

Computation and costs
To facilitate computation for the linear statistic, we reduce each gene set to a single pseudo-gene X Gi = g∈G w g X gi and then let The weights w have been absorbed into the pseudo-gene to simplify notation.We define βG = g∈G Our permuted linear test statistic is For the beta approximation, we need the range of T G,w .Let the sorted Y values be Y (1) Y (2) . . .Y (n) and the sorted X Gi values be X G(1) X G(2) . . .X G(n) .Then the range of T G,w is [A, B], where For a σt (ν) reference distribution we would also need E( T 4 G,w ) = E( β 4 G ).We can apply Lemma 2 to the pseudo-gene resulting in where XGGGG = 1 n n i=1 X 4 Gi .We considered using a σt (ν) reference distribution for T G,w , taking into account the fourth moment of T G,w (6).We have often (in fact usually) found that E( T 4 G,w ) < 3E( T 2 G,w ) 2 ; that is, lighter tails than the normal.This implies a negative kurtosis for the permutation distribution, and t distributions have positive kurtosis.For this reason we use a beta approximation and not a t approximation.
For the quadratic statistic we have found it useful to replace X gi by √ w g X gi in precomputation.That step is only valid for non-negative w g , but those are the ones of most interest.Then we use formulas for E( C G,w ) and var( C G,w ) with all w g = w h = 1 (4).Now we consider the computational cost.The cost to compute all of the X Gi is dominated by np multiplications.It then takes n more multiplications to get βG and another n to get XGGe .It costs n multiplications to get all the µ 2 , except that step done once can be used for all gene sets.The cost for the Gaussian approximation N (0, var( T G,w )) is dominated by n(p + 2) multiplications.
For the beta approximation there is also a cost proportional to n log(n) in the sorting to compute limits A and B. That adds a cost comparable to a multiple of log(n) permutations.We judge that the cost of sorting is usually minor for n and p of interest in bioinformatics.
A permutation analysis requires nM multiplications, after computing X Gi , for a total of n(M + p).It is very common for p to be a few tens and M to be many thousands or more.Then we can simplify the costs to n(M + p) ≈ nM and n(2 + p) ≈ np.The moment method costs about as much as doing p permutations.When the gene set has tens of genes and the permutation method uses many thousands or even several million permutations, the computational cost is quite large.
The pseudo-gene technique is more expensive for the quadratic statistics.The dominant cost in computing C G,w is still the np multiplications required to compute β g for g ∈ G.We can also compute E( C G,w ) in about this amount of work.
The cost of computing var( C G,w ) by a straightforward algorithm is at least np 2 , because we need Xgh and Xgghh for all g, h ∈ G.Some parts of that computation can be sped up to O(np) by rewriting the expression as described in Appendix 5.One of the terms however does not reduce to O(np).A straightforward implementation costs O(np 2 ) while an alternative expression costs O(n 2 p).The latter is valuable in settings where the gene sets are large compared to the sample size.In the former case, the moment approximation has cost comparable to O(p 2 ) permutations.If n < p then the latter case is like np permutations, so the quadratic cost is comparable to on the order of p * min(n, p) permutations.
We have verified our moment formulas by finding that they match values found by enumerating all n! permutations, for some simulated data sets with small n.During testing, we also compared permutation and our approximate p-values on simulated data.We saw a close match but think that an illustration on real data is more compelling.Section 4 makes comparisons using three genome wide expression studies in Parkinson's Disease (PD) patients: Moran et al. (2006), Scherzer et al. (2007) and Zhang et al. (2005).

Parkinson's Disease
We illustrate our method using publicly available data from three expression studies in Parkinson's Disease (PD) patients (Moran et al., 2006, Zhang et al., 2005, and Scherzer et al., 2007;  Using a selected set from the Broad Institute's mSigDB v3.1 (Subramanian et al., 2005) and the presence of PD as a response variable from the Zhang et al. (2005) dataset, we visualized both permutation distributions and our approximation of these distributions (Figure 1).As discussed above, we use a linear test statistic, T G,w = g∈G βg , and a quadratic test statistic, C G,w = g∈G β2 g , where βg is a sample covariance between gene expression and, in this case, disease status. Figure 1 shows these two test statistics with a histogram of 99,999 recomputations of those statistics for permutations of treatment status versus gene expression.In principle, histograms of permuted test statistics can be very complicated, but in practice, they often resemble familiar parametric distributions, as in Figure 1.
Using the fitted normal distribution to determine the rarity of the observed gene set statistic results in a two-tailed p-value of 0.0604 for the linear statistic while permutations yield p = 0.0595.A fitted σ 2 χ 2 (ν) distribution results in p = 0.0425 for the sum of squares gene set statistic, while permutations yield p = 0.0458.The p-values are a quite close despite the somewhat higher peak for the permutation histogram relative to the χ 2 density.
We compared our non-permutation p-values to p-values for linear and quadratic statistics for the 6,303 gene sets from mSigDB's curated gene sets and Gene Ontology (GO, Ashburner et al., 2000) gene sets collections (v3.1).One gene set was removed because it contained only one gene in our experiments.The average size of these gene sets is 79.40 genes.For our gold standard we ran 999,999 permutations of the linear statistic and 499,999 permutations of the quadratic statistic.For all of our permutations, we first calculated the observed test statistic for each of the 6,303 gene sets and then permuted the Y i 's M times to obtain 6,303 × M permuted test statistics.We next compared the pre-computed test statistic vector to our matrix of permuted test statistics.
For each set, we computed left-sided p-values, p L , for the linear statistic and two-sided p-values, p Q , for the quadratic statistic using these permutations.We also computed the normal and beta approximations of p L with our method.(Figure 2, left panel).We converted these one-sided p-values to two-sided pvalues via p = 2 min(p L , 1 − p L ).The beta approximation p-values are almost identical to the permutation p-values.
For our quadratic test statistic, we fit our moment based σ 2 χ 2 (ν) approximation and computed two-sided tailed p-values across all sets (Figure 2  panel).We see that the smallest χ 2 non-permutation p-values are slightly conservative.This may reflect the boundedness of the permutation distribution combined with the unbounded right tail of the χ 2 distribution.
In each of the three experiments, there is a tight correlation between the permutation-based p-values of all sets and both of our moment-based methods (Table 2).The beta and normal approximations are almost identical.Our beta approximations are slightly closer to the gold standard than the normal approximations, but not by a practically important amount.The beta approximation has shorter tails than the Gaussian approximation.It yielded p-values somewhat smaller than permutations did, while the Gaussian approximation yielded p-values somewhat larger than the permutations did.The χ 2 approximations also reproduce the ranking of the gold standard quite well, though not as well as the normal and beta approximations to the linear statistic.
For these data sets and 6,303 gene sets, both of the linear statistics, which have more or less the same rank-ordering of p-values as 999,999 permutations, could be approximated in about than the amount of time it takes to compute 100 permutations (Table 3, top block).Our gene sets had an size of about 80 genes.This lead us to expect that the cost of the linear approximation would be comparable to doing 80 permutations.We found that the Gaussian approximation cost about as much as 100 permutations.While this is a close match, we remark that the time to do M permutations is nearly an affine function a + bM with positive intercept a.At such small M the overhead costs dominated the total cost making the per permutation costs hard to resolve.The beta approximation was slightly slower than the Gaussian one because it involves the sorting of the data.
The χ 2 approximation to the quadratic statistic has a computational cost about as much as 35,000 to 45,000 permutations, yet has a similar rank-ordering of p-values 499,999 permutations (Table 3, bottom block).For the quadratic statistic we expected our algorithm to cost as much as doing a number of permutations equal to a small multiple of the mean square gene set size.It cost about as much as 35,000 to 45,000 permutations while the mean square set size was 27,171.
After applying our permutation approximation methods to each dataset in 6,303 mSigDB gene sets, we found many significantly enriched gene sets, even  (2009)).Through our new gene set enrichment method, we discovered a relationship between the expression of these gene sets and PD.

Discussion
Gene set methods are able to pool weak single gene signals over a set of genes to get a stronger inference.These methods and their corresponding permutationbased inferences are a staple of high throughput methods in genomics.Because an experiment for this purpose may have a few to hundreds of microarrays or RNA-seq samples, permutation can be computationally costly, and yet still result in granular p-values.In this paper, we introduce an approximation gene set method, which performs as well as permutation methods, in a fraction of the computation time and which generates continuous p-values.
Permutation methods have some valuable properties that our approach does not share.Permutation based inferences give exact p-values.Our approxima-tions are not ordinarily exact because the permutation histogram is not in the parametric family we use.
The second advantage of permutations is that they apply to arbitrarily complicated statistics.In our view, many of those complicated statistics are much harder to interpret and are less intuitive than the plain sum and sum of squared statistics we present.Others have observed that simple linear and squared statistics outperform more complex approaches (Ackermann and Strimmer, 2009).Our method allows for the weighting of coefficients in our statistics, granting users access to additional useful and interpretable patterns.
Because of the disadvantages discussed above, there has long been interest in finding approximations to permutation tests.Eden and Yates (1933) noticed that the permutation distribution closely matched a parametric distribution that one would get running an F -test on the same data.It has also been known since the 1940s that the permutation distribution of the linear test is asymptotically normal as n increases (Good, 2004).More recently, Knijnenburg et al. (2009) approach the granularity issue by taking a random sample of permutations and fitting a generalized extreme value (GEV) distribution to the tail of their distribution.
Our work differs from these previous permutation approximation approaches.We use Gaussian or beta distributions for the linear statistic and a χ 2 distribution for the quadratic statistic.These choices never place the observed test statistic strictly outside the possible range of our reference distribution.In this way, we also avoid nonsensical p-values.
We have developed a new and intuitive method for gene set enrichment analysis that is computationally inexpensive, as accurate as permutation methods, and avoids the sample granularity issue.A Gaussian, beta, or χ 2 approximation gives a principled way to break ties among genes or gene sets whose test statistics are larger than any seen in the M permutations.We applied our moment based approximations to three human Parkinson's Disease data sets and discovered the enrichment of several gene sets in this disease, none of which were mentioned in the original publications.involvement of specific protein processing, energy metabolism, and signaling pathways, and suggests novel disease mechanisms.Am J Med Genet B Neuropsychiatr Genet, 137B(1):5-16.Appendix 1: Proof of Lemma 1 This appears in Owen (2005) but we prove it here to keep the paper selfcontained.First and so proving Lemma 1.
indices are distinct: and where the subscripts are mnemonics for terms four of a kind, three of a kind, two pair, one pair and nothing special.
We can express all of these moments in terms of µ 2 and µ 4 = (1/n) n i=1 Y 4 i .Each moment is a normalized sum over distinct indices.We can write these in terms of normalized sums over all indices.Many of those terms vanish because and so on.We can write these sums in terms of unrestricted sums: See Gleich and Owen (2011) for details.We will use the last expression in a context where f ijk vanishes when summed over the entire range of any one of its indices.In that case * ijk We also use the notation n (k) = n(n − 1)(n − 2) • • • (n − k + 1), often called 'n to k factors', where k is a positive integer.Now .
Finally using (7), n (4) µ ∅ equals * ijk We may summarize these results via where the matrix A is given in the statement of Lemma 2.
The coefficient of µ ∅ is, using (7), * ijk X gi X hj X rk X s = ij X gi X hj X ri X sj + X gi X hj X rj X si + X gi X hi X rj X sj − 6 i X gi X hi X ri X si = n 2 Xgh Xrs + Xgr Xhs + Xgs Xhr − 6n Xghrs .
and when Y is substituted for Y , T G,w becomes T G,w and C G,w becomes C G,w .

Figure 1 :
Figure 1: Top panel shows a permutation histogram for a linear test statistic for the the steroid hormone signaling pathway gene set as described in the text.The bottom panel shows a quadratic test statistic.Solid red dots indicate the observed values and curves indicate parametric fits, based on normal and χ 2 distributions.

i
Y i = 0. Let * represent summation over distinct indices, as in * ij =i k=1,k =i,k =j f ijk

Table 2 :
L Beta p L Normal p C Beta p C Chisq p Q Spearman correlations between gold standard (999,999 and 499,999 permutations for linear and quadratic statistics) and approximation p-values.p L and p C represent results for one and two-tailed linear test statistics, respectively.Chisq p Q represents results for the sum of squares analysis.
1 Figure 2: Permutation p-values (x-axis) versus moment-based p-values (y-axis) for 6,303 gene sets.The left column represents results for a linear test statistic, the right column for sum of squares.Data come from three genome-wide expression studies.We applied the non-linear transformation p 1/2 to stretch the Reference Normal p

Table 3 :
Time in seconds for p-value calculations for 6,303 gene sets in three genome-wide expression studies.Linear statistic results with M = 100, M = 500, and M = 1,000,000 permutations, and the normal and beta approximations are in the top block.Timings for the quadratic statistic with = 30,000, M = 40,000, M = 50,000, and M = 500,000 permutations, and the χ 2 approximation are presented in the bottom block.