BMC Bioinformatics BioMed Central Methodology article A general modular framework for gene set enrichment analysis

Background Analysis of microarray and other high-throughput data on the basis of gene sets, rather than individual genes, is becoming more important in genomic studies. Correspondingly, a large number of statistical approaches for detecting gene set enrichment have been proposed, but both the interrelations and the relative performance of the various methods are still very much unclear. Results We conduct an extensive survey of statistical approaches for gene set analysis and identify a common modular structure underlying most published methods. Based on this finding we propose a general framework for detecting gene set enrichment. This framework provides a meta-theory of gene set analysis that not only helps to gain a better understanding of the relative merits of each embedded approach but also facilitates a principled comparison and offers insights into the relative interplay of the methods. Conclusion We use this framework to conduct a computer simulation comparing 261 different variants of gene set enrichment procedures and to analyze two experimental data sets. Based on the results we offer recommendations for best practices regarding the choice of effective procedures for gene set enrichment analysis.


Background
The analysis of "enrichment" of gene sets is a natural extension of the study of differential expression of individual genes. Focusing on sets of genes rather than on individual genes has several benefits. From a statistical point of view the analysis of groups instead of individual genes is advantageous as this typically increases power and reduces the dimensionality of the underlying statistical problem. From the biological perspective gene set enrichment analysis allows one to ask (and answer!) questions that are of direct interest to the understanding of the functional mechanism in a cell: is a certain pathway activated in a given tissue under some treatment X? Is the pathway more active than other pathways? These questions directly relate to various null models for gene sets. Therefore, for good reason a substantial number of statistical procedures to assess gene set enrichment have been introduced in the last few years -see Table 1 and references  for an overview. Given this extensive literature, biologists are now confronted with the difficult choice of a gene set method that is best suited to analyze their data at hand. So far, a "taxonomy" of enrichment analysis approaches and a systematic comparison is lacking.
The objective of the present work is three-fold. First, we present a meta-theory of gene set enrichment analysis in the form of a general modular framework (Fig. 1). This modular scheme encompasses most if not all published approaches for gene set analysis, and hence allows to study the interplay among the various methods. Second, based on this framework we conduct a principled comparison investigating 261 variants of gene set enrichment procedures. From extensive computer simulations and the analysis of two experimental data sets, we offer specific recommendations for conducting an effective gene set analysis. Third, we present an extensive survey of existing statistical methods for detecting enriched gene sets. This is found in the Appendix and forms the basis of the proposed modular framework.

Methods
In this section we describe the key result from our studies: a modular framework that provides a taxonomy for gene set analysis. In the Appendix we have compiled a very comprehensive overview of the respective approaches that underlie our proposed meta-theory of gene set enrichment.

Modular structure underlying procedures for gene set enrichment analysis
Our suggestion for a common modular framework for gene set enrichment analysis methods is depicted in Fig. 1. The scheme consists of five distinct modules: the calculation of a gene-level statistic, an optional transformation of these statistics, the choice of a null hypothesis, the computation of a gene set statistic and the significance assessment. The first two modules correspond to an analysis of differential expression on the individual gene level. The remaining three modules process these results further to obtain an assessment of gene set enrichment. A separate track in this diagram is provided for a module for conducting global tests.
We will now describe each module in more detail. Note that the scheme presented in Fig. 1 is both descriptive as well as generative: by selecting a method in each box one arrives at some specific approach for detecting gene set enrichment. To our knowledge, all enrichment approaches published so far fit into this scheme. In addition, there are also many combinations of modules that will lead to a "novel" method (i.e. one that has not yet been explicitly described in the literature). However, note that some modules are tightly interconnected, e.g., the choice of null hypothesis and of significance assessment. These dependencies need to be kept in mind when setting up a gene set analysis.

Description of individual modules Gene-level statistics
In all the univariate approaches (cf. Appendix), the first step in conducting a gene set enrichment analysis is to assess the amount of differential expression of the individual genes. Choosing a suitable test statistic for this purpose has been discussed extensively in the literature, for an overview we refer, e.g., to [37] and references therein. Typically, the gene-level statistic will be selected from the following list: • fold change, • signal-to-noise ratio, • (regularized) t-statistics, • (shrinkage) correlation coefficient, • coefficient of ANOVA or linear/logistic regression, and • (penalized) log-likelihood ratio.

Transformation of gene-level statistics
The second step of univariate procedures for detecting gene set enrichment consists of the transformation of the local gene statistic. Suitable options suggested in the literature include: • no transformation, • absolute values, • squared values, • binary transformation, • ranks, • p-values or Bayesian posterior probabilities, and • (local) false discovery rates.
There are many instances when a transformation is beneficial, for example in order to improve robustness, or to account for both up-and down-regulation. The binary transformation is an extreme case of ranking considering only two classes (e.g., "differentially expressed" versus "nondifferentially expressed"). Many of the adhoc procedures used by biologists to declare differential expression (such as based on fold change and some kind of arbitrary rules) fall into this category. Also note that binary transformation combined with testing gene set membership is equivalent to the contingency table approach (cf. Appendix).
The influence of the transformation of the gene-level statistics on the results of gene set enrichment procedures is shown by Newton et al. [20]. These authors compare the power properties of averaging and selection test statistics, i.e. methods based on the mean of local statistics like the t-statistic and methods where the mean over the binary indicator of differential expression is used. Newton et al. [20] find that selection methods work better when the gene set enrichment is characterized through a small number of highly differentially expressed genes, whereas averaging is able to detect enriched sets consisting of a large number of genes with subtle increase or decrease of the expression levels.

Null hypotheses
A factor that strongly influences the general setup of an enrichment procedure is the choice of the null hypothesis. The two most common choices are termed Q 1 and Q 2 [14]. Q 1 is called the "competitive null hypothesis" and corresponds to the case where the association between the genes in the set and the phenotype is compared with the association of the remainder of the genes and the phenotype. Note that this implies a model where the genes are the sampling units and the association between the samples and the phenotypes is fixed. Thus, testing this hypothesis is conditional on the gene-level scores. The "complete null hypothesis" [16] that all the gene sets under consideration have the same differential expression can be seen as a special case of Q 1 .
In contrast, Q 2 is the "self-contained null hypothesis" that focuses on the given gene set without considering genes outside of this group. It compares the association of the gene set and the phenotype with that of random phenotypes. Under Q 2 the sampling units are the phenotypes while the gene set membership is fixed. A special case of the self-contained null hypothesis is the "global hypothesis" that there are no differentially expressed genes at all.
A third possible null model, here called Q 3 , is the "nested null hypothesis", where differential expression of the genes in the gene set is compared to the differential expression of all genes under consideration (both inside and outside the gene set). This model is motivated by the FDR-based approach described in Efron [21], where FDR and conditional FDR values are compared.
For understanding gene set enrichment procedures it is absolutely essential to understand the implications of the various null models [5]. For instance, if there are many differentially expressed genes in all gene sets, testing under the self-contained model Q 2 is quite likely to indicate enrichment, even though the particular gene set under investigation might not show any special "enrichment" compared to other gene sets.
Furthermore, as Goeman and Bühlmann [5] point out, the aim of a gene expression experiment is to assess whether the transcript levels of the genes differ between the different phenotypes. Thus, it is based on a model where the phenotypes are the sampling units. Consequently, one would repeat the experiment with different samples and not with different genes to verify the results. In contrast, the competitive hypothesis Q 1 assumes that the genes are the sampling units, turning the experimental design upside down. Therefore, some authors [38,33] consider Q 1 only to be "of limited utility" for statistical analysis.
Gene set statistics A further step in an enrichment analysis is the computation of a gene set statistic. Examples include: • the sum or the mean of the (transformed) single gene statistics, • the median of the (transformed) single gene statistics, • the (modified) Kolmogorov-Smirnov statistic, • the maxmean statistic, • the Wilcoxon rank sum test statistic, • the sign test statistic, and • the conditional local FDR.
The question of which gene set statistic is optimal in a given setting is subject to ongoing discussion. For example, Efron and Tibshirani [11] show that their maxmean statistic is more powerful than a Kolmogorov-Smirnov test or an average of local statistics. Jiang and Gentleman [16] call attention to the problem of robustness against outliers and advocate summaries such as the median or the sign test statistic.

Significance assessment
The last step of both the univariate and the global and multivariate procedures (cf. Appendix) for enrichment analysis is the assessment of significance of the observed gene set statistics. The calculation of the p-value can be done in three different ways: 1. gene sampling: A large number of random gene sets of the same size as the set under investigation is drawn from all the genes and the global statistic is recomputed for every random set. The p-value is calculated as the fraction of resampled gene set statistics that exceed (or fall below) the observed value.

sample label permutation:
The phenotypes of the subjects are permuted a large number of times and the local and global statistics are recomputed. The p-value is the fraction of permutation gene set statistics that exceed (or fall below) the observed value.

restandardization:
Both gene sampling and sample label permutation are conducted. Then each permutation statistic is standardized with the mean and standard deviation of all permutation statistics. To obtain the restandardized value, the standardized permutation statistic is multiplied with the standard deviation of the gene sampling statistics and the mean of the gene sampling statistics is added. The p-value is the fraction of restandardized gene set statistics that exceed (or fall below) the observed value.
Note that the choice of the null hypothesis and the pursued sampling strategy are interconnected: Q 1 implies gene sampling, and Q 2 sample label permutation. Furthermore, gene resampling implicitly assumes independent genes in the group, a prerequisite that is unlikely to hold, e.g., for genes within a pathway. The restandardization strategy put forward in [11] combines the two very different sampling schemes into a single procedure. Therefore, this approach mixes the two null hypotheses Q 1 and Q 2 which may lead to difficulties in the interpretation of the resulting p-values. Interestingly, note that the GSEA algorithm [7,8] implicitly also considers both permutation approaches. While the GSEA null distribution itself is found by permuting sample labels, the Kolmorogov-Smirnov-type test statistic equally carries information on the whole set of genes (at least if the set S is small compared to the number of all genes). Thus, GSEA implements something in effect related to restandardization, albeit not in the extent of the method by [11]. Finally, a computational consideration is whether the chosen null distribution may be approximated by a parametric distribution. This is fundamentally tied to the choices made earlier, and is possible only in very rare circumstances, e.g., in the method of Kim and Volsky [18].
Global and multivariate approaches Enrichment approaches that do not follow the structure introduced above are the global and multivariate procedures. These directly define one model for the whole gene set. Nevertheless, the globaltest of [22] and the GlobalAncova of [23] also allow to obtain a score for every gene as well, since the gene set models can be written as a combination of the models for each gene. For the T 2 -based tests this is only possible when the covariance matrix is diagonal.
In a simulation study, [23] showed that the globaltest and their GlobalAncova perform equally well for independent genes but that the GlobalAncova outperforms the approach of [22] when there are strong gene-gene dependencies. Kong et al. [25] show that their combination of PCA and Hotelling's T 2 -test detects more gene sets than the globaltest. They explain this finding with the multivariate nature of their test. However, the assumption is only made on the basis of the evaluation of one data set and further analyzes are needed to show the superiority of this procedure.

Results and discussion
Employing the developed taxonomy and modular framework we now compare the various gene set enrichment approaches by a principled simulation study and exemplary data analysis.

Simulation study
General set-up All necessary programming and calculations were implemented using the free software environment R [39]. We simulated data in a fashion to represent stylized yet typical constellations of gene set structures found in real gene expression data. We considered gene sets with different levels of differential expression (Δμ = 0, 0.75, 1, -1) and with varying levels of intra-group correlation (ρ = 0, 0.6, -0.6). Furthermore, we also constructed mixed sets, i.e. gene sets that included both differentially expressed and non-differentially expressed genes, and also sets containing both up and down-regulated genes.
Each generated data set consisted of p = 600 genes with n = 20 samples (10 in each of two "treatment" groups). The data were generated from a 600-dimensional multivariate normal distribution, with variances set to 1 and means and correlations specified as follows: • background (noninformative genes): 520 genes were simulated with μ = 0 and ρ = 0.
• set 2 (differential expression + no correlation): same as set 1 but with zero correlation among genes.
• set 3 (no differential expression, no correlation): 20 genes selected randomly from the 520 background genes.
• set 5 (differential expression + no correlation, mixed with background): same as set 4, but with zero correlation among genes (i.e. 10 genes from set 2 and 10 genes from the background set).
• set 6: (differential expression, correlation, both up and down regulation): 10 genes with Δμ = 1 and 10 genes with Δμ = -1, the pairwise correlation between the genes with the same direction of differential expression is ρ = 0.6, between the up-and down-regulated genes it is ρ = -0.6.
• set 7 (differential expression, no correlation, both up and down regulation): same as set 6, but without any correlation among genes.
• set 8 (differential expression, correlation, both up and down regulation, mixed with background): 10 genes from set 6 (5 up-regulated and 5 down-regulated), mixed with 10 genes from the background set.
• set 9 (differential expression, no correlation, mixed with background, both up and down regulation): same as set 8, but without any correlation among genes (i.e. 10 genes from set 7 and 10 genes from background set).
Thus, the 600 simulated genes are constituted by the background set, set 1, set 2, set 6 and set 7. Any well working procedure for gene set enrichment analysis should be able to detect at least the pure sets 1, 2, 6 and 7, but ideally also the sets 4, 5, 8, 9 where only half of the genes are differentially expressed. Gene set 3 serves as a negative control.
Using this data we conducted gene set enrichment analysis by considering combinations of the following module methods: • gene-level statistics: two-sample t-statistic, moderated tstatistic, Pearson correlation coefficient.
• gene set statistics: mean, median, maxmean statistic, ES (weighted by absolute value of gene level statistics), conditional FDR (linear logistic regression of probability of gene set membership depending on the value of the genelevel statistic), Wilcoxon rank sum test.
The gene set statistic ES was not combined with a binary transformation since the latter does not allow a sensible ranking of the genes. Hence, in total 3 × 5 × 6 × 3 -9 = 261 variants of gene set analysis were considered. For each combination of modules we simulated 100 data sets, and determined the false positive rate for set 3 and the true positive rates for the remaining sets.

Gene-level statistics and their transformations
First, we studied the effect of the choice of a specific gene level statistic and the corresponding transformation. Specifically, we varied the gene level statistic and the transformation and fixed the gene set statistic to the mean and computed p-values by resampling. The results are summarized in Table 2.
In general, the choice of the gene-level statistic does not seem to have a great impact on the results of the enrichment analysis. In our simulation setting we obtained roughly the same results, regardless whether the conventional or the moderated t-statistic or the correlation was used. We note that the small differences might be explained with the rather large number of samples per group (10) which is already sufficient to properly estimate the variance of each gene without using a regularization approach. For smaller sample sizes, the merits of regularization or borrowing information across genes is likely to be more pronounced. In contrast, we find that the choice of a transformation has quite a substantial effect on the detection rates. It is striking that all three investigated gene level statistics fail to find sets containing both up-and down-regulated genes. In this situation applying an appropriate transformation instead of using the original scores is indispensable. In this regard, a very useful transformation is the calculation of squared values. Although in Table 2 it can be seen that this results in slightly lower true positive rates for sets 1, 2, 4 and 5 compared to untransformed gene-level scores, it enables the detection of the sets 6, 7, 8, and 9, which are not recognized using the untransformed gene level statistics. Interestingly, if the squared transformation is combined with the rank transformation, this further improves the true positive rates and offers the best overall performance of all transformations under study (see Table 2). Furthermore, the (rank) quadratic transformations outperform the binary and the FDR transformations, which are two approaches that also detect sets containing both up and down-regulated genes.
Gene set statistics Next, we considered in our simulations the impact of the choice of the gene set statistic. To facilitate comparison, we used as underlying gene level statistic the moderated tstatistic and employed a quadratic transformation. For significance assessment we applied the resampling. Table  3 shows the results.
Overall, the gene set statistics all behave as expected for the nine sets. They assign high p-values to the uninteresting set 3 and relatively low p-values to the other sets. All approaches have some difficulty to reliably detect gene sets 4 and 5, where there is relatively weak differential expression and half of the genes belong to the uninformative background.
Performance-wise, the test statistics may be grouped as follows. The mean and the maxmean statistic produce rather similar and overall very good results. The median and the Wilcoxon rank sum test, which are coarsened and more robust location estimators for the change in transcript level, perform markedly better than the mean or maxmean for the correlated sets. However, if phenotype permutation instead of resampling is used, this effect vanishes and the performance of the median and the Wilcoxon test deteriorates (data not shown, cf. [40] for more information). Thus, the usage of the median and the Wilcoxon test is primarily advantageous if the competitive null hypothesis is tested, or if there are many outliers in the data. The conditional FDR procedure compares well to the mean and maxmean procedures in the simulations. However, we find that the results of this approach vary strongly with the choice of the gene-level statistic, transformation and permutation approach. Since it is not markedly better than the other approaches for any gene set and it is computationally expensive, it might not be the best overall choice for an enrichment analysis. Perhaps surprising is the comparatively weak performance of the popular ES score, which yields results that are worse than that of the sum or the mean.

Significance assessment
In a further simulation experiment, we compared four approaches for obtaining significance values (see Table 4).
In this study we used moderated t as the gene level statistic, employed a quadratic transformation, and used the mean as the gene set statistic.
First, the parametric approximation of the null distribution yields the best results. However, the assumption of independence of the individual genes is clearly violated for sets 1, 4, 6 and 8. Thus, the corresponding p-values are over-optimistic.
The resampling procedure that investigates the competitive null hypothesis Q 1 also performs quite well. Sets 3 and 7 are correctly categorized for almost all the 100 repetitions of the simulation. For sets 2 and 9 this is achieved in more than 90% of the cases. The correlated sets 1, 6 and 8 achieve lower true positive rates. Resampling performs worst for the difficult sets 4 and 5, where only half of the genes are differentially expressed.
Testing the self-contained null hypothesis Q 2 using sample label permutation appears to be slightly easier than testing Q 1 (see third column of Table 4), especially for the uncorrelated gene sets. Note that permutation preserves the correlation structure among genes. Since the effective gene set size is smaller in a correlated set compared to a set where the genes are independent, it is more difficult to detect enrichment in these situations. For this reason it is surprising that in most of the cases the permutation works still better for the correlated sets than resampling.
The restandardization procedure performs very similar to resampling. This was the case in almost all of the combinations of enrichment methods that were investigated in this work, so the benefits of restandardization are likely to be small in general. However, we note that [11] showed in The indicated values correspond to the proportion of p-values ≤ 0.05 for squared moderated t and resampling. their paper that there might be some special situations where restandardization is advantageous.

Comparison with global and multivariate approaches
In order to compare the modular univariate approaches with the global and multivariate procedures, we additionally analyzed the simulated data using the globaltest and Hotelling's T 2 -test with a shrinkage covariance matrix [41].
The results for the globaltest under different significance assessment approaches are summarized in Table 5.
Clearly, the performance of this procedure is not better than that of the less sophisticated univariate methods, especially the sum or mean of the squared gene-level statistics. Indeed, the results are very close to these two approaches. Hence, using the globaltest does not improve the results, but it is computationally a little bit faster, especially when the parametric approximation of the null distribution is used. Since it leads to comparable results, it might be a useful alternative when a large number of sets is to be investigated.
The corresponding results for the Hotelling T 2 -test can be found in Table 6. Using this approach the uncorrelated sets are found with the same reliability as with univariate approaches. However, in stark contrast, the sets with correlation (sets 1, 4, 6, and 8) are hardly detected. The reason is clear from a statistical perspective: the Hotelling test statistic penalizes correlation (i.e. the test statistic is largest for zero correlation). Specifically, it is much more likely that genes in a set are all simultaneously differentially expressed if there is correlation. While the Hotelling statistic corrects for this bias accordingly, this implicit penalization is clearly not desired from a biological point of view. A clear advantage of the Hotelling T 2 -test is its improved performance with sample label permutation as opposed to gene sampling. Since the approach does by definition test the self-contained null hypothesis, this is a positive result. Nevertheless, the overall very poor performance of this multivariate approach raises the question whether taking into account the dependence structure between the genes is at all useful in the context of enrichment analysis.

Analysis of p53 cancer data
In order to verify our conclusions from the simulations we reanalyzed the p53 cancer data from Subramanian et al. [8]. The protein p53 is a tumor suppressor preventing the development of cancer cells. It regulates genes involved in the cell cycle and the induction of apoptosis after DNA damage. Gene set enrichment analysis allows to study which pathways are involved in these cellular mechanisms.
The p53 data is used as a benchmark data set in a number of papers [e.g., [19,11]] and contains expression profiles of 50 cancer cell lines that can be found in the data base of the International Agency for Research in Cancer (IARC). The gene expressions were measured with the Affymetrix HGU95Av2 chip which contains 12, 625 probe sets. For each cell line, the mutational status of the p53 protein was reported. 17 cell lines were found to have normal p53 status while the remaining 33 samples showed a mutation of p53. The analyzes in [8,19] and [11] recovered several gene sets that are differentially expressed between the mutation and wildtype cell lines. Five functional groups were mentioned in all three publications: • the p53 pathway, • the down-stream targets of p53, • genes induced by radiation, • genes induced by hypoxia, and • the heat-shock protein signaling pathway.
Dinu and co-workers presented 31 additional gene sets found with the SAM-GS method [19]. In our preprocessing of the data we filtered out genes that showed maximum signal intensities of less than 5 on log 2 scale and a variation of less than 2-fold over all samples to reduce the noise in the data. The resulting data set contained 8,768 genes. From the biologically chosen gene sets of [8] we focused on sets with sizes between 15 and 500 genes. This left 290 gene sets of varying size, each containing between 15 and 321 genes.
In the following we report only our main findings. A much extended analysis of the p53 data set including detailed tables of significant gene sets and pathways can be found in [40].

Gene-level scores and transformations
As in the simulation study, in order to compare the various gene level statistics and their transformations we employed as gene set statistic the mean and use resampling for computing p-values.
Concerning the choice of gene statistics, there was little difference in the overall ranking of the gene sets. The 25 top ranked gene sets are nearly the same for all three statistics in combination with a specific transformation. The significant findings are in broad accordance with the results published previously [8,19,11]. Specifically, we also recover the additional sets reported by Dinu et al. [19] if a squared gene transformation is used (recall that their SAM-GS approach is based on a squared regularized t-statistic). For this data set we noted a distinct difference when positive and negative scores were analyzed separately, instead of combined. In particular, taking squared values of the individual gene scores appears to give more weight to gene sets that are not high scored in a separate analysis for up-and down-regulated genes. This is probably due to the fact that summing up the squared values leads to the detection of gene sets with many subtle expression changes in both directions instead of sets where some constituents are either up-or down-regulated.
When ranks or the binary transformation are used, fewer of the Dinu and Subramanian gene sets are detected. Furthermore, the number of significant gene sets is decreased compared to using the original or squared scores. Thus, for this data the results based on the rank and binary transformations appear to suffer substantially due to the loss of information. In contrast, the transformation to local FDR values (empirical posterior probabilities) performs much better than in the simulation study.
Gene set statistics and significance assessment Next, we investigated the impact of the gene set statistic and the choice of the significance procedure. In this analysis we used the squared moderated t-statistic as the gene level statistic.
The results are summarized in Fig. 2. For this data, both phenotype permutation as well as resampling lead to the same number of significant gene sets regardless of the pvalue cutoff, the only exception being the Wilcoxon statistic. The Hotelling statistic and the conditional FDR approach consistently found the largest number of significant gene sets for any cutoff, followed by the global test and the mean and maxmean. The ES, median, and Wilcoxon statistics typically resulted in the least number of gene sets. Note, however, that we do not know the ground truth for the p53 data. Therefore, one needs to keep in mind that the larger power of the mean or the global test compared, e.g., to the ES score might be due to a higher specificity of the latter. This is in concordance with our simulation study (cf. Table 3) where ES and the median were the only two statistics that never declared the negative control (set 3) significant. The outcome of the Hotelling T 2 -test is in very good accordance with the findings of [19]. Thus, in contrast to the simulated data the Hotelling approach gives meaningful results for real data. We investigated this discrepancy more closely by inspecting the correlations among the genes in a gene set. As can be seen from Fig. 3 the correlations within a gene set are comparatively weak. This explains the good result of the Hotelling approach, which for vanishing correlation essentially reduces to the SAM-GS [19].

Analysis of Hedenfalk breast cancer data
To complement the study of gene expression data we analyzed the breast cancer gene expression data from Hedenfalk et al. [42]. This well-studied data set includes 7 cancer patients carrying the BRCA1 genetic mutation and 8 other patients carrying the BRCA2 mutation. Microarray measurements were taken on 3,226 genes.
A peculiar feature of this data set is that on the gene level there is virtually no signal to distinguish the two groups (BRCA1 versus BRCA2) -see for example [43]. In our preprocessing for gene set analysis we considered a subset of 2,200 genes (all that could be uniquely mapped from cloneIDs to a unigene symbol) and a catalog of 89 gene sets from [8] relevant for these genes.

Gene set statistics and significance assessment
The p53 and the Hedenfalk cancer data behave strikingly different, as can be seen in Fig. 4. For the Hedenfalk data employing resampling resulted in very few gene sets detected as significant, regardless of gene set statistic. In contrast, when phenotype permutation was applied a substantial number of significant gene sets were found. As for the p53 data set, the gene set statistics can be divided into two groups. The Hotelling, mean, maxmean and global statistics resulted in larger numbers of significant set, whereas median, ES, Wilcoxon and conditional FDR detected fewer gene sets. Thus, for this data the conditional FDR approach is as conservative as in the simulations.
The difference between results for permutation and resampling may be explained as follows. If most of the active genes show only small expression changes, then the interesting sets do not differ markedly from random sets. Consequently, they cannot be found by the resampling approach. But still, in the Hedenfalk data the subtle changes of the interesting genes between the phenotypes are large enough when they are accumulated over the active gene set to be detected with sample label permutation.

Enriched gene sets
Finally, it is also instructive to inspect the list of detected gene sets. For gene set statistic mean with gene-level statistic moderated t, transformation squared values and sample label permutation all significant (p-value cutoff 0.01) gene sets are listed in Table 7. Note the enrichment of many cancer-related sets as a result of the gene set analysis.
It is noteworthy here that using the ES as gene set statistic does not yield the set Breast_Cancer_Estrogen_Signalling among the 20 top ranked gene sets, whereas in our Table  7 it is ranked number 1. We remark that this is a biologically important finding, in concordance with Hedenfalk et al. [42] who state that tumors with BRCA1 mutations are generally negative for estrogen receptors, whereas BRCA2 mutations are positive, implying that BRCA1 and BRCA2 genes induce the formation of breast tumors through separate pathways.
Overall, it appears that the Hedenfalk data is an exemplary case where merits of gene set enrichment over analyzing individual genes can be most prominently seen.

Conclusion
In this work we have investigated statistical procedures for detecting enrichment of gene sets. By conducting an extensive survey of corresponding approaches we found that all procedures in current use may be fit into a simple modular framework as depicted in Fig. 1.  The identification of this modular structure greatly facilitates the systematic comparison of the diverse methods. Based on simulations and data analysis we arrive at a number of guidelines for conducting a gene set enrichment analysis. We believe that these will contribute to the establishment of best practices for gene set enrichment analysis.

Analysis of p53 data set
Perhaps the most surprising result from our study is that the simple univariate procedures appear to work best under a large variety of conditions. Specifically, using the mean of the squared (regularized) t-statistics and assessing the significance with one of the two permutation approaches reliably detects gene sets of diverse types of expression and correlation structures. This procedure was proposed, e.g., in [15,14,19,17]. Intriguingly, this simple approach typically outperforms (by a fairly large margin) the in biological circles very popular GSEA method [8]. Thus, we strongly disagree with the recommendation put forward in [34] which was primarily given on grounds of software availability.
With regard to the choice of the best approach for each module in Fig. 1, our conclusions can be summarized as follows: • gene-level score: The choice of the gene-level score is not crucial. All different kinds of scores lead to very similar results. However, it is recommended to use a statistic that is standardized by the variation of the gene expres-Distribution of correlation across the 290 gene sets investigated for the p53 data sion, as this makes the measure of differential expression comparable across genes. Scores with this characteristic are t-statistics, correlation coefficients or standardized regression coefficients. If the sample size is small, using one of the regularized t-statistics is the best choice since they are designed to stabilize small variances.
• transformation: The choice of transformation has a substantial impact on the overall results. In our study the (rank) squared transformation was most accurate. The application of absolute values instead of squared values gives comparable results. Note that using the squared values essentially corresponds to using a modified version of Hotelling's T 2 -test with a diagonal covariance matrix. Binary transformations often lead to satisfying results but also imply a loss of information. If the focus is put on gene sets that are concertedly regulated in one direction the untransformed scores should be used.
• gene set statistics: Overall, the simple approaches such as the mean or the median work very well. Using the median instead of the mean was proposed by [16] as an approach that is less sensitive with regard to outliers. However, using the median may also lead to a smaller number of significant findings. Similar arguments apply to the rank-based Wilcoxon test. The GSEA, or more precisely the enrichment score, is not as reliable as the other gene set test statistics. Furthermore, the construction of the GSEA is rather complicated since it mixes the competitive and the self-contained null hypotheses.
• significance assessment: The choice of the permutation approach for significance assessment is mainly determined by the null hypothesis that is tested. Hence, it has to be decided carefully which procedure is used. If a selfcontained test is conducted, the phenotypes have to be permuted. In this way, the null distribution of the gene set statistic under the assumption of no difference between the treatment groups is obtained. If the gene set is to be compared to other sets, gene sampling is the correct approach. Then the null model is based on the assumption that the gene set membership is random. Depending on the data structure, both approaches can yield quite different results.
• multivariate procedures: Using a multivariate procedure such as the globaltest or the Hotelling approach not necessarily yields better results than the best univariate approaches. However, for the globaltest a parametric approximation of the null distribution is available, which might be beneficial if a large number of gene sets are analyzed. The Hotelling-type approach is not appropriate for the detection of gene sets containing genes that are highly correlated. From a statistical perspective this is straightforward to understand, as it is more unexpected to find a group of independent genes that are co-regulated than a group of correlated genes which behave similarly. In conclusion, we find that despite the large variety of available gene set enrichment procedures there is a common core among all procedures. In order to reliably detect enriched gene sets it is in general sufficient to use a univariate approach, in combination with either resampling or sample label permutation, depending on the desired underlying null model.

Analysis of Hedenfalk data set
In the present study, we have focused on methods for detecting enrichment of a single prespecified gene set. A topic not investigated here is the issue of multiple testing of gene sets. In this context, standard corrections for mul-tiplicity need to be refined in order to account for dependencies among gene sets, which may be due to the underlying gene ontology graph structure or because genes may belong to several gene sets at once [44,28]. 11  As gene set statistic we used the mean combined with the squared moderated t-statistic on the gene level and sample label permutation.
Kruskal [46] measure of statistical association between the two margins of the contingency table.

GSEA algorithm and related methods
One of the most popular methods for detecting enrichment is the "Gene Set Enrichment Analysis" (GSEA) introduced in [7] and improved by the same group only a short time later [8].
This method starts with a list L of N genes ranked by their correlation with the phenotype of interest, e.g., using the t-score or the signal-to-noise ratio. In order to test the null hypothesis that the m genes in a set S are randomly spread in the list L a so-called enrichment score (ES) is computed. This equals the maximum deviation from zero of a running sum (running down the list of genes L) that increases every time a gene in the list L is in S and decreases every time a gene is not in S (j = 1, ..., N). Additionally, every gene set member is weighted by its absolute value of correlation with the phenotype. As a result, genes with very low correlations do not contribute to the enrichment score of the set. The running sums for the genes in the set S and the complementary set (including all the genes from the list which are not in S) are calculated separately as follows: r j is a measure for the correlation of the individual gene g j with the phenotype, for example a t-statistic. The ES is defined as the maximum deviation of ES S (l)from zero (l = 1, ..., N). This test statistic is similar to a Kolmogorov-Smirnov statistic in that it compares the deviation between two cumulative distribution functions. But contrary to the Kolmogorov-Smirnov statistic, the ES is signed and uses weights.
The choice of the exponent p in the above formula determines the influence of the single genes on the score. For p = 0 an unweighted statistic is obtained. When p = 1 the genes in S are weighted by their absolute value of correlation with the phenotype according to the chosen correlation measure. Other values of p can also be used (see [8], supplementary material). The authors suggest p <1 if one wishes to penalize a lack of coherence in a set or p > 1 if correlation of a small subset of S with the phenotype is sufficient to call a set enriched. Additionally, the summands are normalized by the sum of all test statistics in S.
The summands for the genes in are constant.
Other authors presented similar statistics, see for example [12] who use X j = N -m and X j = -m as the contribution of the genes in S and to the running sum respectively. All the choices result in a score that assumes the value 0 when the running sum reaches the end of the list L.
Significance of the test statistics is assessed by permuting the class labels and recalculating the ES for every gene set in every permutation. This necessitates correction for multiplicity when a larger number of sets is tested.
An approach related to GSEA but more general is "Significance Analysis of Function and Expression" (SAFE) as described by [9]. This is a two-step procedure, where in the first step a local test statistic is computed for each gene (e.g. a t-score) and in the second the statistics of the genes in the gene set S are combined to give a global statistic. This is done in order to compare the distribution of the local statistics in the gene set with that of the local statistics of the rest of the genes. However, [9] leave it to the user to decide which global statistic is to be applied. They propose rank-based statistics as the Wilcoxon rank sum test or the Kolmogorov-Smirnov test since they do not require any assumptions about the joint distribution of the local statistics. Significance in the SAFE algorithm is evaluated by permuting the sample labels and recalculating the local and global statistics. For taking account of multiplicity when testing for several gene sets at a time [9] suggest to control the false discovery rate and related criteria.  Yet another modification of the GSEA method is introduced in [10]. These authors employ regression coefficients of the interesting outcome (in their case the age of the patient) to rank the genes. Instead of taking a Kolmogorov-Smirnov statistic as the gene set statistic, [10] use a van der Waerden statistic where Φ -1 is the quantile function of the standard normal distribution N (0, 1) and r j is the rank of gene g j in the list of all genes. Under the null hypothesis the test statistic Y is approximately normally distributed with mean zero.
[10] employ a bootstrap sampling approach to estimate the variance of the null distribution.

Averaging methods
Another natural approach of combining information of a group of genes pursued by many authors is to simply average over the test statistics of the individual genes. Specifically, let t j , j = 1, ..., N, be the association measure of gene g j with the phenotype. Furthermore define G kj as the binary indicator of whether gene g j belongs to gene set S k (G kj = 1) or not (G kj = 0). Then the test statistic is given by where is the number of genes in set S k .
A crucial point noted by [14] is that one needs to distinguish between two types of null hypotheses, which in turn necessitates different ways of assigning significance. The two hypotheses are: 1.Q 1 : "The genes in a gene set show the same pattern of association with the phenotype compared with the rest of the genes." 2. Q 2 : "The gene set does not contain any genes whose expression levels are associated with the phenotype of interest." These correspond to the "competitive" and "self-contained" hypotheses discussed in [5]. Note that the first hypothesis compares the genes in the set with the remaining genes, thus implying that in a permutation test the genes are the sampling units. Q 2 tests the association between the genes in the set and the phenotype regardless of the genes that are not in the set. Therefore, here the phenotype labels are the sampling units. When testing Q 2 [14] employ a permutation approach whereas for testing Q 1 they apply a normal approximation to compute p-values.
An averaging approach based on the moderated t-statistic is described in the documentation of [15]. Further discussion regarding the averaging of gene-level statistics can be found in [20,33] and [18]. The "category" software of [17] also implements a general form of the averaging approach.
The maxmean statistic and restandardization [11] propose several improvements of the averaging approach and GSEA. First, they introduce a new test statistic called the "maxmean" statistic which leads to more effcient tests. It is defined as Second, [11] show the necessity not only to incorporate sample label permutation as in the GSEA method but also gene sampling into the analysis. This idea is implemented in a procedure called "restandardization" which combines gene sampling and phenotype permutation.
Let μ † and σ † be the mean and standard deviation of the distribution obtained by gene label permutation and μ* and σ* the corresponding parameters of the sample label permutation distribution. For a general test statistic T, for example the ES used in GSEA, the restandardization can be written as It corrects the phenotype permutation statistics for the estimated distribution of the individual genes in the data set. The corresponding p-value can then be calculated in the usual way as the fraction of a large number of restandardized test statistics exceeding the observed test statistic.
When the gene set score is simply the average of the individual gene scores in the set, the restandardization formula can be simplified to Here T ave denotes the gene set statistic obtained by an averaging approach, μ s and σ s are the mean and standard deviation of the scores s j of all N genes and μ* and σ* are the mean and standard deviation of the gene set test statistics obtained by sample label permutation. In this special case, it is not necessary to conduct gene sampling before doing the restandardization since the mean of the gene sampling distribution is the same as the overall mean of the single genes and the standard deviations only differ by the factor .
Extensions to GSEA and averaging methods Some further variations to the GSEA and methods using the average of gene-level statistics are reported in Jiang and Gentleman [16]. They propose using the median or a sign test for robustly averaging over the scores in a gene set. [16] also suggest an alternative test statistic for the individual genes when there are other covariates such as gender or age which might influence the expression levels.
In essence, this test statistic is based on fitting a linear model where the gene expression is the response and the phenotype and other covariates are the independent variables.
A similar approach was proposed by [13] who additionally incorporate a measure for co-expression in the gene sets, the mean of the pairwise correlation coefficients, and a measure for the distinctiveness of the sets based on a nearest neighbors classifier.
In their paper, [16] also propose a Bayes approach for the investigation of gene set enrichment. The procedure is somewhat similar to the two groups empirical Bayes model described in [21]. Another issue addressed by [16] is the potential overlap of gene sets. They also investigate dimension reduction (i.e. principle components analysis) to find gene sets with co-regulated genes.

Parametric method
Another very simple approach to detecting gene set enrichment is PAGE (Parametric Analysis of Gene Set Enrichment) suggested by [18]. Essentially, this approach proceeds by averaging over fold changes or other genelevel statistics and comparing this test statistic with the standard normal distribution. Since the approach is fully parametric and because no permutation is required this procedure is very time saving and computationally efficient. Using the normal approximation is justified by [18] via asymptotic arguments. In a similar fashion, the "category" averaging approach [17] also employs (among other options) a normal approximation for significance assessment.

Random-set method
Newton et al. [20] aggregate some ideas from the averaging methods, the parametric models, and the 2 × 2 table methods. Instead of only counting the number of differentially expressed genes present in the gene set, as done when applying 2 × 2 table methods, they allow for more general test statistics similar to the ES introduced by [7]. However, they do not apply sample label permutation to assess significance but consider the genes to be the sampling units.
Starting with the measures of differential expression s j , j = 1, ..., N for each gene, an unstandardized score for a gene set S consisting of m genes is calculated as the average of the gene scores of these genes As the method aims to compare the enrichment in a gene set S with the enrichment of all other distinct randomly drawn gene sets of size m, a so called random-set model is introduced. That means that the gene set S is now considered as a random collection of m genes whose scores s j are fixed. For most choices of the gene-level statistic, the exact distribution of becomes intractable. [20] propose to approximate it with the normal distribution with mean and variance as follows: The score can then be standardized with these two quantities, resulting in the test statistic Under the null hypothesis of no enrichment, Z is approximately normally distributed with mean zero and unit variance. tained because of the use of the sample permutation procedure. Furthermore, a refined follow-up model allows for the explicit inclusion of an estimated correlation matrix [24].

Hotelling's T 2 -test and related methods
Other procedures for gene set enrichment analysis which are also directly based on the gene set information can be derived from the well-known multivariate extension of the t-test, the Hotelling's T 2 -test. The corresponding test statistic is where and are the mean vectors of the gene expressions in the set for the two phenotypes, and n 1 and n 2 are the sample sizes in both groups. S is the pooled empirical covariance matrix of the genes in the set, and R is the corresponding correlation matrix. With t we denote the vector containing all local two-sample equal variance tstatistics. Note that as a weighted sum of the gene-wise tstatistics Hotelling's T 2 explicitly incorporates the dependence structure among the genes -unlike most other test statistics for gene set enrichment.
Under the null hypothesis that the mean vectors in both groups are equal, follows an F distribution with m and n -m -1 degrees of freedom, where m is the gene set size. However, in the analysis of gene expression data it is rarely the case that the number of samples is greater than the number of genes in the set, so the above procedure needs to be modified appropriately.
Kong et al. [25] propose to use dimension reduction by means of singular value decomposition. In effect, their approach amounts to using the Moore-Penrose pseudoinverse for inverting the singular empirical covariance matrix in the definition of Hotelling's T 2 .
Another Hotelling-like statistic can be found in [19]. These authors substitute the usual t-statistic with the SAM statistic [47] where s 0 is a regularizing constant, and assume a diagonal correlation matrix. This leads to the SAM Gene Set (SAM-GS) statistic for a gene set S. Significance is assessed by permuting sample labels.
A further possibility is to substitute the empirical covariance matrix in the Hotelling T 2 statistic with a more efficient estimator. This is important if the sample size is small [48].

Further procedures
Rahnenführer et al. [26] present procedures to investigate enrichment whilst taking into account known pathway structures and co-regulations of gene sets. Adewale et al. [30] extend the approach of [19] to account for diverse phenotypes including survival and count data. Lewin and Grieve [28] developed special methods considering the connections between gene sets derived from the graph structure of the Gene Ontology [49]. Procedures to infer the enrichment of gene-gene and gene-protein networks have been presented by Nacu et al. [29].
Läuter et al. [31] describe a procedure for detecting gene set enrichment and the simultaneous selection of suitable gene sets. This utilizes an exact high-dimensional multivariate spherical test [50]. Note that in this procedure the gene sets are not fixed a priori, but are chosen according to the correlation among genes. As gene set statistic a weighted average of univariate beta statistics is employed.
Edelman et al. [27] present an approach similar to the GSEA of [8].
A key feature of their method is that it is constructed to measure the enrichment of gene sets in individual samples. This aims at investigating the variation of the activity of functional groups of genes in the population so that the expression profile of relevant gene sets can be used for predicting the phenotype of an individual. As local test statistic the log-likelihood ratio is used.