# Computation of significance scores of unweighted Gene Set Enrichment Analyses

- Andreas Keller
^{1}Email author, - Christina Backes
^{1}and - Hans-Peter Lenhof
^{1}

**8**:290

**DOI: **10.1186/1471-2105-8-290

© Keller et al; licensee BioMed Central Ltd. 2007

**Received: **23 May 2007

**Accepted: **06 August 2007

**Published: **06 August 2007

## Abstract

### Background

Gene Set Enrichment Analysis (GSEA) is a computational method for the statistical evaluation of sorted lists of genes or proteins. Originally GSEA was developed for interpreting microarray gene expression data, but it can be applied to any sorted list of genes. Given the gene list and an arbitrary biological category, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list. Usually, significance scores (p-values) of GSEA are computed by nonparametric permutation tests, a time consuming procedure that yields only estimates of the p-values.

### Results

We present a novel dynamic programming algorithm for calculating exact significance values of unweighted Gene Set Enrichment Analyses. Our algorithm avoids typical problems of nonparametric permutation tests, as varying findings in different runs caused by the random sampling procedure. Another advantage of the presented dynamic programming algorithm is its runtime and memory efficiency. To test our algorithm, we applied it not only to simulated data sets, but additionally evaluated expression profiles of squamous cell lung cancer tissue and autologous unaffected tissue.

## Background

Modern high-throughput methods deliver large sets of genes or proteins that can not be evaluated manually. For example, cDNA microarrays are used to measure the expression of a variety of genes under different conditions, e.g. in normal and cancer tissues. Usually, for each gene the expression quotient is computed and the genes are sorted by their expression quotient. The question of interest is whether over-expressed or under-expressed genes accumulate in certain biological categories, as for example biochemical pathways or Gene Ontology categories. To answer this question different approaches can be applied. First, the so-called "Over-Representation Analysis" (ORA) that compares a reference set to a test set of genes by using either the hypergeometric test or Fisher's exact test. Second, "Gene Set Enrichment Analysis" (GSEA) evaluates the distribution of genes belonging to a biological category in a given sorted list of genes or proteins by computing running sum statistics.

*C*and sorted list

*L*of

*m*genes of which

*l*belong to

*C*means that a running sum statistic

*RS*is computed for

*L*. RS statistics evaluate whether the genes of

*C*are accumulated on top or bottom of the sorted list or whether they are randomly distributed. Hereby, the sorted list is processed from top to bottom. Whenever a gene belonging to

*C*is detected, the running sum is increased by a certain number, otherwise it is decreased. The value of interest is the running sum's maximal deviation from zero, denoted as

*RS*

_{ C }. An example is provided in Figure 1 for a list containing 8 genes of which 4 belong to

*C*. The black graph corresponds to all possible running sum statistics. The red pathway represents the example where the first three genes and the seventh gene belongs to

*C*. The

*RS*

_{ C }value of the red path is 12.

*RS*

_{ C }is calculated for permuted gene lists. Two approaches to compute these lists exist. First, the sorted gene list is randomly permuted. Second, if

*L*is sorted by the median expression quotient of expression values in one group divided by the median expression value in another group, the samples are randomly assigned to the two groups and thereby permuted gene lists are generated. Notably, these methods do not always yield the same results. The permutation procedure is repeated

*t*times and the running sum statistics together with the corresponding maximal deviations from zero, denoted as

*RS*

_{ i },

*i*∈ {1,...,

*t*}, are computed. Usually, the p-value computes as the fraction of

*RS*

_{ i }values that are larger or equal than

*RS*

_{ C }:

Since its development in 2003 [1, 2], Gene Set Enrichment Analysis has been enhanced [3] and integrated in a number of analysis tools [4]. Among the most popular programs are "ermineJ" [5] and "GSEA-p" [6]. These two tools estimate the significance values by using nonparametric permutation tests. However, such tests entail three disadvantages:

First, repeated runs of the permutation test algorithm may lead to different significance values because of the random sampling.

Since GSEA is often applied to many biological categories, p-values have to be adjusted for multiple testing by using Bonferroni Hochberg [7], Benjamini [8], or similar adjustment approaches. However, given the above estimation and the known multiple testing methods, the p-value cannot be adjusted in an appropriate way.

*m*= 20000 and

*l*= 2000, a sample size of 1000 permutations may be by far too small. Remarkably, the number of possible different running sum statistics amounts to $\left(\begin{array}{c}m\\ l\end{array}\right)$. On the example given above, the number of different running sums adds up to approximately 4·10

^{2821}, emphasizing that 1000 permutation represent a very small sample. The example shown in Figure 2 indicates that for a sorted list of length 2000 and a biological category including only 14 genes 1000 permutations do not yield reliable significance values. The required large number of permutation tests leads to an unacceptable computational effort, especially if thousands of biological categories are tested. An alternative, parametric method is the so called Parametric Analysis of Gene Set Enrichment "PAGE" method [9] that calculates a z-score for a given gene set and infers the significance value of this z-score against standard normal distribution.

In this study, we address the exact and efficient p-value computation for unweighted Gene Set Enrichment Analysis. Unweighted means that the number by which the running sum statistic is increased if a gene of *C* is found and the number by which the running sum statistic is decreased if the gene does not belong to *C* are constants. In our case, whenever a gene of *C* is found the running sum is increased by *m* - *l*, and otherwise it is decreased by *l*. The dynamic programming method is similar to the "DRIM" approach ([10]) that computes the optimal partition of a gene set in a target and a background set.

We integrated our dynamic programing algorithm into the gene set analysis tool "GeneTrail" [11] that is freely available at *genetrail.bioinf.uni-sb.de*. GeneTrail tests a wide variety of biological categories, among them Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways [12], TRANSPATH pathways [13], transcription factors [14], Gene Ontology GO, [15], granzyme B clevage sites [16], and protein-protein interactions [17–20]. GeneTrail relies on the Biological Information System BN++ [21] that provides easy access to a wide variety of biological data.

## Results and Discussion

### Dynamic programming algorithm

Before presenting our algorithm, we discuss some important features of the running sum statistic. Given the sorted list *L* of *m* genes of which *l* belong to the considered biological category *C*, we calculate a running sum statistic as follows: whenever we find one of the *l* genes of the considered category *C*, we increase the running sum by *m* - *l* leading to a total sum of *l*·(*m* - *l*) over all genes in *C*. Otherwise, we decrease the running sum statistic by *l* leading to a total sum of (*m* - *l*)·(-*l*) over all genes not in *C*. Therefore, the running sum's final value will always be zero. Moreover, the running sum's maximal possible value is *l*(*m* - *l*), whereas its minimal possible value is -*l*(*m* - *l*).

*RS*

_{ C }. The p-value can be computed as the probability that a random running sum reaches a maximal deviation greater or equal as

*RS*

_{ C }. We compute this probability via the complement of the event as:

where *X* is the number of running sum statistics with a maximum deviation of at most *RS*_{
C
}- 1 and *Y* is the number of all possible different running sum statistics which can be obviously computed as $\left(\begin{array}{c}m\\ l\end{array}\right)$. To compute *X*, we count all running sum statistics that have a maximum deviation of at most *RS*_{
C
}- 1.

We use a matrix *M* of dimension (2*l*(*m* - *l*) + 1) × (*m* + 1), where the different rows represent all possible values of the running sum and the columns represent the indices of the sorted list *L* from 1,..., *m* and an initialization column with index 0. Let *M*(*j*, *i*) denote the number of running sum statistics with value *j* in step *i* whose maximum deviation of zero is less than *RS*_{
C
}- 1. The entries of *M* are computed using dynamic programming, starting with the first column. *M*(0, 0) is set to 1 and all other values are set to 0.

*M*(

*j*,

*i*) is recursively computed as:

where the constraint(*) -|*RS*_{
C
}| <*j* < |*RS*_{
C
}|

ensures that only the running sum statistics with maximal deviation of smaller than *RS*_{
C
}are counted. The total number of running sum statistics with maximum deviation smaller than *RS*_{
C
}can be found at matrix entry *M*(0, *m*).

### Implementation details

At first glance, the presented algorithm seems to be inefficient concerning both, space requirement and runtime, which are of order *O*(*m*^{2}*l*). For example, if *m* = 20000 genes and a functional category with *l* = 2000 genes is considered, *M* would have about 1.44·10^{12}entries.

We have implemented the above described algorithm in C++ using time and space efficient data structures which will be discussed here.

As the recurrence equation implies, filling the *i* th column of *M* only requires the values of the *i* - 1th column. Thus, the dynamic programming approach requires only two columns of the matrix reducing the memory requirements to *O*(*ml*).

*M*(·, 0) contains only one number unequal to zero, the second column two numbers, and the

*l*th column

*l*values unequal to zero (see Figure 1 and Figure 3). By using two standard STL hash maps instead of the matrix

*M*, the space requirements can be reduced to

*O*(

*l*) and the expected runtime can be reduced to

*O*(

*ml*).

Another important feature of the running sum statistics implies that certain parts of the matrix *M* do not have to be computed. The running time of the algorithm can be further reduced by adding a second constraint(**) -*m*^{2} + *l*·*m* + *i*·*m* - *i*·*l* ≤ *j* ≤ *l*·*m* - *i*·*l*

for each column *i* to the recurrence equation. The right side of the constraint holds because, for column *i*, the value *j* of the running sum can be computed as*j* = *a*·(*m* - *l*) + (*i* - *a*)·(-*l*)

*a*is the number of genes that belong to

*C*up to index

*i*in the ordered list. Since

*a*can be at most

*l*, the following inequality holds

*i*the following equation holds:

*b*is the number of genes that do not belong to

*C*up to index

*i*in the ordered list. Since

*b*can be at most

*m*-

*l*,

Although the additional constraint does not lead to an asymptotically improved runtime, an increased performance has been measured, especially for small p-values.

Additionally, the runtime of the presented algorithm can be improved by computing only the first half of *M*. Due to a certain "symmetry" of the running sum statistics it suffices to compute either the column in the middle or the two columns in the middle to derive the required number of pathways.

### Dependence of runtime on p-value

*RS*

_{ C }|, |

*RS*

_{ C }|[ have to be computed. However, the running sum's maximal deviations from zero and the corresponding p-values are mutually dependent on each other, i.e., the higher the running sum's maximal deviation, the lower the p-value. Thus, the running time of the presented algorithm depends on the p-value (see Figure 4A and 4B). The lower the p-values are the higher is the runtime of the presented algorithm.

As described above, our algorithm is applied to evaluate several thousands of biological categories/hypotheses using the gene set analysis toolkit "GeneTrail". In general, findings are considered to be significant, if the p-value is smaller than 0.05. Most computation time is spend for small p-values. However, only few of the considered categories are statistically significant, whereas the others will lead to intermediate and larger p-values. Since our algorithm is especially fast for intermediate and large p-values, the complete GSEA analysis is highly efficient and most of its running time is spent for the p-value calculation of the most significant categories.

### Comparison to other approaches

To get maximal performance, we implemented our algorithm in C++. Other available GSEA tools have been implemented in Java or are available as "R" scripts. For this reason, a fair comparison of our tool to other nonparametric permutation tests is not possible. We implemented a permutation test procedure in C++ with expected running time of *O*(number of permutations·m). We applied the algorithm to the example presented in Figure 4 using 1000 permutations. On average the presented dynamic programming approach was more than ten times faster compared to the permutation test procedure. Please note that the two approaches are not directly comparable. The runtime of both methods depends on the length of the gene list *l*. In addition, the runtime of our algorithm depends on the p-value whereas the running time of the permutation test approach depends on the number of performed permutations.

### Evaluation of lung cancer expression profiles

We tested our algorithm by evaluating freely available expression profiles of lung cancer tissue and autologous control samples. In detail, we downloaded expression profiles of 5 squamous cell lung cancer patients [22] from the "Gene Expression Omnibus" [23]. Together with the cancer tissue, unaffected tissue of autologous patients was extracted at surgery and 5 control expression profiles were generated. The 10 expression profiles were measured using the Affymetrix HG-U133A including more than 22000 transcripts and 13000 genes. In a pre-processing step, the profiles were median normalized. Thereafter, for each transcript a paired t-tests was performed in order to detect differentially expressed genes. Paired t-test is applicable here, since the control samples were taken from the normal lung tissues of autologous patients. To generate a sorted list, t-test statistic values were sorted in increasing order such that the top of the resulting list contains the most significantly up-regulated genes in lung-cancer and the bottom of the list the most significantly down-regulated genes.

*α*-level of 0.05 without adjustment, and 711 with Benjamini-Hochberg adjustment. About 300 categories would achieve a p-value of zero by application of 1000 permutation tests. However, these categories differ extremely in their significance.

Evaluation on lung cancer

Category | #sub-categories | #significant | #significant adjusted |
---|---|---|---|

KEGG | 165 | 44 | 22 |

TRANSPATH | 171 | 28 | 4 |

TRANSFAC | 248 | 41 | 5 |

Gene Ontology | 5771 | 1590 | 646 |

Chromosomes | 25 | 15 | 11 |

Chromosome Arms | 44 | 24 | 23 |

Sequence Motifs | 4 | 2 | 2 |

Sum | 6428 | 1744 | 711 |

We detected many significantly down-regulated KEGG-pathways. Among them the *Cell Adhesion Molecules* (p-value of 0.00024). The *Cell Cycle* is the most significantly up-regulated pathway (p-value of 0.0011). It is very likely that both pathways would achieve a p-value of zero by permutation tests, however, they are not equally significant as demonstrated above. The up-regulated *rRNA-binding* achieved a p-value of 0.0488. This category represents an example where permutation tests might define a pathway as significant in one run and as not significant in another run.

## Conclusion

We presented a novel dynamic programming algorithm that enables the efficient computation of exact significance values of unweighted "Gene Set Enrichment Analysis" and thus avoids typical problems of nonparametric permutation tests. Additionally, we showed that the runtime of the presented algorithm decreases as the p-values increase, i.e. our algorithm spends most time for computing small p-values of significant categories.

We integrated our algorithm in the gene set analysis tool "GeneTrail" that allows for performing a wide variety of statistical analyses efficiently. Using GeneTrail, we evaluated the differential expression of genes in squamous cell lung cancer expression profiles, demonstrating the usefulness of the presented dynamic programming algorithm.

## Methods

### GMP library for arbitrary numbers

The number of possible running sum statistics increases exponentially, i.e. $\left(\begin{array}{c}m\\ l\end{array}\right)$. On the example given above, a microarray containing *m* = 20000 genes and a category with *l* = 2000 genes, the number of different running sums adds up to approximately 4·10^{2821}. In the worst case, the matrix entry *M*(0, *m*) amounts to 4·10^{2821}, if all genes of *C* are either top or bottom ranked. This example shows that the approach must be able to handle very large numbers. Hence, we use the "GNU Multiple Precision Arithmetic Library" (GMP), a numerically stable and fast library that can compute arbitrary large natural numbers and is freely available.

## Declarations

### Acknowledgements

This work was supported by "Deutsche Krebshilfe", grant 107342 and by the "Deutsche Forschungsgemeinschaft", grant BIZ 4/1-(1,...,4).

## Authors’ Affiliations

## References

- Mootha V, Lindgren C, Eriksson K, Subramanian A, Sihag S, Lehar J, Puigserver P, Carlsson E, Ridderstrale M, Laurila E, Houstis N, Daly M, Patterson N, Mesirov J, Golub T, Tamayo P, Spiegelman B, Lander E, Hirschhorn J, Altshuler D, Groop L: PGC-1alpha-responsive genes involved in oxidative phosphorylation are coordinately downregulated in human diabetes. Nat Genet. 2003, 34: 267-73. 10.1038/ng1180.View ArticlePubMed
- Lamb J, Ramaswamy S, Ford H, Contreras B, Martinez R, Kittrell F, Zahnow C, N P, TR G, Ewen M: A Mechanism of Cyclin D1 Action Encoded in the Patterns of Gene Expression in Human Cancer. Cell. 2003, 114: 323-32. 10.1016/S0092-8674(03)00570-1.View ArticlePubMed
- Jiang Z, Gentleman R: Extensions to gene set enrichment. Bioinformatics. 2007, 23: 306-13. 10.1093/bioinformatics/btl599.View ArticlePubMed
- Rubin E: Circumventing the cut-off for enrichment analysis. Brief Bioinform. 2006, 7: 202-3. 10.1093/bib/bbl013.View ArticlePubMed
- Lee H, Braynen W, Keshav K, Pavlidis P: ErmineJ: tool for functional analysis of gene expression data sets. BMC Bioinformatics. 2005, 9: 269-10.1186/1471-2105-6-269.View Article
- Subramanian A, Tamayo P, Mootha V, Mukherjee S, Ebert B, Gillette M, Paulovich A, Pomeroy S, Golub T, Lander E, Mesirov J: Gene set enrichment analysis: a knowledge-based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA. 2005, 102: 15545-50. 10.1073/pnas.0506580102.PubMed CentralView ArticlePubMed
- Hochberg Y: A sharper Bonferroni procedure for multiple tests of significance. Biometrika. 1988, 75: 800-2. 10.1093/biomet/75.4.800.View Article
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J R Statist Soc B. 1995, 57: 289-300.
- Kim S, Volsky D: PAGE: parametric analysis of gene set enrichment. BMC Bioinformatics. 2005, 6: 144-10.1186/1471-2105-6-144.PubMed CentralView ArticlePubMed
- Eden E, Lipson D, Yogev S, Yakhin Z: Discovering Motifs in Ranked Lists of DNA Sequences. PloS Comput Biol. 2007, 3 (3):
- Backes C, Keller A, Küntzer J, Kneissl B, Comtesse N, Elnakadi Y, Müeller R, Meese E, Lenhof H: GeneTrail – Advanced Gene Set Enrichment Analysis. Nucleic Acids Research. 2007,
- Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita K, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M: From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res. 2006, 34: D354-D357. 10.1093/nar/gkj102.PubMed CentralView ArticlePubMed
- Krull M, Pistor S, Voss N, Kel A, Reuter I, Kronenberg D, Michael H, Schwarzer K, Potapov A, Choi C, Kel-Margoulis O, Wingender E: TRANSPATH: An Information Resource for Storing and Visualizing Signaling Pathways and their Pathological Aberrations. Nucleic Acids Res. 2006, 34: D546-D551. 10.1093/nar/gkj107.PubMed CentralView ArticlePubMed
- Matys V, Kel-Margoulis O, Fricke E, Liebich I, Land S, Barre-Dirrie A, Reuter I, Chekmenev D, Krull M, Hornischer K, Voss N, Stegmaier P, Lewicki-Potapov B, Saxel H, Kel A, Wingender E: TRANSFAC and its module TRANSCompel: transcriptional gene regulation in eukaryotes. Nucleic Acids Res. 2006, 34: D108-D110. 10.1093/nar/gkj143.PubMed CentralView ArticlePubMed
- Consortium TGO: Gene Ontology: tool for the unification of biology. Nature Genet. 2000, 25: 25-9. 10.1038/75556.View Article
- Backes C, Küntzer J, Lenhof H, Comtesse N, Meese E: GraBCas: a bioinformatics tool for score-based prediction of Caspase- and Granzyme B-cleavage sites in protein sequences. Nucleic Acids Research. 2005, 33: W208-W213. 10.1093/nar/gki433.PubMed CentralView ArticlePubMed
- Hermjakob H, Montecchi-Palazzi L, Lewington C, Mudali S, Kerrien S, Orchard S, Vingron M, Roechert B, Roepstorff P, Valencia A, Margalit H, Armstrong J, Bairoch A, Cesareni G, Sherman D, Apweiler R: IntAct -an open source molecular interaction database. Nucleic Acids Res. 2004, 32: D452-D455. 10.1093/nar/gkh052.PubMed CentralView ArticlePubMed
- Peri S: Development of human protein reference database as an initial platform for approaching systems biology in humans. Genome Research. 2003, 13: 2363-71. 10.1101/gr.1680803.PubMed CentralView ArticlePubMed
- Salwinski L, Miller C, Smith A, Pettit F, Bowie J, Eisenberg D: The database of interacting proteins: 2004 update. Nucleic Acids Res. 2004, 32: D449-D451. 10.1093/nar/gkh086.PubMed CentralView ArticlePubMed
- Zanzoni A, Montecchi-Palazzi L, Quondam M, Ausiello G, Helmer-Citterich M, Cesareni G: MINT: a Molecular INTeraction database. FEBS Letters. 2002, 513: 135-40. 10.1016/S0014-5793(01)03293-8.View ArticlePubMed
- Küntzer J, Blum T, Gerasch A, Backes C, Hildebrandt A, Kaufmann M, Kohlbacher O, Lenhof H: BN++ – A Biological Information System. Journal of Integrative Bioinformatics. 2006, 3: 34-
- Wachi S, Yoneda K, Wu R: Interactome-transcriptome analysis reveals the high centrality of genes differentially expressed in lung cancer tissues. Bioinformatics. 2005, 21 (23): 4205-8. 10.1093/bioinformatics/bti688.View ArticlePubMed
- Barrett T, Troup D, Wilhite S, Ledoux P, Rudnev D, Evangelista C, Kim I, Soboleva A, Tomashevsky M, Edgar R: NCBI GEO: mining tens of millions of expression profiles-database and tools update. NAR. 2006, 35: D760-D765. 10.1093/nar/gkl887.PubMed CentralView ArticlePubMed

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.