Computation of significance scores of unweighted Gene Set Enrichment Analyses
© Keller et al. 2007
Received: 23 May 2007
Accepted: 06 August 2007
Published: 06 August 2007
Skip to main content
© Keller et al. 2007
Received: 23 May 2007
Accepted: 06 August 2007
Published: 06 August 2007
Gene Set Enrichment Analysis (GSEA) is a computational method for the statistical evaluation of sorted lists of genes or proteins. Originally GSEA was developed for interpreting microarray gene expression data, but it can be applied to any sorted list of genes. Given the gene list and an arbitrary biological category, GSEA evaluates whether the genes of the considered category are randomly distributed or accumulated on top or bottom of the list. Usually, significance scores (p-values) of GSEA are computed by nonparametric permutation tests, a time consuming procedure that yields only estimates of the p-values.
We present a novel dynamic programming algorithm for calculating exact significance values of unweighted Gene Set Enrichment Analyses. Our algorithm avoids typical problems of nonparametric permutation tests, as varying findings in different runs caused by the random sampling procedure. Another advantage of the presented dynamic programming algorithm is its runtime and memory efficiency. To test our algorithm, we applied it not only to simulated data sets, but additionally evaluated expression profiles of squamous cell lung cancer tissue and autologous unaffected tissue.
Modern high-throughput methods deliver large sets of genes or proteins that can not be evaluated manually. For example, cDNA microarrays are used to measure the expression of a variety of genes under different conditions, e.g. in normal and cancer tissues. Usually, for each gene the expression quotient is computed and the genes are sorted by their expression quotient. The question of interest is whether over-expressed or under-expressed genes accumulate in certain biological categories, as for example biochemical pathways or Gene Ontology categories. To answer this question different approaches can be applied. First, the so-called "Over-Representation Analysis" (ORA) that compares a reference set to a test set of genes by using either the hypergeometric test or Fisher's exact test. Second, "Gene Set Enrichment Analysis" (GSEA) evaluates the distribution of genes belonging to a biological category in a given sorted list of genes or proteins by computing running sum statistics.
Since its development in 2003 [1, 2], Gene Set Enrichment Analysis has been enhanced  and integrated in a number of analysis tools . Among the most popular programs are "ermineJ"  and "GSEA-p" . These two tools estimate the significance values by using nonparametric permutation tests. However, such tests entail three disadvantages:
First, repeated runs of the permutation test algorithm may lead to different significance values because of the random sampling.
Since GSEA is often applied to many biological categories, p-values have to be adjusted for multiple testing by using Bonferroni Hochberg , Benjamini , or similar adjustment approaches. However, given the above estimation and the known multiple testing methods, the p-value cannot be adjusted in an appropriate way.
In this study, we address the exact and efficient p-value computation for unweighted Gene Set Enrichment Analysis. Unweighted means that the number by which the running sum statistic is increased if a gene of C is found and the number by which the running sum statistic is decreased if the gene does not belong to C are constants. In our case, whenever a gene of C is found the running sum is increased by m - l, and otherwise it is decreased by l. The dynamic programming method is similar to the "DRIM" approach () that computes the optimal partition of a gene set in a target and a background set.
We integrated our dynamic programing algorithm into the gene set analysis tool "GeneTrail"  that is freely available at genetrail.bioinf.uni-sb.de. GeneTrail tests a wide variety of biological categories, among them Kyoto Encyclopedia of Genes and Genomes (KEGG) pathways , TRANSPATH pathways , transcription factors , Gene Ontology GO, , granzyme B clevage sites , and protein-protein interactions [17–20]. GeneTrail relies on the Biological Information System BN++  that provides easy access to a wide variety of biological data.
Before presenting our algorithm, we discuss some important features of the running sum statistic. Given the sorted list L of m genes of which l belong to the considered biological category C, we calculate a running sum statistic as follows: whenever we find one of the l genes of the considered category C, we increase the running sum by m - l leading to a total sum of l·(m - l) over all genes in C. Otherwise, we decrease the running sum statistic by l leading to a total sum of (m - l)·(-l) over all genes not in C. Therefore, the running sum's final value will always be zero. Moreover, the running sum's maximal possible value is l(m - l), whereas its minimal possible value is -l(m - l).
where X is the number of running sum statistics with a maximum deviation of at most RS C - 1 and Y is the number of all possible different running sum statistics which can be obviously computed as . To compute X, we count all running sum statistics that have a maximum deviation of at most RS C - 1.
We use a matrix M of dimension (2l(m - l) + 1) × (m + 1), where the different rows represent all possible values of the running sum and the columns represent the indices of the sorted list L from 1,..., m and an initialization column with index 0. Let M(j, i) denote the number of running sum statistics with value j in step i whose maximum deviation of zero is less than RS C - 1. The entries of M are computed using dynamic programming, starting with the first column. M(0, 0) is set to 1 and all other values are set to 0.
where the constraint(*) -|RS C | <j < |RS C |
ensures that only the running sum statistics with maximal deviation of smaller than RS C are counted. The total number of running sum statistics with maximum deviation smaller than RS C can be found at matrix entry M(0, m).
At first glance, the presented algorithm seems to be inefficient concerning both, space requirement and runtime, which are of order O(m 2 l). For example, if m = 20000 genes and a functional category with l = 2000 genes is considered, M would have about 1.44·1012entries.
We have implemented the above described algorithm in C++ using time and space efficient data structures which will be discussed here.
As the recurrence equation implies, filling the ith column of M only requires the values of the i - 1th column. Thus, the dynamic programming approach requires only two columns of the matrix reducing the memory requirements to O(ml).
Another important feature of the running sum statistics implies that certain parts of the matrix M do not have to be computed. The running time of the algorithm can be further reduced by adding a second constraint(**) -m 2 + l·m + i·m - i·l ≤ j ≤ l·m - i·l
for each column i to the recurrence equation. The right side of the constraint holds because, for column i, the value j of the running sum can be computed asj = a·(m - l) + (i - a)·(-l)
Although the additional constraint does not lead to an asymptotically improved runtime, an increased performance has been measured, especially for small p-values.
Additionally, the runtime of the presented algorithm can be improved by computing only the first half of M. Due to a certain "symmetry" of the running sum statistics it suffices to compute either the column in the middle or the two columns in the middle to derive the required number of pathways.
As described above, our algorithm is applied to evaluate several thousands of biological categories/hypotheses using the gene set analysis toolkit "GeneTrail". In general, findings are considered to be significant, if the p-value is smaller than 0.05. Most computation time is spend for small p-values. However, only few of the considered categories are statistically significant, whereas the others will lead to intermediate and larger p-values. Since our algorithm is especially fast for intermediate and large p-values, the complete GSEA analysis is highly efficient and most of its running time is spent for the p-value calculation of the most significant categories.
To get maximal performance, we implemented our algorithm in C++. Other available GSEA tools have been implemented in Java or are available as "R" scripts. For this reason, a fair comparison of our tool to other nonparametric permutation tests is not possible. We implemented a permutation test procedure in C++ with expected running time of O(number of permutations·m). We applied the algorithm to the example presented in Figure 4 using 1000 permutations. On average the presented dynamic programming approach was more than ten times faster compared to the permutation test procedure. Please note that the two approaches are not directly comparable. The runtime of both methods depends on the length of the gene list l. In addition, the runtime of our algorithm depends on the p-value whereas the running time of the permutation test approach depends on the number of performed permutations.
We tested our algorithm by evaluating freely available expression profiles of lung cancer tissue and autologous control samples. In detail, we downloaded expression profiles of 5 squamous cell lung cancer patients  from the "Gene Expression Omnibus" . Together with the cancer tissue, unaffected tissue of autologous patients was extracted at surgery and 5 control expression profiles were generated. The 10 expression profiles were measured using the Affymetrix HG-U133A including more than 22000 transcripts and 13000 genes. In a pre-processing step, the profiles were median normalized. Thereafter, for each transcript a paired t-tests was performed in order to detect differentially expressed genes. Paired t-test is applicable here, since the control samples were taken from the normal lung tissues of autologous patients. To generate a sorted list, t-test statistic values were sorted in increasing order such that the top of the resulting list contains the most significantly up-regulated genes in lung-cancer and the bottom of the list the most significantly down-regulated genes.
Evaluation on lung cancer
We detected many significantly down-regulated KEGG-pathways. Among them the Cell Adhesion Molecules (p-value of 0.00024). The Cell Cycle is the most significantly up-regulated pathway (p-value of 0.0011). It is very likely that both pathways would achieve a p-value of zero by permutation tests, however, they are not equally significant as demonstrated above. The up-regulated rRNA-binding achieved a p-value of 0.0488. This category represents an example where permutation tests might define a pathway as significant in one run and as not significant in another run.
We presented a novel dynamic programming algorithm that enables the efficient computation of exact significance values of unweighted "Gene Set Enrichment Analysis" and thus avoids typical problems of nonparametric permutation tests. Additionally, we showed that the runtime of the presented algorithm decreases as the p-values increase, i.e. our algorithm spends most time for computing small p-values of significant categories.
We integrated our algorithm in the gene set analysis tool "GeneTrail" that allows for performing a wide variety of statistical analyses efficiently. Using GeneTrail, we evaluated the differential expression of genes in squamous cell lung cancer expression profiles, demonstrating the usefulness of the presented dynamic programming algorithm.
The number of possible running sum statistics increases exponentially, i.e. . On the example given above, a microarray containing m = 20000 genes and a category with l = 2000 genes, the number of different running sums adds up to approximately 4·102821. In the worst case, the matrix entry M(0, m) amounts to 4·102821, if all genes of C are either top or bottom ranked. This example shows that the approach must be able to handle very large numbers. Hence, we use the "GNU Multiple Precision Arithmetic Library" (GMP), a numerically stable and fast library that can compute arbitrary large natural numbers and is freely available.
This work was supported by "Deutsche Krebshilfe", grant 107342 and by the "Deutsche Forschungsgemeinschaft", grant BIZ 4/1-(1,...,4).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.