 Software
 Open Access
 Published:
rapidGSEA: Speeding up gene set enrichment analysis on multicore CPUs and CUDAenabled GPUs
BMC Bioinformatics volume 17, Article number: 394 (2016)
Abstract
Background
Gene Set Enrichment Analysis (GSEA) is a popular method to reveal significant dependencies between predefined sets of gene symbols and observed phenotypes by evaluating the deviation of gene expression values between cases and controls. An established measure of interclass deviation, the enrichment score, is usually computed using a weighted running sum statistic over the whole set of gene symbols. Due to the lack of analytic expressions the significance of enrichment scores is determined using a nonparametric estimation of their null distribution by permuting the phenotype labels of the probed patients. Accordingly, GSEA is a timeconsuming task due to the large number of required permutations to accurately estimate the nominal pvalue – a circumstance that is even more pronounced during multiple hypothesis testing since its estimate is lowerbounded by the inverse number of samples in permutation space.
Results
We present rapidGSEA – a software suite consisting of two tools for facilitating permutationbased GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDAaccelerated tool using finegrained parallelization schemes on massively parallel architectures while ompGSEA is a coarsegrained multithreaded tool for multicore CPUs. Nominal pvalue estimation of 4,725 gene sets on a data set consisting of 20,639 unique gene symbols and 200 patients (183 cases + 17 controls) each probing one million permutations takes 19 hours on a Xeon CPU and less than one hour on a GeForce Titan X GPU while the established GSEA tool from the Broad Institute (broadGSEA) takes roughly 13 days.
Conclusion
cudaGSEA outperforms broadGSEA by around two ordersofmagnitude on a single Tesla K40c or GeForce Titan X GPU. ompGSEA provides around one orderofmagnitude speedup to broadGSEA on a standard Xeon CPU. The rapidGSEA suite is opensource software and can be downloaded at https://github.com/gravitino/cudaGSEAas standalone application or package for the R framework.
Background
Highthroughput technologies such as microarray or nextgeneration sequencing enable researchers to routinely measure the expressions of tens of thousands of genes in many patients. Typically, long lists of interesting candidate genes are generated by subsequent computational analyses. However, interpreting these gene lists is challenging. Recognizing that genes act in concert to drive various biological processes, Gene Set Enrichment Analysis (GSEA) was introduced [1] to summarize genomics data using a predefined gene set. Nowadays, GSEA is a heavily used tool in bioinformatics [2] and has been successfully applied to gain insights into the biological function of diseases such as cancer and diabetes.
However, the GSEA procedure can be highly timeconsuming since significance of a calculated enrichment score is typically tested using a resampling strategy drawing large numbers of permutations. When a whole database of gene sets is used, the amount of required permutations is even higher in order to account for multiple hypothesis testing. Furthermore, size and availability of input data sets continue to increase driven by advances in highthroughput technologies [3]. Thus, developing fast software solutions is of high importance to research. Previous work on accelerating gene set analysis has been limited to cloud computing [4]. We present the rapidGSEA suite – an efficient parallelization of the GSEA method for commonly available multicore CPUs and CUDAenabled GPUs. By using a combination of parallelization techniques we can achieve speedups of one orderofmagnitude on Xeon CPUs and around two ordersofmagnitude on a single GPU compared to broadGSEA.
Implementation
This section is divided into three parts. First, we give a brief explanation of the sequential GSEA algorithm and its four major processing steps for estimating the nominal pvalue of a determined enrichment score using a single gene set. Second, we introduce novel parallelization schemes for single and multiple gene set probing and their explicit implementation optimized for multicore CPUs and CUDAenabled GPUs. Finally, we describe the usage of our standalone application and the bundled package for the R framework.
The sequential algorithm
The traditional GSEA algorithm operates on a realvalued gene expression matrix D(g _{ i },p _{ j }) of shape G×P where g _{ i }∈G denotes G unique gene identifiers and p _{ j }∈P enumerates P patient identifiers each labelled by a binary phenotype L(p _{ j })∈{0,1} encoding cases and controls. The computation of the enrichment score statistics can be split into four major stages:
Computation of local deviation measures
For each gene symbol g _{ i } (each row of D) a local deviation score Δ(g _{ i }) is computed that encodes the interclass deviation between cases and controls. As an example, the difference of means between both classes can be employed to express their variability per gene:
where \(m^{(1)} = \sum _{j=0}^{P1} L(p_{j})\) and m ^{(0)}=P−m ^{(1)} denote the number of patients in each class from the set {0,1}. Variations that combine intraclass means and standard deviations e.g.
are common choices for Δ in GSEA implementations. Please note that extensions from binary to realvalued phenotype profiles \(L(p_{j}) \in \mathbb {R}\) using Euclidean distance, Pearson’s productmoment or Spearman’s rankorder correlation coefficient are straightforward [1] and thus will not be discussed further in this paper.
Gene ranking
After computation of the local deviations, the indices i∈{0,…,G−1} enumerating the gene symbols g _{ i } are reordered such that
is a sorted (usually descending) sequence of local deviation scores. The sequence of reordered gene symbols g _{ σ(i)} is called gene ranking according to Δ and will later be used to determine the enrichment score statistic. Figure 1 illustrates the first and second stage of the GSEA algorithm.
Enrichment score computation
To elucidate significant differences in gene regulation across different phenotypes, it is generally insufficient to consider transcription differences Δ(g _{ σ(i)}) individually. Each gene can be significantly up or down regulated by chance alone, or through correlation with processes such as the cell cycle. In principle, information can be gained from clustering genes according to their regulation [5]. Interpretation of the resulting clusters, however, is often unclear. Instead, prior information about gene classes that are assumed to behave correlatedly (e.g. genes on a regulatory pathway), is used in the analysis. Today, this is typically achieved through the framework of GSEA, which considers the significance of the transcription profile of a set of gene symbols S⊂G as a whole as opposed to individual enrichment values.
Let S be a gene set supposedly correlated to the observed phenotypes and σ(i) the aforementioned reordering of gene symbols. The enrichment score E S(S) is then determined as the maximal amplitude of a weighted running sum statistic ρ(k)∈[−1,1]:
with precomputed constants \(\alpha = \sum _{g \in S} \Delta (g)^{q}\) and β=G−S. The exponent q≥0 is usually chosen from the set \(\{0, 1, \tfrac {3}{2}, 2\}\) and controls the leverage of the weights Δ(g _{ σ(i)}). Please note that the special case q=0 is the wellknown KolmogorovSmirnov statistic [1]. Figure 2 illustrates an example for the linearweighted (q=1) computation of E S(S) using a toy data set.
Significance estimation
Similar to Pearson’s correlation coefficient the enrichment score takes values in the interval [−1,1] with E S(S)=1 indicating perfect (anti)correlation and E S(S)≈0 implying no dependency between S and the observed phenotypes in terms of the used deviation measure. When E S(S)=±1 all gene symbols g∈S are situated at the top/bottom of the ranked gene list. In contrast, small values are observed if the gene symbols g∈S are scattered over the index domain and thus are unlikely to explain the phenotype distribution.
ES values have no intrinsic significance, though. A value of E S(S)=0.857, as computed in our toy model in Fig. 2, might correspond to a high or low significance, depending on the probability to arrive at such a value by chance alone. Unfortunately, closed forms for the statistical distribution of enrichment score are inaccessible. Therefore, pvalues are typically estimated by sampling the null distribution using a permutation of phenotype labels. Please note that while some GSEA implementations allow to permute gene identifiers instead of phenotype labels [1, 6] to estimate the null distribution, phenotype permutation is often considered the more appropriate choice – genes are expected to feature statistical dependencies within a single patient, while probes gained from distinct patients are less likely to do so. Hence, in the following we only consider phenotype permutation.
Figure 3 depicts the enrichment score computation for a permutation π=(1 4) of the original list of six patients where the columns 1 and 4 of D have been swapped.^{1} The resulting score E S(S,π)=0.457<0.857=E S(S) suggests that the original value is considerably higher than a randomly sampled one. An exact computation of the pvalue – due to absent closed forms for their distribution – would require us to calculate E S(S,π) for all P! permutations and finally determine the portion of values which are more extreme than E S(S). GSEA implementations hence usually estimate pvalues by sampling in the space of permutations since P! is too large even for a moderate number of patients.
When probing more than one gene set at once, pvalue estimates have to be adjusted for multiple hypothesis testing. As an example, Bonferronicorrected acceptance levels and familywise error rates (FWER) are frequently used criteria to evaluate the significance of enrichment scores. The need for a large number of samples in the space of permutation is even more pronounced during multiple hypothesis testing: let e∈Π be the identity permutation in the set of n tested permutations Π. Then the pvalue estimate for a fixed gene set S is strictly positive [7] and lowerbounded by inverse sample size:
The Molecular Signature Database v5.1 [8] contains more than 13,000 gene sets divided into eight major collections. Thus, when testing all gene sets at a Bonferroniadjusted significance level of \(\alpha = \frac {0.01}{13,000}\) we have to probe more than 1,300,000 permutations in order to allow the result \(\hat p_{S} < \alpha \). For the rest of the paper, we focus on the efficient computation of the enrichment score table E S(S,π) since pvalue estimates and other statistics such as FWER can be determined using its entries in a postprocessing phase.
The parallel algorithm
GSEA can be parallelized using coarsegrained computation schemes such as assigning threads to each permutation π or gene set S since all entries in E S(S,π) can be processed independently. This approach will be used in our multithreaded shared memory implementation of GSEA (ompGSEA): The set of n probed permutations is split into m partitions each of approximate size \(\frac {n}{m}\) and afterwards m threads independently operate on the individual chunks. This can easily be achieved in shared memory architectures using OpenMP pragmas. Moreover, extensions to distributed memory architectures using the Message Passing Interface (MPI) are conceivable.
However, CUDAenabled accelerators can maintain up to several thousands of threads (e.g. Titan X/Tesla K40c: 3,072/2,880 cores) but only exhibit a limited amount of RAM (both GPUs provide 12 GB). As a result, finegrained computation schemes that parallelize the aforementioned building blocks of the GSEA algorithm have to be employed to exploit the full compute capabilities of CUDAenabled accelerators. In the following, we will present the finegrained parallelization scheme for each processing stage separately.
Computation of local deviation measures
Many local deviation measures used in traditional GSEA e.g. difference of means or fold change can be expressed in terms of intraclass means and standard deviations. Therefore, we have to separately accumulate sums of expression values and their squares for each of the two phenotypes. Although efficient implementations for parallel reduction on CUDAenabled accelerators are known [9] we instead parallelize the loop over the gene symbols since each row of the data matrix D can be processed independently without the need for expensive synchronization as used in reduction algorithms. Moreover, the number of gene symbols will most likely exceed the number of probed patients and thus the loop over g _{ i } is better suited for massively parallel computation. During the calculation of statistical moments we encounter two challenges:
First, the numerically stable computation of standard deviations is known to be a stubborn task. On the one hand, when accumulating a large number of entries (here patients) one has to account for numeric stability using cancellationcompensation [10] or twopass algorithms for the standard deviations. On the other hand, when dealing with only a few patients onepass or cancellationcompensated online algorithms for the standard deviation might be a proper tradeoff between accuracy and speed [11]. rapidGSEA exploits the C++ template engine to provide specialized and usercustomizable accumulator functors adaptable to the task’s requirements.
Second, the genewise computation of transcription differences Δ(g _{ i }) accumulates statistical moments along the rows of the matrix D. Using a CUDA thread block of up to 1,024 CUDA threads for a fixed permutation of the phenotype array L(p _{ π(j)}) it is advisable to transpose D to guarantee coalesced access to global memory. More specifically, since a warp of 32 threads is executed simultaneously on the GPU their concurrent reads from the same column of D would result in excessive cache misses. In contrast, when transposing D the same access pattern causes consecutive threads to simultaneously access consecutive memory. This change from columnmajororder to rowmajororder traversal decreases the runtime of this processing step by one orderofmagnitude in our experiments. Since D usually tends to be smaller than 100 MB, we can use a standard bank conflictfree outofplace algorithm for matrix transposition [12]. Figure 4 depicts the described computation scheme for two CUDA thread blocks each consisting of ten CUDA threads. Please note that the genes are distributed using a blockcyclic distribution if the number of genes exceed the number of threads.
The sampling of permutations can be accomplished using the pseudo random number generators (PRNG) from the cuRAND library [13] bundled with the CUDA SDK. Unfortunately, cuRAND does not provide hostsided calls for the random number generators defined in the device API. Thus, we implemented the keep it simple stupid (KISS) PRNG [14] for the CPU and GPU in order to provide consistent results across architectures. Both cuRAND’s xorwow PRNG and our KISS implementation pass all tests of the dieharder suite [15]. The permutation of the phenotype labels L(p _{ π(j)}) is generated by reordering the original label list L(p _{ j }) in shared memory using a FisherYates shuffle.
Gene ranking
Up to this point, the transcription differences Δ(g _{ i }) have been computed for a batch of permutations that fit into the RAM of the GPU. Unfortunately, we cannot directly apply a keyvalue sort to Δ(g _{ i }) within the same kernel due to the 48 KB limitation of shared memory. Thus, after termination of the previous kernel, we call a devicewide keyvalue radix sort primitive cub::DeviceSegmentedRadixSort from the CUB [9] library specifically optimized for the efficient sorting of segmented arrays. This approach is up to one orderofmagnitude faster than stacking single devicewide cub::DeviceRadixSort calls for each permutation or aliasing global memory to the blockwide cub::BlockRadixSort primitives. The number of concurrently sorted arrays has been set to 128 as a proper tradeoff between runtime and memory consumption. At the end of this stage, we have stored the sorted deviation scores Δ(g _{ σ(i)}) and corresponding indices σ(i) for each of the probed permutations in global memory. Figure 4 illustrates the described workflow.
Enrichment score computation
The computation scheme for the running sum statistic is similar to the processing of local deviation scores. For each permutation a CUDA thread block operates on a pair (g _{ σ(i)},Δ(g _{ σ(i)})) of reordered gene symbols and gene transcription differences. The test whether a gene identifier is part of a gene set g _{ σ(i)}∈S is usually implemented with hash sets on CPUs. Efficient hashing algorithms on CUDAenabled devices are stated in the literature [16] which typically involve linked lists or binary search in sorted arrays in order to resolve collisions. However, we decided to encode the affiliation of a gene g with a binary bit mask b(g,S). The computation of the bit mask can be delegated to the CPU using STL hashes. Further, the corresponding execution time can be overlapped with the deviation score and gene ranking kernels. As a result, we can determine a gene’s affiliation on the GPU in constant time by reading the corresponding entry of the bit mask from global memory.
Each thread k within a thread block processes one gene set S _{ k }. Shared memory can be utilized to avoid slow accesses to global memory since all threads in a warp have to access the same entry from the bit mask b(g _{ σ(i)},S _{ k }) in random order. To achieve this, batches of 64 entries of reordered gene transcription differences Δ(g _{ σ(i)}) and bit mask entries b(g _{ σ(i)},S _{ k }) are consecutively loaded into shared memory (scratchpad) and afterwards processed in order. Due to the large number of genes we again use numerically stable Kahan summation [10] in order to suppress cancellation in floating point arithmetic. Finally, the maximum amplitude of the weighted KolmogorovSmirnov statistic is written to the enrichment score table E S(S,π) and consecutively transferred to the host. Figure 5 illustrates the described procedure.
Significance estimation
When only computing pvalue estimates the counting of values in the tails of the null distribution could be accomplished on the GPU using the devicewide reduction primitive cub::DeviceSegmentedReduce from the CUB library. A similar approach for the computation of the FWER is conceivable. However, we decided to copy E S(S,π) to the host in order to provide the full information for consecutive analysis and visualization of sampled distributions.
Bindings for the R language
The core algorithm written in CUDA and C++11 is provided as standalone application and additionally as Rcppbased [17] package for R. The latter includes functions for the reading of gene expression tables (*.gct), class assignment labels (*.cls) and gene sets files (*.gmt) as well as methods for the querying and selection of the used GPU (see user manual).
Results and discussion
The performance of rapidGSEA is compared to the broadGSEA Java application in version 2.2.2 [18] on the following platform:

(CPU) Intel Xeon E52660 v3 @ 2.60 GHz GHz (10+10 HT) with 128 GB DDR4 RAM

(GPU) NVIDIA GeForce GTX Titan X with 12 GB GDDR5 RAM, NVIDIA Tesla K40c with 12 GB GDDR5 RAM disabled ECC, NVCC ver. 7.5

(Software) Ubuntu 14.04 LTS, GCC ver. 4.8.4, IcedTea ver. 2.6.3 OpenJDK 64Bit Server VM
In our experiments, we use gene expression data (GEO: Series GSE19429) consisting of 183 MDS patients and 17 healthy controls where the array spots have been collapsed to G=20,639 unique gene symbols by max pooling ambiguous mappings in the Affymetrix Human Genome U133 Plus 2.0 Array (GEO: Platform GPL570) [19]. We further choose the smallest (H: hallmark, 50 gene sets) and the biggest (C: curated, 4726 gene sets) collection from the Molecular Signatures Database 5.0 [8]. The number of tested permutations ranges from 1,024 up to 1,024^{2} = 1,048,576 samples. Singleprecision runs are executed on the GeForce GTX Titan X and doubleprecision experiments on the Tesla K40c GPU. If not stated otherwise, rapidGSEA and broadGSEA have been configured to read the input data from disk and afterwards to write the full enrichment score table E S(S,π) to the file system in order to ensure fair competition.
Accuracy and compliance of enrichment scores
We have evaluated the compliance of computed enrichment scores between broadGSEA and rapidGSEA using the identity permutation on the 50 gene sets of the Hallmark collection under the difference of classes measure. The deviation of computed enrichment scores between rapidGSEA and broadGSEA comply within six digits for both single and doubleprecision arithmetic (see Fig. 6). Using identical floating point data types the computed scores of both rapidGSEA components, cudaGSEA and ompGSEA, are indistinguishable.
However, a comparison of computed histograms E S(S,π) is more complex due to different implementations of random number generators. Thus, we have approximated the probability density functions (PDFs) of the enrichment score distribution using n=1,024^{2} permutations and \(\sqrt {n} = 1,024\) bins uniformly sampling the interval [−1,1]. Afterwards, the approximate cumulative distribution functions (CDFs) are computed by prefix summation. The maximum absolute difference of approximated CDFs, also know as Kolmogorov distance,
is then determined for each of the 50 gene sets. Note, the Kolmogorov distance is a reasonable choice since it determines the measurement error of the area under the PDF of the enrichment score distribution and thus relates to the error of the estimated pvalue. Figure 7 visualizes the described procedure for one gene set. The minimum/median/maximum absolute deviation between the approximated CDFs produced by rapidGSEA and broadGSEA over the 50 gene sets is given by 0.0005/0.0011/0.0018. When comparing two histograms both computed by broadGSEA with different seeds the same metrics yield 0.0006/0.0011/0.0018. Moreover, in 26 out of 50 cases rapidGSEA produces histograms with a smaller Kolmogorov distances to broadGSEA in contrast to 24 cases where both histograms produced by broadGSEA are more similar. Concluding, the deviations in estimated areas are reasonably small and mainly caused by different samples in permutation space.^{2}
Scaling over multiple cores
We perform a strong scalability test of our ompGSEA implementation over multiple cores of the Xeon CPU. Note, ompGSEA is part of the cudaGSEA binary and can be selected using the cpu flag. The time needed to process the 50 gene sets defined in the H(allmark) collection is measured for a fixed input size of n=16,384 permutations and a variable number of threads. The experiments cover performance measurements for up to ten physical cores each executing a single thread and a hyperthreaded scenario where up to twenty threads are assigned to ten physical cores. When taking measurements on less than ten physical cores we enforce a thread’s affinity using the taskset command in order to avoid rescheduling by the operating system. The obtained runtimes are listed in Table 1 and illustrated in Fig. 8. The first experiment utilizing only physical cores reveals almost linear speedup for ompGSEA with an efficiency of roughly 77 % for ten cores. However, the hyperthreaded variant exhibits slightly superlinear behaviour for up to nine physical cores and an efficiency of 98 % for all cores. Throughout the rest of this paper all reported runtimes of ompGSEA refer to the hyperthreaded ten core scenario running approximately ten times faster than the corresponding singlecore application. Please note that the time for writing the enrichment score table E S(S,π) to disk has been neglected during this benchmark.
Comparison between rapidGSEA and broadGSEA
The execution time of rapidGSEA and broadGSEA is measured on the aforementioned data set over a wide range of permutations (1,024 up to 1,024^{2}) using the Hallmark (H: 50 gene sets) and Curated (C2: 4,725 gene sets) collections. The experiments include parsing of input files, memory transfers over PCIe when using CUDA and writing the enrichment score table E S(S,π) to spinning disk. The obtained runtimes and speedups are listed in Table 2 and illustrated in Figs. 9 and 10. Numbers in square brackets or dashed lines indicate linearly extrapolated runtimes for broadGSEA in loglog space for large amounts of permutations.
Our multithreaded implementation ompGSEA outperforms broadGSEA on both gene set collections (H and C2) by at least one orderofmagnitude. Note, although broadGSEA spawns more than twenty threads the majority remains idle during processing. Therefore, broadGSEA cannot benefit from the additional physical cores of the Xeon processor. The same behaviour can be observed on an Intel i7 i3970X CPU with six physical cores.
Moreover, cudaGSEA outperforms broadGSEA by around two ordersofmagnitude with growing speedups for an increasing number of permutations. This can be explained by the thread occupancy of the used GPUs. Both, the GeForce Titan X and the Tesla K40c can store at once tens of thousands of permutations (roughly 70k/35k in single/doubleprecision) within their 12 GB of RAM. Thus, when probing a small number of permutations the majority of streaming multiprocessors remain idle. Furthermore, the parsing of input files and dumping of results takes several seconds and cannot be parallelized on the GPU.
Conclusions
In this paper, we have introduced rapidGSEA – a software suite consisting of two tools for facilitating permutationbased GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDAaccelerated tool using finegrained parallelization schemes on massively parallel architectures while ompGSEA is a coarsegrained multithreaded tool for multicore CPUs. ompGSEA outperforms the stateoftheart implementation of GSEA (broadGSEA) by at least one orderofmagnitude in terms of execution times while providing compliant results. Furthermore, cudaGSEA outperforms broadGSEA by around two ordersofmagnitude. The time for probing 1,048,576 permutations on a gene expression data set consisting of 20,639 unique gene symbols and 200 patients can drastically be reduced from roughly 13 days for broadGSEA to less than two hours using rapidGSEA on a commonly available Tesla K40c GPU in doubleprecision or less than one hour on a GeForce Titan X in singleprecision.
A possible direction of future research in order to further reduce runtimes is the parallelization of GSEA on a compute cluster with multiple GPUs attached to each node. Furthermore, extensions of GSEA to consider graphbased (Gene Graph Enrichment Analysis [20]) or networkbased (Networkbased GSEA [21]) correlations between gene symbols and observed phenotypes have gained increasing attention in recent years. It will be interesting to investigate how the parallelization techniques discussed in this paper can be applied to accelerate these extended enrichment methods.
Availability and requirements
Project name: cudaGSEA Project home page: https://github.com/gravitino/cudaGSEAOperating system(s): Linux Programming language: C++, CUDA, R Other requirements: CUDAcapable GPULicense: GNU LGPL Any restrictions to use by nonacademics: None
Endnotes
^{1} Please note that throughout this manuscript, we use zerobased indexing.
^{2} Individual results for each gene set can be found at the github repository of rapidGSEA.
Abbreviations
 API:

Application programming interface
 CUDA:

Compute unified device architecture
 FWER:

Familywise error rate
 GSEA:

Gene set enrichment analysis
 MPI:

Message passing interface
 PCIe:

Peripheral component interconnect express
 PRNG:

Pseudo random number generator
References
 1
Subramanian, et al.Gene Set Enrichment Analysis: A KnowledgeBased Approach for Interpreting GenomeWide Expression Profiles. Proc Natl Acad Sci. 2005; 102(43):15545–15550. doi:http://dx.doi.org/10.1073/pnas.0506580102.
 2
Hung JH, Yang TH, Hu Z, Weng Z, DeLisi C. Gene Set Enrichment Analysis: Performance Evaluation and Usage Guidelines. Brief. Bioinform. 2012; 13(3):281–91.
 3
Wang X, Cairns MJ. SeqGSEA: a Bioconductor Package for Gene Set Enrichment Analysis of RNASeq Data Integrating Differential Expression and Splicing. Bioinformatics. 2014; 30(12):1777–1779. doi:http://dx.doi.org/10.1093/bioinformatics/btu090.
 4
Zhang L, Gu S, Liu Y, Wang B, Azuaje F. Gene set analysis in the cloud. Bioinformatics. 2012; 28(2):294–5.
 5
Eisen MB, Spellman PT, Brown PO, Botstein D. Cluster analysis and display of genomewide expression patterns. Proc Natl Acad Sci. 1998; 95(25):14863–14868. arxiv http://www.pnas.org/content/95/25/14863.full.pdf. Accessed 1 Apr 2016.
 6
Backes C, Keller A, Kuentzer J, Kneissl B, Comtesse N, Elnakady YA, Müller R, Meese E, Lenhof HP. GeneTrailadvanced gene set enrichment analysis. Nucleic Acids Research. 2007; 35(suppl 2):186–92.
 7
Phipson B, Smyth GK. Permutation Pvalues Should Never Be Zero: Calculating Exact Pvalues When Permutations Are Randomly Drawn. Stat Appl Genet Mol Biol. 2010;9(1), Article 39. http://www.degruyter.com/view/j/sagmb.2010.9.1/sagmb.2010.9.1.1585/sagmb.2010.9.1.1585.xml.
 8
Molecular Signatures Database. Accessed 1 Apr 2016. http://software.broadinstitute.org/gsea/msigdb.
 9
CUB: CUDA Unbound Library. Accessed 1 Apr 2016. https://nvlabs.github.io/cub/.
 10
Kahan W. Pracniques: Further Remarks on Reducing Truncation Errors. Commun. ACM. 1965; 8(1):40–8. doi:http://dx.doi.org/10.1145/363707.363723.
 11
Chan TF, Golub GH, LeVeque RJ. Updating Formulae and a Pairwise Algorithm for Computing Sample Variances, Technical report. Stanford: Stanford University; 1979. http://i.stanford.edu/pub/cstr/reports/cs/tr/79/773/CSTR79773.pdf.
 12
Ruetsch G, Micikevicius P. Optimize Matrix Transpose Technical report. Santa Clara: NVIDIA coporation; 2010. http://docs.nvidia.com/cuda/samples/6_Advanced/transpose/doc/MatrixTranspose.pdf. Accessed 1 Apr 2016.
 13
cuRAND: NVIDIA CUDA Random Number Generation Library. [Accessed 1 Apr 2016. https://developer.nvidia.com/curand].
 14
Marsaglia G, Tsang WW, et al.Some difficulttopass tests of randomness. J Stat Softw. 2002; 7(3):1–9.
 15
dieharder: Random Number Generator Testing Suite. Accessed 1 Apr 2016. https://www.phy.duke.edu/~rgb/General/dieharder.php.
 16
Alcantara DAF. Efficient hash tables on the gpu, PhD thesis. Davis: University of California at Davis; 2011. AAI3482095.
 17
Eddelbuettel D, François R. Rcpp: Seamless R and C++ Integration. J Stat Softw. 2011; 40(8):1–18.
 18
Broad Institute of MIT and Harvard. GSEA Java Package. 2016. http://software.broadinstitute.org/gsea/downloads.jsp. Accessed: 01 April 2016.
 19
Pellagatti, et al.Deregulated Gene Expression Pathways in Myelodysplastic Syndrome Hematopoietic Stem Cells. Leukemia. 2010; 24:756–64.
 20
Geistlinger L, Csaba G, Küffner R, Mulder N, Zimmer R. From sets to graphs: towards a realistic enrichment analysis of transcriptomic systems. Bioinformatics. 2011; 27(13):366–73.
 21
Glaab E, Baudot A, Krasnogor N, Schneider R, Valencia A. Enrichnet: networkbased gene set enrichment analysis. Bioinformatics. 2012; 28(18):451.
Acknowledgements
Partial funding was gratefully provided by the Center for Computational Science in Mainz.
Authors’ contributions
BS and AH conceived the study, and participated in its design and coordination. All authors contributed to the writing of the manuscript. CH wrote and evaluated the CPU and GPU implementations. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Throughout this paper the used gene expression data set is anonymized and has been obtained from NCBI Gene Expression Omnibus (GEO: Series GSE19429). The data has exclusively been used for runtime measurements and compliance evaluation of computed enrichment score values between broadGSEA and rapidGSEA. The original source [19] explicitly states approval granted by appropriate ethics committees: ’The study was approved by the ethics committees (Oxford C00.196, Bournemouth 9991/03/E, Duisburg 2283/03, Stockholm 410/03, Pavia 26264/2002) and informed consent was obtained.’
Author information
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Hundt, C., Hildebrandt, A. & Schmidt, B. rapidGSEA: Speeding up gene set enrichment analysis on multicore CPUs and CUDAenabled GPUs. BMC Bioinformatics 17, 394 (2016). https://doi.org/10.1186/s128590161244x
Received:
Accepted:
Published:
Keywords
 CUDA
 Gene set enrichment analysis
 Gene expression data
 Resampling statistics