rapidGSEA: Speeding up gene set enrichment analysis on multi-core CPUs and CUDA-enabled GPUs

Hundt, Christian; Hildebrandt, Andreas; Schmidt, Bertil

doi:10.1186/s12859-016-1244-x

Software
Open access
Published: 23 September 2016

rapidGSEA: Speeding up gene set enrichment analysis on multi-core CPUs and CUDA-enabled GPUs

BMC Bioinformatics volume 17, Article number: 394 (2016) Cite this article

3896 Accesses
1 Citations
13 Altmetric
Metrics details

Abstract

Background

Gene Set Enrichment Analysis (GSEA) is a popular method to reveal significant dependencies between predefined sets of gene symbols and observed phenotypes by evaluating the deviation of gene expression values between cases and controls. An established measure of inter-class deviation, the enrichment score, is usually computed using a weighted running sum statistic over the whole set of gene symbols. Due to the lack of analytic expressions the significance of enrichment scores is determined using a non-parametric estimation of their null distribution by permuting the phenotype labels of the probed patients. Accordingly, GSEA is a time-consuming task due to the large number of required permutations to accurately estimate the nominal p-value – a circumstance that is even more pronounced during multiple hypothesis testing since its estimate is lower-bounded by the inverse number of samples in permutation space.

Results

We present rapidGSEA – a software suite consisting of two tools for facilitating permutation-based GSEA: cudaGSEA and ompGSEA. cudaGSEA is a CUDA-accelerated tool using fine-grained parallelization schemes on massively parallel architectures while ompGSEA is a coarse-grained multi-threaded tool for multi-core CPUs. Nominal p-value estimation of 4,725 gene sets on a data set consisting of 20,639 unique gene symbols and 200 patients (183 cases + 17 controls) each probing one million permutations takes 19 hours on a Xeon CPU and less than one hour on a GeForce Titan X GPU while the established GSEA tool from the Broad Institute (broadGSEA) takes roughly 13 days.

Conclusion

cudaGSEA outperforms broadGSEA by around two orders-of-magnitude on a single Tesla K40c or GeForce Titan X GPU. ompGSEA provides around one order-of-magnitude speedup to broadGSEA on a standard Xeon CPU. The rapidGSEA suite is open-source software and can be downloaded at https://github.com/gravitino/cudaGSEAas standalone application or package for the R framework.

Background

High-throughput technologies such as microarray or next-generation sequencing enable researchers to routinely measure the expressions of tens of thousands of genes in many patients. Typically, long lists of interesting candidate genes are generated by subsequent computational analyses. However, interpreting these gene lists is challenging. Recognizing that genes act in concert to drive various biological processes, Gene Set Enrichment Analysis (GSEA) was introduced [1] to summarize genomics data using a predefined gene set. Nowadays, GSEA is a heavily used tool in bioinformatics [2] and has been successfully applied to gain insights into the biological function of diseases such as cancer and diabetes.

However, the GSEA procedure can be highly time-consuming since significance of a calculated enrichment score is typically tested using a resampling strategy drawing large numbers of permutations. When a whole database of gene sets is used, the amount of required permutations is even higher in order to account for multiple hypothesis testing. Furthermore, size and availability of input data sets continue to increase driven by advances in high-throughput technologies [3]. Thus, developing fast software solutions is of high importance to research. Previous work on accelerating gene set analysis has been limited to cloud computing [4]. We present the rapidGSEA suite – an efficient parallelization of the GSEA method for commonly available multi-core CPUs and CUDA-enabled GPUs. By using a combination of parallelization techniques we can achieve speedups of one order-of-magnitude on Xeon CPUs and around two orders-of-magnitude on a single GPU compared to broadGSEA.

Implementation

This section is divided into three parts. First, we give a brief explanation of the sequential GSEA algorithm and its four major processing steps for estimating the nominal p-value of a determined enrichment score using a single gene set. Second, we introduce novel parallelization schemes for single and multiple gene set probing and their explicit implementation optimized for multi-core CPUs and CUDA-enabled GPUs. Finally, we describe the usage of our standalone application and the bundled package for the R framework.

The sequential algorithm

The traditional GSEA algorithm operates on a real-valued gene expression matrix D(g _i,p _j) of shape |G|×|P| where g _i∈G denotes |G| unique gene identifiers and p _j∈P enumerates |P| patient identifiers each labelled by a binary phenotype L(p _j)∈{0,1} encoding cases and controls. The computation of the enrichment score statistics can be split into four major stages:

Computation of local deviation measures

For each gene symbol g _i (each row of D) a local deviation score Δ(g _i) is computed that encodes the inter-class deviation between cases and controls. As an example, the difference of means between both classes can be employed to express their variability per gene:

$$\begin{array}{*{20}l} \Delta(g_{i}) &= \mu_{i}^{(1)} - \mu_{i}^{(0)} \\ \mu_{i}^{(1)} &= \sum\limits_{j=0}^{|P|-1} \frac{L(p_{j})}{m^{(1)}} D(g_{i}, p_{j})\\ \mu_{i}^{(0)} &= \sum\limits_{j=0}^{|P|-1} \frac{1-L(p_{j})}{m^{(0)}} D(g_{i}, p_{j}) \end{array} $$

where $m^{(1)} = \sum _{j=0}^{|P|-1} L(p_{j})$ and m ⁽⁰⁾=|P|−m ⁽¹⁾ denote the number of patients in each class from the set {0,1}. Variations that combine intra-class means and standard deviations e.g.

$$ \begin{aligned} \begin{array}{lll} \text{fold change:} &\Delta(g_{i}) =\frac{\mu_{i}^{(1)}-\mu_{i}^{(0)}}{\sigma_{i}^{(1)}+\sigma_{i}^{(0)}} &\text{,} \\ \text{t-test:} &\Delta(g_{i}) = \frac{\mu_{i}^{(1)}-\mu_{i}^{(0)}}{\sqrt{\left(\sigma_{i}^{(1)}\right)^{2}+\left(\sigma_{i}^{(0)}\right)^{2}}}& \end{array} \end{aligned} $$

(1)

are common choices for Δ in GSEA implementations. Please note that extensions from binary to real-valued phenotype profiles $L(p_{j}) \in \mathbb {R}$ using Euclidean distance, Pearson’s product-moment or Spearman’s rank-order correlation coefficient are straightforward [1] and thus will not be discussed further in this paper.

Gene ranking

After computation of the local deviations, the indices i∈{0,…,|G|−1} enumerating the gene symbols g _i are reordered such that

$$\begin{array}{*{20}l} \left(\Delta\left(g_{\sigma(0)}\right), \dots, \Delta\left(g_{\sigma(i)}\right), \dots\Delta\left(g_{\sigma(|G|-1)}\right)\right) \end{array} $$

is a sorted (usually descending) sequence of local deviation scores. The sequence of reordered gene symbols g _σ(i) is called gene ranking according to Δ and will later be used to determine the enrichment score statistic. Figure 1 illustrates the first and second stage of the GSEA algorithm.

Enrichment score computation

To elucidate significant differences in gene regulation across different phenotypes, it is generally insufficient to consider transcription differences Δ(g _σ(i)) individually. Each gene can be significantly up- or down regulated by chance alone, or through correlation with processes such as the cell cycle. In principle, information can be gained from clustering genes according to their regulation [5]. Interpretation of the resulting clusters, however, is often unclear. Instead, prior information about gene classes that are assumed to behave correlatedly (e.g. genes on a regulatory pathway), is used in the analysis. Today, this is typically achieved through the framework of GSEA, which considers the significance of the transcription profile of a set of gene symbols S⊂G as a whole as opposed to individual enrichment values.

Let S be a gene set supposedly correlated to the observed phenotypes and σ(i) the aforementioned reordering of gene symbols. The enrichment score E S(S) is then determined as the maximal amplitude of a weighted running sum statistic ρ(k)∈[−1,1]:

$$\begin{array}{*{20}l} ES(S) &= \rho\left(\mathop{\text{argmax}}_{k} |\rho(k)| \right) \ \ \quad\text{where} \\ \rho(k) &= \sum\limits_{i=0}^{k} \left\{ \begin{array}{lll} \frac{1}{\alpha} \cdot |\Delta(g_{\sigma(i)})|^{q} & \text{if} & g_{\sigma(i)} \in S \\ - \frac{1}{\beta} & \text{if} & g_{\sigma(i)} \notin S \end{array} \right. \end{array} $$

with precomputed constants $\alpha = \sum _{g \in S} |\Delta (g)|^{q}$ and β=|G|−|S|. The exponent q≥0 is usually chosen from the set $\{0, 1, \tfrac {3}{2}, 2\}$ and controls the leverage of the weights |Δ(g _σ(i))|. Please note that the special case q=0 is the well-known Kolmogorov-Smirnov statistic [1]. Figure 2 illustrates an example for the linear-weighted (q=1) computation of E S(S) using a toy data set.

Significance estimation

Similar to Pearson’s correlation coefficient the enrichment score takes values in the interval [−1,1] with |E S(S)|=1 indicating perfect (anti-)correlation and |E S(S)|≈0 implying no dependency between S and the observed phenotypes in terms of the used deviation measure. When E S(S)=±1 all gene symbols g∈S are situated at the top/bottom of the ranked gene list. In contrast, small values are observed if the gene symbols g∈S are scattered over the index domain and thus are unlikely to explain the phenotype distribution.

ES values have no intrinsic significance, though. A value of E S(S)=0.857, as computed in our toy model in Fig. 2, might correspond to a high or low significance, depending on the probability to arrive at such a value by chance alone. Unfortunately, closed forms for the statistical distribution of enrichment score are inaccessible. Therefore, p-values are typically estimated by sampling the null distribution using a permutation of phenotype labels. Please note that while some GSEA implementations allow to permute gene identifiers instead of phenotype labels [1, 6] to estimate the null distribution, phenotype permutation is often considered the more appropriate choice – genes are expected to feature statistical dependencies within a single patient, while probes gained from distinct patients are less likely to do so. Hence, in the following we only consider phenotype permutation.

Figure 3 depicts the enrichment score computation for a permutation π=(1 4) of the original list of six patients where the columns 1 and 4 of D have been swapped.¹ The resulting score E S(S,π)=0.457<0.857=E S(S) suggests that the original value is considerably higher than a randomly sampled one. An exact computation of the p-value – due to absent closed forms for their distribution – would require us to calculate E S(S,π) for all |P|! permutations and finally determine the portion of values which are more extreme than E S(S). GSEA implementations hence usually estimate p-values by sampling in the space of permutations since |P|! is too large even for a moderate number of patients.

When probing more than one gene set at once, p-value estimates have to be adjusted for multiple hypothesis testing. As an example, Bonferroni-corrected acceptance levels and family-wise error rates (FWER) are frequently used criteria to evaluate the significance of enrichment scores. The need for a large number of samples in the space of permutation is even more pronounced during multiple hypothesis testing: let e∈Π be the identity permutation in the set of n tested permutations Π. Then the p-value estimate for a fixed gene set S is strictly positive [7] and lower-bounded by inverse sample size:

$$\begin{array}{*{20}l} \hat p_{S} = \frac{1}{n}\sum\limits_{\pi\in\Pi} \left(|ES(S, \pi)| \geq |ES(S, e)| \right) \geq \frac{1}{n} \end{array} $$

The Molecular Signature Database v5.1 [8] contains more than 13,000 gene sets divided into eight major collections. Thus, when testing all gene sets at a Bonferroni-adjusted significance level of $\alpha = \frac {0.01}{13,000}$ we have to probe more than 1,300,000 permutations in order to allow the result $\hat p_{S} < \alpha $. For the rest of the paper, we focus on the efficient computation of the enrichment score table E S(S,π) since p-value estimates and other statistics such as FWER can be determined using its entries in a post-processing phase.

The parallel algorithm

GSEA can be parallelized using coarse-grained computation schemes such as assigning threads to each permutation π or gene set S since all entries in E S(S,π) can be processed independently. This approach will be used in our multi-threaded shared memory implementation of GSEA (ompGSEA): The set of n probed permutations is split into m partitions each of approximate size $\frac {n}{m}$ and afterwards m threads independently operate on the individual chunks. This can easily be achieved in shared memory architectures using OpenMP pragmas. Moreover, extensions to distributed memory architectures using the Message Passing Interface (MPI) are conceivable.

However, CUDA-enabled accelerators can maintain up to several thousands of threads (e.g. Titan X/Tesla K40c: 3,072/2,880 cores) but only exhibit a limited amount of RAM (both GPUs provide 12 GB). As a result, fine-grained computation schemes that parallelize the aforementioned building blocks of the GSEA algorithm have to be employed to exploit the full compute capabilities of CUDA-enabled accelerators. In the following, we will present the fine-grained parallelization scheme for each processing stage separately.