Fig. 4From: rapidGSEA: Speeding up gene set enrichment analysis on multi-core CPUs and CUDA-enabled GPUsFine-Grained Parallelization of Stages 1 and 2. Parallelization of the deviation score computation operating on the transposed data matrix D T. Each thread block draws a permutation by shuffling the original phenotype label list in shared memory. The threads within a thread block independently accumulate gene transcription differences for each gene symbol identifier (along columns) ensuring coalesced reads from global memory. Finally, the local deviation scores are sorted using the segmented radix sort primitive of CUBBack to article page