Accelerating pairwise statistical significance estimation for local alignment by harvesting GPU's power
© Zhang et al.; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Skip to main content
© Zhang et al.; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Pairwise statistical significance has been recognized to be able to accurately identify related sequences, which is a very important cornerstone procedure in numerous bioinformatics applications. However, it is both computationally and data intensive, which poses a big challenge in terms of performance and scalability.
We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. By carefully studying the algorithm's data access characteristics, we developed a tile-based scheme that can produce a contiguous data access in the GPU global memory and sustain a large number of threads to achieve a high GPU occupancy. We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices. The implementation is also extended to take advantage of dual-GPUs. We observe end-to-end speedups of nearly 250 (370) × using single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using Intel© Core™i7 CPU 920 processor.
Harvesting the high performance of modern GPUs is a promising approach to accelerate pairwise statistical significance estimation for local sequence alignment.
The past decades have witnessed dramatically increasing trends in the quantity and variety of publicly available genomic and proteomic sequence data. Dealing with the massive data and making sense of them are big challenges in bioinformatics [1, 2]. One of the most widely used procedures for extracting information from proteomic and genomic data is pairwise sequence alignment (PSA). Given two sequences, PSA finds the extent of similarity between them. Many bioinformatics applications have been developed based on pairwise sequence alignment, such as BLAST , PSI-BLAST [4–6], and FASTA . PSA produces a score for an alignment as a measure of the similarity between two sequences. Generally, the higher the score, the more related the sequences. However, the alignment score depends on various factors such as alignment methods, scoring schemes, sequence lengths, and sequence compositions [8, 9]. Judging the relationship between two sequences solely based on the scores can often lead to wrong conclusion. Therefore, it is more appropriate to measure the quality of PSA using the statistical significance of the score rather than the score itself [10, 11]. Statistical significance of sequence alignment scores is very important to know whether an observed sequence similarity could imply a functional or evolutionary link, or is a chance event [8, 12]. Accurate estimation of statistical significance of gapped sequence alignment has attracted a lot of research in recent years [13–26].
where λ and K are calculational constants and E(x), also known as E-value, is the expected number of distinct local alignments with score values of at least x.
Note that the above distribution is for a gapless alignment. For the cases of gapped alignment, although no asymptotic score distribution has yet been established analytically, computational experiments strongly indicate these scores still roughly follow Gumbel law [8, 13, 30].
In addition to not needing a database to estimate the statistical significance of an alignment, pairwise statistical significance is shown to be more accurate than database statistical significance reported by popular database search programs like BLAST, PSI-BLAST, and SSEARCH . However, it involves thousands of such permutations and alignments, which are enormously time consuming and can be impractical for estimating pairwise statistical significance of a large number of sequence pairs. Hence, use of high-performance computing (HPC) techniques (such as multi-cores CPU, many-core graphics processing units (GPUs), FPGAs, etc.) is highly conducive to accelerate the computation of PSSE. Moreover, large data sets demand more computing power. In many demanding bioinformatics applications, such as sequence alignment [31, 32], and protein sequence database search [33, 34], many-core GPU has demonstrated its extreme computing power. This strongly motivates the use of GPUs to accelerate the PSSE. Acceleration of PSSE using MPI  and FPGA  has been explored earlier, and in this work, we design a GPU implementation for the same. Compared to [35, 36], we consider the estimation of multi-pair PSS along with single-pair PSS using GPUs.
We present a GPU implementation to accelerate pairwise statistical significance estimation of local sequence alignment using standard substitution matrices. Our parallel implementation makes use of CUDA (Compute Unified Device Architecture) parallel programming paradigm. Our design uses an efficient data reorganization method to produce coalesced global memory access, and a tiled-based memory scheme to increase the GPU occupancy, a key measure for GPU efficiency . Through careful analysis of the data dependency and access patterns of PSSE, we reorganize the data sequences into aligned arrays that coalesce the global memory access pattern. Such data access contiguity keeps the GPU cores occupied with computation and allows the thread scheduler to overlap the global memory access with the computation. In addition to the ability to calculate the optimal tile size for data to be shipped to the GPU, our design can also issue a large enough number of threads to maximize the occupancy.
We further extend the parallelization technique to estimate pairwise statistical significance using position-specific substitution matrices, which has earlier demonstrated significantly better sequence comparison accuracy than using standard substitution matrices . The implementation is also extended to take advantage of dual-GPUs to accelerate those computations. As a result, maximum performance could be obtained by harvesting the power of many-core GPUs.
The performance evaluation was carried out on NVIDIA Telsa C2050 GPU. For multi-pair PSSE implementation, we observe nearly 250 (370)× speedups using a single-GPU Tesla C2050 GPU (dual-Tesla C2050) over the CPU implementation using an Intel© Core ™i7 CPU 920 processor. The proposed optimizations and efficient framework for PSSE, which have direct applicability to speedup homology detection, are also applicable to many other sequence comparison based applications, such as DNA sequence mapping, phylogenetic tree construction and database search especially in the era of next-generation sequencing.
In this section we present GPU implementations both for single-pair PSSE and multi-pair PSSE. Along with the methodological details, we also discuss several performance optimization techniques used in this paper, such as, GPU memory access optimization, occupancy maximization, and substitution matrix customization. These techniques have significantly speeded up PSSE.
Careful analysis of the data pipelines of PSSE shows that the computation of PSSE can be decomposed into three computation kernels: Permutation, Alignment and Fitting. Permutation and Alignment comprise the overwhelming majority (a more than 99.8%) of the overall execution time . Therefore, efforts should be spent to optimize these two kernels to achieve high performance. Also, we observe that permutation presents high degrees of data independency that are naturally suitable for single-instruction, multiple-thread (SIMT) architectures  and therefore, can be mapped very well to task parallelism models of GPU. Moreover, even though the alignment task suffers from data dependency, we show that with clever optimizations, it can be heavily accelerated using GPUs.
It is especially important to optimize global memory access as its bandwidth is low and its latency is hundreds of clock cycles . Moreover, global memory coalescing is the most critical optimization for GPU programming . Since the kernels of PSSE usually work over large numbers of sequences that reside in the global memory, the performance is highly dependent on hiding memory latency. When a GPU kernel is accessing global memory, all threads in groups of 32 (i.e. warp) access a bank of memory at one time. A batch of memory accesses is considered coalesced when the data requested by a warp of threads are located in contiguous memory addresses. For example, if the data requested by threads within a warp are located in 32 consecutive memory addresses (such that the i th address is accessed by the i th thread), the memory can be read in a single access. Hence, this memory access operation runs 32 times faster. If the memory access is not coalesced, it is divided into multiple reads and hence serialized .
Considering inter-task parallelism, where each thread works on the alignment of one of the permuted copies of s 2 to s 1, in this layout the gap between the memory accesses by the neighboring threads is at least the length of the sequence. For example, in the intuitive layout, if thread T 0 accesses the first residue (i.e., 'R'), and thread T 1 accesses the first residue (i.e., 'E'), the gap between the access data is n. This results in non-coalesced memory reads (i.e., serialized reads), which significantly deteriorates the performance.
We therefore reorganize the layout of sequence data in memory to obtain coalesced reads. Now, to achieve coalesced access, we reorganize layout of sequences in memory as aligned structure of arrays, as shown in Figure 1 (b) In the optimized layout, the characters (in granularity of 4 bytes) that lie at the same index in different permuted sequences stay at neighboring positions. Then if the first uchar4 of the first permuted sequence (i.e. 'REGN') is requested by thread T 0, the first uchar4 of the second permuted sequence (i.e. 'ARNE') is requested by T 1, and so on. This results in reading a consecutive memory (each thread reads 4 bytes) by a warp of threads in a single access. Thus the global memory access is coalesced, and therefore high performance is achieved.
As the sequences remain unchanged during the alignment, they can be thought of as read-only data, which can be bound to texture memory. For read patterns, texture memory fetches is a better alternative to global memory reads because of texture memory cache, which can further improve the performance.
where T max is maximum number of resident threads that can be launched on a streaming multiprocessor (SM) (which is a constant for a specific GPU), T num is the number of active threads per block and B is the number of active blocks per Streaming Multiprocessor (SM).
Based on the above analysis, higher occupancy can be pursued according to the following rules:
To avoid wasting computation on under-populated warps, the number of threads per block should be chosen as a multiple of the warp size (currently, 32).
Total number of active threads per SM should be close to maximum number of resident threads per multiprocessor (T max ). In short, we should use all the available threads if possible.
To achieve 100% occupancy, (B hw × T num )/T max ≥ 1. Hence, setting T num such that T num ≥ T max /B hw is preferable.
By contrast, position-specific score matrix (PSSM)  offers a variation of this approach. We call this a position specific query profile in which scores are further refined. In this case the same residue (e.g. 'A') appearing at different position of query sequence has different scores, as shown in Figure 2 (b). For a given query sequence, PSSM can be pre-constructed by PSI-BLAST [3, 4].
Using the customized approach, in both cases, the dimension |ω| × |ω| of the substitution matrix is replaced by a query-specific matrix of dimension |ω|×|L|. This increases the memory requirement compared to the original layout, but reduces the lookups of the substitution matrix significantly as explained below.
In traditional matrix layout, if the substitution scores of a subject sequence residue (i.e., 'A') against query sequence residues (i.e., 'A','K','L','G') are required subsequently, GPU has to look up the matrix four times one by one, because the score S[A][A], S[A][K], S[A][L], S[A][G] are usually stored far from each other. In contrast, if we use CUDA built-in vector data type int4 (which packs 4 integers together) to store the customized substitution matrix in the texture memory, the same four scores are stored at neighboring memory locations (as shown in Figure 2 (b). By reading an int4, the GPU can get these four scores S[A][i] in one instruction, where i is the position of residues in query sequence. This reduces the number of lookups by a factor of 4, therefore, a higher performance can be obtained.
In essence, both coalesced memory layout and customized substitution matrix optimization heighten performance of memory access by improving data locality among threads in GPU programming.
As mentioned previously, to simulate the required random sequences, a lot of random numbers are needed, since each s 2 copy has to be permuted thousands of times. Although the most widely used pseudorandom number generators such as linear congruential generators (LCGs) can meet our requirements, the current version of CUDA does not support calls to the host random function. Hence, we develop an efficient random number generator similar to lrand48() on GPUs.
The single-pair PSSE processes only one pair of query and subject sequences. The idea of computing single-pair PSSE is as follows. Given the query sequence s 1 and the subject sequence s 2, to compute PSSE, thousands (say N) of randomly permutations of s 2 are needed. To obtain these N random sequences of s 2, first, a set of N random numbers is generated in CPU as described in previous section. These numbers are then transferred to GPU and considered as seeds by the threads of GPU. Each thread then generates new random numbers using its own seed and swap the symbols of s 2 accordingly to obtain a permuted sequence. Thus the N random permutations of s 2 are obtained in parallel. The algorithm then uses SW algorithm to compute alignment score of s 1 and the N permuted copies of s 2 in parallel on GPU. The scores are then transferred to CPU for fitting. The details of computing single-pair PSSE is outlined by Algorithm 1. Note that Step 4 in Algorithm 1 uses the optimized memory layout, explained previously in detail, for s 2 and its N permuted copies.
Smith-Waterman is a dynamic programming algorithm to identify the optimal local alignment between a pair of sequences. In general, there are two different methods for parallelizing the alignment task . The first method is regarded as inter-task parallelism. In this case, each thread performs alignment of one pair of sequences. Hence, in a thread block, multiple alignment tasks are performed in parallel . The second one is intra-task parallelism. Here, alignment of each pair of sequences is assigned to a block of threads, splitting the whole task into a number of sub-tasks. Each thread in the thread block then performs its own sub-tasks, cooperating to exploit the parallel characteristics of cells in the anti-diagonals of the local alignment matrix . We use a similar alignment kernel as proposed in  with some modifications for further improvement. It is worthwhile to mention that Step 7 describes fitting, which is implemented on the CPU. This is because it involves recursion that is not supported very well on GPU.
The multi-pair PSSE processes multiple queries and subject sequences. We can obtain some hints about improving the performance by analyzing the single-pair PSSE implementation (which we do in a later section). Owing to a dramatic increase of data set for multiple-pairs implementation, there are some significant differences relative to performance of GPU hardware between the two implementations. Through analyzing our experimental results of the single-pair PSSE, we observe that inter-task parallelism performs better than intra-task parallelism (results shown later). Hence, we consider inter-task parallelism in computing multi-pair PSSE. Based on the guidelines for optimizing memory and occupancy described earlier, we compare three implementation strategies.
Algorithm 1: Pseudo-code of single-pair PSSE
Input: (s 1, s 2) - Sequence-pair; M - Substitution matrix; G - Gap opening penalty; GE - Gap extension penalty; N - Number of permutes;
Generate a number N of random numbers in CPU;
Copy LCG seeds to GPU global memory;
Copy s 1 and s 2 to GPU global memory;
Generate random numbers (RNs) using LCG in GPU
Permute sequence s 2 N times using the RNs
Reorganize sequences s 2 and its N times permuted copies as shown in Figure 2 (b) if using inter-task parallel Smith-Waterman();
Align s 2 and its N permuted copies against s 1 using Smith-Waterman() (inter-task or intra-task);
Transfer N alignment scores from GPU into CPU;
(K, λ) ← EV DCensordMLFit(Scores);
pss ← 1 - exp(-Kmne - λx );
Given Q query sequences and S subject sequences, the intuitive strategy is to simply perform the same 'single-pair' procedure Q × S times. In other words, in each iteration, we send a single pair of query and subject sequences to the GPU. The GPU processes that pair and returns the result to the CPU. The same procedure is repeated for all query and subject sequence pairs. However, this strategy suffers from low occupancy. We analyze the cause along with its performance results in the next section.
In the first strategy, the subject sequences from the database are permuted for every query-subject sequence pair. Hence, the same subject sequence is permuted every time it is sent to the GPU. A better strategy is to create permutations of each subject sequence only once and reuse them to align with all the queries. "One permutation, all queries" is the idea of the second strategy. Because of the reuse of permuted sequences, higher performance is expected than the first strategy. However, the occupancy of GPU, to be shown in the next section, is still not elevated.
The low occupancy of the above two strategies is due to the underutilized computing power of GPU. In addition, these two strategies do not work well when the size of subject sequence database becomes too big to be fitted into GPU global memory. For instance, if the size of the original subject sequence database is 5 MB, it becomes 5000 MB when each of the sequences is permuted, say, 1000 times. This prohibits transfer of all the subject sequences to GPU at the same time. We therefore need an optimal number of subject sequences to be shipped to GPU keeping in mind that the subject sequences and their permuted copies fit in global memory. Moreover, the number of new generated sequences should be enough to keep all CUDA cores busy, i.e., keep a high occupancy of GPU, which is very important to harness the GPU power.
where SM num is the total number of SMs in GPU, T max is the maximum number of resident threads per SM, and N is the number of permutations.
All our experiments have been are carried out using Intel© Core™i7 CPU 920 processors running @2.67 GHz. The system has 4 cores, 4 GB of memory, dual Tesla C2050 GPU (each with 448 CUDA cores) and is running a 64-bit Linux-based operating system. Our program has been compiled using gcc 4.4.1 and CUDA 4.0. The sequence data used in this work comprises of a non-redundant subset of the CATH 2.3 database . This dataset consists of 2771 domain sequences as our subject library and includes 86 CATH queries as our query set. We derive PSSMs for the 86 test queries against the non-redundant protein database using PSI-BLAST (provided along with the BLAST+ package ) over a maximum of five iterations and with other default parameters. BLOSUM62 and PSSMs have been used as the scoring matrices with affine gap penalty of 10 + 2k for a gap of length k. We permute the 2771 subject sequences N = 1000 time as in several previous studies [21, 35, 36, 45].
For intra-task parallel implementation, the best speedups for sequences of length 200, 400, and 800 are 24.57×, 35.09×, and 43.63×, respectively. All these are obtained using 128 threads per block. But the best speedup for sequences of length 1600 is 46.34× and used 256 threads per block. These results tell us that it is hard to find a general rule to parameterize the settings to achieve peak performance. A possible reason is that many factors, such as the number of threads per block, the available registers, and shared memory, may contradict with each other.
The intra-task parallel implementation creates enough thread blocks (one block for each alignment task) to keep the occupancy high. However, the SW alignment matrix must be serially computed from the first to the last anti-diagonal. Only the cells of the alignment matrix belonging to the same anti-diagonal can be computed in parallel. In this case, most threads in a block have no work to do. As a result, the performance of the intra-task parallel implementation is worse than that of the inter-task parallel implementation even though it has higher occupancy.
In contrast, inter-task parallelism would have a total of 1000 threads (one for each alignment task). If each block contains 64 threads, then the total number of blocks is B = ⌈1000/64⌉ = 16. The assignment of 16 blocks to 14 available SMs will result in 12 SMs with one active block and two SMs with two blocks. Therefore, the occupancy for the two SMs with two blocks is (2 × 64)/1024 = 12.5%. For the 12 SMs with only one block its occupancy is (1 × 64)/1024 = 6.25%. Because of the low occupancy, there are not sufficient threads to keep CUDA cores busy when the global memory is accessed. As a result, the latency-hiding capabilities of this method are limited. As the number of threads per block T num increases, the total active blocks (B) decreases and some SMs even have no blocks assigned. For example, if T num = 512, then B = ⌈1000/512⌉ = 2. Hence only two SMs are working with one block each.
However, these SMs have a higher occupancy (1 × 512)/1024 = 50%, which compensates for the decrease in the number of working SMs. Consequently, even though there is a reduction in speedups, it is not directly proportional to the reduction in the number of active SMs. When the number of threads per block is 64, for the sequence length of 200, 400, 800, and 1600, the best speedups are 52.14×, 56.36×, 73.39×, and 73.16×, respectively.
In brief, getting performance out of a GPU is about keeping the CUDA cores busy. Both inter-task parallelism and intra-task parallelism, under the situations of small data set being processed, seem to fail in this regard, but not necessarily for the same reason.
The maximum occupancy for three strategies
Due to a high occupancy, the tile-based strategy using single (dual) GPU(s) achieves speedups of 240.31 (338.63)×, 243.51 (361.63)×, 240.71 (363.99)×, and 243.84 (369.95)× using 64, 128, 256, and 512 threads per block, respectively, as shown in Figure 5.
The billion cell updates per second (GCUPS) value is another commonly used performance measure in bioinformatics . The tile-based strategy using single (dual) GPU(s) achieves performance results in the range of 16.55 (23.34) to 16.79 (25.50) GCUPS, as shown in Figure 5. Although a direct comparison across different GPU implementations and hardware in not fair, just for the sake of completeness, we cited below the performance in GCUPS reported by some existing implementations of the Smith-Waterman alignment task. It is worthwhile to mention here that  has a peak performance of 4.65 to 8.99 GCUPS for various query lengths on an NVIDIA 9800 and  has a peak performance of 3.5 GCUPS on a workstation running two GeForce 8800 GTX. Our implementation show higher GCUPS.
In summary, low occupancy is known to interfere with the ability to hide latency on memory-bound kernels, causing performance degradation. However, increasing occupancy does not necessarily increase performance. In general, once a 50% occupancy is achieved, further optimization to gain additional occupancy has little effect on performance . Our experiments verify this claim.
Since the GPU implementation presented in this paper uses the same algorithm for PSSE (specifically the Smith-Waterman algorithm for getting alignment scores, the same fitting routine to get statistical parameters K and λ, and the same algorithm parameters) as in , the retrieval accuracy of the proposed implementation is expected to be identical to . According to , pairwise statistical significance with standard substitution matrices performs at least comparable or significantly better than database statistical significance (using BLAST, PSI-BLAST, and SSEARCH). Moreover, pairwise statistical significance with PSSMs performs significantly better than using standard substitution matrices, and also better than PSI-BLAST using pre-constructed position-specific substitution matrices. More details can be found in . The implementation of the proposed method and related programs in CUDA is available for free academic use at http://cucis.ece.northwestern.edu/projects/PSSE/.
In this paper, we present a high performance accelerator to estimate the pairwise statistical significance of local sequence alignment, which supports standard substitution matrix like BLOSUM62 as well as PSSMs.
Our accelerator harvests the computation power of many-core GPUs by using CUDA, which results in high end-to-end speedups for PSSE. We also demonstrate a comparative performance analysis of single-pair and multi-pair implementations. The proposed optimizations and efficient framework are applicable to a wide variety of next-generation sequencing comparison based applications, such as, DNA sequence mapping and database search. As the size of biological sequence databases are increasing rapidly, even more powerful high performance computing accelerator platforms are expected to be more and more common and imperative for sequence analysis, for which our work can serve as a meaningful stepping stone.
Basic Local Alignment Search Tool
BLOcks of Amino Acid SUbstitution Matrix
Compute Unified Device Architecture
Extreme Value Distribution
Billion Cell Updates Per Second
Graphics Processing Unit
Position Specific Scoring Matrix
Pairwise Statistical Significance Estimation
Position Specific Iterative BLAST
Pairwise Sequence Alignment
Pairwise Statistical Significance
Single: Instruction, Multiple: Thread
Standard Substitution Matrix SM: Streaming Multiprocessor
This work is supported in part by NSF award numbers CCF-0621443, OCI-0724599, CCF-0833131, CNS-0830927, IIS-0905205, OCI-0956311, CCF-0938000, CCF-1043085, CCF-1029166, and OCI-1144061, and in part by DOE grants DE-FC02-07ER25808, DE-FG02-08ER25848, DE-SC0001283, DE-SC0005309, and DE-SC0005340.This work is also supported in part by National Nature Science Foundation of China (NO. 60973118, NO. 61133016), Sichuan provincial science and technology agency (NO. M110106012010HH2027), Ministry of Education of China (NO. 708078), and China Scholarship Council.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.