Program outline
CUDASW++ 3.0 gains high speed by benefiting from the use of CPU and GPU SIMD instructions as well as the concurrent CPU and GPU computations. Our algorithm generally works in four stages: (i) distribution of workloads over CPUs and GPUs according to their compute power; (ii) concurrent CPU and GPU computations; (iii) re-computation of all alignments that have exceeded the 8-bit accuracy using CPU 8-lane 16-bit SIMD instructions; and (iv) sorting of all alignment scores in descending order and output the results. Figure 1 illustrates the workflow of our algorithm. In our algorithm, all subject sequences are pre-sorted in ascending order of sequence length.
Workload distribution
Our workload distribution in Stage (i) balances the runtimes between the CPU and GPU SIMD computation. Hence, the compute power of CPUs and GPUs should be taken into consideration in order to generalize our approach to different hardware configurations. Our distribution policy calculates a rate R of the number of residues from the database assigned to GPUs, which is calculated as
(2)
where f
C
and f
G
are the core frequencies of CPUs and GPUs, N
C
and N
G
are the number of CPU cores (i.e. threads) and the number of GPU SMs, and C is a constant derived from empirical evaluations, i.e. 3.2 and 5.1 for the query profile and its variant, respectively. When using multiple GPUs, our algorithm assumes that they have the same compute power and will calculate N
G
by summing up the number of SMs on all GPUs.
After obtaining R, we calculate the number N
R
of residues assigned to GPUs as R times the total number of residues in the database. Subsequently, all subject sequences assigned to GPUs can be determined by summing the sequence lengths in ascending order until it reaches N
R
. All other subject sequences will be distributed to CPUs.
CPU SIMD computation
In Stage (ii), the CPU SIMD computation consists of two steps. First, we compute the SW algorithm by splitting an SSE vector to 16 lanes with 8-bit lane width. This allows aligning a query in parallel to 16 subject sequences following the inter-task parallelization model. Secondly, we re-compute all alignments, whose scores have overflow potential, using 8-lane SSE vectors with 16-bit lane width. We determine an alignment to have overflow potential by comparing its score with a score limit calculated by subtracting from 128 the maximum substitution score in the scoring matrix. If the score ≥ the score limit, the alignment is deemed to have an overflow potential and thus requires re-computation. Our approach is based on the open-source SWIPE and more details about the specific implementation of the SSE-based SW algorithm can be obtained from [19].
In our algorithm, users are allowed to use multiple threads to conduct the CPU SIMD computation. Since the workload (i.e. subject sequences assigned to CPUs) is known beforehand, we calculate the total number of residues in all assigned subject sequences and (nearly) equally distribute all residues over all threads using a sequence as a unit. This distribution aims to make each thread hold (roughly) the same number of residues, but not necessarily receiving the same number of subject sequences.
GPU SIMD computation
Core PTX SIMD assemblies
We have implemented the recurrence in Equation (1) with PTX SIMD assembly instructions. The code consists of ten assembly instructions for the recurrence and one instruction for obtaining the optimal local alignment score. Figure 2 shows the PTX SIMD assembly instructions.
The figure shows that every instruction operates on quads of 8-bit signed values, corresponding to four independent alignments. Variables h, n and h
e
represent the alignment score vectors corresponding to matrix H, where h denotes the score vector of the four current cells, n the score vector of the four diagonal neighbours and h
e
the score vector of the four upper neighbours. Variables e and f represent the score vectors corresponding to the matrices E and F respectively, and S stores the current maximum alignment scores. For additions and subtractions, saturation instructions have been used to clamp the values to their appropriate signed ranges.
CUDA-enabled parallelization
For quad-lane SIMD computing on GPUs, four adjacent subject sequences (in the pre-sorted list as mentioned above) are assigned to a single thread, with each vector lane corresponding to each sequence. To facilitate data fetches for SIMD vectors, a two-dimensional sequence profile of size 4×l will be created for four sequences, where l is the maximum length of the four sequences. In a sequence profile, each row is a quad-lane residue vector represented as an integer data type, and is created by packing four residues of the same index in their corresponding sequences with each residue occupying 8 bits. To reduce the number of texture fetches, we have further packed four successive residue vectors using a uint4 vector data type for each sequence profile. Thus, we can realize four residue vectors for four subject sequences by a single texture fetch. Using a profile as a unit, we store all profiles in the texture memory following the same layout as in CUDASW++ 2.0.
For the linear-space SW algorithm, we require two intermediate buffers to store one row for matrices H and E (in our case) respectively. Instead of global memory, we have allocated them in local memory. Since the Kepler architecture has 512 KB per-thread local memory, theoretically we can support subject sequences as long as 65,536 on GPUs. Our algorithm sets the maximum subject sequence length to be 3,072 by default, but allows users to configure it at compile time because the two intermediate buffers have to be statically allocated in local memory.
The sequence length deviation generally causes runtime imbalance between threads, which in return can waste GPU compute power. In this regard, we have developed two CUDA kernels based on two parallelization approaches: static scheduling and dynamic scheduling. These two kernels are invoked at runtime based on the sequence length deviation of the database. For both approaches, we compute the total number of thread blocks from the total number of sequence profiles, which is constructed from the workload assigned to the GPU. Since each thread has its own intermediate buffers, the static scheduling parallelization launches all thread blocks onto the GPU at the same time, which is common for launching a CUDA kernel. The parallelization will rely on the CUDA runtime system to maximize the utilization of computational resources of GPUs. Besides the CUDA runtime system, the dynamic scheduling approach attempts to intervene with the scheduling of thread blocks on GPUs. This parallelization launches a small set of thread blocks of size N
T
to carry out the whole computation, regardless of the assigned workload. N
T
is defined as:
(3)
where N
SM
is the number of SMs, N
MRT
is the maximum number of resident threads per SM supported by the GPU, and N
TPB
is the number of threads per thread block configured by the user. The dynamic scheduling parallelization works as follows. All sequence profiles are organized into sequence profile blocks, each of which has as many sequence profiles as the number of threads in a thread block. Subsequently, N
T
thread blocks are launched to perform the computation, where a thread block processes a sequence profile block at a time. When a thread block finishes its current computation, this thread block will dynamically obtain an unprocessed profile block. This operation is done by the atomic addition function atomicAdd() on global memory, which increments the index of global profile blocks. In our algorithm, both static scheduling and dynamic scheduling have used a thread block size of 64.
Our evaluation has found that dynamic scheduling performs slightly better than static scheduling when using the Swiss-Prot database (with large sequence length deviation), whereas the latter seems slightly better in the ideal case where all sequences are of equal lengths. Based on the above observations, we have decided to employ the static scheduling approach for databases with very small sequence length deviation (by default when the standard deviation does not exceed 1% of the mean) and the dynamic scheduling approach for all others.
Query profile variant
Given a query S defined over an alphabet Σ, a query profile is defined as a numerical string set P = {P
r
| r є Σ}, where P
r
is a numeric string comprised of substitution scores required for aligning the whole query to any residue in Σ. The space complexity of the query profile can be calculated as O(|S|×|Σ|). In our algorithm we have employed the sequential-layout query profile [16], which defines the ith element of P
r
as M(r, S[i]), 1≤i≤|S|. The query profile is stored in texture memory and has been packed in the same way as in CUDASW++ 2.0 to reduce the number of texture fetches.
To facilitate GPU SIMD parallelization, we have derived a variant of a query profile. By enumerating all residues in Σ, we define a query profile variant as a numerical set V = {V
r
| 0≤ r<|Σ|K} of |Σ|K entries, where V
r
is a vector of K substitution scores and r is an integer corresponding to the permutation of any K residues in Σ. V
r
stores all substitution score vectors for aligning the whole query to the K residues corresponding to r. The space complexity of a query profile variant can be calculated as O(|S|×|Σ|K). Like the query profile, this variant is also stored in texture memory. When K = 4, each element of V
r
can be directly used in our quad-lane SIMD computation. However, the memory footprint is considerable even for short protein queries (this pressure can be significantly alleviated for DNA sequences due to their small alphabet size), and will cause more texture cache misses as the query length increases. In order to improve speed for long queries, our algorithm therefore uses K=2. We represent each element of V
r
using the short integer data type, since the range of the char data type is generally large enough to store a substitution score. Figure 3 shows an example query profile variant using K=2. Similar to the query profile, the variant has also been packed by representing four consecutive elements of each V
r
using the short4 data type. In this way, the variant can reduce the number of texture fetches by half compared to the query profile. On the other hand, to compute each cell vector (see Figure 2), we have to extract and concatenate the substitution scores, from either query profile or the variant, to generate a substitution score vector. This requires some additional bitwise operations in our implementation. In this case, we can save six bitwise operations for each cell vector by using the variant, instead of the query profile. This makes great sense in terms of speed, considering that each cell vector requires only several assembly instructions as shown in Figure 2.
Our algorithm employs both the query profile and its variant. In general, for short queries, more performance gains can be realized from the query profile variant because it can reduce the number of texture fetches by half and use fewer bitwise operations per cell as mentioned above. However, for longer queries, a query profile becomes superior due to its much smaller memory footprint and less texture cache miss. Thus, we have calculated a query length threshold Q to decide whether to use the query profile or the variant. For the Kepler architecture, texture fetches are cached by the aforementioned read-only cache and L2 cache. Since the L2 cache is usually much larger than the read-only cache, we have estimated Q from the L2 cache size as
(4)
The estimation is empirical and works well in practice through our evaluations. Q is used in the dynamic scheduling parallelization to cope with more general databases. For static scheduling which is only applied to databases with small sequence length deviations, we have found that a query profile variant usually leads to superior speed.