Initialization
Our ADEPT implementation has two parts: a driver and a kernel. The driver initializes the GPU memory, packs the sequences into batches, and once enough sequences are available to saturate the GPU global memory, transfers all this data to the GPU. Additionally, the driver also detects different GPUs available on the node and balances the amount of work across all the available GPUs as shown in Fig. 3.
Batched sequences are stored in two arrays, one for query sequences and the other for reference sequences. The number of sequences in the query and reference arrays are the same, and sequences located at the same indices are aligned with each other. For instance, if there are N query sequences and N reference sequences, a total of N alignments will be performed. Each alignment is mapped to a unique CUDA block. Then, inside each CUDA block a more fine-grained approach is implemented. From here on, all the implementation details are per-block and the same algorithm is replicated across each CUDA block.
Tracking inter-thread dependencies
As discussed in the Background section, to compute each cell Hij of the dynamic programming table H, cells Hi−1,j, Hi,j−1, and Hi−1,j−1 need to have been computed. Because of this aspect of the algorithm, parallelism is restricted only along the ant-diagonal of the matrix as shown in Fig. 1. It can be further observed in the figure that first the amount of parallel work increases as the algorithm progresses, then remains constant for some iterations, and finally starts to decrease near the bottom right triangle of the matrix (shown in red). The maximum number of cells that can be calculated in parallel at any given time is equal to the length of the shorter of the two sequences. This poses the unique challenge of keeping track of dependencies for different cells and masking out the threads for which dependencies are not ready.
To calculate the scoring table H, we start by assigning one CUDA thread per column (as in Figure 1) such that it computes all the scoring cells within that column. Here we assume that of the two sequences being aligned, the longer sequence is mapped along the column and is referred to as R. As discussed before, because of dependencies, not all the threads can progress together. To tackle this problem, we introduce a Binary Masking Array (BMA) for masking out threads in each iteration for which dependencies are not ready. The BMA has a length b, where b is equal to 3∗|Q| and BMA is initialized as:
$$ x_{i}=\left\{ \begin{array}{cc} 0,& \text{if} (i < |Q|) or (i > 2*|Q|)\\ 1, & \text{otherwise} \end{array}\right. $$
(4)
In the above equation, xi is the ith element of BMA. The number of threads that need to track their state is equal to the size of query sequence. Since each thread needs to track its state in three different phases of algorithm, the length of BMA is fixed to three times the size of query sequence.
Figure 4 shows the BMA array for a query of length 6. Here, each thread keeps track of an element in BMA. After each iteration, if the algorithm is in the yellow region (see Figure 1), the array shifts to right, activating one more thread. Condition C is used to keep track of the region which the algorithm is in.
$$ C = I < |Q| or I >= |R| $$
(5)
In the above equation, I is the iteration number, which also corresponds to the diagonal being computed. It can be observed in Fig. 4 that initially no CUDA thread was active and as the algorithm progresses more and more threads are activated to perform the work. Once the algorithm reaches the orange region, the condition C becomes false and the array stops shifting until the algorithm enters the red region. Here again the array starts shifting to the right (as shown in Fig. 5) with each iteration, but this time threads are getting masked out with each iteration because of the decreasing diagonal size, this can be observed in Fig. 1. Pseudo code in Algorithm 1 shows the usage of BMA in keeping tracking of algorithm’s state.
Dynamic programming table storage and memory access issues
To compute the highest scoring cell a pass over the complete table H is required. Since the total number of cells that are computed in the table H are n∗m, if we use 2 bytes to store each cell, storing the complete dynamic programming table in memory requires N∗(2∗mn) bytes. Where N is the size of a batch. This yields a total global memory requirement of several hundred GBs for a million alignments, and even top of the line GPUs have global memory of only a few GBs.
Apart from the storage size of the dynamic programming table, another challenge that occurs often on GPUs is that of non-coalesced global memory accesses. The Global memory accessed by threads of a CUDA warp is bundled into the minimum number of cache loads, where L1 cache line size is 128 bytes. Thus if the two threads are accessing a memory location that is more than 128 bytes apart, the memory accesses will be un-coalesced. It can be observed in Fig. 6 that while performing a write back to global memory to store the table H, memory accesses are about 2∗(n−1) bytes apart, which can be more than 200 bytes apart if n is larger than 100.
It can be seen in Fig. 1 that to compute a given anti-diagonal using the proposed parallel approach only the two recent most anti-diagonals are required. Apart from computing the maximum scoring cell at the end of scoring phase, there is no reason for storing the complete scoring matrix, except for the two most recent diagonals. As a solution to the problem of computing the maximum scoring cell, we modified our implementation so that each thread can maintain a running maximum score for the column it has been assigned; this can be kept in the thread’s register. Thus, we can effectively discard the scoring matrix beyond the two most recent diagonals. Since this requires storing only a portion of the matrix, this can be done inside thread registers thus avoiding the problem of non-coalesced memory accesses.
Once all the cells have been computed we use CUDA’s warp shuffle intrinsics to implement a block-wide reduction for obtaining the highest scoring cell as shown in Algorithm 1. Our implementation of block-wide reduction has been adopted from NVIDIA’s own reduction method [30] with modifications introduced to obtain the indices of the highest scoring cell along with the score.
Efficient inter thread communications
Figure 1 shows the mapping of CUDA threads to columns of the scoring matrix. It can be observed in the figure that because of the cell-dependencies there is inter-thread communication required between the two consecutive threads. For a thread j to compute the cell Hi,j it requires values from cells Hi−1,j, Hi,j−1 and Hi−1,j−1. In the figure it can be observed that the cell Hi−1,j is computed by thread j while the other two cells are computed by thread j−1. For this inter-thread transfer we explored two methods of data sharing between threads i.e. communication using shared memory and register-to-register memory transfer.
CUDA’s warp shuffle intrinsics allow threads to perform direct register-to-register data exchange without performing any memory loads and stores, while use of shared memory involves going through the on-chip shared memory. Due to much faster performance we opted for the register-to-register data exchange method.
However, register-to-register transfers are only allowed among the non-predicated threads of the same warp. This introduces several edge cases, for example in the CUDA platform where a warp is 32 threads wide, a communication between thread (32∗q)−1 and 32∗q (where q>0) would not be possible through register-to-register transfers because these do not belong to same warp. For instance, in Figure 1, threads 3 and 4 belong to different warps (assuming that a warp is three threads wide), so they cannot communicate via the register exchange method. For such cases, the last thread of each warp spills its registers to the shared memory every iteration so that first thread of the next warp can retrieve that data.
Similarly, while computing the scores for cells Hm,j, the threads j-1 would have been predicated (in Fig. 1, each thread is masked after it has computed the last cell of the column it is assigned to) and a register-to-register transfer would not be possible. To cater for these edge cases, we use shared memory arrays to spill the values of thread registers whenever such edge cases occur. Using the BMA method discussed in previous section, it becomes quite straight forward to determine if a certain thread will be predicated in the next iteration so that its registers are timely spilt to shared memory and then any dependent threads can access the required values from shared memory.
Using the above method provides fast inter-thread communication along with freeing up significant amount of shared memory, which helps improve GPU utilization and also helps avoid shared memory bank conflicts. A bank conflict occurs when multiple threads access same bank of shared memory, this enforces sequential access to that portion of memory and results in performance degradation.
An overall step by step kernel pseudo code for forward phase has been provided in Algorithm 1.
Reverse scoring
The third phase of the Smith-Waterman algorithm involves performing a traceback starting from the highest scoring cell and ending when the score drops to zero or the top left end of the matrix is reached. This requires maintaining the traceback pointers, which can be stored in the form of two matrices, one for storing the indices of the query sequences and the other for storing the indices of the reference sequences. However, storing these matrices yields two sets of challenges. First, the amount of memory required to store traceback matrices equals 2∗N∗(n∗m), which can be several hundred GBs when N is close to a million alignments and unlike the scoring phase we cannot discard parts of the traceback matrices because that may lead to missing optimal alignments. The second challenge is that of un-coalesced memory accesses, as mentioned before. The write-back to the traceback matrices occurs along the anti-diagonals and since the matrices are laid down in the global memory in row-major indexing, this leads to un-coalesced memory accesses as shown in Fig. 6.
However, in most of the practical Smith-Waterman applications, complete alignment details are rarely required. The majority of the applications only require the optimal alignment score and the optimal alignment start and end indices [1, 13, 27]. Details of insertions and deletions are typically not required when the Smith-Waterman algorithm is being used as a part of a computational pipeline, in particular for the case of pairwise alignments. Considering this practical reason, rather than performing a detailed traceback, we use a reverse scoring phase.
Reverse scoring phase
To obtain the start positions of the alignment we make use of the symmetric nature of the optimal alignment. An optimal alignment is symmetrical i.e. scoring two sequences forward or with their directions reversed yields the same optimal alignment.
For reverse scoring, we make use of this property as previously done in [27] and compute a reverse scoring matrix with both the sequences flipped from the indices of the highest scoring cell in the forward scoring matrix. When scoring in reverse, the highest score will correspond to the same alignment as the one in the forward scoring phase as shown in Fig. 7.
Using the indices of the highest scoring cell in the reverse scoring phase, we can compute the start index of alignment. Using the reverse scoring phase enables us to avoid storing traceback matrices and helps free up GBs of space. The reverse scoring kernel follows the same implementation as the forward scoring kernel that has been shown in Algorithm 1, hence we re-use that implementation by providing flipped sequences at the input. It must be noted that the reverse scoring matrix in most of the cases ends up having less total work because of known end positions.
Support for protein alignment
Thus far, ADEPT has not required any domain specific optimizations, and the Smith-Waterman implementation discussed above has been oblivious of the types of sequence.
The difference between aligning protein sequences and DNA sequences is between the scoring methods. When aligning DNA sequences, if two of the same nucleotide bases align, that is considered a match and a fixed match score is used for computing the total score; similarly if the bases do not match, a mismatch score is used instead. When aligning protein sequences, two aligning amino acids need to be scored based on their chemical similarity. Similarity scores for all possible comparison of amino acids are characterized and available in the form of a scoring matrix [31]. Instead of a match/mismatch score for protein sequencing a user needs to provide a scoring matrix.
Since a scoring matrix needs to be accessed very frequently, in our implementation we move the static scoring matrix to the GPU’s shared memory to reduce the overhead associated with multiple accesses. For simplifying the scoring matrix lookups and minimizing shared memory usage, we use a decoding matrix to index into the scoring matrix. Typically, the scoring matrix is indexed by the amino acid characters, which leads to large amount of memory being reserved for the matrix. In this implementation, we first index a with the ASCII code associated with the amino acid character to retrieve an encoded index, which is then used to access the scoring matrix.
Underlying kernel for protein and DNA alignment still remains the same, for protein alignment the only difference is that instead of a match/mis-match score and similarity score is obtained from the scoring matrix, everything else remains the same as in Algorithm 1.
In order to make the switch between protein kernel and DNA kernel easy for the user, we provide two different kernels for protein alignment and DNA alignment. The DNA kernel accepts match, mismatch, gap open and gap extend scores at input while the protein kernel accepts a scoring matrix along with gap open and gap extend scores at compile time.
Multi-GPU asynchronous pipeline
In a typical CPU-GPU setup, the CPU prepares a batch of data that is offloaded to a GPU and launches a GPU kernel to process that data; once the data is processed, the results are moved back to CPU. However, with the evolution of GPU technology, a widespread adoption of GPUs has taken place, and instead of having one GPU per node, a typical GPU system has several GPUs on each node. For instance, the Summit supercomputer [32] has six GPUs per node and the upcoming Perlmutter supercomputer is planned to have four GPUs per node [33]. This calls for a software setup which would determine the type and memory capacity of each GPU on a node dynamically and divide the work among all GPUs accordingly. As a solution, ADEPT contains a driver component which manages all the communication, load balancing and batch size determination for the GPU kernels while keeping the developer oblivious of these intricacies.
ADEPT’s driver gathers hardware information about all the GPUs installed on a node and then divides the work equally among them. A separate context is created for each GPU where a unique CPU thread is assigned to a particular GPU. This CPU thread divides the total computational load into smaller batch sizes depending on the memory capacity of the GPU assigned to it. Each batch is then prepared and packed into a data structure which is then passed to a GPU kernel call. Using CUDA streams, the GPU kernel call and the data packing stage are overlapped so that CPU and GPU work can be carried out in parallel. An overview of ADEPT’s design can be seen in Fig. 3.
ADEPT’s driver makes it easier for the developers to integrate ADEPT in high performance bioinformatics software pipelines by reducing the complexities of dealing with multiple GPUs, and requiring just one call to the driver function. Effectively, making ADEPT a drop-in replacement for existing CPU libraries, whereas existing GPU libraries require significant amount of work in order for them to be included in an existing software pipeline.