Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation
© Rognes; licensee BioMed Central Ltd. 2011
Received: 5 March 2011
Accepted: 1 June 2011
Published: 1 June 2011
Skip to main content
© Rognes; licensee BioMed Central Ltd. 2011
Received: 5 March 2011
Accepted: 1 June 2011
Published: 1 June 2011
The Smith-Waterman algorithm for local sequence alignment is more sensitive than heuristic methods for database searching, but also more time-consuming. The fastest approach to parallelisation with SIMD technology has previously been described by Farrar in 2007. The aim of this study was to explore whether further speed could be gained by other approaches to parallelisation.
A faster approach and implementation is described and benchmarked. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is over six times more rapid than software based on Farrar's 'striped' approach. SWIPE was about 2.5 times faster when the programs used only a single thread. For shorter queries, the increase in speed was larger. SWIPE was about twice as fast as BLAST when using the BLOSUM50 score matrix, while BLAST was about twice as fast as SWIPE for the BLOSUM62 matrix. The software is designed for 64 bit Linux on processors with SSSE3. Source code is available from http://dna.uio.no/swipe/ under the GNU Affero General Public License.
Efficient parallelisation using SIMD on standard hardware makes it possible to run Smith-Waterman database searches more than six times faster than before. The approach described here could significantly widen the potential application of Smith-Waterman searches. Other applications that require optimal local alignment scores could also benefit from improved performance.
The alignment of two biological sequences is a fundamental operation that forms part of many bioinformatics applications, including sequence database searching, multiple sequence alignment, genome assembly, and short read mapping.
Smith and Waterman  described a simple and general algorithm requiring O(N 3) time and O(N 2) memory to identify the optimal local sequence alignment score using a substitution score matrix and a general gap penalty function. Gotoh  showed that with affine gap penalties the optimal local alignment score could be computed in just O(N 2) time and O(N) memory.
When the optimal alignment score needs to be computed many times, for example when searching a sequence database, the computation time becomes substantial. Several approaches have been pursued to reduce the time needed. Heuristic approaches like BLAST [3, 4] are considerably faster, but are not guaranteed to discover the optimal alignment.
Reconfigurable hardware in the form of FPGA (Field-Programmable Gate Array) can also accelerate the speed of alignment score computations. Li et al. reported speeds equivalent to about 23.8 billion cell updates per second (GCUPS) with DNA sequences, on a state-of-the-art FPGA board. Search speed is often reported in GCUPS, which indicates the billion (giga) number of cells in the alignment matrix (query sequence length times total number of database residues), processed per second.
The algorithm can also be implemented with various forms of parallelisation in software running on more common hardware. Pairwise alignment of separate sequences is in principle "embarrassingly" parallel because the computations for each pair of sequence are completely independent. Alpern et al. suggested improving speed by performing several independent alignment score computations in parallel by dividing the bits of wide registers into several narrower units and using instructions to perform arithmetic operations on these units individually. This form of parallelism within a register was later made much simpler and easier by microprocessor manufacturers with the introduction of technologies like MMX, SSE, SSE2, MAX, MVI, VIS, and AltiVec, which are now generally referred to as SIMD technology. Several implementations take advantage of the SSE2 instructions available on Intel processors . The approach where parallelisation is carried out across multiple database sequences is also known as inter-task parallelisation, in contrast to intra-task parallelisation where the parallelisation occurs within a single pair of sequences.
The Cell processor manufactured by Sony, Toshiba and IBM, have one main core (Power Processing Element, PPE) and 8 minor cores (Synergistic Processing Elements, SPEs). These cores have SIMD vector processing capabilities. Cell processors are found in some IBM servers (QS20) as well as the Sony PlayStation 3 (PS3) (only 6 SPEs available). Several implementations for this processor have been described. SWPS3 by Szalkowski et al. used the Farrar approach and claimed speeds on the PS3 of up to 8 GCUPS. Also here the performance was highly dependent on query length. With the P07327 query sequence, SWPS3 performance was less than 2 GCUPS on the PS3. Wirawan et al. also employed the Farrar approach in their implementation called CBESW and claimed speeds on the PS3 of over 3.6 GCUPS, while the performance was 2.2 GCUPS with the P07327 query sequence. Farrar also reported speeds of 15.5 GCUPS on an IBM QS20 and up to 11.6 GCUPS on the PS3 using the same query . Rudnicki et al. described an implementation that used parallelisation over multiple database sequences on the PS3 and reported speeds approaching 9 GCUPS.
Graphics processors (GPUs) can also accelerate alignments. Several implementations have employed the CUDA interface to Nvidia GPUs. The CUDASW++ tool by Liu et al. reportedly performed 9.5 GCUPS on the single-GPU GeForce GTX 280 and 14.5 GCUPS on the dual-GPU GeForce GTX 295. Ligowski and Rudnicki  reported speeds of up to 14.5 GCUPS on the dual-GPU GeForce 9800 GX2. In 2010, Liu et al. reported speeds up to 17 and 30 GCUPS by CUDASW++ 2.0 on the single-GPU GeForce GTX 280 and the dual-GPU GeForce GTX 295, respectively. Recently, Ligowski et al. reported a speed of 42.6 GCUPS on the GeForce GTX 480.
The main advantage of the inter-task parallelisation approach where multiple database sequences are processed in parallel as described by Alpern et al. is that it simply avoids all data dependences within the alignment matrix. This approach does not seem to have been explored much in later implementations using SIMD technology apart from the work by Rudnicki et al. However, in the GPU-based tools [15, 16] this approach is common. The aim of the present study was to explore the use of this approach further using SIMD on ordinary CPUs.
Here the algorithm is implemented on Intel processors with SSSE3  with parallelisation over multiple database sequences as illustrated in Figure 1D. Instead of aligning one database sequence against the query sequence at a time, residues from multiple database sequences are retrieved and processed in parallel. Rapid extraction and organisation of data from the database sequences have made this approach feasible. The approach has been implemented in a tool called SWIPE. The approach also involves computing four consecutive cells along the database sequences before proceeding to the next query residue in order to reduce the number of memory accesses needed.
The performance of the new implementation has been extensively tested using different scoring matrices, gap penalties, query sequences and number of threads. The speed of SWIPE was almost constant at more than 100 GCUPS on a dual Intel Xeon X5650 six-core processor system for a wide range of query lengths. SWIPE was about six times faster than SWPS3 and Farrar's own implementation for a typical length query, but the factor varied between 2 and 18 depending on query length and number of threads used. Two versions of BLAST were tested and the speed was found to be highly dependent on the score matrix. SWIPE was about twice as fast as BLAST using the BLOSUM50 matrix, while BLAST was about twice as fast as SWIPE using the BLOSUM62 matrix.
Benchmarking was performed on compute nodes in the Titan high performance cluster at the University of Oslo. Entire nodes were reserved to ensure that no other major processes were running. All data was initially copied to a fast local disk to reduce the influence of the computer networks and minimize file reading time. The output from all programs was redirected to/dev/null to minimize performance differences due to the amount of output. All combinations of programs, number of threads, query sequences, score matrices, gap open penalties and gap extension penalties were run 15 times and the median total wall clock execution time was recorded.
Programs included in performance testing
./blastall -p blastp -F F -C 0 -b 0 -v 10 -a $T -M $M -G $GO
-E $GE -i $Q -d $D
./blastp -seg no -comp_based_stats F -num_alignments 0
-num_descriptions 10 -num_threads $T -matrix $M
-gapopen $GO -gapextend $GE -query $Q -db $D
./swipe -v 10 -a $T -M $M.mat -G $GO -E $GE -i $Q -d $D
./striped -c 10 -T $T -i -$GO -e -$GE $M.mat $Q $D.fsa
./swps3 -j $T -i -$GO -e -$GE $M.mat $Q $D.fsa
SWIPE was written mainly in C++ with some parts hand coded in inline assembler and some using SSE2 and SSSE3 intrinsics. It was compiled for 64 bit Linux using the Intel C++ compiler version 11.1. Source code is available at http://dna.uio.no/swipe/ under the GNU Affero General Public License, version 3. An executable binary and score matrix files are also available at the same location. The same files are included in a gzipped tar archive as additional file 1.
The source code for Farrar's STRIPED software was downloaded from the author's website and compiled with the GNU gcc compiler as specified in the supplied Makefile . The Makefile was also modified to compile STRIPED using the Intel compiler, to see if there were any differences in performance.
The binary executable for the SWPS3 program was downloaded from the authors' website and used directly .
Precompiled binaries for BLAST and BLAST+ were downloaded from the NCBI FTP site .
Graphs were drawn using Gnuplot version 4.5.
Performance tests were carried out on Dell PowerEdge M610 blade servers with 48 GB RAM and dual Intel Xeon X5650 six-core processors running at 2.67 GHz. The X5650 processors have simultaneous multithreading capability also known as hyper-threading (HT). With HT enabled, each of these computers has a total of 24 logical cores.
The programs were run using 1 to 24 threads. To take full advantage of the hardware a number of threads equal to the number of logical cores is usually the most appropriate. There are however important differences between software on the effect of HT and in the ability to make efficient use of the cores available. To simplify comparisons, some of the tests were only performed with 1 or 24 threads.
UniProt Knowledgebase Release 11.0  consisting of both Swiss-Prot release 53.0 and TrEMBL release 36.0 of 29 May 2007 was used for the performance tests. This database consists of 4 646 608 protein sequences with a total of 1 517 383 530 amino acid residues. The longest sequence contains 36 805 residues.
This database was chosen because the size was large enough to be realistic but smaller than the apparent 2 GB file size limit of some software. The database also did not contain any of the special J, O or U amino acid residue symbols that some of the software could not handle. Finally, this release of the database should be available for download in the foreseeable future, making it suitable also for future benchmarking.
The current version of SWIPE will not work with databases split by formatdb into separate volumes. SWIPE therefore cannot search databases larger than about 4 billion amino acids.
The database was converted into FASTA format by a simple Perl script and then formatted with NCBI formatdb version 2.2.24 into the NCBI BLAST binary database format . This binary format was read by SWIPE, BLAST and BLAST+, while STRIPED and SWPS3 read the FASTA-formatted database file directly.
The 32 query sequences with accession numbers P56980, O29181, P03630, P02232, P01111, P05013, P14942, P00762, P53765, Q8ZGB4, P03989 (replacing the identical but obsolete P10318), P07327, P01008, P10635, P58229, P25705, P03435, P42357, P21177, Q38941, O60341, P27895, P07756, P04775, P19096, P28167, P0C6B8, P20930, P08519, Q7TMA5, P33450 and Q9UKN1, ranging in length from 24 to 5478 residues were retrieved from the UniProt database . Most of them have previously been used several times for performance testing. To simplify comparisons, some of the tests were only performed with the 375 residues long P07327 query, representing a protein of roughly average length.
All 82 different combinations of amino acid substitution score matrices and gap penalties accepted by BLAST were tested. The matrices used were BLOSUM45, BLOSUM50, BLOSUM62, BLOSUM80, and BLOSUM90 from the BLOSUM series  as well as PAM30, PAM70, and PAM250 from the PAM series . Matrices were obtained from the NCBI FTP site. Rows and columns for stop codons (*) were removed from the matrices for compatibility with the SWPS3 program. SWPS3 would only run successfully using the BLOSUM45, 50 and 62 matrices. To simplify comparisons, some of the tests were only performed with the BLOSUM62 matrix and gap open and extension penalties of 11 and 1, respectively, which is the BLAST default.
The query sequence q of length m contains residues q i . The database sequence d of length n contains residues d j . H i,j is the score for aligning the prefixes of q and d ending in the alignment of residues q i and d j . E i,j and F i,j are the scores of aligning the same prefixes of q and d but ending with a gap in the query and database sequence, respectively. P[q i , d j ] is the score of aligning residues q i and d j with each other according to the substitution score matrix P. Q is the sum of gap open and extension penalties while R is the gap extension gap penalty. S is the overall optimal local alignment score.
The calculations are carried out column by column. Only parts of the H, E and F matrices need to be kept in memory: a single element of the F matrix as well as two arrays containing m elements each, corresponding to one column of the H and E matrices.
The main features of the implementation are described below.
When a new database sequence begins in one of the channels, the score of the previous database sequence must be recorded. In addition, the H, E and S scores from the previous sequence must be reset. When a new column is going to be processed, it is checked whether any database sequence ended in the previous column. If that is the case, special processing is carried out. The score of the sequences that ended are recorded. A mask is created and later used to reset the values of H, E and S in the appropriate channels before the new column is processed. The channels are filled, and one or more new sequences are started. A somewhat simpler and faster processing step is carried out for the new column if no sequences ended in the previous column. In the simpler processing step no scores need to be saved, no new sequences are started, and no mask need to be created or used. Since most columns are of the simpler type the overall performance is mostly dependent on the speed of processing the simple columns. To simplify computations, each channel is padded with 1-3 null symbols after the end of a sequence if its length is not a multiple of 4, to ensure that a new database sequence will only begin at the beginning of a new block of cells. This padding is indicated in pink in Figure 2. The padding increases the total number of cell a little bit, but allows the checks described above to be carried out only on every fourth residue.
The temporary score profiles are created from the ordinary substitution score matrices (e.g. BLOSUM62) and the 4 × 16 database sequence residues using a series of packed shuffle instructions (pshufb). The shuffle instruction is only available on Intel processors with Supplemental Streaming SIMD Extensions 3 (SSSE3). On processors without SSSE3 (i.e. AMD processors and older Intel processors), the computations may be replaced by a kind of matrix transpose operation using a series of unpack instructions (punpcklbw, punpckhbw) with only a modest speed penalty.
Computations are initially performed using only a 7 bit score range. This allows 16 alignment score matrices to be computed in parallel using SSE2 instructions. Additions and subtractions are performed using signed and saturated arithmetics, while the maximum operations are carried out on unsigned numbers. Only the 7 bit score range from -128 to -1 (signed numbers) or 128 to 255 (unsigned numbers) is used. All scores are biased by an offset of 128. This range of values will ensure that signed saturated addition and subtraction works well on the lower boundary. Also, the unsigned maximum works well in this range. The range enables the use of the packed maximum unsigned byte instruction (pmaxub) and packed add and subtract signed saturated bytes instructions (paddsb and psubsb), which are available on all SSE2 processors and gives the highest speed.
An 8 bit range, which would allow the same number (16) of parallel computations as a 7-bit range, and at the same time allow a wider score range, is not used because it is slower. Either the range from -128 to 127, or the range from 0 to 255 could be used. There are both signed and unsigned versions of the instructions for parallel computation of the maximum of bytes (pmaxub, pmaxsb) and for parallel addition and subtraction of bytes (paddusb, psubusb, paddsb, psubsb), but the pmaxsb instruction for the maximum of signed bytes was only recently available in SSE4.1 and is slower than pmaxub. Unsigned additions and subtractions (paddusb and psubusb) may be used, but the addition of the score matrix vector then requires two instructions. First the score vector including a bias must be added; then the bias must be subtracted.
Versions using 16-bit and 63-bit score ranges are also implemented and used when overflow is detected in computations with lower score ranges. The 16-bit version allows 8 parallel computations. When potential overflow is detected in computations with a narrow score range, the alignment score for that database sequence is recalculated using the next wider score range (first 16-bit and then 63-bit if necessary). Because rather few sequences will usually reach a score that cannot be represented by 7 bits, the additional computation time for wider score ranges are usually negligible.
Recalculations with a wider score range are carried out on a subset of sequences after all sequences in a chunk of database sequences (see below) have been processed using the narrower score range.
Database sequences were stored in the NCBI BLAST database format, produced by the formatdb tool. This is a binary format where the sequence information is split into at least 3 files: indices (.pin), headers (.phr) and sequences (.psq). The file format allows efficient reading of sequences into memory. Protein sequences are stored using byte values in the range 1-24 and 26-27 representing amino acid residues A-I, K-N, P-T, V-Z, U, O and J, respectively. Sequences are separated by a zero byte, which simplifies the check for sequence ends.
Database sequences are retrieved using memory mapping of the .pin and .psq files. This is an efficient and convenient method of accessing the sequences, in which the operating system manages reading data into memory from disk as necessary, concurrently with program execution. The sequence database is divided into 100 chunks per thread. Each chunk contains approximately the same number of sequences. The sequences in one chunk are mapped into memory and processed before the next chunk is mapped. This results in a small memory footprint of the program.
SWIPE uses multiple threads (pthreads) that work on different parts of the sequence database. The number of threads is specified when starting the program and should in general be equal to the number of cores of the computer. For the latest generations of Intel processors with hyper-threading, a number of threads equal to the number of logical cores is usually most effective. Chunks of database sequences are assigned to the threads as they are ready for more work, so the threads may not process exactly the same number of chunks each. Results from the threads are inserted into a common hit list after each chunk is processed.
The SWIPE software was benchmarked against BLAST, BLAST+, STRIPED and SWPS3 under many different conditions to measure speed. The performance using a variable number of threads and the effect of query sequence length was studied. Additionally, the impact of different scoring systems, both substitution score matrices and gap open and extension penalties was examined.
Figure 6B and Figure 6C illustrates the performance with queries of varying length using either 24 threads (B) or a single thread (C). 32 different sequences with lengths ranging from 24 to 5478 amino acid residues were used as queries.
SWIPE had a rather flat performance curve. For queries shorter than about 100 residues there was a gradual loss in performance, especially when running with many threads. The speed was also slightly reduced for very long queries when using many threads. The performance ranged from 23.1 to 110.1 GCUPS for 24 threads and from 3.9 to 9.8 GCUPS for a single thread.
The SWPS3 program was very dependent on query length with speeds ranging from 0.94 to 49.0 GCUPS on 24 threads and between 1.1 and 4.6 GCUPS on a single thread.
The STRIPED program compiled with the Intel compiler was also quite dependent on the query length, in particular when running on 24 threads, with speeds ranging from 1.2 to 46.6 GCUPS. On a single thread, the speed of STRIPED varied between 0.8 and 5.7 GCUPS. STRIPED compiled with the GNU compiler was in general slightly slower, particularly with longer queries.
The speed of the BLAST programs seemed somewhat faster with longer query sequences with speeds ranging from 52.8 to 374.1 and 30.6 to 360.2 GCUPS for BLAST and BLAST+, respectively, but the performance varied a bit from sequence to sequence. There was a noticeable drop in performance with queries shorter than about 100 residues.
The SWIPE software greatly increases the speed of sequence database searches based on the Smith-Waterman algorithm compared to earlier SIMD implementations, being more than six times faster in realistic cases. SWIPE was found to be performing at a speed of 106 GCUPS with a 375 residue query sequence on a dual Intel Xeon X5650 six-core processor system. The speed was a bit dependent on the query length, but independent of the scoring system used. The maximum speed corresponds to the processing of more than 3.3 cells per physical core in each clock cycle.
Using GPUs, speeds of up to 42.6 GCUPS have been reported for CUDASW++ 2.0 on a Nvidia GeForce GTX 480 graphics card . This is comparable to the expected performance of SWIPE on a quad-core CPU.
SWIPE scaled almost linearly with the number of threads used up to 12 threads, corresponding to the number of physical cores available. The X5650 processors features hyper-threading which makes it possible to obtain extra performance using more than one thread per physical core, unless all execution units of busy. Almost no increase in performance beyond 12 threads was observed, so apparently the execution units are fairly busy when SWIPE is running on 12 threads. SWIPE scaled much better than SWPS3 and STRIPED with multiple threads. The maximum speed of SWIPE was 6.5 times faster than SWPS3, the fastest of these, using the 375 residue query. When all programs were running on a single thread, SWIPE was about 2.5 times faster. There seems to be two equally important factors responsible for the differences in speed. Based on the single thread performance numbers, it seems like the use of the inter-sequence parallelisation approach instead of the striped approach is responsible for about half of the increase in speed. Efficient thread parallelisation seems responsible for the other half.
In the comparisons with the STRIPED software it should be noted that Farrar (2007) apparently reported the "scan time" and not the complete running time for the program, excluding the time needed to read the database sequences into memory. The full running time is reported here. For the other programs, it is difficult to separate the scan time from the rest of the time used. STRIPED reads the FASTA formatted files directly and not files formatted by NCBI's formatdb tool. It appears that STRIPED initially parses the entire FASTA file and reads the database into memory using non-threaded code, resulting in low performance when using several threads and measuring the complete running time. Excluding the time for database reading leads to shorter run times and higher performance numbers. On the other hand, it is probably faster to read files in the binary NCBI format than parsing the FASTA text format.
The performance of SWIPE was not very dependent on query length, except for rather short and very long queries. Overhead costs incurred for each database residue probably reduced the performance of SWIPE for the shortest queries. For the longest queries there was a performance decrease when many threads were running. This is probably due to the effects of memory caches, of which some are shared between cores. The programs based on Farrar's approach were considerably more dependent on query length and performed better with increasing query length. The reason for this is probably that as the width of the stripes increase, the relative importance of the dependency between the stripes is reduced.
SWIPE was about twice as fast as BLAST using the BLOSUM50 matrix, while BLAST was twice as fast as SWIPE using the BLOSUM62 matrix. BLAST performance was found to be very dependent on the scoring matrix. The reason may be that BLAST in its heuristics uses an initial hit score threshold (T) that has a fixed default value (11) independent of the score matrix specified. Score matrices with relatively high expected values, e.g. BLOSUM50, will then trigger more initial hits than other matrices, e.g. PAM30.
If one would like to search using a query profile (position-specific scoring matrix) instead of a query sequence, the computation of the temporary score profile need to be carried out for every position of the query, not just for the 20 possible amino acid residues. This has not been implemented, but the resulting reduction in speed has been estimated to 30%.
The software described here should be considered a prototype indicating the performance potential of the approach. A later version that at least computes the actual alignments (not just the alignment score) and the statistical significance of the matches is planned.
Efficient parallelisation using SIMD on standard hardware now allows Smith-Waterman database searches to run considerably faster than before. In the new tool SWIPE, residues from sixteen different database sequences are compared in parallel to one query residue. Using a 375 residue query sequence a speed of 106 billion cell updates per second (GCUPS) was achieved on a dual Intel Xeon X5650 six-core processor system, which is more than six times faster than software based on Farrar's approach, the previous fastest implementation. Furthermore, for the first time, the speed of a Smith-Waterman based search has been shown to clearly exceed that of BLAST at least with one particular scoring matrix.
Since the slow speed has been the major drawback limiting the usefulness of Smith-Waterman based searches, the approach described here could significantly widen the potential application of such searches. Other applications that require optimal local alignment scores, like short read mapping  or genome sequence assembly  could also benefit from improved performance of this method. The approach used here may probably also be applied to HMM-based searches.
This work was supported by the Research Council of Norway with a CoE grant to CMBN. Thanks to the Research Computing Services group at the Center for Information Technology at the University of Oslo for access to the Titan cluster and other services.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.