Sequence alignment is a popular bioinformatics application that determines the degree of similarity between nucleotide or amino acid sequences which is assumed to have same ancestral relationships. The optimal local alignment of a pair of sequences can be computed by the dynamic programming (DP) based Smith-Waterman (SW) algorithm[1]. However, this approach is expensive in terms of time and memory cost. Furthermore, the exponential growth of available biological data[2] means that the computational power needed is growing exponentially as well.

The recent emergence of accelerator technologies such as FPGAs, GPUs and specialized processors have made it possible to achieve an excellent improvement in execution time for many bioinformatics applications, compared to current general-purpose platforms. However, special-purpose hardware implementations such as FPGAs [3, 4] tend to be very expensive and hard-to-program. Hence, they are not suitable for many users. Recent usage of easily accessible accelerator technologies to improve the search time of the SW algorithm include Intel SSE2[5], GPU[6] and CUDA[7].

Farrar[5] exploits the SSE2 SIMD multimedia extension of general-purpose CPUs. His implementation utilizes vector registers, which are parallel to the query sequence and are accessed in a striped pattern. Similar to the implementation by Rognes [8], a query profile is calculated only once for each database search. However, Farrar's implementation allows moving the conditional calculation of the *F*-matrix outside the inner loop. Therefore, this implementation achieves a speed up of factor 2–8 over the previous SIMD implementations by Wozniak[9] and Rognes[8].

Liu et al. [10] first reported the implementation of the Smith-Waterman algorithm on graphics hardware. The SW algorithm is implemented using the streaming architecture of GPUs by reformulating it in terms of computer graphics primitives. The implementation relies on OpenGL, in which a conversion of the problem to the graphical domain is needed, as well as a reverse procedure to convert back the results. Although, it achieves a high efficiency, programming in OpenGL requires specialized skills. Therefore, Manavski[7] re-implemented the SW algorithm on a GPU with the recently released C-based CUDA programming environment. The implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware.

In this paper, we demonstrate how the *PlayStation*^{®} 3 (PS3), a commodity hardware powered by the Cell Broadband Engine[11], can be used as a low cost computational platform to accelerate the Smith-Waterman algorithm. Our implementation is able to outperform both the striped method on an Intel Core 2 Duo as well as the CUDA-based GPU implementation on a GeForce 8800 GTX.

### The Smith-Waterman Algorithm

The Smith-Waterman algorithm is used to determine the optimal local alignment between two nucleotide or protein sequences. The algorithm compares two sequences by computing the similarity score by means of dynamic programming (DP). Two elementary operations are used: substitution and insertion/deletion (also called a gap operation). The original algorithm was proposed by Smith and Waterman[1] with a complexity of O(m^{2}n) and was improved by Gotoh[12] to run at O(mn).

Consider two strings *S* 1 and *S* 2 with length *m* and *n*, respectively. The Smith-Waterman algorithm computes the similarity value *M*(*i*, *j*) of two sequences ending at position *i* and *j* of the two sequences *S* 1 and *S* 2, respectively. For affine gap penalties, i.e. *α* ≠ *β*, the computation of *M*(*i*, *j*), for 1 ≤ *i* ≤ *m*, 1 ≤ *j* ≤ *n*, is given in the following equations 1–3:

*M*(*i*, *j*) = max{*M*(*i* - 1, *j* - 1) + *sbt*(*S* 1[*i*]), *S* 2[*j*], *E*(*i*, *j*), *F*(*i*, *j*), 0}, (1)

*E*(*i*, *j*) = max{*M*(*i*, *j* - 1) - *α*, *E*(*i*, *j*-1) - *β*}, (2)

*F*(*i*, *j*) = max{*M*(*i* - 1, *j*) - *α*, *F*(*i* - 1, *j*) - *β*}, (3)

where *sbt* is a character substitution cost table, *α* is the cost of the initial gap; *β* is the cost of the following gaps. For linear gap penalties, i.e. *α* = *β*, the above recurrence relations can be simplified as shown in equations 4:

*M*(*i*, *j*) = max{*M*(*i* - 1, *j* - 1) + *sbt*(*S* 1[*i*]), *S* 2[*j*], *M*(*i*, *j* - 1) - *α*, *M*(*i* - 1, *j*) - *α*} (4)

Initialization values are given as the following: for 0 ≤ *i* ≤ *m*, 0 ≤ *j* ≤ *n*, *M*(*i*, 0) = *M*(0, *j*) = *E*(*i*, 0) = *F*(*0*, *j*) = 0. Each position of the matrix *M* is a similarity value. The two segments of *S* 1 and *S* 2 producing this value can be determined by a trace-back procedure.

Figure 1 illustrates an example of computing the local alignment between two sequences PAWHEAE and HEAGAWGHEE using the Smith-Waterman algorithm with the BLOSUM 50 scoring matrix [13]. The highest score in the matrix (+28) is the optimal score for the alignment. The trace-back procedure, shown in form of arrows, shows that the optimal local alignment is AW- HE and AWGHE.

### Cell Broadband Engine Architecture

The Cell Broadband Engine[14] (Cell BE) is a recently introduced single-chip heterogeneous multi-core processor, which is developed by Sony, Toshiba and IBM. The Cell BE offers a unique assembly of thread-level and data-level parallelization options. It is operating at the upper range of existing processor frequencies (3.2 GHz for current models) and is projected to run at more than 5 GHz in the near future. Several examples of bioinformatics applications that has been ported to the Cell BE architecture include Folding@Home[15], FASTA[16], ClustalW[16] and RAxML[17].

The Cell BE combines an IBM PowerPC Processor Element (PPE) and eight Synergistic Processor Elements (SPEs)[11]. An integrated high-bandwidth bus called the Element Interconnect Bus (EIB) connects the processors and their ports to external memory and I/O devices. The block diagram of the Cell BE architecture is shown in Figure 2.

The PPE is a 64-bit Power Architecture core and contains a 64-bit general purpose register set (GPR), a 64-bit floating point register set (FPR), and a 128-bit Altivec register set. It is fully compliant with the 64-bit Power Architecture specification and can run 32-bit and 64-bit operating systems and applications. Each SPE is able to run its own individual application programs. Each SPE consists of a processor designed for streaming workloads, a local memory, and a globally coherent Direct Memory Access (DMA) engine. The EIB is a 4-ring structure, and can transmit 96 bytes per cycle, for a bandwidth of 204.8 Gigabytes/second. The EIB can support more than 100 outstanding DMA requests.

The most distinguishing feature of the Cell BE lies within the variety of the processors it has, i.e. the PPE and the SPEs. Heterogenous multi-core systems can lead to decreased performance if both the operating system and application are unaware of the heterogeneity. The PPE is designed to run the operating system and, in many cases, the top-level control thread of an application, while the SPEs is optimized for compute intensive applications, hence, providing the bulk of the application performance.

The SPE can access RAM through direct memory access (DMA) requests. The DMA transfers are handled by the Memory Flow Controller (MFC). The MFC provides the interface, by means of the EIB, between the local storage of the SPE and main memory. The MFC supports DMA transfers as well as mailbox and signal-notification messaging between the SPE and the PPE and other devices. Data transferred between local storage and main memory must be 128-bit aligned. The size of each DMA transfer can be at most 16 KB. DMA-lists can be used for transferring large amounts of data (more than 16 KB). A list can have up to 2,048 DMA requests, each for up to 16 KB.

The PS3 uses the Cell Broadband Engine as its CPU, hence making it possible for users to create a high-powered computing environment for a fraction of the cost of a Cell Blade server. The PS3 utilizes seven of the eight SPEs, in which the eighth SPE is disabled to improve chip yields, i.e. chips do not have to be discarded if one of the SPEs is defective. Only six of the seven SPEs are accessible to developers as one is reserved by the operating system. The power requirement for the PS3 is 120 V AC, 60 Hz and the power consumption approximately 380 W. Generally available PS3's can be used for scientific high performance computing through installation of Linux (e.g. Red Hat or Yellow Dog). Programs can be developed the using freely available C-based Cell BE SDK [18]. At the time of this writing, the retail price of the PlayStation^{®} 3 is US$ 399 for 40 GB and US$480 for 60 GB, while the retail price of the Nvidia GeForce 8800GTX card is US$529, and a Dell Optiplex 745 with Intel Core 2 Duo 2.4 GHz processor is US$871. A QS20 Blade Server with two Cell BE chips has a retail price of US$18,995. Thus, the PS3 offers a good alternative to other accelerator technologies.

## Comments

View archived comments (1)