Parallel algorithms for large-scale biological sequence alignment on Xeon-Phi based clusters
- Haidong Lan†1,
- Yuandong Chan†1,
- Kai Xu1,
- Bertil Schmidt2,
- Shaoliang Peng3 and
- Weiguo Liu1Email author
© The Author(s) 2016
Published: 19 July 2016
Computing alignments between two or more sequences are common operations frequently performed in computational molecular biology. The continuing growth of biological sequence databases establishes the need for their efficient parallel implementation on modern accelerators.
This paper presents new approaches to high performance biological sequence database scanning with the Smith-Waterman algorithm and the first stage of progressive multiple sequence alignment based on the ClustalW heuristic on a Xeon Phi-based compute cluster. Our approach uses a three-level parallelization scheme to take full advantage of the compute power available on this type of architecture; i.e. cluster-level data parallelism, thread-level coarse-grained parallelism, and vector-level fine-grained parallelism. Furthermore, we re-organize the sequence datasets and use Xeon Phi shuffle operations to improve I/O efficiency.
Evaluations show that our method achieves a peak overall performance up to 220 GCUPS for scanning real protein sequence databanks on a single node consisting of two Intel E5-2620 CPUs and two Intel Xeon Phi 7110P cards. It also exhibits good scalability in terms of sequence length and size, and number of compute nodes for both database scanning and multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to optimized Xeon Phi and GPU implementations. Our implementation is available at https://github.com/turbo0628/LSDBS-mpi.
databases scanning of protein sequence databases with the Smith-Waterman algorithm, and
distance matrix computation for multiple sequence alignment (i.e. the first stage of the popular ClustalW heuristic).
Three levels of parallelization are required in order to exploit the compute power available in a cluster of Xeon Phis. Parallelization within a Xeon Phi is usually based on the “scale-and-vectorize” approach: scaling across the up to 61 cores requires the usage of several hundred threads while exploiting the 512-bit wide vector units requires SIMD vectorization within each core. Recent examples of efficient parallelization on Xeon Phis include scientific computing , bioinformatics [6–10], and database operations . Furthermore, parallelization between Xeon Phis adds another level of message passing based parallelism. This level needs to consider data partitioning, load balancing, and task scheduling. The accelerator-based approach is motivated by the fact that the performance of many-core architectures is growing. For example, the 2nd generation Xeon Phi processor named “Knight’s Landing” has already been announced.
The rest of this paper is organized as follows. The “Related work” Section provides important background information about the Xeon Phi programming model, pairwise and multiple sequence alignment, and hardware accelerated alignment algorithms. Our single-node parallel algorithms are presented in the “Algorithms on a single node” Section. The “Cluster level data parallelization” Section describes our cluster-level parallelization. Section “Results and discussion” evaluates performance. Some conclusions are drawn in Section “Conclusion”.
Programming models on Xeon Phi coprocessor
Xeon Phi is a coprocessor connected via the PCI express (PCIe) bus to a host CPU. From a hardware perspective, it contains up to 61 86 compatible cores. Each core features a 512-bit vector processing unit (VPU) based on a new instruction set. The cache hierarchy contains a L1 data cache of size 32 KB and a 512 KB per core L2 cache. The cores are connected via a bidirectional ring bus which enables L2 cache coherence based on a directory based protocol. Each core can execute up to four threads at the same time.
Assuming a Xeon Phi with 61 usable cores running at 1.238 GHz, we can determine the peak performance for 32-bit integer (integer arithmetic is commonly used for sequence alignment calculations) operations as follows: 16 (#SIMD lanes) × 1 integer operation × 1.238 GHz × 61 (#cores) = 1.208 Tera integer operations per second.
From a software perspective, three programming models can be used in order to harness the compute power of the Xeon Phi: (i) native model, (ii) offload model, and (iii) symmetric model. In this paper, we have chosen the offload model. In this model, code sections and data can be offloaded from the host CPU to the Xeon Phi. Using OpenMP pragmas, offload regions can be specified. When encountering such a region during program execution, the necessary data transfers between host and Xeon Phi are performed and the code inside the (parallelized) region is executed on the Xeon Phi.
Pairwise sequence alignment and database search
The database search application considered in this paper scans a protein sequence database using a single protein sequence as a query (similar to BLASTP). Different to the BLASTP heuristic, we calculate the score of an optimal local alignment between the query and each subject sequence using the Smith-Waterman algorithm with affine gap penalties (instead of a seed-and-extend approach). The subject sequences are ranked in terms of this score. Actual alignments are only computed for the top-ranked database sequences which only takes a negligible amount of time in comparison to the score-only search procedure. Note that the score-only Smith-Waterman computation can be performed in linear space and quadratic time with respect to the length of the alignment targets.
The iterative computation of theses matrices is started with the initial values: H A (i,0)=H A (0,j)=E(i,0)=F(0,j)=0 for all 0<=i<=q, 0<=j<=s.
Progressive multiple sequence alignment
The time complexity of computing an optimal multiple alignment of more than two sequences grows exponentially in terms of the number of input sequences. Thus, heuristic approaches with polynomial complexities must be used in practice for large inputs to approximate the (generally unknown) optimal multiple alignment.
- (a)Distance matrix: For each input sequence pair, a distance values is computed based on the Smith-Waterman algorithm
Guide tree: Using the distance matrix computed in the previous step is taken as an input to compute an evolutionary tree using the neighbor-joining method .
Progressive alignment: Following the branching order of the tree a multiple sequence alignment is build progressively.
Hardware accelerated alignment algorithms
We briefly review some previous work on accelerating pairwise alignment (based on Smith Waterman) and progressive multiple sequence alignment (based on ClustalW) on a number of parallel computer architectures. A number of SIMD implementations have been designed in order to harness the vector units of common multi-core CPUs (e.g. [15–21]) or the the Cell/BE (e.g. [22, 23]). Recent years has seen increased interests in acceleration of sequence alignment on massively parallel GPUs. Initially, programming these graphics chips for bioinformatics application still required programming with shaders using languages such as OpenGL . The release of CUDA in 2007 made the usage GPUs for general purpose computing more accessible and subsequently a number of CUDA-enabled Smith-Waterman implementation have been presented in recent years [4, 25–33]. A number of MPI-based solutions for progressive multiple sequence alignments are targeted towards PC clusters [34–37]. Another attractive architecture for sequence analysis are FPGAs [38–41] which are based on reconfigurable hardware. However, in comparison to the other mentioned architectures, FPGAs are often less accessible and generally more difficult to program.
The solution in this paper is based on a cluster of Xeon Phis. Compared to common CPUs, a Xeon Phi contains significantly more cores and often a wider vector unit. Different from CUDA-enabled GPUs, a Xeon Phi provides x86 compatibility, which often simplifies the implementation process. Nevertheless, achieving near-optimal performance is still a challenge which needs to be addressed by parallel algorithm design and efficient implementation. In this paper we demonstrate how this can be done for protein sequence database search and distance matrix computation for multiple sequence alignment.
We have designed new algorithms which can handle searching tasks for large-scale protein databases on Xeon Phi clusters.
We have designed new algorithms for calculating large-scale multiple sequence alignments on Xeon Phi clusters.
We have implemented our multiple sequence alignment algorithm using the offload model to make full use of the compute power of both the multi-core CPUs and the many-core Xeon Phi hardware.
Algorithms on a single node
Protein sequence database search
We have observed two facts: (1) protein sequence database search has inherent data parallelism; (2) each VPU on Xeon Phi can execute multiple integer operations in an SIMD parallel way efficiently. Based on these two facts, we have partitioned the database search process on a single node into two data parallel parts: device level and thread level. The device level data parallel part is encoded on the host CPU. It splits the subject database into multiple batches that can be distributed to CPU and Xeon Phi devices. The thread level data parallel part is used to process data batches locally. In order to support search tasks for large-scale databases, we have designed a dynamic data distribution framework to distribute these batches to both the host CPU device and the Xeon Phi devices. In order to solve the performance loss problem for searching long query sequences, we have also proposed a multi-pass algorithm where long query sequences are partitioned into multiple short subsequences for consecutive searching passes. We have presented more implementation details of our algorithm in .
The distance matrix computation stage of ClustalW is typically a major runtime bottleneck. Thus, in our work we have only concentrated on designing a parallel algorithm for this stage. ClustalW bases the distance computation between two protein sequences on the following concept :
whereby n i d(S i ,S j ) is defined as the number of exact matches in an optimal local alignment between S i and S j . l i (l j ) is the length of S i (S j ).
In our implementation, the size of these two temporary vectors for Xeon Phi and CPU is 16 and 8 separately.
Cluster level data parallelization
Dispatcher (Master): Partitions subject database or MSA tasks into a number of chunks in a preprocessing steps and sends them to compute nodes.
Algorithms on a Single Node (Worker): Receives sequence chunks from master and performs the corresponding DP calculations.
Result Collector (Master): Performs additional operations required to further process the returned results.
Protein sequence database search
Results and discussion
Intel Xeon Phi 7110P: 61 hardware cores, 1.1 GHz processor clock speed, 8 GB GDDR5 device memory.
Intel Xeon Phi 31S1P: 57 hardware cores, 1.1 GHz processor clock speed, 8 GB GDDR5 device memory.
Test cluster configurations
Xeon E5-2620 (6 cores) * 2
Xeon Phi 7110p * 1
Xeon E5-2620v2 (6 cores) * 2
Xeon Phi 7110p * 2
Xeon E5-2650v2 (8 cores) * 2
Xeon Phi 31s1p * 4
Protein sequence database search
A performance measure commonly used in computational biology to evaluate Smith-Waterman implementations is cell updates per second (CUPS). A CUPS represents the time for a complete computation of one entry of the DP matrix, including all comparisons, additions and maxima operations.
We have scanned three protein sequence databases: (i) the 7.5 GB UniProtKB/Reviewed and Annotated (5,943,361,275 residues in 16,110,751 sequences), (ii) the 18 GB UniProtKB/TrEMBL (13,630,914,768 residues in 42,821,879 sequences), and (iii) the 37 GB merged Non-Redundant plus UniProtKB/TrEMBL (24,323,686,690 residues in 73,401,766 sequences) for query sequences with varying lengths. Query sequences used in our tests have the accession numbers P01008, P42357, P56418, P07756, P19096, P0C6B8, P08519, and Q9UKN1.
Performance on a single node
We have firstly compared the single-node performance of our methods to SWAPHI  and CUDASW++ 3.1 . SWAPHI is another parallel Smith-Waterman algorithm on Xeon Phi-based neo-heterogeneous architectures. It is also implemented using the offload model. However, SWAPHI can only run search tasks on Xeon Phi; i.e. it does not exploit the computing power of multi-core CPUs. SWAPHI cannot handle search tasks for large-scale biological databases. In our tests, we find that the database size limitation for SWAPHI is less than the available RAM size; i.e. 16 GB. CUDASW++ 3.1 is currently the fastest available Smith-Waterman implementation for database searching. It makes use of the compute power of both the CPU and GPU. At the CPU side, CUDASW++ 3.1 carries out parallel database searching by invoking the SWIPE  program. It employs CUDA PTX SIMD video instructions to gain the data parallelism at the GPU side. The database size supported by CUDASW++ 3.1 is less than the memory size available on the GPU. Neither SWAPHI nor CUDASW++ 3.1 supports clusters.
For single-node tests, we have used the N 2 node (see Table 1) as test platform. In our experiments, we run our methods with 24 threads on two Intel E5-2620 v2 six-core 2.0 GHz CPUs and 240 threads on each Intel Xeon Phi 7110P respectively. We execute SWAPHI with 240 threads on each Xeon Phi 7110P. We have executed CUDASW++ 3.1 on another server with the same two Intel E5-2620 v2 six-core 2.0 GHz CPUs plus two Nvidia Tesla Kepler K40 GPUs with ECC enabled. 24 CPU threads are also used for CUDASW++ 3.1. If not specified, default parameters are used for both SWAPHI and CUDASW++ 3.1. Furthermore, all available compiler optimizations have been enabled. The parameters α=10, and β=2 have been used in our experiments. The substitution matrix used is BLOSUM62.
SWAPHI and CUDASW++ 3.1 cannot support search tasks for the 18 GB and 37 GB databases. Thus, we only use our methods to search them. Figure 10 a also reports the performance of our methods for searching these two databases. The results show that our methods can handle large-scale database search tasks efficiently.
Performance on a cluster
Figure 10 b shows the performance of our methods using all three cluster nodes. The result indicates that our methods exhibit good scalability in terms of sequence length and size, and number of compute nodes. Our method achieves a peak overall performance of 730 GCUPS on the Xeon Phi-based cluster.
Test datasets for MSA
Performance for processing medium-scale datasets
Performance for processing large-scale datasets
We have presented two parallel algorithms for protein sequence alignment based on the dynamic programming concept which can be efficiently mapped onto Xeon Phi clusters. Our methods exhibit good performance on a single compute node as well as good scalability in terms of sequence length and size, and number of compute nodes for both protein sequence database search and distance matrix computation employed in multiple sequence alignment. Furthermore, the achieved performance is highly competitive in comparison to other optimized Xeon Phi and GPU implementations. Biological sequence databases are continuously growing establishing the need for even faster parallel solutions in the future. Hence, our results are especially encouraging since performance of many-core architectures grows much faster than Moore’s law as it applies to CPUs. For instance, the performance improvement with at least a factor of 3 can be expected on the already announced next-generation Xeon Phi product.
Publication of this article was funded by the PPP project from CSC and DAAD, Taishan Scholar, and NSFC Grants 61272056 and U1435222.
This article has been published as part of BMC Bioinformatics Vol 17 Suppl 9 2016: Selected articles from the IEEE International Conference on Bioinformatics and Biomedicine 2015: genomics. The full contents of the supplement are available online at http://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-9.
Availability of data and materials
Project name: LSDBS-mpi
Project homepage: https://github.com/turbo0628/LSDBS-mpi
Operating System: Linux
Programming Language: C++
HL, BS, and WL designed the study, wrote and revised the manuscript. HL, YC, and KX implemented the algorithm, performed the tests, analysed the results. BS, SP, and WL contributed the idea of using Knights Corner instructions and Xeon Phi clusters, participated in the algorithm optimization, analysed the results. All authors read and approved the final manuscript.
The authors declare that they have no competing interests.
Consent for publication
Ethics approval and consent to participate
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Schmidt B, Schröder H, Schimmler M. Massively parallel solutions for molecular sequence analysis. International Parallel and Distributed Processing Symposium parallel solutions for molecular sequence analysis. IEEE: 2002. p. 0186.Google Scholar
- Bader DA. Computational biology and high-performance computing. Commun ACM. 2004; 47(11):34–41.View ArticleGoogle Scholar
- Rajko S, Aluru S. Space and time optimal parallel sequence alignments. IEEE Trans Parallel Distrib Syst. 2004; 15(11):1070–81.View ArticleGoogle Scholar
- Liu Y, Schmidt B. SWAPHI: Smith-waterman protein database search on Xeon Phi coprocessors. Application-specific Systems, Architectures and Processors (ASAP), 2014 IEEE 25th International Conference on. IEEE: 2014. p. 184–5.Google Scholar
- Heinecke A, Vaidyanathan K, Smelyanskiy M, et al. Design and implementation of the linpack benchmark for single and multi-node systems based on intel xeon phi coprocessor. Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE: 2013. p. 126–37.Google Scholar
- Pennycook SJ, Hughes CJ, Smelyanskiy M, et al. Exploring SIMD for Molecular Dynamics, Using Intel Xeon Processors and Intel Xeon Phi Coprocessors. Parallel & Distributed Processing (IPDPS), 2013 IEEE 27th International Symposium on. IEEE: 2013. p. 1085–97.Google Scholar
- Wang L, Chan Y, Duan X, et al. XSW: Accelerating biological database search on xeon phi. Parallel & Distributed Processing Symposium Workshops (IPDPSW), 2014 IEEE International. IEEE: 2014. p. 950–7.Google Scholar
- Liu Y, Maskell DL, Schmidt B. CUDASW++: optimizing Smith-Waterman sequence database searches for CUDA-enabled graphics processing units. BMC Res Notes. 2009; 2(1):73.View ArticlePubMedPubMed CentralGoogle Scholar
- Lan H, Liu W, Schmidt B, et al. Accelerating large-scale biological database search on Xeon Phi-based neo-heterogeneous architectures. Bioinformatics and Biomedicine (BIBM), 2015 IEEE International Conference on. IEEE: 2015. p. 503–10.Google Scholar
- Rucci E, García C, Botella G, Degiusti A, Naiouf M, Prieto-Matías M. An energy-aware performance analysis of swimm: Smith—waterman implementation on i ntel’s m ulticore and m anycore architectures. Concurr Comput Pract Experience. 2015; 22(6):865–72.Google Scholar
- Lu M, Zhang L, Huynh HP, et al. Optimizing the mapreduce framework on intel xeon phi coprocessor. Big Data, 2013 IEEE International Conference on. IEEE: 2013. p. 125–30.Google Scholar
- Thompson J, Higgins D, Gibson T. ClustalW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting position specific gap penalties and weight matrix choice. Nucleic Acids Res. 1994; 22:4673–680.View ArticlePubMedPubMed CentralGoogle Scholar
- Feng D, Doolittle R. Progressive sequence alignment as a prerequisite to a correct phylogenetic trees. J Mol Evol. 1987; 25:351–60.View ArticlePubMedGoogle Scholar
- Saitou N, Nei M. The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol Biol Evol. 1987; 4:406–25.PubMedGoogle Scholar
- Wozniak A. Using video-oriented instructions to speed up sequence comparison. Comput Appl Biosci. 1997; 13(2):145–50.PubMedGoogle Scholar
- Rognes T, Seeberg E. Six-fold speed-up of Smith-Waterman sequence database searches using parallel processing on common microprocessors. Bioinformatics. 2000; 16(8):699–706.View ArticlePubMedGoogle Scholar
- Alpern B, Carter L, Su Gatlin K. Microparallelism and high-performance protein matching. Proceedings of the 1995 ACM/IEEE conference on Supercomputing. ACM: 1995. p. 24.Google Scholar
- Rognes T. Faster Smith-Waterman database searches with inter-sequence SIMD parallelisation. BMC Bioinforma. 2011; 12.Google Scholar
- Edgar RC. Muscle: multiple sequence alignment with high accuracy and high throughput. Nucleic Acids Res. 2004; 32(5):1792–7.View ArticlePubMedPubMed CentralGoogle Scholar
- Notredame C, Higgins D, Heringa J. T-coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol. 2000; 302:205–17.View ArticlePubMedGoogle Scholar
- Chaichoompu K, Kittitornkun S, Tongsima S. MT-ClustalW: multithreading multiple sequence alignment. Parallel and Distributed Processing Symposium, 2006. IPDPS 2006. 20th International. IEEE: 2006. p. 8.Google Scholar
- Wirawan A, Kwoh CK, Hieu NT, et al. CBESW: sequence alignment on the playstation 3. BMC Bioinforma. 2008; 9(1):377.View ArticleGoogle Scholar
- Szalkowski A, Ledergerber C, Krähenbühl P, et al. SWPS3–fast multi-threaded vectorized Smith-Waterman for IBM Cell/BE and x86/SSE2. BMC Res Notes. 2008; 1(1):107.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu W, Schmidt B, Voss G, Mueller-Wittig W. Streaming algorithms for biological sequence alignment on gpus. IEEE Trans Parallel Distrib Syst. 2007; 18(9):1270–81.View ArticleGoogle Scholar
- Liu Y, Schmidt B, Maskell DL. CUDASW++ 2.0: enhanced Smith-Waterman protein database search on CUDA-enabled GPUs based on SIMT and virtualized SIMD abstractions. BMC Res Notes. 2010; 3(1):93.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu Y, Wirawan A, Schmidt B. CUDASW++ 3.0: accelerating Smith-Waterman protein database search by coupling CPU and GPU SIMD instructions. BMC Bioinforma. 2013; 14(1):117.View ArticleGoogle Scholar
- Manavski S, Valle G. CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment. BMC Bioinforma. 2008; 9(2):1.Google Scholar
- Ligowski L, Rudnicki W. An efficient implementation of Smith-Waterman algorithm on GPU using CUDA, for massively parallel scanning of sequence databases. 2009 International Parallel and Distributed Processing Symposium. IEEE: 2009. p. 1–8.Google Scholar
- Khajeh-Saeed A, Poole S, PJ B. Acceleration of the Smith-Waterman algorithm using single and multiple graphics processors. J Comput Phys. 2010; 229(11):4247–58.View ArticleGoogle Scholar
- Blazewicz J, Frohmberg W, Kierzynka M, Pesch E, Wojciechowski P. Protein alignment algorithms with an efficient backtracking routine on multiple gpus. BMC Bioinforma. 2011; 12:181.View ArticleGoogle Scholar
- Hains D, Cashero Z, Ottenberg M, et al. Improving CUDASW++, a parallelization of Smith-Waterman for CUDA enabled devices. Parallel and Distributed Processing Workshops and Phd Forum (IPDPSW), 2011 IEEE International Symposium on. IEEE: 2011. p. 490–501.Google Scholar
- Liu Y, Schmidt B, Maskell DL. MSA-CUDA: multiple sequence alignment on graphics processing units with CUDA. Application-specific Systems, Architectures and Processors, 2009. ASAP 2009. 20th IEEE International Conference on. IEEE: 2009. p. 121–8.Google Scholar
- Hung CL, Lin YS, Lin CY, Chung YC, Chung YF. CUDA ClustalW: An efficient parallel algorithm for progressive multiple sequence alignment on multi-gpus. Comput Biol Chem. 2015; 58:62–8.View ArticlePubMedGoogle Scholar
- Li K. ClustalW analysis using parallel and distributed computing. Bioinformatics. 2003; 19:1585–6.View ArticlePubMedGoogle Scholar
- Ebedes J, Datta A. Multiple sequence alignment in parallel on a workstation cluster. Bioinformatics. 2004; 20:1193–5.View ArticlePubMedGoogle Scholar
- Cheetham J, Dehne F, Pitre S, et al. Parallel clustal w for pc clusters[M]. Computational Science and Its Applications—ICCSA 2003. Berlin Heidelberg: Springer; 2003, pp. 300–9.Google Scholar
- Tan J, Feng S, Sun N. Parallel multiple sequences alignment in SMP cluster. Int Conf High Perform Comput Asia Reg. 2005; 20:425–31.Google Scholar
- Oliver T, Schmidt B, Maskell D. Hyper customized processors for bio-sequence database scanning on FPGAs. Proceedings of the 2005 ACM/SIGDA 13th international symposium on Field-programmable gate arrays. ACM: 2005. p. 229–37.Google Scholar
- Li ITS, Shum W, Truong K. 160-fold acceleration of the Smith-Waterman algorithm using a field programmable gate array (FPGA). BMC Bioinforma. 2007; 8(1):1.View ArticleGoogle Scholar
- Oliver T, Schmidt B, Nathan D, Clemens R, Maskell D. Using reconfigurable hardware to accelerate multiple sequence alignment with ClustalW. Bioinformatics. 2005; 21:3431–432.View ArticlePubMedGoogle Scholar
- Boukerche A, Correa JM, de Melo ACMA, et al. An FPGA-based accelerator for multiple biological sequence alignment with DIALIGN[M]. High Performance Computing-HiPC 2007. Berlin Heidelberg: Springer: 2007. p. 71–82.Google Scholar