- Open Access
MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)
- Ruibang Luo†1,
- Jeanno Cheung†1,
- Edward Wu†1,
- Heng Wang†1, 2,
- Sze-Hang Chan1,
- Wai-Chun Law1,
- Guangzhu He1,
- Chang Yu1,
- Chi-Man Liu1,
- Dazong Zhou1,
- Yingrui Li1, 3,
- Ruiqiang Li4,
- Jun Wang1, 3,
- Xiaoqian Zhu2,
- Shaoliang Peng2 and
- Tak-Wah Lam1, 4Email author
© Luo et al.; licensee BioMed Central Ltd. 2015
- Published: 23 April 2015
Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores).
To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM).
MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development.
Availability and implementation
MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3.
Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.
- Graphic Processing Unit
- Seed Length
- Graphic Processing Unit Implementation
- Many Integrate Core
- Seed Alignment
With the rapid advance of sequencing technologies, there is continuously demand for faster and faster analysis. The recently announced Illumina HiSeq × Ten sequencing system promises to sequence 18,000 whole human genomes (30x) in one year (four such systems can sequence more genomes than in all of history), while cutting the cost to $1,000 each. To cater to such capacity, it is important to develop new analysis software to fully utilize available acceleration hardware in addition to the CPU. For example, SOAP3-dp uses a graphics processing unit (GPU) and is a few times faster than mainstream CPU aligners and delivers higher sensitivity. Besides GPU, attention has also fallen on Intel's new product Many Integrated Core (MIC), a.k.a. Xeon Phi Co-processor. MIC was introduced in 2011. It is an acceleration device whose hardware and system-software architecture support its use for general purpose computing. It is well suited to software implementations where computations on many thousands of data items can be carried out independently in parallel. The latest product has 57-61 cores and 8 GB of memory in one board, providing ~1 TFlops. Two of the top ten supercomputers in TOP500 (which ranks the world's 500 most powerful supercomputers) are equipped with MIC (Tianhe-2 has 48,000 MIC boards, and Stampede has 6,400 boards).
Experience has shown, however, that it is not easy to build useful read alignment software using massive core architectures such as GPU and MIC. The apparent problem is that the most biologically relevant sequence-alignment algorithm[2, 3] involves dynamic programming dependencies that are awkward to compute efficiently in parallel. The fastest GPU implementations of the algorithm to date relies on task parallelism, where each thread of execution computes an entire alignment independently of all other parallel threads. This, however, requires sequences to be aligned to have similar lengths to ensure balanced tasks distribution among cores. Another typical sequence alignment problem is that a short query sequence (100 to 250 bp) must be aligned with a comparatively long (3 Gbp or longer) reference sequence. Since a brute-force search for all candidate alignments in this setting would be computationally prohibitive, read aligners typically construct a list of candidate reference sequence locations within which potential alignments might be discovered. The size of a list depends on the complexity of the sequence to be aligned and this accounts for a significant proportion of the computational imbalance involved in read alignment using massive core architectures. Several high-throughput read aligners including BarraCUDA, CUSHAW and SOAP3-dp that utilize GPU acceleration have been developed in the past few years, but to this date, limited studies have been carried out on short read alignment using MIC. This paper introduces MICA, a new short-read aligner designed to fully utilize the computing power of MIC.
Algorithmically, MICA maintains a BWT index in each MIC, and it adopts SOAP3-dp's approach to align reads shorter than 150 bp, and uses a new approach to handle longer reads. SOAP3-dp was designed for HiSeq 2000 paired-end reads of 100 bp; its efficiency stems from the fact that a GPU, with a 2 way-BWT index, can efficiently align at least one end of a paired-end read with 0-2 mismatches; the other end can be aligned using dynamic programming in the GPU. For longer reads (150 bp or more), SOAP3-dp deteriorates in sensitivity and speed because GPU is inefficient to align with more mismatches using the index. Thus, we need a new approach. In the following, we divide the discussion into two parts: the first part describes the techniques on how to utilize the resources of MIC to match the performance of SOAP3-dp on GPU, and the second part is about the new techniques for aligning reads.
Our experiment revealed that an MIC core was 4 to 6 times slower than a CPU core when running programs designed for CPU, and an MIC with 57 cores might be comparable to a 12-core CPU when used for brute-force parallelization of a CPU program. Noteworthy, each MIC core has thirty-two 512-bit registers; each allows sixteen 32-bit data to be operated in parallel. Such extra parallelism, if exploited properly, can boost the efficiency dramatically. This requires a lot of engineering work, though. Our first success is on the BWT index, MICA exploits 512-bit operations to speed up different arithmetic and memory transaction operations when querying the BWT (SOAP3-dp is based on 64-bit operations). Next, we turn to dynamic programming, which is for aligning reads allowing indels and soft clipping. Below we give the details of a new parallel algorithm for dynamic programming that can utilize the 512-bit registers.
A. Highly parallel dynamic programming
B. New seeding strategies
For longer reads, we can only afford to use the BWT to align short fragments of a read (i.e., the seeds) and then count on dynamic programming to verify the candidate positions of the seed. Ideally we want fewer seeds (to save time on seed alignment) and the seeds should return correct candidate positions without too many incorrect ones (to save time on dynamic programming). MICA attempts to improve SOAP3-dp and other aligners by using as few seeds as possible, while maintains a balance between efficiency (fast & do not give too many candidates) and sensitivity (capture correct candidate positions for more reads). MICA uses a combination of the following strategies.
a) Non-branching mismatches
MICA allows more mismatches during seed alignment but without sacrificing efficiency. SOAP3-dp allows only one mismatch as the time for seed alignment increases exponentially with the number of mismatches (due to the branching of the search tree). We observe that very often a mismatch is unambiguous and there is exactly one way to correct it according to the reference genome. MICA allows 1-2 "non-branching" mismatches in addition to one "branching" mismatch for seed alignment.
For computing purposes, a BM increases the number of possible extensions much more than a NBM, and is more time-consuming to handle. Yet, we observed that very often a mismatch is unambiguous and there is exactly one way to correct it according to the reference genome (i.e. it is a NBM). Therefore, we allow MICA to find alignments with up to 2 non-branching mismatches, in addition to the 1 branching mismatch for seed alignment. This approach discovers more seeds than the "1 mismatch only" approach.
b) Read-sensitive seed length
MICA dynamically adjusts the seed length depending on the content of the read. If a region of the read appears to be non-repetitive, it uses a shorter seed to avoid missing critical candidate positions; for a region that appears to be repetitive, we choose a long seed to reduce the number of candidates.
More specifically, MICA follows SOAP3-dp's design for rapidly aligning those reads with < 3 mismatches and without an Indel, leaving those reads failed to be aligned (with ≥ 3 mismatches or with ≥ 1 Indel) to the Dynamic Programming (DP) module. This rapid mode works efficiently with 100 bp paired-end reads, where the slower DP module handles only about 10% of the reads. However, new sequencers such as Illumina HiSeq × Ten, produces paired-end reads 150 bp in length. The probability of a 150 bp read to span ≥ 3 mismatches or an Indel is higher than that of 100 bp, empirically forcing the DP module to handle 30-40% of the reads, which slows down the performance of MICA significantly. To enable MICA to work with 150 bp or longer reads efficiently, we have redesigned the workflow by disabling the rapid mode and using multiple rounds of seed discovery in the DP module to catch up to the speed. The longer seed lengths required in the former rounds limit the number of candidate alignment regions, where the shorter seed lengths in the latter rounds ensure the sensitivity, but requiring much more computation due to the large number of candidate alignment regions to be verified. Most reads with fewer mismatches and Indels thus could be aligned in the former rounds, leaving complicated reads for the latter rounds, which are more computationally demanding.
MICA and SOAP3-dp seeding details.
Seed # Limit
SOAP3-dp DP module:
c) Breaking seeds at variant positions
MICA also tries to exclude read positions that are suspicious to be errors or variants (especially Indels) as part of a seed.
The above seeding strategies are used in multiple rounds with different parameters to attain better speed without sacrificing sensitivity. The superior sensitivity of the seeds allows MICA to pair up the candidate positions from both ends of almost all reads before the DP. This saves a lot of time on verifying incorrect positions.
A. Software settings and machine specifications
a) MICA: Version r178. Compiled with Intel C/C++ Compiler version 13.1; one to three 57-core MIC cards (8G, ECC enabled), each coupled with a CPU core; and one CPU core for output (in SAM format).
b) SOAP3-dp: Version r176. Compiled with GCC 4.7.2 and CUDA SDK 5.5; one nVidia GeForce GTX680 (4G); 4 CPU cores; serial BAM output.
c) BWA-MEM: Version 0.7.5a. Compiled with GCC 4.7.2; 6 CPU cores. SAM output.
d) In general: Besides the acceleration devices (MIC/GPU), all other hardware is the same for all experiments. Specifically, Intel i7-3730k, 6-core @3.2 GHz, 64G memory
B. Real data comparison with other aligners
Performance of MICA, SOAP3-dp and BWA-MEM on experimental data.
# of Read Pairs (M)
3 cards scale-up
We have also benchmarked 100 bp paired-end reads  (PE100, ~49-fold). Interestingly, MICA's acceleration on PE150 is slightly more significant than on PE100. For the latter, MICA is ~3.4 and ~9.9 times faster than BWA-MEM when using one and three MIC cards, respectively. When processing many shorter reads, MICA has a bottleneck in accessing the memory.
A. Simulated data comparison with other aligners
We carried out simulated data test with short read simulator Mason to assess the accuracy and sensitivity of MICA. We tested with 2 sets (100 bp and 150 bp) of 6M Illumina-style paired-end (PE) reads with 500 bp insert size from GRCh37 major build.
Bowtie2, SeqAlto and GEM allow users to sacrifice accuracy and sensitivity for speed with switches. We applied "very-fast", "sensitive", and "very-sensitive" switches to Bowtie2, "fast (-f)" to SeqAlto, and "fast adaptive (--fast-mapping)", "fastest (--fast-mapping = 0)" to GEM. We used indices with full suffix array (SA) for SOAP3-dp. All parameters of MICA, SOAP3-dp and SOAP3 were set to default, and parameters for other aligners including BWAaln and CUSHAW2[13, 14] were set to favor the two read types and length. 13 sets of programs and parameters were compared in total.
Comparison on 13 sets of programs and parameters using 100 bp paired-end simulated reads.
6M 100 bp Paired-end reads, 1.2 Gbp bases. 500 bp insert size, 25 bp standard deviation.
MICA(1 MIC Card, 240 threads)
(Fast Mapping: adaptive)
(Fast Mapping: 0)
CPU (thread: core i7-3930k)
GPU (device: GTX680)
For 100 bp reads, MICA takes 178 seconds to align 6M read pairs, which is 16 seconds slower than SOAP3-dp (due to the increasing time in loading the index), but 1.67 to 11.09 times faster than other tools. MICA's sensitivity is 99.60% which is 0.06% lower than SOAP3-dp but 0.07-1.83% higher than other tools. MICA's FDR is 0.40%, which is 0.07-0.79% lower than other tools except being 0.06% higher than SOAP3-dp.
B. Production test on Tianhe-2 supercomputer
Tianhe-2 is currently the champion of the Top500 list of supercomputers. It has in total 16,000 computing nodes, each equipped with two 6-core Intel E5-2692 v2 @ 2.20 Hz CPU, 64 GB main memory and 3 MIC cards. Each MIC card has 57 cores and 8 GB on-board memory with ECC enabled.
To test Tianhe-2 with the workload of a population scale study, we used whole genome sequencing data of 45 CHB and 45 CHS samples (Supplementary Details) from the 1000 genomes project; a sample had 64.68-fold depth on average. In total, we had 932 lanes of 100 bp paired-end reads with size varying from 1.82- to 12.7-fold per lane.
Results of experiments on Tianhe-2 using 5 different settings.
MIC card used
Using 3 cards and 6-thread BAM output, the mean time consumption is about the same but the maximum time consumption is ~18 minutes longer. Interestingly, alignment using 2 cards is almost twice as fast as using 1 card, but using 3 cards is only slightly faster than using 2 cards, since the computation was throttled by the slow compression algorithm in the BAM output module. Splitting the output into two or more files to enable non-blocking parallel compression will help in exploiting the full power of all three cards.
The publication costs for this article were funded by Hong Kong GRF HKU-713512E and ITF GHP/011/12.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 7, 2015: Selected articles from The 11th Annual Biotechnology and Bioinformatics Symposium (BIOT-2014): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S7.
- Luo R, Wong T, Zhu J, Liu CM, Zhu X, Wu E, Lee LK, Lin H, Zhu W, Cheung DW, et al: SOAP3-dp: fast, accurate and sensitive GPU-based short read aligner. PloS one. 2013, 8 (5): e65632-10.1371/journal.pone.0065632.PubMed CentralView ArticlePubMedGoogle Scholar
- Gotoh O: An improved algorithm for matching biological sequences. Journal of molecular biology. 1982, 162 (3): 705-708. 10.1016/0022-2836(82)90398-9.View ArticlePubMedGoogle Scholar
- Smith TF, Waterman MS: Identification of common molecular subsequences. Journal of molecular biology. 1981, 147 (1): 195-197. 10.1016/0022-2836(81)90087-5.View ArticlePubMedGoogle Scholar
- Klus P, Lam S, Lyberg D, Cheung MS, Pullan G, McFarlane I, Yeo G, Lam BY: BarraCUDA - a fast short read sequence aligner using graphics processing units. BMC research notes. 2012, 5: 27-10.1186/1756-0500-5-27.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Y, Schmidt B, Maskell DL: CUSHAW: a CUDA compatible short read aligner to large genomes based on the Burrows-Wheeler transform. Bioinformatics. 2012, 28 (14): 1830-1837. 10.1093/bioinformatics/bts276.View ArticlePubMedGoogle Scholar
- Lam TW, Li R, Tam A, Wong S, Wu E, Yiu S-M: High throughput short read alignment via bi-directional BWT. Bioinformatics and Biomedicine. 2009, IEEE, 31-36. BIBM'09 IEEE International Conference on: 2009Google Scholar
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, et al: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012, 1 (1): 18-10.1186/2047-217X-1-18.PubMed CentralView ArticlePubMedGoogle Scholar
- Li H: Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. arXiv preprint arXiv:13033997. 2013Google Scholar
- Doring A, Weese D, Rausch T, Reinert K: SeqAn an efficient, generic C++ library for sequence analysis. BMC bioinformatics. 2008, 9: 11-10.1186/1471-2105-9-11.PubMed CentralView ArticlePubMedGoogle Scholar
- Langmead B, Trapnell C, Pop M, Salzberg SL: Ultrafast and memory-efficient alignment of short DNA sequences to the human genome. Genome biology. 2009, 10 (3): R25-10.1186/gb-2009-10-3-r25.PubMed CentralView ArticlePubMedGoogle Scholar
- Mu JC, Jiang H, Kiani A, Mohiyuddin M, Bani Asadi N, Wong WH: Fast and accurate read alignment for resequencing. Bioinformatics. 2012, 28 (18): 2366-2373. 10.1093/bioinformatics/bts450.PubMed CentralView ArticlePubMedGoogle Scholar
- Marco-Sola S, Sammeth M, Guigo R, Ribeca P: The GEM mapper: fast, accurate and versatile alignment by filtration. Nature methods. 2012, 9 (12): 1185-1188. 10.1038/nmeth.2221.View ArticlePubMedGoogle Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome research. 2008, 18 (11): 1851-1858. 10.1101/gr.078212.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Y, Schmidt B: Long read alignment based on maximal exact match seeds. Bioinformatics. 2012, 28 (18): i318-i324. 10.1093/bioinformatics/bts414.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.