MICA: A fast short-read aligner that takes full advantage of Many Integrated Core Architecture (MIC)

Background Short-read aligners have recently gained a lot of speed by exploiting the massive parallelism of GPU. An uprising alterative to GPU is Intel MIC; supercomputers like Tianhe-2, currently top of TOP500, is built with 48,000 MIC boards to offer ~55 PFLOPS. The CPU-like architecture of MIC allows CPU-based software to be parallelized easily; however, the performance is often inferior to GPU counterparts as an MIC card contains only ~60 cores (while a GPU card typically has over a thousand cores). Results To better utilize MIC-enabled computers for NGS data analysis, we developed a new short-read aligner MICA that is optimized in view of MIC's limitation and the extra parallelism inside each MIC core. By utilizing the 512-bit vector units in the MIC and implementing a new seeding strategy, experiments on aligning 150 bp paired-end reads show that MICA using one MIC card is 4.9 times faster than BWA-MEM (using 6 cores of a top-end CPU), and slightly faster than SOAP3-dp (using a GPU). Furthermore, MICA's simplicity allows very efficient scale-up when multiple MIC cards are used in a node (3 cards give a 14.1-fold speedup over BWA-MEM). Summary MICA can be readily used by MIC-enabled supercomputers for production purpose. We have tested MICA on Tianhe-2 with 90 WGS samples (17.47 Tera-bases), which can be aligned in an hour using 400 nodes. MICA has impressive performance even though MIC is only in its initial stage of development. Availability and implementation MICA's source code is freely available at http://sourceforge.net/projects/mica-aligner under GPL v3. Supplementary information Supplementary information is available as "Additional File 1". Datasets are available at www.bio8.cs.hku.hk/dataset/mica.


Introduction
With the rapid advance of sequencing technologies, there is continuously demand for faster and faster analysis. The recently announced Illumina HiSeq × Ten sequencing system promises to sequence 18,000 whole human genomes (30x) in one year (four such systems can sequence more genomes than in all of history), while cutting the cost to $1,000 each. To cater to such capacity, it is important to develop new analysis software to fully utilize available acceleration hardware in addition to the CPU. For example, SOAP3-dp [1] uses a graphics processing unit (GPU) and is a few times faster than mainstream CPU aligners and delivers higher sensitivity. Besides GPU, attention has also fallen on Intel's new product Many Integrated Core (MIC), a.k.a. Xeon Phi Co-processor. MIC was introduced in 2011. It is an acceleration device whose hardware and system-software architecture support its use for general purpose computing. It is well suited to software implementations where computations on many thousands of data items can be carried out independently in parallel. The latest product has 57-61 cores and 8 GB of memory in one board, pro-viding~1 TFlops. Two of the top ten supercomputers in TOP500 (which ranks the world's 500 most powerful supercomputers) are equipped with MIC (Tianhe-2 has 48,000 MIC boards, and Stampede has 6,400 boards).
Experience has shown, however, that it is not easy to build useful read alignment software using massive core architectures such as GPU and MIC. The apparent problem is that the most biologically relevant sequencealignment algorithm [2,3] involves dynamic programming dependencies that are awkward to compute efficiently in parallel. The fastest GPU implementations of the algorithm to date relies on task parallelism [1], where each thread of execution computes an entire alignment independently of all other parallel threads. This, however, requires sequences to be aligned to have similar lengths to ensure balanced tasks distribution among cores. Another typical sequence alignment problem is that a short query sequence (100 to 250 bp) must be aligned with a comparatively long (3 Gbp or longer) reference sequence. Since a brute-force search for all candidate alignments in this setting would be computationally prohibitive, read aligners typically construct a list of candidate reference sequence locations within which potential alignments might be discovered. The size of a list depends on the complexity of the sequence to be aligned and this accounts for a significant proportion of the computational imbalance involved in read alignment using massive core architectures. Several high-throughput read aligners including BarraCUDA [4], CUSHAW [5] and SOAP3-dp [1] that utilize GPU acceleration have been developed in the past few years, but to this date, limited studies have been carried out on short read alignment using MIC. This paper introduces MICA, a new short-read aligner designed to fully utilize the computing power of MIC.

Method
MICA accepts input reads in FASTA/FASTQ format and outputs alignment results in SAM/BAM format. In a typical setting, MICA runs on a server (host) equipped with 1 to 3 MIC cards. MICA runs each MIC using the offload mode (instead of native mode) to exploit the host's memory and provide better I/O performance. The latter is important when dealing with large volume of sequencing data. For each MIC card, MICA maintains a CPU thread in the host called an MIC-controller, which feeds the MIC with around million reads each time and spawns 224 MIC threads (running on 56 cores) to align the reads in parallel. At the end, the alignment results are copied to the MICcontroller for output ( Figure 1). Note that reads with different complexities vary in aligning time. To achieve maximum throughput, MICA dynamically balances the load of the threads to avoid unnecessary idling.
Algorithmically, MICA maintains a BWT index in each MIC, and it adopts SOAP3-dp's approach to align reads shorter than 150 bp, and uses a new approach to handle longer reads. SOAP3-dp was designed for HiSeq 2000 paired-end reads of 100 bp; its efficiency stems from the fact that a GPU, with a 2 way-BWT index [6], can efficiently align at least one end of a paired-end read with 0-2 mismatches; the other end can be aligned using dynamic programming in the GPU. For longer reads (150 bp or more), SOAP3-dp deteriorates in sensitivity and speed because GPU is inefficient to align with more mismatches using the index. Thus, we need a new approach. In the following, we divide the discussion into two parts: the first part describes the techniques on how to utilize the resources of MIC to match the performance of SOAP3-dp on GPU, and the second part is about the new techniques for aligning reads.
Our experiment revealed that an MIC core was 4 to 6 times slower than a CPU core when running programs designed for CPU, and an MIC with 57 cores might be comparable to a 12-core CPU when used for brute-force parallelization of a CPU program. Noteworthy, each MIC core has thirty-two 512-bit registers; each allows sixteen 32-bit data to be operated in parallel. Such extra parallelism, if exploited properly, can boost the efficiency dramatically. This requires a lot of engineering work, though. Our first success is on the BWT index, MICA exploits 512-bit operations to speed up different arithmetic and memory transaction operations when querying the BWT (SOAP3-dp is based on 64-bit operations). Next, we turn to dynamic programming, which is for aligning reads allowing indels and soft clipping. Below we give the details of a new parallel algorithm for dynamic programming that can utilize the 512-bit registers.
A. Highly parallel dynamic programming.
To align a read of length m and a reference region of length n by DP, the traditional approach ( Another concern is that the number of memory transactions in each core is an important factor to the overall efficiency. To minimize the number of memory transactions, for each entry, we pack all the necessary information (including the read and reference nucleotide) required for calculation into a 32-bit entry in the DP table and the memory transactions are reduced to loading only three 512-bit vectors of the DP table; The number "three" comes from the affine gap penalty model, which requires to compute three DP tables M, I, D to store the optimal score when aligning up to each position with the last character being a Match, Insert or Delete, respectively. To reduce memory transactions by increasing locality, the M, I, D values for each i, j are packed into T[i, j].

B. New seeding strategies.
For longer reads, we can only afford to use the BWT to align short fragments of a read (i.e., the seeds) and then count on dynamic programming to verify the candidate positions of the seed. Ideally we want fewer seeds (to save time on seed alignment) and the seeds should return correct candidate positions without too many incorrect ones (to save time on dynamic programming). MICA attempts to improve SOAP3-dp and other aligners by using as few seeds as possible, while maintains a balance between efficiency (fast & do not give too many candidates) and sensitivity (capture correct    candidate positions for more reads). MICA uses a combination of the following strategies.

a) Non-branching mismatches
MICA allows more mismatches during seed alignment but without sacrificing efficiency. SOAP3-dp allows only one mismatch as the time for seed alignment increases exponentially with the number of mismatches (due to the branching of the search tree). We observe that very often a mismatch is unambiguous and there is exactly one way to correct it according to the reference genome. MICA allows 1-2 "non-branching" mismatches in addition to one "branching" mismatch for seed alignment.
In details, when performing alignment, we take a single character in the read and extend it character by character and check if the pattern exists in the reference genome. We call it a mismatch when we try to extend it such that the character in the read does not match the corresponding character in the reference genome. We can classify the mismatches into two types, branching mismatches (BM) and non-branching mismatches (NBM) (Figure 4). (a) When a NBM occurs, the current character in the read does not allow an extension, but there is another unique character that allows an extension. (b) When a BM occurs, there are at least 2 characters allowing an extension, one of them can be the character in the read, and we choose to extend it with a mismatch character.
For computing purposes, a BM increases the number of possible extensions much more than a NBM, and is more time-consuming to handle. Yet, we observed that very often a mismatch is unambiguous and there is exactly one way to correct it according to the reference genome (i.e. it is a NBM). Therefore, we allow MICA to find alignments with up to 2 non-branching mismatches, in addition to the 1 branching mismatch for seed alignment. This approach discovers more seeds than the "1 mismatch only" approach. b) Read-sensitive seed length MICA dynamically adjusts the seed length depending on the content of the read. If a region of the read appears to be non-repetitive, it uses a shorter seed to avoid missing critical candidate positions; for a region that appears to be repetitive, we choose a long seed to reduce the number of candidates.
More specifically, MICA follows SOAP3-dp's design for rapidly aligning those reads with < 3 mismatches and without an Indel, leaving those reads failed to be aligned (with ≥ 3 mismatches or with ≥ 1 Indel) to the Dynamic Programming (DP) module. This rapid mode works efficiently with 100 bp paired-end reads, where the slower DP module handles only about 10% of the reads. However, new sequencers such as Illumina HiSeq × Ten, produces paired-end reads 150 bp in length. The probability of a 150 bp read to span ≥ 3 mismatches or an Indel is higher than that of 100 bp, empirically forcing the DP module to handle 30-40% of the reads, which slows down the performance of MICA significantly. To enable MICA to work with 150 bp or longer reads efficiently, we have redesigned the workflow by disabling the rapid mode and using multiple rounds of seed discovery in the DP module to catch up to the speed. The longer seed lengths required in the former rounds limit the number of candidate alignment regions, where the shorter seed lengths in the latter rounds ensure the sensitivity, but requiring much more computation due to the large number of candidate alignment regions to be verified. Most reads with fewer mismatches and Indels thus could be aligned in the former rounds, leaving complicated reads for the latter rounds, which are more computationally demanding. Table 1 gives the implementation details of each round as in the latest version of MICA. The details of SOAP3-dp are also provided for comparison.

c) Breaking seeds at variant positions
MICA also tries to exclude read positions that are suspicious to be errors or variants (especially Indels) as part of a seed.
The above seeding strategies are used in multiple rounds with different parameters to attain better speed without sacrificing sensitivity. The superior sensitivity of

Results
A. Software settings and machine specifications. a) MICA: Version r178. Compiled with Intel C/C++ Compiler version 13.1; one to three 57-core MIC cards (8G, ECC enabled), each coupled with a CPU core; and one CPU core for output (in SAM format). b) SOAP3-dp: Version r176. Compiled with GCC 4.7.2 and CUDA SDK 5.5; one nVidia GeForce GTX680 (4G); 4 CPU cores; serial BAM output. c) BWA-MEM: Version 0.7.5a. Compiled with GCC 4.7.2; 6 CPU cores. SAM output. d) In general: Besides the acceleration devices (MIC/ GPU), all other hardware is the same for all experiments. Specifically, Intel i7-3730k, 6-core @3.2 GHz, 64G memory B. Real data comparison with other aligners.
We used 77-fold depth of 150 bp Illumina paired-end reads of the YH samples [7] (PE150) to benchmark MICA and other state-of-the-art aligners including GPU-based SOAP3-dp [1] and CPU-based BWA-MEM [8]. The results are shown in Table 2. The figures were calculated as the average of three repeated runs, and a tailor-made setting was used for each aligner to ensure the best performance. For the PE150 dataset, MICA using one card is~4.9 times faster than BWA-MEM and slightly faster than SOAP3-dp. The speed of MICA can be scaled up almost linearly with additional cards. When using three cards, MICA is~14.1 times faster than BWA-MEM. MICA's sensitivity is 3.1% and 0.47% higher than BWA-MEM and SOAP3-dp, respectively.
We have also benchmarked 100 bp paired-end reads [7] (PE100,~49-fold). Interestingly, MICA's acceleration on PE150 is slightly more significant than on PE100. For the latter, MICA is~3.4 and~9.9 times faster than BWA-MEM when using one and three MIC cards, respectively. When processing many shorter reads, MICA has a bottleneck in accessing the memory.

A. Simulated data comparison with other aligners.
We carried out simulated data test with short read simulator Mason [9] to assess the accuracy and sensitivity of MICA. We tested with 2 sets (100 bp and 150 bp) of 6M Illumina-style paired-end (PE) reads with 500 bp insert size from GRCh37 major build.
Bowtie2 [10], SeqAlto [11] and GEM [12] allow users to sacrifice accuracy and sensitivity for speed with switches. We applied "very-fast", "sensitive", and "verysensitive" switches to Bowtie2, "fast (-f)" to SeqAlto, and "fast adaptive (-fast-mapping)", "fastest (-fast-mapping = 0)" to GEM. We used indices with full suffix array (SA) for SOAP3-dp. All parameters of MICA, SOAP3-dp and SOAP3 were set to default, and parameters for other aligners including BWAaln and CUSHAW2 [13,14] were set to favor the two read types and length. 13 sets of programs and parameters were compared in total.
In both 100 bp and 150 bp datasets, MICA is comparable to SOAP3-dp. MICA has advantage over other tools with higher speed, higher sensitivity and lower FDR results (Table 3 for 100 bp PE, and 150 bp table in  supplementary).
For 100 bp reads, MICA takes 178 seconds to align 6M read pairs, which is 16 seconds slower than SOAP3dp (due to the increasing time in loading the index), but 1.67 to 11.09 times faster than other tools. MICA's sensitivity is 99.60% which is 0.06% lower than SOAP3-dp but 0.07-1.83% higher than other tools. MICA's FDR is 0.40%, which is 0.07-0.79% lower than other tools except being 0.06% higher than SOAP3-dp.

B. Production test on Tianhe-2 supercomputer.
Tianhe-2 is currently the champion of the Top500 list of supercomputers. It has in total 16,000 computing nodes, each equipped with two 6-core Intel E5-2692 v2 @ 2.20 Hz CPU, 64 GB main memory and 3 MIC cards. Each MIC card has 57 cores and 8 GB on-board memory with ECC enabled.
To test Tianhe-2 with the workload of a population scale study, we used whole genome sequencing data of 45 CHB and 45 CHS samples (Supplementary Details) from the 1000 genomes project; a sample had 64.68-fold depth on average. In total, we had 932 lanes of 100 bp paired-end reads with size varying from 1.82-to 12.7fold per lane.
The alignments of 932 lanes were carried out with 5 different settings. All alignment jobs were submitted simultaneously using one node per lane, occupying 4660 nodes in total at the very beginning. The results are shown in Table 4. 19 jobs failed to complete due to node or card failure, thus excluded from calculation.     The time consumption of BWA is calculated as "align left reads"+"align right reads"+"sampe". The index loading times of "align right reads" and "sampe" modules are 1 second due to the reason that, index files were cached during "align left reads". However, datasets larger than the host memory will flush the cache during alignment. 2 The alignment time consumption of GEM is calculated as "alignment"+"convert to SAM format". The conversion module was run with 4 threads in consistent with the alignment module. 3 SOAP3-dp, SOAP3, SeqAlto and GEM aligners explicitly provide index loading time consumption. The index loading time for Bowtie2, CUSHAW2 and BWA are calculated by the total size of index, divided by 100 MB/ s, which is the average network file system speed of the testing environment. The index loading time maybe underestimated while the time processing the index was not calculated. 4 The alignment times were explicitly provided by the aligners (include results processing and input/output time) or calculated by total elapsed time minus estimated index loading time. 5 Sensitivity is calculated as "Correctly aligned reads"/"All simulated reads". The higher the better. 6 FDR is calculated as "Incorrectly aligned reads"/"All aligned reads". The lower the better Using 3 cards per node and SAM output, the mean time consumption for 923 completed lanes is 1,303 seconds, with the largest lane finished within an hour. By stacking up the alignment jobs to minimize the idle time of computing nodes involved, we anticipate that all 923 jobs could be finished in an hour using less than 400 nodes.
Using 3 cards and 6-thread BAM output, the mean time consumption is about the same but the maximum time consumption is~18 minutes longer. Interestingly, alignment using 2 cards is almost twice as fast as using 1 card, but using 3 cards is only slightly faster than using 2 cards, since the computation was throttled by the slow compression algorithm in the BAM output module. Splitting the output into two or more files to enable non-blocking parallel compression will help in exploiting the full power of all three cards.