 Methodology Article
 Open access
 Published:
BitMapper: an efficient allmapper based on bitvector computing
BMC Bioinformatics volume 16, Article number: 192 (2015)
Abstract
Background
As the nextgeneration sequencing (NGS) technologies producing hundreds of millions of reads every day, a tremendous computational challenge is to map NGS reads to a given reference genome efficiently. However, existing methods of allmappers, which aim at finding all mapping locations of each read, are very time consuming. The majority of existing allmappers consist of 2 main parts, filtration and verification. This work significantly reduces verification time, which is the dominant part of the running time.
Results
An efficient allmapper, BitMapper, is developed based on a new vectorized bitvector algorithm, which simultaneously calculates the edit distance of one read to multiple locations in a given reference genome. Experimental results on both simulated and real data sets show that BitMapper is from several times to an order of magnitude faster than the current stateoftheart allmappers, while achieving higher sensitivity, i.e., better quality solutions.
Conclusions
We present BitMapper, which is designed to return all mapping locations of raw reads containing indels as well as mismatches. BitMapper is implemented in C under a GPL license. Binaries are freely available at http://home.ustc.edu.cn/%7Echhy.
Background
Recently, DNA sequencing has become a powerful tool for researches in biology and medicine. The decreasing cost and improving speed of the nextgeneration sequencing (NGS) technologies generate massive reads every day. However, a disadvantage of NGS technologies is that they produce sequenced reads of relatively short length. For instance, the HiSeq2500 platform of Illumina usually produces 150 bp reads. The first step of many genomic researches is finding the mapping locations of these short NGS reads in a given large reference genome.
For this mapping issue, two classes of methods have been developed. One class, including Bowtie [1], Bowtie2 [2], BWA [3], GEM [4], etc., is referred to as bestmappers for trying to identify one or a few best mapping locations for each read. The other class, including RazerS 3 [5], Hobbes2 [6], and mrFAST [7, 8], is referred to as allmappers for finding all mapping locations. Generally, the selection of different mappers depends on the needs of downstream applications. Finding one or a few best mapping locations for each read using bestmappers is enough in most cases (e.g., mapping DNAprotein interactions, wholetranscriptome sequencing and whole genome expression profiling). However, for some specific applications, such as ChIPseq experiments, CNVs (copy number variation) calling and detecting structural variants, it is necessary to identify all mapping locations using allmappers.
Due to the different purposes, identifying all mapping locations using allmappers is usually much slower than finding one or a few best locations using bestmappers. An important reason is that allmappers have to enumerate all possible locations, while bestmappers can use some heuristic methods to select the most likely one. There are a lot of matches for some reads due to huge numbers of segmental duplications and common repeats in reference genomes. Thus, finding all mapping locations is still a computationally very expensive problem.
To solve this problem, many allmappers have been developed. Most of them consist of two parts, filtration and verification. Filtration reduces the number of the locations that need to be verified (called candidates), especially when a reference genome is extremely large. For example, Hobbes [9] uses a dynamic programming algorithm to select several qgrams with the lowest frequency, where qgrams are the subsequences with length of q. Therefore, the number of candidates is minimal. Another filter proposed by Hobbes 2 chooses k+2 q−grams instead of k+1 and only verifies the locations that appear at least two times. Recently, Masai [10] improved the performance of filtration by generating candidate locations of multiple reads simultaneously and using approximate seeds. Compared with filtration, verification used for edit distance is the dominant part of the whole running time in current mappers [8]. Several algorithms have been proposed to speedup verification. A bitvector algorithm proposed by Myers [11] uses bit representation contained in a machine word to calculte edit distance. RazerS 3 [5] implements a banded version of Myers’ algorithm [12], which only calculates several consecutive diagonals rather than the whole dynamic programming matrix. Although the current banded method of verification is quite quick, it only calculates the edit distance of a read to one location rather than multiple locations.
In this paper, we present BitMapper, an efficient read mapper which is designed to return all mapping locations of raw reads containing indels as well as mismatches. It includes a new vectorized bitvector algorithm using a single machine word to represent several bit vectors and simultaneously calculates the edit distance of a read to multiple locations in a given reference genome. A vectorized verification scheme is also proposed to work with the new bitvector algorithm. Experimental results show that the running time of BitMapper is from several times to an order of magnitude faster than the best existing allmappers, including Hobbes 2, RazerS 3, mrFAST (with FastHASH) [8], Masai and Yara [13].
Methods
First, we define the read mapping problem and related concepts.
Definition 1.
Given a set of reads R and a reference genome S, find all locations in S where the hamming or edit distance of each read in R is at most k.
Hamming and edit distance are two common distance metrics for sequence alignment. Hamming distance only includes the substitutions of the corresponding symbols between two strings of equal length, while edit distance consists of substitutions, insertions and deletions. Calculating hamming distance is relatively easy and has been well solved. On the other hand, calculating edit distance efficiently is still difficult, which is the focus on this article.
Similar to existing short reads mappers, BitMapper mainly consists of two parts: filtration and verification. In the following, we first briefly describe the procedure of existing approaches and then present and analyze BitMapper in detail.
Filtration
Filtration is an important phase for sequence alignment, especially if a reference genome is extremely large. Only the regions consisting of potential mapping locations can be reserved after filtration. Currently, the basic principle of nearly all qgrambased filtration strategies is that the number of qgrams shared between two sequences should exceed a certain threshold if they are potentially similar. Next, we briefly summarize commonly usedqgrambased approaches.
Pigeonhole principle
A simple and efficient filtration strategy is pigeonhole principle: if l items are put into l+1 boxes, then one or more boxes would be empty. In its application on sequence alignment, first each read is divided into k+1 nonoverlapping qgrams, where k is the threshold of edit distance or hamming distance. If the distance between a read and a candidate region is less than k, at least one in k+1 nonoverlapping qgrams of the read can be mapped to the reference exactly, since a substitution, insertion or deletion only affects a qgram. A more general version of pigeonhole principle is that if a read is able to be cut into k+m nonoverlapping qgrams, sharing at least m of them with a read is necessary for each mappinglocation.
Count filtering
Compared with pigeonhole principle, a more involved filtration strategy is count filtering. Given a sequence s, there are s−q+1 overlapping qgrams that are obtained by sliding a window of length q over s, where s is the length of s. As in the explanation of pigeonhole principle, a substitution only affects at most q overlapping qgrams. Thus, no more than k×q qgrams could be affected with hamming distance k. If the hamming distance between s and another sequence r is less than k, then the number T of shared qgrams is at least.
The lower bound T of edit distance is similar to that of hamming distance. The first method of count filtering on sequence alignment is a modified SWIFT algorithm [14] used in RazerS [15].
Our implementation
Pigeonhole principle is faster than count filtering on filtration phase, while the verification time of the pigeonholeprinciplebased mappers is more than that of the countfilteringbased mappers. In fact, there is a tradeoff between filtration and verification. Because the proportion of verification time for the pigeonholeprinciplebased mappers is larger than that for the countfilteringbased mappers, the former benefit more from the improvement of verification than the latter. As our verification method is efficient, Bitmapper used pigeonhole principle instead of count filtering.
Verification
The locations reserved after filtration are the candidates for matches. During the verification phase, these candidates should be verified by calculating their edit distance or hamming distance to each read. Compared with computing hamming distance, computing edit distance is extremely timeconsuming. In the following, we first describe the theoretical basis for our vectorized Gene Myers’ bitvector algorithm, and illustrate the algorithm in detail. Then, we present a vectorized verification scheme, which is designed to work with the vectorized Gene Myers’ bitvector algorithm.
Theoretical basis
For sequence alignment, the reads and the reference genomes can be viewed as the strings including letters A, C, G, T and N. Assume the length of read r is m, the length of genome s is n, and the threshold of edit distance is k. The dynamic programming algorithm proposed in [16] is a classic method for this problem, which computes a dynamic programming matrix C[0…m,0…n] of size (m+1)×(n+1). The wellknown recurrence formula is as follows.
Its time complexity is O(m×n) and it is very slow when a reference genome is large. Actually, calculating the whole dynamic programming matrix is unnecessary when the edit distance threshold k has been set in advance. As stated in the following Lemma 1, the size of computing area in dynamic programming matrix is related to k, which has been proven in [17].
Lemma 1.
Given a read of length m, a candidate location d in a reference genome and an edit distance threshold k, the start and end positions of potential matches may be from d−k to d+k and from d+m−k−1 to d+m+k−1, respectively. In other words, the length of the verification window, which would be calculated with the read, is m+2k.
Figure 1 shows an example for Lemma 1. Note that the candidate location d is obtained by subtracting the offset c from dq, where dq is an exactly matched location of a qgram and c is the offset of this qgram in the read. If there are only at most k deletions, the segment starting at d−k and ending at d+m+k−1 needs to be computed. If there are only at most k insertions, the segment range needing to be considered is from d+k to d+m−k−1. Combining with these two intervals, the range of maximal verification window is from dk to d+m+k−1.
According to Lemma 1, the length of verification window is m+2k. Thus, only (m+2k+1)×(m+1) cells in dynamic programming matrix need to be calculated. We define the diagonal which is shifted from the main diagonal by k diagonals to the right as “base diagonal”. It corresponds to the situation that only substitutions are considered, since the computing path moves right down from the current cell in dynamic programming matrix to the adjacent cell when a substitution occurs. For a deletion in the reference genome, the computing path goes right to the adjacent cell. For an insertion, this path goes down to the adjacent cell. Thus, the rightmost and the leftmost diagonals are obtained by sliding k diagonals from the “base diagonal” to its right and left, respectively. In fact, the computing area in dynamic programming matrix is a banded parallelogram, as shownin Fig. 2.
An efficient solution for this problem is the bitvector algorithm proposed in [11], which is based on the observation that the difference of the values between adjacent cells in dynamic programming matrix is at most 1. It is able to encode a whole column in dynamic programming matrix using bit vectors and compute a column by bit operations rather than cellbycell. Banded versions of Myers’ bitvector algorithm have been implemented in [5] and [12]. They encode a banded parallelogram in dynamic programming matrix into columns for columnwise computation, since only limited consecutive diagonals need to be calculated rather than the whole dynamic programming matrix, according to the analysis above. Figure 2 shows that at most 2k+1 cells in each column need to be calculated, so that the length of the bit vectors is also 2k+1. If 2k+1 is less than the word size of computer, a column could be processed in one step.
Vectorized Gene Myers’ bitvector algorithm
A significant characteristic of the NGS reads is that the length of them is relatively short. For the downstream applications using allmappers, the edit distance threshold is usually set to 4 % or 5 % of the read length. Thus, the edit distance threshold k is usually low. It means that a few bits are enough for banded bitvector algorithms to calculate edit distance. For example, the length of the reads sequenced by Illumina platform is always under 150 so that the threshold k is set to 7. If k=7, the length of bit vectors is 15, while the word size of modern computers is typically 64 and the Streaming SIMD Extensions (SSE) instruction set has several 128bit registers. Therefore, it is possible to load multiple bit vectors into a machine word or a 128bit SSE register. Furthermore, the problem can be converted to how to compute the edit distance between several patterns and a text. Based on these observations, we propose a new vectorized Gene Myers’ bitvector algorithm to simultaneously process a text with multiple patterns.
First we briefly introduce the current bitvector algorithm proposed in [12], which is the basis of our vectorized algorithm. It uses delta encoding in dynamic programming matrix C[0…m,0…n]. Specifically, for column j, the bit delta vectors are.
where H P _{ j },H N _{ j },V P _{ j },V N _{ j },D0_{ j } and P e q _{ j } are the jth element of H P,H N,V P,V N,D and P e q, respectively. And the notation H P _{ j }[i],H N _{ j }[i],V P _{ j }[i],V N _{ j }[i],D0_{ j }[i] and P e q _{ j }[s][i] denote the ith bit of H P _{ j },H N _{ j },V P _{ j },V N _{ j },D0_{ j } and P e q _{ j }[s], respectively.
If the values of these bit vectors in column j−1 have already been known, then the bit vectors in column j can be computed as following.
where t[p] is the pth element of text. Because the length of each column in computing area is 2k+1, so the length of these bit vectors is also 2k+1.
Based on the Algorithm 1, we developed the vectorized Gene Myers’ bitvector algorithm. Briefly, it packs multiple bitvectors of different patterns into a machine word so that these patterns are able to be processed with one text simultaneously. The calculating of each bitvector is similar to the previous banded bitvector algorithm [12]. However, some problems would occur when multiple bitvectors are processed as a whole. From Algorithm 1, only six operations have been used: ‘ ⊕’, ‘ ’, ‘ &’, ‘ ≫’, ‘ ∼’ and ‘+’. They can be divided into two groups. The first group consists of ‘ ⊕’, ‘ ’, ‘ &’ and ‘ ∼’, while ‘ ≫’ and ‘+’ belong to the second group. Using the operations of the first group in a machine word including multiple bit vectors does not have any difficulty, because the bit vectors in it cannot be affected with each other. However, implementing the operations of the second group as a whole would influence each other. For operation ‘+’, the carries resulted from addition of lower bit vectors would affect the nearly upper bit vectors. For operation ‘ ≫’, the lowest bit in upper bit vector would move right to influence the nearly lower bit vector.
To solve the problem about operation ‘+’, the data structures of the variables used in our vectorized Gene Myers’ bitvector algorithm have been redesigned. For the previous banded bitvector algorithm [12], 2k+1 bits are enough for each variable. Intuitively, for the vectorized algorithm, the length of each variable, which represents n bit vectors, is (2k+1)×n. However, if multiple bit vectors are loaded into a machine word as this, the problem above could not be solved. Our solution is to use one more bit between two bit vectors as a buffer, so that the carries resulted from the operation ‘+’ among the lower bit vectors would not affect the upper bit vectors. For example, V P _{ l }, which represents the difference of values between the vertical cells for pattern l, is a part of V P, starting from (2k+2)×lth bit and ending at (2k+2)×(l+1)−1th bit. For h patterns, the bit vectors of them are assembled together so that the length of variables including VP, V N,H P,H N,D0 and Peq is not less than (2k+2)×h, as shown in Fig. 3. And the problem about operation ‘ ≫’ has been solved in our vectorized Gene Myers’ bitvector algorithm by using an extra ‘ &’ operation with a predefined bitmask.
Our vectorized Gene Myers’ bitvector algorithm proceeds columnbycolumn through the dynamic programming matrix. If the length of patterns is less than that of the text, it returns the optimal end location for each pattern on text. Otherwise, it returns the optimal end locations for the text on each pattern. As an example, we present the outline of the algorithm for the second situation as follows.

1.
Preprocess the variables for column 0.

Set the Peq array for the first 2k+1 symbols in each pattern.

Set VP, VN and E to 0, where E contains the edit distances for h patterns.


2.
Scan and compute the banded parallelogram in the dynamic programming matrix from left to right by column.

Compute HP, HN and D0 of column j from VP and VN of column j1.

Compute VP and VN of next column using HP, HN and D0.

Set Peq for next column by shifting to the right of the current Peq.

Update E for this column using D0.


3.
Output the locations with the lowest edit distance as the optimal end location on each pattern, separately. The range in E from (2k+2)×jth bit to (2k+2)×(j+1)−1th bit denotes the edit distance of pattern j.
All of the patterns have to be processed one by one in step 3, while step 1 and step 2 can process multiple patterns simultaneously. Fortunately, unlike step 1 and step 2, step 3 is not always necessary due to two reasons: a) the number of matched locations is much less than that of the candidate locations, and b) a simple branchcut strategy is used in step 2 to stop algorithm earlier, as described in [5]. More details of the vectorized Gene Myers’ bitvector algorithm can be found in the Additional file 1: Section S5.
Influence of the number of patterns
We have already implemented the vectorized Gene Myers’ bitvector algorithm using 64bit machine word and 128bit SSE2 register. It can calculate the edit distance of a text with n patterns. In order to figure out the influence of different n, we selected a 100 bp read from specimen HG00096 as a text and regarded 1 thousand subsequences of human genome starting at the candidate locations of this read as patterns. The threshold k of edit distance was set to 4. Because the length of each bit vector is 2k + 2=10 and the length of a SSE register is 128, the vectorized Gene Myers’ bitvector algorithm can process at most 12 patterns with a text simultaneously. Figure 4 shows the running time of the algorithm with different n. Although the performance was improved until n=12, we found that the running time decreased rapidly from n=1 to n=8, while it only decreased a little from n=9 to n=12.
The reason is that for the original banded bitvector algorithm which calculates the edit distance between a text and a pattern, algorithm stops once it meets the requirement of the branchcut strategy. For our vectorized Gene Myers’ bitvector algorithm, calculating stops until all of the patterns meet the requirement. It is difficult when n is large and would result in extra cost. Therefore, we set n to 8 in most cases. For higher edit distance threshold, n is set to 4 since a 128bit register cannot load 8 bit vectors.
Vectorized verification scheme
In order to make full use of the vectorized algorithm, the patterns used to compare with a same text should be collected. The traditional verification scheme, which only selects a read and a subsequence in given reference genome as input every time, is not suitable for our vectorized Gene Myers’ bitvector algorithm. It is necessary to propose a vectorized verification scheme that considers multiple reads as patterns and a subsequences in given reference genome as text, or vice versa. In other words, multiple reads may correspond to one location in given reference genome, or multiple locations correspond to one read.
Figure 5a shows the vectorized verification scheme A, which considers multiple reads as patterns and a subsequence in given reference genome as a text. All of the four reads have a matched 3gram ATG in the reference genome and share the same candidate location d. Generally, this scheme needs to build a reads index in order to collect the reads sharing the same locations efficiently. Figure 5b shows vectorized verification scheme B that considers a read as a text and multiple subsequences starting at the candidate locations of the read as patterns. The read here corresponds to two subsequences sharing a 2gram AT. These two subsequences are obtained by looking up the index of the reference genome using the nonoverlapping qgrams of the read.
According to the analysis above, we found that scheme A takes advantage of the repeatability of the reads, while B takes advantage of the repeatability of the reference genomes. In the experimental results presented in Additional file 1: Section S2, the repeatability of genomes is much more than that of reads. Therefore, scheme B suits the vectorized Gene Myers’ bitvector algorithm better than A. Another advantage of scheme B is that it does not need an extra index of reads. For the reasons outlined above, BitMapper is implemented as scheme B.
Results and discussion
BitMapper was compared with five stateoftheart allmappers, including mrFAST (with FastHASH), Hobbes 2, RazerS 3, Masai and Yara, and three popular bestmappers, Bowtie 2, GEM and BWA in our experiments. The default configurations of these mappers were used except stated otherwise, and the results were output in the SAM format. For a fair comparison, all mappers ran on the same computer with an Intel(R) Core(TM) i74770 processor and 24GB of RAM running 64bit Ubuntu 14.04.
The distance metric used in our experiments was edit distance with threshold 5 %. The reference genomes were the whole genome of human (NCBI HG19), caenorhabditis elegans (WormBase WS201) and arabidopsis thaliana (assembly TAIR10). In the following, mapping time and sensitivity on both real and simulated data sets were presented.
Sensitivity comparison using Rabema results
In this experiment, 100 k simulated 100 bp reads of human were generated by a simulator tool Mason [18] using default profile setting. And we also selected a real data set consisting of 1 million 100 bp reads from specimen HG00096 of the 1000 Genome project [19]. To compare the sensitivity of singleend alignment in different genomes, we used the first 1 million 100 bp reads of the data sets SRX026594 and the first 1 million 101 bp reads of SRR1604937, which were obtained from the DNA Data Bank of Japan (DDBJ) repository [20] and National Center for Biotechnology Information (NCBI) repository [21], respectively.
To compare the sensitivity of different mappers fairly, Rabema benchmark [22] was used to evaluate them. It has been widely used in recent articles, such as [5, 6] and [10]. The categories of sensitivity scores provided by Rabema benchmark include all, allbest, and anybest, which are designed to denote the mapped fraction of all, all of the best, and any of the best matches. And to measure these scores, Rabema benchmark defines two metrics: normalized found interval and found interval. For normalized found interval, each read is given at most one point no matter how many mapping locations it has. For found interval, each mapping location is given one point [see Additional file 1: Section S4 for more detailed illustration]. Note that we only presented the Rabema scores (normalized found interval) in the following, and presented the Rabema scores (found interval) in Additional file 1: Section S4 due to the limited space. Because Rabema benchmark needs a baseline of mapping locations to build a gold standard, we implemented RazerS 3 in full sensitivity mode, which can report 100 % of mapping locations for each read.
Rabema benchmark results on simulated data
Table 1 shows the results of mapping 100k simulated reads to the reference genome of human. The Rabema all, allbest and anybest scores were presented here. Each Rabema category has a large number and 6 small numbers representing the total score and the scores for mapping locations with \(\binom {0\ 1\ 2}{3\ 4\ 5}\) errors, respectively. Bestmappers including Bowtie 2, GEM and BWA were implemented in both high and default sensitivity mode. In high sensitivity mode, the Rabema allscores (normalized found interval) for Bowtie 2, BWA and GEM were 99.73 %, 97.80 % and 96.02 %, respectively. It seems that these bestmappers can achieve nearly full sensitivity. However, the Rabema allscores (found interval) of BWA and GEM, which can be found in in Additional file 1: Table S3, were 86.67 % and 61.46 %, respectively. For Bowtie 2, although the Rabema allscores (found interval) was still more than 98 %, it was extremely slow using one thread. Thus, we implemented Bowtie 2 with 16 threads and did not present its running time here. This means that the bestmappers are not suitable for applications requiring full or nearly full sensitivity. It is mainly because the bestmappers are designed specifically for identifying the best mapping locations of each read.
Compared to the bestmappers, allmappers usually achieve higher sensitivity. For mrFAST, it is interesting that its Rabema anybest and allbest scores were 53.69 % and 54.09 % at edit distance 5, which were much lower than other allmappers. Masai and Hobbes 2 lost a few mapping locations due to their heuristic methods. BitMapper and RazerS 3 were the only two mappers identifying 100 % all of the best and any of the best mapping locations. Note that the all, allbest and anybest scores of RazerS 3 in full sensitivity mode were 100 %, since we used the output of RazerS 3 in full sensitivity mode as the baseline for Rabema benchmark. However, it was extremely slow. The Rabema allscore for BitMapper was nearly 100 %, which was the best except RazerS 3 in full sensitivity. We did not present the sensitivity of Yara in Table 1, since it could not generate CIGAR strings for suboptimal alignments, which led to incorrect output of Rabema benchmark.
Rabema benchmark results on real data
According to the results above, we found that the sensitivities of GEM and BWA on both high and default sensitivity modes were not high enough for the applications needing all or nearly all mapping locations. For Bowtie 2, although the sensitivity on high sensitivity mode has been improved, it spent much more time and memory than allmappers. Thus, we would not present the results of them in the following.
To compare the sensitivity of allmappers on real data sets, we also measured the Rabema scores using 1 million 100 bp reads of human, as shown in Table 2. And to evaluate the sensitivity for different genomes, the Rabema scores for caenorhabditis elegans genome and arabidopsis thaliana genome were presented in Tables 3 and 4, respectively. According to these results, we fonud that the sensitivity of Bitmapper was also best among all of the allmappers except RazerS 3 in full sensitivity, which generated the baseline of Rabema benchmark. As the results in Table 1, the Rabema scores of Yara were not included in Tables 2, 3 and 4 due to the absence of CIGAR strings.
Performance comparison on large data sets
In order to compare the performance of BitMapper with other mappers on large data sets, we selected a singleend data set consisting of 10 million 100 bp reads from specimen HG00096 of the 1000 Genome project. And to compare the performance in different genomes, we used the first 10 million 100 bp reads of the data sets SRX026594 and the first 10 million 101 bp reads of SRR1604937, which were obtained from the DNA Data Bank of Japan (DDBJ) repository and National Center for Biotechnology Information (NCBI) repository, respectively. The first 10 million read pairs of these data sets were also used to measure the performance of pairedend alignment. Moreover, to demonstrate that Bitmapper also works well for longer reads, a real data set and two simulated data sets were used. The real data set consists of the first 10 million 151 bp reads of human in the HiSeq 2500 NA12878 demo data set in [23], while the two simulated data sets include 10 million 300 bp reads of caenorhabditis elegans and arabidopsis thaliana, respectively. All of these data sets with 100 bp, 151 bp and 300 bp reads were mapped to their reference genomes using edit distance threshold 5, 7 and 15, respectively.
Because the Rabema benchmark cannot be implemented in such large data sets, we used the percentage of mapped reads and the number of mapping sites to measure the sensitivity, instead of Rabema scores. For running time comparison, the results of different mappers with single and eight threads were presented in Tables 5, 6, 7 and 8. Note that since Masai and mrFAST do not support multithreading, the results of them with eight threads were omitted. In addition, peak memory consumption was also compared.
Sensitivity and running time comparison
Table 5 shows the results of mapping 10 million 100 bp and 151 bp singleend reads to the whole human genome. Results of the bestmappers including GEM, Bowtie 2 and BWA were left out, because the sensitivity of them is usually substantially less than that of allmappers and the running time is usually longer, as shown in Table 1 and Additional file 1: Table S3. The results in Table 5 show that BitMapper was the best in terms of sensitivity and running time on the human genome data sets. For 10 million 151 bp reads, it was nearly 3 times faster than the second fastest read mapper Yara, and achieved highest sensitivity with 940.16 million mapping locations identified and 93.8487 % reads mapped in the human genome. Compared to Masai, BitMapper was more than 4 times faster and found more mapping locations. For 10 million 100 bp reads, Bitmapper also presented the best performance among all read mappers. The results of RazerS 3 were not shown, since the memory requirement of RazerS 3 was larger than the memory capacity of our computer. Similarly, BitMapper was superior in mapping 10 million singleend reads against the genomes of caenorhabditis elegans and arabidopsis thaliana, as shown in Table 6. And for longer 300 bp reads, Bitmapper was still more efficient than others, as shown in Table 7.
Finally, Table 8 shows the experimental results for pairedend alignment, where three data sets consisting of 10 million read pairs from different genomes were used to evaluate the performance. Again, BitMapper was the best, 2.5 times faster than Hobbes 2, nearly 3 times faster than Masai and Yara in human genome. Note that the results of the human genome using RazerS 3 are not shown here, because the memory requirement of RazerS 3 was larger than the memory capacity of our computer. For caenorhabditis elegans and arabidopsis thaliana, BitMapper was also several times faster than the existing allmappers. In addition, BitMapper still showed great performance in sensitivity comparison. We did not present the results of mrFAST, since it reported many extra locations. Thus, the running time was extremely long.
Memory usage comparison
If a reference genome is large, the memory usage of most mappers mainly depends on the size of the genome and the index for it. For instance, the human genome could be regarded as a long string including 3.15 billion symbols so that 3GB is required to store them. For hash table index, the locations for each qgram should be saved and a location is represented by a 32bit integer. Thus, BitMapper and Hobbes 2, which both index the reference genome using hash tables, require more than 14GB to load the index and genome. Similarly, Masai requires large memory space and uses about 20GB to map 10 million reads to human genome. Although the hash table index is also used in mrFAST, only 7GB is used since it splits the whole human genome and index into several segments and loads one of them at a time. Yara is another read mapper which requires small memory space, since it uses the BWT and FMindex. The memory usage of RazerS 3 mainly depends on the number of mapping locations. It needs more than 24GB to map 10 million 151 bp reads of human.
Conclusion
BitMapper is designed to find all mapping locations for each read based on bitvector computing. In experiments on both simulated and real data sets, it achieved nearly full sensitivity and superior speed, outperforming existing stateoftheart allmappers.
The verification of edit distance constitutes a significant portion of the whole running time. We propose a new vectorized Gene Myers’ bitvector algorithm, which calculates the edit distance of a read to multiple locations in a given genome. To make full use of the algorithm, the traditional verification scheme is redesigned in BitMapper.
Recently, a new SIMD instruction set AVX2 has been applied to many CPUs. Thus, the performance of our vectorized Gene Myers’ bitvector algorithm will be improved further by using AVX2 in the future. The vectorized bitvector computing approach can also be used to accelerate filtration, which is a future research direction in BitMapper.
BitMapper is implemented in C under a GPL license and is able to download at http://home.ustc.edu.cn/%7Echhy.
References
Langmead B, Trapnell C, Pop M, Salzberg SL. Ultrafast and memoryefficient alignment of short dna sequences to the human genome. Genome Biol. 2009; 10(3):25.
Langmead B, Salzberg SL. Fast gappedread alignment with bowtie 2. Nat Methods. 2012; 9(4):357–9.
Li H, Durbin R. Fast and accurate short read alignment with burrows–wheeler transform. Bioinformatics. 2009; 25(14):1754–60.
MarcoSola S, Sammeth M, Guigó R, Ribeca P. The gem mapper: fast, accurate and versatile alignment by filtration. Nat Methods. 2012; 9(12):1185–8.
Weese D, Holtgrewe M, Reinert K. Razers 3: faster, fully sensitive read mapping. Bioinformatics. 2012; 28(20):2592–599.
Kim J, Li C, Xie X. Improving read mapping using additional prefix grams. BMC Bioinformatics. 2014; 15(1):42.
Hach F, Hormozdiari F, Alkan C, Hormozdiari F, Birol I, Eichler EE, et al. mrsfast: a cacheoblivious algorithm for shortread mapping. Nat Methods. 2010; 7(8):576–7.
Xin H, Lee D, Hormozdiari F, Yedkar S, Mutlu O, Alkan C. Accelerating read mapping with fasthash. BMC Genomics. 2013; 14(Suppl 1):13.
Ahmadi A, Behm A, Honnalli N, Li C, Weng L, Xie X. Hobbes: optimized grambased methods for efficient read alignment. Nucleic Acids Res. 2012; 40:41–1.
Siragusa E, Weese D, Reinert K. Fast and accurate read mapping with approximate seeds and multiple backtracking. Nucleic Acids Res. 2013; 41(7):78–8.
Myers G. A fast bitvector algorithm for approximate string matching based on dynamic programming. J ACM (JACM). 1999; 46(3):395–415.
Hyyrö H. A bitvector algorithm for computing levenshtein and damerau edit distances. Nord J Comput. 2003; 10(1):29–39.
Siragusa WD E, Reinert K. Yara: welldefined alignment of highthroughput sequencing reads. http://www.seqan.de/projects/yara/.
Rasmussen KR, Stoye J, Myers EW. Efficient qgram filters for finding all εmatches over a given length. J Comput Biol. 2006; 13(2):296–308.
Weese D, Emde AK, Rausch T, Döring A, Reinert K. Razersfast read mapping with sensitivity control. Genome Res. 2009; 19(9):1646–54.
Sellers PH. The theory and computation of evolutionary distances: pattern recognition. J Algorithms. 1980; 1(4):359–73.
Ukkonen E. Finding approximate patterns in strings. J Algorithms. 1985; 6(1):132–7.
Holtgrewe M. Mason–a read simulator for second generation sequencing data. Technical Report FU Berlin. 2010.
1000 Genomes: a Deep Catalog of Human Genetic Variation. http://www.1000genomes.org/data.
DNA Data Bank of Japan. ftp://ftp.ddbj.nig.ac.jp.
National Center for Biotechnology Information. http://www.ncbi.nlm.nih.gov/.
Holtgrewe M, Emde AK, Weese D, Reinert K. A novel and welldefined benchmarking method for second generation read mapping. BMC Bioinformatics. 2011; 12(1):210.
BaseSpace Sequencing Data Sets. http://www.illumina.com/informatics/research/sequencingdataanalysismanagement/sequencingdatalibrary.html.
Acknowledgements
The authors would like to thank Yanan Zhao for her suggestions about our article. This work was partially supported by the Key Project of The National Nature Science Foundation of China under the grant No. 60533020 and the Fund for Foreign Scholars in University Research and Teaching Programs(B07033).
Author information
Authors and Affiliations
Corresponding authors
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
HC developed the vectorized Gene Myers’ bitvector algorithm, and implemented the whole software. HC, JY and YX designed the strategies in the software. HC,YX and YS drafted the manuscript. HJ tested the software and revised the bugs of it. All authors read and approved the final manuscript.
Additional file
Additional file 1
Supplementary material. This file consists of the configuration of each read mapper and the analysis of the two vectorized verification schemes. Besides, we present the pseudo code of our vectorized bitvector algorithm and the performance comparison between it and other existing implementation of the Gene Myers’ algorithmin in the additional file.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Cheng, H., Jiang, H., Yang, J. et al. BitMapper: an efficient allmapper based on bitvector computing. BMC Bioinformatics 16, 192 (2015). https://doi.org/10.1186/s1285901506269
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901506269