smsMap: mapping single molecule sequencing reads by locating the alignment starting positions

Background Single Molecule Sequencing (SMS) technology can produce longer reads with higher sequencing error rate. Mapping these reads to a reference genome is often the most fundamental and computing-intensive step for downstream analysis. Most existing mapping tools generally adopt the traditional seed-and-extend strategy, and the candidate aligned regions for each query read are selected either by counting the number of matched seeds or chaining a group of seeds. However, for all the existing mapping tools, the coverage ratio of the alignment region to the query read is lower, and the read alignment quality and efficiency need to be improved. Here, we introduce smsMap, a novel mapping tool that is specifically designed to map the long reads of SMS to a reference genome. Results smsMap was evaluated with other existing seven SMS mapping tools (e.g., BLASR, minimap2, and BWA-MEM) on both simulated and real-life SMS datasets. The experimental results show that smsMap can efficiently achieve higher aligned read coverage ratio and has higher sensitivity that can align more sequences and bases to the reference genome. Additionally, smsMap is more robust to sequencing errors. Conclusions smsMap is computationally efficient to align SMS reads, especially for the larger size of the reference genome (e.g., H. sapiens genome with over 3 billion base pairs). The source code of smsMap can be freely downloaded from https://github.com/NWPU-903PR/smsMap.


Background
Technological breakthroughs in high-throughput sequencing (HTS) platforms have triggered a revolution in genomics [1][2][3][4][5], which enable scientists to obtain the full genomic sequence of many species by read alignment and de novo assembly [6][7][8]. Mapping (or aligning) the HTS reads from a sample to a reference genome is the most fundamental and computing-intensive step in genome resequencing studies [7,8], which detects genome variations such as single nucleotide polymorphisms, large-scale structural variations (SVs) and count for different sequencing errors [9,10]. All existing tools for mapping short reads, such as BLAST [11], BLAT [10], SOAP2 [12], Bowtie [13] and BWA [14], aim to find a "long" substring that would exactly match its mapping locus on the reference genome by using either Burrows-Wheeler Transform Full-text Minute-space (BWT-FM) index [15,16], or substring hashing [17], or hybrid methods of combining FM index with hashing [18]. However, due to a higher number of errors which are primarily insertions and deletions rather than substitutions, the mapping methods created for short reads are not readily extended to long reads generated from the single molecule sequencing (SMS) technology [19].
Recently, a number of available methods (or tools) for mapping SMS long reads to the reference genome, such as BLASR [20], BWA-MEM [21], rHAT [22], GraphMap [23], LAMSA [24], minimap2 [25], NGMLR [26] and lordFAST [27], have been proposed. BLASR [20] is the first tool that is specially designed for mapping SMS reads. It first builds a BWT-FM index [15,16] of the genome to search exact matches and then applies sparse dynamic programming (SDP) to generate rough alignments. The final detailed alignments are generated by dynamic programming. BWA-MEM [21] initially finds the alignment seeds and greedily chains these seeds, then extends the selected seeds to achieve the alignment. rHAT [22] utilizes the regional hash table (RHT) to find the highest possible candidate regions and then adopts an SDP-based approach to align the reads on the candidate regions. GraphMap [23] finds gapped space seeds that are clustered to obtain a coarse alignment, then uses a graph-based vertexcentric procedure to construct the alignment anchors, and chains these anchors, refining the chain to generate the final alignment. LAMSA [24] finds all approximate matches on the reference genome by using the GEM mapper [28], it then builds a direct acyclic graph (DAG) to generate alignment skeletons. Finally, LAMSA implements a specific split-alignment strategy to fill the gaps within the skeletons. minimap2 [25] collects minimizers [29] of the reference genome sequence for indexing them in a hash table and then finds matches to the reference by identifying the sets of co-linear seeds. Afterward, minimap2 applies dynamic programming in the unseeded regions to get the final alignment. NGMLR [26] first finds the linear mapping seeds, then performs a pairwise sequence alignment based on the Smith-Waterman algorithm. NGMLR lastly selects the set of linear alignments with the highest joint score as the final read alignment results. lordFAST [27] first builds an index from the reference genome then maps reads to the reference genome by extracting longest exact matches. It next selects candidate alignment regions, and finally gets the base-to-base alignment with dynamic programming.
All of the above mapping methods designed for SMS reads follow the canonical seedand-extension paradigm [18,30], that is, they find the maximal exact matches (seeds) and then extend the alignment to the non-seed fragments within the selected candidate regions in the query read and the reference genome. The major differences among them are the ways by which seeds and the candidate aligned region are selected.
The candidate aligned regions for each query read are selected either by counting the number of matched seeds (e.g., rHAT and lordFAST) or chaining a group of seeds that are colinear or close to each other (e.g., BLASR, LAMSA, GraphMap, NGMLR, BWA-MEM, and minimap2). However, the candidate aligned region of each query read is always a part of the read and cannot completely cover the whole read length. Although the non-seed fragments are subsequently aligned with dynamic programming, they are still within the candidate regions. For the two ends of the query read, which are not covered by the candidate aligned region, all existing methods do not align them and directly output as the soft clipping. As a result, these methods designed for SMS reads usually focus on producing local mapping results for the query read, other than obtaining the whole end-to-end alignment, leading to low aligned coverage (aligned fraction of the read). The aligned coverage is the percentage of one query sequence aligned to a reference genome, which reflects the effectively aligned size of the query sequence. Many researchers consider the alignments with higher aligned coverage as the valid alignment results [31][32][33][34]. Additionally, alignments with higher aligned coverage mean that more aligned bases can be obtained, which is a key requirement for mapping tools and mapping-based analysis [23], as bases that cannot be mapped are unavailable for use in many downstream applications [35][36][37]. Therefore, the shortcoming (i.e., low aligned coverage) of current mapping methods highlights the need for a sensitive, efficient computational method with higher aligned coverage.
Herein, we proposed a new SMS sequence mapping method (called smsMap) that aims to get the end-to-end accurate alignment against the reference genome for a query read. smsMap mainly contains three steps. It first constructs the BWT-FM index for the reference genome, then finds the starting positions in the query read and reference genome, and lastly a column reduction banded alignment method is developed to obtain the detailed dynamic alignment results from the located starting positions to the two ends of the query read and the genome, which can cover the whole read length. The experiments on simulated and real-life PacBio datasets show that smsMap can achieve more aligned coverage than other mapping tools. Also, smsMap is more sensitive that can map more reads and bases onto the reference genome.

Results
smsMap is implemented in C++ language with multithreading, and it can be run in both Linux and Windows systems. To evaluate the performance and efficiency of smsMap, we compared our smsMap with other seven state-of-art long read mapping tools, such as BLASR [20], BWA-MEM [21], GraphMap [23], minimap2 [25], NGMLR [26], rHAT [22] and lordFAST [27] on simulated datasets and real-life datasets. Because LAMSA [24] always appears a segmentation fault (core dumped) information, we did not compare with it. The real-life raw sequencing datasets, derived from E. coli, A. thaliana, C. elegans and H. sapiens (CHM1), were generated by PacBio sequencing platform. All methods were executed on an Ubuntu 16.04.5 server with 16 3.2-GHz Intel Xeon (E5-2667V4) processors and 128 GB of RAM. The parameters used for each mapping tools are given in Table S1.
For simulated sequence datasets, if a read is aligned to the correct genome and strand, and the aligned subsequence on the reference genome overlaps with the "true" mapping subsequence by at least p bases (here p = 0.9L(r τ )), we consider this read to be correctly mapped on the genome [27]. If a matched base locates within T bp (here T = 5) of the corresponding truth position on the genome [24,27], we consider this base as a correct matched base. Thus, we use three measures including the fraction of correctly aligned reads (cFAR), the fraction of correctly aligned bases (cFAB), and the average coverage ratio of correctly aligned reads (cACR) to estimate the performance of mappers on simulated datasets, which are defined as the following percentages: where N c is the total number of correctly aligned reads, N is the total number of query reads, M c τ is the number of correct matched bases for read r τ , M τ is the number of matched bases for read r τ , M is the total number of matched bases for all correctly aligned reads. An example of how to calculate cFAR, cFAB and cACR is presented in the supplementary file. Additionally, for simulated sequence datasets, base sensitivity and precision [27] are used to compare the performance of different mappers. Sensitivity is defined as the number of correct matched base divided by the total number of bases, precision is defined as the number of correct matched bases divided by the number of mapped bases.
Due to the true base pairing on the reference genome unknown for real-life read datasets, we use another three measures of the fraction of aligned reads (FAR), the fraction of aligned bases (FAB), and the average coverage ratio of aligned reads (ACR) to estimate the performance of mappers on real-life datasets, which are defined as the following percentages: where N a is the total number of aligned reads, N is the total number of query reads, M a τ is the number of matched bases for read r τ , M τ is the base number of read r τ , M is the total base number for all query reads.

Simulation without structural variations
We first adopted the simulated datasets without structural variations (SVs) to evaluate the performance of our smsMap and other mapping tools. The E. coli MG1655 genome sequence (with the length of 4,614,652 bp) from NCBI (No. NC_ 000913.3) was downloaded and inputted to the NPBSS simulator [19] for generating the PacBio simulated reads with different error rates. As a result, 6 simulated datasets with 5, 10, 15, 20 25 and 30% error rates were generated. The sequencing depth and average read length are 50 and 10,000 bp, respectively. The error parameter settings of NPBSS can be found in Table S2. The reads number and total bases of each simulated dataset are listed in Table S3. Figure 1 shows the cFAR, cFAB, cACR, sensitivity, and precision of smsMap, BLASR, BWA-MEM, GraphMap, minimap2, NGMLR, rHAT and lordFAST on the simulated datasets with different error rates (Table S4 gives the detail results of these methods). From Fig. 1, we can see that smsMap and GraphMap correctly mapped almost all reads and bases with different error rates (i.e., 5 to 30%) to the genome, while the cFAR, cFAB of other six mappers gradually decrease as the read error rates increase. smsMap, lordFAST, and GraphMap achieved higher cACR than the other five methods with different error rates, but our smsMap obtained little higher cACR than lordFAST and GraphMap, especially when the error rate is more than 20%. For the base sensitivity and precision in Fig. 1d and e, we can see that the sensitivity and precision of smsMap, minimap2, lordFAST, BLASR, and GraphMap are significantly higher than those of rHAT and NGMLR with error rate increases from 5 to 15%. For error rates ranging from 15 to 30%, smsMap and GraphMap achieved higher sensitivity and precision than other methods, and smsMap obtained a little higher sensitivity and precision than GraphMap. These results show that our smsMap are more robust to sequencing errors, and it can obtain better mapping quality for simulated datasets without SVs. Tables S5 lists the alignment scores for different methods with parameters: match = 2, mismatch = − 2, gap existence = − 2 and gap extension = − 2. It can be seen that the average scores are various among different methods.
Another simulated dataset of H. sapiens (CHM1) generated by NPBSS was also applied to test the performance of different methods. Table S6 shows the cFAR, cFAB, cACR, base sensitivity and precision of smsMap, BLASR, BWA-MEM, GraphMap, minimap2, NGMLR, rHAT and lordFAST on the simulated dataset of H. sapiens (CHM1). We can see that smsMap still achieved better mapping results in terms of cFAR, cFAB, cACR, sensitivity, and precision.

Simulation with structural variations
In order to estimate the capability of smsMap for mapping reads that span structural variations (SVs), we used another simulation dataset from chr1 of NA12878 with SVs. The simulation dataset with SVs was generated by inserting 7 SVs (i.e., 3 insertions, 3 deletions and 1 inversion) from DGV [38] into the reference chr1 and using the NPBSS simulator [19] at 20x coverage. Among the simulated reads, a total of 185 reads cover the SVs breakpoints. The detailed SVs and its breakpoints are listed in Table S7.
If the start and end alignment coordinates of a read in the genome cover the actual simulated breakpoints, we consider this read spanning SVs [27]. Here, we provide the aligned reads number of spanning SVs (#SVs) to evaluate the performance of different mapping tools. The results of our smsMap and other seven tools are listed in Table 1, from which we can see that our smsMap can map more reads with SVs on the genome than the other six tools, suggesting that our smsMap can also handle the SV-spanning reads.

Evaluation of the real datasets
In addition to the simulated datasets, we also used four datasets (generating by PacBio RS II) of E. coli, A. thaliana, C. elegans and H. sapiens (CHM1) to further test the performance of our smsMap. The raw sequences, assembly genome, and statistics of these datasets can be found in Tables S8, S9, S10, respectively. Considering that rHAT always returns the segmentation fault (core dumped) information, we do not provide the results of rHAT in the real datasets. Table 2 reports the mapping results of seven methods in four real datasets, and Table 3 lists ACR standard deviation (std). From Table 2, we can see that FAR of our smsMap is respectively 3.16-24.69%, 2.18-12.12%, 0.58-4.38%, and 3.75-9.86% higher than those of other six methods for E. coli, A. thaliana, C. elegans and H. sapiens   Table 3, we can find that the ACR std. of smsMap is also significantly smaller than those of other six methods, indicating that smsMap can generate better mapping quality for real datasets. Additionally, one real dataset (i.e., E. coli UTI89) generated by MinION sequencer was used to evaluate the performance of seven methods for Oxford Nanopore sequencing data [23]. Table  S11 reports the mapping results for E. coli UTI89 dataset, from which we can also observe that smsMap achieved higher FAR, FAB, and ACR than other methods, demonstrating that smsMap generates better mapping quality for Oxford Nanopore sequencing data. Additionally, the agreement between different methods based on their alignment results were measured. For a given read, there are two alignment results x and y generated by two methods. We define x covers y if the aligned region on the reference genome covered by x shares at least 90% overlaps with the aligned region covered by y [27]. Figure 2 presents the illustration of covering and non-covering alignments. Table 4 reports how best alignments from different methods cover each other for E. coli dataset. Specifically, each row contains the percentage of alignments generated by one method that covers alignments obtained by other tools. For example, among all aligned reads for smsMap and BLASR in Table 4, 95.23% of the alignments produced by BLASR are covered by smsMap, while only 85.76% of the alignments generated by smsMap are covered by BLASR. Tables S12, 13, 14 report the agreement between Fig. 2 A toy example to illustrate the covering and non-covering alignments. x, y and z are different alignments obtained by different methods for the same read. We can see that alignments x and y cover each other as aligned regions on the reference genome share more than 90% overlap. On the other hand, the alignment z does not cover either alignment x or y Given the massive sequences generated by SMS technology, we also need to consider the computational complexity of the mapping tools. To demonstrate the computational efficiency of our smsMap, we compared smsMap with other mapping tools on the four real-world datasets. Table 5 shows the running time (wall-time) and memory usage by using the seven tools. We can see that for the reference genome datasets with small size and relatively short average read lengths, such as E. coli and A. thaliana datasets, the speed of smsMap is a little lower than minimap2 and lordFAST. But for the H. sapiens genome dataset with larger size and relatively short average read length, smsMap is faster than lordFAST, BWA-MEM, NGMLR, and BLASR. These results indicate that smsMap is efficient to align SMS reads, especially for the larger size of the reference genome.

Discussion
Most mapping methods for SMS reads adopt the classical seed-and-extension methodology to obtain the alignment results. That is, they first find the exactly matched seeds in the reference genome, then select the candidate aligned region based on counting the number of matched seeds (e.g., rHAT and lordFAST) or chaining a group of seeds that are co-linear or close to each other (e.g., BLASR, LAMSA, GraphMap, NGMLR,  BWA-MEM, and minimap2), finally, extend the alignment to the non-seed fragments within the selected candidate regions. However, the candidate aligned region of each query read is always a part of the read and cannot completely cover the whole read length. As a result, these methods usually focus on producing local mapping results for the query read, other than obtaining the whole end-to-end alignment, leading to low aligned coverage. To solve the above issue, here we developed smsMap to obtain the whole read alignment by locating the alignment starting positions. smsMap is also a seeds-based method using BWT-FM index technique, there are two differences between smsMap and other methods: i) smsMap proposes a scoring strategy to select the candidate aligned regions by defining a credibility function to measure the starting position credibility, which can locate the aligned positions for each query read; ii) smsMap introduces a banded alignment on the low column memory matrix to get the alignment results of the whole read. The credibility function ensures that smsMap can locate the aligned positions for every query read, even in the situation that the matched seeds are dispersedly distributed in the reference genome. Thus, smsMap can get higher FAR, that is, align more reads. The banded alignment with the low column can obtain the whole end-to-end alignment, not local alignment achieved by other methods. Therefore, the FAB of smsMap is higher than other methods. Table S15 shows the example alignments of different methods for one sequence with length of 296 bp, the detail base-to-base alignments are also provided in the supplementary file. We can see that smsMap aligned the whole read, while other tools failed to align the whole read. So, the alignment of smsMap can truly reflect the error rate of the sequencing platform, while other tools just output the local alignment results. Now, smsMap just outputs the bestaligned position on the reference for query reads. But for a long chimeric read that a part of the read comes from one position and another part of the read from a different position. smsMap still reports one aligned position. This is a limitation of smsMap. Thus, if users prefer to get the whole end-to-end alignments, smsMap is recommended, if users prefer to obtain other aligned positions for a query read, other methods such as lordFAST and minimap2 are recommended.

Conclusions
With the development of SMS technologies (e.g., PacBio and Oxford Nanopore Min-ION) that produce long but noisy reads, mapping these reads to the reference genome has become a central bioinformatics challenge. It is important to develop novel long read alignment tools with better aligning accuracy as well as higher aligned coverage.
In this article, we developed smsMap to improve mapping quality of the long reads. Mainly, there are two key features of smsMap. i) smsMap utilizes a strategy to identify the starting positions in the query read and reference genome by designing a position credibility function, this strategy makes more query reads aligning on the genome, and also enables more segments of the query read mapping to the genome. ii) Compared with the traditional banded alignment algorithm, smsMap implements the banded aligning on a low column matrix, which can reduce the memory usage. The experimental results on both simulated and real-life SMS datasets show that smsMap achieves higher aligned read coverage ratio and better mapping quality, and it can be more robust to the high sequencing errors. In addition, smsMap adopts the strategy of locating the start position that provides the split alignments of the reads. It makes smsMap appropriate for aligning reads deriving from regions with long structural variations.

Methods
An overview of the smsMap mapper is shown in Fig. 3. smsMap mapper mainly includes three main phases: i) build the BWT-FM index of the reference genome (Fig. 3a), ii) locate the best starting positions in genome and query read by designing a position location approach (Fig. 3b), and iii) obtain the detailed dynamic alignment results by presenting a strategy of banded alignment on the low column memory matrix (Fig. 3c).

Indexing the reference genome
The BWT-FM index is a compressed full-text substring index based on the BWT transformation, which can efficiently find the number of occurrences of a substring within the compressed text, as well as locate the position of each occurrence [39]. It has been widely applied in bioinformatics, such as whole-genome alignment [40], short DNA sequence mapping [13], etc. Inspired by the BWT-FM index that allows long reference genome to be searched efficiently with low memory usage [41], here, we use the BWT-FM technique implemented in combined-index [27] to construct the index for the reference genome, which can quickly locate and find the match positions in the reference genome for a given short word (default word length k = 14).

Locating the starting positions for alignment
Suppose that the query read r is the input sequence, where we do not know which position in the reference genome that r comes from or which region in the genome that has high similarity to r. It is impossible to align r with each subsequence of the genome due to high time complexity. Therefore, it is crucial to design a searching strategy to quickly locate the positions in query read r and genome for mapping.
Because the genome sequence is greatly longer than the query read, there is a high possibility that one short word in the query read can be found in multi-positions in the genome, but only one of the multi-positions for each word is the right location. In addition, due to the query reads generated by SMS technology containing different Fig. 3 Overview of the smsMap mapper. a Index the reference genome by BWT-FM strategy. b Locate the starting positions in the genome and the query read, exact word matches are shown with different colors. c A detailed dynamic alignment result is obtained by using the proposed column reduction banded alignment as a guide. The backtracking route is denoted by the yellow arrows errors, there is a high possibility that some words cannot be mapped to the genome (e.g., word w 1 and w 6 in Fig. 4), and the widths between the corresponding mapped positions of adjacent word pairs are usually unequal (e.g., p 2 4 − p 1 3 ≠ p 1 8 − p 2 7 in Fig. 4). Considering these issues, we proposed the following position location strategy to find the starting positions for aligning quickly.
First, we search all exact short word matches from the query read r in the genome index built in the previous step. An exact match of word w i in the read r onto the genome can be described by a position set W i : where i is the position of the i-th word (w i ) in the read r, p l i ðl ¼ 1; 2; ⋯; L i Þ is the l-th matched position of w i on the genome, o l i ¼ ðp l i − iÞ is the modified position of the l-th matched position. After building the position sets for all the matched words in read r, we can define the following credibility function Sðo l i Þ to measure the starting position credibility for aligning.
where Θ is the total matched word number in the query read r, L j is the total matched word number of word w j on the genome, and L(r) is an error-tolerant length function for the query read r (here we set L(r) = 0.2r).
From all the credibility scores of o l i ði ¼ 1; 2; ⋯; Θ; l ¼ 1; 2; ⋯; L i Þ, we select the o l i with largest score value to identify the alignment starting positions on the read and genome, that is, the position of word w i on the read is considered as the alignment starting position, and its l-th matched position on the genome is considered as the alignment starting position.

Banded alignment on the low column memory matrix
After identifying the alignment starting positions on the read and genome, the starting positions generally divide the query read into downstream segment r d and upstream segment r u , the genome into downstream segment g d and upstream segment g u (Fig. 5). For aligning each pair of segments (i.e., pair of r d and g d , or r u and g u ), the traditional banded alignment dynamic programming [42] can be applied to get the alignment result (Fig. 6a). Evidently, it needs a l(r d ) × l(g d ) matrix to store the alignment scores, Fig. 4 The mapping positions for a query read of SMS. Some words (e.g., w 1 and w 6 ) cannot be mapped to the genome because it contains sequencing errors, the right mapping positions for alignment are dotted where l(r d ) and l(g d ) are the length of r d and g d , respectively. Generally, the length of l(r d ) and l(g d ) are over 1000 bps, and it requires bigger memory usage. In order to reduce the matrix memory usage, here we present the following strategy to reduce the matrix column size for relieving the large memory usage (Fig. 6b and c).
We first extract the score region and index for every column in Fig. 6a. The number (i.e., starting column index) in green boxes in Fig. 6b can be used to obtain the bases in the genome for scoring. Based on the observation from Fig. 6b that the maximum aligning base number for each row is 2b (b is the bandwidth), we can use a matrix with l(r d ) × 2b (Fig. 6c) to store the aligning scores in Fig. 6b. Because 2b is much smaller than l(g d ), the matrix l(r d ) × 2b can significantly reduce memory usage. The scores in the low column matrix in Fig. 6c are calculated by the following eqs. 6-10.
where F(u, v) is the score value of the u-th row and v-th column in the matrix in Fig. 6c, Score[r d (u), g d (v ' ')] is the match score or mismatch penalty for the u-th base in segment r d and the v ' '-th base in g d , sci(u) is the starting column index and floor() is the floor function. Here we define the bandwidth b = αl r , where l r is the read length, α is a width coefficient. The default value of α is 0.1, which can cover almost all aligned paths (see supplementary file for more discussion about the banded width). After obtaining the two low column matrices of l(r d ) × 2b and l(r u ) × 2b, we applied the banded alignment algorithm on these two matrices to align r d with g d , and r u with g u , respectively. In the end, we combine the aligning results of downstream segments (i.e., r d with g d ) and upstream segments (i.e., r u with g u ) to get the mapping result of the query read and the reference genome.
Additional file 1: Figure S1. An example to explain how to calculate the metrics of cFAR, cFAB and cARC. The numbers above each read are the start and end aligned positions. Figure S2. The base-to-base alignments for different methods with one sequence (length: 296 bp). lordFAST and minimap2 fail to align this sequence. The statistics of the alignments are listed in Table S15. Figure S2. Diagram of using three bandwidths for aligning. (A) Too large bandwidth. (B) Too small bandwidth. (C) Appropriate bandwidth, it not only covers the backtracking, but also reduces the memory usage. Figure S3. The distribution of width coefficient for simulated datasets with different error rates ranging from 5 to 30%. Table S1. Running command lines of different mapping programs. Table S2. Parameter settings of NPBSS for generating simulated datasets. Table S3. Read number and total bases of simulated datasets. Table S4. The cFAR, cFAB and cACR for different methods on the simulated datasets. Table S5. Alignment scores of different methods for E. coli simulated datasets. The values in the brackets are the min. and max. Alignment scores. Table S6. The cFAR, cFAB and cACR for different methods on the simulated datasets of H. sapiens (CHM1). Table S7. Different SVs types and its breakpoints in the genome. Table S8. PacBio datasets website links. Table S9. Reference genome website links. Table S10. Statistics of E. coli, A. thaliana, C. elegans and H. sapiens datasets. Table S11. FAR(%), FAB(%) and ACR(%) of eight methods on E. coli UTI89 dataset. Table S12. Agreement of different alignment methods for A. thaliana dataset. Table S13. Agreement of different alignment methods for C. elegans dataset. Table S14. Agreement of different alignment methods for H. sapiens dataset. Table S15. Aligned results of different alignment methods for one sequence. (see Figure S2 for detail alignments). Table S16. The average and maximum of width coefficient for simulated datasets.