Re-alignment of the unmapped reads with base quality score

Motivation Based on the next generation genome sequencing technologies, a variety of biological applications are developed, while alignment is the first step once the sequencing reads are obtained. In recent years, many software tools have been developed to efficiently and accurately align short reads to the reference genome. However, there are still many reads that can't be mapped to the reference genome, due to the exceeding of allowable mismatches. Moreover, besides the unmapped reads, the reads with low mapping qualities are also excluded from the downstream analysis, such as variance calling. If we can take advantages of the confident segments of these reads, not only can the alignment rates be improved, but also more information will be provided for the downstream analysis. Results This paper proposes a method, called RAUR (Re-align the Unmapped Reads), to re-align the reads that can not be mapped by alignment tools. Firstly, it takes advantages of the base quality scores (reported by the sequencer) to figure out the most confident and informative segments of the unmapped reads by controlling the number of possible mismatches in the alignment. Then, combined with an alignment tool, RAUR re-align these segments of the reads. We run RAUR on both simulated data and real data with different read lengths. The results show that many reads which fail to be aligned by the most popular alignment tools (BWA and Bowtie2) can be correctly re-aligned by RAUR, with a similar Precision. Even compared with the BWA-MEM and the local mode of Bowtie2, which perform local alignment for long reads to improve the alignment rate, RAUR also shows advantages on the Alignment rate and Precision in some cases. Therefore, the trimming strategy used in RAUR is useful to improve the Alignment rate of alignment tools for the next-generation genome sequencing. Availability All source code are available at http://netlab.csu.edu.cn/bioinformatics/RAUR.html.


Introduction
Next-generation genome sequencing (NGS) technologies, including Illumina/Solexa and AB/SOLiD, generate billions of short reads (25-200 bp) and become more and more popular. Based on NGS technologies, a variety of biological applications are developed. In many large projects, resequencing and read mapping are extensively used, such as 1000 Genome Project [1] and ENCODE [2]. Recently various high-throughput approaches based on bisulfite conversion combined with NGS have been developed and applied for the genome wide analysis of DNA methylation [3]. Resequencing [4], disease genome study [5], and identification of genetic variants [6,7] are also benefited greatly by NGS. For most applications and analysis, assembly and alignment are the first step once sequencing reads are obtained. When reference genomes are not available, assembly will be used to construct genomes and many algorithms have been proposed, such as [8]. The alignment algorithms are applied when reference genomes are available. However, there are many challenges to accurately map the reads to the genome, due to the sequencing errors with an overall per base error rate around 1-2% [9], repeats in the reference genome and differences between the donor and reference genomes.
In recent years, many short read alignment algorithms have been developed to address these challenges, different in speed, memory, accuracy, and alignment strategy [10,11]. There are two main strategies adopted in them. One strategy is spaced seeds, and the representative alignment algorithms are known as MAQ [12] and SOAP [13]. The other one is Burrow-Wheeler Transform [14], and the representative alignment algorithms are BWA [15], Bowtie2 [16], and SOAP2 [17]. Although these alignment algorithms are more and more efficient and accurate, there are a portion of reads which are not mapped at all by the alignment tool or the mapping quality scores are less than the threshold.

The mapping quality and the related works
Mapping quality was firstly proposed in MAQ [12], which is an indicator of the likelihood that a mapping is accurate. Later on, many alignment tools also report mapping qualities for their alignments. The calculation of mapping quality is related to "uniquenes". An alignment is unique if it has a much higher alignment score than all the other possible alignments. In another word, the bigger the gap between the best alignment's score and the second-best alignment's score, the more unique the best alignment, and the higher its mapping quality should be.
Mapping quality is important to the downstream analysis, like variance calling. For instance, a variant caller might choose to ignore evidence from alignments with mapping quality less than 10. However, in almost all the state-of-the-art alignment tools, the mapping quality scores do not correlate well with the actual likelihood that a mapping is accurate [11]. Many accurate mappings are generally reported with quality 0, and many inaccurate mappings are reported with high-quality scores. The RMAP algorithm [18] is proposed to improve mapping accuracy by incorporating base-call quality scores to weight mismatches. Furthermore, Ruffalo et al. [19] use a machine learning approach to re-calculate the mapping qualities of the short read mappings which are more accurate than those reported by the available alignment tools.

The coming of unmapped reads
The re-calculation of mapping quality of the mappings can make the mapping quality more reliable and promote the accuracy to some extent. However, it can do nothing for the reads which are reported as unmapped.
For most alignment tools, the edit distances or the allowed mismatches are limited, thus some reads can not be mapped if the number of mismatches in any hit exceeds the allowable differences. Given a read of length m, BWA [15] only tolerates at most k differences (mismatches or gaps) in a hit, where k is chosen such that < 4% of m-long reads with 2% uniform base error rate. With this configuration, for 15-37 bp reads, k equals 2; for 38-63 bp, k = 3; for 64-92 bp, k = 4; for 93-123 bp, k = 5; and for 124-156 bp reads, k = 6. That is to say, the reads with differences more than k in any hits will be unmapped.
Some trimmed-like strategies appear in some alignment programs and try to handle the problem. For example, in local read alignment mode, Bowtie2 [16] might "trim" or "clip" some read characters from one or both ends of the alignment to maximize the alignment score. The local read alignment can improve the Alignment rate at some extend. However, the false positive sites are also introduced by maximizing the alignment score which will affect the alignment accuracy, since the maximum alignment score can't guarantee that high quality bases are involved. BWA-MEM [20] is a new alignment algorithm, which can perform local alignment and is robust to sequencing errors and applicable to a wide range of sequence lengths.

Our contribution in this article
The unmapped reads also contain many information which is important to the downstream analysis. Thus in this article, we propose a method named (RAUR) to realign these unmapped reads. A trimming strategy used in RAUR is to figure out the longest and most confident and informative segment of a read based on base quality score. It adopts an iterative progress to trim the unmapped reads until the reads can be confidently mapped or can't be mapped in the whole progress. RAUR can combine with any alignment tool to improve the alignment rate. In our experiments, RAUR is combined with BWA [15] and Bow-tie2 [16] separately, and run on both the simulated data and real data with different read lengths. By comparing the Precision and Alignment rate, we can find out that RAUR can improve the Alignment rate of each alignment tool greatly, while the Pecision are still comparative with those of the original alignment tool. Furthermore, in some cases, it has comparative or better performance than BWA-MEM and the local read alignment mode of Bowtie2, which also adopt trimmed-like strategies.

Methods
In this section, we investigate the correlation between the low base quality scores and sequencing errors. Based on the investigation, the trimming strategy adopted in RAUR is presented in details. Then, RAUR algorithm is described.

Base quality scores distribution
Quality score measures the probability that a base is called incorrectly. With sequencing by synthesis technology, each base in a read is assigned a quality score by a phred-like algorithm [21], similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by Equation 1.
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error. A quality score of 10 represents an error rate of 1 in 10, with a corresponding call accuracy of 90%; a quality score of 20 represents an error rate of 1 in 100, with a corresponding call accuracy of 99%; a quality score of 30 represents an error rate of 1 in 1000, with a corresponding call accuracy of 99.9%. In this paper, a base quality score ≥ 20 is considered as a high base quality, otherwise it is a low base quality.
Sequencing errors are one of the main resources for mismatches. The differences between the individual genome and the reference genome are the other resource for mismatches or gaps in alignment. We investigate the quality score of sequencing errors of ILLUMINA sequencing reads with length 50-bp simulated by ART [22]. As shown in Figure 1, we can observe that, the base quality scores of the majority of sequencing errors are lower than 20. On the other side, majority of the bases (above 90%) with low base quality scores (≤ 20) are not sequencing errors, as shown in Figure 2.

The strategy of trimming
There is a saying that the more things you do, the higher possibility you will make a mistake. Similarly, more bases considered, more sequencing errors will be encountered, which may ruin the alignment. With the number of mismatches or the edit distance greater than the allowed value, some reads will be unmapped by the alignment tools, or are mapped with low mapping qualities. These reads are excluded from downstream analysis. However, some confident segments of these reads can be used in variance calling. The first and most important step to make use of the unmapped reads is to figure out the most confident and informative segment of an unmapped read, which can be aligned correctly. This step is called trimming.
The purpose of trimming is to control the number of possible mismatches in the alignment. Mismatches in alignment can be sequencing errors and variances. Given a segment with K low quality bases, the maximum number of possible mismatches is K+b, and the minimum number is 0, where b is the number of possible variances. From Figure 2, we can know that the probability that all the K low quality bases in the segment are sequencing errors is small. Furthermore, Sachidanandam et al. [23] found out that it is nearly in 1 kb that there is a SNP, which indicates in a short read, b is ≤ 1. Thus, an alignment tool which allows K edit distances in a read, can align a segment with K low quality bases confidently. Additionally, to align uniquely, the length of the segment should be long enough. Thus, our aim of trimming is to find the longest segment with no more than K low quality bases, which can be aligned uniquely. Figure 1 The quality score distribution of sequencing errors. The 10 million reads of Illumina's Solexa with length 50-bp simulated by ART, and each base in a read is assigned a quality score by a phred-like algorithm. X represents the quality scores ranging from 0 to 40, and Y represents the number of sequencing error corresponding for each X value. The details of trimming is illustrated as Algorithm 1. The inputs are unmapped reads, and parameter K. K is the number of low quality bases allowed in the segment. For each read, the positions of the bases with low qualities in the read are stored in a array. A segment of a read is several successive bases. Then we check the lengths of segments in the read containing K low quality bases. Each unmapped read is undertook the trimming in RAUR, and can be represented by a longest segment(or called a trimmed read) under the parameter K. The longest segments will be output in the same format as the original unmapped reads. The start position and the end position of a trimmed read in the original unmapped read are recorded, which can be used to deduce the position of an original unmapped read by using these information. if i has a low base quality then 11: Low_position

RAUR algorithm
The process of RAUR is illustrated in Algorithm 2.
Firstly, reads are aligned by an alignment program.
Then the unmapped reads and the unconfident mapped reads (with mapping quality less than 10) [15] are the input of the loop. RAUR makes every effort to find out the longest and mappable segments of these reads by decreasing the values of K of the loop. The parameter K is used to control the number of low quality bases allowed in the trimmed reads. In all experiments of this paper, K is set as 8. For each iteration, the first step is to trim each unmapped reads into a longest segment (trimmed reads) containing K low quality bases. Then align these trimmed reads by the alignment program.
When the trimmed reads with K low quality bases cannot be aligned or confidently mapped, their original reads are the input of the next loop with K = K-1. The whole process will stop when K = 0. Thus, for each read, it either can be confidently mapped with a certain value of K or can't be mapped with any value of K. Figure 3 shows an example of trimming a read. The consecutive squares represent the bases of a read with 45 bp, where the black color squares denote the bases with low quality scores, and in contrast the white color squares are the bases with high quality scores. There are eight bases with low quality scores in the read. When K = 4, the longest segment of the read starts at position 14 of the original read, and ends at position 42, containing four low quality bases. When the trimmed read can't be aligned, K is decreased by 1, and the trimming algorithm search for the longest segment containing three low quality score bases. The start position of the longest segment is 7, and end position is 29. The trimming will stop when the read can be confidently mapped or K = 0. In our experiments, the initial value of K is set as 8.

Evaluated programs and Evaluation metrics
To demonstrate the efficiency of RAUR, two alignment programs are involved in the experiments: BWA(v0.7.5) [15,20], and Bowtie(v2.0.4) [16], which are BWT-based short read alignment tools. RAUR combines each alignment program separately to re-align the unmapped reads and the unconfident mapped reads. RAUR(BWA) and RAUR(Bowtie2) denote the alignment program combined in RAUR. The two alignment programs are run independently as the control group. BWA-MEM algorithm and the local mode of Bowtie2 are sensitive to align longer reads, such as 70 bp-1 Mbp query reads. For further comparison, BWA-MEM [20] (denoted as BWA(mem)) and the local mode of Bowtie2 (denoted as Bowtie2(local)), which perform local alignment for long reads to improve the alignment rate, are run on the datasets with read length greater than 70. For all the alignment programs, the default options are adopted, and the value of K in RAUR is initiated as 8.
To evaluate the performance of different alignment programs, the Alignment rate and Precision are compared. After alignment, all the reads can be classified into three classes, confidently mapped reads, unconfidently mapped reads and the un-mapped reads. The threshold of mapping quality score to differentiate confident mappings and unconfident mappings is set as 10 for all the alignment programs. Alignment rate is the fraction of confidently mapped reads to all the reads defined as Equation 2. On simulated data, we can know the correct chromosomal coordinates of the alignment and the Precision can be measured. According to the correct chromosomal coordinates, confidently mapped reads can be classified into confidently and correctly mapped reads and confidently but incorrectly mapped reads. Thus, Precision is defined as the fraction of confidently and correctly mapped reads among all the confidently mapped reads, calculated according to Equation 3.
where N is the number of total reads, CN is the number of confidently mapped reads with mapping quality ≥ 10, and CCN is the number of confidently and correctly mapped reads.

Simulated data and performance
On simulated data, we can know the correct chromosomal coordinates of the alignment and the evaluation is Figure 3 An example of trimming. The consecutive squares represent the bases of a read with 45 bp, where the black color squares denote the bases with low quality scores, and in contrast the white color squares are the bases with high quality scores. There are eight bases with low quality scores in the read. When K = 4, the read is trimmed into a longest segment which contains four low quality bases, and when K = 3, the read is trimmed into a longest segment which contains three low quality bases. straightforward. We simulate reads from the whole human genome using ART [22], which simulates sequencing reads by mimicking real sequencing process with empirical error models or quality profiles summarized from large recalibrated sequencing data. In this paper, ART is used to simulates sequencing reads of Illumina's Solexa. Six datasets, including both single-end reads and paired-end reads, are generated by ART against the reference genome of human Hg19, with read length 50-bp, 75-bp and 100-bp, respectively. Each dataset contains more than 1 million reads. And then, the alignment programs map the reads back to the human genome. As the exact coordinate of each read is known, it is able to calculate the Precision of the alignments. Table 1 and 2 show the Alignment rate and Precision of each alignment programs on single-end datasets and paired-end datasets, respectively.
As shown in Table 1 for the simulated single-end reads with length 50 bp, the Alignment rate of BWA and Bow-tie2 are about 74% and 79%, respectively, while the Alignment rate of RAUR(BWA) and RAUR(Bowtie2) are about 83%. It means about 4% and 9% reads can be re-aligned by RAUR. The Precision of RAUR(BWA) is comparative with that of BWA and Bowtie2, whose precision are above 99%, while the Precision of RAUR(Bowtie2) has a little decrease. For the 75-bp reads and 100-bp reads, the Alignment rate of RAUR(BWA) and RAUR(Bowtie2) not only outperform BWA and Bowtie2, but also show advantages when compared with BWA(men) and Bowtie2(local). Although in theory BWA works with arbitrarily long reads, its performances are degraded on long reads especially when the sequencing error rate is high. The Alignment rate of RAUR(BWA) are about 13% more and 47% more than those of BWA on the 75-bp reads and 100-bp reads, and about 3% more and 4% more than those of BWA(men). The Precision of RAUR(BWA) are above 99%, which are comparative with those of BWA and B-WA(men). Compared with Bowtie2, the Alignment rate of both RAUR (Bowtie2) and Bowtie2(local) on the 75-bp reads and 100bp reads are improved, however, their Precision decrease to about 98% and 95%, respectively.
The performance of each alignment program on paired-end reads with different read lengths are compared, as shown in Table 2. Compared with the singleend reads with the same read length, the Alignment rate of each alignment program on paired-end reads are much higher. The Alignment rate of BWA and Bowtie2 are about 84% and 89% on 50-bp paired-end reads, and 91% and 88% on 75-bp paired-end reads, respectively. However, the Alignment rate of BWA on 100-bp pairedend reads is as low as that of BWA on 100-bp singleend reads. In contrast, the Alignment rate of RAUR (BWA) and RAUR(Bowtie2) are above 94% on pairedend reads with different read lengths. Compared with Bowtie2(local), not only the Alignment rate but also the Precision of RAUR(BWA) and RAUR(Bowtie2) are  greater than those of Bowtie2(local) on both 75-bp paired-end reads and 100-bp paired-end reads. However, the performances of BWA(men) are slightly better than RAUR(BWA) on Alignment rate or Precision. From Table 1 and 2, it is easy to find out that with longer read length, the numbers of unmapped reads are increasing, and the Alignment rate of BWA and Bowtie2 are declined, while RAUR(BWA) and RAUR(Bowtie2) can dramatically improve the Alignment rate by re-aligning the unmapped reads. Furthermore, we can observe that RAUR(BWA) and RAUR(Bowtie2) can achieve higher Alignment rate on datasets with longer read length, and the Precisions are above 98%. It indicates that for long reads there exist some fragments whose mapping positions can correctly deduce the mapping positions of the original reads, and RAUR can figure out these most informative fragments to be aligned. Table 3 and 4 list the numbers of re-aligned reads which are actually TP (true positive), and FP (false positive) from single-end simulated datasets and paired-end simulated datasets, respectively. Most of realigned reads are eventually TP. In Table 1 and 2, the alignment rate of RAUR(Bowtie2) are improved, while the precision of RAUR(Bowtie2) are less than those of RAUR (BWA). The reason lies in the different strategies of Bow-tie2 and BWA to perform gapped alignment. BWA pays different penalties for mismatches, gap opens and gap extensions. Bowtie2 combines the full-text minute indexassisted seed alignment and SIMD-accelerated dynamic programming to perform sensitive gapped alignment without incurring serious computational penalties. For Illumina reads, there are only substitution errors seldom indel errors. Since the simulated sequencing reads of Illumina's Solexa are generated by ART against the reference genome of human Hg19, the differences between simulated reads and the reference genome are mismatches rather than gaps. With low penalty for gapped alignment, a gapped alignment may gain a high mapping quality score, which will damage the accuracy of the alignment.
Taken RAUR(bowtie2) for example, the influence of different parameter K on the Alignment rate and Precision is analyzed in Table 5. We run RAUR(Bowtie2) with different initial values of parameter K, and the Alignment rate and Precision are compared, as shown in Table S3. The Alignment rate of RAUR(Bowtie2) are increased with larger initial values of K, while the Precision of RAUR(Bowtie2) are decreased. In the original mappings of Bowtie2, there are reads unmapped or mapped with low mapping qualities due to the exceeding of allowable mismatches or gaps. RAUR employs a parameter K to control the possible mismatches, therefore the Alignment rate are improved by RAUR(Bow-tie2). For large initial values of K, the gapped alignments of Bowtie2 may damage the Precision of RAUR(Bow-tie2). For smaller initial values of K, the Precision of RAUR(Bowtie2) are higher, because the lengths of reads trimmed with the small initial value of K are short, and most part of the trimmed reads are aligned to the genome in an ungapped fashion using the FM Index by Bowtie 2. However, for real data, the Precision of RAUR (Bowtie2) will be higher compared with those on simulated data. Because besides the substitution errors introduced by sequencers, the indels and substitutions will be introduced by the differences between the donor and reference genomes, gapped alignments performed by Bowtie2 will be useful. The influence of different initial values of K on the performance of RAUR(BWA) will be similar, which the Alignment rate of RAUR(BWA) are increased with larger initial values of K, but the Table 3 The number of TP (true positive), and FP (false positive) in the re-aligned reads from single-end simulated datasets #RA is the number of confidently re-aligned reads with mapping quality not less than 10, #TP is the number of confidently and correctly re-aligned reads, and #FP is the number of confidently but incorrectly re-aligned reads. #RA is the number of confidently re-aligned reads with mapping quality not less than 10, #TP is the number of confidently and correctly re-aligned reads, and #FP is the number of confidently but incorrectly re-aligned reads.
Precision of RAUR(BWA) will not be decreased as much as RAUR(Bowtie2). The reason lies in the different strategies of Bowtie2 and BWA to perform gapped alignment, and BWA pays different penalties for mismatches, gap opens and gap extensions.

Real data and performance
To assess the performance on real data, each alignment program is run on three datasets of single-end reads (ERR008838(76-bp), SRR006273(76-bp) and ERR008843 (83-bp)) and three datasets of paired-end reads (ERR007641(51-bp), SRR019044(76-bp), and ERR050728 (90-bp)). The single-end reads were produced by Illumina for NA18633, NA18498, and NA18624 individuals, and the paired-end reads were produced by Illumina for NA12282, NA11831, and HG00759 individuals, included in the 1000 Genomes Project http://www.1000genomes. org. These reads are mapped to the human genome UCSC Hg19. The comparison of Alignment rate of different alignment programs on different datasets are shown in Table 6 and 7.
In Table 6 the Alignment rate of RAUR(BWA) and RAUR(Bowtie2) are significantly higher than those of BWA and Bowtie2, and consistent with those of RAUR(BWA) and RAUR(Bowtie2) on single-end simulated data with read length 75-bp. A little different from the simulated results, the Alignment rate of RAU-R(BWA) and RAUR(Bowtie2) outperform those of BWA(men) on three datasets, while Bowtie2(local) gains the highest Alignment rate on SRR006273 and ER-R00884s3, compared with other alignment programs, which is 2% more than those of RAUR(BWA) and RAUR(Bowtie2).
On the three real datasets of paired-end reads, as shown in Table 7 RAUR(BWA) and RAUR(Bowtie2) outperform BWA and Bowtie2, and show significant improvement on ERR007641 and SRR019044. All the alignment programs work well on long reads (ERR050728(90-bp)). The Alignment rate of RAUR(BWA) is comparative with those of BWA(men) and Bowtie2(local), while the Alignment rate of RAUR(Bowtie2) is about 1-2% less than Bowtie2(local).