In this section, we investigate the correlation between the low base quality scores and sequencing errors. Based on the investigation, the trimming strategy adopted in RAUR is presented in details. Then, RAUR algorithm is described.
Base quality scores distribution
Quality score measures the probability that a base is called incorrectly. With sequencing by synthesis technology, each base in a read is assigned a quality score by a phred-like algorithm [21], similar to that originally developed for Sanger sequencing experiments. The quality score of a given base, Q, is defined by Equation 1.
(1)
where e is the estimated probability of the base call being wrong. Thus, a higher quality score indicates a smaller probability of error. A quality score of 10 represents an error rate of 1 in 10, with a corresponding call accuracy of 90%; a quality score of 20 represents an error rate of 1 in 100, with a corresponding call accuracy of 99%; a quality score of 30 represents an error rate of 1 in 1000, with a corresponding call accuracy of 99.9%. In this paper, a base quality score ≥ 20 is considered as a high base quality, otherwise it is a low base quality.
Sequencing errors are one of the main resources for mismatches. The differences between the individual genome and the reference genome are the other resource for mismatches or gaps in alignment. We investigate the quality score of sequencing errors of ILLUMINA sequencing reads with length 50-bp simulated by ART [22]. As shown in Figure 1, we can observe that, the base quality scores of the majority of sequencing errors are lower than 20. On the other side, majority of the bases (above 90%) with low base quality scores (≤ 20) are not sequencing errors, as shown in Figure 2.
The strategy of trimming
There is a saying that the more things you do, the higher possibility you will make a mistake. Similarly, more bases considered, more sequencing errors will be encountered, which may ruin the alignment. With the number of mismatches or the edit distance greater than the allowed value, some reads will be unmapped by the alignment tools, or are mapped with low mapping qualities. These reads are excluded from downstream analysis. However, some confident segments of these reads can be used in variance calling. The first and most important step to make use of the unmapped reads is to figure out the most confident and informative segment of an unmapped read, which can be aligned correctly. This step is called trimming.
The purpose of trimming is to control the number of possible mismatches in the alignment. Mismatches in alignment can be sequencing errors and variances. Given a segment with K low quality bases, the maximum number of possible mismatches is K+b, and the minimum number is 0, where b is the number of possible variances. From Figure 2, we can know that the probability that all the K low quality bases in the segment are sequencing errors is small. Furthermore, Sachidanandam et al. [23] found out that it is nearly in 1 kb that there is a SNP, which indicates in a short read, b is ≤ 1. Thus, an alignment tool which allows K edit distances in a read, can align a segment with K low quality bases confidently. Additionally, to align uniquely, the length of the segment should be long enough. Thus, our aim of trimming is to find the longest segment with no more than K low quality bases, which can be aligned uniquely.
The details of trimming is illustrated as Algorithm 1. The inputs are unmapped reads, and parameter K. K is the number of low quality bases allowed in the segment. For each read, the positions of the bases with low qualities in the read are stored in a array. A segment of a read is several successive bases. Then we check the lengths of segments in the read containing K low quality bases. Each unmapped read is undertook the trimming in RAUR, and can be represented by a longest segment(or called a trimmed read) under the parameter K. The longest segments will be output in the same format as the original unmapped reads. The start position and the end position of a trimmed read in the original unmapped read are recorded, which can be used to deduce the position of an original unmapped read by using these information.
Algorithm 1 Trimming
1: Input: reads in fastq format, parameter K;
2: Output: trimmed reads in fastq format;
3:
4: Process:
5: for each read R do
6: ▹ find the positions of low quality score
7: N_Low = 0,i = 0,Low_position = [];
8: Max_length = 0,Max_start = 0,Max_end = 0;
9: for each base i ∈ R do
10: if i has a low base quality then
11: Low_position[N_Low++] = i
12: end if
13:
14: end for
15: if N_Low ≤ K then
16: output R in fastq format
17: next
18: end if
19: ▹find the longest segment with K low quality bases
20: for S = 0;S ≤ N Low − K;S++ do
21: length = 0,start = 0,end = 0,j = S + K;
22: if S ≥ 1 then
23: start = Low_position[S-1]+1;
24: else
25: start = 0;
26: end if
27: if j <N Low then
28: end = Low_position[j]-1;
29: else
30: end = R.length-1;
31: end if
32: length = end-start+1
33: if length >Max_length then
34: Max_start = start
35: Max_end = end
36: Max_length = length
37: end if
38: end for
39: Output substr(R,Max_start,Max_end,Max_length) in fastq format
40: end for
RAUR algorithm
The process of RAUR is illustrated in Algorithm 2. Firstly, reads are aligned by an alignment program. Then the unmapped reads and the unconfident mapped reads (with mapping quality less than 10) [15] are the input of the loop. RAUR makes every effort to find out the longest and mappable segments of these reads by decreasing the values of K of the loop. The parameter K is used to control the number of low quality bases allowed in the trimmed reads. In all experiments of this paper, K is set as 8. For each iteration, the first step is to trim each unmapped reads into a longest segment (trimmed reads) containing K low quality bases. Then align these trimmed reads by the alignment program. When the trimmed reads with K low quality bases cannot be aligned or confidently mapped, their original reads are the input of the next loop with K = K-1. The whole process will stop when K = 0. Thus, for each read, it either can be confidently mapped with a certain value of K or can't be mapped with any value of K.
Figure 3 shows an example of trimming a read. The consecutive squares represent the bases of a read with 45 bp, where the black color squares denote the bases with low quality scores, and in contrast the white color squares are the bases with high quality scores. There are eight bases with low quality scores in the read. When K = 4, the longest segment of the read starts at position 14 of the original read, and ends at position 42, containing four low quality bases. When the trimmed read can't be aligned, K is decreased by 1, and the trimming algorithm search for the longest segment containing three low quality score bases. The start position of the longest segment is 7, and end position is 29. The trimming will stop when the read can be confidently mapped or K = 0. In our experiments, the initial value of K is set as 8.
Algorithm 2 RAUR
1: Input: reference sequence, illumina Reads in fastq format, parameter K(K > 0);
2: Output: alignment_file in sam format;
3:
4: Process:
5: Align Reads against reference sequence with an aligner;
6:
7: Figure out the unmapped reads and reads with mapping quality ≥10 and write into file unmapped
R
eads
8:
9: for K_low = K;K_low > 0;K_low = K_low - 1 do
10: ▹ Trim reads into longest segments with K_low low quality bases
11: K_low_Reads = Trimming(unmapped_Reads, K_low);
12:
13: Align K_low_Reads against reference sequence with an aligner;
14:
15: Figure out the unmapped reads and reads with mapping quality ≥10 and write their original reads into file unmapped
R
eads
16:
17: end for