ReQON is open source software, written in R and available through the Bioconductor project [8]. This section describes the ReQON algorithm developed to recalibrate base quality scores.
Input
The input to ReQON is an indexed and sorted BAM file [9] with quality scores reported in the QUAL field. The reads can be aligned to any genome using any alignment algorithm. The recalibration results will be dependent on the accuracy of these alignments.
Algorithm
ReQON uses logistic regression to recalibrate the quality scores. The following parameters must be specified:
• Region: Genomic coordinates of the region that the regression model is trained on. This training region must be large enough to obtain accurate coefficients for the regression model. On the other hand, specifying a larger than necessary region will increase run time and may overfit the model to the training set. We recommend training on one of the smaller chromosomes, or specifying MaxTrain, described next.
• MaxTrain (optional): This allows the user to train on a fixed number of bases from a large training region. For example, training on the first 10 million bases of the genomic region given in Region is achieved by setting MaxTrain = 10,000,000. Typically, results are consistent when training on at least 10 million bases and do not improve when the training size is larger than 25 million bases.
• RefSeq: File containing the reference sequence corresponding to the training set.
• SNP (optional): File of known variant locations to remove from the training set before recalibration.
• nerr: The maximum number of errors tolerated at a genomic position (default = 2). Positions with more than nerr errors may likely be true variants, so bases from these positions are removed from the training region.
• nraf: The maximum non-reference allele frequency at a genomic position that is allowed (default = 0.05). Positions with non-reference allele frequency greater than nraf are removed from the training set for the same reason as nerr.
The first step is to read the training region, specified by Region and MaxTrain, from the input BAM file. Positions that are listed in SNP are removed from the training set. Next, sequencing errors are identified as bases that do not match RefSeq. In reality, some of these identified bases will not be errors but instead correct calls, such as novel variants or mapping errors. In an attempt to remedy this issue, two different thresholds are set to remove positions most likely to contain false error calls. At each genomic position, thresh = max{nerr, nraf × coverage} is calculated. This threshold sets a maximum number of allowable called errors for positions with low coverage, and a maximum frequency of non-reference bases for positions with high coverage. The default settings allow up to 2 errors per position if the coverage is less than 60x, and more errors (0.05 × coverage) for positions with at least 60x coverage. Positions with more called errors than thresh are identified and removed from the training set. Note that when we say that “a position is removed from the training set,” this is different than keeping the bases at this position in the training set but switching the call from error to non-error. Instead, all bases at this position are removed from the training set due to low confidence of the error call.
Every base in the training set has now been classified as either error or non-error. There is a strong relationship between errors and their position in the read. Most errors occur at the end of the read, due to technical aspects of the sequencing process [10]. This same pattern is present in Figure 1A, which shows the distribution of sequencing errors in the training set by their read position. The rest of Figure 1 is described in more detail in the next section. To account for these high-error read positions in the ReQON regression model, the algorithm flags read positions that have more errors than a specified threshold, set at 1.5 times the average number of errors per read position. This threshold is visualized by the dashed cyan line in Figure 1A. For notational purposes, assume that there are k flagged positions, denoted as fp
1
fp
2
, … fp
k
.
Regression coefficients (βs) are obtained from the following logistic regression model,,
where Y is the (n x 1) response vector and the X s are (n x 1) vectors of explanatory variables, as defined below.
• Y(i) = I{base i is a sequencing error (i.e., does not match RefSeq)}.
· X1(i) = original quality score of base i.
• X2(i) = I{X1(i) = 0}. Many base calling algorithms assign a quality of zero to indicate bad or randomly called bases, so these bases are treated separately.
• X3(i) = average quality score across all bases in the read containing base i. This helps identify reads for which all bases are assigned low quality scores.
• X4(i) = Read position of base i.
• X5(i) = I{base i = A}.
• X6(i) = I{base i = C}.
• X7(i) = I{base i = G}.
• X7+m(i) = I{X4(i) = fp
m
} for m = 1, …, k. These indicator variables identify bases originating from one of the flagged read positions.
Additional covariates were originally considered, but those that did not improve model performance were dropped from the final model.
Due to memory constraints, it is not practical to fit the model when the training set contains tens of millions of bases. Instead, the training set is split into smaller subsets of no more than 10 million bases and the regression model is run on all subsets. Then, for each coefficient, the median is calculated over all regression models. Once these median regression coefficients are obtained, the model is applied to each base in the input BAM file to obtain predicted error probability values. These probabilities are transformed to the Phred scale [6] and rounded down to the nearest integer.
Output and visualizations
ReQON outputs a BAM file with original quality scores replaced by the recalibrated scores. Additionally, ReQON produces diagnostic plots (Figure 1) which show the effectiveness of the quality score recalibration on the training set. Plot A shows the distribution of errors in the training set by their read position. Any read position above the threshold (dashed cyan line) is given an additional indicator variable in the regression model, discussed in the previous section. Plot B shows the relative frequency distribution of quality scores both before (solid blue line) and after (dashed red line) recalibration.
The bottom two plots show the reported quality score versus the empirical quality score before (plot C) and after (plot D) recalibration. The empirical quality score is calculated by computing the observed error rate for all bases in the training set that are assigned a specific quality score. This error rate is then converted to the log-transformed Phred scale. If the quality scores are accurate, which occurs when the observed and reported sequencing error rates match, the points will fall on the 45-degree line.
The bottom plots also report the Frequency-Weighted Squared Error (FWSE), a measure of how close the points lie to the 45-degree line. FWSE is calculated by squaring the error (vertical distance between the point and 45-degree line), weighting this squared error by its relative frequency (shown in plot B and represented by shading of the point) and summing across all quality scores. FWSE will be close to zero if the quality scores accurately reflect the probability of a sequencing error.