The previous analyses show that the original quality scores assigned by the sequencer machine are neither accurate nor do a good job of discriminating sequencing errors from non-errors. ReQON, GATK and BAQ produce much more reliable quality scores, although the quality scores assigned by ReQON are more accurate than those assigned by GATK or BAQ (Table 1). In most cases, any of the three recalibration algorithms is reasonable to use, but there are some distinct differences.
The algorithms of ReQON and GATK consider very similar covariates, yet Tables 1 and 2 show that ReQON performs just as well as, and in most cases better than, GATK. ReQON’s better performance than GATK can be attributed to two main differences: recalibrating low-quality bases and filtering out mismatch bases that may be due to true variants or mapping errors.
First, as previously mentioned, GATK chooses not to recalibrate bases with very low quality scores, with the default quality threshold set at 5. Their reasoning is that these quality scores indicate bad or randomly called bases by the sequencer, so these original qualities should be kept as is. This makes sense if these low-quality bases are filtered out before later analyses. However, investigators may prefer not to filter out low-quality bases, such as when sequencing experiments are expected to yield low coverage. Under this low coverage setting, it makes more sense to recalibrate all bases, regardless of quality score, which is how ReQON operates. Due to this difference, in general, the low quality bases of GATK have poor accuracy because they are not recalibrated. In contrast, the recalibrated low-quality scores from ReQON are much more accurate.
The low-quality bases that GATK chooses not to recalibrate have a large contribution to its FWSE (Table 1), indicating decreased accuracy. If these low qualities are removed from the analysis, then FWSE is approximately equal between GATK and ReQON. For a more exact comparison, we could have changed the threshold for GATK and recalibrated all bases regardless of its original quality score. But, as most users will use the default settings when running either recalibration algorithm, we choose to only compare the output using these default settings.
A second main advantage of ReQON over GATK is the criterion used for identifying sequencing errors. GATK identifies sequencing errors by filtering out known variant positions, then calling all bases that do not match the reference sequence as errors. In reality, some of these bases will not be sequencing errors but will be correct calls, such as novel variants or mapping errors. These miscalls disproportionately affect the higher quality scores. Because GATK's observed error rate is approximately the sum of the sequencing error rate, the alignment error rate and the rate of novel variants, GATK will be underestimating the true quality. In contrast, ReQON goes a step further by utilizing information from multiple reads and removing positions from the training set with low confidence in the error calls (determined by model parameters nerr and nraf). Figure 3 shows an example of such a position. For cell line replicate 1, ReQON removed 77,133 bases at 2,117 positions (average of 36x coverage) from the training set that GATK called as sequencing errors. These removed positions are likely to be novel variants or mismatches due to systematic alignment errors. ReQON identifies these positions without prior knowledge; in contrast, GATK would need information about these positions in order to remove them when building its model. Therefore, ReQON should be preferred when the data are aligned to unfinished genomes, where many mapping errors are expected, or when an input file of known variant positions is not available.
Figure 3 also shows that ReQON assigns significantly higher quality scores to the non-reference bases at this position than GATK (two-sided paired t-test, p = 8.36 × 10-9). This suggests that GATK biases against discovering novel variants by assigning lower quality scores to non-reference bases at positions supported by multiple reads. Therefore, investigators interested in detecting novel variants should prefer ReQON over GATK.
An additional main difference between the recalibration algorithms is that GATK tests model performance on the same data that were used in training the model (the entire genome). This approach leads to over fitting and overly optimistic estimates of the error. This issue, along with the concern of falsely identifying non-reference bases as sequencing errors, calls into question the true performance of GATK recalibration. On GATK’s software website , the authors discuss the option of training their model on a smaller subset of the data to reduce runtime. The authors provide evidence that training the model on a subset of the data leads to decreased accuracy. They conclude that users interested in maximum recalibration accuracy should continue to train on the full data set. We view this as further evidence that the GATK model overfits to the training data and, thus, overestimates the true model performance. In contrast, ReQON trains the model on a training set, which allows performance to be measured on a separate testing set. The analyses presented in the Results section show that ReQON does not overfit the model to the training data.
ReQON was also compared to BAQ, which is not a traditional quality score recalibration algorithm in the sense that it does not attempt to adjust quality scores so that they better reflect the probability of a sequencing error. Instead, BAQ only considers alignment quality and adjusts the base quality when the alignment quality is low. Due to the difference in motivation, ReQON greatly outperforms BAQ in terms of accurately representing the probability of a sequencing error, shown in Table 1. BAQ adjustment has been shown to improve SNP calling , especially in reducing false calls at positions near indels. Table 2 shows that ReQON does a better job at distinguishing non-reference bases belonging to dbSNP, representing true variants, from non-reference bases at other positions, representing mainly sequencing errors with possibly a few novel variants. Although more detailed analysis is required, this suggests that, overall, ReQON may be as effective at improving variant calling as BAQ.
Like all available base quality score recalibration algorithms, the ReQON results will be dependent on the accuracy of the read alignments. Alignment becomes much more complicated when considering indels, variants, splicing or poorly annotated genomes. BAQ incorporates mapping quality into its recalibrated scores. However, as seen in Table 1, this comes at the cost of the quality scores accurately representing the probability that a base is a sequencing error. We believe that alignment quality should be represented in mapping quality scores and that base quality scores should only convey information about the likelihood of a base being a sequencing error. ReQON attempts to separate out mismatches due to alignment by identifying and removing such bases from the training set. Following the assumption that mapping errors occur in a more systematic fashion than stochastic sequencing errors, this filtering is achieved through the use of parameters nraf and nerr. While this may not remove all effects of alignment on the quality score, it demonstrates a marked improvement over GATK which fails to consider alignment-specific sources of error.