MERIT: a Mutation Error Rate Identification Toolkit for Ultra-deep Sequencing Applications

Rapid progress in high-throughput sequencing (HTS) has enabled the molecular characterization of mutational landscapes in heterogeneous populations and has improved our understanding of clonal evolution processes. Analyzing the sensitivity of detecting genomic mutations in HTS requires comprehensive profiling of sequencing artifacts. To this end, we introduce MERIT, designed for in-depth quantification of erroneous substitutions and small insertions and deletions, specifically for ultra-deep applications. MERIT incorporates an all-inclusive variant caller and considers genomic context, including the nucleotides immediately at 5′ and 3′, thereby establishing error rates for 96 possible substitutions as well as four singlebase and 16 double-base indels. We apply MERIT to ultra-deep sequencing data (1,300,000×) and show a significant relationship between error rates and genomic contexts. We devise an in silico approach to determine the optimal sequencing depth, where errors occur at rates similar to those of true mutations. Finally, we assess nucleotide-incorporation fidelity of four high-fidelity DNA polymerases in clinically relevant loci, and demonstrate how fixed detection thresholds may result in substantial false positive as well as false negative calls.


Introduction
Advances in high-throughput sequencing (HTS) technologies have revolutionized the genomic, transcriptomic, and epigenomic characterization of biological states. HTS platforms produce large amounts of sequencing reads at relatively low cost. However, their high rates of sequencing artifacts have hindered profiling heterogenous populations such as those in tumor samples, where in addition to genomically diverse cancer cells, contaminating normal cells may also be present.
Comprehensive error profiling and mutation detection sensitivity analysis has direct implications for predicting evolutionary trajectories that may arise from such heterogeneity. For example, since cancer's eventual evolution to a therapy-resistant disease is often associated with the natural selection of pre-existing resistant clones, detecting somatic genomic variations that exist in small cell populations prior to treatment is pertinent for prevention of therapy-resistant relapsed disease. [1,2] HTS errors are dominated by misreading a base within the instrument or nucleotide misincorporations during library enrichment with polymerase chain reaction (PCR). Specifically, DNA damaging factors such as spontaneous deamination, presence of oxidized bases in cells in addition to ex vivo oxidation during DNA extraction [3], or short-lived high temperatures during acoustic shearing [4] have been associated with high polymerase error rates [5]. PCR-free library preparation [6,7], sophisticated DNA barcoding [4,8,9], and overlapping-read design [10,11,12] have been proposed to reduce rates of error. Despite varying success of these methods, complete elimination of HTS errors is not possible; therefore, their exhaustive analysis is critical for estimating and improving sensitivity in calling somatic variants.
Here, we introduce MERIT (Mutation Error Rate Identification Toolkit), a comprehensive pipeline designed for in-depth quantification of erroneous HTS calls, specifically for ultra-deep applications. MERIT considers the genomic context of errors and shows a significant relationship between error rates and their sequence contexts. In addition to observing more than three orders of magnitude difference between transition and transversion error rates, we identify variations of more than 130-fold within each error type at high sequencing depths. We also propose an in silico depth reduction approach to provide insights on estimating optimal depth -where sequencing errors exist at rates similar to those of true mutations. Finally, we describe an assay for detailed assessment of nucleotide-incorporation fidelity for four high-fidelity DNA polymerase molecules.
Results during library preparation. Such events often lead to higher rates of transitions versus transversions [13,14,15,16,17] or increased number of errors in specific genomic contexts.These differences can be more pronounced at higher sequencing depths and directly impact the sensitivity for detecting true mutations. MERIT, primarily designed for profiling ultra-deep data (depths >5,000×), provides a comprehensive toolkit to identify, characterize, and quantify various sources of HTS error.
Comparative performance analysis of the commonly used HTS variant callers [18,19,20] suggests a significant disagreement between their identified variants [21,22,23]. These differences are mainly rooted in each pipeline's specific filtering and statistical methodology. Moreover, since the goal of these algorithms is to distinguish true mutations from artifacts, they do not report all the variants required for a precise quantification of sequencing noise. To overcome this limitation, MERIT uses SAMtools to identify all positions with alternate alleles from the aligned, indexed sequencing reads. It then probes SAMtools mpileup data to extract the reference and alternate alleles' depths, Phred quality, and position-in-read information for all variants, even when they are present in only a single read amongst tens of thousands. Finally, it obtains the genomic context of the variants from the reference genome, including the nucleotides immediately at their 5 and 3 , and estimates error rates for 96 possible single nucleotide substitutions as well as four single-base and 16 double-base insertions/deletions (indels). An optional annotation step is also available.
Details of MERIT's workflow are shown in Figure 1 and described in Methods.
Ultra-deep sequencing data set. We generated ultra-deep sequencing data for a HapMap sample (NA19240) targeting four commonly mutated loci in the TP53 and SF3B1 genes. We used four high-fidelity DNA polymerase enzymes -NEBNext ® High-Fidelity 2X PCR Master Mix (Hi-Fi 2X), NEBNext ® Ultra TM II Q5 ® Master Mix (Ultra II), KAPA HiFi PCR kits with ReadyMix (KAPA), and Invitrogen TM Platinum TM SuperFi TM DNA polymerase (SuperFi) -for PCR amplification. Custom primers for amplicon-based paired-end reads were designed such that a large portion of the two reads for each DNA fragment (R1 and R2) overlapped (Tables S1 and S3, SI Materials). Because we used a single, and therefore homogenous, sample, it followed that all detected variants were errors accumulated in library preparation or during sequencing.
Impact of merging reads on context-specific error correction. Independent analysis of R1 and R2 reads at 1,300,000× indicated significant difference in estimated error rates across 96 possible sequence contexts (Fig. 2). High error rates and low Phred quality scores observed in R2 relative to R1 may be associated with sequencing errors caused by misreading a base, attributed to image analysis biases [5] or phasing/pre-phasing [17]. These sequencing errors that dominated the R1 and R2 profiles can be distinguished from polymerase errors by merging the overlapped pairedend reads [24,11,25]. Merging, however, cannot eliminate errors randomly accumulated during the amplification processes and present in both reads. In contrast to higher rates of transversion versus transition errors in paired-end reads (Fig. 3A), the remaining PCR-related errors in the merged reads are dominated by transitions, often with high Phred quality scores (Fig. 3B). MERIT provides further insight for profiling these errors, which are the main hurdle in distinguishing real mutations from sequencing noise: • Merging R1 and R2 reads lowered all the context-specific error rates. The highest reduction in rate was observed for GTA>GGA transversions (5,025±2,794×) while GCG>GTG transition errors only improved by a factor of 1.22±0.07×. Moreover, these improvements were context-specific. For example, T>A transversion in GTA trinucleotides showed substantial reduction (568±249×) compared to those in CTAs (1.43±0.31×).
• The rate of C>A errors in ACCs was the highest of all such transversions. These errors are linked to the conversion of guanine to 8-oxoG resulting in mismatched pairing with adenine [26,27]. Oxidation of guanine to 8-oxoG happens naturally in living cells and can be increased by DNA damaging factors such as acoustic shearing [28].
• Merging R1 and R2 can correct for the low quality erroneous bases associated with sequencing errors. Our analysis suggests that such sequencing errors can be identified and eliminated based on their quality, when merging the reads is not possible (e.g., in hybrid-capture-based sequencing where read pairs are not designed to necessarily overlap).
Effect of mutation context on amino acid variations. In a single codon, the contextspecific rate of error for each base change directly affects the sensitivity of detecting the resulting amino acid variation. Our data indicated that the most commonly mutated residues in TP53 and SF3B1 were often more prone to errors and hence comparatively less likely to be distinguished from sequencing errors. For example, in TP53, R248Q and R248W are among the most common mutations found in cancer patients [29]. The transition base changes that result in these mutations could be confounded by the HTS errors at an 8-fold higher rate than the transversion alterations that lead to R248L, and 55-fold higher than those that lead to R248G (Fig. 4A). Similarly, the K700E mutation in SF3B1 is the most frequently mutated residue in the gene's exon 10 [30,31]; it results from a T>C mutation in a TTC trinucleotide that showed the highest rate of error for a non-synonymous amino acid change in its codon (4.74±0.42×10 −5 [error/base]). In contrast, the comparatively rarer I704F mutation -a T>A in a ATG -had one of the lowest rates of error in its respective codon (5.15±1.13×10 −6 [error/base]; Fig. 4B). K700E's 9-fold higher rate of error than that of I704F indicated marked reduction in its relative detection sensitivity.
Optimal sequencing depth. Insufficient sequencing depth reduces the sensitivity of detecting variants and leads to loss of statistical significance for a confident variant calling [32]. Consequently, sequencing at higher depths is expected to provide robust error rate estimates and improved sensitivities in detecting true mutations. Accurate estimation of optimal sequencing depth, beyond which the inferred background error is not further reduced, not only provides a precise view of intrinsic limitations in HTS assays, but also leads to preserving time and resources by avoiding unproductive ultra-deep sequencing experiments.
To provide insight on optimal sequencing depth, we performed in silico experiments and estimated context-specific error rates as a function of depth. We randomly selected merged reads and constructed simulated sequencing data at depths ranging from 1,000× to 700,000×, with 500 independent replicates at each depth to establish confidence intervals (Fig. 5). MERIT showed that the type of substitution error was an important determinant in estimating the optimal depth ( Fig. 5A) [error/base], respectively. Selecting a frequency threshold for these variants at 5,000× based on the general T>A rate may not yield significant number of false calls independent of their sequence contexts; however, at depths >5,000×, setting a threshold based on all T>A errors would lead to substantial false positive CTA>CAA and false negative GTT>GAT calls, as their corresponding error rates diverge at high depths, reaching a difference of two orders of magnitude at 700,000×.
DNA polymerase fidelity estimation. High-fidelity DNA polymerases -equipped with proofreading -result in fewer base misincorporations in PCR enrichment step, and thus, can reduce HTS error rates. The Hi-Fi 2X, Ultra II, KAPA, and SuperFi enzymes are marketed as highfidelity polymerases, specifically designed for efficient amplification of complex templates such as those with GC-rich regions. Their providers have reported a fidelity 100× better than wild-type Taq DNA polymerase [33,34,35,36]. Several techniques such as forward mutation assay [37], denaturing gradient gel electrophoresis [38], and most recently, HTS-based techniques [39,40,41,42] have been used to determine replication fidelity of DNA polymerases. Because different assays, quantification methods, and often descriptive units [42,41] are used to report polymerase fidelity, comparing these methodologies is a challenging task and beyond the scope of this work. Instead, here, we used MERIT to emphasize on the importance of context-specific error estimation and provide a robust comparison of these commonly used enzymes.
We applied MERIT to the merged reads at equal depths of 650,000×, ensuring that the estimated fidelities were not affected by sequencing depth. When all errors were included in the analysis, global error rates suggested that these polymerases performed fairly similarly to each other, with the highest and lowest error rates belonging to KAPA and SuperFi enzymes, respectively. Specifically, the overall substitution error rates for Hi-Fi 2X, Ultra II, KAPA, and SuperFi were estimated at 2.66±0.21×10 −6 , 1.91±0.19×10 −6 , 6.95±0.54×10 −6 , and 1.76±0.25×10 −6 [error/base/doubling], respectively (Fig. S1A).
Despite highly correlated error profiles of these polymerases (Fig. 6D), MERIT illustrated enzyme-specific biases beyond global error rates. For example, the overall substitution fidelity of SuperFi was found 3.95±0.65× better than that of KAPA's; however, specific substitution fidelity differed widely. C>G errors of SuperFi were 6.88±2.16× less frequent than those of KAPA. In contrast, for C>A substitutions, SuperFi's advantage over KAPA was reduced to only 1.85±0.30× (Fig. S1B). More unexpected differences could also be observed across 96 contexts (Fig. 6A). For example, TTA>TGA error rate of SuperFi is 132±35× lower than KAPA, while for GCG>GAG errors, KAPA performs just slightly better than SuperFi. Finally, using the data from multiple regions of the TP53 and SF3B1 genes, we found limited change in error profiles as the similarities between the genomic content of the amplified amplicons decreased as measured by their corresponding Kullback-Leibler distance (Fig. 7).

Discussion
Rapid progress in HTS technologies in addition to the development of novel library preparation methods have succeeded in reducing background sequencing noise, which has led to improving the sensitivity of detecting true mutations in heterogenous samples: PCR-free methods [43,44] have forgone the bias associated with the polymerase base incorporation [6,7]; DNA barcoding and overlapping read design approaches have minimized PCR-related errors [4,8,45,46,9]. However, these methods are bounded by their technical and theoretical limitations and often require large amounts of input DNA. Moreover, since sequencing error cannot be completely eliminated, comprehensive quantification of their profiles can highlight both the efficiency and the limitations of any methodology. In this manuscript, we introduce MERIT, which provides a platform-independent toolkit and illustrates the importance of estimating error rates in their genomic contexts, especially in ultra-deep HTS data.
Our results as applied to the Illumina platform confirm previous data on the differential rates of errors in paired-end sequencing reads [17], and indicate that merging the overlapping read pairs can notably correct errors that accumulate in sequencing instruments. We also provide a systematic approach for estimating the optimal sequencing depth for discovery of mutations that exist at very low abundances. Our data show that increasing sequencing depth may improve sensitivity for detecting some mutations based on their genomic context.
We also report the application of MERIT to ultra-deep sequencing data obtained from the amplification of multiple clinically relevant loci using four high-fidelity polymerase enzymes. Although there is limited variation in both the rates of error and dependence on the genomic content of the amplified region, our results indicate that profiling each polymerase's misincorporations according to their genomic contexts has important clinical consequences. Specifically, we show that assigning a single allele frequency threshold to call mutations may result in substantial false positive as well as false negative variants. Not only are neighboring mutational hotspots in one gene affected with markedly different error rates, there is significant variation in the sensitivity of detecting common amino acid changes within each residue. These results suggest that some of these mutations may in fact be more prevalent at sub-clonal levels in disease populations than previously reported. Small mutated clones in the TP53 and SF3B1 genes, present in >0.1% of alleles, are shown to be strong predictors of poor survival and possible resistance to therapy in various neoplasms [47,48,49,50]; thus, their detection at very low abundances is pertinent for patient care. Put together, our results strongly advocate mutation-specific variant calling approaches that go beyond estimating fixed detection thresholds for all variants [51]. HTS platform and assay optimization. Custom amplicon-based sequencing and library preparation were performed at GeneWiz (South Plainfield, NJ) using Illumina HiSeq2500 Rapid Run. Primers were designed using Primer3 [52] and were used along with Illumina partial adapters to target four loci in the TP53 and SF3B1 genes (SI Materials, Table S1) such that the paired-end reads significantly overlapped. The annealing temperature of 66°C was determined based on the product specificity and yield for all polymerases after performing gradient PCR optimization at eight different temperatures (SI Materials, Fig. S3).

Methods
PCR amplification and indexing. Twenty PCR cycles were performed using the Hi-Fi 2X, KAPA, and SuperFi polymerases, and 16 cycles were performed using the Ultra II polymerase in the first round of amplification. The cycle numbers were determined after initial PCR amplification tests (Table S2, SI Materials) in order to obtain similar amount of DNA for each enzyme.The second round of PCR for multiplexing included seven cycles for all four polymerases. After each PCR amplification, AMPure Bead cleanup was performed. First, 0.4× ratio (20 µL AMPure bead to 50 µL PCR product) was used to remove gDNA and larger fragments (i.e., >600 bps). For the saved supernatant, additional 80 µL AMPure Bead was added to bring the total to a 2× ratio. The beads were eluted with 22 µL EB (10 mM Tris, pH 8.0). Qubit quantification and Bioanalyzer analysis were performed for quality assessment.
Alignment and merging. Paired-end (PE) reads were cleaned of adapters and primers, and were aligned to the reference human genome hg19 assembly using the Burrows-Wheeler Aligner (BWA) tool [53]. PE reads that properly mapped to the targeted loci were then merged. In our merging scheme, if R1 and R2 reads did not match at a base, an N was assigned for that position. Read pairs with smaller than 50 base overlaps or with more than five mismatches were discarded. Phred quality score (Q) of a successfully merged locus was calculated as sum of the qualities in R1 and R2 reads since these are independent events; Q is given by Q = −10 log 10 p where p is the probability that the base is called wrong. Merged reads were then mapped to the reference human genome, and were filtered so that they were uniquely mapped (BWA tags X0:1 and X1:0). Finally, in order to make a fair comparison between the error rate of merged and PE reads, only PE reads that were merged successfully and uniquely mapped were considered. Table S3 in SI Materials represents the average depth of merged and PE reads in different loci.
In silico depth reduction. The sequencing assay was designed to obtain an average depth of >1,000,000× bp, but for some amplicons the average depth was substantially larger (Table S3, SI Materials). Therefore, an in silico depth reduction procedure was performed to reduce the high depths and more importantly, generate enough independent samples to estimate low error rates confidently. It should be noted that one of the main hurdles in error rate estimation of high fidelity polymerases via HTS is the lack of signal as errors occur infrequently with increased fidelity, hence, a large number of samples is required to accurately estimate errors. As performing ultradeep sequencing on a large number of samples is not cost-effective, alternatively, in silico samples at lower depths can be generated from one ultra-deep sequencing run by randomly selecting reads from the original raw sequencing data.
Error rate estimation. Context-specific erroneous base calls at each locus were assumed to follow a binomial distribution. More precisely, the probability of a single nucleotide Xi with the genomic context ZXiZ , Z, Z ∈ {A, C, T, G}, in a specific locus i being misread as Yi, i.e., P ZX i Z →ZY i Z followed where p is the combined PCR and sequencing error rate and ni and xi are the total read depth and the number of erroneous calls at position i, respectively. Assuming a position-independent p, the probability of observing m instances of ZXZ → ZY Z error within each sample was then given by See Remark 1 in SI Materials on the sum of binomial random variables.
For the case of indels, a binomial model was used to describe the error rate as well, but instead of categorizing them based on their context, indels were classified based on the type of inserted/deleted base, as no differential error rates was observed for context-specific indels.  ACG  ACT  CCA  CCC  CCG  CCT  GCA  GCC  GCG  GCT  TCA  TCC  TCG  TCT  ACA  ACC  ACG  ACT  CCA  CCC  CCG  CCT  GCA  GCC  GCG  GCT  TCA  TCC  TCG  TCT  ACA  ACC  ACG  ACT  CCA  CCC  CCG  CCT  GCA  GCC  GCG  GCT  TCA  TCC  TCG  TCT  ATA  ATC  ATG  ATT  CTA  CTC  CTG  CTT  GTA  GTC  GTG  GTT  TTA  TTC  TTG  TTT  ATA  ATC  ATG  ATT  CTA  CTC  CTG  CTT  GTA  GTC  GTG  GTT  TTA  TTC  TTG  TTT  ATA  ATC  ATG  ATT  CTA  CTC  CTG  CTT  GTA  GTC  GTG  GTT  TTA  TTC  TTG  R2 read. C) Γ, the ratio of error rate in R1 over R2. P-values are computed by performing a two-tailed z-test: p-value > 0.05 , 10 -5 < p-value < 0.05 , 10 -10 < p-value < 10 -5 , 10 -50 < p-value < 10 -10 , p-value < 10 -50 . D) Δ, the difference between their corresponding Phred quality scores. We reduced the depth of paired-end reads to approximately 1,300,000× through an in silico depth reduction experiment. Results are obtained by averaging over 100 independent samples to establish error bars which indicate one standard deviation from the average.  Figure 4: Significant variation in error rates for possible amino acid changes at individual codons. A) Six frequently mutated residues in the TP53 gene. B) Two hotspot residues in the SF3B1 gene. The higher the rate of error for a specific base change, the lower the power to distinguish true mutations from sequencing artifacts at its position. Here, the error rates represent the amplification by the Hi-Fi 2X polymerase. Error bars represent one standard deviation from the mean of 100 independent sub-samples.

Depth Depth Depth
A) B) C) Figure 5: Substitution error rates are classified based on their type (column A) and context (columns B and C) at nine different depths: 1,000×, 5,000×, 10,000×, 25,0000×, 50,000×, 100,000×, 200,000×, 400,000×, and 700,000×. In silico depth reduction experiment was performed on merged reads, amplified by polymerase Ultra II to an average depth of 1,930,473×. The shaded areas are uncertainty bounds of one standard deviation around the average, derived from 500 independent sub-samples.  Figure 6: Estimating the fidelity of four polymerases, i.e., Hi-Fi 2X , Ultra II , KAPA , and SuperFi . A) Context-specific substitutions. B) Single-base insertions. C) Single-base deletions. D) Spearman's rank correlation coefficient between context-specific error profiles. Results are obtained by averaging over 100 independent samples to establish error bars, which indicate one standard deviation from the average. Figure 7: Spearman's rank correlation coefficient between the context-specific error profiles of the targeted genes as a function of the symmetric Kullback-Leibler distance between their content profiles presented in SI Materials, Fig. S2.