TNER: A Novel Bayesian Background Error Suppression Method for Mutation Detection in Circulating Tumor DNA

The use of ultra-deep, next generation sequencing of circulating tumor DNA (ctDNA) holds great promise for early detection of cancer as well as a tool for monitoring disease progression and therapeutic responses. However, the low abundance of ctDNA in the bloodstream coupled with technical errors introduced during library construction and sequencing, complicates mutation detection. To achieve high accuracy of variant calling via better distinguishing low frequency ctDNA mutations from technical errors, we introduce TNER (Tri-Nucleotide Error Reducer), a novel Bayesian background error suppression method that provides a robust estimation of background noise to enhance the specificity for ctDNA mutation detection without sacrificing sensitivity. Results on both simulated and real healthy subjects’ data demonstrate that the proposed algorithm consistently outperforms current position-specific models, particularly when the sample of is small. TNER is publicly available at https://github.com/ctDNA/TNER.


Introduction
Cancer is a genetic disease that is driven by changes to genes that control cellular function [1]. Characterizing the disease at the molecular level is essential to early detection, personalized therapy based on tumor genomic profile, monitoring tumor progression and response to treatment as well as identification of resistant mechanisms [2]. This typically requires tumor tissue biopsies to obtain samples for genotyping or other molecular analyses for solid tumors. Biopsy procedures are usually invasive, and come with additional risk to patient's health. In many cases, tumor tissue biopsy is contraindicated medically and the tissue samples are often insufficient or unsuitable for molecular profiling [3]. In addition, cancer is a heterogeneous disease with different subclones within the same primary tumor and between the primary tumor and metastatic lesions that can lead to variations in tumor sampling [4,5].
Both cancer and normal cells shed DNA as a result of apoptosis and other biological processes, and release DNA fragments into the blood stream to become cell-free DNA (cfDNA) [6][7][8]. cfDNA derived from tumor cells is called circulating tumor DNA (ctDNA), and provides a real time genomic snapshot of cancer cells due to the relatively short half-life of cfDNA (~1-2 hours) [2,9]. ctDNA is a form of "liquid biopsy" that provides a non-invasive alternative to tissue biopsy for cancer diagnosis and monitoring [10][11][12]. Moreover, ctDNA generally comes from all tumor lesions and is pooled in the circulatory system, therefore it can reduce sampling variation associated with tumor heterogeneity than a single tissue biopsy [13].
The fraction of ctDNA in the total cfDNA in plasma, however, can be extremely low in many cancer patients [2,9]. Recently established techniques such as droplet-digital PCR (ddPCR) enable detection and quantification of low abundance ctDNA, but with limitation to cover only a small number of known "hotspot" mutations [9,14]. Advances in DNA sequencing technology have made it possible to identify ctDNA mutations at comparable sensitivity to ddPCR [15,16] when the sequence coverage is sufficient (>10,000 sequence reads per base). One of the most significant challenges in detecting ctDNA mutations is suppressing technical errors introduced during library preparation, PCR amplification or the sequencing itself [17]. While errors arising during PCR amplification can be removed effectively using molecular barcodes [17], other technical errors are more universal and also need to be removed before mutation calling [3,18]. Newman et al [19] recently proposed a creative integrated digital error suppression (iDES) method that included both a molecular barcoding system to reduce PCR errors and a background polishing model with an improved estimation of background mutation error rate (BMER) compared to the previous computational method used in CAPP-Seq [20]. Specifically, the BMER was estimated using a model of Gaussian or Weibull distribution on the mutation data from a collection of healthy subjects [19]. The method increased the accuracy for variant detection significantly, but the relatively small sample size (n=12) and the nature of the data (small discrete count) made it difficult to robustly estimate the background. As a result, false positive errors are still observed in a substantial number of bases [19].
To provide a more robust estimation of background error to further reduce the false positive rate of variant calling, we developed a novel background polishing method called TNER (Tri-Nucleotide Error Reducer) with a Bayesian approach to overcome the small sample size of healthy subjects. TNER is based on tri-nucleotide context data and uses a binomial distribution for the mutation error to estimate the background from healthy subjects. Tri-nucleotide context, or to be more exact, the 96 tri-nucleotide contexts consisting of the 6 distinguishable single nucleotide mutations and the 16 possible combinations of immediately preceding and following bases, has been extensively studied in cancer genetics to construct mutation signatures as a response to carcinogens, to compare the mutational spectra of trunk and branch mutations, or to predict the clinical implications of called mutations [5,[21][22][23]. Given that the pattern of low frequency technical errors from next-generation sequencing (NGS) should be similar in normal control samples and patient samples, we argue that local sequence context could better model noise for small sample size by leveraging information from other bases with a shared tri-nucleotide context. In this paper, we describe the Bayesian background polishing model based on tri-nucleotide contexts and demonstrate that TNER reduces errors more effectively than the position specific error model when there are a limited number of healthy donor samples for modeling the background error. The TNER methodology proposed here, to the best of our knowledge, is novel in this area. TNER is publicly available at https://github.com/ctDNA/TNER.

NGS data for analysis
To demonstrate the performance of the error suppression model in detecting single nucleotide variations, we analyzed targeted sequencing data of plasma cfDNA of healthy subjects using a panel of 87 cancer genes described previously (Lira et al. AACR 2017, #2749). The barcoded target-enriched DNA library (147Kb) was sequenced on an Illumina HiSeq 4000 platform generating ultra-deep coverage with an average coverage per base of ~12,000.

Tri-nucleotide error reduction model
The detection of ctDNA is typically achieved through detecting signature mutations associated with tumors in cfDNA. Sequence data from cfDNA has many stereotypical errors or other background mutation errors that are not of tumor origin [24]. In order to call a mutation in ctDNA, the distribution of the BMER needs to be characterized at each nucleotide base position to reduce the false positive error, for example, by modeling cfDNA data on the same NGS panel from healthy subjects [19]. The mutation rates from healthy subjects are assumed to be background mutation noise associated with both technical and biological sources. One challenge in characterizing the individual nucleotide BMER from healthy subjects is the relatively small cohort size. The iDES method used 12 healthy subjects [19]; we used a comparably sized set of 14 healthy subjects. These small sample sizes do not allow building a reliable estimate of the background error distribution for individual nucleotides. Bayesian method with prior information can help to overcome this limitation.
To better estimate the BMER distribution, we propose a background error model based on a hierarchical Bayesian method that utilizes the distribution of mutation error rate in a tri-nucleotide context, which consists of the mutated nucleotide and the combinations of immediately preceding and following nucleotides. Mutation signatures characterized by tri-nucleotide context have been used frequently in cancergenetics [5,21,22] . There are 96 independent tri-nucleotides contexts. For a nucleotide in tri-nucleotide group i (i=1, …, 96) at base position j (j=1,…J), the number of reads X ij observed for a given coverage N j is assumed to follow a binomial distribution with position specific mutation rate parameter π ij . J is the total number of bases in the panel (147k). With a large N (>1000 typically) and a small (<1%), X can also be modeled as a Poisson distribution ij~P ois( j * ij ) with rate parameter * . The BMER at position j can be estimated using the average mutation rate of the jth nucleotide from the 14 healthy subjects, π ij . This position specific parameter will be poorly estimated with the small sample size. To improve the estimate of , we propose a Bayesian framework and assume that  follows a beta distribution within a tri-nucleotide For convenience, we re-parametrize the beta distribution using its mean as a parameter.
π~Beta(μ, ω), with μ = α α+β and ω = α + β We assume the BMER in a tri-nucleotide follows a beta distribution. The prior parameters of the beta distribution can be estimated based on the BMER distribution of nucleotides in the trinucleotide using method of moments [25]. The mean parameter  can be estimated by the average mutation rate (μ) of nucleotides in the tri-nucleotide. For a position with x mutation count out of n total reads, the posterior distribution of the BMER at this position will be a Beta(α + , β + n − ) with a mean parameter with w=(+)/(++n). The posterior mean can be estimated with a shrinkage estimator, that is, a weighted average of the tri-nucleotide level mutation rate (μ) and the position specific rate π ij .
The w ij is a simplified weight that balances the relative size of the tri-nucleotide mutation error rate and the position specific mutation error rate works as well This weight function provides less shrinkage when the position specific mutation error rate is high -a property that helps retain the position specific background when it is much higher than the trinucleotide level background. However it does not reflect the impact of sample size on the weight, therefore is a more heuristic approach than the full Bayesian method.
Once we have an estimate of the BMER  using eq. (6), the threshold for mutation detection can be defined based on the upper confidence limit of . At  level, the upper 1-/2 Clopper-Pearson confidence limit 11 for a binomial proportion parameter is where () is the quantile function of beta distribution; x ij is the mutation read count at position j and N j is the total reads for this position. In the Bayesian model, we don't have to worry about the multiple comparison correction and the prior distribution allows pooling information between positions and avoids false positive call when variation is low [26]. In our analysis, the false positive calls are very rare. Similar beta-binomial model has been used in other studies [27][28][29]. However, none of them used the model to estimate the BMER distribution with tri-nucleotide contexts, nor did they apply to ctDNA NGS data.

Results
We first evaluated the TNER model on the healthy subject data using the leave-one-out method. We built the background model using data from 13 healthy subjects and predicted the mutation in the leave-out subject. Similar to Newman et al [19], we counted the number of error free positions, defined as those positions with exclusively reference allele reads, for each of the 14 healthy subjects at all 147k nucleotide positions, and compared the different error suppression methods, including the background polishing from iDES and the TNER method ( Figure 1). For TNER we used =0.01, although the results were similar for =0.05. We also calculated the panel-wide error rate which is defined as the number of non-reference allele reads (frequency<5%, to exclude SNPs) divided by the total reads. To test the sensitivity of the method, we used data from three healthy subjects that were not part of the background cohort. One subject had 10 unique private SNPs that were not shared by any of the healthy subjects. We did an in-silico experiment to dilute this subject's data with the other two health subjects in a 1:250:250 ratio and assume heterozygosity so we have an expected allele frequency of 0.1% for the 10 private SNPs. We found both iDES and TNER (=0.01) were able to detect all the 10 SNPs in this experiment.
To compare the performance of the position specific background polishing method and the TNER method more rigorously, we evaluated their sensitivity and specificity at various detection thresholds using simulation studies. The simulation used the average position specific mutation error rate from the 14 healthy subjects as the BMER which is a matrix of 147k rows and four columns. Each column is a nucleotide the reference base can mutate to, including the reference nucleotide which is zero. We randomly selected 1000 bases (rows) out of the 147k total, and then at each of the selected base a simulated allele frequency (simulated signal) was added to the existing BMER of a selected non-reference nucleotide (column). Specifically, for each of the 1000 positions, there are three possible non-reference nucleotides it can mutate to. We chose the nucleotide with the largest BMER value as the selected nucleotide to add the simulated signal. If the BMER are all zeros at this position, we used the first non-reference letter (A-C-T-G) as the selected nucleotide to add signal. This updated BMER matrix is the same as the original matrix except that 1000 rows have a signal added to a selected column, with the updated BMER matrix we simulated the read counts with a total coverage of 10000 per position using a binomial and a normal distribution. For a normal distribution, the standard deviation is the square root of the BMER divided by 100. The simulated counts were further split into forward and reverse strand with a random forward to reverse strand ratio centered around 1. The position specific Gaussian models from the iDES and the TNER method were then separately applied to the simulated data. As the true positives and true negatives are known, sensitivity and specificity were calculated under various detection thresholds ( values). Receiver operating characteristic (ROC) curves in Figure 2 compare the two methods under different scenarios. The TNER method performed better than the position specific Gaussian model in all casesdata simulated under different distributions and different mutation rate (MR) as shown by ROC curves. One of the advantages of the TNER method is that it uses information from other positions with the same tri-nucleotide context through a Bayesian model and thus stabilizes parameter estimates of the BMER. Therefore we would expect TNER performs better than position specific error models when the available sample size for healthy subjects becomes small. To evaluate the effect of healthy subject sample size on the performance of mutation detection methods, we used half the available healthy subjects (n=7) as our background mutation estimate and compared the results from both position specific Gaussian model and TNER in the simulation studies. As expected, we found that the smaller sample size of healthy subjects did not substantially reduce the performance of TNER, but greatly reduced the performance of the position specific Gaussian method ( Figure 3). This illustrates the robustness of the TNER method when the number of healthy subjects is small.

Discussion
We propose a new background polishing method for the detection of ctDNA in liquid biopsy samples based on tri-nucleotide level data. The TNER method estimates the background mutation errors from healthy subjects using a hierarchical model to incorporate both the tri-nucleotide level mutation rate and position specific mutation rate. The additional information from tri-nucleotide level data helps stabilize the estimate of background errors and proves to be more robust when the number of healthy subjects is small.
We could have used dinucleotide context or more complicated local sequence context such as pentanucleotide (2 flanking nucleotides on each side) or heptanucleotide (3 flanking nucleotides on each side). The larger local sequence context may provide better model fit to the mutation error rate [30], but the increasing model complexity with pentanucleotide (1,536 unique contexts) and heptanucleotide (24,576 unique contexts) becomes impractical for a targeted panel, like the one tested here with a total of 147k bases. The Bayesian prior parameter will not be well estimated due to small number of bases within each context. Trinucleotide context provided a better fit than dinucleotide [31] but not too complicated than the larger local sequence context [30], so is a more balanced approach for a common NGS targeted panel.
One of the assumptions for analyzing NGS data by TNER is that individual nucleotides within a tri-nucleotide share a more similar mutation error rate than those between-trinucleotides. We looked at the average mutation error rate from healthy subjects at the tri-nucleotide level and compared the intra-trinucleotide variability and the inter-trinucleotide variability. 94% of trinucleotides have intra-trinucleotide variability smaller than the inter-trinucleotide variability. Figure 4 displays an example of three tri-nucleotides, all with C to T mutation, showing very different distribution. The dashed lines are fit of beta distribution using the parameter estimates calculated by the method of moment. In general beta distribution fits the intra-trinucleotide mutation error rate very well. Figure 4. Examples of mutation error rate distribution of tri-nucleotides with C-T mutation. Solid lines are the probability density of average position specific error rate within a tri-nucleotide. The dashed lines are corresponding fit of a beta distribution using estimated parameters from the data.
In genomic data analysis, when the sample size is small, it is common to analyze data for individual genes using information from other genes. This is implemented in the limma method [32] for microarray data analysis and the DESeq method [33] for RNAseq data analysis. In our approach, we take advantage of the large number of bases shared in the same nucleotide context and use these data to model the individual base mutation error rate. We found the TNER method improves the imprecise estimate of background associated with small sample size at individual base level.
Sequence data are read counts that are best described by distribution from discrete data families, such as Poisson distribution or binomial distribution, particularly when the read count is low and mutation rate is small such as in ctDNA data. We found that the Poisson distribution fit the count data well in general. A more sophisticated distributions taking over-dispersion into consideration and the zero-inflated nature of ctDNA data may further improve the method.
Currently, ctDNA is rapidly becoming established as an important tool to supplement conventional biopsies for cancer early detection, molecular characterization and monitoring of tumor dynamics and the TNER method provides a novel Baysian approach to improve the accuracy of mutation detection in ctDNA.