A novel virtual barcode strategy for accurate panel-wide variants calling in circulating tumor DNA

Background Hybrid capture-based next generation sequencing of DNA has been widely applied in the detection of circulating tumor DNA (ctDNA). Various methods have been proposed for ctDNA detection, while still a great challenging present for these low allelic fraction (AF) variants. In addition, no panel-wide calling algorithm is available, which hider the fully usage of ctDNA based ‘liquid biopsy’. Thus, we developed VBCALAVD in silico to overcome these limits. Results Based on understanding of ctDNA fragmentation nature, a novel platform-independent virtual barcode strategy was established to eliminate random sequencing errors by clustering sequencing reads into virtual family. Polishing stereotypical background artifacts through constructing AF distributions. And additional three robust fine-tuning filters were obtained to eliminate stochastic mutant-family-level noises. The performance of our algorithm was validated using cell-free DNA reference standard samples (cfDNA RSDs) with AFs ranging from 0.1% to 5%. For the RSDs with AFs of 0.1%, 0.2%, 0.5%, 1% and 5%, the mean F1 scores were 0.27 (0~0.56), 0.77, 0.92, 0.926 (0.86~1.0) and 0.89 (0.75~1.0), respectively, which indicates that the proposed approach significantly outperforms the published algorithms. Also, no false positives were detected among 14 normal healthy cfDNA samples. Meanwhile, characteristics of mutant-family-level noise and quantity determinant of its divergence between samples with high and low templates were clearly depicted. by calculating cutoff AF values from best-fitted personalized distribution. Results showed that the ‘Johnsonsu’ distribution was


Performance of our virtual barcode
The performance of our virtual barcode was validated from three aspects using three Oncosmart2 UMI samples: 1) recovery rate of the real family from the UMI; 2) family contents; and 3) effectiveness in suppressing errors. Real family was clustered by UMI and genomic position. Virtual family was defined as reads that share the same start, template length and strand. We randomly selected genomic positions on our panel for 10 times (20, 92.98%±1.1%) and only a small proportion of reads with different UMI tags were mistakenly clustered by the virtual barcode. The incorrectly clustered family contents were investigated and 92.6% of these members were composed of two UMI tag families, 6.8% were three UMI tag families ( Figure 2D). The incorrect clusters might introduce false negatives, particularly if the allele number of variants is extremely low. Thus, we compared f = 1.0 (here f is the ratio of the nonreference allele in a family) virtual family numbers with f = 1.0 real family numbers at six positive sites among three UMI samples. At the 0.1% level, five out of the six positive sites had equal family numbers and no false negatives were detected ( Figure 2E), and consistency was found at the 1% and 5% levels ( Figure S1). Additionally, AF values of six positive sites calculated from virtual-family-level showed good consistency with AF values from read-level and expected values ( Figure S1).
In conclusion, our virtual barcode was sufficiently robust to replace a real UMI tag and could become a universally applicable approach for reducing noise when sequencing cfDNA samples.
Subsequently, virtual barcode was applied for 30 BGs, the panel-wide error position percentage was significantly decreased in every BG ( Figure 3A) and in turn, mean panelwide error-free position percentage was improved by ~64.11%±12.9%. The ability of the method to decrease random read errors was further confirmed among six positive sites among the top 7 high-sequencing-depth control samples. The presence of random nonreference alleles in two or more samples at the positive site ( Figure 3B), and nearly all of these alleles were decreased ( Figure 3C). These results confirmed the good and stable performance of our virtual barcode for decreasing read-level stochastic noise.
Characteristics of mutant-family-level noise A small proportion of error sites supported with f = 1.0 mutant families made the virtual barcode/UMI alone indistinguishable from real variants, and we denote this type of noise mutant-family-level noise (designated as f = 1.0 sites). Thus, additional robust filters are needed to improve the specificity of the proposed algorithm.
The profiles of mutant-family-level noise among 14 controls and 16 RSDs showed an interesting divergence. A significant linear relationship between the mean depth and error position percentage ( Figure 3D) was remained at mutant-family-level in the RSDs ( Figure   3E, green line; R 2 = 79.04%; P = 5.22*10 -6 ; 95% CI: 0.059~0.107) but not among the controls ( Figure 3E, red dots). This disagreement might be caused by input DNA quantities (virtual family numbers) and uneven depth/coverage. Through normalizing panel-wide virtual family numbers based on the coverage, family degree was obtained for every sample. Compared with controls, the median virtual family degree was significantly higher both in Oncosmart2 (2.49-fold, 2.26*10 -5 ) and Oncosmart3 RSDs (1.88-fold, P = 0.007; Figure 3F). Based on the observation that the reciprocal of family degree could reflect panel-wide median virtual family size ( Figure S2), controls had significantly larger overall virtual family size than RSDs ( Figure 3G; P = 5.88*10 -5 ), which in turn could give higher confident support for calculating f values and further decreasing random read-level noise ( Figure 3E; Figure S2). The significantly larger family size in controls were caused by the significantly less template numbers than RSDs (P = 2.05*10 -5 , Figure S2). Scatterplot clearly showed that high template numbers in RSDs caused significantly higher percentage of mutant-family level noise than controls (P = 6.25*10 -8 ; Figure 3H). This result indicated that using cfDNA data from normal healthy individuals with low templates as the background (20,36) is not sufficient to cover all noises in samples with high templates under similar sequencing coverage. Thus, we combined controls with RSDs for following analysis.
According to relationship between sample occurrence and AF spectra ( Figure S3), mutantfamily-level noises were classified into two types: stereotypical (occurrence > = 6 BGs) and stochastic mutant-family-level noise. In total, we obtained 265 unique stereotypical variants ( Figure 4A). The RSDs made a greater contribution than the controls in recovering stereotypical variants many of that happened only once in controls ( Figure S4). As expected, 265 stereotypical noises were occurred stable as showing a significant linear relationship between 25 Oncosmart2 BGs and 529 Oncosmart2 cfDNA samples ( Figure S3; P = 5.6*10 -32 ; 95% CI: 0.922~1.235; R 2 = 41.7). Further analysis of the occurrence rates of 121 shared noises ( Figure 4A) showed a significant linear relationship with a higher R 2 value ( Figure 4B; P = 2.66*10 -12 ; 95% CI: 1.019~1.308; R 2 = 67.8). Additionally, after polishing based on Oncosmart2, no stereotypical noises were found among five Oncosmart3 BGs at intersection region of the two panels (Table S2- Stereotypical noises are caused by many factors, such as DNA damage (37) and PCR errors (38), which had different substitution preference. The main substitution types of our stereotypical variants were that C>T/G>A, C>A/G>T, and A>G/T>C (71.05%, Figure 4C), and these were consistent with the substitution types from Oncosmart3 BGs (Table S2-4: Substitution frequency) and previously reported error profiles for 'Kapa HF' polymerase (38). The percentage of these six substitutions further increased to 84.297% in 121 shared sites, which demonstrated that these substitutions introduced by PCR errors were likely to occur universally ( Figure 4B, Figure  Then a comparison between the IDES construction step and our step was made ( Figure   S4). Finally, 265 stereotypical variants were polished by calculating cutoff AF values from best-fitted personalized distribution. Results showed that the 'Johnsonsu' distribution was the best-fitted distribution (Table 1; 26%) and AF cutoffs were shown in Table S3.
Compared with stereotypical noises, stochastic mutant-noises (designated as stochastic f = 1.0 site) were prone to low AF values, wide AF value spectra and unstable occurrence ( Figure S3). Additional three fine-tuning filters were proposed based on appropriate specific features, namely, variant position in a segment, imbalanced singleton number and minimum template number requirement.  Figure 4D.
With respect to the variant singleton ratio, based on the observation that variant singleton numbers (ranging from 0 to 39) among stochastic mutant-family-level noises were significantly higher than variant singleton numbers among six positive sites, we hypothesized that for the real SNV site, the ratio of singleton numbers to f = 1.0 family numbers would fluctuate within a certain range. First, in panel level, singleton ratios of all BGs were less than 2.0 ( Figure 4E). This singleton ratio was a general robust cutoff value that could well distinguish positive mutations, known mutations of Non-small-cell lung carcinoma (NSCLC) patients (40) and high AF variants from these stochastic family-level noise outliners ( Figure S6). Second, in sample level, mean variant singleton ratios of high AF sites could reflect panel-wide singleton ratio, indicating that variant singleton ratio of real variants was fluctuated around panel-wide singleton ratio ( Figure 4F). Thus, a samplelevel strategy based on distribution of singleton ratios from high AF variants (AF> = 0.05) was applied as reference distribution ( Figure S6). After false discovery rate (FDR) correction, a small number (blue bar) of extreme outliners with mean ratios ranging from 4.1~28.2 (orange bar) were removed (FDR< = 0.01; Figure 4G). Besides our method was relatively conservative and no outliner were found in samples with overall high/low singleton ratio ( Figure S6), such as two tumor samples ( Figure 4G). In conclusion, this filter could avoid over-recovery of variant singletons at genomic sites vulnerable to random noise.
Finally, template numbers were updated both for f = 1.0 numbers and variant singleton number. And updated template features were the most specific features ( Figure S7).
Based on this specific template feature, ROC curve was constructed for six positive sites at every AF level ( Figure 4F), which showed an optimal tradeoff between sensitivity and specificity at a strict 99% confidence level.
Effectiveness of all the filters in improving the panel- We then calculated the sensitivity, PPV and F1 score of both our algorithm and an available calling algorithm at every level using 25 Oncosmart2 BGs ( Figure S8). The results showed that the performance of our algorithm was significantly better than that of previously published calling software at every AF level from 0.1% to 5% (Figures 5D~5H).
Additional validation of our algorithm using five Oncosmart3 RSDs proved the robustness of our algorithm at AF levels ranging from 0.1% to 5% ( Figure S9; Table S2-1: Sensitivity).
A small number of false-positive sites were retained in the 25 Oncosmart2 BGs. From previous reference, we incorporated low-complexity (LC)(41) and short tandem region (STR)(42) into pipeline. Left in controls were SNP sites after annotation (Table S4). And left false-positive sites supported the "spreading-of-signal" (43) with the newer sequencing platform (HiSeq 3000/4000/X Ten) in the same sequencing lane (Table S5).

Discussion
Recently, several studies have been focusing on application of cfDNA fragmentation information in clinical settings (44-46). Traditional endogenous UMI of randomly sheared genomic DNA sequences in depressing noises has been technically validated. Due to the shearing process not entirely random, its usage is limited (25). However, in highly fragmentation nature cfDNA, this type of endogenous UMI as virtual barcode here was accommodated well supporting both by comprehensively validation here and application in our previous research (40). The downside of this step was that approximately 8% of the UMI was wrongly clustered by the virtual barcode, which was due to the fact that different ctDNA molecules have a certain probability of sharing the same virtual barcode (19).
This downside of our proposed method leads to a lower yield of usable families that might generate lower f = 1.0 supported family numbers for a candidate mutation, as shown by the lower f = 1.0 virtual family numbers compared with f = 1.0 real family numbers in Figure 2E, Figure S1. This downside did not have an effect on the sensitivity and PPV at any of the AF levels tested in this study, and thus, we did not further optimize this step of the algorithm. However, because this downside might have some effect in some cases, the value of the f parameter can be adjusted to minimize this effect. Besides, this step also can be affected by paralogous sequences. Reads in these regions tend to have lower mapping quality that is due to multiple alignments. Additionally, multiple mismatches (MM)(47) was another feature to avoid this effect.
For polishing step, unlike IDES, through large samples, we find the most fitted distribution of stereotypical noise under high depth. Meanwhile, best-fitted distributions also provided informative prior distribution for distribution construction with low sample size using Bayesian methods.
For variant singleton ratio filter, the hypothesis of this filter relies on panel-wide singleton ratio and sequencing depth (family degree). For sample with panel-wide single ratio lager than 2, it might be not necessary for this calculating process. For example, one exome data, most of its templates were singletons ( Figure S6) that were the main virtual family form to support variants. Under this circumstance, overall variant singleton ratios were high among variants. Except for panel-wide singleton ratio, sequence depth is another factor. For tumor-70KB with extreme low sequence depth among all samples ( Figure S2), its low family degree under low sequence depth lead a small proportion of singletons that caused overall low variant singleton ratios ( Figure 4E: dark green dot; Figure S6).
Although our method can intelligently recognize these samples, we though, there should be a sample level cutoff value to assess whether this sample needs calculating process of this filter and related precise sample level cutoff value need further detailed investigation among large series of family degree samples with different sequencing depth.

Conclusions
This study developed a novel calling algorithm for the accurate detection of somatic mutations with an AF as low as 0.1%. The algorithm introduces three noise-reduction strategies based on a comprehensive analysis of the source of different types of sequencing noise. The robustness of the strategies was well elaborated using 11 Oncosmart2 RSDs and 14 Oncosmart2 controls and validated with five Oncosmart3 RSDs.
Our algorithm is independent of the platform and well suited for NGS data with or without an UMI. Due to its good performance for the detection of low-AF mutations, our algorithm will greatly facilitate the noninvasive panel-wide detection of mutations in cfDNA in clinical settings.

Materials
In the present study, the following materials were included: 14 cfDNA samples (controls) from healthy individuals, 529 Oncosmart2 patient cfDNA, 104 Oncosmart1 patient cfDNA data and 2 tumor samples (one was from 70KB panel, another was whole exome data), 16 cfDNA reference standards (RSDs) (HD780) harboring six SNV positive sites with AF levels from 0.1% to 5% and corresponding 3 wild-type cfDNA control (HWT). Detailed sample descriptions and sample usages were provided in Supplementary Materials. After preprocessing, sample statistics were provided in Table S1.

Virtual barcode-based algorithm
The sequencing reads were clustered into virtual families according to the start, template length and strand. We validated the robustness and effectiveness of the virtual barcode compared with three UMI samples from three aspects: 1) recovery rate of the real family from the UMI; 2) family contents; and 3) effectiveness in suppressing errors. After validation, if both read1 (R1) and read2 (R2) from the sample template covered a genomic site, we further consolidated the read1 and read2 families.

Consent for publication
This manuscript contains no individual person's data in any form.

Availability of data and materials
All data generated or analyzed during this study are included in this published article and its supplementary information files.

Competing interests
The authors declare that they have no competing interests.