Integrated approach to generate artificial samples with low tumor fraction for somatic variant calling benchmarking

Background High-throughput sequencing (HTS) has become the gold standard approach for variant analysis in cancer research. However, somatic variants may occur at low fractions due to contamination from normal cells or tumor heterogeneity; this poses a significant challenge for standard HTS analysis pipelines. The problem is exacerbated in scenarios with minimal tumor DNA, such as circulating tumor DNA in plasma. Assessing sensitivity and detection of HTS approaches in such cases is paramount, but time-consuming and expensive: specialized experimental protocols and a sufficient quantity of samples are required for processing and analysis. To overcome these limitations, we propose a new computational approach specifically designed for the generation of artificial datasets suitable for this task, simulating ultra-deep targeted sequencing data with low-fraction variants and demonstrating their effectiveness in benchmarking low-fraction variant calling. Results Our approach enables the generation of artificial raw reads that mimic real data without relying on pre-existing data by using NEAT, a fine-grained read simulator that generates artificial datasets using models learned from multiple different datasets. Then, it incorporates low-fraction variants to simulate somatic mutations in samples with minimal tumor DNA content. To prove the suitability of the created artificial datasets for low-fraction variant calling benchmarking, we used them as ground truth to evaluate the performance of widely-used variant calling algorithms: they allowed us to define tuned parameter values of major variant callers, considerably improving their detection of very low-fraction variants. Conclusions Our findings highlight both the pivotal role of our approach in creating adequate artificial datasets with low tumor fraction, facilitating rapid prototyping and benchmarking of algorithms for such dataset type, as well as the important need of advancing low-fraction variant calling techniques. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-024-05793-8.

where: • FASTA is the reference genome (GRCh38 or hg38) • READLEN is the read length; standard read length in targeted sequencing for Illumina sequencers is 150 • FRAGLENMODEL is the fragment length model; we used the default fraglenModel_default.pickle.gz• COVERAGE is the expected median coverage for output artificial datasets; we set the value 30,000, a common target in ultra-deep sequencing • SEQERRORMODEL is the sequencing error model; we used the default errorModel_default.pickle.gz• GCBIASMODEL is the GC% bias model; we used the default gcBias_default.pickle.gz• BED indicates the target regions of interest, e.g., some specific gene regions; it is a tab-separated value text file containing the chromosome names and the genomic positions of the target regions, e.g., constructed from the genome coordinates for all exons of the genes of interest extracted from the UCSC Table Browser (https://genome.ucsc.edu)• --bam indicates to generate a BAM file in output • --vcf is used to produce a VCF file in output • --no-fastq specifies that no FASTQ files are generated in output, saving space and computational time • RANDOM is the Bash function used to generate a random seed for NEAT • MUTMODEL is the mutational model; we used the MutModel_NA12878.pickle.gz,generated from the NA12878 cell line • OUTPUT_NAME is the output name, i.e., the name of the generated files In addition to the BAM file, using the option --vcf also generates the related VCF file containing the random mutations included by NEAT.

S1.2. Generation of low-fraction somatic variants
Somatic mutations, to be spiked in the artificial normal samples from NEAT, are generated using the random_sites.pyscript of the BAMSurgeon suite version 1.3 (Ewing et al., 2015).The script can generate the user-supplied desired number of variants (either single nucleotide variants -SNVs, or insertions-deletions -INDELs), randomly selecting the affected and the varied DNA bases.The allele fraction of the generated variants is randomly sampled from the beta distribution, with the fraction upper and lower bounds selected by the user along with the length of the INDELs.
SNVs are generated on each artificial sample with the following command: where: • FASTA is the reference genome (GRCh38 or hg38) • -b is the BED file to be fed; it is the same file used for the generation of the normal samples (see section S1.1 for details) • RANDOM is the Bash function used to generate a random seed for random_sites.py• -n is the number of random mutations to be generated, e.g., 100 • --avoidN is used to avoid regions containing unknown references when spiking-in variants • --minvaf and --maxvaf are the minimum and the maximum variant allele fractions (VAF) of the randomly generated mutations, respectively; we set them to 0.00001 (0.001%) and 0.05 (5.0%) • snv specifies the type of variants to be created, i.e., Single Nucleotide Variants (SNV) • OUTPUT is the output file name INDELs are generated in the same fashion, using the command: where, in addition to the above specified parameters: • --minlen and --maxlen are the minimum and the maximum base length of the randomly generated INDELs, respectively; we generated two types of INDELs: INDELs of mixed lengths (--maxlen 90) and short INDELs (--maxlen 3) • indel specifies the type of variants to be created, i.e., insertions/deletions (INDEL)

S1.3. Spike-in of artificial variants
Variants generated with the random_sites.pyscript are then spiked in each artificial normal sample through the scripts add_snv.pyand add_indel.py of the BAMSurgeon suite.SNVs and INDELs are spiked in separately.First, the following command is used to spike-in the somatic SNVs:  (Li and Durbin, 2009) for the alignment process, performed by BAMsurgeon • --alignopts specifies the aligner options, established by the bioinformatics community (https://github.com/bcbio/bcbio-nextgen)• -p splits into the specified number of multiple processes • --ignoresnps forces spiking-in even if the reference allele does not share the relevant reads • --tagreads adds a tag in the generated BAM file ('BS') to the altered reads, to distinguish them from the reads from the original sample (for possible downstream analyses or quality control) • --mindepth and --maxdepth are the minimum and maximum read depth to spike-in the mutations • --tmpdir sets the name of the BAMSurgeon temporary directory • --minmutreads 4 sets a minimum of four mutated reads • RANDOM is the Bash random function used to create a random seed for BAMSurgeon INDELs previously generated are spiked-in as done for SNVs, but using the add_indel.pyscript, with the same parameter values.Each type of INDELs generated is inserted separately in the artificial tumor samples previously generated with spiked-in SNVs.

S1.4. Variant calling parameters and their value tuning
Variant calling is performed using six different and widely used variant callers: VarDict (Lai et al., 2016), Mutect2 (Cibulskis et al., 2013), LoFreq (Wilm et al., 2012), VarScan2 (Koboldt et al., 2012), FreeBayes (Garrison and Marth, 2012) and Strelka2 (Kim et al., 2018).We first tested the performance of the variant callers using their default parameter values (default settings), and subsequently we tuned the parameter values of each caller to improve sensitivity (high-sensitivity settings), including performing a reduction or removal of the minimum variant allele fraction limit.Details about default values of algorithm parameters and their tuning are reported as follows: VarDict: Tumor-only mode, default parameters: where: • FASTA is the reference genome (GRCh38 or hg38) • TUMOR_BAM is the input, tumor, BAM file • NAME is the output name of the result • 0.01 is the default limit of detection of VarDict (i.e., the lowest VAF detectable) • --nosv turns off structural variation calling to save computational time • BED is a tab-separated value text file containing the chromosome names and the genomic positions of the target regions of interest; it is the same file used for the previous step (see Section S1.1 for details) • teststrandbias.R is a R script that removes strand bias (a type of sequencing error) • var2vcf_valid.pl is a PERL script used to remove false positive calls from the output calls • -E (of the var2vcf_valid.plscript) does not print the tag END at the end of the log when set • NAME.vcf is the output VCF file name The other parameters are set by default, as stated in the VarDict documentation (https://github.com/AstraZeneca-NGS/VarDictJava#readme).
Tumor-normal paired mode, default parameters: where: • TUMOR_BAM is the tumor input BAM file • NORMAL_BAM is the normal, paired, input BAM file • testsomatic.R is the tumor-normal mode equivalent of the teststrandbias.R script to remove strand bias (a type of sequencing error) • var2vcf_paired.pl is the tumor-normal mode equivalent of the var2vcf_valid.plscript to remove false positive calls from the output calls • TUMOR_NAME and NORMAL_NAME are the sample name of the tumor and normal BAM files, respectively The limit of detection is removed by strongly lowering the threshold on the minimum allele fraction, using the following parameter: • -f 0.0001 To detect the best set of parameter values for high-sensitivity settings, we first tested the following parameters on a single dataset (with either SNVs or SNVs + INDELs spiked-in) from the training set: • -X [default: 2]: extension of bp to look for mismatches after insertion or deletion • --nmfreq [default: 0.1]: variant frequency threshold to determine variant as good in case of nonmonomer Micro Satellite Instability (MSI) • --mfreq [default: 0.25]: variant frequency threshold to determine variant as good in case of monomer MSI • -q [default: 22.5]: phred score for a base to be considered a good call • -m [default: 8]: filter of reads with mismatches more than 8 After the benchmarking results (Supplementary Figure 17), we focused further benchmarks only on the -q parameter, which achieved the best performance with the following value: • -q 15 Mutect2: Tumor-only mode, default parameters: The benchmarking showed that only two of the proposed parameters influenced the software performance on low fraction calling: max-reads-per-alignment-start and min-base-quality-score.We then benchmarked the performance of Mutect2 when combinations of different values of these two parameters were fed in input to the software (Supplementary Figure 18 and 19).The parameter values that achieved the best performance, in both tumor-normal paired and tumor-only mode, and for both SNVs and INDELs, were: • --max-reads-per-alignment-start 100 • --min-base-quality-score 25 LoFreq: Tumor-only mode, default parameters: where --call-indels is needed to call INDELs, and is set only when INDELs are spiked in the artificial normal samples.Note that when calling INDELs, input BAM files need to be processed using the following command: where the --dindel command inserts INDEL qualities into the input BAM files and is mandatory for INDELs calling.
Parameter tuning for high-sensitivity settings was performed separately for tumor-normal paired mode and tumor-only mode, since the two modes do not share the same commands.For tumor-only mode, the following parameter was benchmarked (Supplementary Figure 20): • --sig [default: 0.01]: p-value cut-off for a variant call The value that achieved the best performance was: • --sig 5 Tumor-normal paired mode, default parameters: To perform parameter tuning for high-sensitivity settings for tumor-normal paired mode, following LoFreq documentation (https://csb5.github.io/lofreq/commands/#somatic)we benchmarked two parameters: • --tumor-mtc [default: bonf]: the type of multiple testing correction for tumor; default is the Bonferroni test • --tumor-mtc-alpha [default: 1.0]: the value of the Alpha parameter for the multiple test correction Likewise the tumor-only mode, these parameter values were systematically tested (Supplementary Figure 20).The combination of parameter values that achieved the best performance were: • --tumor-mtc fdr • --tumor-mtc-alpha 1 VarScan2: Tumor-only mode, default parameters: VarScan2 needs a mpileup file (a text-based format file that summarizes base calls of aligned reads) as input, which SAMtools (Li et al., 2009) can build when using the mpileup command.Then, the VarScan2 mpileup2cns command is run to perform variant calling.
Tumor-normal paired mode, default parameters: In tumor-normal paired mode, a single mpileup is generated using both the tumor and the normal BAM files from the sample.Then, the -mpileup 1 command from VarScan2 ensures a single BAM file is processed as input.
The limit of detection is removed by adding the following parameters: • We lastly utilized the following combination of parameter values for high-sensitivity settings, after the benchmarking results (Supplementary Figure 21): • For tumor-only mode:  --p-value 0.1  --min-avg-qual 5 • For tumor-normal paired mode:  --p-value 0.1  --somatic-p-value 0.1 Freebayes: Tumor only mode, default parameters: Tumor-normal paired mode, default parameters: To remove the limit of detection we set the following parameter: • --min-alternate-fraction 0.01 Further increasing the limit of detection was not possible, due to the high computational time required from the algorithm in such settings.

Strelka2:
Tumor-normal paired mode, default parameters: path/to/strelka/bin/configureStrelkaSomaticWorkflow.py \\ --tumorBam TUMOR_BAM \ --normalBam NORMAL_BAM \ --referenceFasta FASTA \ --runDir ANALYSIS_DIR_PATH ANALYSIS_DIR_PATH/runWorkflow.py -m local -j CPUS where • ANALYSIS_DIR_PATH is the folder, generated by Strelka2 itself, in which the analysis is run • -j CPUS is the number of threads to be used The following parameters were benchmarked in order to detect the best values for the high-sensitivity settings (values tested are reported):

S1.5. Variant calling metrics for benchmarking
Variant calling performances are evaluated using an in-house developed script (performance_evaluation.py) provided at https://github.com/DIncalciLab/dincalcilab-lowfrac-variant-benchmarkIt is written in Python programming language (version 3.9) and built around the cyvcf2 package (Pedersen and Quinlan, 2017), a Cython (https://cython.org/)wrapper for fast parsing of VCF files, which allows the easy extraction of the variants from VCF files.
Called variants extracted are compared with those successfully spiked-in and four metrics are calculated: True Positive Rate (TPR), Positive Predictive Value (PPV),False Discovery Rate (FDR) and F1 score.
The True Positive Rate (or Sensitivity, or Recall) is calculated as:

𝑇𝑃𝑅 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑁
where: • True Positive (TP) is the number of spiked-in somatic variants (SNVs or INDELs) correctly called • False Negative (FN) is the number of spiked-in somatic variants not called The Positive Predictive Value (or Precision) is calculated as:

𝑃𝑃𝑉 = 𝑇𝑃 𝑇𝑃 + 𝐹𝑃
where: • False Positive (FP) is the number of called somatic variants that were not spiked-in, thus are incorrectly called.Called pseudo-germline variants randomly inserted in the initial artificial normal samples are not considered as false positives, and thus removed from the calculation.
The False Discovery Rate is calculated as: The F1 score is calculated as:

S1.6. BAM files downsampling
The random selection for reads downsampling was performed using the sambamba (Tarasov et al., 2015) tool.Specifically, for each percentage range (i.e., 2-8% and 20-80%), we used the subsample command in sambamba to randomly select the desired number of reads.The command takes into account the read group information, which allows for subsampling at the individual sample level.In particular, for each sample and each downsample step, we ran the following command: where: • -s [default: .01]:fraction of reads from the original BAM to be retained • --subsampling-seed: seed for subsampling • -f: output format • -o: output file name

S2.1 Spike-in of artificial variants
We tested the reliability of the spike-in of artificial somatic variants by comparing the number of generated loci versus their actual number spiked in the data, i.e., the spike-in success rate.Performances of SNVs and INDELs were evaluated separately.
The SNV spike-in success rate for each sample is reported in Supplementary Table 1.There was no significant difference (Kolmogorov-Smirnov test, p-value = 0.99) in the VAF distribution of the generated variants versus the VAF distribution of the actual spiked-in variants, for both the entire artificial sample cohort (Supplementary Figure 1) and at the individual sample level (Supplementary Figure 2).
.vcfThe limit of detection is removed by adding the following parameter:• --minimum-allele-fraction 0.0 To detect the best set of parameter values for high-sensitivity settings, we first benchmarked the following parameters individually: • ---f1r2-max-depth [default: 200]: groups sites with depth higher than 200 • --f1r2-median-mq [default: 50]: sites with median mapping quality below 50 were skipped • --max-reads-per-alignment-start [default: 50]: maximum number of reads to retain per alignment start position.Reads above 50 threshold were downsampled • --min-base-quality-score [default: 10]: minimum base quality required to consider a base for calling • --tumor-lod-to-emit [default: 3]: Log 10 odds threshold to emit variant to VCF --min-var-freq 0.0 (for varscan) [default: 0.01]: minimum variant allele frequency threshold • -d 0 (for mpileup)[default: 8000]: maximum reads read per input file To detect the best set of parameters for high-sensitivity settings, we tested the following parameters on the training set containing only SNVs spiked-in: • For tumor-only mode:  --p-value [default: 99e-02]: p-value threshold for calling variants  --min-avg-qual [default: 15]: minimum base quality at a position to count a read • For tumor-normal paired mode:  --p-value [default: 0.99]: p-value threshold to call a heterozygote  --somatic-p-value [default: 0.05]: p-value threshold to call a somatic site

Table 1 .
Success rate of spiked-in somatic SNVs.Success rate represents the ratio between the actual number of simulated random variants successfully inserted into the normal data and their generated number.

Table 2 .
Success rate of spiked-in somatic INDELs (max length 90 bp).Success rate represents the ratio between the actual number of simulated random variants successfully inserted into the normal data and their generated number.

Table 3 .
Success rate of spiked-in somatic short INDELs (max length 3 bp).Success rate represents the ratio between the actual number of simulated random variants successfully inserted into the normal data and their generated number.

Tumor-normal paired mode Supplementary Figure 8. Variant caller performance analysis with low-fraction spiked-in somatic SNVs, using tumor-normal paired mode and parameter default values.
Plot depicts the recall (i.e., Sensitivity, or TPR) and precision (i.e., PPV) of VarDict, Mutect2, LoFreq and Strelka2 variant callers in tumor-normal paired mode with parameter default values on simulated data samples with spiked-in somatic SNVs only.

Table 4 .
Performance of variant callers in calling SNVs using parameter default values in matched tumor-normal paired mode on SNVs spiked-in data.

Supplementary Figure 9. Variant caller performance analysis with low-fraction spiked-in somatic SNVs and INDELs, using tumor-normal paired mode and parameter default values
. Plots depict the recall (i.e., Sensitivity, or TPR) and precision (i.e., PPV) of VarDict, Mutect2, LoFreq and Strelka2 variant callers in tumor-normal paired mode with parameter default values on simulated data samples with spiked-in somatic SNVs and INDELs (max length 90 bp).Performances of the variant callers are reported for SNVs (a), INDELs (b), or both SNVs and INDELs (c).

Table 6 .
Performance of variant callers in calling INDELs (max length 90 bp) using parameter default values in matched tumor-normal paired mode on spiked-in data containing both SNVs and INDELs (max length 90 bp).

Table 7 .
Performance of variant callers in calling both SNVs and INDELs (max length 90 bp) using parameter default values in matched tumor-normal paired mode on spiked-in data containing both SNVs and INDELs (max length 90 bp).