VariFAST: a variant filter by automated scoring based on tagged-signatures

Background Variant calling and refinement from whole genome/exome sequencing data is a fundamental task for genomics studies. Due to the limited accuracy of NGS sequencing and variant callers, IGV-based manual review is required for further false positive variant filtering, which costs massive labor and time, and results in high inter- and intra-lab variability. Results To overcome the limitation of manual review, we developed a novel approach for Variant Filter by Automated Scoring based on Tagged-signature (VariFAST), and also provided a pipeline integrating GATK Best Practices with VariFAST, which can be easily used for high quality variants detection from raw data. Using the bam and vcf files, VariFAST calculates a v-score by sum of weighted metrics causing false positive variations, and marks tags in the manner of keeping high consistency with manual review, for each variant. We validated the performance of VariFAST for germline variant filtering using the benchmark sequencing data from GIAB, and also for somatic variant filtering using sequencing data of both malignant carcinoma and benign adenomas as well. VariFAST also includes a predictive model trained by XGBOOST algorithm for germline variants refinement, which reveals better MCC and AUC than the state-of-the-art VQSR, especially outcompete in INDEL variant filtering. Conclusion VariFAST can assist researchers efficiently and conveniently to filter the false positive variants, including both germline and somatic ones, in NGS data analysis. The VariFAST source code and the pipeline integrating with GATK Best Practices are available at https://github.com/bioxsjtu/VariFAST.


Background
With the rapid development of sequencing technologies and the decrease of the costs, a tremendous amount of genome-wide data, including whole exome and genome sequencing data, expression profile data, single nucleotide polymorphism (SNP) and copy number spectrum, as well as functional experimental data, are available for biomedical researches. Different sequencing workflows and platforms have distinctive advantages, e.g. there are many popular variations calling tools, such as Mutect2 [1], SomaticSniper [2], Strelka [3], VarScan2 [4] and HaplotyperCaller [5]. However, false positive alterations still often survive in the final results. In order to acquire high-quality mutations from raw data, the automated pipelines were used to identify and wipe out many false positive calls resulting from sequencing errors, misalignment of reads, and other factors [6,7]. However, to avoid being misled by inaccurate detection of variants, additional manual refinement of alterations is crucial for minimizing false positives and determining candidate variations for further disease studies.
As for germline mutations, it typically involves employing Variant Quality Score Recalibration (VQSR) to produce callsets ready for downstream genetic analysis via using resources of known variation, truthsets and other metadata to assess and improve the accuracy of the results. However, VQSR requires a large scale of samples and tends to work well enough with at least one whole genome or 30 exomes. Anything smaller than that scale is likely to run into difficulties. As for somatic mutations, Fil-terMutectCalls is recommended to filter based on sequence context artifacts. However, additional filtering, for instance, manually reviewing is also necessary for further studies, which extends from deciphering call record annotations to the nitty-gritty of reviewing read alignments using a visualizer.
The Integrative Genomics Viewer (IGV), a lightweight visualization tool that enables intuitive real-time exploration of variants, were developed to handle various types of data [8,9]. Moreover, minimizing false positive alterations via IGV also becomes a traditional method to manually examine sequencing data. Nonetheless, it is obviously not an effective method and has a personal preference. It is necessary to develop more efficient methods and tools for such purposes. In a recent publication, an SOP has been put forwarded, which includes the summarized 19 factors accounting for false positive variants, and provides guideline for variants refinement [10]. In addition, there are benchmark sequencing data from GIAB (the Genome in a Bottle Consortium) for the CEPH/HapMap genome HG001 (NA12878) have been widely used to develop, optimize, and demonstrate the performances of sequencing and bioinformatics methods [11]. Therefore, high-confidence calls of GIAB datasets also can be used for evaluating the reliability and accuracy of variant refinement methods.
In this study, we developed a novel approach for automated filtering false positive variations, called VariFAST, which is based on both weighted score and machine learning model. By using the bam and vcf files, Vari-FAST calculates a v-score by sum of weighted metrics causing false positive variations, and marks tags according to the SOP [10], in the manner of keeping high consistency with existing knowledge, for each variant. We validated the performance of VariFAST for germline variant filtering using the benchmark sequencing data from GIAB, and also for somatic variant filtering using sequencing data of penile squamous cell carcinoma and pituitary adenomas by our Lab. This pipeline of variant refinement is substantially useful for researchers to get a more comprehensive understanding of the pathogenic mechanism of diseases.

Materials and methods
Overall strategy of VariFAST approach for automated variant refinement Figure 1 illustrates the strategy and procedure of Vari-FAST approach. VariFAST contains four major modules: metric calculating, tag-marking, v-score evaluating, and model training.
The Basic metrics contains h, e, r, ri, ao, si, dir, sse, mv, lm, lcr, vafr. The metrics in brackets are used for somatic variants filtering. The input and output files were highlighted in grey.
The first step is metric calculating, in which all reads are stored in the list according to their original position on chromosomes. All mismatches around the variant locus are considered for further metric calculating. For germline variants, 16 metrics are calculated based on reads list. For somatic variants, it will be divided into two partition. For normal track, potential germline variants on the locus are found by v-score and two additional metrics are computed. Then, reads list together with potential germline variants are used to calculate metrics on cancer track. Finally, all 16 metrics for germline variants and 18 metrics for somatic variants are used for tag-marking, v-score evaluating and model training, each step will be explained in more details (see 2.4, 2.5, 2.6).
We implemented the VariFAST approach with python and developed an easy-to-use tool for users to filter variants. We also provide a pipeline integrating GATK Best Practices Pipeline with VariFAST, which can be convenient for users to get high quality variants from raw sequencing data more efficiently.
The detailed flowchart is shown in Additional file 1: Figure S1, and this pipeline can also be downloaded from Github.

Definition and calculation of metrics for variant filtering
We proposed 16 quantitative metrics for germline variant filtering and 18 for somatic variant filtering generally based on SOP of IGV manual review [10], as follows: Where d is the total read counts covering the specific locus. According to different sequencing data, thr represents the users' requirement for minimal read counts. Lcr ranges from 0 to 1, while 0 means the variant passes the quality control of low coverage.

Variant allele frequency risk
Where d m is the depth of the variant.

Near Head Rate
d h is the number of supported reads whose distance between read's head and variant locus is under threshold (default 5).

Near End Rate
d e is the number of supported reads whose distance between read's end and variant locus is under threshold (default 5).
d l is the number of variants supported reads whose mapping quality is under threshold (default 10).

Multiple Mismatched Rate
Where T is the set of mismatches whose frequency are under threshold (default 0.05), v i represents the depth of mismatch i and n means the number of nucleotides within 10 bp upstream and 10 bp downstream. 11. Multiple Variants Rate Where d mv is the number of alleles at the same position.
Notably, for somatic variants, the program will identify potential germline variants first via using normal track according to v-score (see 2. 3) and then count the d mv according to the tumor track, excluding germline variant with same position detected before.

High Discrepancy Region
HR ðHR ¼ 2h d m þd i Þ is calculated for all mismatches located on same reads with candidate variant, where d i the depth of mismatch and h is the counts of mismatch and candidate variant occurring on same read. HR max is the maximum of all HR. Particularly, the value of hdr might be greater than 1 as more than one mismatch appeared high concordant with candidate variant (default HR ≥ 0.9). 13. Short Inserts Rate Tag Short Inserts (SI) is frequently in data derived from archival material (FFPE) or other source material with small DNA fragment [12], where almost all variants appeared in the overlapping region of the two read fragments. Metric si is set to catch this tag, where d si is the number of variants appeared in the overlapping region of both read fragments and d p is the counts of the paired supported reads. 14. Repeat r: when candidate variant is near repeat region, r will be 1. 15. Repeat Insert ri: when insertion is near repeat region, and insert nucleotides are the same as the minimum repeat unit, ri will be 1. 16. Large Insert li: when the length of insert fragment reach threshold (default 20) and nucleotides of the fragments are the same as the near nucleotides of reference, li will be 1.
As for somatic variant, there are two additional metrics.

Normal Variant allele frequency for somatic variation
Where d n is the depth of variants in normal track, and d normal is total read counts covering the locus in normal track. 18. Low normal coverage risk for somatic variation Where thr n represents the users' requirement for minimal read counts in normal track.
All metrics mentioned above ranged from 0 to 1 except mv, hdr and mm.

Quantification of the variant score by sum of weighted metrics
We proposed an indicatrix named variant score (v-score) to comprehensively consider the effects of all metrics. The specific formula for calculating v-score is as follow, where w means the weight of each metric, and x is the value of metric calculated above.
v ¼ On the basis of prior knowledge [10], the 18 metrics are partitioned into three importance levels with different weights. The most important metrics including lcr, vafr, lm, ni and nd, are regarded as level3, and the weights are assigned as 3. The level2 including lncr, mv and hdr are also crucial for false variant filtering, whose weight values are 2. The remaining metrics are regarded as level1, which may not be dominant metrics for refinement. In addition, each metric's weight is not fixed and can be changed to fulfill different kinds of applications.  [10]. Whereas in this study, we combined the DN, TR, MN together as a new tag Repeat Region (RR), SIO and SI are combined as tag SI. AI is divided into two new tags Neat Insertion (NI) and Near Deletion (ND). Two new tags HE (Head) and RI (Repeat Insert) were added, where HE represents nearly all variants are near to head of reads and RI means insertion happening on repeat region and inserted nucleotides are same as the minimum repeat units. To summarize, we proposed a more reasonable and quantitative standard including 19 tags for variant filtering. These tags are automatically marked according to the corresponding metrics defined above, and the standard is shown in Table 1.

Evaluation of variant refinement by v-score
The F β score was calculated as F β ¼ ð1þβ 2 ÞÂPÂR ðβ 2 ÂPÞþR to select the most appropriate v-score. The value of β influence the balance between precision (P) and recall (R), β> 1 demonstrates that recalls plays more important role, while if β< 1, the precision becomes a major effect.

Machine learning model for advanced variant filtering
Supervised machine learning method usually behaves well in a situation with big data, which requires little prior knowledge. XGBoost is an effective and widely used machine learning method, which is based on Gradient Boosting framework [13]. Here, we used XGBOOST to construct a model for false positive variant filtering.
All metrics calculated above were used to train the model with binary-logistic as objective function. The default values were selected for almost all hyperparameters other than learning rate (eta), minimum loss reduction for a further partition (gamma) and the maximum depth of the tree (max_depth). The large gamma makes the model conservative and results in underfitting, oppositely, increasing max_depth may more likely lead to overfitting. Grid search with ten-fold cross-validation was performed for hyperparameters selection while eta was chosen from (0.01; 0.05; 0.1), gamma from (0.1; 1; 10), max_depth from (3; 6; 9).

Sequencing data with high-confidence germline variant annotation
Six datasets sequenced for four samples (HG001, HG002, HG003, HG004) which contain highconfidence benchmark compiled by GIAB were used in this study [11,14,15]. The benchmark set was generated by integrating multi-datasets from 5 different sequencing platforms and could be retrieved from (ftp://ftp-trace.ncbi.nlm.nih.gov/giab/ftp/release/). Our lab also sequenced the HG001 sample using Illumina Xten platform, and generated two exome sequencing datasets HG001_b and HG001_c. Samples HG002, HG003 and HG004 are Ashkenazim Trio from GIAB. The specific information of datasets is shown in Table 2. All sequencing data are mapped using BWA MEM [17,18] and called variants by GATK HaplotypeCaller.

Sequencing data for somatic variant refinement
Sequencing data of two different cancers conducted by our research team were used to show the application of VariFAST for somatic variant refinement. The first one is three paired whole exome datasets sequencing for penile squamous cell carcinoma cases, which were collected in the Affiliated Hospital of Qingdao University. The other one is whole exome datasets for 136 Pituitary adenomas (PAs) published in our previous study [19].

Results
Variant refinement performance on germline variants Summary of variants called from different standard sequencing samples In order to assess whether VariFAST method is able to improve the accuracy of variant refinement, we aligned the sequence data of samples HG001, HG002, HG003, and HG004 by different Illumina sequencers to the reference genome (hg19) using BWA MEM, and used GATK HaplotypeCaller to call the candidate variants. The candidate variants consistent with high confidence variants annotated in GIAB are positive, while others are regarded as negative ( Table 3).

Performance of variant filtering by tags
For all of the 279,573 germline variants, we used Vari-FAST to calculate 16 metrics, and then get the v-score and mark the tags. The distributions of tags showed slightly different patterns among the six standard sequencing datasets (Fig. 2a). While HG001 from 1000 genome has a higher ratio in tags LC and D than other    (Fig. 2b). None of the variants is marked by only SSE. The tags LM (96%), LVF (88%), NI (86%) and ND (96%) are most effective, since almost all of variants marked by them are negative variants without confidence in benchmark. Above half of the variants with tags HDR (55%) and MV (69%) are also false positives out of the benchmark. LC (Low Coverage) may be an impediment to distinguish sequencing artifacts and considerably decline the confidence of a variant, thus, it also should be taken into key consideration, even though the proportion of negative variants marked by tag LC (21%) is not as high as other tags mentioned above. In order to recall these variants, a deeper depth sequencing data should be required. The other tags including RI (29%), RR (25%), SI (25%), AO (33%), MM (26%) and extremely DIR (13%) might not be effective for filtering variant, but they could reduce the confidence of the variant to some extent, which also requires comprehensive assessment.

Performance of variant filtering by v-score
We demonstrated the Recall-Precision diagram using the different v-score threshold with interval 0.1, as shown in Fig. 3a, and found that all samples exhibit similar trends. Making an overall consideration between precision ratio and recall ratio, F β score (see Materials and Methods) was calculated for each dataset with β as 0.3. Figure 3b shows the relationship between F β score and v-score. Interestingly, the F β score of all datasets from the Illumina platform reaches the maximum when vscore is nearly 3.5 (Table 4). We also validated the performance of v-score for different β ranging from 0.1 to 0.5, and found that all F β scores reach the maximum with the v-score from 3 to 4 (Additional file 2: Figure  S2). It suggested that v-score is robust across different samples and choosing a threshold for v-score from 3 to 4 is reasonable. Notably, some complex variants might get high v-score due to the inaccuracy of alignment algorithm and sequencing error, which are still difficult to be detected.

Sanger sequencing validation for refined variants inconsistent with high confidence in benchmark
We selected 15 variants without high confidence in benchmark and with low v-score, including 7 SNVs, 4 insertions and 4 deletions to perform Sanger sequencing validation (Table 5). Almost all variants (14/15) were confirmed as true, which demonstrated that our v-score could efficiently recall a number of variants, although they are out of benchmark. There is only one variant failed in verification, as shown in the IGV visualization in Additional file 3: Figure  S3. Three insert nucleotides were detected in locus 8,374, We also chose other 12 variants with high v-score but within benchmark, which would be filtered by manual review obviously, to perform Sanger sequencing validation. All 12 variants even some with extremely low allele frequency were validated true. There are 5 variants marked by HDR were validated as Multiple Nucleotide Polymorphisms (MNPs). The visualization of MNPs were similar to HDR, whose variants were supported by reads that have other recurrent mismatches across the track (Additional file 4: Figure S4). Thus, v-score is difficult to distinguish these exceptional variants.
In conclusion, variants with low v-score can be regarded as true positive with considerable certainty. Oppositely, a few variants with higher v-score, which were usually also excluded by manual review, would be possibly misjudged. This is also the challenge for most of variant refinement methods.

Performance of variant filtering by machine learning model in VariFAST
All variants (132534) called from three whole exome sequencing datasets (HG002, HG003, and HG004) were used to train the XGBOOST model. The hyperparameters were set as eta:0.1; gamma: 1; max_depth: 6 by ten fold cross validation for optimized grid search. Both v-score and XGBOOST were tested on three independent datasets sequenced by the Illumina platform (HG001_a, HG001_b, and HG001_c). The variants predicted as positive or negative consisted with annotation in benchmark are considered as true positive (TP) or true negative (TN). The variants predicted as positive but without high confidence in benchmark are considered as false positive (FP). Correspondingly, the variants predicted as negative but having high confidence of positive variants are considered as false negative (FN). The ROC curve [20] demonstrated that the XGBOOST model significantly performs better than v-score in all three datasets, as shown in Fig. 4. The AUC values are shown as follow: HG001_a: 0.790 vs 0.712; HG001_b: 0.832 vs 0.766; HG001_c: 0.828 vs 0.766. By using the var.roc function in R with bootstrap method, we got the variance of AUC are 1.66e-5, 7.52e-6, and 7.01e-6 respectively for datasets HG001_a, HG001_b, and HG001_ c, which revealed the robustness of XGBOOST model.
In addition, we also used the datasets HG001_a, HG001_b, and HG001_c as training set and the datasets HG002, HG003, and HG004 as test set, the performance of XGBOOST is also similar.    Fig. 4 ROC curves plotting true-positive rate versus false-positive rate for three independent datasets predicted by XGBOOST model and v-score respectively IGV review, while our pipeline saved massive time and labor. Importantly, the v-score of positive and negative variants groups are significantly different (p-value = 2.2e-16) under rank sum test (Additional file 5: Figure S5).

Variant filtering using VariFAST for pituitary adenomas sequencing data
A large-scale whole exome sequencing data including 136 pituitary adenomas (PAs) cases published in our previous study was used for further evaluation of VariFAST for somatic variant refinement. Aggregately, 108,400 variants are called by GATK and 202 variants have been validated via Sanger sequencing. Figure 6 showed the distribution of v-score for both Sanger validated variants and total variants. The v-scores of almost all (174/202) validated variants are below 4, and the overall distribution of v-score for total variants showed significant difference with the We provided an easy-to-use Python package for Vari-FAST implementation, which saves massive time and labor of manual review. The package uses ray [21], a distributed execution engine, to deal with task-parallel computations. It costs approximate 3 h for dataset HG001_a containing 30,000 germline variants with an average depth of 80, and 3 h for total 108,400 somatic variants from 136 PA cases with average depth of 90, using 64 cores on high performance computer. It can also be adapted to variant filtering of whole genome sequencing data with acceptable time. Our global pipeline  Fig. 6 The v-score distribution of Sanger validated variants and total variants in pituitary adenomas samples. Legend 'total' (red) represents group of total variants, and 'positive' (green) represents group of Sanger validated variants integrating GATK Best Practices with VariFAST enables users to get high quality variants from raw sequencing data in more efficient way.
Comparison of VariFAST with variant quality score recalibration (VQSR) Variant quality score recalibration is the state-of-the-art variant filtering tool involved in GATK, it evaluates the probability that a SNP is a true genetic variant versus a sequencing or data processing artifact based on Gaussian mixture model.  Figure S6). It because VQSR is expecting at least thousands of variant sites in order to achieve decent modeling with the Gaussian mixture model, while VariFAST is stable in the cases of variant filtering for small samples sequencing.
The key characteristic of VariFAST is based on the comprehensive SOP summarized from experience of manual review, which is not affected by the sample size. The other important advantage of VariFAST is that the automated filtering by v-score can be used efficiently for both germline and somatic variants with better interpretability, while VQSR can not be used for somatic variants refinement. V-score is the sum of weighted metrics contributing for false positive variants, and the marked tags have been verified of biological significance accounting for the false positive.

Discussion
Identification of high-quality variants is crucial to explore the in-depth view of genetic causes and find clinical treatment of disease. Although some variant discovery tools have been developed to call variants, such as GATK, there are still many false positive variants remain in final results. In order to reduce false positives, IGV-based manual review is required to filter variants, but it is very timeconsuming and usually has personal preference. Here, we developed an automated approach VariFAST, and provided an easy-to-use tool, to help researchers solve the problem. We also provided a pipeline integrating GATK Best Practices with VariFAST, which can be easily used for high quality variants detection from raw data. Vari-FAST runs dramatically faster than manual review, nearly 3 h for 30,000 germline variants with average coverage of 80 using 64 cores computation, which can free researchers from heavy manual work. VariFAST tool possesses high flexibility, which enables researchers to set different parameters for application in particular studies.
Scoring by quantitative metrics and tags determine the good performance of VariFAST We proposed 18 quantitative metrics based on the SOP for IGV manual review [10], and assigned different weights on metrics from level 1 to 3 according to their importance. Then VariFAST calculates v-score as the sum of weighted metrics, which is robustly from 3 to 4 under the maximum F β score indicating both good precision and recall (Additional file 2: Figure S2). Next, VariFAST marks tags for each variant according to the standard in Table 1, we identified the top effective tags for determining false positive variants, including LM, LVF, NI and ND, LC (Low coverage) should also be taken into key consideration. Correspondingly, the variants having no tag are possibly true positives. In addition, almost all of the positive variants determined by low vscore but without high confidence in benchmark have been validated as true positive by Sanger sequencing validation (Table 5), which indicated that the automated scoring based on quantitative metrics and tags can recall a number of variants that were easily filtered out.
VariFAST is effective for both germline and somatic variants filtering VariFAST has been validated useful for both germline and somatic variants refinement. We performed VariFAST on 279,573 germline variants called from six sequencing datasets of standard samples (NA12878) and discovered that all datasets from Illumina platform have similar optimal v-score thresholds (Fig. 3b) determined by F β , which means v-score is robust in application for different sequencing datasets. The Sanger sequencing validation of HG001 demonstrated high accuracy of VariFAST for positive recall. In addition, two independent cancer datasets were used to validate the performance of VariFAST to refine somatic variants. For penile squamous cell carcinoma sequencing data, we found the result of VariFAST are highly consistent with the manual review (Fig. 5). For sequencing data of 136 pituitary adenomas samples, we showed the v-score distribution of Sanger validated variants is significantly lower than that of the total variants, which means the scoring by VariFAST is effective to distinguish positive and negative variations.

XGBOOST model in VariFAST outperforms state-of-the-art VQSR
VariFAST also provides a XGBOOST model trained using the datasets of Ashkenazim Trio from Illumina platforms for germline variant filtering. XGBOOST model generated an average AUC of 0.82 using three HG001 datasets as test set, which performed better than v-score (Fig. 4). However, XGBOOST model could not be used widely across different types of sequencing platforms due to the bias of training data. Furthermore, compared with v-score, XGBOOST model may have worse interpretability, because no specific tags will be marked to explain why the variant is positive or negative. Therefore, we suggest that the combination of XGBOOST model and v-score should be all taken into consideration for germline variants filtering. By comparison with state-of-the-art variant filtering method VQSR in GATK, XGBOOST model reveals better MCC and AUC, especially more superior for INDEL variants filtering, because it is based on the comprehensive SOP summarized from experience of manual review. Moreover it has no prerequisite of large scale samples and variant sites opposed to VQSR. Although both VQSR and XGBOOST model have limitations in somatic filtering, the v-score of VariFAST has potential to be the dominant tool for discovering true positive somatic mutation for disease study.

Further work
There are still some limitations for VariFAST. Firstly and most importantly, complex variant such as Multiple Nucleotide Polymorphisms, which is similar to HDR, is difficult to detect, so more complex metric may be required to be proposed. Secondly, the weights of metrics are difficult to determine for different tasks due to the absence of gold standard datasets, assigning reasonable weights on each metric by auto-learning will be considered in our further work.

Conclusions
We developed a novel approach VariFAST for high quality variants detection from raw data, which includes the vscore part by sum of weighted metrics causing false positive variations and the predictive model part trained by XGBOOST algorithm for germline variants refinement. VariFAST is an automated and efficient approach for both germline and somatic variant filtering, which is promising to substitute for the laborious IGV manual review.