Skip to main content
  • Methodology article
  • Open access
  • Published:

Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays



Benchmarking the performance of complex analytical pipelines is an essential part of developing Lab Developed Tests (LDT). Reference samples and benchmark calls published by Genome in a Bottle (GIAB) consortium have enabled the evaluation of analytical methods. The performance of such methods is not uniform across the different genomic regions of interest and variant types. Several benchmarking methods such as, vcfeval, and vcflib are available to assess the analytical performance characteristics of variant calling algorithms. However, assessing the performance characteristics of an overall LDT assay still requires stringing together several such methods and experienced bioinformaticians to interpret the results. In addition, these methods are dependent on the hardware, operating system and other software libraries, making it impossible to reliably repeat the analytical assessment, when any of the underlying dependencies change in the assay. Here we present a scalable and reproducible, cloud-based benchmarking workflow that is independent of the laboratory and the technician executing the workflow, or the underlying compute hardware used to rapidly and continually assess the performance of LDT assays, across their regions of interest and reportable range, using a broad set of benchmarking samples.


The benchmarking workflow was used to evaluate the performance characteristics for secondary analysis pipelines commonly used by Clinical Genomics laboratories in their LDT assays such as the GATK HaplotypeCaller v3.7 and the SpeedSeq workflow based on FreeBayes v0.9.10. Five reference sample truth sets generated by Genome in a Bottle (GIAB) consortium, six samples from the Personal Genome Project (PGP) and several samples with validated clinically relevant variants from the Centers for Disease Control were used in this work. The performance characteristics were evaluated and compared for multiple reportable ranges, such as whole exome and the clinical exome.


We have implemented a benchmarking workflow for clinical diagnostic laboratories that generates metrics such as specificity, precision and sensitivity for germline SNPs and InDels within a reportable range using whole exome or genome sequencing data. Combining these benchmarking results with validation using known variants of clinical significance in publicly available cell lines, we were able to establish the performance of variant calling pipelines in a clinical setting.


Next Generation Sequencing (NGS) and analytical methods developed to detect various forms of disease-causing polymorphisms are now routinely being used by clinical laboratories to determine the molecular etiology of complex diseases or disorders and in many cases to make critical treatment course decisions. In the past two decades, many polymorphisms in the human genome have been identified and validated that serve as predictive, diagnostic, and prognostic markers for complex inherited diseases. These genomic disease markers can be of different forms, such as Single Nucleotide Variants (SNVs), small INsertions and DELetions (InDels), large deletions and duplications, and Copy Number Variations (CNVs), and can vary in size from a single base change to several Mega Bases (MB) in length and even whole chromosomal polysomies. Clinically relevant polymorphisms occur in the different regions of the genome, including exonic, splice-sites, and deep-intronic regions. These polymorphisms also happen in various forms, including single base changes within high entropic regions, copy number changes to homopolymer repeats and copy number changes to Short Tandem Repeat (STR) regions. NGS platforms with disparate sequencing chemistries and signal processing methods used to detect these polymorphisms also operate under various error modes; hence, they have very different analytical performances across regions of the genome. Consequently, analytical methods specific to various NGS platforms such as Illumina, Ion Torrent, Pacific Biosciences, and Oxford Nanopore have been developed to both account for and correct the errors particular to these sequencing platforms. A dizzying array of combinations of sequencing platforms and analytical methods are now available to a clinical diagnostic laboratory to develop their LDT assays. Therefore, this presents a challenge to the laboratory staff to determine which combination is the optimal.

To meet this challenge, methods of benchmarking systems and pipelines are used to accurately assess the performance of sequencing platform and analytical method combinations before they are incorporated into a clinical diagnostic assay. Benchmarking starts with a set of data for which the relationship between the input and output is known, so that the sequencing system can be tested to see if, given the same input, it produces the same output, or at least something acceptably close. The Genome In A Bottle (GIAB) consortium hosted by NIST has provided that data for a pilot genome (NA12878/HG001) [1] and for six samples from the Personal Genome Project (PGP) [2]. The established, ground-truth calls for SNVs and small InDels (1–20 base pairs) from these reference samples can be used for optimization, performance estimation, and analytical validation of LDT assays using complex analytical pipelines with multiple methods to detect polymorphisms in the genome. To assist with assessing benchmarking runs, the Global Alliance for Genomics and Health (GA4GH) benchmarking team has developed standardized tools [3] to evaluate the performance metrics of germline variant callers used primarily in research applications.

Results from these types of benchmarking techniques allow a laboratory to demonstrate that its practices meet the exacting standards which certify laboratories for the use of their NGS results in the care of clinical patients. The Clinical Laboratory Improvement Amendments (CLIA) program requires that all laboratories using LDT must establish the test's performance specifications, such as analytical sensitivity, specificity, reportable range, and reference range [4]. The College of American Pathologists (CAP) laboratory standards for NGS based clinical diagnostics [5] not only require the laboratories to assess and document the performance characteristics of all variants within the entire reportable range of LDTs but also obtain the performance characteristics for every type and size of variants that are reported by the assay. Laboratories are also required to assess the performance characteristics for clinically relevant variants, such as \(\Delta F508\) and IVS8-5T [6] mutations in a CFTR assay. The CAP guidelines also require laboratories to periodically (determined by the laboratory) assess and document the analytical performance characteristics to ensure that the LDT is continuing to perform as expected over time.

Benchmarking workflows that are highly scalable, reproducible and capable of reporting the performance characteristics using many reference and clinical samples are needed. In addition, evaluation within multiple highly stratified regions of interest are essential for clinical laboratories to optimize and routinely assess the performance of their LDT assays. Jeremy Leipzig in his comprehensive review of bioinformatics workflows for NGS applications [7] defines a bioinformatics workflow as a structured sequence of methods or scripts that are executed either in sequence or parallel to achieve a complex set of analytical goals that are not feasible by any single program. The individual steps within the workflow help achieve a specific goal by accepting a set of inputs and transforming them to a desired set of outputs, which in-turn serve as inputs to other steps later in the workflow. Benchmarking workflows for NGS based clinical assays typically start with variant calls in form of VCF files as inputs to generate desired assay performance characteristics used during the assay development and validation. In order to accomplish this overall objective, workflows deploy widely-used variant comparison programs, such as vcfeval [8], [9], SURVIVOR [10], SURVIVOR_ant [11] and svclassify [12], which are capable of variant allele normalization, genotype matching, variant classification and breakpoint matching for structural variants. However, due to ambiguity in allelic representations (especially in cases of adjacent SNPs and InDels) and differing normalization methods employed by each of these programs, the computed performance characteristics can vary depending on the program used in the benchmarking workflows. To overcome this challenge, benchmarking workflows can deploy multiple methods in parallel and either report a consensus-based assessment or report all the performance characteristics that are estimated by the individual methods. Thus, the overall value of a benchmarking workflow is not just in the methods that are included but also in the specific way that they are deployed within the workflow and the reliability of the assay’s overall performance characteristics reported by the workflow.

In addition, benchmarking workflows deployed in evaluation of clinical-diagnostic assays should meet minimum precision guidelines for both repeatability and reproducibility [13]. CLIA guidelines have adopted the International Organization for Standardization definitions of reproducibility and repeatability for clinical assays. Repeatability (within-run) is the measure of precision involving assays carried out under the same experimental conditions, such as reagent lots, instrument, laboratory and operator. Reproducibility (intra-run) is the measure of precision involving assays carried out under different conditions, such as reagent, operator, laboratory, etc. In terms of benchmarking workflows, repeatability is defined as the ability of the workflow to reliably generate the same set of performance metrics given the same set of input variant call files. Though achieving high levels of precision for repeatability might seem trivial for software programs, in-practice, it requires a high degree of engineering to achieve. The workflows should be able to track and detect changes to input files (using MD5 checksums), software libraries, underlying operating systems and even hardware architecture. Also, benchmarking workflows, even individual methods used within these workflows, are often not reproducible as they are often custom developed for a laboratory setting, such as hardware configuration, software versions, and in some cases even the method of execution. The level of effort needed to port these pipelines successfully to another laboratory with different operating conditions is often insurmountable, resulting in a plethora of non-standard workflows with no discernable way to compare the results.

Thus, there is a critical need for benchmarking workflows that can meet the high precision requirements for both reproducibility and repeatability of a clinical assay. These workflows should also be highly scalable to meet the growing adoption of NGS based assays in the clinical diagnostic setting.


Our goal was to develop a benchmarking workflow that any clinical laboratory could use to quickly evaluate and compare the performance characteristics of all suitable secondary analysis pipelines such as those employing variant callers. A benchmarking workflow should further help optimize the analytical workflow based on well-defined performance metrics and finally produce a thorough analytical validation report to justify the use of the analytical workflow in their diagnostic assay to regulatory authorities such as CLIA and CAP.

To test the abilities of our benchmarking workflow, we used it to compare two analytical workflows commonly used for germline variant calling: (1) workflow based on Broad Institute’s best practices guidelines using the GATK HaplotypeCaller v3.7 [14] and (2) the SpeedSeq workflow [15] based on FreeBayes v0.9.10 [16] as the primary variant calling engine. The GATK HaplotypeCaller based workflow was chosen over the FreeBayes based workflow as it out-performed in the detection of small-InDels (1–20 base pairs). For reference, the benchmarking results for both the above workflows are available for two GIAB samples (NA24143 and NA24149) in the Additional files 1, 2, 3, 4: Tables S1–S4. In addition, the comparison between the InDel size distribution numbers for the GATK HaplotypeCaller and SpeedSeq workflows for one GIAB sample (NA24631) is presented in the Additional file 5: Table S5.

The performance characteristics of the analytical workflow using GATK v3.7 was further optimized using benchmarking metrics generated from the five GIAB reference samples and four GeT-RM samples (see “Methods” section) with known pathogenic variants. Also, it is critical for the clinical laboratories developing NGS based LDT assays to accurately determine the reportable range to avoid misdiagnosis which would lead to wrong treatment decisions. To this end, we evaluated the performance metrics using the benchmarking workflow in three distinct genomic regions of interest (see “Methods” section for details).

Although we have the benchmarking results for the region including coding exons in all the RefSeq genes, we have omitted those findings in this section and focus only on the clinically relevant regions.

Tables 1 and 2 show the benchmarking metrics for SNPs in all 5 GIAB samples within the clinically relevant genes and whole exome regions, respectively. The precision, recall, and negative percent agreement (NPA) metrics for SNPs are uniform across all the reference samples, and there is no sample bias in results for some of the better-characterized samples such as NA24385 and NA12878. Performance metrics for SNPs within the clinically relevant gene region are significantly better than those within the whole exome region. It is notable that recall metrics are a percentage point better in the clinically pertinent gene region, across all reference samples. This phenomenon is attributable to the fact that many genes have isoforms, resulting in higher alignment errors, and some genes have either very high or very low GC content, resulting in higher than average sequencing errors within these regions of the genome. This finding is of great clinical significance, since the reportable region of most inherited disease or disorder-diagnostic based LDT assays are limited to the clinically relevant genes. Thereby, the overall performance characteristics of the assay is better than that estimated over either the whole genome or whole exome regions.

Table 1 Benchmarking metrics for SNPs within coding exons of ~ 7000 clinically relevant genes (as specified in “Methods” section)
Table 2 Benchmarking metrics for SNPs in whole exome regions, including non-coding exons, splice sites (± 20 bp) and clinically relevant deep intronic regions

Tables 3 and 4 provide the InDel benchmarking metrics for sample NA24385 in the clinically relevant and whole exome regions, respectively. As expected, the benchmarking workflow reveals that the performance metrics for InDels are lower than those for SNPs. However, the stratification by InDel size, helped us determine the reference range for InDels (1–20 base pairs). The recall metric for InDels larger than 20 base pairs is significantly lower than the recall for InDels 1–20 base pairs. As in the case of SNPs, performance metrics for InDel detection within the clinically relevant genes of interest is better than the whole exome region.

Table 3 Benchmarking metrics for InDels of different size ranges in NA24385 (truth set NIST v3.3.2, total bases = 12,626,866) for the regions within ~ 7000 clinically relevant genes (as specified in “Methods” section)
Table 4 Benchmarking metrics on the number of InDels of different size ranges in NA24385 (truth set NIST v3.3, total bases = 65,948,744) for the whole exome regions including non-coding exons, splice sites (± 20 bp) and clinically relevant deep intronic regions

The benchmarking results of the other GIAB reference samples in the clinically relevant and whole exome regions can be obtained in the Additional files 6, 7, 8, 9: Tables S6–S9 and Additional files 10, 11, 12, 13: Tables S10–S13, respectively. The histograms for the InDel size distribution in the NA24385 reference sample for the clinically relevant and whole exome regions respectively are in Additional file 17: Fig. S1. The histograms of InDel size distributions for GIAB samples in both the whole exome and clinically relevant regions are available in the github repository—StanfordBioinformatics/stanford-benchmarking-workflows.

Finally, the benchmarking workflow was able to confirm that the variant calling pipeline can detect all the clinical variants in GeT-RM samples listed in Table 5.

Table 5 Validation of the presence of the truth variants in the GeT-RM samples with their locations specified as GRCh37 coordinates used in our variant calling workflow

To obtain all the metrics produced by and other output files including InDel size distribution plots from our benchmarking workflow for each reference sample, please refer to our GitHub repository.

Additionally, we generated benchmarking metrics and ROC curves for NA24143 using a tool provided by Real Time Genomics (RTG) [8]. The results for benchmarking in the clinically relevant regions and details on the metrics for the other two regions of interest can be found in Additional files 14, 15, 16: Tables S14–S16. Similarly, Additional files 18, 19, 20: Figs. S2–S4.


The GIAB consortium has helped develop standards for genomic data to evaluate the performance of NGS sequencing platforms and analytical methods used for alignment and variant calling. The precisionFDA platform [17] has enabled the genomics community to develop and deploy benchmarking methods that can evaluate the performance of analytical methods against the gold standard datasets, such as ones made available by GIAB. These benchmarking tools, along with accuracy challenges, have led to the development of highly accurate variant calling methods. However, the requirements of a clinical diagnostic laboratory go beyond the simple evaluation of performance characteristics of an analytical pipeline against one or more reference samples. Our purpose was to build a benchmarking workflow to meet the assay optimization and validation needs of a clinical laboratory. The primary benefit of our benchmarking workflow is that it allows for the assay performance to be evaluated using a broad set of both reference samples with a large number of gold-standard variant calls and clinical samples with a small number of clinical variants that are specific to the diagnostic assay being evaluated. The benchmarking workflow enables the clinical laboratories to establish the reporting range of the diagnostic assay by estimating the performance within multiple regions of interest.

The precisionFDA platform has enabled a benchmarking pipeline (vcf comparison framework) which accepts input and truth set variant call files and regions of interest files in BED format to calculate the assay performance metrics across the entire region of interest. The vcf comparison pipeline is only capable of comparing variant calls one sample at a time. The Association for Clinical Genomic Science [18] has published a cloud-based benchmarking pipeline similar to the precisionFDA pipeline. However, this pipeline is more restrictive than the precisionFDA pipeline in that it is capable of evaluating the performance using only one (NA12878) of the several benchmarking samples published by GA4GH. The Seven Bridges platform recently published another cloud-based benchmarking pipeline [19] capable of evaluating the performance characteristics using several GA4GH benchmarking samples and multiple regions of interest at the same time. All the cloud-based benchmarking workflows require the clinical laboratories to upload their sensitive assay results, which in some cases include germline variant calls from WGS assays, to a public cloud platform which may not be HIPAA compliant.

Unlike these web-based benchmarking apps such as those provided by the precisionFDA platform or GA4GH, our benchmarking framework can be seamlessly integrated with any variant calling pipeline in the user's software environment. Thus, our benchmarking workflow enables ease of use and avoids the transfer of sensitive data to different locations.

The emphasis on repeatability and reproducibility by building containerized solutions in clinical settings are starting to emerge in other fields, such as medical imaging and neurosciences [20, 21]. Similarly, the benchmarking workflow is able to achieve high precision for both repeatability and reproducibility by being agnostic to the hardware infrastructure used to execute it. The individual software modules within the workflow are deployed as Docker [22, 23] containers which are self-contained with all the prerequisite software libraries and other dependencies. The Docker images of individual software modules have been published to a public container repository, Docker Hub [24].

The benchmarking workflow is distributed using human-readable YAML [25] format, and it can be ported to existing WDL based workflows, which are executed using workflow managers like Cromwell, published by the Broad Institute [26, 27]. Similarly, the workflow YAML files can be ported to the Common Workflow Language (CWL) format [28, 29] to use pipeline execution engines published by the GA4GH [30]. The workflow can be readily deployed by the clinical laboratories either within their on-premises computer infrastructure, private-cloud or any of the available public-cloud platforms, such as the Google Cloud Platform, Amazon Web Services and Microsoft Azure.

Our benchmarking modules integrated with deployment tools, such as Jenkins [31] or CircleCI [32], that work on the principle of continuous integration and continuous delivery/deployment (CI/CD) can provide a fool proof way of examining consistency in results. In this era where workflows generating reproducible results are gaining attention, easy incorporation of workflows with CI/CD tools is a nice feature to have.


Benchmarking variants is a critical part of implementing variant calling pipelines for research or clinical purposes. Here, we have successfully implemented a benchmarking workflow that generates metrics, such as specificity, precision and sensitivity for germline SNPs, and InDels in whole exome sequencing data. Also, InDel size distributions even in the form of histograms are also provided. Moreover, the parameters within each tool and predefined InDel size bins can be easily modified in the benchmarking workflow to suit a laboratory’s requirements. Combining these benchmarking results with validation using known variants of clinical significance in publicly available cell lines, we were able to establish our variant calling pipelines in a clinical setting. Our benchmarking workflow can serve as a plug-in to any existing variant calling pipeline to work as an integrated unit or be used as a separate module as well.

Furthermore, future extensions to the existing benchmarking workflow can be made to accommodate automatic generation of benchmarking metrics and ROC curves from other tools such as RTG for which we have the public Docker image readily available.

A benchmarking workflow similar to the one built in this work for benchmarking short variants can be constructed to perform benchmarking of structural variants (SVs).


Benchmarking workflow

The benchmarking workflow, as illustrated in Fig. 1, is a sequence of steps required to perform a rapid and comprehensive analytical validation of a clinical diagnostic assay based on germline variants. The benchmarking workflow can be easily integrated with any secondary-analysis pipeline used in a diagnostic assay to call germline variants, and the workflow accepts germline variants (SNVs and small InDels) in Variant Call Format VCF v4.1 [33] or higher. The workflow takes one or more stratification files specifying the regions of interest in BED [34] format and generates a comprehensive analytical validation report detailing the performance characteristics of the assay within each of the specified regions of interest. The benchmark variant calls that are considered as ground truths for each of the reference samples used to evaluate the analytical performance can also be specified in VCF format.

Fig. 1
figure 1

Schematic diagram of the benchmarking framework used in this study. The benchmarking workflow written in YAML format consists of four main steps in which the software tool employed in each step is dockerized and indicated within parentheses. The details of the Docker images for these software tools, which are available in Docker Hub, are specified in the “Methods” section. This benchmarking workflow was executed using Loom, an in-house workflow engine. The inputs to the benchmarking workflow are depicted on the left and the outputs are on the right with the red arrows from the outputs to workflow steps indicating the dependency of the output of that step with another step. The steps in the benchmarking workflow are repeated for each stratified region of interest provided

The basic structure of the benchmarking workflow listing all the initial inputs and outputs expected from all the steps before the individual steps are defined is shown in Fig. 2. The name of the workflow is specified first followed by the entire list of variable input files required for the workflow. Next, we listed inputs that are fixed, which can be files or strings required in any step in the workflow. These “fixed_inputs” are those that do not change often even when the workflow is run multiple times, such as the reference fasta file and its corresponding index file in this case. For each “fixed_input”, the sub-field “data” has “contents” which is assigned a value based on “type” provided. Depending on the “type” specified for a certain input, the “contents” value is interpreted differently. For a “string” type, “contents” contains a text value. For a “file” type, “contents” contains a unique identifier such as “[filename]$ [MD5]”. Please note that “fixed_inputs” is just a term we adopted in order to segregate infrequently changed inputs from variable inputs, such as different regions of interest or ground-truth VCF files. The user can modify or remove the files or strings in the fixed input category as required, and it does not affect the functionality or mode of execution of the workflow. Subsequently, we listed the entire list of output files generated by all the steps in the workflow including intermediate files. In this YAML structure of workflow, the “type” of input or output is specified for example as a string or file. A “channel” refers to a designed name (similar to a variable name in a script) for a particular input or output that the workflow manager uses during execution.

Fig. 2
figure 2

Snippet of the benchmarking.yaml depicting the structure of the beginning of the benchmarking workflow before the steps are defined

A typical structure of a step in a workflow is presented in Fig. 3, where the “steps” tag denotes the beginning of all the steps listed in that workflow. A step begins with the name of the step followed by the “inputs” and “outputs” pertaining only to that step. Here, aside from “type” and “channel”, an output has a “source” tag specified in the form of a “key:value” pair where the key is “filename”. The value for “filename” is a string that can be a “channel” from “fixed_inputs” or an aggregation of different “channels” listed in the “fixed_inputs”, which can be optionally concatenated with a user defined string. The “command” tag specifies the command line consisting of the software tool, parameters, and the inputs and outputs to be executed inside the docker container. The “environment” tag contains a “key:value” pair in the form of “docker_image” as the key and the value as the Docker image name required for that step. The Docker image can be pulled from Docker Hub or the Docker container registry in the user’s computing platform. The compute resources to be utilized for a workflow step is provided via the “resources” tag containing two “key:value” pairs where “memory” and “cores” are the keys whose values are strings denoting memory in GB and number of cores respectively.

Fig. 3
figure 3

Snippet of the benchmarking.yaml showing the first step (vcf comparison using of the benchmarking workflow

The first step in the benchmarking process as seen in Fig. 3 involves the comparison of input variants generated by the analytical pipeline with the benchmark variant calls within each region of interest. The variant calls are compared using, which is capable of haplotype construction from individual genotype calls and is recommended by GIAB consortium and GA4GH. The variant comparison step is performed for each stratification or region of interest file specified as input, and generates a single output VCF file classifying the variant calls defined in the input and truth VCF files as either True Positive (TP), False Positive (FP) or False Negative (FN).

Step two in the benchmarking workflow splits the variant calls annotated using by variant type (SNPs and small InDels) and by variant classification (TP or FP or FN). This step is executed within the workflow for each of the stratification or region of interest files specified. The VCF files are split by variant type using bcftools [35], and a bash script is used to further split the variant calls by the variant classification. This allows the workflow to generate the performance metrics for each of the variant types reported by the diagnostic assay.

Steps three and four of the benchmarking workflow (see Fig. 1) were used to generate a histogram of small InDels by size. The bins used for InDel size histograms were (a) 1–10 base pairs, (b) 11–20 base pairs, (c) 21–50 base pairs, and (d) Greater than 50 base pairs. The R script—indelSizeDistribution_Detailed.R (code in Additional file 21: File 1) then calculates the performance metrics of the assay for each of the InDel size bins. The Python script— (Additional file 22: File 2) consolidates the benchmarking metrics previously obtained and calculates the NPA related metrics combining some of the bin size ranges (user preferred) for all reference samples provided. The details on the usage of the above script are in the associated README file available in our public repository.

In addition to benchmarking truth sets for well-characterized reference samples published by the GIAB consortium, the benchmarking workflow allows clinical laboratories to specify additional samples with clinically relevant variants as ground-truths to estimate the analytical performance of the assay for specific variant types, such as \(\Delta F508\) and IVS8-5 T in CFTR panels. The Python script— (Additional file 23: File 3) accepts the ground-truth variant call sets to confirm the presence or absence of these variants in the VCF files generated by the variant calling pipeline.

Finally, the benchmarking workflow generates a comprehensive analytical validation report using all the provided benchmarking ground-truth call sets—Final_benchmarking_metrics.txt (Additional file 24: File 4).

Scalability and reproducibility of benchmarking workflow

The benchmarking workflow is designed to be repeatable and reproducible by using Docker containers for all software and bioinformatics components used within the workflow (see Table 6). The workflow is distributed in a human-readable data serialization format YAML v1.2, and the workflow can be readily executed using the workflow execution manager—Loom (0.5.3-prerelease-rc10) [36]. The workflow definition file—Benchmarking.yaml (see our GitHub repository) can also be easily ported to Common Workflow Language (CWL) or Workflow Definition Language (WDL) formats and executed using workflow execution managers, such as Toil [37, 38] and Cromwell.

Table 6 List of software components utilized in the benchmarking workflow with their software dependencies, settings and Docker image names (as available in the Docker Hub repository) if applicable

Golden/ground-truth call sets

The golden/ground-truth sets for five reference and PGP genomes are currently available—NA12878 (CEPH family's daughter), NA24143 (AJ mother), NA24149 (AJ father), NA24385 (AJ son) and NA24631 (Chinese son), and these reference call sets were used in this benchmarking study. GIAB provides a high confidence regions file and a high confidence VCF file, and as recommended by GIAB, only the high confidence calls were used in the evaluation of the assay's performance characteristics. The NIST versions and their corresponding FTP site locations used for the above samples in this study can be found in the Additional file 25.

In addition to the GIAB reference samples, samples with known pathogenic germline variants (see Table 5) for various inherited diseases or disorders were chosen from the Genetic Testing Reference Materials Coordination Program (GeT-RM) [39,40,41,42,43].

Stratification or regions of interest (ROI) BED files.

Three stratification files were used to evaluate the performance characteristics of an inherited Whole Exome Sequencing (WES) assay.

  1. 1.

    Coding Exons for all known transcripts in RefSeq genes: RefSeq gene names, transcripts, and coordinates of all coding exons were obtained from the UCSC genome browser [44,45,46].

  2. 2.

    Clinically relevant regions of the human genome: Clinically relevant regions were determined by intersecting coordinates of all known pathogenic variants reported in OMIM [47], ClinVar [48] and DECIPHER v9.28 [49] with all the exon regions (coding and non-coding) file for RefSeq genes obtained from the UCSC genome browser. The exonic coordinates were later extended by 20 base pairs on either end to include canonical and non-canonical splice sites. Deep-intronic regions with pathogenic variants were added to the exonic regions to generate the final clinically relevant regions (BED) file.

  3. 3.

    Whole Exome regions file for RefSeq genes: Whole Exome regions file was obtained from the UCSC genome browser. The exon regions were extended by 20 base pairs on either end to include splice sites.

Benchmarking metrics

Precision and recall are benchmarking metrics provided as output by The true positives (TP), false positives (FP), and false negatives (FN) are counted as described by the developers of Again, as explained by the authors of, precision and recall are calculated using the below formulae:

$$\begin{aligned} Precision \, & = \, True \, \,Positives/\left( {True\, \, Positives \, + \, False\, \, Positives} \right) \\ Recall \, & = \, True \, \,Positives/\left( {True \, \,Positives \, + \, False \, \,Negatives} \right) \\ \end{aligned}$$

Other metrics reported by, such as variants outside the high confidence truth set regions and transition or transversion SNP type, can be found in the extended.csv files included in the vcfComparison directories for each GIAB sample in our GitHub repository..

The total number of bases per sample in a particular region of interest as specified by the corresponding bed file was computed using a bash command provided in the Additional file 25.

True negatives (TN) and Total Negatives are computed using the following:

$$\begin{aligned} & TN \, = \, Total\, \, number\, \, of \, \,bases \, \,in \, \,the\, \, region \, \,of\, \, interest \, - \, \left( {True \, \,Positives \, + \, False\, \, Positives \, + \, False \, \,Negatives} \right) \\ & Total \, Negatives \, = \, True \, \,Negatives \, + \, False \, Positives \\ \end{aligned}$$

The Negative Percentage Agreement (NPA) or specificity as recommended by the FDA [50] is calculated using the following:

$${\text{NPA }} = True \, \,Negatives/Total \, \,Negatives.$$

Generation of ROC curves outside of benchmarking workflow

There is an option for the user to generate benchmarking metrics and corresponding ROC curves using another popular vcf comparison tool called vcfeval in RTG-core’s suite of tools [51]. However, this option is currently not available as part of the validated benchmarking workflow. The ROC plots were obtained using the dockerized version of RTG-core, available in Docker Hub as vandhanak/rtg-core:3.11. The input sample VCF used contained variants called by the germline variant caller GATK HaplotypeCaller v3.7. The default value, genotype quality (GQ), for the parameter vcf-score-field was used. Further, the parameter evaluation-regions was set to the truth bed file corresponding to the region of interest for the GIAB sample. ROC plots were generated using the rocplot tool within RTG. In order to demonstrate this functionality, one GIAB sample, NA24143 was used for the benchmarking on the three stratified regions of interest utilizing the corresponding ground-truth sets.

Availability of data and materials

The benchmarking workflow (benchmarking.yaml), analytical methods or scripts, and datasets generated and analyzed during this study are available in the GitHub repository—

Availability and requirements

Project Name: Clinical benchmarking workflow

Project home page:

Operating system: Platform independent

Programming language: Python 2.x

Other requirements: Docker, R

License: MIT license

Any restrictions to use by non-academics: None



National Institute of Standards and Technology


Genome In A Bottle consortium


Cystic Fibrosis Transmembrane conductance Regulator


Single Nucleotide Polymorphisms


Multiple Nucleotide Polymorphisms




Whole Exome Sequencing


Negative Percent Agreement


True Negative


True Positive


False Negative


False Positive


Health Insurance Portability and Accountability Act


Public database containing the human genes, their genetic phenotypes, and their associations with genetic disorders (Online Mendelian Inheritance in Man)


Public database with genotypic and phenotypic data from ~ 30,000 individuals


Public database with information on the relationship between medically important variants and phenotypes.


Workflow Definition Language


Common Workflow Language


  1. Zook JM, Chapman B, Wang J, Mittelman D, Hofmann O, Hide W, Salit M. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat Biotechnol. 2014;32(3):246–51.

    Article  CAS  Google Scholar 

  2. Zook JM, Catoe D, McDaniel J, Vang L, Spies N, Sidow A, Weng Z, Liu Y, Mason CE, Alexander N, et al. Extensive sequencing of seven human genomes to characterize benchmark reference materials. Sci Data. 2016;3:160025.

    Article  CAS  Google Scholar 

  3. Krusche P, Trigg L, Boutros PC, Mason CE, De La Vega FM, Moore BL, Gonzalez-Porta M, Eberle MA, Tezak Z, Lababidi S, et al. Author correction: Best practices for benchmarking germline small-variant calls in human genomes. Nat Biotechnol. 2019;37(5):567.

    Article  CAS  Google Scholar 

  4. Jennings L, Van Deerlin VM, Gulley ML. Committee CoAPMPR: recommended principles and practices for validating clinical molecular pathology tests. Arch Pathol Lab Med. 2009;133(5):743–55.

    Article  Google Scholar 

  5. Aziz N, Zhao Q, Bry L, Driscoll D, Funke B, Gibson J, Grody W, Hegde M, Hoeltge G, Leonard D, et al. College of American Pathologists Laboratory standards for next-generation sequencing clinical tests. Arch Pathol Lab Med. 2015;139(4):481–93.

    Article  Google Scholar 

  6. Watson M, Cutting G, Desnick R, Driscoll D, Klinger K, Mennuti M, Palomaki G, Popovich B, Pratt V, Rohlfs E, et al. Cystic fibrosis population carrier screening: 2004 revision of the American College of Medical Genetics mutation panel. Genet Med. 2004;6(5):387–91.

    Article  Google Scholar 

  7. Leipzig J. A review of bioinformatic pipeline frameworks. Brief Bioinform. 2017;18(3):530–6.

    PubMed  Google Scholar 

  8. Cleary JG, Braithwaite R, Gaastr K, Hilbush BS, Inglis S, Irvine SA, Jackson A, Littin R, Rathod M, Ware D et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. BioRxiv; 2015.

  9. Haplotype VCF Comparison Tools.

  10. Jeffares DC, Jolly C, Hoti M, Speed D, Shaw L, Rallis C, Balloux F, Dessimoz C, Bähler J, Sedlazeck FJ. Transient structural variations have strong effects on quantitative traits and reproductive isolation in fission yeast. Nat Commun. 2017;8:14061.

    Article  CAS  Google Scholar 

  11. Sedlazeck FJ, Dhroso A, Bodian DL, Paschall J, Hermes F, Zook JM. Tools for annotation and comparison of structural variation. F1000Res. 2017;6:1795.

    Article  Google Scholar 

  12. Parikh H, Mohiyuddin M, Lam HY, Iyer H, Chen D, Pratt M, Bartha G, Spies N, Losert W, Zook JM, et al. svclassify: a method to establish benchmark structural variant calls. BMC Genomics. 2016;17:64.

    Article  Google Scholar 

  13. Plesser HE: Reproducibility vs. Replicability: A Brief History of a Confused Terminology. Front Neuroinform 2017, 11:76.

  14. GATK Workflows.

  15. Chiang C, Layer RM, Faust GG, Lindberg MR, Rose DB, Garrison EP, Marth GT, Quinlan AR, Hall IM. SpeedSeq: ultra-fast personal genome analysis and interpretation. Nat Methods. 2015;12(10):966–8.

    Article  CAS  Google Scholar 

  16. Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing. arXiv:1207.3907v2 [q-bio.GN]; 20 July 2012.

  17. PrecisionFDA.

  18. Association for Clinical Genomic Science.

  19. VCF Benchmarking Workflow.

  20. Matelsky J, Kiar G, Johnson E, Rivera C, Toma M, Gray-Roncal W. Container-based clinical solutions for portable and reproducible image analysis. J Digit Imaging. 2018;31(3):315–20.

    Article  Google Scholar 

  21. Kiar G, Brown ST, Glatard T, Evans AC. A serverless tool for platform agnostic computational experiment management. Front Neuroinform. 2019;13:12.

    Article  Google Scholar 

  22. Docker.

  23. Merkel D: Docker: lightweight Linux containers for consistent development and deployment. Linux J 2014, 2014(239):Article 2.

  24. Docker Hub.

  25. YAML.

  26. OpenWDL.

  27. Cromwell.

  28. Amstutz P, Crusoe MR, Tijanić N, Chapman B, Chilton J, Heuer M, Kartashov A, Leehr D, Ménager H, Nedeljkovich M et al: Common Workflow Language, v1.0. Specification, Common Workflow Language working group. In. Edited by Peter Amstutz MRC, Nebojša Tijanić; 2016.

  29. Common Workflow Language (CWL). In.: Software Freedom Conservancy, Inc.

  30. O'Connor BD, Yuen D, Chung V, Duncan AG, Liu XK, Patricia J, Paten B, Stein L, Ferretti V: The Dockstore: enabling modular, community-focused sharing of Docker-based genomics tools and workflows. F1000Res 2017, 6:52.

  31. Jenkins.

  32. CircleCI.

  33. Variant Call Format.

  34. BED format.

  35. BCFtools.

  36. Hammond N, Liao I: Loom: platform-independent tool to create, execute, track, and share workflows; 2017.

  37. Toil.

  38. Vivian J, Rao AA, Nothaft FA, Ketchum C, Armstrong J, Novak A, Pfeil J, Narkizian J, Deran AD, Musselman-Brown A, et al. Toil enables reproducible, open source, big biomedical data analyses. Nat Biotechnol. 2017;35(4):314–6.

    Article  CAS  Google Scholar 

  39. GeT-RM Home.

  40. GeT-RM NA04408.

  41. GeT-RM NA14090.

  42. Get-RM NA14170.

  43. GeT-RM NA16658.

  44. Fujita PA, Rhead B, Zweig AS, Hinrichs AS, Karolchik D, Cline MS, Goldman M, Barber GP, Clawson H, Coelho A et al: The UCSC Genome browser database: update 2011. Nucl Acids Res 2011, 39(Database issue):D876–882.

  45. Karolchik D, Hinrichs AS, Furey TS, Roskin KM, Sugnet CW, Haussler D, Kent WJ: The UCSC Table Browser data retrieval tool. Nucleic Acids Res 2004, 32(Database issue):D493–496.

  46. Karolchik D, Hinrichs AS, Kent WJ: The UCSC Genome Browser. Curr Protoc Bioinformatics 2009, Chapter 1:Unit1.4.

  47. Online Mendelian Inheritance in Man, OMIM®. 2017 edn: McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University (Baltimore, MD).

  48. Landrum MJ, Lee JM, Benson M, Brown G, Chao C, Chitipiralla S, Gu B, Hart J, Hoffman D, Hoover J, et al. ClinVar: public archive of interpretations of clinically relevant variants. Nucl Acids Res. 2016;44(D1):D862-868.

    Article  CAS  Google Scholar 

  49. Firth HV, Richards SM, Bevan AP, Clayton S, Corpas M, Rajan D, Van Vooren S, Moreau Y, Pettett RM, Carter NP. DECIPHER: database of chromosomal imbalance and phenotype in humans using Ensembl resources. Am J Hum Genet. 2009;84(4):524–33.

    Article  CAS  Google Scholar 

  50. Administration USFaD: Guidance for industry and FDA staff: statistical guidance on reporting results from studies evaluating diagnostic tests.

  51. RTG Core.

Download references


We thank Amin Zia for providing useful information during the initial phase of the benchmarking work. We thank Nathan Hammond and Isaac Liao for the development of the in-house workflow engine “Loom” which was used to run the variant calling pipelines and the subsequent benchmarking workflows. We thank Keith Bettinger for the help in reviewing the manuscript. We also thank Chittaranjan Muthumalai for leading the automated benchmarking workflow testing efforts and Jason Merker for useful discussions in terms of clinical relevance during the benchmarking process. This study makes use of data generated by the DECIPHER community. A full list of centers who contributed to the generation of the data is available from and via email from Funding for the project was provided by the Wellcome Trust.


This work was funded by Stanford Health Care, Stanford Children’s Health and Stanford School of Medicine.

Author information

Authors and Affiliations



VK designed and implemented the benchmarking workflow. SU and VK wrote the manuscript. ZN implemented the script to generate the performance assay report including the clinical variant validation. SU, SD, MS and EA conceived, designed and supervised the overall study. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Sowmithri Utiramerur or Euan A. Ashley.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary Information

Additional file 1: Table S1

. Benchmarking metrics on SNPs and multiple nucleotide polymorphisms (MNPs) in NA24149 (truth set NIST v3.3) for the RefSeq coding exons regions generated for both the GATK and SpeedSeq pipelines that were executed using workflows run by Loom (in-house workflow engine).

Additional file 2: Table S2

. Benchmarking metrics on InDels in NA24149 (truth set NIST v3.3) for the RefSeq coding exons regions generated for both the GATK and SpeedSeq pipelines that were executed using workflows run by Loom (in-house workflow engine).

Additional file 3: Table S3

. Benchmarking metrics on SNPs and multiple nucleotide polymorphisms (MNPs) in NA24143 (truth set NIST v3.3) for the RefSeq coding exons regions generated for both the GATK and SpeedSeq pipelines that were executed using workflows run by Loom (in-house workflow engine).

Additional file 4: Table S4

. Benchmarking metrics on InDels in NA24143 (truth set NIST v3.3) for the RefSeq coding exons regions generated for both the GATK and SpeedSeq pipelines that were executed using workflows run by Loom (in-house workflow engine).

Additional file 5: Table S5

. Benchmarking metrics on InDel size distribution in NA24631 (truth set NIST v3.3.2) for whole exome regions, including non-coding exons, splice sites (+/- 2 bp) and clinically relevant deep intronic regions intersected with clinical exome to assess performance of GATK and SpeedSeq pipelines.

Additional file 6: Table S6

. Benchmarking metrics for InDels of different size ranges in NA12878 (truth set NIST v3.3, total bases = 13728555) for the regions within ~7000 clinically relevant genes (as specified in Methods).

Additional file 7: Table S7

. Benchmarking metrics for InDels of different size ranges in NA24143 (truth set NIST v3.3, total bases = 12549224) for the regions within ~7000 clinically relevant genes (as specified in Methods).

Additional file 8: Table S8

. Benchmarking metrics for InDels of different size ranges in NA24149 (truth set NIST v3.3, total bases = 12538042) for the regions within ~7000 clinically relevant genes (as specified in Methods).

Additional file 9: Table S9

. Benchmarking metrics for InDels of different size ranges in NA24631 (truth set NIST v3.3, total bases = 12808688) for the regions within ~7000 clinically relevant genes (as specified in Methods).

Additional file 10: Table S10

. Benchmarking metrics on the number of InDels of different size ranges in NA12878 (truth set NIST v3.3, total bases = 71152019) for the whole exome regions including non–coding exons, splice sites (+/- 20 bp) and clinically relevant deep intronic regions.

Additional file 11: Table S11

. Benchmarking metrics on the number of InDels of different size ranges in NA24143 (truth set NIST v3.3, total bases = 65657646) for the whole exome regions including non-coding exons, splice sites (+/- 20 bp) and clinically relevant deep intronic regions.

Additional file 12: Table S12

. Benchmarking metrics on the number of InDels of different size ranges in NA24149 (truth set NIST v3.3, total bases = 65597266) for the whole exome regions including non-coding exons, splice sites (+/- 20 bp) and clinically relevant deep intronic regions.

Additional file 13: Table S13

. Benchmarking metrics on the number of InDels of different size ranges in NA24631 (truth set NIST v3.3, total bases = 65657646) for the whole exome regions including non-coding exons, splice sites (+/- 20 bp) and clinically relevant deep intronic regions.

Additional file 14: Table S14

. Benchmarking metrics for NA24143 (SNPs and InDels, truth set NIST v3.3) within coding exons of ~7000 clinically relevant genes (as specified in Methods) using RTG vcfeval.

Additional file 15: Table S15

. Benchmarking metrics for NA24143 (SNPs and InDels, truth set NIST v3.3) in whole exome regions, including non-coding exons, splice sites (+/- 20 bp) and clinically relevant deep intronic regions using RTG vcfeval.

Additional file 16: Table S16

. Benchmarking metrics for NA24143 (SNPs and InDels, truth set NIST v3.3) in the RefSeq coding exon regions using RTG vcfeval.

Additional file 17: Fig S1

. InDel size distribution histograms for NA24385 as generated by the benchmarking workflow for the coding exons of ~7000 clinically relevant genes and whole exome regions (as specified in Methods).

Additional file 18: Fig S2

. ROC curves for NA24143 within coding exons of ~7000 clinically relevant genes using RTG rocplot with metrics obtained from RTG vcfeval.

Additional file 19: Fig S3

. ROC curves for NA24143 in the whole exome regions (as specified in Methods) using RTG rocplot with metrics obtained from RTG vcfeval.

Additional file 20: Fig S4

. ROC curves for NA24143 in the coding exon regions (as specified in Methods) using RTG rocplot with metrics obtained from RTG vcfeval.

Additional file 21: File 1

. indelSizeDistribution_Detailed.R.

Additional file 22: File 2


Additional file 23: File 3.

Additional file 24: File 4

. Final_benchmarking_metrics.txt.

Additional file 25

. Supplementary information.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit The Creative Commons Public Domain Dedication waiver ( applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Krishnan, V., Utiramerur, S., Ng, Z. et al. Benchmarking workflows to assess performance and suitability of germline variant calling pipelines in clinical diagnostic assays. BMC Bioinformatics 22, 85 (2021).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: