Evaluating the necessity of PCR duplicate removal from next-generation sequencing data and a comparison of approaches

Background Analyzing next-generation sequencing data is difficult because datasets are large, second generation sequencing platforms have high error rates, and because each position in the target genome (exome, transcriptome, etc.) is sequenced multiple times. Given these challenges, numerous bioinformatic algorithms have been developed to analyze these data. These algorithms aim to find an appropriate balance between data loss, errors, analysis time, and memory footprint. Typical analysis pipelines require multiple steps. If one or more of these steps is unnecessary, it would significantly decrease compute time and data manipulation to remove the step. One step in many pipelines is PCR duplicate removal, where PCR duplicates arise from multiple PCR products from the same template molecule binding on the flowcell. These are often removed because there is concern they can lead to false positive variant calls. Picard (MarkDuplicates) and SAMTools (rmdup) are the two main softwares used for PCR duplicate removal. Results Approximately 92 % of the 17+ million variants called were called whether we removed duplicates with Picard or SAMTools, or left the PCR duplicates in the dataset. There were no significant differences between the unique variant sets when comparing the transition/transversion ratios (p = 1.0), percentage of novel variants (p = 0.99), average population frequencies (p = 0.99), and the percentage of protein-changing variants (p = 1.0). Results were similar for variants in the American College of Medical Genetics genes. Genotype concordance between NGS and SNP chips was above 99 % for all genotype groups (e.g., homozygous reference). Conclusions Our results suggest that PCR duplicate removal has minimal effect on the accuracy of subsequent variant calls.


Background
Next-generation sequencing (NGS) has accelerated research efforts in virtually every field in the life sciences. NGS is being used to diagnose and determine the genetic cause of diseases, measure gene expression, refine phylogenetic trees, identify markers to differentiate between morphologically similar species, and de novo sequencing for non-model organisms [1][2][3][4][5]. For many years these types of projects were not possible because the data were difficult and expensive to obtain. Today, however, it is possible to sequence entire genomes for a fraction of what it cost just 10 years ago.
Despite the many benefits of NGS, these data are challenging to work with for several reasons, including: (1) NGS has a much higher error rate than other genotyping methods (e.g. compared to Sanger sequencing), (2) the most common NGS methods only produce short fragments, known as "reads", ranging from~100-300 nucleotides in length, and (3) datasets are very large, frequently >100 gigabytes [6]. Many experimental and bioinformatics innovations are employed to address these challenges.
One innovation to overcome the high error rate is to sequence each nucleotide (position) in the target DNA (genome, exome, etc.) multiple times. The number of times each nucleotide is sequenced is referred to as coverage. Coverage is variable within a sample and typical coverage ranges from 30 or less to >1000 for typical human genetic and cancer applications, respectively. This approach is employed under the assumption that sequencing errors are random, making deeper coverage more reliable to determine the nucleotide at a given position. In other words, if each nucleotide is sequenced multiple times, most reads will have the correct nucleotide. PCR duplicates are, at least in theory, one possible impediment to this innovation.
To prepare DNA for NGS, DNA is sonicated, and adapters are ligated to the end of each resulting fragment. Fragments are then PCR amplified and PCR products are spread across the flowcell. There are several additional steps not pertinent to this research, but have been previously described thoroughly by Voelkerding et al. [7]. PCR duplicates are sequence reads that result from sequencing two or more copies of the exact same DNA fragment, which, at worst, may contain erroneous mutations introduced during PCR amplification, or, at the very least, make the occurrence of the allele(s) sequenced in duplicates appear proportionately more often than it should compared to the other allele (assuming a non-haploid organism). Ideally, one PCR copy of each original DNA fragment will hybridize to the flowcell, but there is currently no way to enforce this. When multiple copies originating from the same DNA molecule all bind to the flowcell, each is sequenced and the resulting reads are referred to as PCR duplicates. These duplicates occur for two reasons: (1) we cannot control exactly which sequences from the pool of PCR products hybridize to the flowcell, and (2) not all of the original DNA molecules are amplified without bias (PCR amplification bias). PCR amplification bias and increasing the number of PCR cycles both increase the likelihood of PCR duplicates during sequencing.
Many analysis pipelines remove PCR duplicates to mitigate potential biases on variant calling algorithms. For example, a large number of PCR duplicates containing an amplification-induced error may cause a variant calling algorithm to misidentify the error as a true variant. Several programs exist to remove or mark PCR duplicates (e.g. SEAL [8], elPrep [9], FastUniq [10], etc.), but in this work we focus on the two most commonly used approaches: Picard MarkDuplicates (http://broadinstitute.github.io/picard/) and SAMTools rmdup [11,12].
SAMTools and Picard use similar approaches for duplicate marking or removal, but with some differences. SAMTools (rmdup) identifies PCR duplicates by identifying pairs of reads where multiple reads align to the same exact start position in the genome, and the reverse read on the 3′ end maps at the exact same location (i.e. external mapping coordinates are identical). The read pair with the highest mapping quality score is retained and other read pairs removed (a possible disadvantage because data is lost). Also, rmdup does not work for unpaired reads (in paired end mode) or read pairs where each read maps to different chromosomes. There will also be unexpected results if multiple libraries are present in the same BAM file since rmdup assumes all reads in the BAM file originated from the same library [11,12]. Picard (MarkDuplicates) is similar to rmdup. MarkDuplicates identifies read pairs with the same orientation that have the exact same 5′ start position in the mapping. It takes into account clipping on the 5′ end of the read and makes calculations based on where the 5′ start position would be if the entire read had mapped to the reference. In contrast to rmdup, MarkDuplicates handles interchromosomal read pairs, and considers the library for each read pair and keeps a read pair from each library. MarkDuplicates also does not remove reads, but sets the SAM flag 1024 for all but the best read pair. The best read pair is the read pair with the highest sum of base qualities with Q ≥ 15 (http://broadinstitute.github.io/picard/).
We performed a three-way comparison between variant calls generated without removing duplicates and those removing duplicates with either Picard MarkDuplicates or SAMTools rmdup to determine: (1) if PCR duplicate removal improves the accuracy of variant calls, and (2) if so, whether MarkDuplicates or rmdup produces a more accurate variant dataset. Our results suggest that accuracy is the same for both MarkDuplicates and rmdup, but there are substantial performance (execution time and memory usage) differences between the two. Our results further suggest that removing duplicates may not be necessary in variant calling pipelines.

Dataset
Whole-genome (WGS) data used in this article were obtained from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database (adni.loni.usc.edu). ADNI is the result of efforts of many co-investigators, led by Dr. Michael Weiner, from a broad range of institutions, and includes subjects from more than 50 American and Canadian research sites. A primary goal of ADNI is to identify biological markers for Alzheimer's disease (AD). To date over 1,500 adults, ages 55 to 90, have participated in the study. For up-to-date information see www.adni-info.org. Of the 809 WGS samples (average coverage~37) available in this dataset, we randomly selected 100 to use in our duplicate removal analysis. During the analysis process, one sample was removed due to low quality data and was not replaced. Each of the remaining 99 study samples was run through the exact same pipeline described below (Data Analysis subsection).
We also have matching SNP chip data for the 99 samples used in this study. Samples were genotyped using the HumanOmniExpress BeadChip Kit by Illumina. The SNP chip data were cleaned by removing (in order): (1) all SNPs missing greater than 2 % of data, (2) all individuals missing more than 2 % of data, (3) SNPs with a minor allele frequency less than 0.02, and (4) SNPs out of Hardy-Weinberg equilbrium (p < 0.000001).

Data analysis
We followed the GATK Best Practices [13] during this process, varying only the step of how and whether we removed duplicates during the process (Fig. 1). Genomes used in this research were aligned by ADNI using the Burrows-Wheeler Aligner (BWA) [14]. Three versions of each BAM file in our dataset were generated: (1) a BAM where PCR duplicates were left intact, (2) a BAM where PCR duplicates were removed using SAMTools (rmdup), and (3) a BAM where PCR duplicates were marked (and subsequently ignored) using Picard (MarkDuplicates). Subsequent steps are identical in each pipeline and all steps were performed with the Genome Analysis Toolkit (GATK, version 3.2). Following duplicate removal (or not) we refined the mappings using GATK's IndelRealigner and BaseRecalibrator (BQSR), and joint called/refined variants using the HaplotypeCaller and variant quality score recalibration (VQSR).

Variant Call Format (VCF) file comparisons
After completing the variant calling step we compared the three resulting variant call format (VCF) [15] files using Variant Tool Chest (VTC) [16]. VTC performs complex set operations, like intersect and complement, on VCF files. Next, we extracted summary statistics for the resulting intersects and complements with the variant statistics tool (VarStats) in VTC. We used R (version 3.1.2) to analyze the information from the comparisons of the three files, and the summary statistics [17].

Comparison between SNP chip and NGS data
We compared results between each group and the matching SNP chip data, using the more accurate SNP data as Fig. 1 Our pipeline was identical for every sample, except for how we handled PCR duplicates. Original BAM files with mapped reads were processed using the shown pipeline. PCR duplicates were ignored, removed using SAMTools (rmdup), or marked (left in the file, but ignored in subsequent steps) using Picard (MarkDuplicates). After the duplicates step, all subsequent steps in the analysis pipeline were identical. The final output of the pipeline is a multi-sample VCF file. VCF: Variant Call Format, BQSR: Base Quality Score Recalibration, VQSR: Variant Quality Score Recalibration The datasets correspond to the three pipelines: removing duplicates using SAMTools, removing duplicates using Picard, and ignoring duplicates. The blue circle is the Picard dataset, red is the no duplicates removed dataset, and green is the SAMTools dataset truth. We measured percentage of genotypes changed from one type to another (e.g. heterozygous to homozygous variant).

The American College of Medical Genetics (ACMG) gene list
In 2013, the American College of Medical Genetics (ACMG) published guidelines for reporting incidental findings in large sequencing diagnostics [18], typically defined as whole exome or genome sequencing, or sequencing targeted genes. Included in these guidelines is a list of genes the working group recommends clinicians always examine for deleterious mutations. The list was compiled based on conditions that are verifiable using secondary diagnostic approaches, and for which early intervention is likely to significantly change or prevent disease. This list is certainly not a comprehensive listing of all clinically important genes, but it does include genes that meet the criteria outlined above. The genes on the recommended list are perhaps the most clinically important known genes because there are effective treatments for disorders resulting from mutations in these genes. We refer to this gene list as the ACMG genes.

Variant annotation
We used ANNOVAR [19] to annotate each variant with dbSNP 138 [20] identification numbers (if present in dbSNP), 1000 Genomes Project minor allele population frequency (if observed in the 1000 Genomes Project) [21,22], and separated variants by type (e.g. nonsynonymous, InDel, etc.). We refer to protein changing variants as nonsynonymous variants, InDel frameshifting variants, or structural variants. We assumed any variant not present in dbSNP is novel.

Duplicate removal
We calculated the percentage of duplicates removed by both Picard MarkDuplicates and SAMTools rmdup to quantify approximately how many reads were considered duplicates by both softwares, by comparing the number of reads in the BAM files before and after duplicate Fig. 3 The Venn diagram is as described in Fig. 2, except limited to variants in the ACMG genes. The blue circle is the Picard dataset, red is the no duplicates removed dataset, and green is the SAMTools dataset Here we present metrics from each portion of the Venn diagram (Fig. 2), including total number of variants, transition/transversion (Ti/Tv) ratios, average population frequency, proportion of novel variants, and proportion of variants that change the protein product. In the top part of the  (1) comparing the values for all variants in each main dataset ("All Picard", "All SAMTools", and "All No Dups"); and (2) comparing values for variants across all "Unique" groups. There was a significant difference when comparing the number of variants across groups, but none of the other measures were significantly different marking/removal. For MarkDuplicates, specifically, we counted the number of reads not marked as duplicates.

Whole genome variant dataset: Picard versus SAMTools versus not removing duplicates
We processed whole genome data for each of 99 different genomes three different times. For one set of the 99 genomes, we removed duplicates using Picard (MarkDuplicates), for another set we removed duplicates using SAMTools (rmdup), and for the third we left the duplicates in the alignments. Next, we called variants on each of the alignments using the GATK pipeline (outlined above). Finally, we pooled all of the variants for each of the three sets of genomes for comparison. From this point forward, variant datasets referred to as Picard, SAMTools, or no dup refer to the union of variants from all 99 genomes with duplicates removed using Picard, SAMTools, or no duplicates removed, respectively. In Fig. 2, we show the overlap of called variants in each of the datasets using a Venn diagram. There were a total of 17134081 different variants called, and about 16 million (about 92 %) were called regardless of how duplicates were treated (three-way intersect in the center of the Venn diagram in Fig. 2). Picard and no dup had about twice as many unique variants as SAMTools (307486 and 398248 compared to 150474), and the three two-way intersections each had comparable numbers of variants (177313, 181176, and 230589). Approximately 31 % of all variants in this study were rare, with a minor allele frequency less than 0.01. This is unsurprising because the 1000 Genomes Project demonstrated that each individual has hundreds of rare variants at evolutionarily conserved sites alone [22].
Next we analyzed the variant characteristics from individual partitions of the Venn diagram (Table 1). Several  measurements can be used to assess the quality of a variant dataset, such as total number of variants, transition/transversion (Ti/Tv) ratios, proportion of novel variants, and proportion of variants that change the protein product. The number of SNPs between sets (labeled "All" in Table 1) and subsets (labeled "Unique" in Table 1) were significantly different, but the Ti/Tv ratios, percentage of variants in dbSNP, population frequencies, and percentage of variants that are protein changing are not significantly different (Table 1). Variants across the full Picard, SAMTools, and no dup sets had about 16 million total variants, Ti/Tv ratios of 2.14, 28-29 % novel variants, and 0.4 % of variants are protein-changing. Variants across the subsets (labeled "Unique" in Table 1) had lower Ti/Tv ratios and percentage of variants changing the protein, and higher novel variants.
To further compare these variant datasets, we compared the intersections of the three variant datasets. Most of the variants (15.6 million of~17 million) were called using all three approaches. Metrics for each of the individual variant datasets and the three-way intersect were very comparable, except that~10 % fewer variants were novel in the three-way intersect (Table 1). Variant characteristics in different partitions of the datasets (e.g. Unique to Picard/SAMTools, Unique to SAMTools/no dups, Unique to SAMTools, etc.) are dramatically different than the three-way intersect. The biggest changes occur in Ti/Tv ratios and percent novel variants. Except for the three-way intersect and entire variant datasets for each of the three approaches, the Ti/Tv ratios are all less than two, in contrast to~2.1. Additionally, a large number of novel variants exist in each of these other partitions. There are more novel variants in the Unique to No Dups (~46 %) and Unique to SAMTools/No Dups (~48 %) groups.

ACMG genes variant dataset: Picard versus SAMTools versus not removing duplicates
In 2008 the American College of Medical Genetics (ACMG) compiled a list of genes known to harbor diseasecausing mutations (i.e. clinically important genes). We compared variants in only the ACMG genes to determine if the choice of duplicate removal appears to be (more or less) important for the study of clinically important genes.
We performed the same partitioning of variants (Fig. 3) and comparison of variant characteristics ( Table 2) as above. Many of the results were comparable. The majority (34000 of 34803,~98 %) of variants were identified using all three approaches. Again, the variants in the three-way intersect were very similar compared to all the ACMG variants in the each of the three individual datasets. However, these variants (three-way intersect and whole ACMG variant datasets) and the other partitions of ACMG variants, were very different in every measured statistic.

Comparison between SNP chip and NGS data
We grouped variants from the SNP chip by variant type (i.e. homozygous reference, heterozygous, or homozygous alternate) and compared each of these genotypes to the called genotypes in each of the three groups of NGS data. Comparing genotypes across the NGS and matching SNP chip data, we found that Picard, SAMTools, and not removing duplicates were virtually indistinguishable (  PCR duplicate removal (Picard and SAMTools). To our knowledge, no one has formally compared the two different algorithms. Furthermore, there is little data assessing the necessity of PCR duplicate removal. Our goal was to determine whether PCR duplicate removal meaningfully affects the resulting variant datasets, and if the accuracy of the variant datasets is different when using Picard and SAMTools.
We compared the variant datasets resulting from each of the three different pipelines. First, we compared common measures of the variants in each dataset to assess the overall quality of the called variants [23]. This comparison says nothing about any individual variant, but about the accuracy of the dataset as a whole. When considering the entire variant dataset for each of the three approaches, the important characteristics we compared were nearly identical. All three had Ti/Tv ratios of 2.14, a ratio in the expected range [12], very similar proportions of novel variants (27.95 % to 28.7 %) also in the expected range [23], identical minor allele frequencies for called variants (21 %), and identical proportions of protein changing variants (0.4 %). Therefore, using these meaningful measures to assess the accuracy of the variant calls, the three approaches are nearly indistinguishable. There is evidence suggesting that the intersection of the three variant datasets results in a more accurate dataset because the percentage of novel variants decreased by a small amount (~27 % to~20 %), though the difference was not significant when comparing the average of the three unique values to the intersect of all three (p = 0.49). However, when looking at all other possible intersects of the three variant datasets, the apparent quality of the datasets drops.
We further assessed the accuracy of the three variant datasets by comparing genotypes in the variant datasets to genotypes from SNP chip data. All three approaches were much more accurate than 99 % and nearly identically accurate. This comparison, however, only demonstrates concordance amongst common variants. In this case, we only considered variants with a minor allele frequency greater than 0.02.
As NGS is being moved into a clinical setting, we wanted to verify that our results are consistent in clinically important genes so we performed the same tests in parallel in only the ACMG genes. The ACMG recommended a set of genes that should be assessed when analyzing the whole genome/exome of a patient in a clinical setting. We make no assumption that this gene set contains every clinically important gene, but it does contain all the genes considered most important by the ACMG. We performed the same analyses on variants The y-axis is execution time measured in minutes from only the genes recommended by the ACMG, and our findings are nearly identical to those performed on variants from the entire genome. The three individual datasets are indistinguishable, while the intersection of the three again appears to be slightly better, and there is still nearly perfect concordance with SNP chip data. Next, to assess computational performance differences between Picard and SAMTools, we measured the memory and compute time required for each to remove duplicates. Picard required substantially more memory (31000 versus 120 megabytes) and slightly more time (seven versus eight hours) than SAMTools.
Removing duplicates is intended to reduce noise during the variant identification process and minimize false positives. Our results suggest removing duplicates has little effect on the results. As sequencing technologies continue to advance, PCR duplicate removal will become less of an issue. For example, single-molecule sequencing technologies such as PacBio's Single Molecule, Real-Time (SMRT) sequencing and Oxford Nanopore Technologies' Minion perform sequencing on non-amplified DNA.

Conclusions
In summary, we compared the effect on the resulting variant datasets when using Picard for duplicate removal, SAMTools for duplicate removal, or not removing duplicates. We performed these comparisons across the entire genome, and then limited our analyses to variants located in clinically important genes. Our results suggest that in deep sequencing whole genome data, removing or ignoring PCR duplicates has non-significant effects on the accuracy of subsequent variant datasets. Furthermore, our results demonstrate that when PCR duplicates are handled using either SAMTools or Picard, the resulting variant datasets are very comparable. In some settings, PCR duplicate removal/marking may be preferable. For example, our data show that the most accurate variant dataset may be obtained by using each of the three approaches and then intersecting the three datasets (assuming any variant outside the intersect is a false positive). Since NGS library preparation is different for sequencing genomes, exomes, transcriptomes, etc. additional studies will be necessary to know if our results extend to the exome or other sequenced partitions of the genome.