Alignment and read processing
Mapping/Merging
When mapping reads with BWA, we attempted to use multi-threading over 16 cores per read group, but found that the mapping speed did not improve linearly with the number of computing threads (Fig. 1
b). Therefore, we specified 8 threads per read group as a more cost-effective approach. The output of the mapping stage resulted in an uncompressed sequence alignment/map (SAM) formatted text file that is ~4 × larger than a compressed fastq file (~400–500 GB/sample; Fig. 1
a, Additional file 1: Table S1). Subsequent compression of the SAM file resulted in a BAM file requiring only 25–30 % of the storage space, thereby reducing the overall footprint of aligned read files from ~175 TB to ~60 TB (Additional file 1: Figure S1A). The merged BAM file was comparable in size to the sum of the files containing individual read groups.
Sorting
When sorting a single whole human genome, upwards of 800 temporary files are generated, totaling up to 150 GB per genome. These files are repeatedly opened, read, written, and closed, imposing demanding IO performance. To fully utilize the sixteen CPU cores on each node, a single compute node must be capable of processing 2.4 TB of intermediate data spread across over 10,000 files. Unfortunately, the large parallel file systems available on most massively parallel supercomputers are designed to deliver high streaming bandwidth, not high IOPS [8, 9], and sorting several dozen genomes concurrently on Gordon’s Lustre file system resulted in severe performance degradation. To circumvent this issue, we utilized 2 “BigFlash” nodes, each with 4.4 TB of SSD flash storage.
The ability to write the temporary files to the flash space allowed us to maintain very efficient CPU utiliziation through the sort step by processing 32 genomes (16 per node) concurrently without being limited by IO performance, ultimately decreasing both the computational cost and wall time for this step. Without this ability, job packing would have been impossible and sorting the reads could have come at a needlessly high computational cost as a result of the IO limitations. The output file from the sorting step resulted in a slight compression from the unsorted input BAM file, requiring ~15 % less space than the unsorted file.
Realignment/Recalibration
When running, we allotted ~8 GB of memory per genome, allowing us to process 8 samples on a 64 GB node. Thus, 2 cores were designated for each instance of MarkDuplicates. This approach prevented the possibility of multiple job submissions exceeding the available memory and causing run errors. The RealignerTargetCreator step was similarly designated 8 GB of memory per genome and IndelRealigner steps utilized 12 GB of memory per genome, so 8 and 5 samples were run on a single node, respectively. This approach prevented memory allocation errors arising from memory exhaustion. BaseRecalibrator, on the other hand, could be run in parallel and was multithreaded to 8 cores per genome. Finally, PrintReads was run in parallel over 8 or 16 cores. When we established that specifying 16 computing threads yielded no performance increase, subsequent jobs were multithreaded over 8 cores (Fig. 1
d).
The output from RealignerTargetCreator and BaseRecalibrator were negligible in size, while the output of MarkDuplicates and IndelRealigner were BAM files comparable in size to the sorted BAM file. The final, recalibrated BAM file was roughly double the size of the input BAM file and approximately 200 GB were required for each sample (Fig. 1
a). Given the extraordinary amount of data being processed and generated, it is important to remove large, intermediate files after subsequent processing has been completed to minimize required storage resources. Without deleting intermediate files during the processing phase, each sample would require >1TB of storage space, totalling nearly half a petabyte for our cohort of 437 individuals (Additional file 1: Figure S1A). We suggest using correlations between input and output files sizes as one indication of successful completion of processing steps prior to deleting intermediate files from previous steps (Additional file 1: Figure S1B).
Variant calling
Computing time
By comparing variant calling approaches using different group sizes, we show that the total computing time required to call variants for a group grows quadratically as the group size increases (Fig. 2
a). Consequently, as group size increases, the "per sample" cost of calling variants sharply reduces at first, but then increases linearly when the group size is large (Fig. 2
b). We estimate that the "per sample" cost of variant calling reaches a minimum for groups of ~15 individuals. Any computational cost benefit of group calling is lost when the group size reaches 75 and the cost increases linearly from there (~60 % increase per 100 individuals). For particularly large groups (>100), variant calls were made on only a fraction of chromosome 21 before the job timed out after 72 hrs (Additional file 1: Figure S3).
Accommodating ancestry
To further guide or grouping approach for variant calling, we called variants on portions of the NA12878 genome in groups with varying ancestral background (Additional file 1: Figure S4B). Concordance, defined as the fraction of variant calls made in all four ancestral test groups, amongst all four test groups tended to be higher after variant filtration was applied than in calls prior to filtering. Concordance was also higher for SNVs than indels (Additional file 1: Figure S5). For indels, over 87 % of variant positions were identified in at least 3 of 4 groups, while SNVs had greater than 90 % concordance amongst all test groups with over 95 % of SNV calls being made in at least 3 of 4 groups. To test the hypotheses that ancestral background of a group would affect the sensitivity and specificity of the variant calling process, we used a validated set of variant calls in high-confidence regions of the NA12878 genome [7]. We found evidence that the number of false negative SNV and indel calls was affected by the ancestral background of the group, but not the number of false-positive variants, suggesting that ancestral environment has a greater influence on sensitivity than specificity (Additional file 1: Figure S6). Because the NA12878 genome is of European ancestry, we investigated whether the estimated percent of European ancestry within a test group had an effect on the number of false-negatives or false-positives called on NA12878. The results were consistent with the initial results, indicating a positive correlation between sensitivity and the estimated fraction of European admixture within the test group for SNVs and indels after controlling for region of the genome (Fig. 2
c). Again, there was no effect for the specificity of variant calls (Fig. 2
d). Similarly, there was a positive correlation between overall accuracy and negative predictive value and fraction of European admixture, respectively, but no correlation with positive predictive value (Additional file 1: Figure S7). These results suggest that calling variants in groups of similar ancestry could result in fewer missed variant calls than a heterogenous group containing a high degree of admixture. Thus, in order to obtain high-quality variant calls in a computationally efficient manner, we called variants on the individuals in groups of 20–24 based on shared ancestry (Fig. 2
e).
Variant call comparison
Summary
The resulting variant calls were compared on an individual scale and as a cohort with variant calls made using the conventional pipeline. Our approach yielded 29,915,861 positions in which a SNV was identified across all 437 samples, while the conventional pipeline identified 30,790,918 SNV positions across 435 samples. Of those positions, both pipelines identified 27,247,530 (~81.4 %), while 3,543,283 (~10.5 %) were identified only by the conventional pipeline and 2,668,331 (~8.0 %) were identified only by our pipeline (Fig. 3
a). Concordance between pipelines, defined as the intersection versus the union of variant sites for the pipelines, varied greatly for each individual as well, ranging from 71–82 %, and displayed a bimodal distribution (Fig. 3
b). Surprisingly, our pipeline called, on average, more SNVs per genome despite identifying fewer unique positions, though the number of variants called on an individual varied between pipelines (Fig. 3
a,c), and additional evidence of a batch effect is apparent in the conventional workflow.
The most glaring difference between variant call sets was the dearth of variant calls made by the conventional pipeline in the highly polymorphic human leukocyte antigen (HLA) region of chromosome 6 (Fig. 3
d). The conventional workflow resulted in only a handful of calls in this region, while our own pipeline made tens of thousands of calls on the cohort. This may partially explain why our pipeline identified more SNVs per individual but fewer positions overall. This discrepancy could have staggering consequences in an association study, given the importance of the region in many autoimmune diseases. Other regions of inconsistency between call sets appeared most commonly near centromeres and telomeres. Often, both pipelines made variant calls in these regions that were inconsistent with one another (Additional file 1: Figure S9). This could be a result of poor local alignment in these particular regions. We further considered the prevalence of variant calls on each chromosome. After removing variants from the HLA region, the conventional pipeline resulted in more unique variant positions on each of the 22 autosomes and the X chromosome, while our own pipeline resulted in more unique calls on the Y chromosome and additional unmapped contigs, the latter of which were not included in the reference coordinates for the conventional pipeline (Additional file 1: Figure S10). This is consistent with the fact that the conventional pipeline yielded far more pipeline-specific variant positions overall.
In addition to differences in the location of variant calls between the two call sets, we considered differences in the frequency, novelty, and classification of variant calls made exclusively by each pipeline. Of the variant calls made by a single pipeline, a slightly higher proportion of our calls were rare (minor allele frequency (MAF) <1 % in our cohort), while a slightly higher proportion of calls made by the conventional pipeline were low-frequency (MAF 1–5 %). A similar proportion of calls made exclusively by each pipeline were common (MAF >5 % (Additional file 1: Figure S11). Furthermore, a larger fraction of the calls made specifically by the conventional pipeline were novel (not identified in dbSNP) than those made only by our own pipeline. The same pattern holds true when considering only variant calls outside of the HLA region. In the HLA region, the majority of variant calls made exclusively by our pipeline were known variants, while the majority of the relatively few calls made in that region only by the conventional pipeline were novel (Fig. 4
b). Finally, calls made exclusively by our pipeline had a transition to transversion (Ti/Tv) ratio much closer to the 2.19 expected for a whole human genome than did calls made by the conventional pipeline. The same trend is true for variants in the non-HLA region of the genome, and the difference is especially apparent for variants within the HLA region (Fig. 4
c). These data suggest that the variant calls made by our pipeline are generally more consistent with what is expected than those made by the conventional pipeline.
Impact
We further assessed the location of variant calls relative to functional elements in the genome and the predicted functional consequences of those variants. Excluding variants in the HLA region, we discovered that there were more calls specific to the conventional pipeline in intergenic, intronic, exonic, 3’ untranslated region (UTR), and noncoding RNA (ncRNA) regions than calls made exclusively by our pipeline. This effect was most striking for 3’ UTR regions, while our pipeline actually resulted in slightly more calls in 5’ UTR regions (Fig. 5a,b). When we normalize to the total number of variants uniquely called by each pipeline, we see that our group calling strategy resulted in a higher proportion of calls landing in intergenic, exonic, 5’ UTR and ncRNA regions than the conventional pipeline-specific calls (Additional file 1: Figure S12A). In assessing the coding impact of the pipeline-specific variants in non-HLA regions, we found that the conventional pipeline made more calls predicted to result in nonsynonymous amino acid changes, while calls specific to our pipeline resulted in more synonymous and nonsense variant calls (Fig. 5
c). Spurious variant calls would not be subject to real-world natural selection, so one might expect to find an enrichment of deleterious variants in a set of false-positive variant calls; however, this would need to be validated with functional assays. It should be noted that our group calling workflow-specific calls contained a higher proportion of, albeit fewer, variant calls in all three categories (Additional file 1: Figure S12B).
To further evaluate the accuracy of our variant calls relative to those made by the conventional pipeline, we performed principal component analysis (PCA) and compared the clusters formed by plotting the first two PCs against one another. These measures have been shown to represent the geographical separation amongst different populations, with individuals from similar ancestral backgrounds clustering together. While variant calls made by our pipeline result in clusters that recapitulate self-reported race and ethnicity, calls made with the conventional pipeline show a batch effect with no biological interpretation. Two distinct European clusters are present along with two distinct clusters of Asian and Hispanic individuals extending away from their respective European clusters (Fig. 5
d). These results suggest that some non-biological process is resulting in a large proportion of genomic variation in this cohort when the variants from the conventional pipeline are considered. Using variant calls that do not reflect the biological reality of the patient would lead to spurious results in an association study.
Our results speak to the importance of the sophistication of variant calling workflows in obtaining high-quality data for use in association studies. Furthermore, we highlight the importance of efficiency in large-scale sequencing studies and describe an approach that yields reliable variant calls in a computationally efficient manner.