HaplotypeTools algorithm
The algorithm for HaplotypeTools comprises on five steps. The first step splits the VCF into windows of a specified length (default 10 kb), and BAM files into windows of the same length. Step 3 combines pairs of BAM and VCFs for each sample by assigning read information to intermediate VCF files (i.e. VCF-[contig]-[start window]-[stop window]-phased-[sample number]). Step 4 assigns phase groups based on 5 conditions, outputting intermediate tabulated files (i.e. VCF-[contig]-[start window]-[stop window]-phased-[sample number]-and-assigned.tab). Step 5 merges all phased samples for a given window into VCFs (i.e. VCF-[contig]-[start window]-[stop window]-phased), and then concatenates those into a final phased VCF. Splitting input data into windows allows steps 3 and 4 to be run in parallel on a cluster (Platform Load Sharing Facility (LSF), Sun GridEngine (SGE) or Univa GridEngine (UGE) currently supported). HaplotypeTools can also be run on an individual computer in serial at the expense of slower computational time.
Step 3 of HaplotypeTools assigns Phase Positions (PP) for all reads that overlap ≥ 2 heterozygous positions, which are separated by semicolons and stored in the ID column of the output. PPs consist of:
-
1.
unique read count (RC) and
-
2.
read genotype values (rGT) or read nucleotide values (rNT)
RCs serve as simple integer identifiers (0, 1, 2, n) for step 4 to identify reads that overlap multiple VCF positions, the value of which is incremented for each new read in the BAM. rGT are variant positions in sequence reads corresponding to a sequence match to the VCF REF column (0) or ALT column (1,2 etc.). rNT are used instead of rGT for variant positions in sequence reads that do not match a VCF REF or ALT base (based on the CIGAR flag), e.g. rNT = A or rNT = ATCC. For example, following step 3, an ID column could be’0-PP-0;1-PP-0;2-PP-1;3-PP-1;4-NT-A’, indicating 5 reads align over this VCF position in total, two of which have the REF allele, two that have the ALT allele, and one that has an adenosine, which is not described in the VCF REF or ALT column at that position.
Step 4 runs through pairs of consecutively found heterozygous positions named Previous Heterozygous Position (PHP) and Current Heterozygous Position (CHP), checking them for 5 conditions:
-
1.
Check for ≥ 2 rGT’s in CHP.
-
2.
Check the 2 CHP rGT’s with the highest depth > min. haplotype depth parameter.
-
3.
Check the 2 CHP rGT combined depth (percent) > phase cutoff parameter.
-
4.
Check PHP passed conditions 1–3.
-
5.
Check for ≥ 2 haplotypes from PHP and CHP PPs.
If any of those 5 conditions are not fulfilled, the PHP ID column is replaced by a comment stating the sample number and the reason it was not phased. A Phase Block (PB) integer value (identifier for separate haplotypes) is also incremented. The following pair of PHP and CHP are then assessed. Providing all 5 conditions are met, the reads that match the two PHP rGT’s and the two CHP rGT’s are identified, and used to construct a new CHP phased genotype. In the case > 2 rGT’s are found, the two with the highest depth are selected. A phase group (PG) is assigned to PHP (if not already assigned) and CHP, which is appended to the SAMPLEINFO column. The PG consists of the contig, sample number, start window and PB separated by dashed (e.g. supercont1.1-0-350,000-1), ensuring every PG is unique e.g. the same phase block identifier in the same window (350,000–360,000) for a 2nd sample in a multi sample VCF will be supercont1.1-1-350,000-1. A summary file for each window is printed including contig, position, ID and SAMPLEINFO, which is used to update the final phased VCF during concatenation in Step 5.
Benchmarking using simulated data
HaplotypeTools and WhatsHap (downloaded from https://github.com/whatshap on 1st March 2021) were benchmarked using simulated data from the Batrachochytrium dendrobatidis (Bd) JEL423 genome. First, 40 contigs each of < 10 kb were removed from the reference sequence, ensuring we could simulate 10 kb reads across the genome (updated reference = 29 contigs, 23.44 Mb, N50 = 1.7 Mb). Next, the Biscap utility script “Introduce Random Mutations (IRMS)” [28] was used with the heterozygous setting (HET), which duplicates every chromosome (homologous versions), followed by selecting random nucleotides to ‘mutate’ into other random nucleotides across both chromosomes and homologous chromosomes. Three such modified reference genomes were generated including 1 SNP/Kb (23,137 total), 10 SNP/Kb (231,370 total) and 100 SNP/Kb (2,313,700 total). Next, sequence reads were simulated from this duplicated and modified reference sequence using WGSim (https://github.com/lh3/wgsim) to ~ 20X depth using either short (100 nt) paired reads (2,313,797 pairs) or long (10 kb) paired reads (23,138 pairs) with no introduced errors (-r 0). Aligning these reads back to the unduplicated and unmodified reference genome will then appear to contain heterozygous positions, for which each position changed is known (the truth set). Reads were aligned to the genome using BWA v0.7.4-r385 mem, and a clean BAM created using Samtools v1.8 view -b -h -f 0 × 2. For WhatsHap compatibility, Picard AddOrReplaceReadGroups was applied to the clean BAM files.
Variants were called from the simulated data alignments using Pilon v1.9 with the diploid flag [18]. For WhatsHap compatibility, reference bases were removed from the VCF. GATK v.4.1.2.0 [12] was not used for calling heterozygous positions from simulated data alignments owing to failing the StrandBiasBySample filter used by HaplotypeCaller. Accuracy of Pilon was assessed using Biscap utility script “Comparison of FDR (CFDR)” [28].
Phased VCF’s from both HaplotypeTools and WhatsHap were assessed for accuracy using HaplotypeTools utility scripts. Specifically, Phased in Any (PIA) regions were identified (VCF_phased_to_PIA.pl), with parameter -t PS for WhatsHap and -t PID (default) for HaplotypeTools. FASTA sequences of haplotypes blocks/pairs were extracted using VCF_phased_and_PIA_to_FASTA.pl. Accuracy was assessed using Haplotype_FASTA_files_to_compare_to_IRMS_het_sites.pl, which calculates for every haplotype block/pair the number of sites that are correctly phased, sites that are incorrectly phased (False Positive type 1) and sites that have been incorrectly variant called and also been phased (False Positive type 2), false negatives within haplotype blocks (not presented), and Switch Errors (incorrect crossovers between haplotypes). To calculate Switch Errors, false negatives were ignored, while False Positive type 2 were considered as a switch error. Additionally, the script produces two summary statistics including overall Switch Error Rate, where the switch error is divided by the number of opportunities for switch errors. Finally, the Quality adjusted N50 (QAN50) was calculated for each test, where each haplotype block/pair is divided into sub-blocks with no switch errors, which are multiplied to the proportion of phased alleles inside that block (called an adjusted span), sorted from largest to smallest, and then the QAN50 is the size of the adjusted span that includes more than half of the total variants [29].
HaplotypeTools using real data
To test HaplotypeTools on real data, variant calling was first applied to a major fungal pathogen of amphibians, Batrachochytrium dendrobatidis, which has a 23 Mb diploid or triploid (with frequent aneuploid [22]) genome. Paired-end Illumina data from representatives of all five known lineages (BdGPL JEL423, BdCAPE TF5a1, BdCH ACON, BdAsia-1 KRBOOR_323, BdAsia-2 CLFT065, and a hybrid of unknown parentage SA-EC3) were obtained from the NCBI Sequence Read Archive (SRA) [19, 22, 30]. The Genome Analysis Toolkit (GATK) v.4.1.2.0 [12] was used to call variants. Our Workflow Description Language (WDL) scripts were executed by Cromwell workflow execution engine v.48 [31]. Briefly, raw sequences were pre-processed by mapping reads to the reference genome Bd JEL423 using BWA-MEM v.0.7.17 [32]. Next, duplicates were marked, and the resulting file was sorted by coordinate order. Intervals were created using a custom bash script allowing parallel analysis of large batches of genomics data. Using the scatter–gather approach, HaplotypeCaller was executed in GVCF mode with the diploid ploidy flag. Variants were imported to GATK 4 GenomicsDB and hard filtered (QD < 2.0, FS > 60.0, MQ < 40.0, GQ ≥ 50, AD ≥ 0.8, DP ≥ 10). HaplotypeTools and WhatsHap were used individually to phase each VCF with default parameters, and HaplotypeTools utility scripts to phylogenetically place and visualise haplotype placement across the genome, as well as explore crossovers between pairs of phased VCFs.