HELLO: improved neural network architectures and methodologies for small variant calling

Background Modern Next Generation- and Third Generation- Sequencing methods such as Illumina and PacBio Circular Consensus Sequencing platforms provide accurate sequencing data. Parallel developments in Deep Learning have enabled the application of Deep Neural Networks to variant calling, surpassing the accuracy of classical approaches in many settings. DeepVariant, arguably the most popular among such methods, transforms the problem of variant calling into one of image recognition where a Deep Neural Network analyzes sequencing data that is formatted as images, achieving high accuracy. In this paper, we explore an alternative approach to designing Deep Neural Networks for variant calling, where we use meticulously designed Deep Neural Network architectures and customized variant inference functions that account for the underlying nature of sequencing data instead of converting the problem to one of image recognition. Results Results from 27 whole-genome variant calling experiments spanning Illumina, PacBio and hybrid Illumina-PacBio settings suggest that our method allows vastly smaller Deep Neural Networks to outperform the Inception-v3 architecture used in DeepVariant for indel and substitution-type variant calls. For example, our method reduces the number of indel call errors by up to 18%, 55% and 65% for Illumina, PacBio and hybrid Illumina-PacBio variant calling respectively, compared to a similarly trained DeepVariant pipeline. In these cases, our models are between 7 and 14 times smaller. Conclusions We believe that the improved accuracy and problem-specific customization of our models will enable more accurate pipelines and further method development in the field. HELLO is available at https://github.com/anands-repo/hello Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04311-4.

Since the aligner used for generating these BAM files, pbmm2, is the same as the aligner we used in our evaluations, we did not perform BAM to fastq followed by alignment for this dataset. Instead, we prepared 8.67x and 17.33x coverage PacBio sequencing reads using our python script written for splitting BAM files into multiple files of different coverages through random partitioning. The partitions had respectively, 1/6 th and 1/3 rd of the reads in the source BAM file. The script is available here: https://github.com/anands-repo/bamutils/blob/master/split_bams.py

Haplotagging training BAM files
Haplotagging requires a VCF file and a BAM file. To prepare the VCF file for producing haplotagged BAM files for training, we followed the procedure followed by the DeepVariant team (https://github.com/google/deepvariant/issues/369) -which is to use an older version of DeepVariant to prepare variant calls and use these calls to haplotag the BAM files. For this, we first removed haplotags in the alignment files using the following command: where $input is the input alignment file, and $UNHAPLOTAGGED is the output alignment file without haplotags.
To rehaplotag the reads, we used whatshap to phase the variants and then haplotag the original alignment files as follows: whatshap phase --output $whatshap_vcf_name --reference $REF $dv_vcf_name $UNHAPLOTAGGED whatshap haplotag --output $rehaplotagged --reference $REF $whatshap_vcf_name $UNHAPLOTAGGED We used the same haplotagged BAM file for training the HELLO haplotagged model as well.

Ground truth data
Ground-truth VCF and BED files (referred to as TRUTH_VCF and TRUTH_BED) were downloaded corresponding to the GRCh38 reference corresponding to the HG002 sample from the following link: ftp://ftp.ncbi.nlm.nih.gov/giab/ftp/data/AshkenazimTrio/analysis/NIST_v4.2_SmallVariantDraftBenchm ark_07092020 HELLO Illumina training data generation HELLO is launched for each Illumina BAM file as follows: python /path/to/hello_dev/python/dump.py --ibam $BAM --ref $REF --truth $TRUTH_VCF --bed  $TRUTH_BED --workdir $OUTPUT_DIR --mapq_threshold 5 --num_threads 30 --reconcilement_size 0 HELLO's dump.py only looks at chr1-20 for generating the data, and other reference contigs are ignored. This step produces a list of .memmap, .hdf5 and .index files. The list of .index files can be used as input to the HELLO training script.

DeepVariant PacBio training data generation
For PacBio training data generation, we added the following options to the command template for make_examples that was presented for the Illumina case: --norealign_reads --vsc_min_fraction_indels 0.12 --alt_aligned_pileup 'diff_channels' --add_hp_channel --sort_by_haplotypes --parse_sam_aux_fields. This is based on an examination of run_deepvariant.py where for PacBio sequencing reads, the above additional options are used for PacBio variant calling with haplotagged data.

HELLO PacBio training data generation
HELLO is run using a similar command as for Illumina to generate PacBio training examples, only the files are passed using --pbam instead of --ibam. For the model that uses haplotagged information, we pass the additional '--include_hp' option. We use the same BAM file as for DeepVariant training in this case.

HELLO hybrid training data generation
For hybrid calling, we create random combinations of Illumina and Pacbio BAM files for each chromosome. The following is the simple algorithm used for this purpose. Let illumina_bam_files be the set of Illumina BAM files, and pacbio_bam_files be the set of PacBio BAM files, then For chrom in [1,20]: For ibam in illumina_bam_files: pbam = randomly select from pacbio_bam_files Create HELLO training data using ibam, pbam for chromosome chrom The command template below is used to generate the training data for a single chromosome python /path/to/hello_dev/python/dump.py --ibam $ibam --pbam $pbam -ref $REF --workdir $OUTPUT_DIRECTORY --truth $TRUTH_VCF --bed $TRUTH_BED --hybrid_eval --no_data_lst --q_threshold 10 --mapq_threshold 5 --num_threads 40 --reconcilement_size 0 --chromosomes $chrom As may be seen, HELLO accepts two BAM files and a chromosome designation.

DeepVariant hybrid training data generation
We collect the {chrom, ibam, pbam} combinations generated by the above algorithm. We create a merged ibam+pbam BAM file for the corresponding chrom. We then overwrite read group information to be the same as needed by DeepVariant for hybrid variant calling. The following command templates are used: We invoke make_examples for each $merged_bam_part_RG with options appropriately set. We keep all tfrecords produced for chr1-chr18 for training and those produced for chr19 and chr20 separate for validation.

Training procedure
For training HELLO, we collect all ".index" files produced by the dump.py script runs and list them in a data.lst file. We also need to specify the architecture of the DNN that is being used. The command template is as follows: As may be noted the learning rate used is 0.0003 and is used uniformly for training all models. $CONFIG is the name of a python module that is inside HELLO's codebase and describes different model For training DeepVariant, we collect the output config files produced from the shuffle scripts and use the following command template to run training.
/opt/deepvariant/bin/model_train \ --use_tpu \ --master="grpc://<TPU_IP_ADDR>: The learning rate used for DeepVariant is 0.0005. This is based on the DeepVariant training tutorial given in the DeepVariant github repository. We used Google Cloud TPU v2 to train these models as DeepVariant builds aren't available for IBM machines and our clusters have GPUs only on IBM machines. The realigned file, as may be seen, is termed $abam. Note that $abam and $dbam have the same coverage.
We subsampled $dbam to a fraction 0.5 using samtools view, and realigned the reads using Picard SamToFastq followed by the same pbmm2 command template above.
We subsampled $abam to a fraction 0.25 using samtools view and realigned the reads to GRCh38 reference using Picard SamToFastq followed by the same pbmm2 command above. This subsampling could equally be done with $dbam, however, we had removed $dbam due to lack of disk space.
This procedure gives us 15x, 30x, and 60x reads. This is because $dbam has a higher sequencing depth than the PacBio training set BAM, however we could not find the coverage information in the README.
To determine the sequencing depth of the test set BAM, we counted the number of bases in $abam as well as the training set BAM file for PacBio and used the ratio. That is Setting _ to 52x based on the associated README for the training set, _ comes out to be 60x.
To count the number of bases in a BAM file, we used the following command samtools bam2fq $1 | awk 'NR%4==2' | awk '{ print length($0) }' | awk '{s += $1} END {print s}' For HG001, we performed similar steps to prepare 15x and 30x coverage reads and note that HG001 is mentioned to be 30x coverage in the README released with he data.

Haplotagging PacBio BAMs for variant calling
As mentioned in the main draft, we train two models for HELLO -one with haplotagging and the other without. We used the HELLO's non-haplotagged model to call variants on the PacBio BAM files first. Then we used those variant calls with Whatshap to produce haplotagged BAM files.
The Whatshap invocations are similar to that presented before except that for 15x coverage data for HG001 and 30x coverage data for HG003 we had to use the option ` --distrust-genotypes` to allow the phasing step the freedom to discard certain heterozygous variant records in the vcf file.
We used the same haplotagged BAM files for variant calling with HELLO and DeepVariant.

Hybrid datasets
HELLO accepts two BAM files one for Illumina and one for PacBio, so no additional BAM file is needed. For DeepVariant, we performed "samtools merge" to produce merged versions of BAM files for each of the hybrid experiments. We used Picard to apply the same read group on all the reads using the following command template. GATK command template is based on https://www.nature.com/articles/s41587-019-0217-9, and run as follows 1. HaplotypeCaller was run using --pcr-indel-mode AGGRESSIVE, with minimum mapping quality setting of 60, and -ERC option set to GVCF. 2. GVCF to VCF conversion using scripts released at https://github.com/PacificBiosciences/hg002ccs/ 3. VCF filtration using scripts released at https://github.com/PacificBiosciences/hg002-ccs/ Hybrid datasets HELLO's command template is as follows where it accepts both --ibam and --pbam options. The option reconcilement_size should be set to 0 as of now, which disables the reconcilement feature. This is an untested feature that tries to reconcile differing representations of identical alleles by Illumina and PacBio aligners. Currently the method is not tuned properly and can take a large amount of time to run. python /path/to/hello_dev/python/call.py \ DeepVariant's command template is as follows. The input BAM file in this case is the merged, readgroup imputed BAM files we described before.

Evaluating results
We downloaded the following ground-truth VCF and BED files for HG003 corresponding to GRCh38: DNN architecture Figure S 3 shows the architecture of HELLO's DNN used for single platform variant calling. 1 , 2 , and 3 are marked in the figure. Two copies of 1 and 2 are used in the hybrid architecture as explained in the main document, and the features coming from them are combined using the convolutional architecture shown in Figure S 1. The architecture of 3 is the same for single platform and hybrid platform cases. Figure S 2 shows the residual block architecture used in HELLO, which uses two convolutional blocks in series which are short circuited by skip/residual connections. In the case where dimensionality is changed from input to output, the skip connections involve a convolutional layer which matches the output dimensions.