gcaPDA: a haplotype-resolved diploid assembler

Background Generating chromosome-scale haplotype resolved assembly is important for functional studies. However, current de novo assemblers are either haploid assemblers that discard allelic information, or diploid assemblers that can only tackle genomes of low complexity. Results Here, Using robust programs, we build a diploid genome assembly pipeline called gcaPDA (gamete cells assisted Phased Diploid Assembler), which exploits haploid gamete cells to assist in resolving haplotypes. We demonstrate the effectiveness of gcaPDA based on simulated HiFi reads of maize genome which is highly heterozygous and repetitive, and real data from rice. Conclusions With applicability of coping with complex genomes and fewer restrictions on application than most of diploid assemblers, gcaPDA is likely to find broad applications in studies of eukaryotic genomes. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-022-04591-4.

To resolve haplotypes, de novo assemblers such as FALCON-unzip [9], hifiasm [10] and SDip [11] try to build diploid assembly by distinguishing long reads of different haplotypes based on heterozygous single nucleotide polymorphisms (SNPs). This strategy requires reads error rate to be much lower than genome heterozygous rate. In addition, heterozygous SNPs are not evenly distributed along chromosomes, with long stretch of homozygous region scattered in the genome. When long reads fail to span adjacent heterozygous SNPs, haplotype phasing across this region can't be correctly inferred and lead haplotype switches. Therefore, supplementary data that can assist in long-range phasing is imperative to achieve chromosome-scale haplotype construction. In this regard, parental whole genome shotgun reads (WGS) data [12,13], Hi-C data [14], Strand-seq data [15] and gamete cell data [16,17] have been used to bridge adjacent haplotype blocks.
Trio binning [13] uses WGS data of two parents to partition the progeny's long reads and then assembles paternal and maternal genome respectively. It can provide entirely phased diploid assembly, except for novel mutations that unique to the progeny's genome or heterozygous variants that exist in the trio samples. However, parental samples are not always available, especially in plant and animal studies. Diploid de novo assemblers such as DipAsm [14], PGAS (trio-free Phased Diploid genome Assembly using Strand-Seq) [15], Gamete-binning [16] share a common assembly framework (Additional file 1: Figure  S1), which involving building an initial assembly, calling and phasing SNPs using sequencing reads, partitioning long reads based on phased SNPs and assembling each haplotype genome separately with partitioned long reads. This framework has gain success in genomes of low heterozygosity. However, it is not well suited for species with highly heterozygous genomes, which are quite prevalent scenarios in nature. Highly heterozygous genomes pose great challenge to haploid assembler and usually result in low quality initial assembly with fragmented contigs, abundant mis-assemblies and haplotigs. It is difficult to remove haplotigs thoroughly, while retaining haplotigs in the initial assembly will mess with SNP calling [18] and phasing process and generate poor haplotype-resolved assembly. Hence, it is particularly urgent to develop a diploid assembler that can cope with highly heterozygous genomes.
In this study, we build a diploid assembly pipeline called gcaPDA (gamete cells assisted Phased Diploid Assembler) based on robust programs, that can generate chromosomescale phased diploid assemblies for highly heterozygous and repetitive genomes using PacBio HiFi data, Hi-C data and gamete cell WGS data. gcaPDA offers equivalent performance to the trio-dependent method, with 98% phasing accuracy and > 99% genome completeness. Both of the reconstructed haplotype assemblies generated using gcaPDA have excellent collinearity with their corresponding reference assemblies. Additionally, structural variations between reference genomes, including inversions and InDels, are well-resolved. Having demonstrated its utility with maize and rice genome, it is plausible that gcaPDA can be easily applied to most diploid eukaryotic species and may find broad application in the coming diploid genome era.

Schematic of the gcaDPA assembler
We have developed a diploid assembler, gcaPDA, to generate chromosome-scale phased genome assembly for diploid species. gcaPDA requires PacBio HiFi data and Hi-C data of an individual. In addition, short read WGS data of haploid gamete cells from the same individual are required to assist in long-range phasing. As illustrated in Fig. 1 and Additional file 1: Figure S2, gcaPDA consists of 4 major steps: (1) generating an initial assembly; (2) reconstructing of haplotypes; ((3) partition and normalization of gamete cell reads and (4) generating chromosome-scale phased diploid assembly.

Generating sequencing data for a maize F 1 hybrid
Maize has a very representative large and complex genome [19], which is suitable for testing the performance of gcaPDA. We selected a maize F 1 hybrid by crossing two inbred lines B73 and SK as test sample. High quality reference genomes are available for both parents (B73 [20] and SK [21]) which could be used to benchmark the phased diploid assemblies. In total, 126 Gb and 129 Gb PacBio HiFi reads, with an average read length of 14 Kb, were simulated based on published genome sequences of B73 [20] and SK [21], respectively (Additional file 1: Table S1). Combining simulated B73 and  1  2  3  4  1 2 3 4   5   C  T  C  C  G  T  T   7   C  T  C  C  G  T  T   8   T  C  T  T  A  C  A   6   T  C  T  T  G  T  T   4 Whole genome shotgun sequencing is performed to generate gamete cell short reads. (c) HiFi reads and Hi-C reads are generated by sequencing somatic tissues. (d) An initial assembly is generated by assembling HiFi reads into contigs and scaffolding contigs into superscaffolds with Hi-C data. (e) Short reads of gamete cells are then mapped to the initial assembly and (f) SNPs were identified for each gamete cells. (g) Chromosomal-scale haplotypes of the sequenced individual are reconstructed based on gamete cell SNP arrays using major voting strategy [23], with number of gamete cells that supports adjacent SNP combination were shown on the left. By comparing SNPs of each gamete cell with reconstructed haplotypes, (h) crossovers and (i) haplotype blocks of gamete cells can be determined. Gamete cell reads are (j) partitioned based on haplotype blocks and (k) normalized by k-mer depth to mimic genome coverage distribution of regular parental WGS reads. At last, (l) HiFi reads and partitioned normalized gamete cell reads are then used to construct phased contigs using hifiasm [11] and scaffolded into phased pseudochromosomes with Hi-C data SK PacBio HiFi reads together represents the sequenced reads of the F 1 hybrid. K-mer analysis shows that the heterozygosity, haploid genome size and repeat content of the F 1 hybrid genome are 1.98%, 2.16 Gb and 78% (Additional file 1: Figure S3), respectively, which indicates it's a large, highly heterozygous and repetitive genome.
To assist chromosome-scale assembly, 40 microspores (hereafter referring to as gamete cells) were isolated from 10 tetrads of the F 1 hybrid (Fig. 1a). DNA was extracted from each gamete cell and followed by multiple displacement amplification (MDA) [22] to generate enough DNA for library construction and sequencing [23] (Fig. 1b). Around 45 Gb (~ 20-fold) high quality short reads were generated for each gamete cell (Additional file 1: Table S2). In addition, 427 Gb Hi-C data were also generated from root tissues of the F 1 hybrid ( Fig. 1c and Additional file 1: Table S3).

Generating an initial assembly for the maize F 1 hybrid
The simulated HiFi reads were assembled into contigs using haploid assembler FAL-CON [9] (Fig. 1d). The FALCON assembly is comprised of 2903 Mb primary contigs and 685 Mb alternative contigs, with a contig N50 of 1.3 Mb. The total length of primary contigs was larger than both the reference genomes of SK (2161 Mb) and B73 (2106 Mb), indicating there are ~ 742 Mb haplotigs in the primary contigs. Of these, 432 Mb haplotigs can be tagged by purge_haplotigs [18], while the remaining 310 Mb haplotigs are undetectable (Additional file 1: Table S4). Since haplotigs in the primary contigs are hard to remove thoroughly, we choose to keep all of them and generated an initial assembly by scaffolding all the primary contigs into super-scaffolds with Hi-C data (Fig. 1d). The longest 10 supper-scaffolds (corresponding to 10 chromosomes) covers 97.5% of the initial assembly.

Reconstruction of chromosome-scale haplotypes for the maize F 1 hybrid
Short reads of gamete cells were mapped to the initial assembly and SNPs were identified for each gamete cell (Fig. 1e, f ). Nine out of the 40 gamete cells have abnormal SNP heterozygous rate or missing rate (Additional file 1: Table S5 and Figure S4), which could be caused by contamination of adjacent cells or insufficient genome coverage. These gamete cells were considered to be low-quality and excluded from downstream analyses. In general, 1,387,594 to 2,423,031 high confident SNPs were identified in each gamete cell (Additional file 1: Table S5).
Chromosome-scale haplotypes of the sequenced F 1 hybrid were reconstructed based on the SNPs that located on 10 chromosomes using a major voting strategy with R package Hapi [24] (Fig. 1g). Totally, we obtained 2,721,839 phased SNPs in the reconstructed chromosome-scale haplotypes, which are evenly distributed across chromosomes (Additional file 1: Figures S5 and S6). Haplotype blocks in each gamete cell could be identified by comparing the genotype at each SNP locus with that of the reconstructed chromosome-scale haplotypes (Fig. 1h, i and Additional file 1: Figure S6).

Partition and normalization of gamete cell short reads
Reads of each gamete cell that belonged to haplotype blocks of B73 or SK were extracted and merged separately (Fig. 1j). Reads of each haplotype were normalized by k-mer depth to mitigate the extremely uneven genome coverage caused by MDA procedure (Fig. 1k). After normalization, k-mer depth distribution of the haplotype reads is similar to that of parental WGS data (Additional file 1: Figure S7) and ready for use by triodependent de novo assemblers (HiCanu, hifiasm, etc.).

Generating phased diploid genome assemblies for the maize F 1 hybrid
The phased diploid genome assemblies were generated using hifiasm [10], with simulated HiFi reads and k-mers derived from normalized haplotype reads (Fig. 1l). In total, 2162 Mb contigs were assigned to SK haplotype (HapSK contigs) and 2159 Mb contigs were assigned to B73 haplotype (HapB73 contigs), with a contig N50 of 55.3 Mb and 57.0 Mb, respectively ( Table 1). The hapSK contigs and hapB73 contigs were further scaffolded into chromosomes with Hi-C data, respectively. This chromosome-scale phased diploid assembly, including both hapSK assembly and hapB73 assembly, is referred to gcaPDA assembly hereafter.

Testing the generalizability of gcaPDA on real data of rice F 1 hybrid
We selected a rice F 1 hybrid by crossing two inbred lines MH63 and ZS97 as test sample. Gap-free reference genomes are available for both parents [25] which could be used to benchmark the phased diploid assemblies. In total, 23 Gb PacBio HiFi reads, with an average read length of 14.9 Kb of the rice hybrid was generated (Additional file 1: Table S6). K-mer analysis shows that the heterozygosity and haploid genome size of the F 1 hybrid genome are 0.72% and 386 Mb respectively. To assist chromosome-scale assembly, 24 gamete cells were isolated from 6 tetrads of the rice hybrid. Around 10 Gb (~ 25-fold) high quality short reads were generated for each gamete cell (Additional file 1: Table S7). In addition, 101 Gb Hi-C data were also generated from root tissues of the rice hybrid (Additional file 1: Table S8). Then we perform gcaPDA on rice hybrid same with maize.

Comparing gcaDPA with other methods
To compare the performance of gcaPDA with other methods, we generated a "Hifiasm assembly" with using hifiasm with only HiFi reads, a "Trio assembly" using hifiasm with HiFi reads and simulated parental WGS reads and a "Hifiasm + Hi-C assembly" with using hifiasm with both HiFi and Hi-C reads. Genome assemblies were evaluated in three aspects: contiguity, completeness and accuracy.
Genome contiguity is usually measured with contig N50. All the assembler generated two haplotypic assemblies. The overall contig N50 of gcaPDA is comparable to that of other approaches in with both maize and rice dataset ( Table 1, Additional file 1: Table S9).
Genome completeness were evaluated by k-mer, and BUSCOs [26] (Benchmarking Universal Single-Copy Orthologs), and assembled genome size. All assemblies achieved > 99% k-mer completeness ( Table 1, Additional file 1: Table S9). In addition, all assemblies achieved overall BUSCO completeness comparable to that of the parental genomes. In accordance with the evaluation by k-mer and BUSCOs, assembled genome size of all assemblies are comparable to that of the parental genomes (Table 1, Additional file 1: Table S9). When we looked into the completeness of haplotypic assemblies, we found that duplicated and missing BUSCOs rate are higher in maize Hifiasm hap1/hap2 assembles compared to haplotypic assemblies of other approaches. This might be caused by insufficient phasing and partition of contigs. Phasing accuracy of genome assemblies were evaluated with hapmer. The overall phasing accuracy (positive predictive value, PPV) of the gcaPDA assembly is 98.01% and 98.5% for maize and rice separately, comparable to that of the Trio assembly and Hifiasm + Hi-C assembly ( Table 1, Additional file 1: Table S9) and assemblies reported in recent studies [16,27,28]. Furthermore, we investigated chimeric contigs that contain both SK and B73 hapmer. In Hifiasm assembly, there are many contigs contain hapmers from both haplotypes (Fig. 2, Additional file 1: Figure S8). In comparison, only few contigs in gcaPAD assembly, Trio assembly and Hifiasm + Hi-C contain hapmer from the other haplotype (Fig. 2, Additional file 1: Figure S8). With hapmer, we also detected haplotype blocks from the other haplotype in the hapSK and hapB73 assembly of gcaPDA ( Fig. 3 and Additional file 1: Table S10). It is worth to note that un-anchored sequences are more prone to be mis-assigned to the other haplotype, when comparing with sequences that anchored to chromosomes (Additional file 1: Table S10).

Comparing gcaDPA assembly with parental reference genomes
With reference genome sequences for both parents of the F 1 hybrid available, accuracy of gcaPDA assembly was further evaluated by comparing with parental genomes. The HapSK and HapB73 assembly were compared with SK and B73 reference genomes, respectively. The hapSK assembly covers 99.04% of the SK reference genome, with an average alignment identity of 99.99%, while the hapB73 assembly covers 99.72% of the B73 reference genome, with an average alignment identity of 99.99% (Additional file 1: Table S11). In general, near perfect collinearity was observed between haplotype-resloved assembles and corresponding parental genomes (Additional file 1: Figure S9).
Notablely, gcaPDA could phase the structural variations between B73 and SK genome properly. For example, a large inversion (c. 8 Mb) and a large InDel (c. 3 Mb) between B73 and SK genomes were correctly recovered in the hapSK and hapB73 assembly (Fig. 4a). In addition, we inspected the ZmBAM1d locus 19 which is highly divergent between B73 and SK genomes. We found that the ZmBAM1d locus were also perfectly phased in the gcaPDA assembly, resulting in haplotype sequences identical to respective reference genome (Fig. 4b).

Discussion
Diploid genomes comprise two sets of homologous chromosomes that are slightly different from each other. Current diploid assemblers (such as DipAsm [14], gamete binning [16], PGAS [15]) are tailored for genomes of low complexity and not suitable for heterozygous genomes. In the present study, we developed gcaPDA, a gamete cell-based method which generates chromosome-scale haplotype-resolved diploid assembly for species with highly heterozygous genomes, thus providing access to the genetic variations present in both sets of homologous chromosomes of diploid cells.
In standard trio binning [13], parental data are used to resolve haplotypes. For a highly heterozygous individual, it is natural to assume that its parents are highly heterozygous, too. Notably, phasing information can be fuzzy at sites where all trio-samples have a heterozygous genotype [10]. And parent information is not always available, which limits the application of trio binning method. In contrast, gcaDPA does not require parental information. The gamete cells used to generate input data for gcaPDA are naturally haploid, and preserves unambiguous chromosome-scale phasing information. Thus, gcaP-DA's performance is not affected by the heterozygosity level of the sequenced individual. In DipAsm [14], gamete binning [16], PGAS [15], and gcaDPA, an initial assembly is generated for SNP calling and phasing (Additional file 1: Figure S1). For highly heterozygous genome, the initial assembly may contain haplotigs, which lead to missing SNP calls (false negative) [18]. In addition, SNP calling at repetitive genomic regions are prone to generate false positive calls [29]. Both false positive and negative SNP calls can complicate SNP-based long reads partition in DipAsm, gamete binning, and PGAS methods. Furthermore, partition long reads before assembly process is not the recommended way of diploid assembly [10]. In contrast, gcaPDA partitions gamete cell reads based on haplotype blocks instead of individual SNPs, to increase the tolerance to potential false negative and positive SNP calls. In addition, gcaPDA used all the HiFi reads to construct assembly graphs, with haplotype-specific k-mer derived from gamete cell reads to assist in resolving graph, and generate both haplotype assembly simultaneously (Additional file 1: Figures S1 and S2). Furthermore, the input data required by gcaPDA were generated by standard wet-lab and sequencing approaches which are accessible to researchers. In contrast, gamete binning needs to sequence hundreds of thousands of gamete cell genomes by 10× sc-CNV sequencing [16], while this service is no longer supported by 10× Genomics. The Hi-C data used by DipAsm were generated with four restriction enzymes based on a modified protocol with Arima-HiC kit [14] to achieve uniform per-base coverage of the genome and maintain the highest long-range contiguity signal, while it may be difficult to generate Hi-C data of comparable quality for plant species. Generation of Strand-seq [30] data required by PGAS pose great challenge for non-human species, which limits the application of PGAS.
There are several factors that affect the performance of gcaPDA. First, contiguity and accuracy the initial assembly. The genome of F 1 hybrid is highly repetitive and heterozygous, which poses a great challenge for FALCON and results in abundant short contigs. Short contigs that couldn't be scaffolded into chromosomes won't be phased by gcaPDA, while short contigs wrongly placed during Hi-C scaffolding analysis resulting in misassemblies (false inversion, translocation) in the initial assembly [31], which in turn lead to chunks of gamete cells reads wrongly assigned to the other haplotype. Improving contiguity of the initial assembly with longer HiFi reads, or incorporating optical mapping data to correct mis-placed sequences [32] might mitigate these issues and improve the performance of gcaPDA. Second, number of gamete cells. In this study, 31 gamete cells (~ 20× WGS data for each cell) were used by gcaPDA. According to k-mer coverage accumulation curve (Additional file 1: Figure S10), 20 gamete cells shall suffice. However, with more gamete cells sequenced, phasing errors introduced by mis-placed contigs shall be alleviated and higher phasing accuracy can be achieved by gcaPDA. Third, gcaPDA integrates Hi-C and gamete cells reads to assist genome assembly and phasing resulting in higher computational resources (Additional file 1: Tables 12-13).
All in all, taking the assembly of a real large, highly heterozygous and repetitive maize F 1 hybrid genome as the positive control, we proved that gcaPDA could generate high quality haplotype-resolved and chromosome-scale diploid assembly for diploid species. In contrast to other diploid assemblers, gcaPDA does not rely on paternal information, tremendous amount of gamete cells or special sequencing approaches. As a result, gcaPDA is a good alternative option option to perform haplotype-resolved genome assembly.

Hi-C library construction and sequencing for F 1 hybrid
The maize F 1 hybrid (B73 × SK) seeds were from our own lab in Huazhong Agricultural University. The rice F 1 hybrid (MH63 × ZS97) seeds were provided by Dr. Xiangchun Zhou in Huazhong Agricultural University. The maize F 1 hybrids were planted in 2020 Wuhan, China. The rice F 1 hybrids were planted in 2021 Wuhan, China. Young root tissues of F 1 hybrid were harvested. Hi-C proximity libraries were constructed by the previously described method with restriction enzyme MboI [4]. The libraries were size-selected to retain 350 bp DNA fragments and sequenced on MGISEQ2000 platform (MGI-Tech) to generate 150 bp paired-end reads.

Generating HiFi reads for rice F 1 hybrid
High molecular weight DNA was extracted from young root of the rice hybrid F 1 using modified cetyltrimethylammonium bromide (CTAB) method [33]. PacBio HiFi library was constructed and sequenced with PacBio Sequel II system in BGI-Shenzhen according to manufacture's guidance.

Simulating reads for maize F 1 hybrid
Reference genome sequences of Zea mays var. SK was downloaded from ZEAMAP database (http:// www. zeamap. com/ ftp/ 01_ Genom ics/ Genom es/) [34], while reference genome sequences of Zea mays var. B73 was downloaded from Ensembl Plant database [35] (release 46). Reference genome sequences of MH63 and ZS97 was downloaded from https:// rice. hzau. edu. cn/ rice_ rs3/, In order to lift the contiguity limit capped by the reference sequences, unambiguous bases ('N's) within sequences were removed to generate gapless sequences. Reads were simulated based on SK or B73 gapless sequences. For short reads simulation, wgsim (parameters: − e 0.01) from samtools package [36] (version 1.9) was used. Pbsim [37] (version 1.0.4) was used to simulate PacBio HiFi reads with read length and quality score randomly sampled from the previously published PacBio HiFi data [38].

Genome survey analysis
Genome survey analysis was performed to profile features of the F 1 hybrid genome. Simulated SK, B73 and real rice HiFi reads were broken into k-mer and then counted by Jellyfish [39] (version 2.2.10, k = 21). Genome features such as haploid genome size, heterozygosity, repreat content were estimated using genomescope [40] (version 1) with k-mer frequencies outputted by Jellyfish.

Single cell isolation, DNA extraction, amplification and sequencing
The seeds of F 1 hybrid were planted and then immature tassels were harvested before they had emerged. Gamete cells (microspores) were isolated from tetrads as described in a previous study [23]. DNA was extracted from each gamete cell using QIAGEN REPLI-g Single Cell Kit (Cat No. 150343), followed by multiple displacement amplification (MDA) [22] procedure to generate enough DNA for downstream experiments. A sequencing library was constructed for each gamete cell and then to be sequenced on MGISEQ2000 platform (MGI-Tech) to generate 150 bp paired-end reads.

Reconstruction of haplotypes and gamete cell reads partition
The raw reads of gamete cell were preprocessed to filter adapter sequences and lowquality reads using bbduk.sh (parameters: ktrim = r k = 17 mink = 7 hdist = 1 tpe tbo qtrim = rl trimq = 15 minlength = 80) from BBTools package (https:// sourc eforge. net/ proje cts/ bbmap/). Reads that passed quality filtering were then mapped to the initial assembly using bowtie2 [44] (version 2.3.5.1), with parameters "-X 800". Single nucleotide polymorphisms (SNPs) were identified using bcftools [45] 'mpileup' (version 1.8,parameters: -d 500 -q 10 -ff SECONDARY), followed by bcftools 'call' (parameters: -Ob -cv -p 0.01) implementation. Only bi-allelic SNPs, with quality value ≥ 20 and allele frequency between 0.3 and 0.7 were selected. SNPs that located adjacent to (< 5 bp) InDels were also filtered. Furthermore, SNPs in each cell sample with supporting reads depth < 5 were replaced with 'NA' and treated as missing calls. Heterozygous rate and missing rate were calculated for each cell based on SNP calls. Since we sequenced haploid single cells, most of the SNPs identified in each cell should be homozygous, with lots of missing calls due to insufficient read coverage. Cells with abnormal level of heterozygous rate of SNPs (> 5%) or missing call rate (< 30% or > 70%) indicate contamination or insufficient sequencing coverage, hence were considered as low-quality cells and excluded from downstream analyses.
The paternal and maternal haplotypes were reconstructed based on SNP array of qualified cells using Hapi [24]. By comparing genotypes of each gamete cell with reconstructed parental haplotypes, haplotype blocks can be identified for each cell. Gamete cell reads that mapped to haplotype blocks with the same parental origin were extracted and merged. Due to MDA procedure, the read depth of each cell is unevenly distributed across the genome. To mitigate this issue, merged reads of each haplotype were normalized by k-mer depth using BBnorm.sh (https:// sourc eforge. net/ proje cts/ bbmap/), to mimic regular whole genome sequencing (WGS) reads. The normalized short reads were then used as parental WGS data by diploid de novo assembler.

De novo genome assembly
The normalized short reads of gamete cells of each haplotype were broken into k-mers using yak (https:// github. com/ lh3/ yak), respectively. HiFi reads together with parental k-mers, were de novo assembled using hifiasm [10] to generate phased diploid genome assembly. The haplotype-resolved contigs were linked into superscaffolds using Hi-C reads, with methods described above. The final results of gcaPDA were two haplotype-resolved chromosome-level assemblies: hapSK assembly and hapB73 assembly, hapMH63 and hapZS97 assembly. This final assembly was referred to as gcaPDA assembly in the main text.
In the same time, another haplotype-resolved diploid assembly were generated with HiFi reads and simulated parental WGS short reads using hifiasm. This assembly was referred to as Trio assembly.
A pseudo-haplotype assembly was generated using hifiasm, with only HiFi reads. This assembly was referred to as Hifiasm assembly.
A haplotype-resolved assembly was generated using hifiasm with HiFi and Hi-C reads. This assembly was referred to as Hifiasm + Hi-C assembly.

Evaluation of genome assemblies
We broke the gapless reference genomes of B73 and SK, MH63 and ZS97 into k-mer using meryl utility from Merqury package [46] (release 20200430). Total k-mer set and haplotype-specific k-mer set (hapmers) were computed based on B73 and SK k-mer set and MH63 and SK k-mer set. Genome completeness of each assemblies was evaluated with total k-mer set and phasing accuracy were evaluated with hapmers using Merqury package [46].
Whole genome sequence comparison between assemblies and reference genomes were performed using nucmer and visualized using mummerplot from MUMMER package [47] (version 4.0). Coverage and identity of the alignments were calculated using dnadiff implementation from MUMMER package. Only alignments span > 10 Kb were counted.