Preliminary observations
The genome size of our benchmark organism was estimated as 102 Mb based on k-mer frequencies (see Methods). The k-mer spectrum shows two peaks (Additional file 1: Figure S1): the first one, around 45X, for heterozygous k-mers (found in only one haplotype); and the second peak, around 90X, for homozygous k-mers (identical in both haplotypes). As Adineta vaga has a mid-range heterozygosity, the number of distinct heterozygous k-mers is higher than the number of homozygous k-mers, making a haploid assembly more difficult to obtain than with low-heterozygosity genomes.
We initially ran each assembler (Table 1) five times on our complete and filtered Nanopore and PacBio datasets, as we had observed that there were some discrepancies (assembly size, N50) in the outputs when running several times Flye, NextDenovo, Shasta and wtdbg2. We found that the assembly size, N50 and BUSCO scores of the resulting assemblies were very similar to each other (Additional file 1: Figures S2–S3). As a result, we chose randomly one replicate assembly from each assembler and used this replicate for the subsequent haplotype-purging step.
To represent assembly statistics in a comprehensive manner, we defined four scores (see Methods): size, that is 1 minus the distance of the assembly size to the expected genome size; N50; completeness, a combined metric that includes both the single-copy BUSCO score and the distance of the observed k-mer completeness to the expected value of 50%; haploidy, computed using HapPy.
Assemblies of PacBio reads
Full assembly statistics are available in Additional file 1: Figure S4 and Additional file 1: Table S1–S3, whereas a summary of the results is presented in Fig. 1.
Raw assemblies
All raw assemblies of the full PacBio datasets were oversized compared to the estimated genome size of 102 Mb, ranging from 114 Mb (wtdbg2) to 169 Mb (Canu) (Fig. 1a, raw assemblies). The NextDenovo assembly obtained the highest N50 score with a value of 8.9 Mb, while other assemblies had an N50 ranging from 301 kb (Canu) to 2.7 Mb (Flye). The Canu assembly had the lowest completeness score, as its k-mer completeness greatly exceeded the expected value of 50% (73.5%) and its number of single-copy BUSCOs was only 538, compared to a highest value of 687 features (NextDenovo). The wtdbg2 and Ra assemblies scored the highest according to the haploidy metrics (0.90), while Canu obtained the lowest score (0.59).
Larger assembly sizes correlated with highly bimodal coverage distributions (Fig. 1b, raw assemblies). This was particularly the case with Canu assemblies, which exhibited two large peaks plus a smaller, low-coverage peak. The high-coverage peak, around 210X, was the collapsed peak C whereas the 100X peak, at about half-coverage of the C peak, corresponded to the uncollapsed peak U. In the case of the Canu assembly, the U peak was larger than the C peak, revealing a poor collapsing of highly heterozygous regions. The Flye, NextDenovo, Raven and Shasta assemblies also exhibited two peaks in their coverage distribution, although their U peak was smaller than the one of Canu. The Ra, Raven, Shasta, and wtdbg2 assemblies exhibited an additional low-coverage peak identified as contaminants (see Additional file 1: Figures S5–S11).
Read filtering
We filtered PacBio reads using two strategies: either keeping reads longer than 15 kb; or filtering reads with Filtlong [27] based on quality (in priority) and length. These filtered datasets resulted in assemblies closer to the expected size than assemblies of all reads for Canu, NextDenovo, Ra, Raven and Shasta (Fig. 1a, read filtering). In the case of NextDenovo, filtering based on length made the N50 assembly drop to a value comparable with other assemblies (from 8.9 to 1.8 Mb), while the N50 did not decrease for the Filtlong dataset. Most assemblies maintained their completeness score with both strategies, and it even increased for some (Canu, Flye, Raven). As for the coverage distribution (Fig. 1b, read filtering), the Ra assembly no longer showed a low-coverage contaminant peak. For the NextDenovo assembly the U peak was absent when selecting the longest reads, but remained with the Filtlong dataset. The contaminant peak was also removed for the Raven assembly, and the U peak was reduced. The Shasta assembly had no U peak with the longest reads, but the assembly was shorter than expected (89 Mb) and had a poor completeness score.
Haplotig purging
When adding a post-assembly haplotig-purging step, we observed strikingly different results depending on the combination of assembler and post-assembly purging tool, namely HaploMerger2, purge_dups, purge_haplotigs. While HaploMerger2 reduced the size of all assemblies (resulting in higher size scores on Fig. 1a), it also led to a decrease of the completeness score of all assemblies, except for Canu (Fig. 1a, HaploMerger2). Nevertheless, the haploidy scores all increased (with a minimum of 0.84 for Canu and a maximum of 0.92 for Ra and wtdbg2) as U peaks decreased drastically in all coverage distributions (Fig. 1b, HaploMerger2). Assemblies purged with purge_dups were all closer to the expected size of 102 Mb, and the N50 and completeness scores were maintained or even improved (Fig. 1a, purge_dups). The haploidy scores were also improved as they ranged from 0.89 (Canu and Flye) to 0.91 (Ra and wtdbg2). The coverage distributions showed that the U peaks were removed or at least reduced, but the low-coverage peaks were not (Fig. 1b, purge_dups). After purging with purge_haplotigs, all assembly sizes were closer to the expected size except for Flye, and the N50 and completeness scores were maintained or even improved (Fig. 1a, purge_haplotigs). The haploidy scores were improved for Canu, NextDenovo and Shasta. The coverage distributions showed a reduction of the U peak for the Canu, NextDenovo and Shasta assemblies, explaining their higher haploidy scores (Fig. 1b, purge_haplotigs). The low-coverage peaks were removed for Ra, Raven, Shasta and wtdbg2, explaining their smaller assembly size.
Combination of read filtering and haplotig purging
The combination of read filtering and haplotig purging resulted in assemblies that almost all had a unimodal coverage distribution (Additional file 1: Figure S12–S13). As observed with assemblies of all reads, HaploMerger2 seemed to overpurge assemblies of the filtered datasets. Only the Canu assembly remained above the expected size, but its statistics were similar to those obtained with the Canu assembly of all reads. Assemblies purged with purge_dups or purge_haplotigs were closer to the estimated size and exhibited high numbers of single-copy BUSCO features. The combination of pre-assembly read filtering and post-assembly purge_dups seemed beneficial for most assemblers with the exception of Shasta, as the resulting assemblies ranged in haploidy from 0.90 (Canu and Flye) to 0.97 (NextDenovo). By contrast, the improvements observed when using a combination of read filtering and purge_haplotigs were similar to those obtained with either one of the two. NextDenovo, Ra and wtdbg2 assemblies had satisfying assembly sizes (from 99 to 107 Mb), high haploidy scores (from 0.85 to 0.96) (Additional file 1: Table S2) and unimodal coverage distributions, but similar scores were also obtained using only read selection for NextDenovo and Ra and using only purge_haplotigs for wtdbg2.
Combination of haplotig-purging tools
The combination of purge_dups or purge_haplotigs with HaploMerger2 resulted in problems similar to those observed on assemblies purged only with HaploMerger2: except for Canu, assemblies were shorter than expected and the number of single-copy BUSCO features dropped (Additional file 1: Figure S14). Assemblies purged with both purge_dups and purge_haplotigs were all reduced in size, but none went below the expected size. The BUSCO score and the k-mer completeness were stable, and the haploidy scores ranged from 0.88 (Canu and Raven) to 0.92 (NextDenovo) (Additional file 1: Table S3). The coverage distributions were all close to a unimodal distribution, as there were no low-coverage contigs and the U peaks had mostly disappeared.
Assemblies of Nanopore reads
Full assembly statistics are provided in Additional file 1: Figure S15 and Additional file 1: Table S4–S6.
Raw assemblies
Similarly to the PacBio assemblies, Nanopore assembly sizes exceeded the expected size of 102 Mb, ranging from 118 Mb (Shasta, wtdbg2) to 154 Mb (Canu) (Fig. 2a, raw assemblies). Nanopore assemblies achieved a much higher continuity than PacBio assemblies, as PacBio assemblies had a lowest N50 of 301 kb while Nanopore assemblies had a lowest N50 of 1.6 Mb. Canu, NextDenovo, Ra and wtdbg2 achieved a N50 over 5 Mb using Nanopore reads. The number of complete single-copy BUSCOs was lower in Nanopore assemblies (up to 559) than in PacBio assemblies (up to 699). The k-mer completeness was also usually lower for Nanopore assemblies, around 38.6–54.8%, compared to 47.7–73.5% for PacBio assemblies. These lower values for k-mer completeness in Nanopore assemblies were likely not due to a better collapsing but rather to systematic errors. The haploidy scores were higher for Nanopore assemblies produced by Canu, Ra, Raven, Shasta and wtdbg2, in comparison with PacBio assemblies, but this score was lower for Flye and NextDenovo assemblies (Additional file 1: Table S1, Additional file 1: Table S4). The coverage distribution of the Canu assembly exhibited two distinct U (uncollapsed) and C (collapsed) peaks respectively around 75X and 160X, indicating that many haplotypes were not collapsed and were therefore represented twice in the assemblies (Fig. 2b, raw assemblies). This was also the case for the Flye, NextDenovo, Raven and Shasta assemblies, albeit their U peak was smaller than the one of the Canu assembly. The Ra, Raven, Shasta and wtdbg2 assemblies had an additional low-coverage peak identified as contaminants (see Additional file 1: Figures S16–S22).
Read filtering
When assembling a subset of either the longest Nanopore reads (over 30 kb) or reads filtered with Filtlong, the Ra and Raven assemblies had sizes closer to the estimated size and did not exhibit a contaminant peak in their coverage distribution, whereas other assemblies were generally unmodified (Fig. 2, read filtering). The Raven assembly of filtered reads had a smaller U peak compared to the raw assembly, but still present, whereas the U peak of the Ra assemby disappeared. The wtdbg2 assembly of the Fitlong dataset no longer shows a contaminant peak. These improvements came along with increased haploidy scores: from 0.90 to 0.95–0.97 for Ra, and from 0.83 to 0.89–0.92 for Raven. The result produced by Shasta with read filtering was the opposite, as the size and haploidy scores became lower and the U peak became larger with both filtering strategies. This increase in size of the U peak explained the higher k-mer completeness obtained due to a higher percentage of uncollapsed homozygous and heterozygous k-mers (Additional file 1: Figure S23–S24). Overall, read filtering did not affect completeness scores.
Haplotig purging
As we observed with PacBio reads, all assembly sizes were reduced by HaploMerger2, which resulted in higher size scores, and the U peaks were removed from the coverage distributions, but this also led to lower scores for completeness (Fig. 2, HaploMerger2). purge_dups improved size scores while maintaining or increasing completeness scores and removing U peaks (Fig. 2, purge_dups). These improved coverage distributions resulted in higher haploidy scores, ranging from 0.90 (Flye and Raven) to 0.93 (Ra and Raven). purge_haplotigs improved size scores for all assemblies except Flye and also kept high completeness scores (Fig. 2). Haploidy scores were higher for assemblies produced by Canu and NextDenovo, and the coverage distribution shows that U peak were indeed reduced or removed for Canu and NextDenovo assemblies, while the contaminant peaks were removed for Ra, Raven, Shasta and wtdbg2 assemblies. We observed again that HaploMerger2 generally decreased the continuity and quality metrics of the assemblies, while purge_dups and purge_haplotigs did not. Interestingly, the Canu assembly obtained the highest N50 of all the assemblies presented in this paper (12.4 Mb) when raw assemblies were purged with either purge_dups or purge_haplotigs.
Combination of read filtering and haplotig purging
The read-filtered Nanopore assemblies after HaploMerger2 were too short, except for the Canu assembly (Additional file 1: Figure S25–S26). All assemblies of the longest reads after purge_dups were close to the expected genome size (96–109 Mb), except for the wtdbg2 assembly (114 Mb), and the coverage distributions were almost all unimodal, with a haploidy score ranging from 0.87 (Canu) to 0.96 (Ra) (Additional file 1: Table S5). Notably, the Shasta assembly of filtered reads had a strong U peak that was completely removed by purge_dups, which brought the assembly to a size close to our estimation (102 Mb) and resulted in a high number of singe-copy BUSCO features (519 features). purge_haplotigs was apparently less efficient, as the U peak was reduced for the Canu and Raven assemblies but remained high for the Flye, NextDenovo and Shasta assemblies. The BUSCO scores of assemblies purged with purge_dups or purge_haplotigs were similarly high (up to 539 single-copy BUSCO features, Ra).
Combination of haplotig-purging tools
The combination of HaploMerger2 with either purge_dups or purge_haplotigs led to excessively shorter assemblies, except for Canu, as HaploMerger2 tended to overpurge (Additional file 1: Figure S27). The assemblies purged with both purge_dups and purge_haplotigs were all close to the expected size (100–110 Mb), their single-copy BUSCO score was maintained (up to 559 features, Flye assembly), they had a k-mer completeness below 50%, their haploidy ranged from 0.90 (Flye and Raven) to 0.94 (NextDenovo) (Additional file 1: Table S6) and their coverage distribution was unimodal or close to it.
Impact of sequencing depth
We further evaluated the impact of sequencing depths ranging from 10X to 100X on the size, N50, BUSCO score, and haploidy metrics of PacBio and Nanopore assemblies (Fig. 3).
With a 10X sequencing depth, almost all assemblers produced outputs excessively small compared to the expected genome size, and NextDenovo and Shasta performed the worst. However, with wtbdg2 on PacBio and Nanopore reads, and with Flye on PacBio reads, the assembly size was close to the expected size. At 20X, Canu, Flye, Ra, Raven and wtdbg2 reached at least 93 Mb. Assembly sizes increased sharply up to 40X, except for Canu for which the assemblies kept increasing in size with sequencing depth.
While there was no variation in assembly size among replicates, N50s were highly variable for PacBio assemblies with NextDenovo, Flye and wtdbg2, and for almost all Nanopore assemblies other than wtdbg2. For NextDenovo, there was a clear optimal coverage at about 40X in terms of N50 with PacBio as well as Nanopore reads. Other assemblers also showed a decrease in N50 above a certain sequencing depth: for PacBio assemblies, this happened for Raven over 30X and for Shasta and wtbdg2 over 80X; for Nanopore assemblies, this happened for Flye over 30X, for Shasta over 40X, and for Raven over 80X.
The number of single-copy BUSCO features did not vary much among replicate subsamplings; only NextDenovo exhibited some variability in that regard. Over 40X, the number of single-copy BUSCOs became strikingly stable for most assemblers with both PacBio and Nanopore reads, except for Shasta that is tuned for a 60X depth. For Canu assemblies, the number of complete single-copy BUSCOs decreased when sequencing depth exceeded 30X. For wtdbg2 the complete single-copy BUSCOs remained low for the Nanopore dataset at the different sequencing depths.
For most assemblers, haploidy decreased when sequencing depth increased, and this was especially drastic for Canu assemblies. Only the assemblies produced by wtdbg2 had a stable haploidy value, whereas the Ra assemblies of PacBio reads also exhibited limited variations in haploidy.
Computational performance
We evaluated the computational performance of Flye, Ra, Raven, Shasta and wtdbg2. Canu was only evaluated on PacBio reads as the CPU time was too high for Nanopore reads, and we had to run Canu on a cluster. For the full Nanopore dataset, with 32 threads, Canu ran in 7 days 4 h 41 min and used 99 GB. NextDenovo was not included in this comparison as it required high RAM resources and was therefore run on a different machine; we were not able to measure properly its RAM usage or CPU time. RAM usage and CPU time (Fig. 4) increased with sequencing depth, except for wtdbg2 and Canu that did not display much change. wtdbg2 was the assembler that required the lowest amount of RAM (less than 20 GB) and was the second fastest on PacBio reads (less than 53 h) and Nanopore reads (less than 192 h). Flye had the highest RAM usage for Nanopore assemblies (Canu and NextDenovo excluded), with high variability. Notably, Shasta generally required a lot of RAM and had the highest memory footprint for high-coverage PacBio data, but it ran the fastest on PacBio as well as on Nanopore reads. The most recent versions of Flye (2.8) and Raven (1.5) have decreased the CPU time. Raven was already an efficient assembler, and its latest improvements have halved the RAM usage and CPU time, when tested on the full PacBio and Nanopore datasets. Shasta 0.7.0 also decreased its RAM usage and CPU time.