A binning tool to reconstruct viral haplotypes from assembled contigs

Background Infections by RNA viruses such as Influenza, HIV still pose a serious threat to human health despite extensive research on viral diseases. One challenge for producing effective prevention and treatment strategies is high intra-species genetic diversity. As different strains may have different biological properties, characterizing the genetic diversity is thus important to vaccine and drug design. Next-generation sequencing technology enables comprehensive characterization of both known and novel strains and has been widely adopted for sequencing viral populations. However, genome-scale reconstruction of haplotypes is still a challenging problem. In particular, haplotype assembly programs often produce contigs rather than full genomes. As a mutation in one gene can mask the phenotypic effects of a mutation at another locus, clustering these contigs into genome-scale haplotypes is still needed. Results We developed a contig binning tool, VirBin, which clusters contigs into different groups so that each group represents a haplotype. Commonly used features based on sequence composition and contig coverage cannot effectively distinguish viral haplotypes because of their high sequence similarity and heterogeneous sequencing coverage for RNA viruses. VirBin applied prototype-based clustering to cluster regions that are more likely to contain mutations specific to a haplotype. The tool was tested on multiple simulated sequencing data with different haplotype abundance distributions and contig sizes, and also on mock quasispecies sequencing data. The benchmark results with other contig binning tools demonstrated the superior sensitivity and precision of VirBin in contig binning for viral haplotype reconstruction. Conclusions In this work, we presented VirBin, a new contig binning tool for distinguishing contigs from different viral haplotypes with high sequence similarity. It competes favorably with other tools on viral contig binning. The source codes are available at: https://github.com/chjiao/VirBin.


Data simulation details
Fig. S1 sketches the input data sets for the simulated contigs. The details about the contig sets can be found in Table S1. ``Len (longest)'' is the length of the longest contig in a group. ``Genome coverage'' is the percentage of the underlying genomes covered by all the simulated contigs. ``Group ID'' (1000 to 5000) indicates the upper bound of the contig length in each group.

Abundance prediction by VirBin on the assembled contigs
The contigs are produced by assembly tools SGA and PEHaplo on simulated reads from five haplotypes. The details of the assembled contigs can be found in Table S2   Group ID  1000  2000  3000  4000  5000   contaig number  61  35  30  25  24 Len (  N50 is defined as the maximum length in which all contigs of at least this length contain at least 50% of all the contig bases. Genome coverage (cov.) is the percentage of reference genomes that are aligned by contigs, and mismatch rate is the percentage of mismatches of aligned contigs.
The relative abundance is computed during the iterative clustering algorithm in VirBin. Fig. S2 compares the known haplotype abundances with computed ones by using assembled contigs as input for VirBin. The abundance profiles output by VirBin on both PEHaplo's and SGA's assembled contigs are close to the known haplotype abundance profile. As there are many empty clusters for MaxBin's results, we did not include the abundance comparison.

Results for 10 HIV haplotypes
The simulated 10-haplotype quasispecies contain FJ064, FJ061, FJ065, FJ066, and two simulated haplotypes from the last three strains. The sequence similarity between the simulated haplotype and its originating sequence is 97%. The average sequence similarity between all the 10 haplotypes is around 90.1%. In total, there are 76,974 reads with sequencing depth 2000-x. The relative abundances for each haplotype are shown in Fig.S3. Using the same simulation method as the 5 HIV haplotype data set, 25 contigs were generated, covering 89.45% of the 10 haplotypes. The longest contig has length 8846bp and the N50 is 5933.

Haplotype number estimation:
The contig alignment and windows identification were also applied on simulated contigs. We sorted the windows in descending order of window length. Out of the top 50 windows, 26 have 10 contigs, 16 contain 9 contigs, and 6 contain 8 contigs.Therefore, the haplotype number 10 can still be correctly predicted by the consensus window depth.
Clustering results: The clustering results by VirBin are shown in Table S3. The 3 least abundant haplotypes have highly similar abundances (~ 3%) and got the lowest recall and precision values. The results were also compared with MaxBin. 10 seed contigs from each haplotype were randomly selected and provided to MaxBin. It classified 25 out of 34 contigs, with 9 unclassified. MaxBin correctly classified all contigs from FJ064 and FJ061-h1, and one contig from FJ061-h2. It assigned three contigs to FJ066 cluster with one being correct. Most of the other contigs were not appropriately clustered. The results of MaxBin are shown in Table S3 as well. We again tried StrainPhlAn and ConStrain on this simulated data set, but still, no reads can be mapped to available reference genes.  . S3 compares the true abundance distribution with the output of VirBin and MaxBin. It is not hard to see that VirBin's output is closer to the ground-truth than MaxBin.

Additional results for the mock data set
For the mock data experiment, we also present the recall and precision at contig level in Table S4. Thus, the recall quantifies how many of the contigs originating from one haplotype are correctly grouped in one cluster. The precision quantifies how many contigs in a cluster originate from the corresponding haplotype. Our program is still generally more accurate than MaxBin.