- Research
- Open access
- Published:
Accurate assembly of full-length consensus for viral quasispecies
BMC Bioinformatics volume 26, Article number: 36 (2025)
Abstract
Background
Viruses can inhabit their hosts in the form of an ensemble of various mutant strains. Reconstructing a robust consensus representation for these diverse mutant strains is essential for recognizing the genetic variations among strains and delving into aspects like virulence, pathogenesis, and selecting therapies. Virus genomes are typically small, often composed of only a few thousand to several hundred thousand nucleotides. While constructing a high-quality consensus of virus strains might seem feasible, most current assemblers only generated fragmented contigs. It’s important to emphasize the significance of assembling a single full-length consensus contig, as it’s vital for identifying genetic diversity and estimating strain abundance accurately.
Results
In this paper, we developed FC-Virus, a de novo genome assembly strategy specifically targeting highly diverse viral populations. FC-Virus first identifies the k-mers that are common across most viral strains, and then uses these k-mers as a backbone to build a full-length consensus sequence covering the entire genome. We benchmark FC-Virus against state-of-the-art genome assemblers.
Conclusion
Experimental results confirm that FC-Virus can construct a single, accurate full-length consensus, whereas other assemblers only manage to produce fragmented contigs. FC-Virus is freely available at https://github.com/qdu-bioinfo/FC-Virus.git.
Introduction
Viruses replicate their genetic material rapidly within host cells, resulting in a high mutation rate. As they undergo multiple replication cycles, different mutant strains emerge, each with genomes that are closely related but slightly distinct [1]. This group of genetically related mutant strains is known as a viral quasispecies. The genetic variations among mutant strains primarily manifest as Single Nucleotide Polymorphisms (SNPs) and Indels (insertions or deletions) [2]. These genetic differences significantly influence the varying degrees of virulence, transmissibility, and drug resistance observed among different viral strains. Research has shown that viruses often exist within their hosts as viral quasispecies. Reconstructing a full-length consensus for a viral quasispecies can yield a high-quality reference genome. This helps in characterizing viral strains, enhancing our understanding of the virus’s pathogenesis, and supporting vaccine development.
Advancements in high-throughput sequencing (HTS) have made it possible to reconstruct full-length genomes [3]. HTS technology can generate numerous short fragments (known as reads) from various mutant virus strains. In principle, these short reads should allow for the reconstruction of a full-length consensus genome for virus populations, as most viral genomes are typically just a few thousand to several hundred thousand nucleotides long. However, current assemblers struggle with this task due to sequencing biases and errors, varying levels of strain abundance, repetitive segments, and substantial discrepancies in length between the reads and the actual genome.
Numerous well-established generic assembly algorithms are recognized for their efficacy in genome assembly across diverse sequencing datasets and applications. These algorithms commonly utilize graph models, such as overlap graphs and de Bruijn graphs to facilitate the reconstruction of the genome by identifying a path within these graphs. Assemblers such as Celera [4], Newbler [5], Arachne [6], CAP3 [7], TIGR [8], PCAP [9], AMOS [10], Phrap [11], and Phusion [12] were developed using the overlap graph as their foundational model [13]. Each vertex in an overlap graph represents a read sequence, and an edge connects two vertices if the suffix of one read matches the prefix of the other. Assemblers like SPAdes [14], AllPath [15], ABySS [16], Velvet [17], SOAPdenovo2 [18], IDBA [19], EulerUSR [20], SKESA [21], and Ray [22] model the genome assembly problem by seeking an Eulerian path within a de Bruijn graph. The de Bruijn graph treats each k-mer (a substring of length k from a read) as a vertex and connects two vertices when they share a common prefix and suffix of length k−1. Compared to the overlap graph, the de Bruijn graph generally requires fewer computational resources for construction and storage. However, it involves a larger number of vertices and fails to capture longer overlaps between reads, which increases the graph’s complexity and makes path extraction more challenging.
However, these assemblers often struggle with viral genome assembly because they are not specifically designed for reconstructing viral genomes [23]. Due to the complexity and large size of genomes, many assemblers focus on constructing relatively unambiguous subsequences, or contigs. Consequently, even with smaller viral genomes, these assemblers often struggle to produce a full-length consensus sequence, leading to fragmented contigs. These fragmented contigs represent partial sequences from different strains but do not together form a full-length genome of any single strain or match the combined length of all strain genomes. This limitation indicates that fragmented contigs are insufficient as a comprehensive reference for all strains and fail to accurately represent the genome of each individual strain.
The assembly of viral quasispecies involves two main tasks: consensus assembly and strain-level assembly. Consensus assembly focuses on creating a reference genome that closely represents all the strains, capturing a high level of similarity across them. In contrast, Strain-level assembly aims to individually reconstruct the genome of each specific strain. VICUNA is designed for reconstructing a robust consensus from ultra-deep sequencing data [24]. It employs an overlap-layout-consensus based assembly algorithm to generate a consensus sequence through the following steps: read trimming, contig construction and clustering, contig validation, and contig extension and merging. VICUNA is among the few tools designed specifically for assembling highly diverse viral populations. Algorithms such as SAVAGE [25], viaDBG [26], PEHaplo [27], Virus-VG [28], VG-Flow [29] and VStrains [23] are designed for assembling strain-level genomes. Due to the high mutation rates and varying expression levels among strains, these algorithms often result in fragmented contigs rather than complete viral strains, or they combine contigs and fail to capture the differences between strains.
In this paper, we introduce FC-Virus, a novel de novo assembly method designed to reconstruct a full-length consensus sequence from viral quasispecies. The key contributions of this study are as follows:
-
(i)
Proposal of a concept and methodology for identifying homologous k-mers, which are k-mers shared across multiple strains.
-
(ii)
Development of a de novo assembly strategy that uses homologous k-mers as a backbone, enabling the reconstruction of a full-length consensus sequence.
-
(iii)
Evaluation of the FC-Virus algorithm against state-of-the-art assemblers using both simulated and real datasets. Experimental results show that FC-Virus produces a single consensus sequence that matches or even surpasses the results of multiple fragmented contigs from other assemblers.
Method
The FC-Virus takes high-throughput reads as input, aiming to reconstruct a full-length consensus sequence that maximally incorporates the characteristics of all strains [30]. FC-Virus first extracts k-mers from reads and identifies homologous k-mers according to the abundance of k-mers. FC-Virus regards the read containing at least two homologous k-mers as homologous read, and then it glues homologous reads together to form a consensus sequence. FC-Virus finally extends and refines the consensus using a greedy strategy. The general work flow of FC-Virus algorithm is illustrated in Fig. 1. Details of each step are explained in the following sections.
Identification of homologous k-mers
The genomes of different strains of the same virus are highly similar, so many k-mers are shared among most strains. We refer to these k-mers as homologous k-mers. FC-Virus counted the occurrences of k-mers (by default, \(k=25\)) across different abundance levels and plotted a frequency distribution graph. According to the definition of homologous k-mers, they are expected to cluster around the final peak in the abundance graph when longer repeated sequences are absent in the strain genome. In cases where peaks are poorly defined or very few homologous k-mers are identified, we select the top \(x\%\) of k-mers by frequency, with x typically set to 30% as an empirical value. Using this strategy, homologous k-mers may not be present in every strain but are likely to be common across most strains. With these k-mers as anchors, a consensus sequence can be assembled.
The workflow of FC-Virus algorithm. a Homologous k-mer extraction. FC-Virus extracts k-mers from reads, plots their frequency distribution. It then builds a de Bruijn graph from k-mers in the last peak interval. If this graph lacks a large connected component, the k-mers are classified as homologous. Otherwise, it evaluates the preceding peaks. b Consensus assembly. FC-Virus stitches together reads using homologous k-mers and determines the base composition of divergent regions through a voting process, resulting in a preliminary consensus. c Consensus refinement. FC-Virus extends the consensus using previously unused k-mers through a greedy strategy
To mitigate the influence of repeated sequence, FC-Virus builds a de Bruijn Graph using the k-mers from the final peak range. If this graph contains large connected branches (i.e., the majority of k-mers within the peak interval can be linked together), it indicates that the last peak results from repeats rather than from true genomic similarities. In this case, FC-Virus iteratively checks adjacent left peaks until encountering one caused by genomic similarities. This strategy stems from the observation that sub-sequences shared among strains are scattered throughout the genome sequence and rarely aggregate in one single location. Consequently, the de Bruijn graph constructed from homologous k-mers is often highly fragmented and unlikely to contain a large connected component.
Suppose there are total of m reads, each with a length of L, which can generate n k-mers. Then we have \(n \le \min \{m(L-k+1), 4^k\}\). The space required to store these reads and k-mers is \(O(m+n)\). Since L and k are constants and \(n \le m(L-k+1)\), along with the fact that the number of homologous k-mers is less than the total number of k-mers, we can simplify the space complexity to O(m). Besides, reconstructing the k-mers takes \(O(m(L-k+1))\) time. Identifying homologous k-mers requires scanning all k-mers, which takes O(n) time. As a result, the overall time complexity is O(m).
Consensus assembly
FC-Virus regards reads containing at least two homologous k-mers as homologous reads, and it reconstructs a consensus sequence through the following steps.
step1: Let set R denote all homologous reads. FC-Virus first selects a homologous read from R that contains the highest number of homologous k-mers as the seed read r \((r \in R)\). It then uses step 2 and 3 to refine the sequence composition between the first and the last homologous k-mer in the seed read r. This refined sequence regarded as the initial consensus sequence. The consensus sequence is subsequently extended using step 4 and refined iteratively with step 2 and 3 until no unused homologous reads remain.
step2: FC-Virus scans the homologous k-mers in the seed read r from left to right, looking for homologous reads that contain two adjacent homologous k-mers from r simultaneously, and aggregates these reads together. For two adjacent k-mers \(k_{i}\) and \(k_{i+1}\) in the seed read r, FC-Virus extracts a subset \(R_{i}\) from R. Each read \(r_{j}^{i}\) \((r_{j}^{i} \in R_{i})\) in this subset contains both homologous k-mers \(k_{i}\) and \(k_{i+1}\).
step3: FC-Virus determines the sequence composition between adjacent homologous k-mers \(k_{i}\) and \(k_{i+1}\) based on the read set \(R_{i}\). It categorizes the sequence between homologous k-mers \(k_{i}\) and \(k_{i+1}\) into two scenarios (as shown in Fig. 2). In the first case, there is an overlap between homologous k-mers \(k_{i}\) and \(k_{i+1}\), while in the second case, there is no overlap. If an overlap exists between \(k_{i}\) and \(k_{i+1}\), FC-Virus integrates them seamlessly into the consensus sequence. For example, consider the seed read CACTTGCCGTACGGT, with CTTGC and TGCCG as two neighboring homologous k-mers. In this case, the consensus sequence between these k-mers would be CTTGCCG. When there is no overlap between \(k_{i}\) and \(k_{i+1}\), FC-Virus selects the most frequently occurring sequence composition from \(R_{i}\). For instance, suppose the seed read is CACTTGCCGTACGGT, and ACTTG and TACGG are two adjacent k-mers \(k_{i}\) and \(k_{i+1}\), the set \(R_{i}\) consists of 3 reads: CACTTGCCGTACGGT, ACTTGCCATACGGTC, and GACTTGCCGTACGGA. In this case, there are two types of sequences between \(k_{i}\) and \(k_{i+1}\): ACTTGCCGTACGG and ACTTGCCATACGG, supported by 2 and 1 reads, respectively. FC-Virus would choose the sequence ACTTGCCGTACGG, which is supported by the most reads, as part of the consensus in the region between \(k_{i}\) and \(k_{i+1}\).
step4: FC-Virus extends the consensus sequence by selecting an unused homologous read that contains the last two homologous k-mers \(k_{n-1}\) and \(k_{n}\) of the seed r. If no such read is available, FC-Virus relaxes the criteria, requiring the new seed to encompass \(k_{n}\). Additionally, FC-Virus mandates that the new seed must contain at least one unused homologous k-mer to the right of \(k_{n}\). The seed read r is updated to this newly selected read, and the sequence of the new seed r is then refined using step 2 and 3.
FC-Virus can generate a consensus sequence in O(|R|L) time and memory, where R represents the set of homologous reads and L is the length of the reads, following the steps described above. However, this consensus lacks information to the left of the first k-mer and to the right of the last k-mer. The non-uniformity of current sequencing technologies might result in homologous k-mers that do not cover the entire genome. Additionally, long repetitive sequences, which exceed the read length, may be represented only once by this strategy. To address these limitations, FC-Virus refines the consensus using the strategies outlined in the following section.
Consensus refinement
FC-Virus aligns reads to the consensus using homologous k-mers as anchor points and calculates the average depth X of the consensus based on these reads. Following the alignment positions of reads on the consensus, FC-Virus condenses k-mers located at the same position on the consensus into a block. Algorithm 1 is then used to classify all k-mers in this block as either used or unused, and their abundances are updated accordingly. As described in Algorithm 1, FC-Virus adjusts the abundance of k-mers within block H based on the consensus depth X. Note that block H consists of four k-mers: one that matches the consensus sequence and three others that differ from it only by the last base. The abundance of block H is defined as the sum of the abundances of its four k-mers. If this total abundance significantly exceeds X, some k-mers in H will be labeled as unused, even if they are part of the consensus. This process aims to identify and label k-mers that are likely part of repetitive segments as unused.
FC-Virus extends the consensus at both ends using unused k-mers. It starts with a k-mer from the endpoint of the current consensus as the seed k-mer and groups all unused k-mers that meet the connectivity condition into a block H. The connectivity condition requires that the k−1 prefix of one k-mer matches exactly with the k−1 suffix of another. FC-Virus then selects the unused k-mer with the highest abundance from block H to extend the consensus. This k-mer becomes the new seed, and the abundance and usage status of k-mers in block H are updated using Algorithm 1. This process is repeated iteratively until no more extendable k-mers are found (see Figure S1 for more details). The time complexity of this process is O(n), where n represents the number of k-mers.
Experimental setup
We conducted benchmarking experiments to compare FC-Virus with traditional genome assembly methods, including SOAPdenovo2, SPAdes, and IDBA, as well as algorithms designed for viral quasispecies assembly, such as viaDBG, VG-Flow, and VStrains. In addition to benchmarking against these widely used algorithms, we also conducted comparisons against the actual genome sequence of each strain. We used the genome of one strain as a consensus and compared it with the genomes of the other strains. This benchmarking method is referred to as the ’Reference’. We also tried to benchmark the VICUNA algorithm, but unfortunately, its download link isn’t available.
Dataset information
We benchmarked FC-Virus against other algorithms on the following datasets:
-
Widely used simulated datasets. We collected four simulated datasets from [25] consisting of 5 HIV, 6 Polio, 10 HCV, and 15 ZIKV mixed strains, respectively. The genome length ranges for dataset HIV, Polio, HCV, and ZIKV are \(9478 \sim 9719\), \(7428 \sim 7460\), \(9273 \sim 9311\), and \(10,251 \sim 10,269\) bp. These datasets can be download at https://bitbucket.org/jbaaijens/savage-benchmarks.
-
Newly simulated datasets for SARS-CoV-2 (COVID-19). To investigate how sequencing depth affects assembler performance, we used SimSeq [31] to generate 11 simulated COVID-19 datasets with sequencing depths ranging from 50 to 20, 000X. Note that each simulated COVID-19 dataset comprises three strains downloaded from NCBI Database with accession number of ON944270.1, ON286803.1, and OL519143.1. The genome lengths of these strains range from 29, 675 to 29, 853 bp. These simulated datasets are available at https://bitbucket.org/fc-virus-benchmark.
-
Real datasets. We collected a real Illumina MiSeq dataset from a lab mix of five HIV strains, named HIV-LABMIX. The reads in this dataset are 250bp in length, and it’s available for access at https://github.com/cbg-ethz/5-virus-mix.
Baselines and evaluation metrics
To evaluate the degree of sequence fragmentation, we analyzed the length distribution of all contigs produced by each algorithm. We also assessed the assembly results using universal metrics such as genome fraction, N50, NGA50 duplication ratio, largest alignment, number of contigs, N’s per 100kbp, mismatches per 100kbp and indels per 100kbp, as calculated by QUAST [32]. Note that we used the term errors per 100kbp to represent the sum of N’s, mismatches and indels per 100kbp. Additionally, we aligned reads back to the contigs and evaluated the assembly quality in terms of read mapping rate.
Results
Evaluation of fragmentation degree
We compared the length distribution of contigs (longer than 350bp) assembled by compared assemblers across four widely used simulated datasets. As shown in Fig. 3, FC-Virus consistently produces exactly a single long contig whose length closely matches that of the viral genome. In contrast, other assemblers generate numerous short contigs, with IDBA and SOAPdenovo2 typically producing contigs shorter than the viral genome. Although SPAdes, VG-Flow, ViaDBG, and VStrains do generate some longer contigs, they also produces a substantial number of short ones. We attribute this result to their algorithmic strategies. SPAdes and other existing assembly algorithms usually construct a graph based on the overlap between reads or k-mers. They then extract paths from the graph to form contigs. In this study, the reads sequenced from multiple strains of the same virus, the differences between strain genomes make the graph complex, thereby increasing the difficulty of subsequent path extraction. Therefore, the compared algorithms generate many short contigs. These contigs typically represent different strains but fail to cover the full range of strains (see the next section for more details). FC-Virus, on the other hand, integrates strain variations into its consensus assembly and refinement processes. In regions shared by multiple strains, FC-Virus constructs a consensus sequence that closely matches most strains. In contrast, other assemblers might generate multiple contigs that correspond to strains with higher abundance.
Benchmark using universal criteria
We used QUAST, a tool for assessing the quality of genome assemblies, to evaluate how well the contigs assembled by each algorithm align with the reference viral strains. For each dataset, we employed each strain’s genome as the ground truth reference genome individually and then averaged the results across all evaluations. Table 1 presents the average results for each algorithm across the HIV, POLIO, ZIKV, COVID-19, and HIV-LABMIX datasets. The sequencing depth for all these datasets are all 20, 000X. Table 1 presents an interesting result: FC-Virus achieves significantly fewer errors per 100kbp compared to the Reference, which uses a strain’s genome as the consensus. This highlights the importance of consensus assembly, suggesting that the consensus generated by FC-Virus serves as a more accurate reference genome than an individual strain’s genome.
As presented in Table 1, traditional assembly algorithms, with the exception of SPAdes, tend to perform poorly in virus genome assembly. The contigs assembled by IDBA and SOAPdenovo2 are generally quite short, with numerous contigs less than 350bp in length. Since we only focus on contigs with a length greater than 350bp, the advantages of these two algorithms aren’t particularly noticeable. In some datasets, it is even challenging to compute their NGA50 values, where NGA50 is the contig length at which the aligned contigs cumulatively cover half of the reference genome. If the contigs are too short or fragmented, they may not cover enough of the reference genome, making NGA50 calculation impossible. Compared to FC-Virus, SPAdes generally performs slightly better in genome coverage. However, FC-Virus outperforms SPAdes on evaluation criteria like duplication ratio, errors per 100 kbp, NGA50, N50, largest alignment, and the number of contigs. Strain-level assembly algorithms such as viaDBG, VG-Flow, and VStrains generally excel in genome coverage, largest alignment length, N50, and NGA50. However, they struggle with high duplication ratios and errors per 100 kbp. These algorithms often produce a greater number of contigs than the number of strains, leading to excessive redundancy. In some datasets, their duplication rate far exceeds the number of strains present.
We observed that the total length of contigs assembled by SPAdes is significantly greater than the length of consensus generated by FC-Virus. Although SPAdes and FC-Virus hold a similar genome fraction, SPAdes exhibits a higher duplication rate. This raised the question of whether the high duplication rate indicates that SPAdes successfully assembled all viral strains. According to the experimental results from [23, 29] and our Appendix Table S17, we found that SPAdes did not assemble all viral strains, it only managed to assemble a portion of them. While this is a positive outcome, it doesn’t entirely align with our goal of assembling a single reference sequence. We then extracted the longest contig assembled by SPAdes and compare it with FC-Virus. We found that the genome fraction, largest alignment, N50, NGA50, and duplication ratio achieved by SPAdes (with only the longest contig) are similar to those of FC-Virus. However, SPAdes exhibited a high error rate than FC-Virus (Fig. 4).
Assessment of reads re-mapping rate
We aligned reads to contigs assembled by each algorithm and calculated the percentage of reads with both ends matching to the assembled contigs. Figure 5 presents the percentage of reads that align to contigs at both ends across the simulated datasets of HIV, POLIO, HCV, and ZIKV.
As shown in Fig. 5, FC-Virus, VG-Flow, and Vstrains all demonstrate strong performance in read re-mapping rates, with average values of 99.15%, 98.54%, and 97.73%, respectively. However, it’s worth noting that VG-Flow and VStrains generate a greater number of contigs with a larger total length, which naturally results in higher read re-mapping rates. In contrast, FC-Virus achieves similar or even better performance with just a single contig. Even in the worst-case scenario, such as with the HCV dataset where FC-Virus has a re-mapping rate of 98.08%, the result is still impressive. This suggests that FC-Virus’s consensus covers nearly all the reads and can effectively serve as a reference genome.
Investigate the impact of sequencing depth
The sequencing depth of datasets used in existing studies related to the assembly of viral strains is usually very high, at 20, 000X. We’re interested in exploring the impact of sequencing depth on the performance of assemblers. We evaluated FC-Virus along with other algorithms on 11 COVID-19 datasets with sequencing depths ranging from 50X to 20, 000X. To minimize interference from other factors, we maintained a consistent error rate (less than \(1\%\)) across all simulated datasets. The results show that all four algorithms perform well in terms of genome coverage and read re-mapping rate. However, they differ in contig count, duplication ratio, errors per 100kbp and N50 (see Fig. 6).
As shown in Fig. 6, the impact of sequencing depth on IDBA, SPAdes, VStrains and FC-Virus is relatively insignificant, except for SOAPdenovo2, VG-Flow, and viaDBG. Note that viaDBG, Vstrains, and VG-Flow fail to produce results for some datasets. The FC-Virus exhibits strong comparability across various evaluation criteria, which is consistent with the previous results. SPAdes, viaDBG, VF-Flow, and Vstrains perform similarly to FC-Virus across most evaluation criteria. However, viaDBG, VF-Flow, and Vstrains suffer from a high duplication ratio. The errors per 100kbp obtained by SPAdes are sometimes higher than those of FC-Virus.
These results were somewhat surprising, as we initially expected that sequencing depth would significantly impact the performance of most assemblers. We also anticipated that assemblers would perform less effectively on the COVID-19 dataset compared to datasets like HCV, POLIO, and ZIKA due to the larger genome size of COVID-19. We attribute the observed performance to the low sequencing error rate we used and the relatively small number of strains in the COVID-19 dataset, which has only 3 strains, while the other datasets have between 5 and 15 strains. Our findings suggest that with a low sequencing error rate, the effect of sequencing depth on assembler performance is minimal. Moving forward, we plan to investigate how the number of strains in a dataset influences assembler performance.
Assessment of CPU time and memory requirements
In theory, both the total time complexity and space complexity of FC-Virus are O(m), where m is the number of reads. To better understand its performance in practice, we evaluated the computational demands of FC-Virus alongside the other compared assemblers in terms of CPU time and peak memory usage (Fig. 7). It can be observed that FC-Virus requires the shortest CPU time for the POLIO and HIV-LABMIX datasets. For the other datasets, its CPU time is just behind that of SOAPdenovo2. We attribute FC-Virus’s strong performance in CPU time to its use of homologous reads for constructing consensus, which significantly reduces the required processing time. Additionally, the time spent on its greedy consensus refinement step is also relatively low. In terms of memory usage, FC-Virus also performs either the best or is just behind IDBA. Both SOAPdenovo and IDBA are traditional genome assembly algorithms. Although they perform well in terms of computational demands, their assembly results lag behind those of other algorithms.
Conclusion
In this paper, we presented FC-Virus, an efficient genome assembly algorithm designed to accurately reconstruct full-length consensus sequences for viral quasispecies. We were the first to introduce the concept of homologous k-mers and outline a strategy for identifying them. By using homologous k-mers as anchors, FC-Virus merges reads to produce a consensus sequence that serves as a reference genome. This reference enables more detailed analysis of strain composition and distribution within the dataset. Our experimental results demonstrate that FC-Virus outperforms other assemblers across most evaluation metrics, thanks to its specialized consensus assembly approach. FC-Virus has the advantage of generating a single consensus sequence that delivers the same assembly effect as multiple contigs produced by other assemblers. Future work will focus on using the consensus sequences generated by FC-Virus as reference genomes for assembling individual strain genomes.
Availability of data and materials
The simulated datasets are available at https://bitbucket.org/fc-virus-benchmark and https://bitbucket.org/jbaaijens/savage-benchmarks/src/master/. Real datasets can be downloaded at https://github.com/cbg-ethz/5-virus-mix. The materials are available at https://github.com/qdu-bioinfo/FC-Virus-Supplementary.
References
Alves Brunna M, Siqueira Juliana D, Garrido Marianne M, Botelho Ornella M, Prellwitz Isabel M, Ribeiro Sayonara R, Soares Esmeralda A, Soares Marcelo A. Characterization of hiv-1 near full-length proviral genome quasispecies from patients with undetectable viral load undergoing first-line haart therapy. Viruses. 2017;9(12):392.
Kim S, Misra A. Snp genotyping: technologies and biomedical applications. Annu Rev Biomed Eng. 2007;9:289–320.
Rishton Gilbert M. Reactive compounds and in vitro false positives in hts. Drug Discov Today. 1997;2(9):382–4.
Craig Venter J, Adams Mark D, Myers Eugene W, Li Peter W, Mural Richard J, Sutton Granger G, Smith Hamilton O, Yandell Mark, Evans Cheryl A, Holt Robert A, et al. The sequence of the human genome. Science. 2001;291(5507):1304–51.
Nederbragt Alexander J. On the middle ground between open source and commercial software-the case of the newbler program. Genome Biol. 2014;15(4):1–2.
Batzoglou S, Jaffe DB, Stanley K, Butler J, Gnerre S, Mauceli E, Berger B, Mesirov Jill P, Lander ES. Arachne: a whole-genome shotgun assembler. Genome Res. 2002;12(1):177–89.
Huang X, Madan A. Cap3: a dna sequence assembly program. Genome Res. 1999;9(9):868–77.
Sutton GG, White O, Adams MD, Kerlavage AR. Tigr assembler: a new tool for assembling large shotgun sequencing projects. Genome Sci Technol. 1995;1(1):9–19.
Huang X, Wang J, Aluru S, Yang S-P, Hillier L. Pcap: a whole-genome assembly program. Genome Res. 2003;13(9):2164–70.
Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M. Next generation sequence assembly with amos. Curr Protoc Bioinform. 2011;33(1):11–8.
De La Bastide M, McCombie WR. Assembling genomic dna sequences with phrap. Curr Protoc Bioinform. 2007;17(1):11–4.
Mullikin JC, Ning Z. The phusion assembler. Genome Res. 2003;13(1):81–90.
Li Z, Chen Y, Desheng M, Yuan J, Shi Y, Zhang H, Gan J, Li N, Xuesong H, Liu B, et al. Comparison of the two major classes of assembly algorithms: overlap-layout-consensus and de-bruijn-graph. Brief Funct Genom. 2012;11(1):25–37.
Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA. Spades: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19(5):455–77.
Butler J, MacCallum I, Kleber M, Shlyakhter IA, Belmonte MK, Lander ES, Nusbaum C, Jaffe DB. Allpaths: de novo assembly of whole-genome shotgun microreads. Genome Res. 2008;18(5):810–20.
Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I. Abyss: a parallel assembler for short read sequence data. Genome Res. 2009;19(6):1117–23.
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008;18(5):821–9.
Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Guangxi W, Hao ZY, Shi YL, Chang Y, Wang B, Yao L, Han C, Cheung DW, Yiu S-M, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam T-W, Wang J. Soapdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012;1(1):18.
Peng Y, Leung HC, Yiu SM, Chin FYL. Idba-ud: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8.
Chaisson MJ, Brinza D, Pevzner PA. De novo fragment assembly with short mate-paired reads: does the read length matter? Genome Res. 2009;19(2):336–46.
Souvorov A, Agarwala R, Lipman DJ. Skesa: strategic k-mer extension for scrupulous assemblies. Genome Biol. 2018;19(1):153.
Boisvert S, Laviolette F, Corbeil J. Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010;17(11):1519–33.
Luo R, Lin Y. Vstrains. De novo reconstruction of viral strains via iterative path extraction from assembly graphs. In International Conference on Research in Computational Molecular Biology, pp. 3–20. Springer; 2023
Yang X, Charlebois P, Gnerre S, Coole MG, Lennon NJ, Levin JZ, James Q, Ryan EM, Zody MC, Henn MR. De novo assembly of highly diverse viral populations. BMC Genom. 2012;13:1–13.
Baaijens JA, El Aabidine AZ, Rivals E, Schönhuth A. De novo assembly of viral quasispecies using overlap graphs. Genome Res. 2017;27(5):835–48.
Freire B, Ladra S, Paramá JR, Salmela L. Inference of viral quasispecies with a paired de bruijn graph. Bioinformatics. 2021;37(4):473–81.
Chen J, Zhao Y, Sun Y. De novo haplotype reconstruction in viral quasispecies using paired-end read guided path finding. Bioinformatics. 2018;34(17):2927–35.
Baaijens JA, Van der Roest B, Köster J, Stougie L, Schönhuth A. Full-length de novo viral quasispecies assembly through variation graph construction. Bioinformatics. 2019;35(24):5086–94.
Baaijens Jasmijn A, Stougie L, Schönhuth A. Strain-aware assembly of genomes from mixed samples using flow variation graphs. In Research in Computational Molecular Biology: 24th Annual International Conference, RECOMB 2020, Padua, Italy, May 10–13, 2020, Proceedings 24, pp. 221–222. Springer; 2020
Chor B, Horn D, Goldman N, Levy Y, Massingham T. Genomic dna k-mer spectra: models and modalities. Genome Biol. 2009;10:1–10.
Benidt S, Nettleton D. Simseq: a nonparametric approach to simulation of rna-sequence datasets. Bioinformatics. 2015;31(13):2131–40.
Gurevich A, Saveliev V, Vyahhi N, Tesler G. Quast: quality assessment tool for genome assemblies. Bioinformatics. 2013;29(8):1072–5.
Acknowledgements
The authors would like to thank the editors and reviewers for their constructive comments and suggestions on this manuscript.
Funding
This work is supported by National Natural Science Foundation of China under No. 62202251, No. 6227226 and No.62172028 and Natural Science Foundation of Shandong Province under No. ZR2022QF133.
Author information
Authors and Affiliations
Contributions
J.Z. contributed to the design of the study, J.T. and Z.G. implemented FC-strains. J.T., Z.G. and M.L. performed experiments. J.Z., J.T., Z.G., B.E. and M.L. wrote and reviewed the manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable
Consent for publication
Not applicable
Competing interests
The authors declare that they have no Conflict of interest.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International License, which permits any non-commercial use, sharing, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if you modified the licensed material. You do not have permission under this licence to share adapted material derived from this article or parts of it. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by-nc-nd/4.0/.
About this article
Cite this article
Tian, J., Gao, Z., Li, M. et al. Accurate assembly of full-length consensus for viral quasispecies. BMC Bioinformatics 26, 36 (2025). https://doi.org/10.1186/s12859-025-06045-z
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-025-06045-z







