A new strategy for better genome assembly from very short reads
© Ji et al; licensee BioMed Central Ltd. 2011
Received: 25 August 2011
Accepted: 30 December 2011
Published: 30 December 2011
With the rapid development of the next generation sequencing (NGS) technology, large quantities of genome sequencing data have been generated. Because of repetitive regions of genomes and some other factors, assembly of very short reads is still a challenging issue.
A novel strategy for improving genome assembly from very short reads is proposed. It can increase accuracies of assemblies by integrating de novo contigs, and produce comparative contigs by allowing multiple references without limiting to genomes of closely related strains. Comparative contigs are used to scaffold de novo contigs. Using simulated and real datasets, it is shown that our strategy can effectively improve qualities of assemblies of isolated microbial genomes and metagenomes.
With more and more reference genomes available, our strategy will be useful to improve qualities of genome assemblies from very short reads. Some scripts are provided to make our strategy applicable at http://code.google.com/p/cd-hybrid/.
In the past a few years, several new platforms, such as Roche 454, Illumina/Solexa and ABI SOLiD, which are called Next-Generation Sequencing (NGS) technology in general, have revolutionized the sequencing landscape. Compared to the traditional Sanger sequencing method, the NGS technologies have several distinct features. First, the lengths of NGS reads are shorter. A typical read from Sanger sequencing is about 650-800 base pairs. Roche's 454 sequencer produces reads between 250-400 bp, and Solexa/SOLiD reads are generally within 100 bp. Second, the NGS technologies enable one machine to simultaneously produce millions of reads. For example, the Roche/454's GS FLX Titanium, Illumina/the Solexa's GAII and Life/APG's SOLiD 3 can generate about 0.45, 4 and 7 Giga-bytes data in one run . With the dramatically reduced time and cost for sequencing a genome, thousands of such projects have been finished or are in progress. These projects are either de novo sequencing or re-sequencing of prokaryotes and eukaryotic species (Genomes Online Database, http://www.genomesonline.org/). The NGS technologies were first applied to bacterial genomes [2–4]. For eukaryotic genomes sequenced through the NGS technologies, the giant panda genome was solely assembled from Solexa reads ; the filamentous fungus Grosmannia clavigera  and the cucumber Cucumis sativus  were sequenced in combination with the Sanger technology; and the genome of filamentous fungus Sordaria macrospora was assembled from a mixture of Solexa and 454 reads .
Genome assembly from very short reads is challenging because of genomic repeats and it also requires intensive computation resources. Two strategies are commonly used, the comparative assembly strategy and the de novo assembly strategy. For the comparative assembly strategy, DNA fragments are mapped to the reference and this information is used to infer the structure of genome being sequenced [9, 10]. The de novo assembly strategy is to construct genome sequences from a set of sequence reads without the help of reference genomes, either using the overlap-layout-consensus (OLC) approach or an algorithm based on a de Bruijn graph (DBG). Both methods have been well described in previous reports [11, 12]. Because the DBG-based assemblers can more accurately resolve genomic repeats with less computation than OLC-based ones, they have been widely adopted by genome sequencing projects .
The qualities of genome assemblies are evaluated by their contiguity and the accuracy of contigs or scaffolds . The contiguity refers to lengths of contigs or scaffolds, such as the total length, the average length and the longest length, etc. The accuracy mainly means mis-assembly rates. Previous studies showed that, when the lengths of the NGS reads are shorter than genomic repeats, the complexity of genomic repeat regions is the major contributing factor to the quality of genome assembly [13–15]. Whiteford and colleagues showed that NGS reads of 30 bps could generate useful assemblies and recover almost all genes, while genes that failed to be correctly assembled are mostly related to repetitive elements (such as transposons, IS elements and prophages) . Alkan and colleagues discovered that many genomic repeats or segmental duplications were left out by de novo assemblies of human genomes from short reads, and suggested to combine high-quality sequencing approaches with high-throughput ones for improving the assembly qualities .
There are several possible ways to improve the quality of a genome assembly from short read data. One is to utilize paired-end reads from libraries with different insert lengths . Another is combining different types of reads such as Roche 454/Sanger and Solexa [6, 8]. Using a reference genome to fill gaps between scaffolds of de novo assemblies may also be feasible [16, 17]. The first two approaches work because either separation distances of paired reads or assemblies from longer reads increase the chance to resolve genomic repeats correctly. If a reference genome is highly similar to the target genome, a comparative assembly gets a better result than de novo approach because it is easier for it to resolve genomic repeats . In some studies, comparative assemblies were also used to improve the quality of de novo assemblies [16, 17]. As shown in the Result section, currently the comparative approach is limited by the availability of closely related reference genomes. If the similarity between the reference and the target genomes is not so high, as shown in the result section, contigs may be wrongly assembled.
Here, a novel strategy for improving the quality of genome assembly from very short reads is proposed. By combining de novo assemblies and comparative ones, this strategy can produce high quality assemblies in terms of both the contiguity and the accuracy. Among the major DBG-based assemblers, the ways they deal with genomic repeats and sequencing errors are different [18, 19]. Therefore, their assembly results from short read data are different, as shown in the result section. Moreover it was discovered that mis-assembled contigs were still produced by Velvet , ABySS  or SOAPdenovo . In our approach, a method is used to choose contigs from de novo assemblies, and these contigs are called DBG contigs. Using simulated short read datasets, we show that this method significantly reduce error rates of de novo assemblies and produce extremely reliable DBG contigs. Also, multiple comparative assemblies are produced by choosing multiple reference genomes without limiting to those highly similar ones. Then a method based on DBG contigs is proposed to eliminate almost all the mis-assembled contigs from the comparative assemblies. By doing so, the remaining comparative assemblies are reliable and can be used to improve the qualities of de novo assemblies. Tested on simulated and real short read datasets, we show this workflow is useful for improving the quality of assemblies from very short reads for isolate microbial genomes and metagenomes.
Algorithm: the pipeline of our strategy
The quality of DBG contigs
Choosing reference genomes
Algorithm: criteria for selecting A-contigs
Testing: validation of our strategy
Testing: application of our strategy
Isolate microbial genome assembly
After filtering out the low-quality reads, our pipeline is used to assemble paired-end reads randomly sampled from short reads of Bacillus subtilis subsp. natto BEST195 (SRA: DRX000001) . A draft assembly (Nucleotide: AP011541) of strain Bacillus subtilis subsp. natto BEST195 (Taxonomy: 645657) from very short reads (36 bp) was produced by combining sequences from both the Velvet assembler and the MAQ software.
In the first module, DBG contigs are produced from separated Velvet, ABySS and SOAPdenovo assemblies. In the second module, three genomes are chosen as references, Bacillus subtilis subsp. subtilis str. 168 (Nucleotide: NC_000964; Taxonomy: 224308), Bacillus subtilis subsp. spizizenii str. W23 (Nucleotide: NC_014479; Taxonomy: 655816) and Bacillus subtilis BSn5 (Nucleotide: NC_014976; Taxonomy: 936156). Their blast coverages against AP011541 are 86%, 83% and 87%, respectively. Short reads are assembled by AMOScmp against reference genomes and give A-contigs. In the third module, reliable contigs are chosen. Finally, a hybrid assembly is produced through the fourth module.
Results when our novel strategy is applied to a real short read dataset
Contig number (> 1 kbp)
1068(> 0.5 kbp)
Metagenomics provides opportunities for in-depth investigating environmental microbes by directly sequencing their DNA materials randomly sampled . Obviously the good quality of metagenome assembly will be helpful for metagenome researches, because longer sequences not only make gene prediction more accurate but also contain more genome context information to assist gene annotations. So far metagenome assemblies are still challenging, and most available de novo assemblers for reads of NGS techniques have a limited capability to assemble metagenomes . The quality of de novo metagenome assembly is affected not only by repeats of the same or different genomes but also heterogenous DNA fragments of different coverages. The comparative assembly strategy is promising to improve the quality of metagenome assembly, but reference genomes of nearly 100% genome similarity with microbial members of metagenomes are hard to find since even genomes of the same species may not be the same, for example, genomes of various Escherichia coli species. Therefore, by allowing less similar genomes as references and thus choosing more references, our strategy makes it possible to assemble metagenomes in a comparative way.
Results when our novel strategy is applied to two sets of simulated metagenomes
In our strategy, two key approaches are devised to improve the qualities of genome assemblies. In the first module, long contigs are selected from three de novo assemblies so that the error rates are largely reduced. This is based on the fact that the DBG-based assemblers adopt different approaches to resolve ambiguities in de Bruijn graphs caused by genomic repeats or other, so there are significant inconsistencies among sets of long mis-assembled contigs by different DBG-based assemblers. Using simulated short read datasets, this assumption is shown to be true for at least three assemblers (Velvet, SOAPdenovo and ABySS), since almost all mis-assembled contigs which are at least 500 bps in length are excluded by this method. Thus, this method can improve the accuracy of genome assembly. In the second and third modules, another approach is proposed to improve the quality of genome assembly in terms of their contiguity. It applies comparative assembly strategy in a broaden way, allowing multiple references without limiting to genomes of closely related strains. Most of the mis-assembled contigs generated through this step are then eliminated by the criteria used for selecting reliable comparative contigs. Tested on simulated and real short read datasets, we demonstrate that comparative contigs can indeed be used to extend or scaffold de novo contigs. Moreover, in this paper, accuracies of genome assemblies of different steps in the process of our novel genome assembly strategy have been graphically shown in Figure 2, Figure 3 and Figure 6. Genomes of either Figure 3 or Figure 6 were subsets of genomes of Figure 2. First, Figure 2a showed that accuracies of DBG contigs were 100% for all simulation datasets while there were wrongly assembled contigs in assemblies of Velvet, ABySS and SOAPdenovo. Second, as shown in Figure 3a and Figure 6, our criteria for selecting A-contigs significantly improved accuracies of A-contigs from average 77% to average 90%. Third, after assembling DBG contigs and reliable A-contigs, accuracies of hybrid assemblies were average 95%. Meanwhile the contiguity quality of genome assembly was significantly improved in comparison with de novo assemblies, as shown in Figure 7.
In practice, if other DBG-based assemblers are available, the method used to produce DBG contigs in the first module makes it possible to integrate results of more than three de novo assemblies. Moreover, at least two reference genomes should be chosen to produce comparative assemblies, because the criteria for selecting reliable A-contigs are specially designed for multiple reference genomes, and are expected to have a better performance with more comparative assemblies. In the fourth module, only a stringent light-weight assembler Minimo is used to assemble the mixture of DBG contigs and reliable A-contigs. Additional processing steps may be needed such as scaffolding using Bambus  and gap filling of the scaffolds using IMAGE .
For a genome sequencing project, if without genomes of closely related species and the de novo assemblies by DBG-based assemblers are highly fragmented, our strategy should be the first assembly pipeline to be tried. The effectiveness of our strategy depends on certain factors, for example, the complexity of repetitive regions of genome being sequenced and the similarity values between it and the chosen reference genomes. So, for some short read dataset, our strategy may not work, and other strategies are then considered.
In the future, we will try to integrate our strategy for selecting reliable comparative contigs and other signatures for assembly validation such as mate-pair orientations and separations and depth-of-coverage. We hope that, by eliminating almost all mis-assembled comparative contigs, more reliable A-contigs will be chosen to extend more DBG contigs so that qualities of genome assemblies can be further improved.
A novel strategy for improving genome assembly from very short reads is proposed. The basic idea is that comparative assemblies can be used to improve qualities of genome assemblies by scaffolding or extending de novo contigs. De novo contigs are produced by integrating assemblies got by different DBG-based assemblers. Compared to assemblies by single assembler, error rates are largely reduced on simulated datasets. Comparative assemblies are produced by allowing multiple references, not limiting to closely related genomes. A method is proposed to exclude mis-assembled contigs generated due to this reduced similarities between reference and target genomes.
With more and more microbial genomes available, our strategy will be useful to improve qualities of genome assemblies from very short reads. Some scripts are provided to make our strategy applicable at http://code.google.com/p/cd-hybrid/.
Codes for the pipeline
In order to make our strategy applicable, codes for the pipeline are provided and available from http://code.google.com/p/cd-hybrid/. The generateDBGcontigs.pl script can take de novo assemblies from different tools as inputs and gives out DBG contigs. The chooseReliableAcontigs.pl script can take DBG contigs and a set of comparative assemblies as inputs and produce reliable A-contigs.
The method to choose genomes used for simulated short read datasets
A simple measure is used to estimate the complexity of genomic repeat regions. For a genome, a value is calculated by dividing the sum of lengths of all genomic repeats by the genome length. Some scripts in MUMmer software are used to identify repeats. Scripts and their parameters are "nucmer --maxmatch -nosimplify" and "show-coords -r -T -H".
For short read datasets used to show the quality of DBG contigs, 629 genomes of complexity values bigger than 6e-3 and lengths longer than 1e6 are chosen (see additional file 2). For short read datasets used for comparative assemblies, 41 genomes are chosen as targets using following criteria: their complexity values are bigger than 6e-3; their genome lengths are longer than 1e6; at least three reference genomes are available; similarity values for their reference genomes are larger than 0.8 (see additional file 3). The similarity value between target and reference genomes are calculated from sequence alignment results using 'mummer' script in MUMmer software. The similarity value is defined as the ratio of the sum of lengths of maximal matches between two genomes and the total length of two genomes. Comparative assemblies used to show the performance of the criteria for selecting A-contigs are produced by 41 target genomes with each of their reference genomes. Comparative assemblies used for validation of our strategy on simulated short read datasets are produced by 41 target genomes with all of their reference genomes.
Simulated and real short read datasets
Given a genome of length G, a coverage C and insert length L, short reads of length R of forward-reverse paired-end libraries are simulated by sampling G*C/(2*R) stretches of sequences of length L start positions of which are uniformly distributed on the sequence of genome and then taking two sequences of length R from ends of stretches. In this paper, the C is 60, L is 300 and R is 75. The simulation processes are launched by Maq's simulation module .
For short reads of Bacillus subtilis subsp. natto BEST195 (SRA: DRX000001) the number of reads are 27,296,731 (982.7Mbp). After filtering out 488,869 reads quality scores of which containing characters 'N', 11,214,956 (403.7Mbp) reads randomly sampled from remaining reads are used for genome assembly.
Running DBG-based assemblers, AMOScmp and Minimo
For the Velvet assemblies from simulated datasets, parameters are "velveth 29 -fastq -shortPaired" "velvetg -cov_cutoff auto -exp_cov auto -scaffolding yes". For the SOAPdenovo assemblies from simulated datasets, parameters are "SOAPdenovo-31mer all -K 29" and "reverse_seq = 0; asm_flags = 3; rank = 1; pair_num_cutoff = 3". For the ABySS assemblies from simulated datasets, parameters are "abyss-pe k = 29 n = 10" and "ABYSS -k 29". For all the de novo assemblies from the real short dataset, the value of kmer is replaced with 23.
For the comparative assemblies from both simulated and real datasets, the AMOScmp-shortReads tool is used.
For the hybrid assemblies from DBG contigs and reliable A-contigs by Minimo, parameters are "-D FASTA_EXP = 1 -D MIN_LEN = 30".
Aligning sequences using MUMmer
Three scripts in MUMmer software are used to align sequences, nucmer, delta-filter and show-coords. Their parameters are "nucmer --maxgap = 500 --mincluster = 100 --maxmatch", "delta-filter -q" and "show-coords -T -c -l -o -r -H -I = 0.2". In this paper, some methods adopt this approach to align sequences, such as the one to remove redundant contigs from DBG contigs, the one to select reliable A-contigs from comparative assemblies by aligning DBGs and the one to identify mis-assembled contigs from de novo assemblies and comparative assemblies by aligning them onto genomes used for simulation. Annotations of alignments given by "show-coords" script are used to implement these methods, such as "[CONTAINS]", "[CONTAINED]" and "[IDENTITY]".
List of Abbreviations
the next generation sequencing technology
De Bruijn graph.
Acknowledgements and funding
This research was supported by grants from National High-Tech R&D Program (863) (2006AA02Z334, 2007DFA31040), State key basic research program (973) (2006CB910705, 2010CB529206, 2011CBA00801), Research Program of CAS (KSCX2-YW-R-112, KSCX2-YW-R-190), National Natural Science Foundation of China (30900272) and SA-SIBS Scholarship Program.
- Metzker ML: Sequencing technologies - the next generation. Nat Rev Genet 2010, 11(1):31–46.View ArticlePubMedGoogle Scholar
- Farrer RA, Kemen E, Jones JD, Studholme DJ: De novo assembly of the Pseudomonas syringae pv. syringae B728a genome using Illumina/Solexa short sequence reads. FEMS Microbiol Lett 2009, 291(1):103–111.View ArticlePubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, et al.: Genome sequencing in microfabricated high-density picolitre reactors. Nature 2005, 437(7057):376–380.PubMed CentralPubMedGoogle Scholar
- Reinhardt JA, Baltrus DA, Nishimura MT, Jeck WR, Jones CD, Dangl JL: De novo assembly using low-coverage short read sequence data from the rice pathogen Pseudomonas syringae pv. oryzae. Genome Res 2009, 19(2):294–305.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Fan W, Tian G, Zhu H, He L, Cai J, Huang Q, Cai Q, Li B, Bai Y, et al.: The sequence and de novo assembly of the giant panda genome. Nature 2010, 463(7279):311–317.PubMed CentralView ArticlePubMedGoogle Scholar
- Diguistini S, Liao NY, Platt D, Robertson G, Seidel M, Chan SK, Docking TR, Birol I, Holt RA, Hirst M, et al.: De novo genome sequence assembly of a filamentous fungus using Sanger, 454 and Illumina sequence data. Genome Biol 2009, 10(9):R94.PubMed CentralView ArticlePubMedGoogle Scholar
- Huang S, Li R, Zhang Z, Li L, Gu X, Fan W, Lucas WJ, Wang X, Xie B, Ni P, et al.: The genome of the cucumber, Cucumis sativus L. Nat Genet 2009, 41(12):1275–1281.View ArticlePubMedGoogle Scholar
- Nowrousian M, Stajich JE, Chu M, Engh I, Espagne E, Halliday K, Kamerewerd J, Kempken F, Knab B, Kuo HC, et al.: De novo assembly of a 40 Mb eukaryotic genome from short sequence reads: Sordaria macrospora, a model organism for fungal morphogenesis. PLoS Genet 2010, 6(4):e1000891.PubMed CentralView ArticlePubMedGoogle Scholar
- Pop M: Genome assembly reborn: recent computational challenges. Briefings in Bioinformatics 2009, 10(4):354–366.PubMed CentralView ArticlePubMedGoogle Scholar
- Pop M, Phillippy A, Delcher AL, Salzberg SL: Comparative genome assembly. Brief Bioinform 2004, 5(3):237–248.View ArticlePubMedGoogle Scholar
- Paszkiewicz K, Studholme DJ: De novo assembly of short sequence reads. Brief Bioinform 2010, 11(5):457–472.View ArticlePubMedGoogle Scholar
- Jackman SD, Birol I: Assembling genomes using short-read sequencing technology. Genome Biol 2010, 11(1):202.PubMed CentralView ArticlePubMedGoogle Scholar
- Kingsford C, Schatz MC, Pop M: Assembly complexity of prokaryotic genomes using short reads. BMC Bioinformatics 2010, 11: 21.PubMed CentralView ArticlePubMedGoogle Scholar
- Whiteford N, Haslam N, Weber G, Prugel-Bennett A, Essex JW, Roach PL, Bradley M, Neylon C: An analysis of the feasibility of short read sequencing. Nucleic Acids Res 2005, 33(19):e171.PubMed CentralView ArticlePubMedGoogle Scholar
- Alkan C, Sajjadian S, Eichler EE: Limitations of next-generation genome sequence assembly. Nat Methods 2010, 8(1):61–65.PubMed CentralView ArticlePubMedGoogle Scholar
- Nishito Y, Osana Y, Hachiya T, Popendorf K, Toyoda A, Fujiyama A, Itaya M, Sakakibara Y: Whole genome assembly of a natto production strain Bacillus subtilis natto from very short read data. BMC Genomics 2010, 11: 243.PubMed CentralView ArticlePubMedGoogle Scholar
- Salzberg SL, Sommer DD, Puiu D, Lee VT: Gene-boosted assembly of a novel bacterial genome from very short reads. PLoS Comput Biol 2008, 4(9):e1000186.PubMed CentralView ArticlePubMedGoogle Scholar
- Flicek P, Birney E: Sense from sequence reads: methods for alignment and assembly. Nat Methods 2009, 6(11 Suppl):S6-S12.View ArticlePubMedGoogle Scholar
- Bao S, Jiang R, Kwan W, Wang B, Ma X, Song YQ: Evaluation of next-generation sequencing software in mapping and assembly. J Hum Genet 2011.Google Scholar
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res 2008, 18(5):821–829.PubMed CentralView ArticlePubMedGoogle Scholar
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJ, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res 2009, 19(6):1117–1123.PubMed CentralView ArticlePubMedGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, et al.: De novo assembly of human genomes with massively parallel short read sequencing. Genome Res 2010, 20(2):265–272.PubMed CentralView ArticlePubMedGoogle Scholar
- Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler. BMC Bioinformatics 2007, 8: 64.PubMed CentralView ArticlePubMedGoogle Scholar
- Treangen TJ, Sommer DD, Angly FE, Koren S, Pop M: Next generation sequence assembly with AMOS. Curr Protoc Bioinformatics 2011., Chapter 11: Unit 11 18Google Scholar
- Li H, Ruan J, Durbin R: Mapping short DNA sequencing reads and calling variants using mapping quality scores. Genome Res 2008, 18(11):1851–1858.PubMed CentralView ArticlePubMedGoogle Scholar
- Wooley JC, Godzik A, Friedberg I: A primer on metagenomics. PLoS Comput Biol 2010, 6(2):e1000667.PubMed CentralView ArticlePubMedGoogle Scholar
- Pignatelli M, Moya A: Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. PLoS One 2011, 6(5):e19984.PubMed CentralView ArticlePubMedGoogle Scholar
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH: MetaSim: a sequencing simulator for genomics and metagenomics. PLoS One 2008, 3(10):e3373.PubMed CentralView ArticlePubMedGoogle Scholar
- Pop M, Kosack DS, Salzberg SL: Hierarchical scaffolding with Bambus. Genome Res 2004, 14(1):149–159.PubMed CentralView ArticlePubMedGoogle Scholar
- Tsai IJ, Otto TD, Berriman M: Improving draft assemblies by iterative mapping and assembly of short reads to eliminate gaps. Genome Biol 2010, 11(4):R41.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.