Automated ensemble assembly and validation of microbial genomes
© Koren et al.; licensee BioMed Central Ltd. 2014
Received: 7 February 2014
Accepted: 24 April 2014
Published: 3 May 2014
The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.
To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.
Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
Genome assembly reconstructs a genome from many shorter sequencing reads as faithfully as possible [1–3]. Since reasonable formulations of the problem are NP-hard [2, 4], practical implementations often return an approximate solution that contains errors. Recent assembly evaluations like GAGE and the Assemblathon [5–8] have highlighted the chaotic nature of genome assembly, in which assembler performance varies widely across datasets and small parameter changes can have drastic effects. In GAGE-B  for example, each dataset required a different k-mer parameter, the best assembler was not consistent across datasets, and the continuity difference between best and second best was often two-fold.
Although genome assembly is a complex problem, validating assemblies is more straightforward. The quality of a genome assembly can be confirmed by verifying that the layout of reads is consistent with the sequencing process used to generate the data . Multiple tools have been recently developed for validating genome assemblies both with and without the use of a reference genome [10–15]. Thus, given the chaotic nature of assemblers and the relative ease of validation, it is recommended to generate multiple assemblies and use validation to determine the most appropriate one. This is akin to a “hypothesis generation” view of assembly , which can be most easily implemented as an ensemble of independent methods. Unfortunately, running multiple assemblers is a time consuming, non-trivial task requiring substantial installation, learning, and maintenance costs.
There exist a limited set of tools that integrate automated parameter selection and validation into the assembly process. The A5 pipeline [17, 18] automates the microbial assembly process, but is limited to a single assembler and includes limited validation. CG-Pipeline  is targeted to 454 sequencing. VelvetOptimizer  automates a parameter sweep of k-mer sizes for the Velvet assembler , but uses contig N50 size as the optimization metric, which is not always representative of assembly quality [7, 22]. More recently, a number of assembly methods have been developed that incorporate assembly likelihood estimates into the primary assembly algorithm [23–25]. However, none of these tools robustly automate the execution of multiple assembly methods and validation metrics to achieve the best possible assembly. Here we present iMetAMOS, which automates the process of ensemble assembly and validation.
Whereas MetAMOS  was developed for metagenomic assembly, iMetAMOS is an isolate-focused extension that encapsulates the current best practices for microbial genome assembly using Illumina , 454 , Ion , or PacBio  sequencing data. Building on the conclusions of GAGE and the Assemblathon, iMetAMOS runs multiple, independent tools to generate and validate assemblies. Uniquely, iMetAMOS automates the entire ensemble assembly process including automated parameter selection and sweeps, execution of multiple assembly and validation tools, preliminary gene annotation, and identification of potential contaminating sequences. This ensemble approach is robust to individual tool failures and reliably generates high-quality assemblies with minimal user input.
iMetAMOS is primarily written in Python and builds upon Ruffus  for pipeline management. However, it incorporates many freely available tools written in a variety of languages. To simplify installation, iMetAMOS is distributed as 64-bit OS X and Linux binaries, including all supported assemblers, tools, and required databases. On 32-bit systems, iMetAMOS automatically downloads and installs the required dependencies, as needed, which significantly simplifies installation.
To support future extensibility, iMetAMOS includes a generic framework to add new tools to the pipeline. Currently supported modules are for assembly and classification. When a new tool is available, no code modification is need. Instead, a configuration is written to specify parameters for the tool and required inputs and outputs. iMetAMOS will automatically load this configuration and run the requested tool. When an external tool is executed, a corresponding citation is output to ensure users of iMetAMOS properly credit the tools on which it relies.
iMetAMOS enables reproducible analysis by recording all commands, software versions (via an MD5 hash), and intermediate inputs and outputs. The single, comprehensive binary is generated via PyInstaller , which also serves to fix and archive the exact version of all programs used. Reproducibility of custom analyses is supported via workflows. A workflow defines the software required for an analysis, as well as optional parameters and input data. Workflows support both local and remote file names, as well as SRA run identifiers, and can inherit their parameters from other workflows, allowing users to easily add or modify input data or parameters. Given a workflow, iMetAMOS will download any required remote data and run the analysis using pre-specified parameters. For further reproducibility, a workflow is automatically created for every iMetAMOS run, which can be easily shared with remote collaborators. If the data are available on the Internet, the entire analysis can be reproduced by two simple commands.
Assembly is treated as a hypothesis generation and testing problem. Multiple assembly tools are run to ensure robustness to failure and a thorough exploration of the hypothesis space. The following assemblers are currently supported: ABySS , CABOG , IDBA-UD , MaSuRCA , MetaVelvet , MIRA , Ray/RayMeta [39, 40], SGA , SOAPdenovo2 , SPAdes , SparseAssembler , Velvet , and Velvet-SC . For De Bruijn assemblers, a k-mer size is automatically selected using KmerGenie . Alternatively, users can specify a list of k-mers and iMetAMOS will run each assembly with each specified k-mer. In this mode, iMetAMOS can operate similarly to VelvetOptimizer , but for multiple assemblers and with more appropriate validation measures.
Validation and annotation
Each assembly is treated as a hypothesis subject to validation. The following validation tools are supported: ALE , CGAL , FRCbam , FreeBayes , LAP , QUAST , and REAPR . Both reference-based and reference-free validations are performed. For reference-based validation, a MUMi distance  is used to recruit the most similar reference genome from RefSeq  to calculate reference-based metrics. For reference-free validation, the input reads and read pairs are verified to be in agreement with the resulting assembly using both likelihood-based methods and mis-assembly features. In addition, to provide an initial annotation and comparison between gene content, the assemblies are automatically annotated using Prokka .
From the ensemble, the “winning” assembly is selected using the consensus of the validation tools. For each selected metric, the assemblies are assigned an order from best to worst (with 1 being best). By default, the top assemblies are selected as those that are in the top 10% for at least half the metrics. The best assembly is then selected as the top scoring assembly with the highest count of best scores. A user can select a single metric (i.e. consensus accuracy) or an arbitrarily weighted combination of metrics for validation. Importantly, this allows users to customize the validation process to suit their downstream project goals. For example, studies focused on phylogenetic tree reconstruction may prefer to minimize consensus errors, while structural variation studies may instead focus on maximizing continuity and minimizing long-range errors.
Although iMetAMOS focuses on single-genome assembly, all inputs are considered as a metagenome to control against possible contamination. The winning assembly’s contigs and unassembled reads are analyzed by a taxonomic classification program. By default, iMetAMOS uses the k-mer based Kraken  tool, but the alternative methods of FCP , PhyloSift , PHMMER , and PhymmBL  are also supported. Contigs are partitioned into separate, taxon-specific directories (genus by default) according to their classification, so that contaminating sequence can be easily identified and removed. This process also serves as an initial species identification when assembling novel organisms.
The classification result is dependent on the classifier and database used, and serves as only a preliminary species identification or indicator of potential contamination. Manual follow-up is recommended to confirm the classification. For example, recently acquired genomic elements, such as phage integrations, may be incorrectly classified. Nevertheless, this initial binning facilitates rapid identification of the assembled organism and easier contaminant removal before downstream analysis or submission to a nucleotide archive.
The final output of iMetAMOS is a self-contained HTML5 summary page. Here, users can browse the output files as well as drill down to detailed results from any step in the pipeline. This includes FastQC  reports for the preprocess step, QUAST  graphs and metrics from the validation step, and an interactive Krona  display of the taxonomic classifications.
Automated assembly evaluation
With iMetAMOS it is possible to automatically recreate an assembler evaluation for every sample. We used iMetAMOS to perform ensemble assembly of the Rhodobacter sphaeroides 2.4.1 MiSeq dataset from the recent GAGE-B evaluation . In addition, our automated evaluation included four additional assemblers (IDBA-UD , SparseAssembler , Velvet-SC , and Ray ) and validation metrics not utilized by GAGE-B (e.g. consensus accuracy).
The best assembly of this dataset, as selected by iMetAMOS with an automatically chosen kmer-size of 35, was MaSuRCA, matching the GAGE-B result. However, the corrected N50 of the iMetAMOS MaSuRCA assembly increased to 139 Kbp from the 120 Kbp using the manually selected k-mer size of 55 reported in GAGE-B. Similar improvements were observed for four assemblers when compared to the manually selected k-mer in GAGE-B. This improvement is the result of selecting a value of k to maximize assembly correctness, rather than the GAGE-B approach of maximizing the uncorrected contig N50 size. In cases where iMetAMOS did not outperform the GAGE-B results, GAGE-B had utilized EA-UTILS  to pre-process the data. While EA-UTILS is supported by iMetAMOS, using raw sequencing data generated the best assemblies in GAGE-B, so pre-processing was disabled.
GAGE-B assemblies versus iMetAMOS (iMA) assemblies on the R. sphaeroides dataset
GAGE-B reference coverage
iMA reference coverage
The detection and removal of contaminating DNA sequences is an often-overlooked phase of assembly. For example, the Assemblathon 1 dataset included mock contaminant, which only a few teams attempted to detect and remove . Failure to remove real contaminant from assemblies significantly affects the quality of public databases to which these genomes are submitted.
iMetAMOS average runtime on 225 M. tuberculosis samples
Average time (h)
We have developed an open-source microbial analysis pipeline, iMetAMOS, which automates the process of ensemble assembly. In addition, its modular architecture is extensible and able to incorporate additional analyses or alternative tools. A potential enhancement is assembly correction, or contig breaking, which iMetAMOS does not currently support. However, the infrastructure required to support this is largely in place. For example, REAPR  is included with iMetAMOS and capable of splitting assembled contigs at predicted mis-assemblies. Using this and other supplied validation tools, assembly breaking could be iteratively performed until the validation scores are no longer improving or no more corrections are possible. Alternatively, because iMetAMOS generates multiple assemblies, assembly reconciliation techniques [61–63] could be incorporated into the pipeline. However, in practice, we have found the simple process of running multiple assemblers with multiple parameters is capable of generating high-confidence assemblies on its own, while merging assemblies can increase the risk of mis-assembly without significantly improving continuity .
The iMetAMOS extensible framework also supports customizable workflows and parameters on a per-user or per-run basis. Because all components of iMetAMOS are open source, users and tool authors are able to contribute improved parameters to the repository. Users can also contribute custom workflows tailored for specific analyses. In this way, iMetAMOS can serve as a best-practice repository for multiple assemblers, data types, and analysis tools.
iMetAMOS enables accurate and reproducible genome assembly via a “GAGE-in-a-box” analysis, allowing non-expert users to run multiple assemblers, validation metrics, and annotations with a single command. Results are presented in a simplified and interactive HTML5 format, and reproducibility is enabled through detailed logging and workflows. The current implementation supports over thirteen assemblers and seven validation tools, and its modular architecture supports the easy addition of future tools. Ensemble assembly is more robust, reproducible, and accurate than manual assembly, even surpassing the quality of GAGE-B assemblies using the same data and tools. Most importantly, iMetAMOS provides users with a simple means to generate multiple assemblies and validation metrics, empowering them to choose the best assembly for their specific needs.
Availability and requirements
Project name: iMetAMOS.
Project home page:http://www.cbcb.umd.edu/software/imetamos.
Operating systems: Linux/OS X.
Programming language: Python, C++, Perl, and Java.
Other requirements: Perl (5.8.8+), Python (2.7.3+), Java (1.6+), R (2.11.1+ with PNG support), gcc (4.7+ recommended), git, curl.
License: iMetAMOS and metAMOS-specific code are released open source under the Perl Artistic License .
All assemblies described here are available for download from http://www.cbcb.umd.edu/software/imetamos. The exact version of iMetAMOS used for analysis in this manuscript is available from ftp://ftp.cbcb.umd.edu/pub/data/metamos/imetamos_pub.tar.gz. However, we recommend using the latest release for all analyses.
We thank Magoc et al. and Comas et al. who submitted the raw data that was used in this study. We thank Lex Nederbragt and an anonymous reviewer for detailed comments on the manuscript and iMetAMOS software, usability, and documentation. The contributions of SK, TJT, and AMP were funded under Agreement No. HSHQDC-07-C-00020 awarded by the Department of Homeland Security Science and Technology Directorate (DHS/S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. In no event shall the DHS, NBACC, or Battelle National Biodefense Institute (BNBI) have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication. MP and CMH were supported by NIH grant R01-AI-100947and the NSF grant IIS-1117247.
- Miller JR, Koren S, Sutton G: Assembly algorithms for next-generation sequencing data. Genomics. 2010, 95 (6): 315-327. 10.1016/j.ygeno.2010.03.001.View ArticlePubMed CentralPubMedGoogle Scholar
- Nagarajan N, Pop M: Parametric complexity of sequence assembly: theory and applications to next generation sequencing. J Comput Biol. 2009, 16 (7): 897-908. 10.1089/cmb.2009.0005.View ArticlePubMedGoogle Scholar
- Nagarajan N, Pop M: Sequence assembly demystified. Nat Rev Genet. 2013, 14 (3): 157-167. 10.1038/nrg3367.View ArticlePubMedGoogle Scholar
- Myers EW: Toward simplifying and accurately formulating fragment assembly. J Comput Biol. 1995, 2 (2): 275-290. 10.1089/cmb.1995.2.275.View ArticlePubMedGoogle Scholar
- Bradnam K, Fass J, Alexandrov A, Baranay P, Bechner M, Birol I, Boisvert S, Chapman J, Chapuis G, Chikhi R, Chitsaz H, Chou W-C, Corbeil J, Del Fabbro C, Docking T, Durbin R, Earl D, Emrich S, Fedotov P, Fonseca N, Ganapathy G, Gibbs R, Gnerre S, Godzaridis E, Goldstein S, Haimel M, Hall G, Haussler D, Hiatt J, Ho I: Assemblathon 2: evaluating de novo methods of genome assembly in three vertebrate species. GigaScience. 2013, 2 (1): 10-10.1186/2047-217X-2-10.View ArticlePubMed CentralPubMedGoogle Scholar
- Earl D, Bradnam K, St John J, Darling A, Lin D, Fass J, Yu HO, Buffalo V, Zerbino DR, Diekhans M, Nguyen N, Ariyaratne PN, Sung WK, Ning Z, Haimel M, Simpson JT, Fonseca NA, Birol I, Docking TR, Ho IY, Rokhsar DS, Chikhi R, Lavenier D, Chapuis G, Naquin D, Maillet N, Schatz MC, Kelley DR, Phillippy AM, Koren S, et al: Assemblathon 1: a competitive assessment of de novo short read assembly methods. Genome Res. 2011, 21 (12): 2224-2241. 10.1101/gr.126599.111.View ArticlePubMed CentralPubMedGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, Treangen TJ, Schatz MC, Delcher AL, Roberts M: GAGE: a critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012, 22 (3): 557-567. 10.1101/gr.131383.111.View ArticlePubMed CentralPubMedGoogle Scholar
- Magoc T, Pabinger S, Canzar S, Liu X, Su Q, Puiu D, Tallon LJ, Salzberg SL: GAGE-B: an evaluation of genome assemblers for bacterial organisms. Bioinformatics. 2013, 29 (14): 1718-1725. 10.1093/bioinformatics/btt273.View ArticlePubMed CentralPubMedGoogle Scholar
- Phillippy AM, Schatz MC, Pop M: Genome assembly forensics: finding the elusive mis-assembly. Genome Biol. 2008, 9 (3): R55-R55. 10.1186/gb-2008-9-3-r55.View ArticlePubMed CentralPubMedGoogle Scholar
- Clark SC, Egan R, Frazier PI, Wang Z: ALE: a generic assembly likelihood evaluation framework for assessing the accuracy of genome and metagenome assemblies. Bioinformatics. 2013, 29 (4): 435-443. 10.1093/bioinformatics/bts723.View ArticlePubMedGoogle Scholar
- Rahman A, Pachter L: CGAL: computing genome assembly likelihoods. Genome Biol. 2013, 14 (1): R8-10.1186/gb-2013-14-1-r8.View ArticlePubMed CentralPubMedGoogle Scholar
- Hunt M, Kikuchi T, Sanders M, Newbold C, Berriman M, Otto TD: REAPR: a universal tool for genome assembly evaluation. Genome Biol. 2013, 14 (5): R47-10.1186/gb-2013-14-5-r47.View ArticlePubMed CentralPubMedGoogle Scholar
- Gurevich A, Saveliev V, Vyahhi N, Tesler G: QUAST: quality assessment tool for genome assemblies. Bioinformatics. 2013, 29 (8): 1072-1075. 10.1093/bioinformatics/btt086.View ArticlePubMed CentralPubMedGoogle Scholar
- Ghodsi M, Hill CM, Astrovskaya I, Lin H, Sommer DD, Koren S, Pop M: De novo likelihood-based measures for assembly validation. BMC Res Notes. 2013, 6 (1): 334-10.1186/1756-0500-6-334.View ArticlePubMed CentralPubMedGoogle Scholar
- Vezzi F, Narzisi G, Mishra B: Reevaluating assembly evaluations with feature response curves: GAGE and assemblathons. PLoS One. 2012, 7 (12): e52210-10.1371/journal.pone.0052210.View ArticlePubMed CentralPubMedGoogle Scholar
- Howison M, Zapata F, Dunn CW: Toward a statistically explicit understanding of de novo sequence assembly. Bioinformatics. 2013, 29 (23): 2959-2963. 10.1093/bioinformatics/btt525.View ArticlePubMedGoogle Scholar
- Tritt A, Eisen JA, Facciotti MT, Darling AE: An integrated pipeline for de novo assembly of microbial genomes. PLoS One. 2012, 7 (9): e42304-10.1371/journal.pone.0042304.View ArticlePubMed CentralPubMedGoogle Scholar
- Coil D, Jospin G, Darling AE: A5-miseq: an updated pipeline to assemble microbial genomes from Illumina MiSeq data. arXiv preprint arXiv:1401.5130 2014Google Scholar
- Kislyuk AO, Katz LS, Agrawal S, Hagen MS, Conley AB, Jayaraman P, Nelakuditi V, Humphrey JC, Sammons SA, Govil D, Mair RD, Tatti KM, Tondella ML, Harcourt BH, Mayer LW, Jordan IK: A computational genomics pipeline for prokaryotic sequencing projects. Bioinformatics. 2010, 26 (15): 1819-1826. 10.1093/bioinformatics/btq284.View ArticlePubMed CentralPubMedGoogle Scholar
- Velvet Optimizer: http://bioinformatics.net.au/software.velvetoptimiser.shtml,
- Zerbino DR, Birney E: Velvet: algorithms for de novo short read assembly using de bruijn graphs. Genome Res. 2008, 18 (5): 821-829. 10.1101/gr.074492.107.View ArticlePubMed CentralPubMedGoogle Scholar
- Narzisi G, Mishra B: Comparing de novo genome assembly: the long and short of it. PLoS One. 2011, 6 (4): 17-17.View ArticleGoogle Scholar
- Medvedev P, Brudno M: Maximum likelihood genome assembly. J Comput Biol. 2009, 16 (8): 1101-1116. 10.1089/cmb.2009.0047.View ArticlePubMed CentralPubMedGoogle Scholar
- Laserson J, Jojic V, Koller D: Genovo: de novo assembly for metagenomes. J Comp Biol J Comp Mol Cell Biol. 2011, 18 (3): 429-443. 10.1089/cmb.2010.0244.View ArticleGoogle Scholar
- Hayati A, Sato K, Sakakibara Y: An extended genovo metagenomic assembler by incorporating paired-end information. PeerJ. 2013, 1: e196-View ArticleGoogle Scholar
- Treangen TJ, Koren S, Sommer DD, Liu B, Astrovskaya I, Ondov B, Darling AE, Phillippy AM, Pop M: MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 2013, 14 (1): R2-10.1186/gb-2013-14-1-r2.View ArticlePubMed CentralPubMedGoogle Scholar
- Bentley DR, Balasubramanian S, Swerdlow HP, Smith GP, Milton J, Brown CG, Hall KP, Evers DJ, Barnes CL, Bignell HR, Boutell JM, Bryant J, Carter RJ, Keira Cheetham R, Cox AJ, Ellis DJ, Flatbush MR, Gormley NA, Humphray SJ, Irving LJ, Karbelashvili MS, Kirk SM, Li H, Liu X, Maisinger KS, Murray LJ, Obradovic B, Ost T, Parkinson ML, Pratt MR, et al: Accurate whole human genome sequencing using reversible terminator chemistry. Nature. 2008, 456 (7218): 53-59. 10.1038/nature07517.View ArticlePubMed CentralPubMedGoogle Scholar
- Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen Y-J, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim J-B, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, et al: Genome sequencing in microfabricated high-density picolitre reactors. Nature. 2005, 437 (7057): 376-380.PubMed CentralPubMedGoogle Scholar
- Rothberg JM, Hinz W, Rearick TM, Schultz J, Mileski W, Davey M, Leamon JH, Johnson K, Milgrew MJ, Edwards M, Hoon J, Simons JF, Marran D, Myers JW, Davidson JF, Branting A, Nobile JR, Puc BP, Light D, Clark TA, Huber M, Branciforte JT, Stoner IB, Cawley SE, Lyons M, Fu Y, Homer N, Sedova M, Miao X, Reed B, et al: An integrated semiconductor device enabling non-optical genome sequencing. Nature. 2011, 475 (7356): 348-352. 10.1038/nature10242.View ArticlePubMedGoogle Scholar
- Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, Peluso P, Rank D, Baybayan P, Bettman B, Bibillo A, Bjornson K, Chaudhuri B, Christians F, Cicero R, Clark S, Dalal R, Dewinter A, Dixon J, Foquet M, Gaertner A, Hardenbol P, Heiner C, Hester K, Holden D, Kearns G, Kong X, Kuse R, Lacroix Y, Lin S, et al: Real-time DNA sequencing from single polymerase molecules. Science. 2009, 323 (5910): 133-138. 10.1126/science.1162986.View ArticlePubMedGoogle Scholar
- Goodstadt L: Ruffus: a lightweight python library for computational pipelines. Bioinformatics. 2010, 26 (21): 2778-2779. 10.1093/bioinformatics/btq524.View ArticlePubMedGoogle Scholar
- PyInstaller: http://www.pyinstaller.org/,
- Simpson JT, Wong K, Jackman SD, Schein JE, Jones SJM, Birol I: ABySS: a parallel assembler for short read sequence data. Genome Res. 2009, 19 (6): 1117-1123. 10.1101/gr.089532.108.View ArticlePubMed CentralPubMedGoogle Scholar
- Miller JR, Delcher AL, Koren S, Venter E, Walenz BP, Brownley A, Johnson J, Li K, Mobarry C, Sutton G: Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008, 24 (24): 2818-2824. 10.1093/bioinformatics/btn548.View ArticlePubMed CentralPubMedGoogle Scholar
- Peng Y, Leung HC, Yiu SM, Chin FY: IDBA-UD: a de novo assembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012, 28 (11): 1420-1428. 10.1093/bioinformatics/bts174.View ArticlePubMedGoogle Scholar
- Zimin AV, Marcais G, Puiu D, Roberts M, Salzberg SL, Yorke JA: The MaSuRCA genome assembler. Bioinformatics. 2013, 29 (21): 2669-2677. 10.1093/bioinformatics/btt476.View ArticlePubMed CentralPubMedGoogle Scholar
- Namiki T, Hachiya T, Tanaka H, Sakakibara Y: MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012, 40 (20): e155-10.1093/nar/gks678.View ArticlePubMed CentralPubMedGoogle Scholar
- Chevreux B, Pfisterer T, Drescher B, Driesel AJ, Muller WE, Wetter T, Suhai S: Using the miraEST assembler for reliable and automated mRNA transcript assembly and SNP detection in sequenced ESTs. Genome Res. 2004, 14 (6): 1147-1159. 10.1101/gr.1917404.View ArticlePubMed CentralPubMedGoogle Scholar
- Boisvert S, Laviolette F, Corbeil J: Ray: simultaneous assembly of reads from a mix of high-throughput sequencing technologies. J Comput Biol. 2010, 17 (11): 1519-1533. 10.1089/cmb.2009.0238.View ArticlePubMed CentralPubMedGoogle Scholar
- Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J: Ray meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012, 13 (12): R122-10.1186/gb-2012-13-12-r122.View ArticlePubMed CentralPubMedGoogle Scholar
- Simpson JT, Durbin R: Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012, 22 (3): 549-556. 10.1101/gr.126953.111.View ArticlePubMed CentralPubMedGoogle Scholar
- Luo R, Liu B, Xie Y, Li Z, Huang W, Yuan J, He G, Chen Y, Pan Q, Liu Y, Tang J, Wu G, Zhang H, Shi Y, Liu Y, Yu C, Wang B, Lu Y, Han C, Cheung DW, Yiu SM, Peng S, Xiaoqian Z, Liu G, Liao X, Li Y, Yang H, Wang J, Lam TW, Wang J: SOAPdenovo2: an empirically improved memory-efficient short-read de novo assembler. GigaScience. 2012, 1 (1): 18-10.1186/2047-217X-1-18.View ArticlePubMed CentralPubMedGoogle Scholar
- Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, Lesin VM, Nikolenko SI, Pham S, Prjibelski AD, Pyshkin AV, Sirotkin AV, Vyahhi N, Tesler G, Alekseyev MA, Pevzner PA: SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012, 19 (5): 455-477. 10.1089/cmb.2012.0021.View ArticlePubMed CentralPubMedGoogle Scholar
- Ye C, Ma ZS, Cannon CH, Pop M, Yu DW: Exploiting sparseness in de novo genome assembly. BMC Bioinforma. 2012, 13 (Suppl 6): S1-10.1186/1471-2105-13-S6-S1.View ArticleGoogle Scholar
- Chitsaz H, Yee-Greenbaum JL, Tesler G, Lombardo MJ, Dupont CL, Badger JH, Novotny M, Rusch DB, Fraser LJ, Gormley NA, Schulz-Trieglaff O, Smith GP, Evers DJ, Pevzner PA, Lasken RS: Efficient de novo assembly of single-cell bacterial genomes from short-read data sets. Nat Biotechnol. 2011, 29 (10): 915-921. 10.1038/nbt.1966.View ArticlePubMed CentralPubMedGoogle Scholar
- Chikhi R, Medvedev P: Informed and automated k-mer size selection for genome assembly. Bioinformatics. 2014, 30 (1): 31-37. 10.1093/bioinformatics/btt310.View ArticlePubMedGoogle Scholar
- Garrison E, Marth G: Haplotype-based variant detection from short-read sequencing. 2012, arXiv preprint arXiv:1207.3907Google Scholar
- Deloger M, El Karoui M, Petit MA: A genomic distance based on MUM indicates discontinuity between most bacterial species and genera. J Bacteriol. 2009, 191 (1): 91-99. 10.1128/JB.01202-08.View ArticlePubMed CentralPubMedGoogle Scholar
- NCBI RefSeq: ftp://ftp.ncbi.nih.gov/genomes/Bacteria/all.fna.tar.gzGoogle Scholar
- Seemann T: Prokka: rapid prokaryotic genome annotation. Bioinformatics. 2014, btu153:Google Scholar
- Wood D, Salzberg S: Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014, 15 (3): R46-10.1186/gb-2014-15-3-r46.View ArticlePubMed CentralPubMedGoogle Scholar
- Parks DH, MacDonald NJ, Beiko RG: Classifying short genomic fragments from novel lineages using composition and homology. BMC Bioinforma. 2011, 12 (1): 328-328. 10.1186/1471-2105-12-328.View ArticleGoogle Scholar
- Darling AE, Jospin G, Lowe E, Matsen FAIV, Bik HM, Eisen JA: PhyloSift: phylogenetic analysis of genomes and metagenomes. PeerJ. 2014, 2: e243-View ArticlePubMed CentralPubMedGoogle Scholar
- Eddy SR: Accelerated profile HMM searches. PLoS Comput Biol. 2011, 7 (10): e1002195-e1002195. 10.1371/journal.pcbi.1002195.View ArticlePubMed CentralPubMedGoogle Scholar
- Brady A, Salzberg SL: Phymm and PhymmBL: metagenomic phylogenetic classification with interpolated markov models. Nat Methods. 2009, 6 (9): 673-676. 10.1038/nmeth.1358.View ArticlePubMed CentralPubMedGoogle Scholar
- FastQC: A quality control tool for high throughput sequence data: http://www.bioinformatics.babraham.ac.uk/projects/fastqc/,
- Ondov BD, Bergman NH, Phillippy AM: Interactive metagenomic visualization in a web browser. BMC Bioinforma. 2011, 12 (1): 385-385. 10.1186/1471-2105-12-385.View ArticleGoogle Scholar
- Command-line tools for processing biological sequencing data: https://code.google.com/p/ea-utils/,
- Comas I, Coscolla M, Luo T, Borrell S, Holt KE, Kato-Maeda M, Parkhill J, Malla B, Berg S, Thwaites G, Yeboah-Manu D, Bothamley G, Mei J, Wei L, Bentley S, Harris SR, Niemann S, Diel R, Aseffa A, Gao Q, Young D, Gagneux S: Out-of-Africa migration and Neolithic coexpansion of Mycobacterium tuberculosis with modern humans. Nat Genet. 2013, 45 (10): 1176-1182. 10.1038/ng.2744.View ArticlePubMed CentralPubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol. 1990, 215 (3): 403-410. 10.1016/S0022-2836(05)80360-2.View ArticlePubMedGoogle Scholar
- Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A: GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinforma. 2013, 14 (Suppl 7): S6-10.1186/1471-2105-14-S7-S6.View ArticleGoogle Scholar
- Yao G, Ye L, Gao H, Minx P, Warren WC, Weinstock GM: Graph accordance of next-generation sequence assemblies. Bioinformatics. 2012, 28 (1): 13-16. 10.1093/bioinformatics/btr588.View ArticlePubMed CentralPubMedGoogle Scholar
- Sommer DD, Delcher AL, Salzberg SL, Pop M: Minimus: a fast, lightweight genome assembler. BMC Bioinforma. 2007, 8 (1): 64-64. 10.1186/1471-2105-8-64.View ArticleGoogle Scholar
- Perl Artistic License: http://dev.perl.org/licenses/artistic.html,
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.