Automated ensemble assembly and validation of microbial genomes
© Koren et al.; licensee BioMed Central Ltd. 2014
Received: 7 February 2014
Accepted: 24 April 2014
Published: 3 May 2014
Skip to main content
© Koren et al.; licensee BioMed Central Ltd. 2014
Received: 7 February 2014
Accepted: 24 April 2014
Published: 3 May 2014
The continued democratization of DNA sequencing has sparked a new wave of development of genome assembly and assembly validation methods. As individual research labs, rather than centralized centers, begin to sequence the majority of new genomes, it is important to establish best practices for genome assembly. However, recent evaluations such as GAGE and the Assemblathon have concluded that there is no single best approach to genome assembly. Instead, it is preferable to generate multiple assemblies and validate them to determine which is most useful for the desired analysis; this is a labor-intensive process that is often impossible or unfeasible.
To encourage best practices supported by the community, we present iMetAMOS, an automated ensemble assembly pipeline; iMetAMOS encapsulates the process of running, validating, and selecting a single assembly from multiple assemblies. iMetAMOS packages several leading open-source tools into a single binary that automates parameter selection and execution of multiple assemblers, scores the resulting assemblies based on multiple validation metrics, and annotates the assemblies for genes and contaminants. We demonstrate the utility of the ensemble process on 225 previously unassembled Mycobacterium tuberculosis genomes as well as a Rhodobacter sphaeroides benchmark dataset. On these real data, iMetAMOS reliably produces validated assemblies and identifies potential contamination without user intervention. In addition, intelligent parameter selection produces assemblies of R. sphaeroides comparable to or exceeding the quality of those from the GAGE-B evaluation, affecting the relative ranking of some assemblers.
Ensemble assembly with iMetAMOS provides users with multiple, validated assemblies for each genome. Although computationally limited to small or mid-sized genomes, this approach is the most effective and reproducible means for generating high-quality assemblies and enables users to select an assembly best tailored to their specific needs.
Genome assembly reconstructs a genome from many shorter sequencing reads as faithfully as possible [1–3]. Since reasonable formulations of the problem are NP-hard [2, 4], practical implementations often return an approximate solution that contains errors. Recent assembly evaluations like GAGE and the Assemblathon [5–8] have highlighted the chaotic nature of genome assembly, in which assembler performance varies widely across datasets and small parameter changes can have drastic effects. In GAGE-B  for example, each dataset required a different k-mer parameter, the best assembler was not consistent across datasets, and the continuity difference between best and second best was often two-fold.
Although genome assembly is a complex problem, validating assemblies is more straightforward. The quality of a genome assembly can be confirmed by verifying that the layout of reads is consistent with the sequencing process used to generate the data . Multiple tools have been recently developed for validating genome assemblies both with and without the use of a reference genome [10–15]. Thus, given the chaotic nature of assemblers and the relative ease of validation, it is recommended to generate multiple assemblies and use validation to determine the most appropriate one. This is akin to a “hypothesis generation” view of assembly , which can be most easily implemented as an ensemble of independent methods. Unfortunately, running multiple assemblers is a time consuming, non-trivial task requiring substantial installation, learning, and maintenance costs.
There exist a limited set of tools that integrate automated parameter selection and validation into the assembly process. The A5 pipeline [17, 18] automates the microbial assembly process, but is limited to a single assembler and includes limited validation. CG-Pipeline  is targeted to 454 sequencing. VelvetOptimizer  automates a parameter sweep of k-mer sizes for the Velvet assembler , but uses contig N50 size as the optimization metric, which is not always representative of assembly quality [7, 22]. More recently, a number of assembly methods have been developed that incorporate assembly likelihood estimates into the primary assembly algorithm [23–25]. However, none of these tools robustly automate the execution of multiple assembly methods and validation metrics to achieve the best possible assembly. Here we present iMetAMOS, which automates the process of ensemble assembly and validation.
Whereas MetAMOS  was developed for metagenomic assembly, iMetAMOS is an isolate-focused extension that encapsulates the current best practices for microbial genome assembly using Illumina , 454 , Ion , or PacBio  sequencing data. Building on the conclusions of GAGE and the Assemblathon, iMetAMOS runs multiple, independent tools to generate and validate assemblies. Uniquely, iMetAMOS automates the entire ensemble assembly process including automated parameter selection and sweeps, execution of multiple assembly and validation tools, preliminary gene annotation, and identification of potential contaminating sequences. This ensemble approach is robust to individual tool failures and reliably generates high-quality assemblies with minimal user input.
iMetAMOS is primarily written in Python and builds upon Ruffus  for pipeline management. However, it incorporates many freely available tools written in a variety of languages. To simplify installation, iMetAMOS is distributed as 64-bit OS X and Linux binaries, including all supported assemblers, tools, and required databases. On 32-bit systems, iMetAMOS automatically downloads and installs the required dependencies, as needed, which significantly simplifies installation.
To support future extensibility, iMetAMOS includes a generic framework to add new tools to the pipeline. Currently supported modules are for assembly and classification. When a new tool is available, no code modification is need. Instead, a configuration is written to specify parameters for the tool and required inputs and outputs. iMetAMOS will automatically load this configuration and run the requested tool. When an external tool is executed, a corresponding citation is output to ensure users of iMetAMOS properly credit the tools on which it relies.
iMetAMOS enables reproducible analysis by recording all commands, software versions (via an MD5 hash), and intermediate inputs and outputs. The single, comprehensive binary is generated via PyInstaller , which also serves to fix and archive the exact version of all programs used. Reproducibility of custom analyses is supported via workflows. A workflow defines the software required for an analysis, as well as optional parameters and input data. Workflows support both local and remote file names, as well as SRA run identifiers, and can inherit their parameters from other workflows, allowing users to easily add or modify input data or parameters. Given a workflow, iMetAMOS will download any required remote data and run the analysis using pre-specified parameters. For further reproducibility, a workflow is automatically created for every iMetAMOS run, which can be easily shared with remote collaborators. If the data are available on the Internet, the entire analysis can be reproduced by two simple commands.
Assembly is treated as a hypothesis generation and testing problem. Multiple assembly tools are run to ensure robustness to failure and a thorough exploration of the hypothesis space. The following assemblers are currently supported: ABySS , CABOG , IDBA-UD , MaSuRCA , MetaVelvet , MIRA , Ray/RayMeta [39, 40], SGA , SOAPdenovo2 , SPAdes , SparseAssembler , Velvet , and Velvet-SC . For De Bruijn assemblers, a k-mer size is automatically selected using KmerGenie . Alternatively, users can specify a list of k-mers and iMetAMOS will run each assembly with each specified k-mer. In this mode, iMetAMOS can operate similarly to VelvetOptimizer , but for multiple assemblers and with more appropriate validation measures.
Each assembly is treated as a hypothesis subject to validation. The following validation tools are supported: ALE , CGAL , FRCbam , FreeBayes , LAP , QUAST , and REAPR . Both reference-based and reference-free validations are performed. For reference-based validation, a MUMi distance  is used to recruit the most similar reference genome from RefSeq  to calculate reference-based metrics. For reference-free validation, the input reads and read pairs are verified to be in agreement with the resulting assembly using both likelihood-based methods and mis-assembly features. In addition, to provide an initial annotation and comparison between gene content, the assemblies are automatically annotated using Prokka .
From the ensemble, the “winning” assembly is selected using the consensus of the validation tools. For each selected metric, the assemblies are assigned an order from best to worst (with 1 being best). By default, the top assemblies are selected as those that are in the top 10% for at least half the metrics. The best assembly is then selected as the top scoring assembly with the highest count of best scores. A user can select a single metric (i.e. consensus accuracy) or an arbitrarily weighted combination of metrics for validation. Importantly, this allows users to customize the validation process to suit their downstream project goals. For example, studies focused on phylogenetic tree reconstruction may prefer to minimize consensus errors, while structural variation studies may instead focus on maximizing continuity and minimizing long-range errors.
Although iMetAMOS focuses on single-genome assembly, all inputs are considered as a metagenome to control against possible contamination. The winning assembly’s contigs and unassembled reads are analyzed by a taxonomic classification program. By default, iMetAMOS uses the k-mer based Kraken  tool, but the alternative methods of FCP , PhyloSift , PHMMER , and PhymmBL  are also supported. Contigs are partitioned into separate, taxon-specific directories (genus by default) according to their classification, so that contaminating sequence can be easily identified and removed. This process also serves as an initial species identification when assembling novel organisms.
The classification result is dependent on the classifier and database used, and serves as only a preliminary species identification or indicator of potential contamination. Manual follow-up is recommended to confirm the classification. For example, recently acquired genomic elements, such as phage integrations, may be incorrectly classified. Nevertheless, this initial binning facilitates rapid identification of the assembled organism and easier contaminant removal before downstream analysis or submission to a nucleotide archive.
The final output of iMetAMOS is a self-contained HTML5 summary page. Here, users can browse the output files as well as drill down to detailed results from any step in the pipeline. This includes FastQC  reports for the preprocess step, QUAST  graphs and metrics from the validation step, and an interactive Krona  display of the taxonomic classifications.
With iMetAMOS it is possible to automatically recreate an assembler evaluation for every sample. We used iMetAMOS to perform ensemble assembly of the Rhodobacter sphaeroides 2.4.1 MiSeq dataset from the recent GAGE-B evaluation . In addition, our automated evaluation included four additional assemblers (IDBA-UD , SparseAssembler , Velvet-SC , and Ray ) and validation metrics not utilized by GAGE-B (e.g. consensus accuracy).
The best assembly of this dataset, as selected by iMetAMOS with an automatically chosen kmer-size of 35, was MaSuRCA, matching the GAGE-B result. However, the corrected N50 of the iMetAMOS MaSuRCA assembly increased to 139 Kbp from the 120 Kbp using the manually selected k-mer size of 55 reported in GAGE-B. Similar improvements were observed for four assemblers when compared to the manually selected k-mer in GAGE-B. This improvement is the result of selecting a value of k to maximize assembly correctness, rather than the GAGE-B approach of maximizing the uncorrected contig N50 size. In cases where iMetAMOS did not outperform the GAGE-B results, GAGE-B had utilized EA-UTILS  to pre-process the data. While EA-UTILS is supported by iMetAMOS, using raw sequencing data generated the best assemblies in GAGE-B, so pre-processing was disabled.
GAGE-B assemblies versus iMetAMOS (iMA) assemblies on the R. sphaeroides dataset
GAGE-B reference coverage
iMA reference coverage
The detection and removal of contaminating DNA sequences is an often-overlooked phase of assembly. For example, the Assemblathon 1 dataset included mock contaminant, which only a few teams attempted to detect and remove . Failure to remove real contaminant from assemblies significantly affects the quality of public databases to which these genomes are submitted.
iMetAMOS average runtime on 225 M. tuberculosis samples
Average time (h)
We have developed an open-source microbial analysis pipeline, iMetAMOS, which automates the process of ensemble assembly. In addition, its modular architecture is extensible and able to incorporate additional analyses or alternative tools. A potential enhancement is assembly correction, or contig breaking, which iMetAMOS does not currently support. However, the infrastructure required to support this is largely in place. For example, REAPR  is included with iMetAMOS and capable of splitting assembled contigs at predicted mis-assemblies. Using this and other supplied validation tools, assembly breaking could be iteratively performed until the validation scores are no longer improving or no more corrections are possible. Alternatively, because iMetAMOS generates multiple assemblies, assembly reconciliation techniques [61–63] could be incorporated into the pipeline. However, in practice, we have found the simple process of running multiple assemblers with multiple parameters is capable of generating high-confidence assemblies on its own, while merging assemblies can increase the risk of mis-assembly without significantly improving continuity .
The iMetAMOS extensible framework also supports customizable workflows and parameters on a per-user or per-run basis. Because all components of iMetAMOS are open source, users and tool authors are able to contribute improved parameters to the repository. Users can also contribute custom workflows tailored for specific analyses. In this way, iMetAMOS can serve as a best-practice repository for multiple assemblers, data types, and analysis tools.
iMetAMOS enables accurate and reproducible genome assembly via a “GAGE-in-a-box” analysis, allowing non-expert users to run multiple assemblers, validation metrics, and annotations with a single command. Results are presented in a simplified and interactive HTML5 format, and reproducibility is enabled through detailed logging and workflows. The current implementation supports over thirteen assemblers and seven validation tools, and its modular architecture supports the easy addition of future tools. Ensemble assembly is more robust, reproducible, and accurate than manual assembly, even surpassing the quality of GAGE-B assemblies using the same data and tools. Most importantly, iMetAMOS provides users with a simple means to generate multiple assemblies and validation metrics, empowering them to choose the best assembly for their specific needs.
Project name: iMetAMOS.
Project home page: http://www.cbcb.umd.edu/software/imetamos.
Operating systems: Linux/OS X.
Programming language: Python, C++, Perl, and Java.
Other requirements: Perl (5.8.8+), Python (2.7.3+), Java (1.6+), R (2.11.1+ with PNG support), gcc (4.7+ recommended), git, curl.
License: iMetAMOS and metAMOS-specific code are released open source under the Perl Artistic License .
All assemblies described here are available for download from http://www.cbcb.umd.edu/software/imetamos. The exact version of iMetAMOS used for analysis in this manuscript is available from ftp://ftp.cbcb.umd.edu/pub/data/metamos/imetamos_pub.tar.gz. However, we recommend using the latest release for all analyses.
We thank Magoc et al. and Comas et al. who submitted the raw data that was used in this study. We thank Lex Nederbragt and an anonymous reviewer for detailed comments on the manuscript and iMetAMOS software, usability, and documentation. The contributions of SK, TJT, and AMP were funded under Agreement No. HSHQDC-07-C-00020 awarded by the Department of Homeland Security Science and Technology Directorate (DHS/S&T) for the management and operation of the National Biodefense Analysis and Countermeasures Center (NBACC), a Federally Funded Research and Development Center. The views and conclusions contained in this document are those of the authors and should not be interpreted as necessarily representing the official policies, either expressed or implied, of the U.S. Department of Homeland Security. In no event shall the DHS, NBACC, or Battelle National Biodefense Institute (BNBI) have any responsibility or liability for any use, misuse, inability to use, or reliance upon the information contained herein. The Department of Homeland Security does not endorse any products or commercial services mentioned in this publication. MP and CMH were supported by NIH grant R01-AI-100947and the NSF grant IIS-1117247.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.