Minimus: a fast, lightweight genome assembler
© Sommer et al. 2007
Received: 06 October 2006
Accepted: 26 February 2007
Published: 26 February 2007
Skip to main content
© Sommer et al. 2007
Received: 06 October 2006
Accepted: 26 February 2007
Published: 26 February 2007
Genome assemblers have grown very large and complex in response to the need for algorithms to handle the challenges of large whole-genome sequencing projects. Many of the most common uses of assemblers, however, are best served by a simpler type of assembler that requires fewer software components, uses less memory, and is far easier to install and run.
We have developed the Minimus assembler to address these issues, and tested it on a range of assembly problems. We show that Minimus performs well on several small assembly tasks, including the assembly of viral genomes, individual genes, and BAC clones. In addition, we evaluate Minimus' performance in assembling bacterial genomes in order to assess its suitability as a component of a larger assembly pipeline. We show that, unlike other software currently used for these tasks, Minimus produces significantly fewer assembly errors, at the cost of generating a more fragmented assembly.
We find that for small genomes and other small assembly tasks, Minimus is faster and far more flexible than existing tools. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines. Minimus is released as an open-source software project and the code is available as part of the AMOS project at Sourceforge.
With the advent of whole-genome shotgun (WGS) sequencing in the mid-1990s, the genomics community had an urgent need for software that could process tens of thousands of individual sequence "reads" and assemble those into the genome from which they had come. The first generation of assemblers, including TIGR Assembler , phrap , and CAP3 , were able to assemble small- to medium-sized bacterial genomes, often requiring several weeks of computer time on the fastest computers then available. As sequencing technology improved, ever larger projects were attempted with the WGS method, and it became clear that new methods were needed. For the 130 million base pair (Mbp) genome of the fruit fly Drosophila melanogaster, an entirely new assembler was developed , which incorporated many new ideas about efficient memory usage and sophisticated repeat processing. The Celera Assembler (CelAsm) was also the first algorithm to use mate pair information to any serious degree: taking advantage of the fact that most reads in a WGS project are generated in pairs, that system used the expected distance between reads in a pair to impose many useful constraints on the overall assembly. Other large-scale WGS assemblers followed, including Arachne [5, 6], which was used to assemble the 2.6 billion base pair (Gbp) mouse genome , Phusion , Atlas , and JAZZ .
As these systems have scaled up to meet the needs of very large WGS projects, they have grown in size and complexity, so that today, only a few very sophisticated bioinformatics groups have the expertise needed to install and run them. Like many large systems, these assemblers are relatively brittle, meaning that they often crash if the data does not conform to fairly rigid specifications. However, because they produce far superior results to the first generation of assemblers, the leading genome centers have focused their efforts on these large assemblers to the exclusion of other approaches.
Meanwhile, a host of new genome sequencing applications has arisen that place different demands on assembly algorithms. Although large-scale sequencing has pushed assembly technology in productive directions, small-scale sequencing efforts have proliferated as well. Our group recognized the growing need for an assembler that could assemble a handful of sequencing reads with a minimum of overhead (both computational and human), and as a result we have developed Minimus, a fast, "lightweight" assembler that addresses these needs. Before describing the algorithm and our results, we will describe several of these motivating applications.
Since the very first bacterial genome, Haemophilus influenza , was sequenced, we and our colleagues at The Institute for Genomic Research (TIGR) have been developing methods for closing the gaps in a draft genome. The initial assembly of a WGS project normally produces a large collection of contiguous pieces of DNA (contigs) that are separated by gaps. Improvements in sequencing and assembly technology have yielded fewer gaps per megabase in recent years, but nonetheless, the increased scale of sequencing has meant that large centers have many more gaps to fill. One unintended by-product of this trend is that many genomes today are left in "draft" form: the initial assembly is the only assembly, and the published genome consists of hundreds or thousands of unordered contigs.
Fortunately, many genomes, especially those of the greatest scientific interest, are still being finished, which means that all gaps need to be closed. Gap closure consists of running additional sequencing reactions that fill in the gap between two adjacent contigs. If the gap is filled with repetitive sequence (which is often the case), then "closure" teams may go to great lengths to clone and sequence small DNA fragments that correctly span the gap. Once these sequences are generated, the final step is to assemble the gap. This requires that the newly generated sequences, often spanning just a few kilobases or even less, be assembled together with the two surrounding contigs.
Large-scale assemblers such as CelAsm and Arachne can be used for this task, but this presents several problems. First, the scale of these programs means that simply loading them into memory can take longer than the execution time of the assembly itself. Second, the laboratory teams filling gaps typically use graphical tools to manage gaps, and configuring these tools to call a very large external program is impractical if not impossible. Third, and perhaps most telling, the cleverness of these WGS assemblers is a hindrance for gap closure, because the data do not conform to the characteristics of a typical shotgun process. The depth of coverage of finishing reads often differs from that in the surrounding areas thereby confusing the statistical repeat detection mechanisms present in large-scale assemblers, and preventing a correct assembly of the gap. Therefore an assembler for gaps will do better by using a simple, straight-forward algorithm, focused on a specific region of the genome. Finally, these assemblers cannot be easily modified to address the specific issues raised by specialized finishing procedures, especially as new finishing techniques are continuously being developed. For example, high-throughput finishing experiments often use transposons to sample a problematic region, resulting in paired reads that are facing away from each other (the sequencing proceeds away from the transposon). Such constraints cannot be easily incorporated in existing assemblers which are hard-coded to assume paired reads are facing inwards, towards the middle of the corresponding shotgun fragment. Flexible tools like Minimus and AMOS provide the potential for incorporating such information through add-on modules.
Another important use of small-scale assembly takes advantage of the rapidly growing Trace Archive at NCBI , a public repository of all the raw data from many large sequencing projects. Because it takes months and sometimes years before the final, assembled sequence from a genome project is released, scientists use the BLAST search function at the Trace Archive to find reads matching a gene of interest. If the gene is contained in the Trace Archive data, then a search will return anywhere from a handful to a few hundred sequences. These need to be assembled together to produce a better picture of the genomic region containing the gene. Once again, the scientist needs a small, less finicky assembler for this purpose.
Although most sequencing capacity is taken up by the largest genome projects, the number of small genomes being sequenced easily outstrips – in number of species and strains – the number of large genomes. Ironically, some of the very clever and complicated ideas that make CelAsm, Arachne, and other assemblers work for large genomes make them less than ideal for these small genomes. Viruses are a good example: they typically have genomes ranging from 5–50 kilobases, and they contain relatively little repetitive DNA. Thus there is no need to characterize the repeat content, and a simple assembler that ignores the issues of large-scale WGS projects will produce a perfectly correct assembly more quickly. For example, the Influenza Genome Sequencing Project, which uses an RT-PCR strategy rather than WGS, has assembled over 1000 influenza genomes using Minimus , with savings coming from not having to address special formatting requirements to prepare the data and from not having to maintain a large assembly software package.
The Minimus assembler was built in a modular fashion from software modules available within the AMOS assembly package  and is released as one of the components of this package. AMOS is an open-source software package that provides researchers with a collection of modules and software libraries that are useful in the development of genome assembly and analysis software. A full description of the AMOS package is beyond the scope of this paper and will be published elsewhere (M.Pop, manuscript in preparation).
1. hash-overlap – a sequence overlapper that uses minimizers  to increase speed and decrease memory usage.
2. tigger – a unitigger, i.e. tool that identifies clusters of reads which can be uniquely assembled based on algorithms developed by Myers [28, 29]; in graph theoretic terms a unitigger identifies maximal interval subgraphs of the overlap graph.
3. make-consensus – a progressive multiple alignment program that refines the read layout generated by the unitigger to build a precise multiple alignment of these reads.
Note that sequence quality values are only used during the generation of the multiple alignment consensus (step 3). Other assemblers, such as phrap, use the quality values as an integral component of the assembly algorithm. We found that, due to the high quality of data produced by modern sequencing instruments, the explicit consideration of quality values during the overlap and unitigging steps is unnecessary. Instead we only use the quality data to trim the poor quality flanks of each read (see below under Sequence trimming), and to compute the consensus (and associated quality values) for the multiple alignment of co-assembled reads.
An execution of Minimus consists of the following stages, described in detail below.
The shotgun reads are loaded into the AMOS bank. The inputs are presented as an AMOS message file, whose format is modeled on the format used by Celera Assembler . Virtually any existing format for representing shotgun data can be easily converted to this message format with the help of conversion tools distributed with the AMOS package.
The hash-overlap program is used to compute all pair-wise alignments between the reads provided in the input.
The tigger module constructs a graph representation of the set of overlaps determined in the overlap stage. The overlap graph contains a node for each shotgun read, and an edge connects two nodes if the corresponding reads overlap. The unitigger then uses several reduction steps to simplify this graph, and generate a set of unitigs, based on algorithms originally developed by Myers [28, 29]. Briefly, these reduction steps are:
1. Removal of containment edges. Reads completely contained within other reads in the input are removed from the graph.
2. Transitive reduction. For any set of three reads (A, B, and C), if the overlap between A and C can be inferred from the overlaps between reads A and B, and B and C, this overlap (i.e. the edge corresponding to this overlap) is removed from the graph.
3. Unique-join collapsing. Every simple path in the graph (paths that contain no branches, i.e. all the nodes have in- and out-degrees equal to 1) are collapsed into a single vertex. Each such vertex represents an individual unitig.
The final stage of Minimus constructs the full multiple alignment of the reads aligned within each unitig, using as a guide the approximate placement of the reads inferred from the overlap information.
The criteria used for trimming the vector sequence and the poor quality flanks of shotgun reads vary significantly depending on the source of the data and on the protocols employed during sequencing. In designing Minimus we, thus, opted to perform the trimming of the data with external software tools that can be customized to the specific characteristics of the data. For the examples described in this paper we followed two different approaches for sequence trimming:
1. For data where we had confidence that the Trace Archive clipping coordinates were correct (i.e. the two bacterial genomes) we simply used the coordinates provided to us.
2. For the other data-sets (zebrafish gene and mouse BACs) we followed the protocol described at , specifically we used the program Lucy  for quality trimming, followed by a k-mer based vector trimming protocol.
Note that while phrap performs some trimming based on quality values, in order to ensure consistent trimming of the data, we provided phrap with sequences already trimmed according to the protocol described above.
To extract the set of zebrafish shotgun reads that map to the human GPC3 gene, we built an NCBI Blast database containing the high-quality region of the zebrafish reads (obtained by removing the sequencing vector and the poor quality regions). We then aligned the protein sequence of the human GPC3 gene using tblastn with an E-value cutoff of 0.01. All reads matching GPC3 under these extremely relaxed criteria were then provided to Minimus for assembly.
To demonstrate the capabilities of Minimus we present its application to the assembly of several small data-sets: influenza A virus isolates, individual genes, and BAC clones. We compare the performance of Minimus to that of phrap , the "small assembler" most commonly used for such small assembly tasks. We also used Minimus to assemble two bacterial genomes, Brucella suis, and Staphylococcus aureus, to illustrate its potential use as one of the components of a complex assembly pipeline. Genome assemblers such as Atlas , developed at the Human Genome Sequencing Center at the Baylor College of Medicine, and Phusion , developed at the Sanger Center, represent such assembly pipelines. Both assemblers use a hierarchical approach to partition the reads into small sets during an initial clustering step, then assemble each of the clusters with the phrap assembler.
Before describing our results we would like to emphasize the fact that the comparisons to phrap provided below are inherently skewed due to the fact that phrap and Minimus were designed to solve different problems. These comparisons are relevant, however, because phrap has been widely applied to assembly tasks that fall outside the scope of the original intended use for this program. We will demonstrate that Minimus provides scientists with a better tool for small assembly tasks, be it the assembly of viral genomes or individual genes, or as a component in a larger assembly pipeline such as Atlas or Phusion. The high stringency of the algorithms employed by Minimus obviates the need for the complex modules commonly used (e.g., the RPphrap module of Phusion ) in such assembly pipelines to correct the errors introduced by phrap. In addition, the flexibility provided by Minimus' well defined interfaces and open-source license, allow scientists to adapt and extend our software as needed by their specific projects. Such enhancements are virtually impossible with phrap due to the restrictive license and code release model.
Assembling the influenza A virus is an ideal application for Minimus due to the small size of the virus. The influenza A sequencing project, currently underway at TIGR , has been using Minimus to assemble the genomes of more than 1400 individual isolates of the influenza virus. The sequencing pipeline at TIGR generates approximately 200 sequencing reads for each viral isolate, providing approximately 4-fold coverage of the 8 segments composing the flu genome. The assembly of the influenza genome is performed in a hierarchical manner, building a collection of contigs using Minimus with high stringency settings, then improving this assembly during two additional passes that combine Minimus with quality trimming software. In approximately 95% of the cases (J. Sitz, personal communication), this hierarchical process results in complete reconstructions of each of the segments, these data forming the substrate for genome annotation and for other subsequent analyses. The whole assembly process, including the time needed to access the database used to store the reads and the resulting assemblies, takes approximately 4 minutes. The actual time used by Minimus for assembling the data is approximately 2 seconds/segment during each of the three passes. The shotgun reads, and the assemblies produced by Minimus are made freely available to the scientific community by submission to the NCBI Trace and Assembly Archives .
One of the applications that initially drove the development of Minimus is the assembly of an individual gene from reads "fished" out of a shotgun dataset by alignment to a homologous gene from a related organism. This application is particularly relevant to the study of large eukaryotic genomes that are being sequenced but for which no assembly has yet been made available to the scientific community. While sequencing is a highly automated process, the assembly of large genomes is a time-consuming activity that requires extensive manual intervention, particularly in the case of large, highly repetitive genomes, or genomes with highly divergent homologous chromosomes. Thus, it is not uncommon for the raw shotgun data to be deposited in the Trace Archive months, and sometimes years, before an assembly of a genome is made available, even in a draft form. This situation makes it difficult for scientists to ask questions such as "does this organism being sequenced have a homologue of gene X?", or "how many copies of gene Y are present in this genome?" Such questions are often difficult to answer even if a draft assembly is available, as evidenced, for example, by the absence of chromosome Y-linked genes in an early draft of Drosophila pseudoobscura; in that case, investigators found the genes of interest by directly examining the underlying shotgun data .
To highlight the application of Minimus to assembling individual genes directly extracted from the shotgun data, we attempted to assemble the zebrafish (Danio rerio) homologues to the human glypican-3 (GPC3) gene. The GPC3 gene is highly expressed during development and has been implicated in a variety of cancers as well as in the Simpson-Golabi-Behmel overgrowth syndrome (see, e.g., [16–19]). We chose this combination of organisms due to the large evolutionary distance between human and zebrafish, as well as the fluid nature of the draft assembly of the zebrafish genome (currently at version 6 and still being actively improved).
To ascertain whether the incomplete coverage of the human GPC3 is due to limitations in our methodology, or to actual differences between the human and zebrafish homologues, we aligned the annotated zebrafish GPC3 homologue to the human protein (Figure 2 bottom). The alignment reveals the zebrafish GPC3 gene to be shorter than its human counterpart, consistent with our reconstruction. In fact, the Minimus contigs cover most of the zebrafish gene, with the exception of approximately 100 amino acids at the C terminus. This comparison also reveals a limitation of our approach. Short exons and/or splicing differences between the human and zebrafish homologues of the gene may prevent a simple translated search from identifying the shotgun reads necessary to reconstruct the full length gene. Despite such limitations, we believe our results show that Minimus can be successfully used as a first step in characterizing the homologues of a gene of interest in a newly sequenced organism. Furthermore, the approach we chose can be easily augmented to hierarchically recruit additional shotgun reads that extend the initial set of contigs, eventually reconstructing assemblies of entire genes. We implemented a simple version of such a procedure by also recruiting the mates for all reads identified during the translated searches. Unfortunately, the inclusion of these reads into the assembly process only resulted in marginal improvements. Better results will undoubtedly be obtained by extending this process to also incorporate reads that overlap the reconstructed contigs, however an implementation of such a procedure is beyond the scope of the current paper.
Comparison of Minimus and phrap in the assembly of 10 mouse BACs from data obtained from the NCBI Trace Archive.
BAC size (bp)
# Reads/seq. coverage
N50 contig size (kbp)
1 m 45 s
2 m 55 s
1 m 5 s
2 m 33 s
1 m 43 s
3 m 11 s
6 m 30 s
3 m 49 s
9 m 53 s
1 m 22 s
1 m 5 s
1 m 39 s
2 m 29 s
1 m 19 s
2 m 12 s
Comparison of Minimus and phrap in the assembly of two bacterial genomes (Brucella suis and Staphylococcus aureus).
Genome size (Mbp)
# Reads/seq. coverage
N50 contig size (kbp)
6 m 30 s
30 m 2 s
16 m 40 s
One, perhaps surprising, result of our experiments is the higher fragmentation of the BAC assemblies in comparison to the bacterial assemblies (observed both for Minimus and phrap), even though the BACs were sequenced to a deeper level of coverage. The reason for this fragmentation is the higher density of repeats in the mouse genome. Eukaryotic genomes often contain high-copy repeats that disrupt the assembly process, even within the range of a BAC insert. Such complex repeats are less frequently encountered in bacteria.
We have described Minimus, a shotgun sequence assembly program designed for the assembly of small data-sets, and shown that Minimus can be successfully used to extract individual genes from shotgun data-sets, thereby providing scientists with the means to analyze newly sequenced organisms long before complete genome assemblies are made available. Due to its small size and modular design Minimus is perfectly suited to be a component of complex assembly pipelines, as shown by its use at TIGR as the main workhorse in the influenza virus sequencing pipeline. Traditionally, phrap has been used as a main component of such pipelines. We compared Minimus to phrap on two median-sized assembly tasks, BAC clones and bacterial genomes, and found that Minimus is able to perform such assemblies more efficiently and more accurately than phrap, at the cost of producing smaller contigs. We would like to emphasize the fact that it important to obtain a correct assembly, even if this assembly is fragmented. Assembly errors are often difficult to detect and correct, and are usually resolved through an expensive and time-consuming process of manual curation (no automated tools exists for this task), while fragmented assemblies can easily be improved in a high-throughput fashion by, for example, hierarchically combining the fragmented contigs based on lower-stringency overlap information. These results highlight the potential for Minimus to be used as a replacement for phrap in assembly pipelines such as Atlas or Phusion, especially as these pipelines already implement mechanisms for combining contigs. Also note that the errors in the phrap assemblies are an artifact of the greedy assembly algorithm used by phrap and cannot be resolved by simply adjusting the stringency of the assembly process.
Finally, the modular design of Minimus (and its Open Source license) allows scientists to easily fine-tune, or replace, individual components of the assembly pipeline, tailoring the execution of Minimus to the specific characteristics of the data. Such fine-tuning is impossible in phrap, partly due to its restrictive license, and also due to its monolithic design. Minimus is therefore more than a simple assembler: it can be thought of as a potential testbed for evaluating specific assembly approaches, whether for educational purposes as part of a bioinformatics curriculum, or during the conduct of research in genome assembly.
Minimus is distributed under an Open Source license (the Artistic License) as a component of the AMOS package . The details for this package are provided below.
Project name: AMOS
Project homepage: http://amos.sourceforge.net
Operating systems: Unix (tested on Linux x86 and x86_64, Mac OSX, cygwin, Solaris, and Tru64)
Programming languages: C++, Perl
Other requirements: none for Minimus, some components of AMOS require the QT library
License: OSI Artistic License
Any restrictions to use by non-academics: none
Test data for running Minimus can be downloaded from the Minimus website: http://amos.sourceforge.net/docs/pipeline/minimus.html.
We thank Martin Shumway and Jeff Sitz from The Institute for Genomic Research for providing us with detailed information on the use of Minimus as part of the Influenza A sequencing pipeline. We also thank Mike Schatz for providing us with a vector- and quality-trimmed set of zebrafish reads, and the anonymous reviewers for their detailed and insightful comments. Finally, we thank Marina Lee for suggesting GPC3 as a test of Minimus' ability to reconstruct the assembly of individual genes. The development of Minimus was supported in part by NIH under grants R01-LM06845 and R01-LM007938 to SLS and by DHS cooperative agreement W81XWH-05-2-0051.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.