InteMAP: Integrated metagenomic assembly pipeline for NGS short reads
© Lai et al. 2015
Received: 21 October 2014
Accepted: 24 July 2015
Published: 7 August 2015
Next-generation sequencing (NGS) has greatly facilitated metagenomic analysis but also raised new challenges for metagenomic DNA sequence assembly, owing to its high-throughput nature and extremely short reads generated by sequencers such as Illumina. To date, how to generate a high-quality draft assembly for metagenomic sequencing projects has not been fully addressed.
We conducted a comprehensive assessment on state-of-the-art de novo assemblers and revealed that the performance of each assembler depends critically on the sequencing depth. To address this problem, we developed a pipeline named InteMAP to integrate three assemblers, ABySS, IDBA-UD and CABOG, which were found to complement each other in assembling metagenomic sequences. Making a decision of which assembling approaches to use according to the sequencing coverage estimation algorithm for each short read, the pipeline presents an automatic platform suitable to assemble real metagenomic NGS data with uneven coverage distribution of sequencing depth. By comparing the performance of InteMAP with current assemblers on both synthetic and real NGS metagenomic data, we demonstrated that InteMAP achieves better performance with a longer total contig length and higher contiguity, and contains more genes than others.
We developed a de novo pipeline, named InteMAP, that integrates existing tools for metagenomics assembly. The pipeline outperforms previous assembly methods on metagenomic assembly by providing a longer total contig length, a higher contiguity and covering more genes. InteMAP, therefore, could potentially be a useful tool for the research community of metagenomics.
Without the need for prior laboratory cultivation, metagenomics, as the study of sequence data directly from microbial communities in their natural habitats, has shown great power in investigating ubiquitous microorganisms that have intimate relationships with human beings as well as all other living organisms [1–3]. Recently, the next-generation sequencing (NGS) technology has greatly facilitated metagenomics, such as the launch of the Human Microbiome Project  and the MetaHIT project , providing numerous metagenomic data that can be mined for information on the compositional and functional properties of human microbial communities. Moreover, the drastic reduction in the cost of sequencing has allowed more and more laboratories to initiate metagenomic sequencing projects to decode complex microbiomes.
In order to carry out the analysis of the metagenomes, such as gene-finding and binning [5–8], short reads (e.g. 100 bp for Illumina reads) are usually expected to be assembled first into longer contiguous sequences called contigs (with sizes ranging from several hundreds of bps to the whole chromosome) to provide more valuable genomic content. Despite of early endeavours to assemble reads from Sanger and 454 sequencers, such as Genovo , Xgeovo  and MAP , high-throughput NGS short reads raise new challenges for this problem. The challenges come from the situation of high-throughput and extremely short reads generated by such as Illumina sequencers, as well as from intrinsic complications of metagenomic data caused by the microbial communities. Except for a few studies, most metagenomes using NGS remain unassembled and have been deposited directly into public databases. As part of the MetaHIT project, Qin et al.  assembled metagenomic short reads from 127 human microbial samples and reported a catalogue of microbial genes from the intestinal tract. However, the assembler they employed, SOAPdenovo , is not designed specifically for metagenome assembly but only for isolated genomes. Another single-genome assembler, Velvet , has also been used to assemble permafrost metagenomes . Recently, several assemblers for metagenomic short reads, such as IDBA-UD , MetaVelvet , Ray , and SPAdes [18, 19], are reported to be superior to other methods. In addition, CABOG, previously named Celera Assembler and employed in both single genome and metagenome assembly tasks, has been rebuilt to support Illumina reads in its latest version [20, 21]. Recently, a metagenomic assembler, Omega, has been reported to outperform others mainly on long reads data (MiSeq 300 bp datasets) .
Despite the aforementioned progress, metagenomic sequence assembly is still regarded as a challenge, both with respect to the data and the assembly process. It is known that metagenomic data contain sequences from so many species within a community, with many differences in abundances, levels of polymorphisms, and relationships with each other ; In addition, other technical factors also affect the assembly quality, such as the depth of sequencing coverage, the sequencing errors, or the choice of k-mer size for assembling algorithm [11, 24]. Consequently, using these tools may be placed many more limitations than we might think. That is, they may not suit all the species in the metagenome data, or not be able to deal with all of the assembly issues. To this end, it is useful for the scientific community to realize the properties of existing assemblers with their performance on real metagenomic NGS data. Furthermore, it is also important if we can improve assembly quality by combining the strengths of these tools to deal with the complex task of metagenomic assembly. It has been reported that assembly reconciliation which merges different draft genome assemblies improved the assembly quality [25–27], especially in the task of metagenomic assembly . However, these studies on assembly reconciliation only provided general merging algorithm, leaving the problem of deciding which assemblers to choose unresolved.
In this work, we first conducted a comprehensive assessment on state-of-the-art de novo assemblers including both de Bruijn graph based and overlap graph-based assemblers, with simulated metagenomic NGS data. In contrast to previous metagenomic assembly evaluation efforts that deemed the community as a whole [29, 30], we evaluated the assemblers by investigating how they perform on individual species in terms of factors such as sequencing depth of coverage and/or genomic similarity within the microbial community. Such a strategy has enabled us to reveal that different computational strategies not only complement each other by having different advantages in assembling sequences with different levels of sequencing depth, but also complement each other on assembling the sequences from the same genomes. Based on these findings, we further developed a de novo pipeline, named InteMAP, which integrates three tools, ABySS , IDBA-UD , and CABOG , for metagenomic assembly. Taking advantage of the strength of each assembler and the complementary nature among them, the InteMAP pipeline shows remarkable improvement in metagenomic assembly. Unlike other metagenomic assembly pipeline, such as MetAMOS  and MOCAT , which directly provide a collection of available assemblers to users without a smart solution according to input short reads data, the InteMAP pipeline generates improved assembly based on integrating selected assemblers. The tests upon both simulated and real metagenomic NGS data show InteMAP has the potential for the assembly of metagenomics.
Evaluation of assemblers revealing their respective advantages
In order to understand how state-of-the-art de novo assemblers perform on metagenomic datasets with short DNA reads, here we conducted the evaluation of several leading assemblers currently in wide use. Five latest assemblers, ABySS , CABOG , IDBA-UD , MetaVelvet , and SOAPdenovo , were selected in the current study. Among them, ABySS, CABOG, and SOAPdenovo are general-purpose genome assemblers while the other two are specifically designed for assembling metagenomic reads. For the benchmark data, we constructed a simulated dataset consisting of 113 microbial species with different abundances, called sim-113sp, to model a real complex metagenome. For each assembler, we ran the program by multiple options and selected the best results (see Methods and Additional file 1: Supplemental Methods for details).
It should be pointed out that our evaluation of these assemblers aims to examine their performance directly based on the details of reads, including the sequencing depth of coverage. Fortunately via simulated metagenome data, we may trace the details by mapping each read or contig into the corresponding source genome. Thereby, each assembly is applied to each individual genome within the simulated community. Herein, we evaluated three common-use measurements as contiguity, completeness, and correctness of each assembly for individual genome by calculating correct N50 size, assembly cover rate, and structural assembly errors in contigs (see Methods for how to calculate correct N50). Furthermore, assembly cover rate is defined as the proportion of the reference genome covered by contigs. Structural assembly errors are defined as in previous studies, including “mis-joins” of two unrelated sequences and segmental indels in the contigs .
We have noticed that each assembler provides similar performance for genomes of similar sequencing coverage. Therefore, 113 species within the sim-113sp dataset are classified into three groups by their sequencing depth levels: high (≥30× depth), medium (15-30×) and low coverage (≤15×). From the results of CABOG and IDBA-UD, we selected those that have the best trade-off between contiguity and correctness on most species. For the assemblers using the k-mer graph-based approach (ABySS, MetaVelvet, and SOAPdenovo), we found that assemblies configured with lower k-mer sizes have the better performance at lower coverage levels, while those with higher k-mer sizes present better performance at higher coverage levels. Thus, at different levels of sequencing coverage, we chose different k-mer sizes for the assemblies. Specifically, we selected the assembly with k-mer size 23 for lower coverage levels (low and medium clusters) while 51 for higher levels (medium and high clusters). It should be noted here that selecting a close k-mer size does not significantly change the assembly outcome.
We finally investigated whether the mixture of species impacts the metagenomic assembly for each assembler (details shown in Additional file 1: Supplemental Methods). As shown in Additional file 1: Figure S4 and Additional file 1: Figure S5, the test on four small synthetic datasets with mixed genomes demonstrates that the mixture of uneven coverage sequences did not significantly change the result of each individual genome in metagenomic assembly process, for most assemblers except ABySS (ABySS is found to greatly increase the errors). In contrast, the mixture of the closely related species had a significant impact on both correct N50 size and error count for most assemblers, with shorter N50 and more errors. Nonetheless, the impacts from the mixture of different species seem to be insensitive or quite similar for all the assemblers in this test. Therefore, the variation at specific coverage level in the metagenomic assembly of different assemblers are, to a large extent, determined by the assembly capabilities of assemblers at that level, rather than the abilities to resolve the impacts of the mixture of reads.
To sum up, our analysis revealed that any given individual assembler has bare means of providing the equal performance of assembly on short reads with different sequencing depth. Therefore, a user may rather be caught in trouble when using these tools for reads sequenced with heterogeneous depth as in a real metagenomic sample. However, on the basis of thoroughly understanding the assembling approaches, we expected that a fine-tune combinatorial application of these tools instead of running single tool would improve the assembly outcome.
The InteMAP pipeline and its performance on simulated metagenomic dataset
As mentioned above, upon test of the sim-113sp dataset, we found that existing de novo assemblers achieve their best performances on reads with different sequencing depth levels, moreover they are expected as being complementary to each other. Based on these findings, we devised a pipeline for metagenomic de novo assembly, named InteMAP, to integrate three de novo assemblers, IDBA-UD , ABySS  and CABOG , which are shown to have their best performances at different depth levels and to be actually complementary to each other in our test. The InteMAP pipeline aims to generate a high-quality assembly on both high and low depth of coverage. In general, the pipeline begins with an algorithm to estimate the coverage of each read, and then integrates assemblies via two strategies: for different sequencing coverage reads, different optimal assemblers are used; while for reads of the same coverage, two optimal assemblers are used and then integrated. Iteratively, the pipeline integrates two assemblies at each step. For low coverage reads, the assemblies by IDBA-UD and CABOG are integrated. If high coverage reads are detected in the data, InteMAP combines the assemblies by ABySS and IDBA-UD for these sequences. Finally, the assemblies integrated from low and high coverage are further merged into the final assembly. Herein, the integration algorithm is specially designed for contigs produced by different assemblers. For non-overlapped contigs from two assemblies, the pipeline directly includes all the non-overlapped contigs, while for overlapped contigs from two assemblies, the pipeline merges the overlapped segments and makes a decision on the contradictory point, whether to extend from one contig or to cut at the point (see section Methods for detailed description of the algorithm).
We then report the merits of the integrated assembling pipeline upon tests described below. Unlike the evaluation of the five assemblers mentioned above, herein we are confronted with another situation. To apply an assembler, including our InteMAP pipeline, to practical assembling metagenomic data, a read cannot be indicated its source genome as well as sequencing coverage. Therefore, as in previous studies [11, 16, 35], the statistics of the assembly on the whole community other than the assemblies calculated by individual genomes are used to evaluate the performance of the assembler on simulated metagenomic data. The first measure of assembly we assessed is the total length of the reference sequences covered by contigs, named “total cover length”, from which we evaluated the capability of the assembler to generate contigs. We also evaluated contiguity of the assembly using the statistic “N-len size” of contigs assembled, which has been used in previous studies [11, 16, 35]. The N-len, as a function of a designated length L, is computed as follows: the contigs are sorted in decreasing order of length, therefore the N-len(L) statistic means the shortest length of the contigs, such that the total length of the contigs with a size no shorter than N-len exceeds the length L. Since contigs with assembly errors may lead to misleadingly large N-len sizes, we used the correct N-len size to evaluate the contiguity.
Herein, we compared the assembly generated by InteMAP with the assemblies generated from five assemblers (ABySS, CABOG, IDBA-UD, SOAPdenovo, and MetaVelvet). In addition, we included Bambus 2  and three recently developed assembler methods Ray , SPAdes [18, 19], and Omega  for comparison. Ray and Omega are metagenome-specific assembly methods. Bambus 2 is not a complete assembly tool, but a scaffolder to scaffold unitigs generated by other assemblers, such as CABOG or SOAPdenovo. In this comparison, we used CABOG to generate unitigs for Bambus 2. For Bambus 2, CABOG, IDBA-UD, Omega, Ray, and SPAdes, we selected their best assembly by examining both total cover length and N-len size of the contigs from multiple runs with different parameters. For each of ABySS, MetaVelvet and SOAPdenovo (k-mer graph-based), we selected two assemblies, one of which was set a small k-mer size and reached the best cover length, while the other was set a large k-mer size and reached the best contiguity.
The summaries of assemblies on simulated metagenomic sim-113sp dataset
Total cover length (Mbp)
Corr. N-len at 10 Mbp (bp)
Num. of covered genes
ABySS,k = 31
ABySS,k = 61
MetaVelvet,k = 23
MetaVelvet,k = 61
SOAPdenovo,k = 23
SOAPdenovo,k = 61
Furthermore, we calculate E-size statistic which was proposed in a previous study . E-size is defined as the expectation of the length of the contigs that contains a position (base) randomly designated in reference genomes. This statistic addresses the question of how many genes are completely contained in the contigs. Also, it reflects both the assembly cover rate and the contiguity of the assembly. As we can see from the third column in Table 1, InteMAP has an E-size of 17,652 bp, the largest among all the assemblers.
Besides, we calculated the number of genes that were completely contained in the contigs of each assembly (see section Methods). The fourth column of Table 1 shows that the InteMAP assembly contains the most complete genes in contigs with the number 70,859. As one of the main goals of metagenomic assembly is to recover as many genes as possible from the sequencing reads, it is clear that the most genes covered by InteMAP demonstrate the superiority of our pipeline on metagenomic assembly.
In conclusion, the InteMAP pipeline has an overall better performance than ABySS, Bambus 2, IDBA-UD, SPAdes, and Omega in terms of assembly cover rate, contiguity, and accuracy. Compared to CABOG, MetaVelvet, Ray, and SOAPdenovo, the InteMAP pipeline is superior on assembly cover rate and contiguity at the cost of a slightly higher error rate in short contigs or contigs from species with low sequencing coverage. It should be noted that long contigs provide more valuable genomic contents and the dominant species are often the key object to be studied in the microbial community. In this regard, compared to CABOG, MetaVelvet, Ray, and SOAPdenovo, the total superiority of the InteMAP pipeline in terms of high cover rate, high contiguity and low error in long contigs or contigs from species with high coverage indicates that InteMAP performs better than the tools CABOG, MetaVelvet, Ray, and SOAPdenovo.
Evaluating the merging method in InteMAP
Comparison of InteMAP merging algorithm with MeGAMerge and Minimus2
Total cover length (Mbp)
Corr. N-len at 10 Mbp (bp)
Num. of covered genes
Merging high-coverage-sequence assemblies (from IDBA-UD and ABySS)
Merging low-coverage-sequence assemblies (from IDBA-UD and CABOG)
Final merged assembly
It should be noted that the merging algorithm in InteMAP was not designed for improving any two assemblies. Indeed, merging all the assemblies from Table 1 does not results in a better assembly than the result from InteMAP pipeline (see Additional file 1: Table S1). In contrast, the assembly generated by merging all the assemblies has much more errors than the assembly generated by InteMAP pipeline, mainly because the InteMAP merging algorithm could not detect or correct all the errors contained in all of the merged assemblies. In addition, what deserves special mention is that InteMAP is more than a merging algorithm. The purpose of InteMAP is not to improve the merging algorithm, but rather provide a pipeline for a complete metagenome assembly task, which gives an optimal resolution for how to choose optimal assemblers and how to integrate assemblies step by step.
Testing InteMAP on real NGS data
It is interesting to verify the capability of InteMAP on real metagenomic data by NGS. To this end, we selected a dataset from the 124 human gut microbial metagenomes, sequenced as a part of the MetaHIT project (sample ID: MH0012) . This sample was deeply sequenced by Illumina GA machine. The high-quality paired-end reads were downloaded from NCBI Sequence Read Archive (http://www.ncbi.nlm.nih.gov/sra; ERR011117, ERR011118, ERR011119, ERR011120, ERR011121, ERR011122, ERR011123), consisting of 186 million short reads with length of 75 bp and two paired-end insert size of 134 bp and 378 bp.
We applied the InteMAP pipeline on this dataset and compared the assembly generated by InteMAP with those by other assembling tools including six metagenome-specific assemblers (Bambus 2, IDBA-UD, MetaVelvet, Omega, Ray, and SPAdes), and three general-purpose assemblers (ABySS, CABOG and SOAPdenovo). In fact, both Bambus 2 and MetaVelvet have been assessed using this benchmarking dataset in previous studies [16, 35]. However, we reran Bambus 2 and MetaVelvet on the data since we did not find the assembled data from their publications. For MetaVelvet, we selected the optimal assembly with k-mer size 31 from multiple runs. We reran Bambus 2 based on the unitigs generated by CABOG. As Qin et al. , who sequenced MH0012 sample, assembled the data using SOAPdenovo, we directly downloaded the assembly data from http://www.bork.embl.de/~arumugam/Qin_et_al_2010/ (see section Methods and Additional file 1: Supplemental Methods for detailed receipts describing how to run each assembler).
Assembly statistics and predicted gene number on human gut microbial metagenome dataset (Sample MH0012)
Sum of contig length (Mbp)
N-len at 5 Mbp (bp)
N-len at 50 Mbp (bp)
Non-redundant ORFs predicted
Num. of predicted complete ORFs
ABySS, k = 61a
MetaVelvet, k = 31
SOAPdenovo (Qin et al. 2010)b
Comparison of IDBA-UD and InteMAP on two real metagenomic data samples
Sum of contig length (bp)
N-len at 100 Kbp (bp)
Non-redundant ORFs predicted
Sum of contig length (bp)
N-len at 10 Mbp (bp)
Non-redundant ORFs predicted
The InteMAP pipeline is designed for convenient use with automatic one-step operation; meanwhile users can also run the pipeline step by step, so that they can get an optimized outcome from multiple runs of different parameters for each individual assembler in the middle process. As consisted of several assembly processes by individual assemblers and mapping processes by Bowtie2, it takes longer to run InteMAP than other individual assemblers. For instance, in our test on the dataset MH0012, it took us for about 3 days to finish running InteMAP on the machine with the memory of 192G and thread number of 22, while it took for within one day to run individual assemblers on the same machine.
Discussion and conclusions
It has been believed that the assembly for metagenomic NGS reads has not reached a point to recover one or more complete genomes with the exception of finishable dominant species . However, it is never a trivial matter to make efforts to develop strong and efficient assembly tools to provide more valuable genomic content beyond raw data with a large number of NGS reads, even researchers can directly obtain a little information based on these raw data or 16S sequences . In this work, we have performed a comprehensive assessment on state-of-the-art de novo assemblers to understand their advantages and disadvantages in assembling metagenomic reads. The analysis revealed that the assembly performance of individual tool critically depends on sequence coverage and a combinatorial application of different tools is desired to achieve an overall better performance to assemble short reads sequenced with uneven coverage as in a real metagenomic sample. The InteMAP pipeline addressed this question by integrating three leading assemblers, IDBA-UD, ABySS, and CABOG, to assemble metagenomic NGS reads. The essential manipulation of InteMAP is to decide the assembling approaches according to sequencing coverage estimated for each short read. Thus, the pipeline actually presents a smart platform suitable to assembling real metagenomic NGS data with uneven coverage distribution of sequencing depth. Our analysis showed that the estimated coverage for each read does not exactly mean its real coverage of sequencing, which is no more than calculated based on the draft assemblies by IDBA-UD and ABySS. However, the estimation plays an important role in the pipeline, moreover it leads to the fact that InteMAP achieves a high performance in assembling metagenomic NGS reads.
Surprisingly, several newly developed methods for metagenomic assembly such as Omega , and Ray  do not work well for our datasets. Ray has surprising low performance in terms of contig size and covered genes on the tested datasets, but has an extremely low error rate, suggesting Ray is a quite conservative assembler which tends to avoid errors while generating fewer contigs. Omega employs a minimal cost flow analysis on an overlap graph to generate unitigs. However, unlike CABOG which allows mismatch in the overlap detection, Omega uses exact-match overlaps to construct overlap graph, making it highly sensitivity to the sequencing errors or SNPs. Thus, Omega does not deal well with a highly complex metagenome dataset which might contain a large amount of low-sequencing-depth species on which sequencing errors could not get corrected easily.
The assembler space is constantly moving and more assemblers are developed. It will be interesting to exam new assemblers to find their particular strength in the metagenomic assembly and further plug them in the InteMAP pipeline. However, as we have shown, merging all the assemblies did not result in a better assembly than what InteMAP did, arbitrary merge will not improve the assembly, unless we know the respective strength of an assembler and integrate it in an appropriate way.
Metagenomes are complex mixtures usually containing diverse species with different abundances. An ideal metagenomic assembler is expected to perform well on both high and low depth sequences within the metagenomes. On the condition of a certain sequencing capacity, high depth sequences usually represent dominant or high-abundance species within the community. Conversely, low-coverage sequences usually represent low-abundance species which often occupy a large proportion of the community, especially in high-complexity metagenomes, such as the ocean and the soil metagenomes [1, 38]. In metagenomes which have dominant species, to construct the high-quality draft assemblies or even complete genomes of dominant species has a great significance in metagenomic studies . On the other hand, to assemble as many low abundant sequences as possible is also very important for the completeness of metagenomic studies. By integrating different assemblers which are good at assembling sequences at different coverage levels, the InteMAP pipeline demonstrates strong capability in assembling both high and low depth sequences, based on the improvements in terms of contiguity and total cover length, in comparison with current leading assemblers on both simulated and real NGS data.
To construct the simulated metagenome dataset, we used the MetaSim simulator  to generate a collection of synthetic metagenomic reads which reflect the specified abundance distribution of the species. In the current study, we mainly target the Illumina reads although our pipeline is also able to be applied on other short reads. To imitate Illumina sequencing, the read length was set 100 bp, with the average and the standard deviation of paired-end insert size as 300 bp and 20 bp. To model the specific pattern of sequencing error of Illumina technology, we used NGSfy  to generate sequencing errors in reads, which uses a fourth-degree polynomial model to describe the frequency of errors in Illumina reads. We used the default settings for NGSfy, and the average error rate was 1.5 %, as in previous studies [41, 42].
We constructed the sim-113sp dataset by mixing the synthetic reads from 113 complete microbial genomes downloaded from NCBI RefSeq database. The collection of the 113 species is exactly the same as used by Mavromatis et al.  to generate simulated metagenomic datasets, except for a few unfinished genomes, which we chose close relatives (usually a different strain) instead (see Additional file 2: File S1). For the abundance setting, we used the logarithmic distribution, which has been used in Mende et al.  to model the microbial abundance distribution. The highest sequencing coverage of species in this dataset is 177.5×, followed by 97.6×, and the coverage drops to below 10× after the top 20 species (see Additional file 1: Figure S7). We totally generated 4 million reads of totally 400 Mbp, which are able to be downloaded from http://cqb.pku.edu.cn/ZhuLab/InteMAP/index.html.
In order to calculate the number of genes contained by the assembly in the evaluation on simulated metagenomic dataset sim-113sp, we downloaded the gene annotation files for each reference genome, which record the start and ending positions for each gene on the genome sequence, from the NCBI database. According to the mappings between the contigs and the reference genomes, if one gene was completely contained in an aligned region of the references, this gene was deemed completely contained by the contigs.
Running the assemblers
The pre-processing of correcting the sequencing errors in raw reads has been reported to substantially enlarge the contig size for some assemblers in the single genome assembly . In the current study, error correction also improved the metagenome-assembly quality, especially on the high-depth species. The InteMAP pipeline employs a module of error correction to pre-process the reads before assembling. For a fair comparison, in our experiments, reads were pre-processed by the error correction process before being assembled by all assemblers (see Additional file 1: Supplemental Methods for the detailed process of error correction).
For each assembler, we ran multiple times with different parameter settings and selected the ones that appeared optimal or near optimal results (the parameters selected and details describing how to run each of the assembler are in Additional file 1: Supplemental Methods). The contigs by ABySS, IDBA-UD, MetaVelvet and SOAPdenovo were extracted from their scaffold sequences after splitting the scaffold sequences at gaps. SOAPdenovo has an additional module named GapCloser as a post-processor. Our test demonstrated that the GapCloser's post-processing actually increases the assembling errors dramatically. Therefore, we selected the assemblies without the post-processing of GapCloser for SOAPdenovo.
Evaluating the assemblies
Because extremely short contigs, in many cases as small as the k-mer size used to build the k-mer graph, can hardly support information for further analysis, for all the assemblies, contigs with length <200 bp were excluded before evaluation, as did in previous study . Following the way used in the previous study , we used MUMmer package v3.23  to evaluate the contigs, in which nucmer  was used to align the contigs against the references with the options “-maxmatch -l 30 -banded -D 5.” Then delta-filter with the option “-i 95 -o 95” and dnadiff  were used to obtain a globally optimal mappings between contigs and references, from which we could directly get the structural assembly errors (segmental indels (>5 bp) and misjoins of two non-adjacent segments). From the output of dnadiff, we also got the identity of the aligned contigs. In the evaluation of the assemblies on simulated dataset, we used corrected contigs to evaluate the contiguity of the contigs by computing the correct N50 size, correct N-len size, and correct E-size. We extracted the aligned contig fragments as the corrected contigs from the mappings between contigs and references generated by dnadiff (aligned fragments end at the ends of contigs or assembly errors).
where L C is the length of contig C, and G is the total reference genome length. On real NGS dataset MH0012, as the total length of reference genomes is not available, the G in the E-size formula was assigned the maximal total contigs length among all the assemblies run on this dataset (we used 280,000,000 here).
Algorithm of the InteMAP pipeline
Step 1. Correcting errors of the input reads. By default, input reads are first pre-processed by the Quake tool , which aims to correct the sequencing errors in high-coverage sequences (see Additional file 1: Supplemental Methods for details). Two sets of reads are then generated: one named “correct read set” consisting of reads corrected or validated by Quake, and the other consisting of the remaining ones (see Additional file 1: Figure S8 for the ratio of the correct reads to total reads for each coverage of species). Two sets are mixed together and the mixed read set is termed “total read set”.
Step 2. Running IDBA-UD and estimating reads coverage. After the assembly by IDBA-UD, reads are mapped into the contigs using the read alignment tool, Bowtie 2 . According to the mappings between reads and contigs, the coverage of each read is then estimated by the times of the total length of reads to the length of the assembled contig. Furthermore, the contigs are separated along with the mapped reads by different coverage. The pipeline then picks up both the reads mapping to contigs with the coverage < L (this can be designated by users, 50× by default) and the unmapped reads, and assembles them using CABOG.
Step 3. Checking the total length of high coverage contigs. The pipeline first checks the total length of the contigs with the coverage >30×. If the length is shorter than a designated value (e.g. 1 Mbp by default), the pipeline goes to Step 6 for the condition of the low-coverage reads assembly. Otherwise, goes to the next step.
Step 4. Assembling reads of the “correct read set” using ABySS and IDBA-UD respectively. The “correct read set” has been generated in Step 1, and using this set can largely increase the quality of assembly for the high-depth species by both ABySS and IDBA-UD. After assemblies are generated by ABySS and IDBA-UD, the pipeline maps the reads into the contigs and picks up the high-depth contigs for both assemblies. The minimal coverage of high-depth contigs for ABySS and IDBA-UD is set 20× by default.
Step 5. Merging the assemblies at high and low coverage levels iteratively. First, the pipeline merges the high-depth contigs from the IDBA-UD and ABySS assemblies generated in Step 4, using the algorithm described below. Then filter out the high-coverage contigs (≥50×) by IDBA-UD in Step 2 and merge the remaining contigs by CABOG in Step 2. Since the pipeline has merged contigs both on the high and low coverage sequences, finally, the pipeline merges them together as the final contigs. Note that the ranges of the high and low coverage overlap each other to ensure that the pipeline covers all the sequences, so the default cut-off for high-depth and low-depth are set to make the range broader than those used for the assembly evaluation. The pipeline ends at this step.
Step 6. Merging the assemblies by IDBA-UD and CABOG. If Step 4 and 5 are skipped, which means that there are no high coverage species within the metagenomes, merge the contigs generated by both IDBA-UD and CABOG in Step 2 as the final contigs. The pipeline ends at this step.
The algorithm for merging two sets of contigs
Step 1. Comparing the two sets of contigs. It begins with comparing the two sets of contigs using nucmer, delta-filter, and dnadiff in MUMmer package to get the mappings between the two sets of contigs. From the mappings, the same segments shared by two sets of contigs are marked. Then, the breakpoints are marked inside the contigs where the bifurcations appear (Type 1 breakpoint) or the segments end in the contigs of the different assembly (Type 2 breakpoint), as illustrated in Fig. 7a. The breakpoints may indicate assembly errors.
Step 2. Validating the contigs. The contigs are validated by mapping the paired-end reads against the contigs, which has been used in previous work . The basic idea is that if one contig is correctly assembled, each pair of paired-end reads is located and mated correctly. If a region has ≥5 mis-mated reads located, this region will be marked as suspicious which indicates potential mis-assembly within the region. Thus, if one breakpoint is located within the suspicious region, it is more likely to be an assembly error. Therefore, the algorithm breaks all the breakpoints which are located within the suspicious regions (Fig. 7b).
Step 3. Breaking the contradictory breakpoints. The remaining breakpoints are treated differently according to their types. As Type 1 breakpoints are mated and in different assemblies respectively, if both mated breakpoints are not broken in Step 2, both of them will be broken at this step. In contrast, if only one Type 1 breakpoint is broken in Step 2, the other breakpoint will automatically transform to the Type 2 breakpoint. The survived Type 2 breakpoints are reserved.
Step 4. Connecting the segments at reserved breakpoints (Fig. 7c). Because there are no conflicting breakpoints remained after Step 2 and 3, each segment has at most one segment linked by the reserved breakpoint in either the left or the right direction. Then, beginning from any segment which ends at one side, the pipeline extends the segment by connecting the adjacent one until the extending segment ends at the extending direction. The final contigs are generated by reading the bases from the merged segments. If the segment is shared by two assemblies, randomly select one strand of sequences to read the bases.
Availability of supporting data
The package of the InteMAP software and all the benchmarking data are freely available from http://cqb.pku.edu.cn/ZhuLab/InteMAP/index.html.
We wish to thank Prof. Xuegong Zhang of Tsinghua University and Prof. Zuhong Lu of Peking University for interest to the project and useful discussions. We also thank Prof. Antony Chen of Peking University for his refinement to the manuscript writing. We thank Dr. Yongchu Liu, Jiantao Guo, Luying Liu, Qi Wang, Xiao Guo and Peng Zhai for their helpful discussions.
This work was supported by the National Key Technology Research and Design Program of China (2012BAI06B02), the National Natural Science Foundation of China (30970667, 11021463, 61131003 and 91231119), National Basic Research Program of China (2011CB707500), and Excellent Doctoral Dissertation Supervisor Funding of Beijing (YB 20101000102).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Venter JC, Remington K, Heidelberg JF, Halpern AL, Rusch D, Eisen JA, et al. Environmental genome shotgun sequencing of Sargasso sea. Science. 2004;304:66–74.View ArticlePubMedGoogle Scholar
- Qin J, Li R, Raes J, Arumugam M, Burgdorf KS, Manichanh C, et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature. 2010;464:59–64.View ArticlePubMedPubMed CentralGoogle Scholar
- Emmanuelle LC, Trine N, Junjin Q, Edi P, Falk H, Gwen F, et al. Richness of human gut microbiome correlates with metabolic markers. Nature. 2013;500:541–6.View ArticleGoogle Scholar
- Nelson KE, Weinstock GM, Highlander SK, Worley KC, Creasy HH, Wortman JR, et al. A catalog of reference genomes from the human microbiome. Science. 2010;328:994–9.View ArticlePubMedGoogle Scholar
- Hu GQ, Guo JT, Liu YC, Zhu HQ. MetaTISA: Metagenomic translation initiation site annotator for improving gene start prediction. Bioinformatics. 2009;25:1843–5.View ArticlePubMedGoogle Scholar
- Zhu W, Lomsadze A, Borodovsky M. Ab initio gene identification in metagenomic sequences. Nucleic Acids Res. 2010;38, e132. doi:https://doi.org/10.1093/nar/gkq275.View ArticlePubMedPubMed CentralGoogle Scholar
- Patil KR, Haider P, Pope PB, Turnbaugh PJ, Morrison M, Scheffer T, et al. Taxonomic metagenome sequence assignment with structured output models. Nat Methods. 2011;8(3):191–2.View ArticlePubMedPubMed CentralGoogle Scholar
- Liu YC, Guo JT, Hu GQ, Zhu HQ. Gene prediction in metagenomic fragments based on the SVM algorithm. BMC Bioinformatics. 2013;14:S12.View ArticlePubMedPubMed CentralGoogle Scholar
- Laserson J, Jojic V, Koller D. Genovo: de novo assembly for metagenomes. J Comput Biol. 2011;18:429–43.View ArticlePubMedGoogle Scholar
- Afiahayati, Sato K, Sakakibara Y: An extended genovo metagenomic assembler by incorportating paired-end information. Peer J 2013, 1:e196.Google Scholar
- Lai BB, Ding RG, Li Y, Duan LP, Zhu HQ. A de novo metagenomic assembly program for shotgun DNA reads. Bioinformatics. 2012;28:1455–62.View ArticlePubMedGoogle Scholar
- Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, et al. De novo assembly of human genomes with massively parallel short read sequencing. Genome Res. 2010;20:265–72.View ArticlePubMedPubMed CentralGoogle Scholar
- Zerbino DR, Birney E. Velvet: Algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18:821–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Machelprang R, Waldrop MP, DeAngelis KM, David MM, Chavarria K, Blazewicz SJ, Rubin EM, Jansson JK: Metagenomic analysis of a permafrost microbial community reveals a rapid response to thaw. Nature 2011, 480:doi:https://doi.org/10.1038/nature10576.
- Peng Y, Leung HC, Yiu S, Chin FY. IDBA-UD: A de novo sssembler for single-cell and metagenomic sequencing data with highly uneven depth. Bioinformatics. 2012;28(11):1420–8.View ArticlePubMedGoogle Scholar
- Namiki T, Hachiya T, Tanaka H, Sakakibara Y. MetaVelvet: an extension of Velvet assembler to de novo metagenome assembly from short sequence reads. Nucleic Acids Res. 2012;40, e155.View ArticlePubMedPubMed CentralGoogle Scholar
- Boisvert S, Raymond F, Godzaridis E, Laviolette F, Corbeil J. Ray Meta: scalable de novo metagenome assembly and profiling. Genome Biol. 2012;13(12):R122.View ArticlePubMedPubMed CentralGoogle Scholar
- Bankevich A, Nurk S, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-cell sequencing. J Comput Biol. 2012;19:455–77.View ArticlePubMedPubMed CentralGoogle Scholar
- Nurk S, Bankevich A, Antipov D, Gurevich AA, Korobeynikov A, Lapidus A, et al. Assembling single-cell genomes and mini-metagenomes from chimeric MDA products. J Comput Biol. 2013;20:714–37.View ArticlePubMedPubMed CentralGoogle Scholar
- Myers EW, Sutton GG, Delcher AL, Dew IM, Fasulo DP, Flanigan MJ, et al. A whole-genome assembly of Drosophila. Science. 2000;287:2896–204.Google Scholar
- Miller JR, Delcher AL, Koren S, Venter E, Walenz BP, Brownley A, et al. Aggressive assembly of pyrosequencing reads with mates. Bioinformatics. 2008;24:2818–24.View ArticlePubMedPubMed CentralGoogle Scholar
- Haider B, Ahn TH, Bushnell B, Chai J, Copeland A, Pan C. Omega: an Overlap-graph de novo Assembler for Metagenomics. Bioinformatics. 2014;30(19):2717–22.View ArticlePubMedGoogle Scholar
- Kunin V, Copeland A, Lapidus A, Mavromatis K, Hugenholtz P. A bioinformatician's guide to metagenomics. Microbiol Mol Biol Rev. 2008;72:557–178.View ArticlePubMedPubMed CentralGoogle Scholar
- Miller JR, Koren S, Sutton G. Assembly algorithms for next-generation sequencing data. Genomics. 2010;95:315–27.View ArticlePubMedPubMed CentralGoogle Scholar
- Zimin AV, Smith DR, Sutton G, Yorke JA. Assembly reconciliation. Bioinformatics. 2008;24:42–5.View ArticlePubMedGoogle Scholar
- Yao G, Ye L, Gao H, Minx P, Warren WC, Weinstock GM. Graph accordance of next-generation sequence assemblies. Bioinformatics. 2012;28:13–6.View ArticlePubMedGoogle Scholar
- Vicedomini R, Vezzi F, Scalabrin S, Arvestad L, Policriti A. GAM-NGS: genomic assemblies merger for next generation sequencing. BMC Bioinformatics. 2013;14 Suppl 7:S6.View ArticlePubMedPubMed CentralGoogle Scholar
- Scholz M, Lo CC, Chain PS. Improved assemblies using a source-agnostic pipeline for MetaGenomic Assembly by Merging (MeGAMerge) of contigs. Sci Rep. 2014;4:6480.View ArticlePubMedPubMed CentralGoogle Scholar
- Charuvaka A, Rangwala H. Evaluation of short read metagenomic assembly. BMC Genomics. 2011;12 Suppl 2:S8.View ArticlePubMedPubMed CentralGoogle Scholar
- Mende DR, Waller AS, Sunagawa S, Jarvelin AI, Chan MM, Arumugam M, et al. Assessment of metagenomic assembly using simulated next generation sequencing data. PLoS One. 2012;7(2), e31386. doi:https://doi.org/10.1371/journal.pone.0031386.View ArticlePubMedPubMed CentralGoogle Scholar
- Simpson JT, Durbin R. Efficient de novo assembly of large genomes using compressed data structures. Genome Res. 2012;22(3):549–56.View ArticlePubMedPubMed CentralGoogle Scholar
- Treangen TJ, Koren S, Sommer DD, Liu B, Astrovskaya I, Ondov B, et al. MetAMOS: a modular and open source metagenomic assembly and analysis pipeline. Genome Biol. 2013;14:R2.View ArticlePubMedPubMed CentralGoogle Scholar
- Kultima JR, Sunagawa S, Li J, Chen W, Chen H, Mende DR, et al. MOCAT: a metagenomics assembly and gene prediction toolkit. PLoS One. 2012;7, e47656.View ArticlePubMedPubMed CentralGoogle Scholar
- Salzberg SL, Phillippy AM, Zimin A, Puiu D, Magoc T, Koren S, et al. GAGE: A critical evaluation of genome assemblies and assembly algorithms. Genome Res. 2012;22:557–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Koren S, Treangen TJ, Pop M. Bambus 2: Scaffolding metagenomes. Bioinformatics. 2011;27:2964–71.View ArticlePubMedPubMed CentralGoogle Scholar
- Pop M, Phillippy A, Delcher AL, Salzberg SL. Comparative genome assembly. Brief Bioinform. 2004;5:237–48.View ArticlePubMedGoogle Scholar
- Qin J, Li Y, Cai Z, Li S, Zhu J, Zhang F, et al. A metagenome-wide association study of gut microbiota in type 2 diabetes. Nature. 2012;490:55–60.View ArticlePubMedGoogle Scholar
- Tringe SG, Mering CV, Kobayashi A, Salamov AA, Chen K, Chang HW, et al. Comparative metagenomics of microbial mommunities. Science. 2005;308:554–7.View ArticlePubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428:37–43.View ArticlePubMedGoogle Scholar
- Richter DC, Ott F, Auch AF, Schmid R, Huson DH. MetaSim—a sequencing simulator for genomics and metagenomics. PLoS One. 2008;3(10), e3373.View ArticlePubMedPubMed CentralGoogle Scholar
- Pignatelli M, Moya A. Evaluating the fidelity of de novo short read metagenomic assembly using simulated data. PLoS One. 2011;6(5), e19984.View ArticlePubMedPubMed CentralGoogle Scholar
- Korbel JO, Abyzov A, Mu XJ, Carriero N, Cayting P, Zhang Z, et al. PEMer: a computational framework with simulation-based error models for inferring genomic structural variants from massive paired-end sequencing data. Genome Biol. 2009;10:R23.View ArticlePubMedPubMed CentralGoogle Scholar
- Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4:495–500.View ArticlePubMedGoogle Scholar
- Kurtz S, Phillippy A, Delcher AL, Smoot M, Shumway M, Antonescu C, et al. Versatile and open software for comparing large genomes. Genome Biol. 2004;5:R12.View ArticlePubMedPubMed CentralGoogle Scholar
- Phillippy AM, Schatz MC, Pop M. Genome assembly forensics: Finding the elusive mis-assembly. Genome Biol. 2008;9:R55.View ArticlePubMedPubMed CentralGoogle Scholar
- Kelley DR, Schatz MC, Salzberg SL. Quake: Quality-aware detection and correctetion of sequencing errors. Genome Biol. 2010;11:R116.View ArticlePubMedPubMed CentralGoogle Scholar
- Langmead B, Salzberg SL. Fast gapped-read alignment with Bowtie 2. Nat Methods. 2012;9:357–9.View ArticlePubMedPubMed CentralGoogle Scholar