Improving contig binning of metagenomic data using $$ {d}_2^S $$ oligonucleotide frequency dissimilarity

Wang, Ying; Wang, Kun; Lu, Yang Young; Sun, Fengzhu

doi:10.1186/s12859-017-1835-1

Methodology Article
Open access
Published: 20 September 2017

Improving contig binning of metagenomic data using $ {d}_2^S $ oligonucleotide frequency dissimilarity

Ying Wang ORCID: orcid.org/0000-0001-8766-5950¹,
Kun Wang¹,
Yang Young Lu² &
…
Fengzhu Sun^2,3

BMC Bioinformatics volume 18, Article number: 425 (2017) Cite this article

3308 Accesses
13 Citations
19 Altmetric
Metrics details

Abstract

Background

Metagenomics sequencing provides deep insights into microbial communities. To investigate their taxonomic structure, binning assembled contigs into discrete clusters is critical. Many binning algorithms have been developed, but their performance is not always satisfactory, especially for complex microbial communities, calling for further development.

Results

According to previous studies, relative sequence compositions are similar across different regions of the same genome, but they differ between distinct genomes. Generally, current tools have used the normalized frequency of k-tuples directly, but this represents an absolute, not relative, sequence composition. Therefore, we attempted to model contigs using relative k-tuple composition, followed by measuring dissimilarity between contigs using $ {d}_2^S $. The $ {d}_2^S $ was designed to measure the dissimilarity between two long sequences or Next-Generation Sequencing data with the Markov models of the background genomes. This method was effective in revealing group and gradient relationships between genomes, metagenomes and metatranscriptomes. With many binning tools available, we do not try to bin contigs from scratch. Instead, we developed $ {d}_2^S\mathrm{Bin} $ to adjust contigs among bins based on the output of existing binning tools for a single metagenomic sample. The tool is taxonomy-free and depends only on k-tuples. To evaluate the performance of $ {d}_2^S\mathrm{Bin} $, five widely used binning tools with different strategies of sequence composition or the hybrid of sequence composition and abundance were selected to bin six synthetic and real datasets, after which $ {d}_2^S\mathrm{Bin} $ was applied to adjust the binning results. Our experiments showed that $ {d}_2^S\mathrm{Bin} $ consistently achieves the best performance with tuple length k = 6 under the independent identically distributed (i.i.d.) background model. Using the metrics of recall, precision and ARI (Adjusted Rand Index), $ {d}_2^S\mathrm{Bin} $ improves the binning performance in 28 out of 30 testing experiments (6 datasets with 5 binning tools). The $ {d}_2^S\mathrm{Bin} $ is available at https://github.com/kunWangkun/d2SBin.

Conclusions

Experiments showed that $ {d}_2^S $ accurately measures the dissimilarity between contigs of metagenomic reads and that relative sequence composition is more reasonable to bin the contigs. The $ {d}_2^S\mathrm{Bin} $ can be applied to any existing contig-binning tools for single metagenomic samples to obtain better binning results.

Background

Metagenomics sequencing provides deep insights into microbial communities [1]. A key step toward investigating their taxonomic structure within metagenomics data involves assigning assembled contigs into discrete clusters known as bins [2]. These bins represent species, genera or higher taxonomic groups [3]. Therefore, efficient and accurate binning of contigs is essential for metagenomics studies.

The binning of contigs remains challenging owing to repetitive sequence regions within or across genomes, sequencing errors, and strain-level variation within the same species [4]. Many studies have reported on binning, essentially highlighting two different strategies [5]: “taxonomy-dependent” supervised classification and “taxonomy-independent” unsupervised clustering. “Taxonomy-dependent” studies are based on sequence alignments [6], phylogenetic models [7, 8] or oligonucleotide patterns [9]. “Taxonomy-independent” studies extract features from contigs to infer bins based on sequence composition [10,11,12,13,14], abundance [15], or hybrids of both sequence composition and abundance [4, 5, 16,17,18]. Therefore, these approaches can be applied to bin contigs from incomplete or uncultivated genomes. Some hybrid binning tools, such as COCACOLA [5], CONCOCT [4], MaxBin2.0 [18] and GroopM [16], are designed to bin contigs based on multiple related metagenomic samples. Contigs with similar coverage profiles are more likely to come from the same genome. Previous studies showed that co-varying coverage profiles across multiple related metagenomes play important roles in contig binning [4, 5]. The multiple related samples should be temporal or spatial samples of a given ecosystem [16] composed of similar microbial organisms, but different abundance levels. However, in many situations, multiple related samples may not be available in the required numbers, and as a result, contig-binning based on single metagenomes is still important.

Contig binning tools based on a single sample generally follow one of three strategies. 1) Sequence composition. It is usually denoted as frequencies of k-tuples (k-mers) with k= 2–6 as genomic signatures of contigs. MetaWatt [12] and SCIMM [11] built multivariate statistics and/or interpolated Markov models of background genomes to bin the contigs. Metacluster 3.0 [14] clustered the contigs using k-tuple frequency and Spearman correlation between the k-tuple frequency vectors. LikelyBin [10] utilized Markov Chain Monte Carlo approaches based on 2- to 5-tuples. 2) Abundance. AbundanceBin [15] estimated the relative abundance levels of species living in the same environment based on Poisson distributions of 20-tuples with an Expectation Maximization (EM) algorithm. The MBBC [19] package estimated the abundance of each genome using the Poisson process. All tools based on abundance are designed to bin short or long reads instead of assembled contigs. 3) Hybrid of composition and abundance. Maxbin1.0 [17] combined 4-tuple frequencies and scaffold coverage levels to populate the genomic bins using single-copy marker genes and an Expectation Maximization (EM) algorithm. MyCC [20] combined genomic signatures, marker genes and optional contig coverages within one or multiple samples.

Contig binning using k-tuple composition is based on the observation that relative sequence compositions are similar across different regions of the same genome, but differ between distinct genomes [21, 22]. The frequency vector of k-tuples is one of the representation of sequence composition. In general, current tools use the frequency of k-tuples directly, but this represents absolute, not relative, sequence composition. Here, “absolute” frequency refers to the number of occurrences of a k-tuple over the total number of occurrences of all k-tuples. On the other hand, “relative” frequency refers to the difference between the observed frequency of a k-tuple and the corresponding expected frequency under a given background model. Contigs in the same bin are from the same taxonomic group, such as one class, species or strain. Therefore, contigs from the same bin are expected to obey a consistent background model. Several sequence dissimilarity measures based on relative frequencies of k-tuples have been developed such as CVTree, $ {d}_2^{\ast } $ and $ {d}_2^S, $ and recent studies [23,24,25,26,27] have shown that $ {d}_2^S $ is superior to other dissimilarity measures for the comparison of genome sequences based on relative k-tuple frequencies. Therefore, in the present study, we attempted to model the relative sequence composition and measure dissimilarity between contigs with $ {d}_2^S $ for a single metagenomic sample. The $ {d}_2^S $ was designed to measure the dissimilarity between two sequences or next generation sequencing data by modeling the background genomes [23] using Markov and interpolated Markov chains. Previous studies verified the effectiveness of $ {d}_2^S $ in revealing group and gradient relationships between genomes [24, 25], metagenomes [28] and metatranscriptomes [26, 27]. However, binning of contigs directly using $ {d}_2^S $ is computationally expensive and impractical for large metagenomics studies due to the need to construct Markov background models for sequences and to calculate the expected counts of k-tuples. On the other hand, many binning tools based on absolute k-tuple frequencies and the results from such methods are reasonable. Still, these tools and methods can be improved by using $ {d}_2^S $ dissimilarity. Therefore, in the present study, we do not bin the contigs from scratch. Instead, we attempt to adjust contig bins based on the output of any existing binning tools. We model each contig with a Markov chain based on its k-tuple frequency vector. The bin’s center is represented by the averaged k-tuple frequency vectors of all contigs in this bin and is also modeled with a Markov chain. Then, $ {d}_2^S $ measures dissimilarity between a contig and a bin’s center based on relative sequence composition, as represented by the Markov chains. Finally, a K-means clustering algorithm is applied to cluster the contigs based on the $ {d}_2^S $ dissimilarities, where K is the number of clusters. Such an approach, on the one hand, overcomes the issue of extensive computational complexity directly using $ {d}_2^S $ and, on the other hand, further improves the initial binning results. The method is developed as an open source package, termed $ {d}_2^S\mathrm{Bin} $, which is available at https://github.com/kunWangkun/d2SBin.

We selected six synthetic and real datasets that had originally been used to evaluate existing tools as testing datasets. $ {d}_2^S\mathrm{Bin} $ was applied to adjust the binning results of five representative binning tools using sequence composition (MetaCluster3.0 [14], MetaWatt [12] and SCIMM [11]) and the hybrid of sequence composition and abundance (MaxBin1.0 [17], MyCC [20]) based on a single metagenomic sample. Tuple length k = 6 and the independent identically distributed (i.i.d.) background model (i.e., Markov order r = 0) are frequently the optimal parameters for $ {d}_2^S\mathrm{Bin} $ to achieve the best performance for metagenomics contig binning. $ {d}_2^S\mathrm{Bin} $ improved the binning results in 28 out of 30 testing experiments for 6 datasets using 5 binning tools, giving significantly better performance in terms of recall, precision and ARI (Adjusted Rand Index).

Methods

The framework of $ {d}_2^S\mathrm{Bin} $ is shown in the flowchart of Fig. 1. Any existing contig binning tool is applied with its default settings to bin the contigs in a single metagenomic sample. Each contig is modeled with a Markov chain based on its k-tuple frequency vector. For each bin, the bin’s center is also modeled with a Markov chain based on the averaged frequency vector of all contigs in this bin. The $ {d}_2^S $ measures the dissimilarity between a contig and a bin’s center based on the background probability models. Assuming that contigs in the same bin come from an identical background model, the $ {d}_2^S $ dissimilarity between contigs from the same bin should be smaller than that between contigs from different bins under correct binning. The K-means algorithm is then applied to adjust the contigs among different bins to minimize the within-bin sum of squares based on $ {d}_2^S $ dissimilarity.

The $ {d}_2^S $ dissimilarity measure between two contigs based on k-tuple sequence signature

The $ {d}_2^S $ is a normalized dissimilarity measure for two sequences based on either long genomic sequences or NGS short reads in which expected word counts are subtracted from the observed counts for each sequence. The background adjusted word counts are then compared using correlation to measure the dissimilarity between the two sequences [25]. Let $ {c}_X=\left({c}_{X,1},{c}_{X,2},\cdots, {c}_{X,{4}^k}\right) $ and $ {c}_Y=\left({c}_{Y,1},{c}_{Y,2},\cdots, {c}_{Y,{4}^k}\right) $ be the k-tuple frequency vectors from two sequences X and Y, respectively, where c _X , i is the occurring times of the i ^th k-tuple in sequence X and i = 1 ⋯ 4^k. At each base in the tuple, there are four possible nucleotides, that is A, C, G, and T, for nucleotide sequences. So there are 4^k combinations when tuple length is k.

The $ {d}_2^S $ dissimilarity is defined as

$$ {d}_2^S\left({\tilde{c}}_X,{\tilde{c}}_Y\right)=\frac{1}{2}\left(1-\frac{D_2^S\left({\tilde{c}}_X,,,{\tilde{c}}_Y\right)}{\sqrt{\sum_{i=1}^{4^k}\frac{{\tilde{c}}_{X,i}^2}{\sqrt{{\tilde{c}}_{X,i}^2+{\tilde{c}}_{Y,i}^2}}}\sqrt{\sum_{i=1}^{4^k}\frac{{\tilde{c}}_{Y,i}^2}{\sqrt{{\tilde{c}}_{X,i}^2+{\tilde{c}}_{Y,i}^2}}}}\right), $$

(1)

where

$$ {D}_2^S\left({\tilde{c}}_X,{\tilde{c}}_Y\right)=\sum_{i=1}^{4^k}\frac{{\tilde{c}}_{X,i}{\tilde{c}}_{Y,i}}{\sqrt{{\tilde{c}}_{X,i}^2+{\tilde{c}}_{Y,i}^2}}, $$

(2)

$$ {\tilde{c}}_{X,i}={c}_{X,i}-{n}_X{p}_{X,i},\kern0.5em {\tilde{c}}_{Y,i}={c}_{Y,i}-{n}_Y{p}_{Y,i}, $$

(3)

where p _• , i is the probability of the i ^th k-tuple under the Markov model with order r = 0 − 3 for one long sequence or set of reads and $ {n}_{\bullet }=\sum_{i=1}^{4^k}{c}_{\bullet, i} $, • = X or Y is the sum of occurrences of all k-tuples. The value of $ {d}_2^S $ is between 0 and 1. The p _X , i is the probability of the i ^th k-tuple under the background sequence for X. The p _X , i can be the probability under the i.i.d. model, or under the Markov chain of different orders. The i ^th k-tuple is denoted as w = w ₁ w _2⋯ w _k. Under the r ^th order Markov chain M _r, the probability of the k-tuple w, namely the expected frequency, can be computed as

$$ p\left(w|{M}_r\right)=\left\{\begin{array}{l}\prod \limits_{j=1}^kp\left({w}_j\right)\kern5.00em r=0\\ {}p\left({w}_1{w}_2\dots {w}_r\right)\prod \limits_{j=1}^{k-r}p\left({w}_{j+r}|{w}_j{w}_{j+1}\dots {w}_{j+r-1}\right)\kern0.5em 1\le r\le k-1\end{array}\right. $$

(4)

where p(w _j) is the probability of w _j estimated by the ratio of the number of occurrences of w _j over the number of all nucleotides. The value of p(w ₁ w _2⋯ w _r) is estimated by the ratio of the number of occurrences of w ₁ w _2⋯ w _r over all the number of r-tuple occurrences. The value of p(w _j + r| w _j w _j + 1⋯ w _{j + r − 1}) is estimated by the fraction of occurrences of w _j + r conditional on the previous occurrences of w _j w _j + 1⋯ w _{j + r − 1}.

$ {d}_2^S\mathrm{Bin} $: Contig binning based on the $ {d}_2^S $ measure

Let S = {S ₁, S ₂, ⋯S _l} be the partition of all contigs into l bins. Contig X is represented as $ {c}_X=\left({c}_{X,1},{c}_{X,2},\cdots, {c}_{X,{4}^k}\right) $, the occurrence vector of k-tuples within the contig. The center of bin S _j is represented as the average frequency vector,

$$ {c}_{S_j}=\frac{1}{n_j}{\sum}_{X_i\in {S}_j}{C}_{X_i}, $$

(5)

where X _i is the contig currently in S _j and n _j is the number of contigs in S _j. The value of $ {d}_2^S\left({\overset{\sim }{c}}_X,{\overset{\sim }{c}}_{S_j}\right) $ quantifies the dissimilarity between contig X and bin S _j.

In our study, when the number of bins is fixed, the metrics of binning call for minimizing the within-bin sum of squares based on $ {d}_2^S $ dissimilarity, that is,

$$ \underset{s}{\arg \min}\sum_{j=1}^l\sum_{X\in {S}_j}{d}_2^s\left({\tilde{c}}_X,{\tilde{c}}_{S_j}\right). $$

(6)

We then used the K-means clustering algorithm to optimize Eq. (6).

Experimental design

The purpose of our study is to improve binning results using $ {d}_2^S\mathrm{Bin} $ based on the output of current existing binning tools. Therefore, we adopted both synthetic and real testing datasets generated, or used, by previous binning tools in order to test the performance of $ {d}_2^S\mathrm{Bin} $, as shown in Table 1. The $ {d}_2^S\mathrm{Bin} $ was applied to the binning results of five contig-binning tools, respectively, to evaluate its performance in improving their binning results.

Table 1 Synthetic and real testing datasets for contig binning

Full size table

Selection of contig binning tools

The $ {d}_2^S\mathrm{Bin} $ was applied to adjust the contig-binning results from MaxBin1.0 [17], MetaCluster3.0 [14], MetaWatt [12], MyCC [20] and SCIMM [11] to evaluate its performance. These five widely used contig-binning tools use different binning strategies to bin the contigs for single metagenomic sample: 1) Sequence composition: MetaCluster3.0 [14] measures the Spearman distance between 4-tuple frequency vectors and bins contigs with the K-median algorithm. The MetaCluster4.0 [29] and 5.0 [30] were designed to bin the reads from metagenomics samples of different abundance characteristics. MetaWatt [12] and SCIMM [11] build interpolated Markov models of the background genomes and assign the contigs to bins with maximum likelihood. 2) Hybrid of abundance and sequence composition: MaxBin1.0 [17] measures the Euclidean distance between 4-tuple frequency vectors of contigs and assigns them with an EM algorithm, taking scaffold coverage levels into consideration. MyCC [20] combines genomic signatures, marker genes and optional contig coverages within one or multiple samples.

Five synthetic testing datasets with 10 genomes and 100 genomes

MaxBin1.0 [17] used these five datasets to evaluate its performance. Here we used the same five datasets to evaluate the performance of $ {d}_2^S\mathrm{Bin}. $ Short reads were simulated by MetaSim [31] and assembled to contigs by Velvet [32]. The contigs and their labels are available for downloading from the MaxBin1.0 paper [17]. For the metagenomes containing 10 genomes, 5 million and 20 million paired-end reads were sampled as 20× and 80× average coverage, respectively. For the metagenomes containing 100 genomes, 100 million paired-end reads were sampled with three settings to create simLC+, simMC+ and simHC+. The three datasets represent microbial communities with different levels of complexity, which mimicked the setting of the previous study [33]: simLC simulates low-complexity communities dominated by a single near-clonal population flanked by low-abundance ones. Such datasets result in a near-complete draft assembly of the dominant population in, for example, bioreactor communities [34]. simMC resembles moderately complex communities with more than one dominant population, also flanked by low-abundance ones, as has been observed in an acid mine drainage biofilm [35] and Olavius algarvensis symbionts [36]. These types of communities usually result in substantial assembly of the dominant populations according to their clonality. simHC simulates high-complexity communities lacking dominant populations, such as agricultural soil [37], where no dominant strains are present and minimal assembly results. In addition, the empirical 80-bps error model, which incorporates different error types (deletion, insertion, substitution) at certain positions with empirical error probabilities for Illumina, was produced by MetaSim [31] and used in simulating all metagenomes [17].

One real testing dataset, Sharon

This dataset was applied to test the binning tools COCACOLA [5] and CONCOCT [4]. The dataset is composed of a time-series of 11 fecal microbiome samples from a premature infant [38], denoted as ‘Sharon’. All metagenomic sequencing reads from the 11 samples were merged together, and 5579 contigs were assembled. The contigs were annotated with TAXAassign [39], and 2614 contigs were unambiguously aligned to 21 species [5].

The above datasets cover various species diversity, species dissimilarity, sequencing depth, and community complexity. They include synthetic and real data. Therefore, testing on these datasets would yield a comprehensive evaluation of $ {d}_2^S\mathrm{Bin} $.

Evaluation criteria

To evaluate the performance of $ {d}_2^S\mathrm{Bin} $, three commonly used criteria in binning studies [4, 5, 17], recall, precision and ARI (Adjusted Rand Index), were applied in our study. As described in COCACOLA [5], the binning result is represented as a K × S matrix A = (a _ks) with K bins on S species where a _ks indicates the shared number of contigs between the k ^th bin and the s ^th species. Each contig binning tool filters out low-quality contigs; therefore, N is the total number of contigs passing through the filter and binned by the tools.

Recall: For each species, we first find the bin that contains the maximum number of contigs from the species. We then sum over the maximum number of all species and divide by the number of contigs.

$$ recall=\frac{1}{N}{\sum}_s{max}_k\left\{{a}_{ks}\right\} $$

(7)

Precision: For each contig bin, we first find the species with the maximum number of contigs assigned to the bin. We then sum the maximum numbers across all bins and divide by the number of contigs.

$$ precision=\frac{1}{N}{\sum}_k{max}_s\left\{{a}_{ks}\right\} $$

(8)

ARI (Adjusted Rand Index): ARI is a unified measure of clustering results to determine how far from that perfect grouping a bin result falls. ARI focuses on whether pairs of contigs belonging to the same species can be binned together or not. The detailed descriptions can be found in [4, 5].

$$ ARI=\frac{\sum_{k,s}\left(\begin{array}{c}{a}_{ks}\\ {}2\end{array}\right)-{t}_3}{\frac{1}{2}\left({t}_1+{t}_2\right)-{t}_3} $$

(9)

where $ {t}_1=\sum_k\left(\begin{array}{c}{a}_{k\bullet}\\ {}2\end{array}\right) $, $ {t}_2=\sum_s\left(\begin{array}{c}{a}_{\bullet s}\\ {}2\end{array}\right) $, $ {t}_3=\frac{2{t}_1{t}_2}{\left(\begin{array}{c}N\\ {}2\end{array}\right)}\kern0.5em $ and a _k∙ = ∑_s a _ks, a _∙s = ∑_k a _ks .

Results

In the calculation of $ {d}_2^S $ dissimilarity, the setting of tuple length for k-tuple and Markov order for the background sequences are required. Based on previous studies [4, 5], for $ {d}_2^S $, tuple length k was generally set to 4–7 tuples, and the order of Markov chain was generally set as 0–2, as in previous applications, to analyze metagenomic and metatranscriptomic samples [25, 26]. Therefore, we extended the testing range of tuple length and Markov order as 4–8 and 0–3 to assess the effect of tuple length and Markov order for $ {d}_2^S\mathrm{Bin} $ on contig binning. As shown in Table 2, for the binning results of MaxBin on 10genome-80×, the i.i.d. (that is 0-order Markov) model obtained the highest three indexes at almost all tuple lengths. The models based on tuple length k = 6 represent superior performance. The best performance was achieved under the i.i.d. background model of 6-tuples. All three criteria dropped suddenly at k = 8. The experiment offered initial guidance for the selection of tuple length and Markov order.

Table 2 Initial assessments of the effects of tuple length and Markov order of the background sequences on the performance of MaxBin+ $ {d}_2^S\mathrm{Bin} $ in terms of recall, precision and ARI for dataset 10genome-80×

Full size table

Length selection of k-tuple in $ {d}_2^S\mathrm{Bin} $

According to Table 2, we calculated $ {d}_2^S $ with 4-8 bp tuples under the i.i.d. model based on the output of the existing binning tools. These tools were run under their default tuple length and mode. The datasets 10genome 80× and 100genome-simHC+ were selected to test the effect of tuple length on the performance of $ {d}_2^S\mathrm{Bin} $. For both datasets, $ {d}_2^S\mathrm{Bin} $ based on 6-tuples achieved the best performance on precision, recall and ARI for all five tools. Figures 2 and 3 only plot the curves of tuple length k = 4–6 because the severe dropping in performance with k = 7, 8 led to an excessively wide Y-axis coordinate range, and the curves of k = 4–6 appeared to aggregate, making it hard to display the superiority of k = 6. Therefore, we set k = 6 with $ {d}_2^S $ in the rest of our study.

Order selection for Markov chain in $ {d}_2^S\mathrm{Bin} $

To obtain the most suitable Markov order for the background genome, we fixed the tuple length k = 6 and applied 0-2nd order Markov chain to calculate $ {d}_2^S $ for datasets 10genome 80× and 100genome-simHC+ on the output of five contig-binning tools. As shown in Figs. 4 and 5, for both datasets, $ {d}_2^S\mathrm{Bin} $ under the i.i.d. model of 6-tuple achieves the best performance for Precision, Recall and ARI on all five tools. According to our previous studies about applying $ {d}_2^S $ to compare metagenomic [28] and metatranscriptomic samples [26], $ {d}_2^S $ under the i.i.d. model always achieved best results for all the 12 testing datasets, which illustrated that the i.i.d. model works well for the study of microbial communities. This is probably due to the fact that each bin is a mixture of several genomes and no Markov chain models with fixed order greater than 0 can describe the bin better. Therefore, we set tuple length k = 6 and the i.i.d. model in $ {d}_2^S\mathrm{Bin} $.

Experiments on contig binning

The contig-binning tools Maxbin [17], Metacluster 3.0 [14], Metawatt [3], SCIMM [11] and MyCC [20] were applied to bin the contigs from the six synthetic and real datasets with their original running modes. Based on the results from these tools, $ {d}_2^S\mathrm{Bin} $ was further applied to adjust the contigs among bins. $ {d}_2^S\mathrm{Bin} $ did not change the number of bins obtained by the original tools. The bar graphs in Fig. 6 illustrate the Recall, Precision and ARI of the output of the five existing tools and after the adjustment of $ {d}_2^S\mathrm{Bin} $ for the six datasets. In most cases, the three criteria were improved by 1%–22%. Additional file 1: Table S1 presents the numerical values of the three indexes and offers more detailed information on all experiments, including the number of total&binned contigs and actual&clustered bins, providing more comprehensive view about the scale of dataset, complexity and original binning performance.

Contig binning on synthetic dataset 10 genome 80× coverage

From Fig. 6a, it is easy to see that the three criteria were improved for all five tools. As shown in Additional file 1: Table S1, 8022 contigs were assembled from simulated metagenomic reads. The best results were obtained on MyCC where $ {d}_2^S\mathrm{Bin} $ increased recall, precision and ARI from 97.21%, 97.21%, and 95.58% to 97.75%, 97.75% and 96.16%, respectively. MaxBin, MetaCluster and MyCC assigned the contigs into 10 bins. MetaWatt and SCIMM obtained 27 and 8 bins, respectively, but $ {d}_2^S\mathrm{Bin} $ still adjusted contigs among these bins to achieve better performance.

Contig binning on synthetic dataset 10 genome 20× coverage

Compared with 20 million reads in 10 genome 80× data, 10 genome 20× data have only 5 million reads for the 10 genomes. Fig. 6b shows that $ {d}_2^{\mathrm{S}}\mathrm{Bin} $ improved the binning of MaxBin, MetaWatt, SCIMM and MyCC. As shown in Additional file 1: Table S1, both MaxBin and MetaCluster only produced three bins, and most contigs belonged to the three genomes with highest abundances because most contigs from the seven low-abundance genomes were discarded during preprocessing by having short length [17]. However, the $ {d}_2^S\mathrm{Bin} $ only improved precision, but not recall or ARI, on MetaCluster. In order to have a deep insight on the deterioration of binning performance, we list the number of contigs from the 10 genomes in each bin, as shown in Additional file 1: Table S2–2 for MetaCluster and MetaCluster+ $ {d}_2^S\mathrm{Bin} $. Each row of the table is one genome defined by its genome ID and corresponding genome name in NCBI and each column is the clustered bin, so the element is the number of contigs from one genome inside the current bin. Among the 1217 contigs assigned by MetaCluster, there are 1209 contigs from four dominant genomes: Flavobacterium branchiophilum, Halothiobacillus neapolitanus, Lactobacillus casei and Acetobacter pasteurianus with at least 100 contigs. But MetaCluster only output three bins: the contigs from Flavobacterium branchiophilum, Halothiobacillus neapolitanus and Lactobacillus casei are dominant in the three bins, and the contigs from Acetobacter pasteurianus are scattered into the three bins. After adjustment by $ {d}_2^S\mathrm{Bin} $, the contigs from Acetobacter pasteurianus were merged into the same bin as Halothiobacillus neapolitanus. Acetobacter pasteurianus and Halothiobacillus neapolitanus are both from the phylum Proteobacteria. Therefore, Acetobacter pasteurianus is phylogenetically closer to Halothiobacillus neapolitanus than to the other two genomes. From this point of view, $ {d}_2^S\mathrm{Bin} $ indeed improved the binning of MetaCluster although the performance index did not show improvement. Additional file 1: Table S2 also gives the details of contigs’ assignments in bins before and after $ {d}_2^S\mathrm{Bin} $ for the other four tools. For MyCC in Additional file 1: TableS2–5, before using $ {d}_2^S\mathrm{Bin} $, MyCC produced 5 bins and the contigs from Halothiobacillus neapolitanus were assigned to bin 1 and bin 4 and bin 1 included Halothiobacillus neapolitanus and Lactobacillus casei, which lead to the low ARI index as 24.76%. After using $ {d}_2^S\mathrm{Bin} $, most contigs from Halothiobacillus neapolitanus were assigned to bin 4, and bin 1 mainly included contigs from Lactobacillus casei. The ARI was increased to 70.48%. The result demonstrates that $ {d}_2^S\mathrm{Bin} $ tends to assign contigs with consistent or similar background models to the same bin.

Contig binning on synthetic dataset 100 genome-simHC+

simHC+ has evenly distributed species abundance levels with no dominant species. According to Fig. 6c, the three criteria were all improved for the five tools. According to Additional file 1: Table S1, among a total of 407,873 contigs, 13,919 were clustered into 87 bins by MaxBin with 80.23%, 76.69 and 64.58% recall, precision and ARI, respectively. After $ {d}_2^S\mathrm{Bin} $, the three indexes were improved to 90.67%, 80.14% and 74.03%, respectively, showing overall superior performance. MetaCluster, MetaWatt, and MyCC produced 97, 129 and 94 bins, respectively, and recall, precision and ARI were improved for all of them by $ {d}_2^S\mathrm{Bin} $. SCIMM only clustered 19 bins, which led to low precision and ARI, but $ {d}_2^S\mathrm{Bin} $ still improved the three metrics.

Contig binning on synthetic dataset 100 genome-simMC+

According to Fig. 6d, the three criteria were improved by $ {d}_2^S\mathrm{Bin} $ for MaxBin, MetaCluster, SCIMM and MyCC. Owing to the poor assembly quality of simMC+ [17], only ~10,000+ contigs of the 795,573 passed the minimum length threshold, among which a small portion came from low-abundance genomes. Therefore, only high-abundance genomes were binned, and 11 bins were generated for MaxBin and MetaCluster, and 15 bins for MyCC. The large disparity between the number of real species and bins led to low precision and ARI. However, $ {d}_2^S\mathrm{Bin} $ still greatly improved recall, precision and ARI. The exception was MetaWatt. Among the 11,987 clustered contigs, MetaWatt isolated 41 bins. In this case, extracting contigs from the dominant genome from each bin would leave only 7978, meaning that one-third of the contigs would remain to interfere with the modeling of the 41 dominant genomes, in turn leading to decreased performance for precision and ARI.

Contig binning on synthetic dataset 100 genome-simLC+

$ {d}_2^S\mathrm{Bin} $ improved the binning performance for all tools. All three metrics were also significantly improved by $ {d}_2^S\mathrm{Bin} $. For SCIMM, $ {d}_2^S\mathrm{Bin} $ increased recall, precision and ARI from 70.99%, 46.29% and 32.64% to 76.42%, 65.46% and 55.24%, respectively, which represents the best performance among the five tools.

Contig binning on real dataset Sharon

For this real dataset, the ground truth of binning was not available. The following two evaluations were implemented: (1) We only binned the 2614 contigs with unambiguous labels belonging to 21 species, and the annotations were considered as the ground truth. MaxBin, MetaCluster, MetaWatt, SCIMM and MyCC isolated 11, 10, 23, 19 and 16 bins for Sharon originally. As shown in Fig. 6f, based on their binning outputs, $ {d}_2^S\mathrm{Bin} $ adjusted the contig binning and increased Recall, Precision and ARI for all tools. (2) We applied CheckM [40] to estimate the approximate contamination and genome completeness of the contigs in the bins free from ground truth. Figure 7a shows the number of recovered genome bins by each method in different recall (completeness) threshold with precision (lack of contamination) > 80%. Although the tools identified 10–23 bins among the 21 species in the Sharon dataset, only 4–6 genome bins were recovered with precision > 80%. $ {d}_2^S\mathrm{Bin} $ did improve recall and precision. For MetaWatt and MyCC, $ {d}_2^S\mathrm{Bin} $ increased the number of bins with precision > 80%. For MetaCluster and SCIMM, $ {d}_2^S\mathrm{Bin} $ not only increased the number of bins with precision > 80% but also increased the number of bins with recall > 90%. The $ {d}_2^S\mathrm{Bin} $ also increased the recall of each bin for MaxBin and MyCC. Figure 7b shows the number of recovered genome bins at different precision thresholds with recall > 80%. For all tools, $ {d}_2^S\mathrm{Bin} $ increased the number of bins with recall > 80%. For MaxBin and MyCC, the number of bins with precision > 90% is also increased by $ {d}_2^S\mathrm{Bin} $.

Testing on these synthetic and real datasets showed that $ {d}_2^S\mathrm{Bin} $ could achieve obvious improvement on the original outputs of the five testing tools.

Convergence of K-means iteration on $ {d}_2^S\mathrm{Bin} $

In order to evaluate the convergence of K-means iteration on $ {d}_2^S\mathrm{Bin} $, we plotted the performance curves of the three indexes on randomly selected tools and datasets, as shown in Fig. 8. During our experiments with ten iterations, the three indexes increased significantly on the first iteration and reached steady state quickly. The “0” in the horizontal ordinate indicates the performance of the original binning tool. Therefore, in $ {d}_2^S\mathrm{Bin} $, the iterations of contig binning with K-means will stop when no contigs is adjusted or the number of iterations reaches 5.

Software implementation and running

The code of $ {d}_2^S\mathrm{Bin} $ was implemented with Python and Cython running under the Linux system. Cython is a superset of the Python language that additionally supports calling C functions, and the code can be compiled into a sharing library called by python directly. Tested on a server with 128G memory and Intel(R) Xeon(R) CPU E5–2620 v2 @ 2.10GHz with 6 CPU cores at 2.10 GHz, it takes 16 min to finish the adjustment of contig binning for $ {d}_2^S\mathrm{Bin} $ on 6-tuples for 8022 contigs of 10 bins with 4000 bp length on average and the peak memory is 6.7GB. The source code of $ {d}_2^S\mathrm{Bin} $ is available at https://github.com/kunWangkun/d2SBin.

Discussion

Our experiments demonstrate $ {d}_2^S $ can measure the similarity between contigs more accurately. However, $ {d}_2^S $ requires to build the background Markov model for each contig, which bring heavy computation burden. Therefore, in our study, instead of de novo binning from scratch, we attempt to adjust contig bins based on the output of any existing binning tools for the single metagenomic sample. The computational issue can be overcome using this strategy. When there are multiple related samples available, the sequence composition contribute less than the co-varying coverage profiles across samples for contig binning and $ {d}_2^S\mathrm{Bin} $ can not improve the contig binning for multiple metagenomic samples. The tools designed for multiple samples, like COCACOLA, GroopM, Concoct, MaxBin2.0, can achieve satisfactory results if multiple metagenomic samples are available.

Currently, $ {d}_2^S\mathrm{Bin} $ does not merge, or split, the bins. In some situations that there may be large differences between the numbers of clustered bins and ground truth, merging and splitting the bins would improve the results. However, the algorithms to adjust the clustering number, such as ISODATA [41], require the inputs of the minimum threshold of between-class dissimilarity and the maximum threshold of within-class dissimilarity. These thresholds depend on the detailed taxonomic level which the investigators are interested in. Once these thresholds are given, we can combine the algorithms for merging and splitting bins with $ {d}_2^S\mathrm{Bin} $ to further improve the binning results.

Conclusions

The ability of $ {d}_2^S\mathrm{Bin} $ to achieve improved binning performance is based on the idea that contigs clustered into one bin will come from the same genome and that relative sequence compositions will be similar across different regions of the same genome, but differ between genomes [21, 22]. $ {d}_2^S $ measures the dissimilarity between contig and the bin’s center based on the Markov model of k-tuple sequence compositions.

Our experiments demonstrate that $ {d}_2^S\mathrm{Bin} $ significantly improves binning performance in almost all cases, thus giving credence to the relative sequence composition model over the direct application of absolute sequence composition. We applied $ {d}_2^S\mathrm{Bin} $ to five contig-binning tools with different binning strategies. Irrespective of the different strategies employed by the contig-binning tools, $ {d}_2^S\mathrm{Bin} $ was able to achieve better performance for all tools tested. Finally, the optimal results for $ {d}_2^S\mathrm{Bin} $ are always obtained on steady tuple length k = 6 under the i.i.d. model with no need to search for the optimal parameters.

Abbreviations

ARI :: Adjusted rand index
EM:: Expectation maximization
i.i.d. :: independent identically distributed

References

Riesenfeld CS, Schloss PD, Handelsman J. Metagenomics: genomic analysis of microbial communities. Annu Rev Genet. 2004;38:525–52.
Article CAS PubMed Google Scholar
Mande SS, Mohammed MH, Ghosh TS. Classification of metagenomic sequences: methods and challenges. Brief Bioinform. 2012;13(6):669–81.
Article PubMed Google Scholar
Sedlar K, Kupkova K, Provaznik I. Bioinformatics strategies for taxonomy independent binning and visualization of sequences in shotgun metagenomics. Comput Struct Biotechnol J. 2017;15:48–55.
Article CAS PubMed Google Scholar
Alneberg J, et al. Binning metagenomic contigs by coverage and composition. Nat Methods. 2014;11:1144–6.
Article CAS PubMed Google Scholar
Lu YY, et al. COCACOLA: binning metagenomic contigs using sequence COmposition, read CoverAge, CO-alignment, and paired-end read LinkAge. Bioinformatics. 2017;33(6):791–8.
PubMed Google Scholar
Huson DH, et al. MEGAN analysis of metagenomic data. Genome Res. 2007;17(3):377–86.
Article CAS PubMed PubMed Central Google Scholar
Wood DE, Salzberg SL. Kraken: ultrafast metagenomic sequence classification using exact alignments. Genome Biol. 2014;15(3):R46.
Article PubMed PubMed Central Google Scholar
Finn RD, et al. The Pfam protein families database: towards a more sustainable future. Nucleic Acids Res. 2016;44(D1):D279–85.
Article CAS PubMed Google Scholar
Rosen GL, Reichenberger ER, Rosenfeld AM. NBC: the naive Bayes classification tool webserver for taxonomic classification of metagenomic reads. Bioinformatics. 2011;27(1):127–9.
Article CAS PubMed Google Scholar
Kislyuk A, et al. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinformatics. 2009;10(1):316.
Article PubMed PubMed Central Google Scholar
Kelley DR, Salzberg SL. Clustering metagenomic sequences with interpolated Markov models. BMC Bioinformatics. 2010;11(1):544.
Article PubMed PubMed Central Google Scholar
Strous M, et al. The binning of metagenomic contigs for microbial physiology of mixed cultures. Front Microbiol. 2012;3:410.
Article PubMed PubMed Central Google Scholar
Laczny CC, et al. VizBin-an application for reference-independent visualization and human-augmented binning of metagenomic data. Microbiome. 2015;3(1):1.
Article PubMed PubMed Central Google Scholar
Leung HC, et al. A robust and accurate binning algorithm for metagenomic sequences with arbitrary species abundance ratio. Bioinformatics. 2011;27(11):1489–95.
Article CAS PubMed Google Scholar
Wu Y-W, Ye Y. A novel abundance-based algorithm for binning metagenomic sequences using l-tuples. J Comput Biol. 2011;18(3):523–34.
Article CAS PubMed PubMed Central Google Scholar
Imelfort M, et al. GroopM: an automated tool for the recovery of population genomes from related metagenomes. PeerJ. 2014;2:e603.
Article PubMed PubMed Central Google Scholar
Wu Y-W, et al. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. Microbiome. 2014;2(1):26.
Article CAS PubMed PubMed Central Google Scholar
Wu Y-W, Simmons BA, Singer SW. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics. 2016;32(4):605–7.
Article CAS PubMed Google Scholar
Wang Y, Hu H, Li X. MBBC: an efficient approach for metagenomic binning based on clustering. BMC Bioinformatics. 2015;16(1):36.
Article PubMed PubMed Central Google Scholar
Lin H-H, Liao Y-C. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci Rep. 2016;6:24175.
Article CAS PubMed PubMed Central Google Scholar
Karlin S, Mrazek J, Campbell AM. Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol. 1997;179(12):3899–913.
Article CAS PubMed PubMed Central Google Scholar
Dick GJ, et al. Community-wide analysis of microbial genome sequence signatures. Genome Biol. 2009;10(8):R85.
Article PubMed PubMed Central Google Scholar
Wan L, et al. Alignment-free sequence comparison (II): theoretical power of comparison statistics. J Comput Biol. 2010;17(11):1467–90.
Article CAS PubMed PubMed Central Google Scholar
Ahlgren NA, et al. Alignment-free d₂* oligonucleotide frequency dissimilarity measure improves prediction of hosts from metagenomically-derived viral sequences. Nucleic Acids Res. 2017;45(1):39–53.
Article PubMed Google Scholar
Song K, et al. Alignment-free sequence comparison based on next-generation sequencing reads. J Comput Biol. 2013;20(2):64–79.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, et al. Comparison of metatranscriptomic samples based on k-tuple frequencies. PLoS One. 2014;9(1):e84348.
Article PubMed PubMed Central Google Scholar
Liao W, et al. Alignment-free transcriptomic and Metatranscriptomic comparison using sequencing signatures with variable length Markov chains. Sci Rep. 2016;6:37243.
Article CAS PubMed PubMed Central Google Scholar
Jiang B, et al. Comparison of metagenomic samples using sequence signatures. BMC Genomics. 2012;13(1):730.
Article CAS PubMed PubMed Central Google Scholar
Wang Y, et al. MetaCluster 4.0: a novel binning algorithm for NGS reads and huge number of species. J Comput Biol. 2012;19(2):241–9.
Article CAS PubMed Google Scholar
Wang Y, et al. MetaCluster 5.0: a two-round binning approach for metagenomic data for low-abundance species in a noisy sample. Bioinformatics. 2012;28(18):i356–62.
Article CAS PubMed PubMed Central Google Scholar
Richter DC, et al. MetaSim—a sequencing simulator for genomics and metagenomics. PLoS One. 2008;3(10):e3373.
Article PubMed PubMed Central Google Scholar
Zerbino DR, Birney E. Velvet: algorithms for de novo short read assembly using de Bruijn graphs. Genome Res. 2008;18(5):821–9.
Article CAS PubMed PubMed Central Google Scholar
Mavromatis K, et al. Use of simulated data sets to evaluate the fidelity of metagenomic processing methods. Nat Methods. 2007;4(6):495–500.
Article CAS PubMed Google Scholar
Hallam SJ, et al. Genomic analysis of the uncultivated marine crenarchaeote Cenarchaeum symbiosum. Proc Natl Acad Sci. 2006;103(48):18296–301.
Article CAS PubMed PubMed Central Google Scholar
Tyson GW, et al. Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature. 2004;428(6978):37–43.
Article CAS PubMed Google Scholar
Woyke T, et al. Symbiosis insights through metagenomic analysis of a microbial consortium. Nature. 2006;443(7114):950–5.
Article CAS PubMed Google Scholar
Tringe SG, et al. Comparative metagenomics of microbial communities. Science. 2005;308(5721):554–7.
Article CAS PubMed Google Scholar
Sharon I, et al. Time series community genomics analysis reveals rapid shifts in bacterial species, strains, and phage during infant gut colonization. Genome Res. 2013;23(1):111–20.
Article CAS PubMed PubMed Central Google Scholar
Ijaz, U, Quince C. TAXAassign v0.4. https://github.com/umerijaz/taxaassign 2013.
Parks DH, et al. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 2015;25(7):1043–55.
Article CAS PubMed PubMed Central Google Scholar
Ball GH, Hall DJ. ISODATA, a novel method of data analysis and pattern classification. Menlo Park CA: Stanford research inst; 1965.
Google Scholar
Wu Y-W, et al. MaxBin: an automated binning method to recover individual genomes from metagenomes using an expectation-maximization algorithm. 2014 13 Apr 2017; Available from: http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html.

Download references

Acknowledgements

Not applicable.

Funding

This research is supported by the National Natural Science Foundation of China (61673324, 61503314), U.S. National Science Foundation grants (DMS-1518001), NIH R01GM120624, China Scholarship Council (201606315011) and Natural Science Foundation of Fujian (2016 J01316). The funding agencies had no role in study design, analysis, interpretation of results, decision to publish, or preparation of the manuscript.

Availability of data and materials

The $ {d}_2^S\mathrm{Bin} $ source codes are available at https://github.com/kunWangkun/d2SBin.

The five synthetic testing datasets were from: http://downloads.jbei.org/data/microbial_communities/MaxBin/MaxBin.html [42].

The real Sharon dataset was from the NCBI short-read archive (SRA052203).

Author information

Authors and Affiliations

Department of Automation, Xiamen University, Xiamen, Fujian, 361005, China
Ying Wang & Kun Wang
Molecular and Computational Biology Program, University of Southern California, Los Angeles, California, CA, 90089, USA
Yang Young Lu & Fengzhu Sun
Center for Computational Systems Biology, Fudan University, Shanghai, 200433, China
Fengzhu Sun

Authors

Ying Wang
View author publications
You can also search for this author in PubMed Google Scholar
Kun Wang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Young Lu
View author publications
You can also search for this author in PubMed Google Scholar
Fengzhu Sun
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

YW and FS planned the project; YW developed the model and designed the experiments; KW realized the models and implemented the experiments; KW and YL analyzed the results; YW and FS wrote the main manuscript. All authors read and approved the final manuscript.

Corresponding authors

Correspondence to Ying Wang or Fengzhu Sun.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1: Table S1.

The file gives the numerical values of three criteria of contig binning on the experiments of the six testing datasets. Table S2. Detailed binning results of the contigs before and after $ {d}_2^S\mathrm{Bin} $ for dataset 10genome-20× based on the five testing tools. (DOCX 38 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Wang, Y., Wang, K., Lu, Y.Y. et al. Improving contig binning of metagenomic data using $ {d}_2^S $ oligonucleotide frequency dissimilarity. BMC Bioinformatics 18, 425 (2017). https://doi.org/10.1186/s12859-017-1835-1

Download citation

Received: 03 May 2017
Accepted: 11 September 2017
Published: 20 September 2017
DOI: https://doi.org/10.1186/s12859-017-1835-1

Improving contig binning of metagenomic data using \( {d}_2^S \) oligonucleotide frequency dissimilarity

Abstract

Background

Results

Conclusions

Background

Methods

The \( {d}_2^S \) dissimilarity measure between two contigs based on k-tuple sequence signature

\( {d}_2^S\mathrm{Bin} \): Contig binning based on the \( {d}_2^S \) measure

Experimental design

Selection of contig binning tools

Five synthetic testing datasets with 10 genomes and 100 genomes

One real testing dataset, Sharon

Evaluation criteria

Results

Length selection of k-tuple in \( {d}_2^S\mathrm{Bin} \)

Order selection for Markov chain in \( {d}_2^S\mathrm{Bin} \)

Experiments on contig binning

Contig binning on synthetic dataset 10 genome 80× coverage

Contig binning on synthetic dataset 10 genome 20× coverage

Contig binning on synthetic dataset 100 genome-simHC+

Contig binning on synthetic dataset 100 genome-simMC+

Contig binning on synthetic dataset 100 genome-simLC+

Contig binning on real dataset Sharon

Convergence of K-means iteration on \( {d}_2^S\mathrm{Bin} \)

Software implementation and running

Discussion

Conclusions

Abbreviations

References

Acknowledgements

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional file

Additional file 1: Table S1.

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us