- Methodology article
- Open Access
- Published:

# Unsupervised statistical clustering of environmental shotgun sequences

*BMC Bioinformatics*
**volume 10**, Article number: 316 (2009)

## Abstract

### Background

The development of effective environmental shotgun sequence binning methods remains an ongoing challenge in algorithmic analysis of metagenomic data. While previous methods have focused primarily on supervised learning involving extrinsic data, a first-principles statistical model combined with a self-training fitting method has not yet been developed.

### Results

We derive an unsupervised, maximum-likelihood formalism for clustering short sequences by their taxonomic origin on the basis of their *k*-mer distributions. The formalism is implemented using a Markov Chain Monte Carlo approach in a *k*-mer feature space. We introduce a space transformation that reduces the dimensionality of the feature space and a genomic fragment divergence measure that strongly correlates with the method's performance. Pairwise analysis of over 1000 completely sequenced genomes reveals that the vast majority of genomes have sufficient genomic fragment divergence to be amenable for binning using the present formalism. Using a high-performance implementation, the binner is able to classify fragments as short as 400 nt with accuracy over 90% in simulations of low-complexity communities of 2 to 10 species, given sufficient genomic fragment divergence. The method is available as an open source package called LikelyBin.

### Conclusion

An unsupervised binning method based on statistical signatures of short environmental sequences is a viable stand-alone binning method for low complexity samples. For medium and high complexity samples, we discuss the possibility of combining the current method with other methods as part of an iterative process to enhance the resolving power of sorting reads into taxonomic and/or functional bins.

## Background

Metagenomics, the study of the combined genomes of communities of organisms, is a rapidly expanding area of genome research. The field is driven by environmental shotgun sequencing (ESS), a technique of applying high-throughput genome sequencing to non-clonal DNA purified directly from an environmental sample. This removes the requirement to isolate and cultivate clonal cultures of each species, allowing an unprecedented broad view of microbial communities.

Thus far, environments such as acid mine drainage [1], Scottish soil [2], open ocean [3], termite gut [4], human gut [5], and neanderthal [6] have been sequenced, to name a few. Attention has been directed to bacterial and viral fractions of these communities, with eukaryotic metagenomics pioneered by projects such as the marine protist census [7]. Complexity of these communities varies greatly from 5 to several thousand identifiable bacterial species. These projects have uncovered vast amounts of previously unobserved genetic diversity [8, 9]. For example, "deep sequencing" using 454 pyrosequencing suggests that possibly tens of thousands of species coexist in a single ml of seawater [10].

Given this wealth of genomic data it is becoming possible to make increasingly precise biological inferences regarding the structure and functioning of microbial communities [11–13]. As but one example, the discovery of a novel proteorhodopsin gene was the first step in uncovering a previously unknown, yet apparently dominant, mechanism for phototrophy in the oceans [14]. Characterization of functional diversity is limited by our ability to classify sequences into distinct groups that reflect a desired taxonomic or functional resolution.

Shotgun metagenomic DNA is sequenced in fragments of 50 to 1000 nucleotides, then possibly assembled into longer sequences (contigs). Phylogenetic binning, the task of classifying these sequences into bins by taxonomic origin, then becomes critical to separate metagenomic data into coherent subsets plausibly belonging to separate organisms. This task is challenging due to the short length of available fragments. Bacterial communities of very high complexity, with thousands of species present, further complicate the task.

While methods such as 16S bacterial community censuses [15] and functional- or sequence-based screening surveys are the forerunners of modern metagenomics, indiscriminate whole-genome shotgun sequencing may be the defining approach of the discipline today. This approach has recently generated vast amounts of data, facilitated by continual capacity increases and quality improvements at major sequencing centers and the emergence of cost effective very high throughput Next Generation sequencing (NGS) (454 pyrosequencing [16], Illumina [17] and SOLiD [18]). At the highest diversity levels, the reads may not be assembled at all due to the sparseness of even the highest throughput sequencing methods and the danger of chimeric assemblies, arising from sampling so many organisms at once, leaving the binner with raw reads. Binning methods therefore aim to be able to operate on very short read lengths provided by next-generation sequencing, although most, including the present approach, are only able to go down to 454 pyrosequencing read length (about 400 nt) and not to microread length (30 to 100 nt).

Classic approaches to phylogenetic determination of species identities from environmental sequences rely on identifying variants of highly conserved genes, like 16S rRNA or recA [19]. This approach is not applicable on a full metagenomic scale for two reasons: first, ribosomal or marker gene sequences comprise a small fraction of the bacterial genome, so most shotgun sequences do not contain them and cannot be classified this way; and second, organisms with identical or closely related 16S genes have been shown to exhibit variations in essential physiological functions [20]. Other approaches are broadly divided into sequence similarity based classifiers such as MEGAN [21], which rely on BLAST or other alignments, and sequence composition based classifiers, which rely on statistical patterns of oligonucleotide distributions. Many solutions integrate the task of phylogenetic assignment (labeling) together with that of binning per se (clustering) of genomic fragments. However, with unsupervised methods, like the one presented here, labeling is not possible as part of the algorithm and has to be performed by other means, like analyzing the correspondence of generated clusters to known phylogenies.

Sequence classification based on oligonucleotide distributions has been the basis for gene finding applications since the early 1990s. In 1995, Karlin and Burge [22] noted that dinucleotide distribution is relatively constant within genomes but varies between genomes. Since then, this property has been extensively studied and generalized to other oligonucleotide lengths [23]. With the advent of ESS, several binning methods have used oligonucleotide distributions of various orders to build supervised and semi-supervised classifiers. These include PhyloPythia [24], CompostBin [25], and self-organizing map (SOM) based methods [26–28].

Machine learning-based classification algorithms like those used for binning are categorized into supervised, semi-supervised, and unsupervised classes. Supervised algorithms accept a training set of labeled data used to build their models, which are then applied to the query data. In case of binning, this training set consists of genomic sequences labeled according to the species they originate from. Semi-supervised algorithms use both training set data and query data to build their models. Unsupervised algorithms use no training data and derive their models directly from the query input. While methods described above have achieved considerable success in classifying short anonymous genomic fragments, their supervised nature makes them reliant on previously sequenced data. For example, BLAST-based methods are completely dependent on the presence of sequences related to the query in the database. While semi-supervised clustering methods can have significant generalizing power, their accuracy still depends on similarity of input data to their training set.

To our knowledge, two approaches to unsupervised metagenomic binning have been published. TETRA [29, 30] explores the applications of *k*-mer frequency statistics to metagenomic data. The authors state that their method is suitable as a "fingerprinting technique" for longer DNA fragments, though not as a general-purpose binning method for single-read 454 pyrosequenced or Sanger fragments, and an application of methods including TETRA to binning of fosmid-sized DNA is used in [31]. Abe *et al*. [26] used self-organizing maps (SOM) in combination with principal component analysis (PCA) on 1- and 10-Kb fragments, and this method was evaluated and enhanced in [27] using growing self-organizing maps (GSOM), an extension of SOM, on 8- and 10-kb fragments.

Given the apparent diversity of metagenomic samples and the significant fraction of the full bacterial phylogeny with no sequenced representatives [20, 32], as well as possible undiscovered diversity of the tree of life, binning methods must perform well on previously unseen data. Semi-supervised methods may be able to extrapolate on this data, but if not, unsupervised clustering will be a necessary part of a combined-method binning approach. We present LikelyBin, a new statistical approach to unsupervised classification of metagenomic reads based on an explicit likelihood model of short genomic fragments [33]. The rest of this paper is organized as follows. The Methods section introduces a formal definition of the binning problem, the application of the Markov Chain Monte Carlo (MCMC) formalism, and the feature space and likelihood model used. We discuss numerical methods used in the implementation, including a novel coordinate transformation which achieves dimension reduction for the feature space of *k*-mer frequencies, and the genomic fragment divergence measure *D*_{
n
}, a novel statistical measure we developed for performance evaluation of our algorithm. The Results section presents performance evaluations of our method on mixtures of 2 to 10 species compiled from completed genomes available in GenBank, with fragment lengths starting at 400 nt, as well as accuracy trends over different fragment lengths and mixing ratios. We also present results on the FAMeS [34] dataset and compare the current method to a semi-supervised binning method based on *k*-mer distributions [25]. The Conclusion section explains the applicability of our method, its speed and availability, as well as important future directions for improvement.

## Methods

### The binning problem

We state the problem as follows: given a collection of *N* short sequence reads from *M* complete genomes, how can we predict which sequences derive from the same genome? In our model, we represent a genome as a string of characters deriving from a stochastic model with parameters Θ, referred to here as a master distribution. We make the simplifying assumption that the oligonucleotide distribution is uniform across the bacterial chromosome. This assumption is not satisfied biologically; gene-coding, RNA-coding, and noncoding regions, leading and lagging strands of replication, and genomic islands resulting from horizontal gene transfer can all exhibit distinct oligonucleotide distributions. Accurate classification of these regions in metagenomic fragments is an open problem which requires complex statistical models that we have yet to incorporate into our framework, and which are targets for subsequent model development. Nonetheless we have found that clustering of short reads using the above assumption is sufficiently accurate for use in low complexity metagenome samples.

Given this assumption of statistical homogeneity, we model a collection of sequences from a single genome as realizations of a single stochastic process. Similarly, we model a collection of sequences from multiple genomes as realizations of multiple stochastic processes, one per genome, each with its own master distribution. We are interested in determining which sequences in a metagenomic survey are likely to have been drawn from the same genome and, consequently, the statistical distributions of oligonucleotides within each of the master distributions. If the number of master distributions is unknown, then we must include some prior estimate to close the model. Thus, even in cases where due to insufficient coverage it is impossible to assemble disparate segments of a consensus genome together, a binning algorithm should still be able to group reads together based on their statistical distribution of oligonucleotides.

The simplest model of a genome would be a random collection of letters, A, T, C, and G. The master distribution of a single genome can then be represented as a single probability, *p*_{
A
}, denoting the fraction of A-s in the genome. Base complementarity requires *p*_{
A
}= *p*_{
T
}and *p*_{
C
}= 1/2 - *p*_{
A
}= *p*_{
G
}. A more complex representation would be to assume that genomes are random collections of *k*-mers. When *k* = 1, each nucleotide is independent of the previous. When *k* = 2, the genomes are random collections of dimers and so on. However, when *k* ≥ 2, inherent symmetries are present in this representation since all but the first letters of the current *k*-mer are also contained in the next *k*-mer. In a metagenomic dataset, each short fragment derives from a single master distribution, *θ*_{
i
}, which is represented a fraction *f*_{
i
}of times. How then can we infer the most likely Θ ≡ (*θ*_{1}, *θ*_{2}, ..., *θ*_{
M
}) and *F* ≡ (*f*_{1}, *f*_{2}, ..., *f*_{
M
}) given a set of *N* sequences *S* ≡ (*s*_{1}, *s*_{2}, ..., *s*_{
N
})? To do so, we must calculate the likelihood ℒ(*S*|Θ, *F*) of observing the sequences *S* given the parameters Θ and *F*. Then, we must estimate the values of Θ and *F* that maximize the likelihood ℒ. Below, we demonstrate the use of a MCMC algorithm to perform this task.

### MCMC framework

FIGURE 1 We are interested in finding the values of Θ and *F* that maximize the likelihood, ℒ. The MCMC approach has been described in detail elsewhere [35]. Given an initial parameter setting and a metagenomic data set, we implement the following Metropolis-Hastings algorithm to MCMC maximum likelihood estimation: (i) Determine the likelihood of the dataset ℒ(Θ, *F*|*S*); (ii) Choose some Φ = Θ + dΘ, and *G* = *F* + dF and determine its likelihood, ℒ*'* (Φ, *G*), such that both Φ and *G* exist in the same high-dimensional simplex as Θ and *F* respectively; (iii) Accept the new value given a probability 1 if ℒ*'* (Φ, *G*) > ℒ'(Θ, *F*) and with probability ℒ*'* (Φ, *G*)/ℒ(Θ, *F*) otherwise; (iv) Repeat, and after a burn-in period determine the values and which maximize ℒ(*S*|Θ, *F*). We can then utilize the resulting model of sequence parameters to classify sequences and estimate the most likely oligonucleotide distribution of each of the originating master distributions. The iterative process, together with key stages of the entire binning algorithm, is illustrated in Figure 1. Some technical details necessary for the implementation follow.

#### Likelihood model

Consider a nucleotide sequence *s* = *c*_{1}*c*_{2}*c*_{3}, ..., *c*_{ℓ}. We would like to know the probability of observing such a sequence given some underlying model. We assume that our sequence is selected from broken pieces of double-stranded DNA, and thus that complementary nucleotide sequences have the same probability: i.e., *L*(*s*) = *L*(*s'*), where , and is the nucleotide complementary to the nucleotide *c*_{
i
}. We assume that the probability of our sequence is determined by a set of 2^{k}*k*-mer probabilities .

That is, we write:

Assuming we know probabilities for all of our *k*-mers, we have probabilities for *k* - 1-mers as marginals.

Thus we can write:

As an example, the probability of a sequence given a set of known dimer frequencies is:

Note that we assume the marginal probabilities are well defined: i.e., that we get the same marginal probability if we collapse a *k*-mer to a *k* - 1-mer by summing over the first, or the last, nucleotide. The likelihood of observing *N* sequences given *M* master distributions is

where *P*_{
m
}(*s*_{
i
}) is the probability of generating the *i*-th sequence given the *m*-th master distribution.

A simple example of likelihood computation according to the described model is given in the Appendix.

#### The space of *k*-mer frequencies

Given the assumption of uniformity of the *k*-mer (oligonucleotide) distribution across each genome, we can impose three kinds of constraints on the *k*-mer frequency space. This space is a subspace of , subject to three kinds of constraints: all *k*-mer frequencies sum to 1, e.g.

each *k*-mer has the same frequency as its complement; and all marginal probabilities are consistent over all margins, e.g.

We then derive a transformation of the original *k*-mer frequency vector, *x* = [*p*_{
A
}, *p*_{
T
}, *p*_{
G
}, *p*_{
C
}, *p*_{
AA
}, *p*_{
AT
}, *p*_{
AG
}, *p*_{
AC
}, *p*_{
TA
}, ...], into the independent coordinate space. To generalize and automate the process, we perform it for each case from 1-mers (4 dimensions before removing redundancies) to 5-mers (1364 dimensions before removing redundancies) by generating all equations governing the constraints above. We use the notation [*A*|*b*] to denote the matrices of the constraint equation *Ax* = *b* by generating rows for each constraint type. For example, for *k* = 2, we write the summation, complementarity and marginality constraints as follows:

We find the nullspace of the resulting matrix *A* and use it to perform the transformation. The resulting number of independent dimensions is shown in Table 1. The MCMC simulation then performs the search in the independent coordinate space. For *k* > 6, the matrix *A* becomes too big to compute its nullspace using a non-parallelized algorithm. Even for *k* = 6, the number of independent dimensions is so large that the MCMC simulation takes an intractable amount of time. Therefore, we only generalize our algorithm up to *k* = 5.

#### Initial conditions

The choice of initial conditions can dramatically alter the speed of convergence of a MCMC solver. We used the same initial conditions for comparison of model results, specified by the frequencies of *k*-mers in the entire dataset provided as input (i.e., the weighted average of all sources' contributions to the dataset). Other possibilities, implemented but not chosen as the default, include taking uniformly distributed frequencies, randomizing the starting condition, or using principal components analysis with *K*-means clustering to obtain initial cluster centroids. We verified that convergence, when it did occur, did not depend sensitively on initial conditions (Additional files 1 and 2).

#### Finding the maximum likelihood model

Once the predefined number of timesteps has elapsed, the model with the largest log likelihood is selected. Note that the MCMC framework is amenable to a Bayesian approach, which we implemented as an alternative. Once the equilibrium state has been reached we calculate the autocorrelation of frequencies and estimate a window over which frequencies show no significant autocorrelations. Given a specified prior distribution *p*(Θ, *F*) for the master distribution and frequencies, the Metropolis-Hastings approach will converge to the true posterior distribution of π (Θ, *F*|*S*) ∝ ℒ (*S*|Θ, *F*) *p*(Θ, *F*). In our case we used an uninformed prior distribution so long as positivity and all other specified constraints among *k*-mer probabilities were preserved. We then sample from the equilibrium state to find π (Θ, *F*). Averages of master distributions in the posterior distribution also preserve the constraint conditions because of the linearity of the averaging operator. Accuracy of the model was similar whether using the maximum likelihood model or the average of the posterior distribution (Additional file 3). Full posterior distributions of *k*-mer models could be used to estimate posterior distributions of binning accuracy.

## Numerical details

### Precision

Due to precision limitations of the machine double precision floating point format, the model likelihood calculation is performed in log space. Denote the old model under consideration as **M** = {*M*_{1}, *M*_{2}, ... *M*_{
m
}}, and the new (perturbed) model as . The log likelihood of a single model is

and note that the innermost fraction contains higher-order terms when working with Markov chain orders higher than 2. The innermost product term is a product of on the order of 1000 terms of magnitude ≈ 1/4. However, 1/4^{n}exceeds double floating point precision at *n* ≈ 540. To prevent underflow, we find the *P*_{
m
}(*s*_{
i
}) of highest magnitude and divide the inner sum by it. This allows log space evaluation of the highest magnitude term and ensures that any terms whose precision is lost are at least ≈ 1e300 times smaller. The model log likelihood ratio is then . If this term exceeds 0, the new model is more likely to be observed than the old.

The MCMC iteration loop was implemented with the Metropolis-Hastings criterion. From an initial model, a perturbed model *M*_{
N
}is generated. The new model's probability is evaluated as above and compared to that of the currently selected model *M*_{
C
}. If higher, the new model is selected; otherwise, the new model is selected with probability *p* = exp (log ℒ(*M*_{
N
}|*S*) - log ℒ(*M*_{
C
}|*S*)). The step is repeated *N* times (*N* is fixed at 40000 for the experiments described). Each selected model is stored in a model record for later sampling.

#### Computing the perturbation

The statistical model consists of sub-models for each source. The perturbation step is performed for every sub-model independently. Every sub-model consists of a complete *k*-mer frequency vector, {*p*_{
A
}, *p*_{
T
}, *p*_{
G
}, *p*_{
C
}, *p*_{
AA
}...}. It is perturbed by scaling each vector of the basis matrix *A* by a random number *r*_{
i
}drawn from a Gaussian distribution with mean 0 and constant variance (computed as described below), then adding each scaled vector in succession to the frequency vector. The basis matrix *A* is precomputed for each *k*-mer model order from 2 to 5 and supplied with the program. The computation is performed by generating a system of equations representing the base complementarity, marginal, and summation constraints and using the standard nullspace algorithm supplied with GNU Octave.

The perturbation step variance must be calibrated independently for each dataset. An excessive variance will result in too many suboptimal perturbations as well as perturbations placing the frequency vector outside the unit hypercube (those perturbations are rejected). A variance that is too small can result in an inability to escape local maxima in the model search space and an inability to reach the stationary phase before the pre-determined number of steps is taken. To calibrate the variance, the MCMC iteration is started independently for a reduced number of steps, and different variances ranging from 1*e* - 3 down to 1*e* - 8 are tried. With each trial, the number of new model acceptances is recorded. We consider the fraction . Once the variance yielding *f* closest to 0.234 is found (a heuristic level of acceptances that has become standard [35], p. 504), we use this variance for the main run. Convergence to the stationary phase occurred after 40,000 iterations in all cases of interest.

#### Computing the prediction

To derive the final model prediction, the model with the overall maximum log likelihood is selected. The full MCMC simulation is repeated a selected number of times (to increase performance, the classifier was run in parallel on an 8-core machine; each core was assigned to run one MCMC simulation for a total of 8 restarts). Final model predictions are compared between different runs, and the best overall prediction is selected according to its model likelihood (described above).

The classifier then assigns a putative source to each sequence fragment it was initially queried with. For every fragment, its likelihood according to each sub-model in the final predicted model is computed, and the sub-model supplying the highest likelihood is selected. Since the sources are anonymous, they are referred to simply by indices from 1 to *n* corresponding to each sub-model's index in the final predicted model. Figure 2 illustrates the log likelihood comparison process for all fragments in a given dataset, according to the best model selected as a result of this process.

### Testing methodology

Simulated metagenomic datasets were created by selecting two or more genomic sequences as source DNA. Sequence fragments were selected at random positions within source sequences; overlaps were allowed to occur. Fragment size was fixed for all fragments for each experiment. The total number of fragments per source was selected either according to overall source length or at specified frequency ratios (e.g., 2:1, 10:1:1). The number of sources in each testing dataset was supplied to the classifier.

Accuracy of the classifier is calculated as follows. Every possible matching of source genomic sequence names to classifier output indices is considered, e.g. {*seq* 1 → 1, *seq* 2 → 2}, {*seq* 1 → 2, *seq* 2 → 1}. The number of correct assignments made by the classifier is then counted for each matching and the matching with the highest number of correct assignments is selected. Accuracy is then given as . To evaluate separability of the randomly generated datasets according to the classifier's model, we also define and compute the *genomic fragment divergence* between two sources' *k*-mer distributions. First, we compute the mean, *μ*, and standard deviation, *σ*, of each *k*-mer frequency for each source across fragments originating from that source. The genomic fragment divergence of *k*-mer order *n* is then given by

Generalizing to *M* species, let {*S*} = {*S*_{1}, *S*_{2}, ... *S*_{
m
}}. Then we define.

Figure 3 illustrates the distribution of genomic fragment divergences between completed bacterial genomes. A different formula for intergenomic difference, called the *average absolute dinucleotide relative abundance difference* is [36]: , where . This formula encompasses dinucleotides and pairwise comparisons of entire sequences only, and uses dimer frequency biases instead of absolute frequencies and their deviations in a hierarchical fashion.

## Results and Discussion

The accuracy and applicability of the present method in binning short sequence fragments from low complexity communities (2-10 species) was systematically analyzed using a variety of species, varying fragment lengths, and varying ratios of fragment representation.

First, a set of 1055 completed bacterial chromosomes was retrieved from GenBank. This set was randomly sampled for sets of 2, 3, 5, 10 genomes at a time, representative of various genomic fragment *k*-mer distribution divergences. Binning results for nearly 1800 simulated communities comprised of 2 or 3 genomes at a time are summarized in the top panels of Figure 4. There is a strong positive correlation between genomic fragment divergence and average performance. Classification accuracy was consistently above 85% for fragment divergences when *D*_{3} > 2. Results for Bayesian posterior distribution sampling were not substantially different (Additional file 3).

Accuracy of binning simulated communities of 5-10 species was consistent with the results from 2-3 species communities. The accuracy of binning was strongly positively correlated with genomic fragment divergence with accuracies consistently above 85% for *D*_{3} > 2. Note that accurate binning was possible when fragment length was either *L* = 400 nt or *L* = 800 nt (middle and bottom panels of Figure 4 respectively). For 5 and 10 species, a total of 1815 simulated communities were tested in the *L* = 400 nt case and a total of 425 simulated communities were tested in the *L* = 800 nt case.

Next, we evaluated the robustness of our binning method to changes in fragment length and to changes in fragment ratios using five distinct genome pairs from the preceding experiment (see Table 2). The pairs were selected based on their relatively low genomic fragment divergence, *D*_{3} ≈ 1, given a fragment length of *L* = 400 nt. Binning results on these 2-species tests were evaluated using sequence fragments whose lengths ranged from 40 to 1000 nt. The results are shown in Figure 5. Performance stabilizes close to its optimal value at fragment length 400. Again, results for Bayesian posterior distribution sampling were not substantially different than the maximum likelihood approach (Table 3).

For the same five pairs as in Figure 5, we performed a test of fragment ratio-dependent contributions to accuracy (Figure 6). The binner successfully classifies mixtures with species' fractional content of 20% and above. Although robust to moderate variation in fragment ratios, these results indicate that binning relatively rare species may require modifications to the present likelihood formalism.

We also tested our method using subsets of the JGI FAMeS [34, 37] simulated low-complexity dataset (simLC). We took 5 genomic sources at a time, using 500 fragments, each of length *L* = 400 nt. The accuracy results for binning these simulated low complexity communities are summarized in Table 4. The binning method has approximately 80% accuracy for a five-species community despite the genomic divergence, *D*_{3}, being approximately 1.5 (an indicator of a community with similar *k*-mer distributions).

We also compared our method to CompostBin [25], a semi-supervised algorithm that utilizes a PCA method to bin fragments based on their *k*-mer distributions (Table 5). We performed comparisons on pairs of genomes with fragment divergence *D*_{3} ≈ 1 using the same dataset analyzed in Figures 5, 6 and Table 2. The results indicated that our method performs on par with or better than CompostBin, even though CompostBin required a fraction of input fragments to be labeled to initialize its clustering algorithm. Run time and memory performance was comparable between the two methods.

The algorithm is implemented in portable Perl and C code that can be compiled and run on any platform supporting a Perl interpreter. Both memory use and run time scale linearly with the number of fragments and species, and sub-linearly with fragment length. Memory complexity scales quadratically with the number of dimensions in the search space, or exponentially with *k* (as shown in Table 1). We selected *k* = 3 as the default *k*-mer length, with user-defined options for 2, 4, or 5 available. We have not yet formalized convergence time performance as a function of *k*. In practice, a 3-species dataset of 1000 fragments per species, with *k*-mer order set to 3, takes approximately 2 minutes to run on an Intel Core 2 Duo-class processor.

## Conclusion

We developed an unsupervised, maximum likelihood approach to the binning problem - called LikelyBin. LikelyBin uses a MCMC framework to estimate the set of master distributions and relative frequencies most likely to give rise to an observed collection of short reads. The likelihood approach is based on *k*-mer distributions, for which we developed an index of separability of any pair of genomes, which we termed the genomic fragment divergence measure, *D*_{
n
}. We found that the vast majority of genomes have sufficient divergence to be distinguished using the present method (Figure 3).

Using a high-performance implementation, LikelyBin can be used to cluster sequences with high accuracy (in some cases, > 95%) even when the mononucleotide content of the original genomes is essentially identical (Figure 4). The method does as well or better than a comparable semi-supervised method (CompostBin [25]) that also uses *k*-mer distributions as the statistical basis for binning (Table 5).

Performance of LikelyBin is consistently good for synthesized low-complexity datasets (2-10 species) with fragments of length as low as 400 nt, which corresponds to the characteristic single-read length of a 454 pyrosequencing FLX machine. Microread sequencing technologies such as Solexa and SOLiD are currently out of reach of any non-alignment-based binning method when applied to single reads, which range from 30 to 50 base pairs with these technologies.

The unsupervised nature of our approach makes it potentially useful for classifying mixtures of novel sequences for which supervised learning-based methods may have difficulties. A future direction for our work is to combine our statistical formalism with alignment and supervised composition-based models. For example, we could develop a feature selection framework that would transform the input fragments' features such as *k*-mer statistics, coding frame information, and variable-length motifs into a lower-dimensional space. We could then feed these features to an unsupervised MCMC-based classifier in tandem with an alignment-based classifier that can partially label fragments based on known taxonomic information, then compare and combine their results.

A number of challenges remain to broaden the scope and applicability of the current method. At present, our method is scalable for *k*-mer length from *k* = 2 to *k* = 5. We intend to expand the method's ability to capture longer motif frequencies by using dimension transformation or feature selection in a future work. Intra-genomic heterogeneity of oligonucleotide distributions is another topic that is yet to be addressed. A confidence measure that serves as a performance self-check is already available as part of our method but we have not incorporated it into the program's output yet.

Further, applying the current method in an environmental context requires an estimation of the number of bins. The problem of identifying the necessary number of distinct models, or groups thereof, to represent all components of a given genome, is related to the problem of identifying the number of distinct genomes in the mixture. A combination of jump diffusion and grouped models is our currently planned solution. In this respect, the use of phylogenetic markers to estimate the number of bins will provide important prior information.

In summary, the unsupervised method we proposed is based on a maximum likelihood formalism and can bin short fragments (*L* = 400 nt) of low complexity communities (2-10 species) with high accuracy (in some cases, > 95%) given sufficient genomic divergence. The maximum likelihood formalism and its MCMC implementation make the current approach amenable to extension and incorporation into other packages. The MCMC binner application is provided as an open-source downloadable package, LikelyBin [33], that can be installed on any platform that supports Perl and C and is fully automated to facilitiate use in genome processing pipelines. Version 0.1 of the source code is provided in Additional files 4.

## Appendix

### Example application of likelihood model

Suppose we have two source genomes, *G*_{1} and *G*_{2}, with two fragments from each: *G*_{1} → {ATGTTA, TGTAAT}, *G*_{2} → {CCTGTC, AGGCCTC}. We wish to evaluate the likelihood of observing these sequences according to a dimer model of 2 sources, *M* = {*S*_{1}, *S*_{2}}, which we have generated. Assume the model's source frequency vector is *F* = [0.6, 0.4], its monomer frequencies are

{*S*_{1} : {*p*_{
A
}= 0.3, *p*_{
T
}= 0.3, *p*_{
G
}= 0.2, *p*_{
C
}= 0.2}, *S*_{2} : {*p*_{
A
}= 0.2, *p*_{
T
}= 0.2, *p*_{
G
}= 0.3, *p*_{
C
}= 0.3}} and its dimer frequencies are

*S*_{1} : {*p*_{
AA
}= 0.09, *p*_{
AT
}= 0.09, *p*_{
AG
}= 0.06, *p*_{
AC
}= 0.06, *p*_{
TA
}= 0.07, *p*_{
TT
}= 0.09, *p*_{
TG
}= 0.06, *p*_{
TC
}= 0.08 *p*_{
GA
}= 0.08, *p*_{
GT
}= 0.06, *p*_{
GG
}= 0.04, *p*_{
GC
}= 0.02*p*_{
CA
}= 0.06, *p*_{
CT
}= 0.06, *p*_{
CG
}= 0.04, *p*_{
CC
}= 0.04}, *S*_{2} : {*p*_{
AA
}= 0.02, *p*_{
AT
}= 0.04, *p*_{
AG
}= 0.08, *p*_{
AC
}= 0.06, *p*_{
TA
}= 0.04, *p*_{
TT
}= 0.02, *p*_{
TG
}= 0.06, *p*_{
TC
}= 0.08, *p*_{
GA
}= 0.08, *p*_{
GT
}= 0.06, *p*_{
GG
}= 0.07, *p*_{
GC
}= 0.09*p*_{
CA
}= 0.06, *p*_{
CT
}= 0.08, *p*_{
CG
}= 0.09, *p*_{
CC
}= 0.07}}

Then the likelihoods of observing the first fragment, ATGTTA, given master distributions *S*_{1} and *S*_{2}, respectively, are

where superscripts *S*_{1} and *S*_{2} denote the master distribution. Similarly,

The overall posterior likelihood of the model is then

## References

- 1.
Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment.

*Nature*2004, 428(6978):37–43. 10.1038/nature02340 - 2.
Tringe SG, von Mering C, Kobayashi A, Salamov AA, Chen K, Chang HW, Podar M, Short JM, Mathur EJ, Detter JC, Bork P, Hugenholtz P, Rubin EM: Comparative Metagenomics of Microbial Communities.

*Science*2005, 308(5721):554–557. 10.1126/science.1107851 - 3.
Rusch DB, Halpern AL, Sutton G, Heidelberg KB, Williamson S, Yooseph S, Wu D, Eisen JA, Hoffman JM, Remington K, Beeson K, Tran B, Smith H, Baden-Tillson H, Stewart C, Thorpe J, Freeman J, Andrews-Pfannkoch C, Venter JE, Li K, Kravitz S, Heidelberg JF, Utterback T, Rogers YH, Falcón LI, Souza V, Bonilla-Rosso G, Eguiarte LE, Karl DM, Sathyendranath S, Platt T, Bermingham E, Gallardo V, Tamayo-Castillo G, Ferrari MR, Strausberg RL, Nealson K, Friedman R, Frazier M, Venter CJ: The Sorcerer II Global Ocean Sampling Expedition: Northwest Atlantic through Eastern Tropical Pacific.

*PLoS Biology*2007, 5(3):e77. 10.1371/journal.pbio.0050077 - 4.
Warnecke F, Luginbühl P, Ivanova N, Ghassemian M, Richardson TH, Stege JT, Cayouette M, Mchardy AC, Djordjevic G, Aboushadi N, Sorek R, Tringe SG, Podar M, Martin HG, Kunin V, Dalevi D, Madejska J, Kirton E, Platt D, Szeto E, Salamov A, Barry K, Mikhailova N, Kyrpides NC, Matson EG, Ottesen EA, Zhang X, Hernández M, Murillo C, Acosta LG, Rigoutsos I, Tamayo G, Green BD, Chang C, Rubin EM, Mathur EJ, Robertson DE, Hugenholtz P, Leadbetter JR: Metagenomic and functional analysis of hindgut microbiota of a wood-feeding higher termite.

*Nature*2007, 450(7169):560–565. 10.1038/nature06269 - 5.
Gill SR, Pop M, Deboy RT, Eckburg PB, Turnbaugh PJ, Samuel BS, Gordon JI, Relman DA, Fraser-Liggett CM, Nelson KE: Metagenomic Analysis of the Human Distal Gut Microbiome.

*Science*2006, 312(5778):1355–1359. 10.1126/science.1124234 - 6.
Noonan JP, Coop G, Kudaravalli S, Smith D, Krause J, Alessi J, Chen F, Platt D, Pääbo S, Pritchard JK, Rubin EM: Sequencing and analysis of Neanderthal genomic DNA.

*Science*2006, 314(5802):1113–1118. 10.1126/science.1131412 - 7.
Not F, Gausling R, Azam F, Heidelberg JF, Worden AZ: Vertical distribution of picoeukaryotic diversity in the Sargasso Sea.

*Environmental Microbiology*2007, 9(5):1233–1252. 10.1111/j.1462-2920.2007.01247.x - 8.
Angly FE, Felts B, Breitbart M, Salamon P, Edwards RA, Carlson C, Chan AM, Haynes M, Kelley S, Liu H, Mahaffy JM, Mueller JE, Nulton J, Olson R, Parsons R, Rayhawk S, Suttle CA, Rohwer F: The marine viromes of four oceanic regions.

*PLoS Biol*2006., 4(11): 10.1371/journal.pbio.0040368 - 9.
Huse SM, Dethlefsen L, Huber JA, Welch DM, Relman DA, Sogin ML: Exploring Microbial Diversity and Taxonomy Using SSU rRNA Hypervariable Tag Sequencing.

*PLoS Genet*2008, 4(11):e1000255. 10.1371/journal.pgen.1000255 - 10.
Sogin ML, Morrison HG, Huber JA, Welch DM, Huse SM, Neal PR, Arrieta JM, Herndl GJ: Microbial diversity in the deep sea and the underexplored " rare biosphere".

*Proceedings of the National Academy of Sciences*2006, 103(32):12115–12120. 10.1073/pnas.0605127103 - 11.
Handelsman J: Metagenomics: application of genomics to uncultured microorganisms.

*Microbiol Mol Biol Rev*2004, 68(4):669–685. 10.1128/MMBR.68.4.669-685.2004 - 12.
Yooseph S, Sutton G, Rusch DB, Halpern AL, Williamson SJ, Remington K, Eisen JA, Heidelberg KB, Manning G, Li W, Jaroszewski L, Cieplak P, Miller CS, Li H, Mashiyama STT, Joachimiak MP, van Belle C, Chandonia JM, Soergel DA, Zhai Y, Natarajan K, Lee S, Raphael BJ, Bafna V, Friedman R, Brenner SE, Godzik A, Eisenberg D, Dixon JE, Taylor SS, Strausberg RL, Frazier M, Venter JC: The Sorcerer II Global Ocean Sampling Expedition: Expanding the Universe of Protein Families.

*PLoS Biol*2007., 5(3): 10.1371/journal.pbio.0050016 - 13.
Dinsdale EA, Edwards RA, Hall D, Angly F, Breitbart M, Brulc JM, Furlan M, Desnues C, Haynes M, Li L, McDaniel L, Moran MAA, Nelson KE, Nilsson C, Olson R, Paul J, Brito BRR, Ruan Y, Swan BK, Stevens R, Valentine DL, Thurber RVV, Wegley L, White BA, Rohwer F: Functional metagenomic profiling of nine biomes.

*Nature*2008, 452(7187):629–632. 10.1038/nature06810 - 14.
Béjà O, Spudich EN, Spudich JL, Leclerc M, DeLong EF: Proteorhodopsin phototrophy in the ocean.

*Nature*2001, 411(6839):786–789. 10.1038/35081051 - 15.
Muyzer G, de Waal EC, Uitterlinden AG: Profiling of complex microbial populations by denaturing gradient gel electrophoresis analysis of polymerase chain reaction-amplified genes coding for 16S rRNA.

*Appl Environ Microbiol*1993, 59(3):695–700. - 16.
Margulies M, Egholm M, Altman WE, Attiya S, Bader JS, Bemben LA, Berka J, Braverman MS, Chen YJ, Chen Z, Dewell SB, Du L, Fierro JM, Gomes XV, Godwin BC, He W, Helgesen S, Ho CH, Irzyk GP, Jando SC, Alenquer MLI, Jarvie TP, Jirage KB, Kim JB, Knight JR, Lanza JR, Leamon JH, Lefkowitz SM, Lei M, Li J, Lohman KL, Lu H, Makhijani VB, Mcdade KE, Mckenna MP, Myers EW, Nickerson E, Nobile JR, Plant R, Puc BP, Ronan MT, Roth GT, Sarkis GJ, Simons JF, Simpson JW, Srinivasan M, Tartaro KR, Tomasz A, Vogt KA, Volkmer GA, Wang SH, Wang Y, Weiner MP, Yu P, Begley RF, Rothberg JM: Genome sequencing in microfabricated high-density picolitre reactors.

*Nature*2005, 437(7057):376–380. - 17.
Bentley DR: Whole-genome re-sequencing.

*Current Opinion in Genetics & Development*2006, 16(6):545–552. 10.1016/j.gde.2006.10.009 - 18.
Shendure J, Porreca GJ, Reppas NB, Lin X, Mccutcheon JP, Rosenbaum AM, Wang MD, Zhang K, Mitra RD, Church GM: Accurate Multiplex Polony Sequencing of an Evolved Bacterial Genome.

*Science*2005, 309(5741):1728–1732. 10.1126/science.1117389 - 19.
Lane DJ, Pace B, Olsen GJ, Stahl DA, Sogin ML, Pace NR: Rapid determination of 16S ribosomal RNA sequences for phylogenetic analyses.

*Proceedings of the National Academy of Sciences*1985, 82(20):6955–6959. 10.1073/pnas.82.20.6955 - 20.
Ward BB: How many species of prokaryotes are there?

*Proc Natl Acad Sci USA*2002, 99(16):10234–10236. 10.1073/pnas.162359199 - 21.
Huson DH, Auch AF, Qi J, Schuster SC: MEGAN analysis of metagenomic data.

*Genome Res*2007, 17(3):377–386. 10.1101/gr.5969107 - 22.
Kariin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature.

*Trends in Genetics*1995, 11(7):283–290. 10.1016/S0168-9525(00)89076-9 - 23.
Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences.

*Mol Biol Evol*1999, 16(10):1391–1399. - 24.
Mchardy AC, Martín HG, Tsirigos A, Hugenholtz P, Rigoutsos I: Accurate phylogenetic classification of variable-length DNA fragments.

*Nature Methods*2006, 4: 63–72. 10.1038/nmeth976 - 25.
Chatterji S, Yamazaki I, Bai Z, Eisen J: CompostBin: A DNA composition-based algorithm for binning environmental shotgun reads. In

*Research in Computational Molecular Biology, 12th Annual International Conference, RECOMB 2008, Singapore, March 30 - April 2, 2008. Proceedings, Lecture Notes in Computer Science*.*Volume 4955*. Springer; 2008. - 26.
Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures.

*Genome research*2003, 13(4):693–702. 10.1101/gr.634603 - 27.
Chan CK, Hsu AL, Tang SL, Halgamuge SK: Using growing self-organising maps to improve the binning process in environmental whole-genome shotgun sequencing.

*Journal of biomedicine & biotechnology*2008, 2008: 513701. - 28.
Chan CKK, Hsu AL, Halgamuge SK, Tang SL: Binning sequences using very sparse labels within a metagenome.

*BMC Bioinformatics*2008, 9: 215. 10.1186/1471-2105-9-215 - 29.
Teeling H, Meyerdierks A, Bauer M, Amann R, Glöckner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments.

*Environ Microbiol*2004, 6(9):938–947. 10.1111/j.1462-2920.2004.00624.x - 30.
Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences.

*BMC Bioinformatics*2004, 5: 163. 10.1186/1471-2105-5-163 - 31.
Woyke T, Teeling H, Ivanova NN, Huntemann M, Richter M, Gloeckner FO, Boffelli D, Anderson IJ, Barry KW, Shapiro HJ, Szeto E, Kyrpides NC, Mussmann M, Amann R, Bergin C, Ruehland C, Rubin EM, Dubilier N: Symbiosis insights through metagenomic analysis of a microbial consortium.

*Nature*2006, 443(7114):950–955. 10.1038/nature05192 - 32.
A Genomic Encyclopedia of Bacteria and Archaea (GEBA)[http://www.jgi.doe.gov/programs/GEBA/index.html]

- 33.
LikelyBin webpage[http://ecotheory.biology.gatech.edu/likelybin]

- 34.
Mavromatis K, Ivanova N, Barry K, Shapiro H, Goltsman E, McHardy AC, Rigoutsos I, Salamov A, Korzeniewski F, Land M, Lapidus A, Grigoriev I, Richardson P, Hugenholtz P, Kyrpides NC: Use of simulated data sets to evaluate the fidelity of metagenomic processing methods.

*Nature methods*2007, 4(6):495–500. 10.1038/nmeth1043 - 35.
Sorensen D, Gianola D:

*Likelihood, Bayesian and MCMC Methods in Quantitative Genetics*. Springer; 2007. - 36.
Campbell A, Mrázek J, Karlin S: Genome signature comparisons among prokaryote, plasmid, and mitochondrial DNA.

*Proceedings of the National Academy of Sciences of the United States of America*1999, 96(16):9184–9189. 10.1073/pnas.96.16.9184 - 37.
FAMeS: Fidelity of Analysis of Metagenomic Samples[http://fames.jgi-psf.org/]

## Acknowledgements

We are pleased to acknowledge the support of the Defense Advanced Research Projects Agency under grant HR0011-05-1-0057. Joshua S. Weitz, Ph.D., holds a Career Award at the Scientific Interface from the Burroughs Wellcome Fund. The authors would like to thank Jonathan Eisen for many inspiring discussions. The authors would also like to thank Amol Shetty, Michael Raghib-Moreno, Sourav Chatterji, Luca Giuggoli, and Simon Levin for their suggestions on a preliminary version of the present model, and thank three anonymous reviewers for their helpful suggestions on the paper. The authors are grateful to Sourav Chatterji, Jonathan Eisen, and Ichitaro Yamazaki for their help in the utilization of CompostBin.

## Author information

## Additional information

### Authors' contributions

AK developed code, performed all statistical analyses, and wrote the manuscript. SB developed code and performed preliminary statistical analyses. JD developed the mathematical method and supervised the statistical analysis. JSW developed the mathematical method, supervised the computational and statistical analysis, and wrote the manuscript. All authors read and approved the final manuscript.

## Electronic supplementary material

### Additional file 1: **Convergence dynamics**. Figure 1: Convergence dynamics for good accuracy, *Mycoplasma capricolum subsp. capricolum ATCC 27343* vs. *Campylobacter jejuni subsp. jejuni 81-176* (*D*_{3} = 2.8). A single MCMC simulation was completed for this pair of genomes as described in Methods. *k*-mer order 3 model was used with 30000 steps, and expected nucleotide frequencies in accepted models were plotted over time for all independent mono- and dinucleotides in the model. Two starting conditions were compared: uniform initial frequencies (solid line) and frequencies at dataset mean (dashed line). Dotted lines indicate true average frequencies in the constituent species' fragment datasets. Convergence was observed to be substantially the same, demonstrating robustness of the algorithm to initial starting conditions. Final model accuracy was ≈ 95% in both cases. (PDF 156 KB)

### Additional file 2: **Convergence dynamics**. Figure 2: Convergence dynamics for poor accuracy, *Granulibacter bethesdensis CGDNIH1* vs. *Gluconobacter oxydans 621H* (*D*_{3} = 0.45). Details are identical to Additional file 1, but final model accuracy was ≈ 60% in both cases. (PDF 170 KB)

### Additional file 3: **Accuracy-divergence dependencies for Bayesian sampling**. Figure 3: Pairs and triples of genomes were sampled randomly from a set of 1055 completed bacterial chromosomes, and experiments were conducted using Bayesian posterior distribution sampling on the stationary distribution of the MCMC simulation. The results were found to not be significantly different from those for maximum likelihood sampling (Figure 4). (PDF 139 KB)

## Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

## Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

## About this article

### Cite this article

Kislyuk, A., Bhatnagar, S., Dushoff, J. *et al.* Unsupervised statistical clustering of environmental shotgun sequences.
*BMC Bioinformatics* **10, **316 (2009) doi:10.1186/1471-2105-10-316

#### Received

#### Accepted

#### Published

#### DOI

### Keywords

- Markov Chain Monte Carlo
- Markov Chain Monte Carlo Simulation
- Metagenomic Dataset
- Markov Chain Monte Carlo Iteration
- Simulated Community