Unsupervised binning of environmental genomic fragments based on an error robust selection of l-mers
© Yang and Chin. 2010
Published: 16 April 2010
Skip to main content
© Yang and Chin. 2010
Published: 16 April 2010
With the rapid development of genome sequencing techniques, traditional research methods based on the isolation and cultivation of microorganisms are being gradually replaced by metagenomics, which is also known as environmental genomics. The first step, which is still a major bottleneck, of metagenomics is the taxonomic characterization of DNA fragments (reads) resulting from sequencing a sample of mixed species. This step is usually referred as “binning”. Existing binning methods are based on supervised or semi-supervised approaches which rely heavily on reference genomes of known microorganisms and phylogenetic marker genes. Due to the limited availability of reference genomes and the bias and instability of marker genes, existing binning methods may not be applicable in many cases.
In this paper, we present an unsupervised binning method based on the distribution of a carefully selected set of l-mers (substrings of length l in DNA fragments). From our experiments, we show that our method can accurately bin DNA fragments with various lengths and relative species abundance ratios without using any reference and training datasets.
Another feature of our method is its error robustness. The binning accuracy decreases by less than 1% when the sequencing error rate increases from 0% to 5%. Note that the typical sequencing error rate of existing commercial sequencing platforms is less than 2%.
We provide a new and effective tool to solve the metagenome binning problem without using any reference datasets or markers information of any known reference genomes (species). The source code of our software tool, the reference genomes of the species for generating the test datasets and the corresponding test datasets are available at http://i.cs.hku.hk/~alse/MetaCluster/.
Microbes are essential for almost every process in the biosphere and for every part of human life in both the positive and negative sense, for example, the production of yoghurt with Lactobacillus and alcohol brewing with yeast as well as the fatal pathogen of pulmonary tuberculosis and cholera. The impact of microbes on humans is not limited to several kinds of species. The complex environment of human life is maintained by microbial communities which are composed of dozens to thousands kinds of individual microbes. The unbalance or abnormal diversity of these microbial communities is proved to be associated with common diseases like pericementitis  and gastrointestinal  disturbance. Understanding how microbial community diversity affects health and disease may contribute to better diagnosis, prevention, and treatment of diseases. During the last centuries, researches on microbes have been based on the isolation, cultivation and purification of individual microorganism from complex communities. But 99% of all microbial diversity in the biosphere, as yet, is uncultivable , because of the highly artificial and limited number of conditions used currently for cultivation. For the uncultivable majority of the microbial communities, rapidly developing genome sequencing techniques can help. Metagenomics, which is also known as the environmental genomics, applies the shotgun sequencing technique to mixed genome samples, obtained directly from an environmental sample or series of related samples, producing high-throughput randomly sampled DNA fragments of these genome samples . Since 2004, several metagenomics sequencing projects have been successfully implemented, such as Acid Mine Drainage Biofilm (AMD) for dozens of species  and the recent Human Gut Microbiome (HGM) for more than thousands of species .
Different from traditional single genome sequencing researches where all DNA fragments are coming from one single species, the metagenomics sequencing dataset contains DNA fragments from different species where most of their genomes are unknown. The data analysis process of metagenomic sequencing (MS) dataset requires an additional analyzing step, called “binning” . The binning step assigns the DNA fragments to the taxonomy tree which provides a general map of the microbe distribution of the mixed sample, basically answering the essential question of metagenomic research: what’s in the mix? Various phylogenetic resolution or taxonomical rank of binning from high level such as kingdom to low level such as genus depends on the research requirements and the quality of the MS dataset.
A number of currently available binning methods fall into two broad categories: sequence similarity-based and sequence composition-based. The first, for example based on BLAST , classifies fragments based on the distribution of BLAST hits of phylogenetic specific marker genes to taxonomic classes . The application of this kind of method is limited due to the limited availability of reference genomes of known microorganisms. As mentioned above, less than 1% of all microorganisms can be cultured and sequenced today.
So more generic features such as structure and composition form the basis of methods developed to distinguish components from mixed sequencing dataset in a supervised or semi-supervised manner. In general, these methods extract the composition features of reference genomes or marker regions (e.g. the widely applied fingerprint gene 16S rRNA , recA and rpoB). Then, a classifier is generated based on different machine learning methods like SVM or SOM with the selected training dataset. For semi-supervised methods, the marker information of the testing datasets is translated to additional constraints during the clustering or classification process. Compared to the sequence similarity-based methods, sequence composition-based methods achieve better performance. But there are still limitations on these approaches; For example, not all species inside the sample carry the known phylogenetic markers. According to several metagenomic projects, such as the enhanced biological phosphorus removing (EBPR) sludge , Sargasso Sea  and the Minnesota soil samples , only 0.17%, 0.06% and 0.017% of the contigs (fragments) respectively are known to carry 16S rRNA markers. Even if we consider other markers such as recA and rpoB, less than 1% of the fragments could be identified. Also, some species may share multiple markers with other species, which leads to incorrect classifications. For example, according to recent reports, multiple kinds of 16S rRNA molecules can exist in a single bacterium . Moreover, the marker gene information is provided by the existing cultivation and isolation techniques with limitation and bias in the selection of specific microorganisms. The training datasets with the bias caused by technical limitation will also introduce bias into the classification and clustering results.
However, the difficulty of applying l-mer distribution lies in the resulting high-dimensional data. The l-mer frequency of each input DNA fragment is usually represented by a feature vector with 4 l components. Consider the palindrome reverse and complement DNA string in the sequencing datasets, the dimension of the l-mer vector could be decreased to about 4 l /2. Details of the generation of the l-mer vectors will be described late in the Methods Section. When l ≥ 4, the dimension of the feature vector is very large (i.e. when l = 4, the dimension of the feature vector is 136) and the clustering problem based on the high dimensional feature vectors becomes difficult. Some researchers suggested using methods like PCA (Principal Components Analysis) etc. to come up with a combination of selected significant l-mers  to lower the dimension. However, we show that some seemingly significant l-mers may be due to noise, and simply applying PCA cannot filter out this noise. Moreover, PCA involves a complicated process and it is not easy to understand the resulting combination of l-mers (in terms of eigenvectors) and to find the biological meaning of the eigenvectors. Thus, selecting an appropriate set of l-mers to decrease the dimension of the datasets poses a difficult problem.
To tackle this problem, we introduce a simple but error-robust method, based on a modified Chebychev Distance, to decrease the dimension of the dataset by remove some l-mers. Our selection of l-mers combined with a simple k-mean clustering algorithm is shown to be effective in the binning process.
Note that our method does not rely on any reference sequence or training dataset, but is only based on the similarity of the l-mer frequencies. Our method could bin the raw DNA fragments to several taxonomy specific clusters with high accuracy and resolution. The other advantage of our method is its robustness with respect to sequencing errors. The binning performance decreases by less than 1% while the sequencing error rate increases from 0% to 5% which is much higher than the typical sequencing error rate of less than 2% for the existing commercial sequencing platforms. We believe that our approach is promising to solve the metagenomics binning problem for short fragments generated by the high-throughput sequencing machines.
The DNA composition features of each DNA fragment are extracted by calculating its l-mer frequencies. There are 4 different nucleotides in a DNA sequence, so there are in total 4 l l-mers. A scan window of length l is slid along each DNA fragment and the frequency of every l-mer, say Ni, i ∈ [1, 4 l ], is recorded. For practice, DNA fragments are of different lengths and they contain different numbers of l-mers, for example, a DNA fragment of length 500bp contains 497 4-mers and a DNA fragment of length 2000bp contains 1997 4-mers. So we cannot compare directly the l-mer frequencies of two DNA fragments of different lengths. We need to apply an extra step to normalize the l-mer frequencies based on the lengths of the DNA fragments. Set the total number of l-mers in a DNA fragment be: , the normalized frequency of each l-mer is . Then the feature vector is defined as with 4 l components. After getting the l-mer frequencies, we need to do some modification to make them applicable for reverse complement strings. As each DNA fragment can be sequenced from either strand of the DNA genome, they should give the same l-mer frequencies. Hence, we can combine the frequency of one l-mer and its reverse complement palindrome l-mer into a single frequency for counting. This process will reduce the size of vector by half, i.e. the size of the vector for l-mer , if l is odd; , if l is even.
Based on some previous studies [22, 23], in order to be effective and to have a reasonable vector size, l is set to 4. So each DNA fragment will be transformed to a vector with 136 components and the input sequencing dataset of FASTA will be transformed to a n × 136 matrix with n rows representing n DNA fragments.
Clustering high-dimensional vectors (representing the DNA fragments) of 136 values is not easy. Because of noise, not all the l-mers (components in the feature vectors) are useful. Based on our experiments and analysis, two kinds of major noises l-mers are identified, the intraspecies noise l-mers (about 20% of the total l-mers) and interspecies noise l-mers (about 60% of the total l-mers). In the following parts of this section, we will introduce how to identify and remove these two kinds of noises.
Even though the high-throughput sequencing technology and the assembly process could provide DNA fragments covering the whole genome, because of the noise in data, even for the same genome of the same species, the l-mer distribution of a general region of the genome may be quite different from the l-mer distribution of a special functional region (such as promoters and exogenous transferred regions). This is called intraspecies noise. Figure 1(B) shows an example of two regions from the same genome. These two regions have quite different l-mer distributions (compared to Figure 1(A)). However, this does not occur very often. Usually randomly picked regions from the same genome should have quite similar l-mer distributions, as shown in Figure 1(A). The following procedure can be used to remove these outliners.
Chebychev Distance examines the differences across all components of the vectors and uses only the maximum difference as the measure. However the extreme values are most likely caused by intraspecies noise. To avoid using these noisy l-mers as measure, we construct the sorted value list of a and b where the values of are sorted in increasing order, i.e. if 4-mers are taken as an example, is the minimum and is the maximum. Then the last 20% of the top values are removed as intraspecies noise. The 20% cut off value is determined based on extensive experiments and how to determine this value should be further investigated, but in general, the results are similar for choosing the cut off values between 15% and 25%.
The other type of noise called interspecies noise is due to similar and statistical unstable l-mers among interspecies. Within the entire 136 4-mers, only a few of them are essential and effective for representing the distance (similarity) between the feature vectors of two DNA fragments. Hence, any distance definitions which consider all the l-mers will introduce noises caused by these “unessential” l-mers, which lead to unsatisfactory results.
A group of experiments were conducted to identify those unessential l-mers to be removed for calculating the distance between the feature vectors of two DNA fragments. Let D i be its corresponding value. Without loss of generality, assume that D 1 is the minimum value and D 136 the maximum (if 4-mers are taken as an example). The experiments were conducted based on the reference genomes of some known species or microorganisms in NCBI genome database. Three DNA fragments, say , and , were picked randomly where and are from the same species and from a different species within a particular taxonomical differentia level. Let , the i th element of the sorted value list of X and Y. The interspecies distance D i (A, C)
Three sets of experiments were conducted to generate the three curves in Figure 3. Each curve represents a particular taxonomical differentia level among DNA fragment and . We tested the differentia level at “Genus” ( and are from the same family but different genuses), “Family” ( and are from the same order but different families) and “> Order” ( and are from same class or higher taxonomical levels but different orders). The y-axis represents the value of P i , which could also be considered as the confidence level for selecting the corresponding D i value as distance for species discrimination. Strictly speaking, the range of D i with the corresponding Pi ≤ 50% (the performance of random choice) should be filtered out. In order to avoid the unstable part of the curves with undulation, we filter out the range from 0% to 60% to achieve more solid and better performance, even though we consider the validity for the datasets with closer taxonomic similarity.
N(l) is the number of l-mers described in Methods Section (subsection of l-mer frequency calculation).
A clustering algorithm is then applied to classify the vectors into suitable taxonomical groups based on the modified Chebychev distance. As mentioned before, the l-mer frequencies feature vectors of DNA fragments from the same species tend to have similar distributions. The l-mers distribution within one species is quite stable. Our observation is that, the l-mer feature vectors from the same taxonomic sub tree tend to be located around the same clustering center. We use the k-mean algorithm to cluster the fragments.
Compute the l-mers feature vector for each fragment.
Randomly select k vectors, each as the center of a group.
Cluster all the vectors to the nearest center.
For each group, calculate the mean of all the vectors to generate a new clustering center.
Repeat step 3 and step 4 many times, say M times until the clustering groups are stable.
Because of the unstable feature of k-mean caused by the random selection of the initial clustering centers, we will run the algorithm several times with different initial clustering centers and choose the best clustering result with minimum objective value.
For binning methods based on the DNA composition features  related, there are four major factors which usually affect binning performance: (1) taxonomic complexity (the number of species in the metagenome dataset), (2) length of the input DNA fragments, (3) sequencing error rate and (4) relative abundance ratio (the ratio of DNA fragments among different species in the metagenome dataset). In our experiment, we used simulated datasets with two to eight species, DNA fragments of length from 500bp to 5,000bp, the sequencing error rate from 1% to 5% and the species relative abundance ratio from the simplest 1:1 to 1:8.
The complete reference genomes of 23 species were downloaded from NCBI genomes database ftp server (http://ftp.ncbi.nih.gov/genomes/). The detailed taxonomic level information of these 23 species was obtained from the NCBI taxonomic database (http://www.ncbi.nlm.nih.gov/taxonomy). They are from three major kingdoms (6 from Archaea, 15 from Bacteria and 2 from Eukaryota).
We implemented our method using C++ (See additional file 1 for the source code and additional file 2 for the user manual.) with two sample testing datasets (additional file 3 and additional file 4) for demonstration under Linux OS environment (Ubuntu 8.10 AMD64 and Debian 5.0 AMD64).
Since our approach is unsupervised, we assume that we do not know any information about the species that exist in the sample. For each dataset, we compute the accuracy as the percentage of fragments from the same species that are classified in the same group. The exact estimated number of species inside the sample is not necessary in our approach. If the selected k value is less than the actual number of species inside the sample, the most similar species will be clustered together into some taxonomic specific groups. If the k value is larger than the actual number of species inside the sample, then some special functional genome regions (such as promoters and exogenous transferred regions) will be binned into some taxonomic homologous specific groups. We tested different k values for some datasets. And the result was robust. In order to give a fair evaluation of our approach, the following experiments and performances are based on the assumption that the number of species is known for each sample.
In the following, we give a general description and some detailed discussion on the performance of our approach by varying the DNA fragment lengths, the relative abundance ratios of the species, and sequencing error rates.
General performance based on different sample complexity and taxonomic similarities
Taxonomic Difference of Species
No. of species in datasets
Higher than Order
This result also indicates that the accuracy improvement by increasing the fragment length will taper off from 2000bp to 5000bp.
The result shows that our method is robust to sequencing errors. Even for the datasets with 5% error, compared with the error free datasets, the accuracy decreases by less than 1%. The error robustness property could be due to the chosen DNA composition features. Take 4-mer as an example: if there is one nucleotide sequencing error, only 4 of the total 4-mers will be affected by this nucleotide sequencing error. So if the sequencing error rate is 1%, for DNA fragments of length 2000bp, the worst case is when the erroneous nucleotides are apart from each other with at least 6 correct nucleotides; hence, no more than 80 erroneous 4-mers are introduced into the 4-mer occurrence frequency calculation. The modified Chebychev distance with sorting and range-picking strategy could effectively remove the effect of these erroneous 4-mers.
In this paper, we have introduced an unsupervised DNA composition feature based metagenomic sequencing data binning algorithm focused on the high accuracy taxonomic clustering of unknown species without any reference or markers. Our approach can filter out the interspecies and intraspecies noise to achieve better binning performance and the results are robust even when there are 5% sequencing errors in the DNA fragments. We will improve our method as follows.
The DNA composition features  of the occurrence frequency of short oligonucleotides have been reported in previous research from 2-mers to 8-mers. The selection of suitable substring length l depends on many factors and it is not sure that large l will give better results. When l is large, say l = 8, there are many (32,768) l-mers and the accumulated background noise is so large that we cannot cluster the DNA fragments well. When l is small, say l = 2, the signal in effective l-mers, l > 2, will be mixed with the noise in background l-mers (l-mers with similar frequency in different genomes) so that we cannot cluster the DNA fragments well. For practice, the performance of our approach performs well when l = 4 to 6 for most datasets. In the future, we will study how to select a suitable l for different datasets according to the error rate, DNA fragment length and the expected similarity of the genomes.
The traditional k-mean clustering algorithm is selected based on the assumption that the distribution of the l-mer feature vectors in the hyper-dimensional space is a sphere. When the abundance ratio of different species is extremely different, density-based clustering methods may perform better. Besides, tree structure taxonomic classification among these groups is also important for metagenomic research. Thus, hierarchal clustering methods should be more suitable. In the future, we will study the performance of different clustering algorithms on the binning problem.
Another important direction of metagenomic research is that we do not try to identify different species in the sample. Instead we can treat the sample as a single species with very complex genome structure and study which functions can be provided by this genome structure. The l-mer frequency may also be suitable for clustering DNA fragments according to their functions.
We thank Ruiqiang Li, Junjie Qin from BGI Shenzhen for their suggestions and comments. This research is supported by Project 30871393 of National Natural Science Foundation of China, Hong Kong GRF grant HKU 7117/09E and the HKU Genomics Strategic Research Theme Matching Fund.
This article has been published as part of BMC Bioinformatics Volume 11 Supplement 2, 2010: Third International Workshop on Data and Text Mining in Bioinformatics (DTMBio) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/11?issue=S2.
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.