Barcodes for genomes and applications
© Zhou et al. 2008
Received: 16 June 2008
Accepted: 17 December 2008
Published: 17 December 2008
Skip to main content
© Zhou et al. 2008
Received: 16 June 2008
Accepted: 17 December 2008
Published: 17 December 2008
Each genome has a stable distribution of the combined frequency for each k-mer and its reverse complement measured in sequence fragments as short as 1000 bps across the whole genome, for 1<k<6. The collection of these k-mer frequency distributions is unique to each genome and termed the genome's barcode.
We found that for each genome, the majority of its short sequence fragments have highly similar barcodes while sequence fragments with different barcodes typically correspond to genes that are horizontally transferred or highly expressed. This observation has led to new and more effective ways for addressing two challenging problems: metagenome binning problem and identification of horizontally transferred genes. Our barcode-based metagenome binning algorithm substantially improves the state of the art in terms of both binning accuracies and the scope of applicability. Other attractive properties of genomes barcodes include (a) the barcodes have different and identifiable characteristics for different classes of genomes like prokaryotes, eukaryotes, mitochondria and plastids, and (b) barcodes similarities are generally proportional to the genomes' phylogenetic closeness.
These and other properties of genomes barcodes make them a new and effective tool for studying numerous genome and metagenome analysis problems.
The challenges being faced in sorting out short genomic fragments generated by metagenome sequencing projects  pose a fundamental question: "does each genome have a unique signature imprinted on its short sequence fragments so that fragments from the same genomes in a metagenome can be identified accurately?" A positive answer to this question could have significant implications to many important genome and metagenome analysis problems such as identification of genetic material transferred from other organisms  or through virus invasions [3, 4], separation of short sequence fragments generated by metagenome sequencing into individual genomes  and phylogenetic analyses of genomes .
Understanding the intrinsic properties of genome sequences, either general to all or specific to some classes of genomes, has been the focus of many studies in the past two decades. Earlier work includes the discovery of the periodicity property of DNA sequences across both prokaryotic and eukaryotic genomes  and the realization that coding sequences follow Markov chain properties [8–10]. Karlin and colleagues have studied various genome properties based on analyses of k-mer frequency distributions, and have observed that the di-nucleotide relative abundance, a normalized di-mer frequency with respect to the mono-mer frequencies, is generally stable across a genome measured on 50 K base-pair (bp) fragments [11–13]. They even suggested that such normalized di-mer frequency distributions can possibly serve as signatures of genomes.
In this paper, we present a barcoding scheme for all sequenced genomes, and illustrate a number of interesting and useful properties of the barcodes, which we can take advantage to solve challenging genome analysis problems. We highlight the power of this barcoding scheme through addressing two application problems: metagenome binning problem and identification of horizontally transferred genes.
We have calculated the barcode for each sequenced prokaryotic genome, using the following procedure. For each genome, we partition its sequence into a series of non-overlapping and equal-sized fragments of M bps; then for each k-mer (1 <k < 6 in this study), we calculate the combined frequency of the k-mer and its reverse complement within each partitioned fragment. The barcode for each genome is a matrix of N(k) columns and genome_length/M rows, with each element representing the frequency of the corresponding k-mer within the corresponding sequence fragment, where N(k) is the number of unique combined k-mers. Note that N(k) = 4 k /2 or (4k + 4k/2)/2, depending on whether k is odd or even. For example, N(4) = 136. The portion of the barcode corresponding to a fragment in a genome is called the fragment's barcode. In this paper, barcodes are calculated using M = 1000 and k = 4 unless stated otherwise. A discussion on our choices of the M and k values is given in Additional file 1, where we can also see that the above "equal-sized" requirement is not necessary.
In addition to prokaryotes, we have also calculated the barcodes for the other classes of sequenced genomes, namely eukaryotic, mitochondrial, plastid and plasmid genomes. For eukaryotes, we studied the barcodes for four key components in eukaryotic genomes, namely the (composite) regions of repetitive sequences, promoter sequences (the 1000-bp upstream region from each translation start), coding regions and introns, respectively (Figure 3(b)–(e)). We observed that (i) different regions in a high-level eukaryotic genome (e.g., human) have similar "backbone" structures in their barcodes, and (ii) the barcodes for the four types of regions have increasingly higher complexity, going from repetitive sequences to coding regions to introns and promoter sequences. This is consistent with the belief that introns and promoter sequences are probably the most information rich among the four because of the possibly large numbers of regulatory elements they encode.
The barcodes of the mitochondrial genomes are generally unique compared to the barcodes of the other genomes as they have a distinct overall appearance (e.g., Figure 3(h)–(i)). Their similar appearance may be the result of all mitochondria originating from Proteobacteria. The barcodes of all plasmid genomes also tend to have similar characteristics among themselves, possibly due to being under similar selection pressure caused by their frequent transferring among cell cultures. The barcodes of all the plastid genomes are also generally unique compared to the barcodes of the others (e.g., Figure 3(j)–(k)). For example, a majority of them each consist of two dark horizontal bends toward one end in their barcodes along the genome axis, whose corresponding genomic regions consist of RNA genes such as ribosomal RNAs and tRNAs, plus ribosomal proteins. The fuzzier appearance of the plastid barcodes indicates that their k-mer frequencies along the genome axis are not as stable as in the other genomes. The overall similar appearances of the plastid barcodes may be due to all originating from the Cyanobacteria.
Our procedure for identifying sequence fragments with abnormal barcodes in a genome employs a clustering strategy to divide all the sequence fragments in a genome into two groups: (a) a large group of fragments with their barcodes all similar to each other and (b) the rest (see METHODS section).
Using this procedure, we have identified 30,582 abnormal fragments, covering 30,889 genes across all the complete prokaryotic genomes. Specifically 28,460 such fragments are identified in the 542 bacterial genomes, covering 28,562 genes, and 2,122 such fragments are identified in the 46 archaeal genomes, covering 2,327 genes. We found that the percentage of fragments with abnormal barcodes ranges from 9.40% to 32.32% across all the bacterial genomes, with the average being 12.85%. Among the 46 sequenced archaeal genomes, the percentage of fragments with abnormal barcodes ranges from 9.86% to 23.14%, with the average being 13.58%. Further information can be found from Additional file 1. The detailed frequency information for abnormal fragments across different genomes is in Additional file 2.
While we found that it is generally more challenging to study the abnormal fragments in eukaryotes, we did apply the same procedure to different human chromosomes, and found that the percentage of abnormal fragments ranges from 10.08% to 31.32%, with the average being 12.10%.
We have analyzed the abnormal fragments across the prokaryotic genomes, and found the following: ~30% of the abnormal fragments can be explained in terms of (a) horizontal gene transfers, (b) phage invasions and (c) highly expressed genes, based on PHX-PA [16, 17] and Prophinder , respectively. Among the genes that fall into this 30%, 6.99% are horizontally transferred genes, 4.97% bacteriophage genes and 18.90% highly expressed genes, based on the above two prediction programs – note that these numbers do not add up to exactly 30% since there are overlaps among them. The genes falling into different categories are given in Additional file 3. We have carried out an enrichment analysis of such elements in regions with abnormal versus normal barcodes. We found that the highly expressed genes are enriched in the abnormal fragments, with the enrichment ratio > 1 across all the genomes and the average enrichment ratio being 1.90. Similar results hold for the horizontally transferred genes and bacteriophage genes. All the detailed data can be found in Additional file 2
We noted that our estimate of the percentages of "foreign fragments" in bacterial genomes (after deducting the "highly expressed genes") is in general agreement with the previous estimates though different information and techniques are used to derive the estimates .
We do not yet have an explanation for the remaining ~70% of abnormal fragments in prokaryotic genomes, although we suspect that they mostly fall into the same three categories – one reason that we could not explain them now is possibly due to the limited coverage of the current databases for horizontally transferred genes, bacteriophage genomes and highly expressed genes. We believe that by using more sophisticated computational procedures, one may be able to derive the level of abnormality of a fragment's barcode in a genome, and possibly link such information to when such fragments were horizontally transferred .
The ability to sequence a microbial community has led to the sequencing of at least 7.04 Giga bps of metagenome sequences, already 2.22 times the total complete genome sequences accumulated in the past two decades . These metagenome sequences have opened many doors to new research possibilities, and have posed some challenging problems. One such problem is determining which fragments are from the same organisms in a large pool of metagenomic fragments , typically ~1000 bps in lengths after the initial assembly using the Sanger sequencing techniques.
We have applied a clustering algorithm (see METHODS section) for binning sequence fragments together based on their barcode similarities, and tested the clustering strategy on three sets of simulated metagenome data created by cutting actual bacterial genomes into fragments and mixing them together. The three test sets consist of all sequence fragments from three sets of genomes, respectively, extracted from the GenBank. The first set consists of 11 genomes randomly selected from the same genus but from 11 different species (the genus has only 11 sequenced species) while the last two sets each consist of 30 and 100 genomes randomly selected from 30 and 100 different bacterial genera, respectively. The genome names are given in Additional file 4.
Binning accuracies of our barcode-based clustering algorithm.
FS = 500 bps
FS = 1000 bps
FS = 2000 bps
FS = 5000 bps
FS = 10000 bps
From the table, we can see that the binning accuracy (into the correct genomes) is high for fragment size M = 1000 and above, at both the species and the genus level. From the table, we see that there is a drop in the binning accuracy when the number of the underlying genomes is increased from 30 to 100. This indicates the increased complexity of the problem as a function of the number of underlying genomes.
We have compared our binning performance with the published results by the best available algorithm PhyloPythia . Our comparison indicates that our algorithm gives consistently more accurate and more specific binning results across different fragment sizes. For example, at the species level, our algorithm has better than 50% accuracy on our test set when the fragment size is at least 2000 bps while no binning results at the species level is given in McHardy et al. . At the genus level, the accuracy (the average of binning specificity and sensitivity, extracted from Figure 1(a) and 1(b) in McHardy et al. ) by PhyloPythia is 45.5% for fragment size 1000, 56% for 2000, 74% for 5000 and 82.5% for 10000 (no data is provided for 500), all measured in terms of putting fragments into the correct genera while ours is to the correct genomes and with more accurate binning results. It should be noted that the test set used by PhyloPythia is different than ours, which may affect the performance statistics somewhat though we suspect that will be insignificant, considering the sizes of the test sets. Another key difference between the two algorithms is that while PhyloPythia is a supervised learning algorithm, which requires a training set, our algorithm does not require a training set, and hence it is more general.
One thing worth noting is that a prokaryotic genome, on average, has ~13–14% of abnormal sequence fragments, when the fragment size is M = 1,000, suggesting that the theoretical limit for binning accuracy should be no better than 86–87%. Similarly we expect that the theoretical limits of binning accuracy for 2,000, 5,000 and 10,000 fragments-based binning should, in general, be no better than 87.36%, 87.58% and 88.4%, respectively.
The barcode analyses in this paper are mainly based on data from prokaryotes. Though we have applied the same barcode model to eukaryotes and made interesting observations, we suspect that the current barcoding scheme is rich enough to capture all the complexity of eukaryotes. Further studies along this direction are clearly needed.
We believe that for many genome analysis problems, particularly for prokaryotic genomes, the barcodes provide a natural, intuitive, information-rich and unified framework for studying them. Further applications of this capability to numerous genome analysis problems can be envisioned, such as phylogeny studies, particularly for genomes without obvious marker genes such as viruses, more thorough examination of different types of genomic regions in eukaryotes, their structures and organization, further studies of horizontal gene transfers, assisting in genome assembly of higher-order organisms (e.g., populus which we are currently working on) and possibly many more. We believe that we have only begun exploring the true power of this new capability for genome studies.
The frequency of each k-mer is mapped to a grey level as follows. We first count the frequency of each k-mer across all prokaryotic genomes, and sort the frequency list S [1:N(k)] in the increasing order of the frequencies with N(k) being the number of k-mers. We then find an integer L, L > 0, and partition S [1:N(k)] into L sub-lists so that the following function is minimized: , where S i is the sum of all frequencies in the i th sub-list, is the average of S, and L is a parameter to be determined by the minimization result. For M = 1000 and k = 4, we found L = 14 gives the best value for the above objective function. The computed partition of S gives a mapping of frequencies to the grey levels. Note that this mapping is genome-independent so each grey level in the barcodes has the same meaning in different genomes.
Clearly this is a generalization of the Euclidean distance between two vectors of the averaged k-mer frequencies across each genome, widely used for genome comparisons as in the work of Karlin and colleagues [13, 16, 23, 24] and many others. This is equivalent to the special case of our barcode distance when L = 1. Figure S4 in Additional file 1 provides a comparison between the two distances.
We have used the following procedure to identify fragments in a genome with substantially different barcodes than the average barcode of the genome. The procedure consists of two key steps. First, for each k-mer, we select the fragments in the genome that have the highest or the lowest X% of this k-mer's frequency among all fragments, with X being a parameter. Then we sort all the fragments in the increasing order of the number of times they are selected in the first step, termed function F(p), with p being the index of a fragment. Let p 0 be the fragment index having the highest second-order derivative of F(p). We consider all fragments p with F(p) > F( p 0 ) to be the non-native fragments of the genome as they have used the most number of k-mers with frequencies that are substantially different than the typical k-mer frequencies throughout the genome. We found that the abnormal fragment prediction is not very sensitive to the detailed value of X within the range from 5 to 20. So we have chosen X = 10 as the default value of our program.
The rationale for this procedure is that fragments with higher F(p) values represent fragments that have more "abnormal" k-mer frequencies compared to the average k-mer frequencies in the genome, and hence are more probable to be non-native fragments. By examining the curve of the F(p) function, we found that it is convex with one sharp transition point p 0, indicating a transition point from the typical fragments to the "abnormal" fragments in the genome (see Additional file 1). Hence we have used this point as the separation point between the normal (or native) fragments and the "abnormal"fragments.
where C 1, C 2,..., C K is a partition of a given pool of metagenomic fragments with each C i being a subset of the pool and being the average of the barcodes of all fragments in C i , i = 1,..., K.
This work was supported by National Science Foundation (DBI-0354771, ITR-IIS-0407204, DBI-0542119, CCF0621700), a U.S. Department of Energy "BioEnergy Research Center" grant from the Office of Biological and Environmental Research in the DOE Office of Science, and a Distinguished Scholar grant from the Georgia Cancer Coalition. We would like to thank the two anonymous reviewers for their helpful comments on our work. We would also like to thank Ms Joan Yantko for preparing the manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.