TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences
© Teeling et al; licensee BioMed Central Ltd. 2004
Received: 17 August 2004
Accepted: 26 October 2004
Published: 26 October 2004
In the emerging field of environmental genomics, direct cloning and sequencing of genomic fragments from complex microbial communities has proven to be a valuable source of new enzymes, expanding the knowledge of basic biological processes. The central problem of this so called metagenome-approach is that the cloned fragments often lack suitable phylogenetic marker genes, rendering the identification of clones that are likely to originate from the same genome difficult or impossible. In such cases, the analysis of intrinsic DNA-signatures like tetranucleotide frequencies can provide valuable hints on fragment affiliation. With this application in mind, the TETRA web-service and the TETRA stand-alone program have been developed, both of which automate the task of comparative tetranucleotide frequency analysis.
TETRA provides a statistical analysis of tetranucleotide usage patterns in genomic fragments, either via a web-service or a stand-alone program. With respect to discriminatory power, such an analysis outperforms the assignment of genomic fragments based on the (G+C)-content, which is a widely-used sequence-based measure for assessing fragment relatedness. While the web-service is restricted to the calculation of correlation coefficients between tetranucleotide usage patterns of submitted DNA sequences, the stand-alone program generates a much more detailed output, comprising all raw data and graphical plots. The stand-alone program is controlled via a graphical user interface and can batch-process a multitude of sequences. Furthermore, it comes with pre-computed tetranucleotide usage patterns for 166 prokaryote chromosomes, providing a useful reference dataset and source for data-mining.
Up to now, the analysis of skewed oligonucleotide distributions within DNA sequences is not a commonly used tool within metagenomics. With the TETRA web-service and stand-alone program, the method is now accessible in an easy to use manner for a broad audience. This will hopefully facilitate the interrelation of genomic fragments from metagenome libraries, ultimately leading to new insights into the genetic potentials of yet uncultured microorganisms.
At present, a majority of the microbes from natural microbial communities cannot be transferred into pure cultures, either because proper cultivation conditions have yet to be found, or due to currently unidentified fundamental obstacles . For decades, this has limited our understanding of the functioning of microbes within their natural habitats, such as their metabolic roles, interactions and dependencies.
The metagenome-approach allows for the first time to circumvent the restrictions imposed by the limited culturability of environmental microorganisms . Now, DNA fragments of 40 – 150 kb of uncultured microorganisms can be cloned directly from the environment. This delivers insights into the microorganisms' genetic potentials, sometimes even allowing the reconstruction of entire genomes. Prominent example include the unexpected finding of bacteriorhodopsin in marine Gammaproteobacteria [3–6], and the almost complete reconstruction of two bacterial genomes from an acid mine drainage microbial biofilm , respectively.
However, the metagenome-approach is not without its limitations and problems. A major constraint is that especially small genomic fragments, as are obtained from libraries that have been constructed with fosmids or cosmids as cloning vectors, often lack suitable phylogenetic marker genes. This leads to the problem that fragments belonging to the same organism cannot be reliably identified as such unless they overlap. In order to nonetheless interrelate such genomic fragments, measures such as the (G+C)-content or BLAST hits and codon usage of the fragment's coding regions are commonly used to assess whether two unlinked fragments from a metagenome library belong to the same organism. These measures, however, can produce ambiguous or even misleading results, and should be supplemented by additional tools that assess the relatedness of the genomic fragments .
Since numerous studies have shown that oligonucleotide frequencies within DNA sequences exhibit species-specific patterns [9–18], comparative analysis of such oligonucleotide frequencies is a promising approach to this problem. For tetranucleotides, it has even been demonstrated that their frequencies carry an innate but weak phylogenetic signal . Comparative analysis of tetranucleotide usage patterns also provides a good balance between computational requirements and attainable resolution. This makes the method particularly well-suited for use as a high-throughput method that can assist in tackling the fragment identification problem in metagenomics .
In order to automate and facilitate such an analysis, the TETRA software suite was developed, comprising both, a web-service and a stand-alone program.
The algorithms that are used within TETRA have been described elsewhere . In brief, DNA sequences are extended by their reverse-complements to compensate for different patterns of tetranucleotide over- and underrepresentation between the leading and the lagging strand. Then, the frequencies of all 256 possible tetranucleotides are counted and the corresponding expected frequencies are calculated by means of a maximal-order Markov model from the sequences' di- and trinucleotide composition. In order to evaluate the significance of the level of over- or underrepresentation for each tetranucleotide, the divergence between the observed and expected tetranucleotide frequencies is then transferred into z-scores using an approximation published by Schbath [20, 21]. Finally, all DNA sequences are compared in pairs by computing the Pearson's correlation coefficient of their z-scores. Details on the method, its applicability and its limits are given in Teeling et al. (2004) and the TETRA online manual.
The TETRA stand-alone program can be downloaded for free from the TETRA website . The current release has been implemented in REALbasic (http://www.realsoftware.com REAL Software Inc., Austin, Texas) and is available for Mac OS X. Versions for Linux and Windows are also available, but differ in details regarding their implementation and features. The counting of the tetranucleotides in the current version of TETRA stand-alone program is done by ocount – a self-written C program that has been integrated into the program.
Results and discussion
The TETRA web-service computes correlation coefficients between tetranucleotide usage patterns of DNA sequences, which can be used as an indicator of sequence relatedness. Details on the in- and output formats is available in the comprehensive online documentation .
TETRA stand-alone program
The tetranucleotide usage patterns of an average-sized genome (4 Mb) is computed in less than 10 minutes on a dual 1.8 GHz G5 (IBM PPC 970) computer. Newly computed tetranucleotide usage patterns are displayed within the 'Navigator' window, which is the central place for data management, access to the raw data and the calculation of plots and correlation coefficients (Figure 2). Raw data and correlation coefficients that have been computed for multiple patterns can be saved as tab-delimited tables in plain-text format and the graphical output (2D-plots) can be saved in JPEG-format.
A detailed documentation of the TETRA stand-alone program and its functions is available via the program's online help system.
As has been demonstrated in a previous study , the analysis of tetranucleotide usage patterns is often (but not always) a much more reliable measure of sequence relatedness than the (G+C)-content. However, as a sequence-based measure it is affected by local changes in sequence composition. For example, large stretches of horizontally acquired genes will blur the resolution. Likewise, resolution is a function of sequence-length, i.e. the shorter the sequence, the less meaningful a tetranucleotide frequency analysis will be.
While the method works quite well for sequences in the range of 40 kb, it is certainly not suited for the analysis of single-read end-sequences, which are usually shorter than 1 kb. Since the phylogenetic signal within tetranucleotide usage patterns is faint, the method performs weakly for whole genome phylogenetic tree reconstructions. In a whole-genome tree calculated from the pre-computed 166 prokaryotic chromosomes (data not shown), organisms are mostly grouped at the species level and at the level of genera, when these are closely related (i.e. Escherichia sp., Shigella sp., Yersinia sp. or Mesorhizobium sp., Sinorhizobium sp., Bradyrhizobium sp.). However, more distantly related genera or even species with larger evolutionary distances are often not correctly clustered (e.g. Prochlorococcus sp.).
Therefore, the analysis of tetranucleotide usage patterns should not be regarded as a tool to deduce phylogenetic relationships, but rather as a fingerprinting technique for genomic fragment correlation. For example, assignment of fosmid-sized genomic fragments from metagenome libraries of a microbial consortia that mediates the anaerobic oxidation of methane was possible using tetranucleotide frequency analysis, and was shown to be in perfect agreement with 16S rRNA sequence analysis .
With the worldwide ongoing programs to sequence and analyze natural communities, new approaches for sequence correlation beyond G+C content, read densities and codon usage have to be developed and made available to the users. The easy to use TETRA software will facilitate this task and provide additional decision support for, e.g., fragment assignment also when complete genomes have to be assembled in environmental sequencing projects.
Availability and requirements
Project name: TETRA
Project home page: http://www.megx.net/tetra
Operating system(s): Platform independent (web-service); Mac OS X (stand-alone program)
Programming language: REALbasic
Other requirements: none
Any restrictions to use by non-academics: none
List of abbreviations
basic local alignment search tool
marine environmental genomics
We thank Christian Quast for helping with programming details and for making the TETRA web-service HTML 4.0.1 compliant and Melissa Duhaime for carefully reading the manuscript. This work was supported by the Max Planck Society.
- Amann RI, Ludwig W, Schleifer KH: Phylogenetic identification and in situ detection of individual microbial cells without cultivation. Microbiol Rev 1995, 59: 143–169.PubMed CentralPubMedGoogle Scholar
- Rondon MR, August PR, Bettermann AD, Brady SF, Grossman TH, Liles MR, Loiacono KA, Lynch BA, MacNeil IA, Minor C, Tiong CL, Gilman M, Osburne MS, Clardy J, Handelsman J, Goodman RM: Cloning the soil metagenome: a strategy for accessing the genetic and functional diversity of uncultured microorganisms. Appl Environ Microbiol 2000, 66: 2541–2547. 10.1128/AEM.66.6.2541-2547.2000PubMed CentralView ArticlePubMedGoogle Scholar
- Beja O, Aravind L, Koonin EV, Suzuki MT, Hadd A, Nguyen LP, Jovanovich SB, Gates CM, Feldman RA, Spudich JL, Spudich EN, DeLong EF: Bacterial rhodopsin: evidence for a new type of phototrophy in the sea. Science 2000, 289: 1902–1906. 10.1126/science.289.5486.1902View ArticlePubMedGoogle Scholar
- Beja O, Spudich EN, Spudich JL, Leclerc M, DeLong EF: Proteorhodopsin phototrophy in the ocean. Nature 2001, 411: 786–789. 10.1038/35081051View ArticlePubMedGoogle Scholar
- de la Torre JR, Christianson LM, Beja O, Suzuki MT, Karl DM, Heidelberg J, DeLong EF: Proteorhodopsin genes are distributed among divergent marine bacterial taxa. Proc Natl Acad Sci U S A 2003, 100: 12830–12835. 10.1073/pnas.2133554100PubMed CentralView ArticlePubMedGoogle Scholar
- Sabehi G, Massana R, Bielawski JP, Rosenberg M, Delong EF, Beja O: Novel Proteorhodopsin variants from the Mediterranean and Red Seas. Environ Microbiol 2003, 5: 842–849. 10.1046/j.1462-2920.2003.00493.xView ArticlePubMedGoogle Scholar
- Tyson GW, Chapman J, Hugenholtz P, Allen EE, Ram RJ, Richardson PM, Solovyev VV, Rubin EM, Rokhsar DS, Banfield JF: Community structure and metabolism through reconstruction of microbial genomes from the environment. Nature 2004, 428: 37–43. 10.1038/nature02340View ArticlePubMedGoogle Scholar
- Teeling H, Meyerdierks A, Bauer M, Amann R, Glockner FO: Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ Microbiol 2004, 6: 938–947. 10.1111/j.1462-2920.2004.00624.xView ArticlePubMedGoogle Scholar
- Karlin S, Ladunga I, Blaisdell BE: Heterogeneity of genomes: measures and values. Proc Natl Acad Sci U S A 1994, 91: 12837–12841.PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 1995, 11: 283–290. 10.1016/S0168-9525(00)89076-9View ArticlePubMedGoogle Scholar
- Karlin S: Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol 1998, 1: 598–610. 10.1016/S1369-5274(98)80095-7View ArticlePubMedGoogle Scholar
- Karlin S, Campbell AM, Mrazek J: Comparative DNA analysis across diverse genomes. Annu Rev Genet 1998, 32: 185–225. 10.1146/annurev.genet.32.1.185View ArticlePubMedGoogle Scholar
- Nakashima H, Ota M, Nishikawa K, Ooi T: Genes from nine genomes are separated into their organisms in the dinucleotide composition space. DNA Res 1998, 5: 251–259.View ArticlePubMedGoogle Scholar
- Gentles AJ, Karlin S: Genome-scale compositional comparisons in eukaryotes. Genome Res 2001, 11: 540–546. 10.1101/gr.163101PubMed CentralView ArticlePubMedGoogle Scholar
- Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures. Genome Res 2003, 13: 693–702. 10.1101/gr.634603PubMed CentralView ArticlePubMedGoogle Scholar
- Deschavanne PJ, Giron A, Vilain J, Fagot G, Fertil B: Genomic signature: characterization and classification of species assessed by chaos game representation of sequences. Mol Biol Evol 1999, 16: 1391–1399.View ArticlePubMedGoogle Scholar
- Goldman N: Nucleotide, dinucleotide and trinucleotide frequencies explain patterns observed in chaos game representations of DNA sequences. Nucleic Acids Res 1993, 21: 2487–2491.PubMed CentralView ArticlePubMedGoogle Scholar
- Sandberg R, Winberg G, Branden CI, Kaske A, Ernberg I, Coster J: Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res 2001, 11: 1404–1409. 10.1101/gr.186401PubMed CentralView ArticlePubMedGoogle Scholar
- Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ: Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res 2003, 13: 145–158. 10.1101/gr.335003PubMed CentralView ArticlePubMedGoogle Scholar
- Schbath S, Prum B, de Turckheim E: Exceptional motifs in different Markov chain models for a statistical analysis of DNA sequences. J Comput Biol 1995, 2: 417–437.View ArticlePubMedGoogle Scholar
- Schbath S: An efficient statistic to detect over- and under-represented words in DNA sequences. J Comput Biol 1997, 4: 189–192.View ArticlePubMedGoogle Scholar
- TETRA web-service[http://www.megx.net/tetra/]
- TETRA standalone program[http://www.megx.net/tetra/solo/TETRA.zip]
- PHYLIP homepage[http://evolution.genetics.washington.edu/phylip.html]
This article is published under license to BioMed Central Ltd. This is an open-access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.