TETRA web-service
The TETRA web-service computes correlation coefficients between tetranucleotide usage patterns of DNA sequences, which can be used as an indicator of sequence relatedness. Details on the in- and output formats is available in the comprehensive online documentation [22].
TETRA stand-alone program
The stand-alone version of TETRA has many additional features that are not available via the TETRA web-service. Firstly, it comes with pre-computed tetranucleotide usage patterns of all 166 prokaryote chromosomes that were publicly available by June 2004 (Figure 2). These patterns have been incorporated into the program to provide the user with reference data that can also be used to get familiar with the program. With a few mouse clicks, correlation coefficients for the tetranucleotide usage patterns of all genomes can be computed and exported into PHYLIP format [24]. While not being well-suited for phylogenetic reconstruction, the resolution boundaries of the method can be easily evaluated by looking at the resulting whole genome trees.
Secondly, besides calculating correlation coefficients for tetranucleotide usage patterns, the TETRA stand-alone program allows the user to investigate the raw data (Figure 2) and can produce plots for a more detailed analysis of tetranucleotide over- and underrepresentations (Figure 3). This allows for hints into possible restriction sites by the examination of significantly underrepresented tetranucleotides. Tetranucleotide usage patterns for user-provided sequences can be generated in two ways. Single sequences shorter than 100 kb can be pasted into the so called 'Single Sequence Window'. From there, a sequence can be extended by its reverse complement and its tetranucleotide usage pattern can be calculated. Additionally, the sequence's base composition and GC-content can be computed. Sequences longer than 100 kb or files with multiple sequences can be imported by the 'Batch Mode'. The 'Batch Mode' reads a multi-headed FASTA file and computes the tetranucleotide usage patterns of all sequences within this file in a fully automated manner.
The tetranucleotide usage patterns of an average-sized genome (4 Mb) is computed in less than 10 minutes on a dual 1.8 GHz G5 (IBM PPC 970) computer. Newly computed tetranucleotide usage patterns are displayed within the 'Navigator' window, which is the central place for data management, access to the raw data and the calculation of plots and correlation coefficients (Figure 2). Raw data and correlation coefficients that have been computed for multiple patterns can be saved as tab-delimited tables in plain-text format and the graphical output (2D-plots) can be saved in JPEG-format.
A detailed documentation of the TETRA stand-alone program and its functions is available via the program's online help system.
Applicability
As has been demonstrated in a previous study [8], the analysis of tetranucleotide usage patterns is often (but not always) a much more reliable measure of sequence relatedness than the (G+C)-content. However, as a sequence-based measure it is affected by local changes in sequence composition. For example, large stretches of horizontally acquired genes will blur the resolution. Likewise, resolution is a function of sequence-length, i.e. the shorter the sequence, the less meaningful a tetranucleotide frequency analysis will be.
While the method works quite well for sequences in the range of 40 kb, it is certainly not suited for the analysis of single-read end-sequences, which are usually shorter than 1 kb. Since the phylogenetic signal within tetranucleotide usage patterns is faint, the method performs weakly for whole genome phylogenetic tree reconstructions. In a whole-genome tree calculated from the pre-computed 166 prokaryotic chromosomes (data not shown), organisms are mostly grouped at the species level and at the level of genera, when these are closely related (i.e. Escherichia sp., Shigella sp., Yersinia sp. or Mesorhizobium sp., Sinorhizobium sp., Bradyrhizobium sp.). However, more distantly related genera or even species with larger evolutionary distances are often not correctly clustered (e.g. Prochlorococcus sp.).
Therefore, the analysis of tetranucleotide usage patterns should not be regarded as a tool to deduce phylogenetic relationships, but rather as a fingerprinting technique for genomic fragment correlation. For example, assignment of fosmid-sized genomic fragments from metagenome libraries of a microbial consortia that mediates the anaerobic oxidation of methane was possible using tetranucleotide frequency analysis, and was shown to be in perfect agreement with 16S rRNA sequence analysis [8].