Differentiation of regions with atypical oligonucleotide composition in bacterial genomes
© Reva and Tümmler; licensee BioMed Central Ltd. 2005
Received: 07 June 2005
Accepted: 14 October 2005
Published: 14 October 2005
Complete sequencing of bacterial genomes has become a common technique of present day microbiology. Thereafter, data mining in the complete sequence is an essential step. New in silico methods are needed that rapidly identify the major features of genome organization and facilitate the prediction of the functional class of ORFs. We tested the usefulness of local oligonucleotide usage (OU) patterns to recognize and differentiate types of atypical oligonucleotide composition in DNA sequences of bacterial genomes.
A total of 163 bacterial genomes of eubacteria and archaea published in the NCBI database were analyzed. Local OU patterns exhibit substantial intrachromosomal variation in bacteria. Loci with alternative OU patterns were parts of horizontally acquired gene islands or ancient regions such as genes for ribosomal proteins and RNAs. OU statistical parameters, such as local pattern deviation (D), pattern skew (PS) and OU variance (OUV) enabled the detection and visualization of gene islands of different functional classes.
A set of approaches has been designed for the statistical analysis of nucleotide sequences of bacterial genomes. These methods are useful for the visualization and differentiation of regions with atypical oligonucleotide composition prior to or accompanying gene annotation.
The number of sequenced prokaryotic genomes increases rapidly each year. Their comprehensive analysis requires the development of new high-throughput computational methods. The analysis of oligonucleotide usage biases has been recognized to be practical for the recognition of pathogenicity islands [1, 2] and elucidation of origins of orphan sequences [3–5]. Recently we have developed methods for the global analysis of oligonucleotide usage (OU) in complete sequences of bacterial chromosomes, plasmids and phages . The patterns of deviations of oligonucleotide frequencies from expectations were shown to be genome signatures reflecting to some extent the phylogenetic links between microorganisms [3, 4, 7, 8].
The usage of oligonucleotides in bacterial sequences is not random. Frequencies of the oligonucleotide words (further – words) depend strongly on their physicochemical properties such as base stacking energy, propeller twist angle, bendability, position preference and protein deformability . Oligonucleotide usage in bacterial genomes is strongly influenced by codon usage , however, there are further, yet unknown mechanisms of word selection .
To characterize OU in a sequence, the concept of OU patterns has been introduced . Disparity of frequencies of words and their reverse complements termed as pattern skew (PS) and variance of oligonucleotide frequencies (OUV) are attributes of each OU pattern and the distance (D) expresses the difference between two OU patterns. These OU parameters are independent of the length of the sequence and hence allow the comparison of windows of different sequence length ( and see 'Materials and methods'). This study applied OU statistics to visualize and discern gene islands of different functional classes. The developed methods are of importance for structural, functional and comparative genomics.
Results and discussion
Types of OU patterns, abbreviations and nomenclature Counts of words of different lengths N from 2 to 7-mer were analyzed in this work applying different schemes of normalization. Different types of OU patterns were abbreviated as type _N- mer. Types were "n0" for non-normalized "n1" for normalized by mononucleotide frequencies, "n2" for normalized by dinucleotides and so on. For example, the non-normalized tetranucleotide usage pattern is denoted as n0_4 mer, trinucleotide usage pattern normalized by dinucleotides is n2_3 mer, pentanucleotide usage pattern normalized by trinucleotides is n3_5 mer. Each OU pattern is characterized by three statistical parameters: D – distance between two patterns of the same type (in this work we used distances D between local and global genome patterns) PS – pattern skew distance between the two patterns of the direct and reverse strands of the same DNA sequence and OUV – oligonucleotide usage variance. Correspondingly the nomenclature is as follows: distance between a local n0_4 mer pattern and the corresponding global pattern – D:n0_4 mer pattern skew of a n0_3 mer pattern – PS n0_3 mer variance of a n3_7 mer pattern – OUV n3_7 mer. Two subtypes of normalization of local OU patterns were defined: normalized by frequencies of component words in the current genomic fragment (internal normalization, i) and in the complete sequence of the genome (global normalization, g). For example, internal and global OUV determined for a local n1_4 mer pattern were OUV:n1 i _4 mer and OUV:n1 g _4 mer, respectively. Internal normalization was always used in this study with the exception of the chapter "Identification of horizontally transferred elements" where the distances between OUV:n1 i _4 mer and OUV:n1 g _4 mer are analyzed. To simplify nomenclature, the index i was skipped in the pattern type abbreviation in all other chapters.
OU constraints in bacterial DNA
Local variations of OU patterns
Genetic repertoire of loci characterized by atypical tetranucleotide usage patterns and extreme OUV (section III in Fig. 4) identified in bacterial chromosomes
Genes and the encoded protein
putative hemagglutinin/hemolysin-related protein
non-coding multiple repeats TTTAGAAA
Bordetella bronchiseptica RB50
BB1186: putative hemolysin
Bradyrhizobium japonicum USDA110
Corynebacterium efficiens YS-314
fasA: fatty-acid synthase I
fasB: fatty-acid synthase II
Deinococcus radiodurans R1 chromosome 1
DR1461-1462: hypothetical proteins
non-coding tandem repeats CCCGCCC
E. coli O157:H7
Z0609, Z0615: RTX family exoproteins
Mycobacterium tuberculosis H37Rv
Rv0272c-Rv0279c hypothetical Gly-, Ala-rich proteins
Rv0297-Rv0304c: hypothetical Gly-, Ala-, Asn-rich proteins
Rv0355c: Asn-rich protein
Rv0573c-Rv0578c: hypothetical Gly-rich proteins
Rv0742-Rv0747: hypothetical Gly-rich proteins
Rv1060-Rv1068c: hypothetical Gly-, Ala-rich proteins
Rv1084-Rv1092c: hypothetical proteins
multiple repeats CCGCCGCCA
Rv2490c-Rv2494: hypothetical Gly-rich proteins
Pseudomonas aeruginosa PAO1
PA1874: hypothetical protein
P. putida KT2440
PP0168: Thr-rich surface adhesion protein
PP0806: surface adhesion protein
P. syringae DC3000
PSPTO3229: filamentous hemagglutinin
Rhodopirellula baltika 1
RB3077: putative cyclic nucleotide binding protein
RB4375: large polymorphic membrane protein, probable extracellular nuclease;
RB11769: probable aggregation factor core protein MAFp3
Rhodopseudomonas palustris CGA009
conserved hypothetical protein
conserved hypothetical protein
Sulfolobus solfataricus P2
non-coding tandem repeats GAATTGAAAG
Staphylococcus aureus N315
ebhA – ebhB: large surface anchored proteins
SA2447: similar to streptococcal hemagglutinin
Streptomyces coelicolor A3(2)
SC8F4.01c: Ala/Glu-rich protein
SC2H4.02: hypothetical protein
Xanthomonas campestris ATCC33913
yapH: putative autotransporter adhesin
Xylella fastidiosa Temecula 1
non-coding sequence, multiple
Yersinia pestis KIM
irp1-2: yersiniabactin peptide/polyketide synthetase;
yapH: putative autotransporter adhesin
y3579: putative filamentous hemagglutinin
Section I is heterogeneous. The genes for ribosomal RNAs are discerned from the other genes in section I by their extremely high PS of 60 – 70% that are usually the highest values in the genome. For further differentiation of the gene classes in section I, the next chapter describes the strategy to apply further OU statistical parameters to identify the subgroup of horizontally acquired elements.
Identification of horizontally transferred elements
An example for the identification of a laterally acquired gene island is shown in Fig. 5. The island in the chromosome of P. putida KT2440 has significantly divergent OUV:n1 i _4 mer and OUV:n1 g _4 mer values and D:n0_4 mer values beyond the 95% confidence interval of the complete chromosome (Fig. 5A). Since OUV:n1 i _n mer and OUV:n1 g _n mer in local patterns and the difference thereof are automatically calculated by the program, the method may be used for the high-throughput identification of horizontally transferred elements in bacterial genomes. Whereas OUV:n1 i _4 mer and OUV:n1 g _4 mer values are strongly correlated in the bulk P. putida genome, all islands show up by high OUV:n1 g _4 mer and low OUV:n1 i _4 mer values (Fig. 5B).
Informative assignments of the OU statistical parameters
To check whether the local fluctuations of OU parameters are statistically valid, a sequence of 100 kbp of mononucleotide content similar to pKLC102 was randomly generated. The ranges of 3-sigma fluctuation of D:n0_3 mer and OUV:n1_3 mer in the random sequence are depicted in Fig. 6 by vertical grey bars along the corresponding D and OUV axes. In the real sequences these values vary over a significantly larger range with the mean value of D smaller and the mean OUV higher than in the randomly generated sequence. (The plasmid pKLC102 sequence and the randomly generated sequence are included in the additional files as examples of source data files pKLC102.fts and random.fts, respectively.)
Correlation coefficients between D, PS and OUV of n0_4 mer local patterns with those of the corresponding n1, n2 and n3 normalized patterns
plasmid pKLC102, window 5,000 bp, step 2,500 bp
1 Mbp-2 Mbp locus of E. coli K12 chromosome, window 10,000 bp, step 5,000 bp
Bacterial genomes are not homogeneous but contain polymorphic blocks including horizontally transferred gene islands, non-coding sequences, long multidomain genes and ancient conserved gene clusters. The structural polymorphism of bacterial genomes may be effectively analyzed by local OU pattern signatures. A set of statistical approaches has been designed to perform this structural analysis of nucleotide sequences of bacterial genomes. These methods are useful for the visualization of regions with atypical oligonucleotide composition. The combination of the informative parameters that are 21 in case of tetranucleotide usage analysis, facilitates the prediction of gene classes. Moreover, many other subtypes of OU patterns may be additionally introduced. To this end, OU statistical analysis provides a valuable toolbox for the functional classification of regions and genes of interest prior to common-practice gene annotation.
A command line version of the Python program to apply the OU statistics methods mentioned above is available as additional file. To run the program, first the Python interpreted language program must be downloaded from the Web-site http://www.python.org/download/ and installed on the computer. The source DNA sequence (or sequences) should be saved in FASTA format in text file(s) with .FST file name extensions. Users may choose the OU statistical parameters to be calculated and the parameters of the sliding window by setting corresponding command line arguments. Many different OU parameters may be determined by a single run of the program and all FST files in the target folder will be processed continuously in a batch. For each source data file an output file in TXT format will be saved in the same folder. The full list of arguments and description of how to use the program are documented in the readme.doc file provided in the additional files. The program is fast enough to calculate all set of OU parameters mentioned in this paper for a complete bacterial genome of average length in 10–20 min depending on the computer performance.
Several general conclusions about OU in bacteria can be drawn from this report. First, most OU constraints are hidden in di-, tri- and tetranucleotide combinations that vanish with increasing word length (see Fig. 1). For example, in case of a hexamer the four possible heptamer words will have the same likelihood to occur next in the sequence. Hence, i)the analysis of the oligonucleotide distribution of up to 4-mers is sufficient to uncover all OU constraints in the sequence; and ii)neighbor effects are limited to dipeptides so that protein evolution is not skewed by oligonucleotide biases. Second, D and PS values are correlated in local patterns (see the examples for D:n0_4 mer and PS:n0_4 mer in Fig. 3 and 4). This observation is in accordance with the general trend in bacterial sequences to keep parity of frequencies of words and their reverse complements, in other words- a trend towards minimal PS . OU parity is most pronounced for the OU pattern of the whole chromosome, whereas fluctuations of OU in local patterns lead to an increased PS. The exceptions are the laterally transferred elements with their island-specific OU signature. In this case, large D values of the local OU patterns may be associated with low PS (see blue and green dots in section I in Fig. 4).
Sequences of 163 bacterial chromosomes including eubacterial and archaeal genomes published in the NCBI database  were analyzed in this study.
OU parameters of words of length N were normalized by shorter words n (0 ≤ n <N) as follows:
whereby the F values are the observed frequencies of the particular word of length n in the complete sequence and ξ is any nucleotide A, T, G or C. The expected count of a word [ξ1...ξ N ] of length N in a L seq long sequence normalized by frequencies of n-mers (n <N) was calculated as follows:
The distance D between two patterns was calculated as the sum of absolute distances between ranks of identical words (w, in a total 4 N different words) in patterns i and j as follows:
PS is a particular case of D where patterns i and j were calculated for the same DNA but for direct and reversed strands, respectively. Dmax = 4 N (4 N - 1)/2 and Dmin = 0 when calculating a D, or, in a case of PS calculation, Dmin = 4 N if N is an odd number or Dmin = 4 N - 2 N if N is an even number .
The definition of OUV was provided in our previous paper .
The random sequence was generated by a in-house program using the Python randomizer .
List of abbreviations
oligonucleotide usage variance
distance between two OU patterns of an identical type.
This work was supported by the DFG-sponsored Europäisches Graduiertenkolleg 653.
- Noble PA, Citek RW, Ogunseitan OA: Tetranucleotide frequencies in microbial genomes. Electrophoresis 1998, 19: 528–535. 10.1002/elps.1150190412View ArticlePubMedGoogle Scholar
- Pride DT, Blaser MJ: Identification of horizontally acquired elements in Helicobacter pylori and other prokaryotes using oligonucleotide difference analysis. Genome Let 2002, 1: 2–15. 10.1166/gl.2002.003View ArticleGoogle Scholar
- Abe T, Kanaya S, Kinouchi M, Ichiba Y, Kozuki T, Ikemura T: Informatics for unveiling hidden genome signatures. Genome Res 2003, 13: 693–702. 10.1101/gr.634603PubMed CentralView ArticlePubMedGoogle Scholar
- Pride DT, Meinersmann RJ, Wassenaar TM, Blaser MJ: Evolutionary implications of microbial genome tetanucleotide frequency biases. Genome Res 2003, 13: 145–155. 10.1101/gr.335003PubMed CentralView ArticlePubMedGoogle Scholar
- Teeling H, Waldmann J, Lombardot T, Bauer M, Glöckner FO: TETRA: a web-service and a stand-alone program for the analysis and comparison of tetranucleotide usage patterns in DNA sequences. BMC Bioinformatics 2004, 5: 163. 10.1186/1471-2105-5-163PubMed CentralView ArticlePubMedGoogle Scholar
- Reva ON, Tümmler B: Global features of sequences of bacterial chromosomes, plasmids and phages revealed by analysis of oligonucleotide usage patterns. BMC Bioinformatics 2004, 5: 90. 10.1186/1471-2105-5-90PubMed CentralView ArticlePubMedGoogle Scholar
- Karlin S: Global dinucleotide signatures and analysis of genomic heterogeneity. Curr Opin Microbiol 1998, 1: 598–610. 10.1016/S1369-5274(98)80095-7View ArticlePubMedGoogle Scholar
- Karlin S, Mrazek J, Campbell A: Compositional biases of bacterial genomes and evolutionary implications. J Bacteriol 1997, 179: 3899–3913.PubMed CentralPubMedGoogle Scholar
- Gorban AN, Popova TG, Zinovyev AY: Four basic symmetry types in the 7-cluster structure of microbial genomic sequences. In Silico Biol 2005, 5: 0025.Google Scholar
- Weinel C, Ussery DW, Ohlsson H, Sicheritz-Ponten T, Kiewitz C, Tümmler B: Comparative genomics of Pseudomonas aeruginosa PAO1 and Pseudomonas putida KT2440: orthologs, codon usage, REP elements and oligonucleotide motif signatures. Genome Letters 2002, 1: 175–187. 10.1166/gl.2002.021View ArticleGoogle Scholar
- Weinel C, Nelson KE, Tümmler B: Global features of the Pseudomonas putida KT2440 genome sequence. Environ Microbiol 2002, 4: 809–818. 10.1046/j.1462-2920.2002.00331.xView ArticlePubMedGoogle Scholar
- Weinel C, Tümmler B, Hilbert H, Nelson KE, Kiewitz C: General method of rapid Smith/Birnstiel mapping adds for gap closure in shotgun microbial genome sequencing projects: application to Pseudomonas putida KT2440. Nucleic Acids Res 2001, 29: E110. 10.1093/nar/29.22.e110PubMed CentralView ArticlePubMedGoogle Scholar
- Carbone A, Zinovyev A, Képès : Codon adaptation index as a measure of dominanting codon bias. Bioinformatics 2003, 19: 2005–2015. 10.1093/bioinformatics/btg272View ArticlePubMedGoogle Scholar
- Kiewitz C, Weinel C, Tümmler B: Genome codon index of Pseudomonas aeruginosa : a codon index that utilizes whole genome sequence data. Genome Letters 2002, 1: 61–70. 10.1166/gl.2002.008View ArticleGoogle Scholar
- Hacker J, Kaper JB: Pathogenicity islands and the evolution of microbes. Annu Rev Microbiol 2000, 54: 641–679. 10.1146/annurev.micro.54.1.641View ArticlePubMedGoogle Scholar
- van der Meer JR, Sentchilo V: Genomic islands and the evolution of catabolic pathways in bacteria. Curr Opin Biotechnol 2003, 14: 248–254. 10.1016/S0958-1669(03)00058-2View ArticlePubMedGoogle Scholar
- Sato T, Kobayashi Y: The ars operon in the skin element of Bacillus subtilis confers resistance to arsenate and arsenite. J Bacteriol 1998, 180: 1655–1661.PubMed CentralPubMedGoogle Scholar
- Deng W, Liou SR, Plunkett G 3rd, Mayhew GF, Rose DJ, Burland V, Kodoyianni V, Schwartz DC, Blattner FR: Comparative genomics of Salmonella enterica serovar Typhi strains Ty2 and CT18. J Bacteriol 2003, 185: 2330–2337. 10.1128/JB.185.7.2330-2337.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Perna NT, Mayhew GF, Posfai G, Elliott S, Donnenberg MS, Kaper JB, Blattner FR: Molecular evolution of a pathogenicity island from enterohemorrhagic Escherichia coli O157:H7. Infect Immun 1998, 66: 3810–3817.PubMed CentralPubMedGoogle Scholar
- Wei J, Goldberg MB, Burland V, Venkatesan MM, Deng W, Fournier G, Mayhew GF, Plunkett G 3rd, Rose DJ, Darling A, et al.: Complete genome sequence and comparative genomics of Shigella flexneri serotype 2a strain 2457T. Infect Immun 2003, 71: 2775–2786. 10.1128/IAI.71.5.2775-2786.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Larsson P, Oyston PC, Chain P, Chu MC, Duffield M, Fuxelius HH, Garcia E, Halltorp G, Johansson D, Isherwood KE, et al.: The complete genome sequence of Francisella tularensis , the causative agent of tularemia. Nat Genet 2005, 37: 153–159. 10.1038/ng1499View ArticlePubMedGoogle Scholar
- Simpson AJ, Reinach FC, Arruda P, Abreu FA, Acencio M, Alvarenga R, Alves LM, Araya JE, Baia GS, Baptista CS, et al.: The genome sequence of the plant pathogen Xylella fastidiosa . The Xylella fastidiosa Consortium of the Organization for Nucleotide Sequencing and Analysis. Nature 2000, 406: 151–157. 10.1038/35018003View ArticlePubMedGoogle Scholar
- Kaneko T, Nakamura Y, Sato S, Asamizu E, Kato T, Sasamoto S, Watanabe A, Idesawa K, Ishikawa A, Kawashima K, et al.: Complete genome structure of the nitrogen-fixing symbiotic bacterium Mesorhizobium loti . DNA Res 2000, 7: 331–338. 10.1093/dnares/7.6.331View ArticlePubMedGoogle Scholar
- Kaneko T, Nakamura Y, Sato S, Minamisawa K, Uchiumi T, Sasamoto S, Watanabe A, Idesawa K, Iriguchi M, Kawashima K, et al.: Complete genomic sequence of nitrogen-fixing symbiotic bacterium Bradyrhizobium japonicum USDA110. DNA Res 2002, 9: 189–97. 10.1093/dnares/9.6.189View ArticlePubMedGoogle Scholar
- Lawrence JG, Ochman H: Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol 1997, 44: 383–397.View ArticlePubMedGoogle Scholar
- Klockgether J, Reva O, Larbig K, Tümmler B: Sequence analysis of the mobile genome island pKLC102 of Pseudomonas aeruginosa C. J Bacteriol 2004, 186: 518–534. 10.1128/JB.186.2.518-534.2004PubMed CentralView ArticlePubMedGoogle Scholar
- NCBI Genome Sequence Database[http://www.ncbi.nlm.nih.gov/genomes/MICROBES/Complete.html]
- The Python home site[http://www.python.org/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.