Global features of sequences of bacterial chromosomes, plasmids and phages revealed by analysis of oligonucleotide usage patterns

Background Oligonucleotide frequencies were shown to be conserved signatures for bacterial genomes, however, the underlying constraints have yet not been resolved in detail. In this paper we analyzed oligonucleotide usage (OU) biases in a comprehensive collection of 155 completely sequenced bacterial chromosomes, 316 plasmids and 104 phages. Results Two global features were analyzed: pattern skew (PS) and variance of OU deviations normalized by mononucleotide content of the sequence (OUV). OUV reflects the strength of OU biases and taxonomic signals. PS denotes asymmetry of OU in direct and reverse DNA strands. A trend towards minimal PS was observed for almost all complete sequences of bacterial chromosomes and plasmids, however, PS was substantially higher in separate genomic loci and several types of plasmids and phages characterized by long stretches of non-coding DNA and/or asymmetric gene distribution on the two DNA strands. Five of the 155 bacterial chromosomes have anomalously high PS, of which the chromosomes of Xylella fastidiosa 9a5c and Prochlorococcus marinus MIT9313 exhibit extreme PS values suggesting an intermediate unstable state of these two genomes. Conclusions Strand symmetry as indicated by minimal PS is a universally conserved feature of complete bacterial genomes that results from the matching mutual compensation of local OU biases on both replichors while OUV is more a taxon specific feature. Local events such as inversions or the incorporation of genome islands are balanced by global changes in genome organization to minimize PS that may represent one of the leading evolutionary forces driving bacterial genome diversification.


Background
The analysis of oligonucleotide usage (OU) biases is a useful approach to study bacterial genome organization [1][2][3][4][5]. A number of computational approaches have been designed to visualize pathogenicity regions [2,[6][7][8] or to separate sequences of diverse origins by differential oligonucleotide composition [1,3,5,9,10]. These methods were based on analysis of OU variances and departures from expectations. Di-, tri-and tetranucleotide frequencies were shown to be conserved signatures for individual genomes [1,4,5,11,12] albeit the underlying biological and physicochemical constraints that lead to the over-and underrepresentation of oligonucleotides are only partially known and understood. Mutagenesis and repair, restriction-modification systems, amino acid frequency, codon bias and structural parameters, to name just a few, should all influence the frequency and sequence of oligonucleotides. For example, DNA flexibility and curvature has been related to the nucleotide sequence by trinucleotide bendability, dinucleotide base stacking energy and propeller twist angle [13].
OU biases are characteristic features of a bacterial genome [4,11], but according to Chargaff's first and second parity rule [14] the complementary strands are believed to be symmetric. However, with more and more complete bacterial genome sequences at hand, the AT and GC composition are meanwhile known to be skewed between leading and lagging strand [15,16] and the extrapolation of the first-order parity rule to higher orders of oligonucleotide composition has yet not been investigated in sufficient depth to draw any valid conclusions. The only published evidence that strand symmetry extends from base composition (first order) to oligonucleotides (higher orders) has been provided by numerical analysis of the sequence of human chromosome 22 [12]. According to this data strand symmetry does not result from a single cause, but rather seems to emerge from numerous mechanisms [12]. We were curious to know whether or not strand symmetry of higher order is a universal feature in the microbial world. Physicochemical constraints operating at the oligonucleotide level could lead to higher order strand symmetry at the local and/or global level. Hence we determined strand symmetry and its relationship to oligonucleotide composition in a comprehensive collection of completely sequenced bacterial chromosomes, plasmids and phages.

Results and Discussion
DNA structure and oligonucleotide usage First, we wanted to know whether an association exists between structural features of DNA and the frequency of particular tetranucleotide words in bacterial genomes. DNA structural features have been related from theoretical calculations and empirical measurements to di-or trinucleotide scales such as base stacking energy [17], propeller twist angle [18], protein deformability [19], position preference [20] and bendability [21]. By permutation analysis the 256 tetranucleotides were assigned to 39 equivalence classes each of which characterized by the same values for the five scales mentioned above [13].
The deviations ∆ w of observed from expected counts (see eq. 6 in 'Methods' section) were determined for all 256 tetranucleotides on one clockwise strand of four bacterial genomes and then sorted by decreasing stacking energy with class 39 (CGCG, GCGC) and 32 (ATAT, TATA) having the highest and lowest energy, respectively. The preference of tetranucleotides for low, intermediate or high base stacking energy was found to be a taxon-specific feature. For example, whereas the GC-rich Pseudomonas putida KT2440 prefers the extremes with low and high stacking energy, the selection of tetranucleotides in the AT-rich Bacillus subtilis 168 is biased towards low stacking energy (Fig. 1). All analyzed genomes, however, shared the feature that virtually all tetranucleotide words within the same class were distributed with similar taxon-specific frequency. In other words, tetranucleotides with matching structural features apparently occur with comparable frequency in a bacterial genome. The high-order parity of complementary DNA strands has been postulated previously, implicitly or explicitly, as the consequence of firstorder symmetry [5,12,22,23], but was not substantiated by comprehensive analysis of real sequence data. The reader should note that an oligonucleotide and its reverse complement always belong to the same equivalence class. This tendency towards intrastrand parity of complementary oligonucleotides will consequently lead to a symmetric distribution of oligonucleotides on both strands. To verify the universality of this phenomenon, we systematically analyzed a comprehensive selection of 155 bacterial chromosomes, 104 phages and 316 plasmids by the calculation of pattern skew (PS), a global genome parameter of intrastrand disparity. The values determined for all analyzed genomes are presented in additional data files [see additional data files 1,2 and 3 for chromosomes, plasmids and phages, respectively].

OU pattern skew
The PS values of n0_4mer patterns (definition given in the 'Methods' section) determined for all sequenced bacterial chromosomes were in the range of 1 to 8% ( Fig. 2A In shorter sequences both the PS mean values and standard deviation increase exponentially due to size limitation of the sequences. Regression equations were calculated from the empirical data to simulate this dependence for average PS (PS avr ):   Lamboid phages PS PS the genome and correspondingly the smaller the frequency of the individual oligonucleotide, the larger the variance. The portion of noise increases exponentially in shorter sequences making the oligonucleotide distribution parameters closer to the characteristics of a random sequence [24].
Strand symmetry was found to be a global feature of bacterial chromosomes, but this did not apply to all plasmids and phages: PS values varied over a large range ( Fig. 2A). PS values of circular and linear plasmid and phage genomes overlapped (data not shown Higher PS was found in so-called conjugative genome islands such as pKLC102 [26], R100, R391 [27] and virulence plasmid pO157 [28] and pADAP [29] which share a mosaic genome structure of plasmid and phage origins. Another example of this transition zone of phages and plasmids are the plasmid pBClin15 of Bacillus cereus and the phage Bam35c of Bacillus thuringiensis. Both shares an almost identical gene repertoire, all genes are leftward oriented [30] and in both cases PS values are beyond the 95% confidence limit. The majority of phage genomes exhibit higher PS values than plasmids or chromosomes [see additional data file 3]. Most head-and-tail bacteriophages have a genome sequence length of around 50 kbp that fits to capsid size [31] ( Fig. 2A). Short tail Podoviridae, such as streptococcal phage C 1 [32] and bacillar phage GA-1, have symmetric OU, while PS increases in long tail lamboid Siphoviridae up to extreme values of 60-64% in SPP1, T3, T7, phiYe03-12 phages. A large portion of the genome of long tail phages is non-coding and serves to anchor the DNA within the capsular head and tail [33,34]. Thus in SPP1 phage only 32 kbp out of 47 kbp are coding sequences and all genes are transcribed from one strand [35]. The highest PS value of 82% was observed in the 5,415 bp single strand DNA enterobacterial phage G4.
Symmetric OU is characteristic for Myoviridae T4 and RM 378 and for mycobacterial Siphoviridaea D29, L5 and Bxb1. The genomes of the former have a symmetric structure of two arms of leftward and rightward transcribed genes. However, in the close relative mycobacteriophage TM4 all genes are transcribed in the rightward direction [36] and its PS value for the n0_4mer pattern is threefold larger than that of D29, L5 and Bxb1. In conclusion, asym-metric genome topology and/or the presence of large inserts of less conserved non-coding DNA stretches apparently cause strand specific OU asymmetry in some plasmids and most phages.
Next, we wanted to test whether the association of PS values being higher for non-coding than for coding sequences does not only apply to phages, but also to bacterial chromosomes. PS values of non-coding regions of tested bacterial chromosomes were higher than PS values of coding sequences (Table 1), however, this higher PS is just caused to major extent by its shorter total length in the genome. The comparatively higher PS values are within expectation for a shorter sequence for most genomes ( Fig.  2A). Exceptions were the two outliers X. fastidiosa 9a5c and P. marinus MIT9313 (see above) and chromosomes of Campylobacter jejuni NCTC 11168, Corynebacterium glutamicum ATCC 13032 and Xylella fastidiosa Temecula1.
In other words, the same underlying principles shape oligonucleotide distributions in coding and non-coding sequences in bacterial genomes; i.e. codon usage is not a major determinant for global PS. A bias of coding sequences to one strand as seen in phages, however, leads to increased PS according to their other lifestyle and evolution.
Next, the hypothesis was tested whether global strand symmetry of bacterial chromosomes also extends to local regions. The local PS values were calculated for 15 selected chromosomes (Table 1). PS values of n0_4mer patterns were determined for 100 randomly generated arbitrary loci of 5 to 1,000 kbp in size. These local patterns exhibited significantly higher PS than complete genomes of the same size for 12 of 15 analyzed chromosomes (P < 10 -6 in all 12 cases, χ 2 tests, P values corrected for multiple testing; the genome of Mycoplasms pulmonis UAB CTIP was not applicable for this test due to its small genome size). Figure 2B displays the local patterns randomly generated from the Escherichia coli K12 and Bacillus subtilis 168 chromosomes. The majority of PS values of chromosomal segments were higher than the 95% confidence interval predicted by equations 1 and 2 for genome sequences. In Aquifex aeolicus, Bradyrhizobium japonicum, and Streptomyces coelicolor, however, local PS was only slightly higher than in the whole chromosome (Table 1) indicating strand symmetry throughout the chromosome. In all other selected species pronounced strand asymmetry was determined to differential genome-specific extent (see X. fastidiosa TemeculaI as an extreme example, Table 1), but these local skews of tetranucleotide usage are mutually compensated by other regions so that PS skew of the whole chromosome is reduced to 1-8%.
Normalization of OU frequencies by mononucleotide content typically had either no effect or increased PS, whereby in case of the latter the OU variance decreased. For instance, the mean values of PS of n0_4mer and n1_4mer patterns of the analyzed bacterial chromosomes [see additional data file 1] were 3.4 ± 3.0% and 3.7 ± 2.0%, respectively. In some cases, however, the normalization significantly decreased PS. On Fig. 2A  Next, we wanted to know whether PS changes with the length of oligomers. PS values increased with the length of oligonucleotide words in all examined bacterial chromosomes to an extent that is consistent with the increase of numbers of n-mers and concomitant random fluctuations between expected and observed counts of words; i.e. strand symmetry is maintained to similar extent for di-, tri-, tetra-and pentanucleotides. However, in the two outliers X. fastidiosa 9a5c and P. marinus MIT9313 with the largest n0_4mer PS values (additional data file 1, Fig. 2A) PS decreased with increasing word length (Fig. 3). This behaviour could be attributed in both strains to extreme dinucleotide skew that is diluted in longer words. In case of the three other outliers C. tetani E88, N. europaea ATCC 19718 and H. ducreyi 35000 HP di-, tri-and tetranucleotide skews all contribute to the pronounced pattern skew. Since we did not observe any trend towards common oligonucleotides in all five strains that account for the skew but rather individual patterns that are characteristic for the peculiar strain, we conclude that high pattern skew is a strain specific feature.
Asymmetric OU in bacterial chromosomes may reflect asymmetric genome topology and/or multiple inserts of foreign DNA with high PS. To test this hypothesis, the genomes of X. fastidiosa 9a5c and P. putida KT2440 were selected for a comparative study of OU pattern consistency. Both genomes are characterized by multiple genome islands and unequal lengths of the clockwise and counterclockwise replichors [37,38]. The standard n0_4mer OU patterns were defined for the leading strands of both chromosomes. Next, the local n0_4mer patterns were determined in 15 kbp sliding windows in steps of 7.5 kbp and compared against the standard genome-wide patterns. The X. fastidiosa 9a5c chromosome is characterized by large local variations of OU patterns and a qualitatively different behaviour of the clockwise and counterclockwise replichors (Fig. 4). We calculated the distances D (see eq.8 in 'Methods' section) of n0_4mer patterns between the clockwise and counterclockwise replichor and between the clockwise and the reverse complement of the counterclockwise replichor to be 44.09% and 9.03%, respectively. Compared to X. fastidiosa 9a5c, the OU pattern of the P. putida KT2440 genome is smoother and more uniform in spite of many genome islands with atypical OU patterns scattered throughout the chromosome (Fig. 4). OU pat- terns of the clockwise and counterclockwise replichors were different but complementary to each other. Thus, in P. putida KT2440 the distance of n0_4mer patterns between the clockwise and counterclockwise replichors was 10.43%, while that between the clockwise and the reverse-complement of the counterclockwise replichors was only 1.96%. Notably, despite the length difference, the counterclockwise replichor manages to compensate the mirror OU skew in the clockwise strand due to its greater OU variance in the shorter replichor.

OU variance
OUV is strongly dependent on GC-content [7,39]. If the data from sequenced bacterial chromosomes are superimposed, a parabolic curve is generated with the minimum of n0_4mer OUV at 50 mol% GC [39]. Normalization of OU patterns by mononucleotide content significantly reduced this bias, however, did not remove it completely. OUV values determined for n1_4mer patterns of 155 bacterial chromosomes varied from 0.05 to 0.5. According to regression analysis average values (OUV avr ) and variances (σ OUV ) of OUV retained some parabolic dependence on mol% GC (X GC ): OUV avr has its minimum at 43% GC and σ OUV at 57% GC content. In GC-rich and AT-rich sequences OU is strongly biased leading to higher OUV. Moreover, the lower variance of OUV values in sequences of high GC-content (Fig.  5) suggests that the use of GC-rich words is more biased than that of AT-rich words, presumably due to the extreme values of base stacking energy, propeller twist and protein deformability [13].
The OUV values of all studied bacterial chromosomes and 3-sigma limits of their variations are shown in Fig. 5. In general strains of the same species or genera have similar OUV [see additional data file 1]. No links between OUV and PS values were observed. The strains P. marinus MIT9313 and X. fastidiosa 9a5c with atypical PS have similar OUV as their close relatives.

Conclusions
Two global genome features based on OU statistics were considered in this study: PS and OUV. They provide nonredundant characteristics of the complete sequence of genomes and allow the discrimination of bacterial, plasmid and phage genomes by phylogeny, the arrangement of coding and non-coding sequence and the distribution of islands and islets.
A strong taxonomic signal was observed in genome specific OUV values. Strains belonging to the same species or genus usually have similar OUV. In general, the higher is the OUV, the less random is the sequence. Multiple influences such as DNA structure and topology, codon usage, DNA repair and restriction-modification systems contribute to the surrogate parameter OUV, and hence it is plausible that the OUV is a taxon-specific feature. Future work on the frequency and distribution of individual words should elucidate the biological meaning of the genome specific OUV for the individual taxon (see Weinel et al., 2002 [40] as one of the few published examples).
The major finding was that words and their reverse complements occur with similar frequencies in the complete sequence, because they share the same structural features Relation between PS and the length of oligonucleotide words  . .
in terms of DNA flexibility and curvature thus generating strand symmetry. However, in local regions PS is substantially more variable. Local PS values, for example, were up to 70% in the X. fastidiosa Temecula1 genome (see Table  1), that means the complementary words were deliberately counterselected in these sequences, but even in this extreme case the global PS of the chromosome remained low. Local asymmetric OU patterns of leading and lagging strands of DNA mutually compensate each other restoring in that way the OU symmetry of the complete sequence. Notably, OU symmetry was retained in sequences with replichors of disproportionate length and multiple horizontally acquired genome islands like in the P. putida KT2440 genome. Low PS is a feature of complete genomes. This conclusion is drawn from the fact that the PS values of plasmid genomes and whole chromosomes are smaller than those of arbitrarily selected chromosomal loci of the same size (see Fig. 2B, Table 1).
High PS was predominant for phages and conjugative genome islands that corresponds to their dual lifestyle to either exist in the episomal state or to be integrated into the chromosome. In addition, the high PS correlates with the known high mutation rates of these genetic elements and an intensive horizontal exchange that is an important component of their evolution [26,[41][42][43]. Among the analyzed bacteria only the chromosomes of X. fastidiosa 9a5c and P. marinus MIT9313 show high PS which in both cases is caused by dinucleotide skew of four complementary pairs (TT/AA, GG/CC, GT/AC and TG/CA). This dinucleotide skew led to large fluctuations of local OU patterns, high global PS that is not compensated by the replichors (Fig. 4) and loss of intrastrand parity which should give rise to an uncommon, probably unstable chromosome structure. Future experimental work may Deviations of oligonucleotide usage patterns in local loci of two bacterial chromosomes The OUV values defined for n1_4mer patterns of 155 bacte-rial chromosomes and plotted against the mol% GC content Figure 5 The OUV values defined for n1_4mer patterns of 155 bacterial chromosomes and plotted against the mol% GC content. Curve lines depict the boundaries of the 95% confidence interval of OUV variation determined by eq.3 and 4. show whether or not such a status of the genome is associated with reduced fitness and/or more rapid evolution.
In summary, referring to the comparative analysis of PS in local regions and complete genomes shown in Fig. 2, the proportional usage of words and their reverse complements, i.e. strand symmetry is important for genome stability, and there could exist yet unknown housekeeping mechanisms to control compensating matching of OU patterns of different loci of a sequence. Local events such as inversions or the incorporation of a genome island are balanced by global changes in genome organization to minimize pattern skew. This response of the genome to local perturbations may represent one of the leading evolutionary forces that drive bacterial genome diversification and speciation.

Overall DNA properties
Intragenomic GC-content and GC-skew variations were determined as quantities of (G+C) and (G-C)/(G+C), respectively, averaged over a sliding window of certain length.

Evaluation of oligonucleotide patterns
Overlapping oligonucleotide words of a certain length l w were counted in the sequence of L seq nucleotides by shifting the window in steps of 1 nucleotide. The total word number (W total ) is L seq -l w in a linear sequence or W total =L seq in a circular sequence. Since L seq >>l w , W total ≅ L seq in all cases. For a given word length l w , N w = 4 lw different words are possible for a sequence of four letters A, T, G and C. The observed counts of words (C o ) were compared with the expected counts of words (C e ). Assuming the same distribution frequency for all words of a common length l w irrespective of their composition and sequence, C e matches the standard count number C n0 Correspondingly, if we normalize oligonucleotide usage (OU) by mononucleotide content using zero-order Markov method [44], C e becomes The deviation ∆ w of observed from expected counts is given by In the present work we used the following abbreviations for the different types of patterns: type_l w mer. Types are called 'n0', if they are not normalized by mononucleotide frequency, or 'n1', if they are normalized by the zero-order Markov method. For example, the non-normalized trinucleotide usage pattern is a n0_3mer type, the normalized pentanucleotide usage pattern is a n1_5mer type.
Variance OUV of word deviations were determined as following:

Pattern comparison and pattern skew
For the comparison of sequences by OU patterns of the same type, the words in each sequence were ranked by ∆ w values according to equation 6. Rank numbers instead of word counts were used to simplify pattern comparison and to remove sequence length bias. Assuming that 95% of all words should occur at least ten times in a random sequence, the threshold for the minimum length of the sequence was chosen to be 0.3, 1.2, 5 and 20 kb for di-, tri-, tetra-and pentanucleotides, respectively.
The distance D between two patterns was calculated as the sum of absolute distances between ranks of identical words in patterns i and j as follows: where D max is the maximal distance that is theoretically possible between two patterns of l w long words (equation 9). D min is the minimal distance between two patterns. The minimal distance is zero for two independent sequences, but has a positive value for the two complementary strands of the same DNA sequence, because the OU patterns designed for both strands of the same DNA molecule cannot be identical. The pattern skew (PS) describes this distance between opposite strands of the same DNA and is a measure of OU asymmetry. The minimal theoretical distance between two patterns of opposite strands is realized if the words and their reverse complements are distributed with similar frequencies in the sequence and it is D min = 4 lw , if l w is an odd number (10a) but D min = 4 lw -2 lw , if l w is an even number (10b). because palindromes, which occur in both strands with the same frequency, only exist in words with an even number of nucleotides and the total number of all possible palindromes is 2 lw .

Data mining and storage
A computational program for determining OU patterns, their comparative analysis and storage in a database was written on Python 2.2 [45]. Sequences of 155 bacterial chromosomes including eubacterial, archaeal and cyanobacterial genomes, 316 plasmids and 104 phages published in NCBI database [46] and the plasmid genome database [47] were analyzed in this study. Regression analysis has been done using DataFit7.1.44 software.
Coding and non-coding sequences were selected by gene coordinates provided in the NCBI database [46]. Intergenic spacer regions shorter or equal 50 bp were included in the coding sequence of a genome, while longer spacer regions were concatenated into the non-coding sequence.