Systematic determination of the mosaic structure of bacterial genomes: species backbone versus strain-specific loops
© Chiapello et al. 2005
Received: 05 April 2005
Accepted: 12 July 2005
Published: 12 July 2005
Skip to main content
© Chiapello et al. 2005
Received: 05 April 2005
Accepted: 12 July 2005
Published: 12 July 2005
Public databases now contain multitude of complete bacterial genomes, including several genomes of the same species. The available data offers new opportunities to address questions about bacterial genome evolution, a task that requires reliable fine comparison data of closely related genomes. Recent analyses have shown, using pairwise whole genome alignments, that it is possible to segment bacterial genomes into a common conserved backbone and strain-specific sequences called loops.
Here, we generalize this approach and propose a strategy that allows systematic and non-biased genome segmentation based on multiple genome alignments. Segmentation analyses, as applied to 13 different bacterial species, confirmed the feasibility of our approach to discern the 'mosaic' organization of bacterial genomes. Segmentation results are available through a Web interface permitting functional analysis, extraction and visualization of the backbone/loops structure of documented genomes. To illustrate the potential of this approach, we performed a precise analysis of the mosaic organization of three E. coli strains and functional characterization of the loops.
The segmentation results including the backbone/loops structure of 13 bacterial species genomes are new and available for use by the scientific community at the URL: http://genome.jouy.inra.fr/mosaic.
Systematic genome comparisons play an increasingly important role in genome analysis and annotation. There are mainly two kinds of approaches used for whole genome comparisons: whole proteome comparison studies and whole genomic sequence alignment studies. Both approaches are powerful tools to study genome organization and evolution rules with different time scale considerations. These approaches have been employed with success in a recent study comparing the genome of yeast S. cerevisiae to three related yeast species genomes [1, 2]. Genomewide comparative analysis of the yeast chromosomes has considerably improved gene annotation and has permitted the prediction of new motifs conserved in intergenic regions that act potentially as regulatory elements of gene expression .
Whole genome-alignments tools have shown important developments in the last years. It is now possible to align rapidly two or more long genomic DNA sequences with several tools like MultiPipMaker , Vista , Mummer [5–7] and MGA . Some of them include graphical interfaces to display and browse genome alignments [3–4 7]. Other resources provide precomputed alignments for genome of related species, such as EnteriX or Colibase for enterobacteria [9, 10].
Here we focus on whole genomic sequence alignments in the particular case of strains of single bacterial species. Since the publication of a second strain of Helicobacter pylori in 1999 , sequence data on closely related bacterial genomes has rapidly accumulated in public databases. The availability of complete genome sequences for multiple strains of numerous species opens up new perspectives for studying short term evolutionary processes. For example pairwise alignment of the complete genomes of the enterohemorrhagic Escherichia coli 0157:H7 strains (Sakaï or EDL 933) with the E. coli K-12 laboratory strain, allowed the definition of a 4.1 Mb sequence that was highly conserved between the two strains [12, 13]. It was proposed that this common sequence corresponds to the conserved backbone of the E. coli chromosome, which is interrupted by numerous DNA segments called strain-specific loops, distributed throughout the backbone .
Examination of "mosaic" structures of backbones and loops offers a potential approach to trace the dynamics of genome evolution at the bacterial species level. The backbone, conserved in all aligned genomes of the species, probably corresponds in large part to the common ancestral strain and is the part of the genome under vertical selective pressure. As such, the backbone is also likely to be the most adapted part of the genome, which could be relevant when studying essential functional elements of the cell (such as genes, motifs or signals). Loops differ among strains. Some may correspond to mobile elements, like prophage  and insertion sequences , and may be associated with strain-specific pathogenicity. However little is known about functional elements associated with small loops.
Up to now, no systematic strategy for backbone/loop segmentation has been proposed for closely related bacterial genomes. The existing studies are either limited to pairwise comparisons or choose a reference genome which is compared to several related genomes. Precomputed alignments are often limited to a subgroup of species and use different softwares and parameters, making results generally non-reproducible or non-comparable.
In this paper we address the problem of defining a strategy to obtain a backbone/loop segmentation of bacterial genomes at the intraspecies level. This approach is based on two recent genome aligners: Mummer3  and MGA . Using a validated benchmark dataset, we developed a simple treatment of alignment results which permits a robust definition of the mosaic structure. Our approach does not take any genome as a reference and has no restriction for the number of genomes to align. We used our method to define this segmentation for 13 bacterial species. Validated backbone/loop segmentation results are stored in the MOSAIC database and are freely accessible through a user friendly Web interface. The backbone/loop segmentation determined using three E. coli genomes illustrates important properties of this structure, and indicates that intraspecies segmentation is a useful mean of enhancing bacterial genome annotation.
The loop coordinates of the E. coli K-12 and O157:H7 Sakai genomes validated by Hayashi et al.  (see Methods) were used as a basis to define an alignment strategy and to develop a treatment of alignment results adapted to bacterial backbone/loop segmentations. These strains are known to belong to distantly related E. coli lineages  and their genomes are more distantly related between each other compared with genomes within other species . Parameters allowing a pertinent alignment of such different genomes were therefore expected to produce reliable results for more closely related strains. The K-12/Sakai comparisons were performed using different parameters of the MGA software, and those leading to the best results, as compared to coordinates obtained by Hayashi et al., were chosen. This set was used to produce alignments for all species so that results may be compared.
MGA software provides three types of results: matches (anchored MEMs of a minimal given length), aligned gaps (segments between anchored MEMs, shorter than a user-defined size and aligned with ClustalW) and unaligned gaps. Matches were computed using an iterative process on MEM size: MEM of at least 50 bp were used for the first MGA step and MEM of at least 20 bp were computed in the second recursive step. These two kinds of MEM were included into the backbone of the segmented genomes. The gaps were then treated as follows : gaps longer than 3000 bp (unaligned MGA gaps) were considered as loops, and gaps shorter than 3000 bp were aligned with ClustalW. Aligned gaps with more than 76 % identity were considered as backbone, others, as loops. Minimal size of loops and backbone segments was set to 20 nucleotides each. This strategy generated a backbone/loop profile of the K-12/Sakai alignment that differed by around 2400 nt (0,1 %) from that validated by Hayashi et al. .
In order to select a subset of genomes for which backbone/loop segmentation makes sense, an analysis using the Mummer package was performed (see Methods). Three categories of results we obtained. The first category includes 33 genomes for which MGA alignments and backbone/loop segmentations are feasible, as they have not been submitted to numerous and important rearrangements. The second category includes genomes that can be aligned after minor correction of their sequences (Reverse complement and Translation operators, see Material & Methods section). This second category concerns 5 genomes (4 species). The last category corresponds to 17 genomes belonging to 8 species: Neisseria meningitis, Prochlorococcus marinus, Salmonella enterica, Shigella flexneri, Streptococcus pyogenes, Tropheryma whipplei, Xylella fastidiosa and Yersinia pestis. These genomes were excluded because Mummerplot results revealed rearrangements covering a large part of the genome.
Twenty four genome alignments were generated and treated for backbone/loop segmentation using MGA and our defined set of parameters. These included two quadruple alignments (E. coli, C. pneumoniae), four triple alignments (C. pneumoniae, E. coli, S. aureus, S. pyogenes) and eigthteen pairwise alignments. For one species, Buchnera aphidicola, segmentation results were not exploited due to too low coverage (this value estimates the percentage of total genome length included in the backbone, in this case 40 %, see Discussion).
Segmentation results obtained from MGA alignments and included in the MOSAIC database. For each segmentation result, the first column describes the species and genomes used for segmentation analyses; the number of compared strains is indicated between parentheses. Total loop sizes and loop number of each genome are entered in the same order as strain names, and separated by '+'. Coverage corresponds to the ratio between backbone size and total genome size of a strain; here the mean value for all compared strains is given in percents.
Compared genomes (numbers of strains)
Backbone size (Mb)
Cumulative loop size (kb) [Loop number]
C58 Cereon circ X C58 Univ. Wash circ (2)
C58 Cereon lin RC X C58 Univ. Wash lin (2)
Ames X Ames 'Ancestor' (2)
ATCC14579 X ATCC10987 (2)
AR39 RC+TR X CWL029 X J138 X TW183 (4)
CWL029 X J138 X TW183 (3)
CWL029 X J138 (2)
J138 X TW183 (2)
CWL029 X TW183 (2)
AR39 RC+TR X CWL029 (2)
K-12 X Sakai X EDL933 X CFT073 (4)
K-12 X Sakai X CFT073 (3)
26695 X J99 (2)
EGD X 4b F2365 (2)
CDC1551 X H37Rv (2)
MW2 X MU50 X N315 (3)
2603V/R X NEM316 (2)
R6 X TIGR4 (2)
M1GAS X MGAS315 X MGAS8232 (3)
M1GAS X MGAS315 (2)
M1GAS X MGAS8232 (2)
YJ016 K2 X CMCP6 K2 TR (2)
YJ016 K1 RC X CMCP6 K1 TR (2)
The number of loops in a segmented genome appeared to be also highly variable among bacterial species, ranging from 6 (Chlamydophila pneumoniae strain CWL29 compared to strain TW183) to 2878 (Bacillus cereus, strain ATCC14579 compared to ARCC10987). Results of table 1 revealed two extremes situations. Some species have few very long loops, as Agrobacterium tumefaciens (24/25 loops for the circular chromosome, mean length of loops around 28 kilobases). Others (Bacillus cereus) contain a large number of short loops (mean length around 400 bases). These differences will need to be further examined in details, in relation with genome annotations.
A more precise analysis of the segmentation results from the comparison of E. coli strains K-12 , O157:H7 Sakai  (named Sakai below) and CFT073  (named CFT below) was performed. A 3.73 Mb length backbone (exhibiting more than 97 % identity between the three strains) and three sets of strain-specific loops (of very different total length) were identified. The K-12 genome included 827 K-12 loops (total length 0.9 Mb, 20 % of the K-12 genome), the Sakai genome, 795 Sakai loops (total length 1.8 Mb, 33 % of the Sakai genome) and the CFT genome, 770 CFT loops (total length 1.5 Mb, 29 % of the CFT genome). The differences in total loop sizes are in keeping with the different total genome size of the three strains (K-12: 4.6 Mb ; Sakai: 5.5 Mb ; CFT: 5.2 Mb).
Size distribution of loops (in bp) obtained from segmentation of the E. coli genomes K-12, O157:H7 Sakai (SAK) and CFT073 (CFT). Minimal size (Min), Mean size, Maximal size (Max), First Quartile (1st Qu.), Median size, and Third Quartile (3rd Qu.) are shown.
The distribution of functional elements in the backbone/loop structure was analyzed. Functions identified by classical annotations of bacterial genomes i.e. genes, tRNA, rRNA, phages and Insertion Sequences (IS) were first considered. As expected tRNA and rRNA were mainly present in the backbone. One exception concerns a rather large proportion of tRNA present in the Sakai loops (27 %) compared to K-12 (9%) and CFT (13 %) loops. The Sakai strain contains 18 specific tRNAs not present in K-12 , Hayashi et al. observed that these tRNAs recognize codons which are used with an increased frequency in Sakai loops. Not surprisingly, we observed that phages and IS are quasi-totally included in loops (>98%). The ten longest loops of K-12 correspond systematically to known phages, or phage remnants of the E. coli genome.
Distribution of BIME (in percent of length) in backbone and loops regions of the E. coli K-12 genome, as determined from the triple K-12, Sakai and CFT073 alignment.
Studying backbone and loops of bacterial genomes is an efficient way to distinguish the two major modes of evolution acting on bacterial genomes. The backbone may be considered as the part of the genome susceptible to vertical long-term evolution. Backbones are very similar for closely related strains and variability comes mainly from punctual mutations or insertions/deletions of oligonucleotides. The loop population (defined in MOSAIC as variable regions of 21 bp or more) is more heterogeneous : the number of loops and the average loops length varies greatly from one species to another (Table 1). Loops can be viewed as elements issued from short-term evolution processes. One such process is horizontal transfer. For example acquisition/loss of distinct prophage sets seems to be a rapid process, which can be observed between closely related strains . Significantly, for some genomes, phages are the major contributors to loop length . Eleven loops of E. coli K-12 are associated with phages and constitute 24 % of the total E. coli K-12 loop length. A contrasting example is found in H. pylori: this species does not contain prophage, although it contains large loops that may be associated with pathogenicity islands . Our results indicate that medium-size loops (scale of the gene size) are constituted, at least in part, from known variable elements of bacterial genomes like Insertion Sequences. The relatively large number of short loops found in some species (E. coli, B. cereus) is quite surprising. Such small loops may be due to replication errors ('copy-choice' of DNA polymerase, slippage mechanism), which can generate small insertions or deletions  or may correspond to highly polymorphic regions. As opposed to large or medium size loops it is likely that these shorter loops arose from non-horizontal transfer events.
Alignments including more than two genomes generally yield a more robust but smaller backbone than pairwise alignments. This is due to the fact that a larger set of genome variations is taken into account. In the future, about ten or more genomes will be available for some species. One possible consequence is that the backbone length will shrink steadily with new strain genomes. In that case, the backbone may rather be redefined as, for example, the subset of chromosomal regions present in at least half of the strains. Alternatively, the backbone size may decrease but rapidly reach a minimal size, which will be stable even when new strain genomes will be added for alignment.
As a consequence of multiple comparisons, loop populations are greater and more heterogeneous. They include for example elements present in only one genome (which may correspond to acquisition of a very specific characteristic by one strain), elements present in a subset of strains or elements present in all genomes but one (which may correspond to a deletion in one strain). It will be important to systematically classify loops obtained from multiple comparisons in order to facilitate their identification through the MOSAIC interface.
To estimate the importance of loops corresponding to DNA present in the common ancestor but lost in one of the compared strains a preliminary study was performed: all sequences present in the K-12 loops (from the triple alignment) were blasted against the Salmonella typhimurium genome (considered as the outgroup), and matching sequences present in the same genome environment were considered as "ancestral loops". Ten loops, corresponding to a total length of 3658 bp, matched this criterium. This suggests that only a minor subset of the loops correspond to deletions that occurred in either E. coli Sakai or CFT genomes.
Some genomes of species like Buchnera aphidicola were too divergent to be segmented with our procedure. In fact, these genomes are clearly atypical in terms of evolutionary distance within a species: despite complete colinearity of their genomes, B. aphidicola Sg and Ap genomes display a high degree of divergence at the nucleotide level, making them as different as E. coli/S. typhimurium genomes . Comparison of Sg and Ap genomes is thus almost the same situation as comparing different species, but would be possible by adapting the alignment parameters. This raises the question of bacterial species definition: the evolutionary distances within a species and between species are very heterogeneous. For example, it has recently been confirmed that Shigella is phylogenetically indistinguishable from E. coli . Our method will also be easily extended to bacterial species where numerous chromosomal rearrangements have occurred, using recently developed genome aligners such as MAUVE . Intra-species comparison of divergent and/or rearranged genomes will open the way to segmentation of genomes from different, but closely related species.
To our knowledge, this work is the first study allowing systematic mosaic genome segmentation of all available strains (ranging from two to four) in 13 bacterial species. Examination of the backbone of a bacterial species should greatly facilitate refinement of gene annotation and prediction of conserved sites with potential regulatory roles. Examination of the gene content in loops is important for identification of putative horizontally transferred genes. Genes adapted to specific ecological environments or involved in pathogenicity of a specific strain should also be found in the strain-specific loops. Indeed, the ASAP database (A Systematic Annotation Package for community analysis of genomes)  recently added the features type 'island' and 'conserved_segments' in order to provide lists of regions that are specific or common to the two E. coli K-12 and O157:H7 genomes.
Genome aligners were used to build a robust strategy for bacterial genome segmentation. Backbone/loops structures were systematically determined for 38 bacterial genomes. The MOSAIC resource makes it easy to visualize, annotate, and analyse loops and backbone segments of these genomes. First analyses reveal a surprising diversity in the number of loops from one species to another. In addition some species accumulate a large number of short loops, unsuspected previously.
Complete bacterial genomes were downloaded from the NCBI microbial genome database: http://www.ncbi.nlm.nih.gov/genomes/lproks.cgi?view=1, version of 06/24/2004. Twenty one species (55 genomes) for which genome sequences of at least two different strains are available were selected for analysis: Agrobacterium tumefaciens, Bacillus anthracis, Bacillus cereus, Buchnera aphidicola, Chlamydophila pneumoniae, Escherichia coli, Helicobacter pylori, Listeria monocytogenes, Mycobacterium tuberculosis, Neisseria meningitidis, Prochlorococcus marinus, Salmonella enterica, Shigella flexneri, Staphylococcus aureus, Streptococcus agalactiae, Streptococcus pneumoniae, Streptococcus pyogenes, Tropheryma whipplei, Vibrio vulnificus, Xylella fastidiosa, and Yersinia pestis [see Additional file 1].
In the first step, the subset of genomes for which it is possible to define a reliable backbone was identified using mummer and mummerplot scripts of the Mummer3 package. First, all Maximal Exact Matches (MEM, not necessarily unique) of at least 20 bp in both forward and reverse strands of the compared genomes were computed using the mummer program. Visualization of results between each pair of sequences was then performed using the mummerplot program. This graphical visualization was used to decide whether a common backbone could be defined for the considered genomes. In several cases, this step led us to adjust one of the genomes before the segmentation step. Two operators were defined: the reverse complement operator, named RC, and the translation operator, named TRx, where x indicates that bases from position 1 to x were transferred to the end of the genome. This number x of bases shifted to the end of the genome was determined by the position of the first aligned MEM detected by MGA between the two genomes. These operators allowed us to assign the same strand and the same start position to all compared genomes. They were applied to a subset of genomes before alignment with MGA software. Genomes where rearrangements covering more than half of the total length were detected by mummerplot and excluded at this step. They can not be handled properly by MGA and would therefore lead to inaccurate segmentation.
The second step was to use the MGA software to perform whole genome alignments on the subset of selected genomes and to define backbone and loops. MGA is a powerful multiple genome aligner which presents two major advantages. First, it performs simultaneous multiple alignments based on MEM (Maximal Exact Matches present in all aligned genomes) selection, without considering any genome as the reference. Second, a consistent and robust backbone for the aligned genomes can be defined using its MEM anchoring algorithm followed by treatment of gaps (i.e. regions between the anchored MEM). Parameters used in MGA were adjusted by comparison with a manually curated reference set of loops of two E. coli strains: K-12 and O157:H7 Sakai . After Mummer 1 alignment, backbone/loop junctions were extracted and systematically aligned using the fasta3 algorithm. Each alignment was checked by eye inspection and in many cases, the backbone sequence was extended by a few to several base pair [Pr. T. Hayashi, personal communication]. Further analysis using whole genome PCR scanning confirmed that the loops longer than 500 pb are indeed variable elements . A simple treatment of MGA alignment results was developed to define the boundaries of loops and to enhance their concordance with this manually determined pairwise reference dataset [see 'Results' section, 'Validation of segmentation parameters' subsection]
Results of MGA alignments were generated in XML format. Backbone/loop segmentations were processed with a Perl script using the SAX module for XML parsing. For each aligned genome, backbone and loop coordinates were computed and coverage (length of the backbone divided by total length of the genome) was calculated. Results were then integrated into the MOSAIC relational database. The database was implemented using the PostgreSQL relational database system. The MOSAIC relational model is generic and not dedicated to any alignment tool or genome species. The Web interface was also written in Perl language using standard modules for database access (DBI module for DataBase Interface) and dynamic pages (CGI module for Common Gateway Interface). Different graphical visualizations of the backbone/loop structure were developed using the MuGeN software  and the cirdna program which is part of the EMBOSS package .
We thank C. Hennequet-Antier for her assistance in using R software and Dr A. Gruss for many helpful discussions. We are indebted to Professor T. Hayashi and Dr K. Kurokawa for sharing data before publication and helpful discussions. This work is supported in part by the "ACI IMPbio" grant from the French Ministry of Research.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.