- Open Access
SynteBase/SynteView: a tool to visualize gene order conservation in prokaryotic genomes
BMC Bioinformaticsvolume 9, Article number: 536 (2008)
It has been repeatedly observed that gene order is rapidly lost in prokaryotic genomes. However, persistent synteny blocks are found when comparing more or less distant species. These genes that remain consistently adjacent are appealing candidates for the study of genome evolution and a more accurate definition of their functional role. Such studies require visualizing conserved synteny blocks in a large number of genomes at all taxonomic distances.
After comparing nearly 600 completely sequenced genomes encompassing the whole prokaryotic tree of life, the computed synteny data were assembled in a relational database, SynteBase. SynteView was designed to visualize conserved synteny blocks in a large number of genomes after choosing one of them as a reference. SynteView functions with data stored either in SynteBase or in a home-made relational database of personal data. In addition, this software can compute on-the-fly and display the distribution of synteny blocks which are conserved in pairs of genomes. This tool has been designed to provide a wealth of information on each positional orthologous gene, to be user-friendly and customizable. It is also possible to download sequences of genes belonging to these synteny blocks for further studies. SynteView is accessible through Java Webstart at http://www.synteview.u-psud.fr.
SynteBase answers queries about gene order conservation and SynteView visualizes the obtained results in a flexible and powerful way which provides a comparative overview of the conserved synteny in a large number of genomes, whatever their taxonomic distances.
As prokaryotic species diverge, their gene order is increasingly fading away, except in rare locations where a few genes retain their neighborhood. Such observations gave rise to the concept of genomic context [1–9]. Accordingly, it is assumed that a small number of genes remain adjacent either because their expressions occur at the same time, or because they encode proteins that are constituents of the same molecular machine (e.g. membrane ATPase) or involved in the same cellular function . These genes that remain persistently adjacent in constantly moving genomes form synteny blocks. In a recent work , we have identified such synteny blocks in a large and diverse set of nearly 600 microbial genomes using a three-step process. In step one, we compared each protein encoded by a completely sequenced genome with all other available microbial proteomes in order to identify the full set of homologous proteins they share. In step two, we outlined an approach allowing the identification of bona fide orthologues among all recognized homologues when comparing many pairs of genomes. This second step is based on an adaptation of the method designed by Wall et al.  to compute the reciprocal smallest distance (RSD) that separates the homologues present in a pair of genomes. Step three allowed further research among the correctly identified orthologues to pinpoint those that belong to a minimal unit that is conserved in each pair of genomes, i.e., a pair of positional orthologous genes (POGs) that remain adjacent in each genome. Then, after extending these minimal units as far as possible, it becomes feasible to assess the relative amount and size of synteny blocks in close and distant species. Such synteny blocks are appealing candidates in the study of the mechanisms of genome evolution and in the verification of the functional annotation of neighboring genes. Accordingly, visualizing these blocks in a large number of genomes at various taxonomic distances help to study their features. In this paper, we describe how to assemble all these synteny data in a relational database (SynteBase) and we develop a tool (SynteView) to visualize all conserved synteny blocks in a large number of completely sequenced prokaryotic genomes.
SynteView was designed to display homology and gene context data that are organized in a relational database, SynteBase, described in detail below.
Creating a relational database for synteny data and populating its tables with a dedicated suite of softwares and other tools
We installed PostgreSQL , one of the most advanced open source relational database management systems, on a Linux platform and used it to create SynteBase, which is made up of five tables (Fig. 1). The database can be further populated with home-made data using the different tools we developed (see the user guide [Additional file 1]). Alternatively, one can directly use the SynteBase version we built for our own usage (this paper and ).
Step one: searching for homologues
Raw data extracted from public genomic databanks (GenBank/EMBL/DDBJ) were organized into two tables. The genom e table contains information for the 598 prokaryotic genomes that were compared. The gene/protein table contains many features of their 1,928,135 encoded proteins, such as amino acid sequence, length, species name, location of encoding gene, etc. An exhaustive comparison of all these proteins led to the identification of all homologues. A complete suite of programs (Table 1) was used to compare each pair of proteomes using the following criteria: a pair of aligned proteins was retained as a couple of homologues if their E-values were smaller than 10-5, and if the alignment extended for at least 80% of the length of the shorter matching protein [11, 14].
Step two: identifying orthologues among the collected homologues
We further adapted the Reciprocal Best Blast Hit approach  to analyze the Blast results obtained in the first step. The best RSD orthologous pairs were determined in each comparison of two proteomes as follows. Protein a encoded by genome G A and protein b encoded by genome G B form the best pair of orthologues if the distance separating a from b is smaller than the distance separating both a from any other protein encoded by G B and b from any other protein encoded by G A . We automated this search (Table 1, step 2a). The data obtained were used to populate the orth o table (Fig. 1).
Step three: identifying positional orthologous genes among the collected orthologues
Once populated, the first three tables were used to identify the synteny blocks. We devised a specific SQL query (see [Additional file 2]) to discover the pairs of adjacent orthologous genes (Table 1, step 3a). Then, blocks of size greater than 2 were detected by progressive accretion of blocks of size 2 which shared a common pair of orthologues (Table 1, step 3b). These computed data were entered in the neighborpair s and synten y blocks tables, respectively (Fig. 1).
Architecture of SynteView
To implement SynteView, we applied an object oriented programming paradigm using the Java programming language . In this way, SynteView may be run either as a Java Webstart application or as a local application (Fig. 2). In both cases, SynteView can be used to query SynteBase through a web service (web service mode), or used to query a local synteny database (loca acces s mode). The web service mode allows the user to visualize the precomputed data that are present in our version of SynteBase. To do so, SynteView connects to the SynteView web service to retrieve synteny data present in SynteBase. The local access mode will be useful for those who wish to work online, with home-made computed data. This mode requires the local installation of the Data Base Management System PostgreSQL , and the creation of a committed SynteBase-like database that must be populated with home made synteny data after applying the following mandatory requirements to visualize these data. SynteView requires information on proteins (identifier, coding strand, sequence, function, and length), genomes (species name, species name abbreviation, strain name, taxonomy), and synteny blocks (identifier of the blocks, and pairs of identifiers of orthologous proteins belonging to this block). Note that it does not matter how the data are organized in the underlying local database. SynteView parameters can be set to retrieve the data it needs. However, while SynteView is independent of the name of the selected fields, their order is of importance for correct functioning. Components required to set up a local database are described in detail in the Additional file 1. Once the custom-made database has been built, SynteView can connect to it, after the settings, including connection information (server, login, etc) and all the mandatory queries have been filled out.
Visualizing synteny data with SynteView
The whole set of synteny data that was stored in SynteBase was further examined using SynteView. This tool was designed to provide a wealth of information on each positional orthologous gene, to be user-friendly and customizable. For example, the user can choose the set of genomes to be studied by defining either an array of species names or a taxonomic sampling. The procedure used to visualize synteny between a reference species s1 and a set of species (s2, s3, s4, s5) is straightforward. The user first chooses a reference species, in the "select reference genome panel" by selecting nodes in the species tree (Fig. 3). Clicking on a node produces a list of all the species that are its leaves (right panel). Then, the reference species is chosen by clicking on the species name in this list. Next, the set of compared species is determined by means of the "select compared species" tab (Fig. 3). As previously noted, the user browses the taxonomic tree of prokaryotes. When the user clicks on one node of the tree (e.g. Enterobacteriales), all the descendants of this node appear in the bottom panel. To choose one or several species, a drag and drop of the selected names will move the corresponding species into the right panel. This can be repeated several times, until the required set of species is selected. When this step is accomplished, clicking on the "Start data retrieval process" button on the bottom of the panel will launch the visualization step. The speed of this process depends on the number and nature of the chosen species. Once the retrieval process is completed, all regions of each compared genome become accessible for visualization in a scrollable window using the following features as shown in Fig. 4. Each line corresponds to a genome. The first line from the top (light blue background) shows gene adjacency in the chromosome of the reference species. Dark blue (positive DNA strand) and yellow (negative strand) rectangles stand for genes belonging to a synteny block that is conserved in at least one other species. Gray rectangles are genes of this reference genome that do not have any POGs in the set of compared genomes. Respective gene names are labeled on each rectangle. The following lines contain the different species that are compared to the reference genome. SynteView automatically sorts the chosen species by their taxonomic proximity to the reference genome. For each gene of the reference genome, columns contain the orthologous genes belonging to a synteny block found to be conserved in the different analyzed genomes with their respective names. The same color code (blue or yellow) helps to discriminate the strand of their respective location on each genome. The number of genes present in a block is displayed when the cursor is run over this block. Note that synteny blocks in compared genomes are defined exclusively with respect to the gene order in the reference genome. Thus, in a SynteView window of synteny blocks, the apparent proximity in compared genomes does not imply that they are as physically close in these genomes as their POGs are in the reference genome. By opening the Setting s panel (to do so, click on the "settings" button in the left toolbar menu) the user accesses a Dialog box where it is possible to modify various default parameters. For example, clicking on the "Database" tab allows the user to choose the retrieval mode (database or web service). Once these various parameters have been customized, it is possible to navigate along the reference genome to estimate the density of the synteny blocks present in the other genomes. For example, and as expected, comparing E. coli with the other gammaproteobacteria reveals a rather high density of gene conservation. The bottom blue background shape portrays this rate of conservation in the compared genomes as a histogram (Fig. 4).
Using SynteView for comparative analysis of gene context
Information about any annotated gene is immediately available by clicking on the corresponding rectangle. This opens, to the left of the window, the "gene information" panel (Fig. 4) in which, for the selected gene, its GenBank PID, its name, the species name and the replicon to which it belongs are given; the function of its product (if available), and its exact location on the chromosome are also mentioned. This information panel also contains a text field which permits simple queries such as a search for a protein function, a gene name or a PID in the analyzed genomes, as well as a search for synteny blocks containing at least x adjacent genes and having orthologous genes in at least y species. Moreover, clicking on a gene delivers complete information on its neighbors. For instance, it is possible to estimate the various levels of conservation of detected operons when comparing organisms separated by various taxonomic distances. While the operon histidine is rather well conserved in proteobacteria (Fig. 5, panel A), the neighboring clusters of genes involved in the O-specific lipopolysaccharide biosynthesis (rfb cluster) and the production of extracellular polysaccharide colanic acid (cluster wca), which are located at a short distance and on the other strand, are rapidly fragmented to a scarce number of 2–4 genes such as the rml genes in Pseudomona s aeruginos a (Fig. 5, panel B). In addition, clicking on the "get sequences" button in the information panel opens a dialog box. SynteView shows the sequence of the clicked gene in the first tab and that of its orthologues in the second tab in Fasta format. This further allows downloading of all these amino acid sequences for future work.
Using SynteView for comparative analysis of multiple views
SynteView was also designed to allow complex studies by means of easy and simple operations. For example, looking at a peculiar set of species makes it possible to immediately visualize new assortments of synteny blocks. This is done simply by selecting a new reference species by clicking on a species name on the left of the display and/or by changing the list of compared genomes. Moreover, contrary to challenging tools (see Discussion below), SynteView allows global analyses of the synteny data using various points of view. Scrolling up and down the same window, one can assess the level of conservation of gene order at various taxonomic depths, the relative density of the synteny blocks along the whole genome, the relative size of the blocks, and the respective events of gene insertion/deletion in close and distant species.
Using SynteView to quantify synteny data
Besides being a visualization tool, SynteView can display various kinds of histograms which are computed on-the-fly. For example, the percentage of species displaying POGs in the same synteny block in the reference species is automatically computed and displayed as a histogram (blue background shape at the bottom of the main display). It is also easy to display the distribution of the size of synteny blocks which are conserved between genomes by selecting a pair of species and clicking on the Histogra m button in the left toolbar. This histogram may be saved for further use by selecting the "Save as" button located in the contextual menu in the window. Table 2 summarizes the data obtained when comparing the model organism Bacillu s subtili s with various bacteria and archaea. It appears that the number of genes present in conserved synteny blocks depends on the phylogenetic (taxonomic) distance between species. Indeed, the mean size of synteny blocks is close to 3.3 genes when comparing two closely related bacteria such as the Bacillaceae B. subtili s and Oceanobacillu s iheyensis, whereas it diminishes to nearly 2 when comparing a bacterium (B. subtilis) and an archaeon (Methanosarcin a acetivorans), although these genomes encode a similar range of proteins (3000–4500). Likewise, the longest block ranges from 19 to 4 for the same species comparisons.
SynteView was designed to allow fast and easy visualization of the conservation of gene adjacency in many genomes for which orthology and neighborhood data were computed and stocked in a dedicated relational database SynteBase. Our goal was to develop a flexible yet powerful tool to work directly with home-computed data obtained after comparing large and diverse sets of species. Indeed, our tool can be easily installed on any personal computer endowed with one of the main operating systems (Windows, Mac OS X or Linux). Moreover, SynteView can be customized in many aspects. In particular, it can be used with another, home-made, database in place of SynteBase. We observed that among the other tools to visualize synteny data [16–20] that have been designed to be locally installed, not one is adapted to the use of the abundant genomic data for prokaryotic species. Contrary to these previously published softwares [16–20], SynteView allows the user to compare the gene order in many different genomes in the same window. Finally, the strict relationship between SynteBase and SynteView allows their user to enlarge the study of gene order by means of specific queries on SynteBase. In addition to the visualization of synteny blocks, it is possible to obtain productive information through various requests such as "How many genes are involved in a neighbouring relationship, for each pair of genomes?"
We anticipate that we will be inundated by thousands of completely sequenced genomes in the next few years . Our tool SynteBase/SynteView has been designed to support such large sets of prokaryotic data. This tool will serve to quickly evaluate the conservation of gene order in newly-published genomes as soon as they have been compared to those already analyzed.
Availability and requirements
Project name: SynteView/SynteBase
Project home page: http://www.synteview.u-psud.fr
Operating System(s): Windows, Linux, MacOS X (Java web start)
Programming Language: Java
Other requirements: Java 1.5
License: GNU GPL
Any restrictions to use by non-academics: none
Perl scripts: available on request
Positional Orthologous Genes
Reciprocal Smallest Distance
Structured Query Language.
Dandekar T, Snel B, Huynen M, Bork P: Conservation of gene order: a fingerprint of proteins that physically interact. Trends in Biochemical Sciences 1998, 23: 324–328. 10.1016/S0968-0004(98)01274-2
Huynen MA, Bork P: Measuring genome evolution. Proc Natl Acad Sci USA 1998, 95: 5849–5856. 10.1073/pnas.95.11.5849
Enright AJ, Iliopoulos I, Kyrpides NC, Ouzounis CA: Protein interaction maps for complete genomes based on gene fusion events. Natur 1999, 402: 86–90. 10.1038/47056
Marcotte EM, Pellegrini M, Ng HL, Rice DW, Yeates TO, Eisenberg D: Detecting protein function and protein-protein interactions from genome sequences. Science 1999, 285(5428):751–753. 10.1126/science.285.5428.751
Overbeek R, Fonstein M, D'Souza M, Pusch GD, Maltsev N: The use of gene clusters to infer functional coupling. Proc Natl Acad Sci USA 1999, 96: 2896–2901. 10.1073/pnas.96.6.2896
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285
Galperin MY, Koonin EV: Who's your neighbor? New computational approaches for functional genomics. Nat Biotechnol 2000, 18: 609–613. 10.1038/76443
Huynen M, Snel B, Lathe W, Bork P: Exploitation of gene context. Curr Opin Struct Biol 2000, 10: 366–370. 10.1016/S0959-440X(00)00098-1
Wolf Y, Rogozin I, Kondrashov A, Koonin E: Genome alignment, evolution of prokaryotic genome organization, and prediction of gene function using genomic context. Genome Research 2001, 3: 356–372. 10.1101/gr.GR-1619R
Huynen M, Snel B, Lathe W, Bork P: Predicting Protein Function by Genomic Context: Quantitative Evaluation and Qualitative Inferences. Genome Research 2000, 10: 1204–1210. 10.1101/gr.10.8.1204
Lemoine F, Lespinet O, Labedan B: Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol Biol 2007, 7: 237. 10.1186/1471-2148-7-237
Wall D, Fraser H, Hirsh A: Detecting putative orthologs. Bioinformatic 2003, 19: 1710–1711. 10.1093/bioinformatics/btg213
PostgreSQL database management systems[http://www.postgresql.org/]
Le Bouder-Langevin S, Capron-Montaland I, De Rosa R, Labedan B: A strategy to retrieve the whole set of protein modules in microbial proteomes. Genome Research 2002, 12: 1961–1973. 10.1101/gr.393902
Sinha AU, Meller J: Cinteny: flexible analysis and visualization of synteny and genome rearrangements in multiple organisms. BMC Bioinformatics 2007, 8: 82. 10.1186/1471-2105-8-82
Wang H, Su Y, Mackey AJ, Kraemer ET, Kissinger JC: SynView: a GBrowse-compatible approach to visualizing comparative genome data. Bioinformatic 2006, 22: 2308–2309. 10.1093/bioinformatics/btl389
Hunt E, Hanlon N, Leader DP, Bryce H, Dominiczak AF: The visual language of synteny. OMIC 2004, 8: 289–305. 10.1089/omi.2004.8.289
Pan X, Stein L, Brendel V: SynBrowse: a synteny browser for comparative sequence analysis. Bioinformatic 2005, 21: 3461–3468. 10.1093/bioinformatics/bti555
Byrne KP, Wolfe KH: The Yeast Gene Order Browser: Combining curated homology and syntenic context reveals gene fate in polyploid species. Genome Research 2005, 15: 1456–1461. 10.1101/gr.3672305
Overbeek R, Begley T, Butler RM, Choudhuri JV, Chuang HY, Cohoon M, de Crecy-Lagard V, Diaz N, Disz T, Edwards R, Fonstein M, Frank ED, Gerdes S, Glass EM, Goesmann A, Hanson A, Iwata-Reuyl D, Jensen R, Jamshidi N, Krause L, Kubal M, Larsen N, Linke B, McHardy AC, Meyer F, Neuweger H, Olsen G, Olson R, Osterman A, Portnoy V, Pusch GD, Rodionov DA, Ruckert C, Steiner J, Stevens R, Thiele I, Vassieva O, Ye Y, Zagnitko O, Vonstein V: The subsystems approach to genome annotation and its use in the project to annotate 1000 genomes. Nucleic Acids Res 2005, 33: 5691–5702. 10.1093/nar/gki866
MCL – a cluster algorithm for graphs[http://micans.org/mcl/]
FL is a PhD student supported by the French Ministry of Research. This work was funded by the CNRS (UMR 8621) and the Agence Nationale de la Recherche (ANR-05-MMSA-0009 MDMS NV 10). We gratefully acknowledge Stéphane Descorps-Declère for his help in designing the genome comparison pipeline and Mary Bouley (Université de Bourgogne) for her aid in improving the quality of our manuscript.
FL wrote the different programs necessary to collect all synteny data and to build up the relational database and the visualizing tool. He is responsible for the website. The three authors participated in the design of the experimental approach, the conception of the tools, and the data analysis. Together, the three authors wrote this manuscript.