SyntTax: a web server linking synteny to prokaryotic taxonomy
© Oberto; licensee BioMed Central Ltd. 2013
Received: 11 October 2012
Accepted: 19 December 2012
Published: 16 January 2013
Skip to main content
© Oberto; licensee BioMed Central Ltd. 2013
Received: 11 October 2012
Accepted: 19 December 2012
Published: 16 January 2013
The study of the conservation of gene order or synteny constitutes a powerful methodology to assess the orthology of genomic regions and to predict functional relationships between genes. The exponential growth of microbial genomic databases is expected to improve synteny predictions significantly. Paradoxically, this genomic data plethora, without information on organisms relatedness, could impair the performance of synteny analysis programs.
In this work, I present SyntTax, a synteny web service designed to take full advantage of the large amount or archaeal and bacterial genomes by linking them through taxonomic relationships. SyntTax incorporates a full hierarchical taxonomic tree allowing intuitive access to all completely sequenced prokaryotes. Single or multiple organisms can be chosen on the basis of their lineage by selecting the corresponding rank nodes in the tree. The synteny methodology is built upon our previously described Absynte algorithm with several additional improvements.
SyntTax aims to produce robust syntenies by providing prompt access to the taxonomic relationships connecting all completely sequenced microbial genomes. The reduction in redundancy offered by lineage selection presents the benefit of increasing accuracy while reducing computation time. This web tool was used to resolve successfully several conserved complex gene clusters described in the literature. In addition, particular features of SyntTax permit the confirmation of the involvement of the four components constituting the E. coli YgjD multiprotein complex responsible for tRNA modification. By analyzing the clustering evolution of alternative gene fusions, new proteins potentially interacting with this complex could be proposed. The web service is available at http://archaea.u-psud.fr/SyntTax.
The conservation of gene order or synteny has become an invaluable method for establishing the orthology of genomic regions in different species and to infer functional relationships between genes. The term synteny ('same ribbon' in Greek) was introduced four decades ago to define loci positioned on the same chromosome whether they are genetically linked or not . Synteny has since notably diverged from the original definition and commonly refers to gene loci in different organisms located on a chromosomal region of common evolutionary ancestry . For simplicity, the term synteny will be used hereafter to indicate conservation of gene order even if some would prefer the more accurate 'shared synteny' denomination. In the last years, large scale sequencing has increased the number of complete prokaryotic genomes exponentially and endows synteny with an even more prominent role. This wealth of genomic information has motivated the development of new bioinformatics tools capable of processing large amounts of data in order to produce valid synteny analyses. A number of algorithms and related implementations have been developed for the automatic identification of syntenies across multiple genomes (see  for a review). If these algorithms are able to predict all conservations of gene order at the genome level, they require very intensive computations which restricts their use to a finite group of organisms. These tools are therefore not adapted for the retrieval of a particular gene synteny in a set of newly sequenced genomes. Such queries, commonly performed by geneticists and phylogeneticists require tools of a different nature, able to provide promptly up to date syntenies in human-readable form. Several web services have been developed for this purpose such as GeConT2 , PSAT  and GCView . Unfortunately, syntenies produced by these tools are pre-calculated and therefore often outdated. Their relevance depends on the variable frequency of their genome updates. Standalone synteny programs such as GeneclusterViz  are also available but require in addition to local installation, the impractical retrieval, manipulation and storage of large data files. To address most of these limitations, we recently developed the Absynte web server .
The large and continuously increasing number of completely sequenced prokaryotes is introducing yet another level of difficulty. Researchers working on gene syntenies in given species might overlook their relationships with newly sequenced organisms and potentially miss relevant conserved gene clusters. Indeed, with over 2000 sequenced micro-organisms often responding to exotic genus names, keeping a mental track of all the lineages is a daunting task. To my knowledge, with the exception of GeCont2  which offers partial access to phyla, none of the afore mentioned synteny tools allows organism selection on the basis of taxonomy. The SyntTax web service described here aims to assist the analysis of synteny by providing an organized tree exposing the full lineage of every completely sequenced prokaryote in the National Center for Biotechnology Information (NCBI) repository. In this manner, single or multiple organisms can be easily selected by any combination of classification ranks allowing the generation of robust syntenies performed according to taxonomic criteria. The taxonomic and genomic databases are stored locally on the SyntTax server; they are both updated automatically on a daily basis from the respective NCBI resources. This feature makes of SyntTax a flexible tool capable of promptly adapting to the addition of new sequenced genomes. The validity of the SyntTax web service was tested in benchmarking experiments based on conserved complex gene clusters described in the literature. Additionally, a new use for SyntTax syntenies was also investigated in the prediction of additional functions for the YgjD multiprotein complex by analyzing the taxonomic distribution of gene fusions.
This procedure allows obtaining, for each organism, the superkingdom, phylum, class, family and order definitions in addition to the already available genus and species definitions present in the FTP directory.
In order to cover all the organisms included in the NCBI database, particular attention was devoted to maximize the retrieval of taxonomic data. Most lineages originating from the NCBI are complete, displaying a constant rank depth of 7 with all ranks defined. However, a number of organisms labeled 'candidatus' or 'uncultured' often lack one or more rank definitions. The addition to the database of incompletely defined lineages without proper processing would impair the uniqueness of particular ranks and perturb the subsequent recursive addition of new taxa. When at least the superkingdom is defined, the incomplete taxa are shortened to superkindom and phylum definitions (if the latest is available) and added under Bacteria:unclassified or Archaea:unclassified. Severely incomplete lineages are stored under Prokaryote:unclassified.
On the SyntTax web server, the taxonomical information is stored as an XML document whose hierarchical structure is particularly well suited to store data of this nature. The SyntTax taxonomy database is fully accessible to the user and can be queried for particular organism names. New routines devoted to the management of the taxonomy database have been added to the ancillary Updater program described previously [8, 12, 13]. This program is now capable of performing automated daily incremental updates of both the taxonomic and genomic databases. Decremental updates are performed as well in order to remove obsolete, redundant or renamed organisms from the local databases.
The synteny methodology used in SyntTax is based on the Absynte algorithm described in our previous work  with significant improvements. The algorithm employs a multiple center star gene clustering topology and can be briefly summarized as follows. The query protein is first compared to itself using BLASTP  in order to permit normalization of further alignments. The same protein sequence is then compared to the DNA sequence of the selected target genomes using TBLASTN . The normalized results obtained in such queries allow ranking of the various genomes by decreasing scores and to extract the absolute chromosomal coordinates of the matching hit. DNA segments of 15kb, centered on these coordinates are then extracted from each of these chromosomes and translated according to corresponding GenBank annotation. The proteins of the highest ranking chromosome are compared to each other using the Smith-Waterman-Gotoh (SWG) global alignment algorithm for performance reasons detailed in the Results and Discussion section. A unique color is assigned to each individual protein or to paralogs, when present. These protein sequences are then compared to those extracted from the remaining chromosomes using SWG and the newly identified orthologs are colored accordingly. The synteny becomes readily apparent upon proper alignment and proportional drawing of the color-coded genetic maps. All the processing mentioned above is executed in real time and does not rely on pre-calculation. Synteny analysis with the Absynte/SyntTax algorithm is therefore a processor-intensive task and a number of measures were taken in order to ensure optimal performance. Faster volatile memory access was favored over slower disk operations. Disk I/O was limited to database reading as no user data is written on the server at any time. Multi-threaded or parallel operations were also developed in most areas of the algorithm in order to take full advantage of the multi-core processors architectures available in modern servers. A specific area of the synteny algorithm, build around the BLAST executable (v2.2.24) was nevertheless still single-threaded in the original Absynte algorithm. A substantial optimization of the algorithm could be achieved in SyntTax by allowing full multi-threaded execution of BLAST v2.2.26 in addition to the already parallel SWG routines. The entire data flow and processing can now be distributed among the individual processors. The result report produced by SyntTax was also significantly improved: a list of the selected genomes where the synteny is not observed is now included in the result page and in the Acrobat .pdf and Excel .csv reports. In addition, the sequence of the query protein is also included in the printable reports.
Synteny analysis in real time is a processor-intensive task. To address this problem, SyntTax permits the selection of target organims on the basis of taxonomy in order to reduce computing time and increase the robustness of the analysis. With the relatedness of organisms becoming readily apparent, the user is able to restrict the search to more meaningful organisms while avoiding the redundancy of too closely related species. Several benchmarking experiments were performed in order to demonstrate the capabilities of the SyntTax web service and to evaluate its efficiency. In the first part of this section, SyntTax was tested for the resolution of two conserved complex gene clusters reported in the literature. In the last part, I will describe a predictive analysis aiming to determine new potential factors interacting with the E. coli YgjD complex recently described as involved in tRNA modification . This new use for synteny is directly correlated to the specific taxonomic capabilities of SyntTax and illustrates the potential of this new web service.
SyntTax provides access to prokaryotic genomes by the means of an intuitive taxonomic tree. The selection of a particular tree node will select recursively all the child nodes up to the individual species while the parent nodes remain unselected. Node selection will also propagate correctly even if the child nodes are collapsed and therefore not visible in the interface. Displaying a taxonomic tree containing over 3000 individual rank nodes while keeping a responsive web application was not trivial. Since only portions of the tree are accessed at a given time, displaying the entire tree at once is superfluous. Taxonomic data transfer is therefore partial and only occurs upon expansion of the specific nodes. This tree loading 'on demand' was developed to avoid massive data flux from the server. In addition, to further improve performance, a significant part of the selection logic is executed locally in the web browser application itself.
Contrarily to most synteny programs using BLAST, SyntTax relies instead extensively on the Smith-Waterman-Gotoh algorithm for protein alignments. Whereas BLAST is the most widely used alignment program due to its fast execution speed, several reports pointed out the superior sensitivity of the SWG algorithm [16, 17]. The choice of SWG, combined with the parallel implementation of this algorithm in SyntTax, contributes significantly to the overall performance of the web service.
Recently, Danielou and co-workers described two particular prokaryotic syntenies with genes involved respectively in the glycogen biosynthesis/degradation pathway and in the biotin biosynthetic pathway . These complex gene clusters are characterized by the presence of a number of gene permutations or inversions and by rapid sequence divergence within orthologous groups. Using SyntTax, benchmarking experiments were performed on the above syntenies as follows. Commonly used bacterial organisms were selected directly and unfamiliar lineages were retrieved by querying the taxonomy database (Figure 1B).
The second reported complex gene cluster involves genes implicated in the biotin biosynthetic pathway . Using the E. coli BioF protein sequence, SyntTax generated the genomic contexts shown in Figure 3B. Once again, a fully resolved synteny was obtained, connecting all relevant genes in the selected organisms. Contrarily to OTFQ, SyntTax was able to assign the Staphylococcus aureus bioD gene to the correct group of orthologs (BLAST e-value ≤4e-11) and to link correctly the bioC genes of E. coli and B. thuringiensis (BLAST e-value ≤7e-22). The rapid sequence divergence within these orthologous gene groups did not constitute an obstacle for the resolution of the corresponding complex gene clusters by SyntTax. The use of the SWG protein alignment algorithm in this web service might explain its superior sensitivity as discussed previously.
The detection and the resolution of complex gene clusters, described above, are not the only functions of SyntTax. This section will show that the algorithm of this web service can be used successfully to produce a more robust prediction of multiple protein-protein interactions. Most tools limit their synteny analysis to the relative positioning of clustered genes. In SyntTax, the concept of synteny is pushed one step further. This web service provide a more direct demonstration of proteins interacting with one another by allowing the visualization of gene fusions in the syntenies. Marcotte et al. reported over a decade ago that proteins are likely to interact when their homologs are found fused together . The high proportion of false positives predicted in this way can be substantially reduced by considering only orthologs . On the basis of these principles, SyntTax provides a refined and systematic approach for the study of the proteins that have evolved by modular assembly of independent domains. In addition to the exhaustive approach provided by its taxonomic capabilities, a second property of SyntTax is instrumental for this type of investigation. The synteny maps produced by this web service are drawn to the same exact scale, on the basis of GenBank annotations provided by the NCBI. This particular feature, which is absent in other resources such as GeConT2 , GCView  or OTFQ , allows the immediate observation of the evolution of gene size and the detection of protein fusions.
To illustrate this concept, SyntTax was used to investigate new potential partners of the highly conserved YgjD/Kae1 protein. This protein, often erroneously annotated as a protease , is part of a multiprotein complex involved in the biosynthesis of N6-threonylcarbamoyl adenosine (t6A), a universal modification found at position 37 of tRNAs decoding ANN codons [22, 23]. This essential protein is present in the three domains of life and deletion mutants are characterized phenotypically by severe chromosomal loss . In eukaryotes, Kae1 interacts at the biochemical and structural levels with Bud32, Pcc1, and Cgi121 in order to constitute the KEOPS/EKC complex . In bacteria, however, most of these additional factors are absent and the Kae1 ortholog, named YgjD, interacts directly with the YjeE ATPase and YeaZ to constitute the YgjD complex . The tRNA modification reaction was recently reconstituted in vitro with the addition of a fourth partner, YrdC (now called RimN)  even if no direct protein-protein interaction was reported between RimN and the YgjD complex. Interestingly, the archaeal organisms inherited instead the eukaryotic KEOPS/EKC complex [27, 28]. The precise role of each eukaryotic or prokaryotic protein partner in the tRNA modification pathway is still unknown.
In this post-genomic era, the accumulation of completely sequenced genomes provides, in theory, the opportunity to assign a biological role to a growing number of proteins. However, newly sequenced prokaryotic organisms not only introduce a considerable amount of additional orphan proteins but also a certain degree of redundancy. The scientific community is therefore expected to produce increasing efforts in order to keep a favorable proportion of proteins with known functions over the total amount of protein sequences. Only time consuming wet bench experimentation is able to produce the ultimate proof of functions and it constitutes the bottleneck in this process. Fortunately, comparative genome analysis assists the biologist in this endeavor by prioritizing targets for experimental study . The development of new bioinformatics tools that can adapt to continuously expanding genomic datasets is therefore fundamental. The present work has focused on the development of SyntTax, a new synteny web service whose predictive robustness increases with the amount of available prokaryotic genomes. This property is due to the taxonomic capabilities offered by this web tool and which is not available in comparable implementations.
The results presented here show that the quality of complex syntenies produced in real time by SyntTax equals or exceeds those obtained by other methodologies such as OTFQ which requires hours of systematic calculation on a finite set of genomes. The extensive use of the SWG protein alignment algorithm contributes positively to its performance and sensitivity. Several unique additional SyntTax properties were instrumental for the analysis of the ygjD gene cluster. Beyond the confirmation of known interacting factors in this complex, SyntTax was able to predict the interplay between new potential partners composing this multiprotein machinery involved in the bacterial tRNA modification.
The SyntTax web server is built upon the proven Absynte synteny algorithm and allows full access to complete taxonomic records for each sequenced archaea and bacteria in the NCBI database. This tool is capable of adapting rapidly to the sequencing of new genomes due to the daily automated update of its databases. The taxonomic capabilities of this web service are not available in other tools and add a new dimension to synteny analysis by providing robust results while reducing computing time. Exhaustive access to prokaryotic lineages is not available in similar tools such as GeCont2, PSAT or GCVIEW. Although solely based on genomic sequences, annotations and taxonomic data, the accurate illustration of genomic contexts produced by SyntTax is able to provide exhaustive cutting edge predictive information on protein-protein interaction. These predictions complement adequately those provided by other tools such as STRING which uses a complex combination of genomic, post-genomic and text databases. The principles shown here demonstrated that the high number of sequenced genomes and their increasing redundancy no longer constitute a burden. This wealth of information can indeed be used effectively to target specific aspects of the evolution of prokaryotic gene clusters and to yield insights on the molecular interaction between the corresponding proteins. Predictive models, such as those produced by SyntTax, would constitute interesting challenges for wet bench experimentalists.
Project name: SyntTax
Project home page: http://archaea.u-psud.fr/synttax.
Operating system(s): platform independent.
Other requirements: internet connection.
License: none required.
Any restrictions to use by non-academics: no restriction.
The author wishes to thank the “Centre National pour la Recherche Scientifique” (CNRS) and the "Agence Nationale de la Recherche" (ANR GRONAG grant), for financial support.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.