Reconstructing genome trees of prokaryotes using overlapping genes
© Cheng et al. 2010
Received: 23 October 2009
Accepted: 24 February 2010
Published: 24 February 2010
Skip to main content
© Cheng et al. 2010
Received: 23 October 2009
Accepted: 24 February 2010
Published: 24 February 2010
Overlapping genes (OGs) are defined as adjacent genes whose coding sequences overlap partially or entirely. In fact, they are ubiquitous in microbial genomes and more conserved between species than non-overlapping genes. Based on this property, we have previously implemented a web server, named OGtree, that allows the user to reconstruct genome trees of some prokaryotes according to their pairwise OG distances. By analogy to the analyses of gene content and gene order, the OG distance between two genomes we defined was based on a measure of combining OG content (i.e., the normalized number of shared orthologous OG pairs) and OG order (i.e., the normalized OG breakpoint distance) in their whole genomes. A shortcoming of using the concept of breakpoints to define the OG distance is its inability to analyze the OG distance of multi-chromosomal genomes. In addition, the amount of overlapping coding sequences between some distantly related prokaryotic genomes may be limited so that it is hard to find enough OGs to properly evaluate their pairwise OG distances.
In this study, we therefore define a new OG order distance that is based on more biologically accurate rearrangements (e.g., reversals, transpositions and translocations) rather than breakpoints and that is applicable to both uni-chromosomal and multi-chromosomal genomes. In addition, we expand the term "gene" to include both its coding sequence and regulatory regions so that two adjacent genes whose coding sequences or regulatory regions overlap with each other are considered as a pair of overlapping genes. This is because overlapping of regulatory regions of distinct genes suggests that the regulation of expression for these genes should be more or less interrelated. Based on these modifications, we have reimplemented our OGtree as a new web server, named OGtree2, and have also evaluated its accuracy of genome tree reconstruction on a testing dataset consisting of 21 Proteobacteria genomes. Our experimental results have finally shown that our current OGtree2 indeed outperforms its previous version OGtree, as well as another similar server, called BPhyOG, significantly in the quality of genome tree reconstruction, because the phylogenetic tree obtained by OGtree2 is greatly congruent with the reference tree that coincides with the taxonomy accepted by biologists for these Proteobacteria.
In this study, we have introduced a new web server OGtree2 at http://bioalgorithm.life.nctu.edu.tw/OGtree2.0/ that can serve as a useful tool for reconstructing more precise and robust genome trees of prokaryotes according to their overlapping genes.
The approach to analyzing a single gene (e.g., ribosomal RNA) has proved itself to be a powerful tool in molecular phylogenetic studies. However, it may not be suitable for deriving the phylogenetic history of organisms because it sometimes provides insufficient resolution in the derived tree due to limited phylogenetic information in a single gene, or it gives rise to conflicting trees when applied to different individual genes due to different evolution rates or horizontal gene transfer . The recently advent of high-throughput sequencing techniques has made it possible and reliable for evolutionary biologists to reconstruct phylogenetic trees of organisms (hereafter called genome trees) using the overwhelming amount of genomic information extracted from their complete genomes. It is also believed that the created genome trees are less affected by variable mutation rates or horizontal gene transfer events. So far, many different methods on the basis of this principle have been proposed , such as gene content based on the presence and absence of genes [3, 4] and gene order based on the presence and absence of gene pairs [5–7]. Gene order basically evolves faster than gene content . To gain a high-resolution genome tree, therefore, gene order is more suited for closely related organisms, whereas gene content is more suited for distantly related organisms.
Recently, Luo et al. [8, 9] have proposed a new method, as well as a server named BPhyOG, for reconstructing the genome trees of some prokaryotes only based on the content of overlapping genes. The so-called overlapping genes (OGs) are defined as adjacent genes whose coding sequences (CDSs) overlap partially or entirely. In their studies [8, 9], Luo et al. have reported that OGs can serve as a useful phylogenetic character by providing interesting additional insights into phylogenetic relationship among prokaryotes. Their rationale for doing this is as follows. As phylogenetic characters, OGs may not evolve as slowly as gene content, because they can be observed frequently in all prokaryotic genomes and may also mutate at a universal (constant) rate [10, 11]. On the other hand, OGs have more evolutionary conservation than gene order because the linkage may be preserved between two functional-related OGs [12–14]. However, we have found that some prokaryotic genome trees constructed using BPhyOG are not greatly consistent with those produced by traditional phylogenetic approaches based on ribosomal RNAs and/or concatenation of multiple protein sequences .
To address this problem, we have recently implemented a new server, called OGtree , which allows evolutionary biologists to reconstruct more reliable genome trees of some prokaryotes by using not only their OG content but also their OG order. It has been widely accepted that during evolutionary course, species genomes are subject to rearrangements, such as reversals (also called inversions), transpositions and translocations, all of which can alter the order and/or the orientation of genes in the genomes. As a consequence, the orders of orthologous OG pairs even between two closely related organisms may not be conserved. This suggests that we should take into account both OG content and orthologous OG order when reconstructing the genome trees of prokaryotes using the information of OGs. In our previous study , therefore, we have defined an overlapping-gene distance between two genomes based on a measure of combining OG content (i.e., the presence and absence of OGs) and OG order (i.e., the presence and absence of orthologous OG pairs) in their whole genomes and also implemented our OGtree according to the pairwise OG distances between prokaryotic genomes. Our experimental results for a set of closely related Proteobacteria showed that our OGtree outperformed BPhyOG in the quality of reconstruction of their genome trees.
In this study, we further improve the accuracy of our OGtree by extending the genes retrieved from their complete genomes to include their regulatory regions and redefining the distance measure between two orthologous OG orders using genome rearrangements rather than breakpoints caused by the absence of orthologous OG pairs. The reasons for doing so are as follows. For some distantly related prokaryotic genomes, the amount of their overlapping CDSs is limited so that it is hard to find enough OG pairs to properly evaluate their pairwise OG distances and accurately reconstruct their genome trees. Actually, the term "gene" defined in modern genomics should include not only its coding region, but also its regulatory regions, such as promoter (at the 5' upstream end of the coding region) and terminator (at the 3' downstream end of the coding region) . In addition, overlapping of regulatory regions of distinct genes should be of certain interest, because the regulation of expression for these genes is more or less interrelated . In this study, therefore, we expand the region of a gene to include both its CDS and regulatory regions so that two adjacent genes whose CDSs or regulatory regions overlap with each other are considered as a pair of overlapping genes.
On the other hand, the orders of orthologous OG pairs between two prokaryotic genomes, as mentioned above, are often different due to genome rearrangements. The distance measure between two orthologous OG orders we previously defined was analogous to the breakpoint distance between two gene orders, which has been widely used as a rough measure of genomic distance . In contrast to the genome rearrangement distance, however, the breakpoint distance may not correspond to an optimal series of events that accounts for the rearrangements of one genome with respect to another. Moreover, it is still not clear how to adapt the breakpoint analysis to multi-chromosomal genomes . In this study, therefore, we try to use the genome rearrangement distance involved with reversals, block-interchanges (i.e., generalized transpositions) and translocations [19, 20] to re-define the distance of the orthologous OG orders between two prokaryotic genomes.
Complete genomes of 21 Proteobacteria used in this study.
Neisseria meningitidis MC58
Neisseria meningitidis Z2491
Escherichia coli K12
Escherichia coli O157:H7 EDL933
Salmonella enterica subsp. enterica serovar Typhi Ty2
Salmonella typhimurium LT2
Yersinia pestis KIM
Buchnera aphidicola str. Bp
Buchnera aphidicola str. Sg
Buchnera aphidicola str. APS
Wigglesworthia glossinidia brevipalpis
Vibrio cholerae El Tor N16961 (I)
Vibrio cholerae El Tor N16961 (II)
Haemophilus influenzae Rd
Pasteurella multocida Pm70
Xanthomonas axonopodis pv. citri str. 306
Xanthomonas campestris pv. campestris str. ATCC 33913
As compared to the phylogenetic tree constructed by BPhyOG (Figure 3), our OGtree2 produced a more accurate phylogeny (Figure 2) for the 21 Proteobacteria genomes used in this study. In the BPhyOG tree, the relationship of endosymbionts was paraphyletic, because the three Buchnera species failed to form a monophyletic group and the two insect endosymbionts, W. brevipalpis and B. aphidicola, were separated far away from each other. In addition, the three β-Proteobacteria were placed just as neighbor taxa rather than a sister cluster. In contrast, W. brevipalpis, B. aphidicola and other three Buchnera species in our UPGMA tree (Figure 2), as well as in the reference tree (Figure 1), were placed as a sister group, suggesting that there should be a common origin for these five endosymbionts. Moreover, our current OGtree2 indeed outperformed over its previous version OGtree in phylogeny reconstruction for prokaryotes, because in the genome tree predicted by OGtree using the UPGMA method (Figure 4), the α-Proteobacteria of R. prowazekii and the β-Proteobacteria of R. solanacearum were placed together as a sister group and the insect endosymbiont of B. floridanus was placed in the branch of enterobacteria.
As demonstrated above, as well as in other previous studies [8, 9, 15], OG pairs indeed can serve as a useful tool in phylogenetic inference of prokaryotes, because they are abundant and more conserved than non-overlapping genes in prokaryotic genomes and even may evolve at a constant rate across prokaryotic genomes. In fact, our algorithm for constructing the genome trees of prokaryotes using OG pairs relies on successfully identifying orthologous genes, as well as authentic ORFs and horizontally transferred genes, before we can compare the OG content and order across organisms and calculate their pairwise OG distances. Therefore, more accurate identification of authentic ORFs, HGT events and orthologous genes will definitely further improve the accuracy of our algorithm and software tool. On the other hand, we measured the OG distance by taking into account of both the OG content and order in a pair of organisms. Particularly, we estimated the OG order distance (i.e., r i,j ) by using the genome rearrangements involved with reversals (or inversions), block-interchanges (i.e., generalized transpositions) and translocations (including fusions and fissions) [19, 20]. Although this rearrangement distance may underestimate the true distance, we believe that the difference between them for the Proteobacteria we used in this study is small. The reasons for this small difference are as follows. First, the rearrangements we considered include not only reversals but also transpositions and translocations. Second, Bourque and Pevzner  have conducted simulations to compare the estimated reversal distances and the true ones, consequently showing that the reversal distance approximates the true distance very well as long as the number of reversals remains below 0.4 n (i.e., the normalized reversal distance is less than or equal to 0.4), where n is the number of genes being considered. According to the experimental results we obtained in this study, the normalized rearrangement distance (i.e., r i,j /n) typically varies from 0 to 0.12 for closely related prokaryotes (e.g., free-living Enterobacteriaceae, while it is typically in the range of 0.43 to 0.5 for more divergent organisms (e.g., between Pasteurellaceae and free-living Enterobacteriaceae).
Previously, we have implemented a web server named OGtree to demonstrate that overlapping genes can be served as a useful genomic marker for reconstructing genome trees of some prokaryotes. In contrast to BPhyOG, the OG distance we defined to measure the difference between two prokaryotic genomes in our OGtree was based on a combination of their OG content and orthologous OG order. In this study, we have further improved the accuracy of our OGtree in reconstruction of prokaryotic genome trees by extending the regions of genes to include their regulatory regions and redefining the distance measure between two orthologous OG orders using genome rearrangements rather than simple breakpoints. Our experimental results on a set of 21 Proteobacteria have also shown that the above modifications indeed helped us to reconstruct a more precise and robust genome tree that coincides with the taxonomy accepted by biologists for these Proteobacteria. This suggests that our current OGtree2 can provide interesting insights into the study of evolutionary relationships of completely sequenced prokaryotic genomes.
In the above formula, r i,j denotes the genome rearrangement distance between G i and G j using reversals, block-interchanges (i.e., generalized transpositions) and translocations (including fusions and fissions), which can be computed in polynomial time when block-interchanges are weighted 2 and the others are weighted 1 [19, 20], and x i and x j denote the numbers of total OGs in G i and G j , respectively. Basically, D i,j evaluates the distance between G i and G j by considering the orthologous OG order measure as defined in the first term and the OG content measure as defined in the second term. Then w o and w c can be considered as the weight of orthologous OG order and the weight of OG content, respectively, where their defaults are now set to 1 and 2, respectively, in our OGtree2.
In theory, it can be proved that the value of r i,j is less than or equal to n . Below, we give an intuitive reason for this property. Basically, the number of breakpoints between two permutation orders over the same set of n OG pairs is less than or equal to n. In the generic case, an optimal reversal or translocation will remove two breakpoints, while an optimal block-interchange will remove four breakpoints. We therefore have r i,j ≤ n, even though reversals and translocations are weighted 1 and block-interchanges are weighted 2. It is also worth mentioning, according to the experimental results we obtained in this study, that for closely related prokaryotes (e.g., free-living Enterobacteriaceae), r i,j is typically between 0 and 120, while for more divergent organisms (e.g., between Pasteurellaceae and free-living Enterobacteriaceae), r i,j is typically in the range of 170 to 217, and typical values for n range between 13 and 2,700 depending on how closely related the species are.
Next, the BLASTP program  is used to determine putative orthologous genes between two genomes according to the so-called bidirectional best hit (BBH) approach. A BBH denotes a pair of genes a and b from two genomes G i and G j such that b is the best hit (i.e., most similar gene) when a is compared against all genes of G j , and vice versa. Tatusov et al.  have previously evidenced that the BBH approach works reasonably well for identifying putative orthologs of bacterial genomes. The Inparanoid program  is also used as an alternative to identify putative orthologous genes between any two genomes, because it has been shown to be the best among five currently existing methods of automatically detecting orthologous genes . Recall that the term "gene" defined in this study can be expanded to include not only its coding region but also regulatory regions, such as promoters and transcription terminators. Basically, the promoters of prokaryotes are always located immediately upstream of the transcription start site (TSS), the TSSs are located upstream of the start codon, and the transcription terminators are located downstream of the stop codon. In this case, the CDSs of genes are further extended at their 5' and 3' ends to their regulatory promoter and terminator regions. Then two adjacent genes in each genome are identified as overlapping genes (OGs), or an OG pair, if their CDSs (or extended CDSs) overlap partially or completely. Two OGs, say (a, c) and (b, d), from different genomes are then considered as an orthologous OG pair if a and b, as well as c and d, are orthologous to each other, and (a, c) and (b, d) have the same directional pattern.
Finally, we calculate the pairwise OG distance D i,j between any two given genomes G i and G j according to their OG pairs and then construct genome trees of all input genomes using the so-called distance-based methods of building trees, such as unweighted pair group method with arithmetic mean (UPGMA), neighbor-joining (NJ) and Fitch-Margoliash (FM).
Based on the algorithm described above, we have implemented a web server named as OGtree2 (http://bioalgorithm.life.nctu.edu.tw/OGtree2.0/) that allows the user to reconstruct prokaryotic genome trees with overlapping genes retrieved from the prokaryotic genomes.
It is inevitable that some genes may be misannotated in the genomes downloaded from the NCBI. We may therefore remove those CDSs annotated as unknown, hypothetical or putative genes from each downloaded genome in our analysis. As was done in , however, the fact that most of the CDSs in W. brevipalpisa are currently annotated as unknown, hypothetical or putative leads us to find no orthologous OG pair between W. brevipalpisa and other Proteobacteria, if all these CDSs in W. brevipalpisa are excluded from our analysis in this study. Instead, we first removed those horizontally transferred genes currently annotated at the HGT-DB database  and then used the BBH approach, as mentioned previously, to identify putative orthologous genes by setting the parameters with a minimum E-value of 10-8, at least 85% of each authentic CDS sequence involved in the alignment, and a minimum similarity of 45%. In addition, we observed that the amount of the orthologous OG pairs between non-γ-Proteobacteria genomes and other Proteobacteria genomes is few, resulting in difficulty in measuring the accurate OG distances between them. Recall that the term "gene" can be expanded to include both of its coding and regulatory regions, such as promoters and transcription terminators. In prokaryotic genomes, a promoter region, which basically contains the so-called -10 hexamer, extended -10 element, -35 hexamer and UP element, usually occupies about 60 base pairs (bp) upstream of the transcription start site (TSS) [28, 29] and a terminator region usually occupies about 50 bp downstream of the stop codon . Moreover, as exemplified in E. coli genome, 95% of TSSs occur 325 bp upstream from the translation start sites (TLS) of their corresponding genes . According to these information, therefore, we extended the region of each CDS by 385 bp at its 5' end and by 50 bp at its 3' end, so that any two adjacent genes in a genome were considered as an OG pair if their extended CDSs partially or completely overlap with each other. With default values for all the other parameters (i.e., w c = 2 and w o = 1), we used OGtree2 to calculate the OG distance between every pair of Proteobacteria and finally construct the genome trees for all the Proteobacteria used in this study with the UPGMA, NJ and FM methods.
In prokaryotic genomes, many genes are organized into operon structures [32, 33], where an operon is a cluster of genes co-transcribed in a single mRNA. Most operons typically have a single promoter located upstream of the first gene of operon and a single terminator located downstream of the last gene of operon. By the gene definition we used in this study, any two adjacent genes in such operons can be considered an OG pair because they have the same regulatory elements. In this situation, our method described above to identify OG pairs can still find most of them because it has been reported that, in most bacterial genomes, intergenic distances between genes in the same operon are often small (e.g., less than 20 bp) [10, 32].
To demonstrate the robustness of our method, we have adopted a method similar to the so-called jackknife resampling approach  to compute the support values of the tree branches as described as follows. We first randomly removed e -1 ≈ 37% of the initial OG pairs from each genome, while retaining the relative orders of the remaining OG pairs, and then calculated the OG distance between every pair of Proteobacteria. In this process, we implemented 1,000 such jackknife random samples to obtain 1,000 pairwise OG distance matrices. Next, we applied the NEIGHBOR/FITCH program in the PHYLIP package  to these 1,000 OG distance matrices to obtain 1,000 jackknife trees. Finally, we applied the CONSENSE program in the PHYLIP package to these 1,000 jackknife trees to obtain a majority-rule consensus tree with the numbers at each node representing the percentage of times that the clade defined by that node appears in the 1,000 jackknife trees.
We used the following procedure to obtain a 16S rRNA tree topology for the 21 Proteobacteria. First, the 16S rRNA sequences of these 21 Proteobacteria were downloaded from RDP (http://rdp.cme.msu.edu/). Their multiple sequence alignment were then obtained using CLUSTALW 1.8 . Finally, the neighbor-joining tree was inferred using the NEIGHBOR program of PHYLIP 3.6  and its support values were obtained by bootstrap resampling with 1,000 replicates.
This work was supported in part by National Science Council of Republic of China under grant NSC97-2221-E-009-081-MY3.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.