Phylogenetic detection of conserved gene clusters in microbial genomes
© Zheng et al. 2005
Received: 12 April 2005
Accepted: 03 October 2005
Published: 03 October 2005
Skip to main content
© Zheng et al. 2005
Received: 12 April 2005
Accepted: 03 October 2005
Published: 03 October 2005
Microbial genomes contain an abundance of genes with conserved proximity forming clusters on the chromosome. However, the conservation can be a result of many factors such as vertical inheritance, or functional selection. Thus, identification of conserved gene clusters that are under functional selection provides an effective channel for gene annotation, microarray screening, and pathway reconstruction. The problem of devising a robust method to identify these conserved gene clusters and to evaluate the significance of the conservation in multiple genomes has a number of implications for comparative, evolutionary and functional genomics as well as synthetic biology.
In this paper we describe a new method for detecting conserved gene clusters that incorporates the information captured by a genome phylogenetic tree. We show that our method can overcome the common problem of overestimation of significance due to the bias in the genome database and thereby achieve better accuracy when detecting functionally connected gene clusters. Our results can be accessed at database GeneChords http://genomics10.bu.edu/GeneChords.
The methodology described in this paper gives a scalable framework for discovering conserved gene clusters in microbial genomes. It serves as a platform for many other functional genomic analyses in microorganisms, such as operon prediction, regulatory site prediction, functional annotation of genes, evolutionary origin and development of gene clusters.
In microorganisms, it is often seen that genes tend to locate in conserved proximity in a number of genomes forming conserved gene clusters . Further analyses often uncover biologically meaningful relationships between genes with conserved proximity [1–3]: they are often co-transcribed as operons , or co-regulated as part of a larger biochemical network [5–7]. Examples include the widely present DNA restriction and modification gene pairs , which provide a way of defending against bacteriophage and other foriegn DNA, and the two component systems which respond to changes in environmental conditions . Thus, delineation of conserved gene clusters will help reveal functional relationships between genes within them . In addition, the conserved gene clusters provide an invaluable resource for corroborating the growing number of co-expressed gene sets obtained from microarray based mRNA expression experiments.
The increasing accumulation of genome sequence data has facilitated the practice of finding conserved gene clusters since both similarity and synteny information can be easily obtained . Various computational methods for finding clusters have been described [1, 6, 12–16]. An important issue has been the estimation of significance of the observed conserved proximity. It is generally accepted that phylogenetic distances between genomes largely determine the significance: conservation is deemed more significant when a cluster appears in distantly related genomes than in closely related ones. The rationale is that the greater the length of time a gene cluster has persisted, the more it has resisted dissolution by recombination events, and the stronger the selective pressure to maintain it. Previous published methods have incorporated phylogenetic information into the estimation of significance by using either 16S RNA distance  or statistical methods . However, few efforts have focused on the development of a full evolutionary model to describe conserved gene clusters. Empirical approaches include grouping closely related genome into clades  or choosing a subset of genomes based on knowledge of the evolution of microorganisms . Although the latter approaches are efficient in finding non-trivial conserved gene clusters, they do not scale automatically, which is problematic with more and more sequenced genomes.
Moreover, sequenced organisms are often close relatives of known model organisms or pathogens selected for biomedical reasons, so phylogenetic coverage can be rather sparse and biased. For example, there are many more close relatives of Escherichia coli than those of Helicobacter pylori in the genome database. Thus, simply counting the number of genomes where a gene cluster is conserved often overestimates the significance of conservation of a cluster observed in microorganisms such as E. coli.
Delineation of evolutionarily conserved gene clusters must take into account a stochastic model of evolution of these clusters. In this paper we used a simplified stochastic generative evolutionary process where an organism inherits a gene cluster from its immediate ancestor with a probability proportional to the evolutionary distance in the tree. This model facilitates an efficient computation of the probability that a particular pattern of conservation in many genomes is observed. In particular, observing a large number of occurrences in closely related genomes will not carry the same significance as the occurrence of a cluster in a wider evolutionary phylum.
We define a tree-based probabilistic conservation score and show that it provides a quantitative measure the strength of proximity constraints in genome evolution and serves as a better predictor of functional links among genes than more naïve methods. It also sheds insight into the relationship between evolutionary development and the functional selection.
We applied our computational pipeline to 127 microbial genomes which were obtained from the NCBI website (2/2004). Pairwise genome comparisons of translated open reading frames were performed using BLASTP  and putative orthologs between genomes were identified as reciprocal best hits (see Methods). Conservation scores C u and C d were assigned to each gene (see Methods). Intuitively, C u measures the strength of the conservation between the gene under consideration and its upstream neighboring genes while C d does the same with its downstream neighboring genes on the chromosome. The higher the C u or C d of a gene is, the more significant its conservation with its upstream or downstream neighbors. Conserved gene clusters were then detected using a conservation score cutoff (see Methods section). The complete results of our method on all microbial genomes included can be accessed at . Results using more than 200 genomes available now are being incorporated into the database.
To test the predictive value of our method for inferring operons, we took a compiled list of E. coli operons from the RegulonDB . A total of 345 operons with multiple genes were extracted from the RegulonDB dataset. Genes that are not in the RegulonDB set and have intergenic regions of larger than 100 nucleotides on both sides comprise the negative set (1342 genes). Using different cutoffs, we calculated the sensitivity (Sn) and the specificity (Sp) by:
where TP and FP are the number of genes that are correctly or wrongly predicted to be in operons, and TN and FN are the number of genes that are correctly or wrongly predicted not to be in operons. Notice that genes in operons must reside on the same strand while genes in clusters detected by our system may come from both strands. Nevertheless, the high percentage of conserved clusters that overlap with operon dataset suggests our ability to identify clusters that are conserved due to the transcriptional selection pressure and demonstrates the potential use of the system in the task of operon prediction.
Statistics of conserved gene clusters in a number of microorganisms
Total genes in clusters
Total detected clusters
Average cluster size
Chlamydophila pneumoniae J138
Mycobacterium tuberculosis CDC1551
Mycobacterium tuberculosis H37Rv
Staphylococcus aureus Mu50
Agrobacterium tumefaciens C58
Streptococcus pneumoniae R6
Agrobacterium tumefaciens C58 UWash
Escherichia coli K12
Streptococcus pneumoniae TIGR4
Escherichia coli O157H7
Neisseria meningitidis MC58
Conserved gene clusters, once identified, can be used to make functional predictions for the genes within them, and to hypothesize interactions between their products. For example, in the genome of Mycobacterium tuberculosis CDC1551, we found a four-gene cluster encoding putative homologs of (1) mraZ (MT2224), (2) mraW/yabC (MT2223), (3) ftsL/mraR (MT2222), and (4) ftsI/pbpB (MT2221). This cluster is widely conserved among both gram-positive and gram-negative bacteria, and in E. coli these four genes comprise the start of a known operon of cell division and cell envelope genes . FtsI and FtsL are known to be essential for the assembly of the cell division septum in E. coli. MraW has been shown to be a methyltransferase whose substrates are localized to the cell envelope . Furthermore, it has been shown that lack of S-adenosylmethionine (SAM) leads to a cell division defect in E. coli, with one possible explanation being that SAM serves as a methyl donor in a required methylation event . The location of mraW within this cluster suggests it may encode the methylase involved in such an event, possibly modifying the FtsL and/or FtsI proteins. Indeed, there is experimental data to suggest this , although it has not yet been conclusively shown.
The gene cluster in Figure 5a was originally found and characterized in Sinorhizobium meliloti and was shown to synthesize a special type of siderophore, rhizobactin 1021, involved in iron uptake . It is largely conserved in another three genomes: one is an archaea, Halobacterium sp., and the other two are bacteria, Nostoc sp. and Bacillus halodurans. Notice that this cluster is absent in other Bacilli species (Bacillus subtillis, etc.) despite their closeness to Bacillus halodurans. On the other hand, Bacillus subtilis is known to possess the dhb operon responsible for synthesizing a different type of siderophore, 2,3-dihydroxybenzoate , and this operon is absent in Bacillus halodurans. Figure 5a suggests that the other three microorganisms may possess the ability to synthesize rhizobactin 1021 or a siderophore with a similar structure, reflecting a requirement for iron in the life cycles of these species.
Figure 5b depicts a three gene cluster that is conserved in Pirelulla sp., Methanosarcina acetivorans and Oceanobacillus iheynesis. One of the three genes is annotated as a cellulosomal protein. A BLAST search reveals its similarity to cotH, which has been characterized in Bacillus subtilis and is essential for spore coat assembly . The other two are conserved hypothetical genes in Genbank with no functional information although they are both predicted to be transmembrane proteins (by TMpred). The conservation in Figure 5b suggests the two unknown genes are functionally related to cotH. We note that these conserved gene clusters that appear in just a few genomes with a large evolutionary span could be instances of horizontal transfer.
If conserved gene proximity indeed implies relatedness, we expect to see functional enrichment in the list of gene clusters found by our method. We used the 18 functional codes of COG  as a crude measure of this, counting gene pairs in which both members belong to the same COG category. For 39 genomes examined, we determined the fraction of gene pairs found by our method (m/n), then performed a one-tailed Fisher's exact test to determine the probability of observing at least m gene pairs with the same functional code among n pairs selected randomly from all gene pairs in the genome. Gene pairs including at least one gene with no functional information (i.e., having no assigned category, or designated as general function or unknown function) were excluded from the analysis.
Statistics of functional enrichment
P-value ratio a
Escherichia coli K12
Helicobacter pylori 26695
Neisseria meningitidis MC58
Using the same methodology, we divided the gene pairs into colinear (on the same strand) and divergent (on opposite strands) pairs to see if the bulk of functionally enriched pairs were of one type or the other. The P-values for colinear pairs were similar to those for all pairs, suggesting the vast majority of gene pairs captured by our method might be operons (data not shown). On the other hand, there are only limited numbers of divergent pairs and the P-values for divergent pairs are often insignificant (data not shown).
Functional enrichment analysis has been also applied to the results from the simple counting method, using a cutoff of 10 organisms (of 127 total). The results are shown in Table 3, columns 4 and 5. Although there is no equivalency between them (one is an absolute number, whereas the other is a log-odds score), the number of gene pairs captured by the two methods with their respective cutoffs were similar for many of the genomes examined (Table 3). In most cases, the simple counting method also yielded significantly functionally enriched gene pairs, as indicated by the P-values in column 5. For most genomes, the P-values of the simple counting method are much smaller than those from the phylogeny method (column 6 of Table 2). This is especially true for those genomes which have many close relatives in the database, e.g., E. coli and other Enterobacteriaceae. In those cases, we also observe a large increase in the number of pairs obtained by the simple counting method. We expect this increase to consist largely of functionally unrelated false positives that are conserved because of close phylogenetic distance.
We recognize the limitations of using COG codes to capture functional relationships, and our results will certainly include both false positives and exclude false negatives. For example, consider a functionally related gene pair consisting of a transcriptional regulator and the gene it regulates. These genes would likely be assigned different COG codes, so such a functional relationship would not be captured by this analysis. Therefore, to independently verify our results with COG, we performed a similar analysis using KEGG pathway information in E. coli, the organism with the largest number of KEGG pathway assignments. The overrepresentation statistics for E. coli using KEGG are similar to our results using COG and thus support the above discussion (data not shown).
There is a general problem when comparing sequenced genomes because the samples of such genomes presently available are not uniformly distributed across the microbial kingdoms. Rather, several groups of related bacteria, such as the Enterobacteriaceae, are over-represented. This means that the significance of a feature found in organisms from this group must be interpreted with caution because it may be conserved by simple lineage effects rather than by functional selection. Most papers dealing with this problem either ignore it or simply remove the multiple members of the family and settle for a single representative example. Both methods necessarily lead to inaccuracies in estimating significance. In this paper, we have attempted to overcome this problem by using a method that takes into account the phylogenies of the individual members of related families. We show that it results in a reduction in the false positive rate over the simple counting method. We have not attempted to compare it to the selective sampling method, where one organism is used to represent a phylum, because given the wide variability usually observed within a phylum, the results from that approach will vary widely depending upon which genome is selected as the exemplar.
We have applied our new phylogenetically informed method to the problem of detecting conserved gene clusters. Such clusters are generally believed to reflect conservation of biological function in that often the gene products from the various genes in the cluster are involved in closely-related pathways. This may include the traditional operons known to encode the biosynthetic pathways of intermediary metabolism, or they may reflect the fact that enzymes responsible for post-translational modification will sometimes affect neighboring gene products. Other functional connections may also be found in these clusters. This can be a powerful tool in making predictions about gene products that might otherwise be recalcitrant to direct similarity analysis. Examples here include the restriction-modification enzymes where the DNA methyltransferases often show reasonable degrees of similarity that enable them to be identified, whereas the genes for the restriction enzymes evolve rapidly and usually cannot be identified on the basis of sequence similarity. Nevertheless, the genes are usually clustered. In addition, the current methodology is not necessarily restricted to the conserved synteny between genes. It may be applied to conserved synteny of other functional elements in genomes such as cis-elements, riboswitches, etc.
In our current methodology we have made several simplifying assumptions of which the most important might be to ignore horizontal gene transfer. While this is not a problem if an entire cluster is transferred horizontally, it does become a problem for those "hitchhiking" genes that may be transferred with the cluster. Nevertheless, the results we report seem promising and we do not view this as being a serious limitation at the present time.
The mode of phylogenetic inference is an important consideration in the type of analysis presented here. The basis for the tree underlying the analysis is not restricted to shared gene content, as we have employed here, but could alternatively be based on 16S RNA sequences, gene content, gene order, genome statistics, or coding sequences. Our current method is oblivious to "local" changes such as rearrangements inside the cluster, or differential selection within individual coding sequences. For example, it is possible that genes in the cluster might have different gene trees from each other. The choice of the best phylogenetic methodology in this context is an important follow-up and we feel that the final solution will combine the benefits of multiple methodologies in a single system.
The methodology and results that we present here should be generally useful in any situation where the functional significance of conserved genes or clusters is being investigated, for example, as a way of cross-validating co-expressed genes inferred from many microarray experiments or as a starting point for the assembly of gene networks. As more and more genomes are sequenced, the approach described here should be generally applicable and it will scale well computationally.
Of particular significance in this paper has been our finding that many of the clusters we have observed contain unknown genes with no biological function currently assigned. In these cases it seems reasonable to hypothesize that the product of the unknown gene has a function closely related to the functions of the known genes. Such a function may be a key enzymatic step in a biosynthetic pathway, a key regulatory function or an important post-translational modification. We have assembled a database of conserved gene clusters, called GeneChords http://genomics10.bu.edu/GeneChords, with a simple user interface to permit its rapid query. This will be described in detail in a separate publication.
For each gene pair (as defined below) in the genome, a tree-based conservation score is computed;
For each gene in the genome, two neighbourhood conservation scores are computed, measuring the strength of proximity constraint of the gene with its upstream or downstream neighboring genes;
Adjacent genes with conservation scores between them exceeding some threshold are joined into conserved clusters.
The details are described in the following sections.
For simplicity let us consider the problem of identification of a conserved gene pair, the building block of our gene clusters. We define a gene pair as two genes with no more than k open reading frames separating them along the chromosome (k = 1 in this paper). In contrast to the operon identification procedure , the two genes in a pair do not need to reside on the same strand of the chromosome. When orthologs of a gene pair form a pair in other genomes, the pair is considered conserved. Orthologous genes are often detected as reciprocal best hits (BLAST E-value < 1E-5) between the two genomes by sequence comparison software, such as BLAST . However, evolutionary events such as gene duplication followed by diversification could obscure the reciprocal relationship, resulting in relatively high false negative rates in the identification of orthologous genes. To relieve this concern, we loosen the criteria to include genes with high similarity (BLAST E-value ≤ 1E-5) (see Results). However, unless specifically pointed out, the results presented are calculated on the basis of orthology data.
We assign a score to a conserved gene pair by computing the probability a particular pattern of conservation is observed in analyzed genomes based on a stochastic model of evolution. Before introducing the detailed implementation, we first lay out the theoretical foundation of the method and point out our assumptions.
We assume that the probability of a genome acquiring a gene cluster given its absence in the ancestor is negligible, i.e., P(child = 1|parent = 0) = 0. This is a simplifying assumption which considers the predominance of vertical inheritance and omits the negligible probability of independent formation of identical clusters. Under this assumption the most recent common ancestor of all the leaves that are assigned to 1 is also set to 1. Accordingly, in the Figure 6 tree model, X0 is set to 1. We compute the significance of the conservation based on the probability of observing the specific gene cluster given that the closest common ancestor has it, that is,
P(conservation) = P(Q = 1, A = 1, B = 1, C = 1, D = 0|X 0 = 1)
In our initial model we do not take into account the genomes that do not possess the cluster, although it is not difficult to do with the appropriate assumption on the conditional probability of loss of a cluster in a descendant of an organism that has it. Thus a leaf node D that lacks the cluster is dropped from further calculation. Now we have
P(conservation) = P(Q = 1, A = 1, B = 1, C = 1|X 0 = 1)
The tree in Figure 6 can also be interpreted as a Bayesian network in a tree form [31, 32]. In particular, the vertical inheritance along any path in the tree is a generative probabilistic process, and the probability that a child inherits a gene pair is only dependent on its immediate evolutionary ancestor. More specifically, we associate a conditional probability table with each edge of the tree enumerating the probability of the presence or absence of a gene pair in a genome given the state of its immediate ancestor. According to the tree model, we have
Assuming independence between the siblings, the above can be rewritten as
Note that this derivation is a simple generalization of the forward algorithm [30, 32–33]. According to our vertical inheritance assumption, the probability for a child to have a gene pair is approximately zero if the immediate ancestor does not have the gene pair. After applying this assumption and evaluating the equation recursively, the above formula reduces to
P(Q = 1, A = 1, B = 1, C = 1|X 0 = 1) = P(Q = 1|X 2 = 1)·P(A = 1|X 2 = 1)·P(B = 1|X 3 = 1)·P(C = 1|X 3 = 1)·P(X 2 = 1|X 0 = 1)·P(X 3 = 1|X 0 = 1)
More generally, for a gene pair found in a set of genomes and a given genome phylogeny tree T, the following simple relation holds
where Y is an immediate ancestor of X in T.
Taking the negative logarithm of both sides, we have
Thus the probability of conservation of a gene cluster in a given tree is simply the sum of log conditional probabilities of all the branches on paths leading to the genomes where the gene pair is present.
We further assume that log(P(X = 1|Y = 1) is proportional to the length of the branch that connects X and its parent Y, that is, log(P(X = 1|Y = 1)) ~ d(X,Y). Thus,
where Y is the parent of X. The summation of all the tree branches in the phylogenetic tree is proportional to the logarithm of the overall probability.
In our implementation, the genome phylogenetic tree is built by using the genome distance metric based on the shared gene content suggested by Snel and coworkers . We first calculate the pairwise distance between genomes by d = -ln(s), where s = (number of shared orthologs)/(average of total gene numbers in two genomes). In this implementation, Escherichia coli and Salmonella typhi, have a distance of 0.35, while Escherichia coli and Bacillus subtilis, which are much more distantly related, have a larger distance of 1.18. The genome phylogenetic tree is then constructed from the pairwise distance matrix using the neighbor joining algorithm .
For each gene pair, a genome phylogenetic tree is built on the genomes that have the pair, the conservation score of the gene pair is the summation of all the branch lengths in the tree. Notice that another nice property of the score is that it is independent of the query genome.
We now extend the methodology by using the gene pair conservation to detect longer conserved gene clusters. For the gth gene on the query chromosome, we estimate its upstream conservation score (C u ) and downstream conservation score (C d ) by:
where each s(g-i, g) is the conservation score assigned to gene pair (g-i, g) calculated using the method above; k is the maximum number of intervening genes allowed in a conserved gene pair (k = 1 in this paper). As a result, each gene will be associated with two numerical scores, measuring the extent of conservation between itself and its upstream or downstream neighboring genes respectively.
The statistical significance of the conservation scores is inferred from a bootstrap simulation. For each genome, the null distribution is computed by calculating the conservation scores of the randomly shuffled genome. The P-value cutoff is set at about 1E-4 in this paper, which corresponds to a conservation score of about 5.0 for most genomes. Notice that P-values are related to genome size since genes in very small genomes may have higher chance of forming conserved gene clusters. For instance, conservation score of 5.0 corresponds to a larger P-value (3.5E-3) in a small genome Mycoplasma genitalium than in E. coli (2.4E-4).
For genes that are at the boundaries of the cluster, only one of the conservation scores will exceed the threshold, which provides a convenient way of detecting the boundaries. We detect the maximal conserved gene clusters by scanning the genomes sequentially. The gene with C d over the threshold, but not for its upstream genes, marks the start of a new cluster. The gene whose downstream genes have C u scores below the threshold mark the end of the cluster. All the genes between are considered as part of the cluster.
More sophisticated dynamic programming procedures, or single-linkage clustering algorithm to identify maximal conserved gene clusters are also possible.
The main program was written in C and used the LEDA library for the manipulation of trees. Scripts for generating genome BLAST data and for analyzing data were written in Perl. GeneChords was built with PostGreSQL. All scripts are available upon request from the authors.
We thank the genome sequencing teams who made their data publicly available. We thank Drs. Eugene Koonin, Peer Bork, Russell Greiner and David Lipman, Steven Salzberg and five anonymous reviewers for many insightful comments. Y.Z. thank Kevin Wiehe, Jennifer Beane, Elisabeth George, and David Sternlichit, Paul Strominger, Yulei Fu for their early involvement in building the database interface and the Java based user interface. This work is supported in part by NSF grants DBI-0239435 and ITR-048715 and NHGRI grant #1R33HG002850-01A1 (to S.K.). B.A. and R.J.R. are supported by New England Biolabs.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.