 Methodology article
 Open Access
 Published:
Simultaneous prediction of transcription factor binding sites in a group of prokaryotic genomes
BMC Bioinformatics volume 11, Article number: 397 (2010)
Abstract
Background
Our current understanding of transcription factor binding sites (TFBSs) in sequenced prokaryotic genomes is very limited due to the lack of an accurate and efficient computational method for the prediction of TFBSs at a genome scale. In an attempt to change this situation, we have recently developed a comparative genomics based algorithm called GLECLUBS for de novo genomewide prediction of TFBSs in a target genome. Although GLECLUBS has achieved rather high prediction accuracy of TFBSs in a target genome, it is still not efficient enough to be applied to all the sequenced prokaryotic genomes.
Results
Here, we designed a new algorithm based on GLECLUBS called extended GLECLUBS (eGLECLUBS) for simultaneous prediction of TFBSs in a group of related prokaryotic genomes. When tested on a group of γproteobacterial genomes including E. coli K12, a group of firmicutes genomes including B. subtilis and a group of cyanobacterial genomes using the same parameter settings, eGLECLUBS predicts more than 82% of known TFBSs in extracted interoperonic sequences in both E. coli K12 and B. subtilis. Because each genome in a group is equally treated, it is highly likely that similar prediction accuracy has been achieved for each genome in the group.
Conclusions
We have developed a new algorithm for genomewide de novo prediction of TFBSs in a group of related prokaryotic genomes. The algorithm has achieved the same level of accuracy and robustness as its predecessor GLECLUBS, but can work on dozens of genomes at the same time.
Background
With the continuous decline in the cost of genome sequencing due to the development of new technologies [1, 2], numerous prokaryotic genomes are being sequenced, and this number will soon approach a few thousand. Since the biological functions of an organism are encoded in its genome, knowing its genome sequence can greatly facilitate the understanding of its biological functions. However, due to the expensive nature of experimental characterization of biological functions of an organism, ideally, these functions should be largely deduced computationally from its genome sequence. Nevertheless, understanding the function of even a relatively simple prokaryotic cell from its genome sequence remains one of the most daunting challenges in the postgenomic era. In particular, we know very little about the cisregulatory elements or transcription factor (TF) binding sites (TFBSs) in the vast majority of sequenced prokaryotic genomes because of the lack of an accurate and efficient computational method for predicting TFBSs in sequenced genomes.
The difficulty of computational prediction of TFBSs in a prokaryotic genome is mainly due to the short length and degenerate nature of TFBSs, which complicates their discovery within the long upstream interoperonic regions in which they usually reside. Furthermore, although some TFBSs in prokaryotes have a palindromic structure, any segment of an interoperonic sequence can in principle be a TFBS as long as a TF can recognize it. Therefore, TFBSs are usually predicted by comparative analysis of multiple sequences that are known to contain or potentially contain TFBSs. Based on the observation that the transcriptional regulation machinery including TFBSs is relatively conserved in closely related genomes, various forms of phylogenetic footprinting algorithms have been developed to identify conserved DNA segments as possible TFBSs in the promoters of orthologous genes in a group of related prokaryotic [3–9] and fungal genomes [10]. For the convenience of discussion, in this paper, we refer a set of similar TFBSs as a motif.
These algorithms typically start by predicting TFBSs in the upstream intergenic sequences of a group of orthologous genes using a motiffinding tool, and then cluster the resulting motifs into distinct sets according to the similarity values among the motifs using different clustering strategies and similarity measures [3–10]. Although meaningful results have been achieved by these algorithms in their specific applications, their prediction coverage of possible TFBSs in the applied genomes is generally low [3–10]. For instance, using a Bayesian clustering algorithm to group similar TFBSs predicted in E. coli K12 by phylogenetic footprinting in an earlier work [3], Qin et al.[4] could only predict 192 motifs covering only 438 operons, while the E. coli K12 genome is predicted to encode 266314 TFs, and more than 2000 operons [11–14]. In another study, van Nimwegen et al.[5] used a Monte Carlo sampling strategy to partition into clusters a set of TFBSs predicted by phylogenetic footprinting [3]; this study yielded only 115 significant clusters/motifs. More recently, Liu et al.[8] used the PhyloNet algorithm [15] to cluster putative TFBSs predicted by the motiffinding program CONSENSUSv6c [16] through phylogenetic footprinting in the Shewanella oneidensis genome, finding that PhyloNet is not able to efficiently cluster the predicted TFBSs. Therefore, an additional hierarchical clustering procedure was used to achieve reasonable predictions [8].
In our opinion, there are two unnecessarily exclusive problems in these existing algorithms, limiting their performance and applications. First, these algorithms used only a single motiffinding tool in the phylogenetic footprinting process for identifying putative TFBSs. However, it has been shown that these motiffinding tools can only predict at most 30% of known TFBSs in the input intergenic sequences, and may be biased to some types of TFBSs, though different tools may complement with each other for recovering different types of TFBSs [17, 18]. Second, most of these algorithms assume that the putative TFBSs predicted by a phylogenetic footprinting procedure are all true TFBSs; therefore the subsequent clustering procedure is designed to group similar motifs into distinct ones without filtering out the spurious predictions. However, a considerable portion of the predicted motifs are clearly spurious predictions due to the low prediction accuracy of current motiffinding tools [17, 18].
To overcome these problems, we have recently developed a new algorithm named GLECLUBS (GLobal Ensemble and Clustering of Binding Sites) for genomewide de novo prediction of TFBSs in a prokaryotic genome [19]. Although GLECLUBS also employs a phylogenetic footprinting technique to first identify all possible TFBSs, and then clusters similar motifs, it is distinct from the prior methods in two ways. First, in order to harvest as many as possible true TFBSs by phylogenetic footprinting, GLECLUBS uses multiple wellevaluated complementary motiffinding tools instead of using only a single tool, and considers multiple outputs of each tool. Second, GLECLUBS assumes that only a small portion of predicted TFBSs by phylogenetic footprinting are true TFBSs, and that the vast majority of them are spurious predictions. Therefore, the goal of the clustering step of GLECLUBS is to discriminate true TFBSs from spurious ones using an iterative filtering procedure, instead of simply partitioning putative TFBSs into distinct groups. We have shown that GLECLUBS outperforms the existing algorithms in terms of the prediction sensitivity and specificity in E. coli K12, B. subtilis and S. oneidensis[19]. We found that the major bottleneck for the prediction accuracy of GLECLUBS is the accuracy of operon predictions that are used to guide the extraction of interoperonic sequences for phylogenetic footprinting [19]. When interoperonic sequences are correctly extracted, GLECLUBS can recover at least 80% known TFBSs in both E. coli K12 and B. subtilis, according to RegulonDB [20] and DBTBS [21], respectively.
Nevertheless, GLECLUBS and all of the other prior algorithms that we are aware of, only aim at predicting TFBSs in a single target genome even though multiple reference genomes are used during the phylogenetic footprinting and subsequent clustering steps; therefore the information derived for the TFBSs in the reference genomes are not fully utilized. For this reason, all these algorithms including GLECLUBS are not efficient enough to be applied to all the sequenced prokaryotic genomes. In this study, we have developed a new version of GLECLUBS called extendedGLECLUBS (eGLECLUBS) for simultaneous de novo prediction of TFBSs in a group of prokaryotic genomes that we call a group of target genomes. We showed that eGLECLUBS can achieve at least the same level of prediction accuracy for a group of genomes as GLECLUBS does for a single genome; however, it can simultaneously predict TFBSs in dozens of closely related genomes. Therefore eGLECLUBS is more efficient than GLECLUBS, and can be used for predicting TFBSs in the increasing number of sequenced prokaryotic genomes.
Results
TFBSs can be effectively identified by phylogenetic footprinting based on predicted COORs using multiple motiffinding tools
In a typical phylogenetic footprinting procedure with a single target genome, upstream intergenic sequences are extracted based on a group of orthologous genes of a gene in the target genome [3–10]. We [19] have previously shown that intergenic sequences extracted from a group of orthologous operons determined by an operon in the target genome outperforms intergenic sequences extracted based on a group of orthologous genes determined by a gene in the target genome for motiffinding. In current study, in order to predict TFBSs in a group of genomes, we predict Clusters of Operons with Orthologous Relationships (COORs, Figure 1 and see Methods) in the genomes. We then extract upstream interoperonic sequences based on the predicted COORs. Clearly, by the design of the algorithm for predicting COORs, the extracted sequences are unlikely biased to any genome in the group. Application of the algorithm (Figure 1) to a group of target genomes comprised of 32 γproteobacterial genomes including E. coli K12 [Additional file 1: group D in Supplemental Figure S1] resulted in 4,103 COORs and interoperonic sequences sets which contain 1,447 known E. coli K12 TFBSs as described above. To evaluate whether or not we can effectively identify these known E. coli K12 TFBSs in the interoperonic sequence sets based on the COORs, we applied seven motiffinding tools that we have evaluated previously [19] to these 4,103 interoperonic sequence sets. These seven tools were MEME [22], BioProspector [23], MotifSampler [24], CUBIC [25], MDScan [26], Weeder [27] and CONSENSUS [28]. These tools were chosen based on evaluations by others [17, 18], the balance of different algorithm designs and ease of use. As shown in Figure 2A, these tools have different performances in their ability to recover known E. coli K12 TFBSs in the interoperonic sequence sets for their best, top 5, 10, 15, 20 and 25 predictions, but they all identify an increasing number of known E. coli K12 TFBSs when more predictions are considered. We can define a lower bound of specificity as the number of predicted known TFBS divided by the number of predicted TFBSs, to evaluate the prediction specificity of each tool, although since TFBs in E. coli K12 have not been completely characterized, this estimate of specificity may be overly conservative. As shown in Figure 2B, the increasing number of known E. coli K12 TFBSs recovered by each tool when more top predictions are considered is at the cost of a decreased lower bound of specificity. Furthermore, the predictions of these tools are complementary to one another as their combined predictions recover more known TFBSs than does any single tool (Figure 2A). However, Weeder and CONSENSUS substantially underperform the other tools, and their predictions were all covered by the other tools, therefore they were not further considered. Using a low bound specificity cutoff of 5% (Figure 2B), and based on the results shown in Figure 2A, we consider for further analysis a total of 40 motifs in each interoperonic sequence set associated with a COOR, including the top 15 predictions of MEME, the top 10 predictions of BioProspector, and the top 5 predictions of CUBIC, MDscan and MotifSampler, respectively. Therefore, there are a total of 4,103 × 40 = 164,120 predicted motifs (called input motifs, see Methods) for this group of target genomes. These predicted motifs recover 1,347 (1347/1447 = 94%) known E. coli K12 TFBSs in the extracted interoperonic sequence sets. As shown in Figures 2C and 2D, although there are large overlaps among the predictions of these tools, each tool has its own considerable unique predictions. Therefore, we have achieved at least the same level of sensitivity as our previous results obtained from the interoperonic sequences based on operons using E. coli K12 as the single target genome [19]. These results suggest that TFBSs can be effectively identified based on the predicted COORs, and that our choice of these five tools and their top predictions are sufficient enough to recover true TFBSs in the interoperonic sequence sets, though other choices of motiffinding tools are possible, in particular, when better ones are available in the future. Furthermore, although our predictions were only evaluated in E. coli K12 because very little is known about TFBSs in other genomes in the group, it is highly likely that similar results have been achieved for the other genomes in the group since the COORs are unlikely biased to any genome in the group (see Methods).
Optimal selection of a group of target genomes is vital for the prediction of TFBSs
Since each genome in a group of target genomes is both a target genome for which we want to predict all possible TFBSs and a reference genome for all the others in the group, the composition of genomes in a group can largely affect the results of phylogenetic footprinting. Furthermore, it has been shown that cisregulatory systems in prokaryotes evolve very rapidly and are extremely flexible [29], thus traditional phylogenetic analyses may fail to detect conserved cisregulatory systems. To quantify the level of conservation of cisregulatory systems in a group of related genomes, we constructed a special phylogenetic tree that largely reflects the relationships of the cisregulatory systems in the genomes, using a method that we developed previously[19] (see Methods). The tree for 139 γproteobacterial genomes is shown in [Additional file 1: Supplemental Figure S1]. To determine the level of conservation of cisregulatory systems, so that phylogenetic footprinting would perform best in the corresponding genomes for recovering known TFBSs, we selected five subtrees including E. coli K12 using different branching points with increasing distances in the tree for the γproteobacterial genomes [Additional file 1: Figure S1], and obtained five groups of genomes as follows: group A containing Escherichia genomes, group B containing Escherichia and Shigella genomes, group C containing Escherichia, Shigella, and Salmonella genomes, group D containing Escherichia, Shigella, Salmonella, and Yersinia genomes, and group E containing Escherichia, Shigella, Salmonella, Yersinia, Pseudomonas, Shewanella, and Vibrio genomes. As shown in Figure 2E, groups A and B did not perform very well because only genomes that are very closely related to each other were used. Similarly, group E did not perform well either, presumably because some genomes in the groups have cisregulatory systems that are too divergent. In contrast, groups C and D that include genomes encoding cis regulatory systems that are intermediately related to that of E. coli K12 are the best in recovering the known E. coli K12 TFBSs in the extracted intergenic sequences. Interestingly, removal of other very closely related Escherichia genomes in groups C and D decreased the performance of phylogenetic footprinting (data not shown). This result is consistent with an earlier observation that inclusion of reference genomes that are very closely related to the target genome may improve the performance of phylogenetic footprinting if other reference genomes that are intermediately related to the target genome are also included [30]. As shown in Figure 2E group D recovers the largest number of known E. coli K12 TFBSs and has the largest number of overlapping predictions with those of its neighboring groups C and E. Since group D does not include any genomes that share less than 50% of TFs with any other genomes, we used this as the criterion for selecting a group of target genomes for the prediction of their TFBSs. In other words, by default, we select a subtree from the phylogenetic tree of cisregulatory systems of a group of sequenced genomes belonging to the same phylum/subphylum or class to form a group of target genomes, such that each genome in the subtree shares at least 50% of its TFs with any other genomes in the group (see Methods). Accordingly, we used group D to evaluate the performance of our motif clustering algorithm.
Selection of the motif similarity cutoffs α and β for the construction of motif similarity graphs G _{ 1 } and G _{ 2 } , respectively
Next, we want to distinguish true TFBSs from spurious ones in the set of input motifs predicted for a group of target genomes by gradually filtering out the latter based on the following two assumptions as we used in GLECLUBS previously [19]: 1) a true motif is more likely than a spurious one to be found by multiple tools from the same interoperonic sequence set based on a COOR; and 2) a true motif is more likely than a spurious one to have a similar motif found in a different COOR. To this end, we use a graphtheoretic approach similar to that used in GLECLUBS. However, instead of first constructing a single motif similarity graph, we begin with two motif similarity graphs G_{1} and G_{2} using a low and a high motif similarity score cutoffs α and β (α <β), respectively (Figure 3, and see Methods). As shown in Figure 2F, the distribution of the similarity scores among all predicted motifs in the group is largely leftshifted relative to that among the submotifs of each known motif in E. coli K12, suggesting that the majority of predicted motifs are irrelevant to one another or are spurious predictions. On the other hand, there is a considerable overlap between the two distributions of the similarity scores, suggesting that the true motifs cannot be easily separated from the spurious ones by a single similarity score cutoff. The same conclusion was reached by us previously when E. coli K12 was chosen as the single target genome [19]. Furthermore, in the current application we have to deal with a much large set of input motifs (1.6×10^{5}) identified from the 4,103 interoperonic sequence sets based on the same number of COORs in the group of genomes, which is almost twice more than those when E. coli K12 was used as the single target genome [19]. Ideally, a motif similarity graph should be constructed, so that most relevant true submotifs are connected by edges with higher weights, while most irrelevant and spurious motifs are not. According to the similarity score distributions shown in Figure 2F, the optimal similarity score cutoff value seems to be located around 0.05. However, we found that with this similarity score cutoff or lower, the density of a resulting graph (defined as the number of its edges divided by the number of its nodes) was too high (more than 10^{3}). And a graph of this size (>10^{5} nodes) could not be efficiently clustered into dense subgraphs/clusters by any clustering algorithms that we have tested, including the Markov chain clustering (MCL) algorithm [31] that is best known for its high efficiency for clustering very large graphs. However, the density of the motif similarity graphs decreases precipitously when the motif similarity cutoff is greater than 0.2 (Figure 4A) while the resulting graphs still include the vast majority (99.9%) of input motifs as long as the cutoff is less than 0.3 (Figure 4B). Therefore, we selected the cutoff α ∈[0.2, 0.3] for the construction of motif similarity graph G_{1} that includes the vast majority of input motifs. Nevertheless, G_{1} usually still has a too high density (>500 when α = 0.2, Figure 4A) to be efficiently clustered, therefore, we construct a substitute motif similarity graph G_{2} using a higher motif similarity score cutoff β ∈[0.35, 0.45]. Since G_{2} has a low enough density (<100), thus can be efficiently clustered by the MCL algorithm, while the graph still keeps more than 90% of the input motifs (Figure 4B). Therefore, our clustering algorithm works by first clustering G_{2} into dense subgraphs, and then recruits those in G_{1} but missed in G_{2} (see Methods).
Prediction of TFBSs in a group of γproteobacterial genomes including E. coli K12 and a group of firmicutes genomes including B. subtilis
The eGLECLUBS algorithm ranks the predicted motifs in each genome in a group of target genomes according their ClusterScores defined by formula (4). To evaluate the prediction accuracy and robustness of our algorithm, we first applied it to a group of 32 γproteobacterial genomes including E. coli K12 [Additional file 1: group D in Supplemental Figure S1], and a group of 19 firmicutes genomes including B. subtilis [Additional file 1: Supplemental Figure S2] using the same parameter settings. These two groups of genomes were chosen for the evaluation as the relatively large numbers of known TFBSs in E. coli K12 and B. subtilis can be used to systematically verify our predictions in the two genomes.
As shown in Table 1, of the 1,642 and 568 known TFBSs belonging to 125 and 99 motifs in E. coli K12 [20] and B. subtilis[21], 1,447 and 451 belonging to123 and 93 motifs, respectively, are correctly extracted and included in the upstream interoperonic sequence sets according to the predicted operons in the genomes [32, 33] and our criterion that each set has to contain at least three sequences (see Methods). Consequently, about 12 and 21% of TFBSs in E. coli K12 and B. subtilis[32, 33], respectively, were not included in the interoperonic sequence sets due to the incorrect operon predictions or the restriction of our criterion, and thus could not be predicted by our clustering algorithm. The reason that there are a larger portion (21%) of missed TFBSs in the extracted sequence sets for B. subtilis than that (12%) for E. coli K12 is that more interoperonic sequence sets containing a sequence from B. subtilis were deleted as the corresponding COORs contain fewer than three operons, and thus were not further analyzed (see Methods). The five motiffinding tools returning a total of 40 motifs in each interoperonic sequence set recover 1,347 and 397 known TFBSs belonging to 122 and 92 motifs in E. coli K12 and B. subtilis, respectively (Table 1). Thus, 93 and 88% of known TFBSs included in the interoperonic sequence sets are recovered by the phylogenetic footprinting procedure in E. coli K12 and B. subtilis[32, 33], respectively (Table 1).
We then calculated the cumulative recovery rate by the topranked motifs of the 1,347 and 397 known TFBSs found by phylogenetic footprinting in E. coli K12 and B. subtilis, respectively. As shown in Figure 5A, with the increase in the number of topranked clusters, the cumulative recovery rate of known TFBSs by the topranked clusters increases very rapidly for the top 300 and 230 clusters, which recover 1,102 (82%) and 324 (82%) of the 1347 and 397 known TFBSs in E. coli K12 and B. subtilis, respectively, and then it enters a saturation phase with a small linear increase. We also calculated the recovery rate of the 122 and 92 motifs by the topranked clusters in E. coli K12 and B. subtilis, respectively. We consider that a known motif is recovered by one cluster if at least 20% of its known TFBSs are included in this cluster. As shown in Figure 5B, with the increase in the rank of clusters, the cumulative recovery rate of known motifs by the topranked clusters increases rapidly for the top 300 and 230 clusters, recovering 113 (93%) and 88 (96%) of the 122 and 92 motifs in E. coli K12 and B. subtilis, respectively, and then it also enters a saturation phase with little increase. Therefore, our clustering algorithm has achieved rather high sensitivity in recovering the known TFBSs as well as the known motifs of both genomes. Interestingly, the cumulative recovery rate of known motifs saturate at around the top 300th and 230th predicted motifs, which is close to the number of TFs that have been estimated in the E. coli K12 [11–14] and B. subtilis genomes [34], respectively. The rapid recovery of known TFBSs (Figure 5A) and motifs (Figure 5B) by topranked clusters also suggest that the higher the rank of a cluster, the more likely it is a true motif.
To assess the specificity of the predictions in E. coli K12 and B. subtilis, we calculated the number of cumulative unique predicted TFBSs in the topranked clusters. We consider a predicted TFBS to be unique if there is no other site overlapping with it by more than eight bases. As shown in Figure 5C, with the increase in the rank of clusters, the number of cumulative unique predicted TFBSs in both genomes increases in a way very similar to the cumulative recovery rates of known TFBSs shown in Figure 5A, and then it saturates around the top 300th and 230th clusters, covering about 4,900 and 4,600 unique TFBSs in E. coli K12 and B. subtilis, respectively.
There are 40,405 and 57,567 unique predicted TFBSs in the whole set of 164,120 and 18,6960 input motifs predicted by phylogenetic footprinting in E. coli K12 and B. subtilis, respectively. Therefore, the vast majority of predicted TFBSs have been filtered out by our algorithm in both genomes. The fact that these relatively small numbers (4,900 and 4,600) of unique predicted TFBSs recover 1,102 (82%) and 324 (82%) of 1,347 and 397 known TFBSs in the top 300 and 230 clusters of E. coli K12 and B. subtilis, respectively, suggests that our clustering algorithm has significantly enriched the true TFBSs in the predictions (p < 10^{10} according to a hypergeometric distribution for both genomes). It is highly likely that our algorithm has achieved high prediction specificity in both genomes, although we cannot accurately compute the prediction specificity and false positive rate, because not all TFBSs in both genomes are currently known. However, we can estimate the low bound of prediction specificity based on the partially known TFBSs to be 22% (1102/4900) and 7% (324/4600) by the top 300 and 230 clusters in which 113 (113/300 = 30%) and 88 (88/230 = 38%) motifs recovered in E. coli K12 and B. subtilis, respectively. As shown in [Additional File 1: Supplemental Tables S1 and S2], the top 20 predicted motifs in both E. coli K12 and B. subtilis recover 10 known motifs, achieving a low bound specificity of 50% in both genomes. Therefore, we have achieved at least the same level of prediction accuracy in both genomes as we have previously obtained when E. col i K12 and B. subtilis were used as the single target genome [19].
In the current study since our selection of genomes in the two groups of target genomes, and prediction of COORs are not biased to any particular genome, e.g., E. coli K12 or B. subtilis in the groups, it is highly likely that we have also achieved the same level of prediction accuracy for all of the genomes in the groups, although we cannot systematically verify our prediction in the genomes other than E. coli K12 or B. subtilis, because none of these genomes has enough number of known TFBSs for doing so. However, if we inspect the predictions in any of these lessstudied genomes, biologically meaningful results can be always identified http://gleclubs.uncc.edu/pbs. For example, if we look at the topranked motifs in Yersinia pestis biovar Mediaevails that is relatively far away from E. coli K12 in the tree [Additional file 1: Supplemental Figure S1] in the γproteobacterial group, the 2nd topscoring motif is similar the CRP binding sites in E. coli K12, and the genes that bear a binding site of this motif are annotated to be involved in carbon and energy metabolism, therefore it is highly likely that this is the CRP binding motif in the genome. The 3rd topscoring motif is similar to the FUR motif in E. coli K12, and genes that bear a binding site of this motif are annotated to be involved in ion assimilation, therefore it is highly likely that this motif is recognized by FUR in this genome. The 32nd topscoring motif is associated with genes involved in arginine biosynthesis, thus is likely to be the binding motif of the regulator ArgR in this genome. The binding sites of the 43rd topscoring motifs are born by genes involved in pentose metabolism, so this motif is likely to be involved in the regulation of pentose metabolism, however, its cognate TF cannot be predicted based on the current data. The 50th topscoring motifs is similar to the LexA motif in E. coli K12, and many genes bearing a binding site of this motif are annotated to be involved in DNA damage repairing, so it is likely that this motif is recognized by LexA in this genome. Similar results are seen for the other Yersinia genomes in the group http://gleclubs.uncc.edu/pbs. Therefore, eGLECLUBS can not only predict known TFBSs in model organisms, but also can predict novel TFBSs in lessstudied organisms. The predicted TFBSs in the 32 γproteobacterial and 19 firmicutes genomes can be found at the website of eGLECLUBS http://gleclubs.uncc.edu/pbs. More importantly, with only one more graph clustering step than its predecessor GLECLUBS (see Methods), eGLECLUBS can predict TFBSs in a group of up to 32 genomes. Therefore, eGLECLUBS is more efficient than GLECLUBS.
Furthermore, these results also show that we have achieved the similar prediction accuracy in B. subtilis based on the 19 firmicutes genomes as in E. coli K12 based on the 39 γproteobacterial genomes using the same parameter settings, suggesting that our algorithm is very robust in predicting TFBSs in different groups of genomes. In addition, it is noticeable that the cumulative binding sites, motifs and unique putative binding sites in B. subtilis all saturate faster than those of E. coli K12 (Figure 5), respectively, although both genomes encode similar number of genes (4,105 vs. 4,132) and operons (about 2,300 vs. 2,400). This might indicate that B. subtilis encodes fewer TFs than does E. coli K12. In fact, according to the DBD database [35], the E. coli K12 genome was estimated to encode 266 TFs, while the B. subtilis genome 238 TFs.
Prediction of TFBSs in a group of cyanobacterial genomes
To further evaluate the prediction accuracy and robustness of eGLECLUBS, we applied it to a group of 14 target genomes chosen from 33 sequenced cyanobacteria [Additional file 1: Supplemental Figure S3] using the same parameter settings. Despite the important roles of cyanobacteria in global primary production and the evolution of green plants, our current understanding of cis regulatory systems in even the beststudied cyanobacterial Synechocystis sp. PCC 6803 strain (PCC6803) is still very limited. Since there are not many known TFBSs in any cyanobacterial genome, we cannot systematically verify our predictions in a way as we did in E. coli K12 and B. subtilis. However, as shown in Figure 2F, the similarity scores of the input motifs for this cyanobacterial group have a very similar distribution to those of the γproteobacterial and firmicutes groups, suggesting that eGLECLUBS should also perform well in this group of genomes. Furthermore, if we use our predictions in PCC6803 as an example, as shown in Figure 5C, the number of cumulative unique putative TFBSs in the topranked clusters in the genome rapidly saturates at ~3,354 around the 200th cluster, indicating that we have likely enriched the true TFBSs in PCC6803 as in the cases of E. coli K12 and B. subtilis. Again, the much faster saturation of the cumulative unique predicted TFBSs in PCC6803 than in B. subtilis and E. coli K12 (Figure 5C) suggests that the number of TFs in PCC6803 might be fewer than those in these two genomes. Indeed, DBD predicts that there are 85 TFs in PCC6803 [36], even though the actually number might be greater than this number [36]. Moreover, a few experimentally characterized or computationally predicted motifs by other ad hoc methods are among the top 20 predicted motifs in PCC6803 [Additional file 1: Supplemental Table S3], including the PhoB motif (9th) [37, 38], NtcA motif (11th) [39, 40], and CRP motif (14th) [41]. It is worth noting that, although the CRP regulons in cyanobacteria are rather diverse as we recently showed [41], eGLECLUBS can still accurately identify the CRP binding motif in PCC6803 with a high rank. Therefore, these results again unequivocally demonstrate that eGLECLUBS is robust enough to predict TFBSs in various groups of lessstudied genomes with high accuracy.
Correlation between the number of predicted motifs and the number of TFs encoded in genomes
As shown in Figure 5D, the number of topranked clusters at which the cumulative unique predicted TFBSs saturate is highly correlated to the number of predicted TFs in the DBD database for each genome in the three groups of target genomes. Therefore, the saturation point can be used to roughly estimate the number of motifs encoded in each genome in a group of target genomes. These results again suggest that our predictions of TFBSs in each genome have achieved a high level of accuracy, and that our algorithm is very robust in terms of parameter settings for different groups of target genomes. Interestingly, the two bacteria that have the largest number of predicted motifs and TFs, namely, Serratia proteamaculans 568 and Klebsiella pneumoniae MGH 78578 (Figure 5D), are freeliving pathogens of human and animals [42]http://www.ncbi.nlm.nih.gov/genomeprj/31. Their more complex gene regulatory systems might be due to their needs to cope with very different living environments. In contrast, the bacterium that has the smallest number of predicted motifs and TFs is the Cyanobacterium synechococcus PCC7002, which thrives in seawater where nutrients are relatively stable. Furthermore, cyanobacteria can acquire their carbon source through photosynthesis, and thus generally have simpler gene regulatory systems.
Discussion
Phylogenetic footprinting followed by motif clustering has been proven the most practical method for genomewide TFBS prediction in prokaryotic genomes. However, the existing methods including the GLECLUBS algorithm that we developed earlier [19] were designed to predict TFBSs in a single target genome of interest, and the TFBSs in multiple reference genomes were largely ignored and not predicted. Therefore these algorithms are not very efficient and cannot be applied to a large number of sequenced genomes. The major reason for this practice is that these methods identify orthologous gene groups for phylogenetic footprinting based on the gene contents in the target genome, therefore, interoperonic sequences pooled based on these orthologous gene groups are inevitably biased to the target genome, and the coverage of predictions in reference genomes may not be high enough to be used for the genomewide prediction of TFBSs in the genomes. In the current study, we have developed the eGLECLUBS algorithm for simultaneous genomewide prediction of TFBSs in each genome in a group of genomes without introducing a bias to any genome via overcoming two technical obstacles. First, we identify COORs in a group of target genomes based on the gene contents and operon structures in each genome, and treat each genome in the group equally both as a target genome and as a reference genome for the others. Therefore, the pooled interoperonic sequences are unlikely to be biased to a particular genome. Second, because we consider every gene in each genome in the group, we end up with about twice as many interoperonic sequence sets as we have when a single target genome was considered [19]. To circumvent the difficulty of clustering a much larger motif similarity graph with relatively high density for predicting TFBSs, we construct two similarity graphs G_{1} and G_{2} with a high and a low motif similarity score cutoffs, respectively. We first cut G_{2} into dense clusters, and then recruit in the resulting clusters the input motifs in G_{1} but not in G_{2}, and reconnect those that are not connect in G_{2} but in G_{1}. True motifs are then gradually separated from spurious ones through multiple rounds of subsequent graph clustering. Application of the eGLECLUBS algorithm to a group of 32 γproteobacterial genomes including E. coli K12, a group of 19 firmicutes genomes including B. subtilis, and a group of 14 cyanobacterial genomes, resulted in similar prediction results in both the E. coli K12 and B. subtilis genomes as we have achieved when each was used as the single target genome by GLECLUBS [19]. Due to the limited knowledge of TFBSs in most sequenced genomes, we can only systematically verify our predictions in the two model bacteria E. coli K12 and B. subtilis. We assert, however, that it is highly likely that we have achieved the same level of prediction accuracy for each genome in the three groups of genomes because of the unbiasedness of our algorithm to any genome in the groups. Therefore, we have improved our original algorithm for simultaneously predicting TFBSs of dozens of genomes without a cost to prediction accuracy. To further speed up the algorithm, we are currently parallelizing the eGLECLUBS algorithm through distributed computing, so that it can be applied to all the sequenced prokaryotic genomes in the public databases.
In addition, with a single set of parameter settings, our algorithm performs equally well on the groups of γproteobacterial, firmicutes and likely cyanobacterial genomes. Therefore, it is very robust for predicting TFBSs in various groups of genomes. The robustness of our algorithm can be explained by the similar distributions of the similarity scores among the input motifs identified in the different groups of target genomes, and the similar distributions of the submotifs of the known motifs in different genomes (Figure 2F). Consequently, the coverage of nodes and the graph density of motif similarity graphs under different motif similarity score cutoffs in the three groups are similar (Figure 4A and 4B). Therefore, the motif similarity score cutoffs used to construct motif similarity graphs for the three groups of genomes work equally well. Since these distributions are determined by the first principles of biochemical rules, thus they are unlikely to be genomespecific. Therefore, we expect that our algorithm can be applied to any group of target genomes using similar parameter settings.
Conclusions
We have developed a new algorithm for genomewide de novo prediction of TFBSs in a group of related prokaryotic genomes. The algorithm has achieved rather high prediction accuracy, and is very robust and computationally efficient. The resulting tool can be used for annotating TFBSs in the increasing number of sequenced prokaryotic genomes.
Methods
Materials
The genome sequences and their annotation files of γproteobacteria, firmicutes and cyanobacteria were downloaded from the NCBI ftp sever ftp://ftp.ncbi.nih.gov/genomes. The known TFBSs of E. coli K12 and B. subtilis were downloaded from RegulonDB Version 6.0 [20] and DBTBS Release 5 [21], respectively. Known and predicted TFs were downloaded from DBD Release 2.0 http://www.transcriptionfactor.org[35]. Predicted prokaryotic operons were downloaded from the DOOR database [32, 33], which has the highest prediction accuracy among all surveyed operon prediction algorithms [43].
Selection of a group of target genomes
To select a group of target genomes for which we want to predict TFBSs in each genome, we used a previously developed method that considers not only the evolutionary relationships, but also the number of shared TFs among the genomes in the group [19]. Briefly, we selected all the sequenced genomes from the same phylum or class of interest, e.g. the 139 sequenced γproteobacterial genomes, 124 sequenced firmicutes genomes, and 33 sequenced cyanobacterial genomes as it has been shown that genomes from the same phylum or class usually share TFBSs that are conserved enough to be predicted by phylogenetic footprinting [3–10]. We then computed a bits vector for each genome, where "1" and "0" represents the presence and absence of a known or predicted TFs in the genome according to the DBD database [35]. We constructed a neighborjoining tree based on the Hamming distance between the vectors of each pair of genomes. By selecting difference branches, we could obtain different groups of target genomes with different levels of conservation of gene transcriptional regulatory systems.
Predictions of orthologs and COORs
Orthologous proteins and their genes between two genomes were predicted by the bidirectional best hits (BDBH) method [44] using the BLASTP algorithm with an Evalue cutoff 10^{20} in both directions of search. For each operon o of each genome G_{ i } in a group of target genomes, we construct a group of operons called a SemiCluster of Operons with Orthologous Relationship (SCOOR) as follows (Figure 1): The SCOOR initially only contains the operon o; for each genome except G_{ i } in the group of target genomes, if there exists an operon o' containing orthologs of at least 50% of genes in o, we recruit o' into the SCOOR. Next, we construct a graph using these resulting SCOORs as the set of vertices. We connect two SCOORs by an edge if the larger one includes 70% of the operons in the smaller one. Because of the low connectivity, the graph is composed of many connected components (a connected component is defined as a subgraph in which any two vertices are connected by at least a path.). We call the operons associated with each connected component a Cluster of Operons with Orthologous Relationships (COOR). Clearly, in a COOR, the majority of genes in each operon in a genome have orthologs in other operons from other genomes. A COOR is conceptually similar to a cluster of orthologous transcription units defined by Wels et al. [9].
Prediction of input motifs
For each COOR C_{ i } containing at least three operons, we extracted up to 800 bases upstream interoperonic sequence for each operon in C_{ i } to form a set I_{ i }of sequences. Therefore each I_{ i }contains at least three interoperonic sequences, which is necessary for most motiffinding tools to work well. We apply P motiffinding tools to each I_{ i }, and each tool j returns its top T_{ j } motifs (Figure 3A and 3B). The length of returned motifs is set to be 16 bases for all tools as we found that motiffinding tools perform best with this length in prokaryotes [19]. Therefore, there are T = T_{1} + T_{2} + ... + T_{P} putative motifs for each I_{ i }. If there are n COORs containing at least three operons in the groups of target genomes, we will have a total of nT motifs, which are referred to as input motifs.
Calculation of motifmotif similarity
We used the same motif similarity metric that we defined previously [19] to compute a similarity score for each pair of input motifs. We have shown that this metric outperforms the other existing ones for separating true motifs from spurious ones. Briefly, for a motif M_{ x }containing n_{ x }sequences with length L_{ x } , let be its frequency matrix, and P_{ x } be its profile matrix defined as
where p_{ x } (b, i) is the probability of base b ∈{A, C, G, T} appearing at position i of M_{ x } , and q(b) is the probability of base b appearing in the background sequences. For two motifs M_{1} = (P_{1}, F_{1}) and M_{2} = (P_{2}, F_{2}), the similarity score between them is defined as
where
Note that in the metric (3), A is the set of optimal ungapped alignments that have the maximum number of aligned columns {i}, each satisfying , and column i of P_{ x } corresponds to column s(i) of F_{ y } in the alignment s ∈ A.
Prediction of TFBSs in a group of target genomes through graph clustering
In our previous design of the GLECLUBS algorithm [19], we constructed a motif similarity graph using the predicted input motifs as the nodes and connecting two nodes with an edge if their similarity scores is above a preset cutoff; we then clustered the graph into dense subgraphs using the MCL algorithm [31]. However, we found that when the size of the graph increases to the size of current application, the MCL algorithm becomes too inefficient to cluster the graph. To circumvent this obstacle, we first construct two motif similarity graphs G_{1} = (V_{1}, E_{1}) and G_{2} = (V_{1}, E_{1}) with a high and a low graph densities, respectively (Figure 3). The density of a graph is defined as the number of its edges divided by the number of its nodes in this paper. We choose the density of G_{1} to be high enough, so that most relevant motifs are presumably connected, and the density of G_{2} to be low enough so that only highly similar motifs are connected and the graph can be efficiently cut by the MCL algorithm. We first cut G_{2} into dense subgraphs/clusters, and then recruit in the resulting clusters the motifs not in G_{2} but in G_{1}(Figure 3). The details of the clustering procedure follow.

(A)
Construct motif similarity graphs G _{1} with a high density and G _{2} with a low density, and cut G _{2} into dense subgraphs using MCL. We construct two motif similarity graphs G _{1} = (V _{1}, E _{1}) and G _{2} = (V _{2}, E _{2}) using predicted motifs as the nodes, and connecting any two motifs by an edge if their similarity score is greater than two preset cutoffs α and β (β > α), respectively, with the similarity score being the weight on the edge (Figure 3C). Because of the larger value of β, G _{2} contains fewer nodes and edges than does G _{1}. We choose a value of α so that most relevant motifs are connected in G _{1}, and a value of β so that the density of G _{2} is low enough to be efficiently clustered into dense subgraphs using the MCL algorithm. We typically choose α and β such that the density of G _{1} is about 500, and that of G _{2} is about 15~20 (see Results). We then apply MCL to G _{2} to cut it into dense subgraphs, denoted by C = {c _{1}, c _{2}, ..., c_{ t } }(Figure 3C).

(B)
Construct and cluster motif similarity graph G _{3}. Let V _{1}  V _{2} = {v _{1}, v _{2}, ..., v_{ k } } be the nodes in graph G _{1} but not in G _{2}. We construct graph G _{3} = (V _{3}, E _{3}), where the node set V _{3} = C ∪(V _{1}  V _{2}) = {c _{1}, ..., c_{ t } , v _{1}, ..., v_{ k } }, and the edge set E _{3} is defined as follows. For each pair of nodes u, v ∈ V _{3} we calculate , where w_{ xy } is the weight of edge (x, y) ∈ E _{1}, and u and v the number of nodes of u and v, respectively. We connect u and v by an edge (u, v) ∈ E _{3} with a weight if and only if is greater than the cutoff α used to construct G _{1}. We then apply MCL to G _{3} to obtain a set of clusters (Figure 3C). The motifs in each of these resulting clusters induce a subgraph of G _{1}. Let these induced subgraphs be the set S= {s _{1},s _{2},...,s_{ n } } (Figure 3D).

(C)
Construct and cluster the quasicliquebased motif similarity graph G _{4}. For each subgraph s_{ i } in S, we apply the method described in our previous work [19] to find a clique associated with each node in s_{ i } , and then merge all cliques into a socalled quasiclique if any two cliques share the majority of their nodes (Figure 3D). For each quasiclique, we pool all the corresponding putative TFBSs, and merge any two sequences if they overlap more than 8 bases to form a quasicliquespecific sequence set. We then identify the best motif for each of the quasicliquespecific sequence sets using the motiffinding tool BioProspector [23]. We call this motif a quasicliquespecific motif. We then calculate the similarity score between each pair of quasicliquespecific motifs, and construct a new motif similarity graph G _{4} using the quasicliquespecific motifs as the nodes, and connecting any two nodes if the similarity between the two motifs is larger than a preset cutoff γ (= β, normally). We cluster graph G _{4} into a set of dense subgraphs using MCL.

(D)
Construct and cluster the extended sequencebased motif similarity graph G _{5}. For each cluster from G _{4}, we pool all the corresponding putative binding site sequences, and merge any two sequences from the same intergenic sequence if they overlap more than 8 bases to form a new nonoverlapping sequence set. To fix the possible problem of only covering a part of a binding site by motiffinding tools because of the use of a fixed motif length for motif finding so far, we extend each sequence on both ends by padding a fixed length (10 bases) of its flanking genome sequences. For each of these extended sequence sets, we identify the best 22base long motif called an extended motif. Similar to the construction of graph G _{1}, we construct a new motif similarity graph G _{5} using the same motif similarity score cutoff γ as used to construct G _{4}. We then cluster G _{5} into a set of subgraphs using MCL.

(E)
Construct and cluster genomespecific motif similarity graphs. Note that each motif in the clusters from G _{5} contains predicted TFBSs from different genomes in the groups of target genomes. We pull the putative binding sites in each cluster from G _{5} into a group if they are from the same genome to form a genomespecific sequence set. So sequences in each cluster are partitioned into multiple genomespecific sequence sets, and there are multiple genomespecific sequence sets from the same genome. For each genome, after merging the overlapping sequences in each genomespecific sequence set, we find the best motif in each of its genomespecific sequence sets using the motiffinding tool MEME [22] with the length being automatically detected in the region of 822 bases. We then calculate motifmotif similarity between each pair of the resulting motifs for the genome, and construct a genomespecific motif similarity graph using the similarity score cutoff γ used to construct G _{4}. The graph is clustered into subgraphs using MCL.
(F). Refine and rank the clusters of each genome. The clusters obtained from the genomespecific motif similarity graph contain sequences with different lengths, we identify up to top 15 motifs of different lengths (= 22  8 + 1) from the sequences of each cluster using MEME with the motif length being automatically detected in the region of 8~22 bases. Since the resulting motifs may cover different parts of the same motif, and contain the same TFBS because of the high similarity of sequences in each cluster, we merge the overlapping and redundant sequences covered by different motifs to form a set of unique sequences, which is the final predicted motif in each cluster. We use the following scoring function defined previously [19] to evaluate the likelihood that a cluster contains a true motif,
where n is the number of sequences in the best motif of length L found by MEME, p(b, i) and P(b, i) are the probability and profile, respectively, of base b appearing at position i of the motif as defined in formula (1), and N is the number of sequences in the cluster. We rank all the clusters in descending order according to their ClusterScore s.
Availability
The eGLECLUBS algorithm is currently implemented in Perl, consisting of a set of utility programs for generating input files and a suite of programs for TFBS prediction in genomes. The source codes and detailed instructions for running the programs are available at http://gleclubs.uncc.edu/pbs.
Abbreviations
 GLECLUBS:

Global Ensemble CLUsters of Binding Sites
 eGLECLUBS:

extended GLECLUBS
 ORF:

open reading frame
 TFBS:

transcription factor binding site
 COOR:

Clusters of Operons with Orthologous Relationships
 SCOOR:

semiCOOR
 TF:

transcription factor.
References
 1.
Sundquist A, Ronaghi M, Tang H, Pevzner P, Batzoglou S: Wholegenome sequencing and assembly with highthroughput, shortread technologies. PLoS One 2007, 2(5):e484. 10.1371/journal.pone.0000484
 2.
Mardis ER: The impact of nextgeneration sequencing technology on genetics. Trends Genet 2008, 24(3):133–141.
 3.
McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE: Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 2001, 29(3):774–782. 10.1093/nar/29.3.774
 4.
Qin ZS, McCue LA, Thompson W, Mayerhofer L, Lawrence CE, Liu JS: Identification of coregulated genes through Bayesian clustering of predicted regulatory binding sites. Nat Biotechnol 2003, 21(4):435–439. 10.1038/nbt802
 5.
van Nimwegen E, Zavolan M, Rajewsky N, Siggia ED: Probabilistic clustering of sequences: inferring new bacterial regulons by comparative genomics. Proc Natl Acad Sci USA 2002, 99(11):7323–7328. 10.1073/pnas.112690399
 6.
Alkema WB, Lenhard B, Wasserman WW: Regulog analysis: detection of conserved regulatory networks across bacteria: application to Staphylococcus aureus. Genome Res 2004, 14(7):1362–1373. 10.1101/gr.2242604
 7.
Jensen ST, Shen L, Liu JS: Combining phylogenetic motif discovery and motif clustering to predict coregulated genes. Bioinformatics 2005, 21(20):3832–3839. 10.1093/bioinformatics/bti628
 8.
Liu J, Xu X, Stormo GD: The cisregulatory map of Shewanella genomes. Nucleic Acids Res 2008, 36(16):5376–5390. 10.1093/nar/gkn515
 9.
Wels M, Francke C, Kerkhoven R, Kleerebezem M, Siezen RJ: Predicting cisacting elements of Lactobacillus plantarum by comparative genomics with different taxonomic subgroups. Nucleic Acids Res 2006, 34(7):1947–1958. 10.1093/nar/gkl138
 10.
Wang T, Stormo GD: Combining phylogenetic data with coregulated genes to identify regulatory motifs. Bioinformatics 2003, 19(18):2369–2380. 10.1093/bioinformatics/btg329
 11.
Madan Babu M, Teichmann SA: Evolution of transcription factors and the gene regulatory network in Escherichia coli. Nucleic Acids Res 2003, 31(4):1234–1244. 10.1093/nar/gkg210
 12.
PerezRueda E, ColladoVides J: The repertoire of DNAbinding transcriptional regulators in Escherichia coli K12. Nucleic Acids Res 2000, 28(8):1838–1847. 10.1093/nar/28.8.1838
 13.
Salgado H, MorenoHagelsieb G, Smith TF, ColladoVides J: Operons in Escherichia coli: genomic analyses and predictions. Proc Natl Acad Sci USA 2000, 97(12):6652–6657. 10.1073/pnas.110147297
 14.
MorenoHagelsieb G, ColladoVides J: A powerful nonhomology method for the prediction of operons in prokaryotes. Bioinformatics 2002, 18(Suppl 1):S329–336.
 15.
Wang T, Stormo GD: Identifying the conserved network of cisregulatory sites of a eukaryotic genome. Proc Natl Acad Sci USA 2005, 102(48):17400–17405. 10.1073/pnas.0505147102
 16.
Hertz GZ, Stormo GD: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 1999, 15(78):563–577. 10.1093/bioinformatics/15.7.563
 17.
Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al.: Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 2005, 23(1):137–144. 10.1038/nbt1053
 18.
Hu J, Li B, Kihara D: Limitations and potentials of current motif discovery algorithms. Nucleic Acids Res 2005, 33(15):4899–4913. 10.1093/nar/gki791
 19.
Zhang S, Xu M, Li S, Su Z: Genomewide de novo prediction of cisregulatory binding sites in prokaryotes. Nucleic Acids Res 2009, 37(10):e72. 10.1093/nar/gkp248
 20.
GamaCastro S, JimenezJacinto V, PeraltaGil M, SantosZavaleta A, PenalozaSpinola MI, ContrerasMoreira B, SeguraSalazar J, MunizRascado L, MartinezFlores I, Salgado H, et al.: RegulonDB (version 6.0): gene regulation model of Escherichia coli K12 beyond transcription, active (experimental) annotated promoters and Textpresso navigation. Nucleic Acids Res 2008, (36 Database):D120–124.
 21.
Sierro N, Makita Y, de Hoon M, Nakai K: DBTBS: a database of transcriptional regulation in Bacillus subtilis containing upstream intergenic conservation information. Nucleic Acids Res 2008, (36 Database):D93–96.
 22.
Bailey TL, Elkan C: Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 1994, 2: 28–36.
 23.
Liu X, Brutlag DL, Liu JS: BioProspector: discovering conserved DNA motifs in upstream regulatory regions of coexpressed genes. Pac Symp Biocomput 2001, 127–138.
 24.
Thijs G, Lescot M, Marchal K, Rombauts S, De Moor B, Rouze P, Moreau Y: A higherorder background model improves the detection of promoter regulatory elements by Gibbs sampling. Bioinformatics 2001, 17(12):1113–1122. 10.1093/bioinformatics/17.12.1113
 25.
Olman V, Xu D, Xu Y: CUBIC: identification of regulatory binding sites through data clustering. J Bioinform Comput Biol 2003, 1(1):21–40. 10.1142/S0219720003000162
 26.
Liu XS, Brutlag DL, Liu JS: An algorithm for finding proteinDNA binding sites with applications to chromatinimmunoprecipitation microarray experiments. Nat Biotechnol 2002, 20(8):835–839.
 27.
Pavesi G, Mereghetti P, Zambelli F, Stefani M, Mauri G, Pesole G: MoD Tools: regulatory motif discovery in nucleotide sequences from coregulated or homologous genes. Nucleic Acids Res 2006, (34 Web Server):W566–570. 10.1093/nar/gkl285
 28.
Stormo GD, Hartzell GW: Identifying proteinbinding sites from unaligned DNA fragments. Proc Natl Acad Sci USA 1989, 86(4):1183–1187. 10.1073/pnas.86.4.1183
 29.
LozadaChavez I, Janga SC, ColladoVides J: Bacterial regulatory networks are extremely flexible in evolution. Nucleic Acids Res 2006, 34(12):3434–3445. 10.1093/nar/gkl423
 30.
McCue LA, Thompson W, Carmack CS, Lawrence CE: Factors influencing the identification of transcription factor binding sites by crossspecies comparison. Genome Res 2002, 12(10):1523–1532. 10.1101/gr.323602
 31.
van Dongen S: Graph clustering by flow simulation. PhD thesis. University of Utrecht; 2000.
 32.
Mao F, Dam P, Chou J, Olman V, Xu Y: DOOR: a database for prokaryotic operons. Nucleic Acids Res 2009, (37 Database):D459–463. 10.1093/nar/gkn757
 33.
Dam P, Olman V, Harris K, Su Z, Xu Y: Operon prediction using both genomespecific and general genomic information. Nucleic Acids Res 2007, 35(1):288–298. 10.1093/nar/gkl1018
 34.
MorenoCampuzano S, Janga SC, PerezRueda E: Identification and analysis of DNAbinding transcription factors in Bacillus subtilis and other Firmicutesa genomic approach. BMC Genomics 2006, 7: 147. 10.1186/147121647147
 35.
Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA: DBDtaxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res 2008, (36 Database):D88–92.
 36.
Kummerfeld SK, Teichmann SA: DBD: a transcription factor prediction database. Nucleic Acids Res 2006, (34 Database):D74–81. 10.1093/nar/gkj131
 37.
Suzuki S, Ferjani A, Suzuki I, Murata N: The SphSSphR two component system is the exclusive sensor for the induction of gene expression in response to phosphate limitation in synechocystis. J Biol Chem 2004, 279: 13234–13240. 10.1074/jbc.M313358200
 38.
Su Z, Olman V, Xu Y: Computational prediction of Pho regulons in cyanobacteria. BMC Genomics 2007, 8: 156. 10.1186/147121648156
 39.
Su Z, Olman V, Mao F, Xu Y: Comparative genomics analysis of ntcA regulons in cyanobacteria: regulation of nitrogen assimilation and its coupling to photosynthesis. Nucleic Acid Res 2005, 33(16):5156–5171. 10.1093/nar/gki817
 40.
Herrero A, MuroPastor AM, Valladares A, Flores E: Cellular differentiation and the NtcA transcription factor in filamentous cyanobacteria. FEMS Microbiol Rev 2004, 28(4):469–487. 10.1016/j.femsre.2004.04.003
 41.
Xu M, Su Z: Computational prediction of cAMP receptor protein (CRP) binding sites in cyanobacterial genomes. BMC Genomics 2009, 10: 23. 10.1186/147121641023
 42.
Bollet C, Grimont P, Gainnier M, Geissler A, Sainty JM, De Micco P: Fatal pneumonia due to Serratia proteamaculans subsp. quinovora. J Clin Microbiol 1993, 31(2):444–445.
 43.
Brouwer RW, Kuipers OP, van Hijum SA: The relative value of operon predictions. Brief Bioinform 2008, 9(5):367–375. 10.1093/bib/bbn019
 44.
Mushegian AR, Koonin EV: A minimal gene set for cellular life derived by comparison of complete bacterial genomes. Proc Natl Acad Sci USA 1996, 93: 10268–10273.
Acknowledgements
This work was partially supported by a grant from the University of North Carolina at Charlotte, a grant from CMCUNCC Collaborative Research Fund and a grant (EF0849615) from NSF to ZS. We would like to thank Dr. Anthony Fodor for his critical reading of the manuscript.
Author information
Additional information
Authors' contributions
ZS conceived the project. SZ designed and conducted the experiment. SL and PTP helped conduct some analysis. ZS and SZ wrote the manuscript. All authors read and approved the final manuscript.
Electronic supplementary material
Additional file 1: Supplementary figures and tables. Additional file 1 consists of three supplementary figures and tables. Supplemental Figure S1: The phylogenetic tree of cisregulatory systems in sequenced γproteobacterial genomes for the selection of groups of target genomes containing E. coli K12. Supplemental Figure S2: The phylogenetic tree of cisregulatory systems in sequenced firmicutes for the selection of a group of target genomes containing B. subtilis. Supplemental Figure S3: The phylogenetic tree of cisregulatory systems in sequenced cyanobacteria for the selection of a group of target genomes containing Synechocystis sp. PCC 6803. Supplemental Table S1: The top 20 motifs/clusters predicted in E. coli K12. Supplemental Table S2: The top 20 motifs/clusters predicted in B. Subtilis. Table S3: The top 20 motifs/clusters predicted in Synechocystis sp. PCC 6803. (PDF 1 MB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
About this article
Cite this article
Zhang, S., Li, S., Pham, P.T. et al. Simultaneous prediction of transcription factor binding sites in a group of prokaryotic genomes. BMC Bioinformatics 11, 397 (2010). https://doi.org/10.1186/1471210511397
Received:
Accepted:
Published:
Keywords
 Target Genome
 Prokaryotic Genome
 Phylogenetic Footprinting
 Dense Subgraph
 Cyanobacterial Genome