Bayesian hierarchical model for transcriptional module discovery by jointly modeling gene expression and ChIP-chip data

Background Transcriptional modules (TM) consist of groups of co-regulated genes and transcription factors (TF) regulating their expression. Two high-throughput (HT) experimental technologies, gene expression microarrays and Chromatin Immuno-Precipitation on Chip (ChIP-chip), are capable of producing data informative about expression regulatory mechanism on a genome scale. The optimal approach to joint modeling of data generated by these two complementary biological assays, with the goal of identifying and characterizing TMs, is an important open problem in computational biomedicine. Results We developed and validated a novel probabilistic model and related computational procedure for identifying TMs by jointly modeling gene expression and ChIP-chip binding data. We demonstrate an improved functional coherence of the TMs produced by the new method when compared to either analyzing expression or ChIP-chip data separately or to alternative approaches for joint analysis. We also demonstrate the ability of the new algorithm to identify novel regulatory relationships not revealed by ChIP-chip data alone. The new computational procedure can be used in more or less the same way as one would use simple hierarchical clustering without performing any special transformation of data prior to the analysis. The R and C-source code for implementing our algorithm is incorporated within the R package gimmR which is freely available at http://eh3.uc.edu/gimm. Conclusion Our results indicate that, whenever available, ChIP-chip and expression data should be analyzed within the unified probabilistic modeling framework, which will likely result in improved clusters of co-regulated genes and improved ability to detect meaningful regulatory relationships. Given the good statistical properties and the ease of use, the new computational procedure offers a worthy new tool for reconstructing transcriptional regulatory networks.


Background
Transcriptional regula tion is one of the crucial mechanisms used by living systems to maintain homeostasis. Disregulation of gene expression underlies toxic effects of many chemicals [1], and gene expression changes are often reliable markers of a disease [2]. The specificity of transcriptional initiation of a eukaryotic gene is maintained through a complex interaction of one or more sequence-specific transcription factors, regulatory DNA regions harboring corresponding DNA regulatory motifs, chromatin-remodeling proteins and the basal transcriptional machinery [3]. While not all modes of expression regulatory controls are known, it has been shown that in many important biological processes the initiation of transcription requires binding of one or more transcriptional factors to their cognate regulatory motifs within regulatory DNA regions. Two key high-throughput (HT) experimental technologies are capable of producing data offering insights into the expression regulatory mechanism on a genome scale. The first technology are expression microarrays facilitating simultaneous monitoring expression of virtually all genes in a genome [3][4][5]. The second technology is the Chromatin Immuno-Precipitation on Chip (ChIP-chip) technology facilitating assessment of transcription factor binding events on a genomic scale [6,7]. Optimal joint modeling of data generated by these two complementary biological assays, with the goal of identifying and characterizing TMs, is an important open problem in computational biomedicine.
Earliest applications of microarray technology included attempts at discovering shared regulatory motifs and corresponding transcription factors within groups of coexpressed genes identified by cluster analysis [8]. Groups of co-expressed genes were first identified by clustering gene expression profiles. Putative regulatory motifs inducing the co-expression were then identified de-novo using the MEME algorithm [9]. The inefficiency of procedures in which different data-types (e.g. expression data and promoter sequences) are analyzed separately is due to the inability of patterns in different data-types to re-enforce each other. For example, due to the noise in microarray data, the correlation between expression levels of two co-regulated genes could be too weak to be detected by clustering expression data alone. However, if evidence exists that promoters of these two genes are bound by the same TF, this information could enforce the weak signal in the expression data and allow us to identify these two genes as being parts of the same TM. In the traditional two-step approach such co-regulation will be lost since the second step regulatory motif analysis is conditional on co-expression of the two genes.
Several heuristic algorithms have been developed for constructing TMs by integrated analysis of gene expression and binding (ChIP-chip) data. Genetic Regulatory Modules (GRAM) algorithm [10] uses binding data to identify a gene set bound to common TFs (p-value < 0.001). It then searches for other genes at a lower level of significance (p-value < 0.01) that are bound by those TFs and have similar expression levels to the initial gene set core (d < d 0 ). ReMoDiscovery [11] follows similar stringent and relaxed two step procedures and infers TMs from Chipchip, motif and expression data. Module Finding Algorithm (MOFA) also uses two level p-values, but additional criteria for selecting genes regulated by a specific TF is the correlation between expression levels of such genes and expression level of the TF [12]. Statistical-Algorithmic Method for Bicluster Analysis (SAMBA) algorithm [13] transforms expression and binding data items to properties of genes/genes or genes/proteins, then generates a genes-properties bipartite graph. The algorithm aims at discovering sets of genes with statistically significant common properties. SAMBA requires discretization of inherently continuous gene expression and binding data based on more or less ad-hoc cut-offs which will almost certainly reduce the information content of the data.
In a model-based approach to find TMs based on gene expression and TF binding data, one postulates the probabilistic model of all data and then estimates parameters of the model which define TM membership. Three such models based on Bayesian networks have been proposed. In the first approach [14] both gene expression and ChIPchip data are directly modeled within the same Bayesian hierarchical model. In the other model, ChIP-chip data is used to calculate prior probabilities of TM memberships [15] based on an extension of the Bayesian module networks model [16]. In both of these models, the number of the modules has to be first be estimated from the data (or guessed) and all inference is valid conditional on the number of modules being correct. Since both of these models can also be thought of as extensions of the basic finite-mixture model, it is very likely that they will share inherent instability with respect to misspecification of the "correct" number of modules [17,18]. Earlier, a Bayesian casual network inferred from discretized expression data was used to describe the gene regulatory network with the binding data used to establish the constraints for the network structure [19]. The number of genes participating in the network construction is limited because of the complexity of model search. COGRIM [20] algorithm uses a Bayesian hierarchical framework to fit a gene-by-gene linear regression model of a gene's expression levels as function of is a quadratic function of all TFs' expression levels and their pair-wise interactions. The ChIP-chip binding data and the TF binding motif scores based on predefined Position Weight Matrices (PWM) are integrated as the prior information in the model. Genes are grouped into same TMs if they are regulated by the same set of TFs.
We developed a novel Expression-ChIP Infinite Mixture (ECIM) model for identifying TMs by jointly modeling gene expression and TF binding data. The model is constructed by extending the context-specific infinite mixture model (CSIMM) [21] in such a way that expression and binding data are represented by two separate contexts with different probabilistic models. We also constructed a novel probabilistic representation for the ChIP-chip data that seems to capture all relevant information from this data and use it within the binding-context of the model. The overall approach makes use of the Bayesian infinite mixture framework [17,18] to circumvent the issue of identifying the 'correct' number of global and local patterns in the data. Context-specificity not only allows the use of different probabilistic models to represent expression and binding data, but it also allows for discordances between patterns of co-expression and co-regulation. Posterior distribution of model parameters is estimated using Gibbs sampling [22]. TMs are formed based on Posterior Pairwise Probabilities (PPPs) of co-membership and Posterior Binding Probabilities (PBPs). It has been previously shown that PPPs can be directly interpreted as measures of statistical significance of co-membership [18,21].
The new computational procedure can be used in more or less the same way as one would use simple hierarchical clustering without need to perform any special transformation of data prior to the analysis. In the results section we show that PBPs are able to identify binding relationships not revealed by CHIP-chip binding data alone. We demonstrate the ability of this procedure to integrate information from gene expression and TF binding data by assessing the functional coherence of TMs constructed from real-world datasets.

Data preparation
We constructed four expression-binding datasets to examine the performance of ECIM and alternative methods. For each dataset, binding data consisted of ChIP-Chip data assessing binding affinities for 106 TFs to promoters of 6270 genes [6]. Expression datasets we used were the sporulation data set consisting of gene expression measurements throughout the sporulation process for the yeast strain SK1 [8]; the sporulation data set consisting of gene expression measurements during sporulation for the yeast strains SK1 and W303y [23]; the cell cycle data set consisting of gene expression measurements spanning two complete yeast cell cycles [24]; and the combined sporulationcell cycle dataset which we previously used to validate the CSIMM model [21]. Dual channel data [8] was processed by: (i) adjusting for background signal intensities; (ii) calculating log-intensity ratios of intensities in two channels; (iii) adjusting log-ratios using local regression of logratios on average log-intensities in two channels; and (iv) centering each gene's log-ratios by subtracting the genespecific average log-ratio. Affymetrix data [23,24] was processed by: (i) setting any measurement below one to one; (ii) log-transforming measurements; and (iii) centering each gene's log-measurements by subtracting the gene-specific average log-measurement. Genes with the maximum signal strength of less than 100 were excluded from the analysis. To make results comparable across different datasets, we used only data for genes represented on all microarray platforms (4980 genes).

Sensitivity and specificity of co-memembership in TMs
Using the Gibbs sampler, we generated a sequence of TMs approximating the marginal posterior distribution of TMs given data. This distribution was summarized by calculating PPPs of two genes belonging to the same TM, and PBP of a specific TF binding to the promoter of a specific gene. For each dataset we constructed an Expression-ChIP Infinite Mixture (ECIM) based hierarchical clustering of genes using PPPs as the similarity measure with the averagelinkage principle. The precision of such analysis was compared to results obtained by using alternative analytical approaches and by using the equivalent models with only expression or only binding data. Following are descriptions of all methods compared:

ECIM Expression and Binding
Hierarchical clustering based on PPPs derived from ECIM analysis of both expression and binding data.

ECIM Expression
Hierarchical clustering based on PPPs derived from ECIM analysis of expression data.

ECIM Binding
Hierarchical clustering based on PPPs derived from ECIM analysis of binding data.

Binding P-Values
TM's formed based on p-values of binding calculated in the original publication [6].

Binding PBP
TMs formed based on PBPs from ECIM analysis of expression and binding data.

Euclidian Distance
Hierarchical clustering based on Euclidian distances of expression data.

GRAM
TMs formed using the GRAM algorithm with default parameters, expression and binding data.

SAMBA
TMs formed using the SAMBA algorithm with default parameters and expression data only. ROC curves were constructed by correlating results for the 949 KEGG-associated genes where "functional clusters" are based on the co-membership of these 949 genes within any KEGG [25] pathway. It is obvious that this is not the perfect "gold standard" as some co-regulated genes will not be categorized to belong to a common pathway and vice versa. However, the assumption behind using membership in specific pathways as a gold standard, which is that co-regulated genes are more likely to participate in the same pathway than randomly grouped genes, is reasonable. Other well-known annotation databases, such as GO [26] or MIPS [27], are more complicated to use since they are hierarchically structured and results would depend on the level of specificity used to construct functional grouping.

ROC for hierarchical TMs based on hierarchical clustering using PPPs and Euclidian distance
The tree was cut at different depths to create clustering with every possible number of clusters. For a fixed number of clusters a pair of genes (from the 949 genes assigned to at least one pathway) belonging to the same cluster was assumed to be a "true positive" if the two genes both belonged to at least one specific KEGG pathway, and it was considered to be a "false positive" if they did not share a single KEGG pathway. True and false positive rates were then obtained by dividing the number of true/false positives with the total number of gene pairs sharing a common KEGG pathway and total number of gene pairs not sharing a KEGG pathway respectively. When the number of clusters is equal to the number of genes and all genes are placed in their own individual clusters, both true and false positive rates are equal to zero. A ROC curve is defined when we reduce the number of clusters and both true and false positive rates increase. At the extreme when all genes are placed in the same cluster, both true and false positive rates are equal to one.

ROC curves for ECIMs based on binding p-values and PBP
The significance cut-off was varied between 0 and 1. For each cut-off level, two genes were considered to be co-regulated if they were bound by at least one common TF at this significance level. True and false positive rates were established in the same way as for the clusters formed by hierarchical clustering with KEGG "gold standard".

GRAM and SAMBA
True and false positive rates for TMs produced by the two algorithms with default parameters were calculated in the same way as for the previous two situations. There was no recommended way to vary specificity and sensitivity of these two algorithms so we report only a single true and false positive rate for each algorithm.
Since just 5% of gene-pairs annotated in KEGG shared the same pathway, only when the True Positive Rate (TPR) is at least 20 times higher than the False Positive Rate (FPR) do true positive pairs outnumber the false positives. Therefore we only show ROC curves for each dataset/ method combination for statistically relevant false-positive rates (less than 0.05). The FPRs achieved by GRAM and SAMBA are around or less than 0.001, thus we plotted left most part of ROC curves (less than 0.0025) to make a clear comparison (Figure 1). ROC curves on the expended rage of FPRs (less than 0.05) are shown Figure 2.
ECIM-derived TMs based on the expression and binding data clearly outperformed all other approaches. In all three datasets, ECIM framework was able to successfully integrate information from both data types and significantly improve precision of analysis over individually analyzing any one of two data types. When using only binding data, it made no difference whether we simply use p-values to construct modules or apply ECIM procedure using only the binding data context, which was expected since the binding data was the only information source even we use different processing methods. On the other hand, TMs constructed by either hierarchically clustering genes using PPPs or using PBPs derived from the same analysis, were equally precise. This suggests that either PPP or PBP summarizes the posterior distribution of TMs generated by the ECIM analysis of two data types and carries all the meaningful information about the underlying TM structure.
To demonstrate the seamless integration of ECIM framework with more sophisticated expression data models we re-analyzed the combined sporulation-cellcycle data set we previously described [21] using CSIMM model for multiple expression data contexts ( Figure 2 and Figure  3D). As expected, the ECIM with CSIMM expression data contexts outperformed all other approaches, indicating the ability of the CSIMM model to effectively integrate information from different expression data sets and the ability of the ECIM model to integrate further such complex expression data with ChIP-Chip binding data.
The performance of two previously described computational procedures for constructing TMs based on joint analysis of expression and binding data was relatively poor. Points defined by single pairs of true/false positive rates for both methods with default parameters fall below all ROC curves including the one that uses only binding p-values. For the combined sporulation-cellcycle dataset we manipulated the parameters for the two algorithm with the goal of obtaining ROC points for a range of false positive rates. Detailed tables of parameters used and resulting FPRs and TPRs are shown in Supplemental Tables 1 and 2, (see Additional files 2 and 3 ). ROC points obtained by these two algorithms with non-default parameters are depicted by smaller dots in Figures 2 and 3D. While we managed to expand the range of FPRs, the overall conclusions did not change.
In the case of SAMBA we used only expression data because we could not establish with certainty the appropriate transformation for the binding data used in the original study [13]. This is appropriate because the statistical procedure implemented in SAMBA is same for both the gene expression and appropriately transformed ChIP-chip data. Furthermore, SAMBA has been originally described in the context of clustering gene expression data alone and the web page manual describes only this kind of use. However, it is important to emphasize that SAMBA's performance should be compared to results of other procedures that use only gene expression data (Euclidian Distance and ECIM Expression). Given the poor precision of TMs generated by SAMBA when compared to ECIM using only expression data, we conjecture that adding binding data is unlikely to improve SAMBA's results to the point of performing better than ECIM using both data types. For the sporulation data in Figure 1A both SAMBA and GRAM failed to identify any TMs. Same was the case for GRAM with cell-cycle data in Figure 1C.
ROC curves for 8 different algorithms using three different yeast gene expression datasets Figure 1 ROC curves for 8 different algorithms using three different yeast gene expression datasets. A) Chu,1998, sporulation; B) Primig,2000, sporulation;C) Cho,1998, cellCycle and the ChIP-chip data of Lee, 2002. KEGG pathways were used as the gold standard. ECIM utilizing both expression and binding data dominated all other algorithms. ROC "spots" for GRAM and SAMBA algorithms were obtained by applying the algorithms using the default parameters. In the original publications, both SAMBA and GRAM were used to analyze larger expression datasets than we used here. To assess the scalability of results presented here we also analyzed a significantly larger dataset with 165 microarray experiments assessing yeast transcriptional responses to various environmental perturbations [28]. The functional coherence of produced TMs was also compared to the functional coherence of TMs previously constructed using a large scale gene expression data analysis [29] for 23 different cut levels provided by authors, and two latest algorithms (ReMoDiscovery and COGRIM) [11,20] for constructing TMs from jointly analyzing gene expression data, ChIP-chip data and DNA motif scores obtained by scanning gene promoters using predefined PWM. The comparisons to ReMoDiscovery and COGRIM were based on results published in original publications describing these two algorithms. These results were based on analyzing the gene expression datasets that contained the Gasch dataset [28], and on the same TF binding dataset we used in our analyses (Lee's ChIP-chip data [6]). We downloaded module definitions from the respective support web sites and constructed ROC points using again KEGG pathways as the gold standard. For ReMoDiscovery we used two modules definitions discussed in the paper (seed module and extended module). For COGRIM we used two modules defined by authors (B+C+ corresponding to modules defined by COGRIM and supported by binding data alone and B-C+ corresponding to modules defined by COGRIM but not supported by binding data alone) and the combined module corresponding to all modules constructed by COGRIM. Unfortunately, after multiple attempts we were not able to construct TMs using SAMBA on this dataset. This could be a consequence of the large number of missing values in this dataset or our inability to correctly format missing values. We again manipulated GRAM parameters (details in Supplemental Table 1, (see Additional file 2) to expand the range of false positive rates.
Basic conclusion still held and all algorithms we tested produced improved ROC results when compared to the smaller expression datasets ( Figure 4). However, although ECIM performed as well or better than any other algorithm, significant improvements in precision from adding ChIP-chip data were visible only when PBP's are used to construct the modules. This could be the consequence of the additional noise in the algorithm for constructing hierarchical clustering from PPPs. ECIM also outperformed TMs constructed by the large gene expression datasets alone [26] as well as two algorithms that use expression, binding and DNA sequence motif information to infer TMs [11,20] despite the dramatically smaller number of data points used in the analysis. COGRIM outperformed GRAM and matched the functional coherence of modules that were based on a much larger gene expression dataset alone. This could be due to the additional regulation information used in the analysis or simply due to the more efficient use of the expression data alone.
Finally, we performed additional comparisons between TMs produced by GRAM and ECIM using Gene Ontologies as the gold standard [26]. In this comparison, we constructed TMs by cutting the hierarchical tree constructed by the ECIM algorithm so that the total number of genes in resulting TMs was about the same as the number of genes implicated by GRAM (740 unique genes in 98 TMs). For each gene-pair we identify the most specific GO category to which both of them belong by defining the specificity as I =  Lee, 2002. KEGG pathways were used as the gold standard. ECIM utilizing both expression and binding data dominated all other algorithms. ROC "spots" for GRAM and SAMBA algorithms were obtained by applying the algorithms using the default parameters. Smaller ROC "spots" for SAMBA were obtained by systematically manipulating algorithm's parameters.

GRAM SAMBA
number of genes annotated in GO. It has been shown that such a measure of specificity is a good way to represent the level of information about functional relationship between genes based on GO groupings [30]. For a specific cut-off i, a pair of genes is True Positive if the corresponding I>i and are placed in at least one common TM. A pair of genes is False Positive if I>i, but the two genes do not share a commong TM. ROC curves in Figure 5 are constructed by systematically changing the threshold i and calculating corresponding true and false positive rates for TMs constructed by GRAM and those constructed by ECIM. Results of this analysis are concordant with results obtained by using KEGG pathways as the gold standard.
In addition to constructing ROC curves we examined the coherence of TMs identified in this analysis in terms of statistical significances of over-represented Gene Ontologies. For each TM, we identified the most over-represented Gene Ontology as measured by the p-value of the Fisher's exact test. The distribution of TM sizes and the statistical significances of most over-represented Gene Ontologies is depicted in Figure 6. Assuming that the false discovery rate of 0.05 to be statistically significant, the results of the analysis are summarized in Table 1 Lee, 2002. KEGG pathways were used as the gold standard. ECIM utilizing both expression and binding data dominated all other algorithms. Large ROC "spots" for GRAM and SAMBA algorithms were obtained by applying the algorithms using the default parameters. Smaller ROC "spots" for GRAM and SAMBA were obtained by systematically manipulating algorithm's parameters. The comparison of Gene Ontologies significantly associated with TMs constructed by ECIM and GRAM (Table 2) reveals that several key Gene Ontologies were implicated by both algorithms (protein biosynthesis, Sporulation, sulfur metabolism, mitosis and amino acid metabolism). On the other hand, 8 out of 15 ECIM modules and 5 out 15 GRAM modules were algorithm specific. All of these 13 algorithm specific categories could be linked in one way or another to the two basic process investigated by expression data (sporulation and cell cycle). Consequently, it seems that both algorithms are identifying relevant TMs, it is just that ECIM is assigning a greater number relevant genes to these TMs. The list of all TMs along with the associated Gene Ontologies is given in the Supplemental Table 3, (see Additional file 4).

Constructing TM's and identification of associated regulators
To demonstrate the simplicity of use and interpretation of ECIM results we constructed TMs based on results of the combined sporulation-cell cycle dataset. 294 genes were selected based on the fact that their average linkage distance based on ECIM-derived PPPs to at least one other gene or group of genes was below 0.1 and their cluster size is larger than 10. Previously we demonstrated that such average linkage distance cut-offs have direct interpretations in terms of statistical significance of implicated associations [21]. The heatmap in Figure 7 Table 4, (see Additional file 5). The biological meaning of identified TMs is discussed in the next section.
We also investigate the utility of PBPs in identifying novel regulatory relationships not implicated by ChIP-chip data

Description of transcriptional modules detected Sporulation
Clusters associated with the biological processes of synapsis/recombination and spore wall assembly were clearly discerned in the Primig sporulation dataset (Figure 7). Genes within each of the clusters for both yeast strains SK1 and W303 were all upregulated late in the sporulation process. Joint data clustering showed enrichment in the number of clusters associated with sporulation as well as the number of regulators identified (Figure 8). In addition to modules regulated by Sum1 and Pho4, the ECIM algorithm identified a third transcriptional module associated with synapsis/recombination (cluster 4) that consisted of three additional regulators; Gln3, Otu1 and Rcs1. Gln3 positively regulates genes that are subject to nitrogen catabolite repression (NCR) [35]; under conditions of nitrogen limitation, Gln3 localizes to the nucleus and activates NCR-sensitive genes. Gln3 was likely detected due to the use of nitrogen-deficient sporulation media. In addition to its role as a deubiquitylation enzyme, Otu1 has been suggested by database mining to affect PIS1 expression, which is required for the final step in phosphatidylinositol synthesis [36]. Previous work has demonstrated that S. cerevisiae inositol auxotrophic strains require inositol for the completion of sporulation [37]. Rcs1 is a transcription factor involved in iron utilization and homeostasis [38]. Previous studies have found that it is also involved in controlling cell size [39] as well as biotin uptake and biosynthesis, nitrogen assimilation and purine biosynthesis [40]. Using joint data clustering, two transcriptional modules separately detected Sum1. SUM1 is required for middle sporulation element-mediated repression during meiotic development in S. cerevisiae [32].

Amino acid metabolism
A single transcriptional module involved in the biological process of amino acid metabolism was detected using expression data exclusively. This ten gene Gcn4-regulated module could not be further specifically annotated. In contrast, joint data clustering identified a transcriptional module that was significantly associated with methionine biosynthesis (cluster 2 in Figure 8). Genes in this module were cell cycle regulated and had increased expression in the S/G2 transition (Figure 7). This "MET" cluster has similarly been observed using microarrays to study S. cerevisiae cell cycle-regulated genes [41]. In addition to Gcn4, the primary regulator of the transcriptional response to amino acid starvation, joint data clustering identified Met4, Met31, Cbf1 and Ino4 (Figure 8). Met4 is responsible for the regulation of the sulfur amino acid pathway and requires different combinations of auxiliary factors including Met31 and Cbf1. In the Primig sporulation dataset [23], genes in the cluster associated with methionine biosynthesis show a derepression early in the sporulation process prior to sporulation clusters associated with ROC curves comparing the functional coherence of TMs constructed by GRAM and ECIM using the combined sporu-lation and cell-cycle gene expression dataset and the ChIP-chip data of Lee, 2002 with Gene Ontologies as gold stand-ard Figure 5 ROC curves comparing the functional coherence of TMs constructed by GRAM and ECIM using the combined sporulation and cell-cycle gene expression dataset and the ChIPchip data of Lee, 2002 with Gene Ontologies as gold standard.
spore wall assembly. Ino4 is required for derepression of inositol-choline-regulated genes involved in phospholipid synthesis. Previous work has shown that the completion of sporulation requires inositol [37].

Protein biosynthesis
Three clusters associated with the biological processes of rRNA processing and metabolism, RNA processing and ribosomal gene expression, and mitochondrial ribosomal protein metabolism were detected using expression data exclusively. However, only one transcriptional module was identified, consisting of the regulators Fhl1, Rap1 and Yap5. This same cluster was identified using joint data clustering (cluster 10) and two additional regulators were identified; Met4 and Pdr1. Patterns identified in both sporulation and cell cycle datasets suggested that genes regulated by this module were upregulated in G1-and Sphases and/or early in SK1 sporulation. The forkhead-like transcription factor Fhl1 plays a key role in the control of rRNA processing [42]. Rap1, in its role as a positive regulator, activates a number of ribosomal proteins [43]. Yap5 is a bZIP protein, shown to be regulated at the G1/S transition [44]. Pdr1 is a master drug regulator involved in the recruitment of other zinc cluster proteins to pleiotropic drug resistance elements to modulate the regulation of multidrug resistance genes [45]. Met4, also identified above in the amino acid metabolism transcriptional module category, is a transcription factor involved in the regulation of the sulfur amino acid pathway.
The second transcriptional module involved in the biological process of rRNA processing and metabolism (cluster 5) was identified using joint data clustering and consisted of three additional regulators; Arg80, Hap3 and Rcs1. Patterns identified in the Cho cell cycle dataset [24] suggested that genes regulated by this module were upregulated in S-and G2-phases. The ReMoDiscovery algorithm similarly identified Arg80 associated with ribosome biogenesis [11], a transcription factor involved in regulation of arginine-responsive genes. Likewise, the GRAM algorithm identified Rcs1 associated with protein synthesis [10]. Rcs1, also identified above in the sporulation transcriptional module category, is a transcription factor involved in a variety of different processes, including iron homeostasis, control of cell size, biotin biosynthesis, nitrogen assimilation and purine biosynthesis. Hap3 is a subunit of the CCAAT-binding factor (CBF), which activates genes required for respiratory metabolism; the Hap2 and Hap3 subunits of CBF are also required for optimal expression of ASN1, an asparagine synthase [46].

Cell cycle
Two transcriptional modules involved in the biological processes of chromatin cohesion and DNA repair and G2/ M cell cycle transition were detected using expression data exclusively. Joint data clustering also identified these two modules (clusters 3 and 8), but found several more regulators. In addition to Dot6, MATa1, Mbp1, Mcm1, Ndd1 and Swi6, the CSIMM algorithm identified Fkh2, Ino4 and Swi4. Further, two additional transcriptional modules associated with the biological processes of late-G1specific transcription (cluster 6) and cytokinesis (cluster 1) were detected ( Figure 8) and included the regulators Ace2, Ash1, Mbp1, Skn7, Stb1 and Swi4 as well as Fkh1, Ino4 and Mcm1.
In diploid cells, MATa1 has been shown to interact with another homeodomain protein, MATalpha2, and bind DNA as a heterodimer to repress transcription of haploidspecific genes [47]. Mbp1 is a DNA-binding protein that forms the MBF complex with Swi6; MBF is a sequencespecific transcription factor that regulates gene expression during the G1/S transition of the cell cycle [48]. In addition to Mbp1, Swi6 has been shown to form the SBF complex with the DNA-binding protein Swi4 to regulate transcription at the G1/S transition [49]. The MBF and SBF The distribution of TM sizes vs -log 10 of FDR-adjusted p-val-ues calculated by Fisher's test for association between the membership in a TM and the most significantly over-repre-sented Gene Ontology Figure 6 The distribution of TM sizes vs -log 10 of FDR-adjusted p-values calculated by Fisher's test for association between the membership in a TM and the most significantly over-represented Gene Ontology. The green line represents the statistically significant cut-off of FDR<0.05. All points above the line represent statistically significant associations.
complexes regulate late-G1-specific transcription. Although Skn7 is required for induction of heat-shock genes by oxidative stress [50], it has recently been shown to associate with Mbp1, forming a transcription factor independent of MBF that may be involved in the budemergence process [51]. Stb1 binds to Swi6 and has a role in the regulation of MBP-specific transcription [52].
Mcm1 has been shown to be required for the coordination of G2-specific transcription [53]. Ndd1 is essential for the expression of a set of late-S-phase-specific genes [54]. Fkh1 and Fkh2 are transcription factors of the forkhead family that regulate the cell cycle [55]. Ace2 has been shown to activate the expression of early-G1-specific genes [56]. Dot6 is a protein of unknown function involved in telomeric silencing [57] and filamentation [58]. Ino4, also identified above using joint expression and binding data clustering in the amino acid metabolism transcriptional module category, is a transcription factor that regulated genes involved in phospholipid synthesis.
In diploid cells deprived of nitrogen, Ash1 has been shown to be asymmetrically localized to the nuclei of daughter cells during pseudohyphal growth [59].
Finally, the transcriptional coherence of the genes in these TMs and associated regulators were assessed by calculating average correlations between expression levels of genes in a TM and the expression levels of associated TFs ( Table 3). The statistical significance of these average correlations (r) was assessed by calculating p-values based on resampling-based null-distribution of average correlations. Briefly, for each TM-TF pair a random set of genes of the same size as the original TM was selected from the list of all genes used in the analysis. The average correlation between the expression levels of the TF and all genes in such random set was calculated and compared to the actual average correlation for this TM-TF pair. This was repeated 2000 times. For r>0, one-sided p-value assessing the statistical significance of r was calculated as the proportion of times when r was larger than re-sampled average correlations. For r<0, one-sided p-value was calculated as the proportion of times when r was smaller than resampled average correlations. Two-sided p-values were obtained by doubling the one-sided p-values and are reported in Table 3. P-values that were equal to zero by this calculations were set to the smallest observable nonzero p-value (0.001). 23 out of 37 TM-TF pairs were significantly positively or negatively correlated (p-value< 0.05). Expected number of pairs with p-value<0.05 under the global null hypothesis that none of the TM-TF pairs were correlated is less than 2. 15 out of 23 TFs were positively correlated with respective TMs representing putative inducers. 8 TM-TF pairs were negatively correlated implicated potential repressors.
In this paper we utilized the ChIP-chip dataset of Lee et al [6] instead of the newer ChIP-chip dataset [60]. The reason for this was the "higher information density" in the Lee dataset which has about 4000 statistically significant binding events while the newer dataset has about 25% more binding events for twice as many transcription factors examined. However, we did perform similar analysis using the newer dataset for comparative reasons. ROC curves resulting from this analysis ( Figure S1 in the web supplement, (see Additional file 1)) and TMs (Supplementary Table 5, (see Additional file 6) were similar to the ones discussed here.

Conclusion
We presented a novel probabilistic model and related computational procedures for jointly modeling the gene expression and TF binding data within the context specific Bayesian infinite mixture framework. The algorithm identifies transcriptional modules consisting of groups of coregulated genes and TFs that regulate expression of genes within such groups. The method does not require prior knowledge of number of modules. We demonstrated the improved functional coherence of TMs by analyzing real world data. We also demonstrated that novel regulatory relationship can be identified which would not be implicated by either analyzing gene expression or binding data separately. The new method also produced more functionally coherent TMs than two alternative algorithms for joint analysis of gene expression and binding data. In the original publications, both of these algorithms were tested on much larger expression datasets than we used here. However, the functional coherence as measured by the sensitivity and specificity of predicting the co-member-ship in KEGG pathways remained significantly improved for the ECIM algorithm in analyzing an order of magnitude larger dataset [28]. Furthermore, most of the expression datasets examining a specific biological process are similar in size to datasets we used here and so the comparisons we made are very relevant.
Since there are no free parameters to adjust or tune during clustering phase, users only need to provide the data and the time consuming sampling process will go by itself, then user can select or change either stringent or relaxed criteria to search qualified gene group and corresponding TFs immediately. The output will show results of the analysis in a familiar form without the need to completely understand the mathematical/computational machinery used. We believe that this is an appealing characteristic of ECIM. The model presented here does not account for combinatorial interactions of different TFs in regulating expression. However, the modular nature of the model allows straightforward incorporation of more precise Heatmap of expression data and PBPs for highly specific TMs inferred by ECIM algorithm using the combined sporulation and cell-cycle gene expression dataset and Lee's ChIP-chip data Figure 7 Heatmap of expression data and PBPs for highly specific TMs inferred by ECIM algorithm using the combined sporulation and cell-cycle gene expression dataset and Lee's ChIP-chip data. Each line in the heatmap represents a gene. Red-green heatmap represents gene expression levels in the three different gene expression datasets that were combined together in this analysis and each column represents one microarray. The yellow-blue heatmap represents Posterior Binding Probabilities for 29 most significant TFs with each column in the heatmap representing a TF. Colour-bar on the right of the heatmap depicts groupings of co-regulated genes into TMs and is denoted with the significantly correlated functional category.    models for ChIP-chip data which will most likely further improve the performance of the method.

The probabilistic model and computational algorithm
Suppose that expression levels are measured for T genes across M experimental conditions. If x im is the expression level of gene i for experimental condition m, then x i = (x i1 , x i2 , ..., x iM ) denotes the complete expression profile for gene i. Suppose further the ChIP-Chip experiments measured binding affinity of N TFs to promoters of each of T genes. If p ij is the p-value for rejecting the null-hypothesis that TF j does not bind the promoter of gene i, we define the "binding intensity" of TF j to promoter of gene i as y ij = log(p ij )/log(p min ), where p min is the minimum of all pvalues. y i = (y i1 , y i2 , ..., y iN ) denotes the complete "binding profile" for gene i. x i and y i jointly represent the expression-binding (EB) profile for gene i.
Each gene's EB profile can be viewed as being generated by one out of Q different underlying EB patterns. Suppose that c i is the classification variable indicating the EB pattern that generates EB profile i. c i = q means that EB profile i was generated by pattern q. A clustering structure indicating putative TMs is defined by a set of classification variables for all EB profiles C = (c 1 , c 2 , ..., c T ). The expression part of pattern q that generates profile i is represented by the mean vector and the variance-covariance matrix of the M-dimensional Gaussian random variable (µ q , Σ q ). The binding part of pattern q is N-dimensional vector b q = (b q1 , ..., b qN ), where b qj ∈ {0,1} and , specifying the identity of the TF binding to promoters of genes in TM q (b qj = 1 implicates that TF j is associated with genes in TM q). The space of all possible associated TFs is augmented by a "baseline" TF having p-value of 0.5 for all genes. This allows certain expression patterns not to be associated with any real TF.
Observed expression profiles of genes from the same TM (i.e. generated by the same expression pattern) are assumed to be a random sample from the same multivariate Gaussian random variable (e.g. c i = q implies that x ĩ N M (µ q , Σ q )). The binding profiles of genes associated with TM q, {y i : c i = q}, are assumed to be observations from the random variable with probability density function defined as where p(y ij ) = 2(y ij ) if b qj = 1 and p(y ij ) = 2(1 -y ij ) if b qj = 0. (1) The local structure of the expression and binding patterns is specified by the Q × 2 matrix L(C) = (L 1 , ..., L Q ), where L q1 = k 1 if genes in TM q are placed in group k 1 within the expression context and L q2 = k 2 if genes in TM q are placed in group k 2 within the binding context.

Specification of the complete model
The probabilistic model describing the distribution of the data (i.e. observed EB profiles (x i , y i )) is given in the form of a Bayesian hierarchical model [61]. Dependencies between various model parameters and the data are defined by the Directed Acyclic Network [62] in Figure 9. Nodes in the network represent random variables and arcs define the independence structure of the joint probability distribution function. An arc drawn between a node and a dotted rectangle containing multiple nodes implies that it is the parent node for all nodes within the rectangle.
Assuming that the probability distribution of any node is independent of its non-descendants if values of the parent nodes are given (Directed Markov Assumption), the joint probability distribution of all parameters and data is given by the product of the local probability distributions of individual random variables given their parents. unique. That is, (µ q , Σ q ) = (µ q' , Σ q' ) whenever L q1 = L q'1 , and As specified above, Prior distributions for the local TM assignments C and context groupings L are defined following the infinite mixtures approach that avoids the specification of the "correct" number of groups of local clusters for each context [17,18,21]. The prior distribution for C is defined by specifying prior probabilities that a complete data vector will be either placed in an already existing TM q,  , c i-1 , c i+1 , ..., c T ), n -i,q is the number of profiles generated by EB pattern q without counting EB profile i, and α is the hyper-parameter. Similarly, local structure priors are specified by the probability that expression or binding profiles from TM q are further grouped together within the corresponding context. The probability of assigning TM q to an already existing group of TMs t within context f (f = 1 for the expression context and 2 for the binding context), is , where n -qft is the number of TMs currently placed in local grouping t within context f without counting TM q and a is the hyper-parameter. The probability of assigning TM q to a new local group is . Hyper-parameters a and α are further modeled and estimated from the data and don't have to be specified in the analysis [21,63]. Conditional distributions for all other parameters in the model given their parent nodes in the DAG are the same as previously described [17,18,21] and are given in the web supplement (see Addtional file 1).
The goal of the analysis is to estimate the posterior distribution of parameters in the model given data

Fitting the model
The joint posterior distribution of all parameters in the model given data is estimated using Gibbs sampler. Gibbs sampler [22] is a general procedure for sampling observations from a multivariate distribution. It proceeds by iteratively drawing observations from complete conditional distributions of all components given the current values of all other components. Under mild condition, the distribution of generated multivariate observations converges to the target multivariate distribution. The Gibbs sampler employed here is derived from previously described algorithms for fitting infinite mixture models.

Inferring transcription factors from PBP and binding pvalue
Once we select gene clusters based on average PPP distance and proper Gene Ontology annotations we can infer associated TFs by either PBP or binding p-values. The first method transformed binding p-value to a boolean value based on the p-value cut-off threshold (0.001). Each TF was then examined to determine if it was significantly bound to the promoters of the gene cluster using a Fisher exact test (p-value <= 0.005). The second method calculated the average PBP between gene clusters and each TF. Those TFs with PBP >= 0.1 were considered significant. The selection of thresholds for significance is established empirically to balance the sensitivity and specificity of candidate TFs. This is the same cut-off threshold as used in the original publication [6]. The PBP threshold was chosen by examining the distribution of all PBPs to select the cut-off with pretty much the same level of specificity that was achieved by the p-value cut-off. Cluster size of 10 was somewhat ad-hoc cut-off aimed at getting reasonable level of statistical power to detect significant Gene Ontologies correlating with TMs.
It is important to emphasize that ROC curves presented before are completely independent of these threshold selections. These thresholds are only used when finally constructing TM's based on the posterior distribution generated by the Gibbs sampler. ROC's are designed to systematically compare true and false positive results using all possible ways to automatically construct TM's from the Gibbs sampler output.

Availability and requirements
We have implemented ECIM algorithm within the R package gimmR which can be downloaded our website http:// eh3.uc.edu/gimm.