Identification of tissue-specific cis-regulatory modules based on interactions between transcription factors

Background Evolutionary conservation has been used successfully to help identify cis-acting DNA regions that are important in regulating tissue-specific gene expression. Motivated by increasing evidence that some DNA regulatory regions are not evolutionary conserved, we have developed an approach for cis-regulatory region identification that does not rely upon evolutionary sequence conservation. Results The conservation-independent approach is based on an empirical potential energy between interacting transcription factors (TFs). In this analysis, the potential energy is defined as a function of the number of TF interactions in a genomic region and the strength of the interactions. By identifying sets of interacting TFs, the analysis locates regions enriched with the binding sites of these interacting TFs. We applied this approach to 30 human tissues and identified 6232 putative cis-regulatory modules (CRMs) regulating 2130 tissue-specific genes. Interestingly, some genes appear to be regulated by different CRMs in different tissues. Known regulatory regions are highly enriched in our predicted CRMs. In addition, DNase I hypersensitive sites, which tend to be associated with active regulatory regions, significantly overlap with the predicted CRMs, but not with more conserved regions. We also find that conserved and non-conserved CRMs regulate distinct gene groups. Conserved CRMs control more essential genes and genes involved in fundamental cellular activities such as transcription. In contrast, non-conserved CRMs, in general, regulate more non-essential genes, such as genes related to neural activity. Conclusion These results demonstrate that identifying relevant sets of binding motifs can help in the mapping of DNA regulatory regions, and suggest that non-conserved CRMs play an important role in gene regulation.


Background
Transcriptional regulation is a key component of gene regulation, which itself plays a major role in all forms of cellular differentiation and function. To understand the mechanisms that regulate gene expression, it is important to identify and define the network of cis-acting DNA regulatory elements, which can be viewed as the regulatory code wired within the genome. The code itself is executed through transcription factors (TFs), which determine which set of genes will be expressed. Because the cis-regulatory elements are usually short and degenerate, distinguishing regions of regulatory importance from other non-coding regions is often a great challenge.
One important method for cis-regulatory element detection is based on the concept of the cis-regulatory module (CRM). The hypothesis behind this approach is that TFs co-operate as a functional complex in regulating gene expression. A corollary of this hypothesis is that a region with multiple putative TF binding sites (TFBSs) is more likely to be functional than a region with only a solitary site. These studies were carried out by counting TFBS hits in sliding windows along genomic sequence, and then predicting that the regions with the highest density of TFBSs represent CRMs. This CRM concept has been applied to several biological systems [1][2][3][4][5][6][7][8][9]. Methods based on Hidden Markov Models have been developed to use this CRM concept to improve motif identification [10,11]. A limit that has prevented this approach from being applied on a large scale is the a priori requirement of a set of biologically relevant TFs. Definition of the set of relevant TFs generally requires extensive experimental work, and for most cases such data is not available.
Evolutionary conservation (phylogenetic footprinting approach) can help to improve the cis-regulatory elements identification when used in combination with other methods. This approach is based on the hypothesis that evolutionarily conserved sequences within non-coding regions are the result of selective pressure, and are likely to be enriched for functional regulatory elements. This type of approach has been successfully applied to a variety of systems [12][13][14][15][16][17][18]. However, by the nature of the approach, it is not effective for the discovery of speciesspecific regulatory elements. This limitation is becoming increasingly important because a growing body of evidence suggests the biological importance of non-conserved sequences. Recent comparative genomics studies have identified important RET elements in human and zebrafish that are not conserved evolutionarily [19]. On a genome-wide scale, comparisons between Drosophila species show only slight difference in conservation between known regulatory regions and other non-coding regions, again suggesting the regulatory importance of non-conserved sequences [20]. A large scale analysis from the ENCODE project found that only 55% of regulatory factor binding sites overlap the high-confidence evolutionarily constrained sequences, suggesting the possibility of large number of neutral regulatory elements which are biologically functional but are not under selective pressure [21].
To study tissue-specific gene regulation, we have been working to develop approaches to identify and analyze regulatory regions that contribute to tissue specificity. We suggest that species differences in regulatory regions may be particularly important in specialized functions such as the determination of tissue specificity. Therefore, it is important to identify tissue-specific cis-regulatory elements in an unbiased way in terms of evolutionary conservation. In the work described here, the goals were to (1) develop a computational method for identification of cisregulatory elements that works with both conserved and non-conserved regions, (2) discover regions that are responsible for tissue specificity, and (3) functionally compare conserved and non-conserved regions with the hope of providing a glimpse into the evolution of gene regulation.
Our algorithm, called CRM-PI (cis-regulatory module identification based on protein interaction), searches for DNA regions with dense TFBSs whose binding TFs are known or predicted to be important to tissue specificity. The approach is similar to the above-described methods for mapping CRMs in genomes [1][2][3][4][5][6][7][8][9], but it has been adapted to deal with situations in which experimental data defining biologically relevant sets of interacting factors is lacking. Sets of putative interacting TFs are first computationally predicted based on the positional relationship and co-occurrences of their binding sites in the conserved regions of the promoters of tissue-specific genes [22]. Based upon these predictions on the trans-acting factors, efforts are made to construct cis-acting DNA regions (i.e. CRMs) in the promoter of each tissue-specific gene, including conserved and non-conserved regions. Since our predicted CRMs do not rely on evolutionary conservation, we are able to investigate the contribution of both conserved and non-conserved CRMs to tissue specific gene regulation, and to study the associated characteristics in the context of the evolution of gene regulation.

Detection of CRMs based on tissue-specific TF interactions
We utilized computationally predicted tissue-specific TF interactions to identify CRMs. In our previous work we identified 9060 putative tissue-specific TF interactions [22]. Two TFs were predicted as interacting if the relative positions and co-occurrence of their binding sites in promoters differed significantly from random expectation ( Figure 1; Additional files 1 and 2). Since identifying the CRMs harboring these interactions in each individual promoter is not trivial, we developed an algorithm, CRM-PI, to detect CRMs by calculating an empirical "potential energy" between interacting TFBSs along the genomic sequence. A promoter region containing many interacting TFBSs will have low "potential energy" (see Methods). CRM-PI obtains an energy landscape along the regulatory regions and searches for regions with low "potential energy". For those locations at which the energy is below a given threshold, the region around the minimum is defined as a CRM.
As one example of the use of CRM-PI, Figure 2A shows the promoter region around the gene Aldolase A (ALDOA, NM_000034), which is found to be preferentially expressed in the larynx. The energy landscape (Ecrm) based on larynx-specific TF interactions is shown in the bottom panel of the figure. An energy minimum around -200 bp is lower than the threshold (dashed line) and thus predicts the presence of a CRM (highlighted by vertical bar). Five known cis-regulatory elements are also located around this position (red dots). Interestingly, the region is not evolutionarily conserved (upper panel). Furthermore, if we simply count hits of all TFBSs instead of only the relevant (interacting) ones, the TFBS density at this region is just around the average value.
CRM-PI detected novel as well as known CRMs. For example, we identified three putative CRMs for cyclic nucleotide gated channel beta 3 (CNGB3, NM_019098), a gene that is preferentially expressed in the retina. Two CRMs are within upstream 400 bp, while the other one is located around upstream 3200 bp (vertical bars in Figure 2B). The CRM closest to the transcription start site (TSS, 0 on the xaxis) is conserved (average conservation score: 0.92). However, the TFBS density is not high around this region. For the CRM spanning from -200 bp to -340 bp, the region is also relatively conserved. The one at -3200 bp, however, is not conserved, and there is only a minimal increase in TFBS density in this region. These examples demonstrate that our approach can identify CRMs in both conserved and non-conserved regions.  Schematic of module detection method based on TF interactions. Based on gene expression profiles across different tissues, we identified groups of genes that are preferentially expressed in tissues (e.g. gene C and D in the schematic). For each group of genes, we searched the binding sites of known TFs in promoter regions and determined the TF pairs whose binding sites tend to co-occur in close proximity. A tissuespecific TF interaction network was obtained from the analysis. We then scanned the genomic regions and identified cisregulatory regions (CRMs). The CRMs are defined as regions enriched with TF interactions. Note the first steps were implemented in our previous work [22] while this paper focuses on the last step. We applied our method to 7261 genes that are preferentially expressed in various human tissues [22]. For each of the 7261 tissue-specific genes, we calculated the energy landscape based on the TF interactions specific to the respective tissues. The investigated regions included sequences upstream of the translation start sites, introns, and sequences downstream of the transcription end sites (see Methods). Among the tissue-specific genes, 2130 of them were found to contain a total of 6232 CRMs, with an average length of 90 bp. The summary statistics for each tissue is shown in Table 1. Detailed information of the predictions can be found in Additional file 3.

Evaluation using known regulatory elements and DNase I hypersensitive sites
In an effort to evaluate CRM-PI, we first explored its behavior with known cis-regulatory elements. We collected the elements from the TRANSFAC database [23], and chose those in the investigated regions of the tissuespecific genes as the positive controls. In total, we identified 548 regions as positive controls.
Sensitivity and specificity are widely used criteria in evaluation of a prediction. Here, because only a very small fraction of cis-regulatory elements are determined, we calculated enrichment instead of specificity. Sensitivity is defined as the ratio of the recovered positive control regions (in unit of nucleotides) to total length of the positive control regions. The enrichment is defined as the ratio of the probabilities of predicting a nucleotide site in positive control region as part of CRM versus a site in random sequence as CRM. If a method does not have any prediction power, the resulted prediction will have enrichment close to 1.
We obtained a series of predictions using different thresholds to assess the performance of our approach in terms of sensitivity and enrichment. As expected, if more stringent cutoffs are used, the number of known regulatory regions recovered decreases (Figure 3). At the same time, the enrichment of known regulatory regions in the prediction increases with more stringent cutoffs, indicating that the known regions are not randomly distributed in the predic- tion set and tend to have more significant "potential energy" from our prediction. This result justifies the usage of "potential energy" as a measure of the relative probability that a region is in a CRM. In our prediction, we chose a threshold of E crm = 1 (see Methods). At this threshold, we have sensitivity of 12% and enrichment of 10. In other words, these CRMs constitute only 1.2% of the regulatory sequences, while 12% of known regulatory regions are found in these CRMs.
To obtain a relative sense of the performance of our approach, we compared our results with an approach that predicts CRMs based purely on conservation. In this simplified comparison, we used conservation scores as the cutoff, and the non-coding regions with conservation scores higher than the cutoff were predicted as CRMs [24]. We found that the performance of our approach was sim-ilar to that of this conservation-based method ( Figure 3). However, comparison of the approaches revealed that there are significant areas of non-overlap between the sets of identified CRMs, suggesting that the different approaches can complement each other (see Discussion).
We also used DNase I hypersensitive sites (DHSs) as an indirect assessment of our predictions. DHSs represent chromatin regions that show increased sensitivity to digestion by the enzyme DNase I. The increased DNase I sensitivity is thought to reflect areas of DNA where there is increased accessibility to TFs and other DNA binding proteins. DHSs are generally enriched with regulatory elements. We analyzed the 84 DHSs that are associated with tissue-specific genes [25]. Figure 3B shows the sensitivityenrichment plots for our approach. Our approach can predict DHSs with a reasonable performance. However, an approach based solely on conservation does not effectively distinguish DHSs from the overall genomic sequence (enrichment of DHSs in the prediction is close to 1).

Some features of predicted CRMs for tissue-specific genes
The analysis revealed that CRMs are not homogenously distributed across regulatory regions. We calculated the probability for each position to be in a CRM (Figure 4). There is a significant peak showing high regulatory activity around transcription start sites (TSSs). The peak is located at ~ -60 bp upstream to the TSS. The regulatory activity decays rapidly with increasing distance from the TSS in both directions. The decreases are symmetric for upstream and downstream to TSSs, an observation consistent with the ENCODE analysis [21]. We also noticed that there is a moderate peak at the start position of the first intron. As a comparison, we calculated the same probability for random sequences as the background (pink line in left panel). The random sequences were generated according to the nucleic acid compositions and 1sr/2 nd /6th order transition probabilities of all promoter sequences in the human genome. The averaged probabilities for a base pair to be in CRM are 0.0005, 0.00056, and 0.0014 for random sequences of orders 1st, 2nd and 6th, respectively. For clarity, only the results of the 1 st -order random sequences are presented in Figure 4. All regions investigated have significantly higher regulatory activity (>0.01) than the random sequences. This finding is consistent with the results from several large scale ChIP-chip experiments showing that regulatory elements are not restricted to upstream sequences [21,[26][27][28].
We then examined the number of CRMs associated with each gene and found that the CRM numbers display a power-law-like distribution. In other words, most of the genes have only a few associated CRMs and a few genes have many CRMs. More than half of the genes (65%) are  For example, CD36 is preferentially expressed in blood, bone marrow, heart, and soft tissue. The numbers of CRMs specific to these tissues are 1, 4, 8, and 9, respectively.

Enrichment and sensitivity of predictions
Another example of this phenomenon is provided by PITX2, which encodes a paired-like homeodomain transcription factor. PITX2 is preferentially expressed in both placenta and eye based on our EST analysis. We calculated the potential energy landscapes using the interaction sets specific to placenta and eye, respectively. Figure 5 shows the energy landscape around the TSS of PITX2. The landscape in the upper panel was based on placenta-specific TF interactions. The placenta-specific interactions yield a CRM around 1800 bp in the first intron (the intron spans from 573 bp to 4343 bp). The most important TFs binding to the CRM are LHX3, CHX10 and ATBF1. Similarly, we calculated the energy landscape based on eye-specific interaction for the same gene (bottom panel in Figure 5). Two CRMs were found based on eye-specific TF interactions. One is located around -3600 bp upstream and the other around 1400 bp in the first intron. The TF binding to the CRM in the upstream region is FOXJ2. Three FOXJ2 binding sites occur in the region of the CRM, suggesting several homotypic interactions between FOXJ2. The TFs found in the CRM in intron include FOXJ2, CHX10, POU3F2 and CRX. This example demonstrates that different sets of TFs bind to their respective CRMs and can regulate the same gene in different tissues.

Conserved and non-conserved CRMs regulate distinct classes of genes
Since our approach to CRM prediction does not rely on evolutionary conservation, one would expect to obtain both conserved and non-conserved CRMs. As expected, the identified CRMs occurred in both conserved and nonconserved regions. Among all predicted CRMs, 55% of them have an average conservation score less than 0.2.
We explored whether there might be differences between the conserved and non-conserved classes of CRMs in terms of the activity and/or properties of their target genes. We first investigated the functional classes enriched for the respective gene groups regulated by conserved CRMs (cCRMs) and non-conserved CRMs (ncCRMs) (see Methods for definitions). We counted the numbers of genes in various functional classes based on gene ontology (GO) annotation [29] and calculated the P values for each class (see Methods). Table 2 lists the most significantly enriched functional classes in the gene groups targeted by cCRMs and ncCRMs, respectively. We can see that genes involved in gene expression and protein modification (e.g. transcription, protein amino acid dephosphorylation) tend to be targeted by cCRMs. In contrast, the genes related to nervous system function (e.g. synaptic transmission) tend to be regulated by ncCRMs. The observations on the different enriched functional classes for cCRMs and ncCRMs could be rationalized by the fact that transcription and translation activities are among the most fundamental processes in cells, while neural activity The energy landscapes for PITX2 Figure 5 The energy landscapes for PITX2. The landscape in the upper panel was calculated based on placenta-specific interactions between TFs. The one in bottom panel was based on eyespecific TF interactions. Dependence of regulatory activity on positions relative to gene structure Figure 4 Dependence of regulatory activity on positions relative to gene structure. We calculated the probability for each position containing a CRM. The reference positions (origins in the x-axis) are transcription start sites, the respective start sites of introns and transcription end sites in three regions, respectively. The pink curve in the left panel is from random sequences which were generated with the same nucleic acids compositions and 1 st order transition probabilities, respectively, as those of the all promoter sequences in the human genome. is more specific to species. To exclude possible effects brought about by the conservation of their target genes, we performed the same analysis on the conserved gene groups (of average conservation scores not less than 0.5).
One can see that most of the results are consistent with the above (Additional file 4).
We then linked CRM conservation with the essentiality of their target genes. Essential genes are defined as those that render the cell, or organism, non-viable if knocked out. The essential and viable genes were obtained through mouse knockout experiments [30], and the gene lists themselves were obtained from the Human Protein Reference Database [31]. From the database, we identified 153 viable and 83 essential genes that were tissue-specific and had at least one CRM in their regulatory regions. The fractions of essential genes in the gene group regulated by cCRMs and ncCRMs are 0.48 (45 out 93) and 0.27 (38 out of 143), respectively, suggesting that the classification cCRM/ncCRM is significantly correlated with the classification viable/essential genes (chi square test, p < 0.01). Our result demonstrates that the regulatory elements for essential genes tend to evolve slower than those regulating non-essential genes. Since it has been found that more dispensable genes tend to have a higher evolutionary rate [32], this result indicates a possible co-evolution between coding sequences and their regulatory elements.
A recent study on interspecies variation in gene expression revealed that genes containing a TATA box in their promoters tend to have increased gene expression variation across species, indicating that the TATA box could be a genetic signature for gene expression variation [33]. We examined the relationship between the conservation of the CRMs and the likelihood of containing a TATA-box in the promoters. Twelve percent (9 out of 74) of genes regulated by cCRMs contain a TATA-box. In contrast, 23% (39 out of 169) of genes regulated by ncCRMs contain a TATA-box (chi square test, p < 0.05). Hypogeometric testing also indicates that the cCRM group significantly overlaps with the non-TATA group (p ~ 0.003) and the ncCRM group with the TATA group (p ~ 0.03). This observation and the above results demonstrate that the genes controlled by cCRMs and ncCMRs are distinctive in terms of functional classification, gene essentiality and the likelihood of being associated with a TATA-box containing promoter.

Computational complexity and software availability
The computation of a gene's energy landscape takes less than 1 second for a typical 5-10 kbp promoter sequence, and ~42 minutes for ~6000 random sequences in a linux system operating on a DELL PRECISION 690. The program CRM-PI is available upon request.

Discussion
Gene regulation is not a static process. It is not only about "which TF regulates which genes", but also about "when, where and how the TF regulates the gene". In this context, CRMs are best thought of not just as static segments of DNA sequences located around genes, but rather as dynamic entities that are a function of both time and space. There are many attributes associated with the CRMs, and these attributes are also dynamic. In our analysis, we predicted tissue-specific CRMs, i.e. CRMs associated with gene regulation specific to individual cells or tissues. In different tissues, the CRMs regulating the same gene can be dramatically different (see example in Figure  5). In addition, distinct sets of regulatory elements may regulate the same gene at different developmental stages. As more information becomes available, it should be possible to more fully relate regulatory elements with temporal (e.g. development) and spatial (e.g. cell types) attributes. The additional temporal and spatial information will likely be helpful in describing regulatory elements in their larger biological context.

Sensitivity
Our approach achieved a reasonable sensitivity of 0.12 for identification of tissue-specific cis-regulatory regions. This performance is similar to other currently available tools. Gupta and Liu tested the CRM predictions of 7 available methods on conserved human-mouse alignment regions of 5 k upstream sequence, and their sensitivities ranged from 0.10 to 0.31 [10]. PreMod is a new promising approach of CRM detection based on conservation, and its sensitivity is 0.15~0.30 for TRANSFAC binding sites.
The modest sensitivity of CRM-PI can be attributed to many factors. First, our approach used 306 TFs with known binding sites, which is just a fraction of the total of 1500 human TFs [34]. As a result, many CRMs bound by the factors not included this study will be missing from our prediction. Second, our positive controls from TRANSFAC are not necessarily tissue-specific, while our predicted CRMs are only those that contribute to tissue specificity. Third, we expect that if we could include more information for TF interactions, such as binding site orientation and distances, our approach would be more sensitive to detect true CRMs.

Comparison with other CRM methods
One significant difference between our approach and other CRM methods is that we proposed that only relevant TFBSs contribute to a functional CRM. Therefore, CRM-PI counts the number of interactions instead of the total number of TFBSs in a sliding window for CRM detection. One interesting observation is that we obtained similar densities of TFBSs in promoter sequences and 6 th order Markov random sequences (0.152 vs 0.148 hit per bp). In other words, we would have similar numbers of CRMs for real promoters and random sequences if we simply counted the TFBSs. However, the numbers of CRMs (i.e regulatory activities) between the promoter and random sequences are significantly different (> 0.01 vs. 0.0014), suggesting the importance of including only relevant TFs for CRM detection.

Comparison with conserved-based approach
To enrich for true interactions between TFs, we utilized evolutionary conservation as a constraint [22]. After obtaining the TF interactions, we searched the cis-regulatory regions for each promoter regardless of the degree of evolutionary conservation. As a result, we have found a substantial fraction of CRMs in non-conserved regions, indicating that our approach and conservation-based methods detect different sets of the CRMs. For example, using TFBS regions of TRANSFAC database as positive control, we correctly predicted 66 TFBS regions. Among these 66 sites, only 19 were predicted by PReMod, a genome-wide conservation-based approach [12]. Therefore, our method can complement the widely used conservation-based methods.

Evolution of regulatory mechanisms
From the perspective of molecular evolution, identification of both conserved and non-conserved regulatory elements can provide insight into the evolution of gene regulation. It has been hypothesized that variation in gene regulation can be as important as coding region changes in contributing to the differences between species [35]. Supporting this viewpoint, human and chimpanzee share 99% sequence identity in coding sequences, yet they still demonstrate major difference in morphology and behavior [36]. Therefore, how genes are regulated might, in some ways, be as critical as evolutionary variation between genes themselves, suggesting that the evolution of regulatory elements might be as important as that of coding sequences in the explanation of divergence among species.
Despite this, there have been few comprehensive studies on the evolution of regulatory elements. In one study, Gasch et al analyzed the conservation and evolution of cis-regulatory systems in fungi [37]. Many cis-regulatory elements in S. cerevisiae were found to be conserved in other fungi, but they also found some novel cis-regulatory elements specific to individual species.
One problem limiting comprehensive study of regulatory variation is the lack of sufficient known cis-regulatory elements, especially in mammalian systems. Current studies of the evolution of gene regulation often compare gene expression across species instead of regulatory elements directly [38][39][40]. It is worth emphasizing that it is not identical to study evolution in terms of gene expression and cis-regulatory elements. For example, a recent study based upon comparison of the messenger RNA levels in liver tissues within and between human, chimpanzee, orangutan and rhesus macaque found that TFs tend to evolve more rapidly than other genes [40]. In contrast, our study on promoter sequences indicates that TFs tend to be regulated by conserved CRMs. Our findings may reflect the evolutionary discrepancy between the gene expression and regulatory elements. The study of regulatory elements therefore requires unbiased (conserved vs. non-conserved) identification of these elements. We suggest that our method will be helpful in the identification of both conserved and non-conserved regulatory elements, which in turn will provide insights into the evolution of regulatory mechanisms.

Limitations and future directions
In this study, we limited the promoter sequences to 5 kb upstream of the transcription start site (TSS), and thus excluded many enhancers and other regulatory elements that are far from the TSSs. However, extension of our analysis to the entire genomic would likely introduce significant noise, a reflection of the trade-off problem between sensitivity and specificity.
Another area of compromise involved whether or not to include evolutionary conservation. We decided to ignore use of the evolutionary conservation constraint so as to be able to detect both conserved and non-conserved CRMs. A limitation of excluding evolutionary conservation, however, is that we loose its power for helping to identify certain cis-regulatory regions. One possible compromise approach that could take advantage of both approaches would be to include only mammalian genomes to define conservation. In this way, we would have the power of the conservation constraint, and in the meantime, we can hopefully retain most of species-specific cis-regulatory elements.
Our approach to CRM identification can be extended to treat cell-type-or development-stage-specific gene groups. The real challenge of our strategy is how to find welldefined gene groups whose members share similar mechanisms of gene regulation. Secondly, choosing a more non-redundant and comprehensive motif set, such as a refined combination of Jaspar [41] and TRANSFAC, may improve the current performance of our approach. Other developments, such as improvements in motif finding algorithms, may further help improve the global performance of our approach.

Conclusion
We have presented a study of cis-regulatory systems that regulate tissue-specific gene expression. In contrast to popular phylogenetic footprinting approaches, our proposed method utilizes information based on TF interactions to identify cis-regulatory modules (CRMs), an approach that is unbiased in terms of evolutionary conservation. The results suggest that non-conserved CRMs contribute significantly to tissue-specific gene regulation. With more data available, it should be possible to put the CRMs in a fuller biological context and better understand the roles they play in cellular differentiation and function.

Sequences in the genome
We obtained sequences and gene structures based on Ref-Seq gene annotation (hg17). The regulatory regions we studied include (i) upstream 5 kb to translation start site (ii) introns (iii) 3'UTR (iv) downstream relative to transcription end site (TES) to its 3' adjacent gene. If the region is longer than 5 kb, we cut it to 5 kb from TES.

Position weight matrices and genome-wide search
Similar as previous work [22], we collected 306 human position weight matrices (PWMs) from TRANSFAC database and literatures. The match scores between a matrix and a sequence were calculated for all possible positions along the promoters defined in previous work [22]. The score threshold for the top 0.015% matches in all promoters was utilized as the cutoff. This cutoff is somewhat arbitrary and may cause some false positive hits. The study of CRMs based on TF interactions is expected to increase the specificity of these hits.

Identification of cis-regulatory modules (CRMs)
The basic hypothesis of our method is that the TF complex, rather than individual TFs, is the functional unit of gene regulation. Based on this hypothesis, we perform our analysis by searching for clusters of transcription factor binding sites (TFBSs), which are denoted as cis-regulatory modules (CRMs). These clusters are more likely to be functional than solitary TFBSs. A key step in our method is to identify sets of relevant TFs. We argue that only the clusters of relevant (interacting) TFs are biologically meaningful. Relevant TFs are defined here as the TFs that may interact to co-regulate tissue-specific genes. We identified cis-regulatory modules based on TF interactions. If a region has sufficient TFBSs whose binding TFs interact with each other, we predict the region as CRMs.
In our previous work, we predicted tissue-specific TF interactions [22]. Briefly, we first identified tissue-specific genes based on gene expression profile across various tissues. We then derived TF pairs that are likely to regulate the tissue-specific genes. We scanned known TFBSs (i.e. the binding matrix from TRANSFAC) in the promoters of the tissue-specific genes. If the co-occurrence of two TFBSs is over-represented and/or the distances between the two TFBSs in the promoters are significantly deviating from a random expectation, we predicted the two TFs interact with each other in the tissue (Figure 1). The interaction strength between two motifs is defined as S = -log(P) = -log(P occ P d ), where P occ describes whether the co-occurrence (g) of the TFBS pair is enriched in the tissue-specific promoters (n) compared to the total co-occurrence (G) in all N promoters.
The contribution from distances between two TFBSs, P d , is obtained from comparison of the observed TFBS-pair distance distribution with that of two random sites using a Kolmogorov-Smirnov (KS) test.
In this work, we then predicted cis-regulatory modules (CRMs) based on the obtained TF interactions. We developed an algorithm, CRM-PI, to calculate an empirical "potential energy" between interacting TFBSs along the genomic sequence. In this context, two TFBSs are considered as interacting if their respective TFs are interacting. Previous methods for identification of CRMs often counted the number of TFBSs in a sliding window along the genomic sequences. Our method is equivalent to count the number of interactions in the regions. Note that multiple TFBS matches for the same TF are considered separately. More precisely, we calculated an empirical "potential energy" for each TFBS site, Here S ij is the interaction strength between two TFBSs (see equation 1); d ij is the distance between sites i and j; D is the decaying constant, which is equivalent to the window size from other CRM methods. In our calculations of TF interactions, we found that the average genomic length between two interacting TFBSs is roughly 200 bp, which we choose as the value of D. Changing this value in a certain range does affect our results. If the two interacting sites are far away, their contribution to the energy would be small. For each TFBS, we summed up the "potential energy" due to the interactions with all other TFBSs.
In some cases, multiple TFBSs overlap at site i, and we choose the TFBS hit whose E i value is the lowest. This E i values still contain the contributions from other overlapped TFBSs. We utilized a simpler approach, compared to previous methods such as AHAB [8] and ClusterDraw [9]. For all sites in the studied sequence, we sort them according to their E i values from low to high, and delete the overlapped hits for each of the ordered sites sequentially. We then update the E i values for the non-overlapped hits. Although this greedy algorithm may not guarantee in obtaining the optimal solution, its efficiency in CPU time makes it a practical choice for a genome-wide calculation.
The E i values for the sites generally fluctuate greatly along the sequence. The lengths of CRMs based on the E i values are often around the length of a TF binding site. To smooth the energy landscape and define a continuous CRM, the contributions from the vicinity of the studied site with a window size W were summed up to be as the CRM energy E crm of this site. We chose a window size of W = 400 bp. Furthermore, to make the calculations in different tissues comparable, we performed a simulation on random sequences to determine the cutoff for E crm . The random sequences have the same transition probabilities as the sequences (-5000 bp to translation start site) of all RefSeq genes (~23000). For each tissue, we calculated E crm on 10000 random sequences and recorded the lowest E crm value for each random sequence. We chose the cutoff as the E crm which is ranked on 5% lowest values (i.e. the value at the lowest 500) in the random simulation. The E crm was normalized by this value, thus the cutoffs for all tissues are E crm = 1.

Conservation of sequences and CRMs
We utilized conservation scores, which were calculated based on multiple sequence alignment of eight vertebrate species using the phylogenetic hidden Markov model (phastCons) [42], to evaluate the conservation of sequences and CRMs. The eight species compared were human, chimp, mouse, rat, dog, chicken, fugu, and zebrafish. The conservation score obtained from this comparison reflects the general conservation level within the vertebrates and allows for a metric to determine conservation of larger stretches of genomic sequence. Since CRMs often span several hundred base pairs, the average value of conservation scores of all base pairs in a CRM is an appropriate measurement for the general conservation trend of this CRM.
To determine whether there might be differences between the conserved and non-conserved classes of CRMs in terms of the activity and/or properties of their target genes, we compared the gene groups regulated by conserved CRMs (cCRMs) and non-conserved CRMs (ncCRMs). The cCRM group was defined as those with conservation score > 0.5, and the ncCRM group defined as those with conservation score < 0.2. For those genes controlled by more than one CRM, we used the average conservation score of the associated CRMs. By this definition, 455 and 877 genes were regulated by cCRMs and ncCRMs, respectively. (It should be noted that variation in the cutoffs chosen for the definition of cCRMs and ncCRMs was tested and such variation did not significantly affect the results presented in the result section.)

Functional enrichment for cCRMs and ncCRMs
We first counted the number of genes associated with cCRMs (or ncCRMs) in each functional class and then compared the observed number with that which would be expected if genes were randomly assigned to the functional class. P-values were calculated using the binomial distribution. We did the multiple testing over the functional classes by performing random simulation to determine the cutoff for significant P-values. In each simulation we selected a group of genes with the same size of the genes controlled by cCRMs (or ncCRMs) and calculated the P-values for their occurrences in different functional classes. We obtained the most significant P-value from each simulation and determined the P-value at top 5% of the P-value distribution from the simulation as the cutoff for significance. For cCRM and ncCRM, the cutoffs for the P-values are 10 -3.66 and 10 -3.9 , respectively.

Authors' contributions
JQ and XY conceived of the study. XY implemented the software and performed most of the analysis. JL and DJZ helped to interpret the results. All authors wrote the paper.