The objective is to investigate whether the occurrence of one TRE directly affects the occurrences of other TREs. The correct scale for studying the organization of TREs on the genome seems to be a scale where most regulators show point-like interactions with the genome at binding sites that each cover only a few nucleotides, since this corresponds to actual binding site sizes. At this stage, it is helpful to review the ChIP-seq and ChIP-chip techniques. ChIP-seq/chip are based on a protocol that first fixes DNA-bound proteins to DNA by cross-linking, followed by shearing of the DNA. Antibodies are then added to isolate the DNA bound by a protein of interest (see [15] and references therein for an introduction). The sheared DNA sequences are approximately 400-1000 bp long, depending on the protocol used. For ChIP-seq, the 5' or 3' edges of the fragments are sequenced. These sequence reads can then be mapped back to the genome, and the site of the cognate DNA-binding protein can be determined using specific algorithms, see e.g. [16]. Adjacent regions on the DNA with a high frequency of mapped sheared sequences from the ChIP step are merged into a larger block. This implies that it is possible to merge two or more binding sites in a single block, since individual signals end up being merged if they are physically close together. For ChIP-chip, the sheared DNA sequences are hybridized to an array of DNA probes that are designed to cover the genome with a certain spacing between the probes. As for ChIP-seq data, sets of probes that are consecutive on the genome and have high hybridization signals will be merged into a larger block corresponding to the DNA region spanning the probes. Since probes cannot be placed on repetitive genomic sequences and repetitive sequences may map to more than one location on the genome, a larger block can be broken up by occurrences of repeat elements in both ChIP-seq and ChIP-chip data. Finally, since both methods are based on a ChIP step, the regions detected are much larger than the actual underlying TREs, and hence we need to obtain proxies for the actual locations of the TREs from the wider ChIP data.
Figure 1 illustrates how we obtain proxies for binding sites for at given TRE from a ChIP-chip signal. For ChIP-seq data the approach is essentially identical, although the raw signal is slightly different. When considering several TREs, we ultimately obtain a sequence of points along the genome with different labels according to the different TREs, i.e., a multivariate point process along the genome. The intensity of occurrence of one TRE along the genome is then affected by the occurrences of the other TREs, and so we need to model how this intensity changes along the genome and how the occurrences of other TREs affect the intensity.
Multivariate analysis of TREs for mouse embryonic stem cell data
For the application of our model to the ChIP-seq data from Chen et al [10], we focus on 11 TREs: 9 TFs and 2 transcription regulators. We let
denote the set of these TREs. Inspired by [17], we investigate the use of the Hawkes process, where the main specification consists of a collection of functions g
m, k
for each combination of m, k ∈ I of TREs considered (see Methods). Given that we observe TRE m, the intensity for observing TRE k downstream at a distance s is then given as a baseline intensity multiplied by g
m, k
(s). The most important hypotheses to investigate are whether g
m, k
= 1, which means that the occurrence of TRE k is not affected by upstream occurrences of TRE m. We can use standard methods for testing these hypotheses and as described in the Methods section, we can use the results to determine local independencies in the data.
In the multivariate analysis of the 11 TREs we initially allow for all 121 potential interactions among the TREs. As described in the Methods section, each of the g-functions are modeled using a spline basis expansion of log g
m, k
for m, k ∈ I with 8 equidistant, fixed knots. We choose to place the knots so that we limit the range of dependence to a maximum of 1000 base pairs downstream.
Estimates of the 121 g-functions are shown in Figure 2. For each TRE(column), these functions show the factor by which that TRE affects the downstream baseline intensity of another TRE (row), as a function of the distance between the TREs. A point-wise 95% confidence interval is also shown for each of the estimated functions. Adopting the terminology of [17] we say that a value less than one means that a given inter-TRE distance tends to be avoided, while a value greater than one means that a given inter-TRE distance tends to be favored.
It is important to point out that the implementation of the Hawkes process treats the genome simply as a line along which events (the occurrences of TREs) happen. This means that the descriptors "downstream" and "upstream" are dependent only on the direction we assign to the genome and not on the actual direction of genes. The estimated g
m, k
-functions in Figure 2 were estimated in the forward direction of the genome; i.e., the lowest numbered nucleotide in each chromosome based on the assembly coordinates was used as the starting point. We will discuss this point in more depth below.
The overall impression from Figure 2 is that generally, the occurrence of one TRE affects the occurrence of another TRE by increasing its intensity immediately downstream, with the effect then leveling off. For instance, there is a more than 10-fold increase in the intensity for occurrences of Zfx, c-myc and n-myc immedialy downstream of E2f1. However, several of the estimated effects do not seem to be significantly different from one.
The largest positive effects are found among the four TREs in the upper left corner, E2f1, Zfx, c-myc and n-myc and among the three TREs Nanog, Sox2 and Oct4. This indicates that the factors in the two groups often bind in proximity to each other. In addition, the three TREs Stat3, p300 and Smad1 seem to be more related to the group consisting of Nanog, Sox2 and Oct4 than to the other group. This is consistent with the analyses by Chen et al [10]. We also observe possible relations between Suz12 and Oct4 and Zfx and n-myc, respectively. In the original study, no relation with Suz12 was reported, but another study reported an interaction between Suz12 and Oct4 in mouse embryonic stem cells [14]. The experimental validation was based on both PCR analysis of the promoters and a knockdown study where reduced levels of Oct4 led to a loss of Suz12 at certain target promoters.
Based on the log g
m, k
-functions, we clustered the 11 TREs (see Methods) in order to link regulators that co-occur. The result of this clustering is presented in Figure 3 and shows two clusters of TREs. The first cluster consists of E2f1, Zfx, c-myc and n-myc, while the second includes Nanog, Sox2, Oct4, Suz12, Stat3, p300 and Smad1. Again, this is consistent with the results presented in [10], apart from the Suz12 findings.
The functions on the diagonal (from upper left corner to bottom right corner) in Figure 2 represent the self-dependencies of the TREs, i.e., the effect of the occurrence of a TRE on the downstream occurrence of that same TRE. All 11 effects seem significantly different from one and all have a characteristic shape, with a clear depletion of the intensity for the first 500 nucleotides. For some of the TREs, the effects approach 1 thereafter and for others, there is a small positive effect after 500 nucleotides. The self-dependence for the transcription regulator Suz12 stands out, with a large (more than ten-fold) effect after 500 nucleotides. While there are many reports that show homotypic clusters of transcription factor binding sites [18, 19], the depletion and peaks might be technical artifacts. Depletions can be attributed to the case where multiple binding sites are located very close to each other; in this case, the block will be interpreted as a single binding event, with the implication that sites located close together will never be detected. Peaks might occur due to the presence of repeat elements that break up larger regions. In the Suz12 case, the effect after 500 bp does not seem to be a technical artifact, since the effect is large.
In most cases, transcription factors have no strand preference relative to their regulated gene when it comes to binding, and regardless, strand information is lost in the ChIP experiment and the experiment will not explicitly tell us which gene is the regulatory target of a given site. Consequently, we fit the model to a mixed signal. If two TREs typically occur in a specific order when involved in the regulation of a gene, then the order is reversed from the forward direction point-of-view if the TREs are involved in the regulation of a gene in the reverse direction. We argue that if there is an equal distribution of TREs involved in regulation in the forward and reverse directions, the mixed signal should be approximately symmetric, which would then imply that the shapes of g
m, k
and g
k, m
do not differ that much up to a multiplicity factor. From Figure 2, we see that this is true in most cases (e.g. gE2f1, Zfx and gZfx, E2f1) but we also see some deviations from this (e.g. gNanog, Smad1 and gSmad1, Nanog).
To further investigate the estimated effects for each combination of m, k ∈ I, we test the hypothesis
This is the hypothesis of local independence of the m'th TRE on the k'th TRE, conditional on the upstream occurrences of the k th TRE and the remaining 9 TREs, as described in the Methods section. If we reject H0(m, k), the interpretation is that there is a direct relation between the occurrence of TRE m and the downstream occurrence of TRE k. Of course, we can not rule out that such a relation can be explained by a factor not included in our analysis, but we can say that the other TREs included can not collectively explain the relation. On the other hand, if we do not reject H0(m, k), there is no evidence in the data for a direct relation between the occurrence of TRE m and the downstream occurrence of TRE k. The p-values for the 121 tests are shown in Figure 4, with tests for which H0 is rejected shown as red squares. As in Figure 2, we use the forward direction of the genome. We find that all TREs show significant self-dependence, as discussed above. In addition, we observe that many of the interactions are significant, with Oct4 having interactions with all the other TFs, and Suz12 having the fewest interactions with other TFs. All interactions within the group consisting of E2f1, Zfx, c-myc and n-myc are significant, as is those within the group consisting of Nanog, Sox2, Oct4, Stat3, p300 and Smad1, which again is in accordance with the analyses in [10].
We found a significant effect of the occurrence of Suz12 on downstream occurrences of Oct4 but not the reverse. We argue that this asymmetry is a consequence of the inclusion of self-dependence terms in the model, combined with Suz12's strong self-dependence, as can be seen in Figure 2. When we fit a model without the self-dependence term for Suz12 (not shown), the asymmetry disappears. We believe that most of the observed self-dependencies, including the large self-dependence for Suz12, represent true self-dependencies; therefore we prefer to use the multivariate model including all self-dependence terms when analyzing the data.
Multivariate analysis of TREs for the pilot ENCODE regions
To further investigate the applicability of our model we analyze a subset of the ENCODE pilot data produced by Affymetrix: the "Affymetrix Sites" track from the UCSC ENCODE browser database resulting from a study of retinoic acid-treated HL-60 cells 0, 2, 8 and 32 hours after treatment. Initially, we focus on the 8-hour post-treatment results from the data and investigate the effects of the oriented specification of the model and the inclusion of histone modifications in the model. Subsequently, we compare the results obtained at the four different time points. We focused on 10 TREs, selecting classical transcription factors, the transcription machinery and chromatin boundary elements. Because some regulatory elements, such as histone modifications, can not always be regarded as point-like, we include the two histone modifications, H3K27me3 and H4Kac4, as covariates in the modeling of the remaining eight TREs. We let
denote the set of these TREs. Aside from the inclusion of the histone modifications, an important feature, the model is the same as in the previous section. The intensity of the occurrence of a TRE at a given location depends on upstream occurrences of other TREs and on whether the histone modifications are present at the same location.
The set of TREs available from the ENCODE data is quite diverse and potential interactions among them are to a large extent previously undescribed. In the multivariate analysis of the 8 TREs, we initially allow for all 64 potential dependencies among the TREs. Again, as described in the Methods section, each of the g-functions are modeled using a spline basis expansion with 8 equidistant, fixed knots and a maximum range of dependence of 1000 base pairs downstream. (We did not seem to be able to capture dependencies over a longer range).
Estimates of the 64 g-functions are shown in Figure 5, with the effect of each TRE (column) on the downstream baseline intensity of another TRE (row) as a function of the distance between the TREs. As with the previous data set, we observe that several TREs show a clear impact on the downstream occurrence of other TREs and, as in the previous analysis, most estimated effects seem greater than one, suggesting that in general, the occurrence of a given TRE affects the occurrence of another TRE by increasing its intensity. All 8 self-dependencies are significantly different from one, and for all except CEBPE, the g-function have a shape with a depletion of the intensity for the first 100 nucleotides, followed by a peak, with, for instance, a four-fold increase of the intensity for POL2 at approximately 300 nucleotides downstream. CEBPE shows a different behavior with only the depletion of the intensity downstream of the occurrence of CEBPE. As noted previously, both depletions and peaks might be technical artifacts. For ChIP-chip data, the peaks can be caused by genomic repeats, which can break up a longer signal block, causing the block to be interpreted as two binding events. On the other hand, the fact that CEBPE does not show the same behavior indicates that the effect may actually not be a technical artifact. Depletions might be attributed to multiple binding events occurring very close to one another, such that the block is interpreted as a single binding event, and since we use the center of the block as our proxy for the binding site, a depletion will occur if the block is large.
Investigation of the oriented specification of the model
To investigate whether the estimated effects are statistically significant, for each combination of m, k ∈ I we test the hypothesis H0(m, k): g
m, k
= 1. The p-values are shown in Figure 6, with tests for which the hypothesis is rejected shown as red squares. As in Figure 5, we use the forward direction of the genome. We find that all TREs show significant self-dependence, as discussed above. We observe that RARA appear to be directly associated with most of the other TREs, both up- and downstream, whereas PU1 and POL2 show fewer significant direct relations with other TREs.
Keeping in mind that we fit the model to a mixed signal, we would expect to reject the hypothesis g
m, k
= 1 if, and only if, we reject the hypothesis g
k, m
= 1. Deviances from this, as seen in Figure 6, could be explained by an unequal distribution of these TREs involved in regulations in the forward and reverse directions, respectively, in this data set.
Regardless of a mixed signal, if we fit the model using the reverse direction of the genome, we would expect the estimate of g
k, m
to be similar to the estimate of g
m, k
in the forward direction. The g-functions estimated in the reverse direction are shown in Figure 7. To make the comparison with Figure 5 easier, for each TRE (row), the figure shows the factor by which that TRE affects the downstream baseline intensity of another TRE (column); with this figure orientation the estimate of g
k, m
in the reverse direction is in the same place in Figure 7 as the estimate of g
m, k
in the forward direction in Figure 5. We see that the estimated functions are indeed very similar for almost all of the TREs, with most differences being differences in function amplitude only and not in shape. The different amplitudes can be due to TREs occurring at different rates, such that at least one of the TREs will occur in situations without the other TRE.
We would also hope that the conclusions reached would be qualitatively symmetric - that is to say, that we reject H0(m, k) in the forward direction if, and only if, we reject H0(k, m) in the reverse direction. When we repeat the tests fitting the model in the reverse direction, we find that this is generally, but not entirely, the case. The p-values are shown in Figure 6 with tests for which H0 is rejected shown as red circles. As in Figure 7, we have transposed the results relative to the forward direction, that is, the figure shows the effect of the TRE (row) on the TRE (column), to make the comparison simple. Qualitatively, most of the conclusions are preserved, but 12 of the 64 tests had a different outcome when we estimate the model in the reverse direction. Most notably, BRG1, POL2 and SIRT1 are each involved in 5 of these tests. However, the conclusions for CEBPE are identical and for CTCF, there is only one difference.
An explanation for seeing a direct relation of TRE m on downstream occurrences of TRE k in the forward direction, but not a direct relation of TRE k on downstream occurrences of TRE m in the reverse direction, could be that TRE m occurs almost exclusively in relation to TRE k, but TRE k also occurs in many other situations in the absence of TRE m. This could explain the results for the relations between BRG1 and POL2 and between SIRT1 and BRG1, since in these cases we see a direct relation of TRE m (POL2 and BRG1, respectively) on the downstream occurrences of TRE k (BRG1 and SIRT1, respectively) in the forward as well as in the reverse direction.
In conclusion, certain findings are consistent when estimating in either direction. The self-dependencies are all significant, and RARA seems to have direct relations to all or most of the other TREs both up- and downstream. RARA is known to function as an active repressor by recruiting corepressors and/or deacetylases when its ligand is not present, and an activator when the ligand is present [20]. Our observation that RARA is directly related to most of the other TREs is then an indication of the importance of the factor in this system, since the cells were treated with its ligand (retinoic acid-treated HL60 cells).
Effect of histone modifications on occurrences of TREs in the pilot ENCODE regions
As mentioned above, the two histone modifications are included as covariates in our model. The effects of the histone modifications on the intensity for the occurrence of TRE k in a modified region is captured by the fold change parameters and , respectively (see Methods).
Figure 8 shows the point estimates and 95% confidence intervals for these parameters. We find that all parameters are significantly greater than 1, showing that the intensity of the occurrence of any of the 8 TREs is increased in the presence of either of the histone modifications. Most notable is the effect of H4Kac4 on PU1 and POL2; the presence of H4Kac4 increases the intensity of the occurrence of these two TREs by a factor of 22.2 and 16.6, respectively. It is generally believed that acetylation is coupled to chromatin opening and increased transcription; indeed, an independent report shows a characteristic pattern of acetylation around transcription start sites [21]. However, it is not clear why the effect on PU1 and POL2 is so strong. One possible explanation is that the PU1 binding is not particulary stringent - it is a HMG box which essentially binds GGAA-rich sequences.
The effect of H3K27me3 on the occurrence of PU1 and POL2 is, on the other hand, negligible, which might have to do with its effect as a repressor [22]. We observe that the presence of H4Kac4 also show a pronounced effect on the occurrence of BRG1 and CEBPE, increasing the intensity by a factor of 9.9 and 6.2, respectively. The most pronounced effects of the presence of H3K27me3 are found for P300, RARA and CTCF where the intensity is increased by a factor of 5.8, 4.7 and 4.5, respectively.
Results for all four time points for the pilot ENCODE regions
Since data are available for 0, 2 and 32 hours post-treatment, in addition to the 8 hours analyzed initially, we can investigate whether our findings are consistent over time. Hence we fit our multivariate model to the data at these time points in the forward direction and test the hypotheses g
m, k
= 1 for all m, k ∈ I. The p-values are shown in Figure 9. We see that the results for 2 hours after treatment stand out compared to the other three time points, with many fewer significant relations. BRG1 especially shows almost no relations with the other TREs, with the effect of CEBPE on downstream occurrences of BRG1 the only exception. In particular, we note that the self-dependence has disappeared. In contrast, BRG1 shows many relations with other TREs at the other three time points. The fact that BRG1 is observed less frequently 2 hours post-treatment compared with the other time points, combined with inspection of the estimated g-functions and their 95% confidence intervals (not shown), suggest that the much larger variance in the estimated signal for BRG1 relations 2 hours post-treatment are at least part of the explanation for the observed differences among the time points. Apart from the observations 2 hours post-treatment, many of the relations are consistent over time. We observe, for instance, that the local independence of CEBPE and CTCF is present at all four time points, as is the relation between CEBPE and P300. We should note, however, that although many relations are preserved over time, this does not necessarily mean that the binding sites for the TFs are the same over time, only that there is a relation between the binding sites for the TFs.