Application of machine learning methods to histone methylation ChIP-Seq data reveals H4R3me2 globally represses gene expression
© Xu et al; licensee BioMed Central Ltd. 2010
Received: 7 March 2010
Accepted: 23 July 2010
Published: 23 July 2010
In the last decade, biochemical studies have revealed that epigenetic modifications including histone modifications, histone variants and DNA methylation form a complex network that regulate the state of chromatin and processes that depend on it including transcription and DNA replication. Currently, a large number of these epigenetic modifications are being mapped in a variety of cell lines at different stages of development using high throughput sequencing by members of the ENCODE consortium, the NIH Roadmap Epigenomics Program and the Human Epigenome Project. An extremely promising and underexplored area of research is the application of machine learning methods, which are designed to construct predictive network models, to these large-scale epigenomic data sets.
Using a ChIP-Seq data set of 20 histone lysine and arginine methylations and histone variant H2A.Z in human CD4+ T-cells, we built predictive models of gene expression as a function of histone modification/variant levels using Multilinear (ML) Regression and Multivariate Adaptive Regression Splines (MARS). Along with extensive crosstalk among the 20 histone methylations, we found H4R3me2 was the most and second most globally repressive histone methylation among the 20 studied in the ML and MARS models, respectively. In support of our finding, a number of experimental studies show that PRMT5-catalyzed symmetric dimethylation of H4R3 is associated with repression of gene expression. This includes a recent study, which demonstrated that H4R3me2 is required for DNMT3A-mediated DNA methylation--a known global repressor of gene expression.
In stark contrast to univariate analysis of the relationship between H4R3me2 and gene expression levels, our study showed that the regulatory role of some modifications like H4R3me2 is masked by confounding variables, but can be elucidated by multivariate/systems-level approaches.
Histones are subjected to numerous modifications, including methylation, acetylation and phosphorylation. Over 60 different amino acid residues of the histones, detected by specific antibodies or mass spectrometry, can be modified. They regulate a number of important processes on DNA, including transcription [1, 2]. Extensive studies comparing histone modification and transcription levels have established that histone methylation is associated with either transcriptional repression or activation. A number of marks have been classified as "activating" transcription including H3 lysine 4 tri-methyl (H3K4me3) and H3 lysine 36 tri-methyl (H3K36me3) and "repressing" transcription including H3 lysine 27 tri-methyl (H3K27me3) [1, 2]. These modifications can be recognized by chromatin remodeling proteins (readers), which render chromatin in either "open", transcriptionally permissive conformations or "closed", DNA-inaccessible conformations, respectively [1, 2].
A simple question that emerges is: Why does the cell require ~100 or more modifications to maintain two (i.e., open and closed) or a handful of chromatin states? The histone code hypothesis was developed to address this question. The histone code hypothesis "suggested that distinct functional consequences result from histone modifications and that a given outcome is encoded in the precise nature and pattern of marks" [3–6]. A challenge to the hypothesis has been the identification of multiple readers for a single modification, thereby confounding "a simple one-mark-to-one-module type of decoding" . A framework that keeps the histone code hypothesis intact and addresses this criticism is the phenomenon of multivalency--the cooperative engagement of several linked substrates by a species with more than one discrete interacting surface [2, 3]. In other words, chromatin regulatory proteins and their associated complexes write, read and erase multiple histone modifications simultaneously. It has been suggested that multivalency may be widespread in chromatin regulation. Indeed, a number of recent studies are uncovering patterns of coexisting histone marks, extensive crosstalk among different modifications as well as multiple effector proteins on the same complex [2, 3, 7–9].
Using ChIP-chip and ChIP-Seq, bivalent domains of H3K4me3 and H3K27me3 were observed at genes encoding developmentally important transcription factors in embryonic stem cells [10–12]. It is suggested that these genes are transcriptionally silent but poised for activation during development. Indeed, in differentiated cells the vast majority of bivalent domains (93/97) resolved into either K4me3 (active genes) or K27me3 (repressed genes). Consistent with the idea of widespread multivalency, it is notable that two "opposing" marks were assayed on a genomic scale and were found to occur in bivalent domains. It raises the question: If many more marks were mapped, would we find widespread multivalencies?
To help address these questions we applied two machine learning methods, Stepwise Multilinear Regression and Multivariate Adaptive Regression Splines (MARS) , to genome-wide ChIP-Seq maps of 20 histone lysine and arginine methylations and histone variant H2A.Z in CD4+ T-cells . We hypothesize that inclusion of two (bivalent) and three (trivalent) interacting cross-terms in the model can reveal (1) putative cross-regulation or multivalent relationships between histone modifications and (2) a global view of the epigenetic regulatory network. Specifically, we first estimate the enrichment level of each modification using a new, model-based approach, which accounts for the characteristic spatial distribution of each modification across genes. With the enrichment levels as inputs and normalized log2 gene expression levels as output, we build the multilinear (ML) model from a set of 21 single or monovalent inputs, 210 bivalent inputs and 1330 trivalent inputs. For the MARS model, the 21 monovalent amplitudes were supplied as input and the bi- and trivalent interacting terms were added as part of the model optimization procedure. Using 10-fold cross validation and requiring terms to appear in 5 of 10 training models, our best ML model contained 7, 8 and 8 mono-, bi-and trivalent terms, respectively. Using the Generalized Cross Validation (GCV) score to protect against overfitting, we trained a MARS model that had 7, 10 and 6 mono-, bi- and trivalent terms, respectively. We were able to identify a number of highly significant multivalent terms, suggesting that multivalency and cross talk among histone modifications may be widespread. However, we were surprised that both models predicted H4R3me2--shown to be repressive in a number of experimental studies [15–28]--to be among the most repressive histone methylations given that its ChIP-Seq enrichment levels showed no response to increasing gene expression .
Calculation of amplitudes
Enrichment levels, or amplitudes, for each of the 21 histone modifications were estimated for each gene using a spatially weighted average of the mapped ChIP-Seq tag counts (see Additional file 1: Supplemental Figure S1 and Additional file 2: Supplemental Table S1 for the range of amplitude values). The gene list used in this study was compiled from the NCBI36 Homo sapiens database (Ensembl 54, downloaded June 24, 2009). For each mark j, an average enrichment template, t i,j across the 5' flanking region (i.e., -2 kbp before the transcription start site), the body of a scaled gene (a gene divided into a fixed number of bins), and the 3' flanking region (i.e., transcription stop site to +2 kbp), was first calculated as a function of relative genomic position i. For both the 5' flanking region and 3' flanking region, the coordinate i represents each nucleotide position relative to the transcription start and stop sites, respectively. Within gene bodies (i.e., transcription start site to stop site), the coordinate i represents the position in the gene body, which is divided into 8138 segments, or bins, which corresponds to the median gene length. For genes whose lengths are greater than 8138 bp, tag counts were averaged across bases within each of the 8138 bins. For genes whose lengths are less than 8138 bp, tag counts were repeated in order to generate 8138 bins. For genes not divisible by 8138 or a divisor thereof, fractions of base pairs within a bin were rounded to the nearest integer value; therefore bins containing the majority of the fraction received the full tag count value of the corresponding base pair, while the bin containing the minority received no part of the bisected base pair's value. The median value was used for the bin number to minimize biases introduced in scaling. Large bins would tend to over-smooth large genes, while small bins would tend to overrepresent copied values from small genes. The t i,j or template for mark j, was finally computed by (1) aligning the transcription start and stop sites of every scaled gene and then (2) calculating the average bin-averaged tag count across genes for every coordinate i. All templates were then normalized so that their average across bins was 1 ( ; N = number of bins). In other words, the template is the averaged and normalized enrichment profile across all scaled genes. Because the template appears to have a characteristic shape for a given mark j across the length of scaled genes, we developed a model of relative enrichment which assumes the actual profile of any given mark is given by a product of a gene-dependent amplitude, , for a gene, k, and the mark's template t i,j . In other words, gene k's tag count profile for mark j across genomic coordinate i, , is well approximated by . Using least squares, we minimized the difference between the model and the actual tag count profiles:
We note that in the special case where the template is constant as a function of genomic position/bin, i, reduces to a simple average of tag counts across bins, , which is the appropriate estimate of tag "depth" for a mark whose tag distribution is uniform across a gene.
Selection of transcription start and stop sites
Many Ensembl genes contain multiple start and stop sites. Given that we only have 3' biased gene expression data, there are cases where we cannot unambiguously assign an Affymetrix probe set to one transcription start or stop site which we need for our estimate of mark enrichment. Consequently, we chose the transcription start sites that were associated with the highest number of significantly enriched histone modifications as representing the most likely expressed transcript. If a selected start site had multiple stop sites, we chose a stop site using the same scheme. In cases where multiple transcription start sites had the same number of significant marks, the most upstream transcription start site was chosen. When multiple stop sites for a given start site had the same number of significant marks a stop site was arbitrarily selected.
where μ j is the mode of the amplitude distribution and the sum is over genes k* whose amplitude is less than or equal to the mode and n is the number of genes that satisfy this inequality. This null model was used to determine the p-value by calculating the integral of the Gaussian from the mark amplitude to infinity for each mark at every Ensembl gene [29, 30].
A Benjamini-Hochberg false discovery rate (FDR)  correction was applied to the p-values using the p.adjust function in R, and an FDR-corrected p-value cutoff of 0.05 was used to determine significantly enriched amplitudes.
Amplitude robustness and relative error metrics
where n is the number of bins in the template, and all other variables follow previous definitions. In addition to the gene amplitude calculation with the 8138-bin (8 k) template, amplitudes were also calculated with 6000 (6 k) and 10,000 (10 k) bin (plus flanking regions) templates. To assess the robustness of our amplitude estimates, we calculated the Spearman correlation coefficient between the 8 k and 6 k bin amplitudes and the 8 k and 10 k bin amplitudes. We also calculated the fractional difference between the 8 k and 6 k bin amplitudes, , and the 8 k and 10 k bin amplitudes, which is similarly defined. Finally, the CV(RMSD) was calculated for all marks for the 3 sets of amplitude calculations to assess effect of bin size selection on the template model fit.
Building the multilinear model using stepwise linear regression
where Y k is the normalized log2 gene expression (using GCRMA ); β j , β j,l , β j,l,m are mono-, bi- and trivalent histone modification fitting coefficients; are mark j amplitudes for gene k and the ε k are random errors. Briefly, an initial model with randomly selected terms-defined as β coefficients multiplied by one, two or three mark amplitudes (e.g., )-- is fit. Terms from the set that are not in the initial model and make a statistically significant contribution to the model (i.e., p-value ≤ 0.05 according to an F-test) are added during a forward step. The forward step continues until no terms from the available pool of unused terms contribute significantly to the model. A backward step is then applied whereby terms are ranked in descending order according to their p-values and removed if they are not significant (i.e., p-value > 0.05). The backward step ends when no terms in the model are insignificant. The forward and backward steps are repeated until no significant terms can be added or removed, respectively.
Because stepwise linear regression is not guaranteed to converge to a globally optimal solution (e.g., minimum adjusted R2) for any given initial seed model, we performed multiple rounds of multiple stepwise regressions using different randomly seeded models. In the first round, we ran stepwisefit on the full data set 100 times using randomly seeded models. This resulted in 100 models with a mean of 227 terms. To assess the statistical significance of a given term's survival rate across the models, we randomly sampled 227 of the 1561 possible terms to generate a null model. While a survival rate of 0.2 was significant (p-value < 0.05), to increase stringency we arbitrarily selected a cutoff of 0.35 to arrive at 167 starting terms for the next round of stepwise linear regression.
To avoid the problem of overfitting and its inflation of model complexity, we applied stepwisefit to 10-fold cross validation data. Specifically, for each of the 10 folds we performed 10 runs of stepwisefit where the initial model contained the 167 terms found in the first round plus an additional 60 randomly selected terms (i.e., we generated 100 models). Using the test data, we applied only the backward step of the stepwise procedure to assess the significance of every term that survived the training step and removed those with p-values > 0.05. Among the 10 runs for each fold, the model with the lowest test mean square error (MSE) was selected. This resulted in 10 models for each fold. We then required a term to appear in 5 or more of the 10 models generated within each fold to be selected for the final model. This resulted in 24 terms.
We arrived at a robust estimate of the final set of 24 coefficients by fitting the training data to a model that contained only the 24 terms. This yielded 10 sets of 24 coefficients (i.e., one for each fold of the 10-fold training data). We arrived at the final value of each fitting coefficient by calculating the trimmed mean of the 10 found in each fold. The final model's performance was assessed by calculating the mean MSE and adjusted R2 across the 10 test and training data folds (see Results and Discussion).
Building the Multivariate Adaptive Regression Splines (MARS) model
where Y k is log2 gene expression of gene k (i.e., output variable), c0 is a constant, is the subset of mark amplitudes that appear in term i, and b i (·) is a basis function that is made up of either one or a product of two or more hinge functions. Hinge functions are splines that take the form or where is a special constant known as a knot. We note that the two hinge functions shown above are a symmetrical pair about the vertical line .
where RSS is the residual sum of squares, N g is the number of observations or genes with expression data in our case, T is the total number of terms in the model, P is a user defined penalty (earth's default is 3 for multivalent models), and B is the total number of non-constant basis functions in the model.
Results and discussion
These amplitudes were calculated for 11,796 Ensembl genes . Because the gene expression data was 3' biased, we could not distinguish the expression levels of different isoforms, which included multiple TSS. We selected the TSS that had the largest number of significant modifications as detailed in the Methods section. Human CD4+ T-cell gene expression data which was used to generate the output of our models was collected from the Genomics Institute of the Novartis Research Foundation's SymAtlas --a compendium of gene expression data in human and mouse tissues. Only genes that had Ensembl, UCSC, and RefSeq IDs were included in this study. Of the 18,647 genes that met these criteria, 11,796 had expression data  associated with them. Because multiple Affymetrix probes can interrogate a single gene, the total number of expression data points for the 11,796 genes was 17,635, which constituted the output of the multilinear and MARS models.
Estimating input amplitudes for regression models
Two groups [35, 36] have estimated ChIP-Seq histone modification/variant enrichment levels across genes in order to applying machine learning techniques--linear regression  and Bayesian networks . They count tags only in a region surrounding the transcription start site (i.e., ±1 kbp  or ±2 kbp  of the TSS). A major problem with this method is that many marks do not have promoter/5' end biased enrichment patterns. A striking example is H3K36me3, which has increasing enrichment along the gene body, which peaks near the 3' end of genes . Yu et al. calculate correlation coefficients between their 5' end biased mark enrichment estimates and gene expression levels and find little to no correlation between H3K36me3 and gene expression. This is an unexpected result as H3K36me3 has been characterized as an activating mark in a number of biochemical studies [7, 8], and its levels have been shown to have a strong positive correlation with gene expression . This discrepancy is likely due to the 5' bias of their amplitude estimation method. To address this problem, we estimated the effective enrichment levels of each mark by calculating a weighted average across the whole gene and its flanking region as described in the Methods section. We use the average enrichment pattern across the flanking regions and body of scaled genes as an estimate of the weighting function. However, given the large variation in gene lengths, exon/intron number, and mark deposition patterns, we assessed the robustness and relative error of our amplitude estimation procedure.
Robustness and relative error of mark amplitude estimates
Our amplitude estimation procedure is motivated by the observation that a number of histone methylations (e.g., H3K36me3, H4K20me1, H2BK5me1, etc) are pervasive across the body of genes and their enrichment patterns appear to scale with gene length. However, methyl groups are physically attached to histones whose octamer wraps 146 bp of DNA (i.e., a nucleosome). Thus, our (and others ) gene-scaling procedures average different numbers of nucleosomes depending on the gene length and selected bin size. Consequently, we assessed the robustness of our amplitude estimation procedure by recalculating our template and amplitude using 6000 (6 k) and 10,000 (10 k) bins and compared them to those calculated using 8138 (8 k) bins.
We first generated scatter plots of the 6 k versus 8 k and 10 k versus 8 k amplitude estimates (Additional file 3: Supplemental Figure S2) and calculated their associated Spearman correlation coefficients (Additional file 4: Supplemental Table S2 and Additional file 5: Supplemental Table S3) for all 21 histone modifications/variants. We found the values to be highly correlated with correlation coefficients ranging from 0.994-0.9995 and 0.9975-0.9998 across marks for the comparisons of 8 k bins to 6 k and 10 k, respectively. We also calculated the fractional difference (i.e., difference divided by mean) between 6 k and 8 k and 10 k and 8 k amplitude estimates. These values were summarized across genes by calculating the 0th, 25th, 50th, 75th and 100th percentile values for each mark (Additional file 4: Supplemental Table S2 and Additional file 5: Supplemental Table S3). The absolute value of typical (50th percentile) fractional differences range from 0.0023-0.065 and 0.0016-0.042 for the comparisons of 8 k to 6 k and 10 k, respectively. Indeed, the worst absolute values were 0.22 and 0.16 for 8 k versus 6 k and 10 k, respectively. Thus, our estimates of mark enrichment amplitudes are relatively robust with respect to bin size. Given these results, it's not surprising that our model results and main conclusions do not depend on bin size as discussed below.
An advantage of a model-based approach to estimating enrichment levels is that we can directly assess model performance by calculating residuals (i.e., differences between the model and the data). Thus, for each mark, we calculated the root mean square deviation (RMSD) between the model and the data divided by the amplitude (i.e., CV(RMSD) defined in Methods section) for every gene, which is a measure of the relative error. In a plot of CV(RMSD) versus amplitude for every mark we find a near universal curve (Additional file 6: Supplemental Figure S3). This results in part because our normalization of each mark's template (i.e., their average across bins equals 1) allows us to interpret the mark amplitude as a model-based estimate of each mark's effective read density, or read coverage. As might be expected, below amplitude values of 1 (i.e., 1× coverage) the error grows rapidly. For relatively large amplitudes (i.e., values greater than 1), the CV asymptotically reach values slightly below 2. In contrast, marks whose largest amplitudes fall well below 1 and consequently don't achieve their large amplitude asymptotic value have CV values that range from 2.3-5 at the largest amplitudes encountered (i.e., 95th amplitude percentile). For most marks, the 95th amplitude percentile is below 1, indicating from our crude gene-centric measure of coverage/read density that the effective sequencing coverage might be low (see Additional file 7: Supplemental Table S4). We also see a steady trend upward in the CV--calculated in the neighborhood of the 95th amplitude percentile (mean CV calculated for genes in the 92.5-97.5 amplitude percentile)--with decreasing 95th percentile amplitude levels (see Additional file 7: Supplemental Table S4). Taken together, these results indicate that RMSD between the model and the data are on the same order as the amplitude. We also note that we find essentially the same CV(RMSD) values for 6 k and 10 k bins. We are currently working on improving estimates of mark enrichment levels from ChIP-Seq data and distinguishing the extent to which this modest to high variability is due to intrinsic noise or systematic biases introduced by our model assumptions, including how we scale genes. Nevertheless, we find our weighted average estimate is relatively robust and should capture enrichment level trends in histone modification/variant ChIP-Seq data reasonably well.
Multilinear model terms and statistics
Multilinear model term
β (trim mean)
Z (trim mean)
Impact (trim mean)
H3K4me2 a * - H3R2me1 r
H3K27me2 r * - H3R2me1 r
H3K9me2 r * - H3K27me1 r * - H4K20me1 a
H2BK5me1 a * - H3K36me3 a
H2BK5me1 a * - H4K20me1 a - H3R2me1 r
H3K9me3 r * - H3K36me3 a
H3K4me2 a * - H3K36me3 a - H3K79me3 a
H3K36me3 a - H3K79me2 - H3R2me2
H3K27me3 r * - H3K79me2 - H3K79me3 a
H3K4me1 a * - H3K9me2 r * - H4K20me1 a
H3K9me1 a * - H3K27me1 r * - H4K20me1 a
H2BK5me1 a * - H4K20me1 a
H3K36me1 - H3K79me1 a - H3K79me3 a
H3K4me2 a * - H3K9me1 a *
H3K27me2 r * - H3K36me3 a
H3K27me2 r * - H3K79me1 a
Multilinear model terms
Of the 7 monovalent terms (Table 1), H3K4me3, H3K36me3, H3K79me1, H3K79me3 and H4K20me1 were activating. Of these, only H3K4me3, H3K36me3 and H4K20me1 display a clear overall activating trend from composite plots . Based on their composite plot analysis, Barski et al. conclude that H3K79me1 level alone shows no overall trend with gene expression while it makes the highest impact activating contribution in our ML model. This is consistent with a recent finding that H3K4me3 and H3K79me1 are the most predictive of gene expression levels in low CpG content promoters . Barski et al. also find that H3K79me3 is enriched in active gene promoters, and in the body of silent genes. Two arginine methylations, H4R3me2 and H3R2me1, were the only repressive monovalent marks in the model. In contrast, Barski et al. find no overall activating or repressing trend for these two methylations from their composite plot analysis. Marks that showed an activating trend from composite plots  but did not appear as monovalent terms in our ML model included H3K4me1,2, H3K9me1 and H2BK5me1. Marks that showed a repressive trend from composite plots but were absent at the monovalent level in our ML model were H3K27me2,3, and H3K9me2,3. H3K27me1 and the variant H2A.Z did not appear as monovalent terms in the ML model and displayed complex, non-monotonic enrichments as a function of increasing expression from their composite plots. Finally, marks that neither appeared at the monovalent level in the ML model nor showed any trend with respect to gene expression level were H3K79me2, H3R2me2 and H4K20me3.
While the majority of monovalent terms are activating, the majority of multivalent terms, 11 of 16, are repressive. Half of the 16 multivalent terms involve a mix of activating and repressing modifications according to either the sign of their monovalent term in the ML model or composite plot trends. This is interesting given the discovery of bivalent domains of H3K4me3 and H3K27me3. Indeed, two of the three highest impact and most significant repressive terms are bivalent. They both include H3K27me2 together with H3K79me1 and H3K36me3 respectively. The highest impact activating multivalent term is also bivalent and composed of an activating mark, H3K4me2, and a repressive mark, H3R2me1. Thus, at the bivalent level, the linear model terms suggest that there is significant overlap between opposing marks (i.e., activating and repressive) and that one of them tends to "override" the other, similar to the observation that H3K27me3 overrides H3K4me3 in ES cells [10, 12].
We find increasing H3K27me2 corresponds to decreasing expression as expected; we also find increasing expression with fixed H3K27me2 and increasing H3K36me3 (i.e., vertical lines at various x-axis intercepts). However, for relative H3K27me2 amplitudes exceeding 0.9, gene expression remains at low levels independent of H3K36me3 levels. This suggests that high levels of H3K27me2 are capable of overriding the gene activating potential of H3K36me3.
ML models have been applied in physical and statistical studies where a common outcome-- theoretically expected in many cases--is that the single terms dominate the model in both their relative impact and statistical significance. In these systems, the double, triple, quadruple product terms tend to make small, diminishing (in the order of the number of products) corrections to the single terms. As shown in Table 1, we find the expected trend with the highest impact and most significant activating terms all monovalent. The two most significant repressive terms are also monovalent: H3R2me1 (highest impact) and H4R3me2. However, we note that size of the impact of the highest impact bivalent terms is relatively large.
Normalized log2 gene expression was modeled as a function of histone modification enrichment using the nonlinear Multivariate Adaptive Regression Splines (MARS) method. The model was built with the earth package in R http://cran.r-project.org/web/packages/earth/index.html as described in the Methods section.
MARS model terms
Genes with non-Zero value
MARS model terms
Of the 5 unique monovalent marks in the MARS model, 3 are activating, including H3K36me3, H3K79me3 and H4K20me1. These results are in agreement with the ML model and the Barski et al. data; although as previously mentioned, H3K79me3 is a complicated mark, which is enriched in the promoters of active genes and the bodies of repressed genes. H3K27me2 has a repressive trend in the model, which agrees with Barski et al. H4R3me2 appears to have no discernible behavior in the Barski et al. data; however, both the MARS and ML models select it as a repressive monovalent mark.
Impact of marks in MARS model
Predicted impact (95th-5th)
Impact of two marks in MARS model
Bivalent MARS term
Impact of three marks in MARS model
Trivalent MARS term
We also find one bivalent and trivalent combination composed of activating and repressive marks, H3K36me3-H4R3me2 and H3K27me2-H3K27me3-H4K20me1. As shown in Tables 4 and 5, we find that increasing each mark's level independently results in the expected activating or repressive response. High levels of the activating and repressive marks result in a moderating effect on predicted gene expression with values falling between those of high activating-low repressive and low activating-high repressive mark amplitudes. This reinforces the results of the ML model where we found the tendency of one mark to override or oppose another overlapping mark.
Like the ML model, the MARS model explains about half of the variation in gene expression. Moreover, the ML and MARS model predicted gene expression profiles are highly correlated (Figure 2C). However, the Pearson correlation coefficient between predicted and actual log2 gene expression is slighter better for the MARS model. This is impressive given that both models contain the same number of terms, 24, and the MARS model was built using one round of a greedy algorithm while the ML model was built by selecting the best model from multiple rounds of a stepwise algorithm. We note that the stepwise algorithm is a more powerful and computationally expensive optimization procedure. These observations suggest that methods like MARS that are capable of modeling the nonlinear relationship between histone modification and gene expression levels should outperform models that assume this relationship is linear. Moreover, many of the bi- and trivalent terms in the ML model may not have a biological origin but are compensating for the nonlinearities in the data. Specifically, as shown in Table 1, the ML model contains two bivalent terms (H3K4me2-H3K9me1 and H2BK5me1-H4K20me1) containing activating marks with a negative (repressive) fitting coefficient, one bivalent term composed on two repressive marks (H3K27me2-H3R2me1) with a positive (activating) fitting coefficient, and a trivalent term composed of activating marks (H3K4me2-H3K36me3-H3K79me3) with a negative (repressive) fitting coefficient. These terms have no known biological origin and are more likely artifacts of imposing a linear model on data, which is inherently nonlinear.
Both regression methods produced a model with 24 terms. However, there are only 4 common terms between the models, all of which are monovalent terms: H4R3me2, H3K79me3, H4K20me1 and H3K36me3. Both models agree that of these marks H4R3me2 is the only repressive mark, while the others are activating. Considering the linear model contains 7 monovalent terms out of a possible 21, and the MARS model contains 5, the degree of overlap in the monovalent terms is quite high.
No overlapping multivalent terms existed between the two models. The differences between the multivalent components of each model could be the result of the way the models were built. Since the space of possible terms increases rapidly with valency, and the search space over which the ML model converges on a final model is much larger than that of the MARS model, the potential for model overlap becomes less likely as valency increases. In this way the MARS model--built using a greedy algorithm--is constrained in the number of multivalent terms it can potentially include.
In silico knockout analysis
In order to generate experimentally testable predictions, a knockout analysis of the ML and MARS models was performed to assess the effect of removing a specific modification on gene expression. Predictions of gene expression were made with each model by setting the amplitude of a single mark to zero in the model and holding all others at their experimental values. This process was repeated for all marks in each model to determine the global effect of each histone modification. All pairwise combinations of marks were also knocked out for each model.
ML model knockout analysis
Multilinear model knockouts
log2 fold change (predicted WT/predicted KO)
MARS model knockout analysis
MARS model knockouts
log2 fold change (predicted WT/predicted KO)
H4R3me2 is globally repressive in ML and MARS models
Strikingly, H4R3me2 was the most globally repressive mark in the ML model according to the knockout analysis, with an average predicted fold change (WT/knockout) in gene expression of 0.55. It was also the second most repressive mark in the MARS model knockout analysis, with an average fold change of 0.70. This was a highly unexpected result for H4R3me2 given the unresponsiveness of its ChIP-Seq enrichment levels to increasing gene expression . Indeed, we used the data generated by Barski et al. but came to diametrically opposed conclusions regarding H4R3me2's influence on gene expression. Moreover, this conclusion is not altered by the selection of bin size, as H4R3me2 is the most highly repressive mark in the 6 k bin-based MARS model, and second most repressive in the 8 k and 10 k bin-based MARS models (Additional file 8: Supplemental Table S5).
In order to make sense of the apparently contradictory behavior of H4R3me2, we first note that we performed two analyses that differed from the Barski et al. analysis: (1) estimated mark amplitudes using a model based weighted average and (2) modeled gene expression using all histone methylation amplitudes as input to the ML and MARS models. A trivial explanation would be that our amplitude estimation procedure now yields a response of H4R3me2 levels to gene expression, which is then reflected in the ML and MARS fitting coefficients. We directly tested this by generating boxplots of our mark amplitudes stratified by quartiles of gene expression shown in Additional file 9: Supplemental Figure S4. Consistent with Barski's  composite plots, we observe little to no response of H4R3me2 amplitudes with increasing gene expression. For comparison, we generate the same boxplots for H3K27me2 and observe a dramatic decrease in its amplitude with increasing gene expression level as expected. Thus, we can rule out the mark amplitude estimation procedure as an explanation. We are left with the interesting result that H4R3me2's repression of gene expression is revealed by analyses such as MARS and ML modeling which account for the simultaneous impact of the other histone methylations. In other words, the impact of any given histone methylation on gene expression is analyzed in the context of 20 other activating and repressive modifications.
In order to better understand H4R3me2's affect on gene expression, we performed a comparative analysis with H3K27me2, which is highly repressive in the ML and MARS model knockout analysis as well as the Barski et al. composite plot analysis. Specifically, we divided the ML model log2 fold changes of wild type over H4R3me2 and H3K27me2 knockouts by quintiles and calculated boxplots of predicted gene expression in the WT and KO cases as shown in Additional file10: Supplemental Figures S5A and S5B, respectively. The first 20% of the data (QU1) represents the genes most up regulated by knocking out the mark (i.e., the largest de-repression of gene expression). The last 20% of the data (QU5) represents the genes least up regulated by knocking out H4R3me2 and genes down regulated by knocking out H3K27me2. The trends in WT and KO gene expression across the stratified data are opposite for H4R3me2 and K3K27me2. For H4R3me2, the median log2 gene expression in the WT is relatively low, 4, in QU1 and increases slightly to 4.7 in QU5 with the knockout showing a similar trend (Additional file 10: Supplemental Figure S5A). In contrast, the H3K27me2 WT median starts out considerably higher in QU1, 5.9, and plummets 8-fold to 2.9 in QU5 with the knockout again showing a similar trend (Additional file 10: Supplemental Figure S5B). Thus, H3R3me2 tends to be consistently acting on relatively low expressed genes, and its removal is predicted to increase their expression ~1.7-fold on average. H3K27me2, on the other hand, has the highest impact, from our knockout analysis, on middle to high expressed genes and the least on silenced genes.
Taken together these analyses suggest that in contrast to well-characterized repressive modifications including H3K27 and H3K9 methylation whose levels tend to be strongly anti-correlated to activating modification levels (see Figure 3A), H4R3me2 and the other arginine methylations tend to be somewhat uncorrelated to both. Consequently, its absence at genes does not imply the presence of activating marks and high levels of expression (see Figure 3B). Conversely, high levels of H4R3me2 can coincide with modest to relatively high levels of activating marks like H3K36me3, which tend to override H4R3me2 (see Table 4, Table 7 and Figure 3B). Thus, its levels show no strong trend with increasing overall gene expression as Barski et al. found. Instead, dimethylation of H4R3 consistently tends to further repress low to modestly expressed genes nearly 2-fold on average, leading to relatively strong predicted de-repression in the ML and MARS models when knocked out.
Experimental studies demonstrate H4R3me2 represses gene expression
Barski et al. report that they used the antibody that recognizes symmetric dimethylated H4R3 (H4R3me2s), which is deposited by the arginine methyltransferase PRMT5. A number of experimental studies have shown that PRMT5 and H4R3me2 s repress gene expression [15–28]. In an experiment that is a direct analogue of our knockout analysis, silencing of PRMT5 in mouse cell lines resulted in more de-repressed than repressed genes in a microarray analysis , supporting our result that H4R3me2 s is globally repressive. PRMT5 is a member of the multi-subunit mSin3A and NuRD histone deacetylase complexes , suggesting H4R3me2 is associated with deacetylation and hence gene inactivation [22, 23, 26]. Interestingly, both the mSin3A-PRMT5 containing complex and recombinant PRMT5 methylate H4R3 and show an in vitro preference for methylating hypo-versus hyperacetylated histone H4R3 . PRMT5 was also shown to interact with the MBD2/NuRD complex and that PRMT5 and MBD2 are recruited to CpG islands in a methylation-dependent manner with H4R3 methylated at these loci . These results are consistent with our finding that H4R3me2 tends to further repress modest to low expressed genes, which are likely hypoacetylated, contain methylated CpG islands in their promoter regions or both.
In a recent study, H4R3me2 s was shown to be required for subsequent DNA methylation --a repressor of gene expression. Indeed, H4R3me2 s was shown to be a direct binding target of the DNA methyltransferase DNMT3A . Loss of H3R3me2 s through shRNA knockdown of PRMT5 resulted in reduced DNMT3A binding, loss of DNA methylation and six-fold induction of the fetal (γ) globin gene .
Current genomic strategies for assessing whether a particular histone modification is activating or repressive involve (1) mapping it to the genome using ChIP-chip or ChIP-Seq and either (2) comparing boxplots of gene expression of genes with and without the mark  or (3) generating composite plots of average mark levels of genes stratified by gene expression level as in the Barski et al. analysis. Using this approach, Barski et al. concluded that H4R3me2 is neither activating nor repressive because its levels showed no response with increasing gene expression level. Using the Barski et al. ChIP-Seq data of 20 histone lysine and arginine methylations and histone variant H2A.Z in CD4+ T-cells, we built models of gene expression as a function of histone modification/variant levels using Multilinear (ML) Regression and Multivariate Adaptive Regression Splines (MARS). The response of monovalent (non-interacting) terms in the ML and MARS model indicate whether a given modification is activating or repressive. For most of the 20 histone methylations, our assignments agree with that of Barski et al. However, according to our in-silico equivalent of knocking out modifications, H4R3me2 is predicted to be the most and second most globally repressive histone methylation among the 20 studied in the ML and MARS models, respectively. A number of experimental studies show that PRMT5-catalyzed symmetric dimethylation of H4R3 is associated with repression of gene expression [15–28]. This includes a recent study, which demonstrated that H4R3me2 is required for DNMT3A-mediated DNA methylation --a known global repressor of gene expression. Consequently, our study serves as the first demonstration that H4R3me2 represses gene expression using genomic data and shows that the regulatory role of some modifications like H4R3me2 can only be revealed by approaches that simultaneously analyze multiple activating and repressive modifications. Our findings point to a disconnect between traditional biochemical (e.g., silencing) and genomic approaches in assessing the activating or repressive potential of an individual modification. Indeed, assuming the biochemical studies were correct and H4R3me2 is repressive, one would conclude from the Barski et al. analysis that the antibody they used for H4R3me2 did not work. Our results suggest that it worked extremely well. Taken together, our findings have broad implications for ChIP-Seq experimental design, analysis and interpretation and suggest an important role for retaining network level information in the analysis of ChIP-Seq data.
Multivariate Adaptive Regression Splines
This work was supported in part by the National Institutes of Health grant CA104397 from the National Cancer Institute (NCI). SH was supported by the University of Virginia Cancer Training Grant from the NCI (5T32CA009109). We would like to thank Swati Jain for contributing helpful ideas and generating the first multilinear models as part of a class assignment. We would also like to thank two anonymous reviewers for significantly strengthening the manuscript, especially regarding assessment of our amplitude estimation procedure and wording of the text.
- Li B, Carey M, Workman JL: The role of chromatin during transcription. Cell 2007, 128(4):707–719. 10.1016/j.cell.2007.01.015View ArticlePubMedGoogle Scholar
- Berger SL: The complex language of chromatin regulation during transcription. Nature 2007, 447(7143):407–412. 10.1038/nature05915View ArticlePubMedGoogle Scholar
- Ruthenburg AJ, Li H, Patel DJ, Allis CD: Multivalent engagement of chromatin modifications by linked binding modules. Nat Rev Mol Cell Biol 2007, 8(12):983–994. 10.1038/nrm2298View ArticlePubMedPubMed CentralGoogle Scholar
- Strahl BD, Allis CD: The language of covalent histone modifications. Nature 2000, 403(6765):41–45. 10.1038/47412View ArticlePubMedGoogle Scholar
- Jenuwein T, Allis CD: Translating the histone code. Science 2001, 293(5532):1074–1080. 10.1126/science.1063127View ArticlePubMedGoogle Scholar
- Turner BM: Histone acetylation and an epigenetic code. Bioessays 2000, 22(9):836–845. 10.1002/1521-1878(200009)22:9<836::AID-BIES9>3.0.CO;2-XView ArticlePubMedGoogle Scholar
- Suganuma T, Workman JL: Crosstalk among Histone Modifications. Cell 2008, 135(4):604–607. 10.1016/j.cell.2008.10.036View ArticlePubMedGoogle Scholar
- Latham JA, Dent SY: Cross-regulation of histone modifications. Nat Struct Mol Biol 2007, 14(11):1017–1024. 10.1038/nsmb1307View ArticlePubMedGoogle Scholar
- Zhang Y, Reinberg D: Transcription regulation by histone methylation: interplay between different covalent modifications of the core histone tails. Genes Dev 2001, 15(18):2343–2360. 10.1101/gad.927301View ArticlePubMedGoogle Scholar
- Bernstein BE, Mikkelsen TS, Xie X, Kamal M, Huebert DJ, Cuff J, Fry B, Meissner A, Wernig M, Plath K, Jaenish R, Wagschal A, Feil R, Schreiber SL, Lander ES: A bivalent chromatin structure marks key developmental genes in embryonic stem cells. Cell 2006, 125(2):315–326. 10.1016/j.cell.2006.02.041View ArticlePubMedGoogle Scholar
- Ku M, Koche RP, Rheinbay E, Mendenhall EM, Endoh M, Mikkelsen TS, Presser A, Nusbaum C, Xie X, Chi AS, Adli M, Kasif S, Ptaszek LM, Cowan CA, Lander ES, Koseki H, Bernstein BE: Genomewide analysis of PRC1 and PRC2 occupancy identifies two classes of bivalent domains. PLoS Genet 2008, 4(10):e1000242. 10.1371/journal.pgen.1000242View ArticlePubMedPubMed CentralGoogle Scholar
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, Lee W, Mendenhall E, O'Donovan A, Presser A, Russ C, Xie X, Meissner A, Wernig M, Jaenisch R, Nusbaum C, Lander ES, Bernstein BE: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007, 448(7153):553–560. 10.1038/nature06008View ArticlePubMedPubMed CentralGoogle Scholar
- Hastie T, Tibshirani R, Friedman J: The Elements of Statistical Learning: Data Mining, Inference, and Prediction. New York: Springer-Verlag; 2001.View ArticleGoogle Scholar
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823–837. 10.1016/j.cell.2007.05.009View ArticlePubMedGoogle Scholar
- Zhao Q, Rank G, Tan YT, Li H, Moritz RL, Simpson RJ, Cerruti L, Curtis DJ, Patel DJ, Allis CD, Cunningham JM, Jane SM: PRMT5-mediated methylation of histone H4R3 recruits DNMT3A, coupling histone and DNA methylation in gene silencing. Nat Struct Mol Biol 2009, 16(3):304–311. 10.1038/nsmb.1568View ArticlePubMedGoogle Scholar
- Hou Z, Peng H, Ayyanathan K, Yan KP, Langer EM, Longmore GD, Rauscher FJ: The LIM protein AJUBA recruits protein arginine methyltransferase 5 to mediate SNAIL-dependent transcriptional repression. Mol Cell Biol 2008, 28(10):3198–3207. 10.1128/MCB.01435-07View ArticlePubMedPubMed CentralGoogle Scholar
- Bedford MT, Clarke SG: Protein arginine methylation in mammals: who, what, and why. Mol Cell 2009, 33(1):1–13. 10.1016/j.molcel.2008.12.013View ArticlePubMedPubMed CentralGoogle Scholar
- Wang L, Pal S, Sif S: Protein arginine methyltransferase 5 suppresses the transcription of the RB family of tumor suppressors in leukemia and lymphoma cells. Mol Cell Biol 2008, 28(20):6262–6277. 10.1128/MCB.00923-08View ArticlePubMedPubMed CentralGoogle Scholar
- Ancelin K, Lange UC, Hajkova P, Schneider R, Bannister AJ, Kouzarides T, Surani MA: Blimp1 associates with Prmt5 and directs histone arginine methylation in mouse germ cells. Nat Cell Biol 2006, 8(6):623–630. 10.1038/ncb1413View ArticlePubMedGoogle Scholar
- Bedford MT, Richard S: Arginine methylation an emerging regulator of protein function. Mol Cell 2005, 18(3):263–272. 10.1016/j.molcel.2005.04.003View ArticlePubMedGoogle Scholar
- Litt M, Qiu Y, Huang S: Histone arginine methylations: their roles in chromatin dynamics and transcriptional regulation. Biosci Rep 2009, 29(2):131–141. 10.1042/BSR20080176View ArticlePubMedGoogle Scholar
- Pal S, Vishwanath SN, Erdjument-Bromage H, Tempst P, Sif S: Human SWI/SNF-associated PRMT5 methylates histone H3 arginine 8 and negatively regulates expression of ST7 and NM23 tumor suppressor genes. Mol Cell Biol 2004, 24(21):9630–9645. 10.1128/MCB.24.21.9630-9645.2004View ArticlePubMedPubMed CentralGoogle Scholar
- Wysocka J, Allis CD, Coonrod S: Histone arginine methylation and its dynamic regulation. Front Biosci 2006, 11: 344–355. 10.2741/1802View ArticlePubMedGoogle Scholar
- Lee DY, Teyssier C, Strahl BD, Stallcup MR: Role of protein methylation in regulation of transcription. Endocr Rev 2005, 26(2):147–170. 10.1210/er.2004-0008View ArticlePubMedGoogle Scholar
- Le Guezennec X, Vermeulen M, Brinkman AB, Hoeijmakers WA, Cohen A, Lasonder E, Stunnenberg HG: MBD2/NuRD and MBD3/NuRD, two distinct complexes with different biochemical and functional properties. Mol Cell Biol 2006, 26(3):843–851. 10.1128/MCB.26.3.843-851.2006View ArticlePubMedPubMed CentralGoogle Scholar
- Pal S, Yun R, Datta A, Lacomis L, Erdjument-Bromage H, Kumar J, Tempst P, Sif S: mSin3A/histone deacetylase 2- and PRMT5-containing Brg1 complex is involved in transcriptional repression of the Myc target gene cad. Mol Cell Biol 2003, 23(21):7475–7487. 10.1128/MCB.23.21.7475-7487.2003View ArticlePubMedPubMed CentralGoogle Scholar
- Fabbrizio E, El Messaoudi S, Polanowska J, Paul C, Cook JR, Lee JH, Negre V, Rousset M, Pestka S, Le Cam A, Sardet C: Negative regulation of transcription by the type II arginine methyltransferase PRMT5. EMBO Rep 2002, 3(7):641–645. 10.1093/embo-reports/kvf136View ArticlePubMedPubMed CentralGoogle Scholar
- Yu MC, Lamming DW, Eskin JA, Sinclair DA, Silver PA: The role of protein arginine methylation in the formation of silent chromatin. Genes Dev 2006, 20(23):3249–3254. 10.1101/gad.1495206View ArticlePubMedPubMed CentralGoogle Scholar
- Buck MJ, Nobel AB, Lieb JD: ChIPOTle: a user-friendly tool for the analysis of ChIP-chip data. Genome Biol 2005, 6(11):R97. 10.1186/gb-2005-6-11-r97View ArticlePubMedPubMed CentralGoogle Scholar
- Gibbons FD, Proft M, Struhl K, Roth FP: Chipper: discovering transcription-factor targets from chromatin immunoprecipitation microarrays using variance stabilization. Genome Biol 2005, 6(11):R96. 10.1186/gb-2005-6-11-r96View ArticlePubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hocberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Statist Soc Ser B (Methodological) 1995, 57: 289–300.Google Scholar
- Wu ZJ, Irizarry RA, Gentleman R, Martinez-Murillo F, Spencer F: A model-based background adjustment for oligonucleotide expression arrays. Journal of the American Statistical Association 2004, 99(468):909–917. 10.1198/016214504000000683View ArticleGoogle Scholar
- Hubbard TJ, Aken BL, Ayling S, Ballester B, Beal K, Bragin E, Brent S, Chen Y, Clapham P, Clarke L, Coates G, Fairley S, Fitzgerald S, Fernandez-Banet J, Gordon L, Graf S, Haider S, Hammond M, Holland R, Howe K, Jenkinson A, Johnson N, Kahari A, Keef D, Keenan S, Kinsella R, Kokocinski F, Kulesha E, Lawson D, Longden I, et al.: Ensembl 2009. Nucleic Acids Res 2009, (37 Database):D690–697. 10.1093/nar/gkn828Google Scholar
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, Cooke MP, Walker JR, Hogenesch JB: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101(16):6062–6067. 10.1073/pnas.0400782101View ArticlePubMedPubMed CentralGoogle Scholar
- Yu H, Zhu S, Zhou B, Xue H, Han JD: Inferring causal relationships among different histone modifications and gene expression. Genome Res 2008, 18(8):1314–1324. 10.1101/gr.073080.107View ArticlePubMedPubMed CentralGoogle Scholar
- Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. Proc Natl Acad Sci USA 2010, 107(7):2926–2931. 10.1073/pnas.0909344107View ArticlePubMedPubMed CentralGoogle Scholar
- Bernstein BE, Kamal M, Lindblad-Toh K, Bekiranov S, Bailey DK, Huebert DJ, McMahon S, Karlsson EK, Kulbokas EJ, Gingeras TR, Schreiber SL, Lander ES: Genomic maps and comparative analysis of histone modifications in human and mouse. Cell 2005, 120(2):169–181. 10.1016/j.cell.2005.01.001View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.