Histone modification profiles are predictive for tissue/cell-type specific expression of both protein-coding and microRNA genes
© Zhang and Zhang; licensee BioMed Central Ltd. 2011
Received: 9 December 2010
Accepted: 14 May 2011
Published: 14 May 2011
Gene expression is regulated at both the DNA sequence level and through modification of chromatin. However, the effect of chromatin on tissue/cell-type specific gene regulation (TCSR) is largely unknown. In this paper, we present a method to elucidate the relationship between histone modification/variation (HMV) and TCSR.
A classifier for differentiating CD4+ T cell-specific genes from housekeeping genes using HMV data was built. We found HMV in both promoter and gene body regions to be predictive of genes which are targets of TCSR. For example, the histone modification types H3K4me3 and H3K27ac were identified as the most predictive for CpG-related promoters, whereas H3K4me3 and H3K79me3 were the most predictive for nonCpG-related promoters. However, genes targeted by TCSR can be predicted using other type of HMVs as well. Such redundancy implies that multiple type of underlying regulatory elements, such as enhancers or intragenic alternative promoters, which can regulate gene expression in a tissue/cell-type specific fashion, may be marked by the HMVs. Finally, we show that the predictive power of HMV for TCSR is not limited to protein-coding genes in CD4+ T cells, as we successfully predicted TCSR targeted genes in muscle cells, as well as microRNA genes with expression specific to CD4+ T cells, by the same classifier which was trained on HMV data of protein-coding genes in CD4+ T cells.
We have begun to understand the HMV patterns that guide gene expression in both tissue/cell-type specific and ubiquitous manner.
The development of a human body from a single fertilized egg is a spatially and temporally regulated complex process. The genes that are responsible for general cellular function are expressed in all cell-types and tissues. However, in many tissue/cell-types, specialized functions require or exclude the expression of certain genes. The mechanism of this tissue/cell-type specific regulation (TCSR) is rather intriguing. It is worth noting that such diverse expression patterns are achieved through one genome shared largely by all cells. Gene transcription is regulated in multiple layers, e.g. transcription factor binding through DNA nucleotide features, DNA methylations, and chromatin modifications. TCSR may involve combinations of these regulations in all layers (for review [1–3]).
Thanks to next generation sequencing technology, our understanding of human TCSR has accelerated in recent years. At the base layer of DNA features, the association between DNA regulatory elements, such as TATA box and CpG islands in the promoter regions, and tissue-specific regulation has been investigated experimentally  and computationally ; Tissue-specific regulatory transcription factor binding sites in the promoter regions have been well studied in muscle  and liver , and binding sites were also detected in multiple tissues using generic transcription factor binding site prediction tools [7–9]. Cell-type specific enhancers have been experimentally explored in several cell types as well . High-throughput Cap Analysis of Gene Expression (CAGE) data showed that alternative transcription start sites (TSS) exist in the mammalian genome with more prevalence than previously thought , and, moreover, distributions of TSS have also been associated with TCSR . Recently, genome-wide mapping of Histone Modifications and Variants (HMVs) in CD4+ T cells [13, 14], as well as other cell types , opened up an opportunity to model gene expression levels from the perspective of post-translational modification of histones . For example, Pekowska et al. clustered genes by their H3K4me2 profile at the promoter regions in CD4+ T cells. They found that a cluster was enriched in CD4+ T cell specific genes . However, a comprehensive picture on how posttranslational modifications of histones contribute to TCSR is still not clear.
Therefore, in this work, we addressed three major questions 1) which HMVs carry sufficient information to allow TCSR target gene prediction, 2) whether TCSR is the same as gene expression activity regulation, and 3) whether the predictive relationship between HMV and TCSR target genes is universal for entire Pol II transcriptome. To properly address these questions, we developed a quantitative model to link the HMVs and TCSR target genes using CoreBoost, and applied it to recently published genome-wide mapped HMVs in CD4+ T cells [13, 14]. CoreBoost is a previously developed boosting classifier [18, 19] that can select informative features from an ensemble of weak classifiers. We first show that HMV profiles in both proximal promoters and gene bodies are predictive for CD4+ T cell specificity. The most predictive HMV types have been identified for CpG- and nonCpG-related genes in promoters and gene bodies. The evidences have shown that the underlying enhancers and intragenic alternative promoters marked by the HMV patterns were associated with tissue/cell-type specific gene expression. Second, we demonstrated that TCSR is different from the regulation of gene expression activity. Finally, the model, which was trained on HMV data of protein-coding genes in CD4+ T cells, successfully predicted muscle cell specific genes and CD4+ T cell specific microRNA genes.
Results and Discussion
Definition of CD4+ T cell specific regulated genes
We chose CD4+ T cells as the model, taking advantage of the widespread availability of genome-wide HMV data for this cell type [13, 14]. CD4+ T cell specific expressed genes (denoted as CD4SE) and housekeeping genes (denoted as HK) were collected as positive and negative datasets. We identified CD4SE genes according to their expression profiles among human tissues and other information. Altogether, 454 and 630 genes were collected in CD4SE, and HK sets, respectively (see Methods and Materials).
Genes in the CD4SE set were not expressed in most tissue/cell-types other than blood cell types. We plotted the expression distribution of genes in CD4SE, HK and randomly selected genes among all tissues in the GNF symAtlas dataset  as shown in Additional file 1. CD4SE genes were only expressed in a small number of blood cell types (CD14, CD19, CD33, CD4, CD56, CD8, X721 B/T cells, and whole blood), as expected, since this result agrees with the high expression correlation between blood cells [15, 16]. On the other hand, the HK genes and randomly selected genes were expressed in various tissue/cell-types studied. Quantitatively, both the overall entropy and categorical entropy in CD4+ T cells are significantly smaller in CD4SE genes than in HK genes  (the average overall entropies for CD4SE and HK genes are 4.8 and 6.26 as in the GNF symAtlas dataset [4, 20], respectively, P < 2.2e-16; the average categorical entropies for CD4SE and HK genes are 8.95 and 12.35 as in the GNF symAtlas dataset respectively, P < 2.2e-16).
The predictive HMVs for CD4+ T cell specific regulation
Previous studies suggested that CpG- and nonCpG-related promoters have different regulatory characteristics [21–26], and have a contrasting distribution of HMVs . Following the same strategy used previously for CoreBoost , CoreBoost_HM , and a third work , we analyzed CpG- and nonCpG-related genes separately. There were 40 HMV types in the CD4+ T cell dataset [13, 14], many of which were correlated with each other . We first performed a principal component analysis (PCA) and grouped the HMVs into two sets. For convenience, we refer to them as Set I and Set II. Set I contained the HMVs that have the highest contributions in the first 4 principal components (which captured 90% of variance, see Additional file 2). Set II contained the remaining HMVs. There were 25 and 15 HMVs in Set I and Set II, respectively. We trained CoreBoost to distinguish between CD4SE genes and HK genes. Because there were more genes in the HK set (630) than in the CD4SE set (454), we randomly sampled about 454 HK genes and combined them with CD4SE to form a total set. The performance of the CoreBoosts was evaluated based on sensitivity, positive predictive value (PPV)  and F-score . Five-fold cross-validation was performed to limit over-fitting. To further eliminate any potential bias introduced by sampling fluctuation, we repeated the whole process 100 times.
The performance of CoreBoost based on features in CpG- or nonCpG-related proximal promoter and gene body region
0.579 ± 0.016
0.764 ± 0.028
0.658 ± 0.006
0.889 ± 0.032
0.771 ± 0.021
0.825 ± 0.007
0.594 ± 0.013
0.789 ± 0.021
0.678 ± 0.007
0.876 ± 0.015
0.811 ± 0.013
0.842 ± 0.006
0.523 ± 0.016
0.717 ± 0.021
0.605 ± 0.010
0.908 ± 0.033
0.790 ± 0.021
0.844 ± 0.007
0.588 ± 0.015
0.736 ± 0.020
0.653 ± 0.008
0.892 ± 0.014
0.821 ± 0.012
0.854 ± 0.006
To investigate this possibility, we designed a new HMV feature table containing the following information: the average and sum of each HMV level for the first exon and the first intron; the average and sum of each HMV level for the whole gene body; and the sum of each HMV level in the first twenty nucleosomes positioned after the first exon. The first exon and the first intron were chosen because previous studies had shown the first exon and/or intron can play important roles in gene regulation , especially in tissue-specific regulation [30, 31]. Using this newly designed feature table, we repeated the CoreBoost training and analysis. The "body" entries in Table 1 summarize the performance of the new CoreBoost classifier for 100 replicates. We found that the classifiers have similar performance, irrespective of whether the HMV features in promoter or in gene bodies were used for training, and both performed significantly better than classifiers trained by control regions (Table 1). For CpG-related genes, the features sums of H3K27ac, sums of H3K79me3 and sums of H3K4me3 levels in the entire gene bodies contributed most to the prediction of CD4+ T cell specificity (see Additional file 5). For nonCpG-related gene, the features sums of H4K20me3 and sums of H3K14ac levels in the entire gene bodies contribute most to the prediction (see Additional file 5). Based on this line of evidence, we conclude that HMV profiles in gene bodies encode information about TCSR, much like those in promoters.
TCSR is different from gene expression activity regulation at the HMV level
We have shown above that TCSR target genes can be predicted by HMV profiles in both promoters and gene body regions. The immediate question that follows is how much gene expression level per se may determine TCSR. It might be argued our TCSR model achieves high predictive power is because CD4SE genes are highly expressed in CD4+ T cell and therefore could be easily predicted by any gene expression level prediction model. We now argue that this is not the case.
The predictive power of our TCSR model does not stem from the high expression level of CD4SE genes. First, if we define highly expressed genes as those genes whose expression levels are at least one standard deviation higher than average levels in a given cell type, then CD4SE genes are by no means highly expressed genes, even though they are higher than expression levels of HK genes (rank sum test P = 0.01). Second, our model does not simply predict highly expressed genes as CD4+ T cell specific. For example, of the 159 CpG-related genes that were predicted as CD4+ T cell specific in at least half of 100 replicates, only 26 genes were actually highly expressed in CD4+ T cell (P = 0.005). Moreover, the predicted highly expressed genes by the model proposed by Karlic et al. are expressed in broad tissues, but our model predicted CD4+ T cell specific genes expressed only in limited blood cell types akin to CD4SE genes (Additional file 6). Therefore, it is not surprising that our predictions, in comparison, have significantly smaller overall entropies (average overall entropies are 5.7 and 4.3, respectively, rank sum test P < 2.2e-16) and categorical entropies (5.8 and 4.5, respectively, rank sum test P < 2.2e-16). The same observation can be made even if one removes the intersection of the two predictions (overall entropies are 5.7 and 4.3 respectively, P < 2.2e-16; and categorical entropies are 5.9 and 4.6 respectively, P < 2.2e-16).
The correlations between predictive HMV profiles in nonCpG-related and CpG-related promoters
What makes HMV predictive for tissue/cell-types specific regulation?
There are several other possible associations between TCSR and HMV patterns. The HMV patterns could be markers in the nucleosomes indicating enhancers in the nearby DNA sequence. The binding of a transcription regulatory factor at an enhancer has long been suggested as one of the most important mechanisms of tissue/cell-type regulation [10, 15, 34]. H3K4me1 is most frequently associated with enhancers [10, 13]. We compared H3K4me1 profiles in the gene body with the profile of other HMV types (see Additional file 7). For the 15 HMVs which most correlated with H3K4me1, 13 of them (87%, hypergeometric test P = 9.3E-9) were selected as the top predictive features by resampling at least once in the 100 replicates. In addition, there are other HMVs types associated with the enhancers. For example, H2A.Z, H3K27ac, monomethylated H3K4, H3K9, and H3K27 were all found to be strongly associated with enhancers [13–15, 32, 41, 42]. Also, six HMVs (H3K4me1, H3K4me2, H3K4me3, H3K9me1, H3K18ac, and H2A.Z) were detected at more than a fifth of potential enhancers . All of these HMVs were selected as predictive HMVs at least once by resampling (see Additional file 5), indicating the possibility of the underlying enhancer activity in the regions.
Another possibility is that tissue/cell-type specific expression could be regulated after transcription initiation and/or in the pause and elongation stages. Recent studies implied that the majority of genes are transcriptionally initiated and paused [43–45]. H3K79me2, a characteristic marker of RNAPII elongation, is only found downstream of TSS in the human genome . In our data, H3K79me2 is a most frequently selected predictive HMV among the 100 replicates from Set I (see Additional file 4). In nearly all the cases (except for the Set I HMVs in the nonCpG related promoters), as shown in Table 2, we noticed an HMV highly correlated with H3K4me3 (in nonCpG related genes) and H3K27ac (in CpG-related genes), respectively. H3K4me3 and H3K27ac are well-known gene activity markers [13, 14]. The other HMVs are much less correlated with either of the HMVs selected by the gene activity model Karlic et al (except for the Set II HMVs in the CpG related promoters). Given this observation, we propose that the HMV profile of H3K27ac and H3K4me3, together with other correlated HMV types, may provide a basal layer of information for gene transcriptional regulation in CpG- and nonCpG-related genes, respectively. And as additional signals, the remaining HMV marks may be "modulated" on top of the basal signals so that the tissue/cell-type specificities of gene expression can be achieved. This modulation process could be manifested either by guiding the binding of transcription factors at enhancer regions or by directing the pause or elongation of transcription, as discussed above.
The HMV profile marks skeletal muscle myoblasts specific genes
The performance of CoreBoost classifiers
0.929 ± 0.098
0.467 ± 0.012
0.619 ± 0.033
0.764 ± 0.214
0.617 ± 0.043
0.662 ± 0.123
0.919 ± 0.135
0.473 ± 0.022
0.618 ± 0.043
0.860 ± 0.183
0.608 ± 0.025
0.700 ± 0.093
0.417 ± 0.037
0.441 ± 0.036
0.426 ± 0.011
0.365 ± 0.073†
0.521 ± 0.085
0.422 ± 0.079†
0.255 ± 0.080
0.220 ± 0.068†
0.234 ± 0.068†
0.125 ± 0.057*
0.099 ± 0.069*
0.102 ± 0.056*
Prediction of CD4+ T cell specific regulation of microRNA genes
MicroRNAs (miRNA) are a class of short RNA molecules which are generated from intergenic or intronic transcripts called pri-miRNAs (for review see [47, 48]). Similar to mRNA, pri-miRNAs also have a 5' cap structure and a 3' ployA tail . The majority of pri-miRNAs are believed to be transcribed by Pol II , with a few exceptions . Nevertheless, most pri-miRNAs share a transcription mechanism similar to protein-coding genes.
To test whether the association between TCSR and HMV patterns we found in protein-coding gene is similar for miRNAs genes, we trained our CoreBoost classifiers using the HMV profiles of protein-coding genes and applied them to miRNA genes. We evaluated our prediction with a recently published miRNA expression atlas  in which 13 and 50 CpG-related miRNAs clusters were identified as CD4+ T cell specific and housekeeping, respectively. The performance of the classifiers trained in promoter and gene body was significantly better than the performance of classifiers trained in control regions (Table 3), although they were not as good as the performance for predicting protein-coding genes (Table 1). The relatively lower performance of the classifiers on miRNA most likely results from the fact that we do not have sufficient knowledge about the miRNA gene structures, e.g., the TSS, the full length of pre-miRNA transcript, or the existences or the lengths of first exon/introns. The promoter regions of miRNA genes used for this prediction were obtained by recent computational predictions . However, because of the shortage of high-quality training data, miRNA promoter prediction is a difficult problem, and the resolution and the accuracy of the predictions are relatively lower [19, 23]. On the other hand, our classifiers were trained on the HMV profiles in individual hypothetical nucleosomes related to a well-defined TSS. Thus, the low resolution of promoter prediction has a significant effect on the nucleosomes assignment (as 500-bp resolution could end up with a difference of about 3 nucleosomes). This effect lowers the expectation of the predictive power of our HMV promoter trained classifiers. Nevertheless, even without full knowledge, our model was still be able to correctly predict about 40% of CD4+ T cell specific miRNAs, and this prediction was significantly better than the control. This result suggested that miRNA genes may share a similar association between HMV patterns and TCSR with protein-coding genes.
Predictive information is redundantly distributed among HMVs
In this work, we identified H3K4me3, H3K79me3, and H3K27ac as the most predictive marks in the promoter regions (Figure 1). However, these three HMV marks are by no means the only predictive ones. For example, H3K79me2 has also been selected as the most predictive HMV marks in nonCpG-related gene bodies (see Additional file 4). Therefore, we can reasonably argue that the predictive power for detecting TCSR targeted gene is redundantly encoded among HMVs. One clue indicating the existence of such redundancy was the success of applying our model to HSMM cell input data. Instead of using the full model, we trained our CoreBoost classifiers with the eight HMV types which were available in the ENCODE dataset. Although neither H3K79me2 nor H3K79me3 were available in the ENCODE dataset, the classifiers still managed to make significant predictions with similar performance as those trained with the full HMV set (Table 3).
To further exclude the possibility that this high performance could not be attributed to the existence of one or several dominating HMV marks, we performed the training and testing once more with a subset of HMV type set, in which all three of the most predictive HMV types H3K4me3, H3K79me3, and H3K27ac were removed. We also excluded H3K4me2 from the training data because this HMV type has recently been suggested as a unique mark for CD4+ T cell specificity . Interestingly, the classifiers also achieved similar significant predictive power as the classifiers trained by the full HMV profile (Table 3 and Additional file 8). With the possible exception of H3K9me1, Pekowska et al. did not find any other HMV marks than H3K4me2 that could make the same enrichment of CD4+ T cell specific genes . This is probably because clustering did not fully reveal the profound relationship between HMV profile and TCSR. To explore this possibility, we revisited the cluster (cluster 1) in which they observed enrichment of CD4+ T cell specific genes. By comparing the entropies between cluster 1 and CD4SE by using the GNF symAtlas dataset, we found that the overall entropy of cluster 1 was larger than CD4SE (5.5 and 4.8 respectively, p < 2.2e-16), and that categorical entropy was also larger than CD4SE (10.8, and 8.95 respectively, p < 2.2e-16), implying that the genes in the cluster 1 are significantly less specific to CD4+ T cells than the genes in CD4SE. Only 66 out of 392 genes in the cluster 1 were actually CD4+ T cell specific expressions according to our definition of TSCR by gene express entropy (sensitivity = 0.14, PPV = 0.16 and F-score = 0.15).
We have utilized CoreBoost to connect the HMV and TCSR patterns in CD4+ T cells. From this data we draw the following conclusions. First, we found that patterns of HMV contain sufficient information to predict TCSR target genes. The classifier we trained on HMV data successfully distinguished CD4+ T cell specific genes from housekeeping genes. Predictive HMV information was not only found in promoter regions, but also in the gene body. This finding is important because it implies the existence of multiple regulatory elements which could be marked by HMVs for TCSR. Second, we identified predictive HMV marks for CpG- and nonCpG-related genes. In promoters, H3K4me3 and H3K27ac were the most predictive HMV marks for CpG-related genes, whereas for nonCpG-related genes H3K4me3 and H3K79me3 were the most predictive. However, even if we excluded data from the most predictive HMV marks, we found that the remaining data still have sufficient predictive ability to make significant predictions for TCSR target genes. This information redundancy again points to the existence of multiple regulatory elements which could be marked by HMVs for TCSR. By carefully surveying patterns of HMV, we further propose that marking the underlying enhancers and marking intragenic alternative promoters are two potential mechanisms that could guide TCSR. Finally, we provide evidence showing that TCSR in other tissue/cell-types, as well as TCSR for non-protein coding Pol II transcripts, such as microRNA, may share TCSR HMV patterns similar to the case of CD4+ T cells. The associations between the HMV patterns and TCSR we found may be generic, as we successfully predicted genes with muscle cell specific expression, as well as microRNA genes with CD4+ T cell specific expression, by the same classifier which was trained on the HMV data of protein-coding genes in CD4+ T cells.
The RefSeq Gene annotation track for the human genome sequence (hg18) was downloaded from the University of California Santa Cruz Genome Browser (UCSC, http://genome.ucsc.edu/). The exon information was downloaded from BioMart at Ensembl (http://www.ensembl.org/biomart/). Two gene expression data sets in human tissues were taken from the GNF symAtlas database ; and the GEO database (http://www.ncbi.nlm.nih.gov/geo/, GSE7307). We defined a promoter to be CpG-related if there was a CpG island located within its upstream 2 kb to downstream 500-bp region from the TSS . The CpG island annotations were downloaded from the UCSC Genome Browser as well. HMV data for CD4+ T cell were retrieved from genome-wide studies of the distribution of 19 lysine or arginine histone methylations and H2A.Z histone variant , and mapping of 19 histone acetylation . HMV data for normal human skeletal muscle myoblasts (HSMM) and K562 cell lines were retrieved from the ENCODE project , specifically in the Broad Institute Chip-seq dataset . In addition, as part of the ENCODE project, CAGE experimental data for the K562 cell line were retrieved from the RIKEN institute , and DNA methylation level in K562 were retrieved from the work of Brunner and colleagues . The miRNA expression profiles were retrieved from a small RNA library based sequencing atlas ; we used the number of clones for each miRNA cluster to represent the expression of the pri-miRNA in each tissue. The promoter of a miRNA cluster was chosen as the closet promoter predicted for the members in the cluster .
Identifying tissue-specific and housekeeping transcripts
As a measurement of information content, Shannon entropy has been used for measuring the tissue-specificity of gene expression . As the information content (tissue-specificity) of a distribution increases, its entropy decreases. Borrowing this concept, we measured the CD4+ T cell specific expressed gene set by the combination of the following two datasets: 1) genes that the overall gene expression entropy is smaller than 5.0 and categorical entropy less than 9 ; and 2) manually selected genes. From the literature, we manually selected 40 genes that play certain roles in CD4+ T cell development or maturation (see Additional file 9). The combination of the above two datasets contains 454 genes (see Additional file 9). The housekeeping genes were defined according to two criteria: 1) the overall gene expression entropy larger than 6.2 by GNF symAtlas dataset [4, 20]; 2) the overall gene expression entropy larger than 8.9 by GSE7307 dataset. In total, there were 630 genes identified as housekeeping genes (see Additional file 9). The threshold for CD4+ T cell specificity was determined according to the bell shape distribution of categorical entropy (z-score > 2). The threshold for housekeeping genes was determined as the one at the turning point of the overall entropy distribution curve, which has an exponential-like shape. We also tried several thresholds surrounding values, and we retrained our model accordingly, but no significantly different results were observed.
Definition of feature tables for promoter and gene body regions
The promoter region was defined as the region from the 6th nucleosome upstream of the transcription start site (TSS) to the 20th nucleosome downstream of the TSS. We adopted the position definitions of -2 to + 5 nucleosomes relative to the TSS used by Dustin et al.(-2[-370: -196], -1[-195: -46], +1 [-45:134], +2 [135:314], +3 [315:494], +4 [495:674] and +5 [675:859]). For the positions of other nucleosomes, we simply extended 150 bp from its immediate neighbor nucleosomes. We tried another combinations of up- and downstream nucleosome numbers to define the promoter region (from the 6th nucleosome upstream of the TSS to the 9th downstream of the TSS), but it did not change the results. To construct the feature table for a gene, HMV levels were individually calculated for each HMV type. For any given HMV type, the sum of the HMV tag numbers in a nucleosome was assigned as the HMV level on that nucleosome. The HMV feature of a gene is therefore an array containing all HMV levels of each nucleosome within the proximal promoter. Taken together, there are 40 HMV levels (including the bound level of the CCCTC-binding factor) on all 26 nucleosomes for each gene. The feature table for gene body regions is defined in the main text.
PCA analysis was performed by using R
The sum of tags in the 4k region around the TSS ([-2k, +2K]) of all HMVs for all genes forms a matrix in which rows represent genes and columns represent HMV types. PCA analysis produced the linear combinations of columns. The first 4 principal components were chosen to form Set I and the remaining HMVs belonged to Set II.
CoreBoost and performance evaluations
where TP denotes true positives, TN denotes true negatives, FP denotes false positives and FN denotes false negatives.
5-fold cross validation
The evaluations and the number of features for selection by CoreBoost were all obtained by 5-fold cross validations. The procedure was as follows: Given a total dataset D, 1) D was randomly partitioned into 5 subsets D i (i = 1,2,...,5); 2) each D i was removed exactly once from D and a CoreBoost classifier was trained on the remaining 80% and tested on the removed D i ; and 3) the final evaluations were the average of tests on 5 subsets. The final classifier was trained on total dataset D, which was used to predict the miRNAs expressed specifically in CD4+ T cells.
The definition of control regions
We chose a position that was 50 kb upstream of any given annotated TSS as its control site. When we tried same control with 500 kb, we observed essentially the same results.
List of abbreviations
tissue/cell-type specific regulation
transcription start site
Cap Analysis Gene Expression
positive prediction values
histone H3 trimethylated at lysine 4
histone H3 dimethylated at lysine 4
histone H3 monomethylated at lysine 4
histone H3 acetylated at lysine 27
histone H3 dimethylated at lysine 79
histone H3 trimethylated at lysine 79
histone H3 acetylated at lysine 9
histone H3 trimethylated at lysine 27
histone H3 trimethylated at lysine 36
histone H4 mono-methylated at lysine 20.
We thank Dr. Piero Carninci and Dr. Kawaji Hideya of RIKEN, Japan for providing early access to the Fantom data. We thank Dr Monica Sleumer, Dr. Greg Vatcher, Dr. Jeff De Jong, Will Liao, Dr. Pradipta Ray, and David Martin for proofreading of the manuscript. We thank the two anonymous reviewers for their suggestions. This work was supported by a NIH R01 grant (HG001696) to M.Q.Z.
- Maston GA, Evans SK, Green MR: Transcriptional regulatory elements in the human genome. Annu Rev Genomics Hum Genet 2006, 7: 29–59. 10.1146/annurev.genom.7.080505.115623View ArticlePubMed
- Schones DE, Cui K, Cuddapah S, Roh TY, Barski A, Wang Z, Wei G, Zhao K: Dynamic regulation of nucleosome positioning in the human genome. Cell 2008, 132(5):887–898. 10.1016/j.cell.2008.02.022View ArticlePubMed
- Allis CD, Jenuwein T, Reinberg D: Epigenetics. Cold Spring Harbor, New York: Cold Spring Harbor Laboratory Press; 2007.
- Schug J, Schuller WP, Kappen C, Salbaum JM, Bucan M, Stoeckert CJ Jr: Promoter features related to tissue specificity as measured by Shannon entropy. Genome biol 2005, 6(4):R33. 10.1186/gb-2005-6-4-r33PubMed CentralView ArticlePubMed
- Wasserman WW, Fickett JW: Identification of regulatory regions which confer muscle-specific gene expression. J Mol Biol 1998, 278(1):167–181. 10.1006/jmbi.1998.1700View ArticlePubMed
- Krivan W, Wasserman WW: A predictive model for regulatory sequences directing liver-specific transcription. Genome Res 2001, 11(9):1559–1566. 10.1101/gr.180601PubMed CentralView ArticlePubMed
- Smith AD, Sumazin P, Zhang MQ: Identifying tissue-selective transcription factor binding sites in vertebrate promoters. Proc Natl Acad Sci USA 2005, 102(5):1560–1565. 10.1073/pnas.0406123102PubMed CentralView ArticlePubMed
- Smith AD, Sumazin P, Zhang MQ: Tissue-specific regulatory elements in mammalian promoters. Mol Syst Biol 2007, 3: 73.PubMed CentralPubMed
- Smith AD, Sumazin P, Xuan Z, Zhang MQ: DNA motifs in human and mouse proximal promoters predict tissue-specific expression. Proc Natl Acad Sci USA 2006, 103(16):6275–6280. 10.1073/pnas.0508169103PubMed CentralView ArticlePubMed
- Heintzman ND, Hon GC, Hawkins RD, Kheradpour P, Stark A, Harp LF, Ye Z, Lee LK, Stuart RK, Ching CW, et al.: Histone modifications at human enhancers reflect global cell-type-specific gene expression. Nature 2009, 459(7243):108–112. 10.1038/nature07829PubMed CentralView ArticlePubMed
- Carninci P, Kasukawa T, Katayama S, Gough J, Frith MC, Maeda N, Oyama R, Ravasi T, Lenhard B, Wells C, et al.: The transcriptional landscape of the mammalian genome. Science 2005, 309(5740):1559–1563.View ArticlePubMed
- Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, et al.: Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 2006, 38(6):626–635. 10.1038/ng1789View ArticlePubMed
- Wang Z, Zang C, Rosenfeld JA, Schones DE, Barski A, Cuddapah S, Cui K, Roh TY, Peng W, Zhang MQ, et al.: Combinatorial patterns of histone acetylations and methylations in the human genome. Nat Genet 2008, 40(7):897–903. 10.1038/ng.154PubMed CentralView ArticlePubMed
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell 2007, 129(4):823–837. 10.1016/j.cell.2007.05.009View ArticlePubMed
- Cui K, Zang C, Roh TY, Schones DE, Childs RW, Peng W, Zhao K: Chromatin signatures in multipotent human hematopoietic stem cells indicate the fate of bivalent genes during differentiation. Cell Stem Cell 2009, 4(1):80–93. 10.1016/j.stem.2008.11.011PubMed CentralView ArticlePubMed
- Karlic R, Chung HR, Lasserre J, Vlahovicek K, Vingron M: Histone modification levels are predictive for gene expression. Proc Natl Acad Sci USA 2010, 107(7):2926–2931. 10.1073/pnas.0909344107PubMed CentralView ArticlePubMed
- Pekowska A, Benoukraf T, Ferrier P, Spicuglia S: A unique H3K4me2 profile marks tissue-specific gene regulation. Genome Res 2010, 20: 1493–1502. 10.1101/gr.109389.110PubMed CentralView ArticlePubMed
- Zhao XY, Xuan ZY, Zhang MQ: Boosting with stumps for predicting transcription start sites. Genome biology 2007, 8(2):R17. 10.1186/gb-2007-8-2-r17PubMed CentralView ArticlePubMed
- Wang XW, Xuan ZY, Zhao XY, Li YD, Zhang MQ: High-resolution human core-promoter prediction with CoreBoost_HM. Genome Res 2009, 19(2):266–275.PubMed CentralView ArticlePubMed
- Su AI, Wiltshire T, Batalov S, Lapp H, Ching KA, Block D, Zhang J, Soden R, Hayakawa M, Kreiman G, et al.: A gene atlas of the mouse and human protein-encoding transcriptomes. Proc Natl Acad Sci USA 2004, 101(16):6062–6067. 10.1073/pnas.0400782101PubMed CentralView ArticlePubMed
- Bajic VB, Seah SH, Chong A, Zhang G, Koh JL, Brusic V: Dragon Promoter Finder: recognition of vertebrate RNA polymerase II promoters. Bioinformatics 2002, 18(1):198–199. 10.1093/bioinformatics/18.1.198View ArticlePubMed
- Davuluri RV, Grosse I, Zhang MQ: Computational identification of promoters and first exons in the human genome. Nat Genet 2001, 29(4):412–417. 10.1038/ng780View ArticlePubMed
- Marson A, Levine SS, Cole MF, Frampton GM, Brambrink T, Johnstone S, Guenther MG, Johnston WK, Wernig M, Newman J, et al.: Connecting microRNA genes to the core transcriptional regulatory circuitry of embryonic stem cells. Cell 2008, 134(3):521–533. 10.1016/j.cell.2008.07.020PubMed CentralView ArticlePubMed
- Mikkelsen TS, Ku M, Jaffe DB, Issac B, Lieberman E, Giannoukos G, Alvarez P, Brockman W, Kim TK, Koche RP, et al.: Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature 2007, 448(7153):553–560. 10.1038/nature06008PubMed CentralView ArticlePubMed
- Roh TY, Cuddapah S, Cui K, Zhao K: The genomic landscape of histone modifications in human T cells. Proc Natl Acad Sci USA 2006, 103(43):15782–15787. 10.1073/pnas.0607617103PubMed CentralView ArticlePubMed
- Saxonov S, Berg P, Brutlag DL: A genome-wide analysis of CpG dinucleotides in the human genome distinguishes two distinct classes of promoters. Proc Natl Acad Sci USA 2006, 103(5):1412–1417. 10.1073/pnas.0510310103PubMed CentralView ArticlePubMed
- Bajic VB, Tan SL, Suzuki Y, Sugano S: Promoter prediction analysis on the whole human genome. Nat Biotechnol 2004, 22(11):1467–1473. 10.1038/nbt1032View ArticlePubMed
- Abeel T, Saeys Y, Bonnet E, Rouze P, Van de Peer Y: Generic eukaryotic core promoter prediction using structural features of DNA. Genome Res 2008, 18(2):310–323. 10.1101/gr.6991408PubMed CentralView ArticlePubMed
- Cenik C, Derti A, Mellor JC, Berriz GF, Roth FP: Genome-wide functional analysis of human 5'untranslated region introns. Genome biol 2010, 11(3):R29.PubMed CentralView ArticlePubMed
- Nogami H, Hoshino R, Ogasawara K, Miyamoto S, Hisano S: Region-specific expression and hormonal regulation of the first exon variants of rat prolactin receptor mRNA in rat brain and anterior pituitary gland. J Neuroendocrinol 2007, 19(8):583–593. 10.1111/j.1365-2826.2007.01565.xView ArticlePubMed
- Turner JD, Schote AB, Macedo JA, Pelascini LP, Muller CP: Tissue specific glucocorticoid receptor expression, a role for alternative first exon usage? Biochem Pharmacol 2006, 72(11):1529–1537. 10.1016/j.bcp.2006.07.005View ArticlePubMed
- Roh TY, Wei G, Farrell CM, Zhao K: Genome-wide prediction of conserved and non-conserved enhancers by histone acetylation patterns. Genome Res 2007, 17(1):74–81.PubMed CentralView ArticlePubMed
- Schones DE, Zhao K: Genome-wide approaches to studying chromatin modifications. Nat Rev Genet 2008, 9(3):179–191. 10.1038/nrg2270View ArticlePubMed
- Wei G, Wei L, Zhu J, Zang C, Hu-Li J, Yao Z, Cui K, Kanno Y, Roh TY, Watford WT, et al.: Global mapping of H3K4me3 and H3K27me3 reveals specificity and plasticity in lineage fate determination of differentiating CD4+ T cells. Immunity 2009, 30(1):155–167. 10.1016/j.immuni.2008.12.009PubMed CentralView ArticlePubMed
- Lorincz MC, Dickerson DR, Schmitt M, Groudine M: Intragenic DNA methylation alters chromatin structure and elongation efficiency in mammalian cells. Nat Struct Mol Biol 2004, 11(11):1068–1075. 10.1038/nsmb840View ArticlePubMed
- Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799–816. 10.1038/nature05874View ArticlePubMed
- Valen E, Pascarella G, Chalk A, Maeda N, Kojima M, Kawazu C, Murata M, Nishiyori H, Lazarevic D, Motti D, et al.: Genome-wide detection and analysis of hippocampus core promoters using DeepCAGE. Genome Res 2009, 19(2):255–265.PubMed CentralView ArticlePubMed
- Brunner AL, Johnson DS, Kim SW, Valouev A, Reddy TE, Neff NF, Anton E, Medina C, Nguyen L, Chiao E, et al.: Distinct DNA methylation patterns characterize differentiated human embryonic stem cells and developing human fetal liver. Genome Res 2009, 19(6):1044–1056. 10.1101/gr.088773.108PubMed CentralView ArticlePubMed
- Portela A, Esteller M: Epigenetic modifications and human disease. Nat Biotechnol 2010, 28(10):1057–1068. 10.1038/nbt.1685View ArticlePubMed
- Maunakea AK, Nagarajan RP, Bilenky M, Ballinger TJ, D'Souza C, Fouse SD, Johnson BE, Hong C, Nielsen C, Zhao Y, et al.: Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 2010, 466(7303):253–257. 10.1038/nature09165PubMed CentralView ArticlePubMed
- Heintzman ND, Stuart RK, Hon G, Fu YT, Ching CW, Hawkins RD, Barrera LO, Van Calcar S, Qu CX, Ching KA, et al.: Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nat Genet 2007, 39(3):311–318. 10.1038/ng1966View ArticlePubMed
- Roh TY, Cuddapah S, Zhao K: Active chromatin domains are defined by acetylation islands revealed by genome-wide mapping. Genes Dev 2005, 19(5):542–552. 10.1101/gad.1272505PubMed CentralView ArticlePubMed
- Guenther MG, Levine SS, Boyer LA, Jaenisch R, Young RA: A chromatin landmark and transcription initiation at most promoters in human cells. Cell 2007, 130(1):77–88. 10.1016/j.cell.2007.05.042PubMed CentralView ArticlePubMed
- Core LJ, Waterfall JJ, Lis JT: Nascent RNA sequencing reveals widespread pausing and divergent initiation at human promoters. Science 2008, 322(5909):1845–1848. 10.1126/science.1162228PubMed CentralView ArticlePubMed
- Core LJ, Lis JT: Transcription regulation through promoter-proximal pausing of RNA polymerase II. Science 2008, 319(5871):1791–1792. 10.1126/science.1150843PubMed CentralView ArticlePubMed
- Seila AC, Calabrese JM, Levine SS, Yeo GW, Rahl PB, Flynn RA, Young RA, Sharp PA: Divergent transcription from active promoters. Science 2008, 322(5909):1849–1851. 10.1126/science.1162253PubMed CentralView ArticlePubMed
- Bartel DP: MicroRNAs: genomics, biogenesis, mechanism, and function. Cell 2004, 116(2):281–297. 10.1016/S0092-8674(04)00045-5View ArticlePubMed
- Bartel DP: MicroRNAs: target recognition and regulatory functions. Cell 2009, 136(2):215–233. 10.1016/j.cell.2009.01.002PubMed CentralView ArticlePubMed
- Cai X, Hagedorn CH, Cullen BR: Human microRNAs are processed from capped, polyadenylated transcripts that can also function as mRNAs. RNA 2004, 10(12):1957–1966. 10.1261/rna.7135204PubMed CentralView ArticlePubMed
- Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN: MicroRNA genes are transcribed by RNA polymerase II. EMBO J 2004, 23(20):4051–4060. 10.1038/sj.emboj.7600385PubMed CentralView ArticlePubMed
- Borchert GM, Lanier W, Davidson BL: RNA polymerase III transcribes human microRNAs. Nat Struct Mol Biol 2006, 13(12):1097–1101. 10.1038/nsmb1167View ArticlePubMed
- Landgraf P, Rusu M, Sheridan R, Sewer A, Iovino N, Aravin A, Pfeffer S, Rice A, Kamphorst AO, Landthaler M, et al.: A mammalian microRNA expression atlas based on small RNA library sequencing. Cell 2007, 129(7):1401–1414. 10.1016/j.cell.2007.04.040PubMed CentralView ArticlePubMed
- Xuan Z, Zhao F, Wang J, Chen G, Zhang MQ: Genome-wide promoter extraction and analysis in human, mouse, and rat. Genome biol 2005, 6(8):R72. 10.1186/gb-2005-6-8-r72PubMed CentralView ArticlePubMed
- Hastie T, Tibshirani R, Friedman J: The elements of satistical learning: Data mining, inference, and prediction. New York: Springer-Verlag; 2000.
- Breiman L, Friedman J, Olshen R, Storne C: Classification and regression trees. Belmont, CA: Wadsworth International Group; 1984.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.