Systematic classification of non-coding RNAs by epigenomic similarity
© Dozmorov et al.; licensee BioMed Central Ltd. 2013
Published: 9 October 2013
Even though only 1.5% of the human genome is translated into proteins, recent reports indicate that most of it is transcribed into non-coding RNAs (ncRNAs), which are becoming the subject of increased scientific interest. We hypothesized that examining how different classes of ncRNAs co-localized with annotated epigenomic elements could help understand the functions, regulatory mechanisms, and relationships among ncRNA families.
We examined 15 different ncRNA classes for statistically significant genomic co-localizations with cell type-specific chromatin segmentation states, transcription factor binding sites (TFBSs), and histone modification marks using GenomeRunner (http://www.genomerunner.org). P-values were obtained using a Chi-square test and corrected for multiple testing using the Benjamini-Hochberg procedure. We clustered and visualized the ncRNA classes by the strength of their statistical enrichments and depletions.
We found piwi-interacting RNAs (piRNAs) to be depleted in regions containing activating histone modification marks, such as H3K4 mono-, di- and trimethylation, H3K27 acetylation, as well as certain TFBSs. piRNAs were further depleted in active promoters, weak transcription, and transcription elongation regions, and enriched in repressed and heterochromatic regions. Conversely, transfer RNAs (tRNAs) were depleted in heterochromatin regions and strongly enriched in regions containing activating H3K4 di- and trimethylation marks, H2az histone variant, and a variety of TFBSs. Interestingly, regions containing CTCF insulator protein binding sites were associated with tRNAs. tRNAs were also enriched in the active, weak and poised promoters and, surprisingly, in regions with repetitive/copy number variations.
Searching for statistically significant associations between ncRNA classes and epigenomic elements permits detection of potential functional and/or regulatory relationships among ncRNA classes, and suggests cell type-specific biological roles of ncRNAs.
The advent of novel high-throughput technologies such as next-generation sequencing (NGS) has empowered the gathering of transcriptional data of unprecedented quantity and quality. It has lead to the discovery that ~75% of the human genome is transcribed to RNA at some point in certain cell types [1, 2], with other estimates being as high as 90% . Only ~1-2% of these transcripts are translated into protein form [4, 5], while the remainder are classified as non-coding RNAs (ncRNAs) [6, 7]. The fact that ncRNAs are relatively abundant, expressed in a developmentally regulated fashion [8, 9], and often exhibit precise sub-cellular localization  supports the notion that they play important biological roles. In particular, their ability to base-pair with other transcripts and regions suggests they may be responsible for a variety of regulatory functions .
Attempts to untangle the complex landscape of ncRNAs have led to crude classification of ncRNAs based on their length (small, 18-31nt; medium, 31-200nt; and long, >200nt) , function (housekeeping ncRNAs such as ribosomal (rRNAs), transfer RNAs (tRNAs)), regulatory potential (microRNAs (miRNAs), long non-coding RNAs (lncRNAs)) , and subcellular localization (small nuclear RNAs (snRNAs), small nucleolar RNAs (snoRNAs), cytoplasm-located piwi-interacting RNAs (piRNAs), and short interfering RNAs (siRNAs)). Other unusual ncRNA species such as trans-spliced transcripts, macroRNAs that encompass enormous genomic distances, and multi-gene transcripts that encompass several genes or even the whole chromosome further confound efforts for systematic classification [13–15]. In reality, however, clear categorization of ncRNA classes has been quite difficult, as many ncRNA transcripts often share the properties of multiple categories.
ncRNAs can also be broadly categorized by genomic properties such as 1) sense or antisense transcripts, when the ncRNA transcript co-localizes with exons of another transcript on the same or the opposite strand, respectively, 2) bidirectional, when the expression of the ncRNA and of a neighboring coding transcript in the opposite strand are initiated in close proximity, 3) intronic, when the ncRNA arises from an intron of another transcript, and 4) intergenic, when the ncRNA is localized between two coding transcripts [10, 12, 16]. Other ncRNA classes, such as promoter-associated RNAs (PARs) and enhancer RNAs (eRNAs) are also being characterized [17, 18]. These considerations alone suggest the role of genomic organization in the biogenesis of ncRNAs is more complex than previously thought .
Recent years have seen a rapid growth of publicly available data on genome organization and functional annotation. The Encyclopedia of DNA Elements (ENCODE) project has been actively cataloging functional elements in the human genome, such as cell type-specific histone modification profiles, chromatin states, and transcription factor binding sites . In this study we will refer to these functional and regulatory regions as epigenomic elements, i.e., genomic data other than nucleotide sequence that describes functions, properties, or experimental values associated with genome regions [19–23]. Although the precise definition of the word "epigenomics" is hotly debated  and is often narrower in scope, we feel our broader definition of epigenomic elements is suitable to convey the concepts of our research to the readers. The aim of the current study was to use the ENCODE data, such as cell type-specific histone modification profiles, chromatin states, and experimentally validated transcription factor binding sites to examine how epigenomic elements statistically significantly associated with different classes of ncRNAs may reveal functional roles and relationships among ncRNA classes.
We investigated whether the genomic coordinates of 15 ncRNA classes significantly overlapped with or were depleted in a total of 420 annotated epigenomic elements. We found that piRNA and tRNA ncRNAs were strongly under- and over-represented with the majority of epigenomic elements examined, respectively. Other ncRNA classes showed less extreme enrichments, but we were able to confirm known and identify new statistically significant ncRNA class-specific epigenomic associations. Clustering ncRNA classes by the significance of their epigenomic associations captures their known hierarchy, providing a means to utilize epigenomic background for classification purposes. In summary, our study demonstrates a means to use genome annotation data to identify regulatory commonalities and differences among genomic and epigenomic elements.
The ncRNA classes used in the current study
Part of snRNP complex; involved in control of transcription elongation.
Localized to nucleolus, involved in methylation of rRNA.
Localized to nucleolus, involved in pseudouridylation of rRNA.
Long non-coding RNA
A non protein-coding transcript with > 200 nt.
A short, hairpin-shaped RNA that usually suppresses translation of target mRNA by binding to 3' UTR.
A large class of 26-31 nt RNA suggested being involved in transposon silencing.
A transcript containing one or more microRNAs before processing by Drosha complex.
RNA which is directly incorporated into the ribosome (distinct from mRNA encoding ribosomal proteins).
small Cajal body-specific RNA
Located in the nucleolus, and involved in methylation and pseudouridylation of spliceosomal RNAs.
small nucleolar RNA
RNA located in the nucleolus, mostly involved in modification of other RNAs, such as rRNA or spliceosomal RNA.
small nuclear RNA
Small RNA located in the nucleus, involved in the spliceosome, RNA modification, or other functions.
RNA forming part of the spliceosome complex.
signal recognition particle RNA
RNA forming part of the SRP complex. The SRP complex targets proteins to their proper subcellular localization by recognizing the signal recognition peptide tag on proteins.
Transfer amino acids to the ribosome for protein construction.
Components of the Ro RNP complex; may repress Ro activity. As an independent function, is also required for DNA replication.
Functional/Regulatory elements data
As part of our interest in automating the search for biologically meaningful correlations among high-throughput and high-information data [25–27], we developed GenomeRunner (http://sourceforge.net/projects/genomerunner), a software program that searches for statistically significant co-localization between a set of genomic regions of interest (ncRNA classes in this paper) and sets of annotated genomic features (epigenomic elements, e.g., genomic regions in the GM12878 cell line where H3K4me1 histone modification sites were found) . We tested each different class of ncRNAs to see if they were statistically significantly associated with or depleted in three groups of epigenomic elements: Chromatin State Segmentation by HMM from ENCODE/Broad, Histone Modifications by ChIP-seq from ENCODE/Broad Institute, and Experimentally validated Transcription Factor ChIP-seq from ENCODE (Additional files 1, 2). A total of 420 epigenomic elements were obtained from the UCSC genome database , and stored in a local MySQL database accessible for the GenomeRunner community.
Briefly, sets of ncRNAs were tested for genomic overlap of at least 1 nucleotide with 420 epigenomic elements. We first calculated the total number of members from an ncRNA class that overlapped with an epigenomic element. If a given ncRNA overlaps with >1 epigenomic element of the same type, it is only counted once, to reflect the fact of overlap. In other words, overlap counting is ncRNA-centric, e.g., if a tRNA overlaps with 3 NFKB binding sites, only one is added to the tRNA-NFKB overlap counter.
We then performed random sampling from the total pool of all 15 classes of ncRNAs, selecting the same number of random ncRNAs as in the class being analyzed. The use of all ncRNAs as a background enables us to search, specifically, for associations that might distinguish one ncRNA class from the rest. This background selection also restricts random sampling to avoid low complexity genomic regions, such as transposable elements, duplications and inversions, repeats, comprising up to 47% of the human genome . We selected a random number of ncRNA elements according to the size of each ncRNA class and performed 1000 such random samplings, estimating the average number and variance of random overlapping. A Chi-square test was used to determine whether there was a significant difference between the total number of co-localization for the ncRNA class as compared with what could be expected by random chance.
Transformation of p-values
To emphasize epigenomic elements most significantly associated with an ncRNA class, p-values were adjusted for multiple testing using the Benjamini-Hochberg procedure . For easier visualization and comparison, we converted p-values into decimal scale by -log10-transformation. A "-"sign was added if a p-value represents an underrepresented association. This allows representation of significant associations in an intuitive format - larger numbers equal more statistically significant overrepresented associations of a ncRNA class with an epigenomic element, while smaller negative numbers represent more statistically significant under-represented associations.
Clustering and visualization
The transformed p-values can be directly visualized with a blue-yellow gradient representing under- and over-represented associations, respectively. A n × m matrix of transformed p-values, where n is the number of ncRNA classes and m is the number of epigenomic elements, was assembled (Additional file 3). Epigenomic elements showing no statistically significant associations (p-value cutoff 0.01, unless otherwise specified) with at least one ncRNA class were removed. Further, epigenomic elements showing consistent associations across the ncRNA classes (standard deviation of the transformed p-value distribution is less than 2) were also filtered. These filtering steps simplify visualization and allow a reader to focus on the most significant epigenomic associations differentially enriched among ncRNA classes.
Hierarchical clustering was performed using "maximum" distance to measure dissimilarity between rows and columns, and the "ward" agglomeration method . Clustering and visualization were performed within R computing environment .
Pair-wise correlation analysis
We compared epigenomic similarities and differences among ncRNA classes. That is, pair-wise correlations between all ncRNA class-specific transformed p-values were measured using Pearson's correlation coefficient. We expect ncRNA classes showing similar enrichment patterns with epigenomic elements to correlate with positive Pearson's correlation coefficient close to 1, while ncRNA classes differentially associated with epigenomic elements would show negative Pearson's correlation coefficient. Combining these Pearson's correlation coefficients into n × n matrix, where n is the number of ncRNA classes, allows clustering and visualization of epigenomic similarities among ncRNA classes as describe above, using blue/yellow gradient to highlight negative/positive correlations, respectively.
There was a positive correlation between the size of ncRNA classes and the standard deviation of the p-value (Pearson's correlation coefficient = 0.25), although it was not significant (p = 0.487), likely due to the extreme p-value variability in tRNA and piRNA classes. Removing those classes increased Pearson's correlation coefficient to 0.58 (p = 0.134). Notable, that both classes did not show extreme size or length abnormalities, being comparable in size with snoRNA and miRNA classes (Table 1, Figure 2B), suggesting strong epigenomic associations driving p-value variability.
Pair-wise correlation analysis of epigenomic associations reveal known and novel relationships among ncRNA classes
Expectedly, snoRNA, C/D-box and H/ACA-box ncRNA classes shared similar epigenomic associations, being subgroups of a larger snRNA class (Figure 1). Yet scaRNAs, a subgroup of snoRNAs, did not share similar significant associations. This may be, in part, due to size differences in the groups (see Table 1). Along with tRNA class, scaRNAs had epigenomic associations distinct from other ncRNA classes. Spliceosomal, rRNA, 7SK, piRNA, SRP-RNA and Y-RNA classes formed the largest group with similar epigenomic associations.
Pearson's correlation coefficient for pairs of ncRNA classes best correlated and anticorrelated by their epigenomic associations
best correlates with
at Pearson's correlation coefficient
and best anticorrelates with
at Pearson's correlation coefficient
piRNAs are depleted, while tRNAs are enriched in histone modification marks
Conversely, piRNAs were associated with heterochromatin regions in the genome and underrepresented in active promoter regions (Figure 5A, Additional files 4, 5). piRNAs were depleted in the same types of histone marks the tRNAs were enriched in. That is, piRNAs were depleted in activation-related H3K4 mono-, di- and trimethylation marks, H3K9ac, H3k27ac and H3k36me3 marks, and the H2az histone variant. Yet, the association of piRNAs with transcriptionally silent regions of the genome is blurred by their depletion in H4K20me1, a marker of silent chromatin, and in H3k27me3 marker of gene repression.
Heatmap visualization helps summarize over- and underrepresentation trends for other ncRNA classes (Figure 5B, Additional files 4, 6). The H3K4 di- and trimethylation marks were the most frequently differentially enriched, followed by H3K9ac. We did not observe bias in the cell type-specific origin of these histone marks, although K562, NHEK, H1hESC, Hela-S3 and HepG2 specific marks dominated the heatmap. The driving force behind the clustering appeared to be the histone marks themselves. This can be seen from groups of H3K79me2, H3K4me2 and H3K4me3 showing similar enrichment patterns, thus clustered together, on the heatmap.
Interestingly, even though C/D box and H/ACA box RNAs are subclasses of snoRNAs, the H/ACA box subclass was not enriched for most of the associations that either snoRNAs or C/D box RNAs were, in general. This is in contrast to the general similarity of associated features shown in Figure 5B and indicates that one of the factors that may separate the two snoRNA subclasses is their regulatory locations. The only mark H/ACA box and C/D boxes seemed to both be enriched in was H3K79me2 which has been associated with the cell cycle. This suggests that H/ACA and C/D box RNAs may be orchestrated similarly by histone modification marks during the cell cycle, but not beyond.
piRNAs are enriched in the repressed chromatin states, tRNAs are associated with transcriptionally active regions
The Chromatin State Segmentation by HMM from ENCODE/Broad covers a spectrum of chromatin states, ranging from active promoters and strong enhancers to heterochromatin and repetitive/copy number variation regions. Expectedly, we found nearly symmetrical distribution of enrichment/depletion of ncRNA associations in opposing chromatin states, reflecting their presence in either active or repressed regions and, conversely, absence in the region with opposite regulatory role. Again, piRNA and tRNA classes showed greater range of p-values (Figure 4B).
piRNAs were strongly enriched in the repressed chromatin (evidence from 9 cell lines), heterochromatin (5 cell lines) and insulator (3 cell lines) regions, as well as in the repetitive/CNV regions in the H1hESC cell line. Conversely, they were depleted in the active, weak and poised promoters (9, 8 and 4 cell lines, respectively), transcription elongation and transition regions (9 and 7 cell lines, respectively), strong and weak enhancers (11 cell lines) (Figure 5C, Additional files 4, 5). Although these findings suggest piRNAs are generally located at the inactive regions of the genome, cell type-specific chromatin states also showed an exception from these observations. For example, piRNAs were depleted in the heterochromatin regions in only the K562 cell line.
tRNAs showed greater diversity in cell type-specific enriched and depleted associations with chromatin states. They were enriched in the active, weak, and poised promoters (9, 9, and 6 cell lines, respectively), and depleted in heterochromatin regions in 6 cell lines. Although this suggests active transcription processes associated with tRNAs, they were depleted in transcription elongation, transition, weak transcription (9, 8, and 6 cell lines, respectively), and in weak enhancers in the HMEC and HUVEC cell lines. tRNAs were also strongly overrepresented in the repetitive/CNV regions in 9 cell lines (Figure 5C, Additional files 4, 5).
Other ncRNA classes showed similar enrichment/depletion patterns in the antagonistic chromatin state regions. Similar to the histone marks, chromatin states were the major driving force behind clustering. For example, pri-miRNA, miRNA, and lncRNA classes were depleted in heterochromatin regions in the majority of the cell lines, while rRNA, 7SK and spliceosomal classes were enriched in these regions (Figure 5D, Additional files 4, 6).
The snoRNA class, and its members C/D box, H/ACA box and scaRNAs were enriched in the "transcription elongation" and "transcription transition" regions in H1hESC cells, and depleted in the "repressed" regions in NHEK and HepG2 cells. These classes were also enriched in the "strong enhancers" and depleted in "weak transcription" chromatin regions in the K562 cell line. The Y-RNA and SRP-RNA classes, on the contrary, were enriched in weak transcription regions in K562 cells (Figure 5D, Additional files 4, 6).
piRNAs are depleted in, while tRNAs are strongly enriched in, transcription factor binding sites
tRNAs showed strong enrichment in their tendencies to co-localize with 70 TFBSs (p < 0.01, adjusted for multiple testing) (Figure 4C, 5E, Additional files 4, 5). Not unexpectedly, the strongest association was RPC155 (p.adj = 4.43E-89), a catalytic core and the largest (155kDa) component of RNA polymerase III (Pol III), which synthesizes tRNAs as well as small RNAs, such as 5S rRNA. Other Pol III components were also enriched, such as BDP1 (B double prime 1, subunit of Pol III transcription initiation factor IIIB, p.adj = 2.85E-72), TFIIIC-110 (a 110kDa subunit of the Pol III that initiates assembly of the transcription complex on tRNA, p.adj = 9.27E-70), BRF1 (B-related factor 1, a subunit of the Pol III playing in transcription initiation on genes encoding tRNAs, 5S rRNAs, and other small structural RNAs, p.adj = 2.12E-48), Pol3 (aka RPC32, involved either in the recruitment and stabilization of the subcomplex within Pol III, or in stimulating catalytic function of other subunits during initiation, p.adj = 5.28E-34). TBP, TATA-binding protein, a subunit of the transcription factor TFIID (p.adj = 8.19E-80) has also been shown to participate in the initiation complex assembly on tRNA . A general co-activator protein p300 (aka EP300 of E1A binding protein p300, p.adj = 2.01E-65) was also strongly associated with tRNAs.
Pol III can sense non-self dsDNA that serves as template for transcription into dsRNA. Such non-self Pol III transcripts induce type I interferon and NF-kB through RIG-I pathway [37, 38]. We observed strong enrichment of immune-related transcription factor binding sites associated with tRNAs. Notably, ETS1, a transcription factor involved in hematopoietic cell differentiation and the development of lymphoid tissues, was the most strongly enriched (p.adj = 8.72E-38), accompanied by ELF1 (aka E74-like factor 1, ets domain transcription factor, p.adj = 5.71E-30). Other well known immune-related transcription factors include NFkB, IRF1, IRF4, and STAT1 (Additional file 5).
piRNAs showed completely the opposite picture (Figure 4C, 5E), being statistically significantly depleted in 59 TFBSs at p-value adjusted for multiple testing <0.01 (Additional files 4, 5). piRNAs were strongly depleted in Pol2 binding sites assessed by four different antibodies (Pol2, p.adj = 1.50E-22; Pol2-4H8, p.adj = 2.55E-18; Pol2(b), p.adj = 5.67E-09; Pol2(phosphoS2), p.adj = 1.03E-06). We also observed the opposite behavior of the Pol III-related proteins, with TBP (p.adj = 5.94E-16), RCP155 (p.adj = 1.41E-12), BDP1 (p.adj = 7.74E-10), TFIIIC-110 (p.adj = 1.38E-08), Pol3 (p.adj = 1.67E-04), and p300 (p.adj = 5.30E-13) binding sites.
The aforementioned RNA polymerase II- and III-related TFBSs were also statistically significantly associated with other ncRNA classes (Figure 5F, Additional files 4, 6). A general trend was that an ncRNA class shows opposite associations with RNA polymerases II and III, being generally associated with Pol II sites and depleted in Pol III-related sites. miRNAs were found enriched in Pol II sites, which corroborates earlier reports .
We comprehensively investigated statistically significant genomic associations between a total of 15 ncRNA classes versus 420 cell type-specific epigenomic elements. Although the ENCODE project provides cell type-specific information about epigenomic elements, we generally did not observe preferential cell type specificity of the enrichments. Epigenomic elements associated with different ncRNA classes were generally similarly enriched across multiple cell lines and were the driving force behind clustering patterns confirming that epigenomic relationships hold across multiple cell lines . This can be illustrated by similar patterns of enrichment among spliceosomal, rRNA and 7SK ncRNAs in the heterochromatin regions in multiple cell lines (Figure 5D), association of C/D box and snoRNAs with H3K79me2 mark across several cell lines (Figure 5B), and by similar observations in other ncRNA classes (Additional files 4, 5, 6). There were exceptions from this general trend; for example, piRNAs were strongly enriched in the heterochromatin regions in all but the K562 cell line, which instead showed statistically significant underrepresentation. Y-RNA and SRP-RNA classes were strongly associated with weak transcription regions exclusively in the K562 cell line. Therefore, we kept the cell type-specific results of our analysis (Additional files 1, 2) while describing our results from the epigenomic point of view, mentioning cell line specificity only where appropriate.
Our study revealed both expected (positive controls) and novel relationships among ncRNA classes and epigenomic elements. For example, C/D box, H/ACAbox, and snoRNA were grouped by their common epigenomic background, being subsets of a larger snRNA class. Yet, the relationship between long non-coding RNAs and miRNAs is interesting, as they are located in different genomic regions but shared similar epigenomic background. Both lncRNA and miRNA classes were depleted in the heterochromatin regions in multiple cell lines, were located in the poised promoter regions in the H1hESC cell line, and showed similar enrichment in several histone modification marks (Additional files 4, 6). Yet they were differentially associated with active promoters, with only lncRNAs being associated with these regions in the HSMM and HUVEC cell lines. The lncRNA class was also enriched in the repressed regions of the genome in GM12878 and NHEK cell lines, which is consistent with an observation that >20% of lncRNAs are bound by the Polycomb Repressive Complex 2 (PRC2) , with miRNAs showing no measurable enrichment. miRNAs, on the contrary, were enriched in the transcription elongation regions in HMEC, HUVEC, and NHLF cell lines. miRNAs were depleted in the Pol III transcription factor binding regions, and enriched only in the Pol2 binding sites, while lncRNAs did not show any associations with any of these regions. Instead, lncRNAs were enriched in SUZ12 binding sites, consistent with an observation that PRC2-transcribed smaller ~50-200 nt RNAs that interact with SUZ12 to mediate gene repression .
We observed strong statistically significant associations of the tRNA and piRNA classes of non-coding RNAs with epigenomic elements. This can't be attributed solely to the size of these classes, since ncRNA classes having more members did not show such enrichment diversity. The mean length of tRNAs and piRNAs were among the smallest among all ncRNA classes (Table 1), thus diminishing the probability of overlapping with epigenomic elements on a per-transcript basis. Yet, the significance of the identified associations suggests strong epigenomic regulation of tRNA and piRNA. We expect our analyses to produce more focused results as the ncRNA classes will become better defined with more members showing clearer enrichment patterns. This is especially important for small ncRNA classes, like scaRNAs, which did not have sufficient numbers to demonstrate statistically significant enrichments. Advancements of the ENCODE project with more epigenomic data being available will further aid analysis of cell type-specific associations.
Our enrichment analysis has limitations. We define ncRNA classes by the genomic coordinates of ncRNA elements, and test these locations for the enriched co-localizations with epigenomic elements. Thus, our enrichment analysis of transcription factor binding sites does not consider the promoter regions of the ncRNAs or, if they don't have promoters, then their local cis-regulatory genomic environment. Yet, the transcription of ncRNAs can be different from that of genes, with ncRNAs often originating from exons and introns [16, 43], promoters of the genes , or intergenic regions by mechanisms apparently different from canonical transcription factor binding . Transcription factors, on the other hand, show a great variety of binding patterns, being located within gene bodies, overlapping first exons, and occurring in the intergenic regions . Our analysis revealed TF binding preference (or the absence of it) within or partially overlapping ncRNA bodies without making assumptions about the mechanism of ncRNA biogenesis. While the full potential of transcription factor co-localization with ncRNA classes remains to be tested experimentally, we are confident our findings of differential transcription factor co-localization with ncRNAs will reveal novel relationships among non-coding and coding parts of the genome.
In the current study we did not consider the amount of overlap of ncRNAs with epigenomic elements. While this is not critical for the analysis of, for example, single nucleotide polymorphisms with length equal to 1 nt, ncRNAs may show a variety of co-localization patterns, ranging from being located within an epigenomic element to overlapping it completely or partially. We restricted our analysis by considering any overlap as co-localization for computational efficiency and ease of interpretation. However, our future work will include comparison of the percent of overlap observed for an ncRNA class vs. that of for a random selection. This "overlap enrichment" analysis may provide better resolution of spatial relationships of ncRNA classes with epigenomic elements.
We present the use of our method for finding statistically significant relationships between experimental regions of interest (ncRNA classes in the current study) with cell type-specific epigenomic data from the ENCODE project (such as transcription factor binding sites, histone modification marks, chromatin segmentation states). We demonstrate the utility of statistically significant epigenomic associations to classify and contrast ncRNA classes, and to outline their potential functional roles and mechanisms of biogenesis from epigenomic perspective. Our method, implemented as the open source software GenomeRunner (http://sourceforge.net/projects/genomerunner/)  can provide interpretation of any experimental genome-wide data within the growing amount of epigenomic data from the ENCODE project .
GenomeRunner, the main method used in this work, is available at http://sourceforge.net/projects/genomerunner
Publication of this article is funded in part by the National Institutes of Health (NIH, grants # 1P20GM103636 and 8P20GM103456 to JDW), the National Institute of Arthritis and Musculoskeletal and Skin Diseases (a subaward from grant # P30 AR053483 to MGD), and an Institutional Development Award (IDeA) from the National Institute of General Medical Sciences (a subaward from grant # P30GM103510 to MGD). The content of this article is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health.
This article has been published as part of BMC Bioinformatics Volume 14 Supplement 14, 2013: Proceedings of the Tenth Annual MCBIOS Conference. Discovery in a sea of data. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/14/S14.
- GEO ftp data archive. [ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/]
- Djebali S, Davis CA, Merkel A, Dobin A, Lassmann T, Mortazavi A, Tanzer A, Lagarde J, Lin W, Schlesinger F: Landscape of transcription in human cells. Nature. 2012, 489 (7414): 101-108. 10.1038/nature11233.PubMed CentralView ArticlePubMedGoogle Scholar
- ENCODE Project Consortium: The ENCODE (ENCyclopedia Of DNA Elements) Project. Science. 2004, 306 (5696): 636-640.View ArticleGoogle Scholar
- ENCODE Project Consortium, Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature. 2007, 447 (7146): 799-816. 10.1038/nature05874.View ArticleGoogle Scholar
- Mattick JS: The genetic signatures of noncoding RNAs. PLoS Genet. 2009, 5 (4): e1000459-10.1371/journal.pgen.1000459.PubMed CentralView ArticlePubMedGoogle Scholar
- Mattick JS: The functional genomics of noncoding RNA. Science. 2005, 309 (5740): 1527-1528. 10.1126/science.1117806.View ArticlePubMedGoogle Scholar
- Mattick JS, Makunin IV: Non-coding RNA. Hum Mol Genet. 2006, 15 (Spec No 1): R17-29.View ArticlePubMedGoogle Scholar
- Amaral PP, Mattick JS: Noncoding RNA in development. Mamm Genome. 2008, 19 (78): 454-492.View ArticlePubMedGoogle Scholar
- Dinger ME, Amaral PP, Mercer TR, Pang KC, Bruce SJ, Gardiner BB, Askarian-Amiri ME, Ru K, Solda G, Simons C: Long noncoding RNAs in mouse embryonic stem cell pluripotency and differentiation. Genome Res. 2008, 18 (9): 1433-1445. 10.1101/gr.078378.108.PubMed CentralView ArticlePubMedGoogle Scholar
- Mercer TR, Dinger ME, Mattick JS: Long non-coding RNAs: insights into functions. Nat Rev Genet. 2009, 10 (3): 155-159. 10.1038/nrg2521.View ArticlePubMedGoogle Scholar
- Nagano T, Fraser P: No-nonsense functions for long noncoding RNAs. Cell. 2011, 145 (2): 178-181. 10.1016/j.cell.2011.03.014.View ArticlePubMedGoogle Scholar
- Ponting CP, Oliver PL, Reik W: Evolution and functions of long noncoding RNAs. Cell. 2009, 136 (4): 629-641. 10.1016/j.cell.2009.02.006.View ArticlePubMedGoogle Scholar
- Denoeud F, Kapranov P, Ucla C, Frankish A, Castelo R, Drenkow J, Lagarde J, Alioto T, Manzano C, Chrast J: Prominent use of distal 5' transcription start sites and discovery of a large number of additional exons in ENCODE regions. Genome Res. 2007, 17 (6): 746-759. 10.1101/gr.5660607.PubMed CentralView ArticlePubMedGoogle Scholar
- Horiuchi T, Aigaki T: Alternative trans-splicing: a novel mode of pre-mRNA processing. Biol Cell. 2006, 98 (2): 135-140. 10.1042/BC20050002.View ArticlePubMedGoogle Scholar
- Lipovich L, Johnson R, Lin CY: MacroRNA underdogs in a microRNA world: evolutionary, regulatory, and biomedical significance of mammalian long non-protein-coding RNA. Biochim Biophys Acta. 2010, 1799 (9): 597-615. 10.1016/j.bbagrm.2010.10.001.View ArticlePubMedGoogle Scholar
- St Laurent G, Shtokalo D, Tackett MR, Yang Z, Eremina T, Wahlestedt C, Urcuqui-Inchima S, Seilheimer B, McCaffrey TA, Kapranov P: Intronic RNAs constitute the major fraction of the non-coding RNA in mammalian cells. BMC Genomics. 2012, 13: 504-10.1186/1471-2164-13-504.PubMed CentralView ArticlePubMedGoogle Scholar
- De Santa F, Barozzi I, Mietton F, Ghisletti S, Polletti S, Tusi BK, Muller H, Ragoussis J, Wei CL, Natoli G: A large fraction of extragenic RNA pol II transcription sites overlap enhancers. PLoS Biol. 2010, 8 (5): e1000384-10.1371/journal.pbio.1000384.PubMed CentralView ArticlePubMedGoogle Scholar
- Kurokawa R: Promoter-associated long noncoding RNAs repress transcription through a RNA binding protein TLS. Adv Exp Med Biol. 2011, 722: 196-208. 10.1007/978-1-4614-0332-6_12.View ArticlePubMedGoogle Scholar
- Berger SL, Kouzarides T, Shiekhattar R, Shilatifard A: An operational definition of epigenetics. Genes Dev. 2009, 23 (7): 781-783. 10.1101/gad.1787609.PubMed CentralView ArticlePubMedGoogle Scholar
- Hardison RC: Genome-wide epigenetic data facilitate understanding of disease susceptibility association studies. J Biol Chem. 2012, 287 (37): 30932-30940. 10.1074/jbc.R112.352427.PubMed CentralView ArticlePubMedGoogle Scholar
- Bird A: Perceptions of epigenetics. Nature. 2007, 447 (7143): 396-398. 10.1038/nature05913.View ArticlePubMedGoogle Scholar
- Leung A, Schones DE, Natarajan R: Using epigenetic mechanisms to understand the impact of common disease causing alleles. Curr Opin Immunol. 2012, 24 (5): 558-563. 10.1016/j.coi.2012.07.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Inbar-Feigenberg M, Choufani S, Butcher DT, Roifman M, Weksberg R: Basic concepts of epigenetics. Fertil Steril. 2013, 99 (3): 607-15. 10.1016/j.fertnstert.2013.01.117.View ArticlePubMedGoogle Scholar
- Ptashne M: Epigenetics: Core misconcept. Proc Natl Acad Sci USA. 2013, 110 (18): 7101-7103. 10.1073/pnas.1305399110.PubMed CentralView ArticlePubMedGoogle Scholar
- Wren JD, Johnson D, Gruenwald L: Automating genomic data mining via a sequence-based matrix format and associative rule set. BMC Bioinformatics. 2005, 6 (Suppl 2): S2-10.1186/1471-2105-6-S2-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Dozmorov MG, Wren JD: High-throughput processing and normalization of one-color microarrays for transcriptional meta-analyses. BMC Bioinformatics. 2011, 12 (Suppl 10): S2-10.1186/1471-2105-12-S10-S2.PubMed CentralView ArticlePubMedGoogle Scholar
- Dozmorov MG, Giles CB, Wren JD: Predicting gene ontology from a global meta-analysis of 1-color microarray experiments. BMC Bioinformatics. 2011, 12 (Suppl 10): S14-10.1186/1471-2105-12-S10-S14.PubMed CentralView ArticlePubMedGoogle Scholar
- Dozmorov MG, Cara LR, Giles CB, Wren JD: GenomeRunner: Automating genome exploration. Bioinformatics. 2012, 28 (3): 419-20. 10.1093/bioinformatics/btr666.PubMed CentralView ArticlePubMedGoogle Scholar
- Rosenbloom KR, Sloan CA, Malladi VS, Dreszer TR, Learned K, Kirkup VM, Wong MC, Maddren M, Fang R, Heitner SG: ENCODE data in the UCSC Genome Browser: year 5 update. Nucleic Acids Res. 2013, 41 (Database): D56-63.PubMed CentralView ArticlePubMedGoogle Scholar
- Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W: Initial sequencing and analysis of the human genome. Nature. 2001, 409 (6822): 860-921. 10.1038/35057062.View ArticlePubMedGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the False Discovery Rate: a Practical and Powerful Approach to Multiple Testing. J R Stat Soc Series B. 1995, 57 (1): 289-300.Google Scholar
- Ward JH: Hierarchical Grouping to Optimize an Objective Function. J Am Stat Assoc. 1963, 58 (301): 236-244. 10.1080/01621459.1963.10500845.View ArticleGoogle Scholar
- R Core Team: R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. 2013, [http://www.R-project.org]Google Scholar
- Ernst J, Kellis M: Discovery and characterization of chromatin states for systematic annotation of the human genome. Nat Biotechnol. 2010, 28 (8): 817-825. 10.1038/nbt.1662.PubMed CentralView ArticlePubMedGoogle Scholar
- Phillips JE, Corces VG: CTCF: master weaver of the genome. Cell. 2009, 137 (7): 1194-1211. 10.1016/j.cell.2009.06.001.PubMed CentralView ArticlePubMedGoogle Scholar
- Huet J, Sentenac A: The TATA-binding protein participates in TFIIIB assembly on tRNA genes. Nucleic Acids Res. 1992, 20 (24): 6451-6454. 10.1093/nar/20.24.6451.PubMed CentralView ArticlePubMedGoogle Scholar
- Ablasser A, Bauernfeind F, Hartmann G, Latz E, Fitzgerald KA, Hornung V: RIG-I-dependent sensing of poly(dA:dT) through the induction of an RNA polymerase III-transcribed RNA intermediate. Nat Immunol. 2009, 10 (10): 1065-1072. 10.1038/ni.1779.View ArticlePubMedGoogle Scholar
- Chiu YH, Macmillan JB, Chen ZJ: RNA polymerase III detects cytosolic DNA and induces type I interferons through the RIG-I pathway. Cell. 2009, 138 (3): 576-591. 10.1016/j.cell.2009.06.015.PubMed CentralView ArticlePubMedGoogle Scholar
- Lee Y, Kim M, Han J, Yeom KH, Lee S, Baek SH, Kim VN: MicroRNA genes are transcribed by RNA polymerase II. EMBO J. 2004, 23 (20): 4051-4060. 10.1038/sj.emboj.7600385.PubMed CentralView ArticlePubMedGoogle Scholar
- Dong X, Greven MC, Kundaje A, Djebali S, Brown JB, Cheng C, Gingeras TR, Gerstein M, Guigo R, Birney E: Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012, 13 (9): R53-10.1186/gb-2012-13-9-r53.PubMed CentralView ArticlePubMedGoogle Scholar
- Khalil AM, Guttman M, Huarte M, Garber M, Raj A, Rivea Morales D, Thomas K, Presser A, Bernstein BE, van Oudenaarden A: Many human large intergenic noncoding RNAs associate with chromatin-modifying complexes and affect gene expression. Proc Natl Acad Sci USA. 2009, 106 (28): 11667-11672. 10.1073/pnas.0904715106.PubMed CentralView ArticlePubMedGoogle Scholar
- Kanhere A, Viiri K, Araujo CC, Rasaiyaah J, Bouwman RD, Whyte WA, Pereira CF, Brookes E, Walker K, Bell GW: Short RNAs are transcribed from repressed polycomb target genes and interact with polycomb repressive complex-2. Mol Cell. 2010, 38 (5): 675-688. 10.1016/j.molcel.2010.03.019.PubMed CentralView ArticlePubMedGoogle Scholar
- Rodriguez A, Griffiths-Jones S, Ashurst JL, Bradley A: Identification of mammalian microRNA host genes and transcription units. Genome Res. 2004, 14 (10A): 1902-1910. 10.1101/gr.2722704.PubMed CentralView ArticlePubMedGoogle Scholar
- Wang J, Zhuang J, Iyer S, Lin X, Whitfield TW, Greven MC, Pierce BG, Dong X, Kundaje A, Cheng Y: Sequence features and chromatin structure around the genomic regions bound by 119 human transcription factors. Genome Res. 2012, 22 (9): 1798-1812. 10.1101/gr.139105.112.PubMed CentralView ArticlePubMedGoogle Scholar
- RFAM. [ftp://ftp.sanger.ac.uk/pub/databases/Rfam/CURRENT/USERMAN]
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.