Skip to main content

A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data

Abstract

Background

RNA polymerase II (PolII) is essential in gene transcription and ChIP-seq experiments have been used to study PolII binding patterns over the entire genome. However, since PolII enriched regions in the genome can be very long, existing peak finding algorithms for ChIP-seq data are not adequate for identifying such long regions.

Methods

Here we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a false discovery rate (FDR) approach. The binned ChIP-seq data for PolII are first processed using a non-local means (NL-means) algorithm for purposes of denoising. Then, a FDR approach is developed to determine the threshold for marking enriched regions in the binned histogram.

Results

We first test our method using a public PolII ChIP-seq dataset and compare our results with published results obtained using the published algorithm HPeak. Our results show a high consistency with the published results (80-100%). Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on the breast cancer cell line MCF7. The results demonstrate that our method can effectively identify long enriched regions in ChIP-seq datasets. Specifically, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 Kbp (maximum 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (maximum 325,000 bp).

Conclusions

We demonstrated the effectiveness of this method in studying binding patterns of PolII in cancer cells which enables further deep analysis in transcription regulation and epigenetics. Our method complements existing peak detection algorithms for ChIP-seq experiments.

Background

Chromatin immunoprecipitation combined with next generation sequencing technology (ChIP-seq) has been swiftly adopted as a standard technique for studying genome wide protein-DNA interaction patterns during the past four years. It is applied in gene regulation studies for identifying transcription factor targets and binding motifs, as well as in epigenetics research towards the characterization of chromatin states using various histone marks and RNA polymerase II (PolII) [13].

PolII plays an essential role in gene transcription. During transcription, it is responsible for the synthesis of nascent messenger RNA molecules (mRNA) for protein-coding genes and microRNAs [4]. The nascent mRNAs then go through a series of processing steps including splicing to form mature mRNAs. To transcribe a gene, PolII will undergose several steps including recruitment, initiation, elongation, and dissociation [4, 5]. In addition, PolII pausing and pre-mature dissociation will cause stalling of the transcription process [4, 5]. Thus, accurately characterization of PolII binding patterns over the entire genome is of great importance in studying the dynamics of transcription as well as contributing to the characterization of nascent mRNA, which cannot be directly inferred from gene expression microarray or regular RNA-seq technologies since they focus on mature mRNA. However, since during transcription PolII elongates along the entire gene, the PolII binding pattern over a gene is usually not just a single peak but forms elongated regions as manifest in ChIP-seq data. PolII enriched regions can stretch to several thousands of basepairs (Figure 1). Traditionally, ChIP-seq data analysis methods rely on peak region detection algorithm to delineate genomic regions with enriched protein bindings. However, the binding pattern of PolII poses a very different paradigm of computing and in turn significant challenges. Several peak detection algorithms were developed for delineating transcription factor binding sites and the anticipated regions are short (e.g., 200-1500 bp) [612] thus rendering such algorithms inadequate for studying proteins with prevalent binding over the entire genome such as PolII.

Figure 1
figure 1

Examples of PolII ChIP-seq data for MCF7 cell line. ChIP-seq data for PolII binding pattern on SEMA3C in MCF7 cell control samples. The top lane shows the histogram of the PolII binding densities over a range of genome. The gene covered by this range is shown in the bottom lane. In the bottom lane, the thick bars below the gene symbol indicate exons of the gene while the blue arrow indicates its orientation. The tail and head of the arrow correspond to the transcription starting site (TSS) and transcription ending site (TES) of the gene respectively. The same arrangements are also applied to the other figures. It is apparent that PolII not only binds to the TSS regions of the gene but also form long enriched regions over the entire transcript.

While ChIP-seq data can be considered a 1-D signal over the entire genome, only a few studies explicitly take advantage of signal denoising and detection methods developed in the engineering community. For instance, in [13], wavelet denoising technique was applied to filter the ChIP-seq data to identify nucleosome distribution patterns. For histone marks, a method called SISSR was recently developed [14], which takes a multiscale approach to analyze ChIP-seq data. This approach first identifies potential regions with enriched histone patterns and then links proximal regions which are separated by short intervals as a contiguous large region. The short intervals can be considered "noise" in the genome-wide signal that can be filtered out at coarser scales.

In this paper, we also consider a ChIP-seq dataset a noisy 1-D signal stretched over the genome and apply signal processing techniques combined with statistical analysis for identifying large enriched regions in ChIP-seq data. Our approach includes three main steps. First we directly apply a signal denoising algorithm to process histogram of ChIP-seq data. Noise in ChIP-seq data originates from multiple sources including non-specific binding to the antibody, artifacts in amplification (local PCR), and sequencing reads mapping errors. These confounding noises are usually modeled as spike noise with an underlying Poisson distribution. Recently it has been shown that the non-local means (NL-means) algorithm is highly effective in signal denoising compared to other commonly used algorithms such as median filter, low-pass filtering and wavelet denoising [15, 16]. It should be noted that our proposed method is based on the NL-means denoising algorithm. Since the NL-means algorithm is optimal for Gaussian noise, we apply the Anscombe transformation to the binned ChIP-seq [17]. Consequently, the underlying noise distribution model is now approximated by a Gaussian distribution. Subsequently, we apply the NL-means algorithm followed by inverse Anscombe transform. The denoised data is then compared against random model to determine the threshold for region selection based on a false discovery rate (FDR) approach. Finally, regions are selected based on the threshold, the region length, and the ratio between the peak value and the threshold.

To evaluate our method, we first test our method using a public PolII ChIP-seq dataset and compare the results with published results obtained using the published algorithm HPeak [18]. Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on breast cancer cell line MCF7. We compare the long segments obtained when tested against both MCF7 cell line and the MCF7 cells treated with 17-estradial (E2) hormone. The results demonstrate that our method can effectively identify both long enriched regions in ChIP-seq datasets and complement existing peak finding algorithms for a variety of potential wide applications such as histone mark binding pattern study.

Methods

Data selection and pre-processing

To test our method, we first downloaded a ChIP-seq dataset for PolII on a prostate cancer cell line (LnCap) after treatment with an androgen receptor agonist R1881 (GSM353618) from NCBI Gene Expression Omnibus (GEO). The dataset was generated using Illumina Genome Analyzer II (GAII) sequencer and was analyzed using the HPeak algorithm [3]. In addition, we apply our algorithm in PolII ChIP-seq data generated in our research on breast cancer using MCF7 cell line. The accession numbers are GSM529981 (MCF7 control), and GSM529982 (for MCF7 treated with 17-estradial (E2)). For all the datasets, we downloaded the sequence mapping results (using the Eland algorithm provided by Illumina Inc.) and then generated histogram with selected bin size.

Introduction of NL-means algorithm and parameter selection

The NL-means algorithm was originally developed for image denoising [15] and was also applied to 1-D signals (including acoustic signals) [19] and video [20]. The details of the original NL-means algorithm are given in [15]. Here we briefly describe the essential formulation and the salient parameters. Basically, for each data point, its value is replaced by a weighted average of data points over the entire signal such that points with similar neighbourhoods are given higher weights. It has been shown that the NL-means algorithm is optimal for Gaussian noise. In addition, using weighted averaging for points with similar neighbourhoods is suitable for ChIP-seq experiments since the pattern of protein binding are considered similar across many regions of the genome.

Formally, given a signal with data points X={x i ,i = 1,...,N}, the filtered value at x i is defined as NL x i = j = 1 N w i , j x j , where is a difference measure between x i and a neighbouring point x j under the constraints w(i,j)≥0 and j = 1 N w ( i , j ) =1 Specifically, w i , j = e x p ( - || N x i - N x j || 2 / ( 2 σ 2 ) ) Z ( i ) , where is the normalizing factor, Z i = j exp ( - || N x i - N x j || 2 2 σ 2 ) , and N x i denote a fixed-size neighbourhood centred at the position i. In practice, searching for similar neighbourhood patterns over the entire genome is not feasible. Instead, a parameter specifying the range of search is needed. In summary, the NL-means algorithm requires three parameters, the size of neighbourhood the range of search and the weight parameter We evaluated a series of parameter combinations (Figure 2) and in this study we use R= 10(bins) L= 15, and σ= 10

Figure 2
figure 2

Denoising on PolII ChIP-seq data for MCF7 cell line using NL-means algorithm. This region is over the gene ELAC2. Different parameters for the NL-means algorithm are tested on the ChIP-seq data. The second panel uses the parameter sets used in the rest of this paper (R=10(bins),L=15, and σ=10)

Anscombe transformation

Since NL-means algorithm is optimal for Gaussian noise while the noise in ChIP-seq data is usually modelled using Poisson distribution, we apply Anscombe transform to the data first. Anscombe transform is a commonly used variance-stabilizing transformation that transforms a Poisson distribution random variable into one with that is close to a Gaussian distribution [17, 21]. The Anscombe transformation on data point is defined as follows: y=2 x + 3 / 8 and the inverse Anscombe transform is given by x=(y/2)2-3/8.

FDR based approach

To select a threshold for region selection, we designed a FDR approach based on B random simulations. Given the histogram of the ChIP-seq data with M bins and the reads over each bin as r i (i = 1, 2,..., M), we take the following steps:

1. In the b-th round of simulation (b = 1, 2,..., B), we randomly (Poisson distribution) position the same amount of sequencing reads for each chromosome as in the original data and then apply the NL-means algorithm. The reads over the i-th bin is denoted as r i b * .

2. Then for the i-th bin in the histogram, the ratio that the observed data is less than simulated data is recorded as p i = b = 1 B I ( r i < r ib * ) . I(A) takes value 1 (resp. 0) if A is true (resp. false).

3. For the b-th round of simulation, we treat it as a "true" signal and compare it with rest of simulated data. For the i-th bin, we compute p i b * = b = 1 B I r i b * < r i b * Then for a threshold p cut , the number of false peaks in this round of simulation is d b = i = 1 M I ( p i b * < p c u t )

4. The false discovery rate for the cutoff p cut is FDR p c u t = B - 1 b = 1 B d b i = 1 M I p i p c u t .

Since p cut is a function of the region height, we obtain the FDR for selected threshold on height. Then for each detected region based on the FDR approach, we also calculate the ratio between the peak value and the threshold.

Results

Comparing with HPeak for detecting PolII enriched regions in prostate cancer

We first apply our method to a published PolII ChIP-seq data in prostate cancer model (GSM353618). We compare our results with the published 15,833 PolII enriched region detected using HPeak [3, 18]. As shown in Table 1, when FDR = 0.01, our method show a high overlap with published results with overlap ratio (comparing to HPeak) of 80-100%. We then inspected many of regions that were detected using our method but were missed by HPeak. We found many of them correspond to potential transcription starting sites (TSS) for unknown genes or non-coding RNA genes. Figure 3A-B show examples of regions detected using our method using 0.01 for FDR and 6.5 for Peak-to-threshold ratio but were missed by the HPeak algorithm even though they imply the potential transcription activity for the covered genes. One of them is a snoRNA gene (U5E). Figure 3C shows examples of regions that were detected by HPeak but were missed by our method. In fact, using this setting, all regions missed by our method contain narrow peaks (covers one to two 25 bp bins). This observation suggests that the regions detected using our method can provide addition information regarding gene transcription and annotation. Further detailed annotation on these new regions is currently being carried out by integration with other information including CpG island distribution, DNase I hypersensitivity, and H3K4me2 binding (ChIP-seq).

Table 1 Comparing our method with HPeak (bin size 25 bp).
Figure 3
figure 3

Comparison of HPeak and the proposed algorithm in detecting regions on the prostate PolII ChIP-seq data (parameters of 25 bp for bin size, 0.01 for FDR and 6.5 for Peak-to-threshold ratio). The bin size is the same between the two algorithms. A: A region detected using the proposed algorithm but not by HPeak. Top: the original histogram of the ChIP-seq data. Bottom: the NL-means filtered data and the detected regions (marked by the red bar). The length of the detected region (10375 bp) is marked under the red bar. A gene (NEAT1, marked by black bar) is covered by this region. B: Another region detected using the proposed algorithm but not by HPeak. Top: the original histogram of the ChIP-seq data. Bottom: the NL-means filtered data and the detected regions (marked by the red bar, 2975 bp long). A snoRNA gene (U5E) and its pseudogene (marked by black bars) are covered by this region. C: Two regions detected by HPeak but not the proposed algorithm. The HPeak detected regions are marked using green bars (the lengths are 675 bp and 200 bp, respectively). We examined all regions that were missed by the proposed algorithm under this set of parameters, all regions contain short spike-like peaks as shown here.

Detecting large regions with PolII enrichment in breast cancer cell line MCF7

We then applied our method in the PolII ChIP-seq datasets for the breast cancer cell line MCF7 using large bin size (1,000 bp). As shown in Table 2, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 K bp (ranging from 4,000 bp to 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (ranging from 4,000 bp to 325,000 bp). Some examples of the long regions detected in MCF7 control samples are shown in Figure 4.

Table 2 Long enriched regions identified in PolII ChIP-seq data in MCF7 cells using bin size of 1,000 bp.
Figure 4
figure 4

Examples of long regions of PolII binding detected using our algorithm. Red bars indicate the detected regions. The segments are shown above the patterns over corresponding genes. The lengths of the segments are listed in the figure. The four genes (from top to bottom) are SEMA3C, TRIM37, SETD5, and BMP7.

Discussion

Identification of long segments for PolII binding are important for further investigation for understanding gene transcription regulation as well as potentially discovery novel transcripts and alternative promoters. For gene transcription, while PolII binding density at promoter around the TSSs was considered to determine gene transcription levels, recent studies show that the density of PolII binding on gene body is also critical [5, 22]. We also observed such phenomena using the above identified segments. For instance, as shown in Figure 5, a segment of 16,000 bp has been identified over the transcript of the gene PLK2 on human chromosome 5. The MCF7 control sample has more sequencing reads over this region than the MCF7 sample treated with E2 sample (958 vs 454 reads with similar amount of total reads in chromosome 5 between the two samples). Although, the height of the "peak" at the TSS region in the MCF7 control sample is lower than that in the MCF7 E2 treated sample, the total transcription level (measured by Affymetrix gene expression array) is still higher in MCF7 control by a factor of 3.95-fold (Student t-test p = 3.872 × 10-6).

Figure 5
figure 5

The PolII binding patterns and expression levels for PLK2 gene. Top: the PolII binding patterns for PLK2 gene in control (first lane) and E2 treated samples (second lane). PolII shows a higher peak for the E2 treated sample but lower total amount of binding over the transcript. Red bar indicates the 16 Kbp region detected using our method in MCF7 and blue bar indicates the 14 Kbp region detected in MCF7+E2. Bottom: the expression levels of PLK2 gene in the two different conditions (n = 4). Error bar is for standard deviation.

The above observations suggested that ChIP-seq technology can reveal potential new insights and principles in biology. The method presented in this paper will contribute to such discovery efforts. Nevertheless, this method can be improved in several aspects. First, currently the parameters for NL-means algorithm are fixed. In practice, the user may focus on enriched regions of certain length and this could lead to the change of these parameters. Therefore, a multiscale approach is preferred. Second, an important utility of ChIP-seq data analysis is to compare enriched regions between different samples such as ChIP sample versus its input control or control sample versus drug treated sample. Currently we are implementing a Fisher's exact test based approach to enable such comparison. Last but not the least, this method can be applied not only to PolII ChIP-seq data but can also be used for analyzing other data such as histone mark ChIP-seq data for integrative genomic analysis. Currently we are also expanding our study to integrate ChIP-seq data for important histone marks including H3K4me2 and H3K27me3 in an epigenetic study.

Conclusions

In this paper, we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a FDR approach. Our method complements existing peak detection algorithms for ChIP-seq experiments. We demonstrated the effectiveness of this method by studying the binding patterns of PolII in cancer cell lines which enables further deep analysis with applications in transcription regulation and epigenetics.

References

  1. Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129 (4): 823-837. 10.1016/j.cell.2007.05.009.

    Article  CAS  PubMed  Google Scholar 

  2. Barski A, Zhao K: Genomic location analysis by ChIP-Seq. Journal of cellular biochemistry. 2009, 107 (1): 11-18. 10.1002/jcb.22077.

    Article  CAS  PubMed  Google Scholar 

  3. Yu J, Yu J, Mani RS, Cao Q, Brenner CJ, Cao X, Wang X, Wu L, Li J, Hu M: An integrated network of androgen receptor, polycomb, and TMPRSS2-ERG gene fusions in prostate cancer progression. Cancer cell. 17 (5): 443-454.

  4. Fuda NJ, Ardehali MB, Lis JT: Defining mechanisms that regulate RNA polymerase II transcription in vivo. Nature. 2009, 461 (7261): 186-192. 10.1038/nature08449.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Zeitlinger J, Stark A, Kellis M, Hong JW, Nechaev S, Adelman K, Levine M, Young RA: RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo. Nature genetics. 2007, 39 (12): 1512-1516. 10.1038/ng.2007.26.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  6. Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature biotechnology. 2009, 27 (1): 66-75. 10.1038/nbt.1518.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  7. Cao AR, Rabinovich R, Xu M, Xu X, Jin VX, Farnham PJ: Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein interaction domains do not participate in targeting E2F1 to the human genome. The Journal of biological chemistry. 286 (14): 11985-11996.

  8. Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W: Model-based analysis of ChIP-Seq (MACS). Genome biology. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.

    Article  PubMed Central  PubMed  Google Scholar 

  9. Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, NY). 2007, 316 (5830): 1497-1502. 10.1126/science.1141319.

    Article  CAS  Google Scholar 

  10. Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature methods. 2008, 5 (9): 829-834. 10.1038/nmeth.1246.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  11. Blahnik KR, Dou L, O'Geen H, McPhillips T, Xu X, Cao AR, Iyengar S, Nicolet CM, Ludascher B, Korf I: Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data. Nucleic acids research. 38 (3): e13-

  12. Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature biotechnology. 2008, 26 (11): 1293-1300. 10.1038/nbt.1505.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  13. He HH, Meyer CA, Shin H, Bailey ST, Wei G, Wang Q, Zhang Y, Xu K, Ni M, Lupien M: Nucleosome dynamics define transcriptional enhancers. Nature genetics. 42 (4): 343-347.

  14. Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics (Oxford, England). 2009, 25 (15): 1952-1958. 10.1093/bioinformatics/btp340.

    Article  CAS  Google Scholar 

  15. Buades A, Coll B: A non-local algorithm image denoising. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR): 2005. 2005, 60-65.

    Google Scholar 

  16. Van De Ville D, Kocher M: Non-Local Means With Dimensionality Reduction and SURE-Based Parameter Selection. IEEE Transactions on Image Processing. 2011, 20 (9): 2683-2690.

    Article  PubMed  Google Scholar 

  17. Anscombe F: The transformation of Poisson, binomial and negative-binomial data. Biometrika. 1948, 35 (3-4): 246-254. 10.1093/biomet/35.3-4.246.

    Article  Google Scholar 

  18. Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC bioinformatics. 11: 369-

  19. Szlam A: Non-Local Means for Audio Denoising. CAM report. 2008, UCLA

    Google Scholar 

  20. Boulanger J, Kervrann C, Bouthemy P: Space-time adaptation for patch-based image sequence restoration. IEEE transactions on pattern analysis and machine intelligence. 2007, 29 (6): 1096-1102.

    Article  PubMed  Google Scholar 

  21. Boulanger J, Kervrann C, Bouthemy P, Elbau P, Sibarita JB, Salamero J: Patch-based nonlocal functional for denoising fluorescence microscopy image sequences. IEEE transactions on medical imaging. 29 (2): 442-454.

  22. Rahl PB, Lin CY, Seila AC, Flynn RA, McCuine S, Burge CB, Sharp PA, Young RA: c-Myc regulates transcriptional pause release. Cell. 141 (3): 432-445.

Download references

Acknowledgements

This work is partially supported by NCI U54 CA113001-06 grant and PhAMA Foundation grant. TP is supported by the Pelotonia Fellowship grant.

This article has been published as part of BMC Bioinformatics Volume 13 Supplement 2, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S2

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Kun Huang.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

ZH implemented the algorithm and carried out all tests. LT designed the FDR test. TP advised on the implementation of NL-means algorithm. TH provided the breast cancer PolII data. RM conceived the idea of using NL-means for ChIP-seq denoising. KH led the project including development of the idea of signal processing approach, design of all experiments, interpretation of the results and writing of the manuscript. All authors edited the manuscript.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Han, Z., Tian, L., Pécot, T. et al. A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data. BMC Bioinformatics 13 (Suppl 2), S2 (2012). https://doi.org/10.1186/1471-2105-13-S2-S2

Download citation

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-13-S2-S2

Keywords