- Open Access
A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data
© Han et al.; licensee BioMed Central Ltd. 2012
- Published: 13 March 2012
RNA polymerase II (PolII) is essential in gene transcription and ChIP-seq experiments have been used to study PolII binding patterns over the entire genome. However, since PolII enriched regions in the genome can be very long, existing peak finding algorithms for ChIP-seq data are not adequate for identifying such long regions.
Here we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a false discovery rate (FDR) approach. The binned ChIP-seq data for PolII are first processed using a non-local means (NL-means) algorithm for purposes of denoising. Then, a FDR approach is developed to determine the threshold for marking enriched regions in the binned histogram.
We first test our method using a public PolII ChIP-seq dataset and compare our results with published results obtained using the published algorithm HPeak. Our results show a high consistency with the published results (80-100%). Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on the breast cancer cell line MCF7. The results demonstrate that our method can effectively identify long enriched regions in ChIP-seq datasets. Specifically, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 Kbp (maximum 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (maximum 325,000 bp).
We demonstrated the effectiveness of this method in studying binding patterns of PolII in cancer cells which enables further deep analysis in transcription regulation and epigenetics. Our method complements existing peak detection algorithms for ChIP-seq experiments.
- Histone Mark
- Breast Cancer Cell Line MCF7
- Enrich Region
- Wavelet Denoising
- snoRNA Gene
Chromatin immunoprecipitation combined with next generation sequencing technology (ChIP-seq) has been swiftly adopted as a standard technique for studying genome wide protein-DNA interaction patterns during the past four years. It is applied in gene regulation studies for identifying transcription factor targets and binding motifs, as well as in epigenetics research towards the characterization of chromatin states using various histone marks and RNA polymerase II (PolII) [1–3].
While ChIP-seq data can be considered a 1-D signal over the entire genome, only a few studies explicitly take advantage of signal denoising and detection methods developed in the engineering community. For instance, in , wavelet denoising technique was applied to filter the ChIP-seq data to identify nucleosome distribution patterns. For histone marks, a method called SISSR was recently developed , which takes a multiscale approach to analyze ChIP-seq data. This approach first identifies potential regions with enriched histone patterns and then links proximal regions which are separated by short intervals as a contiguous large region. The short intervals can be considered "noise" in the genome-wide signal that can be filtered out at coarser scales.
In this paper, we also consider a ChIP-seq dataset a noisy 1-D signal stretched over the genome and apply signal processing techniques combined with statistical analysis for identifying large enriched regions in ChIP-seq data. Our approach includes three main steps. First we directly apply a signal denoising algorithm to process histogram of ChIP-seq data. Noise in ChIP-seq data originates from multiple sources including non-specific binding to the antibody, artifacts in amplification (local PCR), and sequencing reads mapping errors. These confounding noises are usually modeled as spike noise with an underlying Poisson distribution. Recently it has been shown that the non-local means (NL-means) algorithm is highly effective in signal denoising compared to other commonly used algorithms such as median filter, low-pass filtering and wavelet denoising [15, 16]. It should be noted that our proposed method is based on the NL-means denoising algorithm. Since the NL-means algorithm is optimal for Gaussian noise, we apply the Anscombe transformation to the binned ChIP-seq . Consequently, the underlying noise distribution model is now approximated by a Gaussian distribution. Subsequently, we apply the NL-means algorithm followed by inverse Anscombe transform. The denoised data is then compared against random model to determine the threshold for region selection based on a false discovery rate (FDR) approach. Finally, regions are selected based on the threshold, the region length, and the ratio between the peak value and the threshold.
To evaluate our method, we first test our method using a public PolII ChIP-seq dataset and compare the results with published results obtained using the published algorithm HPeak . Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on breast cancer cell line MCF7. We compare the long segments obtained when tested against both MCF7 cell line and the MCF7 cells treated with 17-estradial (E2) hormone. The results demonstrate that our method can effectively identify both long enriched regions in ChIP-seq datasets and complement existing peak finding algorithms for a variety of potential wide applications such as histone mark binding pattern study.
Data selection and pre-processing
To test our method, we first downloaded a ChIP-seq dataset for PolII on a prostate cancer cell line (LnCap) after treatment with an androgen receptor agonist R1881 (GSM353618) from NCBI Gene Expression Omnibus (GEO). The dataset was generated using Illumina Genome Analyzer II (GAII) sequencer and was analyzed using the HPeak algorithm . In addition, we apply our algorithm in PolII ChIP-seq data generated in our research on breast cancer using MCF7 cell line. The accession numbers are GSM529981 (MCF7 control), and GSM529982 (for MCF7 treated with 17-estradial (E2)). For all the datasets, we downloaded the sequence mapping results (using the Eland algorithm provided by Illumina Inc.) and then generated histogram with selected bin size.
Introduction of NL-means algorithm and parameter selection
The NL-means algorithm was originally developed for image denoising  and was also applied to 1-D signals (including acoustic signals)  and video . The details of the original NL-means algorithm are given in . Here we briefly describe the essential formulation and the salient parameters. Basically, for each data point, its value is replaced by a weighted average of data points over the entire signal such that points with similar neighbourhoods are given higher weights. It has been shown that the NL-means algorithm is optimal for Gaussian noise. In addition, using weighted averaging for points with similar neighbourhoods is suitable for ChIP-seq experiments since the pattern of protein binding are considered similar across many regions of the genome.
Since NL-means algorithm is optimal for Gaussian noise while the noise in ChIP-seq data is usually modelled using Poisson distribution, we apply Anscombe transform to the data first. Anscombe transform is a commonly used variance-stabilizing transformation that transforms a Poisson distribution random variable into one with that is close to a Gaussian distribution [17, 21]. The Anscombe transformation on data point is defined as follows: and the inverse Anscombe transform is given by x=(y/2)2-3/8.
FDR based approach
To select a threshold for region selection, we designed a FDR approach based on B random simulations. Given the histogram of the ChIP-seq data with M bins and the reads over each bin as r i (i = 1, 2,..., M), we take the following steps:
1. In the b-th round of simulation (b = 1, 2,..., B), we randomly (Poisson distribution) position the same amount of sequencing reads for each chromosome as in the original data and then apply the NL-means algorithm. The reads over the i-th bin is denoted as .
2. Then for the i-th bin in the histogram, the ratio that the observed data is less than simulated data is recorded as . I(A) takes value 1 (resp. 0) if A is true (resp. false).
3. For the b-th round of simulation, we treat it as a "true" signal and compare it with rest of simulated data. For the i-th bin, we compute Then for a threshold p cut , the number of false peaks in this round of simulation is
4. The false discovery rate for the cutoff p cut is .
Since p cut is a function of the region height, we obtain the FDR for selected threshold on height. Then for each detected region based on the FDR approach, we also calculate the ratio between the peak value and the threshold.
Comparing with HPeak for detecting PolII enriched regions in prostate cancer
Comparing our method with HPeak (bin size 25 bp).
# of Regions
Ratio of Overlap with
Detecting large regions with PolII enrichment in breast cancer cell line MCF7
Long enriched regions identified in PolII ChIP-seq data in MCF7 cells using bin size of 1,000 bp.
# of regions with length ≥ 4,000 bp
# of regions with length ≥ 10,000 bp
Maximum length (bp)
MCF7 + E2
The above observations suggested that ChIP-seq technology can reveal potential new insights and principles in biology. The method presented in this paper will contribute to such discovery efforts. Nevertheless, this method can be improved in several aspects. First, currently the parameters for NL-means algorithm are fixed. In practice, the user may focus on enriched regions of certain length and this could lead to the change of these parameters. Therefore, a multiscale approach is preferred. Second, an important utility of ChIP-seq data analysis is to compare enriched regions between different samples such as ChIP sample versus its input control or control sample versus drug treated sample. Currently we are implementing a Fisher's exact test based approach to enable such comparison. Last but not the least, this method can be applied not only to PolII ChIP-seq data but can also be used for analyzing other data such as histone mark ChIP-seq data for integrative genomic analysis. Currently we are also expanding our study to integrate ChIP-seq data for important histone marks including H3K4me2 and H3K27me3 in an epigenetic study.
In this paper, we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a FDR approach. Our method complements existing peak detection algorithms for ChIP-seq experiments. We demonstrated the effectiveness of this method by studying the binding patterns of PolII in cancer cell lines which enables further deep analysis with applications in transcription regulation and epigenetics.
This work is partially supported by NCI U54 CA113001-06 grant and PhAMA Foundation grant. TP is supported by the Pelotonia Fellowship grant.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 2, 2012: Proceedings from the Great Lakes Bioinformatics Conference 2011. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S2
- Barski A, Cuddapah S, Cui K, Roh TY, Schones DE, Wang Z, Wei G, Chepelev I, Zhao K: High-resolution profiling of histone methylations in the human genome. Cell. 2007, 129 (4): 823-837. 10.1016/j.cell.2007.05.009.View ArticlePubMedGoogle Scholar
- Barski A, Zhao K: Genomic location analysis by ChIP-Seq. Journal of cellular biochemistry. 2009, 107 (1): 11-18. 10.1002/jcb.22077.View ArticlePubMedGoogle Scholar
- Yu J, Yu J, Mani RS, Cao Q, Brenner CJ, Cao X, Wang X, Wu L, Li J, Hu M: An integrated network of androgen receptor, polycomb, and TMPRSS2-ERG gene fusions in prostate cancer progression. Cancer cell. 17 (5): 443-454.Google Scholar
- Fuda NJ, Ardehali MB, Lis JT: Defining mechanisms that regulate RNA polymerase II transcription in vivo. Nature. 2009, 461 (7261): 186-192. 10.1038/nature08449.PubMed CentralView ArticlePubMedGoogle Scholar
- Zeitlinger J, Stark A, Kellis M, Hong JW, Nechaev S, Adelman K, Levine M, Young RA: RNA polymerase stalling at developmental control genes in the Drosophila melanogaster embryo. Nature genetics. 2007, 39 (12): 1512-1516. 10.1038/ng.2007.26.PubMed CentralView ArticlePubMedGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, Carriero N, Snyder M, Gerstein MB: PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nature biotechnology. 2009, 27 (1): 66-75. 10.1038/nbt.1518.PubMed CentralView ArticlePubMedGoogle Scholar
- Cao AR, Rabinovich R, Xu M, Xu X, Jin VX, Farnham PJ: Genome-wide analysis of transcription factor E2F1 mutant proteins reveals that N- and C-terminal protein interaction domains do not participate in targeting E2F1 to the human genome. The Journal of biological chemistry. 286 (14): 11985-11996.Google Scholar
- Zhang Y, Liu T, Meyer CA, Eeckhoute J, Johnson DS, Bernstein BE, Nussbaum C, Myers RM, Brown M, Li W: Model-based analysis of ChIP-Seq (MACS). Genome biology. 2008, 9 (9): R137-10.1186/gb-2008-9-9-r137.PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson DS, Mortazavi A, Myers RM, Wold B: Genome-wide mapping of in vivo protein-DNA interactions. Science (New York, NY). 2007, 316 (5830): 1497-1502. 10.1126/science.1141319.View ArticleGoogle Scholar
- Valouev A, Johnson DS, Sundquist A, Medina C, Anton E, Batzoglou S, Myers RM, Sidow A: Genome-wide analysis of transcription factor binding sites based on ChIP-Seq data. Nature methods. 2008, 5 (9): 829-834. 10.1038/nmeth.1246.PubMed CentralView ArticlePubMedGoogle Scholar
- Blahnik KR, Dou L, O'Geen H, McPhillips T, Xu X, Cao AR, Iyengar S, Nicolet CM, Ludascher B, Korf I: Sole-Search: an integrated analysis program for peak detection and functional annotation using ChIP-seq data. Nucleic acids research. 38 (3): e13-Google Scholar
- Ji H, Jiang H, Ma W, Johnson DS, Myers RM, Wong WH: An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nature biotechnology. 2008, 26 (11): 1293-1300. 10.1038/nbt.1505.PubMed CentralView ArticlePubMedGoogle Scholar
- He HH, Meyer CA, Shin H, Bailey ST, Wei G, Wang Q, Zhang Y, Xu K, Ni M, Lupien M: Nucleosome dynamics define transcriptional enhancers. Nature genetics. 42 (4): 343-347.Google Scholar
- Zang C, Schones DE, Zeng C, Cui K, Zhao K, Peng W: A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics (Oxford, England). 2009, 25 (15): 1952-1958. 10.1093/bioinformatics/btp340.View ArticleGoogle Scholar
- Buades A, Coll B: A non-local algorithm image denoising. IEEE International Conference on Computer Vision and Pattern Recognition (CVPR): 2005. 2005, 60-65.Google Scholar
- Van De Ville D, Kocher M: Non-Local Means With Dimensionality Reduction and SURE-Based Parameter Selection. IEEE Transactions on Image Processing. 2011, 20 (9): 2683-2690.View ArticlePubMedGoogle Scholar
- Anscombe F: The transformation of Poisson, binomial and negative-binomial data. Biometrika. 1948, 35 (3-4): 246-254. 10.1093/biomet/35.3-4.246.View ArticleGoogle Scholar
- Qin ZS, Yu J, Shen J, Maher CA, Hu M, Kalyana-Sundaram S, Yu J, Chinnaiyan AM: HPeak: an HMM-based algorithm for defining read-enriched regions in ChIP-Seq data. BMC bioinformatics. 11: 369-Google Scholar
- Szlam A: Non-Local Means for Audio Denoising. CAM report. 2008, UCLAGoogle Scholar
- Boulanger J, Kervrann C, Bouthemy P: Space-time adaptation for patch-based image sequence restoration. IEEE transactions on pattern analysis and machine intelligence. 2007, 29 (6): 1096-1102.View ArticlePubMedGoogle Scholar
- Boulanger J, Kervrann C, Bouthemy P, Elbau P, Sibarita JB, Salamero J: Patch-based nonlocal functional for denoising fluorescence microscopy image sequences. IEEE transactions on medical imaging. 29 (2): 442-454.Google Scholar
- Rahl PB, Lin CY, Seila AC, Flynn RA, McCuine S, Burge CB, Sharp PA, Young RA: c-Myc regulates transcriptional pause release. Cell. 141 (3): 432-445.Google Scholar
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.