A signal processing approach for enriched region detection in RNA polymerase II ChIP-seq data

Background RNA polymerase II (PolII) is essential in gene transcription and ChIP-seq experiments have been used to study PolII binding patterns over the entire genome. However, since PolII enriched regions in the genome can be very long, existing peak finding algorithms for ChIP-seq data are not adequate for identifying such long regions. Methods Here we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a false discovery rate (FDR) approach. The binned ChIP-seq data for PolII are first processed using a non-local means (NL-means) algorithm for purposes of denoising. Then, a FDR approach is developed to determine the threshold for marking enriched regions in the binned histogram. Results We first test our method using a public PolII ChIP-seq dataset and compare our results with published results obtained using the published algorithm HPeak. Our results show a high consistency with the published results (80-100%). Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on the breast cancer cell line MCF7. The results demonstrate that our method can effectively identify long enriched regions in ChIP-seq datasets. Specifically, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 Kbp (maximum 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (maximum 325,000 bp). Conclusions We demonstrated the effectiveness of this method in studying binding patterns of PolII in cancer cells which enables further deep analysis in transcription regulation and epigenetics. Our method complements existing peak detection algorithms for ChIP-seq experiments.


Background
Chromatin immunoprecipitation combined with next generation sequencing technology (ChIP-seq) has been swiftly adopted as a standard technique for studying genome wide protein-DNA interaction patterns during the past four years. It is applied in gene regulation studies for identifying transcription factor targets and binding motifs, as well as in epigenetics research towards the characterization of chromatin states using various histone marks and RNA polymerase II (PolII) [1][2][3].
PolII plays an essential role in gene transcription. During transcription, it is responsible for the synthesis of nascent messenger RNA molecules (mRNA) for protein-coding genes and microRNAs [4]. The nascent mRNAs then go through a series of processing steps including splicing to form mature mRNAs. To transcribe a gene, PolII will undergose several steps including recruitment, initiation, elongation, and dissociation [4,5]. In addition, PolII pausing and pre-mature dissociation will cause stalling of the transcription process [4,5]. Thus, accurately characterization of PolII binding patterns over the entire genome is of great importance in studying the dynamics of transcription as well as contributing to the characterization of nascent mRNA, which cannot be directly inferred from gene expression microarray or regular RNA-seq technologies since they focus on mature mRNA. However, since during transcription PolII elongates along the entire gene, the PolII binding pattern over a gene is usually not just a single peak but forms elongated regions as manifest in ChIP-seq data. PolII enriched regions can stretch to several thousands of basepairs ( Figure 1). Traditionally, ChIP-seq data analysis methods rely on peak region detection algorithm to delineate genomic regions with enriched protein bindings. However, the binding pattern of PolII poses a very different paradigm of computing and in turn significant challenges. Several peak detection algorithms were developed for delineating transcription factor binding sites and the anticipated regions are short (e.g., 200-1500 bp) [6][7][8][9][10][11][12] thus rendering such algorithms inadequate for studying proteins with prevalent binding over the entire genome such as PolII.
While ChIP-seq data can be considered a 1-D signal over the entire genome, only a few studies explicitly take advantage of signal denoising and detection methods developed in the engineering community. For instance, in [13], wavelet denoising technique was applied to filter the ChIP-seq data to identify nucleosome distribution patterns. For histone marks, a method called SISSR was recently developed [14], which takes a multiscale approach to analyze ChIP-seq data. This approach first identifies potential regions with enriched histone patterns and then links proximal regions which are separated by short intervals as a contiguous large region. The short intervals can be considered "noise" in the genome-wide signal that can be filtered out at coarser scales.
In this paper, we also consider a ChIP-seq dataset a noisy 1-D signal stretched over the genome and apply signal processing techniques combined with statistical analysis for identifying large enriched regions in ChIPseq data. Our approach includes three main steps. First we directly apply a signal denoising algorithm to process histogram of ChIP-seq data. Noise in ChIP-seq data originates from multiple sources including non-specific binding to the antibody, artifacts in amplification (local PCR), and sequencing reads mapping errors. These confounding noises are usually modeled as spike noise with an underlying Poisson distribution. Recently it has been shown that the non-local means (NL-means) algorithm is highly effective in signal denoising compared to other commonly used algorithms such as median filter, lowpass filtering and wavelet denoising [15,16]. It should be noted that our proposed method is based on the NL-means denoising algorithm. Since the NL-means algorithm is optimal for Gaussian noise, we apply the Anscombe transformation to the binned ChIP-seq [17]. Consequently, the underlying noise distribution model is now approximated by a Gaussian distribution. Subsequently, we apply the NL-means algorithm followed by inverse Anscombe transform. The denoised data is then compared against random model to determine the threshold for region selection based on a false discovery rate (FDR) approach. Finally, regions are selected based on the threshold, the region length, and the ratio between the peak value and the threshold.
To evaluate our method, we first test our method using a public PolII ChIP-seq dataset and compare the results with published results obtained using the published algorithm HPeak [18]. Then, we apply our proposed method on PolII ChIP-seq data generated in our own study on the effects of hormone on breast cancer cell line MCF7. We compare the long segments obtained when tested against both MCF7 cell line and the MCF7 cells treated with 17-estradial (E2) hormone. The results demonstrate that our method can effectively identify both long enriched regions in ChIP-seq datasets and complement existing peak finding algorithms for a variety of potential wide applications such as histone mark binding pattern study.

Data selection and pre-processing
To test our method, we first downloaded a ChIP-seq dataset for PolII on a prostate cancer cell line (LnCap) after treatment with an androgen receptor agonist R1881 (GSM353618) from NCBI Gene Expression Omnibus (GEO). The dataset was generated using Illumina Genome Analyzer II (GAII) sequencer and was analyzed using the HPeak algorithm [3]. In addition, we apply our algorithm in PolII ChIP-seq data generated in our research on breast cancer using MCF7 cell line. The accession numbers are GSM529981 (MCF7 control), and GSM529982 (for MCF7 treated with 17-estradial (E2)). For all the datasets, we downloaded the sequence mapping results (using the Eland algorithm provided by Illumina Inc.) and then generated histogram with selected bin size.

Introduction of NL-means algorithm and parameter selection
The NL-means algorithm was originally developed for image denoising [15] and was also applied to 1-D signals (including acoustic signals) [19] and video [20]. The details of the original NL-means algorithm are given in [15]. Here we briefly describe the essential formulation and the salient parameters. Basically, for each data point, its value is replaced by a weighted average of data points over the entire signal such that points with similar neighbourhoods are given higher weights. It has been shown that the NL-means algorithm is optimal for Gaussian noise. In addition, using weighted averaging for points with similar neighbourhoods is suitable for ChIP-seq experiments since the pattern of protein binding are considered similar across many regions of the genome.
Formally, given a signal with data points X={x i ,i = 1,..., N}, the filtered value at x i is defined as and N (x i ) denote a fixed-size neighbourhood centred at the position i. In practice, searching for similar neighbourhood patterns over the entire genome is not feasible. Instead, a parameter specifying the range of search is needed. In summary, the NL-means algorithm requires three parameters, the size of neighbourhood the range of search and the weight parameter We evaluated a series of parameter combinations ( Figure 2) and in this study we use R = 10(bins) L = 15, and s = 10

Anscombe transformation
Since NL-means algorithm is optimal for Gaussian noise while the noise in ChIP-seq data is usually modelled using Poisson distribution, we apply Anscombe transform to the data first. Anscombe transform is a commonly used variance-stabilizing transformation that transforms a Poisson distribution random variable into one with that is close to a Gaussian distribution [17,21]. The Anscombe transformation on data point is defined as follows: y = 2 x + 3/8 and the inverse Anscombe transform is given by x=(y/2) 2 -3/8.

FDR based approach
To select a threshold for region selection, we designed a FDR approach based on B random simulations. Given the histogram of the ChIP-seq data with M bins and the reads over each bin as r i (i = 1, 2,..., M), we take the following steps: 1. In the b-th round of simulation (b = 1, 2,..., B), we randomly (Poisson distribution) position the same amount of sequencing reads for each chromosome as in the original data and then apply the NL-means algorithm. The reads over the i-th bin is denoted as r * ib . 2. Then for the i-th bin in the histogram, the ratio that the observed data is less than simulated data is recorded as p i = B b=1 I(r i < r * ib ). I(A) takes value 1 (resp. 0) if A is true (resp. false).
3. For the b-th round of simulation, we treat it as a "true" signal and compare it with rest of simulated data.
For the i-th bin, we compute p I r * ib < r * ib Then for a threshold p cut , the number of false peaks in this round of simulation is 4. The false discovery rate for the cutoff p cut is Since p cut is a function of the region height, we obtain the FDR for selected threshold on height. Then for each detected region based on the FDR approach, we also calculate the ratio between the peak value and the threshold.

Comparing with HPeak for detecting PolII enriched regions in prostate cancer
We first apply our method to a published PolII ChIPseq data in prostate cancer model (GSM353618). We Figure 2 Denoising on PolII ChIP-seq data for MCF7 cell line using NL-means algorithm. This region is over the gene ELAC2. Different parameters for the NL-means algorithm are tested on the ChIP-seq data. The second panel uses the parameter sets used in the rest of this paper (R=10(bins),L=15, and s=10) compare our results with the published 15,833 PolII enriched region detected using HPeak [3,18]. As shown in Table 1, when FDR = 0.01, our method show a high overlap with published results with overlap ratio (comparing to HPeak) of 80-100%. We then inspected many of regions that were detected using our method but were missed by HPeak. We found many of them correspond to potential transcription starting sites (TSS) for unknown genes or non-coding RNA genes. Figure 3A-B show examples of regions detected using our method using 0.01 for FDR and 6.5 for Peak-to-threshold ratio but were missed by the HPeak algorithm even though they imply the potential transcription activity for the covered genes. One of them is a snoRNA gene (U5E). Figure 3C shows examples of regions that were detected by HPeak but were missed by our method. In fact, using this setting, all regions missed by our method contain narrow peaks (covers one to two 25 bp bins). This observation suggests that the regions detected using our method can provide addition information regarding gene transcription and annotation. Further detailed annotation on these new regions is currently being carried out by integration with other information including CpG island distribution, DNase I hypersensitivity, and H3K4me2 binding (ChIP-seq).

Detecting large regions with PolII enrichment in breast cancer cell line MCF7
We then applied our method in the PolII ChIP-seq datasets for the breast cancer cell line MCF7 using large bin size (1,000 bp). As shown in Table 2, pertaining to MCF7 control samples we identified 5,911 segments with length of at least 4 K bp (ranging from 4,000 bp to 233,000 bp); and in MCF7 treated with E2 samples, we identified 6,200 such segments (ranging from 4,000 bp to 325,000 bp). Some examples of the long regions detected in MCF7 control samples are shown in Figure 4.

Discussion
Identification of long segments for PolII binding are important for further investigation for understanding gene transcription regulation as well as potentially discovery novel transcripts and alternative promoters. For gene transcription, while PolII binding density at promoter around the TSSs was considered to determine gene transcription levels, recent studies show that the density of PolII binding on gene body is also critical [5,22]. We also observed such phenomena using the above identified segments. For instance, as shown in Figure 5, a segment of 16,000 bp has been identified over the transcript of the gene PLK2 on human chromosome 5. The MCF7 control sample has more sequencing reads over this region than the MCF7 sample treated with E2 sample (958 vs 454 reads with similar amount of total reads in chromosome 5 between the two samples). Although, the height of the "peak" at the TSS region in the MCF7 control sample is lower than that in the MCF7 E2 treated sample, the total transcription level (measured by Affymetrix gene expression array) is still higher in MCF7 control by a factor of 3.95-fold (Student t-test p = 3.872 × 10 -6 ).
The above observations suggested that ChIP-seq technology can reveal potential new insights and principles in biology. The method presented in this paper will contribute to such discovery efforts. Nevertheless, this method can be improved in several aspects. First, currently the parameters for NL-means algorithm are fixed. In practice, the user may focus on enriched regions of certain length and this could lead to the change of these parameters. Therefore, a multiscale approach is preferred. Second, an important utility of ChIP-seq data analysis is to compare enriched regions between different samples such as ChIP sample versus its input control or control sample versus drug treated sample. Currently we are implementing a Fisher's exact test based approach to enable such comparison. Last but not the least, this method can be applied not only to PolII ChIP-seq data but can also be used for analyzing other data such as histone mark ChIP-seq data for integrative genomic analysis. Currently we are also expanding our study to integrate ChIP-seq data for important histone marks including H3K4me2 and H3K27me3 in an epigenetic study.

Conclusions
In this paper, we propose an enriched region detection method for ChIP-seq data to identify long enriched regions by combining a signal denoising algorithm with a FDR approach. Our method complements existing peak detection algorithms for ChIP-seq experiments. We demonstrated the effectiveness of this method by     studying the binding patterns of PolII in cancer cell lines which enables further deep analysis with applications in transcription regulation and epigenetics.