Is this the right normalization? A diagnostic tool for ChIP-seq normalization
- Claudia Angelini^{1}Email author,
- Ruth Heller^{2},
- Rita Volkinshtein^{2} and
- Daniel Yekutieli^{2}
https://doi.org/10.1186/s12859-015-0579-z
© Angelini et al.; licensee BioMed Central. 2015
Received: 3 August 2014
Accepted: 20 April 2015
Published: 9 May 2015
Abstract
Background
Chip-seq experiments are becoming a standard approach for genome-wide profiling protein-DNA interactions, such as detecting transcription factor binding sites, histone modification marks and RNA Polymerase II occupancy. However, when comparing a ChIP sample versus a control sample, such as Input DNA, normalization procedures have to be applied in order to remove experimental source of biases. Despite the substantial impact that the choice of the normalization method can have on the results of a ChIP-seq data analysis, their assessment is not fully explored in the literature. In particular, there are no diagnostic tools that show whether the applied normalization is indeed appropriate for the data being analyzed.
Results
In this work we propose a novel diagnostic tool to examine the appropriateness of the estimated normalization procedure. By plotting the empirical densities of log relative risks in bins of equal read count, along with the estimated normalization constant, after logarithmic transformation, the researcher is able to assess the appropriateness of the estimated normalization constant. We use the diagnostic plot to evaluate the appropriateness of the estimates obtained by CisGenome, NCIS and CCAT on several real data examples. Moreover, we show the impact that the choice of the normalization constant can have on standard tools for peak calling such as MACS or SICER. Finally, we propose a novel procedure for controlling the FDR using sample swapping. This procedure makes use of the estimated normalization constant in order to gain power over the naive choice of constant (used in MACS and SICER), which is the ratio of the total number of reads in the ChIP and Input samples.
Conclusions
Linear normalization approaches aim to estimate a scale factor, r, to adjust for different sequencing depths when comparing ChIP versus Input samples. The estimated scaling factor can easily be incorporated in many peak caller algorithms to improve the accuracy of the peak identification. The diagnostic plot proposed in this paper can be used to assess how adequate ChIP/Input normalization constants are, and thus it allows the user to choose the most adequate estimate for the analysis.
Keywords
Chip-Seq Diagnostic plots NormalizationBackground
Mammalian genomes are organized to form a tridimensional structure called chromatin. It is a highly structured and compact DNA-protein complex that can assume many different conformations depending on the nuclear context and on the biochemical modifications present on both DNA and histone proteins [1]. Its conformation can influence cell activity, state and functionality and can help in understanding why different types of cells exhibit very different behaviours, although they share the same genome. Indeed, transcription factors and chromatin modifiers represent important players in gene regulation, DNA replication, programming and reprogramming of cellular states during differentiation and development. Epigenetic regulatory mechanisms are crucial in the onset and progression of several diseases, including cancer development [2,3], and can be altered by the environment. Therefore, being able to precisely profile all epigenetic actors can deeply help in revealing the landscape of transcription and regulation in cells.
Since the seminal papers [4-7], chromatin immunoprecipitation followed by massively parallel sequencing (ChIP-seq, see [8-10] for a review) has substituted the use of DNA hybridization to microarray (ChIP-chip) and has become a standard technique for identifying protein-DNA interaction [11]. Several large scale projects such as the Encyclopedia of DNA Elements [12,13], The Cancer Genome Atlas (http://cancergenome.nih.gov/), the Roadmap Epigenomics (http://www.roadmapepigenomics.org), are now producing thousands of genome-wide maps for a variety of transcription factors and histone modifications in a large number of different cell types and conditions, see for example [14,15]. In ChIP-seq experiments [8-10], the DNA-binding protein is cross-linked to DNA in vivo, then DNA fragments (usually 200-600 bp) are enriched by immuno-precipitation using an antibody specific to the protein of interest. DNA libraries (i.e. the collections of DNA fragments that will be sequenced) are prepared according to the protocol of the instrument, and finally massively sequenced to produce several millions of short reads of about 50-100 bp.
According to most of the available computational protocols [9,16-18] short reads have to be first aligned to the reference genome in order to identify the genomic locations (i.e., sub-intervals of the whole genome) with enrichment of reads (i.e., with a higher number of mapped reads than expected by chance). Several tools have been proposed for identifying peaks of enriched regions. The corresponding algorithms are known as “peak caller” algorithms [19-28]. Different shapes of the genome profile require different peak callers: transcription factors can be well described as sharp punctuate signals of few tens or at most few hundreds base-pairs of length; histone modifications appear as wide-spread and less defined domains and can reach several hundred of kilobases of length; RNA polymerase II binding sites are modelled like a sharp peak with a long heavy tail toward the direction of the transcription.
Exploring the genome-wide coverage profiles it is easy to observe that, due to several experimental sources of errors, the signal consists of both truly enriched regions (i.e., the true regions of interest) and a large number of non specific/non uniform regions that behave as “background”. Following this observation, in [20,29], the data are modeled as a mixture of reads randomly sampled from an enriched signal distribution and a background noise distribution. In order to account for different sources of experimental biases in the background distribution, it is standard practice to sequence a control sample either from Input DNA (i.e., DNA isolated from cells that have been cross-linked and fragmented under conditions similar to those of the experimental sample) or by using a non specific antibody during the enrichment (i.e., IgG) [11]. When the control sample is available, peaks are defined as those sub-regions of the genome with statistically significant higher number of reads than the control/background.
A good peak detection algorithm, that balances sensitivity and specificity, is obtained by choosing an appropriate peak-calling algorithm and a normalization method. The normalization is essential to make the ChIP sample and control/Input sample comparable, since they are usually sequenced with a different number of short reads. The most commonly used procedure for accounting for the different depth is to scale-up the read counts (usually observed in a window w) with respect to the total library sizes ratio. This approach is too naive, since it does not account for the fact that only the background component of the ChIP signal follows the same distribution as that of the control signal. Therefore, more sophisticated linear and non-linear approaches to ChIP-seq data normalization have been proposed (see Box 6 and Table S2 in [18] for a summary, and [29-32] for examples). However, despite the substantial impact that the choice of the normalization method can have on the results of a ChIP-seq data analysis [20,29], their properties are not fully explored in the literature, and there are no diagnostic tools that show whether the applied normalization is indeed appropriate for the data being analyzed.
In this paper we focus on linear normalization approaches. These approaches estimate a scale factor, r, able to adjust for different sequencing depths. Linear approaches have the great advantage that they can easily be incorporated in many peak caller algorithms to improve the accuracy of the peak identification (e.g., [29]). Moreover, we observe that, independently from the impact on peak calling, the direct estimate of the normalization factor is of interest both for measuring the specificity of the antibody and for comparing specific genomic regions under different conditions. The latter often occurs when relating epigenomic signatures with gene expression, see [33,34].
In this work we propose a novel diagnostic plot to examine the appropriateness of the estimated normalization constant. By plotting empirical densities of log relative risks in bins of equal read count, along with the normalization constant, after logarithmic transformation, the researcher is able to assess the level of agreement of the estimated normalization constant with the data. If the agreement is not satisfactory, the user can revise the estimate, either by using a different normalization method or by changing the input parameters of the method, and reassess. We stress that an accurate diagnosis of the normalization constant is important since it can affect the subsequent analysis dramatically: if the estimated normalization constant is too large, there may be too few discoveries due to power loss in the peak caller algorithms; if the estimated normalization constant is too small, there may be an increase in false positives in the peak caller algorithm.
The paper is organized as follows. After introducing the notation, and three common normalization methods along with their potential limitations, we introduce our diagnostic tool. We examine its usefulness in data-driven simulations based on yeast, as well as on several real mouse data examples and a real model organism example. In order to assess the importance of the correct normalization factor in inference, we examine empirically the effect of different values for the normalization factor on peak calling algorithms, and we examine theoretically the effect of the normalization constant value on the false discovery rate (FDR). We conclude with a summary, some final remarks and future extensions.
Methods
Notation
The value of r represents the normalization constant of interest. Since π _{0} is unknown, it has to be estimated. Note that the naive approach that normalises the reads according to the ratio of the two depths corresponds to setting π _{0} to one, which is the correct value only if all the observed reads in the ChIP sample are background reads.
We partition of the genome of interest into G _{ b } windows of equal genomic length |w|=b, indexed by w=1⋯G _{ b }. We denote by \(\mathcal {W} = \{ 1 \cdots G_{b} \}\) the set of indices of all windows. Let N _{ ch }(w) be the number of ChIP reads mapped to window w and N _{ in }(w) be the number of Input reads mapped to window w. Let N _{ tot }(w)=N _{ ch }(w)+N _{ in }(w) be the total read count in window w. The total number of reads in the ChIP and Input samples are therefore \(N_{\textit {ch}} = \sum _{w \in \mathcal {W}} N_{\textit {ch}} (w)\) and \(N_{\textit {in}} = \sum _{w \in \mathcal {W}} N_{\textit {in}} (w)\), respectively. Remark 1 below discusses the preprocessing steps for the read counts to avoid redundancy in the window counting.
We assume that each enriched region is a compact interval and only a subset of the genome is affected by the modification. These regions may be very large, as expected for histone modifications. The set of windows is thus comprised of two subsets of indices of enriched and background windows, denoted by \(\mathcal {W}_{1} \) and \(\mathcal {W}_{0} \), respectively. A window w is in \(\mathcal {W}_{1} \) if the ChIP signal is enriched, and in \(\mathcal {W}_{0} \) if the ChIP signal behaves like background (i.e., control/Input signal).
Remark 1.
Since the DNA fragments undergo a PCR step (during the sample preparation) that can result in excessive and non-uniform amplification of the original sequences, it is common practice to remove such artefacts after the alignment step, retaining at most few reads per starting position. When the sequencing depth is low, then the number of retained reads is often only one. In case of higher coverage two or three reads are usually retained. Moreover, to account for the fact that each sequenced read constitutes only the 5’-end of the corresponding DNA fragments, it can be beneficial to shift each mapped position towards their 3’-end by half of the average DNA fragment length before computing N _{ ch }(w) and N _{ in }(w).
Normalization approaches
The unknown ratio r in (1) is approximately equal to \(\frac {\sum _{w \in \mathcal {W}_{0}} N_{\textit {ch}} (w) } { \sum _{w \in \mathcal {W}_{0}} N_{\textit {in}} (w)}\). The key to estimating r is thus the estimation of the subset \(\mathcal {W}_{0}\) of background windows. Several normalization procedures have been proposed to estimate r, and we shall review the three methods investigated in this paper [20,21,29].
CisGenome
In [21], the genome is first divided into windows of length b=100 b p and \(\mathcal {W}_{0}\) is estimated by \(\hat {\mathcal {W}}_{0}=\left \{ w:\, N_{\textit {tot}} (w)\leq t\right \} \), with t fixed at the value t=1. The assumption in [21] is that windows with low total count, N _{ tot }(w), are more likely to belong to the background. The main drawback of this method is that the fixed window-size does not adapt to signals with variable spread, and that the fixed threshold t=1 does not scale up with the increase of sequencing depth.
CisGenome is freely available at http://www.biostat.jhsph.edu/~hji/cisgenome/.
NCIS
The final step is to estimate \({\mathcal {W}}_{0}\) by \({\hat {\mathcal {W}}_{0}}={\hat {\mathcal {W}}_{0}}^{b^{*},t^{*}}\), where b ^{∗} is the smallest window size with an estimated normalization factor that is a local minima (i.e. with value smaller than that computed from smaller window sizes, and at most as large as that of the next largest window size).
This method, as the previous one, assumes that the background windows tend to have lower counts than windows in enriched regions, and therefore may result in poor estimates if the total number of many background windows is actually larger than for windows in enriched regions.
NCIS is an R package available in Additional file two of [29].
CCAT
Recognising that due to the intrinsic bias of ChIP-seq experiments, the read counts may also be small for some ChIP-enriched regions and may be relatively large for some background regions, in [20] a different approach has been taken: an iterative method based on the assumption that reads with both positive and negative strand direction are equally distributed in the sample.
- 1.Estimation of \({\hat {\mathcal {W}}_{0}}^{j} \) as follows:where \(N^{+}_{\textit {ch}}(w) \) and \(N^{+}_{\textit {in}}(w)\) are, respectively, the ChIP sample and Input sample reads that map on the positive strand only for window w.$${\hat{\mathcal{W}}_{0}}^{j} =\left\{ w:\, N^{+}_{ch}\left(w\right)<\hat{r}^{j-1} N^{+}_{in}\left(w\right)\right\}, $$
- 2.Updating of \(\hat {r}^{j}\) aswhere \(N^{-}_{\textit {ch}}(w)\) and \(N^{-}_{\textit {in}}(w)\) are, respectively, the ChIP sample and Input sample reads that map on the negative strand only for window w.$$\hat {r}^{j}: = \frac{\sum_{w \in {\hat {\cal W}_{0}}^{j} } N^{-}_{ch} (w) } { \sum_{w \in {\hat {\cal W}_{0}}^{j} } N^{-}_{in} (w)}, $$
The estimator may be biased if the distribution of reads in the positive and negative directions differ in many regions.
CCAT is freely available at http://cmb.gis.a-star.edu.sg/ChIPSeq/paperCCAT.htm.
The diagnostic plot
Given an estimate of r, we shall describe the diagnostic plot we propose in order to verify that the estimate of r is adequate. Our plot is relevant for bin counts that have a Poisson distribution or a distribution that is more dispersed than the Poisson distribution (e.g., Negative binomial, among others). We shall start by describing our diagnostic tool for bin counts that have a Poisson distribution, then relax this assumption and argue that the diagnostic plot remains relevant for distributions that are more dispersed than the Poisson distribution.
We set an integer K in the range from 100 to 1000. For the fixed value of K, we partition the genome into (non-equal length) non-overlapping genomic regions, called bins, by agglomerating consecutive windows w such that the sum of N _{ tot }(w) within the bin is (approximately) K: the first bin is \(\sum _{w=1}^{i_{1}} N_{\textit {tot}}(w)\), where i _{1} is the value that satisfies \(\sum _{w=1}^{i_{1}-1} N_{\textit {tot}}(w)<K\leq \sum _{w=1}^{i_{1}} N_{\textit {tot}}(w)\); the second bin is \(\sum _{w=i_{1}+1}^{i_{2}} N_{\textit {tot}}(w)\), where i _{2} is the value that satisfies \(\sum _{w=i_{1}+1}^{i_{2}-1} N_{\textit {tot}}(w)<K\leq \sum _{w=i_{1}+1}^{i_{2}} N_{\textit {tot}}(w)\); etc. Hence, a bin aggregates a small number of windows in regions containing many mapped reads in total (i.e., of ChIP plus Input data), and a large number of windows when the total number of mapped reads in the region is low.
where N(μ,σ ^{2}) denotes the normal distribution with mean μ and variance σ ^{2}, and ∼· means “approximately distributed as”. The approximation is better the greater \(\tilde {N}_{\textit {tot}}(i)\). Since the bins have approximately the same total counts, it follows that for background bins, the variability of \(\log \frac {\tilde {N}_{\textit {ch}} (i)}{\tilde {N}_{\textit {in}} (i)}\) is similar for all i, and their expectation is logr. Hence, the observed values \(\log \frac {\tilde {N}_{\textit {ch}} (i)}{\tilde {N}_{\textit {in}} (i)}\) can be seen as random samples drawn form the same distribution for background bins. A kernel density approach [36] can be used to estimate the density function of \(\log \frac {\tilde {N}_{\textit {ch}} (i)}{\tilde {N}_{\textit {in}} (i)}\). For the background bins we expect the peak to be around logr. Hence, to assess the performance of the estimate of \(\hat {r}\) we can compare the location of \(\log \hat {r}\) with respect to the left peak of the estimated density.
Finally, we note that the choice of total count, K, can matter: if the selected K is too small, then the expectation may be farther from logr since the large sample approximation is too crude; if the selected K is too large then the number of background bins may be too few. For diagnosing whether the actual estimate of logr is reasonable, we therefore suggest plotting the empirical density of the log relative risks at several values of K. In the Results and discussion section, detailed next, we show that the qualitative assessment of the appropriateness of estimates, in the variety of examples we examined, is the same for different K values in the range 100 to 1000. Since our set of examples is representative of enrichment shapes that are encountered in practice, we recommend using the diagnostic plots with K in this range.
Results and discussion
Data driven simulations
We performed simulations in order to assess the ability of the diagnostic plot to capture the true normalization factor, as well as in pointing out the quality of the estimated normalization factors. We considered two different simulation schemes. They were both based on yeast data, which is 12×10^{6} bp long, about 250 times smaller than that of the human genome.
The first simulation scheme, which we call the “Read-add simulation”, is similar to the simulation method of [29] setting 3. As in [29], it was based on the yeast ChIP-seq study of [37]. The data (GEO Accession number GSE19636) was deeply sequenced: the control sample of segregant 1 (SEG1) had a total of 4.2×10^{6} reads, and the average fragment length was 200 bp. Thus the sequencing coverage was 200×4.2/12=70, much higher than the typical coverage for the human genome. We split the control sample into two halves, “ChIP” and “Input”. We subsampled the two halves to yield 1/d of the original library size. Next, we added reads to the “ChIP” half in several locations, as follows. We added at 50 randomly assigned genome locations along the yeast chromosome reads in the range of several thousands base-pairs, with an average increase of 0.2 reads with respect to the “Input” in these enriched regions. In this simulation scheme, the total number of reads was always greater for the “ChIP” library and the true normalization score was 1.
The second simulation scheme, which we call the“By-Genes simulation”, is similar to that of [27]. We downloaded from the Genome Browser the entire table of yeast genome (sacCer1): 27820 partially overlapping gene bodies having lengths ranging from 42 to 632 Kb. The median and average lengths were 1000 bp and 5000 bp, respectively. We selected 300 non-overlapping gene bodies as enriched. We used the control sample SEG1 above, subsampled to yield 1/d of the library size in order to estimate the probability of input reads in non-overlapping windows of size 100 bp. We generated the “Input” data from these probabilities. For the “ChIP” data, in enriched regions the probability of reads falling in the windows had a higher probability (about twice as high). In this simulation, the total number of reads is the same for the “ChIP” and “Input” libraries and the true normalization score was determined by the fraction of input reads that fell outside the 300 enriched regions. Specifically, let p _{ in }(1),…,p _{ in }(G) be the fraction of reads in windows 1 to G in the 1/d sub-sampled control sample SEG1. Then the “Input” number of reads per window was sampled from a multinomial distribution with N total reads, and the vector of probabilities p _{ in }(1),…,p _{ in }(G). Let \(f_{0} = \sum _{w\in \mathcal {W}_{0}} p_{\textit {in}}(w)\) be the probability of falling in our defined \(\mathcal {W}_{0}\) (i.e., the fraction of reads outside the 300 enriched regions out of all the reads in the 1/d sub-sampled control sample SEG1). The “ChIP” counts per window were sampled from a multinomial distribution with N total reads, and the vector of probabilities p _{ ch }(1),…,p _{ ch }(G), where p _{ ch }(w)=λ _{ w } p _{ in }(w)/(f _{0}+2(1−f _{0})), where λ _{ w }=2 if \(w\in \mathcal {W}_{1}\), and λ _{ w }=1 otherwise. Therefore, r=f _{0}/[f _{0}+2(1−f _{0})]∗[1/f _{0}]=1/(2−f _{0}).
Real data application
The diagnostic plot on transcription factors and histone modifications
We applied the normalization methods and diagnostic plot on the ChIP-seq data of mouse embryonic fibroblast cells in [38] (GEO accession number GSE36048).
ChIP-seq samples consist of three histone modifications H3K4me3, H3K27me3 and H3K36me3 and one transcription factor, CTCF, along with the Input sample. The histone modifications are interesting in our context since they have different shapes and levels of enrichment, with H3K4me3 being more compact and peaked around the transcription starting sites of active genes, and H3K27me3 and H3K36me3 more wide-spread over the gene bodies of repressed and active genes, respectively. On the other hand, CTCF is an example of the transcription factor that shows a sharper and more punctuated signal. These experiments contain signals with different spatial resolutions, and we shall apply the diagnostic plot for each of these signal types.
ChIP-seq and Input library sizes in the mouse embryonic fibroblast cells in the study of [ 38 ] and in the D. melanogaster in the modENCODE 3955 dataset
Modification | Library size, | Library size, “-2” | % of library |
---|---|---|---|
bp | filtration, bp | used | |
CTCF [38] | 15,995,367 | 14,378,395 | 90 |
H3K4me3 [38] | 18,466,608 | 17,047,383 | 92 |
H3K27me3 [38] | 12,594,070 | 11,802,201 | 94 |
H3K36me3-1 [38] | 16,675,805 | 14,397,781 | 86 |
H3K36me3-5 [38] | 20,189,813 | 17,194,818 | 85 |
H3K36me3-8 [38] | 19,588,306 | 16,700,104 | 85 |
H3K36me3-pooled [38] | 56,453,924 | 48,292,703 | 86 |
Input [38] | 19,672,479 | 17,099,127 | 87 |
H3K27me3 (modENCODE,id 1820) | 13, 273,869 | 12,946,455 | 97 |
Input (modENCODE,id 1815) | 13,425,292 | 13,168,039 | 98 |
H3K27me3 (modENCODE,id 1957) | 11,997,853 | 11,807,914 | 98 |
Input (modENCODE,id 1961) | 15,685,299 | 15,402,582 | 98 |
H3K27me3-pooled(modENCODE, id 1820and 1957) | 25,081,783 | 24,754,369 | 98 |
Input-pooled(modENCODE,id 1815 and 1961) | 28,827,874 | 28,570,621 | 98 |
The H3K36me3 modification had three replicates labeled rep 1, rep 5 and rep 8 in Table 1. Sample H3K36me3-pooled was assembled by pooling the three filtered H3K36me3 replicates. All datasets showed a low percentage of duplicated reads, so the impact of the threshold in the filtering step (for PCR artefact removal) was negligible.
We performed the ChIP/Input normalization factor estimation of the data samples using the CisGenome, NCIS and CCAT methods, using the default parameters in their packages. The diagnostic plots in Figure 1 refer to the histone modification samples (H3K4me3, H3K27me3 and H3K36me3) and include the estimated density and the four sub-densities, as well as the estimated logr for the three methods. The assessment was done with K=200. The diagnostic plots (for the same modifications) with K=50,500,5000 are illustrated in Additional files 1, 2 and 3, respectively. Additional file 4 shows the diagnostic plot for sample CTCF (for K=50,100,500 and 2000, respectively).
The shape of the estimated density of the log relative risks confirms that the enriched regions of H3K4me3 are easily separated from those from the background. In fact, the estimated density shows a well-defined peak. For the other two modifications (H3K27me3 and H3K36me3), enrichment is less pronounced and also more widespread, and as a consequence the signal is not as easily separated from the background. This is reflected in the fact that the estimated density shows a less defined peak. However, a comparison of the four sub-densities of the different quartiles in length suggests that the bins in the last quartile in length are background bins, and the bins in the first quartile in length contain mainly signal, for all three modifications.
Since the density of the subset of bins that are in the last quartile of length are background bins, the estimated logr is reasonable if it is in the vicinity of the peak of this curve. The diagnostic plot in Figure 1 shows that for the six datasets, the estimator of CCAT is reasonable. The estimate of CisGenome is diagnosed to be too large for the datasets H3K27me3 and H3K36me3-rep 1, and too small for the dataset H3K36me3-pooled. The estimate of NCIS is diagnosed to be too small for the dataset H3K36me3-pooled.
The diagnostic plot for CTCF, in Additional file 4, shows that the estimates by CisGenome and NCIS (which are very close) agree best with the data, and that the estimate of CCAT may be slightly too small: the mode of the empirical density of the upper quartile in length bins (in pink) for all values of K is very close to the estimates by CisGenome and NCIS, and slightly to the right of the estimate by CCAT.
For each dataset of [ 38 ], the estimated r and π _{ 0 } from CisGenome (columns 2 and 3), from NCIS (columns 4 and 5), and from CCAT (columns 6 and 7)
Dataset | CisGenome | NCIS | CCAT | |||
---|---|---|---|---|---|---|
\(\boldsymbol {\hat {r}}\) | \(\boldsymbol {\hat {\pi }_{0} = \frac {N_{\textit {in}}}{N_{\textit {ch}}}\hat {r}}\) | \(\boldsymbol {\hat {r}}\) | \(\boldsymbol {\hat {\pi }_{0} = \frac {N_{\textit {in}}}{N_{\textit {ch}}}\hat {r}}\) | \(\boldsymbol {\hat {r}}\) | \(\boldsymbol {\hat {\pi }_{0} = \frac {N_{\textit {in}}}{N_{\textit {ch}}}\hat {r}}\) | |
CTCF | 0.6699 | 0.5633 | 0.6657 | 0.5598 | 0.6419 | 0.5397 |
H3K4me3 | 0.3833 | 0.3845 | 0.3751 | 0.3762 | 0.3715 | 0.3726 |
H3K27me3 | 0.6163 | 0.8929 | 0.5182 | 0.7507 | 0.5187 | 0.7515 |
H3K36me3-1 | 0.6460 | 0.7672 | 0.5727 | 0.6801 | 0.5689 | 0.6757 |
H3K36me3-5 | 0.7097 | 0.7058 | 0.6802 | 0.676 | 0.6763 | 0.6725 |
H3K36me3-8 | 0.6987 | 0.7154 | 0.6813 | 0.6976 | 0.6499 | 0.6654 |
H3K36me3-pooled | 1.3117 | 0.4644 | 1.4680 | 0.5198 | 1.8084 | 0.6403 |
\(\frac {\hat {\pi }_{01}N_{Ch_{1}}+\hat {\pi }_{05}N_{Ch_{5}}+\hat {\pi }_{08}N_{Ch_{8}}}{N_{Ch_{1}}+N_{Ch_{5}}+N_{Ch_{8}}}\) | 0.7274 | 0.6848 | 0.6710 |
Effect of the coverage depth
It is known that very disperse histone modifications such as H3K27me3 require very high coverage to be identified, and the coverage in the dataset of [38] was low. Therefore, in addition to the examination of H3K27me3 in the data of [38], we examined it in a study on D. melanogaster from modENCODE Project[39] (Model Organism ENCyclopedia Of DNA Elements). The dataset consists of two replicates of histone modification H3K27me3 (identified by id number 1820 and 1957) and two replicates of Input (identified by id number 1815 and 1961) from Drosophila Oregon R embryos, 14-16 hr after egg laying (GEO accession number GSE47230, modENCODE 3955).
Short reads were aligned to the reference genome using Bowtie and the alignments were converted to BED format. As before, the aligned dataset was subject to filtration, meaning that up to two reads were retained in starting on a single genomic position.
Table 1 shows the number of reads sequenced in the study. Since the D. melanogaster genome is about 1.2∗10^{8} bp (compared to 2.8∗10^{9} bp for the mouse genome), the coverage of this dataset was about 20 times higher than the one in previous examples. Sample H3K27me3-pooled was assembled by pooling the two filtered H3K27me3 replicates. In a similar way we also pooled the two replicate of the Input.
For each dataset of modENCODE 38955, the estimated r and π _{ 0 } from CisGenome (columns 2 and 3), from NCIS (columns 4 and 5), and from CCAT (columns 6 and 7)
Dataset | CisGenome | NCIS | CCAT | |||
---|---|---|---|---|---|---|
\(\boldsymbol {\hat {r}}\) | \(\boldsymbol {\hat {\pi }_{0} = \frac {N_{\textit {in}}}{N_{\textit {ch}}}\hat {r}}\) | \(\boldsymbol {\hat {r}}\) | \(\boldsymbol {\hat {\pi }_{0} = \frac {N_{\textit {in}}}{N_{\textit {ch}}}\hat {r}}\) | \(\boldsymbol {\hat {r}}\) | \(\boldsymbol {\hat {\pi }_{0} = \frac {N_{\textit {in}}}{N_{\textit {ch}}}\hat {r}}\) | |
H3K27me3 (modENCODE,id 1820 vs id 1815) | 0.6481 | 0.6587 | 0.6508 | 0.6615 | 0.5760 | 0.5855 |
H3K27me3 (modENCODE,id 1957 vs id 1961) | 0.4883 | 0.6364 | 0.4933 | 0.6428 | 0.4251 | 0.5540 |
H3K27me3-pooled (modENCODE) | 0.5464 | 0.6301 | 0.5656 | 0.6522 | 0.4660 | 0.5373 |
\(\frac {\hat {\pi }_{1820}N_{Ch_{1820}}+\hat {\pi }_{1957}N_{Ch_{1957}}}{N_{Ch_{1820}}+N_{Ch_{1957}}}\) | 0.6473 | 0.6520 | 0.5694 |
The shape of the estimated density of the log relative risks suggests that the enriched regions can be well separated from the background. The empirical density of the lower and upper quartile in length (red and pink curves, respectively) clearly capture the enriched and background bins. The diagnostic plot shows that the estimates by CisGenome and NCIS (which are very close), agree best with the data, and that the estimate of CCAT is too small: the mode of the empirical density of the upper quartile in length bins (in pink) is very close to the estimates by CisGenome and NCIS, and it is to the right of the estimate by CCAT. Additional file 5 shows the diagnostic plot for ChIP sample id 1820 versus Inpunt sample Id 1815 and K=100,200,1000, and 2000, respectively (the other cases behave similarly and are not shown for brevity).
Impact on peak calling algoritms
The main purpose of the ChIP-seq studies is to detect binding sites and/or enriched regions of histone modifications, and a peak detection algorithm is typically used for this purpose. In general, the impact of the normalization on the identified regions depends on the specific peak calling method and on the type of multiple testing procedure that is implemented. Most of the peak calling algorithms normalize data inside their code using suitable internal strategies. Tools such as MACS [24] and SICER[25] use N _{ ch }/N _{ in } as their normalization constant. Following the suggestion of [29], we modified both MACS and SICER, making them able to work with user defined normalization constants. For both methods, the (global) normalization constant does not change the rank of the peaks (where the ranking is by p-values or by enrichment scores), but it does affect the number of discovered regions. Therefore, a smaller normalization constant, with all other parameters held fixed, may result in a larger number of discovered regions.
Number of peaks detected using the naive normalization constant (column 2), the estimate by CisGenome (column 3), NCIS (column 4), and CCAT (column 5), in a specific peak calling algorithm (column 6)
TF or Modification | Default | CisGenome | NCIS | CCAT | Peak calling Algorithm |
---|---|---|---|---|---|
CTCF [38] | 22522 | 29841 | 30369 | 38339 | MACS |
H3K4me3 [38] | 16972 | 17400 | 17401 | 17403 | SICER |
H3K27me3 [38] | 10233 | 10844 | 11272 | 11271 | SICER |
H3K36me3-1 [38] | 10139 | 10676 | 10747 | 10752 | SICER |
H3K36me3-5 [38] | 13880 | 15192 | 15248 | 15257 | SICER |
H3K36me3-8 [38] | 14079 | 15454 | 15494 | 15564 | SICER |
H3K36me3-pooled [38] | 16326 | 18788 | 18729 | 18509 | SICER |
H3K27me3 (modENCODE:id 1820 vs id 1815) | 1037 | 1959 | 1955 | 2028 | SICER |
H3K27me3 (modENCODE:id 1957 vs id 1961) | 1026 | 1894 | 1893 | 1949 | SICER |
H3K27me3-pooled (modENCODE) | 1118 | 2068 | 2049 | 2172 | SICER |
Moreover, by looking at the locations of the enriched regions along the chromosomes in the two replicates we also measured the number of overlapping/non overlapping intervals as function of the normalization constant. First, we considered the 1037 regions detected as enriched (with default constant) in the first H3K27me3 replicate of modEncode dataset, and the 1026 regions detected as enriched in the second replicate (see Table 4, column 2). We found that 106 regions of the first set did not have any positional overlap with the regions in the second set; and 154 regions of the second set did not have any overlap with the the first set. However, out of the 106 regions above described, 79 are detected as enriched in the second replicate when using NCIS as constant (i.e., they overlap some of the 1893 regions in Table 4, column 4), and out of the 154 regions, 113 overlaps those of the first replicate when using NCIS as constant (i.e., the 1955 regions Table 4, column 4). Such comparisons suggest that a proper estimate of the normalization constant can increase the number of true discoveries: for example 79 peaks that were detected by the first replicated but not by the second replicate using the naive constant due to lack of power, but when the proper normalization constant was used they were indeed detected in both replicates.
Overall, using NCIS for estimating the normalization constant, out of the 1955 regions of the first replicate, 1657 overlap the 1893 regions of the second replicate; and out of the 1893 regions of the second replicate 1542 overlap the 1955 regions of the first replicate.
The effect of the estimated normalization constant on the FDR
As mentioned in the above section, the normalization constant has an impact on the number of regions that are declared significant. Each peak calling algorithm produces an enrichment score (or a p-value) for each region of interest. The greater the enrichment score (the smaller the p-value), the greater the evidence that there is a modification or a binding site in that region. In order to determine the cutoff threshold, above which a region is considered enriched, it has been suggested by [20] and [29] (among others) to use false discovery rate (FDR, [40]) estimation by library swapping.
The estimated normalization constant has a crucial role in determining the cutoff threshold. If the estimate is too large, \(\hat r>r\), the procedure may be overly conservative. If the estimate is too small, \(\hat r<r\), the procedure may detect too many false positives. To see this, we first introduce a peak detection procedure based on library swapping that controls the FDR, and then we discuss the effect of over- or under- estimation of r.
Consider an enrichment score \(g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), r)\), where \(\tilde {N}_{\textit {ch}} (i)\) and \(\tilde {N}_{\textit {in}} (i)\) denote the number of reads in the ChIP and Input sample, respectively, in bin i. The swapped score in bin i is therefore \(g(\tilde {N}_{\textit {in}} (i), \tilde {N}_{\textit {ch}} (i), 1/r)\). Let q be the desired FDR level (e.g., q=0.05). Let \(\mathcal {S} = \{i: g(\tilde {N}_{\textit {in}} (i), \tilde {N}_{\textit {ch}} (i), 1/r)\geq g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), r)\}\) be the index set of the bins where the enrichment score is higher for the swapped libraries. Let \(\mathcal {S}^{c}\) contain the remaining indices.
Procedure 1.
- 1.Find$${\fontsize{7.6pt}{9.6pt}\selectfont{\begin{aligned} T= \min \left\{ t\in (0,\infty): \frac{1+\#\{i\in \mathcal{S}:g(\tilde{N}_{in} (i), \tilde{N}_{ch} (i), 1/r)\geq t \}}{\#\{i\in \mathcal S^{c}:g(\tilde{N}_{ch} (i), \tilde{N}_{in} (i), r)\geq t \}\lor 1}\leq q \right\}\!. \end{aligned}}} $$
- 2.
Declare all bins with \(g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), r)\geq T\), \(i \in \mathcal S^{c}\), as enriched.
Theorem 1.
See Additional file 6 for a proof based on a very nice recent result of [41].
Since r is not known in practice, it is estimated from the data. For a reasonable enrichment score, \(g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), r)\) decreases with increasing r, since a larger r implies that a larger fraction of the reads in the ChIP sample are null reads. Consider an estimate \(\hat r>r\). Then \(g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), \hat {r})< g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), r)\), and \(g(\tilde {N}_{\textit {in}} (i), \tilde {N}_{\textit {ch}} (i), 1/\hat {r})>g(\tilde {N}_{\textit {in}} (i), \tilde {N}_{\textit {ch}} (i), 1/ r)\). Therefore, using \(\hat {r}\) instead of r, may result in a larger set (and, respectively, a smaller set \(\mathcal {S}^{c}\)), and the cut-off threshold based on \(\hat {r}\) will be smaller, thus less discoveries would be made. A similar reasoning suggests that if \(\hat {r}<r\), the cut-off threshold based on \(\hat {r}\) will be larger, thus more discoveries would be made. However, if \(\hat r<r\), there is no longer any guarantee for FDR control, and it may very well be that the actual fraction of false discoveries among the discoveries is far larger than the desired level q. Therefore, in order to keep the desired balance between power and false discovery control, it is crucial to estimate r well.
Remark 2.
- 1.
Sort the bins, so that the bin with index i=1 has the largest enrichment score or swapped score, the bin with index i=2 has the second largest enrichment score or swapped score, etc. Formally, if \(G_{i} = \max \left \{g(\tilde {N}_{\textit {ch}} (i), \tilde {N}_{\textit {in}} (i), r),g(\tilde {N}_{\textit {in}} (i), \tilde {N}_{\textit {ch}} (i), 1/r)\right \}\), the sorted scores satisfy G _{1}≥…≥G _{ B }>0, where B is the total number of bins considered for enrichment.
- 2.Find$$\hat k= \max \left\{ k : \frac{1+\#\{i\leq k, i \in \mathcal{S} \}}{\#\{i\leq k, i \in \mathcal{S}^{c} \}\lor 1}\leq q \right\}. $$
- 3.
Declare the bins \(i=1,\ldots, \hat {k}\) as enriched.
If a peak calling algorithm outputs only all enrichment scores above a cut-off A, then the inference proceeds as follows. The G _{ i }s are computed for all the bins with enrichment score or swapped score above A. Clearly, if bin i was only discovered in the original analysis, then \(i\in \mathcal {S}^{c}\), and if it was only discovered after swapping the libraries, then \(i\in \mathcal {S}\). If the bin was discovered by both original analysis and analysis after swapping, then the bin is in \(\mathcal {S}^{c}\) if the score in the original analysis was highest, and it is in if the score after library swapping is highest. If \(\frac {1+\#\{i \in S \}}{\#\{i \in \mathcal {S}^{c} \}\lor 1}\leq q\), all bins above the cut-off A are declared enriched. Otherwise, find \(\hat k\) as detailed above, and declare the bins \(i=1,\ldots, \hat k\) as enriched. It is straightforward to show that the selection above the cut-off A does not invalidate the procedure, and thus the FDR is still controlled. The cut-off A may however affect the power, since if a lower cut-off than A would have led to more rejections, then the power could have been higher. Therefore, it is desirable to choose a cut-off A below which it is believed that enrichment is unlikely.
Code availability
The diagnostic plot described in this paper has been implemented within the c h i p_d i a g n o s t i c s function in the R language. The c h i p_d i a g n o s t i c s function is described in the Additional file 7.
Conclusions
The analysis of ChIP-seq data has become in the last decade one of the most used methods for obtaining genome-wide maps of protein-DNA interactions and different epigenetic signatures. Despite the many available tools for detecting enriched regions from ChIP-seq experiments, many statistical issues are still open. Among them, we have focused our attention on the problem of estimating the normalization constant when comparing ChIP and Input samples. We developed a simple diagnostic tool for assessing the appropriateness of a given normalization constant when applied to a real dataset. We illustrated the proposed approach on several real datasets consisting of enriched regions of various shapes. We showed the usefulness of such a plot in picking the most reliable constant among few proposals. As a consequence, instead of choosing one specific estimator for all datasets, the user can choose the estimator that is most adequate for the dataset under analysis.
All our examples clearly show that there is no single normalization method that is clearly superior to the other ones, under different settings. The diagnostic tool can identify problematic cases that require a reassessment of the normalization procedure, and it can choose the best estimate among several. However, it cannot serve as an estimate of the normalization constant, even though the diagnostic tool shows graphically the range of reasonable estimates. This is so since from the graph it is very difficult to extract a single number which can serve as an estimate of the normalization constant. Although the left peak of the empirical density of the log relative risks is expected to be around logr, we see from the real and simulated examples that the peak is broad, and that it depends on K. Therefore, it is hard to estimate the normalization constant from the diagnostic plot, and such an estimate will necessarily be unstable and of limited use.
Once the normalization constant is chosen, it should be incorporated in the algorithm of detection of enriched regions in peak calling tools. Unfortunately, most of the tools compute the normalization constant directly in their code. Therefore, to change the constant the user has to access the source code. By suitably modifying MACS and SICER, we show that the list of significant regions can be tuned using \(\hat {r}\). An over-estimate of \(\hat {r}\) will produce an over conservative list of peaks, an under-estimate will increase the number of false positive. We also provide a novel procedure aimed to control the FDR at level q using a sample swapping strategy. This novel procedure can be incorporated in the analysis pipeline to allow a more rigorous control of false positives.
In this paper we have considered the problem of between-sample normalization, and we have limited our attention to linear methods (i.e., a global estimator \(\hat {r}\)). However, the problem of sample normalization in the ChIP-seq context can be more complex than what has been considered here. In fact, several other experimental biases are connected with ChIP-seq normalization, such as CG content, PCR amplification, library preparation etc. This suggests that both within sample and between sample normalization procedures should be applied. Some non-linear methods towards this direction have already been proposed in the context of ChIP-seq [30,31]. Other approaches could be derived from methods available for DNA-seq (see [42] for CG content bias). Our approach does not directly apply to non linear methods and the extension might depend on the type of non liner method that is considered. Such extension is outside the scope of this manuscript, and a direction for future research.
In this paper we have considered the case where a ChIP sample is compared with an Input sample. It is now becoming standard practice to profile more ChIP samples (relative to the same transcription factor or histone modification) collected under different experimental conditions, such as those used to investigate the epigenetic response to different pharmacological treatments or to associate difference in the epigenetic profile to different diseases status. In such cases the problem is to detect regions that are both enriched with respect to the respective Input samples and are differentially enriched among the ChIP samples belonging to different experimental conditions. This context constitutes an emerging area of research with relatively few methods available (see [18] for a review), where normalization is going to play a very important role. Our diagnostic tool is expected to be useful in such settings as well.
Another point of future development is the possibility of incorporating additional source of biological information such as chromatin accessibility obtained from by DNase I digestion [43] in order to create the sub-densities.
Declarations
Acknowledgements
This research was supported by Consiglio Nazionale delle Ricerche (CNR) flagship project EPIGEN. Claudia Angelini would like to thank all co-authors for warm hospitality while visiting Tel Aviv to carry out part of this work. The research of Ruth Heller and Rita Volkinshtein was supported by the Israel Science Foundation (ISF) Grant no. 2012896.
Authors’ Affiliations
References
- Espada J, Esteller M. Epigenetic control of nuclear architecture. Cell Mol Life Sci. 2007; 64:449–57.View ArticlePubMedPubMed CentralGoogle Scholar
- Portela A, Esteller M. Epigenetic modifications and human disease. Nat Biotech. 2010; 28:1057–68.View ArticleGoogle Scholar
- Martens J, Stunnenberg H, Logie C. The decade of the epigenomes?Genes Cancer. 2011; 6:680–7.View ArticleGoogle Scholar
- Barski A, Cuddapah S, Cui K, Roh T, Schones D, Wang Z, et al. High-resolution profiling of histone methylations in the human genome. Cell. 2007; 129:823–37.View ArticlePubMedGoogle Scholar
- Johnson D, Mortazavi A, Myers R, Wald B. Genome-wide mapping of in vivo protein-DNA interactions. Science. 2007; 316:1497–502.View ArticlePubMedGoogle Scholar
- Mikkelsen T, Ku M, Jaffe D, Issac B, Lieberman E, Giannoukos G, et al. Genome-wide maps of chromatin state in pluripotent and lineage-committed cells. Nature. 2007; 448:553–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Robertson G, Hirst M, Bainbridge M, Bilenky M, Zhao Y, Zeng T, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nat Methods. 2007; 316:1497–502.View ArticleGoogle Scholar
- Park P. ChIP-seq: advantages and challenges of a maturing technology. Nat Rev Genet. 2009; 10:669–80.View ArticlePubMedPubMed CentralGoogle Scholar
- Huss M. Introduction into the analysis of high-throughput-sequencing based epigenome data. Briefings Bioinf. 2010; 11:512–23.View ArticleGoogle Scholar
- Furey T. ChIP-seq and beyond: new and improved methodologies to detect and characterize protein-DNA interactions. Nat Rev Genet. 2012; 13:840–52.View ArticlePubMedPubMed CentralGoogle Scholar
- Ho J, Bishop E, Karchenko P, Negre N, White K, Park P. ChIP-chip versus ChIP-seq: Lessons for experimental design and data analysis. BMC Genomics. 2011; 12:art 134.View ArticleGoogle Scholar
- The ENCODE Project: ENCyclopedia Of DNA Elements. [http://www.genome.gov/10005107]
- Genome Browser: Encyclopedia of DNA Elements. [http://genome.ucsc.edu/ENCODE/]
- The ENCODE Project Consortium. An integrated encyclopedia of DNA elements in the human genome. Nature. 2012; 489(7414):57–74.View ArticlePubMed CentralGoogle Scholar
- Landt SG, Marinov GK, Kundaje A, Kheradpour P, Pauli F, Batzoglou S, et al. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res. 2012; 22:1813–31.View ArticlePubMedPubMed CentralGoogle Scholar
- Pepke S, Wold B, Mortazavi A. Computation for ChIP-seq and RNA-seq studies. Nat Methods. 2009; 6(11 Suppl):S22–32.View ArticlePubMedPubMed CentralGoogle Scholar
- Wilbanks E, Facciotti M. Evaluation of Algorithm Performance in ChIP-Seq Peak Detections. PLoS ONE. 2010; 5:e11471.View ArticlePubMedPubMed CentralGoogle Scholar
- Bailey T, Krajewski P, Ladunga I, Lefebvre C, Li Q, Liu T, et al. Practical Guidelines for the Comprehensive Analysis of ChIP-seq Data. PLoS Comput Biol. 2013; 9:e1003326.View ArticlePubMedPubMed CentralGoogle Scholar
- Kharchenko P, Tolstorukov M, Park P. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol. 2008; 26:1351–9.View ArticlePubMedPubMed CentralGoogle Scholar
- Xu H, Handoko L, Wei X, Ye C, Sheng J, Wei C, et al. A signal noise model for significance analysis of ChIP-seq with negative control. Nat Biotechnol. 2008; 26:1199–204.View ArticleGoogle Scholar
- Ji H, Jiang H, Ma W, Johnson D, Myers R, Wong H. An integrated software system for analyzing ChIP-chip and ChIP-seq data. Nat Biotechnol. 2008; 26:1293–300.View ArticlePubMedPubMed CentralGoogle Scholar
- Rozowsky J, Euskirchen G, Auerbach RK, Zhang ZD, Gibson T, Bjornson R, et al. PeakSeq enables systematic scoring of ChIP-seq experiments relative to controls. Nat Biotechnol. 2009; 27:66–75.View ArticlePubMedPubMed CentralGoogle Scholar
- Spyrou C, Stark R, Lynch AG, Tavaré S. BayesPeak: Bayesian analysis of ChIP-seq data. BMC Bioinf. 2009; 10:299.View ArticleGoogle Scholar
- Zhang Y, Liu T, Meyer C, Eeckhoute J, Johnson D, Bernstein B, et al. Model-based analysis of ChIP-Seq (MACS). Nat Protoc. 2012; 7:1728–40.View ArticlePubMedGoogle Scholar
- Zang C, Schones D, Zeng C, Cui K, Zhao K, Peng W. A clustering approach for identification of enriched domains from histone modification ChIP-Seq data. Bioinformatics. 2009; 25:1952–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Micsinai M, Parisi F, Strino F, Asp P, Dynlacht BD, Kluger Y. Picking ChIP-seq peak detectors for analyzing chromatin modification experiments. Nucleic Acids Res. 2012; 40(9):e70.View ArticlePubMedPubMed CentralGoogle Scholar
- Song Q, Smith A. Identifying dispersed epigenomic domains from ChIP-Seq data. Bioinformatics. 2011; 27:870–1.View ArticlePubMedPubMed CentralGoogle Scholar
- Wang J, Lunyak V, Jordan I. BroadPeak: a novel algorithm for identifying broad peaks in diffuse ChIP-seq datasets. Bioinformatics. 2013; 29:492–3.View ArticlePubMedGoogle Scholar
- Liang K, Keleş S. Normalization on ChIP-seq data with control. BMC Bioinformatics. 2012; 13:199.View ArticlePubMedPubMed CentralGoogle Scholar
- Nair N, Sahu A, Bucher P, Moret BM. ChIPnorm: a statistical method for normalizing and identifying differential regions in histone modification ChIP-seq libraries. PLoS One. 2012; 7:e39573.View ArticlePubMedPubMed CentralGoogle Scholar
- Diaz A, Park K, Lim D, Song JS. Normalization, bias correction, and peak calling for ChIP-seq. Stat Appl Genet Mol Biol. 2012; 11:Article 9.PubMedGoogle Scholar
- Taslim C, Huang K, Huang T, Lin S. Analyzing ChIP-seq data: preprocessing, normalization, differential identification, and binding pattern characterization In: Wang J, Tan AC, Tian T, editors. Methods Mol, Biol. New York: Springer: 2012. p. 275–91.Google Scholar
- Cheng C, Alexander R, Min R, Leng J, Yip KY, Rozowsky J, et al. Understanding transcriptional regulation by integrative analysis of transcription factor binding data. Genome Res. 2012; 22:1658–67.View ArticlePubMedPubMed CentralGoogle Scholar
- Dong X, Greven M, Kundaje A, Djebali S, Brown J, Cheng C. Modeling gene expression using chromatin features in various cellular contexts. Genome Biol. 2012; 213:R53.View ArticleGoogle Scholar
- Agresti A. Categorical Data Analysis, 2nd edition: John Wiley & Sons; 2002.Google Scholar
- Silverman BW. Density Estimation. London: Chapman and Hall; 1986.View ArticleGoogle Scholar
- Zheng W, Zhao H, Mancera E, Steinmetz L, Snyder M. Genetic analysis of variation in transcription factor binding in yeast. Nature. 2010; 464(7292):1187–91.View ArticlePubMedPubMed CentralGoogle Scholar
- Zullo JM, Demarco IA, Piqué-Regi R, Gaffney DJ, Epstein CB, Spooner CJ, et al. DNA sequence-dependent compartmentalization and silencing of chromatin at the nuclear lamina. Cell. 2012; 149(7):1474–87. doi:10.1016/j.ell.2012.04.035.View ArticlePubMedGoogle Scholar
- The model organism ENCyclopedia Of DNA Elements. [http://www.modencode.org/]
- Benjamini Y, Hochberg Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B. 1995; 57(1):289–300.Google Scholar
- Barber RF, Candes E. Controlling the false discovery rate via knockoffs. arXiv:1404.5609. 2014.Google Scholar
- Benjamini Y, Speed TP. Summarizing and correcting for the GC-content bias in high throughput sequencing. Nucleic Acids Res. 2012; 40:10:e72.View ArticlePubMedPubMed CentralGoogle Scholar
- Madrigal P, Krajewski P. Current bioinformatic approaches to identify DNase I hypersensitive sites and genomic footprints from DNase-seq data. Front Genet. 2012; 3:230.View ArticlePubMedPubMed CentralGoogle Scholar
Copyright
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.