Normalization of ChIP-seq data with control

Background ChIP-seq has become an important tool for identifying genome-wide protein-DNA interactions, including transcription factor binding and histone modifications. In ChIP-seq experiments, ChIP samples are usually coupled with their matching control samples. Proper normalization between the ChIP and control samples is an essential aspect of ChIP-seq data analysis. Results We have developed a novel method for estimating the normalization factor between the ChIP and the control samples. Our method, named as NCIS (Normalization of ChIP-seq) can accommodate both low and high sequencing depth datasets. We compare statistical properties of NCIS against existing methods in a set of diverse simulation settings, where NCIS enjoys the best estimation precision. In addition, we illustrate the impact of the normalization factor in FDR control and show that NCIS leads to more power among methods that control FDR at nominal levels. Conclusion Our results indicate that the proper normalization between the ChIP and control samples is an important step in ChIP-seq analysis in terms of power and error rate control. Our proposed method shows excellent statistical properties and is useful in the full range of ChIP-seq applications, especially with deeply sequenced data.


Linearity between ChIP and control samples
We demonstrate that a proper control sample correlates linearly with the background parts of its corresponding ChIP sample. In the following examples, we first draw the original ChIP vs control bins counts to show the over-abundance of high ChIP count bins due to binding signals. Then we filter the strong binding signals by calling peaks with SPP (Kharchenko et al., 2008) at FDR 0.1 level and exclude all the bins that intersect with the peaks. Then the remaining ChIP and control bin counts show clear linear trend and are roughly symmetric around the normalization factor estimated through NCIS. Zheng et al. (2010) studied transcription factor Ste12 in various yeast strains. The ChIP and control samples bin counts of one of the strains (segregant 1) are plotted in Figure 1a. After peak-calling and exclusion of the bins that intersect with peaks, the remaining ChIP and control bin counts are plotted in Figure 1b.    Kasowski et al. (2010) studied binding of the transcription factor NFκB in ten human cell lines. The ChIP and control samples bin counts of cell line GM12878 are plotted in Figure 2a. After peak-calling and exclusion of the bins that intersect with peaks, the remaining ChIP and control bin counts are plotted in Figure 2b.

Human NFκB data
There are some noticeable outliers where control sample has much larger read counts than ChIP sample. For example, the rightmost point in Figure 3b can be traced back to a region on chromosome 1, where the read count per nucleotide is plotted in Figure  4. The reads on each strand are not concentrated on a single nucleotide but rather on a stretch of 30 nucleotides. The footprint of a typical transcription factor binding site on a single strand should be larger than the average fragment length, which is about 200 bp in this case. Also given the fact that the control sample has much larger read count than the ChIP sample, these reads are most likely the result of some artifacts. Similarly, most outliers that appeared below the normalization line in Figure 5 of the main text can be traced to a region in chromosome 8 and is plotted in Figure 5.

Precision
In this section, we present simulation results based on a deeply sequenced C.elegans ChIP-seq dataset (Zhong et al., 2010). In this dataset, the control sample of the first stage of larval development under starvation has about 26.8 million uniquely mapped reads. The C.elegans genome has roughly 100 million base pairs and is about 33 times smaller than the human genome. Thus, in terms of genome coverage, the C.elegans dataset is on the same order as the yeast dataset used in the main text. We performed simulations with parameters identical to those used in the main text for the yeast data. Although the absolute values of MSE change due to the changes in genome, the relative performances of different normalization factor estimators remain mostly the same.

FDR control and power
We follow the same procedure as in Section "FDR control and power" in the main text except we selected 5000 sites from candidate sites of 7725 predicted from the C.elegans data (SPP FDR level 0.01) in each iteration. The added artifacts leads to a negative bias in PeakSeq normalization estimation due to its regression approach's sensitivity to the influential points. As a result, FDR with PeakSeq is out of control. On average, NCIS is about 14% more powerful than SPP, which is the only other method maintain proper FDR control, across different sequencing depths. The purpose of normalization in ChIP-seq analysis is to make the ChIP and the control samples comparable. We define the ChIP and the control samples as balanced when π 0 N 1 = N 2 , or equivalently, r = 1. The balance can be judged in practice by checking whetherπ 0 N 1 = N 2 after obtainingπ 0 , an estimate of π 0 . The theory of the sample-swapping method for estimating FDR in ChIP-seq data analysis was studied by Xu et al. (2010) under a balanced setting. However, almost all ChIPseq datasets exhibit imbalance. One strategy to deal with unbalanced data is to subsample the larger of the ChIP and control samples to achieve balance. This strategy has been advocated and practiced in Xu et al. (2010) and Smagulova et al. (2011). The obvious drawback of the subsampling strategy is that part of the samples will not be utilized. To address this issue, a strategy has been proposed in Xu et al. (2010) to resample multiple copies of balanced data and merge the results so that all reads of the samples can have a chance to contribute to the final result. We hypothesize that by incorporating the normalization factor into significance score, the loss of data in the the subsampling strategy and the added computational complexity of resampling strategy can be avoided. Let g(a i , b i , r) be a normalized significance score function based on ChIP count a i , corresponding control count b i of region i and normalization factor r when comparing the ChIP sample to the control sample. Define E as the collection of nucleotides at which ChIP signal is enriched. We now focus on the positive FDR (pFDR) which is proposed in Storey (2003) and can be defined in ChIP-seq context as where s is a significance threshold.
Theorem 1 Under the conditions: , the estimated pFDR at a threshold s can be approximated as for large s.
The first condition requires the normalized significance scores to have similar tail distributions in background regions of the ChIP and control samples. The second condition assumes good separation of the significance scores of g(a i , b i , r) and g(b i , a i , 1/r) when the region i is not entirely within background regions. When r = 1, the approximation in condition (a) is exact.

Proof of Theorem 1
pFDR(s) = P(i ⊂ E|g(a i , b i , r) ≥ s) where the last step is by condition (a). From condition (b), we also have P(g(b i ,a i ,1/r)≥s|i ⊂E)P(i ⊂E) P(g(a i ,b i ,r)≥s) P(g(a i ,b i ,r)≥s|i ⊂E)P(i ⊂E) P(g(a i ,b i ,r)≥s) ≤ 1 Thus, pFDR(s) ≈ P(g(b i , a i , 1/r) ≥ s|i ⊂ E)P(i ⊂ E) P(g(a i , b i , r) ≥ s) which can be approximated as