Normalization and experimental design for ChIP-chip data
© Peng et al. 2007
Received: 26 March 2007
Accepted: 25 June 2007
Published: 25 June 2007
Skip to main content
© Peng et al. 2007
Received: 26 March 2007
Accepted: 25 June 2007
Published: 25 June 2007
Chromatin immunoprecipitation on tiling arrays (ChIP-chip) has been widely used to investigate the DNA binding sites for a variety of proteins on a genome-wide scale. However, several issues in the processing and analysis of ChIP-chip data have not been resolved fully, including the effect of background (mock control) subtraction and normalization within and across arrays.
The binding profiles of Drosophila male-specific lethal (MSL) complex on a tiling array provide a unique opportunity for investigating these topics, as it is known to bind on the X chromosome but not on the autosomes. These large bound and control regions on the same array allow clear evaluation of analytical methods.
We introduce a novel normalization scheme specifically designed for ChIP-chip data from dual-channel arrays and demonstrate that this step is critical for correcting systematic dye-bias that may exist in the data. Subtraction of the mock (non-specific antibody or no antibody) control data is generally needed to eliminate the bias, but appropriate normalization obviates the need for mock experiments and increases the correlation among replicates. The idea underlying the normalization can be used subsequently to estimate the background noise level in each array for normalization across arrays. We demonstrate the effectiveness of the methods with the MSL complex binding data and other publicly available data.
Proper normalization is essential for ChIP-chip experiments. The proposed normalization technique can correct systematic errors and compensate for the lack of mock control data, thus reducing the experimental cost and producing more accurate results.
Chromatin immunoprecipitation on microarrays (ChIP-chip) is a technique that has been used primarily for investigating the binding locations of a protein on a genome-wide scale. With the availability of custom tiling arrays, this approach has yielded unprecedented resolution for these binding events. Much of the work so far has focused on binding of transcription factors [1–3], and many computational methods have been developed to identify the bound regions [4–9]. Recently this technology has been extended to the genomic mappings of other features, such as histone modifications , transcriptionally active regions , and binding sites of other protein complexes [12, 13]. In the present work, we examine several issues related to the experimental design and data analysis of ChIP-chip experiments, with a focus on two-color platforms. Despite their increasing popularity [11, 14], some basic analytical issues still remain unresolved.
One issue is related to the role of mock controls in the design of experiments. Mock control experiments using non-specific antibody or no antibody are often performed together with ChIP-chip experiments to control for sample handling, labeling bias, preferential amplification, and other biases that may occur in the experiment. For experiments with two-channel arrays, the immunoprecipitated sample is hybridized against the input DNA and the mock control is hybridized against the same input DNA. This design allows the same mock control to be used in multiple experiments and tends to give less noise than hybridizing directly against the mock IP, as the amount of input DNA is much larger than that of the mock IP. Without the mock control, it is possible to get false positive bound sites due to an artifact in the experimental procedure; on the other hand, the mock controls increase the experimental cost substantially and may in fact add other artifacts in some cases. The importance of this design issue was underscored in a recent paper . In that work, differential enrichment of genic and intergenic regions is observed in the ChIP-chip experiments for histone occupancy in yeast. However, the mock control data also display a similar feature, and there is no substantial differential enrichment once the mock control data are used to normalize the histone occupancy data. If in general the conclusion drawn in a study depends on whether or not the mock control data are used, there is a need to perform mock control experiments in all cases, and the conclusions drawn in those experiments without mock control may be suspect. In fact, in many published works, mock control data sets are missing, and the control experiments do not appear to have been performed.
Another important and related issue is normalization of the data. There are many sources of systematic variation in microarray experiments, and normalization is a computational process for reducing the experimental artifacts, both within each array and between arrays. This issue has been studied extensively for gene expression microarray data [15, 16], and the standard normalization methods for expression arrays have been extended to ChIP-chip experiments . A difficulty with ChIP-chip data, however, is that the distribution of the log-ratios is asymmetric. There is only a 'bump' in the right side of the distribution, corresponding to the binding events. Standard normalization methods for expression analysis assume that there is a roughly equal number of up- and down-regulated genes or that the proportion of differentially expressed genes is small. Neither type of assumption is satisfied in general for ChIP-chip data, and standard methods do not work well. Some fixes have been proposed previously, such as mirroring the left side of the distribution to the right to estimate the background distribution . However, we have noticed that the distribution of log-ratios is often not centered at zero or does not appear to have a symmetric background. A quantile normalization , which forces the signals in each array to follow the same distribution, has been used in some cases, but this seems to be too stringent, eliminating the variation in the degree of binding among experiments. The bias in the GC content of each probe also influences its hybridization characteristics and a normalization effect for this has been proposed recently .
In this work, we present a novel normalization method designed specifically for ChIP-chip data and show its effectiveness in correcting dye-bias and other systematic errors. We also find that proper normalization is closely related to the issue of whether mock control should be used: in the absence of proper normalization, mock control is generally necessary to eliminate the effect of systematic errors; but through appropriate normalization, the lack of mock controls may be compensated. We find that the use of the proposed normalization method alone without the use of mock control is sufficient to identify the binding events and that the correlation among biological replicates also improves as a result. Furthermore, our normalization strategy also yields insight into the question of differential enrichment of histone occupancy in genic and intergenic regions .
This methodological investigation is made possible by a unique data set we have generated on dosage compensation in Drosophila . The MSL complex is known to bind specifically to the X chromosome to up-regulate the X-linked genes [12, 18]. With tiling array data for the binding of the MSL complex on both the X chromosome and autosomes (2L and 4), analytical methods can be evaluated for their efficiency of identifying bound regions on the X with autosomes as control regions. This interesting feature of Drosophila MSL binding also offers evaluation of normalization schemes and other data analysis strategies.
The purpose of the mock control experiment is to correct dye-bias and other systematic errors in ChIP-chip experiments. The correlation observed in Figure 2b indicates that there may indeed be some systematic bias. To examine possible dye-bias, we plot the log-ratio as a function of signal intensity. Often referred to as the 'MA plot,' we plot the log-ratios M = log(R/G) on the y-axis and the average intensity A = log( ) on the x-axis, where R and G are the intensities of the two-channels. This is equivalent to a scatterplot of the two channel intensities but the 45 degree rotation and rescaling makes it easier to interpret.
The use of mock data is one possible solution to this problem. Because the mock data contain the same dye-bias (Figure 4b), subtraction of the mock data from the experiment data would eliminate much of the bias. However, because noise exists in both ChIP-chip and control experiments, a direct subtraction of the mock control has the disadvantage of introducing additional noise. This effect is shown in Figure 4, in which the probes in the small green region selected from the background of ChIP-chip experiment (Figure 4a) are spread out in the background of mock control experiment (Figure 4b) due to noise.
Rather than eliminating systematic variation by subtracting the mock data, we consider the possibility of normalizing each array first. Normalization is a process for reducing the variations within and between arrays of non-biological origin [15, 16]. It is an important step for any array analysis, as it has been shown that different normalization methods can lead to divergent results in expression data analysis . For two-color arrays, the conventional normalization method is to fit a line through the background portion of the data in the MA plot and then to regard this line as the new horizontal line M = 0. Because one cannot distinguish the signal from the background, a robust line fitting method called locally weighted regression and smoothing scatterplots (lowess)  is often applied. This technique fits the line through the dense part of the distribution and is resistant to outliers, which are likely to be the signal. While this procedure has been shown to work well for expression arrays, it is unlikely to be effective for ChIP-chip data. The reason for this is that the signal is only positive, unlike the up- and down-regulation in expression arrays and that the amount of signal (binding) can be very large, especially for histone modifications or chromatin-associated proteins, as illustrated in Figure 1. These issues were described in the Background section. Indeed, when the lowess method is applied to our data shown in Figure 4, the fitted curve is 'pulled' by the signal (red points) and is far away from the background data (blue points). Because of the heavy bias in this data set, even a more robust version of lowess, such as one based on interquartile range of residuals, is unlikely to work near the high-intensity spots. Other line fitting methods we have tried also gave the same unsatisfactory results. While it is not difficult to see where the line should be from visual inspection, the problem of fitting a line through the background when the background is mixed with the signal is not trivial.
The combinatorial method of rotation and lowess normalization with no use of mock control data in fact results in improved correlations among biological replicates compared to the subtraction of the mock data, both at the probe level and at the gene level. For the two ChIP-chip replicates for two cell types (Drosophila embryos and Clone 8 cells), the correlations at the probe level increases from 73% and 78% to 82% and 79%, respectively. For the agreement for enrichment at the gene level, the matches increase from 89% and 86% to 96% and 95%, respectively. This may be partly due to the reduction in the noise level as a result of avoiding the subtraction of log-ratios. A further improvement in the identification of the signal can be made by simple smoothing of the profiles (see Figure 1) along their chromosomal location, for both ChIP-chip and the mock control. Regardless of the algorithm used for finding bound regions, the smoothing step suppresses the effect of outlier probes and results in improved performance. The amount of smoothing applied depends on the type of binding sites and the spacing of the arrays. In our case, a running median smoothing along each probe with a window size of 7 probes gave good results. In contrast to Figure 7a, there are fewer stray probes in Figure 7b and both the signal and background parts appear to be 'tighter' with smaller standard deviations.
In the previous section, we used the differences of log-ratios from neighboring probes as a means to symmetrize the distribution and perform curve fitting. But the same idea can be used to estimate the level of array-specific noise, which can be used to normalize between arrays. As shown in Figure 5, the differences of the consecutive probes is not affected by the amount of binding, other than few outlier points corresponding to the ends of the bound regions. Thus, a measure of deviation for the distribution of differences (Figure 5d) would serve as a reasonable measure of noise. The median absolute deviation can be used to obtain a robust measure of deviation (see Methods).
To see how the amount of binding affects this estimate, the noise estimates were obtained separately for chromosomes X and 2L as well as the combined data (Figure 8a). For chromosome 2L, there is a clear stabilization of the noise estimate; for chromosome X, it continues to increase due to the long range correlations but the increase is very slight. For the mock control, there is almost no distinction between the X and the 2L chromosomes and they reach a plateau around j = 8. The proximity of the curves near j = 8 for X and 2L indicate that the proposed estimate of noise is resistant to the different amount of binding. This estimate can therefore be used to rescale the log-ratios explicitly or as a basis for determining a significance threshold on each array.
To determine the effectiveness of the proposed normalization, we examine how the analysis results change between this and the standard mock subtraction normalization. This is studied in the context of two data sets.
The X-specificity of Drosophila MSL complex binding provides an excellent opportunity for the development of methods for identifying bound regions in ChIP-chip data. We have found that there is significant bias due to the differences in the dye, as seen on the average intensity vs log-ratio plot, and that this must be corrected to obtain accurate estimates of binding sites. One way to fix this problem is via direct subtraction of the mock control, but it appears that the lack of mock control may be compensated through proper normalization steps. We developed a normalization procedure for ChIP-chip experiments based on the differences of the neighboring probes along the chromosome and found that it improves both the correlation of log-ratios at probe level and the overlap at gene level among replicates. Conventional normalization methods may work for ChIP-chip experiments with transcription factors because the proportion of the bound probes is generally small. But with histone modifications or binding of chromatin-associated proteins, that proportion may be much greater and the standard methods do not work well. We also used the same idea to measure array-specific noise for normalization across arrays.
We have examined the Nimblegen and Agilent platforms in detail here, but other studies using these and other two-dye platforms are likely to have experimental artifacts. It is thus important that the data are processed properly to obtain accurate description of the binding sites.
Data from two Drosophila male cell types (Clone 8 and SL2) and late-stage embryos are available for the experiments with the MSL complex. Because the MSL complex appears to bind to some regions on autosomes in SL2 cells, possibly due to genome rearrangement, we used the embryo data for developing and testing data analysis methods. The custom NimbleGen arrays for these experiments contained 388,000 probes each and were designed based on FlyBase 3.2. The X and the 2L chromosomes were tiled with 50-mer probes every 100 bp except for the repetitive regions. For the ChIP-chip experiments, tandem affinity purification (TAP) tag was added to the C-terminus of the MSL3 protein, which is a component in the MSL complex. The addition of the TAP-tag does not affect the function of MSL complex and chromatin immunoprecipitation was achieved by an antibody specifically recognizing the TAP tag. Mock control experiments were done following the same protocol as the ChIP-chip experiments but in the absence of the TAP-tagged MSL complex (detailed in ). All data are available from the authors' web site. The array designs and the ChIP-chip data for histone modification and occupancy in yeast  were obtained from the ArrayExpress database under the accession number E-WMIT-3.
To measure the noise level in each array, we employed the median absolute deviation to the lagged differences. This is defined as σ* (j) = s × median|d ij - median(d ij )|, where d ij = x i + j- x i and x i is the log-ratio of the ith probe. The scaling factor s ≈ 1.4826/ is introduced so that the σ* becomes the standard deviation σ when the underlying distribution is normal. There are two parameters to consider in determining whether a region is bound: the threshold for significant log-ratio value and the minimum number of probes needed to define a cluster of probes. We set the threshold for log-ratios to 2σ*(j = 8) to define "enriched" signal probes (note that this definition is different from the one in Ref  due to the lag). We also set the minimum number of clusters to be 8, based on the correlation length of 800 bp obtained from Figure 8. These values give a high enrichment ratio between the number of the sites on the X and the 2L chromosomes and the false discovery rates based on random permutation are < .05.
This work was supported by the National Institutes of Health (GM45744 to M.I.K and GM67825 to P.J.P), a grant to E.L., a Leukemia and Lymphoma Society Fellow (5198-05), and the Howard Hughes Medical Institute. M.I.K. is an HHMI Investigator.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.