In the current work we propose a new computational approach to analyze ChIP-chip data using wavelet decomposition. A schematic view of the decomposition process is shown in Figure 1.
We use Coiflets (Coif1) as basis functions for the wavelet decomposition [11], as Coiflets have a nearly symmetrical, peak-like form of the mother wavelet. This shape resembles the tiling array signal profile at the transcription factor binding sites observed in ChIP-chip experiments (see top graph in Figure 2). We chose Coiflets for the wavelet decomposition due to their properties of having the maximum number of zero moments while also having small widths (also called support in wavelet literature), ensuring a fast convergence rate [11].
We applied a thresholding procedure to the wavelet coefficients at different resolutions in order to delineate the regions of biochemical activity of interest at the same confidence level for all relevant length-scales.
Application of wavelet analysis to Pol II and histone modification data
Our goal is to expand both signals in red and green channels using a wavelet basis.
The input signal for the wavelet decomposition is derived as follows. We define the signal as a function of the genomic coordinate to be equal to the measured intensity of this probe for the genomic coordinates inside of the non-overlapping part of the probe, as well as for half of the part which overlaps with the nearest-neighbor probe along the genomic coordinate. Each overlapping region is divided equally between two nearest- probes along the genomic coordinates. The signal is assumed to be zero for the missing data (gaps) in the genomic regions lacking probes on the tiling array. An example of the input signal for the wavelet decomposition algorithm is shown in Figure 2 (top graph). In this figure, the signal of biochemical activity (Pol II binding site in this case) contained in the red channel is located between genomic coordinates 1900 and 2200 and also between 1050 and 1250.
It is important to mention that almost all the wavelet coefficients are greater than one (A(m) > 1). As a result it is possible to log transform them. Wavelet coefficients corresponding only to the missing data points (we assigned zero signal to the missing data points) next to the tiled regions can take small values values (A(m) < 1) and even can become negative. We filter out those coefficients while retaining the rest, including those corresponding to peaks. Wavelet coefficients A(m) > 1 are log-transformed for further analysis.
The level of wavelet decomposition to be used is defined by the typical length-scale of the signal variation we wish to analyze. This length-scale (the size of the peaks) is in the range of 200-2000 base pairs in the ChIP-chip experiments for Pol II data and in the range of 200-4000 base pairs for the histone modification data. The width of the wavelet at the composition level m is approximately 2 m . Hence, we used wavelet decomposition levels m = 8, 9, 10, 11 for the Pol II data (range: 256-2,048) and m = 8, 9, 10, 11, 12 for the histone modification data (range: 256-4,096) to resolve the signal variation length-scales of interest.
Choice of the wavelet decomposition levels
Approximation coefficients for the wavelet decomposition at levels m = 8 through m = 12 should capture the signal on the tiling array due to biochemical activity. The presence of this activity is indicated by the enrichment of the red channel signal relative to the green channel signal. If there is no enrichment of the signal in the red channel relative to the signal in the green channel, we expect the wavelet coefficients for the red and green channels to grow proportionally to each other as a function of the average intensity. Wavelet coefficients corresponding to the regions of the enrichment of the red signal relative to the green signal will exhibit deviation from this main trend. The approximation coefficients A(m) of the input signal for the decomposition levels m = 6, 7, and 8 are shown on the second, third, and fourth graphs from the top in Figure 2. The region of biochemical activity between genomic coordinates 1900 and 2200 is captured by one wavelet approximation coefficient at decomposition level m = 8, three wavelet approximation coefficients at decomposition level m = 7 and five wavelet approximation coefficients at decomposition level m = 6. Broader peaks require higher order wavelet decomposition.
Wavelet coefficients A(m)red vs. A(m)green (m = 8,..., 11) for the signal on the entire array are plotted in Figure 3 (subplots A, D, G, J). Each point on the graph corresponds to the pair of the wavelet coefficients of the signals of the red and green channels on the tiling array. As can be seen from the graph, the majority of the points are located inside of the triangle-like area bounded by two lines coming from the origin of the coordinates. Plotting the same data using logarithmic coordinates log[A(m)red /A(m)green ] vs. log[A(m)green ], we see that all points lay inside of the stripe-shaped region and that the width of this region is roughly independent of the average intensity of the signal (see Figure 3 (subplots B, E, H and K)). Figure 3 (subplots C, F, I and L) shows the histograms of the distribution function for the log[A(m)green ]. Peaks on the histograms for log[A(m)green ] at the resolution levels m = 8,9,10 indicate that the scale of the wavelet at those resolution levels is smaller than the size of many contiguously tiled regions on the ENCODE array. Many data points inside the red box in Figure 3 (subplots B, E, H and K) correspond to the peaks inside the contiguously tiled regions whose size is larger than the size of the wavelet used for the signal decomposition. We can only identify parts of the broad peaks by going back to the original input signal and selecting the regions corresponding to those wavelet coefficients. In order to identify all the broad peaks in their entirety we should combine the information provided by the wavelet coefficients up to the resolution level m = 11. As can be seen on Figure 3 (subplot L) the histogram for A11 becomes flat compared to the histograms for A10, A9 and A8 (subplots C, F and I). This is an indication that there is a lack of contiguously tiled regions on the Encode tiling array with a size greater than the size of the wavelet at the resolution level m = 11. Wavelet coefficients for the decomposition levels m > 12 contain the information from the missing data regions where the signal is zero. Consequently, It makes it impractical to use the decomposition levels m > 12.
The probability density function of log[A(m)red/A(m)green ] is very close to normal at the resolution levels m = 8,..., 11; as can be seen in Figure 4. The deviation from the normal distribution is due to the regions of high enrichment attributable to the specific hybridization. For each resolution level m we can compute the standard deviation σ
m
of log[A(m)red /A(m)green ] and threshold the wavelet coefficients relative to σ
m
, allowing us to obtain regions of biochemical activity of interest at the same confidence level for all relevant length-scales. For every wavelet coefficient above the threshold we can go back to the original signal and identify the region of the biochemical activity. The size of each region is related to the resolution level of the corresponding wavelet. At the end, all the detected regions are combined together.
The log-normal distribution is the characteristic feature of multiplicative random processes [16]. One explanation for the appearance of the log-normal distribution in the data is that the measurement process of the fluorescent signal of the tiling array involves multiplicative random factors. These factors can include the collection efficiency of the light during array scanning and the variation of the quantum efficiency of the pixels in the CCD camera. The log-normal distribution was previously observed in the fluorescence microscopy signal [17].
Furthermore, the log-normal distribution of the data could be attributed to the kinetics of the hybridization process on the array.
Consistency property of the wavelet coefficients corresponding to the locations of the peaks
A very interesting feature can be observed from Figure 5: Approximation coefficients for the red channel are consistently above the approximation coefficients for the green channel over the region of the biochemical activity across several wavelet decomposition levels. We use this characteristic to decrease the number of false-positive calls. We describe the numerical procedure ensuring the consistency property of the wavelet coefficients below.
According to (2.1) the approximation coefficients A(m+1)
n
at scale (m+1) can be obtained from six approximation coefficients at the finer resolution scale (m):
where c
k
, k = 1,.., 6 are the decomposition low-pass filter coefficients of Coif1: c1 = -0.0157, c2 = -0.0727, c3 = 0.3849, c4 = 0.8526, c5 = 0.3379, c6 = -0.0727.
Three of the six approximation coefficients at the resolution level m: A(m)2n-1, A(m)2n-2and A(m)2n-3contribute the most to A(m+1)
n
. Repeating the same argument for the decomposition level m we find nine approximation coefficients at the resolution level m-1 contributing mostly to the numerical values of three approximation coefficients at the resolution level m. For the wavelet decomposition using Coif1 there are nine approximation coefficients at the resolution level m-1 and three approximation coefficients at the resolution level m that greatly influence the numerical value of the approximation coefficient A(m+1)
n
.
Description of the numerical algorithm
Our algorithm performs wavelet decomposition of the signals in red and green channels, computes the standard deviations of the distribution functions of the log-ratios of of the wavelet coefficients, thresholds the log-ratios, checks the thresholded wavelet coefficients for consistency, generates hit regions from the wavelet coefficients selected by the algorithm and estimates the FPR (false positive rate) for the chosen threshold.
Here are the steps of our algorithm:
-
1)
Depending on the range of peaks we are looking for, we choose the range of wavelet decomposition levels: M 1 ≤ m ≤ M 2. The width of the wavelet at the decomposition level m is approximately 2m . Hence, we use wavelet decomposition levels M 1 ≤ m ≤ M 2 to look for the peaks of width L: . For example, we use M 1 = 8, M 2 = 11 for POL II data and M 1 = 8, M 2 = 12 for the histone modification data. We compute the wavelet decomposition of the signal at levels M 1 - 2 ≤ m ≤ M 2 (two extra levels M 1 - 1 and M 1 - 2 are computed to check the wavelet coefficients at level M 1 for consistency).
-
2)
For every decomposition level m, the probability density function of is approximated by a Gaussian and the standard deviation σ
m
is computed.
-
3)
We call a region corresponding to the approximation coefficient A(m)
n
a hit only if:
-
a)
, where α is the numerical value of the threshold. The threshold value is the same for all M 1 - 2 ≤ m ≤ M 2. The thresholding allows us to select peaks of different sizes with the same confidence level.
-
b)
The same as in 1) is true for the log ratios of at least three approximation coefficients at the resolution levels m-1 and m-2 contributing greatly to A(m)
n
. Requirements a) and b) impose consistency constraints on the wavelet coefficients across three resolution levels which help to reduce the number of false positives. Each time we find a hit we go from A(m)
n
, to A(m-1)2n-1, A(m-1)2n-2, and A(m-1)2n-3; until we reach the original input signal to identify the region of biochemical activity.
-
4)
We combine together each overlapping group of hits into one big hit region. We call N the total number of final hit regions.
We can estimate the false positive rate (FPR) corresponding to the chosen value of threshold α. Signal of the red channel is randomly shuffled between the probes. We repeat the same scoring procedure for the shuffled signal keeping M1, M2 and α the same.
The obtained hits are false hits which allow us to estimate the false positive rate (FPR): , where N is the total number of hits without random shuffling and N
shuff
is the total number of hits after the random shuffling of the red signal probes. One can choose the α according to the corresponding FPR.
Figure 5 shows Pol II binding sites located between genomic coordinates 1900 and 2200 and also between 1050 and 1250. Two regions are identified as hits by our algorithm with parameters M1 = 8, M2 = 11 and the threshold α corresponding to the estimated FPR = 5%. The wavelet coefficient which satisfy conditions 3) and the regions of the signal corresponding to those wavelet coefficients are indicated by the red boxes.
Figure 6 shows a snapshot from the Affymetrix Integrated Genome Browser (IGB) displaying broader peaks of the Pol II ChIP-chip data using the Nimblegen ENCODE tiling array. Raw signals for the green and red channels are shown as blue and pink tracks. We used again the following set of parameters: M1 = 8, M2 = 11 and the threshold α corresponding to the estimated FPR = 5%. Yellow bars indicate the hit regions corresponding to the wavelet coefficients at the resolution levels m = 8, ..., 11 satisfying the condition 3) of our numerical algorithm. Hit regions obtained by combining the information from these resolution levels (combining together overlapping yellow bars) are shown as red bars (step 4 of our algorithm).
Comparison with other methods
In order to test the performance of our method with the consistency constraint previously described, we applied our algorithm to the Spike-in data from the ENCODE Nimblegen tiling array. Mixtures of human genomic DNA and 100 human sequences at various concentrations were hybridized on the array [5]. Spike-in data was obtained from human sequences of approximately the same size, which were generated in the laboratory. Spike-in data allows for an objective estimation of the performance of our method and a comparison with other methods.
We used model parameters: M1 = 8, M2 = 9 (to identify narrow peaks) and varying α to obtain ROC-type curve. We choice of the model parameters was based on our observation that the size of the wavelets should be comparable to the size of the peaks to identify. ROC-type curves were generated by plotting the Sensitivity (i.e. the number of true positives/100) vs. the False Positives ratio (i.e. the number of false positives/100). The optimal ROC-type curve is the one closest to the left upper corner.
We compared our method with other methods described in [5]: MA2C, Splitter, Permu, ACME, TAMALg, and TAMALs. MA2C (Model-based Analysis for 2-Color arrays) [8] is a normalization method based on the GC content of the probes. It compensates each probe's log(Cy5/Cy3) ratio for the GC bias and weights each probe, taking into account the signal variance of the GC group to which the probe belongs. A sliding window consisting of 500 bp was used, and windows with high median values were identified as hits. The Splitter algorithm [7] incrementally changes the cutoff value of the signal and compares the total number of hits before and after the change. If the ratio of the number of hits before and after the cutoff change is smaller than a pre-defined value, the algorithm stops and hits before the last cutoff change are reported as final hits. Clusters of probes located closer than a "maxgap" parameter were merged together. Clusters of probes containing the number of probes smaller than a "minrun" parameter were discarded. Permu [6] identifies the peaks within the sliding window based on iterative thresholding procedure. FPR (false positive rate) is assigned to each peak using the randomized data. TAMALg and TAMALc are two versions of the same peak-finding algorithm [18] that use different stringency levels.
Our method demonstrated excellent performance compared to other methods as can be seen from the ROC-type curve in Figure 7. The intuitive reason behind of such a good performance of our method is that the shape of the wavelets we use is very similar to the shape of peaks of the signal. Another reason of a good performance is the use of consistency constraint which reduces the number of false positives.
We also tested our method using STAT1 data from ENCODE Nimblegen arrays in [19]. We use the following parameters for our model: M1 = 8, M2 = 12 and the threshold α corresponding to the estimated false positive rate FPR = 5%. Most of the hits (84%) obtained by our method overlap with hits reported in [19].