 Methodology article
 Open Access
 Published:
Tiling array data analysis: a multiscale approach using wavelets
BMC Bioinformatics volume 12, Article number: 57 (2011)
Abstract
Background
Tiling array data is hard to interpret due to noise. The wavelet transformation is a widely used technique in signal processing for elucidating the true signal from noisy data. Consequently, we attempted to denoise representative tiling array datasets for ChIPchip experiments using wavelets. In doing this, we used specific wavelet basis functions, Coiflets, since their triangular shape closely resembles the expected profiles of true ChIPchip peaks.
Results
In our wavelettransformed data, we observed that noise tends to be confined to small scales while the useful signalofinterest spans multiple large scales. We were also able to show that wavelet coefficients due to nonspecific crosshybridization follow a lognormal distribution, and we used this fact in developing a thresholding procedure. In particular, wavelets allow one to set an unambiguous, absolute threshold, which has been hard to define in ChIPchip experiments. One can set this threshold by requiring a similar confidence level at different lengthscales of the transformed signal. We applied our algorithm to a number of representative ChIPchip data sets, including those of Pol II and histone modifications, which have a diverse distribution of lengthscales of biochemical activity, including some broad peaks.
Conclusions
Finally, we benchmarked our method in comparison to other approaches for scoring ChIPchip data using spikeins on the ENCODE Nimblegen tiling array. This comparison demonstrated excellent performance, with wavelets getting the best overall score.
Background
Tiling arrays have recently become widely used to investigate thousands of biochemical reactions in a parallel fashion. In oligonucleotide microarrays the probes are designed to match parts of the known genomic sequence. Genome tiling arrays include overlapping oligonucleotides designed to cover an entire genomic region of interest. These arrays are able to simultaneously monitor the expression of thousands of genes [1] as well as to identify the transcription factor binding sites [2]. Transcription factors are regulatory proteins which bind to DNA and control the gene transcription or biochemical activity of other regulatory proteins. The experimental technique to identify these regions of activity of the regulatory proteins on DNA involves the hybridization of immunoprecipitated DNA on a tiling microarray (ChIPchip experiments)[3]. In this paper we examine data from two ChIPchip experiments in order to identify regions of activity.
One common feature of all microarray experiments is that the signal of interest due to biochemical reactions is contaminated by noise. This noise can be mainly attributed to nonspecific crosshybridization. In the ideal case the oligonucleotides on the microarray only bind targets with exactly the right complementary sequences. In reality, however, lower affinity binding with other, imperfect sequences (known as mismatches) also occurs. In order to correct for this nonspecific binding two samples labeled with red and green fluorescent dyes (Cy3 and Cy5) are hybridized on the same tiling array simultaneously. One sample, labeled with red dye, contains DNA immunoprecipitated with antibodies against the transcription factor of interest. Another sample, labeled with green dye, is derived from nonimmunoprecipitated, genomic DNA and is used as a negative control.
Several statistical techniques have been developed for microarray data analysis [4–10]. Experimental data coming from the Pol II and histone modification ChIPchip data show a broad distribution of the lengths of biochemicallyactive regions. The signalofinterest on the tiling microarray exhibits trianglelike peaks of different widths [9]. There is both a scale and intensity separation between the useful signal and the noise in ChIPchip experiments. Most of the existing ChIPchip data analysis methods do not explicitly exploit this feature.
Signal decomposition using basis functions (wavelets) that resemble peaks makes this separation more apparent. Numerical implementation of the wavelet transformation is called Discrete Wavelet Transform (DWT). Computational cost of DWT scales linearly with the number of input data points (~O(N)). The same is also true for the computation cost of the moving window averaging. One of the major advantages of the DWT over the moving average window method is that DWT gives explicit representation of the signal at different lengthscales and the possibility to choose the type of wavelets for the DWT closely resembling the shape of the peaks we are trying to identify. One of the existing ChIPchip data analysis methods [10] uses wavelets only as a denoising tool by thresholding the wavelet decomposition at a fixed level.
The peaks were detected not from the wavelet decomposition, but by applying Laplacian of a Gaussian (LoG) edge detector. We describe a method which is capable of delineating peaks of different sizes from the wavelet decomposition coefficients at different levels, and the range of sizes is determined by the algorithm parameters. Our method does not require any data preprocessing. Statistical analysis of the wavelet coefficients produces the probability density function of the signal intensity due to the nonspecific hybridization and makes it possible to conduct an unbiased identification of protein binding site locations. Below we describe how to employ a wavelet algorithm to analyze the experimental data.
Methods
Mathematical formalism
In this section we will provide an overview of wavelet decomposition. The term "wavelets" is used to refer to a set of localized basis functions which posses a special structure. Wavelets are defined by two mutually orthogonal functions: scaling function ϕ and mother wavelet ψ. The rest of the basis functions can be obtained by performing dilation and translation operations with the scaling function and the mother wavelet. The fundamental idea behind wavelet analysis is that one can separate data based upon its scale. A very comprehensive review of wavelets can be found in [11].
In this work we use the discrete wavelet transform (DWT) algorithm proposed by Mallat [12]. This algorithm uses scales and positions based on powers of two (i.e. dyadic scales and positions). Mallat's algorithm takes the discrete input signal A(m) of length 2N and decomposes it into two signals: A(m+1) and D(m+1), each of length N. The results of this process are called approximation coefficients and detail coefficients. For ChIPchip data, the approximation coefficients represent a relevant signal, whereas the detail coefficients represent noise. The approximation coefficients at scale (m+1) can be obtained from the approximation coefficients at the finer resolution scale (m) as follows [11]:
where c_{ k } , k = 1,.., 6 are the decomposition lowpass filter coefficients. For the Coif1 wavelet the numerical values are [13]:
c_{1} = 0.0157, c_{2} = 0.0727, c_{3} = 0.3849, c_{4} = 0.8526, c_{5} = 0.3379, c_{6} = 0.0727.
Similarly the detail coefficients at scale (m+1) can be obtained from the approximation coefficients at scale (m) as follows:
where b_{ k } , k = 1,.., 5 are the decomposition highpass filter coefficients. For the Coif1 wavelet the numerical values of the coefficients are:
b_{1} = 0.0727, b_{2} = 0.3379, b_{3} = 0.8526, b_{4} = 0.3849, b_{5} = 0.0727, b_{6} = 0.0157.
A schematic view of the decomposition process is shown in Figure 1.
Equations (1) and (2) perform lowpass and highpass filtering of the input signal. Approximation coefficients retain a lowfrequency, smoothed version of the signal; whereas the detail coefficients capture the lowscale, high frequency component of the signal. Procedures (1) and (2) can be performed recursively using A(m+1) as the input signal. In practice, the discrete input signal is treated as the set of approximation coefficients at scale m = 0, and multilevel wavelet decomposition is performed using formulas (1) and (2). The decomposition (1)(2) is reversible. Approximation coefficients at scale (m) can be reconstructed from the approximation and detail coefficients at scale (m+1):
For the Coif1 wavelet, the lowpass filter coefficients f and highpass filter coefficients g are [13]:
The pyramidal structure of the algorithm makes signal decomposition (1)(2) and signal reconstruction (3) computationally very efficient [11]. Additional file 1 Figure S1 shows the highpass and lowpass filter coefficients for the decomposition and reconstruction procedures.
Specific datasets used for the analysis
All the analysis was performed using data from Nimblegen's ENCODE tiling arrays. The goal of the ENCODE (Encyclopaedia of DNA Elements) project [14] is to identify functional elements from a representative 1% of the human genome. This part of the human genome is represented on the Nimblegen ENCODE tiling array. The single array contains more than 384,000 unique 50mer probes selected from 30 megabases of human sequence data specified by the ENCODE PROJECT CONSORTIUM [14]. These probes are spaced apart every 38 bases on the average, thus creating a 12base overlap between probes. No probes were included for interspersed repetitive DNA, thus there are gaps in genome tiling paths on the array. The POL II (CTD4H8) and H3K36ME3 histone modification data were obtained from [15]. It is interesting to use these datasets to test our algorithm because they have broad peaks.
Results and Discussion
In the current work we propose a new computational approach to analyze ChIPchip data using wavelet decomposition. A schematic view of the decomposition process is shown in Figure 1.
We use Coiflets (Coif1) as basis functions for the wavelet decomposition [11], as Coiflets have a nearly symmetrical, peaklike form of the mother wavelet. This shape resembles the tiling array signal profile at the transcription factor binding sites observed in ChIPchip experiments (see top graph in Figure 2). We chose Coiflets for the wavelet decomposition due to their properties of having the maximum number of zero moments while also having small widths (also called support in wavelet literature), ensuring a fast convergence rate [11].
We applied a thresholding procedure to the wavelet coefficients at different resolutions in order to delineate the regions of biochemical activity of interest at the same confidence level for all relevant lengthscales.
Application of wavelet analysis to Pol II and histone modification data
Our goal is to expand both signals in red and green channels using a wavelet basis.
The input signal for the wavelet decomposition is derived as follows. We define the signal as a function of the genomic coordinate to be equal to the measured intensity of this probe for the genomic coordinates inside of the nonoverlapping part of the probe, as well as for half of the part which overlaps with the nearestneighbor probe along the genomic coordinate. Each overlapping region is divided equally between two nearest probes along the genomic coordinates. The signal is assumed to be zero for the missing data (gaps) in the genomic regions lacking probes on the tiling array. An example of the input signal for the wavelet decomposition algorithm is shown in Figure 2 (top graph). In this figure, the signal of biochemical activity (Pol II binding site in this case) contained in the red channel is located between genomic coordinates 1900 and 2200 and also between 1050 and 1250.
It is important to mention that almost all the wavelet coefficients are greater than one (A(m) > 1). As a result it is possible to log transform them. Wavelet coefficients corresponding only to the missing data points (we assigned zero signal to the missing data points) next to the tiled regions can take small values values (A(m) < 1) and even can become negative. We filter out those coefficients while retaining the rest, including those corresponding to peaks. Wavelet coefficients A(m) > 1 are logtransformed for further analysis.
The level of wavelet decomposition to be used is defined by the typical lengthscale of the signal variation we wish to analyze. This lengthscale (the size of the peaks) is in the range of 2002000 base pairs in the ChIPchip experiments for Pol II data and in the range of 2004000 base pairs for the histone modification data. The width of the wavelet at the composition level m is approximately 2 ^{m} . Hence, we used wavelet decomposition levels m = 8, 9, 10, 11 for the Pol II data (range: 2562,048) and m = 8, 9, 10, 11, 12 for the histone modification data (range: 2564,096) to resolve the signal variation lengthscales of interest.
Choice of the wavelet decomposition levels
Approximation coefficients for the wavelet decomposition at levels m = 8 through m = 12 should capture the signal on the tiling array due to biochemical activity. The presence of this activity is indicated by the enrichment of the red channel signal relative to the green channel signal. If there is no enrichment of the signal in the red channel relative to the signal in the green channel, we expect the wavelet coefficients for the red and green channels to grow proportionally to each other as a function of the average intensity. Wavelet coefficients corresponding to the regions of the enrichment of the red signal relative to the green signal will exhibit deviation from this main trend. The approximation coefficients A(m) of the input signal for the decomposition levels m = 6, 7, and 8 are shown on the second, third, and fourth graphs from the top in Figure 2. The region of biochemical activity between genomic coordinates 1900 and 2200 is captured by one wavelet approximation coefficient at decomposition level m = 8, three wavelet approximation coefficients at decomposition level m = 7 and five wavelet approximation coefficients at decomposition level m = 6. Broader peaks require higher order wavelet decomposition.
Wavelet coefficients A(m)_{red} vs. A(m)_{green} (m = 8,..., 11) for the signal on the entire array are plotted in Figure 3 (subplots A, D, G, J). Each point on the graph corresponds to the pair of the wavelet coefficients of the signals of the red and green channels on the tiling array. As can be seen from the graph, the majority of the points are located inside of the trianglelike area bounded by two lines coming from the origin of the coordinates. Plotting the same data using logarithmic coordinates log[A(m)_{red} /A(m)_{green} ] vs. log[A(m)_{green} ], we see that all points lay inside of the stripeshaped region and that the width of this region is roughly independent of the average intensity of the signal (see Figure 3 (subplots B, E, H and K)). Figure 3 (subplots C, F, I and L) shows the histograms of the distribution function for the log[A(m)_{green} ]. Peaks on the histograms for log[A(m)_{green} ] at the resolution levels m = 8,9,10 indicate that the scale of the wavelet at those resolution levels is smaller than the size of many contiguously tiled regions on the ENCODE array. Many data points inside the red box in Figure 3 (subplots B, E, H and K) correspond to the peaks inside the contiguously tiled regions whose size is larger than the size of the wavelet used for the signal decomposition. We can only identify parts of the broad peaks by going back to the original input signal and selecting the regions corresponding to those wavelet coefficients. In order to identify all the broad peaks in their entirety we should combine the information provided by the wavelet coefficients up to the resolution level m = 11. As can be seen on Figure 3 (subplot L) the histogram for A11 becomes flat compared to the histograms for A10, A9 and A8 (subplots C, F and I). This is an indication that there is a lack of contiguously tiled regions on the Encode tiling array with a size greater than the size of the wavelet at the resolution level m = 11. Wavelet coefficients for the decomposition levels m > 12 contain the information from the missing data regions where the signal is zero. Consequently, It makes it impractical to use the decomposition levels m > 12.
The probability density function of log[A(m)_{red}/A(m)_{green} ] is very close to normal at the resolution levels m = 8,..., 11; as can be seen in Figure 4. The deviation from the normal distribution is due to the regions of high enrichment attributable to the specific hybridization. For each resolution level m we can compute the standard deviation σ_{ m } of log[A(m)_{red} /A(m)_{green} ] and threshold the wavelet coefficients relative to σ_{ m } , allowing us to obtain regions of biochemical activity of interest at the same confidence level for all relevant lengthscales. For every wavelet coefficient above the threshold we can go back to the original signal and identify the region of the biochemical activity. The size of each region is related to the resolution level of the corresponding wavelet. At the end, all the detected regions are combined together.
The lognormal distribution is the characteristic feature of multiplicative random processes [16]. One explanation for the appearance of the lognormal distribution in the data is that the measurement process of the fluorescent signal of the tiling array involves multiplicative random factors. These factors can include the collection efficiency of the light during array scanning and the variation of the quantum efficiency of the pixels in the CCD camera. The lognormal distribution was previously observed in the fluorescence microscopy signal [17].
Furthermore, the lognormal distribution of the data could be attributed to the kinetics of the hybridization process on the array.
Consistency property of the wavelet coefficients corresponding to the locations of the peaks
A very interesting feature can be observed from Figure 5: Approximation coefficients for the red channel are consistently above the approximation coefficients for the green channel over the region of the biochemical activity across several wavelet decomposition levels. We use this characteristic to decrease the number of falsepositive calls. We describe the numerical procedure ensuring the consistency property of the wavelet coefficients below.
According to (2.1) the approximation coefficients A(m+1)_{ n } at scale (m+1) can be obtained from six approximation coefficients at the finer resolution scale (m):
where c_{ k } , k = 1,.., 6 are the decomposition lowpass filter coefficients of Coif1: c_{1} = 0.0157, c_{2} = 0.0727, c_{3} = 0.3849, c_{4} = 0.8526, c_{5} = 0.3379, c_{6} = 0.0727.
Three of the six approximation coefficients at the resolution level m: A(m)_{2n1}, A(m)_{2n2}and A(m)_{2n3}contribute the most to A(m+1)_{ n }. Repeating the same argument for the decomposition level m we find nine approximation coefficients at the resolution level m1 contributing mostly to the numerical values of three approximation coefficients at the resolution level m. For the wavelet decomposition using Coif1 there are nine approximation coefficients at the resolution level m1 and three approximation coefficients at the resolution level m that greatly influence the numerical value of the approximation coefficient A(m+1)_{ n }.
Description of the numerical algorithm
Our algorithm performs wavelet decomposition of the signals in red and green channels, computes the standard deviations of the distribution functions of the logratios of of the wavelet coefficients, thresholds the logratios, checks the thresholded wavelet coefficients for consistency, generates hit regions from the wavelet coefficients selected by the algorithm and estimates the FPR (false positive rate) for the chosen threshold.
Here are the steps of our algorithm:

1)
Depending on the range of peaks we are looking for, we choose the range of wavelet decomposition levels: M _{1} ≤ m ≤ M _{2}. The width of the wavelet at the decomposition level m is approximately 2^{m} . Hence, we use wavelet decomposition levels M _{1} ≤ m ≤ M _{2} to look for the peaks of width L: {2}^{{M}_{1}}\le L\le {2}^{{M}_{2}}. For example, we use M _{1} = 8, M _{2} = 11 for POL II data and M _{1} = 8, M _{2} = 12 for the histone modification data. We compute the wavelet decomposition of the signal at levels M _{1}  2 ≤ m ≤ M _{2} (two extra levels M _{1}  1 and M _{1}  2 are computed to check the wavelet coefficients at level M _{1} for consistency).

2)
For every decomposition level m, the probability density function of \mathrm{log}\left(\frac{(A{(m)}_{n\phantom{\rule{0.5em}{0ex}}red})}{(A{(m)}_{n\phantom{\rule{0.5em}{0ex}}green})}\right) is approximated by a Gaussian and the standard deviation σ_{ m } is computed.

3)
We call a region corresponding to the approximation coefficient A(m) _{ n } a hit only if:

a)
\mathrm{log}\left(\frac{(A{(m)}_{n\phantom{\rule{0.5em}{0ex}}red})}{(A{(m)}_{n\phantom{\rule{0.5em}{0ex}}green})}\right)>\alpha {\sigma}_{m}
, where α is the numerical value of the threshold. The threshold value is the same for all M _{1}  2 ≤ m ≤ M _{2}. The thresholding allows us to select peaks of different sizes with the same confidence level.

b)
The same as in 1) is true for the log ratios of at least three approximation coefficients at the resolution levels m1 and m2 contributing greatly to A(m)_{ n }. Requirements a) and b) impose consistency constraints on the wavelet coefficients across three resolution levels which help to reduce the number of false positives. Each time we find a hit we go from A(m)_{ n }, to A(m1)_{2n1}, A(m1)_{2n2}, and A(m1)_{2n3}; until we reach the original input signal to identify the region of biochemical activity.

4)
We combine together each overlapping group of hits into one big hit region. We call N the total number of final hit regions.
We can estimate the false positive rate (FPR) corresponding to the chosen value of threshold α. Signal of the red channel is randomly shuffled between the probes. We repeat the same scoring procedure for the shuffled signal keeping M_{1}, M_{2} and α the same.
The obtained hits are false hits which allow us to estimate the false positive rate (FPR): FPR=\frac{{N}_{shuff}}{N}, where N is the total number of hits without random shuffling and N_{ shuff } is the total number of hits after the random shuffling of the red signal probes. One can choose the α according to the corresponding FPR.
Figure 5 shows Pol II binding sites located between genomic coordinates 1900 and 2200 and also between 1050 and 1250. Two regions are identified as hits by our algorithm with parameters M_{1} = 8, M_{2} = 11 and the threshold α corresponding to the estimated FPR = 5%. The wavelet coefficient which satisfy conditions 3) and the regions of the signal corresponding to those wavelet coefficients are indicated by the red boxes.
Figure 6 shows a snapshot from the Affymetrix Integrated Genome Browser (IGB) displaying broader peaks of the Pol II ChIPchip data using the Nimblegen ENCODE tiling array. Raw signals for the green and red channels are shown as blue and pink tracks. We used again the following set of parameters: M_{1} = 8, M_{2} = 11 and the threshold α corresponding to the estimated FPR = 5%. Yellow bars indicate the hit regions corresponding to the wavelet coefficients at the resolution levels m = 8, ..., 11 satisfying the condition 3) of our numerical algorithm. Hit regions obtained by combining the information from these resolution levels (combining together overlapping yellow bars) are shown as red bars (step 4 of our algorithm).
Comparison with other methods
In order to test the performance of our method with the consistency constraint previously described, we applied our algorithm to the Spikein data from the ENCODE Nimblegen tiling array. Mixtures of human genomic DNA and 100 human sequences at various concentrations were hybridized on the array [5]. Spikein data was obtained from human sequences of approximately the same size, which were generated in the laboratory. Spikein data allows for an objective estimation of the performance of our method and a comparison with other methods.
We used model parameters: M_{1} = 8, M_{2} = 9 (to identify narrow peaks) and varying α to obtain ROCtype curve. We choice of the model parameters was based on our observation that the size of the wavelets should be comparable to the size of the peaks to identify. ROCtype curves were generated by plotting the Sensitivity (i.e. the number of true positives/100) vs. the False Positives ratio (i.e. the number of false positives/100). The optimal ROCtype curve is the one closest to the left upper corner.
We compared our method with other methods described in [5]: MA2C, Splitter, Permu, ACME, TAMALg, and TAMALs. MA2C (Modelbased Analysis for 2Color arrays) [8] is a normalization method based on the GC content of the probes. It compensates each probe's log(Cy5/Cy3) ratio for the GC bias and weights each probe, taking into account the signal variance of the GC group to which the probe belongs. A sliding window consisting of 500 bp was used, and windows with high median values were identified as hits. The Splitter algorithm [7] incrementally changes the cutoff value of the signal and compares the total number of hits before and after the change. If the ratio of the number of hits before and after the cutoff change is smaller than a predefined value, the algorithm stops and hits before the last cutoff change are reported as final hits. Clusters of probes located closer than a "maxgap" parameter were merged together. Clusters of probes containing the number of probes smaller than a "minrun" parameter were discarded. Permu [6] identifies the peaks within the sliding window based on iterative thresholding procedure. FPR (false positive rate) is assigned to each peak using the randomized data. TAMALg and TAMALc are two versions of the same peakfinding algorithm [18] that use different stringency levels.
Our method demonstrated excellent performance compared to other methods as can be seen from the ROCtype curve in Figure 7. The intuitive reason behind of such a good performance of our method is that the shape of the wavelets we use is very similar to the shape of peaks of the signal. Another reason of a good performance is the use of consistency constraint which reduces the number of false positives.
We also tested our method using STAT1 data from ENCODE Nimblegen arrays in [19]. We use the following parameters for our model: M_{1} = 8, M_{2} = 12 and the threshold α corresponding to the estimated false positive rate FPR = 5%. Most of the hits (84%) obtained by our method overlap with hits reported in [19].
Conclusions
We analyzed tiling array data using wavelet transformations, and from the resulting wavelet coefficients we obtained clear intensity and lengthscale separation between the background signal and the signal coming from the regions of biochemical activity. A thresholding procedure was applied to the wavelet coefficients at different resolution levels with the consistency constraint in order to delineate the regions of biochemical activity of interest at the same confidence level for all the relevant lengthscales. This method was successfully applied to several ChIPchip data sets, including Pol II and histone modification experiments. Our method demonstrated excellent performance using Spikein data from the Nimblegen tiling array.
References
 1.
Yazaki J, Gregory BD, Ecker JR: Mapping the genome landscape using tiling array technology. Curr Opin Plant Biol 2007, 10(5):534–542. 10.1016/j.pbi.2007.07.006
 2.
Bertone P, Gerstein M, Snyder M: Applications of DNA tiling arrays to experimental genome annotation and regulatory pathway discovery. Chromosome Res 2005, 13(3):259–274. 10.1007/s1057700521650
 3.
Buck MJ, Lieb JD: ChIPchip: considerations for the design, analysis, and application of genomewide chromatin immunoprecipitation experiments. Genomics 2004, 83(3):349–360. 10.1016/j.ygeno.2003.11.004
 4.
Quackenbush J: Microarray data normalization and transformation. Nat Genet 2002, 32: 496–501. 10.1038/ng1032
 5.
Johnson DS, Li W, Gordon DB, Bhattacharjee A, Curry B, Ghosh J, Brizuela L, Carroll JS, Brown M, Flicek P, et al.: Systematic evaluation of variability in ChIPchip experiments using predefined DNA targets. Genome Res 2008, 18(3):393–403. 10.1101/gr.7080508
 6.
Lucas I, Palakodeti A, Jiang YW, Young DJ, Jiang N, Fernald AA, Le Beau MM: Highthroughput mapping of origins of replication in human cells. Embo Rep 2007, 8(8):770–777. 10.1038/sj.embor.7401026
 7.
Splitter algorithm[http://zlab.bu.edu/splitter]
 8.
Song JS, Johnson WE, Zhu XP, Zhang XM, Li W, Manrai AK, Liu JS, Chen RS, Liu XS: Modelbased analysis of twocolor arrays (MA2C). Genome Biol 2007., 8(8): 10.1186/gb200788r178
 9.
Zheng M, Barrera LO, Ren B, Wu YN: ChIPchip: Data, model, and analysis. Biometrics 2007, 63(3):787–796. 10.1111/j.15410420.2007.00768.x
 10.
Ozsolak F, Song JS, Liu XS, Fisher DE: Highthroughput mapping of the chromatin structure of human promoters. Nature Biotechnology 2007, 25(2):244–248. 10.1038/nbt1279
 11.
Daubechies: Ten lectures on wavelets: Society for Industrial and Applied Mathematics, Philadelphia, Pa. 1992.
 12.
Mallat SG: A Theory for Multiresolution Signal Decomposition  the Wavelet Representation. Ieee T Pattern Anal 1989, 11(7):674–693. 10.1109/34.192463
 13.
Mathworks: Wavelet Toolbox User Guide. 2008.
 14.
Birney E, Stamatoyannopoulos JA, Dutta A, Guigo R, Gingeras TR, Margulies EH, Weng Z, Snyder M, Dermitzakis ET, Thurman RE, et al.: Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature 2007, 447(7146):799–816. 10.1038/nature05874
 15.
Lian Z, Karpikov A, Lian J, Mahajan MC, Hartman S, Gerstein M, Snyder M, Weissman SM: A genomic analysis of RNA polymerase II modification and chromatin architecture related to 3' end RNA polyadenylation. Genome Res 2008, 18(8):1224–1237. 10.1101/gr.075804.107
 16.
Limpert E, Stahel WA, Abbt M: Lognormal distributions across the sciences: Keys and clues. Bioscience 2001, 51(5):341–352. 10.1641/00063568(2001)051[0341:LNDATS]2.0.CO;2
 17.
Mutch SA, Fujimoto BS, Kuyper CL, Kuo JS, Bajjalieh SM, Chiu DT: Deconvolving singlemolecule intensity distributions for quantitative microscopy measurements. Biophys J 2007, 92(8):2926–2943. 10.1529/biophysj.106.101428
 18.
Bieda M, Xu XQ, Singer MA, Green R, Farnham PJ: Unbiased location analysis of E2F1binding sites suggests a widespread role for E2F1 in the human genome. Genome Res 2006, 16(5):595–605. 10.1101/gr.4887606
 19.
Euskirchen GM, Rozowsky JS, Wei CL, Lee WH, Zhang ZDD, Hartman S, Emanuelsson O, Stolc V, Weissman S, Gerstein MB, et al.: Mapping of transcription factor binding regions in mammalian cells by ChIP: Comparison of array and sequencingbased technologies. Genome Res 2007, 17(6):898–909. 10.1101/gr.5583007
Acknowledgements
The authors would like to acknowledge Raymond Auerbach, Sudhakar Chelikani, Lukas Habegger, Alex Urban and Sherman Weissman for useful discussions.
Author information
Affiliations
Corresponding author
Additional information
Authors' contributions
AK designed the algorithm, coded it and wrote the manuscript. JR provided crucial help in improving the data analysis. MG provided general advice and guidance in drawing necessary conclusions and results. All the authors read and approved the final manuscript.
Electronic supplementary material
Figure S1  Highpass and lowpass decomposition and reconstruction filters for Coif1
Additional file 1:. Lowpass decomposition filter c has a triangularlike shape that resembles the shape of the signal over binding sites observed in ChIPchip experiments. (TIFF 5 MB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Karpikov, A., Rozowsky, J. & Gerstein, M. Tiling array data analysis: a multiscale approach using wavelets. BMC Bioinformatics 12, 57 (2011). https://doi.org/10.1186/147121051257
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/147121051257
Keywords
 Discrete Wavelet Transform
 Wavelet Coefficient
 Wavelet Decomposition
 Resolution Level
 Decomposition Level