Wavelet-based identification of DNA focal genomic aberrations from single nucleotide polymorphism arrays

Background Copy number aberrations (CNAs) are an important molecular signature in cancer initiation, development, and progression. However, these aberrations span a wide range of chromosomes, making it hard to distinguish cancer related genes from other genes that are not closely related to cancer but are located in broadly aberrant regions. With the current availability of high-resolution data sets such as single nucleotide polymorphism (SNP) microarrays, it has become an important issue to develop a computational method to detect driving genes related to cancer development located in the focal regions of CNAs. Results In this study, we introduce a novel method referred to as the wavelet-based identification of focal genomic aberrations (WIFA). The use of the wavelet analysis, because it is a multi-resolution approach, makes it possible to effectively identify focal genomic aberrations in broadly aberrant regions. The proposed method integrates multiple cancer samples so that it enables the detection of the consistent aberrations across multiple samples. We then apply this method to glioblastoma multiforme and lung cancer data sets from the SNP microarray platform. Through this process, we confirm the ability to detect previously known cancer related genes from both cancer types with high accuracy. Also, the application of this approach to a lung cancer data set identifies focal amplification regions that contain known oncogenes, though these regions are not reported using a recent CNAs detecting algorithm GISTIC: SMAD7 (chr18q21.1) and FGF10 (chr5p12). Conclusions Our results suggest that WIFA can be used to reveal cancer related genes in various cancer data sets.


Background
With the recent advances of cancer studies at a molecular level, DNA copy number aberrations (CNAs) have been studied as important causes and consequences in the initiation, development, and progression of cancer. To date, many researchers have focused on the detection of chromosomal regions having amplifications and deletions using arrays of comparative genomic hybridization (CGH) data sets. These studies have generated valuable observations about cancer metastasis [1][2][3][4][5][6][7]. For example, it is now known that many oncogenes and tumor suppressor genes are located in regions of amplifications and deletions, and that chromosome regions with aberrations can be used to distinguish between cancer types. Also, new cancer related genes have been discovered. These advances have been accelerated by the development of computational methods and software [8][9][10][11][12][13][14]; segmentation and denoising methods such as circular binary segmentation (CBS) [8], wavelets [9], and the Gaussian-based likelihood approach (GLAD) [10] have been developed in order to identify true aberrations from background noise in a single sample. And with the accumulation of copy number aberration data sets, it has become increasingly important to find concordant aberrations in multiple samples. Thus, algorithms such as the minimum common region (MCRs) [15] and significance testing for aberrant copy number (STAC) [16] have been developed to address this issue.
However, even though each method can identify aberrant regions, these regions are not concordant between the different methods. As one possible explanation for this lack of concordance, Beroukhim et al. (2007) [17] assumed that many aberrations randomly occur, though most methods do not explicitly consider the background rate of random aberrations. For instance, most locations of chr7 and chr10 are amplified and deleted, respectively, in shortterm survival patients of glioblastoma multiforme (GBM) [18], though only a few of their genes are known oncogenes and tumor suppressors in GBM. As such, if random aberrations are not considered, most chr7 and chr10 genes will be regarded as relevant. Hence, an important issue is to distinguish cancer driving genes, i.e., genes involved in cancer development, from broad chromosomal aberrations. Fortunately, the amount of aberrations of driving genes has been observed to be larger than in their neighboring genes, and these aberrations are likely to occur consistently across multiple cancer patients. A few algorithms, such as the genomic identification of significant targets in cancer (GISTIC) [17], have been developed in attempts to incorporate these issues and are used to detect focal aberrations. Note that the term "focal aberrations" is used here to refer to relatively short, but consistently aberrant, regions in multiple samples. The use of GISTIC revealed that these focal aberrations contain many cancer related genes. In a comparison of GISTIC to MCR [15], via three independent data sets, GISTIC consistently identified more cancer related genes than MCR. In GISTIC, it first selects copy number aberration regions by applying a segmentation method to each sample, and then sums the amount of aberrations from the multiple samples. Then, differences between the aberrations and their neighbors are computed using a peel-off method. However, GISTIC has an inherent weakness: differences between neighbors in individual samples may cancel out since it summates log2 ratios in all aberrant samples first. The important difference between GISTIC and our proposed approach is that we first consider the differences between neighbors in an individual sample, before identifying focal regions in multiple samples. In this study, we propose a novel algorithm, referred to as the wavelet-based identification of focal genomic aberrations (WIFA), to address the following issues: (i) distinguish signals from noise among probes having high aberrations, (ii) detect focal aberrations by considering the differences between aberrations and their neighbors, as well as the amount of aberrations, and (iii) consider the consistency of aberrations in multiple samples.
Wavelet analysis is a mathematical technique for representing data. Wavelets can be used to remove noise from observed data (contaminated by noise) while preserving important features of true data; this process is called wavelet denoising. In this study, we use a variant of the translation-invariant level dependent wavelet denoising method in [19] to obtain translation-invariant approximations of the smooth (low-frequency) part of true data y LOW , and of the local (high-frequency) behavior of true data y HIGH , from the observed data y. In brief, y LOW is based on the averages of the neighboring values of y, and y HIGH is based on the differences of neighboring values of y, followed by thresholding. Thresholding is only performed in y HIGH since it is likely that noise would be more pronounced in the high-frequency content. After obtaining y HIGH via the wavelet analysis, we obtain y * HIGH for each sample by adjusting some obvious artifacts in y HIGH , and then cluster continuous focal aberrations across multiple samples. By applying this approach to GBM and lung cancer data sets, we are able to find previously known cancer related genes in the focal aberrations. In addition, a similar procedure based on y LOW enables us to detect broad regions of chromosomal aberrations.
The difficulty of assessing the performance in detecting focal aberrations is that the true answer is often not known, since regions containing cancer related genes still need to be revealed. Hence, we compare genes identified by our approach to known cancer genes obtained from GISTIC [17]. Based on this comparison, in addition to confirming regions identified by GISTIC, we are able to find new regions not previously identified by GISTIC; literature shows that these new regions contain known oncogenes. In addition, WIFA is compared to STAC and MCR, outperforming these two methods both in the simulation and GBM data. The source code for WIFA is available at http://www.gcancer.org/wifa/WIFA.html.

Materials
We collected and reanalyzed three single nucleotide polymorphism (SNP) data sets: 154 GBM tumor samples [17], 178 GBM tumor samples [20], and 371 lung tumor samples [21]. We downloaded the signal intensities of the data sets from either the websites of the original publications or the GEO database. We used all chromosomes except X and Y. Both GBM data sets were generated from an Affymetrix 100K SNP microarray, and the lung cancer data set was from 250K Sty SNP arrays. Since the 100K SNP array consisted of independent 50K Xba and 50K Hind arrays, we then merged these two arrays along the chromosome positions. Next, to calculate copy number changes from signal intensities, we applied the following procedure (similar to original publications): (i) signal intensities were transformed using the log2 transform to make the noise constant; (ii) for each sample, the median value across all probes was subtracted from the probes; (iii) to obtain the log2 ratio for tumor samples compared to the normal samples, log2 transformed normal samples were subtracted from the log2 transformed tumor samples; and (iv) to remove copy number variants (CNVs) that occur in normal population, positions with CNVs obtained from [22] were omitted from the data sets. Figure 1 illustrates the WIFA method for detecting broad and focal aberrations in SNP array data sets using the wavelet transform. The procedure for detecting focal aberrations is as follows. First, the difference between the aberrations and their neighbors is measured for the SNP array data using the wavelet transform and a postprocessing step; this process is called a HIGH transform and generates y * HIGH . Probes that have a larger difference with their neighbors than a given threshold value are likely to have nonzero values in y * HIGH . Thus, y * HIGH can be used to distinguish driving aberrations from passenger aberrations because the amount of aberrations for driving genes is usually larger than for their neighbors. Second, the significance of probes for indicating the driving gene in multiple samples is calculated as the sum of the values in y * HIGH from multiple samples, and is referred to as Y * HIGH . This value is generated based on the assumption that both the number of samples with aberrations and the amount of aberrations are important indicators for discovering driving aberrations. Third, probes having nonzero Y * HIGH values are clustered with neighbors having nonzero Y * HIGH values within a certain distance. Since chromosome positions are also important for identifying candidate cancer-related genes, we consider sets of neighboring probes in this third step instead of using the information from individual probes. To prioritize these clusters, the score of each cluster is determined by summing the probes' values in Y * HIGH in the cluster. Similar to the procedure for detecting focal aberrations, broad aberrations are identified using y * LOW ; y * LOW values are calculated for each sample and then summed for all samples, denoted as y * LOW . The statistical significance of the aberrations is subsequently calculated as shown in Figure 1(d)-(e).

Wavelet transform and its use in WIFA Wavelet transform
Let J and L be integers such that 1 ≤ L ≤ J -1. The (discrete) wavelet transform (WT) maps a given data set y of length 2 J into the scaling coefficients s := {s L,t : t = 0, 1, ..., 2 L -1} and the wavelet coefficients w := {w j,t : j = L, L + 1, ..., J -1; t = 0, 1, ..., 2 j -1}. Note that WT is linear and can be represented by a 2 J × 2 J orthogonal matrix W. WT depends on the specific wavelet selected. In this paper, we use a WT based on the Haar wavelet. The Haar wavelet transform is used to simply pair up input values, storing the difference and passing the average, and it repeats this process recursively, pairing up the averages to provide the next level-finally resulting in 2 L averages (stored in the scaling coefficients s) and 2 J -2 L differences (that are stored in the wavelet coefficients w). For more details about wavelets, refer to [23], [24], and [25].

Wavelet procedure for WIFA
A drawback of the traditional WT is that it is not translation-invariant. In attempts to remedy this problem, a number of translation-invariant wavelet transforms have been employed [9,19,26]. Among the available translation-invariant wavelet transforms, we use the stationary wavelet transform by [27] for our WIFA methodology. Suppose that a single tumor sample is fixed, and that a chromosome of the sample has n = 2 J locations for some positive integer J. We then denote y i as the observed copy number change at the i-th genomic location x i for i = 1, ..., n. We assume that the genomic locations x i are fixed, known and equally spaced; for a wavelet analysis with unequally spaced data, see example [28]. We further assume that for i = 1, ..., n, the observed copy number change can be expressed as where f is an underlying function representing the true copy number change, and ε i is a stationary Gaussian noise with a zero mean value [19].
The basic principle of wavelet denoising then becomes to identify and zero out the wavelet coefficients y := {y i : i = 1, ..., n} that are likely to contain noise, and to estimate {f (x i ): i = 1, ..., n} in (1). Instead of the usual wavelet denoising procedure [29], we use the following modified steps as the main wavelet denoising procedure for our methodology: (Step 1) Given the data y of length n = 2 J , and an integer L such that 1 ≤ L ≤ J -1, compute W y = {s, w}. For an integer M ≥ L, let w MID be the wavelet coefficients w of y with levels j = L, L + 1, ..., M -1, and w HIGH be the wavelet coefficients w of y with levels j = M, M + 1, ..., J -1. ( Step 2) Define T LOW :{s, w MID , w HIGH } ↦ {s, 0, 0} and T HIGH : {s, w MID , w HIGH } → {0, 0, w HIGH }, where w HIGH is obtained from w HIGH by thresholding using a hard threshold function [30] with the threshold value λ = Cσ j 2logn j for each level j = M, M + 1, ..., J -1.
Here, σ 2 j is the estimate of the noise variance for the wavelet coefficients at level j, n j is the length of the subsignal at level j, and C is a constant to be determined later. ( Step 3) Let S represent a shift operator of the one time unit [27]. Then, compute the translation-invariant low-frequency and high-frequency approximations y LOW and y HIGH defined as The threshold value λ = Cσ j 2logn j used in (Step 2) is a variant of the threshold value used in [19]. After (Step 1)-(Step 3), we obtain y LOW and y HIGH . Note that y LOW gives a translation-invariant approximation of the smooth (low-frequency) part of the true data, which provides rough estimate for detecting a broad region of chromosomal aberrations. This value is based on the Haar scaling coefficients, which can be considered as averages of neighboring values of y. On the other hand, y HIGH gives a translation-invariant approximation of the local (high-frequency) behavior of the true data, which provides a rough idea for detecting the focal aberration of chromosomes; y HIGH is based on Haar wavelet coefficientswhich are differences between the neighboring values of y-and the threshold.
Dividing the wavelet coefficients depending on the level has been used in many studies (see [19] and [31]), although the exact form may vary. The main difference between our method and other level-dependent wavelet denoising methods is that we concentrate only on the low-frequency scaling and high-frequency wavelet coefficients, and do not consider the mid-frequency wavelet coefficients. To do this, we add the parameter M to the usual wavelet thresholding process; from the discussion in the Results section, this parameter allows us to identify focal genomic aberrations more effectively.
The values of y LOW for all chromosomes of a given sample y are obtained simply by processing each chromosome separately, and then concatenating the values of y LOW for each chromosome; similarly, the values of y HIGH for all chromosomes can be found.
Next, let us explain how we treat the problem of the boundary of each chromosome. In brief, the problem of the boundary is caused by our previous assumption that the chromosome has n = 2 J locations for a positive integer J, which may not hold true in general; for a more detailed discussion about boundary conditions, refer to [32]. We handle this boundary problem by extending each chromosome first symmetrically and then periodically. Our experiments show the effectiveness of this method. Other parameters used in our methodology include: • Constant C in the threshold value λ = Cσ j 2logn j : in (Step 2), we use the threshold value λ = Cσ j 2logn j to threshold the high-frequency wavelet coefficients at level j. A smaller C would allow more nonzero values in y HIGH . • Level L: parameter L can be as small as 1 and as large as J -1. A smaller L would increase the coarseness of the y LOW approximation, whereas a larger L would make it finer.
• Level M : parameter M can be as small as L and as large as needed. A smaller M would produce a y HIGH with more nonzero values, whereas a larger M would produce a y HIGH with fewer nonzero values. Since this is not a standard parameter in wavelet literature, we pay special attention to it and discuss its effect on our methodology by varying M. See the Results section.
In the Results section, we further discuss which values of the above parameters C, L, and M are used for each of the data sets in our experiments. To implement the wavelet transforms, we used WaveLab http://www-stat. stanford.edu/%7Ewavelab/Wavelab_850/index_wave-lab850.html.

Identification of broad and focal aberrations in WIFA Identification of focal aberrations
The values of y HIGH generated from the wavelet transform indicate the difference between the neighboring values of the denoised sample. To use this information for focal aberration detection in the WIFA method, we process the y HIGH values. In Figure 2(a), we draw the log2 ratio of 152 probes for a single sample, a part of the GBM SNP array data [17], in which there are strong amplifications between the 98th and 143th probes. The y HIGH values obtained by the wavelet transform are then presented in Figure 2(b). Note that negative y HIGH values are generated around the positive y HIGH values of the 98th and 143th probes. This negativity is due to the fact that the positions next to amplifications also have large differences with their neighbors. To keep only y HIGH positions with amplifications, we select the position with the highest absolute log2 ratio value among the consecutive positions with nonzero y HIGH values, and then assign zero to the positions with a different sign in the y HIGH values. This process generates y * HIGH , as shown in Figure 2(c). A similar process is then performed on the deletions.
After obtaining y * HIGH , WIFA considers the two following issues. Let y * HIGH (p) be the value of y * HIGH at a position p. First, if for a given probe p, y * HIGH (p) = 0, and for its consecutive probes k on both sides, y * HIGH (k) = 0, the p might represent CNVs that are abundant in a normal population, instead of CNAs. Thus, we set y * HIGH (p) = 0. Second, we attempt to determine focal aberrations that are consistent across multiple samples. are located within a distance d. From the clusters in a group, we select a cluster c having the maximum score S(c) = p∈cluster(c) |Y * HIGH (p)| . In the clustering process, clusters containing nonzero Y * HIGH (p) values from only a single patient are removed. Then, statistically significant clusters are ranked based on their S(c) scores. In Figure  3(c), the cluster on the right is a focal aberration that contains a known cancer related gene (EGFR).
A statistical significance of each cluster is calculated based on the null hypothesis that consecutive aberrations in each sample are independent from those in other samples. Here, the number of permutations N and the significant value a are used to estimate the significance of clusters.
1. In each sample, segments of aberrations (a set of consecutive probes with y * HIGH = 0) are randomly positioned on a chromosome. This random positioning is then applied to all samples, generating randomly permuted data from the multiple samples. This permutation approach is described in detail in [11]. 2. The process for detecting focal aberrations in multiple samples is subsequently applied to the randomly permuted data, generating a set of clusters. 3. Steps 1 and 2 are repeated N times. Let the max score of clusters from the ith permutation be the max_score(i). 4. The p-value of each cluster c of the observed data P cluster (c) can be calculated by comparing scores from the permuted and observed data: where S(c) is the score of a given cluster c, and I denotes the indicator function. 5. Clusters with P cluster (c) less than a are considered statistically significant.
In this paper, we use N = 1,000 and a = 0.1, and the permutation and calculation of P cluster is performed for each chromosome.

Identification of broad aberrations
After y LOW values are generated from the wavelet transform for each sample, we use y LOW as y * LOW ; y LOW values do not require a processing step, contrary to y HIGH . To integrate multiple samples, we sum y * LOW values from multiple samples, referred to as Y * LOW . Note that if all probes in a chromosome arm are statistically significant in Y * LOW , we consider it a broad aberration. We then calculate the statistical significance of Y * LOW in the following way. The null hypothesis for Y * LOW is that y * LOW is independent among samples, so the summation of y * LOW , Y * LOW , is the same across all probes in the chromosomes.
To generate the null distribution, we first construct a histogram h i of y * LOW in a single sample i by splitting y * LOW values into bins at intervals of 0.01. Next, the distribution of Y * LOW is calculated by the convolution of h i of all samples, and the p-value of the observed Y * LOW is calculated by summing the probabilities from the tail of the null distribution to the observed Y * LOW value. The pvalue is separately calculated for amplifications and deletions. For the correction of multiple tests, p-values are converted into q-values [33]; the p-values of Y * HIGH are similarly calculated. This approach is similar to the calculation of statistical significance of aberrations used in [17]. Note that the p-values of Y * HIGH are calculated for each probe, and the P cluster discussed above is calculated for each cluster.

Broad and focal aberrations in glioblastoma
We applied our approach to 154 GBM tumor samples [17]. After testing different values of C, L, and M (cf. Methods section), we selected C = 1.94, L = 9, and M = 12. Our experimental results for different values of M are shown below (experiments for different values of C and L are not shown). Since J total := ⌈log 2(n total )⌉ = 17, where n total is the number of total probes in the data set, L = 9 for this data set roughly indicates that we average the log2 ratios of 2 J total −L = 2 17−9 = 256 probes in the 100K SNP array to obtain y LOW . In order to identify broad aberrations, we apply the LOW transform and calculate the q-values of Y * LOW . Here, amplifications and deletions in the size of a chromosome arm are considered broad aberrations, with a threshold q-value of 0.01. As shown in Additional File 1(a) and 1(b), chr7, 8q, 17q, 19p, and 20 are amplified, and chr6q, 9p, 10, 13, 14, and 22 are deleted in the size of a chromosome arm. Next, using the HIGH transform, our model is able to detect clusters with focal aberrations; we used d = 100 KB to construct clusters. Since EGFR, MDM2, PDGFRA, MDM4, CDK4, MET, CDKN2A, PTEN, RB1, CDK6, and MYC have been reported as important GBM related genes [34], we investigate whether the clusters contain these genes. In addition, we investigate the effect of different M values for detecting focal aberrations. We use M values ≥ 9 since M can have values from L, as described in the Methods section. The number of probes with nonzero values in Y * HIGH is the largest when M = 9; Figure 4(a) shows that 71 statistically significant clusters are generated for this value. In these clusters, seven GBM related genes are ranked as the top seven clusters. As M increases, the number of clusters and the number of nonzero probes in Y * HIGH decreases. Indeed, in Figure 4(g) only two clusters are generated when M = 15. In the range M = 10-12, eight or nine GBM genes are located in the highly ranked clusters; this observation shows that our method highly ranks clusters containing important GBM genes regardless of the M value. However, if we consider the size of the chromosomal region containing the identified focal aberrations, there is a preferred choice for M value. When M = 9, among all 71 clusters spanning 436 MB, the top seven clusters contained seven GBM genes. On the other hand, when M = 12, 19 clusters spanning 26 MB included nine GBM genes in the top nine clusters. Hence, we use M = 12 for further analysis. As shown in Figure 4(b), these nine clusters contain nine previously identified GBM related genes [34], including EGFR, MDM2, PDFGFA, MDM4, CDK4, MET (amplifications), CDKN2A, PTEN, and RB1 (deletions); however, MDM4 and RB1 lie slightly out of the detected region. These nine genes were also detected by GISTIC. In addition to these nine genes, aberrations of CDK6 and MYC have also been reported in multiple GBM studies [34]. However, using this GBM data set, aberrations of these two genes were detected neither by our method nor by the GISTIC method. The above results indicate that our method highly ranks most of the important GBM genes and reports only a small number of false positive positionsalthough there is a possibility that these false positive regions may include previously unreported GBM genes. Table 1 shows the positions of clusters in chromosomes, ranked according to the cluster scores when M = 12. WIFA identified ten additional regions (other than the above nine genes). Two regions include known cancer related genes (MYCN [35] and IGFBPL1 [36]), though the remaining eight regions do not contain any cancer related genes; Additional File 2 contains the gene symbols for all clusters. In contrast, GISTIC identified 17 additional regions. In these regions, five known cancer related genes are included, one of which is MYCN. The remaining 12 regions do not contain any cancer related genes. Hence, both methods contain regions that may require further analysis, or they might just be falsely detected regions. Let us now revisit Figure 3 in order to illustrate the use of the HIGH transform to detect focal aberrations in GBM data [17]. Figure 3(a) presents the SNP array data of 154 patients at positions from 52,767 to 55,790 KB in chr7. In this region, DNA amplification seems apparent in multiple patients. Figure 3(b) then gives the HIGH analysis for the corresponding region. Note that the cluster from 54,145 to 55,790 KB ranks 2nd in Table 1 and contains EGFR.
We also applied the proposed method to a GBM data set obtained by Kotliarov et al. [20]. We used the same parameter values for C, L, M, and d as for the previous GBM data, since both are generated using the same SNP array platform. The HIGH analysis for this GBM data generates eight clusters; among these clusters, one focal deletion contains CDKN2A, and five focal amplifications contain MDM4, PDGFRA, EGFR, MDM2, and CDK4 (Additional File 3 and Additional File 4). Note that MET is not included in the focal aberrations because amplification occurs only in a single sample. The six GBM genes identified using this data are also found in the previous GBM data set. This result confirms that our method is able to detect focal aberrations that are consistent across different experiments.

Broad and focal aberrations in lung cancer
We applied our method to the 371 lung cancer patients studied by Weir et al. [21]. Let us first explain how some of the parameters in our Methods section can be determined for a data set other than GBM data, by using this data set as an example. We recall that the parameters used for GBM data are C = 1.94, L = 9, M = 12, and J total = 17. Since the average distance between probes for GBM data set is 50 KB, the actual length of genomes we average for y LOW is approximately 50K · 2 J total −L = 50K · 2 17−9 = 12.8M, and the actual length of genomes we consider as a focal aberration for y HIGH is approximately 50K · 2 J total −M = 50K · 2 17−12 = 1.6M.
For the lung data set, the average distance between probes is 13 K and J total := ⌈log 2(n total )⌉ = 18, where n total is the number of total probes in the data set. Since we want the actual length of genomes that we average for y LOW and the actual length of genomes we consider as focal aberrations to be similar to the GBM case, we select L = 8 and M = 11. Then, the actual length of genomes we average over for y LOW is approximately 13K · 2 J total −L = 13K · 2 18−8 = 13.2M, and the actual length of genomes that we consider as focal aberrations for y HIGH is approximately 13K · 2 J total −M = 13K · 2 18−11 = 1.664M.  With the value of C = 1.94, the number of nonzero values in y HIGH of all chromosomes in the GBM data set is about 10% of the number of nonzero values in the sample. The value of C that provides approximately the same percentage of nonzero values in y HIGH for the lung data set is C = 5.2.
When Weir et al. [21] analyzed this lung cancer data set using the GISTIC approach, they identified five deleted focal regions and 17 amplified focal regions with statistical significance: in these regions, ten known oncogenes (MDM2, MYC, EGFR, CDK4, KRAS, CCNE1, ERBB2, CCND1, TERT, and ARNT), two known tumor suppressor genes (CDKN2A and PTEN), and six new candidate genes (MBIP, NKX2-1, VEGFA, PTPRD, PDE4D, and AUTS2) were found. We applied the WIFA approach to this same data set (d = 50 KB). When M = 11, 20 clusters spanning 28 MB are generated. Table 2 and Additional File 6 present a detailed description of these 20 clusters. The top-ranked cluster contains MBIP, a new candidate gene that is also ranked at the top in the GISTIC analysis. Among the 18 cancer related genes noted above, six known oncogenes (MDM2, EGFR, KRAS, CCNE1, ERBB2, and CCND1), one known tumor suppressor gene (CDKN2A), and three new candidate genes (MBIP, NKX2-1, and VEGFA) are included in the focal aberrations. In addition, SS18 and FGFR1-which were identified by Weir et al., though they did not consider them statistically significant-are identified as having statistical significance in our study. Let us now look at the remaining clusters that do not contain cancer related genes reported by Weir et al. We investigated whether or not these remaining clusters contain other cancer related genes ( Table 2). Two genes SMAD7 (chr18q21.1) and FGF10 (chr5p12) are located in focal aberration regions. It is known that SMAD7 functions as an intracellular antagonist of transforming growth factor beta (TGF-beta) signaling and is frequently unregulated in various cancers [37]. A recent study showed that a transgenic mouse model with SMAD7 disrupted TGF-beta signaling and increased lung carcinogenesis [38]. In our study, the values of y * HIGH in chr18:43,953-45,322 KB, where SMAD7 is located, are positive in two patients. Indeed, the averages of the log2 intensity ratio of these two patients are higher than those of patients having zero values in y * HIGH (Additional File 7(a)). This result implies the DNA amplification of SMAD7 in lung cancer. A member of the fibroblast growth factor FGF10 has also been reported in several cancer studies; multiple lines of evidences show that FGF10 plays important roles in various cancer types, including prostate and pancreatic cancers [39,40]. Previously, the mRNA expression of FGF10 in the fetal lung of mice was shown to disrupt lung morphogenesis [41], and the alternation of FGF pathways frequently occurs in lung cancer [42].  When M = 11, 20 statistically significant clusters are shown. These clusters contain known lung cancer genes or cancer related genes in other cancer types. All genes contained in the 20 clusters are described in Additional File 6.
Columns are as described in Table 1. §Gene annotations are based on hg18 human genome assembly.
SMAD7 is located), which was identified only by WIFA. Figure 5(a) shows the log2 intensity ratios of two patients for part of chr18. Although both samples are commonly amplified in chr18:43,953-45,322 KB, the Gvalues obtained by GISTIC are not large enough to identify this region. The G-values are calculated by summing the log2 intensity ratios of the amplified samples; 40 samples are considered amplified in this region (log2 ratio>0.1). However, even though the original data from two patients ( Figure 5(a)) were highly amplified and their differences with neighbors seemed significant, the G-values did not capture this significance. This is because the other 38 (amplified with less difference between neighboring values than the two) samples were also used to calculate the G-values, which resulted in the significance of the two samples being obscured ( Figure 5(b)). In contrast, WIFA first calculates the differences with neighbors for each sample, then sums the differences. In WIFA, the y * HIGH values are high for the two samples, but the y * HIGH values are zero for the other samples since these other samples do not have any significant difference with their neighbors. Therefore, Y * HIGH -the summation of y * HIGH values from multiple samples-can reflect the significant differences with the neighbors of these two samples ( Figure 5(c)). Note that both GISTIC and WIFA identify amplification in chr18:20,300-23,559 KB.
To validate our choice of M, we conducted experiments with various M values. For example, with M = 9, 37 clusters spanning 161 MB included 12 cancer related genes, as shown in Additional File 8(a). However, even though the M = 9 case identified two more genes compared to M = 11, it required the search of five times more genomic regions; refer to Additional File 8 for the results of the other M values. Our analysis of lung cancer confirms that WIFA is useful for identifying cancer related genes in focal aberrations across different cancer types.

Comparison with other methods
We then compared our method with MCR and STAC. For implementation, we used the MCR from waviCGH [13](http://wavi.bioinfo.cnio.es/) and STAC from the authors' website (http://www.cbil.upenn.edu/STAC/). Note that the input files from both methods should have binary aberration calls of amplification, deletion, or no change; hence, GLAD [10], a segmentation method, was applied to single samples. The thresholds for amplification and deletion were then used to determine the aberration regions. In MCR, the fraction of samples in aberrant regions was used to determine the significant regions. In STAC, the p-value of the footprint was used as a measure of the significance of aberrant regions. For WIFA, the cluster score is used for this purpose.
We used a series of simulation data as the basis of our comparison, and generated the simulation data in two steps. First, ten different underlying true data were generated using Multiple Sample Analysis [11](http://www. cbil.upenn.edu/MSA/) software. For each true data, the length of a genome, in terms of number of markers, was 4,500; in addition, the number of samples was 50; the number of markers in the underlying concordant aberrations was 30; and the numbers of samples in concordant aberrations varied from 50% to 70%. In ten true data, the numbers of concordant aberrant regions varied from five to seven, and there were one or two nonconcordant regions. Second, the background aberrations were generated using a normal distribution. Because the maximum (in absolute) values of the markers for each sample were different, we set the standard deviation of the normal distribution to be the multiplication of a fixed number, which we refer to as the noise level, and the maximum value of the true data.
We investigated noise levels of 0.2 and 0.4. Performances of three methods are measured based on values for the area under a curve (AUC) for the sensitivity and false positive rate. For both noise levels, WIFA shows very good performance in identifying concordant regions in the simulation data, as shown in Figure 6. Note that MCR performs slightly better than WIFA when the noise level is 0.2, but WIFA is superior when the noise level is 0.4. Considering that a lower noise level generally results in easier noise removal, we can conclude that WIFA is the most useful in identifying concordant regions in the simulation data. During these simulations, the parameter values of C, L, and M in WIFA were determined in the same manner as for lung cancer. In MCR and STAC, 0.1 is used as the threshold value for identifying amplifications and deletions in a single sample, after application of the GLAD segmentation method.
We next compared WIFA with MCR and STAC using real GBM data sets. As a measure of performance, we used the length of genomes to identify known GBM genes. In Figure 7, the y-axis represents the length of the genome regions sorted based on the region significance, and the x-axis represents the number of GBM genes contained in the corresponding genomes in the y-axis. Because methods with high performance require a smaller length of genomes in their search for known GBM related genes, methods closer to the x-axis generally outperform other methods. Here, on average, WIFA typically shows the best performance, as nine genes can be identified around 10,000 KB. In MCR and STAC, five different thresholds of 0.1, 0.2, 0.3, 0.4, and 0.5 are used to determine the amplification and deletion. In both methods, the threshold value of 0.3 shows the best performance. Figure 7 includes the graphs of the 0.3 threshold, along with one other threshold for comparison. In MCR, only three or four genes are identified within 10,000 KB; more than 100,000 KB is required to identify the remaining genes. In STAC, more than 10,000 KB is required to identify most of the genes.

Discussion and Conclusions
Our work is based on a wavelet analysis. The wavelet analysis has been used in other papers to analyze array CGH data (cf. [9], [43]); for example, in [9], it is shown to perform well compared to approaches such as CBS, a change-point method [8], and HMM [44]. Compared to other wavelet-based approaches used to analyze array CGH data, the main differences in WIFA include: (i) a new parameter M is introduced, which is used to identify focal genomic aberrations more effectively; and (ii) a new method that integrates multiple samples, as a postprocessing step in the wavelet analysis, is suggested in order to identify cancer-related genes from a data set having multiple samples. As a result, we were able to detect cancer related genes with high rate of accuracy in both GBM and lung data sets.
CNPs are another type of DNA variation that are abundant in the normal population, and are usually observed in kilobase or megabase DNA deletions or duplications. When a HIGH analysis was applied to SNP microarrays, deletions of a single SNP probe were frequently observed. When these were compared to the positions of known CNPs [22], many regions were found to overlap (data not shown); these single SNP probes were removed from our analysis since the relevance of CNPs to cancer requires further study. However, if CNPs are the main subject of analysis, it is possible that a new method based on our HIGH analysis could be developed to achieve this task. As a promising example, a single deletion of the SNP probe from 13 patients was observed at the 55,205,890 base position of chr11 when the GBM data set was used [17]. Olfactory receptor (OR) genes such as OR4C11, OR4P4, OR4S2,  Figure 6 Comparison of STAC, MCR, and WIFA in the simulation data. Averages of AUC values of ten underlying true data in three methods are drawn for noise levels of 0.2 and 0.4. The standard deviation of the AUC values for ten underlying true data is drawn as error bars. The y-axis is the length of the genome regions sorted based on the significance of regions, and the x-axis is the number of GBM genes contained in the corresponding genomes in the y-axis. When the genome length is larger than 6,000,000 KB, it is shown exactly as 6,000,000 KB for clarity. Because methods with high performance require a smaller length of genomes in their search for known GBM related genes, a method closer to x-axis is considered to outperform other methods. and OR4C6 are located at this position, and it was previously shown that the OR genomic location is frequently affected by CNPs [45]. This observation suggests that our wavelet analysis has the potential to be broadly applied to detect various kinds of focal aberrations.