Spatial normalization of arrayCGH data
 Pierre Neuvial†^{1}Email author,
 Philippe Hupé†^{1, 2},
 Isabel Brito^{1},
 Stéphane Liva^{1},
 Élodie Manié^{3},
 Caroline Brennetot^{3},
 François Radvanyi^{2},
 Alain Aurias^{3} and
 Emmanuel Barillot^{1}
https://doi.org/10.1186/147121057264
© Neuvial et al; licensee BioMed Central Ltd. 2006
Received: 15 September 2005
Accepted: 22 May 2006
Published: 22 May 2006
Abstract
Background
Arraybased comparative genomic hybridization (arrayCGH) is a recently developed technique for analyzing changes in DNA copy number. As in all microarray analyses, normalization is required to correct for experimental artifacts while preserving the true biological signal. We investigated various sources of systematic variation in arrayCGH data and identified two distinct types of spatial effect of no biological relevance as the predominant experimental artifacts: continuous spatial gradients and local spatial bias. Local spatial bias affects a large proportion of arrays, and has not previously been considered in arrayCGH experiments.
Results
We show that existing normalization techniques do not correct these spatial effects properly. We therefore developed an automatic method for the spatial normalization of arrayCGH data. This method makes it possible to delineate and to eliminate and/or correct areas affected by spatial bias. It is based on the combination of a spatial segmentation algorithm called NEM (Neighborhood Expectation Maximization) and spatial trend estimation. We defined quality criteria for arrayCGH data, demonstrating significant improvements in data quality with our method for three data sets coming from two different platforms (198, 175 and 26 BACarrays).
Conclusion
We have designed an automatic algorithm for the spatial normalization of BAC CGHarray data, preventing the misinterpretation of experimental artifacts as biologically relevant outliers in the genomic profile. This algorithm is implemented in the R package MANOR (MicroArray NORmalization), which is described at http://bioinfo.curie.fr/projects/manor and available from the Bioconductor site http://www.bioconductor.org. It can also be tested on the CAPweb bioinformatics platform at http://bioinfo.curie.fr/CAPweb.
Background
Arraybased comparative genomic hybridization (arrayCGH) provides a quantitative measure of differences in copy number between two DNA samples [1]. The technique is typically applied to cancer studies because chromosome aberrations frequently occur during tumor progression [2]. ArrayCGH facilitates the localization and identification of oncogenes and tumor suppressor genes, which are likely to be present in chromosomal regions gained and lost, respectively, in cancer cells.
Recent developments in the statistical analysis of arrayCGH data have focused on highlevel analysis, typically the identification of breakpoints from the genomic profile [3–7], rather than normalization. Most of the normalization techniques used to date for arrayCGH data analysis have therefore involved the simple transposition of methods originally designed for expression data [8, 9], correcting for differences in the labeling efficiency of the two dyes, spotting effects (block, row, column, or printtip effects), and local or global intensity dependence of the ratios [10]. As far as we are aware, Khojasteh et al. [11] have reported the only method specific to CGH arrays.
Investigation of the systematic sources of variation in the arrayCGH data studied showed that the effects affecting expression arrays were negligible with respect to spatial effects of two types. We describe here an algorithm for spatial normalization, which can also be combined with existing normalization methods for handling nonspatial artifacts. We will define and illustrate these two types of spatial effect, and show that such effects are not properly taken into account by traditional normalization techniques.
Two distinct types of spatial artifact
The methods proposed here were originally developed for the analysis of bladder cancer data from tumors collected at Henri Mondor Hospital (Créteil, France) [12], analyzed by hybridization on CGH arrays (F. Radvanyi, D. Pinkel et al., unpublished results), including 2464 clones spotted at the University of California San Francisco (UCSF) [13]. They were then adapted to several data sets for CGH arrays produced and hybridized at the Institut Curie, including the breast cancer data (O. Delattre, A. Aurias et al., unpublished results) and the neuroblastoma data [14] (which is publicly available [15]) used to illustrate the technique.
Local spatial bias
Continuous spatial gradient
The array image shows a smooth gradient in signal from one side of the slide to the other (Fig. 21(a)). This artifact leads to genomic profiles with high variability, even between regions with the same DNA copy number. When this effect is observed, it affects all spots to various degrees.

They occur on arrays designed such that neighboring spots on the array correspond to nonneighboring clones in the genome, so there is no obvious biological reason for the clustering of high (or low) signals on the array;

They are frequently observed on control (normal tissue vs normal tissue) hybridizations, and even on background signals (see Figure 5 for illustration with the breast cancer data set).
The methods proposed are designed to remove or reduce these two types of spatial effect, while preserving the true biological signal.
The need for a spatial segmentation method
The spatial effects described above cannot be attributed to spotting, for two reasons: firstly, they are not limited to array rows, columns or blocks; secondly, they are not reproducible from one array to another, even for arrays taken from batches of slides printed at the same time. Therefore, it is not possible to correct for them properly with the normalization methods generally used for expression arrays, in which "spatial" effects are captured only by row, column, or printtip group effects. For a method to be appropriate, it must take into account the spatial structure of the array as a whole, and the arbitrary shape of these biased areas.
Several different studies have taken into account spatial effects in expression microarray data and have provided signal correction methods. For example, Workman et al. [16] defined a spatial gradient normalization method using a twodimensional Gaussian function to estimate local background bias in a probe neighborhood. Baird et al. [17] proposed a mixed model for cDNA array data, using splines with spatial autocorrelation, assuming the existence of a onestep correlation between adjacent spots in a row or column. Colantuoni et al. [18] proposed a method for normalizing the element signal intensities to a mean intensity calculated locally across the surface of a DNA microarray. Others studies have combined intensitydependent and spatiallydependent effects. Wilson et al. [19] have proposed fitting a single LOESS curve on the MA plot and then spatially smoothing the residuals using a median filter to estimate the spatial trend. Tarca et al. [20] proposed correcting intensitydependent and spatiallydependent effects using a feedforward neural network. Khojasteh et al. [11] have compared different CGH array data normalization methods and suggested that a threestep normalization that combines printtip LOESS with spatial correction using moving median and microplate effect correction gave the best results.
These methods may be suitable for correcting continuous spatial gradients, but they were not designed to detect abrupt changes in signal value across the array, and therefore may not adequately handle local spatial bias: Figure 1 illustrates the need for a spatial segmentation method to handle such local spatial effects. From the mediancentered logratios (a) we estimate a spatial trend (b) by twodimensional LOESS regression [21, 22]; subtracting this spatial trend from the raw values partially corrects the spatial effect (c), but the array trend after correction (d) demonstrates that the spatial effect is undercorrected at the inner border of the biased area, and overcorrected at the outer border, consistent with the observation that signal disturbances vary steeply at the border of the biased area. This systematic overcorrection or undercorrection may lead to misinterpretation in the corresponding genomic profile.
A similar type of spatial effect was reported for expression microarrays by Reimers et al [23]. For CGH arrays, this type of effect should be easier to detect and correct, as they have a much smaller range of signal ratio variation than expression microarrays. However, this smaller range necessitates a much greater measurement precision for arrayCGH data.
We describe here a spatial segmentation algorithm for the automatic delineation and elimination of unreliable areas, facilitating the exclusion of local spatial bias from arrayCGH data. This algorithm consists of three steps, which are explained in detail in the Methods section:
[step 1]: Estimation of a spatial trend on the array using twodimensional LOESS regression [21, 22]
[step 2]: Segmentation of the array into spatial areas with similar trend values using NEM, an unsupervised classification algorithm including spatial constraints [24, 25]
[step 3]: Identification of the areas affected by spatial bias.
A wide variety of microarray techniques based on BACs, cDNAs or oligonucleotides (see [26] for a review) may be used to quantify changes in DNA copy number. From a technical aspect, our method could be applied to any of these microarray types, although we detected local spatial bias only on BAC arrays.

The first data set (bladder cancer data) was produced at the UCSF. In this data set, local spatial effects were observed on 57% of 198 arrays, with a median of 229 affected spots, and no visual evidence of spatial gradients;

The two other data sets were produced at the Institut Curie, INSERM U509. They consist of a breast cancer data set, in which local spatial effects were observed on 45% of 175 arrays, with a median of 592 affected spots, and a neuroblastoma data set [14, 15], with local spatial effects on 23% of 26 arrays, and a median of 551 affected spots.
MANOR: an algorithm combining segmentation and signal correction
In addition to local spatial bias, we also frequently identified continuous spatial gradients, especially in breast cancer data set (Fig. 21(a)) and neuroblastoma data set. A straightforward way to correct for spatial gradients (Fig. 21(b)) is to subtract from the logratios an estimate of the spatial trend on the array (Fig. 22(a, b)). The first step of the spatial segmentation algorithm for detecting local spatial bias (step 1) provides such an estimate. This estimate is calculated using twodimensional LOESS regression as explained in detail in the Methods section.
 1.
run the spatial segmentation algorithm (seg) to identify potential areas of local spatial bias
 2.
correct spots not excluded during the first step for continuous spatial gradients (2dLoess).
This algorithm, implemented in the MANOR package, will be referred to as seg+2dLoess in the remainder of this article. The rationale underlying this twostep approach is that arrays affected by continuous spatial gradients only will not be detected as containing local spatial bias by the step seg, and will therefore be properly corrected by the step 2dLoess. This twostep approach is suitable for the spatial normalization of data sets containing both types of spatial effect.
Results and discussion
We have used our method for the spatial normalization of arrayCGH data from two different platforms. In this section, we provide information about the practical implementation of the method on these two platforms, and quantitative results comparing our method to ten other normalization techniques. These compare the values of three quality criteria calculated after normalization of each array: the first, sigma, estimates the experimental variability between replicates, whereas the others, smt and dyn, evaluate quality in the context of the estimation of differences in DNA copy number between test and reference samples: smt quantifies the smoothness of the signal over the genome, and dyn assesses the dynamics of the signal, defined by the signaltonoise ratio between gained and normal regions; these criteria are defined more formally and explained in detail in the Methods section.
To our knowledge, the ten normalization procedures used for the comparisons cover all the different types of approaches proposed so far and include the methods proposed by Tarca et al. [20], Yang et al. [10] and Khojasteh et al. [11]. These methods are detailed in the Methods section. For each normalization method, we calculated the three quality criteria for each array. When comparing two methods, we calculated a relative performance for each quality criterion, and assessed the significance of this performance using a Student's ttest, as explained in the Methods section. We show that our proposed method outperforms all previously published approaches for the three data sets.
Application to data produced at UCSF

Neighboring clones in the genome are dispersed on the array – a necessary condition for distinguishing between spatial artifacts and real biological information;

Each clone is replicated three times on the array, and the three replicated spots are adjacent, so a high level of consistency for the three corresponding ratios does not prove that there are no spatial effects.
For this data set, spatial normalization is the last step in the following comprehensive normalization process. After image analysis of the arrays with SPOT 2.0 software [27], we screened for lowquality spots: spots with a foreground reference signal (and foreground DAPI signal) less than 125% of the background reference signal (reference DAPI signal) were discarded, as were clones with a logratio standard deviation exceeding 0.1. Clones for which only one of the three replicates was retained after these steps were then also discarded.
Finally, we applied the proposed spatial normalization method seg+2dLoess as follows: the spatial segmentation seg was applied to the logratios of this filtered array, with K = 5 and β = 1 (see Methods for a definition of these parameters and a discussion of how to choose them), followed by the correction for continuous spatial gradients 2dLoess.
Spatial normalization step
Our segmentation algorithm detected local spatial effects on 113 of 198 bladder cancer arrays (57%); the median proportion of biased areas on these arrays was 3.1%. Figure 3 (top) illustrates the successive steps of the algorithm, from centered logratios to array trend, spatial segmentation of the array, and finally the delineation of biased areas. Red dots on the corresponding genomic profile (Figure 3, bottom) correspond to the spots discarded during spatial normalization (on this figure, signal logratios have not yet been averaged by clone: spotlevel information is displayed).
Figure 3 (bottom) illustrates the improvement in data quality achieved with our spatial normalization method: among the apparent outliers (i.e. clones with logratio values significantly different from the mean logratio value for the genomic region), it distinguished between experimental artifacts (red dots) and potentially biologically relevant outliers accounting for localized genomic amplifications.
Evaluation of the performance of the seg+2dLoess method
For each normalization method (11 methods including ours), we calculated the three quality criteria for each array and performed pairwise comparison of methods using the estimate and significance of their relative performance for each criterion, as explained in detail in the Methods section.
Figure 4 shows the results of comparison of the ten methods with seg+2dLoess. For the dyn criterion, seg+2dLoess significantly outperformed all methods (with all pvalues ≤ 0.039), and most significantly methods 5 to 11, that do not include the 2dLoess step (with all pvalues below 8.5 × 10^{18}). The dyn criterion is particularly important as it assesses the quality of copy number change detection. seg+2dLoess also gives significantly better results for the sigma criterion than all other methods (with all pvalues below 1.1 × 10^{8}) except one: seg performs significantly better (p = 7.9 × 10^{4}) but the relative improvement has a limited amplitude (only 0.36%).
For the smt criterion, seg+2dLoess also significantly outperforms all methods (with all pvalues below 8.1 × 10^{6}, except block+2dLoess for which p = 0.048).
Section 1 of the Additional file 1 shows similar plots to Figure 4, but for the smt and dyn criteria, and for the smt and sigma criteria. Tables 1 to 3 of the Additional files 2 and 3 summarize the results of all the pairwise comparisons of methods for the three quality criteria.
Taken together, these results show that the seg+2dLoess method outperforms its competitors for the bladder cancer data set.
Application to data produced at Institut Curie, INSERM U 509
The Institut Curie, INSERM U509 has developed its own highdensity CGH array; all steps in the production of these chips are performed in Institut Curie laboratories, including array spotting, DNA preparation, hybridization, scanning and image processing. The current version of the array contains 3342 clones, each of which is spotted at least three times on the array, giving a total of 10800 to 11520 spots (including controls).
This array was designed to facilitate distinction between relevant biological effects and experimental artifacts: "empty" spots and spots of water were included as controls, clone replicates were scattered over the array, and the positions of clones on the array are not correlated with their actual positions in the genome. A reliable ratio value can therefore be calculated even if one of the three replicates is flagged. The arrays were scanned using an Axon Genepix 4000b scanner, and images were processed with Genepix Pro 5.1.
We analyzed a breast cancer data set and a neuroblastoma data set from this platform.
For this platform, we applied the proposed spatial normalization method seg+2dLoess as follows: the spatial segmentation seg was applied to the Background signal as explained in the paragraph below, and the spatial gradients were corrected by 2dLoess calculated over the logratios. A postprocessing step that includes spot and clone screening was then applied (allowing us, for example, to discard spots having too low a signaltonoise ratio, or with poor replicate consistency).
Detail of the spatial segmentation step
Although we can correct the foreground signal for background intensity, a significant proportion of arrays still show localized spatial patterns that cannot be attributed to biological causes. Visual examination of spatial representations of the four signals (foreground and background intensities for test and reference signals) revealed that the bias was much clearer for the background signal of Cy3labeled samples (Figure 5), which was not the case for bladder cancer data. We therefore applied the spatial segmentation method described above to the background signal of the Cy3 channel, with K = 7 and β = 1 (see Methods for a definition of these parameters and a discussion of how to choose them).
Biased areas of the CGH array are flagged and excluded from subsequent analysis. As clone replicates are not adjacent on the array, at least two of the three replicates generally remain after spatial bias correction, and a reliable ratio value can still be calculated. Figure 6 shows the results of this spatial segmentation step in the case of an array with local spatial bias but no spatial gradients.
Evaluation of the performance of the method seg+2dLoess
As for bladder cancer data, we calculated the three quality criteria for each normalization method and for each array for the breast cancer data set and the neuroblastoma data set. We then compared the methods paiwise using the estimate and significance of their relative performance for each criterion, as explained in detail in the Methods section.
The neuroblastoma data set gives similar results: seg+2dLoess quality criteria are always better than those of the other methods, except for dyn, in which adjSeg+2dLoess is slightly better (0.22%) but not significantly so (p = 0.1). For smt, seg+2dLoess is only slightly better than ptl+movMed and the methods including the 2dLoess step, but not significantly so for adjSeg+2dLoess and ptl+movMed. In these cases, the small size of the data set (26 arrays, 6 with local spatial bias) affects the statistical power.
Section 2 and 3 of the Additional file 1 and Tables 4 to 9 of the Additional files 2 and 3 detail and complement these results.
These results show that the seg+2dLoess method outperforms the other methods on the two data sets produced on the Institut Curie, INSERM U509 platform. The results also allow the methods to be ranked in terms of performance. Those methods that include a twodimensional LOESS step are the highest ranked, with the methods proposed by [11, 10] and [20], which all include some spatial processing, being next, and the other methods being the lowest ranked (see Figure 7 for example).
Conclusion
We have designed an efficient and automated algorithm for the spatial normalization of BAC arrayCGH data, and defined a set of parameters for CGH array data quality assessment. We have shown that our method significantly improves the quality of data from two different BACarray platforms and outperforms other normalization techniques on three data sets.
The proposed algorithm is particularly suitable for correcting spatial effects not related to array design (row, column, or printtip group effects): indeed, the arrays studied show two distinct types of such spatial effect (local spatial bias and continuous spatial gradients), which can simultaneously affect any given array. In such cases, using spatial trend correction after spatial segmentation helps to remove or reduce these two types of spatial effect, while preserving the true biological signal.
This method is original in the application of a segmentation algorithm for detecting and removing local spatial bias, preventing the misinterpretation of experimental artifacts as biologically relevant outliers in the genomic profile.
This method was developed for arrayCGH experiments, and gave very good results. However, it can be applied to any microarray experiment having the same types of spatial effect.
Availability and requirements
Methods
In this section, we provide details of the segmentation method and the other normalization techniques used for comparison, and of the quality criteria proposed. We also discuss the choice of the two parameters of the segmentation algorithm: K and β.
Description of the segmentation algorithm (seg)
The segmentation method consists of three steps:
[step 1]: Estimation of a spatial trend on the array using twodimensional LOESS regression [21, 22]
[step 2]: Segmentation of the array into spatial areas with similar trend values, using NEM, an unsupervised classification algorithm including spatial constraints [24, 25]
[step 3]: Identification of the areas affected by spatial bias.
[step 1]: spatial trend estimation
We decided to carry out spatial segmentation based on an estimate of the spatial trend on the array, to optimize the robustness of segmentation. Furthermore, estimation of this trend makes it possible to replace missing values by interpolating the spatial trend.
The trend is estimated by means of a twodimensional LOESS procedure with three iterative reweighting steps [21, 22]. The local estimation is linear and the neighborhood taken into account to fit the local model corresponds to 3% of the total number of points. We use an iterative reweighting procedure to avoid outlier effects. Indeed, in the context of cancer studies, we are investigating changes in DNA copy number, and some clones displaying an amplification or a homozygous deletion may generate extreme but biologically meaningful values, which should not be interpreted as a local spatial bias.
When the spatial trend is estimated from the logratios, we first apply a basic correction to these logratios to prevent confusion between spatial artifacts and biologically relevant effects. For each chromosome arm, centered logratios are calculated as follows: the median of the corresponding logratio values is calculated and then subtracted from the initial values. The spatial trend is estimated from these centered logratios. This method helps to decrease the impact of true genomic aberrations on the detection of spatial trends in the data, particularly for samples with many, or large genomic alterations, as most of these alterations correspond to the gain or loss of whole chromosome arms.
[step 2]: spatial segmentation
This step aims to identify K clusters corresponding to spots with similar signal levels located close together geographically. This is achieved by Neighborhood Expectation Maximization (NEM) [24, 25]. We assume that the data are drawn from a mixed Gaussian density function $f({x}_{i}\Phi )={\displaystyle {\sum}_{k=1}^{K}{p}_{k}{f}_{k}({x}_{i}{\theta}_{k})}$ where p_{ k }are the proportions of the mixture model, f_{ k }(x_{ i }θ_{ k }) denotes the density function of a Gaussian distribution with parameter θ_{ k }= (μ_{ k }, Σ_{ k }) and Φ = {p_{1},..., p_{ k }, θ_{1},..., θ_{ K }} is the set of parameters to be estimated. The classical EM algorithm considers the following decomposition of the likelihood:
where
In the mixture model context, [32] pointed out that the EM algorithm is formally equivalent to the alternative maximization of L (c, Φ) with respect to c ("E" step) and with respect to Φ ("M" step). The NEM algorithm is original in that it regularizes the likelihood by means of a term that takes into account the spatial dimension of the problem through the following adjacency matrix:
Here, the neighbors of a point located at coordinates (l, m) are the four points with the following coordinates: (l +1, m), (l  1, m), (l, m  1). We define the following quantity:
Thus, instead of maximizing L (c, Φ) in the E step, we maximize L (c, Φ) + βG (c). The value of β controls the weighting of the geographical context in the maximization. The M step remains unchanged.
[step 3]: elimination of local spatial bias
The basic idea is to remove from the array those spatial clusters with signal values significantly higher (or lower) than the unbiased areas of the array. We describe here the situation for positive spatial bias, but the idea can be adapted to negative bias. As local spatial biases cover a limited proportion of the array, we introduced a tuning parameter p_{ max }, which corresponds to the maximum proportion of the array image corresponding to local spatial bias. In our experiment, local spatial bias typically applies to less than one quarter of the array, so we used p_{ max }= 0.25.
After sorting the clusters identified by NEM by decreasing mean signal, we consider only those clusters with cumulative frequencies lower than p_{ max }to be potentially biased, making it possible to define a set of candidate clusters. The mean signal value of the remaining clusters is used as a reference value for the unbiased signal. Each candidate cluster with a mean signal differing from this reference value by more than a given threshold value is considered biased. The other candidates are considered unbiased, unless their mean signal is closer to that of the biased cluster than to that of the reference: such clusters are also considered biased. This threshold was chosen based on the crossvalidation of arrays analyzed by experts.
Comparison to other normalization methods

A printtip group method:

A printtip group with intensity dependent effect method:

A spatial smoothing method:

Two spatial segmentation methods:
seg (segmentation of local spatial bias): we apply the spatial segmentation algorithm described above to automatically eliminate the biased area.

A method combining printtip group and spatial smoothing:

Two methods combining intensity dependent effect and spatial smoothing:
nnNorm (neural network normalization): we apply the normalization method described by Tarca et al. [20] using the nnNorm R package (1.5.1 release, with default parameters) available from Bioconductor. Briefly, this technique uses a neural network approach to correct the intensitydependent and spatiallydependent effects.

Two methods combining spatial segmentation and spatial smoothing:
adjSeg+2dLoess (correction of local spatial bias and continuous spatial gradients): we apply the 2dLoess method on the normalized logratio values obtained with the adjSeg method.

Raw logratio values with no normalization (none).
ArrayCGH data quality assessment
Definition of quality criteria
Evaluation of the quality of the signal ratios of an array facilitates the comparison of different image analyses or normalization algorithms, and makes it possible to quantify the improvement achieved by each step of a given normalization algorithm. We define three criteria for assessing the quality of the analyzed array: the first addresses the issue of overall quality whereas the other two provide quality evaluations for the estimation of differences in DNA copy number between test and reference samples.
sigma The first item provides an estimate of experimental noise. We isolate each clone and calculate the standard deviation of the logratio of the corresponding replicates. sigma is defined as the median of these standard deviations: the smaller the value of sigma, the higher the quality of the array.
The other two criteria are calculated after detection of the altered (gained or lost) regions in the test sample. We used the GLAD algorithm, developed by Hupé et al. [4] for this purpose:
smt Within a given DNA copy number region, the ratios of contiguous clones should not differ considerably. The second quality criterion concerns the smoothness of the signal logratios within such a chromosomal region: signal smoothness is defined as the median absolute difference between logratios for contiguous normal clones. If N denotes the set of clones considered normal after DNA copy number estimation, we can calculate
smt = median_{n∈N}x_{(n)} x_{(n 1)},
where x_{(n)}is the value of the logratio at the n^{th} clone in genome order.
dyn The last criterion estimates the dynamics of DNA copy number variation between test and reference samples. We calculate the discrepancy between the median ratios of the regions considered "gained"(G) and "normal"(N) after DNA copy number estimation, and compare it with signal smoothness, as measured by smt:
If no gained region is detected, we compare "normal" regions with "lost"(L) regions.
smt and dyn are not independent parameters and are anticorrelated. However, they quantify related but different ideas, as smt estimates the noise level after data normalization whereas dyn measures the ability to detect genome alterations after data normalization.
Paiwise comparison of quality criteria
These three criteria help us to decide which of two normalization methods gives the best results for a given array. In this pairwise comparison context, smt and dyn must be calculated with the same definition of G, N, and L regions for the two normalized arrays. We therefore define consensus G, N, and L regions associated with an array processed with two different normalization methods as the intersection of the two corresponding G, N, and L regions obtained using the two different normalization methods.
In order to test whether method j is better than method i, we defined a relative performance for each quality criterion as follows:
We calculated this relative performance for each array, and assessed its significance by testing the hypotheses ${\mathscr{H}}_{i,j}$: {RP^{ qc } (i,j) < 0} for each quality criterion qc, using a Student's unilateral ttest.
In figures 4, 7, and 8, we calculated relative performances RP(seg+2dLoess, test) where test corresponds to one of the ten other methods. Hence a negative value for RP (seg+2dLoess, test) indicates that our proposed method outperforms the test method.
Parameter choice for the segmentation algorithm
The segmentation algorithm includes two parameters: the number K of clusters, and the regularization parameter β, which controls the weighting of geographic context in signal segmentation. Our experience suggests that the optimal choice of K and β may depend on the arrayCGH technology used. We therefore provide guidelines for the choice of suitable parameters of the algorithm. We have investigated two different approaches to the choice of (K, β): incorporating a model selection criterion into the algorithm so that an optimal (K, β) can be chosen for each array, or developing a calibration method to help the user to find relevant sets of parameters for analyzing a whole data set. In this section, we discuss these two approaches and justify our choice of the second solution.
The difficulty finding optimal parameters on a per array basis
Choice of the number K of components in a mixture model can be addressed using model selection criteria. The basic idea is as follows: as the maximum likelihood estimator of the model increases mechanically with K (as model complexity increases with K), this method subtracts an increasing function of K from the likelihood of the model with K components, to prevent model overfitting. Many applications use the Akaike Information Criterion (AIC) or the Bayesian Information Criterion (BIC) for this purpose. However, in our framework, K and β must be chosen simultaneously, because β also affects the maximum likelihood estimator. As we have no information concerning the quantitative behavior of the maximum likelihood estimator with respect to K and β (this complex question is beyond the scope of this paper), the choice of an appropriate penalization remains arbitrary.
We also considered an approach involving the fitting of K using model selection criteria and crossvalidating the choice of β, but this approach has major drawbacks: first, it strongly increases the complexity of the estimation process, making this method too timeconsuming for use as a routine normalization method; second, it makes the normalization method difficult to interpret, because two arrays from the same platform will not be treated with the same parameters.
Guidelines for choosing relevant parameters for analyzing a new data set
Rather than searching for optimal (K, β) values for each array, we provide a calibration method making it possible to choose appropriate (K, β) values for each data set. The basic principle of the calibration method is comparison of the output of our algorithm run on different (K, β) pairs, taken from a predefined grid (e. g. K ∈ {2,... 10} and β ∈ {0.1,0.2,...2.0}).
We considered two different approaches to compare the results of the segmentations and to choose appropriate (K, β) values. The first approach involved choosing a (K, β) combination that optimizes quality criteria. The second involves expert assessment. An expert examines each array from a representative set and determines whether there is local spatial bias: he or she checks both the array image and the genomic profile to guarantee that the spatial effect is due to an experimental artifact rather than a biological effect. We then select the (K, β) combination that gives the best agreement between the expert decision and the algorithm decision. We call this second approach expert assessment. We found this second method simpler and more efficient than the first, for a number of reasons, outlined below.
In the first approach, quality criteria are calculated after normalization and DNA copy number assessment, so these three steps have to be carried out for each (K, β) combination. Therefore, although this method has the obvious advantage of not relying on expert assessment, it is timeconsuming, and provides only indirect evaluations of the differences between pairs of parameters, which may make the results hard to interpret. Moreover, a much lower level of variation was observed in the values of quality criteria for different (K, β) combinations for a given array than between arrays, so we were unable to identify optimal (K, β) values with this method (data not shown).
In the second approach, we considered two different ways of performing the expert assessment: either identifying arrays displaying local spatial bias (qualitative assessment), or estimating the number of spots that should be discarded (quantitative assessment). We found quantitative assessment to be very poorly reproducible, with large differences between experts, and much more timeconsuming than the qualitative method. Therefore, we adopted the qualitative method, which made possible the rapid expert assessment of a larger number of arrays, thus increasing the accuracy of parameter choice.
Based on the qualitative expert assessment of an entire data set or a subset of data, we compare, for each array, the decision of our algorithm (has the algorithm detected a local spatial bias?) with that of the expert. We then calculate the proportion of false positives and false negatives for each combination of the parameters K ∈ {2,...10} and β ∈ {0.1, 0.2,... 2.0}. Qualitative expert assessment remains highly variable (significant differences between experts), as a substantial proportion of arrays are difficult to classify. Nevertheless, all assessments show the same form of dependence in the error rate in (K, β), and lead to selection of the same parameters (data not shown).
Drawing figures such as Figure 9 or 10 for any new data set can facilitate the identification of relevant sets of parameters for the segmentation algorithm. In our case, they suggest values of K = 5 and β between 0.9 and 1.3 for bladder cancer data set, and K = 7 or 8 and β between 0.9 and 1.3 for breast cancer data set. We used K = 5, β = 1 for the bladder cancer data set, and K = 7, β = 1 for the breast cancer data set.
Notes
Declarations
Acknowledgements
This work was supported by the Institut Curie, the Institut National de la Santé et de la Recherche Médicale, the Centre National de la Recherche Scientifique, the IST program from the European Commission through the HKIS project (IST200138153), the Cancéropole Ile de France, and the association Courir pour la vie, courir pour Curie.
The construction of the 3.3K BACarray by Institut Curie, INSERM U509 was supported by grants from the Carte d'Identité des Tumeurs program of the Ligue Nationale Contre le Cancer.
We thank Isabelle JanoueixLerosey and Olivier Delattre (Institut Curie, INSERM U509) for making the neuroblastoma data set publicly available.
We thank Nadège Gruel, Virginie Raynal, Gaelle Pierron, Olivier Delattre (Institut Curie, INSERM U509) and Daniel Pinkel (University of California San Francisco) for fruitful discussions.
Authors’ Affiliations
References
 Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, Dairkee SH, Ljung BM, Gray JW, Albertson DG: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998, 20: 207–211. 10.1038/2524View ArticlePubMedGoogle Scholar
 Albertson DG, Collins C, McCormick F, Gray JW: Chromosome aberrations in solid tumors. Nat Genet 2003, 34: 369–76. 10.1038/ng1215View ArticlePubMedGoogle Scholar
 Fridlyand J, Snijders A, Pinkel D, Albertson DG, Jain AN: Application of Hidden Markov Models to the analysis of the array CGH data. Journal of Multivariate Analysis 2004. Special Issue on Multivariate Methods in Genomic Data Analysis Special Issue on Multivariate Methods in Genomic Data AnalysisGoogle Scholar
 Hupé P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratios to gain and loss of DNA regions. Bioinformatics 2004, 20: 3413–3422. 10.1093/bioinformatics/bth418View ArticlePubMedGoogle Scholar
 Jong K, Marchiori E, van der Vaart A, Ylstra B, Weiss M, Meijer G: Chromosomal Breakpoint Detection in Human Cancer. In Applications of Evolutionary Computing, EvoWorkshops2003: EvoBIO, EvoCOP, EvoIASP, EvoMUSART, EvoROB, EvoSTIM, Volume 2611 of LNCS. Edited by: Raidl GR, Cagnoni S, Cardalda JJR, Corne DW, Gottlieb J, Guillot A, Hart E, Johnson CG, Marchiori E, Meyer JA, Middendorf M. University of Essex, England, UK: SpringerVerlag; 2003:54–65.Google Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics 2004, 5: 557–572. 10.1093/biostatistics/kxh008View ArticlePubMedGoogle Scholar
 Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6: 27. 10.1186/14712105627PubMed CentralView ArticlePubMedGoogle Scholar
 Pollack JR, Sorlie T, Perou CM, Rees A, Jeffreys SS, Lonning P, Tibshirani R, Botstein D, BorresenDale AL, Brown PO: Microarray analysis reveals a direct role of DNA copy number alteration in the transcriptional program of breast tumors. PNAS 2002.Google Scholar
 Wang J, MezaZepeda LA, Kresse SH, Myklebost O: MCGH: Analysing microarraybased CGH experiments. BMC Bioinformatics 2004, 5: 74. 10.1186/14712105574PubMed CentralView ArticlePubMedGoogle Scholar
 Yang YH, Dudoit S, Luu P, Lin DM, Pend V, Ngai J, Speed TP: Normalization of cDNA microarray data: a robust composite method addressing single and multiple slide systematic variation. Nucleic Acids Research 2002, 30: e15:1e15:11.Google Scholar
 Khojasteh M, Lam WL, Ward RK, MacAulay C: A stepwise framework for the normalization of array CGH data. BMC Bioinformatics 2005, 6: 274. 10.1186/147121056274PubMed CentralView ArticlePubMedGoogle Scholar
 Billerey C, Chopin D, AubriotLorton MH, Ricol D, Gil S, Van Rhijn B, Bralet MP, LefrereBelda MA, Lahaye JB, Abbou CC, Bonaventure J, Zafrani ES, van der Kwast T, Thiery JP, Radvanyi F: Frequent FGFR3 mutations in papillary noninvasive bladder(pTa) tumors. Am J Pathol 2001, 158: 955–1959.View ArticleGoogle Scholar
 Snijders AM, Nowak N, Segraves R, Blackwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, SL S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain AN, Pinkel D, Albertson DG: Assembly of microarrays for genomewide measurement of DNA copy number. Nat Genet 2001, 29: 263–4. 10.1038/ng754View ArticlePubMedGoogle Scholar
 JanoueixLerosey I, Hupé P, Maciorowski Z, La Rosa P, Pierron G, Manié E, Liva S, Barillot E, Delattre O: Preferential occurence of chromosome breakpoints within early replicating regions in neuroblastoma. Cell Cycle 2005, 4: 1842–1846.View ArticlePubMedGoogle Scholar
 Replication timing data analysis in Neuroblastoma[http://microarrays.curie.fr/publications/U509/reptiming]
 Workman C, Jensen LJ, Jarmer H, Berka R, Gautier L, Nielsen HB, Saxild HH, Nielsen C, Brunak S, Knudsen S: A new nonlinear normalization method for reducing variability in DNA microarray experiments. Genome Biology 2002, 3(9):research0048.1–0048.16. 10.1186/gb200239research0048View ArticleGoogle Scholar
 Baird D, Johnstone P, Wilson T: Normalization of Microarray Data Using a Spatial Mixed Model analysis which includes Splines. Bioinformatics 2004, 20: 3196–3205. 10.1093/bioinformatics/bth384View ArticlePubMedGoogle Scholar
 Colantuoni C, Henry G, Zeger S, Pevsner J: Local mean normalization of microarray element signal intensities across an array surface: quality control and correction of spatially systematic artifacts. Biotechniques 2002, 32: 1316–1320.PubMedGoogle Scholar
 Wilson DL, Buckley MJ, Helliwell CA, Wilson IW: New normalization methods for cDNA microarray data. Bioinformatics 2003, 19: 1325–1332. 10.1093/bioinformatics/btg146View ArticlePubMedGoogle Scholar
 Tarca AL, Cooke JEK, Mackay J: A robust neural networks approach for spatial and intensitydependent normalization of cDNA microarray data. Bioinformatics 2005, 21(11):2674–2683. 10.1093/bioinformatics/bti397View ArticlePubMedGoogle Scholar
 Cleveland W, Devlin S, Grosse E: Regression By Local Fitting. Journal of Econometrics 1988, 37: 87–114. 10.1016/03044076(88)900772View ArticleGoogle Scholar
 Cleveland WS, Grosse E: Computational Methods for Local Regression. Statistics and Computing 1991, 1: 47–62. 10.1007/BF01890836View ArticleGoogle Scholar
 Reimers M, Weinstein JN: Quality assessment of microarrays: visualization of spatial artifacts and quantitation of regional biases. BMC Bioinformatics 2005, 6: 166. 10.1186/147121056166PubMed CentralView ArticlePubMedGoogle Scholar
 Ambroise C: Approche probabiliste en classification automatique et contraintes de voisinage. PhD thesis. Université Technique de Compiègne, France; 1996.Google Scholar
 Ambroise C, Dang M, Govaert G: Clustering of spatial data by the EM algorithm. In Geostatistics for Environmental Applications. Edited by: Soares A, GomesHernandez J, Froidevaux R. Kluwer Academic Publisher; 1997:493–504.View ArticleGoogle Scholar
 Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nat Genet 2005, (37 Suppl):S11S17. 10.1038/ng1569Google Scholar
 Jain AN, Tokuyasu TA, Snijders AM, Segraves R, Albertson DG, Pinkel D: Fully automatic quantification of microarray image data. Genome Res 2002, 12: 325–332. 10.1101/gr.210902PubMed CentralView ArticlePubMedGoogle Scholar
 MANOR: CGH MicroArray NORmalization[http://bioinfo.curie.fr/projects/manor]
 Bioconductor: Open software development for computational biology and bioinformatics[http://www.bioconductor.org]
 Liva S, Hupé P, Neuvial P, Brito I, Viara E, La Rosa P, Barillot E: CAPweb : a bioinformatics CGH array Analysis Platform. Nucleic Acids Research 2006, in press.Google Scholar
 CAPweb : a bioinformatics CGH array Analysis Platform[http://bioinfo.curie.fr/CAPweb]
 Hathaway RJ: Another interpretation of the EM algorithm for mixture distributions. Journal of Statistics and Probability Letters 1986, 4: 53–56. 10.1016/01677152(86)900167View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.