A statistical approach for array CGH data analysis
 Franck Picard^{1}Email author,
 Stephane Robin^{1}Email author,
 Marc Lavielle^{2},
 Christian Vaisse^{3} and
 JeanJacques Daudin^{1}
DOI: 10.1186/14712105627
© Picard et al; licensee BioMed Central Ltd. 2005
Received: 18 August 2004
Accepted: 11 February 2005
Published: 11 February 2005
Abstract
Background
MicroarrayCGH experiments are used to detect and map chromosomal imbalances, by hybridizing targets of genomic DNA from a test and a reference sample to sequences immobilized on a slide. These probes are genomic DNA sequences (BACs) that are mapped on the genome. The signal has a spatial coherence that can be handled by specific statistical tools. Segmentation methods seem to be a natural framework for this purpose. A CGH profile can be viewed as a succession of segments that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. We model a CGH profile by a random Gaussian process whose distribution parameters are affected by abrupt changes at unknown coordinates. Two major problems arise : to determine which parameters are affected by the abrupt changes (the mean and the variance, or the mean only), and the selection of the number of segments in the profile.
Results
We demonstrate that existing methods for estimating the number of segments are not well adapted in the case of array CGH data, and we propose an adaptive criterion that detects previously mapped chromosomal aberrations. The performances of this method are discussed based on simulations and publicly available data sets. Then we discuss the choice of modeling for array CGH data and show that the model with a homogeneous variance is adapted to this context.
Conclusions
Array CGH data analysis is an emerging field that needs appropriate statistical tools. Process segmentation and model selection provide a theoretical framework that allows precise biological interpretations. Adaptive methods for model selection give promising results concerning the estimation of the number of altered regions on the genome.
Background
Chromosomal aberrations often occur in solid tumors: tumor suppressor genes may be inactivated by physical deletion, and oncogenes activated via duplication in the genome. Gene dosage effect has become particularly important in the understanding of human solid tumor genesis and progression, and has also been associated with other diseases such as mental retardation [1, 2]. Chromosomal aberrations can be studied using many different techniques, such as Comparative Genomic Hybridization (CGH), Fluorescence in Situ Hybridization (FISH), and Representational Difference Analysis (RDA). Although chromosome CGH has become a standard method for cytogenetic studies, technical limitations restrict its usefulness as a comprehensive screening tool [3]. Recently, the resolution of Comparative Genomic Hybridizations has been greatly improved using microarray technology [4, 5].
The purpose of arraybased Comparative Genomic Hybridization (array CGH) is to detect and map chromosomal aberrations, on a genomic scale, in a single experiment. Since chromosomal copy numbers can not be measured directly, two samples of genomic DNA (referred to as the reference and test DNAs) are differentially labelled with fluorescent dyes and competitively hybridized to known mapped sequences (referred to as BACs) that are immobilized on a slide. Subsequently, the ratio of the intensities of the two fluorochromes is computed and a CGH profile is constituted for each chromosome when the log_{2} of fluorescence ratios are ranked and plotted according to the physical position of their corresponding BACs on the genome [6]. Different methods and packages have been proposed for the visualization of array CGH data [7, 8].
Each profile can be viewed as a succession of "segments" that represent homogeneous regions in the genome whose BACs share the same relative copy number on average. Array CGH data are normalized with a median set to log_{2}(ratio) = 0 for regions of no change, segments with positive means represent duplicated regions in the test sample genome, and segments with negative means represent deleted regions. Even if the underlying biological process is discrete (counting of relative copy numbers of DNA sequences), the signal under study is viewed as being continuous, because the quantification is based on fluorescence measurements, and because the possible values for chromosomal copy numbers in the test sample may vary considerably, especially in the case of clinical tumor samples that present mixtures of tissues of different natures.
Two main statistical approches have been considered for the analysis of array CGH data. The first has focused many attentions, and is based on segmentation methods where the purpose is to locate segments of biological interest [7, 9–11]. A second approach is based on Hidden Markov Models (aCGH Rpackage [12]), where the purpose is to cluster individual data points into a finite number of hidden groups. Our approach can be put into the first category. Segmentation methods seem to be a natural framework to handle the spatial coherence of the data on the genome that is specific to array CGH. In this context the signal provided by array CGH data is supposed to be a realization of a Gaussian process whose parameters are affected by an unknown number of abrupt changes at unknown locations on the genome. Two models can be considered, according to the characteristics of the signal that is affected by the changes: it can be either the mean of the signal [7, 10, 11] or the mean and the variance [9]. Since the choice of modeling is crucial in any interpretation of a segmented CGH profile, we provide guidelines for this choice in the discussion. Two major issues arise in breakpoints detection studies: the localization of the segments on the genome, and the estimation of the number of segments. The first point has lead to the definition of many algorithms and packages: segmentation algorithms [9, 10] and smoothing algorithms [11] where the breakpoints are defined with a posterior empirical criterion. These methods are defined by a criterion to optimize and an algorithm of optimization. Different criteria have been proposed: the likelihood criterion [9, 11], the leastsquares criterion [7], partial sums [10], and algorithms of optimization are based on genetic algorithms [9], dynamic programing [7], binary segmentation (DNAcopy Rpackage [10]) and adaptive weigths smoothing (GLAD Rpackage [11]). Since many criteria and algorithms have been proposed, one important question is the resulting statistical properties of the breakpoint estimators they provide. Note that smoothing techniques do not provide estimators of the breakpoint coordinates, since the primary goal of the underlying model is to smooth the data, and breakpoints are not parameters of the model (in this case, they are defined after the optimization of the criterion [11]). Here we consider the likelihood criterion and we use dynamic programming that provides a global optimum solution, contrary to genetic algorithms [9], in a reasonable computational time.
As for the estimation of the number of segments, the existing articles have not defined any statistical criterion adapted to the case of process segmentation. This problem is theoretically complex, and has lead to ad hoc procedures [9–11]. Since the purpose of array CGH experiments is to discover biological events, the estimation of the number of segments remains central. This problem can be handled in the more general context of model selection. In the discussion we explain why classical criteria based on penalized likelihoods are not valid for breakpoints detection. Criteria such as the Akaike Information Criterion (AIC) and the Bayes Information Criterion (BIC) lead to an overestimation of the number of segments. For this reason, an arbitrary penalty constant can be chosen in order to select a lower number of segments in the profile [9]. We propose a new procedure to estimate the number of segments, choosing the penalty constant adaptively to the data. We explain the construction of such penalty, and its performances are compared to other criteria in the Results Section, based on simulation studies and on publicly available data sets. Put together, we propose a methodology that considers a simple modeling, a fast and effective algorithm of optimization and that takes advantages of the statistical properties of the maximum likelihood. Our procedure has been implemented on MATLAB Software and is freely available http://www.inapg.fr/ens_rech/mathinfo/recherche/mathematique/outil.html.
Results
Comparison of model selection criteria
The behavior of the criterion we propose is different. It seems to be more robust to the noise, as it will give a number of segments that is close to the true number. In particular, the irregular configuration presents a segment of small size (5 points at t = 80) that could be interesting to detect in the case of array CGH profile (a putative gained region for instance). Since the previously described criteria [9, 13] tend to underestimate the number of segments, this particular region would not be detected. On the contrary, the adaptive criterion will be able to detect it, even if the noise is important, since it selects a constant number of segments close to the true number whatever the noise. These simulation examples perfectly illustrate the capacity of an adaptive criterion to find a reasonable number of segments even in configurations where the profile is not very separated.
The second simulationbased result concerns the ability of dynamic programming to locate the breakpoints at the correct coordinate, given different amounts of noise (Figures 3 and 4). In the regular configuration (Figure 3), simulation results show that dynamic programming perfectly localizes the breakpoints when the variability of the noise σ^{2} is low regarding the jump d of the mean. If d/σ = 10 the estimated probability to localize the breakpoints at the correct coordinate is 1, and this probability deacreases with the noise (probability close to 0.65 for d/σ = 2 and 0.25 for d/σ = 1). The effect of additional noise is to widden the zone of estimation, but the estimated breakpoints remain close to the true breakpoints. If the true breakpoint is located at t*, the estimated breakpoint stays in the interval t* ± 3. In the irregular configuration, additional noise has similar effects on the breakpoint's positioning, but the probability to correctly estimate a breakpoint depends on the jump of the mean between two segments. In the irregular case, Figure 4, at position t = 40 the difference of mean is d = 2, and the probability to locate the breakpoint at the true coordinate is higher than 0.65 for any additional noise. On the contrary, at position t = 85 where the different of mean equals d = 0.5 the probability to correctly locate the breakpoint decreases dramatically with the noise (probability 1 for σ = 0.1 and probability 0.25 for σ = 0.5). This means that dynamic programming is sensitive to small segments that present little differences in the mean regarding the noise. Nevertheless, the example on the real data set presented in Figure 5 shows that using an adaptive criterion with dynamic programming allows for the identification of small regions of putative biological interest as mentioned above. Put together, these simulation results show that the adaptive method selects the good number of segments even in the presence of important noise, and that when this number is selected, dynamic programming is able to correctly localize the breakpoint. In addition to its ability to locate precisely the breakpoints, it is important to notice that dynamic programming provides a global optimum of the likelihood that is required for any model selection procedure to select the number of segments, compared to genetic algorithms [9].
Segmentation models in the Gaussian framework
Discussion
The definition of an appropriate penalized criterion has been an issue for previous works using segmentation methods for array CGH data analysis [8, 9, 11]. In this section, we explain the specificity of model selection in the case of process segmentation, in order to give further justification to the inefficiency of classical criteria to select the number of segments, as shown in the Results Section.
Estimating the number of segments via penalized likelihood
where pen(K) is a penalty function that increases with the number of segments, and β is a constant of penalization. The estimated number of segments is such as :
It is crucial to notice that the criterion which is penalized should provide the best partition of Kdimensional, ie for a fixed K the criterion has to be globally maximized to ensure convergence of the breakpoint estimators to the true breakpoints [14]. This optimum is provided by dynamic programming, but not by other algorithms [9, 10].
Choice of the penalty function and constant
Classical penalized likelihoods use the number of independent continuous parameters to be estimated as a penalty function. Even though those criteria are widely used in the context of model selection, theoretical considerations suggest that they are not appropriate in the context of an exhaustive search for abrupt changes.
Constants and penalty funtions for different penalized criteria, in a heteroscedastic model with K segments.
criterion  β  pen(K) 

AIC  I  2K 
BIG 
 2K 
Jong (2003)  10/3  3K  1 
Lebarbier (2003)  adaptive 

Lavielle (2003)  adaptive  2K 
This leads to the definition of a new penalty function adapted to the special context of the exhaustive search of abrupt changes. This function (table 1) is proportional to the number of continuous parameters, but is also proportional to a new term in that takes the complexity of the visited configurations into account. It is written pen(K) = 2K(c_{1} + c_{2} ), where c_{1} and c_{2} are constant coefficients that have to be calibrated using numerical simulations. Since AIC and BIC and the criterion proposed in [9] do not consider the complexity of the visited models, they select a too high number of segments. The second term of the penalty is the penalty constant β. This term is constant in the case of AIC and BIC (β = 1, β = , respectively), and contributes to the oversegmentation as mentioned above. This can lead to an empirical choice for the constant, in order to obtain expected results based on a priori knowledge. For this reason, an arbitrary penalty constant can be chosen for the procedure to select a reasonable number of segments (β = 10/3 in [9]). Instead of an arbitrary choice for this constant, β can be adaptively chosen to the data [13, 14]. Furthermore, when the number of segments is small with respect to the number of data points (which is the case in CGH data analysis), the logterm can be considered as a constant [14]. The author rather suggests to use the penalty function pen(K) = 2K and to define an automatic procedure to choose the constant of penalization β adaptively. We explain the estimation procedure for the penalty constant in the Methods Section.
The power of adaptive methods for model selection lies in the definition of a penalty that is not universal (such as in the case of AIC and BIC). This means that the dimension of the model is estimated adaptively to the data. The efficiency of such method has been shown on simulated data as well as on experimental results (Results Section), and adaptive model selection criteria seem to be very appropriate for array CGH data analysis.
Choice of modelling for array CGH data
It has to be noted that classical models used in segmentation methods assume the independence of the data. This may be a reasonnable assumption for BAC arrays whose genome representation is approximately 1 BAC every 1.4 Mb [6]. Nevertheless, a new generation of arrays now provides a tiling resolution of the genome [15]. The overlapping of successive BACs could lead to statistical correlations that will require developments of new segmentation models for correlated processes.
Conclusions
Microarray CGH currently constitutes the most powerful method to detect gain or loss of genetic material on a genomic scale. To date, applications have been mainly restricted to cancer research, but the emerging potentialities of this technique have also been applied to the study of congenital and acquired diseases. As expression profile experiments require careful statistical analysis before any biological expertise, CGH microarray experiments will require specific statistical tools to handle experimental variability, and to consider the specificity of the the studied biological phenomena. We introduced a statistical method for the analysis of CGH microarray data that models the abrupt changes in the relative copy number ratio between a test DNA and a reference DNA. We discuss the effects of different modelings that can be used in segmentation methods, and suggest the use of a model that considers the homogeneity of the signal variability based on experimental arguments and regarding the specificity of array CGH data.
The main theoretical issue of array CGH data analysis lies in the estimation of the number of segments that requires the definition of appropriate penalty function and constant. We define a new procedure that estimates the number of segments adaptively to the data. This method selects the number of segments with high accuracy compared to previously mapped aberrations, and seems to be more efficient compared to others proposed to date. The use of dynamic programming remains central to localizing the breakpoints, and the simulation results show that when the good number of segments are selected, the algorithm localizes the breakpoints very close to the truth. Assessing the number of segments in a model is theoretically complex, and requires the definition of a precise model of inference. To that extent, microarray CGH analysis not only requires computational approaches, but also a careful statistical methodology.
Methods
Materials
We briefly present the data we used in this article. The first data we use in the Results Section consist of a single experiment on fibroblast cell lines (Coriell Cell lines) whose chromosomal aberrations have been previously mapped. Those defaults concern partial or whole chromosome aneuploidy. This data have been previously used by other authors [10]. The second group of data used in the Results section is described in [6]. A test genome of Bt474 cell lines is compared to a normal reference male genome. The last data set used is described in [16] and consists of 125 primary colorectal tumors that were surgically dissected and frozen. The arrays used for these analysis are BAC arrays described in [6].
Models and Likelihoods
where μ_{ k }is the mean of the k^{ th }segment. Model specifies that the variance is segmentspecific ( ), whereas considers that the variance is common between segments (σ^{2}). Since BACs are supposed to be independent, the loglikelihood can be decomposed into a sum of "local" likelihoods, calculated on each segments: , with
Estimation of the segment's mean and variance
Given the number of segments K and the segments' coordinates (t_{0}, t_{1}, t_{2},...,t_{K1}, t_{ K }), we estimate the mean and the variance for each segment using maximum likelihood :
If the variance of the segments is homogeneous, its estimator is given by:
Notice that when the segment coordinates are known, the estimation of the mean and variance for each segment is straightforward. Then, the key problem is to estimate K and (t_{0}, t_{1}, t_{2},...,t_{K  1}, t_{ K }). We will proceed in two steps: in the first step, we will consider that the number of segments is known, and the problem will be to estimate the t_{ k }s, that is, to find the best partition of a set of n individuals into K segments. In the second step, we will estimate the number of segments, using a penalized version of the likelihood.
A segmentation algorithm when the number of segments is known
When the number of segments K is known, the problem is to find the best partition of {1,...,n} into K segments, according to the likelihood, where n is the size of the sample. An exhaustive search becomes impossible for large K since the number of partitions of a set with n elements into K segments is . To reduce the computational load, we use a dynamic programming approach (programs are coded in MATLAB language and are available upon request). Let be the maximum loglikelihood obtained by the best partition of the data {Y(i), Y(i + 1),...,Y(j)} into k + 1 segments, with k breakpoints, and let note . The algorithm is as follows:
Dynamic programming takes advantage of the additivity of the loglikelihood described above, considering that a partition of the data into k + 1 segments is a union of a partition into k segments and a set containing 1 segment. This approach presents two main advantages: it provides an exact solution for the global optimum of the likelihood [17], and reduces the computational load from (n^{ K }) to (n^{2}) for a given K (the algorithm only requires the storage of an upper n × n triangular matrix). At the end of the procedure, the quantities are stored and will be used in the next step. Notice that this problem of partitioning is analogous to the search for the shortest path to travel from one point to another, where represents the total length of a (k + 1)steppath connecting the point with coordinate 1 to the point with coordinate n.
An adaptive method to estimate the penalty constant
The purpose of this section is to explain an adaptive method to estimate the number of segments. Further theoretical developments can be found in [14]. If we consider that the likelihood measures the adjustment of a model with K segments to the data, we aim at selecting the dimension for which ceases to increase significantly. For this purpose, let us define a decreasing sequence (β) such as β_{0} = ∞ and
If we represent the curve (pen(K), ), the sequence of β_{ i }represents the slopes between points (pen(K_{i + 1}), ) and (pen(K_{ i }), ), where the subset {(pen(K_{ i }), ),i ≥ 1}) is the convex hull of the set {(pen(K), )}.
and we select the highest number of segments K such that the second derivative is lower than a given threshold :
Other procedures have been developed to automatically locate the break in the slope of the likelihood. Nevertheless, the criterion we use can be interpreted geometrically and is easy to implement. The choice of the constant s is arbitrary. According to our experience, a threshold s = 0.5 seems appropriate for our purpose. A criticism that can be made to this procedure is its dependency on the threshold which is chosen. Nevertheless, it is important to point out that despite this thresholding the procedure remains adaptive, since the penalty constant is estimated according to the data.
Simulation studies
We performe numerical simulations to assess the sensitivity of our procedure to the addition of noise. In the first case, we simulate 100 points with K* = 5 segments. In the first case Figure 3, the segments are regularly spaced and the difference of the means between two segments is d = 1. In the second case (Figure 4) the segments are irregularly spaced and the difference of the means varies between d = 2 and d = 0.5. The standard deviation of the Gaussian errors varies from σ = 0.1 to σ = 2. Each configuration is simulated 500 times, and we calculate the average selected number of segments over 500 simulations. In order to assess the performance of the dynamic programming algorithm, we calculate the empirical probability over 500 simulations for a breakpoint to be located at coordinate t (for t = 1 to 100).
Declarations
Acknowledgements
The authors want to thank Prs D. Pinkel and D. G. Albertson, and Dr E. Lebarbier for helpful discussion and comments, and L. Spector for editing the manuscript. CV is supported by grant NIH RO1 DK60540.
Authors’ Affiliations
References
 Albertson D, Collins C, McCormick F, Gray J: Chromosome aberrations in solid tumors. Nature Genetics 2003, 34: 369–376. 10.1038/ng1215View ArticlePubMed
 Albertson D, Pinkel D: Genomic Microarrays in Human Genetic Disease and Cancer. Human Molecular Genetics 2003, 12: 145–152. 10.1093/hmg/ddg261View Article
 Beheshti B, Park P, Braude I, Squire J: Molecular Cytogenetics: Protocols and Applications. Humana Press; 2002.
 SolinasToldo S, Lampel S, Stilgenbauer S, Nickolenko J, Benner A, Dohner H, Cremer T, Lichter P: Matrixbased Comparative Genomic Hybridization: Biochips to Screen for Genomic Imbalances. Genes, Chromosomes and Cancer 1997, 20: 399–407. Publisher Full Text 10.1002/(SICI)10982264(199712)20:4%3C399::AIDGCC12%3E3.0.CO;2IView ArticlePubMed
 Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo W, Chen C, Zhai Y, Dairkee S, Ljung B, Gray J: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nature Genetics 1998, 20: 207–211. 10.1038/2524View ArticlePubMed
 Snijders AM, Nowak N, Segraves R, Blakwood S, Brown N, Conroy J, Hamilton G, Hindle AK, Huey B, Kimura K, Law S, Myambo K, Palmer J, Ylstra B, Yue JP, Gray JW, Jain A, Pinkel D, Albertson DG: Assembly of microarrays for genomewide measurement of DNA copy number. Nature Genetics 2001, 29: 263–264. 10.1038/ng754View ArticlePubMed
 Autio R, Hautaniemi S, Kauraniemi P, YliHarja O, Astola J, Wolf M, Kallioniemi A: CGHplotter: MATLAB toolbox for CGHdata analysis. Bioinformatics 2003, 13: 1714–1715. 10.1093/bioinformatics/btg230View Article
 Eilers P, Menezes R: Quantile smoothing of array CGH data. Bioinformatics 2004, in press.
 Jong K, Marchiori E, van der Vaart A, Ylstra B, Weiss M, Meijer G: Applications of Evolutionary Computing: EvoWorkshops 2003: Proceedings, SpringerVerlag Heidelberg, chap. chromosomal breakpoint detection in human cancer. 2003, 2611: 54–65.
 Olshen A, Venkatraman E, Lucito R, Wigler M: Circular Binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics 2004, 5(4):557–572. 10.1093/biostatistics/kxh008View ArticlePubMed
 Hupe P, Stransky N, Thiery J, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 2004, 20(18):3413–3422.View ArticlePubMed
 Fridlyand J, Snijders A, Pinkel D, Albertson D, Jain A: Hidden Markov Models approach to the analysis of array CGH data. Journal of Multivariate Analysis 2004, 90: 132–1533. 10.1016/j.jmva.2004.02.008View Article
 Lebarbier E: Detecting Multiple ChangePoints in the Mean of Gaussian Process by Model Selection. (to appear in) Signal Processing 2005.
 Lavielle M: Using penalized contrasts for the changepoint problem. (to appear in) Signal Processing 2005.
 Ishkanian A, Malloff C, Watson S, deLeeuw R, Chi B, Coe B, Snijders A, Albertson D, Pinkel D, Marra M, Ling V, MacAulay C, Lam W: A tiling resolution DNA microarray with complete coverage of the human genome. Nature Genetics 2004, 36(3):299–303. 10.1038/ng1307View ArticlePubMed
 Nakao K, Mehta K, Fridlyand J, Moore DH, Jain AJ, Lafuente A, Wiencke J, Terdiman J, Waldman F: Highresolution analysis of DNA copy number alterations in colorectal cancer by arraybased comparative genomic hybridization. Carcinogenesis 2004, 25(8):1345–1357. 10.1093/carcin/bgh134View ArticlePubMed
 Auger I, Lawrence C: Algorithms for the optimal identification of segments neighborhoods. Bull Math Biol 1989, 51: 39–54.View ArticlePubMed
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.