Learning smoothing models of copy number profiles using breakpoint annotations
 Toby Dylan Hocking^{1, 2, 3, 4}Email author,
 Gudrun Schleiermacher^{5},
 Isabelle JanoueixLerosey^{5},
 Valentina Boeva^{4},
 Julie Cappo^{5},
 Olivier Delattre^{5},
 Francis Bach^{1} and
 JeanPhilippe Vert^{2, 3, 4}
DOI: 10.1186/1471210514164
© Hocking et al.; licensee BioMed Central Ltd. 2013
Received: 25 July 2012
Accepted: 2 May 2013
Published: 22 May 2013
Abstract
Background
Many models have been proposed to detect copy number alterations in chromosomal copy number profiles, but it is usually not obvious to decide which is most effective for a given data set. Furthermore, most methods have a smoothing parameter that determines the number of breakpoints and must be chosen using various heuristics.
Results
We present three contributions for copy number profile smoothing model selection. First, we propose to select the model and degree of smoothness that maximizes agreement with visual breakpoint region annotations. Second, we develop crossvalidation procedures to estimate the error of the trained models. Third, we apply these methods to compare 17 smoothing models on a new database of 575 annotated neuroblastoma copy number profiles, which we make available as a public benchmark for testing new algorithms.
Conclusions
Whereas previous studies have been qualitative or limited to simulated data, our annotationguided approach is quantitative and suggests which algorithms are fastest and most accurate in practice on real data. In the neuroblastoma data, the equivalent pelt.n and cghseg.k methods were the best breakpoint detectors, and exhibited reasonable computation times.
Background
The need for smoothing model selection criteria
DNA copy number alterations (CNAs) can result from various types of genomic rearrangements, and are important in the study of many types of cancer [1]. In particular, clinical outcome of patients with neuroblastoma has been shown to be worse for tumors with segmental alterations or breakpoints in specific genomic regions [2, 3]. Thus, to construct an accurate predictive model of clinical outcome for these tumors, we must first accurately detect the precise location of each breakpoint.
In recent years, array comparative genomic hybridization (aCGH) microarrays have been developed as genomewide assays for CNAs, using the fact that microarray fluoresence intensity is proportional to DNA copy number [4]. In parallel, there have been many new mathematical models proposed to smooth the noisy signals from these microarray assays in order to recover the CNAs [512]. Each model has different assumptions about the data, and it is not obvious to decide which model is appropriate for a given data set.
Furthermore, most models have parameters that control the degree of smoothness. Varying these smoothing parameters will vary the number of detected breakpoints. Most authors give default values that accurately detect breakpoints on some data, but do not necessarily generalize well to other data. There are some specific criteria for choosing the degree of smoothness in some models [1315], but it is impossible to verify whether or not the mathematical assumptions of these models are satisfied for real noisy microarray data.
To motivate the use of their cghFLasso smoothing model, Tibshirani and Wang write “The results of a CGH experiment are often interpreted by a biologist, but this is time consuming and not necessarily very accurate” [8].
In contrast, this paper takes the opposite view and assumes that the expert interpretation of the biologist is the gold standard which a model should attain. The first contribution of this paper is a smoothing model training protocol based on this assumption.
In practice, visualization tools such as VAMP are used to plot the normalized microarray signals against genomic position for interpretation by an expert biologist looking for CNAs [16]. Then the biologist plots a model and varies its smoothness parameter, until the model seems to capture all the visible breakpoints the data. In this article, we make this model training protocol concrete by using an annotation database to encode the expert’s interpretation.
We note that using databases of visual annotations is not a new idea, and has been used successfully for object recognition in photos and cell phenotype recognition in microscopy [17, 18]. In array CGH analysis, some models can incorporate prior knowledge of locations of CNAs [19], but no models have been specifically designed to exploit visual breakpoint annotations.
Our second contribution is a protocol to estimate the breakpoint detection ability of the trained smoothing models on real data. In the Methods section, we propose to estimate the false positive and false negative rates of the trained models using crossvalidation. This provides a quantitative criterion for deciding which smoothing algorithms are appropriate breakpoint detectors for which data.
The third contribution of this paper is a systematic, quantitative comparison of the accuracy of 17 common smoothing algorithms on a new database of 575 annotated neuroblastoma copy number profiles, which we give in the Results and discussion section. There are several publications which attempt to assess the accuracy of smoothing algorithms, and these methods fall into 2 categories: simulations and lowthroughput experiments. GLAD, DNAcopy, and a hidden Markov model were compared by examining false positive and false negative rates for detection of a breakpoint at a known location in simulated data [20]. However, there is no way to verify if the assumptions of the simulation hold in a real data set, so the value of the comparison is limited. In another article, the accuracy of the CNVfinder algorithm was assessed using quantitative PCR [21]. But quantitative PCR is lowthroughput and costly, so is not routinely done as a quality control. So in fact there are no previous studies that quantitatively compare breakpoint detection of smoothing models on real data. In this paper we propose to use annotated regions for quantifying smoothing model accuracy, and we make available 575 new annotated neuroblastoma copy number profiles as a benchmark for the community to test new algorithms on real data.
Several authors have recently proposed methods for socalled joint segmentation of multiple CGH profiles, under the hypothesis that each profile shares breakpoints in the exact same location [22, 23]. These models are not useful in our setting, since we assume that breakpoints do not occur in the exact same locations across copy number profiles. Instead, we focus on learning a model that will accurately detect an unknown, different number of breakpoints in each copy number profile.
To summarize, this article describes a quantitative method for DNA copy number profile smoothing model selection. First, an expert examines scatterplots of the data, and encodes her interpretation of the breakpoint locations in a database of annotated regions. To repeat, the annotations represent an expert’s interpretation, not the biological truth in the tumor samples, which is unknown. We treat the annotated regions as a gold standard, and compare them to the breakpoints detected by 17 existing models. The best model for our expert is the one which maximizes agreement with the annotation database.
Results and discussion
Counts of annotations in two annotation data sets of the same copy number profiles
Original  Detailed  

protocol  Systematic  Any 
annotated profiles  575  575 
annotated chromosomes  3418  3730 
annotations  3418  4359 
0breakpoints  2845  3395 
1breakpoint  0  521 
>0breakpoints  573  443 
For both annotation databases, we calculated the global and local error curves E^{global}(λ) and ${E}_{i}^{\text{local}}\left(\lambda \right)$, which quantify how many breakpoint annotations disagree with the model breakpoints. As shown in Figure 2 for the original set of annotations, the smoothness parameter λ is chosen by minimizing the error curves.
Among global models, cghseg.k and pelt.n exhibit the smallest training error
The global model is defined as the smoothness parameter $\widehat{\lambda}$ that minimizes the global error E^{global}(λ), which is the total number of incorrect annotations over all profiles. Training error curves for cghseg.k, pelt.n, flsa.norm, and dnacopy.sd are shown in Figure 2. An ideal global model would have zero annotation error ${E}^{\text{global}}\left(\widehat{\lambda}\right)=0$ for some smoothness parameter $\widehat{\lambda}$. However, none of the global models that we examined achieved zero training error in either of the two annotation databases. The best global models were the equivalent cghseg.k and pelt.n models, which achieved the minimum error of 2.2% and 6.1% in the original and detailed data sets.
Among local models, cghseg.k and pelt.n exhibit the smallest training error
Since there is no global model that agrees with all of the annotations in either database, we fit local models with profilespecific smoothness parameters ${\widehat{\lambda}}_{i}$. For every profile i, the local model is defined as the smoothness parameter ${\widehat{\lambda}}_{i}$ that minimizes the local error ${E}_{i}^{\text{local}}\left(\lambda \right)$, the number of incorrect annotations on profile i. As shown in Figure 2, the local model fits the annotations at least as well as the global model: ${E}_{i}^{\text{local}}\left({\widehat{\lambda}}_{i}\right)\le {E}_{i}^{\text{local}}\left(\widehat{\lambda}\right)$. However, the local model does not necessarily attain zero error. For example, Figure 2 shows that dnacopy.sd does not detect a breakpoint in profile i=362 even at the smallest parameter value, corresponding to the model with the most breakpoints.
But even if the local models are better at fitting the given breakpoint annotations, they do not generalize well to unannotated breakpoints, as we show in the next section.
Global models detect unannotated breakpoints better than default models
In addition, Additional file 1: Figure S1S5 compares these default and global models. Again, the global models were learned on other profiles, so the shown annotations can be used for model evaluation.
Global models detect unannotated breakpoints better than local models
The leaveoneout crossvalidation results in Figure 6 also allow comparison of global and local models. For dnacopy.prune, glad.MinBkpWeight, glad.lambdabreak, dnacopy.sd and flsa, there is little difference between the local and global training procedures. For models flsa.norm, gada, pelt.n, and cghseg.k, there is a clear advantage for the global models which share information between profiles. The equivalent cghseg.k and pelt.n models show the minimal test error of only 2.1% and 4.4% in the original and detailed data sets.
Only a few profiles need to be annotated for a good global model
To estimate the generalization error of a global model trained on a relatively small training set of t annotated profiles, we applied ⌊n/t⌋fold crossvalidation to the n=575 profiles.
In Figure 9, we used ⌊n/t⌋fold crossvalidation in the detailed annotations to estimate the error rates of all 17 models trained using only t=10 profiles.
The equivalent cghseg.k and pelt.n models show the best performance on these data, with an estimated breakpoint detection error of 7.7%.
Global models generalize across annotators
We assessed the extent to which the annotator affects the results by comparing models trained on one data set and tested on the other. Figure 8 shows that test error changes very little between models trained on one data set or the other. This demonstrates that global models generalize very well across annotators.
Timing PELT and cghseg
The PELT and cghseg models use different algorithms to calculate the same segmentation, which showed the best breakpoint detection performance in every comparison. But they are slightly different in terms of speed, as we show in Figure 9.
When comparing the global models, cghseg.k is somewhat faster than pelt.n. For cghseg.k, pruned dynamic programming is used to calculate the best segmentation μ^{ k } for $k\in \{1,\dots ,20\}$ segments, which is the slow step. Then, we calculate the best segmentation for $\lambda \in \{{\lambda}_{1},\dots ,{\lambda}_{100}\}$, based on the stored μ^{ k } values. In contrast, the Pruned Exact Linear Time algorithm must be run for each $\lambda \in \{{\lambda}_{1},\dots ,{\lambda}_{100}\}$, and there is no information shared between λ values.
Timing the PELT and cghseg default models without tuning parameters shows the opposite trend. In particular, the default cghseg.mBIC method is slower than the pelt.default method. This makes sense since cghseg must first calculate the best segmentation μ^{ k } for several k, then use the mBIC criterion to choose among them. In contrast, the PELT algorithm recovers just the μ^{ k } which corresponds to the Schwarz Information Criterion penalty constant β= logd. So if you want to use a particular penalty constant β instead of the annotationguided approach we suggest in this article, the default PELT method offers a modest speedup over cghseg.
Annotationbased modeling is feasible for highdensity microarrays
Conclusions
We proposed to train breakpoint detection models using annotations determined by visual inspection of the copy number profiles. We have demonstrated that this approach allows quantitative comparison of smoothing models on a new data set of 575 neuroblastoma copy number profiles. These data provide the first set of annotations that can be used for benchmarking the breakpoint detection ability of future algorithms. Our annotationbased approach is quite useful in practice on real data, since it provides a quantitative criterion for choosing the model and its smoothing parameter.
One possible criticism of annotationbased model selection is the time required to create the annotations. However, using the GUIs that we have developed, it takes only a few minutes to annotate the breakpoints in a profile. This is a relatively small investment compared to the time required to write the code for data analysis, which is typically on the order of days or weeks. In addition, in the neuroblastoma data, we observed that annotating only about 10 of 575 profiles was sufficient to learn a smoothness parameter that achieves the modelspecific optimal breakpoint detection. More generally, our results suggest that after obtaining a moderately sized database of annotations, data analysis time is better spent designing and testing better models. Additionally, the learned models generalized very well between annotators. So breakpoint annotations are a feasible approach for finding an accurate model and smoothing parameter for real copy number profiles.
We compared local models for single profiles with global models selected using annotations from several profiles. We observed that local models fit the given annotations better, but global models generalize better to unannotated regions. In contrast with our results, it has been claimed that local models should be better in some sense: “it is clear that the advantages of selecting individualspecific λ values outweigh the benefit of selecting constant λ values that maximize overall performance” [15]. However, they did not demonstrate this claim explicitly, and one of the contributions of this work is to show that global models generalize better than local models, according to our leaveoneout estimates.
It will be interesting to apply annotationbased model training to other algorithms and data sets. In both annotation data sets we analyzed, cghseg.k and pelt.n showed the best breakpoint detection, but another model may be selected for other data.
Our results indicate that even the best models have nonzero training and testing breakpoint detection error, which could be improved. To make a model that perfectly fits the training annotations, a dynamic programming algorithm called SegAnnot was proposed to recover the most likely breakpoints that are consistent with the annotation data [24]. The test error of the cghseg model can be lowered by choosing chromosomespecific λ parameter values as a function of features such as variance and the number of points sampled [25]. Developing a model that further lowers the test error remains an interesting direction of future research.
We have solved the problem of smoothness parameter selection using breakpoint annotations, but the question of detecting CNAs remains. By constructing a database of annotated regions of CNAs, we could use a similar approach to train models that detect CNAs. Annotations could be actual copy number ($0,1,2,3,\dots \phantom{\rule{0.3em}{0ex}}$) or some simplification (loss, normal, gain). We will be interested in developing joint breakpoint detection and copy number calling models that directly use these annotation data as constraints or as part of the model likelihood.
Methods
GUIs for annotating copy number profiles
Assume that we have n DNA copy number profiles, and we would like to accurately detect their breakpoints. The first step of annotationbased modeling is to plot the data, visually identify breakpoints, and save these regions to an annotation database. We created 2 annotation graphical user interfaces (GUIs) for this purpose: a Python program for lowdensity profiles called annotate_breakpoints.py, and a web site for larger profiles called SegAnnDB.
We used Tkinter in Python’s standard library to write annotate_breakpoints.py, a crossplatform GUI for annotating lowdensity DNA copy number profiles. The annotator loads several profiles from a CSV file, plots the data, and allows annotated regions to be drawn on the plot and saved to a CSV file for later analysis. The annotator does not support zooming so is not suitable for annotating highdensity profiles. It is available in the annotate_regions package on the Python Package Index:
http://pypi.python.org/pypi/annotate_regions
SegAnnDB is a web site that can be used to annotate low to highdensity copy number profiles. After copy number data in bedGraph format is uploaded, the site uses D3 to show plots which can be annotated [26]. The annotations can then be downloaded for later analysis. As shown in Figure 10, the plots can be zoomed for detailed annotation of highdensity copy number profiles. The free/opensource software that runs the web site can be downloaded from the breakpoints project on INRIA GForge:
https://gforge.inria.fr/scm/viewvc.php/webapp/?root=breakpoints
Definition of breakpoints in smoothing models
Note that this set is drawn using vertical black lines in Figures 1 and 7.
Definition of the annotation error
For every profile and chromosome, we judge the accuracy of the predicted set of breakpoints ${\widehat{b}}^{\lambda}$ using a set of visuallydetermined regions and corresponding annotations. Every annotation $a=\phantom{\rule{2.77626pt}{0ex}}[\phantom{\rule{0.3em}{0ex}}\underline{a},\overline{a}]$ is an interval that specifies the expected number of changes in the corresponding region r. For example, we defined 3 types of annotations: a= [ 0,0] for 0breakpoints annotations, a= [ 1,1] for 1breakpoint annotations, and a= [ 1,∞) for >0breakpoints annotations. A region is an interval of base pairs on the chromosome that corresponds to ${\widehat{b}}^{\lambda}$, for example r= [ 1000000,2000000].
Note that this loss function gives the same weight to false positives and false negatives. Reweighting schemes could be used, but uniform weighting is justified in the data we analyzed since each annotation took approximately the same amount of time to create.
Definitions of error and ROC curves
Selecting the optimal degree of smoothness
 1.
When the minimum error is achieved in a range of intermediate parameter values, we select a value in the middle. This occurs in the local error curves shown for flsa.norm and cghseg.k.
 2.
When the minimum is attained by the model with the most breakpoints, we select the model with the fewest breakpoints that has the same error. This attempts to minimize the false positive rate. This occurs for profile i=375 with the dnacopy.sd model.
 3.
When the minimum is attained by the model with the fewest breakpoints, we select the model with the most breakpoints that has the same error. This attempts to minimize the false negative rate, and occurs for profile i=362 with the dnacopy.sd model.
More complicated smoothing parameter estimators could be defined, but for simplicity in this article we explore only the global $\widehat{\lambda}$ and local ${\widehat{\lambda}}_{i}$ models.
Leaveoneout crossvalidation for comparing local and global models
 1.
On each profile, randomly pick one annotated region and set it aside as a test set.
 2.
Using all the other annotations as a training set, select the best λ using the protocol described in Section “Selecting the optimal degree of smoothness” For local models we learn a profilespecific ${\widehat{\lambda}}_{i}$ that minimizes ${E}_{i}^{\text{local}}$, and for global models we learn a global $\widehat{\lambda}$ that minimizes E ^{global}.
 3.
To estimate how the model generalizes, count the errors of the learned model on the test regions.
The final estimate of model error shown in Figure 6 is the average error over all V repetitions.
⌊n/t⌋fold crossvalidation to estimate error on unannotated profiles
Since the annotation process is timeconsuming, we are interested in training an accurate breakpoint detector with as few annotations as possible. Thus we would like to answer the following question: how many profiles t do I need to annotate before I get a global model that will generalize well to all the other profiles?
To answer this question, we estimate the error of a global model trained on the annotations from t profiles using crossvalidation. We divide the set of n annotated profiles into exactly ⌊n/t⌋ folds, each with approximately t profiles. For each fold, we consider its annotations a training set for a global model, and combine the other folds as a test set to quantify the model error. The final estimate of generalization error is then the average model error over all folds.
Data: neuroblastoma copy number profiles
We analyzed a new data set of n=575 copy number profiles from aCGH microarray experiments on neuroblastoma tumors taken from patients at diagnosis. The microarrays were produced using various technologies, so do not all have the same probes. The number of probes per microarray varies from 1719 to 71340, and the median spacing between probes varies from 40 Kb to 1.2 Mb. In this article we analyzed the normalized logratio measurements of these microarrays, which we have made available as neuroblastoma$profiles in R package neuroblastoma on CRAN.
Two different expert annotations were used to construct 2 annotation databases based on these profiles (Table 1). The 2 annotation data sets are mostly consistent, but the detailed annotations provide more precise breakpoint locations (Figure 3).
The “original” annotations were created using the “Systematic” protocol. First, a set of 6 genomic regions was chosen. Then, each of these regions was inspected on scatterplots of each profile. Breakpoint annotations were recorded by typing 0 or 1 in spreadsheet with one row for each of the 575 profiles and one column for each of the 6 regions. Entries with 0 were 0breakpoints annotations a= [ 0,0] and entries with 1 were >0breakpoints annotations a= [ 1,∞). These annotations are available as neuroblastoma$annotations in R package neuroblastoma on CRAN.
The “detailed” annotations were constructed using the “Any” protocol. The data were shown as scatterplots in a graphical user interface (GUI) that allows zooming and direct annotation on the plotted profiles. Annotators were asked to label any regions for which they were sure of the annotation. These annotations are available as neuroblastomaDetailed in R package bams on CRAN.
Algorithms: copy number profile smoothing models
We considered smoothing models from the bioinformatics literature with free software implementations available as R packages on CRAN, RForge, or Bioconductor [2729]. For each algorithm, we considered three types of training for the smoothness parameter λ:

Default models can be used when functions give default parameter values, or do not have smoothness parameters that vary the number of breakpoints.

Local models choose a smoothness parameter that maximizes agreement with the annotations from a single profile.

Global models choose a smoothness parameter that maximizes agreement with the entire database of training annotations.
In the following paragraphs, we discuss the precise meaning of the smoothness parameter λ in each of the algorithms. The code that standardizes the outputs of these models can be found in the list of functions smoothers in R package bams on CRAN. For some algorithms (GADA, GLAD, DNAcopy) the smoothing ${\u0177}^{\lambda}\in {\mathbb{R}}^{d}$ is defined for an entire profile $y\in {\mathbb{R}}^{d}$, but in others (cghseg, pelt, flsa) the smoothing ${\widehat{x}}^{\lambda}\in {\mathbb{R}}^{m}$ is defined in terms of probes on a single chromosome $x\in {\mathbb{R}}^{m}$. Note that to decrease computation time, the model fitting may be trivially parallelized for profiles, algorithms, and smoothing parameter values.
We used version 1.0 of the gada package from RForge to calculate a sparse Bayesian learning model [11]. We varied the degree of smoothness by adjusting the T parameter of the BackwardElimination function, and for the gada.default model, we did not use the BackwardElimination function.
We define a grid of values $\lambda \in \{1{0}^{5},\dots ,1{0}^{12}\}$, take λ_{1}=0, and consider the following parameterizations for λ_{2}:

flsa: λ_{2}=λ.

flsa.norm: λ_{2}=λ m×10^{6}/l where m is the number of points and l is the length of the chromosome in base pairs.
We used version 1.18.0 of the DNAcopy package from Bioconductor to fit a circular binary segmentation model [7]. We varied the degree of smoothness by adjusting the undo.SD, undo.prune, and alpha parameters of the segment function. However, the dnacopy.prune algorithm was too slow ( >24 hours) for some of the profiles with many data points, so these profiles were excluded from the analysis of dnacopy.prune.
We used version 0.2.1 of the cghFLasso package from CRAN, which implements a default fused lasso method [8], but does not provide any smoothness parameters for breakpoint detection.
We used version 2.0.0 of the GLAD package from Bioconductor to fit the GLAD adaptive weights smoothing model [5]. We varied the degree of smoothness by adjusting the lambdabreak and MinBkpWeight parameters of the daglad function. For the glad.haarseg model, we used the smoothfunc=~haarseg~ option and varied the breaksFdrQ parameter to fit a wavelet smoothing model [9].
For the pelt.default model, we used the default settings which specify penalty=~SIC~ for the Schwarz or Bayesian Information Criterion, meaning β= logm. For the pelt.n model, we specified penalty=~Manual~ which means that the value parameter is used as β, and the cpt.mean function returns ${\mu}^{{k}^{\ast}\left(\beta \right)}$. We defined the same grid of λ values that we used for cghseg.k, and let β=λ d. Note that this model is mathematically equivalent to cghseg.k, but shows small differences in the results, since there are rounding errors when specifying the penalty cpt.mean(value=sprintf(~n*%f~,lambda)) for pelt.n.
Ethical approval
This study was authorized by the decision of the ethics comitee “Comité de Protection des Personnes SudEst IV”, reference L0795 and L12171.
Declarations
Acknowledgements
Thanks to Edouard Pauwels for many helpful discussions and comments to simplify the mathematics on an early draft of the paper.
This work was supported by Digiteo [DIGITEOBIOVIZ200925D to T.D.H.]; the European Research Council [SIERRAERC239993 to F.B; SMACERC280032 to JP.V.]; the French National Research Agency [ANR09BLAN005104 to JP.V.]; the Annenberg Foundation [to G.S.]; the French Programme Hospitalier de Recherche Clinique [PHRC IC200709 to G.S.]; the French National Cancer Institute [INCA20071RT4IC to G.S.]; and the French AntiCancer League.
Authors’ Affiliations
References
 Weinberg RA: The Biology of Cancer. 2007, London: Garland Science Taylor & Francis Group, LLCGoogle Scholar
 JanoueixLerosey I, Schleiermacher G, Michels E, Mosseri V, Ribeiro A, Lequin D, Vermeulen J, Couturier J, Peuchmaur M, Valent A, Plantaz D, Rubie H, ValteauCouanet D, Thomas C, Combaret V, Rousseau R, Eggert A, Michon J, Speleman F, Delattre O: Overall genomic pattern is a predictor of outcome in neuroblastoma. J Clin Oncol. 2009, 27 (7): 10261033. 10.1200/JCO.2008.16.0630. [http://jco.ascopubs.org/content/27/7/1026.abstract]View ArticlePubMedGoogle Scholar
 Schleiermacher G, JanoueixLerosey I, Ribeiro A, Klijanienko J, Couturier J, Pierron G, Mosseri V, Valent A, Auger N, Plantaz D, Rubie H, ValteauCouanet D, Bourdeaut F, Combaret V, Bergeron C, Michon J, Delattre O: Accumulation of segmental alterations determines progression in neuroblastoma. J Clin Oncol. 2010, 28 (19): 31223130. 10.1200/JCO.2009.26.7955. [http://jco.ascopubs.org/cgi/content/abstract/28/19/3122]View ArticlePubMedGoogle Scholar
 Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y, Dairkee SH, Gray JW, Albertson DG, Ljung Bm: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet. 1998, 20 (2): 207211. 10.1038/2524. [http://dx.doi.org/10.1038/2524]View ArticlePubMedGoogle Scholar
 Stransky N, Thiery JP, Radvanyi F, Barillot E, Hupé P: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004, 20 (18): 34133422. 10.1093/bioinformatics/bth418.View ArticlePubMedGoogle Scholar
 Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis. BMC Bioinformatics. 2005, 6 (27):
 Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007, 23 (6): 657663. 10.1093/bioinformatics/btl646. [http://dx.doi.org/10.1093/bioinformatics/btl646]View ArticlePubMedGoogle Scholar
 Tibshirani R, Wang P: Spatial smoothing and hot spot detection for CGH data using the fused lasso. Biostatistics. 2007, 9 (1): 1829. 10.1093/biostatistics/kxm013.View ArticlePubMedGoogle Scholar
 BenYaacov E, Eldar YC: A fast and flexible method for the segmentation of aCGH data. Bioinformatics. 2008, 24 (16): i139i145. 10.1093/bioinformatics/btn272.View ArticlePubMedGoogle Scholar
 Hoefling H: A path algorithm for the Fused Lasso Signal Approximator. 2009, [ArXiv:0910.0526]Google Scholar
 PiqueRegi R, MonsoVarona J, Ortega A, Seeger RC, Triche TJ, Asgharzadeh S: Sparse representation and Bayesian detection of genome copy number alterations from microarray data. Bioinformatics. 2008, 24 (3): 309318. 10.1093/bioinformatics/btm601.PubMed CentralView ArticlePubMedGoogle Scholar
 Killick R, Fearnhead P, Eckley IA: Optimal detection of changepoints with a linear computational cost. 2011, [ArXiv:1101.1438]Google Scholar
 Lavielle M: Using penalized contrasts for the changepoint problem. Signal Process. 2005, 85: 15011510. 10.1016/j.sigpro.2005.01.012.View ArticleGoogle Scholar
 Zhang NR, Siegmund DO: A modified Bayes information criterion with applications to the analysis of comparative genomic hybridization data. Biometrics. 2007, 63: 2232. 10.1111/j.15410420.2006.00662.x.View ArticlePubMedGoogle Scholar
 Zhang Z, Lange K, Ophoff R, Sabatti C: Reconstructing DNA copy number by penalized estimation and imputation. Ann Appl Stat. 2010, 4: 17491773. 10.1214/10AOAS357.PubMed CentralView ArticlePubMedGoogle Scholar
 La Rosa P, Viara E, Hupé P, Pierron G, Liva S, Neuvial P, Brito I, Lair S, Servant N, Robine N, Brennetot C, JanoueixLerosey I, Raynal V, Gruel N, Rouveirol C, Stransky N Stern, Delattre O, Aurias A, Radvanyi F, Barillot E, Manié E: VAMP: Visualization and analysis of arrayCGH, transcriptome and other molecular profiles. Bioinformatics. 2006, 22 (17): 20662073. 10.1093/bioinformatics/btl359. [http://bioinformatics.oxfordjournals.org/content/22/17/2066.abstract]View ArticlePubMedGoogle Scholar
 Russell BC, Torralba A, Murphy KP, Freeman WT: LabelMe: a database and webbased tool for image annotation. Int J Comput Vis. 2008, 77 (13): 157173.View ArticleGoogle Scholar
 Jones TR, Carpenter AE, Lamprecht MR, Moffat J, Silver SJ, Grenier JK, Castoreno AB, Eggert US, Root DE, Golland P, Sabatini DM: Scoring diverse cellular morphologies in imagebased screens with iterative feedback and machine learning. Proc Natl Acad Sci. 2009, 106 (6): 18261831. 10.1073/pnas.0808843106. [http://www.pnas.org/content/106/6/1826.abstract]PubMed CentralView ArticlePubMedGoogle Scholar
 Shah SP, Xuan X, DeLeeuw RJ, Khojasteh M, Lam WL, Ng R, Murphy KP: Integrating copy number polymorphisms into array CGH analysis using a robust HMM. Bioinformatics. 2006, 22 (14): 431439. 10.1093/bioinformatics/btl238.View ArticleGoogle Scholar
 Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analysis. Bioinformatics. 2005, 21 (22): 40844091. 10.1093/bioinformatics/bti677.View ArticlePubMedGoogle Scholar
 Fiegler H, Redon R, Andrews D, Scott C, Andrews R, Carder C, Clark R, Dovey O, Ellis P, Feuk L, French L, Hunt P, Kalaitzopoulos D, Larkin J, Montgomery L, Perry GH, Plumb BW, Porter K, Rigby RE, Rigler D, Valsesia A, Langford C, Humphray SJ, Scherer SW, Lee C, Hurles ME, Carter NP: Accurate and reliable highthroughput detection of copy number variation in the human genome. Genome Res. 2006, 16 (12): 15661574. 10.1101/gr.5630906. [http://dx.doi.org/10.1101/gr.5630906]PubMed CentralView ArticlePubMedGoogle Scholar
 Vert JP, Bleakley K: Fast detection of multiple changepoints shared by many signals using group LARS. Advances in Neural Information Processing Systems 23 (NIPS). Edited by: Lafferty J, Williams CKI, ShaweTaylor J, Zemel RS, Cullota A. 2010, 23432351.Google Scholar
 Ritz A, Paris P, Ittmann M, Collins C, Raphael B: Detection of recurrent rearrangement breakpoints from copy number data. BMC Bioinformatics. 2011, 12: 11410.1186/1471210512114. [http://www.biomedcentral.com/14712105/12/114]PubMed CentralView ArticlePubMedGoogle Scholar
 Hocking TD, Rigaill G: SegAnnot: an R package for fast segmentation of annotated piecewise constant signals. [http://hal.inria.fr/hal00759129]
 Rigaill G, Hocking T D Bach, Vert JP: Learning Sparse Penalties for ChangePoint Detection using Max Margin Interval Regression. Proceedings of the 30th International Conference on Machine Learning (ICML13) ICML ’13. Edited by: McAllester D, Dasgupta S, Dasgupta S , McAllester D . 2013, New York: ACMGoogle Scholar
 Bostock M, Oglevetsky V, Heer J: D3 datadriven documents. IEEE Trans Vis Comput Graph. 2011, 17 (12): 23012309.View ArticlePubMedGoogle Scholar
 R Development Core Team: R: A Language and Environment for Statistical Computing. 2011, Vienna: R Foundation for Statistical Computing, [http://www.Rproject.org/][ISBN 3900051070]Google Scholar
 Theußl S, Zeileis A: Collaborative software development using RForge. The R J. 2009, 1: 914. [http://journal.rproject.org/20091/RJournal_20091_Theussl+Zeileis.pdf]Google Scholar
 Gentleman RC, Carey VJ, Bates DM: Bioconductor: Open software development for computational biology and bioinformatics. Genome Biol. 2004, 5: R8010.1186/gb2004510r80. [http://genomebiology.com/2004/5/10/R80]PubMed CentralView ArticlePubMedGoogle Scholar
 Rigaill G: Pruned dynamic programming for optimal multiple changepoint detection. 2010, [ArXiv:1004.0887]Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.