A response to Yu et al. "A forwardbackward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using highdensity single nucleotide polymorphism (SNP) array", BMC Bioinformatics 2007, 8: 145
 Oscar M Rueda^{1}Email author and
 Ramon DiazUriarte^{1}Email author
DOI: 10.1186/147121058394
© Rueda and DiazUriarte; licensee BioMed Central Ltd. 2007
Received: 23 June 2007
Accepted: 16 October 2007
Published: 16 October 2007
Abstract
Background
Yu et al. (BMC Bioinformatics 2007,8: 145+) have recently compared the performance of several methods for the detection of genomic amplification and deletion breakpoints using data from highdensity single nucleotide polymorphism arrays. One of the methods compared is our nonhomogenous Hidden Markov Model approach. Our approach uses Markov Chain Monte Carlo for inference, but Yu et al. ran the sampler for a severely insufficient number of iterations for a Markov Chain Monte Carlobased method. Moreover, they did not use the appropriate reference level for the nonaltered state.
Methods
We rerun the analysis in Yu et al. using appropriate settings for both the Markov Chain Monte Carlo iterations and the reference level. Additionally, to show how easy it is to obtain answers to additional specific questions, we have added a new analysis targeted specifically to the detection of breakpoints.
Results
The reanalysis shows that the performance of our method is comparable to that of the other methods analyzed. In addition, we can provide probabilities of a given spot being a breakpoint, something unique among the methods examined.
Conclusion
Markov Chain Monte Carlo methods require using a sufficient number of iterations before they can be assumed to yield samples from the distribution of interest. Running our method with too small a number of iterations cannot be representative of its performance. Moreover, our analysis shows how our original approach can be easily adapted to answer specific additional questions (e.g., identify edges).
Background
The recent paper by Yu et al. [1] proposes a new method for the analysis (segmentation) of high density single nucleotide polymorphism (SNP) arrays to detect copy number changes (CNAs) in genomic DNA. Their approach has been designed to be highly sensitive for edge detection and is tailored to SNP arrays. In their paper, Yu et al. compare the performance of their approach with that of several alternative methods initially developed for the analysis of arrayCGH (aCGH) data. The methods compared include Circular Binary Segmentation [2], GLAD [3], CGHseg [4] – though not in the original formulation of their authors, as the recommended adaptive penalization of [4] is not used–, and three Hidden Markov Model approaches: a homogenous one [5], the nonhomogeneous one of Marioni et al. [6], and RJaCGH, our own nonhomogeneous HMM using Markov Chain Monte Carlo (MCMC) with Reversible Jump [7].
The different aCGH technologies, from BACbased aCGH to oligonucleotide aCGH (oaCGH), including Affymetrix SNP arrays, differ in several ways, both in terms of coverage, technology, noise per individual element, requirement for reference samples and minimal DNA quality, and LOH information returned (e.g., see [8–13]). Some of these differences might require customized approaches for SNPbased arrays [11, 13].
Thus, our objective here is not to show the optimality of RJaCGH for SNPbased arrays but, rather, to reanalyze the data of Yu et al. [1] using more appropriate parameters for the Markov Chain Monte Carlo in RJaCGH, which yield much improved performance, as well as to show extensions of our model for edge detection, which show the relevance of modelbased methods of direct interpretation that explicitly return probabilities.
Reanalyzing the data
We will reanalyze the data in [1]: in the original comparison RJaCGH was used with a) the incorrect reference for normal samples and b) too few iterations for both the burnin and posterior sampler in the MCMC algorithm.
The data analyzed, in the nonaltered condition, have a mean value of 1.03 and a median of 1.05. RJaCGH, by default, expects that the "normal" values will be centered around 0. This value of 0 is, often, the result of the log2 of a (normalized) ratio of 1. Expecting values of nonaltered regions to be centered around 0 is not unique to RJaCGH; for instance, Fridlyand et al. [5], in p. 138, explain why they expect that the median copy number of the array will typically be normalized to 0. Likewise, Marioni et al. [6] assume a Gaussian for the log2 ratios of each state, which therefore means that regions with unaltered copy number are expected to show values centered around 0. Similarly, Picard et al. [4] indicate (p. 2) "Array CGH data are normalized with a median set to log2(ratio) = 0 for regions of no change (...)". Now, the last three methods, CGHseg [4], aCGH [5], and BioHMM [6], might not be adversely affected, in terms of edge detection, if the normal samples, instead of being centered around 0, are centered around 1 (as in the current data). However, RJaCGH is adversely affected: our algorithm, by default, uses a step where clones are separated into the three groups "gained", "lost", "no change", and information about the expected value of the nonchanged clones is used to separate these three states, and therefore to identify some of the edges. Appropriately using RJaCGH, thus, requires either normalizing the data so that unaltered areas are centered around 0 or, else, explicitly using the "normal.reference" parameter to the RJaCGH function in our code (this parameter might not have been available in the version used in [1], but recentering the data would have had the same effect).
Second, and more importantly, RJaCGH has been run with a burnin run of 50 iterations and results obtained from 450 runs (TOT parameter set to 500). Except for extremely simple Bayesian models, in reallife usage, much longer runs are often necessary (e.g., [14]) both for the burnin period (i.e., to try to ensure that we are sampling from the true posterior) and for the number of MCMC iterations used after the burnin period (i.e., the number of samples used to actually carry out inferences). The number of iterations should be even larger when we are using Reversible Jump and a complex model as HMM [15–17]. It is extremely unlikely that with only 50 burnin iterations our model would be anywhere near convergence, and thus all inferences from these models are suspect.
Parameters for reanalyzing the data
The data and further details on the analysis and simulations were kindly provided by the first author of the paper, Tianwei Yu. We rerun the RJaCGH analysis, using 10000 burnin iterations and 40000 samples for the posterior. Note that, to depart from the original [1] as little as possible, we do not use multiple chains in parallel (which would, however, be the recommended procedure and would also allow to check for convergence [14, 18], as explained in our paper [7]).
As explained above, we set the reference mean to 1 (parameter "normal.reference" in RJaCGH). We could have, instead, centered all data to ensure that the nonaltered regions were centered around 0. To account for different levels of noise in the data (or to modify the tradeoff between false positive and false negative rates) RJaCGH uses only one parameter, "normal.ref.percentile". This parameter might not have been easily available to the authors of [1] and, thus, none of our key criticisms hinge around nonusage of this parameter. "normal.ref.percentile" is used to set the width of the confidence band based on a Normal (Gaussian) distribution: higher values of "normal.ref.percentile" will tend to incorporate more states into the "nonchange" final state. We have varied "normal.ref.percentile" from 0.5 to 0.99 to construct ROC curves as in Figure 3 of [1]. All code for our analysis is available from [19].
Results
Searching directly for edges
In the reanalysis above, we used RJaCGH in the default way. This means that the inference about the edges is not one where we estimate the probability of a spot being an edge. As RJaCGH is an HMMbased model, its main objective is to try to assign clones to their true (hidden) state; RJaCGH additionally automatically collapses the hidden states to three biologicallymotivated levels: gained/lost/nochange. Thus, to find edges, we first use RJaCGH to infer the status of a clone (gained, lost, no change) and then we identify the edges by looking for two consecutive probes where status changes. There is, however, a simple and direct procedure to obtain the probability of an edge that does not involve the above indirect procedure. Since we can apply the Viterbi algorithm to every MCMC sample [7], for each MCMC sample we can directly obtain the locations where there is an edge: spot i is an edge if its state –as identified by the path from the Viterbi algorithm– is different from that of spot i + 1. This approach has the advantage of returning the probability of an edge in a given spot, directly incorporating averaging over number of hidden states, and not requiring to specify "normal.ref.percentile". The later also illustrates one difference with regards to the previous approach. In the previous approach, the edges are identified from the "collapsed" classification of all hidden states into gained/lost/nochange (and this is why we required the specification of "normal.ref.percentile"). Here, in contrast, we will identify as an edge any change in state, even if the change in state is between hidden states that will eventually be considered a single biological level in the classification (e.g., if the change in state is between two different hidden states that will later be both considered "gain").
Finally, there is no need to rerun the algorithm to modify the tradeoff between sensitivity and FDR, since this tradeoff can be changed by changing the probability threshold over which we consider a spot to be an edge. This feature provides added flexibility, which allows users to choose the best threshold for any application, or even to use different thresholds for different probes or chromosomes. Moreover, even if the edge is not located precisely, we can obtain evidence suggestive of the area where an edge is located. For instance, in Figure 3, panel f) the exact location of any edge is not well defined, but there are three areas where spots of high edge probability are located, and an additional two with areas of medium edge probability; these types of representations might suggest refining searches in these areas. In contrast, panel e) is indicative of a situation where the algorithm is likely to be performing poorly, thus alerting users that those results should not be trusted.
Discussion
We have shown (Figure 1) that, when used properly (i.e., with realistic number of chains for the MCMC algorithm and with the correct reference for the normal–nochange–probes), RJaCGH can achieve performance onpar with, or better than, other methods in this data set. Moreover, and in contrast to other methods, the model of RJaCGH is sufficiently flexible that we can directly obtain answers to the questions we are interested in, as shown in Figure 2, where we directly obtain the probabilities of an edge from the algorithm, without intermediate and indirect calculations. In fact, one of the advantages of the explicit model underlying RJaCGH [7] is the possibility of obtaining answers (in the form of statistics, posterior probabilities, etc) to the biologically relevant questions. Our original code was targeted to correctly classifying probes into the gained/lost/nochange status but we can, as well, focus the analysis on the probability that a probe is an edge. Moreover, it is easy to extend this approach to ask (and answer) more specific questions; for example, we could be interested in the probability that a probe is an edge between a state of nochange and a state of gain. In contrast, answers to these types of questions are often hard to obtain from other methods and the answers, if available, are rarely as immediately applicable as a probability. Finally, both our method and BioHMM [6] are unique among the methods compared because they can incorporate unequal distance between probes. With the data provided, the performance advantage of these two methods can not be detected since, by construction of the simulations, the interprobe distance is the same (or plays no role in the probability of change in state).
A broad issue raised by the paper of [1] and our reanalysis is why HMMbased models, as well as approaches such as CGHseg and Circular Binary Segmentation, do not achieve better performance with these data, especially given their excellent performance in other aCGH platforms [7, 20, 21]. The comparatively poor performance of these approaches with SNPbased data might be attributable more to specifics of the Affymetrix platform [11–13, 22–24] than to specific values of parameters related to noise in the data (as exemplified by Figure 3).
Response from original authors
Tianwei Yu & Xiaofeng Zhou
Address:
Department of Biostatistics, Rollins School of Public Health, Emory University, Atlanta, GA USACenter for Molecular Biology of Oral Diseases, College of Dentistry, University of Illinois at Chicago, Chicago, IL USAGuanghua School & Research Institute of Stomatology, Sun YatSen University, Guangzhou, China
Corresponding author: Xiaofeng Zhou
xfzhou@uic.edu
In our paper [1], the comparison study was done with a previous version of RJaCGH (version 1.0.0). The current version of RJaCGH (version 1.1.1) was publicly available only after our paper was already published. Two key parameters were nonexistent in the version 1.0.0 [25]. They are "normal.ref.percentile" and "normal.reference". The first one controls the sensitivity in edge detection, and the second one determines the expected value of the nonchanged clones.
As Rueda and DiazUriarte pointed out, two factors contributed to the apparent suboptimal performance of RJaCGH in our original paper [1]. The first is the expected value of the two copy clones being nonzero. It is a characteristic of copy number data processed by Copy Number Analysis Tool (CNAT) from Affymetrix Inc. In the version of RJaCGH we tested, there was no such parameter to adjust and its relevance would not have been known to the user. Obviously the RJaCGH package has been improved afterwards to add the parameter "normal.reference" to address this issue. The other two HMMbased packages already reached excellent performance without setting such a parameter [1]. We wish to point out the difficulty and importance of correctly setting this parameter in Affymetrix SNP arraybased copy number analysis. With the current Affymetrix SNP array platforms, the normal reference dataset was provided by an internal library, which is implemented as an integrated part in Affymetrix CNAT software [22]. While this approach greatly reduces the effort and cost to perform copy number analysis, it could also introduce bias from various sources. Hence even minimum variations in the experimental procedure may cause the median value of the two copy clones to drift from the ideal value. In addition, the X chromosome without copy number deviation from a male sample will yield copy numbers centered around 1, as opposed to 2 from the autosomes. These issues require special care by the user when a "normal.reference" value needs to be set.
The authors criticized us for conducting simulations using existing methods "not in the original formulation of their authors". We would like to point out that in Supplement Table 1 of our paper [1], the comparison of all the tested methods were performed with the default settings, except the number of permutations in DNAcopy and number of iterations in RJaCGH. In the results reported in Figure 3 of [1], modifications of the parameters were made in an effort to adapt them better to the SNP array data and improve their performance. Certainly this was done from a user's perspective and we could not guarantee optimality for every package tested. All the parameters we used were listed in Table 1 of our paper [1]. These manipulations were necessitated by the fact that copy number data generated by the Affymetrix arrays have different characteristics than traditional arrayCGH data. It would be beneficial if authors of the packages could provide users guidance as to how to optimize their packages for newer arrays. In general, we are happy that changes are now implemented in the new version of RJaCGH as targeted improvements. And it is encouraging to see the boosted performance of the method on SNP arraybased copy number data.
Abbreviations
 aCGH:

arraybased comparative genomic hybridization
 CBS:

aCGH analysis method developed by [2]
 CGHseg:

aCGH analysis method developed by [4]
 GLAD:

aCGH analysis method developed by [3]
 MCMC:

Markov Chain Monte Carlo
 oaCGH:

Oligonucleotide aCGH
 RJaCGH:

Reversible Jumpbased analysis of aCGH data developed by [7]
 ROC (curve):

Receiver operating characteristic (curve)
 SNP:

Single nucleotide polymorphism
Declarations
Acknowledgements
X. Zhou and, particularly, T. Yu for explanations and details of their simulations. Two reviewers for comments on the ms.
Funding provided by Fundacion de Investigacion Medica Mutua Madrilena. R.D.U. is partially supported by the Ramon y Cajal programme of the Spanish MEC. Funding bodies had no role in study design, in the collection, analysis, and interpretation of data, in the writing of the manuscript or in the decision to submit the manuscript for publication.
Authors’ Affiliations
References
 Yu T, Ye H, Sun W, Li KC, Chen Z, Jacobs S, Bailey DK, Wong DT, Zhou X: A forwardbackward fragment assembling algorithm for the identification of genomic amplification and deletion breakpoints using highdensity single nucleotide polymorphism (SNP) array. BMC Bioinformatics 2007, 8: 145.PubMed CentralView ArticlePubMedGoogle Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M: Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics 2004, 5: 557–572.View ArticlePubMedGoogle Scholar
 Hup'e P, Stransky N, Thiery JP, Radvanyi F, Barillot E: Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics 2004, 20: 3413–3422.View ArticleGoogle Scholar
 Picard F, Robin S, Lavielle M, Vaisse C, Daudin JJ: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6: 27.PubMed CentralView ArticlePubMedGoogle Scholar
 Fridlyand J, Snijders AM, Pinkel D, Albertson DGa: Hidden Markov models approach to the analysis of array CGH data. Journal of Multivariate Analysis 2004, 90: 132–153.View ArticleGoogle Scholar
 Marioni JC, Thorne NP, Tavar'e S: BioHMM: a heterogeneous hidden Markov model for segmenting array CGH data. Bioinformatics 2006, 22: 1144–1146.View ArticlePubMedGoogle Scholar
 Rueda OM, DíazUriarte R: Flexible and accurate detection of genomic copynumber changes from aCGH. PLoS Comput Biol 2007, 3(6):e122.PubMed CentralView ArticlePubMedGoogle Scholar
 Ylstra B, van den Ijssel P, Carvalho B, Brakenhoff RH, Meijer GA: BAC to the future! Or oligonucleotides: a perspective for micro array comparative genomic hybridization (array CGH). Nucleic Acids Res 2006, 34: 445–450.PubMed CentralView ArticlePubMedGoogle Scholar
 Pinkel D, Albertson DG: Array comparative genomic hybridization and its applications in cancer. Nat Genet 2005, 37(Suppl):S11S17.View ArticlePubMedGoogle Scholar
 Lockwood WW, Chari R, Chi B, Lam WL: Recent advances in array comparative genomic hybridization technologies and their applications in human genetics. Eur J Hum Genet 2006, 14(2):139–148.View ArticlePubMedGoogle Scholar
 Huang J, Wei W, Chen J, Zhang J, Liu G, Di X, Mei R, Ishikawa S, Aburatani H, Jones KW, Shapero MH: CARAT: a novel method for allelic detection of DNA copy number changes using high density oligonucleotide arrays. BMC Bioinformatics 2006, 7.Google Scholar
 Nannya Y, Sanada M, Nakazaki K, Hosoya N, Wang L, Hangaishi A, Kurokawa M, Chiba S, Bailey DK, Kennedy GC, Ogawa S: A robust algorithm for copy number detection using highdensity oligonucleotide single nucleotide polymorphism genotyping arrays. Cancer Res 2005, 65(14):6071–6079.View ArticlePubMedGoogle Scholar
 Wang W, Carvalho B, Miller N, Pevsner J, Chakravarti A, Irizarry R: Estimating GenomeWide Copy Number Using Allele Specific Mixture Models. Research in Computational Molecular Biology, Springer 2007, 137–150.View ArticleGoogle Scholar
 Gelman A, Carlin JB, Stern HS, Rubin DB: Bayesian Data Analysis. 2nd edition. Chapman & Hall/CRC; 2003.Google Scholar
 Green P: Reversible jump Markov chain Monte Carlo computation and Bayesian model determination. Biometrika 1995, 7: 11–732.Google Scholar
 Capp'e O, Moulines E, Ryden T: Inference in Hidden Markov Models (Springer Series in Statistics). Springer 2005.Google Scholar
 Robert C, Ryden T, Titterington D: Bayesian inference in hidden Markov models through reversible jump Markov chain Monte Carlo. Journal of the Royal Statistical Society: Series B (Statistical Methodology) 2000, 62: 57–75.View ArticleGoogle Scholar
 Brooks S, Gelman A: General Methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics 1998, 7: 434–455.Google Scholar
 Scripts for analysis[http://ligarto.org/rdiaz/RcodereponseYuBMCBioinfo.tar.gz]
 Willenbrock H, Fridlyand J: A comparison study: applying segmentation to array CGH data for downstream analyses. Bioinformatics 2005, 21: 4084–4091.View ArticlePubMedGoogle Scholar
 Lai WRR, Johnson MDD, Kucherlapati R, Park PJJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005, 21: 3763–3770.PubMed CentralView ArticlePubMedGoogle Scholar
 Huang J, Wei W, Zhang J, Liu G, Bignell GR, Stratton MR, Futreal PA, Wooster R, Jones KW, Shapero MH: Whole genome DNA copy number changes identified by high density oligonucleotide arrays. Hum Genomics 2004, 1(4):287–299.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhao X, Li C, Paez JG, Chin K, Jänne PA, Chen TH, Girard L, Minna J, Christiani D, Leo C, Gray JW, Sellers WR, Meyerson M: An integrated view of copy number and allelic alterations in the cancer genome using single nucleotide polymorphism arrays. Cancer Res 2004, 64(9):3060–3071.View ArticlePubMedGoogle Scholar
 Lai Y, Zhao H: A statistical method to detect chromosomal regions with DNA copy number alterations using SNParraybased CGH data. Comput Biol Chem 2005, 29: 47–54.View ArticlePubMedGoogle Scholar
 Rueda OM, DiazUriarte R: A flexible, accurate and extensible statistical method for detecting genomic copynumber changes. COBRA Preprint Series 2006., 9:Google Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.