Bivariate segmentation of SNP-array data for allele-specific copy number analysis in tumour samples
© Mosén-Ansorena and Aransay; licensee BioMed Central Ltd. 2013
Received: 3 July 2012
Accepted: 28 February 2013
Published: 5 March 2013
SNP arrays output two signals that reflect the total genomic copy number (LRR) and the allelic ratio (BAF), which in combination allow the characterisation of allele-specific copy numbers (ASCNs). While methods based on hidden Markov models (HMMs) have been extended from array comparative genomic hybridisation (aCGH) to jointly handle the two signals, only one method based on change-point detection, ASCAT, performs bivariate segmentation.
In the present work, we introduce a generic framework for bivariate segmentation of SNP array data for ASCN analysis. For the matter, we discuss the characteristics of the typically applied BAF transformation and how they affect segmentation, introduce concepts of multivariate time series analysis that are of concern in this field and discuss the appropriate formulation of the problem. The framework is implemented in a method named CnaStruct, the bivariate form of the structural change model (SCM), which has been successfully applied to transcriptome mapping and aCGH.
On a comprehensive synthetic dataset, we show that CnaStruct outperforms the segmentation of existing ASCN analysis methods. Furthermore, CnaStruct can be integrated into the workflows of several ASCN analysis tools in order to improve their performance, specially on tumour samples highly contaminated by normal cells.
Two chief genetic instabilities associated to tumoural cells are genomic copy number alterations (CNAs) and somatic loss of heterozygosity (LOH) events, which represent a deviation from the normal allele-specific copy numbers (ASCN). Both imbalances have been reported to affect the expression of oncogenes and tumour-suppressor genes , and therefore, the accurate characterisation of ASCNs in tumoural samples is critical in order to identify candidate cancer-related genes, to discriminate cancer types  and to understand tumour initiation and complexity .
Single nucleotide polymorphism (SNP) arrays of Illumina  and Affymetrix  platforms allow screening for ASCNs at high resolution and throughout the whole genome by providing measures for the log R ratio (LRR), which reflects the total intensity signals for both alleles, and the B allele frequency (BAF), which is the relative proportion of one of the alleles with respect to the total intensity signal. Both LRR and BAF signals are required for a complete characterisation of ASCNs since they provide complementary information. Yet, although each combination of copy number and allelic ratio has an expected LRR value and a specific BAF pattern, these signals can be blurred due to experimental probe-specific noise and by autocorrelated  and dye  biases, respectively.
In the study of ASCNs over tumour samples with SNP arrays, three additional issues need to be considered. First, there is a LRR baseline shift that depends on the ploidy of the sample. Second, tumour biopsies can be contaminated with normal cells, whose genotypes are mainly diploid, which make the LRR and BAF signals to shrink and converge towards those of a diploid state proportionally to the degree of contamination . Third, tumours can be composed of several subclones, this is, subpopulations of cells that harbour specific alterations along with the shared ones, which makes LRR and BAF signals even more complex . The second and third tumour-specific issues, together with the experimental noise and biases, affect the ability to correctly delimit regions with different ASCNs. Therefore, inferring change-point locations from tumour samples requires mathematical models whose performance is affected as little as possible by these issues.
Two approaches are used for the detection of ASCNs in tumour samples on SNP arrays, both of which inherit from methodologies applied to aCGH. The most recurrent approach is based on a combination of a hidden Markov model (HMM) and an expectation-maximisation (EM) algorithm. OncoSNP  and GPHMM  are two recent HMM-based tools validated on Illumina data which, in contrast to previous methods [12-14], are capable of estimating both normal cell contamination and LRR baseline shift. Most existing HMM-based methods, including the two aforementioned ones, integrate the LRR and BAF signals into the same model, which confers them more change-point detection power. Yet, the pre-established levels of HMMs are not prepared to characterise the observed continuous mean levels that arise due to the presence of multiple subclones [9, 15]. Additionally, HMMs require parameterisation on region probability and length, which vary among samples and are not known a priori. Arguably due to the aforementioned issues, in a recent method comparison  HMM-based methods were outperformed by a change-point detection method. For this reason, we propose tackling the problem of ASCN analysis from a change-point-based stand.
Methods based on change-point detection algorithms are typically comprised by segmentation followed by a calling step [17, 18]. This approach does not assume pre-established signal levels and does not require parameterisation of a priori knowledge. Two change-point-based approaches for unpaired tumour samples that use both LRR and BAF signals have been developed: GAP and ASCAT. PSCBS  also falls into this category, but it only works on coupled tumour samples and does not automatically estimate normal cell contamination. In the segmentation step, GAP segments the LRR and BAF signals independently and merges the change-points with those that come from the detection of LOH germline regions in BAF. On the contrary, ASCAT performs a single bivariate segmentation instead of two univariate segmentations, because the integration of the signals into the same formulation can increase the power to detect dimmer joint changes and reduce false positives. However, the extension from the univariate to the bivariate case is not trivial and depends on the characteristics of the considered segmentation approach, which may fall into one of two broad categories: boundary-based and region-based (see  for an analogue distinction in image segmentation).
In the boundary-based differential approach, change-points are seen as inflection points, this is, places where the first derivative has local extrema. Only local information around each point is used to compute the derivative, often resulting in spurious and merged change-points. Multiresolution analysis can be performed by computing the derivative at various window sizes, but region-based approaches are the most adequate to obtain more information for segmentation decisions, although they sacrifice change-point location accuracy. Region-based approaches can be broken down into segment-growing, split-and-merge and global optimisation. Region-growing starts with a number of random single-point regions. Neighbouring points are added to a region if they are similar enough, according to a certain homogeneity criterion; otherwise, a new segment is started. A representative example of split-and-merge is the binary segmentation, which selects as a change-point the position that divides the data into two segments with the most different means. The process is recursively applied to each segment until it cannot be divided into two subsegments with a mean difference that is significant enough. Then, similar regions are merged back together following some pruning criterion. Circular binary segmentation (CBS)  is a modification that allows at each step for the detection of one change-point or two, where the subsegment in the middle has a different mean than the other two subsegments. Global optimisation methods try to optimise an objective function, called cost function in minimisation and utility function in maximisation. Some methods, such as the structural change model (SCM) [15, 22], return the actual optimum. Others, namely heuristics, perform a non-exhaustive scan over the combinatorial space of change-points and can thus be trapped into local extrema.
Current change-point detection methods [17-19] are based on region-based segmentation algorithms, which are more adequate for ASCN analysis because finding change-points is more important than establishing their accurate location. More precisely, GAP is based on CBS and ASCAT on bivariate global optimisation. PSCBS, aimed at paired tumour samples, and BAFsegmentation  and TAPS , which only segment either BAF or LRR, are also based on CBS.
The application of the univariate segmentation methods to the bivariate data from SNP array requires: (i) knowing how the transformation typically applied to the BAF signal influences the applicability of certain segmentation methods and their extension to the bivariate case, and (ii) a mathematical model that generalises the extension from the univariate to the bivariate case. We provide such formalisations, illustrate that the approach taken by ASCAT is a specific case of the bivariate generalisation and discuss why there are more suitable formulations of the bivariate segmentation for ASCN analysis. Then, we show how the bivariate framework is applied to the SCM model in order to achieve CnaStruct, a method that outperforms the segmentation of existing approaches.
BAF transformation and characterisation
The resulting imBAF is not homoscedastic for two reasons: (i) the homozygous band resembles a mixture of a point mass function and a truncated normal distribution with lower variance than its heterozygous counterparts; (ii) the distribution of the heterozygous band, when near the 0.5 axis, is truncated due to the mirroring and thus has lower variance. Nevertheless, homoscedasticity violations seem to be sufficiently small so as to not impact segmentation performance of the approaches we assessed (CBS and SCM).
Non-polymorphic probes yield missing values on the BAF variable. Additionally, the transformation of BAF into imBAF generates more missing values, all of which can be easily removed for the application of univariate segmentation approaches. However, the removal of missing values on bivariate approaches typically implies the exclusion of the corresponding LRR observations and, thus, loss of information. Therefore, missing values should be either handled by the segmentation method or imputed, which can be easily done through interpolation. In general, we observed that constant interpolation is more adequate for change-point detection than linear interpolation, because this latter inserts values that lie between imBAF bands, distorting the profile.
where k = 1…n indexes the observations of the variable z, t1…ts+1 parameterise the borders of the S segments, μ s is the mean value of the s -th segment and ? k are the residuals.
where lr,s and lb,s are the number of non-missing observations of a segment s present in the LRR and imBAF variables, respectively.
where θ establishes constraints for the number of segments and the maximum allowed segment length. This last constraint reduces computational time complexity from O(n2) to O(nl) .
Because this is a fitting problem, Minkowski distances of order p < 1, which model the interaction between the variables as a decay function, are appropriate (Figure 2B). Such approach makes a pair of strong-weak fits result in a lower cost than two average fits, given the same linear combination of residual errors. In other words, it promotes the detection of strong mean shifts albeit in a single variable, such as transitions between segments with same allelic ratio but different copy number. The weighting coefficient β parametrises the relative contribution of the imBAF variable to the cost function. For a balanced contribution, its value should quantify the ratio of informative observations in each variable and the relationship between their signal-to-noise ratios. While the standard deviation is around 6 times larger on LRR independently of the sample data, the expected mean changes depend on the amount of LOH regions and the diversity of copy numbers among other sample factors. The tests we have performed on synthetic data suggest establishing a weighting between variables of β = 1, and show that a Minkowski distance of order of p = 1/4 captures very well the interaction between the variables.
Data can always be fitted better by increasing the number of change-points S, so there is a need to find the optimal S. An option is the use of penalised log-likelihoods .
where N is the number of observations, S is the number of segments, δ determines the best segmentation with S segments and k = 1. Huber et al.  discuss that log-likelihood penalisation overestimates the number of change-points in transcriptional data. In SNP-array copy number data, we found the BIC-penalised log-likelihood to be a satisfactory model selector, but it can be adjusted depending on the desired sensitivity required for downstream analyses.
We built a CnaStruct R package that is freely available at http://web.bioinformatics.cicbiogune.es/cnastruct. The segmentation function is based on the one included in another R package, called tilingArray . Maximum segment length and the number of maximum segments are parameters inherited from such function. The order of the Minkowski distance and the weighing constant between variables are also parameterised, with p = 1/4 and β = 1 set as the default values. Finally, CnaStruct allows the selection of k in the information criterion (default is k = 1, BIC), in order to alter the number of located change-points.
Results and discussion
We evaluated the performance of CnaStruct against the two latest HMM-based methods (GPHMM  and OncoSNP ) and the two change-point detection methods (ASCAT  and GAP ) that use both LRR and BAF variables.
All the assessed methods can handle Illumina data, so we evaluated them on the benchmarking dataset from Mosén-Ansorena et al. , which simulates data from this platform. The dataset considers five characteristic tumour alteration patterns (near-diploid, near-triploid, near-tetraploid, LOH-enriched and complex) and contains one hundred samples per pattern. Fragments with copy numbers 1, 2, 3, 4 and 5 with and without somatic or germline LOH spanning 10, 20, 40, 80 or 160 SNP probes were included and samples were generated at four percentages (0, 25, 50 or 75%) of non-tumoural cell contamination. Longer regions were not included because no major performance differences were expected from the longest considered region on (see Discussion for rationale on this matter). For a more detailed description of the datasets, see Mosén-Ansorena et al. .
A true change-point was considered recalled if at least one predicted change-point falls within a window of 3 probes from it, a threshold that is wide enough to recover most of the correct predictions in the benchmark dataset. Furthermore, from such window on, between-method differences do not vary significantly. Given that GAP outputs the result of merging three segmentations, the calculation of the specificity does not penalise repeated calls of the same change-point in order not to deflate its specificity.
The default parameterisations in OncoSNP and ASCAT are aimed to the detection of longer regions than the ones included in the analysed synthetic samples, so, in order to account for parameterisation differences and keep further comparisons fair, we replaced the default sensitivity-related values with those that achieved the best combination of specificity and sensitivity in the corresponding ROC curves. Such combination is called F-measure, the harmonic mean of specificity and sensitivity. However, notice that the traditional F-measure gives the same importance to both measures, which may not be adequate, as it has been noted that sensitivity is preferable over specificity . Certainly, regions with the same allele-specific copy number can be merged a posteriori after excessive segmentation, but missed region borders cannot be recovered if too few change-points have been detected. Hence, long regions are easily identified regardless of parameterisation, but delimiting short regions requires high sensitivity. Because of this, (i) if a posteriori region merging is applied, parameterisations (of the same method) that prioritise sensitivity achieve better results on shorter regions while having similar results on longer ones (see Additional file 1 for an example), and (ii) CnaStruct’s downstream improvement is expected to be greater with respect to ASCAT, which mainly delivers lower sensitivity, than with respect to GAP, which delivers lower specificity.
Although we only assessed CnaStruct on Illumina-like data, we ran it in combination with GAP, ASCAT and TAPS on samples from either the Illumina or Affymetrix platform (Additional file 4). Visual assessment (see Additional file 5) shows a good performance, although ASCAT fails at the calling step on the sample with 53% of normal cell contamination. Furthermore, we hypothesise that the improvement with respect to existing methods for the Affymetrix platform is even greater than on the Illumina counterpart, given that the noisier profile of Affymetrix SNP-array data is more appealing for the bivariate segmentation and current methods for Affymetrix data, including GAP, only perform univariate segmentation. Particularly, TAPS  is an ASCN analysis tool for the Affymetrix platform whose change-point detection step consists on a simple CBS segmentation over the LRR variable, whose baseline shift is not automatically estimated. Still, when compared to GAP, it delivered a better performance . Given that, as the authors state, the CBS segmentation in TAPS can be replaced by other approaches, we propose the combined use of CnaStruct and TAPS (see Additional file 2) for ASCN analysis on the Affymetrix platform. Given a proper construction of LRR and BAF, we believe that CnaStruct is a sensible segmentation method for high-throughput sequencing (HTS) ASCN analysis too, where, at the moment of writing, the only method that uses BAF  does not perform bivariate segmentation.
We have first identified the issues that arise on segmentation due to imBAF characteristics, namely high value missingness and heteroscedasticity. Although such transformation had already been described, no literature existed on how imBAF’s peculiarities affect segmentation, and more specifically bivariate segmentation.
Then, we have introduced and formalised the bivariate segmentation of SNP-array data for the characterisation of ASCNs in tumour samples. The formalisation generalises the problem and describes the extension from the univariate to the bivariate case, so further univariate methods can eventually be extended to the bivariate SNP-array case through such mathematical framework. With an appropriately selected Minkowski order, the generalisation considers the interaction between variables and their common features, but it is still capable of retrieving changes in a single variable. Thus, the proposed segmentation approach offers an intermediate stand between univariate approaches (e.g. CBS in GAP), which do not include the information available from both variables in the same model and are prone to skipping changes common to the two variables, and bivariate approaches with p = 1 (e.g. ASCAT), which overestimate the effect of variable interaction and tend to obviate single changes. The advantage of bivariate segmentation is more evident in low signal-to-noise ratio (SNR) scenarios, such as high degree of normal cell contamination and samples with high noise levels, where joint variable information reduces the chance of false positives and the promotion of single-variable changes avoids the reduction of recall rates. Additionally, in comparison to the univariate approach, duplicated estimation of change-points is avoided.
CnaStruct exemplifies the benefits of bivariate segmentation with adequately selected Minkowski order and outperforms existing methods at change-point detection on synthetic data. Besides, when coupled with the pattern recognition processes of GAP or ASCAT, the new workflows improve the downstream ASCN analysis in comparison to their original counterparts and the rest of compared methods. Notably, given its performance under the low contrast situations produced by high normal cell contamination levels and intra-tumour heterogeneity, CnaStruct should greatly improve allele-specific copy number characterisation in samples extracted from tumour biopsies, which are typically highly contaminated with normal cells, and in samples from advanced tumours, which are expected to present greater intra-tumour cellular heterogeneity.
DMA is supported by the Government of Navarra, Spain through the grant “Ayuda predoctoral para realizar una tesis doctoral y obtener el título de doctor (Plan de Formación y de I + D 2010/2011)”. AMA and the research expenses are supported by the Department of Industry, Tourism and Trade of the Government of the Autonomous Community of the Basque Country (Etortek Research Programs 2010/2012) and from the Innovation Technology Department of the Bizkaia County.
- Pinkel D, Segraves R, Sudar D, Clark S, Poole I, Kowbel D, Collins C, Kuo WL, Chen C, Zhai Y: High resolution analysis of DNA copy number variation using comparative genomic hybridization to microarrays. Nat Genet 1998,20(2):207-211. 10.1038/2524View ArticlePubMedGoogle Scholar
- Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M: The landscape of somatic copy-number alteration across human cancers. Nature 2010,463(7283):899-905. 10.1038/nature08822PubMed CentralView ArticlePubMedGoogle Scholar
- Notta F, Mullighan CG Y, Wang JC, Poeppl A, Doulatov S, Phillips L, Ma J, Minden MD, Downing JR, Dick JE: Evolution of human BCR-ABL1 lymphoblastic leukaemia-initiating cells. Nature 2011,471(7337):254-254.View ArticleGoogle Scholar
- Illumina. http://www.illumina.com
- Affymetrix. http://www.affymetrix.com
- Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H, Bucan M, Maris JM, Wang K: Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Res 2008,36(19):e126-e126. 10.1093/nar/gkn556PubMed CentralView ArticlePubMedGoogle Scholar
- Staaf J, Vallon-Christersson J, Lindgren D, Juliusson G, Rosenquist R, Höglund M, Borg Å, Ringnér M: Normalization of Illumina Infinium whole-genome SNP data improves copy number estimates and allelic intensity ratios. BMC Bioinformatics 2008,9(1):409-409. 10.1186/1471-2105-9-409PubMed CentralView ArticlePubMedGoogle Scholar
- Staaf J, Lindgren D, Vallon-Christersson J, Isaksson A, Göransson H, Juliusson G, Rosenquist R, Höglund M, Borg A, Ringnér M: Segmentation-based detection of allelic imbalance and loss-of-heterozygosity in cancer cells using whole genome SNP arrays. Genome Biol 2008,9(9):R136-R136. 10.1186/gb-2008-9-9-r136PubMed CentralView ArticlePubMedGoogle Scholar
- Parisi F, Ariyan S, Narayan D, Bacchiocchi A, Hoyt K, Cheng E, Xu F, Li P, Halaban R, Kluger Y: Detecting copy number status and uncovering subclonal markers in heterogeneous tumor biopsies. BMC Genomics 2011,12(1):230-230. 10.1186/1471-2164-12-230PubMed CentralView ArticlePubMedGoogle Scholar
- Yau C, Mouradov D, Jorissen RN, Colella S, Mirza G, Steers G, Harris A, Ragoussis J, Sieber O, Holmes CC: A statistical approach for detecting genomic aberrations in heterogeneous tumor samples from single nucleotide polymorphism genotyping data. Genome Biol 2010,11(9):R92-R92.PubMed CentralPubMedGoogle Scholar
- Li A, Liu Z, Lezon-Geyda K, Sarkar S, Lannin D, Schulz V, Krop I, Winer E, Harris L, Tuck D: GPHMM: an integrated hidden Markov model for identification of copy number alteration and loss of heterozygosity in complex tumor samples using whole genome SNP arrays. Nucleic Acids Res 2011,39(12):4928-4941. 10.1093/nar/gkr014PubMed CentralView ArticlePubMedGoogle Scholar
- Liu Z, Li A, Schulz V, Chen M, Tuck D: MixHMM: inferring copy number variation and allelic imbalance using SNP arrays and tumor samples mixed with stromal cells. PLoS One 2010,5(6):14-14.Google Scholar
- Sun W, Wright F, Tang Z, Nordgard SH, Van Loo P, Yu T, Kristensen VN, Perou CM: Integrated study of copy number states and genotype calls using high-density SNP arrays. Nucleic Acids Res 2009,37(16):5365-5377. 10.1093/nar/gkp493PubMed CentralView ArticlePubMedGoogle Scholar
- Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S: PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics (Oxford, England) 2010,11(1):164-175. 10.1093/biostatistics/kxp045View ArticleGoogle Scholar
- Huber W, Toedling J, Steinmetz LM: Transcript mapping with high-density oligonucleotide tiling arrays. Bioinformatics (Oxford, England) 2006, 22: 1963-1970. 10.1093/bioinformatics/btl289View ArticleGoogle Scholar
- Mosén-Ansorena D, Rodriguez-Ezpeleta N, Aransay AM: Comparison of methods to detect copy number alterations in cancer using simulated and real genotyping data. BMC Bioinformatics 2012, 13: 192. 10.1186/1471-2105-13-192PubMed CentralView ArticlePubMedGoogle Scholar
- Popova T, Manié E, Stoppa-Lyonnet D, Rigaill G, Barillot E, Stern MH: Genome Alteration Print (GAP): a tool to visualize and mine complex cancer genomic profiles obtained by SNP arrays. Genome Biol 2009,10(11):R128-R128. 10.1186/gb-2009-10-11-r128PubMed CentralView ArticlePubMedGoogle Scholar
- Van Loo P, Nordgard SH, Lingjærde OC, Russnes HG, Rye IH, Sun W, Weigman VJ, Marynen P, Zetterberg A, Naume B: Allele-specific copy number analysis of tumors. Proc Natl Acad Sci U S A 2010,107(39):16910-16915. 10.1073/pnas.1009843107PubMed CentralView ArticlePubMedGoogle Scholar
- Olshen AB, Bengtsson H, Neuvial P, Spellman PT, Olshen R, Seshan VE: Parent-specific copy number in paired tumor-normal studies using circular binary segmentation. Bioinformatics (Oxford, England) 2011, 27: 2038-2046. 10.1093/bioinformatics/btr329View ArticleGoogle Scholar
- Cufi X, Munoz X, Freixenet J, Marti J: A review on image segmentation techniques integrating region and boundary information. Advances in imaging and electron physics 2003, 120: 1-39.View ArticleGoogle Scholar
- Venkatraman ES, Olshen AB: A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics (Oxford, England) 2007,23(6):657-663. 10.1093/bioinformatics/btl646View ArticleGoogle Scholar
- Picard F, Robin S, Lavielle M, Vaisse C, Daudin J-J: A statistical approach for array CGH data analysis. BMC Bioinformatics 2005, 6: 27. 10.1186/1471-2105-6-27PubMed CentralView ArticlePubMedGoogle Scholar
- Rasmussen M, Sundstrom M, Goransson Kultima H, Botling J, Micke P, Birgisson H, Glimelius B, Isaksson A: Allele-specific copy number analysis of tumor samples with aneuploidy and tumor heterogeneity. Genome Biol 2011,12(10):R108-R108. 10.1186/gb-2011-12-10-r108PubMed CentralView ArticlePubMedGoogle Scholar
- Bai J, Perron P: Computation and analysis of multiple structural change models. Journal of Applied Econometrics 2003,18(1):1-22. 10.1002/jae.659View ArticleGoogle Scholar
- Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-lerosey I, Delattre O, Baril E, Curie I: Control-FREEC: a tool for assessing copy number and allelic con- tent using next generation sequencing data. Bioinformatics (Oxford, England) 2012,28(3):423-425. 10.1093/bioinformatics/btr670View ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.