Two chief genetic instabilities associated to tumoural cells are genomic copy number alterations (CNAs) and somatic loss of heterozygosity (LOH) events, which represent a deviation from the normal allele-specific copy numbers (ASCN). Both imbalances have been reported to affect the expression of oncogenes and tumour-suppressor genes [1], and therefore, the accurate characterisation of ASCNs in tumoural samples is critical in order to identify candidate cancer-related genes, to discriminate cancer types [2] and to understand tumour initiation and complexity [3].

Single nucleotide polymorphism (SNP) arrays of Illumina [4] and Affymetrix [5] platforms allow screening for ASCNs at high resolution and throughout the whole genome by providing measures for the log R ratio (LRR), which reflects the total intensity signals for both alleles, and the B allele frequency (BAF), which is the relative proportion of one of the alleles with respect to the total intensity signal. Both LRR and BAF signals are required for a complete characterisation of ASCNs since they provide complementary information. Yet, although each combination of copy number and allelic ratio has an expected LRR value and a specific BAF pattern, these signals can be blurred due to experimental probe-specific noise and by autocorrelated [6] and dye [7] biases, respectively.

In the study of ASCNs over tumour samples with SNP arrays, three additional issues need to be considered. First, there is a LRR baseline shift that depends on the ploidy of the sample. Second, tumour biopsies can be contaminated with normal cells, whose genotypes are mainly diploid, which make the LRR and BAF signals to shrink and converge towards those of a diploid state proportionally to the degree of contamination [8]. Third, tumours can be composed of several subclones, this is, subpopulations of cells that harbour specific alterations along with the shared ones, which makes LRR and BAF signals even more complex [9]. The second and third tumour-specific issues, together with the experimental noise and biases, affect the ability to correctly delimit regions with different ASCNs. Therefore, inferring change-point locations from tumour samples requires mathematical models whose performance is affected as little as possible by these issues.

Two approaches are used for the detection of ASCNs in tumour samples on SNP arrays, both of which inherit from methodologies applied to aCGH. The most recurrent approach is based on a combination of a hidden Markov model (HMM) and an expectation-maximisation (EM) algorithm. OncoSNP [10] and GPHMM [11] are two recent HMM-based tools validated on Illumina data which, in contrast to previous methods [12–14], are capable of estimating both normal cell contamination and LRR baseline shift. Most existing HMM-based methods, including the two aforementioned ones, integrate the LRR and BAF signals into the same model, which confers them more change-point detection power. Yet, the pre-established levels of HMMs are not prepared to characterise the observed continuous mean levels that arise due to the presence of multiple subclones [9, 15]. Additionally, HMMs require parameterisation on region probability and length, which vary among samples and are not known a priori. Arguably due to the aforementioned issues, in a recent method comparison [16] HMM-based methods were outperformed by a change-point detection method. For this reason, we propose tackling the problem of ASCN analysis from a change-point-based stand.

Methods based on change-point detection algorithms are typically comprised by segmentation followed by a calling step [17, 18]. This approach does not assume pre-established signal levels and does not require parameterisation of a priori knowledge. Two change-point-based approaches for unpaired tumour samples that use both LRR and BAF signals have been developed: GAP and ASCAT. PSCBS [19] also falls into this category, but it only works on coupled tumour samples and does not automatically estimate normal cell contamination. In the segmentation step, GAP segments the LRR and BAF signals independently and merges the change-points with those that come from the detection of LOH germline regions in BAF. On the contrary, ASCAT performs a single bivariate segmentation instead of two univariate segmentations, because the integration of the signals into the same formulation can increase the power to detect dimmer joint changes and reduce false positives. However, the extension from the univariate to the bivariate case is not trivial and depends on the characteristics of the considered segmentation approach, which may fall into one of two broad categories: boundary-based and region-based (see [20] for an analogue distinction in image segmentation).

In the boundary-based differential approach, change-points are seen as inflection points, this is, places where the first derivative has local extrema. Only local information around each point is used to compute the derivative, often resulting in spurious and merged change-points. Multiresolution analysis can be performed by computing the derivative at various window sizes, but region-based approaches are the most adequate to obtain more information for segmentation decisions, although they sacrifice change-point location accuracy. Region-based approaches can be broken down into segment-growing, split-and-merge and global optimisation. Region-growing starts with a number of random single-point regions. Neighbouring points are added to a region if they are similar enough, according to a certain homogeneity criterion; otherwise, a new segment is started. A representative example of split-and-merge is the binary segmentation, which selects as a change-point the position that divides the data into two segments with the most different means. The process is recursively applied to each segment until it cannot be divided into two subsegments with a mean difference that is significant enough. Then, similar regions are merged back together following some pruning criterion. Circular binary segmentation (CBS) [21] is a modification that allows at each step for the detection of one change-point or two, where the subsegment in the middle has a different mean than the other two subsegments. Global optimisation methods try to optimise an objective function, called cost function in minimisation and utility function in maximisation. Some methods, such as the structural change model (SCM) [15, 22], return the actual optimum. Others, namely heuristics, perform a non-exhaustive scan over the combinatorial space of change-points and can thus be trapped into local extrema.

Current change-point detection methods [17–19] are based on region-based segmentation algorithms, which are more adequate for ASCN analysis because finding change-points is more important than establishing their accurate location. More precisely, GAP is based on CBS and ASCAT on bivariate global optimisation. PSCBS, aimed at paired tumour samples, and BAFsegmentation [8] and TAPS [23], which only segment either BAF or LRR, are also based on CBS.

The application of the univariate segmentation methods to the bivariate data from SNP array requires: (i) knowing how the transformation typically applied to the BAF signal influences the applicability of certain segmentation methods and their extension to the bivariate case, and (ii) a mathematical model that generalises the extension from the univariate to the bivariate case. We provide such formalisations, illustrate that the approach taken by ASCAT is a specific case of the bivariate generalisation and discuss why there are more suitable formulations of the bivariate segmentation for ASCN analysis. Then, we show how the bivariate framework is applied to the SCM model in order to achieve CnaStruct, a method that outperforms the segmentation of existing approaches.