Reconstructing DNA copy number by joint segmentation of multiple sequences

Background Variations in DNA copy number carry information on the modalities of genome evolution and mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number of different high throughput technologies can be used to identify copy number variable sites, and the literature documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual. Results We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets. Conclusions The flexibility of our framework makes it applicable to data obtained with a wide range of technology. Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets.


Introduction
Duplication and deletion of genomic materials are common in cancer cells and known to play a role in the establishment of the tumor status [24]. As our ability to survey the fine scale of the human genome has increased, it has become apparent that normal cells can also harbor a number of variations in copy number [18,36]. The last few years have witnessed a steady increase in our knowledge of size and frequency of these variants [10,19,23,30] and their implications in complex diseases [29,40].
At the same time, statistical methods and algorithms have been developed to better harness the information available. At the cost of oversimplification, two different approaches have become particularly popular: one is based on the hidden Markov model (HMM) machinery and explicitly aims to reconstruct the unobservable discrete DNA copy number; the other, which we will generically call "segmentation", aims at identifying portions of the genome that have constant copy number, without specifically reconstructing it. The HMM approach takes advantage of the implicitly discrete nature of the copy number process (both when a finite number of states is assumed and when, as in some implementations, less parametric approaches are adopted); furthermore, by careful modeling of the emission probabilities, one can fully utilize the information derived from the experimental results. In the case of genotyping arrays, for example, both the quantification of total DNA amount and relative allelic abundance as well as prior information (for example, minor allele frequencies) can be considered. No a-priori knowledge of the number of copy number states is required the segmentation approach-an advantage in the study of cancer where polyploids and contamination with normal tissues result in a wide range of fractional copy numbers.
Possibly for the reasons outlined, HMMs are the methods of choice in the analysis of normal samples [9,34,41,47,49], while segmentation methods are the standard in cancer studies [26,53].
A limitation of segmentation methods is that they rely on the data in which the variation in copy number is reflected in the differences in means of the segments-which make them applicable di-rectly to a substantial portion of the data derived from recent technologies, but not to relative allelic abundance (see the modification suggested in [39] and following description for an exception).
While a number of successful approaches have been derived along the lines described above, there is still a paucity of methodology for the joint analysis of multiple sequences. It is clear that if multiple subjects share the same variation in copy number, there exists the potential to increase power by joint analysis. Wang et al. (2009) [45] presented a methodology that extended [24] to reconstruct the location of tumor suppressor genes from the identification of regions lost in a larger number of samples; the initial steps of the Birdsuite algorithm rely on the identification of suspect signals in the context of multiple samples; PennCNV [47] includes an option of joint analysis of trios; methodology to process multiple samples with the context of change point analysis has been developed in [37,51,53]; Efron and Zhang (2011) [14] consider FDR analysis of independent samples to identify copy number polymorphysms (CNPs); and Nowak et al. (2011) [25] use a latent feature model to capture, in joint analysis of array-CGH data from multiple tumor samples, shared copy number profiles, on each of which a fused-lasso penalty is enforced for sparsity. In the present work we consider a setting similar to [53] in that we want joint analysis to inform the segmentation of multiple samples. Our main focus is the analysis of genotyping array data, but the methodology we develop is applicable to a variety of platforms. By adopting a flexible framework we are able, for example, to define a segmentation algorithm that uses all information from Illumina genotyping data. As in [37], we are interested in the situation when not all the samples under consideration carry a copy number variant (CNV): we rather want to enforce a certain sparsity in the vector that identifies which samples carry a given variant. We tackle this problem using a penalized estimation approach, originally proposed in this context by [42], on which we have developed an algorithmic implementation before [54]. Appreciable results are achieved in terms of speed, accuracy and flexibility. In concluding this introduction, we would like to make an important qualification: the focus of our contribution is on segmentation methods, knowing that this is only one of the steps necessary for an effective recovery of CNVs. In particular, normalization and transformation of the signal from experimental sources are crucial and can have a very substantial impact on final results: we refer the reader to [1,2,7,12,33,35], for example. Furthermore, calling procedures that further classify results of segmentation while possibly controlling global error measures [14] are also needed. Indeed, in the data analysis included in this paper we need to resort to both these additional steps and we will describe briefly the fairly standard choices we are making.
The rest of the paper is organized as follows: Section 2 motivates the need for joint analysis of multiple signals and presents the penalized estimation framework. Section 3 describes how the model can be used for data analysis by (a) outlining an efficient estimation algorithm, (b) generalizing it to the case of uncoordinated data, and (c) describing the choice of the penalization parameters. Section 4 illustrates our results on two simulated data sets (descriptive of normal and tumor samples) and two real data sets: in one case multiple platforms are used to analyze the same sample and in the other case samples from related individuals benefit from joint analysis.

Multiple sequence segmentation
The goal of the present paper is to develop a flexible methodology for joint segmentation of multiple sequences that are presumed to carry related information on CNVs. We start by illustrating a series of contexts where the joint analysis appears to be useful.

Genotyping arrays and CNV detection
Genotyping arrays have been used on hundreds of thousands of subjects and the data collected through them provides an extraordinary resource for CNV detection and the study of their frequencies in multiple populations. Typically, the raw intensity data representing hybridization strength is processed to obtain two signals: a quantification of total DNA amount (from now on log R Ratio LRR, following Illumina terminology) and a relative abundance of the two queried alleles (from now on B allele frequency, BAF). Both these signals contain information on CNV and one of the strengths of HMM models has been that they can easily process them jointly. Segmentation models like CBS have traditionally relied only on LRR. While this is a reasonable choice, it can lead to substantial loss of information, particularly in tumor cells, where poliploidity and contamination make information in LRR hard to decipher. To exploit BAF in the context of a segmentation method, a signal transformation has been suggested [39]: mirrowed BAF (mBAF) relies on exchangeability of the two alleles and the low information content of homozygous SNPs.
The resulting mBAF is defined on a coarser grid than the original BAF, but is characterized by changing means in presence of CNV. While [39] shows that its analysis alone can be advantageous and more powerful than segmentation of LRR in some contexts, clearly a joint analysis of LRR and mBAF should be preferable to an arbitrary selection of one or the other signal.

Multiple platforms
LRR and BAF are just one example of the multiple signals that one can have available for the same sample. Often, as research progresses, the samples are assessed with a variety of technologies.
For example, a number of subjects that have been genotyped at high resolution are now being resequenced. Whenever the technology adopted generates a signal that contains some information on copy number, there is an incentive to analyze the available signals jointly.

Tumor samples from the same patient obtained at different sites or different progression stages
In an effort to identify mutations that are driving a specific tumor, as well as study its response to treatment, researchers might want to study CNVs in cells obtained at different tumor sites or at different time points [27]. Copy number is highly dynamic in cancer cells, so that it is to be expected that some differences be detected over time or across sites. In contrast, the presence of the same CNVs across these samples, can be taken as an indication that the tumors share the same origin: therefore a comparative analysis of CNV can be used to distinguish resurgence of the same cancer from insurgence of a new one, or to identify specific cancer cell populations. Given that the tissue extracted always consists of a mixture of normal and cancer cells, which are in turn a mixture of different populations, joint analysis of the signals from the varied materials is much more likely to lead to the identification of common CNVs, when these exist.

Related subjects
Family data is crucial in genetic investigations and hence it is common to analyze related subjects.
When studying individuals from the same pedigree, it is reasonable to assume that some CNVs might be segregating in multiple people: joint analysis would reduce Mendelian errors and increase power of detection.

A model for joint analysis of multiple signals
Assume we have observed M signals, each measured at N locations, corresponding to ordered physical positions along the genome, with y ij being the observed value of sequence i at location j.
The copy number process can be modeled as where ij represent noise, and the mean values β ij are piece-wise constant: there exists a linearly ordered partition {R In other words, most of the increments |β ij − β i,j−1 | are assumed to be zero. When two sequences k and l share a CNV with the same boundaries at location j, both |β kj − β k,j−1 | and |β lj − β l,j−1 | will be different from zero in correspondence of the change point. Modulo an appropriate signal normalization, β ij = 0 can be interpreted as corresponding to the appropriate normal copy number equal to 2. We propose to reconstruct the mean values β by minimizing the following function, called hereafter generalized fused lasso (GFL): which includes a goodness-of-fit term and three penalties, whose roles we will explain one at the time. The 1 penalty M i=1 N j=1 |β ij | enforces sparsity within β, in favor of values β ij = 0, corresponding to the normal copy number. The total variation penalty N j=2 |β ij − β i,j−1 | minimizes the number of jumps in the piece-wise constant means of each sequence and was introduced by [42] in the context of CNV reconstruction from array-CGH data. Finally, the Euclidean penalty on the column vector of jumps is a form of the group penalty introduced by [50] and favors common jumps across sequences. As clearly explained in [56], "the local penalty around 0 for each member of a group relaxes as soon as the |β ij − β i,j−1 | for one member i of the group moves off 0." Bleakley and Vert (2011) [4] also suggested the use of this group-fused-lasso penalty to reconstruct CNV. We here consider the use of both the total variation and the Euclidean penalty on the jumps to achieve the equivalent effect of the sparse group lasso, which, as pointed out in [16], favors CNV detection in multiple samples, allowing for sparsity in the vector indicat- proposed in a regularized least-square optimization [32], where β ij is the true underlying intensity of pixel (i, j). In CNV detection problems, signals from multiple sequences can be aligned up in shape of an image, except that pixels in each sequence are linearly ordered while sequences as a group have no certain order a priori; thus one of the two total variation penalties is replaced by the group penalty on the column vector of jumps.
Using matrix notation, and allowing the tuning parameter λ 1 , λ 2 and λ 3 to be sequence specific, we can reformulate the objective function as follows. Let Y = (y ij ) M ×N and β = (β ij ) M ×N .
Let β i be the ith row of β and β (j) the jth column of β. Also, let λ 3 = (λ 3,i ) M ×1 . Then we have where || · || F is the Frobenius norm of matrix, || · || 1 and || · || 2 are 1 and 2 norm of vector, β i,s:t indicates the sub-vector with elements β i,s , . . . , β i,t in row vector β i , and " * " is used as entry-wise multiplication between two vectors. Note that it would be easy to modify the tuning parameters so as to make them location specific: that is, reduce the penalty for a jump in correspondence of genomic regions known to harbor CNVs.

An MM algorithm
While the solution to the optimization problem (3) might have interesting properties, this approach is useful only if an effective algorithm is available. The last few years have witnessed substantial advances in computational methods for 1 -regularization problems, including the use of coordinate descent [15,48] and path following methods [4,17,43,55]. The time cost of these methods in the best situation is O(M N K), for K knots along the solution path. It is important to note that these algorithms -some of which are designed for more general applications -may not be the most efficient for large scale CNV analysis for at least two reasons: on the one hand, reasonable choices of λ might be available, making it unnecessary to solve for the entire path; on the other hand, the number of knots K can be expected to be as large as O(N ), making the computational costs of path algorithms prohibitive.
With specific regard to the fused-lasso application to CNV detection, we were successful in developing algorithm with per iteration cost O(N ) and empirically fast convergence rate for the analysis of one sequence [54]. We apply the same principles here. We start by modifying the norms in the penalty as follows: rather than the 1 norm we use ||x|| 2, = √ x 2 + for sufficiently small , and, for computational stability, we also substitute 2 norm with obtaining a differentiable objective function Adopting an MM framework [21], we want to find a surrogate function g (β | β (m) ) for each iteration m such that g (β (m) | β (m) ) = f (β (m) ) and g (β | β (m) ) ≥ f (β) for all β. At each iteration, then, β (m+1) = argmin g (β | β (m) ). A majorizing function with the above properties is readily obtained using the concavity of square-root function ||x|| 2, ≤ 1 2||z|| 2, (x 2 − z 2 ), and its vector equivalent ||x|| 2, ≤ 1 2||z|| 2, (||x|| 2 2 − ||z|| 2 2 ). The resulting can be decomposed in the sum of similar functions of all the row vectors β i Here each A

Stacking observations at different genomic locations
While copy number is continuously defined across the genome, experimental procedures record data at discrete positions, for which we have used the indexes j = 1, . . . , N . In reality, repeated evaluations of the same sample (or related samples) will typically result in measurements at only partially overlapping genomic locations: either because different platforms use different sets of probes, or because missing data my occur at different positions across sequences (consider for example, mBAF and LRR from the same experiment on one subject: the mBAF signal will be defined on a subset of the locations where LRR is).
Let S indicate the union of all genomic positions where some measurement is available among the M signals under study. And let S i be the subset of locations with measurements in sequence i. We reconstruct β ij for all j ∈ S. When j / ∈ S i , β ij will be determined simply on the basis of the neighboring datapoints, relying on the regularizations introduced in (3). The goodness-of-fit portion of the objective function is therefore redefined as The MM strategy previously described applies with slight modifications of the matrix A The attentive reader would have noted that y ij with j / ∈ S i can be considered as missing data, and an evaluation of the characteristics of this missingness is appropriate. In general, y ij cannot be considered missing at random. The most important example is the case of mBAF, where homozygous markers result in missing values. Now, homozygosity is more common when copy number is equal to 1 than when copy number is equal to 2 and, therefore, there is potentially more information on β ij to be extracted from the signals than the one we will capture with the proposed methodology. On the other hand, it does appear that the approach outlined does not increase false positive: operationally, then, it can be considered as an improvement over segmentation based on LRR only, even if in theory, it does not completely use the information on BAF. It is also relevant to note that, in reality, most of the information on deletion is obtained through LRR, and BAF is really carrying additional information in case of duplications (where the changes in LRR are limited due to saturation effects).

Choice of tuning constants and segmentation
One of the limitations of penalization procedures is that a value for the tuning parameters needs to be set and clear guidelines are not always available. Path methods that obtain a solution of the optimization problem (3) for every value of tuning parameters can be attractive, but recent algorithmic advances [4,43,55] remain impractical for problems of the size of ours. A number of recent publications obtain optimal values of penalty parameters under a series of conditions [3,5,6,13]: we rely upon them to propose the following strategy consisting of obtaining a solution of (3) for reasonably liberal values of the tuning parameters, followed by a sequence-by-sequence hard thresholding of the detected jumps with a data-adaptive threshold.
We have found the following guidelines to be useful in choosing penalty parameter values: for i = 1, . . . , M , whereσ i is a robust estimate of standard deviation of y i , p is roughly the proportion of the M sequences we anticipate to carry CNVs, and c 1 , c 2 and c 3 are positive multipliers adjusted in consideration of different signal-to-noise ratios and CNV sizes.
While a more rigorous justification is provided in the supplementary material, we start by underscoring some of the characteristics of this proposal.
• The sequence-specific penalizing parameters are proportional to an estimate of the standard deviation of the sequence signal: that is, proviso an initial normalization, the same penalties would be used across all signals.
• The tuning parameter for the total variation (fused lasso) and the Euclidean (group fused lasso) penalties on the jumps depend on √ log N , where N is the possible number of jumps.
This has a "multiple comparison controlling" effect and resembles rates that have been proven optimal under various sparse scenarios [3,5,6,13]. This term does not appear in the expression of λ 1 , as the lasso penalty can be understood as providing a soft thresholding of the solution of (3) when λ 1 = 0: given the penalization due to λ 2 and λ 3 , this object will have much smaller dimensionality than N .
• The group penalty depends on √ M , where M is the number of grouped sequences, as in the original proposal [50].
• The relative weight of the fused-lasso and group-fused-lasso penalties is regulated by ρ, which depends on p, the proportion of the M sequences expected to carry the same CNV.
For example, if M = 2 and the two sequences are LRR and BAF from the same individual, we anticipate p = 1 with ρ = 0, enforcing jumps at identical places in the two signals. At the other extreme, for completely unrelated sequences, p = 0 and ρ = 1.
The standard deviationσ i can be estimated robustly as follows. Let ∆ ij = y i,j+1 − y i,j , for j = 1, . . . , N − 1, be the one-order difference of adjacent y ij for sequence i. Then most Var(∆ ij ) = 2σ 2 i except those bridging real change points, so we can takê As mentioned before, the exact values of the penalty parameters should be adjusted depending on the expectations of signal strengths. Following the approach in [31], one can approximate the bias induced by each of the penalties and hence work backwards in terms of acceptable levels. As detailed in the supplementary material, Following again the approach in [31], one can show that under some relatively strong assumptions, the choices in (7) lead to a consistent behavior as N → ∞ and M stays bounded (see the supplementary material). Despite the fact that N is indeed large in our studies, it is not clear that we can assume it to be in the asymptotic regime. As finer scale measurements become available, scientists desire to investigate CNV of decreasing length: the CNVs we are interested in discovering are often covered by a small number of probes. Furthermore we have often little information on the sizes and frequencies of CNV. In this context, we find it advisable to rely on a two-stage strategy: 1. Sequences are jointly segmented minimizing (3) for a relatively lax choice of the penalty parameters.
2. Jumps are further thresholded on the basis of a data-driven cut-off.
Step 2 allows us to be adaptive to the signal strength and can be carried on with multiple methods.
For example, one can adopt the modified Bayesian Information Criteria (mBIC) [52]. In data analysis, we often apply an even simpler procedure where the threshold for jumps is defined as a fraction of the maximal jump size observed for every sequence. Specifically, for sequence i, Then we define as a "ruler" reflecting the scale of a possible real jump size, taking cγ i as the cut-off in removal of most small jumps. In all analyses for this paper, we fix a = 1, b = 5 and c = 0.2. In our experience, this heuristic procedure works well for both tumor and normal tissue CNV data.

Calling Procedure
Even if this is not the focus of our proposal, in order to compare the performance of our segmentation algorithm with HMM approaches, it becomes necessary to distinguish acquisitions from losses of copy number. While the same segmentation algorithm can be applied to a wide range of data sets, calling procedures depend more closely on the specific technology used to carry out the experiments. Since our data analysis relies on Illumina genotyping arrays, we limit ourselves to this platform, and briefly describe the calling procedure we adopt in Section 4.
Analyzing one subject at the time, each segment with constant mean is assigned to one of five possible copy number states (c = 0, 1, 2, 3, 4). Let R collect the indexes of all SNPs comprising one segment and let (x R , y R ) = {(x j , y j ), j ∈ R} be the vectors of values for BAF and LRR in the segment. On the basis of typical pattern for BAF and LRR in the different copy number states (see [9,45,47]), we can write log-likelihood ratio explicitly defined in the supplementary material. Segment R is assigned a CNV stateĉ that maximize LR(c), only if LR(ĉ) > r 1 , where r 1 is a pre-specified cut-off.
As noted in [53], the LRR data for a segment with c = 2, ideally normalized to have mean 0, often has a small non-zero mean, due to experimental artifacts. If the number of SNPs in R is sufficiently large, a log-likelihood-ratio criterion as the above would result in the erroneous identification of a copy number different from 2. To avoid this, we also require that the size of the absolute difference of the mean of LRR from zero be larger than a threshold |ȳ R | > r 2 σ.

Results
We report the results of the analysis of two simulated and two real data sets, which overall exemplify the variety of situations where joint segmentation of multiple sequences is attractive, as described in the introduction. In all cases, we compare the performance of the proposed procedure with a set of relevant, often specialized, algorithms. The penalized estimation method we put forward in this manuscript shows competitive performance in all cases and often a substantial computational advantage. Its versatility and speed make it a very convenient tool for initial ex-

Simulated CNV in normal samples
We consider one of the simulated data sets described in [54]: and duplication with CBS and the two fused-lasso approaches, we use both LRR and BAF data (before transformed to mBAF) with the following cut-off values: r 1 = 10 and r 2 = 1(1.5) for duplication (deletion). Performance is evaluated by the same indexes we used in [54]: true positive rate (TPR or sensitivity) and false discovery rate (FDR), all defined on a per SNP basis. Results are summarized in Table 1.
Not surprisingly, all algorithms perform similarly well for larger deletions/duplications and it is mainly for variants that involve ≤ 10 SNPs that differences are visible. Algorithms that rely only on LRR (as CBS and fused lasso) underperform in the detection of small duplications  Table 1: while all algorithms are rather fast, the two implementations of the fused lasso are dominating.

A simulated tumor data set
To explore the challenges presented by tumor data, we rely on a data set created by [39], with the specific goal of studying the effect of contamination between normal and cancer cells. The HapMap sample NA06991, genotyped on Illumina HumanHap550 array, was used to simulate a cancer cell line, by inserting a total of 10 structure variation regions, including one-copy losses, one-copy gains, and copy neutral loss-of-hetrozygosity (CN-LOH) (see Supplementary Table 2).
The signal from this artificial "tumor" sample was then contaminated in silico with that of the original "normal" sample, resulting in 21 data sets, with a percentage of normal cells ranging from 0% to 100%. Note that most simulated CNV or CN-LOH regions are very large-some spanning an entire chromosome-and the challenge in detection is really due to the contamination levels.
For ease of comparison, we evaluate the accuracy of calling procedures as in the original reference [39]: sensitivity is measured for each variant region as the percentage of heterozygous SNPs that are assigned the correct copy number; and specificity is the percentage of originally heterozygous SNPs in unperturbed regions that are assigned CN=2. We compare the performance of GFL to BAFsegmentation [39] and PSCN [8] representing, respectively, a version of segmentation and HMM approaches specifically developed to deal with contaminated tumor samples (both these algorithms have been tested with success on this simulated data set).
Following other analyses, we do not pre-process the data prior to CNV detection. BAFsegmentation and PSCN were run using recommended parameter values. For each of the diluted data sets, we applied the GFL model on each chromosome at one time using both LRR and mBAF, whose standard deviations are normalized to 1. Tuning constants are set to λ 1 = 0, λ 2 = 0.5×3× √ log N , and λ 3 = 0.5 × 3 × √ log N , varying specifically for chromosome interrogated by N SNPs. The change points resulting from hard segmentation on LRR and mBAF are combined to make a finer segmentation of the genome. Finally, we adopt the same calling procedure described by [39].  PSCN, like GFL, is implemented in R with some computationally intensive subroutines coded in C. BAFsegmentation relies its segmentation part on the R package DNAcopy, whose core algorithms are implemented in C and Fortran, and it is wrapped in Perl. A comparison of run times indicate that GLF and BAFsegmentation are comparable, while PSCN is fifty times slower than GFL (see Supplementary Table 3).

One sample assayed with multiple replicates and multiple platforms
We use the data from a study [28] assessing the performance of different array platforms and CNV calling methods to illustrate the advantages of joint analysis of multiple measurements on the same subject. DNA from four individuals was analyzed in triplicate on each of 5 platforms: Affymetrix 6.0, Illumina 1M, 660W, Omni1-Quad (O1Q) and Omni2.5-Quad (O2Q) (among others [28]).
We use the results on the first three to define "true" copy numbers and try to reconstruct them using data from O1Q and O2Q. The nine "reference" experiments were analyzed with 4 or 5 CNV calling algorithms (see [28]) and a CNV was identified using majority votes: consistent evidence was required from at least 2 analysis tools, on at least 2 platforms, and in at least 2 replicates (see Supplementary Table 4). Here CNVs detected in two replicates/algorithms/platforms are regarded as the same CNV and collapse down to one CNV with the outmost boundaries when they overlap with each other.
The test experiments are based on 1,020,596 and 2,390,395 SNPs on autosomes after some quality control, at a total of 2,657,077 unique loci. Since our focus here is to investigate how to best analyze multiple signals on the same subject, rather than on the specific properties of any CNV calling method, we carry out all the analyses using different settings of GFL in segmentation while keeping the same CNV calling and summarizing procedure. All segmentation is done on LRR only while calling procedure uses both LRR and BAF (with cut-off r 1 = 10 and r 2 = 1). Here we compare three segmentation settings to analyze these 6 experiments per subject (see Supplementary   Table 5 for more details about tuning parameters): 1. The signals from the three technical replicates with one platform are averaged and then segmented and subject to calling procedure separately. The final CNV list is the union of CNV calls from the two platforms.
2. The signals from the three technical replicates with one platform are each segmented and subject to calling procedure separately. A majority vote is used to summarize CNV result for each platform: a CNV needs to be called in at least two replicates out of three. The final CNV list is the union of the two platforms' results.
3. The signals from the three technical replicates of both platforms (6 LRR sequences) are segmented jointly. Calling procedure is still done on each replicate separately, and the same majority vote is used to summarize CNV result for each platform. Again, the final CNV list is the union of the two platforms' results.
To benchmark the result of joint analysis we use MPCBS [51], a segmentation method, specifically designed for multi-platform CNV analysis. The segments output from MPCBS are proceeded to the same calling, majority voting, and summarizing procedure.

Multiple related samples assayed with the same platform
In the context of a study of the genetic basis of bipolar disorder, the Illumina Omni2.5-Quad chip was used to genotype 455 individuals from 11 Columbian and 13 Costa Rican pedigrees. We use this data set to explore the advantages of a joint segmentation of related individuals. In absence of a reference evaluation of CNV status in these samples, we rely on two indirect methods to assess the quality of the predicted CNVs. We used the collection of CNVs observed in HapMap Phase III [19] to compile a list of 426 copy number polymorphisms (selecting all those CNVs with frequency and compared different CNV calling procedures on the basis of how many Mendelian errors they generate. As mentioned before, PennCNV represents a state-of-the-art HMM method for the analysis of normal samples and, therefore, we included it in our comparisons. However, the parameters of the underlying HMM algorithm had not been tuned on the Omni2.5-Quad at the time of writing, resulting in sub-standard performance. Segmentation methods are less dependent on parameter optimization; hence, GFL analysis of LRR and BAF one subject at a time can provide a better indication of the potential of single-sample methods. We considered two multiple-sample algorithms: GFL and MSSCAN [53], both applied on LRR with group defined by pedigree memberships. (While a trio-mode is available in PennCNV [46], this does not adapt to the structure of our families.) A final qualification is in order. While the authors of MSSCAN kindly shared with us a beta-version of their software, we find it not to be robust. Indeed, we were unable to use it to segment the entire genome. However, we successfully used it to segment Chromosome 8, so that we could include MSSCAN in the comparison based on Mendelian error rates.
Prior to analysis, the data was normalized using the GC-content correction implemented in PennCNV [12]. For individual analysis, the GFL parameters were λ 1 = 0.1, λ 2 = 0, and λ 3 = 2× √ log N , where N is the number of SNPs deployed on each chromosome; for pedigree analysis, the GFL parameters were λ 1 = 0.1, where M is the number of individuals in each pedigree. For MSSCAN, CNV size is constraint to be less than 200 SNPs and the maximum number of change points is set as 50. The calling procedure with r 1 = 10 and r 2 = 1 was applied to both the GFL and MSSCAN results.   Figure 3 shows an example of large pedigree, where 3 out of 4 Mendelian errors are removed by joint analysis.

Discussion
We have presented a segmentation method based on penalized estimation and capable of processing multiple signals jointly. We have shown how this leads to improvements in the analysis of normal samples (where segmentation can be applied to both total intensity and allelic proportion), tumor sample (where we are able to deal with contamination effectively), measurements from multiple platforms, and related individuals. Given that copy number detection is such an active area of research, it is impossible to compare one method to all the others available. However, for each of the situations we analyzed, we tried to select approaches that represented the most successful stateof-the-art. In comparison to these, the algorithm we presented performs well: its accuracy is always comparable to that of the most effective competitor and its computation time often more contained.
We believe that for its versatility and speed, GFL is particularly useful for initial screening.
There are of course many aspects of CNV detection that we have not analyzed in this paper: from normalization and signal transformation to FDR control of detected CNVs. There are also a number of improvements to our approach that appear promising, but at this stage are left for further work: for example, it is easy to modify algorithms so that the penalization parameters are location dependent to incorporate prior information on known copy number polymorphisms; more challenging is developing theory and method to select the values of these regularization parameters in a data-adaptive fashion.
Finally, while our scientific motivation has been the study of copy number variations, the joint segmentation algorithm we present is not restricted to specific characteristics of these data types, and we expect it will be applied in other contexts.

Software implementation
All the code used to run the analysis presented in this paper is available at the web-page of the authors. We have implemented the segmentation routine, which is our core contribution, in an R package (Piet) to be submitted to R-forge (http://r-forge.r-project.org). To demonstrate a visualization of the CNV results on Chromosome 8 in the bipolar disorder study (see Section 4.4), we refer the interested audience to Supplementary Figure 2 in the supplementary material.

TDM algorithm
The non-zero entries in A i and b i in the re-shaped surrogate function (5) are listed as follows: When staking measurements at different positions, the item 1 in a

Bias estimation
Let x ij be the data for sequence i at locus j after σ i of each sequence is normalized to 1. With such normalization, the model (3) is reduced to a simpler form with global tuning parameters to each sequence for easier interpretation: The solution to minimize f The length (number of SNPs) of each segment isL k |, k = 1, . . . ,K i . Thus, the estimated mean vector for sequence i can be written aŝ β is the optimal solution if and only if it satisfies the subgradient condition ∂f (β) = 0; that is,β ij and s (3) ij are coordinates of subgradient corresponding to β ij 's appearing in each of the three penalty terms. Both bias estimation and asymptotic analysis rely on the analytic form of subgradient. Now we discussed the bias induced by each penalty separately.

Bias induced by lasso penalty
It is easy to verify that the subgradient for the lasso penalty can be written as where, with a bit abuse of notation, Hence, the lasso penalty term merely plays as a soft-thresholding on the fitted values resulted from the model (S.1) with λ 1 = 0, denoted asβ ij (0, λ 2 , λ 3 ); that is, for any λ 1 > 0, where (x) + = max{x, 0}. This is also highlighted in Lamma A.1 of [15] for model (S.1) with

Bias induced by fused-lasso penalty
In model (S.1) with λ 1 = 0 and λ 3 = 0 (only fused-lasso penalty involved), Lemma 2.1 in [31] gives an insightful characterization ofμ (i) : and, for k = 2, . . . ,K i − 1, The result implies that the sample mean (as an unbiased estimate of true mean) of a local minimum/maximum segment (except it is located at either end) is shifted towards 0 due to fused-lasso penalty. The bias is positively proportional to λ 2 and negatively proportional to the length of the segment. It is more important to notice that there exists no configuration where a local minimum/maximum segment has a jump size (relative to neighboring segments) less than the amount of bias. It means that a CNV with small jump size or small length could possibly be merged into neighboring segments, if λ 2 is set too large.

Bias induced by group-fused-lasso penalty
The subgradient for group-fused-lasso penalty is given in the following Proposition 1.

Proposition 1:
The β ij 's involved in group-fused-lasso penalty have subgradient given by For the j such that ||β (j) − β (j−1) || 2 > 0, the sub-gradient is reduced to regular gradient, and thus can be derived in a usual way. We now focus on the j such that ||β (j) − β (j−1) || 2 = 0, i.e., the subgradient of β ij at 0. By Cauchy-Schwartz inequality, we have where e j is any vector such that ||e j || 2 ≤ 1. It follows by the definition of subgradient that The bias induced by the group-fused-lasso penalty can be derived from the analytic form of subgradient accordingly and is given in the following Proposition 2.
Proposition 2: In model (S.1) with λ 1 = 0 and λ 2 = 0, the fitted means of segments for sequence i can be expressed aŝ Proof : The proof follows a similar technique used in the proof of Lemma 2.1 in [31]. Following the subgradient condition (S.2) in case λ 1 = 0 and λ 2 = 0, we havê ij .
By Proposition 1 and simple algebra, we have ends, then the bias termĉ k reduces to what it appears in model (S.1) with fused-lasso term only (λ 1 = 0 and λ 3 = 0). If m out of M sequences share change points at these two ends and also assume the jump size at these two locations for all the m sequences are roughly the same, then the absolute value of the bias term can be approximately written as 2λ 3 It means that if more than one sequences share change points at the same coordinate, then they can benefit from each other to reduce their individual bias, relative to the bias induced by fused-lasso penalty specific to each individual sequence.

Asymptotic behavior
Now we try to give a justification of the order of the magnitude of λ 2 and λ 3 in compatible with their large sample behavior, say, as N → ∞. When the number of sequences M in segmentation task is relatively large, extra caution is needed for λ 3 . Again, we discuss asymptotic behavior of the solution influenced by fused-lasso and group-fused-lasso separately for easier exhibition.

Asymptotic behavior for fused-lasso penalty
In fused-lasso model (λ 1 = 0 and λ 3 = 0), the justification is directly inspired by the proof of Theorem 2.3 in [31]. Denote the event for i = 1, . . . , M respectively. This event means that all jump points and the direction of jumps are correctly identified for each sequence i. A necessary condition required for λ 2 is summarized in Proposition 3. This asymptotic behavior follows directly the proof of Theorem 2.3 in [31]. We have some quick remarks: 1) If the signal of each sequence is not normalized, then λ 2,i = c 2 σ i √ log N , specific to sequence i.
2) In order to ascertain a CNV segment with length L and jump size δ, the bias needs to satisfy Here, δ σ i can be interpreted as signal-tonoise ratio (SNR). For a specific platform, one may get a sense of the magnitude of SNR and L from prior knowledge. In practice, it is desired to take as large value of c 2 as possible to ensure the sparsity of the segmentation, but not too large in order to compensate for the constraint of signal strength ( δ σ i L). Based on our experiences of analysis of Illumina data [54], the results are not sensitive to the choice of c 2 , provided that it falls into a reasonable range.

Proposition 4: It is required that
the linear rate.
Proof : For simplicity, we prove under the condition that ij are i.i.d. N (0, 1) (after σ i is normalized to 1), while this condition can be relaxed [31]. We also follow the same technique used in the proof of Theorem 2.3 in [31]. Let d ij = β ij − β i,j−1 ,d ij =β ij −β i,j−1 , and d ij = ij − i,j−1 . practice, M is not negligible with respect to √ log N . For example, log(10 6 ) ≈ 3.72, and it is not uncommon to have more than 4 sequences for joint segmentation. Therefore, it is necessary to We also have some remarks on how to determine λ 3 : 1) If the signal of each sequence is not normalized, then λ 3,i = c 3 σ i √ pM √ log N . The choice of p is decided case by case and discussed in the main text.
2) Following the above discussion about bias induced by group-fused-lasso penalty, if m out of M sequences carry CNVs with exactly the same boundary, the bias can be approximately written as 2c 3 pM √ m . On one hand, if p is over estimated so that pM is much larger than m, the model would be over penalized and introduce more bias than that is attributed to individual fused-lasso penalty, and thus does not benefit from joint analysis; On the other hand, if pM is set too small, we have insufficient control on the sparsity of each sequence, so that it has to be compensated by the fused-lasso penalty. This is the reason why we need to incooperate ρ(p) to re-weight the relative influence of the two penalties.

Details in calling procedure
We specify the likelihood functions of LRR and BAF signals in the log-likelihood ratio (8) as follows. For BAF signal, the likelihood is usually modeled for different copy number states as a mixture of densities surrounding a few possible BAF values corresponding to different genotypes [9,47]. When population frequencies for allele A and B, p A and p B , are available or can be estimated from data, we have where all parameters are defined in the same way (see Supplementary Table 1).
For LRR signal, the likelihood function is simply defined by normal density: L LRR (y; c) = φ(y; µ c , σ 2 c ). For c = 0, 1, 3, 4, µ c and σ 2 c are estimated based on the data y R in segment R being considered, while µ 2 and σ 2 2 are estimated from the data of the whole chromosome on which segment R locates or, locally, from the data of a few hundred markers flanking the segment.