Bayesian DNA copy number analysis.

BACKGROUND
Some diseases, like tumors, can be related to chromosomal aberrations, leading to changes of DNA copy number. The copy number of an aberrant genome can be represented as a piecewise constant function, since it can exhibit regions of deletions or gains. Instead, in a healthy cell the copy number is two because we inherit one copy of each chromosome from each our parents. Bayesian Piecewise Constant Regression (BPCR) is a Bayesian regression method for data that are noisy observations of a piecewise constant function. The method estimates the unknown segment number, the endpoints of the segments and the value of the segment levels of the underlying piecewise constant function. The Bayesian Regression Curve (BRC) estimates the same data with a smoothing curve. However, in the original formulation, some estimators failed to properly determine the corresponding parameters. For example, the boundary estimator did not take into account the dependency among the boundaries and succeeded in estimating more than one breakpoint at the same position, losing segments.


RESULTS
We derived an improved version of the BPCR (called mBPCR) and BRC, changing the segment number estimator and the boundary estimator to enhance the fitting procedure. We also proposed an alternative estimator of the variance of the segment levels, which is useful in case of data with high noise. Using artificial data, we compared the original and the modified version of BPCR and BRC with other regression methods, showing that our improved version of BPCR generally outperformed all the others. Similar results were also observed on real data.


CONCLUSION
We propose an improved method for DNA copy number estimation, mBPCR, which performed very well compared to previously published algorithms. In particular, mBPCR was more powerful in the detection of the true position of the breakpoints and of small aberrations in very noisy data. Hence, from a biological point of view, our method can be very useful, for example, to find targets of genomic aberrations in clinical cancer samples.

f (y t l ,tm , µ l,m | t l,m , K t l ,tm = m − l) = f (y t l ,tp ,µ l,p | t l,p ,K t l ,tp = p − l)f (y tp,tm ,µ p,m | t p,m ,K tp,tm = m − p). (S.1) We define the following quantity to perform the dynamic program A r i,j := R µ r m f (y i,j ,µ m | K i,j = 1)dµ m for all i < m ≤ j, (S.2) where µ i+1 = . . . = µ j = µ m , since all the data points considered belong to the same segment, which we call m. One should notice that, if r = 0, i.e., it is the density of the data y i,j given that they belong to the same segment, while if r ≥ 1, A r i,j is related to the moments of M m with respect to the conditional probability given y, t and where i = t m−1 and j = t m , for all m = 1, . . . , k.

1
The left and right recursion of the dynamic program are made on the following quantities L k+1,j := j − 1 k f (y 0,j | K 0,j = k + 1) (S.5) R k+1,i := n − i − 1 k f (y i,n | K i,n = k + 1). (S.6) Given the position of t k , the probability distribution of the data in k + 1 segments can be decomposed in the product between the joint distribution in the first k segments and the joint distribution in the last segment. Hence, we can compute the probability distribution of the data in k + 1 segments summing this products for all possible positions of t k , Similarly, we can sum, over all possible position of t 1 , the products between the joint distribution in the first segment and the joint distribution in the last k segments, obtaining The recursion starts with L 0,j := δ 0,j and R 0,i := δ i,n , because there are zero segments only if the two endpoints are equal.

The Bayesian estimators of the original BPCR.
For the definition of the Bayesian estimators, we need to compute the evidence and the posterior distributions of the parameters. Since L k,n = R k,0 = ( n−1 k−1 )f (y | k), (S.9) the evidence of the sample point y turns out to be (S.10) Then, the posterior probabilities of the segment number and the boundaries can be written in the following way for each p = 1, . . ., k 0 − 1 and for all h ∈ {p, . . . , n − 1}.
2 Now we can compute the estimators. The MAP (Maximum A Posteriori) estimate of the segment number given the sample point y is (S.13) As we can see in Equation (S.12), the estimate for the boundaries needs the knowledge of the true number of segments, so we usek instead of k 0 , The computation of the estimate of the r th moment of the level of the m th segment needs the knowledge of the segment number and the partition of the data (see Equation (S.4)), so we use the estimated ones where i = t m−1 and j = t m , for all m = 1, . . . ,k.
To estimate the segment level M s at a generic position s, we need at first to compute explicitly its posterior distribution: Hence the estimate of M s givenk is for all s = 1, . . . , n.
The boundary estimator T joint . The boundary estimator T joint is defined as The explicit formula for the joint boundary distribution, given k 0 and the sample point y, is Thus, the estimated boundaries are This estimator can be computed by using the following recursion, where W p 0,i represents the maximum probability that y 0,i is divided into p segments over all combinations of the boundaries. As a consequence, the components of the vector of the boundary estimator turn out to be, for k =k − 1, . . . ,1, (S.23) with tk := n.
The boundary estimator T BinErrAk .
To compute the T BinErrAk estimator, we need at first to make explicit the expected value in Equation (22), given the sample point y, Hence, to find the estimated τ BinErrAk , we need to find the k − 1 highest (corresponding to the indices i 1 , . . . , i k−1 ) and then take τ BinErrAk such that τ ip = 1 for p = 1, . . ., k − 1. Using the inverse function of (19), we obtain the corresponding value of the estimator T BinErrAk .

4
The BRCAk. The computation of the BRCAk is simply given by where in the last equation we used the notations of Hutter [11,12] with the only difference that now F 0 i,j must be computed for all possible k and so we need to make explicit this dependency. Unfortunately, the computation of this quantity required time O(n 2 k 2 max ), hence it could be a problem with samples of big size.

S.2 Some error measure definitions
Definition of the MSE measure. The mean square error is defined as the expected value of the square error and so it is estimated with the corresponding arithmetic mean. Hence, given N estimated valuesθ 1 , . . .,θ N of a generic parameter θ, the estimated Mean Square Error of θ is defined as Definition of the mean relative error. Given N estimated valuesθ 1 , . . .,θ N of a generic parameter θ, the estimated mean relative error over all datasets is where m is the number of datasets, n the number of samples in each dataset and θ ij is the estimate of θ j based on the i th sample in the j th dataset.

S.3 Supplementary Tables
∆ν/ν 0.6376 ∆σ 2 /σ 2 0.1524 ∆ρ 2 /ρ 2 8.6217 ∆ρ 2 1 /ρ 2 1 0.5840 Table S.1: Estimated mean relative error of the estimatorsν,σ 2 ,ρ 2 andρ 2 1 over all the datasets used. The results show thatσ 2 had the lowest error. The error ofν was high but acceptable. The estimatorρ 2 1 was significantly better thanρ 2 with respect to this error measure.  Table S.2: Estimated mean errors ± standard deviation of the estimators of t 0 applied on a dataset without replicates with σ 2 = 0.1 and ρ 2 = 0.5, using k 0 ,ν,σ 2 andρ 2 1 . The estimator T 01 minimized the sum 0-1 error and the average square error. The errors of T BinErr and T BinErrAk were always very close and low.  Table S.5: Copy number estimation results obtained on 250K Array data of sample JEKO-1. All methods behaved equally good. The method HMM had problem in determining the right position of one breakpoint around the C-MYC amplification. All methods estimated the copy number of CCND1 differently from FISH technique. igure S.2: Confidence intervals at 95% of the estimators of k 0 for, respectively, the 0-1 error, the absolute error and the squared error obtained on a dataset without replicates with σ 2 = 0.1 and ρ 2 = 0.5. The graphs show that, in each situation, K 2 always had the lowest upper bound of the interval. Usingρ 2 the confidence intervals were shorter.  igure S.5: Comparison of the segment level estimation by using T joint or T BinErrAk with the different ρ 2 estimators, on four datasets with replicates. The corresponding true profiles are in Figure S.4. In general, using T BinErrAk we obtained a lower MSE per probe than using T joint . For a fixed boundary estimator, often the error was lower by usingρ 2 1 on these datasets where σ 2 ≫ ρ 2 . igure S.6: Comparison of the sensitivity and FDR computed on the results obtained on dataset Simulated Chromosomes using all the piecewise constant methods. Using ρ 2 instead of ρ 2 1 , the FDR was lower. The estimator T BinErrAk had the highest sensitivity and the second lowest FDR. The FDR of T 01 was the lowest, but this is due to the fact that it reduces the number of the estimated breakpoints by assigning the same position to different breakpoints. In conclusion, T BinErrAk with ρ 2 was the best performing estimator on this dataset. igure S.7: RMSE per probe of the several Bayesian regression curves, usingρ 2 as the estimator of ρ 2 , on four datasets with replicates. The corresponding true profiles are in Figure S.4. The graphs show that, usingρ 2 , BRCAk always had the lowest RMSE per probe and thus performed better than the other BRCs. Sometimes the error committed usingρ 2 1 was lower than usingρ 2 , probably becauseρ 2 1 can lead to a slight overfitting.   The black segments on the horizontal axis correspond to the regions of aberration. On this very noisy dataset, the modified BPCR withρ 2 1 , K 2 and T BinErrAk always had a low RMSE per probe, even though its error was not the lowest one outside the aberrations and inside the first one. On the contrary, CBS and CGHseg had the lowest error in these regions, but the highest error inside the small aberrations. Hence, globally the modified BPCR withρ 2 1 , K 2 and T BinErrAk performed better than the other algorithms on this dataset, with respect to the RMSE measure.   Figure S.15: Zoomed ROC curves of several smoothing methods applied to dataset with SNR = 3. The intersection among the ROC curves was due to the differences of the methods in the level estimation outside the aberrations. The more oscillating were the estimated curves in these regions, the closer were the corresponding ROC curves to the top side of the graph. In our case, an oscillating estimated profile is very different from the true one. igure S.17: ROC curves of several smoothing methods applied to dataset with SNR = 1. On this very noisy data, the methods smoothseg and lowess seemed to be the best ones, since their ROC curves were the highest at the top left corner of the plot. The third best method was BRC with K 2 and ρ 2 1 . smoothseg lowessF igure S.18: RMSE of several smoothing methods applied to dataset with SNR = 1. The black segments on the horizontal axis correspond to the regions of aberration. The graphs show that the method lowess, quantreg and smoothseg had more or less the same error inside and outside the aberrations. Instead, the BRC version with K 2 and ρ 2 1 and BRCAk with ρ 2 1 had a very low error outside the aberrations and not the highest error inside them, thus globally they performed better than the other algorithms with respect to the RMSE measure.