 Methodology article
 Open Access
DBS: a fast and informative segmentation algorithm for DNA copy number analysis
 Jun Ruan^{1},
 Zhen Liu^{1},
 Ming Sun^{1},
 Yue Wang^{2},
 Junqiu Yue^{3} and
 Guoqiang Yu^{2}Email author
 Received: 10 July 2017
 Accepted: 7 December 2018
 Published: 3 January 2019
Abstract
Background
Genomewide DNA copy number changes are the hallmark events in the initiation and progression of cancers. Quantitative analysis of somatic copy number alterations (CNAs) has broad applications in cancer research. With the increasing capacity of highthroughput sequencing technologies, fast and efficient segmentation algorithms are required when characterizing high density CNAs data.
Results
A fast and informative segmentation algorithm, DBS (Deviation Binary Segmentation), is developed and discussed. The DBS method is based on the least absolute error principles and is inspired by the segmentation method rooted in the circular binary segmentation procedure. DBS uses pointbypoint model calculation to ensure the accuracy of segmentation and combines a binary search algorithm with heuristics derived from the Central Limit Theorem. The DBS algorithm is very efficient requiring a computational complexity of O(n*log n), and is faster than its predecessors. Moreover, DBS measures the changepoint amplitude of mean values of two adjacent segments at a breakpoint, where the significant degree of changepoint amplitude is determined by the weighted average deviation at breakpoints. Accordingly, using the constructed binary tree of significant degree, DBS informs whether the results of segmentation are over or undersegmented.
Conclusion
DBS is implemented in a platformindependent and opensource Java application (ToolSeg), including a graphical user interface and simulation data generation, as well as various segmentation methods in the native Java language.
Background
Changes in the number of copies of somatic genomic DNA are a hallmark in cancer and are of fundamental importance in disease initiation and progression. Quantitative analysis of somatic copy number alterations (CNAs) has broad applications in cancer research [1]. CNAs are associated with genomic instability which causes copy number gains or losses of genomic segments. As a result of such genomic events, gains and losses are contiguous segments in the genome [2]. Genomewide scans of CNAs may be obtained with highthroughput technologies, such as SNP arrays and highthroughput sequencing (HTS). After proper normalization and transformation of the raw sample data obtained from such technologies, the next step is usually to perform segmentation to identify the regions where CNA occurs. This step is critical, because the signal at each genomic position measured is noisy and the segmentation can dramatically increase the accuracy of CNA detection.
Quite a few segmentation algorithms have been designed. Olshen et al. [3, 4] developed Circular Binary Segmentation (CBS), which relies on the intuition that a segmentation can be recovered by recursively cutting the signal into two or more pieces using a permutation reference distribution. Fridlyand et al. [5] proposed an unsupervised segmentation method based on Hidden Markov Models (HMM), assuming that copy numbers in a contiguous segment have a Gaussian distribution. Segmentation is viewed as a state transition and maximizes the probability of an observation sequence (copy number sequence). Several dedicated HMMs have been proposed [6–8]. Zaid Harchaoui et al. [9, 10] proposed casting the multiple changepoint estimation as a variable selection problem. A leastsquare criterion with a Lasso penalty yields a primary efficient estimation of changepoint locations. Tibshirani et al. [11] proposed a method based on a fused Lasso penalty that relies on the L1norm penalty for successive differences. Nilsen [12] proposed a highly efficient algorithm, Piecewise Constant Fitting (PCF), that is based on dynamic programming and statistically robust penalized least squares principles. By minimizing a penalized least squares criterion, the breakpoints were estimated. Rigaill [13, 14] proposed dynamic programming to retrieve the changepoints to minimize the quadratic loss. Yu et al. [15, 16] proposed a segmentation method using the Central Limit Theorem (CLT), which is similar to the idea used in the circular binary segmentation procedure.
Many existing methods show promising performance when the length of an observation sequence is small or moderate to be split. However, as experienced in our own studies, these methods are computationally intensive and segmentation becomes a bottle neck in the pipeline of copy number analysis. With the increasing capacity for raw sample data production provided by highthroughput technologies, a faster algorithm to perform segmentation to identify regions of constant copy numbers is always desirable. In this paper, a novel and computationally highly efficient algorithm is developed and tested.
There are three innovations in the proposed Deviation Binary Segmentation (DBS) algorithm. First, least absolute error (LAE) principle is exploited to achieve high processing efficiency and speed, and a novel integral arraybased algorithm is proposed to further increase computational efficiency. Second, a heuristics strategy derived from the CLT helps gaining additional speed optimization. Third, DBS measures the changepoint amplitude of mean values of two adjacent segments at a breakpoint. And using the constructed binary tree of significant degree, DBS informs whether the results of segmentation are over or undersegmented. A central theme of the present work is to build algorithm for solving segmentation problems under a statistically and computationally unified framework. The DBS algorithm is implemented in an opensource Java package named ToolSeg. It provides integrated simulation data generation and various segmentation methods: PCF, CBS (2004), and segmentation method in Bayesian Analysis of Copy Number Mixture (BACOM). It can be used for comparison between methods as well as meeting the needs of the actual segmentation.
Implementation
Systems overview
The ToolSeg tool provides functionality for many tasks typically encountered in copy number analysis: data preprocessing, segmentation methods of various algorithms and visualization tools. The main workflow includes: 1) reading and filtering of raw sample data; 2) segmentation of allelespecific SNP array data; and 3) visualization of results. The input includes copy number measurements from single or paired SNParray or HTS experiments. Allele observations normally need to detect and appropriately modify or filter extreme observations (outliers) prior to segmentation. Here, the median filtering algorithm [17] is used in the ToolSeg toolbox to manipulate the original input measurements. The method of DBS is based on the Central Limit Theorem in probability theory for finding breakpoints and observation segments with a welldefined expected mean and variance. In DBS, the segmentation curves are recursively generated by the recursive splits using the preceding breakpoints. A set of graphical tools is also available in the toolbox to visualize the raw data and segmentation results and to compare six different segmentation algorithms in a statistically rigorous way.
Input data and preprocessing
ToolSeg requires the raw signals from highthroughput samples to be organized as a onedimensional vector and stored as a .txt file. Detailed descriptions of the software are included in the Supplementary Material.
Before we performed copy number change detection and segmentation using copy number data, a challenging factor in copy number analysis was the frequent occurrence of outliers – single probe values that differ markedly from their neighbors. Generally, such extreme observations can be due to the presence of very short segments of DNA with deviant copy numbers, technical aberrations, or a combination. Such extreme observations have potentially harmful effect when the focus is on detection of broader aberrations [17, 18]. In ToolSeg, the classical limit filter, Winsorization, is performed to reduce such noise, which is a typical preprocessing step to eliminate extreme values in the statistical data to reduce the effect of possible spurious outliers.
and τ ∈ [1.5, 3], (default 2.5 in ToolSeg). Often, such simple and fast Winsorization is sufficient, as discussed in [12].
Binary segmentation
Now, we discuss the basic problem of obtaining individual segmentation for one chromosome arm in one sample. The aim of copy number change detection and segmentation is to divide a chromosome into a few continuous segments, within each of which the copy numbers are considered constant.
Consider first the simplest problem of obtaining only one segment. There is no copy number change on a chromosome in the sample. Given the copy number signals of length n on the chromosome, x_{1}, …, x_{n}, and let x_{i} be an observation produced by independent and identically distributed (i.i.d.) random variable drawn from distribution of expected values given by \( \widehat{\mu} \) and finite variances given by \( {\widehat{\sigma}}^2 \).
According to the central limit theorem (CLT), as the sample size (the length of an observation sequence, j − i) increases to a sufficiently large number, the arithmetic mean of independent random variables will be approximately normally distributed with mean μ and variance σ^{2}/(j − i), regardless of the underlying distribution. Therefore, under the null hypothesis of no copy number change, the test statistic \( {\widehat{Z}}_{ij} \) asymptotically follows a standard normal distribution, N (0, 1). Copy number change segments measured by highthroughput sequencing data usually span over hundreds, even tens of thousands, of probes. Therefore, the normality of \( {\widehat{Z}}_{ij} \) is approximated with high accuracy.
We iterate over the whole segment to calculate the Pvalue of \( \widehat{Z} \) using the cumulative distribution function of N (0, 1). If the Pvalue is greater than θ, then we will consider that there is no copy number change in the segment. In other words, \( \widehat{Z} \) is not far from the center of the shape of the standard normal distribution.
Then, a binary segmentation procedure will be performed at breakpoint b, and we will apply the above algorithm recursively to the two segments x_{1}, …, x_{p − 1} and x_{p}, …, x_{n}, p ∈ (1, n).
Multiscale scanning procedure
Up to now, the above algorithm has been able to identify the most significant breakpoints, except one short segment sandwiched between two long segments. In this case, the distance between breakpoints at the intermediate position p and both ends is much or far greater than 1. Thus, \( \wp \left({\widehat{T}}_{1,p}\right)=\theta /p \) tends to 0, and \( {\widehat{T}}_{1,p} \) has almost no change with an increase in p. The accumulated error generated by the sum process is equally shared to each point from 1 to p. When increasing the distance to the ends, the change of Z_{i, j} becomes slower. Thus, spike pulses and small segments embedded in long segments are suppressed. Therefore, if ℤ_{p} is less than the estimated standard deviation \( \widehat{\sigma} \) after a onetime scan of the whole segment, we cannot arbitrarily exclude the presence of the breakpoint.
From the situation above, it is obvious that we cannot use the fixed endpoints to detect breakpoints on a global scale. This method is acceptable with large jumps or changes in long segments, but to detect shorter segments. We need smaller windows. For these smaller segments, scalespace scanning is used. In the DBS algorithm, in the second phase, a multiscale scanning stage will be started by the windowed model, if a breakpoint was not found immediately by the first phase.
Therefore, we can find the local maximum across these scales (window width), which provides a list of (Z_{p − w, p}, Z_{p, p + w}, p, w) values and indicate that there is a potential breakpoint at p at the w scale. Once ℤ_{p} is greater than the estimated standard deviation \( \widehat{\sigma} \), then a new candidate breakpoint is found. The new recursive procedure as the mentioned first phase will be applied to the two new segments just generated.
Analysis of time complexity in DBS
In DBS, the first phase is a binary segmentation procedure, and the time complexity of this phase is O(n ∙ log K), where K is the number of segments in the result of the first phase, and n is the length of an observation sequence to be split. Because n ≫ K, the time complexity approaches O(n). Next, the second phase, the multiscale scanning procedure, is costly compared with a onetime scan on a global scale on the whole segment. When \( \mathcal{W} \) is a geometric sequence with a common ratio of 2, the time complexity of the second phase is O(n ∙ log n). When \( \mathcal{W} \) includes all integer numbers from 1 to n, the time complexity of the second phase degenerates to O(n^{2}). Then, in this case, the algorithm framework of DBS is fully equivalent to one in BACOM, which is similar to the idea used in Circular Binary Segmentation procedure.
In simulation data set, it is not common that one short segment sandwiched between two long segments is found in the first or first few dichotomies of whole segmentation process, because broader changes can be expected to be detected reasonably well. After several recursive splits were executed, the length of each subsegment is greatly reduced. Then, the execution time of the second phase in DBS is also greatly reduced at each subsegment. But the second phase must be triggered once before the recursive procedure ends. So, the time complexity of DBS tends to approach O(n ∙ log n). Moreover, the real data is more complicated, so the effect of DBS is O(n ∙ log n) in practice. Its time complexity is about the same as its predecessors, but DBS is faster than them. We will discuss later in section “Computational Performance”.
Convergence threshold: Trimmed firstorder difference variance \( \widehat{\boldsymbol{\sigma}} \)
Here, the average of estimated standard deviation \( \widehat{\sigma} \) on each chromosome is the key to the convergence of iterative binary segmentation, and it comes from a trimmed firstorder difference variance estimator [19]. Combined with simple heuristics, this method may be used to further enhance the accuracy of \( \widehat{\sigma} \). Suppose we restrict our attention to exclude a set of potential breakpoints by computationally inexpensive methods. One way to identify potential breakpoints is to use highpass filters, i.e., a filter obtaining high absolute values when passing over a breakpoint. The simplest such filter uses the difference ∆x_{i} = x_{i + 1} − x_{i}, 1 < i < n for each position i. We calculate all the differences at each position and identify approximately 2% of the probe positions as potential breakpoints. In other words, the area below the 1st percentile and above the 99th percentile of all differences corresponds to the breakpoints. Then, we estimated the standard deviation \( {\overset{\sim }{\sigma}}^{\prime } \) of ∆x_{i} at the remaining positions. Supposing the change of the variances of each segment on one chromosome is not very large, the average standard deviation \( \widehat{\sigma} \) of each segment is \( \widehat{\sigma}={\overset{\sim }{\sigma}}^{\prime }/\sqrt{2} \).
We need to be reminded that the current \( \widehat{\sigma} \) is only used to determine whether to continue to split iteratively. After a whole binary segmentation procedure is completed, we can obtain preliminary results and a corresponding binary tree of the test statistic ℤ_{p} generated by segmentation. Furthermore, according to the binary tree, a new finetuned \( {\widehat{\sigma}}^{\prime } \) will be generated naturally to improve the intrasegment variance more accurately. Finally, we select those candidate breakpoints in which ℤ_{p} is greater than the given \( {\widehat{\sigma}}^{\prime } \) as the final ‘true’ breakpoints.
Determining real breakpoints or false breakpoints
Then, we can generate a corresponding binary tree of test statistic ℤ_{p}; see Fig. 1(b). The values of the root node and child nodes are ℤ_{p} of the initial array and the corresponding intermediate results, and the values of the leaf nodes are ℤ_{p} of the results of segmentation. The identification of every node (Node ID) is the Segment ID.
If η > 0, all ℤ_{p}s of nonleaf nodes are greater than all standard deviations of leaf nodes, and the breakpoints corresponding to all nonleaf nodes are the real significant breakpoints and the DBS algorithm ends immediately.

Quickly calculating the statistic Z_{ij}
In DBS, we use absolute errors rather than squared errors to enhance the computational speed of segmentation. The time complexity of absolute errors can be reduced to O(1), and it only needs one subtraction operation for the summing of one continuous region using an integral array. The algorithm of integral array is naturally decreased from the integral image in computing 2D images [20]. A special data structure and algorithm, namely, summed area array, make it very quick and efficient to generate the sum of values in a continuous subset of an array.

Selecting for the distance function dist(∙)
Finally, ℤ_{1, n + 1}(p) represents an accumulated error ε_{1, p} weighted by \( {\mathfrak{D}}_p \). The test statistic ℤ_{p} physically represents the largest fluctuation of ℤ_{1, n + 1}(p) on a segment.
When k < 1, the Minkowski distance between 0 and 〈ω_{1, p}, ω_{p, n + 1}〉 tends to the smaller component within ω_{1, p} and ω_{p, n + 1}. Furthermore, ω_{1, p} and ω_{p, n + 1} belong to the same interval [ω_{1, n + 1}, ω_{1, 2}], and as p moves, they exhibit the opposite direction of change. Thus, when ω_{1, p} is equal to ω_{p, n + 1}, \( {\mathfrak{D}}_p \) reaches a maximum value, and then p = n/2.
From the analysis above, when p is close to any end (such as p is relatively small, then n − p is sufficiently large), Z_{1, p} is very susceptible to the outliers between point 1 and p. In this case, the position of such a local maximum Z_{1, p} may be false breakpoints, but Z_{p, n + 1} is not significant because there are a sufficient number of probes between point p and another end to suppress the noise. Here, the Minkowski distance (k < 1) is used to filter these unbalanced situations. At the same time, the breakpoints at balanced local extrema are preferentially found. Usually, the most significant breakpoint may be found at the middle of a segment due to \( {\mathfrak{D}}_p \), so the performance and stability of binary segmentation is increased.
In other words, at each point, ℤ_{1, n + 1}(p) tends to the smaller value of the left and right parts to suppress the noise when searching for breakpoints; ℤ_{1, n + 1}(p) tends to the larger value for measuring the significance of breakpoints.
Algorithm DBS: Deviation binary segmentation
Input: Copy numbers x_{1}, …, x_{n}; predefined significance level θ = 0.05; filtration ratio γ = 2(%); safe gap λ = 0.02;
 1.
Calculate integral array by letting S_{0} = 0, and iterate for i = 1…n:
 2.Estimate standard deviation \( \widehat{\sigma} \):
 a)
Calculate the differences iteratively for i = 1…n: d_{i} = x_{i + 1} − x_{i}
 b)
Sort all d_{i} and exclude the area below the γ/2 percentile and above the 100 − γ/2 percentile of differences of d_{i}, then calculate the estimated standard deviation \( {\overset{\sim }{\sigma}}^{\prime } \) at the remaining part.
 c)
Get \( \widehat{\sigma}={\overset{\sim }{\sigma}}^{\prime }/\sqrt{2} \).
 a)
 3.Start binary segmentation with two fixed endpoints for segment x_{1}, …, x_{n}, and calculate the average \( \widehat{\mu} \) on the segment;
 a)
By Eqn (14) iterate Z_{1, p} and Z_{p, n + 1} for p = 1…n;
 a)
 b)
Search the index of potential breakpoint b_{k} at which ℤ_{p} is the maximum in the previous step, and calculate \( {\mathbb{Z}}_{b_k} \) by Eqn (19);
 c)
If \( {\mathbb{Z}}_{b_k}>\widehat{\sigma} \), store \( {\mathbb{Z}}_{b_k} \) and b_{k}, then go to Step 3 and apply binary segmentation recursively to the two subsegments \( {x}_1,\dots, {x}_{b_k1} \) and \( {x}_{b_k},\dots, {x}_n \); otherwise, the multiscale scanning will be started, and enter Step 4.
 4.Start binary segmentation with various sliding windows for segment x_{1}, …, x_{n},
 g)
Create a width set of sliding windows by letting \( {\mathcal{W}}_0=n/2 \), and iterate \( {\mathcal{W}}_i={\mathcal{W}}_{i1} \)/2 until \( {\mathcal{W}}_i \) is less than 2 or a given value.
 h)
Similar to the above binary segmentation, iterate Z_{p − w, p}, Z_{p, p + w} and ℤ_{1, n + 1}(p) for p = 1…n under all sliding windows \( {\mathcal{W}}_i \), then find the index of potential breakpoint b_{k} by the maximum and \( {\mathbb{Z}}_{b_k} \) is calculated.
 i)
If \( {\mathbb{Z}}_{b_k}>\widehat{\sigma} \), store \( {\mathbb{Z}}_{b_k} \) and b_{k}, then go to Step 3 and recursively start a binary segmentation without windows to the two segments \( {x}_1,\dots, {x}_{b_k1} \) and \( {x}_{b_k},\dots, {x}_n \); otherwise, terminate the recursion and return.
 g)
 5.
Merge operations: calculate η and update \( {\widehat{\sigma}}^{\prime } \), and prune child nodes corresponding to candidate breakpoints to satisfy η > λ.
 6.
Sort the indices of breakpoints b_{1}, …, b_{K}, find the segment averages:
y_{i} = average(\( {x}_{b_{i1}},\dots, {x}_{b_i1} \)) for i = 1…K, and b_{0} = 1.
In the algorithm, we use the data from each segment to estimate the standard deviation of noise. As it is well documented that the copy number signals have higher variation for increasing deviation from diploid ground states, by assuming each segment has the same copy number state, the segmentspecific estimate of noise level makes the algorithm robust to the heterogeneous noise.
DBS is a statistical approach based on observations. Similar to its predecessors, there is a limit of minimum length of short segments in order to meet the conditions of CLT. In DBS, the algorithm has also been extended to allow a constraint on the least number of probes in a segment. It is worth noting that low density data, such as arrayCGH, is not suitable for DBS, and an insufficiently low limit on the length (twenty probes) of a segment is necessary.
Results and discussion
Constructing the binary tree of ℤ _{p} and filtering the candidate breakpoints
In DBS, the selection of parameters is not difficult. There are three parameters. The predefined significance level θ can be seen as a constant. The filtration ratio γ implies the proportion of breakpoints at one original sequence. When breakpoints are scarce in copy number analysis, the 2% rejection rate is already sufficient. The safety gap λ should be a positive number close to zero in determining the tradeoff between high sensitivity and high robustness (i.e., a leap at the breakpoint visually). It limits the minimum of significant degrees ℤ_{p} of breakpoints in the results and ensures that the most significant breakpoints are not intermixed with some inconspicuous breakpoints.
The key factor is the average of estimated standard deviation \( \widehat{\sigma} \) of all segments and is predicted statistically according to the difference between adjacent points. When the preliminary segmentation is completed, \( \widehat{\sigma} \) will also be updated in terms of the binary tree of ℤ_{p}. Therefore, the binary tree generated by DBS is the key to the judgment of breakpoints.
Segmentation of actual data sample
However, the last of the top 5 significant breakpoints was not immediately found. Here, we can see that the ℤ_{p} of Node 4 is rising instead of falling compared to that of its parent Node 2. This fact also predicts the existence of complex changes between Node 2 and its child, Node 4. The resolution capacity of the binary segmentation with two fixed endpoints in DBS is increased as the length of the split segments becomes shorter, unless the binary segmentation with various sliding windows is triggered. After splitting several times, one sharp pulse is found between Node 7 and Node 8. The processes of discovering Node 40, Node 41 and Node 56 are also similar.
In DBS, because the resolution capacity of breakpoints is continually enhanced with recursive segmentation, the binary segmentation with multiscale scanning will be performed at the leaf nodes. Thus, the segments corresponding to the leaf nodes cannot contain the breakpoints whose ℤ_{p} is greater than \( \widehat{\sigma} \). Therefore, the conditional multiscale scanning of DBS determining the tradeoff between segmentation efficiency and segmentation priority (i.e., the sooner the greater) can be accepted, although this method leads to the destruction of the tree structure. There will be no missing breakpoints; this process will merely postpone the time to find them unless \( \widehat{\sigma} \) is overestimated.
Usually, the segmentation process of Node 19 is representative, as the ℤ_{p} of child nodes are monotonically decreasing. The possibility of finding breakpoints is smaller after shortening the segment length and excluding more significant breakpoints continuously. The ℤ_{p} of new nodes will eventually be less than the \( \widehat{\sigma} \) predicted previously by CLT, and a new leaf node will be identified to terminate the recursion.
For the above examples, we argue that the proper underestimation of \( \widehat{\sigma} \) is necessary. It ensures that the leaf nodes cannot contain any real breakpoints in the initial topdown segmentation process. Simultaneously, the standard deviations of the segments corresponding to the leaf nodes also correctly reflect the actual dispersion under correct segmentation, which guides the classification by degree of significance of breakpoints through a bottomup approach. In DBS, we choose a sufficiently large filtration ratio γ to ensure this result. Thus, η would be less than 0 after the preliminary segmentation. Otherwise, there is a reason to worry about missing breakpoints, which can be observed only when the change in adjacent segments is more obvious, as shown in Fig. 2.
Test dataset
To generate a test dataset that has a similar data structure with that in real cases, we chose real data samples as the ground truth reference [16]. We manually checked the plots of all chromosomes and chose several genomic regions as the reference templates for generating simulated data. These regions are representative of the diversity of copy number states that are typically extracted in tumornormal sample pairs by classical segmentation algorithms and have no structural variations in them. In addition, the data included in each template follows a single Gaussian distribution, and there are four different templates corresponding to copy number range from 1 to 4. Using these templates, we generate a test dataset at the assured position of breakpoints and the given average copy number for each segment.
The entire test dataset consists of 104 test sequences and a total of 876 test segments. The length of each detected sequence is between about 10^{3} and 10^{5}. In the process of calculation, too short sequences are too vulnerable to external random interference, which is generated by other programs running in operating system. Therefore, we only use sequences of length more than 10,000 in performance analysis.
Test method
We evaluate the performance of these algorithms by calculating the precision of segmentation results. With reference to test methods proposed by the pioneers [12, 21], a classification was constructed to test and compare the sensitivity and specificity of segmentation algorithms, which are used for the detection of recurring alterations in Multiplex Ligationdependent Probe Amplification (MLPA) analysis [12].
If Tag(p) is true, we consider that the p is in the normal region and is called positive. Otherwise, it is in an aberrant region and is called negative. When we use the given breakpoints and the average copy number of segments in the test dataset, the gold standard was obtained by uniform sampling near the given breakpoints. In the gold standard, there were 1424 positive and 4752 negative values used for the comparison.
Thus, the test mechanism that was independent of the structure of the algorithm was established for segmentation accuracy. If the prediction from the classifier using the outcomes of a segmentation algorithm is positive and the actual value in the gold standard is also positive (normal loci), then it is called a true positive (TP); however, if the actual value is negative (aberrant loci), then it is said to be a false positive (FP). Conversely, a true negative (TN) has occurred when both the prediction outcome and the actual value are negative, and a false negative (FN) is when the prediction outcome is negative, while the actual value is positive.
Segmentation accuracy
Panel (c) shows the effect of different combinations of window sizes. Curve W1 is the result using window sizes generated by the arithmetic progression with common difference of 1, and corresponds to use arbitrary sets of window sizes. Curve W2, W4 and W8 correspond to window sizes of the geometric sequence with common ratio of 2, 4 and 8 respectively. We can see that different combinations have little effect on the segmentation result.
Panel (d) shows calls made on the basis of the segmentations found by DBS, PCF, Circular Binary Segmentation (CBS) [3] and the segmentation method in BACOM [15, 16] with raw data. In comparison studies of the accuracy of the segmentation solutions, CBS is the most commonly used available algorithm and has good performance in terms of sensitivity and false discovery rate. PCF is a relatively new copy number segmentation algorithm based on least squares principles and combined with a suitable penalization scheme. Here the recent versions have been used with the original R implementations, DNAcopy v1.52.0 (CBS) and copynumber v1.18 (PCF). The predecessor of DBS is the segmentation method in BACOM, and this precursor replaces a decision processbased permutation test on CBS with a decision process based on Central Limit Theorem (CLT). There are the differences between DBS and CBS mainly in the following points. Firstly, the criterion of segmentation use Eqn (6) in BACOM, however Eqn (9) is the criterion in DBS. Secondly, the algorithm structure of the method in BACOM mainly contains a complete double circulation with recursively dividing into three subsegments. DBS only contains a single circulation with recursive splits. Finally, the test statistics of the method in BACOM and the first phase in DBS are calculated point by point, this is equivalent to a scan process using window sizes with the arithmetic progression with common difference of 1. But the sliding windows of the second phase in DBS are a geometric sequence with a common ratio of 2.
Segmentation and Merging effects
AUC  Segment Count  Oversegmentation Ratio  

In BACOM  0.9256  845  0.965 
CBS  0.9373  1171  1.34 
PCF  0.9279  4906  5.60 
DBS  0.9452  967  1.104 
We found that the number of segments included in the segmentation result is different. The entire test dataset consists of 104 test sequences and a total of 876 test segments. The method in BACOM only got 845 segments, it shows the existence of undersegmentation. There is more serious oversegmentation in PCF using default parameter (it is too small). In CBS and DBS, the results are all oversegmented. But the result of DBS is slightly better than CBS. Here we defined a variable, oversegmentation ratio, which is a ratio of the segment count generated by each algorithm and the actual number of test segments. As shown in Table 1, the ratio of DBS is the minimum in all oversegmentation results. It also shows the merit of the merge operation (pruning false breakpoints) in DBS.
Computational performance
Computational performance
Method  Time (s)  Trend Line  

Test 1 (26 K, 1 Chr.)  Test 2 (160 K, 1 Chr.)  Affymetrix SNP 6.0 (868 K, 22 Chr.)  Slope  R^{2}  
In BACOM  0.592  18.769  34.818  1.999  0.988 
CBS  1.442  6.249  80.551  0.984  0.963 
PCF  0.393  2.638  10.256  0.968  0.979 
DBS  0.093  0.524  2.167  0.968  0.986 
Further, the time complexity of the first phase in DBS is O(n ∙ log K), where K is the number of segments in the result of the first phase, and n is the length of observation sequence to be split. Because n ≫ K, the time complexity approaches O(n). At the second phase, the time complexity is O(n ∙ log n), because \( \mathcal{W} \) is a geometric sequence with a common ratio of 2 by default. Although the time complexity of DBS is a mixture of O(n) and O(n ∙ log n) in theory, but the speed of DBS depends on the frequency upon triggering binary segmentation with various sliding windows. Thus, the speed of DBS is quite variable from sample to sample of the same length. As a result, the time complexity of DBS is basically no different than the other two algorithms (CBS and PCF) in the benchmarking. Just the speed of DBS is indeed larger than them. In Fig. 5, the slopes of the lines corresponding to the three algorithms are the same, but the line of DBS is at the bottom. In conclusion, DBS and other methods typically provide similar results and have equivalent accuracy; however, DBS enjoys a significant advantage in computation performance.
Conclusions
We have developed a variant of binary segmentation based on least absolute errors (LAE) principles combined with heuristics using CLT that we call Deviation Binary Segmentation (DBS) for identifying genomic alterations in array copy number experiments. We have introduced a suite of platformindependent evaluation mechanisms based on the MLPA binary classifier as the gold standard. DBS was applied to a test dataset that has a similar data structure as a real case. The algorithm performs similarly to other leading segmentation methods in terms of sensitivity and specificity. In addition, the proposed algorithm can provide significant degrees of breakpoints in the results, and find breakpoint locations by searching for the extreme of test statistic. Furthermore, DBS benefits from the algorithm structure with a computational complexity of O(n*log n), which gives a further marked reduction in computation time using heuristics with trimmed firstorder difference variance for searching for potential breakpoints. The proposed algorithm is easy to generalize and is computationally very efficient on highresolution data. The Java Application offers a userfriendly GUI to the proposed algorithms and is freely available at https://gitee.com/w3STeam/ToolSeg.
Declarations
Acknowledgements
This research was supported by research grants 81300042 (to JY) from the Natural Science Foundation of China and 2014(41) (to JY) from the “Training project for Young and Middleaged Medical Talents” from Health and Family Planning Commission of Wuhan City of China.
Funding
Not applicable.
Availability of data and materials
Project name: ToolSeg.
Project home page: https://gitee.com/w3STeam/ToolSeg
Download: https://gitee.com/w3STeam/ToolSeg/attach_files
Operating system(s): All systems supporting the Java.
Programming language: Java.
Other requirements: No.
License: Apache License 2.0.
Authors’ contributions
The study was initiated by JR, YW and GQY. JR and ZL drafted the manuscript. The software was written by JR with contributions from ZL based on algorithms developed by ZL and MS. YW, GQY and JQY contributed with examples and in discussions of the manuscript and software. All authors have read, commented on and accepted the final manuscript.
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Pollack JR, Sorlie T, Perou CM, Rees CA, Jeffrey SS, Lonning PE, Tibshirani R, Botstein D, BorresenDale AL, Brown PO. Microarray analysis reveals a major direct role of DNA copy number alteration in the transcriptional program of human breast tumors. Proc Natl Acad Sci U S A. 2002;99(20):12963–8.View ArticleGoogle Scholar
 Beroukhim R, Mermel CH, Porter D, Wei G, Raychaudhuri S, Donovan J, Barretina J, Boehm JS, Dobson J, Urashima M. The landscape of somatic copynumber alteration across human cancers. Nature. 2010;463(7283):899–905.View ArticleGoogle Scholar
 Olshen AB, Venkatraman ES, Lucito R, Wigler M. Circular binary segmentation for the analysis of arraybased DNA copy number data. Biostatistics. 2004;5(4):557–72.View ArticleGoogle Scholar
 Venkatraman ES, Olshen AB. A faster circular binary segmentation algorithm for the analysis of array CGH data. Bioinformatics. 2007;23(6):657–63.View ArticleGoogle Scholar
 Fridlyand J, Snijders AM, Pinkel D, Albertson DG, Jain AN. Hidden Markov models approach to the analysis of array CGH data. J Multivar Anal. 2004;90(1):132–53.View ArticleGoogle Scholar
 Chen H, Xing H, Zhang NR. Estimation of parent specific DNA copy number in tumors using highdensity genotyping arrays. PLoS Comput Biol. 2011;7(1):e1001060.View ArticleGoogle Scholar
 Greenman CD, Bignell G, Butler A, Edkins S, Hinton J, Beare D, Swamy S, Santarius T, Chen L, Widaa S. PICNIC: an algorithm to predict absolute allelic copy number variation with microarray cancer data. Biostatistics. 2010;11(1):164.View ArticleGoogle Scholar
 Sun W, Wright FA, Tang Z, Nordgard SH, Van LP, Yu T, Kristensen VN, Perou CM. Integrated study of copy number states and genotype calls using highdensity SNP arrays. Nucleic Acids Res. 2009;37(16):5365–77.View ArticleGoogle Scholar
 Harchaoui Z, LévyLeduc C. Catching changepoints with lasso. Adv Neural Inf Proces Syst. 2007;22:617–24.Google Scholar
 Harchaoui Z, LévyLeduc C. Multiple changepoint estimation with a Total variation penalty. J Am Stat Assoc. 2010;105(492):1480–93.View ArticleGoogle Scholar
 Tibshirani R, Saunders M, Rosset S, Zhu J, Knight K. Sparsity and smoothness via the fused lasso. J R Stat Soc. 2005;67(1):91–108.View ArticleGoogle Scholar
 Nilsen G, Liestol K, Van Loo P, Moen Vollan HK, Eide MB, Rueda OM, Chin SF, Russell R, Baumbusch LO, Caldas C, et al. Copynumber: efficient algorithms for single and multitrack copy number segmentation. BMC Genomics. 2012;13:591.View ArticleGoogle Scholar
 Rigaill G. A pruned dynamic programming algorithm to recover the best segmentations with 1 to Kmax changepoints. Journal de la Société Française de Statistique. 2015;156(4):180205.Google Scholar
 Rigaill G. Pruned dynamic programming for optimal multiple changepoint detection. 2010. arXiv preprint arXiv:1004.0887.Google Scholar
 Yu GQ, Zhang B, Bova GS, Xu JF, Shih IM, Wang Y. BACOM: in silico detection of genomic deletion types and correction of normal cell contamination in copy number data. Bioinformatics. 2011;27(11):1473–80.View ArticleGoogle Scholar
 Fu Y, Yu G, Levine DA, Wang N, Shih Ie M, Zhang Z, Clarke R, Wang Y. BACOM2.0 facilitates absolute normalization and quantification of somatic copy number alterations in heterogeneous tumor. Sci Rep. 2015;5:13955.View ArticleGoogle Scholar
 Huang T, Yang G, Tang G. A fast twodimensional median filtering algorithm. IEEE Transactions Acoustics Speech Signal Process. 1979;27(1):13–8.View ArticleGoogle Scholar
 Hupe P, Stransky N, Thiery JP, Radvanyi F, Barillot E. Analysis of array CGH data: from signal ratio to gain and loss of DNA regions. Bioinformatics. 2004;20(18):3413–22.View ArticleGoogle Scholar
 Neumann JV, Kent RH, Bellinson HR, Hart BI. The mean square successive difference. Ann Math Stat. 1941;12(2):153–62.View ArticleGoogle Scholar
 Viola P, Jones M. Rapid object detection usineg a boosted cascade of simple features. In: Computer Vision and Pattern Recognition, 2001 CVPR 2001 Proceedings of the 2001 IEEE Computer Society Conference on, vol. 511; 2001. p. I511–8.Google Scholar
 PierreJean M, Rigaill G, Neuvial P. Performance evaluation of DNA copy number segmentation methods. Brief Bioinform. 2015;16(4):600–15.View ArticleGoogle Scholar