Problem formulation
We adopted a formulation of the segmentation problem from previous methods [7]. The goal was to find segments with statistical significance higher than a predefined level measured by p-values under a certain probability distribution assumption. The priority was given to segments with higher significance, meaning that the segment with highest significance was identified first, followed by the one with the second highest significance, and so on. We implemented the method using Gaussian-based tests (i.e. t-test and z-test), as used it in other existing methods [3, 7, 9]. Our method achieved satisfactory performance for both microarray and next generation sequencing data without modifying the hypothesis test. A more common formulation of the change-point problems is given in [41].
Consider a sample consisting of N measurements along the genome in a sequential order, X1, X2, …, XN, and
$$ {X}_k\sim N\left({\mu}_0,{\sigma}^2\right),\forall k\in L $$
$$ {X}_k\sim N\left({\mu}_i,{\sigma}^2\right),\forall k\notin L $$
for some set of locations L (i.e. background, regions with no changes, etc.). The common assumption is that there are M non-overlapping segments with mean μ1, μ2, …, μ
i
, …, μ
M
, where μ
i
≠ μ0, and the union of these segments will form the complement of the set L. If the background level, μ0, is non-zero, the null hypothesis can be the corresponding non-zero means. According to this model, it is possible for multiple segments with means different from μ0 to be adjacent to each other. In addition, all the measurements are assumed to be independent. This assumption has been employed in many existing methods [9, 23]. A summary of existing methods that use such an i.i.d. assumption and its properties are discussed in [42]. The goal of a segmentation method is to detect all the M segments with means different from μ0.
To illustrate, Figure 2a shows segments generated from Normal distributions with non-zero means where the rest of the data is generated from a standard Normal distribution. There are two computational challenges associated with the approach we are taking that also manifest in many previous methods. First, the number of segments that are examined is very large. Second, the overlaps among significant segments need to be detected so that the significance of the overlapping segments can be adjusted accordingly. To deal with the first challenge, we applied dynamic programming combined with exponentially increased segment scales to speed up the scanning of a large sequence of data points. To deal with the second challenge, we designed an algorithm coupling two balanced binary trees to quickly detect overlaps and update the list of the most significant segments. Segment refinement and merging allow iSeg to detect segments of arbitrary length.
Computing p-values using dynamic programming
iSeg scans a large number of segments starting with a minimum window length, W
min
, and up to a maximum window length, W
max
. The minimum and maximum window lengths have default values of 1 and 300, respectively. This window length increases by a fixed multiplicative factor, called power factor (ρ), with every iteration. For example, the shortest window length is W
min
, and the next shortest window length would be ρW
min
. The default value for ρ is 1.1. When scanning with a particular window length, W, we use overlapping windows with a space of W/5. When ‘W’ is not a multiple of 5, numerical rounding (ceiling) is applied. The aforementioned parameters can be changed by a user. We found the default parameters work robustly for all the datasets we have worked with. The algorithm computes p-values for candidate segments and detects a set of non-overlapping segments most significant among all possible segments.
Given the normality assumption, a standard test for mean is the one-sample student’s t-test, which is commonly found among many existing methods. The test statistic for this test is,
$$ t=\frac{\overline{x}\sqrt{n}}{s} $$
where \( \overline{x} \) is the sample mean, s is the sample standard deviation, and n is the sample size. A drawback of this statistic is that it cannot evaluate segments of length 1. This may be the reason that some of the previous methods are not good at detecting segments of length 1. Although we can derive a test statistic separately for segments of length 1, the two statistics may not be consistent. To solve this issue, we first estimate the sample standard deviation using median absolute deviation (MAD), assuming that the standard deviation is known. Specifically, for a series of data points x1, x2, …, x
n
, the MAD is defined as the median of the absolute deviations from the data’s median:
MAD = median(| x
i
− median(x)| ).
This is a robust summary statistic of the variability of the data. This allows us to use z-statistic instead of t-statistic and the significance of single points can be evaluated based on the same model assumption as longer segments. To calculate sample means for all segments to be considered for significance, the number of operations required by a brute force approach is ‘C
b
’.
$$ {C}_b=\sum \limits_{i=0}^k\left(N-{\rho}^i{W}_{\mathrm{min}}\right){\rho}^i{W}_{\mathrm{min}} $$
where, ρkWmin ≤ Wmax and ρk + 1Wmin > Wmax.
Computation of these parameters (means and standard deviations) for larger segments can be made more efficiently by using the means computed for shorter segments. For example, the running sum of a shorter segment of length ‘m’ is given by,
$$ {S}_m=\sum \limits_{i=l}^m{X}_i. $$
If this sum is retained, the running sum of a longer segment of length r (r > m) in the next iteration can be obtained as,
$$ {S}_r={S}_m+\sum \limits_{i=m+1}^r{X}_i, $$
and the means for all the segments can be computed using these running sums. Now, the total number of operations (C
b
*) is
$$ {C}_b^{\ast }=N+\sum \limits_{i=0}^k\left(N-{\rho}^i{W}_{\mathrm{min}}\right), $$
which is much smaller in practice than the number of operations (C
b
) without using dynamic programming. Computation of standard deviations is sped up using a similar process.
Detecting overlapping segments and updating significant segments using coupled balanced binary trees
When the p-values of all the segments are computed, we rank the segments by their p-values from the smallest to the largest. All the segments with p-values smaller than a threshold value, p
s
, are kept in a balanced binary tree (BBT1). The default value of p
s
is set as 0.001. Assuming a significance level (α) of 0.1, 100 simultaneous tests will maintain a family-wise error rate (FWER) bounded by 0.001 with Bonferroni and Sidak corrections. Thus, the cut-off is an acceptable upper bound for multiple testing. It can be changed by a user if necessary. The procedure for overlapping segment detection is described below as a pseudo-code. The set BBT1 stores all significant segments passing the initial significance level cutoff (default value 0.001). The second balanced binary tree (BBT2) stores the boundaries for significant segments. After the procedure, SS contains all the detected significant segments. The selection of segments using balanced binary tree makes sure that segments with small p-values will be kept, while those overlapping ones with bigger p-values will be removed.
Refinement of significant segments
The significant segments are refined further by expansion and shrinkage. Without loss of generality, in the procedure (see SegmentExpansion text box) we describe expansion on left side of a segment only. Expansion on the right side and shrinkage are done similarly. When performing said expansion and shrinkage, a condition to check for overlapping segments is applied so the algorithm results in only disjoint segments.
Merging adjacent significant segments
When all the significant non-overlapping segments are detected and refined in the previous steps, iSeg performs a final merging step to merge adjacent segments (no other significant segments in between). The procedure is straightforward. We check each pair of adjacent segments. If the merged segment, whose range is defined by the left boundary of the first segment and the right boundary of the second segments, has a p-value smaller than those of individual segments, then we merge the two segments. The new segment will then be tested for merging with its adjacent segments iteratively. The procedure continues until no segments can be merged. With refinement and merging, iSeg can detect segments of arbitrary length— long and short. We added an option to merge only segments whose distances are no more than certain threshold, where distances are measured by the difference of the ending position of the first segment and the starting position of the second segment.
Multiple comparisons
In iSeg, p-values for potentially significant segments are calculated. Using a common p-value cutoff, for example 0.05, to determine significant segments can suffer from a large number of false positives due to multiple comparisons. To cope with the multiple comparisons issue, which can be very serious when the sequence of measurements is long, we use a false discovery rate (FDR) control. Specifically, we employ the Benjamini-Hochberg (B-H) procedure [43] to obtain a cutoff value for a predefined false discovery rate (α), which has a default value of 0.01, and can also be set by a user. Other types of cutoff values can be used to select significant segments, such as a fixed number of most significant segments.
Biological cutoff
Often in practice, biologists prefer to call signals above a certain threshold. For example, in gene expression analysis, a minimum of two-fold change may be applied to call differentially expressed genes. Here we add a parameter, bc, which can be tuned by a user to allow more flexible and accurate calling of significant segments. The default output gives four bc cutoffs: 1.0, 1.5, 2.0 and 3.0. Biological cutoff value 1.0 means that the height of a segment has to be greater than 1.0*standard deviation of the data for it to be called as significant, regardless of the length of the segment. The biological cutoff parameter allows users to select significant segments, whichever are more likely to be biological significant based on their knowledge of the problem they are studying.
In Fig. 1, we provide a schematic illustration of the iSeg workflow.
Processing of the raw NGS data from maize
Raw fastq files were clipped of 3′ illumina adapters with cutadapt 1.9.1. Reads were, aligned to B73 AGPv3 [44] with bowtie2 v2.2.8 [45], alignments with a quality < 20 were removed, and fragment intervals were generated with bedtools (v2.25) bamtobed [46]. Fragments were optionally subset based on their size. Read counts in 20-bp nonoverlapping windows across the entire genome were calculated with bedtools genomecov and normalized for sequencing depth (to fragments-per-million, FPM). Difference profiles were calculated by subtracting heavy FPM from light FPM. Quantile normalization was performed using the average score distributions within a given combination of digestion level (or difference) and fragment size class.