Pipeline
Segmentum was developed and written in the Python programming language (version 3) and requires the SciPy library to be installed (If the user wishes to use the ‘plot’ sub-command to inform parameter value selection, matplotlib library is also required). Segmentum employs SAMtools to extract RD and heterozygous SNPs BAF data from BAM files containing WGS data. These constitute the inputs required by Segmentum to perform copy number analysis. Figure 1 illustrates the Segmentum pipeline. Each step is explained in more detail in the following sections.
RD extraction and BAF calculation for heterozygous SNPs
To extract the RD from the BAM files, the genome is divided into bins of user-defined length (2 kbp by default) and the number of reads overlapping each bin is counted to determine the RD at each bin. To calculate the BAF values, heterozygous SNPs in the normal sample are identified at known SNP sites in the human genome (based on SNP annotations such as those produced by the 1000 Genomes project). Next, the number of reference and alternative alleles at each heterozygous SNP position is extracted from the tumor sample and the BAF for the i
th heterozygous SNP is calculated using the following equation:
$$ BA{F}_i = \frac{al{ t}_i}{al{ t}_i + re{f}_i} $$
where alt
i
and ref
i
refer to the alternative and reference allele, respectively, of the i
th heterozygous SNP.
It should be noted that by default reads with mapping quality score 10 are filtered out before RD extraction and BAF calculation in order to address the challenges raised by reads not mapping to a unique region in the genome (the read filtration criterion based on the mapping quality score is a parameter to Segmentum and can be set by the user).
Log-ratio calculation
The RD log-ratio is calculated using the following equation:
$$ log{r}_i = l o{g}_2\left(\ \frac{tR{ D}_i}{nR{ D}_i}\ \right) $$
where logr
i
is the log-ratio of the i
th genomic window and tRD
i
and nRD
i
are RDs extracted from the i
th genomic window of a specific size (determined by user; default is 2 kbp) for the tumor and normal samples, respectively.
Differences in the total number of aligned reads in the normal and tumor samples may bias the estimation of the RD log-ratios. The correction was performed by finding the mode of log-ratio values for each chromosome and subtracting the median of all of the modes from each log-ratio value. It should be noted that median, in the correction step, is robust to the changes in one mode. For instance, one chromosomal arm having a copy number change has no effect on the correction since it only affects one of the chromosomal modes.
Mirroring and smoothing of the BAF values
The BAF of a heterozygous SNP has an expected value of 0.5 in normal diploid cells. In the presence of somatic copy number alterations, the BAF can diverge from 0.5 if the relative abundance of the two alleles changes. To make smoothing and segmentation of BAF data possible, the BAF values must be mirrored about the 0.5 axis so that the B allele fraction always represents the allele fraction of the dominant allele. Without this mirroring step, the BAF values will be symmetric about the BAF = 0.5 axis and smoothing will underestimate the absolute divergence from 0.5 [14]. In this study, a median filter is used for smoothing the BAF data. Simultaneous mirroring and smoothing is implemented using the following equation:
$$ c B A{F}_i = H\ast \left|0.5-{M}_9\left( BA{F}_i\right)\right|+\left(1- H\right)\ast {M}_9\left(\left|0.5- BA{F}_i\right|\right). $$
where BAF
i
is the BAF value for the i
th heterozygous SNP, cBAF
i
is the simultaneously mirrored and smoothed BAF
i
, H is a heterozygosity measurement calculated with the following equation: H = 1 − 2 ∗ |0.5 − x|, and M
9 refers to applying a median filter to 9 SNPs in the vicinity of and including the i
th SNP.
Segmentation using a double sliding window approach
To detect changes in the RD log-ratio and BAF signals, two non-overlapping, fixed-sized windows (determined by the user) are slid over the RD log-ratio and BAF values and a compound score (S) is calculated for each of the adjacent two windows. If the compound score is greater than 1, a change is detected and a breakpoint is placed at the place where the two windows touch each other. The compound score is calculated using the following equation:
$$ S = \frac{\left|\ \overline{log{ r}_{wi{ n}_i}} - \overline{log{ r}_{wi{ n}_{i+1}}}\ \right|{}^2}{\tau_{log r}}+\frac{\left|\overline{cBA{ F}_{wi{ n}_i}} - \overline{cBA{ F}_{wi{ n}_{i+1}}}\ \right|{}^2}{\tau_{BAF}} $$
where \( \overline{log{ r}_{wi{ n}_i}} \) is the mean of the RD log-ratio values in the i
th window, \( \overline{cBA{ F}_{wi{ n}_i}} \) is the mean of the mirrored and smoothed BAF values in the i
th window, τ
logr
and τ
BAF
are thresholds for the absolute mean difference in the RD log-ratios and the absolute mean difference in the BAF values in the two adjacent windows, respectively.
It is possible that some breakpoints will not be detected by a single pass of a double sliding window due to a given window size. Thus, to increase the sensitivity, Segmentum analyzes the signals for the detection of breakpoints multiple times with different window-sizes and thresholds. Each new window is 1.5 times larger than the previous one. The increase in the window size decreases the detection thresholds. This is due to the fact that increasing the window size increases the sample size (assuming sampling from normal distribution with N(μ, σ
2)) and consequently decreases the standard deviation of the mean (mean having probability distribution of \( N\left(\mu, \kern0.5em \raisebox{1ex}{${\sigma}^2$}\!\left/ \!\raisebox{-1ex}{$ n$}\right.\right) \)). The new standard deviation of the mean when window size is increased 1.5 times is \( \frac{1}{\sqrt[2]{1.5}} \) times the old standard deviation. Let τ = ασ where τ is the threshold and α is a scalar and σ is the standard deviation. It follows that:
$$ {\tau}_{new} = \alpha .{\sigma}_{new} = \frac{1}{\sqrt[2]{1.5}}.\ \alpha .{\sigma}_{old} = \frac{1}{\sqrt[2]{1.5}}.\ {\tau}_{old} $$
Thus both the τ
logr
and τ
BAF
thresholds are updated using the following equation:
$$ {\tau}_{new} = \frac{1}{\sqrt[2]{1.5}} \ast {\tau}_{old} $$
The process of increasing the window-size is continued as long as the updated thresholds are greater than the thresholds for the merging two consecutive segments (see below). After detecting all the breakpoints, a consensus list of breakpoints is created by accepting all of the breakpoints detected by the first pass of the double sliding window and adding the breakpoints detected from the larger windows to the list only if the breakpoint is not in the vicinity of an existing breakpoint in the list (i.e., |cp
current
− cp
existing
| > window size, where cp is a detected breakpoint). Consensus breakpoints are used to create the segments. Two consecutive breakpoints constitute a segment. For each segment, the average RD log-ratio and average mirrored and smoothed BAF is calculated. Two consecutive segments are merged if the following conditions are met:
$$ \left|\ \overline{log{ r}_{se{ g}_i}} - \overline{log{ r}_{se{ g}_{i+1}}}\kern0.5em \right|<{\tau}_{merg{ e}_{log r}}\kern1em and\kern1em \left|\ \overline{cBA{ F}_{se{ g}_i}} - \overline{cBA{ F}_{se{ g}_{i+1}}}\ \right| < {\tau}_{merg{ e}_{BAF}} $$
where \( \overline{log{ r}_{se{ g}_i}} \) is the mean RD log-ratio of the i
th segment, \( \overline{cBA{ F}_{se{ g}_i}} \) is the mean mirrored and smoothed BAF of the i
th segment, and \( {\tau}_{merg{ e}_{logr}} \) and \( {\tau}_{merg{ e}_{BAF}} \) (determined by user) are the RD log-ratio and BAF merging thresholds, respectively.
Detection of cnLOH events within a single sample
A segment is considered to be a cnLOH segment if the following conditions are met:
$$ \left|\ \overline{log{ r}_{se{ g}_i}}\ \right| < {\tau}_{cnLO{ H}_{log r}}\kern1.25em and\kern1.5em \left(0.5 - \kern0.5em \overline{cBA{ F}_{se{ g}_i}}\right) < {\tau}_{cnLO{ H}_{BAF}} $$
where \( \overline{log{ r}_{se{ g}_i}} \) is the mean RD log-ratio of the i
th segment, \( \overline{cBA{ F}_{se{ g}_i}} \) is the mean mirrored and smoothed BAF of the i
th segment, \( {\tau}_{cnLO{ H}_{logr}} \) and \( {\tau}_{cnLO{ H}_{BAF}} \) (determined by the user) are thresholds for calling a cnLOH segment.
Detection of recurrent cnLOH regions across multiple samples
To find genomic regions with recurrent cnLOH events, all cnLOH regions for individual samples are identified following the procedure described earlier. Then, the number of occurrences of a cnLOH event for a specific region across multiple samples is counted using an interval tree data structure (Additional file 1: Figure S1).
Simulator
To evaluate Segmentum in terms of segmentation accuracy, a simulator capable of simulating whole-genome RD for both normal and tumor samples and BAF based on events such as deletions, amplifications and cnLOH was developed. The simulator receives a normal sample RD data and outputs 4 sets of data including the simulated normal and tumor RD, BAF data and a ground truth. First, the simulator learns the distribution of the RD data from the provided normal sample by simply counting the number of times two consecutive RD values (e.g., 368 and 299) occur together throughout the genome (Additional file 1: Figure S2). The learned distribution also accounts for the inherent noise in the RD data. Next, inverse transform sampling (Smirnov transform) is used to generate RD values for each position in the genome based on the learned distribution. Then, noise is removed using a median filter. A normal RD is constructed by adding independent Poisson noise to the simulated RD data. To construct the tumor RD, two copy number tracks (because autosomal chromosomes come in maternal and paternal pairs) harboring random SCNAs are constructed. The tumor sample RD is calculated using the copy number tracks, the simulated normal sample RD and the normal sample contamination (i.e., a parameter determined by user). To construct the BAF data, heterozygous SNPs are initially randomly distributed across the genome (1 heterozygous SNP per 1.5 Kbp). The number of B-alleles at a heterozygous SNP is calculated using a binomial distribution with the parameters n (total number of reads at heterozygous SNP position) and p (probability that a read is coming from the B-allele). n is extracted from the simulated normal RD at heterozygous SNP positions. p is calculated using the two constructed copy number tracks and the normal sample contamination. Once the number of B-alleles is calculated, it is used to calculate the BAF values (Additional file 1: Figures S3 and S4 for the simulator pipeline and the simulated data visualized in the integrative genomics viewer (IGV) [15], respectively).