Read count (RC) is one of the commonly used approaches for CN detection on WES data [5, 28]. In this process, the WES data are first used to extract the RC for each exon. This is done by calculating the number of reads aligned to an exon. Once the average read coverage has been computed for each exon in the paired sample (tumor and normal), the data are preprocessed for bias removal. The basic idea behind RC approach is that the read count of the genomic region should be proportional to the CN profile of the region. However, the data contain several forms of bias that need to be removed. In this work, we apply a two-step bias removal approach - Mappability and GC Correction. The data are then represented as a log ratio (LR) signal of the tumor sample with respect to its matched normal. This list of LR values corresponding to each exon is the input to the segmentation algorithm, which uses these values to fragment the signal and eventually help in the identification of amplifications and deletions in the genome. In the following subsections, we will describe the details of the algorithm.
Preprocessing
This is the first and vital step in WES data analysis. The raw data contain several bias sources which affects the copy number profiles, and hence requires preprocessing for mitigation or normalization. Although with the paired sample (tumor with a matched normal) approach, normalization against the control samples is known to reduce the bias, it has been shown that preprocessing is still necessary for better assessment [29, 30].
In our work, we use BAM files of paired tumor-normal samples as the input files. In general, the raw reads (DNA sequences) from WES raw data are first aligned to their location in the reference genome using alignment tools such as BWA [31] or BOWTIE [32], and finally compiled to a BAM format. In order to reduce the mappability bias in our analysis, we removed the reads that are mismatched or aligned to multiple regions. The average RC (ARC) for each exon is, then, computed as the number of reads aligned to an exon over the size of that exon (in bases):
$$ {ARC}_{i} = \frac{RC_{i}}{s_{i}} $$
where R
C
i
is the number of reads aligned to exon i and s
i
is the size of that exon.
The next step is to remove the biological bias present in the data due to GC content. We observed that GC bias exists in our data and is in compliance with the analysis previously made in a similar work [12]. The average read coverage of the exons is the highest for values of GC content between 35 % and 65 %, and decreases for the other values. To address this correlation, we apply the median normalization approach described in [13] and implemented in the recent tool [12]. This approach successfully reduced the bias in our data as well.
Once the data have been preprocessed to remove biases and ARC values computed, we calculate the logarithm of the ratio of the tumor average read coverage and the matched normal average read coverage, which is known as the LR signal and represented as:
$${LR}_{i} = \log_{2} \frac{\overline{AR{C^{t}_{i}}}}{\overline{AR{C^{n}_{i}}}} $$
where \(\overline {AR{C^{t}_{i}}}\) and \(\overline {AR{C^{n}_{i}}}\) are the average read coverage of the tumor sample and the normal sample for exon i respectively after correction. This LR signal, in which each data point corresponds to an exon, is passed on for segmentation to detect CN breakpoints.
Segmentation
We extend VEGA segmentation algorithm, originally developed to detect CNVs on aCGH data, for its application on WES data. This algorithm is based on the Mumford and Shah variational model [20]. According to this model, an observation signal u
0, defined on the domain Ω, is partitioned into a set of disjoint connected components (Ω=Ω
1∪Ω
2∪…∪Ω
n
). The set of points on the boundary between the Ω
i
is denoted as Γ. This partition is modeled such that the signal varies smoothly within a component and discontinuously between the disjoint components. This is also known as the problem of piecewise smooth approximation. The solution to this problem requires the derivation of an optimal approximation u of u
0 while penalizing the complexity of the solution using a regularization parameter, λ. Here, we adopt the special case of Mumford and Shah energy functional for piecewise constant approximation, which is best suited for CNV segmentation. The optimal approximation is achieved by minimizing the following energy function which, in the original two-dimensional space, can be expressed as:
$$ E(u, \Gamma) = \sum_{i}{\int_{\Omega_{i}} (u_{o} -u_{i})^{2}dx dy + \lambda |\Gamma|} $$
((1))
where u
i
is the mean value of u
o
within each connected component Ω
i
.
VEGA adopts the one-dimensional version of the piecewise model; the data D∈IRn are represented as a vector of size n, where n is the number of exons ordered by the genomic position. The segmentation divides the data vector into M connected regions, denoted as R, and is defined as a set of ordered positions Γ={b
1,…,b
M+1}. Each region R
i
contains all exons between breakpoints {b
i
,b
i+1}. Derived from the Mumford Shah model for one-dimensional data, the piecewise constant energy function to be minimized in order to find the optimal approximation is [24]:
$$ E(u, \Gamma) = \sum_{i}{\int_{b_{i}}^{b_{i+1}} (u_{o} -u_{i})^{2}dx + \lambda M} $$
((2))
where λ is the regularization parameter that determines the number of segmented regions.
In order to minimize this function, adjacent regions R
i
and R
i+1 are iteratively merged in a pyramidal manner to create larger segments and the reduction of the energy can be shown as:
$$ E(u, \Gamma\backslash \{b_{i}\}) - E(u, \Gamma) = \frac{|R_{i}||R_{i+1}|}{|R_{i}| + |R_{i+1}|} ||u_{i}-u_{i+1}||^{2} - \lambda $$
((3))
where, |R
i
| and u
i
are the length (number of exons) and LR mean value of the i-th region, respectively, and ||.|| is the L
2 norm and ∖ is the set difference. Following a greedy procedure, we start with a segmentation having n regions for each LR measure. Then, at each step we merge the pair of adjacent regions that upon merging yields the maximum decrease of the energy functional. Since λ decides the end of merging, choice of an appropriate value is crucial to ensure the quality of the final segmentation. In VEGA, the selection for λ at each merging step is done dynamically, depending on two factors - the length and LR mean values of the consecutive regions being considered for the merger. Hence, the cost of merging two regions R
i
and R
i+1, associated to a breakpoint b
i
, is computed as follows:
$$ \hat{\lambda_{i}} = \frac{|R_{i}||R_{i+1}|}{|R_{i}|+|R_{i+1}|} ||u_{i}-u_{i+1}||^{2} $$
((4))
The adjacent regions are merged and the i-th breakpoint removed if \(\hat {\lambda _{i}} < \lambda \). If the condition is not satisfied any further, this implies the energy function has reached its minimum and no merging can proceed. Therefore, λ is updated to the smallest \(\hat {\lambda _{i}} + \epsilon \) (close to zero) and the merging is continued. The sequence of λ values is monotonically increasing as it corresponds to the amount of decrease of the energy functional at each step in (eq. (3)). Here, we adopt a stopping criterion in such a way that the final segmentation is obtained when the increase in lambda stabilizes, and merging any further does not correspond to a significant decrease of the energy. The final stopping value of λ is based on the variability of the adjacent region (λ values) and the total variability of the data, ν. The resulting computation for the stopping criterion is Δ
λ=λ
l+1−λ
l
≤β
ν, where β is a positive constant (ideal value = 0.5–0.7).
The WES data are sparse and consists of varied distances between exons. This characteristic is distinct from WGS data and affects the behavior of the WES data upon segmentation. Hence, we take this distance feature into consideration, similar to the approach taken in the development of EXCAVATOR, in the extended version of the original VEGA model.
The pyramid approach in VEGA merges two regions if the cost of merging them, \( \hat {\lambda _{i}} \), is low. As mentioned earlier, in the original model the two parameters considered to compute the cost are the relative lengths of the regions (|R
i
|−|R
i+1|) and the difference of the LR values of both regions ||u
i
−u
i+1||. In the new model, we add a third parameter that considers the local average distance between the exons within a region. We compute the local average for a region by taking the mean of the distances between consecutive exons in the region and express as:
$$ d_{i} = \frac{\sum_{j=1}^{|R_{i}|-1}{|m_{j} - m_{j+1}|}}{|R_{i}|} $$
((5))
where m
j
is the midpoint of exon j in the region R
i
on the genome. Then, the difference between the average distances of the two regions is calculated, to influence the merging condition (the cost function). If the difference between the two regions is small, the regions are expected to merge. That is, the cost of merging two regions is reduced if both are sparsely populated or are densely populated. On the other hand, if one region is sparsely populated as compared to the other, then the difference between the local averages is high which reduces the chances of merging. The idea is that two adjacent regions will merge with respect to their relative sparsity/density. Therefore, to include this factor, the original update formula for \(\hat {\lambda _{i}}\) is adjusted with the third parameter that computes a weighted difference between the local average of the distances of exons for the two regions and can be expressed in the following way:
$$ \hat{\lambda_{i}} = \frac{|R_{i}||R_{i+1}|}{|R_{i}|+|R_{i+1}|} ||u_{i}-u_{i+1}||^{2} + \alpha\log(|d_{i} - d_{i+1}|) $$
((6))
where d
i
and d
i+1 are the averages of the distances between consecutive exons in R
i
and R
i+1 respectively, and α determines the weight of the parameter. In our analysis, we set the α variable to 0.001. This variable, when set to 0, also provides the flexibility of using the original VEGA model for segmentation on WES data.
The pseudocode of the resulting VEGAWES algorithm is reported below:
Synthetic data
We generated synthetic chromosomes from the corrected ARC data of the eight samples described in [33]. The dataset consisted seven samples of Yoruba ancestry (NA19131, NA19138, NA19152, NA19153, NA19159, NA19206 and NA19223) and one sample of Caucasian ancestry (NA10847). We applied a similar procedure reported in [12] to generate these chromosomes using the seven samples as test and one sample as the control for each test sample.
Each synthetic chromosome consists of 1,000 exons and has g altered genes, where N is the length of an altered gene (in exons). The distance, in bp, between consecutive genes was defined as D. We performed tests on both amplification and deletion events separately and with several combinations of g, N, and D: g = (2, 5), N = (5, 20), and D = (10,000, 1,000,000). For each combination, we generated 100 synthetic chromosomes. In order to report the segmentation performance of each algorithm on the synthetic chromosomes, we used the receiver operating characteristic (ROC) curve as described in [34]. To compute ROC curve, true positive rate (TPR) was defined as the total number of exons in the altered regions whose segmented mean LR value is above the threshold divided by the total number of exons in the altered region and the false positive rate (FPR) was defined as the total number of exons in the unaltered regions whose segmented mean LR value is above the threshold divided by the total number of exons in the unaltered regions.
Tumor dataset
In order to assess our algorithm on real data, we experimented with data provided by TCGA. All patient data were acquired from the published TCGA GBM Analysis project [35] in which it is stated that “Specimens were obtained from patients, with appropriate consent from institutional review boards” in accordance with the policies and guidelines outlined by the Ethics, Law and Policy Group from TCGA. All patient data is anonymous and was originally collected for routine therapeutic purposes.
The SNP data were used as ground truth for comparison and validation of the results generated by our algorithm. The reads in the BAM files were aligned to the Hg19 reference genome (ftp://ftp.ncbi.nlm.nih.gov/sra/reports/Assembly/GRCh37-HG19_Broad_variant/Homo_sapiens_assembly19.fasta) from the Broad Institute.
The sample sets comprised of varied coverage values. The total number of reads (RC) in the control samples ranges between 1.6–7.4 billion while in the case of the tumor samples the values range between of 3.4 – 7.9 billion. The total average reads per exon in the control samples lie between 7.2–33 million and the values for the tumor samples lie between 15–35 million. Furthermore, the average coverage per base in the normal samples range between 32.23–146.44, and 53.61–177.76 for tumor samples. The size of the exons in the reference genome varies between 24–91k bp and the average size is approximately 352 bp.
Prior to performing the evaluation, the data were preprocessed and the ARC values were generated using the DepthOfCoverage functionality of Genome Analysis Tool Kit (GATK-v3.2–2). The ARC values were then corrected for GC content, and the LR values were computed. The LR signal was then passed to each segmentation algorithm, and the results were obtained in the form of segments along with a corresponding mean LR value for each segment. In order to compare with the other algorithms, we set the parameters for each algorithm as per the default values provided by ExomeCNV for CBS and EXCAVATOR for HSLM.
We define amplification event as segments with LR value above 0.35 [36], deletion event as segments with values below -0.25 [37], and mark the rest as normal. In addition, since we focus on evaluating the results of the segmentation algorithms and not only the classified CNV labels, we also use the LR values of the exons computed by each algorithm with the corresponding SNP values for validation. The LR value of an exon is the mean LR value of the segment that contains the exon. An exon is marked as true positive if the copy number classification for the exon is the same as that in the SNP data or if the LR value from the segmentation algorithm is within the range of ±0.15 when compared with the LR value from the SNP data.
To analyze the accuracy of the segmentation on the tumor data, we use the precision, recall, and f-score metrics. We computed precision and recall for amplification and deletion events separately. For each sample, precision was defined as the ratio of the true positives detected by the algorithm that correctly correspond with the ground truth and the total number of regions detected by the algorithm. Recall was defined as the ratio of the true positives detected by the algorithm that correctly correspond with the ground truth and the total number of regions detected by SNP (ground truth). F-score, for each sample, is then the harmonic mean of the precision and recall scores.