isoCNV: in silico optimization of copy number variant detection from targeted or exome sequencing data

Background Accurate copy number variant (CNV) detection is especially challenging for both targeted sequencing (TS) and whole‐exome sequencing (WES) data. To maximize the performance, the parameters of the CNV calling algorithms should be optimized for each specific dataset. This requires obtaining validated CNV information using either multiplex ligation-dependent probe amplification (MLPA) or array comparative genomic hybridization (aCGH). They are gold standard but time-consuming and costly approaches. Results We present isoCNV which optimizes the parameters of DECoN algorithm using only NGS data. The parameter optimization process is performed using an in silico CNV validated dataset obtained from the overlapping calls of three algorithms: CNVkit, panelcn.MOPS and DECoN. We evaluated the performance of our tool and showed that increases the sensitivity in both TS and WES real datasets. Conclusions isoCNV provides an easy-to-use pipeline to optimize DECoN that allows the detection of analysis-ready CNV from a set of DNA alignments obtained under the same conditions. It increases the sensitivity of DECoN without the need for orthogonal methods. isoCNV is available at https://gitlab.com/sequentiateampublic/isocnv.

introduce more biases due to hybridization and to a non-uniform read-depth distribution among regions [13][14][15] that make CNV detection even more difficult. Nevertheless, TS and WES can offer greater depth at a lower price and a faster and less complex data analysis.
Many tools have been developed for CNV detection using TS or WES data [13,[16][17][18][19][20]. Among them there is DECoN [18], which has shown a high performance with NGS panel data [21,22]. DECoN is based on read depth data to call CNV and is the result of modification and optimization of ExomeDepth v1.0.0 [16]. However, its performance is highly dependent on the selected parameters which should be tuned for each specific dataset to maximize sensitivity [22] and should not be used directly with data produced differently, i.e. with different sequencing technologies, targeting probes or capture protocol [22].
Parameter optimization can be performed using an optimizer from the CNVbench-markeR framework [22]. However, the parameter optimization process requires a CNV validation set, which is usually generated using either multiplex ligation-dependent probe amplification (MLPA) or array comparative genomic hybridization (aCGH). They are the gold standard methods [23], but both are time-consuming and costly approaches.
Here we present the isoCNV pipeline, which performs in silico optimization of DECoN parameters to maximize its sensitivity using only NGS data. We propose to obtain the CNV validation set from the overlapping calls of three CNV calling tools: CNVkit [24], panelcn.MOPS [19] and DECoN. We show that our tool increases the sensitivity of DECoN for both TS and WES data. In addition, it is easy to implement and allows to obtain analysis-ready CNV from DNA sequencing read alignments in BAM format [25].

Implementation
Our pipeline is a Python 3.7 software package comprising a command-line program, isoCNV.py. The input to the program is a batch of BAM files from TS or WES samples obtained under the same conditions and the regions of interest (ROI) in BED format that should correspond with the capture bait locations. The program is completely modular and allows to run the complete pipeline in batch or perform the step-by-step analysis. The pipeline consists of 5 main steps: individual CNV calling using three different algorithms, creation of an in silico validation dataset, parameter optimization, CNV calling with optimized parameters and CNV annotation (Fig. 1).

Datasets
A targeted and a whole-exome sequencing dataset were selected to evaluate the performance of isoCNV: ICR96 exon CNV validation series [26] and NimbleGen set [27], respectively. ICR96 exon CNV validation series includes 96 samples and NimbleGen set includes 34 samples. Both datasets have available validated CNV information, ICR96 have been validated by MLPA and NimbleGen by SNP microarray [28]. ICR96 exon CNV validation series can be downloaded from European-Genome phenome Archive (EGA), which is hosted by the EBI and the CRG, under accession number EGAS00001002428. The FASTQ files for NimbleGen dataset can be accessed through the Sequence Read Archive (SRA) [29] under accession number SRP010920.

Data preprocessing
All samples were aligned to the GRCh37 human genome assembly using BWA-MEM algorithm developed by Wellcome Trust Sanger Institute [30]. Sentieon sort utility [31] was used to sort and index BAM files. Then, duplicate reads were removed and base quality score recalibration (BQSR) was performed using the Sentieon utilities [31]. Sentieon is a commercial variant caller that is designed as an accelerated software for Genome Analysis Toolkit (GATK) [32].

Individual CNV calling
Preliminary identification of copy number variants is performed using three different CNV callers: DECoN v1.0.2 [22] with default parameters, CNVkit v0.9.6 [24] and panelcn.MOPS v1.12.0 [19]. Since the CNV identification method is based on depth of coverage, the gender of the samples is a critical factor to determine variations in copy number of sex chromosomes. Therefore, one of the mandatory inputs to perform the analysis is the gender of the samples, which can be provided by the user or it will be automatically inferred using the CNVkit gender tool.
Default parameters are applied to perform the CNV calling using DECoN but with the modifications described below. DECoN creates a reference set for each sample of interest consisting only of those samples which are well correlated [22]. Hence related individuals should be excluded from the reference, otherwise common CNV in the family would not be detected. For this reason, a list of related samples can be provided in order to automatically exclude them from the reference set of their relatives in order not to lose CNV of the family. In addition, it has been found that the optimum size of the reference set is between 5 and 10 samples [16] so DECoN has been modified to only accept a maximum of 10 samples as reference. Moreover, it should be noted that by default, CNV calling is performed separately between male and female samples, thus allowing the detection of CNV in the sex chromosomes. However, if there are less than 5 female or male samples, all samples are analyzed in a single batch, disabling a reliable CNV calling on sex chromosomes. Optionally, only sex chromosomes can be analyzed separately between male and female samples, using the "batch2" option of isoCNV. Regions of interest (ROI) will be dropped if they are below the default minimum median coverage threshold (100) for any sample (measured across all ROI in the target) or region (measured across all samples). CNV will be filtered out from samples that do not meet either the minimum coverage threshold (100) or the minimum correlation threshold (0.98). Samples which do not have a high correlation with other samples in the set are likely to have suboptimal detection across the entire target. These two types of filters have been added as options so that the user can easily select whether to apply them.
Regarding CNV calling with CNVkit, default parameters are also applied except for filtering where the 'cn' method is applied instead of 'ci' . Here a single reference set is created for all samples, it will be composed of all female samples in the batch with a standard deviation (SD) between − 2 and 2. Such a reference set will be only modified if the sample of interest is female, in which case it will be excluded from the reference. Two exceptions should be noted in the creation of the reference set: (1) if there are less than 5 female samples, then males will be the ones used as reference and (2) in the case that there are less than 5 females and less than 5 males, then all samples will be used as a reference and CNV in Y chromosome will be unreliable. Furthermore, the thresholds used by CNVkit to define copy numbers 0 and 1 were modified to be more restrictive: for CN0 the threshold range (log 2 value up to) has changed from log 2 ≦ − 1.1 to log 2 ≦ − 2 and for CN1 from − 1.1 < log 2 ≦ − 0.4 to − 2 < log 2 ≦ − 0.4. The precise copy number values obtained by CNVkit (0. 1, 2, 3, etc) are then converted to deletion (DEL) or duplication (DUP) taking into account the gender of the sample of interest and the gender of the references.
The identification of CNV with panelcn.MOPS is also carried out using the default parameters of the tool. As with DECoN, the analysis is carried out in two groups, one with the female samples and another with the male ones, unless there are less than 5 females or males that all samples will be analyzed together. ROI are excluded from the analysis if marked as "low quality" by panelcn.MOPS: their median read count across all samples does not exceed the minimum default threshold (30) or if their read count shows a high variation across all samples as marked by the default behaviour of the algorithm.

In silico validation dataset
The in silico validation dataset is composed of the overlapping calls of the three CNV calling tools (DECoN with default parameters, CNVkit and panelcn.MOPS). In order to compare the results obtained by the three calling tools and create an in silico validation dataset, the output of each tool is normalized to a single format, a tab-delimited BED file. This file contains five columns corresponding to chromosome, start position of the CNV, end position of the CNV, CNV type (DEL or DUP) and samplename. Using BED-Tools utilities v2.29.2 [33] and pybedtools Python library v0.8.1 [34], the overlapping CNV between call sets from the three algorithms are selected if meet two criteria (1) at least 60% of overlap with one of the call sets from the algorithms and (2) a minimum size equivalent to the mean size of the target ROIs. If one of the tools reports no CNV in any sample, only the output of the other two algorithms is used to create the in silico validation set.

Parameter optimization
Parameter optimization is performed using the feature optimizer from CNVbench-markeR framework [22]. From a validated dataset, it executes DECoN algorithm against the dataset using up to 22 different values for each parameter. The results obtained with each combination of parameters are compared with the validated copy number states in order to obtain optimized parameters for the dataset.
Here, the validated CNV are the ones obtained in silico from the overlapped calls between DECoN, CNVkit and panelcn.MOPS (the in silico validation dataset). Nevertheless, it is also necessary to provide validated information about regions with a normal copy number state. To do this, all regions where a CNV has been found (and has been validated in silico) in any of the samples from the dataset are selected as validated regions, and then, a normal copy number state is assigned to each validated region with no validated CNV.
The DECoN parameters subject to optimization are the following: (1) the minimum correlation threshold between a test sample and any other sample to be considered well correlated, (2) the minimum median coverage for any sample or ROI to be considered well-covered and (3) the transition probability between normal copy number state and either deletion or duplication state in the hidden Markov model.
The identification of copy number variants is performed using DECoN algorithm using the same approach applied to create the CNV validation dataset: (1) a maximum of 10 samples are used as reference per sample, (2) related individuals are excluded from the reference set and (3) female and male samples are processed separately. Nevertheless, instead of using the default parameters, the optimized ones obtained in the previous step are used to perform the analysis.
The results are the final copy number variants, which are normalized in BED format with the following columns: chromosome, start position of the CNV, end position of the CNV, CNV type (DEL or DUP), sample name, reads ratio and the precise copy number value. Reads ratio corresponds to the one calculated by DECoN algorithm and copy number values are calculated based on the reads ratio (Table 1).

CNV annotation
Finally, CNV are annotated using the AnnotSV tool [35]. AnnotSV provides numerous relevant annotations: genes-based annotation (OMIM, Haploinsufficiency, Gene intolerance, etc), annotation with features overlapping the CNV (databases of known CNV such as gnomAD or 1000 genomes), annotation with features overlapped with the CNV (pathogenic SV from dbVar, promoters, etc) and annotation of the breakpoints (GC content, segmental duplications, etc). Therefore, it classifies CNVs according to their pathogenicity into one of the 5 classes proposed by the American College of Medical Genetics and Genomics (ACMG) guidelines: benign, likely benign, variant of unknown significance (VUS), likely pathogenic or pathogenic. All of this makes it easier for prioritization of copy number variants of interest.

Benchmark evaluation metrics
The performance of isoCNV was evaluated per region of interest (ROIs). Such ROIs correspond to the target bed file of each dataset and were treated as independent entities. If the tool matched the result of the validation information was classified as true positive (TP) or true negative (TN). If the tool identified a CNV not present in the validation information was a false positive (FP) and if the tool missed a validated CNV was a false negative (FN). Furthermore, the performance of isoCNV was evaluated taking the no calls into account. This is due to the fact that in a real diagnostic scenario, all regions where there is no call should be confirmed by an orthogonal method.

In silico validation dataset
The total copy number variants identified per ROI, for each calling tool and dataset, is shown in a Venn diagram (Fig. 2). It is shown that the total number of CNVs per ROI varies across algorithms. In both datasets, panelcn.MOPS identified the greatest number of CNVs whereas DECoN identified the least number. The overlapped CNVs per ROI between the three call sets were 205 in the TS dataset (ICR96) and 693 in the WES dataset (NimbleGen) (Fig. 2). From these, the validation dataset was composed from the ones that overlapped at least 60% with one of the call sets from the algorithms and that had a minimum size equivalent to the mean size of the target ROIs. Hence, 72 validated CNVs were obtained in ICR96 and 388 in NimbleGen. After the regions with normal copy number state were attached to the validation set, such validation set could be compared to the real copy number information obtained by MLPA in ICR96 and by SNP microarray in NimbleGen set (Table 2). For both datasets, specificity was 1 as no FP were identified, while sensitivity was quite low as a high number of FN were found. These results were expected, due to the stringent filters that we apply to define a copy number as validated before proceeding to the optimization step.

Benchmark evaluation
After the parameter optimization of DECoN, 597 CNV were identified in ICR96 and 125601 in NimbleGen. There was an increase in sensitivity and F-score for both dataset but especially for NimbleGen set where there was a major improvement in sensitivity (from 16.2 to 84.5%) and F-score (from 27.1 to 82.7%) by slightly decreasing specificity (from 99.4 to 96.3%) (Fig. 3, Table 3). Negative Predictive Value (NPV) was higher than the Positive Predictive Value (PPV) before and after optimization process in both datasets (Fig. 3, Table 3) as expected in unbalanced datasets with a much larger number of negative elements (no calls) than positive ones.
To evaluate if parameter optimization of DECoN allows to identify new CNVs only predicted by the other two methods (CNVkit and panelcn.MOPS) when default parameters are used, the unique CNVs of CNVkit (identified by CNVkit but not by DECoN with default parameters) have been obtained and compared to the final CNVs (identified by DECoN with optimized parameters) and found 86 and 2727 CNVs in common in the ICR96 and NimbleGen dataset, respectively. The same approach has been applied to the unique CNVs of panelcn.MOPS and 88 (ICR96) and 68569 (NimbleGen) CNVs have been found in the final CNVs that were not identified initially by DECoN with default parameters.
In addition, the performance of isoCNV was evaluated depending on the number of samples analyzed. This relates to the reference set as samples with a better correlation or a higher coverage may be included and could improve the performance of DECoN. The ICR96 set reached almost 100% specificity and NPV independently of the number of samples with both default and optimized parameters (Fig. 4). An improvement in  PPV and F-score can be observed in the ICR96 set when at least 20 samples were analyzed together and then, from 24 samples, both PPV and F-score remained fairly constant, being always higher when executing DECoN with optimized parameters (Fig. 4). The sensitivity in the ICR96 set also remained quite constant and above 80% when at least 6 samples were analyzed with optimized parameters, whereas there was a decrease in the sensitivity when more than 86 samples were analyzed with default parameters (Fig. 4). The NimbleGen set showed a fairly constant sensitivity, specificity, PPV, NPV and F-score with optimized parameters (Fig. 5). However, sensitivity, F-score and NPV decreased considerably when analyzing more than 20 samples using default parameters (Fig. 5).

Conclusions
We presented isoCNV, an automated pipeline to optimize DECoN algorithm using only NGS data. It allows the detection of analysis-ready CNV from a set of DNA alignments and their corresponding capture bait locations. It has been shown to improve sensitivity of DECoN in both TS and WES data, which is especially critical when this tool is used as a screening step in a diagnostic strategy. We thus hope to reduce the number of assays required per patient to reach a diagnosis as orthogonal methods, such as MLPA or aCGH, are not required.