ConanVarvar: a versatile tool for the detection of large syndromic copy number variation from whole-genome sequencing data
BMC Bioinformatics volume 24, Article number: 49 (2023)
A wide range of tools are available for the detection of copy number variants (CNVs) from whole-genome sequencing (WGS) data. However, none of them focus on clinically-relevant CNVs, such as those that are associated with known genetic syndromes. Such variants are often large in size, typically 1–5 Mb, but currently available CNV callers have been developed and benchmarked for the discovery of smaller variants. Thus, the ability of these programs to detect tens of real syndromic CNVs remains largely unknown.
Here we present ConanVarvar, a tool which implements a complete workflow for the targeted analysis of large germline CNVs from WGS data. ConanVarvar comes with an intuitive R Shiny graphical user interface and annotates identified variants with information about 56 associated syndromic conditions. We benchmarked ConanVarvar and four other programs on a dataset containing real and simulated syndromic CNVs larger than 1 Mb. In comparison to other tools, ConanVarvar reports 10–30 times less false-positive variants without compromising sensitivity and is quicker to run, especially on large batches of samples.
ConanVarvar is a useful instrument for primary analysis in disease sequencing studies, where large CNVs could be the cause of disease.
A copy number variant (CNV) is defined as a DNA fragment with a size of at least 1 kilobase-pair (kb) which has a different copy number compared to a reference genome. This large unbalanced structural variation can be present in the form of deletions (1 or 0 copies) and duplications (>2 copies). CNVs can cause many rare sporadic and Mendelian disorders, such as split hand/foot malformation and leukodystrophy [1, 2]. Furthermore, severe paediatric conditions are often caused by large CNVs spanning multiple genes, with the size of such CNVs varying from 1 to 3 megabases (Mb) (e.g., DiGeorge and Charcot-Marie-Tooth syndromes [3, 4]) to >10 Mb (e.g., Cri-du-Chat and cat-eye syndromes [5, 6]).
Tens of computer programs for CNV detection have been developed over the last decade, with most being based on read-depth, split-read, read-pair and de novo assembly approaches [7, 8]. However, the vast majority of benchmarking analyses tend to focus on relatively small variants rather than on big, clinically actionable alterations in the genome [9,10,11,12] (see  for an example of a study showing the performance of one popular CNV caller on CNVs >1 Mb). Consequently, there are no comprehensive benchmarks assessing the performance of different tools on large deletions and duplications, a situation which can be partly explained by the limited availability of samples with genetic aberrations of this kind.
We present ConanVarvar, a novel software for quick and robust joint calling of large, syndromic CNVs in batches of whole-genome sequencing (WGS) samples using read depth. To aid in the analysis, ConanVarvar annotates identified CNVs with information about associated syndromic conditions and generates plots showing the position of each variant on the chromosome. ConanVarvar demonstrated a superior performance on our test dataset comprising both clinical and simulated samples with large CNVs, compared to some of the most popular programs for CNV analysis.
Overview of the approach
ConanVarvar is a read depth-based CNV caller with both a graphical user interface (GUI) and a command-line interface (CLI) (Additional file 1: Figs. S1 and S2). It approximates read depth along chromosomes by splitting them into bins of fixed size (e.g., 50 kb) with subsequent corrections for GC content and mappability. To detect abnormal regions, ConanVarvar performs segmentation of binned genomic intervals and assigns each segment an averaged copy number value. When the total number of available segments is sufficient, the program first removes all outliers with high standard deviation and then transforms the mean copy number of each segment to a different scale, so that potential deletions and duplications are further away from other segments (see Fig. 1); a K-means clustering algorithm then groups all transformed segments into “normal” and “CNV” categories. Otherwise, when there are fewer than 30 segments available, which renders clustering inefficient, a simple threshold-based approach is triggered to identify abnormal regions based on the raw read depth of the segment.
Once all non-CNV segments are excluded from the list, the program assigns each variant an “occurrence” value indicating the total number of identical CNVs found in other samples using a distance-based estimation technique (see Fig. 2). Finally, ConanVarvar generates plots (Additional file 1: Fig. S3) and reports the identified CNVs, with all per-variant statistics and annotations, including bootstrapped p-values, summarised in the form of a spreadsheet (Additional file 1: Fig. S4). A more detailed description of the developed methodology is available in Additional file 1: Note S1, illustrated with the workflow shown in Additional file 1: Fig. S5.
We benchmarked ConanVarvar against two commonly used read depth-based tools, CNVnator v0.4  and Control-FREEC v11.5 [15, 16]. The tools were selected based on their superior performance among other similar programs in previous benchmarks, especially on large deletions and duplications . We also included in the comparison an alternative Python implementation of the CNVnator algorithm called CNVpytor (v1.2.1), which provides speed improvements and additional features when compared to its predecessor . Another popular tool, Manta v1.6.0 , which is considered one of the top-performing CNV detection algorithms overall in terms of recall, accuracy and precision on both simulated and real data [9, 11], was also added to the comparison. In contrast to ConanVarvar, Control-FREEC and CNVnator/CNVpytor, Manta uses read-pair and split-read information instead of read depth.
Due to the extremely deleterious nature of large CNVs, they are much rarer genomic events than their smaller counterparts. For this reason, there is a very limited number of publicly available WGS samples with large CNVs, which results in the lack of benchmarking studies involving this type of variation [9, 10].
To assess the performance of ConanVarvar, CNVnator/CNVpytor, Control-FREEC and Manta, we created a test dataset of 14 WGS files, comprising 4 clinical samples, 9 simulated single-chromosome samples and the original sample NA12878 from the 1000 Genomes Project  dataset (see Table 1). The clinical samples were selected based on WGS and comparative genomic hybridisation (CGH) microarray results in cases from our in-house cohort of patients . The dataset contained a total of 13 large (>1 Mb) deletions and duplications. All simulated files were generated using either BAMSurgeon v1.2  or Illumina’s EAGLE simulator v2.5.1  (see Additional file 1: Note S2).
All read depth-based tools (ConanVarvar, CNVnator/CNVpytor, Control-FREEC) were run at the 50 kb resolution. For ConanVarvar and CNVnator/CNVpytor, the default settings were used. The parameters of Control-FREEC were selected to most closely match the default settings of ConanVarvar, e.g., both programs were tested with the minimum mappability of 0.8. For other parameters, either recommended or default values were used (see Additional file 1: Note S2). For CNVnator, Control-FREEC and Manta, the same reference files were used in each run, either as separate chromosomes (CNVnator and Control-FREEC) or in a merged form (Manta). Allosomes (sex chromosomes) were excluded from the analysis in all samples, as were Manta’s BND (‘breakend’) type of records, specific to inversions and translocations.
The performance of ConanVarvar, CNVnator/CNVpytor, Control-FREEC and Manta was evaluated using the F1, precision and recall metrics. The execution time on an HPC (high-performance computing) server node with 4 Intel Xeon CPUs and 128 GB of RAM was recorded for each tool.
F1, precision and recall
Among the tools we evaluated, ConanVarvar had the highest F1 and precision scores, as shown in Fig. 3. It correctly identified all CNVs of interest and reported the smallest number of false positives (i.e., variants other than the selected 13 CNVs). In contrast, Manta had the lowest precision overall. It missed more than half of all CNVs and performed especially poorly on duplications. Interestingly, the performance of Manta on simulated data was better than on real data, where it failed to find most large CNVs (Additional file 1: Figs S6–S11), except for one clinical sample, for which it gave a partially correct answer (compare Additional file 1: Fig. S12 with Figs. S13 and S14). Perhaps unsurprisingly, the output of CNVnator and CNVpytor for most of the samples in our benchmark was identical or nearly identical, which is also reflected in the highly similar F1, precision and recall characteristics of the two tools in Fig. 3.
As shown in Fig. 4, the overall concordance between the tools was strikingly low. Each tool, with the exception of ConanVarvar, produced a considerably large list of unique false positives. The most obvious examples of such false positives, identified by Manta, are shown in Additional file 1: Figs. S15–S20. Also, we found that both CNVnator/CNVpytor and Control-FREEC treat gaps in the centromere regions as CNVs, which resulted in another group of false-positive calls (Additional file 1: Figs. S21–S29).
CNVnator and ConanVarvar demonstrated a very similar performance in terms of their execution time on the analysed samples (Table 2). Control-FREEC was slightly slower on full BAM files compared to ConanVarvar and CNVnator, and was on average as fast as the other two programs on single-chromosome samples. Manta was the slowest of the five tools on full BAMs, which, however, can be explained by the underlying differences between split-read, read-pair and read depth-based methods. In particular, it demonstrated a two- to three-fold slower performance compared to ConanVarvar, CNVnator and Control-FREEC on each of the full BAM files, though its performance on single-chromosome samples was not too inferior. Interestingly, even though CNVpytor was significantly faster on full BAM files than the other four tools, it was surprisingly slow on single-chromosome BAMs.
None of the above-mentioned tools, except for ConanVarvar, allow for concurrent processing of multiple WGS samples. Even though Manta can, in principle, be run on several BAM files simultaneously, it is reported to fail on datasets with more than 12 samples, and, therefore, this feature was not applicable to our dataset. Hence, apart from ConanVarvar, all other tools required multiple invocations to analyse files from our dataset.
ConanVarvar is different, as it was specifically designed for quick multi-sample analysis. As shown in Table 2, it batch-processed the entire dataset in just 63 min, efficiently utilising all available CPUs (configurable behaviour).
When ConanVarvar was run on all 14 BAM files as one large batch of samples, it reported only 50 variants in total, a quantity 10–30 times smaller than the total number of false-positive CNVs outputted by the other four tools. Importantly, not only did it find all real CNVs of interest, but it also prioritised those variants based on their calculated p values and associated syndromes, such that all of them were in the first half of the list. Besides, in order to aid in further filtering of false positives, ConanVarvar calculated within-batch “occurrence” values for all identified variants based on the nearness of each one of them to other variants.
CNVs larger than 1 Mb remain a considerably understudied class of genomic variants in bioinformatics [9,10,11,12]. Yet, these variants cause some of the most severe developmental disorders and should, therefore, be prioritised. In this paper we present ConanVarvar, a robust CNV detection tool based on read depth that specifically addresses the problem of large deletions and duplications. This software is designed to be used as a primary analysis program for simultaneous screening of multiple WGS samples for the most deleterious mutations. The 1 Mb cutoff on the CNV size allows ConanVarvar to quickly detect tens of known syndromic CNVs without reporting large numbers of false positives. False-positive CNVs are a long-known problem in disease sequencing studies [7, 23, 24], as they tend to complicate the analysis by obscuring the more severe genomic abnormalities.
Our benchmarking results show that read depth-based CNV callers, such as ConanVarvar, CNVnator/CNVpytor and Control-FREEC, tend to perform well on large CNVs. In contrast, despite being one the de facto best algorithms for the detection of structural variation according to previous studies, Manta, which utilises split-read and read-pair information for CNV detection, missed half of all the large CNVs in our dataset. In particular, it did extremely poorly on our large duplications, which is strikingly different to its previously reported performance on CNVs of this type with smaller size . Among the read depth-based callers in this study, ConanVarvar demonstrated superior results in terms of the F1 and precision metrics on both real and simulated data by making several-fold less false-positive calls and prioritising true positives higher. Consistent with the literature , CNVnator correctly identified all large CNVs from our dataset, as did Control-FREEC. Nevertheless, the overall concordance between the tools was rather low, with the vast majority of calls being unique false positives.
Aside from the superior performance, ConanVarvar offers built-in annotations, facilitating the identification of clinically actionable CNVs based on 56 known syndromes. As WGS technology is gradually becoming a standard clinical practice, we anticipate that the reliance on such readily available annotation features will increase accordingly in the near future.
One of the limitations of our study is the size of our selection of CNV callers. Besides, we were also limited in the number of metrics we could use for the benchmarking, as the definition of a true negative is ambiguous in the context of large CNVs. However, given that our test dataset was sufficiently heterogeneous, we argue that the above-mentioned conclusions are still valid.
In future developments of ConanVarvar, we plan to add the support of single-nucleotide variants integration, in order to improve the accuracy of CNV breakpoint detection, which is currently one of the major limitations of all programs based on read depth, as the accuracy depends on the bin size. We also plan to regularly update the list of syndromes which is used for annotations in ConanVarvar.
We believe that this work will not only provide the bioinformatics community with a new tool for CNV analysis but will also help to further elucidate the performance of existing tools for CNV detection on large CNVs.
Availibility of data and materials
The source code and test data are available online at https://github.com/VCCRI/ConanVarvar. The datasets generated and/or analysed during the current study are not publicly available due to the use of patient-identifiable clinical data, but are available from the corresponding author on reasonable request. Project name: ConanVarvar; Project home page: https://github.com/VCCRI/ConanVarvar; Docker Hub: https://hub.docker.com/r/mgud/conanvarvar; Operating systems: Platform independent; Programming languages: R; Other requirements: Docker; License: GNU GPL.
Comparative genomic hybridisation
Congenital heart disease
Copy number variant
False negative rate
Graphical user interface
Crackower MA, Scherer SW, Rommens JM, Hui C-C, Poorkaj P, Soder S, Cobben JM, Hudgins L, Evans JP, Tsui L-C. Characterization of the split hand/split foot malformation locus SHFM1 at 7Q21.3–Q22.1 and analysis of a candidate gene for its expression during limb development. Hum Mol Genet. 1996;5(5):571–9. https://doi.org/10.1093/hmg/5.5.571.
Padiath QS, Saigoh K, Schiffmann R, Asahara H, Yamada T, Koeppen A, Hogan K, Ptáček LJ, Fu Y-H. Lamin B1 duplications cause autosomal dominant leukodystrophy. Nat Genet. 2006;38(10):1114–23. https://doi.org/10.1038/ng1872.
McDermid HE, Morrow BE. Genomic disorders on 22q11. Am J Hum Genet. 2002;70(5):1077–88. https://doi.org/10.1086/340363.
Boerkoel CF, Takashima H, Garcia CA, Olney RK, Johnson J, Berry K, Russo P, Kennedy S, Teebi AS, Scavina M, Williams LL, Mancias P, Butler IJ, Krajewski K, Shy M, Lupski JR. Charcot–Marie–Tooth disease and related neuropathies: mutation distribution and genotype-phenotype correlation. Ann Neurol. 2002;51(2):190–201. https://doi.org/10.1002/ana.10089.
Mainardi PC, Perfumo C, Calì A, Coucourde G, Pastore G, Cavani S, Zara F, Overhauser J, Pierluigi M, Bricarelli FD. Clinical and molecular characterisation of 80 patients with 5p deletion: genotype-phenotype correlation. J Med Genet. 2001;38(3):151–8. https://doi.org/10.1136/jmg.38.3.151.
Footz TK, Brinkman-Mills P, Banting GS, Maier SA, Riazi MA, Bridgland L, Hu S, Birren B, Minoshima S, Shimizu N, et al. Analysis of the cat eye syndrome critical region in humans and the region of conserved synteny in mice: a search for candidate genes at or near the human chromosome 22 pericentromere. Genome Res. 2001;11(6):1053–70.
Pirooznia M, Goes FS, Zandi PP. Whole-genome CNV analysis: advances in computational approaches. Front Genet. 2015;6:138. https://doi.org/10.3389/fgene.2015.00138.
Teo SM, Pawitan Y, Ku CS, Chia KS, Salim A. Statistical challenges associated with detecting copy number variations with next-generation sequencing. Bioinformatics. 2012;28(21):2711–8. https://doi.org/10.1093/bioinformatics/bts535.
Kosugi S, Momozawa Y, Liu X, Terao C, Kubo M, Kamatani Y. Comprehensive evaluation of structural variation detection algorithms for whole genome sequencing. Genome Biol. 2019;20:117. https://doi.org/10.1186/s13059-019-1720-5.
...Zook JM, Hansen NF, Olson ND, Chapman L, Mullikin JC, Xiao C, Sherry S, Koren S, Phillippy AM, Boutros PC, Sahraeian SME, Huang V, Rouette A, Alexander N, Mason CE, Hajirasouliha I, Ricketts C, Lee J, Tearle R, Fiddes IT, Barrio AM, Wala J, Carroll A, Ghaffari N, Rodriguez OL, Bashir A, Jackman S, Farrell JJ, Wenger AM, Alkan C, Soylev A, Schatz MC, Garg S, Church G, Marschall T, Chen K, Fan X, English AC, Rosenfeld JA, Zhou W, Mills RE, Sage JM, Davis JR, Kaiser MD, Oliver JS, Catalano AP, Chaisson MJP, Spies N, Sedlazeck FJ, Salit M. A robust benchmark for detection of germline large deletions and insertions. Nat Biotechnol. 2020;38(11):1347–55. https://doi.org/10.1038/s41587-020-0538-8.
Wang T, Sun J, Zhang X, Wang W-J, Zhou Q. CNV-PG: a machine-learning framework for accurate copy number variation predicting and genotyping. bioRxiv. 2020. https://doi.org/10.1101/2020.04.13.039016.
Pabinger S, Dander A, Fischer M, Snajder R, Sperk M, Efremova M, Krabichler B, Speicher MR, Zschocke J, Trajanoski Z. A survey of tools for variant analysis of next-generation genome sequencing data. Brief Bioinform. 2013;15(2):256–78. https://doi.org/10.1093/bib/bbs086.
Trost B, Walker S, Wang Z, Thiruvahindrapuram B, MacDonald JR, Sung WWL, Pereira SL, Whitney J, Chan AJS, Pellecchia G, Reuter MS, Lok S, Yuen RKC, Marshall CR, Merico D, Scherer SW. A comprehensive workflow for read depth-based identification of copy-number variation from whole-genome sequence data. Am J Hum Genet. 2018;102(1):142–55. https://doi.org/10.1016/j.ajhg.2017.12.007.
Abyzov A, Urban AE, Snyder M, Gerstein M. CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome Res. 2011;21(6):974–84. https://doi.org/10.1101/gr.114876.110.
Boeva V, Popova T, Bleakley K, Chiche P, Cappo J, Schleiermacher G, Janoueix-Lerosey I, Delattre O, Barillot E. Control-FREEC: a tool for assessing copy number and allelic content using next-generation sequencing data. Bioinformatics. 2012;28(3):423–5. https://doi.org/10.1093/bioinformatics/btr670.
Boeva V, Zinovyev A, Bleakley K, Vert JP, Janoueix-Lerosey I, Delattre O, Barillot E. Control-free calling of copy number alterations in deep-sequencing data using GC-content normalization. Bioinformatics. 2011;27(2):268–9. https://doi.org/10.1093/bioinformatics/btq635.
Suvakov M, Panda A, Diesh C, Holmes I, Abyzov A. CNVpytor: a tool for copy number variation detection and analysis from read depth and allele imbalance in whole-genome sequencing. GigaScience. 2021. https://doi.org/10.1093/gigascience/giab074.
Chen X, Schulz-Trieglaff O, Shaw R, Barnes B, Schlesinger F, Källberg M, Cox AJ, Kruglyak S, Saunders CT. Manta: rapid detection of structural variants and indels for germline and cancer sequencing applications. Bioinformatics. 2016;32(8):1220–2. https://doi.org/10.1093/bioinformatics/btv710.
Genomes Project Consortium. A global reference for human genetic variation. Nature. 2015;526:68–74. https://doi.org/10.1038/nature15393.
...Alankarage D, Ip E, Szot JO, Munro J, Blue GM, Harrison K, Cuny H, Enriquez A, Troup M, Humphreys DT, Wilson M, Harvey RP, Sholler GF, Graham RM, Ho JWK, Kirk EP, Pachter N, Chapman G, Winlaw DS, Giannoulatou E, Dunwoodie SL. Identification of clinically actionable variants from genome sequencing of families with congenital heart disease. Genet Med. 2019;21(5):1111–20. https://doi.org/10.1038/s41436-018-0296-x.
...Lee AY, Ewing AD, Ellrott K, Hu Y, Houlahan KE, Bare JC, Espiritu SMG, Huang V, Dang K, Chong Z, Caloian C, Yamaguchi TN, Kellen MR, Chen K, Norman TC, Friend SH, Guinney J, Stolovitzky G, Haussler D, Margolin AA, Stuart JM, Boutros PC. Combining accurate tumor genome simulation with crowdsourcing to benchmark somatic structural variant detection. Genome Biol. 2018;19(1):188. https://doi.org/10.1186/s13059-018-1539-5.
Kuo T, Frith MC, Sese J, Horton P. EAGLE: explicit alternative genome likelihood evaluator. BMC Med Genom. 2018. https://doi.org/10.1186/s12920-018-0342-1.
Kuśmirek W, Szmurło A, Wiewiórka M, Nowak R, Gambin T. Clustering-based optimization method of reference set selection for improved CNV callers performance. bioRxiv. 2018. https://doi.org/10.1101/478313.
Xiao F, Min X, Zhang H. Modified screening and ranking algorithm for copy number variation detection. Bioinformatics. 2015;31(9):1341–8. https://doi.org/10.1093/bioinformatics/btu850.
This work was supported by NSW State Government [S.L.D., D.S.W., E.G.], Chain Reaction (The Ultimate Corporate Bike Challenge) [S.L.D.], the NSW Health Cardiovascular Senior Scientist Grant [S.L.D.], the NSW Health Early-Mid Career Fellowship [E.G.], the National Health and Medical Research Council (Project Grant 1162878) [S.L.D., D.S.W., E.G.], the National Health and Medical Research Council Principal Research Fellowship (1135886) [S.L.D.], the National Heart Foundation of Australia Future Leader Fellowship (101204) [E.G.], the National Heart Foundation of Australia Postdoctoral Fellowship (101894) [G.M.B.], and the Office of Health and Medical Research. The funding sources played no role in the design of the study; the collection, analysis, and interpretation of the data; or the writing of the manuscript.
Ethics approval and consent to participate
For all clinical samples (samples 2, 3, 4, and 5), ethical approval was obtained from the Sydney Children’s Hospital Network Human Research Ethics Committee (Approval Number HREC/16/SCHN/73). Written informed consent was obtained from all participants. Consent was obtained from a parent or guardian on behalf of any participants under the age of 16.
Consent for publication
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
. File contains supplementary information about the methods (Note S1) and the benchmarking procedure (Note S2). Fig. S1. The graphical user interface (GUI) of ConanVarvar, developed using the R Shiny framework. Fig. S2. The command-line interface (CLI) of ConanVarvar. Fig. S3. Examples of plots generated by ConanVarvar. Fig. S4. Example of the produced spreadsheet with pre-sorted candidate variants. Fig. S5. Complete workflow diagram of ConanVarvar. Figs. S6–S29. Sample plots created using ConanVarvar's native plotting function showing the output of Manta, CNVnator and Control-FREEC for some of the samples used in the benchmarking.
About this article
Cite this article
Gudkov, M., Thibaut, L., Khushi, M. et al. ConanVarvar: a versatile tool for the detection of large syndromic copy number variation from whole-genome sequencing data. BMC Bioinformatics 24, 49 (2023). https://doi.org/10.1186/s12859-023-05154-x