Choice of methods
At present, lots of calling tools are available and these tools exhibited their specialties in CNV calling [8,9,10, 12]. To integrate the different tools into a single package, several factors weighs heavily in our consideration: 1) Efficiency: the efficiency of Anaconda depends on the performance of the included methods. Based on previous report [9], EXCAVATOR, ADTEx and Control-FREEC ranked in the top 3 for processing duration. Tested on our in-house input, ExomeCNV performed slightly slower than EXCAVATOR and ADTEx but out-performed than Control-FREEC. 2) Precision: we identified the precision of each tool based on existing comparisons, especially focus on the comparison conducted on clinical data. When setting SNP array results as control, previous report compared the performance of 6 tools on two major datasets: ADTEx and EXCAVATOR showed better performances owing to their high precision and sensitivity [9]. 3) Input: unified input format will facilitate the combination of different methods. Most caller tools, such as ADTEx, EXCAVATOR, ExomeCNV and Control-FREEC, allow BAM input. Though ERDS-pe also allows BAM input, the required single-nucleotide variation information (VCF format), limited its practicability. Additionally, the tools revealed their preference on CNV size: EXCAVATOR often recognizes larger CNVs, ADTEx tends to detect medium-size CNVs, while ExomeCNV and Control-FREEC are in favor of smaller CNVs [9]. Therefore, Anaconda integrates 4 algorithms: Control-FREEC [13], ADTEx [11], EXCAVATOR [14] and ExomeCNV [15], other tools will be incorporated to Anaconda in future.
Fundamental framework of Anaconda is constructed with Shell. Unix-like systems, R3.0+, Jdk8+, gcc and g++ are required before installing Anaconda. After fulfilling all prerequisites, users could simply run a single command “./install” at the Anaconda unzipped folder to install Anaconda.
Workflow
For convenience of users during setting the parameters, Anaconda prepared a specific config file, at which users could determine the following options: 1) softwares used for CNV detection, 2) paths for input files and output results; 3) gene coverage in CNV regions; 4) minimal called methods in considering CNV as a common CNV; 5) parallel threads as well as all specific parameters for each selected tool. After the setting progress, users could simply run a command “./bin/ANACONDA /path/to/configfile” to process their data. We highly recommend users to access Internet when use Anaconda for the first time, because Anaconda would double-check and download the necessary packages automatically.
Anaconda takes paired tumor and normal bam files, genome reference fasta file, exome bed file as input, and output detected CNVs and their annotations. Human genome (hg18 and hg19) fasta file and exome bed file can be downloaded from Anaconda website. Workflow of Anaconda is shown in Fig. 1. The pipeline contains five steps: 1) configure the running environment; 2) detect somatic CNVs by assigned tools; 3) extract the intersection of detected CNVs; 4) retrieve and annotate genes located within called CNVs; 5) generate a HTML-based report including all the analyzed results.
General analysis for callers
For CNVs called by specific tool, Anaconda draws plot of gain and loss CNVs on every chromosome using R (Additional file 1: Figure S1A), and calculates overall loss and gain of the CNV quantity. Detailed results of CNVs are presented in tables including chromosome, exon start, exon end and copy number information (Additional file 1: Figure S1B).
Venn diagrams are drawn to show the intersection of called CNVs by selected tools and genes involved in CNV regions (Additional file 1: Figure S1C). Detailed CNV intersection results are showed in tables, including CNV position, copy number quantity, caller information and shared number information (Additional file 1: Figure S1D). Anaconda also provides additional coverage and detailed information for the genes involved in called CNV regions.
Shared CNVs and genes
Method that Anaconda determines shared CNV region and genes can be seen at Additional file 2: Figure S2. At first, Anaconda gathers all merged CNV reads called by selected tools, maps them with reference genome and divides them as unique-caller reads, double-caller reads, triple-caller reads and tetrad-caller reads. Mapping gene to called CNV region is based on gene coverage. Our default coverage value is 0.7, i.e. if 70% of gene sequence is located inside this CNV, this gene will be retrieved with caller information. Gene coverage value could be modified at Anaconda config file.
Functional annotation
To reveal gene function in called CNV regions, Anaconda annotates these genes with Gene Ontology (GO), Online Mendelian Inheritance in Man (OMIM), Clusters of Orthologous Groups (COG), Pathway, Protein domain and terms (Additional file 3: Figure S3). All term information are downloaded from Database for Annotation, Visualization and Integrated Discovery (DAVID) V6.8 [16] and Kyoto Encyclopedia of Genes and Genomes (KEGG) [17]. Anaconda applies fisher’s exact test to generate P-value for all variants enriched to respective terms. After assigning annotation categories, detailed table is provided to present annotation results. On each annotation page, search module and data sort function is equipped for users with specific commands. For instance, users could click the sort icon by P-value column to sort the P-value of all the terms in a low to high or high to low manner.