Software comparison for evaluating genomic copy number variation for Affymetrix 6.0 SNP array platform
- Jeanette E Eckel-Passow†1Email author,
- Elizabeth J Atkinson†1,
- Sooraj Maharjan1,
- Sharon LR Kardia2 and
- Mariza de Andrade1
© Eckel-Passow et al; licensee BioMed Central Ltd. 2011
Received: 22 November 2010
Accepted: 31 May 2011
Published: 31 May 2011
Copy number data are routinely being extracted from genome-wide association study chips using a variety of software. We empirically evaluated and compared four freely-available software packages designed for Affymetrix SNP chips to estimate copy number: Affymetrix Power Tools (APT), Aroma.Affymetrix, PennCNV and CRLMM. Our evaluation used 1,418 GENOA samples that were genotyped on the Affymetrix Genome-Wide Human SNP Array 6.0. We compared bias and variance in the locus-level copy number data, the concordance amongst regions of copy number gains/deletions and the false-positive rate amongst deleted segments.
APT had median locus-level copy numbers closest to a value of two, whereas PennCNV and Aroma.Affymetrix had the smallest variability associated with the median copy number. Of those evaluated, only PennCNV provides copy number specific quality-control metrics and identified 136 poor CNV samples. Regions of copy number variation (CNV) were detected using the hidden Markov models provided within PennCNV and CRLMM/VanillaIce. PennCNV detected more CNVs than CRLMM/VanillaIce; the median number of CNVs detected per sample was 39 and 30, respectively. PennCNV detected most of the regions that CRLMM/VanillaIce did as well as additional CNV regions. The median concordance between PennCNV and CRLMM/VanillaIce was 47.9% for duplications and 51.5% for deletions. The estimated false-positive rate associated with deletions was similar for PennCNV and CRLMM/VanillaIce.
If the objective is to perform statistical tests on the locus-level copy number data, our empirical results suggest that PennCNV or Aroma.Affymetrix is optimal. If the objective is to perform statistical tests on the summarized segmented data then PennCNV would be preferred over CRLMM/VanillaIce. Specifically, PennCNV allows the analyst to estimate locus-level copy number, perform segmentation and evaluate CNV-specific quality-control metrics within a single software package. PennCNV has relatively small bias, small variability and detects more regions while maintaining a similar estimated false-positive rate as CRLMM/VanillaIce. More generally, we advocate that software developers need to provide guidance with respect to evaluating and choosing optimal settings in order to obtain optimal results for an individual dataset. Until such guidance exists, we recommend trying multiple algorithms, evaluating concordance/discordance and subsequently consider the union of regions for downstream association tests.
Data from genome-wide association studies are now being mined for genotyping as well as for copy number estimation. Often, the primary objective is genotyping and secondarily the data are interrogated to evaluate copy number variation (CNV). It is well recognized that the genotyping algorithms are usually highly concordant for most SNPs, particularly for the two most common SNP array vendors: Illumina and Affymetrix. However, although the probe intensities produce concordant SNP calls, this does not imply that the probe intensities are sufficiently clean to accurately estimate copy number. In fact, copy number estimation is more susceptible than genotyping to variability in sample processing procedures as well as analytical processing steps. For example, copy number data obtained from SNP arrays are highly prone to batch effects, which reflect systematic variability during sample processing . Although genotype clustering algorithms are relatively robust to these batch effects, copy number data are not. This systematic variability will result in biased estimates of copy number and thus can negatively impact downstream statistical analyses.
There are numerous software packages available for genotyping and for copy number estimation, of which most have options available for removing systematic variability. However, each software package uses different algorithms for removing this variation resulting in different effects on downstream statistical analyses. In an attempt to compare CNV software for oligonulceotide data, Baross et al.  compared four software packages for preprocessing and analyzing CNV data obtained from the Affymetrix 100K SNP array platform: Copy Number Analyzer for GeneChip (CNAG 1.1), dChip, Affymetrix GeneChip® Chromosome Copy Number Analysis Tool (CNAT 3.0), and Gain and Loss Analysis of DNA (GLAD). Baross and colleagues found that the numbers and types of CNVs varied significantly across the four software packages and thus concluded that at least two software packages are necessary in order to identify all real copy number aberrations. Their assumption is that the algorithms have different strengths and thus taking the union of all the copy number aberrations should increase sensitivity. More recently, Winchester et al.  compared various freely- and commercially-available software for detecting germline CNVs using both Illumina and Affymetrix SNP arrays. With respect to freely-available software they primarily evaluated QuantiSNP and PennCNV. Winchester and colleagues also suggest applying two algorithms on a single dataset in order to obtain the most informative results and recommend using software that was designed specifically for the CNV array platform being used.
Both Baross et al.  and Winchester et al.  compared results obtained after segmentation, and thus did not evaluate the agreement, or lack thereof, in the locus-level copy number data. Furthermore, since the Baross et al.  publication, the technology has advanced and there have been additional algorithms developed that are now routinely used. Thus, the goal of this paper is to compare freely-available software that is routinely used for CNV data obtained from the Affymetrix Genome-Wide Human SNP Array 6.0 platform, the newest Affymetrix SNP platform that has an order of magnitude more probes than the older 100K platform. The SNP Array 6.0 has 1.8 million genetic markers, including more than 906,600 SNPs and 946,000 probes for the detection of CNV.
Herein, we provide a description of four freely-available software packages that are commonly used for CNV analysis of data generated from Affymetrix Genome-Wide Human SNP Array 6.0 platform. We compare both the bias and variance in the locus-level copy number data as well as concordance of the segmentation results obtained using a hidden Markov model (HMM). Throughout, we report on results from autosomal chromosomes only and we often use chromosome 22 as an informative example.
Results and Discussion
We used a germline dataset consisting of 1,418 Caucasian samples to empirically evaluate four software packages developed for the Affymetrix 6.0 SNP array platform. The four software packages evaluated were PennCNV , Aroma.Affymetrix , Affymetrix Power Tools (APT)  and Corrected Robust Linear Model with Maximum Likelihood Distance (CRLMM) . It is important to note that PennCNV, Aroma.Affymetrix and APT estimate relative copy number, relative to a reference sample or cohort of reference samples. In contrast, CRLMM estimates absolute copy number. Scharpf et al.  advocates estimating absolute copy number in comparison to relative copy number stating that the disadvantages of estimating relative copy number are that a reference set is necessary, a deviation from the normal two copies can either represent an aberration in the test sample or the reference set, and lastly, that the allelic copy number at polymorphic loci is often ignored.
List of CNV analysis software
locus-level CN (LRR = Log2R)†
Normalization (default target distribution)
Genomic-wave or GC correction
*R = (A + B)/
R = (A + B)/
median(A + B)
First, calibrates for offset and allelic crosstalk. Second, performs quantile (self).
Post normalization, corrects for PCR fragment length and GC content.
Affymetrix Power Tools (APT)
R = (A + B)/
median(A + B)
Standard argument to specify in the linear model
Log 2 R denotes the relative locus-level copy number of the sample of interest relative to a reference sample(s). PennCNV utilizes the HapMap samples as the reference data (denominator) for calculating Log2R, whereas Aroma.Affymetrix and APT utilize the data at hand as their reference data. CRLMM estimates absolute copy number using a linear model and thus does not require a reference sample(s); according to their documentation, CRLMM requires at least 10 samples in order to obtain accurate estimates of the model parameters.
Background correction attempts to remove optical background and non-specific hybridization . PennCNV and APT do not correct for background hybridization. Aroma.Affymetrix corrects for allelic crosstalk prior to performing quantile normalization and CRLMM corrects for optical background and non-specific hybridization using a linear model after performing quantile normalization.
Across-array normalization attempts to correct for systematic variability induced by array manufacturing, sample preparation, and labelling, hybridization and scanning of the arrays . By default, all four software packages apply quantile normalization, which makes the distribution of probe intensities for each array equivalent . Although each of the software applies quantile normalization, they utilize different target distributions. Particularly, Aroma.Affymetrix and APT utilize the data at hand to define the target distribution, whereas PennCNV and CRLMM use the HapMap samples to define a target distribution.
Genomic-wave is an artefact that has been observed in both array comparative genomic hybridization (aCGH) and SNP data and is thought to be correlated with GC content [9, 10]. By default, only Aroma.Affymetrix corrects for genomic waves; however, options are available in PennCNV and APT to correct for genomic waves. Specifically, Aroma.Affymetrix corrects for GC content and PCR fragment length to the post quantile-normalized data.
Summary of all F-statistics for chromosome 22 from performing analysis of variance (ANOVA) on the locus-level copy number data by 96-well plate; there were 23 plates utilized in the GENOA study
Comparison of bias and variance of locus-level data
Agreement of locus-level copy number data
Up to this point, for ease of explanation, we have only used data from chromosome 22 to compare the four software programs. Similar trends were observed across the other autosomes (data not shown).
Concordance of detected segmentation regions
In addition to evaluating the agreement of locus-level copy number data across the different software, we also evaluated the concordance (or discordance) of identified regions of copy number gain and loss as obtained from the HMM algorithm. Here, we only compare the results obtained from CRLMM and PennCNV; of the four software packages evaluated, they are the only two that contain a segmentation algorithm. Although CRLMM itself does not contain a segmentation algorithm, the same authors developed VanillaIce  and thus the results from CRLMM can be directly imported into VanillaIce without extra effort (hereafter referred to as CRLMM/VanillaIce).
It is important to note that while neither CRLMM/VanillaIce nor PennCNV apply CNV-specific quality-control metrics to the segmented data by default, only PennCNV provides CNV-specific quality-control metrics that can be used to identify potentially poor CNV samples (see Methods). Of the 1,418 samples analyzed, 136 samples were identified as potentially poor CNV samples in the PennCNV analysis. CRLMM suggests using a signal-to-noise measure to identify samples that have poor quality for genotype calling but do not have additional CNV-specific quality-control measures. Using the signal-to-noise measure, CRLMM identified 51 samples that were poor samples for genotyping purposes, 35 of which PennCNV also identified as poor samples. The 152 samples identified as poor samples via PennCNV and CRLMM were removed when evaluating concordance between PennCNV and CRLMM/VanillaIce.
Summary of CNV regions on chromosome 22 using HMM algorithm for 1,266 subjects that passed PennCNV and CRLMM QC metrics
Median Number CNVs per patient (Min, 25%, 75%, Max)
30 (4, 21, 42, 438)
39 (13, 31, 48, 96)
Concordance of detected CNV segments using PennCNV and CRLMM/VanillaIce for 1,266 subjects that passed PennCNV and CRLMM QC metrics
Our objective was to compare commonly-used freely-available software algorithms for analyzing CNV data obtained from the Affymetrix 6.0 SNP array platform. Specifically, we compared Affymetrix Power Tools (APT), Aroma.Affymetrix, PennCNV and CRLMM. Of the four software packages that we compared APT performed the best with respect to bias; that is, APT on average had median locus-level copy numbers closest to a value of two. In fact, the tight bounds associated with the bias for APT suggests that there could be an algorithmic assumption made by the software; however, we could not identify any such discussion in their documentation. Although APT had the smallest bias, PennCNV and Aroma.Affymetrix had the smallest variability associated with the median locus-level copy number. Thus, if one is interested in performing statistical tests on the locus-level copy number data, our empirical results suggest that PennCNV and Aroma.Affymetrix are the optimal software packages of those evaluated herein.
Batch effects are an artefact of processing samples in multiple laboratories, by multiple technicians, using reagents from multiple batches, or other sample processing steps that are not constant across samples and affects individual probes differently . Methodologies to remove probe-specific batch effects are an area of active research. Here, we have shown empirically that the extent of probe-specific batch effects post analytical processing is dependent on the software used. Furthermore, even though CRLMM has a default option to remove batch effects, probe-specific batch effects were present in the post-processed locus-level copy number data. Additionally, probe-specific batch effects were more evident in the post-processed CRLMM and PennCNV locus-level copy number data in comparison to the Aroma.Affymetrix and APT post-processed data. This empirical observation was surprising and thus warrants additional evaluation.
It is interesting - and maybe surprising - that so many deleted segments were estimated to be false positives, especially since both PennCNV and VanillaIce utilize genotype and B-allele frequency information in their HMM algorithms. Even though our estimated false-positive rate appears high, our results agree with the rates shown by Baross et al. . Undoubtedly, the gold standard would be to have known-spike in data in which to compare software and likewise determine the true false-positive rate or to validate each candidate CNV region using an independent technology. Unfortunately, validation is not feasible for the thousands of candidate CNVs identified in the GENOA data.
Of the software evaluated, only PennCNV and CRLMM (through the use of VanillaIce, which was developed by the same authors) provide locus-level copy number data as well as a segmentation routine and thus allow the analyst to complete all data processing steps without having to reformat the data for use in another software package. PennCNV provides CNV-specific quality-control metrics to aid in identifying potentially poor CNV samples, whereas CRLMM or VanillaIce does not. That is, CRLMM suggests using a signal-to-noise measure to identify samples that have poor quality for genotype calling but do not have additional CNV-specific quality-control measures. We observed that PennCNV detects almost all of the regions that CRLMM/VanillaIce does as well as additional regions of copy number gain and loss. Although some of these additional regions may be false positives, the estimated false-positive error rate associated with deletions was similar for PennCNV and CRLMM/VanillaIce and thus the additional regions are likely not all false positives.
As discussed by Lai et al. , it is very difficult to compare complicated algorithms as each algorithm has its own set of parameters that must be optically tuned. Therefore, it is possible that the results discussed herein would change if the parameters associated with the software were fine tuned to fit the data more precisely. Unfortunately, there is little-to-no guidance provided by most software for evaluating and choosing optimal parameter settings. This was particularly true for the HMM algorithms provided by PennCNV and VanillaIce. Thus, we evaluated each algorithm using the default parameters. Although not optimal, this is what many analysts - even experienced analysts - will ultimately do until developers of software provide adequate documentation and guidance for evaluating and choosing parameters for complicated algorithms. Until such guidance exists, we recommend trying multiple algorithms, evaluating concordance/discordance as we have done here and subsequently consider the union of regions for downstream association tests. Others have suggested a similar approach [2, 3] assuming that algorithms have different strengths and thus taking the union of all the copy number aberrations should increase sensitivity.
Affymetrix 6.0 SNP array: GENOA Data
The samples included 1,418 of the non-Hispanic white adults enrolled in the Genetic Epidemiology Network of Arteriopathy (GENOA) study of the Family Blood Pressure Program (FBPP), a study designed to identify germline genetic determinants of hypertension in multiple ethnic groups. These samples were genotyped using the Affymetrix SNP Array 6.0 and all had contrast QC values greater than 0.4.
Copy number analysis was performed using PennCNV (Version 2010May01) software using the default parameters . Locus-level copy number for sample i and probe j is calculated as Log2 Rij = log2 (Robserved/Rexpected), where R observed is the summation of the intensity for the A and B allele at probe j for sample i and Rexpected for probe j is computed from linear interpolation of canonical genotype clusters . The transformation was applied for plotting purposes and for calculating bias and variance. Affymetrix Power Tools (APT) was used to obtain genotype calls; the genotype calls are required for PennCNV copy number estimation. Regions of copy number aberrations were identified using the hidden Markov model (HMM) algorithm provided within PennCNV. The PennCNV-Affy Protocol (http://www.openbioinformatics.org/penncnv/penncnv_tutorial_affy_gw6.html) was followed to obtain the Log2R and B-allele frequency (BAF) values. PennCNV provides CNV-specific quality-control metrics for which to identify potentially poor CNV samples. Subjects were excluded if they had more than 100 CNV intervals detected, B-allele drift > 0.0125, wave factor > 0.05, or Log2R standard deviation > 0.40. The wave factor represents the overall waviness or variation of signal intensity and the B-allele drift is the fraction of "abnormal" markers that do not cluster in the usual positions (0, 0.5, 1); this number is the median of all chromosomes. In some cases when a portion of the array has genotyping failure (for example, a corner of the array is dried), most other markers will look normal but some markers will appear very random; the B-allele drift is useful for detecting these situations (personal correspondence with PennCNV). Additional File 2 provides the code used to run PennCNV on the GENOA data.
Copy number analysis was performed using Aroma.Affymetrix (Version 1.3.0) software using the default parameters (http://www.aroma-project.org) . Locus-level copy number for sample i and probe j is calculated as , where is the normalized total copy number and is a reference signal at probe j typically representing the mean diploid signal. The transformation was applied for plotting purposes and for calculating bias and variance. Additional File 2 provides the code used to run Aroma.Affymetrix on the GENOA data.
Affymetrix Power Tools (APT)
Copy number analysis was performed using Affymetrix Power Tools (Version 1.12.0) software using the default parameters . Locus-level copy number for sample i and probe j is calculated as , where is the normalized total copy number and is a reference signal at probe j typically representing the mean diploid signal. The transformation was applied for plotting purposes and for calculating bias and variance. Additional File 2 provides the code used to run APT on the GENOA data.
Copy number analysis was performed using CRLMM (Version 1.4.3) software using the default parameters . CRLMM estimates absolute copy number using a linear model that can be summarized as I = O + NS + S, where I denotes the observed intensity, O denotes optical background, NS denotes nonspecific hybridization and S denotes the change in the average intensity at a given locus per each integer increase in the allelic copy number. Absolute copy number is then estimated as ; please see  for specifics. Batch was defined according to the 96-well plate the sample was processed on. Regions of copy number aberrations were identified using the HMM algorithm provided by VanillaIce (Version 1.6.0) . Although CRLMM does not itself perform segmentation, the copy number data produced by CRLMM can be directly imported into VanillaIce since they were developed by the same authors. Subjects were excluded if they had a signal-to-noise measure larger than 0.40. Additional File 2 provides the code used to run CRLMM and VanillaIce on the GENOA data.
The median locus-level copy number was computed across all loci and chromosomes for each sample. Because these are germline data, we expect a median copy number of two representing no change. Bias and variance were evaluated on the locus-level data across all chromosomes. Bias was calculated as the median raw copy number across all probes minus two. To evaluate variability, the median absolute deviation (MAD) and derivative log ratio spread (DLRS) was computed across all loci for each sample and chromosome.
To evaluate the agreement across the four software packages Bland-Altman plots were utilized . A Bland-Altman plot is a plot of the difference between two measurements (X - Y) against the average of the two measurements (X + Y)/2. In comparison to a simple correlation plot of X versus Y, a Bland-Altman plot provides a better visualization of the magnitude of disagreement (error and bias) and better highlights outliers and trends in the disagreement. If the differences in locus-level copy number between two software packages are not related to the magnitude of either copy number measurement, then it is expected that the data will be randomly scattered around the zero horizontal reference line.
where is an indicator function for loci that were detected using both software algorithms and similarly, is an indicator function for loci that were detected by at least one of the software algorithms. Here, we are evaluating concordance of two different algorithms on the exact same sample and thus we calculated concordance on a locus basis instead of on a segment basis. Even though we calculated concordance on locus-level data, we obtained similar levels of concordance that others have reported with respect to segmented-level data .
We thank Robert Scharpf, Henrik Bengtsson and Kai Wang for their helpful comments about CRLMM, Aroma.Affymetrix and PennCNV, respectively. The Genotyping Shared Resource of the Mayo Clinic Technology Center performed all of the genotyping. This paper was supported by the National Institutes of Health research grant R01 HL87660 (MdA and SLRK) and the Genetic Epidemiology and Risk Assessment (GERA) program (P30CA15083-36) from the National Cancer Institute.
- Scharpf RB, Ruczinski I, Carvalho B, Doan B, Chakravarti A, Irizarry RA: A multilevel model to address batch effects in copy number estimation using SNP arrays. Biostatistics 2011, 12: 33–50. 10.1093/biostatistics/kxq043PubMed CentralView ArticlePubMedGoogle Scholar
- Baross A, Delaney AD, Li HI, Nayar T, Flibotte S, Qian H, Chan SY, Asano J, Ally A, Cao M, Birch P, Brown-John M, Fernandes N, Go A, Kennedy G, Langlois S, Eydoux P, Friedman JM, Marra MA: Assessment of algorithms for high throughput detection of genomic copy number variation in oligonucleotide microarray data. BMC Bioinformatics 2007, 8: 368. 10.1186/1471-2105-8-368PubMed CentralView ArticlePubMedGoogle Scholar
- Winchester L, Yau C, Ragoussis J: Comparing CNV detection methods for SNP arrays. Briefings in Functional Genomics and Proteomics 2009, 8: 353–366. 10.1093/bfgp/elp017View ArticlePubMedGoogle Scholar
- Wang K, Li M, Hadley D, Liu R, Glessner J, Grant S, Hakonarson H, Bucan M: PennCNV: an integrated hidden Markov model designed for high-resolution copy number variation detection in whole-genome SNP genotyping data. Genome Research 2007, 17: 1665–1674. 10.1101/gr.6861907PubMed CentralView ArticlePubMedGoogle Scholar
- Bengtsson H, Wirapati P, Speed T: GenomeWideSNP 5 & 6. Bioinformatics 2009, 25(17):2149–2156. 10.1093/bioinformatics/btn016PubMed CentralView ArticlePubMedGoogle Scholar
- Bengtsson H, Simpson K, Bullard J, Hansen K: Aroma.affymetrix: A generic framework in R for analyzing small to very large Affymetrix data sets in bounded memory. Report 745, Department of Statistics, University of California, Berkeley, 2008Google Scholar
- Irizarry RA, Hobbs B, Collin F, Beazer-Barclay YD, Antonellis KJ, Scherf U, Speed TP: Exploration, normalization and summaries of high density oligonucleotide array proble level data. Biostatistics 2003, 4: 249–264. 10.1093/biostatistics/4.2.249View ArticlePubMedGoogle Scholar
- Bolstad BM, Irizarry RA, Astrand M, Speed TP: A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 2003, 19: 185–193. 10.1093/bioinformatics/19.2.185View ArticlePubMedGoogle Scholar
- Marioni JC, Thorne NP, Valsesia A, Fitzgerald T, Redon R, Fiegler H, Andrews TD, Stranger BE, Lynch AG, Dermitzakis ET, Carter NP, Tavaré S, Hurles ME: Breaking the waves: improved detection of copy number variation from microarray-based comparative genomic hybridization. Genome Biology 2007, 8: R228. 10.1186/gb-2007-8-10-r228PubMed CentralView ArticlePubMedGoogle Scholar
- Diskin SJ, Li M, Hou C, Yang S, Glessner J, Hakonarson H, Bucan M, Maris JM, Wang K: Adjustment of genomic waves in signal intensities from whole-genome SNP genotyping platforms. Nucleic Acids Research 2008, 36: e126. 10.1093/nar/gkn556PubMed CentralView ArticlePubMedGoogle Scholar
- Altman DG, Bland JM: Measurement in medicine: the analysis of method comparison studies. The Statistician 1983, 32: 307–317. 10.2307/2987937View ArticleGoogle Scholar
- Scharpf RB, Parmigiani G, Pevsner J, Ruczinski I: Hidden Markov models for the assessment of chromosomal alterations using high-throughput SNP arrays. The Annals of Applied Statistics 2008, 2: 687–713. 10.1214/07-AOAS155PubMed CentralView ArticlePubMedGoogle Scholar
- The 1000 Genomes Project Consortium: A map of human genome variation from population - scale sequencing. Nature 2010, 467: 1061–1073. 10.1038/nature09534PubMed CentralView ArticleGoogle Scholar
- Oldridge DA, Banerjee S, Setlur SR, Sboner A, Demichelis F: Optimizing copy number variation analysis using genome-wide short sequence oligonucleotide arrays. Nucleic Acids Research 2010, 38: 3275–86. 10.1093/nar/gkq073PubMed CentralView ArticlePubMedGoogle Scholar
- Lai WR, Johnson MD, Kucherlapati R, Park PJ: Comparative analysis of algorithms for identifying amplifications and deletions in array CGH data. Bioinformatics 2005, 21: 3763–70. 10.1093/bioinformatics/bti611PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.