On the association analysis of CNV data: a fast and robust family-based association method

Background Copy number variation (CNV) is known to play an important role in the genetics of complex diseases and several methods have been proposed to detect association of CNV with phenotypes of interest. Statistical methods for CNV association analysis can be categorized into two different strategies. First, the copy number is estimated by maximum likelihood and association of the expected copy number with the phenotype is tested. Second, the observed probe intensity measurements can be directly used to detect association of CNV with the phenotypes of interest. Results For each strategy we provide a statistic that can be applied to extended families. The computational efficiency of the proposed methods enables genome-wide association analysis and we show with simulation studies that the proposed methods outperform other existing approaches. In particular, we found that the first strategy is always more efficient than the second strategy no matter whether copy numbers for each individual are well identified or not. With the proposed methods, we performed genome-wide CNV association analyses of hematological trait, hematocrit, on 521 Korean family samples. Conclusions We found that statistical analysis with the expected copy number is more powerful than the statistic with the probe intensity measurements regardless of the accuracy of the estimation of copy numbers. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1622-z) contains supplementary material, which is available to authorized users.


Supplementary Text 1 Rao's score test statistic with the expected copy number
We let = 2 + 2 , = − − and let denotes all the parameters expect for and . If we assume that the copy number for each individual is known, the conditional log density function of Y is ( , ; | ) = − 1 2 | | − 1 2 −1 .
The score for the proposed likelihood is thus Instead of the expected Fisher information matrix, the hybrid method has been usedthat is, the nonzero elements of the Fisher information are replaced by the observed information (Kent, 1982). Then, under the null hypothesis, the information is the score test statistic T1 becomes and T1 follows the chi-square distribution with a single degree of freedom under H0.

Supplementary Text 2 Rao's score test statistic with the probe intensity measurements
If the unobserved copy number vector λ is given, the score for is −1 . The copy number is unknown and is replaced by the expected score for as follows: We assume that ( |X) has a linear relationship with the probe intensity measurements and, if we let 1 = ( −1 ) −1 −1 , our score under the null hypothesis is equivalent to We let = ( 11 , … , ) and we assume that cov( , ′ ) = ′ . The Mendelian transmission of copy numbers make similar to but some deviation is expected because of the presence of measurement error.

Supplementary Figure 1. Results of the clustering analysis with family samples (A) and cohort samples (B).
(A) Histogram depicts clustering result using log2 ratio values based on signal intensity of the reference and each sample. PedCNV calculated two copy-number classes from the family samples, but we estimated that this region is composed of three copy-number classes denoted as C1, C2, and C3. This discrepancy of copy-number genotype might be caused by the difference of sample size between the two studies. Moreover, because the T2 statistic of PedCNV is robust to badly separated clusters, this bias of CNV call cannot affect association results. However, to evaluate the exact CNV genotype, we chose samples from C1, C2, and C3 to carry Validation results by PCR experiment. The PCR experiment was conducted to evaluate concordance between genotypes and copy-number estimates within each cluster (normal, heterozygous deletion and homozygous deletion).
The PCR product size of the normal allele was 1519bp, whereas that of deleted allele was 690bp. Validation results show complete concordance with estimated genotypes except for 3 samples (red arrows).

Supplementary Figure 4. Validation results of replication samples by TaqMan qPCR experiment.
Supplementary The empirical powers of T1 and T1 * which use the expected copy number and the most probable copy number respectively were estimated at various significance levels with 2,000 replicates for different values of under BSC, MSC and WSC. The score test using the expected copy number is denoted by T1. The estimated parameters of T1 and T1 * which use the expected copy number and the most probable copy number respectively were estimated at various significance levels with 2,000 replicates for different values of under BSC, MSC and WSC. The score test using the expected copy number is denoted by T1.