- Proceedings
- Open Access

# Estimating copy numbers of alleles from population-scale high-throughput sequencing data

- Takahiro Mimori
^{1, 2}, - Naoki Nariai
^{1, 2}, - Kaname Kojima
^{1, 2}, - Yukuto Sato
^{1, 2}, - Yosuke Kawai
^{1, 2}, - Yumi Yamaguchi-Kabata
^{1, 2}and - Masao Nagasaki
^{1, 2}Email author

**16 (Suppl 1)**:S4

https://doi.org/10.1186/1471-2105-16-S1-S4

© Mimori et al.; licensee BioMed Central Ltd. 2015

**Published:**21 January 2015

## Abstract

### Background

With the recent development of microarray and high-throughput sequencing (HTS) technologies, a number of studies have revealed catalogs of copy number variants (CNVs) and their association with phenotypes and complex traits. In parallel, a number of approaches to predict CNV regions and genotypes are proposed for both microarray and HTS data. However, only a few approaches focus on haplotyping of CNV loci.

### Results

We propose a novel approach to infer copy unit alleles and their numbers in each sample simultaneously from population-scale HTS data by variational Bayesian inference on a generative probabilistic model inspired by latent Dirichlet allocation, which is a well studied model for document classification problems. In simulation studies, we evaluated concordance between inferred and true copy unit alleles for lower-, middle-, and higher-copy number dataset, in which precision and recall were ≥ 0.9 for data with mean coverage ≥ 10× per copy unit. We also applied the approach to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus, and confirmed consistency of the estimated alleles within samples belonging to a trio of CEPH/Utah pedigree 1463 with 11 offspring.

### Conclusions

Our proposed approach enables detailed analysis of copy number variations, such as association study between copy unit alleles and phenotypes or biological features including human diseases.

## Keywords

- copy number variation
- high-throughput sequencing data
- latent Dirichlet allocation

## Background

With the recent development of microarray and high-throughput sequencing (HTS) technologies, extensive efforts have elucidated catalogs of haplotypes and genomic variations such as single nucleotide polymorphisms (SNPs), indels, copy number variations (CNVs) and other structural variations in population [1–3]. Based on these catalogs of genomic variations and haplotype structures, a number of genome wide association studies (GWAS) have been conducted to identify associations between genomic variations and phenotypes.

Recent studies also revealed that CNVs affect phenotypes and complex traits, such as human diseases [4–8]. In parallel, a number of methods for detecting CNV loci and inferring copy numbers at each CNV locus have been proposed for both microarray and HTS technologies [9–13]. In particular, high coverage and PCR-free sequencing data enable us to estimate copy numbers of CNVs at higher resolution than former technologies because of its quantitative stability. Even for deletions, which are losses of genomic regions with various size, it requires sequencing data with 20× to 30× depths per diploid genomes for accurate detection [13, 14].

Not only an absolute copy number, but also characteristics of each copy unit at CNV locus are expected to provide critical information about genetic structure and biological function of the locus. For example, nonsynonymous mutations on coding regions are known to affect biological functions. Hence, identifying these copy units is essential for understanding biological effects of CNVs.

*variable sites*in the units (Figure 1) that are supposed to be introduced by mutations during evolutional history in population. If those sequences are similar to each other,

*i.e.*, a ratio of mutated bases among all the bases are less than ten percent, then alignment of reads to the reference genome using tools such as BWA [15] will yield an information including the number of mismatched bases observed at variable sites of the CNV locus. This information reflects copy numbers of each copy unit in sequenced samples.

There are some difficulties in determining sequence and copy numbers of each copy unit at CNV locus from sequenced data of samples. First, observable counts of bases come from diploid sequences for autosomal chromosomes, and the true combination of bases at multiple heterozygotes sites is not apparent from the data. Second, at CNV loci, because copy unit alleles are similar to each other, reads are aligned to the same locus of the reference genome. These problems complicate the task to infer sequences and copy numbers of copy units for each sample from read alignment data. The former problem is called phasing, and several approaches to infer haplotypes from SNP and indel genotypes of multiple samples are developed [16–18]. Recently, phasing approaches of an another type which utilize co-occurrence of multiple heterozygote variants on HTS read are devised [19, 20]. In particular, HapMonster [20] performs simultaneous estimation of haplotype phasing and variant calling and succeeds in improving both of their performances, which suggests that treating genotype and haplotype with a unified statistical model is promising approach. In contrast to a number of phasing approaches have been devised today, there are only a few approaches for inferring haplotypes of the variable sites in CNV locus [21–23] and they all use microarray data of population as input data. There seems to be no approaches for this task from sequencing data at present, which might be due to a lack of PCR-free, high quality, and high coverage sequencing data of population.

In this study, we propose a novel approach to estimate sequence and copy numbers of copy unit at CNV locus from population-scale sequencing data. In the proposed approach, we construct a generative model of sequenced reads and estimate copy unit sequences and their copy numbers for each sample simultaneously using the expectation maximization (EM) algorithm and the variational Bayesian (VB) inference. The similar models and techniques have been studied in topic models of natural language processing [24, 25]. Recently, several bioinformatics approaches such as TIGAR [26] applies the VB inference to estimate transcript isoform abundance from RNA-Seq data.

Due to a limited resolution in identifying allelic ratio with microarray data, previous approaches have been applied to CNVs whose copy numbers are less than or equal to four per diploid [21–23]. On the contrary, our probabilistic model can analyze higher copy number loci with high coverage HTS data, in which the computational complexity is linear with respect to the number of samples and the number of copy unit alleles.

We verified performance of the approaches in simulation studies with various configurations of copy numbers. In a real data analysis, we apply our methods to HTS data of 1123 samples at highly variable salivary amylase gene locus and a pseudogene locus. We also confirmed consistency in predicting copy numbers of copy unit alleles for CEPH trio samples with 11 offspring.

## Methods

### Preprocessing

A precondition is that there is sequenced read data of *N* individual samples, and the read data is aligned to reference sequences that represent sequences at predefined CNV loci. We assume that each sample has combinations of *K* copy unit alleles with which each one has copy number more than or equal to zero at the CNV loci. From aligned data of multiple samples, we can identify *M* variable sites of the alleles to distinguish them.

### The generative model

We construct a generative model of aligned reads at variable sites of CNV loci. In the model, each observed base of the *n*-th sample is assumed to be generated from one of *K* alleles that follows *K* dimensional multinomial distribution with the parameters *θ*_{
n
}, and the base observation probability at the variable site *x* from allele *k* follows multinomial distribution with the parameters *ϕ*_{
kx
}. We also introduce Dirichlet priors αfor parameters *θ*_{
n
}.

*b*

_{ nxt }denotes the

*t*-th observed base at variable site

*x*of sample

*n*, which is one of the nucleotide characters: Λ = {

*A, T, C*,

*G*},

*z*

_{ nxt }denotes the allele index which generates the base

*b*

_{ nxt }, and

*d*

_{ nx }denotes the number of observed bases at site

*x*of sample

*n*. The three terms in this equation are calculated as follows:

where Γ is the gamma function. In this study, we use *α*_{
k
} = 1 (*k =* 1*... K*), which assumes uniform priors for *θ*_{
n
}.

### The EM algorithm and the VB inference

*z*and

*θ*and a parameter vector φwhich describes emission probability of each base for given variable site

*x*and allele

*k*, and calculate the marginal log likelihood:

where *X =* {*b*} is data, *Y =* {*z, θ*} denotes hidden variables, and Φ= {φ} denotes parameters to be estimated.

*P*(Y\X, Φ) for given parameters Φ(E step) and maximizing the lower bound by varying parameters Φfor given posterior distribution (M step) iteratively. In the latter M step, we further approximate the posterior distribution with factorized functions:

where we introduce new parameters wand rto describe the approximate distributions of *z* and *θ*, respectively, and *I* (statement) denotes the indicator function which equals to 1 if the statement is true, otherwise 0.

*Q*iteratively according to following formulas to maximize the lower bound

*L:*

where Ψ is the digamma function, which is the first derivative of the log Gamma function.

### Computational complexity

In the E step, Eq. (1) shows that *w*_{
nxtk
} depends on *t* only through the observed base *b*_{
nxt
}. From this fact and Eq. (2) and Eq. (3), it is clear that the computational complexity of each iteration is *O*(*NMK*|Λ|), where |Λ| is the number of possible bases. One of the ways to determine the number of alleles *K* is that, estimate the marginal log likelihood of the model for each value *K* with its lower bound and select the most likely value $\phantom{\rule{0.3em}{0ex}}\hat{K}$ which provides highest one.

## Results and discussion

### Simulation analysis 1

#### Data preparation

*M =*16, nucleotide bases of these sites as shown in Table 1 the number of samples

*N =*12 with which four samples have two alleles, another four have three alleles, and the remainder have four alleles, respectively. Each allele of samples is chosen with equally probability from the four alleles. From these samples, we generate histogram of bases at the variable sites that correspond to aligned HTS read data. The number of observed read at each variable site follows a mixture of Poisson distribution for each copy unit allele and their means are set to 15 in this analysis. We also take into account 1% sequencing errors that mutates the correct base to one of the other three bases.

Copy unit alleles and their bases at variable sites used in simulation analysis 1.

No. | Bases at variable sites |
---|---|

1 | ATTGCGATATTGCGAT |

1 | ACGGATTTACGGATTT |

3 | CTTCGGAACTTCGGAA |

4 | CGATTGAACGTCGTAC |

#### Evaluation of the results

*K*as four, which maximized log likelihood

*L*as shown in Figure 2. For evaluation of allele concordance between true and predicted set, we defined precision and recall of the predictions as follows:

*K*

_{0}is the number of true alleles which equals to four in this case and

*r*

_{ kl }represents the ratio of matched bases at variable sites between the predicted allele

*k*and the true allele

*l*. The concrete definition of

*r*

_{ kl }is as follows:

where ${\widehat{b}}_{kx}\equiv \underset{b\in \Lambda}{\mathrm{arg}\mathrm{max}}{\phi}_{kxb}$ is a predicted base at variable site *x* of the predicted allele *k* and ${b}_{lx}^{\left(0\right)}$ is a base at variable site *x* of the true allele *l*.

### Simulation analysis 2

#### Data, preparation

Configurations of copy numbers and number of samples in three datasets used in simulation analysis 2.

Dataset | List of copy numbers and number of samples |
---|---|

a) lower-copy number | 1:5, 2:11, 3:6 |

b) middle-copy number | 2:2, 3:3, 4:4, 5:2, 6:1 |

c) higher-copy number | 3:1, 4:2, 5:3, 6:2, 7:1 |

#### Evaluation of the results

### Real data application

#### Data, preparation

We estimate copy numbers of copy unit alleles at salivary amylase gene (AMY1) locus using publicly available HTS data of 1123 samples, in which 17 are high coverage data around 50× per diploid genome of Coriell CEPH/Utah pedigree 1463 provided by Illumina's Platinum Genomes project [27] and 1106 are low coverage data around 4× per diploid genome released from the 1000 Genomes project [3]. AMY1 is known as a CNV locus with highly variable copy numbers [28], whose typical copy number is six to ten.

We obtained BAM files, in which HTS reads were aligned to the hgl9 reference sequence. We extracted paired-end reads in FASTQ format that aligned to amylase gene locus chrl:104,129, 283-104, 320, 531. Then, we aligned the extracted reads with BWA [15] to a custom reference sequence that is comprised of extracted sequences of gene coding loci of AMY1A and AMY2A from the hgl9 reference sequence. After the alignment process, we identify 57 variable sites within 835-th to 9200-th bases of AMY1A locus in the 17 high coverage samples. To determine these variable sites, we adopted criterion that observed counts of minor bases ≥ 15 at least one of the 17 samples. For simplicity of analysis, we omitted variable sites that contained deletions whose observed ratio is ≥ 0.1 against the total observed bases at the same sites.

#### Evaluation of the results

*K*to four which provides maximal lower bound of the log likelihood when varying

*K*between one and 15. From estimated results, we confirmed that copy numbers of each allele for trio samples: NA12877 (father), NA12878 (mother), and their 11 offspring: NA12879, NA12880, NA12881, NA12882, NA12883, NA12884, NA12885, NA12886, NA12887, NA12888, and NA12893 are consistent in a sense of heredity pattern of diploid alleles indirectly (Figure 5), that is, estimated copy number of each allele for offspring is less than or equal to the sum of that of its parent samples (NA12877 and NA12878).

We also conducted the similar analysis for CHEK2P2, which is a pseudogene located at chrl5:20,487,996-20,496,839. The locus had 175 variable sites and its estimated copy numbers ranged from three to 12 in members of the CEPH/Utah pedigree 1463. The copy unit allele K was chosen as 12, which maximized the lower bound of the log likelihood when K was set from one to 15. The estimated copy numbers of haplotypes were consistent within family members, as similar to AMY1 locus.

## Conclusions

We proposed a novel computational approach to simultaneously infer copy unit alleles and their numbers in each sample at CNV loci from HTS data. We verified the prediction performance in estimating copy unit alleles in two different simulation analyses. In the simulation analysis 1, we prepared four alleles with 16 variable sites and succeeded to predict true number and sequences of prepared alleles by maximizing the lower bound of log likelihood. In the simulation analysis 2, we extracted known haplotype sequences from 45 males in CEU population and constructed artificial CNV alleles with lower-, middle-, and higher-copy numbers and varied depth of coverage. Although a dataset with higher copy numbers is more difficult for accurate estimation than with lower copy numbers, the approach achieved allele concordance > 0.9 in terms of precision, recall, and F-measure with HTS data of 10× mean depth of coverage per copy unit. We also applied the approach at highly variable salivary amylase gene locus and a pseudogene locus from HTS real data of 1123 samples that includes 17 high- and 1106 low-coverage alignment data. With this application, we confirmed consistency of inferred copy number for each allele of CEPH/Utah trio samples (NA12877, NA12878, and their 11 offspring).

We model copy numbers of copy unit alleles for each sample by relative amount of the alleles in the sample, instead of inferring combination of integer copy numbers of possible alleles explicitly which will be intractable for high copy number alleles due to the exponentially increasing number of possible states. Thanks to this feature, the computational complexity is linear order of number of alleles *K*, number of samples *N*, and number of variable sites *M* at CNV locus, as described in Methods section, and our approach is robust to increase in the number of alleles and samples.

Although this study presents a promising approach for CNV haplotyping from HTS data, there are several challenges beyond the current approach. First, utilizing full features of HTS data, such as base qualities, paired-end information, and cooccurrence of variable sites on single reads may improve the inference accuracy. Second, using or inferring the population history around CNV locus might improve the accuracy. However, it might be also needed to consider various events in the population history other than mutations such as duplications and recombinations around CNV loci and gene conversions [29, 30], which will complicate the problem. Inference of diplotypes of CNV loci is also an important future work. Third, applying different approximation techniques such as a collapsed VB inference [25] or belief propagation [31] used for topic models of natural language processing to our model might improve accuracy of the inference.

## Declarations

### Acknowledgements

This work was supported (in part) by MEXT Tohoku Medical Megabank Project. All computational resources were provided by the Supercomputing services, Tohoku Medical Megabank Organization, Tohoku University.

**Declarations**

The publication costs for this article were partly funded by MEXT Tohoku Medical Megabank Project.

This article has been published as part of *BMC Bioinformatics* Volume 16 Supplement 1, 2015: Selected articles from the Thirteenth Asia Pacific Bioinformatics Conference (APBC 2015): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S1

## Authors’ Affiliations

## References

- International HapMap Consortium, et al: A haplotype map of the human genome. Nature. 2005, 437 (7063): 1299-1320. 10.1038/nature04226.View ArticleGoogle Scholar
- International HapMap 3 Consortium, et al: Integrating common and rare genetic variation in diverse human populations. Nature. 2010, 467 (7311): 52-58. 10.1038/nature09298.View ArticleGoogle Scholar
- 1000 Genomes Project Consortium, et al: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- McCarroll SA, Altshuler DM: Copy-number variation and association studies of human disease. Nature genetics. 2007, 39: 37-42. 10.1038/ng2080.View ArticleGoogle Scholar
- Stranger BE, Forrest MS, Dunning M, Ingle CE, Beazley C, Thorne N, Redon R, Bird CP, De Grassi A, Lee C, et al: Relative impact of nucleotide and copy number variation on gene expression phenotypes. Science. 2007, 315 (5813): 848-853. 10.1126/science.1136678.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang F, Gu W, Hurles ME, Lupski JR: Copy number variation in human health, disease, and evolution. Annual review of genomics and human genetics. 2009, 10: 451-481. 10.1146/annurev.genom.9.081307.164217.PubMed CentralView ArticlePubMedGoogle Scholar
- Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, et al: Origins and functional impact of copy number variation in the human genome. Nature. 464 (7289): 704-712.Google Scholar
- Almal SH, Padh H: Implications of gene copy-number variation in health and diseases. Journal of human genetics. 2011, 57 (1): 6-13.View ArticlePubMedGoogle Scholar
- Winchester L, Yau C, Ragoussis J: Comparing CNV detection methods for SNP arrays. Briefings in functional genomics & proteomics. 2009, 8 (5): 353-366. 10.1093/bfgp/elp017.View ArticleGoogle Scholar
- Medvedev P, Stanciu M, Brudno M: Computational methods for discovering structural variation with next-gene rati on sequencing. Nature methods. 2009, 6: 13-20. 10.1038/nmeth.1374.View ArticleGoogle Scholar
- Abyzov A, Urban AE, Snyder M, Gerstein M: CNVnator: an approach to discover, genotype, and characterize typical and atypical CNVs from family and population genome sequencing. Genome research. 2011, 21 (6): 974-984. 10.1101/gr.114876.110.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhu M, Need AC, Han Y, Ge D, Maia JM, Zhu Q, Heinzen EL, Cirulli ET, Pelak K, He M, et al: Using ERDS to infer copy-number variants in high-coverage genomes. The American Journal of Human Genetics. 2012, 91 (3): 408-421. 10.1016/j.ajhg.2012.07.004.View ArticlePubMedGoogle Scholar
- Mimori T, Nariai N, Kojima K, Takahashi M, Ono A, Sato Y, Yamaguchi-Kabata Y, Nagasaki M: iSVP: an integrated structural variant calling pipeline from high-throughput sequencing data. BMC systems biology. 2013, 7 (6): 1-8.Google Scholar
- Sims D, Sudbery I, Ilott NE, Heger A, Ponting CP: Sequencing depth and coverage: key considerations in genomic analyses. Nature Reviews Genetics. 2014, 15 (2): 121-132. 10.1038/nrg3642.View ArticlePubMedGoogle Scholar
- Li H, Durbin R: Fast and accurate short read alignment with Burrows-Wheeler transform. Bioinformatics. 2009, 25 (14): 1754-1760. 10.1093/bioinformatics/btp324.PubMed CentralView ArticlePubMedGoogle Scholar
- Browning SR, Browning BL: Rapid and accurate haplotype phasing and missing-data inference for whole-genome association studies by use of localized haplotype clustering. The American Journal of Human Genetics. 2007, 81 (5): 1084-1097. 10.1086/521987.View ArticlePubMedGoogle Scholar
- Delaneau O, Marchini J, Zagury J-F: A linear complexity phasing method for thousands of genomes. Nature methods. 2012, 9 (2): 179-181.View ArticleGoogle Scholar
- Delaneau O, Zagury J-F, Marchini J: Improved whole-chromosome phasing for disease and population genetic studies. Nature methods. 2013, 10 (1): 5-6.View ArticlePubMedGoogle Scholar
- DePristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, del Angel G, Rivas MA, Hanna M, et al: A framework for variation discovery and genotyping using next-gene rat ion DNA sequencing data. Nature genetics. 2011, 43 (5): 491-498. 10.1038/ng.806.PubMed CentralView ArticlePubMedGoogle Scholar
- Kojima K, Nariai N, Mimori T, Yamaguchi-Kabata Y, Sato Y, Kawai Y, Nagasaki M: HapMonster: A statistically unified approach for variant calling and haplotyping based on phase-informative reads. Lecture Notes in Computer Science. 2014, 8574: 107-118.View ArticleGoogle Scholar
- Coin LJ, Asher JE, Walters RG, Moustafa JSE-S, de Smith AJ, Sladek R, Balding DJ, Froguel P, Blakemore AI: cnvHap: an integrative population and haplotype-based multiplatform model of SNPs and CNVs. Nature methods. 2010, 7 (7): 541-546. 10.1038/nmeth.1466.View ArticlePubMedGoogle Scholar
- Kato M, Yoon S, Hosono N, Leotta A, Sebat J, Tsunoda T, Zhang MQ: Inferring haplotypes of copy number variations from high-throughput data with uncertainty. G3 (Bethesda). 2011, 1 (1): 35-42. 2011.View ArticleGoogle Scholar
- Su S-Y, Asher JE, Jarvelin M-R, Froguel P, Blakemore AI, Balding DJ, Coin LJ: Inferring combined CNV/SNP haplotypes from genotype data. Bioinformatics. 2010, 26 (11): 1437-1445. 10.1093/bioinformatics/btq157.PubMed CentralView ArticlePubMedGoogle Scholar
- Blei DM, Ng AY, Jordan MI: Latent dirichlet allocation, the Journal of machine Learning research. 2003, 3: 993-1022.Google Scholar
- Teh YW, Newman D, Welling M: A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. Advances in Neural Information Processing Systems. 2006, 1353-1360.Google Scholar
- Nariai N, Hirose O, Kojima K, Nagasaki M: TIGAR: transcript isoform abundance estimation method with gapped alignment of RNA-Seq data by variational Bayesian inference. Bioinformatics. 2013, 29 (18): 2292-2299. 10.1093/bioinformatics/btt381.View ArticlePubMedGoogle Scholar
- Illumina Corporation: Platinum genomes project. 2013, [http://www.platinumgenomes.org]Google Scholar
- Perry GH, Dominy NJ, Claw KG, Lee AS, Fiegler H, Redon R, Werner J, Villanea FA, Mountain JL, Misra R, et al: Diet and the evolution of human amylase gene copy number variation. Nature genetics. 2007, 39 (10): 1256-1260. 10.1038/ng2123.PubMed CentralView ArticlePubMedGoogle Scholar
- Innan H, Kondrashov F: The evolution of gene duplications: classifying and distinguishing between models. Nature Reviews Genetics. 2010, 11 (2): 97-108.PubMedGoogle Scholar
- Teshima KM, Innan H: The coalescent with selection on copy number variants. Genetics. 2012, 190 (3): 1077-1086. 10.1534/genetics.111.135343.PubMed CentralView ArticlePubMedGoogle Scholar
- Zeng J, Cheung WK, Liu J: Learning topic models by belief propagation. Pattern Analysis and Machine Intelligence. IEEE Transactions. 2013, 35 (5): 1121-1134.Google Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.