 Methodology article
 Open access
 Published:
Inference of chromosomespecific copy numbers using population haplotypes
BMC Bioinformatics volumeÂ 12, ArticleÂ number:Â 194 (2011)
Abstract
Background
Using microarray and sequencing platforms, a large number of copy number variations (CNVs) have been identified in humans. In practice, because our human genome is a diploid, these platforms are limited to or more accurate for detecting total copy numbers rather than chromosomespecific copy numbers at each of the two homologous chromosomes. Nevertheless, the analysis of linkage disequilibrium (LD) between CNVs and SNPs indicates that distinct copy numbers often sit on their own background haplotypes.
Results
We propose new computational models for inferring chromosomespecific copy numbers by distinguishing background haplotypes of each copy number. The formulated problems are shown to be NPhard and approximation/heuristic algorithms are developed. Simulation indicates that our method is accurate and outperforms the existing approach. By testing the program in 60 parentoffspring trios, the inferred chromosomespecific copy numbers are highly consistent with the law of Mendelian inheritance. The distributions of copy numbers at chromosomal level are provided for 270 individuals in three HapMap panels.
Conclusions
The estimation of chromosomespecific copy numbers using microarray or sequencing platforms was often confounded by a number of factors. This study showed that the integration of background haplotypes is able to improve the accuracies of copy number estimation at chromosome level, especially for the CNVs having strong LD with SNPs in proximity.
Background
Genetic variations exist in many forms in the human genome. Large structural variations such as deletions and duplications are quite common in the human populations, which encompass more base pairs than single nucleotide polymorphisms (SNPs). Among various types of structural variations, copy number variations (CNVs) often occupy regulatory regions of genes and greatly influence phenotypic traits and disease susceptibility [1]. CNV is defined as a DNA segment with length more than 1 kb and observed with various numbers of copies in the population. A number of CNVs have been known to highly associate with several complex diseases such as HIV infection, autoimmunity, autism, Parkinson's, Alzheimer's and Crohn's disease [2â€“6].
The advance of highthroughput array platforms and sequencing technologies enables fast and costeffective scan of CNVs in genomewide scale [7]. Using array Comparative Genomic Hybridization (aCGH) and nextgeneration sequencing, a large number of CNVs have been identified in human and other primates [8â€“12]. In practice, because our human genome is a diploid, most sequencing platforms often report total copy numbers of one individual instead of chromosomespecific copy numbers presented on each of the two homologous chromosomes. For example, suppose there are two diplotype configurations at one CNV locus: 1/1 represents one copy at each of the two chromosomes, and 0/2 indicates a deletion at one chromosome and a duplication at the other. The total copy numbers of these configurations are both experimentally obtained as two, although the underlying mechanisms generating these two configurations are different. Nevertheless, determination of chromosomespecific copy numbers is important in the analysis of population genetics and disease association studies. For instance, the power of detecting positive selection and accuracy of measuring Linkage Disequilibrium (LD) between SNPs and CNVs can be improved through direct use of chromosomespecific copy numbers [13â€“15]. Moreover, identification of the chromosomespecific copy numbers can even shed light on the age of each CNV [1].
Recently, an expectation maximization (EM) algorithm was developed to estimate frequencies of chromosomespecific copy numbers under the assumption of HardyWeinberg equilibrium (HWE) [16]. In reality, the observed allele frequencies do not completely satisfy HWE, because each copy number allele may be sampled more or less in different sequencing projects. In a few occasions, HWE may be even deviated due to directional selection, assortative mating, or migration [17]. Using B allele frequencies (BAF) and log R ratios (LLR) provided by SNP array platforms, a hidden Markov model (HMM) was designed for inferring chromosomespecific copy numbers within a parentoffspring family [18]. In addition, information of allelicspecific copies at each SNP locus (e.g., AAABB) have been also used to indirectly infer chromosomespecific copy numbers [19]. However, BAF, LLR and allelicspecific copies are not always available in each sequencing platform. For example, in nextgeneration sequencing (e.g., SOLid and Illumina), SNPs and CNVs called at these platforms (e.g., Bioscope and SAMTools) do not provide such information. Moreover, the accuracy of allelicspecific copies is often decreased for higher copies and is worse than that of total copy numbers due to crosshybridization [20, 21]. Although traditional haplotype phasing programs (e.g., fastPHASE) may be used for inferring copy number by encoding biallelic CNVs into SNP genotypes, this approach is inadequate to infer multiallelic CNVs [13, 19].
To date, the analysis of LD structure in human genome indicated that many CNVs have strong LD with SNPs in proximity, probably owing to uneven distribution of recombination hot/cold spots or genetic hitchhiking [22â€“25]. Moreover, a number of CNVs have been shown to be taggable using alleles at flanking SNPs [3]. The LD structure between CNVs and SNPs implies different chromosomespecific copy numbers often sit at their own background haplotypes, which can be viewed as fingerprints of each copy number. As a consequence, chromosomespecific copy numbers of each individual are inferable by careful analysis of background haplotypes around each CNV. In recent years, several largescale sequencing projects have constructed complete haplotype and CNV databases across major human populations (e.g., HapMap [26]). Integration of these databases may gain insight into the distribution of chromosomespecific copy numbers in human populations.
In this study, we develop new computational models and combinatorial algorithms for inferring chromosomespecific copy numbers by distinguishing background haplotypes of each copy number. Two optimization problems are formulated, shown to be NPhard, and solved by approximation or heuristic algorithms. Simulation indicates our method is very accurate and is able to outperform existing approach. By testing the program separately for each individual within 60 parentoffspring trios, the inferred chromosomespecific copy numbers are highly consistent with the law of Mendelian inheritance. The distribution of chromosomespecific copy numbers across three human populations indicate that one copy is the major allele as expected, and zero copy (deletion) alleles are much frequent than high copy (duplication) alleles.
Methods
The haplotypes of 270 individuals are downloaded from the Phase II of international HapMap project [26]. For the input of unphased genotypes, the haplotypes were inferred via the PHASE [27] program, which was used by the HapMap project. For highthroughput sequencing data, a number of haplotype assembly tools can also be used to infer the haplotypes [28]. The total copy numbers of 1,319 CNVs typing on the same individuals are retrieved from [9]. We extract SNPs within each CNV as well as SNPs at flanking regions in our study. We compared the SNP distance (i.e. number of SNPs) with the physical distance (e.g., 5 kb) for capturing the extent of LD and found that the LD is more sensitive to physical distance. The simulation results indicated that the accuracy of our algorithm is highest when including SNPs within onefold extension of the physical size of each CNV (Additional file 1, Figure S1). Therefore, the released program will automatically checks the coordinates of CNVs and SNPs and captures SNPs within the onefold extension regions into consideration.
Given a set of haplotype pairs and the total copy number for each individual (Figure 1), the chromosomespecific copy number of each haplotype is determined by first solving a variant of MaxkCut problem, which aims to divide background haplotypes into k clusters. Then, a variant of Constraint Satisfaction Problem (CSP) is solved to assign chromosomespecific copy number to each cluster. Finally, these two procedures are repeated for all possible k in order to determine the best solution.
Haplotypes Clustering via Solving Constrained MaxkCut Problem
Through analysis of LD between SNPs and CNVs, the copy numbers on a CNV are shown to have strong LD with alleles at flanking SNPs [13, 23, 24]. The LD structure implies different chromosomespecific copy numbers tend to sit at their own background haplotypes. We first group haplotypes spanning across each CNV into k clusters (for all possible k) based on their pairwise hamming distance and total copy numbers. Note that odd total copy number implies the underlying two chromosomespecific numbers should be different (e.g., 3 = 0 + 3 or 1 + 2). Haplotypes clustered into the same set may represent haplotype background for the same chromosomespecific copy number. The input total copy numbers and haplotypes are formulated into a weighted graph described as following (see Figure 2):

(1)
Each haplotype is transformed into a vertex.

(2)
The weight of edge between two vertices is the hamming distance between two haplotypes. Note that the same haplotype from different individuals are formulated as multiple vertices with zero distance.

(3)
For the haplotype pair with odd total copy number (e.g., H _{1} and H _{2}), the edge between them is called hard edge.

(4)
For the other haplotype pairs, the edges between them are called soft edges.
Given the above weighted graph with hard and soft edges, these haplotypes are grouped into k clusters by solving a variant of MaxkCut problem (called constrained MaxkCut). A formal definition of the constrained MaxkCut problem is given below.
Problem: Constrained MaxkCut
Given an undirected weighted graph G = (V, E) in which some edges in E are hard and the others are soft, the constrained MaxkCut problem aims to find a partition of vertices in V into k sets (X_{1}, X_{2}, ..., X_{ k } ) such that the total weight of soft edges across different sets (called cut) is maximized, requiring all hard edges must be on the cut.
The original MaxkCut is known to be NPhard [29, 30], which is a special case of this problem when all edges are soft. Therefore, the problem of constrained MaxkCut is also NPhard. In order to efficiently solve the constrained MaxkCut problem, we develop a greedy approximation algorithm which explores larger solution space by randomizing nondeterministic steps. The core procedure of this algorithm is given below (Additional file 2, Figure S2).
Algorithm for Constrained MaxkCut

(1)
Randomly pick k different vertices as initial elements for each k set (X _{1}, X _{2}, ..., X_{ k } ).

(2)
Without violating the constraint of hard edge, randomly pick a remaining vertex and assign it into the set which maximizes the total weight of soft edges across different sets. This step is repeated until all vertices are assigned.
Note that the above procedure involves nondeterministic parts in both steps (i.e., initial k vertices and the order of picking the next vertex). Therefore, this procedure is repeated ten times to explore larger solution space by trying different initial k vertices in step 1 and different order in step 2. The best solution among all trials is outputted as the final solution. The number of repeated iterations is usually a tradeoff between accuracy and efficiency. Nevertheless, we found that the randomized approaches on top of the greedy framework requires only few iterations (Additional file 3, Figure S3). Thus, the implemented can run fast in practice. The following theorem implies that the solution found by this algorithm is quite close to the optimal solution.
Theorem 1. The algorithm for constrained MaxkCut is a (k  2)/(k  1)approximation algorithm for k > 2.
Proof. Without loss of generality, let the order of picking vertices be v_{1}, v_{2}, â‹¯, v_{ n } . Let W denote the total weight of all edges in G and
then
Let X_{ j } be the jth set of partitioned vertices and
where 1 â‰¤ m â‰¤ i  1 , 1 â‰¤ j â‰¤ k . Then
Suppose Cut_{i 1}â‰¥ Cut_{i 2}â‰¥ â‹¯ â‰¥ Cut _{ ik } and v_{ i } cannot be put into set X_{1} due to the hardedge constraint. Then v_{ i } can be put into set X_{2} by the algorithm instead, because the hard edge only appears between haplotype pairs of the same individual. The relation between Cut _{ ij } and W_{ i } can be computed as following:
Therefore,
Let C denote the solution from the greedy algorithm and C* be the optimal solution, then
Since W â‰¥ C*,
Copy Number Assignment via Solving Constraint Satisfaction Problem
After clustering haplotypes into k sets (X_{1}, X_{2}, ..., X_{ k } ), we then assign k different integers to each set, which correspond to k distinct chromosomespecific copy numbers. For each individual, summation of chromosomespecific copy numbers of each haplotype pair should be equal to his/her total copy number. which can be written as the following constraint:
where Copy(H_{i 1}) and Copy(H_{i 2}) are chromosomespecific copy numbers for the ith individual, and Total(i) is his/her total copy number. For the example shown in Figure 3(A), Copy(H_{5})+Copy(H_{6}) = 2 for individual 3. Note that because all haplotypes have been clustered into the same or different sets, eq (1) can be rewritten into the following constraint using their set variables X_{ i } :
where X_{ a } and X_{ b } denote the sets of these two haplotypes after clustering (e.g., X_{1} + X_{3} = 2 for individual 3). For each individual, a set of constraints with two variables similar to eq (2) can be generated by repeating the above formulation (see Figure 3(B)). By assigning distinct integer numbers to these set variables X_{ i } , chromosomespecific copy number of each haplotype can then be determined. Theoretically, all constraints should be satisfied after the assignment, but practically, not all constraints can be satisfied, because some ambiguous haplotypes may not be the true background of the copy number. In order to satisfy as many constraints as possible, chromosomespecific copy numbers are assigned to each set X_{ i } by solving a variant of the constraint satisfaction problem (termed Unique Max2CSP). Given a set of twovariable constraints over n variables (X_{1}, X_{2},..., X_{ n } ), the Unique Max2CSP problem asks for k unique (distinct) integers assigned to each variable which satisfied maximum number of constraints.
Problem: Unique Max2CSP
Given a set of variables X = {X_{1}, X_{2}, ..., X_{ n } }, a set of finite integer domains D = {0, 1, ..., d}, where d â‰¥ n  1, and a set of twovariable constraints C = {C_{1}, C_{2}, ..., C_{ m } } with the following form:
where T_{ l } is a nonnegative integer. The Unique Max2CSP asks for an assignment of n distinct integers in D to X_{1}, X_{2}, ..., X_{ n } that maximizes the total number of satisfied constraints in C.
We first prove a problem called binary Max2CSP is NPhard, in which the integer domain D is restricted to {0, 1}, and values assigned to different variables in X are allowed to be identical (e.g., X_{1}= X_{2}= 1).
Then, the unique Max2CSP problem is shown to be NPhard by reduction from binary Max2CSP. The details of these proofs can be found in Additional file 4, Supplementary Material.
Theorem 2. Unique Max2CSP is NPhard.
In order to solve unique Max2CSP more efficiently, we developed a greedy heuristic algorithm which also explores larger solution space by randomizing nondeterministic steps. Let n be the number of individuals and c_{ max } be the maximum possible copy number.
Algorithm for Unique Max2CSP
For 1 â‰¤ i â‰¤ n, 0 â‰¤ c â‰¤ c_{ max } , do step (1) to step (3) (see Figure 4).

(1)
Initially set X_{ i } = c.

(2)
Randomly pick a constraint {N : X_{ a } + X_{ b } = T } in which only X_{ a } (or X_{ b } ) is assigned, where N is the number of the constraint. If X_{ a } (or X_{ b } ) = D, and there are m types of constraints with X_{ a } + X_{ b } as following: {N_{ j } : X_{ a } + X_{ b } = T_{ j } }, where 1 â‰¤ j â‰¤ m, and N _{1} is maximum in N_{ j } , assign X_{ b } (or X_{ a } ) = T _{1}  D. Repeat this step until there is no constraint in which only one variable is assigned.

(3)
Compute the number of satisfied constraints with respect to X_{ i } = c.
Ideally, once the value of the initial variable X_{ i } is assigned (e.g., X_{1} = 0), the values of other associated variables can be indirectly determined (e.g., X_{1} + X_{2} = 1 or X_{1} + X_{3} = 2). However, there could be some conflicting constraints existed (e.g., X_{2} + X_{3} = 3). Therefore, the possible values of all variables X_{ i } are dependent on the order of assignment (e.g., X_{1} = 0 first, X_{2} = 1 second, then ...). In reality, there are more variables and the dependency/conflict relations are more complicated. Consequently, we repeat the above procedure ten times to explore different orders of assignments by randomly prioritizing distinct constraints to be satisfied in different rounds. The best solution among all iterations is recorded into the corresponding row in the solution table. Note that some variables may have no assignment due to conflicts with previously assigned variables and hence are recorded as 1.
After the above procedure is iterated over possible initial values of all variables, a solution table will be created. Each row stands for one assignment corresponding to the initial value of some variable. Note that although each row represents a set of possible assignment, the assignment may not satisfy all variables due to the lack of dependency with other variables (e.g., X_{4} may not be reachable from X_{1}). Therefore, we do step (4) iteratively using a greedy approach.

(4)
Select a row which is not chosen in the solution table with maximum number of satisfied constraints. Repeat this row selection until no further constraints can be satisfied. Note that the variables assigned in one iteration cannot violate the assignment in previous iteration.
Finally, the union of assignments selected by this greedy algorithm is outputted as the solution.
Iteration and Adjustment
The previous two procedures (haplotype clustering and copy number assignment) are repeated for all possible numbers of clusters k, because the best setting of k can not be known in advance. We try all possible k from two to maximum possible number. For example, if a CNV have total copy number 2, 3, 4 in populations, the maximum possible k is 5 since all possible chromosomespecific copy numbers range from 0 to 4. We choose the best k with maximum number of satisfied constraints in unique Max2CSP. In practice, the constraints of some individuals may be still unsatisfied after these iterations, because the ambiguous haplotypes, which are not the true background of underlying copy number, may confuse the haplotype clustering. Consequently, we adjust the clustering results for these unsatisfied individuals using the following randomized approach (see Figure 5 for an example):

(1)
Randomly pick an individual with haplotypes violating the constraint and enumerate all possible assignments for these two haplotypes such that the constraint can be satisfied.

(2)
For each possible assignment, evaluate the new cut value in the MaxkCut problem and choose the assignment with maximum cut among all possibilities.

(4)
Repeat step (1) and step (2) for the remaining unsatisfied individuals until all of them are satisfied.
Because the order of individuals processed is nondeterministic, we also repeat above procedure ten times and output the best solution among them.
Simulation
The simulation of LD and HWE generated two series of copy numbers and SNP genotypes from 16 individuals. The haplotype phases are inferred via PHASE [27]. The first series of data sets simulate diplotype configurations completely match HWE (P = 1.0, Chisquare). The flanking SNPs are simulated starting from perfect LD (average r^{2} = 1.0). Subsequently, the remaining data sets of lower LD are constructed by flipping SNP alleles at random. The second set of experiments simulated an imperfect HWE data sets by adding/deleting some copy number alleles from the HWE data sets, which aims to slightly deviate from the expected HWE frequency (P = 0.98). The remaining data sets of LD decay are generated in a similar way.
The simulation using copy number on X Chromosomes is also adopted. Because there is only one X chromosome in each male, the total copy number obtained on X chromosome directly represents the chromosomespecific copy number [9]. We use CNVs and haplotypes in X chromosomes of males from the HapMap project during simulation. The total copy numbers at one CNV is simulated by randomly pairing two copy numbers on two different X chromosomes.
In order to compare the accuracy of our algorithm and CNVphaser [1, 16], which outputs posterior probabilities of each copy number, we parsed the output files of CNVPhaser and picked up the diplotype configuration with highest probability for each individual. The accuracy of inferred copy number configurations is defined as following:
where C_{ correct } is the number of correctly inferred copy number configurations and C_{ total } is the total number of configurations.
Results
The proposed algorithms have been implemented as a program called CSCNPhaser, which is available at http://www.cs.ccu.edu.tw/~ythuang/Tool/CSCNPhaser/. We retrieved haplotypes of 270 individuals from Phase II of the International HapMap Project [26]. These individuals include 30 trios from the Utah, USA region (CEU); 30 trios from the Yoruba in Ibadan, Nigeria (YRI); 45 unrelated Japanese individuals from Tokyo, Japan (JPT); and 45 unrelated Han Chinese individuals from Beijing, China (CHB). In addition, total copy numbers at 1,319 CNVs typing on the same 270 individuals are downloaded from [9]. We consider haplotypes within the CNV as well as haplotypes at flanking regions, whereas the best length of extended haplotypes is determined by simulation (see Method).
Simulation on LD Decay and HWE
We compared the CNVPhaser [1, 16] and our program CSCNPhaser over two series of data sets with respect to different LD and HWE (see Method). Although the copy numbers in both experiments almost match the ideal HWE, the slight deviation from HWE is shown by the P values using the Chisquare test. The first set of experiments simulated complete HWE, in which the copy alleles in all data sets completely follow the expected frequency (P = 1.0). The flanking SNP alleles are randomly flipped to decay the LD. Figure 6(A) plots the accuracies of CSCNPhaser and CNVPhaser at different degrees of LD under complete HWE. Because CSCNPhaser is designed based on the LD of background haplotypes, the accuracy is decreasing as the background haplotypes are less LDinformative. Unexpectedly, we found the accuracy of CNVPhaser also deteriorates as LD decays. This is because CNVPhaser estimated the combined frequencies of the entire haplotype and copy number to match HWE, which implicitly captured LD in a light way. CSCNPhaser outperforms CNVPhaser as the background haplotypes are more LDinformative, and both accuracies are worse as haplotypes are less informative.
The second set of experiments simulated an imperfect HWE data sets by adding/deleting a few copy number alleles to slightly deviate from the expected HWE frequency (P = 0.98). Note that the entire allele frequency spectrum is still close to that of HWE. Figure 6(B) plots the accuracies of CSCNPhaser and CNVPhaser at different degrees of LD. In high LD, both programs can still achieve high accuracies.
Although the major trends are similar to previous experiments, CNVPhaser is slightly worse than previous experiment compared with our method, implying it is more sensitive to HWE deviation.
Consistency with Mendelian Inheritance
The developed program is further applied on 1,292 CNVs on autosomal chromosomes typing over 270 HapMap individuals from [9]. We discarded CNVs with less than 10 SNPs, because they are less informative about LD. There are 969 CNVs used in following experiments. The copy numbers observed among normal individuals should be overwhelmingly inherited from their parents. By running our program separately for each individual within 60 parentoffspring trios (CEU and YRI panels), correctness of our method can be justified by checking the Mendelian consistent rate of inferred chromosomespecific copy numbers within trios [9]. More than 97 percent of CNVs have Mendelian consistent rate larger than 0.9 (see Figure 7). These results indicate that majority of copy numbers inferred by our method is under expectation from the law of Mendelian inheritance. The remaining few CNVs might imply novel deletions/duplications or translocationmediated CNVs [31, 32].
Distributions of Copy Number Configurations in Human Populations
The chromosomespecific copy numbers of 270 individuals in CEU, YRI, and CHB+JBP HapMap panels are inferred by our program in order to investigate the distributions of haplopid and diploid configurations in human populations. Figure 8(A) plots the haploid distribution of chromosomespecific copy numbers inferred by our program. Our results indicate that one copy on each chromosome is the major allele in the population as expected. Zero copy (deletion) is the second frequent allele compared with two copies (duplication). Frequencies of higher chromosomespecific copy numbers are relatively lower. This is not unexpected because multiple duplication events at the same CNV locus are relatively less common.
In the distribution of diploid configurations (Figure 8(B)), 1/1 configurations are the most frequent form as expected. We observed 0/2 configurations (deletion+duplication) is the second frequent one. This phenomenon may be explained by the fact that 1/1 and 0/2 configurations contribute equally to gene copy balance in humans. In order to assess the miscalled rates of 1/1 into 0/2 configurations, we conducted a series of simulation experiments of only 1/1 diplotype configurations (i.e., no CNV). Because 1/1 configuration is miscalled to 0/2 by CSCNPhaser only when the background haplotypes are not LDinformative, we investigated the miscalled rates of data sets starting from LDinformative haplotypes down to lowLD ones. Specifically, the haplotypes are onebyone replaced with noninformative haplotypes. Figure 9 plots the miscalled rates with respect to the percentage of replaced haplotypes. When the majority of haplotypes are LDinformative (>60%), the miscalled rate is low (~0.06). As more haplotypes are replaced with noninformative ones, the miscalled rate goes up as expected.
On the other hand, 0/1 and 0/0 configurations (hemizygous and homozygous deletions) are more frequent than the remaining duplication forms, probably due to the low frequencies of highcopy alleles. Although these distributions are consistent across three HapMap panels, these results could be still biased due to lowLD background of recurrent or translocationmediated CNVs. Therefore, these distributions are for reference only, which require experimental validations before further interpretation.
Capability and Efficiency
Although the maximum copy number in the population is still not clear, it is worth of interest to know the capability and efficiency of both programs for processing data sets with large copy numbers. Figure 10 plots the average running time of CNVPhase and CSCNPhaser over a range of maximum copy numbers. Both programs are able to accept input of up to 60 copies. The differences are the running time and memory usage. CNVPhaser requires longer time (>1 min) and more memory (>1GB) for >50 copy numbers, whereas CSCNPhaser is very fast (within seconds) and does not consume much memory.
Discussion
Strength and Weakness of LDbased Inference
CNVPhaser was developed by estimating allele frequencies using HWE, while our CSCNPhaser investigated the haplotype background of each copy number. Although not explicitly stated, we observed CNVPhaser implicitly capture background haplotypes in a light way, because the frequencies are estimated over the entire copy number/haplotype combinations. Therefore, its accuracy also decreases as LD decays. For CNVs having high LD with flanking SNPs, our program performs better than CNVPhaser. In low LD regions with only 1/1 configurations, we observed that CSCNPhaser may miscall them as 0/2 configurations. On the other hand, in data sets with mixed configurations (i.e., 1/1 and 0/2), we observed that the miscalled rate is lower, because these data sets contain haplotypes LDinformative of 0/2 configuration, which are used for better distinguishing 1/1 from 0/2 configurations in our algorithm. Most common, diallelic CNVs have been found to have strong LD with flanking SNPs, and most lowfrequency CNVs even segregated on specific haplotype background [9]. Therefore, we anticipate it is useful to look at the haplotype background for inferring copy number of most CNVs.
It should be noted that the LDbased approach is not suitable for recurrent CNVs or translocationmediated CNVs, in which their background haplotypes are less informative of the copy number. In fact, our simulation on Xchromosome CNVs found two possible recurrent CNVs with lower accuracies compared with other ordinary CNVs (Tables 1 and 2). Nevertheless, the CNVPhaser and our program can work in a hybrid way to overcome the limitation. The LD of SNPs across the CNV can be computed first. If the LD is low (i.e., recurrent CNVs), it might be a clue for not looking into the haplotypes for copy number inference. That is, we can run the CNVPhaser but exclude SNP genotypes for pure HWE frequency estimation. As for other CNVs with LDinformation haplotypes, our program can be used to achieve higher accuracies. With the release of next generation sequencing platforms, SNPs and CNVs are often collectively called in each sequencing project. And the accuracies of inferring these CNVs can be improved by further looking into the LD background of each copy number.
Strategies for Solving Constrained MaxkCut
The original MaxkCut problem can be solved by a randomized algorithm which randomly partition all vertices into k sets. In addition, it can be also solved by a deterministic greedy algorithm which iteratively assigns one vertex into the set that maximizes the cut size. In fact, both algorithms return a good approximate solution within a factor of of the optimal solution, and the semidefiniteprogramming (SDP) relaxation can achieve a better approximation bound [30]. Therefore, it is natural to consider the three strategies for solving the constrained version. Although not presented in this paper, the random partition method was ever considered but later withdrawn due to the bad accuracies in all experiments. It is because the random partition method simply guess the solution, which can't work in practice although the approximation ratio is theoretically good. The SDP strategy is theoretically sound but the running time is slow in our previous study [33], and the SDP implementation is complex so that the program is often not easily portable to all platforms. On the other hand, the greedy algorithm (with randomization enhancement) can achieve high accuracies and run very fast in all experiments. As a consequence, the greedy solution is taken in order to perform genomewide experiments with high accuracies and within reasonable period of time.
Integration of Greedy and Randomized Approaches
Theoretically, the two optimization problems (MaxkCut and Max2CSP) can be both solved by a deterministic greedy approach or a pure randomized approach (e.g., random partition for the kcut problem). The greedy approach is simple and fast. However, the solution found is often only theoretically sound but not comparable with other heuristic methods in practice. This is due to the fact that the ordinary greedy approach tends to find local optimum solution instead of global optimum solution. On the other hand, the pure randomized (blindsearch) approaches do not have the tendency of finding local optimum solution but requires numerous iterations for finding a good solution. Therefore, when solving both algorithms, we used the greedy algorithm as a framework and randomized the nondeterministic steps for searching better solutions. The results showed that the iterations required of this hybrid approach are far less than those of pure randomized approaches, while obtaining better solutions than an ordinary greedy algorithm.
Conclusion
In this study, we developed new computational models and combinatorial algorithms for inferring chromosomespecific copy numbers by distinguishing their haplotype background. Simulation showed that our method is accurate and outperformed existing method as the background haplotypes are LDinformative of the copy numbers. The inferred copy numbers are consistent with Mendelian inheritance for 97% of CNVs within parentoffspring trios. The inference of copy numbers in microarray and sequencing platforms are often confounded by a number of different factors. This study showed that integration of haplotypes into copy number estimation is able to improve the accuracies, especially for those CNVs having strong LD with SNPs.
References
Kato M, Nakamura Y, Tsunoda T: An algorithm for inferring complex haplotypes in a region of copynumber variation. The American Journal of Human Genetics 2008, 83: 157â€“169. 10.1016/j.ajhg.2008.06.021
Gonzalez E, Kulkarni H, Bolivar H, Mangano A, Sanchez R, Catano G, Nibbs RJ, Freedman BI, Quinones MP, Bamshad MJ, Murthy KK, Rovin BH, Bradley W, Clark RA, Anderson SA, O'connell RJ, Agan BK, Ahuja SS, Bologna R, Sen L, Dolan MJ, Ahuja SK: The influence of CCL3L1 genecontaining segmental duplications on HIV1/AIDS susceptibility. Science 2005, 307: 1434â€“1440. 10.1126/science.1101160
McCarroll SA, Huett A, Kuballa P: Deletion polymorphism upstream of IRGM associated with altered IRGM expression and Crohn's disease. Nature Genetics 2008, 40: 1107â€“1112. 10.1038/ng.215
RoveletLecrux A, Hannequin D, Raux G, Le Meur N, Laquerriere A, Vital A, Feuillette S, Brice A, Vercelletto M, Dubas F, Frebourg T, Campion D: APP locus duplication causes autosomal dominant earlyonset alzheimer disease with cerebral amyloid angiopathy. Nature Genetics 2006, 38: 24â€“26. 10.1038/ng1718
Sebat J: Major changes in our DNA lead to major changes in our thinking. Nature Genetics 39: S3S5.
Singleton AB, Farrer M, Johnson J, Singleton A, Hague S, Kachergus J, Hulihan M, Peuralinna T, Dutra A, Nussbaum R, Lincoln S, Crawley A, Hanson M, Maraganore D, Adler C, Cookson MR, Muenter M, Baptista M, Miller D, Blancato J, Hardy J, GwinnHardy K: alphaSynuclein locus triplication causes Parkinson's disease. Science 2003, 302: 841. 10.1126/science.1090278
Shendure J, Ji H: Nextgeneration DNA sequencing. Nature Biotechnology 2008, 26: 1135â€“1143. 10.1038/nbt1486
Kidd JM, Cooper GM, Donahue WF, Hayden HS, Sampas N, Graves T, Hansen N, Teague B, Alkan C, Antonacci F, Haugen E, Zerr T, Yamada NA, Tsang P, Newman TL, Tuzun E, Cheng Z, Ebling HM, Tusneem N, David R, Gillett W, Phelps KA, Weaver M, Saranga D, Brand A, Tao W, Gustafson E, McKernan K, Chen L, Malig M, Smith JD, Korn JM, McCarroll SA, Altshuler DA, Peiffer DA, Dorschner M, Stamatoyannopoulos J, Schwartz D, Nickerson DA, Mullikin JC, Wilson RK, Bruhn L, Olson MV, Kaul R, Smith DR, Eichler EE: Mapping and sequencing of structural variation from eight human genomes. Nature 2008, 453: 56â€“64. 10.1038/nature06862
McCarroll SA, Kuruvilla FG, Korn JM, Cawley S, Nemesh J, Wysoker A, Shapero MH, de Bakker PI, Maller JB, Kirby A, Elliott AL, Parkin M, Hubbell E, Webster T, Mei R, Veitch J, Collins PJ, Handsaker R, Lincoln S, Nizzari M, Blume J, Jones KW, Rava R, Daly MJ, Gabriel SB, Altshuler D: Integrated detection and populationgenetic analysis of SNPs and copy number variation. Nature Genetics 2008, 40: 1166â€“1174. 10.1038/ng.238
Perry GH, Tchinda J, McGrath SD, Zhang J, Picker SR, Caceres AM, Iafrate AJ, TylerSmith C, Scherer SW, Eichler EE, Stone AC, Lee C: Hotspots for copy number variation in chimpanzees and humans. Proceedings of the National Academy of Sciences of the United States of America 2006, 103: 8006â€“8011. 10.1073/pnas.0602318103
Redon R, Ishikawa S, Fitch KR, Feuk L, Perry GH, Andrews TD, Fiegler H, Shapero MH, Carson AR, Chen W, Cho EK, Dallaire S, Freeman JL, Gonzalez JR, Gratacos M, Huang J, Kalaitzopoulos D, Komura D, MacDonald JR, Marshall CR, Mei R, Montgomery L, Nishimura K, Okamura K, Shen F, Somerville MJ, Tchinda J, Valsesia A, Woodwark C, Yang F, Zhang J, Zerjal T, Zhang J, Armengol L, Conrad DF, Estivill X, TylerSmith C, Carter NP, Aburatani H, Lee C, Jones KW, Scherer SW, Hurles ME: Global variation in copy number in the human genome. Nature 2006, 444: 444â€“454. 10.1038/nature05329
de Smith AJ, Tsalenko A, Sampas N, Scheffer A, Yamada NA, Tsang P, BenDor A, Yakhini Z, Ellis RJ, Bruhn L, Laderman S, Froguel P, Blakemore AI: Array CGH analysis of copy number variation identifies 1284 new genes variant in healthy white males: implications for association studies of complex disease. Human Molecular Genetics 2007, 16: 2783â€“2794. 10.1093/hmg/ddm208
Conrad DF, Pinto D, Redon R, Feuk L, Gokcumen O, Zhang Y, Aerts J, Andrews TD, Barnes C, Campbell P, Fitzgerald T, Hu M, Ihm CH, Kristiansson K, Macarthur DG, Macdonald JR, Onyiah I, Pang AW, Robson S, Stirrups K, Valsesia A, Walter K, Wei J, Wellcome Trust Case Control Consortium, TylerSmith C, Carter NP, Lee C, Scherer SW, Hurles ME: Origins and functional impact of copy number variation in the human genome. Nature 2009, 464: 704â€“712.
Emerson JJ, CardosoMoreira M, Borevitz JO, Long M: Natural selection shapes genomewide patterns of copynumber polymorphism in drosophila melanogaster. Science 2008, 320: 1629â€“1632. 10.1126/science.1158078
Sabeti PC, Varilly P, Fry B, Lohmueller J, Hostetter E, Cotsapas C, Xie X, Byrne EH, McCarroll SA, Gaudet R, Schaffner SF, Lander ES, International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, et al.: Genomewide detection and characterization of positive selection in human populations. Nature 2007, 449: 913â€“918. 10.1038/nature06250
Kato M, Nakamura Y, Tsunoda T: MOCSphaser: a haplotype inference tool from a mixture of copy number variation and single nucleotide polymorphism data. Bioinformatics 2008, 24: 1645â€“1646. 10.1093/bioinformatics/btn242
Wigginton JE, Cutler DJ, Abecasis GR: A Note on Exact Tests of HardyWeinberg Equilibrium. American Journal of Human Genetics 2005, 76: 887â€“893. 10.1086/429864
Wang K, Chen Z, Tadesse MG, Glessner J, Grant SF, Hakonarson H, Bucan M, Li M: Modeling genetic inheritance of copy number variations. Nucleic Acids Research 2008, 36: e138. 10.1093/nar/gkn641
Su SY, Asher JE, Jarvelin MR, Froguel P, Blakemore AI, Balding DJ, Coin LJ: Inferring combined CNV/SNP haplotypes from genotype data. Bioinformatics 2010, 26: 1437â€“1445. 10.1093/bioinformatics/btq157
Alkan C, Kidd JM, MarquesBonet T, Aksay G, Antonacci F, Hormozdiari F, Kitzman JO, Baker C, Malig M, Mutlu O, Sahinalp SC, Gibbs RA, Eichler EE: Personalized copy number and segmental duplication maps using nextgeneration sequencing. Nature Genetics 2009, 41: 1061â€“1067. 10.1038/ng.437
OrtizEstevez M, Bengtsson H, Rubio A: ACNE: a summarization method to estimate allelespecific copy numbers for Affymetrix SNP arrays. Bioinformatics 2010, 26: 1827â€“1833. 10.1093/bioinformatics/btq300
Huang YT, Zhang K, Chen T, Chao KM: Selecting additional tag SNPs for tolerating missing data in genotyping. BMC Bioinformatics 2005, 6: 263. 10.1186/147121056263
McCarroll SA, Hadnott TN, Perry GH, Sabeti PC, Zody MC, Barrett JC, Dallaire S, Gabriel SB, Lee C, Daly MJ, Altshuler DM, International HapMap Consortium: Common deletion polymorphisms in the human genome. Nature Genetics 2006, 38: 86â€“92. 10.1038/ng1696
Surakka I, Kristiansson K, Anttila V, Inouye M, Barnes C, Moutsianas L, Salomaa V, Daly M, Palotie A, Peltonen L, Ripatti S: Founder populationspecific HapMap panel increases power in GWA studies through improved imputation accuracy and CNV tagging. Genome Research 2010, 20: 1344â€“1351. 10.1101/gr.106534.110
Ting CK, Lin WT, Huang YT: Multiobjective tag SNPs selection using evolutionary algorithms. Bioinformatics 2010, 26(11):1446â€“1452. 10.1093/bioinformatics/btq158
International HapMap Consortium, Frazer KA, Ballinger DG, Cox DR, Hinds DA, Stuve LL, Gibbs RA, Belmont JW, Boudreau A, Hardenbol P, Leal SM, Pasternak S, Wheeler DA, Willis TD, Yu F, Yang H, Zeng C, Gao Y, Hu H, Hu W, Li C, Lin W, Liu S, Pan H, Tang X, Wang J, Wang W, Yu J, Zhang B, Zhang Q: A second generation human haplotype map of over 3.1 million SNPs. Nature 449: 851â€“862.
Stephens M, Donnelly P: A comparison of bayesian methods for haplotype reconstruction from population genotype data. Am J Hum Genet 2003, 73: 1162â€“1169. 10.1086/379378
Bansal V, Bafna V: HapCUT: an efficient and accurate algorithm for the haplotype assembly problem. Bioinformatics 2008, 24: i153i159. 10.1093/bioinformatics/btn298
CojaOghlan A, Moore C, Sanwalani V: MAX kCUT and approximating the chromatic number of random graphs. Proceedings of 30th Int Coll Automata, Languages and Programming (ICALP) 2003.
Frieze A, Jerrum M: Improved approximation algorithms for MAX kCUT and MAX BISECTION. Algoritmica 1997, 18.
Linardopoulou EV, Williams EM, Fan Y, Friedman C, Young JM, Trask BJ: Human subtelomeres are hot spots of interchromosomal recombination and segmental duplication. Nature 2005, 437: 94â€“100. 10.1038/nature04029
Watson SK, deLeeuw RJ, Horsman DE, Squire JA, Lam WL: Cytogenetically balanced translocations are associated with focal copy number alterations. Human Genetics 2007, 120: 795â€“805. 10.1007/s0043900602519
Huang YT, Chao KM, Chen T: An approximation algorithm for haplotype inference by pure parsimony. Journal of Computational Biology 2005, 12: 1261â€“1274. 10.1089/cmb.2005.12.1261
Acknowledgements
We thank the reviewers for their valuable comments. YaoTing Huang and MinHan Wu were supported in part by NSC grant 982621B194001 from the National Science Council, Taiwan.
Author information
Authors and Affiliations
Corresponding author
Additional information
Authors' contributions
YTH formulated the two optimization problems. YTH and and MHW designed and analyzed the proposed algorithms. YTH and MHW conducted experiments and wrote the manuscript. Both authors read and approved the final manuscript.
Electronic supplementary material
12859_2010_4602_MOESM1_ESM.EPS
Additional file 1:Supplementary Figure S1. Simulation of extended regions to the left or right of a CNV. The results indicate accuracy is highest when the extension is equal to the size of CNV. (EPS 6 KB)
12859_2010_4602_MOESM2_ESM.EPS
Additional file 2:Supplementary Figure S2. An example of the algorithm for solving the constrained MaxkCut problem. (EPS 505 KB)
12859_2010_4602_MOESM3_ESM.EPS
Additional file 3:Supplementary Figure S3. The average accuracies of CSCNPhaser with respect to different iterations used. (EPS 69 KB)
Authorsâ€™ original submitted files for images
Below are the links to the authorsâ€™ original submitted files for images.
Rights and permissions
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Huang, YT., Wu, MH. Inference of chromosomespecific copy numbers using population haplotypes. BMC Bioinformatics 12, 194 (2011). https://doi.org/10.1186/1471210512194
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/1471210512194