Volume 7 Supplement 4
Symposium of Computations in Bioinformatics and Bioscience (SCBB06)
Estimate haplotype frequencies in pedigrees
- Qiangfeng Zhang^{1},
- Yuzhong Zhao^{1},
- Guoliang Chen^{1}Email author and
- Yun Xu^{1}
DOI: 10.1186/1471-2105-7-S4-S5
© Zhang et al; licensee BioMed Central Ltd 2006
Published: 12 December 2006
Abstract
Background
Haplotype analysis has gained increasing attention in the context of association studies of disease genes and drug responsivities over the last years. The potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.
Results
We present a two-stage method to estimate haplotype frequencies in pedigrees, which includes haplotyping stage and estimation stage. In the haplotyping stage, we propose a linear time algorithm to determine all zero-recombinant haplotype configurations for each pedigree. In the estimation stage, we use the expectation-maximization (EM) algorithm to estimate haplotype frequencies based on these haplotype configurations. The experiments demonstrate that our method runs much faster and gives more credible estimates than other popular haplotype analysis software that discards the pedigree information.
Conclusion
Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that the whole haplotype configuration space can be substituted by the space of zero-recombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.
Background
The modelling of human genetic variation is critical to the understanding of the genetic basis for complex diseases. Single nucleotide polymorphisms (SNPs) are the most frequent form of variation. The Human Genome Project and other large-scale efforts have identified millions of SNP markers. Although each marker can be analyzed independently, it is more informative to analyze them in groups. Therefore, it is useful to analyze haplotypes (haploid genotypes), which are sequences of linked markers on a single chromosome. In diploid organisms, such as human beings, chromosomes come in pairs, and experiments often yield genotype information, which blend haplotype information for chromosome pairs. There is growing evidence that, in order to better characterize the role of a candidate gene, full haplotype information should be exploited instead of using only genotype information. Unfortunately, it is both time-consuming and expensive to derive haplotype information experimentally. This explains the increasing interest in inferring haplotype information, or haplotyping, computationally. In fact, the potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.
Genotype data can be with or without any pedigree information, the first category is called population genotype data while the second one is pedigree genotype data. A large number of algorithms have been designed to estimate haplotype frequencies based on population data [1–4]. Among them, EM algorithms are most popular due to their interpretability and stability.
For any given pedigree genotype data, we can certainly discard the pedigree information and simply take the genotype sequences as the input of EM estimation algorithms for population data. However, it is well accepted that information obtained by analyzing pedigree genotype data is more reliable: the constraint provided by other members in a pedigree would force one genotype to settle on a unique haplotype pair as being most probable.
Here we propose a two-stage method to estimate haplotype frequencies in pedigrees. The first stage is the haplotyping stage, which finds out all feasible haplotype configurations for each pedigree. In the second estimation stage, we use EM algorithm to estimate haplotype frequencies in pedigrees based on the haplotype configurations inferred in the former stage.
In general, haplotyping pedigrees need consider the entire solution space of all possible consistent haplotype configurations. However, the genomic DNA can be partitioned into long blocks such that recombinations within each block are rare or even nonexistent [5, 6]. Thus it is believed that haplotype configurations with fewer recombinations should be preferred in haplotype inference [7–9]. When the region of interest is so small that the expected number of recombinations in the pedigree data is very close to zero, the solution space of all consistent haplotype configurations can be replaced by that of zero recombination (provided it is non-empty) to estimate haplotype frequencies. It is because the contribution of the solutions of recombinations to the overall likelihood becomes so small compared to those of zero recombination while they bring considerable complexity to the computation. Thus, we are interested in finding the consistent haplotype configurations of zero recombination.
Wijsman [10] proposed a 20-rule algorithm, and O'Connell [11] described a genotype-elimination algorithm, both of which can be used to find out zero-recombinant haplotype configurations for pedigrees. Recently, Li and Jiang [8, 9] showed that it could be solved in polynomial time. Here we propose an algorithm to find out zero-recombinant haplotype configurations in linear time using a technique called HCL-linkage analysis.
In the second stage, we use the EM algorithm to estimate haplotype frequencies based on haplotype configurations obtained from the haplotyping stage. We employ the Hardy-Weinberg Equilibrium to obtain the probabilities of founder genotypes and use a genetic model [12] to deduce the transmitted probabilities of non-founders. While the likelihood of each configuration is computed by multiplying the probabilities of each genotype, the frequency of each haplotype that appears in the configuration is calculated by a gene-counting method.
We implement all the algorithms in a C software package named HANAP (Haplotype ANAlysis in Pedigrees) and test its effectiveness and efficiency both on simulated and real data sets. The experimental results show that, our method runs much faster than the direct frequency estimation software that discards the pedigree information. Moreover, because our method utilizes such information, the estimation is more reliable.
Methods
Haplotyping stage: haplotyping algorithm based on HCL-linkage analysis
Excoffier's EM algorithm was widely applied in haplotype analysis [14, 15]. Unfortunately, it should calculate the frequencies of all possible haplotype pairs consistent to each given genotype, which is unbearable in storage when the haplotype length grows to more than 20 [16]. O'Connell [10] showed that genetic information from relatives could be used to resolve one genotype's ambiguity, and thus reduce the number of haplotypes that should be considered. However, O'Connell's method had an exponential time complexity. Recently, Li and Jiang [8, 9] showed that, for any genotype in a given pedigree, its ambiguity could be solved in cubic time (O(m^{3}n^{3})), where n is the number of members in the pedigree and m is the number of loci in each genotype. Here we present a so-called HCL-linkage analysis method to do haplotyping in linear time (O(mn)).
HCL-linkage definition
Imputing the paternal allele and the maternal allele for the child at a single locus
Conditions | Paternal | Maternal |
---|---|---|
e = f | e | f |
e ≠ f, a = b | ||
e = a | e | f |
f = a | f | e |
e ≠ f, a ≠ b, c = d | ||
e = c | f | e |
f = c | e | f |
e ≠ f, a ≠ b, c ≠ d | ||
a = c, b = d | Ambiguous | |
a = c, b ≠ d | ||
e = b or f = d | e | f |
e = d or f = b | f | e |
a ≠ c, b = d | ||
e = a or f = c | e | f |
e = c or f = a | f | e |
a = d, b = c | Ambiguous | |
a = d, b ≠ c | ||
e = b or f = c | e | f |
e = c or f = b | f | e |
a ≠ d, b = c | ||
e = a or f = d | e | f |
e = d or f = a | f | e |
a ≠ c ≠ b ≠ d | ||
e = a or e = b | e | f |
e = c or e = d | f | e |
In fact, for any trio, ignoring the ambiguous loci, the consistent (partial) haplotype configuration for the unambiguous loci is unique and specifies a linkage of alleles on some heterozygous loci for each node in the trio. We define such linkage as HCL-linkages (linkages of Haplotype Configuration on the non-ambiguous Loci).
Definition
An HCL-linkage ψ is a quadruplet <v, RE, LS, PH> defined on node v and specified by the unique consistent (partial) haplotype configuration within a trio that contains v. Here v denotes the node to which the HCL-linkage belongs. LS = {a_{1},...,a_{ l }} is the set of heterozygous loci where the haplotype configuration has been inferred. PH = {ph, ph'} = {(h_{1}...h_{ l }), (h_{1}'...h_{ l }')} records the two (partial) haplotypes imputed on these loci. RE = (R, R') denotes that ph (respectively ph') is inherited from or will be passed on to the node in set R(R').
An HCL-linkage describes the partial haplotype configuration of a node and the inheritance relationship between the parents and their children. Under our definition, we can conclude that every haplotype configuration should be consistent with any HCL-linkage specified by each trio in the pedigree.
Merge and transfer operations over HCL-linkages
In the case of multiple generations and multiple children, loci on one node may be linked by different HCL-linkages. HCL-linkages of the same node should be merged if they can. There are three cases when merging two HCL-linkages ψ_{1} = <v, (R_{1}, R_{1}'), LS_{1}, {ph_{1}, ph_{1}'}> and ψ_{2} = <v, (R_{2}, R_{2}'), LS_{2}, {ph_{2}, ph_{2}'}> on node v.
Case (1): (R_{1} ∪ R_{1}') ∩ (R_{2} ∪ R_{2}') ≠ Φ
l.a) R_{1} ∩ R_{2} ≠ Φ or R_{1}' ∩ R_{2}' ≠ Φ, it means that both ph_{1} and ph_{2} are from the nodes in R_{1} and R_{2} said ph_{1}' and ph_{2}' are from the nodes in R_{1}' and R_{2}', so ph_{1} and ph_{2} should be on the same haplotype, and ph_{1}' and ph_{2}' on the other: i) LS_{1} ∩ LS_{2} = Φ, or LS_{1} ∩ LS_{2} ≠ Φ but ph_{1} equals ph_{2} when restricted to loci in LS_{1} ∩ LS_{2}, it means that ψ_{1} and ψ_{2} are compatible. In this case, they should be merged to ψ = <v, (R_{1} ∪ R_{2}, R_{1}' ∪ R_{2}'), LS_{1} ∪ LS_{2}, {ph_{1} ∪ ph_{2}, ph_{1}' ∪ ph_{2}'}>, here ph_{1} ∪ ph_{2} denote a longer partial haplotype, which alleles equal to those of ph_{1} and ph_{2} when restricted to loci in LS_{1} and LS_{2}; ii) LS_{1} ∩ LS_{2} ≠ Φ and ph_{1} doesn't equal ph_{2} when restricted to LS_{1} ∩ LS_{2}, it means that ψ_{1} and ψ_{2} are incompatible, i.e. no haplotype configuration can satisfy the two HCL-linkages in the same time.
1.b) R_{1} ∩ R_{2}' ≠ Φ or R_{1}' ∩ R_{2} ≠ Φ, it means that ph_{1} and ph_{2}' should be on the same haplotype, and ph_{1}' and ph_{2} on the other. Similarly, ψ_{1} and ψ_{2} can be merged to ψ = <v, (R_{1} ∪ R_{2}', R_{1}' ∪ R_{2}), LS_{1} ∪ LS_{2}, {ph_{1} ∪ ph_{2}', ph_{1}' ∪ ph_{2}}> when they are compatible.
Case (2): (R_{1} ∪ R_{1}') ∩ (R_{2} ∪ R_{2}') = Φ, but LS_{1} ∩ LS_{2} ≠ Φ,
2.a) ph_{1} equals ph_{2} (and ph_{1}' equals ph_{2}' consequently) or ph_{1} equals ph_{2}' (then ph_{1}' equals ph_{2}) when restricted to LS_{1} ∩ LS_{2}, it means that ψ_{1} and ψ_{2} are compatible, in this case, they should be merged to ψ = <v, (R_{1} ∪ R_{2}, R_{1}' ∪ R_{2}'), LS_{1} ∪ LS_{2}, {ph_{1} ∪ ph_{2}, ph_{1}' ∪ ph_{2}'}> or ψ = <v, (R_{1} ∪ R_{2}', R_{1}' ∪ R_{2}), LS_{1} ∪ LS_{2}, {ph_{1} ∪ ph_{2}', ph_{1}' ∪ ph_{2}}>.
2.b) Else, ph_{1} doesn't equal ph_{2} or ph_{2}' when restricted to LS_{1} ∩ LS_{2}, it means that ψ_{1} and ψ_{2} are incompatible.
Case (3): (R_{1} ∪ R_{1}') ∩ (R_{2} ∪ R_{2}') = Φ, and LS_{1} ∩ LS_{2} = Φ,
In this case, ψ_{1} and ψ_{2} cannot be merged and both should be recorded in a HCL-linkage set Ψ_{ v }for node v.
With the merge operation, we can define the normalizing of a set of HCL-linkages Ψ_{ v }: normalizing a set Ψ_{ v }of HCL-linkages on node v means repeatedly applying the merge operation for pairs of HCL-linkages in Ψ_{ v }until, ∀ψ_{ i }, ψ_{ j }∈ Ψ_{ v }, (R_{ i }∪ R_{ i }') ∩ (R_{ i }∪ R_{ i }') = Φ, and LS_{1} ∩ LS_{2} = Φ. Ψ_{ v }is then said to be normalized. From now on, if there is no further notice, Ψ_{ v }should be normalized after any changes.
Like genetic information, HCL-linkages will be passed on from generations to generations. Without loss of generality, let us define the transfer of HCL-linkage information from child C to its parent F. The other case from F to C would be similar. Let Ψ_{ C }and Ψ_{ F }represent the normalized HCL-linkage sets of C and F respectively, and let HS be the set of homozygous loci of F. The transfer of Ψ_{ C }from C to F results in changes to Ψ_{ F }, where each ψ_{ C }= <C, (R_{ C }, R_{ C }'), LS_{ C }, {ph_{ C }, ph_{ C }'}> ∈ Ψ_{ C }is transferred independently. There are two cases to consider.
Case (1): if F ∈ R_{ C }(or respectively, F ∈ R_{ C }'), add ψ_{ F }= <F, ({C}, Φ), LS_{ C }- HS, {ph_{ F }, ph_{ F }'}> to Ψ_{ F }, here ph_{ F }equals the resulting partial haplotypes of ph_{ C }(respectively ph_{ C }') when restricted to loci in LS_{ C }- HS and ph_{ F }' is the compensatory partial haplotypes of ph_{ F }consistent to genotype g_{ F }.
Case (2): else, F ∉ R_{ C }∪ R_{ C }': i) both ph_{ C }and ph_{ C }' are consistent with the partial genotype g_{ F }when restricted to loci in LS_{ C }, then add ψ_{ F }= <F, (Φ, Φ), LS_{ C }- HS, {ph_{ F }, ph_{ F }'}> to Ψ_{ F }, here ph_{ F }and ph_{ F }' equal the resulting partial haplotypes of ph_{ C }and ph_{ C }' when restricted to loci in LS_{ C }- HS; ii) ph_{ C }' (respectively ph_{ C }) is not consistent with the partial genotype g_{ F }when restricted to loci in LS_{ C }, then add ψ_{ F }= <F, ({C}, Φ), LS_{ C }- HS, {ph_{ F }, ph_{ F }'}> to Ψ_{ F }, here ph_{ F }equals the resulting partial haplotypes of ph_{ C }(respectively ph_{ C }') when restricted to loci in LS_{ C }- HS and ph_{ F }' is the compensatory partial haplotypes of ph_{ F }consistent to genotype g_{ F }. Note that at least one of ph_{ C }and ph_{ C }' should be consistent with the partial genotype g_{ F }.
Remember that Ψ_{ F }should be normalized whenever adding a new HCL-linkage to it. In the case of transferring an HCL-linkage ψ_{ F }from F to C, resulting in adding ψ_{ C }= <C, (R_{ C }, R_{ C }'), LS_{ C }, {ph_{ C }, ph_{ C }'}> to Ψ_{ C }, note that we should add M into R_{ C }' whenever we have determined that F ∈ R_{ C }.
Our merge and transfer operations will not bring more or lose any HCL-linkage information for building consistent haplotype configurations.
Main HCL-linkages analysis haplotyping algorithm
Before the algorithm, we preprocess each trio in the pedigree. Whenever a trio specifies an HCL-linkage for node v, it will be stored in the HCL-linkage set Ψ_{ v }. The objective of the algorithm is to collect the complete HCL-linkage information for each node, which is accomplished by traversing the tree twice.
Firstly, we will convert the input pedigree into a rooted searching tree T (at an arbitrary node R) (Step 1). Then we traverse T in post-order to transfer and merge the HCL-linkage information for each node from its relatives (Step 2). We do this from the left lowest nuclear family F_{o}. The HCL-linkages in nuclear family F_{o} will be merged at both parents, and then be transferred to the root of the sub-tree. The same operations will be conducted in its parental nuclear family on HCL-linkages specified in this family as well as on those transferred from its child families. And at last, we collect all the HCL-linkages at the root R. In Step 3, we traverse T again in pre-order and transfer the linkage in another direction from R to its farmost descendants.
After step 3, the HCL-linkage set of each node preserves all HCL-linkages in the pedigree. In step 4, we choose a node v arbitrarily. Set Ψ_{ v }contains several HCL-linkages ψ_{1}, ψ_{2},...,ψ_{ l }defined on disjoint locus set LS_{1}, LS_{2},...LS_{ l }. When a set of loci are linked by one HCL-linkage, they can be viewed as a compound locus, and the two partial haplotypes can be viewed as two compound alleles. These "loci" (and "alleles") will be treated equally as the other heterozygous loci and homozygous loci that are not involved in any HCL-linkage. We arbitrarily select one allele from the two at each locus to form a haplotype; the other alleles form another haplotype. It is called an imputing schema. Whenever the haplotype configuration of one node is determined, it can be used to determine the configurations of its relatives, and those of the whole pedigree at last.
The time complexity and space complexity of our algorithm are both O(mn) where n is the number of the members in the pedigree and m is the length of the loci.
Frequency estimation stage
Suppose that we are given K pedigrees P = {P_{1}, P_{2},...,P_{ K }}. Each P_{ i }consists of n_{ i }nodes v_{ i,j }(1 ≤ i ≤ K, 1 ≤ j ≤ n_{ i }), in which the first n_{ i }' are founders. The genotype of node v_{ i,j }(1 ≤ j ≤ n_{ i }) is g_{ i,j }. Suppose that there are π_{ i }consistent solutions for pedigree P_{ i }and the s-th solution is: S_{ s,i }= <S_{s,i,1}, S_{s,i,2},..., > (1 ≤ s ≤ π_{ i }), where S_{ s,i,j }= <α_{s,i,j,1}, β_{s,i,j,2}> is a haplotype pair of genotype g_{ i,j }. All haplotypes appear in these solutions form a list of haplotypes H = {h_{1}, h_{2},...,h_{ l }} with frequencies Θ = {θ_{1}, θ_{2},...,θ_{ l }}, here θ_{1} + θ_{2} + ... + θ_{ t }= 1 is what we want to estimate.
The likelihood of haplotype frequencies given the observed pedigree data is,
Under the assumption of random mating, the paternal haplotype configuration and the maternal haplotype configuration are independent, and the child's haplotype configuration is transmitted from its parents. We have:
Here Pr (S_{ s,i,j }|Θ) is the probability of haplotype configuration of the founder nodes, it can be computed using the Hardy-Weinberg Equilibrium. Pr(S_{s,i,j'}|< , >) is the gamete transmission probabilities of haplotype configuration S_{ s,i,j }with the parental haplotype configurations of and . It can be computed using a genetic model presented by Elston and Stewart [6].
EM algorithm estimates the haplotype frequencies Θ starting with the initial arbitrary values Θ^{(0)} = {θ_{1}^{(0)}, θ_{2}^{(0)},...,θ_{ l }^{(0)}}. These initial values are used as if they were the unknown true frequencies to estimate solution frequencies Pr(S_{ s,i }|Θ) (the expectation step). These expected solution frequencies are used in turn to estimate haplotype frequencies at the next iteration Θ^{(1)} = {θ i^{(1)}, θ_{2}^{(1)},...,θ^{(1)}} (the maximization step), and so on, until convergence is reached.
Suppose that in the r-th iteration, Θ = Θ^{(r)}and we want to estimate Θ^{(r+1)}. Then we have:
Let δ_{ i,j,t }be an indicator variable equalling the number of haplotype h_{ t }appear in solution S_{ s,i }. Then the haplotype frequencies can be computed using a gene-counting method,
There are several ways to initialize the haplotype frequencies Θ = {θ_{1}, θ_{2},...,θ_{ l }}. For instance, the initial haplotype frequencies can be chosen at random, or all haplotypes are equally frequent, i.e. Θ_{ t }^{(0)} = 1/l (t = 1, 2,...,l). Or that all initial haplotype frequencies are equal to the product of the corresponding single-locus allele frequencies (i.e., a complete linkage equilibrium). Also, we can set all feasible solutions for each pedigree to be equally likely, i.e. Pr(S_{ s,i }|Θ^{(0)}) = 1/π_{ i }, (j = 1,2,...,π_{ i }). We can even initialize the haplotype frequencies by counting their occurrence in all the feasible solutions. Since in practical applications the EM algorithm could be trapped in some local maximum, we recommend to restart the algorithm several times with different initial haplotype frequencies and better with a randomized additive perturbation.
The stopping (convergence) criterion is defined as the absolute value of the difference of Θ between consecutive iterations being less than some small value ε > 0.
Results
Simulated data set
In order to generate a pedigree genotype data set for simulation experiments, we generate a population of haplotypes H* first, where each locus of each haplotype is set to some allele according to the probability distribution function P. In our simulation, we generated haplotypes of SNP loci as well as haplotypes of micro-satellite loci. For a biallelic SNP locus i, suppose that i happens to be one state with a probability of p_{ i }, and to be the other state with a probability of (1 - p_{ i }). For a micro-satellite loci, suppose it has w different alleles: a_{1}, a_{2},...,a_{ w }, each appears with the probability of p_{1}, p_{2},...,p_{ w }(p_{1} + p_{2} +...+ p_{ w }= 1).
Each founder node in any tested pedigree is arbitrarily assigned a pair of haplotypes according to their frequencies θ*. The two haplotypes of a non-founder node are arbitrarily selected from those of its parents (one from the father, one from the mother). At last, the pair of haplotypes of the same node is blended to form a genotype corresponding to that node.
All experiments are conducted on a Windows server with 1.7G Hz CPU and 256 MB RAM. And for each parameter setting, 100 copies are randomly generated and the performance is evaluated by computing the average numbers in these 100 runs.
Running time of the haplotyping algorithm
One of the main contributions of our paper is to do haplotyping in linear time, so we firstly examine the running time with respect to different number of nodes of each pedigree (n) and different number of loci in each sequence (m).
Number of solutions
We compare the numbers of haplotypes that should be considered in the estimation stage, with and without the haplotyping stage. In our experiment, we set P_{1} (p_{1} = p_{2} = 0.5) and P_{2} (p_{1} = 0.9, p_{2} = 0.1) for SNP loci, and set w = 4, P_{3} (p_{1} = p_{2} = p_{3} = p_{4} = 0.25) and P_{4} (p_{1} = 0.5, p_{2} = p_{3} = 0.2, p_{4} = 0.1) for micro-satellite loci. We let |H*| = 20, and θ_{1}* = 0.2, θ_{2}* = θ_{3}* = θ_{4}* = 0.1, θ_{5}* = θ_{6}* = θ_{7}* = θ_{8}* = 0.05, θ_{9}* = θ_{10}* = ... = θ_{20}* = 0.025.
Comparison of number of haplotypes (|H|) on trio pedigrees
Parameter settings (m, k) | directly | HANAP | ||||||
---|---|---|---|---|---|---|---|---|
P _{1} | P _{2} | P _{3} | P_{4} | P_{1} | P_{2} | P _{3} | P _{4} | |
(20, 20) | 3.18e4 | 3.43e2 | 2.06e6 | 1.02e6 | 5.83e2 | 72.2 | 1.12e2 | 76.4 |
(20, 100) | 1.49e5 | 1.68e3 | 1.02e7 | 5.02e6 | 2.67e3 | 3.55e2 | 5.72e2 | 3.61e2 |
(20, 200) | 2.56e5 | 3.30e3 | 2.02e7 | 0.98e7 | 4.45e3 | 5.96e2 | 1.14e3 | 6.02e2 |
(100, 20) | N/A | 8.13e6 | N/A | N/A | 5.90e5 | 2.60e2 | 6.91e2 | 2.74e2 |
(100, 100) | N/A | 1.62e7 | N/A | N/A | 2.68e6 | 1.31e3 | 3.42e3 | 1.52e3 |
(100, 200) | N/A | 3.21e7 | N/A | N/A | 4.65e6 | 2.55e3 | 6.91e3 | 2.89e3 |
(200, 20) | N/A | N/A | N/A | N/A | N/A | 7.92e2 | 6.08e3 | 1.07e3 |
(200, 100) | N/A | N/A | N/A | N/A | N/A | 4.09e3 | 3.21e4 | 5.17e3 |
(200, 200) | N/A | N/A | N/A | N/A | N/A | 7.79e3 | 6.22e4 | 1.03e4 |
Comparison of number of haplotypes (|H|) on a general pedigree
Parameter settings (m, k) | directly | HANAP | ||||||
---|---|---|---|---|---|---|---|---|
P _{1} | P _{2} | P_{3} | P _{4} | P _{1} | P _{2} | P _{3} | P _{4} | |
(20, 20) | 2.31e5 | 1.07e3 | 3.65e6 | 1.65e6 | 20.78 | 18.34 | 20.15 | 20.05 |
(20, 50) | 2.53e5 | 2.38e3 | 6.82e6 | 2.24e6 | 21.0 | 19.39 | 21.2 | 20.67 |
(20, 100) | 2.96e5 | 4.51e3 | 1.08e7 | 3.38e6 | 20.9 | 19.5 | 20.75 | 20.8 |
(200, 20) | N/A | N/A | N/A | N/A | 26.8 | 21.3 | 22.1 | 22.4 |
(200, 50) | N/A | N/A | N/A | N/A | 34.4 | 22.1 | 23.5 | 24.2 |
(200, 100) | N/A | N/A | N/A | N/A | 54.7 | 23.9 | 28.9 | 32.8 |
(500, 20) | N/A | N/A | N/A | N/A | 1.41e2 | 22.4 | 23.2 | 23.9 |
(500, 50) | N/A | N/A | N/A | N/A | 2.24e2 | 29.5 | 35.7 | 30.7 |
(500, 100) | N/A | N/A | N/A | N/A | 4.06e2 | 30.1 | 41.4 | 42.5 |
Running time of HANAP
EM-DeCODER is a popular software using the EM algorithm to estimate the haplotype frequencies based on population data. As we have pointed out, it can be used to estimate haplotype frequencies in pedigrees, simply by discarding the pedigree information. Here we also compare the running times of HANAP and EM-DeCODER.
Accuracy rate of HANAP
We define a parameter Δ to incarnate the deviation of the estimate haplotype frequencies from the underlying ones. Because the simulation data are generated according to the Θ*, we recognize that as the underlying true frequencies. Suppose the estimate haplotype set is H^{E} with frequencies Θ^{E}. Compare H^{E} with H*. Suppose the estimate frequencies of the 20 haplotypes in H* are θ_{1}^{E}, θ_{2}^{E},...,θ_{20}^{E}. We let,
Two real data sets
The frequencies of the 20 most frequent haplotypes found by HANAP
Id | Alleles of Marker Set | θ |
---|---|---|
1 | CCAGGTAGCGCGAAGCATTTCTGTAGTACGA | 0.037 |
2 | CCAGGTAGCGCTCTAAGCAACTGGCGACGAG | 0.043 |
3 | CCCGGTGGTACGAAGCACAATCGGCGAACAC | 0.025 |
4 | CCCGGTGGTACGAAGCACAATCGTAGTACGA | 0.102 |
5 | CCCGGTGGTACGAGGTATCTCTAGCAGTCGT | 0.016 |
6 | CCCGGTGGTACGAGGTATCTCTAGCGACGAG | 0.026 |
7 | CTCGGTGGCGCGAGGTGTATCTATAAACGGC | 0.023 |
8 | CTCGGTGGCGCTCTAAGCAGCTGTAGATCGC | 0.117 |
9 | CTCGGTGGCGCTCTATGCAACTGGCGACGAG | 0.178 |
10 | GACGGTGATATGAGGTATCTCTAGCGTACGC | 0.015 |
11 | GACGGTGATATTCTAGGCTTTCGGAAACGAC | 0.021 |
12 | GCCGGTCGCGCGAGGCATATCTATAGATCGC | 0.027 |
13 | GCCGGTCGCGCTCTAAGCAGCTGTAAACGGC | 0.022 |
14 | GTATCGCATATGAAGCACAATCGGAAACGAC | 0.078 |
15 | GTATCGCATATGAGGTGTATCTAGAAACGAC | 0.041 |
16 | GTATCGCATATGAGGTGTATCTAGCAGTCGT | 0.034 |
17 | GTATCGCATATGCGGAGCCGTCAAAATTGGA | 0.023 |
18 | GTATCGCATATGCGGCGCCGTCAGAAACGAC | 0.055 |
19 | GTATCGCATATTCTAAGCAGCTGTAGATCGC | 0.068 |
20 | GTATCGCATATTCTAGGCTTTCGGAAACGAC | 0.021 |
∑ | 0.97 |
The second data set is from the CEPH database [17], which contains 65 families; each consists of only three generations, usually with four grandparents, two parents and a number of children. Figure 5 in [15] shows a typical family with 21 nodes.
A great portion of the alleles in this data set have not been identified, and will be viewed as missing data. We carefully selected a data set of 28 families (totally 482 nodes) on a block (48 markers) from chromosome 14 (452 markers in total) with no recombination. Both HANAP and PHASE (another widely used software package [3] based on the GS algorithm) are applied to this data set. HANAP inferred 36 haplotypes with frequency larger than 0.01 and PHASE inferred 39 ones, among which 31 are common. Although we are not sure which output is closer to the real cases, the running time of HANAP (13m 24s) is extremely shorter than that of PHASE (21h 14m).
Discussion
Complexities of the HCL-linkage analysis haplotyping algorithm
We show that the algorithm runs in O(mn) time and O(mn) space. The pre-process need calculate no more than 3n HCL-linkages in no more than n trios. Each HCL-linkage can be computed in O(m) time, so the pre-process can be done in O(mn) time. It takes step 1 O(n) time to construct the rooted tree. In step 2 and step 3, we have to traverse the whole tree, and visit each node for no more than constant times.
When we process the HCL-linkages from the left lowest nuclear family, we should merge the d_{1} HCL-linkages at each parent node (if we can), it need O(d_{1}m) time, here d_{1} is the number of children in this family. We need another O(d_{1}m) time to exchange the HCL-linkage information between the two parents and transfer that to its root R_{1} in the search tree T. So we need O(d_{1}m) time in total to process this nuclear family. When we transfer the normalized HCL-linkage set = {ψ_{1}, ψ_{2},...ψ_{ k }} to the upper nuclear families, we only need to remember that all ψ_{ i }is coming from R_{1}, so for each ψ_{ i }, it will only take O(1) time to process RE_{ i }, and O(|LS_{ i }|) time to process LS_{ i }and PH_{ i }. The summation time is no more than O(k + LS_{1} + LS_{2} +...+ LS_{ k }) = O(m) because LS_{ i }are disjoint subsets of {1,2,...,m}. In other words, the HCL-linkages in one nuclear family won't increase the processing time of its adjacent families. So the total running time to process all nuclear familes is no more than O(d_{1}m + d_{2}m +...+ d_{ x }m) = O(nm), here x is the number of nuclear families in the whole pedigree.
We need another O(mn) time to complete step 4. Therefore, the time complexity of this algorithm is O(mn).
For the computation, we need to maintain a data structure to store the HCL-linkage set Ψ_{ v }for each node v; we can maintain the storage always below O(d_{ i }m) for nuclear family F_{ i }. So the space complexity of the algorithm is also O(d_{1}m + d_{2}m +...+ d_{ x }m) = O(nm).
Effectiveness of the haplotyping phase
Excoffier used the EM algorithm to estimate haplotype frequencies while ignoring the pedigree information. Here we adopt a two-stage method, which tries to reduce the number of possible haplotypes to be considered in the stage of estimation by utilizing the relatives' information to do haplotyping at first.
Suppose we are estimating haplotype frequencies in trios and each haplotype consists of m biallelic SNP loci. For locus i, suppose that i happens to be one state with a probability of p_{ i }, and to be the other state with a probability of (1 - p_{ i }). Then locus i of the genotype is heterozygous with the probability of 2p_{ i }(1 - p_{ i }). Suppose the expected value of p_{ i }is p, then the genotype is expected to have 2p(1 - p)·m heterozygous loci. As a consequence, a total number of 2^{2p(1-p)·m-1}possible haplotype pairs is expected to be considered if we use the EM algorithm directly. However, the probability that locus i in a trio is ambiguous is 2p_{ i }(1 - p_{ i })·2p_{ i }(1 - p_{ i })·2/4 = 2p_{ i }^{2}(1 - p_{ i })^{2}. So the expected number of possible haplotype configurations for the trio is . If p = 1/2, our method can handle λ = (2p(1 - p))/(2p^{2}(1 - p)^{2}) = 4 times longer genotypes than Excoffier's methods. Moreover, in most cases, the more frequent allele at one locus appears with a probability of more than 0.9, so our method usually can handle λ = 1/p(1 - p) > 10 times longer genotypes.
Furthermore, if each locus of the haplotype is a micro-satellite locus, and it has l different alleles: a_{1}, a_{2},...,a_{ l }, each appears with the probability of p_{1}, p_{2},...,p_{ l }. then the expected number of possible haplotype pairs for a genotype is , the expected number of feasible haplotype configurations for a trio is , so our method usually can handle times longer genotypes. For example, when l = 8, and p_{1} = p_{2} = ... = p_{8} = 1/8, λ = 64, i.e. our method can be applied to cases of much larger scale.
Conclusion
We present a two-stage method to do haplotyping and to estimate haplotype frequencies for pedigree genotype data in this paper. Given a set of pedigrees, it firstly determines all feasible haplotype configurations for each pedigree, then uses the EM algorithm to estimate the haplotype frequencies based on the inferred haplotype configurations. Because a large number of illegal haplotypes have been eliminated from the possible haplotype list, our method is both more efficient and more accurate. The experimental results show that, HANAP runs much faster than EM-DeCODER, and thus can be applied to much larger scale of instances. Moreover, the deviation of the estimate of HANAP is smaller than that of EM-DeCODER, which means it is more accurate.
Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that whole haplotype configuration space can be substitute by the space of zero-recombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.
Appendix
Correctness of the HCL-linkage analysis haplotyping algorithm
First, we point out that every consistent haplotype configuration for pedigree P should be consistent with all HCL-linkages calculated by the unique partial solutions within trios, and our merge and transfer operations keep this feature during the whole process, i.e. if ζ is a haplotype configuration for P, and ζ is consistent with all the HCL-linkages in the pedigree, it should also be consistent with those after the merge and transfer operations.
Obviously, HCL-linkages are sufficient to generate a consistent haplotype configuration within a trio. We prove that:
Lemma
If it exists consistent haplotype configuration for a nuclear family (F, M, C_{1}, C_{2},...,C_{ d }), every imputing schema s of arbitrarily imputing the (compound) alleles at one node can output one feasible solution ζ . Contrarily, if one imputing schema ends with incompatibleness, there is no consistent haplotype configuration.
Proof
Firstly, HCL-linkages are necessary for constructing the consistent haplotype configurations, which means that if there is a feasible solution, it should correspond to one imputing schema.
Secondly, we show that if one imputing schema outputs a feasible solution, all the schemas output feasible solutions.
In particular, for a trio (F, M, C_{1}), without loss of generality, we suppose that Step 4 starts by imputing the "alleles" of node F. For a specified "locus" i, the two alleles of F, M and C_{1} are denoted as (a, b), (c, d) and (e, f), and the partial haplotypes from locus 1 to locus (i - 1) are denoted as (ph_{1}, ph_{2}), (ph_{3}, ph_{4}) and (ph_{5}, ph_{6}).
Suppose in schema s, allele a is imputed to ph_{1}, denoted as ph_{1} ← a, and ph_{2} ← b, ph_{3} ← c, ph_{4} ← d, ph_{5} ← e, ph_{6} ← f. If s outputs a consistent haplotype configuration for trio (F, M, C), we proof that s': ph_{2} ← a, ph_{1} ← b, ph_{4} ← c, ph_{3} ← d, ph_{6} ← e, ph_{5} ← f outputs another consistent haplotype configuration.
The loci from 1 to (i - 1) can be viewed as a compound locus I. Let's refer to Table 1, because we can impute a or b to ph_{1}, either or both of "locus" I and i should be ambiguous. Else, they will be linked to a bigger compound "locus" in Step 2 or 3 by HCL-linkages. Whatever a or b will be linked with ph_{1}, we can not impute that arbitrarily in Step 4. Without loss of generality, we assume that locus i is ambiguous, which means that a ≠ b, and {a, b} = {c, d} = {e, f} (please refer to Table 1). We can prove that s' is also a consistent haplotype configuration by enumeration. Because nuclear family (F, M, C_{1}, C_{2},...,C_{ d }) is the intersection of trio (F, M, C_{1}), (F, M, C_{2}),...,(F, M, C_{ d }), the above prove shows that both both s: ph_{1} ← a, ph_{2} ← b, ph_{3} ← c, ph_{4} ← d and s': ph_{2} ← a, ph_{1} ← b, ph_{4} ← c, ph_{3} ← d will lead to a consistent haplotype configuration for the family.
We call s to s' a walk step by switching ph_{1} ← a to ph_{1} ← b. Obviously, for any two imputing schemata s_{1} and s_{2}, we can transfer s_{1} to s_{2} by consecutive walk steps. So s_{1}and s_{2} will lead to a consistent haplotype configuration or neither can.
This lemma indicates that our algorithm works in a nuclear family. We now can prove the correctness of our HCL-linkage analysis haplotyping algorithm by induction, i.e. if there is at least one feasible solution for a general pedigree, HCL-linkages are sufficient to generate all solutions.
Suppose that the root R has multiple child mating nodes: O_{1},...,O_{ r }, each represents a nuclear family; and the algorithm works in all sub-trees, i.e. all of the feasible solutions for those sub-trees can be directly deduced from the HCL-linkages collected from the sub-trees and stored at their roots. From the former lemma, we know that if incompatibleness of type II occurs, there is no feasible solution. We assume that there is no incompatibleness of type II and there are always feasible solutions for all sub-trees. Suppose that haplotype configuration ζ is consistent with all the HCL-linkages at root R (and all the HCL-linkages in the P consequently). Then ζ should be consistent haplotype configuration when restricted to any nuclear family of O_{1},...,O_{ r }, and it should also be consistent with the HCL-linkages at the root of lower sub-trees of O_{1},...,O_{ r }. By induction, ζ should be a feasible solution when restricted to any of these sub-trees. So ζ is a consistent haplotype configuration of the whole pedigree P.
Declarations
Acknowledgements
This work is supported by the National Science Foundation of China under the grant No.60533020.
This article has been published as part of BMC Bioinformatics Volume 7, Supplement 4, 2006: Symposium of Computations in Bioinformatics and Bioscience (SCBB06). The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/7?issue=S4.
Authors’ Affiliations
References
- Excoffier L, Slatkin M: Maximum-likelihood estimation of molecular haplotype frequencies in a diploid population. Mol Biol Evol 1995, 12: 921–927.PubMedGoogle Scholar
- Hawley ME, Kidd KK: HAPLO: a program using the EM algorithm to estimate the frequencies of multi-site haplotypes. J Hered 1995, 86(5):409–11.PubMedGoogle Scholar
- Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction for population data. Am J Hum Genet 2001, 68: 978–989.PubMed CentralView ArticlePubMedGoogle Scholar
- Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked single nucleotide polymorphisms. Am J Hum Genet 2002, 70: 157–169.PubMed CentralView ArticlePubMedGoogle Scholar
- Griffiths A, Gelbart W, Lewontin R, Miller J: Modern Genetic Analysis: Integrating Genes and Genomes. New York: W.H. Freeman and Company; 2002.Google Scholar
- Cox R, Bouzekri N, et al.: Angiotensinlconverting enzyme (ACE) plasma concentration is influenced by multiple ACElinked quantitative trait nucleotides. Hum Mol Genet 2002, 11: 2969–2977.View ArticlePubMedGoogle Scholar
- Qian D, Beckmann L: Minimum recombinant haplotyping in pedigrees. Am J Hum Genet 2002, 70(6):1434–1445.PubMed CentralView ArticlePubMedGoogle Scholar
- Li J, Jiang T: Efficient rule-based haplotyping algorithms for pedigree data. Proc of RECOMB 2003, 197–206.Google Scholar
- Li J, Jiang T: Efficient inference of haplotypes from genotypes on a pedigree. J Bioinfo Comp Biol 2003, 1(1):41–69.View ArticleGoogle Scholar
- Wijsman EM: A deductive method of haplotype analysis in pedigrees. Am J Hum Genet 1987, 41(3):356–373.PubMed CentralPubMedGoogle Scholar
- O'Connell JR: Zero-recombinant haplotyping: applications to fine mapping using SNPs. Genet Epidemiol 2000, 19(Suppl 1):S64–70.View ArticlePubMedGoogle Scholar
- Elston RC, Stewart J: A general model for the genetic analysis of pedigree data. Human Heredity 1971, 21: 523–542.View ArticlePubMedGoogle Scholar
- Fallin D, Schork NJ: Accuracy of haplotype frequency estimation for biallelic loci, via the expectation maximization algorithm for unphased diploid genotype data. Am J Hum Genet 2000, 67: 947–959.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao H, et al.: Transmission/disequilibrium tests using multiple tightly linked markers. Am J Hum Genet 2000, 67: 936–946.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhang Q, Chin FYL, Shen H: Minimum Parent-Offspring Recombination Haplotype Inference in Pedigrees. Transactions on Computational Systems Biology LNBI 3680–0100 2005, 2: 100–12.Google Scholar
- Zhang Q, Che H, Zhou Z, Chen G: Comparative study on different approaches to in silico haplotyping. In Technical report. Dept of Computer Science and Technology, University of Science and Technology of China; 2003.Google Scholar
- The CEPH genotype database[http://www.cephb.fr/]
Copyright
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.