Estimate haplotype frequencies in pedigrees

Background Haplotype analysis has gained increasing attention in the context of association studies of disease genes and drug responsivities over the last years. The potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour. Results We present a two-stage method to estimate haplotype frequencies in pedigrees, which includes haplotyping stage and estimation stage. In the haplotyping stage, we propose a linear time algorithm to determine all zero-recombinant haplotype configurations for each pedigree. In the estimation stage, we use the expectation-maximization (EM) algorithm to estimate haplotype frequencies based on these haplotype configurations. The experiments demonstrate that our method runs much faster and gives more credible estimates than other popular haplotype analysis software that discards the pedigree information. Conclusion Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that the whole haplotype configuration space can be substituted by the space of zero-recombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.


Background
The modelling of human genetic variation is critical to the understanding of the genetic basis for complex diseases. Single nucleotide polymorphisms (SNPs) are the most frequent form of variation. The Human Genome Project and other large-scale efforts have identified millions of SNP markers. Although each marker can be analyzed independently, it is more informative to analyze them in groups. Therefore, it is useful to analyze haplotypes (haploid genotypes), which are sequences of linked markers on a single chromosome. In diploid organisms, such as human beings, chromosomes come in pairs, and experi-ments often yield genotype information, which blend haplotype information for chromosome pairs. There is growing evidence that, in order to better characterize the role of a candidate gene, full haplotype information should be exploited instead of using only genotype information. Unfortunately, it is both time-consuming and expensive to derive haplotype information experimentally. This explains the increasing interest in inferring haplotype information, or haplotyping, computationally. In fact, the potential use of haplotypes has led to the initiation of the HapMap project which is to investigate haplotype patterns in the human genome in different populations. Haplotype inference and frequency estimation are essential components of this endeavour.
Genotype data can be with or without any pedigree information, the first category is called population genotype data while the second one is pedigree genotype data. A large number of algorithms have been designed to estimate haplotype frequencies based on population data [1][2][3][4]. Among them, EM algorithms are most popular due to their interpretability and stability.
For any given pedigree genotype data, we can certainly discard the pedigree information and simply take the genotype sequences as the input of EM estimation algorithms for population data. However, it is well accepted that information obtained by analyzing pedigree genotype data is more reliable: the constraint provided by other members in a pedigree would force one genotype to settle on a unique haplotype pair as being most probable.
Here we propose a two-stage method to estimate haplotype frequencies in pedigrees. The first stage is the haplotyping stage, which finds out all feasible haplotype configurations for each pedigree. In the second estimation stage, we use EM algorithm to estimate haplotype frequencies in pedigrees based on the haplotype configurations inferred in the former stage.
In general, haplotyping pedigrees need consider the entire solution space of all possible consistent haplotype configurations. However, the genomic DNA can be partitioned into long blocks such that recombinations within each block are rare or even nonexistent [5,6]. Thus it is believed that haplotype configurations with fewer recombinations should be preferred in haplotype inference [7][8][9]. When the region of interest is so small that the expected number of recombinations in the pedigree data is very close to zero, the solution space of all consistent haplotype configurations can be replaced by that of zero recombination (provided it is non-empty) to estimate haplotype frequencies. It is because the contribution of the solutions of recombinations to the overall likelihood becomes so small compared to those of zero recombination while they bring considerable complexity to the computation. Thus, we are interested in finding the consistent haplotype configurations of zero recombination.
Wijsman [10] proposed a 20-rule algorithm, and O'Connell [11] described a genotype-elimination algorithm, both of which can be used to find out zero-recombinant haplotype configurations for pedigrees. Recently, Li and Jiang [8,9] showed that it could be solved in polynomial time. Here we propose an algorithm to find out zerorecombinant haplotype configurations in linear time using a technique called HCL-linkage analysis.
In the second stage, we use the EM algorithm to estimate haplotype frequencies based on haplotype configurations obtained from the haplotyping stage. We employ the Hardy-Weinberg Equilibrium to obtain the probabilities of founder genotypes and use a genetic model [12] to deduce the transmitted probabilities of non-founders. While the likelihood of each configuration is computed by multiplying the probabilities of each genotype, the frequency of each haplotype that appears in the configuration is calculated by a gene-counting method.
We implement all the algorithms in a C software package named HANAP (Haplotype ANAlysis in Pedigrees) and test its effectiveness and efficiency both on simulated and real data sets. The experimental results show that, our method runs much faster than the direct frequency estimation software that discards the pedigree information. Moreover, because our method utilizes such information, the estimation is more reliable.

Haplotyping stage: haplotyping algorithm based on HCLlinkage analysis
Excoffier's EM algorithm was widely applied in haplotype analysis [14,15]. Unfortunately, it should calculate the frequencies of all possible haplotype pairs consistent to each given genotype, which is unbearable in storage when the haplotype length grows to more than 20 [16]. O'Connell [10] showed that genetic information from relatives could be used to resolve one genotype's ambiguity, and thus reduce the number of haplotypes that should be considered. However, O'Connell's method had an exponential time complexity. Recently, Li and Jiang [8,9] showed that, for any genotype in a given pedigree, its ambiguity could be solved in cubic time (O(m 3 n 3 )), where n is the number of members in the pedigree and m is the number of loci in each genotype. Here we present a so-called HCLlinkage analysis method to do haplotyping in linear time (O(mn)).

HCL-linkage definition
Trios are simple pedigrees that contain only a pair of parents and a child. A consistent zero-recombinant haplotype configuration for a general pedigree should also be a consistent zero-recombinant haplotype configuration when restricted to each trio in this pedigree. Given trio T = (F, M, C), here F is the father, M is the mother and C is the child, suppose that locus i of F, M, and C have alleles {a, b}, {c, d} and {e, f} (note that {e, f} ⊂ ({a, b} ∪ {c, d})). The genotype information of C can be homozygous or heterozygous. If it is homozygous (e = f), then it is clear that the paternal allele and the maternal allele are the same (e or f). The situation becomes complicated if it is a heterozygous site (e ≠ f). Table 1 lists out all possible situations. We can see that given the locus genotype information for the three members, it may or may not be possible to determine the paternal allele and the maternal allele for the child. We call a locus ambiguous if its inheritance relationship cannot be resolved.
In fact, for any trio, ignoring the ambiguous loci, the consistent (partial) haplotype configuration for the unambiguous loci is unique and specifies a linkage of alleles on some heterozygous loci for each node in the trio. We define such linkage as HCL-linkages (linkages of Haplotype Configuration on the non-ambiguous Loci).

Definition
An HCL-linkage ψ is a quadruplet <v, RE, LS, PH> defined on node v and specified by the unique consistent (partial) haplotype configuration within a trio that contains v. Here v denotes the node to which the HCL-linkage belongs. LS = {a 1 ,...,a l } is the set of heterozygous loci where the haplotype configuration has been inferred. PH = {ph, ph'} = {(h 1 ...h l ), (h 1 '...h l ')} records the two (partial) haplotypes imputed on these loci. RE = (R, R') denotes that ph (respectively ph') is inherited from or will be passed on to the node in set R(R').
An HCL-linkage describes the partial haplotype configuration of a node and the inheritance relationship between the parents and their children. Under our definition, we can conclude that every haplotype configuration should be consistent with any HCL-linkage specified by each trio in the pedigree.

Merge and transfer operations over HCL-linkages
In the case of multiple generations and multiple children, loci on one node may be linked by different HCL-linkages. HCL-linkages of the same node should be merged if they can. There are three cases when merging two HCLlinkages ψ 1 = <v, (R 1 , R 1 '), LS 1 , {ph 1 , ph 1 '}> and ψ 2 = <v, it means that both ph 1 and ph 2 are from the nodes in R 1 and R 2 said ph 1 ' and ph 2 ' are from the nodes in R 1 ' and R 2 ', so ph 1 and ph 2 should be on the same haplotype, and ph 1 ' and ph 2 ' on the other: i) LS 1 ∩ LS 2 = Φ, or LS 1 ∩ LS 2 ≠ Φ but ph 1 equals ph 2 when restricted to loci in LS 1 ∩ LS 2 , it means that ψ 1 and ψ 2 are compatible. In this case, they should be merged to ψ = <v, here ph 1 ∪ ph 2 denote a longer partial haplotype, which alleles equal to those of ph 1 and ph 2 when restricted to loci in LS 1 and LS 2 ; ii) LS 1 ∩ LS 2 ≠ Φ and ph 1 doesn't equal ph 2 when restricted to LS 1 ∩ LS 2 , it means that ψ 1 and ψ 2 are incompatible, i.e. no haplotype configuration can satisfy the two HCL-linkages in the same time.
should be on the same haplotype, and ph 1 ' and ph 2 on the other. Similarly, ψ 1 and ψ 2 can be merged to ψ = <v, (R 1 ∪ 2.a) ph 1 equals ph 2 (and ph 1 ' equals ph 2 ' consequently) or ph 1 equals ph 2 ' (then ph 1 ' equals ph 2 ) when restricted to LS 1 ∩ LS 2 , it means that ψ 1 and ψ 2 are compatible, in this case, they should be merged to ψ = <v, (
In this case, ψ 1 and ψ 2 cannot be merged and both should be recorded in a HCL-linkage set Ψ v for node v.
With the merge operation, we can define the normalizing of a set of HCL-linkages Ψ v : normalizing a set Ψ v of HCLlinkages on node v means repeatedly applying the merge Like genetic information, HCL-linkages will be passed on from generations to generations. Without loss of generality, let us define the transfer of HCL-linkage information from child C to its parent F. The other case from F to C would be similar. Let Ψ C and Ψ F represent the normalized HCL-linkage sets of C and F respectively, and let HS be the are two cases to consider.
resulting partial haplotypes of ph C (respectively ph C ') when restricted to loci in LS C -HS and ph F ' is the compensatory partial haplotypes of ph F consistent to genotype g F .

Case
(2): else, F ∉ R C ∪ R C ': i) both ph C and ph C ' are consistent with the partial genotype g F when restricted to loci here ph F and ph F ' equal the resulting partial haplotypes of ph C and ph C ' when restricted to loci in LS C -HS; ii) ph C ' (respectively ph C ) is not consistent with the partial genotype g F when restricted to loci in LS C , then add ψ F = <F, resulting partial haplotypes of ph C (respectively ph C ') when restricted to loci in LS C -HS and ph F ' is the compensatory partial haplotypes of ph F consistent to genotype g F . Note that at least one of ph C and ph C ' should be consistent with the partial genotype g F .
Remember that Ψ F should be normalized whenever adding a new HCL-linkage to it. In the case of transferring an HCL-linkage ψ F from F to C, resulting in adding ψ C = <C, Our merge and transfer operations will not bring more or lose any HCL-linkage information for building consistent haplotype configurations.

Main HCL-linkages analysis haplotyping algorithm
Before the algorithm, we preprocess each trio in the pedigree. Whenever a trio specifies an HCL-linkage for node v, it will be stored in the HCL-linkage set Ψ v . The objective of the algorithm is to collect the complete HCL-linkage information for each node, which is accomplished by traversing the tree twice.
Firstly, we will convert the input pedigree into a rooted searching tree T (at an arbitrary node R) (Step 1). Then we traverse T in post-order to transfer and merge the HCLlinkage information for each node from its relatives (Step 2). We do this from the left lowest nuclear family F o . The HCL-linkages in nuclear family F o will be merged at both parents, and then be transferred to the root of the sub-tree.
The same operations will be conducted in its parental nuclear family on HCL-linkages specified in this family as well as on those transferred from its child families. And at last, we collect all the HCL-linkages at the root R. In Step 3, we traverse T again in pre-order and transfer the linkage in another direction from R to its farmost descendants.
After step 3, the HCL-linkage set of each node preserves all HCL-linkages in the pedigree. In step 4, we choose a node v arbitrarily. Set Ψ v contains several HCL-linkages ψ 1 , ψ 2 ,...,ψ l defined on disjoint locus set LS 1 , LS 2 ,...LS l . When a set of loci are linked by one HCL-linkage, they can be viewed as a compound locus, and the two partial haplotypes can be viewed as two compound alleles. These "loci" (and "alleles") will be treated equally as the other heterozygous loci and homozygous loci that are not involved in any HCL-linkage. We arbitrarily select one allele from the two at each locus to form a haplotype; the other alleles form another haplotype. It is called an imputing schema. Whenever the haplotype configuration of one node is determined, it can be used to determine the configurations of its relatives, and those of the whole pedigree at last.
During our algorithm, Incompatibleness may occur when normalizing HCL-linkage set Ψ v . Then we declare that there is no solution and exit from the algorithm immediately. Even in step 4, incompatibleness may still occur when applying the haplotypes of the parents to resolve the genotype of the children in the case that an individual node has multiple children. Figure 1 shows an example.
The key point is, if it exists a consistent haplotype configuration for a nuclear family (F, M, C 1 , C 2 ,...,C d ), every arbitrary imputing schema s can output one feasible solution ζ. Contrarily, if one imputing schema ends with incom-patibleness, other schemata will fail too. We will prove this in the appendix.
The time complexity and space complexity of our algorithm are both O(mn) where n is the number of the members in the pedigree and m is the length of the loci.

Frequency estimation stage
Suppose that we are given K pedigrees P = {P 1 , P 2 ,...,P K }.  Suppose that in the r-th iteration, Θ = Θ (r) and we want to estimate Θ (r+1) . Then we have: Let δ i,j,t be an indicator variable equalling the number of haplotype h t appear in solution S s,i . Then the haplotype frequencies can be computed using a gene-counting method, There are several ways to initialize the haplotype frequencies Θ = {θ 1 , θ 2 ,...,θ l }. For instance, the initial haplotype frequencies can be chosen at random, or all haplotypes are equally frequent, i.e. Θ t (0) = 1/l (t = 1, 2,...,l). Or that all initial haplotype frequencies are equal to the product of the corresponding single-locus allele frequencies (i.e., a complete linkage equilibrium). Also, we can set all feasible solutions for each pedigree to be equally likely, i.e. Pr(S s,i |Θ (0) ) = 1/π i , (j = 1,2,...,π i ). We can even initialize the haplotype frequencies by counting their occurrence in all the feasible solutions. Since in practical applications the EM algorithm could be trapped in some local maximum, we recommend to restart the algorithm several times with  The stopping (convergence) criterion is defined as the absolute value of the difference of Θ between consecutive iterations being less than some small value ε > 0.

Simulated data set
In order to generate a pedigree genotype data set for simulation experiments, we generate a population of haplotypes H* first, where each locus of each haplotype is set to some allele according to the probability distribution function P. In our simulation, we generated haplotypes of SNP loci as well as haplotypes of micro-satellite loci. For a biallelic SNP locus i, suppose that i happens to be one state with a probability of p i , and to be the other state with a probability of (1 -p i ). For a micro-satellite loci, suppose it has w different alleles: a 1 , a 2 ,...,a w , each appears with the probability of p 1 , p 2 ,...,p w (p 1 + p 2 +...+ p w = 1).
Each founder node in any tested pedigree is arbitrarily assigned a pair of haplotypes according to their frequencies θ*. The two haplotypes of a non-founder node are arbitrarily selected from those of its parents (one from the father, one from the mother). At last, the pair of haplotypes of the same node is blended to form a genotype corresponding to that node.
All experiments are conducted on a Windows server with 1.7G Hz CPU and 256 MB RAM. And for each parameter setting, 100 copies are randomly generated and the performance is evaluated by computing the average numbers in these 100 runs.

Running time of the haplotyping algorithm
One of the main contributions of our paper is to do haplotyping in linear time, so we firstly examine the running time with respect to different number of nodes of each pedigree (n) and different number of loci in each sequence (m).
Several different tree pedigree structures are used in the simulation, the first pedigree is Figure 1 in [15], which is a tree with 13 nodes. The second one is Figure 8 in [9], which is a tree with 29 nodes. The third one is a 21 node pedigree from Figure 5 of [15]. The results are given in Figure 2. It is obvious that our HCL-analysis haplotyping algorithm runs in linear time and thus could be applied to large-scale haplotype analysis.
When only trio pedigrees are considered, the average numbers of haplotypes are recorded in Table 2. We can see from the table that the numbers of haplotypes that should be estimated have been greatly reduced after the haplotyping stage (HANAP vs. directly), which will immediately bring the improvement on the running time.
We also consider a more complex pedigree that contains 13 nodes (Figure 1 of [15]). The average numbers of haplotypes are recorded in Table 3. We find that the number of haplotypes that should be estimated is even much Running time of the haplotyping algorithm smaller. We also notice that the number of haplotypes is growing with the length of haplotypes and the number of pedigrees. However, it grows very slowly.

Running time of HANAP
EM-DeCODER is a popular software using the EM algorithm to estimate the haplotype frequencies based on population data. As we have pointed out, it can be used to estimate haplotype frequencies in pedigrees, simply by discarding the pedigree information. Here we also compare the running times of HANAP and EM-DeCODER.  We let, Figure 4 shows the deviation of the estimate of HANAP and EM-DeCODER over different number of trios (k), length of haplotypes (m) and distributions of allele-probability (P). We can learn from the figure that the deviation of HANAP is smaller than that of EM-DeCODER, which means HANAP is more accurate. We have also noticed that the deviation of the estimate increases with the length of haplotypes, and decreases with the number of trios.   The running time of HANAP and EM-DeCODER

Two real data sets
We also test the efficiency and accuracy of HANAP on two real data sets. The first real data set is from dbMHC|ABDR, a set of 122 trios. Each genotype of these trios contains 31 markers of the same position on chromosome 6, 10 of which are micro-satellite markers and others are SNPs. We run HANAP to find the most frequent haplotypes (with frequencies larger than 0.01). A list of 20 haplotypes is found by HANAP. Their frequencies are shown in Table 4. It only takes HANAP 0.97 second to find these haplotypes while it is out of the capability of EM-DeCODER.
The second data set is from the CEPH database [17], which contains 65 families; each consists of only three generations, usually with four grandparents, two parents and a number of children. Figure 5 in [15] shows a typical family with 21 nodes.
A great portion of the alleles in this data set have not been identified, and will be viewed as missing data. We carefully selected a data set of 28 families (totally 482 nodes) on a block (48 markers) from chromosome 14 (452 markers in total) with no recombination. Both HANAP and PHASE (another widely used software package [3] based on the GS algorithm) are applied to this data set. HANAP inferred 36 haplotypes with frequency larger than 0.01 and PHASE inferred 39 ones, among which 31 are common. Although we are not sure which output is closer to the real cases, the running time of HANAP (13m24s) is extremely shorter than that of PHASE (21h14m).

Complexities of the HCL-linkage analysis haplotyping algorithm
We

Effectiveness of the haplotyping phase
Excoffier used the EM algorithm to estimate haplotype frequencies while ignoring the pedigree information. Here we adopt a two-stage method, which tries to reduce the number of possible haplotypes to be considered in the stage of estimation by utilizing the relatives' information to do haplotyping at first.

Conclusion
We present a two-stage method to do haplotyping and to estimate haplotype frequencies for pedigree genotype data in this paper. Given a set of pedigrees, it firstly determines all feasible haplotype configurations for each pedigree, then uses the EM algorithm to estimate the haplotype frequencies based on the inferred haplotype configurations. Because a large number of illegal haplotypes have been eliminated from the possible haplotype list, our method is both more efficient and more accurate. The experimental results show that, HANAP runs much faster than EM-DeCODER, and thus can be applied to much larger scale of instances. Moreover, the deviation of the estimate of HANAP is smaller than that of EM-DeCODER, which means it is more accurate.
Our method suggests that pedigree information is of great importance in haplotype analysis. It can be used to speedup estimation process, and to improve estimation accuracy as well. The result also demonstrates that whole haplotype configuration space can be substitute by the space of zero-recombinant haplotype configurations in haplotype frequency estimation, especially when the considered haplotype block is relatively short.
The loci from 1 to (i -1) can be viewed as a compound locus I. Let's refer to Table 1, because we can impute a or b to ph 1 , either or both of "locus" I and i should be ambiguous. Else, they will be linked to a bigger compound "locus" in Step 2 or 3 by HCL-linkages. Whatever a or b will be linked with ph 1 , we can not impute that arbitrarily in Step 4. Without loss of generality, we assume that locus i is ambiguous, which means that a ≠ b, and {a, b} = {c, d} = {e, f} (please refer to Table 1). We can prove that s' is also a consistent haplotype configuration by enumeration. Because nuclear family (F, M, C 1 , C 2 ,...,C d ) is the intersection of trio (F, M, C 1 ), (F, M, C 2 ),...,(F, M, C d ), the above prove shows that both both s: ph 1 ← a, ph 2 ← b, ph 3 ← c, ph 4 ← d and s': ph 2 ← a, ph 1 ← b, ph 4 ← c, ph 3 ← d will lead to a consistent haplotype configuration for the family.
We call s to s' a walk step by switching ph 1 ← a to ph 1 ← b.
Obviously, for any two imputing schemata s 1 and s 2 , we can transfer s 1 to s 2 by consecutive walk steps. So s 1 and s 2 will lead to a consistent haplotype configuration or neither can.
This lemma indicates that our algorithm works in a nuclear family. We now can prove the correctness of our HCL-linkage analysis haplotyping algorithm by induction, i.e. if there is at least one feasible solution for a general pedigree, HCL-linkages are sufficient to generate all solutions.
Suppose that the root R has multiple child mating nodes: O 1 ,...,O r , each represents a nuclear family; and the algorithm works in all sub-trees, i.e. all of the feasible solutions for those sub-trees can be directly deduced from the HCL-linkages collected from the sub-trees and stored at their roots. From the former lemma, we know that if incompatibleness of type II occurs, there is no feasible solution. We assume that there is no incompatibleness of type II and there are always feasible solutions for all subtrees. Suppose that haplotype configuration ζ is consistent with all the HCL-linkages at root R (and all the HCL-linkages in the P consequently). Then ζ should be consistent haplotype configuration when restricted to any nuclear family of O 1 ,...,O r , and it should also be consistent with the HCL-linkages at the root of lower sub-trees of O 1 ,...,O r .
By induction, ζ should be a feasible solution when restricted to any of these sub-trees. So ζ is a consistent haplotype configuration of the whole pedigree P.