 Proceedings
 Open Access
 Published:
A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on a pedigree
BMC Bioinformatics volume 13, Article number: S19 (2012)
Abstract
Background
When studying genetic diseases in which genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of genetic recombinational events. These ideas motivated the zerorecombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far no lineartime algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant.
Results
Given a pedigree with n individuals, m marker loci, and k mating loops, we proposed an algorithm that can provide a general solution to the zerorecombinant haplotype configuration problem in O(kmn + k^{2}m) time. In addition, this algorithm can be modified to detect inconsistencies within the genotype data without loss of efficiency. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity in terms of execution time in relation to input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, in thousandth to hundredth of a second, on a personal desktop computer.
Conclusions
We have developed the first deterministic lineartime algorithm for the zerorecombinant haplotype configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss of efficiency and is expected to be able to handle recombinant and missing data with further extension.
Background
A genetic disease is caused by the abnormality in an individual's genome. Genetic diseases have been studied extensively for decades by investigating the connection between diseases and genetic variations. In the human genome, chromosomes come in pairs; each gene consists of two alleles that reside in different chromosomes at the same locus. One of the two alleles comes from the father and the other comes from the mother. To study hereditary diseases in which the genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Unfortunately, the haplotype structure of a human genome is not available directly from the genotyping and the unordered genotype data does not tell us which allele comes from which parent. A haplotype is a collection of alleles at multiple loci on a chromosome that tend to be inherited as a unit. The determination of haplotypes from genotype data is called haplotype phasing or haplotype inference. Algorithms for haplotype inference are indispensable and have been intensively studied.
The existing computational algorithms for haplotype inference can be classified into statistical and combinatorial and most of which were designed for genotype data collected from individuals in the form of a pedigree. A pedigree is a hierarchical structure that describes the parentchild relationship among members of a family. Individuals without parents are called founders. There may be cycles in a pedigree, which are referred to as mating loops. A mating loop arises from a couple if they have children and both of them are offspring of certain family ancestors. An example of a pedigree, coupled with genotype data, is depicted in Figure 1(a); each allele is denoted as 0 or 1 to represent its form within a gene. If two alleles of a gene are the same, the locus is homozygous; otherwise, it is heterozygous. A haplotype is regarded as a hereditary unit and therefore an input pedigree is preferred to be free of mutational events and to have minimum number of genetic recombinational events [1]. Haplotype inference under this assumption is referred to as the minimumrecombinant haplotype configuration (MRHC) problem, which requires the solving of the haplotype structure of the input pedigree with the minimum number of recombination events [1]. Several algorithms have been proposed to solve the MRHC problem [1–8]. A special case of MRHC is zerorecombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, without any mutation [9]. To reduce the complexity of the ZRHC, some algorithms have been applied to pedigrees without mating loops (called tree pedigrees) [10–12]. In contrast to algorithms targeting tree pedigrees, so far no lineartime algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant; the execution time of existing algorithms for ZRHC using general pedigrees is polynomial [4, 13–17]. Regardless of whether it is a MRHC or a ZRHC problem, some algorithms have been extended to handle pedigrees with mutations or missing data [5, 8, 11, 15]. In addition to haplotype inference from pedigree data, algorithms have been proposed for population datasets that come from unrelated individuals. Algorithms for population datasets try to decode the haplotype structure of each individual as well as the haplotype frequencies of a population [18–22]. All the above mentioned algorithms are mainly combinatorial. Readers who are interested in statistical approaches for haplotype inference can consult a recent review [23].
In this study, we have targeted the ZRHC problem for pedigree data. If we assume we are given a pedigree with n individuals and m marker loci. Then for general pedigrees, Li and Jiang proposed an O(m^{3}n^{3}) time algorithm by converting the inheritance process into an equivalent linear system of O(mn) equations over Galois field GF(2) and invoking Gaussian elimination [4]. Xiao et al. improved the method to take O(mn^{2} + n^{3} log^{2} n log log n) time by removing redundant equations from the linear system [16]. Doan et al. proposed an O(mnα(m)) time algorithm by exploring constraints among marker loci rather than family members, where α(·) is the inverse of the Ackermann function [14]. For tree pedigrees, the execution time of the algorithm proposed by Xiao can be reduced from O(mn^{2} + n^{3} log^{2} n log log n) to O(mn + n^{3}) [16]. Li and Li proposed an O(mnα(n)) time algorithm using disjointset data structures [11]. Liu et al. further lowered the complexity of Xiao's algorithm to linear time O(mn) [12]. Chan et al. also proposed a lineartime algorithm by maintaining a graph structure [10]. Chan's algorithm, however, only produce a particular solution. A particular solution assigns a numerical value to each system variable, while a general solution describes all possible solutions of the system by designating certain variables as free variables and the others as linear combinations of these free variables.
In this paper, we presented an O(kmn + k^{2}m) time algorithm that provides a general solution for ZRHC for general pedigrees, where k is the number of mating loops. In human pedigrees, k is usually very small and can be regarded as constant. Our algorithm therefore turns out to be linear for most of the practical cases. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity of the execution time in relation to the input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, from thousandth to hundredth of a second on a personal desktop computer. We also showed that the proposed algorithm can be easily modified to detect inconsistencies among genotype data without loss of efficiency.
Methods
To apply computational techniques, we transformed the input pedigree into a pedigree graph by connecting each parent directly to its children (Figure 1(b)). A pedigree graph is an undirected graph G = (V, E), where V is a set of nodes and E a set of edges. Each node in V represents an individual in the pedigree; each pair of nodes is connected with an edge in E if and only if the two individuals have a parentchild relationship. G is defined to be undirected because the computational property of each edge is symmetric in our algorithm, even if the parentchild relationship is asymmetric. G may contain cycles. We only pay attention to two types of cycles: a cycle due to a mating loop, which is called a global cycle and a cycle due to a couple and two of their children, which is called a local cycle. Global cycles and local cycles are referred to as basic cycles. For ease of cycle processing, we construct a spanning tree T (G) on G. A basic cycle can be obtained by adding a nontree edge into T (G). The set of nontree edges is denoted by E^{X}. Nontree edges are further divided into two disjoint subsets {E}_{L}^{X} and {E}_{G}^{X}; members in {E}_{L}^{X} induce local cycles and members in {E}_{G}^{X} induce global cycles. Mating loops seldom appear in human pedigrees and therefore \left{E}_{G}^{X}\right=k is regarded as a small constant.
In the rest of this paper, we are assuming that G has n nodes and m loci, all alleles are biallelic (denoted by 0 or 1), and the input dataset is free of genotyping errors. Under this assumption, the input size of ZRHC is O(mn). The genotype data of a node n_{ i } are represented as a vector {g}_{{n}_{i}} of size m. The genotype of n_{ i } at locus l, where 1 ≤ l ≤ m, is defined as follows:
Genotype data are available, thus all gvariables can be regarded as constant (Figure 1(b)). We introduce a vector {p}_{{n}_{i}} of size m to describe the haplotype information of n_{ i }; the paternal allele of n_{ i } at locus l, where 1 ≤ l ≤ m, is defined as follows:
The vector {p}_{{n}_{i}} is regarded as unknown even though we know that {p}_{{n}_{i}}\left[l\right]={g}_{{n}_{i}}\left[l\right] if n_{ i } is homozygous at locus l (i.e. {g}_{{n}_{i}}[l] ≠ 2).
We formulated the ZRHC problem as follows.
ZRHC Given a pedigree graph G(V, E) with full gconstants, determine {p}_{{n}_{i}} of each node n_{ i } in V.
The haplotype configuration of the input pedigree is identified by specifying the paternal haplotype of each family member.
A system of linear equations over GF(2)
In this section, we introduce a system of linear equations based on G and gconstants; this system was first proposed in [16] and will be reduced to determine all pvariables. Since pvariables carry binary values, all equations in the linear system are defined over GF(2) whose operations addition (+) and multiplication (·) are shown in Table 1.
The building block of the system: inheritance
"Inheritance" is the building block of the system. What parents pass to their children must be the same as what children receive from their parents. For a parent n_{ i } and a locus l, n_{ i } passes {p}_{{n}_{i}}[l] + 1 to its children if and only if the genotype of n_{ i } at locus l is heterozygous and n_{ i } passes its maternal allele; otherwise n_{ i } passes {p}_{{n}_{i}} [l] to its children. We introduce two auxiliary variables {w}_{{n}_{i}}\left[l\right]{\text{and}h}_{{n}_{i},{n}_{j}} to formally state the above argument. The variable {w}_{{n}_{i}}\left[l\right] indicates if locus l of n_{ i } is heterozygous.
The variable {h}_{{n}_{i},{n}_{j}} indicates which allele of n_{ i } is passed to its child n_{ j }.
Therefore, {p}_{{n}_{i}}\left[l\right]+{w}_{{n}_{i}}\left[l\right]\cdot {h}_{{n}_{i},{n}_{j}} represents the allele at locus l that n_{ i } passes to n_{ j }.
On the other hand, assume that n_{ j } receives an allele from n_{ i }. If n_{ i } is n_{ j }'s father, what n_{ i } passes to n_{ j } is the paternal allele of n_{ j }. In this case, we have {p}_{{n}_{i}}\left[l\right]+{w}_{{n}_{i}}\left[l\right]\cdot {h}_{{n}_{i},{n}_{j}}={p}_{{n}_{j}}\left[l\right]. If n_{ i } is n_{ j }'s mother, there are two subcases. If locus l of n_{ j } is homozygous, what n_{ i } passes to n_{ j } must be the same as the paternal allele of n_{ j }. In this case, we have {p}_{{n}_{i}}\left[l\right]+{w}_{{n}_{i}}\left[l\right]\cdot {h}_{{n}_{i},{n}_{j}}={p}_{{n}_{j}}\left[l\right]. If locus l of n_{ j } is heterozygous, what n_{ i } passes to n_{ j } is the maternal allele of n_{ j } and is different from the paternal allele of n_{ j }. In this case, we have {p}_{{n}_{i}}\left[l\right]+{w}_{{n}_{i}}\left[l\right]\cdot {h}_{{n}_{i},{n}_{j}}={p}_{{n}_{j}}\left[l\right]+1. The variable {w}_{{n}_{j}}\left[l\right] can be used to indicate if locus l of n_{ j } is homozygous or heterozygous, the two subcases can therefore be combined into a single equation {p}_{{n}_{i}}\left[l\right]+{w}_{{n}_{i}}\left[l\right]\cdot {h}_{{n}_{i},{n}_{j}}={p}_{{n}_{j}}\left[l\right]+{w}_{{n}_{j}}\left[l\right]. Moreover, if we introduce another auxiliary variable {d}_{{n}_{i},{n}_{j}}\left[l\right] as follows,
the inheritance relationship can be unified into the following equation:
Note that the w and dvariables are constant by definition, and the p and hvariables are unknowns. Equation (1) formulates the property of edge (n_{ i }, n_{ j }) in G: pvariables and wconstants are attributes of the nodes n_{ i } and n_{ j }, and hvariables and dconstants describe the inheritance relation associated with the edge (n_{ i }, n_{ j }). With the information provided by Equation (1), various constraints on hvariables can be generated by traversing different paths in G. Our algorithm was designed to first determine hvariables based on these constraints and then the solution to the ZRHC problem can be obtained by determining all pvariables based on the solved hvalues and Equation (1). One point needs special care: if n_{ j } is a child of {n}_{i},{h}_{{n}_{j},{n}_{i}} and {d}_{{n}_{j},{n}_{i}} are undefined. In our algorithm, we make the hvariables and dconstants symmetrical such that {h}_{{n}_{j},{n}_{i}}={h}_{{n}_{i},{n}_{j}} and {d}_{{n}_{j},{n}_{i}}={d}_{{n}_{i},{n}_{j}}.
Linear constraints on hvariables
To reduce the computational complexity of our algorithm, we try to make the number of unknowns in the coming linear system as small as possible. In the pedigree graph G, we have mn pvariables and at most 2n hvariables (since each individual has two parents and there are at most n individuals). Observe that if a node n_{ i } itself or one of its parents is homozygous at locus l, {p}_{{n}_{i}}\left[l\right] is determined by definition and Equation (1). In this case n_{ i } is referred to as predetermined at locus l and the number of unknown pvariables is reduced by one. Moreover, for an edge (n_{ i }, n_{ j }) ∈ E, where n_{ i } is a parent of {n}_{j},\phantom{\rule{0.3em}{0ex}}{h}_{{n}_{i},\phantom{\rule{0.3em}{0ex}}{n}_{j}} is cancelled from Equation (1) if {w}_{{n}_{i}}\left[l\right]=0 at locus l. If {w}_{{n}_{i}}\left[l\right]=0 holds for all 1 ≤ l ≤ m, no constraints are imposed on {h}_{{n}_{i},{n}_{j}} and it becomes a free variable (or its value will finally depend on other free variables). In this case the number of hvariables to be determined is reduced by one, which is equipotent to the removal of edge (n_{ i }, n_{ j }) from G. Accordingly, wconstants can be viewed as the weight of edges in G; we only pay attention to edges with weight one (parent nodes that are heterozygous). To consider only the edges with weight one at locus l, we construct the lth locus graph G_{ l } = (V, E_{ l }), where E_{ l } = {(n_{ i }, n_{ j })  n_{ i } is a parent of n_{ j }, {w}_{{n}_{i}}\left[l\right]=1}. Moreover, the spanning forest T(G) ∩ G_{ l } is denoted by T(G_{ l }) and is referred to as the lth locus forest (Figure 1(c)).
We define constraints on hvariables by traversing paths in the locus graphs. Consider a path p = n_{0}, n_{1}, ..., n_{ i } in G_{ l }. Assume that n_{0} and n_{ i } are predetermined and all other inbetween nodes are nonpredetermined. Adding up all hvariables on the path will produce the following equation by Equation (1):
Since n_{0} and n_{ i } are predetermined and all dconstants are known, b is a constant. The constant b is said to be the constraint of path p. Note that the constraint b does not depend on the direction that path p is read because the hvariables and dconstants are symmetric. Moreover, if the path is a cycle c = n_{0}, n_{1}, ..., n_{ i }, n_{0} in G_{ l }, we would have the following equation:
Again, since all dconstants are known, b' is also a constant. The constant b' is said to be the constraint of cycle c. On the basis of Equations (2) and (3), we can generate constraint equations with only hvariables for cycles or for paths that connect predetermined nodes in G_{ l }. Constraints can be classified into two categories with respect to the spanning tree T(G): cycle and path constraints derived from paths containing nontree edges, and tree constraints derived from paths containing only tree edges.
Cycle and Path constraints
Adding a nontree edge e into the spanning tree T (G) generates a basic cycle c. If G_{ l } contains e, there are two cases of c in G_{ l }.
Case 1 c is in G_{ l }. A cycle constraint b_{ c } of cycle c can be obtained by Equation (3). The constraint is denoted interchangeably by b_{ c } or (b_{ c }, e), which is also said to be the cycle constraint of e.
Case 2 c is broken into several disjoint paths in G_{ l } by predetermined nodes. Since these paths are disjoint, there is exactly one path p' of them containing e. Along the path p', we identify a subpath p = n_{ i } ...n_{ j } containing e such that n_{ i } and n_{ j } are predetermined and all other inbetween nodes are nonpredetermined. A path constraint b_{ p } of the subpath p can be obtained by Equation (2). The constraint is denoted interchangeably by b_{ p } or (n_{ i }, n_{ j }, b_{ p }, e), which is also said to be the path constraint of e. Path constraints are symmetric because (n_{ i }, n_{ j }, b_{ p }, e) = (n_{ j }, n_{ i }, b_{ p }, e).
Tree constraints
For each connected component of T (G_{ l }), we arbitrarily pick a predetermined node n_{ s } as the seed. For the unique tree path p that connects n_{ s } and another predetermined node n_{ k } in the same connected component, a tree constraint b_{ t } of path p can be obtained by Equation (2). The constraint is denoted interchangeably by b_{ t } or (n_{ s }, n_{ k }, b_{ t }). Tree constraints are symmetric because (n_{ s }, n_{ k }, b_{ t }) = (n_{ k }, n_{ s }, b_{ t }). Note that if there exists a component that has no predetermined nodes, locus l must be heterozygous across the entire pedigree and no tree constraints will be generated.
Our algorithm in relation to the ZRHC problem
Our algorithm consists of four steps. We begin by initializing required data structures in the preprocessing step. The initialized data structures are subject to the constraint generation step to construct a system of linear constraints on hvariables. There are two issues should be addressed. First, since all constraints are derived from locus graphs that come from the same pedigree graph, there is usually redundancy in the system. Second, we actually do not need to know all hvalues to solve the ZRHC problem. For a child node n_{ i }, there are two hvariables related to it and its parents. However, from Equation (1) we know that one of the two hvalues is sufficient to determine {p}_{{n}_{i}}. So it is easy to see that the (n  1) hvariables in T (G) form a minimal sufficient set to solve the ZRHC problem. In the third step, constraint reduction and transformation, we therefore try to eliminate redundancy in the system and transform as many path constraints into tree constraints as possible. Finally, in the haplotype determination step, we introduce an efficient way to solve hvariables and further pvariables based on the reduced system.
Step 1: preprocessing
The data structures of our algorithm are initialized by the following procedures:

1.
Transform the pedigree into a pedigree graph G = (V, E). Each node n_{ i } in V is equipped with its genotype vector {g}_{{n}_{i}}. Since each individual has two parents, there are at most 2n edges in G, so we have V = O(n) and  E  = O(n).

2.
Construct a spanning tree T (G) on G.

3.
For each locus l,

(a)
generate a locus graph G_{ l },

(b)
generate a locus forest T (G_{ l }), and

(c)
identify predetermined nodes as well as their pvalues, all dconstants, and all wconstants.

(a)
The operations applied in this step are graph traversal and spanning tree construction, both operations can be performed in time O( V  +  E ) = O(n). The time complexity of this step is therefore O(mn).
Step 2: constraint generation
A system of linear equations on hvariables over GF(2) will be constructed in this step. The system consists of three sets C^{C}, C^{P}, and C^{T} that contain different kinds of constraints. C^{C} contains cycle constraints, if any, of all nontree edges at all loci. Similarly, C^{P} contains path constraints, if any, of all nontree edges at all loci. Finally, C^{T} contains tree constraints at all loci. To reduce computational complexity, repetitions of set members are forbidden in our algorithm; we do nothing if an existing member is going to be added into the same set.
There are O(mn) trials to generate a constraint for a nontree edge since there are m locus graphs and each of which contains O(n) nontree edges; in each trial we perform a cycle detection procedure to generate a cycle constraint or a path constraint, so we have  C^{C}  +  C^{P}  = O(mn). The cycle detection procedure is usually implemented by depth first graph traversal and its execution time depends on the length of the cycle. Consequently, if a nontree edge induces a global cycle, the cycle detection procedure takes O(n) time; otherwise the procedure takes constant time because each local cycle contains only four edges. The time to generate O(mn) cycle and path constraints is O(kmn) since there are at most km trials to generate global cycle constraints. To generate tree constraints within a locus graph, we perform tree traversal on its locus forest. This procedure generates O(n) tree constraints in O(n) time. So we require O(mn) time to generate tree constraints at all loci. The time complexity to generate our constraint system is therefore O(kmn) + O(mn) = O(kmn).
Step 3: constraint reduction and transformation
Redundancy arises in the constraint system if a constraint can be represented as a linear combination of other constraints. We are especially interested in the following two types of redundancies.
Type 1 Assume there is a basic cycle c in G and it can be decomposed into two edgedisjoint paths p_{1} and p_{2} both connecting nodes n_{ i } and n_{ j }. There must be exactly a nontree edge e in c, and without loss of generality, we assume that e belongs to path p_{1}. If there is a cycle constraint (b_{ c }, e) of c, a path constraint (n_{ i }, n_{ j }, b_{ p }, e) of p_{1}, and a tree constraint (n_{ i }, n_{ j }, b_{ t }) of p_{2}, we have b_{ c } = b_{ p } + b_{ t } by Equations (2) and (3). That is, these three constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure 2(a)). A path constraint can therefore be transformed into a tree constraint by the equation b_{ t } = b_{ p } + b_{ c }, which is the basis of the reduction of our constraint system.
Type 2 Assume there are three tree constraints (n_{ i }, n_{ j }, b_{1}), (n_{ i }, n_{ k }, b_{2}), and (n_{ j }, n_{ k }, b_{3}) of paths p_{1}, p_{2}, and p_{3}, respectively. By definition we know that a tree constraint is the summation of all hvariables along a unique path in T(G), so we have
Suppose that n_{ l } is the node closest to n_{ i } on the path p_{3}. We then have three paths p_{4} between n_{ i } and n_{ l }, p_{5} between n_{ l } and n_{ j }, and p_{6} between n_{ l } and n_{ k } such that p_{1} = p_{4} + p_{5}, p_{2} = p_{4} + p_{6}, and p_{3} = p_{5} + p_{6}. The tree constraints can therefore be rewritten as
Because all constraints are defined over GF(2), we conclude that b_{1} + b_{2} = b_{3}; the three tree constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure 2(b)). The above argument implies the following lemma.
Lemma 1 For any three nodes n_{ i }, n_{ j }, and n_{ k }, the tree constraint of the path between n_{ j } and n_{ k } is equal to the total tree constraint of the path between n_{ i } and n_{ j } and the path between n_{ i } and n_{ k }.
Lemma 1 still holds even if n_{ i } is on the path between n_{ j } and n_{ k } (n_{ i } = n_{ l } in Figure 2(b)), which means that if a tree path is partitioned into two disjoint subpaths, the tree constraint of this path is equal to the total constraint of the two subpaths.
In this step, we remove the type 1 redundancy by transforming as many path constraints to tree constraints as possible, and remove the type 2 redundancy by reducing C^{T} to an equivalent set whose cardinality is at most (n  1).
For each nontree edge e ∈ E^{X}, if cycle constraint (b_{ c }, e) exists, we remove all path constraints (n_{ i }, n_{ j }, b_{ p }, e), if any, from C^{P} and add tree constraints (n_{ i }, n_{ j }, b_{ c } + b_{ p }) into C^{T}. Since the size of C^{P} is O(mn), this procedure can be carried out in time O(mn), and the new C^{T} is of size O(mn).
To further remove the redundancy in C^{T}, we construct a constraint graph G* of G. The constraint graph G* shares the same set of nodes V as G; for each tree constraint (n_{ i }, n_{ j }, b_{ t }) ∈ C^{T}, we introduce an edge connecting nodes n_{ i } and n_{ j } in G* with weight b_{ t } (Figure 3(a)). An example of constraint graph is depicted in Figure 3(b). The constraint graph is used to reduce the size of C^{T}. As shown in Figure 3(b), a constraint graph may not be connected. Within each connected component in G*, we randomly choose a seed n_{ s } and try to assign each node n_{ i } a variable W[n_{ i }] to represents the tree constraint of the tree path between nodes n_{ s } and n_{ i } in the pedigree graph G. The assignment is carried out by the following steps in each connected component of G*.

1.
W[n_{ s }] of the seed n_{ s } is assigned the value zero,

2.
start from n_{ s }, perform a breadthfirstsearch traversal via tree constraints, i.e., we can traverse from node n_{ i } to node n_{ j } if (n_{ i }, n_{ j }, b_{ t }) ∈ C^{T} or (n_{ j }, n_{ i }, b_{ t }) ∈ C^{T},

3.
as we traverse from n_{ i } to n_{ j } through (n_{ i }, n_{ j }, b_{ t }) or (n_{ j }, n_{ i }, b_{ t }), if n_{ j } is unvisited, we assign W[n_{ i }] + b_{ t } to W[n_{ j }] based on Lemma 1; otherwise we do nothing.
Since W[n_{ i }] represents the tree constraint (n_{ s }, n_{ i }, W[n_{ i }]), it can be regarded as the summation of hvariables along the unique path on T(G) from the seed n_{ s } to node n_{ i }, which implies the following lemma:
Lemma 2 The hvalue of a tree edge (n_{ i }, n_{ j }) in T(G) can be obtained by{h}_{{n}_{i},{n}_{j}}=W\left[{n}_{i}\right]+W\left[{n}_{j}\right] if n_{ i } and n_{ j } reside in the same connected component of G*.
Therefore, if we can assign Wvalues to all nodes in V and make G* connected, G* would be equipotent to a reduced C^{T} of size (n  1) that covers hvariables of all tree edges of T(G) and is sufficient to solve the ZRHC problem. The construction of the constraint graph takes O(C^{T}) = O(mn) time.
The constraint graph G*, however, may not be connected with fully assigned Wvalues. We therefore introduced an extension procedure to extend G* by adding extra tree constraints, if any, into G*; we would like to reduce the number of connected component in G* as much as possible. To explore more tree constraints to be added into G*, we examine those nontree edges e ∈ E^{X} that do not have cycle constraints in C^{C}. The basic idea is that if we can synthesize a new cycle whose constraint is the same as the expected cycle constraint of e, we may obtain new tree constraints by transforming known path constraints of e.
For a nontree edge e without cycle constraint, we try to synthesize a cycle only if e\in {E}_{G}^{X}. We do nothing if e\in {E}_{L}^{X} because no extra tree constraints of e can be obtained by cycle synthesis. To see this, suppose the local cycle induced by e connects a couple n_{ a } and n_{ b } and their two children n_{ c } and n_{ d }; without loss of generality, we assume e = (n_{ a }, n_{ d }) (Figure 4). We can examine the possible constraints derived from this local cycle. Constraints of a single edge with predetermined endpoints are not of interest and can be ignored because the pvalues of the endpoints are known; we need only pay attention to constraints whose path lengths are longer than one. In the l th locus graph, if {w}_{{n}_{a}}\left[l\right]={w}_{{n}_{b}}\left[l\right]=1, a local cycle exists and we have cycle constraint (b_{ c }, e) (Figure 4(a)); if {w}_{{n}_{a}}\left[l\right]=1 and {w}_{{n}_{b}}\left[l\right]=0, we only have the path p_{1} = n_{ c }n_{ a }n_{ d } with path constraint (n_{ c }, n_{ d }, b_{ p }, e) (Figure 4(b)); if {w}_{{n}_{a}}\left[l\right]=0 and {w}_{{n}_{b}}\left[l\right]=1, we only have the path p_{2} = n_{ c }n_{ b }n_{ d } with tree constraint (n_{ c }, n_{ d }, b_{ t }) (Figure 4(c)); if, {w}_{{n}_{a}}\left[l\right]={w}_{{n}_{b}}\left[l\right]=0 all four nodes are predetermined and we can determine their pvalues directly (Figure 4(d)). No useful constraints other than (b_{ c }, e), (n_{ c }, n_{ d }, b_{ p }, e), and (n_{ c }, n_{ d }, b_{ t }) can be derived from this local cycle. Here we already know that (b_{ c }, e) does not exist. If (n_{ c }, n_{ d }, b_{ t }) is already in C^{T}, it is the only useful tree constraint of e and we are finished. If (n_{ c }, n_{ d }, b_{ t }) does not exist in C^{T}, we cannot obtain (n_{ c }, n_{ d }, b_{ t }) by combining b_{ c } and b_{ p } because (b_{ c }, e) does not exist, even if the path constraint (n_{ c }, n_{ d }, b_{ p }, e) is available. If this case holds for all 1 ≤ l ≤ m, our linear system actually provides no information to obtain the tree constraint of p_{2}; the hvariable of each edge on p_{2} will eventually be assigned a free variable, or its value will depend on other free variables. Therefore we do nothing if e\in {E}_{L}^{X}.
Assume that E^{S} is the set of nontree edges in {E}_{G}^{X} without cycle constraint. Cycle synthesis is carried out by concatenating paths with known path constraints or tree constraints. The extension procedure is applied to E^{S} as follows.
E1. For each e ∈ E^{S}, we check if there is an odd number, say 2t + 1, of path constraints of e that link different connected components in G* to form a synthetic cycle (Figure 5(a)); a constraint is said to link two components A and B if one of its endpoints resides in A and the other resides in B. There is a special case whereby we can also obtain a synthetic cycle if two endpoints of a single path constraint reside in the same connected component (t = 0). If no such 2t + 1 path constraints are found, we cannot synthesize a cycle of e and do nothing; otherwise we perform the following tasks:

E1.1
assign the constraint \left(\hat{{b}_{c}},e\right) to the synthetic cycle, where
\hat{{b}_{c}}=\underset{\left({n}_{i},{n}_{j},{b}_{p},e\right)?{S}_{e}}{?}W\left[{n}_{i}\right]+W\left[{n}_{j}\right]+{b}_{p},(4)
in which S_{ e } is the set of the chosen 2t + 1 path constraints;

E1.2
for each path constraint \left({n}_{x},{n}_{y},{b}_{p}^{\text{'}},e\right) in C^{P}, generate a tree constraint \left({n}_{x},{n}_{y},\hat{{b}_{c}}+{b}_{p}^{\text{'}}\right) and add the new constraint into G*;

E1.3
update G*;

E1.4
remove e from E^{S};
E2. If E^{S} becomes empty (there has been a synthetic cycle for every e in the original E^{S}), or no synthetic cycle is synthesized (E^{S} stays unchanged), we stop the extension procedure; otherwise we go back to E1 to start the next iteration.
We thus try to synthesize a cycle for each nontree edge in E^{S} to generate new tree constraints and update G*. To update G*, if more than one connected component is combined into a new one by new tree constraints, we arbitrarily choose one of the old seeds from these connected components as a new seed, and perform a graph traversal to update Wvalues within the new connected component. A nontree edge that fails to receive a synthetic cycle in a trial of cycle synthesis may benefit from a later updated G* and therefore our extension procedure is designed to operate in an iterative fashion; the procedure terminates only if G*cannot be updated anymore. In this procedure, a nontree edge may be checked many times (in different iterations) to form a synthetic cycle. In the worst case scenario, only one cycle is synthesized in each iteration, so we require k iterations to perform k + (k  1) + ... + 1 = O(k^{2}) trials of cycle synthesis.
To verify the correctness of the extension procedure, we need first to explain the meaning of Equation (4). Follow a similar argument to that of Lemma 1, for two nodes n_{ x } and n_{ y } that reside in the same connected component of G*, we know that W[n_{ x }] + W[n_{ y }] is actually the tree constraint of the path from n_{ x } to n_{ y } on T(G). The synthetic cycle is conceptually a round trip through tree edges and the nontree edge e. The value \hat{{b}_{c}} in Equation (4) is therefore the summation of hvariables along the round trip (Figure 5(b)). Now we demonstrate that \hat{{b}_{c}} is the same as the cycle constraint of e. We first show that there is exactly one hvariable of e in \hat{{b}_{c}}. According to Equation (4), we have 2t + 1 hvariables of e in \hat{{b}_{c}}. Since we perform additions over GF(2), 2t out of the 2t + 1 hvariables will be cancelled and we finally have only one hvariable of e in \hat{{b}_{c}}. To verify if \hat{{b}_{c}}is the same as the cycle constraint of e in G, we assume that the expected cycle constraint of e is b_{ c }. We generate a set {S}_{e}^{\prime} by converting path constraints (n_{ i }, n_{ j }, b_{ p }, e) in S_{ e } to tree constraints (n_{ i }, n_{ j }, b_{ c } + b_{ p }). It is easy to see that the converted 2t + 1 tree constraints also link connected components in G* to form a new synthetic cycle, and the corresponding round trip only contains tree edges in T(G). T(G) has no cycle and therefore each edge of this new round trip must be visited an even number of times, which means that its hvariable will be cancelled in the new cycle constraint. So the constraint of the new synthetic cycle must be zero and we have the following equations:
Since there are 2t + 1 constraints in {S}_{e}^{\prime}, we have {\sum}_{i=1}^{\left{S}_{e}^{\prime}\right}{b}_{c}={b}_{c}. We then obtain \hat{{b}_{c}}+{b}_{c}=0 and conclude that \hat{{b}_{c}}={b}_{c}.
For each e ∈ E^{S}, the time to determine if there are odd number of path constraints that link connected components in G* to form a cycle is O(m). This time complexity can be achieved by regarding each connected component as a single node and each path constraint of e as a single edge, and following O(m) edges to perform a depthfirst traversal. Since there are O(k^{2}) cycle syntheses throughout the extension procedure, we require O(k^{2}m) time to find synthetic cycles. Once we synthesized a cycle for e, we require O(m) time to convert path constraints to tree constraints because there are at most m path constraints of e in C^{P}. There are O(k) nontree edges in E^{S} and therefore the extension procedure takes O(km) time to perform constraint conversion. To update G*, we require O(n) time to perform breadthfirst traversal on every connected component to modify Wvalues similar to the way we initialize G*. There are at most k synthetic cycles and therefore G* is updated O(k) times in O(kn) time. In summary, the Step 3, constraint reduction and transformation, takes O(k^{2}m) + O(km) + O(kn) = O(k^{2}m + kn) time.
Step 4: haplotype determination
To solve the hvalues of the tree edges of T(G) by Lemma 2, we try to make G* produced by Step 3 connected. Firstly, we pay attention to the founders in the pedigree. Founders cannot be predetermined endpoints of paths with either path constraints or tree constraints and therefore founders must be isolated nodes in G*. It is also impossible to know whether an allele of a founder is paternal or maternal. We attach a founder n_{ f } to G* by assuming that it passed its paternal haplotype to an arbitrary child n_{ c }. The attachment can be done by assigning weight zero to the edge (n_{ f }, n_{ c }) of G*, which implies {h}_{{n}_{f},{n}_{c}}=0 (n_{ f } passes its paternal haplotype). There are O(n) edges in G* and therefore the attachment of founders to G* takes O(n) time.
Secondly, we check if there is any nontree edge that can link any two connected components of G*. A nontree edge e = (n_{ i }, n_{ j }) can link two connected components A and B if we can find a path constraint (n_{ k }, n_{ l }, b_{ p }, e) of path p that, without loss of generality, satisfies the following two conditions:

1.
n_{ k } and n_{ i } reside in A and have available W [n_{ k }] and W [n_{ i }] derived from the seed n_{ A } of A,

2.
n_{ l } and n_{ j } reside in B and have available W [n_{ l }] and W [n_{ j }] derived from the seed n_{ B } of B.
If we can find such a nontree edge e, we can decompose p into three parts: a subpath from n_{ k } to n_{ i }, the nontree edge e, and the subpath from n_{ j } to n_{ l }. The constraints of these three parts are W\left[{n}_{k}\right]+W\left[{n}_{i}\right],\phantom{\rule{2.77695pt}{0ex}}{h}_{{n}_{i},{n}_{j}},W\left[{n}_{j}\right]+W\left[{n}_{l}\right], respectively. This turns out that {b}_{p}=W\left[{n}_{k}\right]+W\left[{n}_{i}\right]+{h}_{{n}_{i},{n}_{j}}+W\left[{n}_{j}\right]+W\left[{n}_{l}\right]. The nontree edge e therefore can be used to link components A and B with known hvalue {h}_{{n}_{i},{n}_{j}}={b}_{p}+W\left[{n}_{k}\right]+W\left[{n}_{i}\right]+W\left[{n}_{j}\right]+W\left[{n}_{l}\right]. Since there are at most O(mn) path constraints to be checked, this procedure requires O(mn) time.
Finally, assume that there remain t connected components of G*. We arbitrarily introduce (t 1) edges into G* to make it connected. Our algorithm does not impose any constraint on these (t  1) edges and therefore the weights of these edges can be safely set as free variables. We then update all Wvalues within the new connected G* (new Wvalues may contain free variables), and apply Lemma 2 to determine the hvalues of all edges in T(G). With these solved hvalues as well as the wconstants and dconstants, we can determine the pvalues of all nodes in the locus graphs by Equation (1).
The update of G* takes O(n) time. Moreover, we require O(n) time to determine all hvalues of edges in T(G). For a locus graph, the determined hvalues are used to solve all pvalues in O(n) time. Since there are m locus graphs, we require O(mn) time to determine the pvectors of all nodes in G. Consequently, the three procedures of this step take O(n) + O(mn) + O(n) + O(mn) = O(mn) time.
Results and discussion
An execution example
We use the pedigree given in Figure 6(a) as an example to demonstrate how the proposed algorithm works. There are 19 individuals in the pedigree; eight of them are founders. Each individual is equipped with genotype data collected from four marker loci. There is a mating loop in the pedigree.
In the first step, we transform the input pedigree into a pedigree graph G and construct a spanning tree T(G) on G (Figure 6(b)). There are three local cycles AHBIA, DKELD, and PRQSP and one global cycle BFCGDLOQNIB in G. Edges AH, EL, QR, and BF are chosen as nontree edges within the four cycles. From the pedigree graph G, we construct the four locus graphs and forests that are depicted in Figure 6(c). The pvalues of predetermined nodes, wconstants of all nodes, and dconstants of all edges within the four locus graphs are also identified.
In the second step, we generate all cycle, path, and tree constraints for each of the four locus graphs using Equations (2) and (3). For example, cycle AHBIA in the second locus graph has cycle constraint h_{ A, H } + h_{ H, B } +h_{ B, I } +h_{ I, A } = d_{ A, H } [2] + d_{ H, B } [2] +d_{ B, I } [2] +d_{ I, A }[2] = 0 + 1 + 1 + 0 = 0, and path GCFBINQ in the third locus graph has path constraint of the nontree edge BF h_{ G, C } + h_{ C, F } + h_{ F, B } + h_{ B, I } + h_{ I, N } + h_{ N, Q } = p_{ G }[3] + d_{ G, C }[3] + d_{ C, F } [3] + d_{ F, B }[3] + d_{ B, I } [3] + d_{ I, N } [3] + d_{ N, Q }[3] + p_{ Q }[3] = 0 + 0 + 0 + 1 + 1 + 0 + 0 + 0 = 0.
At the end of this step we receive C^{C} = {(0, e_{ EL }), (0, e_{ QR }), (0, e_{ AH })}, C^{P} = {(I, H, 0, e_{ AH }), (N, G, 0, e_{ BF) }, (G, Q, 0, e_{ BF) }, (R, Q, 0, e_{ RQ }), (N, Q, 0, e_{ BF) }}, and C_{ T } = {(I, N, 0), (G, K, 0), (G, L, 0), (G, O, 0), (Q, S, 0)}.
In the third step, we obtain two new tree constraints (R, Q, 0) and (I, H, 0) by (0, e_{ QR }) + (R, Q, 0, e_{ RQ }) and (0, e_{ AH }) + (I, H, 0, e_{ AH }), respectively. The set C^{T} is therefore extended to {(I, N, 0), (G, K, 0), (G, L, 0), (G, O, 0), (Q, S, 0), (R, Q, 0), (I, H, 0)}. We construct the initial constraint graph G* based on the updated C^{T} (Figure 6(d)). In the initial G*, we choose H, G, and Q as component seeds to determine Wvalues. We can further find that the three path constraints (N, G, 0, e_{BF)},(G, Q, 0, e_{BF)}, and (N, Q, 0, e_{BF)}link three connected components to form a synthetic cycle of the nontree edge BF with constraint zero. So we further obtain three extra tree constraints (N, G, 0), (G, Q, 0), and (N, Q, 0) derived from the synthetic cycle and add them to G*.
In the final step, we try to make G* connected to solve all hvalues. We first arbitrarily introduce eight edges AH, BI, CG, DG, EL, JN, MO, and PR to attach the eight founders to G*; all the eight edges are of weight zero to imply that founders passes their paternal haplotypes to one of their children. Now there are only two connected components in G*, one of which is an isolated node, F. we attach F to G* by set h_{ B, F } as a free variable. This final connected G* is depicted in Figure 6(e). After the final update of G*, all hvalues other than h_{ B, F }are zero, and h_{ B, F } is free to be either zero or one. Given these known hvalues, all pvalues over the four locus graphs can be solved by Equation (1).
Time complexity and experimental result
According to the analyses at the end of each step in Section 3, the time complexity of our algorithm is O(mn) (step 1: preprocessing) + O(kmn) (step 2: constraint generation) + O(k^{2}m + kn) (step 3: constraint reduction and transformation) + O(mn) (step 4: haplotype determination) = O(kmn + k^{2}m). Because k is regarded as a constant, our algorithm is linear.
To verify the efficiency and the correctness of our algorithm, we conducted some experiments using the proposed method. Our algorithm was implemented in C and was evaluated on a desktop computer equipped with Intel Core i72600 3.4 GHz CPU and 8 GB of RAM. The desktop ran Ubuntu Release 11.10 operating system with Linux kernel 3.0.016generic and GNOME 3.2.1 graphical user interface.
In the experiments, we generated test cases by setting different number of individuals (n) and markers (m). We applied the algorithm developed by Thomas et al. [24] to generate 12 tree pedigrees with different n values ranging from 30 to 400. To observe how the number of mating loops (k) affects our algorithm, each tree pedigree was preprocessed to produce four variants with zero, two, four, and six mating loops. For each pedigree, we examined 10 different m values ranging from 10 to 300. Each (n, m) combination was tested 100 times. Each time we generated new genotypes and randomly selected one pedigree from the four variants of the given n. The haplotype configurations of all the 12000 trials were correctly identified. The experimental results are listed in Table 2.
Table 2(a) shows that unknown pvariables were correctly solved without assigning any free variable if the number of marker loci was not less than 30, which covers most practical cases in regular genotyping. Free variables were required only when the number of marker loci was far less than the number of individuals. In this experiment, free variables were used only when m = 10, and they were used at most five times out of 100 trials. The result is reasonable because the dimension of the solution space of a pedigree with a limited number of marker loci is probable less than the number of unknown pvariables.
Table 2(b) shows the cumulative execution time of 100 trials of each (n, m) combination. We received a fluctuation in execution time if n or m were less than 100. We conjecture that, because the algorithm executes very fast for small values of n or m, the cumulative execution time might be dramatically affected by the context switches within the operating system that ran many background services. Furthermore, we believe that when both n and m were larger than 100, the execution time of the algorithm became more significant than that of the context switches. From the table it is apparent that the execution time is linear for the larger n and m values.
Finally, Table 2(c) shows that mating loops existed evenly throughout all 12000 trials, with the number ranging from zero to six per pedigree, and the number did not affect the linearity of the execution time of our algorithm in relation to the input size of n and m.
Issue of spanning tree and seed node selection
In the first step, preprocessing, a spanning tree T(G) is constructed on the pedigree graph G. As mentioned above, T(G) is constructed for the ease of cycle processing; it is merely an auxiliary data structure used to generate linear constraints of all cycles and paths between predetermined nodes in G. We do not impose any constraint on the construction of T(G) because predetermined nodes are defined by genotype data. Once the input pedigree is given, all the cycles and paths as well as their constraints are bound, no matter which spanning tree is constructed on the pedigree graph. Different spanning trees assign different edges as the nontree edge in a cycle, and only affect the type of a constraint; a constraint may be a path constraint with respective to one spanning tree and a tree constraint with respect to another spanning tree. Since different spanning trees are used to generate the same set of constraints, without considering their type, the construction of the spanning tree can be arbitrary. In our implementation, T(G) was constructed by depthfirst traversal.
In the second step, constraint generation, a seed node is arbitrarily selected from T(G) to generate tree constraints. To see why the seed node can be selected arbitrarily, assume that there are two possible seeds n_{ i } and n_{ j }. For any other predetermined node n_{ k }, we have (n_{ j }, n_{ k }, b_{ jk }) = (n_{ i }, n_{ j }, b_{ jk }) + (n_{ i }, n_{ k }, b_{ ik }) by Lemma 1, which means that a tree constraint seeded with one predetermined node is a linear combination of two tree constraints seeded with another predetermined node. Hence, tree constraints seeded with different predetermined nodes are mathematical equivalent; we can safely choose any predetermined node as seed. Similarily, the seed nodes within a constraint graph can also be selected arbitrarily based on the above argument.
Consistency checking
Although we assume that the input pedigree is free of genotyping errors, our algorithm can be easily modified to detect inconsistencies within the genotype data without loss of efficiency. No recombination is allowed in the input pedigree and therefore inconsistencies will arise if there are different assignments of an hvalue, that results in incompatible linear constraints. We may designate the following two checkpoints to detect inconsistencies within our linear system:

1.
The generation of constraints. The constraint of a path or a cycle may be computed more than one time across all locus graphs; all these computations should arrive at the same value. So each time we compute a constraint, we check if it is the same as the current value, if any.

2.
The initialization/update of G*. There may be loops in the constraint graph G* and therefore it is possible that there are more than one path from the seed n_{ s } to a node n_{ i }. It turns out that W [n_{ i }] may be assigned more than once in the initialization or update procedures of G*. By the definition of Wvariables, however, all the assignments to W[n_{ i }] are actually associated with the same path from n_{ s } to n_{ i } on T (G) and therefore should be identical. So each time we compute a Wvalue, we check if it agrees with the current value, if any.
Conclusions
In this study, we proposed and implemented an algorithm to solve the zerorecombinant haplotype configuration (ZRHC) problem for a general pedigree in O(kmn + k^{2}m) time. With the aid of free variables, our method provides a general solution to describe possible haplotype structures within a pedigree rather than a particular solution that only assigns a specific numerical setting to haplotypes. To the best of our knowledge, this algorithm is the first deterministic one to provide a general solution in linear time for pedigrees having small number of mating loops. Moreover, the algorithm can be easily modified to detect inconsistency among genotype data without loss of efficiency. Our experimental results confirm its linearity. In the future, we will try to extend the proposed algorithm to handle recombination and missing data in linear time for general pedigrees.
References
Qian D, Beckmann L: Minimumrecombinant haplotyping in pedigrees. The American Journal of Human Genetics. 2002, 70 (6): 14341445. 10.1086/340610.
Albers CA, Heskes T, Kappen HJ: Haplotype inference in general pedigrees using the cluster variation method. Genetics. 2007, 177 (2): 11011116. 10.1534/genetics.107.074047.
Chin FYL, Zhang Q, Shen H: krecombination haplotype inference in pedigrees. Proceedings of the International Conference on Computational Science (ICCS). 2005, SpringerVerlag, Berlin, 985993.
Li J, Jiang T: Efficient rulebased haplotyping algorithms for pedigree data. Proceedings of the 7th Annual Conference on Research in Computational Molecular Biology (RECOMB). 2003, ACM, New York, 197206.
Li J, Jiang T: An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming. Proceedings of the 8th Annual Conference on Research in Computational Molecular Biology (RECOMB). 2004, ACM, New York, 2029.
Li J, Jiang T: Computing the minimum recombinant haplotype configuration from incomplete genotype data on a pedigree by integer linear programming. Journal of Computational Biology. 2005, 12 (6): 719739. 10.1089/cmb.2005.12.719.
Sobel E, Lange K, O'Connell JR, Weeks DE: Haplotyping algorithms. Genetic Mapping and DNA Sequencing, Volume 81 of IMA Volumes in Mathematics and its Applications. Edited by: Speed T, Waterman MS. 1996, SpringerVerlag, 89110.
Tapadar P, Ghosh S, Majumder PP: Haplotyping in pedigrees via a genetic algorithm. Human Heredity. 2000, 50: 4356. 10.1159/000022890.
O'Connell JR: Zerorecombinant haplotyping: Applications to fine mapping using SNPs. Genetic Epidemiology. 2000, 19: 6470. 10.1002/10982272(200007)19:1<64::AIDGEPI5>3.0.CO;2E.
Chan MY, Chan WT, Chin FYL, Fung SPY, Kao MY: Lineartime haplotype inference on pedigrees without recombinations and mating loops. SIAM J Comput. 2009, 38 (6): 21792197. 10.1137/080680990.
Li X, Li J: Efficient haplotype inference from pedigrees with missing data using linear systems with disjointset data structure. 7th Annual International conference on Computational Systems Bioinformatics. 2008, 297308.
Liu L, Jiang T: A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on pedigrees without mating loops. Journal of Combinatorial Optimization. 2008, 19: 217240.
Baruch E, Weller JI, CohenZinder M, Ron M, Seroussi E: Efficient inference of haplotypes from genotypes on a large animal pedigree. Genetics. 2006, 172 (3): 17571765.
Doan DD, Evans PA, Horton JD: A nearlinear time algorithm for haplotype determination on general pedigrees. J Comput Biol. 2010, 17 (10): 145165. 10.1089/cmb.2009.0133. [http://www.biomedsearch.com/nih/nearlineartimealgorithmhaplotype/20937017.html]
Wang WB, Jiang T: Efficient inference of haplotypes from genotypes on a pedigree with mutations and missing alleles. CPM 2009, LNCS 5577. Edited by: Kucherov G, Ukkonen E. 2009, SpringerVerlag Berlin Heidelberg, 353367.
Xiao J, Liu L, Xia L, Jiang T: Efficient algorithms for reconstructing zerorecombinant haplotypes on a pedigree based on fast elimination of redundant linear equations. SIAM J Comput. 2009, 38: 21982219. 10.1137/070687591.
Zhang K, Sun F, Zhao H: HAPLORE: a program for haplotype reconstruction in general pedigrees without recombination. Bioinformatics. 2005, 21: 90103. 10.1093/bioinformatics/bth388.
Gusfield D: Inference of haplotypes from samples of diploid populations: complexity and algorithms. Journal of Computational Biology. 2001, 8 (3): 305323. 10.1089/10665270152530863.
Niu T, Qin ZS, Xu X, Liu JS: Bayesian haplotype inference for multiple linked singlenucleotide polymorphisms. The American Journal of Human Genetics. 2002, 70: 157169. 10.1086/338446.
Stephens M, Smith NJ, Donnelly P: A new statistical method for haplotype reconstruction from population data. The American Journal of Human Genetics. 2001, 68 (4): 978989. 10.1086/319501.
Wang S, Kidd KK, Zhao H: On the use of DNA pooling to estimate haplotype frequencies. Genetic Epidemiology. 2003, 24: 7482. 10.1002/gepi.10195.
Yang Y, Zhang J, Hoh J, Matsuda F, Xu P, Lathrop M, Ott J: Efficiency of singlenucleotide polymorphism haplotype estimation from pooled DNA. Proceedings of the National Academy of Science of the United States of America. 2002, 100: 72257230.
Browning SR, Browning BL: Haplotype phasing: existing methods and new developments. Nature Reviews Genetics. 2011, 12: 703714. [http://www.nature.com/nrg/journal/v12/n10/full/nrg3054.html]
Thomas A, Cannings C: Simulating realistic zero loop pedigrees using a bipartite Prufer code and graphical modelling. Math Med Biol. 2004, 21 (4): 33545. 10.1093/imammb/21.4.335. [http://www.biomedsearch.com/nih/Simulatingrealisticzerolooppedigrees/15567888.html]
Acknowledgements
This work was supported in part by the National Science Council, Taiwan under NSC992320B010022MY2.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 17, 2012: Eleventh International Conference on Bioinformatics (InCoB2012): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S17.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
EYL, WBW, TJ, and KPW contributed to the algorithm design. EYL implemented the algorithms and performed the experiments. KPW and EYL analyzed the complexity of the algorithm and wrote the paper. All authors read and approved the final manuscript.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Lai, EY., Wang, WB., Jiang, T. et al. A lineartime algorithm for reconstructing zerorecombinant haplotype configuration on a pedigree. BMC Bioinformatics 13 (Suppl 17), S19 (2012). https://doi.org/10.1186/1471210513S17S19
Published:
DOI: https://doi.org/10.1186/1471210513S17S19
Keywords
 Haplotype inference
 zerorecombinant haplotype configuration (ZRHC)
 pedigree
 mating loop.