A linear-time algorithm for reconstructing zero-recombinant haplotype configuration on a pedigree

Background When studying genetic diseases in which genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Determining haplotypes from genotype data is called haplotype inference. Most existing computational algorithms for haplotype inference have been designed to use genotype data collected from individuals in the form of a pedigree. A haplotype is regarded as a hereditary unit and therefore input pedigrees are preferred that are free of mutational events and have a minimum number of genetic recombinational events. These ideas motivated the zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, both without any mutation. So far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant. Results Given a pedigree with n individuals, m marker loci, and k mating loops, we proposed an algorithm that can provide a general solution to the zero-recombinant haplotype configuration problem in O(kmn + k2m) time. In addition, this algorithm can be modified to detect inconsistencies within the genotype data without loss of efficiency. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity in terms of execution time in relation to input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, in thousandth to hundredth of a second, on a personal desktop computer. Conclusions We have developed the first deterministic linear-time algorithm for the zero-recombinant haplotype configuration problem. Our experimental results demonstrated the linearity of its execution time in relation to the input size. The proposed algorithm can be modified to detect inconsistency within the genotype data without loss of efficiency and is expected to be able to handle recombinant and missing data with further extension.


Background
A genetic disease is caused by the abnormality in an individual's genome. Genetic diseases have been studied extensively for decades by investigating the connection between diseases and genetic variations. In the human genome, chromosomes come in pairs; each gene consists of two alleles that reside in different chromosomes at the same locus. One of the two alleles comes from the father and the other comes from the mother. To study hereditary diseases in which the genetic variations are passed on to offspring, the ability to distinguish between paternal and maternal alleles is essential. Unfortunately, the haplotype structure of a human genome is not available directly from the genotyping and the unordered genotype data does not tell us which allele comes from which parent. A haplotype is a collection of alleles at multiple loci on a chromosome that tend to be inherited as a unit. The determination of haplotypes from genotype data is called haplotype phasing or haplotype inference. Algorithms for haplotype inference are indispensable and have been intensively studied.
The existing computational algorithms for haplotype inference can be classified into statistical and combinatorial and most of which were designed for genotype data collected from individuals in the form of a pedigree. A pedigree is a hierarchical structure that describes the parent-child relationship among members of a family. Individuals without parents are called founders. There may be cycles in a pedigree, which are referred to as mating loops. A mating loop arises from a couple if they have children and both of them are offspring of certain family ancestors. An example of a pedigree, coupled with genotype data, is depicted in Figure 1(a); each allele is denoted as 0 or 1 to represent its form within a gene. If two alleles of a gene are the same, the locus is homozygous; otherwise, it is heterozygous. A haplotype is regarded as a hereditary unit and therefore an input pedigree is preferred to be free of mutational events and to have minimum number of genetic recombinational events [1]. Haplotype inference under this assumption is referred to as the minimum-recombinant haplotype configuration (MRHC) problem, which requires the solving of the haplotype structure of the input pedigree with the minimum number of recombination events [1]. Several algorithms have been proposed to solve the MRHC problem [1][2][3][4][5][6][7][8]. A special case of MRHC is zero-recombinant haplotype configuration (ZRHC) problem, which strictly follows the Mendelian law of inheritance, namely that one haplotype of each child is inherited from the father and the other haplotype is inherited from the mother, without any mutation [9]. To reduce the complexity of the ZRHC, some algorithms have been applied to pedigrees without mating loops (called tree pedigrees) [10][11][12]. In contrast to algorithms targeting tree pedigrees, so far no linear-time algorithm for ZRHC has been proposed for general pedigrees, even though the number of mating loops in a human pedigree is usually very small and can be regarded as constant; the execution time of existing algorithms for ZRHC using general pedigrees is polynomial [4,[13][14][15][16][17]. Regardless of whether it is a MRHC or a ZRHC problem, some algorithms have been extended to handle pedigrees with mutations or missing data [5,8,11,15]. In addition to haplotype inference from pedigree data, algorithms have been proposed for population datasets that come from unrelated individuals. Algorithms for population datasets try to decode the haplotype structure of each individual as well as the haplotype frequencies of a population [18][19][20][21][22]. All the above mentioned algorithms are mainly combinatorial. Readers who are interested in statistical approaches for haplotype inference can consult a recent review [23].
In this study, we have targeted the ZRHC problem for pedigree data. If we assume we are given a pedigree with n individuals and m marker loci. Then for general pedigrees, Li and Jiang proposed an O(m 3 n 3 ) time algorithm by converting the inheritance process into an equivalent linear system of O(mn) equations over Galois field GF (2) and invoking Gaussian elimination [4]. Xiao et al. improved the method to take O(mn 2 + n 3 log 2 n log log n) time by removing redundant equations from the linear system [16]. Doan et al. proposed an O(mna(m)) time algorithm by exploring constraints among marker loci rather than family members, where a(·) is the inverse of the Ackermann function [14]. For tree pedigrees, the execution time of the algorithm proposed by Xiao can be reduced from O(mn 2 + n 3 log 2 n log log n) to O(mn + n 3 ) [16]. Li and Li proposed an O(mna(n)) time algorithm using disjoint-set data structures [11]. Liu et al. further lowered the complexity of Xiao's algorithm to linear time O(mn) [12]. Chan et al. also proposed a linear-time algorithm by maintaining a graph structure [10]. Chan's algorithm, however, only produce a particular solution. A particular solution assigns a numerical value to each system variable, while a general solution describes all possible solutions of the system by designating certain variables as free variables and the others as linear combinations of these free variables.
In this paper, we presented an O(kmn + k 2 m) time algorithm that provides a general solution for ZRHC for general pedigrees, where k is the number of mating loops. In human pedigrees, k is usually very small and can be regarded as constant. Our algorithm therefore turns out to be linear for most of the practical cases. The proposed algorithm was subject to 12000 experiments to verify its performance using different (n, m) combinations. The value of k was uniformly distributed between zero and six throughout all experiments. The experimental results show a great linearity of the execution time in relation to the input size when both n and m are larger than 100. For those experiments where n or m are less than 100, the proposed algorithm runs very fast, from thousandth to hundredth of a second on a personal desktop computer. We also showed that the proposed algorithm can be easily modified to detect inconsistencies among genotype data without loss of efficiency.

Methods
To apply computational techniques, we transformed the input pedigree into a pedigree graph by connecting each parent directly to its children (Figure 1(b)). A pedigree graph is an undirected graph G = (V, E), where V is a set of nodes and E a set of edges. Each node in V represents an individual in the pedigree; each pair of nodes is connected with an edge in E if and only if the two individuals have a parent-child relationship. G is defined to be undirected because the computational property of each edge is symmetric in our algorithm, even if the parent-child relationship is asymmetric. G may contain cycles. We only pay attention to two types of cycles: a cycle due to a mating loop, which is called a global cycle and a cycle due to a couple and two of their children, which is called a local cycle. Global cycles and local cycles are referred to as basic cycles. For ease of cycle processing, we construct a spanning tree T (G) on G. A basic cycle can be obtained by adding a non-tree edge into T (G). The set of non-tree edges is denoted by E X . Non-tree edges are further divided into two disjoint subsets E X L and E X G ; members in E X L induce local cycles and members in E X G induce global cycles. Mating loops seldom appear in human pedigrees and therefore |E X G | = k is regarded as a small constant.
In the rest of this paper, we are assuming that G has n nodes and m loci, all alleles are bi-allelic (denoted by 0 or 1), and the input dataset is free of genotyping errors. Under this assumption, the input size of ZRHC is O (mn). The genotype data of a node n i are represented as a vector g n i of size m. The genotype of n i at locus l, where 1 ≤ l ≤ m, is defined as follows: Genotype data are available, thus all g-variables can be regarded as constant (Figure 1(b)). We introduce a vector p n i of size m to describe the haplotype information of n i ; the paternal allele of n i at locus l, where 1 ≤ l ≤ m, is defined as follows: The vector p n i is regarded as unknown even though we know that p n i [l] = g n i [l] if n i is homozygous at locus l (i.e. g n i [l] ≠ 2).
We formulated the ZRHC problem as follows.
ZRHC Given a pedigree graph G(V, E) with full g-constants, determine p n i of each node n i in V.
The haplotype configuration of the input pedigree is identified by specifying the paternal haplotype of each family member.
A system of linear equations over GF (2) In this section, we introduce a system of linear equations based on G and g-constants; this system was first proposed in [16] and will be reduced to determine all p-variables. Since p-variables carry binary values, all equations in the linear system are defined over GF(2) whose operations addition (+) and multiplication (·) are shown in Table 1.
The building block of the system: inheritance "Inheritance" is the building block of the system. What parents pass to their children must be the same as what children receive from their parents. For a parent n i and a locus l, n i passes p n i [l] + 1 to its children if and only if the genotype of n i at locus l is heterozygous and n i passes its maternal allele; otherwise n i passes p n i [l] to its children. We introduce two auxiliary variables w n i l and h n i ,n j to formally state the above argument. The variable w n i l indicates if locus l of n i is heterozygous. The variable h n i ,n i indicates which allele of n i is passed to its child n j . h n i ,n j = 0 if n i passes its paternal allele to n j 1 if n i passes its maternal allele to n j .
Therefore, p n i [l] + w n i [l] · h n i ,n j represents the allele at locus l that n i passes to n j .
On the other hand, assume that n j receives an allele from n i . If n i is n j 's father, what n i passes to n j is the paternal allele of n j . In this case, we have p n i [l] + w n i [l] · h n i ,n j = p n j [l] . If n i is n j 's mother, there are two sub-cases. If locus l of n j is homozygous, what n i passes to n j must be the same as the paternal allele of n j . In this case, we have p n i [l] + w n i [l] · h n i ,n j = p n j [l] . If locus l of n j is heterozygous, what n i passes to n j is the maternal allele of n j and is different from the paternal allele of n j . In this case, we have p n i [l] + w n i [l] · h n i ,n j = p n j [l] + 1 . The variable w n j l can be used to indicate if locus l of n j is homozygous or heterozygous, the two sub-cases can therefore be combined into a single equation p n i [l] + w n i [l] · h n i ,n j = p n j [l] + w n j [l] . Moreover, if we introduce another auxiliary variable d n i ,n j [l] as follows, if n i is n j s mother, the inheritance relationship can be unified into the following equation: Note that the w-and d-variables are constant by definition, and the p-and h-variables are unknowns. Equation (1) formulates the property of edge (n i , n j ) in G: p-variables and w-constants are attributes of the nodes n i and n j , and h-variables and d-constants describe the inheritance relation associated with the edge (n i , n j ). With the information provided by Equation (1), various constraints on h-variables can be generated by traversing different paths in G. Our algorithm was designed to first determine h-variables based on these constraints and then the solution to the ZRHC problem can be obtained by determining all p-variables based on the solved h-values and Equation (1). One point needs special care: if n j is a child of n i , h n j ,n i and d n j ,n i are undefined. In our algorithm, we make the h-variables and d-constants symmetrical such that h n j ,n i = h n i ,n j and d n j ,n i = d n i ,n j .

Linear constraints on h-variables
To reduce the computational complexity of our algorithm, we try to make the number of unknowns in the coming linear system as small as possible. In the pedigree graph G, we have mn p-variables and at most 2n h-variables (since each individual has two parents and there are at most n individuals). Observe that if a node n i itself or one of its parents is homozygous at locus l, p n i l is determined by definition and Equation (1). In this case n i is referred to as predetermined at locus l and the number of unknown p-variables is reduced by one. Moreover, for an edge (n i , n j ) E, where n i is a parent of n j , h n i , n j is cancelled from Equation (1) if w n i [l] = 0 at locus l. If w n i [l] = 0 holds for all 1 ≤ l ≤ m, no constraints are imposed on h n i ,n j and it becomes a free variable (or its value will finally depend on other free variables). In this case the number of h-variables to be determined is Table 1 Addition (+) and multiplication (·) in GF(2) reduced by one, which is equipotent to the removal of edge (n i , n j ) from G. Accordingly, w-constants can be viewed as the weight of edges in G; we only pay attention to edges with weight one (parent nodes that are heterozygous). To consider only the edges with weight one at locus l, we construct the lth locus graph G l = (V, E l ), where E l = {(n i , n j ) | n i is a parent of n j , w n i l = 1}. Moreover, the spanning forest T(G) ∩ G l is denoted by T(G l ) and is referred to as the lth locus forest (Figure 1(c)). We define constraints on h-variables by traversing paths in the locus graphs. Consider a path p = n 0 , n 1 , ..., n i in G l . Assume that n 0 and n i are predetermined and all other in-between nodes are non-predetermined. Adding up all h-variables on the path will produce the following equation by Equation (1): Since n 0 and n i are predetermined and all d-constants are known, b is a constant. The constant b is said to be the constraint of path p. Note that the constraint b does not depend on the direction that path p is read because the h-variables and d-constants are symmetric. Moreover, if the path is a cycle c = n 0 , n 1 , ..., n i , n 0 in G l , we would have the following equation: Again, since all d-constants are known, b' is also a constant. The constant b' is said to be the constraint of cycle c. On the basis of Equations (2) and (3), we can generate constraint equations with only h-variables for cycles or for paths that connect predetermined nodes in G l . Constraints can be classified into two categories with respect to the spanning tree T(G): cycle and path constraints derived from paths containing non-tree edges, and tree constraints derived from paths containing only tree edges.

Cycle and Path constraints
Adding a non-tree edge e into the spanning tree T (G) generates a basic cycle c. If G l contains e, there are two cases of c in G l .
Case 1 c is in G l . A cycle constraint b c of cycle c can be obtained by Equation (3). The constraint is denoted interchangeably by b c or (b c , e), which is also said to be the cycle constraint of e.
Case 2 c is broken into several disjoint paths in G l by predetermined nodes. Since these paths are disjoint, there is exactly one path p' of them containing e. Along the path p', we identify a subpath p = n i ...n j containing e such that n i and n j are predetermined and all other in-between nodes are non-predetermined. A path constraint b p of the subpath p can be obtained by Equation (2). The constraint is denoted interchangeably by b p or (n i , n j , b p , e), which is also said to be the path constraint of e. Path constraints are symmetric because (n i , n j , b p , e) = (n j , n i , b p , e).

Tree constraints
For each connected component of T (G l ), we arbitrarily pick a predetermined node n s as the seed. For the unique tree path p that connects n s and another predetermined node n k in the same connected component, a tree constraint b t of path p can be obtained by Equation (2). The constraint is denoted interchangeably by b t or (n s , n k , b t ). Tree constraints are symmetric because (n s , n k , b t ) = (n k , n s , b t ). Note that if there exists a component that has no predetermined nodes, locus l must be heterozygous across the entire pedigree and no tree constraints will be generated.

Our algorithm in relation to the ZRHC problem
Our algorithm consists of four steps. We begin by initializing required data structures in the preprocessing step. The initialized data structures are subject to the constraint generation step to construct a system of linear constraints on h-variables. There are two issues should be addressed. First, since all constraints are derived from locus graphs that come from the same pedigree graph, there is usually redundancy in the system. Second, we actually do not need to know all h-values to solve the ZRHC problem. For a child node n i , there are two h-variables related to it and its parents. However, from Equation (1) we know that one of the two h-values is sufficient to determine p n i . So it is easy to see that the (n -1) h-variables in T (G) form a minimal sufficient set to solve the ZRHC problem. In the third step, constraint reduction and transformation, we therefore try to eliminate redundancy in the system and transform as many path constraints into tree constraints as possible. Finally, in the haplotype determination step, we introduce an efficient way to solve h-variables and further p-variables based on the reduced system.
Step 1: preprocessing The data structures of our algorithm are initialized by the following procedures: 1. Transform the pedigree into a pedigree graph G = (V, E). Each node n i in V is equipped with its genotype vector g n i . Since each individual has two parents, there are at most 2n edges in G, so we have |V| = O(n) and | E | = O(n). Step 2: constraint generation A system of linear equations on h-variables over GF (2) will be constructed in this step. The system consists of three sets C C , C P , and C T that contain different kinds of constraints. C C contains cycle constraints, if any, of all non-tree edges at all loci. Similarly, C P contains path constraints, if any, of all non-tree edges at all loci. Finally, C T contains tree constraints at all loci. To reduce computational complexity, repetitions of set members are forbidden in our algorithm; we do nothing if an existing member is going to be added into the same set.
There are O(mn) trials to generate a constraint for a non-tree edge since there are m locus graphs and each of which contains O(n) non-tree edges; in each trial we perform a cycle detection procedure to generate a cycle constraint or a path constraint, so we have | C C | + | C P | = O(mn). The cycle detection procedure is usually implemented by depth first graph traversal and its execution time depends on the length of the cycle. Consequently, if a non-tree edge induces a global cycle, the cycle detection procedure takes O(n) time; otherwise the procedure takes constant time because each local cycle contains only four edges. The time to generate O(mn) cycle and path constraints is O(kmn) since there are at most km trials to generate global cycle constraints. To generate tree constraints within a locus graph, we perform tree traversal on its locus forest. This procedure generates O (n) tree constraints in O(n) time. So we require O(mn) time to generate tree constraints at all loci. The time complexity to generate our constraint system is therefore Step 3: constraint reduction and transformation Redundancy arises in the constraint system if a constraint can be represented as a linear combination of other constraints. We are especially interested in the following two types of redundancies.
Type 1 Assume there is a basic cycle c in G and it can be decomposed into two edge-disjoint paths p 1 and p 2 both connecting nodes n i and n j . There must be exactly a non-tree edge e in c, and without loss of generality, we assume that e belongs to path p 1 . If there is a cycle constraint (b c , e) of c, a path constraint (n i , n j , b p , e) of p 1 , and a tree constraint (n i , n j , b t ) of p 2 , we have b c = b p + b t by Equations (2) and (3). That is, these three constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure 2(a)). A path constraint can therefore be transformed into a tree constraint by the equation b t = b p + b c , which is the basis of the reduction of our constraint system.
Type 2 Assume there are three tree constraints (n i , n j , b 1 ), (n i , n k , b 2 ), and (n j , n k , b 3 ) of paths p 1 , p 2 , and p 3 , respectively. By definition we know that a tree constraint is the summation of all h-variables along a unique path in T(G), so we have h n x, n y .
Suppose that n l is the node closest to n i on the path p 3 . We then have three paths p 4 between n i and n l , p 5 between n l and n j , and p 6 between n l and n k such that p 1 = p 4 + p 5 , p 2 = p 4 + p 6 , and p 3 = p 5 + p 6 . The tree constraints can therefore be rewritten as h n x, n y · Because all constraints are defined over GF(2), we conclude that b 1 + b 2 = b 3 ; the three tree constraints are linearly dependent and each of them can be represented as a linear combination of the other two constraints (Figure 2(b)). The above argument implies the following lemma.
Lemma 1 For any three nodes n i , n j , and n k , the tree constraint of the path between n j and n k is equal to the total tree constraint of the path between n i and n j and the path between n i and n k .
Lemma 1 still holds even if n i is on the path between n j and n k (n i = n l in Figure 2(b)), which means that if a tree path is partitioned into two disjoint sub-paths, the tree constraint of this path is equal to the total constraint of the two sub-paths.
In this step, we remove the type 1 redundancy by transforming as many path constraints to tree constraints as possible, and remove the type 2 redundancy by reducing C T to an equivalent set whose cardinality is at most (n -1).
For each non-tree edge e E X , if cycle constraint (b c , e) exists, we remove all path constraints (n i , n j , b p , e), if any, from C P and add tree constraints (n i , n j , b c + b p ) into C T . Since the size of C P is O(mn), this procedure can be carried out in time O(mn), and the new C T is of size O(mn).
To further remove the redundancy in C T , we construct a constraint graph G* of G. The constraint graph G* shares the same set of nodes V as G; for each tree constraint (n i , n j , b t ) C T , we introduce an edge connecting nodes n i and n j in G* with weight b t (Figure 3(a)). An example of constraint graph is depicted in Figure 3(b). The constraint graph is used to reduce the size of C T . As shown in Figure 3(b), a constraint graph may not be connected. Within each connected component in G*, we randomly choose a seed n s and try to assign each node n i a variable W[n i ] to represents the tree constraint of the tree path between nodes n s and n i in the pedigree graph G. The assignment is carried out by the following steps in each connected component of G*.
1. W[n s ] of the seed n s is assigned the value zero, 2. start from n s , perform a breadth-first-search traversal via tree constraints, i.e., we can traverse from node n i to node n j if (n i , n j , b t ) C T or (n j , n i , b t ) C T , 3. as we traverse from n i to n j through (n i , n j , b t ) or (n j , n i , b t ), if n j is unvisited, we assign W[n i ] + b t to W[n j ] based on Lemma 1; otherwise we do nothing.
Since W[n i ] represents the tree constraint (n s , n i , W [n i ]), it can be regarded as the summation of h-variables along the unique path on T(G) from the seed n s to node n i , which implies the following lemma: Therefore, if we can assign W-values to all nodes in V and make G* connected, G* would be equipotent to a reduced C T of size (n -1) that covers h-variables of all tree edges of T(G) and is sufficient to solve the ZRHC problem. The construction of the constraint graph takes O(|C T |) = O(mn) time.
The constraint graph G*, however, may not be connected with fully assigned W-values. We therefore introduced an extension procedure to extend G* by adding extra tree constraints, if any, into G*; we would like to reduce the number of connected component in G* as much as possible. To explore more tree constraints to be added into G*, we examine those non-tree edges e E X that do not have cycle constraints in C C . The basic idea is that if we can synthesize a new cycle whose constraint is Figure 2 Two types of redundancy arise from linearly dependence. (a) A cycle c is decomposed into a tree path p 2 and a path p 1 that contains a non-tree edge e. So we have b c = b p + b t , where b c , b p , and b t are constraints of cycle c, path p 1 and path p 2 , respectively. The dotted line represents the non-tree edge e. (b) n l is the node closest to n i on the path p 3 . Assume that the constraint of tree path p i is b i , 1 ≤ i ≤ 6. We have b 1 = b 4 + b 6 , b 2 = b 4 + b 5 , and b 3 = b 5 + b 6 , which conclude that b 1 = b 2 + b 3 due to the addition over GF (2).

Figure 3
The concept of a constraint graph. (a) A tree constraint (n i , n j , b t ) of the path that connects n i and n j in a pedigree graph G will be transformed into an edge between n i and n j with weight b t in the corresponding constraint graph G*. (b) A constraint graph. There are three edges (n 3 , n 11 ), (n 7 , n 5 ), and (n 9 , n 5 ) in the constraint graph, which means that there are three tree constraints in the linear system. Note that the constraint graph is disconnected and contains several connected components. the same as the expected cycle constraint of e, we may obtain new tree constraints by transforming known path constraints of e.
For a non-tree edge e without cycle constraint, we try to synthesize a cycle only if e ∈ E X G . We do nothing if e ∈ E X L because no extra tree constraints of e can be obtained by cycle synthesis. To see this, suppose the local cycle induced by e connects a couple n a and n b and their two children n c and n d ; without loss of generality, we assume e = (n a , n d ) (Figure 4). We can examine the possible constraints derived from this local cycle. Constraints of a single edge with predetermined endpoints are not of interest and can be ignored because the p-values of the endpoints are known; we need only pay attention to constraints whose path lengths are longer than one.  (Figure 4(d)). No useful constraints other than (b c , e), (n c , n d , b p , e), and (n c , n d , b t ) can be derived from this local cycle. Here we already know that (b c , e) does not exist. If (n c , n d , b t ) is already in C T , it is the only useful tree constraint of e and we are finished. If (n c , n d , b t ) does not exist in C T , we cannot obtain (n c , n d , b t ) by combining b c and b p because (b c , e) does not exist, even if the path constraint (n c , n d , b p , e) is available. If this case holds for all 1 ≤ l ≤ m, our linear system actually provides no information to obtain the tree constraint of p 2 ; the h-variable of each edge on p 2 will eventually be assigned a free variable, or its value will depend on other free variables. Therefore we do nothing if e ∈ E X L . Assume that E S is the set of non-tree edges in E X G without cycle constraint. Cycle synthesis is carried out by concatenating paths with known path constraints or tree constraints. The extension procedure is applied to E S as follows.
E1. For each e E S , we check if there is an odd number, say 2t + 1, of path constraints of e that link different connected components in G* to form a synthetic cycle (Figure 5(a)); a constraint is said to link two components A and B if one of its endpoints resides in A and the other resides in B. There is a special case whereby we can also obtain a synthetic cycle if two endpoints of a single path constraint reside in the same connected component (t = 0). If no such 2t + 1 path constraints are found, we cannot synthesize a cycle of e and do nothing; otherwise we perform the following tasks: in which S e is the set of the chosen 2t + 1 path constraints; E1.2 for each path constraint (n x , n y , b p , e) in C P , generate a tree constraint (n x , n y , b c + b p ) and add the new constraint into G*; E1.3 update G*; E1.4 remove e from E S ; E2. If E S becomes empty (there has been a synthetic cycle for every e in the original E S ), or no synthetic cycle is synthesized (E S stays unchanged), we stop the extension procedure; otherwise we go back to E1 to start the next iteration.
We thus try to synthesize a cycle for each non-tree edge in E S to generate new tree constraints and update G*. To update G*, if more than one connected component is combined into a new one by new tree constraints, we arbitrarily choose one of the old seeds from these connected components as a new seed, and perform a graph traversal to update W-values within the new connected component. A non-tree edge that fails to receive a synthetic cycle in a trial of cycle synthesis may benefit from a later updated G* and therefore our extension procedure is designed to operate in an iterative fashion; the procedure terminates only if G*cannot be updated anymore. In this procedure, a non-tree edge may be checked many times (in different iterations) to form a synthetic cycle. In the worst case scenario, only one cycle is synthesized in each iteration, so we require k iterations to perform k + (k -1) + ... + 1 = O(k 2 ) trials of cycle synthesis.
To verify the correctness of the extension procedure, we need first to explain the meaning of Equation (4). Follow a similar argument to that of Lemma 1, for two nodes n x and n y that reside in the same connected component of G*, we know that W[n x ] + W[n y ] is actually the tree constraint of the path from n x to n y on T(G). The synthetic cycle is conceptually a round trip through tree edges and the non-tree edge e. The value b c in Equation (4) is therefore the summation of h-variables along the round trip ( Figure 5(b)). Now we demonstrate that b c is the same as the cycle constraint of e. We first show that there is exactly one h-variable of e in b c . According to Equation (4), we have 2t + 1 h-variables of e in b c . Since we perform additions over GF (2), 2t out of the 2t + 1 h-variables will be cancelled and we finally have only one h-variable of e in b c . To verify if b c is the same as the cycle constraint of e in G, we assume that the expected cycle constraint of e is b c . We generate a set S e by converting path constraints (n i , n j , b p , e) in S e to tree constraints (n i , n j , b c + b p ). It is easy to see that the converted 2t + 1 tree constraints also link connected components in G* to form a new synthetic cycle, and the corresponding round trip only contains tree edges in T(G). T(G) has no cycle and therefore each edge of this new round trip must be visited an even number of times, which means that its h-variable will be cancelled in the new cycle constraint. So the constraint of the new synthetic cycle must be zero and we have the following equations: Since there are 2t + 1 constraints in S e , we have We then obtain b c + b c = 0 and conclude For each e E S , the time to determine if there are odd number of path constraints that link connected components in G* to form a cycle is O(m). This time complexity can be achieved by regarding each connected component as a single node and each path constraint of e as a single edge, and following O(m) edges to perform a depth-first traversal. Since there are O(k 2 ) cycle syntheses throughout the extension procedure, we require O(k 2 m) time to find synthetic cycles. Once we synthesized a cycle for e, we require O(m) time to convert The conceptual view of the synthetic cycle of (a) in a pedigree graph. The synthetic cycle is actually a round trip through the tree edges and the non-tree edge e. The trip is composed of 10 different but connected paths in the pedigree graph. In this example, e would be visited five times during the trip. path constraints to tree constraints because there are at most m path constraints of e in C P . There are O(k) nontree edges in E S and therefore the extension procedure takes O(km) time to perform constraint conversion. To update G*, we require O(n) time to perform breadthfirst traversal on every connected component to modify W-values similar to the way we initialize G*. There are at most k synthetic cycles and therefore G* is updated O(k) times in O(kn) time. In summary, the Step 3, constraint reduction and transformation, takes O(k 2 m) + O(km) + O(kn) = O(k 2 m + kn) time.
Step 4: haplotype determination To solve the h-values of the tree edges of T(G) by Lemma 2, we try to make G* produced by Step 3 connected. Firstly, we pay attention to the founders in the pedigree. Founders cannot be predetermined endpoints of paths with either path constraints or tree constraints and therefore founders must be isolated nodes in G*. It is also impossible to know whether an allele of a founder is paternal or maternal. We attach a founder n f to G* by assuming that it passed its paternal haplotype to an arbitrary child n c . The attachment can be done by assigning weight zero to the edge (n f , n c ) of G*, which implies h n f ,n c = 0 (n f passes its paternal haplotype). There are O(n) edges in G* and therefore the attachment of founders to G* takes O(n) time.
Secondly, we check if there is any non-tree edge that can link any two connected components of G*. A nontree edge e = (n i , n j ) can link two connected components A and B if we can find a path constraint (n k , n l , b p , e) of path p that, without loss of generality, satisfies the following two conditions: 1. n k and n i reside in A and have available W [n k ] and W [n i ] derived from the seed n A of A, 2. n l and n j reside in B and have available W [n l ] and W [n j ] derived from the seed n B of B.
If we can find such a non-tree edge e, we can decompose p into three parts: a sub-path from n k to n i , the non-tree edge e, and the sub-path from n j to n l . The constraints of these three parts are Finally, assume that there remain t connected components of G*. We arbitrarily introduce (t -1) edges into G* to make it connected. Our algorithm does not impose any constraint on these (t -1) edges and therefore the weights of these edges can be safely set as free variables. We then update all W-values within the new connected G* (new W-values may contain free variables), and apply Lemma 2 to determine the h-values of all edges in T(G). With these solved h-values as well as the w-constants and d-constants, we can determine the p-values of all nodes in the locus graphs by Equation (1). The

An execution example
We use the pedigree given in Figure 6(a) as an example to demonstrate how the proposed algorithm works. There are 19 individuals in the pedigree; eight of them are founders. Each individual is equipped with genotype data collected from four marker loci. There is a mating loop in the pedigree.
In the first step, we transform the input pedigree into a pedigree graph G and construct a spanning tree T(G) on G (Figure 6 From the pedigree graph G, we construct the four locus graphs and forests that are depicted in Figure 6(c). The p-values of predetermined nodes, w-constants of all nodes, and d-constants of all edges within the four locus graphs are also identified.
In the second step, we generate all cycle, path, and tree constraints for each of the four locus graphs using Equations (2)   O, 0), (Q, S, 0), (R, Q, 0), (I, H, 0)}. We construct the initial constraint graph G* based on the updated C T (Figure 6(d)). In the initial G*, we choose H, G, and Q as component seeds to determine W-values. We can further find that the three path constraints (N, G, 0, e B-F) , (G, Q, 0, e B-F) , and (N, Q, 0, e B-F) link three connected components to form a synthetic cycle of the non-tree edge B-F with constraint zero. So we further obtain three extra tree constraints (N, G, 0), (G, Q, 0), and (N, Q, 0) derived from the synthetic cycle and add them to G*.
In the final step, we try to make G* connected to solve all h-values. We first arbitrarily introduce eight edges A-H, B-I, C-G, D-G, E-L, J-N, M-O, and P-R to attach the eight founders to G*; all the eight edges are of weight zero to imply that founders passes their paternal haplotypes to one of their children. Now there are only two connected components in G*, one of which is an isolated node, F. we attach F to G* by set h B, F as a free variable. This final connected G* is depicted in Figure 6(e). After the final update of G*, all h-values other than h B, F are zero, and h B, F is free to be either zero or one. Given these known h-values, all p-values over the four locus graphs can be solved by Equation (1).

Time complexity and experimental result
According to the analyses at the end of each step in Section 3, the time complexity of our algorithm is O (mn) (step 1: preprocessing) + O(kmn) (step 2: constraint generation) + O(k 2 m + kn) (step 3: constraint reduction and transformation) + O(mn) (step 4: haplotype determination) = O(kmn + k 2 m). Because k is regarded as a constant, our algorithm is linear.
To verify the efficiency and the correctness of our algorithm, we conducted some experiments using the proposed method. Our algorithm was implemented in C and was evaluated on a desktop computer equipped with Intel Core i7-2600 3.4 GHz CPU and 8 GB of RAM. The desktop ran Ubuntu Release 11.10 operating system with Linux kernel 3.0.0-16-generic and GNOME 3.2.1 graphical user interface.
In the experiments, we generated test cases by setting different number of individuals (n) and markers (m). We applied the algorithm developed by Thomas et al. [24] to generate 12 tree pedigrees with different n values ranging from 30 to 400. To observe how the number of mating loops (k) affects our algorithm, each tree pedigree was preprocessed to produce four variants with zero, two, four, and six mating loops. For each pedigree, we examined 10 different m values ranging from 10 to 300. Each (n, m) combination was tested 100 times. Each time we generated new genotypes and randomly selected one pedigree from the four variants of the given n. The haplotype configurations of all the 12000 trials were correctly identified. The experimental results are listed in Table 2.
Table 2(a) shows that unknown p-variables were correctly solved without assigning any free variable if the number of marker loci was not less than 30, which covers most practical cases in regular genotyping. Free variables were required only when the number of marker loci was far less than the number of individuals. In this experiment, free variables were used only when m = 10, and they were used at most five times out of 100 trials. The result is reasonable because the dimension of the solution space of a pedigree with a limited number of marker loci is probable less than the number of unknown p-variables.
Table 2(b) shows the cumulative execution time of 100 trials of each (n, m) combination. We received a fluctuation in execution time if n or m were less than 100. We conjecture that, because the algorithm executes very fast for small values of n or m, the cumulative execution time might be dramatically affected by the context switches within the operating system that ran many background services. Furthermore, we believe that when both n and m were larger than 100, the execution time of the algorithm became more significant than that of the context switches. From the table it is apparent that the execution time is linear for the larger n and m values.
Finally, Table 2(c) shows that mating loops existed evenly throughout all 12000 trials, with the number ranging from zero to six per pedigree, and the number did not affect the linearity of the execution time of our algorithm in relation to the input size of n and m.

Issue of spanning tree and seed node selection
In the first step, preprocessing, a spanning tree T(G) is constructed on the pedigree graph G. As mentioned above, T(G) is constructed for the ease of cycle processing; it is merely an auxiliary data structure used to generate linear constraints of all cycles and paths between predetermined nodes in G. We do not impose any constraint on the construction of T(G) because predetermined nodes are defined by genotype data. Once the input pedigree is given, all the cycles and paths as well as their constraints are bound, no matter which spanning tree is constructed on the pedigree graph. Different spanning trees assign different edges as the non-tree edge in a cycle, and only affect the type of a constraint; a constraint may be a path constraint with respective to one spanning tree and a tree constraint with respect to another spanning tree. Since different spanning trees are used to generate the same set of constraints, without considering their type, the construction of the spanning tree can be arbitrary. In our implementation, T(G) was constructed by depth-first traversal.
In the second step, constraint generation, a seed node is arbitrarily selected from T(G) to generate tree constraints. To see why the seed node can be selected arbitrarily, assume that there are two possible seeds n i and n j . For any other predetermined node n k , we have (n j , n k , b jk ) = (n i , n j , b jk ) + (n i , n k , b ik ) by Lemma 1, which means that a tree constraint seeded with one predetermined node is a linear combination of two tree constraints seeded with another predetermined node. Hence, tree constraints seeded with different predetermined nodes are mathematical equivalent; we can safely choose any predetermined node as seed. Similarily, the seed nodes within a constraint graph can also be selected arbitrarily based on the above argument.

Consistency checking
Although we assume that the input pedigree is free of genotyping errors, our algorithm can be easily modified to detect inconsistencies within the genotype data without loss of efficiency. No recombination is allowed in the input pedigree and therefore inconsistencies will arise if there are different assignments of an h-value, that results in incompatible linear constraints. We may designate the following two checkpoints to detect inconsistencies within our linear system: 1. The generation of constraints. The constraint of a path or a cycle may be computed more than one time across all locus graphs; all these computations should arrive at the same value. So each time we compute a constraint, we check if it is the same as the current value, if any. 2. The initialization/update of G*. There may be loops in the constraint graph G* and therefore it is possible that there are more than one path from the seed n s to a node n i . It turns out that W [n i ] may be assigned more than once in the initialization or update procedures of G*. By the definition of W-variables, however, all the assignments to W[n i ] are actually associated with the same path from n s to n i on T (G) and therefore should be identical. So each time we compute a W-value, we check if it agrees with the current value, if any.

Conclusions
In this study, we proposed and implemented an algorithm to solve the zero-recombinant haplotype configuration (ZRHC) problem for a general pedigree in O(kmn + k 2 m) time. With the aid of free variables, our method provides a general solution to describe possible haplotype structures within a pedigree rather than a particular solution that only assigns a specific numerical setting to haplotypes. To the best of our knowledge, this algorithm is the first deterministic one to provide a general solution in linear time for pedigrees having small number of mating loops. Moreover, the algorithm can be easily modified to detect inconsistency among genotype data without loss of efficiency. Our experimental results confirm its linearity. In the future, we will try to extend the proposed algorithm to handle recombination and missing data in linear time for general pedigrees.