Isomorphism and similarity for 2-generation pedigrees

We consider the emerging problem of comparing the similarity between (unlabeled) pedigrees. More specifically, we focus on the simplest pedigrees, namely, the 2-generation pedigrees. We show that the isomorphism testing for two 2-generation pedigrees is GI-hard. If the 2-generation pedigrees are monogamous (i.e., each individual at level-1 can mate with exactly one partner) then the isomorphism testing problem can be solved in polynomial time. We then consider the problem by relaxing it into an NP-complete decomposition problem which can be formulated as the Minimum Common Integer Pair Partition (MCIPP) problem, which we show to be FPT by exploiting a property of the optimal solution. While there is still some difficulty to overcome, this lays down a solid foundation for this research.


Introduction
Pedigrees, or commonly known as family trees, are important tools in evolutionary and computational biology. They are important for geneticists, as with a valid pedigree the recombination events can be deduced more accurately [8], or disease loci can be mapped consistently [22,23]. In this sense, pedigrees could greatly help geneticists.
There have been many practical methods for reconstructing pedigrees [30,26,4,5,17]. For instance, Thompson [30] defined the pedigree reconstruction problem as: given the genetic data from a set of extant individuals, reconstruct relationships between the individuals that may share unobserved ancestors. There have also been research using the machine learning methods to construct pedigrees with the maximum likelihood [19,10]. Some theoretical results are also known [27][28][29].
It is known that a lot of computations on pedigree graphs are NP-hard [24,20,16], so a series of research has been conducted on speeding up these computations [6,13,21]. It is expected that these research will continue, possibly along different directions.
On the other hand, methods for comparing pedigrees are rare. The brute-force method will not work when the data set has size in the thousands [1,14]. People can typically use phylogenetic trees as the basis to compare tree-like pedigrees. On the other hand, even for humans the pedigrees could be more complex than trees as inter-generational mating is not rare. The only known research that systematically study pedigree comparison is by Kirkpatrick et al. [18], where the pedigree isomorphism and edit distance problems, for both general pedigrees and leaf-labeled pedigrees, are systemically studied.
In this paper, we follow the work by Kirkpatrick et al. [18] to consider the isomorphism and similarity problems for the simplest pedigree -2-generation pedigrees, where the isomorphism and similarity problems are both studied. Surprisingly, we show that the isomorphism problem is Gl-hard (GI -Graph Isomorphism) even for 2-generational pedigrees. We then relax the similarity measure and formulate this as a Minimum Common Integer Pair Partition (MCIPP) problem, generalizing the famous NP-complete Minimum Common Integer Partition (MCIP) problem, which we show to be Fixed-Parameter Tractable (FPT). While these is still some difficulty to overcome, this lays down a solid foundation for this research.

Preliminaries
An (unlabeled) pedigree is a directed graph P = (I(P), E(P)) with vertices I(P) and edges E(P), together with a gender function s : I(P) {male, female} such that: 1 P is acyclic. 2 For all nodes v ∈ I(P), the in-degree of v is either two or zero. 3 For two edges (a, c), (b, c) ∈ E(P), we have s(a) ≠ s(b).
In practice, we typically draw a pedigree in a top-down fashion to denote the direction of the edges. Moreover, we use square (resp. circular) nodes to represent males (resp. females). See Figure 1 for an example. Throughout this paper, we assume that a pedigree (or graph) contains no isolated nodes (i.e., those with in-degree and outdegree both zero). This is easy to handle if it doeswe just remove these isolated nodes.
Let N = {1, 2, 3, . . .}. An individual u ∈ I(P) is monogamous if it mates with exactly one partner, i.e., the number of individuals u', u' ≠ u, such that (u, x), (u', x) ∈ E(P) for some x ∈ I(P) is exactly one. A pedigree is monogamous if all the individuals are monogamous. In Figure 2, the sub-pedigree formed by the rightmost component is monogamous while the leftmost component is not. A pedigree P = (I(P), E(P)) is generational if there is a function g : I(P) → N such that: 1 g(v) = 1 for all v ∈ I(P) with in-degree zero. 2 For all (u, v) ∈ E(P), we have g(v) = g(u) + 1.
The number g(v) is called the generation of v. For a generational pedigree P, we use I g (P) to represent the individuals of P whose generation is g. The pedigree on Figure 1 is not generational, due to node 11. Figure 2 shows a 2-generation pedigree. Throughout this paper, we will focus only on this simplest 2-generation pedigrees.

Hardness for 2-generation pedigree
Graph Isomorphism (GI) is one of the most famous problems in computational complexity whose precise complexity has been open since 1972 [15,12]. It is not known to be in P or NP-complete. The class of GI-complete problems are those which are polynomial time equivalent to the GI problem. The class of GI-hard problems are those problems at least as hard as the GI problem. It is known that even testing the isomorphism for chordal bipartite graphs is GI-complete [32].
In [18], it was shown that the pedigree isomorphism problem is GI-hard. The reduction is from bipartite isomorphism. The construction uses a pedigree of three generations. Here we show that even testing the isomorphism of two 2-generation pedigrees is GI-hard.
Theorem 1 Testing the isomorphism between two 2generation pedigrees is GI-hard.
Proof. We reduce bipartite graph isomorphism problem to our problem. Let B 1 = (U 1 , V 1 , E 1 ), B 2 = (U 2 , V 2 , E 2 ) be two bipartite graphs (with no isolated nodes). For our construction, we perform the following: 1 All nodes in U 1 ,U 2 are marked male. 2 All nodes in V 1 ,V 2 are marked female. 3 B 1 is converted into a pedigree P 1 = (I(P 1 ), E(P 1 )) as follows: (3.1) I(P 1 ) = U 1 ∪ V 1 are the generation-1 nodes; (3.2) E(P 1 ) is initially set as empty; (3.3) for (u, v) ∈ E 1 we create a new generation-2 node uv such that s(uv) = female, E(P 1 ) E(P 1 ) ∪ {(u, uv), (v, uv)}. 4 B 2 is converted into a pedigree P 2 identically as in step (3).  We claim that B 1 and B 2 are isomorphic iff P 1 and P 2 , both 2-generational, are isomorphic. We only show the necessary direction here as the other one is easy. If P 1 and P 2 are isomorphic, the first property we make use of is that all the generation-2 nodes are female. So, in the isomorphism between P 1 and P 2 , if a generation-2 node uv ∈ I(P 1 ) is mapped to a generation-2 node xy ∈ I(P 2 ), we can simultaneously contract uv, xy to their corresponding male parents in P 1 and P 2 . Consequently, we obtain the isomorphism between B 1 and B 2 . □ Note that in our construction, all generation-2 individuals are female; moreover, a pair of generation-1 individuals mate with exactly one female child. A simple example on this reduction is shown on Figure 3.
Although the isomorphism testing problem is GI-hard even for 2-generation pedigrees, in some situations the problem is not hard to solve. In fact, when both of the 2-generation pedigrees are monogamous then the problem can be solved in linear time.
When a pair of generation-1 couple mate to have generation-2 children, for instance i females and j males, we say that these two parents and the i + j children form an 〈i, j〉-family. In Figure 2, the rightmost component is a 〈1,1〉-family.
Theorem 2 Testing the isomorphism between two 2generation monogamous pedigrees is polynomial time solvable.
Proof. It is easily seen that when a 2-generation pedigree Q 1 is monogamous then it is composed of a set of disjoint 〈i, j〉-families. So to test the isomorphism between two monogamous 2-generation pedigrees Q 1 , Q 2 it suffices to check whether two sets of integral pairs are identical, which can be done in O(n log n) time using the standard optimal sorting algorithms in two passes similar to the radix sort. In the first pass, we sort all the pairs according to their first components, and in the second, for each contiguous list of pairs with the same first component, we sort them according to the second components. □

Similarity of 2-generation pedigrees
The hardness result in the previous section implies that it might be too much if we use the standard isomorphism to measure the similarity of 2-generation pedigrees. In practice, ambiguities exist in pedigree-related datasets. In fact, it is estimated that 2-10% of people do not know their biological father [2,25]. For 2-generation pedigrees, in general the pedigrees cannot be monogamous. So, we need a new measure to weakly describe the similarity of two 2-generation pedigrees.
For a general 2-generation pedigree P, it is not difficult to identify all (not necessarily disjoint) 〈i, j〉-families (or simply families, when 〈i, j〉's are used). (For instance, the left component in Figure 2 can be decomposed into two families: 〈2, 0〉 and 〈0, 1〉.) Then, we try to decompose the generation-2 nodes in these families so that the resulting number of isomorphic sub-families is minimized. Note that in this process a generation-1 pair can appear in more than one sub-family. This can in turn be formulated as the Minimum Common Integer Pair Partition (MCIPP) problem.
For example, when n = 9, {1, 2, 2, 4} is a partition of n. It should be noted that while it is simple to partition an integer, the number of such partitions is usually Figure 3 An example for the GI-hardness reduction. In P 1 , only four (the leftmost and rightmost two) generation-2 nodes are labeled, some underlined and some overlined, to maintain clarity.
A partition of a multiset X = {x 1 , x 2 ,...,x p } is a multiset union of all the partitions τ(x i ), i.e., ∪ 1≤i≤p τ(x i ). A multiset Z is a common partition of two multisets X = {x 1 , The size of the partition Z is denoted as |Z|. For example, given X = {5, 8}, Y = {3,10}, a common partition of X, Y is Z = {1, 2, 2, 4, 4}, and the size of this partition is 5. It is easily seen that the necessary condition for X and Y to admit a common partition is that the sums of the integers in X and Y are equal. Throughout this paper, whenever we talk about a common partition for sets of integers X and Y, we always assume that this condition is met.

MCIP (Minimum Common Integer Partition)
Instance: Two multiple sets of integers A and B, and an integer k.
Question: Does A, B admit a common partition of size k? For the ease of presentation, we use MCIP(A, B) to represent this instance.
Given a 2-tuple of integers, 〈a, b〉, the projection Given two sets of 2-tuples S, T, a common partition of S and T is a set of 2-tuples H = {〈g 1 , h 1 〉, 〈g 2 , h 2 〉, ..., 〈g k , h k 〉} such that P 1 (H) is a common partition of P 1 (S) and P 1 (T), and, P 2 (H) is a common partition of P 2 (S) and P 2 (T). k is the size of the partition H. Again, it is easily seen that the necessary condition for S and T to admit a common partition is that the sums of the integers in P 1 (S) and P 1 (T) are equal, so are those in P 2 (S) and P 2 (T). Throughout this paper, whenever we talk about any common partition of sets of 2-tuples S, T, we always assume that this condition is met.

MCIPP (Minimum Common Integer Pair Partition)
Instance: Two multiple sets of 2-tuples of integers S and T, and an integer k.
Question: Does S, T admit a common partition of size k?
Recall that a 2-tuple 〈i, j〉 represents the pedigree of a couple which has i female and j male chilren. Again, we use MCIPP(S, T) to represent this instance. As MCIPP is a generalization for MCIP, all the known negative results regarding MCIP hold for MCIPP; i.e., MCIP and MCIPP are both NP-complete and APXhard, following [7]. (In the past, d-MCIP has also been considered, where the input is d multisets with the same sum. Efficient asymptotic approximation algorithms have been obtained for large d [7,33,34], the best factor being 0.5625 · d + O(1) [34]. We will only consider d = 2 in this paper.) Also, note that the integer 0 in a solution for MCIP is meaningless while it is possible that 0 can appear either in the input or in the solution for MCIPP. So for MCIPP, we focus on inte- Finally, a Fixed-Parameter Tractable (FPT) algorithm is an algorithm for a decision problem with input size n and parameter k whose running time is O(f (k)n c ) = O*(f(k)), where f (−) is any computable function on k and c is a constant. FPT algorithms are efficient tools for handling some NP-complete problems, especially when k is small in practical datasets [9,11].

Some properties of MCIPP
Given a pair of integers a, c, we say a dominates c if a >c. Given a pair of 2-tuples of integers 〈a, b〉 and 〈c, d〉, we say 〈a, b〉 dominates 〈c, d〉 if a ≥ c and b ≥ d.
To simplify the writing, we say that 〈a, b〉 and 〈c, d〉 form a dominating pair if either 〈a, b〉 dominates 〈c, d〉 or vice versa. Likewise, 〈a, b〉 and 〈c, d〉 form a nondominating pair if either a >c, b <d or a <c, b >d.
We first describe some optimality properties for both the optimization versions of MCIP and MCIPP. When the context is clear, we still use MCIP(-,-) and MCIPP (-,-) to denote the corresponding optimization versions of the instances.
The following property is trivial and holds for both MCIP and MCIPP.
For a pair of dominating 2-tuples 〈a, b〉 and 〈c, d〉, we can use subtraction to partition them into two common pairs. For example, if a ≤ c and b ≤ d, then we can obtain the common partition {〈a, b〉, 〈c − a, d − b〉. So, given 〈2, 4〉 and 〈4, 5〉 we can obtain a partition {〈2,4〉, 〈2,1〉} for 〈4, 5〉. We also say that this is a dominatingpartition operation. Apparently, for MCIP, this gives a way to partition a pair of integers as well. For instance, given 2 and 6, we can subtract 2 from 6 to obtain a partition {2, 4} for 6.
We next describe some properties on non-dominating pairs of 2-tuples which are unique for MCIPPfor MCIP, a pair of integers a, b has the property that either a dominates b or vice versa. This is not the case for a non-dominating pairs of 2-tuples, e.g. 〈1, 4〉 and 〈2, 3〉. We start with this fundamental lemma.
Lemma 3 Let A', B' be a set of positive integers with the same total sum, moreover, let us suppose |A'| = m + 1, |B'| = m. Then there must exist elements a ∈ A', b ∈ B' such that a <b.
Proof. As |A'| > |B'|, we can arbitrarily select m elements from A' and match up them with those in B in an one-to-one fashion. As the sum of integers in A and B are the same, in this matching at least one of elements a ∈ A must be smaller than its matched counterpart b ∈ Botherwise, the sum of integers in A would be larger than that of B'. □ Corollary 1 Let A, B be two sets of n > 1 positive integers with the same total sum. WLOG, let A = {a 1 , a 2 , ..., a n }, B = {b 1 , b 2 , ..., b n }. Then there must exist an element b ∈ B' = B − {b j } which is greater than some element a ∈ A' = {a 1 , a 2 , ..., a i−1 , a i − b j , a i+1 , ..., a n }, where a i >b j .
Proof. Obviously we have |A'| = n and |B'| = n − 1. Then this corollary follows directly from Lemma 3. □ The implication of Corollary 1 for MCIP with input A, B is obviouswe can successively find pairs of dominating integers. In fact, in the proof of Corollary 1, once we obtain a' = a k ∈ A' and b' = b ℓ ∈ B' such that a' <b', we can repeatedly use the above argument to A'' = A' − {a'} = {a 1 , a 2 , ..., a k−1 , a k+1 , ..., a n } and B'' = {b 1 , b 2 , ..., b ℓ−1 , b ℓ − a', b ℓ+1 , ..., b n }, where |A''| = |B''| = n − 1 and the two sets A, B have the same sum. Now let us see how this can be applied to MCIPP. When we have an instance of MCIPP whose input {S, T} is each composed of m non-dominating 2-tuples, then we can find a pair s = 〈s 1 , s 2 〉 ∈ S and t = 〈t 1 , t 2 〉 ∈ T (assuming s 1 >t 1 and s 2 <t 2 ) such that we can put 〈t 1  Moreover, if we apply the dominating-partition process on these two tuples s', t', following Corollary 1, we can repeatedly find dominating tuples until all tuples in S, T are all commonly partitioned. This is because after we apply the dominating-partition on s', t' (say s' <t') to obtain an MCIPP instance S'',T'', we have |P 1 (S )| = |P 1 (T )|, |P 2 (S )| = |P 2 (T )| and the two pairs of sizes in fact differ by one. Following Corollary 1, we can then repeatedly obtain dominating pairs.
1 While |S| ≥ 2 and |T| ≥ 2 2 Repeat 2.1 select a pair of dominating 2-tuples, s ∈ S and t ∈ T, 2.2 compute two decomposing 2-tuples by subtraction, Until no dominating 2-tuples can be found. 4 If there are at least two pairs of non-dominating 2-tuples in S and T 5 Then 5.1 use a brute-force method to select two nondominating 2-tuples s' ∈ S, t' ∈ T which leads to successive dominating pairs. 6 If |S| = 1, |T| = 1, then find the smaller tuple in S and T, x.
7 Return τ(S, T) τ(S, T) ∪ {x}. Of course, due to the 'existence' constraint in Lemma 3 and Corollary 1, we would have to use a brute-force method to find a pair s ∈ S, t ∈ T which can make the process of repeatedly processing dominating pairs possible. Let us show an example, S = {〈9, 4〉, 〈1, 11〉, 〈6, 3〉} and T = {〈2, 8〉, 〈12, 1〉, 〈2, 9〉}. In this example, among the 9 non-dominating pairs between S and T, there are 4 solutions enabling us to successively find dominating pairs. One of them is s = 〈6, 3〉 and t = 〈2, 9〉, which gives us a common partition of size 6. The other 5 solutions all lead to a common partition of size 7.
The above discussion enables us to design an algorithm Heuristic-MCIPP to prove the next lemma. Proof. When there is no non-dominating pairs in the input, with the running of the algorithm Heuristic-MCIPP, we have |MCIPP(S, T)| ≤ |S| + |T| − 1. The reason is that when each of S and T has at least two 2tuples, we can use the dominating-partition procedure to obtain two 2-tuples in the solution set for each pair of dominating 2-tuples from S, T. When there are a total of three elements in S, T, say, one in S and two in T, we just need to return the two elements in T as their sum matches the one in S already. (This is certainly true for MCIP as pointed out in [7].) When there are p ≥ 2 pairs of non-dominating pairs at Step 4-5 of the Heuristic-MCIPP algorithm, following Corollary 1 and the subsequent arguments, there exists a non-dominating pair s = 〈s 1 , s 2 〉 ∈ S, t = 〈t 1 , t 2 〉 ∈ T which leads to successive dominating pairs. (We can use the brute-force method to find this in O(p 2 (|S| + |T|)) time.) In this case, the solution obtained by Heuristic-MCIPP has size at most 1 + (|S| + |T| − 1) = |S| + |T|, where the first one corresponds to (s 1 , t 2 ) (if s 1 <t 1 ) or (t 1 , s 2 ) (if s 1 >t 1 ). □ In fact, the above three lemmas imply that Heuristic-MCIPP provides a factor-2 approximation for MCIPP, as we have |S| + |T| ≤ 2max{|S|, |T|} ≤ 2|MCIPP*(S, T)|. On the other hand, designing approximation algorithms is not our focus for this paper; in fact, by a simple modification for the Maximum Packing method in [7] we can obtain a similar factor-1.25 approximation for MCIPP. In the remainder of this paper, we solely focus on the exact or FPT algorithm.
Note that Lemma 4 is different from its counterpart for MCIP, which, according to Lemma 2.2 in [7], states that |MCIP(A, B)| < |A| + |B| − 1. The latter in fact immediately implies that for MCIP there is always an optimal solution which does not partition at least an integer from either A or B. That further implies that there is a simple FPT algorithm for MCIP based on the bounded-degree search. We will show a stronger property in the next section to improve the FPT algorithm for MCIP, and subsequently, an FPT algorithm for MCIPP can be obtained.

An FPT algorithm for MCIPP
We first give the following lemma for MCIP.
Lemma 5 Let A, B be the input for MCIP and let a be the smallest element in A or B. Then there is an optimal solution for MCIP which contains a, i.e., there is an optimal solution which does not partition the smallest element in A and B.
Proof. We first show the following claim: in an optimal solution τ for MCIP with input A, B, let a 1 , a 2 be a pair of elements in τ(z), z ∈ A ∪ B, with the condition that (1) a 1 + a 2 <a, and (2) a 2 is the minimum among all pairs of elements in τ satisfying the condition (1), then there is an optimal solution τ' which partitions some element z ∈ A ∪ B with τ'(z) = τ(z) − {a 1 , a 2 } ∪ {a 1 + a 2 }.
The proof for the above claim is as follows. WLOG, let z ∈ A and let a 1 ∈ τ(y 1 ) and a 2 ∈ τ(y 2 ) for two distinct integers y 1 , y 2 ∈ B. Following the definition of 〈a 1 , a 2 〉, τ(y 1 ) contains at least one more element other than a 1 , say a 3 ; and following (2) we have a 3 >a 2 . Suppose that a 3 ∈ τ(x) for some x ∈ A. By Lemma 1, x ≠ z. We replace a 3 by a 3 = a 3 − a 2 , and a 1 by a 1 = a 1 + a 2 .
We comment that we can use Lemma 2.2 by Chen et al. [7] directly to prove a weaker claim: as |MCIP(A, B)| < |A| + |B| − 1, there must be an optimal solution whose corresponding matching graph between the partitioned elements in A, B contains no cycle, which means there is at least one leaf node. Then this leaf node corresponds to an unpartitioned integer in A or B. The above lemma in fact implies a faster FPT algorithm for MCIP. Pick the smallest element a ∈ A ∪ B (say a ∈ A), we try to partition some other integer z ∈ B by subtracting a from it. Then we repeat over the new problem instance involving z − a. This process is repeated k times when either a solution is founded or we have to report that there is no solution of size k. The running time is O*((max{|A|, |B|}) k ) = O*(k k ).
To obtain an FPT algorithm for MCIPP, we also need a similar lemma.
Lemma 6 Let S, T be the input for MCIPP. Then there is an optimal solution for MCIPP which either contains 〈a, b〉 ∈ S ∪ T or 〈c, d〉 ∈ S ∪ T, or contains 〈a, d〉, where a is the minimum element in P 1 (S ∪ T)and d is the minimum element in P 2 (S ∪ T).
Similar to Lemma 5, it is obvious that we can repeatedly apply the above steps to obtain an optimal solution with does not partition the smallest element in P 1 (S ∪ T) (and, symmetrically, P 2 (S ∪ T)). Hence the lemma is proven. □ With the above lemma, it is again possible to have an FPT algorithm, Exact-MCIPP, for MCIPP using bounded degree search. At each step, we search for 〈a, b〉, 〈c, d〉 ∈ S ∪ T or 〈a, d〉 ∈ S ∪ T, where a is the minimum element in P 1 (S ∪ T) and d is the minimum element in P 2 (S ∪ T) such that some optimal solution for MCIPP contains 〈a, b〉, 〈c, d〉 or 〈a, d〉. For one step, the running time for the former would be O(k 1 + k 2 ) for the first two cases and for the latter would also be O(k 1 + k 2 )as 〈a, d〉 could be subtracted from O(k 1 + k 2 ) pairs, where k 1 = |S|, k 2 = |T|. As k 1 , k 2 ≤ k, the running time of this step is bounded by O(2k). Running this for k steps, the running time of the whole algorithm is O* (2 k k k ). Hence, we have the following theorem. The running time of the above FPT algorithm is still too high to be applied alone to the similarity comparison for arbitrary 2-generation pedigrees, i.e., when k is large. In [14], the salmon data contains 60 individuals from each family, with hundreds of families. To handle some data like that, we either need to speed up the running time of our algorithm or combine the FPT algorithm with some existing approximation algorithms (which will be discussed next). Nevertheless, it lays down a solid theoretical foundation for further research on this problem, especially when k is relatively small.
In practice, to handle datasets possibly of varying k values, we suggest a combination of the FPT algorithm and approximation algorithms [7,31]. That is, when the value of k is not too large, we can run this FPT algorithm; when k is too large for the FPT algorithm to handle, we can then use the approximation algorithms. (We comment that the approximation algorithms in [7,31], though presented for MCIP, can be easily adapted for MCIPP.)

Concluding remarks
We consider the problem of testing the isomorphism and similarity of the simplest possible unlabeled pedigrees. We show that the isomorphism testing is GIhard, excluding any chance for a polynomial time algorithm (unless Graph Isomorphism is polynomially solvable). We define a new similarity measure based on 〈i, j〉-family, and formulate this as the Minimum Common Integer Pair Partition (MCIPP) problem, which generalizes the NP-complete problem of Minimum Common Integer Partition (MCIP) problem. We show that MCIPP (hence MCIP) is FPT (Fixed-Parameter Tractable). It would be interesting to significantly improve the running time of the FPT algorithms presented in this paper.