 Proceedings
 Open access
 Published:
Isomorphism and similarity for 2generation pedigrees
BMC Bioinformatics volume 16, Article number: S7 (2015)
Abstract
We consider the emerging problem of comparing the similarity between (unlabeled) pedigrees. More specifically, we focus on the simplest pedigrees, namely, the 2generation pedigrees. We show that the isomorphism testing for two 2generation pedigrees is GIhard. If the 2generation pedigrees are monogamous (i.e., each individual at level1 can mate with exactly one partner) then the isomorphism testing problem can be solved in polynomial time. We then consider the problem by relaxing it into an NPcomplete decomposition problem which can be formulated as the Minimum Common Integer Pair Partition (MCIPP) problem, which we show to be FPT by exploiting a property of the optimal solution. While there is still some difficulty to overcome, this lays down a solid foundation for this research.
Introduction
Pedigrees, or commonly known as family trees, are important tools in evolutionary and computational biology. They are important for geneticists, as with a valid pedigree the recombination events can be deduced more accurately [8], or disease loci can be mapped consistently [22, 23]. In this sense, pedigrees could greatly help geneticists.
There have been many practical methods for reconstructing pedigrees [30, 26, 4, 5, 17]. For instance, Thompson [30] defined the pedigree reconstruction problem as: given the genetic data from a set of extant individuals, reconstruct relationships between the individuals that may share unobserved ancestors. There have also been research using the machine learning methods to construct pedigrees with the maximum likelihood [19, 10]. Some theoretical results are also known [27–29].
It is known that a lot of computations on pedigree graphs are NPhard [24, 20, 16], so a series of research has been conducted on speeding up these computations [6, 13, 21]. It is expected that these research will continue, possibly along different directions.
On the other hand, methods for comparing pedigrees are rare. The bruteforce method will not work when the data set has size in the thousands [1, 14]. People can typically use phylogenetic trees as the basis to compare treelike pedigrees. On the other hand, even for humans the pedigrees could be more complex than trees as intergenerational mating is not rare. The only known research that systematically study pedigree comparison is by Kirkpatrick et al. [18], where the pedigree isomorphism and edit distance problems, for both general pedigrees and leaflabeled pedigrees, are systemically studied.
In this paper, we follow the work by Kirkpatrick et al. [18] to consider the isomorphism and similarity problems for the simplest pedigree  2generation pedigrees, where the isomorphism and similarity problems are both studied. Surprisingly, we show that the isomorphism problem is Glhard (GI  Graph Isomorphism) even for 2generational pedigrees. We then relax the similarity measure and formulate this as a Minimum Common Integer Pair Partition (MCIPP) problem, generalizing the famous NPcomplete Minimum Common Integer Partition (MCIP) problem, which we show to be FixedParameter Tractable (FPT). While these is still some difficulty to overcome, this lays down a solid foundation for this research.
Preliminaries
An (unlabeled) pedigree is a directed graph P = (I(P), E(P)) with vertices I(P) and edges E(P), together with a gender function s : I(P) → {male, female} such that:

1
P is acyclic.

2
For all nodes v ∈ I(P), the indegree of v is either two or zero.

3
For two edges (a, c), (b, c) ∈ E(P), we have s(a) ≠ s(b).
In practice, we typically draw a pedigree in a topdown fashion to denote the direction of the edges. Moreover, we use square (resp. circular) nodes to represent males (resp. females). See Figure 1 for an example. Throughout this paper, we assume that a pedigree (or graph) contains no isolated nodes (i.e., those with indegree and outdegree both zero). This is easy to handle if it does  we just remove these isolated nodes.
Let \mathcal{N}=\left\{1,2,3,\dots \right\}. An individual u ∈ I(P) is monogamous if it mates with exactly one partner, i.e., the number of individuals u', u' ≠ u, such that (u, x), (u', x) ∈ E(P) for some x ∈ I(P) is exactly one. A pedigree is monogamous if all the individuals are monogamous. In Figure 2, the subpedigree formed by the rightmost component is monogamous while the leftmost component is not. A pedigree P = (I(P), E(P)) is generational if there is a function g:I\left(P\right)\to \mathcal{N} such that:

1
g(v) = 1 for all v ∈ I(P) with indegree zero.

2
For all (u, v) ∈ E(P), we have g(v) = g(u) + 1.
The number g(v) is called the generation of v. For a generational pedigree P, we use I_{ g } (P) to represent the individuals of P whose generation is g. The pedigree on Figure 1 is not generational, due to node 11. Figure 2 shows a 2generation pedigree. Throughout this paper, we will focus only on this simplest 2generation pedigrees.
Given two pedigrees P = (I(P), E(P)), P' = (I(P'),E(P')) with the associated gender functions s(−), s'(−) respectively, a bijection ϕ: I(P) → I(P') is a pedigree isomorphism between P and P' if:

1
For every u ∈ I(P), s(u) = s'(ϕ(u)), and

2
(u, v) ∈ E(P) if and only if (ϕ(u), ϕ(v)) ∈ E(P').
Hardness for 2generation pedigree
Graph Isomorphism (GI) is one of the most famous problems in computational complexity whose precise complexity has been open since 1972 [15, 12]. It is not known to be in P or NPcomplete. The class of GIcomplete problems are those which are polynomial time equivalent to the GI problem. The class of GIhard problems are those problems at least as hard as the GI problem. It is known that even testing the isomorphism for chordal bipartite graphs is GIcomplete [32].
In [18], it was shown that the pedigree isomorphism problem is GIhard. The reduction is from bipartite isomorphism. The construction uses a pedigree of three generations. Here we show that even testing the isomorphism of two 2generation pedigrees is GIhard.
Theorem 1 Testing the isomorphism between two 2generation pedigrees is GIhard.
Proof. We reduce bipartite graph isomorphism problem to our problem. Let B_{1} = (U_{1}, V_{1}, E_{1}), B_{2} = (U_{2}, V_{2}, E_{2}) be two bipartite graphs (with no isolated nodes). For our construction, we perform the following:

1
All nodes in U_{1},U_{2} are marked male.

2
All nodes in V_{1},V_{2} are marked female.

3
B_{1} is converted into a pedigree P_{1} = (I(P_{1}), E(P_{1})) as follows: (3.1) I(P_{1}) = U_{1} ∪ V_{1} are the generation1 nodes; (3.2) E(P_{1}) is initially set as empty; (3.3) for (u, v) ∈ E_{1} we create a new generation2 node uv such that s(uv) = female, E(P_{1}) ← E(P_{1}) ∪ {(u, uv), (v, uv)}.

4
B_{2} is converted into a pedigree P_{2} identically as in step (3).
We claim that B_{1} and B_{2} are isomorphic iff P_{1} and P_{2} , both 2generational, are isomorphic. We only show the necessary direction here as the other one is easy. If P_{1} and P_{2} are isomorphic, the first property we make use of is that all the generation2 nodes are female. So, in the isomorphism between P_{1} and P_{2}, if a generation2 node uv ∈ I(P_{1}) is mapped to a generation2 node xy ∈ I(P_{2}), we can simultaneously contract uv, xy to their corresponding male parents in P_{1} and P_{2}. Consequently, we obtain the isomorphism between B_{1} and B_{2}. □
Note that in our construction, all generation2 individuals are female; moreover, a pair of generation1 individuals mate with exactly one female child. A simple example on this reduction is shown on Figure 3.
Although the isomorphism testing problem is GIhard even for 2generation pedigrees, in some situations the problem is not hard to solve. In fact, when both of the 2generation pedigrees are monogamous then the problem can be solved in linear time.
When a pair of generation1 couple mate to have generation2 children, for instance i females and j males, we say that these two parents and the i + j children form an 〈i, j〉family. In Figure 2, the rightmost component is a 〈1,1〉family.
Theorem 2 Testing the isomorphism between two 2generation monogamous pedigrees is polynomial time solvable.
Proof. It is easily seen that when a 2generation pedigree Q_{1} is monogamous then it is composed of a set of disjoint 〈i, j〉families. So to test the isomorphism between two monogamous 2generation pedigrees Q_{1},Q_{2} it suffices to check whether two sets of integral pairs are identical, which can be done in O(n log n) time using the standard optimal sorting algorithms in two passes similar to the radix sort. In the first pass, we sort all the pairs according to their first components, and in the second, for each contiguous list of pairs with the same first component, we sort them according to the second components. □
Similarity of 2generation pedigrees
The hardness result in the previous section implies that it might be too much if we use the standard isomorphism to measure the similarity of 2generation pedigrees. In practice, ambiguities exist in pedigreerelated datasets. In fact, it is estimated that 210% of people do not know their biological father [2, 25]. For 2generation pedigrees, in general the pedigrees cannot be monogamous. So, we need a new measure to weakly describe the similarity of two 2generation pedigrees.
For a general 2generation pedigree P, it is not difficult to identify all (not necessarily disjoint) 〈i, j〉families (or simply families, when 〈i, j〉's are used). (For instance, the left component in Figure 2 can be decomposed into two families: 〈2, 0〉 and 〈0, 1〉.) Then, we try to decompose the generation2 nodes in these families so that the resulting number of isomorphic subfamilies is minimized. Note that in this process a generation1 pair can appear in more than one subfamily. This can in turn be formulated as the Minimum Common Integer Pair Partition (MCIPP) problem.
MCIP and MCIPP Problems
Throughout this paper, for MCIP, we focus on integers in \mathcal{N}=\left\{1,2,3,\dots \right\}. A partition of an integer n is a multiset τ(n) = {n_{1}, n_{2},..., n_{ t }} such that {\sum}_{1\le i\le t}{n}_{i}=n. For example, when n = 9, {1, 2, 2, 4} is a partition of n. It should be noted that while it is simple to partition an integer, the number of such partitions is usually (counterintuitively) huge. For instance, the integer 10 has 190569292 distinct partitions [3].
A partition of a multiset X = {x_{1}, x_{2},...,x_{ p }} is a multiset union of all the partitions τ(x_{ i }), i.e., ∪_{1≤i≤p}τ(x_{ i }). A multiset Z is a common partition of two multisets X = {x_{1}, x_{2},..., x_{ p }}, Y = {y_{1}, y_{2},..., y_{ q }} if there are partitions τ_{1}, τ_{2} with ∪_{1≤i≤p}τ_{1}(x_{ i }) = ∪_{1≤j≤q}τ_{2}(y_{ j }) = Z. The size of the partition Z is denoted as Z. For example, given X = {5, 8}, Y = {3,10}, a common partition of X, Y is Z = {1, 2, 2, 4, 4}, and the size of this partition is 5. It is easily seen that the necessary condition for X and Y to admit a common partition is that the sums of the integers in X and Y are equal. Throughout this paper, whenever we talk about a common partition for sets of integers X and Y, we always assume that this condition is met.
MCIP (Minimum Common Integer Partition)
Instance: Two multiple sets of integers A and B, and an integer k.
Question: Does A, B admit a common partition of size k?
For the ease of presentation, we use MCIP(A, B) to represent this instance.
Given a 2tuple of integers, 〈a, b〉, the projection {\mathcal{P}}_{1}\left(\u3008a,b\u3009\right)=a,{\mathcal{P}}_{2}\left(\u3008a,b\u3009\right)=b. Let S be a set of 2tuples of integers, {\mathcal{P}}_{1}\left(S\right)={\cup}_{s\in S}{\mathcal{P}}_{1}\left(s\right),{\mathcal{P}}_{2}\left(S\right)={\cup}_{s\in S}{\mathcal{P}}_{2}\left(s\right).
Given two sets of 2tuples S, T, a common partition of S and T is a set of 2tuples H = {〈g_{1}, h_{1}〉, 〈g_{2}, h_{2}〉, ⋯, 〈g_{ k }, h_{ k }〉} such that {\mathcal{P}}_{1}\left(H\right) is a common partition of {\mathcal{P}}_{1}\left(S\right) and {\mathcal{P}}_{1}\left(T\right), and, {\mathcal{P}}_{2}\left(H\right) is a common partition of {\mathcal{P}}_{2}\left(S\right) and {\mathcal{P}}_{2}\left(T\right). k is the size of the partition H. Again, it is easily seen that the necessary condition for S and T to admit a common partition is that the sums of the integers in {\mathcal{P}}_{1}\left(S\right) and {\mathcal{P}}_{1}\left(T\right) are equal, so are those in {\mathcal{P}}_{2}\left(S\right) and {\mathcal{P}}_{2}\left(T\right). Throughout this paper, whenever we talk about any common partition of sets of 2tuples S, T, we always assume that this condition is met.
MCIPP (Minimum Common Integer Pair Partition)
Instance: Two multiple sets of 2tuples of integers S and T, and an integer k.
Question: Does S, T admit a common partition of size k?
Recall that a 2tuple 〈i, j〉 represents the pedigree of a couple which has i female and j male chilren. Again, we use MCIPP(S, T) to represent this instance. As MCIPP is a generalization for MCIP, all the known negative results regarding MCIP hold for MCIPP; i.e., MCIP and MCIPP are both NPcomplete and APX hard, following [7]. (In the past, dMCIP has also been considered, where the input is d multisets with the same sum. Efficient asymptotic approximation algorithms have been obtained for large d [7, 33, 34], the best factor being 0.5625 · d + O(1) [34]. We will only consider d = 2 in this paper.) Also, note that the integer 0 in a solution for MCIP is meaningless while it is possible that 0 can appear either in the input or in the solution for MCIPP. So for MCIPP, we focus on integers in \mathcal{N}\cup \left\{0\right\}=\left\{1,2,3,\dots \right\}.
Finally, a FixedParameter Tractable (FPT) algorithm is an algorithm for a decision problem with input size n and parameter k whose running time is O(f (k)n^{c}) = O*(f(k)), where f (−) is any computable function on k and c is a constant. FPT algorithms are efficient tools for handling some NPcomplete problems, especially when k is small in practical datasets [9, 11].
Some properties of MCIPP
Given a pair of integers a, c, we say a dominates c if a >c. Given a pair of 2tuples of integers 〈a, b〉 and 〈c, d〉, we say 〈a, b〉 dominates 〈c, d〉 if a ≥ c and b ≥ d. To simplify the writing, we say that 〈a, b〉 and 〈c, d〉 form a dominating pair if either 〈a, b〉 dominates 〈c, d〉 or vice versa. Likewise, 〈a, b〉 and 〈c, d〉 form a nondominating pair if either a >c, b <d or a <c, b >d.
We first describe some optimality properties for both the optimization versions of MCIP and MCIPP. When the context is clear, we still use MCIP(,) and MCIPP(,) to denote the corresponding optimization versions of the instances.
Lemma 1 Let A, B be the input for MCIP. In any feasible solution, if a partition for some × ∈ A, τ (x) = {x_{1}, x_{2},...,x_{ p }}, and a partition for some y ∈ B, tau(y) = {y_{1}, y_{2},...,y_{ q }}, satisfies that τ(x) ∩ τ(y) > 1 then this solution for MCIP is not optimal.
Proof. Suppose to the contrary that τ(x) ∩ τ(y) > 1, and the corresponding partition for A, B is optimal. WLOG, suppose τ(x) = {x_{1}, x_{2},..., x_{ p }} and τ(y) = {y_{1}, y_{2},..., y_{ q }} contain r common elements {z_{1}, z_{2},..., z_{ r }} then we can update τ(x) ← τ(x) − {z_{1}, z_{2},..., z_{ r }}∪{z_{1} + z_{2} + ... + z_{ r }} and τ(y) ← τ(y) − {z_{1}, z_{2},..., z_{ r }}∪{z_{1} + z_{2} + ... + z_{ r }}. Then the solution size for MCIP on A, B is reduced by r − 1, contradicting the optimality of the assumption. □
With the above lemma, we can now assume that for any optimal partition for some x ∈ A and some y ∈ B, they share at most one common element. Notice that this lemma also holds for MCIPP, i.e., in an optimal partition of 〈s_{1}, s_{2}〉 ∈ S and 〈t_{1}, t_{2}〉 ∈ T, τ(〈s_{1}, s_{2}〉) and τ(〈t_{1}, t_{2}〉) share at most one common 2tuple. Similarly, we can assume that in the input for MCIP (A, B) (resp. MCIPP (S, T)) there is no common pair of integers in A and B (resp. no common pair of 2tuples in S and T), as it must be put in the optimal solution.
The following property is trivial and holds for both MCIP and MCIPP.
Lemma 2 Let MCIPP*(S, T) be the optimal solution size for MCIPP(S, T). Then MCIPP*(S, T) >max{S, T}.
For a pair of dominating 2tuples 〈a, b〉 and 〈c, d〉, we can use subtraction to partition them into two common pairs. For example, if a ≤ c and b ≤ d, then we can obtain the common partition {〈a, b〉, 〈c − a, d − b〉. So, given 〈2, 4〉 and 〈4, 5〉 we can obtain a partition {〈2,4〉, 〈2,1〉} for 〈4, 5〉. We also say that this is a dominating  partition operation. Apparently, for MCIP, this gives a way to partition a pair of integers as well. For instance, given 2 and 6, we can subtract 2 from 6 to obtain a partition {2, 4} for 6.
We next describe some properties on nondominating pairs of 2tuples which are unique for MCIPP  for MCIP, a pair of integers a, b has the property that either a dominates b or vice versa. This is not the case for a nondominating pairs of 2tuples, e.g. 〈1, 4〉 and 〈2, 3〉. We start with this fundamental lemma.
Lemma 3 Let A', B' be a set of positive integers with the same total sum, moreover, let us suppose A' = m + 1, B' = m. Then there must exist elements a ∈ A', b ∈ B' such that a <b.
Proof. As A' > B', we can arbitrarily select m elements from A' and match up them with those in B in an onetoone fashion. As the sum of integers in A and B are the same, in this matching at least one of elements a ∈ A must be smaller than its matched counterpart b ∈ B  otherwise, the sum of integers in A would be larger than that of B'. □
Corollary 1 Let A, B be two sets of n > 1 positive integers with the same total sum. WLOG, let A = {a_{1}, a_{2}, ⋯, a_{ n }}, B = {b_{1}, b_{2}, ⋯, b_{ n }}. Then there must exist an element b ∈ B' = B − {b_{ j }} which is greater than some element a ∈ A' = {a_{1}, a_{2}, ⋯, a_{i−1}, a_{ i } − b_{ j }, a_{i+1}, ⋯, a_{ n }}, where a_{ i } >b_{ j }.
Proof. Obviously we have A' = n and B' = n − 1. Then this corollary follows directly from Lemma 3. □
The implication of Corollary 1 for MCIP with input A, B is obvious  we can successively find pairs of dominating integers. In fact, in the proof of Corollary 1, once we obtain a' = a_{ k } ∈ A' and b' = b_{ ℓ } ∈ B' such that a' <b', we can repeatedly use the above argument to A'' = A' − {a'} = {a_{1}, a_{2}, ⋯, a_{k−1}, a_{k+1}, ⋯, a_{ n }} and B'' = {b_{1}, b_{2}, ⋯, b_{ℓ−1}, b_{ℓ} − a', b_{ℓ+1}, ⋯, b_{ n }}, where A'' = B'' = n − 1 and the two sets A, B have the same sum.
Now let us see how this can be applied to MCIPP. When we have an instance of MCIPP whose input {S, T} is each composed of m nondominating 2tuples, then we can find a pair s = 〈s_{1}, s_{2}〉 ∈ S and t = 〈t_{1}, t_{2}〉 ∈ T (assuming s_{1} >t_{1} and s_{2} <t_{2}) such that we can put 〈t_{1}, s_{2}〉 in some solution set while the resulting instance S' = ({S − {〈s_{1}, s_{2}〉}) ∪ {〈s_{1} − t_{1}, 0〉},T' = ({T − {〈t_{1}, t_{2}〉}) ∪ {〈0, t_{2} − s_{2}〉} is still a valid instance for MCIPP. (We call this operation nondominatingpartition.) Then, following Corollary 1, there exists a pair of dominating 2tuples s' ∈ S', t' ∈ T'. Moreover, if we apply the dominatingpartition process on these two tuples s', t', following Corollary 1, we can repeatedly find dominating tuples until all tuples in S, T are all commonly partitioned. This is because after we apply the dominatingpartition on s', t' (say s' <t') to obtain an MCIPP instance S'',T'', we have \left{\mathcal{P}}_{1}\left({S}^{\u2033}\right)\right\ne \left{\mathcal{P}}_{1}\left({T}^{\u2033}\right)\right,\left{\mathcal{P}}_{2}\left({S}^{\u2033}\right)\right\ne \left{\mathcal{P}}_{2}\left({T}^{\u2033}\right)\right and the two pairs of sizes in fact differ by one. Following Corollary 1, we can then repeatedly obtain dominating pairs.
Algorithm HeuristicMCIPP(S, T)
Input: S, T
Output: A common partition τ(S, T) for S, T, initially empty.
1 While S ≥ 2 and T ≥ 2
2 Repeat
2.1 select a pair of dominating 2tuples, s ∈ S and t ∈ T,
2.2 compute two decomposing 2tuples by subtraction,
2.3 update S ← S − {s}, T ← (T − {t}) ∪{t − s} if s <t,
2.4 update S ← (S − {s}) ∪{s − t}, T ← T − {t} if s >t,
2.5 update τ(S, T) ←τ(S, T) ∪ {min(s, t)},
3 Until no dominating 2tuples can be found.
4 If there are at least two pairs of nondominating 2tuples in S and T
5 Then
5.1 use a bruteforce method to select two nondominating 2tuples s' ∈ S, t' ∈ T which leads to successive dominating pairs.
6 If S = 1, T = 1, then find the smaller tuple in S and T, x.
7 Return τ(S, T) ← τ(S, T) ∪ {x}.
Of course, due to the 'existence' constraint in Lemma 3 and Corollary 1, we would have to use a bruteforce method to find a pair s ∈ S, t ∈ T which can make the process of repeatedly processing dominating pairs possible. Let us show an example, S = {〈9, 4〉, 〈1, 11〉, 〈6, 3〉} and T = {〈2, 8〉, 〈12, 1〉, 〈2, 9〉}. In this example, among the 9 nondominating pairs between S and T, there are 4 solutions enabling us to successively find dominating pairs. One of them is s = 〈6, 3〉 and t = 〈2, 9〉, which gives us a common partition of size 6. The other 5 solutions all lead to a common partition of size 7.
The above discussion enables us to design an algorithm HeuristicMCIPP to prove the next lemma.
Lemma 4 Let MCIPP(S, T) be the size of the solution returned by HeuristicMCIPP. Then MCIPP(S, T) ≤ S + T.
Proof. When there is no nondominating pairs in the input, with the running of the algorithm HeuristicMCIPP, we have MCIPP(S, T) ≤ S + T − 1. The reason is that when each of S and T has at least two 2tuples, we can use the dominatingpartition procedure to obtain two 2tuples in the solution set for each pair of dominating 2tuples from S, T. When there are a total of three elements in S, T, say, one in S and two in T, we just need to return the two elements in T as their sum matches the one in S already. (This is certainly true for MCIP as pointed out in [7].)
When there are p ≥ 2 pairs of nondominating pairs at Step 45 of the HeuristicMCIPP algorithm, following Corollary 1 and the subsequent arguments, there exists a nondominating pair s = 〈s_{1}, s_{2}〉 ∈ S, t = 〈t_{1}, t_{2}〉 ∈ T which leads to successive dominating pairs. (We can use the bruteforce method to find this in O(p^{2}(S + T)) time.) In this case, the solution obtained by HeuristicMCIPP has size at most 1 + (S + T − 1) = S + T, where the first one corresponds to (s_{1}, t_{2}) (if s_{1} <t_{1}) or (t_{1}, s_{2}) (if s_{1} >t_{1}). □
In fact, the above three lemmas imply that HeuristicMCIPP provides a factor2 approximation for MCIPP, as we have S + T ≤ 2max{S, T} ≤ 2MCIPP*(S, T). On the other hand, designing approximation algorithms is not our focus for this paper; in fact, by a simple modification for the Maximum Packing method in [7] we can obtain a similar factor1.25 approximation for MCIPP. In the remainder of this paper, we solely focus on the exact or FPT algorithm.
Note that Lemma 4 is different from its counterpart for MCIP, which, according to Lemma 2.2 in [7], states that MCIP(A, B) < A + B − 1. The latter in fact immediately implies that for MCIP there is always an optimal solution which does not partition at least an integer from either A or B. That further implies that there is a simple FPT algorithm for MCIP based on the boundeddegree search. We will show a stronger property in the next section to improve the FPT algorithm for MCIP, and subsequently, an FPT algorithm for MCIPP can be obtained.
An FPT algorithm for MCIPP
We first give the following lemma for MCIP.
Lemma 5 Let A, B be the input for MCIP and let a be the smallest element in A or B. Then there is an optimal solution for MCIP which contains a, i.e., there is an optimal solution which does not partition the smallest element in A and B.
Proof. We first show the following claim: in an optimal solution τ for MCIP with input A, B, let a_{1}, a_{2} be a pair of elements in τ(z), z ∈ A ∪ B, with the condition that (1) a_{1} + a_{2} <a, and (2) a_{2} is the minimum among all pairs of elements in τ satisfying the condition (1), then there is an optimal solution τ' which partitions some element z ∈ A ∪ B with τ'(z) = τ(z) − {a_{1}, a_{2}} ∪ {a_{1} + a_{2}}.
The proof for the above claim is as follows. WLOG, let z ∈ A and let a_{1} ∈ τ(y_{1}) and a_{2} ∈ τ(y_{2}) for two distinct integers y_{1}, y_{2} ∈ B. Following the definition of 〈a_{1}, a_{2}〉, τ(y_{1}) contains at least one more element other than a_{1}, say a_{3}; and following (2) we have a_{3} >a_{2}. Suppose that a_{3} ∈ τ(x) for some x ∈ A. By Lemma 1, x ≠ z. We replace a_{3} by {a}_{3}^{\prime}={a}_{3}{a}_{2}, and a_{1} by {a}_{1}^{\prime}={a}_{1}+{a}_{2}. Subsequently, we obtain another optimal partition τ' with τ'(x) = τ(x) − {a_{3}} ∪ {{a}_{3}^{\prime}, a_{2}}, τ'(y_{1}) = τ(y_{1}) − {a_{1}, a_{3}} ∪ {{a}_{1}^{\prime}, {a}_{3}^{\prime}}, and τ'(z) = τ(z) − {a_{1}, a_{2}} ∪ {{a}_{1}^{\prime}}. Apparently, τ' has the same size as τ, so it is also a minimum size common partition for A, B.
It is obvious that, as long as the smallest element a is partitioned in some optimal partition τ with \tau \left(a\right)=\left\{{a}_{1}^{\prime},{a}_{2}^{\prime},\dots ,{a}_{t}^{\prime}\right\}, we can repeatedly apply the above steps to obtain another optimal partition τ' with {\tau}^{\prime}\left(a\right)=\left\{{a}_{1}^{\prime},{a}_{2}^{\prime},\dots ,{a}_{i1}^{\prime},{a}_{i}^{\prime}+{a}_{j}^{\prime},{a}_{i+1}^{\prime},\dots ,{a}_{j1}^{\prime},{a}_{j+1}^{\prime},\dots ,{a}_{t}^{\prime}\right\}. After t − 1 such steps, we obtain an optimal partition which contains the minimum element a in A ∪ B. □
An example of the above proof is given as follows. We have A = {2, 5, 5}, B = {6, 6}, and an optimal partition τ = {1,1, 5, 5} where τ(2) = {a_{1} = 1, a_{2} = 1}. By the construction in the proof of Lemma 5, y_{1} = 6, y_{2} = 6, a_{3} = 5, {a}_{3}^{\prime}=4 and {a}_{1}^{\prime}=2. The new optimal solution is τ' = {1, 2, 4, 5}, where a = 2 is kept.
We comment that we can use Lemma 2.2 by Chen et al. [7] directly to prove a weaker claim: as MCIP(A, B) < A + B − 1, there must be an optimal solution whose corresponding matching graph between the partitioned elements in A, B contains no cycle, which means there is at least one leaf node. Then this leaf node corresponds to an unpartitioned integer in A or B. The above lemma in fact implies a faster FPT algorithm for MCIP. Pick the smallest element a ∈ A ∪ B (say a ∈ A), we try to partition some other integer z ∈ B by subtracting a from it. Then we repeat over the new problem instance involving z − a. This process is repeated k times when either a solution is founded or we have to report that there is no solution of size k. The running time is O*((max{A, B})^{k}) = O*(k^{k}).
To obtain an FPT algorithm for MCIPP, we also need a similar lemma.
Lemma 6 Let S, T be the input for MCIPP. Then there is an optimal solution for MCIPP which either contains 〈a, b〉 ∈ S ∪ T or 〈c, d〉 ∈ S ∪ T, or contains 〈a, d〉, where a is the minimum element in {\mathcal{P}}_{1}\left(S\cup T\right)and d is the minimum element in {\mathcal{P}}_{2}\left(S\cup T\right).
Proof. Again, we first show the following claim: in an optimal solution τ for MCIPP with input S, T, let 〈a_{1}, a_{2}〉, 〈b_{1}, b_{2}〉 be two 2tuples in τ(z), z ∈ S ∪ T, such that (1) a_{1} + b_{1} ≤ a, and (2) b_{1} is the minimum among all pairs of 2tuples in τ satisfying (1), then there is an optimal solution τ' which partitions some 2tuple z ∈ S ∪ T with τ'(z) = τ(z) − {〈a_{1}, a_{2}〉, 〈b_{1}, b_{2}〉} ∪ { a_{1} + b_{1}, a_{2}}. (Symmetrically, we can have a claim on the second component of 2tuples in S ∪ T, i.e., d.)
WLOG, let z = 〈z_{1}, z_{2}〉 ∈ S and let 〈a_{1}, a_{2}〉 ∈ τ(y_{1}) and 〈b_{1}, b_{2}〉 ∈ τ(y_{2}) for two distinct 2tuples y_{1}, y_{2} ∈ T. Following the definition of (a_{1}, b_{1}), tau(y_{1}) contains at least one more pair 〈c_{1}, c_{2}〉, with c_{1} ≥ b_{1}. Suppose that 〈c_{1}, c_{2}〉 ∈ τ(x) for some x ∈ S. Again, by Lemma 1, x ≠ z. We replace 〈c_{1}, c_{2}〉 by 〈c_{1} − b_{1}, c_{2}〉, and 〈a_{1}, a_{2}〉 by 〈a_{1} + b_{1}, a_{2}〉. Subsequently, we obtain another optimal partition τ' with τ'(x) = τ(x) − {〈c_{1}, c_{2}〉}∪{〈c_{1} − b_{1}, c_{2}〉, 〈a_{2}, b_{2}〉}, τ'(y_{1}) = τ(y_{1}) − {〈a_{1}, a_{2}〉, 〈c_{1}, c_{2}〉 ∪ {〈a_{1} + b_{1}, a_{2}〉}, 〈c_{1} − b_{1}, c_{2}〉, and τ'(z) = τ(z) − {〈a_{1}, a_{2}〉, 〈b_{1}, b_{2}〉 ∪ {〈a_{1} + b_{1}, a_{2}〉}. Again, τ' is also a minimum size common partition for S, T.
Similar to Lemma 5, it is obvious that we can repeatedly apply the above steps to obtain an optimal solution with does not partition the smallest element in {\mathcal{P}}_{1}\left(S\cup T\right) (and, symmetrically, {\mathcal{P}}_{2}\left(S\cup T\right)). Hence the lemma is proven. □
With the above lemma, it is again possible to have an FPT algorithm, ExactMCIPP, for MCIPP using bounded degree search. At each step, we search for 〈a, b〉, 〈c, d〉 ∈ S ∪ T or 〈a, d〉 ∈ S ∪ T, where a is the minimum element in {\mathcal{P}}_{1}\left(S\cup T\right) and d is the minimum element in {\mathcal{P}}_{2}\left(S\cup T\right) such that some optimal solution for MCIPP contains 〈a, b〉, 〈c, d〉 or 〈a, d〉. For one step, the running time for the former would be O(k_{1} + k_{2}) for the first two cases and for the latter would also be O(k_{1} + k_{2})  as 〈a, d〉 could be subtracted from O(k_{1} + k_{2}) pairs, where k_{1} = S, k_{2} = T. As k_{1}, k_{2} ≤ k, the running time of this step is bounded by O(2k). Running this for k steps, the running time of the whole algorithm is O*(2^{k}k^{k}). Hence, we have the following theorem.
Algorithm ExactMCIPP(S, T)
Input: S, T, k
Output: A common partition τ(S, T) for S, T, initially empty.
1 While k ≥ 1
2 Repeat
2.1 let a be the minimum element in {\mathcal{P}}_{1}\left(S\cup T\right),
2.2 let d be the minimum element in {\mathcal{P}}_{2}\left(S\cup T\right),
2.3 if 〈a, d〉 ∈ S ∪ T then τ(S, T) ← τ(S, T) ∪ {〈a, d〉}, delete 〈a, d〉 from S ∪ T, and update S, T and k ← k − 1,
2.4 if 〈a, b〉 ∈ S ∪ T then τ(S, T) ← τ(S, T) ∪ {〈a, b〉}, delete 〈a, b〉 from S ∪ T, and update S, T and k ← k − 1,
2.5 if 〈c, d〉 ∈ S ∪ T then τ(S, T) ← τ(S, T) ∪ {〈c, d〉}, delete 〈c, d〉 from S ∪ T, and update S, T and k ← k − 1,
3 Until S = ∅ or T = ∅ or k = 0.
4 If both S = ∅ and T = ∅
4.1 Then return τ(S, T),
4.2 Else return 'no solution'.
Theorem 3 Minimum Common Integer Pair Partition is FPT.
The running time of the above FPT algorithm is still too high to be applied alone to the similarity comparison for arbitrary 2generation pedigrees, i.e., when k is large. In [14], the salmon data contains 60 individuals from each family, with hundreds of families. To handle some data like that, we either need to speed up the running time of our algorithm or combine the FPT algorithm with some existing approximation algorithms (which will be discussed next). Nevertheless, it lays down a solid theoretical foundation for further research on this problem, especially when k is relatively small.
In practice, to handle datasets possibly of varying k values, we suggest a combination of the FPT algorithm and approximation algorithms [7, 31]. That is, when the value of k is not too large, we can run this FPT algorithm; when k is too large for the FPT algorithm to handle, we can then use the approximation algorithms. (We comment that the approximation algorithms in [7, 31], though presented for MCIP, can be easily adapted for MCIPP.)
Concluding remarks
We consider the problem of testing the isomorphism and similarity of the simplest possible unlabeled pedigrees. We show that the isomorphism testing is GIhard, excluding any chance for a polynomial time algorithm (unless Graph Isomorphism is polynomially solvable). We define a new similarity measure based on 〈i, j〉family, and formulate this as the Minimum Common Integer Pair Partition (MCIPP) problem, which generalizes the NPcomplete problem of Minimum Common Integer Partition (MCIP) problem. We show that MCIPP (hence MCIP) is FPT (FixedParameter Tractable). It would be interesting to significantly improve the running time of the FPT algorithms presented in this paper.
References
Abney M, Ober C, McPeek M: Quantitativetrait homozygosity and association mapping and empirical genomewide significance in large, complex pedigrees: fasting seruminsulin level in the hutterites. Am J Hum Genet. 2002, 70: 920934. 10.1086/339705.
Anderson KG: How well does paternity confidence match actual paternity? Evidence from worldwide nonpaternity rates. Curr Anthropol. 2006, 47: 513520. 10.1086/504167.
Andrews G: The Theory of Partitions. 1976, AddisonWesley
BergerWolf T, Sheikh S, DasGupta B, et al: Reconstructing sibling relationships in wild populations. Bioinformatics. 2007, 23: i49i56. 10.1093/bioinformatics/btm219.
Brown D, BergerWolf T: Discovering kinship through small subsets. Proc. WABI'10. 2010, 111123.
Browning S, Browning B: On reducing the statespace of hidden Markov models for the identity by descent process. Theor Popul Biol. 2002, 62: 18. 10.1006/tpbi.2002.1583.
Chen X, Liu L, Liu Z, Jiang T: On the minimum common integer partition problem. ACM Trans on Algorithms. 2008, 5 (1):
Coop G, Wen X, Ober C, et al: Highresolution mapping of crossovers reveals extensive variation in fine scale recombination patterns among humans. Science. 2008, 319: 13951398. 10.1126/science.1151851.
Downey R, Fellows M: Parameterized Complexity. 1999, SpringerVerlag
Fishelson M, Dovgolevsky N, Geiger D: Maximum likelihood haplotyping for general pedigrees. Hum Hered. 2005, 59: 4160. 10.1159/000084736.
Flum J, Grohe M: Parameterized Complexity Theory. 2006, SpringerVerlag
Garey MR, Johnson DS: Computers and Intractability: A Guide to the Theory of NPCompleteness. 1979, W.H. Freeman
Geiger D, Meek C, Wexler Y: Speeding up HMM algorithms for genetic linkage analysis via chain reduction of the state space. Bioinformatics. 2009, 25: i19610.1093/bioinformatics/btp224.
Herbinger C, O'Reiley P, Doyle R, et al: Early growth performance of Atlantic salmon fullsib families reared in single family tanks versus in mixed family tanks. Aquaculture. 1999, 173: 105116. 10.1016/S00448486(98)004797.
Karp R: Reducibility among combinatorial problems. Complexity of Computer Computations. Edited by: R Miller and J Thatcher. 1972, Plenum Press, NY, 85103.
Kirkpatrick B: Haplotype versus genotypes on pedigrees. Proc. WABI'10. 2010, 136147.
Kirkpatrick B, Li S, Karp R, et al: Pedigree reconstruction using identity by descent. J of Computational Biology. 2011, 18: 14811493. 10.1089/cmb.2011.0156.
Kirkpatrick B, Reshef Y, Finucane H, Jiang H, Zhu B, Karp R: Comparing pedigree graphs. J of Computational Biology. 2012, 19 (9): 9981014. 10.1089/cmb.2011.0254.
Lauritzen S, Sheehan N: Graphical models for gene analysis. Stat Sci. 2003, 18: 489514. 10.1214/ss/1081443232.
Li J, Jiang T: An exact solution for finding minimum recombinant haplotype configurations on pedigrees with missing data by integer linear programming. Proc. RECOMB'03. 2003, 101110.
Li X, Yin XL, Li J: Efficient identification of identicalbydescent status in pedigrees with many untyped individuals. Bioinformatics. 2010, 26: i191198. 10.1093/bioinformatics/btq222.
Ng M, Levinson D, Faraone S, et al: Metaanalysis of 32 genomewide linkage studies of schizopherenia. Mol Psychiatry. 2009, 14: 774785. 10.1038/mp.2008.135.
Ng S, Buckingham K, Lee C, et al: Exome sequencing identifies the cause of a mendelian disorder. Nat Genet. 2010, 42: 3035. 10.1038/ng.499.
Piccolboni A, Gusfield D: On the complexity of fundamental computational problems in pedigree analysis. J of Computational Biology. 2003, 10: 763773. 10.1089/106652703322539088.
Simmons L, Firman R, Rhodes G, Peters M: Human sperm competition: testis size, sperm production and rates of extrapair copulations. Animal Behavior. 2004, 68: 13231337.
Stankovich J, Bahlo M, Rubio J, et al: Identifying nineteenth century genealogical links from genotypes. Hum Genet. 2005, 117: 188199. 10.1007/s004390051279y.
Steel M, Hein J: Reconstructing pedigrees: a combinatorial perspective. J Theor Biol. 2006, 240: 360367. 10.1016/j.jtbi.2005.09.026.
Thatte B, Steel M: Reconstructing pedigrees: a stochastic perspective. J Theor Biol. 2008, 251: 440449. 10.1016/j.jtbi.2007.12.004.
Thatte B: Combinatorics of pedigrees I: counterexamples to a reconstruction question. SIAM J Disc Math. 2008, 22: 961970. 10.1137/060675964.
Thompson E: Pedigree Analysis in Human Genetics. 1985, Johns Hopkins University Press
Tong W, Lin G: An improved approximation algorithm for the minimum common integer partition problem. Proc. ISAAC'14. 2014, 353364.
Uehara R, Toda S, Nagoya T: Graph isomorphism completeness for chordal bipartite graphs and strongly chordal graphs. Disc Appl Math. 2005, 145 (3): 479482. 10.1016/j.dam.2004.06.008.
Woodruff D: Better approximation for the minimum common integer partition problem. Proc. RANDOMAPPROX'06, LNCS 4110. 2006, 248259.
Zhao W, Zhang P, Jiang T: A network flow approach to the minimum common integer partition problem. Theoretical Computer Science. 2006, 369 (13): 456462. 10.1016/j.tcs.2006.09.001.
Acknowledgements
This research is partially supported by NSF of China under project 61070019 and 61202014, by Doctoral Fund of Chinese Ministry of Education under grant 20090131110009, and by China Postdoctoral Science Foundation funded project under grant 2011M501133 and 2012T50614. GL and WT are supported by NSERC of Canada.
Declarations
Publication charges for this work was funded by NSF of China grant 61202014.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 5, 2015: Selected articles from the 10th International Symposium on Bioinformatics Research and Applications (ISBRA14): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S5.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
BZ conceived the study. All authors contributed to the algorithm design and analysis, read and approved the final manuscript.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Jiang, H., Lin, G., Tong, W. et al. Isomorphism and similarity for 2generation pedigrees. BMC Bioinformatics 16 (Suppl 5), S7 (2015). https://doi.org/10.1186/1471210516S5S7
Published:
DOI: https://doi.org/10.1186/1471210516S5S7