Computing the family-free DCJ similarity

Background The genomic similarity is a large-scale measure for comparing two given genomes. In this work we study the (NP-hard) problem of computing the genomic similarity under the DCJ model in a setting that does not assume that the genes of the compared genomes are grouped into gene families. This problem is called family-free DCJ similarity. Results We propose an exact ILP algorithm to solve the family-free DCJ similarity problem, then we show its APX-hardness and present four combinatorial heuristics with computational experiments comparing their results to the ILP. Conclusions We show that the family-free DCJ similarity can be computed in reasonable time, although for larger genomes it is necessary to resort to heuristics. This provides a basis for further studies on the applicability and model refinement of family-free whole genome similarity measures. Electronic supplementary material The online version of this article (10.1186/s12859-018-2130-5) contains supplementary material, which is available to authorized users.


Computing the Family-Free DCJ Similarity
Diego P Rubert, Edna A Hoshino, Marília DV Braga, Jens Stoye and Fábio V Martinez Additional file 1 This additional file contains the APX-hardness proof of problem ffdcj-similarity. We first give some definitions based on [1]. Thereby we restrict ourselves to maximization problems and feasible solutions.
Given an instance x of an optimization problem P and a solution y of x, val(x, y) denotes the value of y, which is a positive integer measure of y. The function val, also referred to as objective function, must be computable in polynomial time. The value of an optimal solution (which maximizes the objective function) is defined as opt(x). Thus, the performance ratio of y with respect to x is defined as: Given two optimization problems P and P , let f be a polynomial-time computable function that maps an instance x of P into an instance f (x) of P , and let g be a polynomial-time computable function that maps a solution y for the instance f (x) of P into a solution g(x, y) of P . A reduction is a pair (f, g). A reduction from P to P is frequently denoted by P ≤ P , and we say that P is reduced to P . A reduction P ≤ P preserves membership in a class C if P ∈ C implies P ∈ C. An approximation-preserving reduction preserves membership in either APX, PTAS, or both classes. The strict reduction, which is the simplest type of approximationpreserving reduction, preserves membership in both APX and PTAS classes and must satisfy the following condition: We consider the following optimization problem, to be used within the proof of Theorem 1 below: Problem max-2sat3(φ): Given a 2-cnf formula (i.e., with at most 2 literals per clause) φ = {C 1 , . . . , C m } with n variables X = {x 1 , . . . , x n }, where each variable appears in at most 3 clauses, find an assignment that satisfies the largest number of clauses.
The formula φ as defined above is called a 2sat3 formula. max-2sat3 [2,3] is a special case of max-2satB (also known as B-occ-max-2sat), where each variable occurs in at most B clauses for some B, which in turn is a restricted version of max-2sat [4]. Theorem 1 ffdcj-similarity is APX-hard and cannot be approximated with approximation ratio better than 22/21 = 1.0476 . . ., unless P = NP. Proof. [Theorem 1, first part] We give a strict reduction (f, g) from max-2sat3 to ffdcj-similarity, showing that for any instance φ of max-2sat3 and solution γ of ffdcj-similarity with instance f (φ). Since variables occurring only once imply their clauses and others to be trivially satisfied, we consider only clauses that are not trivially satisfied in their instance. Similar for clauses containing literals x i and x i , for some variable x i .
(Function f .) We show progressively how to build GS σ (A, B) and define genes and their sequences in chromosomes of A and B. For each variable x i occurring three times, let Cx 1 i , Cx 2 i and Cx 3 i be aliases for the clauses where x i occurs (notice that a clause composed of two literals has two aliases). We define a variable component C i adding vertices (genes) |C| . In this transformation, each cycle C is such that w(C) = 0, 0.5 or 1. A cycle C such that w(C) > 0 is a helpful cycle and represents a clause satisfied by one or two literals ( w(C) = 0.5 or w(C) = 1, respectively). See an example in Fig. 2.
In this scenario, however, a solution of ffdcj-similarity with performance ratio r could lead to a solution of max-2sat3 with ratio 2r, since the total normalized weight for two cycles C 1 and C 2 with w(C 1 ) = w(C 2 ) = 0.5 (two clauses satisfied by one literal each) is the same for one cycle C with w(C) = 1.0 (one clause satisfied by two literals). Therefore, achieving the desired ratio requires some modifications in f . It is not possible to make these two types of cycles have the same weight, but it suffices to get close enough.
We introduce special genes into the genomes called extenders. For some p even, for each edge ex j i = (Cx j i , x j i ) of weight 1 in GS σ (A, B) we introduce p extenders α 1 , . . . , α p into A (as a consequence, they are also introduced into A) and p extenders α p+1 , . . . , α 2p into B (each ex j i of weight 1 has its own set of extenders). Edge ex j i is replaced by edges (Cx j i , α 1 ) with weight 1 (which we consider equivalent to ex j i ) and (α p+1 , x j i ) with weight 0, and edges (α k , α p+k ) with weight 0 are added to GS σ (A, B) for each 1 ≤ k ≤ p (extenders α 1 and α p+1 are now part of the variable component C i ). Regarding new chromosomes in genomes A and B, A is The same occurs for the path from x jh i to Cx jh i (see Fig. 3). Now, cycles in AG σ (A, B) induced by edges of weight 0 in GS σ (A, B) have normalized weight 0, cycles previously with normalized weight 1 are extended and have normalized weight 1 1+p , and cycles previously with normalized weight 0.5 are extended and have normalized weight 1 2+p . Notice that, for a sufficiently large p, 1 1+p is quite close to 1 2+p , hence the problem of finding the maximum similarity in this graph is very similar to finding the maximum number of helpful cycles.
(Function g.) By the structure of variable components in GS σ (A, B), and since solutions of ffdcj-similarity are restricted to maximal matchings only, any solution γ for f (φ) is a matching that covers only edges ex j i or ex i j for each variable com- are in the solution then the variable x i is assigned to true (false), inducing in polynomial time an assignment for each x i ∈ X and therefore a solution g(f (φ), γ) to max-2sat3. A clause is satisfied if vertices (or the only vertex) corresponding to its aliases are in a helpful cycle.
(Approximation ratio.) Given f (φ) and a feasible solution γ of ffdcj-similarity with the maximum number of helpful cycles, denote by c the number of helpful cycles in γ. Notice that c is also the maximum number of satisfied clauses of max-2sat3, that is, the value of an optimal solution for max-2sat3 for any instance φ, denoted here by opt 2sat3 (φ). Thus, c = opt 2sat3 (φ). To achieve the desired ratio we must establish some properties and relations between the parameters of max-2sat3 and ffdcj-similarity and set some parameters to specific values.
If opt sim (f (φ)) denotes the value of an optimal solution for ffdcj-similarity with instance f (φ) and c * denotes the number of helpful cycles in an optimal solution of ffdcj-similarity, then we have immediately that Besides that and c * ω ≤ opt sim (f (φ)) ≤ c * (ω + ε).
Thus, we have where (7) comes from (3) and (8) is valid due to (5). Now, let c r be the number of helpful cycles given by an approximate solution for the ffdcj-similarity with approximation ratio r. Then, where the last inequality is given by Proposition 3 below. This concludes the first part of the proof. Proof. Since c is the greatest number of helpful cycles possible, it is immediate that c * ≤ c .
Let us now show that c * ≥ c . Suppose for a moment that c * < c . Since c * and c are integers, this implies that c * + 1 ≤ c , i.e., c * ≤ c − 1. (10) Let C be the set of cycles with c cycles, i.e., with the maximum number of helpful cycles possible. Let w(C ) := C∈C w(C) = C∈C w(C)/|C|. Then > c * (ω + ε) ≥ opt sim (f (φ)), where (11) follows from (10), (12) comes from (9), and (13) is valid due to (6). It means that w(C ) > opt sim (f (φ)), which is a contradiction. Therefore, c = c * . Proposition 3 Let c r be the number of helpful cycles given by an approximate solution for ffdcj-similarity with approximation ratio r. Let c be the same as defined in Proposition 2. Then, Proof. Given an instance f (φ) of ffdcj-similarity, let γ r be an approximate solution of f (φ) with performance ratio r, i.e., val(f (φ), γ r ) ≥ opt sim (f (φ)) r . Let c r be the number of helpful cycles of γ r . Then where (14) follows from (3), (15) is valid from (6) and Proposition 2. Then, from (16) we know that c r > c r − 1 and, since c r is an integer number, the result follows. We now continue with the proof of Theorem 1. Proof.[Theorem 1, second part] First, notice that if a problem is APX-hard, the existence of a PTAS for it implies P = NP. Since a strict reduction preserves membership in the class PTAS, finding a PTAS for ffdcj-similarity implies a PTAS for every APX-hard problem and P = NP. A PTAS for ffdcj-similarity would also imply an approximation ratio better than 2012/2011 = 1.0005 . . ., unless P = NP. This follows immediately from the reduction in Theorem 1 with R max-2sat3 = R ffdcj-similarity and the fact that max-2sat3 is shown in [2] to be NP-hard to approximate within a factor of 2012/2011 − ε for any ε > 0.
However, our result is slightly stronger. Notice particularly that the reduction max-2sat3 ≤ ffdcj-similarity from the first part of the proof can be trivially extended to max-2sat ≤ ffdcj-similarity by extending variable components to arbitrary sizes. This increases the lower bound to 22/21 = 1.0476 . . . [5].