On the consistency of orthology relationships

Jones, Mark; Paul, Christophe; Scornavacca, Céline

doi:10.1186/s12859-016-1267-3

Volume 17 Supplement 14

Proceedings of the 14th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: bioinformatics

Research
Open access
Published: 11 November 2016

On the consistency of orthology relationships

Mark Jones¹,
Christophe Paul² &
Céline Scornavacca²

BMC Bioinformatics volume 17, Article number: 416 (2016) Cite this article

1564 Accesses
2 Citations
2 Altmetric
Metrics details

Abstract

Background

Orthologs inference is the starting point of most comparative genomics studies, and a plethora of methods have been designed in the last decade to address this challenging task. In this paper we focus on the problems of deciding consistency with a species tree (known or not) of a partial set of orthology/paralogy relationships \(\mathcal {C}\) on a collection of n genes.

Results

We give the first polynomial algorithm – more precisely a O(n ³) time algorithm – to decide whether \(\mathcal {C}\) is consistent, even when the species tree is unknown. We also investigate a biologically meaningful optimization version of these problems, in which we wish to minimize the number of duplication events; unfortunately, we show that all these optimization problems are NP-hard and are unlikely to have good polynomial time approximation algorithms.

Conclusions

Our polynomial algorithm for checking consistency has been implemented in Python and is available at https://github.com/UdeM-LBIT/OrthoPara-ConstraintChecker.

Background

Two genes from two different species are said to be orthologous if they derived from a single gene present in the last common ancestor of the two species via a speciation event, and paralogous if they were created by a duplication event [1]. Orthologs inference is the starting point of most comparative genomics studies, and is also a key instrument for functional annotation of new genomes. A plethora of methods have been designed in the last decade to address this challenging task, and can be roughly divided in two groups [2]. The first group of methods use clustering algorithms to detect homologous genes, i.e., genes sharing a common ancestry, and then reconstruct a gene tree describing the evolutionary history of this set of genes; orthology relationships are then deduced from this tree by comparing it with the species tree, i.e., the tree depicting the history of the species containing those genes, via reconciliation algorithms (see [3], among others, and [4] for a review of reconciliation algorithms). The second group of methods use other sources of information, e.g. sequence similarity or synteny, to directly estimate orthology relationships [5, among others]. The first set of methods are considered to be more accurate, but they require a prior knowledge of the species tree, and are very dependent on the accuracy of the gene trees. Unfortunately, the species phylogeny is not always known and gene trees can be highly inaccurate as a result of several kinds of reconstruction artifact, e.g. long-branch attraction (LBA) [6].

The second set of methods does not suffer from these drawbacks but still has an important weakness: given a set of genes V, the set of inferred orthology/paralogy relationships \(\mathcal {C}\) for V may fail to be satisfiable, i.e., to simultaneously co-exist in any evolutionary history for V, or consistent i.e., such that all displayed triplet phylogenies are included in a species tree (formal definitions are given in the next section).

In the last years, the decision problems associated with these questions have been extensively studied, both when \(\mathcal {C}\) is full, i.e., involves a constraint for each pair of genes in V [7, 8], and when it is not [9].

In [9], the authors give O(n ³) time algorithms to decide whether \(\mathcal {C}\) is satisfiable and consistent under the assumption that the species tree is known – where n=|V|. These results hold whether \(\mathcal {C}\) is a full set of constraints or not. They also showed how to decide whether \(\mathcal {C}\) is satisfiable when the species tree is unknown but \(\mathcal {C}\) is full (this problem was also considered in [10]).

In this paper, we extend the results of [9] by giving a O(n ³) time algorithm to decide whether \(\mathcal {C}\) is consistent, even when the species tree is not known and \(\mathcal {C}\) is not full, and show an application on real data. Thus the problems of deciding satisfiability, deciding consistency given a species tree, and deciding consistency with an unknown species tree, are all polynomial-time solvable. We also investigate an optimization version of these problems, in which we wish to minimize the number of duplication events in the evolutionary history for V – duplication minimization is a well-known criterion in phylogenomics [11]. Unfortunately, we show that all three problems are NP-hard, even when the maximum number of duplication events is 2, and are unlikely to have good polynomial- time approximation algorithms.

Preliminaries

A rooted tree T with arc set E(T) and node set V(T) is a directed acyclic connected graph, in which every node has in-degree 1, except for a single node, the root – denoted by ROOT(T), of in-degree 0, and where the set of nodes in T with out-degree 0 – the leaves of T, denoted by L(T) – are univocally labeled. Throughout the paper, we will treat leaves in a tree as synonymous with the labels associated to them. We denote by I(T) the set V(T)∖L(T) – the internal nodes of T. If all nodes in I(T) have out-degree 2, we say that T is binary.

Given two nodes x,y in T, we say that x is an ancestor of y in T, and that y is a descendant of x in T, if there is a directed path from x to y in T. (Note that any node x is an ancestor and descendant of itself.) If x is not an ancestor of y and y is not a ancestor of x, we say that x,y are separated in T. If there is an arc from x to y in T, we say that x is the parent of y in T and that y is a child of x in T.

Given a node x, let DESC_T(x) denote the descendants of x in T. Let CHILD_T(x) denote the set of all children of x in T. Let LEAF_T(x)=DESC(x)∩L(T), i.e. LEAF_T(x) is the set of leaves in T that are descendants of x. Note that LEAF_T(ROOT(T))=L(T). Given a set A of nodes in T, let LCA_T(A) denote the least common ancestor of A in T, that is, the unique node z such that z is an ancestor of all x∈A, and no descendant of z has this property. Given two nodes x,y, we will often write LCA_T(x,y) as shorthand for LCA_T({x,y}). When T is clear from context, we will often omit “in T” and simply say that x is the ancestor of y,y is the descendant of x,z is a leaf, etc.

Suppressing a non-root node x of out-degree 1 in a tree T consists of removing x and making the unique child of x a new child of the parent of x. Given a set of leaves L ^′⊆L(T), the restriction of T to L ^′, denoted \(T|_{L^{\prime }}\), is the tree derived from T by taking the minimum subtree of T spanning L ^′, and suppressing all non-root nodes of out-degree 1.

A triplet is a rooted binary tree T with |L(T)|=3. Given three distinct elements x,y,z, we denote by x y|z the unique triplet T with L(T)={x,y,z} such that LCA_T(x,y)≠ROOT(T) (or equivalently, LCA_T(x,y)≠LCA_T(x,z)=LCA_T(y,z)). We say that a rooted tree T displays the triplet x y|z if T|_{x,y,z}=x y|z.

Given a set of edges E over a set of vertices V, and a subset V ^′⊆V, we define E[V ^′]={x y:x,y∈V ^′,x y∈E}. Given graphs G=(V,E) and G ^′=(V ^′,E ^′), we say that G ^′ is an induced subgraph of G if V ^′⊆V and E ^′=E[V ^′], and denote G ^′ by G[V ^′]. We define \(\overline {E} = \{xy: x,y \in V, xy \notin E\}\) and say that \(\overline {G}=(V,\overline {E})\) is the complement of G. For any integer l≥1, a path P _l is a graph (V={v ₁,…,v _l},E={v _i v _i+1:1≤i≤l−1}. We note here that if a graph contains an induced P ₄, then its complement contains an induced P ₄ on the same four vertices.

Species trees and DS-trees. Let Σ denote a set of species. A species tree S on Σ is a binary rooted tree such that L(T)=Σ, used to depict the evolutionary history of the species in Σ.

Genes are said to be homologous if they share a common ancestor. Let V denote a set of homologous genes belonging to species in Σ. A species assignment of V is a function s:V→Σ, with s(v)=a representing the fact that gene v belongs to species a∈Σ. For a set V ^′⊆V, we define s(V ^′)={a∈Σ:∃x∈V ^′,s(x)=a}, and \(s_{|V^{\prime }}: V^{\prime } \rightarrow s(V^{\prime })\) such that \(s_{|V^{\prime }}(v)=s(v)\) for all v∈V ^′. A DS-tree on V is a pair (T,ℓ), where T is a binary rooted tree with leaf set V and ℓ:I(T)→{D u p,S p e c} is a function labeling each internal node x of T as a speciation node (if ℓ(x)=S p e c) or a duplication node (if ℓ(x)=D u p). DS-trees are used to depict the evolutionary history of the genes in V. When the function ℓ is clear from context, we will often omit it and speak only of a DS-tree T.

Given two genes x,y in T, we say that x,y are orthologs with respect to T if LCA_T(x,y) is a speciation node, and paralogs with respect to T otherwise. Given an undirected graph G=(V,E), a DS-tree (T,ℓ) on V is a DS-tree for G (or G is an orthology graph for T) if for every x,y∈V, x y∈E⇔ℓ(LCA_T(x,y))=S p e c. That is, x and y are adjacent in G if and only if they are orthologs with respect to T.

The presence of two homologous genes in the same species can be caused either by duplications or gene transfers [12]. So, in absence of gene transfers, homologous genes from the same species are necessarily paralogs. We formalize this idea in the following assumption.

Assumption 1

We assume in what follows that whenever we are given a graph G=(V,E) with a species assignment s, two vertices x,y of G are not adjacent if s(x)=s(y).

Cographs A cograph is a graph that can be generated from a single-vertex graph using the operations of disjoint union (taking the disjoint union of multiple graphs) and series composition (adding all possible edges between vertices of multiple graphs) [13]. This generation scheme yields a representation of a cograph in terms of cotrees. A cotree is a rooted tree T, with internal nodes labeled 0 (representing the disjoint union operation) or 1 (representing the series composition). Hence a cotree represents a graph G=(V,E) if L(T)=V and two vertices x and y of G are adjacent if and only if LCA_T(x,y)=1. Observe that the cotree representation of a cograph is not unique. Also, while a cotree is not necessarily binary, any non-binary cotree can be transformed in linear time into a binary cotree with the same corresponding cograph. There are several characterizations of cographs. Among other characterizations, a cograph is a graph with no induced P ₄ [13]. Cographs can also be viewed as graphs where each connected component has diameter at most 2.

Hellmuth et al. [8] noted that all orthology graphs (i.e. graphs for which there exists a DS-tree) can be characterized as symbolic ultrametrics [14], and showed that a graph is an orthology graph if and only if it is a cograph [8, Corollary 4].

Thus we have a useful graph-theoretic framework for deciding on the existence of a DS-tree.

Proposition 1

For an undirected graph G=(V,E), the following are equivalent:

1.
There exists a DS-tree for G;
2.
G contains no induced P ₄, i.e. it is P ₄-free;
3.
G is a cograph.

As cographs can be recognized in linear time [15, 16], deciding whether a graph has a DS-tree, i.e., if it is satisfiable, can be achieved within the same time complexity. Note, however, that not every DS-tree represents a possible evolutionary history for a set of genes. In particular, given a species assignment, different parts of a DS-tree may imply conflicting evolutionary histories for the species containing those genes. The concept of consistency makes this notion precise.

Consistent DS-trees. Given a DS-tree T on V, a species assignment s:V→Σ and a species tree S on Σ, we say that (T,s) is consistent with S (or S-consistent) if for every speciation node z in T, and distinct children x,y of z, LCA_S(s(LEAF_T(x))) and LCA_S(s(LEAF_T(y))) are separated in S. Given a graph G=(V,E) and the species assignment s, the pair (G,s) is consistent with S if there exists a DS-tree T for G such that (T,s) is consistent with S. We say that G (resp. T) along with the species assignment s, is consistent if there exists a species tree S such that (G,s) (resp. (T,s)) is consistent with S [9].

Given a DS-tree T on V and a species assignment s:V→Σ, let t r(T,s) be the set of triplets s(x)s(y)|s(z) for which the triplet x y|z is displayed by T with a speciation node as the root, and for which s(x)≠s(y).

Hernandez-Rosales et al. [7] showed that (T,s) is consistent with a species tree S if and only S displays all triplets in t r(T,s). In light of this result, Hellmuth et al. [10] gave a framework for finding the DS-tree and species tree for which the maximum number of triplets are displayed, using Integer Linear Programming. Lafond and El-Mabrouk [9] improved the result of [7] by showing that it is enough to consider only the triplets in t r(T,s) that have a speciation node as the root node and a duplication node as the other internal node. This can expressed in terms of the consistency of an orthology graph in the following way.

Given a graph G=(V,E) and species assignment s:V→Σ, define the set of triplets P ₃(G,s)={s(x)s(y)|s(z):x z,z y∈E and x y∉E and s(x)≠s(y)}. Note that as a consequence of Assumption 1, if s(x)s(y)|s(z)∈P ₃(G,s), then s(z)≠s(y) and s(z)≠s(x).

By Theorem 5 in [9], we have the following theorem (in fact, Theorem 5 in [9] only states that (G,s) is consistent if and only if there exists a species tree S which displays all triplets in P ₃(G,s), but their proof shows that (G,s) is indeed consistent with such an S):

Theorem 1

[ 9 ] Let G=(V,E) have a DS-tree and let s:V→Σ be a species assignment. Let S be a species tree on Σ. Then (G,s) is consistent with S if and only if S displays all triplets in P ₃(G,s).

Theorem 1 directly provides a polynomial time algorithm to decide whether a graph and a species assignment are consistent with a given species tree. The following proposition reformulates Theorem 1 in a convenient way:

Proposition 2

Given a graph G=(V,E), a species assignment s:V→Σ, and a species tree S,(G,s) is consistent with S if and only if the following holds:

1.
G does not contain an induced P ₄;
2.
Every triplet in P ₃(G,s) is displayed by S.

As both of the properties in Proposition 2 are hereditary, we also have:

Corollary 1

Given a graph G=(V,E), a species assignment s and a subset V ^′⊆V, if (G,s) is consistent with a species tree S then \((G[V^{\prime }],s_{|V^{\prime }})\) is consistent with the species tree \(S|_{s(V^{\prime })}\).

Constraint graphs. A constraint graph is a pair (G,s) where G=(V,M⊎U) is an edge-bicolored graph and s is a species assignment on V. A constraint graph aims at representing the partial knowledge about the orthology or paralogy relations between genes from V. The edges in M are mandatory edges, representing the pairs of genes xy for which we know that x and y are orthologs. The non-edges of G (i.e. the set of unordered pairs uv for which u v∉M⊎U) represent the pairs of genes xy for which we know that x and y are paralogs. The edges in U are unknown edges, for which we do not know if x and y are orthologs or paralogs. From Assumption 1, we have that x y∉M⊎U for any pair of genes x, y such that s(x)=s(y) (in absence of gene transfers, homologous genes from the same species are necessarily paralogs). Note that an orthology graph is a constraint graph where U=∅. A sandwich of a constraint graph (G,s), with G=(V,M⊎U), is a graph H=(V,E) such that M⊆E⊆M∪U.

As a gene is always associated with the species it belongs to, throughout this paper we will always present a DS-tree T together with a species assignment s. Thus we will speak of a DS-tree (T,s). Similarly, we will always present an orthology graph G together with its species assignment s, and speak of an orthology graph (G,s). A sandwich graph G ^′ will be presented on its own without a species assignment, as a sandwich graph is defined relative to a constraint graph (G=(V,M⊎U),s), and so the species assignment s will always be clear from context.

Methods

Computing a consistent DS-tree

In this section, we describe a polynomial time algorithm for the following problem:

CONSISTENT ORTHOLOGY GRAPH SANDWICH problem

Input: a constraint graph (G,s), with G=(V,M⊎U) and s:V→Σ a species assignment;

Output: a sandwich graph H for (G,s) such that (H,s) is consistent (if any exists).

Observe that by Proposition 2, the CONSISTENT ORTHOLOGY GRAPH SANDWICH problem amounts to computing a sandwich cograph satisfying extra properties. The sandwich cograph problem is known to be polynomial time solvable [17]. Our algorithm can be seen as a combination of the sandwich cograph algorithm and the BUILD algorithm [18] for checking consistency of a set of triplets.

Let G=(V,M⊎U) be an edge-bicolored graph and for F⊆U, define the graph G(F)=(V,M∪F).

The first lemma proves that unknown edges between connected components of G(∅) can be removed (i.e. freezed as paralogy relations between genes).

Lemma 1

Let (G,s) be a constraint graph with G=(V,M⊎U). Let CC be the connected components of G(∅), and let \(U_{CC} = \bigcup _{C \in CC}U[C]\). There exists a consistent sandwich graph of (G,s) if and only if there exists a consistent sandwich graph of (G _CC=(V,M⊎U _CC),s).

Proof

Suppose first that there exists a consistent sandwich graph G ^′=(V,E ^′) of (G,s) and let S be a species tree such that (G ^′,s) is S-consistent. For every C∈C C, by Corollary 1 (G ^′[C],s _|C) is consistent with S|_s(C) and hence with S. Then the disjoint union G ^′′ of the G ^′[C] is a sandwich cograph of (G _CC,s). Moreover we clearly have P ₃(G ^′′,s)=∪_C P ₃(G ^′[C],s _|C), implying that (G ^′′,s) is also consistent with S. The converse is symmetric. □

Reduction Rule 1

Let (G,s) be a constraint graph with G=(V,M⊎U). Remove from U every edge xy such that x and y belong to distinct connected components of G(∅).

As an example, consider the constraint graph (G,s) in Fig. 1. The genes a ₁,b ₁,c ₁,d ₁ form one connected component of G(∅), and a ₂,b ₂,c ₂,d ₂ form the other. Thus Reduction Rule 1 will remove the unknown edge d ₁ a ₂ from U.

Note that although we remove all edges between connected components of G(∅), we cannot solve the problem on each connected component independently, and so we cannot assume that G(∅) is connected. The reason is that for two connected components C,D of G(∅), a solution for (G[C],s _|C) may be consistent with a different species tree than a solution for (G[D],s _|D). To avoid conflicts between solutions on different subgraphs, we must split the graph into subgraphs on disjoint sets of species.

From now on, we may assume that |s(V)|>1. Otherwise, Assumption 1 implies that M=U=∅, and thereby (G,s) is a trivial positive instance. For the sake of the algorithm, we define an auxiliary graph H _G,s=(Σ,F) on the species set, called hereafter the species graph. For each pair of distinct species a,b∈Σ, add ab to F if there exist x,y∈V such that x and y are in the same connected component of G(∅),x and y are not adjacent in G(U) and s(x)=a,s(y)=b.

Lemma 2

Let (G,s) be a constraint graph reduced by Reduction Rule 1. If the species graph H _G,s is connected, then (G,s) does not have a consistent sandwich graph.

Proof

Consider an arbitrary binary species tree S, and an arbitrary sandwich graph G ^′=(V,E ^′) of (G,s). We show that P ₃(G ^′,s) contains a triplet not displayed by S.

Let A=LEAF_S(u _A) and B=LEAF_S(u _B) where u _A and u _B are the children of ROOT(S). Note that A and B partition the set of species Σ. As H _G,s is connected, there exists a∈A,b∈B such that a b∈F. Therefore there exist x,y∈V such that x,y are in the same connected component C of G(∅),s(x)=a,s(y)=b and x y∉M∪U.

As G ^′[C] is connected, there exists a chordless path P from x to y in G ^′. By Proposition 2, G ^′ is P ₄-free. This implies that P contains, in addition to x and y, a third vertex z such that x z∈E ^′ and z y∈E ^′.

Assume without loss of generality that s(z)∈A (the case s(z)∈B is symmetric). Then we have s(x)s(y)|s(z)∈P ₃(G ^′). Note however that LCA_S(s(y),s(z))=ROOT(S) (as s(z)∈A,s(y)∈B), while LCA_S(s(x),s(z)) is a descendant of LCA_S(A). It follows that LCA_S(s(x),s(z)) is different from LCA_S(s(y),s(z)), and so s(x)s(y)|s(z) is not displayed by S. □

The next lemma shows how to use connected components of the species graph in order to freeze some unknown edges to orthology relations between genes.

Lemma 3

Let (G,s) be a constraint graph reduced by Reduction Rule 1 such that the species graph H _G,s is not connected.

Let A be the vertices of a connected component of the species graph H _G,s and let B=Σ∖A. Let G _A=(V _A,M[V _A]⊎U[V _A]) and G _B(V _B,M[V _B]⊎U[V _B]), where V _A=s ⁻¹(A) and V _B=s ⁻¹(B). There exists a consistent sandwich graph of (G,s) if and only if there exist consistent sandwich graphs of \(\phantom {\dot {i}\!}(G_{A},s_{|V_{A}})\) and of \((G_{B},s_{|V_{B}})\phantom {\dot {i}\!}\).

Proof

Let \(G^{\prime }_{A}\) and \(G^{\prime }_{B}\) be respectively consistent sandwich graphs of \(\phantom {\dot {i}\!}(G_{A},s_{|V_{A}})\) and of \(\phantom {\dot {i}\!}(G_{B},s_{|V_{B}})\). Suppose that \(G^{\prime }_{A}\) is consistent with the species tree S _A and \(G^{\prime }_{B}\) with S _B. For every connected component C of G(∅), let \(G^{\prime }_{C}\) be the series composition of \(G^{\prime }_{A}[C]\) and \(G^{\prime }_{B}[C]\) and let G ^′=(V,E ^′) be the disjoint union of all \(G^{\prime }_{C}\)’s. We now show that (G ^′,s) is a consistent sandwich graph of (G,s).

As \(G_{A}^{\prime }\) and \(G_{B}^{\prime }\) are cographs, by construction G ^′ is a cograph too. Now, as \(G_{A}^{\prime }\) and \(G_{B}^{\prime }\) are respectively sandwich graphs of \(\phantom {\dot {i}\!}(G_{A},s_{|V_{A}})\) and \((G_{B},s_{|V_{B}})\phantom {\dot {i}\!}\), and as there is no edge in M between different connected components of G(∅), we have that M⊆E ^′. By construction of H _G,s and the fact that H _G,s has no edges between A and B, for every connected component C of G(∅), if x∈V _A∩C and y∈V _B∩C, then x y∈M∪U. As \(G_{A}^{\prime }\) and \(G_{B}^{\prime }\) are respectively sandwich graphs of \((G_{A},s_{|V_{A}})\phantom {\dot {i}\!}\) and \((G_{B},s_{|V_{B}})\phantom {\dot {i}\!}\), this implies that E ^′⊆M∪U. It follows that G ^′ is a sandwich graph of G.

Now consider the species tree S obtained from S _A and S _B by adding a root whose children are ROOT(S _A) and ROOT(S _B). We claim that (G ^′,s) is consistent with S. Consider a triplet s(x)s(y)|s(z)∈P ₃(G ^′,s). We distinguish two cases:

If {s(x),s(y),s(z)}⊆A (the case {s(x),s(y),s(z)}⊆B is symmetric), then s(x)s(y)|s(z)∈P ₃(G _A) and is displayed by S _A and thereby by S as well.
Otherwise, as x z,y z∈E ^′,x and y are connected in G ^′ and so by construction of G ^′, we have that x,y∈C for some connected component C of G(∅). As x y∉E ^′, by construction of \(G_{C}^{\prime }\) either {s(x),s(y)}⊆A or {s(x),s(y)}⊆B. Suppose w.l.o.g that the former holds, implying s(z)∈B. Observe then that s(x)s(y)|s(z) is displayed by S. Indeed, we have lca_S(s(x),s(z))=lca_S(s(y),s(z))=root(S), and lca_S(s(x)s(y)) is a descendant of root(S _A).

The converse follows from Corollary 1. □

The correctness of the next branching rule follows from Lemma’s 2 and 3.

Branching Rule 1

Let (G,s) be a constraint graph reduced by Reduction Rule 1 such that the species graph H _G,s is not connected. Let A be a connected component of the species graph H _G,s and let B=Σ∖A. Solve CONSISTENT SANDWICH SUBGRAPH on \((G_{A}, s_{|V_{A}})\phantom {\dot {i}\!}\) and \((G_{B}, s_{|V_{B}})\phantom {\dot {i}\!}\) where V _A=s ⁻¹(A) and V _B=s ⁻¹(B). If there exist \(G^{\prime }_{A}=(V_{A},E^{\prime }_{A})\) and \(G^{\prime }_{B}=(V_{B},E^{\prime }_{B})\) that are respectively consistent sandwich graphs of \((G_{A}, s_{|V_{A}})\phantom {\dot {i}\!}\) and \((G_{B}, s_{|V_{B}})\phantom {\dot {i}\!}\), then return \(G^{\prime }=(V,E^{\prime }_{A}\cup E^{\prime }_{B}\cup M^{\prime })\), where M ^′={x y∈M∪U:x∈V _A,y∈V _B}. Otherwise, return NULL.

Consider again the example of Fig. 1, after the unknown edge d ₁ a ₂ has been removed by Reduction Rule 1. Because one connected component has non-edges a ₁ c ₁,b ₁ d ₁ and the other has non-edge b ₂ d ₂, the edges in H _G,s will be AC and BD (see Fig. 2). Thus, Branching Rule 1 will split the constraint graph into two parts, one restricted to a ₁,c ₁,a ₂,c ₂, and one restricted to b ₁,d ₁,b ₂,d ₂.

We can now give the pseudocode of the algorithm, which essentially consists of alternately applying Reduction Rule 1 and Branching Rule 1.

Theorem 2

Given a constraint graph (G,s), the CONSISTENT ORTHOLOGY GRAPH SANDWICH problem can be solved in O(n ³) time, where n is the number of genes in G.

Proof

The correctness of Algorithm 1 follows from the correctness of Reduction Rule 1 (Lemma 1) and Branching Rule 1 (Lemma’s 2 and 3).

To analyze the running time of Algorithm 1, we simply observe that the recursive calls define a binary tree structure with at most O(|Σ|)=0(n) nodes. As each step of the recursion can clearly be performed in quadratic time, so the complexity follows. □

We can adapt the algorithm to cases when the species tree S is partially known, by adjusting the construction of H _G,s. In particular, for any x,y,z∈V for which it is known that S displays the triplet s(x)s(y)|s(z), we add s(x)s(y) as an edge in H _G,s.

Algorithm 1 has important applications. When the species tree is not known, it allows us to differentiate constraint graphs that are consistent with a species tree from those that are not; the latter cannot be depicted by a consistent DS-tree, and should be considered as phylogenetically irrelevant and discarded. When the species tree S is known and a given constraint graph C is not consistent with it, the sandwich graph returned by Algorithm 1 shows to what extent C and S are in contradiction. Furthermore if S contains some uncertainties, it allows us to see if the contradictions between C and S lie in the “uncertainty zone” of S. This may help to correct the species tree.

As an example of the last appplication, suppose that we have the species tree given in Fig. 3(a), but the relative position of species C in this tree is uncertain. Suppose in addition we are given the constraint graph (G,s) given in Fig. 1. The DS-tree in Fig. 3(c) is a DS-tree for (G,s), but is not consistent with the Fig. 3(a). However, it is consistent with the species tree in Fig. 3(b), which can be derived from Fig. 3(a) by moving species C.

See the “Results and Discussion” section for an example of application on real data.

Hardness of optimizing the duplication nodes

Given a constraint graph (G,s) for which there exist several possible DS-trees, we may be interested in finding one minimizing the number of duplication nodes. Duplication minimization is a well-known criterion in phylogenomics [4,11]; for example, it is used to resolve polytomies in gene trees in [19] and to estimate the species tree in [20].

In this section, we consider the following three optimization variants of the ORTHOLOGY GRAPH SANDWICH problem in which the number of duplication nodes has to be minimized. We prove hardness results for each of these problems.

k-DUPLICATION ORTHOLOGY GRAPH SANDWICH problem (k-DOGS)

Input: a constraint graph (G,s) and an integer k;

Output: does there exists a DS-tree (T,s) containing at most k duplication nodes, whose orthology graph is a sandwich of G?

The above problem is equivalent to asking if (G,s) is satisfiable and there exists a DS-tree for (G,s) containing at most k duplication nodes.

SPECIES TREE CONSISTENT k-DUPLICATION ORTHOLOGY GRAPH SANDWICH problem (S-CONS- k-DOGS)

Input: a constraint graph (G,s), with G=(V,M⊎U) and s:V→Σ a species assignment, a species tree S on Σ and an integer k;

Question: does there exist a DS-tree (T,s) containing at most k duplication nodes, whose orthology graph is a sandwich of G, and is consistent with S?

CONSISTENT k-DUPLICATION ORTHOLOGY GRAPH SANDWICH Problem (CONS- k-DOGS)

Input: a constraint graph (G,s), with G=(V,M⊎U) and s:V→Σ a species assignment, and an integer k;

Question: does there exist a DS-tree (T,s) containing at most k duplication nodes and a species tree S, such that the orthology graph of (T,s) is a sandwich of G and is consistent with S?

We first provide a reduction from 3-COLORING that proves that k-DOGS is para-NP-hard [21] with respect to the number of duplication nodes k (that is, k-DOGS is NP-hard for some fixed k). This implies that k-DOGS does not belong to the complexity class X P, meaning that the problem cannot be solved in time O(n ^f(k)) for some function f(.). In what follows, [k] denotes the set {1,⋯,k}.

k-COLORING Problem

Input: a (connected) graph G=(V,E);

Question: does there exist a k-coloring c:V→[k] such that for every x y∈E, c(x)≠c(y)?

The following lemma will be useful in this section. An equivalent version of this lemma could be written in terms of cographs, and we believe a proof for such a lemma should already exist in the literature. However, as we were unable to find such a proof, we give one here.

Lemma 4

Let (G,s) be an orthology graph with a DS-tree containing at most k duplication nodes. Then we can find a k+1 coloring of its complement \(\overline {G}\) in polynomial time.

Proof

Let (G=(V,E),s) be an orthology graph. We prove the claim by induction on |V|.

If |V|=1, then there are 0 duplication nodes in a DS-tree for (G,s), and \(\overline {G}\) has a 1-coloring, as required.

So now suppose the claim holds for all orthology graphs (G ^′=(V ^′,E ^′),s ^′) with |V ^′|<|V|. Let (T,σ) be a DS-tree for (G,s) with at most k duplication nodes. Consider ROOT(T). If ROOT(T) is a duplication node, then G is disconnected, and we can find a partition V=V _A⊎V _B such that there is no edge between V _A and V _B in G. Moreover, the number of duplication nodes in T is k _A+k _B+1, where k _A is the number of duplication nodes in a DS-tree for G[V _A], and k _B is the number of duplication nodes in a DS-tree for G[V _B]. By the inductive hypothesis, there exists a k _A+1 coloring for \(\overline {G[V_{A}]}\), and a k _B+1 coloring for \(\overline {G[V_{B}]}\). It is clear that we can combine these colorings into a k _A+1+k _B+1≤k+1 coloring of \(\overline {G}\).

If ROOT(T) is a speciation node, then \(\overline {G}\) is disconnected, and we can find a partition V=V _A⊎V _B such that there are no edges between V _A and V _B in \(\overline {G}\). Moreover, the number of duplication nodes in a DS-tree for G[V _A] (G[V _B], respectively) is at most k. By the inductive hypothesis, there exists a k+1 coloring for \(\overline {G[V_{A}]}\) and a k+1 coloring for \(\overline {G[B]}\) and these can be combined into a k+1 coloring for G.

This proof can be turned into a polynomial time algorithm as follows. If G is disconnected, find a partition V=V _A⊎V _B with no edges between V _A and V _B in G, and recursively find colorings for \(\overline {G[V_{A}]}\) and \(\overline {G[V_{B}]}\), adjusting the coloring on \(\overline {G[V_{B}]}\) to assign different values from those assigned by the coloring on \(\overline {G[V_{A}]}\).

Otherwise, find a partition V=V _A⊎V _B with no edges between V _A and V _B in \(\overline {G}\), and recursively find colorings for \(\overline {G[V_{A}]}\) and \(\overline {G[V_{B}]}\). As each recursion splits the set of vertices and each recursive step takes polynomial time, the whole algorithm takes polynomial time. □

Lemma 5

Given a connected graph G=(V,E), define a constraint graph (H=(V,M⊎U),s) by setting M=∅ and \(U=\overline E\), and letting s:V→Σ be an arbitrary species assignment such that each gene in V is assigned to a different species. Then for any integer k>0,G is k-colorable if and only if (H,s) has a solution with at most k−1 duplication nodes. Furthermore if such a solution exists then there exists a solution consistent with an arbitrary species tree on Σ.

Proof

Assume that G is k-colorable, with c:V→[k] a k-coloring of G. Let V _i=c ⁻¹(i) for each i∈[k]. Thus V ₁,…V _k form a partition of V. For each i∈[k], let \((T_{i}, s_{|V_{i}})\phantom {\dot {i}\!}\) be an arbitrary DS-tree with leaves V _i such that every internal node is a speciation node, and let x _i denote the root of T _i. We now construct a DS-tree (T,s) as follows. Let z ₁,…,z _k−1 be duplication nodes such that ROOT(T)=z ₁, such that for each i∈[k−2],z _i has child nodes x _i and z _i+1, and the children of z _k−1 are x _k−1 and x _k. Now consider the graph H ^′=(V,E ^′) obtained from the disjoint union of cliques on V _i for 1≤i≤k. Observe that H ^′ is a sandwich graph of (H,s). Moreover by construction, we have that x y∈E ^′ if and only if LCA_T(x,y) is a speciation node. Moreover (T,s) has k−1 duplication nodes, so H ^′ is a solution. To conclude the proof of the first claim, observe that the converse follows from Lemma 4. To see the second claim, observe that as H ^′ is a disjoint union of cliques, P ₃(H ^′,s)=∅ and therefore (H ^′,s) is consistent with any species tree on Σ. □

As an example of the construction in the proof above, consider the graph G=(V,E) given in Fig. 4. The corresponding constraint graph (H=(V,M⊎U),s) is given in Fig. 5, and a DS-tree for this constraint graph is given in Fig. 6. As this DS-tree has 2 duplication nodes, G has a 3-coloring. In particular, following the structure of Fig. 6, we observe that there is a 1-coloring of G[{a,e}] (as these vertices are not adjacent in G), and a 1-coloring of G[{b,f}]. Combining these colorings gives a 2-coloring of G[{a,b,e,f}], which can then be combined with a 1-coloring of G[{c,d}] to give a 3-coloring of G.

We now prove the NP-hardness of 2-DOGS, S-CONS-2-DOGS or CONS-2-DOGS, using Lemma 5 and the fact that 3-COLORING is NP-hard [22].

Theorem 3

2-DOGS is NP-hard.

Proof

Given an instance G=(V,E) of 3-COLORING, let (H,s) be the constraint graph given by Lemma 5. Then by Lemma 5, (H,s,2) is a YES-instance of k-DOGS if and only if G is 3-colorable. As 3-COLORING is NP-hard, so is 2-DOGS. □

Using the same technique as for Theorem 3, we can prove the same NP-hardness result for S-CONS-2-DOGS and CONS-2-DOGS. The proofs are identical to that of Theorem 3, except that in the case of Theorem 4 we construct an arbitrary species tree S on Σ in addition to the constraint graph (H,s).

Theorem 4

S-CONS-2-DOGS is NP-hard.

Theorem 5

CONS-2-DOGS is NP-hard.

Let MINDOGS, S-CONS-MINDOGS, and CONS-MINDOGS denote the minimization versions of k-DOGS, S-CONS- k-DOGS, and CONS- k-DOGS respectively, in which we want to find a solution with the minimum number of duplication nodes. Let GRAPH COLORING denote the minimization version of k-COLORING. As GRAPH COLORING has no polynomial time \(n^{1-\epsilon ^{\prime }}\)-approximation for any ε ^′>0, unless P=NP [23], we can prove the following theorem.

Theorem 6

For any ε>0, there is no polynomial time algorithm that takes as input an instance of MINDOGS, and returns a solution with at most n ^1−ε·k duplication nodes if there exists a solution with at most k duplication nodes, unless P=N P.

Proof

Let G=(V,E) be an instance of GRAPH COLORING. Without loss of generality we may assume that G is connected. Let (H,s) be the constraint graph given by Lemma 5.

Now for any ε>0, fix an integer n ₀ and ε ^′>0 such that \( n^{1-\epsilon } +1 < n^{1-\epsilon ^{\prime }}\) for any n≥n ₀.

Suppose that there exists a polynomial-time n ^1−ε-approximation for MINDOGS, i.e. an algorithm that for any instance (H,s) with n vertices, finds a solution with at most n ^1−ε·k duplication nodes if there exists a solution with at most k duplication nodes. We show that there exists a polynomial-time \(n^{1-\epsilon ^{\prime }}\)-approximation for GRAPH COLORING.

Let G be an instance of GRAPH COLORING with n vertices, and suppose without loss of generality that n≥n ₀ (as otherwise the problem can be solved exactly in polynomial time). Let (H,s) be the instance of MINDOGS constructed from G as above. Now run the supposed approximation algorithm for MINDOGS on (H,s). If G is k-colorable for any k>1, then by Lemma 5, there exists a solution for (H,s) with at most k−1 duplication nodes. Therefore if G is k-colorable, the algorithm returns a solution with at most n ^1−ε·(k−1) duplication nodes. (Note that we may assume the solution contains at least 1 duplication node, as otherwise G would be disconnected). Let (H ^′,s) be the orthology graph for this solution. Then by Lemma 4, we have a n ^1−ε·(k−1)+1-coloring for \(\overline {H^{\prime }}\). As G is a subgraph of \(\overline {H^{\prime }}\), this is also a n ^1−ε·(k−1)+1-coloring for G.

As \(n \ge n_{0}, n^{1-\epsilon }\cdot (k-1) +1 \le (n^{1-\epsilon } +1) \cdot k \le n^{1-\epsilon ^{\prime }}\cdot k\) and so we have a polynomial time \(n^{1-\epsilon ^{\prime }}\)-approximation for GRAPH COLORING, a contradiction. □

Using the same technique as for Theorem 6, we can prove the same inapproximability result for S-CONS-MINDOGS and CONS-MINDOGS. The proofs are identical to that of Theorem 6, except that in the case of Theorem 7 we construct an arbitrary species tree S on Σ in addition to the constraint graph (H,s).

Theorem 7

For any ε>0, there is no polynomial time algorithm that takes as input an instance of S-CONS-MINDOGS, and returns a solution with at most n ^1−ε·k duplication nodes if there exists a solution with at most k duplication nodes, unless P=N P.

Theorem 8

For any ε>0, there is no polynomial time algorithm that takes as input an instance of CONS-MINDOGS, and returns a solution with at most n ^1−ε·k duplication nodes if there exists a solution with at most k duplication nodes, unless P=N P.

To summarise the results in this section: given a constraint graph on n vertices, it is NP-hard to find a DS-tree for that graph with at most k duplication nodes, even when k=2. This holds regardless of whether we require the DS-tree to be consistent, or whether we are given a species tree that it should be consistent with. Viewed as a minimization problem, it is NP-hard even to find an n ^1−ε-approximate solution, for any ε>0.

Results and Discussion

We integrated Algorithm 1 to the software provided at [24] by the authors of [9]. Note that the previous version of the program only permitted to check satisfiability and consistency of a constraint graph with respected to a given species tree S.

We used the modified software to reanalyze the data set in [9]. This data set was constructed by randomly choosing 265 gene families of vertebrates with more than 20 genes from Ensembl [25]. Each gene family was then analysed with ProteinOrtho [26] using 9 different parameter settings, yielding 2385 different constraint graphs. Here S is the Ensembl species tree, which can be downloaded at [27].

For this data set we have that, apart from one case, all satisfiable constraint graphs are also consistent. In 533 out of 2385 cases, the constraint graph was found to be consistent, but not consistent with S. We were interested in finding out how greatly the graphs in this set (denoted \(\mathcal {CG}\)) conflicted with S. Indeed, some nodes in the Ensembl species tree, for example the position of Equus, Tupaia and Cavia, do not enjoy a consensus in the community, so some contradictions with S are expected.

Note that we can use the graph G ^′ outputted by Algorithm 1 to obtain a species tree in the following way: we compute the set \(\mathcal {T}\) of all P ₃(G ^′,s) and then feed \(\mathcal {T}\) to the BUILD algorithm [18], which will return a species tree displaying all the triplets in \(\mathcal {T}\) (in practice, our implementation of Algorithm 1 is able to construct a species tree directly).

This species tree can fail to be binary, if the information contained in \(\mathcal {T}\) is sparse (this is actually the case for our data set: the maximum number of internal nodes over all species trees reconstructed by our approach from constraint graphs in \(\mathcal {CG}\) was 6, with an average of 1.5).

To estimate the discordancy between the Ensembl species tree S and each of the species trees S ^′ reconstructed by our approach for a constraint graph in \(\mathcal {CG}\), we did the following: for each pair (S,S ^′) we constructed a tree S ^′′ displaying the maximum number of triplets of S not contradicting S ^′ using PhySIC_IST [28]. We then computed the number of triplets displayed by S not in S ^′′, as a proportion of the total number of triplets displayed by S: the higher this number is, the higher the conflict between S and S ^′. This number, denoted c(S,S ^′), can be used to differentiate gene families that are good markers (i.e. markers highly coherent with the given species tree, which will have a low c(S,S ^′)) from gene families that are bad markers (with a high c(S,S ^′)). The histogram of the values of c(S,S ^′) for our data set is given in Fig. 7. This shows that several constraint graphs, even though not consistent with S, are not in high contradiction with it and thus the corresponding gene families can still be considered as good markers.

Conclusions

In this paper, we extend the results of [9] by giving a O(n ³) time algorithm to decide whether \(\mathcal {C}\) is consistent, even when the species tree is not known and \(\mathcal {C}\) is not full. We also incorporated this algorithm into the software provided at [24]. The algorithm has important applications in providing evidence for the structure of a species tree when that species tree is unknown. It also allows us to see how much an ‘inconsistent’ set of constraints is in conflict with a known species tree, as the algorithm returns a species tree for which those constraints are consistent, if any exists. On the negative side, we show that the problem of minimizing duplications nodes in DS-trees is NP-hard even when the number of duplications is very small, and it is also hard to find approximate solutions for this criterion.

References

Fitch WM. Distinguishing homologous from analogous proteins. Syst Biol. 1970; 19(2):99–113.
CAS Google Scholar
Altenhoff AM, Dessimoz C. Inferring orthology and paralogy. Evol Genomics Stat Comput Methods. 2012; 1:259–79.
Article Google Scholar
Schreiber F, Patricio M, Muffato M, Pignatelli M, Bateman A. Treefam v9: a new website, more species and orthology-on-the-fly. Nucleic Acids Res. 2014; 42(Database issue):D922–5. doi:10.1093/nar/gkt1055. Epub 2013 Nov 4.
Article CAS PubMed Google Scholar
Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011; 12(5):392–400.
Article PubMed Google Scholar
Li L, Stoeckert CJ, Roos DS. Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003; 13(9):2178–89.
Article CAS PubMed PubMed Central Google Scholar
Bergsten J. A review of long-branch attraction. Cladistics. 2005; 21(2):163–93.
Article Google Scholar
Hernandez-Rosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF. From event-labeled gene trees to species trees. BMC Bioinforma. 2012; 13(Suppl 19):6.
Google Scholar
Hellmuth M, Hernandez-Rosales M, Huber KT, Moulton V, Stadler PF, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013; 66(1-2):399–420.
Article PubMed Google Scholar
Lafond M, El-Mabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics. 2014; 15(Suppl 6):12.
Article Google Scholar
Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci U S A. 2015; 112(7):2058–63.
Article CAS PubMed PubMed Central Google Scholar
Ma B, Li M, Zhang L. From gene trees to species trees. SIAM J Comput. 2000; 30(3):729–52.
Article Google Scholar
Koonin EV. Orthologs, paralogs, and evolutionary genomics 1. Annu Rev Genet. 2005; 39:309–38.
Article CAS PubMed Google Scholar
Corneil D, Lerchs H, Stewart-Burlingham LK. Complement reducible graphs. Discret Appl Math. 1981; 3(1):163–74.
Article Google Scholar
Bőcker S, Dress AWM. Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math. 1998; 138(1):105–25. doi:10.1006/aima.1998.1743.
Article Google Scholar
Bretscher A, Corneil D, Habib M, Paul C. A simple linear time lexbfs cograph recognition algorithm. SIAM J Discret Math. 2008; 22(4):1277–96.
Article Google Scholar
Corneil D, Perl Y, Stewart LK. A linear time recognition algorithm for cographs. SIAM J Comput. 1985; 14(4):926–34.
Article Google Scholar
Golumbic M, Kaplan H, Shamir R. Graph sandwich problems. J Algorithm. 1995; 19:449–73.
Article Google Scholar
Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981; 10(3):405–21.
Article Google Scholar
Lafond M, Swenson KM, El-Mabrouk N. An optimal reconciliation algorithm for gene trees with polytomies. In: International Workshop on Algorithms in Bioinformatics. Springer Berlin Heidelberg: 2012. p. 106–22.
Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genome-scale coestimation of species and gene trees. Genome Res. 2013; 23(2):323–30.
Article CAS PubMed PubMed Central Google Scholar
Downey RG, Fellows MR. Parameterized Complexity. New York, NY, USA: Springer; 1999, p. 530.
Book Google Scholar
Garey MR, Johnson DS. Computers and Intractability: A Guide to the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman & Co.; 1979.
Google Scholar
Zuckerman D. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput. 2007; 3(6):103–28. doi:10.4086/toc.2007.v003a006.
Article Google Scholar
Lafond M, Jones M. GitHub - UdeM-LBIT/OrthoPara-ConstraintChecker: A program to check if a given set of orthology/paralogy relations, with possible unknowns, is satisfiable or consistent with a species tree. https://github.com/UdeM-LBIT/OrthoPara-ConstraintChecker. Accessed 22 July 2016.
Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, Carvalho-Silva D, Clapham P, Coates G, Fairley S, et al. Ensembl 2013. Nucleic Acids Res. 2012; 41(Database issue):D48–D55. Published online 2012 Nov 30. doi:10.1093/nar/gks1236.
PubMed PubMed Central Google Scholar
Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: Detection of (co-) orthologs in large-scale analysis. BMC Bioinforma. 2011; 12(1):1.
Article Google Scholar
Ensembl. Ensembl Species Tree. http://www.ensembl.org/info/about/speciestree.html. Accessed 22 July 2016.
Scornavacca C, Berry V, Lefort V, Douzery EJ, Ranwez V. Physic_ist: cleaning source trees to infer more informative supertrees. BMC Bioinforma. 2008; 9(1):413.
Article Google Scholar

Download references

Acknowledgements

The authors would like to thank Manuel Lafond for providing the Python script to check consistency with a known species tree.

Declarations

This article has been published as part of BMC Bioinformatics Vol 17 Suppl 14, 2016: Proceedings of the 14th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume-17-supplement-14.

Funding

Research of MJ was funded by Labex NUMEV (ANR-10-LABX-20).

Availability of data and materials

Our polynomial algorithm for checking consistency has been implemented in Python and is available at [24]. The gene families and Ensembl species tree used in our experiments are available at ensembl.org. The datasets analysed during this study are available from the corresponding author upon request. An implementation of the PhySIC_IST method is available at http://www.atgc-montpellier.fr/physic_ist/.

Authors’ contributions

Designed the algorithms: CP MJ CS. Implemented the algorithms: MJ. Designed, performed and analyzed the experiments: MJ CS. Wrote the paper: CP MJ CS. All authors have read and approved the final version of the manuscript.

Competing interests

The authors declare that they have no competing interests.

Consent for publication

Not applicable.

Ethics approval and consent to participate

Not applicable.

Publication funding

Publication charges for this article have been funded by the French Agence Nationale de la Recherche Investissements d’Avenir/ Bioinformatique (ANR-10-BINF-01-02, Ancestrome).

Author information

Authors and Affiliations

LIRMM, CNRS, Université de Montpellier, Montpellier, France
Mark Jones
ISE-M, CNRS, IRD, EPHE, Université, Montpellier, France
Christophe Paul & Céline Scornavacca

Authors

Mark Jones
View author publications
You can also search for this author in PubMed Google Scholar
Christophe Paul
View author publications
You can also search for this author in PubMed Google Scholar
Céline Scornavacca
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Céline Scornavacca.

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Jones, M., Paul, C. & Scornavacca, C. On the consistency of orthology relationships. BMC Bioinformatics 17 (Suppl 14), 416 (2016). https://doi.org/10.1186/s12859-016-1267-3

Download citation

Published: 11 November 2016
DOI: https://doi.org/10.1186/s12859-016-1267-3

Proceedings of the 14th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: bioinformatics

On the consistency of orthology relationships

Abstract

Background

Results

Conclusions

Background

Preliminaries

Assumption 1

Proposition 1

Theorem 1

Proposition 2

Corollary 1

Methods

Computing a consistent DS-tree

Lemma 1

Proof

Reduction Rule 1

Lemma 2

Proof

Lemma 3

Proof

Branching Rule 1

Theorem 2

Proof

Hardness of optimizing the duplication nodes

Lemma 4

Proof

Lemma 5

Proof

Theorem 3

Proof

Theorem 4

Theorem 5

Theorem 6

Proof

Theorem 7

Theorem 8

Results and Discussion

Conclusions

References

Acknowledgements

Declarations

Funding

Availability of data and materials

Authors’ contributions

Competing interests

Consent for publication

Ethics approval and consent to participate

Publication funding

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us