On the consistency of orthology relationships
 Mark Jones^{1},
 Christophe Paul^{2} and
 Céline Scornavacca^{2}Email author
https://doi.org/10.1186/s1285901612673
© The Author(s) 2016
Published: 11 November 2016
Abstract
Background
Orthologs inference is the starting point of most comparative genomics studies, and a plethora of methods have been designed in the last decade to address this challenging task. In this paper we focus on the problems of deciding consistency with a species tree (known or not) of a partial set of orthology/paralogy relationships \(\mathcal {C}\) on a collection of n genes.
Results
We give the first polynomial algorithm – more precisely a O(n ^{3}) time algorithm – to decide whether \(\mathcal {C}\) is consistent, even when the species tree is unknown. We also investigate a biologically meaningful optimization version of these problems, in which we wish to minimize the number of duplication events; unfortunately, we show that all these optimization problems are NPhard and are unlikely to have good polynomial time approximation algorithms.
Conclusions
Our polynomial algorithm for checking consistency has been implemented in Python and is available at https://github.com/UdeMLBIT/OrthoParaConstraintChecker.
Keywords
Background
Two genes from two different species are said to be orthologous if they derived from a single gene present in the last common ancestor of the two species via a speciation event, and paralogous if they were created by a duplication event [1]. Orthologs inference is the starting point of most comparative genomics studies, and is also a key instrument for functional annotation of new genomes. A plethora of methods have been designed in the last decade to address this challenging task, and can be roughly divided in two groups [2]. The first group of methods use clustering algorithms to detect homologous genes, i.e., genes sharing a common ancestry, and then reconstruct a gene tree describing the evolutionary history of this set of genes; orthology relationships are then deduced from this tree by comparing it with the species tree, i.e., the tree depicting the history of the species containing those genes, via reconciliation algorithms (see [3], among others, and [4] for a review of reconciliation algorithms). The second group of methods use other sources of information, e.g. sequence similarity or synteny, to directly estimate orthology relationships [5, among others]. The first set of methods are considered to be more accurate, but they require a prior knowledge of the species tree, and are very dependent on the accuracy of the gene trees. Unfortunately, the species phylogeny is not always known and gene trees can be highly inaccurate as a result of several kinds of reconstruction artifact, e.g. longbranch attraction (LBA) [6].
The second set of methods does not suffer from these drawbacks but still has an important weakness: given a set of genes V, the set of inferred orthology/paralogy relationships \(\mathcal {C}\) for V may fail to be satisfiable, i.e., to simultaneously coexist in any evolutionary history for V, or consistent i.e., such that all displayed triplet phylogenies are included in a species tree (formal definitions are given in the next section).
In the last years, the decision problems associated with these questions have been extensively studied, both when \(\mathcal {C}\) is full, i.e., involves a constraint for each pair of genes in V [7, 8], and when it is not [9].
In [9], the authors give O(n ^{3}) time algorithms to decide whether \(\mathcal {C}\) is satisfiable and consistent under the assumption that the species tree is known – where n=V. These results hold whether \(\mathcal {C}\) is a full set of constraints or not. They also showed how to decide whether \(\mathcal {C}\) is satisfiable when the species tree is unknown but \(\mathcal {C}\) is full (this problem was also considered in [10]).
In this paper, we extend the results of [9] by giving a O(n ^{3}) time algorithm to decide whether \(\mathcal {C}\) is consistent, even when the species tree is not known and \(\mathcal {C}\) is not full, and show an application on real data. Thus the problems of deciding satisfiability, deciding consistency given a species tree, and deciding consistency with an unknown species tree, are all polynomialtime solvable. We also investigate an optimization version of these problems, in which we wish to minimize the number of duplication events in the evolutionary history for V – duplication minimization is a wellknown criterion in phylogenomics [11]. Unfortunately, we show that all three problems are NPhard, even when the maximum number of duplication events is 2, and are unlikely to have good polynomial time approximation algorithms.
Preliminaries
A rooted tree T with arc set E(T) and node set V(T) is a directed acyclic connected graph, in which every node has indegree 1, except for a single node, the root – denoted by ROOT(T), of indegree 0, and where the set of nodes in T with outdegree 0 – the leaves of T, denoted by L(T) – are univocally labeled. Throughout the paper, we will treat leaves in a tree as synonymous with the labels associated to them. We denote by I(T) the set V(T)∖L(T) – the internal nodes of T. If all nodes in I(T) have outdegree 2, we say that T is binary.
Given two nodes x,y in T, we say that x is an ancestor of y in T, and that y is a descendant of x in T, if there is a directed path from x to y in T. (Note that any node x is an ancestor and descendant of itself.) If x is not an ancestor of y and y is not a ancestor of x, we say that x,y are separated in T. If there is an arc from x to y in T, we say that x is the parent of y in T and that y is a child of x in T.
Given a node x, let DESC_{ T }(x) denote the descendants of x in T. Let CHILD_{ T }(x) denote the set of all children of x in T. Let LEAF_{ T }(x)=DESC(x)∩L(T), i.e. LEAF_{ T }(x) is the set of leaves in T that are descendants of x. Note that LEAF_{ T }(ROOT(T))=L(T). Given a set A of nodes in T, let LCA_{ T }(A) denote the least common ancestor of A in T, that is, the unique node z such that z is an ancestor of all x∈A, and no descendant of z has this property. Given two nodes x,y, we will often write LCA_{ T }(x,y) as shorthand for LCA_{ T }({x,y}). When T is clear from context, we will often omit “in T” and simply say that x is the ancestor of y,y is the descendant of x,z is a leaf, etc.
Suppressing a nonroot node x of outdegree 1 in a tree T consists of removing x and making the unique child of x a new child of the parent of x. Given a set of leaves L ^{′}⊆L(T), the restriction of T to L ^{′}, denoted \(T_{L^{\prime }}\), is the tree derived from T by taking the minimum subtree of T spanning L ^{′}, and suppressing all nonroot nodes of outdegree 1.
A triplet is a rooted binary tree T with L(T)=3. Given three distinct elements x,y,z, we denote by x yz the unique triplet T with L(T)={x,y,z} such that LCA_{ T }(x,y)≠ROOT(T) (or equivalently, LCA_{ T }(x,y)≠LCA_{ T }(x,z)=LCA_{ T }(y,z)). We say that a rooted tree T displays the triplet x yz if T_{{x,y,z}}=x yz.
Given a set of edges E over a set of vertices V, and a subset V ^{′}⊆V, we define E[V ^{′}]={x y:x,y∈V ^{′},x y∈E}. Given graphs G=(V,E) and G ^{′}=(V ^{′},E ^{′}), we say that G ^{′} is an induced subgraph of G if V ^{′}⊆V and E ^{′}=E[V ^{′}], and denote G ^{′} by G[V ^{′}]. We define \(\overline {E} = \{xy: x,y \in V, xy \notin E\}\) and say that \(\overline {G}=(V,\overline {E})\) is the complement of G. For any integer l≥1, a path P _{ l } is a graph (V={v _{1},…,v _{ l }},E={v _{ i } v _{ i+1}:1≤i≤l−1}. We note here that if a graph contains an induced P _{4}, then its complement contains an induced P _{4} on the same four vertices.
Species trees and DStrees. Let Σ denote a set of species. A species tree S on Σ is a binary rooted tree such that L(T)=Σ, used to depict the evolutionary history of the species in Σ.
Genes are said to be homologous if they share a common ancestor. Let V denote a set of homologous genes belonging to species in Σ. A species assignment of V is a function s:V→Σ, with s(v)=a representing the fact that gene v belongs to species a∈Σ. For a set V ^{′}⊆V, we define s(V ^{′})={a∈Σ:∃x∈V ^{′},s(x)=a}, and \(s_{V^{\prime }}: V^{\prime } \rightarrow s(V^{\prime })\) such that \(s_{V^{\prime }}(v)=s(v)\) for all v∈V ^{′}. A DStree on V is a pair (T,ℓ), where T is a binary rooted tree with leaf set V and ℓ:I(T)→{D u p,S p e c} is a function labeling each internal node x of T as a speciation node (if ℓ(x)=S p e c) or a duplication node (if ℓ(x)=D u p). DStrees are used to depict the evolutionary history of the genes in V. When the function ℓ is clear from context, we will often omit it and speak only of a DStree T.
Given two genes x,y in T, we say that x,y are orthologs with respect to T if LCA_{ T }(x,y) is a speciation node, and paralogs with respect to T otherwise. Given an undirected graph G=(V,E), a DStree (T,ℓ) on V is a DStree for G (or G is an orthology graph for T) if for every x,y∈V, x y∈E⇔ℓ(LCA_{ T }(x,y))=S p e c. That is, x and y are adjacent in G if and only if they are orthologs with respect to T.
The presence of two homologous genes in the same species can be caused either by duplications or gene transfers [12]. So, in absence of gene transfers, homologous genes from the same species are necessarily paralogs. We formalize this idea in the following assumption.
Assumption 1
We assume in what follows that whenever we are given a graph G=(V,E) with a species assignment s, two vertices x,y of G are not adjacent if s(x)=s(y).
Cographs A cograph is a graph that can be generated from a singlevertex graph using the operations of disjoint union (taking the disjoint union of multiple graphs) and series composition (adding all possible edges between vertices of multiple graphs) [13]. This generation scheme yields a representation of a cograph in terms of cotrees. A cotree is a rooted tree T, with internal nodes labeled 0 (representing the disjoint union operation) or 1 (representing the series composition). Hence a cotree represents a graph G=(V,E) if L(T)=V and two vertices x and y of G are adjacent if and only if LCA_{ T }(x,y)=1. Observe that the cotree representation of a cograph is not unique. Also, while a cotree is not necessarily binary, any nonbinary cotree can be transformed in linear time into a binary cotree with the same corresponding cograph. There are several characterizations of cographs. Among other characterizations, a cograph is a graph with no induced P _{4} [13]. Cographs can also be viewed as graphs where each connected component has diameter at most 2.
Hellmuth et al. [8] noted that all orthology graphs (i.e. graphs for which there exists a DStree) can be characterized as symbolic ultrametrics [14], and showed that a graph is an orthology graph if and only if it is a cograph [8, Corollary 4].
Thus we have a useful graphtheoretic framework for deciding on the existence of a DStree.
Proposition 1
 1.
There exists a DStree for G;
 2.
G contains no induced P _{4}, i.e. it is P _{4}free;
 3.
G is a cograph.
As cographs can be recognized in linear time [15, 16], deciding whether a graph has a DStree, i.e., if it is satisfiable, can be achieved within the same time complexity. Note, however, that not every DStree represents a possible evolutionary history for a set of genes. In particular, given a species assignment, different parts of a DStree may imply conflicting evolutionary histories for the species containing those genes. The concept of consistency makes this notion precise.
Consistent DStrees. Given a DStree T on V, a species assignment s:V→Σ and a species tree S on Σ, we say that (T,s) is consistent with S (or Sconsistent) if for every speciation node z in T, and distinct children x,y of z, LCA_{ S }(s(LEAF_{ T }(x))) and LCA_{ S }(s(LEAF_{ T }(y))) are separated in S. Given a graph G=(V,E) and the species assignment s, the pair (G,s) is consistent with S if there exists a DStree T for G such that (T,s) is consistent with S. We say that G (resp. T) along with the species assignment s, is consistent if there exists a species tree S such that (G,s) (resp. (T,s)) is consistent with S [9].
Given a DStree T on V and a species assignment s:V→Σ, let t r(T,s) be the set of triplets s(x)s(y)s(z) for which the triplet x yz is displayed by T with a speciation node as the root, and for which s(x)≠s(y).
HernandezRosales et al. [7] showed that (T,s) is consistent with a species tree S if and only S displays all triplets in t r(T,s). In light of this result, Hellmuth et al. [10] gave a framework for finding the DStree and species tree for which the maximum number of triplets are displayed, using Integer Linear Programming. Lafond and ElMabrouk [9] improved the result of [7] by showing that it is enough to consider only the triplets in t r(T,s) that have a speciation node as the root node and a duplication node as the other internal node. This can expressed in terms of the consistency of an orthology graph in the following way.
Given a graph G=(V,E) and species assignment s:V→Σ, define the set of triplets P _{3}(G,s)={s(x)s(y)s(z):x z,z y∈E and x y∉E and s(x)≠s(y)}. Note that as a consequence of Assumption 1, if s(x)s(y)s(z)∈P _{3}(G,s), then s(z)≠s(y) and s(z)≠s(x).
By Theorem 5 in [9], we have the following theorem (in fact, Theorem 5 in [9] only states that (G,s) is consistent if and only if there exists a species tree S which displays all triplets in P _{3}(G,s), but their proof shows that (G,s) is indeed consistent with such an S):
Theorem 1
[ 9 ] Let G=(V,E) have a DStree and let s:V→Σ be a species assignment. Let S be a species tree on Σ. Then (G,s) is consistent with S if and only if S displays all triplets in P _{3}(G,s).
Theorem 1 directly provides a polynomial time algorithm to decide whether a graph and a species assignment are consistent with a given species tree. The following proposition reformulates Theorem 1 in a convenient way:
Proposition 2
 1.
G does not contain an induced P _{4};
 2.
Every triplet in P _{3}(G,s) is displayed by S.
As both of the properties in Proposition 2 are hereditary, we also have:
Corollary 1
Given a graph G=(V,E), a species assignment s and a subset V ^{′}⊆V, if (G,s) is consistent with a species tree S then \((G[V^{\prime }],s_{V^{\prime }})\) is consistent with the species tree \(S_{s(V^{\prime })}\).
Constraint graphs. A constraint graph is a pair (G,s) where G=(V,M⊎U) is an edgebicolored graph and s is a species assignment on V. A constraint graph aims at representing the partial knowledge about the orthology or paralogy relations between genes from V. The edges in M are mandatory edges, representing the pairs of genes xy for which we know that x and y are orthologs. The nonedges of G (i.e. the set of unordered pairs uv for which u v∉M⊎U) represent the pairs of genes xy for which we know that x and y are paralogs. The edges in U are unknown edges, for which we do not know if x and y are orthologs or paralogs. From Assumption 1, we have that x y∉M⊎U for any pair of genes x, y such that s(x)=s(y) (in absence of gene transfers, homologous genes from the same species are necessarily paralogs). Note that an orthology graph is a constraint graph where U=∅. A sandwich of a constraint graph (G,s), with G=(V,M⊎U), is a graph H=(V,E) such that M⊆E⊆M∪U.
As a gene is always associated with the species it belongs to, throughout this paper we will always present a DStree T together with a species assignment s. Thus we will speak of a DStree (T,s). Similarly, we will always present an orthology graph G together with its species assignment s, and speak of an orthology graph (G,s). A sandwich graph G ^{′} will be presented on its own without a species assignment, as a sandwich graph is defined relative to a constraint graph (G=(V,M⊎U),s), and so the species assignment s will always be clear from context.
Methods
Computing a consistent DStree
In this section, we describe a polynomial time algorithm for the following problem:
CONSISTENT ORTHOLOGY GRAPH SANDWICH problem
Input: a constraint graph (G,s), with G=(V,M⊎U) and s:V→Σ a species assignment;
Output: a sandwich graph H for (G,s) such that (H,s) is consistent (if any exists).
Observe that by Proposition 2, the CONSISTENT ORTHOLOGY GRAPH SANDWICH problem amounts to computing a sandwich cograph satisfying extra properties. The sandwich cograph problem is known to be polynomial time solvable [17]. Our algorithm can be seen as a combination of the sandwich cograph algorithm and the BUILD algorithm [18] for checking consistency of a set of triplets.
Let G=(V,M⊎U) be an edgebicolored graph and for F⊆U, define the graph G(F)=(V,M∪F).
The first lemma proves that unknown edges between connected components of G(∅) can be removed (i.e. freezed as paralogy relations between genes).
Lemma 1
Let (G,s) be a constraint graph with G=(V,M⊎U). Let CC be the connected components of G(∅), and let \(U_{CC} = \bigcup _{C \in CC}U[C]\). There exists a consistent sandwich graph of (G,s) if and only if there exists a consistent sandwich graph of (G _{ CC }=(V,M⊎U _{ CC }),s).
Proof
Suppose first that there exists a consistent sandwich graph G ^{′}=(V,E ^{′}) of (G,s) and let S be a species tree such that (G ^{′},s) is Sconsistent. For every C∈C C, by Corollary 1 (G ^{′}[C],s _{C }) is consistent with S_{ s(C)} and hence with S. Then the disjoint union G ^{′′} of the G ^{′}[C] is a sandwich cograph of (G _{ CC },s). Moreover we clearly have P _{3}(G ^{′′},s)=∪_{ C } P _{3}(G ^{′}[C],s _{C }), implying that (G ^{′′},s) is also consistent with S. The converse is symmetric. □
Reduction Rule 1
Let (G,s) be a constraint graph with G=(V,M⊎U). Remove from U every edge xy such that x and y belong to distinct connected components of G(∅).
Note that although we remove all edges between connected components of G(∅), we cannot solve the problem on each connected component independently, and so we cannot assume that G(∅) is connected. The reason is that for two connected components C,D of G(∅), a solution for (G[C],s _{C }) may be consistent with a different species tree than a solution for (G[D],s _{D }). To avoid conflicts between solutions on different subgraphs, we must split the graph into subgraphs on disjoint sets of species.
From now on, we may assume that s(V)>1. Otherwise, Assumption 1 implies that M=U=∅, and thereby (G,s) is a trivial positive instance. For the sake of the algorithm, we define an auxiliary graph H _{ G,s }=(Σ,F) on the species set, called hereafter the species graph. For each pair of distinct species a,b∈Σ, add ab to F if there exist x,y∈V such that x and y are in the same connected component of G(∅),x and y are not adjacent in G(U) and s(x)=a,s(y)=b.
Lemma 2
Let (G,s) be a constraint graph reduced by Reduction Rule 1. If the species graph H _{ G,s } is connected, then (G,s) does not have a consistent sandwich graph.
Proof
Consider an arbitrary binary species tree S, and an arbitrary sandwich graph G ^{′}=(V,E ^{′}) of (G,s). We show that P _{3}(G ^{′},s) contains a triplet not displayed by S.
Let A=LEAF_{ S }(u _{ A }) and B=LEAF_{ S }(u _{ B }) where u _{ A } and u _{ B } are the children of ROOT(S). Note that A and B partition the set of species Σ. As H _{ G,s } is connected, there exists a∈A,b∈B such that a b∈F. Therefore there exist x,y∈V such that x,y are in the same connected component C of G(∅),s(x)=a,s(y)=b and x y∉M∪U.
As G ^{′}[C] is connected, there exists a chordless path P from x to y in G ^{′}. By Proposition 2, G ^{′} is P _{4}free. This implies that P contains, in addition to x and y, a third vertex z such that x z∈E ^{′} and z y∈E ^{′}.
Assume without loss of generality that s(z)∈A (the case s(z)∈B is symmetric). Then we have s(x)s(y)s(z)∈P _{3}(G ^{′}). Note however that LCA_{ S }(s(y),s(z))=ROOT(S) (as s(z)∈A,s(y)∈B), while LCA_{ S }(s(x),s(z)) is a descendant of LCA_{ S }(A). It follows that LCA_{ S }(s(x),s(z)) is different from LCA_{ S }(s(y),s(z)), and so s(x)s(y)s(z) is not displayed by S. □
The next lemma shows how to use connected components of the species graph in order to freeze some unknown edges to orthology relations between genes.
Lemma 3
Let (G,s) be a constraint graph reduced by Reduction Rule 1 such that the species graph H _{ G,s } is not connected.
Let A be the vertices of a connected component of the species graph H _{ G,s } and let B=Σ∖A. Let G _{ A }=(V _{ A },M[V _{ A }]⊎U[V _{ A }]) and G _{ B }(V _{ B },M[V _{ B }]⊎U[V _{ B }]), where V _{ A }=s ^{−1}(A) and V _{ B }=s ^{−1}(B). There exists a consistent sandwich graph of (G,s) if and only if there exist consistent sandwich graphs of \(\phantom {\dot {i}\!}(G_{A},s_{V_{A}})\) and of \((G_{B},s_{V_{B}})\phantom {\dot {i}\!}\).
Proof
Let \(G^{\prime }_{A}\) and \(G^{\prime }_{B}\) be respectively consistent sandwich graphs of \(\phantom {\dot {i}\!}(G_{A},s_{V_{A}})\) and of \(\phantom {\dot {i}\!}(G_{B},s_{V_{B}})\). Suppose that \(G^{\prime }_{A}\) is consistent with the species tree S _{ A } and \(G^{\prime }_{B}\) with S _{ B }. For every connected component C of G(∅), let \(G^{\prime }_{C}\) be the series composition of \(G^{\prime }_{A}[C]\) and \(G^{\prime }_{B}[C]\) and let G ^{′}=(V,E ^{′}) be the disjoint union of all \(G^{\prime }_{C}\)’s. We now show that (G ^{′},s) is a consistent sandwich graph of (G,s).
As \(G_{A}^{\prime }\) and \(G_{B}^{\prime }\) are cographs, by construction G ^{′} is a cograph too. Now, as \(G_{A}^{\prime }\) and \(G_{B}^{\prime }\) are respectively sandwich graphs of \(\phantom {\dot {i}\!}(G_{A},s_{V_{A}})\) and \((G_{B},s_{V_{B}})\phantom {\dot {i}\!}\), and as there is no edge in M between different connected components of G(∅), we have that M⊆E ^{′}. By construction of H _{ G,s } and the fact that H _{ G,s } has no edges between A and B, for every connected component C of G(∅), if x∈V _{ A }∩C and y∈V _{ B }∩C, then x y∈M∪U. As \(G_{A}^{\prime }\) and \(G_{B}^{\prime }\) are respectively sandwich graphs of \((G_{A},s_{V_{A}})\phantom {\dot {i}\!}\) and \((G_{B},s_{V_{B}})\phantom {\dot {i}\!}\), this implies that E ^{′}⊆M∪U. It follows that G ^{′} is a sandwich graph of G.

If {s(x),s(y),s(z)}⊆A (the case {s(x),s(y),s(z)}⊆B is symmetric), then s(x)s(y)s(z)∈P _{3}(G _{ A }) and is displayed by S _{ A } and thereby by S as well.

Otherwise, as x z,y z∈E ^{′},x and y are connected in G ^{′} and so by construction of G ^{′}, we have that x,y∈C for some connected component C of G(∅). As x y∉E ^{′}, by construction of \(G_{C}^{\prime }\) either {s(x),s(y)}⊆A or {s(x),s(y)}⊆B. Suppose w.l.o.g that the former holds, implying s(z)∈B. Observe then that s(x)s(y)s(z) is displayed by S. Indeed, we have lca_{ S }(s(x),s(z))=lca_{ S }(s(y),s(z))=root(S), and lca_{ S }(s(x)s(y)) is a descendant of root(S _{ A }).
The converse follows from Corollary 1. □
The correctness of the next branching rule follows from Lemma’s 2 and 3.
Branching Rule 1
Let (G,s) be a constraint graph reduced by Reduction Rule 1 such that the species graph H _{ G,s } is not connected. Let A be a connected component of the species graph H _{ G,s } and let B=Σ∖A. Solve CONSISTENT SANDWICH SUBGRAPH on \((G_{A}, s_{V_{A}})\phantom {\dot {i}\!}\) and \((G_{B}, s_{V_{B}})\phantom {\dot {i}\!}\) where V _{ A }=s ^{−1}(A) and V _{ B }=s ^{−1}(B). If there exist \(G^{\prime }_{A}=(V_{A},E^{\prime }_{A})\) and \(G^{\prime }_{B}=(V_{B},E^{\prime }_{B})\) that are respectively consistent sandwich graphs of \((G_{A}, s_{V_{A}})\phantom {\dot {i}\!}\) and \((G_{B}, s_{V_{B}})\phantom {\dot {i}\!}\), then return \(G^{\prime }=(V,E^{\prime }_{A}\cup E^{\prime }_{B}\cup M^{\prime })\), where M ^{′}={x y∈M∪U:x∈V _{ A },y∈V _{ B }}. Otherwise, return NULL.
We can now give the pseudocode of the algorithm, which essentially consists of alternately applying Reduction Rule 1 and Branching Rule 1.
Theorem 2
Given a constraint graph (G,s), the CONSISTENT ORTHOLOGY GRAPH SANDWICH problem can be solved in O(n ^{3}) time, where n is the number of genes in G.
Proof
The correctness of Algorithm 1 follows from the correctness of Reduction Rule 1 (Lemma 1) and Branching Rule 1 (Lemma’s 2 and 3).
To analyze the running time of Algorithm 1, we simply observe that the recursive calls define a binary tree structure with at most O(Σ)=0(n) nodes. As each step of the recursion can clearly be performed in quadratic time, so the complexity follows. □
We can adapt the algorithm to cases when the species tree S is partially known, by adjusting the construction of H _{ G,s }. In particular, for any x,y,z∈V for which it is known that S displays the triplet s(x)s(y)s(z), we add s(x)s(y) as an edge in H _{ G,s }.
Algorithm 1 has important applications. When the species tree is not known, it allows us to differentiate constraint graphs that are consistent with a species tree from those that are not; the latter cannot be depicted by a consistent DStree, and should be considered as phylogenetically irrelevant and discarded. When the species tree S is known and a given constraint graph C is not consistent with it, the sandwich graph returned by Algorithm 1 shows to what extent C and S are in contradiction. Furthermore if S contains some uncertainties, it allows us to see if the contradictions between C and S lie in the “uncertainty zone” of S. This may help to correct the species tree.
See the “Results and Discussion” section for an example of application on real data.
Hardness of optimizing the duplication nodes
Given a constraint graph (G,s) for which there exist several possible DStrees, we may be interested in finding one minimizing the number of duplication nodes. Duplication minimization is a wellknown criterion in phylogenomics [4,11]; for example, it is used to resolve polytomies in gene trees in [19] and to estimate the species tree in [20].
In this section, we consider the following three optimization variants of the ORTHOLOGY GRAPH SANDWICH problem in which the number of duplication nodes has to be minimized. We prove hardness results for each of these problems.
kDUPLICATION ORTHOLOGY GRAPH SANDWICH problem (kDOGS)
Input: a constraint graph (G,s) and an integer k;
Output: does there exists a DStree (T,s) containing at most k duplication nodes, whose orthology graph is a sandwich of G?
The above problem is equivalent to asking if (G,s) is satisfiable and there exists a DStree for (G,s) containing at most k duplication nodes.
SPECIES TREE CONSISTENT kDUPLICATION ORTHOLOGY GRAPH SANDWICH problem (SCONS kDOGS)
Input: a constraint graph (G,s), with G=(V,M⊎U) and s:V→Σ a species assignment, a species tree S on Σ and an integer k;
Question: does there exist a DStree (T,s) containing at most k duplication nodes, whose orthology graph is a sandwich of G, and is consistent with S?
CONSISTENT kDUPLICATION ORTHOLOGY GRAPH SANDWICH Problem (CONS kDOGS)
Input: a constraint graph (G,s), with G=(V,M⊎U) and s:V→Σ a species assignment, and an integer k;
Question: does there exist a DStree (T,s) containing at most k duplication nodes and a species tree S, such that the orthology graph of (T,s) is a sandwich of G and is consistent with S?
We first provide a reduction from 3COLORING that proves that kDOGS is paraNPhard [21] with respect to the number of duplication nodes k (that is, kDOGS is NPhard for some fixed k). This implies that kDOGS does not belong to the complexity class X P, meaning that the problem cannot be solved in time O(n ^{ f(k)}) for some function f(.). In what follows, [k] denotes the set {1,⋯,k}.
kCOLORING Problem
Input: a (connected) graph G=(V,E);
Question: does there exist a kcoloring c:V→[k] such that for every x y∈E, c(x)≠c(y)?
The following lemma will be useful in this section. An equivalent version of this lemma could be written in terms of cographs, and we believe a proof for such a lemma should already exist in the literature. However, as we were unable to find such a proof, we give one here.
Lemma 4
Let (G,s) be an orthology graph with a DStree containing at most k duplication nodes. Then we can find a k+1 coloring of its complement \(\overline {G}\) in polynomial time.
Proof
Let (G=(V,E),s) be an orthology graph. We prove the claim by induction on V.
If V=1, then there are 0 duplication nodes in a DStree for (G,s), and \(\overline {G}\) has a 1coloring, as required.
So now suppose the claim holds for all orthology graphs (G ^{′}=(V ^{′},E ^{′}),s ^{′}) with V ^{′}<V. Let (T,σ) be a DStree for (G,s) with at most k duplication nodes. Consider ROOT(T). If ROOT(T) is a duplication node, then G is disconnected, and we can find a partition V=V _{ A }⊎V _{ B } such that there is no edge between V _{ A } and V _{ B } in G. Moreover, the number of duplication nodes in T is k _{ A }+k _{ B }+1, where k _{ A } is the number of duplication nodes in a DStree for G[V _{ A }], and k _{ B } is the number of duplication nodes in a DStree for G[V _{ B }]. By the inductive hypothesis, there exists a k _{ A }+1 coloring for \(\overline {G[V_{A}]}\), and a k _{ B }+1 coloring for \(\overline {G[V_{B}]}\). It is clear that we can combine these colorings into a k _{ A }+1+k _{ B }+1≤k+1 coloring of \(\overline {G}\).
If ROOT(T) is a speciation node, then \(\overline {G}\) is disconnected, and we can find a partition V=V _{ A }⊎V _{ B } such that there are no edges between V _{ A } and V _{ B } in \(\overline {G}\). Moreover, the number of duplication nodes in a DStree for G[V _{ A }] (G[V _{ B }], respectively) is at most k. By the inductive hypothesis, there exists a k+1 coloring for \(\overline {G[V_{A}]}\) and a k+1 coloring for \(\overline {G[B]}\) and these can be combined into a k+1 coloring for G.
This proof can be turned into a polynomial time algorithm as follows. If G is disconnected, find a partition V=V _{ A }⊎V _{ B } with no edges between V _{ A } and V _{ B } in G, and recursively find colorings for \(\overline {G[V_{A}]}\) and \(\overline {G[V_{B}]}\), adjusting the coloring on \(\overline {G[V_{B}]}\) to assign different values from those assigned by the coloring on \(\overline {G[V_{A}]}\).
Otherwise, find a partition V=V _{ A }⊎V _{ B } with no edges between V _{ A } and V _{ B } in \(\overline {G}\), and recursively find colorings for \(\overline {G[V_{A}]}\) and \(\overline {G[V_{B}]}\). As each recursion splits the set of vertices and each recursive step takes polynomial time, the whole algorithm takes polynomial time. □
Lemma 5
Given a connected graph G=(V,E), define a constraint graph (H=(V,M⊎U),s) by setting M=∅ and \(U=\overline E\), and letting s:V→Σ be an arbitrary species assignment such that each gene in V is assigned to a different species. Then for any integer k>0,G is kcolorable if and only if (H,s) has a solution with at most k−1 duplication nodes. Furthermore if such a solution exists then there exists a solution consistent with an arbitrary species tree on Σ.
Proof
Assume that G is kcolorable, with c:V→[k] a kcoloring of G. Let V _{ i }=c ^{−1}(i) for each i∈[k]. Thus V _{1},…V _{ k } form a partition of V. For each i∈[k], let \((T_{i}, s_{V_{i}})\phantom {\dot {i}\!}\) be an arbitrary DStree with leaves V _{ i } such that every internal node is a speciation node, and let x _{ i } denote the root of T _{ i }. We now construct a DStree (T,s) as follows. Let z _{1},…,z _{ k−1} be duplication nodes such that ROOT(T)=z _{1}, such that for each i∈[k−2],z _{ i } has child nodes x _{ i } and z _{ i+1}, and the children of z _{ k−1} are x _{ k−1} and x _{ k }. Now consider the graph H ^{′}=(V,E ^{′}) obtained from the disjoint union of cliques on V _{ i } for 1≤i≤k. Observe that H ^{′} is a sandwich graph of (H,s). Moreover by construction, we have that x y∈E ^{′} if and only if LCA_{ T }(x,y) is a speciation node. Moreover (T,s) has k−1 duplication nodes, so H ^{′} is a solution. To conclude the proof of the first claim, observe that the converse follows from Lemma 4. To see the second claim, observe that as H ^{′} is a disjoint union of cliques, P _{3}(H ^{′},s)=∅ and therefore (H ^{′},s) is consistent with any species tree on Σ. □
We now prove the NPhardness of 2DOGS, SCONS2DOGS or CONS2DOGS, using Lemma 5 and the fact that 3COLORING is NPhard [22].
Theorem 3
2DOGS is NPhard.
Proof
Given an instance G=(V,E) of 3COLORING, let (H,s) be the constraint graph given by Lemma 5. Then by Lemma 5, (H,s,2) is a YESinstance of kDOGS if and only if G is 3colorable. As 3COLORING is NPhard, so is 2DOGS. □
Using the same technique as for Theorem 3, we can prove the same NPhardness result for SCONS2DOGS and CONS2DOGS. The proofs are identical to that of Theorem 3, except that in the case of Theorem 4 we construct an arbitrary species tree S on Σ in addition to the constraint graph (H,s).
Theorem 4
SCONS2DOGS is NPhard.
Theorem 5
CONS2DOGS is NPhard.
Let MINDOGS, SCONSMINDOGS, and CONSMINDOGS denote the minimization versions of kDOGS, SCONS kDOGS, and CONS kDOGS respectively, in which we want to find a solution with the minimum number of duplication nodes. Let GRAPH COLORING denote the minimization version of kCOLORING. As GRAPH COLORING has no polynomial time \(n^{1\epsilon ^{\prime }}\)approximation for any ε ^{′}>0, unless P=NP [23], we can prove the following theorem.
Theorem 6
For any ε>0, there is no polynomial time algorithm that takes as input an instance of MINDOGS, and returns a solution with at most n ^{1−ε }·k duplication nodes if there exists a solution with at most k duplication nodes, unless P=N P.
Proof
Let G=(V,E) be an instance of GRAPH COLORING. Without loss of generality we may assume that G is connected. Let (H,s) be the constraint graph given by Lemma 5.
Now for any ε>0, fix an integer n _{0} and ε ^{′}>0 such that \( n^{1\epsilon } +1 < n^{1\epsilon ^{\prime }}\) for any n≥n _{0}.
Suppose that there exists a polynomialtime n ^{1−ε }approximation for MINDOGS, i.e. an algorithm that for any instance (H,s) with n vertices, finds a solution with at most n ^{1−ε }·k duplication nodes if there exists a solution with at most k duplication nodes. We show that there exists a polynomialtime \(n^{1\epsilon ^{\prime }}\)approximation for GRAPH COLORING.
Let G be an instance of GRAPH COLORING with n vertices, and suppose without loss of generality that n≥n _{0} (as otherwise the problem can be solved exactly in polynomial time). Let (H,s) be the instance of MINDOGS constructed from G as above. Now run the supposed approximation algorithm for MINDOGS on (H,s). If G is kcolorable for any k>1, then by Lemma 5, there exists a solution for (H,s) with at most k−1 duplication nodes. Therefore if G is kcolorable, the algorithm returns a solution with at most n ^{1−ε }·(k−1) duplication nodes. (Note that we may assume the solution contains at least 1 duplication node, as otherwise G would be disconnected). Let (H ^{′},s) be the orthology graph for this solution. Then by Lemma 4, we have a n ^{1−ε }·(k−1)+1coloring for \(\overline {H^{\prime }}\). As G is a subgraph of \(\overline {H^{\prime }}\), this is also a n ^{1−ε }·(k−1)+1coloring for G.
As \(n \ge n_{0}, n^{1\epsilon }\cdot (k1) +1 \le (n^{1\epsilon } +1) \cdot k \le n^{1\epsilon ^{\prime }}\cdot k\) and so we have a polynomial time \(n^{1\epsilon ^{\prime }}\)approximation for GRAPH COLORING, a contradiction. □
Using the same technique as for Theorem 6, we can prove the same inapproximability result for SCONSMINDOGS and CONSMINDOGS. The proofs are identical to that of Theorem 6, except that in the case of Theorem 7 we construct an arbitrary species tree S on Σ in addition to the constraint graph (H,s).
Theorem 7
For any ε>0, there is no polynomial time algorithm that takes as input an instance of SCONSMINDOGS, and returns a solution with at most n ^{1−ε }·k duplication nodes if there exists a solution with at most k duplication nodes, unless P=N P.
Theorem 8
For any ε>0, there is no polynomial time algorithm that takes as input an instance of CONSMINDOGS, and returns a solution with at most n ^{1−ε }·k duplication nodes if there exists a solution with at most k duplication nodes, unless P=N P.
To summarise the results in this section: given a constraint graph on n vertices, it is NPhard to find a DStree for that graph with at most k duplication nodes, even when k=2. This holds regardless of whether we require the DStree to be consistent, or whether we are given a species tree that it should be consistent with. Viewed as a minimization problem, it is NPhard even to find an n ^{1−ε }approximate solution, for any ε>0.
Results and Discussion
We integrated Algorithm 1 to the software provided at [24] by the authors of [9]. Note that the previous version of the program only permitted to check satisfiability and consistency of a constraint graph with respected to a given species tree S.
We used the modified software to reanalyze the data set in [9]. This data set was constructed by randomly choosing 265 gene families of vertebrates with more than 20 genes from Ensembl [25]. Each gene family was then analysed with ProteinOrtho [26] using 9 different parameter settings, yielding 2385 different constraint graphs. Here S is the Ensembl species tree, which can be downloaded at [27].
For this data set we have that, apart from one case, all satisfiable constraint graphs are also consistent. In 533 out of 2385 cases, the constraint graph was found to be consistent, but not consistent with S. We were interested in finding out how greatly the graphs in this set (denoted \(\mathcal {CG}\)) conflicted with S. Indeed, some nodes in the Ensembl species tree, for example the position of Equus, Tupaia and Cavia, do not enjoy a consensus in the community, so some contradictions with S are expected.
Note that we can use the graph G ^{′} outputted by Algorithm 1 to obtain a species tree in the following way: we compute the set \(\mathcal {T}\) of all P _{3}(G ^{′},s) and then feed \(\mathcal {T}\) to the BUILD algorithm [18], which will return a species tree displaying all the triplets in \(\mathcal {T}\) (in practice, our implementation of Algorithm 1 is able to construct a species tree directly).
This species tree can fail to be binary, if the information contained in \(\mathcal {T}\) is sparse (this is actually the case for our data set: the maximum number of internal nodes over all species trees reconstructed by our approach from constraint graphs in \(\mathcal {CG}\) was 6, with an average of 1.5).
Conclusions
In this paper, we extend the results of [9] by giving a O(n ^{3}) time algorithm to decide whether \(\mathcal {C}\) is consistent, even when the species tree is not known and \(\mathcal {C}\) is not full. We also incorporated this algorithm into the software provided at [24]. The algorithm has important applications in providing evidence for the structure of a species tree when that species tree is unknown. It also allows us to see how much an ‘inconsistent’ set of constraints is in conflict with a known species tree, as the algorithm returns a species tree for which those constraints are consistent, if any exists. On the negative side, we show that the problem of minimizing duplications nodes in DStrees is NPhard even when the number of duplications is very small, and it is also hard to find approximate solutions for this criterion.
Declarations
Acknowledgements
The authors would like to thank Manuel Lafond for providing the Python script to check consistency with a known species tree.
Declarations
This article has been published as part of BMC Bioinformatics Vol 17 Suppl 14, 2016: Proceedings of the 14th Annual Research in Computational Molecular Biology (RECOMB) Comparative Genomics Satellite Workshop: bioinformatics. The full contents of the supplement are available online at https://bmcbioinformatics.biomedcentral.com/articles/supplements/volume17supplement14.
Funding
Research of MJ was funded by Labex NUMEV (ANR10LABX20).
Availability of data and materials
Our polynomial algorithm for checking consistency has been implemented in Python and is available at [24]. The gene families and Ensembl species tree used in our experiments are available at ensembl.org. The datasets analysed during this study are available from the corresponding author upon request. An implementation of the PhySIC_IST method is available at http://www.atgcmontpellier.fr/physic_ist/.
Authors’ contributions
Designed the algorithms: CP MJ CS. Implemented the algorithms: MJ. Designed, performed and analyzed the experiments: MJ CS. Wrote the paper: CP MJ CS. All authors have read and approved the final version of the manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Publication funding
Publication charges for this article have been funded by the French Agence Nationale de la Recherche Investissements d’Avenir/ Bioinformatique (ANR10BINF0102, Ancestrome).
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License(http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
Authors’ Affiliations
References
 Fitch WM. Distinguishing homologous from analogous proteins. Syst Biol. 1970; 19(2):99–113.Google Scholar
 Altenhoff AM, Dessimoz C. Inferring orthology and paralogy. Evol Genomics Stat Comput Methods. 2012; 1:259–79.View ArticleGoogle Scholar
 Schreiber F, Patricio M, Muffato M, Pignatelli M, Bateman A. Treefam v9: a new website, more species and orthologyonthefly. Nucleic Acids Res. 2014; 42(Database issue):D922–5. doi:10.1093/nar/gkt1055. Epub 2013 Nov 4.View ArticlePubMedGoogle Scholar
 Doyon JP, Ranwez V, Daubin V, Berry V. Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011; 12(5):392–400.View ArticlePubMedGoogle Scholar
 Li L, Stoeckert CJ, Roos DS. Orthomcl: identification of ortholog groups for eukaryotic genomes. Genome Res. 2003; 13(9):2178–89.View ArticlePubMedPubMed CentralGoogle Scholar
 Bergsten J. A review of longbranch attraction. Cladistics. 2005; 21(2):163–93.View ArticleGoogle Scholar
 HernandezRosales M, Hellmuth M, Wieseke N, Huber KT, Moulton V, Stadler PF. From eventlabeled gene trees to species trees. BMC Bioinforma. 2012; 13(Suppl 19):6.Google Scholar
 Hellmuth M, HernandezRosales M, Huber KT, Moulton V, Stadler PF, Wieseke N. Orthology relations, symbolic ultrametrics, and cographs. J Math Biol. 2013; 66(12):399–420.View ArticlePubMedGoogle Scholar
 Lafond M, ElMabrouk N. Orthology and paralogy constraints: satisfiability and consistency. BMC Genomics. 2014; 15(Suppl 6):12.View ArticleGoogle Scholar
 Hellmuth M, Wieseke N, Lechner M, Lenhof HP, Middendorf M, Stadler PF. Phylogenomics with paralogs. Proc Natl Acad Sci U S A. 2015; 112(7):2058–63.View ArticlePubMedPubMed CentralGoogle Scholar
 Ma B, Li M, Zhang L. From gene trees to species trees. SIAM J Comput. 2000; 30(3):729–52.View ArticleGoogle Scholar
 Koonin EV. Orthologs, paralogs, and evolutionary genomics 1. Annu Rev Genet. 2005; 39:309–38.View ArticlePubMedGoogle Scholar
 Corneil D, Lerchs H, StewartBurlingham LK. Complement reducible graphs. Discret Appl Math. 1981; 3(1):163–74.View ArticleGoogle Scholar
 Bőcker S, Dress AWM. Recovering symbolically dated, rooted trees from symbolic ultrametrics. Adv Math. 1998; 138(1):105–25. doi:10.1006/aima.1998.1743.View ArticleGoogle Scholar
 Bretscher A, Corneil D, Habib M, Paul C. A simple linear time lexbfs cograph recognition algorithm. SIAM J Discret Math. 2008; 22(4):1277–96.View ArticleGoogle Scholar
 Corneil D, Perl Y, Stewart LK. A linear time recognition algorithm for cographs. SIAM J Comput. 1985; 14(4):926–34.View ArticleGoogle Scholar
 Golumbic M, Kaplan H, Shamir R. Graph sandwich problems. J Algorithm. 1995; 19:449–73.View ArticleGoogle Scholar
 Aho AV, Sagiv Y, Szymanski TG, Ullman JD. Inferring a tree from lowest common ancestors with an application to the optimization of relational expressions. SIAM J Comput. 1981; 10(3):405–21.View ArticleGoogle Scholar
 Lafond M, Swenson KM, ElMabrouk N. An optimal reconciliation algorithm for gene trees with polytomies. In: International Workshop on Algorithms in Bioinformatics. Springer Berlin Heidelberg: 2012. p. 106–22.Google Scholar
 Boussau B, Szöllősi GJ, Duret L, Gouy M, Tannier E, Daubin V. Genomescale coestimation of species and gene trees. Genome Res. 2013; 23(2):323–30.View ArticlePubMedPubMed CentralGoogle Scholar
 Downey RG, Fellows MR. Parameterized Complexity. New York, NY, USA: Springer; 1999, p. 530.View ArticleGoogle Scholar
 Garey MR, Johnson DS. Computers and Intractability: A Guide to the Theory of NPCompleteness. New York, NY, USA: W. H. Freeman & Co.; 1979.Google Scholar
 Zuckerman D. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory Comput. 2007; 3(6):103–28. doi:10.4086/toc.2007.v003a006.View ArticleGoogle Scholar
 Lafond M, Jones M. GitHub  UdeMLBIT/OrthoParaConstraintChecker: A program to check if a given set of orthology/paralogy relations, with possible unknowns, is satisfiable or consistent with a species tree. https://github.com/UdeMLBIT/OrthoParaConstraintChecker. Accessed 22 July 2016.
 Flicek P, Ahmed I, Amode MR, Barrell D, Beal K, Brent S, CarvalhoSilva D, Clapham P, Coates G, Fairley S, et al. Ensembl 2013. Nucleic Acids Res. 2012; 41(Database issue):D48–D55. Published online 2012 Nov 30. doi:10.1093/nar/gks1236.PubMedPubMed CentralGoogle Scholar
 Lechner M, Findeiß S, Steiner L, Marz M, Stadler PF, Prohaska SJ. Proteinortho: Detection of (co) orthologs in largescale analysis. BMC Bioinforma. 2011; 12(1):1.View ArticleGoogle Scholar
 Ensembl. Ensembl Species Tree. http://www.ensembl.org/info/about/speciestree.html. Accessed 22 July 2016.
 Scornavacca C, Berry V, Lefort V, Douzery EJ, Ranwez V. Physic_ist: cleaning source trees to infer more informative supertrees. BMC Bioinforma. 2008; 9(1):413.View ArticleGoogle Scholar