Original Sankoff Parsimony
Sankoff parsimony [13, 14] counts the number of evolutionary changes for a specific site in a phylogenetic tree, assuming a set of n character states i = 1, ..., n (for instance, 4 nucleotides or 20 amino acids) for which a cost matrix C = (c
ij
) of changes between states is given. Each node p of the phylogeny has assigned a cost vector, S(p), which contains the minimal evolutionary cost for each of the character states. If node p is assigned state i, the quantity reflects the minimum cost of events (state changes) from p to the root of the tree.
The original Sankoff algorithm calculates the cost vectors at each node moving from the leaves upwards to the root. Initially, the S(x)vectors at the inner nodes are unknown and those at the leaves are initialized with cost 0 for the observed character state and ∞ for the rest. For instance, if adenine is observed in a site (character) for a certain species, the cost vector would be = 0 and . The cost vector of a node p with two children q and r is:
(1)
Equation (1) states that the cost of being in character state i for node p is the cost of moving from character state i to j in child q (c
ij
) plus the cost of having reached state j at node q from the leaves (). Character j is selected to minimize this sum, with the same procedure being applied to character k in child r. Algorithm 1 presents the original implementation of Sankoff parsimony to calculate the cost vector of all nodes in a tree for a single character.
Algorithm 1 (Original Sankoff algorithm: Up phase). A procedure call Sankoff_Up(T, C, S) calculates the cost vector S(p)of all nodes p of the phylogeny T, given a cost matrix C = (c
ij
).
procedure Sankoff_Up(T, C, S)
for all nodes p of T in postorder do
if p is a leaf then
for all i in 1, ..., n do
if state i observed at leaf p then
← 0
else
← ∞
else
{q, r} ← children of p
for all i in 1, ..., n do
← cost(q, i) + cost(r, i)
function cost(x, i)
min ← ∞
for all j in 1, ..., n do
if c
ij
+ < min then
min ← c
ij
+
return min
Figure 1 presents an example for a single site in a nucleotide sequence, where cytosine, guanine, and thymine are observed at the leaves. The minimum number of changes under Sankoff parsimony is 4, as annotated in the root of the tree. As it can be observed, Algorithm 1 takes O(n2) time in the number of character states per node, since the function cost is O(n) and it is called n times. Once the cost vectors have been calculated for each inner node, we can use this information to reconstruct the ancestral states. The algorithm proceeds now top-down, and the root of the phylogeny gets assigned those states with minimum cost in the vector S(root). For any inner node p, and given that ancestral state i was reconstructed at its parent, the state j to be chosen is that for which c
ij
+ is minimized. Notice that for inner nodes this state needs not correspond with that for which is minimum. Algorithm 2 describes an implementation as presented in [[34], §6].
Algorithm 2 (Original Sankoff algorithm: Down phase). A procedure call Sankoff_Down(x, T, C, S, S
anc
) calculates the ancestral states of all nodes p of the phylogeny T, given the root x of T, a cost matrix C = (c
ij
) of transition costs between states, and the cost vectors S(p)for all nodes p of T as calculated by Sankoff_Up(T, C, S).
procedure Sankoff_Down(x, T, C, S, S
anc
)
for all j in do
for all child y of x do
Sankoff_Down(j, y, T, C, S, S
anc
)
procedure Sankoff_Down(i, x, T, C, S, S
anc
)
min_states(i, x, C, S, S
anc
)
for all j in do
for all child y of x do
Sankoff_Down(j, y, T, C, S, S
anc
)
procedure min_states(i, x, C, S, S
anc
)
min ← ∞
for all j in 1, ..., n do
if x = root(T) then
trans_cost ←
else
trans_cost ← c
ij
+
if trans_cost < min then
min ← trans_cost
← {j}
else if trans_cost = min then
← ∪ {j}
Figure 2 presents an example of the top-down phase of Sankoff parsimony. Notice that the states chosen as ancestral do not necesarily correspond to those with minimum values in the cost vectors. The node with cost vector S = (4, 3, 3, 4) has thymine as one of the two possible parsimonious ancestral states, which has larger cost than guanine. For simplicity, ties are not solved in the algorithms presented in this paper and thus inner nodes can have more than one ancestral state. If we were to differentiate among the different parsimonious reconstructions, there would be three scenarios for this example as indicated by the red lines: cytosine both at the root and the inner node, thymine both at the root and the inner node, or thymine at the root and cytosine at the inner node. Cytosine in the root and thymine in the inner node is not parsimonious. The reconstruction of ancestral states as described in Algorithm 2 takes O(n) time per node.
Optimized Sankoff Parsimony
Our optimization is based on the observation that not all transition costs between character states need to be computed if the cost matrix is ultrametric or additive.
Definition 1. A cost matrix is ultrametric if for every three indices i, j, k, one of the three following inequalities holds (three point condition):
-
c
ij
≤ c
ik
= c
jk
-
c
ik
≤ c
ij
= c
jk
-
c
jk
≤ c
ij
= c
ik
Definition 2. A cost matrix is additive if for every four indices i, j, k, ℓ, one of the three following inequalities holds (four point condition):
-
c
ij
+ ck ℓ≤ c
ik
+ cj ℓ= ci ℓ+ c
jk
-
c
ik
+ cj ℓ≤ c
ij
+ ck ℓ= ci ℓ+ c
jk
-
ci ℓ+ c
jk
≤ c
ij
+ ck ℓ= c
ik
+ cj ℓ
Notice that ultrametric matrices are also additive, since they satisfy the four point condition [35].
Consider a simple example where the cost matrix C has c
ii
= 0 and c
ij
= k for all i ≠ j. When calculating the cost in Equation (1), we can substitute min
j
(c
ij
+ ) for min(, k + minj≠i), and similarly for the other child r of p. In the more general case of ultrametric or additive cost matrices, we can efficiently represent them with a unique rooted weighted cost tree T
C
using UPGMA [36] or neighbor-joining [37] respectively. The length of the path between any two leaves i, j in T
C
corresponds to the cost c
ij
. For ultrametric matrices, consider any set of leaves L = {a, b, ...} in the tree that have the same last common ancestor, lca(L). By definition, lca(L) is equidistant to any leaf in L, and all leaves in L are at the same distance d to each other (which is double the distance from the leaf to lca(L)). For any two leaves a, b in L we can then simplify the expression min
L
(c
ab
+ ) as d + min . Therefore, given the cost tree T
C
obtained by UPGMA from the ultrametric matrix, for each state i we only need to compute the minimum costs at the last common ancestor of that state and any other, that is, the inner nodes in the path from i to the root of T
C
. With additive matrices, since the distance from an inner node to its descendant leaves can vary, we need to take into consideration the specific length of each branch when calculating the minimum. Therefore, in our algorithm each cost vector Spis replaced by a cost tree , whose inner nodes will contain the value that minimizes. c
ij
+ for all descendant leaves L. Algorithm 3 presents the optimized version of Sankoff parsimony for the calculations from the leaves to the root of the phylogeny.
Algorithm 3 (Optimized Sankoff algorithm: Up phase). A procedure call Opt_Sankoff_Up(T, T
C
, S) calculates the cost vector S(p)of all nodes p of the phylogeny T, given a cost tree T
C
representing an ultrametric or additive cost matrix C = (c
ij
).
procedure Opt_Sankoff_Up(T, T
C
, S)
for all nodes p of T in postorder do
if p is a leaf then
for all i in 1, ..., n do
if state i observed at leaf p then
← 0
else
← ∞
update(p, i, , T
C
)
else
{q, r} ← children of p
for all i in 1, ..., n do
← cost(q, i) + cost(r, i)
update(p, i, , T
C
)
procedure update(x, i, v, T
C
)
n ← leaf of corresponding to state i
cost(n) ← v
min_tags(n) ← {i}
cost ← branch length between n and its parent in
repeat
n ← parent of n in
if v + cost <cost(n) then
cost(n) ← v + cost
min_tags(n) ← {i}
else if v + cost = cost(n) then
min_tags(n) ← min_tags(n) ∪ {i}
if n ≠ root of then
cost ← cost + branch length between n and its parent in
until n = root of
function cost(x, i)
n ← leaf of corresponding to state i
min ← cost(n)
cost ← branch length between n and its parent in
repeat
n ← parent of n in
if cost + cost(n) < min then
min ← cost + cost(n)
if n ≠ root of then
cost ← cost + branch length between n and its parent in
until n = root of
return min
Algorithm 3 utilizes a cost tree with the same topology as T
C
for each node p, and where each node is annotated with the minimum value corresponding to c
ij
+ , as implemented in the function cost. The function update saves each of these cost trees by moving from the leaves to the root and storing minimum values in the nodes (cost(n)), as well as the leaf responsible for the value stored in the node (min_tags(n)), which will be later used to optimize the reconstruction of ancestral states. The complexity of Algorithm 3 depends on the internal path length of T
C
. The worst case would be a degenerate tree with linear structure, in which case the complexity for n states is (n2 - n)/2 per node and character [[38], §2.3.4.5]. Notice that in practice this will be a rare case, and most cost trees have a more favourable topology to our optimization (as will be shown in the Results section), while the original algorithm takes O(n2) time no matter the cost matrix used.
Figure 3 presents a detailed example of the Algorithm 3. The cost of a transition between states i and j, c
ij
, is the sum of branch lengths between the corresponding leaves in the cost tree. Since the matrix in this example is additive, neighbor-joining guarantees a unique tree. The minimum cost for state a can be calculated as = cost(q, a) + cost(r, a), the sum of minimum costs from the children nodes q and r. For node q this cost will be the minimum among ∞ (cost at leaf a in ), 8 (cost 2 at node x plus branch length 6), and 16 (cost 6 at node w plus total branch length 6 + 4), therefore cost(q, a) = 8. For node r, clearly cost(r, a) = 0, and therefore = 8. At this point, we update the path from leaf a to the root of : node x is set to 8 + 6 = 14, and node w is set to 8 + 6 + 4 = 18. The cost for state b is calculated following the same procedure, resulting in = 8. When updating , the cost at node x is reduced from 14 to 10 (8 at leaf b plus branch length 2), and at node w from 18 to 14. The remaining state costs and inner nodes of the cost tree are calculated in a similar way.
Algorithm 4 presents the optimized version of the reconstruction of ancestral states using Sankoff parsimony. In its original implementation, for each node we had to consider all posible states looking for the one that minimized c
ij
+ . Since we have already saved the minimum transition costs and the leaves responsible for them as shown in Algorithm 3, we can use this information to further speed up computation. The ancestral states for the root r of the phylogeny T are obtained as in the original algorithm. For any inner node x of T, and given that its parent had ancestral state k reconstructed, we only need to move from leaf k to the root of and keep the state min_tags(n) that has the minimum value cost(n).
Algorithm 4 (Optimized Sankoff algorithm: Down phase). A procedure call Opt_Sankoff_Down(x, T, T
C
, S
anc
) calculates the ancestral states of all nodes p of the phylogeny T, given the root x of T and the cost tree for each node p of T as calculated by Opt_Sankoff_Up(T, T
C
, S).
procedure Sankoff_Down(x, T, T
C
, S
anc
)
for all j in do
for all child y of x do
Opt_Sankoff_Down(j, y, T, T
C
, S
anc
)
procedure Opt_Sankoff_Down(i, x, T, T
C
, S
anc
)
n ← leaf of corresponding to state i
min ← cost(n)
cost ← branch length between n and its parent in
repeat
n ← parent of n in
if cost + cost(n) < min then
min ← cost + cost(n)
← {min_tags(n)}
else if cost + cost(n) = min then
← ∪ {min_tags(n)}
if n ≠ root of then
cost ← cost + branch length between n and its parent in
until n = root of
for all j in S
anc
do
for all child y of x do
Opt_Sankoff_Down(j, y, T, T
C
, S
anc
)
Figure 4 presents an example of Algorithm 4. Assuming the parent node of p had state ancestral state e, we move from that leaf to the root of while comparing minimum values: 34 at e, 34 (28 + 6) if we move from state f, 35 (27 + 6 + 2) if we move from d or 25 (14 + 6 + 2 + 1) when moving from b. The ancestral state reconstructed in this node is therefore b. If the ancestral state at the parent had been a instead, node p would have state a as the most parsimonious. Notice that in this case the value in the root of is not necessary, since the cost of moving from b to a is in fact 10 (minimum annotated at the parent of a) plus 6 (the length of the branch from a to its parent).
Algorithm 4 reduces the number of operations compared to the original implementation, since we do not have to review all states at each node but only traverse from a leaf to the root of the cost tree. The original implementation takes O(n) time, while the complexity of our optimization is again a function of the internal path length of T
C
, which in general will be less than n.