Phylogenetic novelty scores
We consider a phylogenetic tree \(\phi\) describing the evolutionary relationships of its N tips \(s_1, \ldots , s_N\). We want to define weights \(w_s\) representing how ‘novel’ tip s is compared to the other tips of \(\phi\). Throughout this paper we consider the tips to represent biomolecular sequences comprising amino acid or nucleotide characters, but other possible sets of characters could equally be accommodated. We assume that we have one sequence associated with each tip, conveniently sharing the same names \(s_1, \ldots , s_N\), and arranged as the rows of an alignment A. We start by defining weights that are a function of \(\phi\) only, and so depend on the evolutionary history relating the considered sequences and not on the specific sequences themselves. In the next section we extend the definitions to also condition on the observed sequence characters.
As a motivating example, if \(\phi\) consists of only extremely long branches, then we want \(w_1 = \cdots = w_N = 1\). In fact, in this case, all sequences represent effectively independent observations, so no weighting correction is needed. This means that, unlike many sequence weighting schemes (e.g. [9, 14, 39, 40]), we want to account for the effect of saturation, so that doubling the length of a long tree branch has negligible effect on the weights.
If instead \(\phi\) has branches all of length 0, we want \(w_1 = \cdots = w_N = 1/N\), so that the total alignment score is 1, as in [9, 14, 39, 40]. This is because all the observed sequences are now just perfectly dependent copies, and so in total they represent just one independent observation of a sequence. At an intermediate level, if \(\phi\) has two tips (\(N=2\)), and branch length such that half of the ancestral characters are expected not to have mutated in either branch (they are PIBD with probability 0.5), then we want the total alignment score to be 1.5, and both weights to be 0.75; this is because in this case only half of each sequence will be novel with respect to the other, so in total we observe 1.5 novel sequences, and we want the two sequences to have the same weight.
A simple way to describe how novel \(s_1\) is with respect to \(s_2\) could be to count the number of mismatches between their sequences. However, even if \(s_1\) and \(s_2\) were very divergent from each other, their sequences would still be identical at some alignment column because of chance or of convergent evolution, instead of close relatedness. In our approach, \(s_1\) can be novel with respect to \(s_2\) at a column of A even if they share the same character, as long as they are not PIBD.
We usually cannot know for sure if sequences are PIBD and at which alignment column, so we define \(p_s(i)\) as the probability that, at a generic alignment column, the number of tips of \(\phi\) (including s) that are PIBD to s, \(i_\phi (s)\), is exactly i. For example, \(p_s(N)\) is the probability that, at a generic alignment column, no substitution occurs along \(\phi\); \(p_s(1)\) is the probability that no tip (except s) in \(\phi\) is PIBD to s at some arbitrary alignment column. We then define the weight \(w_s\) of s within \(\phi\) as:
$$\begin{aligned} w_s=\sum _{i=1}^{N}\dfrac{p_s(i)}{i} = {\mathbb {E}}_\phi \left[ \frac{1}{i_\phi (s)}\right] \; . \end{aligned}$$
(1)
In the simplest case of nucleotide sequences evolving under the JC69 substitution model ( [41]; all substitution rates are 1/3), the probability that two nodes in \(\phi\) separated by branch length t are PIBD is \(e^{-t}\). So, again in the simple case that \(N=2\) and that the two branches in \(\phi\) have each length t/2, \(s_1\) and \(s_2\) each have weight \(w_{s_1}=w_{s_2}=\) \(p_{s_1}(1)+p_{s_1}(2)/2=\) \((1-e^{- t}) + e^{-t}/2=\) \(1-e^{-t}/2\), and the sum of the weights is \(2-e^{-t}\). The same is true for any pair of branch lengths with sum t, of course.
We expect the definition of sequence weights given by Eq. 1 to be useful for character frequency inference and many other applications. In fact, in addition to satisfying classical sequence weighting requirements [33], these \(w_s\) can also be efficiently calculated from any \(\phi\) and substitution model, as discussed later. We refer to weights \(w_s\) as the ‘phylogenetic novelty scores’ (PNS). We call the sum of all weights in \(\phi\) the ‘effective sequence number’ (ESN): \(T=\sum _{s=1}^N w_s\), representing the expected number of evolutionarily distinct character observations at an alignment column. An example graphical representation of the PNS is shown in Fig. 1
Conditioning on observed data
In this section we define weights that are a function not only of phylogeny \(\phi\), but also of a specific alignment column D of alignment A. These weights refer not to the novelty of a sequence s, but of its specific character \(D_s\) observed in row s of column D. The probability that two tips of \(\phi\) are PIBD at a specific alignment column can be strongly affected by the observed characters at that column. Clearly, if the two tips differ at alignment column D then the probability that they are PIBD, conditional on D, is 0. The case that the two tips have the same character in D is less trivial. If we assume that the two tips are separated by a total divergence time t, and for simplicity assuming a JC69 substitution model [41], then the probability that the two tips have the same nucleotide is \((1+3e^{-4t/3})/4\) and the probability that the two tips are PIBD is \(e^{-t}\); therefore, the probability that the two tips are PIBD conditional on them having the same nucleotide is \(4e^{-t}/(1+3e^{-4t/3})\).
We denote by \(p_s(i|D)\) the probability that exactly i sequences are PIBD to s at the given, observed alignment column D. The new positional PNSs conditional on D are then defined as:
$$\begin{aligned} w^D_s=\sum _{i=1}^{N}\dfrac{p_s(i|D)}{i} \; . \end{aligned}$$
(2)
Algorithms for calculating the phylogenetic novelty scores
We present several algorithms for calculating PNS. One of these methods (‘up-down pruning’) is the most computationally efficient, and so is described below, but may also be the most difficult to understand. For this reason, we also mention other approaches and include their full description in the Supplement. In the following we assume that the phylogeny \(\phi\) is rooted; the case of an unrooted topology follows simply by placing an arbitrary root on the tree, as long as the substitution process is reversible and at equilibrium (as in this case the scores are not affected by the position of the root). In the case of non-reversible or non-stationary character evolution, the position of the root can affect the scores, and so a rooted phylogeny (which could in principle be estimated from sequences in this scenario) is required.
Calculating PNS scores via simulation
We can calculate PNS by simulating sequence evolution along \(\phi\). If we are interested in weights \(w_s\), at each iteration we start by sampling a root character from the equilibrium distribution. We then sample its descendant characters and the mutation events along the branches of \(\phi\) using standard methods (e.g. [42], “method 2” of [43, 44]) until we reach all the tips, recording each substitution that occurs and hence which tip characters are PIBD. Note that it is not possible to achieve this using software such as evolver [45] or “method 1” of INDELible [43] that only simulate the start and end state of each branch, and do not distinguish between characters that are PIBD and those that happen to match following multiple substitutions. For each iteration we associate a score of 1/i to a tip of \(\phi\) if its observed character is PIBD to the characters of exactly i tips. The final weight of a tip is then obtained by averaging its scores over all iterations. Weights \(w^D_s\) can be similarly calculated employing a variant of the up-down approach [46] to sample characters at internal nodes of the phylogeny conditional on D. A straightforward but inefficient way to achieve the same result is to simulate characters without any conditioning, and discard those iterations that do not match D. These approaches are described in more detail in the Supplement.
Calculation of PNS scores via brute-force
We can calculate PNS via brute-force, that is, by enumerating all possible mutational histories on \(\phi\) by considering all possible character assignments at each end of each branch, and for each branch considering whether there is at least one substitution on it or not. Each mutational history results in a score as in the previous method using simulation. By averaging the scores of all mutational histories, while accounting for different probabilities of different histories, we can calculate the \(w_s\) or \(w^D_s\) weights. The full methods are described in detail in the Supplement.
Pruning method to calculate the ESN
The ESNs \(T = \sum _{s=1}^N w_s\) or \(T^D=\sum _{s=1}^N w^D_s\) can be calculated very efficiently (computational cost \({\mathcal {O}}(N)\)) if one is not interested in the weights of the individual tips. The idea is to calculate T iteratively on each subtree of \(\phi\) starting from the tips until we reach the root. We call this the ‘pruning ESN’ method, due to its similarity with Felsenstein’s pruning algorithm [47]. See the Supplement for details.
Up–down pruning approach to calculate PNS
We now present the main, efficient algorithm that we use and recommend for calculating PNS. It can be used for calculating either \(w_s\) or \(w^D_s\) weights and can be considered an adaptation of Felsenstein’s pruning algorithm [47]. The method visits all nodes in \(\phi\) starting from the tips and toward the root (‘up’ phase) and then again a second time starting from the root and moving downward to the tips (‘down’ phase), similar to the up-down approach of [46]. The computational cost of this algorithm is cubic in the number of tips N.
In the following we assume that the substitution rate matrix Q is given. We make no assumptions regarding the state space of the substitution process, which can comprise nucleotide, amino acid or codon states [48], and exclude or include gaps (see e.g. [49]). The probability of having character k at the end of a branch of length t, conditional on having character j at its start, is then \(P_t^{j,k}\), the entry in row j and column k of \(P_t=\exp (tQ)\). See [50] for a more detailed introduction to these concepts in molecular phylogenetics. Starting with character j at the top node of a branch of length t, we denote the probability that no substitution occurs along the branch, and therefore also that the top and bottom nodes of the branch are PIBD, as:
$$\begin{aligned} I^{j}_t=\exp (tQ_{jj}) \; . \end{aligned}$$
(3)
Note that \(I^{j}_t\) is different from \(P_t^{j,j}\), which is the probability that the character a the end of the branch is the same as j, the character at the start of the branch. This is because \(I^{j}_t\) requires not only the two characters to be the same, but also that no substitution occurred on the branch; when substitutions occurred but resulted cumulatively in no change in character at the two ends of a branch (as possible on long branches) we do not consider those two states PIBD, and the two characters are treated as independent observations.
Our objective is to calculate, for each tip s of \(\phi\), the probability distribution \((p_s(0) = 0, p_s(1), \ldots , p_s(N))\) of having each possible number of PIBD sequences (defined as in Eqs. 1 and 2). To address both the cases of \(w^D_s\) and \(w_s\), we present our description as conditioned on data D; for the case that one is interested in \(w_s\), the same equations can be used but setting D as non-informative. (For example, with DNA sequences a non-informative column D will have all entries equal to character “N”, representing an unknown nucleotide, so that the partial likelihood for column D at each tip is 1.) As before, we denote the observed character at tip s by \(D_s\); we now represent the observed characters for the leaves in sub-phylogeny \(\phi '\) of \(\phi\) as \(D_{\phi '}\). In the particular case that \(\phi '=\phi\), we have \(D_{\phi }=D\), so we can represent the final values of interest for tip s also as \(p_s(i|D_{\phi })\).
For most of the following, we condition probabilities on information from only part of \(\phi\). Given a node \(\nu\) of \(\phi\), and given a sub-phylogeny \(\phi '\) of \(\phi\), we define \(p_\nu ^{\phi '}(i)\) to be the probability that there are exactly i tips in \(\phi '\) that are PIBD to \(\nu\). We also define \(p^{\phi '}_\nu (i,j)\) as the probability of having i tips in \(\phi '\) that are PIBD to \(\nu\) and to have character j in \(\nu\). Similarly, we define \(p^{\phi '}_\nu (i|j)\) to be the probability of having i tips in \(\phi '\) that are PIBD to \(\nu\), conditional on having character j in \(\nu\). Finally, \(p^{\phi '}_\nu (i,D_{\phi '}|j)\) is the probability that i tips in \(\phi '\) are PIBD to \(\nu\) and that the observed data in \(\phi '\) is \(D_{\phi '}\), conditional on having character j in \(\nu\).
The first step of the up phase is to initialise \(p^{s}_s(i,D_s|j)\) at every tip s of \(\phi\), for every character j, and for \(0\leqslant i \leqslant N\):
$$\begin{aligned} p^s_s(i,D_s|j)=\delta (j,D_s) \, \delta (i,1) \end{aligned}$$
(4)
where \(\delta (x,y)\) is the Kronecker delta function (\(\delta (x,y)=1\) if and only if \(x=y\); \(\delta (x,y)=0\) otherwise). In the case that \(D_s\) is uninformative, we have \(p^s_s(i,D_s|j)= \delta (i,1)\).
Next, starting from the tips, we move iteratively ‘upward’, toward the root of \(\phi\). If branch b with length t connects the two nodes \(\nu _1\) (the parent or upper node) and \(\nu _2\) (child or lower node), then b splits \(\phi\) into two sub-phylogenies. We call these \(\phi _1\) and \(\phi _2\), with \(\phi _2\) the sub-phylogeny of \(\phi\) containing \(\nu _2\) (but not b) and all its descendant nodes and branches, and \(\phi _1\) the sub-phylogeny of \(\phi\) containing all nodes and branches (except b) not in \(\phi _2\). Assuming that we have already visited all branches and nodes below b, and therefore know \(p^{\phi _2}_{\nu _2}(i,D_{\phi _2}|j)\) for every character j and every \(0\leqslant i \leqslant N\), we can then calculate \(p^{\phi _2}_{\nu _1}(i,D_{\phi _2}|j)\) for every character j and every \(0\leqslant i \leqslant N\):
$$\begin{aligned} \begin{aligned} p^{\phi _2}_{\nu _1}(0,D_{\phi _2}|j)&= { I^j_t p^{\phi _2}_{\nu _2}(0,D_{\phi _2}|j) + \sum _{k}(P^{j,k}_t - \delta (j,k)I^j_t)\sum _{i=0}^N p^{\phi _2}_{\nu _2}(i,D_{\phi _2}|k)} \\ p^{\phi _2}_{\nu _1}(1,D_{\phi _2}|j)&= I^j_t p^{\phi _2}_{\nu _2}(1,D_{\phi _2}|j) \\&\quad \vdots \\ p^{\phi _2}_{\nu _1}(N,D_{\phi _2}|j)&= I^j_t p^{\phi _2}_{\nu _2}(N,D_{\phi _2}|j) \; . \\ \end{aligned} \end{aligned}$$
(5)
For the first term \(p^{\phi _2}_{\nu _1}(0,D_{\phi _2}|j)\), the first summand \(I^j_t p^{\phi _2}_{\nu _2}(0,D_{\phi _2}|j)\) relates to the case in which there are no mutations on the considered branch b, while the second summand relates to the case in which at least one mutation event happens on b. Graphical examples for Eqs. 4 and 5 are given in Additional file 1: Fig. S1. Many of the \(p^{\phi _2}_{\nu _1}(i,D_{\phi _2}|j)\) will be 0 (when i is larger than the number of tips in \(\phi _2\)). In practice, we have made use of this to speed up the implementation of the algorithm, but we ignore it here for brevity.
Thanks to Eq. 5 we can ‘move’ probabilities up along branches, starting from the initialisations at the tips. Next, we show how to ‘merge’ probabilities when we reach an internal node \(\nu\). A given internal node \(\nu\) splits \(\phi\) into three sub-phylogenies (a parent one, \(\phi _P\), a left child one \(\phi _L\), and a right child one \(\phi _R\)), each associated with one of the three branches adjacent to \(\nu\) (one parent and two child branches). If \(\nu\) is the root, then for simplicity we consider its parent sub-phylogeny to exist but be empty. Assuming that we have already visited all branches and nodes descendant of \(\nu\), and therefore know \(p^{\phi _L}_{\nu }(i,D_{\phi _L}|j)\) and \(p^{\phi _R}_{\nu }(i,D_{\phi _R}|j)\) for every character j and every \(0\leqslant i \leqslant N\), and denoting by \(\phi _L \cup \phi _R\) the sub-phylogeny obtained by joining sub-phylogenies \(\phi _L\) and \(\phi _R\), we can calculate \(p^{\phi _L \cup \phi _R}_{\nu }(i,D_{\phi _L \cup \phi _R}|j)\) for every character j and every \(0\leqslant i \leqslant N\):
$$\begin{aligned} \begin{aligned} p^{\phi _L \cup \phi _R}_{\nu }(0,D_{\phi _L \cup \phi _R}|j)&= p^{\phi _L}_{\nu }(0,D_{\phi _L}|j)p^{\phi _R}_{\nu }(0,D_{\phi _R}|j) \\ p^{\phi _L \cup \phi _R}_{\nu }(1,D_{\phi _L \cup \phi _R}|j)&= p^{\phi _L}_{\nu }(0,D_{\phi _L}|j)p^{\phi _R}_{\nu }(1,D_{\phi _R}|j) \\&\quad + p^{\phi _L}_{\nu }(1,D_{\phi _L}|j)p^{\phi _R}_{\nu }(0,D_{\phi _R}|j) \\&\quad \vdots \\ p^{\phi _L \cup \phi _R}_{\nu }(N,D_{\phi _L \cup \phi _R}|j)&= \sum _{i=0}^{N} p^{\phi _L}_{\nu }(i,D_{\phi _L}|j)p^{\phi _R}_{\nu } (N-i,D_{\phi _R}|j) \; . \end{aligned} \end{aligned}$$
(6)
Equation 6 is one of the most computationally demanding steps of the algorithm (jointly with Eq. 10 below) as it has up to quadratic cost in N. Equation 6 is used on each internal node of \(\phi\), and so causes the algorithm to have a total time complexity in the order of \({\mathcal {O}}(N^3)\).
Using Eqs. 5 and 6 iteratively, we can calculate \(p^{\phi _L}_{\nu }(i,D_{\phi _L}|j)\), \(p^{\phi _R}_{\nu }(i,D_{\phi _R}|j)\) and \(p^{\phi _L \cup \phi _R}_{\nu }(i,D_{\phi _L \cup \phi _R}|j)\) for each internal node \(\nu\), each \(0 \leqslant i \leqslant N\), and any character j. We stop once we reach node \(\rho\), the root of \(\phi\). At \(\rho\) we have \(p^{\phi }_{\rho }(i,D_{\phi }|j)=p^{\phi _L \cup \phi _R}_{\rho }(i,D_{\phi _L \cup \phi _R}|j)\) for any character j and \(0\leqslant i \leqslant N\). If \(\pi\) are the character frequencies at \(\rho\), we then have the joint probabilities:
$$\begin{aligned} p^{\phi }_\rho (i,D_\phi ,j) = \pi (j) p^{\phi }_\rho (i,D_\phi |j) \; . \end{aligned}$$
(7)
This concludes the ‘up’ stage of the method, which is more succinctly described in Eq. 8:
$$\begin{array}{*{20}{l}} {{\mathbf{Algorithm}}\,{\mathbf{stage}}\,{\mathbf{Up}}} \\ {\quad \begin{array}{*{20}{l}} {[{\text{initialise}}]}&{\quad {\text{compute }}\,{P_t}\,{\text{ and}}\,I_t^j\,{\text{ for}}\,{\text{ every}}\,{\text{ branch}}\,{\text{ length}}\,t\,{\text{ and}}\,{\text{ character}}\,j} \\ {}&{\quad {\text{compute}}\,p_s^s(i,{D_s}\mid j) \,{\text{for }}\,{\text{every}}\,{\text{ tip}}\,s,{\text{ character}}\,j\,{\text{ and}}\,0\, \leqslant \,i\, \leqslant \,N} \\ {[{\text{iterate}}]}&{\quad {\text{visit}}\,{\text{ every}}\,{\text{ internal}}\,{\text{ node}}\,\nu \,{\text{in}}\,{\text{ post-order}}\,{\text{ traversal;}}\,{\text{ for}}\,{\text{ each}}\,\nu ,{\mkern 1mu} j,{\mkern 1mu} i\,{\text{calculate }}} \\ {}&{\quad p_\nu ^{{\phi _L}}(i,{D_{{\phi _L}}}\mid j) \,{\text{and}}\,p_\nu ^{{\phi _R}}(i,{D_{{\phi _R}}}\mid j)\,{\text{ with}}\,{\text{ Eq}}{\text{.}}\;\,5} \\ {}&{\quad p_\nu ^{{\phi _L} \cup {\phi _R}}(i,{D_{{\phi _L} \cup {\phi _R}}}\mid j) \,{\text{with }}\,{\text{Eq}}{\text{.}}\,\;6} \\ {[{\text{finalise}}]}&{\quad {\text{at }}\,{\text{root}}\,\rho \,{\text{calculate}}\,p_\rho ^\phi (i,{D_\phi },j) \,{\text{for}}\,{\text{ every}}\,j,{\mkern 1mu} i\,{\text{using}}\,{\text{ Eq}}{\text{.}}\,\;7} \end{array}} \end{array}$$
(8)
The ‘down’ phase is the second and last stage of the algorithm. Starting from root \(\rho\), we move toward the tips, visiting each node and branch in pre-order traversal. Given branch b of length t connecting nodes \(\nu _1\) (parent) and \(\nu _2\) (child), we assume, as in Eq. 5, that \(\phi _1\) and \(\phi _2\) are the two sub-phylogenies induced by b. Assuming that we have already visited iteratively all ancestor branches of b, and therefore know \(p^{\phi _1}_{\nu _1}(i,D_{\phi _1},j)\) for every character j and \(0\leqslant i \leqslant N\), we can calculate \(p^{\phi _1}_{\nu _2}(i,D_{\phi _1},j)\) for every character j and \(0\leqslant i \leqslant N\):
$$\begin{aligned} \begin{aligned} p^{\phi _1}_{\nu _2}(0,D_{\phi _1},j)&=(P^{j,j}_t-I^j_t) \sum _{i=0}^{N}p^{\phi _1}_{\nu _1}(i,D_{\phi _1},j) + I^j_t p^{\phi _1}_{\nu _1}(0,D_{\phi _1},j) \\&\quad +\sum _{k\ne j}P^{k,j}_t\sum _{i=0}^{ N}p^{\phi _1}_{\nu _1}(i,D_{\phi _1},k) \\ p^{\phi _1}_{\nu _2}(1,D_{\phi _1},j)&=I^j_t p^{\phi _1}_{\nu _1}(1,D_{\phi _1},j)\\&\quad \vdots \\ p^{\phi _1}_{\nu _2}(N,D_{\phi _1},j)&=I^j_t p^{\phi _1}_{\nu _1}(N,D_{\phi _1},j) \; .\\ \end{aligned} \end{aligned}$$
(9)
Equation 9 allows us to ‘move’ probabilities downward along branches, starting from the root. Next, we show again how to ‘merge’ probabilities when we reach an internal node \(\nu\). Given one left child sub-phylogeny \(\phi _L\) of \(\nu\), and given its parent sub-phylogeny \(\phi _P\), we can calculate \(p^{\phi _P \cup \phi _L}_{\nu }(i,D_{\phi _P \cup \phi _L},j)\) for every character j and \(0\leqslant i \leqslant N\):
$$\begin{aligned} \begin{aligned} p^{\phi _P \cup \phi _L}_{\nu }(0,D_{\phi _P \cup \phi _L},j)&=p^{\phi _L}_{\nu }(0,D_{\phi _L}|j)p^{\phi _P}_{\nu }(0,D_{\phi _P},j)\\ p^{\phi _P \cup \phi _L}_{\nu }(1,D_{\phi _P \cup \phi _L},j)&=p^{\phi _L}_{\nu }(0,D_{\phi _L}|j)p^{\phi _P}_{\nu }(1,D_{\phi _P},j) \\&\quad + p^{\phi _L}_{\nu }(1,D_{\phi _L}|j)p^{\phi _P}_{\nu }(0,D_{\phi _P},j)\\&\quad \vdots \\ p^{\phi _P \cup \phi _L}_{\nu }(N,D_{\phi _P \cup \phi _L},j)&=\sum _{i=0}^{N} p^{\phi _L}_{\nu }(i,D_{\phi _L}|j)p^{\phi _P}_{\nu }(N-i,D_{\phi _P},j) \; . \end{aligned} \end{aligned}$$
(10)
We use Eq. 10 twice for each internal node \(\nu\), once with the left child sub-phylogeny \(\phi _L\) and once replacing \(\phi _L\) with the right child sub-phylogeny \(\phi _R\). Using Eqs. 9 and 10 iteratively, we calculate \(p^{\phi _P}_{\nu }(i,D_{\phi _P},j)\), \(p^{\phi _P \cup \phi _L}_{\nu }(i,D_{\phi _P \cup \phi _L},j)\) and \(p^{\phi _P \cup \phi _R}_{\nu }(i,D_{\phi _P \cup \phi _R},j)\) for each internal node \(\nu\), each \(0\leqslant i \leqslant N\) and every character j. After we have visited every internal node of \(\phi\), we reach all tips s using Eq. 9 to obtain \(p^{\phi \setminus s}_{s}(i,D_{\phi \setminus s},j)\) for all characters j and all \(0\leqslant i \leqslant N\), where \(\phi \setminus s\) is the sub-phylogeny obtained by removing s (and its
parent branch) from \(\phi\). We then combine these probabilities at the tips with the initialisation probabilities \(p^{s}_{s}(i,D_{s}|j)\) to obtain, at every tip s, for all characters j and each \(1\leqslant i \leqslant N\):
$$\begin{aligned} p^{\phi }_s(i,D_\phi ,j)=p^{\phi \setminus s}_{s}(i-1,D_{\phi \setminus s},j)p^{s}_{s}(1,D_{s}|j) \, . \end{aligned}$$
(11)
The final probabilities of interest, \(p^{\phi }_s(i | D_\phi )\), can be calculated for every \(0\leqslant i \leqslant N\) and every tip s as:
$$\begin{aligned} p^{\phi }_s(i | D_\phi ) = \dfrac{ \sum _{j} p^{\phi }_s(i,D_\phi ,j)}{P(D_{\phi })} \, , \end{aligned}$$
(12)
where \(P(D_{\phi })\) is the probability of the data (the phylogenetic likelihood of \(\phi\) and the substitution model for D). In the case D is empty, that is, if we want to calculate weights \(w_s\), then \(P(D_{\phi })=1\). Otherwise, if we are interested in weights \(w_s^D\), then \(P(D_{\phi })\) can be calculated as a normalisation factor such that the probabilities \(p^{\phi }_s(i | D_\phi )\) at any s sum over i to 1. In either case, the final PNS scores can be easily calculated substituting the results from Eq. 12 into Eqs. 1 or 2.
We summarise the ‘down’ stage of the algorithm in Eq. 13:
$$\begin{array}{*{20}{l}} {{\mathbf{Algorithm}}\,{\mathbf{stage}}\,{\mathbf{Down}}} \\ {\quad \begin{array}{*{20}{l}} {[{\text{initialise}}]}&{\quad {\text{run}}\,{\text{ algorithm}}\,{\text{ stage}}\,{\text{ Up}}\,{\text{ to}}\,{\text{ calculate}}\,p_\rho ^\phi (i,{D_\phi },j) \,{\text{for }}\,{\text{root}}\,\rho {\text{, and }}} \\ {}&{\quad p_\nu ^{{\phi _L}}(i,{D_{{\phi _L}}}\mid j)\,{\text{ and}}\,p_\nu ^{{\phi _R}}(i,{D_{{\phi _R}}}\mid j)\,{\text{ at}}\,{\text{ every}}\,{\text{ internal }}\,{\text{node}}\,\nu \,{\text{ and }}\,{\text{every}}\,j,{\mkern 1mu} i} \\ {[{\text{iterate}}]}&{\quad {\text{visit}}\,{\text{every }}\,{\text{internal}}\,{\text{ node}}\,\nu \,{\text{ in}}\,{\text{ pre-order}}\,{\text{ traversal; }}\,{\text{for}}\,{\text{ each}}\,\nu ,{\mkern 1mu} j,{\mkern 1mu} i\,{\text{ calculate }}} \\ {}&{\quad p_\nu ^{{\phi _1}}(i,{D_{{\phi _1}}}\mid j) \,{\text{with}}\,{\text{ Eq}}{\text{.}}\,\;9} \\ {}&{\quad p_\nu ^{{\phi _P} \cup {\phi _L}}(i,{D_{{\phi _P} \cup {\phi _L}}}\mid j)\,{\text{ and}}\,p_\nu ^{{\phi _P} \cup {\phi _R}}(i,{D_{{\phi _P} \cup {\phi _R}}}\mid j) \,{\text{with}}\,{\text{ Eq}}{\text{.}}\,\;10} \\ {[{\text{finalise}}]}&{\quad {\text{at }}\,{\text{each }}\,{\text{tip }}\,s\,{\text{ calculate }}\,p_s^\phi (i,{D_\phi },j)\,{\text{ for }}\,{\text{every}}\,j,{\mkern 1mu} i\,{\text{using }}\,{\text{Eq}}{\text{.}}\,\;12} \\ {}&{\quad {\text{at}}\,{\text{ each}}\,{\text{ tip}}\,\,s\,{\text{ calculate}}\,p_s^\phi (i\mid {D_\phi })\,{\text{for }}\,{\text{every }}\,i\,{\text{using}}\,{\text{ Eq}}{\text{.}}\,\;7} \end{array}} \end{array}$$
(13)
Fast approximation
The most efficient algorithm above has cubic cost in N. In some circumstances, for example when \(N > 10^5\), it becomes important to consider faster solutions. For this reason, we also present an approximate PNS that can be calculated more efficiently. With \(i_\phi (s)\) the random variable representing the number of tips in \(\phi\) that are PIBD to s, we have \(w_s =\) \(\sum _{i=1}^{N} p_s(i)/i =\) \({\mathbb {E}}[1/i_\phi (s)]\) (Eq. 1). As an alternative fast approximation we consider:
$$\begin{aligned} {\overline{w}}_s=\frac{1}{{\mathbb {E}}[i_\phi (s)]}=\frac{1}{ \sum _{i=1}^{N}ip_s(i)} \; . \end{aligned}$$
(14)
The weights \({\overline{w}}_s\) can be computed very efficiently with an up-down pruning approach, requiring only \({\mathcal {O}}(N)\) time, so we refer to them as ‘fast PNS’. The algorithm to calculate weights \({\overline{w}}_s\) has many similarities to the one in the previous Section, and is described in detail in the Supplement.
Application to inference of character frequencies
Inference of character frequencies specifically for a single alignment column has broad applications such as modeling selection [34, 35], and creating profile HMMs [3, 4, 6] and sequence logos [36,37,38]. Here, we assume that the frequencies of interest are the equilibrium frequencies at a given alignment column, i.e. the average character frequencies over long evolutionary times. Such frequencies are typically represented in molecular phylogenetics as \(\pi\), with \(\pi (j)\) being the equilibrium distribution of character j [50]. This definition of frequencies fits well with the assumptions of profile HMMs, and is also reasonable for sequence logos, although we acknowledge that different definitions might be also considered in different settings. In this work, we want to investigate and compare different methods for inferring \(\pi\).
The simplest inference method is to use the observed frequency p(j) of character j within the given column as an estimate of the true frequency \(\pi (j)\). This approach corresponds to assuming that all sequences are independent of each other. This approach might be ideal in some circumstances, for example when the considered sequences are not homologous but only evolutionary convergent, but might be inappropriate in others. As an example, consider an alignment of 1000 homologous human sequences and two mouse sequences (1002 homologous sequences in total). Genetic variation within mice, and variation between mice and humans will have negligible effects on estimates p(j), which will be dominated by within-human genetic variation. However, human sequences are highly correlated, as they have very short divergence time between each other, so within-human allele frequencies will typically not represent evolutionary equilibrium character distributions. The problem here is that using p(j) as an estimate of \(\pi (j)\) means treating homologous sequences as independent of each other, while they are often strongly correlated due to shared evolutionary histories.
A traditional way to address this problem is to use sequence weights, for example our \(w_s\), to reduce the contribution of groups of closely related sequences. We can in fact define \(p^w(j)\), a new estimate of \(\pi (j)\), as:
$$\begin{aligned} p^w(j)=\frac{\sum _{s}w_{s} \delta (j,D_s)}{\sum _{s}w_{s}} \end{aligned}$$
(15)
where, as before, \(D_s\) is the observed character for sequence s at the alignment column D under consideration, and \(\delta\) is again the Kronecker delta function.
We investigate and compare the performance, for character frequency inference, of the three weighting schemes introduced above: \(w_s\), \(w^D_s\) and \({\overline{w}}_s\). We also consider two popular sequence weights: those defined by [5], which we call HH94, and by [9], which we call GSC94. HH94 first calculates, for any s, the score of s at an alignment column D, which we will denote \(H\!H^{D}_s\). This score \(H\!H^{D}_s\) is 1/rd, with r the number of different characters in D and d the number of times character \(D_s\) appears in D. The weight for sequence s, which we denote \(H\!H_s\), is then defined as the average of \(H\!H^{D}_s\) over all columns D of alignment A.
GSC94 defines sequence weights iteratively along a phylogeny, by visiting branches in post-order traversal (from the tips to the root). First, all terminal branches (those connected to the tips of \(\phi\)) are visited, and the length of a terminal branch connected to tip s is assigned as the initialisation value of the weight \(GSC_s\) of s. Then, every time an internal branch b is visited, its length t is distributed among the weights of its descendant sequences. More precisely, first t is split among the tips, with the part \(t_s\) assigned to s being \(t_s=t\,{GSC_s}/{\sum _{s'\in S_b}GSC_{s'}}\), where \(S_b\) is the set of tips descendent from b. Secondly, each \(GSC_s\) for \(s\in S_b\) is increased by \(t_s\). After the last branches connected to the root have been processed, the \(GSC_s\) are the final GSC94 weights.
In addition to the character frequency inference p(j) based on the observed frequencies, and \(p^w(j)\) based on one of many weighting schemes studied, we also consider character frequency inference via phylogenetic maximum likelihood (ML). We perform this using PhyML v3.1 by fixing the phylogenetic tree (inferred from the whole alignment A using FastTree v2.1.10 [51]) and the substitution model exchangeabilities, and inferring, one alignment column D at a time, only the equilibrium character frequencies \(\pi\).
Bayesian approaches to character frequency inference
Above, we introduced point estimate methods for character frequency inference. These methods do not measure inference uncertainty, and this can result in a very limited summary of the available data. For example, observing character j 100 times in an alignment column from 100 distantly related species leads all above methods to infer \(100\%\) frequency for j; however, so also does observing j two times within an alignment column of just two closely related sequences. While in the first scenario there should be little uncertainty regarding the inferred frequencies, in the second scenario uncertainty should be elevated. Using a Bayesian method is a natural way to address this issue, and also allows the inclusion of priors over characters frequencies. Here we present a Bayesian variant of the weight-based character frequency inference of Eq. 15.
If A is composed of N independent (non-homologous) sequences, the likelihood of a column D is \(P(D|\pi )=\prod _{s}\pi (D_s)\). It is simple to combine this likelihood with a character frequency prior to obtain a Bayesian posterior distribution, and perform Bayesian character frequency inference. However, we are interested in the general case where the sequences in A are related by a phylogeny \(\phi\), and therefore are not independent. One possible way to perform Bayesian inference of \(\pi\) in this scenario would be using Bayesian phylogenetic methods such as BEAST [52] or MRBAYES [53], but at often excessive computational cost. Instead, we propose an approximation of the likelihood function \(P(D|\pi )\) based on sequence weights \(w_s\):
$$\begin{aligned} {\widehat{P}}(D|\pi )=\prod _{s}\pi ^{w_s}(D_s) = \prod _j \big (\pi (j)\big )^{\sum _s w_{s}\delta (j,D_s)} \; . \end{aligned}$$
(16)
Similarly, we can replace weights \(w_s\) in Eq. 16 with any other weighting scheme. In the following, we assume a uniform prior \(P(\pi )\) on character frequencies, meaning that all possible \(\pi\) are similarly likely a priori. Alternative priors are possible, and some might be more realistic, but usually at the cost of introducing more parameters in the model. Our approximation of the posterior probability \(P(\pi |D)\) is then:
$$\begin{aligned} {\widehat{P}}(\pi |D) = \dfrac{ P(\pi ) {\widehat{P}}(D|\pi ) }{ P(D) } = \dfrac{ \prod _j \big (\pi (j)\big )^{\sum _s w_{s}\delta (j,D_s)} }{ \int _{\xi } \prod _j \xi _j^{\sum _s w_{s}\delta (j,D_s)}} \end{aligned}$$
(17)
where the integral in the denominator is over all possible character frequencies \(\xi\). Equation 17 is a Dirichlet distribution with parameters \(\alpha _j=1+\sum _s w_s\delta (j,D_s)\), so in the following we use the properties of Dirichlet distributions [54]. The (approximate) maximum a posteriori and ML \(\pi\) are both given by the weighted observed character frequencies \(p^w(j)\) in Eq. 15. The approximation of the expectation of \(\pi (j)\), \(E(\pi (j)|D)\) is however:
$$\begin{aligned} {\widehat{E}} (\pi (j)|D) = \dfrac{ 1+\sum _s w_s \delta (j,D_s) }{ B + \sum _s w_s } = \dfrac{ \alpha _j }{ \alpha _0 } \; , \end{aligned}$$
(18)
where B is the number of possible characters in the considered alphabet, \(\alpha _j=1+\sum _s w_s\delta (j,D_s)\) and \(\alpha _0= B+\sum _s w_s\). This can be seen as the ML estimate of \(\pi\) in the presence of 1 pseudo-count per character. The posterior variance is then approximated as:
$$\begin{aligned} {\text {Var}}(\pi (j)|D) \approx \dfrac{ \alpha _j (\alpha _0 -\alpha _j)}{ \alpha _0^2 (\alpha _0 +1) } \; , \end{aligned}$$
(19)
which can be used as a measure of the uncertainty over character frequencies. However, considering that Eq. 17 has beta-distributed univariate marginals \({\widehat{P}}(\pi (j)|D) \sim {\text {Beta}}(\alpha _j , \alpha _0 - \alpha _j)\), in the following we derive approximate \(95\%\) posterior probability intervals using the stats.beta.ppf function in scipy [55].
Simulations
We use simulations to test and compare computational demands of calculating PNS values as well as for assessing the accuracy of different approaches to infer position-specific character frequencies. In the base simulation scenario, we simulate nucleotide sequence evolution along a 100 vertebrate taxa phylogeny (Fig. 1) using Dendropy [56]. We use a HKY85 substitution model [57] with transition:transversion ratio \(\kappa =3\) both for simulation and inference. We simulate 10 replicates, each replicate consisting of an alignment of 1000 columns. Alignment columns are evolved independently of each other (conditional on the tree and the substitution model). As we do not simulate indel events, so we do not consider gap characters in our inference; when used on real data we would treat gap characters as missing data, as typically done in phylogenetics, but it would also be possible to include the gap character in the substitution model state space (see e.g. [49]).
Specific equilibrium character frequencies \(\pi\) are assigned to each alignment column. For each alignment, 800 columns (\(80\%\)) are simulated as evolving under the same background equilibrium character frequency distribution, which we set to \(\pi (A)=\pi (T)=0.3\) and \(\pi (C)=\pi (G)=0.2\). The background character frequency distribution represents, in our simulations, the evolutionary dynamics of positions not strongly affected by selective forces; at these positions, the equilibrium character frequency distribution is constant because it is mostly determined by neutral mutational biases, which we assume constant across all alignment columns. The remaining \(20\%\) of alignment columns are simulated under position-specific selection, with position-specific equilibrium character frequency \(\pi\) sampled from a Dirichlet distribution prior with \(\alpha =0.1\) (Additional file 1: Fig. S2A).
Our aim is, for each replicate and each alignment column, to infer \(\pi\) from the simulated sequences alone. For each replicate/alignment, we first infer a phylogenetic tree and alignment-wide HKY85 substitution model parameters using FastTree v2.1.10 [51]. We then consider this tree and the HKY85 \(\kappa\) parameter to be fixed and infer column-specific character equilibrium frequencies. While \(\pi\) is inferred separately at each column, the HKY85 alignment-wide parameters (including nucleotide frequencies) inferred with FastTree are used in some sequence weighting schemes (for example, in Eq. 3). The methods we used to infer equilibrium frequencies are:
-
the observed character frequencies in the alignment column (the p(j) described above),
-
observed frequencies corrected using the HH94 [5] weights and Eq. 15,
-
observed frequencies corrected using the GSC94 [9] weights and Eq. 15,
-
observed frequencies corrected using the PNS weights \(w_s\) from Eq. 1 combined with Eq. 15,
-
observed frequencies corrected using our PNS weights conditional on data, \(w^D_s\) from Eq. 2, combined with Eq. 15,
-
observed frequencies corrected using our fast approximate PNS \({\overline{w}}_s\) weights (Eq. 14) combined with Eq. 15,
-
Bayesian variants of the methods above, and
-
ML phylogenetic inference (only of equilibrium character frequencies) with PhyML v3.1 [58].
All the methods above, except FastTree and PhyML, were implemented in custom Python scripts available from https://bitbucket.org/nicofmay/noveltyscores.
In addition to the basic simulation scenario, we also consider variant scenarios in order to investigate how certain parameters can affect the results:
-
We consider alternative tree lengths, which we obtain by multiplying all branch lengths in the tree in Fig. 1 by a constant coefficient, either 0.2 or 5.
-
We consider the case of amino acid characters instead of nucleotides. In this case, we simulate under an LG substitution model [59], and when we do inference we assume that the substitution model (including character frequencies) is known. Column-specific character frequencies are inferred as usual. In this case, equilibrium character frequencies for columns under selection are sampled from a Dirichlet distribution prior with \(\alpha =0.02\) (Additional file 1: Fig. S2B).
-
To test the effect of very biased taxon sampling in an alignment, we added multiple (either 100 or 1000) human tips to the tree in Fig. 1. The short phylogeny relating the human sequences was randomly sampled at each replicate under a standard coalescent prior [60] with mean coalescent time 0.001 between human sequences. This short human phylogeny was then appended to the human tip of the tree in Fig. 1.
-
To test the robustness of methods to the assumption of an ultrametric tree (a tree where all tips have equal distance from the root), we consider the case of a strongly non-ultrametric trees, as is common for some viruses such as influenza.