A phylogenetic approach for weighting genetic sequences

Background Many important applications in bioinformatics, including sequence alignment and protein family profiling, employ sequence weighting schemes to mitigate the effects of non-independence of homologous sequences and under- or over-representation of certain taxa in a dataset. These schemes aim to assign high weights to sequences that are ‘novel’ compared to the others in the same dataset, and low weights to sequences that are over-represented. Results We formalise this principle by rigorously defining the evolutionary ‘novelty’ of a sequence within an alignment. This results in new sequence weights that we call ‘phylogenetic novelty scores’. These scores have various desirable properties, and we showcase their use by considering, as an example application, the inference of character frequencies at an alignment column—important, for example, in protein family profiling. We give computationally efficient algorithms for calculating our scores and, using simulations, show that they are versatile and can improve the accuracy of character frequency estimation compared to existing sequence weighting schemes. Conclusions Our phylogenetic novelty scores can be useful when an evolutionarily meaningful system for adjusting for uneven taxon sampling is desired. They have numerous possible applications, including estimation of evolutionary conservation scores and sequence logos, identification of targets in conservation biology, and improving and measuring sequence alignment accuracy. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04183-8.

years. These methods assign a score to each sequence considered, with the aim of assigning reduced weights to sequences from over-represented clades and larger weights to sequences from under-represented clades. The purpose of sequence weighting schemes is therefore to improve the accuracy of many bioinformatic tasks in a computationally efficient way.
PSI-BLAST [2], for example, employs the Henikoff and Henikoff [5] weighting scheme ("HH94" [5]) where the score of a sequence is the average of the scores of each position of the sequence, the score of a position being 1/rd, with r the number of different characters at the considered alignment column and d the number of times the character of the considered sequence and position appears in the considered alignment column. The idea of this weighting scheme is to give equal weight to all characters observed at one alignment column, dividing this weight equally among those sequences sharing that character at that position. This method has the advantage of being very fast to calculate, and of giving higher weights to sequences with more rare characters that are, therefore, likely more distantly related.
HMMER [6] and the CLUSTAL family of aligners [1,7,8] use the weighting scheme of Gerstein et al. (1994: "GSC94" [9]; similar to [10]), which defines sequence weights iteratively along a phylogeny from tips to root. At each step, the length of the considered tree branch is split proportionally to the current weights of its descendant sequences, and is then added to the weights of the descendant sequences. Here, the idea is that weights are determined by divergence between groups of sequences. The more diverged one group of sequences is from the others, the higher weights it will have. However, the weight of a group is shared among the sequences in the group, so that in a group with many similar sequences each of those sequences will have small individual weight.
Other sequence weighting schemes have also been proposed, although they have seen fewer applications. Maximum discrimination sequence weighting [11] is a complex approach that aims to optimally distinguish homology from chance alignments in database searches. Henikoff and Henikoff [12] proposed a method that splits sequences into clusters based on sequence similarity, and assigns equal weights to sequences in the same cluster and a total weight of 1 to each cluster [12]. Vingron and Argos weighted sequences proportionally to their average distances from all other sequences [13]; Sibbald and Argos proposed an approach in which a sequence receives more weight if it is more isolated in sequence space [14]. Altschul et al. measured evolutionary correlations among sequences using branch lengths in the phylogeny, and then calculated sequence weights using the inverse of the variance-covariance matrix [15]; Gotoh developed a fast approximation of this method [16]. Similar ideas have also been explored within methods aimed at estimating character frequencies at a given position in a protein [17][18][19], defining tree or alignment informativeness [20][21][22][23][24], and quantifying diversity within a habitat and prioritising conservation efforts [25][26][27][28][29][30][31].
The many weighting schemes proposed have rarely been assessed and compared under different scenarios. We show that heuristic approaches can suffer from limitations: for example GSC94, while providing good performance on ultrametric trees, can lead to inaccurate results on non-ultrametric trees. Instead, here we propose a new weighting scheme, the first to be derived from the idea of evolutionary 'novelty' . We quantify the novelty of each sequence compared to the other sequences under consideration by computing the probability that, at a given position, sequences are 'phylogenetically identical by descent' (PIBD): that is, that they descended from a common ancestor without any substitution occurring. Our terminology highlights the similarity with the concept of "identity by descent" in population genetics (see e.g. [32]), where it applies when two alleles are not only identical ("identity by state"), but also have had no mutation or recombination occurring in the lineages connecting them to their most recent common ancestor. This rigorous approach allows us to quantify the novelty of sequences in very general scenarios (without specific assumptions regarding the phylogeny relating the considered sequences) while being robust to uneven sampling and very elevated or reduced divergence levels, and generally conforming to guiding principles for an acceptable weighting scheme [33]. We present algorithms and scripts to efficiently compute these weights from a phylogeny and from a multiple sequence alignment.
As shown by the examples above, this new weighting scheme has a number of possible applications, from gene family profiles and multiple sequence alignment evaluation to ecology and conservation biology. The aim of our work is to provide a new sequence weighting scheme that is robust to the choice of application and scenario, therefore giving the possibility of improving the accuracy of these bioinformatics tasks, while at the same time being sufficiently computationally efficient to be used on large datasets. As an example, we focus on the task of inferring character frequencies at an alignment column. Inference of character frequencies is not only important for gene family profiling, but also for modeling evolutionary fitness, calculating conservation scores, and visualizing sequence logos [34][35][36][37][38]. We show that our methods result in efficient and accurate inference of character frequencies, with clear advantages compared to previous sequence weighting schemes.

Phylogenetic novelty scores
We consider a phylogenetic tree φ describing the evolutionary relationships of its N tips s 1 , . . . , s N . We want to define weights w s representing how 'novel' tip s is compared to the other tips of φ . Throughout this paper we consider the tips to represent biomolecular sequences comprising amino acid or nucleotide characters, but other possible sets of characters could equally be accommodated. We assume that we have one sequence associated with each tip, conveniently sharing the same names s 1 , . . . , s N , and arranged as the rows of an alignment A. We start by defining weights that are a function of φ only, and so depend on the evolutionary history relating the considered sequences and not on the specific sequences themselves. In the next section we extend the definitions to also condition on the observed sequence characters.
As a motivating example, if φ consists of only extremely long branches, then we want w 1 = · · · = w N = 1 . In fact, in this case, all sequences represent effectively independent observations, so no weighting correction is needed. This means that, unlike many sequence weighting schemes (e.g. [9,14,39,40]), we want to account for the effect of saturation, so that doubling the length of a long tree branch has negligible effect on the weights.
If instead φ has branches all of length 0, we want w 1 = · · · = w N = 1/N , so that the total alignment score is 1, as in [9,14,39,40]. This is because all the observed sequences are now just perfectly dependent copies, and so in total they represent just one independent observation of a sequence. At an intermediate level, if φ has two tips ( N = 2 ), and branch length such that half of the ancestral characters are expected not to have mutated in either branch (they are PIBD with probability 0.5), then we want the total alignment score to be 1.5, and both weights to be 0.75; this is because in this case only half of each sequence will be novel with respect to the other, so in total we observe 1.5 novel sequences, and we want the two sequences to have the same weight.
A simple way to describe how novel s 1 is with respect to s 2 could be to count the number of mismatches between their sequences. However, even if s 1 and s 2 were very divergent from each other, their sequences would still be identical at some alignment column because of chance or of convergent evolution, instead of close relatedness. In our approach, s 1 can be novel with respect to s 2 at a column of A even if they share the same character, as long as they are not PIBD.
We usually cannot know for sure if sequences are PIBD and at which alignment column, so we define p s (i) as the probability that, at a generic alignment column, the number of tips of φ (including s) that are PIBD to s, i φ (s) , is exactly i. For example, p s (N ) is the probability that, at a generic alignment column, no substitution occurs along φ ; p s (1) is the probability that no tip (except s) in φ is PIBD to s at some arbitrary alignment column. We then define the weight w s of s within φ as: In the simplest case of nucleotide sequences evolving under the JC69 substitution model ( [41]; all substitution rates are 1/3), the probability that two nodes in φ separated by branch length t are PIBD is e −t . So, again in the simple case that N = 2 and that the two branches in φ have each length t/2, s 1 and s 2 each have weight w s 1 = w s 2 = p s 1 (1) + p s 1 (2)/2 = (1 − e −t ) + e −t /2 = 1 − e −t /2 , and the sum of the weights is 2 − e −t . The same is true for any pair of branch lengths with sum t, of course.
We expect the definition of sequence weights given by Eq. 1 to be useful for character frequency inference and many other applications. In fact, in addition to satisfying classical sequence weighting requirements [33], these w s can also be efficiently calculated from any φ and substitution model, as discussed later. We refer to weights w s as the 'phylogenetic novelty scores' (PNS). We call the sum of all weights in φ the 'effective sequence number' (ESN): T = N s=1 w s , representing the expected number of evolutionarily distinct character observations at an alignment column. An example graphical representation of the PNS is shown in Fig. 1

Conditioning on observed data
In this section we define weights that are a function not only of phylogeny φ , but also of a specific alignment column D of alignment A. These weights refer not to the novelty of a sequence s, but of its specific character D s observed in row s of column D. The probability that two tips of φ are PIBD at a specific alignment column can be strongly affected by the observed characters at that column. Clearly, if the two tips differ at alignment column D then the probability that they are PIBD, conditional on D, is 0. The case that the two tips have the same character in D is less trivial. If we assume that the two tips are separated by a total divergence time t, and for simplicity assuming a JC69 substitution model [41], then the probability that the two tips have the same nucleotide is (1 + 3e −4t/3 )/4 and the probability that the two tips are PIBD is e −t ; therefore, the probability that the two tips are PIBD conditional on them having the same nucleotide is 4e −t /(1 + 3e −4t/3 ). We denote by p s (i|D) the probability that exactly i sequences are PIBD to s at the given, observed alignment column D. The new positional PNSs conditional on D are then defined as:

Algorithms for calculating the phylogenetic novelty scores
We present several algorithms for calculating PNS. One of these methods ('up-down pruning') is the most computationally efficient, and so is described below, but may also be the most difficult to understand. For this reason, we also mention other approaches and include their full description in the Supplement. In the following we assume that the phylogeny φ is rooted; the case of an unrooted topology follows simply by placing an arbitrary root on the tree, as long as the substitution process is reversible and at  This tree was also used for simulations in this work. a The tree has all tips spaced uniformly on the horizontal axis, representing the case of no weighting scheme being used. b Tips are spaced horizontally according to their w s weight. The weight of each tip can also be seen in the length of the colored bars. Notice how species in regions of the tree with many close relatives (e.g. mammal, primate and bird clades) have low PNSs, and so take up less space individually. This means the horizontal dimension of the plot now gives more equal representation of the novelty of each sequence and clade, instead of emphasising densely sampled clades. More divergent species with few close relatives (e.g. lamprey, coelacanth, frog and platypus) have higher PNSs and are given more horizontal space, representing the greater novelty of their sequences relative to other species in the tree. Cumulative ESN scores (clade-wise sum of PNSs) are also shown for some clades equilibrium (as in this case the scores are not affected by the position of the root). In the case of non-reversible or non-stationary character evolution, the position of the root can affect the scores, and so a rooted phylogeny (which could in principle be estimated from sequences in this scenario) is required.

Calculating PNS scores via simulation
We can calculate PNS by simulating sequence evolution along φ . If we are interested in weights w s , at each iteration we start by sampling a root character from the equilibrium distribution. We then sample its descendant characters and the mutation events along the branches of φ using standard methods (e.g. [42], "method 2" of [43,44]) until we reach all the tips, recording each substitution that occurs and hence which tip characters are PIBD. Note that it is not possible to achieve this using software such as evolver [45] or "method 1" of INDELible [43] that only simulate the start and end state of each branch, and do not distinguish between characters that are PIBD and those that happen to match following multiple substitutions. For each iteration we associate a score of 1/i to a tip of φ if its observed character is PIBD to the characters of exactly i tips. The final weight of a tip is then obtained by averaging its scores over all iterations. Weights w D s can be similarly calculated employing a variant of the up-down approach [46] to sample characters at internal nodes of the phylogeny conditional on D. A straightforward but inefficient way to achieve the same result is to simulate characters without any conditioning, and discard those iterations that do not match D. These approaches are described in more detail in the Supplement.

Calculation of PNS scores via brute-force
We can calculate PNS via brute-force, that is, by enumerating all possible mutational histories on φ by considering all possible character assignments at each end of each branch, and for each branch considering whether there is at least one substitution on it or not. Each mutational history results in a score as in the previous method using simulation. By averaging the scores of all mutational histories, while accounting for different probabilities of different histories, we can calculate the w s or w D s weights. The full methods are described in detail in the Supplement.

Pruning method to calculate the ESN
The ESNs T = N s=1 w s or T D = N s=1 w D s can be calculated very efficiently (computational cost O(N ) ) if one is not interested in the weights of the individual tips. The idea is to calculate T iteratively on each subtree of φ starting from the tips until we reach the root. We call this the 'pruning ESN' method, due to its similarity with Felsenstein's pruning algorithm [47]. See the Supplement for details.

Up-down pruning approach to calculate PNS
We now present the main, efficient algorithm that we use and recommend for calculating PNS. It can be used for calculating either w s or w D s weights and can be considered an adaptation of Felsenstein's pruning algorithm [47]. The method visits all nodes in φ starting from the tips and toward the root ('up' phase) and then again a second time starting from the root and moving downward to the tips ('down' phase), similar to the up-down approach of [46]. The computational cost of this algorithm is cubic in the number of tips N.
In the following we assume that the substitution rate matrix Q is given. We make no assumptions regarding the state space of the substitution process, which can comprise nucleotide, amino acid or codon states [48], and exclude or include gaps (see e.g. [49]). The probability of having character k at the end of a branch of length t, conditional on having character j at its start, is then P j,k t , the entry in row j and column k of P t = exp(tQ) . See [50] for a more detailed introduction to these concepts in molecular phylogenetics. Starting with character j at the top node of a branch of length t, we denote the probability that no substitution occurs along the branch, and therefore also that the top and bottom nodes of the branch are PIBD, as: which is the probability that the character a the end of the branch is the same as j, the character at the start of the branch. This is because I j t requires not only the two characters to be the same, but also that no substitution occurred on the branch; when substitutions occurred but resulted cumulatively in no change in character at the two ends of a branch (as possible on long branches) we do not consider those two states PIBD, and the two characters are treated as independent observations.
Our objective is to calculate, for each tip s of φ , the probability distribution (p s (0) = 0, p s (1), . . . , p s (N )) of having each possible number of PIBD sequences (defined as in Eqs. 1 and 2). To address both the cases of w D s and w s , we present our description as conditioned on data D; for the case that one is interested in w s , the same equations can be used but setting D as non-informative. (For example, with DNA sequences a non-informative column D will have all entries equal to character "N", representing an unknown nucleotide, so that the partial likelihood for column D at each tip is 1.) As before, we denote the observed character at tip s by D s ; we now represent the observed characters for the leaves in sub-phylogeny φ ′ of φ as D φ ′ . In the particular case that φ ′ = φ , we have D φ = D , so we can represent the final values of interest for tip s also as p s (i|D φ ).
For most of the following, we condition probabilities on information from only part of φ . Given a node ν of φ , and given a sub-phylogeny φ ′ of φ , we define p φ ′ ν (i) to be the probability that there are exactly i tips in φ ′ that are PIBD to ν . We also define p φ ′ ν (i, j) as the probability of having i tips in φ ′ that are PIBD to ν and to have character j in ν . Similarly, we define p φ ′ ν (i|j) to be the probability of having i tips in φ ′ that are PIBD to ν , conditional on having is the probability that i tips in φ ′ are PIBD to ν and that the observed data in φ ′ is D φ ′ , conditional on having character j in ν.
The first step of the up phase is to initialise p s s (i, D s |j) at every tip s of φ , for every character j, and for 0 i N: where δ(x, y) is the Kronecker delta function ( δ(x, y) = 1 if and only if x = y ; δ(x, y) = 0 otherwise). In the case that D s is uninformative, we have p s s (i, D s |j) = δ(i, 1).
Next, starting from the tips, we move iteratively 'upward' , toward the root of φ . If branch b with length t connects the two nodes ν 1 (the parent or upper node) and ν 2 (child or lower node), then b splits φ into two sub-phylogenies. We call these φ 1 and φ 2 , with φ 2 the subphylogeny of φ containing ν 2 (but not b) and all its descendant nodes and branches, and φ 1 the sub-phylogeny of φ containing all nodes and branches (except b) not in φ 2 . Assuming that we have already visited all branches and nodes below b, and therefore know p φ 2 ν 2 (i, D φ 2 |j) for every character j and every 0 i N , we can then calculate p for every character j and every 0 i N: relates to the case in which there are no mutations on the considered branch b, while the second summand relates to the case in which at least one mutation event happens on b. Graphical examples for Eqs. 4 and 5 are given in Additional file 1: Fig. S1. Many of the p φ 2 ν 1 (i, D φ 2 |j) will be 0 (when i is larger than the number of tips in φ 2 ). In practice, we have made use of this to speed up the implementation of the algorithm, but we ignore it here for brevity.
Thanks to Eq. 5 we can 'move' probabilities up along branches, starting from the initialisations at the tips. Next, we show how to 'merge' probabilities when we reach an internal node ν . A given internal node ν splits φ into three sub-phylogenies (a parent one, φ P , a left child one φ L , and a right child one φ R ), each associated with one of the three branches adjacent to ν (one parent and two child branches). If ν is the root, then for simplicity we consider its parent sub-phylogeny to exist but be empty. Assuming that we have already visited all branches and nodes descendant of ν , and therefore know p for every character j and every 0 i N , and denoting by φ L ∪ φ R the sub-phylogeny obtained by joining sub-phylogenies φ L and φ R , we can calculate p φ L ∪φ R ν (i, D φ L ∪φ R |j) for every character j and every 0 i N: is one of the most computationally demanding steps of the algorithm (jointly with Eq. 10 below) as it has up to quadratic cost in N. Equation 6 is used on each internal node of φ , and so causes the algorithm to have a total time complexity in the order of O(N 3 ).
Using Eqs. 5 and 6 iteratively, we can calculate p for each internal node ν , each 0 i N , and any character j. We stop once we reach node ρ , the root of φ . At ρ we have p for any character j and 0 i N . If π are the character frequencies at ρ , we then have the joint probabilities: This concludes the 'up' stage of the method, which is more succinctly described in Eq. 8: The 'down' phase is the second and last stage of the algorithm. Starting from root ρ , we move toward the tips, visiting each node and branch in pre-order traversal. Given branch b of length t connecting nodes ν 1 (parent) and ν 2 (child), we assume, as in Eq. 5, that φ 1 and φ 2 are the two sub-phylogenies induced by b. Assuming that we have already visited iteratively all ancestor branches of b, and therefore know p for every character j and 0 i N: allows us to 'move' probabilities downward along branches, starting from the root. Next, we show again how to 'merge' probabilities when we reach an internal node ν . Given one left child sub-phylogeny φ L of ν , and given its parent sub-phylogeny φ P , we can calculate p φ P ∪φ L ν (i, D φ P ∪φ L , j) for every character j and 0 i N:

Algorithm stage Up
[initialise] compute P t and I j t for every branch length t and character j compute p s s (i, D s | j) for every tip s, character j and 0 visit every internal node ν in post-order traversal; for each ν, j, i calculate p at root ρ calculate p φ ρ (i, D φ , j) for every j, i using Eq. 7 We use Eq. 10 twice for each internal node ν , once with the left child sub-phylogeny φ L and once replacing φ L with the right child sub-phylogeny φ R . Using Eqs. 9 and 10 iteratively, we calculate p for each internal node ν , each 0 i N and every character j. After we have visited every internal node of φ , we reach all tips s using Eq. 9 to obtain p φ\s s (i, D φ\s , j) for all characters j and all 0 i N , where φ \ s is the sub-phylogeny obtained by removing s (and itsparent branch) from φ . We then combine these probabilities at the tips with the initialisation probabilities p s s (i, D s |j) to obtain, at every tip s, for all characters j and each 1 i N: The final probabilities of interest, p φ s (i|D φ ) , can be calculated for every 0 i N and every tip s as: where P(D φ ) is the probability of the data (the phylogenetic likelihood of φ and the substitution model for D). In the case D is empty, that is, if we want to calculate weights w s , then P(D φ ) = 1 . Otherwise, if we are interested in weights w D s , then P(D φ ) can be calculated as a normalisation factor such that the probabilities p φ s (i|D φ ) at any s sum over i to 1. In either case, the final PNS scores can be easily calculated substituting the results from Eq. 12 into Eqs. 1 or 2.
We summarise the 'down' stage of the algorithm in Eq. 13:

Fast approximation
The most efficient algorithm above has cubic cost in N. In some circumstances, for example when N > 10 5 , it becomes important to consider faster solutions. For this reason, we also present an approximate PNS that can be calculated more efficiently. With i φ (s) the random variable representing the number of tips in φ that are PIBD to s, we have w s = N i=1 p s (i)/i = E[1/i φ (s)] (Eq. 1). As an alternative fast approximation we consider:

Algorithm stage Down
[initialise] run algorithm stage Up to calculate p φ ρ (i, D φ , j) for root ρ, and p φ L ν (i, D φ L | j) and p φ R ν (i, D φ R | j) at every internal node ν and every j, i visit every internal node ν in pre-order traversal; for each ν, j, i calculate p at each tip s calculate p φ s (i, D φ , j) for every j, i using Eq. 12 at each tip s calculate p φ s (i | D φ ) for every i using Eq. 7 (14) .
The weights w s can be computed very efficiently with an up-down pruning approach, requiring only O(N ) time, so we refer to them as 'fast PNS' . The algorithm to calculate weights w s has many similarities to the one in the previous Section, and is described in detail in the Supplement.

Application to inference of character frequencies
Inference of character frequencies specifically for a single alignment column has broad applications such as modeling selection [34,35], and creating profile HMMs [3,4,6] and sequence logos [36][37][38]. Here, we assume that the frequencies of interest are the equilibrium frequencies at a given alignment column, i.e. the average character frequencies over long evolutionary times. Such frequencies are typically represented in molecular phylogenetics as π , with π(j) being the equilibrium distribution of character j [50]. This definition of frequencies fits well with the assumptions of profile HMMs, and is also reasonable for sequence logos, although we acknowledge that different definitions might be also considered in different settings. In this work, we want to investigate and compare different methods for inferring π.
The simplest inference method is to use the observed frequency p(j) of character j within the given column as an estimate of the true frequency π(j) . This approach corresponds to assuming that all sequences are independent of each other. This approach might be ideal in some circumstances, for example when the considered sequences are not homologous but only evolutionary convergent, but might be inappropriate in others. As an example, consider an alignment of 1000 homologous human sequences and two mouse sequences (1002 homologous sequences in total). Genetic variation within mice, and variation between mice and humans will have negligible effects on estimates p(j), which will be dominated by within-human genetic variation. However, human sequences are highly correlated, as they have very short divergence time between each other, so within-human allele frequencies will typically not represent evolutionary equilibrium character distributions. The problem here is that using p(j) as an estimate of π(j) means treating homologous sequences as independent of each other, while they are often strongly correlated due to shared evolutionary histories.
A traditional way to address this problem is to use sequence weights, for example our w s , to reduce the contribution of groups of closely related sequences. We can in fact define p w (j) , a new estimate of π(j) , as: where, as before, D s is the observed character for sequence s at the alignment column D under consideration, and δ is again the Kronecker delta function.
We investigate and compare the performance, for character frequency inference, of the three weighting schemes introduced above: w s , w D s and w s . We also consider two popular sequence weights: those defined by [5], which we call HH94, and by [9], which we call GSC94. HH94 first calculates, for any s, the score of s at an alignment column D, which we will denote HH D s . This score HH D s is 1/rd, with r the number of different characters in D and d the number of times character D s appears in D. The weight for sequence s, which we denote HH s , is then defined as the average of HH D s over all columns D of alignment A.
(15) p w (j) = s w s δ(j, D s ) s w s GSC94 defines sequence weights iteratively along a phylogeny, by visiting branches in post-order traversal (from the tips to the root). First, all terminal branches (those connected to the tips of φ ) are visited, and the length of a terminal branch connected to tip s is assigned as the initialisation value of the weight GSC s of s. Then, every time an internal branch b is visited, its length t is distributed among the weights of its descendant sequences. More precisely, first t is split among the tips, with the part t s assigned to s being t s = t GSC s / s ′ ∈S b GSC s ′ , where S b is the set of tips descendent from b. Secondly, each GSC s for s ∈ S b is increased by t s . After the last branches connected to the root have been processed, the GSC s are the final GSC94 weights.
In addition to the character frequency inference p(j) based on the observed frequencies, and p w (j) based on one of many weighting schemes studied, we also consider character frequency inference via phylogenetic maximum likelihood (ML). We perform this using PhyML v3.1 by fixing the phylogenetic tree (inferred from the whole alignment A using FastTree v2.1.10 [51]) and the substitution model exchangeabilities, and inferring, one alignment column D at a time, only the equilibrium character frequencies π.

Bayesian approaches to character frequency inference
Above, we introduced point estimate methods for character frequency inference. These methods do not measure inference uncertainty, and this can result in a very limited summary of the available data. For example, observing character j 100 times in an alignment column from 100 distantly related species leads all above methods to infer 100% frequency for j; however, so also does observing j two times within an alignment column of just two closely related sequences. While in the first scenario there should be little uncertainty regarding the inferred frequencies, in the second scenario uncertainty should be elevated. Using a Bayesian method is a natural way to address this issue, and also allows the inclusion of priors over characters frequencies. Here we present a Bayesian variant of the weight-based character frequency inference of Eq. 15.
If A is composed of N independent (non-homologous) sequences, the likelihood of a column D is P(D|π) = s π(D s ) . It is simple to combine this likelihood with a character frequency prior to obtain a Bayesian posterior distribution, and perform Bayesian character frequency inference. However, we are interested in the general case where the sequences in A are related by a phylogeny φ , and therefore are not independent. One possible way to perform Bayesian inference of π in this scenario would be using Bayesian phylogenetic methods such as BEAST [52] or MRBAYES [53], but at often excessive computational cost. Instead, we propose an approximation of the likelihood function P(D|π) based on sequence weights w s : Similarly, we can replace weights w s in Eq. 16 with any other weighting scheme. In the following, we assume a uniform prior P(π ) on character frequencies, meaning that all possible π are similarly likely a priori. Alternative priors are possible, and some might be more realistic, but usually at the cost of introducing more parameters in the model. Our approximation of the posterior probability P(π |D) is then: where the integral in the denominator is over all possible character frequencies ξ . Equation 17 is a Dirichlet distribution with parameters α j = 1 + s w s δ(j, D s ) , so in the following we use the properties of Dirichlet distributions [54]. The (approximate) maximum a posteriori and ML π are both given by the weighted observed character frequencies p w (j) in Eq. 15. The approximation of the expectation of π(j) , E(π(j)|D) is however: where B is the number of possible characters in the considered alphabet, α j = 1 + s w s δ(j, D s ) and α 0 = B + s w s . This can be seen as the ML estimate of π in the presence of 1 pseudo-count per character. The posterior variance is then approximated as: which can be used as a measure of the uncertainty over character frequencies. However, considering that Eq. 17 has beta-distributed univariate marginals P(π(j)|D) ∼ Beta(α j , α 0 − α j ) , in the following we derive approximate 95% posterior probability intervals using the stats.beta.ppf function in scipy [55].

Simulations
We use simulations to test and compare computational demands of calculating PNS values as well as for assessing the accuracy of different approaches to infer position-specific character frequencies. In the base simulation scenario, we simulate nucleotide sequence evolution along a 100 vertebrate taxa phylogeny (Fig. 1) using Dendropy [56]. We use a HKY85 substitution model [57] with transition:transversion ratio κ = 3 both for simulation and inference. We simulate 10 replicates, each replicate consisting of an alignment of 1000 columns. Alignment columns are evolved independently of each other (conditional on the tree and the substitution model). As we do not simulate indel events, so we do not consider gap characters in our inference; when used on real data we would treat gap characters as missing data, as typically done in phylogenetics, but it would also be possible to include the gap character in the substitution model state space (see e.g. [49]).
Specific equilibrium character frequencies π are assigned to each alignment column. For each alignment, 800 columns ( 80% ) are simulated as evolving under the same background equilibrium character frequency distribution, which we set to π(A) = π(T ) = 0.3 and π(C) = π(G) = 0.2 . The background character frequency distribution represents, in our simulations, the evolutionary dynamics of positions not strongly affected by selective forces; at these positions, the equilibrium character frequency distribution is constant because it is mostly determined by neutral mutational biases, which we assume constant across all alignment columns. The remaining 20% of alignment columns are simulated under position-specific selection, with (17) , position-specific equilibrium character frequency π sampled from a Dirichlet distribution prior with α = 0.1 (Additional file 1: Fig. S2A). Our aim is, for each replicate and each alignment column, to infer π from the simulated sequences alone. For each replicate/alignment, we first infer a phylogenetic tree and alignment-wide HKY85 substitution model parameters using FastTree v2.1.10 [51]. We then consider this tree and the HKY85 κ parameter to be fixed and infer column-specific character equilibrium frequencies. While π is inferred separately at each column, the HKY85 alignment-wide parameters (including nucleotide frequencies) inferred with FastTree are used in some sequence weighting schemes (for example, in Eq. 3). The methods we used to infer equilibrium frequencies are: • the observed character frequencies in the alignment column (the p(j) described above), • observed frequencies corrected using the HH94 [5] weights and Eq. 15, • observed frequencies corrected using the GSC94 [9] weights and Eq. 15, • observed frequencies corrected using the PNS weights w s from Eq. 1 combined with Eq. 15, • observed frequencies corrected using our PNS weights conditional on data, w D s from Eq. 2, combined with Eq. 15, • observed frequencies corrected using our fast approximate PNS w s weights (Eq. 14) combined with Eq. 15, • Bayesian variants of the methods above, and • ML phylogenetic inference (only of equilibrium character frequencies) with PhyML v3.1 [58].
All the methods above, except FastTree and PhyML, were implemented in custom Python scripts available from https:// bitbu cket. org/ nicof may/ novel tysco res. In addition to the basic simulation scenario, we also consider variant scenarios in order to investigate how certain parameters can affect the results: • We consider alternative tree lengths, which we obtain by multiplying all branch lengths in the tree in Fig. 1 by a constant coefficient, either 0.2 or 5. • We consider the case of amino acid characters instead of nucleotides. In this case, we simulate under an LG substitution model [59], and when we do inference we assume that the substitution model (including character frequencies) is known. Column-specific character frequencies are inferred as usual. In this case, equilibrium character frequencies for columns under selection are sampled from a Dirichlet distribution prior with α = 0.02 (Additional file 1: Fig. S2B). • To test the effect of very biased taxon sampling in an alignment, we added multiple (either 100 or 1000) human tips to the tree in Fig. 1. The short phylogeny relating the human sequences was randomly sampled at each replicate under a standard coalescent prior [60] with mean coalescent time 0.001 between human sequences. This short human phylogeny was then appended to the human tip of the tree in Fig. 1.
• To test the robustness of methods to the assumption of an ultrametric tree (a tree where all tips have equal distance from the root), we consider the case of a strongly non-ultrametric trees, as is common for some viruses such as influenza.

Sequence weights
We implemented all the considered weighting schemes and all simulations within custom Python scripts (https:// bitbu cket. org/ nicof may/ novel tysco res), making use of the phylogenetic python package dendropy [56]. We implemented all the algorithms presented in the Methods section for calculating weights w s and w D s , and used comparisons of weights from different algorithms to assess the correctness of the implementations.
PNS shows similar trends to previous weighting schemes HH94 and GSC94, assigning higher weights to phylogenetically isolated taxa and smaller weights to taxa within clades with many other closely related taxa (Fig. 2). In particular, PNS seems to show an intermediate 'intensity' compared to the two other schemes. GSC94 weights are the most extreme, assigning the highest weights of any scheme to the most evolutionarily isolated taxa in the tree of Fig. 1, such as Lamprey, Coelacanth and frog Xenopus tropicalis. For example, for Lamprey, the GSC94 normalized weight is about 3 times larger than HH94, while PNS is about 2 times larger than HH94. Conversely, for taxa in over-represented clades, such as Human, GSC94 gives the smallest weight, HH94's weight being many times larger, and PNS being intermediate. Rescaling the branch lengths of the tree does not change this overall trend (Additional file 1: Fig. S3).  Fig. 1 (species names on x-axis labels) in the scenario of nucleotide data (1 locus of 1kb) by different weighting schemes: PNS (weights w s ), HH94 [5] and GSC94 [9]. Weights from each scheme are normalized so that the sum over taxa is 1

Computational demand
Computational efficiency is one of the main requirements for applicability of weighting schemes, in particular when considering large datasets; for this reason, here we compare the computational demand of different approaches. Calculating sequence weights based on a phylogeny (PNS and GSC94) usually requires limited computational demand, with the dominant cost being the estimation of the phylogeny itself ( Fig. 3 and Additional file 1: Fig. S4). One exception to this are the w D s weights of Eq. 2: these, being conditional on the data observed at a specific alignment column, need to be re-computed for each position. Calculation of these weights requires time cubic in N, and so it is not surprising that these weights are slower than phylogenetic inference. The other slowest method for character frequency is phylogenetic ML (PhyML), which also needs to be run once for each alignment column. All other approaches are at least one order of magnitude faster than PhyML and w D s in estimating character frequencies, and are practical also for larger trees (see e.g. Fig. S4 where we included 1000 closely related taxa). Calculating weights w D s and estimating frequencies by ML, instead, becomes infeasible on such larger trees. Estimating frequencies using HH94 weights is the fastest of the methods considered, as it does not require prior estimation of a phylogenetic tree, and it might therefore be one of the few possible choices available for extremely large datasets. The second fastest approach is GSC94, followed by the w s weights, and finally by the w s weights, although  Fig. 1 and nucleotide data. Time cost for computing frequencies from un-weigthed observed characters is not shown as it is negligible. Time demand of Bayesian variants of PNS weights is also not shown, as it is the same as for their non-Bayesian variants (Bayesian variants only require the addition of pseudocounts compared to non-Bayesian variants). 'FastTree' represents the cost of running phylogenetic inference with FastTree prior to weight calculation. Orange violin plots show the total cost (including computational cost of phylogenetic inference for methods requiring a phylogeny). Blue violin plots show the cost of calculating the scores without taking into account the cost of phylogenetic tree inference. For w D s and 'PhyML' , blue and orange plots overlap. Calculating HH94 weights is, overall, the fastest approach among those considered here, as it does not require phylogenetic inference these methods have very similar computational demand once the cost of inferring a phylogenetic tree is taken into account.

Accuracy of character frequency inference
Here we assess the ability of different weighting schemes, including those derived from our new PNS methods, to facilitate inference of column-specific character frequencies.
We measure the accuracy of an approach by calculating, at each alignment column, the Euclidean distance between simulated and inferred character frequencies. ML phylogenetic inference with PhyML is almost always the most accurate method (Figs. 4 and 5). This is perhaps not surprising, given that this approach fully models the effects of varying equilibrium character distributions on character evolution along the phylogeny. However, this approach is also the most computationally demanding, and the advantage of schemes based on sequence weights is that they can be much faster, in particular on datasets with many sequences or many alignment columns. The only case where PhyML seems marginally less accurate than weights-based methods is at high divergence and strong selection (Fig. 5f ). This is probably due to the particular implementation in PhyML, which does not allow character frequencies below 1%.
All the weighting schemes considered improve character frequency inference compared to the simplest approach of counting the observed characters at an alignment Fig. 4 Equilibrium frequency inference error. Comparison of the accuracy of different methods for reconstructing equilibrium frequencies in the basic simulation scenario (nucleotide characters and tree as in Fig. 1). Violin plots summarise the nucleotide frequency inference error (on the y-axis), measured as the Euclidean distance between the vectors of column-specific simulated nucleotide frequencies and inferred ones. Each plot contains 10 replicates, and each replicate contains 800 alignment columns evolved under the background nucleotide frequencies (a, c and e), or 200 alignment columns evolved under equilibrium nucleotide frequencies sampled from a Dirichlet distribution with α = 0.1 (b, d and f). Horizontal black dashed lines aid comparison by showing the median error of the first method (frequencies extracted from character counts). In a and b the tree branch lengths were scaled by a factor of 0.2; in c and d by a factor of 1.0; and in e and f by a factor of 5.0. Each plot shows results for a particular character frequency inference method, indicated on the x-axis. Results from additional methods (e.g. Bayesian approaches) are shown in Additional file 1: Fig. S5 column (Figs. 4 and 5). GSC94 and w s weights seem to give more accurate results than HH94 and w s weights, in particular within very biased datasets (Fig. 5c-f ). The latter is not too surprising, given that weights w s are an approximation of weights w s .
We note that in Figs. 4 and 5 the weights w s and GSC94 give very similar accuracy, with GSC94 sometimes marginally outperforming w s . In theory, we expect the weights w s , compared to GSC94, to benefit from the advantages of being based on intuitive mathematical principles and accounting for the effects of saturation. However, saturation probably has very little impact in this scenario (and in many real life scenarios); another important factor at play here might be that PNS gives more uniform weights compared to the more 'extreme' GSC94 weights. The latter might perform better in this case, possibly because PNS counts character observations as independent after one mutation event, when in reality more mutation events might be needed to approach near-independence of character observations. While our weights w s (and also w D s ) are calculated exactly (aside from rounding errors), this does not mean that an estimate of character frequencies based on these weights will be exact with respect to phylogenetic maximum likelihood optimization. Rather, they give an approximation, and our weights were not defined specifically in order to optimise character frequency estimation.
A limitation of GSC94 is that it does not work well with trees in which tips have very different distances from root (non-ultrametric trees). This effect has limited impact in our basic simulation scenario, as the tree in Fig. 1 is not far from ultrametric. However,  Fig. 1 and with amino acid sequences. In c and d we consider nucleotide sequences and the tree in Fig. 1 with 100 added human sequences (see Methods). In e and f we instead add 1000 human sequences. Results from additional methods (e.g. Bayesian approaches) are shown in Additional file 1: Fig. S5. Results from PhyML are not available, due to excessive computational demand when we consider a strongly non-ultrametric tree (Fig. 6a), as is often observed for some viruses such as influenza [61,62], we see that the GSC94 weights are strongly impacted, resulting in considerably worse inference than any of the other weighting schemes studied, and worse even than observed character frequencies (Fig. 6b). The reason is that, in such strongly non-ultrametric trees, GSC94 weights at terminal, younger tips tend to be considerably larger than GSC94 weights at older tips closer to internal nodes and in particular those closer to the root. Even in cases when observed characters close to internal nodes can provide useful information regarding equilibrium frequencies, for example when branches are sufficiently long in Fig. 6a, GSC94 weights are still almost exclusively distributed on the latest two phylogenetic tips in this scenario. All other approaches seem to perform similarly well in the scenario of Fig. 6, including simple base counting, and the likely reason is that here no clade is over-represented, and so a weighting scheme is not needed in the first place for the considered application.
Using sequence weights conditional on the data at the specific column, i.e. w D s from Eq. 2, unexpectedly does not seem to improve accuracy (Additional file 1: Fig. S5) while, as shown in Fig. 3, it does significantly impact computational demand. For these reasons, we do not generally recommend the use of weights w D s . Using a Bayesian approach to character frequency inference means that the prior on character frequencies can affect the result of the inference. This can have a positive effect if the prior distribution is based on reliable evidence from sources other than the currently considered dataset. However, in our simulations we consider a completely arbitrary prior (corresponding to observing one character of each type at the considered alignment column) and this has the effect of slightly shifting the inferred frequencies closer to a uniform distribution (Additional file 1: Fig. S5). Expectedly, this overall improves character inference at sites evolving under the background frequencies, while it worsens inference at sites evolving under strong selection.

Discussion
We have proposed a new approach for assigning weights to the sequences in an alignment, or, equivalently, to the tips of a phylogenetic tree. First, we define phylogenetic novelty scores (PNS) based on rigorous mathematical principles. These scores summarise how novel is a sequence (respectively, tip), in evolutionary terms, with respect to the rest of the alignment (respectively, tree) and have a number of desirable properties, including meeting the objective criteria of [33].  6 Equilibrium frequency inference error with a strongly non-ultrametric tree. a: The strongly non-ultrametric phylogenetic tree under which simulations for this figure are performed. Some tips of the tree (e.g. T10, T20) are close to the root while others (T1, T11) are considerably more evolutionarily distant; in an ultrametric tree, all tips would instead have the same distance from the root. b and c: Violin plots summarising nucleotide frequency inference error (y-axis), measured as the Euclidean distance between the vectors of column-specific simulated nucleotide frequencies and inferred ones. Each plot contains 10 replicates, and each replicate contains (b) 800 alignment columns evolved under the background nucleotide frequencies, or (c) 200 alignment columns evolved under equilibrium nucleotide frequencies sampled from a Dirichlet distribution with α = 0.1 . Each plot refers to a particular character frequency inference method, indicated on the x-axis (See figure on next page.) compared to some popular sequence weighting schemes, in particular HH94 [5] (see for example Figs. 4E and 5C, E). This however usually comes at the cost of additional computational demand, especially considering that our scores require the availability of an inferred phylogenetic tree, and considering that this might not be feasible for extremely large datasets. PNS and GSC94 [9] weights both require a phylogenetic tree estimate, and both show very similar performance in our main simulation scenario, with GSC94 marginally outperforming PNS. However, we demonstrate that, unlike GSC94 weights, PNS are not affected when the assumption of tree ultrametricity is violated (Fig. 6), and similar patterns are expected with respect to the robustness to the position of the tree. This shows that PNS are particularly versatile in applicability, as one would expect from their formal phylogenetic derivation. Over most scenarios, the most accurate method for position-specific character frequency inference seems to be standard phylogenetic ML inference; however, this approach is also very computationally demanding, and is not suitable for large datasets.
Character frequency inference, our example use-case for PNS, has a number of important applications. Character frequencies are fundamental parameters used in HMM profiling of protein families [3,6], and our scores could therefore improve approaches to this task. Our scores could also be used to improve character frequency estimates used within alignment column-wise conservation scores [36][37][38], frequently defined as where p(j) is the frequency of character j at a given alignment column, and B is the number of characters ( B = 4 for nucleotides and B = 20 for amino acids). ( S max is the maximum possible entropy at the considered position, equal to log 2 B , while S obs is the observed value.) Typically, the p(j) are inferred from the observed character frequencies at an alignment column; however, as we have shown, our PNS can significantly improve the inference of these frequencies, and therefore of conservation scores. Our simulations suggest that this is in fact the case (Additional file 1: Fig. S6).
Sequence weights, like our PNS, also have many other applications, for example to aid alignment inference. They have been shown to improve sequence alignment [1] and profile searches [5,10], and examples of their use include PSI-BLAST [2] (which uses HH94 weights) and the CLUSTAL family of aligners (e.g. [1,10] use GSC94 weights). Our scores could therefore result in improved alignments.
Sequence weights are also used to measure alignment quality, and our scores could be used for example in the context of the information content score (ICS) [63] or the norMD approaches [64]. Furthermore, our scores could be used in measures of conservation priority in conservation biology, such as phylogenetic diversity PD [25], quadratic diversity Q [30] and the phylogenetic entropy index H P [31].
Lastly, we note that our scores could be used to improve the definition of phylogenetic effective sample size to be used for AICc [65] and BIC [66]. This is usually defined as the number of alignment columns, but this is not the only reasonable choice [67,68].