Identification of phylogenetic tree trunks
To resolve tree trunks, sitePath assumes that most descendant sequences of each terminal node have the same polymorphism state. The Treemmer algorithm [6] collapses tree tips and guarantees that the collapsed tree tips are within a monophyletic group. Here, the terminating condition for the Treemmer algorithm is changed to tree tips having the same polymorphism state. However, it is difficult to retrieve a monophyletic group when a few sequences in the group have different polymorphism states than the dominant one (Fig. 1A). The modified Treemmer algorithm would identify a few smaller groups instead of a single monophyletic group. To address this issue, another method similar to CD-HIT [7] is implemented in an attempt to merge the smaller groups. The rule is that the average similarity of the group after merging is no lower than a threshold. The groups of sequences to be merged depend on which site being evaluated. It is therefore impossible to define a universal threshold for all sites. Instead, a universal rule for generating the threshold value is set as the median similarity of the sequences. Furthermore, a polymorphism state has to exist in enough sequences to be valid for defining clades and tree trunk terminals. Thus, the group with the number of sequences less than the predefined \({N}_{\mathrm{min}}\) (the minimum number of sequences to define a valid polymorphism clade) will be treated as outlier and dropped. The tree trunks are found by linking the tree root and the ancestral node of all resulting groups. Only the longest tree trunks are retained if there are overlaps (Fig. 1A).
The phylogenetic pathways are derived from tree trunks. A phylogenetic pathway can be a combination of tree trunks to describe the ancestral and descendant relationships of clades. A clade closer to the tree root is defined to be ancestral to its neighbor clade. The default \({N}_{\mathrm{min}}\) value depends on the total number of sequences in the input dataset. In sitePath, the default \({N}_{\mathrm{min}}\) is defined as the number of pathways remains the same for a few consecutive \({N}_{\mathrm{min}}\) values. For an input file containing all the available sequences of a virus or a proportional subsampled dataset of it, the default \({N}_{\mathrm{min}}\) is recommended. While for a partial dataset of a virus, sitePath provides a method to help find the best \({N}_{\mathrm{min} }\) (Fig. 2A).
Branch-and-bound search for polymorphism clades
A branch-and-bound like search method is implemented to minimize the entropy of the polymorphism state across clades. The search is independently performed per site per tree trunk and the site will be referred as site \(X\) in this section. The goal is to search for tree trunk nodes delimiting the sequences into such clade set \(\left\{{C}_{1}, {C}_{2}, \dots , {C}_{M}\right\}\) that the total entropy \(H=-\sum_{\mathrm{m}=1}^{M}\sum_{n}\mathrm{Pr}\left({S}_{m}^{n}\right)\times \mathrm{log}\left(\mathrm{Pr}\left({S}_{m}^{n}\right)\right)\) is minimized (Fig. 1B). Assume in clade \({C}_{m}\) there are \(N\left({C}_{m}\right)\) sequences in total and \(N\left({S}_{m}^{n}\right)\) sequences with polymorphism state \(n\) for site \(X\). Then the probability of drawing polymorphism state \(n\) in \({C}_{m}\) is \(Pr\left({S}_{m}^{n}\right)=N\left({S}_{m}^{n}\right)/N\left({C}_{m}\right)\). The search starts with initial entropy \({H}_{0}\) by treating all sequences on the tree trunk as a single clade delimited by the tree root and the terminal node. Then, the search branches to explore new clade sets through iterations to find the goal clade set with the minimum entropy \({H}_{\mathrm{min}}\).
A breadth-first searching strategy is used, meaning that all possible clade sets branched from the previous clade set are evaluated when proceeding with the search. All \(N({C}_{m})\) of a newly branched clade set is constrained to be greater than the predefined \({N}_{\mathrm{min}}\) otherwise it will become inactive in later iterations. For each iteration, all the newly branched and other active clade sets are compared to find the temporary best solution with the lowest \(H\) value. This is because \(H\) is not monotonically decreasing in the search towards the goal clade set. The search is terminated when the temporary best solution stays the best after a few iterations. It is set as the total number of nodes of the tree trunk by default and can be folded by user to elongate the search.
Fixed and parallel mutation
The result can be further used to find fixed and parallel mutations in phylogenetic pathways (Fig. 2). Fixed mutation was found by comparing the dominant polymorphism state between adjacent clades (Fig. 2B), while parallel mutation was found when a variant was observed on multiple tree trunks (Fig. 2C). The performance (Fig. 3) of detecting fixed mutations is tested on an influenza A H3N2 antigenic drifting dataset [1]. For visualization, sitePath provides functions based on the R package ggtree [8]. As an example, a ZIKA virus dataset (see Additional file 1 for NCBI Accessions) is included in sitePath package. The result of its site 139 of the polyprotein, which was reported to be associated with enhanced virulence [2], is shown in Fig. 1C. Moreover, sitePath was used on SARS-CoV-2 to find fixed and parallel mutations, which were found to be potentially adaptive for the virus [9].
Computation time complexity
Tree trunk and clade identification approximately share the same algorithm time complexity of \(O\left(SL/{N}_{\mathrm{min}}\right)\), where \(S\) is the total number of sequences and \(L\) is the length of the sequence alignment. To save time, the sites that are completely conserved or have too many ambiguous characters or gaps are ignored.