Algorithm for inference
Software has been created to enable evaluation of conditional retention probability ratios (the test statistic) for any genome that has a pair of WGD events in its history. Written in Python, the conditional probability ratio statistic is calculated from a collection of gene trees derived from systematic comparative genomic analysis and a reference species tree with whole genome duplication events labeled. The statistic can be calculated for every pair of WGD events that occurs serially in the evolutionary history of a species. This refers to two whole genome duplication events that occurred on different species tree lineages that are both on the same phylogenetic trajectory from the species tree root to the extant species tip. At the heart of the package, is a gene tree-species tree algorithm that labels specific nodes in each gene tree, as described below.
The input to the software for inference are a reference rooted species tree and a set of unrooted gene trees from the genomes of the species involved. Gene trees do not need to contain genes from all species in the species tree but are assumed to contain all members that existed descended from the root node of the species tree. Once rooted using the species tree, gene trees containing more than 1 species with root nodes that are duplication events are split iteratively until the root node is a speciation event.
In order to calculate the conditional probability ratio, it is needed to map WGD events onto nodes of a gene tree. In principle, a WGD event could correspond to any duplication event on a gene tree where all leaves under that node correspond to a species under the event on the species tree. Here it will be assumed the most parsimonious solution, where the placement that results in the smallest number of small scale duplications is utilized. When equally parsimonious solutions are possible, the algorithm will place WGD at the earlier node. There may be scenarios where this choice can lead to a bias, but this has not been detected here. The mapping has a linear time complexity with respect to the number of nodes on the species tree.
Notation
-
G, denotes an input node. This should be the root node of a rooted and labeled gene tree.
-
S, denotes an input node. This should be the root of the corresponding species tree with labeled WGD events.
-
g, is used to represent specified nodes on the gene tree.
-
s, is used to represent specified nodes on the species tree.
-
r(n), denotes the right child of node n.
-
l(n), denotes the left child of node n.
-
L(n), denotes all leaves under n.
-
add_event(n,*), labels a gene tree node n as a WGD duplication, where the duplicate was either retained so the event is “Present” or the duplicate was lost so the event is “Missing”.
-
lca(s,g), denotes a method that returns the mapping of node s onto g by getting the last [least] common ancestor of all leaves present under s on g. For example, let g define a gene tree node: ((a1, b1), (a2, b2))g. Let s be a parent node on the species tree with only a,b as children. Here the left child and the right child include all species labels. Then lca(s,g) will simply return g.
Implementation
The placement algorithm takes as input binary gene trees that are rooted and reconciled such that duplication and loss events are present and labeled and one species tree where the branches containing the WGD events are labeled. This is generated by the software as described below, given a rooted species tree and a set of unrooted gene trees. Given the root node of a gene tree G and the root node of the species tree S the algorithm maps the WGD events on to the gene tree to give as output a new gene tree with nodes labeled as WGD.
First it is necessary to root and reconcile the gene trees. The software package presented here implements a known reconciliation algorithm [33], which differs from an algorithm previously implemented in the group [34]. The software also roots the tree by iterating over all branches and selecting the root that minimizes the number of duplication and loss events on the reconciled tree. From the reconciliation a rooted gene tree with both duplication and loss events labeled is obtained. Although it is possible and computationally more efficient to map WGD events onto the gene tree as part of the reconciliation it was decided to keep these methods separate to allow other reconciliation algorithms to be used with the WGD placement software. As long as the resulting gene tree is labeled with duplication and loss events the placement algorithm will work. The loss events are treated as leaves for the purposes of the mapping. Let G be the root node of a gene tree and S the root of a species tree. Both should be labeled rooted binary trees. The set of leaves of a gene tree L(G) and a species tree L(S) are taken from the same set of species. The first step is to select the most recent possible node n under g an arbitrary gene tree node such that L(s) ⊆ L(n), where s represents a node on the species tree directly after the WGD event and L(s) denotes all leaves under s. Let lca(s, g) be a function for this mapping. This mapping could return an n such that L(n) ⊈ L(s), for example if g was a node (((a1, b1), c1), ((a2, b2), c2)))g and s was only a parent to a,b.If this is the case n is a duplication occurring before the whole genome duplication event. This means it is possible the event corresponds to two locations on the gene tree. So we use r(n), the right child of n and l(n), the left child of n and get two new mappings. This process continues until we have a likely candidate, a node or nodes n such that L(s) ⊆ L(n) and L(n) ⊆ L(s).
Next, one checks if n is labeled as a duplication by the rooting/reconciliation method. If the node is not a duplication, then the node is labeled as a missing WGD event. If the node was a duplication both children are checked for duplication events. If both children are duplication nodes, then the event is placed on both children. If not, the event is placed on the current node and labeled as present. The reasoning behind checking the children for duplication events works as follows. If a duplication, where all leaves under that node correspond to a species under the event on the species tree, is in fact not a result of WGD it be would expected to see both copies of the gene duplicate as a result of WGD. This method greatly reduces the number of duplication events attributed to small scale duplication compared to placement methods that do not check children node for duplications. Here, event_num is an integer that tracks which event is being placed. Upon successfully placing an event, it is incremented for the next consecutive event. For example, when place_event() is called if event_num is 0 the node will be labeled “event 0”.
Pseudo-code
Pseudo-code for the algorithm is provided as Fig. 1.
Now with WGD events mapped onto the gene tree, computing the conditional retention is straight forward. First one iterates over each node until we find a node labeled as event 0 then we iterate over all descendant nodes until we find a node labeled as event 1. For every event 1 node found the number of possible duplicate copies increases by one. All the leaves descending from the event 1 node are checked and the total number of retained duplicate copies is counted. The number of duplicate copies along with the number of possible duplicate copies is added to a counter that tracks the total copies and possible copies across all gene families being analyzed. If event 0 was present, then the copies and possible copies get added to the RR count and if event 0 was missing, the copies get added to the LR count. This process repeats for all events 1 descending from the event 0. This is done for every event 0 node on the tree.
For example, In the case where a gene "A" was retained after the first WGD, there will be two copies of "A", "A1" and "A2", before the second WGD happens. If, after the second WGD, the new copies of "A1" are retained but not those "A2". The software with find one event 0 with two descended nodes labeled event 1 so 2 possible duplicate copies, and since the “A1” duplicate was retained and not “A2” we would see 1 duplicate copy of the gene still present on the tree. Thus for just this tree the conditional probability of RR would be 0.5 and since the first event was retained there would be no data for LR.
Comparative genomic bioinformatic pipeline
The following seven species of plants with 5 pairs of WGD events between them [35,36,37,38] were selected to be analyzed: Ananas comosus, Elaeis guineensis, Nelumbo nucifera, Oryza brachyantha, Panicum hallii, Phalaenopsis equestris, and Phoenix dactylifera. All plant species selected are autopolyploids with available high-quality genomes, allopolyploids were eliminated. A species tree was generated using NCBI [39] with the timing of species divergences generated using Timetree [40]. Protein sequences for all seven species were gathered from NCBI as FASTA files. A total of 255,312 protein sequences were gathered, 35,775 from Ananas comosus at 400x, 41,887 from Elaeis guineensis at 16 × coverage, 38,191 from Nelumbo nucifera at 100 × coverage, 26,803 from Oryza brachyantha at 104x, 44,192 from Panicum hallii at 202 × coverage, 29,894 from Phalaenopsis equestris at 99.5 × coverage, and 38,570 from Phoenix dactylifera at 139 × coverage. High coverage genomes were used to reduce any potential for bias due to missing gene duplicates. It should be noted that the Pratio statistic is expected to be more robust than other retention statistics because missing duplicates might be expected to occur in the statistic numerator or denominator without bias. BLAST all-against-all was run for each pair of species, including against themselves, at an e-value threshold of 10–10 to identify homologous protein pairs. These pairs of homologs were run through a script to ensure that a pair of homologs had both percent identity and percent ungapped were both ≥ 60%. 255,187 gene families were formed by single linkage clustering of all the gene pairs. Protein alignments of the gene families were generated using MAFFT [41]. Maximum likelihood trees were created for gene families of size 4 or greater by PhyML [42] using SMS model selection [43] and Neighbor-Joining. Due to the size of some gene families (size > 100) PhyML was not an efficient method to use, Neighbor-Joining was used for these families. The gene trees were rooted using a Python script based off [7, 34]. As described above, gene trees with more than 1 species and a root node as a duplication event were iteratively split until the root node was a speciation event. In total 12,852 trees with size > 4 were generated, 81 of which used neighbor joining due to size.
Generation of simulated data
For testing and as a companion to the inference tool, a simulation tool was built that is capable of producing a statistically probable gene trees for any given species tree [44,45,46]. Using a Poisson process to dictate the arrival of events (duplication/loss), a set of gene trees is produced. The evolutionary history of a gene is simulated over each branch of a given species tree.
Duplication affects the gene tree by duplicating the corresponding branch and subtree and losses affect the tree by dropping one branch and subtree. The simulation also includes the ability to add WGD duplication events where every copy of the gene present for the event is duplicated. The location for these events is determined via comments with a specific T value on the input species tree. The T value represents time and affects the probability that the simulation will place a SSD before or after the event on the same branch. This is a comment used with NHX format. The package uses ETE 3 [47] to read and manipulate trees. The resulting tree is a rooted simulated gene tree corresponding to the inputted species tree with WGD events labeled. Using these simulated trees, the inference method used to place WGD events on the tree can be tested for accuracy under different sets of conditions.
The simulation method was run over two tree types, balanced and caterpillar (Fig. 2) to produce gene trees with correctly labeled WGD nodes. 1,000 trees were produced for every combination of loss and SSD rates = {0.01, 0.009, 0.002, 0.0002, 0.00002} (units of ‘event per million years’) (Additional file 1: Table S1). The values were chosen based on the current estimate of the rate of evolutionary events given in [48] and simulation run time limitations.
To test if the number of speciation events between WGD events affected the accuracy of the placement, the simulation was run for variations of the balanced and caterpillar tree where WGD events were placed a different number of speciation events apart. 1000 trees for each placement, loss rate and SSD rate were produced (Additional file 1: Tables S2 and S3). When the inference tool was run over the simulated trees to identify the branches inferred to have duplication events, the accuracy was inferred. Accuracy is the total number of correctly placed events divided by the total number of events called on the tree.
Bootstrap analysis
Bootstrap analysis was performed to generate p-values and intervals containing 95% of the data. For the comparative genomic analysis, 1000 bootstrap samples of the 8013 trees containing P. equestris were generated. From the simulated data, 1000 bootstrap samples of the 1000 trees for each of the 150 data points was generated.
Software user information
The tool developed in this paper is available on the Python package index under the name WGDTree from https://github.com/cnickh/wgdtree. The software provides functions for simulating likely gene tree phylogeny for a given species tree with WGD events, placing WGD events onto gene trees given a labeled species tree, and determining the conditional retention rate of duplicates resulting from WGD events. There is a user guide available with the source code on GitHub. In addition to the user guide there is also example code displaying the expected usage of the functions provided by this package.