SFREEMAP - A simulation-free tool for stochastic mapping

Background Stochastic mapping is frequently used in comparative biology to simulate character evolution, enabling the probabilistic computation of statistics such as number of state transitions along a tree and distribution of states in its internal nodes. Common implementations rely on Continuous-time Markov Chain simulations whose parameters are difficult to adjust and subjected to inherent inaccuracy. Thus, researchers must run a large number of simulations in order to obtain adequate estimates. Although execution time tends to be relatively small when simulations are performed on a single tree assumed to be the “true” topology, it may become an issue if analyses are conducted on several trees, such as the ones that make up posterior distributions obtained via Bayesian phylogenetic inference. Working with such distributions is preferable to working with a single tree, for they allow the integration of phylogenetic uncertainty into parameter estimation. In such cases, detailed character mapping becomes less important than parameter integration across topologies. Here, we present an R-based implementation (SFREEMAP) of an analytical approach to obtain accurate, per-branch expectations of numbers of state transitions and dwelling times. We also introduce an intuitive way of visualizing the results by integrating over the posterior distribution and summarizing the parameters onto a target reference topology (such as a consensus or MAP tree) provided by the user. Results We benchmarked SFREEMAP’s performance against make.simmap, a popular R-based implementation of stochastic mapping. SFREEMAP confirmed theoretical expectations outperforming make.simmap in every experiment and reducing computation time of relatively modest datasets from hours to minutes. We have also demonstrated that SFREEMAP returns estimates which were not only similar to the ones obtained by averaging across make.simmap mappings, but also more accurate, according to simulated data. We illustrate our visualization strategy using previously published data on the evolution of coloniality in scleractinian corals. Conclusion SFREEMAP is an accurate and fast alternative to ancestral state reconstruction via simulation-based stochastic mapping. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1554-7) contains supplementary material, which is available to authorized users.


Introduction
SFREEMAP is an analytical approach to obtain accurate, per-branch expectations of numbers of state transitions and dwelling times. We also introduce an intuitive way of visualizing the results by integrating over the posterior and summarizing the parameters onto a target reference topology (such as a consensus or MAP tree) provided by the user.
The following sections will guide you through installation and use of this tool.

Installing sfreemap
First make sure you have libblas-dev and liblapack-dev installed on your system.
To install the development version from github:

New object classes
One new class of object extend existing data structure for phylogenetic trees: * sfreemap: complements "phylo" and "multiPhylo" classes from ape and phytools with mapped.edge.lmt, a matrix containing the expected value for the number of transitions among states.

Simple stochastic mapping 4.1 Standard type
The program accepts the parameter type=standard which should be used when the character is of morphological type. The dataset sfreemap.corals.trees and its corresponding tip values sfreemap.corals.tips can be used as example here.
Just to have an idea of the dataset we are working on, let's have a look at the first tree: Again, we will run the program with only the first ten trees. The tips dataset has several characters, and again we are going to use just the first ten. What the program will do in this case is to run each tree against each character, in the order they appear in their respective objects. The result will be ten mapped trees. Same logic would apply for standard type. Although the command above is quite typical for an R experienced programmer, it might not be as simple to remember and understand for a more regular user. Besides, what if we want the median instead of the mean? The command would be completely different. That is why we provide some tools to analyse the mappings produced by sfreemap command, which will be described in the next section.

Analysing mapped data with histograms
As we have shown, the object created by sfreemap can be analysed and manipulated to produce summaries and plots. In this section we will show some tools to make this task a lot easier.
For the examples we will map again the sfreemap.corals.trees dataset, but this time scaling the trees to have the same distance from the root to the terminals. This is important for analysing the number of transitions considering the same proportion on all trees.
trees <-rescale(sfreemap.corals.trees, height=1) data <-sfreemap(trees, sfreemap.corals.tips, parallel=FALSE) Now we will run the function below, which analysis mapping results on all trees, creating an object with the dwelling times and the number of transitions for every node of every tree. It tries to match nodes using function matchNodes from phytools packages, which consider two nodes to be the same when they share the same taxa. If you haven't rescaled the trees before but want to do it now, just pass on scale.trees=1 (or any other value) to the function below.
base_tree <-data[[1]] mpd <-map_posterior_distribution(base_tree, data, scale.branches=TRUE, parallel=FALSE) The first parameter of the function is the so called "base tree", the tree on which the branches of the other trees will be compared to. Second parameter is all other trees to be analysed, scale.branches is used to represent the dwelling times regarding the percentage of time spent in the branch, instead of the absolute value (the expected number of transitions is not scaled, as it doesn't make sense to do so).
This function returns a list of three elements, each one representing a type of analysis.
1. lmt (labelled markov transitions): the expected value for number of transitions among states; 2. emr (expected markov reward): it's the dwelling times of states; 3. mr (mutation rate): the muration/evolution rate; Each item on this list contains an array of three dimensions, [x, y, z], where x indexes the trees, y indexes the states and z indexes the nodes. As an example, mpd$emr [10,'colonial',83] would return the dwelling times for state colonial on node 83 of the tenth tree.
If node 83 is not present on the tree number 10, the value will be NA.

Expected dwelling times for states of a branch across all trees
Let's suppose we want to check the posterior distribution of states on a specific branch of the tree. Branches don't usually have names, so we will define a branch by it's ending node. For example, the code below will plot our base tree and it's correspondent node numbers.  The interpretation is as follows: the colonial state was present with 95% certainty around 10% and 45% of the time on the branch ending on node 89 and, with equal certainty, the state solitary was present during 55% to 90% of the time on this branch. At the top of the chart, the text branch posterior probability: 100% means that this particular branch was present on 100% of the trees given as argument to the function map_posterior_distribution and compared to out base tree.
It is also possible to plot only one of the states by supplying the states argument.

Expected dwelling times for states for all branches across all trees
Quite easy to do that, just omit the node parameter and the function will consider the distribution over all nodes.

Posterior Distribution of Branch Lengths
Branch posterior probability: 99.28%

Expected dwelling times for states for all branches on a group of trees
To filter or limit the distribution over a specific group of trees, or maybe a single tree, pass on the argument trees, which work in the same way as nodes and states.

Posterior Distribution of Branch Lengths
Branch posterior probability: 100%

Expected number of transitions for states of one or more branches across all trees
Similar plots can be generated for the number of transitions using the same function and mpd object created before, just changing the parameter type from emr to lmt (labelled markov transitions).
As stated before, it doesn't make sense to scale the number of transitions according to the branch length, so here the absolute values are represented in the x-axis.

Posterior Distribution of Branch Lengths
Branch posterior probability: 90.55%

Expected mutation rate for states of one or more branches across all trees
As the last type of distribution plot, mr (mutation rate) shows the rate of mutation for states per branch. Parameters are very similar than before. It is possible to filter by trees, nodes and/or states.
Just as a reminder, it makes more sense to analyse mutation rate on trees where the distance between any leaf and it's root is equal. If the trees passed to sfreemap where not scaled in that way, it's possible to do that using the map_posterior_distribution function.
base_tree <-data[[1]] mpd <-map_posterior_distribution(base_tree, data, scale.trees=1, parallel=FALSE) Now that we know that all values are scaled in a way that all leaf nodes are distant from the root node by 1 unit, we can then plot the distribution chart.
We can get the mutation rate for all states in a single tree (let's say, tree 2 ) with the command below: