Object manipulation and transformation
Bio::Phylo provides a toolkit for the manipulation of rich phylogenetic data objects. The objects can be annotated and labeled, and have any number of arbitrary other objects attached to them. The objects can be traversed in various ways, including depth-first, breadth-first or level-order traversal of tree shapes and through iterator or visitor access [17] to all objects that are lists of things (e.g. a Bio::Phylo::Taxa object is a list of OTU objects). Traversals can move from node objects to the OTU objects that define them (and back) and from OTU objects to character state observations that were made for the OTU objects (and back). The objects can be tested for various predicates, e.g. whether a tree is rooted, whether it is binary, whether it is ultrametric; whether a set of tips is monophyletic with respect to a given outgroup, whether a set of tips forms a complete clade; whether a node object is a tip, an internal node or the root node; whether it is the ancestor, parent, sibling, child or descendant of another node.
Using these traversal methods and tests, simple calculations are easily implemented. Bio::Phylo provides a number of these, e.g. the sum of all branch lengths on a tree; the average, minimum, maximum and cumulative root-to-tip path length; the amount of redundancy (i.e. the amount of shared, ancestral evolutionary history along all lineages on the total amount of evolutionary history, including along terminal branches). In addition, a number of more sophisticated tree shape methods useful for biodiversity informatics is provided:
▪ calc_ltt - Calculates lineage-through-time points [18].
▪ calc_symdiff - Calculates the symmetric difference metric [19] between two trees.
▪ calc_fiala_stemminess - Calculates stemminess measure [20].
▪ calc_rohlf_stemminess - Calculates stemminess measure [21].
▪ calc_imbalance - Calculates Colless's coefficient of tree imbalance [22].
▪ calc_gamma - Calculates γ-statistic [23].
▪ calc_i2 - Calculates I2 imbalance [24].
▪ calc_fp - Calculates the Fair Proportion value [25, 26] for each tip.
▪ calc_es - Calculates the Equal Splits value [27, 28] for each tip.
▪ calc_pe - Calculates the Pendant Edge [29] value for each tip.
▪ calc_shapley - Calculates the Shapley [30] value for each tip.
Likewise, calculations applicable to sets of trees (e.g. split frequencies) and to character state matrices (e.g. state frequencies, G/C content) are provided.
Bio::Phylo also provides methods for the transformation of phylogenetic data objects. For example, phylogenetic trees can be re-rooted, pruned or ultrametricized, nodes can be collapsed or inserted, branch lengths can be exponentiated or log-transformed. Sets of trees can be summarized in consensus trees or represented as pseudo-character-state MRP [31, 32] matrices. Character state data can be manipulated directly, or transformed through bootstrapping and jackknifing [33].
Input/output
A number of file formats is used for phylogenetic data. The Bio::Phylo::IO module supports the most commonly used ones: trees can be written and read in Newick format [8]; projects, taxa, trees and matrices can be written and read in NEXUS format and in NeXML http://www.nexml.org; character state matrices can be read from CSV, FASTA, PHYLIP and tab-delimited files; trees can be read from the Tree of Life Web Project [34] XML service; trees and character state matrices can be written in the legacy format for the CONTINUOUS [35, 36], DISCRETE [37] and MULTISTATE [37] programs and in PHYLIP format
If BioPerl [7] is present, the wealth of data formats supported by Bio::SeqIO, Bio::AlignIO and Bio::TreeIO is also available because BioPerl objects can be converted to Bio::Phylo objects (using the new_from_bioperl constructors), and Bio::Phylo objects can be passed to the write methods of BioPerl. However, different from BioPerl is Bio::Phylo's concept of a "project" object, which is a collection of fundamental data objects (OTUs, trees and matrices) that reference each other. Whereas in BioPerl, NEXUS files are treated as flat containers of records of the same type (i.e. either trees or alignments, which are read sequentially by a tree file reader or an alignment reader, respectively), Bio::Phylo can optionally treat NEXUS and NeXML files as containing a project of related data of different types, in the same way as the informatics-oriented applications Mesquite [11] and TreeBASE [12] do. The compatibility with BioPerl is optional: Bio::Phylo doesn't require BioPerl to be installed or vice versa (they don't share code), but if Bio::Phylo detects BioPerl's presence, it enables a compatibility mode to make trees, nodes, character state matrices and sequences implement the interfaces that BioPerl defines.
Beyond BioPerl, interaction with other toolkits (e.g. ETE, DendroPy, BioPython, Ape) and combination in larger workflows is confined to the extent to which they can read the same data formats as Bio::Phylo. This functionality is usually confined to NEXUS and Newick file exchange, although DendroPy has support for NeXML as well, allowing more fine-grained data and metadata sharing, and similar functionality is in development for BioRuby [38].
Visualization
Bio::Phylo can draw trees where only the branching order and direction, but not the branch lengths are significant ("cladograms"), as trees with branch lengths proportional to time or some other metric such as inferred change ("phylograms") or as trees where branch lengths and distance are significant, but no direction or nesting is implied ("unrooted"). These trees can be drawn with rectangular, curved or diagonal branches. Branch thickness, branch color, node color, node radius and node pie diagrams (e.g. for "likelihood pies" [sensu [39]]) can all be set per node and branch individually. Clades can be represented in a view that shows them collapsed into triangles whose width and color can be set per clade individually. This programmatic access to the visualization of individual objects in large trees allows users to superimpose their data on trees in a variety of ways (for an example, see Figure 2).
The visualizations produced by the tree drawer module can be serialized to various bitmap formats (GIF, JPEG, PNG) and vector formats (PDF, SVG, SWF and the new HTML5 Canvas used by modern web browsers and by the iPhone and iPad), some of which can be used to create interactive graphics and animations (SVG, SWF and Canvas). Using the XML-based SVG output format, the resulting serialization can be further processed programmatically, as was done by the authors of a recent study [40] that used Bio::Phylo (Florent Angly, pers. comm.). Any serialization can of course be manipulated further by hand using vector drawing or graphics editing software to prepare it for publication; however, the most useful application of Bio::Phylo's tree drawing capabilities is probably in the creation of interactive graphics for the web, e.g. within the dynamic server environment of a web application that serves trees from a database.
Sampling and simulation
Bio::Phylo can simulate tree topologies under the following models of cladogenesis: pure birth under the model of Hey [41]; pure birth under the Yule model [42]; equiprobable topologies [sensu [43]]; constant rate birth-death, evolving speciation rate and beta binomial models implemented using novel algorithms [44]. The tree sampling interface in Bio::Phylo can also be used to sample from arbitrarily complex user specified models using the algorithms in [44].