The aim of the phylotastic project is to develop a delivery system for expert knowledge of species phylogeny. In response to a user-supplied list of taxa, a phylotastic system identifies suitable source phylogenies, matches species identifiers, prunes away unneeded subtrees, grafts on missing species, and supplies branch lengths and other information, ultimately returning an expert phylogeny for the user’s list of species. Ideally such a system would cover all kingdoms of life, be fast enough to provide results while the user waits, and address the particular needs of researchers for reproducibility and provenance. To enhance the potential for such a system to become a sustainable community resource, it could be implemented as a set of loosely coupled components that interact in clearly defined ways, e.g., via web services.
Steps toward enabling a phylotastic system
The implementations described above provide a point of reference for considering the potential of a phylotastic system as conceived here, and for identifying weaknesses. With these 2 goals in mind, below we discuss in particular, 3 demonstration projects: the MapReduce pruner, Mesquite-o-Tastic, and Reconcili-o-Tastic.
The MapReduce pruner can be invoked interactively via a convenient web form, which has a text box in which to enter a list of species, and pull-down menus to select a format (5 choices from Newick to NeXML) and a source tree (from a list that includes most of the large trees listed in Background). A tree is returned typically in 8 seconds. The web form is merely the front end to a web service that can be invoked via a URI with arguments for “species”, “tree” and “format”. A simple Perl script using this web-services API would be as follows:
This script could be invoked with a command such as
and the result will be a file called “out.tre” with the Newick tree-string “((Homo_sapiens, Pan_troglodytes),Mus_musculus)”.
Demonstration software based on Mesquite , an extensible workbench, illustrates how such web services could be integrated into an interactive workflow. In the Mesquite-o-Tastic screencast (Table 1), a NEXUS file from a published scientific study  is downloaded from an online archive (MorphoBank ), and opened in Mesquite, which shows that the file contains a character matrix with 51 taxa, but not a phylogeny. When the user invokes a custom menu item (added for this project), Mesquite automatically formulates a query URI using the names in the character matrix, executes the query remotely (invoking the MapReduce pruner above) and incorporates the resulting tree in memory. The tree is then available for graphical display as well as for use in analyses such as reconstructing ancestral states. While we chose to create this demonstration using Mesquite, the same thing could be done with a variety of other software systems (e.g., DAMBE, PAUP, MEGA).
The Reconcili-o-Tastic demonstration shows how phylotastic services can be integrated into a more automated workflow, as described above (Shiny section). In this case, not only is the query constructed automatically (by retrieving species names from sequence identifiers), it is used automatically for a downstream analysis step, which is to generate a reconcile tree.
Limitations of the tools described above become apparent quickly if one considers a broader set of cases than the sample queries used for illustration. Some of these limitations are due to limitations in the current state of expert knowledge, others are due to incomplete implementations of the phylotastic concept, and still others represent design limitations.
These limitations can be clarified relative to an imaginary challenge of (1) obtaining many sets of names, e.g., by downloading hundreds of NEXUS files from MorphoBank , or thousands of NEXUS files from TreeBASE , or processing thousands of scientific publications using GNRD (Table 1) to auto-recognize names, then (2) using the tools developed here to find phylogenies for implicated taxa, and (3) attempting to use the resulting phylogenies to carry out some kind of phylogeny-dependent downstream step, such as computing phylogenetic diversity, or reconstructing ancestral character states.
Current tools, if subjected to this kind of challenge, would prove unsatisfactory. The first challenge is that source trees available (via the MapReduce pruner) provide limited coverage of the millions of known biological species. The result is that, in many cases, only a minor fraction of species named in the query would be found in a source tree. This is largely a limitation in the coverage provided by available megatrees (the Open Tree Of Life project, mentioned further below, attempts to address this gap by synthesizing phylogenetic knowledge broadly). Coverage differs dramatically among different taxonomic groups, e.g., the mammal tree  covers the vast majority of extant mammals, but there is poor coverage of fungi and protists. Comparative studies of morphology often use fossil data, but fossil species are poorly represented in large phylogenies, because the latter typically are constructed from molecular sequences (typically not available for extinct species). Grafting of missing species onto the corresponding genus or family could improve coverage, but we have not implemented grafting methods here other than via web services supplied by the enhanced version of Phylomatic.
Currently available strategies for taxonomic name resolution also represent limitations. Our TNRS meta-service has not been integrated with other components, so that (for example) the names in a NEXUS file uploaded by Mesquite-o-Tastic must be spelled exactly as in a source tree available from the MapReduce pruner, or no match will be found. This may be a desirable behavior in some cases, e.g., when the user (or client application) is confident about names and does not want to allow any translation. However, in most cases, presumably, integrating the TNRS meta-service would improve results.
There are a number of current limitations on the potential for improvement, because (1) sources of taxonomic knowledge referenced by the meta-service (NCBI, MSW, etc.) provide limited coverage of names; (2) spell-checking typically is not available; (3) there is no automated way to choose among multiple matches (especially given the possibility of valid homonyms — identical names for different species covered by different nomenclatural codes); and (4) there is no automated way to interpret names from higher taxonomy. With regard to the last-named challenge, for instance, MorphoBank has many data matrices (e.g., project #816) that combine data at the genus level or higher, so that the key to a row of data is a higher taxon label (e.g., “Sabellidae”) or an anonymized species name (e.g., “Myriowenia sp.”). One can imagine a system that resolves such names in an appropriate way depending on the user’s choice, but support for such a system exists only for plant species and only up to the taxonomic rank of family .
The general design for a phylotastic system (given above) calls for a tree-store that responds to a user’s query by identifying a source tree that provides the best coverage. However, such a component has not been integrated, thus current tools require the user to specify a source tree in advance.
Integrating many source phylogenies, along with a TNRS and a tree-store, would make it possible to respond to a variety of phylotastic queries to identify the best source tree for each one. However, given our current implementations, the resulting trees would lack branch lengths and contain polytomies, making them unsuitable for many kinds of downstream analysis (for reasons noted earlier). The lack of branch lengths could be addressed by integrating the DateLife service, but currently its store of calibrated phylogenies covers only animals. This situation could be improved, but fossil-based calibrations generally are not available for phylogenies of groups of microscopic organisms, which have a poor fossil record. Other methods for scaling trees are possible (as noted above), but we have not explored those methods. Likewise, polytomies could be resolved arbitrarily, or using character-based methods, but we have not explored such options.
Finally, whereas we can imagine an enormously useful phylotastic system that efficiently delivers currently available knowledge of phylogeny, taxonomy, and fossil dating, current standards and technology for annotation are insufficient to enable the delivery of this knowledge with enough credibility for scientific research. Whereas students and educators may be satisfied to download a tree made by a “black box”, researchers will expect a clear description of sources and methods, including metadata on the source trees used to derive a result, and the phylotastic method of its derivation (by pruning, grafting, re-scaling, etc.). Yet standards for annotating sources and methods are relatively undeveloped (see discussion in ). Attribution and licensing present additional challenges for a system that re-uses data.
Beyond these narrow technical limitations there is the question of whether a fully developed phylotastic system ultimately would represent a practical alternative to other ways of obtaining phylogenetic information. The most obvious uses of such a system are for cases in which the user’s demand for speed is relatively high compared to the demand for rigor. Resources such as Wikipedia or the Encyclopedia of Life, for instance, might benefit from the ability to auto-generate phylogenies to illustrate a taxon for a taxon-specific web page. Whether scientists will use a phylotastic system for research purposes will depend on multiple factors that include the user’s demand for rigor, the user’s potential to infer— at a significant cost in terms of time, training, and computation— a more rigorous phylogeny by de novo methods, and the availability of a pre-computed expert phylogeny that covers the query of interest (and is considered authoritative).
Such factors are not easy to assess directly, but the examples cited by Stoltzfus et al.  suggest that, given the opportunity, researchers often will choose to forego the task of inferring a species phylogeny de novo from character data, and instead will choose to apply approximate methods to compute a derived tree from an expert source tree, even when the researcher’s needs are limited to a single tree with a few dozen species (e.g., as in ).
A phylotastic ecology of informatics resources
The results described above provide a basis for further development of phylotastic systems in hackathons planned for the year 2013. This further development will take place within a community with a long history of cyber-infrastructure projects, the oldest being TreeBASE and ToLWeb, both of which date from the 1990s. More recent phylogeny-related resources include CIPRES  and PhyLoTA . Taxonomic information services also have existed for many years (e.g., ITIS noted in Table 1). Our DateLife service is similar to the widely popular TimeTree project , noted above.
How does phylotastic relate to these other projects? How might the projects inter-relate in the future? Above (Implementation) we explained why we chose to implement a TNRS meta-service (rather than rely solely on an existing service), and why we chose to implement a new tree-scaling service similar to TimeTree. In both cases, the reasons relate to the need for resources that are designed (and licensed) to support automated data-integration tasks, rather than interactive or ad hoc uses.
Currently, although TreeBASE and ToLWeb are resources that represent expert knowledge of phylogeny, they are not alternatives to a phylotastic system as a convenient source of custom trees for downstream use. TreeBASE  provides tools for searching ~8000 published phylogenies— a small fraction of all published phylogenies —, but it does not include pruning or grafting tools, nor does its store of trees include any of the trees given above (Background) as examples of large species trees.
ToLWeb is primarily an educational resource whose main feature is a phylogeny divided into branches curated by experts who determine the phylogeny and supply annotations. Its downloadable phylogeny of 16000 taxa covers all kingdoms but includes <1% of named species; as noted in , when bioinformatics researchers want a comprehensive ToL (e.g., for projects such as TimeTree), they use the NCBI taxonomy tree, which has 250000 species. Educational uses of ToLWeb predominate over research uses, perhaps because the interfaces focus on graphical presentation: when ToLWeb is cited in the research literature, in studies such as [62–67], it appears that knowledge of a small set of relationships is conveyed visually rather than by computation from the tree.
Resources such as PhyLoTA  and the CIPRES portal  clearly provide ways to generate custom species trees for downstream use. However, the trees are generated by the user de novo from sequence alignments. While implementing a phylogenetic inference workflow using CIPRES or PhyLoTA is far easier than implementing an ad hoc workflow on a local computer, it is time-consuming and represents a substantial burden for most users.
By contrast, the phylotastic project aims to facilitate the case in which a user can make a scientifically defensible choice to use a modified (pruned, grafted, re-scaled) version of an expert phylogeny, rather than attempt to infer a phylogeny de novo. Clearly Phylomatic —the inspiration for the phylotastic project—also addresses this niche. While the original conception of Phylomatic was a local tool with a fixed source tree, subsequent developments (including developments for this project) expanded its web-services interface and allowed the capacity for a user-supplied tree, allowing Phylomatic to become a component in a distributed phylotastic system of components that interoperate to provide on-the-fly access to ever-expanding domains of expert knowledge (of phylogeny, taxonomy, geographic distribution, etc.).
Just as Phylomatic was designed for a smaller and more static world of data, but has begun to adapt to a larger and dynamic world, the other resources listed above also could play a role in this emerging world. As described above (Implementation), existing taxonomic name resolution services can be adapted for aggregation into a meta-service. Likewise, existing phylogeny resources such as ToLWeb or TreeBASE could expose their content using the TreeStore concept envisioned in Figure 2. For instance, for this project, we exported the XML version of the ToLWeb tree, and translated it using Bio::Phylo , so that the content of the ToLWeb tree is available via the pruner web tool described above. If ToLWeb were to supply the current version of its tree via a web service, this could be accessed by the pruner; likewise, TreeBASE could expand its current web-services interface to expose its data to phylotastic systems.
The goal of the phylotastic project is to leverage expert knowledge of phylogeny, rather than create de novo trees using tools such as PhyLoTA and CIPRES. Yet, there are cases in which a phylotastic system could benefit from rapid methods for making limited phylogenetic inferences from a sample of sequences or other data, including (1) using phylogenetic placement  to place a missing species on a tree, when taxonomic grafting is impossible or undesirable, (2) resolving a polytomy, or (3) assigning branch lengths within subtrees of organisms poorly represented in the fossil record (e.g., single-celled organisms).
The design of phylotastic systems allows for perpetual heterogeneity and novelty, thus it does not matter whether or not a central source of authoritative ToL knowledge emerges in the next decade through efforts such as the Open Tree Of Life project (Table 1). New phylogenies will augment available tree-stores, and phylotastic systems will allow them to be pruned, grafted and analyzed according to the wishes and needs of the user. Because of the open architecture and modularity of the project, researchers can chain the phylotastic components together in various ways, piecemeal or as complete workflows.
Finally, if successful, a convenient and comprehensive delivery system for expert phylogenetic knowledge will create a competitive marketplace in which alternative source trees, and alternative phylotastic services, compete to satisfy the demands of users. The existence of such a marketplace may be expected to catalyze broad improvements in related technology and standards. For instance, given that the scale of scientific phylogeny re-use has been— with the exception of APG and Phylomatic— unimpressive , the delay in developing a “minimal information” standard for annotating phylogenies, first proposed in 2006 , is unsurprising. However, a phylotastic system will require annotations of sources and methods to satisfy the demand of researchers for credible (publishable) results and, given the choice, users will prefer those source trees, tree-stores, and client applications that provide them with more fully annotated results. Another critical feature missing from the technology landscape of phylogenetics is some scheme for quantifying the accuracy or perceived quality of phylogenies— e.g., an objective scheme based on consistency or metadata density, or a subjective scheme based on social bookmarking—, but one can expect such a scheme to emerge naturally as an aid to users facing choices in a phylotastic marketplace.