Skip to main content

Phylotastic! Making tree-of-life knowledge accessible, reusable and convenient

Abstract

Background

Scientists rarely reuse expert knowledge of phylogeny, in spite of years ofeffort to assemble a great “Tree of Life” (ToL). A notableexception involves the use of Phylomatic, which provides tools togenerate custom phylogenies from a large, pre-computed, expert phylogeny ofplant taxa. This suggests great potential for a more generalized systemthat, starting with a query consisting of a list of any known species, wouldrectify non-standard names, identify expert phylogenies containing theimplicated taxa, prune away unneeded parts, and supply branch lengths andannotations, resulting in a custom phylogeny suited to the user’sneeds. Such a system could become a sustainable community resource ifimplemented as a distributed system of loosely coupled parts that interactthrough clearly defined interfaces.

Results

With the aim of building such a “phylotastic” system,the NESCent Hackathons, Interoperability, Phylogenies (HIP) workinggroup recruited 2 dozen scientist-programmers to a weeklong programminghackathon in June 2012. During the hackathon (and a three-month follow-upperiod), 5 teams produced designs, implementations, documentation,presentations, and tests including: (1) a generalized scheme for integratingcomponents; (2) proof-of-concept pruners and controllers; (3) a meta-API fortaxonomic name resolution services; (4) a system for storing, finding, andretrieving phylogenies using semantic web technologies for data exchange,storage, and querying; (5) an innovative new service, DateLife.org,which synthesizes pre-computed, time-calibrated phylogenies to assign agesto nodes; and (6) demonstration projects. These outcomes are accessible viaa public code repository (GitHub.com), a website(http://www.phylotastic.org), and a server image.

Conclusions

Approximately 9 person-months of effort (centered on a software developmenthackathon) resulted in the design and implementation of proof-of-conceptsoftware for 4 core phylotastic components, 3 controllers, and 3 end-userdemonstration tools. While these products have substantial limitations, theysuggest considerable potential for a distributed system that makesphylogenetic knowledge readily accessible in computable form. Widespread useof phylotastic systems will create an electronic marketplace for sharingphylogenetic knowledge that will spur innovation in other areas of the ToLenterprise, such as annotation of sources and methods and third-partymethods of quality assessment.

Background

Researchers in many areas of life sciences, from community ecology to genomics tobiomedical genetics, use phylogenies to place data in an evolutionary context [1]. Phylogenies provide the basis for classification, whether of species(i.e., biological taxonomy) or molecular sequences. Furthermore, phylogenies arecentral to rigorous quantitative methods of comparative analysis used throughoutbiology. Evolved things (genes, species, or other entities) have features that arehighly correlated by virtue of common ancestry, thus they are not independentsamples of an underlying process, but require special methods of analysis:evolutionary comparative methods use branching models to separate correlations dueto common ancestry from correlations due to functional causes.

Inferring a phylogeny is often a challenging task with multiple steps, subject tonumerous pitfalls [2]. To infer a credible tree, users must collaborate with experts, or committo learning about phylogenetic methods.

Nevertheless, the number of phylogeny publications has been growing at a rateconsiderably above the baseline growth of scientific publishing (e.g., compare [3] with [4]). In 2010, an estimated 7700 publications reported new phylogenies [5]. These and other phylogenies computed throughout the life sciencescollectively represent the sum of expert knowledge of evolutionary relationships.This knowledge is largely scattered and inaccessible, locked inside individualpublications. In spite of a community archive that has existed for many years(TreeBASE [6, 7]), roughly 96% of phylogenies are not archived, and are available only aspictures in a scientific journal [5].

One possible interpretation of this situation is that, in spite of the effort thatgoes into generating phylogenies, they generally have a very low value for re-use.One might argue that phylogenies are volatile and must be re-computed constantlyfrom an ever-expanding body of data using ever-improving methods. If so, then thelack of archiving and re-use of phylogenies is neither surprising norproblematic.

However, a recent study of phylogeny re-use [5] suggests that certain types of phylogenies have a high re-use value,under the right conditions. In a small sample of just 40 phylogeny-relevant researcharticles, the authors found that 6 of the studies re-used large trees, 4 of themusing the software called Phylomatic[8] to perform pruning and grafting operations on the framework tree providedby the Angiosperm Phylogeny Group (APG). The APG tree [9] aims to cover all flowering plants, albeit mostly at the level of highertaxa (e.g., families, orders) rather than of species. In this context,“pruning” means cutting away unwanted terminal nodes (and collapsing anyresulting unnecessary internal nodes), while “grafting” means addingbranches to a tree, which Phylomatic does taxonomically (i.e., “taxonomicgrafting”, based on an input list with taxonomic derivations of the form“<order>/<family>/<genus>/<genus species>”).Phylomatic has been cited in over 200 scientific articles since 2005 [10].

In other words, Phylomatic uses simple manipulations to generate a customized treebased on a larger authoritative tree computed by experts, providing the user with acombination of convenience and credibility. The expert trees most useful in suchcases will be those that (1) address the relationships of species (as opposedto relationships of genes or proteins) and (2) cover a large number ofspecies. A single ToL covering millions of species does not exist (in spite of adecade-long effort by the US National Science Foundation), but there are many treesthat provide extensive coverage of a large group, e.g., 4510 extant mammal species [11], 55473 angiosperms [12], 73060 eukaryotic species [13], and 408135 prokaryotic 16S rDNAs in the “greengenes” tree [14]. To this list, one may add resources that are not true phylogenies, buttaxonomic hierarchies, including the NCBI taxonomy hierarchy of 250000 species ofprokaryotes, eukaryotes, and viruses [15], the downloadable version of the tree from the Tree of Life Web project [16] with 16000 species, and the APG tree with 1566 taxa [9]. The NCBI hierarchy is widely used as a ToL in projects that require asingle unifying framework to cover all domains of life (e.g., [17-21]).

Whereas the lack of a single authoritative ToL may be a substantial barrier to there-use of expert knowledge of species phylogeny, the lack of a convenientdelivery system for available ToL knowledge is a barrier of equal orgreater importance. In particular, the example of Phylomatic suggests the potentialfor a more general system that, in response to a query consisting of a list ofspecies (or higher taxa), rapidly supplies a phylogeny for those species based onexpert knowledge. Ideally such a system would cover the entire ToL, be fast enoughto provide results while the user waits, and address the particular needs ofresearchers for reproducibility and provenance. Such a system will not replace thetime-consuming generation of robust phylogenies by experts, but aims to make thosehard-won results conveniently accessible to everyone else.

In this paper, we report initial results of a project, codenamed“phylotastic”, that aims to create such a system. The HIP(Hackathons, Interoperability, Phylogenies) working group of the NationalEvolutionary Synthesis Center (NESCent) gathered a group of programmer-scientistsfor a one-week “hackathon” (an intensive bout of collaborative softwaredevelopment) aimed at building a system of loosely coupled components thatcollectively provide phylotastic access to phylogenies with broad taxonomiccoverage. The system was designed and implemented by 5 teams: the Architecture teamwas responsible for overall design and controllers; the TNRS team focused onTaxonomic Name Resolution Services; the TreeStore team developedmethods to submit, store, and retrieve phylogenetic trees; the Branch Lengths teamimplemented a system to assign divergence times to nodes; and team Shiny focused onend-user experiences and outreach. Project outcomes are accessible via a public coderepository (http://github.com/phylotastic), a project web site(http://www.phylotastic.org), and a server image (supplementarydata).

Our results show that providing on-the-fly phylogenies via web services is possible,although improvements are needed in order to create a robust and flexible systemthat meets the typical demands of researchers. With further development, phylotasticsystems have the potential to create an electronic marketplace for sharingphylogenetic knowledge that, in turn, may be expected to provide the incentives forresearchers and technologists to improve data quality, improve technology andstandards for annotation of sources and methods, and facilitate third-party methodsof quality assessment.

Implementation

Hackathon planning and execution

The goal of developing a phylotastic system emerged in January 2012 frombrainstorming and evaluation of multiple alternatives at a face-to-face meetingof the HIP leadership team (LT), a group of 10 scientists with backgrounds inmolecular biology, bioinformatics, molecular evolution, genomics, phylogenetics,and comparative biology. Hackathon participants were selected from applicationssubmitted in response to an open call for participation. From 29applications—nearly all from qualified individuals—the LT selected20 participants to represent a breadth of expertise and knowledge. Asimultaneous satellite hackathon was arranged with a remote group of 6individuals.

In the two-month pre-hackathon planning stage (April to May of 2012),participants were enrolled in a mailing list and encouraged to raise issues andshare ideas. LT members regularly injected ideas and challenges to stimulatediscussion and maintain energy and momentum. Some participants and organizersdeveloped and shared proof-of-concept software during this period. By the timeof the hackathon, nearly all participants had engaged in discussions viatelephone conferencing and a shared email list.

The hackathon took place from June 4 to 8, 2012 at NESCent headquarters inDurham, NC, with a satellite hackathon in Moscow, Idaho composed of a group ofinterested researchers who could not travel to Durham. The first day began withuser scenarios (descriptions of how researchers use ToL knowledge); a two-hourbrainstorming session; short technical talks to familiarize participants withcertain key technologies (NeXML, the semantic web, iPlant TNRS); and anopen-space session to form teams. During the open-space session, participantsproposed, joined, critiqued, revised and defended “pitches”(proposals) in an open room, until unpopular pitches were abandoned and thegroup settled down to a limited number of teams. The first day ended with 5teams, and the remaining 3.5 days were spent planning and executing teamprojects. In the period after the hackathon, organizers and a subset ofparticipants collaborated remotely to improve hackathon products.

General conception of a phylotastic system

In the pre-hackathon stage, and during the first day, participants developed ageneralized view of phylotastic systems as a means to deliver ToL knowledge toresearchers. This view is intended to be practical given present realities ofthe informatics landscape, yet extensible and adaptable as a long-term communityresource.

Conditions

There are 2 initial conditions of vital importance. The first is that theuser has a list of taxa, typically a list of dozens or hundreds of species.Rarely the list may be longer; it may contain higher taxa such as genera,families or orders. The input list might be composed manually by the user,or constructed interactively from some data resource, e.g., the user mightinvoke the Global Names Recognition & Discovery service (seeTable 1) on the PDF of a scientific paper,resulting in the list of taxa named in that document. The lists of namesthat emerge from ordinary scientific sources frequently have errors in thespecification of names, including typographical errors, misspellings basedon ignorance, and deprecated names, with the result that integrating datausing names as keys is a major bioinformatics challenge [22-24].

Table 1 Locations of online resources, with hackathon team listed whereappropriate

The second important initial condition is that there are a variety of sourcesof expert phylogenetic knowledge, in the form of multiple source trees thatmight satisfy (partially or fully) the user’s query (examples werelisted in Background). There is no single authority for such trees. They areavailable in many formats, although this is not problematic due to theavailability of general-purpose libraries that provide tools for formatconversion [25, 26]. In some cases, the trees that are most useful are taxonomichierarchies with polytomies, missing branch lengths, and higher taxa asterminals (e.g., [9]). While truly global phylogenies may appear very soon, we assumethat, for years into the future, expert phylogenetic knowledge will residein a plurality of trees, not a single tree.

Requirements

The basic functional requirement is to provide a phylogeny in response to theuser’s query, which consists of a list of entities (typically, a listof species) and optionally, additional conditions (e.g., specifying aparticular source tree, a restriction on name-matching, etc.). One mayconceive of the user’s query as an under-specified version of theresulting phylogeny, i.e., a graph of unconnected leaf nodes that is filledin by the remaining components of the system.

Importantly, the utility of such a system depends on its strengths andweaknesses relative to the alternative (faced by the user) of (1) obtainingspecialized training in phylogenetic inference methods, (2) installing orotherwise accessing specialized software, and (3) executing a workflow toinfer a custom phylogeny from character data— all of which require aconsiderable expenditure of time to produce a tree that nonetheless may lackcredibility. In the system envisioned here, phylogenetic knowledge isreturned while the user waits, and ultimately comes from phylogeniesproduced by experts and presented (typically) in stand-alone publications(e.g., [11]).

Currently, many such source trees (e.g., [9, 16]) include polytomies and lack branch lengths, yet users typicallyrequire a fully resolved tree that includes branch lengths. The reason forthis is that branch lengths are required to apply various downstream(phylogeny-based) methods, such as independent contrasts [27], probabilistic reconstruction of ancestral states [28], or correlation of discrete traits [29]. Many implementations of these methods require fully resolvedtrees (i.e., trees without polytomies), even if the method itself does not.Branch lengths may be obtained in units of amount of change by using treeinference techniques with character data (e.g., sequences from GenBank).Inclusion of calibrations or constraints from fossils, biogeographical data,or other sources can be used with a tree to infer branch lengths in timeunits. A method called bladj, included in the Phylomatic package [8], simply assigns lengths so that branchings are evenly spaced.

Furthermore, it should be possible to cite the derived tree and provide adescription of provenance in a way that will satisfy expectations forwriting the “Methods” section of a scientific publication. Sucha description would identify the source tree, describe the manipulationsperformed on it, and possibly provide a precise means to recover the derivedtree.

In this context, one may imagine a system that takes the user’s query,rectifies names, identifies a source tree with the best coverage of theuser’s list of species, invokes pruning and grafting operations toprovide a phylogeny for the species, and as needed, invokes further servicesto provide branch lengths and possibly other annotations. The name-rectifierwould accept, as input, a list of names, and would return a mapping of theinput names to qualified taxonomic identifiers. Tree-stores would accept, asinput, a query with descriptors or conditions to be satisfied (coverage,quality), and would return trees (or references to trees) that satisfy theconditions. Pruners (and grafters) would accept (1) a list of names and (2)a tree (or tree reference), and would return a suitably pruned (or grafted)tree. Alternatively, one may imagine topology services that combineinformation from multiple trees (so-called supertree methods) rather thansimple pruning and grafting. Scaling operations would accept, as input, atree with or without branch lengths, and would return a tree with newlyscaled branch lengths, or with dates assigned to nodes.

Such a system would be useful to a wide variety of users. Because most suchusers are not computing experts, one cannot expect them to navigate thecomplex series of operations just described. Instead, we assume that mostusers will take advantage of phylotastic systems via client applications orcontrollers that manage a phylotastic workflow. This raises the question ofhow to design the overall architecture of the system. We assume that asystem that is distributed and based on open collaboratively developedstandards has a greater likelihood of becoming a sustainable communityresource, relative to a closed, centralized system. Openness andcollaborative development lower the bar to participation by early adoptersand increase the breadth of use-cases being served, and thus are consideredimportant factors favoring sustainability of cyberinfrastructure [30, 31].

Therefore, to facilitate the development and maintenance of phylotasticsystems as a sustainable community resource, we imagine the system inFigure 1 as a set of loosely coupled webservices. That is, we imagine a step such as taxonomic name resolution, notas a local operation—e.g., based on a local library oftaxonomy-related functions that access a local namebank—but as aremote web service maintained by taxonomy experts (e.g., the ITIS service:see Table 1). Furthermore, we imagine that, foreach type of service, there are many possibilities, rather than a singleauthoritative service. Thus, one phylotastic client might access TNRSservice #1. and another client might access TNRS service #2. This featurewill allow the possibility for phylotastic systems to be maintained, withoutrelying on the continuity of any particular service or resource. Ultimately,different scientific groups may expose their preferred phylogenies, TNRSs,pruners, and so on; likewise, different clients ultimately would be able tochoose among multiple services based on quality, reliability, coverage, andso on.

Figure 1
figure 1

Overall scheme of a phylotastic system. The user (upper left)experiences a phylotastic system as a piece of software that returnsa phylogeny in response to a query consisting of a list of taxa (andpossibly other qualifiers). The user’s point of access to thesystem is a client program or controller that invokes variousoperations to access (and transform) the information needed tosatisfy the user’s query. The response ultimately depends oninformation available from name-banks, source trees, andcalibrations (right). There are many ways to implement such asystem. In the approach described here, it is a system of looselycoupled components that uses web services to exchange informationusing common standards.

Finally, given multiple services of each type, one can imagine each operationin Figure 1, not as a single instance of aresource, but as a service broker that accesses a registry of many services,invoking whichever service is most appropriate to process the user’squery, as in the BioMoby registry [32]. Furthermore, these brokers might choose services usingindicators of quality or reliability, based on the success of past queries,or based on third-party evaluations (e.g., social bookmarking).

Architecture

The architecture team focused on the high-level design of phylotastic systems,including the definitions and interaction of components, the flow of informationthrough standard interfaces, and the integrated control of operations viacontrollers and workflow environments.

Design considerations

The architecture group set out to identify and formalize the interoperationbetween different components described in the generalized design above,taking into account (1) agreement on the minimum workflow to be enabled; (2)a delimitation of operations into modules with agreed inputs and outputs, interms of both format and content; (3) the requirement that all modules canoperate both standalone and as part of a workflow; and (4) the requirementthat the system is driven by user events. Addressing these requirementsensured that each individual module could be developed independently andtreated as a “black box” in the overall architecture, takinginput in a specific format, and producing output to be reused by othermodules.

The minimum functionality requirement devised by the hackathon participantswas the following: when a user submits a list of names through a controllerinterface, an annotated tree containing all the named species(“phylotastic tree”) is returned. To enable this simpleworkflow, the submission of names (“dirty names”) to thecontroller triggers the TNRS module, which makes use of public services toreturn a new list of names (“clean names”). The controllersubmits the clean names to a tree-store service, which will return thematching megatrees and associated metadata. The topology service, in turn,finds and retrieves applicable megatrees by querying the tree-store.Optionally, the user may pass (as input to the topology service) referencetrees to be used by the topology service. Both megatrees (returned by thetree-store) and phylotastic trees (pruned and returned by the topologyservice) can be enriched by the annotator, which tags each branch of theselected tree with provenance metadata and branch length. The phylotastictrees are returned to the user through a user interface that may enable theuser to visualize and manipulate the tree. Note that inputs and outputs maybe passed by reference (e.g., components may pass a megatree via aresolvable reference to the megatree).

Implementations

Since one of the requirements was that all services were able tointeroperate, whereas the hackathon teams used a variety of programmingparadigms, the agreed format and protocol for exchanging data was throughREST services. Figure 2 shows the flow ofoperations for a typical use-case, as the user’s query is processedvia various components of a phylotastic system.

Figure 2
figure 2

Use case diagram for the implementation of the referencearchitecture, displaying both the minimal workflow requirements(upper) and the complete workflow (lower) with moduleindependence. The method names are represented in the leftcolumn. When a list of names is submitted to the controller, theTNRS endpoint is invoked with a set of configuration parameters.This list of URI names is then submitted to the tree-store serviceand the output expected is a set of matched megatrees. The megatreeURIs and the user-selected taxa URIs are then submitted to thetopology endpoint, which responds with the phylotastic trees.Finally, the URIs are submitted to the Branch Length module, whichreturns the chronogram or trees annotated with branch lengths.

To demonstrate the platform-independent and loosely coupled nature of thephylotastic scheme, 3 different controllers were implemented, in JavaScript,Perl and the component API for Galaxy [33], a workflow environment. The JavaScript controller (executable onthe server side with Node.js [34], or in the client side through any web browser) was designed with4 endpoints to match each of the independently developed modules: TNRS, TreeStore, Topology and Branch Length (Figure 2). Thecontroller both supports the minimum use case, and respects the independencerequirement of each module. The Perl version of the controller isimplemented as a CGI script and provides functionality that is similar tothe JavaScript controller. By default, the Perl controller coordinates 4“dummy” implementations of the TNRS, Tree Store, Topology, andBranch Length modules, but can be configured (via CGI parameters) to invokereal implementations for each of these modules.

Finally, a controller was implemented in the form of a collection of simpleclients for the Galaxy platform [33]. Using Galaxy’s interactive workflow editor, these clientscan be chained together in flexible ways to perform taxonomic namereconciliation, branch length estimation and tree pruning using thepreviously described RESTful services, in addition to various fileconversion and filtering services to accomodate the tabular data model usedin the Galaxy environment.

Resolving taxonomic names (TNRS)

The phylotastic system envisioned above integrates data using species names, buterrors and lack of specificity in such names can be major sources of ambiguity [22, 35]. Users attempting to integrate phylogenetic information via namesmust go through a manual process of reconciling these names to each other.Several TNRSs (taxonomic name resolution services) have been created in recentyears that may assist in this process by matching user-supplied names againstcurrently valid or accepted names in taxonomic databases (e.g., ITIS: seeTable 1). However, each TNRS service uses adifferent API, has a different number of names in its database, covers differentsections of the ToL and provides a different set of features [36]; none represents a complete solution for a phylotastic system. Wedecided to focus on developing a single meta-service that would provide accessto multiple existing TNRSs, as well as a single API, which could in the futurebe used by any client software to query any TNRS.

Design considerations

Our main design goals were to develop a web service and interface which wouldbe (1) easy for developers to integrate into phylotastic workflows, (2)simple enough for users to understand, but without shielding the complexityof taxon names from them, and (3) broad enough to cover multiplenomenclatural codes. With these aims in mind, we developed a dual-purposeAPI that serves both for core services and for a meta-service thataggregates over multiple core services.

The API, which is documented more fully on the project wiki linked to theproject web site (Table 1), is based on theiPlant TNRS API [36], and always returns responses as correctly formatted JavaScriptObject Notation (JSON) objects. It provides only two methods: asubmit method which accepts a list of newline-separated namesfor matching, returning a token, and a retrieve method that acceptsa token and returns a report on the results of processing the originalquery. This asynchronous approach allows the server to carry outcomputationally intensive processes like fuzzy matching without the risk ofthe TCP/IP connection timing out.

For each name submitted, every name matched by every service is returned witha match score between 0.0 and 1.0, with 0 indicating that the name could notbe matched and 1 indicating a perfect match. This score can be used by TNRSsthat implement fuzzy or partial matching algorithms to report match scoresor degrees of certainty. The meta-service returns all names found across allqueried sources, leaving it to the client to decide which names to pick incase of conflicts between sources. A URI uniquely identifies each acceptedname and provides credit to its source. Details are available in the onlineAPI description mentioned above.

Implementations

The reference implementation (codenamed “Taxosaurus”) consists ofa handler and a collection of adaptors. The handler is responsible forcommunicating with the client and the adaptors; it uses a subset of the fullAPI to send the client's queries to each adaptor, and combines and formatsthe results from each adaptor. The calls to the name-providers are handledby adaptors that are independent from the handler, and which may be writtenin any programming language. Their task is to serve as a translator betweenthe meta-service API and the name provider’s API. The handler itselfis modular; for purposes of speed, parts of it might eventually beincorporated into the Apache web service. Taxosaurus is very similar to theTaxonomic Search Engine of Page [35]. A key difference is the adoption of a RESTful approach insteadof SOAP and the use of URIs instead of LSIDs to uniquely identify eachaccepted name.

A core service that conforms to the API may be aggregated into themeta-service implementation; a TNRS service that does not conform to the APImay be wrapped in an adaptor, in principle. We provide access to three coreservices described below.

NCBI Taxonomy. This adaptor, written in Python, uses NCBI’sE-Utils service (particularly the “ESearch” and“ESummary” commands) to match names (see Table 1 for NCBI eUtils). Scores may be '1' (successful match)or '0' (could not match).

iPlant TNRS. Since our API design was directly inspired by theiPlant API [36], this adaptor consists of a simple Perl script that forwardsqueries to iPlant, and renames field names in their result to our schemabefore returning the results to the handler. The score returned by theiPlant TNRS is directly passed on to the user.

MSW3. Due to the importance of the mammalian supertree from [11] for phylotastic projects involving mammals, we implemented a newcore service, called MSW3, based on taxonomic data from Mammal Species ofthe World, third edition [37]. The MSW3 taxonomy was downloaded as a CSV file from the MSW3 website (see Table 1), and potential synonym nameswere extracted by searching for text beginning and ending with "<i>"and "</i>" tags in any column. A new CSV file consisting solely ofthese indexed names was stored as a local database. Three techniques areused to match names: (1) searching for an identical binomial name in theGenus and Species columns; (2) searching for names found anywhere in othercolumns in the CSV file, and (3) searching for names mentioned in differentparts of a single row (e.g., the species epithet in the 'species' field, butthe genus name mentioned in the 'Synonyms' column). These techniquesidentify all but a few percent of the 4500 names in the mammalsupertree.

Tree storage (TreeStore)

The focus of the TreeStore team was to develop a flexible way for phylogenyproviders to expose knowledge for use in phylotastic systems.

Design considerations

A key feature of the overall architecture (above) is the capacity to choose asuitable tree from among available source trees, rather than beingconstrained to one or a few local or “built in” trees. Thesuitability of a source tree for a given case may be based not only oncoverage of a set of species, but on metadata describing the methods,protocols, and data used to construct the tree (e.g., see [38]). Such metadata would include the recommended citation by whichusers can properly credit the source tree(s) in subsequent publications.Thus, a phylotastic tree-store should support flexible storing, querying,and retrieval of trees, and also of metadata associated with trees and theircomponent nodes and branches.

Implementations

Standards and technologies for the semantic web [39] seem well positioned to address the nature of the functionalrequirements for the tree-store. In particular, we chose to use the ResourceDescription Framework (RDF) [40] to design a model for large phylogenies and their metadata; touse Virtuoso (OpenLink Software [41]) as a triple-store [42] that is scalable enough for storing very large phylogeniesannotated with RDF; and to use Virtuoso’s implementation of SPARQL(the standard RDF query language) for querying the triple-store.

A critical step in enabling programmers and users alike to query the data ina triple-store is to design an RDF data model that, on the one hand, has theflexibility to accommodate diverse data, metadata and querying needs, and onthe other hand, aligns well with controlled vocabularies and ontologiescurrently in use. We chose to use the Comparative Data Analysis Ontology(CDAO) [43]. We began by modifying CDAO to comply more fully with bestpractices in ontology engineering [44].

Designing the representation for the required metadata capabilities revealedseveral gaps in CDAO and other available vocabularies. As a consequence, weare building OWL ontologies and RDF models for TNRS matches of OTU labels,bibliographic citation of a tree, and for the various attributes encompassedby the proposed MIAPA reporting standard [38]. For example, the TNRS ontology (see Table 1) defines the classes and properties needed to represent theresults from resolving OTU labels to taxonomic names.

Pruning and grafting

Design considerations

The concepts of grafting and pruning are relatively simple, and could beimplemented as part of a tree-store, but standalone services are part of thedistributed design of a phylotastic system. While there was not a teamdedicated to topology services, several hackathon participants developedproof-of-concept pruners, explored compute times for pruning, and refinedexisting pruning implementations.

Implementations

For small trees, pruning is already available for users of some phylogeneticprogramming platforms and software packages. However, for on-the-fly pruningof very large trees (e.g., 105 taxa), the algorithms implementedin commonly used programming toolkits (e.g., DendroPy, [26]; Forester (see Table 1); NCL, [45]) may be too computationally intensive, especially if for eachpruning step the tree structure is parsed out of parenthetical-formattedflat-text [46]. Preliminary tests confirm that this is the case, indicating thatpruning using implementations such as DendroPy may take several minutes fora trees the size of the 55473-species tree of Smith, et al. [12]. Experiments with relational databases suggested that performingpruning operations using SQL might not yield satisfactory performanceimprovements either. In contrast, a prototype implemented using SPARQL andthe Virtuoso triple-store suggested that such an approach could provideexcellent performance.

Promising results were found using the MapReduce design pattern [47]. Our implementation, deployed on the development server with aconvenient web-forms interface described below (see Table 1 for URI), suggested that pre-processing of the inputtrees into separate terminal taxon-to-root paths that are accessed in aparallelizable way could reduce processing times significantly (e.g., in theprototype implementation, shortening a pruning operation from ~20 minutes to~6 seconds). Most of the performance gains of the prototype implementationare likely due to the pre-computed de-normalization of the tree structureinto taxon-to-root paths. Because this implementation does not yet takeadvantage of parallelization, and requires the Hadoop MapReduce framework toboot up for each request, further performance gains may be anticipated.

Grafting and pruning are the core operations of Phylomatic [8], a pre-existing tool mentioned in Background. The online versionof Phylomatic was upgraded (version 3) in connection with this hackathon(see Table 1 for URI), and is now implemented ingawk, a lightweight pattern-matching utility [48], drawing on no external libraries to parse Newick, NeXML or CDAORDF phylogenies, and using simple node-to-parent-node arrays as its internaldata representation. Pruning and grafting in Phylomatic are relativelyefficient, requiring only 1.8 seconds to load a tree with 55000 tips andprune out all but 10 taxa (on an Intel i5 processor). Functionality that wasadded to version 3 includes: modifications to enable easier access andincorporation as a web-service; a range of built-in megatrees, not just forplants; and the ability to read and write NeXML and CDAO RDF, enabling theweb-service to act as a format translator without any grafting orpruning.

Scaling trees (Branch lengths)

The Branch Lengths group aimed to satisfy the user requirement for phylogeniesthat are not merely topological frameworks, but have branch lengths that reflecttime or amount of divergence, based on incorporating relevant biologicaldata.

Design considerations

Several possible automated approaches to scaling trees may be imagined,including sampling character data (e.g., sequences obtained from GenBank);assigning branch lengths by simple subdivision of root-to-tip path lengthsinto equidistant internodes (e.g., as in the bladj method from [8]) or using more sophisticated models of expected cladogenesis;calibrating the tree using fossil data; or combinations of these differentapproaches.

The Branch Lengths group opted to develop a system that assigns dates tonodes based on a stored library of fossil-calibrated trees (chronograms).This design was inspired by TimeTree [49], which takes a pair of species and returns point estimates of theage of their most recent common ancestor from published chronograms.TimeTree itself could not be used, as the terms of its license prohibitlarge-scale mining of its data, which are compiled from published work. Ourinitial design includes three elements: an input interface allowing the userto specify a list of taxa (two or more), a server that returns estimates ofages for most-recent-common-ancestors, and an interface to the resultsreturned.

Implementations

Interaction with DateLife is primarily done through its website (seeTable 1), though all the source code and datacan be downloaded to run locally. The website is created using PHP, whichalso processes RESTful requests. This then calls an already-running Rdaemon, created using the FastRWeb [50] interface to RServe [51] as well as functions from ape [52] and new functions. This daemon returns the requested informationas a JSON string, a Newick tree, or an HTML page. Internally the R scriptswork as follows. Upon startup, input trees are pre-processed by convertingthem into patristic distance matrices for taxa. Then, satisfying a query byobtaining the ages for a set of taxa is simply a matter of matching rownames and then subsetting the array to the relevant entries (rather thantraversing a tree structure). Doing this for thousands of trees on a typicalserver takes under a second. The initial chronograms were placed in thePhyloOrchard R package as a temporary tree-store while others were beingdeveloped.

Importantly, rather than returning a single point estimate from a study thenew tool allows a range of dates to be returned if there are multiple trees(such as post-burnin trees from a Bayesian analysis) from a study.

Web site and special demonstrations (Shiny)

The goal of Team Shiny was to develop demonstrations showing the potential ofphylotastic components, and to create a public face for the project.

Design considerations

The team considered demonstration projects that would be easy to understand,that would highlight the unique role of phylotastic systems, and that wouldrelate to important research problems. The team sketched out five possibleprojects, prioritized as follows: (1) Reconcili-o-tastic; (2) ageneralized phylotastic web interface; (3) Mesquite-o-Tastic (andother forms of integration with character analysis workflows); (4)Phylo-Taxic; and (5) Phylotas-Doc. The first 3 ideasare explained below; Phylo-Taxic would provide a phylogeny for a highertaxon such as a family or order; Phylotas-Doc would supply a phylogeny forthe species named in a scientific paper or other document such as a website.

Implementations

Team Shiny implemented 3 demonstration projects, deployed an informationalweb site (http://phylotastic.org), and produced a series of blogsand screencasts. While it was not possible to produce a fully generalizedweb interface (due to the lack of a fully implemented phylotastic system),the team built up the web interface to the MapReduce pruner (seeTable 1) with explanations along with samplequeries appropriate for each of the source trees.

Mesquite-o-tastic is a small demonstration of the utility of integratingphylotastic services into the kinds of workflows often used in evolutionaryanalysis, which are interactive and manually supervised workflows, oftencombining several separate pieces of software. Mesquite [53] is an extensible workbench for comparative evolutionary analysiswritten in Java, providing tools for uploading and manipulating lists ofspecies (or other units of comparison such as genes or proteins), matricesof comparative “character data” and trees. We developed a smallJava module that extracts a list of taxa from the data matrix loaded intoMesquite, and attempts to obtain a phylogeny for those species using theMapReduce pruner described above. Mesquite automatically integrates the treewith any character data, allowing phylogeny-based analyses of the characterdata, as shown in a screencast (see Table 1 andbelow, Results and Discussion).

The main product of Team Shiny is Reconcili-o-Tastic. As noted in theBackground, reconciliation of gene trees with species trees [54] is potentially a high-volume use case for phylotastic services.Current reconciliation approaches assume that the user will supply a speciestree. This requires the user to determine the set of species implicated by agene or protein tree, then generate or otherwise obtain a tree for thosespecies. In Reconcili-o-tastic, these steps are automated. In response tothe choice of an initial gene tree (from among a set of examples provided),Reconcili-o-Tastic (1) reads the input tree; (2) queries external databasesto link identifiers in the input file to species names; (3) uses thesespecies names to retrieve the species tree phylotastically; and (4) performsreconciliation.

The operations are implemented as follows. Strings that match the pattern ofgene identifiers are extracted from the input file using custom code.Sequence records are then retrieved from an appropriate database by invokingUniProt web services (see Table 1). Species namesare obtained from these sequence records. Genes (proteins) for which acorresponding species name cannot be established, along with those missingfrom the phylotastic species tree, are removed from the gene tree prior toreconciliation. Reconciliation is done using a modified version of thespeciation-duplication inference (SDI) algorithm described in [55], which allows for non-binary species trees. The result of thisreconciliation is a gene tree with speciation or duplication events at eachinternal node.

Reconcili-o-Tastic is implemented as a web application in the web2pyframework, using JavaScript for front-end operations, and drawingextensively for back-end operations on the Forester library (Java; seeTable 1), which includes the SDIreconciliation engine and the Archaeopteryx viewer. Reconciled trees arerepresented (with encoded duplication and speciation nodes) in PhyloXMLformat [56]. Input gene trees, species trees, and reconciled trees aredisplayed interactively using Archaeopteryx [53], an embedded Java applet.

Results and discussion

The aim of the phylotastic project is to develop a delivery system for expertknowledge of species phylogeny. In response to a user-supplied list of taxa, aphylotastic system identifies suitable source phylogenies, matches speciesidentifiers, prunes away unneeded subtrees, grafts on missing species, and suppliesbranch lengths and other information, ultimately returning an expert phylogeny forthe user’s list of species. Ideally such a system would cover all kingdoms oflife, be fast enough to provide results while the user waits, and address theparticular needs of researchers for reproducibility and provenance. To enhance thepotential for such a system to become a sustainable community resource, it could beimplemented as a set of loosely coupled components that interact in clearly definedways, e.g., via web services.

Steps toward enabling a phylotastic system

The implementations described above provide a point of reference for consideringthe potential of a phylotastic system as conceived here, and for identifyingweaknesses. With these 2 goals in mind, below we discuss in particular, 3demonstration projects: the MapReduce pruner, Mesquite-o-Tastic, andReconcili-o-Tastic.

The MapReduce pruner can be invoked interactively via a convenient web form,which has a text box in which to enter a list of species, and pull-down menus toselect a format (5 choices from Newick to NeXML) and a source tree (from a listthat includes most of the large trees listed in Background). A tree is returnedtypically in 8 seconds. The web form is merely the front end to a web servicethat can be invoked via a URI with arguments for “species”,“tree” and “format”. A simple Perl script using thisweb-services API would be as follows:

This script could be invoked with a command such as

and the result will be a file called “out.tre”with the Newick tree-string “((Homo_sapiens,Pan_troglodytes),Mus_musculus)”.

Demonstration software based on Mesquite [52], an extensible workbench, illustrates how such web services could beintegrated into an interactive workflow. In the Mesquite-o-Tastic screencast(Table 1), a NEXUS file from a publishedscientific study [57] is downloaded from an online archive (MorphoBank [58]), and opened in Mesquite, which shows that the file contains acharacter matrix with 51 taxa, but not a phylogeny. When the user invokes acustom menu item (added for this project), Mesquite automatically formulates aquery URI using the names in the character matrix, executes the query remotely(invoking the MapReduce pruner above) and incorporates the resulting tree inmemory. The tree is then available for graphical display as well as for use inanalyses such as reconstructing ancestral states. While we chose to create thisdemonstration using Mesquite, the same thing could be done with a variety ofother software systems (e.g., DAMBE, PAUP, MEGA).

The Reconcili-o-Tastic demonstration shows how phylotastic services can beintegrated into a more automated workflow, as described above (Shiny section).In this case, not only is the query constructed automatically (by retrievingspecies names from sequence identifiers), it is used automatically for adownstream analysis step, which is to generate a reconcile tree.

Current limitations

Limitations of the tools described above become apparent quickly if one considersa broader set of cases than the sample queries used for illustration. Some ofthese limitations are due to limitations in the current state of expertknowledge, others are due to incomplete implementations of the phylotasticconcept, and still others represent design limitations.

These limitations can be clarified relative to an imaginary challenge of (1)obtaining many sets of names, e.g., by downloading hundreds of NEXUS files fromMorphoBank [58], or thousands of NEXUS files from TreeBASE [7], or processing thousands of scientific publications using GNRD(Table 1) to auto-recognize names, then (2) usingthe tools developed here to find phylogenies for implicated taxa, and (3)attempting to use the resulting phylogenies to carry out some kind ofphylogeny-dependent downstream step, such as computing phylogenetic diversity,or reconstructing ancestral character states.

Current tools, if subjected to this kind of challenge, would proveunsatisfactory. The first challenge is that source trees available (via theMapReduce pruner) provide limited coverage of the millions of known biologicalspecies. The result is that, in many cases, only a minor fraction of speciesnamed in the query would be found in a source tree. This is largely a limitationin the coverage provided by available megatrees (the Open Tree Of Life project,mentioned further below, attempts to address this gap by synthesizingphylogenetic knowledge broadly). Coverage differs dramatically among differenttaxonomic groups, e.g., the mammal tree [11] covers the vast majority of extant mammals, but there is poorcoverage of fungi and protists. Comparative studies of morphology often usefossil data, but fossil species are poorly represented in large phylogenies,because the latter typically are constructed from molecular sequences (typicallynot available for extinct species). Grafting of missing species onto thecorresponding genus or family could improve coverage, but we have notimplemented grafting methods here other than via web services supplied by theenhanced version of Phylomatic.

Currently available strategies for taxonomic name resolution also representlimitations. Our TNRS meta-service has not been integrated with othercomponents, so that (for example) the names in a NEXUS file uploaded byMesquite-o-Tastic must be spelled exactly as in a source tree available from theMapReduce pruner, or no match will be found. This may be a desirable behavior insome cases, e.g., when the user (or client application) is confident about namesand does not want to allow any translation. However, in most cases, presumably,integrating the TNRS meta-service would improve results.

There are a number of current limitations on the potential for improvement,because (1) sources of taxonomic knowledge referenced by the meta-service (NCBI,MSW, etc.) provide limited coverage of names; (2) spell-checking typically isnot available; (3) there is no automated way to choose among multiple matches(especially given the possibility of valid homonyms — identical names fordifferent species covered by different nomenclatural codes); and (4) there is noautomated way to interpret names from higher taxonomy. With regard to thelast-named challenge, for instance, MorphoBank has many data matrices (e.g.,project #816) that combine data at the genus level or higher, so that the key toa row of data is a higher taxon label (e.g., “Sabellidae”) or ananonymized species name (e.g., “Myriowenia sp.”). One canimagine a system that resolves such names in an appropriate way depending on theuser’s choice, but support for such a system exists only for plant speciesand only up to the taxonomic rank of family [36].

The general design for a phylotastic system (given above) calls for a tree-storethat responds to a user’s query by identifying a source tree that providesthe best coverage. However, such a component has not been integrated, thuscurrent tools require the user to specify a source tree in advance.

Integrating many source phylogenies, along with a TNRS and a tree-store, wouldmake it possible to respond to a variety of phylotastic queries to identify thebest source tree for each one. However, given our current implementations, theresulting trees would lack branch lengths and contain polytomies, making themunsuitable for many kinds of downstream analysis (for reasons noted earlier).The lack of branch lengths could be addressed by integrating the DateLifeservice, but currently its store of calibrated phylogenies covers only animals.This situation could be improved, but fossil-based calibrations generally arenot available for phylogenies of groups of microscopic organisms, which have apoor fossil record. Other methods for scaling trees are possible (as notedabove), but we have not explored those methods. Likewise, polytomies could beresolved arbitrarily, or using character-based methods, but we have not exploredsuch options.

Finally, whereas we can imagine an enormously useful phylotastic system thatefficiently delivers currently available knowledge of phylogeny, taxonomy, andfossil dating, current standards and technology for annotation are insufficientto enable the delivery of this knowledge with enough credibility for scientificresearch. Whereas students and educators may be satisfied to download a treemade by a “black box”, researchers will expect a clear descriptionof sources and methods, including metadata on the source trees used to derive aresult, and the phylotastic method of its derivation (by pruning, grafting,re-scaling, etc.). Yet standards for annotating sources and methods arerelatively undeveloped (see discussion in [5]). Attribution and licensing present additional challenges for asystem that re-uses data.

Beyond these narrow technical limitations there is the question of whether afully developed phylotastic system ultimately would represent a practicalalternative to other ways of obtaining phylogenetic information. The mostobvious uses of such a system are for cases in which the user’s demand forspeed is relatively high compared to the demand for rigor. Resources such asWikipedia or the Encyclopedia of Life, for instance, might benefit from theability to auto-generate phylogenies to illustrate a taxon for a taxon-specificweb page. Whether scientists will use a phylotastic system for research purposeswill depend on multiple factors that include the user’s demand for rigor,the user’s potential to infer— at a significant cost in terms oftime, training, and computation— a more rigorous phylogeny by denovo methods, and the availability of a pre-computed expert phylogenythat covers the query of interest (and is considered authoritative).

Such factors are not easy to assess directly, but the examples cited by Stoltzfuset al. [5] suggest that, given the opportunity, researchers often will choose toforego the task of inferring a species phylogeny de novo from characterdata, and instead will choose to apply approximate methods to compute a derivedtree from an expert source tree, even when the researcher’s needs arelimited to a single tree with a few dozen species (e.g., as in [59]).

A phylotastic ecology of informatics resources

The results described above provide a basis for further development ofphylotastic systems in hackathons planned for the year 2013. This furtherdevelopment will take place within a community with a long history ofcyber-infrastructure projects, the oldest being TreeBASE and ToLWeb, both ofwhich date from the 1990s. More recent phylogeny-related resources includeCIPRES [60] and PhyLoTA [61]. Taxonomic information services also have existed for many years(e.g., ITIS noted in Table 1). Our DateLife serviceis similar to the widely popular TimeTree project [49], noted above.

How does phylotastic relate to these other projects? How might the projectsinter-relate in the future? Above (Implementation) we explained why we chose toimplement a TNRS meta-service (rather than rely solely on an existing service),and why we chose to implement a new tree-scaling service similar to TimeTree. Inboth cases, the reasons relate to the need for resources that are designed (andlicensed) to support automated data-integration tasks, rather than interactiveor ad hoc uses.

Currently, although TreeBASE and ToLWeb are resources that represent expertknowledge of phylogeny, they are not alternatives to a phylotastic system as aconvenient source of custom trees for downstream use. TreeBASE [7] provides tools for searching ~8000 published phylogenies— asmall fraction of all published phylogenies [5]—, but it does not include pruning or grafting tools, nor doesits store of trees include any of the trees given above (Background) as examplesof large species trees.

ToLWeb is primarily an educational resource whose main feature is a phylogenydivided into branches curated by experts who determine the phylogeny and supplyannotations. Its downloadable phylogeny of 16000 taxa covers all kingdoms butincludes <1% of named species; as noted in [5], when bioinformatics researchers want a comprehensive ToL (e.g., forprojects such as TimeTree), they use the NCBI taxonomy tree, which has 250000species. Educational uses of ToLWeb predominate over research uses, perhapsbecause the interfaces focus on graphical presentation: when ToLWeb is cited inthe research literature, in studies such as [62-67], it appears that knowledge of a small set of relationships isconveyed visually rather than by computation from the tree.

Resources such as PhyLoTA [61] and the CIPRES portal [60] clearly provide ways to generate custom species trees for downstreamuse. However, the trees are generated by the user de novo from sequencealignments. While implementing a phylogenetic inference workflow using CIPRES orPhyLoTA is far easier than implementing an ad hoc workflow on a local computer,it is time-consuming and represents a substantial burden for most users.

By contrast, the phylotastic project aims to facilitate the case in which a usercan make a scientifically defensible choice to use a modified (pruned, grafted,re-scaled) version of an expert phylogeny, rather than attempt to infer aphylogeny de novo. Clearly Phylomatic [8]—the inspiration for the phylotastic project—alsoaddresses this niche. While the original conception of Phylomatic was a localtool with a fixed source tree, subsequent developments (including developmentsfor this project) expanded its web-services interface and allowed the capacityfor a user-supplied tree, allowing Phylomatic to become a component in adistributed phylotastic system of components that interoperate to provideon-the-fly access to ever-expanding domains of expert knowledge (of phylogeny,taxonomy, geographic distribution, etc.).

Just as Phylomatic was designed for a smaller and more static world of data, buthas begun to adapt to a larger and dynamic world, the other resources listedabove also could play a role in this emerging world. As described above(Implementation), existing taxonomic name resolution services can be adapted foraggregation into a meta-service. Likewise, existing phylogeny resources such asToLWeb or TreeBASE could expose their content using the TreeStore conceptenvisioned in Figure 2. For instance, for thisproject, we exported the XML version of the ToLWeb tree, and translated it usingBio::Phylo [25], so that the content of the ToLWeb tree is available via the prunerweb tool described above. If ToLWeb were to supply the current version of itstree via a web service, this could be accessed by the pruner; likewise, TreeBASEcould expand its current web-services interface to expose its data tophylotastic systems.

The goal of the phylotastic project is to leverage expert knowledge of phylogeny,rather than create de novo trees using tools such as PhyLoTA andCIPRES. Yet, there are cases in which a phylotastic system could benefit fromrapid methods for making limited phylogenetic inferences from a sample ofsequences or other data, including (1) using phylogenetic placement [68] to place a missing species on a tree, when taxonomic grafting isimpossible or undesirable, (2) resolving a polytomy, or (3) assigning branchlengths within subtrees of organisms poorly represented in the fossil record(e.g., single-celled organisms).

The design of phylotastic systems allows for perpetual heterogeneity and novelty,thus it does not matter whether or not a central source of authoritative ToLknowledge emerges in the next decade through efforts such as the Open Tree OfLife project (Table 1). New phylogenies will augmentavailable tree-stores, and phylotastic systems will allow them to be pruned,grafted and analyzed according to the wishes and needs of the user. Because ofthe open architecture and modularity of the project, researchers can chain thephylotastic components together in various ways, piecemeal or as completeworkflows.

Finally, if successful, a convenient and comprehensive delivery system for expertphylogenetic knowledge will create a competitive marketplace in whichalternative source trees, and alternative phylotastic services, compete tosatisfy the demands of users. The existence of such a marketplace may beexpected to catalyze broad improvements in related technology and standards. Forinstance, given that the scale of scientific phylogeny re-use has been—with the exception of APG and Phylomatic— unimpressive [5], the delay in developing a “minimal information” standardfor annotating phylogenies, first proposed in 2006 [38], is unsurprising. However, a phylotastic system will requireannotations of sources and methods to satisfy the demand of researchers forcredible (publishable) results and, given the choice, users will prefer thosesource trees, tree-stores, and client applications that provide them with morefully annotated results. Another critical feature missing from the technologylandscape of phylogenetics is some scheme for quantifying the accuracy orperceived quality of phylogenies— e.g., an objective scheme based onconsistency or metadata density, or a subjective scheme based on socialbookmarking—, but one can expect such a scheme to emerge naturally as anaid to users facing choices in a phylotastic marketplace.

Conclusions

The expanding scope of available species phylogenies, and the increasing demand forcustom phylogenies for use in evolutionary analysis, suggests the value of ageneralized phylotastic system that, in response to a user’s query consistingof a list of names, would provide name-resolution, tree-finding, pruning, grafting,scaling, and annotation operations necessary to generate a custom phylogeny for thenamed entities. Approximately 9 person-months of effort were devoted to a carefullyplanned hackathon at which 2 dozen participants worked to develop such a system. Theresults of this hackathon demonstrate the feasibility of some aspects of theproject, such as rapid pruning and re-scaling, and expose remaining challenges, suchas providing an integrated spell-checking system for mapping input names toqualified taxonomic identifiers. The project has demonstrated the feasibility ofon-the-fly delivery of expert phylogenetic knowledge under limited conditions.Further work is needed to develop a production system that is robust and scalable,and which can be adapted to multiple use-cases. If such a system can be developed,it may be expected to drive improvements in other areas of the worldwide effort toassemble a ToL.

Availability and requirements

Project name: Phylotastic;

Project home page:http://www.phylotastic.org;

Operating systems: Linux, MacOSX;

Programming languages: Perl, Java, R, Python, JavaScript, PHP, SPARQL,awk;

Other requirements: as described for individual sub-projects;

Licenses: GPL3, MIT, BSD 3-clause;

Any restrictions to use by non-academics: No.

The project website (Table 1) describes the phylotasticproject and provides links to demonstration software (“demos”), webservices produced during the hackathon, the working project wiki, and code hosted onGitHub. The screencasts (see Table 1) are available onYouTube and are readily discovered by searching with the keyword“phylotastic”. The requirements differ for the different softwareproducts. These products generally are free of dependencies on commercialsoftware.

Source code for most products is available at GitHub under an open-source license,with the following project names: phylotastic/tnrastic (TNRS metaservice);phylotastic/tolomatic (MapReduce pruner); phylotastic/arch-galaxy (Galaxy toolintegration); helenadeus/phylotastic_js (javascript controller); phylotastic/cgi(CGI controller); phylotastic/mesquite-o-tastic (Mesquite-o-Tastic);phylotastic/phyloShiny (Reconcili-o-Tastic); phylotastic.github.com (project website); camwebb/phylomatic-ws (Phylomatic 3). The DateLife web site (Table 1) includes links to its source code (currently on BitBucketwith the identifier bomeara/datelife).

Access to live demonstrations and documentation is as follows. The MapReduce pruneris accessible via a convenient web-forms interface that provides instructions andexamples (see Table 1). Reconcili-o-Tastic is implementedon the NESCent development server (Table 1) and may beinstalled locally (using web2py) following the instructions in the README file onGitHub. Mesquite-o-Tastic is not implemented on the server, but instead should beevaluated locally: the code available on GitHub (above) may be added to a localinstallation of Mesquite simply by copying the code into the proper local directory,as explained in the generic instructions for installing modules provided on theMesquite project web site (Table 1). TheMesquite-o-Tastic screencast (Table 1) serves as thedocumentation. The DateLife presentation at the 2012 iEvoBio conference(Table 1) serves as the documentation forDateLife.

Finally, because some products may require special expertise to install, we havecreated a stand-alone server image that can be mounted by a computer systemadministrator without expertise in bioinformatics (see Table 1). A server image is a snapshot of an operating system disk that can bestarted up inside a host environment as a virtual machine (VM). This server imageincludes all of the principal working products of the hackathon, including thoselisted above, as well as the Virtuoso tree-store with a web-services interface, andexcluding only DateLife, Phylomatic 3, and Mesquite-o-Tastic.

Abbreviations

APG:

Angiosperm Phylogeny Group

API:

Applications Programming Interface

CDAO:

Comparative Data Analysis Ontology

CGI:

Common Gateway Interface

HIP:

Hackathons,Interoperability, Phylogenies

ITIS:

Integrated Taxonomic Information Service

JSON:

JavaScript Object Notation)

MIAPA:

Minimum Information About a PhylogeneticAnalysis

NCBI:

National Center for Biotechnology Information

NCL:

NEXUS ClassLibrary

NESCent:

National Evolutionary Synthesis Center

LT:

Leadership Team ofHIP

MSW3:

Mammal Species of the World, version 3

OBO:

Open Biomedical Ontologies

OTU:

Operational Taxonomic Unit

PDF:

Portable Document Format

rDNA:

Ribosomal DNA(deoxyribonucleic acid)

RDF:

Resource Description Framework

REST:

RepresentationalState Transfer

TNRS:

Taxonomic Name Resolution Service

URI:

Uniform ResourceIdentifier

US:

United States

References

  1. Assembling the tree of life: harnessing life's history to benefitscience and society. Edited by: Cracraft J, Donoghue M, Dragoo J, Hillis D, Yates T. 2002, Arlington: National Science Foundation, (accessed 9 May 2013 fromhttp://ucjeps.berkeley.edu/tol.pdf),

    Google Scholar 

  2. Felsenstein J: Inferring Phylogenies. 2004, Sunderland, Mass: Sinauer

    Google Scholar 

  3. Kumar S, Dudley J: Bioinformatics software for biologists in the genomics era. Bioinformatics (Oxford, England). 2007, 23 (14): 1713-1717. 10.1093/bioinformatics/btm239.

    Article  CAS  Google Scholar 

  4. Larsen PO, von Ins M: The rate of growth in scientific publication and the decline in coverageprovided by Science Citation Index. Scientometrics. 2010, 84 (3): 575-603. 10.1007/s11192-010-0202-z.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Stoltzfus A, O'Meara B, Whitacre J, Mounce R, Gillespie EL, Kumar S, Rosauer DF, Vos RA: Sharing and Re-use of Phylogenetic Trees (and associated data) to FacilitateSynthesis. BMC Res Notes. 2012, 5: 574-10.1186/1756-0500-5-574.

    Article  PubMed Central  PubMed  Google Scholar 

  6. Sanderson MJ, Donoghue MJ, Piel WH, Eriksson T: TreeBASE: a prototype database of phylogenetic analyses and an interactivetool for browsing the phylogeny of life. Am J Bot. 1994, 81 (6): 183-

    Google Scholar 

  7. Piel W, Chan L, Dominus M, Ruan J, Vos R, Tannen V: TreeBASE v. 2: A Database of Phylogenetic Knowledge. 2009, London: e-BioSphere

    Google Scholar 

  8. Webb CO, Donoghue MJ: Phylomatic: tree assembly for applied phylogenetics. Mol Ecol Notes. 2005, 5: 181-183. 10.1111/j.1471-8286.2004.00829.x.

    Article  Google Scholar 

  9. The Angiosperm Phylogeny G: An update of the Angiosperm Phylogeny Group classification for the orders andfamilies of flowering plants: APG III. Bot J Linn Soc. 2009, 16 ((2): 105-121.

    Article  Google Scholar 

  10. Web of Knowledge.  . http://www.webofknowledge.com,

  11. Bininda-Emonds OR, Cardillo M, Jones KE, MacPhee RD, Beck RM, Grenyer R, Price SA, Vos RA, Gittleman JL, Purvis A: The delayed rise of present-day mammals. Nature. 2007, 446 (7135): 507-512. 10.1038/nature05634.

    Article  CAS  PubMed  Google Scholar 

  12. Smith SA, Beaulieu JM, Stamatakis A, Donoghue MJ: Understanding angiosperm diversification using small and large phylogenetictrees. Am J Bot. 2011, 98 (3): 404-414. 10.3732/ajb.1000481.

    Article  PubMed  Google Scholar 

  13. Goloboff PA, Catalano SA, Marcos Mirande J, Szumik CA, Salvador Arias J, Källersjö M, Farris JS: Phylogenetic analysis of 73 060 taxa corroborates major eukaryotic groups. Cladistics. 2009, 25 (3): 211-230. 10.1111/j.1096-0031.2009.00255.x.

    Article  Google Scholar 

  14. McDonald D, Price MN, Goodrich J, Nawrocki EP, DeSantis TZ, Probst A, Andersen GL, Knight R, Hugenholtz P: An improved Greengenes taxonomy with explicit ranks for ecological andevolutionary analyses of bacteria and archaea. ISME J. 2012, 6 (3): 610-618. 10.1038/ismej.2011.139.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Sayers EW, Barrett T, Benson DA, Bolton E, Bryant SH, Canese K, Chetvernin V, Church DM, DiCuccio M, Federhen S: Database resources of the National Center for Biotechnology Information. Nucleic Acids Res. 2011, 39 (Database issue): D38-D51.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Maddison D, Schulz K-S, Maddison W: The Tree of Life Web Project. Zootaxa. 2007, 1668: 19-40.

    Google Scholar 

  17. Cannone J, Subramanian S, Schnare M, Collett J, D'Souza L, Du Y, Feng B, Lin N, Madabusi L, Muller K: The Comparative RNA Web (CRW) Site: an online database of comparativesequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics. 2002, 3 (1): 2-10.1186/1471-2105-3-2.

    Article  PubMed Central  PubMed  Google Scholar 

  18. Heymans M, Singh A: Deriving phylogenetic trees from the similarity analysis of metabolicpathways. Bioinformatics (Oxford, England). 2003, 19 (suppl 1): i138-i146. 10.1093/bioinformatics/btg1018.

    Article  Google Scholar 

  19. Kummerfeld S, Teichmann S: Relative rates of gene fusion and fission in multi-domain proteins. Trends Genet. 2005, 21 (1): 25-30. 10.1016/j.tig.2004.11.007.

    Article  CAS  PubMed  Google Scholar 

  20. Ruan J, Li H, Chen Z, Coghlan A, Coin LJ, Guo Y, Heriche JK, Hu Y, Kristiansen K, Li R: TreeFam: 2008 Update. Nucleic Acids Res. 2008, 36 (Database issue): D735-D740.

    PubMed Central  CAS  PubMed  Google Scholar 

  21. Vilella A, Severin J, Ureta-Vidal A, Heng L, Durbin R, Birney E: EnsemblCompara GeneTrees: Complete, duplication-aware phylogenetic trees invertebrates. Genome Res. 2009, 19 (2): 327-335.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  22. Patterson DJ, Cooper J, Kirk PM, Pyle RL, Remsen DP: Names are key to the big new biology. Trends Ecol Evol (Personal edition). 2010, 25 (12): 686-691.

    Article  CAS  Google Scholar 

  23. Parr CS, Guralnick R, Cellinese N, Page RD: Evolutionary informatics: unifying knowledge about the diversity of life. Trends Ecol Evol. 2012, 27 (2): 94-103. 10.1016/j.tree.2011.11.001.

    Article  PubMed  Google Scholar 

  24. Page RD: Biodiversity informatics: the challenge of linking data and the role ofshared identifiers. Brief Bioinform. 2008, 9 (5): 345-354. 10.1093/bib/bbn022.

    Article  PubMed  Google Scholar 

  25. Vos RA, Caravas J, Hartmann K, Jensen MA, Miller C: BIO:Phylo-phyloinformatic analysis using perl. BMC Bioinformatics. 2011, 12: 63-10.1186/1471-2105-12-63.

    Article  PubMed Central  PubMed  Google Scholar 

  26. Sukumaran J, Holder MT: DendroPy: a Python library for phylogenetic computing. Bioinformatics (Oxford, England). 2010, 26 (12): 1569-1571. 10.1093/bioinformatics/btq228.

    Article  CAS  Google Scholar 

  27. Felsenstein J: Phylogenies and the comparative method. Amer Natural. 1985, 125: 1-15. 10.1086/284325.

    Article  Google Scholar 

  28. Pagel M: The Maximum Likelihood Approach to Reconstructing Ancestral Character Statesof Discrete Characters on Phylogenies. Syst Biol. 1999, 48 (3): 612-622. 10.1080/106351599260184.

    Article  Google Scholar 

  29. Pagel M: Detecting correlated evolution on phylogenies: a general method for thecomparative analysis of discrete characters. Proc R Soc B. 1994, 255: 37-45. 10.1098/rspb.1994.0006.

    Article  Google Scholar 

  30. Stewart CA, Almes GT, Wheeler BC: Cyberinfrastructure Software Sustainability and Reusability: Report from anNSF-funded workshop. 2010, Bloomington, IN: Indiana University,

    Google Scholar 

  31. Prlić A, Procter JB: Ten Simple Rules for the Open Development of Scientific Software. PLoS Comput Biol. 2012, 8 (12): e1002802-10.1371/journal.pcbi.1002802.

    Article  PubMed Central  PubMed  Google Scholar 

  32. Vandervalk BP, McCarthy EL, Wilkinson MD: Moby and Moby 2: creatures of the deep (web). Brief Bioinform. 2009, 10 (2): 114-128. 10.1093/bib/bbn051.

    Article  CAS  PubMed  Google Scholar 

  33. Goecks J, Nekrutenko A, Taylor J: Galaxy: a comprehensive approach for supporting accessible, reproducible, andtransparent computational research in the life sciences. Genome Biol. 2010, 11 (8): R86-10.1186/gb-2010-11-8-r86.

    Article  PubMed Central  PubMed  Google Scholar 

  34. Hughes-Croucher T, Wilson M: Up and Running with Node.js. Up and Running. 2012, Sebastopol: O'Reilly, 204-1

    Google Scholar 

  35. Page RD: A Taxonomic Search Engine: federating taxonomic databases using webservices. BMC Bioinformatics. 2005, 6: 48-10.1186/1471-2105-6-48.

    Article  PubMed Central  PubMed  Google Scholar 

  36. Boyle B, Hopkins N, Lu Z, Garay JAR, Mozzherin D, Rees T, Matasci N, Narro ML, Piel WH, Mckay SJ: The taxonomic name resolution service: an online tool for automatedstandardization of plant names. BMC Bioinformatics. 2013, 14: 16-10.1186/1471-2105-14-16.

    Article  PubMed Central  PubMed  Google Scholar 

  37. Mammal Species of the World. A Taxonomic and Geographic Reference. Edited by: Wilson DE, Reeder DM. 2005, Baltimore: Johns Hopkins University Press, 3

    Google Scholar 

  38. Leebens-Mack J, Vision T, Brenner E, Bowers J, Cannon S, Clement M, Cunningham C, Depamphilis C, DeSalle R, Doyle J: Taking the first steps towards a standard for reporting on phylogenies:Minimum information about a phylogenetic analysis (MIAPA). Omics-A: J Integr Biol. 2006, 10 (2): 231-237. 10.1089/omi.2006.10.231.

    Article  CAS  Google Scholar 

  39. Berners-Lee T, Hendler J: Publishing on the semantic web. Nature. 2001, 410 (6832): 1023-1024.

    Article  CAS  PubMed  Google Scholar 

  40. Klyne G, Carroll JJ: Resource Description Framework (RDF): Concepts and Abstract Syntax. World Wide Web Consortium. 2004.

    Google Scholar 

  41. Virtuoso Universal Server.  . http://virtuoso.openlinksw.com,

  42. World Wide Web Consortium: Large Triple Stores.  . 2011

    Google Scholar 

  43. Prosdocimi F, Chisham B, Pontelli E, Thompson JD, Stoltzfus A: Initial Implementation of a Comparative Data Analysis Ontology. Evol Bioinformatics. 2009, 5: 47-66.

    CAS  Google Scholar 

  44. Smith B, Ashburner M, Rosse C, Bard J, Bug W, Ceusters W, Goldberg LJ, Eilbeck K, Ireland A, Mungall CJ: The OBO Foundry: coordinated evolution of ontologies to support biomedicaldata integration. Nat Biotechnol. 2007, 25 (11): 1251-1255. 10.1038/nbt1346.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  45. Lewis PO: NCL: a C++ class library for interpreting data files in NEXUS format. Bioinformatics (Oxford, England). 2003, 19 (17): 2330-2331. 10.1093/bioinformatics/btg319.

    Article  CAS  Google Scholar 

  46. Maddison DR, Swofford DL, Maddison WP: NEXUS: An Extensible File Format for Systematic Information. Syst Biol. 1997, 46 (4): 590-621. 10.1093/sysbio/46.4.590.

    Article  CAS  PubMed  Google Scholar 

  47. Dean J, Ghemawat S: MapReduce: Simplified Data Processing on Large Clusters. Sixth Symposium on Operating System Design and Implementation. 2004, San Francisco, CA: ACM, 107-113.

    Google Scholar 

  48. Foundation FS: GNU awk. 2008.

    Google Scholar 

  49. Hedges SB, Dudley J, Kumar S: TimeTree: a public knowledge-base of divergence times among organisms. Bioinformatics (Oxford, England). 2006, 22 (23): 2971-2972. 10.1093/bioinformatics/btl505.

    Article  CAS  Google Scholar 

  50. Urbanek S: FastRWeb: Fast Interactive Web Framework for Data Mining Using R. ISAC 2008 World Congress. 2008.

    Google Scholar 

  51. Urbanek S: Rserve - A Fast Way to Provide R Functionality to Applications. Proceedings of the 3rd International Workshop on Distributed StatisticalComputing (DSC 2003). Edited by: Hornik K, Leisch F, Zeileis A. 2003, : ,

    Google Scholar 

  52. Popescu AA, Huber KT, Paradis E: Ape 3.0: New tools for distance-based phylogenetics and evolutionary analysisin R. Bioinformatics (Oxford, England). 2012, 28 (11): 1536-1537. 10.1093/bioinformatics/bts184.

    Article  CAS  Google Scholar 

  53. Mesquite: a modular system for evolutionary analysis. Version 2.73.  . http://mesquiteproject.org,

  54. Doyon JP, Ranwez V, Daubin V, Berry V: Models, algorithms and programs for phylogeny reconciliation. Brief Bioinform. 2011, 12 (5): 392-400. 10.1093/bib/bbr045.

    Article  PubMed  Google Scholar 

  55. Zmasek CM, Eddy SR: A simple algorithm to infer gene duplication and speciation events on a genetree. Bioinformatics (Oxford, England). 2001, 17 (9): 821-828. 10.1093/bioinformatics/17.9.821.

    Article  CAS  Google Scholar 

  56. Han M, Zmasek C: PhyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics. 2009, 10: 356-10.1186/1471-2105-10-356.

    Article  PubMed Central  PubMed  Google Scholar 

  57. Voss RS, Jansa SA: Phylogenetic relationships and classification of didelphid marsupials, anextant radiation of New World metatherian mammals. Bull Am Mus Nat Hist. 2009, 322: 1-177.

    Article  Google Scholar 

  58. O'Leary MA, Kaufman S: MorphoBank: phylophenomics in the”cloud”. Cladistics. 2011, 27: 529-537. 10.1111/j.1096-0031.2011.00355.x.

    Article  Google Scholar 

  59. Riek A: Allometry of milk intake at peak lactation. Mamm Biol. 2011, 76 (1): 3-11.

    Google Scholar 

  60. Miller MA, Pfeiffer W, Schwartz T: Creating the CIPRES Science Gateway for inference of large phylogenetictrees. Gateway Computing Environments Workshop (GCE). 2010, La Jolla, CA, USA: San Diego Supercomput. Center, 1-8.

    Chapter  Google Scholar 

  61. Sanderson M, Boss D, Chen D, Cranston K, Wehe A: The PhyLoTA Browser: processing GenBank for molecular phylogeneticsresearch. Syst Biol. 2008, 57 (3): 335-346. 10.1080/10635150802158688.

    Article  PubMed  Google Scholar 

  62. Farris SM, Roberts NS: Coevolution of generalist feeding ecologies and gyrencephalic mushroom bodiesin insects. Proc Natl Acad Sci U S A. 2005, 102 (48): 17394-17399. 10.1073/pnas.0508430102.

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  63. Martinson H, Schneider K, Gilbert J, Hines J, Hambäck P, Fagan W: Detritivory: stoichiometry of a neglected trophic level. Ecol Res. 2008, 23 (3): 487-491. 10.1007/s11284-008-0471-7.

    Article  Google Scholar 

  64. Shenoy BD, Jeewon R, Hyde KD: Impact of DNA sequence-data on the taxonomy of anamorphic fungi. Fungal Divers. 2007, 26: 1-54.

    Google Scholar 

  65. Smolenaars MM, Madsen O, Rodenburg KW, Van der Horst DJ: Molecular diversity and evolution of the large lipid transfer proteinsuperfamily. J Lipid Res. 2007, 48 (3): 489-502.

    Article  CAS  PubMed  Google Scholar 

  66. Stelkens R, Seehausen O: Genetic distance between species predicts novel trait expression in theirhybrids. Evolution. 2009, 63 (4): 884-897. 10.1111/j.1558-5646.2008.00599.x.

    Article  PubMed  Google Scholar 

  67. Whitney KD, Garland T: Did genetic drift drive increases in genome complexity?. PLoS Genet. 2010, 6 (8):  -10.1371/journal.pgen.1001080.

  68. Matsen FA, Kodner RB, Armbrust EV: Pplacer: linear time maximum-likelihood and Bayesian phylogenetic placementof sequences onto a fixed reference tree. BMC Bioinformatics. 2010, 11: 538-10.1186/1471-2105-11-538.

    Article  PubMed Central  PubMed  Google Scholar 

Download references

Acknowledgements

We thank Mark Holder (TreeStore), Ben Vandervalk (Architecture), Chris Baron(Shiny) and Jon Eastman (DateLife) for their contributions to the hackathon, andwe thank Mark Wilkinson and Sergei Pond for participating in the firstLeadership Team meeting. We thank Danielle Wilson, David Palmer, and MattisonWard for administrative and IT support. Supported by NESCent (the NationalEvolutionary Synthesis Center, NSF #EF-0905606), the iPlant Collaborative (NSF#DBI-0735191), and the Biodiversity Synthesis Center (BioSync) of theEncyclopedia of Life. Additional funding for travel expenses was provided to RVby the Naturalis Research Incentive budget. The identification of any specificcommercial products is for the purpose of specifying a protocol, and does notimply a recommendation or endorsement by the National Institute of Standards andTechnology.

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Arlin Stoltzfus.

Additional information

Competing interests

The authors declare that they have no competing financial interests. Some authors arekey participants in projects mentioned in the text, including NeXML (RV, MH, JS, PM,JB, A Stoltzfus), CDAO (JB, EP, A Stoltzfus, HL), TreeBASE (RV), Mesquite (PM),MIAPA (KC, HL, A Stoltzfus, JB, RV, EP) and the iPlant TNRS (NM).

Authors’ contributions

Authorship on this work was open to all who participated in the hackathon or itsleadership team. The order of authors reflects post-hackathon contributions to themanuscript and to hackathon products. The HIP Leadership Team (AS, BS, EP, HL, KC,MR, MW, RV, SP and MW) conceived and planned the hackathon, with administrativecoordination by AS, assisted by RV and EP. All authors except BS, MR, and MWparticipated in the hackathon, with BO, LH, JB, MWP, and MA doing so remotely. Withthe exception that PM wrote the Mesquite code for Mesquite-o-Tastic, and GJextensively revised Reconciliotastic for purposes of this publication, the hackathonteams are responsible for products attributed to them above: DateLife (BO, TH, PM,LH, MWP); TNRS (NM, GV, SM); Shiny (CZ, HB, MP, AS); Architecture (HD, RV, COW, JS),including Pruners (RV, A Steele, COW); TreeStore (HL, KC, JPB, EP). Most of theauthors contributed to initial drafts of Implementations; initial drafts of theBackground and Discussion were written by AS with help from BS; revisions were doneprimarily by AS, with help from HL, HD, BS and EP. AS coordinated work on themanuscript. All authors read and approved the final manuscript.

Authors’ original submitted files for images

Below are the links to the authors’ original submitted files for images.

Authors’ original file for figure 1

Authors’ original file for figure 2

Rights and permissions

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative CommonsAttribution License (http://creativecommons.org/licenses/by/2.0), whichpermits unrestricted use, distribution, and reproduction in any medium, provided theoriginal work is properly cited.

Reprints and permissions

About this article

Cite this article

Stoltzfus, A., Lapp, H., Matasci, N. et al. Phylotastic! Making tree-of-life knowledge accessible, reusable and convenient. BMC Bioinformatics 14, 158 (2013). https://doi.org/10.1186/1471-2105-14-158

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/1471-2105-14-158

Keywords