Tree Pruner is an efficient, visual editing capability for obtaining a dataset of genetic sequences with desired properties, such as evolutionary representation or shared genotype. Importantly, it provides the user curatorial control over the final selection of sequence data. While it is currently used with the large, biased influenza sequence database, it can be implemented for other viral genetic databases, such as those for HIV and HCV.
Overview of Tree Pruner
The two editing functions of Tree Pruner, Keep and Remove/Restore, act in complementary modes to edit a phylogenetic tree (and consequently edit a dataset). Keep is particularly suited to selection of a small subset of a dataset. Remove/Restore is particularly suited to fine-tuning a dataset by removal of just a few sequences. During a Tree Pruner session, editing actions are represented on the tree by changes in the color of branches and labels. At the end of a Tree Pruner session, the editing actions are committed to the original data set, resulting in the removal of sequences corresponding to deselected tips in the tree.
Editing actions are selected by custom additions to the Archaeopteryx drop-down menu for actions on nodes. Custom buttons on the Archaeopteryx control panel, such as Discard All or Commit Changes, translate editing actions into changes to the dataset.
Example of editing using Tree Pruner
The following illustrations of Tree Pruner are based on an initial dataset of all instances of sequences of the hemagglutinin (HA) gene from influenza A (H5N1) viruses that were collected in the period 1900 - 2000. A minimum length of 900 (out of a maximum of 1790) nucleotides was required for inclusion in this initial dataset. A search of the Influenza Research Database (IRD) [4, 5], conducted on March 7 2010, yielded 96 sequences, which were stored in a working set called DemoTreePruner on the IRD server. Instructions for accessing this dataset to test Tree Pruner are given in Figure 1. (Alternatively, a static version of Tree Pruner with a pre-loaded tree of 953 NP sequences from seasonal influenza A (H1N1) viruses, together with an equine NP sequence as an outgroup, is available at the Influenza Sequence Database, http://www.flu.lanl.gov.)
Tree Pruner is launched from the working set, and automatically infers a phylogeny of the sequences in the set. (The IRD infers a maximum likelihood tree under the HKY model of evolution, using PhyML [10, 11].) Tree Pruner then opens an Archaeopteryx applet, labeled with the name of the working set (DemoTreePruner) and the name of the gene (Segment 4), and displays this tree.
The user can carry out multiple Keep and/or Remove editing sessions. Before switching from one edit action to the other, the user must save or discard all edits of the current type. This ensures that the keep/remove status of all "untouched" taxa is defined. The user can end a Tree Pruner session by committing changes. Then taxa marked for removal in the tree will be removed from the dataset, thus maintaining a 1:1 relationship between the dataset and tree display.
(i) Keep function
When a user clicks on a node at the tip of the tree, or at the root of a sub-tree, the label(s) of the selected tip (sub-tree) is (are) written in black, designating inclusion in the final dataset. All non-selected labels are written in blue. Blue tips are "active," meaning they can be selected by future actions, but are not currently designated for inclusion in the final dataset. Branches are also colored; black branches lead to sequences selected for inclusion in the dataset; blue branches lead to sequences whose status is undetermined. (See Figure 2.)
(ii) Remove/Restore function
When a user clicks on a node at the tip of the tree, or at the root of a sub-tree, the color of the label(s) of the selected tip (sub-tree) switches between black and grey. All non-selected labels remain black. Black tips are "active" and can be removed by future actions. Alternatively, clicking on a grey node is a Restore action. Restore will change the tip label(s) and relevant branches from grey to black. (See Figure 3.)
(iii) Handling large trees
A key scientific requirement of Tree Pruner was to enable the user to view taxon labels, even in large trees. Archaeopteryx offers three features for viewing large trees. "Dynamic hiding" displays a subset only of taxon labels in order to squeeze a large tree into the window. Zooming allows reading of all taxon labels, but loses the overall tree structure. Viewing a sub-tree allows reading of all taxon labels in a sub-tree, but loses the context of editing because the sub-tree replaces the full tree in the window. Archaeopteryx was modified so that Tree Pruner displays a sub-tree in a separate window, side-by-side with the display of the complete tree. Multiple levels of sub-tree are permitted. Pruning actions are mapped among windows by the Refresh action. Thus, a sub-tree can be pruned while viewing its context.
(iv) Concluding a Tree Pruner session
The user may use Commit to exit from Tree Pruner; all taxa tagged for removal (i.e., drawn in grey in the phylogeny) will be removed from the working set. The tree used to edit the dataset is removed from the server. To perform further editing, Tree Pruner infers a tree from the revised dataset, thereby retaining a 1:1 correspondence between the dataset and the tree display.
(v) Crash recovery
AutoSave is run every 10 minutes after the most recent user-instigated Save. If the server or client crashes during a pruning session, Tree Pruner can resume from the most recent (Auto)Save.