TreeRipper web application: towards a fully automated optical tree recognition software
© Hughes; licensee BioMed Central Ltd. 2011
Received: 3 February 2011
Accepted: 20 May 2011
Published: 20 May 2011
Relationships between species, genes and genomes have been printed as trees for over a century. Whilst this may have been the best format for exchanging and sharing phylogenetic hypotheses during the 20th century, the worldwide web now provides faster and automated ways of transferring and sharing phylogenetic knowledge. However, novel software is needed to defrost these published phylogenies for the 21st century.
TreeRipper is a simple website for the fully-automated recognition of multifurcating phylogenetic trees (http://linnaeus.zoology.gla.ac.uk/~jhughes/treeripper/). The program accepts a range of input image formats (PNG, JPG/JPEG or GIF). The underlying command line c++ program follows a number of cleaning steps to detect lines, remove node labels, patch-up broken lines and corners and detect line edges. The edge contour is then determined to detect the branch length, tip label positions and the topology of the tree. Optical Character Recognition (OCR) is used to convert the tip labels into text with the freely available tesseract-ocr software. 32% of images meeting the prerequisites for TreeRipper were successfully recognised, the largest tree had 115 leaves.
Despite the diversity of ways phylogenies have been illustrated making the design of a fully automated tree recognition software difficult, TreeRipper is a step towards automating the digitization of past phylogenies. We also provide a dataset of 100 tree images and associated tree files for training and/or benchmarking future software. TreeRipper is an open source project licensed under the GNU General Public Licence v3.
Currently, the construction of the relationships between the 1.8 million currently estimated species largely depends on the unprecedented growth of molecular sequence data  and this makes GenBank the most accessible source of comparative data for most taxa in the tree of life . Whilst more sequence data, more powerful computers and improved phylogenetic reconstruction algorithms will enable researcher to generate up-to-date phylogenies from the raw data in the future, past phylogenetic inferences will remain central to guiding researchers towards studying poorly supported relationships and under-sampled lineages. They are also central for studying the effects of new phylogenetic methodologies and new and larger datasets .
Not all phylogenetically informative data are confined to sequence databases. TreeBASE is a very valuable repository as it holds morphological or genetic data with the associated published phylogeny . However, as few publishers require submission to TreeBASE as a pre-requisite for publication, a large number of phylogenies remain embedded as images in published articles. Indeed, the rapid growth of published phylogenies is not matched by the availability of those trees in databases (see Figure 1 in ).
The idea of using a program to convert a tree image into a computer-readable representation of that tree was first implemented in TreeThief  which required the user to trace a tree by clicking on each of its nodes in turn. The latter program is only available for the discontinued operating system Mac OS 9. Laubach and von Haeseler  provided a conceptual advance with a semi-automatic program called TreeSnatcher that has recently been updated . TreeSnatcher uses image-processing methods to prepare a tree image and detect the tree structure, it works on rectangular and freeform trees (e.g., radial and star). The user supervises the tree recognition process by making corrections to the image. For example, the user can modify the image in order to make the foreground dark and background light, fill gaps in lines and identify the foreground. The program then determines inner node and tip locations. The user can add or remove further nodes and delete or add branches. The user is then required to assign species names to the tips before the program can build the Newick tree code.
Here, we will review the way researchers present their phylogenies, demonstrate the feasibility of a fully automated tree recognition software and provide a dataset of tree images and associated tree files for training and/or benchmarking future programs.
The current version of TreeRipper opens tree-image files in the formats PNG, JPG/JPEG, or GIF.
The tree needs to have the root on the left and leaves on the right.
The tree constitutes a dark foreground on a light homogenous background (no background boxes or shading).
The tree must be bi- or multifurcating (not a network)
The inner nodes are branching points between lines and have no circles, rectangles, etc. inscribed.
Tip branches must have branch lengths greater than 0.
Results and Discussion
The successfully recognised tree images along with a further 63 images manually converted to tree files are provided as supplementary material in NEXUS, Newick and phyloXML formats  (Additional file 1) for training and/or benchmarking future programs.
Although the program has a high failure rate, it is the first step towards an automated approach for optical tree recognition and proves the feasibility of an approach, which will allow us to defrost published phylogenetic hypotheses. We are unlikely to ever be able to create an application that recognises all possible trees due to the sheer diversity of ways phylogenies have been illustrated but at the very least, this program could be used for automating tree recognition of large sets of tree images before using manual conversion or semi-automated programs like TreeSnatcher for the trees that were not converted.
As phylogenetics enters a third phase of growth with the advent of next-generation sequencing, one hopes that the work of future phylogenetists will be published in a format that will enable the digital curation and preservation of their hard work.
Availability and requirements
Project name: TreeRipper (automated phylogeny recognition from images)
Project home page: https://code.google.com/p/treeripper/
Programming language: C++ and PHP web interface
License: GNU GPL v3
Tesseract-OCR licensed with the Apache 2.0 License except the tesseractTrainer.py, which is licensed with GPL: http://code.google.com/p/tesseract-ocr
Imagemagick, license is compatible with the GPL: http://www.imagemagick.org/
I am very grateful to Rod Page for giving me the freedom to tackle this project, and for the useful comments, suggestions and debugging help during the project. JH is funded by the Biotechnology and Biological Sciences Research Council (BBSRC) (Grant No. BBF0157201).
- Darwin CR: On the origin of species by means of natural selection, or the preservation of favoured races in the struggle for life. 1st edition. London: John Murray; 1859.
- Smith SA, Beaulieu JM, Donoghue MJ: Mega-phylogeny approach for comparative biology: an alternative to supertree and supermatrix approaches. BMC evolutionary biology 2009, 9: 37. 10.1186/1471-2148-9-37PubMed CentralView ArticlePubMed
- McMahon MM, Sanderson MJ: Phylogenetic supermatrix analysis of GenBank sequences from 2228 papilionoid legumes. Systematic biology 2006, 55: 818–836. 10.1080/10635150600999150View ArticlePubMed
- TreeBASE: a database of phylogenetic knowledge[http://www.treebase.org/]
- Page RDM: Towards a Taxonomically Intelligent Phylogenetic. Nature Precedings 2007, 1–5.
- TreeThief: a tool for manual phylogenetic tree entry[http://microbe.bio.indiana.edu:7131/soft/iubionew/molbio/evolution/phylo/TreeThief/main.html]
- Laubach T, von Haeseler A: TreeSnatcher: coding trees from images. Bioinformatics (Oxford, England) 2007, 23: 3384–3385. 10.1093/bioinformatics/btm438View Article
- TreeSnatcher Plus: a phylogenetic tree capturing tool[http://www.cibiv.at/software/treesnatcher/]
- Smith R: An Overview of the Tesseract OCR Engine. Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) 2007, 2: 629–633.View Article
- Han MV, Zmasek CM: phyloXML: XML for evolutionary biology and comparative genomics. BMC Bioinformatics 2009, 10: 356. 10.1186/1471-2105-10-356PubMed CentralView ArticlePubMed
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.