An automated graphics tool for comparative genomics: the Coulson plot generator
© Field et al.; licensee BioMed Central Ltd. 2013
Received: 14 November 2012
Accepted: 20 April 2013
Published: 27 April 2013
Comparative analysis is an essential component to biology. When applied to genomics for example, analysis may require comparisons between the predicted presence and absence of genes in a group of genomes under consideration. Frequently, genes can be grouped into small categories based on functional criteria, for example membership of a multimeric complex, participation in a metabolic or signaling pathway or shared sequence features and/or paralogy. These patterns of retention and loss are highly informative for the prediction of function, and hence possible biological context, and can provide great insights into the evolutionary history of cellular functions. However, representation of such information in a standard spreadsheet is a poor visual means from which to extract patterns within a dataset.
We devised the Coulson Plot, a new graphical representation that exploits a matrix of pie charts to display comparative genomics data. Each pie is used to describe a complex or process from a separate taxon, and is divided into sectors corresponding to the number of proteins (subunits) in a complex/process. The predicted presence or absence of proteins in each complex are delineated by occupancy of a given sector; this format is visually highly accessible and makes pattern recognition rapid and reliable. A key to the identity of each subunit, plus hierarchical naming of taxa and coloring are included. A java-based application, the Coulson plot generator (CPG) automates graphic production, with a tab or comma-delineated text file as input and generating an editable portable document format or svg file.
CPG software may be used to rapidly convert spreadsheet data to a graphical matrix pie chart format. The representation essentially retains all of the information from the spreadsheet but presents a graphically rich format making comparisons and identification of patterns significantly clearer. While the Coulson plot format is highly useful in comparative genomics, its original purpose, the software can be used to visualize any dataset where entity occupancy is compared between different classes.
CPG software is available at sourceforge http://sourceforge.net/projects/coulson and http://dl.dropbox.com/u/6701906/Web/Sites/Labsite/CPG.html
With a rapidly growing database of completed genomes and consequential improvements to the reconstruction of deep and broad phylogenetic relationships, it has become possible to consider the molecular origins of many complex cellular systems. Such analyses can reveal deep relationships between cellular functions, identify lineage-specific features and uncover evolutionary mechanisms [1-5], and are important in the identification of, for example, pathogen-associated gene products, with potential for therapeutic intervention, as well as in attempts to understand how such systems arose. Further, falling costs of nucleotide sequencing are providing opportunities to generate genome sequences from even hard to culture organisms, making analysis of function in these taxa possible through comparison with tractable organisms. In short, the need to present comparative data is highly pressing and likely to remain an issue for some time.
While it is now comparatively trivial to generate vast datasets containing 100s to 1000s of query results using BLAST, HMMer and other sequence-based algorithms [6-10] these data constitute essentially gene lists, which only have value when processed and presented coherently [5, 11-16]. The major biological added value within such analyses is the ability to rapidly compare the distributions of genes between multiple biological processes, i.e. protein complexes and pathways, and also across many taxa. This is quite challenging as these datasets can contain may hundreds/thousands of gene calls, and unless these data are represented graphically and in an easily comprehended manner, patterns are difficult to observe. In particular, spreadsheets do not lend themselves to browsing and fragmenting datasets into subgroups to reduce data complexity often removes much valuable comparative information. Production of comparison figures from developing datasets (works in progress) are invaluable during dataset production, and even for making decisions and developing hypotheses, but manual production of figures on the fly is unfeasible.
To address these needs we devised the Coulson plot, a matrix of colorized pie charts and which displays information in a clustered format, together with hierarchical taxonomic labels and a key to individual gene products. This plot we, and others, have used in multiple publications and which we have found to be highly useful and accessible to readers of these reports [3, 17-24]. However, the manual construction of these plots is time consuming and, with hundreds of elements, error prone, and which precludes on-the-fly plots and possibly wider adoption of the format. Hence, to facilitate generic/automated production and adoption of the plot we developed a platform-independent application, the Coulson plot generator (CPG), to draw Coulson plots from structured data that uses standard spreadsheet file formats as input. CPG should be accessible to the vast majority of workers with only rudimentary computing skills and requires minimal post-plot manipulations to generate publication quality plots of considerable complexity.
The CPG application opens with three tabs (Figure 3A). The first allows the user to select an input data file, the second, to choose custom pie colors, and the third tab provides the Help/Manual and change log (and licensing). The fourth tab provides a process log, and information to assist with input file formatting (appropriate error messages if your input is not acceptable). The 'Plot' button is not enabled unless the input is correct; clicking on 'Plot' generates the figure. By returning to the first, tabbed window, multiple plots may be created from different inputs, and different versions of a figure may be created from the second window and viewed all together. A default color set is supplied (text file and hard coded) (Figure 3B). After selecting an input file, CPG will parse the data and if successful it will convert the data to a table (Figure 3C). Clicking ‘Figure’ will display a Coulson plot of the data in a new scrollable window. An example dataset used for testing is shown (Figure 3B) from which a small portion was taken for early development. The text file was produced using Microsoft Excel, with data entry in the table as described (Figure 2). Data from Excel were exported as comma separated files (.csv). The output file is an editable PDF or SVG file which can be opened and manipulated with Inkscape (http://inkscape.org/) or Adobe Illustrator (http://www.adobe.com/products/illustrator.html?promoid=KAUCB). We selected this option as more efficient than attempting to build sophisticated editing tools into CPG as the precise choices and requirements of users and datasets are difficult to predict. This follows a similar philosophy to FigTree, a popular phylogenetics tree graphics package which also generates editable graphics requiring a small amount of finessing prior to publication (http://tree.bio.ed.ac.uk/software/figtree/). A Coulson plot with more than 200 pies can be produced satisfactorily.
We developed the Coulson plot to display and compare data on gene representation grouped by gene product complex or pathway membership and to display this information across multiple taxa (Figure 1). An array of gene product components from multiple species with each complex is displayed as a pie chart comprising a variable number of components (sectors), the number of which matches the number of protein subunits in a functional complex, process (i.e. pathway) or other functional group. Pie charts are arranged by phylogenetic hierarchy to allow evaluation of evolutionary trends and the rapid identification of gene losses, specializations or expansions. Several such systems may be compared, so that an array of systems is represented for each species. Using colors, it is possible to separate groups of systems with excellent visual clarity.
We have found the Coulson plot to be highly valuable for presentations of comparative genomic data, and that the lucid display of patterns within datasets more than offset the time required to manually produce these plots. However, we are aware that the skills required and potentially the effort needed acted as a barrier to adoption of a broadly potentially useful graphing format, and which is not available as part of commercial graphing packages as far as we are aware. We therefore developed a plotting tool that manages the vast majority of the plot functionality, leaving the user a format that can be subjected to final editing as appropriate for individual requirements.
A great many datasets have been used to test CPG [3, 17-24]. We find the software is stable on OS X (10.5.8 to 10.8.2), Microsoft Windows (XP, 7 and 8) and multiple versions of Linux. Creation of .csv output files from Microsoft Excel, Apple Numbers or open source office suites that can be read by CPG is routine, and the PDF and SVG output successfully imported to Adobe Illustrator or Inkscape as an editable graphic. A diagram with more than 200 pies and over 600 individual elements can be routinely produced, allowing publication quality figures to be generated in one hour. The ability to rapidly generate plots from dissimilar datasets on-the-fly, allowing hypothesis-driven composition of datasets, is a distinct advantage, and we hope that the Coulson plot will become a more generally exploited format, and that the use of this plot beyond comparative genomics will be facilitated with the provision of CPG.
Availability and requirements
CPG is a Java application and requires Java 1.5.0 or higher for the JVM. CPG source code and binaries are available from sourceforge: http://sourceforge.net/projects/coulson as a jar file or disc image for Mac OS X. Project home page: http://dl.dropbox.com/u/6701906/Web/Sites/Labsite/CPG.html and http://sourceforge.net/projects/coulson. The software is licensed under GNU Artistic license 2.0.
This work was in part funded through a Wellcome Trust program grant (082813) to MCF.
We are grateful to Paul Manna, Ka-Fai Leung (University of Cambridge) and Joel B. Dacks (University of Alberta) for beta-testing of the software and Andrew Jackson (University of Liverpool) for comments on the manuscript.
- Carvalho-Santos Z, Azimzadeh J, Pereira-Leal JB, Bettencourt-Dias M: Evolution: tracing the origins of centrioles, cilia, and flagella. J Cell Biol. 2011, 194 (2): 165-175. 10.1083/jcb.201011152.PubMed CentralView ArticlePubMedGoogle Scholar
- Wickstead B: Patterns of kinesin evolution reveal a complex ancestral eukaryote with a multifunctional cytoskeleton. BMC Evol Biol. 2010, 10: 110-10.1186/1471-2148-10-110.PubMed CentralView ArticlePubMedGoogle Scholar
- Field MC, Gabernet-Castello C, Dacks JB: Reconstructing the evolution of the endocytic system: insights from genomics and molecular cell biology. Adv Exp Med Biol. 2007, 607: 84-96. 10.1007/978-0-387-74021-8_7.View ArticlePubMedGoogle Scholar
- Barnes RL, Shi H, Kolev NG, Tschudi C, Ullu E: Comparative genomics reveals two novel RNAi factors in Trypanosoma brucei and provides insight into the core machinery. PLoS Pathog. 2012, 8 (5): e1002678-10.1371/journal.ppat.1002678.PubMed CentralView ArticlePubMedGoogle Scholar
- Serpeloni M, Vidal NM, Goldenberg S, Avila AR, Hoffmann FG: Comparative genomics of proteins involved in RNA nucleocytoplasmic export. BMC Evol Biol. 2011, 11: 7-10.1186/1471-2148-11-7.PubMed CentralView ArticlePubMedGoogle Scholar
- Finn RD, Clements J, Eddy SR: HMMER web server: interactive sequence similarity searching. Nucleic Acids Res. 2011, 39 (Web Server issue): W29-W37.PubMed CentralView ArticlePubMedGoogle Scholar
- Price MN: FastBLAST: homology relationships for millions of proteins. PLoS One. 2008, 3: e3589-10.1371/journal.pone.0003589.PubMed CentralView ArticlePubMedGoogle Scholar
- Grant JR, Arantes AS, Stothard P: Comparing thousands of circular genomes using the CGView Comparison Tool. BMC Genomics. 2012, 13 (1): 202-10.1186/1471-2164-13-202.PubMed CentralView ArticlePubMedGoogle Scholar
- Koumandou VL, Klute MJ, Herman EK, Nunez-Miguel R, Dacks JB, Field MC: Evolutionary reconstruction of the retromer complex and its function in Trypanosoma brucei. J Cell Sci. 2011, 124 (Pt 9): 1496-1509.PubMed CentralView ArticlePubMedGoogle Scholar
- Waterhouse RM, Zdobnov EM, Tegenfeldt F, Li J, Kriventseva EV, Ortho DB: the hierarchical catalog of eukaryotic orthologs in 2011. Nucleic Acids Res. 2011, 39 (Database issue): D283-D288.PubMed CentralView ArticlePubMedGoogle Scholar
- Elias M, Brighouse A, Gabernet-Castello C, Field MC, Dacks JB: Sculpting the endomembrane system in deep time: high resolution phylogenetics of Rab GTPases. J Cell Sci. 2012, 125 (Pt 10): 2500-2508.PubMed CentralView ArticlePubMedGoogle Scholar
- Kloepper TH, Kienle CN, Fasshauer D: An elaborate classification of SNARE proteins sheds light on the conservation of the eukaryotic endomembrane system. Mol Biol Cell. 2007, 18 (9): 3463-3471. 10.1091/mbc.E07-03-0193.PubMed CentralView ArticlePubMedGoogle Scholar
- Iyer LM, Anantharaman V, Wolf MY, Aravind L: Comparative genomics of transcription factors and chromatin proteins in parasitic protists and other eukaryotes. Int J Parasitol. 2008, 38 (1): 1-31. 10.1016/j.ijpara.2007.07.018.View ArticlePubMedGoogle Scholar
- El-Sayed NM, Myler PJ, Blandin G, Berriman M, Crabtree J, Aggarwal G, Caler E, Renauld H, Worthey EA, Hertz-Fowler C, Ghedin E, Peacock C, Bartholomeu DC, Haas BJ, Tran AN, Wortman JR, Alsmark UC, Angiuoli S, Anupama A, Badger J, Bringaud F, Cadag E, Carlton JM, Cerqueira GC, Creasy T, Delcher AL, Djikeng A, Embley TM, Hauser C, Ivens AC, Kummerfeld SK, Pereira-Leal JB, Nilsson D, Peterson J, Salzberg SL, Shallom J, Silva JC, Sundaram J, Westenberger S, White O, Melville SE, Donelson JE, Andersson B, Stuart KD, Hall N: Comparative genomics of trypanosomatid parasitic protozoa. Science. 2005, 309 (5733): 404-409. 10.1126/science.1112181.View ArticlePubMedGoogle Scholar
- O’Reilly AJ, Dacks JB, Field MC: Evolution of the karyopherin-β family of nucleocytoplasmic transport factors; ancient origins and continued specialization. PLoS One. 2011, 6 (4): e19308-10.1371/journal.pone.0019308.PubMed CentralView ArticlePubMedGoogle Scholar
- Neumann N, Lundin D, Poole AM: Comparative genomic evidence for a complete nuclear pore complex in the last eukaryotic common ancestor. PLoS One. 2010, 5 (10): e13241-10.1371/journal.pone.0013241.PubMed CentralView ArticlePubMedGoogle Scholar
- Koumandou VL, Dacks JB, Coulson RM, Field MC: Control systems for membrane fusion in the ancestral eukaryote; evolution of tethering complexes and SM proteins. BMC Evol Biol. 2007, 7: 29-10.1186/1471-2148-7-29.PubMed CentralView ArticlePubMedGoogle Scholar
- Leung KF, Dacks JB, Field MC: Evolution of the multivesicular body ESCRT machinery; retention across the eukaryotic lineage. Traffic. 2008, 9 (10): 1698-1716. 10.1111/j.1600-0854.2008.00797.x.View ArticlePubMedGoogle Scholar
- Nevin WD, Dacks JB: Repeated secondary loss of adaptin complex genes in the Apicomplexa. Parasitol Int. 2009, 58 (1): 86-94. 10.1016/j.parint.2008.12.002.View ArticlePubMedGoogle Scholar
- Dokudovskaya S, Waharte F, Schlessinger A, Pieper U, Devos DP, Cristea IM, Williams R, Salamero J, Chait BT, Sali A, Field MC, Rout MP, Dargemont C: A conserved coatomer-related complex containing Sec13 and Seh1 dynamically associates with the vacuole in Saccharomyces cerevisiae. Mol Cell Proteomics. 2011, 10 (6): M110.006478-10.1074/mcp.M110.006478.PubMed CentralView ArticlePubMedGoogle Scholar
- Herman EK, Walker G, van der Giezen M, Dacks JB: Multivesicular bodies in the enigmatic amoeboflagellate Breviata anathema and the evolution of ESCRT 0. J Cell Sci. 2011, 124 (Pt 4): 613-621.PubMed CentralView ArticlePubMedGoogle Scholar
- Hirst J, Barlow LD, Francisco GC, Sahlender DA, Seaman MN, Dacks JB, Robinson MS: The fifth adaptor protein complex. PLoS Biol. 2011, 9 (10): e1001170-10.1371/journal.pbio.1001170.PubMed CentralView ArticlePubMedGoogle Scholar
- Klute MJ, Melançon P, Dacks JB: Evolution and diversity of the Golgi. Cold Spring Harb Perspect Biol. 2011, 3 (8): a007849. 1-View ArticleGoogle Scholar
- Adl SM, Simpson AG, Farmer MA, Andersen RA, Anderson OR, Barta JR, Bowser SS, Brugerolle G, Fensome RA, Fredericq S, James TY, Karpov S, Kugrens P, Krug J, Lane CE, Lewis LA, Lodge J, Lynn DH, Mann DG, McCourt RM, Mendoza L, Moestrup O, Mozley-Standridge SE, Nerad TA, Shearer CA, Smirnov AV, Spiegel FW, Taylor MF: The new higher level classification of eukaryotes with emphasis on the taxonomy of protists. J Eukaryot Microbiol. 2005, 52 (5): 399-451. 10.1111/j.1550-7408.2005.00053.x.View ArticlePubMedGoogle Scholar
- Brennand A, Gualdrón-López M, Coppens I, Rigden DJ, Ginger ML, Michels PA: Autophagy in parasitic protists: unique features and drug targets. Mol Biochem Parasitol. 2011, 177 (2): 83-99. 10.1016/j.molbiopara.2011.02.003.View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.