Pan-Tetris: an interactive visualisation for Pan-genomes
© Hennig et al. 2015
Published: 13 August 2015
Large-scale genome projects have paved the way to microbial pan-genome analyses. Pan-genomes describe the union of all genes shared by all members of the species or taxon under investigation. They offer a framework to assess the genomic diversity of a given collection of individual genomes and moreover they help to consolidate gene predictions and annotations. The computation of pan-genomes is often a challenge, and many techniques that use a global alignment-independent approach run the risk of not separating paralogs from orthologs. Also alignment-based approaches which take the gene neighbourhood into account often need additional manual curation of the results. This is quite time consuming and so far there is no visualisation tool available that offers an interactive GUI for the pan-genome to support curating pan-genomic computations or annotations of orthologous genes.
We introduce Pan-Tetris, a Java based interactive software tool that provides a clearly structured and suitable way for the visual inspection of gene occurrences in a pan-genome table. The main features of Pan-Tetris are a standard coordinate based presentation of multiple genomes complemented by easy to use tools compensating for algorithmic weaknesses in the pan-genome generation workflow. We demonstrate an application of Pan-Tetris to the pan-genome of Staphylococcus aureus.
Pan-Tetris is currently the only interactive pan-genome visualisation tool. Pan-Tetris is available from http://bit.ly/1vVxYZT
Next-generation sequencing technologies have accelerated the pace at which whole genomes can be sequenced, opening the possibility to sequence a large number of individuals from one common species (such as the 1000 Genomes Project in Human , or the 1001 Genomes Project in Arabidopsis thaliana ). The genomes of individuals of one species are compared on different levels, ranging from single nucleotide variation up to chromosomal rearrangements. In addition, in bacteria individual strains within one species can show extensive variation in their gene content, such that either individual genes or larger clusters of genes can be lost or newly acquired by horizontal gene transfer. In particular, in pathogenic strains of a bacterial species the degree of virulence can be attributed to the absence or presence of genes. This latter observation has led to the coining of the term pan-genome, which traditionally encompasses the full repertoire of all genes of a bacterial species , but it also has been extended to other organisms such as plants [4, 5]. Given a pan-genome of a species, then various subsets of genes in the pan-genome are of interest, such as the core genes, which are those genes that are present in all strains, the set of orphan genes that are present in only one strain (also called strain-specific genes), and the set of dispensable genes, which refers to genes that exist in a subset of the strains but neither in all nor just in one. So far, at least 17 microbial pan-genome projects have been conducted (see  for a review). Several scientific questions are followed using the pan-genome as a framework, such as the determination of genomic diversity of a species, reconstruction of the phylogenetic relationships between strains, or even to replace or at least question serotyping systems for the species in question . The serotype is important for the epidemiologic classification of species and strains and it has great implications for decisions for example about medical treatment. Furthermore, pan-genomes play an increasing role in annotation efforts of bacteria. An example is the community wiki-type database AureoWiki at the University of Greifswald, which aims to unify gene/gene product information based on the pan-genome of Staphylococcus aureus (http://www.protecs.uni-greifswald.de/aureowiki). A unified nomenclature of genes, gene descriptions and gene names will help especially bacteriologists and life scientists to transfer knowledge from experiments with different strains of the same species on gene regulation, gene functions, mechanisms of pathogenicity and many more.
For the computation of a pan-genome most methods employ a BLAST-based approach or variants of it that compute orthologous gene groups. Orthologs are homologous genes that are related through speciation from a common ancestral gene, while paralogs have evolved through gene duplication. Widely used tools following this type of pan-genome implementation are PGAP , PanOCT , or PANGP . BLAST-based approaches that do not take gene neighbourhood into account bear the risk of false orthologs clustering in particular of genes with many paralogs [11, 9].
On the other hand in alignment-based approaches when genomes are compared based on genomic positions, typically a specific reference genome is assigned which acts as the coordinate system for the comparison. However, rearrangements and insertions or deletions lead to substantial architectural variations between genomes and therefore genomic regions that cannot be aligned to the reference are lost. We have proposed the SuperGenome  as a solution, which establishes a general global coordinate system for multiply aligned genomes. This enables the consistent placement of genome annotations in the presence of insertions, deletions, and rearrangements. From the SuperGenome the pan-genome can be computed in a straight-forward way. First, the start and stop positions of the annotated genes of the individual genomes are transferred into the shared coordinate system of the SuperGenome. After this, groups of genes are generated, depending if their annotations overlap in the SuperGenome. Finally, pairwise similarities of the overlapping genes are computed, which are used for the final grouping of the pan genes.
Also alignment-based methods that compute a pan-genome are not error-free. In particular regions with large sequence variation or with many copies of a gene class, such as tRNA gene clusters, the correct deduction of the pan genes is a challenge. For this, methods that visualise the gene order together with functional annotation can help to identify and possibly to resolve such cases.
Only few tools have been developed that explicitly address the task to visualise a pan-genome. A commonly used tool is the BLASTatlas , which maps and visualises whole genome homology of genes within a reference strain. Each genome in such a plot is represented as one circle with a unique colour, the intensity of the colour represents similarity with the respective orthologous gene in the pre-chosen reference genome. PGAT offers a web-based tool to support the homogenisation of genome annotation across the genomes of a species.
Many visualisations that are used for pan-genomes are not only static, but also mainly focus on visualising summary statistics. An example is the flower pot visualisation . To our knowledge so far no tool includes analytical methods that can be triggered in connection with the visualisation, for example to (re)annotate genes of strains within a species. Here, we introduce Pan-Tetris, the first tool for interactive visualisation of pan-genome computation results. The pan-genome table is represented in a matrix-like visualisation with the aim to identify patterns of ordered pan gene groups which could be merged. Pan-Tetris offers such pan gene modifications by a Tetris-inspired interaction possibility.
We have applied Pan-Tetris to the visualisation of the pan-genome of the bacterium Staphylococcus aureus, a model organism for bacteriologists and life scientists.
This section presents the specifications, design choices and realisation of Pan-Tetris, a framework for an interactive pan-genome map visualisation. A key aspect is the aggregation interaction technique that is implemented in Pan-Tetris to support user correction of the computed pan-genome. For this, we made use of the aggregation technique we have introduced in iHAT, here however, to support the interactive process of annotation-based pan-genome refinement.
The SuperGenome-based pan-genome computation
Starting point of our visualisation is an alternative approach to computing a pan-genome. In contrast to reciprocal BLAST, we first compute a whole genome alignment (using progressiveMauve ) of the individual genome, from which we then build a SuperGenome. The SuperGenome provides a common coordinate system that allows a bidirectional mapping between the alignment coordinates and the original coordinates of each individual genome in the multiple genome alignment . Next we compute the pan-genome based on the SuperGenome. For the computation of the pan genes we first note that in a multiple genome alignment orthologous genes if not too dissimilar will be commonly aligned, and secondly these will overlap in the coordinate system of the SuperGenome. The advantage of this alignment-based approach is that overlapping genes are more likely to be orthologs than paralogs, because of the synteny of the genes, that is implicitly taken into account while the multiple genome alignment is constructed. In addition, if the genes in the individual genomes have been annotated, the annotations can be directly transferred because of the bidirectional mapping provided by the SuperGenome.
Our method, which we coined 'PanGee' (unpublished software), computes the pan-genome from a SuperGenome of a multiple genome alignment. We define the pan-genome as computed by PanGee as the union of all genes that are contained in any of the individual genomes in the data set. It considers homologous relationships among these genes, which are represented by the computation of orthologous gene groups from genes that overlap in the coordinate system of the SuperGenome. In the context of PanGee and the underlying SuperGenome, a group of orthologous genes will be called a pan gene. A pan gene is defined as follows:
it has a unique identifier;
it contains at least one gene;
it contains at most n genes;
it cannot contain two or more genes from the same genome.
PanGee then outputs a pan-genome map which, similarly to other programs, reports the orthologous gene groups, i.e., all pan genes.
Another advantage is that due to the common coordinate system a specific ordering of the orthologs can be assigned based on the starting position in the alignment. This ordering gives a logical structuring of the groups without the need of a reference genome.
Nevertheless, a multiple genome alignment is in most cases heuristically computed. The non-optimality of such alignments can lead to erroneously aligned regions which can affect the pan genes' count. These erroneous regions are computationally difficult to detect. However, because of the logical ordering of the pan genes in the SuperGenome coordinate system, certain patterns in the absence and presence of genes within consecutive orthologous gene groups of the constructed pan-genome by PanGee give indications of these misaligned regions. A visualisation of this pan-genome can therefore help to identify these patterns and correct the errors caused by the alignment.
With this in mind we have developed Pan-Tetris. It uses the aggregation concept of our previously published tools iHAT, which we developed for the visualisation and analysis of genome wide association (GWA) data, and inPHAP, an interactive visualisation tool for genotype and phased haplotype data. The number of orthologous groups and therefore pan genes depend on the homologous relationships between the genes and the resulting multiple genome alignment.
Graphical representation of the pan-genome
Data formats and visualisation
The glyphs of each present gene within a pan gene for a specific genome are pre-rendered images to ensure a smooth interaction with the data. Also other graphical elements such as boxes to highlight selection of individual genes or rows and columns are pre-rendered. Due to this, all changes do not require a recalculation of the image, but instead just a repainting of the current view, which ensures a real-time response to user interaction.
The Pan-Tetrisgraphical user interface
Pan-Tetris provides several possibilities of user interaction within the GUI which are described in detail in the next subsections.
General interactions with the GUI
In general, the number of pan genes in most pan-genome studies is very large in comparison to the number of genomes. For a fast navigation along these groups the user can also use the overview panel, which not only features the indication of the current view area by a red rectangle, but also to jump to a desired location (see Figure 3 for an example). Furthermore, it is possible to adjust the current view by changing the grid size, where the individual present genes are placed onto, or the colour of single graphical elements. The navigation through the graphical representation of the pan-genome is realized with navigation bars along the pan genes (vertical) as well as the genomes (horizontal).
Interactions with data
By selection interaction the user can get further detailed information about the data. By selecting a pan gene (row), the meta-information of this group is displayed in the left bottom panel of the GUI. This information helps to provide a quick overview of the genes that are present in a specific pan gene group. Furthermore, it is possible to select single genes in the pan-genome visualisation panel. All available information about the respective gene is then displayed in the bottom middle panel.
To provide a convenient way of finding a gene, pan gene, aggregated groups or pan genes of specific function a search function has been implemented that allows the user to find, select and update the current view to the target location of those elements.
Pan-Tetris provides two general export possibilities. Visualisations can be exported as publication-ready images either in bit-map formats (JPEG, PNG and TIFF) or as scalable vector graphics (SVG or PDF format). When the user modified the pan-genome matrix itself, this modified matrix can also be saved. The output format of the modified matrix is the same as the chosen input format.
Supported platforms and availability
Pan-Tetris is written in Java 7, and can therefore be run on any machine with a Java VM installed. Pan-Tetris, including a tutorial video and example data, is available at http://bit.ly/1vVxYZT.
Results and discussion
The development of Pan-Tetris is the result of a close collaboration with biologists who work on various pan-genome projects. No method for the computation of a pan-genome is error-free, and one of the aims for Pan-Tetris was to provide an interactive tool that offers the possibility to correct the computations of the pan-genome, which at the same time can then also be used to unify gene annotations. A correct pan-genome with a unified nomenclature of genes and gene descriptions is desirable and will help especially bacteriologists and life scientists to transfer knowledge on gene regulation and gene functions from experiments with different strains of the same species.
During our studies of the Staphylococcus aureus pan-genome we learned that lists and tables of orthologous genes alone are not suitable to describe the pan-genome of a bacterial species. The design choices were motivated by the inconvenient use of tables and their unsuitable depiction of possibly missed orthologous relationships. Due to this, we designed a simple visualisation that clearly separates individual genes and strains and at the same time allows the user to identify possible errors in the underlying pan-genome matrix.
The pan-genome as well as the pan gene concept is closely related to set-type data. Thus, our visualisation concept of Pan-Tetris is similar to set visualisation tools such as ConSet and OnSet. These tools let the user examine relationships between different sets with the help of basic set operations and aim at reducing large data sets with the focus to highlight differences and/or similarities between sets. While the aggregation approach of Pan-Tetris is in fact a specific type of a set operation, the focus of Pan-Tetris is, however the proof-reading of the output of an algorithm as well as curation of the data.
The resulting design of Pan-Tetris offers both an overview of the data as well as a possibility for a detailed inspection of the pan-genome matrix (see Figure 3).
The implementation of Pan-Tetris, in particular with its pre-rendered graphical elements provides a smooth navigation without noticeable loading times. Additionally, all interaction possibilities with data are intuitively and conveniently placed, which simplify the application for the user.
Pan-Tetris is tightly linked to our method PanGee with which we compute a pan-genome of a data set. PanGee requires a multiple genome alignment, a possible drawback of our approach in comparison to the reciprocal BLAST based methods, since it is often a difficult endeavour to compute such a genome alignment. However, reciprocal BLAST approaches have no information about genomic rearrangements and are not robust against annotation errors, which makes a correction difficult.
Additionally, many labs sequence their own isolated strains and provide their own assembled and annotated genomes for databases. The problem is that independently performed genome annotations often result in variable gene start and end point predictions, varying gene lengths and often interfere with sequence errors resulting in the prediction in more or less truncated or multiple divided gene sequences. A reciprocal BLAST approach is here more likely to fail because the grouping of pan genes will be prevented because the found matches will not be sufficient to establish an orthologous relationship. In contrast to this the multiple genome aligner will still align truncated genes or place divergent gene sequences in direct neighbourhood to each other, which might lead to incomplete pan genes. An example for the notoriously difficult to align clusters of tRNA genes is shown in the supplement (see Additional file S3). With Pan-Tetris we offer an interactive possibility for correction. In total we identified many such or similar cases and using Pan-Tetris we were able to reduce the pan-genome of Staphylococcus aureus significantly. To this day, we know of no other tool that offers a refinement of a pan-genome.
In the current version of Pan-Tetris we have concentrated primarily on a clearly structured visualisation of the pan-genome matrix computed from a multiple genome alignment using our SuperGenome approach to help improve a pan-genome and respective gene annotations. In a next version we will add further functionality, that for example allows the user to output just the core genome, the dispensable genome or only the orphans of the pan-genome of interest. For the research of mechanisms of pathogenicity, the core genome of an organism may reveal generic targets which can be suitable for a species but non-strain specific treatment (e.g., vaccines, antibiotics, cellular antagonist). Dispensable genes available only in subgroups of strains such as genes of mobile genetic elements, pathogenicity islands, plasmids or single genes of unknown origin may serve as strain specific markers for diagnostic purposes. They can be used to differentiate among phylogenetically related strain groups. Some of them are responsible for strain specific capabilities such as the resistance to antibiotics, synthesis of defined toxins, defined metabolic properties and further factors. The same is exclusively true for orphan genes but in a specific manner only for one of the analysed strains.
The next logical step that will improve the curation of pan-genomes is to connect Pan-Tetris with the underlying multiple genome alignment. Here, we plan to integrate a local multiple alignment method such as Clustal Omega  to realign candidates for aggregation.
There are a number of databases that offer precomputed pan-genomes of bacteria, a very prominent example is EDGAR. Here, together with the developers and providers of EDGAR we plan to extend output formats such that users can visualize pan-genomes of EDGAR using Pan-Tetris.
Though traditionally defined for bacteria, the concept of the pan-genome can be and has been extended to other organisms, such as plants, where gene repertoire changes are observed. Pan-Tetris is not restricted to microbial species, however, as of right now it has only been tested for pan-genomes computed from multiple alignments of bacterial species. Last but not least, it is conceivable that pan-genome studies for closely related taxa could be performed at the nucleotide sequence rather than the gene level. Thus, using we could extend out SuperGenome approach and the computation of the pan-genome to general all orthologous sequence elements, revealing not only all protein coding sequences, but also non-protein coding features including promoters and small RNAs.
With these additional analytical functionalities we hope to make Pan-Tetris a truly powerful visual analytics tool for pan-genome computation.
We have presented Pan-Tetris, a framework for the visualisation and interactive exploration of large-scale pan-genome matrices. With its close connection to our previously developed SuperGenome concept, a visual assessment of pan genes and the correction by aggregating different pan genes with common functional annotation is very straight-forward. To our knowledge Pan-Tetris so far is the only available interactive visualisation tool to explore and modify computed pan-genomes.
We acknowledge support for publication by Deutsche Forschungsgemeinschaft and Open Access Publishing of University of Tübingen.
This article has been published as part of BMC Bioinformatics Volume 16 Supplement 11, 2015: Proceedings of the 5th Symposium on Biological Data Visualization: Part 1. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/16/S11
- 1000 Genomes Project Consortium, Abecasis GR, Altshuler D, Auton A, Brooks LD, Durbin RM, et al: A map of human genome variation from population-scale sequencing. Nature. 2010, 467 (7319): 1061-1073. 10.1038/nature09534.View ArticleGoogle Scholar
- Weigel D, Mott R: The 1001 genomes project for Arabidopsis thaliana. Genome Biol. 2009, 10 (5): 107-10.1186/gb-2009-10-5-107.PubMed CentralView ArticlePubMedGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al: Genome analysis of multiple pathogenic isolates of Streptococcus agalactiae: implications for the microbial "pan-genome". Proc Natl Acad Sciences U S A. 2005, 102 (39): 13950-13955. 10.1073/pnas.0506758102.View ArticleGoogle Scholar
- Morgante M, De Paoli E, Radovic S: Transposable elements and the plant pan-genomes. Curr Opin Plant Biol. 2007, 10 (2): 149-155. 10.1016/j.pbi.2007.02.001.View ArticlePubMedGoogle Scholar
- Li YH, Zhou G, Ma J, Jiang W, Jin LG, Zhang Z, et al: De novo assembly of soybean wild relatives for pan-genome analysis of diversity and agronomic traits. Nature Biotechnology. 2014, 32 (10): 1045-1052. 10.1038/nbt.2979.View ArticlePubMedGoogle Scholar
- Vernikos G, Medini D, Riley DR, Tettelin H: Ten years of pan-genome analyses. Current Opinion in Microbiology. 2015, 23: 148-154.View ArticlePubMedGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R: The microbial pan-genome. Curr Opin Genet Dev. 2005, 15 (6): 589-594. 10.1016/j.gde.2005.09.006.View ArticlePubMedGoogle Scholar
- Zhao Y, Wu J, Yang J, Sun S, Xiao J, Yu J: PGAP: pan-genomes analysis pipeline. Bioinformatics. 2012, 28 (3): 416-418. 10.1093/bioinformatics/btr655.PubMed CentralView ArticlePubMedGoogle Scholar
- Fouts DE, Brinkac L, Beck E, Inman J, Sutton G: PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Research. 2012, 40 (22): e172-10.1093/nar/gks757.PubMed CentralView ArticlePubMedGoogle Scholar
- Zhao Y, Jia X, Yang J, Ling Y, Zhang Z, Yu J: PanGP: A tool for quickly analyzing bacterial pan-genome profile. Bioinformatics. 2014, 30 (9): 1297-1299. 10.1093/bioinformatics/btu017.PubMed CentralView ArticlePubMedGoogle Scholar
- Storm CE, Sonnhammer EL: Comprehensive analysis of orthologous protein domains using the HOPS database. Genome Research. 2003, 13 (10): 2353-2362. 10.1101/gr1305203.PubMed CentralView ArticlePubMedGoogle Scholar
- Herbig A, Jäger G, Battke F, Nieselt K: GenomeRing: alignment visualization based on SuperGenome coordinates. Bioinformatics. 2012, 28 (12): 7-15. 10.1093/bioinformatics/bts217.View ArticleGoogle Scholar
- Hallin PF, Binnewies TT, Ussery DW: The genome BLASTatlas--a GeneWiz extension for visualization of whole-genome homology. Molecular BioSystems. 2008, 4 (5): 363-371. 10.1039/b717118h.View ArticlePubMedGoogle Scholar
- Sugawara M, Epstein B, Badgley BD, Unno T, Xu L, Reese J, et al: Comparative genomics of the core and accessory genomes of 48 Sinorhizobium strains comprising five genospecies. Genome Biology. 2013, 14 (2): R17-10.1186/gb-2013-14-2-r17.PubMed CentralView ArticlePubMedGoogle Scholar
- Heinrich J, Vehlow C, Battke F, Jäger G, Weiskopf D, Nieselt K: iHAT: interactive hierarchical aggregation table for genetic association data. BMC Bioinformatics. 2012, 13 (Suppl 8): 2-10.1186/1471-2105-13-S8-S2.View ArticleGoogle Scholar
- Darling AE, Mau B, Perna NT: progressiveMauve: multiple genome alignment with gene gain loss and rearrangement. PLoS One. 2010, 5 (6): 11147-10.1371/journal.pone.0011147.View ArticleGoogle Scholar
- Jäger G, Peltzer A, Nieselt K: inPHAP: Interactive visualization of genotype and phased haplotype data. BMC Bioinformatics. 2014, 15 (1): 200-10.1186/1471-2105-15-200.PubMed CentralView ArticlePubMedGoogle Scholar
- Kim B, Lee B, Seo J: Visualizing set concordance with permutation matrices and fan diagrams. Interacting with Computers. 2007, 19 (5-6): 630-643. 10.1016/j.intcom.2007.05.004.PubMed CentralView ArticlePubMedGoogle Scholar
- Eddy SR: Accelerated profile HMM searches. PLoS Computational Biology. 2011, 7 (10): 1002195-10.1371/journal.pcbi.1002195.View ArticleGoogle Scholar
- Haft DH, Selengut JD, Richter RA, Harkins D, Basu MK, Beck E: TIGRFAMs and genome properties in 2013. Nucleic Acids Research. 2013, 41 (D1): D387-D395. 10.1093/nar/gks1234.PubMed CentralView ArticlePubMedGoogle Scholar
- Sadana R, Major T, Dove A, Stasko J: OnSet: A visualization technique for large-scale binary set data. IEEE Transactions on Visualization and Computer Graphics. 2014, 20 (12): 1993-2002.View ArticlePubMedGoogle Scholar
- Ross MG, Russ C, Costello M, Hollinger A, Lennon NJ, Hegarty R, et al: Characterizing and measuring bias in sequence data. Genome Biology. 2013, 14 (5): 51-10.1186/gb-2013-14-5-r51.View ArticleGoogle Scholar
- Sievers F, Wilm A, Dineen D, Gibson TJ, Karplus K, Li W, et al: Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology. 2011, 7 (1): 539-PubMed CentralView ArticlePubMedGoogle Scholar
- Blom J, Albaum SP, Doppmeier D, Pühler A, Vorhölter FJ, Zakrzewski M, Goesmann A: EDGAR: a software framework for the comparative analysis of prokaryotic genomes. BMC Bioinformatics. 2009, 10 (1): 154-10.1186/1471-2105-10-154.PubMed CentralView ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.