- Open Access
ClustAGE: a tool for clustering and distribution analysis of bacterial accessory genomic elements
© The Author(s). 2018
- Received: 22 September 2017
- Accepted: 11 April 2018
- Published: 20 April 2018
The non-conserved accessory genome of bacteria can be associated with important adaptive characteristics that can contribute to niche specificity or pathogenicity of strains. High degrees of structural and compositional diversity in genomic islands and other elements of the accessory genome can complicate characterization of accessory genome contents among populations of strains. Methods for easily and effectively defining the distributions of discrete elements of the accessory genome among bacterial strains in a population are needed to explore the relationships between the flexible genome and bacterial adaptive traits.
We have developed the open-source software package ClustAGE. This program, written in Perl, uses BLAST to cluster nucleotide accessory genomic elements from the genomes of multiple bacterial strains and to identify their distribution within the study population. The program output can be used in combination with strain phenotype data or other characteristics to detect associations. Optional graphical output is available for visualizing accessory genome gene content and distribution patterns. The capabilities of the software are demonstrated on a collection of 14 Pseudomonas aeruginosa genome sequences.
The ClustAGE software and utilities are effective for identifying characteristics and distributions of accessory genomic elements among groups of bacterial genomes. The ability to easily and effectively characterize the accessory genome of a sequence collection may provide a better understanding of the accessory genome’s contribution to a species’ adaptation and pathogenesis. The ClustAGE source code can be downloaded from https://clustage.sourceforge.io and a limited web-based implementation is available at http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage.cgi.
- Comparative genomics
- Accessory genome
- Flexible genome
Gene content can vary greatly between closely related strains of bacteria and between other unicellular organisms [1, 2]. Genes within a species can be divided into a conserved core genome and a flexible accessory genome. The core genome of an organism consists of genetic sequence that is conserved among all or nearly all members of the species. Conversely, the accessory genome represents genetic material that is present in some, but not all members of the species. The total complement of genetic material within a species is known as the pangenome . Among bacteria, modification of gene content can arise from one of three major mechanisms: gene loss, gene gain through duplication, and gene gain through horizontal gene transfer (HGT) [4, 5]. Horizontally transferred genetic elements can include such structures as plasmids, integrative and conjugative elements (ICEs), replacement islands, prophages and phage-like elements, transposons, insertion sequences and integrons [6–8]. Collectively, these horizontally transferred elements, as well as any contiguous stretch of genetic material that is not part of the conserved core genome, regardless of source or structure, can be referred to as accessory genomic elements (AGEs).
The accessory genome of bacteria can be an important source of phenotypic diversity . Genes within the accessory genome can drive environmental niche adaptation or pathogenesis within hosts [10, 11]. For instance, in Pseudomonas aeruginosa, genes within the accessory genome have been found to allow the organisms to persist in environments containing heavy metals and toxic organic compounds that would otherwise be unsuitable for P. aeruginosa habitation [12, 13]. In Staphylococcus aureus, the S. aureus pathogenicity islands (SaPIs) are a class of mobile genetic elements that carry genes encoding such virulence factors as TSST1, a toxin important in toxic shock syndrome, or other toxins . Antibiotic resistance genes are frequently found in the accessory genomes of clinically important bacterial pathogens. One example is the carbapenemase-carrying plasmids in Klebsiella pneumoniae and other Gram-negative pathogens that contribute to the spread of this phenotype in healthcare settings [15, 16]. Given that bacterial accessory genomes are known to be enriched in virulence factors , directed study of the accessory genome contents and distributions within a population could yield new diagnostics and therapeutics for bacterial infections.
Often AGEs are not discrete structures with well-defined borders and gene compositions, but instead can be mosaic and fragmented with insertions of other elements, structural rearrangements, or partial deletions . Mosaic islands have previously been described in E. coli  and Streptococcus pneumoniae . In Pseudomonas aeruginosa, a genomic island containing the type 3 secretion system effector gene exoU was found to have extensive homology and synteny of genes in this island with genes in other P. aeruginosa islands PAPI-1 and pKLC102 . Given the possibly mosaic nature of accessory genomic regions, accessory element characterization is often not as simple as screening genomes for a discrete set of defined genomic islands or other horizontally transferred elements. Therefore, a robust analysis of the pan-accessory genome of a set of bacterial strains must be able to account for potential changes in structure and composition of accessory regions between strains.
With the increase in availability and affordability of whole-genome sequencing, large-scale genomic analyses of populations of isolates have become more feasible. Software packages, such as mga  Mauve , and Mugsy , have been developed to perform segmented alignments of complete genomes for the purposes of aligning shared genomic regions. There are also several bioinformatic tools that exist to characterize the core and pangenome of bacterial species [25–28], but few that specifically examine the accessory genome fraction. To address the accessory genome of bacteria more directly, the previously presented bioinformatic tools Spine and AGEnt  were developed to identify the conserved nucleotide core genome sequence in a set of genomic sequences and use this core genome sequence to perform in silico subtractive hybridization to isolate the accessory genome component of each strain. However, software such as Spine and AGEnt or others  that characterize the accessory genome of bacterial strains do not focus directly on providing the distribution of accessory elements in a study population. Such distribution analyses are important for answering questions about the contributions of horizontally transferred or subgroup-specific genetic elements that may contribute, for example, to a particular phenotype of interest or to understanding particular niche adaptations.
This report describes ClustAGE, a software package that clusters accessory genomic elements identified by Spine and AGEnt from a set of genomes into discrete AGE units to define the distribution of accessory elements among the analyzed genomes. Several software tools such as DomClust , GCQuery , PanOCT  and OrthoDB  have been developed for the purposes of clustering gene sequences into orthologous groups. These programs identify clusters of related genes across bacterial genomes based on gene sequences. The approach to accessory genome characterization taken by ClustAGE differs from these other approaches in that ClustAGE compares the complete nucleotide sequences of the accessory genome rather than just the protein-coding sequences. A nucleotide-sequence-based, gene-agnostic approach offers several advantages in characterizing AGE distributions. First, the identification of shared accessory elements does not depend on annotation techniques, which may differ in technique and results between strains available from public databases or collaborators. Second, intergenic sequence distribution can be studied, allowing distributions of non-protein-coding sequences such as promoter sequences or small RNAs with potential biological relevance in the accessory genome of the population to more easily be analyzed. Third, this approach has the potential to capture variable regions within otherwise conserved genes that may have arisen by homologous recombination or other mechanisms. The data generated by this software allow detailed analysis of the flexible portion of a population’s pangenome.
ClustAGE is a command-line tool built using the Perl programming language for the purpose of analyzing and comparing accessory genomic elements (AGEs) between genomes. The source code is distributed as freeware under the GNU General Public License version 3. The core functionality of ClustAGE requires BLAST+ v2.3.0 [35, 36], of which binaries for OS X or Linux 64-bit are included with the distributions. Optional features require installation of the freeware programs gnuplot v5.0 (http://www.gnuplot.info/) and/or bwa v0.7.13 .
In the first step of this process, clustering of similar AGE sequences into “bins” is performed. First, AGE sequences from all genomes input into ClustAGE are pooled together to create a single nucleotide BLAST database. AGE sequences are then sorted by size. In the initial iteration of the clustering algorithm, the longest contiguous AGE in the dataset is chosen as a bin representative. This bin sequence is then used as the query sequence in a blastn alignment against the database of all input AGE sequences. Alignment results are filtered to remove any hits against AGEs from the same genome as the bin representative, as well as hits below user-defined sequence identity and length cutoffs. BLAST hits against AGE sequences that pass these filters are binned with the representative sequence and removed from the pool of potential bin representative sequences in subsequent iterations. Conversely, all non-aligning AGE sequences remain in the pool of potential bin representative sequences. If only part of an AGE sequence aligns to the bin representative, the non-aligning portion of the AGE sequence is isolated and added to the pool of potential bin representatives. Subsequent iterations of clustering select the next-longest complete or partial AGE sequence that was not previously binned with a larger bin representative sequence and uses it as the query sequence for alignment against the AGE sequence pool. Clustering iterations continue in this fashion until no bin representative sequences above a user-defined length threshold remain in the pool.
Output files from the core function of ClustAGE described above include nucleotide sequences of the bin representative and nucleotide sequences of AGE subelements longer than a user-defined cutoff. A file listing positions within the input sequences from which the bin representative AGEs were derived, as well as a file listing the positions of subelements within each AGE and the distributions of each subelement among the input sequences are also output. Optionally, ClustAGE can produce plots of AGE distributions among the input genomes for each of the bin representative AGEs (Fig. 2). This functionality requires gnuplot (http://www.gnuplot.info) to produce the plots.
ClustAGE allows users to include coordinates and descriptions of protein coding sequences (CDS) within accessory elements as input. If provided, information about which coding sequences are contained within bins and subelements is output for each AGE for which annotations in the bin reference sequence were given. If graphical output was requested, annotated gene positions and directionality will be shown in the images (Fig. 2).
One limitation of working with draft genome sequences generated by de novo assembly of short sequencing reads such as those produced by Illumina sequencing technology is that the assembly process can fail to assemble small portions of the genome even when sufficient reads covering these regions are present in the read data set. This in turn can lead to the false presumption that an AGE is absent from a genome when in fact it simply failed to be properly assembled. To try to account for data missing from de novo generated draft genome sequences assembled from Illumina reads, ClustAGE includes an option to identify missing AGE sequences from raw read sequences. After the set of AGEs is identified from accessory genome sequences as detailed above, whole-genome Illumina sequencing data provided to ClustAGE is aligned to the bin reference AGE sequences using the ‘mem’ function of bwa aligner with default settings . To try to minimize false-positive alignments of core genome read sequences to accessory regions, a core genome nucleotide sequence, such as output by Spine , can be provided to ClustAGE. Any reads aligning to both the core genome sequence and an AGE bin sequence will be excluded. Reads aligning to AGEs above a user-defined minimum depth of coverage and producing a contiguous alignment exceeding a minimum user-defined sequence similarity will be added to the binned sequence for that genome. Alignment data from Illumina reads are only added in AGE regions that were not found to have alignments against a bin representative AGE in the original input draft sequence for a genome. To minimize false-positive results, read alignment data are also not added unless the alignment region is either at one or both of the bin representative AGE ends or contiguous with accessory genomic sequence previously aligned by BLAST from assembly data. Subelements are then redefined using the added read alignment data and a separate set of “read-corrected” subelement sequence and coordinate files are output. If optional plotting of AGE distributions was chosen, read-aligned AGE regions are plotted using a different color to distinguish them from AGE alignments derived from accessory genomic sequences (Fig. 2).
The ClustAGE results can be used to visualize and compare relative similarity of total accessory genome content among strains in the population studied. The pipeline script subelements_to_tree.pl is provided with ClustAGE for this purpose. The program quantifies relative amount of shared subelement accessory genomic sequence for each pair of genomes by calculating Bray-Curtis distances . Briefly, the Bray-Curtis distance for a pair of genomes is calculated as d = 1 – (2 S ij / (S ii + S jj )) where S ij is the total length, in bases, of subelements identified by ClustAGE in both genomes i and j and S ii and S jj are the total accessory genome subelement sizes, in bases, of genomes i and j, respectively. In order to cluster strains by total accessory genome similarity, a matrix of Bray-Curtis distances for each pair of input strains is used to create a neighbor-joining tree using the ‘neighbor’ function of PHYLIP version 3.696 . Optional bootstrap trees from random re-samplings of the data can be generated using PHYLIP’s ‘seqboot’ and ‘neighbor’ functions. Bootstrap support values can then be calculated for each branch of the neighbor-joining tree using the CompareToBootstrap.pl script developed by Morgan N. Price (http://microbesonline.org/fasttree/treecmp.html). In addition to the neighbor-joining tree, a matrix of Bray-Curtis similarity values (1 – d) is output, as well as a file that can be used to add a heatmap of Bray-Curtis similarity values to the neighbor-joining tree in the online tree visualization software Interactive Tree Of Life (http://itol.embl.de) .
A utility for visualizing ClustAGE results as a pan-accessory genome figure is also available. ClustAGE Plot (http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage_plot.cgi) uses CGView  to produce a representation of ClustAGE results as bins ordered largest to smallest in a circular configuration with concentric rings indicating the distributions of accessory elements for each included strain. Although designed to be flexible, user-friendly, and powerful enough for most users, visualizations with ClustAGE Plot could become less informative with larger (i.e > 100 genomes) and/or high complexity data sets. The xml-formatted file produced by ClustAGE Plot can be downloaded and used to produce higher resolution images on a user’s local version of CGView. Furthermore, the output files generated by ClustAGE provide sufficient data for further processing and can be reformatted to serve as input for other applications capable of visualizations such as R (https://www.r-project.org/), Circos (http://circos.ca/), or other 3rd party programs, depending on the users’ needs and skills.
The ClustAGE scripts and utilities are available for download at https://clustage.sourceforge.io. A web-based implementation of ClustAGE is also available at http://vfsmspineagent.fsm.northwestern.edu/cgi-bin/clustage.cgi. The web version is limited to a maximum of 15 accessory genome sequence sets and does not support read-correction of AGEs.
To demonstrate the functionality of ClustAGE, Spine v0.2.1 was used to identify the core and accessory genomic sequences of a set of 12 Pseudomonas aeruginosa strains, as described previously . The 12 strain sequences used and their NCBI accession numbers were 19BR (AFXJ01000001.1), 213BR (AFXK01000001.1), B136-33 (CP004061.1), DK2 (CP003149.1), LESB58 (FM209186.1), M18 (CP002496.1), NCGM2.S1 (AP012280.1), PA7 (CP000744.1), UCBPP-PA14 (CP000438.1), PACS2 (NZ_AAQW01000001.1), PAO1 (AE004091.2), and RP73 (CP006245.1). Using a core genome definition of sequences present in at least 11 of the 12 reference genomes, the reference core genome size was 5844 kbp. AGEnt v0.2.1 was then used to determine the accessory genomic sequences of these 12 strains as well as of two draft genome assemblies of P. aeruginosa strains, PA99 (JARJ01000000) and PA103 (JARI01000000). The average total size of the accessory genomic fraction of a strain was 735 kbp (range 428 kbp - 1177 kbp) with an average of 208 contiguous accessory elements (range 170 - 435). Output files from the Spine and AGEnt analyses are available in Additional file 1. Sequencing read sets for PA99 and PA103 consisting of 100 bp paired-end Illumina reads generated by the HiSeq 2000 platform are available from the NCBI short read archive (SRR5447413 and SRR5447414). For more detail on the derivation and characteristics of the core and accessory genomes of this sequence set, see previous publication on Spine and AGEnt .
AGE bin representative characteristics
# bin representatives
Total size of bin representatives, in bp
Average bin representative size, in bp (min - max)
3459 (216 - 50,833)
5387 (268 - 54,765)
3348 (227 - 15,557)
4304 (226 - 81,418)
6094 (206 - 50,121)
3226 (206 - 31,798)
5736 (270 - 40,043)
5331 (209 - 127,886)
3324 (208 - 21,861)
4670 (200 - 55,310)
2336 (229 - 63,512)
4721 (227 - 32,463)
6084 (217 - 10,474)
8981 (212 - 46,125)
AGE read correction results per strain
Total added accessory genome sequence (bp)
% increase in total accessory genome length
# bins with added sequence
Average bp added per bin (min - max)
46 (1 - 350)
100 (1 - 1247)
To examine the effect of modifications to the default settings of ClustAGE on output, the analysis was repeated with a more permissive minimum sequence identity of 80%, as well as a more restrictive minimum sequence identity of 90%. See Additional file 4 for a table comparing ClustAGE results at the different cutoffs. Using a setting of 80% minimum sequence identity, there were more bin representatives identified comprising less total sequence and more subelement divisions of the bin representatives compared to when the default setting of 85% was used. The lower sequence identity threshold results in more alignments against bin representatives being preserved. This causes more binning of portions of AGEs within the potential bin representative pool leaving more unbinnned AGE fragments to serve as bin representatives. This is reflected in the decreased average length of the bin representatives compared to the results of the 85% cutoff. This also leads to greater fragmentation of the AGEs into subelements as more potentially nonspecific BLAST alignments escape filtering. Conversely, the more restrictive 90% sequence identity cutoff resulted in fewer AGE representatives of longer average length divided into fewer total subelements. Further comparisons of ClustAGE results after read correction can be seen in the table in Additional file 4.
ClustAGE gene distribution
ClustAGE annotation vs. gene ortholog analysis
% of comparisons
ClustAGE+ / Ortholog-a
ClustAGE- / Ortholog+
Accessory genome similarity
Scalability and computational efficiency
As the cost of microbial whole-genome sequencing has decreased and availability of sequencing resources has increased, computational requirements for analyzing the resulting genomic data sets can become a limiting factor. Processing time and memory requirements of ClustAGE analyses were evaluated using AGE data sets from increasing numbers of genomes. The figure in Additional file 7 shows the average analysis time and average maximum memory requirements for ClustAGE analyses. Five replicate analyses of each number of input genomes were conducted on both a server platform running Ubuntu Linux as well as a desktop computer running Mac OS X. For more details, see Additional Methods in Additional file 5. On both computing platforms the ClustAGE processing times increased linearly up to 200 genomes, with r-squared values of 0.9906 and 0.9925 on the Linux and OS X platforms, respectively. The average time required to analyze 200 accessory genomes was less than 70 min on both computers. Peak memory use also increased linearly up to 200 genomes analyzed with a maximum average RAM use of 1.6 Gb on the Ubuntu Linux server and 1.2 Gb on the Mac OS X desktop computer. It is expected that processing time and memory use requirements are likely to vary further depending on average accessory genome size of the analyzed strains. Nonetheless, these results indicate that ClustAGE analysis is scalable to larger genome data sets and suggest that users without access to high-memory and/or multiple processor computing resources can still perform ClustAGE analyses on AGEs derived from 10s or 100 s of genomic sequences using standard desktop or even laptop computers.
ClustAGE, in combination with the core and accessory genome identification packages Spine and AGEnt , is an easy-to-use and accurate software tool to characterize the distribution of accessory genomic elements (AGEs) within a collection of bacterial whole-genome sequences. It includes utilities for visualizing AGE distributions and comparing and classifying relative accessory genome similarity among strains in the studied population. Taken together, the analysis output provided by ClustAGE can offer researchers a powerful new tool to study the relationships of discrete strain characteristics with flexible genome content in large genomic data sets to gain insight into bacterial evolution and adaptation.
Project name: ClustAGE.
Operating system(s): Platform independent.
Programming language: Perl.
Other requirements: Perl 5.10 or higher, BLAST+ 2.3.0 or higher. For optional functions, gnuplot 5.0 or higher, bwa 0.7.13 or higher, and/or phylip 3.695 or higher are necessary.
License: GNU GPL v3.
Any restrictions to use by non-academics: None.
Thank you to Larry Kociolek, Nathan Pincus, Maulin Soneji, and Syed Beenish for software testing and feedback. Thank you to Timothy Lee Turner and Sudhir Penugonda for manuscript review. Thank you to Alan Hauser for mentorship, guidance, and manuscript review.
This work was supported by a Mentored Research Scholar Grant in Applied and Clinical Research, MRSG-13-220-01 – MPC from the American Cancer Society.
Availability of data and materials
The datasets analyzed during the current study are available in the NCBI nucleotide and short read repositories, https://www.ncbi.nlm.nih.gov/nuccore and https://www.ncbi.nlm.nih.gov/sra. The remainder of the data generated during this study is included in this published article and its supplementary information files.
EO conceived of, programmed, and tested the software and prepared the manuscript. The author read and approved the final manuscript.
Ethics approval and consent to participate
The authors declare that they have no competing interests.
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Open AccessThis article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
- Lan R, Reeves PR. Intraspecies variation in bacterial genomes: the need for a species genome concept. Trends Microbiol. 2000;8(9):396–401.View ArticlePubMedGoogle Scholar
- van Passel MW, Marri PR, Ochman H. The emergence and fate of horizontally acquired genes in Escherichia coli. PLoS Comput Biol. 2008;4(4):e1000059.View ArticlePubMedPubMed CentralGoogle Scholar
- Medini D, Donati C, Tettelin H, Masignani V, Rappuoli R. The microbial pan-genome. Curr Opin Genet Dev. 2005;15(6):589–94.View ArticlePubMedGoogle Scholar
- Kuo CH, Ochman H. The fate of new bacterial genes. FEMS Microbiol Rev. 2009;33(1):38–43.View ArticlePubMedGoogle Scholar
- Rocha EP. Evolutionary patterns in prokaryotic genomes. Curr Opin Microbiol. 2008;11(5):454–60.View ArticlePubMedGoogle Scholar
- Kung VL, Ozer EA, Hauser AR. The accessory genome of Pseudomonas aeruginosa. Microbiol Mol Biol Rev. 2010;74(4):621–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Hacker J, Blum-Oehler G, Muhldorfer I, Tschape H. Pathogenicity islands of virulent bacteria: structure, function and impact on microbial evolution. Mol Microbiol. 1997;23:1089–97.View ArticlePubMedGoogle Scholar
- Burrus V, Waldor MK. Shaping bacterial genomes with integrative and conjugative elements. Res Microbiol. 2004;155(5):376–86.View ArticlePubMedGoogle Scholar
- Rouli L, Merhej V, Fournier PE, Raoult D. The bacterial pangenome as a new tool for analysing pathogenic bacteria. New Microbes New Infect. 2015;7:72–85.View ArticlePubMedPubMed CentralGoogle Scholar
- Top EM, Springael D. The role of mobile genetic elements in bacterial adaptation to xenobiotic organic compounds. Curr Opin Biotechnol. 2003;14(3):262–9.View ArticlePubMedGoogle Scholar
- Hacker J, Hochhut B, Middendorf B, Schneider G, Buchrieser C, Gottschalk G, Dobrindt U. Pathogenomics of mobile genetic elements of toxigenic bacteria. Int J Med Microbiol. 2004;293(7-8):453–61.View ArticlePubMedGoogle Scholar
- Aguilar-Barajas E, Ramírez-Díaz MI, Riveros-Rosas H, Cervantes C. Heavy metal resistance in pseudomonads. In: Ramos JL, Filloux A, editors. Pseudomonas: volume 6: molecular microbiology, infection and biodiversity, vol. 6. New York: Springer; 2010. p. 255–82.Google Scholar
- Campos-García J: Metabolism of acyclic terpenes by Pseudomonas. In: Pseudomonas: volume 6: molecular microbiology, infection and biodiversity. Ramos JL, Filloux A, vol. 6. New York: Springer; 2010: 235-254.Google Scholar
- Novick RP, Christie GE, Penades JR. The phage-related chromosomal islands of gram-positive bacteria. Nat Rev Microbiol. 2010;8(8):541–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Gomez-Simmonds A, Uhlemann AC. Clinical implications of genomic adaptation and evolution of Carbapenem-resistant Klebsiella pneumoniae. J Infect Dis. 2017;215(suppl_1):S18–27.View ArticlePubMedPubMed CentralGoogle Scholar
- Ramirez MS, Traglia GM, Lin DL, Tran T, Tolmasky ME. Plasmid-mediated antibiotic resistance and virulence in gram-negatives: the Klebsiella pneumoniae paradigm. Microbiol Spectr. 2014;2(5):1–15.PubMedPubMed CentralGoogle Scholar
- Ho Sui SJ, Fedynak A, Hsiao WW, Langille MG, Brinkman FS. The association of virulence factors with genomic islands. PLoS One. 2009;4(12):e8094.View ArticlePubMedPubMed CentralGoogle Scholar
- Hacker J, Kaper JB. Pathogenicity islands and the evolution of microbes. Annu Rev Microbiol. 2000;54:641–79.View ArticlePubMedGoogle Scholar
- Janka A, Becker G, Sonntag AK, Bielaszewska M, Dobrindt U, Karch H. Presence and characterization of a mosaic genomic island which distinguishes sorbitol-fermenting enterohemorrhagic Escherichia coli O157:H- from E. coli O157:H7. Appl Environ Microbiol. 2005;71(8):4875–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Bruckner R, Nuhn M, Reichmann P, Weber B, Hakenbeck R. Mosaic genes and mosaic chromosomes-genomic variation in Streptococcus pneumoniae. Int J Med Microbiol. 2004;294(2-3):157–68.View ArticlePubMedGoogle Scholar
- Kulasekara BR, Kulasekara HD, Wolfgang MC, Stevens L, Frank DW, Lory S. Acquisition and evolution of the exoU locus in Pseudomonas aeruginosa. J Bacteriol. 2006;188(11):4037–50.View ArticlePubMedPubMed CentralGoogle Scholar
- Hohl M, Kurtz S, Ohlebusch E. Efficient multiple genome alignment. Bioinformatics. 2002;18(Suppl 1):S312–20.View ArticlePubMedGoogle Scholar
- Darling AE, Mau B, Perna NT. ProgressiveMauve: multiple genome alignment with gene gain, loss and rearrangement. PLoS One. 2010;5(6):e11147.View ArticlePubMedPubMed CentralGoogle Scholar
- Angiuoli SV, Salzberg SL. Mugsy: fast multiple alignment of closely related whole genomes. Bioinformatics. 2011;27(3):334–42.View ArticlePubMedGoogle Scholar
- Page AJ, Cummins CA, Hunt M, Wong VK, Reuter S, Holden MT, Fookes M, Falush D, Keane JA, Parkhill J. Roary: rapid large-scale prokaryote pan genome analysis. Bioinformatics. 2015;31(22):3691–3.View ArticlePubMedPubMed CentralGoogle Scholar
- Treangen TJ, Ondov BD, Koren S, Phillippy AM. The harvest suite for rapid core-genome alignment and visualization of thousands of intraspecific microbial genomes. Genome Biol. 2014;15(11):524.View ArticlePubMedPubMed CentralGoogle Scholar
- Laing C, Buchanan C, Taboada EN, Zhang Y, Kropinski A, Villegas A, Thomas JE, Gannon VP. Pan-genome sequence analysis using Panseq: an online tool for the rapid analysis of core and accessory genomic regions. BMC Bioinformatics. 2010;11:461.View ArticlePubMedPubMed CentralGoogle Scholar
- Chaudhari NM, Gupta VK, Dutta C. BPGA- an ultra-fast pan-genome analysis pipeline. Sci Rep. 2016;6:24373.View ArticlePubMedPubMed CentralGoogle Scholar
- Ozer EA, Allen JP, Hauser AR. Characterization of the core and accessory genomes of Pseudomonas aeruginosa using bioinformatic tools spine and AGEnt. BMC Genomics. 2014;15:737.View ArticlePubMedPubMed CentralGoogle Scholar
- Lanza VF, Baquero F, de la Cruz F, Coque TM. AcCNET (accessory genome constellation network): comparative genomics software for accessory genome analysis using bipartite networks. Bioinformatics. 2017;33(2):283–5.View ArticlePubMedGoogle Scholar
- Uchiyama I. Hierarchical clustering algorithm for comprehensive orthologous-domain classification in multiple genomes. Nucleic Acids Res. 2006;34(2):647–58.View ArticlePubMedPubMed CentralGoogle Scholar
- Yang Q, Sze SH. Large-scale analysis of gene clustering in bacteria. Genome Res. 2008;18(6):949–56.View ArticlePubMedPubMed CentralGoogle Scholar
- Fouts DE, Brinkac L, Beck E, Inman J, Sutton G. PanOCT: automated clustering of orthologs using conserved gene neighborhood for pan-genomic analysis of bacterial strains and closely related species. Nucleic Acids Res. 2012;40(22):e172.View ArticlePubMedPubMed CentralGoogle Scholar
- Kriventseva EV, Tegenfeldt F, Petty TJ, Waterhouse RM, Simao FA, Pozdnyakov IA, Ioannidis P, Zdobnov EM. OrthoDB v8: update of the hierarchical catalog of orthologs and the underlying free software. Nucleic Acids Res. 2015;43(Database issue):D250–6.View ArticlePubMedGoogle Scholar
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ. Basic local alignment search tool. J Mol Biol. 1990;215(3):403–10.View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25(17):3389–402.View ArticlePubMedPubMed CentralGoogle Scholar
- Li H, Durbin R. Fast and accurate short read alignment with burrows-wheeler transform. Bioinformatics. 2009;25(14):1754–60.View ArticlePubMedPubMed CentralGoogle Scholar
- Bellanger X, Payot S, Leblond-Bourget N, Guedon G. Conjugative and mobilizable genomic islands in bacteria: evolution and diversity. FEMS Microbiol Rev. 2014;38(4):720–60.View ArticlePubMedGoogle Scholar
- Schmidt H, Hensel M. Pathogenicity islands in bacterial pathogenesis. Clin Microbiol Rev. 2004;17(1):14–56.View ArticlePubMedPubMed CentralGoogle Scholar
- Shapiro BJ, Friedman J, Cordero OX, Preheim SP, Timberlake SC, Szabo G, Polz MF, Alm EJ. Population genomics of early events in the ecological differentiation of bacteria. Science. 2012;336(6077):48–51.View ArticlePubMedPubMed CentralGoogle Scholar
- Felsenstein, J. PHYLIP - Phylogeny Inference Package (Version 3.2). Cladistics 5. 1989. p. 164–166Google Scholar
- Letunic I, Bork P. Interactive tree of life (iTOL) v3: an online tool for the display and annotation of phylogenetic and other trees. Nucleic Acids Res. 2016;44(W1):W242–5.View ArticlePubMedPubMed CentralGoogle Scholar
- Stothard P, Wishart DS. Circular genome visualization and exploration using CGView. Bioinformatics. 2005;21(4):537–9.View ArticlePubMedGoogle Scholar
- Roy PH, Tetu SG, Larouche A, Elbourne L, Tremblay S, Ren Q, Dodson R, Harkins D, Shay R, Watkins K, Mahamoud Y, Paulsen IT. Complete genome sequence of the multiresistant taxonomic outlier Pseudomonas aeruginosa PA7. PLoS One. 2010;5(1):e8842.View ArticlePubMedPubMed CentralGoogle Scholar
- Battle SE, Meyer F, Rello J, Kung VL, Hauser AR. Hybrid pathogenicity island PAGI-5 contributes to the highly virulent phenotype of a Pseudomonas aeruginosa isolate in mammals. J Bacteriol. 2008;190(21):7130–40.View ArticlePubMedPubMed CentralGoogle Scholar
- Tatusov RL, Koonin EV, Lipman DJ. A genomic perspective on protein families. Science. 1997;278(5338):631–7.View ArticlePubMedGoogle Scholar
- Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan Y. Predicting function: from genes to genomes and back. J Mol Biol. 1998;283(4):707–25.View ArticlePubMedGoogle Scholar