micropan: an R-package for microbial pan-genomics
© Snipen and Liland; licensee BioMed Central. 2015
Received: 14 October 2014
Accepted: 24 February 2015
Published: 12 March 2015
A pan-genome is defined as the set of all unique gene families found in one or more strains of a prokaryotic species. Due to the extensive within-species diversity in the microbial world, the pan-genome is often many times larger than a single genome. Studies of pan-genomes have become popular due to the easy access to whole-genome sequence data for prokaryotes. A pan-genome study reveals species diversity and gene families that may be of special interest, e.g because of their role in bacterial survival or their ability to discriminate strains.
We present an R package for the study of prokaryotic pan-genomes. The R computing environment harbors endless possibilities with respect to statistical analyses and graphics. External free software is used for the heavy computations involved, and the R package provides functions for building a computational pipeline.
We demonstrate parts of the package on a data set for the gram positive bacterium Enterococcus faecalis. The package is free to download and install from The Comprehensive R Archive Network.
As a result of accelerated development of sequencing technologies over the last decades, prokaryotic genomes can now be sequenced in short time and at small cost. This has made it popular to sequence many strains within the same species, to investigate the diversity, to look for a core set of genes common to all strains, to be able to distinguish pathogenic from non-pathogenic genotypes, etc. [1-5].
It has long since been recognized that the diversity between prokaryotic strains is much greater than between, say, individuals in a human subpopulation. All humans share, more or less, the same genes, and the differences between individuals are often found in the regulatory parts of the genome. In contrast, two different strains of E. coli may be many times more diverse than humans and chimpanzees, with perhaps 20% of the genes in one strain lacking in the other (e.g. [6,7]). For this reason, the set of unique genes found in the entire E. coli population is many times larger than what we observe in a single genome, and this collection of all genes utilized by a species was by  denoted the ‘pan-genome’.
Microbial comparative genomics has in most cases focused on the core genes, i.e. housekeeping genes conserved across all strains . In pan-genomics we consider all genes, also those referred to as accessory genes. In all cases, a ‘gene’ is actually a gene family or cluster of similar sequences, and we only consider the protein coding genes. A gene family is usually synonymous to a collection of orthologs, i.e. genes playing identical roles in different genomes. Since the grouping we come up with in pan-genomics does not always correspond to the classical gene family, we prefer the term gene cluster instead of gene family. Despite the central role of such clusters or families, there has never really been any consensus on how to compute them , and in the tool set presented here we have implemented several options.
In this paper we describe and illustrate some functions and opportunities of the micropan R-package  by using public data for the species Enterococcus faecalis. E. faecalis is a multifaceted Gram-positive bacterium with an intimate relationship to human health and disease . This species is among the first to colonize the gastro intestinal tract of newborns , and for most people E. faecalis constitutes a commensal of the normal intestinal microflora . Remarkably, while certain strains are used as probiotics , E. faecalis is also a prominent cause of nosocomial infections . Hence, this extreme diversity in phenotype among different strains motivates an investigation of the species’ pan-genome.
Within the package there is a document (casestudy. pdf) giving a detailed description, with R-code, of a case study example that can serve as a template for most studies. Here we give an overview of the package implementation.
We have chosen to use R as the ‘workbench’ for making microbial pan-genome analyses. R is a free and platform-independent tool much used in bioinformatics, and it requires only very modest programming skills. Inside R there is a huge array of other packages and solutions available, making the possible combinations endless. A long list of basic and complex statistical analyses can be made, as well as graphical displays of all kinds. The micropan package offers a set of functions specifically designed for microbial pan-genomics, but it is quite natural to also make use of other facilities in R once the basic computations are made.
In addition to R, we also make use of some external software, typically for heavy computations. All external software used by the micropan package are free and platform-independent, and the package vignette gives an overview of how to obtain them. Once installed, the external software are operated by using micropan R-functions. This means you never leave the R-environment while doing the analyses, but still make use of these extremely efficient computing engines.
A central data structure in a pan-genome analysis is the pan-matrix. Any pan-genome analysis is typically divided into two phases; first, the heavy computations required to establish the pan-matrix, and then the more interesting analyses where the pan-matrix is often used as the input.
Any pan-genome study requires a set of FASTA-files containing the protein sequences of all genes in each genome. The micropan package has functions for downloading genomes (complete or draft) from NCBI , using the Entrez programming utilities , and if gene calling is needed the function prodigalPredict can be used to call upon the Prodigal gene finder  as an external software.
To compute the pan-matrix, a large number of protein sequence comparisons must be made, which is the computational bottleneck of a pan-genome study. We have chosen the software BLAST+ or HMMER3 [20-22] for this job, highly optimized standard tools in the bioinformatics toolbox. These are invoked from R by the functions blastAllAll and hmmerScan in the micropan package. The sequence comparisons are still time-consuming, but they are only done once for each data set since all results are stored on files in a specified output-folder. If data sets are extended, only computations involving the new genomes are needed. On a multi-core computer you can easily increase the computational speed by running multiple R-sessions in parallel. We have designed the functions not to overwrite existing result files, allowing multiple R-sessions to write to the same output-folder.
where S(i;j) is the BLAST score for the alignment between the sequences when i is used as query and j as database sequence, and vice versa for S(j;i). The self-scores S(j;j) and S(i;i) are used to get the relative score (between 0 and 1), and the distance is simply the average lack of relative score. Gene families are found by a hierarchical clustering where linkage functions and cutoff thresholds can be specified.
However, any all-versus-all type of comparison suffers from the quadratic scaling, i.e. the computational load is O(G 2) where G is the number of genomes involved. It is feasible for a smallish number of genomes, but considering the ever-growing number of sequenced strains, it has some daunting perspectives. Already the approach would prove difficult for Escherichia coli. At the time of writing there are 2267 strains available at NCBI. Producing the more than 5 million (22672) BLAST result files would take weeks given ordinary computing resources, and in the meantime the number of E. coli strains has increased further!
An alternative to clustering based on orthologs is to scan all genes for protein domains using HMMER3, as suggested by , and then cluster them by their ordered sequence of non-overlapping domains. The whole idea of this indirect comparison is to first compare all proteins to some reference set of sequences, e.g. a database of protein domains, and then cluster them based on how they look in this sequence subspace. This approach scales linearly in the number of genomes, and is more robust to gene prediction errors since an accurate start-codon prediction is not as critical as for the direct comparison approach.
Finally, the pan-matrix is constructed from the clustering results based on either BLAST+ or HMMER3 comparisons. The pan-matrix has one row for each genome and one column for each gene cluster, and the value in cell (i,j) is the number of copies of gene cluster j found in genome i.
Pan-genome size can be estimated by methods previously suggested [25-27]. The Chao lower bound estimate of pan-genome size is a conservative estimate, i.e. it tends to be on the smaller side of the true size. Fitting a binomial mixture model will also produce a conservative estimate of pan-genome size, as well as an estimate of the core size. Such models also give a nice graphical view of the composition of the pan-genome with respect to gene types, ranging from core genes (always or almost always present), shell genes (often present, but lacking in subsets of genomes) to cloud genes (observed in only a few genomes).
Pan-genome openness/closedness can be estimated as suggested by , using a Heaps law type of model. This model is fitted to rarefaction curves for the data, and the display of such curves is also implemented.
Another quantity describing pan-genome diversity is the genomic fluidity suggested by , also implemented as a function in the micropan package. The fluidity is very similar to a Jaccard distance, which is a measure of overlap between genomes with respect to their gene clusters. Both Jaccard distances as well as Manhattan distances between genomes can be computed. The functions distManhattan and distJaccard will by default consider only presence and absence of any gene cluster in a genome, i.e. transform the pan-matrix values to 0′ s or 1′ s.
Based on some function for computing distances (mentioned above, or made by the user) the genomes can be displayed in pan-genome trees as described in . An alternative display is by principal component analysis (PCA). A pan-matrix, like any matrix, can be used as input in a PCA, and by plotting the genomes in a truncated score-space we may get an impression of how the genomes are located relative to each other. A PCA biplot may also reveal which gene clusters are the best discriminators between (groups of) genomes. A PCA on a pan-matrix will most typically indicate if genomes tend to cluster. The panpca function can also take arguments for copy number scaling and gene cluster weighting, as mentioned for distManhattan above.
Results and discussion
To demonstrate some facets of the micropan package we have used publicly available data for the species E. faecalis. We have chosen this rather large data set since it is typical of what we will face in near future for all species. Within the micropan package a case study document demonstrates the approach on a very small data set, to make all computations fast for a do-it-yourself exercise.
First, genome data for 342 E. faecalis genome sequencing projects were downloaded from NCBI by entrezDownload. Only 5 of these genomes were complete and the rest were draft genomes. Next, we called protein-coding genes in all genomes using the prodigalPredict function. In the 342 FASTA-files of amino acid sequences the median number of proteins was 2916 with a range of 2428 to 3482, and close to 1 million sequences in total.
In the present study, since we have as many as 342 genomes, we decided to compare sequences by their protein domain sequences. All proteins were scanned against the Pfam-A database  using the hmmerScan function. A total of 2318 unique Pfam-A entries produced hits, with around 3500 hits per genome. On average 84% of the proteins in a genome contained some Pfam-A domain(s). Each sequence having a hit in Pfam-A was clustered together with all other sequences having the exact same domains in the exact same non-overlapping order, using the dClust function. Clearly, a drawback of this approach is that the 16% proteins without hits in the Pfam-A database are excluded from the analysis. Using only domain sequences we maintain a very high degree of specificity (including only real genes in the analysis) at the cost of a lower sensitivity. This we find fruitful in pan-genome studies, where the inclusion of some ‘rubbish genes’ in every genome would assemble into a large pool of (seemingly) ORFan genes in the entire pan-genome. The sensitivity can be improved to some degree by extending the domain database, e.g. including data from CDD and/or InterPro [32,33].
The core gene clusters corresponds to the darkest blue sector, with detection probability 1. There are also two other sectors with almost the same dark blue, and detection probabilities close to 1. In practice, we would say these are also core genes. They lack in some of the 342 genomes, but since these data contain mostly incomplete draft genomes we do not expect 100% detection even for real core gene clusters. Thus, the core genes make up roughly one third of the E. faecalis pan-genome. These gene clusters comprise the stable part of the pan-genome. At the other end we notice the large sector of ‘cloud’ gene clusters (pink). If we also include the sector with detection probability 0.041 we see that such gene clusters make up half the pan-genome. Even if they are many, they are rarely found in a genome (low detection probability). The history and function of such genes is interesting from a biological viewpoint, but is of less importance for describing the population of E. faeclis strains. From Figure 3 we also note that around 10-15% of the pan-genome are ‘shell’ gene clusters (greenish colors), being present in many, but not all, genomes. These gene clusters are interesting. In this segment we find the really strong signals with respect to which strains are more or less similar. If we should design a typing scheme for this species, we would certainly consider these gene clusters as important.
Estimates of the full pan-genome size was obtained from the fitted mixture model as well as the chao  function in the micropan package. Both produce conservative estimates of how many clusters we should expect to see if we continue to sequence more strains, the results were 3327 and 3845, respectively. Both these methods assume the pan-genome is closed, i.e. its size has an upper bound. The heaps function was used to indicate if this is actually the case. The Heaps law model parameter α was estimated to 0.82, indicating an open pan-genome, the threshold being 1.0, see  for details. We will, however, not conclude from this that the E. faecalis pan-genome is ‘infinite’. An ‘open’ pan-genome must be interpreted as a size much bigger than what we have seen so far. The pan-genome size estimation is by nature a very difficult problem since it implies an extrapolation way beyond the present data. This must always be acknowledged and warrants a caution on making statements about pan-genome open/closedness.
The estimation of a pan-genome size is a variant of the ‘number of unseen species’ problem that dates back at least to . The methods implemented in the current version of this R package are just some out of several possible . Recent investigations have demonstrated that both these and some other methods have their flaws . More methodological research should be devoted to this problem, and future versions of this R package will be updated accordingly. Specifically, methods incorporating information about genome relatedness should have a potential for improving such estimates. This has recently been tried out by [37,38], and similar ideas from related problems should also be explored .
Given the pan-matrix, we can construct a pan-genome tree, as described in . Such trees illustrate the difference in gene cluster content between genomes. For 342 genomes any such tree requires too much space for being displayed in this paper, but in Additional file 1: Figure S1 we have included such a (unweighted and unscaled) tree for all genomes.
We have presented an R package including a number of functions for exploring microbial pan-genomes. The implementation uses R as the working environment, but exports heavy computations to external software. Within the package is a small case study example for the user to explore. In this paper we have used a larger data set for Enterococcus faecalis to demonstrate some of the facets in this package. We have also used other functions and packages available in R on the E. faecalis data to highlight these possibilities from within the R computing environment.
Availability and requirements
The R package is available for free from The Comprehensive R Archive Network (CRAN, ). It is most easily obtained by starting R and runninginstall.packages(~micropan~, repos=~http://cran.r-project.org/~) in the console window.Project name: micropan Project home page: Search for micropan on CRANOperating systems: Platform independentProgramming language: ROther requirements: For full functionality the system must also have available the BLAST+ software , the HMMER3 software [21,22] and the Prodigal software . License: GPL-2
This project has been financed by the Norwegian University of Life Sciences.
- Deng X, Phillippy AM, Li Z, Salzberg SL, Zhang W. Probing the pan-genome of Listeria monocytogenes: new insights into intraspecific niche expansion and genomic diversification. BMC Genomics. 2010; 11:500.View ArticlePubMedPubMed CentralGoogle Scholar
- Donati C, Hiller NL, Tettelin H, Muzzi A, Croucher NJ, Angiuoli SV, et al.Structure and dynamics of the pan-genome of Streptococcus pneumoniae and closely related species. Genome Biol. 2010; 11(10):R107.View ArticlePubMedPubMed CentralGoogle Scholar
- Lefebure T, Pavinski Bitar PD, Suzuki H, Stanhope MJ. Evolutionary dynamics of complete campylobacter pan-genomes and the bacterial species concept. Genome Biol Evol. 2010; 2:646–55.View ArticlePubMedPubMed CentralGoogle Scholar
- Galardini M, Mengoni A, Brilli M, Pini F, Fioravanti A, Lucas S, et al.Exploring the symbiotic pangenome of the nitrogen-fixing bacterium Sinorhizobium meliloti. BMC Genomics. 2011; 12:235.View ArticlePubMedPubMed CentralGoogle Scholar
- Hao P, Zheng H, Yu Y, Ding D, Gu W, Chen S, et al.Complete sequencing and pan-genomic analysis of lactobacillus delbrueckii subsp. bulgaricus reveal its genetic basis for industrial yogurt production. PLoS ONE. 2011; 6(1):e15964.View ArticlePubMedPubMed CentralGoogle Scholar
- Rasko DA, Rosovitz MJ, Myers GSA, Mongodin EF, Fricke WF, Gajer P, et al.The pangenome structure of Escherichia coli: comparative genomic analysis of E. coli commensal and pathogenic isolates. J Bacteriol. 2008; 190(20):6881–93.View ArticlePubMedPubMed CentralGoogle Scholar
- Lukjancenko O, Wassenaar T, Ussery DW. Comparison of 61 sequenced Escherichia coli genomes. Microb Ecol. 2010; 60:708–20.View ArticlePubMedPubMed CentralGoogle Scholar
- Tettelin H, Masignani V, Cieslewicz MJ, Donati C, Medini D, Ward NL, et al. Genome analysis of multiple pathogenic isolates of streptococcus agalactiae: implications for the microbial ‘pan-genome’. PNAS. 1395; 102:0–5.Google Scholar
- Maiden MCJ, Bygraves JA, Feil E, Morelli G, Rusell JE, Urwin R, et al.Multilocus sequence typing: A portable approach to the identification of clones within populations of pathogenic microorganisms. PNAS. 1998; 25:3140–5.View ArticleGoogle Scholar
- Dessimoz C, Gabaldon T, Roos DS, Sonnhammer ELL, Herrero J, for Orthologs Consortium Q. Towards community standards in the quest for orthologs. Bioinformatics. 2012; 28(6):900–4.View ArticlePubMedPubMed CentralGoogle Scholar
- R Core Team. R: A language and environment for statistical computing. Vienna: Austria: R Foundation for Statistical Computing; 2014. http://www.R-project.org/.Google Scholar
- Gilmore MS, Ferretti JJ. The thin line between gut commensal and pathogen. Science. 2003; 299(5615):1999–2002.View ArticlePubMedGoogle Scholar
- Are A, Aronsson L, Wang S, Greicius G, Lee YK, Gustafsson J, et al. Enterococcus faecalis from newborn babies regulate endogenous PPARgamma activity and IL-10 levels in colonic epithelial cells. PNAS. 2008; 105(6):1943–8.View ArticlePubMedPubMed CentralGoogle Scholar
- Noble CJ. Carriage of group D streptococci in the human bowel. J Clin Pathol. 1978; 31:1182–6.View ArticlePubMedPubMed CentralGoogle Scholar
- Domann E, Hain T, Ghai R, Billion A, Kuenne C, Zimmermann K, et al.Comparative genomic analysis for the presence of potential enterococcal virulence factors in the probiotic Enterococcus faecalis strain Symbioflor 1. Int J Med Microbiol. 2007; 297(7–8):533–9.View ArticlePubMedGoogle Scholar
- Richards MJ, Edwards JR, Culver DH, Gaynes RP. Nosocomial infections in combined medical-surgical intensive care units in the United States. Infect Control Hosp Epidemiol. 2000; 21(8):510–5.View ArticlePubMedGoogle Scholar
- NCBI Genome. http://www.ncbi.nlm.nih.gov/genome.
- NCBI E-utilities. http://www.ncbi.nlm.nih.gov/books/NBK25501/.
- Hyatt D, Chen G, LoCascio PF, Land ML, Larimer FW, Hauser LJ. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinformatics. 2010; 11:119.View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul S, Gish W, Miller W, Myers E, Lipman D. Basic local alignment search tool. J Mol Biol. 1990; 215(3):403–10.View ArticlePubMedGoogle Scholar
- Eddy SR. A probabilistic model of local sequence alignment that simplifies statistical significance estimation. PLoS Comput Biol. 2008; 4(5).Google Scholar
- Eddy SR. A new generation of homology search tools based on probabilistic inference. Genome Inform. 2009; 23:205–11.PubMedGoogle Scholar
- Benedict MN, Henriksen JR, Metcalf WM, Whitaker RJ, Price ND. ITEP: An integrated toolkit for exploration of microbial pan-genomes. BMC Genomics. 2014; 15:8.View ArticlePubMedPubMed CentralGoogle Scholar
- Snipen L, Ussery DW. A domain sequence approach to pangenomics: applications to Escherichia coli. F1000 Res. 2012; 1(19):1–19.Google Scholar
- Chao A. Estimating the population size for capture-recapture data with unequal catchability. Biometrics. 1987; 43:783–91.View ArticlePubMedGoogle Scholar
- Hogg JS, Hu FZ, Janto B, Boissy R, Hayes J, Keefe R, et al.Characterization and modelling of the Haemophilus influenzae core- and supra-genomes based on the complete genomic sequences of Rd and 12 clinical nontypeable strains. Genome Biol. 2007; 8(6):R103.View ArticlePubMedPubMed CentralGoogle Scholar
- Snipen L, Almœy T, Ussery DW. Microbial comparative pan-genomics using binomial mixture models. BMC Genomics. 2009; 10:385.View ArticlePubMedPubMed CentralGoogle Scholar
- Tettelin H, Riley D, Cattuto C, Medini D. Comparative genomics: the bacterial pan-genome. Curr Opinions Microbiol. 2008; 12:472–7.View ArticleGoogle Scholar
- Kislyuk AO, Haegeman B, Bergman NH, Weitz JS. Genomic fluidity: an integrative view of gene diversity within microbial populations. BMC Genomics. 2011; 12:32.View ArticlePubMedPubMed CentralGoogle Scholar
- Snipen L, Ussery DW. Standard operating procedure for computing pangenome trees. Stand Genomic Sci. 2010; 2:135–41.View ArticlePubMedPubMed CentralGoogle Scholar
- Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, et al. The Pfam protein families database. Nucleic Acid Res. 2010; 38:D211–22.View ArticlePubMedGoogle Scholar
- Conserved Domains Database. http://www.ncbi.nlm.nih.gov/Structure/cdd/cdd.shtml.
- InterPro protein sequencee analysis and classification. http://www.ebi.ac.uk/interpro/.
- Schwarz G. Estimating the dimension of a model. Ann Stat. 1978; 6:461–4.View ArticleGoogle Scholar
- Fisher R, Corbet AS, Williams CB. The relation between the number of species and the number of individuals in a random sample of an animal population. J Anim Ecol. 1943; 12:42–58.View ArticleGoogle Scholar
- Lapierre P, Gogarten JP. Estimating the size of the bacterial pan-genome. Trends Genet. 2009; 25(3):107–10.View ArticlePubMedGoogle Scholar
- Lobkovsky AE, Wolf YI, Koonin EV. Estimation of prokaryotic supergenome size and composition from gene frequency distributions. BMC Genomics. 2014; 15:S14.View ArticlePubMedPubMed CentralGoogle Scholar
- Baumdicker F, Hess WR, Pfaffelhuber P. The infinitely many genes model for the distributed genome of bacteria. Genome Biol Evol. 2012; 4(4):443–56.View ArticlePubMedPubMed CentralGoogle Scholar
- Andersen MM, Eriksen PS, Morling N. The discrete Laplace exponential family and estimation of Y-STR haplotype frequencies. J Theor Biol. 2013; 329:39–51.View ArticlePubMedGoogle Scholar
- BioSample database. http://www.ncbi.nlm.nih.gov/biosample/.
- Kaufman L, Rousseeuw P. Finding groups in data: an introduction to cluster analysis. USA: John Wiley & Sons, Inc; 1990.View ArticleGoogle Scholar
- Feil EJ, Li BC, Aanensen DM, Hanage WP, Spratt BG. eBURST: Inferring patterns of evolutionary descent among clusters of related bacterial genotypes from multilocus sequence typing data. J Bac. 2004; 186:1518–30.View ArticleGoogle Scholar
- The Comprehensive R Archive Network. http://cran.r-project.org/.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly credited. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.