Analysis of plasmid genes by phylogenetic profiling and visualization of homology relationships using Blast2Network
© Brilli et al; licensee BioMed Central Ltd. 2008
Received: 18 June 2008
Accepted: 21 December 2008
Published: 21 December 2008
Phylogenetic methods are well-established bioinformatic tools for sequence analysis, allowing to describe the non-independencies of sequences because of their common ancestor. However, the evolutionary profiles of bacterial genes are often complicated by hidden paralogy and extensive and/or (multiple) horizontal gene transfer (HGT) events which make bifurcating trees often inappropriate. In this context, plasmid sequences are paradigms of network-like relationships characterizing the evolution of prokaryotes. Actually, they can be transferred among different organisms allowing the dissemination of novel functions, thus playing a pivotal role in prokaryotic evolution. However, the study of their evolutionary dynamics is complicated by the absence of universally shared genes, a prerequisite for phylogenetic analyses.
To overcome such limitations we developed a bioinformatic package, named Blast2Network (B2N), allowing the automatic phylogenetic profiling and the visualization of homology relationships in a large number of plasmid sequences. The software was applied to the study of 47 completely sequenced plasmids coming from Escherichia, Salmonella and Shigella spps.
The tools implemented by B2N allow to describe and visualize in a new way some of the evolutionary features of plasmid molecules of Enterobacteriaceae; in particular it helped to shed some light on the complex history of Escherichia, Salmonella and Shigella plasmids and to focus on possible roles of unannotated proteins.
The proposed methodology is general enough to be used for comparative genomic analyses of bacteria.
Despite the huge amount of available sequences, few papers reported comparative analyses of entire plasmids with the aim of a complete classification of the functions they code for [1–4], and none considered all the sequences coming from entire genera or more inclusive taxonomic groups.
Nevertheless, plasmids are extremely important in microbial evolution, because they can be transferred between organisms, representing natural vectors for the transfer of genes and functions they code for [[5, 6] and references therein]. In medical epidemiology and microbial ecology plasmids are thoroughly investigated because they often carry genes encoding adaptive traits such as antibiotic resistance, pathogenesis or the ability to exploit new environments or compounds [[7–9] and references therein].
While bacterial chromosomes show a relatively high conservation of their architecture, plasmid molecules are more variable concerning gene content and/or organization, even at short evolutionary distances. Indeed, plasmid genes can be considered to be under differential selection, while moving around the bacterial community. Moreover they have a dynamic structure, i.e. genes can be gained or lost from the plasmid molecule. Actually, the same plasmid can be hosted by different organisms inhabiting different environments (e.g.: pH, temperature and chemical composition) and cohabiting with different genetic backgrounds. These factors may shape both the functional role(s) of the proteins, and the compositional features of plasmid DNA, such as GC or oligomers contents, some of the last being a very specific signature even at close phylogenetic distances .
Despite their key role in the microbial world, at least two main issues concerning plasmids remain poorly investigated: i) the function of proteins they code for (see Additional file 1, more than 25% of proteins do not have assigned COG) and ii) the evolutionary dynamics of plasmids including their importance in bacterial evolution .
This latter point is often analyzed using phylogenetic methods that make use of rigorous statistical approaches to model the evolution of sequences (such as Maximum Likelihood or Bayesian inference). However, such methods are of restricted use in the case of plasmid molecules: they are computationally expensive when thousands of amino acid or nucleotide sequences are analyzed, and, moreover, require a set of homologous and universally shared sequences, that could be unavailable when studying plasmids.
to reconstruct the evolutionary history of plasmids molecules by identifying those having the most similar gene content;
to assign a putative function to previously uncharacterized proteins. This task is fulfilled in two ways: by means of sequence similarity of unknown or hypothetical proteins to known ones and through a phylogenetic profiling approach. In this case the function of a protein is inferred by observing co-occurrence patterns. This is based on the idea that proteins involved in the same metabolic process or macromolecular complex tend to be maintained (or lost) together and that proteins which often occur together are likely to be functionally linked .
to provide an immediate visualization of the similarities existing among sequences. In fact, one of the outputs of the program is a network of sequence similarities in a format readable by the visualization software Visone http://visone.info/.
To test the package, we focused the attention on plasmids harbored by members of the Enterobacteriaceae family of γ-Proteobacteria, which is one of the most studied divisions of bacteria and includes Escherichia, Shigella, and Salmonella genera, whose biomedical importance  has allowed to record a relatively high number of completely sequenced plasmids in a few species. Moreover, horizontal transfer of plasmids between them has been described , complicating the phylogenetic information on plasmids; lastly, several pathogenesis-associated phenotypes are plasmid-borne . Consequently, the application of B2N to this dataset could allow to reveal the presence of relationships between known pathogenesis-associated proteins and those which have not been characterized yet.
Description of the program
This approach allows the analysis of co-occurrence patterns, metabolic reconstruction and so on. In details, by taking as input the adjacency matrix storing the sequence similarity values, B2N produces a rectangular matrix (as described in the central part of Figure 1b) composed by all the plasmids under analysis (rows) and all the protein clusters (columns) identified through a depth-first search of the adjacency matrix. Each position of the phylogenetic profile matrix will be "1" in the case a given plasmid (row) possesses (at least) one protein in the corresponding protein cluster (column), whereas it is filled with "0" in the opposite case.
One of the commonly used metrics for binary data comparison is the Jaccard similarity coefficient. Given two vectors of phylogenetic profiles in binary form (A and B in this case, with n observations), the Jaccard coefficient is defined as the size of the intersection divided by the size of the union of the sample sets: J(A, B) = |A ∩ B|/|A ∪ B|. The 'Jaccard distance', which measures dissimilarity between sample sets, is obtained by dividing the difference of the sizes of the union and the intersection of two sets by the size of the union: Jδ (A, B) = |A ∪ B| - |A ∩ B|/|A ∪ B| = 1 - J(A, B).
The Jaccard coefficient is a useful measure of the overlap that the attributes of 'A' and 'B' share. Each attribute of 'A' and 'B' can either be 0 ('absence') or 1 ('presence'). The total number of each combination of attributes for both 'A' and 'B' are specified as follows: M11 (M00) represents the total number of attributes where 'A' and 'B' both have a value of 1 (0). M01 (M10) represents the total number of attributes where the attribute of 'A' is 0 (1) and the attribute of 'B' is 1 (0). Each attribute must fall into one of these four categories, meaning that their sum equals n. The Jaccard similarity coefficient is J = M11/(M01+ M10+ M11). Blast2Network calculates the Jaccard distance for both dimensions of the phylogenetic profiles matrix, which corresponds to the distance between plasmids in term of shared genes, and the distance between occurrence patterns of clusters in plasmids. The Jaccard distance matrices are then used for the construction of two neighbor-joining dendrograms (Figure 1b). The first one describes similarities in gene content of the plasmids, the other one groups together those protein clusters with the most similar occurrence pattern within plasmids. Random permutations of the original data allows to compute the statistical significance of the Jaccard distances.
B2N also outputs the BLAST post processing results as a network in Visone format http://visone.info/, a freely available software for network visualization and analysis. In doing so, it takes advantage of several information: the position and the color of the nodes (proteins) in the network correspond to the plasmid source, whereas the links indicate the existence of a given degree of sequence similarity between nodes. To reduce the dimensionality of the networks it is possible to use the Jaccard distance matrices to construct two hypergraphs where each plasmid or protein cluster, respectively, are collapsed to single nodes connected by edges whose values reflect the significance of the Jaccard distance calculated (see below and in Additional file 2).
B2N can include additional information in the network, assigning to each node a numerical (or binary) value which can be visualized in Visone as the size of the node; this node-associated value might be a compositional measure, such as the GC content and/or the codon adaptation index [17, 18] of the corresponding gene. To this purpose, B2N has two methods but the user can input its own list of values as a text file. The first built-in method writes node values corresponding to the GC content of a sequence, while the other one implements the dinucleotide analysis derived from  and , obtaining a composition-based dissimilarity index of a gene sequence with respect to the source plasmid (or genome). Considering each possible dinucleotide, say xy, and a gene s, ρxy(s)= (fxy(s)/fx(s)*fy(s)). From this value the program obtains δ(s,g) = 1/16 * Σ |ρxy(s)- ρxy(g)| over all 16 dinucleotides, that is a measure of the compositional bias of a given sequence (s) with respect to a reference sequence (g) i.e. the genome or the entire plasmid. The δ can be used to detect genes that have been recently transferred and have since then maintained the compositional properties of the original plasmid.
Sequence data source and software availability
Escherichia coli KL4
Escherichia coli O157:H7 str. Sakai
Escherichia coli O157:H7 str. Sakai
Salmonella enterica serovar Enteritidis
Salmonella enterica subsp.enterica serovar Berta
Salmonella enterica subsp. enterica serovar Choleraesuis
Salmonella enterica subsp.enterica serovar Choleraesuis
Salmonella enterica subsp. enterica serovar Choleraesuis
Salmonella enterica subsp.enterica serovar Choleraesuis
Salmonella enterica subsp. enterica serovar Typhi str. CT18
Salmonella enterica subsp.enterica serovar Typhi str. CT18
Shigella boydii Sb227
Shigella dysenteriae Sd197
Shigella flexneri 2a str. 301
Shigella sonnei Ss046
The software B2N with the user's manual can be directly requested to the authors and is also available as Additional file (Additional file 3).
Results and discussion
Visual representation of sequence homology network
One of the problems faced with such complex data is the reduction of the dimensionality, so that important relationships can be more easily identified. Similarities in gene content between different plasmids can be better visualized by collapsing all the proteins belonging to the same plasmid in a single node. In this way a hypergraph is obtained where each node represents a single plasmid. The connection can be obtained from the plasmid vs plasmid Jaccard distance matrix or better, they can reflect the p-values matrix, so that each link in the hypergraph quantifies the significance of a given association between plasmids (showed in Additional file 2) and a simple hard thresholding allows changing the stringency for the inclusion of edges in the hypergraph.
Network data analysis
Most plasmids contain at least some gene coding for highly interconnected proteins; however, some of them (e.g. pRK2, ColJs Cjl, pLG13, CloDF13) exhibited only few connections. Hence, these plasmids share few genes with the other members of the dataset at these threshold levels. This, in turn, may suggest that they might have experienced less recombination events than others.
Several proteins (about 40% of all the connected nodes) were found to be mobile elements (transposases, IS and transposons -related sequences), representing the most highly connected proteins in the network.
As shown in Figure 3, proteins shared by Escherichia, Salmonella and Shigella plasmids included: a) the antirestriction protein KlcA involved in the broad-host range of IncP plasmids ; b) the RNA chaperone FinO, related to repression of sex pilus formation [21, 22]; c) the CcdB protein, which is involved in plasmid stability by killing bacteria that lose the plasmid .
Several clusters were composed by proteins shared by Shigella spp. and Escherichia coli; this finding is in agreement with the notion that they are considered to belong to the same species . Moreover, several proteins were shared only by E. coli and Salmonella plasmids, including: the genetic determinants for antibiotic resistance such as TetA and TetR , β-lactamases (Bla) [25, 27], genes for resistance to amino glycosides (AadA) and sulphonamides (DHPT synthase). A similar scenario was observed for sex pilus related proteins, such as Tra and Trb proteins: out of 22 different Tra groups, 21 contain proteins coming from E. coli and Salmonella, but 3 groups only (TraDI for DNA transport and TraX for pilin acetylation) have Shigella sequences. Likewise, out of 5 different Trb groups, we observed Shigella plasmid sequences in a single cluster (TrbH). Moreover, the proteins TraP, TrbA and TrbJ seem to be only present in plasmids from E. coli, while all the other sex pilus related proteins are shared with Salmonella. These data are in agreement with evidences for recent transfer of plasmid genes between enteroinvasive Escherichia and Salmonella [26, 27].
Concerning the pathogenesis-related genes, Shigella plasmids seem to have a specific set of these genes, comprising at least some of the proteins of the type III secretion system (TTS), e.g.: Mxi, Spa, Ipa, Ipg and Osp proteins.
Finally, on the overall observation it appeared that besides the closer phylogenetic relationships existing between E. coli and Shigella, plasmid content appeared more similar among E. coli and Salmonella for what is concerned with antibiotic resistance and sex pilus formation.
A relevant exception is represented by five Shigella plasmids (pCP301, pSB4 227, pSD1 197, pWR501, and pSS 046) that form a unique clade (which, however, also includes pC plasmid from Salmonella enterica).
Figure 5 and Additional file 3 report the co-occurrence clustering for the protein dataset of the selected plasmids. In general, plasmids are believed to share very few common functions (mainly related to their replication and mobility), several accessory genes and a complex history of recombination events among either them or the host chromosome(s) . Here, we actually show that most of the co-occurrence clusters are due to protein related to plasmid transfer (e.g. Trb and Tra proteins). Nevertheless, several clusters are present showing the co-occurrence of hypothetical proteins with proteins with predicted functions such as type II secretion proteins and pilins (BfpK), or with proteins involved in mobilization (MobA, MbkC) and virulence factors (IroN). These analyses may help in addressing experimental analyses for elucidating the functional role of these proteins.
In conclusion, we report that the tools implemented by B2N allow to describe and to visualize in a new way some of the evolutionary features of plasmid molecules of Enterobacteriaceae; the most important results obtained by B2N on the Enterobacteriaceae dataset are related to the possibility, by means of phylogenetic profiling and network relationships of proteins, to uncover some of the molecular history, which shaped the evolution of this group of plasmids. In particular, data obtained suggested a large amount of horizontal transfer and rearrangement of plasmid molecules between E. coli, Salmonella and Shigella. Moreover, interestingly some plasmids from Shigella share a common history with Salmonella and several hypothetical proteins form co-occurrence clusters, suggesting possible roles in plasmid maintenance and/or pathogenesis, which could be investigated by conventional genetic techniques.
The proposed method is general enough to be proposed as a new tool for comparative genomic analyses of bacteria and can work at least within the range of phylogenetic distances enabling Blast to find homologs. For this reason, the B2N approach could help solving some questions linked to the presence of (few) well conserved functions within plasmid datasets from wide taxonomic ranges (e.g. functions related to transfer or replication). Moreover, possible applications of the method could include also chromosomal replicons, trying to depict histories of gene rearrangement and integration from plasmid to chromosomes and viceversa.
type III secretion system.
This work was supported by the Italian Ministry of Research, FISR founding "Soil sink". MBr was supported by a post doc fellowships of the University of Firenze.
- Gilmour MW, Thomson NR, Sanders M, Parkhill J, Taylor DE: The complete nucleotide sequence of the resistance plasmid R478: defining the backbone components of incompatibility group H conjugative plasmids through comparative genomics. Plasmid 2004, 52: 182–202. 10.1016/j.plasmid.2004.06.006View ArticlePubMedGoogle Scholar
- Guerrero G, Peralta H, Aguilar A, Diaz R, Villalobos MA, Medrano-Soto A, Mora J: Evolutionary, structural and functional relationships revealed by comparative analysis of syntenic genes in Rhizobiales . BMC Evol Biol 2005, 5: 55. 10.1186/1471-2148-5-55PubMed CentralView ArticlePubMedGoogle Scholar
- Johnson TJ, Siek KE, Johnson SJ, Nolan LK: DNA sequence and comparative genomics of pAPEC-O2-R, an avian pathogenic Escherichia coli transmissible R plasmid. Antimicrob Agents Chemother 2005, 49: 4681–4688. 10.1128/AAC.49.11.4681-4688.2005PubMed CentralView ArticlePubMedGoogle Scholar
- Tauch A, Puhler A, Kalinowski J, Thierbach G: Plasmids in Corynebacterium glutamicum and their molecular classification by comparative genomics. J Biotechnol 2003, 104: 27–40. 10.1016/S0168-1656(03)00157-3View ArticlePubMedGoogle Scholar
- Kohiyama M, Hiraga S, Matic I, Radman M: Bacterial sex: playing voyeurs 50 years later. Science 2003, 301: 802–803. 10.1126/science.1085154View ArticlePubMedGoogle Scholar
- Thomas CM, Nielsen KM: Mechanisms of, and barriers to, horizontal gene transfer between bacteria. Nat Rev Microbiol 2005, 3: 711–721. 10.1038/nrmicro1234View ArticlePubMedGoogle Scholar
- Burrus V, Waldor MK: Shaping bacterial genomes with integrative and conjugative elements. Res Microbiol 2004, 155: 376–386. 10.1016/j.resmic.2004.01.012View ArticlePubMedGoogle Scholar
- Espinosa-Urgel M: Plant-associated Pseudomonas populations : molecular biology, DNA dynamics, and gene transfer. Plasmid 2004, 52: 139–150. 10.1016/j.plasmid.2004.06.004View ArticlePubMedGoogle Scholar
- Dennis JJ: The evolution of IncP catabolic plasmids. Curr Opin Biotechnol 2005, 16: 291–298. 10.1016/j.copbio.2005.04.002View ArticlePubMedGoogle Scholar
- Karlin S, Burge C: Dinucleotide relative abundance extremes: a genomic signature. Trends Genet 1995, 11: 283–290. 10.1016/S0168-9525(00)89076-9View ArticlePubMedGoogle Scholar
- Fernández-López R, Garcillán-Barcia MP, Revilla C, Lázaro M, Vielva L, de la Cruz F: Dynamics of the IncW genetic backbone imply general trends in conjugative plasmid evolution. FEMS Microbiol Rev 2006, 30: 942–966. 10.1111/j.1574-6976.2006.00042.xView ArticlePubMedGoogle Scholar
- Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO: Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 1999, 96: 4285–4288. 10.1073/pnas.96.8.4285PubMed CentralView ArticlePubMedGoogle Scholar
- Linton A, Hinton MH: Enterobacteriaceae associated with animals in health and disease. Soc Appl Bacteriol Symp Ser 1988, 17: 71S-85S.View ArticlePubMedGoogle Scholar
- Slater FR, Bailey MJ, Tett AJ, Turner SL: Progress towards understanding the fate of plasmids in bacterial communities. FEMS Microbiol Ecol 2008, 66: 3–13. 10.1111/j.1574-6941.2008.00505.xView ArticlePubMedGoogle Scholar
- Lavigne JP, Blanc-Potard AB: Molecular evolution of Salmonella enterica serovar Typhimurium and pathogenic Escherichia coli : from pathogenesis to therapeutics. Infect Genet Evol 2008, 8: 217–226. 10.1016/j.meegid.2007.11.005View ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhalg J, Zhalg Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucl Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Sharp PM, Li WH: The codon Adaptation Index – a measure of directional synonymous codon usage bias, and its potential applications. Nucleic Acids Res 1987, 15: 1281–1295. 10.1093/nar/15.3.1281PubMed CentralView ArticlePubMedGoogle Scholar
- Ramazzotti M, Brilli M, Fani R, Manao G, Degl'Innocenti D: The CAI Analyser Package: inferring gene expressivity from raw genomic data. In Silico Biol 2007, 7: 507–526.PubMedGoogle Scholar
- van Passel MW, Bart A, Luyf AC, van Kampen AH, Ende A: Compositional discordance between prokaryotic plasmids and host chromosomes. BMC Genomics 2006, 7: 26. 10.1186/1471-2164-7-26PubMed CentralView ArticlePubMedGoogle Scholar
- Larsen MH, Figurski DH: Structure, expression, and regulation of the kilC operon of promiscuous IncP alpha plasmids. J Bacteriol 1994, 176: 5022–5032.PubMed CentralPubMedGoogle Scholar
- Dionisio F, Matic I, Radman M, Rodrigues OR, Taddei F: Plasmids spread very fast in heterogeneous bacterial communities. Genetics 2002, 162: 1525–1532.PubMed CentralPubMedGoogle Scholar
- Arthur DC, Ghetu AF, Gubbins MJ, Edwards RA, Frost LS, Glover JN: FinO is an RNA chaperone that facilitates sense-antisense RNA interactions. EMBO J 2003, 22: 6346–6355. 10.1093/emboj/cdg607PubMed CentralView ArticlePubMedGoogle Scholar
- Aguirre-Ramirez M, Ramirez-Santos J, Van Melderen L, Gomez-Eichelmann MC: Expression of the F plasmid ccd toxin-antitoxin system in Escherichia coli cells under nutritional stress. Can J Microbiol 2006, 52: 24–30. 10.1139/w05-107View ArticlePubMedGoogle Scholar
- Escobar-Paramo P, Giudicelli C, Parsot C, Denamur E: The evolutionary history of Shigella and enteroinvasive Escherichia coli revised. J Mol Evol 2003, 57: 140–148. 10.1007/s00239-003-2460-3View ArticlePubMedGoogle Scholar
- Hartman AB, Essiet II, Isenbarger DW, Lindler LE: Epidemiology of tetracycline resistance determinants in Shigella spp. and enteroinvasive Escherichia coli : characterization and dissemination of tet(A)-1 . J Clin Microbiol 2003, 41: 1023–1032. 10.1128/JCM.41.3.1023-1032.2003PubMed CentralView ArticlePubMedGoogle Scholar
- Boyd EF, Hartl DL: Recent horizontal transmission of plasmids between natural populations of Escherichia coli and Salmonella enterica . J Bacteriol 1997, 179: 1622–1627.PubMed CentralPubMedGoogle Scholar
- Call DR, Kang MS, Daniels J, Besser TE: Assessing genetic diversity in plasmids from Escherichia coli and Salmonella enterica using a mixed-plasmid microarray. J Appl Microbiol 2006, 100: 15–28. 10.1111/j.1365-2672.2005.02775.xView ArticlePubMedGoogle Scholar
- Thomas CM: Paradigms of plasmid organization. Mol Microbiol 2000, 37: 485–491. 10.1046/j.1365-2958.2000.02006.xView ArticlePubMedGoogle Scholar
- Tian W, Skolnick J: How well is enzyme function conserved as a function ofpairwise sequence identity? J Mol Biol 2003, 333: 863–882. 10.1016/j.jmb.2003.08.057View ArticlePubMedGoogle Scholar
- Friedberg I: Automated protein function prediction – the genomic challenge. Brief Bioinform 2006, 7: 225–242. 10.1093/bib/bbl004View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.