Skip to main content
  • Methodology article
  • Open access
  • Published:

A computational approach for identifying pathogenicity islands in prokaryotic genomes



Pathogenicity islands (PAIs), distinct genomic segments of pathogens encoding virulence factors, represent a subgroup of genomic islands (GIs) that have been acquired by horizontal gene transfer event. Up to now, computational approaches for identifying PAIs have been focused on the detection of genomic regions which only differ from the rest of the genome in their base composition and codon usage. These approaches often lead to the identification of genomic islands, rather than PAIs.


We present a computational method for detecting potential PAIs in complete prokaryotic genomes by combining sequence similarities and abnormalities in genomic composition. We first collected 207 GenBank accessions containing either part or all of the reported PAI loci. In sequenced genomes, strips of PAI-homologs were defined based on the proximity of the homologs of genes in the same PAI accession. An algorithm reminiscent of sequence-assembly procedure was then devised to merge overlapping or adjacent genomic strips into a large genomic region. Among the defined genomic regions, PAI-like regions were identified by the presence of homolog(s) of virulence genes. Also, GIs were postulated by calculating G+C content anomalies and codon usage bias. Of 148 prokaryotic genomes examined, 23 pathogenic and 6 non-pathogenic bacteria contained 77 candidate PAIs that partly or entirely overlap GIs.


Supporting the validity of our method, included in the list of candidate PAIs were thirty four PAIs previously identified from genome sequencing papers. Furthermore, in some instances, our method was able to detect entire PAIs for those only partial sequences are available. Our method was proven to be an efficient method for demarcating the potential PAIs in our study. Also, the function(s) and origin(s) of a candidate PAI can be inferred by investigating the PAI queries comprising it. Identification and analysis of potential PAIs in prokaryotic genomes will broaden our knowledge on the structure and properties of PAIs and the evolution of bacterial pathogenesis.


PAIs are distinct genetic elements of pathogens encoding various virulence factors such as protein secretion systems, host invasion factors, iron uptake systems, and toxins [1, 2]. PAIs are a subset of genomic islands which have been transferred by horizontal gene transfer (HGT) event and confer virulence upon the recipient. PAIs can be identified by features such as the presence of virulence genes, biased G+C content and codon usage, carriage of mobile sequence elements, and/or association with tRNA genes or repeated sequences at their boundaries [3].

Identification of PAIs is essential in understanding the development of disease and the evolution of bacterial pathogenesis [2]. As complete genome sequences rapidly accumulate, various in silico methods have been developed to detect HGT [47]. Most of the methods were based on the detection of genomic regions having atypical G+C content, patterns of codon usage bias, or dinucleotide anomaly. However, compositional approaches may generate many false positives due to other factors such as selection and mutation bias [8, 9], and a lot of false negatives owing to adjustment of the transferred sequence in its composition by amelioration [10]. In fact, these methods detect different sets of ORFs as foreign origin when applied to the genome of Escherichia coli K-12 [11]. Thus, combining multiple lines of evidence can be beneficial to determine whether a gene or a group of genes has been acquired by HGT.

While studies on detecting horizontally transferred genes or GIs in genome sequences have been intensively carried out, little has been reported for PAIs. Considering that a PAI is a GI encoding virulence factors, compositional criteria such as G+C content and codon usage is not sufficient for identifying PAIs because genomic approaches can only lead to the identification of GIs [2]. In this work, we designed a computational method for identifying PAIs in sequenced genomes by combining a homology-based method and detection of abnormalities in genomic composition. To do this, we collected published PAI data and checked virulence genes on the PAI loci. We applied this approach to 148 prokaryotic genomes and identified 77 candidate PAIs. Detected regions contain virulence genes and relics of the HGT event.


Genomic islands in bacterial genomes

As for the 157 chromosomes examined (Table 1S [see Additional file 1]), the length proportion of GIs to the chromosome averaged 10.1%. Nanoarchaeum equitans, the smallest genome of any sequenced microbes, contained the smallest proportion of GIs, which is only 2.9%. Leptospira intrerrogans, which is responsible for worldwide water-borne zoonosis leptospirosis, contained the largest, 34.7% for chromosome I and 32.2% for chromosome II. The genome of L. interrogans was reported to have the biggest number of proteins with structural similarity to eukaryal and archaeal proteins as compared to other bacteria [12]. In general, larger proportions of GIs in pathogens than those in related nonpathogenic species were observed, e.g., 15.7% for Corynebacterium diphtheriae versus 7.6% for C. glutamicum, 12.3% for E. coli CFT073 versus 8.9% for E. coli K-12.

PAI-like regions

When every ORF contained in 207 PAI loci (see Table 1 and supplementary Table 2S for the complete information [see Additional file 2]) were similarity-searched against the ORFs present in the 148 prokaryotic genomes, 1,490 genomic strips of PAI-associated genes were defined based on the proximity of the homologs of genes from the same PAI accession. Overlapping strips were then merged into 525 genomic regions in 83 chromosomes (Figure 1). Among these regions, 241 contained at least one gene homologous to the virulence genes on the PAI loci, which will be referred to as PAI-like regions in this study. 77 PAI-like regions (total 1,652,758 bp) partly or entirely overlapped GIs, while the remaining 164 regions (total 1,553,923 bp) did not contain any part of GIs. In this report, we call the former candidate PAIs (cPAIs). Figure 2 shows the projection of PAI-like regions in their G+C contents and length-proportion of horizontally transferred genes. 52% of all the PAI-like regions show lower G+C content compared to those of their genomes (average of -0.6%, standard deviation of 3.8), however, 75% of the cPAIs have lower G+C contents (-2.7%, 4.7, respectively). The plot indicates that clusters of PAI-homologs are often located in the backbone sequence while the detected GIs tend to be biased to have lower G+C content.

Table 1 A shortened list of Part of PAI loci mentioned in the text. (see supplementary Table 2S for the complete list of 207 collected PAI loci.) [see Additional file 2]
Figure 1
figure 1

Flow chart of the algorithm.

Figure 2
figure 2

Projection of PAI-like regions in their G+C contents and length-proportion of horizontally transferred genes. Projection of PAI-like regions which overlap genomic islands (cPAI) and those which do not overlap genomic islands (nPAI) in their G+C contents (X axis) and length-proportion of horizontally transferred genes (Y axis). Each symbols denotes follows; cPAI (plus sign), nPAI (minus sign), cPAI and nPAI matching to a PAI identified from the genome sequencing paper (circle and triangle, respectively)

Candidate PAIs

cPAIs, PAI-like anomalous regions, were present in 29 bacteria including 6 non-pathogens, and their sizes ranged from 3.7 kb to 137.5 kb with the average length of 21.5 kb (Table 2, supplementary Table 3S [see Additional file 3]). Most of these regions contained transposase, integrase genes or insertion sequence elements, and were associated with tRNA genes at their boundaries, which is indicative of genomic islands. In some instances, our method allowed the detection of the entire PAIs for those only partial sequences have been reported in the original papers (Figure 3). This is due to the fact that PAIs often share conserved regions, and homologous regions of other PAIs can be located in the same PAI locus. Interestingly, cPAIs were detected in six strains which are known to be non-pathogens. Genes contained code for an ABC transporter (Bacillus halodurans), flagellar proteins (Bacillus subtilis), iron transport and fimbrial proteins (E. coli K-12), transmembrane sensors and outer membrane efflux proteins (Nitrosomonas europaea), or nodulation proteins (Bradyrhizobium japonicum). Genes detected in Mesorhizobium loti, a bacterium that forms globular nodules and perform nitrogen-fixing symbiosis with leguminous plants, are involved in the nodulation process and a type III secretion system (TTSS) [13]. However, the unexpected locations of cPAIs in non-pathogens should be interpreted as some clusters of potentially horizontally transferred genes that have homology to virulence genes.

Table 2 PAIs in prokaryotic chromosomes (see supplementary Table 3S for the complete information) [see Additional file 3]
Figure 3
figure 3

Example of a PAI-like region and a cPAI in genome sequences. 48.5-kb of PAI ICFT073 from E. coli CFT073 was detected by merging genomic strips similar to known PAI loci (yellow strip) including partial sequence of PAI ICFT073. The genomic region contains homologs of the virulence genes on the known PAIs (red arrow) and genomic island (grey bar). Therefore, this PAI-like region is considered as a cPAI. Red and orange arrows in yellow strips denote virulence and putative virulence gene, respectively. Numbers on the yellow strips indicate parts of the PAI loci homologous to the genomic strips: 1. PAI I536 (accession number: AJ488511, host strain: E. coli 536); 2. PAI II536 (AJ494981, E. coli 536); 3. PAI III536 (X16664, E. coli 536); 4. LEE (AJ278144, E. coli 4797/97); 5 and 6. LEE (AF071034, E. coli O157:H7 EDL933); 7 and 8. PAI IICFT073 (AF447814, E. coli CFT073); 9. PAI ICFT073 (AF081284, E. coli CFT073); 10. PAI ICFT073 (AF081285, E. coli CFT073). Note that accessions of PAI IICFT073 that were included in the query set are partial sequence of the PAI. Some boxes are joined by a line for saving the space of the figure.

Among the 77 cPAIs, 34 matched to PAIs which have been described in genome sequencing papers (Table 2, Figure 2). 27 cPAIs entirely matched to known PAIs – a PAI (in Enterococcus faecalis), PAI I, IICFT073 (E. coli CFT073), LEE (E. coli O157 EDL933 and Sakai), cag PAI (Helicobacter pylori 26695 and J99), the TTSS and tc loci (Photorhabdus luminescens), SPI-2,4,5 (Salmonella enterica serovar Typhi Ty2 and CT18, and serovar Typhimurium LT2), SPI-3 (S. typhimurium LT2), SHI-1, 2 (Shigella flexneri 2a 2457T and 301), VPI (Vibrio cholerae), Hrp PAI (Xanthomonas campestris), and HPI (Yersinia pestis CO92 and KIM). One end of PAIs – SPI-1 (in three S. enterica strains), SaPIm3 (S. aureus Mu50), and SaPIn3 (S. aureus N315) – were found in 5 cPAIs, and the other end of the PAIs were found in seemingly backbone sequences. νSaβ in S. aureus MW2 and CTX locus in V. cholerae N16961 were partly matched. Nine cPAIs span the TTSS loci which were not annotated as PAIs in the genome sequencing data.

Regions homologous to a certain PAI were frequently found in genomes of various taxa. Especially, parts of PAIs originally identified from enteropathogenic bacteria were detected not only in enterobacteria but also in phyla other than the Gammaproteobacteria in our study (Figure 4). The number of genomes containing PAI-like regions was drastically reduced when we considered genomic regions that overlap GIs. Elements of PAI I~ III536 in the uropathogenic E. coli strain 536 showed high similarities to other members of the Enterobacteriaceae. This is consistent with the previous report that PAI-specific sequences of E. coli strain 536 were frequently found in pathogenic and commensal E. coli isolates by using "E. coli pathoarray" [14]. Parts of the LEE PAI in enterohemorrhagic E. coli O157:H7, enteropathogenic E. coli E2348/69, rabbit-specific enteropathogenic E. coli 83/89, and rabbit diarrheagenic E. coli RDEC-1 similarly matched to genomic regions of different taxa.

Figure 4
figure 4

Distribution of genomic regions homologous to the PAIs from enteropathogenic bacteria. According to each PAI, left bar denotes the number of genomes containing at least one cPAI. Right hatched bar delineates the number of genomes containing at least one PAI-like region. Different colors represent the number of genomes of different taxon – Enterobacteriales (black), Proteobacteria except Enterobacteriales (red), and phylums except Proteobacteria (green). The demonstrated PAIs are PAI I,II,III536 in uropathogenic E. coli 536, PAI IICFT073 in uropathogenic E. coli CFT073, LEE in enterohemorrhagic E. coli O157, SPI-2 in S. typhimurium, SHI-2 and SRL in S. flexneri, HPI in Y. enterocolitica, and TTSS locus in Photorhabdus lumniescens.

In most cases, distribution of the regions homologous to the PAIs from other enterobacteria such as VPI of Vibrio cholerae, cag PAI of Helicobacter pylori, SaPI1 of Staphylococcus aureus strains were restricted to their host strains. However, widespread distribution in different species was evident for PAGI-1 of Pseudomonas aeruginosa and the Hrp PAI of P. syringae, Xanthomonas spp., Burkholderia pseudomallei, and Ralstonia solanacearum. Variations of cPAIs were observed for EDL933 and Sakai, which belong to the same E. coli O157 group (Table 2). This discrepancy results from the different distribution of prophages in the two genomes. Also, different ORF prediction by different research groups affected the determination of GIs.

PAI-like regions that did not meet the criteria

164 PAI-like regions in 57 prokaryotes including 16 non-pathogenic bacteria and one archaeon did not overlap GIs (supplementary Table 4S) [see Additional file 4]. Their sizes ranged from 1.9 to 50.6 kb and were averaged 9.5 kb. Most of them encoded flagellar/fimbrial biosynthesis or iron uptake systems. Among these regions, 14 were PAIs published in the genome sequencing papers. Six PAIs – Hrp PAI (in Pseudomonas syringae pv. tomato DC3000), SPI-3 (S. enterica serovar Typhi strains Ty2 and CT18), SaPIm1 (in S. aureus Mu50), SaPIn1 (S. aureus N315) and νSa3 (S. aureus MW2) – entirely matched, and 5 counterparts of the PAIs that partly match to the cPAIs that overlap GIs were found in these regions. Parts of LIPI-1 in Listeria innocua and two regions of internalins in L. monocytogenes EGD were found. In fact, the Hrp PAI and LIPI-1 have DNA compositions similar to the core genomes, and are suggested to have been acquired a long time ago [15, 16].


By analyzing structures of many microbial genomes, it became obvious that HGT is an important mechanism for bacterial evolution, let alone genome complexity and plasticity [1]. GIs, which are large genomic segments and most likely transferred by HGT, contribute to the survival of the hosting bacterial strain in a particular environment and sometimes to virulence. These two kinds of GIs, of which the former can be referred as 'fitness islands', are often hardly distinguishable from each other because the role of a GI may vary in different ecological niches and the physiology of the bacterium. Up to now, attempts to identify PAIs [5, 6, 17] have been made by detecting genomic regions which only differ from the rest of the genome in their base composition and codon usage. In this study, we identified "candidate PAIs (cPAIs)" that reflect potential PAIs with anomalous composition, probably due to their recent acquisition. Among the 148 sequenced strains searched in this study, 17 were the strains closely related to the hosts carrying queried PAI loci. From the reports of their genome sequencing projects, 27 PAIs have been described. Among them, 23 PAIs were found in the list of cPAIs and the accuracy of our method can be considered as 85% (Table 2, supplementary Table 4S [see Additional file 4]).

The presence of virulence factors could be a useful criterion for discerning PAIs from other genomic islands. Clusters consisting of only hypothetical genes and/or elements involved in the transfer mechanism (e.g. IS elements, tRNA genes, integrase, and prophage) were filtered out, leaving only 46% of the genomic regions containing virulence factors. Widespread distribution of conserved elements of many PAIs in different species and in even non-pathogens is due to their complex mosaic structures consisting of elements of different origins. PAI I~ III536 in E. coli 536 have mosaic-like structures consisting of many DNA fragments that show high similarities to the chromosomal regions of other pathogenic E. coli strains and Shigella flexneri[18]. SPI-2 is a fusion of at least two genetic elements – a 25-kb region encoding the TTSS with a low G+C content and a 15-kb region encoding metabolic functions with a G+C content similar to the rest of the genome [19], and the Hrp PAI of Pseudomonas syringae has a tripartite structure [15].

Some virulence factors in PAIs are homologous to seemingly backbone genes. As shown in Figure 4, PAIs having extensive mosaic structures showed highly frequent occurrence in various species, and clusters of seemingly backbone genes could be removed from the list of the cPAIs by checking the presence of a GI in a PAI-like region. Many Gram-negative bacterial pathogens cause diseases by secreting and injecting virulence proteins (effectors) into the host cell via a specialized protein secretion mechanism (TTSS) [20]. They are evolutionarily related to flagellar systems and often hard to distinguish when based only on homology searches [21]. However, TTSSs are frequently transferred laterally between Gram-negative bacteria while flagellar systems are mainly inherited by vertical descent. This fact explains why many regions encoding flagellar biosynthesis genes have hits to PAI-like regions not showing anomalies in DNA composition (supplementary Table 4S) [see Additional file 4], while PAI-like regions overlapping GIs contain lots of TTSSs (Table 2). Iron uptake systems are important for bacterial survival as well as virulence [2]. Many PAIs such as HPI of Yersinia species, SHI-2 of S. flexneri, and SRL of S. flexneri 2a YSH6000 carry genes encoding various siderophore systems that produce and secrete low-molecular-weight siderophores with extremely high affinities for ferric iron. Clusters of homologs of ferric dicitrate transport system (fecABCDEIR, Fec) of SRL [22] were widely distributed in the backbone genomic regions of various species, which implies that Fec might be the most ancient siderophore system (Figure 4, Table 2, supplementary Table 4S [see Additional file 4]). Interestingly, a 7.1-kb fecCDE-homologous region can be found even in Halobacterium sp. NRC-1, the only archaeon possessing the PAI-like region in this study. This region is inserted by a 6-phosphogluconate dehydrogenase gene, 3 hypothetical proteins and tRNA-Arg gene.

One of the difficulties when dubbing potential PAIs in the sequenced genomes is to determine the boundaries. A PAI may have a number of genes which have undergone many evolutionary stages and thus compositionally indistinguishable from the rest of the genome [2, 23]. This might be due to some parts highly adjusted to the base composition of the recipient's genome or to the backbone genomic segments added later in evolution [10]. We found that the length proportion of transferred regions contained in the known chromosomal PAIs – 28.7 kb of LEE in E. coli O157 Sakai, 36.2 kb of Cag PAI in H. pylori 26695, 61.2 kb of VPI-2 in V. cholerae, and 137.5 kb of PAI in Enterococcus faecalis – vary from 0.19 to 0.65. Thus, compositional approaches cannot predict the boundaries of the detected PAI because they only detect atypical genomic region. To solve this problem, we detected genomic segments homologous to each known PAI, which were then clumped into a large genomic region. This procedure is somewhat like the process of fragment assembly in which a contiguous region (contig) is made from overlapping fragments in shotgun sequencing [24]. Like the conserved sequences of TTSS structural genes [20], PAIs often share conserved regions. In addition, PAIs frequently carry relics of HGT event such as mobile sequence elements and association with tRNA genes at their boundaries [3]. Islander [25], a database of potential integrative islands in prokaryotic genomes, detects GIs by identifying tRNAs or tmRNA genes, and candidate integrase genes. Although many GIs reported from the database were in accordance with our results, large portion was not annotated as cPAIs mainly due to the absence of homologs of virulence genes in known PAIs and PAIs that are not located at the tRNA loci. As illustrated in Figure 3, frequent distribution of conserved regions between PAIs allows our method to find the entire region of a PAI in a sequenced genome even though its similar sequence is partially known.

A typical genome sequencing team uses genes in the gene cluster or the genome sequence of interest as a query to search for any similar genes in the databases. Then, homologs of pathogenicity/virulence genes are inferred by checking whether descriptions of the retrieved genes have any indications that suggest virulence/pathogenicity or they are from pathogens. Because this approach depends on the examiner's knowledge on known PAIs or pathogenicity/virulence genes and entry descriptions of the retrieved genes often are not informative to infer the function, it is never sure whether the searches thoroughly picked up all the genes associated with PAIs or pathogenicity/virulence. To avoid this uncertainty on the robustness of the open-ended search, we first collected all the reported PAI loci and used them as a query to search for homologs in the complete prokaryotic genomes. Our method guarantees that all the potential PAIs related to the known PAIs were searched without the intervention of human interpretation.

In completely sequenced genomes, we detected cPAIs that are homologous to the published PAIs and show anomaly in DNA composition. The methodology we developed in this study has a limitation in that the detected cPAIs are limited by the query data set of the known PAIs. This caveat, however, can be advantageous when the researchers only concern a specific set of PAIs. Furthermore, this approach can be easily extended to identify various genomic islands (e.g. fitness, metabolism, and resistance islands). Among the cPAIs detected in this study, omission of several well-known PAIs such as Hrp PAI of P. syringae and LIPI-1 of L. innocua is due to their DNA compositions similar to the core genomes which may caused by horizontal transfer from closely related strains or very ancient HGT event. Thus, patterns of best matches of each gene to different species, lineage-specific genes or transferred genes from phylogenetically distant species would be helpful in improving the possibility of finding GIs and PAIs. Also, accumulation of PAI sequence data in bacterial families other than the Enterobacteriaceae will lead to detection of more putative PAIs across various taxa. Finally, it should be noted that the identity of cPAIs as bona fide PAIs need to be confirmed by further experimental verification. We are currently improving the detection scheme and are developing a database for cPAIs in sequenced genomes.


We present the first computational framework combining feature-based analyses and similarity-based analyses. As shown in Figure 3, the similarity-based analysis that is reminiscent of the sequence-assembly procedure was proven to be an efficient method for demarcating the potential PAIs in our study. Also, the function(s) and origin(s) of a cPAI can be inferred by investigating the PAI queries comprising it. With the availability of rapidly increasing complete genome sequences [26] as well as PAI data, the proposed method will be useful in identifying potential PAIs in microbial genomes.


Collection of complete genomes and PAI Data

The sequence files of 148 prokaryotic complete genomes consisting of 157 chromosomes, including 17 archaeal ones as of January 2004 were downloaded from the NCBI FTP server (, supplementary Table 1S) [see Additional file 1]. We searched the GenBank database and literature [3, 23] for any descriptions of the "pathogenicity island". Forty five kinds of PAIs and 207 GenBank accessions containing either part or all of the reported PAI loci in 120 pathogenic bacteria, are summarized in Table 1. (see supplementary Table 2S for the complete information) [see Additional file 2]. The definition of virulence genes is difficult as their function may depend on growth conditions and host niches. Thus, we attributed this to the biologists who identified PAI loci, and virulence genes of PAI loci were identified by literature survey. Many PAIs, 29 out of 45 kinds of PAIs, came from Enterobacteriaceae. Thirty four PAI loci are completely sequenced ones ranging from 6.8 kb to 153.6 kb (average: 41.3 kb), and the remains are part of PAI. It should be noted that the collected sets do not contain PAIs which were reported from genome sequencing papers.

Detection of GIs in genome sequences

To detect GIs in a chromosome, we first identified horizontally transferred genes (H) based on the algorithm developed by Garcia-Vallve et al. [4]. To alleviate false positives caused by applying single criterion for identifying HGT regions, we considered a gene as H only if both G+C content and codon usage are aberrant. For each genome, we have computed total G+C content ([G+C]T) and G+C contents at the first and third codon positions ([G+C]1 and [G+C]3) of every ORF. The compositional bias at the first and third positions were reported to be positively correlated to expressivity and genomic G+C content, respectively [10, 27]. Extraneous origin of the gene in terms of G+C content was considered if its [G+C]T deviates over 1.5 σ or if deviations of [G+C]1 and [G+C]3 are of the same sign and at least one of them is over 1.5 σ. Mahalanobis distance (dM) was used to evaluate deviation of the codon usage of a gene and mean of the genome [4]. dM is a statistic in unit of standard deviation from the mean of 61 codon frequencies and can be calculated as follows:

dM(X, X mean ) = (X - X mean )TS-1(X - X mean )

Where X and Xmean correspond to vectors having relative frequencies of the 61 codons for a gene and the mean values for a genome, respectively. S-1 is the inverse of variance-covariance matrix (S) of all the 61 codon frequencies. The higher this value is the more deviation in codon usage [4]. If Xs are normally distributed, dMs can be converted to p-values using the χ2 distribution function. We considered a gene as extraneous in codon usage if its p-value was less than 0.05. It should be noted that genes longer than 300 bp were used for calculating the mean and standard deviation (σ) of G+C contents and dMs. This is from the observation that genes having shorter than 300 bp have much higher chance of anomalies in G+C content and codon usage.

We ran a genome scan of a 10-gene window and identified regions containing four or more H. This threshold frequency of 0.4 was inferred from the observation that the frequencies of H in known PAIs such as LEE of E. coli O157 Sakai, cag PAI of Helicobacter pylori 26695, VPI-2 of Vibrio cholerae, and a PAI of Enterococcus faecalis, were averaged 0.35. Neighbouring regions were merged into larger regions which were referred to as GIs in this study. Some genomic regions had highly biased G+C content compared to the whole G+C content of the chromosome, while their codon usage were not biased. For example, 46.4 kb genomic region ranging from 2,647,129 bp in Yersinia pestis KIM, which contains yersiniabactin genomic island [28] has considerably higher G+C content (55.7% versus 47.6% average for the whole genome), but showed a similar codon usage for the genes contained in this region. Thus, among genomic regions made from genes anomalous in G+C content, the region was added to GIs if its G+C(T) deviates more than 1.5 σ.

Identification of candidate PAIs

The detection scheme for the regions of cPAIs is outlined in Figure 1. Each ORF from PAI locus was used as the query in BLASTP searches [29] against the set of ORFs from each of the 148 completely sequenced genomes using PAM250 as scoring matrix for retrieving homologous genes in evolutionary distant strains. Likewise, homologs of ORFs, RNA genes and repeat regions of PAI locus on the nucleotide level were searched using BLAT, a modified BLAST alignment program which can stitch matched regions into a larger one [30]. If the identity of the resulting hit is over 80% for DNA sequence or 25% for protein sequence and the aligned region is both over 70% of lengths of query and the hit, the pair of sequences was considered as a homolog. Genomic strips corresponding to each PAI locus were then obtained by identifying the regions containing four or more homologs of the genes from the same PAI accession and by merging the neighboring regions. Overlapping or adjacent genomic strips corresponding to the same or different kind of PAI loci were fused into a large region. Among these regions, PAI-like regions were identified by checking the presence of at least one gene homologous to a virulence gene on the PAI loci. We considered a candidate PAI (cPAI) only if the PAI-like region partly or entirely spans the GI.


  1. Dobrindt U, Hochhut B, Hentschel U, Hacker J: Genomic islands in pathogenic and environmental microorganisms. Nat Rev Microbiol 2004, 2(5):414–424. 10.1038/nrmicro884

    Article  CAS  PubMed  Google Scholar 

  2. Schmidt H, Hensel M: Pathogenicity islands in bacterial pathogenesis. Clin Microbiol Rev 2004, 17(1):14–56. 10.1128/CMR.17.1.14-56.2004

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  3. Hacker J, Kaper JB: Pathogenicity islands and the evolution of pathogenic microbes. Berlin , Springer-Verlag; 2002.

    Google Scholar 

  4. Garcia-Vallve S, Romeu A, Palau J: Horizontal gene transfer in bacterial and archaeal complete genomes. Genome Res 2000, 10(11):1719–1725. 10.1101/gr.130000

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  5. Karlin S: Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol 2001, 9(7):335–343. 10.1016/S0966-842X(01)02079-0

    Article  CAS  PubMed  Google Scholar 

  6. Tu Q, Ding D: Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis. FEMS Microbiol Lett 2003, 221(2):269–275. 10.1016/S0378-1097(03)00204-0

    Article  CAS  PubMed  Google Scholar 

  7. Merkl R: SIGI: score-based identification of genomic islands. BMC Bioinformatics 2004, 5(1):22. 10.1186/1471-2105-5-22

    Article  PubMed Central  PubMed  Google Scholar 

  8. Eisen JA: Horizontal gene transfer among microbial genomes: new insights from complete genome analysis. Curr Opin Genet Dev 2000, 10(6):606–611. 10.1016/S0959-437X(00)00143-X

    Article  CAS  PubMed  Google Scholar 

  9. Wang B: Limitations of compositional approach to identifying horizontally transferred genes. J Mol Evol 2001, 53(3):244–250. 10.1007/s002390010214

    Article  CAS  PubMed  Google Scholar 

  10. Lawrence JG, Ochman H: Amelioration of bacterial genomes: rates of change and exchange. J Mol Evol 1997, 44(4):383–397.

    Article  CAS  PubMed  Google Scholar 

  11. Ragan MA: On surrogate methods for detecting lateral gene transfer. FEMS Microbiol Lett 2001, 201(2):187–191. 10.1016/S0378-1097(01)00262-2

    Article  CAS  PubMed  Google Scholar 

  12. Ren SX, Fu G, Jiang XG, Zeng R, Miao YG, Xu H, Zhang YX, Xiong H, Lu G, Lu LF, Jiang HQ, Jia J, Tu YF, Jiang JX, Gu WY, Zhang YQ, Cai Z, Sheng HH, Yin HF, Zhang Y, Zhu GF, Wan M, Huang HL, Qian Z, Wang SY, Ma W, Yao ZJ, Shen Y, Qiang BQ, Xia QC, Guo XK, Danchin A, Saint Girons I, Somerville RL, Wen YM, Shi MH, Chen Z, Xu JG, Zhao GP: Unique physiological and pathogenic features of Leptospira interrogans revealed by whole-genome sequencing. Nature 2003, 422(6934):888–893. 10.1038/nature01597

    Article  CAS  PubMed  Google Scholar 

  13. Kaneko T, Nakamura Y, Sato S, Asamizu E, Kato T, Sasamoto S, Watanabe A, Idesawa K, Ishikawa A, Kawashima K, Kimura T, Kishida Y, Kiyokawa C, Kohara M, Matsumoto M, Matsuno A, Mochizuki Y, Nakayama S, Nakazaki N, Shimpo S, Sugimoto M, Takeuchi C, Yamada M, Tabata S: Complete genome structure of the nitrogen-fixing symbiotic bacterium Mesorhizobium loti. DNA Res 2000, 7(6):331–338.

    Article  CAS  PubMed  Google Scholar 

  14. Dobrindt U, Agerer F, Michaelis K, Janka A, Buchrieser C, Samuelson M, Svanborg C, Gottschalk G, Karch H, Hacker J: Analysis of genome plasticity in pathogenic and commensal Escherichia coli isolates by use of DNA arrays. J Bacteriol 2003, 185(6):1831–1840. 10.1128/JB.185.6.1831-1840.2003

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  15. Alfano JR, Charkowski AO, Deng WL, Badel JL, Petnicki-Ocwieja T, van Dijk K, Collmer A: The Pseudomonas syringae Hrp pathogenicity island has a tripartite mosaic structure composed of a cluster of type III secretion genes bounded by exchangeable effector and conserved effector loci that contribute to parasitic fitness and pathogenicity in plants. Proc Natl Acad Sci U S A 2000, 97(9):4856–4861. 10.1073/pnas.97.9.4856

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  16. Vazquez-Boland JA, Kuhn M, Berche P, Chakraborty T, Dominguez-Bernal G, Goebel W, Gonzalez-Zorn B, Wehland J, Kreft J: Listeria pathogenesis and molecular virulence determinants. Clin Microbiol Rev 2001, 14(3):584–640. 10.1128/CMR.14.3.584-640.2001

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  17. Lio P, Vannucci M: Finding pathogenicity islands and gene transfer events in genome data. Bioinformatics 2000, 16(10):932–940. 10.1093/bioinformatics/16.10.932

    Article  CAS  PubMed  Google Scholar 

  18. Dobrindt U, Blum-Oehler G, Nagy G, Schneider G, Johann A, Gottschalk G, Hacker J: Genetic structure and distribution of four pathogenicity islands (PAI I536 to PAI IV536) of uropathogenic Escherichia coli strain 536. Infect Immun 2002, 70(11):6365–6372. 10.1128/IAI.70.11.6365-6372.2002

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  19. Hensel M, Nikolaus T, Egelseer C: Molecular and functional analysis indicates a mosaic structure of Salmonella pathogenicity island 2. Mol Microbiol 1999, 31(2):489–498. 10.1046/j.1365-2958.1999.01190.x

    Article  CAS  PubMed  Google Scholar 

  20. Hueck CJ: Type III protein secretion systems in bacterial pathogens of animals and plants. Microbiol Mol Biol Rev 1998, 62(2):379–433.

    PubMed Central  CAS  PubMed  Google Scholar 

  21. Kim JF: Revisiting the chlamydial type III protein secretion system: clues to the origin of type III protein secretion. Trends Genet 2001, 17(2):65–69. 10.1016/S0168-9525(00)02175-2

    Article  CAS  PubMed  Google Scholar 

  22. Luck SN, Turner SA, Rajakumar K, Sakellaris H, Adler B: Ferric dicitrate transport system (Fec) of Shigella flexneri 2a YSH6000 is encoded on a novel pathogenicity island carrying multiple antibiotic resistance genes. Infect Immun 2001, 69(10):6012–6021. 10.1128/IAI.69.10.6012-6021.2001

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  23. Kaper JB, Hacker J: Pathogenicity islands and other mobile virulence elements. Washington, DC , American Society for Microbiology Press; 1999 .

    Google Scholar 

  24. Myers G: Whole-genome DNA sequencing. Comput Sci Eng 1999, 1: 33–43. 10.1109/5992.764214

    Article  CAS  Google Scholar 

  25. Mantri Y, Williams KP: Islander: a database of integrative islands in prokaryotic genomes, the associated integrases and their DNA site specificities. Nucleic Acids Res 2004, 32(Database issue):D55–8. 10.1093/nar/gkh059

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  26. Fraser CM, Eisen JA, Salzberg SL: Microbial genome sequencing. Nature 2000, 406(6797):799–803. 10.1038/35021244

    Article  CAS  PubMed  Google Scholar 

  27. Gutierrez G, Marquez L, Marin A: Preference for guanosine at first codon position in highly expressed Escherichia coli genes. A relationship with translational efficiency. Nucleic Acids Res 1996, 24(13):2525–2527. 10.1093/nar/24.13.2525

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  28. Deng W, Burland V, Plunkett III G, Boutin A, Mayhew GF, Liss P, Perna NT, Rose DJ, Mau B, Zhou S, Schwartz DC, Fetherston JD, Lindler LE, Brubaker RR, Plano GV, Straley SC, McDonough KA, Nilles ML, Matson JS, Blattner FR, Perry RD: Genome sequence of Yersinia pestis KIM. J Bacteriol 2002, 184(16):4601–4611. 10.1128/JB.184.16.4601-4611.2002

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  29. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389

    Article  PubMed Central  CAS  PubMed  Google Scholar 

  30. Kent WJ: BLAT-the BLAST-like alignment tool. Genome Res 2002, 12(4):656–664. 10.1101/gr.229202. Article published online before March 2002

    Article  PubMed Central  CAS  PubMed  Google Scholar 

Download references


We thank Drs. Seung-Hwan Park and Doil Choi for their heartful support to the project. This work was funded by the 21C Frontier Microbial Genomics and Applications Center Program, Ministry of Science and Technology, Republic of Korea

Author information

Authors and Affiliations


Corresponding author

Correspondence to Jihyun F Kim.

Additional information

Authors' contributions

SHY designed the study, developed the software for implementing the devised algorithm, and wrote the manuscript. CH and HK contributed to the writing the software, and YHK collected and reviewed the data, and TKO assessed the biological significance of the results. JFK supervised the project and contributed to the development of methodology and writing the manuscript. All authors read and approved the final manuscript.

Electronic supplementary material

Authors’ original submitted files for images

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Yoon, S.H., Hur, CG., Kang, HY. et al. A computational approach for identifying pathogenicity islands in prokaryotic genomes. BMC Bioinformatics 6, 184 (2005).

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: