A computational approach for identifying pathogenicity islands in prokaryotic genomes

Background Pathogenicity islands (PAIs), distinct genomic segments of pathogens encoding virulence factors, represent a subgroup of genomic islands (GIs) that have been acquired by horizontal gene transfer event. Up to now, computational approaches for identifying PAIs have been focused on the detection of genomic regions which only differ from the rest of the genome in their base composition and codon usage. These approaches often lead to the identification of genomic islands, rather than PAIs. Results We present a computational method for detecting potential PAIs in complete prokaryotic genomes by combining sequence similarities and abnormalities in genomic composition. We first collected 207 GenBank accessions containing either part or all of the reported PAI loci. In sequenced genomes, strips of PAI-homologs were defined based on the proximity of the homologs of genes in the same PAI accession. An algorithm reminiscent of sequence-assembly procedure was then devised to merge overlapping or adjacent genomic strips into a large genomic region. Among the defined genomic regions, PAI-like regions were identified by the presence of homolog(s) of virulence genes. Also, GIs were postulated by calculating G+C content anomalies and codon usage bias. Of 148 prokaryotic genomes examined, 23 pathogenic and 6 non-pathogenic bacteria contained 77 candidate PAIs that partly or entirely overlap GIs. Conclusion Supporting the validity of our method, included in the list of candidate PAIs were thirty four PAIs previously identified from genome sequencing papers. Furthermore, in some instances, our method was able to detect entire PAIs for those only partial sequences are available. Our method was proven to be an efficient method for demarcating the potential PAIs in our study. Also, the function(s) and origin(s) of a candidate PAI can be inferred by investigating the PAI queries comprising it. Identification and analysis of potential PAIs in prokaryotic genomes will broaden our knowledge on the structure and properties of PAIs and the evolution of bacterial pathogenesis.


Background
PAIs are distinct genetic elements of pathogens encoding various virulence factors such as protein secretion systems, host invasion factors, iron uptake systems, and toxins [1,2]. PAIs are a subset of genomic islands which have been transferred by horizontal gene transfer (HGT) event and confer virulence upon the recipient. PAIs can be identified by features such as the presence of virulence genes, biased G+C content and codon usage, carriage of mobile sequence elements, and/or association with tRNA genes or repeated sequences at their boundaries [3].
Identification of PAIs is essential in understanding the development of disease and the evolution of bacterial pathogenesis [2]. As complete genome sequences rapidly accumulate, various in silico methods have been developed to detect HGT [4][5][6][7]. Most of the methods were based on the detection of genomic regions having atypical G+C content, patterns of codon usage bias, or dinucleotide anomaly. However, compositional approaches may generate many false positives due to other factors such as selection and mutation bias [8,9], and a lot of false negatives owing to adjustment of the transferred sequence in its composition by amelioration [10]. In fact, these methods detect different sets of ORFs as foreign origin when applied to the genome of Escherichia coli K-12 [11]. Thus, combining multiple lines of evidence can be beneficial to determine whether a gene or a group of genes has been acquired by HGT.
While studies on detecting horizontally transferred genes or GIs in genome sequences have been intensively carried out, little has been reported for PAIs. Considering that a PAI is a GI encoding virulence factors, compositional criteria such as G+C content and codon usage is not sufficient for identifying PAIs because genomic approaches can only lead to the identification of GIs [2]. In this work, we designed a computational method for identifying PAIs in sequenced genomes by combining a homology-based method and detection of abnormalities in genomic composition. To do this, we collected published PAI data and checked virulence genes on the PAI loci. We applied this approach to 148 prokaryotic genomes and identified 77 candidate PAIs. Detected regions contain virulence genes and relics of the HGT event.

Genomic islands in bacterial genomes
As for the 157 chromosomes examined (Table 1S [see Additional file 1]), the length proportion of GIs to the chromosome averaged 10.1%. Nanoarchaeum equitans, the smallest genome of any sequenced microbes, contained the smallest proportion of GIs, which is only 2.9%. Leptospira intrerrogans, which is responsible for worldwide water-borne zoonosis leptospirosis, contained the largest, 34.7% for chromosome I and 32.2% for chromosome II. The genome of L. interrogans was reported to have the biggest number of proteins with structural similarity to eukaryal and archaeal proteins as compared to other bacteria [12]. In general, larger proportions of GIs in pathogens than those in related nonpathogenic species were observed, e.g., 15.7% for Corynebacterium diphtheriae versus 7.6% for C. glutamicum, 12.3% for E. coli CFT073 versus 8.9% for E. coli K-12.

PAI-like regions
When every ORF contained in 207 PAI loci (see Table 1 and supplementary Table 2S for the complete information [see Additional file 2]) were similarity-searched against the ORFs present in the 148 prokaryotic genomes, 1,490 genomic strips of PAI-associated genes were defined based on the proximity of the homologs of genes from the same PAI accession. Overlapping strips were then merged into 525 genomic regions in 83 chromosomes ( Figure 1). Among these regions, 241 contained at least one gene homologous to the virulence genes on the PAI loci, which will be referred to as PAI-like regions in this study. 77 PAIlike regions (total 1,652,758 bp) partly or entirely overlapped GIs, while the remaining 164 regions (total 1,553,923 bp) did not contain any part of GIs. In this report, we call the former candidate PAIs (cPAIs). Figure 2 shows the projection of PAI-like regions in their G+C contents and length-proportion of horizontally transferred genes. 52% of all the PAI-like regions show lower G+C content compared to those of their genomes (average of -0.6%, standard deviation of 3.8), however, 75% of the cPAIs have lower G+C contents (-2.7%, 4.7, respectively). The plot indicates that clusters of PAI-homologs are often located in the backbone sequence while the detected GIs tend to be biased to have lower G+C content.
Candidate PAIs cPAIs, PAI-like anomalous regions, were present in 29 bacteria including 6 non-pathogens, and their sizes ranged from 3.7 kb to 137.5 kb with the average length of 21.5 kb ( Table 2, supplementary Table 3S [see Additional  file 3]). Most of these regions contained transposase, integrase genes or insertion sequence elements, and were associated with tRNA genes at their boundaries, which is indicative of genomic islands. In some instances, our method allowed the detection of the entire PAIs for those only partial sequences have been reported in the original papers ( Figure 3). This is due to the fact that PAIs often share conserved regions, and homologous regions of other PAIs can be located in the same PAI locus. Interestingly, cPAIs were detected in six strains which are known to be non-pathogens. Genes contained code for an ABC transporter (Bacillus halodurans), flagellar proteins (Bacillus subtilis), iron transport and fimbrial proteins (E. coli K-12), transmembrane sensors and outer membrane efflux proteins (Nitrosomonas europaea), or nodulation proteins (Bradyrhizobium japonicum). Genes detected in Mesorhizobium loti, a bacterium that forms globular nodules and perform nitrogen-fixing symbiosis with leguminous plants, are involved in the nodulation process and a type III secretion system (TTSS) [13]. However, the unexpected locations of cPAIs in non-pathogens should be interpreted as some clusters of potentially horizontally transferred genes that have homology to virulence genes.
Regions homologous to a certain PAI were frequently found in genomes of various taxa. Especially, parts of PAIs originally identified from enteropathogenic bacteria were detected not only in enterobacteria but also in phyla other than the Gammaproteobacteria in our study ( Figure 4). The number of genomes containing PAI-like regions was drastically reduced when we considered genomic regions that overlap GIs. Elements of PAI I~ III 536 in the uropathogenic E. coli strain 536 showed high similarities to other members of the Enterobacteriaceae. This is consistent with the previous report that PAI-specific sequences of E. coli strain 536 were frequently found in pathogenic and commensal E. coli isolates by using "E. coli pathoarray" [14].  (Table 2). This discrepancy results from the different distribution of prophages in the two genomes. Also, different ORF prediction by different research groups affected the determination of GIs.

PAI-like regions that did not meet the criteria
164 PAI-like regions in 57 prokaryotes including 16 nonpathogenic bacteria and one archaeon did not overlap GIs (supplementary Table 4S) [see Additional file 4]. Their sizes ranged from 1.9 to 50.6 kb and were averaged 9.5 kb.
Most of them encoded flagellar/fimbrial biosynthesis or iron uptake systems. Among these regions, 14 were PAIs published in the genome sequencing papers. Six PAIs -Hrp PAI (in Pseudomonas syringae pv. tomato DC3000), SPI-3 (S. enterica serovar Typhi strains Ty2 and CT18), SaPIm1 (in S. aureus Mu50), SaPIn1 (S. aureus N315) and νSa3 (S. aureus MW2) -entirely matched, and 5 counterparts of the PAIs that partly match to the cPAIs that overlap GIs were found in these regions. Parts of LIPI-1 in Listeria innocua and two regions of internalins in L. monocytogenes EGD were found. In fact, the Hrp PAI and LIPI-1 have DNA compositions similar to the core genomes, and are suggested to have been acquired a long time ago [15,16].

Discussion
By analyzing structures of many microbial genomes, it became obvious that HGT is an important mechanism for bacterial evolution, let alone genome complexity and plasticity [1]. GIs, which are large genomic segments and Flow chart of the algorithm Figure 1 Flow chart of the algorithm.
Projection of PAI-like regions in their G+C contents and length-proportion of horizontally transferred genes Figure 2 Projection of PAI-like regions in their G+C contents and length-proportion of horizontally transferred genes. Projection of PAI-like regions which overlap genomic islands (cPAI) and those which do not overlap genomic islands (nPAI) in their G+C contents (X axis) and length-proportion of horizontally transferred genes (Y axis). Each symbols denotes follows; cPAI (plus sign), nPAI (minus sign), cPAI and nPAI matching to a PAI identified from the genome sequencing paper (circle and triangle, respectively)  a Deviation of the G+C content of the cPAI as compared to that of the whole genome b Length percentage of horizontally transferred genes in the cPAI c Genes involved in the transfer mechanism (integrase, transposase, IS element, or tRNA gene at the boundaries) d Non-pathogenic bacterium e cPAI that entirely matches to a PAI identified from the genome sequencing paper f cPAI that matches to one end of a PAI identified from the genome sequencing paper. The other end of the PAI is present in a PAI-like region not overlapping GIs. g cPAI that partly matches to a PAI identified from the genome sequencing paper Bold characters denote that a sequenced strain containing the cPAI is the same as or closely related to the host strain of the queried PAI loci.
Example of a PAI-like region and a cPAI in genome sequences Figure 3 Example of a PAI-like region and a cPAI in genome sequences. 48.5-kb of PAI I CFT073 from E. coli CFT073 was detected by merging genomic strips similar to known PAI loci (yellow strip) including partial sequence of PAI I CFT073 . The genomic region contains homologs of the virulence genes on the known PAIs (red arrow) and genomic island (grey bar). Therefore, this PAI-like region is considered as a cPAI. Red and orange arrows in yellow strips denote virulence and putative virulence gene, respectively. Numbers on the yellow strips indicate parts of the PAI loci homologous to the genomic strips:  described. Among them, 23 PAIs were found in the list of cPAIs and the accuracy of our method can be considered as 85% (Table 2, supplementary Table 4S [see Additional The presence of virulence factors could be a useful criterion for discerning PAIs from other genomic islands. Clusters consisting of only hypothetical genes and/or elements involved in the transfer mechanism (e.g. IS elements, tRNA genes, integrase, and prophage) were filtered out, leaving only 46% of the genomic regions containing virulence factors. Widespread distribution of conserved elements of many PAIs in different species and in even nonpathogens is due to their complex mosaic structures consisting of elements of different origins. PAI I~ III 536 in E. coli 536 have mosaic-like structures consisting of many DNA fragments that show high similarities to the chromosomal regions of other pathogenic E. coli strains and Shig-ella flexneri [18]. SPI-2 is a fusion of at least two genetic elements -a 25-kb region encoding the TTSS with a low G+C content and a 15-kb region encoding metabolic functions with a G+C content similar to the rest of the genome [19], and the Hrp PAI of Pseudomonas syringae has a tripartite structure [15].
Some virulence factors in PAIs are homologous to seemingly backbone genes. As shown in Figure 4, PAIs having extensive mosaic structures showed highly frequent occurrence in various species, and clusters of seemingly backbone genes could be removed from the list of the cPAIs by checking the presence of a GI in a PAI-like region. Many Gram-negative bacterial pathogens cause diseases by secreting and injecting virulence proteins (effectors) into the host cell via a specialized protein secretion mechanism (TTSS) [20]. They are evolutionarily related to flagellar systems and often hard to distinguish when based only on homology searches [21]. However, TTSSs are frequently transferred laterally between Gram-negative bacteria while flagellar systems are mainly inherited by vertical descent. This fact explains why many regions encoding flagellar biosynthesis genes have hits to PAI-like regions not showing anomalies in DNA composition (supplementary Table 4S) [see Additional file 4], while PAI-like regions overlapping GIs contain lots of TTSSs (Table 2). Iron uptake systems are important for bacterial survival as well as virulence [2]. Many PAIs such as HPI of Yersinia species, SHI-2 of S. flexneri, and SRL of S. flexneri 2a YSH6000 carry genes encoding various siderophore systems that produce and secrete low-molecular-weight siderophores with extremely high affinities for ferric iron. Clusters of homologs of ferric dicitrate transport system (fecABCDEIR, Fec) of SRL [22] were widely distributed in the backbone genomic regions of various species, which implies that Fec might be the most ancient siderophore system ( Figure 4, Table 2, supplementary Table 4S [see  Additional file 4]). Interestingly, a 7.1-kb fecCDE-homologous region can be found even in Halobacterium sp. NRC-1, the only archaeon possessing the PAI-like region in this study. This region is inserted by a 6-phosphogluconate dehydrogenase gene, 3 hypothetical proteins and tRNA-Arg gene.
One of the difficulties when dubbing potential PAIs in the sequenced genomes is to determine the boundaries. A PAI may have a number of genes which have undergone many evolutionary stages and thus compositionally indistinguishable from the rest of the genome [2,23]. This might be due to some parts highly adjusted to the base composition of the recipient's genome or to the backbone genomic segments added later in evolution [10]. We found that the length proportion of transferred regions contained in the known chromosomal PAIs -28. To solve this problem, we detected genomic segments homologous to each known PAI, which were then clumped into a large genomic region. This procedure is somewhat like the process of fragment assembly in which a contiguous region (contig) is made from overlapping fragments in shotgun sequencing [24]. Like the conserved sequences of TTSS structural genes [20], PAIs often share conserved regions. In addition, PAIs frequently carry relics of HGT event such as mobile sequence elements and association with tRNA genes at their boundaries [3]. Islander [25], a database of potential integrative islands in prokaryotic genomes, detects GIs by identifying tRNAs or tmRNA genes, and candidate integrase genes. Although many GIs reported from the database were in accordance with our results, large portion was not annotated as cPAIs mainly due to the absence of homologs of virulence genes in known PAIs and PAIs that are not located at the tRNA loci. As illustrated in Figure 3, frequent distribution of conserved regions between PAIs allows our method to find the entire region of a PAI in a sequenced genome even though its similar sequence is partially known.
A typical genome sequencing team uses genes in the gene cluster or the genome sequence of interest as a query to search for any similar genes in the databases. Then, homologs of pathogenicity/virulence genes are inferred by checking whether descriptions of the retrieved genes have any indications that suggest virulence/pathogenicity or they are from pathogens. Because this approach depends on the examiner's knowledge on known PAIs or pathogenicity/virulence genes and entry descriptions of the retrieved genes often are not informative to infer the function, it is never sure whether the searches thoroughly picked up all the genes associated with PAIs or pathogenicity/virulence. To avoid this uncertainty on the robustness of the open-ended search, we first collected all the reported PAI loci and used them as a query to search for homologs in the complete prokaryotic genomes. Our method guarantees that all the potential PAIs related to the known PAIs were searched without the intervention of human interpretation.
In completely sequenced genomes, we detected cPAIs that are homologous to the published PAIs and show anomaly in DNA composition. The methodology we developed in this study has a limitation in that the detected cPAIs are limited by the query data set of the known PAIs. This caveat, however, can be advantageous when the researchers only concern a specific set of PAIs. Furthermore, this approach can be easily extended to identify various genomic islands (e.g. fitness, metabolism, and resistance islands). Among the cPAIs detected in this study, omission of several well-known PAIs such as Hrp PAI of P. syringae and LIPI-1 of L. innocua is due to their DNA compositions similar to the core genomes which may caused by horizontal transfer from closely related strains or very ancient HGT event. Thus, patterns of best matches of each gene to different species, lineage-specific genes or transferred genes from phylogenetically distant species would be helpful in improving the possibility of finding GIs and PAIs. Also, accumulation of PAI sequence data in bacterial families other than the Enterobacteriaceae will lead to detection of more putative PAIs across various taxa. Finally, it should be noted that the identity of cPAIs as bona fide PAIs need to be confirmed by further experimental verification. We are currently improving the detection scheme and are developing a database for cPAIs in sequenced genomes.

Conclusion
We present the first computational framework combining feature-based analyses and similarity-based analyses. As shown in Figure 3, the similarity-based analysis that is reminiscent of the sequence-assembly procedure was proven to be an efficient method for demarcating the potential PAIs in our study. Also, the function(s) and origin(s) of a cPAI can be inferred by investigating the PAI queries comprising it. With the availability of rapidly increasing complete genome sequences [26] as well as PAI data, the proposed method will be useful in identifying potential PAIs in microbial genomes.

Collection of complete genomes and PAI Data
The sequence files of 148 prokaryotic complete genomes consisting of 157 chromosomes, including 17 archaeal ones as of January 2004 were downloaded from the NCBI FTP server (ftp://ftp.ncbi.nih.gov, supplementary Table  1S) [see Additional file 1]. We searched the GenBank database and literature [3,23] for any descriptions of the "pathogenicity island". Forty five kinds of PAIs and 207 GenBank accessions containing either part or all of the reported PAI loci in 120 pathogenic bacteria, are summarized in Table 1