SIGI: score-based identification of genomic islands
© Merkl 2004
Received: 10 December 2003
Accepted: 03 March 2004
Published: 03 March 2004
Skip to main content
© Merkl 2004
Received: 10 December 2003
Accepted: 03 March 2004
Published: 03 March 2004
Genomic islands can be observed in many microbial genomes. These stretches of DNA have a conspicuous composition with regard to sequence or encoded functions. Genomic islands are assumed to be frequently acquired via horizontal gene transfer. For the analysis of genome structure and the study of horizontal gene transfer, it is necessary to reliably identify and characterize these islands.
A scoring scheme on codon frequencies
Score_G1G2(cdn) = log(f_G2(cdn) / f_G1(cdn))
was utilized. To analyse genes of a species G1 and to test their relatedness to species G2, scores were determined by applying the formula to log-odds derived from mean codon frequencies of the two genomes. A non-redundant set of nearly 400 codon usage tables comprising microbial species was derived; its members were used alternatively at position G2. Genes having at least one score value above a species-specific and dynamically determined cut-off value were analysed further. By means of cluster analysis, genes were identified that comprise clusters of statistically significant size. These clusters were predicted as genomic islands. Finally and individually for each of these genes, the taxonomical relation among those species responsible for significant scores was interpreted. The validity of the approach and its limitations were made plausible by an extensive analysis of natural genes and synthetic ones aimed at modelling the process of gene amelioration.
The method reliably allows to identify genomic island and the likely origin of alien genes.
A microbial genome is by no means a random agglomeration of genes. In addition to operons clustering functionally related genes, additional signals indicating structure can be detected: Base composition e.g. can vary strand-specifically  or the GC-content of a sequence may be correlated with its distance from the origin of replication . Codon usage can be diversified depending on effects like translational efficiency . Such parameters as well as the integration of bacteriophages or megaplasmids are responsible for structures perceptible on the genome level.
In addition, genomic island may result from horizontal gene transfer (HGT), regarded as an additional evolutionary means of biochemical or environmental adaptation . Microbial genomes contain a varying portion of genes presumably acquired via HGT . It was claimed that in some genomes this portion exceeds 20% of the genomic content. To study HGT, various methods based on the analysis of codon or amino acid sequences or the construction of phylogenetic trees were developed; reviewed e.g. in . Each approach has its individual drawbacks and it might be that each method identifies a separate class of genes acquired in a different period of genome evolution . Because of the mechanistic implications, the pieces of DNA captured via HGT frequently have a considerable length. Consequently, it has to be expected that a large fraction of alien genes occurs in clusters. This assumption is supported by biological evidence: Genes responsible for pathogenicity are often agglomerated in islands; see  and references therein. Huge clusters of genes expanding evolutionary fitness can also be found in non-pathogenic species. An example is the symbiotic island of size 611 kb in the genome of M. loti .
An exhaustive analysis of genomic islands has several aspects: It consists of the identification of clusters and the interpretation of gene function. For putatively alien genes (pA, acquired via HGT), their likely origin has to be predicted. The most reliable methods (if applied correctly) coping with the latter task rely on the construction and evaluation of phylogenetic trees. However, each such analysis requires the inference of relations within a gene family. For several reasons like the insufficient number of appropriate clades, it is still difficult to extend these phylogenetic studies to each gene of a complete genome. Therefore, methods were developed aimed at the identification of pA genes without the need for computing phylogenetic trees [5, 10–12]. These intrinsic methods assess (if applied to sequences) the composition on DNA or protein level and measure the deterioration from the typical case. One disadvantage of these surrogate methods  is that the origin of the open reading frames cannot be predicted.
In the following, I introduce a novel surrogate method that has the potential of predicting the putative source of a DNA sequence. It relies on the generally accepted assumption that codon usage in phylogenetically related species is similar [13, 14]. The algorithm is integrated into the software package named SIGI and is based on scores assessing codon usage in pairwise comparisons and the taxonomic evaluation of results. It will be shown that its sensitivity in identifying genomic islands is comparable to the most advanced methods like hidden Markov models (HMM). The combination of a sensitive detector with cluster analysis as implemented here, results in the reliable identification of islands and allows to reduce the number of false positive predictions. This seems to be a problem in many studies of HGT published so far . The validity of the predictions is made plausible by an exhaustive statistical analysis based on natural and synthetic genes. These predictions are one function of SIGI. In addition, it identifies gene clusters originating from additional signals like codon usage bias aimed at the optimisation of translational efficiency.
The following paragraphs are organized as follows: First, the predictive power of the new approach named MPW (see Methods) is compared to methods already introduced in order to validate its ability to find compositionally atypical (CA) genes and genomic islands. Then, the performance of the algorithm in identifying the putative source of genes is studied. Finally, predictions deduced for completely sequenced genomes are presented.
Fraction of compositional atypical genes in microbial genomes. The numbers in the column CU LO are as in , column MPW gives the fraction of CA genes as determined by the MPW approach described here.
CA LO [%]
Escherichia coli O157:H7
Salmonella enterica subsp. Enterica
Vibrio cholerae chr. II
Salmonella typhimurium LT2
Streptococcus pneumoniae TIGR4
Ralstonia solanacearum megaplasmid
Escherichia coli K-12
Ralstonia solanacearum chromosome
Vibrio cholerae chr. I
Chlamydophila pneumoniae AR39
Haemophilus influenzae Rd
Synechocystis sp. PCC 6803
Genomic islands in the genome of Bacillus subtilis.
PHX: ribosomal proteins
yesJ-yesZ, ABC transporter
PBSX prophage (1320–1348)
Bacteria, Bacillales, Chlamydophila, Streptococcus
Bacteria, Bacilli, Streptococcus
Enterobacteriaceae, Bacillus cereus group
Arsenic resistance regulator
PHX: eno, pgm, tpi, pgk, gap
Cell wall synthesis
yxiQ-yxxG, bglS, deaD
Bacteria, gamma subdivision
Streptothricin, tetracycline, mercury regul.
In a critical survey, four surrogate methods were compared  by studying the intersections of those sets of genes identified as pA by the various methods. It turned out that only the CA LO approach and a method based on hidden Markov models tagged the same genes as pA more frequently than expected by change.
In , the genomes of two Xanthomonas pathogens were compared: Xanthomonas axonopodis pv. citri (Xac) and Xanthomonas campestris pv. campestris (Xcc). To identify unique genes the authors BLASTed each gene of one genome against all genes of the second one and analysed the hits. For the following comparison those genes were named unique that had no BLAST hit with an E-value < 10-20 in the second genome. These data sets were downloaded from http://cancer.lbi.ic.unicamp.br/xanthomonas. As the MPW approach was designed to predict clusters only, unique genes lying isolated were removed. The resulting data sets were compared against the MPW prediction. For Xac, 425 genes were identified as unique; the MPW approach annotated 488 genes as CA. 248 genes were both unique and CA. In the Xcc data set consisting of 4240 genes, 340 had the attribute unique and 454 the attribute CA, 213 genes had both attributes. If these attributes were completely unrelated, one would expect for Xcc the following number n of genes with both attributes: n = (454/4240 × 340/4240) × 4240 = 36. In both genomes, the number of CA genes labelled as unique is more the five times the expected value.
Combinations of acceptor and donor species for the generation of random sequences mimicking the amelioration process.
Escherichia coli K-12
It is known that the mean GC-content at the three codon positions is correlated with the mean GC-content of the genome . A significant deviation of the position-specific GC-content from expectation values derived from the mean GC-content of a gene was interpreted as a signal identifying ameliorating genes . In order to determine a precise measure for the calculation of position-specific GC-values, 89 microbial genomic data sets were analysed and used for linear regression. The following formulas were deduced:
GC exp 1 = 0.761 GC mean + 17.9
GC exp 2 = 0.481 GC mean + 15.6
GC exp 3 = 1.732 GC mean - 33.0
GC mean is the mean GC-content of a gene set under study; GC exp 1 .. GC exp 3 are the expected GC-content values of codon positions 1 to 3. To evaluate the GC composition of a gene, expectation values GC exp i as indicated above were derived from its mean GC-content and compared to the position-specific values GC i . An indicator already used to signal amelioration  is the Manhattan distance GC_dist assessing codon positions 1 and 3:
SIGI's performance in predicting the donor genome for synthetic genes modelling the amelioration process.
Escherichia coli K-12
If the random sequences used for the analysis are a proper model for the amelioration process, then the following problem arises: The result suggests that the amelioration of a sequence originating from a species with very dissimilar codon usage causes misleading signals. In addition, the results presented in table 4 do not show a correlation of the relative number of false positive predictions with the GC_dist value. Therefore, and at least for the data set used here, the interpretation of the GC_dist value was no indicator to identify ameliorating sequences. If ameliorating sequences of that kind were frequent in genomes, then the interpretation of codon usage or signatures as in  would be questionable. However, there is one argument that might resolve the dilemma: It was made plausible that the range and the frequency of HGT is constrained by selective barriers  and one might expect that a codon usage too dissimilar to the acceptor's one prevents an expression level necessary to guarantee the survival of a gene in the acceptor's genome.
A second test for the predictive power was based on the analysis of native genes selected in 20 species representing the bacterial and archeal superkingdoms. From the genomes of A. fulgidus, M. acetivorans, T. acidophilum, P. horikoshii, T. maritima, D. radiodurans, C. glutamicum, L. lactis, S. pneumoniae, B. subtilis, E. coli, Y. pestis, H. influenzae, N. meningitidis, H. pylori, A. tumefaciens, M. loti, R. conorii, C. pneumoniae and M. pulmonis, genes annotated by SIGI as putative native ones were extracted. Each data set consisted of more than 200 genes. During the analysis and for all data sets, each of the mentioned species was regarded as the putative acceptor resulting in 20 × 20 individual tests. For all genes, the putative source was predicted individually as described (see Methods). In 14 of the 20 data sets, the highest score identified the source correctly for more than 90% of the genes on the level of the taxonomical family irrespective of the choice of the putative acceptor. Less specific were the results for the data sets extracted from the genomes of L. lactis (72% correct predictions on the family level), S. pneumoniae (81%), C. glutamicum (88%), B. subtilis (88%), Y. pestis (83%) and from genes of chromosome 1 of D. radiodurans (59% correct predictions on the phylum level). Inferring the putative source from three high scoring pairwise comparisons as described below, reduced – as expected – the specificity of the taxonomical classification. However, the number of false predictions decreased drastically: For all cases besides D. radiodurans, less than 5% of the sources were misclassified on the level of the taxonomic class. In the worst case, i.e. for D. radiodurans genes, 32% of the predictions were wrong on the phylum level. These findings suggest that codon usage in the extracted gene set of D. radiodurans is unspecific and has to be studied in more detail. In no case, more than 1% false classifications were generated on the level of the superclass. This result indicates that codon usage in Bacteria and Archaea is quite distinct.
It is known that genomic islands are inhomogeneous in composition and have a mosaic structure, as they are the result of a multistep process . For a first analysis of the genomes however, I identified genomic islands annotated relatively homogeneously with respect to the putative donor.
Each Salmonella genome has approximately 12 fimbria operons frequently involved in virulence . In S. typhimurium, the operons fim, saf, std and sth were identified as CA, bcf, lpf, stc, stf, sti and stj were silent. In S. enteritidis, sef, stb, std, ste and tcf were identified as CA, bcf, saf, sta, stc, stg and sth were inconspicuous. The integrated phage genomes Gifsy-1, Gifsy-2, Fels-1 and Fels-2 were CA. In many cases, for these islands the taxa "Bacteria" or plasmids are predicted as putative source, indicating an unspecific codon usage. The bias seen in the fimbria operons might be due to the strong selective pressure imposed by the host immune system. Based on the analysis of the ycdB gene, it was claimed that genomic islands of low GC-content were acquired from Lactococcus lactis . SIGI gave the following annotations for the most similar sequences of ycdB: AAC75047 of E. coli K-12 predicted as CA originating from Bacilli, second most similar codon usage (hit) Lactococcus, CAD06974 of S. enteritidis CA (Bacilli, first hit Lactococcus), AAL23317 of S. typhimurium was not annotated as CA, SF2054 of S. flexneri CA, (Bacilli, first hit Lactococcus).
In Salmonella enterica, several CA clusters were identified. In the following list, for some clusters the positions, gene names, coded proteins and putative sources are given: 1004 kb – 1053 kb contains genes annotated as putative bacteriophage proteins, originating according to SIGI from plasmids or Enterobacteriaceae; 1625 kb – 1651 kb, ssa genes, coding for a type III secretion system, Enterobacteriaceae; 2118 kb – 2135 kb, rfb, putative transferases, inhomogeneous codon usage; 2863 kb – 2900 kb, prg and sip, pathogenicity 1 island effector proteins, spa, surface presentation of antigens, inv, secretory proteins, Enterobacteriaceae; 3830 kb – 3838 kb, ccm heme exporter protein, Proteobacteria; 3930 kb – 3941 kb, waa, involved in the lipopolysaccharide core biosynthesis, Enterobacteriaceae; 4403 kb – 4543 kb, topB, a topoisomerase, pil, vex polysaccaride export, Enterobacteriaceae, plasmids.
The codon usage of most CA genes identified in Bacillus subtilis (see table 2) is unspecific. If SIGI predicts a specific taxon, it is a closely related clade. A similar result was observed for the genome of Escherichia coli O157:H7. 728 genes were identified as CA, the codon usage table causing the highest score was in 137 cases derived from a plasmid, 349 times it was from a species belonging to the gamma subdivision. All prophages and prophage-like elements known to be integrated into the genome  were found plus several additional CA clusters. All known pathogenicity islands of V. cholerae are CA, among these were the recently identified islands on chromosome one, named "seventh pandemic islands" . For a detailed analysis, see the material deposited on our webserver.
In the genome of the α-Proteobacterium Caulobacter crescentus, only 2.5% of the genes are CA. Several genes clustered in the area from 621 kb – 694 kb were predicted as originating from the Rhizobiaceae group: CC0575 coding for a putative beta-lactamase (p1 = 5 × 10-4, p2 = 9 × 10-5), CC0576, it's product is an asparaginase family protein (p1 = 0.01, p2 = 9 × 10-5) or CC0618, cysG coding for a siroheme synthase (p1 = 0.01, p2 = 9 × 10-5). The indices p1 and p2 are explained in Methods. A second cluster at 2.90 kb – 2.95 kb contains the genes for the conjugal transfer protein trbI and several transposases. The codon signature is inhomogeneous, dominated by species of the Rhizobiaceae group.
Haemophilus influenzae Rd is a small, Gram-negative bacterium; the only natural host is human. For 15 genes, the γ-Proteobacterium Shewanella was predicted as putative source. 12 of these hits were clustered in the region 1572 kb – 1590 kb which belongs to an island extending from 1555 kb – 1595 kb. It contains among genes for hypothetical proteins fepC and genes coding for Mu proteins like muA. Recently, a Shewanella species was identified as human pathogen  making the prediction plausible. As the GC-content of H. influenzae is 38%, these predictions have to be interpreted with care (see above, results for ameliorating genes).
In many cases, restriction-modification enzymes were identified as CA like in Nostoc. A genomic island (3278 kb – 3289 kb) containing a type 1 restriction modification enzyme follows a tRNA-Ala gene and a transposase. A second enzyme of that type is located in the genomic island (4186 kb – 4220 kb) following a tRNA-Gly. These genes are predicted as originating from Bacilli and Chlamydophila.
In the genome of Streptococcus pyogenes, 131 genes are annotated as "phage associated". However, only 50 of these genes were identified as CA. A CA island spans from 884 kb to 895 kb containing a putative methyltransferase and the srt system involved in lantibiotic production; the putative source is diffuse.
It is known that the genome of Mesorhizobium loti contains a huge symbiosis island (4645 kb – 5256 kb) of size 611 kb . SIGI predicts most of these genes as originating from the Rhizobiaceae group. The hypothetical protein MLR6371 has the codon signature of the beta subdivision. Examples for additional CA clusters are: 341 kb – 362 kb coding for unknown proteins around a bacteriophage integrase, genes are similar in codon usage to plasmids and Rhizobiaceae; 779 kb – 827 kb codes several transferases; 843 kb – 861 kb containing genes for an adenylate cyclase, putatively originating from the Rhizobiaceae group (p1 = 0.04, p2 = 3 × 10-5) and rsp, the rhizobiocin secretion system; 2592 kb – 2610 kb containing a cyclase and a glycosyltransferase gene; 3219 kb – 3234 kb with genes for a DNA invertase rlgA and an excisionase; 3705 kb – 3755 kb containing genes for a glycosyltransferase, a DNA polymerase, chloramphenicol-acetyltransferase, heat shock proteins, codon usage most similar to Rhizobiaceae group; 5714 kb – 5742 kb containing genes for elements of an ABC-transporter, methyltransferases, hydrolases; 6580 kb – 6681 kb genes for hypothetical proteins, an ABC-transporter, a DNA modification-methylase, a histidine-kinase and a site-specific recombinase, the codon usage is most similar to that of the Rhizobiaceae group.
For a complete listing of the results, see the material deposited on the web server http://www.g2l.bio.uni-goettingen.de. For each genome, results are available in a tabulated version and a format readable by the gene browser ARTEMIS .
It was argued that codon usage and atypical GC-content are no reliable indicators for the study of horizontally transferred genes . An analysis of positional orthologous genes in E. coli and S. typhimurium came up with a similar result . Interestingly the genes referred to as being classified as false positives in E. coli K-12 (gloB, gadB, yheB) with the CA LO approach were not classified as CA by the MPW method. Definitely, the number of false positive predictions can be reduced by applying a clustering method as introduced here. The risk of missing a large fraction of pA genes should be minimal, as the pieces of transferred DNA have usually a considerably length , although there exist exceptions like in the genome of Neisseria .
The assumption that surrogate methods might overlook genes acquired by horizontal transfer might be valid for more ancient events, recently acquired genes seem to be detected to a great extend by surrogate methods . Lawrence and Ochman estimated the age of imported genes . The conclusion was that most are relatively recent, i.e. acquired within the last few million years; see e.g. . This suggests that older imports have been purged from the genomes presumably because these genes did not improve fitness . If this argument is valid, there is no need to search for huge amounts of ancient pA genes.
The highly consistent findings of the HMM and the MPW approach for the B. subtilis genome confirm specificity and sensitivity of the MPW method. However, there might be two problems: Predicting the false donor and amelioration. The most convincing proof for the correctness of SIGI's prediction are concordances with phylogenetic studies. One example of consistent results is the analysis of the ycdB gene presented above. However, in many cases, genes identified as pA with other methods were not part of a CA cluster. This was the case for gltB and ino1 of Thermotoga maritima identified as archeal , or the events of HGT described for D. radiodurans . The analysis of synthetic genes showed that the risk of predicting a false source is high, if the codon usage of the donor is extremely different. There is however biological evidence that such HGT events are rare. Therefore, most of SIGI's predictions are reliable on a statistical level. The analysis of the domain structure of aminoacyl-tRNA synthetases revealed a complex history of HGT events . In no genome, the MPW method annotated an aminoacyl-tRNA synthetase as CA. This might indicate the limitations of the approach, which is limited to signals on the codon level.
The GC-content decreases near the replication terminus of several microbial species. The AT richness of the terminus region could be caused by the replication machinery or the DNA repair system . This deviation might be the source for classifying genes incorrectly as CA. In many cases, the GC-content of pathogenicity islands is however lower than the average content – see examples in [22, 31] – and it might be that genomic islands were imported preferentially opposite of the origin of replication. In addition, not all genomic islands are AT rich: the area between 1555 kb and 1595 kb in the genome of H. influenzae consists of 40 genes having a GC-content that is higher and 11 genes having a GC-content lower than the mean GC-content of the genome. If GC-content is determined gene-wise, then for 45% of the genomes analysed here, more than 75% of the genes have a lower than mean GC-content, which is in agreement with [40, 41]. However, 18% of the genomes harbour in GIs more than 50% of CA genes having a higher GC-content. An extreme case is the genome of S. solfataricus, where 90% of CA genes have a GC-content higher than the mean value of 35%.
In principle, the MPW should also identify genomic islands whose GC-content is similar to the rest of the genome as long as the codon usage is different. Even a similarity of codon usage as detected in thermophilic bacteria of different clades  will not cause false predictions: because of the interpretation of taxonomic relation between hits, the annotation will in these cases be less specific but not false.
There are several options to improve SIGI: The integration of more codon usage tables and additional indicators like those introduced in  or  may further enhance its predictive power. Applying models for the amelioration process [14, 34] may allow to "reameliorate" genes and to determine the source of pA genes more specifically. Finally, a statistical model for the MPW approach has to be developed.
SIGI is able to detect genomic islands with high sensitivity. These areas are also candidates for HGT events. Studying such events, SIGI complements methods based on phylogenetic approaches. The analysis of the taxonomical relation among putative donors makes clear that a simple comparison of codon usage may create misleading predictions.
The simplest statistical model describing genes as a sequence of codons assumes that codons occur independently from each other. For this model, the Neyman-Pearson lemma assures that a function of the type
is optimal to decide, whether gene = start codon, cdn 1, ccdn 2......cdn n is a member of the family G1 characterized by codon frequencies f G1 (cdn j ) or belongs to family G2 having codon frequencies f G2 (cdn j ). As a result of test theory, it is known that there exists no other function with a decision strength greater than expression (I). Applying the logarithm and normalizing for gene length gives:
Now h(gene) is the sum of species-specific log-odds scores PW G1G2(cdn) divided by the number of codons constituting the gene. Scores of that type were utilized frequently and are supported by a sound theory . Recently it was shown that a similar approach is appropriate to quantify codon usage bias associated with translational efficiency .
The score values PW G1G2(cdn), which were here always deduced from codon frequencies among synonymous codons, can be used to decide whether codon usage in gene resembles more the prevalences of species G1 or species G2. If gene is from genome G1 and if h(gene) is >> 0 then its codon usage is more similar to G2. Therefore, and if G2 is taxonomically distinct, the gene under study must be regarded an alien gene and genome G2 might be its source. In the study presented here, a putative source was predicted for genes longer than 100 codons only. This lower limit for gene length was introduced in order to reduce statistical variation due to small sampling size.
As it was one aim of the study to predict the putative source of compositional atypical genes, it was necessary to generate a sufficiently large number of score sets covering most of the possible origins of taxonomically related species. A prerequisite for the calculation of these scores PW G1G2 are codon usage tables. Their compilation was initiated with data sets derived from completely sequenced microbial genomes, which were publicly available. Frequency values f G2(cdn j ) were determined from those genes not annotated as hypothetical or with a putative function. It is known that codon frequencies in putatively highly expressed (PHX) genes deviate significantly from mean values [3, 44]. For each gene, z-scores were determined for CU contrast  and GCB-values  (compare figure 2). A gene was regarded as PHX, if the combination of the two scores exceeded a predefined cut-off value. This initial set was supplemented with entries from the CUTG database  in the version as of Aug. 2002. From this collection, only those microbial entries were accepted that contained more than 6400 codons. If more than one frequency table existed for the same taxonomic species, the data set deduced from the largest number of codons was processed further. After data collection, similarity of codon usage among species was controlled by calculating pairwise a Manhattan-like distance among codon usage tables. This distance values were used to select the final set CUTG_RF of codon frequencies. For all elements of CUTG_RF, it was confirmed that the most similar species on codon usage belongs to the same taxonomic class or superclass. This step and the other precautions mentioned above were introduced in order to guarantee taxonomic relatedness among the entries and to eliminate codon usage tables presumably derived from a non-representative sample of a genome. The collection was supplemented with codon usage tables of plasmids. Altogether CUTG_RF consisted of n RF = 371 entries used for the calculation of scores PW G1G2(cdn) at position G2. The codon frequencies f G1 (cdn j ) of the genome G1 under study were determined from those genes not annotated with the terms "hypothetical" or "putative" and which were no PHX genes.
For the analysis of each gene of a genome G1, its codon usage was evaluated in a multiple pairwise test (MPW) using n RF individual scoring schemes PW G1G2(cdn). The species G2 max causing the highest score h MPW (gene) was considered a putative source, if h MPW (gene) exceeded a cut-off value. In order to quantify the statistical relevance of the prediction, two parameters p1 and p2 were introduced. p1 gives the fraction of genes in G1 that achieved a score at least as high as h MPW (gene), if evaluated with the scoring scheme PW G1G2max (cdn). A p1- value of 0.01 e.g. indicates that 1% of the genes in G1 have a score equal to or greater than h MPW (gene) if compared to the codon usage of G2 max . The second parameter p2 was derived from a taxonomic rating of those k = 2 or 3 species G2 1 -G2 k triggering the k largest score values for gene. The basis for the analysis was a taxonomic tree generated by using material obtained from the ftp-server of the NCBI ftp://ftp.ncbi.nih.gov/pub/taxonomy/. The nodes that represent species belonging to the set CUTG_RF were labelled with an indicator. To calculate the parameter p2, the position of the leaves G2 1 ... G2 k were used to identify the nearest node (ancestor in the taxonomy tree) t 1 that subsumes the k leaves. If t 1 is the ancestor of those n t species belonging to CUTG_RF, then the probability p2 of picking by chance k species belonging to the taxonomic group t 1 can be calculated as
Formula (III) was adapted accordingly, if n t1 was smaller than k. The identification of taxon t 1 and the p 2 value allow to determine the specificity of codon usage. If the high scoring species are taxonomically unrelated, t 1 will be unspecific and p2 relatively large. A specialized codon usage will result in small p 2 values and a more specific taxon.
The concepts introduced so far allowed to characterize individual genes and to quantify related scores statistically. Now, it was necessary to assess the set of all h max (gene) values in order to derive a cut-off which discriminated those values h(gene) > 0 that deviated significantly from expected fluctuations. Because of the focussing on identifying clusters of CA genes, a statistical approach could be utilized to eliminate false positive predictions. To identify genomic islands and to dynamically adapt the cut-off for each genome individually, a two-pass strategy was used. During the first pass, for each gene with number i, all n RF scores were determined and h MPW (gene i ) was identified. A text string genome was created according to the following instruction:
For the string genome the global frequency f glob (S) was determined and clusters SSSSS indicating a successive sequence of at least five CA genes were localized. These clusters were extended in both directions until the local frequency f loc (S) fell below the value 2 × f glob (S). The h MPW (gene) values of the genes in the extended clusters and the remaining ones were accumulated in two histograms h_cl and h_rem. From the histogram h_cl, the cut-off c_o 2 for round two of the clustering process was derived as the h MPW value exceeded by 95% of the values determined for genes in extended clusters. c_o 2 allows to estimate the error of not classifying a CA gene correctly: Applying c_o 2 on h_rem gives the number of genes having a score above this cut-off and not being classified as a CA gene. Using cut-off c_o 2, the clustering algorithm was reinitiated and the genes classified as belonging to extended clusters in round two were annotated as being CA. The cut-off for round one was always set to 0.025, a value deduced from the analysis of chromosome two of V. cholerae (see Results).
There were several reasons to design the algorithm as described: The main argument for focusing on clusters was a combination of biological evidence and statistical principles that help to increase the reliability of the prediction. First, it is known that genomic islands frequently have a size of 10 – 200 kb . Second, if the probability of annotating a gene as CA is p(S) then the probability for a CA cluster of n successive genes is p(S) n , if independency is assumed. Thus, for realistic values of p(S) and n it is highly unlikely that such a cluster occurs merely by chance. Even if we consider a large value like p(S) = 0.3, then the probability p(S) 5 for a cluster of size 5 (as assumed above) is < 2.5 × 10-3. A rough estimation (1 / p(S) 5) gives that then one among 400 of such clusters occurs merely by chance and is a false positive classification. This situation allows to gain high sensitivity in identifying individual CA genes and to deliberately adjust the cut-off level as described above. As mentioned, the calculation is based on the assumption that the classification of adjacent genes is independent of the context. Assuming independency is a simplistic model, however a rough approximation, if compared to findings in E. coli: 80% of transcription units (which subsume operons) have less than five genes .
The factor 2.0 used in the expression 2 × f glob (S) for the propagation of extended clusters was inferred from the analysis of the integron island on chromosome two of V. cholerae (see Results). The exact value of this parameter did not critically influence the identification and localization of the integron island (annotation as from , data not shown). In general, the algorithm used for the extension of clusters resembles principles implemented in BLAST for the identification of optimal high scoring segment pairs .
Archaeoglobus fulgidus (NC_000917), Aeropyrum pernix (NC_000854), Halobacterium sp. NRC-1 (NC_002607), Methanothermobacter thermautotrophicus (NC_000916), Methanocaldococcus jannaschii (NC_000909), Methanosarcina acetivorans (NC_003552), Methanosarcina mazei (AE008384), Pyrococcus abyssi (NC_000868), Pyrococcus horikoshii (NC_000961), Sulfolobus solfataricus (NC_002754), Sulfolobus tokodaii (NC_003106), Thermoplasma acidophilum (NC_002578), Thermoplasma volcanium (NC_002689).
Agrobacterium tumefaciens (AE007869, AE007870), Aquifex aeolicus (NC_000918), Bacillus halodurans (NC_002570), Bacillus subtilis (NC_000964), Brucella melitensis (NC_003317, NC_003318), Borrelia burgdorferi (NC_001318), Buchnera sp. APS (NC_002528), Campylobacter jejuni (NC_002163), Caulobacter crescentus (NC_002696), Chlamydia muridarum (NC_002620), Chlamydia trachomatis (NC_000117), Chlamydophila pneumoniae J138 (NC_002491), Chlamydophila pneumoniae AR39 (NC_002179), Clostridium acetobutylicum (NC_003030), Clostridium perfringens (NC_003366), Corynebacterium glutamicum (NC_003450), Deinococcus radiodurans (Chromosome 1, NC_001263), Escherichia coli K-12 (NC_000913), Escherichia coli O157:H7 EDL933 (NC_002655), Fusobacterium nucleatum (NC_003454), Haemophilus influenzae Rd (NC_000907), Helicobacter pylori 26695 (NC_000915), Helicobacter pylori J99 (NC_000921), Lactococcus lactis subsp. Lactis (NC_002662), Listeria innocua (NC_003212), Listeria monocytogenes (NC_003210), Mesorhizobium loti (NC_002678), Methanopyrus kandleri (NC_003551), Mycobacterium tuberculosis CDC1551 (NC_002755), Mycobacterium tuberculosis H37Rv (NC_000962), Mycoplasma genitalium (NC_000908), Mycobacterium leprae strain TN (NC_002677), Mycoplasma pneumoniae (NC_000912), Mycoplasma pulmonis (NC_002771), Neisseria meningitidis Z2491 (NC_003116), Nostoc sp. PCC 7120 (NC_003272), Oceanobacillus iheyensis (NC_004193), Pasteurella multocida (NC_002663), Pseudomonas aeruginosa (NC_002516), Pyrobaculum aerophilum (NC_003364), Ralstonia solanacearum (NC_003295, NC_003296), Rickettsia conorii (NC_003103),Rickettsia prowazekii (NC_000963), Salmonella enterica (NC_003198), Salmonella typhimurium (NC_003197), Shigella flexneri (NC_004337), Sinorhizobium meliloti (NC_003047), Staphylococcus aureus subsp. aureus N315 (NC_002745), Staphylococcus aureus strain Mu50 (NC_002758), Streptococcus agalactiae (NC_004116), Streptococcus pneumoniae R6 (NC_003098), Streptococcus pneumoniae TIGR4 (NC_003038), Streptococcus pyogenes (NC_002737), Synechocystis sp. PCC6803 (NC_000911), Thermoanaerobacter tengcongensis (NC_003896), Thermosynechococcus elongatus (NC_004113), Thermotoga maritima (NC_000853), Treponema pallidum (NC_000919), Ureaplasma urealyticum (NC_002162), Vibrio cholerae (NC_002505, NC_002506), Xanthomonas axonopodis (NC_003919), Xanthomonas campestris (NC_003902), Xylella fastidiosa (NC_002488), Yersinia pestis (NC_003143).
The data sets for Thermus thermophilus and Picrophilus torridus were preliminary data prepared at the Göttingen Genomics Laboratory.
The project was carried out within the framework of the Competence Network Göttingen "Genome research on bacteria" (GenoMik) financed by the German Federal Ministry of Education and Research (BMBF). I thank S. Waack and M. Stanke for discussions concerning statistics and test theory and A. Wiezer and J. Sobkowiak for supplying me with a perfect computational infrastructure.
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.