Skip to main content
  • Methodology article
  • Open access
  • Published:

2SigFinder: the combined use of small-scale and large-scale statistical testing for genomic island detection from a single genome

Abstract

Background

Genomic islands are associated with microbial adaptations, carrying genomic signatures different from the host. Some methods perform an overall test to identify genomic islands based on their local features. However, regions of different scales will display different genomic features.

Results

We proposed here a novel method “2SigFinder “, the first combined use of small-scale and large-scale statistical testing for genomic island detection. The proposed method was tested by genomic island boundary detection and identification of genomic islands or functional features of real biological data. We also compared the proposed method with the comparative genomics and composition-based approaches. The results indicate that the proposed 2SigFinder is more efficient in identifying genomic islands.

Conclusions

From real biological data, 2SigFinder identified genomic islands from a single genome and reported robust results across different experiments, without annotated information of genomes or prior knowledge from other datasets. 2SigHunter identified 25 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats from 27 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats, and detected 101 Phage and 28 HEG out of 130 Phage and 36 HEGs in S. enterica Typhi CT18, which shows that it is more efficient in detecting functional features associated with GIs.

Background

The diversity of bacteria has increased, and can adapt to environmental changes. The adaptability of these microorganisms is partly due to horizontal gene transfer (HGT). In 1990, Hacker et al. discovered some viral gene clusters from some Escherichia coli genomes, but no other closely related species were found, these viral gene clusters were named Pathogenic Islands (PAIs) [1]. PAIs can be divided into many types, including symbiotic islands, metabolic islands, secretory islands, and resistant islands. Generally, genomic islands (GIs) are used as a standard term to refer to a group of genes that are 10–200 kb in length after horizontal transfer. The area of horizontal transfer was originally called the GIs until the gene function was fully determined. Based on their gene function, a more specific term was provided for their basic use [2].

In the genomic era, the importance of GIs should be taken seriously. With new genomic sequencing technology, we aim to identify genomic regions of other species that are different from other species or strains. Generally speaking, the more relevant taxonomy is a method to identify genomic islands associated with functions [3, 4]. Such as, the genomic islands are associated with the secretion system, iron absorption function, secretion of toxins and adhesions, all of which increase the survival rate of pathogens in the host [5, 6]. Pathogens can initially regulate the detectability of chromosomes and exhibit different pathogenic phenotypes [7, 8]. GIs in bacteria induce many adaptation processes, such as metal resistance, antibiotic resistance, and secondary metabolic characteristics, thereby providing environmental and industrial benefits [9, 10]. Therefore, the identification of GIs in different genomes has been a key factor in the study of microbial evolution and function.

In large-scale comparative genomics, GIs have characteristics such as different sequence composition, direct flanking, migration-related genes, and tRNA genes, which should be explored and used to identify GIs [4, 11,12,13]. Genomic islands are scattered using a system model different from the host. Therefore, their differences can be determined by comparison with the differences of 16srrna [14]. Some detection algorithms have been developed: local alignment methods [15], and whole alignment methods [16]. These methods are based on multiple genomic alignments are inconsistent or unique, aligned with genomes that may be considered GIs and conservative regions. At the same time, several methods for constructing and applying multi-layer large-scale genome comparisons have been reported for complex situations. For example, MobilomeFINDER revealed that tRNA genes are shared across several related genomes. Mauve searches for genomic islands around homologous tRNA [17]. GI identification using this method is related to interrupted tRNAs, and genomic islands that do not have tRNA may be lost. The above question can be solved by MOSAIC, which are used to determine whether a strain-specific region should be inserted into the tRNA region [18]. However, we often incorrectly identify inversions and translocations as a strain-specific region. Another widely used GI prediction method is IslandPick [19]. For a simple genome, IslandPick can first select the optimal comparative gene without any prejudice, and then call Mauve for genome-wide comparison construction. IslandPick avoids duplication with help of rechecking Mauve’s alignment regions [20, 21]. The above algorithms are based on genomic comparison methods and can therefore be limited to using annotations or closely related but unavailable genomes. Since there are many genomes, the genome of the target species should be carefully selected [22].

In addition, some algorithms are also used to detect genomic islands based on the component of genome sequence. These algorithms can yield high efficiency and must distinguish anomalous regions from the remaining genomic biases because GI has a different sequence composition from the host. They are useful to quickly identify GIs in a genome or sequence and do not require additional genomes. Two to nine long oligonucleotide sequences and GC content are often defined as the component of genome sequence [11, 23,24,25,26]. Such as, abnormal G-C content and codon frequency deviations are calculated using PAI-Finder to detect GIs, and candidate PAIs are further evaluated using PAI-Finder to determine whether PAI-like regions partially or completely span GIs [27]. The PAI database (PAIDB) and PAI Finder are combined on one platform, where you can download annotated data and prediction information [28, 29].

Hidden Markov Model (HMM) helps to remove or detect abnormal regions containing component deviations [23, 30,31,32]. For example, SIGI-HMM has constructed an HMM model to eliminate ribosome regions with codon usage preferences [30, 31]. In addition, HMMer can identify the PFAM37 migrating gene map by searching each predicted gene [12], so IslandPath DIMOB [32] uses HMM to identify migrating gene map [33]. In contrast, Alien_Hunter improved the prediction of the boundaries of GIs by introducing a special scoring system based on k-mers variable length and using HMM models [23]. Although these methods based on Hidden Markov Models are more efficient than other methods in predicting GIs, they require a relatively large amount of parameter training and a large number of calculations. Therefore, prolonged operations are necessary to predict one GI.

A sequence is segmented into different regions, and the extraction of constituent characteristics of the sequence is performed instead of evaluating a set of genes in several predictions [34,35,36,37]. Measure significant differences between two windows to identify windows that are different in composition. The centroid method is used to determine some windows as GIs based on the comparison of windows’ scores [34]. But, it is limited by host signature estimates based on all windows. As a result, some noise was observed in the host’s local information. INDeGenIUS finds a cluster of the sequences to obtain a “major cluster” and estimates the host’s native signature. In this way, the previous problems can be solved [35, 36]. However, the measurement of each oligonucleotide is unnecessary, and some oligonucleotides are considered to be important indicators of horizontal transfer. Therefore, SigHunt detects the core tetranucleotides based on the related genomes using the tetranucleotide mass fraction instead of selecting all possible tetranucleotides [37].

Although the above algorithms achieve better performances, there are still some problems: 1) some methods mainly detect GIs through global testing, and pay attention to whether the local signature of a region is obviously not the same with the host. But, these characteristics are directly related to the scale of genomic signatures, for example, poor local genomic signatures may miss some small details at large scale; in contrast, small-scale features retain local features, whereas the GI detection is largely affected by large-scale differences. Therefore, the future developments of GI prediction should use multi-scale methods to explore the multi-scale genomic signatures; 2) the above algorithms detect some typical regions as possible genomic islands and do not refine the boundaries. If the predicted boundary of GIs can be further optimized, the effectiveness and efficiency of the prediction will be improved.

To address these problems, we proposed here a novel method “2SigFinder”, the first combined use of small-scale and large-scale statistical testing for genomic island detection. We propose an iterative of a small-scale t-test with large-scale feature selection techniques for each region of the genome to facilitate quantification of its compositional differences with the host, instead of calculating the distance or discrete interval cumulative score for each region. We used the higher moments of each tetranucleotide and designed an iteration of large-scale statistical testing with dynamic signals from small-scale feature selection to identify some multi-window segments; in addition, we split them into optimal distinct segments according to the CG-content bias and detect the genomic islands. At last, the CG-based segmentation method and the Markovian Jensen–Shannon divergence are used to optimize the boundaries of genomic islands.

Results

Comparison to the algorithms based on the windows for detecting GIs

We evaluated the effectiveness of our algorithm by detecting GI/non-GIs. Langille et al. constructed GI analysis data from 675 complete bacterial genomes. All genomes have a sufficient number of related species or strains, using strict but possibly flexible standards [19]. They identified some regions stored in all genomes as negative datasets and built a standard dataset to evaluate the efficient of genomic island detection methods. The data contains 771 genomic islands, referred to as GI, as well as 3770 non-genomic island fragments (non-GI), ranging in length from 8 kb to 31 kb. Since these GIs and non-GIs come from 118 genomes, the genomes of representative species come from the field of bacteria and archaea.

2SigFinder was used to classify GIs / non-GIs, where the transformed window is 1, the eye window is 5, the neighborhood size is 10, the long window is 50, 256 core features and 4 dynamic features are used, with 10 iterations and 0.05 standard error. Finally, the 3 kb “raw” genomic islands were used to find the genomic island boundary. Three published algorithms were also evaluated on the same dataset with default values [34, 35, 37]. When we used the SigHunt and INDeGenIUS methods, the significance level 0.05 test was selected to identify genomic islands, where DIAS was calculated based on all of the tetranucleotides.

The overall accuracy of the 2SigFinder was 85.16%, which achieved the best results, while the overall accuracy of the other methods was similar, ranging from 80 to 82% (Table 1). As for accuracy and recall, it is easy to find that the recall rate of 2SigFinder exceeds 45%, and no other methods. INDeGenIUS got a better precision, but its accuracy was lower (19.99%) [35]. The SigHunt’s performance did not meet expectations, and we infer that it predicts more genomic islands (758), and the average length of the predicted fragments is smaller (4670 bp) compared with other methods (number: 277–346, and average length: 13146–22,423 bp). These results indicates that 2SigFinder outperforms other algorithms in genomic island detection.

Table 1 Comparison of the window-based methods Centroid, INDeGenIUS, SigHunt, and the proposed 2SigFinder on classification of GIs/non-GI datasets. The precision, recall and overall accuracy of each method are calculated based on the number of overlapping nucleotides in both published GIs and predicted GIs

Identification of genomic islands in Pseudomonas aeru-ginosa LESB58

We next evaluated the proposed method 2SigFinder on P. aeruginosa LESB58 genome, whose genomic islands have been explored widely [38,39,40]. There are currently 6 prophage gene clusters and 5 annotated pathogenicity islands in P. aeruginosa LESB58 [38, 41, 42].

We applied 2SigFinder to identify the genomic islands in the P. aeruginosa LESB58 genome, where transformed window is 4, eye window is 5, neighbourhood size is 4 and long window size is 100, using 256 core features and 4 dynamic features, with 4 iterations in IST-LFS and 4 iterations in ILST-DSFS, and 0.05 standard error. At last, 2 kb upstream/downstream of ‘raw’ genomic islands was used to refine the boundaries of predicted genomic islands. Six algorithms based on the windows and a comparative genomics were also used to predict the genomic islands with default values [19, 23, 31, 32, 34, 35, 37]. The level of the same significance test was set to 0.05, and the score results were used to identify the putative GIs. Figure 1a is the comparison of different detection algorithms on P. aeruginosa. LESB58 [37, 41, 42]. Since Alien_Hunter detected a large number of hypothetical regions, the predicted GI has the longest length (Fig. 1b). Note that although Alien_Hunter detected 293 kb in the established island-encoded 451 kb DNA, but its false positives was large (Fig. 1b). Thus, it gets the better recall at the expense of its accuracy (Fig. 1c and Tables 2 and 3).

Fig. 1
figure 1

Performance of the proposed 2SigFinder (2SF), SIGI-HMM (SH), Al-ien_Hunter (AH), Centroid (CE), IslandPath-DIMOB (IPA), INDeGenIUS (IN), SigHunt (SI) and IslandPick (IPI) on the detection of genomic islands in P. aerugino-sa LESB58. a Predicted GIs found by all of the methods, and the known genomic islands are shown as vertical grey bars. b Overall length of the predicted genomic islands, true positives and false positives of all of the evaluated methods at the nucleo-tide level. c Precision, false positive rate (FPR) and F1-score of all of the evaluated methods at the island level, in which the precision, false positive rate and F1-score are calculated based on the number of known GIs that are more than 50% covered by the results of the prediction methods

Table 2 Total length, average length and number of genomic islands predicted by 2SigFinder, SIGI-HMM, Alien_Hunter, Centroid, IslandPath-DIMOB, INDeGenIUS, SigHunt and IslandPick on detection of genomic islands in P. aeruginosa LESB58, and total number of the overlapping nucleotides in both known GIs and predicted GIs Data as well as the number of the known GI with at least 50% covered by results of prediction methods
Table 3 Precision, false positive rate (FPR) and F1-score of the proposed method 2SigFinder, SIGI-HMM, Alien_Hunter, Centroid, IslandPath-DIMOB, INDeGenIUS, SigHunt and IslandPick on detection of genomic islands in P. aeruginosa LESB58, and the precision, false positive rate and F1-score are calculated based on the number of the known GIs with greater than 50% covered by results of prediction methods

In contrast, comparative genomics IslandPick got better prediction results by detecting 16 genomic islands. In order to further evaluate the predictive ability of GI level, we calculated the accuracy rate and F1 using the annotated genomic islands with more than 50% covered by the prediction results. Half of the 5 known genomic islands are predicted by IslandPick, which lead to high FDR and low F1 score (Fig. 1c and Tables 2 and 3).

2SigFinder predicted 10 genomic islands with large average length (Table 3). We observed that about 50% of the predicted 277,741 nucleotides were found in annotated genomic islands. It got a large true positive, and its false positive is also low (Fig. 1b). We then found that half of the 6 annotated genomic islands were predicted by 2SigFinder, resulting in the high accuracy and F1 (Fig. 1c and Table 3).

Through a comprehensive study, AlienHunter was found to be sensitive, but it has high false positive. Some algorithms based on the windows found some genomic islands, but their sizes are small. Thus, the results indicates that 2SigFinder is more efficient in identifying genomic islands.

Identifying functional features in S. enterica Typhi CT18

Comparative genomics found that genomic island is often accompanied by different insertion sequences, repeat sequences and migratory tRNA genes. These features can better discover the function of genomic islands. Therefore, we further studied these functional features associated with the real genomic islands and predicted genomic islands from different prediction methods. We used the annotated genome to search for some characteristic genes in the genome islands. We looked for genes containing ribosomal proteins, genes with partner degradation functions, genes associated with energy metabolism, treated them as highly expressed genes, and counted their total number within genomic islands [39]. We used REPuter software to find repeated sequence fragments in genomic islands [40], and downloaded the annotation file from the US National Center for Biotechnology Information and looked for the insertion sequence within the genomic islands.

Here, we further analysed S. enterica Typhi CT18 whose genomic islands was annotated [23, 43]. There are currently 17 pathogenicity islands in this sequence [23], and multiple phage has been found as well as the unidentified island [3, 44], resulted in 21 fragments reliably from foreign origin. All the functional features associated with genuine genomic islands have been summarized in Table 4.

Table 4 Summary of functional features predicted by 2SigFinder, SIGI-HMM, Alien_Hunter, Centroid, IslandPath-DIMOB, INDeGenIUS, SigHunt and IslandPick on detection of genomic islands in S. enterica Typhi CT18, and the functional features were based on the number of the related genes in the real genomic islands which are covered by more than 50% of the results of the prediction method

2SigFinder was used to detect genomic islands in this sequence, where transformed window is equal to 4, eye window size is 5, neighbourhood size is 4 and long window size is 100, using 256 core features and 4 dynamic features, with 8 iterations in IST-LFS and 10 iterations in ILST-DSFS, and 0.05 standard error. At last, it used 20 kb around genomic islands to search the GI’s boundary. Six algorithms based on the window and a comparative genomics were also used to predict the genomic islands with default values [19, 23, 31, 32, 34, 35, 37]. As before, we employed the same test with 0.05 level to detect the genome islands. All the functional features associated with the predicted genomic islands have been summarized in Table 4.

To evaluate the predicted GIs, we calculated their features within the real GIs, more than 50% of which was covered by the results of the prediction method. For Phage and HEG, 2SigFinder outperforms the other methods, and it detected 101 Phage and 28 HEG out of 130 Phage and 36 HEGs. As for features associated with GIs, including Pathogenicity, tRNA, Virulence and Repeats, 2SigHunter and Alien_Hunter achieve the best performance, where 25 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats were identified from 27 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats. For the Integrase, Transposase and IS features, Alien_Hunter outperforms the other methods. The next best method is 2SigFinder, whereas the other methods lag behind (Table 4).

PAI is a type of GIs that possesses the genetic elements of pathogens of virulence factors and affects the horizontal transfer of genes of multiple virulence factors. Ten PAIs are located in this genome as revealed by PAIDB [28, 29], and more information are summarised in Table 5. To further evaluate the predicted GIs, we counted the number of PAIs, more than 50% of which was covered by the results of the prediction method. Figure 2 indicates that Alien_Hunter achieves the best performance, with 9 out 10 PAIs were identified. The next best method is 2SigFinder, whereas the other methods lag behind. Moreover, Alien_Hunter performs better in detection of Integrase, Transposase, IS features and PAI because it predicted a lot of genomic islands, and its false positive is high (Table 6), indicating that it is of limited practical use. These results show that 2SigFinder is more efficient in detecting functional features associated with GIs.

Table 5 Ten pathogenicity islands reported to be located in S. enterica Typhi CT18, and name, star position, end position, size and function of these PAIs have been summarized from the pathogenicity island database (PAIDB)
Fig. 2
figure 2

Overlap percentages between the reported PAI and the predicted genomic islands from Precision, recall and overall accuracy of SigHunt and INDeGenIUS, in which 0.05–0.2 significance levels are used as cut-off values to evaluate their performances. All evaluation indexes are calculated at the nucleotide level

Table 6 Overall length of the predicted genomic islands, true positives and false positives of all of the evaluated methods at the nucleotide level in S. enterica Typhi CT18

Discussion

Genome islands refer to a type of gene clusters with horizontal origin in the genome, which is closely related to the rapid adaptation of the organism, making it have important values such as medical, economic or environmental. Comparative genomics analyses 16S rRNAs and other orthologs among different genomes to detect genomic islands. However, it relies largely on genomic comparison methods and thus can be limited to the use of annotations or closely related but unavailable genomes. Therefore, the emergence of research into comparison-free method is apparent and necessary to overcome critical limitations of comparative genomics.

Several algorithms have been proposed and achieve better performances, but there are still some problems in genomic island detection. 2SigFinder is a genomic island recognition method based on small-scale and large-scale statistical tests proposed by this paper. Through a comprehensive study, we found that AlienHunter was found to be sensitive, but it predicts more genomic islands, and the average length of the predicted fragments is smaller. Comparative genomics got better prediction results, but the number of genomic islands is predicted to be less. Some algorithms based on the windows found some genomic islands, but their sizes are small. 2SigFinder is more efficient in detecting genomic islands and their functional features. Although 2SigFinder achieved better performance, it is still not a generic solution to detect all GIs in different organisms. It relies on the observation of different tetranucleotides, thus only limited genomic signatures can be used. Sometimes, the detection of GI by tetranucleotide is not strong enough, which may lead to false negative prediction. For small genomic islands and not providing sufficient oligonucleotide patterns from their host genome, 2SigFinder may also be difficult to detect. Therefore, further research could also be conducted to determine genomic signatures that are more efficient for genomic island prediction.

Conclusion

Several methods mainly detect GIs through global testing and pay attention to whether the local signature of a region is not the same with the host. In this paper, we proposed a genomic island recognition method based on small-scale and large-scale statistical tests. The existing methods generally have the predetermined thresholds, and the information of each window is limited. In the proposed method, we unique research the variability of higher moments of each tetranucleotide and designed an iteration of large-scale statistical testing with dynamic signals from small-scale feature selection to identify some multi-window segments; in addition, we split them into optimal distinct segments according to the CG-content bias. After depicting these compositionally different segments, the selection of genomic islands was performed by their IST-LFS scores. Finally, the CG-based divergence are used to optimize the boundaries of genomic islands. Systematic and quantitative assessment demonstrated that 2SigFinder is more robust than other existing methods in identifying genomic islands. As for the functional features associated with the real genomic islands, 2SigFinder is more efficient in inspection of the functions of genomic islands.

Methods

We designed a test-based algorithm to identify GI. The framework is shown in Fig. 3, and the steps are as follows:

Fig. 3
figure 3

Overview of the 2SigFinder algorithm. a The work-flow of the small-scale t-test with large-scale feature selection, in which signatures of the host are extracted using the confidence interval of window variances, and core signatures are selected based on ordered kurtosis. During an iteration, we score each window using the two-sample t-test and selecte the windows whose scores are large enough to be considered to be statistically significant. b The workflow of the large-scale statistical test using dynamic signals from small-scale feature selection. Starting from the higher moments of each tetranucleotide, we select signatures of the host using the confidence interval of window variances and select dynamic core signatures using large sliding windows. During an iteration, we score each sliding long window with an accumulative score and select the windows whose scores are large enough to be consid-ered to be statistically significant

At smaller scales, we used small-scale t-tests to score each window based on the large-scale selection to evaluate the component differences in each area (Fig. 3a). We first divided a genome into n windows with 1 kb long and calculated the frequencies f of the tetranucleotides. For each window, the confidence interval of the mean variance s2 was estimated as:

$$ \overline{s^2}-{z}_{\alpha /2}\frac{s_{s^2}}{N}\le {\mu}_{s^2}\le \overline{s^2}+{z}_{\alpha /2}\frac{s_{s^2}}{N} $$
(1)

where \( \overline{s^2} \) is the mean value of all windows variances, ss2 is denoted as a variance, α is a confidence level, and N is the total number of the windows.

In n windows, the kurtosis of each tetranucleotide is defined as follows

$$ ku=\frac{\raisebox{1ex}{$\sum {\left({f}_i-\overline{f}\right)}^4$}\!\left/ \!\raisebox{-1ex}{$n$}\right.}{\raisebox{1ex}{${\left(\sum {\left({f}_i-\overline{f}\right)}^2\right)}^2$}\!\left/ \!\raisebox{-1ex}{$n$}\right.} $$
(2)

\( \overline{f} \) is the average of a tetranucleotide. If a tetranucleotide has a larger kurtosis, it will be selected as the information signatures.

Given the ith window, we calculated the two-sample t-test between the host and the ith window. For each fj of the ith window, we choose its left and right window regions as a sample \( \left({f}_j^{i-\varepsilon +1},\cdots, {f}_j^i,\cdots, {f}_j^{i+\varepsilon}\right) \) of the signature fj from the ith window. The signature fj from the host was represented as \( \left({f}_j^{t_1},{f}_j^{t_2}\cdots, {f}_j^{t_{\Gamma}}\right) \), and tΤ tT is the window number from the host and Γ denotes the chose signatures. Then, we used the t-test to determine if the average values of the two samples \( \left({f}_j^{i-\varepsilon +1},\cdots, {f}_j^i,\cdots, {f}_j^{i+\varepsilon}\right) \) fji-ε + 1,,fji,,fji + ε and \( \left({f}_j^{t_1},{f}_j^{t_2}\cdots, {f}_j^{t_{\Gamma}}\right) \) fjt1,fjt2,,fjtΓ are equal, and calculated the P-value of informative signature as follows:

$$ {P}_{f_j}=P\left(\left|t\right|>\frac{\overline{f_j^1}-\overline{f_j^2}}{\sqrt{s_p^2\left(\frac{1}{2\varepsilon +1}+\frac{1}{t_{\Gamma}}\right)}}\right) $$
(3)

where

$$ {s}_p^2=\frac{2\varepsilon {s}_{f_j^1}^2+\left({t}_{\Gamma}-1\right){s}_{f_j^2}^2}{2\varepsilon +{t}_{\Gamma}-1} $$

\( \overline{f_j^1} \)× 1 and \( \overline{f_j^2} \) × 1 (\( {s}_{f_j^1}^2 \) s12 and \( {s}_{f_j^2}^2 \)) denote the average (variances) of the ith region fji-ε + 1,,fji,,fji + εand the host. Accumulating all the signature p values, the difference was as follows:

$$ D={\sum}_{j=1}^{t_{\Gamma}}{P}_{f_j} $$
(4)

Then we selected some windows with scores large enough to make the data statistically significant, and delete these selected windows. We updated all windows in the genome, and then repeated the above steps until no windows were found.

A large-scale statistical test using dynamic signals from small-scale feature selection

On a large scale, we study the variability of the high-order moments of each tetranucleotide and use dynamic signals selected by small-scale features to design iterations of large-scale statistical tests to identify large, multi-window segments (Fig. 3b).

To assess changes of local signatures surrounding the ith window, we choose 2τ window surrounding the ith window as its neighbourhood and calculate the normalised first, second, third and fourth standardized moments of each signature as follows:

$$ {\mathrm{NM}}_{\mathrm{i}}^1\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)=\frac{1}{2\uptau +1}{\sum}_{\mathrm{x}=\mathrm{i}-\uptau}^{\mathrm{i}+\uptau}{\mathrm{f}}_{\mathrm{x}}^{\mathrm{i}} $$
(5)
$$ {\mathrm{NM}}_{\mathrm{i}}^2\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)=\sqrt{\frac{1}{2\uptau +1}{\sum}_{\mathrm{x}=\mathrm{i}-\uptau}^{\mathrm{i}+\uptau}{\left({\mathrm{f}}_{\mathrm{x}}^{\mathrm{i}}-{\mathrm{NM}}_{\mathrm{i}}^1\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)\right)}^2} $$
(6)
$$ {\mathrm{NM}}_{\mathrm{i}}^3\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)=\frac{2\uptau +1}{2\uptau \left(2\uptau -1\right)}{\sum}_{\mathrm{x}=\mathrm{i}-\uptau}^{\mathrm{i}+\uptau}{\left(\frac{{\mathrm{f}}_{\mathrm{x}}^{\mathrm{i}}-{\mathrm{NM}}_{\mathrm{i}}^1\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)}{{\mathrm{NM}}_{\mathrm{i}}^2\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)}\right)}^3 $$
(7)
$$ {\mathrm{NM}}_{\mathrm{i}}^4\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)=\frac{\left(2\uptau +1\right)\left(2\uptau +2\right)}{2\uptau \times \left(2\uptau -1\right)\left(2\uptau -2\right)}{\sum}_{\mathrm{x}=\mathrm{i}-\uptau}^{\mathrm{i}+\uptau}{\left(\frac{\mathrm{f}{}_{\mathrm{x}}{}^{\mathrm{i}}-{\mathrm{NM}}_{\mathrm{i}}^1\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)}{{\mathrm{NM}}_{\mathrm{i}}^2\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)}\right)}^4-\frac{24{\uptau}^3}{\left(2\uptau -1\right)\left(2\uptau -2\right)} $$
(8)

where \( {\mathrm{NM}}_{\mathrm{i}}^1\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right) \), \( {\mathrm{NM}}_{\mathrm{i}}^2\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right) \), \( {\mathrm{NM}}_{\mathrm{i}}^3\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right) \) and \( {\mathrm{NM}}_{\mathrm{i}}^4\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right) \) are the normalised first, second, third and fourth standardized moments of the signature \( {\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}} \) within the ith window.

We calculated the genomic signatures of the host and estimate the cumulative kernel distribution function φ for each signature. From the ith window, we use its following δ continued windows to create the ith large sliding window (LSWi LSWi). We then select core signatures of these δ continued windows within the ith large windows using ordered kurtosis. It is important to highlight here that the core signatures of the large window will change as the ith window sliding along genome, and thus, we denote this set of core signatures as dynamic core signatures of this genome.

Count the top θ dynamic core signatures whose values are located outside of their credibility interval in non-overlapping windows, and sum all count numbers of the δ continued windows as accumulative score (AS) of the ith large sliding window

$$ \mathrm{AS}\left({\mathrm{LSW}}_{\mathrm{i}}\right)=\sum \limits_{\mathrm{i}=1}^{\updelta}\sum \limits_{\mathrm{t}=1}^{\uptheta}\mathtt{\varphi}\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right) $$
(9)

Where \( \mathtt{\varphi}\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right) \) is a random indicator function defined as follows:

$$ \mathtt{\varphi}\left({\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\right)=\left\{\begin{array}{cc}0& {\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}}\in \left({\upvarphi}_{\mathrm{t}}^{-1}\left(\frac{\upalpha}{2}\right),{\upvarphi}_{\mathrm{t}}^{-1}\left(1-\frac{\upalpha}{2}\right)\right)\\ {}1& \mathrm{Otherwise}\end{array}\right. $$
(10)

φt φt is the cumulative kernel distribution function of the dynamic core signature ft, \( {\mathrm{f}}_{\mathrm{t}}^{\mathrm{i}} \) is the value of the dynamic core signature in the ith non-overlapping window, and α is a confidence level.

Select large sliding windows whose scores are large enough to be considered statistically significant. Delete the selected large sliding window and update the entire window of the genome, repeating the steps above until the large sliding window cannot be found.

Refine the boundaries of predicted GIs

For each multi-window region detected by the above method, we segment it into several different fragments based on the GC content deviation, and use the G-C deviation and Markovian Jensen-Shannon divergence (MJSD) to determine the boundaries of the predicted GIs. Assume t1 t1 and t2 are the start and end points of a given genomic island \( {S}_{\left[{t}_1\to {t}_2\right]} \) St1 → t2. We search its boundaries from the expanded region \( {S}_{\left[{t}_1-\gamma kb\to {t}_2+\gamma kb\right]} \) St1-γkb → t2 + γkb. G-C deviation is one of the important sequence features, describing the differences between DNA fragments [45, 46]. In order to find the starting position, the sequence St1-γkb → t2 is divided into different sub-sequences to get some points \( \left\{{P}_{S_{\left[{t}_1-\gamma kb\to {t}_2\right]}}^{CG}\right\} \). For each point tτ, its MJSD was calculatedStτ→t2 as follows:

$$ {\displaystyle \begin{array}{l}{MJSD}^2\left({t}_{\tau}\right)={H}^2\left({S}_{\left[{t}_1-\gamma kb\to {t}_2\right]}\right)-\frac{t_{\tau }-{t}_1-\gamma kb+1}{t_2-{t}_1-\gamma kb+1}{H}^2\left({S}_{\left[{t}_1-\gamma kb\to {t}_{\tau}\right]}\right)\\ {}\kern10em -\frac{t_2-{t}_{\tau }+1}{t_2-{t}_1-\gamma kb+1}{H}^2\left({S}_{\left[{t}_{\tau}\to {t}_2\right]}\right)\end{array}} $$
(11)

where H2St1-γkb → tτ and H2Stτ → t2are the entropies of the \( {S}_{\left[{t}_1-\gamma kb\to {t}_{\tau}\right]} \) and \( {S}_{\left[{t}_{\tau}\to {t}_2\right]} \) respectively, H2St1-γkb → t2 is the entropy of St1-γkb → t2.

Availability of data and materials

Datasets and supplementary are freely available at https://github.com/bioinfo0706/2SigFinder or http://bioinfo.zstu.edu.cn/2SigFinder.

Abbreviations

HGT:

Horizontal Gene Transfer

PAI:

Pathogenicity Island

GI:

Genomic Island

PAIDB:

PAI Database

HMM:

Hidden Markov Model

CG-MJSD:

GC content and Markovian Jensen-Shannon divergence

HEGs:

Highly expressed genes

RP:

Ribosomal protein

TF:

Transcriptional processing factor

CH:

Chaperone degradation

IS:

Insertion sequence elements

NCBI:

National Center for Biotechnology Information

References

  1. Hacker J, Bender L, Ott M, Wingender J, Lund B, Marre R, Goebel W. Deletions of chromosomal regions coding for fimbriae and hemolysins occur in vitro and in vivo in various extraintestinal Escherichia coli isolates. Microb Pathog. 1990;8:213–25.

    CAS  PubMed  Google Scholar 

  2. Hacker J, Kaper JB. Pathogenicity islands and the evolution of microbes. Annu Rev Microbiol. 2000;54:641–79.

    CAS  PubMed  Google Scholar 

  3. Kingsley RA, Humphries AD, Weening EH, De Zoete MR, Papaconstantinopoulou A, Dougan G, Bäumler AJ. Molecular and phenotypic analysis of the CS54 island of Salmonella enterica serotype Typhimurium: identification of intestinal colonization and persistence determinants. Infect Immun. 2003;71:629–40.

    CAS  PubMed  PubMed Central  Google Scholar 

  4. Nakamura Y, Itoh T, Matsuda H, Gojobori T. Biased biological functions of horizontally transferred genes in prokaryotic genomes. Nat Genet. 2004;36:760–6.

    CAS  PubMed  Google Scholar 

  5. Gal-Mor O, Finlay BB. Pathogenicity islands: a molecular toolbox for bacterial virulence. Cell Microbiol. 2006;8:1707–19.

    CAS  PubMed  Google Scholar 

  6. Dobrindt U, Hochhut B, Hentschel U, Hacker J. Genomic islands in pathogenic and environmental microorganisms. Nat Rev Microbiol. 2004;2:414–24.

    CAS  PubMed  Google Scholar 

  7. Lawrence JG. Common themes in the genome strategies of pathogens. Curr Opin Genet Dev. 2005;15:584–8.

    CAS  PubMed  Google Scholar 

  8. Manson JM, Gilmore MS. Pathogenicity island integrase cross-talk: a potential new tool for virulence modulation. Mol Microbiol. 2006;61:555–9.

    CAS  PubMed  Google Scholar 

  9. Middendorf B, Hochhut B, Leipold K, Dobrindt U, Blum-Oehler G, Hacker J. Instability of pathogenicity islands in uropathogenic Escherichia coli 536. J Bacteriol. 2004;186:3086–96.

    CAS  PubMed  PubMed Central  Google Scholar 

  10. Finlay BB, Falkow S. Common themes in microbial pathogenicity revisited. Microbiol Mol Biol Rev. 1997;61:136–69.

    CAS  PubMed  PubMed Central  Google Scholar 

  11. Karlin S. Detecting anomalous gene clusters and pathogenicity islands in diverse bacterial genomes. Trends Microbiol. 2001;9:335–43.

    CAS  PubMed  Google Scholar 

  12. Hsiao WW, Ung K, Aeschliman D, Bryan J, Finlay BB, Brinkman FS. Evidence of a large novel gene pool associated with prokaryotic genomic islands. PLoS Genet. 2005;1:e62.

    PubMed  PubMed Central  Google Scholar 

  13. Vernikos GS, Parkhill J. Resolving the structural features of genomic islands: a machine learning approach. Genome Res. 2008;18:331–42.

    CAS  PubMed  PubMed Central  Google Scholar 

  14. Ragan MA. Detection of lateral gene transfer among microbial genomes. Curr Opin Genet Dev. 2001;11:620–6.

    CAS  PubMed  Google Scholar 

  15. Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997;25:3389–402.

    CAS  PubMed  PubMed Central  Google Scholar 

  16. Darling ACE, Mau B, Blattner FR, Perna NT. Mauve: multiple alignment of conserved genomic sequence with rearrangements. Genome Res. 2004;14:1394–403.

    CAS  PubMed  PubMed Central  Google Scholar 

  17. Ou HY, Chen LL, Lonnen J, Chaudhuri RR, Thani AB, Smith R, Garton NJ, Hinton J, Pallen M, Barer MR, Rajakumar K. A novel strategy for the identification of genomic islands by comparative analysis of the contents and contexts of tRNA sites in closely related bacteria. Nucleic Acids Res. 2006;34:e3.

    PubMed  PubMed Central  Google Scholar 

  18. Chiapello H, Bourgait I, Sourivong F, Heuclin G, Gendrault-Jacquemard A, Petit MA, El Karoui M. Systematic determination of the mosaic structure of bacterial genomes: species backbone versus strain-specific loops. BMC Bioinformatics. 2005;6:171.

    CAS  PubMed  PubMed Central  Google Scholar 

  19. Langille MGI, Hsiao WWL, Brinkman FSL. Evaluation of genomic island predictors using a comparative genomics approach. BMC Bioinformatics. 2008;9:329.

    PubMed  PubMed Central  Google Scholar 

  20. Langille MG, Brinkman FS. IslandViewer: an integrated interface for computational identification and visualization of genomic islands. Bioinformatics. 2009;25:664–5.

    CAS  PubMed  PubMed Central  Google Scholar 

  21. Dhillon BK, Chiu TA, Laird MR, Langille MG, Brinkman FS. IslandViewer update: improved genomic island discovery and visualization. Nucleic Acids Res. 2013;41:W129–32.

    PubMed  PubMed Central  Google Scholar 

  22. Aaron JA, Rajeev K, Azad AR, Jeffrey GL. Detection of genomic islands via segmental genome heterogeneity. Nucleic Acids Res. 2009;37:5255–66.

    Google Scholar 

  23. Vernikos GS, Parkhill J. Interpolated variable order motifs for identification of horizontally acquired DNA: revisiting the Salmonella pathogenicity islands. Bioinformatics. 2006;22:2196–203.

    CAS  PubMed  Google Scholar 

  24. Karlin S, Mrazek J, Campbell AM. Codon usages in different gene classes of the Escherichia coli genome. Mol Microbiol. 1998;29:1341–55.

    CAS  PubMed  Google Scholar 

  25. Sandberg R, Winberg G, Branden CI, Kaske A, Ernberg I, Coster J. Capturing whole-genome characteristics in short sequences using a naive Bayesian classifier. Genome Res. 2001;11:1404–9.

    CAS  PubMed  PubMed Central  Google Scholar 

  26. Tsirigos A, Rigoutsos I. A new computational method for the detection of horizontal gene transfer events. Nucleic Acids Res. 2005;33:922–33.

    CAS  PubMed  PubMed Central  Google Scholar 

  27. Yoon SH, Hur CG, Kang HY, Kim YH, Oh TK, Kim JF. A computational approach for identifying pathogenicity islands in prokaryotic genomes. BMC Bioinformatics. 2005;6:184.

    PubMed  PubMed Central  Google Scholar 

  28. Yoon SH, Park YK, Lee S, Choi D, Oh TK, Hur CG, Kim JF. Towards Pathogenomics: A web-based resource for Pathogenicity Islands. Nucleic Acids Res. 2007;35:D395–400.

    CAS  PubMed  Google Scholar 

  29. Yoon SH, Park YK, Kim JF. PAIDB v2.0: exploration and analysis of pathogenicity and resistance islands. Nucleic Acids Res. 2014;43:D624–30.

    PubMed  PubMed Central  Google Scholar 

  30. Merkl R. SIGI: score-based identification of genomic islands. BMC Bioinformatics. 2004;5:22.

    PubMed  PubMed Central  Google Scholar 

  31. Waack S, Keller O, Asper R, Brodag T, Damm C, Fricke WF, Surovcik K, Meinicke P, Merkl R. Score-based prediction of genomic islands in prokaryotic genomes using hidden markov models. BMC Bioinformatics. 2006;7:142.

    PubMed  PubMed Central  Google Scholar 

  32. Hsiao W, Wan I, Jones SJ, Brinkman FS. IslandPath: aiding detection of genomic islands in prokaryotes. Bioinformatics. 2003;19:418–20.

    CAS  PubMed  Google Scholar 

  33. Finn RD, Tate J, Mistry J, Coggill PC, Sammut SJ, Hotz HR, Ceric G, Forslund K, Eddy SR, Sonnhammer EL, Bateman A. The Pfam protein families database. Nucleic Acids Res. 2008;36:D281–8.

    CAS  PubMed  Google Scholar 

  34. Rajan I, Aravamuthan S, Mande SS. Identification of compositionally distinct regions in genomes using the centroid method. Bioinformatics. 2007;23:2672–7.

    CAS  PubMed  Google Scholar 

  35. Shrivastava S, Reddy CV, Mande SS. INDeGenIUS, a new method for high-throughput identification of specialized functional islands in completely sequenced organisms. J Biosci. 2010;35:351–64.

    CAS  PubMed  Google Scholar 

  36. Azad RK, Lawrence JG. Towards more robust methods of alien gene detection. Nucleic Acids Res. 2011;39(9):e56.

    CAS  PubMed  PubMed Central  Google Scholar 

  37. Jaron KS, Moravec JC, Martinkova N. SigHunt: horizontal gene transfer finder optimized for eukaryotic genomes. Bioinformatics. 2014;2014(30):1081–6.

    Google Scholar 

  38. Fothergill JL, Mowat E, Ledson MJ, Walshaw MJ, Winstanley C. Fluctuations in phenotypes and genotypes within populations of Pseudomonas aeruginosa in the cystic fibrosis lung during pulmonary exacerbations. J Med Microbiol. 2009;59:472–81.

    PubMed  Google Scholar 

  39. Karlin S, Mrazek J. Predicted highly expressed genes of diverse pro-karyotic genomes. J Bacteriol. 2000;182:5238–50.

    CAS  PubMed  PubMed Central  Google Scholar 

  40. Kurtz S, Schleiermacher C. REPuter: fast computation of maximal repeats in complete genomes. Bioinformatics. 1999;15:426–7.

    CAS  PubMed  Google Scholar 

  41. Winstanley C, Langille MG, Fothergill JL, Kukavical-Ibrulj I, Paradis-Bleau C, Sanschagrin F, Thomson NR, Winsor GL, Quail MA, Lennard N, Bignell A, Clarke L, Seeger K, Saunders D, Harris D, Parkhill J, Hancock RE, Brinkman FS, Levesque RC. Newly introduced genomic prophage islands are critical determinants of in vivo competitiveness in the Liverpool epidemic strain of Pseudomonas aeruginosa. Genome Res. 2009;19:12–23.

    CAS  PubMed  PubMed Central  Google Scholar 

  42. Smart CH, Walshaw MJ, Hart CA, Winstanley C. Use of suppression subtractive hybridization to examine the accessory genome of the Liverpool cystic fibrosis epidemic strain of Pseudomonas aeruginosa. J Med Microbiol. 2006;55:677–88.

    CAS  PubMed  Google Scholar 

  43. Vernikos GS, Thomson NR, Parkhill J. Genetic flux over time in the Salmonella lineage. Genome Biol. 2007;8:R100.

    PubMed  PubMed Central  Google Scholar 

  44. Kingsley RA, van Amsterdam K, Kramer N, Bäumler AJ, et al. The shdA gene is restricted to serotypes of Salmonella enterica subspecies I and contributes to efficient and prolonged fecal shed-ding. Infect Immun. 2000;68:2720–7.

    CAS  PubMed  PubMed Central  Google Scholar 

  45. Tu Q, Ding D. Detecting pathogenicity islands and anomalous gene clusters by iterative discriminant analysis. FEMS Microbiol Lett. 2003;221:269–75.

    CAS  PubMed  Google Scholar 

  46. Pundhir S, Vijayvargiya H, Kumar A. PredictBias: a server for the identification of genomic and pathogenicity islands in prokaryotes. In Silico Biol. 2008;8:223–34.

    CAS  PubMed  Google Scholar 

Download references

Acknowledgements

We thank the referees for many valuable comments that have improved this manuscript.

Funding

We would like to thank the National Natural Science Foundation of China (Grant Nos. 61772028, 2012CB316503) for providing financial supports for this study and publication charges. The funding bodies did not play any role in the design of the study and collection, analysis, and interpretation of data and in writing the manuscript.

Author information

Authors and Affiliations

Authors

Contributions

QD conceived the method and prepared the manuscript. RK, XNX and QD implemented the software and performed the analysis. QD, XQL, PAH and MQZ contributed to the discussion and have approved the final manuscript. The author(s) read and approved the final manuscript.

Corresponding author

Correspondence to Qi Dai.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Additional information

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Kong, R., Xu, X., Liu, X. et al. 2SigFinder: the combined use of small-scale and large-scale statistical testing for genomic island detection from a single genome. BMC Bioinformatics 21, 159 (2020). https://doi.org/10.1186/s12859-020-3501-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/s12859-020-3501-2

Keywords