2SigFinder: the combined use of small-scale and large-scale statistical testing for genomic island detection from a single genome

Background Genomic islands are associated with microbial adaptations, carrying genomic signatures different from the host. Some methods perform an overall test to identify genomic islands based on their local features. However, regions of different scales will display different genomic features. Results We proposed here a novel method “2SigFinder “, the first combined use of small-scale and large-scale statistical testing for genomic island detection. The proposed method was tested by genomic island boundary detection and identification of genomic islands or functional features of real biological data. We also compared the proposed method with the comparative genomics and composition-based approaches. The results indicate that the proposed 2SigFinder is more efficient in identifying genomic islands. Conclusions From real biological data, 2SigFinder identified genomic islands from a single genome and reported robust results across different experiments, without annotated information of genomes or prior knowledge from other datasets. 2SigHunter identified 25 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats from 27 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats, and detected 101 Phage and 28 HEG out of 130 Phage and 36 HEGs in S. enterica Typhi CT18, which shows that it is more efficient in detecting functional features associated with GIs.

including symbiotic islands, metabolic islands, secretory islands, and resistant islands. Generally, genomic islands (GIs) are used as a standard term to refer to a group of genes that are 10-200 kb in length after horizontal transfer. The area of horizontal transfer was originally called the GIs until the gene function was fully determined. Based on their gene function, a more specific term was provided for their basic use [2].
In the genomic era, the importance of GIs should be taken seriously. With new genomic sequencing technology, we aim to identify genomic regions of other species that are different from other species or strains. Generally speaking, the more relevant taxonomy is a method to identify genomic islands associated with functions [3,4]. Such as, the genomic islands are associated with the secretion system, iron absorption function, secretion of toxins and adhesions, all of which increase the survival rate of pathogens in the host [5,6]. Pathogens can initially regulate the detectability of chromosomes and exhibit different pathogenic phenotypes [7,8]. GIs in bacteria induce many adaptation processes, such as metal resistance, antibiotic resistance, and secondary metabolic characteristics, thereby providing environmental and industrial benefits [9,10]. Therefore, the identification of GIs in different genomes has been a key factor in the study of microbial evolution and function.
In large-scale comparative genomics, GIs have characteristics such as different sequence composition, direct flanking, migration-related genes, and tRNA genes, which should be explored and used to identify GIs [4,[11][12][13]. Genomic islands are scattered using a system model different from the host. Therefore, their differences can be determined by comparison with the differences of 16srrna [14]. Some detection algorithms have been developed: local alignment methods [15], and whole alignment methods [16]. These methods are based on multiple genomic alignments are inconsistent or unique, aligned with genomes that may be considered GIs and conservative regions. At the same time, several methods for constructing and applying multi-layer large-scale genome comparisons have been reported for complex situations. For example, Mobilo-meFINDER revealed that tRNA genes are shared across several related genomes. Mauve searches for genomic islands around homologous tRNA [17]. GI identification using this method is related to interrupted tRNAs, and genomic islands that do not have tRNA may be lost. The above question can be solved by MOSAIC, which are used to determine whether a strain-specific region should be inserted into the tRNA region [18]. However, we often incorrectly identify inversions and translocations as a strainspecific region. Another widely used GI prediction method is IslandPick [19]. For a simple genome, IslandPick can first select the optimal comparative gene without any prejudice, and then call Mauve for genome-wide comparison construction. IslandPick avoids duplication with help of rechecking Mauve's alignment regions [20,21]. The above algorithms are based on genomic comparison methods and can therefore be limited to using annotations or closely related but unavailable genomes. Since there are many genomes, the genome of the target species should be carefully selected [22].
In addition, some algorithms are also used to detect genomic islands based on the component of genome sequence. These algorithms can yield high efficiency and must distinguish anomalous regions from the remaining genomic biases because GI has a different sequence composition from the host. They are useful to quickly identify GIs in a genome or sequence and do not require additional genomes. Two to nine long oligonucleotide sequences and GC content are often defined as the component of genome sequence [11,[23][24][25][26]. Such as, abnormal G-C content and codon frequency deviations are calculated using PAI-Finder to detect GIs, and candidate PAIs are further evaluated using PAI-Finder to determine whether PAI-like regions partially or completely span GIs [27]. The PAI database (PAIDB) and PAI Finder are combined on one platform, where you can download annotated data and prediction information [28,29].
Hidden Markov Model (HMM) helps to remove or detect abnormal regions containing component deviations [23,[30][31][32]. For example, SIGI-HMM has constructed an HMM model to eliminate ribosome regions with codon usage preferences [30,31]. In addition, HMMer can identify the PFAM37 migrating gene map by searching each predicted gene [12], so IslandPath DIMOB [32] uses HMM to identify migrating gene map [33]. In contrast, Alien_Hunter improved the prediction of the boundaries of GIs by introducing a special scoring system based on k-mers variable length and using HMM models [23]. Although these methods based on Hidden Markov Models are more efficient than other methods in predicting GIs, they require a relatively large amount of parameter training and a large number of calculations. Therefore, prolonged operations are necessary to predict one GI.
A sequence is segmented into different regions, and the extraction of constituent characteristics of the sequence is performed instead of evaluating a set of genes in several predictions [34][35][36][37]. Measure significant differences between two windows to identify windows that are different in composition. The centroid method is used to determine some windows as GIs based on the comparison of windows' scores [34]. But, it is limited by host signature estimates based on all windows. As a result, some noise was observed in the host's local information. INDeGenIUS finds a cluster of the sequences to obtain a "major cluster" and estimates the host's native signature. In this way, the previous problems can be solved [35,36]. However, the measurement of each oligonucleotide is unnecessary, and some oligonucleotides are considered to be important indicators of horizontal transfer. Therefore, SigHunt detects the core tetranucleotides based on the related genomes using the tetranucleotide mass fraction instead of selecting all possible tetranucleotides [37].
Although the above algorithms achieve better performances, there are still some problems: 1) some methods mainly detect GIs through global testing, and pay attention to whether the local signature of a region is obviously not the same with the host. But, these characteristics are directly related to the scale of genomic signatures, for example, poor local genomic signatures may miss some small details at large scale; in contrast, small-scale features retain local features, whereas the GI detection is largely affected by large-scale differences. Therefore, the future developments of GI prediction should use multi-scale methods to explore the multi-scale genomic signatures; 2) the above algorithms detect some typical regions as possible genomic islands and do not refine the boundaries. If the predicted boundary of GIs can be further optimized, the effectiveness and efficiency of the prediction will be improved.
To address these problems, we proposed here a novel method "2SigFinder", the first combined use of small-scale and large-scale statistical testing for genomic island detection. We propose an iterative of a small-scale t-test with large-scale feature selection techniques for each region of the genome to facilitate quantification of its compositional differences with the host, instead of calculating the distance or discrete interval cumulative score for each region. We used the higher moments of each tetranucleotide and designed an iteration of large-scale statistical testing with dynamic signals from small-scale feature selection to identify some multi-window segments; in addition, we split them into optimal distinct segments according to the CG-content bias and detect the genomic islands. At last, the CG-based segmentation method and the Markovian Jensen-Shannon divergence are used to optimize the boundaries of genomic islands.

Comparison to the algorithms based on the windows for detecting GIs
We evaluated the effectiveness of our algorithm by detecting GI/non-GIs. Langille et al. constructed GI analysis data from 675 complete bacterial genomes. All genomes have a sufficient number of related species or strains, using strict but possibly flexible standards [19]. They identified some regions stored in all genomes as negative datasets and built a standard dataset to evaluate the efficient of genomic island detection methods. The data contains 771 genomic islands, referred to as GI, as well as 3770 non-genomic island fragments (non-GI), ranging in length from 8 kb to 31 kb. Since these GIs and non-GIs come from 118 genomes, the genomes of representative species come from the field of bacteria and archaea.
2SigFinder was used to classify GIs / non-GIs, where the transformed window is 1, the eye window is 5, the neighborhood size is 10, the long window is 50, 256 core features and 4 dynamic features are used, with 10 iterations and 0.05 standard error. Finally, the 3 kb "raw" genomic islands were used to find the genomic island boundary. Three published algorithms were also evaluated on the same dataset with default values [34,35,37]. When we used the SigHunt and INDeGenIUS methods, the significance level 0.05 test was selected to identify genomic islands, where DIAS was calculated based on all of the tetranucleotides.
The overall accuracy of the 2SigFinder was 85.16%, which achieved the best results, while the overall accuracy of the other methods was similar, ranging from 80 to 82% (Table 1). As for accuracy and recall, it is easy to find that the recall rate of 2SigFinder exceeds 45%, and no other methods. INDeGenIUS got a better precision, but its accuracy was lower (19.99%) [35]. The SigHunt's performance did not meet expectations, and we infer that it predicts more genomic islands (758), and the average length of the predicted fragments is smaller (4670 bp) compared with other methods (number: 277-346, and average length: 13146-22,423 bp). These results indicates that 2SigFinder outperforms other algorithms in genomic island detection. Identification of genomic islands in Pseudomonas aeru-ginosa LESB58 We next evaluated the proposed method 2SigFinder on P. aeruginosa LESB58 genome, whose genomic islands have been explored widely [38][39][40]. There are currently 6 prophage gene clusters and 5 annotated pathogenicity islands in P. aeruginosa LESB58 [38,41,42].
We applied 2SigFinder to identify the genomic islands in the P. aeruginosa LESB58 genome, where transformed window is 4, eye window is 5, neighbourhood size is 4 and long window size is 100, using 256 core features and 4 dynamic features, with 4 iterations in IST-LFS and 4 iterations in ILST-DSFS, and 0.05 standard error. At last, 2 kb upstream/ downstream of 'raw' genomic islands was used to refine the boundaries of predicted genomic islands. Six algorithms based on the windows and a comparative genomics were also used to predict the genomic islands with default values [19,23,31,32,34,35,37]. The level of the same significance test was set to 0.05, and the score results were used to identify the putative GIs. Figure 1a is the comparison of different detection algorithms on P. aeruginosa. LESB58 [37,41,42]. Since Alien_Hunter detected a large number of hypothetical regions, the predicted GI has the longest length (Fig. 1b). Note that although Alien_Hunter detected 293 kb in the established island-encoded 451 kb DNA, but its false positives was large (Fig. 1b). Thus, it gets the better recall at the expense of its accuracy ( Fig. 1c and Tables 2 and 3).
In contrast, comparative genomics IslandPick got better prediction results by detecting 16 genomic islands. In order to further evaluate the predictive ability of GI level, we calculated the accuracy rate and F1 using the annotated genomic islands with more than 50% covered by the prediction results. Half of the 5 known genomic islands are predicted by IslandPick, which lead to high FDR and low F1 score ( Fig. 1c and Tables 2 and 3).  (Table 3). We observed that about 50% of the predicted 277,741 nucleotides were found in annotated genomic islands. It got a large true positive, and its false positive is also low (Fig. 1b). We then found that half of the 6 annotated genomic islands were predicted by 2SigFinder, resulting in the high accuracy and F1 (Fig. 1c and Table 3).
Through a comprehensive study, AlienHunter was found to be sensitive, but it has high false positive. Some algorithms based on the windows found some genomic islands, but their sizes are small. Thus, the results indicates that 2SigFinder is more efficient in identifying genomic islands.

Identifying functional features in S. enterica Typhi CT18
Comparative genomics found that genomic island is often accompanied by different insertion sequences, repeat sequences and migratory tRNA genes. These features can better discover the function of genomic islands. Therefore, we further studied these  functional features associated with the real genomic islands and predicted genomic islands from different prediction methods. We used the annotated genome to search for some characteristic genes in the genome islands. We looked for genes containing ribosomal proteins, genes with partner degradation functions, genes associated with energy metabolism, treated them as highly expressed genes, and counted their total number within genomic islands [39]. We used REPuter software to find repeated sequence fragments in genomic islands [40], and downloaded the annotation file from the US National Center for Biotechnology Information and looked for the insertion sequence within the genomic islands.
Here, we further analysed S. enterica Typhi CT18 whose genomic islands was annotated [23,43]. There are currently 17 pathogenicity islands in this sequence [23], and multiple phage has been found as well as the unidentified island [3,44], resulted in 21 fragments reliably from foreign origin. All the functional features associated with genuine genomic islands have been summarized in Table 4.
2SigFinder was used to detect genomic islands in this sequence, where transformed window is equal to 4, eye window size is 5, neighbourhood size is 4 and long window size is 100, using 256 core features and 4 dynamic features, with 8 iterations in IST-LFS and 10 iterations in ILST-DSFS, and 0.05 standard error. At last, it used 20 kb around genomic islands to search the GI's boundary. Six algorithms based on the window and a comparative genomics were also used to predict the genomic islands with default values [19,23,31,32,34,35,37]. As before, we employed the same test with 0.05 level to detect the genome islands. All the functional features associated with the predicted genomic islands have been summarized in Table 4.
To evaluate the predicted GIs, we calculated their features within the real GIs, more than 50% of which was covered by the results of the prediction method. For Phage and HEG, 2SigFinder outperforms the other methods, and it detected 101 Phage and 28 HEG out of 130 Phage and 36 HEGs. As for features associated with GIs, including Pathogenicity, tRNA, Virulence and Repeats, 2SigHunter and Alien_Hunter achieve the best performance, where 25 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats were identified from 27 Pathogenicity, 1 tRNA, 2 Virulence and 2 Repeats. For the Integrase,  Transposase and IS features, Alien_Hunter outperforms the other methods. The next best method is 2SigFinder, whereas the other methods lag behind (Table 4). PAI is a type of GIs that possesses the genetic elements of pathogens of virulence factors and affects the horizontal transfer of genes of multiple virulence factors. Ten PAIs are located in this genome as revealed by PAIDB [28,29], and more information are summarised in Table 5. To further evaluate the predicted GIs, we counted the number of PAIs, more than 50% of which was covered by the results of the prediction method. Figure 2 indicates that Alien_Hunter achieves the best performance, with 9 out 10 PAIs were identified. The next best method is 2SigFinder, whereas the other methods lag behind. Moreover, Alien_Hunter performs better in detection of Integrase, Transposase, IS features and PAI because it predicted a lot of genomic islands, and its false positive is high ( Table 6), indicating that it is of limited practical use. These results show that 2SigFinder is more efficient in detecting functional features associated with GIs.

Discussion
Genome islands refer to a type of gene clusters with horizontal origin in the genome, which is closely related to the rapid adaptation of the organism, making it have important values such as medical, economic or environmental. Comparative genomics analyses 16S rRNAs and other orthologs among different genomes to detect genomic islands. However, it relies largely on genomic comparison methods and thus can be limited to the use of annotations or closely related but unavailable genomes. Therefore, the emergence of research into comparison-free method is apparent and necessary to overcome critical limitations of comparative genomics. Table 5 Ten pathogenicity islands reported to be located in S. enterica Typhi CT18, and name, star position, end position, size and function of these PAIs have been summarized from the pathogenicity island database (PAIDB) Several algorithms have been proposed and achieve better performances, but there are still some problems in genomic island detection. 2SigFinder is a genomic island recognition method based on small-scale and large-scale statistical tests proposed by this paper. Through a comprehensive study, we found that AlienHunter was found to be sensitive, but it predicts more genomic islands, and the average length of the predicted fragments is smaller. Comparative genomics got better prediction results, but the number of genomic islands is predicted to be less. Some algorithms based on the windows found some genomic islands, but their sizes are small. 2SigFinder is more efficient in detecting genomic islands and their functional features. Although 2SigFinder achieved better performance, it is still not a generic solution to detect all GIs in different organisms. It relies on the observation of different tetranucleotides, thus only limited genomic signatures can be used. Sometimes, the detection of GI by tetranucleotide is not strong enough, which may lead to false negative prediction. For small genomic islands and not providing sufficient oligonucleotide patterns from their host genome,  2SigFinder may also be difficult to detect. Therefore, further research could also be conducted to determine genomic signatures that are more efficient for genomic island prediction.

Conclusion
Several methods mainly detect GIs through global testing and pay attention to whether the local signature of a region is not the same with the host. In this paper, we proposed a genomic island recognition method based on small-scale and large-scale statistical tests. The existing methods generally have the predetermined thresholds, and the information of each window is limited. In the proposed method, we unique research the variability of higher moments of each tetranucleotide and designed an iteration of large-scale statistical testing with dynamic signals from small-scale feature selection to identify some multi-window segments; in addition, we split them into optimal distinct segments according to the CG-content bias. After depicting these compositionally different segments, the selection of genomic islands was performed by their IST-LFS scores. Finally, the CG-based divergence are used to optimize the boundaries of genomic islands. Systematic and quantitative assessment demonstrated that 2SigFinder is more robust than other existing methods in identifying genomic islands. As for the functional features associated with the real genomic islands, 2SigFinder is more efficient in inspection of the functions of genomic islands.

Methods
We designed a test-based algorithm to identify GI. The framework is shown in Fig. 3, and the steps are as follows: At smaller scales, we used small-scale t-tests to score each window based on the large-scale selection to evaluate the component differences in each area (Fig. 3a). We first divided a genome into n windows with 1 kb long and calculated the frequencies f of the tetranucleotides. For each window, the confidence interval of the mean variance s2 was estimated as: where s 2 is the mean value of all windows variances, ss2 is denoted as a variance, α is a confidence level, and N is the total number of the windows. In n windows, the kurtosis of each tetranucleotide is defined as follows f is the average of a tetranucleotide. If a tetranucleotide has a larger kurtosis, it will be selected as the information signatures.
Given the ith window, we calculated the two-sample t-test between the host and the ith window. For each f j of the ith window, we choose its left and right window regions as a sample ð f i−εþ1 j ; ⋯; f i j ; ⋯; f iþε j Þ of the signature f j from the ith window. The signa-ture f j from the host was represented as ð f t 1 j ; f t 2 j ⋯; f t Γ j Þ, and t Τ tT is the window number from the host and Γ denotes the chose signatures. Then, we used the t-test to determine if the average values of the two samples ð f i−εþ1 j ; ⋯; f i j ; ⋯; f iþε j Þ fji-ε + 1,⋯,fji, ⋯,fji + ε and ð f t 1 j ; f t 2 j ⋯; f t Γ j Þ fjt1,fjt2,⋯,fjtΓ are equal, and calculated the P-value of informative signature as follows: where Fig. 3 Overview of the 2SigFinder algorithm. a The work-flow of the small-scale t-test with large-scale feature selection, in which signatures of the host are extracted using the confidence interval of window variances, and core signatures are selected based on ordered kurtosis. During an iteration, we score each window using the two-sample t-test and selecte the windows whose scores are large enough to be considered to be statistically significant. b The workflow of the large-scale statistical test using dynamic signals from small-scale feature selection. Starting from the higher moments of each tetranucleotide, we select signatures of the host using the confidence interval of window variances and select dynamic core signatures using large sliding windows. During an iteration, we score each sliding long window with an accumulative score and select the windows whose scores are large enough to be consid-ered to be statistically significant f 1 j × 1 and f 2 j × 1 (s 2 f 1 j s12 and s 2 f 2 j ) denote the average (variances) of the ith region fjiε + 1,⋯,fji,⋯,fji + εand the host. Accumulating all the signature p values, the difference was as follows: Then we selected some windows with scores large enough to make the data statistically significant, and delete these selected windows. We updated all windows in the genome, and then repeated the above steps until no windows were found.