SweepCluster: A SNP clustering tool for detecting gene-specific sweeps in prokaryotes

Background The gene-specific sweep is a selection process where an advantageous mutation along with the nearby neutral sites in a gene region increases the frequency in the population. It has been demonstrated to play important roles in ecological differentiation or phenotypic divergence in microbial populations. Therefore, identifying gene-specific sweeps in microorganisms will not only provide insights into the evolutionary mechanisms, but also unravel potential genetic markers associated with biological phenotypes. However, current methods were mainly developed for detecting selective sweeps in eukaryotic data of sparse genotypes and are not readily applicable to prokaryotic data. Furthermore, some challenges have not been sufficiently addressed by the methods, such as the low spatial resolution of sweep regions and lack of consideration of the spatial distribution of mutations. Results We proposed a novel gene-centric and spatial-aware approach for identifying gene-specific sweeps in prokaryotes and implemented it in a python tool SweepCluster. Our method searches for gene regions with a high level of spatial clustering of pre-selected polymorphisms in genotype datasets assuming a null distribution model of neutral selection. The pre-selection of polymorphisms is based on their genetic signatures, such as elevated population subdivision, excessive linkage disequilibrium, or significant phenotype association. Performance evaluation using simulation data showed that the sensitivity and specificity of the clustering algorithm in SweepCluster is above 90%. The application of SweepCluster in two real datasets from the bacteria Streptococcus pyogenes and Streptococcus suis showed that the impact of pre-selection was dramatic and significantly reduced the uninformative signals. We validated our method using the genotype data from Vibrio cyclitrophicus, the only available dataset of gene-specific sweeps in bacteria, and obtained a concordance rate of 78%. We noted that the concordance rate could be underestimated due to distinct reference genomes and clustering strategies. The application to the human genotype datasets showed that SweepCluster is also applicable to eukaryotic data and is able to recover 80% of a catalog of known sweep regions. Conclusion SweepCluster is applicable to a broad category of datasets. It will be valuable for detecting gene-specific sweeps in diverse genotypic data and provide novel insights on adaptive evolution. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04533-6.

Keywords: SweepCluster, SNP clustering, Gene-specific sweep Background A selective sweep is a process where a beneficial allelic change sweeps through the population and becomes fixed in a specific population, and the nearby sites in linkage disequilibrium will hitchhike together and also become fixed [1,2]. Those sweep regions containing beneficial alleles could possibly be introduced by recombination and rise to high frequency rapidly in the population under positive selection. If the increase in frequency is recent or fast relative to other recombination events, the mutation profile in the sweep regions across the population will be maintained without being interrupted. Finally, the process will imprint genetic signatures in the population genomes, leading to lowered within-population genetic diversity, increased between-population differentiation, and/or high linkage disequilibrium [3][4][5]. When such selective sweeps only occur at specific gene regions under selection without affecting the genome-wide diversity, they are described as gene-specific sweep [6].
Recently, the gene-specific sweep has been demonstrated to play important roles in adaptive evolution in microbial populations, such as ecological differentiation in Prochlorococcus [7] and Synechococcus [8], speciation in marine bacterium Vibrio cyclitrophicus (V. cyclitrophicus) [3,9], and phenotypic divergence in human adapted pathogen Streptococcus pyogenes (S. pyogenes) [10]. The observation of the gene-specific sweeps in those scenarios in both environmental organisms and host pathogens suggests that the gene-specific sweep may represent one of the general mechanisms underlying adaptive evolution of microorganisms. Therefore, identifying the gene-specific sweep on the genome-wide scale will not only provide insights into the evolutionary mechanisms shaping the genetic diversity, but also help to unravel potential genetic markers associated with ecological adaptation or phenotypic differentiation.
An array of methods have been proposed to identify the gene-specific sweeps and are generally fall into three categories based on the genetic signatures being captured, i.e., (1) composite likelihood ratio (CLR) tests of the marginal likelihood of the allele frequency spectrum under a model with selective sweeps in comparison with that under a model of selective neutrality [11][12][13] (Kim and Stephan-2002, Nielsen-2005, Huber-2016; (2) comparison of the distribution of population subdivision or linkage disequilibrium in a region under positive selection with that of a neutral background [14,15] (Akey-2002, Kim-Nielsen-2004); (3) haplotype-based approaches for detecting elevated haplotype homozygosity in a locus around the selected site in comparison with that under a neutral model [16][17][18][19][20]  However, those methods were mainly developed for detecting selective sweeps in eukaryotic data and are not readily applicable to prokaryotic data, such as the haplotype-based approaches [21]. In addition, some challenges have not yet been sufficiently addressed by the currently available methods. For example, the gene-centric concept of the gene-specific sweep has not been taken into account leading to a low spatial resolution of sweep regions; the spatial distribution properties of the mutated sites within the sweep regions have not been fully considered.
In this study, we propose a new gene-centric approach specifically for identifying the gene-specific sweeps in prokaryotes, which search for regions with a higher level of spatial clustering of single nucleotide polymorphisms (SNPs) assuming a null distribution model of SNPs under neutral selection. The clustering applies to the SNP subsets of specific interests, which can be selected based on the genetic signatures of sweep regions, such as elevated population subdivision, reduced within-population diversity, excessive linkage disequilibrium, or significant phenotype association. Our approach is the first such type specifically for identifying gene-specific sweeps in prokaryotes and differs from the previous methods for eukaryotes in that: (1) it applies the gene-centric concept by considering the gene-specific location of SNPs; (2) it takes advantages of the spatial distribution properties of SNPs in the sweep region; (3) the clustering is performed on pre-selected target SNPs with specific genetic properties, thus minimizing the influences from uninformative SNPs. We offer it as an open-source tool "SweepCluster" and it is freely accessible at github: https:// github. com/ BaoCo deLab/ Sweep Clust er.

Pre-selection of SNPs
The pre-selection of SNPs could be based on elevated population differentiation Fst, extended linkage disequilibrium LD, or phenotypic association. However, the determination could also depend on the data property and study purposes. For instance, if the positive selection acting on disease markers is of interest, the screening of SNPs with significant association with disease phenotypes using robust genome-wide association analysis is preferred. In the real and simulated datasets in this study, we selected the SNPs associated with phenotypic divergence or population differentiation.

Overview of the clustering approach
The SNP clustering algorithm employs a gene-centric concept to mimic the biological process of introducing gene-specific sweeps. In the gene-specific sweep model, nonsynonymous SNPs (the SNPs causing amino acid alterations) or upstream regulatory SNPs (the SNPs in the regulatory regions) are more likely to be under positive selection than synonymous SNPs (the SNPs without causing amino acid alterations) or inter-genic SNPs, and the selected non-synonymous SNPs along with the nearby synonymous/intergenic SNPs are introduced simultaneously in a single event. For a recent sweep event, the selected SNPs and the hitchhiking SNPs are tightly clustered in specific gene regions without severely ruined by other recombination events. Based on the gene-specific sweep model, our clustering strategy is illustrated in Fig. 1 and described previously [10]. Briefly, a non-synonymous or upstream regulatory SNP is randomly chosen in a specific gene/operon and serves as an anchor for an initial cluster. The initial cluster is then extended progressively by scanning and merging the neighboring SNPs or clusters. If the total span is shorter than the specified sweep length, then the surrounding SNPs or clusters are merged. Otherwise, the initial cluster is extended by merging the neighboring SNPs or clusters which minimize the normalized root-mean-square of inter-SNP distances (NRMSD): where d i is the ith inter-SNP distance, n is the total number of the SNPs in the target cluster and l is the maximum spanning range of the SNPs in the target cluster. The merging process will iterate until no neighboring SNP or cluster satisfies the merging criteria. The majority of CPU time is spent on making decision and merging procedures with the time complexity of O(n + n b · k), where n b is the number of boundary SNPs and k is the number of initial cluster. In the meantime, the required memory is on the scale of O(N) due to its one-dimensional property.
Following merging, all clusters are re-examined and split if any inter-SNP distance within the cluster is longer than a given distance threshold. The distance threshold can be determined based on the genome-wide average inter-SNP distance. Under the null neutral model, the SNPs are independently and randomly distributed across the genome, and the significance of a cluster with m distinct SNPs spanning a length of l can Neighbor is a singleton SNP Neighbor is a cluster Total span is within the sweep length?
The center span is close to the sweep length? Total span is close to the sweep length?
The center span is within the sweep length?  Fig. 1 Outline of the clustering procedures of SweepCluster. A non-synonymous or upstream regulatory SNP is randomly chosen in each gene/operon and serves as an anchoring SNP for an initial cluster. The initial cluster is then extended by scanning and merging the neighboring SNPs or clusters. If the total span is shorter than the specified sweep length, then the surrounding SNPs or clusters are merged. Otherwise, the initial cluster is extended by merging the neighboring SNPs or clusters such that the normalized root-mean-square of inter-SNP distances (NRMSD) is minimized. All clusters after merging are re-examined and split if any inter-SNP distance within the cluster is longer than a given inter-SNP distance threshold (max_dist) be evaluated as the accumulative probability of observing m SNPs in a cluster spanning a distance ≤ the observed length l and the probability density function can be represented as the gamma distribution [22]: Here, the rate parameter β is equal to the average mutation rate μ across the genome: μ = n/s, where n is the total number of SNPs in the genome and s is the length of the genome.

Evaluation of the performance of the clustering method
We evaluated the performance of the clustering using four metrics, i.e., CPU time, memory usage, accuracy and sensitivity based on simulation datasets. The evaluation of CPU time and memory usage was performed using real datasets with varying data size. The assessment of accuracy and sensitivity was conducted based on simulation datasets (see below). The accuracy is defined as the proportion of correctly assigned SNPs among the total SNPs. The sensitivity is defined as the proportion of detected clusters containing at least 90% of the SNPs correctly assigned. The mapping between detected clusters and expected clusters was determined based on reciprocal maximum overlapping between the two sets of clusters. The specificity is defined as the proportion of SNPs assigned outside of clusters among the SNPs expected to be outside of clusters.

Simulation datasets
Due to the lacking of a reference dataset with well-defined SNP clustering profiles, we created simulation datasets for assessing the accuracy, sensitivity, and specificity of the clustering algorithm. The simulation datasets were generated based on the gene region annotation of the bacterial strain S. pyogenes AP53, which was annotated and studied by us previously [23]. The purpose is to take the advantage of the natural gene regions in the genome for producing well-defined artificially SNP clusters under gene-specific sweeps. The SNPs were first generated independently and randomly on the genome based on the Poisson process of a given mutation rate (the average mutation rate of S. pyogenes). The SNPs were then processed to form expected clusters by taking the following procedures to satisfy the pre-defined parameter conditions of sweep length (sweep_lg) and maximum inter-SNP distance (max_dist): (1) roughly a half of the SNPs in each gene region were assigned non-synonymous such that each gene region contain non-synonymous SNPs. The purpose is to make the cluster detection by DBSCAN and SweepCluster independent on the biological functions of SNPs, such that their comparison is robust to biological factors given that the design of SweepCluster favors gene regions containing non-synonymous SNPs.
(2) removing the SNPs in the gene regions longer than sweep_ lg + 50 to create SNP clusters satisfying a specific condition of the parameter sweep_lg; (3) if the spanning length of the neighboring genes is longer than sweep_lg + 50 and the inter-genic distance is smaller than max_dist, then remove the downstream gene to create a larger inter-genic distance to create SNP clusters satisfying a specific condition of the threshold of max_dist.

Real dataset of S. pyogenes genotypes
We used the real genomic datasets from two bacterial species S. pyogenes and S. suis to assess the effects of the procedure of SNP pre-selection. The reason to choose the two species is that they are known to have a high level of genomic variability and a high density of genotypes, which facilitate the manifestation of influences of SNP pre-selection [24,25]. S. pyogenes is a common human pathogenic bacterium causing diverse disease phenotypes, such as pharyngitis, skin infection, necrotizing fasciitis, and acute rheumatic fever. Previous studies have shown that the alleles in the gene regions of S. pyogenes exhibit phenotype-dependent changes, thus providing an excellent dataset for selecting SNPs associated with phenotype differentiation [10,26]. The genomic sequences of S. pyogenes were downloaded from NCBI Genbank database (ftp:// ftp. ncbi. nlm. nih. gov). A total of 46 genomes were chosen for this study with balanced distribution of phenotypes based on the known phenotypic information [10]. The core genome is defined as the regions encoded by all studied genomes and was determined by aligning the shredded genomes against the reference strain AP53 (CP013672). Finally, the core genome contains 69,171 segregating sites mutated in at least one of the genomes and were concatenated for downstream analysis. Both the whole set of SNPs at all segregating loci and a subset of selected SNPs associated with the phenotype of acute rheumatic fever were used for inferring sweep regions using SweepCluster. The SNPs associated with the disease phenotype were identified using the Chi-squared test. The parameters used for SweepCluster are "-sweep_lg 1781 -max_dist 1100 -min_num 2" and the clustering significance was evaluated using the function "Pval" with the parameter of mutation rate "-rate 0.0362". The linkage disequilibrium analysis of the SNPs was performed using Haploview [27].

Real dataset of Streptococcus suis (S. suis) genotypes
S. suis is a swine pathogen that colonizes pigs asymptomatically but can also causes severe clinical diseases in pigs such as respiratory infection, septicemia, and meningitis. S. suis can be classified to 29 distinct serotypes forming complex population structures [28]. Previous phylogenetic study showed that many serotypes exist in multiple subpopulations and each subpopulation may contain multiple serotypes [25]. The complexity has been associated with extensive genetic recombination and genomic shuffling among and between populations. Therefore, it will be interesting to investigate the occurrence of selective sweeps among subpopulations in the highly recombining genome of S. suis.
A total of 1,197 genomic sequences of S. suis strains were downloaded from the NCBI Genbank database (ftp:// ftp. ncbi. nlm. nih. gov). We removed the redundancy among the genomes to reduce the data size by grouping them based on the submission institutions and selecting the most distant genomes within each group based on the phylogenetic structures built by SplitsTree [29] (Additional file 9: Figuure S1A). The selected genomes were further filtered based on their phylogenetic distance. The final dataset comprises 209 non-redundant genomes (Additional file 9: Figure S1B) and gives rise to a total of 236,860 segregating mutation sites with BM407 as the reference (FM252033). The core genome was identified using the same procedures as that for S. pyogenes. The inference of sweep regions using SweepCluster was performed respectively for all segregating SNPs and for those associated with differentiation of two subpopulations (branch-1 and branch-2 in Additional file 9: Figure S1C). The SNPs associated with population differentiation was identified using the Chi-squared test. The parameters used for SweepCluster and significance evaluation are "-sweep_lg 2000 -max_dist 2000 -min_num 2 and "-rate 0.1077", respectively. Here, "sweep_lg" is for sweep length, "max_dist" for maximum inter-SNP distance, "min_mum" for minimum number of SNPs in a cluster, and "rate" for average mutation rate across the genome.

Real dataset of V. cyclitrophicus genotypes
V. cyclitrophicus is a gram-negative bacterium inhabiting seawater. Previous studies reported ecological differentiation of the V. cyclitrophicus population associated with gene-specific sweeps [9]. The authors sequenced 20 strains of V. cyclitrophicus, which are divided into two phenotypic groups (S strains and L strains) according to their ecological partition. They showed that the partition is associated with the ecoSNPs, i.e., the dimorphic nucleotide positions with one allele present in all S strains and the other allele in all L strains. The authors then classified the ecoSNPs into 11 clusters and demonstrated the evidences of gene-specific sweeps in causing the ecoSNPs. This is the only available study of SNP clusters under gene-specific sweeps in bacteria. We used this dataset for benchmarking of our clustering method.
We downloaded the genomic sequences of the 20 strains from NCBI Genbank database (ftp:// ftp. ncbi. nlm. nih. gov) and aligned them to a reference strain with complete genome assembly (ECSMB14105) to derive the segregating SNPs of 139,066 and the phylogenetic structure (Additional file 10: Figure S2). The ecoSNPs were obtained using the same definition as that in the reference [9]. The ecoSNPs were then subject to cluster detection using SweepCluster with the parameters "-sweep_lg 8000 -max_dist 5000 -min_num 2" and "-rate 0.000111".

Empirical datasets of human genotypes
We employed the genotype datasets from the human 1000 Genomes project [30] to evaluate the ability of SweepCluster of identifying selective sweeps in eukaryotic data. We chose the 1000 Genomes datasets because they have been extensively used in previous studies of selective sweeps and a handful of gene loci have been well-characterized to be under selective sweep in specific subpopulations. We extracted the genotype data from three subpopulations, i.e., EUR (Europeans), AFR (Africans) and EAS (East Asians), and selected the mutation sites associated with pairwise population differentiation Fst. The calculation of Fst was based on Hudson's estimator in the transformed formula [31]: where n 1 /n 2 is the subpopulation size and p 1 /p 2 is the minor allele frequency for the two paired populations. Distinct subsets of SNPs were selected using a series of Fst thresholds (0.7, 0.65, 0.60, 0.55, 0.50, 0.45, 0.43, and 0.4) for inferring sweep regions to evaluate the robustness of SweepCluster in eukaryotic data. The parameters used for Sweep-Cluster are: "-sweep_lg 200,000 -max_dist 40,000 -min_num 2". The sweep regions and SNPs were annotated based on the genome build hg19 using ANNOVAR [32].

Optimization of the parameters
We carried out the parameter simulation of sweep lengths by calculating the number of sweep regions inferred by SweepCluster for varying values of sweep lengths in the range 300-10,000 bp. The relationship between the number of sweep regions versus sweep length was approximated using non-linear fitting implemented in generalized additive models in the R package "mgcv". The optimal estimation of the sweep length is calculated based on the maximum curvature in the fitting curves with the curvature calculated with the following formula: where f ′ (x) and f ′′ (x) are the first-order and second-order derivative of the fitting curves, respectively. We have provided in the package a shell script "sweep_lg_simulation.sh" for automatic optimization of the sweep length for any particular genotype dataset. The parallel acceleration was implemented in the script for fine-grained parameter searching.

Overview of SweepCluster
The package SweepCluster performs four major functions. (1) Density: calculates the SNP density using a window-scanning method in a specific genomic region or in the genome-wide scale; (2) Cluster: executes the core functionality of the package, i.e., gene-centric SNP clustering; (3) Pval: estimates the statistical significance of each SNP cluster based on a null gamma distribution of SNPs; (4) a driver script "sweep_ lg_simulation.sh" for parameter optimization.

Computing performance
The computing performance of SweepCluster was evaluated using multiple real datasets with varying number of SNPs (designated as n). The memory usage of Sweep-Cluster increases linearly with n consistent with the expected memory usage scale O(n). The used memory is fairly low even for the maximum datasets of 200,000 SNPs at about 260 megabytes (MB) (Fig. 2A). The CPU time consumption of SweepCluster is on the scale O(n 2 ) at the initial stage and then becomes nearly linear O(n) when n > 140,000 (Fig. 2B). It is in compliance with the expected time complexity O(n + n b · k), whereby the CPU time is governed by optimizing the boundary SNPs when n is small, but governed by clustering the inner SNPs when n is large with the ratio of boundary SNPs rapidly declining. Considering the linear increment of memory usage and CPU time, and the downsized genotype datasets upon pre-selection, we anticipate that the computing resources will not be limiting factors for larger datasets. In the meanwhile, it should also be noted that the computing performance also depends on the applied parameters (such as the sweep length) and the genotype data properties (such as the proportion of the boundary SNPs).

Performance of accuracy and sensitivity
We evaluated the performance of the clustering algorithm in SweepCluster in terms of accuracy, sensitivity and specificity using artificially generated simulation datasets. Due to the lacking of such clustering method for prokaryotes, we compared the performance of SweepCluster with that of DBSCAN, a general-purpose density-based spatial clustering algorithm without considering any trait information of the data [33]. DBSCAN has been commonly used in diverse scenarios for spatial clustering, and a variety of extensions have also been proposed to address specific challenges. For instances, the hierarchical clustering by HDBSCAN and OPTICS appropriate for variable density distributions [34,35]; GDBSCAN with the ability to automatically predict optimal parameters [36]; Fuzzy DBSCAN for dealing with datasets with partially overlapping borders [37]; MR-DBSCAN and DENCAST with distributed implementation for handling largescale and high-dimensional datasets [38,39]. In the current study, we chose DBSCAN for comparison due to its implementation in Python and efficiency matching with our dataset size. The simulation datasets were created with the SNP distributions satisfying specific combinations of sweep lengths (sweep_lg) and maximum inter-SNP distances (max_dist). The clusters in the simulation datasets were made such that the clustering results are insensitive to the biological composition of the clusters (such as the synonymous and non-synonymous SNPs, see the Methods & Materials) and that the comparison with the general-purpose DBSCAN is meaningful. The comparison showed that the performance of both algorithms as a function of maximum inter-SNP distances is highly similar, and quickly approaches optimum when the maximum inter-SNP distance increases to roughly 200 bp, close to the average inter-SNP distance in the simulation datasets (Fig. 3A, C). Interestingly, the performance of SweepCluster and DBSCAN as  (Fig. 3B, D). DBSCAN is not influenced significantly by the sweep length and performs nearly equally well for a broad range of sweep lengths. However, the performance of SweepCluster is dependent on the sweep length. It gradually improves with increasing sweep lengths and achieves optimal results at around 800-1000 bp, coincident with the average gene length of our simulation datasets. The dependence of the performance of SweepCluster on the sweep length is a manifestation of the gene-aware concept of the design of the clustering method in SweepCluster. When the parameter sweep_lg approaches the true value, the clustering results become close to the true clustering profile.

Efficacy of SNP pre-selection in real datasets of S. pyogenes and S. suis
We test the efficacy of the procedure of SNP pre-selection prior to clustering by employing real datasets from two bacterial species, S. pyogenes and S. suis of dense genotypes.
For the datasets of S. pyogenes, a total of 69,171 core SNPs were obtained across 46 representative strains and selection of SNPs based on phenotypic association with the disease acute rheumatic fever reduced the number of SNPs to 1,631 (Additional file 1: Table S1, S2 and S3). SweepCluster was subsequently applied to the two SNP datasets and identified 215 and 131 significant clusters (p value ≤ 0.05), respectively (Additional file 11: Figure S3, Additional file 1: Table S4 and S5). The relevance of the identified clusters to gene-specific sweeps is confirmed by the significant difference of population Fig. 3 The accuracy, sensitivity and specificity of the clustering algorithm in SweepCluster in comparison with DBSCAN. The accuracy, sensitivity, and specificity were calculated for a series of values of sweep lengths or maximum inter-SNP distances using SweepCluster. Only accuracy and sensitivity were calculated using DBSCAN due to the fact that DBSCAN classified all SNPs into clusters leaving no out-of-cluster SNPs. The accuracy and sensitivity were calculated for a series of values of eps (the maximum distance between two samples) and min_samples (the minimum number of samples in a neighbhorhood) using DBSCAN. Other parameters for DBSCAN were set as default, including metric (default = "euclidean"), algorithm (default = "auto"), leaf_size (default = 30), and n_jobs (default = -1) differentiation Fst between the SNPs within the clusters and those outside clusters (p value < 2.0 × 10 -16 ) (Additional file 11: Figure S3C). We then used linkage disequilibrium (LD) between SNPs within the clusters as a proxy to examine the effect of pre-selection. A snapshot of the comparison of the LD patterns before and after pre-selection is shown in Fig. 4A, B. The average LD within clusters was significantly increased after performing SNP pre-selection (p value < 2.2 × 10 -16 ), indicating the significant effect of pre-selection on diminishing the spurious signals in inferring sweep regions (Fig. 4C).
We carried out similar analysis for the genomic data of S. suis as that for S. pyogene. A total of 236,860 core SNPs were obtained across 209 non-redundant strains of S. suis and 349 clusters were identified using SweepCluster (p value ≤ 0.05) (Additional file 12: Figure S4A, Additional file 2: Table S6, S7 and S8). Without pre-selection of SNPs, we found that the clusters are densely distributed on the genome, implying that many of the clusters may contain false positive signals of selective sweep. Therefore, we selected the SNPs associated with differentiation of two subpopulations using the Chi-squared test (Additional file 9: Figure S1C). A total of 2,205 SNPs satisfies the significance threshold (p-value ≤ 0.05) and were subject to cluster detection using SweepCluster (Additional file 2: Table S9). A total of 111 clusters were identified with significance (p-value ≤ 0.05) (Additional file 12: Figure S4B and Additional file 2: Table S10). Similarly, the relevance of the identified clusters to gene-specific sweeps is confirmed by the difference of population differentiation Fst between the SNPs within the clusters and those outside clusters (p-value < 2.0 × 10 -16 ) (Additional file 12: Figure S4C). We examined the effect of SNP pre-selection by calculating the average inter-SNP LD within the clusters (Fig. 4D-F). The results reveal a higher level of average LD in the clusters from the selected SNPs than that from the whole set of SNPs (p-value < 4.0 × 10 -7 ), reiterating the efficiency of our strategy for identification of signals of sweep regions.

Influence of SNP pre-selection methods on SNP clustering
In order to examine the influence of different pre-selection methods on SNP clustering, we further used population differentiation Fst for SNP pre-selection for the S. pyogenes dataset and compared the clustering results with that from pre-selected SNPs using phenotypic association above. Finally, we obtained 2,729 selected SNPs with significant population differentiation (Fst ≥ 0.6), covering a large proportion of the SNPs (1,277 SNPs, 78.3%) selected using phenotypic association (Additional file 13: Figure S5A). We then performed clustering with SweepCluster for the 1,277 overlapped SNPs and 2,729 SNPs, generating 114 and 158 significant clusters, respectively (Additional file 13: Figure S5C and S5D). The two set of clusters cover 77% and 89% of the 131 clusters (with at least 90% common SNPs) detected from the SNPs with phenotypic association. It indicates that SNP clustering results are robust to the methods of pre-selection.

Application in empirical datasets of V. cyclitrophicus
We benchmark our method using the dataset in the Ref. [9], the only currently available study of SNP clusters under gene-specific sweep in bacteria. We processed the genomic data from the 20 strains of V. cyclitrophicus (13 L strains and 7 S strains) to obtain ecoS-NPs associated with ecological differentiation between the L and S population (Additional file 3: Table S11 and S12). Cluster detection is subsequently performed to the ecoSNPs using SweepCluster and 11 significant clusters were identified (Fig. 5, Table 1 and Additional file 3: Table S13). We validated our results by comparing with all eleven but two clusters reported in the Ref. [9]. We excluded cluster2 annotated as "Conserved protein" of which the equivalent gene in our reference cannot be precisely located, and cluster4 which contains flexible genes without falling into the core genome. Among the remaining nine clusters, seven were recovered by our method corresponding to a concordance rate of 78%. It is noted that cluster5 was not recovered because it does not contain non-synonymous or upstream regulatory mutations, reflecting different clustering strategies of the two studies. It is noticeable that we also identified with high significance two novel clusters cluster12 and cluster13 containing 6 and 36 SNPs, respectively (Table 1 and Additional file 3: Table S13). In summary, the cluster comparison shows that the differences in the identified clusters between our results and those in the Ref. [9] are mainly due to distinct clustering methods and reference genomes used in the two studies. The current study used the strain of V. cyclitrophicus ECSMB14105, the only strain of this bacterium with complete genome assembly, while the study of [9] took an alternative but closely related species V. splendidus (12B01) as the reference. Therefore, the concordance rate between the two studies should have been underestimated.

Application in empirical human genotype datasets
Though SweepCluster is specifically developed for prokaryotic data of dense genotypes, it will be helpful to test whether it is also applicable to eukaryotic data. We examined three well-characterized gene regions (LCT, EDAR, and PCDH15) under selective sweep in pairwise populations of EUR, AFR, and EAS from the human 1000 Genomes Project genotype datasets [30]. We at first performed SNP pre-selection based on the population differentiation Fst at a series of cutoff values, and then applied SweepCluster to each dataset of selected SNPs to search for gene regions under potential selective sweep  (Additional file [4][5][6][7]. At the threshold of Fst = 0.4, all three gene loci were recovered as significant regions under selective sweep (Fig. 6). The LCT gene, encoding lactase, was previously shown to be associated with lactase persistence in European populations and the region around it has been acknowledged as the target for strong selective sweep [19,20,40]. In our cluster detection, the LCT locus along with the flanking gene regions (R3HDM1, UBXN4, and MCM6) forms a cluster of 57 variants spanning 235.6 kb with significance (p-value = 5.7 × 10 -6 ), consistent with the strong positive selection. The gene EDAR is involved in ectodermal development and the missense mutation V370A showed evidences for association with hair thickness in East Asians [41,42]. The region around EDAR has been identified to be the locus undergoing strong selective sweep [19,42,43]. We localized the EDAR-centered region (GCC2, LIMS1 and EDAR) of 132 variants (including V370A) spanning 145.8 kb with significance (p-value < 10 -8 ), implying strong selection signals. The gene PCDH15 encodes protocadherin and previous studies showed evidences of positive selection in East Asian populations [43,44]. We recovered the PCDH15 locus as a highly significant sweep region consisting of more than 300 variants spanning 369.5 kb (p-value < 10 -8 ), indicating a strong signature of selective sweep.
Our results show that the size and significance of the sweep regions depend on the SNP selection threshold of Fst, but the detection efficiency is robust for a wide range of In order to assess the overall performance of SweepCluster on detecting sweep regions for eukaryotic genotype data, we collected a catalog of 20 representative gene loci known to be under selective sweep or previously identified to be under positive selection by multiple studies [17,19,20,40,43], and examined whether they can be recovered by SweepCluster (Additional file 8). Notably, 16 of them (80%) were recovered by Sweep-Cluster and 14 (70%) reach high statistical significance (p-value < 0.006). The high rate of recovery reiterates the efficiency and robustness of SweepCluster in detecting sweep regions of eukaryotic data.

Optimization of parameters
The performance evaluation based on simulation data showed that the performance of the clustering algorithm in SweepCluster is closely related with the sweep length. Therefore, proper estimation of sweep lengths is critical for confident inference of selective sweep regions. Unfortunately, in many cases, it is not straightforward to derive the value of sweep lengths from genotype data. Therefore, we provided in the package a simulation script "sweep_lg_simulation.sh" to search for the optimal estimation of the sweep length for a specific genotype dataset. It is particularly suitable for prokaryotic data because the prokaryotes use gene conversion as the main vehicle for introducing selective sweeps and the sweeps are generally uniform in size [21].
We did the simulation by calculating the number of sweep regions inferred by Sweep-Cluster at a series of values of sweep lengths and then fitting a non-linear model for the relationship between the number of sweep regions and sweep lengths. The optimal estimation of sweep lengths is determined by the point of maximum curvature in the fitting model. Here we present the simulation results for the three real datasets of S. pyogenes, S. suis, and V. cyclitrophicus, respectively (Fig. 7). It is shown that all three datasets have the maximum curvatures at the sweep length of ~ 2000 bp (1638 bp for S. pyogenes, 1500 bp for S. suis, and 2157/1989 bp for the two chromosomes of V. cyclitrophicus). It is consistent with our previous estimation of 1,789 bp for S. pyogenes using the alternative tool ClonalFrame [10,45].

Discussion
We have proposed a gene-centric spatial clustering approach to identify gene-specific sweeps in bacterial polymorphism data. It targets for the mutation sites complying with specific genetic properties of selective sweeps and captures the regions with unusual clustering patterns of those mutations differing from that of a neutral expectation. Based on the known genetic properties of gene-specific sweeps, the target mutations are first obtained by selecting those with elevated population differentiation, reduced within-population diversity, heightened linkage disequilibrium, or significant phenotype association. The selected subsets of mutations are then subject to clustering. Therefore, our approach for inferring sweep regions employs two layer of information, i.e., genetic signatures and spatial distribution patterns of mutations under gene-specific sweeps in comparison with current methods focusing only on one layer of information in the genotype data [11-14, 16, 19, 20].
The purpose of the procedure of selecting target mutations of particular genetic signatures prior to clustering is to remove the spurious or uninformative signals and perform spatial clustering only for the mutations related with selective sweep. The impact of mutation selection was dramatic in our two example datasets from the bacteria S. pyogenes and S. suis. The level of linkage disequilibrium between SNPs, as a signature of selective is significantly increased by selecting those mutations associated with disease phenotypes or population differentiation. The ultimate datasets The relationship between the number of sweep regions and sweep lengths was fit using generalized additive models (red lines) and the curvature for each fitting curve was calculated using the formula (4) upon prior selection are more sensitive to the statistical test under the neutral model of spatial distribution of mutations, making it more efficient to identify gene regions under selective sweeps. Using the only available dataset of gene-specific sweeps in bacteria [9], we validated our method yielding a concordance rate of 78% for the detected clusters even with distinct clustering strategies and reference genomes in the two studies. Our approach is specifically designed for prokaryotic data of dense genotypes such that the mutations of particular genetic properties can be exhaustively obtained and the distribution of those mutations can be statistically distinguished from the null model. However, the tests showed that our method also performs well for eukaryotic data of sparse genotypes. We recovered the well-characterized gene regions (LCT, EDAR, and PCDH15) under selective sweeps in the 1000 Genomes Project genotype datasets. The signals of selective sweeps in the three gene loci persist for a wide range of mutation selection criteria, suggesting the robustness of our method on identifying sweep regions in sparse genotype data. Moreover, the spatial-aware strategy of the clustering makes the resolution of detected sweep regions narrowed down to single nucleotides facilitating identifying relatively old sweeps of low numbers of selected sites.
There are some limitations of our approach. It cannot distinguish explicitly between hard sweeps and soft sweeps, or recent sweeps and older sweeps because mixed sites of varying strength of selection are treated as a whole for statistical tests. Our method does not deal with the confounding effects of background selection, as the signatures of background selection are very similar to the real selection and it has been a challenge to confidently classify the background selection for many alternative approaches.

Conclusion
We proposed a novel gene-centric approach for identifying gene-specific sweeps implemented in the Python tool SweepCluster. It performs spatial clustering of polymorphisms to infer the regions with signatures of gene-specific sweeps by employing two layers of information, i.e., genetic properties and spatial distribution models of the polymorphisms. It is specifically developed for prokaryotic data of dense genotypes and exhibit efficiency and robustness in detecting sweep regions in the validation datasets. It also performs well for eukaryotic data in a wide dynamic range of parameters of genetic properties. We expect that our new method will be valuable for detecting gene-specific sweeps in diverse genotype data and provide novel insights on evolutionary selection.

Availability and requirements
Project name: A python tool SweepCluster.
License: GPL3.0 Any restriction to use by non-academics: Not for non-academics.