 Research
 Open access
 Published:
Clustering biological sequences with dynamic sequence similarity threshold
BMC Bioinformatics volume 23, Article number: 108 (2022)
Abstract
Background
Biological sequence clustering is a complicated data clustering problem owing to the high computation costs incurred for pairwise sequence distance calculations through sequence alignments, as well as difficulties in determining parameters for deriving robust clusters. While current approaches are successful in reducing the number of sequence alignments performed, the generated clusters are based on a single sequence identity threshold applied to every cluster. Poor choices of this identity threshold would thus lead to low quality clusters. There is however little support provided to users in selecting thresholds that are well matched with the input sequences.
Results
We present a novel sequence clustering approach called ALFATClust that exploits rapid pairwise alignmentfree sequence distance calculations and community detection in graph for clusters generation. Instead of a single threshold applied to every generated cluster, ALFATClust is capable of dynamically determining the cutoff threshold for each individual cluster by considering both cluster separation and intracluster sequence similarity. Benchmarking analysis shows that ALFATClust generally outperforms existing approaches by simultaneously maintaining cluster robustness and substantial cluster separation for the benchmark datasets. The software also provides an evaluation report for verifying the quality of the nonsingleton clusters obtained.
Conclusions
ALFATClust is able to generate sequence clusters having high intracluster sequence similarity and substantial separation between clusters without having users to decide precise similarity cutoff thresholds.
Background
Sequence clustering refers to the process of grouping together similar biological sequences such that only homologous sequences are expected to appear in each sequence cluster. It is particularly useful for identifying various sets of potentially homologous candidates from unknown sequences for further analysis or annotation, as well as aggregating sequencing reads for reference genes abundance estimation in metagenomic samples. Sequence clustering could be considered within the problem domain of general data clustering and are usually resolved using unsupervised learning techniques. Early approaches exploit agglomerative hierarchical clustering [1] to cluster sequences with either single linkage (e.g. BlastClust [2] and GeneRAGE [3]) or average linkage (methods proposed by Loewenstein et al. [4] and Uchiyama [5]) metrics. Partitional clustering, especially Kmeans clustering [6, 7], was another popular method used to derive sequence clusters (customized Kmeans approaches by Ashlock et al. [8] and Kelarev et al. [9]). All these approaches require a pairwise sequence distance matrix to be computed. High computation costs are therefore incurred due to the O(N^{2}) pairwise sequence alignments (such as BLAST alignments [10]) required for N sequences. The size of the distance matrix (N × N) also creates a scalability problem in terms of space complexity. Moreover, clustering results are often sensitive to userspecified clustering parameters. For example, the Kmeans algorithm would require the number of final clusters K to be specified upfront. In order to determine the optimal parameter values for any given set of sequences, the clustering process would need to be performed iteratively, each with a different set of parameter values. After each iteration, an internal validation index [11] such as the silhouette coefficient [12] is calculated from the pairwise sequence distance matrix for the generated output clusters. The set of parameter values having the best index score is deemed to generate the optimal clusters. However, the number of sequences in each cluster is expected to vary substantially [13], with some clusters having hundreds of sequences while some having only a few or even just a single sequence, therefore making it difficult to efficiently estimate the clustering parameters by assuming similar size or density for all clusters.
Many recent sequence clustering approaches [14,15,16,17,18] therefore aim at minimizing both the number of pairwise alignments performed and the space complexity (i.e. the sequence distance matrix is no longer required), as well as defining the cluster cutoff through a biologically comprehensible parameter known as the sequence identity threshold T. The value of T ranges from 0 (complete mismatch) to 1 (identical sequences) and is usually selected based on users’ domain knowledge or a widely accepted value within the domain. It thus reflects that the higher value of the threshold, the more similar would be the sequences in each derived cluster. Such approaches are primarily greedy algorithms that assign a target sequence to an existing cluster when the pairwise sequence identity of this target sequence with the representative center sequence of this cluster is at least T, or otherwise creates a new cluster of size = 1 with this target sequence as its representative center sequence. Techniques such as short word filtering [19, 20] can be applied to avoid the computational expensive procedure of aligning a target sequence with the representative center sequence if their pairwise sequence identity is likely below T. The greedy algorithms are also very space efficient because they only consider the sequence identities between the target sequence and all existing representative center sequences. The clustering results can however be affected by the choices of the representative center sequences for the clusters, which are determined by the order of the sequences assessed. MeShClust [21] therefore addresses this limitation by reselecting the representative center sequence from all sequences in the cluster after a new sequence has been added using the meanshift algorithm [22]. The representative center sequences for all clusters are also reselected again by meanshift once all sequences have been clustered, and two clusters are merged when the pairwise sequence identity between their representative center sequences is equal to or greater than T.
In contrast, MMseqs2 [23] models each sequence as a unique graph vertex, and two vertices are connected by an edge when the pairwise alignment of their underlying sequences satisfies particular criteria including significance, sequence coverage, and T. Sequence clusters are then obtained through a graph clustering approach. In addition, the sequence clustering tool Linclust [24] can be run as a preprocessing step to divide the sequences into intermediate clusters for individual graph clustering in each intermediate cluster for scalability. For computational efficiencies, MMseqs2 replaces the exact alignment process between sequences with rapid approximations. The speedup techniques utilized by the different algorithms are summarized in Additional file 1: Table S1. Although the sequence identity threshold T is comprehensible to most users, a poorly chosen value could generate clusters substantially different from the true biological clusters [21]. The most common scenario is that the value of T is set too high and hence some homologous sequences are assigned to different clusters, but these clusters are hardly identifiable given the large number of sequences or insufficient annotation. Parameter optimization with internal validation index is also not feasible for these approaches due to the absence of the complete distance matrix.
We therefore develop a novel sequence clustering method that dynamically adjusts the cutoff thresholds for individual clusters. It first estimates a complete pairwise sequence distance matrix using an alignmentfree approach to avoid performing the traditional slower sequence alignments. This distance matrix is then used to derive all the edge weights for a graph, in which each sequence is represented by a vertex, and the edge weight for an edge denotes the pairwise similarity between the pair of sequences associated with. Sequence clustering is now performed via an iterative graph clustering in which each vertex is regarded as a singleton graph cluster (a singleton graph cluster consists of only one vertex) initially. Each iteration begins with identifying potential clusters to be merged. A cluster separation cutoff threshold then determines which of them can be merged into a single cluster without introducing possible outliers. When a cluster is prohibited from expansion, it is deemed well separated from its neighbours. In contrast to the sequence identity threshold T, the cluster separation cutoff threshold is a dynamic threshold because it is partially determined by the clusters considered. We thus name this sequence clustering approach as AlignmentFree Adaptive Threshold Clustering, or ALFATClust in short. ALFATClust is implemented as a publicly available tool, which also provides an user option to evaluate the nonsingleton clusters in terms of sequence identity through sequence alignment.
The remaining sections of this manuscript are organized as follows: The “Methods” section first introduces the sequence distance calculation approach and its estimation parameters to be used in ALFATClust, and then gives an overview of the ALFATClust algorithm. Its core steps such as the binning process to derive the graph clusters and the graph contraction are illustrated afterwards. Details of the scalability enhancement and the optional sequence cluster evaluation report are also provided. In the “Results” section, we assess both clustering and time performances of ALFATClust with other sequence clustering tools using the benchmark datasets. We then elaborate on the advantages and limitations of ALFATClust in the “Discussion” section.
Methods
Pairwise sequence distance using alignmentfree approach
Mash [25] and Dashing [26] are two rapid alignmentfree sequence distance approaches that require kmer size and sketch size to be specified as input parameters. We therefore performed experiments to assess the distance calculation method to be implemented in ALFATClust and also determine the relevant input parameters. The experimental details can be found in Additional file 1 provided. Results (Additional file 1: Figures S1–S5) indicate that Mash is better due to the higher correlation between Mash distance and the sequence distance obtained by alignment approach. In addition, from Additional file 1: Figures S1 and S3, the optimal Mash kmer size for gene sequences is 17, or lower (e.g. 13) when some sequences are very short. For complete protein sequences, the optimal kmer size is 9 (Additional file 1: Figure S5). The optimal sketch size is 2000 for both gene and protein sequences. Since Mash distance d ranges from 0 to 1, the corresponding sequence similarity can be easily calculated as 1 − d.
Overview of ALFATClust algorithm
The ALFATClust algorithm consists of four components:

1.
Mash [25] distance calculation for constructing a graph to model pairwise distances between the input sequences;

2.
Leiden algorithm [27] to partition the graph into communities [28] of raw graph clusters;

3.
A binning process to bin the vertices in each raw graph cluster into validated graph clusters;

4.
A graph contraction step to replace every validated graph cluster by a single vertex.
Suppose a Mash distance matrix D is computed for a sequence set S such that d_{ij} of D represents the pairwise Mash distance between sequences s_{i} and s_{j} in S. Since Mash distance is a symmetric distance measure, d_{ij} = d_{ji} and both D and W = 1 − D are therefore symmetric matrices. An undirected weighted graph G = (V, E, W) is initialized with each vertex v_{i} ∈ V representing a sequence s_{i} ∈ S. Every vertex pair (v_{i}, v_{j}) (i < j) is connected with an edge e_{ij} ∈ E = {(v_{i}, v_{j})  v_{i}, v_{j} ∈ V, i < j}, and the edge weight of e_{ij} is equal to w_{ij} of W. w_{ii} = 1 despite the absence of selfloop e_{ii} in G. w_{ij} therefore denotes the sequence similarity between sequences s_{i} and s_{j}. Leiden algorithm then partitions the graph into raw graph clusters (communities) that maximize the score calculated using a predefined quality function. Given a value of γ between 0 and 1, the quality function of the Constant Potts Model (CPM) [29] guarantees that for any vertex v in a nonsingleton raw graph cluster R, the average edge weight of edges connecting between v and other vertices in R exceeds γ. This means the average sequence similarity between the sequences within every nonsingleton raw cluster is higher than γ. The core parameters of the ALFATClust algorithm thus include a value range [γ_{low}, γ_{high}] (γ_{low} < γ_{high}) for γ and a step size Δ, which are used to determine the values of γ used in the algorithm:

1.
Create graph G using Mash distance matrix D; assign γ_{high} to γ

2.
While γ > γ_{low }− Δ do:

3.
Run Leiden algorithm on G using CPM with γ to obtain set of raw clusters C_{raw}

4.
Initialize L with an empty list

5.
Assign C_{raw} to L if γ = γ_{high}; otherwise bin each raw cluster in C_{raw} and add the bins to L

6.
Contract G by replacing every graph cluster in L by a single vertex and update edges

7.
Decrement γ by Δ

8.
Return the sequence clusters represented by L derived in the final iteration
The abovementioned steps would derive graph (sequence) clusters iteratively from γ = γ_{high} to a value determined by γ_{low}. In the ith iteration, it first performs Leiden algorithm using CPM with γ = γ_{high }− (i − 1)Δ, and the raw clusters C_{raw} become the graph clusters L directly in the first iteration. From the second iteration onwards, the vertices in each individual raw cluster are binned (refer to the binning section below) and vertices in the same bin become a validated graph cluster in L. After deriving L for the current iteration, graph G is contracted in a way similar to Uchiyama’s approach [5] that all vertices belonging to a cluster in L are replaced by a single vertex. The next iteration, if any, begins with the contracted graph G. The graph clusters L derived in the final iteration thus become the output sequence clusters. Details of the Leiden algorithm with CPM, binning, and graph contraction are further elaborated below. A detailed description of the ALFATClust algorithm can be found in Additional file 1.
Leiden algorithm with Constant Potts Model to identify vertices for binning
Community detection [28] refers to the process of partitioning a graph such that vertices in each partition are closely related with each other. While each of these partitions is originally known as a community, we term it as a raw graph cluster or raw cluster in this article. In ALFATClust, community detection is performed using the Leiden algorithm [27] which is an improvement of the Louvain algorithm [30] in both cluster quality and execution time. Briefly, Leiden algorithm starts with each vertex as an individual cluster, and then updates the clusters iteratively to maximize the overall quality score q. The calculation of q requires a quality function to quantify how closely related the vertices are within each of the raw clusters. For a graph G without selfloop, Eq. (1) is the general form of the quality function as defined in CPM [29]:
where α_{ij} belongs to the adjacency matrix A. α_{ij} = 1 for any i and j because G is complete (i.e. an edge exists between any two distinct vertices regardless of edge weight). σ_{i} refers to the raw cluster for v_{i}, and δ(σ_{i}, σ_{j}) = 1 if σ_{i} = σ_{j}, i.e., both v_{i} and v_{j} belong to the same raw cluster, and zero otherwise. CPM therefore only considers edge weights for edges within a cluster but not those across clusters. The value of γ, which is called the resolution parameter, is within the range of the edge weight for G, i.e. from 0 to 1. When a vertex exists as a singleton raw cluster it contributes zero score to q according to Eq. (1). It follows that the baseline case for CPM is every raw cluster being a singleton cluster and therefore q = 0 regardless of the value of γ. Given a value of γ, for any vertex v in a nonsingleton raw cluster R, the average edge weight of the intracluster edges in R involving v must be greater than γ in order to contribute a positive score to q. γ therefore becomes a lower bound of the average intracluster edge weight for all raw clusters. Graph clusters maximizing the value of q are regarded as the raw clusters for a particular value of γ.
A problem for community detection is the size bias where large communities dominate over small communities [31]. Similar size bias are also observed in CPM where the lower the value of γ, the more it favours larger raw cluster size over higher edge weight. For example, suppose graph G consists of vertices v_{1}, v_{2}, v_{3}, v_{4}, v_{5}, … with edge weights w_{ij} in W. For γ = 0.875, v_{1} may form a raw cluster R = {v_{1}, v_{2}} when w_{12} = 0.89, thus giving q = 0.89 − 0.875 = 0.15. Meanwhile, v_{1} cannot form a larger raw cluster R’ = {v_{1}, v_{3}, v_{4}, v_{5}} when {w_{13}, w_{14}, w_{15}, w_{34}, w_{35}, w_{45}} = {0.84, 0.86, 0.87, 0.85, 0.87, 0.86}, because q for R’ is equal to (0.84 + 0.86 + 0.87 + 0.85 + 0.87 + 0.86) − 6 × 0.875 = − 0.1. This situation is however reversed for γ = 0.85, because q becomes 0.05 for R’ and is now larger than that for R (q = 0.89 − 0.85 = 0.04). Size bias occurs at γ = 0.85 in this example since the highest edge weight w_{12} is omitted from R’ for v_{1}. This suggests, in the presence of size bias, community detection may allocate less similar sequences to the same raw cluster and highly similar sequences to distinct raw clusters. Another problem is that parameter γ is still a uniform cluster cutoff threshold analogous to sequence identity threshold T. Nevertheless, community detection is still an effective means to identify potential vertices for individual graph clusters provided these shortcomings are addressed. Indeed, ALFATClust minimizes the impact of size bias by deriving the graph clusters iteratively and replacing them with individual vertices through graph contraction. Also, the subsequent binning process facilitates an adaptive cutoff threshold for individual clusters.
Binning process
The binning cutoff criteria requires calculating the arithmetic mean of the intracluster edge weight for every validated graph cluster obtained in the previous iteration. Since a vertex v_{i} in the contracted graph G represents a validated graph cluster, the average intracluster edge weight is regarded as the vertex weight of v_{i} and is denoted as w_{ii} of W. All diagonal elements w_{ii} of W therefore represent the vertex weights for G and other elements w_{ij} (i ≠ j) denote the edge weights. This does not affect the calculation of Eq. (1) for the community detection process because G has no selfloop e_{ii} and thus w_{ii} is not associated with any edge. Initially each vertex v_{i} in G is simply a singleton graph cluster, and w_{ii} = 1 for any singleton cluster represented by v_{i}. For the first iteration (γ = γ_{high}), all raw clusters become graph clusters directly without binning. This is because γ_{high} is supposed to be close to the highest possible edge weight (i.e. 1) so that the raw clusters identified are robust against size bias, and the intracluster edge weights are sufficiently large for raw clusters to qualify as graph clusters. The binning process, which is performed individually for each nonsingleton raw cluster, begins by sorting its intracluster edges in a descending order of edge weight. The two vertices v_{i} and v_{j} associated with edge e_{ij} are added to a vertex list J following the sorted edge order, however only the vertex not present in J can be added. When both v_{i} and v_{j} do not appear in J, the one representing the larger underlying graph cluster is added first. The first bin is created using the first vertex in J. Starting from the second vertex, the selected vertex v_{t} is eligible to be assigned to an existing bin B satisfying Eq. (2):
where
ρ(v_{i}) returns the underlying graph cluster size of v_{i} as the total number of primitive vertices (i.e. vertices in initial G) belonging to that cluster. In Fig. 1A, I_{bin}(B, W) denotes the current cluster compactness of B in terms of its average intrabin edge weight. I(B, W, v_{t}) refers to the average edge weight between v_{t} and vertices in B. By assuming the homologous sequences have significantly higher edge weights (lower Mash distances) with each other than with other nonhomologous sequences, v_{t} should not be assigned to B when I(B, W, v_{t}) is much smaller than I_{bin}(B, W), or it might introduce outliers into B otherwise. I_{bin}(B, W) − I(B, W, v_{t}) in Eq. (2) is therefore an estimate for the cluster separation between B and v_{t}, and I(B, W, v_{t}) − γ_{low} is the cutoff threshold determined by both B and v_{t} as well as the predefined γ_{low}. Figure 1A illustrates both cluster separation and adaptive cutoff threshold in the graph. v_{t} is assigned to the bin giving the highest positive score calculated with a scoring function Q(B, W, v_{t}):
The inequality in Eq. (2) is satisfied when Q(B, W, v_{t}) > 0. v_{t} is assigned to a new bin when none of the scores for the existing bins are positive. As a result, every bin derived from a raw cluster becomes a validated graph cluster for the current iteration.
Graph contraction
After binning all raw clusters, G is contracted such that each graph cluster C_{i} in G is now replaced by a new vertex v’_{i}, and its vertex weight w’_{ii} is equal to the average intracluster edge weight of C_{i}. Edges connecting between clusters C_{i} and C_{j} are replaced by a single edge connecting v’_{i} and v’_{j} in the contracted graph, with its edge weight averaged from the replaced edges. Figure 1B illustrates an example of the graph contraction process. However, graph clusters obtained from the same raw cluster are prohibited from appearing together in any raw cluster in subsequent iterations. This is achieved by assigning a large negative edge weight (–V’^{2} where V’ is the vertices of the contracted G) to the edges interconnecting them. The next iteration, if any, begins with the contracted graph. The graph contraction therefore preserves the validated graph clusters in subsequent iterations against size bias, provided that Δ is sufficiently small.
Selection of the core parameters in ALFATClust
The value of γ_{high} is supposed to approach the maximum possible edge weight value (i.e. 1) so that the raw clusters detected in the first iteration are robust against size bias. Most of the actual cutoffs for individual clusters are expected to appear above γ_{low}, which still needs to be set as high as possible to avoid capturing subsequent trivial (but larger) drops. For example, a decrease of edge weight from 0.85 to 0.72 is more significant than the next even larger drop from 0.74 to 0.59. The smaller the value of Δ the lesser is the size bias for CPM in an iteration. The minimum value for γ_{high} is at least 0.95. The value for Δ should not exceed 0.025. Both γ_{high} and Δ are usually relatively invariant and the value of γ_{low} can set between 0.7 and 0.8.
Scalability improvement
Although ALFATClust computes a full Mash distance matrix for its graph clustering, the matrix can be significantly reduced using a divideandconquer approach. A preclustering step is performed at the beginning to divide the sequences into multiple sequence partitions. This is achieved by running the highly scalable MMSeqs2 with sequence identity threshold T equal to γ_{low}, and each of its output sequence clusters becomes an individual sequence partition. ALFATClust can then run with each partition separately without compromising the overall clustering accuracy. This is because the pairwise sequence similarity between any two sequences in different partitions is below γ_{low}, hence filtering their corresponding vertex pair brings little impact to both CPM scoring community detection and the binning process. ALFATClust runs in preclustering mode when either the number of sequences exceeds a predefined limit (20,000 sequences by default) or the preclustering option is activated by the user.
Cluster evaluation
ALFATClust provides an optional evaluation of cluster quality in terms of sequence identity. For each nonsingleton cluster, the sequence giving the largest sum of intracluster edge weight and satisfying the sequence length criteria (i.e. between lower quartile − 1.5 × interquartile range and upper quartile + 1.5 × interquartile range) is selected as the representative center sequence. If no such sequence exists or the cluster consists of only two sequences, then the longest sequence will be selected instead. Pairwise sequence identity (number of matched bases divided by alignment length excluding terminal gaps, refer to Additional file 1 for implementation details) is calculated between the center sequence and every other sequence in the same cluster. The evaluation report includes both mean sequence identity and minimum sequence identity for every nonsingleton cluster.
Results
Benchmark datasets and setup
Antimicrobial resistance (AMR) gene and protein sequence datasets [32,33,34] are suitable for evaluating clustering effectiveness due to the presence of many distinct classes of homologous resistance gene sequences, and a wide range of sequence similarities among these sequences. AMR gene sequences are retrieved from various public AMR databases including CARD (version 3.0.7) [32], ResFinder (downloaded on 3th Feb, 2020) [33], and ARGANNOT (version 4) [34]. All AMR sequences collected are validated, integrated, annotated into an AMR gene sequence dataset using ARGDIT [35], which further translates this dataset into an AMR protein sequence dataset. NonAMR plasmid nucleotide sequences are extracted with their original annotation from PLSDB [36] (version 2019_10_07) to create a plasmid nucleotides dataset. Due to its highly variable sequence lengths, it can be used to investigate whether the clustering performance is affected when the sequence length differs substantially. Finally, scalability in terms of the number of sequences and sequence length is examined using viral nucleotide and amino acid sequence datasets (each consisting of ~ 470,000 sequences) retrieved from viruSITE (release 2021.1) [37]. Sequences consisting of ambiguous nucleotide or amino acid, or having their sequence length shorter than the smallest Mash kmer size benchmarked (13 for gene and 9 for protein) are discarded. In particular, viral nucleotide sequences are split into two separate datasets according to the 10,000 nucleotides criteria. The viral sequence datasets are only used for scalability assessments. Table 1 summarizes the properties of all benchmark datasets used, with further details provided in the “Results” section.
The performance of ALFATClust is compared with CDHIT (version 4.8.1), UCLUST (version 11.0.667), VSEARCH (version 2.14.1), DNACLUST (version 3.7), MeShClust (also MeShClust^{2} [38] for the viral nucleotide dataset), and MMseqs2 (version 12.113e3). The identity thresholds used for these tools range from 0.7 to 0.9 with a step size of 0.05. Any cluster optimization option(s) in the tools are also turned on whenever available except for the viral sequence datasets. The suggested value of γ_{low} is in the range of 0.7 to 0.8 for most instances; in this benchmark γ_{low} is set to three different values: 0.7, 0.75, and 0.8. Default values are applied for other parameters: γ_{high} = 0.95, and Δ = 0.025. Also, the default Mash kmer size (17 for nucleotide sequences and 9 for protein sequences) and sketch size (2000) are used except for both plasmid and viral nucleotide datasets where the kmer size is set to 13. Execution commands and options for all the tools are provided in Additional file 1. Silhouette scores [12] for the sequence clusters are computed using scikitlearn [39], while the results are plotted using Matplotlib [40]. All the benchmarks are performed on a workstation with a quadcore Intel Xeon W2102 CPU and 64 GB RAM. Cluster evaluation reports of ALFATClust for the AMR and plasmid sequence datasets are available in Additional files 2, 3, 4, 5, 6, 7, 9, 10, 11. In addition, although the Leiden algorithm involves random vertex and community selection, identical set of clusters is obtained from 50 individual runs for each of the AMR and plasmid datasets using any of the three values of γ_{low} specified above.
Sensitivity of clustering parameter γ _{low} and cluster robustness
Figure 2 shows the rate of decrease in number of clusters generated is slower with decreasing γ_{low} in ALFATClust than decreasing T in other clustering approaches, especially for the AMR sequence datasets. One possible explanation is the clustering results being less sensitive to γ_{low}, at least for its suggested value range (0.7–0.8), than T in other clustering tools. A further verification is therefore performed by comparing the clusters derived with these values of γ_{low} and T (0.9, 0.85, and 0.8) (Additional file 1: Figures S6–S12). ALFATClust exhibits the highest proportion of identical clusters shared across different γ_{low}. For example, in the AMR gene dataset, the number of identical clusters common across the three values of γ_{low} is 995 (Additional file 1: Figure S6), which constitute 83.4% of all distinct clusters seen. This proportion is remarkably higher than all other tools such as UCLUST (68.5%, Additional file 1: Figure S8) and MMseqs2 (65.9%, Additional file 1: Figure S9). This better cluster pattern convergence indicates a lower impact of selecting a nonoptimal γ_{low} from the suggested range. Another observation is the uneven distribution of the cluster sizes. When running ALFATClust with γ_{low} = 0.75, the average number of sequences per cluster and the standard deviation for the AMR gene, AMR protein, and plasmid nucleotides datasets are 3.72 ± 13.77, 3.74 ± 13.98, and 7.27 ± 15.66 respectively. Their cluster sizes are therefore highly varied. Moreover, while there are a few clusters consisting of over 100 sequences in each of these datasets, singleton clusters occupy ~ 70% of all derived clusters for the AMR sequence datasets and ~ 39% for the plasmid dataset. These datasets are thus real examples illustrating the difficulty in searching for suitable clustering parameter value (such as K in Kmeans clustering) evaluated with internal validation index.
The distribution of the minimum sequence identities presented in the ALFATClust cluster evaluation reports (available in Additional files 2, 3, 4, 5, 6, 7, 9, 10, 11) is shown in Fig. 3. For γ_{low} ≥ 0.75, at least 74% of the nonsingleton clusters have minimum sequence identity ≥ 0.9 in all three benchmark datasets; and only a few clusters have minimum sequence identity below 0.8. The particularly low sequence identities (< 0.7) for the plasmid nucleotides dataset are partially due to multiple large nonterminal gaps present in the pairwise sequence alignment, such as the one between the cluster center sequence “CP016074.1_rep7a_15_repC(pS0385p1)” and sequence “NC_017335.1_rep7a_18_rep(pS0385p2)” in the same cluster. Another reason is the significantly underestimated Mash distance due to partially overlapping sequence segments. Plasmid nucleotide sequence pair “LT906556.1_IncFII(pCoo)_1_pCoo” and “KX276657.1_IncFIC(FII)_1” is an example demonstrating such partial overlap, with the Mash distance equal to 0.069 (hence the corresponding sequence similarity is 1 − 0.069 = 0.931), while the true sequence identity calculated using alignment is only 0.555. The alignments giving large nonterminal gaps or partial overlap for the above examples are illustrated in Additional file 1. By examining the cluster evaluation reports generated, it is found that for the AMR gene dataset with γ_{low} = 0.7, the sequence pair having the lowest minimum sequence identity belongs to the same AMR gene family (AMR genes LRA3 and LRA9, both under subclass B3 LRA betalactamase according to CARD). For the AMR protein dataset with γ_{low} = 0.7, the sequence pair having the lowest identity score belongs to dihydrofolate reductase (conferring resistance to trimethoprim), although the sequence identity for this pair is quite low (0.675).
Overall sequence cluster quality benchmark
The overall sequence cluster quality of ALFATClust is compared with other approaches using both external and internal validation indices. Normalized mutual information (NMI) [41] and purity [42] are external validation indices used for this comparison with the AMR sequence datasets. The AMR sequences are manually classified into gene classes created using the gene names (and their synonyms provided by CARD [32]) identified from the AMR gene sequence annotations. Sets of AMR sequence clusters obtained by various approaches at different thresholds are evaluated individually against a set of 827 AMR gene classes consisting of 3 720 AMR sequences (accounting for ~ 92% of all AMR gene sequences, refer to Additional file 8 for the AMR gene classes). For a set of sequence clusters L consisting of N sequences together, NMI measures the correlation between L and the predefined gene classes Ω using Eq. (4) below:
where
and
The higher the values for NMI (maximum NMI is 1) the better correlation is between the sequence clusters and the gene classes. Figure 4 illustrates how NMI varies with γ_{low} for ALFATClust as well as T for other approaches. The NMI values obtained by ALFATClust are not less than any other tool (Additional file 1: Tables S2 and S3) in the AMR gene dataset, and its NMI at γ_{low} = 0.7 is the overall highest (0.933) in the AMR protein dataset. Moreover, the NMI value variation for γ_{low} between 0.7 and 0.8 is relatively small compared to others in both datasets. This is mainly due to the lower sensitivity of the clustering outcomes with γ_{low} than T. Purity is another external index to assess the homogeneity of gene class in the sequence clusters. Higher purity (maximum purity is 1) means the clusters are generally more dominated by sequences in the same gene class. Equation (5) below calculates purity:
Figure 5 shows a strictly decreasing trend of purity with decreasing γ_{low} and T for all the clustering approaches benchmarked. Nevertheless, the rate of decrease for ALFATClust is slower than others, and its purity is often close to or even higher than other tools at T = 0.85 (exact values for ALFATClust and other methods are shown in Additional file 1: Tables S4 and S5 respectively). The implication is that the cluster expansion in ALFATClust is less aggressive than the greedy algorithms towards lower thresholds. This is the benefit of determining an individual cutoff for each cluster by considering the separation with its neighbour clusters, rather than relying on a single rigid threshold to allocate sequences.
Internal validation indices such as silhouette coefficient c [12] evaluates the overall cluster separation using the true sequence distance matrix (rather than the Mash distance matrix). c is defined as:
where d_{neighbor}(s) measures the separation of sequence s, i.e. mean distance between s and a sequence in its nearest (called “neighboring”) cluster, and d_{intra}(s) measures the cohesion of s, i.e. mean distance between s and other sequences in the same cluster. c ranges from − 1 to 1, and a higher value of c indicates a better overall cluster separation. Internal validation indices can be used for cluster quality evaluation when the predetermined sequence classes are unknown (e.g. the plasmid nucleotides dataset). To compute the true sequence distance matrix for the plasmid nucleotides dataset, exact pairwise sequence identity (same calculation formula as the one used for cluster evaluation in ALFATClust) is calculated for every pair of sequences, and the pairwise distance is given by 1 − sequence identity. Figure 6 shows that while the maximum silhouette coefficient values are attained at distinct values of T for different tools, none of them is better than those for ALFATClust at all three values of γ_{low}. By comparing the silhouette coefficients between Additional file 1: Tables S6 and S7, it can be seen that the lowest silhouette coefficient achieved (0.716) by ALFATClust is equal to the highest score obtained with other approaches for the plasmid dataset.
Scalability performance
The scalability benchmark is performed with respect to the number of sequences using the viral nucleotide and amino acid sequence datasets, and sequence length using the long viral nucleotide dataset. ALFATClust clusters these datasets with γ_{low} = 0.75 (without the optional cluster evaluation), while others perform clustering with T = 0.85. The processing times shown in Table 2 vary substantially among different clustering approaches, ranging from a minute or less to several hours. Both MMSeqs2 and MeShClust runs faster than ALFATClust for the viral nucleotide dataset, but the number of clusters derived by MeShClust is unusually low. UCLUST requires less than a minute to cluster ~ 470,000 viral amino acid sequences, but it cannot process long viral nucleotide sequences due to its software limitation. MMseqs2 as well as the alignmentfree ALFATClust and MeShClust^{2} run much faster than other alignmentbased tools for long viral nucleotide sequences. In summary, ALFATClust is scalable for a large number of sequences due to the efficient preclustering based on MMseqs2, and sequence length through alignmentfree sequence distance calculation.
Discussion
ALFATClust is conceptually similar to hierarchical agglomerative clustering since its algorithm begins with each sequence (vertex) as a singleton graph cluster, and the graph clusters are gradually merged through iterations with decreasing resolution parameter γ. The graph contraction at the end of each iteration preserves the integrity of the current graph clusters against the size bias of CPM in subsequent iterations. Moreover, for each raw cluster, vertex sorting prioritizes its vertex pairs in descending order of edge weight (i.e. ascending order of sequence distance) for subsequent binning. Both intracluster and intercluster edge weight calculations are based on unweighted average linkage as shown in Fig. 1B. Using the CPM formulated in Eq. (1), community detection (Leiden algorithm) acts as a selection process at a particular value of γ to identify potential vertices for individual graph clusters. Equation (3) is the scoring function proposed to bin these vertices within each raw cluster to one or more graph clusters. For any existing bin B, it considers both the average intrabin edge weight I_{bin}(B, W), which is equal to the mean sequence similarity inferring sequence homology, and the proximity between B and vertex v_{t}, i.e. I(B, W, v_{t}), with respect to γ_{low}. Note that this scoring function only depends on the constant γ_{low} but not the variable γ in its calculation, hence the binning process is consistent throughout iterations. Although it references a single value of γ_{low}, every bin has its own actual cutoff above γ_{low}. In other words, γ_{low} is a soft cutoff as opposed to a global hard cutoff like sequence identity threshold T or number of clusters K [13] to define all output clusters. Individual adaptive cluster cutoffs offer greater flexibility to fit different clusters such that the cutoff value can be lowered for only certain clusters when necessary, i.e. to expand a cluster by including those slightly less similar sequences, if any. The benchmark analysis suggests that the clustering outcomes are less sensitive to different values of γ_{low} (at least for those within the suggested value range) compared to T, therefore reducing the impact of selecting a nonoptimal soft cutoff. It is also easier to simultaneously maintain relatively high cluster quality and cluster separation for a wider range of γ_{low}. This is much harder for other algorithms to balance between these two criteria through a uniform cluster cutoff threshold, of which the optimal value is often unknown and difficult to determine. Moreover, ALFATClust is scalable towards sequence length due to the use of alignmentfree sequence distance calculation such as Mash.
From the sequence similarity point of view, both the iterative cluster computation from γ_{high} down to γ_{low} (provided Δ is sufficiently small) accompanied by graph contraction, and the vertex sorting prior to the binning process generally prioritize more similar sequence pairs for clustering. This is particularly important because the linear correlation between the average nucleotide identity (ANI) and the Mash distance d decreases with increasing d [25], and thus this prioritization allows the sequence pairs to be processed from the most reliable (and small) sequence distances. In addition, since γ_{low} determines the value of γ for the final iteration, most of the pairwise sequence similarity values below γ_{low} are actually filtered, and so the clustering outcomes are not affected even they deviate substantially from the true sequence identities. Mash distance inaccuracies are mitigated by averaging the edge weights when collapsing the relevant vertices. The proposed ALFATClust algorithm allows preclustering to enhance scalability towards number of sequences without disrupting the overall clustering outcomes, because the sequences from distinct partitions are supposed to be dissimilar with each other. Cluster evaluation is also available for users to inspect the quality of the nonsingleton sequence clusters, and determine whether the specified value of γ_{low} is appropriate for the given dataset. In particular, the recommended value for γ_{low} is 0.7 or above, because the linear correlation between ANI and d might become distorted when its value is too low. It is sufficient to set γ_{high} to 0.95 for most cases, and the step size Δ is not larger than 0.025.
Although both ALFATClust and CARNACLR [43] are clustering tools based on community detection, they do have fundamental differences. Firstly, CARNACLR relies on read alignment (mapping) to determine whether two long reads overlap significantly based on a similarity threshold set in the alignment tool, and an unweighted edge between two vertices (reads) denotes such overlap; ALFATClust creates a complete graph using the alignmentfree Mash distances, which are converted to edge weights representing (average) pairwise sequence similarities. Hence, it does not impose any similarity threshold for graph construction. Secondly, CARNACLR partitions the graph into K cliques or dense subgraphs by minimizing the number of edges between them, and the value of K is determined internally; ALFATClust performs sequence clustering based on a userspecified soft cutoff γ_{low} as explained above. Finally, CARNACLR resolves intersecting clusters by identifying spurious read overlaps and cutting the edges accordingly; ALFATClust gradually expands the clusters from the largest (and also the most reliable) edge weights first to minimize the impact of sequence distance approximation errors.
ALFATClust inherits the limitations of Mash distance, particularly distance underestimation for partially overlapping sequences such as the plasmid sequence cases discussed in the “Results” section. Therefore, when the input sequences consist of genome sequence fragments rather than complete gene sequences, users are advised to identify clusters with particularly low sequence identities through the evaluation report to detect potential partial overlaps. Moreover, compared to genomescale sequences, gene sequences are more sensitive to the specified kmer size for accurate distance calculation, and this value is shown to be varying between different datasets. It should also be noted that ALFATClust is intended to be used to partition multiple groups of homologous sequences, it is therefore not suitable for tasks such as OTU (operational taxonomic unit) clustering, in which the sequence identity threshold required is strictly 97%.
Conclusions
Our benchmark demonstrates numerous advantages of ALFATClust over typical thresholdbased sequence clustering approaches, including better clustering results for a nonoptimal soft cutoff threshold, generally large cluster separation, and scalability with respect to number of sequences and sequence length. It also facilitates cluster quality inspection by providing cluster evaluation. It is suitable for clustering multiple groups of homologous sequences in which the sequence similarity cutoff threshold is often unknown and hard to determine.
Availability of data and materials
ALFATClust is available at https://github.com/phglab/ALFATClust. AMR sequence and plasmid datasets are also available in this package.
References
Murtagh F, Contreras P. Algorithms for hierarchical clustering: an overview. WIREs Data Min Knowl Discov. 2012;2(1):86–97.
National Center for Biotechnology Information (NCBI): Documentation of the BLASTCLUSTalgorithm. ftp://ftp.ncbi.nih.gov/blast/documents/blastclust.html.
Enright AJ, Ouzounis CA. GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics. 2000;16(5):451–7.
Loewenstein Y, Portugaly E, Fromer M, Linial M. Efficient algorithms for accurate hierarchical clustering of huge datasets: tackling the entire protein space. Bioinformatics. 2008;24(13):i41–9.
Uchiyama I. Hierarchical clustering algorithm for comprehensive orthologousdomain classification in multiple genomes. Nucleic Acids Res. 2006;34(2):647–58.
Lloyd S. Least squares quantization in PCM. IEEE Trans Inf Theory. 1982;28(2):129–37.
MacQueen J. Some methods for classification and analysis of multivariate observations. In: Proceedings of the fifth Berkeley symposium on mathematical statistics and probability, vol 1: statistics: 1967 1967; Berkeley, Calif.: University of California Press. pp. 281–297.
Ashlock D, Warner E. Classifying synthetic and biological DNA sequences with side effect machines. In: 2008 IEEE symposium on computational intelligence in bioinformatics and computational biology: 1517 Sept. 2008 2008. pp. 22–29.
Kelarev A, Kang B, Steane D. Clustering algorithms for ITS sequence data with alignment metrics. Lect Notes Comput Sci. 2006;4304:1027–31.
Boratyn GM, Camacho C, Cooper PS, Coulouris G, Fong A, Ma N, Madden TL, Matten WT, McGinnis SD, Merezhuk Y, et al. BLAST: a more efficient report with usability improvements. Nucleic Acids Res. 2013;41(W1):W29–33.
Liu Y, Li Z, Xiong H, Gao X, Wu J, Wu S. Understanding and enhancement of internal clustering validation measures. IEEE Trans Cybern. 2013;43(3):982–94.
Rousseeuw PJ. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987;20:53–65.
Wei D, Jiang Q, Wei Y, Wang S. A novel hierarchical clustering algorithm for gene sequences. BMC Bioinform. 2012;13(1):174.
Fu L, Niu B, Zhu Z, Wu S, Li W. CDHIT: accelerated for clustering the nextgeneration sequencing data. Bioinformatics. 2012;28(23):3150–2.
Li W, Godzik A. Cdhit: a fast program for clustering and comparing large sets of protein or nucleotide sequences. Bioinformatics. 2006;22(13):1658–9.
Edgar RC. Search and clustering orders of magnitude faster than BLAST. Bioinformatics. 2010;26(19):2460–1.
Ghodsi M, Liu B, Pop M. DNACLUST: accurate and efficient clustering of phylogenetic marker genes. BMC Bioinform. 2011;12(1):271.
Rognes T, Flouri T, Nichols B, Quince C, Mahé F. VSEARCH: a versatile open source tool for metagenomics. PeerJ. 2016;4:e2584.
Li W, Jaroszewski L, Godzik A. Clustering of highly homologous sequences to reduce the size of large protein databases. Bioinformatics. 2001;17(3):282–3.
Li W, Jaroszewski L, Godzik A. Sequence clustering strategies improve remote homology recognitions while reducing search times. Protein Eng Des Sel. 2002;15(8):643–9.
James BT, Luczak BB, Girgis HZ. MeShClust: an intelligent tool for clustering DNA sequences. Nucleic Acids Res. 2018;46(14):e83–e83.
Cheng Y. Mean shift, mode seeking, and clustering. IEEE Trans Pattern Anal Mach Intell. 1995;17(8):790–9.
Steinegger M, Söding J. MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nat Biotechnol. 2017;35:1026.
Steinegger M, Söding J. Clustering huge protein sequence sets in linear time. Nat Commun. 2018;9(1):2542.
Ondov BD, Treangen TJ, Melsted P, Mallonee AB, Bergman NH, Koren S, Phillippy AM. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 2016;17(1):132.
Baker DN, Langmead B. Dashing: fast and accurate genomic distances with HyperLogLog. Genome Biol. 2019;20(1):265.
Traag VA, Waltman L, van Eck NJ. From Louvain to Leiden: guaranteeing wellconnected communities. Sci Rep. 2019;9(1):5233.
Girvan M, Newman MEJ. Community structure in social and biological networks. Proc Natl Acad Sci. 2002;99(12):7821–6.
Traag VA, Van Dooren P, Nesterov Y. Narrow scope for resolutionlimitfree community detection. Phys Rev E. 2011;84(1):016114.
Blondel VD, Guillaume JL, Lambiotte R, Lefebvre E. Fast unfolding of communities in large networks. J Stat Mech Theory Exp. 2008;2008(10):P10008.
Jones I, Wang R, Han J, Liu H. Community cores: removing size bias from community detection. In: Proceedings of the international AAAI conference on web and social media 2016, 10(1).
Alcock BP, Raphenya AR, Lau TTY, Tsang KK, Bouchard M, Edalatmand A, Huynh W, Nguyen ALV, Cheng AA, Liu S, et al. CARD 2020: antibiotic resistome surveillance with the comprehensive antibiotic resistance database. Nucleic Acids Res. 2019;48(1):517–25.
Zankari E, Hasman H, Cosentino S, Vestergaard M, Rasmussen S, Lund O, Aarestrup FM, Larsen MV. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012;67(11):2640–4.
Gupta SK, Padmanabhan BR, Diene SM, LopezRojas R, Kempf M, Landraud L, Rolain JM. ARGANNOT, a new bioinformatic tool to discover antibiotic resistance genes in bacterial genomes. Antimicrob Agents Chemother. 2014;58(1):212–20.
Chiu JKH. Ong RTH: ARGDIT: a validation and integration toolkit for antimicrobial resistance gene databases. Bioinformatics. 2019;35(14):2466–74.
Galata V, Fehlmann T, Backes C, Keller A. PLSDB: a resource of complete bacterial plasmids. Nucleic Acids Res. 2018;47(D1):D195–202.
Stano M, Beke G, Klucar L. viruSITE—integrated database for viral genomics. Database 2016; 2016.
James BT, Girgis HZ: MeShClust2: Application of alignmentfree identity scores in clustering long DNA sequences. bioRxiv 2018:451278.
Pedregosa F, Varoquaux G, Gramfort A, Michel V, Thirion B, Grisel O, Blondel M, Prettenhofer P, Weiss R, Dubourg V, et al. Scikitlearn: machine learning in python. J Mach Learn Res. 2011;12:2825–30.
Hunter JD. Matplotlib: a 2D graphics environment. Comput Sci Eng. 2007;9(3):90–5.
Vinh NX, Epps J, Bailey J. Information theoretic measures for clusterings comparison: variants, properties, normalization and correction for chance. J Mach Learn Res. 2010;11:2837–54.
Manning CD, Raghavan P, Schütze H. Introduction to information retrieval. Cambridge: Cambridge University Press; 2008.
Marchet C, Lecompte L, Silva CD, Cruaud C, Aury JM, Nicolas J, Peterlongo P. De novo clustering of long reads by gene from transcriptomics data. Nucleic Acids Res. 2018;47(1):e2–e2.
Acknowledgements
Not applicable.
Funding
This research is supported by the Saw Swee Hock School of Public Health, National University of Singapore under its Infectious Disease Programme Research (SSHSPH IDPRG/PILOTGRANT/2018/04).
Author information
Authors and Affiliations
Contributions
JKHC and RTHO conceived the method and wrote the manuscript. JKHC implemented the software and performed the experiments. Both authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Competing interests
The authors declare that they have no competing interests.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Additional file 1
. Supplementary method and benchmark details, figures, and tables.
Additional file 2
. ALFATClust cluster evaluation report for the AMR gene sequence dataset (γ_{low} = 0.7).
Additional file 3
. ALFATClust cluster evaluation report for the AMR gene sequence dataset (γ_{low} = 0.75).
Additional file 4
. ALFATClust cluster evaluation report for the AMR gene sequence dataset (γ_{low} = 0.8).
Additional file 5
. ALFATClust cluster evaluation report for the AMR protein sequence dataset (γ_{low} = 0.7).
Additional file 6
. ALFATClust cluster evaluation report for the AMR protein sequence dataset (γ_{low} = 0.75).
Additional file 7
. ALFATClust cluster evaluation report for the AMR protein sequence dataset (γ_{low} = 0.8).
Additional file 8
. Gene classification for the AMR gene sequence dataset. Used as the background truth for the NMI and purity evaluation.
Additional file 9
. ALFATClust cluster evaluation report for the plasmid nucleotide sequence dataset (γ_{low} = 0.7).
Additional file 10
. ALFATClust cluster evaluation report for the plasmid nucleotide sequence dataset (γ_{low} = 0.75).
Additional file 11
. ALFATClust cluster evaluation report for the plasmid nucleotide sequence dataset (γ_{low} = 0.8).
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article's Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article's Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Chiu, J.K.H., Ong, R.TH. Clustering biological sequences with dynamic sequence similarity threshold. BMC Bioinformatics 23, 108 (2022). https://doi.org/10.1186/s12859022046439
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859022046439