kClust: fast and sensitive clustering of large protein sequence databases
© Hauser et al.; licensee BioMed Central Ltd. 2013
Received: 24 April 2013
Accepted: 12 August 2013
Published: 15 August 2013
Skip to main content
© Hauser et al.; licensee BioMed Central Ltd. 2013
Received: 24 April 2013
Accepted: 12 August 2013
Published: 15 August 2013
Fueled by rapid progress in high-throughput sequencing, the size of public sequence databases doubles every two years. Searching the ever larger and more redundant databases is getting increasingly inefficient. Clustering can help to organize sequences into homologous and functionally similar groups and can improve the speed, sensitivity, and readability of homology searches. However, because the clustering time is quadratic in the number of sequences, standard sequence search methods are becoming impracticable.
Here we present a method to cluster large protein sequence databases such as UniProt within days down to 20%–30% maximum pairwise sequence identity. kClust owes its speed and sensitivity to an alignment-free prefilter that calculates the cumulative score of all similar 6-mers between pairs of sequences, and to a dynamic programming algorithm that operates on pairs of similar 4-mers. To increase sensitivity further, kClust can run in profile-sequence comparison mode, with profiles computed from the clusters of a previous kClust iteration. kClust is two to three orders of magnitude faster than clustering based on NCBI BLAST, and on multidomain sequences of 20%–30% maximum pairwise sequence identity it achieves comparable sensitivity and a lower false discovery rate. It also compares favorably to CD-HIT and UCLUST in terms of false discovery rate, sensitivity, and speed.
kClust fills the need for a fast, sensitive, and accurate tool to cluster large protein sequence databases to below 30% sequence identity. kClust is freely available under GPL at http://toolkit.lmb.uni-muenchen.de/pub/kClust/.
In recent years, the amount of sequence data has been growing at an accelerating pace. While one would expect the denser sampling of sequence space to lead to better performance of sequence searches, the opposite seems to be true: The increase has led to stagnating or even negative returns, as measured by the ability to detect homologous sequences for structure or function predictions . Removing redundant sequences through clustering can partly alleviate this problem:  and  showed that sequence searches through clustered databases, which contain one representative sequence per cluster, could improve the sensitivity of homology search methods. Furthermore, clustering reduces search times and can greatly improve the readability of search results . Clustering is also often used in the analysis of metagenomics experiments, where potentially orthologous sequences are clustered together in order to define the inventory of biochemical reactions that are likely to be present in the metagenomic community [5–7].
An important motivation to develop kClust was our need for a database of profile hidden Markov models (HMMs) that contains all UniProt sequences, clustered down to 20%–30% sequence identity (uniprot20). Such a database is required by HHblits, the most sensitive protein sequence search method to date . HHblits calculates a profile HMM from the query sequence and compares this query HMM with the uniprot20 database of HMMs. In subsequent iterations, sequences belonging to significantly similar database HMMs are added to the query multiple sequence alignment (MSA). Thanks to the additional information from homologous sequences in the MSAs, HMM-HMM comparison is more sensitive and accurate than profile-sequence comparison: Compared to PSI-BLAST , HHblits is faster, up to twice as sensitive, and produces alignments with several percent higher accuracy . For HHblits it is critical that only a very low number of clusters are corrupted by non-homologous sequences, since these can cause high-scoring false-positive matches. The clustering sensitivity is also important because MSAs with higher sequence diversity and higher information content are better at finding remotely related sequences. Also, the lower number of clusters results in shorter search times.
Sequence clustering methods first need to compare the sequences in the input set with each other. Most methods use FASTA or BLAST [10, 11] for this purpose. Some of these methods use the pairwise sequence similarities for hierarchical clustering [12–15]. Others use them to cluster sequences into orthologous or functionally similar groups [16–22]. FASTA and BLAST are two to three orders of magnitude faster than Smith-Waterman search  due to their fast, k-mer-based heuristic filters, yet they would require around 80 years of CPU time to cluster the UniProt database version with ∼35.5 million sequence from scratch. SIMAP  employs a 9-TeraFLOP distributed network of computers to regularly update their database of precomputed FASTA similarity scores for a large set of protein sequences. This database can be tapped to avoid the time-consuming sequence alignments , but it cannot be used for sequences not yet contained in SIMAP or when the clustering should depend on information other than sequence similarity scores.
CD-HIT [25–27] and UCLUST  are sequence clustering methods that, like kClust, do not rely on an external sequence search tool. All three methods aim at clustering large sequence databases much faster than what is possible using BLAST or FASTA. They start with zero clusters and pick one sequence after the other from the database. If the query is sufficiently similar to the representative sequence of a cluster, the query is added to that cluster. Otherwise it becomes the founding member and representative sequence for a new cluster. For the fast comparison of the query to the representative sequences of the clusters, these methods first prefilter the representative sequences using an alignment-free, k-mer-based sequence comparison. Sequences that pass the prefilter are aligned to the query using Smith-Waterman alignment. The clustering finishes when all sequences have been picked and assigned to a cluster.
For prefiltering, CD-HIT and UCLUST simply count the number of identical k-mers between sequences. Because this number drops quickly as the similarity of the compared sequences decreases, CD-HIT uses shorter k-mers for achieving higher sensitivity: (e. g. 5-mers for clustering thresholds down to 70% sequence identity, and 2-mers below 50%). The choice of k follows from the requirement for near-perfect sensitivity, between 95% and 99%. Reducing k comes at a considerable loss of speed, however, since decrementing k by one results in approximately 20 times more chance k-mers matches and a 20-fold longer run time. CD-HIT can therefore cluster large databases such as UniProt only to down to ∼ 50% sequence identity. UCLUST uses 5-mers at all clustering thresholds. This allows it to maintain a high speed even at low thresholds, at the cost of a loss of sensitivity. It ranks the representative sequences by the number of 5-mers they have in common with the query sequence and aligns them in this order until one of them is similar to the query sequence or until the highest-ranked eight clusters have been rejected. An apparent disadvantage of UCLUST is that, in order to increase sensitivity despite its word length of 5, it uses rather loose acceptance criteria. At clustering thresholds below 50%, this leads to a high fraction of corrupted clusters containing non-homologous sequences.
Both CD-HIT and UCLUST use banded dynamic programming to speed up the Smith-Waterman search. In addition, UCLUST extends the alignment around identical 5-mer matches in a way similar to BLAST .
kClust achieves high sensitivity by allowing matches between similar k-mers and ranking sequence pairs by the sum of similarity scores over all similar k-mer pairs.
We benchmark the performance of kClust and other tools to cluster large sequence databases well below 50% sequence identity. Our results show that kClust achieves false discovery rates similar to BLAST at much higher speeds, whereas at these low clustering thresholds CD-HIT and UCLUST manifest severe limitations in sensitivity, false discovery rate, and speed.
kClust, like CD-HIT and UCLUST [26, 28], uses the incremental, greedy clustering strategy of  (Additional file 1: Figure S1). First, all sequences are sorted by length. Starting with the longest one, the next sequence is picked from the database as query and is compared with the representative sequences representing the already created clusters. If the query fulfills kClust’s similarity criteria with one of the representative sequences, the query is added to that cluster, otherwise a new cluster is created for which the query becomes the representative sequence.
A fast, alignment-free prefilter reduces the number of costly computations of sequence alignments. Whereas CD-HIT and UCLUST count exact k-mer matches, kClust calculates the sum of similarity scores over all similar 6-mers. To increase speed, alignments are constructed with a novel algorithm, k-mer dynamic programming (kDP), which finds the local optimal alignment passing through pairs of similar 4-mers (C.E.M. and J.S., to be published). Furthermore, kClust employs spaced 6-mers and 4-mers that reduce the noise caused by the correlation between scores of neighboring k-mer matches .
Three similarity criteria are used to decide if a query is added to a cluster: (1) The sequence similarity score from the 4-mer-based dynamic programming algorithm is larger than a minimum BLOSUM62 score per column (default 1.12 half bits, which corresponds to a sequence identity of ∼30%, see Additional file 1: Figure S2), (2) the alignment achieves an E-value less than a defined threshold (default value 1E-3), and (3) the alignment covers at least 80% of the residues of the representative sequence. This criterion ensures that clusters contain sequences with nearly identical domain composition.
In a second iteration, profile-based kClust can also merge homologous clusters. It computes sequence profiles of the clusters generated in the previous step and uses these for scoring during the prefilter and alignment stages. This further improves the sensitivity without loss of speed.
where S BLOSUM62(x,y) denote the BLOSUM62 substitution matrix scores. The list of similar k-mers is then generated iteratively position by position, by appending each of the 20 amino acids to the current j-mer and ending the branch if the score of the current j-mer is lower than the minimum required, S min,j . Using lists of amino acid dimers speeds up the process further by generating two amino acids instead of one at each step. Each dimer list is pre-sorted according to the score to the query dimer, so the branch can be skipped after the score of the j-mer falls below S min,j . Precomputed intermediate thresholds at each position and presorted dimer lists ensure that only k-mers with the overall score above the similarity threshold are generated. Since no k-mers except those with a score above the threshold are generated, this step has a time complexity of O(r).
where N db is the number of sequences in the database and p match is the probability of a chance match above the similarity threshold between two k-mers. Note that before a new query sequence can be processed, the score array S[ ·] (see Figure 2) holding the prefiltering scores for the database sequences for this query has to be reset to 0. The naive resetting would result in a time complexity of which would be too time-consuming. For each query, we therefore start with an empty list to which we add all sequence indices whose score has been increased above zero, and we reset only the scores in that list.
For the sequences that pass the prefiltering step, pairwise alignments are calculated using a fast heuristic, k-mer dynamic programming . kDP records the similar 4-mer matches in the two sequences and determines the optimal alignment passing through the matched 4-mers. The optimal alignment is the one maximizing the sum of k-mer similarity scores minus gap penalties. In a second step, the full, residue-wise alignment with optimal BLOSUM62 score that passes through the 4-mers on the optimal kDP alignment is determined. Since kDP operates only on the similar 4-mer matches, the run time is proportional to number of matches, p match L 2, where L is the length of the sequences. Hence, the run time can be reduced in principle by a factor of p match. In kClust, we chose the similarity threshold such that the lists of similar 4-mers have an average length of r = 200. Therefore, p match ≈ r / 204 ≈ 0.002 (see Additional file 1).
Multiple sequence alignments (MSAs) are generated for each cluster stemming from the previous iteration of kClust, and for each MSA a profile HMM and a consensus sequence are calculated with the hhmake and hhconsensus binaries of the open-source HH-suite software package . The kClust algorithm proceeds in a similar way to the standard case, but instead of picking query sequences from the database, sequence profiles representing clusters are picked from the previously clustered database and compared to the clusters that have already been created in the present clustering iteration. As representative sequences of the clusters, the consensus sequences are used, which improves the sensitivity further . The BLOSUM62 scores are simply replaced by the position-specific profile scores read from the profile HMM files. In the prefiltering, for instance, the lists and scores of the similar k-mers depend on the local 6-column window of the query profile. If no representative sequences similar to the query profile are found, the query is added as a new cluster to the database.
The binary kClust_mkAln generates MSAs for each sequence cluster, using a user-specified multiple sequence alignment program. In addition, a cluster header with merged and redundancy-filtered names and annotations from the headers of the contained sequences is generated. The merging of UniProt and NCBI headers is supported.
We compared kClust to three methods that are able to cluster large protein databases: CD-HIT, UCLUST, and BLAST-based clustering.
We clustered the datasets with the newest, parallelized version of CD-HIT  on 16 cores down to a sequence identity of 40%, the lowest possible value. We used the -n 2 setting (k-mer word length) recommended for a clustering threshold of 40%. We set the minimum alignment coverage of the longer sequence to 80% with the -aL 0.8 option and the maximum available memory to 6000 MB with the -M 6000 option. We used parallel calculation on 16 cores of the computer by setting -T 0 (i. e. all available cores).
We ran UCLUST with the -id 0.3 and 0.4 options (“UCLUST 30” and “UCLUST 40”). Additionally, we used the -targetfract 0.8 setting for the minimum alignment coverage 0.8 of the longer sequence. We tried the -usersort option that should sort the input database by length, but this led to occasional segmentation faults. We therefore gave UCLUST a length-sorted database as input. As a side note, when not crashing, UCLUST produced worse results with the -usersort option than with a presorted database.
BLAST-based clustering was calculated using pdbfilter.pl from HH-suite (http://toolkit.lmb.uni-muenchen.de/HH-suite/), adapted to our input data. This simple procedure uses the same incremental greedy clustering strategy as CD-HIT, UCLUST, and kClust, but instead of comparing sequences using a prefilter and dynamic programming, parallel BLAST (option -a 8) was employed. The acceptance condition to add a query sequence to an existing cluster was an E-value below 1E-3 and an alignment that covers more than 80% of the representative sequence. No condition on the sequence identity is used, which corresponds to the “max sens” version of kClust.
Four parameter settings were tested that differed in the acceptance criteria for adding sequences to existing clusters. In all four cases, the maximum E-value was 1E-3 and the minimum alignment coverage was 80%, which are the default values. For “kClust 30” and “kClust 20”, kClust was run with a clustering threshold of 30% and 20%, respectively. “kClust sens” was run with maximum sensitivity, i.e., with a clustering threshold at 0%. For “kClust iter”, one profile-based kClust iteration was run after a single iteration with “kClust max sens” settings.
Pfam-A-seeds version 24.0 are the 500 361 sequences that define the manually curated 11 912 domain families of the Pfam data base . Of these, 4011 Pfam families are further grouped into 458 clans – a collection of families that are thought to be evolutionarily related. There are therefore 8359 groups of homologous sequences in the Pfam-A seeds set.
A pair of sequences is considered to be incorrectly grouped together in one cluster if the sequences belong to different Pfam families and to different clans. We call such a cluster corrupted.
Pfam clustering results
#Wrong seqs per
66 h 2 m ∗
28 m/1 h 47 m †
5 h 26 m
Two iterations kClust achieve the highest sensitivity, i.e., the lowest number of clusters. The second iteration of kClust reduces the number of clusters in the first iteration almost by 30% without increasing the number of corrupted clusters. BLAST produces slightly more clusters with a comparable error rate. However, there is a striking difference in the run times – two iterations kClust take 28 min while BLAST clustering takes 66 hours. UCLUST is the fastest method, but it clusters less effectively 1.5 times more clusters than kClust and twice more than BLAST. CD-HIT produces less clusters than UCLUST, but with the highest error rate and is slow.
Most eukaryotic sequences consist of two or more structural domains. Multi-domain sequence pose greater challenges to clustering algorithms because they tend to cluster sequences together if they are sufficiently similar within one domain, even though their other domains may be unrelated.
Currently, most protein sequences are only partly or not annotated with contained domains. To test the ability of the tools to correctly cluster multi-domain sequences, we need a dataset of multi-domain sequences where all the sequences are annotated over their entire length. It should contain a sufficient number of sequences with the same domain composition in order to form enough clusters and also enough sequences with only local similarities, so it would be not trivial. Currently, there is no real dataset that would meet these criteria.
To test the ability of the tools to correctly cluster multi-domain sequences, we generated an artificial dataset of 100 000 two-domain proteins. We used the SCOP25 database  (SCOP database filtered to maximum 25% pairwise similarity), downloaded from the ASTRAL server (http://astral.berkeley.edu/). For each seed sequence from SCOP, we searched for homologous sequence segments using two iterations of PSI-BLAST  in the non-redundant database from NCBI (nr) and filtered the results to 30% minimum sequence identity to the seed sequence from SCOP and to 70% maximum pairwise sequence identity, using the hhfilter binary from the HH-suite package , in order to achieve not too low and not too high sequence similarity within the homologous group. Additionally, the homologous sequences must cover at least 90% of the seed sequence.
Sequences that remained after filtering were merged into groups, in accordance to the SCOP family membership of the seed sequences, one group corresponding to one SCOP family. 1000 groups containing more than 100 sequences were drawn randomly, and from the list of these groups 1000 combinations of two different SCOP families were drawn (two-domain architectures). For each such two-domain architecture, 100 two-domain sequences were generated by concatenating two randomly drawn sequences from the corresponding family groups. Additionally, we rejected a sequence if the longer domain in the artificial sequence covered more than 80% of the overall sequence length.
Two sequences were considered to be grouped correctly in one cluster if they have the same domain architecture. Thus, the dataset could, in principle, be clustered into 1000 clusters of homologous sequences with the same domain architecture.
Multidomain proteins clustering results
#Wrong seq per
3 h 21 m ∗
13 m/39 m †
BLAST is clearly the most sensitive method, generating only 6977 clusters. kClust produces 17 070 in the first iteration with the maximum sensitivity setting, and one additional iteration reduces this number to 8537 without increasing the number of false positives. kClust clustered the dataset in the first iteration in about 9 min, the second iteration takes additional 4 minutes (without the time needed to generate multiple alignments of the clusters and profiles that are necessary for the second iteration). BLAST-based clustering took about 3.5 hours. UCLUST is very fast, needing less than one minute, but is much less sensitive, producing clusters with only 2.5 sequences on average, compared to kClust 20 with 5.8 on average.
It is noteworthy that even BLAST produces significantly more than the 1000 clusters that are theore- tically possible. The reason is that many alignments between sequences within one architecture group do not fulfill the coverage criterion, because the longest sequence is often significantly longer than many of the other members of the group, and because BLAST, like all other local alignment methods, often generates alignments that do not cover the entire homologous region.
At the time of the benchmark calculation (June 2013), UniProt  contained 36 042 779 amino acid sequences with 11 786 916 970 residues and average sequence length of about 330 residues.
Clustering of the UniProt database
5 663 658
13 d 12 h
6 636 076
3 d 5 h
Running time of kClust depends heavily on the desired sequence identity threshold for the clustering. For lower thresholds, in addition to lowering the sequence identity threshold, kClust generates longer similar k-mer lists during the prefiltering and the alignment in order to increase sensitivity. As a consequence, the generation of the lists and index table matching takes longer.
With the rapidly growing numbers of protein sequences from genome sequencing and metagenomics projects, methods that can cluster huge sequence sets in a reasonable time are in great need. Tools that are sensitive enough to cluster sequences together down to ∼30 % sequence identity will be particularly useful, since at that similarity protein domains usually still have the same or very similar molecular functions, in particular if their domain architecture is conserved [21, 36]. Most existing sequence clustering methods rely on BLAST or FASTA for calculating the similarities between the sequences to be clustered, which makes them too slow for many clustering tasks ahead.
CD-HIT is fast and works well at high sequence identity clustering threshold, but it gets impracticably slow and inaccurate below 50% sequence identity. UCLUST is a very fast alternative which has, however, limited sensitivity as demonstrated in our study. kClust is to our knowledge the only method at present that fills the need for the fast clustering of large sequence databases or metagenomics sequence with a sensitivity not far from BLAST. kClust achieves this sensitivity with its prefilter that sums up the scores of similar 6-mers, its fast 4-mer-based alignment algorithm kDP, and its iterative, profile-based clustering strategy.
This is to our knowledge a novel approach to solving the problem of inexact, alignment-free comparison of sequences. A somewhat related approach that is popular in next-Generation sequencing data analysis is the “spaced seed” approach. In that approach, exemplified by PatternHunter  or SEED , one looks for exact matches between the compared sequences at nonconsecutive positions. Such a pattern of non-consecutive positions is called a spaced seed. Searches with several different such spaced seed patterns are then combined. For a given maximum number of mismatches, spaced seed sets can be constructed that guarantee an exact match for at least one spaced seed. This approach is very efficient as long as the number of allowed mismatches is low, say 4 out of 100. Our approach is applicable also in a range of 70 mismatches out of 100, far below what the spaced seed approach could handle due to the exploding number of spaced seeds required.
We are currently working on a faster and more flexible successor software which will be parallelized to run efficiently on multi-core architectures, which offers a more powerful clustering algorithm than the incremental, greedy clustering used in kClust, CD-HIT and UCLUST, and which will also allow to perform very fast parallelized sequence searches of a set of sequences against another set of sequences or sequence profiles. Importantly, it will allow updating of clustered databases with new sequences, thus obviating the need to recluster the entire sequence set from scratch.
M.H. gratefully acknowledges financial support from the Deutsche Telekom-Stiftung. C.E.H. and J.S. thank Andrei Lupas, Max-Planck Institute for Developmental Biology, Tübingen, for financial support and for hosting the first phase of this project. The work was supported by a Gastprofessur grant from the Ludwig-Maximilians-Universität München to Patrick Cramer and J.S., which was financed by the Excellence Initiative of the Bundesministerium für Bildung und Forschung (BMBF).
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.