Ultra-fast sequence clustering from similarity networks with SiLiX
© Miele et al; licensee BioMed Central Ltd. 2011
Received: 28 October 2010
Accepted: 22 April 2011
Published: 22 April 2011
Skip to main content
© Miele et al; licensee BioMed Central Ltd. 2011
Received: 28 October 2010
Accepted: 22 April 2011
Published: 22 April 2011
The number of gene sequences that are available for comparative genomics approaches is increasing extremely quickly. A current challenge is to be able to handle this huge amount of sequences in order to build families of homologous sequences in a reasonable time.
We present the software package SiLiX that implements a novel method which reconsiders single linkage clustering with a graph theoretical approach. A parallel version of the algorithms is also presented. As a demonstration of the ability of our software, we clustered more than 3 millions sequences from about 2 billion BLAST hits in 7 minutes, with a high clustering quality, both in terms of sensitivity and specificity.
Comparing state-of-the-art software, SiLiX presents the best up-to-date capabilities to face the problem of clustering large collections of sequences. SiLiX is freely available at http://lbbe.univ-lyon1.fr/SiLiX.
Proteins can be naturally classified into families of homologous sequences that derive from a common ancestor. The comparison of homologous sequences and the analysis of their phylogenetic relationships provide very useful information regarding the structure, function and evolution of genes. Thanks to the progress of sequencing projects, this comparative approach can now be applied at the whole genome scale in many different taxa, and several databases have been developed to provide a simple access to collections of multiple sequence alignments and phylogenetic trees [1–9]. The building of such phylogenomic databases involves three steps that require important computing resources: 1) compare all proteins to each other to detect sequence similarities, 2) cluster homologous sequences into families (that we will call the clustering step) and 3) compute multiple sequence alignments and phylogenetic trees for each family. With the recent progress of sequencing technologies, there is an urgent need to prepare for the deluge and hence to develop methods able to deal with a huge quantity of sequences. In this paper, we present a new approach for the clustering of homologous sequences, based on single transitive links (single linkage) with alignment coverage constraints and implemented in a software package (called SiLiX for SIngle LInkage Clustering of Sequences). We model the dataset as a similarity network where sequences are vertices and similarities are edges . To overcome memory limitations we follow an online framework  in which we visit the edges one at a time to update the families dynamically. This approach enables also an incremental procedure where sequences and similarities are added into the dataset so that it would not be necessary to rebuild the families from scratch. Finally, we adopt a divide-and-conquer strategy to deal with the quantity of data  and design a parallel algorithm whose theoretical complexity is addressed in this paper.
We evaluated the computational performances and scalability of this method on a very large dataset of more than 3 millions sequences from the HOGENOM phylogenomic database . Our approach presents several advantages over other clustering algorithms: it is extremely fast, it requires only limited memory and can be run on a parallel architecture - which is essential for ensuring its scalability to large datasets. SiLiX outperforms other existing software programs both in terms of speed and memory requirements. Moreover, it allows a satisfying quality of clustering. We discuss the interest of SiLiX for the clustering of homologous sequences in huge datasets, possibly in combination with other clustering methods.
The principle of the single-linkage clustering is that if sequence A is considered homologous to sequence B, and B homologous to C, then A, B and C are grouped into the same family, whatever the level of similarity between A and C. The choice of the sequence similarity criteria that is used to infer homology is therefore an essential parameter of the single-linkage clustering approach. Different criteria can be used, separately or in combination (percentage of identity, alignment score or E-value, alignment coverage i.e. percentage of the length of the sequence that is effectively aligned). Then, if a pair of sequences (A, B) does not satisfy the criteria, the pair is not considered for the clustering. The choice of these criteria depends on the goal of the clustering.
Here we consider the second step: given a list of pairs of similar sequences previously positively filtered, group the sequences into families. We define an undirected graph G = (V, E) with the set of vertices V representing sequences and the sets of edges E representing similarities between these sequences. We define n = |V | and m = |E |. Naturally, finding sequence families consists in computing the connected components of G. In this paper, we want to address the case of large n and m and we therefore develop a parsimonious approach in terms of memory use. We want to examine the edges online [11, 16] and avoid storing them into a connectivity matrix. Therefore the classical Depth-first search algorithm  is not adapted. By analogy with external-memory graph algorithms , our approach consists in dynamically reducing the connected components into trees. When an edge is examined, we need to execute two operations: find the tree containing each of the two vertices and union these trees by merging their vertices into a new tree. Consequently, the connected component problem consists in (1) iteratively build a collection of trees representing the connected components of the graph G and (2) transform each resulting tree into a star tree which root is the representative (or leader) of the family. The final formulation of the problem is therefore building a spanning star forest G* = (V, E*).
The connected components of G actually form a partition of V into non-overlapping subsets of vertices that we call disjoint-sets. Initially each vertex is a set by itself. We need to store the information of the partition and be able to update it dynamically. For this purpose, we use the disjoint-sets data structure [19, 20] which is well suited when the graph is discovered edge by edge. This structure allows efficient implementation of the find and union operations by representing each set as a tree. Practically, the forest composed by all the trees is implemented as an array parent of size n. Each element i of a tree has a parent parent(i) such that parent(r) = r if r is the root of the tree. Moreover, it is straightforward and practical to transform each tree into a star tree such that the parent information is a common label for the vertices in a connected component. This will allow to directly retrieve each sequence family by reading the parent information.
Algorithm 1 ADDEDGE(i, j) by UNION- FIND
Function: FIND (i): returns the root of the tree containing i
Function: PATHCOMPRESSION(i, r): parent of vertices in the path from i to the root of the tree containing i are set to r
1: r 1 ← FIND(i); r 2 ← FIND(j)
2: k ← arg maxl = 1, 2 (rank(r l ))
3: if rank(r 1) == rank(r 2) and r 1 ≠ r 2 then
4: rank(r k )++
5: end if
6: PATHCOMPRESSION(i, r k )
7: PATHCOMPRESSION(j, r k )
We take advantage of the possibility of exploring series of sets of sequence similarities with a client-server parallel architecture. We assume that it is usually affordable to split a large set into q sets. For the sake of clarity, we consider here a group of q processors, which is a reasonable hypothesis in practice. We note that it would also be recommended to have sets of comparable sizes. We adopt a divide-and-conquer strategy where different processors use the previous sequential algorithm to independently obtain a collection of spanning star forests where such that . These subsolutions are successively merged to obtain the final solution G* . We first design an algorithm to merge two of these forests in O(n) time (see Algorithm 2). It is also based on the disjoint-sets data structure since, for each vertex i, it basically consists in adding in one forest a formal edge between i and the root of the tree containing i in the other forest. Then we build a parallel formulation of our approach [21, 22] where are obtained with step (1) of the sequential algorithm and iteratively merged (see Algorithm 3). The parallel time complexity can be estimated as O(m/q + nq). We notice that the merge procedure is many orders of magnitude faster than the processing of a single set of similarities. For this reason, we decide not to distribute over the processors the merge procedures that will be consequently performed by the server processor in the order of the availability.
Algorithm 2 MERGE
Function: FIND(i): returns the root of the tree containing i
1: for all i such that FIND(i) ≠ i in do
2: r ← FIND(i) in
3: ADDEDGE(r, i) in
4: end for
Algorithm 3 Parallel SiLiX
1: each processor r builds with the sequential algorithm
2: if r > 1 client then
3: MPI_SEND to server processor 1
5: for k in 2,...p do
6: MPI_RECEIVE among in their order of availability
8: end for
9: for all i in do
10: PATHCOMPRESSION(i, Find(i))
11: end for
12: end if
Because genome sequences are often not 100% complete and hence some genes may overlap with gaps in the genome assembly, it is important to be able to treat some partial protein sequences (as opposed to complete sequences). These partial sequences cannot be classified using the same criteria as the complete ones and are therefore treated separately. In a first step, gene families are built using only complete protein sequences as explained previously. In a second step, partial sequences are added to this classification, using different alignment length thresholds (for details about parameters, see ). It is important to note that, if there are several families that meet these alignment coverage criteria, a partial sequence is included in the one with which it shows the strongest similarity score.
To allow the treatment of partial sequences, we propose a modified version of our approach. We redefine the previously mentioned graph G = (V c , E c ) and we define the undirected graph H = G ∪ (V p , E p ) with two sets of vertices V c and V p , the complete and partial sequences respectively, and the set of edges E p between complete and partial sequences, each edge in E p being weighted by the similarity score. We also impose that edges between partial sequences are not allowed. In this case, n c = |V c |, n p = |V p |, n = n c + n p , m c = |E c |, m p = |E p | and m = m c + m p . At this point, we note that sequence families correspond to the connected components of a subgraph of H obtained by only conserving the edge of maximum weight for each vertex in V p : this will guarantee that each partial sequence is connected to only one complete sequence and prevent partial sequences to link two connected components. As described previously, the problem consists in building a novel graph that has the following properties:
H* is a spanning star forest,
H* is called a semi-bipartite graph, i.e. a graph that can be partitioned into two exclusive and comprehensive parts (V c and V p ) with internal edges (connecting vertices of the same part) only existing within one of the two parts . The particularity is here that edges between the two parts are weighted,
∀v ∈ V p , deg(v) = 1.
First, it is necessary to insert an additional step between the two steps of the above-mentioned online procedure: build a subset of E p by selecting for each vertex the edge of maximal weight, in O(m p ) time. Then we extend the step (2) to all the vertices in V p for a time complexity in O(n). This procedure runs in O(n) space since it requires the storage of n parent values. For the parallelized algorithm, we modify the merging of two forests presented in Algorithm 2 to consider vertices of V p and once again select edges of maximal weight, such that the overall parallel complexity can be estimated to be in O(m/q+ nq).
All the presented algorithms are implemented into the SiLiX software package which is written in ANSI C++ and uses MPI (Message Passing Interface) and elements of the well-established Boost library http://www.boost.org. SiLiX can take two kinds of input. First, the user can provide the result file of an all-against-all BLAST search (genomic or protein sequences) in tabular format (option -outfmt 6 in BLAST). In that case, SiLiX performs the filtering step by analyzing BLAST hits to search for pairs of sequences that fulfill similarity criteria (alignment coverage, sequence identity) set by the user to build families. In this mode, partial sequences can be treated separately, as described above. Second, if the user prefers to use other types of criteria for the filtering, SiLiX can simply take as input a list of pairs of sequences IDs and perform the clustering step. Compilation and installation are compliant with the GNU standard procedure. The package is freely available on the SiLiX webpage http://lbbe.univ-lyon1.fr/SiLiX. Online documentation and man pages are also available. SiLiX is licensed under the General Public License http://www.gnu.org/licenses/licenses.html.
To test SiLiX and compare it to state-of-the-art programs, we extracted protein sequences from the HOGENOM database (Release 5, ). The current release of HOGENOM contains 3,666,568 protein sequences (76% bacteria, 3% archae and 20% eukarya). We selected 3,159,593 non-redundant sequences including about 1% partial sequences. Sequences were compared against each others with BLASTP  with an E-value threshold set to 10-4. The BLAST output file contained 1,905,335,339 pairwise alignments. Then we selected three previously published programs, for which the source code is publicly available: hcluster_sg  and MC-UPGMA  that are based on hierarchical clustering, and MCL  that relies on graph-based heuristics.
CPU time and memory requirements for SiLiX and three state-of-the-art programs on the dataset of similarity pairs extracted from the HOGENOM database .
Although the speed and memory requirements are important parameters for the choice of clustering method, the most important criterion is of course the quality of the results. Single linkage clustering is known to be problematic because spurious similarities can lead to the clustering of non-homologous sequences. Even with stringent sequence similarity criteria, single linkage clustering can lead to erroneous clustering, because of the so-called problem of "domain chaining" , as illustrated in Figure 1. To avoid this problem, SiLiX performs single linkage clustering with alignment coverage constraints, i.e. pairs of similar sequences are considered for the clustering only if they meet two criteria: i) the alignment should cover at least a given percentage of the longest sequence; ii) sequence similarity within the alignment should exceed a given threshold. To assess the quality of SiLiX clustering, we used 2 different strategies. First, we compared clustering results to the classification of protein families reported in the InterPro database . Second we assessed the performance of SiLiX on a set of 13 families of orthologous genes encoded by mitochondrial genomes in 1821 metazoan species.
Comparison of clustering performances of SiLiX and hcluster_sg (used alone or in combination with SiLiX).
SiLiX (0.25) + hcluster_sg(100)
Evaluation of SiLiX performances on mitochondrial genes of metazoan taxa.
Nb. Seq. 1 st fam. (%)
Nb. Seq. 2 nd fam. (%)
Nb. Seq. 3 rd fam. (%)
Nb. Singletons (%)
Different methods have been proposed for the clustering of proteins into families of homologous sequences [1, 8, 9, 24–26, 32]. These methods differ both in terms of the quality of the clustering, and in terms of the computing resources required to perform the clustering. The single-linkage clustering approach is used in different phylogenomic databases such as EnsemblCompara  or HOGENOM . Here we propose a new implementation of the single linkage clustering method with alignment coverage constraints, SiLiX, which is extremely efficient, both in terms of computing time and memory requirements. Moreover, this method can be cost-effectively run on parallel architectures, and hence is easily scalable. Thus, in terms of the computing resource requirements, this method is much more efficient than other available methods for the treatment of huge sequence datasets. In terms of clustering quality, SiLiX performs as well as hcluster_sg, the only other available clustering program that could be run in reasonable time on such a large sequence dataset. Given its speed, SiLiX may also efficiently be used as a first clustering step, before running other algorithms.
Project name: SiLiX
Project home page: http://lbbe.univ-lyon1.fr/SiLiX
Operating system(s): All Unix-like operating systems such as Linux and Mac OS X.
Programming language: C++
Other requirements: MPI, the Boost:program options class, and optionally CppUnit and the Boost:unordered_map class.
License: GNU GPL.
The authors would like to thank Bastien Boussau, Daniel Kahn, Vincent Lacroix, Marie-France Sagot, Franck Picard and Eric Tannier for helpful discussions and comments, Bruno Spataro for the computing facilities and Yanniv Loewenstein, Jan Baumbach and Antje Krause for their answers about the availability and use of their programs. This work has been supported by the French Agence Nationale de la Recherche under grant NeMo ANR-08-BLAN-0304-01.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.