SCPS: a fast implementation of a spectral method for detecting protein families on a genome-wide scale
© Nepusz et al; licensee BioMed Central Ltd. 2010
Received: 25 November 2009
Accepted: 9 March 2010
Published: 9 March 2010
An important problem in genomics is the automatic inference of groups of homologous proteins from pairwise sequence similarities. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. It was shown recently that global methods such as spectral clustering have better performance on a wide variety of datasets. However, currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community.
SCPS (Spectral Clustering of Protein Sequences) is an efficient and user-friendly implementation of a spectral method for inferring protein families. The method uses only pairwise sequence similarities, and is therefore practical when only sequence information is available. SCPS was tested on difficult sets of proteins whose relationships were extracted from the SCOP database, and its results were extensively compared with those obtained using other popular protein clustering algorithms such as TribeMCL, hierarchical clustering and connected component analysis. We show that SCPS is able to identify many of the family/superfamily relationships correctly and that the quality of the obtained clusters as indicated by their F-scores is consistently better than all the other methods we compared it with. We also demonstrate the scalability of SCPS by clustering the entire SCOP database (14,183 sequences) and the complete genome of the yeast Saccharomyces cerevisiae (6,690 sequences).
Besides the spectral method, SCPS also implements connected component analysis and hierarchical clustering, it integrates TribeMCL, it provides different cluster quality tools, it can extract human-readable protein descriptions using GI numbers from NCBI, it interfaces with external tools such as BLAST and Cytoscape, and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive and effective tool for practical research in computational biology. Source code and precompiled executables for Windows, Linux and Mac OS X are freely available at http://www.paccanarolab.org/software/scps.
An important problem in genomics is the automatic inference of groups of homologous proteins when only sequence information is available. Several approaches have been proposed for this task which are "local" in the sense that they assign a protein to a cluster based only on the distances between that protein and the other proteins in the set. In fact, the majority of these methods are based on thresholding a sequence similarity measure (e.g., BLAST E-value  or percent identity) and considering two protein sequences potentially homologous if their similarity is above the threshold [2, 3]. However, by considering SCOP superfamilies as gold standard collections of homologous proteins and analysing the distribution of sequence distances within and between superfamilies, it was shown that there does not exist a single threshold on BLAST E-values that can be used to cluster homologues correctly . As a consequence, while the existing methods yield adequate results for close homologues, they are likely to fail in identifying distant evolutionary relationships.
A possible way to improve these results is to use "global" methods, which cluster a set of proteins taking into account all the distances between every pair of proteins in the set. Paccanaro et al  introduced a global method based on spectral clustering and showed that it has better performance than commonly used local methods (namely hierarchical clustering  and connected component analysis ) and TribeMCL . Other authors have also used spectral clustering successfully in various biological contexts [8–12]. The development of SCPS (Spectral Clustering of Protein Sequences) was motivated by the fact that currently available implementations of spectral clustering methods mostly consist of a few loosely coupled Matlab scripts that assume a fair amount of familiarity with Matlab programming and hence they are inaccessible for large parts of the research community. Moreover, the mathematical formulation of the algorithm is rather involved and it is not trivial to implement all the details properly in an ex-novo implementation.
SCPS provides an implementation of the spectral clustering algorithm  via a simple, clean and user-friendly graphical user interface that requires no background knowledge in programming or in the details of spectral clustering algorithms. SCPS is also able to perform connected component analysis and hierarchical clustering, and it incorporates TribeMCL, thus providing the user with an integrated environment where one can experiment with different clustering techniques. SCPS is extremely efficient and its speed scales well with the size of the dataset, allowing for the clustering of protein sets constituted by thousands of proteins in a few minutes. Moreover, SCPS is able to calculate different cluster quality scores, it interfaces with external tools such as BLAST  and Cytoscape , and it can produce publication-quality graphical representations of the clusters obtained, thus constituting a comprehensive tool for practical research. For more advanced use-cases (i.e., the integration of SCPS in automated batch processing pipelines), we also included a sophisticated command line interface.
SCPS was written in C++ and is distributed as an open-source package. Precompiled executables are available for the three major operating systems (Windows, Linux and Mac OS X) at http://www.paccanarolab.org/software/scps.
In the rest of this paper, we outline the general framework of our spectral clustering algorithm and then demonstrate its practical usage and usefulness via a number of benchmarks ranging from a few superfamilies to the entire SCOP database and the genome of the yeast Saccharomyces cerevisiae.
Spectral clustering in SCPS
The goal of SCPS is to infer homology relations between protein sequences based on pairwise sequence information only. The input dataset thus consists of either a set of protein sequences or a list of pairwise similarity scores between some protein domains. The output is a partition of the sequences such that each sequence is assigned to one and only one of the partitions in a way that the partitions represent groups of homologs.
A typical SCPS workflow starts with either a FASTA file containing sequences for the protein domains of interest, or a list of BLAST E-values for all pairs of proteins where significant sequence similarity was reported by BLAST. Besides spectral clustering , SCPS currently supports connected component analysis, hierarchical clustering and TribeMCL , and more algorithms will be added in the near future. The spectral clustering approach reformulates the problem of protein homology detection into that of finding an optimal partition of a weighted undirected graph G. Each vertex of the graph corresponds to a protein sequence. Vertices are connected by undirected, weighted edges, each edge denoting a similarity relation between the two proteins it connects. The weight (label) of the edge is related to the probability of evolutionary relatedness. Edges with large weight are more likely to appear between domains of the same superfamily, hence the problem of partitioning the graph into subsets of vertices with mostly heavy-weight edges is an equivalent formulation of the original protein sequence classification problem. Spectral clustering solves the problem of finding the optimal partition by examining random walks on the similarity graph .
If the input file is a FASTA sequence file, we conduct an all-against-all matching using BLAST and store the E-values.
Given the pairwise BLAST E-values obtained either from the previous step or directly from the input file, we build an affinity matrix based on a non-linear transformation from E-values to similarity scores. The matrix element in row i and column j contains the E-value corresponding to protein j when protein i was used as a query sequence.
Since the BLAST E-value corresponding to a query protein i matching protein j in the database is not necessarily equal to the case when the query protein j matches protein i in the same database, the affinity matrix has to be symmetrized. To obtain a symmetric matrix, we take the higher similarity score (i.e. the smaller E-value) in case of ambiguity. Let s ij denote the symmetrized similarity score between protein i and protein j. The s ij values together constitute the symmetrized affinity matrix S, whose main diagonal contains only ones.
We conduct a preliminary connected component analysis on the graph represented by the affinity matrix S to identify small connected components containing less than five sequences. It is unlikely that these components should be subdivided further, therefore we remove the rows and columns corresponding to these sequences from S, obtaining a reduced matrix S'.
We construct a symmetric matrix L = D-1/2S ' D-1/2, where D is a diagonal matrix formed of the vertex degrees ( ), and find the eigenvectors corresponding to the K largest eigenvalues of L. Let us denote these eigenvectors by u1, u2, ..., u K , respectively.
We build a matrix U s.t. the k th column of U is u k and normalize the rows of the matrix such that each row in U has unit length.
Treating the rows of U as points in the k-dimensional Euclidean space ℝ K , we conduct a k-means clustering of these points into K clusters. The initial centroid positions are chosen from the data points themselves, placed as orthogonally to each other as possible.
We assign node i in the original graph to cluster j if and only if row i of Y was assigned to cluster k in the previous step. Small connected components obtained in step 4 are also merged back into the dataset in this final stage.
An important advantage of this method is that the number of clusters (K) can be selected automatically by evaluating the eigenvalues of S ' . In our implementation, K is set to the smallest integer k such that λ k /λk+1> ε. ε is adjustable and it is chosen to be 1.02 by default. The main role of ε is to control the granularity of the clustering obtained: larger ε values tend to produce more fine-grained clusters, while a smaller ε yields only a few large clusters. We found that the default choice works well in a wide variety of biological problems (see the Results section). Another way to control the granularity of the clustering is to override K manually either before the clustering process or after the eigenvalue calculation. Both methods are facilitated by the SCPS user interface.
Finally, SCPS includes a command line interface which runs the clustering without user intervention and writes the results to the standard output or to a specific output file. This enables the integration of SCPS in batch processing pipelines.
SCPS uses the ARPACK library  for eigenvector calculations. The ARPACK library implements the implicitly restarted Arnoldi method for eigenvector calculations, which is an iterative process that is able to calculate all the eigenvectors and eigenvalues or only the top K ones. When one can provide a reasonable upper estimate on the number of clusters, the Arnoldi method is much more efficient than standard methods that solve the eigenvector equation directly. On the other hand, the convergence of iterative methods is affected negatively in the presence of eigenvalues with multiplicity greater than one. The multiplicity of the top eigenvalue of the affinity matrix S is equal to the number of connected components in the input graph. Therefore, we first eliminate small connected components of size less than five sequences from the original graph (they will not be subdivided further) and then connect the remaining components by a small amount of random edges with weight less than 0.01. This decreases the multiplicity of the top eigenvalue to one and thus improve the stability of the eigenvector calculation process without affecting the final result.
The number of clusters can be selected using one of the following methods in our implementation:
Automatic. This method uses the eigengaps to select the appropriate number of clusters. K is set to the smallest integer k such that λ k /λk+1> ε. ε is adjustable and it is chosen to be 1.02 by default.
Bounded from above. This method is similar to the automatic selection, but it considers at most a given number of clusters. It takes advantage of the fact that the complete eigenspectrum is not needed in this case when using the spectral clustering, saving time and resources during the computation. If the maximum number of clusters is K max , SCPS will compute only the top K max eigenvalues and the corresponding eigenvectors.
Exactly. The user can select the desired number of clusters either before the analysis or after the calculation of the eigenvalues and eigengaps.
Transforming BLAST E-values to similarities
A crucial step in the application of spectral clustering methods in the context of protein sequences is the transformation from BLAST E-values to similarities. SCPS uses an approach based on the statistical analysis of E-values within and between SCOP superfamilies. A randomly selected set of 10,000 E-values chosen from sequences within the same superfamily and 10,000 E-values chosen from sequences in different superfamilies were used to train a logistic regression model that discriminates between intra-superfamily and inter-superfamily E-values. The posterior probability returned by the model on any E-value is then interpreted as the probability of evolutionary relatedness. In case of asymmetric E-values for a pair of proteins, the lower E-value (i.e., the higher probability) is used. The proteins used for training the logistic regression model were not used later in performance assessments of the algorithm.
This section describes the various quality measures we implemented in SCPS. In the following subsections, we will use the following notations:
s ij is the similarity value labelling the edge between vertex i and j in a graph G. s ij = s ji since we always symmetrize the initial similarity values.
δ ij is 1 if vertices i and j are within the same cluster, zero otherwise.
We will also need the following definitions:
Definition 1 (Vertex weight) The weight of vertex i is the sum of the weight (similarity) of all its adjacent edges:d i = Σ j s ij .
Definition 2 (Cluster weight) The weight of cluster i is the sum of the weight of all the edges that lie fully within cluster i (i.e., both their endpoints are in cluster i).
The mass fraction  is an internal quality measure of a clustering on a given graph G. Intuitively, a clustering is good if the total weight of its clusters is comparable to the total weight of the whole network; in other words, most of the heavy-weight edges are within clusters. The mass fraction simply denotes the fraction of edge weights that is concentrated inside the clusters.
A disadvantage of this measure is that it attains its maximum when all the vertices are in the same cluster, hence the mass fraction alone cannot be used to decide whether a given clustering is better than another.
Modularity  is another internal quality measure of a clustering on a given graph G. The idea is that it is not enough for a clustering to be good when it contains much of the edge weights within the clusters; the clustering is good when it contains more weight within the clusters than what we would expect if we rearranged the edges of the graph randomly while keeping the vertex weights constant. Therefore, the difference between the actual cluster weight and the expected cluster weight after such rearrangement is a good indicator of the general quality of the clustering. This measure also avoids the problem with trivial clusterings: a cluster containing all the vertices will contain exactly the same weight before and after rewiring as all the edges will stay within the same cluster, so the modularity score will be zero. Similarly, a clustering where each vertex is in its own cluster will also yield zero modularity as there are no intra-cluster edges at all.
Formally, the modularity score of a clustering is the normalized difference between the actual weight of the clusters and the expected cluster weight after a random rewiring that preserves the vertex weights. It can be shown that the expected weight of the edge between vertex i and j after rewiring is , where m is the sum of all edge weights in the graph (m = Σi ≥ js ij ) . The modularity formula then follows easily:
Positive modularity then means that there is more weight concentrated within the clusters than what we would expect from a completely random graph with the same vertex weight distribution.
Heatmap of the rearranged similarity matrix
This quality measure is not a single numeric value, but it provides a visual cue to the goodness of a clustering result. The basic idea is that the initial similarity matrix can be plotted as a greyscale heatmap where each pixel corresponds to a single cell of the matrix and the intensity of the pixel is proportional to the weight that the corresponding cell in the matrix represents. The rows and columns of the similarity matrix can be arranged in arbitrary order, but by arranging them in a way that rows and columns corresponding to the same cluster are next to each other, one can uncover a block-diagonal structure in the matrix if the clustering is good.
Results and discussion
In this section, we present the results of a comparison of SCPS with other popular clustering methods (hierarchical clustering , connected component analysis  and TribeMCL ) on various datasets assembled from SCOP 1.75 , ASTRAL-95  and STRING v8.1 . First, we will describe the datasets we used, then we give an overview of the methods we compared SCPS with and the quality measures we used to evaluate the performance of each method. After that, the benchmark results will be presented in detail. We conclude the section with a short discussion on the scalability of SCPS.
Datasets 1-4 and the SCOP≥ 5 dataset in our benchmarks were taken from SCOP 1.75 . Sequence data for these datasets were gathered from ASTRAL-95 . Sequence data for the yeast genome benchmark were downloaded from STRING v8.1  and the corresponding Gene Ontology annotations were assembled from the Saccharomyces Genome Database .
List of SCOP superfamilies used in Datasets 1-4
SCOP superfamily ID
Trypsin-like serine proteases
MHC antigen-recognition domain
NAD(P)-binding Rossmann-fold domains
Triosephosphate isomerase (TIM)
Concanavalin A-like lectins/glucanases
Trypsin-like serine proteases
The SCOP≥ 5 dataset was constructed from SCOP 1.75 and ASTRAL-95 as follows: a database containing all sequences in ASTRAL-95 was used to conduct an all-against-all search using BLAST. For each sequence in ASTRAL-95, the corresponding superfamily was looked up from SCOP and a gold standard clustering was created using all the superfamilies that contained at least five sequences. Superfamilies containing less than five domains with associated sequence information were excluded from the benchmark, as we were interested in the performance of the methods in case of non-trivial superfamilies. The final dataset contained 632 superfamilies.
Datasets 1-4 are distributed with the downloadable SCPS package. The SCOP≥ 5 and the yeast genome dataset was excluded as it would have disproportionately increased the size of the package, but it is available from the authors upon request.
Alternative clustering approaches
Hierarchical clustering is a family of clustering methods that start with individual data points (i.e. the sequences) and then build a tree by iteratively merging the closest points until only one is left . The final cluster assignment is then determined by cutting the branches of the tree at a specific level. The various hierarchical clustering methods usually differ only in the way they define the distance between two sets of data points and the way they choose the optimal level to cut the branches of the tree in the end. The best results in our datasets were obtained by using the average distance metric, in which the distance between two sets of data points is given by the average distance between all possible point pairs such that one point is chosen from one of the sets and the other one is from the other set. The tree was cut at the level where the average distance metric was above 10-6, similarly to . Pairs of proteins where BLAST did not return an E-value were considered to have an E-value of 10, which is the default BLAST E-value threshold.
Connected component analysis
Connected component analysis is a method that has been widely used in computer vision  and was initially applied to sequence clustering in GeneRAGE  and ProClust . The method starts with a fully connected graph where the edges are labeled by the E-values or some other suitable distance metric. The algorithm proceeds by removing edges labelled by a distance larger than a given threshold, then collecting groups of vertices that still remained connected. These groups are then considered as the final result of the algorithm. The E-value threshold used in our benchmarks was 10-6, similarly to GeneRAGE .
TribeMCL , a variant of the Markov clustering algorithm (MCL), models the random walk of a particle on a similarity graph, similarly to spectral clustering. A detailed comparison is given in , here we only note that the fundamental difference between MCL and spectral clustering is the way the random walk is propagated along the edges of the network. While our spectral clustering algorithm models the random walk exactly and analyses perturbations to the stationary distribution of the random walk, MCL modifies the random walk to promote the emergence of clusters. This approximation allows MCL to converge faster, but it can potentially lead to many small clusters. Another, less significant difference is the way TribeMCL symmetrizes the input matrix of E-values: while SCPS takes the smaller E-value in face of ambiguity and then transforms it to a similarity value, TribeMCL transforms both E-values to similarities first by taking the negative base 10 logarithm and then symmetrizes the pair by taking the average. A more detailed comparison of the two algorithms is to be found in .
For the TribeMCL benchmarks on Datasets 1-4, we tuned the inflation parameter of the algorithm by trying all possible values with a step size of 0.1 in the range [1.2; 5.0], as suggested by the documentation of the algorithm. The final inflation parameter was chosen in a way that resulted in the highest F-score. For the SCOP≥ 5 dataset, the inflation parameter was chosen as the average of the inflation parameters that were the best for Datasets 1-4.
Comparing clusterings with a gold standard
We used the combined F-score to compare a clustering result with the gold standard SCOP superfamily classification. Let n denote the total number of proteins in the dataset, ni*the number of proteins in the i th superfamily, n*jthe number of proteins in the j th calculated cluster and n ij the number of proteins that are in superfamily i and cluster j at the same time.
Definition 5 (Precision) The precision of cluster j with respect to superfamily i is the fraction of proteins in cluster j that are also in superfamily i: p ij = n ij /n*j
Definition 6 (Recall) The recall of cluster j with respect to superfamily i is the fraction of proteins in superfamily i that are also in cluster j: r ij = n ij /ni*
Now we can define the combined F-score, which combines precision and recall with equal weights.
The combined F-score attains its maximum at 1 if the two clusterings are identical.
Benchmarks on SCOP
The validity of the spectral clustering approach was tested on several datasets assembled from the SCOP database, version 1.75 . Sequences were extracted from ASTRAL-95 , i.e. the sequence identity between any two sequences was at most 95%. Datasets 1-3 contained sequences from 5-8 protein superfamilies that were hand-chosen to resemble the datasets originally used in  (the original datasets could not have been re-used due to the changes in SCOP classifications and to the new sequences added to ASTRAL-95 since 2006). Dataset 4 was conceived specifically for this study. Finally, the SCOP≥ 5 dataset contains all the SCOP superfamilies containing at least five sequences. The datasets were described in detail earlier in the Data subsection.
Comparison of spectral clustering with other methods
Figure 5 shows the heatmaps of the rearranged similarity matrices for Datasets 1-4 using spectral clustering, confirming that the quality of the obtained clustering is indeed very good. It also shows us that Dataset 4 is different from the others as the number of clusters is much higher than the number of superfamilies used to construct the dataset, indicating that BLAST misses many remote homologs in this case. Using an improved similarity measure derived from PSI-BLAST  or CS-BLAST  would probably yield better results in these cases; however, examining this is out of the scope of the present paper.
Clustering the genome of the yeast Saccharomyces cerevisiae
To further test the scalability of our method and to assess its performance on the genome of a model organism with multi-domain proteins, we collected 6,690 sequences of the yeast Saccharomyces cerevisiae from STRING v8.1  and performed an all-against-all BLAST search on them with the default BLAST parameters. The BLAST hits were processed with spectral clustering, TribeMCL, connected component analysis and hierarchical clustering and clusters of size less than three were excluded from further assessment. The parameters for the various algorithms were the same as in the SCOP≥ 5 benchmark.
Comparison of the results obtained on the genome of the yeast Saccharomyces cerevisiae
Total cluster size
The spectral clustering method has two potential bottlenecks. One of them is the k-means clustering step where no exact result is known about the number of steps the algorithm takes in the worst case. However, it was shown recently that the k-means clustering procedure terminates in a polynomial number of steps with high probability in high-dimensional spaces when the data points are drawn from independent multivariate normal distributions . It was also proven that given a clustered structure in the original input dataset, data points of the same cluster will be aligned roughly along orthogonal directions in our k-means step. The normalisation step then ensures that these points will be situated close to each other , thus they can be approximated well with multivariate normal distributions. Therefore, the data points we are likely to encounter in the k-means step satisfy the conditions of polynomial time complexity. The other potential bottleneck of the algorithm is the calculation of the eigenvectors. Typically, the number of steps required to calculate the top K eigenvectors scales linearly with the number of non-zero elements in the input matrix when using the implicitly restarted Arnoldi method . Since SCPS uses this method when a maximum cluster count is specified, the algorithm is expected to terminate in polynomial time for real sequence similarity datasets, enabling us to analyse large datasets comprising of thousands of protein sequences. In our experiments, the SCOP≥ 5 dataset was processed in 83 minutes using a single core of a quad-core Intel Xeon X3360 desktop machine running at 2.83 GHz, using the top 2000 eigenvalues and eigenvectors of the similarity matrix. This does not include the CPU time required to run the all-against-all BLAST query on SCOP, which took nearly four hours.
In this paper, we presented SCPS, an efficient, user-friendly, scalable and platform-independent improved implementation of a spectral clustering method , which can identify protein superfamilies in datasets containing thousands of proteins within a few minutes. The software along with its source code is available to non-commercial users free of charge. We would like to encourage users and developers to provide feedback, suggest new features or contribute code. Future work will focus on the improvement of the similarity measure used by the algorithm and a parallelized implementation of the method to exploit the power of multiple CPU cores.
Availability and requirements
Project name: SCPS
Project home page: http://www.paccanarolab.org/software/scps
Operating systems: Windows, Mac OS X, Linux
Programming language: C++
License: GNU General Public License (GPL) v3
Restrictions to use by non-academics: None
TN was supported by the Newton International Fellowship Scheme of the Royal Society (grant number NF080750). AP was supported by the Biotechnology and Biological Sciences Research Council (BBSRC) grant BB/F00964X/1.
- Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215(3):403–410.View ArticlePubMedGoogle Scholar
- Enright AJ, Ouzounis CA: GeneRAGE: a robust algorithm for sequence clustering and domain detection. Bioinformatics 2000, 16(5):451–457. 10.1093/bioinformatics/16.5.451View ArticlePubMedGoogle Scholar
- Pipenbacher P, Schliep A, Schneckener S, Schönhuth A, Schomburg D, Schrader R: ProClust: improved clustering of protein sequences with an extended graph-based approach. Bioinformatics 2002, 18: S182-S191. 10.1093/bioinformatics/18.1.182View ArticlePubMedGoogle Scholar
- Paccanaro A, Casbon JA, Saqi MAS: Spectral clustering of protein sequences. Nucleic Acids Res 2006, 34(5):1571–1580. 10.1093/nar/gkj515View ArticlePubMedPubMed CentralGoogle Scholar
- Everitt B: Cluster analysis. 3rd edition. London: Edward Arnold; 1993.Google Scholar
- Ballard D, Brown C: Computer Vision. Englewood Cliffs: Prentice-Hall; 1982.Google Scholar
- Enright AJ, Van Dongen S, Ouzounis CA: An efficient algorithm for large-scale detection of protein families. Nucleic Acids Res 2002, 30(7):1575–1584. 10.1093/nar/30.7.1575View ArticlePubMedPubMed CentralGoogle Scholar
- Kannan N, Vishveshwara S: Identification of side-chain clusters in protein structures by a graph spectral method. J Mol Biol 1999, 292(2):441–464. 10.1006/jmbi.1999.3058View ArticlePubMedGoogle Scholar
- Forman JJ, Clemons PA, Schreiber SL, Haggarty SJ: SpectralNET - an application for spectral graph analysis and visualization. BMC Bioinformatics 2005, 6: 260. 10.1186/1471-2105-6-260View ArticlePubMedPubMed CentralGoogle Scholar
- Krishnadev O, Brinda K, Vishveshwara S: A graph spectral analysis of the structural similarity network of protein chains. Proteins Struct Funct Bioinfo 2005, 61: 152–163. 10.1002/prot.20532View ArticleGoogle Scholar
- Verkhedkar K, Raman K, Chandra N, Vishveshwara S: Metabolome based reaction graphs of M. tuberculosis and M. leprae: a comparative network analysis. PLoS ONE 2007, 2(9):e881. 10.1371/journal.pone.0000881View ArticlePubMedPubMed CentralGoogle Scholar
- Wang G, Shen Y, Ouyang M: A vector partitioning approach to detecting community structure in complex networks. Comput Math Appl 2008, 55(12):2746–2752. 10.1016/j.camwa.2007.10.028View ArticleGoogle Scholar
- Shannon P, Markiel A, Ozier O, Baliga NS, Wang JT, Ramage D, Amin N, Schwikowski B, Ideker T: Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003, 13(11):2498–504. 10.1101/gr.1239303View ArticlePubMedPubMed CentralGoogle Scholar
- Meilă M, Shi J: A random walks view of spectral segmentation. Proceedings of the 8th International Workshop on Artificial Intelligence and Statistics (AISTATS) 2001.Google Scholar
- Ng AY, Jordan MI, Weiss Y: On Spectral Clustering: Analysis and an algorithm. In Advances in Neural Information Processing Systems 14. MIT Press; 2001:849–856.Google Scholar
- van Dongen S: Performance criteria for graph clustering and Markov cluster experiments. Tech Rep INS-R0012 2000.Google Scholar
- Newman M: Fast algorithm for detecting community structure in networks. Phys Rev E 2004, 69(6):066133. 10.1103/PhysRevE.69.066133View ArticleGoogle Scholar
- Fruchterman T, Reingold E: Graph drawing by force directed placement. Software Pract Ex 1991, 21(11):1129–1164. 10.1002/spe.4380211102View ArticleGoogle Scholar
- Lehoucq RB, Sorensen DC, Yang CY: ARPACK users' guide: solution of large-scale eigenvalue problems with implicitly restarted Arnoldi methods. SIAM; 1998.View ArticleGoogle Scholar
- Murzin A, Brenner S, Hubbard T, Chothia C: SCOP - a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247(4):536–540.PubMedGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, (32 Database):D189–92. 10.1093/nar/gkh034Google Scholar
- Jensen LJ, Kuhn M, Stark M, Chaffron S, Creevey C, Muller J, Doerks T, Julien P, Roth A, Simonovic M, Bork P, von Mering C: STRING 8 - a global view on proteins and their functional interactions in 630 organisms. Nucleic Acids Res 2009, (37 Database):D412–416. 10.1093/nar/gkn760Google Scholar
- Hong EL, Balakrishnan R, Dong Q, Christie KR, Park J, Binkley G, Costanzo MC, Dwight SS, Engel SR, Fisk DG, Hirschman JE, Hitz BC, Krieger CJ, Livstone MS, Miyasato SR, Nash RS, Oughtred R, Skrzypek MS, Weng S, Wong ED, Zhu KK, Dolinski K, Botstein D, Cherry JM: Gene Ontology annotations at SGD: new data sources and annotation methods. Nucleic Acids Res 2008, 36: D577–581. 10.1093/nar/gkm909View ArticlePubMedPubMed CentralGoogle Scholar
- Altschul SF, Madden TL, Schffer AA, Schffer RA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402. 10.1093/nar/25.17.3389View ArticlePubMedPubMed CentralGoogle Scholar
- Biegert A, Soding J: Sequence context-specific profiles for homology searching. Proc Natl Acad Sci USA 2009, 106(10):3770–3775. 10.1073/pnas.0810767106View ArticlePubMedPubMed CentralGoogle Scholar
- Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, Harris MA, Hill DP, Issel-Tarver L, Kasarskis A, Lewis S, Matese JC, Richardson JE, Ringwald M, Rubin GM, Sherlock G: Gene ontology: tool for the unification of biology. Nat Genet 2000, 25: 25–29. 10.1038/75556View ArticlePubMedPubMed CentralGoogle Scholar
- Benjamini Y, Hochberg Y: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J Roy Stat Soc B Stat Meth 1995, 57: 289–300.Google Scholar
- Ruepp A, Zollner A, Maier D, Albermann K, Hani J, Mokrejs M, Tetko I, Guldener U, Mannhaupt G, Munsterkotter M, Mewes HW: The FunCat, a functional annotation scheme for systematic classification of proteins from whole genomes. Nucl Acids Res 2004, 32(18):5539–5545. 10.1093/nar/gkh894View ArticlePubMedPubMed CentralGoogle Scholar
- Arthur D, Vassilvitskii S: How slow is the k-means method? In SCG '06: Proceedings of the 22nd Annual Symposium on Computational Geometry. New York, NY, USA: ACM; 2006:144–153. full_textView ArticleGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.