PFClust: a novel parameter free clustering algorithm
 Lazaros Mavridis†^{1}Email author,
 Neetika Nath†^{1} and
 John BO Mitchell^{1}
https://doi.org/10.1186/1471210514213
© Mavridis et al.; licensee BioMed Central Ltd. 2013
Received: 29 January 2013
Accepted: 1 July 2013
Published: 3 July 2013
Abstract
Background
We present the algorithm PFClust (Parameter Free Clustering), which is able automatically to cluster data and identify a suitable number of clusters to group them into without requiring any parameters to be specified by the user. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intracluster similarity. A set of n objects can be clustered into any number of clusters from one to n, and there are many different hierarchical and partitional, agglomerative and divisive, clustering methodologies available that can be used to do this. Nonetheless, automatically determining the number of clusters present in a dataset constitutes a significant challenge for clustering algorithms. Identifying a putative optimum number of clusters to group the objects into involves computing and evaluating a range of clusterings with different numbers of clusters. However, there is no agreed or unique definition of optimum in this context. Thus, we test PFClust on datasets for which an external gold standard of ‘correct’ cluster definitions exists, noting that this division into clusters may be suboptimal according to other reasonable criteria. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simplyexpressed metric over the space of possible clusterings.
Results
We validate PFClust firstly with reference to a number of synthetic datasets consisting of 2D vectors, showing that its clustering performance is at least equal to that of six other leading methodologies  even though five of the other methods are told in advance how many clusters to use. We also demonstrate the ability of PFClust to classify the three dimensional structures of protein domains, using a set of folds taken from the structural bioinformatics database CATH.
Conclusions
We show that PFClust is able to cluster the test datasets a little better, on average, than any of the other algorithms, and furthermore is able to do this without the need to specify any external parameters. Results on the synthetic datasets demonstrate that PFClust generates meaningful clusters, while our algorithm also shows excellent agreement with the correct assignments for a dataset extracted from the CATH partmanually curated classification of protein domain structures.
Keywords
Background
In pattern recognition, data analysis is used for predicting the behaviour of an unseen test dataset. Amongst problems of this kind, two different types can be clearly distinguished. The first is supervised learning or classification, where a labelled training dataset with known categories is involved. The second kind is unsupervised learning or clustering, where no prior information is available for grouping the dataset. The objective of a clustering algorithm is to partition the given data into mutually exclusive and meaningful [1] clusters; this can provide a better understanding of the natural structure of the data. Semisupervised [2] classification, which combines strategies from both supervised and unsupervised methods, has also grabbed attention in various fields of research as it requires less human effort and gives better accuracy [3] than unsupervised learning. In this paper, we focus our attention on the challenges faced by clustering algorithms [4, 5].
There are numerous clustering algorithms discussed in the literature, traditionally clearly distinguished as either hierarchical [6] or partitional [7]. Hierarchical methods group the objects together layer by layer, based on the closeness of the data points as measured by suitable similarity or distance metrics. Hierarchical clustering sequentially partitions the dataset, either by successively dividing an initial single cluster in divisive methods, or by joining together initially unlinked objects in agglomerative algorithms. In a hierarchical method, once two objects are clustered together they remain together at all subsequent levels of the scheme with fewer clusters. Contrastingly, partitional clustering algorithms, such as kmeans, do not have a layer by layer structure and objects may sometimes move from one cluster to another.
The kmeans method iteratively assigns each point to the cluster whose centroid is closest to it, recalculates the cluster centroids, and reassigns the points. This process continues until the assignments no longer change at each iteration. kmeans tends to generate approximately equally sized clusters, minimising intracluster distances; however, its preference for globular clusters and its failure to reproduce clusters of complex shape are limitations [5].
Determining the number of clusters
Automatically determining the number of clusters is a major problem in clustering. A set of n objects can be clustered into any number k of clusters 1 ≤ k ≤ n by any of the methods we have discussed. Identifying the optimal number of clusters involves computing a range of different numbers of clusters k, with the objective of finding the best value of k that gives the optimum clustering. However, there is no agreed or unique definition of optimum in this context. Using internal and external validation measures as described in Handl et al. [5], one could design a protocol for reaching a decision on the best k. A gap statistic [8] addressed this issue by acting as an internal validation measure, and has been applied in bioinformatics [9]. Though in principle it is not hard to design a workflow to find the best k, in practice this is not commonly done. This is partly because there is no consensus as to which of the many different possible measures should be used to compare clusterings with different numbers of clusters, a more difficult problem than the comparison of clusterings with the same k. This adds to the difficulty of choosing the best clustering method for finding the structure of a novel dataset.
Validation
Validation [4, 8] plays an important role in deciding the number of clusters, as well as in assessing the performance of the clustering algorithm. Cluster validation is designed to evaluate and compare clustering algorithms by their ability to analyse a dataset. There are many different validation measures. Internal measures like Silhouette width [9] and Dunn Index [10, 11] depend on the inherent structure of the data [12]. External measures, such as the Rand Index [13] commonly used to evaluate the noise in biological data, depend on comparison with an externally known gold standard classification of the objects.
Here we propose a novel clustering algorithm, PFClust (Parameter Free Clustering), suitable for use where no prior information about the number of clusters is given. As input, only similarity scores within the dataset are required, and evaluation of the clustering is part of the algorithm. A previous study by Akoglu et al. [14] designed a parameter free graph clustering algorithm to find cohesive clusters, PICS. They have shown the efficiency of their method using realworld datasets including data from YouTube and Twitter. Our method shares the property of being parameterfree, but is aimed at classifying objects rather than graphs.
As the availability of biological information accelerates, it is necessary to find the natural structure or patterns in data in order to understand biological questions. In bioinformatics, grouping proteins based on sequence [1] or structure is a very common task. Classification of novel proteins [15] can be performed by using pattern recognition approaches, built on the assumption that some underlying characteristics are considered, while clustering proteins into superfamilies and families. There are numerous classification schemes for protein sequences including PIRPSD [16], a freely available database of protein sequence classification mostly applied for functional annotation, and Pfam [17], a classification of functional protein domains based on hidden Markov models and multiple sequence alignments.
Extending these ideas to threedimensional (3D) protein structure provides the interesting task of clustering and classifying protein domain folds. During the early 1990s the Protein Data Bank (PDB) [18] held only a few thousand 3D crystal structures, and several initiatives for protein fold classification were proposed with CATH [19] and SCOP [20] being the best known. These were based on either manual curation (SCOP) or computeraided manual curation (CATH). Common to both approaches is that the human curator has the final word in the classification decision. With the exponential growth of the number of 3D high resolution structures deposited in the PDB during the last decade [21], reaching 87,085 structures at the beginning of 2013, the ratelimiting manual part of the curation process restricts our capacity to understand the full structural diversity of proteins. Hence it would be ideal if a fully automated process could classify protein domains and cluster them into structurally similar groups.
Methods
Here we describe a partitional algorithm that uses the idea that each cluster can be represented as a nonpredetermined distribution of the intracluster similarities of its members. The algorithm partitions a dataset into a number of clusters that share some common attributes, such as their minimum expectation value and variance of intracluster similarity. It is an agglomerative algorithm, meaning that it starts with separated objects and progressively joins them together to form clusters. PFClust is heuristic in the sense that it cannot be described in terms of optimising any single simplyexpressed metric over the space of possible clusterings. Nonetheless, we demonstrate that, over a number of validations on test datasets, it produces clusterings that closely reflect the structure of the test data, and outperforms many well established algorithms in this regard. We have taken a number of design decisions to optimize the algorithm with respect to time efficiency and result stability.
where k is the number of clusters, inclusive of singletons. This number grows rapidly with the number of points in the dataset. Hence, instead of attempting an exhaustive search of all possible clusterings, we perform a random sampling, where we randomly decide the number of clusters and randomly assign the initial distribution of points amongst clusters. This sampling approach is necessary for efficiency, but introduces a random element to the algorithm. In order to optimize some of our design decisions for the algorithm, we have internally validated using a synthetic dataset of 1500 twodimensional (2D) vector points.
Algorithm
Threshold estimation
Similaritybased clustering
Now, for each threshold value T, the dataset Ω is clustered with a similaritybased clustering. We begin with all elements separated and no clusters defined. The two most similar elements in Ω are placed together to form the first cluster. For each of the (n2) remaining elements, we compute its average similarity E[X] to the elements already in cluster 1, and we now identify the element with the highest similarity. If this value is larger than T, the algorithm considers adding this most similar element to the given cluster. The element is added if and only if the element’s average similarity compared to the members of that cluster is at least P% of T and the overall E[X] of the cluster is also larger than T. This process is repeated until no new element can be added to the cluster without E[X] of the cluster falling below T. At this stage, a new cluster is formed from the two most similar remaining elements, provided that their similarity exceeds T. This process is continued iteratively until all elements of Ω have been clustered, or until the remaining elements cannot form a cluster that has an expectation value of intracluster similarity greater than T.
Comparison of different values of P
P Value  Clusters  Singletons  Silhouette width  Avg Std 

T  30  6  0.1719  1.1740 
0.95 × T  10  4  0.5240  4.2715 
0.90 × T  10  1  0.5650  4.3176 
0.85 × T  10  0  0.5961  4.6604 
0.80 × T  10  0  0.5955  4.8175 
0.75 × T  10  0  0.5955  4.8175 
After the data are assigned to clusters, a final refinement step is applied to all points that have an average similarity score less than T when compared to the members of the cluster they have been assigned to. Each such point has its average similarity calculated with every cluster and is assigned to the cluster to which it is most similar (if this is not its original cluster, then the process moves the point to a different cluster; this assignment to a cluster is made even if the point’s average similarity to the members of its new cluster is less than T).
Selecting the best threshold
where δ(C_{ i }, C_{ j }) is the intercluster distance between the centres of clusters i and j, and max _{1 ≤ k ≤ m}Δ_{ k } is the maximum cluster size in the dataset, where cluster size is defined as the mean distance between all members of the cluster and the cluster centroid. The cluster centroid is the element with the maximum similarity to the other members of the cluster. The Silhouette width is the main factor used in deciding which threshold produces the best clustering, and the Dunn Index is used only as a tiebreaker to decide cases where two or more clusterings have the same Silhouette width.
Convergence
where α is the number of cases where two elements are members of the same cluster in both clusterings, b is the number of cases where two elements are members of different clusters in both clusterings, c is the number of cases where a pair of elements are in the same cluster for the first clustering and in different for the second and d is the number of cases where a pair of elements are members of different clusters for the first clustering and members of the same cluster for the second clustering. Using all six different pairs of clusterings we calculate the average Rand Index.
We use the Rand Index because it is a widely accepted measure of concordance between different clusterings (here, the four clusterings produced by the four runs) and not as a maximization metric compared to some original classification. If the average Rand Index is high (>0.99), this means that most of the runs report nearidentical clusterings with no significant differences. Hence, this is enough for the algorithm to converge and report the clustering with the highest Silhouette width. As mentioned above we want to be very confident in the resulting clusters, therefore a very strict average Rand Index of 0.99 (which allows for a limited number of differences in assignment of borderline cases) is applied as a cutoff. In the case of an average Rand Index less than 0.99, we consider that we have found significantly different clusterings. Then, an instance of the clustering with the lowest (or equal lowest) Silhouette width is removed, even if this outcome has been found two or three times, and another randomization is done. This procedure is repeated until convergence.
Pseudocode
 I.
Do four times:
Stage 1: Calculate D (the distribution of E[X]’s). 1.Do the specified “randomization” 1000 times:
 i.
Randomly select a number of clusters k.
 ii.
Randomly assign each data point α to a cluster c.
 iii.
∀ clusters c, calculate E[X] for the pairwise pointpoint similarities within c and include this value of E[X] in D.
 i.
 2.
For each of the ten percentiles {95.00%, 97.50%, 99.00%, 99.14%, 99.29%, 99.43%, 99.57%, 99.71%, 99.86% and 100.00%} of the distribution D of intracluster similarities, and for ten further thresholds corresponding to the second to eleventh highest values, retrieve a threshold value T.
Stage 1A: Clustering i.While any α in the dataset remains unclustered:
 a.
Join the two most similar currently unclustered elements to form a new cluster, provided criteria in b. are met.
 b.Calculate average similarity of each currently unclustered data point to the current cluster and keep adding the most similar available data point as a member as long as both:

E[X] of the cluster > T, and.

the average similarity of the new member to the existing members of the current cluster > 0.85×T.

 a.
Stage 1B: Clustering Refinement ii.∀ α ∈ any c, retrieve its average similarity with all the members of its current cluster. If this average similarity < T then:
 a.
If its average similarity with elements of any other cluster is more than that with the parent cluster, move the point α to this other cluster.
 a.
 iii.
Measure the Silhouette width, averaged over all points with singletons each contributing −1, and the Dunn Index for the final clustering for this T value.
 i.
 3.
Return the T value and resultant clustering with the best Silhouette width as the result of the run; in the event of a tie, use the Dunn Index to decide.
 1.
 II.
Repeat until Convergence:
Stage 2: Convergence (measure the Rand Index between each of the four runs) 1.
If average Rand Index amongst all 6 pairs taken from the 4 clusterings ≥ 0.99, return the clustering with the best Silhouette width as the final result (algorithm converged).
 2.
If this average Rand Index < 0.99, the algorithm has not converged and the clustering with the lowest Silhouette width is discarded and we repeat Stage 1 a single time to generate a new clustering.
 1.
Validation
In order to validate PFClust, we used a number of synthetic 2D datasets. The first dataset consisted of 3000 2D vectors distributed over 20 groups; for each of these groups, the probability density function falls off with distance from its centre according to a normal distribution. Hence the groups are approximately circular. Each group corresponds to the external gold standard definition of a cluster. We also used subsets of 300 and 450 2D vectors, respectively composed of two and three out of the 20 groups in the 3000 vector dataset. The second dataset consisted of 5000 2D vectors distributed over 15 groups, which vary in shape. Finally, the third dataset consisted of 928 2D vectors distributed over 20 clusters, which all have different member densities. In each dataset, the centres are chosen such that there is no significant overlap between groups, though a handful of outlier points appear within an apparently ‘wrong’ group. We performed three different experiments based on the first dataset, in order to illustrate that the method was not finely tuned for a specific number of clusters or cluster structure. We define similarity as one minus the normalized (such that the most distant point is one unit from the origin of the coordinate system) Euclidean distance between two points. We would therefore expect PFClust to perform optimally where the clusters are approximately circular.
On all our datasets, we run PFClust as well as six other current stateoftheart algorithms. These are (i) the hierarchical clustering algorithm Hierarchy [6]; (ii) the hierarchical AGlomerative NESting (Agnes) [22], (iii) the partitional kmeans clustering algorithm [7], (iv) Clustering Large Applications (Clara) [23], which is based on repeated kmediods clustering of samples, (v) DensityBased Algorithm for Discovering Clusters in Large Spatial Databases (DBSCAN) [24] and (vi) ModelBased Clustering (Mix Model) [25]. We used available implementations of each of these methods in the statistical software suite R, [26] the relevant packages are listed in Additional file 1: Table S1. These algorithms cannot all be compared on the ability to find the optimal number of clusters. Only PFClust and DBSCAN amongst the methods considered here can do this, and in fact the latter algorithm requires two parameters to be optimised before it decides the number of clusters. Hence, for the five other methods, we will use the externally defined ‘correct’ number of clusters (this definition including singletons in the count of clusters) as a given parameter and compare how well each algorithm clusters the data compared to the original classification. In order to compare the different clustering approaches, we selected the Rand Index as a measure of agreement between the externally known ‘correct’ clustering and that produced by a clustering algorithm.
We ran kmeans and Clara 100 times each on every dataset and have selected as the final result for each algorithm the one with the best Silhouette width. For the epsilon parameter of DBSCAN, the maximum permitted distance between a point and its closest intracluster neighbour, we examined all values between 0 and 1 with a step size of 0.001. We also iterated the minpoints parameter, the minimum number of members allowed in a valid cluster, using all integer values from 1 to 100. This resulted in 10^{5} clustering outputs, from which the one with the maximum Silhouette width was selected.
As an addendum to the main work, we tested the use of the Silhouette width as a characteristic measure from which to decide the correct number of clusters. We ran the deterministic methods once each. We also ran the stochastic Clara and kmeans algorithms 100 times each for every number of clusters, k, between 2 and 50. The run with the best Silhouette width for a given algorithm was selected, thus deciding the number of clusters to report.
Protein fold clustering using polar Fourier expansions
where y_{ lm } (θ, φ) are the real spherical harmonics, N is the order of the highest polynomial power of the expansion, R_{ nl } (r) are radial functions, and α_{ nlm } are the expansion coefficients which are calculated as described previously [29]. Mavridis et al. proposed in the same paper a novel structurebased indexing for existing classification schemes such as CATH [19] and SCOP [20]. Their proposed consensus algorithm works well for only some of the cases it was tested on, because of the structural diversity of a number of protein domains assigned to the same superfamilies [28]. Hence, methods such as SPF would greatly benefit from an automated clustering algorithm, such as PFClust, which could identify the structure of a dataset without any prior knowledge or parameter tuning.
In order to illustrate that PFClust could be used to provide such a clustering using the SPF descriptors, we performed the following study. We randomly selected 11 CATH superfamilies, which had in total 224 nonredundant representative structures, and used the SPF descriptors to calculate the allagainstall similarity matrix of these protein domains. We then used PFClust to cluster the protein domain structures based on these similarities.
Results and discussion
Original dataset
Comparison of the clustering methods based on popular metrics
Methods  Rand index  

300  450  1500  3000  5000  CATH  Density  Average  
Hierarchy  0.883  0.942  0.896  0.920  0.981  0.964  0.928  0.931 
Agnes  0.947  0.973  0.836  0.820  0.976  0.906  0.975  0.919 
Clara  0.960  0.980  0.952  0.948  0.987  1.000  0.956  0.969 
kmeans  0.960  0.980  0.958  0.966  0.986  0.738  0.901  0.927 
Mix Model  0.960  0.980  0.959  0.911  0.990  1.000  0.977  0.968 
PFClust  0.960  1.000  0.958  0.949  0.986  0.996  0.976  0.975 
DBSCAN  0.973  0.973  0.930  0.921  0.978  0.977  0.688  0.920 
Validation
Table 2 summarizes the results for all methods on all the different datasets used in this study. We can see from the results that PFClust is the top performing algorithm on average with a mean Rand Index of 0.975, though it perfectly reproduces the original clusters for only one dataset (450 2D vectors).
Another challenge for algorithms is describing datasets that have different densities of points in their original clusters; the 928 2D vector dataset consists of 20 clusters whose densities vary. This is a subset of the D31 dataset taken from [30], which consists of 31 circular clusters with 100 members each. We then chose a 20 cluster subset of the original dataset and from each cluster we randomly selected a different number of members (varying from 5% to 95%).
PFClust  supervised mode
Supervised and Unsupervised timings for PFClust
Dataset  Rand index  Randomizations  Clustering  Total execution time (runs) 

CATH folds  0.996  4.3 s  1.8 s  25 s 
300 2D Vectors  0.960  8 s  2 s  40 s 
450 2D Vectors  1.000  25 s  8 s  2 m 38 s 
1500 2D Vectors  0.958  13 m 22 s  5 m 47s  1 h 22 m 
3000 2D Vectors  0.949  2 h 24 m 27 s  1 h 14 m 38 s  12 h 
5000 2D Vectors  0.986  11 h 35 m 30 s  5 h 20 m 20 s  81 h 50 m 30 s 
Density Dataset  0.976  6 m 20 s  2 m 15 s  36 m 
1500 Supervised  0.958    1 m 50 s  2 m 57 s 
3000 Supervised  0.951    12 m 20 s  20 m 39 s 
5000 Supervised  0.986    1 h 9 m 30 s  1 h 21 m 36 s 
The table summarizes the timings for convergence of PFClust with the different datasets. The total execution time typically includes four randomization and four clustering runs, but in the case of the 5000 2D vectors, the “0.99 average Rand Index between the four clusterings” criterion was not met until a fifth run had been carried out. The second column in the table shows the Rand Index of the final clustering against the original gold standard cluster definitions.
There exist cases when one wants to cluster large groups of data and time efficiency is very important. For those cases, a supervised mode of PFClust has been implemented in order to significantly speed up the process. The supervised mode of PFClust addresses the cost issue by applying an initial clustering on a training set and estimating a number of thresholds that would finally be applied to full dataset. The training set should be a small subset of the data, representing some coherent groups or clusters amongst the full dataset that we wish to cluster.
PFClust clusters the training set and uses the three best performing thresholds to estimate a total of nine threshold values (these are selected to allow for some variation). Then these nine values are applied on the full dataset and the clustering with the best Silhouette width is reported. Using the supervised version of the algorithm the timeconsuming randomization step is removed, which results in a significant speedup of the total process.
Protein fold clustering using polar Fourier expansions
CATH superfamilies selected for the study
Superfamily  Name  Members  Representative structure 

1.20.140.10  ButyrylCoA Dehydrogenase, subunit A, domain 3  13  
1.25.40.20    21  
2.80.10.50    40  
2.30.42.10    55  
2.40.100.10  Cyclophilin  10  
2.40.110.10  ButyrylCoA Dehydrogenase, subunit A, domain 2  7  
3.30.500.10  Murine Class I Major Histocompatibility Complex, H2DB, subunit A, domain 1  13  
3.40.50.80  Nucleotidebinding domain of ferredoxinNADP reductase (FNR) module  14  
3.40.50.1220  TPPbinding domain  16  
3.90.110.10  L2Hydroxyisocaproate Dehydrogenase, subunit A, domain 2  11  
3.90.79.10  Nucleoside Triphosphate Pyrophosphohydrolase  24 
Finding the correct number of clusters
We see that in most cases, except for the density dataset, at least one method found the correct number of clusters by using the maximum Silhouette width as the stopping criterion. However, no method is consistently able to do this for the different datasets.
Conclusions
It has been shown that PFClust can accurately group data according to their similarities without the need for any parameter tuning. Our clustering results on the synthetic datasets not only show that PFClust provides structurally meaningful clusters, but also that it performs best when compared to six other wellknown clustering algorithms. Clustering protein domains using a density representation gives excellent agreement with the CATH partmanually curated classification. In the future, the full CATH database could be automatically clustered based on such density representations of protein domains.
Notes
Declarations
Acknowledgements
This project has been carried out with the support of WADA. JBOM and NN thank the Scottish Universities Life Sciences Alliance (SULSA) and the Scottish Overseas Research Student Awards Scheme of the Scottish Funding Council (SFC) for financial support. We thank Dr Luna De Ferrari for her valuable comments on this manuscript.
Availability
By request to the corresponding author, lazaros.mavridis.lm@gmail.com.
Authors’ Affiliations
References
 Harlow TJ, Gogarten JP, Ragan MA: A hybrid clustering approach to recognition of protein families in 114 microbial genomes. BMC Bioinformatics. 2004, 5: 4510.1186/14712105545.PubMed CentralView ArticlePubMedGoogle Scholar
 Zhu X: SemiSupervised Learning Literature Survey. Technical Report 1530, Department of Computer Sciences. 2005, Madison: University of Wisconsin, http://pages.cs.wisc.edu/~jerryzhu/pub/ssl_survey.pdf,Google Scholar
 Pise NN, Kulkarni P: A Survey of SemiSupervised Learning Methods. International Conference on Computational Intelligence and Security. 2008, 30: 34[http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=4724730]Google Scholar
 Jain A, Murty M, Flynn P: Data clustering: a review. ACM Comput Surv. 1991, 31: 264323.View ArticleGoogle Scholar
 Handl J, Knowles J, Kell DB: Computational cluster validation in postgenomic data analysis. Bioinformatics. 2005, 21: 32013212. 10.1093/bioinformatics/bti517.View ArticlePubMedGoogle Scholar
 Lance BGN, Williams WT: A general theory of classificatory sorting strategies 1. Hierarchical systems. Comput J. 1967, 9: 373380. 10.1093/comjnl/9.4.373.View ArticleGoogle Scholar
 Jain AK: Data clustering: 50 years beyond Kmeans. Pattern Recognition Letters. 2010, 31: 651666. 10.1016/j.patrec.2009.09.011.View ArticleGoogle Scholar
 Tibshirani R, Walther G, Hastie T: Estimating the number of clusters in a data set via the gap statistic. Journal of the Royal Statistical Society Series B Statistical Methodology. 2001, 63: 411423. 10.1111/14679868.00293.View ArticleGoogle Scholar
 Giancarlo R, Scaturro D, Utro F: Computational cluster validation for microarray data analysis: experimental assessment of Clest, Consensus Clustering, Figure of Merit. Gap Statistics and Model Explorer. BMC Bioinformatics. 2008, 9: 46210.1186/147121059462.PubMed CentralView ArticlePubMedGoogle Scholar
 Rousseeuw P: Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math. 1987, 20: 5365.View ArticleGoogle Scholar
 Dunn JC: A Fuzzy Relative of the ISODATA Process and Its Use in Detecting Compact WellSeparated Clusters. J Cybernetics. 1973, 3: 3257. 10.1080/01969727308546046.View ArticleGoogle Scholar
 Bezdek JC, Pal NR: Some new indexes of cluster validity. IEEE transactions on systems, man, and cybernetics. Part B, Cybernetics. 1998, 28: 1510.1109/3477.658574.View ArticleGoogle Scholar
 Rand WM: Objective Criteria for the Evaluation of Clustering Methods. J Am Stat Assoc. 1971, 66: 846850. 10.1080/01621459.1971.10482356.View ArticleGoogle Scholar
 Akoglu L, Tong H, Meeder B, Faloutsos C: PICS: Parameterfree Identification of Cohesive Subgroups in Large Attributed Graphs. 2012, Anaheim, CA: SDMGoogle Scholar
 Shenoy SR, Jayaram B: Proteins: sequence to structure and functioncurrent status. Curr Protein Pept Sci. 2010, 11: 498514. 10.2174/138920310794109094.View ArticlePubMedGoogle Scholar
 Wu CH, Huang H, Arminski L, CastroAlvear J, Chen Y, Hu ZZ, Ledley RS, Lewis KC, Mewes HW, Orcutt BC, Suzek BE, Tsugita A, Vinayaka CR, Yeh LSL, Zhang J, Barker WC: The Protein Information Resource: an integrated public resource of functional annotation of proteins. Nucleic Acids Res. 2002, 30: 3537. 10.1093/nar/30.1.35.PubMed CentralView ArticlePubMedGoogle Scholar
 Finn RD, Mistry J, Tate J, Coggill P, Heger A, Pollington JE, Gavin OL, Gunasekaran P, Ceric G, Forslund K, Holm L, Sonnhammer ELL, Eddy SR, Bateman A: The Pfam protein families database. Nucleic Acids Res. 2010, 38: D211D222. 10.1093/nar/gkp985.PubMed CentralView ArticlePubMedGoogle Scholar
 Berman HM, Westbrook J, Feng Z, Gilliland G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
 Cuff AL, Sillitoe I, Lewis T, Redfern OC, Garratt R, Thornton J, Orengo CA: The CATH classification revisitedarchitectures reviewed and new ways to characterize structural divergence in superfamilies. Nucleic Acids Res. 2009, 37: D310314. 10.1093/nar/gkn877.PubMed CentralView ArticlePubMedGoogle Scholar
 Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol. 1995, 247: 536540.PubMedGoogle Scholar
 Berman HM: The Protein Data Bank: a historical perspective. Acta Crystallographica Section A Foundations of Crystallography. 2008, 64: 8895. 10.1107/S0108767307035623.View ArticleGoogle Scholar
 Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. 1990, New York: WileyView ArticleGoogle Scholar
 Wei C: Empirical Comparison of Fast Clustering Algorithms for Large Data Sets. Experts Systems with Applications. 2003, 24: 351363. 10.1016/S09574174(02)001859.View ArticleGoogle Scholar
 Ester M, Kriegel HP, Sander J, Xu X: A DensityBased Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining. 1996, 226231. KDD96Google Scholar
 Fraley C, Raftery AE: Modelbased clustering, discriminant analysis, and density estimation. J Am Stat Assoc. 2002, 97: 611631. 10.1198/016214502760047131.View ArticleGoogle Scholar
 R: A language and environment for statistical computing; R development core team. 2005, Vienna, Austria: R foundation for statistical computing, http://www.rproject.org/,
 Mavridis L, Ritchie DW: 3DBlast: 3D protein structure alignment, comparison, and classification using spherical polar fourier correlations. Pac Symp Biocomput. 2010, 281292.Google Scholar
 Mavridis L, Ghoorah AW, Venkatraman V, Ritchie DW: Representing and comparing protein folds and fold families using threedimensional shapedensity representations. Proteins: Structure, Function and Bioinformatics. 2011, 80: 530545.View ArticleGoogle Scholar
 Ritchie DW, Kemp GJ: Protein docking using spherical polar Fourier correlations. Proteins. 2000, 39: 178194. 10.1002/(SICI)10970134(20000501)39:2<178::AIDPROT8>3.0.CO;26.View ArticlePubMedGoogle Scholar
 Veenman CJ, Reinders MJT, Backer E: A maximum variance cluster algorithm. Pattern Analysis and Machine Intelligence, IEEE Transactions. 2002, 24: 12731280. 10.1109/TPAMI.2002.1033218.View ArticleGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.