# ClusCo: clustering and comparison of protein models

- Michal Jamroz
^{1}and - Andrzej Kolinski
^{1}Email author

**14**:62

https://doi.org/10.1186/1471-2105-14-62

© Jamroz and Kolinski; licensee BioMed Central Ltd. 2013

**Received: **16 November 2012

**Accepted: **17 February 2013

**Published: **22 February 2013

## Abstract

### Background

The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the one of the bottlenecks in the protein modeling pipeline.

### Results

Clusco is fast and easy-to-use software for high-throughput comparison of protein models with different similarity measures (cRMSD, dRMSD, GDT_TS, TM-Score, MaxSub, Contact Map Overlap) and clustering of the comparison results with standard methods: K-means Clustering or Hierarchical Agglomerative Clustering.

### Conclusions

The application was highly optimized and written in C/C++, including the code for parallel execution on CPU and GPU, which resulted in a significant speedup over similar clustering and scoring computation programs.

## Background

The development, optimization and validation of protein modeling methods require efficient tools for structural comparison. Frequently, a large number of models need to be compared with the target native structure. There are numerous measures of model similarity. The most popular is the cRMSD - coordinate Root Mean-Square Deviation (after the best superimposition) [1]. The other popular scores are: GDT_TS - Global Distance Test Total Score [2], MaxSub - Maximal Substructure [3], TM-Score - Template-Modeling Score [4], or dRMSD - distance Root Mean-Square Deviation [5].

One of the methodologies most widely used for protein modeling includes performing the clustering step after generating a protein conformation ensemble [6-10] followed by the selection of a representative model (or models) for refinement. To achieve this, we need a similarity matrix of the whole ensemble, which contains all-versus-all comparison (for *N* conformers it gives *N*(*N*−1)/2 of score calculation). However, many available applications are not optimized for running time, because they were developed rather for simple pair comparison.

The main reason for the development of Clusco software was to create a high-throughput tool for all-versus-all comparison, because calculating similarity matrix is the one of the bottlenecks in the protein modeling pipeline.

## Implementation

The implementation of the similarity measures was performed using OpenMP API which supports multiprocessing programming. Additionally, the cRMSD algorithm was coded on the Graphic Processor Unit (GPU) architecture using nvidia cuda, which gave an over ten fold speedup in comparison with one CPU.

We used an open source parallel K-means [11] clustering code, implemented with OpenMP and serial C Clustering Library [12] for Hierarchical Agglomerative Clustering (single-linkage, maximum-linkage, average-linkage).

### Algorithms

**cRMSD**Coordinates Root Mean-Square deviation is defined as:

*N*is the number of atoms, ${x}_{i}^{A}$ -

*i*-th atom position vector of protein

*A*. There is no need to superimpose structures (calculating rotation matrix) to obtain the cRMSD. By the diagonalization of the 3×3 covariance matrix

*M*, we obtain the cRMSD value by [13]:

where *R*_{
A
} is the radius of gyration of protein *A*, *S* is the sign of the covariance matrix determinant, *λ* is the eigenvalue (sorted in descending order) of the square of the covariance matrix. These eigenvalues can be computed by finding roots of the cubic equation instead of computational demanding Singular Value Decomposition of the covariance matrix.

**GDT_TS**Global Distance Test Total Score is defined as:

where *C*_{1Å} is the number of atom pairs below the 1Å distance. Max denotes here the maximal value for a series of superimpositions.

The Global Distance Test algorithm is NP-hard [14], and all the GDT_TS computing algorithms use their own heuristics. Our GDT_TS algorithm is as follows: 1) divide the chain into all possible continuous 4,*N*/4,*N*/2,*N* fragments, and 2) use them as initial superimposition fragments (i.e. superimpose the whole structure by a rotation matrix computed by superimposing each fragment), 3) find atom pairs which are closer by cutoff (1,2,4,8[Å]), 4) select atoms which are closer than 3.5Å and use them for another superimposition until the number of selected atoms does not change in four iterations.

**TM-Score, MaxSub** (Template Modeling Score and Maximum Subset, respectively) [3, 4] - both scores are variations of Levitt and Gerstein (1998) score.

where ${d}_{0}=1.24\sqrt[3]{N-15}-1.8\left[\text{\xc5}\right]$, and MaxSub is defined as the TM-Score with *d*_{0}=3.5Å. For the calculation of both scores we used the same searching algorithm as for GDT_TS, which means that the computational costs of GDT_TS, MaxSub, and the TM-Score will be the same.

**dRMSD**distance Root Mean-Square Deviation [5]. This score is the deviation of inter-molecular distance matrices:

where ${d}_{\mathit{\text{ij}}}^{A}$ is the distance between *i* and *j* atoms of protein model *A*. Note that the representation of molecule as a distance matrix causes loss of information about chirality.

**CMO**Contact Map Overlap [15]. Using a representation of the protein structure as a binary matrix

*C*, defined as:

*A*,

*B*:

where *n*(*A*) is the number of contacts in protein *A*.

## Results and discussion

Using the cRMSD, dRMSD, MaxSub, GDT_TS, TM-Score, CMO as a similarity measure, Clusco can calculate all-versus-all (or with respect to the reference model) scores of proteins from a one-column list file or using multimodel pdb file. The calculated results may be used, for example, as a similarity matrix input for clustering algorithms or clustered by Clusco itself.

In this section we show the examples of usage and the performance of Clusco with respect to other similar programs. All tests were performed on a box with intel E5649 CPU (24 threads), nvidia GeForce GTX 470 GPU and 24GB of RAM. The computation time elapsed was assessed by the standard *NIX “time” program.

### Selecting of pairs of models within a given cRMSD threshold

Recently, Fogolari and coworkers [16] described an algorithm for reducing the computational cost of all-versus-all comparison of protein models using cRMSD by inverse triangle inequality. As an example of that idea, the authors provided fsss software and 1ctf protein models from 4state_reduced decoy set (Decoys ’R’ Us) [17] as an input ensemble. The fsss software outputs pairs of models with cRMSD below a given threshold (3.2 Å in this example).

We compared fsss and Clusco based on this dataset, recording the time spent by one CPU performing the task. Note that Clusco computes all-versus-all scores by default, and to get results similar to the ones obtained from fsss (only pairs below given threshold) we needed to filter the output by awk (standard *NIX program): clusco -l 4state_reduced_1ctf.list -s rmsd -o 4state . tmp; awk ‘{if($3<3.2) print $line}' 4state.tmp > output.rmsd.

fsss software spent 149.14 seconds on this task, and Clusco+awk spent 0.46 (Clusco) + 0.11 (awk) seconds, which amounts to 261× speedup. Note that it is possible to improve these results, using parallel execution of Clusco (by simply defining shell variable OMP_NUM_THREADS before Clusco execution).

### Clustering of protein decoys from five independent molecular dynamics trajectories

The decoy set vhp_mcmd [18] from Decoys ’R’ Us database contains the results of five (NATIVE, F1, F3, F4, F7) Molecular Dynamics simulations of the thermostable subdomain from chicken villin headpiece (36 residues, pdb code: 1vii), starting from different protein conformations. The set contains 6256 of villin conformations in total, in the range of 0.49 - 12.8 Å cRMSD to the experimental structure deposited in the Protein DataBank.

Using cRMSD and each of the Clusco clustering schemes, it is possible to separate this decoy set roughly into former trajectories, as we show in Additional file 1: Table S1. Each of the Hierarchical Agglomerative methods divides decoys into rather separate clusters i.e. more than 85% of trajectory models create a separated cluster, while of “NATIVE” and “F1” models create one common cluster (which is the result of “F1” convergence to the native structure during simulation - we refer the interested readers to Figure two in [18]). Other clustering scheme in K-means results in the grouping of “NATIVE” and “F1” models into four separated clusters.

Command to perform clustering described above: clusco -l vhp_mcmd . list -s rmsd <0 , 1 , 2 , 3> 8.

The running time for this dataset varied from 5 seconds (for K-means clustering and GPU cRMSD comparison), to 3.5 minutes (for average-linkage, Hierarchical Agglomerative Clustering and CPU cRMSD comparison, see Additional file 1: Table S1.

### Selecting representative model from *de-novo* modeling decoy set

Clustering of protein models after *de-novo* simulations is one the methods most commonly used for the selection of the representative model from the decoys set [6-10]. We compared Clusco with other clustering software (durandal[19], calibur[20], spicker[21]) in terms of results and computation time. To do this, we used public available I-Tasser [22] decoys set, containing 12500-32000 models for each of 56 modelled target protein.

calibur uses preprocessing of decoy set in three ways: screening-out unlikely candidates by setting lower and upper cRMSD bounds, using triangular inequality for assessing if particular model is within the threshold distance from a group of models (which reduces the number of structure comparisons), detecting and ignoring outlier decoys. durandal uses triangular inequality (like calibur) for the approximation of cRMSD value of randomly chosen decoy and fill-up distance matrix until it contains proper amount of information for the next, clustering step. spicker selects the decoy with the largest number of structurally similar decoys (by automatically detected threshold value) and creates the first cluster. The process is repeated to get a sufficient number of clusters.

Clusco was run with cRMSD as a similarity measure (just as durandal, calibur and spicker) and K-means as a clustering method. We set number of clusters to 20: clusco -l list -s rmsd 0 20.

The Clusco representative model was selected by min(<*R*>/*f*), where *f* denoted the fraction of elements in particular cluster and <*R*> - the average cRMSD between cluster elements.

For the comparison of software reliability, we calculated tm-score to the experimental structure (Additional file 1: Table S2) and Z-score of the tm-score (where Z-score <0 means that a model is worse than the average structure of the decoy set, for detailed results see Additional file 1: Table S3).

According to the average tm-score, all of the programs gave similar results: all, except durandal, gave the average tm-score 0.59, and durandal gave the score of 0.58. According to Z-score, the best algorithms were calibur and Clusco (49/56 of the models with Z-Score above zero), followed by spicker and durandal (45/56 and 41/56 respectively).

**Total time for clustering of decoys**

Program | Total time(min) | 1thx_ | 2reb_2 |
---|---|---|---|

spicker | 435 | 10 | 4 |

durandal | 140 | 9 | 0.9 |

calibur | 859 | 64 | 1.2 |

Clusco 1CPU | 426 | 32 | 1.8 |

Clusco 1CPU 1GPU | 213 | 19 | 0.7 |

Clusco 2CPU | 219 | 16 | 0.9 |

Clusco 2CPU 1GPU | 146 | 12 | 0.4 |

Clusco 4CPU | 131 | 11 | 0.5 |

Clusco 4CPU 1GPU | 106 | 7 | 0.4 |

Clusco 23CPU | 47 | 3 | 0.3 |

Analyzing the above results, we can conclude that Clusco gives results which are as good as the ones provided by the state-of-the-art calibur in half of the time, however, on the commonly used today multicore machines, our program gives results in the time about 18× shorter than calibur.

### Comparison of structures with reference/experimental model

This command will compute the tm-score for each of tra.pdb models, saving the results into output.txt. If OMP_NUM_THREADS variable was not set, program will utilize all available CPU’s.

We recorded the computation time of the tm-score to the reference (experimental) structure with Clusco and the original TM-Score software [4] using the decoy set mentioned in the previous paragraph. TM-Score performed the task in 68 minutes, Clusco on 1 CPU - in 53 minutes (speedup of about 1.2×), but when we ran Clusco on 12 CPU’s, it completed the task in 7 minutes (speedup about 10×) (detailed data in Additional file 1: Table S4).

It must be noted that the computation time for GDT_TS and MaxSub will be mostly identical, since all of these algorithms use the same method for selecting fragments of structure. Optionally, it is possible to compute more exact GDT_TS score with Clusco by using -s gdtExt flag - in this particular case Clusco will split structures into many more fragments.

Recently Hung & Samudrala [24] published an algorithm for the computation of all-versus-all tm-score on amd GPU and CPU. We compared Clusco with this algorithm using the exemplary data attached to the program package (1000 models of 140 residues). Clusco on one CPU completed the computation in 53.65 minutes, Hung & Samudrala code - in 57.18 minutes, but Clusco can achieve pronounced speedup if executed in multi-CPU fashion (13.66 minutes on 4 CPU’s), which was not implemented in the Hung & Samudrala algorithm (see Figure S1 in Additional file 1 for tm-score values comparison). However, users with access to the amd GPU can complete this task significantly faster with Hung & Samudrala algorithm.

## Conclusions

We presented here versatile software for comparison and clustering of protein structures, optimized for novel multicore computers. We showed cuda implementation of cRMSD algorithm which may be usable for creating of proteins similarity matrices (a bottleneck of the clustering software) as an input for more efficient clustering algorithms.

However, up till now, not many computers are equipped with graphics cards capable of floating point computation, but most of them are equipped with multicore processors. Our software accounts for this situation by containing a parallel code for all described here comparison methods. It results in great-to-moderate speedup over an existing serial execution algorithms, together with clustering results as good as obtained using the state-of-the-art method, calibur.

Clusco is able to cluster small-to-moderate protein decoys with scoring functions other than the cRMSD, i.e. the TM-Score, dRMSD, GDT_TS, MaxSub, Contact Map Overlap, especially on many-core machines, which is unique.

Clusco may be useful for protein modeling community as an all-in-one, fast and easy in use software for daily lab work. It may be used as a standalone program for comparison or clustering of protein models or as a preprocessing tool for clustering algorithms, either as a compiled program or a fragment of Clusco’s source code.

## Availability and requirements

**Project name:** ClusCo**Project home page:**http://biocomp.chem.uw.edu.pl/clusco**Operating system:** gnu/Linux **Programming language:** c/c++, cuda **Other requirements:** OpenMP library (included in gcc ⩾ 4.2 compiler), optionally: cuda sdk and cuda compatible graphic card**License:** gnu gpl (scoring functions), Python License (Hierarchical Clustering library), custom license for K-means library (included in package)

## Declarations

### Acknowledgements

We would like to thank Dr Sebastian Kmiecik for reading the manuscript.

MJ acknowledge the support from a Project operated within the Foundation for Polish Science MPD Programme, co-financed by the EU European Regional Development Fund.

AK acknowledge support from the Foundation for Polish Science TEAM project (TEAM/2011-7/6) co-financed by the European Regional Development Fund operated within the Innovative Economy Operational Program.

## Authors’ Affiliations

## References

- Kabsch W:
**A solution for the best rotation to relate two sets of vectors.***Acta Crystallogr*1976,**32:**922-923. 10.1107/S0567739476001873View ArticleGoogle Scholar - Zemla A:
**LGA: a method for finding 3D similarities in protein structures.***Nucleic Acids Res*2003,**31**(13):3370-3374. 10.1093/nar/gkg571PubMed CentralView ArticlePubMedGoogle Scholar - Siew N, Elofsson A, Rychlewski L, Fischer D:
**MaxSub: an automated measure for the assessment of protein structure prediction quality.***Bioinformatics*2000,**16**(9):776-785. 10.1093/bioinformatics/16.9.776View ArticlePubMedGoogle Scholar - Zhang Y, Skolnick J:
**Scoring function for automated assessment of protein structure template quality.***Proteins*2004,**57**(4):702-710. 10.1002/prot.20264View ArticlePubMedGoogle Scholar - Torda AE, van Gunsteren WF:
**Algorithms for clustering molecular dynamics configurations.***J Comput Chem*1994,**15**(12):1331-1340. 10.1002/jcc.540151203View ArticleGoogle Scholar - Shortle D, Simons KT, Baker D:
**Clustering of low-energy conformations near the native structures of small proteins.***P Natl Acad Sci USA*1998,**95**(19):11158-11162. 10.1073/pnas.95.19.11158View ArticleGoogle Scholar - Kolinski A:
**Protein modeling and structure prediction with a reduced representation.***Acta Biochim Pol*2004,**51**(2):349-371.PubMedGoogle Scholar - Roy A, Kucukural A, Zhang Y:
**I-TASSER: a unified platform for automated protein structure and function prediction.***Nat Protoc*2010,**5**(4):725-738. 10.1038/nprot.2010.5PubMed CentralView ArticlePubMedGoogle Scholar - Xu D, Zhang Y:
**Ab initio protein structure assembly using continuous structure fragments and optimized knowledge-based force field.***Proteins*2012,**80**(7):1715-1735.PubMed CentralPubMedGoogle Scholar - Rohl CA, Strauss CEM, Misura KMS, Baker D:
**Protein structure prediction using Rosetta.***Methods Enzymol*2004,**383:**66-93.View ArticlePubMedGoogle Scholar - MacQueen JB:
**Some methods for classification and analysis of multiVariate observations.**In*Proc. of the fifth Berkeley Symposium on Mathematical Statistics and Probability, Volume 1*. Edited by: Cam LML, Neyman J.. Berkeley: University of California Press; 1967:281-297.Google Scholar - de Hoon MJL, Imoto S, Nolan J, Miyano S:
**Open source clustering software.***Bioinformatics*2004,**20**(9):1453-1454. 10.1093/bioinformatics/bth078View ArticlePubMedGoogle Scholar - Brüschweiler R:
**Efficient RMSD measures for the comparison of two molecular ensembles. Root-mean-square deviation.***Proteins*2003,**50:**26-34.View ArticlePubMedGoogle Scholar - Li SC, Bu D, Xu J, Li M:
**Finding nearly optimal GDT scores.***J Comput Biol*2011,**18**(5):693-704. 10.1089/cmb.2010.0123PubMed CentralView ArticlePubMedGoogle Scholar - Fraser R, Glasgow J:
**A demonstration of clustering in protein contact maps for alpha helix pairs.**In*Proceedings of the 8th international conference on Adaptive and Natural Computing Algorithms, Part I, Volume 4431*. Berlin, Heidelberg: Springer-Verlag; 2007:758-766.Google Scholar - Fogolari F, Corazza A, Viglino P, Esposito G:
**Fast structure similarity searches among protein models: efficient clustering of protein fragments.***Algorithms Mol Biol*2012,**7:**16. 10.1186/1748-7188-7-16PubMed CentralView ArticlePubMedGoogle Scholar - Samudrala R, Levitt M:
**Decoys ‘R’ Us: a database of incorrect conformations to improve protein structure prediction.***Protein Sci*2000,**9**(7):1399-1401. 10.1110/ps.9.7.1399PubMed CentralView ArticlePubMedGoogle Scholar - Fogolari F, Tosatto SCE, Colombo G:
**A decoy set for the thermostable subdomain from chicken villin headpiece, comparison of different free energy estimators.***BMC Bioinformatics*2005,**6:**301. 10.1186/1471-2105-6-301PubMed CentralView ArticlePubMedGoogle Scholar - Berenger F, Shrestha R, Zhou Y, Simoncini D, Zhang KYJ:
**Durandal: fast exact clustering of protein decoys.***J Comput Chem*2012,**33**(4):471-474. 10.1002/jcc.21988View ArticlePubMedGoogle Scholar - Li SC, Ng YK:
**Calibur: a tool for clustering large numbers of protein decoys.***BMC Bioinformatics*2010,**11:**25. 10.1186/1471-2105-11-25PubMed CentralView ArticlePubMedGoogle Scholar - Zhang Y, Skolnick J:
**SPICKER: a clustering approach to identify near-native protein folds.***J Comput Chem*2004,**25**(6):865-871. 10.1002/jcc.20011View ArticlePubMedGoogle Scholar - Wu S, Skolnick J, Zhang Y:
**Ab initio modeling of small proteins by iterative TASSER simulations.***BMC Biol*2007,**5:**17. 10.1186/1741-7007-5-17PubMed CentralView ArticlePubMedGoogle Scholar - Theobald DL:
**Rapid calculation of RMSDs using a quaternion-based characteristic polynomial.***Acta Crystallogr A*2005,**61**(Pt 4):478-80.View ArticlePubMedGoogle Scholar - Hung LH, Samudrala R:
**Accelerated protein structure comparison using TM-score-GPU.***Bioinformatics*2012,**28**(16):2191-2192. 10.1093/bioinformatics/bts345PubMed CentralView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.