Calibur: a tool for clustering large numbers of protein decoys
 Shuai Cheng Li^{1}Email author and
 Yen Kaow Ng^{2}
DOI: 10.1186/147121051125
© Li and Ng; licensee BioMed Central Ltd. 2010
Received: 24 June 2009
Accepted: 13 January 2010
Published: 13 January 2010
Abstract
Background
Ab initio protein structure prediction methods generate numerous structural candidates, which are referred to as decoys. The decoy with the most number of neighbors of up to a threshold distance is typically identified as the most representative decoy. However, the clustering of decoys needed for this criterion involves computations with runtimes that are at best quadratic in the number of decoys. As a result currently there is no tool that is designed to exactly cluster very large numbers of decoys, thus creating a bottleneck in the analysis.
Results
Using three strategies aimed at enhancing performance (proximate decoys organization, preliminary screening via lower and upper bounds, outliers filtering) we designed and implemented a software tool for clustering decoys called Calibur. We show empirical results indicating the effectiveness of each of the strategies employed. The strategies are further finetuned according to their effectiveness.
Calibur demonstrated the ability to scale well with respect to increases in the number of decoys. For a sample size of approximately 30 thousand decoys, Calibur completed the analysis in one third of the time required when the strategies are not used.
For practical use Calibur is able to automatically discover from the input decoys a suitable threshold distance for clustering. Several methods for this discovery are implemented in Calibur, where by default a very fast one is used. Using the default method Calibur reported relatively good decoys in our tests.
Conclusions
Calibur's ability to handle very large protein decoy sets makes it a useful tool for clustering decoys in ab initio protein structure prediction. As the number of decoys generated in these methods increases, we believe Calibur will come in important for progress in the field.
Background
In ab initio protein structure predictions [1–5], it is often the case that a large set of candidates (called decoys) is generated, and one is required to select a single (or a small selection of) best candidate(s) from the set. One criterion for this selection is to choose decoys with more neighbors over decoys with fewer neighbors. The use of this criterion is well justified [6, 7], and there are a few tools which incorporate this strategy [1, 8, 9]. In the popular protein structure prediction systems ITASSER [2, 8] and ROSETTA [1], decoys are selected using the following procedure: Starting with the set of generated decoys, a threshold d is first decided. Then, from the set, the decoy with the most neighboring decoys within RMSD d from it is found, and is reported as the highest ranking decoy. (Ties are broken arbitrarily.) This decoy and all of its neighbors (the first cluster) are then removed from the set, after which the decoy with the most neighbors within RMSD d is again found. This decoy is then reported as the second highest ranking decoy, and together with all its neighbors (the second cluster) are removed from the set. Similarly the third highest ranking decoy is then found, and so on.
Current implementations of this procedure evaluate pairwise RMSD (or approximate values) of the decoys, resulting in runtimes which are at best quadratic in the number of decoys. As the number of decoys grows to the tens of thousands, this method becomes infeasible, necessitating the development of faster methods. In this paper we propose three strategies to speed up the procedure, with no compromise on the clustering performed. That is, the resultant method produces exactly the same clusters that are produced based on pairwise comparison, but only faster. In the first strategy we create auxiliary groups of proximate decoys. This allows us to, through the use of triangular inequality, deduce if a group of decoys is (or is not) within the threshold distance from a given decoy. Our second strategy is to use efficiently computable lower and upper bounds of the RMSD to preliminary screen out unlikely candidates. Thirdly, outlier decoys can be detected and removed prior to the clustering. These strategies are implemented in an opensource tool called Calibur.
Implementation
where ℛ is the set of all rotations and the set of all translations.
Strategy 1: Auxiliary grouping of decoys
We want a grouping such that each decoy belongs to exactly one group, and is at most C_{ α }RMSD r from the group's center (i.e. the representative decoy). This is done as follows: First a distance r less than d is decided, and an arbitrary decoy is set as a center. (Let r = for Case 1 below.) Repeatedly we take an ungrouped decoy, and try to find from all current centers for one which it is within distance r from. If and when any such center C is found the decoy is grouped with C and its distance to C is recorded. Otherwise the decoy is declared as a new center.
To locate the decoys in a group that are within distance d from a decoy A, one can consider the following five cases (denote by C the group's center and X an arbitrary decoy in the group):
Case 1: A is in the group of C (including when A is the group's center), given that r = .
Case 2: C_{ α }RMSD(A, C) + r ≤ d.
Case 3: C_{ α }RMSD(A, C) > d + r.
Case 4: C_{ α }RMSD(A, C) + C_{ α }RMSD(C, X) ≤ d
Case 5: C_{ α }RMSD(A, C)  C_{ α }RMSD(C, X) > d
In Cases 4 and 5, we take advantage of the already computed distance from the group's center to each member of the group. Again, triangular inequality implies that in Case 4, the decoy X is within distance d from A, while in Case 5 the converse is true. The C_{ α }RMSD between X and A is computed if and only if none of the cases applies.
Strategy 2: Lower and upper bounds to C_{ α }RMSD
Hence we can efficiently compute an upper and a lower bound to C_{ α }RMSD(X, Y), through an arbitrarily chosen reference decoy O and precomputed C_{ α }RMSD(X, O) values for each decoy X. In practice, one can use n reference decoys O_{1}, O_{2}, ..., O_{ n }to obtain n upper bounds and n lower bounds.
The Euclidean distance between two decoys, after they are reorientated to minimize their C_{ α }RMSDs to a fixed arbitrary decoy, yields another upper bound to their C_{ α }RMSD [9]. This upper bound distance is referred to as rRMSD.
The signature distance of two protein structures is a lower bound of their C_{ α }RMSD, that is:
Lemma 1. C_{ α }RMSD(S_{1}, S_{2}) ≥ dist(Sig_{1}, Sig_{2})
Proof. Let R and T be the optimal rotation and translation found in computing the C_{ α }RMSD of two structures S_{1} and S_{2}. Let r_{ k }= Rs_{1, k} s_{2, k} T^{2}, u_{1, k}= ⟨s_{1, k}, c_{1}⟩ and u_{2, k}= ⟨s_{2, k}, c_{2}⟩, 1 ≤ k ≤ n. u_{1, k}and u_{2, k}are line segments with lengths v_{1, k}and v_{2, k}respectively.
It is known that the superposition in computing the C_{ α }RMSD of any two structures results in the centroids of the structures to coincide [11].
Let θ be the angle between u_{1, k}and u_{2, k}. By trigonometry, . Hence, C_{ α }RMSD (S_{1}, S_{2}) = .
To decide if a decoy X is within C_{ α }RMSD d of a decoy A, we first compute the bounds and examine the following.

If any of the upper bounds of C_{ α }RMSD(X, A) is smaller than or equal to d. If so, clearly C_{ α }RMSD(X, A) ≤ d.

If any of the lower bounds of C_{ α }RMSD(X, A) is larger than d. If so, clearly C_{ α }RMSD(X, A) > d.
We compute C_{ α }RMSD(X, A) if and only if these two checks fail.
The upper and lower bounds can also be applied to the conditions in Case 2 and Case 3 of Strategy 1, as follows.

In Case 2, if any of the upper bounds of C_{ α }RMSD(A, C) is smaller than d  r, then the condition C_{ α }RMSD(A, C) + r ≤ d holds.

In Case 3, if any of the lower bounds of C_{ α }RMSD(A, C) is larger than d + r, then the condition C_{ α }RMSD(A, C) > d + r holds.
We compute C_{ α }RMSD(A, C) for Case 2 and Case 3 if and only if these two checks fail.
Strategy 3: Filtering outlier decoys
Another possible enhancement to performance is to discard decoys with low similarity to other decoys in the set, prior to the clustering. Here we propose an efficient technique to quickly identify such decoys. Our aim is to retain all of the high ranking decoys, and the decoys which are within distance d from them. We identify these as "good" decoys. Assume that every high ranking decoy is within distance d from 10% of all the decoys. For a random sample of n decoys, the probability for a good decoy to be within a distance 2d from at least one of the sampled decoys is 1  0.9^{ n }, which is > 0.9999 when n = 100. Hence decoys that are not within 2d from any one of 100 randomly sampled decoys are likely not good, and are removed from the set.
Overall program design
We designed a program based on the three strategies. On a given set S of n input decoys, the program does the following:
Step 1: Discover a suitable threshold distance d for clustering S.
Step 2: Filter outlier decoys using 100 randomly selected decoys, as in Strategy 3.
Step 3: Create auxiliary groups out of the decoys as required by Strategy 1. Compute the signature (Sig_{ x }), and the distances (C_{ α }RMSD(X, O) for each decoy X and reference decoy O; rRMSD(X, Y) for each decoy X, Y) as required by Strategy 2.
Step 4: Find for each decoy A a neighbor set N_{ A }which contains all the decoys in S within distance d from A (A inclusive), using Strategy 1 with the preliminary checks of Strategy 2. This is done in a straightforward fashion as follows.
1. Set N_{ A }to an empty set.
2. For each auxiliary group of decoys G (C denote the center of G),
(a) If A is in G, add all decoys in G into N_{ A }and go for the next auxiliary group.
(b) Examine if C_{ α }RMSD(A, C) + r ≤ d using each of the upperbounds of C_{ α }RMSD(A, C).
If true, add all decoys in G into N_{ A }. Go for the next auxiliary group.
(c) Examine if C_{ α }RMSD(A, C) > d + r using each of the lowerbounds of C_{ α }RMSD(A, C).
If true, skip G. Go for the next auxiliary group.
(d) Compute C_{ α }RMSD(A, C).
(e) Examine if C_{ α }RMSD(A, C) + r ≤ d.
If true, add all decoys in G into N_{ A }. Go for the next auxiliary group.
(f) Examine if C_{ α }RMSD(A, C) > d + r. If true, skip G. Go for the next auxiliary group.
(g) For each decoy X in G,
i. Examine if C_{ α }RMSD(A, C) + C_{ α }RMSD(C, X) ≤ d.
If true, add X into N_{ A }. Go for the next decoy in G.
ii. Examine if C_{ α }RMSD(A, C)  C_{ α }RMSD(C, X) > d.
If true, skip X. Go for the next decoy in G.
iii. Examine if C_{ α }RMSD(A, X) ≤ d using each of the upperbounds of C_{ α }RMSD(A, X).
If true, add X into N_{ A }. Go for the next decoy in G.
iv. Examine if C_{ α }RMSD(A, X) > d using each of the lowerbounds of C_{ α }RMSD(A, X).
If true, skip X. Go for the next decoy in G.
v. Compute C_{ α }RMSD(A, X).
vi. If C_{ α }RMSD(A, X) ≤ d, add X into N_{ A }.
3. Output N_{ A }.
Step 5: Start with an empty sequence Output. Repeatedly find A ∈ S with the largest N_{ A }, appending A to Output while removing N_{ A }from S and all the neighbor sets.
Step 6: Output the decoys in Output. (For brevity the program is set to output only the first 3 decoys.)
The threshold selection in Step 1 is addressed in the next subsection.
Steps 2 and 3 are performed straightforwardly.
Step 5 is performed by repeating the following until S is empty: Find the decoy X ∈ S with the largest N_{ X }(breaking ties arbitrarily) and append the decoy to Output. Then, remove N_{ X }from S and for each Y ∈ N_{ X }, remove Y from N_{ Z }for each Z ∈ N_{ Y }.
Selection of a suitable threshold
We consider two decoys to be significantly related if and only if their C_{ α }RMSD is relatively small among all pairwise C_{ α }RMSDs of the decoys. Hence our strategy to threshold finding is to identify a value d such that only x percent of pairwise C_{ α }RMSD distances are below d, for some suitable x. Given x, a straightforward way to determine such a d exactly is to compute all n × n C_{ α }RMSDs and find the (0.01xn^{2})th shortest distance. Alternatively, a reasonable approximation to the xpercentile value can be obtained efficiently using only a relatively small random sample of the decoys. In our tests, around 10 samplings of 100 decoys each sufficed to determine this value quickly and consistently in general. Our program uses this method by default, with x set to min{100n^{1/4}, 10}. The expression 100n^{1/4} is heuristic. It's aim is to reduce the percentile when more decoys are available, in order to speed up the clustering (e.g., x = 10 when n = 10000, x = 5 when n = 160000).
A similar strategy would be to use the most frequently occurring C_{ α }RMSD among decoys, f say, as a reference to decide a threshold distance d. (If the pairwise distances are distributed normally, f would correspond to the 50th percentile.) As a selectable option the program includes a simple method based on this, in which we let d = cf + b, where c is set to and b is set to the minimum pairwise distance discovered through random sampling.
Memory usage
In Steps 13 and 5, the memory required is linear in n. For Step 4, in the theoretical worst case, N_{ X } = n for each X, resulting in O(n^{2}) memory usage. However, such a scenario is unlikely to occur in the program's intended use. In practical use, N_{ X } is seldom above 0.2n, and small for most X. Note that in the case that the number of neighbor sets of a given size falls off geometrically with the size, the memory required to store all neighbor sets would in fact be linear in n. In our tests, the actual growth in memory usage is closer to O(n) than O(n^{2}).
If one is interested in only the highest ranked decoy from the clustering, it is unnecessary to construct the neighbor sets, since the sizes of the neighbor sets suffice to determine such a decoy. In this case, the total memory usage would be linear in n. We include this mode of operation as an option.
Results
Our C++ implementation of the program is called Calibur. Calibur accepts as input a list of names of PDB files (each for a decoy) and an optional threshold d. No preprocessing is required of the PDB files. If no threshold is given, Calibur automatically finds a suitable threshold for the input decoys, as discussed. The method which Calibur uses for threshold discovery can be altered through commandline arguments. A list of all the implemented methods is shown when Calibur is called without any input arguments.
Effectiveness of strategies
The effectiveness of each of the strategies was evaluated with decoys predicted by FALCON (reported in Alipanali et al.: A protocol for automated NMR protein structure determination, submitted for publication) on the proteins TM1112 from the Arrowsmith Lab at University of Toronto (herein the set is referred to as TM1112) and SH3 from Donaldson's Lab at York University (herein referred to as CASKIN). Each of these two sets contains 9999 decoys.
Auxiliary grouping, lower and upper bounds
In Calibur, the order in which evaluations are performed, as well as the range of thresholds to use for the evaluations has been optimized based on these observations.
Filtering
Strategies' effects on Calibur's performance
To evaluate the strategies' effects on Calibur at various thresholds, the runtimes when the strategies are used ("Calibur") and when they are not used ("pairwise") were compared. For reference purposes, the runtime for ROSETTA's pairwise evaluation based clustering program ("cluster_info_silent") is also shown. ROSETTA is currently the most popular system for protein structure prediction. All the tests are run on a 3 GHz Intel Core 2 Duo PC with 2.98 G RAM running CentOS 5.3. All three tools are compiled using GCC 4.1.2 with optimization O. The same codes is used for computing C_{ α }RMSD.
Calibur's performance on a large data set
CPU times of Calibur on large data set.
Threshold  0.5 (27)  1.0 (1966)  1.5 (5531)  2.0 (8560)  3.0 (14397)  4.0 (17915)  5.0 (19905) 

Calibur  74 ± 7  506 ± 14  1047 ± 27  1482 ± 42  2369 ± 154  3109 ± 266  3616 ± 290 
no filtering  225 ± 11  717 ± 22  1250 ± 35  1629 ± 42  2495 ± 180  3166 ± 233  3501 ± 272 
Pairwise  2628 ± 72  2624 ± 69  2651 ± 66  2741 ± 83  3293 ± 205  4014 ± 130  4425 ± 324 
In practice the largest clusters typically contain around 10% of the decoys. In the present case, the largest clusters found at 1.5 threshold distance already contain more than 18% of all the decoys. At this point, the corresponding CPU time required by Calibur is about one third of the time required when the strategies are not used.
As a further reference on Calibur's performance in high load use, Calibur completed in around 15000 seconds CPU time under its default settings in our recent tests using 100,000 decoys.
Evaluation of Calibur's output decoys
Sizes of the decoy set for each target.
Target  #Decoys  Target  #Decoys  Target  #Decoys  Target  #Decoys 

1abv_  12500  1dtjA_  20000  1mkyA3  6119  1shfA  20000 
1af7_ _  12499  1egxA  20000  1mla_2  12500  1sro_  20000 
1ah9_  27498  1fadA  12599  1mn8A  12500  1ten_  20000 
1aoy_  32000  1fo5A  20000  1n0uA4  12499  1tfi_  32000 
1b4bA  12500  1g1cA  19997  1ne3A  12500  1thx_  32000 
1b72A  12499  1gjxA  12500  1no5A  12500  1tif_  12500 
1bm8_  20000  1gnuA  17533  1npsA  20000  1tig_  12500 
1bq9A  20000  1gpt_  32000  1o2fB_  12500  1vcc_  20000 
1cewI  19830  1gyvA  11508  1of9A  20000  256bA  20000 
1cqkA  19999  1hbkA  20000  1ogwA_  19998  2a0b_  32000 
1csp_  12500  1itpA  12500  1orgA  20000  2cr7A  12500 
1cy5A  32000  1jnuA  20000  1pgx_  20000  2f3nA  19999 
1dcjA_  20000  1kjs_  20000  1r69_  20000  2pcy_  20000 
1di2A_  20000  1kviA  20000  1sfp_  19985  2reb_2  12500 
In order to compare Calibur with SPICKER in terms of both output and speed we ran Calibur under the same conditions as SPICKER. Both programs were compiled with optimization O3 and were made to cluster exactly the same set of decoys. Filtering was disabled in Calibur. We noticed a limit on the number of decoys that SPICKER handles. When the number of decoys is larger than 13000, SPICKER samples only 13000 decoys for clustering. To test Calibur with the same set of decoys that SPICKER clusters, we obtained 13000 decoys from each decoy set that is larger than 13000 (using the same procedure as in SPICKER's source codes) and tested Calibur with these decoys.
When decoys are sampled, they may not be sufficiently representative and the quality of the best decoy obtained may be compromised. To investigate this effect, we randomly sampled 1000, 2500, 4000, 5500, 7000, 8500, 10000, 11500 decoys from each of the original sets and ran SPICKER and Calibur with these sampled sets. Since only 6119 decoys are available for 1mkyA3, the full set was used as the sampled set at sizes above 5500.
Total scores and CPU times.
Sample size  Average TMscore  Total C_{ α }RMSD  Total CPU Time (s)  

SPICKER  Calibur  SPICKER  Calibur  SPICKER  Calibur  
1000  0.571107  0.578102  291.534  281.994  136.05  237.25 
2500  0.574379  0.578386  284.937  286.09  687.45  746.84 
4000  0.576045  0.576859  284.953  284.108  1851.34  1533.81 
5500  0.574475  0.57845  284.18  283.055  3276.36  2781.55 
7000  0.576323  0.578823  284.927  278.769  7029.09  4320.95 
8500  0.57843  0.580889  283.325  279.248  8132.9  5906.61 
10000  0.577543  0.581236  282.919  279.1  10745.5  7822.75 
11500  0.578748  0.582175  282.644  281.192  14293.8  10374.2 
13000  0.578432  0.582425  283.795  281.701  16608.8  12420.5 
Total scores and CPU times at larger sample sizes.
Max sample size  Average TMscore  Total C_{ α }RMSD  Total CPU Time (s) 

16000  0.581143  283.749  28857.2 
19000  0.582325  281.589  38928.9 
22000  0.580996  279.887  45916 
25000  0.581545  283.549  47083.1 
28000  0.58272  278.243  50669.4 
31000  0.581805  282.32  53582.2 
Conclusion
Calibur is a carefully implemented tool, dedicated to the purpose of clustering very large numbers of decoys. As methods in ab initio protein structure prediction advances, the number of decoys to be analyzed is expected to increase, and the disability to cluster decoys efficiently will eventually pose a hindrance to the analyses of various problems and subproblems in the prediction of protein structures. It is our belief that Calibur, together with the methods it implements, will come in very useful when this situation arises. For this reason, we have decided to release the source codes of Calibur with an open license.
Availability and requirements

Project name: Calibur

Project homepage: http://sourceforge.net/projects/calibur/

Operating System(s): Multiple platform (tested on Windows and Linux)

Programming Language: C++.

Other requirements: None.

License: GNU General Public License
Declarations
Authors’ Affiliations
References
 Simons KT, Kooperberg C, Huang E, Baker D: Assembly of protein tertiary structures from fragments with similar local sequences using simulated annealing and Bayesian scoring functions. J Mol Biol 1997, 268: 209–25. 10.1006/jmbi.1997.0959View ArticlePubMedGoogle Scholar
 Wu S, Skolnick J, Zhang Y: Ab initio modeling of small proteins by iterative TASSER simulations. BMC Biology 2007., 5(17):Google Scholar
 Hamelryck T, Kent JT, Krogh A: Sampling realistic protein conformations using local structural bias. PLoS Comput Biol 2006, 2(9):1121–1133. 10.1371/journal.pcbi.0020131View ArticleGoogle Scholar
 Bystroff C, Shao Y: Fully automated ab initio protein structure prediction using ISITES, HMMSTR and ROSETTA. Bioinformatics 2002, 18(Suppl 1):54–61.View ArticleGoogle Scholar
 Li SC, Bu D, Xu J, Li M: FragmentHMM: A new approach to protein structure prediction. Protein Science 2008, 17(11):1925–34. 10.1110/ps.036442.108View ArticlePubMedPubMed CentralGoogle Scholar
 Shortle D, Simons KT, Baker D: Clustering of lowenergy conformations near the native structures of small proteins. Proc Natl Acad Sci 1998, 95: 1158–62. 10.1073/pnas.95.19.11158View ArticleGoogle Scholar
 Betancourt MR, Skolnick J: Finding the needle in a haystack: Educing native folds from ambiguous ab initio protein structure. J Comput Chem 2001, 22: 339–353. Publisher Full Text 10.1002/1096987X(200102)22:3%3C339::AIDJCC1006%3E3.0.CO;2RView ArticleGoogle Scholar
 Zhang Y, Skolnick J: SPICKER: Approach to clustering protein structures for nearnative model selection. J Comput Chem 2004, 25: 865–871. 10.1002/jcc.20011View ArticlePubMedGoogle Scholar
 Li H, Zhou Y: SCUD: Fast structure clustering of decoys using reference state to remove overall rotation. J Comput Chem 2005, 26(11):1189–92. 10.1002/jcc.20251View ArticlePubMedGoogle Scholar
 Boris S: A revised proof of the metric properties of optimally superimposed vector sets. Acta Crystallographica Section A 2002, 58(5):506. 10.1107/S0108767302011637View ArticleGoogle Scholar
 Arun KS, Huang TS, Blostein SD: Leastsquares fitting of two 3D point sets. IEEE Trans Pattern Anal Mach Intell 1987, 9(5):698–700. 10.1109/TPAMI.1987.4767965View ArticlePubMedGoogle Scholar
 ITASSER protein structure decoys[http://zhang.bioinformatics.ku.edu/ITASSER/decoys/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.