Coordinates of the C_{
α
}atoms on the protein fold backbone are used to represent the main structure of a protein. As distance measure between two protein structures we use the backbone C_{
α
}carbon root mean squared deviation (C_{
α
}RMSD). Each C_{
α
}atom corresponds to a point in 3D space. For two protein structures S_{1} = (s_{1,1}, s_{1,2}, ..., s_{1, n}) and S_{2} = (s_{2,1}, s_{2,2}, ..., s_{2, n}), each s_{i, j}, 1 ≤ i ≤ 2, 1 ≤ j ≤ n, is a 3D point indicating a C_{
α
}atom in the backbone. The C_{
α
}RMSD of S_{1} and S_{2} is defined as:
where ℛ is the set of all rotations and the set of all translations.
Strategy 1: Auxiliary grouping of decoys
To avoid pairwise C_{
α
}RMSD computation, proximate decoys can be considered collectively, in deciding whether they are within C_{
α
}RMSD d from a decoy. We illustrate this idea in Figure 1, where the input decoys are collected into five groups. When finding all the decoys that are within C_{
α
}RMSD d from the decoy A (which is itself in group 2), one can first consider each of the five groups as a whole. In this case, all the decoys in the groups 2 and 3 are within C_{
α
}RMSD d from A, while all the decoys in the groups 1 and 5 are further than d from A. No such conclusion can be collectively made about the decoys in group 4. This strategy is made possible by the fact that we can decide if an entire collection of decoys is within C_{
α
}RMSD d from a decoy A by comparing A to a representative decoy C for the collection. That is, if A is within a certain distance from C, then we conclude that the entire group is within d from X. Similarly, no decoy in the group is within d from A if A is further than some distance from C. How this can be done is as follows.
We want a grouping such that each decoy belongs to exactly one group, and is at most C_{
α
}RMSD r from the group's center (i.e. the representative decoy). This is done as follows: First a distance r less than d is decided, and an arbitrary decoy is set as a center. (Let r = for Case 1 below.) Repeatedly we take an ungrouped decoy, and try to find from all current centers for one which it is within distance r from. If and when any such center C is found the decoy is grouped with C and its distance to C is recorded. Otherwise the decoy is declared as a new center.
To locate the decoys in a group that are within distance d from a decoy A, one can consider the following five cases (denote by C the group's center and X an arbitrary decoy in the group):
Case 1: A is in the group of C (including when A is the group's center), given that r = .
Case 2: C_{
α
}RMSD(A, C) + r ≤ d.
Case 3: C_{
α
}RMSD(A, C) > d + r.
Case 4: C_{
α
}RMSD(A, C) + C_{
α
}RMSD(C, X) ≤ d
Case 5: C_{
α
}RMSD(A, C)  C_{
α
}RMSD(C, X) > d
These cases are depicted in Figure 2. Since C_{
α
}RMSD is a metric [10], triangular inequality applies. Hence in Cases 1 and 2, all the decoys grouped with C must be within distance d from A. In Case 3 the converse is true.
In Cases 4 and 5, we take advantage of the already computed distance from the group's center to each member of the group. Again, triangular inequality implies that in Case 4, the decoy X is within distance d from A, while in Case 5 the converse is true. The C_{
α
}RMSD between X and A is computed if and only if none of the cases applies.
Strategy 2: Lower and upper bounds to C_{
α
}RMSD
Given any two decoys X and A, an efficiently computable lower bound of C_{
α
}RMSD(X, A) can be used to detect if C_{
α
}RMSD(X, A) is larger than a given threshold d. Likewise, an upper bound can be used to detect the case where C_{
α
}RMSD(X, A) is smaller than d. Our strategy is to use multiple such efficiently computable bounds as preliminary checks to reduce the much more expensive C_{
α
}RMSD computations. We will first propose a few of such upper and lower bounds, and then demonstrate how they are applied. First consider any three decoys, O, X and Y. By triangular inequality,
Hence we can efficiently compute an upper and a lower bound to C_{
α
}RMSD(X, Y), through an arbitrarily chosen reference decoy O and precomputed C_{
α
}RMSD(X, O) values for each decoy X. In practice, one can use n reference decoys O_{1}, O_{2}, ..., O_{
n
}to obtain n upper bounds and n lower bounds.
The Euclidean distance between two decoys, after they are reorientated to minimize their C_{
α
}RMSDs to a fixed arbitrary decoy, yields another upper bound to their C_{
α
}RMSD [9]. This upper bound distance is referred to as rRMSD.
Another lower bound can be obtained as follows. Denote the centroid of a protein structure S_{
x
}as c_{
x
}. The signature Sig_{
x
}for a protein structure S_{
x
}= (s_{x, 1}, s_{x, 2}, ..., s_{x, n}) is defined as:
where v_{x, i}= s_{x, i} c_{
x
}, 1 ≤ i ≤ n. Define the distance between two signatures Sig_{1} and Sig_{2}, called signature distance, as:
The signature distance of two protein structures is a lower bound of their C_{
α
}RMSD, that is:
Lemma 1. C_{
α
}RMSD(S_{1}, S_{2}) ≥ dist(Sig_{1}, Sig_{2})
Proof. Let R and T be the optimal rotation and translation found in computing the C_{
α
}RMSD of two structures S_{1} and S_{2}. Let r_{
k
}= Rs_{1, k} s_{2, k} T^{2}, u_{1, k}= ⟨s_{1, k}, c_{1}⟩ and u_{2, k}= ⟨s_{2, k}, c_{2}⟩, 1 ≤ k ≤ n. u_{1, k}and u_{2, k}are line segments with lengths v_{1, k}and v_{2, k}respectively.
It is known that the superposition in computing the C_{
α
}RMSD of any two structures results in the centroids of the structures to coincide [11].
Let θ be the angle between u_{1, k}and u_{2, k}. By trigonometry, . Hence, C_{
α
}RMSD (S_{1}, S_{2}) = .
To decide if a decoy X is within C_{
α
}RMSD d of a decoy A, we first compute the bounds and examine the following.

If any of the upper bounds of C_{
α
}RMSD(X, A) is smaller than or equal to d. If so, clearly C_{
α
}RMSD(X, A) ≤ d.

If any of the lower bounds of C_{
α
}RMSD(X, A) is larger than d. If so, clearly C_{
α
}RMSD(X, A) > d.
We compute C_{
α
}RMSD(X, A) if and only if these two checks fail.
The upper and lower bounds can also be applied to the conditions in Case 2 and Case 3 of Strategy 1, as follows.

In Case 2, if any of the upper bounds of C_{
α
}RMSD(A, C) is smaller than d  r, then the condition C_{
α
}RMSD(A, C) + r ≤ d holds.

In Case 3, if any of the lower bounds of C_{
α
}RMSD(A, C) is larger than d + r, then the condition C_{
α
}RMSD(A, C) > d + r holds.
We compute C_{
α
}RMSD(A, C) for Case 2 and Case 3 if and only if these two checks fail.
Strategy 3: Filtering outlier decoys
Another possible enhancement to performance is to discard decoys with low similarity to other decoys in the set, prior to the clustering. Here we propose an efficient technique to quickly identify such decoys. Our aim is to retain all of the high ranking decoys, and the decoys which are within distance d from them. We identify these as "good" decoys. Assume that every high ranking decoy is within distance d from 10% of all the decoys. For a random sample of n decoys, the probability for a good decoy to be within a distance 2d from at least one of the sampled decoys is 1  0.9^{n}, which is > 0.9999 when n = 100. Hence decoys that are not within 2d from any one of 100 randomly sampled decoys are likely not good, and are removed from the set.
Overall program design
We designed a program based on the three strategies. On a given set S of n input decoys, the program does the following:
Step 1: Discover a suitable threshold distance d for clustering S.
Step 2: Filter outlier decoys using 100 randomly selected decoys, as in Strategy 3.
Step 3: Create auxiliary groups out of the decoys as required by Strategy 1. Compute the signature (Sig_{
x
}), and the distances (C_{
α
}RMSD(X, O) for each decoy X and reference decoy O; rRMSD(X, Y) for each decoy X, Y) as required by Strategy 2.
Step 4: Find for each decoy A a neighbor set N_{
A
}which contains all the decoys in S within distance d from A (A inclusive), using Strategy 1 with the preliminary checks of Strategy 2. This is done in a straightforward fashion as follows.
1. Set N_{
A
}to an empty set.
2. For each auxiliary group of decoys G (C denote the center of G),
(a) If A is in G, add all decoys in G into N_{
A
}and go for the next auxiliary group.
(b) Examine if C_{
α
}RMSD(A, C) + r ≤ d using each of the upperbounds of C_{
α
}RMSD(A, C).
If true, add all decoys in G into N_{
A
}. Go for the next auxiliary group.
(c) Examine if C_{
α
}RMSD(A, C) > d + r using each of the lowerbounds of C_{
α
}RMSD(A, C).
If true, skip G. Go for the next auxiliary group.
(d) Compute C_{
α
}RMSD(A, C).
(e) Examine if C_{
α
}RMSD(A, C) + r ≤ d.
If true, add all decoys in G into N_{
A
}. Go for the next auxiliary group.
(f) Examine if C_{
α
}RMSD(A, C) > d + r. If true, skip G. Go for the next auxiliary group.
(g) For each decoy X in G,
i. Examine if C_{
α
}RMSD(A, C) + C_{
α
}RMSD(C, X) ≤ d.
If true, add X into N_{
A
}. Go for the next decoy in G.
ii. Examine if C_{
α
}RMSD(A, C)  C_{
α
}RMSD(C, X) > d.
If true, skip X. Go for the next decoy in G.
iii. Examine if C_{
α
}RMSD(A, X) ≤ d using each of the upperbounds of C_{
α
}RMSD(A, X).
If true, add X into N_{
A
}. Go for the next decoy in G.
iv. Examine if C_{
α
}RMSD(A, X) > d using each of the lowerbounds of C_{
α
}RMSD(A, X).
If true, skip X. Go for the next decoy in G.
v. Compute C_{
α
}RMSD(A, X).
vi. If C_{
α
}RMSD(A, X) ≤ d, add X into N_{
A
}.
3. Output N_{
A
}.
Step 5: Start with an empty sequence Output. Repeatedly find A ∈ S with the largest N_{
A
}, appending A to Output while removing N_{
A
}from S and all the neighbor sets.
Step 6: Output the decoys in Output. (For brevity the program is set to output only the first 3 decoys.)
The threshold selection in Step 1 is addressed in the next subsection.
Steps 2 and 3 are performed straightforwardly.
Step 5 is performed by repeating the following until S is empty: Find the decoy X ∈ S with the largest N_{
X
}(breaking ties arbitrarily) and append the decoy to Output. Then, remove N_{
X
}from S and for each Y ∈ N_{
X
}, remove Y from N_{
Z
}for each Z ∈ N_{
Y
}.
Selection of a suitable threshold
We consider two decoys to be significantly related if and only if their C_{
α
}RMSD is relatively small among all pairwise C_{
α
}RMSDs of the decoys. Hence our strategy to threshold finding is to identify a value d such that only x percent of pairwise C_{
α
}RMSD distances are below d, for some suitable x. Given x, a straightforward way to determine such a d exactly is to compute all n × n C_{
α
}RMSDs and find the (0.01xn^{2})th shortest distance. Alternatively, a reasonable approximation to the xpercentile value can be obtained efficiently using only a relatively small random sample of the decoys. In our tests, around 10 samplings of 100 decoys each sufficed to determine this value quickly and consistently in general. Our program uses this method by default, with x set to min{100n^{1/4}, 10}. The expression 100n^{1/4} is heuristic. It's aim is to reduce the percentile when more decoys are available, in order to speed up the clustering (e.g., x = 10 when n = 10000, x = 5 when n = 160000).
A similar strategy would be to use the most frequently occurring C_{
α
}RMSD among decoys, f say, as a reference to decide a threshold distance d. (If the pairwise distances are distributed normally, f would correspond to the 50th percentile.) As a selectable option the program includes a simple method based on this, in which we let d = cf + b, where c is set to and b is set to the minimum pairwise distance discovered through random sampling.
Memory usage
In Steps 13 and 5, the memory required is linear in n. For Step 4, in the theoretical worst case, N_{
X
} = n for each X, resulting in O(n^{2}) memory usage. However, such a scenario is unlikely to occur in the program's intended use. In practical use, N_{
X
} is seldom above 0.2n, and small for most X. Note that in the case that the number of neighbor sets of a given size falls off geometrically with the size, the memory required to store all neighbor sets would in fact be linear in n. In our tests, the actual growth in memory usage is closer to O(n) than O(n^{2}).
If one is interested in only the highest ranked decoy from the clustering, it is unnecessary to construct the neighbor sets, since the sizes of the neighbor sets suffice to determine such a decoy. In this case, the total memory usage would be linear in n. We include this mode of operation as an option.