We have suggested in the Methods section that after dispensing with translation and the length of the SSEs, the structure is effectively represented by a set of points on a unit sphere. With these points we associated information about underlying SSE type and sequential order. By settling on the minimal representation of the protein structure, we set out to analyze its sufficiency for structure description and retrieval.

As noted in subsection "Self-scoring in a large database of structures", Fig. 2 and the related discussion, the directions themselves, except when taken very narrowly (*δ* = 0.1 in our formulation), may be matched by quite diverse protein structures. To get rid of false positive matches that arise that way, we have suggested imposing the requirement that the matched SSEs follow the same sequential order in the two structures. This, however is not the only possible way around the problem: as discussed in Mizuguchi and Go [5], and later elaborated by Krissinel and Henrick [12] in development of SSM (discussed above), the directions of SSEs can be supplemented by various other pieces of information: the length of SSEs, the distance and torsion angle between all possible *pairs* of SSEs in a structure and/or the angles between their directions and the direction of the line passing through their geometric centers. The advantage of using this type of information, rather than requiring the common sequential order of the SSEs, then is in the ability to look for pairs of proteins with different connectivity between SSEs, that still result in the overall comparable structures. On the flip-side, the set of requirements might end up being too restrictive in the search of similar (but non-identical) structures, as we have illustrated in the inset of Fig. 4.

Contrary to the model of similarity adopted here, where similar structures are assumed to share to certain extent the underlying SSE arrangement, it is conceivable that two proteins might share a common function as long as they offer a common geometry of the surface to their common (or similar) interacting partners [55]. In that case one might be interested in a method for detection and retrieval of proteins sharing the same shape, irrespective of the underlying secondary structure. It is a possibility not explored here. Methods for retrieval by global shape similarity have been discussed in literature (see [29] and references therein) and extremely short retrieval times (~10^{-4}
*s* on a 3 GHz CPU) reported [29]). Some questions remain outside the scope of these methods, such as detection of a common substructure or structural motif.

Sticking to a more conservative model of shared protein structure, the problem which ultimately needs to be resolved is the correspondence: which SSEs (and later, on a finer detail level, which backbone atoms) on two structures correspond to one another. Function *F*(*R*; *X*, *Y*) enables us to initially sidestep this problem, in principle at least, because the fast fall-off of the closeness measure *D*
_{
ij
}(*R*) (Eq. 2) makes possible the double sum over all elements without the danger of obtaining as the optimal a solution where no actual match exists, but the sum over many distant neighbors artificially increases the score. By starting the protein structure comparison by minimization of *F*(*R*; *X*, *Y*), we are effectively adopting, on the SSE level, the match-first-align-later approach, popularized by Gerstein and Levitt [18] (see also [56] for a further development of the idea).

Ideally, the scoring function *F* would quantify, in a single expression, the geometric match under the constraint of sequential ordering of the pairs, a problem which we leave open. On the high-resolution end of the spectrum of related ideas lies the URMS-RMS hybrid algorithm [8, 23, 57]. There, a set of directions in space is also considered, however not along the SSEs, but along the lines connecting neighboring *C*
_{
α
} atoms within a heptapeptide. Being a high-resolution method, it comes with the computational burden comparable to the other backbone-matching approaches (and, of course, with the final reward of the actual detailed matching of two backbone traces). The match scoring function used in that work is different from the one suggested here, but it runs into a similar difficulty of estimating the statistical significance for a match of different structures. A solution offered there is comparison with an empirically derived background distribution of match probabilities using existing, unrelated protein structures.

Instead, we opted for a solution which separates the geometric match from the alignment. The fuzziest point in the algorithm we have outlined, therefore, is that the averages in Eq. 4 should properly be evaluated not over the set of all rotations *R*, but only over those rotations which allow, through the matrix *D*
_{
ij
}(*R*), the alignment of subsequences of the two proteins of substantial length. Numerical evaluation of these proper averages would effectively grind the search to a halt, so in our prototype evaluation we keep the averages over all *R* as an approximation. The approximation works well for the rigid search, where it is used to dispense with bad solutions, rather than score good ones. In the case of the flexible search we resort to the total assigned score as a scoring function, coupled with the requirement that both maps have a high rotational *z*-score on their own.

In terms of the implementation, the room for improvement is certainly ample. The relatively large number of false positives is attributable, at least in part, to parallel beta sheets and helix bundles, which can be amended by more careful grouping of the representation vectors. Also, in the implementation used here, each *β* strand is represented by a single vector
- a rather crude approximation for most *β* strands, which are often bent.

Perhaps stating the obvious, the ultimate degree of success of an approach will depend on the choices made in the implementation, as much so as on the underlying idea. In this work, the available implementations (steepest descent and Needleman-Wunsch) decided the way in which the three features we selected to describe a protein were used. Even though a faster, or more robust, implementation could perhaps be achieved by a different choice of optimization or alignment algorithm, these are replaceable components, and the main points of improvement are in the representation itself, in the distance (or match scoring) function, and in its statistical evaluation.