Obtaining the best sequence alignment from a pair of superimposed structures is a non-trivial problem when the two structures are not entirely similar. The common practice is to select a maximal number of aligned residue pairs that will minimize the aggregate sum of distances between Cα atoms of the selected pairs. The natural algorithm for doing this is the dynamic programming algorithm.

However, blind minimization of the distance sum, in conjunction with the use of an essentially arbitrary gap penalty function, can produce poor alignments. The problem is particularly easy to see when two structurally equivalent helices cross each other at an angle as in the case shown in Figure 3. In such cases, insufficient gap penalty often leads to an alignment of the closest, but not necessarily structurally equivalent, residues, with many gaps.

The SE algorithm is a heuristic algorithm, which approximately follows the mental process that one of the authors (BL) goes through when he manually writes down the alignment from visual inspection of a pair of superimposed structures displayed on a computer screen. It starts with a few residue pairs that are clearly equivalent and then extends the alignment without introducing a gap until the inter-residue distance changes abruptly. There is no explicit notion of a gap penalty, although it is implicitly present since the algorithm attempts to extend the alignment without a gap. We have shown in this study that this algorithm produces more accurate alignments than the dynamic programming algorithms implemented in three different programs. It is also considerably faster than the latter, especially when the structures are large. An additional merit of the algorithm is that it generates strictly symmetric alignments, i.e. it produces the same alignment when the query and target structures are swapped. This is not always the case with the dynamic programming algorithm.

The algorithm requires several parameters, including the distance change cutoff value, which is used to decide when to stop extension of the alignment, the scalar product threshold value, which measures the similarity of orientation of residue triplets and which is used to identify the seed alignments, and the distance tolerance and the sequence similarity cutoff values, which are used to decide when to consider the sequence similarity in choosing among a couple of conflicting alignments. Initially, we chose the values of these parameters intuitively. The values of the first two parameters were then varied within a limited range and the optimal values were chosen using the 582 pairs of alignments selected from the CDD database. Although CDD is the most recent expert-curated database, there are other structure-based sequence alignment databases, e.g. HOMSTRAD[17] and FSSP[18]. It is possible that use of these other databases can alter the optimal values of these parameters. Also, adjustments may be indicated as the program is tested using more structure pairs and used more widely. However, we also expect that any adjustment will be small in magnitude and, in particular, SE will remain superior to a dynamic programming algorithm.