# Convergent algorithms for protein structural alignment

- Leandro Martínez
^{1}, - Roberto Andreani
^{2}and - José Mario Martínez
^{3}Email author

**8**:306

**DOI: **10.1186/1471-2105-8-306

© Martínez et al; licensee BioMed Central Ltd. 2007

**Received: **14 March 2007

**Accepted: **22 August 2007

**Published: **22 August 2007

## Abstract

### Background

Many algorithms exist for protein structural alignment, based on internal protein coordinates or on explicit superposition of the structures. These methods are usually successful for detecting structural similarities. However, current practical methods are seldom supported by convergence theories. In particular, although the goal of each algorithm is to maximize some scoring function, there is no practical method that theoretically guarantees score maximization. A practical algorithm with solid convergence properties would be useful for the refinement of protein folding maps, and for the development of new scores designed to be correlated with functional similarity.

### Results

In this work, the maximization of scoring functions in protein alignment is interpreted as a Low Order Value Optimization (LOVO) problem. The new interpretation provides a framework for the development of algorithms based on well established methods of continuous optimization. The resulting algorithms are convergent and *increase the scoring functions at every iteration*. The solutions obtained are critical points of the scoring functions. Two algorithms are introduced: One is based on the maximization of the scoring function with Dynamic Programming followed by the continuous maximization of *the same* score, with respect to the protein position, using a smooth Newtonian method. The second algorithm replaces the Dynamic Programming step by a fast procedure for computing the correspondence between C*α* atoms. The algorithms are shown to be very effective for the maximization of the STRUCTAL score.

### Conclusion

The interpretation of protein alignment as a LOVO problem provides a new theoretical framework for the development of convergent protein alignment algorithms. These algorithms are shown to be very reliable for the maximization of the STRUCTAL score, and other distance-dependent scores may be optimized with same strategy. The improved score optimization provided by these algorithms provide means for the refinement of protein fold maps and also for the development of scores designed to match biological function. The LOVO strategy may be also used for more general structural superposition problems such as flexible or non-sequential alignments. The package is available on-line at http://www.ime.unicamp.br/~martinez/lovoalign.

## Background

The number of protein structures obtained experimentally becomes larger every year. This large database is the source of data for the study of important problems in structural biology: The classification of protein structures according to their function, and the correlation of sequence and structure. Studies on the classification of proteins available in the Protein Data Bank (PDB) [1] have already provided important insights into the nature of protein evolution and folding [2–4]. With the increase in computer power and the expansion of the database, using this information for protein design and for the characterization of the protein folding landscape (or fold space) is becoming possible [5, 6].

Reliable methods for assessing similarities or discrepancies between structures are thus required. Algorithmic reliability is now most important since the description of the fold space is changing from a discrete to a continuous representation of similarities in terms of geometrical measures [6]. It is not clear whether these measures are meaningful if obtained by methods that do not necessarily converge. Furthermore, in protein folding it was recognized that the stable folds are minimizers in complex energy landscapes [7]. If one intends to obtain insights into the landscape of protein folding from similarity measures (scores), these measures must be meaningful. This means that relevant scores must be developed [4, 6, 8, 9], and that the algorithms that perform the alignment must converge to score maximizers (ideally the global maximizers).

Studies on the organization of the protein fold universe employ multidimensional scaling [10–12] or Kernel methods [13]. From scores that measure the similarity between pairs of proteins, distance-like functions are derived, and proteins are represented as points in the 3D-Euclidean space that best fit the distances. The Structure Space Map developed in [11, 12] provides good predictions of function similarities in many cases. Pairwise scores, which are the essential ingredients to compute distances, are given by alignment algorithms. The score information may be "crude, noisy, incomplete or inconsistent" [13]. In particular, scores related to very dissimilar proteins are usually badly computed. Obviously, poor scoring information is not an advantage for the mapping project and alignment methods that give reliable similarity measures, even for very different proteins, should be preferred. In this sense, if we define a score as the best possible association between substructures of two proteins, the best way to proceed is to compute the global maximum of the association measures with respect to relative positions. This is the philosophy of the algorithm by Kolodny and Linial [4] which, on the other hand, is unacceptably expensive for the present computer facilities. Therefore, as currently done in Optimization, we must turn to algorithms that *very likely* compute global maximizers and *are guaranteed to* compute critical points (points that satisfy sharp necessary conditions for maximality).

Summing up, algorithms for protein structural comparisons that provide theoretical tools for a more profound analysis of the relationships between sequence, folding, and structure must have some desirable properties: They must converge, in the sense that a solution must always be encountered, and the solutions must have known properties in terms of the similarity function being considered. The algorithm must also be versatile (adaptable for the optimization of diverse merit functions) since the study of the correlations between structural similarity and functionality may require specific similarity measures. Furthermore, a good method must be competitive with current algorithms in terms of computer time.

### Structural alignment algorithms

Algorithms for protein alignment usually fall into two categories: A large group is based on the comparison of relative internal coordinates of the proteins [2, 3, 14–17] whereas the second class relies on the explicit superposition of the two structures [18–22]. Structural alignment methods are also affected by the measure of similarity being adopted [17]. The more straightforward comparison tool is the Root Mean Square Deviation (RMSD), but RMSD is not a measure of similarity by itself, since it must be accompanied by the number of C*α* atoms of each structure being aligned and is not sensitive to the presence of gaps [23]. Scores that take into account the presence of gaps and automatically incorporate the number of C*α* atoms being compared have been developed [23]. Algorithms for the maximization of such scores or for the minimization of the RMSD under some conditions use different levels of information on the structure of the proteins: From internal distances only, to secondary structures [14, 21]. The comparison of the performance of these methods is not trivial, since each one is based on a different score and the packages often do not provide clearly comparable outputs. However, a comprehensive evaluation of several of these methods was recently elaborated and the results obtained suggest that the STRUCTAL algorithm (see below) provides the best alignments to date [19, 23]. These results indicate that the search of the maximizer of distance-dependent additive scores is a valuable strategy for obtaining meaningful alignments. All these algorithms are, however, heuristic in the sense that a rigorous characterization of the solutions that they find is not provided, and the relation of such solutions and score maximizers is uncertain.

### Convergent algorithms for protein alignment

Kolodny and Linial [4] introduced a method for solving the Protein Alignment problem whose complexity is polynomial in the number of C*α* atoms. The idea is to use an exhaustive *ε*-grid search in the space of rotations and to exploit the polynomiality of the Dynamic Programming procedure to optimize a score function. They also illustrate their approach by the maximization of the STRUCTAL score, but other scores could be maximized with the same strategy. If one could define a clearly biologically meaningful similarity measure, this algorithm would provide the optimal alignment for any pair of proteins.

The method of Kolodny and Linial obtains the *global* solution (up to a precision *ε* > 0) of the Alignment problem at the expense of an unaffordable computational effort. Therefore, as mentioned in [4], this method is not practical with the present computer facilities, although it sheds light on the complexity of the problem. This is a very common situation in Optimization practice. Methods that converge to global optimizers can be defined but practical methods are based on local optimization principles.

An algorithm is said to be *convergent* if it finds stationary points of the objective function, independently of the initial approximation (this property is called *global convergence* in the Numerical Optimization field [27, 28]). We will show that the interpretation of the protein alignment problem in the context of Low Order Value Optimization Theory [29, 30] suggests how convergent algorithms can be formulated. Moreover, we will describe how the STRUCTAL algorithm can be modified in order to obtain convergence. Then, we will introduce a new strategy for obtaining the correspondence between the C*α* atoms of the proteins that provide the algorithms with high efficiency while maintaining good score maximization and allows for non-sequential alignments. Numerical experiments will be presented that demonstrate that the methods are also competitive in terms of computer time with state of the art algorithms. Remarks and perspectives are given in the Conclusions. The Methods section contains details of the implementation of the line-search Newtonian algorithm proposed and rigorous convergence proofs.

## Results and Discussion

### General methodological principles

In this section we describe the Low Order Value Optimization problem. We explain the way in which structural alignment can be interpreted as a LOVO problem, and we show how this interpretation naturally suggests robust convergent algorithms for protein alignment.

#### Protein Alignment as a Low Order Value Optimization Problem

Given *f*_{1}(*x*), ..., *f*_{
m
}(*x*) a finite set of real functions, the Low Order Value Optimization (LOVO) problem consists of finding *x* such that the maximum of *f*_{1}(*x*), ..., *f*_{
m
}(*x*) is maximal. That is, defining *f*(*x*) = max{*f*_{1}(*x*), ..., *f*_{
m
}(*x*)} the objective of LOVO is to maximize *f* (*x*). The application of LOVO to more general problems, an equivalent formulation in terms of minimization and suitable definitions of criticality [31, 32] are given in [29]. General algorithms of Newtonian type that globally converge to stationary points were defined in [29] and applied to several practical problems. The main ideas of how convergence is obtained with the application of LOVO methods to protein alignment will be given below. Theoretical details and convergence proofs can be found in [29, 30].

*α*atoms, by rotating and translating one of the structures. Therefore,

*each correspondence*defines a function (the RMSD or some other scoring function) of the translations and rotations, as represented schematically in Figure 1(a). Since the number of possible correspondences is finite, the functions of the rigid-body transformations also form a finite set. In theory, one could maximize the scoring function for each correspondence independently, and the correspondence with maximal score value would be the best alignment between the two structures. Therefore, as in the LOVO theory, in structural alignment one wishes to obtain the optimal value of a finite set of real smooth functions (one for each correspondence) [30]. The objective function to be maximized assumes the value of the

*f*

_{ i }that has the optimal value for each point of the domain, as represented in Figure 1(b).

#### The STRUCTAL algorithm

where *d* is the distance between corresponding C*α* atoms and *n*_{
g
}is the number of gaps in the sequence alignment.

The apparent success the STRUCTAL method probably comes from the fact that it is based on two strong ideas that are incorporated in several state-of-the-art alignment algorithms: 1) For a given spatial orientation of the two structures, the correspondence that *globally* maximizes a distance-dependent score can be obtained using Dynamic Programming [24]; and 2) For a given correspondence, a rigid-body transformation that *globally* minimizes the RMSD can be obtained analytically (this problem is known as the Procrustes problem, and here we will call "Procrustes" the process of obtaining the rigid-body superposition that minimizes the RMSD between corresponding C*α* atoms of two proteins) [25, 26]. The STRUCTAL algorithm consists in the iterative application of these two methods: For the current spatial orientation of the proteins, a correspondence is obtained using Dynamic Programming, and this correspondence is used for a best rigid-body superposition (using Procrustes) in order to obtain a new relative spatial orientation of the structures. The goal of the STRUCTAL algorithm is to maximize the STRUCTAL score.

From the point of view of score maximization given a spatial orientation, the use of Dynamic Programming is quite effective and theoretically justified, since this procedure obtains the bijection that globally maximizes the score itself [24]. However, obtaining the rigid-body transformation that minimizes the RMSD for the current bijection between C*α* atoms is not a score-maximizing strategy. Therefore, although this procedure seems to be adequate because alignments with low RMSD are desirable, it is not the best choice if one wants to maximize the scoring function. In particular, the algorithm usually does not converge to a maximizer of the score, and many times oscillates between two different alignments.

#### Convergent LOVO algorithms for protein alignment

Figures 1(a) and 1(b) show how the objective function of protein alignment can be interpreted as the maximal function of rotations-translations corresponding to each admissible correspondence between C*α* atoms of the two structures.

- 1.
For a given displacement (rotation-translation, giving a three-dimensional relative orientation between the two proteins), a suitable subalgorithm must be able to identify which is the correspondence that maximizes the score. This corresponds to steps of type A in Figure 1(c).

- 2.
Once the best correspondence is found, another subalgorithm must be able to obtain a new rotation-translation displacement that improves the score for the correspondence obtained by the step of Type A. These are the steps of type B in Figure 1(c).

In the context of protein alignment, steps of type A may be performed using the classical Dynamic Programming algorithm [24]. Unfortunately, Procrustes rigid-body superpositions do not satisfy the conditions required for the steps of type B, since they do not guarantee the improvement of the score. However, reliable optimization algorithms exist that are always able to obtain improvements of the score merit function. Therefore, a natural approach consists of replacing Procrustes by an iteration of one of these algorithms. This is the principle governing the methods proposed here. Interestingly, note that point C in Figure 1(c) is non-smooth because two correspondences provide the same score for the given rotation-translation. As justified by the LOVO theory, however, the non-smoothness of the objective function may be simply ignored: A smooth optimization algorithm that takes any of the concurrent gradients may be used for improving the merit function at each iteration [29, 30].

## Algorithms

The Low Order Value Optimization approach suggests that, for obtaining a convergent score optimization algorithm, one should replace the Procrustes best rigid-body transformation by some algorithm that guarantees increasing the score during the structural alignment step. Here, we propose a classical Safeguarded Line-Search Newtonian algorithm, leading to the Dynamic-Programming-Line-Search method. Next, motivated by the rather high computational cost of the Dynamic-Programming step, we propose a second LOVO algorithm that preserves the Line-Search step but uses a Non-Bijective correspondence that can be computed very fast. This will be the Non-Bijective-Line-Search method, which will allow also for the study of non-sequential alignments. Both algorithms converge to stationary points of the scores, in the sense defined above.

### Dynamic-Programming-Line-Search method (DP-LS)

The DP-LS method preserves the Dynamic-Programming step. However, it uses a Safeguarded Line Search Newtonian method that guarantees a sufficient increase of the score during the structural alignment step [28].

#### Safeguarded Line Search Newtonian step

- 1.
Obtain a strictly concave quadratic approximation of the objective function at the current point.

- 2.
Maximize this quadratic approximation. The maximizer of the quadratic model is called (first) trial point, as shown in Figure 2(a).

- 3.
If the true objective function increased enough at the trial point, then the trial point is accepted as the new iterate.

- 4.
If the objective function did not increase enough, a new point is obtained in the segment determined by the current point and the trial point using safeguarded quadratic interpolation. This step is represented in Figure 2(b). In this way one obtains a new trial point and the control returns to Step 3.

The quadratic approximation uses the Hessian matrix, modified in such a way that the model is strictly concave. This guarantees that an ascent direction is generated at every iteration and that the objective function increases sufficiently at every iteration. As a consequence of sufficient increase, every limit point generated by the method is stationary and, under reasonable conditions, the local convergence rate is quadratic.

The reasons why this algorithm converges to a critical point of the objective function are the following: The quadratic model is a good approximation of the true objective function at the current point, since it has the same first and second derivatives than the objective function, as shown in Figure 2(c). Furthermore, the model is concave, so that either the current point is a maximizer of the quadratic model or there is a direction along which the model (and the objective function) must increase. The first trial point is obtained by the maximization of the quadratic model, but it may not increase the function value, as represented in Figure 2(c). If the function value is not increased, a new trial point is obtained, closer to the current point, along the line connecting the current point and the first trial point. Since the parabola that represents the quadratic model along this line is concave, any point in this line must have a greater value for the quadratic model. Furthermore, since the model is a good representation of the true objective function in the vicinity of the current point, for a trial point close enough to the current point, the true objective function must also increase. The strategy of reducing the distance between the current and the trial point guarantees that, eventually, the function value will be improved, and the method must converge to a critical point (see Methods).

#### DP-LS algorithm

- 1.
Given the three-dimensional orientation, compute the bijection that maximizes the score using Dynamic Programming.

- 2.
Given the current bijection, perform a single Safeguarded Line Search Newtonian iteration to obtain a new orientation for the second protein that guarantees a greater score for the current bijection.

- 3.
If the score increased more than a given tolerance, go to step 1. Otherwise stop.

DP-LS deals with the same objective function at the two phases of the algorithm, in contrast with STRUCTAL, that maximizes the score in the first phase and minimizes the RMSD in the second one. For this reason, we expect that the convergence to score maximizers should be more robust in DP-LS than in the STRUCTAL algorithm. Their computational cost must be similar, since both use the DP step, and the cost of performing a Newtonian iteration is similar to the cost of a Procrustes best rigid-body superposition.

### Non-Bijective-Line-Search method (NB-LS)

Motivated by the fact that Dynamic Programming is computationally expensive, we define here a new correspondence that sacrifices some of the requirements of a good correspondence but that may be computed much more rapidly. Moreover, this procedure makes it possible the generalization of the methods presented here for problems that do not require the correspondence to be monotone and sequential. The NB-LS method can be used for the structural superposition of other structures than proteins, such as ligands.

The Dynamic-Programming step was designed to globally maximize the score for a bijective and monotone correspondence. Both properties are reasonable from the point of view of protein alignment, since the C*α* atoms of each protein are ordered along the protein chain.

We propose now a new (Non-Bijective) correspondence, that sacrifices both bijectivity and monotonicity, but is very cheap to compute: *For each C* *α* *atom of the first protein, the C* *α* *atom corresponding to it is the closest C* *α* *atom of the second protein*. Clearly, this correspondence violates both the bijective and monotone characteristics of the DP correspondence, since different C*α* atoms of the first protein may be associated to the same C*α* atom of the second protein. This correspondence has the somewhat undesirable property that it is not symmetric relative to the interchange of the two proteins. Here we consider systematically the first structure as the smallest one because, as will be shown bellow, this is favorable for obtaining the correspondence rapidly.

This correspondence globally maximizes any score that increases monotonically as the distance between atoms decreases. The STRUCTAL score is one of the many scores that satisfy this property. Therefore, the use of this correspondence associated with the Line-Search Newtonian method also results in a convergent LOVO algorithm, although the functions *f*_{
i
}are defined in a different way. The advantage of the non-bijective correspondence is that it may be computed very rapidly, as will be shown below.

#### Fast algorithm for obtaining the NB-correspondence

- 1.
Obtain the internal distance-matrix of one (usually the largest) protein (called protein B).

- 2.
Sort the elements of each column of the internal distance-matrix. Therefore, the

*ordered*internal distance-matrix will contain, for each C*α*atom of protein B, the distances to other C*α*atoms of the same protein in increasing order.

- 1.
Compute the distance

*d*_{1}of the first atom of A (1A) to some atom of B (for example, to atom 1B), as shown in Figure 3(a).

- 2.
Since we seek the atom of B that is closest to atom 1A, this atom of B cannot be farther than the atom 1B. Therefore, it necessarily belongs to the sphere of radius

*d*_{1}around atom 1A, as shown in Figure 3(a). - 3.
As shown in Figure 3(b), the condition above implies that the atom of B closest to 1A is in the sphere of radius 2

*d*_{1}around atom 1B. - 4.
Therefore, one needs to compute the distances of the C

*α*atom 1A*only*to those atoms of B that are closer than 2*d*_{1}to atom 1B. We know which are these atoms because we have the ordered internal distance matrix of B.

In practice, the first distance computed may be to some atom already known to be close to 1A according to previous iterations. For the second atom of A, the first distance computed may be to the atom which was found to be closest to the first atom of A, and so on. These procedures maintain the initial distances *d*_{1} small, keeping also small the number of distances that have to be computed for each atom of A. We observe that the number of distances computed for each atom of the first protein is of the order of 10, almost independently of the size of the protein B.

Using this algorithm for a single protein-protein comparison might not result in time savings due to the cost of the preparatory step. However, when performing database comparisons, the time savings are large because the preparatory steps may be performed in a rational way: For a comparison of a single protein to a database of structures, the ordered internal distance matrix may be computed only for that single protein, which is then systematically treated as protein B. For all-on-all protein comparisons, one computes the ordered distance matrix for the largest protein, treats it as protein B, and aligns it to the whole database. Then we move to the second largest protein, repeat the operation, and so on. Only a single preparatory step is performed for each protein. The computation of the ordered distance matrices was observed to take only about 4% of the total alignment time.

### Testing

The theoretical reasons why the methods presented here should be robust and fast for score maximization were presented in the previous sections. Now we discuss how these methods behave in practice and whether the theoretical expectations were fulfilled.

^{-3}. Considering all the alignments, the best scores are obtained in about 50% of the cases by the DP-LS method, in 45% of the cases by the STRUCTAL algorithm and in only about 7% of the cases by the NB-LS method. However, most alignments in a database are not meaningful and, therefore, the capacity of the method in identifying good alignments is more important. Therefore, the percentage of best-scores obtained by each method were classified in terms of the best score obtained by the three methods, resulting in Figure 4(a). In this comparison, the STRUCTAL scores are scaled by the number of atoms of the smallest protein involved in each alignment: The STRUCTAL score for a perfect alignment with no gaps is 20

*n*

_{ A }, where

*n*

_{ A }is the number of atoms of the smallest protein involved. Therefore, scaling the scores allows one to compare alignments containing proteins with different number of atoms. Scaled scores are always between 0 and 20.

The first clear observation is that DP-LS is systematically able to obtain the best scores in the highest percentage of cases for all alignment qualities. For alignments with (scaled) best-scores greater than 6, for example, DP-LS obtains the best scores in at least 90% of the cases. For alignments with best scores greater than 12, DP-LS obtains the best scores in 98% of the problems. The STRUCTAL algorithm is competitive with DP-LS for bad alignments (scores smaller than 3) and for very good alignments (scores greater than 18).

The NB-LS method obtains the best scores only in about 7% of the cases in general (best-scores greater than zero). However, as the overall quality of the scores increases, the algorithm linearly improves its winning percentage, obtaining the best scores for 90% of the cases for scores greater than 13 and for 98% of the cases for scores greater than 15. In terms of this evaluation, this algorithm obtains better results than STRUCTAL for all scores greater than 4.

Figure 4(a) suggests that DP-LS and NB-LS are very effective for STRUCTAL score maximization. However, this figure does not give a measure of the difference between the scores obtained by each method, providing only a partial image of the actual results. Figure 4(b) is a plot of the average value of the scores obtained by each method relatively to the best scores obtained. Again, the best results are obtained systematically by the DP-LS algorithm, followed by the STRUCTAL algorithm for bad alignments and by the NB-LS algorithm for alignments with best-scores greater than 13.

Figure 4(b) depicts the values of the scores obtained by each method, relatively to the best score obtained. All algorithms obtain scores that are greater than 90% of the best score obtained for problems with best scores greater than 8. For alignments with best scores greater than 14, the STRUCTAL method obtains scores about 1 to 2% smaller than the other methods. This illustrates the effect of using a reliable local convergence strategy on the overall alignment qualities.

These results are as expected in view of the theoretical predictions: The DP-LS algorithm systematically obtains the best results, since it contains only monotone score maximization steps. The best scores obtained by this algorithm are mostly related to the fast local convergence of the LS algorithm. For bad alignments this difference is not as important as it is for good alignments. The NB-LS also behaves as expected: It does not penalize gaps or forces monotonicity during optimization, therefore it is not effective for obtaining good scores for alignments which do not naturally satisfy monotone-bijection and few-gaps properties. However, as the overall alignment quality is improved, these properties are automatically satisfied, and the deleterious effect of the non-bijective correspondence is reduced. For alignments with best scores greater than 13 the sacrifice of the bijection is not as important as the improvement provided by the LS step, and the results are better than the ones obtained by STRUCTAL.

Average time per alignment of the three methods

Method | Average time per alignment/s |
---|---|

STRUCTAL | 0.209 |

DP-LS | 0.133 |

NB-LS | 0.033 |

## Implementation

In this section we provide technical details that should be useful to replicate algorithms and experiments.

### Line search

At steps of type B of the algorithms proposed here we used a single line-search procedure. One could perform the full optimization of the score given the current bijection using the LS algorithm until convergence is achieved. However, this is not worthwhile, since after a single movement of the proteins, the bijection that maximizes the score for the new orientation of the proteins frequently changes. Therefore, for every new three-dimensional orientation of the proteins, it is reasonable to recompute the bijection.

Also some safeguards must be taken into account in order to guarantee that the line-search Newtonian method converges. For instance, the current-trial point distances must not be reduced abruptly. Moreover, the increase of objective function value must be at least a fraction of the increase predicted by the quadratic model at the accepted point. The details on how the line-search must be implemented in order to obtain practical and theoretical convergence can be found, for example, in [28] for smooth optimization and in [30] for the (non-smooth) structural alignment problem. More details of the current implementation and convergence proofs can be found in the Methods section.

### Initial approximations

The initial points for the alignments were obtained using an approximate alignment based on the internal coordinates of the proteins: For each protein with *N* atoms, a set of *N* - 3 points in R^{3} are defined by the distances of atom *i* to atoms *i* + 2, *i* + 3, and the distance between *i* + 2 and *i* + 3, for *N* consecutive indices *i*. These are the three distances that determine the dihedral of the atoms that follow atom *i*. This creates a "pseudostructure" with *N* - 3 atoms for each protein. The correspondence between the atoms of the pseudostructures of the two proteins is obtained using Dynamic Programming to maximize a STRUCTAL-like score (in which the distances are multiplied by a factor, in our case 20, for providing a reasonable scaling relative to the score parameters). The superposition that minimizes the RMSD for this bijection is obtained using Procrustes and this orientation of the proteins was defined as the initial point for the alignments. This algorithm was observed to directly provide the solution for the alignment of very similar proteins. Furthermore, this method provides good approximations for all algorithms, in such a way that even the classical circularly permuted pair 2pel:A-5cna:A [34] was correctly aligned with all the methods reported here.

For comparing the alignment obtained by the NB-LS method relative to STRUCTAL and NB-LS we compute, as a post-processing step, the actual bijective and monotone STRUCTAL scores relative to the (DP) optimal monotone bijection, for the final alignment obtained.

### Numerical examples

We selected, at random, 20 proteins from the publicly available DALI alignments [3]. For each of these 20 proteins, the 20 best matches found by DALI were also included in our data set. Therefore our database contained 400 proteins, including similar proteins (since they were obtained from a DALI classification) as well as structurally non-correlated proteins. All-on-all alignments within this set of proteins were performed with the three methods, comprising 79,800 alignments for each algorithm. The list of the 400 proteins used can be obtained at the web site of LovoAlign. For sorting the distances during the calculation of the ordered internal distance matrices for the NB-LS method we used the Flashsort algorithm [33]. The methods are implemented in Fortran77. The tests were run on an AMD Opteron 242 with 1 Gb of RAM running Linux. The software was compiled with the GNU fortran compiler version 3.3 with the "-O3 -3ast-math" options.

## Conclusion

Here we presented two contributions, one theoretical, and other practical, to the problem of structural alignment. The theoretical contribution is the interpretation of the alignment problem as a Low Order Value Optimization problem, a framework under which convergent algorithms can be developed. Furthermore, the solutions obtained are critical points of the scoring function. This means that the solutions obtained by our algorithms admit a precise mathematical description in terms of necessary conditions for score maximality.

- 1.
The STRUCTAL algorithm iteration has two phases: Maximizing the STRUCTAL score for fixed positions (Dynamic Programming) and modifying the relative positions with RMSD minimization (Procrustes). These two objectives might be conflictive leading to oscillation. Our DP-LS modifi-cation improves the score at both phases. Therefore, the score increases monotonically at all the iterations.

- 2.
One of the theoretical consequences of the monotone behavior of DP-LS is that this algorithm enjoys convergence to stationary points independently of the initial approximation.

- 3.
The first phase of the STRUCTAL algorithm and DP-LS uses Dynamic Programming. With the aim of reducing the cost of this procedure we introduced a nonbijective correspondence at the first phase of the iterations. The nonbijective association is computationally very cheap.

- 4.
As expected, NB-LS is faster than the algorithms based on first-phase Dynamic Programming. Perhaps surprisingly, for medium to good alignments, its robustness is similar to the one of the STRUCTAL algorithm and DP-LS. The reason is that meaningful alignments usually satisfy the bijective and monotone properties for the best correspondence without being necessary to force them at every step of the optimization procedure.

In this paper, we *do not* address the problem of whether the STRUCTAL score is the best merit function for the evaluation of the biological relevance. The functional relevance of the alignments obtained here is intrinsically linked to the functional relevance of the score being maximized. An ideal score would be one that increases as functional similarity increases. Although several scores have been proposed [8, 9, 23, 34], the design of a score that maximizes biological relevance is still an open problem. This occurs, in part, because reliable practical methods for score optimization were unavailable. The algorithms presented here should be effective tools for the maximization of any distance-dependent score. Therefore, these methods may be used for the development of new scores designed to be functionally meaningful.

The approaches described here can be employed for more general structural alignment problems. The substitution of the Procrustes procedure by the Newtonian algorithm can be used to introduce internal transformations (as flexibility) in the proteins being aligned. On the other hand, the replacement of the Dynamic-Programming strategy by the NB-correspondence makes it possible the treatment of structural alignment problems where monotonicity does not hold. These possibilities were theoretically investigated, and succesfully tested, in a previous work [30], but an effective implementation of these methods including flexibility or other transformation on the structures is an area of future research.

## Methods

Here we give a detailed description of the Newtonian algorithm used in DP-LS and NB-LS and rigorous convergence proofs. More detailed descriptions of the algorithms and of the theory involved can be found in [30]. Although the original problem is given in terms of maximization the maximum of a set of functions, here we use the "minimization of the minimum" approach, which is trivially equivalent. Therefore, our problem may be formulated in the following way.Minimize *f*_{
min
}(*x*)

where*f*_{
min
}(*x*) = min{*f*_{1}(*x*), ..., *f*_{
m
}(*x*)}.

we denote *I*_{
min
}(*x*) = {*i* ∈ {1, ..., *m*} | *f*_{
i
}(*x*) = *f*_{
min
}(*x*)}. we denote $\Vert z\Vert =\sqrt{{z}_{1}^{2}+\mathrm{...}+{z}_{n}^{2}}$.

**Algorithm A1**. Let *θ* ∈ (0, 1), *α* ∈ (0, 1/2), *β* > 0, be algorithmic parameters. Let *x*_{0} ∈ $\mathcal{R}$^{
n
}be the initial approximation. Given *x*_{
k
}∈ $\mathcal{R}$^{
n
}, the steps for computing *x*_{k+1}are:

**Step 1.** Choose *ν*(*k*) ∈ *I*_{
min
}(*x*_{
k
}). If ||∇*f*_{ν(k)}(*x*_{
k
})|| = 0, terminate.

**Step 2.** Compute *d*_{
k
}∈ $\mathcal{R}$^{
n
}such that∇*f*_{ν(k)}(*x*_{
k
})^{
T
}*d*_{
k
}≤ -*θ*||*d*_{
k
}|| ||∇*f*_{ν(k)}(*x*_{
k
})||(2)

and||*d*_{
k
}|| ≥ *β*||∇*f*_{ν(k)}(*x*_{
k
})||.(3)

In the Newtonian version of the algorithm we choose*d* = - (∇^{2}*f*_{ν(k)}(*x*_{
k
}) + *λI*)^{-1}∇*f*_{ν(k)}(*x*_{
k
}),(4)

where *λ* is the first number in the sequence {0,0.1||∇^{2}*f*_{ν(k)}(*x*_{
k
})||,0.2||∇^{2}*f*_{ν(k)}(*x*_{
k
})||,...} that verifies (2).

*d*= $\widehat{x}$ -

*x*

_{ k }where $\widehat{x}$ is the minimizer of a quadratic approximation

*q*(

*x*) of

*f*

_{ν(k)}(

*x*). Namely,

If *λ* = 0 (which is the usual case) this is the ordinary Taylor approximation of *f*_{ν(k)}. Sometimes it is necessary to take *λ* > 0 in order to guarantee that a minimum of the quadratic exists and that the generated direction is a descent direction (2). In this case, the geometrical meaning of *d* is that *d* = $\widehat{x}$ - *x*_{
k
}, where $\widehat{x}$ minimizes the Taylor quadratic approximation in a restricted trust ball [35].

If *d* satisfies (3) we take *d*_{
k
}= *d*. Otherwise, we take *d*_{
k
}= *βd*_{
k
}/||*d*_{
k
}||. In the Newtonian implementation of the algorithm, we use *θ* = 10^{-4}, *β* = 10^{-6}.

**Step 3**. In this step, we aim to compute *t*_{
k
}> 0, *x*_{k+1}∈ $\mathcal{R}$^{
n
}, such that*f*_{
min
}(*x*_{k+1}) ≤ *f*_{
min
}(*x*_{
k
}) + *αt*_{
k
}∇*f*_{ν(k)}(*x*_{
k
})^{
T
}*d*_{
k
}.(5)

- 1.
*t*← 1; - 2.
**Suffcient Descent Test**. If*t*satisfies*f*_{ min }(*x*_{k+1}) ≤*f*_{ min }(*x*_{ k }) +*αt*∇*f*_{ν(k)}(*x*_{ k })^{ T }*d*_{ k }(6)

*t*

_{ k }=

*t*,

*x*

_{k+1}=

*x*

_{ k }+

*t*

_{ k }

*d*

_{ k }and finish Step 3.

- 3.If
*t*does not satisfy (6) compute $\widehat{t}$ as:$\widehat{t}=\frac{-\nabla {f}_{\nu (k)}{({x}_{k})}^{T}{d}_{k}{t}^{2}}{2({f}_{min}({x}_{k}+t{d}_{k})-{f}_{min}({x}_{k})-\nabla {f}_{\nu (k)}{({x}_{k})}^{T}{d}_{k}t}.$(7)

(If the denominator of the expression above vanishes, we take $\widehat{t}$ = 0.5.)

With the choice (7), $\widehat{t}$ is the minimizer of the one-dimensional quadratic (parabola) *φ* that interpolates *f*_{
min
}at *x*_{
k
}and *x*_{
k
}+ *td*_{
k
}along the direction *d*_{
k
}. By this we mean that*φ*(0) = *f*_{
min
}(*x*_{
k
}), *φ'*(0) = ∇*f*_{ν(k)}(*x*_{
k
})^{
T
}*d*_{
k
}, *φ*(*t*) = *f*_{
min
}(*x*_{
k
}+ *td*_{
k
}).

If $\widehat{t}$ > *t*/2 we take *t* ← *t*/2. If $\widehat{t}$ <*t*/10 we take *t* ← *t*/10. Otherwise, take *t* ← $\widehat{t}$. (This procedure is known as *safeguarded quadratic interpolation* [27].) Go to **Suffcient Descent Test**.

We say that *x*_{*} is a critical (or stationary) point if ∇*f*_{
i
}(*x*) = 0 for some *i* ∈ *I*_{
min
}(*x*). Critical points are Clarke Stationary points in the sense used in [31], for example.

In the following theorems we prove that the algorithm stops at *x*_{
k
}only if *x*_{
k
}is critical and that limit points of sequences generated by Algorithm **A1** are critical. These proofs are adaptations of the ones displayed in [29] in a more general setting.

**Theorem 1**. *Algorithm* **A1** *is well-defined and terminates at x*_{
k
}*only if x*_{
k
}*is critical*.

*Proof*. Assume that

*x*

_{ k }is not critical and define

*i*=

*ν*(

*k*). So, ∇

*f*

_{ i }(

*x*

_{ k }) ≠ 0. By (2) and the differentiability of

*f*

_{ i },

*α*< 1, for

*t*small enough we have:

Since ∇*f*_{
i
}(*x*_{
k
})^{
T
}*d*_{
k
}< 0, we deduce:*f*_{
i
}(*x*_{
k
}+ *td*_{
k
}) ≤ *f*_{
i
}(*x*_{
k
}) + *αt*∇*f*_{
i
}(*x*_{
k
})^{
T
}*d*_{
k
}.

But *f*_{
min
}(*x*_{
k
}+ *td*_{
k
}) ≤ *f*_{
i
}(*x*_{
k
}+ *td*_{
k
}) and *f*_{
min
}(*x*_{
k
}) = *f*_{
i
}(*x*_{
k
}), so:*f*_{
min
}(*x*_{
k
}+ *td*_{
k
}) ≤ *f*_{
min
}(*x*_{
k
}) + *αt*∇*f*_{
i
}(*x*_{
k
})^{
T
}*d*_{
k
}

for *t* small enough.

Therefore, choosing *t*_{
k
}as in Steps 3.1–3.3, the condition (5) is satisfied.

This proves that, whenever *x*_{
k
}is not critical, a point *x*_{k+1}satisfying (5) may be found, so the algorithm is well defined.

**Theorem 2**

*If x*

_{*}

*is a limit point of a sequence generated by Algorithm*

**A1**

*then x*

_{*}

*is critical. Moreover, if*lim

_{k∈K}

*x*

_{ k }=

*x*

_{*}

*and the same i*=

*ν*(

*k*) ∈

*I*

_{ min }(

*x*

_{ k })

*is chosen at Step 1 of the algorithm for infinitely many indices k*∈

*K*,

*then i*∈

*I*

_{ min }(

*x*

_{*})

*and*∇

*f*

_{ i }(

*x*

_{*}) = 0.

*Finally*,

*Proof*. Let

*x*

_{*}∈ $\mathcal{R}$

^{ n }be a limit point of the sequence generated by Algorithm

**A1**. Let

*K*= {

*k*

_{0},

*k*

_{1},

*k*

_{2},

*k*

_{3}, ...} be an infinite sequence of integers such that:

- 1.
There exists

*i*∈ {1, ...,*m*} such that*i*=*ν*(*k*) for all*k*∈*K*. - 2.
lim

_{k∈K}*x*_{ k }=*x*_{*}.

The sequence *K* and the index *i* necessarily exist since {1, ..., *m*} is finite.

*f*

_{ i },

Clearly, since *i* = *ν*(*k*), we have that*f*_{
i
}(*x*_{
k
}) ≤ *f*_{ℓ}(*x*_{
k
}) for all ℓ ∈ {1, ..., *m*}.

for all *k* ∈ *K*.

Taking limits on both sides of this inequality, we see that *f*_{
i
}(*x*_{*}) ≤ *f*_{ℓ}(*x*_{*}) for all ℓ ∈ {1, ..., *m*}. Thus,*i* ∈ *I*_{
min
}(*x*_{*}).(11)

**A1**, since

*k*

_{j+1}≥

*k*

_{ j }+ 1, we have:

for all *j* ∈ $\mathcal{N}$.

*K*

_{1}⊂

*K*, lim

_{k∈1}∇

*f*

_{ i }(

*x*

_{ k }) = 0, we deduce that ∇

*f*

_{ i }(

*x*

_{*}) = 0 and the thesis is proved. Therefore, we only need to analyze the possibility that ||∇

*f*

_{ i }(

*x*

_{ k })|| is bounded away from zero for

*k*∈

*K*. In this case, by (12),

*d*

_{ k }|| → 0, the condition (2) also implies that ∇

*f*

_{ i }(

*x*

_{ k }) → 0 and ∇

*f*

_{ i }(

*x*

_{*}) = 0. Thus, we only need to consider the case in which lim

_{k∈K}

*t*

_{ k }= 0. Without loss of generality, we may assume that

*t*

_{ k }< 1 for all

*k*∈

*K*. So, for all

*k*∈

*K*there exists ${\overline{t}}_{k}$ > 0 such that

*s*

_{ k }= ${\overline{t}}_{k}$

*d*

_{ k }for all

*k*∈

*K*. Then, by (15),

By (14) and the Mean Value Theorem, for all *k* ∈ *K* there exists *ξ*_{
k
}∈ [0, 1] such that∇*f*_{
i
}(*x*_{
k
}+ *ξ*_{
k
}*s*_{
k
})^{
T
}*s*_{
k
}= *f*_{
i
}(*x*_{
k
}+ *s*_{
k
}) - *f*_{
i
}(*x*_{
k
}) > *α*∇*f*_{
i
}(*x*_{
k
})^{
T
}*s*_{
k
}.(17)

for all *k* ∈ *K*.

Let *K*_{1} ⊂ *K*, *s* ∈ $\mathcal{R}$^{
n
}be such that ${\mathrm{lim}\phantom{\rule{0.1em}{0ex}}}_{k\in {K}_{1}}{s}_{k}/\Vert {s}_{k}\Vert =s$.

By (16), dividing both sides of the inequality (17) by ||*s*_{
k
}||, and taking limits for *k* ∈ *K*_{1}, we obtain:∇*f*_{
i
}(*x*_{*})^{
T
}*s* ≥ *α*∇*f*_{
i
}(*x*_{*})^{
T
}*s*.

Since *α* < 1 and ∇*f*_{
i
}(*x*_{
k
})^{
T
}*d*_{
k
}< 0 for all *k*, this implies that ∇*f*_{
i
}(*x*_{*})^{
T
}*s* = 0. Thus, taking limits in (18), we obtain that ∇*f*_{
i
}(*x*_{*}) = 0. Therefore, by (11), *x*_{*} is critical.

Finally, let us prove (9). If (9) is not true, there exists *j* and an infinite set of indices *k* ∈ *K* such that *j* = *ν*(*k*) and ||∇ *f*_{
j
}(*x*_{
k
})|| is bounded away from zero. This implies that *j* ∈ *I*_{
min
}(*x*_{*}) and ||∇*f*_{
j
}(*x*_{*})|| ≠ 0, contradicting the first part of the proof. □

## Availability and requirements

The program for performing protein structural alignments with these methods is freely available, with source codes, at:

http://www.ime.unicamp.br/~martinez/lovoalign

An *online server* for pairwise comparison is also available. The methods are implemented in such a way that performing pairwise, single-protein-to-database or all-on-all database comparisons is straightforward.

## Declarations

### Acknowledgements

We thank Paulo S. Silva and Walter F. Mascarenhas for valuable discussions. The authors thank FAPESP and CNPq and UNICAMP for financial support.

## Authors’ Affiliations

## References

- Berman HM, Westbrook J, Feng Z, Gililand G, Bhat TN, Weissig H, Shindyalov IN, Bourne PE: The Protein Data Bank. Nucleic Acids Res. 2000, 28: 235-242. 10.1093/nar/28.1.235.PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Sander C: Protein structure comparison by alignment of distance matrices. J Mol Biol. 1993, 233: 123-138. 10.1006/jmbi.1993.1489.View ArticlePubMedGoogle Scholar
- Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics. 2000, 16: 566-567. 10.1093/bioinformatics/16.6.566.View ArticlePubMedGoogle Scholar
- Kolodny R, Linial N: Approximate protein structural alignment in polynomial time. P Natl Acad Sci USA. 2004, 101: 12201-12206. 10.1073/pnas.0404383101.View ArticleGoogle Scholar
- Yang A-S, Honig B: An integrated approach to the analysis and modeling of protein sequences and structures. I. Protein structural alignment and a quantitative measure for protein structural distance. J Mol Biol. 2000, 301: 665-678. 10.1006/jmbi.2000.3973.View ArticlePubMedGoogle Scholar
- Kolodny R, Petrey D, Honig B: Protein structure comparison: Implications for the nature of 'fold space' and structure and function prediction. Curr Opin Struc Biol. 2006, 16: 393-398. 10.1016/j.sbi.2006.04.007.View ArticleGoogle Scholar
- Onuchic JN, Wolynes PG: Theory of protein folding. Curr Opin Struc Biol. 2004, 14: 70-75. 10.1016/j.sbi.2004.01.009.View ArticleGoogle Scholar
- Zhang Y, Skolnick J: Scoring function for automated assessment of protein structure template quality. Proteins. 2004, 57: 702-710. 10.1002/prot.20264.View ArticlePubMedGoogle Scholar
- Zhang Y, Skolnick J: TM-align: A protein structure alignment algorithm based on TM-score. Nucleic Acids Res. 2005, 33: 2302-2309. 10.1093/nar/gki524.PubMed CentralView ArticlePubMedGoogle Scholar
- Vendruscolo M, Dobson CM: A glimpse at the organization of the protein universe. P Natl Acad Sci USA. 2005, 102: 5641-5642. 10.1073/pnas.0500274102.View ArticleGoogle Scholar
- Hou J, Sims GE, Zhang C, Kim S-H: A global representation of the protein fold space. P Natl Acad Sci USA. 2003, 100: 2386-2390. 10.1073/pnas.2628030100.View ArticleGoogle Scholar
- Hou J, Jun S-R, Zhang C, Kim S-H: Global mapping of the protein structure space and application in structure-based inference of protein function. P Natl Acad Sci USA. 2006, 102: 3651-3656. 10.1073/pnas.0409772102.View ArticleGoogle Scholar
- Lu F, Keles S, Wright SJ, Wahba G: Framework for kernel regularization with application to protein clustering. P Natl Acad Sci USA. 2005, 102: 12332-12337. 10.1073/pnas.0505411102.View ArticleGoogle Scholar
- Holm L, Sander C: Mapping the Protein Universe. Science. 1996, 273: 595-602. 10.1126/science.273.5275.595.View ArticlePubMedGoogle Scholar
- Shyndialov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998, 11: 739-747. 10.1093/protein/11.9.739.View ArticleGoogle Scholar
- Zhu J, Weng Z: FAST: A novel protein structure alignment algorithm. Proteins. 2005, 58: 618-627. 10.1002/prot.20331.View ArticlePubMedGoogle Scholar
- Kedem K, Chew LP, Elber R: Unit-vector RMS (URMS) as a tool to analyze molecular dynamics trajectories. Proteins. 1999, 37: 554-564. 10.1002/(SICI)1097-0134(19991201)37:4<554::AID-PROT6>3.0.CO;2-1.View ArticlePubMedGoogle Scholar
- Gerstein M, Levitt M: Comprehensive assessment of automatic structural alignment against a manual standard, the Scop classification of proteins. Protein Sci. 1998, 7 (2): 445-456.PubMed CentralView ArticlePubMedGoogle Scholar
- Subbiah S, Laurents DV, Levitt M: Structural similarity of DNA-binding domains of bacteriophage repressors and the globin core. Curr Biol. 1993, 3: 141-148. 10.1016/0960-9822(93)90255-M.View ArticlePubMedGoogle Scholar
- Kleywegt GJ: Use of non-crystallographic symmetry in protein structure refinement. Acta Crystallog D. 1996, 52: 842-857. 10.1107/S0907444995016477.View ArticleGoogle Scholar
- Krissinel E, Henrick K: Protein structure comparison in 3D based on secondary structure matching (SSM) followed by C-alpha alignment, scored by a new structural similarity function. Proceedings of the First international Conference on Molecular Structural Biology. 2003, 3-7.Google Scholar
- Krissinel E, Henrick K: Secondary-structure matching (SSM), a new tool for fast protein structure alignment in three dimensions. Acta Crystallog D. 2004, 60: 2256-2268. 10.1107/S0907444904026460.View ArticleGoogle Scholar
- Kolodny R, Koehl P, Levitt M: Comprehensive evaluation of protein structure alignment methods: Scoring by geometric measures. J Mol Biol. 2005, 346: 1173-1188. 10.1016/j.jmb.2004.12.032.PubMed CentralView ArticlePubMedGoogle Scholar
- Needleman B, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol. 1970, 48: 443-453. 10.1016/0022-2836(70)90057-4.View ArticlePubMedGoogle Scholar
- Kearsley SK: On the orthogonal transformation used for structural comparisons. Acta Crystallog A. 1989, 45: 208-210. 10.1107/S0108767388010128.View ArticleGoogle Scholar
- Kabsch W: A discussion of the solution for the best rotation to relate two sets of vectors. Acta Crystallog A. 1978, 34: 827-828. 10.1107/S0567739478001680.View ArticleGoogle Scholar
- Dennis JE, Schnabel RB: Numerical Methods for Unconstrained Optimization and Nonlinear Equations. 1983, New Jersey: Prentice Hall, Englewood CliffsGoogle Scholar
- Nocedal J, Wright SJ: Numerical Optimization. 1999, New York: SpringerView ArticleGoogle Scholar
- Andreani R, Martínez JM, Martínez L, Yano F: Low Order Value Optimization and Applications. Technical report. 2007, [http://www.ime.unicamp.br/~martinez/lovoalign]Google Scholar
- Andreani R, Martínez JM, Martínez L, Yano F: Continuous Optimization Methods for Structural Alignment. Math Program. 2008, 12: 93-124.Google Scholar
- Audet C, Dennis JE: Mesh adaptive direct search algorithms for constrained optimization. SIAM J Optimiz. 2006, 17: 188-217. 10.1137/040603371.View ArticleGoogle Scholar
- Burke JV, Lewis AS, Overton ML: A robust gradient sampling algorithm for nonsmooth nonconvex optimization. SIAM J Optimiz. 2005, 15: 751-779. 10.1137/030601296.View ArticleGoogle Scholar
- Neubert K-D: Flashsort1 Algorithm. Dr. Dobb's Journal. 1998, 23: 123-124.Google Scholar
- Bhattacharya S, Bhattacharyya C, Chandra NR: Projections for fast protein structure retrieval. BMC Bioinformatics. 2006, 7 (Suppl 5): S5-10.1186/1471-2105-7-S5-S5.PubMed CentralView ArticlePubMedGoogle Scholar
- Conn AR, Gould NIM, Toint PL: Trust-Region Methods. Edited by: Conn AR, Gould NIM, Toint PhL. 2000, Philadelphia: MPS/SIAM Series on Optimization, SIAM-MPSView ArticleGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.