SE: an algorithm for deriving sequence alignment from a pair of superimposed structures
© Tai et al. 2009
Published: 30 January 2009
Skip to main content
© Tai et al. 2009
Published: 30 January 2009
Generating sequence alignments from superimposed structures is an important part of many structure comparison programs. The accuracy of the alignment affects structure recognition, classification and possibly function prediction. Many programs use a dynamic programming algorithm to generate the sequence alignment from superimposed structures. However, this procedure requires using a gap penalty and, depending on the value of the penalty used, can introduce spurious gaps and misalignments. Here we present a new algorithm, Seed Extension (SE), for generating the sequence alignment from a pair of superimposed structures. The SE algorithm first finds "seeds", which are the pairs of residues, one from each structure, that meet certain stringent criteria for being structurally equivalent. Three consecutive seeds form a seed segment, which is extended along the diagonal of the alignment matrix in both directions. Distance and the amino acid type similarity between the residues are used to resolve conflicts that arise during extension of more than one diagonal. The manually curated alignments in the Conserved Domain Database were used as the standard to assess the quality of the sequence alignments.
SE gave an average accuracy of 95.9% over 582 pairs of superimposed proteins tested, while CHIMERA, LSQMAN, and DP extracted from SHEBA, which all use a dynamic programming algorithm, yielded 89.9%, 90.2% and 91.0%, respectively. For pairs of proteins with low sequence or structural similarity, SE produced alignments up to 18% more accurate on average than the next best scoring program. Improvement was most pronounced when the two superimposed structures contained equivalent helices or beta-strands that crossed at an angle. When the SE algorithm was implemented in SHEBA to replace the dynamic programming routine, the alignment accuracy improved by 10% on average for structure pairs with RMSD between 2 and 4 Å. SE also used considerably less CPU time than DP.
The Seed Extension algorithm is fast and, without using a gap penalty, produces more accurate sequence alignments from superimposed structures than three other programs tested that use dynamic programming algorithm.
Structure comparison and accurate structure-based sequence alignment are essential operations in structural bioinformatics. As of September 2008, the total number of structures in the Protein Data Bank (PDB) [1, 2] is more than 53000 and is increasing by 20% per year. Good structure comparison algorithms are necessary in order to compare and classify these structures and to derive accurate sequence alignments, which can help establish evolutionary relationships among the proteins.
Many protein structure alignment programs include iterations of a two-step cycle: first superposing the two structures according to a given sequence alignment, and then deriving a new sequence alignment from the superimposed structures. Dynamic programming algorithm [3, 4] is a widely used method for the second step. Programs such as SSAP , STRUCTAL , LSQMAN , CE , MATRAS , SHEBA , FAST  and others  use it to generate the alignments. However, dynamic programming algorithm requires using a gap penalty function, for which there is little guidance. It also uses a score function that usually considers only the distance between matching residues. Use of such a function can introduce incorrect alignments.
In order to recognize residue pairs that are structurally equivalent but not necessarily the closest ones and to avoid using a gap penalty function, we devised a novel algorithm called Seed Extension (SE) for obtaining the sequence alignment from a pair of superimposed structures. The performance of the new algorithm was compared with those of three programs that use the dynamic programming algorithm, namely LSQMAN, CHIMERA  and DP, which is a program extracted from SHEBA. LSQMAN and CHIMERA are two well-known programs and were chosen because they were easily available and could, without any modification, accept two superimposed structures and output the sequence alignment. The manually curated alignments in the Conserved Domain Database (CDD)  were used as the gold standard. Our results show that SE is fast and generates more accurate alignments, especially in cases where sequence or structural similarity is low. The program can be incorporated into an existing structure comparison program or it can simply be appended to such a program to improve its alignment quality.
In order to see if the SE algorithm improves the alignment quality of structure comparison programs, it was implemented in SHEBA to replace the original dynamic programming algorithm. In SHEBA, after the initial alignment is found for a given pair of structures, they are superimposed according to Kabsch [15, 16] and then a new sequence alignment is obtained from the superimposed structures using either the original DP routine or the new SE procedure. This Kabsch-DP or Kabsch-SE refinement cycle is repeated until convergence or until a set number of cycles has been completed.
Obtaining the best sequence alignment from a pair of superimposed structures is a non-trivial problem when the two structures are not entirely similar. The common practice is to select a maximal number of aligned residue pairs that will minimize the aggregate sum of distances between Cα atoms of the selected pairs. The natural algorithm for doing this is the dynamic programming algorithm.
However, blind minimization of the distance sum, in conjunction with the use of an essentially arbitrary gap penalty function, can produce poor alignments. The problem is particularly easy to see when two structurally equivalent helices cross each other at an angle as in the case shown in Figure 3. In such cases, insufficient gap penalty often leads to an alignment of the closest, but not necessarily structurally equivalent, residues, with many gaps.
The SE algorithm is a heuristic algorithm, which approximately follows the mental process that one of the authors (BL) goes through when he manually writes down the alignment from visual inspection of a pair of superimposed structures displayed on a computer screen. It starts with a few residue pairs that are clearly equivalent and then extends the alignment without introducing a gap until the inter-residue distance changes abruptly. There is no explicit notion of a gap penalty, although it is implicitly present since the algorithm attempts to extend the alignment without a gap. We have shown in this study that this algorithm produces more accurate alignments than the dynamic programming algorithms implemented in three different programs. It is also considerably faster than the latter, especially when the structures are large. An additional merit of the algorithm is that it generates strictly symmetric alignments, i.e. it produces the same alignment when the query and target structures are swapped. This is not always the case with the dynamic programming algorithm.
The algorithm requires several parameters, including the distance change cutoff value, which is used to decide when to stop extension of the alignment, the scalar product threshold value, which measures the similarity of orientation of residue triplets and which is used to identify the seed alignments, and the distance tolerance and the sequence similarity cutoff values, which are used to decide when to consider the sequence similarity in choosing among a couple of conflicting alignments. Initially, we chose the values of these parameters intuitively. The values of the first two parameters were then varied within a limited range and the optimal values were chosen using the 582 pairs of alignments selected from the CDD database. Although CDD is the most recent expert-curated database, there are other structure-based sequence alignment databases, e.g. HOMSTRAD and FSSP. It is possible that use of these other databases can alter the optimal values of these parameters. Also, adjustments may be indicated as the program is tested using more structure pairs and used more widely. However, we also expect that any adjustment will be small in magnitude and, in particular, SE will remain superior to a dynamic programming algorithm.
SE algorithm produces more accurate sequence alignments from superimposed structures than the dynamic programming algorithms used in CHIMERA, LSQMAN or SHEBA, especially in pairs of proteins with low sequence or structure similarity. SE does not require gap penalties but the alignments have fewer gaps. SHEBA implemented with SE algorithm takes less CPU time and generates more accurate alignments than the original version with dynamic programming algorithm. It is available as a software package for implementing in other structure comparison programs.
where dist i, j is the distance between the Cα atoms of residue i of structure A and residue j of structure B.
A pair of residues (i, j) is a seed if its corresponding matrix element M i, j is the minimum in both the i th row and the j th column and their scalar product is greater than 0. The scalar product here refers to that between unit vectors which bisect the angles (i-1, i, i+1) and (j-1, j, j+1). A seed segment (SS) is a set of consecutive seeds along one diagonal. After all seeds have been identified, seed segments of length 1 or 2 (isolated seeds or isolated pairs of seeds) are discarded and treated as not aligned. The status of each residue in structure A is stored as a "seed" or "extended pair" (see the following section), with the paired residue number in B, or "not yet aligned".
After all SSs are extended in both directions, a dynamic programming algorithm is used to choose the best set of consistent ASs. A set of ASs is consistent if for every AS pair p and q in the set, ip < iq implies jp < jq, where (ip, jp) and (iq, jq) are the starting residue numbers of the pth and qth ASs in the set, respectively. Consistency ensures that the resulting alignment respects sequence connectivity of the aligned residues in both structures. The best set of ASs was the one with the largest total number of aligned residue pairs in the set.
The Seed Extension algorithm was first written as a standalone program called SE. In order to compare this algorithm with the dynamic programming algorithm, the dynamic programming routine in the program SHEBA was isolated into a standalone program, which we refer to as DP. Prior to the implementation of the Seed Extension algorithm into SHEBA, the program SHEBA was first modified by removing some known bugs and by altering some features of the refinement procedure. In SHEBA3.1, after initial alignment, the program enters seven different weight schemes of 3 superposition-alignment cycles each. The alignment that has the most number of aligned residues is chosen as the final alignment. This is the last updated version that still employs only the dynamic programming algorithm.
The new SHEBA4.2, with – dp option, also employs the dynamic programming algorithm but uses a modified iteration scheme. It also uses seven different weight schemes but, for each weight scheme, the program first runs three weighted superposition-alignment cycles followed by up to 10 unit weight cycles. If the alignment converged within 10 cycles, that is, the alignment did not change in two consecutive cycles, the converged alignment is selected; otherwise, the alignment which gives the most number of aligned residues in the next cycle is chosen for that weight scheme. The alignment that gives the largest number of aligned residues among the seven different weight schemes is chosen as the final alignment. SHEBA4.2 with – se option has the same iteration scheme but the dynamic programming algorithm in the alignment part of the superposition-alignment cycle is replaced by the Seed Extension algorithm.
A set of structurally aligned protein pairs was selected from NCBI's Conserved Domain Database as described below. In CDD version 2.09, there were 2009 expert curated families with names starting with 'cd', of which 593 had at least two protein sequences with PDB structure files available that did not contain missing coordinates or non-standard amino acid residues. From each of these families, the pair with the least sequence similarity was selected and structurally superimposed using CHIMERA  based on CDD alignments. Discarding those with Cα RMSD greater than 5 Å resulted in 582 protein pairs.
We evaluated following programs for the accuracy of the sequence alignment generated from a given structural superposition: CHIMERA, LSQMAN version 060802, DP, and SE. The option GLocal_nw, global-superposition-distance-based Needleman-Wunsch sequence alignment, was used in LSQMAN to generate a sequence alignment from two superimposed structures. The default Cα distance cut-off value used in CHIMERA, LSQMAN and DP was 3.5 Å. Default values were used for the gap penalty.
The CDD alignments were used as reference alignments; those generated by programs were referred to as test alignments. The alignment accuracy was measured by means of the fraction of correctly aligned residues, f CAR . This is defined as the number of aligned pairs in the reference alignment that are preserved in the test alignment, divided by that in the reference alignment[20, 21].
To measure the speed of the algorithm, CPU time was retrieved by the clock function at the beginning and end of the DP or SE routine. The values were divided by the number of clock ticks per second to convert to the execution time. The average CPU time of SE or DP routine was obtained as the sum of elapsed time for all cycles divided by the number of cycles. The total CPU time to run the whole refinement cycles, including both the superposition and alignment generation, was also recorded. All time measurements were made on a Power Mac G5 with Dual PowerPC 970 2 GHz CPU.
This research was supported by the Intramural Research Program of the National Cancer Institute, National Institutes of Health.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1
This article is published under license to BioMed Central Ltd. This is an open access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.