Seed Extension algorithm
The input of the algorithm is the coordinates of two superimposed structures and the output is the sequence alignment based on that superposition. The flow chart of the algorithm is shown in Figure 7.
1. Compute distance matrix
Given a pair of superimposed structures A of length m and B of length n, the m × n matrix M of the average Cα distances is defined as
where disti, jis the distance between the Cα atoms of residue i of structure A and residue j of structure B.
2. Find seed and seed segments (SSs)
A pair of residues (i, j) is a seed if its corresponding matrix element Mi, jis the minimum in both the ithrow and the jthcolumn and their scalar product is greater than 0. The scalar product here refers to that between unit vectors which bisect the angles (i-1, i, i+1) and (j-1, j, j+1). A seed segment (SS) is a set of consecutive seeds along one diagonal. After all seeds have been identified, seed segments of length 1 or 2 (isolated seeds or isolated pairs of seeds) are discarded and treated as not aligned. The status of each residue in structure A is stored as a "seed" or "extended pair" (see the following section), with the paired residue number in B, or "not yet aligned".
3. Extend seed segments to obtain aligned segments (ASs)
An aligned segment (AS) is a set of ungapped, at least 3 consecutive residue pairs that are aligned. ASs are initially set equal to SSs, which are then extended along the diagonal in both directions according to a protocol detailed in Figure 8. Briefly, an AS is extended by a residue pair if the distance between the new pair is not more than the distance between the last aligned pair by a cutoff distance (default is 3.0 Å). The extension is terminated if either of the candidate residue pair is a seed residue (a seed pair is preferred over an extended pair) or if the candidate residue pair is an extended pair (two ASs on the same diagonal are joined). If the extension meets a residue which is a part of a pre-existing AS on a different diagonal, the extension is either stopped or continued, in which case the pre-existing AS is correspondingly shortened, depending on which AS is to be preferred. The factors considered for this choice include the distance between the residue pairs and the similarity of the residue pairs as measured by the BLOSUM62 matrix.
4. Collect consistent sets of diagonals and choose the best set
After all SSs are extended in both directions, a dynamic programming algorithm is used to choose the best set of consistent ASs. A set of ASs is consistent if for every AS pair p and q in the set, ip < iq implies jp < jq, where (ip, jp) and (iq, jq) are the starting residue numbers of the pth and qth ASs in the set, respectively. Consistency ensures that the resulting alignment respects sequence connectivity of the aligned residues in both structures. The best set of ASs was the one with the largest total number of aligned residue pairs in the set.
SE, DP, and SHEBA modifications
The Seed Extension algorithm was first written as a standalone program called SE. In order to compare this algorithm with the dynamic programming algorithm, the dynamic programming routine in the program SHEBA was isolated into a standalone program, which we refer to as DP. Prior to the implementation of the Seed Extension algorithm into SHEBA, the program SHEBA was first modified by removing some known bugs and by altering some features of the refinement procedure. In SHEBA3.1, after initial alignment, the program enters seven different weight schemes of 3 superposition-alignment cycles each. The alignment that has the most number of aligned residues is chosen as the final alignment. This is the last updated version that still employs only the dynamic programming algorithm.
The new SHEBA4.2, with – dp option, also employs the dynamic programming algorithm but uses a modified iteration scheme. It also uses seven different weight schemes but, for each weight scheme, the program first runs three weighted superposition-alignment cycles followed by up to 10 unit weight cycles. If the alignment converged within 10 cycles, that is, the alignment did not change in two consecutive cycles, the converged alignment is selected; otherwise, the alignment which gives the most number of aligned residues in the next cycle is chosen for that weight scheme. The alignment that gives the largest number of aligned residues among the seven different weight schemes is chosen as the final alignment. SHEBA4.2 with – se option has the same iteration scheme but the dynamic programming algorithm in the alignment part of the superposition-alignment cycle is replaced by the Seed Extension algorithm.
Data set of superimposed structures
A set of structurally aligned protein pairs was selected from NCBI's Conserved Domain Database as described below. In CDD version 2.09, there were 2009 expert curated families with names starting with 'cd', of which 593 had at least two protein sequences with PDB structure files available that did not contain missing coordinates or non-standard amino acid residues. From each of these families, the pair with the least sequence similarity was selected and structurally superimposed using CHIMERA [19] based on CDD alignments. Discarding those with Cα RMSD greater than 5 Å resulted in 582 protein pairs.
Structure-based sequence alignment programs
We evaluated following programs for the accuracy of the sequence alignment generated from a given structural superposition: CHIMERA, LSQMAN version 060802, DP, and SE. The option GLocal_nw, global-superposition-distance-based Needleman-Wunsch sequence alignment, was used in LSQMAN to generate a sequence alignment from two superimposed structures. The default Cα distance cut-off value used in CHIMERA, LSQMAN and DP was 3.5 Å. Default values were used for the gap penalty.
Alignment accuracy measure
The CDD alignments were used as reference alignments; those generated by programs were referred to as test alignments. The alignment accuracy was measured by means of the fraction of correctly aligned residues, f
CAR
. This is defined as the number of aligned pairs in the reference alignment that are preserved in the test alignment, divided by that in the reference alignment[20, 21].
Computing time measurement
To measure the speed of the algorithm, CPU time was retrieved by the clock function at the beginning and end of the DP or SE routine. The values were divided by the number of clock ticks per second to convert to the execution time. The average CPU time of SE or DP routine was obtained as the sum of elapsed time for all cycles divided by the number of cycles. The total CPU time to run the whole refinement cycles, including both the superposition and alignment generation, was also recorded. All time measurements were made on a Power Mac G5 with Dual PowerPC 970 2 GHz CPU.