# Improving pairwise sequence alignment accuracy using near-optimal protein sequence alignments

- Michael L Sierk
^{1}, - Michael E Smoot
^{2}, - Ellen J Bass
^{3}and - William R Pearson
^{4}Email author

**11**:146

**DOI: **10.1186/1471-2105-11-146

© Sierk et al; licensee BioMed Central Ltd. 2010

**Received: **27 July 2009

**Accepted: **22 March 2010

**Published: **22 March 2010

## Abstract

### Background

While the pairwise alignments produced by sequence similarity searches are a powerful tool for identifying homologous proteins - proteins that share a common ancestor and a similar structure; pairwise sequence alignments often fail to represent accurately the structural alignments inferred from three-dimensional coordinates. Since sequence alignment algorithms produce optimal alignments, the best structural alignments must reflect suboptimal sequence alignment scores. Thus, we have examined a range of suboptimal sequence alignments and a range of scoring parameters to understand better which sequence alignments are likely to be more structurally accurate.

### Results

We compared near-optimal protein sequence alignments produced by the Zuker algorithm and a set of probabilistic alignments produced by the probA program with structural alignments produced by four different structure alignment algorithms. There is significant overlap between the solution spaces of structural alignments and both the near-optimal sequence alignments produced by commonly used scoring parameters for sequences that share significant sequence similarity (E-values < 10^{-5}) and the ensemble of probA alignments. We constructed a logistic regression model incorporating three input variables derived from sets of near-optimal alignments: robustness, edge frequency, and maximum bits-per-position. A ROC analysis shows that this model more accurately classifies amino acid pairs (edges in the alignment path graph) according to the likelihood of appearance in structural alignments than the robustness score alone. We investigated various trimming protocols for removing incorrect edges from the optimal sequence alignment; the most effective protocol is to remove matches from the semi-global optimal alignment that are outside the boundaries of the local alignment, although trimming according to the model-generated probabilities achieves a similar level of improvement. The model can also be used to generate novel alignments by using the probabilities in lieu of a scoring matrix. These alignments are typically better than the optimal sequence alignment, and include novel correct structural edges. We find that the probA alignments sample a larger variety of alignments than the Zuker set, which more frequently results in alignments that are closer to the structural alignments, but that using the probA alignments as input to the regression model does not increase performance.

### Conclusions

The pool of suboptimal pairwise protein sequence alignments substantially overlaps structure-based alignments for pairs with statistically significant similarity, and a regression model based on information contained in this alignment pool improves the accuracy of pairwise alignments with respect to structure-based alignments.

## Background

Pairwise sequence alignment is the most widely used method for extracting information from protein and DNA sequences; it is routinely used to detect protein homologs that diverged more than 2 billion years ago. Homology -common evolutionary ancestry - can be reliably inferred for proteins that share statistically significant sequence similarity. When statistically significant similarity to a known sequence is found, inferences can be made about the structure, function, and biologically significant residues of the unknown sequence. While the inference of homology is quite robust (proteins that share significant similarity in pairwise alignments always have similar structures), [1] some of the more detailed functional inferences are critically dependent upon the quality of the alignment between the two sequences. For proteins that are very similar (>60% identity), functional inferences are usually very accurate, but for more distantly related proteins, ambiguity in the alignment of poorly conserved regions can lead to errors [2].

The most widely used algorithm for generating pairwise sequence alignments is dynamic programming, originally applied to biological sequences by Needleman and Wunsch[5]. Dynamic programming methods guarantee an algorithmically optimal alignment for the specific sequence and input parameters. However, an optimal sequence alignment score does not necessarily guarantee that the alignment is structurally accurate. Sequence alignment scores are optimized for a particular set of amino acid replacement scores and gap penalties; there is no natural process by which two proteins align themselves. For this reason "optimal" sequence alignments can be very different from optimal structural alignments. For example, in Figure 1A, the highest scoring structural alignment (produced by LSQMAN) had a semi-global sequence alignment score of -20, which is dramatically lower than the optimal sequence alignment score of 145. Yet the actual LSQMAN alignment is very similar to the other structural alignments, and the optimal sequence alignment. The LSQMAN sequence alignment score is very low because LSQMAN has a number of gaps of the form ACD--/---TQY, which would never be produced in a sequence alignment, but may make sense (because two loops are geometrically distant) in a structural alignment.

Furthermore, distantly related proteins often have multiple optimal alignments, as well as a large number of suboptimal alignments with scores very close to that of the optimal score [4, 6–8]. As one moves away from the optimal score, the number of alternative alignments grows rapidly. Thus, to keep the number of alignments to be dealt with computationally tractable, one must sample the suboptimal alignment space.

Although the "gold standard" by which sequence alignments are evaluated is a structure-based alignment, structural alignments can vary, and there is no algorithm that guarantees an optimal structural alignment [9]. The assumption is usually made that because structures tend to vary less than sequences over evolutionary time, differences in structural alignments are small compared to the difference between sequence and structural alignments. While this is certainly true for very distantly related proteins that do not share significant similarity (and thus cannot be meaningfully aligned from sequence data alone), the range of structural and sequence alignment accuracy has not been carefully examined for proteins that share statistically significant similarity. Our results suggest that for proteins with moderately significant sequence similarity, sequence alignments can often be within the range of different structural alignments.

Since structurally accurate alignments often have sub-optimal sequence alignment scores, investigators have explored these sets of alternative alignments, asking whether they provide information about accurate structural alignments. For example, Jaroszewski et al. [4] examined alternative alignments generated both from a near-optimal alignment generation algorithm and by varying the scoring parameters (i.e. the gap penalties and substitution matrix) and showed that there is frequently an alignment in these sets that is closer to the structural alignment. They concluded that the two methods of generating alternative alignments - alternative scoring parameters and sub-optimal alignment - have complementary (as opposed to redundant) information, since the union of the two sets yielded many more alignments that matched a structural alignment than either of the single sets. Holmes and Durbin [10] also investigated the accuracy of the optimal sequence alignment and developed a method for calculating the expected accuracy of a given alignment. Zhang and Marr [11] used an algebraic approach to investigate alternative alignments in the neighborhood of the optimal alignment.

Various authors have taken a probabilistic approach to generating sets of alternative alignments. Miyazawa [17] calculated alignment probabilities based on the exponent of the alignment score, and compared the resultant probabilities of matched amino acids in the alignment to the respective protein structure alignments. Yu and Hwa examined the statistical significance of alignments produced using a pairwise Hidden Markov Model (HMM) [12]. Knudsen and Miyamoto [13] developed an alignment method based on a pairwise HMM that included an explicit evolutionary model for indels. Finally, Mückstein et al. [14] developed a procedure for sampling alignments based on statistical weighting using the partition function over all possible alignments of two sequences.

While it is of theoretical interest to compare sets of individual sequence and structure alignments, it is only of practical use if one can determine which sequence alignment is the correct one in the absence of structural information. One way to address this question is to estimate the reliability of a particular pair of aligned residues (which we call an edge, using the convention that in the dynamic programming path graph, aligned residues, insertions, and deletions are scored along edges, while the optimal score is calculated at the vertex). Cline et al. [15] looked at four methods for predicting the reliability of a particular pair of aligned residues and determined that the method proposed by Yu and Smith [16] for extracting near-optimal alignments from a profile Hidden Markov Model (HMM) provided the most improvement in alignment quality. Miyazawa [17], Knudsen and Miyamoto [13], and Mückstein et al. [14] examined the relationship between edge probabilities and structural alignment, although in the latter two cases only in the context of a small number of protein pairs, and generally found a good correspondence between them. Mevissen and Vingron [18] demonstrated the efficacy of an edge reliability index called robustness, which had been defined previously by Chao et al. [19], among others (see Methods). They demonstrated that the robustness of an edge accurately predicted whether the edge was also aligned in the structural alignment. Here, we extend the analysis of robustness by incorporating additional alignment quality information and developing a logistic regression model that returns (via the logit link function - see Methods) the probability that a given edge is contained in a structural alignment. We also examine the distribution of alignments produced by probA (Mückstein et al. [14]) and compare them to the distribution produced by the Zuker algorithm [8].

## Results

Our goal is to find characteristics of sub-optimal sequence alignments that can be used to identify alignments, or sub-alignments, that are found in structural alignments. Just as importantly, we seek measures of alignment quality that help us predict which alignments are more likely to be correct. Since homologous proteins are typically identified through similarity searches, we focus on protein sequences that share statistically significant sequence similarity. We divide our protein pairs by sequence similarity, to explore the relationship between sequence similarity (statistical significance) and structural accuracy. The most similar third of the alignment pairs have sequence similarity expectation values E() < 10^{-10}, with an average of 48% identity. The intermediate and most distantly related sequences have 10^{-10} < E() < 10^{-5} (26.9% identity) and E() < 10^{-5} (22.6% identity), respectively.

### Comparison of Near-Optimal and Structure-based Sequence Alignments

Suboptimal sequence alignments can only be used to produce accurate structural alignments if the sequence and structure alignments overlap each other. Figure 1 depicts the sequence/structural alignment overlap in the context of the path graph. In Figure 1A, the alignment of two globin homologs, the structure-based alignments tend to overlap extensively, with only minor deviations from each other, in large part because the eight globin α-helices comprise most of the sequence. The optimal sequence alignment also overlaps the structure alignments extensively. The lines for the 95% of optimal and 75% of optimal sets represent envelopes: i.e. for a given residue in the path graph, the two lines mark the maximum and minimum indices of the residues it is aligned against in the pool of alternative alignments. In Figure 1B, two neurotoxins that have a set of three conserved secondary structural elements, but other substantial structural differences, the structural alignments are less consistent, and the 75%-optimal sequence alignments are required to capture the structural uncertainty. In both cases, the structural alignments are contained within the set of alternative sequence alignments.

Sequence and structural alignment scores for two example alignments.

Levitt-Gerstein Structural Similarity | Sequence Alignment Score | Shift Score vs. Dali | Levitt-Gerstein Structural Similarity | Sequence Alignment Score | Shift Score vs. Dali | |
---|---|---|---|---|---|---|

| 1 hdaB00 vs. 1 mytA00 (pair 25) | 1bcgA00 vs. 1b7dA00 (pair 42) | ||||

CE | 2333 | 85 | 0.986 | 978 | 61 | 0.920 |

Matras | 2330 | 103 | 0.983 | 966 | 49 | - |

DALI | 2334 | 115 | 1.000 | 980 | 24 | 1.000 |

LSQMAN | 2337 | -20 | 0.946 | 980 | 24 | 0.940 |

Optimal Sequence | 2352 | 145 | 0.940 | 583 | 108 | 0.837 |

95% Neighborhood | 2355 | 145 | 0.900 | 696 | 104 | 0.880 |

75% Neighborhood | 2359 | 145 | 0.923 | 812 | 88 | 0.780 |

probA | 2362 | 37 | 0.645 | 830 | 37 | 0.757 |

robustness | 2301 | -526 | 0.734 | 383 | 24 | 0.565 |

model | 2352 | 145 | 0.941 | 579 | 108 | 0.796 |

^{-10}) and in more than 50% of pairs for moderately similar sequences (E() < 10

^{-5}). Indeed, 95% sub-optimal sequence alignments were more accurate than the best structural alignment in 10%-20% of the families in both similarity groups. When a sampling of sequence alignment scores that are within 75% of optimal are examined (Figure 3B), a sequence alignment as structurally accurate as the least accurate structure alignment is produced in more than 80% of both the most similar pairs and the moderately similar pairs. However, for sequences that share less similarity (E() < 0.02 and E() > 0.02), sub-optimal sequence alignments rarely include structurally high-scoring alignments. Also shown in Figure 3 are the results for the probA alignments, using either BLOSUM 50 or BLOSUM 62. In contrast to the examples in Figure 2, the wider sampling produced by probA does result in an increase in the number of alignments with good structural similarity scores, particularly in the low similarity group.

Figure 3 demonstrates that there is substantial overlap between the set of near-optimal alignments and the four structural alignments, especially for the pairs of sequences that are most highly similar to each other (as is to be expected). They also demonstrate that increasing the size of the pool of near-optimal alignments increases the probability of finding a near-optimal alignment as good as or better than a structural alignment. In addition, Figure 3 indicates that there is no set of scoring parameters that is clearly better than the others in terms of producing high-scoring structural alignments, for the six scoring matrix/gap penalty combinations examined. Finally, it is instructive to note the difference between the medium and low sequence similarity groups: the former has E()-values between 10^{-10} and 10^{-5} and a mean percent identity of 26.9% (median 25.2%), while the latter has E()-values between 10^{-5} and 0.02 and a mean percent identity of 22.7% (median 22.1%). There is clearly a significant drop-off in alignment quality going from the medium similarity group to the low similarity one, even though the average percent identity does not change very much. The expectation value is much more informative predictor of sequence alignment accuracy than the percent identity, even for sequences that share significant similarity.

### Logistic Regression Model

To make use of the information contained within the set of near-optimal alignments, one needs a way of ranking or assessing the alignments. In particular, one would like to assess the likelihood that a particular pair of residues has been aligned properly. Mevissen and Vingron described one such method [18]. They calculated the robustness, which is the difference between the sequence alignment score for an alignment including a given pair of aligned residues (i.e. an edge in the path graph of the alignment) and the highest score for an alignment that does not include that edge. They demonstrated that robustness reasonably predicted the reliability that the edge would be found in the structural alignment. Robustness gives a measure of the importance of a given edge, but it does not measure aspects of the overall alignment. The frequency that an edge is found in a set of near-optimal alignments incorporates the fact that the area in the path graph surrounding the edge in question may affect whether that particular edge is used frequently. In addition, the maximum bits-per-position score that is obtained by an alignment containing that edge in the set of near-optimal alignments provides information about the overall quality of the alignments that use that edge.

Logistic Regression Model

Estimate | Std. Error | Wald Z-score | p-value | |
---|---|---|---|---|

(Intercept) | -6.1032 | 0.3614 | -16.888 | < 2e-16 |

Frequency | 5.7816 | 0.3904 | 14.808 | <2e-16 |

Robustness | 4.7489 | 0.8787 | 5.405 | 6.49e-08 |

Maximum bits- per-position | -1.6225 | 0.5842 | -2.777 | 0.00548 |

Analysis of variance for logistic regression model

Degrees Freedom | Deviance | Residual Degree Freedom | Residual Deviance | p-value | |
---|---|---|---|---|---|

NULL | 4999 | 1895.85 | |||

Frequency | 1 | 1369 | 4998 | 526.88 | 1.158e-299 |

Robustness | 1 | 23.6 | 4997 | 503.24 | 1.164e-06 |

Maximum bits-per-position | 1 | 7.7 | 4996 | 495.51 | 0.01 |

AIC | 503.5 |

Residual Deviance and AIC values for different regression models.

Model | Residual Deviance | AIC |
---|---|---|

Full Model | 495.5 | 503.5 |

Freq + Robust | 503.2 | 509.2 |

Frequency | 526.9 | 530.9 |

Robustness | 1038.0 | 1042.0 |

Since the probA alignments sampled a wider variety of alignments, which tended to overlap with the structural alignments to a greater extent than the Zuker alignments did, we also created a model using the probA alignment edges as input. This model performed slightly worse than that based on the Zuker alignments did (AUC 0.942, AIC 1327). We did not directly examine the edge probabilities that probA uses to construct its alignments, but since the algorithm produces a "statistically weighted distribution" [14] of alignments, the frequency of an edge should approximately correspond with its probability. Thus if the probA edge probabilities were substantially better at correctly predicting structural edges this should show up in the regression. The implication is that probA expands the diversity of alignments compared to the Zuker algorithm, but it is not necessarily any easier to identify which edges are found in structure alignments.

### Applying the model

Having established that the model is superior at identifying structural edges, we next wanted to make use of this information to produce more structurally accurate alignments. We did this in two ways: by "trimming" the optimal sequence alignments and by generating new alignments using the model probabilities.

Shift Score Summary

Alignments | Better (count) | Worse (count) | Better (avg. (stdev)) | Worse (avg. (stdev)) | |
---|---|---|---|---|---|

High Similarity | model trim vs. optimal | 3 | 1 | 0.011 (0.005) | -0.009 (0.000) |

robust trim vs. optimal | 3 | 1 | 0.010 (0.006) | -0.009 (0.000) | |

frequency trim vs. optimal | 3 | 1 | 0.010 (0.006) | -0.009 (0.000) | |

local trim vs. optimal | 13 | 0 | 0.022 (0.017) | 0 | |

local trim + model trim vs. local | 0 | 10 | 0 | -0.008 (0.003) | |

model alignment vs. optimal | 8 | 3 | 0.025 (0.035) | -0.008 (0.002) | |

robust alignment vs. optimal | 3 | 12 | 0.019 (0.014) | -0.140 (0.130) | |

frequency alignment vs. optimal | 6 | 3 | 0.028 (0.038) | -0.008 (0.002) | |

Medium Similarity | model trim vs. optimal | 7 | 3 | 0.036 (0.025) | -0.013 (0.006) |

robust trim vs. optimal | 9 | 4 | 0.032 (0.025) | -0.010 (0.005) | |

frequency trim vs. optimal | 7 | 4 | 0.037 (0.025) | -0.011 (0.006) | |

local trim vs. optimal | 14 | 2 | 0.088 (0.076) | -0.014 (0.001) | |

local trim + model trim vs. local | 6 | 10 | 0.019 (0.010) | -0.017 (0.013) | |

model alignment vs. optimal | 10 | 2 | 0.034 (0.030) | -0.016 (0.013) | |

robust alignment vs. optimal | 2 | 19 | 0.095 (0.062) | -0.200 (0.150) | |

frequency alignment vs. optimal | 9 | 2 | 0.035 (0.030) | -0.016 (0.013) | |

Low Similarity | model trim vs. optimal | 7 | 3 | 0.032 (0.022) | -0.022 (0.015) |

robust trim vs. optimal | 11 | 3 | 0.027 (0.020) | -0.022 (0.016) | |

frequency trim vs. optimal | 7 | 3 | 0.032 (0.022) | -0.022 (0.015) | |

local trim vs. optimal | 18 | 3 | 0.100 (0.096) | -0.240 (0.390) | |

local trim + model trim vs. local | 5 | 8 | 0.023 (0.019) | -0.013 (0.007) | |

model alignment vs. optimal | 7 | 2 | 0.025 (0.014) | -0.026 (0.019) | |

robust alignment vs. optimal | 3 | 16 | 0.088 (0.066) | -0.220 (0.180) | |

frequency alignment vs. optimal | 6 | 3 | 0.023 (0.014) | -0.024 (0.014) |

True Positive Summary

Alignments | Better (count) | Worse (count) | Better (avg. (stdev)) | Worse (avg. (stdev)) | |
---|---|---|---|---|---|

High Similarity | model trim vs. optimal | 10 | 0 | 0.018 (0.014) | 0 |

robust trim vs. optimal | 8 | 0 | 0.019 (0.014) | 0 | |

frequency trim vs. optimal | 11 | 0 | 0.015 (0.013) | 0 | |

local trim vs. optimal | 15 | 0 | 0.024 (0.016) | 0 | |

local trim + model trim vs. local | 4 | 6 | 0.013 (0.010) | -0.009 (0.003) | |

model alignment vs. optimal | 8 | 3 | 0.024 (0.031) | -0.008 (0.003) | |

robust alignment vs. optimal | 3 | 12 | 0.024 (0.013) | -0.170 (0.170) | |

frequency alignment vs. optimal | 8 | 1 | 0.022 (0.030) | -0.012 | |

Medium Similarity | model trim vs. optimal | 14 | 1 | 0.056 (0.058) | -0.009 |

robust trim vs. optimal | 14 | 0 | 0.061 (0.056) | 0 | |

frequency trim vs. optimal | 13 | 1 | 0.053 (0.057) | -0.009 | |

local trim vs. optimal | 14 | 3 | 0.098 (0.084) | -0.012 (0.005) | |

local trim + model trim vs. local | 9 | 8 | 0.061 (0.056) | -0.015 (0.009) | |

model alignment vs. optimal | 10 | 2 | 0.039 (0.043) | -0.020 (0.018) | |

robust alignment vs. optimal | 2 | 19 | 0.062 (0.000) | -0.250 (0.140) | |

frequency alignment vs. optimal | 8 | 2 | 0.045 (0.043) | -0.020 (0.018) | |

Low Similarity | model trim vs. optimal | 12 | 1 | 0.054 (0.066) | -0.024 |

robust trim vs. optimal | 17 | 0 | 0.051 (0.055) | 0 | |

frequency trim vs. optimal | 12 | 1 | 0.055 (0.066) | -0.024 | |

local trim vs. optimal | 18 | 2 | 0.110 (0.100) | -0.280 (0.380) | |

local trim + model trim vs. local | 10 | 3 | 0.041 (0.065) | -0.013 (0.005) | |

model alignment vs. optimal | 7 | 2 | 0.021 (0.014) | -0.034 (0.022) | |

robust alignment vs. optimal | 2 | 16 | 0.150 (0.050) | -0.240 (0.170) | |

frequency alignment vs. optimal | 5 | 3 | 0.022 (0.016) | -0.029 (0.018) |

Trimming the alignments with the logistic regression model probabilities clearly improves the percentage of true positives in all similarity categories, in agreement with the ROC curve. This has minimal effects on the sensitivity (coverage); of the edges in the optimal semi-global alignment removed by the model, less than 10% (49/503) are true positives (i.e. are aligned in 2 out of 4 structural alignments). No more than four true positives are removed in any given pair. For shift scores, trimmed alignments also show clear, if less dramatic, improvement over the optimal alignment. In this trimming analysis, the full model performs similarly to robustness alone and frequency alone.

Another way to look at alignment quality is to identify regions of the alignment that are of higher or lower quality, as opposed to individual edges. Since we are aligning structurally defined domains, the default was to use semi-global alignments; however, we hypothesized that the regions of the alignment that were within the local alignment boundaries might be of higher quality, under the assumption that there would be more alignment "signal" within this region. Tables 5 and 6 indicate that this is indeed the case, since when we trimmed the optimal global alignments (using the same procedures as described above) to the local alignment boundaries (determined by SSEARCH alignments), the true positives and shift scores improved significantly. The semi-global alignments trimmed to local boundaries perform better than the SSEARCH alignments, indicating that the global alignment is necessary even though it is typically of poorer quality at the ends of the alignment (data not shown). Combining trimming by both local boundaries and the regression model did not result in better performance, indicating that a significant portion of the improvement due to the model trimming is due to removing incorrect edges that are at the ends of the alignment (see Tables 5 and 6).

Trimming alignments can help remove incorrect edges, but it is constrained by the initial alignment. We also produced new alignments that were not so constrained, by using the probabilities calculated by the model (equation 2) in place of standard substitution matrix scores and affine gap penalties (see Methods). Tables 5 and 6 demonstrate that these alignments also improve structural accuracy compared to the optimal sequence alignment. In contrast, alignments created using the same procedure, except that the regression model was built using only robustness as a variable, are typically less accurate than the optimal sequence alignment. Using frequency as the only variable results in slightly lower performance than the full model. The model-trimmed and model-produced alignments are similar, but not identical. The average shift score between them is 0.96, but ranges from 0.77 to 1.0. The model-produced alignments have higher shift scores (with the Dali alignment as the reference) in 15 of the pairs (average difference 0.03), while the model-trimmed alignments have higher shift scores in 12 of the pairs (average difference 0.02). The model-produced alignments find novel (i.e. not in the optimal sequence alignment) correct structural edges, which the trimmed alignments cannot consider. The full model finds two or more novel correct edges in 18 of the 66 pairs (average count of these 18 is 7.4, maximum 29). The corresponding numbers for the robustness-only and frequency-only alignments are 40, 7.4, 26, and 11, 9.2, 27, respectively. The number of new correct edges found is modest, but there is no disadvantage to using the model-produced alignments compared to the either the optimal alignments or the trimmed alignments, and in some cases using the model-produced alignment provides substantial improvement. These results also support the use of the full model over the frequency-only model. Interestingly, the robustness-only model-produced alignments actually find more novel correct edges than the full model does, even though the overall alignments are clearly poorer. From looking at the alignments and the edges found, it appears that this occurs because the robustness-based alignment produces short runs of 2-6 edges that are structurally correct (typically by placing gaps in different locations), but which do not appear in the optimal or full model-produced alignments (data not shown).

## Discussion and Conclusions

Previous authors have shown that there is substantial overlap between the near-optimal alignment space and the structural alignment space [4, 15], and have made use of this information to produce better alignments [15, 16] or evaluate alignments [18]. Our results also show this overlap, focusing on statistically significant pairwise alignments (some previous work has tended to focus on more distantly related sequences). It is important to note that the overlap between sequence and structural alignments is due to variation in both spaces: structural alignments produced by different algorithms sometimes vary substantially, in addition to variations in the set of near-optimal sequence alignments. We also emphasize that expectation value is a much more accurate proxy measure for alignment accuracy than the more commonly used percent identity (e.g. [3]). The differences in percent identity between the medium similarity group and low similarity group (defined by E-value) are modest (i.e. both are in the < 30% identity range), but the degree of overlap with structural alignments differs significantly (Figure 3).

Previous efforts to extract information from the overlap between the structural and sequence alignment spaces have used profile-based methods [4, 15] to improve the alignments of distantly related proteins; here we focus on pairwise alignments. While profile-based methods are clearly able to detect and align more distantly related proteins than pairwise methods, pairwise alignment is more widely used and in some cases may be the only option available; thus improvements in pairwise alignment accuracy are desirable. Focusing on individual edges in the alignment, Mevissen and Vingron [18] demonstrated that robustness can accurately discriminate between structurally correct and incorrect edges in an alignment; however, they did not produce an explicit model for robustness edge classification, and did not attempt to produce improved alignments. We have developed an explicit model that is more accurate than robustness in predicting whether a given edge (i.e. aligned pair of residues) is likely to be found in a structural alignment, and that produces the associated log-odds probability. Our model can be used to produce alignments that are more similar to structure-based alignments, and is capable of finding correct structural edges that are not in the optimal sequence alignment. The model developed will be integrated into our existing display software that allows users to build and visualize sets of near-optimal alignments, [21, 22] making the model easily accessible.

There has also been substantial effort put into developing probabilistic methods for exploring alternative pairwise alignments. We have used the probA program to compare this methodology with the Zuker method of sampling different alignments. It appears that probA samples a more diverse range of alignments, which can result in better agreement with the structural alignments; however, it is not obvious that the "correct" alignments (or edges) are any easier to identify within the set of probA alignments. It is intriguing that in some cases the best probA alignment is also the most similar to the structural alignments (according to the shift score - see Figure 4), which apparently happens rarely if at all with the Zuker alignments. Further investigation will be required to determine if these cases can be distinguished prior to knowledge of the structural alignment.

## Methods

### Near-optimal alignment algorithms

As noted above, the near-optimal solution space can become very large with even small deviations from the optimal score. To accommodate this, several algorithms have been developed that generate samples of the near-optimal space. We chose the Zuker algorithm [8] because it ensures a diverse sampling by forcing all near-optimal edges to be included in at least one alignment while at the same time preserving information about which edges within the set of all edges are used most frequently. The Zuker algorithm identifies all the residue pairs aligned within a sub-optimal range, and produces an alignment that includes those pairs. Because different alignments can have the same suboptimal score, the complete set of alignments, which can be produced by the Waterman-Byers algorithm [6], is very large. The Zuker approach does not guarantee that every possible alignment is produced, but it does guarantee that every possible pair of aligned residues (an alignment edge) is sampled. The maximum number of alignments produced in the 75% of optimal set (within a particular set of scoring parameters) was 44,581, while the minimum was 1. A summary of the numbers of alignments produced for each pair is in Additional File 1.

### probA alignments

We obtained a copy of the probA program [14] from http://www.tbi.univie.ac.at/~ulim/probA/, and used the option to produce 1000 alternative alignments, using either BLOSUM 50 or BLOSUM 62 scoring matrices. probA uses gap open/extension penalties of -9.5/-1.2 for BLOSUM 50, and -7.5/-0.9 for BLOSUM 62. In Figure 2 we used the probA scores; however, when calculating bit scores and robustness values we calculated the sequence similarity scores using the more traditional -10/-2 and -11/-1 penalties.

### Structural alignment programs

The structural alignment programs used were Dali, LSQMAN, CE, and Matras. We used the standalone version of the Dali program [23], called DaliLite [24], obtained from the website ftp://ftp.ebi.ac.uk/pub/contrib/holm/dl/, with default parameters. We used the Linux version of the Combinatorial Extension (CE) program [25], obtainable at http://cl.sdsc.edu/ce.html, also with default parameters. We used the Structal method as implemented in the LSQMAN program [26] from the Uppsala Software Factory: http://xray.bmc.uu.se/usf/. Specifically, we used the Fast Force and Improve commands to get an initial alignment, then the DP command to implement the dynamic-programming method of Levitt and Gerstein [27]. We then used the Global command to calculate the statistics based on the Gerstein and Levitt structural similarity score [27]. For Matras, we used the Linux version of the program provided by the authors [28] with default parameters.

### Alignment comparison metrics

We evaluated sequence and structure alignments using two different metrics: (1) an *individual* metric, such as the Needleman-Wunsch semi-global alignment score (i.e. end gaps are not penalized) [5] or the structural alignment score, and *pairwise* alignment scores, which compare two different alignments. For a structural alignment score we used the Structal score [27] calculated with the LSQMAN program (using the Xalign option) to characterize individual structural alignments. The pairwise metric used was the shift score described by Cline, et al. [15].

### Protein Families

The protein pairs were selected from CATH (version 3.2) domains [29] of known homology and grouped according to expectation values (see Additional File 2). The expectation values were computed using an all vs. all SSEARCH [30] database search using the whole CATH database. We used three groups of sequence pairs, spanning 33 CATH families: high similarity (expectation value (E() < 10^{-10}, average percent identity 48.0%), medium similarity (E() < 5 × 10^{-5}, average percent identity 26.9%), and low similarity (E() < 2 × 10^{-2}, average percent identity 22.6%).

### Logistic Regression

The logistic regression was performed using the R statistical computing system [31]. Models were built using default parameters for the *glm* (generalized linear model) function with the logit link function and the *lrm* (logistic regression model) function from the Design library [32]. The binary response variable for the models was created by using the number of times a particular edge occurred within the set of structural alignments (e.g. if an edge appeared in 2 or more of the 4 structural alignments then the response variable would be set to 1; otherwise it would be set to 0). Three predictors for the model were the robustness of the edge, [18] the frequency of the edge, and the maximum bits-per-position of the edge. Robustness and maximum bits-per-position were normalized to a range of 0 to 1 (in the case of robustness by using the highest and lowest robustness values in a set of suboptimal alignments, rather than within a single alignment).

### Variable Selection

We evaluated the three predictor variables first by building single parameter models using each of the possible predictors. Individual predictors were evaluated following the strategy described in Hosmer and Lemeshow (2000).; any variable with a p-value less than 0.25 was considered for inclusion in the model. (The 0.25 threshold is deliberately large to allow variables that may only be significant when interacting with other variables to be included.) Second, we built a model using all possible predictors and then selectively omitted one or more parameters until the most parsimonious model was achieved. Parsimony was measured with the Akaike Information Criterion (AIC) [33]. Of all the models constructed (using different scoring parameters, edge sample sizes, and target thresholds), all included the frequency, 60% included robustness, and 33% included maximum bits-per-position. Variable omission was due to very small changes in AIC scores; the average difference in the AIC was 0.65 (range 20 to 7264), indicating that the difference between variable selection strategies is minimal. A Kruskal-Wallis test [34] confirmed that the model parameters are robust to changes in sample size, scoring parameters, and target thresholds, indicating that we do not need to construct different models for different combinations of these parameters. Table 2 shows the final model parameters, and Table 3 the analysis of variance for the model. Table 4 shows the residual deviance and AIC values for various combinations of predictor variables.

*p*is the probability that the edge is contained in a structure-based alignment,

*q*is 1-

*p*,

*r*is the robustness of the edge,

*m*is the maximum bits-per-position for an alignment containing that edge, and

*f*is the frequency that the edge is seen in an alignment.

*a*

_{1},

*a*

_{2}, and

*a*

_{3}are coefficients. Given (1), the probability of an edge being found in a structural alignment is:

### Model assessment

Sequence pairs were partitioned into test and training sets, so that edges from a given sequence pair were not used to both train and test the model. As recommended by Hosmer, et al. [35] we used a smoothed residual test statistic, with the p-value calculated based on a chi-squared distribution (Table 2). Performance was also assessed by plotting ROC curves and comparing the area under the curve (AUC) (Figure 5).

### Model-based alignment generation

We used a modified version of our suboptimal sequence alignment generation code to produce alignments where a diagonal edge in the dynamic programming matrix was scored using the log-odds probability produced by the logistic regression model (Equation 2) in place of the normal substitution matrix score. Gap open/extension penalties of -10/-2 were used. We also constructed alignments using the same method, using only robustness or only frequency to calculate the log-odds probability (Tables 5 and 6).

## Declarations

### Acknowledgements

MLS acknowledges funding support from the Army Biotechnology Center, under contract No. NMA401-02-09-2002, through the Johns Hopkins University/APL subcontract No. 904856. WRP was supported by a grant from the National Library of Medicine (LM04969).

## Authors’ Affiliations

## References

- Brenner SE, Chothia C, Hubbard TJ: Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships.
*Proceedings of the National Academy of Sciences USA*1998, 95: 6073–6078. 10.1073/pnas.95.11.6073View ArticleGoogle Scholar - Venclovas C: Comparative modeling in CASP5: progress is evident, but alignment errors remain a significant hindrance.
*Proteins*2003, 53(Suppl 6):380–388. 10.1002/prot.10591View ArticlePubMedGoogle Scholar - Vitkup D, Melamud E, Moult J, Sander C: Completeness in structural genomics.
*Nat Struct Biol*2001, 8: 559–566. 10.1038/88640View ArticlePubMedGoogle Scholar - Jaroszewski L, Li W, Godzik A: In search for more accurate alignments in the twilight zone.
*Protein Sci*2002, 11: 1702–1713. 10.1110/ps.4820102View ArticlePubMedPubMed CentralGoogle Scholar - Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins.
*Journal of Molecular Biology*1970, 48: 443–453. 10.1016/0022-2836(70)90057-4View ArticlePubMedGoogle Scholar - Waterman MS, Byers TH: A dynamic programming algorithm to find all solutions in a neighborhood of the optimum.
*Mathematical Biosciences*1985, 77: 179–188. 10.1016/0025-5564(85)90096-3View ArticleGoogle Scholar - Saqi MA, Sternberg MJ: A simple method to generate non-trivial alternate alignments of protein sequences.
*J Mol Biol*1991, 219: 727–732. 10.1016/0022-2836(91)90667-UView ArticlePubMedGoogle Scholar - Zuker M: Suboptimal sequence alignment in molecular biology. Alignment with error analysis.
*J Mol Biol*1991, 221: 403–420. 10.1016/0022-2836(91)80062-YView ArticlePubMedGoogle Scholar - Lathrop RH: The protein threading problem with sequence amino acid interaction preferences is NP-complete.
*Protein Eng*1994, 7: 1059–1068. 10.1093/protein/7.9.1059View ArticlePubMedGoogle Scholar - Holmes I, Durbin R: Dynamic programming alignment accuracy.
*J Comput Biol*1998, 5: 493–504. 10.1089/cmb.1998.5.493View ArticlePubMedGoogle Scholar - Zhang MQ, Marr TG: Alignment of molecular sequences seen as random path analysis.
*J Theor Biol*1995, 174: 119–129. 10.1006/jtbi.1995.0085View ArticlePubMedGoogle Scholar - Yu YK, Hwa T: Statistical significance of probabilistic sequence alignment and related local hidden Markov models.
*J Comput Biol*2001, 8: 249–282. 10.1089/10665270152530845View ArticlePubMedGoogle Scholar - Knudsen B, Miyamoto MM: Sequence alignments and pair hidden Markov models using evolutionary history.
*J Mol Biol*2003, 333: 453–460. 10.1016/j.jmb.2003.08.015View ArticlePubMedGoogle Scholar - Muckstein U, Hofacker I, Stadler P: Stochastic pairwise alignments.
*Binformatics*2002, 18: 153–160.View ArticleGoogle Scholar - Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments.
*Bioinformatics*2002, 18: 306–314. 10.1093/bioinformatics/18.2.306View ArticlePubMedGoogle Scholar - Yu L, Smith TF: Positional statistical significance in sequence alignment.
*J Comput Biol*1999, 6: 253–259. 10.1089/cmb.1999.6.253View ArticlePubMedGoogle Scholar - Miyazawa S: A reliable sequence alignment method based on probabilities of residue correspondences.
*Protein Eng*1995, 8: 999–1009. 10.1093/protein/8.10.999View ArticlePubMedGoogle Scholar - Mevissen HT, Vingron M: Quantifying the local reliability of a sequence alignment.
*Protein Engineering*1996, 9: 127–132. 10.1093/protein/9.2.127View ArticlePubMedGoogle Scholar - Chao KM, Hardison RC, Miller W: Locating well-conserved regions within a pairwise alignment.
*Comput Appl Biosci*1993, 9: 387–396.PubMedGoogle Scholar - Hosmer DW, Lemeshow S:
*Applied Logistic Regression*. New York: John Wiley & Sons, Inc; 2000. full_textView ArticleGoogle Scholar - Smoot ME, Guerlain SA, Pearson WR: Visualization of near-optimal sequence alignments.
*Bioinformatics*2004, 20: 953–958. 10.1093/bioinformatics/bth013View ArticlePubMedPubMed CentralGoogle Scholar - Smoot ME, Bass EJ, Guerlain SA, Pearson WR: A system for visualizing and analyzing near-optimal protein sequence alignments.
*Information Visualization*2005, 4: 224–237. 10.1057/palgrave.ivs.9500100View ArticleGoogle Scholar - Holm L, Sander C: Mapping the protein universe.
*Science*1996, 273: 595–603. 10.1126/science.273.5275.595View ArticlePubMedGoogle Scholar - Holm L, Park J: DaliLite workbench for protein structure comparison.
*Bioinformatics*2000, 16: 566–567. 10.1093/bioinformatics/16.6.566View ArticlePubMedGoogle Scholar - Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path.
*Protein Eng*1998, 11: 739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar - Kleywegt GJ: Use of non-crystallographic symmetry in protein structure refinement.
*Acta Crystallogr D Biol Crystallogr*1996, 52: 842–857. 10.1107/S0907444995016477View ArticlePubMedGoogle Scholar - Levitt M, Gerstein M: A unified statistical framework for sequence cdent, but alignment errors remain a.
*Proc Natl Acad Sci USA*1998, 95: 5913–5920. 10.1073/pnas.95.11.5913View ArticlePubMedPubMed CentralGoogle Scholar - Kawabata T, Nishikawa K: Protein structure comparison using the markov transition model of evolution.
*Proteins*2000, 41: 108–122. 10.1002/1097-0134(20001001)41:1<108::AID-PROT130>3.0.CO;2-SView ArticlePubMedGoogle Scholar - Orengo CA, Michie AD, Jones S, Jones DT, Swindells MB, Thornton JM: CATH--a hierarchic classification of protein domain structures.
*Structure*2003, 5: 1093–1108. 10.1016/S0969-2126(97)00260-8View ArticleGoogle Scholar - Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms.
*Genomics*1991, 11: 635–650. 10.1016/0888-7543(91)90071-LView ArticlePubMedGoogle Scholar - R: A language and environment for statistical computing[http://www.R-project.org]
- Design: R Package version 2.0–9[http://biostat.mc.vanderbilt.edu/s/Design]
- Aikake H: An Information Criterion.
*Mathematical Sciences*1976, 14: 5–9.Google Scholar - Kruskal W, Wallis W: Use of ranks in one-criterion variance analysis.
*Journal of the American Statistical Association*1952, 47: 583–621. 10.2307/2280779View ArticleGoogle Scholar - Hosmer DW, Hosmer T, Le Cessie S, Lemeshow S: A comparison of goodness-of-fit tests for the logistic regression model.
*Stat Med*1997, 16: 965–980. 10.1002/(SICI)1097-0258(19970515)16:9<965::AID-SIM509>3.0.CO;2-OView ArticlePubMedGoogle Scholar

## Copyright

This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.