- Research article
- Open Access
Accuracy of structure-based sequence alignment of automatic methods
© Kim and Lee; licensee BioMed Central Ltd. 2007
Received: 04 June 2007
Accepted: 20 September 2007
Published: 20 September 2007
Accurate sequence alignments are essential for homology searches and for building three-dimensional structural models of proteins. Since structure is better conserved than sequence, structure alignments have been used to guide sequence alignments and are commonly used as the gold standard for sequence alignment evaluation. Nonetheless, as far as we know, there is no report of a systematic evaluation of pairwise structure alignment programs in terms of the sequence alignment accuracy.
In this study, we evaluate CE, DaliLite, FAST, LOCK2, MATRAS, SHEBA and VAST in terms of the accuracy of the sequence alignments they produce, using sequence alignments from NCBI's human-curated Conserved Domain Database (CDD) as the standard of truth. We find that 4 to 9% of the residues on average are either not aligned or aligned with more than 8 residues of shift error and that an additional 6 to 14% of residues on average are misaligned by 1–8 residues, depending on the program and the data set used. The fraction of correctly aligned residues generally decreases as the sequence similarity decreases or as the RMSD between the C α positions of the two structures increases. It varies significantly across CDD superfamilies whether shift error is allowed or not. Also, alignments with different shift errors occur between proteins within the same CDD superfamily, leading to inconsistent alignments between superfamily members. In general, residue pairs that are more than 3.0 Å apart in the reference alignment are heavily (>= 25% on average) misaligned in the test alignments. In addition, each method shows a different pattern of relative weaknesses for different SCOP classes. CE gives relatively poor results for β-sheet-containing structures (all-β, α/β, and α+β classes), DaliLite for "others" class where all but the major four classes are combined, and LOCK2 and VAST for all-β and "others" classes.
When the sequence similarity is low, structure-based methods produce better sequence alignments than by using sequence similarities alone. However, current structure-based methods still mis-align 11–19% of the conserved core residues when compared to the human-curated CDD alignments. The alignment quality of each program depends on the protein structural type and similarity, with DaliLite showing the most agreement with CDD on average.
Accurate sequence alignments for homologous proteins are essential for constructing accurate motifs and profiles, which are used in motif- or profile-based protein function search models [1–3] and in building homology models[4, 5]. When sequence similarity is low, however, it is difficult to obtain the correct sequence alignment based on sequence similarity alone [3, 4]. Since it is well known that proteins can have similar structures even in the absence of any detectable sequence similarity, structural alignments have been used to guide sequence alignments and are used as the gold standard for sequence alignment evaluation [5, 6].
Many pairwise structure alignment programs have been developed, but their performance has often been measured by how well the programs reproduce an expert-curated structure classification, such as SCOP or CATH [7, 8]. It has been shown that some programs do not produce high quality individual alignments, as measured by geometric match measures such as SAS or GSAS, even when they perform well in classification tests . It is also known that structure-based sequence alignments produced by different programs can be different even when the superimposed structures are similar [4, 5, 10–12]. Nonetheless, as far as we know, there is no report of a systematic evaluation of commonly used structural alignment programs in terms of the sequence alignment accuracy, perhaps because it has been difficult to find a fully human-curated and reasonably difficult reference alignment set [13, 14].
There are a number of sequence alignment databases that are augmented by structural alignments, including CAMPASS, HOMSTRAD, PALI[17, 18], DBAli, PASS2, CDD, SUPFAM, BAliBase, OXBench, PREFAB, SABmark and S4. The extent of similarity of the structures in these databases varies and so does the degree with which the alignments were curated by human experts after they were initially generated by automatic methods and/or imported from outside sources.
Zhu and Weng used HOMSTRAD database to measure the performance of their structure alignment program, FAST, and reported an average accuracy of 96%, measured as the percentage of correctly aligned residues among all aligned residues in the reference alignment. But our study reported herein indicates that such high accuracy is generally not obtained unless the structures are highly similar.
In this study, we evaluate the accuracy of structure-based sequence alignments produced by seven pairwise structure alignment programs, using the human-curated sequence alignments from NCBI's CDD  as the standard of truth. This is an expert-curated database, built by importing sequence alignments from outside sources, which are manually modified by considering structure-based alignments. In addition to the family-level alignments, where protein sequences are highly similar, it also provides fully curated superfamily-level alignments, where sequence similarity is not so high.
Average performance of each method
The composition of the reference alignment datasets
Root node set
Terminal node set
We use "correctly" aligned fraction of residues (f car ) as a measure of alignment quality. This measure is defined as the ratio of the number of residues that are aligned correctly, within a specified shift error, to the total number of aligned residues in the reference alignment (see Methods section for details). Since there is a large variation in the number of alignments in the CD nodes, (e.g., 1424 pairs for the immunoglobulin root node cd00096 vs. one pair in the root node cd00120), we use the node-wide average of f car , which we denote as Fcar. In order to compare the performance of different structure-alignment programs, we take the average of Fcar (double average of fcar) over all nodes within each node set.
In contrast to the terminal node set, the Fcar values without shift error are only 0.81 to 0.89 for the root node set. About 6% to 14% of the residues, on average, are aligned with some shift error (at most 4 residues in general) and an additional 4% to 9% of the residues are either not aligned or aligned with shift error of more than 8 residues. The best performance was achieved by DaliLite, whether shift error was allowed or not. CE was the most dependent on allowed shift error; it ranked the lowest when shift error was not allowed but the second best, after only DaliLite, if a shift error of up to 4 was allowed.
Figure 2 also shows that the average Fcar value changes noticeably between shift error of 0 and 4 but that it remains essentially unchanged after 4. For accurate profile construction, one cannot tolerate a shift error of any magnitude. On the other hand, for the purpose of recognizing similar structures in the database, precise accuracy of the alignment is of less concern. Therefore, we generally focus on fcar values with shift errors of either 0 or up to 8 in the following analysis of the results of this study.
Dependence of performance on sequence similarity and distance between homologous residues
We note in passing an easily discernible feature on the length of the alignments that different structure alignment programs produce (inset of Figure 9). As expected, all programs produce longer alignments than the reference alignment, since CDD alignments are those of the conserved core regions in a set of multiple alignments whereas test alignments are pair-wise alignments that may include residues outside of the conserved core. But CE, DaliLite and MATRAS produce relatively long alignments on average, FAST, VAST and SHEBA produce relatively short alignments and LOCK2 is in between.
Variations within and between superfamilies
The results described in the previous sections (except for Figures 8 and 9) were given in terms of the fcar values averaged over all protein pairs and over all CDD superfamilies. However, each method gives alignment accuracies that vary greatly over different protein pairs and over different superfamilies.
The largest CDD superfamily and the superfamilies for which all programs score poorly
Description in CDD
T-fold; Tunneling fold
OM_channels; Porin superfamily
nt_trans; nucleotidyl transferase
E_set; E or "early" set of sugar utilizing enzymes
IG: Immunoglobulin domain family
Included in Figure 11 are the RMSD values averaged for each superfamily. They generally decrease as the Fcar(0) value increases, although there are a couple of exceptions, as indicated by the red inverted triangles. None of the 5 superfamilies identified above has an exceptional RMSD value. This indicates that there is no gross error in the reference alignments for these superfamilies.
One notable feature is that CE produces more one-residue shifted alignments than other methods for 4 of the 5 superfamilies (red bars in Figure 12), as well as for cd00096 included here for reference as a typical superfamily.
In general, fcar values also vary within each superfamily for all methods (Figure 13). Relatively large variation of fcar(0) compared to fcar(8) implies that there will be correspondingly large number of inconsistencies among the alignments of the superfamily members. For the largest superfamily, cd00096, all methods produced 5% (DaliLite) to 20% (CE) of alignments wherein all the residues are shifted. Some of these shifted alignments are as good as the reference alignments in terms of the RMSD and the number of aligned residue pairs, but are clearly wrong because the conserved cysteine residues that form the disulfide bond are not correctly aligned (See Figure 5 for an example). This kind of incorrect alignments in immunoglobulin were discussed by Gerstein and Levitt in the category of "hard to align" pairs .
Architecture dependence of performance
It is known that some structure alignment programs show weakness in some specific architecture of the proteins in structure classification [7, 28]. In order to examine possible such dependence in sequence alignments, the alignments were grouped by their SCOP class. The main four classes, α, β, α/β and α+β, were separately considered and the remainder were combined into the "others" class. For this study, we excluded the 5 outlier superfamilies of Figure 12.
Performance difference of the methods
A significant observation in this study is that DaliLite produces the most accurate structure-based sequence alignment, while CE is clearly not as good when shift error is not allowed (Figure 2). This result contrasts with an earlier evaluation study wherein DaliLite was found to produce worse alignments than CE in terms of geometric measures, which include RMSD. Our result is more consistent with Sierk and Pearson's work, in which DaliLite was found to be the best followed by MATRAS, although they measured classification ability rather than alignment accuracy, using CATH database as the gold standard.
DaliLite, MATRAS and FAST, which are relatively good performers in our analysis, are based on the comparison of intra-molecular distance matrices without resorting to rigid body rotation during structural alignment [26, 29]. Thus, structural superposition is not necessary to obtain a good sequence alignment. Also, different algorithms give different performances depending on how much shift error is allowed and on the secondary structure content of the structure. DaliLite, LOCK2 and VAST probably depend more on secondary structures than other programs and perform less well for "others" class of structures. CE tends to give inaccurate alignments for β-containing structures but performs well when some shift error is allowed, which makes it more suitable for homology detection and structure classification tasks. CE, DaliLite, and MATRAS produce long alignments (inset of Figure 9). MATRAS produces longer alignments on average than DaliLite, but performs less well.
Such differences among the methods were not observed with the terminal node set (Figure 2). FAST was evaluated by its own authors using the overlap score, which is the same as fcar(0), and HOMSTRAD as the gold standard. The reported accuracy of 96% is consistent with our observation using the terminal node set. This suggests that the sequence similarity of the proteins in the HOMSTRAD dataset is perhaps similar to that of our terminal node set, which is made of "easy" cases for which all methods perform similarly well. The present study shows the advantage of using the root node set for evaluation since it has a higher discrimination power than the terminal node set (Figure 2).
Alignment accuracy measures
We used fcar(0) and fcar(8) values almost exclusively as the measures of accuracy of alignments. These are the fraction of residues that are correctly aligned within the specified alignment shift error. As mentioned above, fcar(0) values are the suitable measures when accurate alignment is essential as in building profiles. On the other hand, for the purposes of finding structurally similar proteins and for the structure classification, fcar(8) may be a better measure to use. Measures such as fcar(8) is probably preferable over a quantity that measures how well the program reproduces an existing structure classification dataset such as SCOP or CATH; the latter test brings in a set of issues, such as the human classification versus machine comparison and the effect of clustering [ and manuscript in preparation], which are only peripherally related to the performance of the pair-wise structure alignment program itself.
The fcar measures can be used only when one has a reliable set of alignments that can be considered to be true. We used the NCBI's CDD alignments for this purpose. When such standard is not available, one has to use some absolute measure of the goodness of the alignments. Authors of SHEBA, for example, which include one of us (BL), used the number of residue pairs aligned within a given distance as the measure of goodness. Kolodny et al.  define four different measures, each of which is some combination of the number of aligned residues and the RMSD. As mentioned above, use of these measures results in a different ranking of the programs. It is easy to understand why the RMSD is included in the goodness measure that is basically based on how many residues a program aligns; the alignment length can be increased arbitrarily until it encompasses the whole protein if RMSD is not considered. However, as can be seen in Figures 8 and 9, our reference alignments include a significant number of conserved core residue pairs that are rather far apart. Simply discouraging the alignment of such pairs is not necessarily the desired characteristic of a good structure alignment program and it may not be easy to find the proper combination of the number of aligned residues and the RMSD that will correctly assess the accuracy of a structure alignment program.
CDD as reference alignments
There are advantages to using the alignments from CDD as the reference dataset since they are human-curated and include sequences of both high and low sequence similarities. Although VAST alignment results are consulted by the NCBI curators of CDD, there does not seem to be a VAST-specific bias since VAST does not perform particularly well among the tested methods (Figure 2).
An obvious drawback is that CDD gives alignments of only the conserved core region from multiple alignments. A pairwise alignment will generally align more residues outside of the conserved core, but the accuracy of these alignments cannot be assessed using this reference set of alignments. Our assumptions are that any good alignment program should do well for the conserved core residues and that a program that aligns the conserved core residues well will also align the non-core residues better than other programs.
Imperfectness of alignments
Although we investigated only the conserved core regions of the alignments, it is clear that all structure alignment programs often produce alignments with all or part of this core region of the structures misaligned (See Figures 4 and 5). The correctly aligned fraction never reaches 95% even after shift error is allowed for up to 8 residues (Figures 2, 3 and 13) and it decreases rapidly as the sequence similarity decreases or as the RMSD increases (Figures 6, 7 and 8).
A possible reason for such discrepancy is the potential errors in the human-curated reference alignments. It was pointed out in the Results section that some of the CDD alignments were unusual from the point of view of purely structural alignment. However, we believe that this is not the major contributor to the observed discrepancy according to two limited investigations we made as described below.
If the problem is in the reference alignment, all methods are likely to score poorly. But, as shown in Figure 11, there are only 5 superfamilies that are exceptionally poorly aligned by all methods and inclusion or exclusion of these superfamilies had little effect on the overall alignment accuracies.
A related possibility is that there are equally good alternate alignments for many of the structure pairs, as was pointed out by many authors [4, 5, 10–12]. The alternate alignments can affect the whole structure or only a part of the structure. The possibility of such alternates will increase for evolutionarily distant pairs as the sequence similarity becomes low and the structures acquire distinct differences. The fact that residue pairs that are more than 3.0 Å apart in the reference alignment are heavily misaligned in the test alignments (Figure 8) suggests that this could be a significant contributor to the overall discrepancy between the test and reference alignments. In such circumstances, even structure-based sequence alignment can benefit from multiple alignments and from including the evolutionary relation between sequences.
A third possibility is of course the imperfection of the pair-wise structure alignment programs. The fact that different programs behave differently for the same set of data indicates that they are not yet perfect. We have observed that different programs totally fail for different sets of protein pairs. We have observed many instances wherein all or part of the structure is shifted by 2 or 4 residues compared to the reference alignment. In the example shown in Figure 5, the DaliLite alignment is clearly wrong because the cysteine residues do not align. We are also surprised by the large number of cases wherein the alignment is shifted by an odd number of residues for all or part of the structure. It is definitely our impression that there is room for improvement in the structure alignment programs.
The accuracy of the sequence alignments produced by 7 commonly used structure alignment programs was evaluated using the sequence alignments from NCBI's human-curated Conserved Domain database as the standard of truth and the "correctly" aligned fraction of residues as the alignment quality measure. These programs mis-align 11–19% of the conserved core residues on average for structure pairs in the same CDD root node but not in the same child node. DaliLite gave the best results among the programs tested. The alignment quality varied depending on the program used, on the protein structural type (SCOP Classes), and on the degree of sequence and structural similarity.
Reference alignment sets
Since CDD includes hundreds of families imported directly from outside sources, such as Pfam, COGs and SMART, we collected only the expert-curated CD (Conserved Domain) families, whose names always begin with "cd" . There were 2,009 such CDs (CDD v.2.07 as of 04/04/2006) organized in a hierarchical manner: 285 singleton CDs (without children or parents), 146 CDs from root nodes, 1,440 CDs from terminal nodes, and 138 CDs from internal nodes (between root and terminal nodes in CD hierarchy). We selected 828 CDs with at least two 3D structures and, using cddalignview from the NCBI c++ toolkit, extracted multiple sequence alignments from their ".acd" files. This subset includes 220 singletons, 135 root nodes, 367 terminal nodes and 106 internal nodes. Total 21,140 pairwise alignments were prepared from these multiple alignments. Each sequence in the alignments included all the unaligned residues at both termini, since -lefttails and -righttails options were used with cddalignview.
CDD uses curated domains based on MMDB [30–32]. For this study, we adopted the ASTRAL SCOP domains (ASTRAL SCOP 1.69) because they were better documented. The ASTRAL domain sequences and structures were downloaded from ASTRAL web site . Finding the ASTRAL domain corresponding to a CDD domain, however, is not trivial, because domain definitions do not always coincide. In order to determine which ASTRAL domain is associated with which CDD domain, we used a sequence alignment procedure (Lobster package). First, each sequence in a given CDD alignment was aligned to all the ASTRAL domain sequences derived from the same PDB structure. An ASTRAL domain was selected if at least 70% of its residues were covered by the CDD aligned span. A CDD aligned span is the sequence segment spanned by the first and the last aligned residues in the CDD alignment. This means that a CDD sequence can correspond to more than one ASTRAL domain. When this happened, all the domains were kept, which meant that the single CDD domain was effectively split into more than one domain according to ASTRAL SCOP definition. If an ASTRAL domain was not assigned to a sequence of a CDD aligned sequence pair, the pair was omitted. We also required that the aligned region between the domain spans include at least 20 residue pairs and cover at least 70% of the shorter span. A domain span here is defined for each ASTRAL domain as the region from the first to the last aligned residues within the boundaries of the domain. Its length is the number of the residues and gaps in the span. After this procedure, the dataset contained 6,425 pairwise alignments from the root nodes, 2,351 from the internal nodes, 2,809 from the terminal nodes, and 2,979 from the singletons. Each reference alignment is associated with a pair of ASTRAL domains and the pair-wise CDD sequence alignment.
We used only the root and terminal node sets. In order to select alignments specific to the root node set, the alignments were excluded from the root node set if their domain pair was also included in the internal or terminal node set. The pairs with 80% or more sequence identity (among aligned residue pairs) were also removed from both the root and the terminal node sets. If a structure in the aligned pair did not contain the side chains or was derived by NMR, the pair was also eliminated. The final reference alignment sets consisted of 2,199 alignment pairs for the terminal node set and 4,017 pairs for the root node set (Additional file 3).
Structure alignment programs
For various reasons, we could not evaluate all known structure alignment programs. We selected programs mainly based on their availability. Some programs were difficult to use because they failed for some of the structure pairs for unknown reasons or generated sequence alignments that were different from what were implied by other measures such as RMSD values. Finally we included CE (Algorithm 1.0, Alignment calculator 1.02), DaliLite_2.4.1 , LOCK2 , FAST , MATRAS (version 1.2), VAST (directly from Dr. Gibrat)  and SHEBA-4.0 . SSEARCH from FASTA3 package  was used for pure sequence alignment. The MATRAS and VAST were kindly given to us by the authors; others were downloaded from their websites.
Each program was run with its default setting. CE needs SEQRES sequence to recognize the residues as they are in the PDB file. Since such information is not included in PDB-style ASTRAL domain files, the three-letter symbols were derived from the ATOM records in the PDB-style files. When the secondary structure information is explicitly required, DSSP was used. VAST includes companion programs, which derive the secondary structures and SCOP domains from the original PDB files containing the whole structure. When a program generates more than one alignment for a given structure pair as in DaliLite and VAST, the first alignment in the output file was chosen for the evaluation.
Sequence alignment quality measure
A test alignment was generated for each reference alignment by running the structure alignment program on the two ASTRAL domains assigned to the reference alignment. The test alignment generated then need to be compared to the reference alignment for quality assessment. However, the protein sequence in the test alignment is often not identical to that in the reference alignment. For example, residues missing in the crystal structure do not appear in the test alignment. Some non-standard amino acids are simply removed (FAST) or marked with the extended amino acid symbols (LOCK2) – B, Z or X. Also, CE removes unaligned N-terminal and C-terminal residues. These and other sequence related issues involved in comparing different sequence alignments have been addressed before . In this study, we used a sequence alignment procedure (see below) in order to establish the correspondence between residues in the test and the reference alignment sequences. In principle, there are cases when an unambiguous correspondence cannot be made even by the sequence alignment. For instance, if there are tandem repeats in the sequence and one of these contains a gap, the gap can be relocated without cost by the sequence alignment procedure. Fortunately, we have not detected such ambiguity in the aligned regions of any of our reference alignments.
We used the C++ class library included in the Lobster package to handle sequence alignments . Two sequences derived from the same protein, one from the test and the other from the reference alignments, were aligned. The lengths and the one-letter symbols of these sequences can be different even though both are for the same protein. Then the serial numbers of the residues in the reference alignment sequence were assigned to the residues in the test alignment sequence. After this step, residues were identified by means of the assigned serial numbers alone, so that different symbols for the same residue were allowed. Also, the residues in the reference alignment sequence that do not appear in the test alignment sequence, either because the residue is missing in the crystal structure or because the ASTRAL domain spans less than the whole reference aligned span, are marked as unaligned in the reference sequence and not considered further.
For each structure pair, let r and t be the number of aligned residue pairs in the reference and test alignments, respectively, and let m(δ) be the number of aligned residues in both sequences with shift error up to δ. We define the fraction of "correctly" aligned residues, f car (δ), and the relative alignment length, l, as and . The f car (δ = 0) is the same as fD, which Sauder et al.  called the "developer's viewpoint" score. This has also been called the sensitivity of sequence alignment [40, 41].
CE, C ombinatorial E xtension; DaliLite, standalone version of DALI (D istance mA trix ALI gnment); DSSP, D efinition of S econdary S tructure of P roteins given a set of 3D coordinates; FAST, Recursive acronym for F AST A lignment and S earch T ool; FASTA3, DNA and Protein sequence alignment software package; LOCK2, Improvements over LOCK (Hierarchical protein structure superposition); MATRAS, MA rkovian TRA nsition of protein S tructure; SHEBA, S tructural H omology by E nvironment-B ased A lignment; SSEARCH, Smith-Waterman search; VAST, V ector A lignment S earch T ool.
ASTRAL, compendium for protein structure and sequence analysis; BaliBase, B enchmark Ali gnment dataBase; CATH, Hierarchical classification of protein domain structures, which clusters proteins at four major levels, Class (C), Architecture (A), Topology (T) and Homologous superfamily (H); CAMPASS, CAM bridge database of P rotein A lignments organised as S tructural S uperfamilies; COGs, C lusters of O rthologous G roups of proteins; DBAli, D ataB ase of structure Ali gnments; HOMSTRAD, HOM ologous STR ucture A lignment D atabase; MMDB, M olecular M odelling D ataB ase; OXBench, benchmark for evaluation of protein multiple sequence alignment accuracy; PALI, database of P hylogeny and ALI gnment of homologous protein structures; PASS2, P rotein A lignments organised as S tructural S uperfamilies (version 2); Pfam, multiple sequence alignments and HMM-profiles of protein domains; PDB, P rotein D ata B ank; PREFAB, P rotein REF erence A lignment B enchmark; S4, S tructure-based S equence alignments of S COP S uperfamilies; SABmark, S equence A lignment B enchmark; SCOP, S tructural C lassification of P roteins; SMART, S imple M odular A rchitecture R esearch T ool; SUPFAM, database of potential protein SUP erFAM ily relationships.
Alignment quality measure
GSAS, G apped SAS; SAS, S tructural A lignment S core.
RMSD, Root-mean-square of C α distances between aligned pairs after structural superposition; NCBI, N ational C enter for B iotechnology I nformation; CD, Conserved Domain in CDD; CDD, NCBI's Conserved Domain Database.
We thank Dr. Aron Marchler-Bauer for reading the manuscript and for the many valuable comments. We also thank the authors of CHIMERA and of the structure alignment programs for making their programs available. This research was supported by the Intramural Research Program of the NIH, National Cancer Institute, Center for Cancer Research.
- Lassmann T, Sonnhammer EL: Automatic assessment of alignment quality. Nucleic Acids Res 2005, 33(22):7120–7128. 10.1093/nar/gki1020PubMed CentralView ArticlePubMedGoogle Scholar
- Eidhammer I, Jonassen I, Taylor WR: Structure comparison and structure patterns. J Comput Biol 2000, 7(5):685–716. 10.1089/106652701446152View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Panchenko AR, Ariel N, Bryant SH: Comparison of sequence and structure alignments for protein domains. Proteins 2002, 48(3):439–446. 10.1002/prot.10163View ArticlePubMedGoogle Scholar
- Sauder JM, Arthur JW, Dunbrack RL Jr: Large-scale comparison of protein sequence alignment algorithms with structure alignments. Proteins 2000, 40(1):6–22. 10.1002/(SICI)1097-0134(20000701)40:1<6::AID-PROT30>3.0.CO;2-7View ArticlePubMedGoogle Scholar
- Hubbard TJ, Blundell TL: Comparison of solvent-inaccessible cores of homologous proteins: definitions useful for protein modelling. Protein Eng 1987, 1(3):159–171. 10.1093/protein/1.3.159View ArticlePubMedGoogle Scholar
- Russell RB, Barton GJ: Structural features can be unconserved in proteins with similar folds. An analysis of side-chain to side-chain contacts secondary structure and accessibility. J Mol Biol 1994, 244(3):332–350. 10.1006/jmbi.1994.1733View ArticlePubMedGoogle Scholar
- Sierk ML, Pearson WR: Sensitivity and selectivity in protein structure comparison. Protein Sci 2004, 13(3):773–785. 10.1110/ps.03328504PubMed CentralView ArticlePubMedGoogle Scholar
- Novotny M, Madsen D, Kleywegt GJ: Evaluation of protein fold comparison servers. Proteins 2004, 54(2):260–270. 10.1002/prot.10553View ArticlePubMedGoogle Scholar
- Kolodny R, Koehl P, Levitt M: Comprehensive evaluation of protein structure alignment methods: scoring by geometric measures. J Mol Biol 2005, 346(4):1173–1188. 10.1016/j.jmb.2004.12.032PubMed CentralView ArticlePubMedGoogle Scholar
- Feng ZK, Sippl MJ: Optimum superimposition of protein structures: ambiguities and implications. Folding & design 1996, 1(2):123–132. 10.1016/S1359-0278(96)00021-1View ArticleGoogle Scholar
- Godzik A: The structural alignment between two proteins: is there a unique answer? Protein Sci 1996, 5(7):1325–1338.PubMed CentralView ArticlePubMedGoogle Scholar
- Gerstein M, Levitt M: Comprehensive assessment of automatic structural alignment against a manual standard, the scop classification of proteins. Protein Sci 1998, 7(2):445–456.PubMed CentralView ArticlePubMedGoogle Scholar
- Casbon J, Saqi MA: S4: structure-based sequence alignments of SCOP superfamilies. Nucleic Acids Res 2005, (33 Database):D219–222.Google Scholar
- Ebert J, Brutlag D: Development and validation of a consistency based multiple structure alignment algorithm. Bioinformatics 2006, 22(9):1080–1087. 10.1093/bioinformatics/btl046View ArticlePubMedGoogle Scholar
- Sowdhamini R, Burke DF, Huang JF, Mizuguchi K, Nagarajaram HA, Srinivasan N, Steward RE, Blundell TL: CAMPASS: a database of structurally aligned protein superfamilies. Structure 1998, 6(9):1087–1094. 10.1016/S0969-2126(98)00110-5View ArticlePubMedGoogle Scholar
- Stebbings LA, Mizuguchi K: HOMSTRAD: recent developments of the Homologous Protein Structure Alignment Database. Nucleic Acids Res 2004, (32 Database):D203–207. 10.1093/nar/gkh027Google Scholar
- Sujatha S, Balaji S, Srinivasan N: PALI: a database of alignments and phylogeny of homologous protein structures. Bioinformatics 2001, 17(4):375–376. 10.1093/bioinformatics/17.4.375View ArticlePubMedGoogle Scholar
- Balaji S, Sujatha S, Kumar SS, Srinivasan N: PALI-a database of Phylogeny and ALIgnment of homologous protein structures. Nucleic Acids Res 2001, 29(1):61–65. 10.1093/nar/29.1.61PubMed CentralView ArticlePubMedGoogle Scholar
- Marti-Renom MA, Ilyin VA, Sali A: DBAli: a database of protein structure alignments. Bioinformatics 2001, 17(8):746–747. 10.1093/bioinformatics/17.8.746View ArticlePubMedGoogle Scholar
- Bhaduri A, Pugalenthi G, Sowdhamini R: PASS2: an automated database of protein alignments organised as structural superfamilies. BMC Bioinformatics 2004, 5: 35. 10.1186/1471-2105-5-35PubMed CentralView ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, et al.: CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 2005, (33 Database):D192–196.Google Scholar
- Pandit SB, Bhadra R, Gowri VS, Balaji S, Anand B, Srinivasan N: SUPFAM: a database of sequence superfamilies of protein domains. BMC Bioinformatics 2004, 5: 28. 10.1186/1471-2105-5-28PubMed CentralView ArticlePubMedGoogle Scholar
- Raghava GP, Searle SM, Audley PC, Barber JD, Barton GJ: OXBench: a benchmark for evaluation of protein multiple sequence alignment accuracy. BMC Bioinformatics 2003, 4: 47. 10.1186/1471-2105-4-47PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC: MUSCLE: multiple sequence alignment with high accuracy and high throughput. Nucleic acids research 2004, 32(5):1792–1797. 10.1093/nar/gkh340PubMed CentralView ArticlePubMedGoogle Scholar
- Van Walle I, Lasters I, Wyns L: SABmark – a benchmark for sequence alignment that covers the entire known fold space. Bioinformatics 2005, 21(7):1267–1268. 10.1093/bioinformatics/bth493View ArticlePubMedGoogle Scholar
- Zhu J, Weng Z: FAST: a novel protein structure alignment algorithm. Proteins 2005, 58(3):618–627. 10.1002/prot.20331View ArticlePubMedGoogle Scholar
- Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 1991, 11(3):635–650. 10.1016/0888-7543(91)90071-LView ArticlePubMedGoogle Scholar
- Sam V, Tai CH, Garnier J, Gibrat JF, Lee B, Munson PJ: ROC and confusion analysis of structure comparison methods identify the main causes of divergence from manual protein classification. BMC Bioinformatics 2006, 7: 206. 10.1186/1471-2105-7-206PubMed CentralView ArticlePubMedGoogle Scholar
- Holm L, Park J: DaliLite workbench for protein structure comparison. Bioinformatics 2000, 16(6):566–567. 10.1093/bioinformatics/16.6.566View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Anderson JB, DeWeese-Scott C, Fedorova ND, Geer LY, He S, Hurwitz DI, Jackson JD, Jacobs AR, Lanczycki CJ, et al.: CDD: a curated Entrez database of conserved domain alignments. Nucleic Acids Res 2003, 31(1):383–387. 10.1093/nar/gkg087PubMed CentralView ArticlePubMedGoogle Scholar
- Madej T, Gibrat JF, Bryant SH: Threading a database of protein cores. Proteins 1995, 23(3):356–369. 10.1002/prot.340230309View ArticlePubMedGoogle Scholar
- Wang Y, Anderson JB, Chen J, Geer LY, He S, Hurwitz DI, Liebert CA, Madej T, Marchler GH, Marchler-Bauer A, et al.: MMDB: Entrez's 3D-structure database. Nucleic Acids Res 2002, 30(1):249–252. 10.1093/nar/30.1.249PubMed CentralView ArticlePubMedGoogle Scholar
- Chandonia JM, Hon G, Walker NS, Lo Conte L, Koehl P, Levitt M, Brenner SE: The ASTRAL Compendium in 2004. Nucleic Acids Res 2004, 32: D189–192. 10.1093/nar/gkh034PubMed CentralView ArticlePubMedGoogle Scholar
- Edgar RC, Sjolander K: SATCHMO: sequence alignment and tree construction using hidden Markov models. Bioinformatics 2003, 19(11):1404–1411. 10.1093/bioinformatics/btg158View ArticlePubMedGoogle Scholar
- Shindyalov IN, Bourne PE: Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng 1998, 11(9):739–747. 10.1093/protein/11.9.739View ArticlePubMedGoogle Scholar
- Shapiro J, Brutlag D: FoldMiner and LOCK 2: protein structure comparison and motif discovery on the web. Nucleic Acids Res 2004, (32 Web Server):W536–541. 10.1093/nar/gkh389Google Scholar
- Kawabata T: MATRAS: A program for protein 3D structure comparison. Nucleic Acids Res 2003, 31(13):3367–3369. 10.1093/nar/gkg581PubMed CentralView ArticlePubMedGoogle Scholar
- Jung J, Lee B: Protein structure alignment using environmental profiles. Protein Eng 2000, 13(8):535–543. 10.1093/protein/13.8.535View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogen-bonded and geometrical features. Biopolymers 1983, 22(12):2577–2637. 10.1002/bip.360221211View ArticlePubMedGoogle Scholar
- Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18(2):306–314. 10.1093/bioinformatics/18.2.306View ArticlePubMedGoogle Scholar
- Marchler-Bauer A, Bryant SH: Measures of threading specificity and accuracy. Proteins 1997, (Suppl 1):74–82. Publisher Full Text 10.1002/(SICI)1097-0134(1997)1+<74::AID-PROT11>3.0.CO;2-OGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.