OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy
© Raghava et al; licensee BioMed Central Ltd. 2003
Received: 06 May 2003
Accepted: 10 October 2003
Published: 10 October 2003
The alignment of two or more protein sequences provides a powerful guide in the prediction of the protein structure and in identifying key functional residues, however, the utility of any prediction is completely dependent on the accuracy of the alignment. In this paper we describe a suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked. We test the OXBench benchmark suite on alignments generated by the AMPS multiple alignment method, then apply the suite to compare eight different multiple alignment algorithms. The benchmark shows the current state-of-the art for alignment accuracy and provides a baseline against which new alignment algorithms may be judged.
The simple hierarchical multiple alignment algorithm, AMPS, performed as well as or better than more modern methods such as CLUSTALW once the PAM250 pair-score matrix was replaced by a BLOSUM series matrix. AMPS gave an accuracy in Structurally Conserved Regions (SCRs) of 89.9% over a set of 672 alignments. The T-COFFEE method on a data set of families with <8 sequences gave 91.4% accuracy, significantly better than CLUSTALW (88.9%) and all other methods considered here. The complete suite is available from http://www.compbio.dundee.ac.uk.
The OXBench suite of reference alignments, evaluation software and results database provide a convenient method to assess progress in sequence alignment techniques. Evaluation measures that were dependent on comparison to a reference alignment were found to give good discrimination between methods. The STAMP S c Score which is independent of a reference alignment also gave good discrimination. Application of OXBench in this paper shows that with the exception of T-COFFEE, the majority of the improvement in alignment accuracy seen since 1985 stems from improved pair-score matrices rather than algorithmic refinements. The maximum theoretical alignment accuracy obtained by pooling results over all methods was 94.5% with 52.5% accuracy for alignments in the 0–10 percentage identity range. This suggests that further improvements in accuracy will be possible in the future.
Keywordsprotein multiple sequence alignment benchmark structural alignment
Multiple sequence alignment is a central technique in molecular biology [1, 2]. Alignments enhance the understanding of structure-function relationships by allowing common functional and structural regions in protein families to be identified . Accurate alignment is also the essential first step in predicting a protein structure by homology modelling . Many different techniques have been developed to align protein sequences [5–9]. For two sequences, dynamic programming guarantees a mathematically optimal alignment for a given set of parameters [10, 11]. Dynamic programming can be extended to the alignment of more than two sequences (multiple alignment) , but this becomes computationally intractable for more than ≈ 3 sequences without adding complexity to the basic dynamic programming algorithm . Most practical methods for multiple alignment work by following a guide tree to add sequences or clusters of sequences to an alignment [8, 13], or by iteratively refining an initial alignment [14, 15].
The quality of automatic alignments have been assessed on small sets of protein sequence families [5, 14, 16, 17]. Barton and Sternberg  evaluated the quality of alignment on globin and immunoglobulin families by comparison to reference alignments from 3D (three-dimensional) structure comparison. McClure et al.  studied the performance of 12 different global and local methods of multiple protein sequence alignment on four protein families (hemoglobin, kinase, aspartic acid protease and ribonuclease H). Their criteria of assessment were based on the ability of the methods to identify correctly the ordered series of motifs that are conserved throughout each protein family. Gotoh  assessed the multiple sequence alignment method CLUSTALW ; and four of his own methods [18–20], on 54 families from the Joy 3.2 database  of alignments from 3D structure comparisons.
More recently, the BAliBASE database of sequence alignments  has been created and used to evaluate the accuracy of alignment methods. The set of 142 alignments in BAliBASE are divided into five types that aim to test different factors that affect alignment accuracy, which include large insertions, orphan sequences and N- or C-terminal extensions.
In this study, we describe a data set of reference alignments and software tools for benchmarking pairwise and multiple alignment methods. The benchmark data set is made up of domain families obtained from the 3Dee database of protein structural domains [23, 24]. After filtering these families by different criteria, reference structural alignments were determined by the STAMP algorithm . The initial reference data set of domain family alignments was extended and subdivided in various ways to allow the study of different aspects of the protein sequence alignment problem. The reference alignments and tools were applied to the AMPS multiple alignment method [13, 14] in order to identify the most informative test measures. The benchmark suite was then applied to six further methods for comparison and the detailed results stored in a database accessible via the WWW.
The results of this study consider the development of a database of reference alignments; the definition of evaluation measures for multiple alignment accuracy; the identification of the most informative evaluation measures by application to the AMPS [13, 14] multiple alignment method; the application of the training data set to find good parameters for the AMPS multiple alignment program and investigation of different features of this hierarchical alignment method; exploration of the accuracy of alignment for AMPS on the different OXBench test sets and application and comparison of the OXBench benchmark to eight different multiple alignment methods.
Development of reference alignments and evaluation measures
Reference proteins for alignment were drawn from the 3Dee database of structural domains [23, 24]. 3Dee contains domain definitions for proteins of experimentally determined three-dimensional structure in the Protein Data Bank (PDB) up to July 1998. The domains are organised into a hierarchy of structurally similar protein domain families classified by the "S c score"  from the automatic multiple structure alignment program STAMP . S c scores greater than 3.0 indicate clear structural similarity. STAMP not only provides the multiple structure alignment, but also gives a measure of reliability to each structurally aligned position. Thus, STAMP alignments provide a convenient way to filter out positions that are not structurally equivalent or where structural alignment can be ambiguous.
We started with 729 domain structure families at the S c 5.0 level which contained 9,015 domains. Families with only one member were removed, as were structures of resolution poorer than 3.2 Å and domains with less than 40 residues. Domains with more than 5% unknown residues and any domain for which the secondary structure could not be defined by DSSP  were also removed. The stereochemical quality of the structures was assessed by running PROCHECK v.3.4.4 on each chain . PROCHECK examines a range of stereochemical features of protein structures and identifies torsion angles that deviate significantly from the distributions seen in protein structures solved at a similar resolution. The PROCHECK G-factor encapsulates these quality measures in a single figure. Accordingly, we filtered the domains to exclude any protein with an overall PROCHECK G-factor ≤ -1. These refinements left 465 families containing 7,217 domains. All multiple segment domains were then excluded to leave 5,428 domains in 381 families.
Highly similar domains (≥ 98% identity) provide limited information for assessing alignment quality and so were removed from the data set by the following procedure. Within each family, the domains were compared by pairwise sequence alignment and clustered by percentage sequence identity , then one domain whose structure was solved at high resolution was selected from the clusters formed at 98% identity. Thus, the data set reduced to 1,168 domains in 218 families; where no two sequences in a family share ≥ 98% identity. We chose this relatively high PID cut-off since obtaining accurate alignment of sequences that are very similar is of critical importance in protein modelling and function prediction studies.
Throughout this work the PID for two domains was calculated from the reference structural alignment as the number of identical amino acid pairs in the alignment divided by the length of the shortest sequence.
The STAMP multiple structure comparison algorithm  provides good reference alignments for testing sequence alignment methods since it can generate both pairwise and multiple alignments from structure and automatically identify SCRs (Structurally Conserved Regions). STAMP implements several alternative iterative hierarchical methods for finding the structural alignment of two or more proteins. All alternative methods were tried for all families, and the alignment with the highest structural similarity score (S c ) was selected . Alignments produced by STAMP are usually at least as good as those by a human expert, but as structural similarity drops, alignments by any method become less easy to define [28, 29]. For these reasons, the few alignments found with unusually high or low S c values compared to their PID were carefully inspected and where structural alignments were thought to be in error, alternative STAMP parameters were tried to obtain more satisfactory results.
Structural alignments for every sequence pair in the families of the data set were also generated by STAMP as for the multiple structure alignments. This pairwise reference data set allows comparisons between pairwise and multiple alignment accuracies to be made.
Master data set
For some families in the unique data set of 218 families, the sequence identity between a subset of domains is < 10% and it is difficult for sequence alignment methods to align these families as a whole. An example is the immunoglobulin superfamily, where structure comparison puts C-type and V-type domains together, even though there is little sequence similarity. Although alignments of the complete families presents a useful test, alignments of sub-families within these families are also a challenge to methods. Accordingly, the families were sub-divided on the basis of sequence identity and structural similarity.
Summary statistics for the Master data set. NDom: Number of domains. LenAln: Length of alignment. PID a : Average pairwise percentage identity. PID w : Percentage identity across all members of a family. S c : The structural similarity score. PSCR: Percentage of positions in a structurally conserved region.
Extended data set
It has been observed in previous studies that a multiple alignment will often yield better alignments than a pair of sequences taken in isolation [14, 17]. The Master data set only contains sequences of known three-dimensional structure, but for each family in the Master data set there may be many more known sequences. In order to understand the effect on alignment accuracy of increasing the number of sequences in an alignment, we extracted all clearly similar sequences to each family from the SWALL  sequence database.
Full-length sequence data set
The above data sets are based on isolated domains. In practice, the domain boundaries may be unknown. Accordingly, we generated a data set of families which contain the full-length protein sequences rather than just the domain sequences. Full-length sequences were obtained by cross-reference to SWISS-PROT annotations  and sequence comparison to the corresponding full-length sequences.
Set of pairwise families
The set of 273 families which contain only two sequences was extracted from the Master data set. This set may be used to evaluate alignment methods that work for only two sequences (pairwise methods) and the performance of multiple sequence alignment techniques when aligning pairs.
Set of multiple families
The set of 399 families with more than two members was extracted from the Master data set. This set allows the study of alignment algorithms and parameters on families having more than 2 sequences.
Set of small families
The set of 590 families, containing eight or fewer domains was extracted from the Master data set. This set can be used to assess multiple alignment methods that are not suitable for large families or are too time consuming to use on large families [7, 33].
MSA data set
In testing on the set of small families, DIALIGN and T-COFFEE were able to align all 590 families but MSA failed to align 8 families. Most of these families had either PID<10 or length of alignment >150 which suggests that MSA is not suitable for aligning sequences with low PID or long sequences. Accordingly, we generated a further data set that excludes these 8 families and call this the MSA data set (582 families.)
Test and training sets
So that a fair assessment of performance may be performed, it is necessary that separate, independent training and testing data sets should exist. Ideally a full, leave-one-out jack-knife test would be performed but for multiple sequence alignments this would normally be too time consuming. Accordingly, a simple two-fold cross-validation method was adopted.
The Master data set was split into two sets in such a way that there was no domain in one set that shared sequence similarity with domains in the other set and the PID distribution in both sets were equal. The two sets were created as follows: the PIDs across the whole alignment of each family of the master data set were computed. These families were sorted into ascending order by PID, then the families with odd indices were placed in one set and the complement in the other.
Alignment accuracy evaluation measures
Dependent measures: evaluation of the complete alignment
Multiple alignments can be assessed either by considering the alignment as a whole, or by examining the quality of each pairwise alignment within the multiple alignment. Thus, the accuracy of multiple alignment as a whole (AC w ) and average accuracy AC a of all pair alignments were computed by Equations 1, 2 & 3.
Example calculation of Position Shift Error (PSE). See Equation: 4 and associated text for explanation.
a) Reference Alignment
b) Test Alignment
where IR and IT are the index values for the reference and test alignments corresponding to the index position i. PSE p and LA p are the mean position shift error and alignment length of the pair respectively. The PSE p for the alignment shown in Table 2 would be if we take the first sequence as the reference and would be if we take the second sequence as reference. The mean PSE is calculated over all pairs of sequences.
Dependent measures: evaluation of structurally conserved regions
Assessing accuracy of the complete alignment assumes that every position in the reference alignment is equally valid. However, not every position in a protein sequence has an equivalence in a homologue. In particular, a loop region may be structurally very different and so no alignment is valid at that position. For this reason we also calculate the accuracy of alignment only within SCRs. SCRs were obtained directly from STAMP . The regions with STAMP P ij  > 6.0 for 3 or more residues were considered as SCRs. The average accuracy of each SCR was calculated within each alignment, then the average Acc SCR was calculated over all SCRs in an alignment.
One limitation of the dependent accuracy measures is that the quality of evaluation is dependent upon the quality of the reference alignment. Errors in the reference alignment may reflect badly on a good sequence alignment method. For this reason evaluation measures that are independent of any reference alignment were also developed. To do this, the structure superposition implied by the test alignment was computed. The quality of the test alignment was then calculated from this structure superposition by computing the RMSD [35, 36] and S c  values.
For multiple alignment the RMSD between each pair of domains was calculated as well as the average RMSD over all the pairs in the alignment. Since the RMSD is dependent on the number of atoms fitted [28, 29], the percentage of equivalenced atoms (Pfit) was also calculated.
The Rossmann and Argos  probability is an alternative criterion for computing similarity between structures [25, 37] that combines both distance and conformational terms to give the probability, P ij that residue i in one structure and residue j in the other structure are equivalent. This value was computed for all the equivalent residues of the superimposed structures that were obtained from the sequence alignment by least squares fitting of the main chain Cα atoms. The average sum of pairs (S p ) was calculated from Equation 5:
Since these scores are a function of alignment length, it is necessary to normalise them so that domains of different size may be compared . Equation 6 provides a more useful measure of alignment quality:
where S c , n, L a , L b , I a , I b are structural similarity score, number of equivalent residues, length of alignment of sequence a, length of alignment of sequence b, length of gap introduced in a and length of gap introduced in b respectively.
Substitution matrices and statistics
The substitution matrices investigated by Vogt et al.  were tested. To be consistent with early work on AMPS and the work of Vogt et al., matrices containing negative scores were made positive by subtracting the most negative number in the matrix from all elements. The scores were made integer by multiplying all the elements by an appropriate power of 10.
Significance in the difference of performance between two sets of alignment parameters or two weight matrices was determined by the Wilcoxon Signed Rank Pair test as coded in the statistics package R .
Web server and database
An important goal of this work was to make the reference alignment sets, evaluation test alignments and results, and analysis software readily available to developers and users. Accordingly, we have developed WWW tools to permit the reference alignment data sets to be interrogated and to allow new alignments to be compared to the reference alignments. The facilities available from the OXBench web server (accessible via http://www.compbio.dundee.ac.uk) are described in detail in the on-line documentation, but summarised here.
The server includes a database of more than 400 sets of alignments generated by different sequence alignment methods where each set consists of alignments of the Master data set of 672 families. The database also contains full statistics on these sets of alignments. All the evaluation results and family statistics are stored in a relational database managed by the PostgreSQL system, while the sequence alignments are stored as flat files.
The reference alignment of an individual family or the whole data set can be downloaded. In addition, the sequences of any family, or of all families are also available for download. The details of each family in the OXBench reference data set are available on the server. This information can be searched by PDB code, PDB header and PDB compound. The OXBench evaluation software can be downloaded from the server, and used locally to study alignment quality. The code is written C and R and was developed on the Linux platform.
Identification of the most informative evaluation measures
In this section we sought to find which of the different evaluation measures we examined were most able to discriminate between different alignments. To do this, we applied the AMPS multiple alignment method with a range of parameters chosen deliberately to generate very varied alignments.
AMPS implements multiple sequence alignment methods based on a progressive approach [13, 14]. Multiple alignment is achieved in three steps. First, all pairs of sequences are compared. Next, the order in which they should be aligned (i.e. most similar pair through to least similar pair) is determined by cluster analysis. Finally, the sequences are aligned in that order by performing two-sequence, sequence-to-profile or profile-profile comparisons as required. AMPS has a number of options. Any substitution matrix may be employed, but the default before the work described in this paper was PAM250 . The order of alignment may either follow the simple addition of one sequence at a time (single-order) or follow a tree (tree-order). Ordering may be determined by a range of alternative measures such as PID, SD score (Z-score) or Normalised Alignment Score (NAS) . Optionally, for single-order, the method can iterate to refine the alignment .
In identifying the most informative evaluation measures we were keen to work with the complete Master data set of alignments since this provided the largest sample. Clearly, working with all the data presented a possible problem with optimisation on the test data when subsequently evaluating AMPS alongside other methods. In order to minimise the risk of bias, we tested the evaluation measures by applying AMPS with its original standard protocol of PAM250  matrix and following a guide tree calculated from Z scores obtained from 100 randomisations. This protocol had been found to be effective in 10 years experience of using the program in practical applications to many different sequence families and from tests on a small number of alignments [13, 14].
Since big changes in alignment can happen when the gap-penalty is varied, we varied this single parameter over a large range and examined the effect on alignment as judged by the different evaluation measures. This was not an exhaustive attempt to find the best gap-penalty for AMPS with PAM250, but rather to generate a set of scores to highlight the differences in the evaluation measures.
Evaluation of dependent measures of alignment quality
Dependent measure accuracies (i.e. comparison of test to reference alignment) for AMPS run on the Master data set at various gap-penalties sorted by Acc SCR – Clustering was performed on SD score (from 100 randomisations) with the PAM250 matrix. Acc SCR : Accuracy of Structurally Conserved Regions. AC a : Accuracy average (pairwise). AC w : Accuracy of whole alignment. PSE : Position Shift Error. NA in columns marked p indicates the highest accuracy in the preceding column, p gives the Wilcoxon Signed Rank test probability that the difference to the highest scoring row occured by chance.
The AC a and AC w show similar trends in this test, with the best AC a of 82.49% and best AC w of 74.52% both for a penalty of 12, but not significantly different to accuracies for a penalty of 9. The PSE did not discriminate between the different alignments quite as well as Acc SCR , with only the alignments generated with penalties of 18, 21, 24, 27, 30 and 0 showing significantly worse PSE values than the best alignments with a penalty of 12.
These results suggest that all the measures provide a useful ranking of quality for the alignments. However, since some regions of protein sequences usually share no common structural features and are unalignable, in the following studies, we chose the Acc SCR as the primary measure of quality.
Evaluation of independent measures of alignment quality
Independent accuracy measures for AMPS runs on the Master data set at various gap-penalties. Clustering was performed on significance score (from 100 randomisations) with the PAM250 matrix. S c : Scoring method based on STAMP S c . RMSD: Score based on Root Means Squared Deviation. Pfit: Percentage of aligned positions without gap used in fit.
Table 3 and Table 4 show that the S c independent measure is comparable to Acc SCR at discriminating alignment quality. However, since it is simpler to understand, we focus on the dependent measure Acc SCR in the remainder of this paper.
Visualisation of alignment differences
The SCRs reported by STAMP and exploited in this benchmark are deliberately conservative to avoid the need to inspect every structural alignment for errors. As a consequence, some SCRs in Figure 8 could be extended by one or two residues. For example, positions 21 and 22 in Figure 8 are structurally equivalent, as is position 25 (column: DNDDG). However, positions 22–24 are less straightforward to align structurally due to the insertion in 1fdn and 1fca. Inspection of the structure superposition in this region shows that despite its position in the sequence, the proline at position 24 in 2fxb is not structurally equivalent to the prolines in 1fdn and 1fca, while the valine at position 24 in 1fdn and 1fca should be shown as an insertion.
Application of the training data set to find good parameters for the AMPS multiple alignment program
As a test of OXBench, we applied the suite to the AMPS program to see if pair-score matrices developed since 1986 might give better alignments with AMPS than the original defaults of PAM250  and 8 .
The alignment of families in the training set was generated by AMPS with various combinations of pair-score matrix and gap-penalty by following a tree generated by clustering on Z-scores. In summary, the BLOSUM75  matrix with gap-penalty 10 gave the maximum Acc SCR of 89.90% on the training set while the next best combination was the BENNER74 matrix  with a gap-penalty of 100 (89.7%). Application of AMPS with the best parameters from the training set (BLOSUM75 matrix, gap-penalty 10 and tree mode) gave an average Acc SCR for the test set of 90.3%. Exchanging the test and training sets did not alter the parameters that gave the best performance. While one would normally keep test and training data completely separate, since the performance of the AMPS alignments appeared to be unaffected by which training set was used, we felt it was safe for results on this method to be reported for the complete data set in all subsequent discussions and comparisons. The accuracy on the complete Master set for these parameters was lower (89.9%) than on either test set (90.3%) which gave us further confidence that comparison on the complete Master set was unlikely to enhance the apparent accuracy of AMPS over the result seen for the test data alone. The various problems of training and testing sets when benchmarking are returned to in the Discussion.
The performance of AMPS on the Master data set with 1987 defaults (PAM250, open gap-penalty 8, single order) and optimised (BLOSUM75, gap-penalty 10, tree order) parameters. Percentage SCR is the percentage of residues in the structurally conserved regions. Acc SCR : accuracy of alignment in SCRs. p : Wilcoxon Signed Rank test probability
PID Average 1
Number of Families 2
Percentage SCR 3
Acc SCR (PAM250/8) 4
Acc SCR (BLOSUM75/10) 5
As well as pair-score matrix and gap-penalty, the other adjustable parameters in AMPS are the choice of following a tree or using a single order for multiple alignment, the number of iterations, and the method used to calculate pairwise scores from which the tree is constructed. The minimum number of randomisations necessary has been considered elsewhere  and so was not investigated here. However, early studies on a small set of alignments suggested that the Normalised Alignment Score could be a faster to calculate and good approximation to the Z-score for clustering . Accordingly, we investigated alternative strategies for clustering sequences prior to multiple alignment with AMPS.
Effect of alternative clustering methods on alignment accuracy
The effect of parameters used for clustering on accuracy of SCRs. Evaluation was performed on families of the Master data set which have more than two domains (the set of multiple familes). Acc SCR (SD): Accuracy when clustered by SD score. Acc SCR (NAS): Accuracy when clustered by Normalised Alignment Score . Acc SCR (PID): Accuracy when clustered by percentage identity. Difference(SD-PID): Difference in SCR accuracy between clustering on SD and PID. p : Wilcoxon Signed Rank test probability (SD-PID).
Number of Families
Acc SCR (SD)
Acc SCR (NAS)
Acc SCR (PID)
Difference (SD – PID)
Accuracy of alignment on different OXBench data sets
Comparison of multiple to pairwise alignment accuracy
Effect of adding additional sequences
The effect of additional similar sequences in a family, on the performance of AMPS applied to the Master Data Set. Clustering was performed on NAS instead of SD for efficiency with large alignments. Acc SCR (Master Data Set): Accuracy for AMPS clustered on Normalised Alignment Score (NAS) for the Master data set. Acc SCR (Extended data set): Accuracy for alignments on the data set with additional sequences. p : Wilcoxon Signed Rank Pair test significance
PID Average 1
Number of Families 2
Acc SCR (Master Data set) 3
Acc SCR (Extended data set) 4
Difference in Accuracy (4–3)
The effect of additional similar sequences in a family, for the set of pairwise families only. Headings as for Table 7.
PID Average 1
Number of Families 2
Acc SCR (Pairwise Families) 3
Acc SCR (Extended data set) 4
Difference in Accuracy (4–3)
As in the comparison of multiple with pairwise alignment accuracy shown in Figure 9, these results confirm the early work on globin and immunoglobulin families that showed an improvement in alignment accuracy upon multiple alignment .
Effect of aligning full-length sequences
The performance of AMPS on the Master data set of domain families and on full-length sequences that include the domains. Parameters and column labels as in Table 7.
PID Average 1
Number of Families 2
Acc SCR (Domain Families) 3
Acc SCR (Full-length) 4
Difference in Accuracy (3–4)
Application of the benchmark to compare 8 multiple alignment methods
The OXBench suite was applied to compare and contrast AMPS (with BLOSUM75 matrix and penalty of 10, tree order from Z-scores) and 7 further methods for multiple alignment with their default parameters. The methods considered were: PILEUP, based on the progressive method of Feng and Doolittle  as implemented in the Wisconsin Package Version 8.1-UNIX, (August 1995). CLUSTALW 1.7  which also bases alignment on a tree, but includes a number of sophistications such as variable gap-penalties and variable pair-score matrices. PRRP  a DNR (doubly nested randomised iterative) method for aligning multiple sequences. MSA [7, 45] which computes an optimal multiple alignment with respect to a multiple alignment scoring system that considers all sequences simultaneously. HMMER  hidden Markov model method, trained first on unaligned sequences then applied to align these sequences. PIMA [47, 48] which is based on a pattern construction algorithm. DIALIGN  is based on segment-to-segment comparison instead of residue-to-residue comparison. It constructs multiple alignments from local pairwise alignments. T-COFFEE  is the newest of the methods considered. T-COFFEE locates the most consistent alignments within a set of alignments and has been shown in previous studies  to out-perform CLUSTALW .
Comparison of alignment methods on the Master data set
The Acc SCR for methods on the Master data set.
Percentage Identity Range
Difference in average performance of methods on the Master data set. A positive value indicates that the row method gives a higher accuracy than the column method. Significant differences as calculated by the Wilcoxon Signed Rank Pair test are marked: †p < 0.05; ‡p < 0.01; § p < 0.001.
Evaluation of methods on families with ≤ 8 Sequences
The more computationally intensive methods are either unable to align families in the Master data set, or would take an unreasonably long time. For this reason, DIALIGN  MSA  and T-COFFEE  were tested on a set of small families from the master data set where the number of sequences in a family was eight or less.
The performance of methods on the MSA data set (families with ≤ 8 members.)
Percentage Identity Range
Overall difference in performance of methods on the MSA data set. Symbols as for Table 11.
The clear winner on the MSA benchmark was the newest method T-COFFEE . T-COFFEE gave an average Acc SCR over 582 families of 91.39% which was 1.71% better than the second best average accuracy achieved by AMPS (p = 4.7 × 10-14). Average differences in accuracy between T-COFFEE alignments and AMPS alignments were all positive, and significant in all but the 0–10% identity range. The largest improvement over AMPS was seen in the 10–20% identity range where the accuracy improves from 62.2% to 69.0% (p < 0.005). When T-COFFEE alignments were compared to CLUSTALW the improvement was even more dramatic with an increase in accuracy of 2.45% (p = 1.5e × 10-15) over all alignments and a 12.0% increase in accuracy in the 10–20% identity range.
Evaluation of pairwise alignment
Performance of methods on pairwise alignments.
Percentage Identity Range
Performance of methods on full-length sequence families
Performance of methods on families of the full-length sequence data set. Only AMPS, CLUSTALW and PRRP were able to align all families in this set.
Percentage Identity Range
Maximum possible accuracy
Maximum accuracy achievable over all methods on each family. AMPS (Opt): results for AMPS with BLOSUM75/10 parameters. AMPS (Max): result of taking the most accurate alignment for each family over all tested parameter combinations. All Methods (Max): result of taking the most accurate alignment over AMPS (Max) and alignments by all other methods.
Percentage Identity Range
All Methods (Max)
An early surprise in applying the benchmark to several multiple alignment methods was that our own comparatively simple method AMPS [13, 14] performed as well or better than many more sophisticated methods. Even with the original 1986-default pair-score matrix and gap-penalty of PAM250/8 applied with the single-order option in the method as originally published , AMPS achieved 88.5% Acc SCR on the Master data set of domains (Table 5). With this protocol, AMPS showed no significant difference to the more complex PRRP  which gave 88.4% and PILEUP  which gave 88.9%, but was lower in accuracy than CLUSTALW with default parameters at 89.1% (Table 10.) While the original default method in AMPS followed a single alignment order, AMPS is normally run with tree-order . Simply switching to tree-order and updating the pair-score matrix to one of the BLOSUM series (50 or 75) to be consistent with more modern methods, raised the accuracy of AMPS to 89.9% which showed no significant difference to CLUSTALW with default parameters (Tables 10 and 11).
The results obtained for PAM250/8 are above suspicion since these parameters were published in 1987. However, an obvious criticism of the BLOSUM results for AMPS on the Master data set is that a subset of this data set was used to select the BLOSUM matrix and gap-penalty combination. As discussed in Results, this procedure breaks the rule of separating training data from test data and so in principle invalidates any conclusions about the relative merits of AMPS compared to methods which were not "optimised" on the same subset.
Although one has to bear this in mind when examining the results, all the methods considered, except AMPS with its original options, were developed to use BLOSUM series matrices. The only change necessary to bring AMPS in line with the best of these methods, CLUSTALW, was to update the matrix to one from the BLOSUM series. While not conclusive, this strongly suggests that the improvements in accuracy obtained by CLUSTALW were due primarily to the change in pair-score matrix rather than the many refinements to the CLUSTALW algorithm that are not present in AMPS. This finding is consistent with results for secondary structure prediction, where the addition of data rather than underlying algorithm appear to dominate improvements in accuracy [49, 50].
Another possible explanation of the accuracy seen for AMPS, even before changing to the BLOSUM matrix, is that we unconsciously selected test data and evaluation measures that favour the type of alignments that AMPS produces. If we had stopped with the methods compared in Table 10 it would have been difficult to refute this charge. However, after developing the benchmark and running most of the tests, the T-COFFEE  algorithm became available. When tested by its authors, T-COFFEE was reported as out-performing other methods, so by running T-COFFEE alignments through our benchmark we could test both the method and to some extent the validity of the benchmark. As the data discussed in Results and shown in Tables 12,13,14 indicate, T-COFFEE outperforms all other methods on our benchmark. Since our result agrees with the authors of T-COFFEE but on a different evaluation method, it suggests that the test data and evaluation measures applied in this work are effective and likely to be free of bias towards any single alignment method.
When applying any benchmark to a method, it is important to ensure that there is a separation between the data used to develop the method and optimise parameters, and those data being used to test. Unfortunately, it is rarely possible to make a totally clean separation. Even when test data appear to have been explicitly excluded from training data there may be implicit data and assumptions that are built into the algorithm under test. This is a particular problem when testing sequence alignment algorithms since nearly all make use of amino acid substitution matrices that are derived from the analysis of sequence alignments. The BLOSUM series of matrices in particular, are derived from alignments of protein families that will almost certainly include families that are featured in any reasonable benchmark data set.
A further problem arises when a benchmark is applied to a range of methods from different laboratories, since one does not know exactly which data were used to refine each method. For example, Cuff & Barton  faced this problem for the comparison of secondary structure prediction algorithms where often, even the authors of the methods were unsure of which data had been used in training! A fair test of the different prediction methods was only possible when a completely new test set was generated that could not have been used in the development of any of the methods .
Given these concerns, the good performance of T-COFFEE on our benchmark, may in part be due to overlap between the families of alignments used to optimise T-COFFEE and those in this benchmark. Such overlaps are a problem for those trying to tease out which features of an alignment algorithm are most important. However, an end-user of an alignment algorithm would like it to give the most accurate alignment possible on any presented set of sequences. Developing a method on large data sets such as those presented here is likely to lead to an overall improvement in accuracy and so provide better quality alignments for most users.
Although useful in its current form, OXBench could be improved in a number of ways. OXBench contains alignments that range from 0 to 98.9% percentage identity across the complete alignment (PID w in Table 1), but below 20% identity, the data sets have only 28 alignments of more than two sequences. With so few alignments in this range it is difficult to see significant differences between methods, though general trends are apparent. The deficit in low percentage identity alignments is partly due to the automatic sequence/structure clustering used to construct the 3Dee domains database [23, 24] and the need to avoid incorrect clusters. In the 3Dee structural hierarchy, the domain sequences are first clustered on sequence similarity, then representatives taken and clustered by structure comparison [23, 24]. As structural similarity reduces, the possibility of incorrectly clustering structures increases, so we took a conservative threshold of structural similarity when choosing the families to include in OXBench to avoid polluting the reference data. A further reason for the relatively low numbers of low similarity domains in the benchmark is the general difficulty of constructing multiple structure alignments where the proteins only have a basic topology and a few key positions in common. Again, we erred on the side of caution in selecting domain families for which the STAMP algorithm  will give consistent multiple structure alignments.
Structural hierarchies such as SCOP  that are curated by humans rather than generated entirely automatically provide a more reliable grouping for structures that share remote similarities. A benchmark data set derived from alignments built at the SCOP superfamily or fold level may extend the number of examples in the low PID range. In addition, SCOP is regularly updated and so the growth in the structural database would naturally lead to an expansion in the number of low identity examples described here. However, even with this approach, the problem still remains to generate reliable multiple structure alignments of such families to use as reference alignments for benchmarking. One way to approach this problem would be to be less conservative in selecting STAMP aligned subsets, then inspect each alignment and structure superposition by hand with reference to hydrogen-bonding diagrams and known functional sites. The automatically identified STAMP SCRs could then be extended or contracted as appropriate in order to separate out clear regions of structural similarity. The resulting hand-curated multiple structure alignments should provide a valuable reference resource for the improvement of multiple sequence alignment methods on hard-to-align examples.
In this paper we have considered a range of different, commonly applied alignment evaluation measures. While these measures are not a comprehensive survey, the OXBench code has been written in a modular way to allow extension that may include different measures as needed. For example, some groups when developing techniques that aim to identify the reliably aligned regions of an alignment (e.g. see: [52–54]) prefer an evaluation measure such as the "shift score" proposed by Cline et al.  that attempt to penalise alignments that delete large parts of the structure as well as alignments of regions that should not be aligned. Measures such as this and other evaluation variants  would be straightforward to add as additional modules to the evaluation code.
Ideally, one would dispense with reference alignments and simply judge the quality of the three-dimensional model inferred from alignment as in the independent measures considered above. Although conceptually harder to understand, the S c independent measure that considers both distance and local conformation performed as well as the Acc SCR and so warrants further exploration as a ranking method. More sophisticated model-building approaches followed by 3D-structure evaluation may in future provide even more effective guides to improving alignment methods.
In this paper we have developed sets of protein multiple sequence alignments derived from tertiary structure comparison that can be used to test sequence alignment methods. The sets include those designed to test the ability of a method to align residues in known SCRs, to test the effect of aligning domains within full-length sequences, to test accuracy on pairwise and multiple alignment, and to test the effect of adding additional sequences to an alignment.
All measures of alignment accuracy provided some degree of discrimination of alignment quality but measurement of accuracy within structurally conserved regions (Acc SCR ) was preferred since it focused on the most reliable structurally aligned regions in the reference alignment. A structural similarity score that takes account of distance and local conformation (S c ) was the most discriminating measure that was independent of a reference alignment.
The pair-score matrix (BLOSUM, PAM, etc.) has a bigger effect on alignment accuracy than other parameters such as clustering method. For AMPS an overall improvement of 1.25% was seen on moving from PAM250 to BLOSUM75 matrix for tree-based multiple alignment from Z-scores.
Adding more sequences to an alignment improves the accuracy of multiple alignments for families with low sequence identity. For AMPS the largest effect was in the 0–10 PID bin where an average improvement of 13.6% was seen (p = 0.0045; Tables 7,8).
Aligning full-length sequences degraded the accuracy of alignment for embedded domains by all methods considered. For example, for AMPS the accuracy drops by 6.8% on average (p ≤ 2.2 × 10-16; Table 9).
The order of performance for methods on pairwise alignment (Table 14) was very similar to that on the Master data set for multiple alignment. This suggests that the performance of multiple sequence alignment methods based on the progressive approach is proportional to their ability to align pairs of sequences.
On the MSA data set of alignments that contain ≤ 8 sequences, the best hierarchical alignment methods out-performed more rigorous methods that optimise sum-of-pairs and similar measures. For example, MSA gave alignments that were on average 1.4% worse than AMPS (p < 0.001; Table 13).
The maximum accuracy achieved by taking the maximum accuracy alignment for each family over all methods and parameter combinations was 94.52% (Table 16).
The work in this paper was initially supported by the Royal Society through a University Research Fellowship to GJB and a grant to GJB from the BBSRC (UK Biotechnology and Biological Sciences Research Council) to fund GPSR to work in the Laboratory of Molecular Biophysics at the University of Oxford. The work was subsequently continued by GPSR, SMJS and GJB at the EBI (European Bioinformatics Institute) with support from EMBL (European Molecular Biology Laboratory) to GJB and SMJS. Completion of the work was performed at the University of Dundee School of Life Sciences by GJB, PCA (supported by the Wellcome Trust) and JB (supported by a BBSRC studentship).
- Taylor WR: Identification of protein sequence homology by consensus template alignment. J Mol Biol 1986, 188: 233–258.View ArticlePubMedGoogle Scholar
- Barton GJ: Protein sequence alignment and database scanning. In Protein structure prediction: A practical approach (Edited by: Sternberg MJE). Oxford: IRL Press at Oxford University Press 1996, 31–63.Google Scholar
- Livingstone CD, Barton GJ: Protein sequence alignments: a strategy for the hierarchical analysis of residue conservation. Comput Appl Biosci 1993, 9: 745–756.PubMedGoogle Scholar
- Marti-Renom MA, Stuart AC, Fiser A, Sanchez R, Melo F, Sali A: Comparative protein structure modeling of genes and genomes. Annu Rev Biophys Biomol Struct 2000, 29: 291–325. 10.1146/annurev.biophys.29.1.291View ArticlePubMedGoogle Scholar
- Barton GJ, Sternberg MJ: Evaluation and improvements in the automatic alignment of protein sequences. Protein Eng 1987, 1: 89–94.View ArticlePubMedGoogle Scholar
- Feng DF, Doolittle RF: Progressive sequence alignment as a prerequisite to correct phylogenetic trees. J Mol Evol 1987, 25: 351–360.View ArticlePubMedGoogle Scholar
- Lipman DJ, Altschul SF, Kececioglu JD: A tool for multiple sequence alignment. Proc Natl Acad Sci USA 1989, 86: 4412–4415.PubMed CentralView ArticlePubMedGoogle Scholar
- Thompson JD, Higgins DG, Gibson TJ: CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 1994, 22: 4673–4680.PubMed CentralView ArticlePubMedGoogle Scholar
- Notredame C, Higgins DG, Heringa J: T-coffee: A novel method for fast and accurate multiple sequence alignment. J Mol Biol 2000, 302: 205–17. 10.1006/jmbi.2000.4042View ArticlePubMedGoogle Scholar
- Needleman SB, Wunsch CD: A general method applicable to the search for similarities in the amino acid sequence of two proteins. J Mol Biol 1970, 48: 443–453.View ArticlePubMedGoogle Scholar
- Sellers PH: On the theory and computation of evolutionary distances. J App Math 1974, 26: 787–793.Google Scholar
- Murata M, Richardson JS, Sussman JL: Simultaneous comparison of three protein sequences. Proc Natl Acad Sci USA 1985, 82: 3073–3077.PubMed CentralView ArticlePubMedGoogle Scholar
- Barton GJ: Protein multiple sequence alignment and flexible pattern matching. Methods Enzymol 1990, 183: 403–28.View ArticlePubMedGoogle Scholar
- Barton GJ, Sternberg MJ: A strategy for the rapid multiple alignment of protein sequences, confidence levels from tertiary structure comparisons. J Mol Biol 1987, 198: 327–337.View ArticlePubMedGoogle Scholar
- Karplus K, Hu B: Evaluation of protein multiple alignments by SAM-T99 using the BAliBASE multiple alignment test set. Bioinformatics 2001, 17: 713–20. 10.1093/bioinformatics/17.8.713View ArticlePubMedGoogle Scholar
- McClure M, Vasi T, Fitch W: Comparative analysis of multiple protein-sequence alignment methods. Mol Biol Evol 1994, 11: 571–592.PubMedGoogle Scholar
- Gotoh O: Significant improvement in accuracy of multiple protein sequence alignments by iterative refinement as assessed by reference to structural alignments. J Mol Biol 1996, 264: 823–838. 10.1006/jmbi.1996.0679View ArticlePubMedGoogle Scholar
- Gotoh O: Optimal alignment between groups of sequences and its application to multiple sequence alignment. Comp App Biosci 1993, 9: 361–370.Google Scholar
- Gotoh O: Further improvement in methods of group-to-group sequence alignment with generalized profile operations. Comp App Biosci 1993, 10: 379–387.Google Scholar
- Gotoh O: A weighting system and algorithm for aligning many phylogenetically related sequences. Comp App Biosci 1995, 11: 543–551.Google Scholar
- Sali A, Overington JP: Derivation of rules for comparative protein modeling from a database of protein structure alignments. Protien Sci 1994, 3: 1582–1596.View ArticleGoogle Scholar
- Thompson J, Plewniak F, Poch O: BAliBASE: a benchmark alignment database for the evaluation of multiple sequence alignment programs. Bioinformatics 1999, 15: 87–88. 10.1093/bioinformatics/15.1.87View ArticlePubMedGoogle Scholar
- Dengler U, Siddiqui AS, Barton GJ: Protein structural domains: analysis of the 3Dee domains database. Proteins 2001, 42: 332–344. 10.1002/1097-0134(20010215)42:3<332::AID-PROT40>3.3.CO;2-JView ArticlePubMedGoogle Scholar
- Siddiqui AS, Dengler U, Barton GJ: 3Dee: a database of protein structural domains. Bioinformatics 2001, 17: 200–201. 10.1093/bioinformatics/17.2.200View ArticlePubMedGoogle Scholar
- Russell RB, Barton GJ: Multiple protein sequence alignment from tertiary structure comparison: assignment of global and residue confidence levels. Proteins 1992, 14: 309–323.View ArticlePubMedGoogle Scholar
- Kabsch W, Sander C: Dictionary of protein secondary structure: pattern recognition of hydrogne-bonded and geometrical features. Biopolymers 1983, 22: 2577–2637.View ArticlePubMedGoogle Scholar
- Laskowski RA, Mac Arthur MW, Moss DS, Thornton JM: PROCHECK: a program to check the stereochemical quality of protein structures. J Appl Cryst 1993, 26: 283–291. 10.1107/S0021889892009944View ArticleGoogle Scholar
- Godzik A: The structural alignment between two protein: Is there a unique answer? Protien Sci 1996, 5: 1325–1338.View ArticleGoogle Scholar
- Feng ZK, Sippl MJ: Optimum superimposition of protein structures: ambiguities and implications. Fold Des 1996, 1: 123–132.View ArticlePubMedGoogle Scholar
- Barton G: OC – A cluster analysis program.1993. [http://www.compbio.dundee.ac.uk/Software/OC/oc.html]Google Scholar
- Hermjakob H, Lang F, Apweiler R: SPTR – A comprehensive, non-redundant and up-to-date view of the protein sequence world. CCP11 Newsletter 1998., 2:Google Scholar
- Bairoch A, Boeckmann B: The SWISS-PROT protein sequence data bank. Nucleic Acids Res 1991, 19 Suppl: 2247–2249.View ArticlePubMedGoogle Scholar
- Morgenstern B, Dress A, Werner T: Multiple DNA and protein sequence alignment based on segment-to-segment comparison. Proc Natl Acad Sci USA 1996, 93: 12098–12103. 10.1073/pnas.93.22.12098PubMed CentralView ArticlePubMedGoogle Scholar
- Moult J, Hubbard T, Bryant SH, Fidelis K, Pedersen JT: Critical assessment of methods of protein structure prediction (CASP): round II. Proteins 1997, Suppl 1: 2–6. Publisher Full Text 10.1002/(SICI)1097-0134(1997)1+%3C2::AID-PROT2%3E3.3.CO;2-KView ArticlePubMedGoogle Scholar
- Mclachlan AD: A mathematical procedure for superimposing atomic coordinates of proteins. Acta Crystallogr A 1972, A28: 656–657. 10.1107/S0567739472001627View ArticleGoogle Scholar
- Diamond R: On the comparison of conformations using linear and quadratic transformations. Acta Crystallogr A 1976, A32: l-10.Google Scholar
- Rossmann MG, Liljas A, Branden CI, Banaszak LJ: Evolutionary and structural relationships among the dehydrogenases. The Enzymes 1975, 11: 61–102.View ArticleGoogle Scholar
- Vogt G, Etzold T, Argos P: An assessment of amino acid exchange matrices in aligning protein sequences: The twilight zone revisited. J Mol Biol 1995, 249: 816–831. 10.1006/jmbi.1995.0340View ArticlePubMedGoogle Scholar
- Ihaka R, Gentleman R: R: A language for data analysis and graphics. Journal of Computational and Graphical Statistics 1996, 5: 299–314.Google Scholar
- Dayhoff MO, Schwartz RM, Orcutt BC: A model of evolutionary change in proteins. Matrices for detecting distant relationships. In Atlas of protein sequence and structure (Edited by: Dayhoff MO). Washington DC: National biomedical research foundation 1978, 5: 345–358.Google Scholar
- Barton GJ: ALSCRIPT: a tool to format multiple sequence alignments. Protein Eng 1993, 6: 37–40.View ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci USA 1992, 89: 10915–10919.PubMed CentralView ArticlePubMedGoogle Scholar
- Gonnet GH, Cohen MA, Benner SA: Exhaustive matching of the entire protein sequence database. Science 1992, 256: 1443–1445.View ArticlePubMedGoogle Scholar
- Webber C, Barton GJ: Estimation of P-values for global alignments of protein sequences. Bioinformatics 2001, 17: 1158–67. 10.1093/bioinformatics/17.12.1158View ArticlePubMedGoogle Scholar
- Gupta SK, Kececioglu J, Schaffer AA: Improving the practical space and time efficiency of the shortest-paths approach to sum-of-pairs multiple sequence alignment. J Comput Biol 1995, 2: 459–472.View ArticlePubMedGoogle Scholar
- Krogh A, Brown M, Mian IS, Sjolander K, Haussler D: Hidden markov models in computational biology: Applications to protein modelling. J Mol Biol 1994, 235: 1501–1531. 10.1006/jmbi.1994.1104View ArticlePubMedGoogle Scholar
- Smith RF, Smith TF: Automatic generation of primary sequence patterns from sets of related protein sequences. Proc Natl Acad Sci USA 1990, 87: 118–122.PubMed CentralView ArticlePubMedGoogle Scholar
- Smith RF, Smith TF: Pattern-induced multi-sequence alignment (PIMA) algorithm employing secondary structure-dependent gap penalties for use in comparitive protein modelling. Proteins 1992, 5: 35–41.View ArticleGoogle Scholar
- Cuff JA, Barton GJ: Evaluation and improvement of multiple sequence methods for protein secondary structure prediction. Proteins 1999, 34: 508–519. 10.1002/(SICI)1097-0134(19990301)34:4<508::AID-PROT10>3.0.CO;2-4View ArticlePubMedGoogle Scholar
- Cuff JA, Barton GJ: Application of multiple sequence alignment profiles to improve protein secondary structure prediction. Proteins 2000, 40: 502–511. 10.1002/1097-0134(20000815)40:3<502::AID-PROT170>3.0.CO;2-QView ArticlePubMedGoogle Scholar
- Murzin A, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database and the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540. 10.1006/jmbi.1995.0159PubMedGoogle Scholar
- Vingron M, Argos P: Determination of reliable regions in protein sequence alignments. Protien Eng 1990, 3: 565–569.View ArticleGoogle Scholar
- Holmes I, Durbin R: Dynamic programming alignment accuracy. J Comput Biol 1998, 5: 493–504.View ArticlePubMedGoogle Scholar
- Cline M, Hughey R, Karplus K: Predicting reliable regions in protein sequence alignments. Bioinformatics 2002, 18: 306–314. 10.1093/bioinformatics/18.2.306View ArticlePubMedGoogle Scholar
This article is published under license to BioMed Central Ltd. This is an Open Access article: verbatim copying and redistribution of this article are permitted in all media for any purpose, provided this notice is preserved along with the article's original URL.