OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy

Raghava, GPS; Searle, Stephen MJ; Audley, Patrick C; Barber, Jonathan D; Barton, Geoffrey J

doi:10.1186/1471-2105-4-47

Methodology article
Open access
Published: 10 October 2003

OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy

GPS Raghava^2,3,4,
Stephen MJ Searle^2,5,
Patrick C Audley¹,
Jonathan D Barber¹ &
…
Geoffrey J Barton^1,2,3

BMC Bioinformatics volume 4, Article number: 47 (2003) Cite this article

15k Accesses
144 Citations
4 Altmetric
Metrics details

Abstract

Background

The alignment of two or more protein sequences provides a powerful guide in the prediction of the protein structure and in identifying key functional residues, however, the utility of any prediction is completely dependent on the accuracy of the alignment. In this paper we describe a suite of reference alignments derived from the comparison of protein three-dimensional structures together with evaluation measures and software that allow automatically generated alignments to be benchmarked. We test the OXBench benchmark suite on alignments generated by the AMPS multiple alignment method, then apply the suite to compare eight different multiple alignment algorithms. The benchmark shows the current state-of-the art for alignment accuracy and provides a baseline against which new alignment algorithms may be judged.

Results

The simple hierarchical multiple alignment algorithm, AMPS, performed as well as or better than more modern methods such as CLUSTALW once the PAM250 pair-score matrix was replaced by a BLOSUM series matrix. AMPS gave an accuracy in Structurally Conserved Regions (SCRs) of 89.9% over a set of 672 alignments. The T-COFFEE method on a data set of families with <8 sequences gave 91.4% accuracy, significantly better than CLUSTALW (88.9%) and all other methods considered here. The complete suite is available from http://www.compbio.dundee.ac.uk.

Conclusions

The OXBench suite of reference alignments, evaluation software and results database provide a convenient method to assess progress in sequence alignment techniques. Evaluation measures that were dependent on comparison to a reference alignment were found to give good discrimination between methods. The STAMP S_cScore which is independent of a reference alignment also gave good discrimination. Application of OXBench in this paper shows that with the exception of T-COFFEE, the majority of the improvement in alignment accuracy seen since 1985 stems from improved pair-score matrices rather than algorithmic refinements. The maximum theoretical alignment accuracy obtained by pooling results over all methods was 94.5% with 52.5% accuracy for alignments in the 0–10 percentage identity range. This suggests that further improvements in accuracy will be possible in the future.

Background

Multiple sequence alignment is a central technique in molecular biology [1, 2]. Alignments enhance the understanding of structure-function relationships by allowing common functional and structural regions in protein families to be identified [3]. Accurate alignment is also the essential first step in predicting a protein structure by homology modelling [4]. Many different techniques have been developed to align protein sequences [5–9]. For two sequences, dynamic programming guarantees a mathematically optimal alignment for a given set of parameters [10, 11]. Dynamic programming can be extended to the alignment of more than two sequences (multiple alignment) [12], but this becomes computationally intractable for more than ≈ 3 sequences without adding complexity to the basic dynamic programming algorithm [7]. Most practical methods for multiple alignment work by following a guide tree to add sequences or clusters of sequences to an alignment [8, 13], or by iteratively refining an initial alignment [14, 15].

The quality of automatic alignments have been assessed on small sets of protein sequence families [5, 14, 16, 17]. Barton and Sternberg [14] evaluated the quality of alignment on globin and immunoglobulin families by comparison to reference alignments from 3D (three-dimensional) structure comparison. McClure et al. [16] studied the performance of 12 different global and local methods of multiple protein sequence alignment on four protein families (hemoglobin, kinase, aspartic acid protease and ribonuclease H). Their criteria of assessment were based on the ability of the methods to identify correctly the ordered series of motifs that are conserved throughout each protein family. Gotoh [17] assessed the multiple sequence alignment method CLUSTALW [8]; and four of his own methods [18–20], on 54 families from the Joy 3.2 database [21] of alignments from 3D structure comparisons.

More recently, the BAliBASE database of sequence alignments [22] has been created and used to evaluate the accuracy of alignment methods. The set of 142 alignments in BAliBASE are divided into five types that aim to test different factors that affect alignment accuracy, which include large insertions, orphan sequences and N- or C-terminal extensions.

In this study, we describe a data set of reference alignments and software tools for benchmarking pairwise and multiple alignment methods. The benchmark data set is made up of domain families obtained from the 3Dee database of protein structural domains [23, 24]. After filtering these families by different criteria, reference structural alignments were determined by the STAMP algorithm [25]. The initial reference data set of domain family alignments was extended and subdivided in various ways to allow the study of different aspects of the protein sequence alignment problem. The reference alignments and tools were applied to the AMPS multiple alignment method [13, 14] in order to identify the most informative test measures. The benchmark suite was then applied to six further methods for comparison and the detailed results stored in a database accessible via the WWW.

Results

The results of this study consider the development of a database of reference alignments; the definition of evaluation measures for multiple alignment accuracy; the identification of the most informative evaluation measures by application to the AMPS [13, 14] multiple alignment method; the application of the training data set to find good parameters for the AMPS multiple alignment program and investigation of different features of this hierarchical alignment method; exploration of the accuracy of alignment for AMPS on the different OXBench test sets and application and comparison of the OXBench benchmark to eight different multiple alignment methods.

Development of reference alignments and evaluation measures

Structural alignments

Reference proteins for alignment were drawn from the 3Dee database of structural domains [23, 24]. 3Dee contains domain definitions for proteins of experimentally determined three-dimensional structure in the Protein Data Bank (PDB) up to July 1998. The domains are organised into a hierarchy of structurally similar protein domain families classified by the "S_cscore" [25] from the automatic multiple structure alignment program STAMP [25]. S_cscores greater than 3.0 indicate clear structural similarity. STAMP not only provides the multiple structure alignment, but also gives a measure of reliability to each structurally aligned position. Thus, STAMP alignments provide a convenient way to filter out positions that are not structurally equivalent or where structural alignment can be ambiguous.

We started with 729 domain structure families at the S_c5.0 level which contained 9,015 domains. Families with only one member were removed, as were structures of resolution poorer than 3.2 Å and domains with less than 40 residues. Domains with more than 5% unknown residues and any domain for which the secondary structure could not be defined by DSSP [26] were also removed. The stereochemical quality of the structures was assessed by running PROCHECK v.3.4.4 on each chain [27]. PROCHECK examines a range of stereochemical features of protein structures and identifies torsion angles that deviate significantly from the distributions seen in protein structures solved at a similar resolution. The PROCHECK G-factor encapsulates these quality measures in a single figure. Accordingly, we filtered the domains to exclude any protein with an overall PROCHECK G-factor ≤ -1. These refinements left 465 families containing 7,217 domains. All multiple segment domains were then excluded to leave 5,428 domains in 381 families.

Highly similar domains (≥ 98% identity) provide limited information for assessing alignment quality and so were removed from the data set by the following procedure. Within each family, the domains were compared by pairwise sequence alignment and clustered by percentage sequence identity [14], then one domain whose structure was solved at high resolution was selected from the clusters formed at 98% identity. Thus, the data set reduced to 1,168 domains in 218 families; where no two sequences in a family share ≥ 98% identity. We chose this relatively high PID cut-off since obtaining accurate alignment of sequences that are very similar is of critical importance in protein modelling and function prediction studies.

Throughout this work the PID for two domains was calculated from the reference structural alignment as the number of identical amino acid pairs in the alignment divided by the length of the shortest sequence.

The STAMP multiple structure comparison algorithm [25] provides good reference alignments for testing sequence alignment methods since it can generate both pairwise and multiple alignments from structure and automatically identify SCRs (Structurally Conserved Regions). STAMP implements several alternative iterative hierarchical methods for finding the structural alignment of two or more proteins. All alternative methods were tried for all families, and the alignment with the highest structural similarity score (S_c) was selected [25]. Alignments produced by STAMP are usually at least as good as those by a human expert, but as structural similarity drops, alignments by any method become less easy to define [28, 29]. For these reasons, the few alignments found with unusually high or low S_cvalues compared to their PID were carefully inspected and where structural alignments were thought to be in error, alternative STAMP parameters were tried to obtain more satisfactory results.

Structural alignments for every sequence pair in the families of the data set were also generated by STAMP as for the multiple structure alignments. This pairwise reference data set allows comparisons between pairwise and multiple alignment accuracies to be made.

Master data set

For some families in the unique data set of 218 families, the sequence identity between a subset of domains is < 10% and it is difficult for sequence alignment methods to align these families as a whole. An example is the immunoglobulin superfamily, where structure comparison puts C-type and V-type domains together, even though there is little sequence similarity. Although alignments of the complete families presents a useful test, alignments of sub-families within these families are also a challenge to methods. Accordingly, the families were sub-divided on the basis of sequence identity and structural similarity.

In order to generate sequence similar sub-families we first calculated the PID between every pair of sequences from its structural alignment. The family was then clustered on PID between domains by complete linkage with the program OC [30]. The domain clusters formed at PID cut-offs of 60, 40, 30, 20, 10 and 5 were used as sub-families as illustrated in Figure 1 for the dehydrogenase family (Family 10). The sub-families formed between the given PID cut-off were extracted as shown by the sub-divisions labelled A, B, C, D, E, F, G, H and I. For example, sub-family B comprises domain 1hya-AUTO and Ihyb-AUTO. A total of 391 sequence sub-families were created. The structural alignment of these sub-families was optimised by STAMP. In a similar manner, sub-families were generated on structural similarity at S_ccut-offs of 7, 6, 5, 4, 3 and 2.

The creation of sequence sub-families and structure sub-families were independent, so it was possible for there to be sub-families containing identical members. One of each pair of identical sub-families was removed to leave a total of 672 families and sub-families. This set included the 218 unique families and is referred to as the Master data set. Figure 2 summarises the further data sets and subsets that were derived from the Master data set and are described in the following sections.

The distribution of of the 218 families in percentage identity (PID) bins is shown in Table 1 and Figure 3. The families include a wide range of numbers of sequences (from 2 to 122) and a wide distribution of length and PID. The percentage of structurally conserved residues in the families ranges from 2.5% to 100%.

Table 1 Summary statistics for the Master data set. NDom: Number of domains. LenAln: Length of alignment. PID_a: Average pairwise percentage identity. PID_w: Percentage identity across all members of a family. S_c: The structural similarity score. PSCR: Percentage of positions in a structurally conserved region.

OXBench: A benchmark for evaluation of protein multiple sequence alignment accuracy

Abstract

Background

Results

Conclusions

Background

Results

Development of reference alignments and evaluation measures

Structural alignments

Master data set

Extended data set

Full-length sequence data set

Set of pairwise families

Set of multiple families

Set of small families

MSA data set

Test and training sets

Alignment accuracy evaluation measures

Dependent measures: evaluation of the complete alignment

Dependent measures: evaluation of structurally conserved regions

Independent measures

Substitution matrices and statistics

Web server and database

Identification of the most informative evaluation measures

Evaluation of dependent measures of alignment quality

Evaluation of independent measures of alignment quality

Visualisation of alignment differences

Application of the training data set to find good parameters for the AMPS multiple alignment program

Effect of alternative clustering methods on alignment accuracy

Accuracy of alignment on different OXBench data sets

Comparison of multiple to pairwise alignment accuracy

Effect of adding additional sequences

Effect of aligning full-length sequences

Application of the benchmark to compare 8 multiple alignment methods

Comparison of alignment methods on the Master data set

Evaluation of methods on families with ≤ 8 Sequences

Evaluation of pairwise alignment

Performance of methods on full-length sequence families

Maximum possible accuracy

Discussion

Conclusions

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Authors' contributions

Authors’ original submitted files for images

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us