Testing statistical significance scores of sequence comparison methods with structure similarity
© Hulsen et al; licensee BioMed Central Ltd. 2006
Received: 10 July 2006
Accepted: 12 October 2006
Published: 12 October 2006
In the past years the Smith-Waterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The e-value is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Z-score based on Monte-Carlo statistics. Several papers have described the superiority of the Z-score as compared to the e-value, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.
All experiments are performed on the ASTRAL SCOP database. The Smith-Waterman sequence comparison algorithm with both e-value and Z-score statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three Smith-Waterman implementations with e-value are better at predicting structural similarities between proteins than the Smith-Waterman implementation with Z-score. SSEARCH especially has very high scores.
The compute intensive Z-score does not have a clear advantage over the e-value. The Smith-Waterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with e-values for pairwise sequence comparisons.
Sequence comparison is still one of the most important methodologies in the field of computational biology. It enables researchers to compare the sequences of genes or proteins with unknown functions to sequences of well-studied genes or proteins. However, due to a significant increase in whole genome sequencing projects, the amount of sequence data is nowadays very large and rapidly increasing. Therefore, pairwise comparison algorithms should not only be accurate and reliable but also fast. The Smith-Waterman algorithm  is one of the most advanced and sensitive pairwise sequence comparison algorithms currently available. However, it is theoretically about 50 times slower than other popular algorithms , such as FASTA  and BLAST . All three algorithms generate local alignments, but the Smith-Waterman algorithm puts no constraints on the alignment it reports other than that it has a positive score in terms of the similarity table used to score the alignment. BLAST and FASTA put additional constraints on the alignments that they report in order to speed up their operation: only sequences above a certain similarity threshold are reported, the rest is used for the estimation of certain parameters used in the alignment calculation. Because of this Smith-Waterman is more sensitive than BLAST and FASTA. The Smith-Waterman algorithm finds the best matching regions in the same pair of sequences. However, BLAST and FASTA are still far more popular because of their speed and the addition of a statistical significance value, the Expect-value (or simply e-value), whereas the original Smith-Waterman implementation relies only on the SW-score without any further statistics. The newer Smith-Waterman implementations of Paracel , SSEARCH  and ParAlign  do include the e-value as a measure of statistical significance, which makes the Smith-Waterman algorithm more usable as the engine behind a similarity search tool. The e-value is far more useful than the SW-score, because it describes the number of hits one can expect to see by chance when searching a database of a certain size. An e-value threshold can be used easily to separate the 'interesting' results from the background noise. However, a more reliable statistical estimate is still needed . The Z-score, based on Monte-Carlo statistics, was introduced by Doolittle  and implemented by Gene-IT  in its sequence comparison suite Biofacet . The Z-score has been used in the creation of the sequence annotation databases CluSTr  and Protein World  and was used in orthology studies . The Z-score has also been implemented in algorithms other than Smith-Waterman, such as FASTA . It is calculated by performing a number (e.g., 100) of shuffling randomizations of both sequences that are compared, completed by an estimation of the SW score significance as compared to the original pairwise alignment. This makes the Z-score very useful for doing all-against-all pairwise sequence comparisons: Z-scores of different sequence pairs can be compared to each other, because they are only dependent on the sequences itself and not on the database size, which is one of the parameters used to calculate the e-value. However, this independency of the database size makes the Z-score unsuitable for determining the probability that an alignment has been obtained by chance. The randomizations make the Z-score calculation quite slow, but theoretically it is more sensitive and more selective than e-value statistics [16, 17]. Unfortunately, this has never been validated experimentally.
Some methods have been used to combine the sensitivity and selectivity of a sequence comparison algorithm into one single score . Receiver operating characteristic (ROC) is a popular measure of search accuracy . For a perfect search algorithm, all true positives for these queries should appear before any false positive in the ranked output list, which gives an ROC score of 1. If the first n items in the list are all false positives, the ROCn score is 0. Although researchers have devised many ways to merge ROC scores for a set of queries , one simple and popular method is to 'pool' search results so as to get an overall ROC score . Another method to evaluate different methods is the errors per query (EPQ) criterion and the 'coverage versus error' plots . EPQ is a selectivity indicator based on all-against-all comparisons, and coverage is a sensitivity measure. The assumption for EPQ is that the search algorithm can yield a 'normalized similarity score' rather than a length-dependent one, so that results from queries are comparable. Like ROC, the coverage versus error plot can give an overall performance comparison for search algorithms. A third method, the average precision (AP) criterion, is adopted from information retrieval research . The method defines two values: the recall (true positives divided by the number of homologs) and the precision (true positives divided by the number of hits), which are plotted in a graph. The AP then is an approximate integral to calculate the area under this recall-precision curve. These methods were used to compare several sequence comparison algorithms, but we use them to compare the e-value and Z-score statistics. Analyses of BLAST and FASTA are also included as reference material.
Here we show that two out of the three Smith-Waterman implementations with e-value statistics are more accurate than the Smith-Waterman implementation of Biofacet with Z-score statistics. Furthermore, the comparison of BLAST and FASTA with the four Smith-Waterman implementations shows that FASTA is a more reliable algorithm when using the ASTRAL SCOP structural classification as a benchmark. The Smith-Waterman implementation of Paracel even has lower scores than both BLAST and FASTA. SSEARCH, the Smith-Waterman implementation in the FASTA package, scores best.
Properties of ASTRAL SCOP PDB sets
Maximal percentage indentity
Number of sequences
Number of families
Average family size
Size of largest family
Number of families having only 1 member
Number of families having more than 1 member
Sequence comparison methods and parameters
Gap open penalty
Gap extension penalty
Number of randomizations
Paracel SW e-value
Biofacet SW Z-score
NCBI BLAST e-value
ParAlign SW e-value
Receiver operating characteristic
Coverage versus error
We included two examples of our statistical analysis, which show how the ROC and mean AP measures differ from each other and how results can be different for each studied protein. We choose two well-studied proteins: enoyl-ACP reductase and the progesterone receptor, the first from a prokaryote (E. coli) and the second from a eukaryote (H. sapiens). Both case studies were done using the PDB095 set, which is the most complete ASTRAL SCOP PDB set used in our study.
Bacterial enoyl-ACP reductase
Scores for bacterial enoyl-ACP reductase
First False Polsitive (FFP)
Number of True Positives (NTP)
Human progesterone receptor
Scores for human progesterone receptor
First False Positive (FFP)
Number of True Positives (NTP)
Times for all-against-all sequence comparisons of the ASTRAL SCOP PDB095 set.
Paracel SW e-value
3 hours *
Biofacet SW Z-score
NCBI BLAST e-value
5 hours, 49 minutes
ParAlign SW e-value
Finally, we would like to stress that the results from the CVE analysis might be more reliable than those from the ROC and mean AP analyses. ROC and mean AP make use of a ranking system based on the e-value or Z-score, instead of looking at the e-value or Z-score directly. This means that in some cases, especially the smaller protein families, a large number of very low-scoring hits (e.g. e>100 or Z<3) is still used for the calculation of the scores. This is not the case for the CVE plots, because we varied the e-value and Z-score thresholds above which a hit is seen as a true positive, instead of relying on a ranking system. However, because the results from the CVE plots are similar to the results from the ROC and mean AP graphs, the use of a ranking system does not seem to give a large disadvantage.
For a complete analysis we need a less biased database, having a wide range of proteins classified by structure similarity. Until such a database is available, it will be difficult to pinpoint the reasons for the different results between FASTA, BLAST and Smith-Waterman, and the theoretical advantages of the Z-score. Regardless of all these theoretical assumptions, the computational disadvantage of the Z-score is smaller for larger databases. Z-scores do not have to be recalculated when sequences are added to the database, in contrast to e-values, which are dependent on database size. For very large databases containing all-against-all comparisons, this is an important advantage of the Z-score. Although recalculating the e-values does not take much time when the alignments and SW scores are already available, this may cause a change in research results that were obtained earlier. Despite these considerations, we recommend using SSEARCH with e-value statistics for pairwise sequence comparisons.
For the Smith-Waterman e-value calculation, the ASTRAL SCOP files were loaded onto the Paracel file system as protein databases and subsequently used as queries against these databases: the set with 10% maximal identity (PDB010) against itself, the set with 20% maximal identity (PDB020) against itself, etc. The matrix used for all sequence comparisons was the BLOSUM62 matrix . This is the default scoring matrix for most alignment programs. For all sequence comparisons in this article, the gap open penalty was set to 12 and the gap extension penalty was set to 1. These are the averages of the default penalties over the six studied methods. Both the matrix and gap penalties used are suited for comparing protein sets with a broad spectrum of evolutionary distances, like the PDB set [30, 31]. Per query sequence, the best 100 hits were kept [see section Data availability], discarding the match of each query sequence with itself.
Receiver operating characteristic calculation
For each query, the 100 best hits were marked as true positives or false positives, i.e. the hit being in the same or in a different SCOP family than the query. For each of the first 50 false positives that were found, the number of true positives with a higher similarity score was calculated. The sum of all of these numbers was then divided by the number of false positives (50), and finally divided by the total number of possible true positives in the database (i.e. the total number of members in the SCOP family minus 1), giving an ROC50 score for each query sequence. The average of these ROC50 scores gives the final ROC score for that specific statistical value and that specific ASTRAL SCOP set. Mean ROC50 scores were calculated for all ten different ASTRAL SCOP sets.
Coverage versus error calculation
Instead of taking the first 100 hits for each query, like in the ROC analysis, we varied the threshold at which a certain hit was seen as a positive. For the e-value analysis, we created a list of 49 thresholds in the range of 10-50 to 100. For Z-score, we created a list of 58 thresholds in the range of 0 to 100. Then, for each threshold, two parameters were measured: the coverage and the errors per query (EPQ). The coverage is the number of true hits divided by the total number of sequence pairs that are in the same SCOP family, for that specific ASTRAL SCOP set. The EPQ is the number of false hits divided by the number of queries. We used the most inclusive ASTRAL SCOP set (PDB095), the least inclusive set (PDB010) and an intermediate set (PDB035) to create the coverage versus error plots.
Average precision calculation
For the calculation of the average precision (AP), the 100 best hits per query were marked again as either true positives or false positives. Subsequently for each true positive found by the search algorithm, the true positive rank of this hit (i.e. the number of true positives with a higher score + 1) was divided by the positive rank (i.e. the number of hits with a higher score + 1). These numbers were all added up and then divided by the total number of hits (i.e. 100), giving one AP value per query. The mean AP is the average of all these APs. Mean APs were calculated for all ten different ASTRAL SCOP sets.
Bacterial enoyl-ACP reductase
The ASTRAL SCOP entry for E. coli enoyl-ACP reductase chain A, d1qg6a_, was picked as an example for our methodology. The 100 best hits of this entry on the PDB095 set were calculated using each of the six algorithms and sorted by ascending e-value and descending Z-score. Then they were marked as either true positives or false positives, depending on if the hit was in the same structural family (c.2.1.2) or not. Furthermore, the ROC50 scores and mean APs were calculated.
Human progesterone receptor
A second example is the analysis of d1a28a_, the H. sapiens progesterone receptor chain A. Once more, the 100 best hits of this entry on the PDB095 set were calculated using each of the six algorithms and sorted by ascending e-value and descending Z-score. These hits were marked as either true positives or false positives, depending on if the hit was in the same structural family (a.123.1.1) or not. Finally, the mean AP and ROC50 scores were calculated.
We measured the speed of the sequence comparison algorithms, by doing an all-against-all comparison of the ASTRAL SCOP PDB095 set and using the 'time' command provided by UNIX. All calculations were performed on the same machine, except for the Paracel calculation which could only be performed on the Paracel machine. The Paracel calculation time had to be estimated because of the unaivailability of the Paracel machine at the time of performing this analysis.
All raw sequence comparison output files (containing the top 100 hits per query sequence) are available through our website . The top 100 hits for the two case studies of the bacterial enoyl-ACP reductase (i.e. Table S.1) and the human progesterone receptor (i.e. Table S.2) can be found in the additional files section [see Additional file 1].
- bf z:
Basic Local Alignment Search Tool
- bl e:
BLOcks SUbstitution Matrix
Clusters of SWISS-PROT and TrEMBL
Coverage Versus Error
Errors Per Query
- fa e:
First False Positive
Number of True Positives
- pa e:
- pc e:
Protein Data Bank
Receiver Operating Characteristic
Structural Classification Of Proteins
- ss e:
This work was supported financially by NV Organon and the Netherlands Organization for Scientific Research (NWO). The authors like to thank Scott Lusher for critically reading this manuscript.
- Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195–197. 10.1016/0022-2836(81)90087-5View ArticlePubMedGoogle Scholar
- Brenner SE, Chothia C, Hubbard TJ: Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A 1998, 95(11):6073–6078. 10.1073/pnas.95.11.6073PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
- Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the Smith-Waterman and FASTA algorithms. Genomics 1991, 11(3):635–650. 10.1016/0888-7543(91)90071-LView ArticlePubMedGoogle Scholar
- Rognes T: ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res 2001, 29(7):1647–1652. 10.1093/nar/29.7.1647PubMed CentralView ArticlePubMedGoogle Scholar
- Pearson WR, Sierk ML: The limits of protein sequence comparison? Curr Opin Struct Biol 2005.Google Scholar
- Doolittle RF: Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. Mill Valley California, University Science Books; 1986.Google Scholar
- Codani JJ, Comet JP, Aude JC, Glémet E, Wozniak A, Risler JL, Hénaut A, Slonimski PP: Automatic Analysis of Large-Scale Pairwise Alignments of Protein Sequences. Methods in Microbiology 1999, 28: 229–244.View ArticleGoogle Scholar
- Kriventseva EV, Servant F, Apweiler R: Improvements to CluSTr: the database of SWISS-PROT+TrEMBL protein clusters. Nucleic Acids Res 2003, 31(1):388–389. 10.1093/nar/gkg035PubMed CentralView ArticlePubMedGoogle Scholar
- Protein World[http://www.cmbi.ru.nl/pw/]
- Hulsen T, Huynen MA, de Vlieg J, Groenen PM: Benchmarking ortholog identification methods using functional genomics data. Genome Biol 2006, 7(4):R31. 10.1186/gb-2006-7-4-r31PubMed CentralView ArticlePubMedGoogle Scholar
- Booth HS, Maindonald JH, Wilson SR, Gready JE: An efficient Z-score algorithm for assessing sequence alignments. J Comput Biol 2004, 11(4):616–625. 10.1089/cmb.2004.11.616View ArticlePubMedGoogle Scholar
- Comet JP, Aude JC, Glemet E, Risler JL, Henaut A, Slonimski PP, Codani JJ: Significance of Z-value statistics of Smith-Waterman scores for protein alignments. Comput Chem 1999, 23(3–4):317–331. 10.1016/S0097-8485(99)00008-XView ArticlePubMedGoogle Scholar
- Bastien O, Aude JC, Roy S, Marechal E: Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Z-value statistics. Bioinformatics 2004, 20(4):534–537. 10.1093/bioinformatics/btg440View ArticlePubMedGoogle Scholar
- Chen Z: Assessing sequence comparison methods with the average precision criterion. Bioinformatics 2003, 19(18):2456–2460. 10.1093/bioinformatics/btg349View ArticlePubMedGoogle Scholar
- Gribskov M, Robinson NL: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Compu Chem 1996, 20: 25–33. 10.1016/S0097-8485(96)80004-0View ArticleGoogle Scholar
- Kester AD, Buntinx F: Meta-analysis of ROC curves. Med Decis Making 2000, 20(4):430–439.View ArticlePubMedGoogle Scholar
- Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSI-BLAST protein database searches with composition-based statistics and other refinements. Nucleic Acids Res 2001, 29(14):2994–3005. 10.1093/nar/29.14.2994PubMed CentralView ArticlePubMedGoogle Scholar
- Salton G: Developments in automatic text retrieval. Science 1991, 253: 974–980. 10.1126/science.253.5023.974View ArticlePubMedGoogle Scholar
- Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000, 28(1):254–256. 10.1093/nar/28.1.254PubMed CentralView ArticlePubMedGoogle Scholar
- Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar
- ASTRAL SCOP release 1.65[http://astral.berkeley.edu/scopseq-1.65.html]
- Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32(Database issue):D115–9. 10.1093/nar/gkh131PubMed CentralView ArticlePubMedGoogle Scholar
- Agarwal P, States DJ: Comparative accuracy of methods for protein sequence similarity search. Bioinformatics 1998, 14(1):40–47. 10.1093/bioinformatics/14.1.40View ArticlePubMedGoogle Scholar
- Pearson WR: Comparison of methods for searching protein sequence databases. Protein Sci 1995, 4(6):1145–1160.PubMed CentralView ArticlePubMedGoogle Scholar
- Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
- Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, 18(11):1500–1507. 10.1093/bioinformatics/18.11.1500View ArticlePubMedGoogle Scholar
- Price GA, Crooks GE, Green RE, Brenner SE: Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics 2005, 21(20):3824–3831. 10.1093/bioinformatics/bti627View ArticlePubMedGoogle Scholar
- Supplementary data[http://www.cmbi.ru.nl/~timhulse/ezcomp/]
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.