Testing statistical significance scores of sequence comparison methods with structure similarity
 Tim Hulsen^{1}Email author,
 Jacob de Vlieg^{1, 2},
 Jack AM Leunissen^{3} and
 Peter MA Groenen^{2}
DOI: 10.1186/147121057444
© Hulsen et al; licensee BioMed Central Ltd. 2006
Received: 10 July 2006
Accepted: 12 October 2006
Published: 12 October 2006
Abstract
Background
In the past years the SmithWaterman sequence comparison algorithm has gained popularity due to improved implementations and rapidly increasing computing power. However, the quality and sensitivity of a database search is not only determined by the algorithm but also by the statistical significance testing for an alignment. The evalue is the most commonly used statistical validation method for sequence database searching. The CluSTr database and the Protein World database have been created using an alternative statistical significance test: a Zscore based on MonteCarlo statistics. Several papers have described the superiority of the Zscore as compared to the evalue, using simulated data. We were interested if this could be validated when applied to existing, evolutionary related protein sequences.
Results
All experiments are performed on the ASTRAL SCOP database. The SmithWaterman sequence comparison algorithm with both evalue and Zscore statistics is evaluated, using ROC, CVE and AP measures. The BLAST and FASTA algorithms are used as reference. We find that two out of three SmithWaterman implementations with evalue are better at predicting structural similarities between proteins than the SmithWaterman implementation with Zscore. SSEARCH especially has very high scores.
Conclusion
The compute intensive Zscore does not have a clear advantage over the evalue. The SmithWaterman implementations give generally better results than their heuristic counterparts. We recommend using the SSEARCH algorithm combined with evalues for pairwise sequence comparisons.
Background
Sequence comparison is still one of the most important methodologies in the field of computational biology. It enables researchers to compare the sequences of genes or proteins with unknown functions to sequences of wellstudied genes or proteins. However, due to a significant increase in whole genome sequencing projects, the amount of sequence data is nowadays very large and rapidly increasing. Therefore, pairwise comparison algorithms should not only be accurate and reliable but also fast. The SmithWaterman algorithm [1] is one of the most advanced and sensitive pairwise sequence comparison algorithms currently available. However, it is theoretically about 50 times slower than other popular algorithms [2], such as FASTA [3] and BLAST [4]. All three algorithms generate local alignments, but the SmithWaterman algorithm puts no constraints on the alignment it reports other than that it has a positive score in terms of the similarity table used to score the alignment. BLAST and FASTA put additional constraints on the alignments that they report in order to speed up their operation: only sequences above a certain similarity threshold are reported, the rest is used for the estimation of certain parameters used in the alignment calculation. Because of this SmithWaterman is more sensitive than BLAST and FASTA. The SmithWaterman algorithm finds the best matching regions in the same pair of sequences. However, BLAST and FASTA are still far more popular because of their speed and the addition of a statistical significance value, the Expectvalue (or simply evalue), whereas the original SmithWaterman implementation relies only on the SWscore without any further statistics. The newer SmithWaterman implementations of Paracel [5], SSEARCH [6] and ParAlign [7] do include the evalue as a measure of statistical significance, which makes the SmithWaterman algorithm more usable as the engine behind a similarity search tool. The evalue is far more useful than the SWscore, because it describes the number of hits one can expect to see by chance when searching a database of a certain size. An evalue threshold can be used easily to separate the 'interesting' results from the background noise. However, a more reliable statistical estimate is still needed [8]. The Zscore, based on MonteCarlo statistics, was introduced by Doolittle [9] and implemented by GeneIT [10] in its sequence comparison suite Biofacet [11]. The Zscore has been used in the creation of the sequence annotation databases CluSTr [12] and Protein World [13] and was used in orthology studies [14]. The Zscore has also been implemented in algorithms other than SmithWaterman, such as FASTA [15]. It is calculated by performing a number (e.g., 100) of shuffling randomizations of both sequences that are compared, completed by an estimation of the SW score significance as compared to the original pairwise alignment. This makes the Zscore very useful for doing allagainstall pairwise sequence comparisons: Zscores of different sequence pairs can be compared to each other, because they are only dependent on the sequences itself and not on the database size, which is one of the parameters used to calculate the evalue. However, this independency of the database size makes the Zscore unsuitable for determining the probability that an alignment has been obtained by chance. The randomizations make the Zscore calculation quite slow, but theoretically it is more sensitive and more selective than evalue statistics [16, 17]. Unfortunately, this has never been validated experimentally.
Some methods have been used to combine the sensitivity and selectivity of a sequence comparison algorithm into one single score [18]. Receiver operating characteristic (ROC) is a popular measure of search accuracy [19]. For a perfect search algorithm, all true positives for these queries should appear before any false positive in the ranked output list, which gives an ROC score of 1. If the first n items in the list are all false positives, the ROC_{n} score is 0. Although researchers have devised many ways to merge ROC scores for a set of queries [20], one simple and popular method is to 'pool' search results so as to get an overall ROC score [21]. Another method to evaluate different methods is the errors per query (EPQ) criterion and the 'coverage versus error' plots [2]. EPQ is a selectivity indicator based on allagainstall comparisons, and coverage is a sensitivity measure. The assumption for EPQ is that the search algorithm can yield a 'normalized similarity score' rather than a lengthdependent one, so that results from queries are comparable. Like ROC, the coverage versus error plot can give an overall performance comparison for search algorithms. A third method, the average precision (AP) criterion, is adopted from information retrieval research [22]. The method defines two values: the recall (true positives divided by the number of homologs) and the precision (true positives divided by the number of hits), which are plotted in a graph. The AP then is an approximate integral to calculate the area under this recallprecision curve. These methods were used to compare several sequence comparison algorithms, but we use them to compare the evalue and Zscore statistics. Analyses of BLAST and FASTA are also included as reference material.
Here we show that two out of the three SmithWaterman implementations with evalue statistics are more accurate than the SmithWaterman implementation of Biofacet with Zscore statistics. Furthermore, the comparison of BLAST and FASTA with the four SmithWaterman implementations shows that FASTA is a more reliable algorithm when using the ASTRAL SCOP structural classification as a benchmark. The SmithWaterman implementation of Paracel even has lower scores than both BLAST and FASTA. SSEARCH, the SmithWaterman implementation in the FASTA package, scores best.
Results
Properties of ASTRAL SCOP PDB sets
Maximal percentage indentity  Number of sequences  Number of families  Average family size  Size of largest family  Number of families having only 1 member  Number of families having more than 1 member 

10%  3631  2250  1.614  25  1655  595 
20%  3968  2297  1.727  29  1605  692 
25%  4357  2313  1.884  32  1530  783 
30%  4821  2320  2.078  39  1435  885 
35%  5301  2322  2.283  46  1333  989 
40%  5674  2322  2.444  47  1269  1053 
50%  6442  2324  2.772  50  1178  1146 
70%  7551  2325  3.248  127  1087  1238 
90%  8759  2326  3.766  405  1023  1303 
95%  9498  2326  4.083  479  977  1349 
Sequence comparison methods and parameters
Method  Abbreviation  Version  Matrix  Gap open penalty  Gap extension penalty  Number of randomizations 

Paracel SW evalue  pc e    BLOSUM62  3*IS *  0.3*IS *  0 
Biofacet SW Zscore  bf z  2.9.6  BLOSUM62  12  1  100 
NCBI BLAST evalue  bl e  2.2.9  BLOSUM62  12  1  0 
FASTA evalue  fa e  3.4t24  BLOSUM62  12  1  0 
SSEARCH evalue  ss e  3.4t24  BLOSUM62  12  1  0 
ParAlign SW evalue  pa e  4.0.0  BLOSUM62  12  1  0 
Receiver operating characteristic
Coverage versus error
Average precision
Case studies
We included two examples of our statistical analysis, which show how the ROC and mean AP measures differ from each other and how results can be different for each studied protein. We choose two wellstudied proteins: enoylACP reductase and the progesterone receptor, the first from a prokaryote (E. coli) and the second from a eukaryote (H. sapiens). Both case studies were done using the PDB095 set, which is the most complete ASTRAL SCOP PDB set used in our study.
Bacterial enoylACP reductase
Scores for bacterial enoylACP reductase
pc e  bf z  bl e  fa e  ss e  pa e  

ROC score  0.156  0.124  0.250  0.367  0.338  0.229 
MAP score  0.212  0.161  0.264  0.374  0.343  0.234 
First False Polsitive (FFP)  21  24  24  22  24  24 
Number of True Positives (NTP)  25  23  27  25  24  24 
Human progesterone receptor
Scores for human progesterone receptor
pc e  bf z  bl e  fa e  ss e  pa e  

ROC score  0.402  0.437  0.513  0.745  0.762  0.573 
MAP score  0.504  0.503  0.548  0.727  0.745  0.586 
First False Positive (FFP)  22  18  25  23  23  23 
Number of True Positives (NTP)  28  27  27  28  28  28 
Timing
Times for allagainstall sequence comparisons of the ASTRAL SCOP PDB095 set.
Method  Time 

Paracel SW evalue  3 hours * 
Biofacet SW Zscore  multiple days 
NCBI BLAST evalue  15 minutes 
FASTA evalue  40 minutes 
SSEARCH evalue  5 hours, 49 minutes 
ParAlign SW evalue  47 minutes 
Discussion
Finally, we would like to stress that the results from the CVE analysis might be more reliable than those from the ROC and mean AP analyses. ROC and mean AP make use of a ranking system based on the evalue or Zscore, instead of looking at the evalue or Zscore directly. This means that in some cases, especially the smaller protein families, a large number of very lowscoring hits (e.g. e>100 or Z<3) is still used for the calculation of the scores. This is not the case for the CVE plots, because we varied the evalue and Zscore thresholds above which a hit is seen as a true positive, instead of relying on a ranking system. However, because the results from the CVE plots are similar to the results from the ROC and mean AP graphs, the use of a ranking system does not seem to give a large disadvantage.
Conclusion
For a complete analysis we need a less biased database, having a wide range of proteins classified by structure similarity. Until such a database is available, it will be difficult to pinpoint the reasons for the different results between FASTA, BLAST and SmithWaterman, and the theoretical advantages of the Zscore. Regardless of all these theoretical assumptions, the computational disadvantage of the Zscore is smaller for larger databases. Zscores do not have to be recalculated when sequences are added to the database, in contrast to evalues, which are dependent on database size. For very large databases containing allagainstall comparisons, this is an important advantage of the Zscore. Although recalculating the evalues does not take much time when the alignments and SW scores are already available, this may cause a change in research results that were obtained earlier. Despite these considerations, we recommend using SSEARCH with evalue statistics for pairwise sequence comparisons.
Methods
Sequence comparisons
For the SmithWaterman evalue calculation, the ASTRAL SCOP files were loaded onto the Paracel file system as protein databases and subsequently used as queries against these databases: the set with 10% maximal identity (PDB010) against itself, the set with 20% maximal identity (PDB020) against itself, etc. The matrix used for all sequence comparisons was the BLOSUM62 matrix [29]. This is the default scoring matrix for most alignment programs. For all sequence comparisons in this article, the gap open penalty was set to 12 and the gap extension penalty was set to 1. These are the averages of the default penalties over the six studied methods. Both the matrix and gap penalties used are suited for comparing protein sets with a broad spectrum of evolutionary distances, like the PDB set [30, 31]. Per query sequence, the best 100 hits were kept [see section Data availability], discarding the match of each query sequence with itself.
Receiver operating characteristic calculation
For each query, the 100 best hits were marked as true positives or false positives, i.e. the hit being in the same or in a different SCOP family than the query. For each of the first 50 false positives that were found, the number of true positives with a higher similarity score was calculated. The sum of all of these numbers was then divided by the number of false positives (50), and finally divided by the total number of possible true positives in the database (i.e. the total number of members in the SCOP family minus 1), giving an ROC_{50} score for each query sequence. The average of these ROC_{50} scores gives the final ROC score for that specific statistical value and that specific ASTRAL SCOP set. Mean ROC_{50} scores were calculated for all ten different ASTRAL SCOP sets.
Coverage versus error calculation
Instead of taking the first 100 hits for each query, like in the ROC analysis, we varied the threshold at which a certain hit was seen as a positive. For the evalue analysis, we created a list of 49 thresholds in the range of 10^{50} to 100. For Zscore, we created a list of 58 thresholds in the range of 0 to 100. Then, for each threshold, two parameters were measured: the coverage and the errors per query (EPQ). The coverage is the number of true hits divided by the total number of sequence pairs that are in the same SCOP family, for that specific ASTRAL SCOP set. The EPQ is the number of false hits divided by the number of queries. We used the most inclusive ASTRAL SCOP set (PDB095), the least inclusive set (PDB010) and an intermediate set (PDB035) to create the coverage versus error plots.
Average precision calculation
For the calculation of the average precision (AP), the 100 best hits per query were marked again as either true positives or false positives. Subsequently for each true positive found by the search algorithm, the true positive rank of this hit (i.e. the number of true positives with a higher score + 1) was divided by the positive rank (i.e. the number of hits with a higher score + 1). These numbers were all added up and then divided by the total number of hits (i.e. 100), giving one AP value per query. The mean AP is the average of all these APs. Mean APs were calculated for all ten different ASTRAL SCOP sets.
Bacterial enoylACP reductase
The ASTRAL SCOP entry for E. coli enoylACP reductase chain A, d1qg6a_, was picked as an example for our methodology. The 100 best hits of this entry on the PDB095 set were calculated using each of the six algorithms and sorted by ascending evalue and descending Zscore. Then they were marked as either true positives or false positives, depending on if the hit was in the same structural family (c.2.1.2) or not. Furthermore, the ROC_{50} scores and mean APs were calculated.
Human progesterone receptor
A second example is the analysis of d1a28a_, the H. sapiens progesterone receptor chain A. Once more, the 100 best hits of this entry on the PDB095 set were calculated using each of the six algorithms and sorted by ascending evalue and descending Zscore. These hits were marked as either true positives or false positives, depending on if the hit was in the same structural family (a.123.1.1) or not. Finally, the mean AP and ROC_{50} scores were calculated.
Timing
We measured the speed of the sequence comparison algorithms, by doing an allagainstall comparison of the ASTRAL SCOP PDB095 set and using the 'time' command provided by UNIX. All calculations were performed on the same machine, except for the Paracel calculation which could only be performed on the Paracel machine. The Paracel calculation time had to be estimated because of the unaivailability of the Paracel machine at the time of performing this analysis.
Data availability
All raw sequence comparison output files (containing the top 100 hits per query sequence) are available through our website [32]. The top 100 hits for the two case studies of the bacterial enoylACP reductase (i.e. Table S.1) and the human progesterone receptor (i.e. Table S.2) can be found in the additional files section [see Additional file 1].
Abbreviations
 AP:

Average Precision
 bf z:

Biofacet (Zscore)
 BLAST:

Basic Local Alignment Search Tool
 bl e:

BLAST (evalue)
 BLOSUM:

BLOcks SUbstitution Matrix
 CluSTr:

Clusters of SWISSPROT and TrEMBL
 CVE:

Coverage Versus Error
 EPQ:

Errors Per Query
 fa e:

FASTA (evalue)
 FFP:

First False Positive
 NTP:

Number of True Positives
 pa e:

ParAlign (evalue)
 pc e:

Paracel (evalue)
 PDB:

Protein Data Bank
 ROC:

Receiver Operating Characteristic
 SCOP:

Structural Classification Of Proteins
 ss e:

SSEARCH (evalue)
 SW:

SmithWaterman
Declarations
Acknowledgements
This work was supported financially by NV Organon and the Netherlands Organization for Scientific Research (NWO). The authors like to thank Scott Lusher for critically reading this manuscript.
Authors’ Affiliations
References
 Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147(1):195–197. 10.1016/00222836(81)900875View ArticlePubMedGoogle Scholar
 Brenner SE, Chothia C, Hubbard TJ: Assessing sequence comparison methods with reliable structurally identified distant evolutionary relationships. Proc Natl Acad Sci U S A 1998, 95(11):6073–6078. 10.1073/pnas.95.11.6073PubMed CentralView ArticlePubMedGoogle Scholar
 Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci U S A 1988, 85(8):2444–2448. 10.1073/pnas.85.8.2444PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25(17):3389–3402. 10.1093/nar/25.17.3389PubMed CentralView ArticlePubMedGoogle Scholar
 Paracel[http://www.paracel.com]
 Pearson WR: Searching protein sequence libraries: comparison of the sensitivity and selectivity of the SmithWaterman and FASTA algorithms. Genomics 1991, 11(3):635–650. 10.1016/08887543(91)90071LView ArticlePubMedGoogle Scholar
 Rognes T: ParAlign: a parallel sequence alignment algorithm for rapid and sensitive database searches. Nucleic Acids Res 2001, 29(7):1647–1652. 10.1093/nar/29.7.1647PubMed CentralView ArticlePubMedGoogle Scholar
 Pearson WR, Sierk ML: The limits of protein sequence comparison? Curr Opin Struct Biol 2005.Google Scholar
 Doolittle RF: Of URFs and ORFs: a primer on how to analyze derived amino acid sequences. Mill Valley California, University Science Books; 1986.Google Scholar
 GeneIT[http://www.geneit.com]
 Codani JJ, Comet JP, Aude JC, Glémet E, Wozniak A, Risler JL, Hénaut A, Slonimski PP: Automatic Analysis of LargeScale Pairwise Alignments of Protein Sequences. Methods in Microbiology 1999, 28: 229–244.View ArticleGoogle Scholar
 Kriventseva EV, Servant F, Apweiler R: Improvements to CluSTr: the database of SWISSPROT+TrEMBL protein clusters. Nucleic Acids Res 2003, 31(1):388–389. 10.1093/nar/gkg035PubMed CentralView ArticlePubMedGoogle Scholar
 Protein World[http://www.cmbi.ru.nl/pw/]
 Hulsen T, Huynen MA, de Vlieg J, Groenen PM: Benchmarking ortholog identification methods using functional genomics data. Genome Biol 2006, 7(4):R31. 10.1186/gb200674r31PubMed CentralView ArticlePubMedGoogle Scholar
 Booth HS, Maindonald JH, Wilson SR, Gready JE: An efficient Zscore algorithm for assessing sequence alignments. J Comput Biol 2004, 11(4):616–625. 10.1089/cmb.2004.11.616View ArticlePubMedGoogle Scholar
 Comet JP, Aude JC, Glemet E, Risler JL, Henaut A, Slonimski PP, Codani JJ: Significance of Zvalue statistics of SmithWaterman scores for protein alignments. Comput Chem 1999, 23(3–4):317–331. 10.1016/S00978485(99)00008XView ArticlePubMedGoogle Scholar
 Bastien O, Aude JC, Roy S, Marechal E: Fundamentals of massive automatic pairwise alignments of protein sequences: theoretical significance of Zvalue statistics. Bioinformatics 2004, 20(4):534–537. 10.1093/bioinformatics/btg440View ArticlePubMedGoogle Scholar
 Chen Z: Assessing sequence comparison methods with the average precision criterion. Bioinformatics 2003, 19(18):2456–2460. 10.1093/bioinformatics/btg349View ArticlePubMedGoogle Scholar
 Gribskov M, Robinson NL: Use of receiver operating characteristic (ROC) analysis to evaluate sequence matching. Compu Chem 1996, 20: 25–33. 10.1016/S00978485(96)800040View ArticleGoogle Scholar
 Kester AD, Buntinx F: Metaanalysis of ROC curves. Med Decis Making 2000, 20(4):430–439.View ArticlePubMedGoogle Scholar
 Schaffer AA, Aravind L, Madden TL, Shavirin S, Spouge JL, Wolf YI, Koonin EV, Altschul SF: Improving the accuracy of PSIBLAST protein database searches with compositionbased statistics and other refinements. Nucleic Acids Res 2001, 29(14):2994–3005. 10.1093/nar/29.14.2994PubMed CentralView ArticlePubMedGoogle Scholar
 Salton G: Developments in automatic text retrieval. Science 1991, 253: 974–980. 10.1126/science.253.5023.974View ArticlePubMedGoogle Scholar
 Brenner SE, Koehl P, Levitt M: The ASTRAL compendium for protein structure and sequence analysis. Nucleic Acids Res 2000, 28(1):254–256. 10.1093/nar/28.1.254PubMed CentralView ArticlePubMedGoogle Scholar
 Park J, Karplus K, Barrett C, Hughey R, Haussler D, Hubbard T, Chothia C: Sequence comparisons using multiple sequences detect three times as many remote homologues as pairwise methods. J Mol Biol 1998, 284(4):1201–1210. 10.1006/jmbi.1998.2221View ArticlePubMedGoogle Scholar
 ASTRAL SCOP release 1.65[http://astral.berkeley.edu/scopseq1.65.html]
 Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, Martin MJ, Natale DA, O'Donovan C, Redaschi N, Yeh LS: UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 2004, 32(Database issue):D115–9. 10.1093/nar/gkh131PubMed CentralView ArticlePubMedGoogle Scholar
 Agarwal P, States DJ: Comparative accuracy of methods for protein sequence similarity search. Bioinformatics 1998, 14(1):40–47. 10.1093/bioinformatics/14.1.40View ArticlePubMedGoogle Scholar
 Pearson WR: Comparison of methods for searching protein sequence databases. Protein Sci 1995, 4(6):1145–1160.PubMed CentralView ArticlePubMedGoogle Scholar
 Henikoff S, Henikoff JG: Amino acid substitution matrices from protein blocks. Proc Natl Acad Sci U S A 1992, 89(22):10915–10919. 10.1073/pnas.89.22.10915PubMed CentralView ArticlePubMedGoogle Scholar
 Reese JT, Pearson WR: Empirical determination of effective gap penalties for sequence comparison. Bioinformatics 2002, 18(11):1500–1507. 10.1093/bioinformatics/18.11.1500View ArticlePubMedGoogle Scholar
 Price GA, Crooks GE, Green RE, Brenner SE: Statistical evaluation of pairwise protein sequence comparison with the Bayesian bootstrap. Bioinformatics 2005, 21(20):3824–3831. 10.1093/bioinformatics/bti627View ArticlePubMedGoogle Scholar
 Supplementary data[http://www.cmbi.ru.nl/~timhulse/ezcomp/]
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.