Island method for estimating the statistical significance of profileprofile alignment scores
 Aleksandar Poleksic^{1}Email author
DOI: 10.1186/1471210510112
© Poleksic; licensee BioMed Central Ltd. 2009
Received: 25 December 2008
Accepted: 20 April 2009
Published: 20 April 2009
Abstract
Background
In the last decade, a significant improvement in detecting remote similarity between protein sequences has been made by utilizing alignment profiles in place of aminoacid strings. Unfortunately, no analytical theory is available for estimating the significance of a gapped alignment of two profiles. Many experiments suggest that the distribution of local profileprofile alignment scores is of the Gumbel form. However, estimating distribution parameters by random simulations turns out to be computationally very expensive.
Results
We demonstrate that the background distribution of profileprofile alignment scores heavily depends on profiles' composition and thus the distribution parameters must be estimated independently, for each pair of profiles of interest. We also show that accurate estimates of statistical parameters can be obtained using the "island statistics" for profileprofile alignments.
Conclusion
The island statistics can be generalized to profileprofile alignments to provide an efficient method for the alignment score normalization. Since multiple island scores can be extracted from a single comparison of two profiles, the island method has a clear speed advantage over the direct shuffling method for comparable accuracy in parameter estimates.
Background
The statistical significance of a local alignment score between two sequences of aminoacid letters can be assessed by analyzing background distribution of the alignment scores between random sequences. For SmithWaterman alignments [1] lacking gaps, it has been well established that the background score distribution is approximately Gumbel [2], specified by two analytically computable parameters λ and K[3–6].
Assessing score statistics for profilebased alignments is much more challenging problem. In order to quickly estimate the significance of a database match, the HMMER method (Eddy, 1997) precomputes extreme value distribution parameters for each Hidden Markov model in the profile library. These model dependent parameters are calculated by aligning and scoring a given HMM against thousands of real or random sequences. PSIBLAST estimates score significance "on the fly", by reconstructing residue scores within each profile column to the same scale as the scores specified in the BLOSUM62 matrix [7]. The assumption is that, after rescaling, the background distribution of PSIBLAST scores will be the same as the distribution of the gapped BLAST scores. Many experiments suggest that this hypothesis is valid and that the rescaling technique yields accurate pvalues.
The assessment of statistical significance of profileprofile scores is still an unsolved problem. In lieu of a rigorous analytical theory, many profileprofile algorithms resort to Zscore statistics [8, 9]. For sequence only methods, the Zvalue of an alignment score between two sequences is computed by comparing the first sequence with randomly shuffled versions of the second sequence. An advantage of Zvalues is that they eliminate the sequence length and compositional bias, since the shuffling of a sequence preserves these two variables. However, there are certain disadvantages to using raw Zscores to rank the significance of the alignment scores. First, the Zscore statistics makes a false assumption about the Gaussian form of the underlying score distribution. A reader interested in the magnitude of the error introduced by this assumption in referred to [10]. Second, Zscores do not provide the probability that an alignment score could be obtained by chance.
Nevertheless, the Zvalues can be made very useful for computing accurate pvalues via a "change of variable" technique [11]. More specifically, it has been shown that if the raw alignment scores follow a standard Gumbel law, then the pvalues of associated Zscores are free of sequence length and amino acid composition biases [12, 13]. Since the only drawback of this approach is the computational expense associated with random simulations, it would be very interesting to see whether the "change of variable" approach can be used in other settings.
Recently, an interesting approach to alignment score normalization has been described that uses socalled Shared Amount of Information (SAI) between the aminoacid[12]. The model proposed in [12] is unique since it is derived from the reliability theory applied to sequences of aminoacids.
To date the studies on score normalization for local profileprofile alignments have been limited to some specific alignment scoring schemes. For example, an explicit generalization of techniques implemented in PSIBLAST has been successfully used in the COMPASS algorithm [14]. However, the method described in Sadreyev et al. works only in the context of the COMPASS scoring function. The statistical significance of alignment scores produced by the LAMA method is estimated using an approach based on Fisher's combining method [15]. In HHSEARCH [16], the profile specific parameters were computed by comparing each profile to the set of profiles built for the representative sequences in the SCOP database [17] (SCOP folds). The alignment scores obtained by PROF_SIM [18], STRUCTFAST[19], and UNIFOLD [20] were also shown to follow the extreme value distribution, but the distribution parameters in these methods must be precalculated using computationally expensive curvefitting procedure. This approach is commonly referred to as the "direct method". In the "direct method", thousands of optimal alignment scores between real or random profiles are usually needed for moderately accurate estimates of the distribution parameters. On the other hand, profileprofile methods are computationally very expensive, making the direct method too slow for parameter estimation, in particular for deriving the score statistics "on the fly" for each given pair of profiles.
Here, we study a generalization of the well known island method [21, 22] to score normalization problem for profileprofile alignments. The island method uses the scores of local alignment "islands" obtained by a simple modification of the dynamic programming matrix. Since multiple island scores can be computed from a single path graph, the island method has a distinct speed advantage over the direct method.
Methods
The statistical theory
The statistical significance of an alignment score is usually expressed by the score's p value. The p value of a score x is defined as the probability of obtaining a score of at least x purely by chance, given the probabilistic models for the sequences and the alignment scoring scheme.
The analytically computable parameters λ and K depend on the background probabilities of aminoacid letters and the residueresidue substitution scores specified in the mutation matrix.
There is plenty of evidence suggesting that equation 1 still holds for alignments with gaps [23–28], as well as for profilesequence and profileprofile alignments[7, 18, 29]. However, for these methods, λ and K must be estimated from random simulations rather than computed analytically [3, 18, 28, 8]. We note that precise estimates of λ are particularly important since the pvalue is a doubly exponential function of λ. We also note that, in contrast to local alignment scores, the scores of global sequencesequence alignments are shown to approximately follow a threeparameter gamma distribution function[31]. For global alignment statistics, the computational complexity is still an open problem.
Need for compositionbased statistics for profileprofile alignments
For alignment methods that use substitution matrices and residue type information (such as BLAST[4] or FASTA[32]), it has been well established that λ and K depend, not only upon the alignment scoring system, but also upon the frequencies of aminoacid letters in the sequences being aligned. In these methods, λ can vary more than 10% from one sequence pair to another, due entirely to change in sequence aminoacid composition [21].
Island statistics
To circumvent the computational expense associated with random simulations for sequencesequence methods, Olsen et al. proposed using the scores of the socalled "alignment islands" [22]. An alignment island is a region in the dynamic programming matrix corresponding to positively scoring segments in two sequences. More precisely, an island is a collection of locally optimal alignments that start at the same cell (anchor cell) in the path graph [21, 22]. The score of an island is defined as the highest score among all local alignment scores for that island.
where m and n are the lengths of the random sequences used in each island comparison and B is the total number of sequence comparisons performed to generate the islands[21].
We note that the island method is similar to the "declumping" method of Waterman and Vingron[26, 27], but is much faster, because, unlike clumps, the islands and their scores can be collected with a minor modification of the SmithWaterman algorithm [22]. Several applications have recently been developed that incorporate island statistics for score normalization, including CTXBLAST [34], ConSequenceS[35], and CIS [36].
An added benefit of the island statistics (and other score normalization methods based on sequence shuffling) is flexibility in choosing the scoring system. In order to be amenable to island statistics, the only requirement a method needs to satisfy is that that the alignments it generates stay in the local regime, i.e. that the distribution of alignment scores between random sequences (profiles) is approximately Gumbel. Therefore, since the procedure for computing statistical parameters does not change with changes to the scoring function, one can entirely focus on improvements to the scoring scheme. This is important, because incorporating additional information into the alignment process, such as, for example, the compositionally adjusted background frequencies [20, 37, 38] or protein secondary structure information [9, 39] is known to significantly increase sensitivity of an alignment method[9, 16].
Results and discussion
The island statistics for profileprofile alignments
The alignment score significance can be assessed using either real or random profiles [40]. We use random profiles to avoid bias in the results toward any particular group of proteins. A random profile of length n is obtained by sampling n profile columns at random from the collection of profiles computed for ~2,500 representative sequences from the FSSP database (FSSP family representatives). The database of FSSP profiles is generated by running three PSIBLAST iterations on each FSSP sequence and parsing aminoacid letter frequencies from the corresponding PSIBLAST checkpoint files.
We study the applicability of the island statistics on four popular and well tested profileprofile scoring schemes: JensenShannon (implemented in the PROF_SIM method [18]), CrossProduct (PRALINE [39]), WeightedLogOdds (COMPASS[14]), and Multinomial (UNIFOLD[20]). The definition of each scoring function is given in the appendix. The columncolumn scores in all four methods are scaled (multiplied by constant factors) so that the alignment score distributions have similar parameters.
To establish a link between the statistics of peak island scores and optimal alignment scores, we compare, for a range of cutoff values c, the observed number of islands with scores ≥ c with the expected number of such islands computed from the bestfit extremevalue distribution. The expected number of islands is defined as , where and are parameters obtained with the direct method. More specifically, and are the maximum likelihood estimates of parameters in equation 2, obtained from the scores of (globally) optimal local alignments between profile shuffles. For more on the maximum likelihood estimates of statistical parameters, the reader is referred to [41].
The two statistics obviously differ for low scoring islands (Figure 4). As argued before [21, 22] the low scoring islands often correspond to ungapped alignments of only few profile positions, and therefore, the scores of those islands follow a different distribution, namely the distribution of gapless alignment scores.
The plots in Figure 4 show faster decay in the number of islands with score ≥ c for profiles of size 350 compared to profiles of size 1500 × 1500. We note that the apparent λ for each comparison in Figure 4 is equal to k, where k denotes the slope of the set of data points. For sequence only alignments, this dependence of the apparent λ on sequence length is due to the "edge effect", which arises because the length of the longest island, and hence its associated score, is limited by the lengths of the sequences [21]. Thus, if the variance in slopes for profileprofile methods seen in Figure 4 is also due to the edge effect, one would expect to observe larger difference in slopes for methods that generate longer alignments. Indeed, our analysis of alignments generated by four methods in our study demonstrates that the variance in λ for small and large comparisons seen in Figure 4 scales proportionally with average alignment length generated by each method (30 for WeihgtedLogOdds, 47 for CrossProduct, 37 for JensenShannon, and 35 for Multinomial).
Island estimates of λ and K
c  R _{ c } 
 SE 


20  18942541  0.1924  0.02%  0.0395 
21  15451236  0.1899  0.03%  0.0371 
22  12673131  0.1882  0.03%  0.0354 
23  10416563  0.1866  0.03%  0.0339 
24  8557899  0.1846  0.03%  0.0320 
25  7041327  0.1825  0.04%  0.0300 
26  5794787  0.1800  0.04%  0.0278 
27  4796202  0.1782  0.05%  0.0262 
28  3981201  0.1767  0.05%  0.0249 
29  3312692  0.1753  0.05%  0.0238 
30  2761460  0.1740  0.06%  0.0227 
31  2307980  0.1730  0.07%  0.0219 
32  1931516  0.1720  0.07%  0.0211 
33  1618724  0.1712  0.08%  0.0204 
34  1358277  0.1704  0.09%  0.0198 
35  1141702  0.1697  0.09%  0.0193 
36  960448  0.1692  0.10%  0.0188 
37  809392  0.1688  0.11%  0.0185 
38  681757  0.1683  0.12%  0.0181 
39  575054  0.1679  0.13%  0.0179 
40  484923  0.1675  0.14%  0.0175 
41  409305  0.1671  0.16%  0.0172 
42  345792  0.1668  0.17%  0.0169 
43  292455  0.1666  0.18%  0.0168 
44  247162  0.1663  0.20%  0.0166 
45  209396  0.1664  0.22%  0.0167 
46  177245  0.1664  0.24%  0.0166 
47  149811  0.1661  0.26%  0.0163 
48  127130  0.1664  0.28%  0.0166 
49  107539  0.1663  0.30%  0.0165 
50  91004  0.1661  0.33%  0.0164 
51  77013  0.1660  0.36%  0.0163 
52  65225  0.1660  0.39%  0.0163 
53  55132  0.1656  0.43%  0.0159 
54  46798  0.1659  0.46%  0.0162 
55  39719  0.1663  0.50%  0.0166 
56  33618  0.1662  0.55%  0.0165 
57  28394  0.1657  0.59%  0.0160 
58  24094  0.1660  0.64%  0.0163 
59  20398  0.1659  0.70%  0.0161 
60  17282  0.1659  0.76%  0.0162 
61  14634  0.1659  0.83%  0.0162 
62  12430  0.1663  0.90%  0.0166 
63  10497  0.1659  0.98%  0.0161 
64  8837  0.1647  1.06%  0.0149 
65  7579  0.1667  1.15%  0.0172 
66  6407  0.1665  1.25%  0.0169 
67  5416  0.1662  1.36%  0.0165 
68  4591  0.1662  1.48%  0.0165 
69  3892  0.1663  1.60%  0.0167 
70  3280  0.1654  1.75%  0.0156 
71  2786  0.1658  1.89%  0.0160 
72  2367  0.1663  2.06%  0.0167 
73  1985  0.1645  2.24%  0.0145 
74  1695  0.1656  2.43%  0.0158 
Speed vs. accuracy
There are two types of errors that can occur when computing the statistical parameters using random simulations. The first error, called "bias", represents the difference between the estimated and "true" statistical parameters. The second error is the standard error, which, unlike the bias, can be controlled by the number of data points used in parameter estimation. More specifically, the standard error in is 1/ for the island method and 0.78/ for the direct method [21], where R denotes the number of data points, i.e. the number of island scores above the cutoff and the number of optimal alignment scores, respectively.
Both direct and island method suffer from bias in the estimates of the statistical parameters. As seen in Figure 6, the bias of the island method is closely related to the island cutoff score. Similarly, the direct method tends to overestimate λ due to the nonexistence of an optimal alignment score threshold. The maximum likelihood estimates of distribution parameters obtained with the direct method most strongly depend on the low scoring data points, because of the steep decrease of the left tail of the extreme value distribution. Therefore, the extent of bias for the direct method is proportional to the fraction of low scoring optimal alignments used for parameter estimation.
We note that the biases of the direct and island method can be computed (and compared) for local alignments of single sequences, due to availability of experimentally verified "best estimate" of the asymptotic λ[21]. Using the "best estimate" of λ as the reference point, Altschul and coworkers were able to find a threshold island score that eliminates all cutoffbased bias for large size comparisons of random sequences. By considering only the islands with peak scores over the threshold, they computed accurate, sequence length specific parameter estimates of λ, and used these estimates as gold standards to assess the extent of bias for both methods [21].
Unfortunately, it would be difficult to perform a similar experiment in our setting because of the dependence of statistical parameters on profiles' composition and because of the computational complexity of profileprofile methods. Thus, instead of comparing the bias sidebyside, we focus our attention on measuring the difference between the island and direct method estimates of λ and on comparing the computational efficiencies of two methods.
The speed advantage of the island method is due to its ability to generate multiple data points in a single comparison of two shuffled profiles. However, the average number of islands per pair of shuffled profiles does not directly translate into the speed advantage of the island method. First, for the same standard error in , the island method needs to generate 64% more data points than the direct method. Second, a single comparison of two profiles with the island method is computationally more expensive than the same comparison with the direct method, since the island method needs to keep track of the islands and their peak scores. Our implementation of the dynamic programming engine for the island method is ~1.5 times slower than the procedure that only returns an optimal alignment score. Taking those two factors into consideration, the total speed advantage of the island method is about A_{ c }/2.4, where A_{ c }denotes the average number of island with peak scores ≥ c collected in a single comparison of two shuffled profiles. We note that our results are identical to previously reported results for sequencesequence alignments [21].
Running time of the island method and the deviation in λ
m, n= 350  m, n= 1500  

Method  2 s  4 s  8 s  16 s  32 s  64 s 
WeightedLogOdds  7%  5%  1%  4%  2%  1% 
CrossProduct  10%  5%  2%  14%  7%  4% 
JensenShannon  8%  4%  2%  4%  3%  2% 
Multinomial  6%  3%  3%  5%  2%  1% 
We emphasize that, by using the direct method estimates as reference points, we do not argue that these estimates are more accurate than the estimates obtained with the island method. In fact, the results of a similar analysis for sequenceonly methods [21] suggest that, for comparisons of size ~350 × 350, the bias of the direct method would be about three times larger than the bias of the island method, for the same standard error in .
Previous studies of the island statistics for sequencesequence alignments addressed the speedaccuracy tradeoff by optimizing the island score cutoff c. For the BLOSUM62 matrix and gap opening and extension penalties of 11 and 1, respectively, the cutoff value of c = 28 was found appropriate [21]. Olsen and coworkers suggested the cutoff value of c = 1.3·max{s_{ ab }}, where s_{ ab }is the score for matching amino acid letters a and b, specified in the substitution matrix [22].
A slightly different interpretation of the results in Table 2 suggests an alternative approach to controlling speed and accuracy tradeoff for an arbitrary profileprofile scoring scheme and a range of profile lengths. For example, for a pair of profiles of lengths 350, the JensenShannon scoring scheme, and the standard error of 0.78%, the island estimate of λ that is within 4% of the direct method estimate of λ can be obtained by running the island method for ~4 seconds and computing λ using the top scoring 16,437 islands (this number of islands yields standard error in of 0.78%).
Lindahl benchmark
Top 1  Top 5  

Fold  +1.2%  +0.9% 
Superfamily  0.9%  0.5% 
Family  0.7%  +1.1% 
Conclusion
By utilizing the information present in protein families, profileprofile alignment algorithms are often able to detect extremely week relationships between protein sequences, as evidenced by the large scale benchmarking experiments such as CASP [43], CAFASP [44], and LiveBench [45]. However, estimating the score statistics for profileprofile alignments is a challenging problem. The background distribution of profileprofile alignment scores is constrained by profiles' composition and hence the distribution parameters must be estimated independently, for each given pair of profiles.
We study the applicability of the well known "island method" to profileprofile score normalization. In the island method, the statistical parameters are computed based upon the top scoring islands that can be collected using a simple modification of the SmithWaterman algorithm. Since multiple high scoring islands can be extracted from a single path graph, the island method has a distinct speed advantage over the direct method. For some widely used profileprofile scoring schemes, the speed advantage of the island method exceeds an order of magnitude for comparable accuracy in parameter estimates. For larger profiles, a significant speed advantage of the island statistics comes with almost perfect accuracy. This is important, since using the direct method as the only other alternative to compute the parameters "on the fly" for large size comparisons is computationally prohibitive.
Appendix
In the Multinomial scoring function, both c_{1} and c_{2} are set to 1.
Declarations
Acknowledgements
We thank Dr Igor Strugar for critically reading the manuscript and for helpful suggestions.
Authors’ Affiliations
References
 Smith TF, Waterman MS: Identification of common molecular subsequences. J Mol Biol 1981, 147: 195–197.View ArticlePubMedGoogle Scholar
 Gumbel EJ: Statistics of Extremes. Columbia University Press, New York, NY; 1958.Google Scholar
 Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci USA 1990, 87: 2264–2268.PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–410.View ArticlePubMedGoogle Scholar
 Dembo A, Karlin S, Zeitouni O: Critical phenomena for sequence matching with scoring. Ann Prob 1994, 22: 1993–2021.View ArticleGoogle Scholar
 Karlin S, Dembo A: Limit distributions of maximal segmental score among Markovdependent partial sums. Adv Appl Prob 1992, 24: 113–140.View ArticleGoogle Scholar
 Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res 1997, 25: 3389–3402.PubMed CentralView ArticlePubMedGoogle Scholar
 Rychlewski L, Jaroszewski L, Li W, Godzik A: Comparison of sequence profiles. Strategies for structural predictions using sequence information. Protein Science 2000, 9: 232–241.PubMed CentralView ArticlePubMedGoogle Scholar
 Ginalski K, Pas J, Wyrwicz LS, von Grotthuss M, Bujnicki JM, Rychlewski L: ORFeus: Detection of distant homology using sequence profiles and predicted secondary structure. Nucleic Acids Res 2003, 31: 3804–7.PubMed CentralView ArticlePubMedGoogle Scholar
 Hulsen T, de Vlieg JAM, Leunissen JMA, Groenen P: Testing statistical significance scores of sequence comparison methods with structure similarity. BMC Bioinformatics 2006, 7: 444.PubMed CentralView ArticlePubMedGoogle Scholar
 Bastien O, Maréchal E: Evolution of biological sequences implies an extreme value distribution of type I for both global and local pairwise alignment scores. BMC Bioinformatics 2008, 9: 332.PubMed CentralView ArticlePubMedGoogle Scholar
 Bastien O: A Simple Derivation of the Distribution of Pairwise Local Protein Sequence Alignment Scores. Evol Bioinform Online 2008, 4: 41–45.PubMed CentralPubMedGoogle Scholar
 Pearson WR: Empirical statistical estimates for sequence similarity searches. J Mol Biol 1998, 276: 71–84.View ArticlePubMedGoogle Scholar
 Sadreyev RI, Grishin NV: COMPASS: A tool for comparison of multiple protein alignments with assessment of statistical significance. J Mol Biol 2003, 326: 317–336.View ArticlePubMedGoogle Scholar
 FrenkelMorgenstern M, Voet H, Pietrokovski S: Enhanced statistics for local alignment of multiple alignments improves prediction of protein function and structure. Bioinformatics 2005, 21: 2950–6.View ArticlePubMedGoogle Scholar
 Söding J: Protein homology detection by HMMHMM comparison. Bioinformatics 2005, 21: 951–60.View ArticlePubMedGoogle Scholar
 Murzin AG, Brenner SE, Hubbard T, Chothia C: SCOP: a structural classification of proteins database for the investigation of sequences and structures. J Mol Biol 1995, 247: 536–540.PubMedGoogle Scholar
 Yona G, Levitt M: Within the twilight zone: A sensitive profileprofile comparison tool based on information theory. J Mol Biol 2001, 315: 1257–1275.View ArticleGoogle Scholar
 Debe DA, Danzer JF, Goddard WA, Poleksic A: STRUCTFAST: protein sequence remote homology detection and alignment using novel dynamic programming and profileprofile scoring. Proteins 2006, 64: 960–7.View ArticlePubMedGoogle Scholar
 Poleksic A, Fienup M: Optimizing the size of the sequence profiles to increase the accuracy of protein sequence alignments generated by profileprofile algorithms. Bioinformatics 2008, 24: 1145–53.View ArticlePubMedGoogle Scholar
 Altschul SF, Bundschuh R, Olsen R, Hwa T: The estimation of statistical parameters for local alignment score distributions. Nucleic Acids Res 2001, 29: 351–61.PubMed CentralView ArticlePubMedGoogle Scholar
 Olsen R, Bundschuh R, Hwa T: Rapid assessment of extremal statistics for gapped local alignment. In Proceedings of the Seventh International Conference on Intelligent Systems for Molecular Biology. Edited by: Lengauer T, Schneider R, Bork P, Brutlag D, Glasgow J, Mewes HW, Zimmer R. AAAI Press, Menlo Park, CA; 1999:211–222.Google Scholar
 Smith TF, Waterman MS, Burks C: The statistical distribution of nucleic acid similarities. Nucleic Acids Research 1985, 13: 645–656.PubMed CentralView ArticlePubMedGoogle Scholar
 Collins JF, Coulson AFW, Lyall A: The significance of protein sequence similarities. Comput Appl Biosci 1988, 4: 67–71.PubMedGoogle Scholar
 Mott R: Maximum likelihood estimation of the statistical distribution of SmithWaterman local sequence similarity scores. Bull Math Biol 1992, 54: 59–75.View ArticlePubMedGoogle Scholar
 Waterman MS, Vingron M: Sequence comparison significance and Poisson approximation. Stat Sci 1994, 9: 367–381.View ArticleGoogle Scholar
 Waterman MS, Vingron M: Rapid and accurate estimates of statistical significance for sequence database searches. Proc Natl Acad Sci USA 1994, 91: 4625–4628.PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul SF, Gish W: Local alignment statistics. Methods Enzymol 1996, 266: 460–480.View ArticlePubMedGoogle Scholar
 Eddy SR: A Probabilistic Model of Local Sequence Alignment That Simplifies Statistical Significance Estimation. PLoS Comput Biol 2008, 4: e1000069.PubMed CentralView ArticlePubMedGoogle Scholar
 Mott R: Accurate formula for Pvalues of gapped local sequence and profile alignments. J Mol Biol 2000, 300: 649–59.View ArticlePubMedGoogle Scholar
 Pang H, Tang J, Chen SS, Tao S: Statistical distributions of optimal global alignment scores of random protein sequences. BMC Bioinformatics 2005, 6: 257.PubMed CentralView ArticlePubMedGoogle Scholar
 Pearson WR, Lipman DJ: Improved tools for biological sequence comparison. Proc Natl Acad Sci USA 1988, 85: 2444–2448.PubMed CentralView ArticlePubMedGoogle Scholar
 Holm L, Ouzounis C, Sander C, Tuparev G, Vriend G: A database of protein structure families with common folding motifs. Protein Sci 1992, 1: 1691–1698.PubMed CentralView ArticlePubMedGoogle Scholar
 Gambin A, Wojtalewicz P: CTXBLAST: context sensitive version of protein BLAST. Bioinformatics 2007, 23: 1686–8.View ArticlePubMedGoogle Scholar
 Przybylski D, Rost B: Powerful fusion: PSIBLAST and consensus sequences. Bioinformatics 2008, 24: 1987–1993.PubMed CentralView ArticlePubMedGoogle Scholar
 Poleksic A, Danzer JF, Hambly K, Debe DA: Convergent Island Statistics: a fast method for determining local alignment score significance. Bioinformatics 2005, 21: 2827–31.View ArticlePubMedGoogle Scholar
 Yu YK, Wootton JC, Altschul SF: The compositional adjustment of amino acid substitution matrices. Proc Natl Acad Sci USA 2003, 100: 15688–93.PubMed CentralView ArticlePubMedGoogle Scholar
 Yu YK, Altschul SF: The construction of amino acid substitution matrices for the comparison of proteins with nonstandard compositions. Bioinformatics 2005, 21: 902–11.View ArticlePubMedGoogle Scholar
 Heringa J: Computational methods for protein secondary structure prediction using multiple sequence alignments. Curr Protein Pept Sci 2000, 1: 273–301.View ArticlePubMedGoogle Scholar
 Sadreyev RI, Grishin NV: Accurate statistical model of comparison between multiple sequence alignments. Nucleic Acids Res 2008, 36: 2240–8.PubMed CentralView ArticlePubMedGoogle Scholar
 Lawless JF: Statistical models and methods for lifetime data. Wiley, New York, NY; 1982:141–202.Google Scholar
 Lindahl E, Elofsson A: Identification of related proteins on family, superfamily and fold level. J Mol Biol 2000, 295: 613–625.View ArticlePubMedGoogle Scholar
 Moult J, Fidelis K, Kryshtafovych A, Rost B, Hubbard T, Tramontano A: Critical assessment of methods of protein structure predictionRound VII. Proteins 2007, 69(Suppl 8):3–9.PubMed CentralView ArticlePubMedGoogle Scholar
 Fischer D, Rychlewski L, Dunbrack RL Jr, Ortiz AR, Elofsson A: CAFASP3: the third critical assessment of fully automated structure prediction methods. Proteins 2003, 53(Suppl 6):503–516.View ArticlePubMedGoogle Scholar
 Rychlewski L, Fischer D: LiveBench8: the largescale, continuous assessment of automated protein structure prediction. Protein Sci 2005, 14: 240–245.PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.