Statistical distributions of optimal global alignment scores of random protein sequences

Background The inference of homology from statistically significant sequence similarity is a central issue in sequence alignments. So far the statistical distribution function underlying the optimal global alignments has not been completely determined. Results In this study, random and real but unrelated sequences prepared in six different ways were selected as reference datasets to obtain their respective statistical distributions of global alignment scores. All alignments were carried out with the Needleman-Wunsch algorithm and optimal scores were fitted to the Gumbel, normal and gamma distributions respectively. The three-parameter gamma distribution performs the best as the theoretical distribution function of global alignment scores, as it agrees perfectly well with the distribution of alignment scores. The normal distribution also agrees well with the score distribution frequencies when the shape parameter of the gamma distribution is sufficiently large, for this is the scenario when the normal distribution can be viewed as an approximation of the gamma distribution. Conclusion We have shown that the optimal global alignment scores of random protein sequences fit the three-parameter gamma distribution function. This would be useful for the inference of homology between sequences whose relationship is unknown, through the evaluation of gamma distribution significance between sequences.


Background
Sequence alignment is a central problem in computational molecular biology. Every branch of molecular biology-from database search, phylogeny reconstruction to protein structure prediction-takes sequence alignments as its foundation. The functional and/or structural properties of a new sequence could be predicted from its degree of similarity with some clearly defined and known sequences. If the similarity between two sequences is statistically significant and does not simply stem from sequence repeats of low complexity, then the two sequences are likely to be homologous and thus may have similar functions and/or structures.
To understand whether an observed sequence similarity implies indeed a functional or evolutionary link, or is just a chance event is the central problem for the evaluation of the statistical significance of sequence alignment scores. The basic question to be answered is: what is the probability that a similarity score as great as that actually observed in a comparison between real sequences could have arisen by chance, when sampling from suitably-defined populations of random unrelated sequences? This question is generally addressed by analyzing the distribution of optimal alignment scores from random or real but unrelated sequences [1], which is the approach applied in this research.
Accurate statistical estimate for similarity searches for local alignments has been studied comprehensively [1][2][3][4], and the Gumbel distribution is used to estimate the statistical significance for local alignments [5]. However, we still lack a complete theoretical solution to the optimal global alignment between sequences, due to the fact that global alignment scores grow proportionally to the length of the sequences and small changes in the scoring system can produce a different alignment [6].
Abagyan and Batalov suggested that global alignment scores between unrelated protein sequences followed the Gumbel distribution [7]. However, since the scoring system that they used favoured local alignments, these alignments they produced may not be global but local. Dayhoff et al (1978) and Dayhoff et al(1983) evaluated global alignment scores for randomized sequences (maintaining the amino acid composition and sequence length of the real sequences) using their log-odds scoring matrix at PAM250 and a constant gap penalty. The distribution of the resulting scores matched a normal distribution [8,9]. Webber and Barton used sequences belonging to different folds of the SCOP database whose percent identity to each other is less than 40 to analyze the distribution of global alignment z-scores between sequences [10]. They found that the gamma distribution describes the distribution of z-scores better than either the normal or Gumbel distribution.
The determination of homolog is also affected by the reference datasets used for statistical estimation. Lipman et al analyzed the distribution of scores among 100 vertebrate nucleic acid sequences and compared these scores with randomized sequences prepared in different ways [11]. When the randomized sequences were prepared by shuffling the sequence to conserve base composition, the standard deviation was approximately one-third less than the distribution of scores of the natural sequences, thus leading to an overestimate of the significance if such randomized sequences were used for a significant test. When sequences were locally shuffled for randomization, the standard deviation was similar to that of the natural sequences.
In this research, we chose 6 different ways to generate random and real but unrelated protein sequences as reference datasets for significance estimation. Four scoring matrices were applied for global alignments. The PAM120 and PAM250 matrices were selected because they imply different evolutionary time [8], and the BLOSUM50 and BLOSUM62 matrices were selected for their sound performance in database search [12,13]. Most alignments were carried out with the affine gap penalty (see Methods) as it penalizes less for additional gaps, which is more reasonable. The resulting alignment scores were then fitted with three distributions to obtain the statistical characteristic of the global alignment scores.

Derivation of distributions with different sequence sets
We have generated random and real but unrelated sequences in six different ways as reference datasets. The datasets are abbreviated as LAC, LCA, GS, LS, SMP and RUS sequences respectively. The definition of the abbreviations can be found in the List of Abbreviations Used below.
In general the three-parameter gamma distribution performs the best in the goodness of fit test with the distribution of global alignment scores. When the shape parameter of the gamma distribution is sufficiently large, the gamma distribution closely approximates a normal distribution. Thus the normal distribution agrees with the data as well. The Gumbel distribution deviates from the alignment score distribution over the majority of the score frequencies, however its performance gets a little better for the LS sequence set than for the other sequence sets.
The distributions of global alignment scores of the LAC, LCA and GS sequence sets are similar ( Figure 1). The three-parameter gamma distribution defines the empirical distributions extremely well. The fitting quality of the normal distribution is better than that of the Gumbel distribution, but both of them diverged from the majority of the global alignment frequencies.
For the LS sequence sets the majority of the empirical score frequencies disagree with the normal or Gumbel distributions, whereas the three-parameter gamma distribution describes the alignment data extremely well ( Figure  2). It also can be seen that the performance of the Gumbel distribution is better with this sequence set than with the others.
The distribution of alignment scores of the SMP sequence set is quite different. The empirical score distribution of sequences generated with the PAM120 mutation probability matrix and aligned with the PAM120 log-odds scoring matrix is different from those generated with the PAM250 matrix. Both the gamma and normal distributions fit the score frequencies of the former well ( Figure   3), whereas for the latter, the normal distribution disagrees with the majority of the score curve.
Most of the score distributions of the RUS sequence set agree well with the gamma distribution, no matter what the sequence length is. Although there are occasions when no distribution agrees perfectly well with the score distribution, the three-parameter gamma distribution is still the closest to the empirical distribution ( Figure 4).
In database search, it is always the sequences, with higher scores (i.e., tail of the distribution), are of interest. So the score frequencies were also plotted against the natural logarithm of scores at the tail of the distribution ( Figure 5). It shows clearly that the gamma distribution outperforms both the normal and Gumbel distributions even at the tails of those distributions.

The effect of sequence length on the theoretical distribution
We chose the LAC sequence set to analyze the impact of the sequence length on the gamma distribution because the amino acid composition of the LAC sequence set is the same. The result shows that the shape parameter of the fitted gamma distribution increases and scale parameter decreases gradually as the sequence length increases, at the same time the performance of the normal distribution for curve fitting improves slowly ( Table 2 and Table 3).

The difference of window size for local shuffling
The impact of the window size of local shuffling approach on the gamma distribution was also studied ( Table 4). The result shows that when the window size increases, the shape parameter of the fitted gamma distribution decreases and the fitting performance of the normal distribution gets worse.

The impact of scoring scheme
The effects of the four scoring matrices are minor on the distribution of global alignment scores. The only exception is the empirical distribution of the SMP sequence set aligned with the PAM120 log-odds scoring matrix, in which the normal distribution performs as well as the gamma distribution. Dayhoff et al (1978) and Dayhoff et al (1983) evaluated global alignment scores for randomized sequences generated as our LCA sequence set using the PAM250 log-odds scoring matrix and a constant gap penalty [8,9]. The distribution of the resulting scores matched a normal distribution. We tried both constant and affine gap penalty for Distribution of scores from global alignments of the SMP sequence set of 300aa long  Plots of the tail of the global alignment optimal score distribution Figure 5 Plots of the tail of the global alignment optimal score distribution. The score frequencies were plotted against the natural logarithm of scores at the tail of the distribution of Figure 1. The three theoretical distributions were indicated in solid lines. The score distribution was fitted with (A) the three-parameter gamma distribution; (B) the normal distribution; and (C) the Gumbel distribution. the LCA sequence set and found that the distribution of optimal scores of the LCA sequence set agrees better with the gamma distribution than with the normal distribution.

Discussion
Webber and Barton took sequences of different folds with less than 40 percent identity from the SCOP database for global alignments and fitted the z-scores to peak distributions [10]. They found that the gamma distribution describes the alignment scores between different folds better than either the normal or Gumbel distribution. The problem is that the derivation of z-scores requires additional alignments to be calculated, and the zscore carries with it an implicit and possible incorrect assignment of significance by the normal distribution.
The score distribution of the SMP sequence set simulated from the evolution of the ancestor sequence at PAM120 is an exception. The fitted shape parameter of the gamma distribution is very large, and the normal distribution fits with the data equally well. This is because sequences generated with the PAM120 mutation probability matrix are around 40 percent similar with each other, so they can hardly be viewed as random sequences for distributional statistical analysis.
This study specifies three-parameter gamma distribution as the theoretical distribution of global alignment scores of random protein sequences. It could be used for the inference of homology between sequences whose relationship is unknown through significance evaluation. One issue worth exploring further is to formulate a function taking sequence length, scoring scheme and amino acid composition into account so that rapid statistics conclusion is reached.

Conclusion
The global alignment optimal scores have been regarded as normal [7] or Gumbel distributed [8]. We have shown here that the normal distribution holds only for sequences simulated with the PAM120 matrix, while the Gumbel distribution disagrees with the optimal alignment score frequencies for all sequence sets in this research. The study shows that the three-parameter gamma distribution describes the random sequence alignment scores better than the normal or Gumbel distribu-  tion. The normal distribution performs as well as threeparameter gamma distribution when the shape parameter of the gamma distribution is sufficiently large.

Dataset
The SCOP (Structural Classification of Proteins) database [14] provides a detailed and comprehensive description of the structural and evolutionary relationships between all proteins whose structure is known [15]. The classification is on hierarchical levels that embody the evolutionary and structural relationships. The hierarchy of the database from top to bottom is fold, superfamily and family. Proteins that share clear evolutionarily relationship are clustered in families, those that have low sequence identities but whose structural and functional features suggest that a common evolutionary origin is probable are placed together in superfamilies. Proteins are defined as having a common fold if they have the same major secondary structures in the same arrangement and with the same topological connections. The SCOP 1.65 (released on December 2003) was used in this study. It contains 40,452 domains organized in 2,327 families, 1,294 superfamilies and 800 folds. These domains correspond to 20,619 entries in the Protein Data Bank (PDB) [16].
The amino acid compositions of all the sequences in the SCOP 1.65 were calculated as shown in Table 1. The amino acid composition in Table 1 was taken as the average amino acid composition of proteins for random sequence generation.

Sequence randomization approaches 1) Maintaining the sequence length and average amino acid composition (LAC)
Sequences were generated as strings of independent characters where the amino acid in any position is proportional to its composition in proteins (Table 1) with a given sequence length.

2) Maintaining the sequence length and the amino acid composition of the authentic sequences (LCA)
The amino acid compositions of the two authentic sequences were calculated and random sequences were generated retaining both the distributions of the amino acid composition and the sequence length of the original sequences.

3) Global shuffling or permutation (GS)
Each residue in the authentic sequences is randomly repositioned anywhere in the sequence, so that both the amino acid composition and sequence length were maintained.

4) Local shuffling or permutation (LS)
The sequence is broken into n/w windows (n denotes the length of the sequence and w is the length of the window, typically 5-20 residues) and the residues in each window are randomly shuffled. Both the sequence length and the local amino acid composition are retained with this approach. We selected four window sizes-5aa, 10aa, 15aa and 20aa-to compare their effects on the statistical distributions.

5) Simulation of the mutational process (SMP)
To get sequences of a known evolutionary distance, we simulated the mutational process of the ancestor sequence. First, the PAM1 mutation probability matrix is multiplied by itself 120 or 250 times to get the PAM120 and PAM250 mutation probability matrices. The matrices provide information about each amino acid stayed unchanged or substituted by any other one at the given evolutionary distance. Then the fate of each residue in the new sequence was determined according to the PAM120 or PAM250 mutation probability matrix through computer simulation [8].

6) Real but unrelated sequences (RUS)
The SCOP database also provides filtered sub datasets selected with different criteria, such as sequence identity percentage, E-value, or different hierarchy representatives [17]. We chose three subsets, one in which the sequences are less than 10 percent identity to each other, another with sequences whose E-value with one another is greater than 10, and the third with 800 sequences each represents one fold in the SCOP 1.65.
The sequence lengths in the three subsets vary from 23aa to 1504aa. As global alignment between sequences of highly different sequence lengths is not appropriate, we extracted sequences of similar length in each filtered sequence set further. 300 sequences of 50aa, 100aa, 200aa, 300aa and 400aa were extracted respectively from each of the filtered sequence set, and the tails of longer ones were cut off.

Alignment algorithm
We wrote a C program for all the global alignments in this study. The program implements the Needleman-Wunsch dynamic programming algorithm [18] and penalized on end gaps.
For the LAC and SMP sequence sets, two sequences were generated at a time and global alignments were carried out between them, this process was repeated 10,000 times. For the LCA, GS and LS sequence sets the first sequence was aligned with 5,000 randomizations of the second, then vice versa. Global alignments were carried out between every pair of the 300 sequences for the RUS sequence set.

Scoring scheme
As the evolutionary distance of the SMP sequence set is known, the scoring matrix matching the giving evolutionary time was chosen for the alignments. For other sequence sets two matrices, the PAM120 and PAM250 from the PAM series and another two, the BLOSUM50 and BLOSUM62 from the BLOSUM series were employed. The respective gap open/extension penalty combination for the PAM120 is -16/-4, PAM250 -11/ -1, BLOSUM50 -10/-2, BLOSUM62 -7/-1, as recommended by Vingron and Waterman, Mount, and Pearson [6,19,20]. Gap extension penalty is added for the first gap.
We also used constant gap penalty of -10 for the PAM250 matrix for the LAC and LCA sequence sets.

Curve fitting
Optimal global alignment scores of the different sequence sets aligned with different scoring scheme were fitted with the Gumbel, normal and three-parameter gamma distributions respectively. Methods of moments were used for the estimation of the parameters of the optional distributions.
The probability density function of the gamma distribution is given as where γ (γ > 0) is the shape parameter, λ (λ > 0) is the scale parameter, and µ (x -µ ≥ 0) is the location parameter.
The normal distribution is given as where µ is the location parameter and σ (σ > 0) is the scale parameter.
The Gumbel distribution, given as where µ is the location parameter and β (β > 0) is the scale parameter.
The χ 2 goodness of fit test was used to evaluate the fitting result. The degree of freedom for the fitting with gamma distribution is the number of intervals subtracts 4, while for normal and Gumbel distribution is the number of intervals subtracts 3.

Authors' contributions
HP developed the code, tested program performance, analyzed the results, and drafted the manuscript. JT proposed many additional suggestions for improving algorithm performance. SC supported the research and improved the writing. ST conceived of the study and participated in its design and coordination. All authors read and approved the final manuscript.