Accurate statistics for local sequence alignment with position-dependent scoring by rare-event sampling

Background Molecular database search tools need statistical models to assess the significance for the resulting hits. In the classical approach one asks the question how probable a certain score is observed by pure chance. Asymptotic theories for such questions are available for two random i.i.d. sequences. Some effort had been made to include effects of finite sequence lengths and to account for specific compositions of the sequences. In many applications, such as a large-scale database homology search for transmembrane proteins, these models are not the most appropriate ones. Search sensitivity and specificity benefit from position-dependent scoring schemes or use of Hidden Markov Models. Additional, one may wish to go beyond the assumption that the sequences are i.i.d. Despite their practical importance, the statistical properties of these settings have not been well investigated yet. Results In this paper, we discuss an efficient and general method to compute the score distribution to any desired accuracy. The general approach may be applied to different sequence models and and various similarity measures that satisfy a few weak assumptions. We have access to the low-probability region ("tail") of the distribution where scores are larger than expected by pure chance and therefore relevant for practical applications. Our method uses recent ideas from rare-event simulations, combining Markov chain Monte Carlo simulations with importance sampling and generalized ensembles. We present results for the score statistics of fixed and random queries against random sequences. In a second step, we extend the approach to a model of transmembrane proteins, which can hardly be described as i.i.d. sequences. For this case, we compare the statistical properties of a fixed query model as well as a hidden Markov sequence model in connection with a position based scoring scheme against the classical approach. Conclusions The results illustrate that the sensitivity and specificity strongly depend on the underlying scoring and sequence model. A specific ROC analysis for the case of transmembrane proteins supports our observation.


Background
A large amount of molecular biological data is stored in form of sequences of symbols in huge data bases, e.g., DNA sequences or the primary structure of proteins. It is one main task of Bioinformatics [1] to develop algorithms [2] which allow to find, via sequence comparison, for a given "query" sequence the most similar "subject" sequences in a data base. All widely-used algorithms depend on many parameters, the so-called scoring schemes [2], which are usually suitably adopted to test sets of data. Parts of these scoring schemes involve also rules how to deal with small non-similar subsequences, the so-called gaps. The most popular sequence-comparison algorithms are the Smith-Waterman algorithm [3] for pairwise local sequence alignment (which means that the most similar subsequences of two sequences are found) and the Viterbi algorithm for sequence-to-HMM alignment. In the latter case, one specific sequence is compared to a full set of sequences which is specified via a Hidden Markov Model (HMM) [4]. For practical applications, often very fast heuristics are used, which do not find the exact best-matching (sub-)sequences but only approximations of the optimum. The BLAST algorithm [5] is used widely. Sequence-comparison algorithms return a raw similarity score, i.e., a number, that quantifies the similarity between the input objects. Unfortunately, this raw score is hard to interpret because one does not know its absolute scale.
An interpretation becomes possible when we specify a probabilistic null model for the input: Then the similarity score becomes a random variable S whose probabilities Prob(S = s) under the null model can be determined. Sometimes this can be done analytically, but usually one has to apply numerical simulation [6]. The p-value assigned to an observed score s is defined as pval(s): = Prob(S ≥ s) in the null model, and log pval (s) is a measure of surprise (and hence a universally normalized score) for s. The key problem is, of course, to find Prob(S = s) for a given sequence-comparison algorithm, a given scoring scheme, and a given null model.
In this paper, we explain and extend an efficient and generally applicable numerical technique that solves this problem in many different sequence comparison settings, such as for a BLAST-like database search [5] with a fixed query, for position-specific scoring and/or gapcost schemes (essentially HMMs), or for normalized alignment [7]. In each of those settings a variety of null models in addition to the i.i.d. model is possible.
Most of the existing statistical work for pairwise sequence comparison focuses on null models where both sequences are random and at each position a symbol σ Σ is chosen independently of the other positions ("i.i.d. model"), with a given frequency f f f often reflects the average composition of proteins in the UniProt/SwissProt database [8]. An alignment of the two sequences is a set of pairs {(i k , j k )}, which means that symbol of position i k of the first sequence is aligned (or paired) to the symbol at position j k of the second sequence. The pairs must not cross, i.e., if for two pairs (i k , j k ) (i l , j l ) the condition i k <i l holds, then also j k <j l . Positions which are not paired, i.e., which do not appear in the alignment, contribute to the above mentioned gaps. The length of a gap is the number of adjacent ungapped positions. In the following example, where the two sequences are shown such that paired symbols are atop on each other, Scores for individual pairs of symbols are given by a constant (position-independent) symmetric Σ × Σ scoring matrix with negative expected score, such as BLO-SUM62 [9]. The score of an alignment is given by the sum of the scores of the pairs, plus (negative) contributions for the gaps. For the conventional gap-scoring schemes (called "affined"), each gap contributes a score which depends only linearly on its length (times a parameter called gap-extension penalty) plus a constant (gap-open penalty). We shall refer to this model later as "random query -general-purpose scoring" (RQGS).
For gapless pairwise local sequence alignment, the raw score distribution can be derived numerically by Markov chain analysis [10] and also asymptotically for infinite sequences (Karlin-Altschul or Dembo-Karlin statistics [11]): It is an extreme-value distribution (EVD), also called Gumbel distribution [12]: where the parameters λ > 0 and c > 0 depend on the score matrix, on the symbol frequencies f, and on the query and subject sequence lengths L Q and L S . Asymptotically we have c = KL Q L S for a length-independent K > 0.
For gapped pairwise local sequence alignment, which is the most relevant case in database queries there exist no universal analytic results, with the exception of few special cases [13]. Empirical evidence also indicates convergence towards the Gumbel form for long sequences; λ and K additionally depend on the gap-cost function [14]. Several works have focused on efficient numerical estimation of these parameters [15]. The influence of varying lengths of the finite sequences [16] is treated in various ways, e.g. by adjusting the lengths of the sequences to "effective lengths" but still assuming a Gumbel form of the distribution. Nevertheless, for moderate sequence lengths, which are biologically most relevant, the true distribution differs strongly from a Gumbel form [17,18], which can be dealt with by including a correction term to the Gumbel form (giving rise to an additional parameter).
The (RQGS) model is convenient, because the problem of computing significance values reduces to the estimation of only two parameters, which can be precomputed for each scoring scheme. However, there are also several problems. For instance, even if one considers just the gapless case, it is in general not easy to extend the analytic asymptotic theory to more complex null models. Furthermore, for practical applications where finite sequence lengths are considered, of even more importance is: The p-values reported by (the original) BLAST only depend on the raw score, the query and the subject length, and not on the actual query sequence. This leads to large distortions when the composition of the query sequence does not match the composition of the null model. For example, when we run a homology search for the human transmembrane protein rhodopsin (UniProt accession P08100) with BLAST (BLOSUM 62, gap-init 12, gap-extend 1, no composition adjustment, no filtering), we find a possibly remote homolog Q8NH42 with an E-value of 9 · 10 -8 . The E-value for score s is the expected number of database hits with score at least s and can be easily computed from pval(s) and the database size. Hence, it appears unlikely to obtain such a homolog by pure chance, i.e., the homolog appears to be relevant. However, using a recent "composition-based adjustment" option [19,20] leads to a very different E-value of 0:001 for the same protein. This underlines the importance of query-specific or at least composition-based statistics, particularly for intermediate p-values.
The statistics of position-dependent scoring and/or gap-cost schemes, as used in PSI-BLAST [21] or in hidden Markov model (HMM) frameworks, are less well explored. The central question here is, "given a query Q and a position-specific scoring scheme, what is the score distribution when random null-model sequences of given length are scored against Q?". We refer to this model as "fixed query -position-dependent scoring" (FQPS). As a compromise between the general (RQGS) and the very specific (FQPS) models, one may neither use a completely free (i.i.d) nor use a fixed query but draw query sequences according to more specific models, e.g., HMMs for transmembrane proteins.
In all these cases EVDs Eq. (2) may still used heuristically by fitting the parameters of the EVD to the simulated data. This can be achieved by generating pairs of random sequences according to the given null model while recording the histogram of observed alignment scores. Using such a "simple sampling" approach, the large-probability region of the score distribution can be investigated, e.g., for probabilities about > 10 -4 when generating 10 4 sequence pairs. Such an approach is implemented, e.g. in the hmmcalibrate program from the HMMER package [22]. Nevertheless, this procedure may fail to describe distribution in the "rare-even tail", i. e., where the probability is small (say < 10 -4 ), although this part of the distribution is most important for the estimation of the statistical significance.
Our motivation for a simulation-based method that makes no initial parametric assumption refers to the approach [23] to increase the sensitivity of detecting homologs of a given transmembrane (TM) protein in a database search: A bipartite scoring scheme with a (non-symmetric) transmembrane helix specific scoring matrix (such as SLIM [23]) for the TM helices and a generalpurpose scoring matrix (such as BLOSUM [9]) for the remaining regions of the query protein were applied, see Figure 1. This results in higher search sensitivity and specificity. However, a statistical theory or efficient computational method to obtain the score probabilities in such a (FQPS) framework is missing so far.

Our contributions and paper outline
We present a general framework for efficient estimation of raw score distributions in sequence comparison problems. In particular the rare-event tail for large scores can be accessed. We only make the following assumptions: 1. We are able to sample pairs x, y of sequences according to the null model and to compute the null model probability of any given x, y. 2. We have an efficient algorithm  that computes the score S(x, y), where x, y could be a pair of randomly drawn sequences (RQGS or HMM), or one fixed and one random sequence (FQPS). 3. The scores are rational numbers with a common denominator. Hence, without loss of generality, they can be assumed to be integers. 4. Optionally for the (HMM) approach, we have an efficient algorithm  that predicts the most likely state sequence for a given sequence.
Our framework is readily applicable to the (RQGS), (FQPS) and (HMM) models, but also to more exotic  Figure 1 Bipartite scoring scheme. Bipartite scoring scheme for the detection of homologous transmembrane proteins from Ref. [23]. The figure represents the Smith-Waterman alignment matrix and indicates which scoring matrix is used for which query positions (rows): In transmembrane helices, a transmembranespecific scoring matrix is used. For p-value computations, the query is assumed fixed or generated by the TMHMM and the subject is assumed a random i.i.d. sequence drawn from the distribution of amino-acid frequencies of the database. settings, such as normalized alignment [7], where the score is not additive, but normalized by the alignment length, for which no statistical framework exists so far. Very recently Eddy [24] studied the distributions of Viterbi and Forward scores under probabilistic local alignment, for which a numerical analysis of the rareevent tail would be of interest as well.
In the current stage of the methodology, the computation of an accurate "on the fly" p-value for each particular database query might be impracticable as each full calculation is not achieved within a few minutes.
We will illustrate the approach for the HMM for TM proteins (TMHMM [25,26]), which has been proven valuable in predicting TM helices. In this approach (and possible in other models as well) one is able to specify the score distribution in more detail. Each query may be classified to be a member of certain given subclasses  . In this case, it could be meaningful to obtain an individual score distribution for each subclass. A natural classification of the TMHMM is the number of transmembrane regions. Since for a given sequence this number is usually not known exactly, one takes the most likely one.
The rest of the paper is organized as follows. The following section presents the mathematical background on importance sampling and Markov chain Monte Carlo methods which are fundamental to the methods used to obtain the score distribution, in particular in the rare-event tail, for different null models. Next, we present a description of the methodology. Section "Results" shows computational results on transmembrane protein similarity statistics in (RQGS), (FQPS) and (HMM). A discussion closes the paper.

Importance sampling
Importance sampling is a general technique to reduce the variance in the estimation of quantities that can be written as an expectation  [ ( )] h Z , where Z is a random object representing the null model and h is a real-valued function. We assume that we can draw within a computer simulation n random samples z 1 , ..., z n according the null model. The expectation is then approximated by In our setting, to estimate the score distribution (and then p-values), we consider the state space S , from which we generate n random pairs of sequences({x (1) , y (1) } ,..., {x (n) , y (n) }).
These pairs are then aligned by a given algorithm  and the corresponding similarity scores S(x (i) , y (i) ) are computed. To formally write a histogram as an This means, we approximate the unknown exact probability Prob(S(X, Y)) by normalized score histograms over all sampled sequence pairs. If the probability to be estimated is small, say 10 -9 , when using simple sampling, we need about 10 12 samples to estimate it with reasonable precision. Thus, for very rare events, this sampling quickly becomes infeasible.
Importance sampling generates the "interesting" events more often by sampling from a different distribution and correcting for this bias afterward, which results in a more accurate estimate with a reasonable number of samples. Let p be the probability mass function (pmf) of (X, Y), and let q be another pmf satisfying q(x, y) > 0 whenever p(x, y) > 0. Then where each pair (x' (i) , y' (i) ) is sampled from the pmf q. Eq. (2) gives us the relationship between the expectation value w.r.t. the unknown distribution of interest, the target distribution p, and samples drawn from the actually used sampling distribution q. To successfully apply importance sampling, q has to fulfill three properties: First, it needs to put high probability on the region of interest; second, we need to be able to sample according to q; third, we need to be able to compute the correcting weight p(x, y)/q(x, y). Since directly sampling from q often is impossible, we shall use a general sampling method which we describe next.

Metropolis-Hastings sampling
If we need to generate samples from a discrete (or continuous) distribution q but have no simple direct method to do so, the Metropolis-Hastings method [27] provides a solution by constructing an ergodic Markov chain with stationary distribution q in the following way.
Extensive introductions to such so-called Monte Carlo simulations can be found, e.g., in Refs. [28,29]. Here we just give a concise introduction specifically tailored to the problem of sampling pairs of sequences. Let us call the elements (x, y) of the sample space  configurations. The sampling will be performed by randomly "moving" in the space of configurations, such that for each step a configuration is altered only slightly. Configurations which can be visited and are connected by a move are called neighbors. Hence, each configuration (x, y) has a algorithm-dependent set  ( , ) x y of potential neighbors x y . The movement is performed in such a way that each neighbor exhibits a positive probability P (x,y), (x',y') as being the next configuration. The proposal is accepted with probability in which case (x', y') becomes the new current configuration. Otherwise (x', y') is discarded and (x, y) remains unchanged. Thus, a sampling algorithm is specified via the neighborhoods  ( , ) x y and the proposal probabilities P (x,y),(x', y') The acceptance criterion Equation (3) is quite general. By using a symmetric proposal probability matrix, P (x',y') (x, y) = P (x,y), (x',y') , the relationship simplifies to Since the distribution q appears only in form of a ratio we need to be able to compute q(x, y) only up to a normalization constant. If q(x', y') >q(x, y) we accept the proposal and otherwise the proposal is accepted as new configuration with the finite probability q(x', y')/q(x, y). Hence, we need to choose the neighborhood of x, y such that that the ratio q x y q x y is not too far away from one. Otherwise, virtually only proposals that increase the probability are accepted and the sampling procedure gets stuck at local maxima. Equation (4) and its generalization Equation (3) describe Markov chains in the configuration space . For an appropriate choice of the neighborhoods  ( , ) x y and of the proposal distribution P (x, y), (x', y') , the so-constructed Markov chain is ergodic (each configuration can be reached from any starting configuration with finite probability). Furthermore, one can show that the detailed balance condition which is fulfilled due to the choice of a(.) according to Eq. (3), implies that the chain converges towards the desired sampling distribution q.
We say that the chain has reached equilibrium when convergence has occurred up to numerically negligible error. Thus, if the configuration (x, y) is sampled after equilibration, it will behave like a sample from q. In practice, the exact speed of equilibration is unknown and convergence diagnostics are applied (see below). Several (almost) independent samples are obtained by running the chain further and taking a sample every kth step for sufficiently large k to allow time for "forgetting" (i.e., decorreclating from) the state of the last sampled configuration. This time is usually referred as mixing time.

Implementation
In this section we show how the sampling algorithm for pairs of sequences is actually designed, based on the background given in the two preceding sections, such that the tails of the probability distributions for the scores can be addressed. The crucial point of the Metropolis-Hastings update is the choice of an appropriate neighborhood  ( , ) x y (and the related proposal probabilities) and the computation of the probabilities of newly proposed states q(x', y'). The neighborhood should be chosen such that the acceptance rate Eq. (3) is between 0.3 and 0.7. We shall factorize the (un-normalized) pmf q in two contributions, firstly weights w: ℤ ℝ + that assign each score value of interest a weight and secondly the null probability, i.e.
Note that we will leave w(·) undetermined for a moment, until Section "Wang-Landau Sampling". The importance reweighting equation Eq. (2) for h s is then For the (HMM) a single distribution of scores is not sufficient: Each query is a member of a certain sub-class characterized by the number of transmembrane regions "# of TM helices" to be determined by the Viterbi algorithm (see below). Thus, each class has its own probability P n (s) = Prob(S = s, # of TM helices = n TM ). In order to take this property into account, we deal with the joint probability Prob(S = s, # of TM helices = n TM ). Accordingly, the weights have a two dimensional domain, we write w(s, n). Also h s in Eq. (7) is replaced by an indicator function h s,n that depends on two parameters: h s,n (x, y) equals 1 if S(x, y) = s and # of TM helices of x = n TM . The sampling distribution is generalized to and the the reweighting relationship reads as Generally the occurrence of two sequences x = is characterized by the null probability This simple factorization allows us to draw proposals for the query and for the subject independently. Hence, for simplicity, a neighboring configuration will leave one of the two sequences unchanged. Thus, for selecting a neighboring configuration, first one of the two sequences is chosen at random with probability 1/2. In the case of (FQPS) the subject is always chosen. Then one sequence is chosen from the neighborhood of the selected sequence, as described next. Formally, this means for (RQGS) and (HMM) we use the factorized proposal densities P (x, y), (x', y') = 0.5P x, x' 1 y, y' or P (x, y) (x', y') = 1 x, x' P y, y' (1 y, y' denotes the indicator function which is only one if y = y', 0 else. P x, x' denotes the proposal of a single sequence) depending on the choice of sequence in the first step.

Proposal densities for (FQPS) and (RQGS)
In the simplest case either both sequences are i.i.d. or the query is fixed (to some sequence  x ) and the nullmodel probabilities of their occurrence factorize, i.e.
and of course in both cases Due to the factorization that occurs in Eq. (10) it is possible to draw sequences from  ( ) x such that the detailed balance condition f iid (x) P x,x' = f iid (x') P x', x is fulfilled by the following set of Monte Carlo moves (see also Figure 2 and Table 1 Operation a) appears with probability 1/2 and the other ones with probability 1/2 · 1/4 each. This is one possible choice that guarantees detailed balance.
Note that all sequences in  ( ) x have the same length and each operation involves a replacement of an existing symbol with a newly drawn symbol, in case a) by a direct substitution and in the cases b)-e) indirect via a shift operation. Each position of a sequence has the same probability of being chosen and the replaced symbol is chosen in all cases according to the frequencies f s (σ ∑).
With this construction the Metropolis-Hastings ratio Eq. x y x y The right part in the second line cancels, because all contributions to p(x, y) and p(x', y') where symbols from (x, x') and y, y) agree cancel directly and for the few remaining letters, the proposal probability in the nominator contains exactly the frequencies of the null probability in the denominator, and vice versa. Thus, the acceptance rate depends only on the score value of the current configuration S(x, y) and the one of the proposal S(x', y'). Furthermore, it is easy to prove that the detailed balance conditions in Eq. (5) is fulfilled for this chain which in turn implies that the chain converges towards the sampling distribution q.

Proposal densities for the (HMM)
In contrast to the approach presented in the previous section, the generalized method we use here also works for null models that do not allow for direct sampling from  ( ) x as in the case of i.i.d. sequences.
This framework can be summarized by following algorithm: METROPOLISHASTINGSUPDATE(x, y, z, p, s, n, w) Input: Sequences x, y, a hidden state sequence z, the null probability p(x, y) = f query (x). f subject (y), the score s = S(x, y), the sub-class n and weights w Output: Possibly new values for x, y, z, p, s, n. ▷ Designed such that p(x', y') · P (x', y'), (x,y) = p(x, y) · P (x,y),(x', y') 6: With probability min {1, a}: Let (x, y, z, p, s, n) (x', y', z'. p', s', n') 7: return (x, y, z, p, s, n) The algorithm is applicable to all models that allow for a rapid calculation of the null probabilities f(·). Sequence models based on HMMs fall into this class. In the following we brie y describe this framework. A detailed discussion can be found in the specialized literature on the topic [2,4].
In the probabilistic framework of HMMs one assumes a sequence of "observed" symbols (the protein sequences here) which is generated conditioned on a sequence of "hidden" states. For the case of TM proteins, the state corresponds to the physical region where the corresponding amino acid is located in, as detailed below. Within a modeling using HMMs, this state sequence, also called path, follows a simple Markov chain. The actually generated symbols are connected to the hidden states by conditional "emission" probabilities. More formally, a HMM consists of • a finite set ∑ of (output) symbols (in our case the amino acid alphabet), Figure    LGQITAED deletion at position 5 with right shift DLGQITAE Valid Monte Carlo operations for input sequence s = LGQIWTAE (indexing starts with 1). In order to obtain sequences of the same length as s, in the case of a deletion a character (D) to be appended at the border has to be specified.
• a finite set Γ of (hidden) states, • initial state probabilities π μ for all μ Γ with ∑ μ Γ π μ = 1, • emission probabilities p   in each state μ Γ Σ and for all s ∑ with p     = 1 for all s ∑, • a stochastic transition probability matrix P = Given these model parameters, the "most natural" application of a HMM is to generate a sequence of hidden states by a stochastic process and, in parallel, to generate a random sequence of symbols given the generated states. Hence, the stochastic process describes pairs of states and symbols. But also given a fixed state Z = Z 1 ... Z L , the symbol sequence x = x 1 ... x L is a stochastic process, furthermore the opposite case of a fixed sequence of output symbols, the state sequence is a stochastic process.
For the Monte Carlo sampling as needed here, it is not possible to simulate a HMM directly to generate symbol sequences, since importance sampling changes the underlying sequence probabilities. Nevertheless, one still needs to compute the probabilities f HMM (x) for the Monte Carlo acceptance procedure, i.e. the probabilities that x is the observed symbol sequence generated by the HMM using any feasible state sequence. These probabilities can be computed in  (L · |Γ| 2 ) time using the well known forward algorithm as described in the following. One introduces the auxiliary variables f μ (i), which correspond to the probability that the subsequence x 1 ... x i is generated by the HMM given that the last state variable Z i has the value μ, i.e. f μ (i) = Prob(X 1 ... X i = x 1 ... x i |Z i = μ). The overall probability is then f HMM (x) = ∑ μ Γ f μ (L). The probabilities f μ (i) can be determined by the recursion with initial conditions f p x Within the same time complexity the Viterbi algorithm  computes the most probable state path for a given sequence of observations, that is The missing normalization is no problem, since we are interested only in the most probable path, which is reconstructed by back-tracking [2].
For the approach discussed in this section, the subject sequences are drawn almost as above, see below. The HMM approach we use to sample transmembrane queries is the TMHMM developed by Sonnhammer et. al. [25]. In this setting, the states are (structural) domains. Some of them are "tied", which means that they share the same emission probabilities. They are classified into seven groups: • Helix core, • two different groups of caps on either side, • loops on the cytoplasmic side, • short and long loops on the non-cytoplasmic side, • globular domains.
The internal structure of the helix core and loop module allows modeling different lengths of the corresponding protein domain by assigning jump probabilities. The globular domains have a self-looping structure and hence may also have various lengths. The other modules have fixed lengths. The overall number of model parameters is 216. Figure 3 shows the actual layout of TMHMM. Each box represents a group of "tied" states. The states corresponding to "helix core" represent the transmembrane helices that connect states of the cytoplasmic side and the non-cytoplasmic side of the  membrane. The prediction of the positions of the "helix core" states determines the loci of the special purpose scoring matrix SLIM for position specific alignment (see Figure 1). The following Metropolis-Hastings update consists of two steps: First, the proposal of a new configuration from the neighborhood  ( ) x is made by inserting/ replacing symbols with equal weights f    1 | | for all σ ∑ using one of the five Monte-Carlo moves described above. The acceptance ratio Eq. (3) in that case is given by The current and new number of TM regions n TM and n' TM are determined by the Viterbi algorithm applied on the sequence x and x' respectively. The calculation of the query probabilities is based on the TMHMM. The subject sequence probability is simply calculated according to Eq. (11). Note that, in each step, one of the two probabilities cancels, because only one of the two sequences is changed within each step, as above.
This approach allows us to sample noni.i.d. sequences with appropriate weights and to predict transmembrane helical regions that can be used in the position specific alignment scheme (as described in [23]) even for random sequences.

Wang-Landau sampling
The idea of importance sampling is to choose the weights w(·), such that the drawn events in the region of interest have a high probability to occur in the simulation. Ideally, P(S) is already known and in that case one might choose w(S) ∝ 1/P (S) on the entire range of interest. Then all states are visited with equal probability, and hence a at score histogram is achieved in the limit of infinite sample size. Still, for practical applications with finite sample size, the distribution of scores can be sampled with high accuracy over a large range of its support. This idea refers back to statistical physics and it is known as "generalized ensemble" or " flat histogram" methods. In the following we will denote this weights by w flat .
Of course the true P(S) is unknown and the method requires some guesses which approximate w flat to a suitable accuracy. The achieved score histogram becomes only approximatively flat. The true (unknown) distribution can then be estimated by reweighting the histogram of visited states using the importance sampling formula Eq. (7) for h s .
Many iterative sampling schemes to achieve initial guesses had been developed in the 1990ies, for example entropic sampling [30], multicanonical sampling [31] and later transition matrix Monte Carlo [32][33][34], only to mention a few. Here we use the Wang-Landau algorithm [35,36] to approximate w flat as input for Metropolis-Hastings sampling.
The Wang-Landau algorithm explicitly violates detailed balance by dynamically updated weights depending on the visited states in the following way: First, a score range of interest [S min , S max ] is chosen. The algorithm basically employs a histogram H(S) and weights w(S) defined on the desired score range. For more complicated models such as the (TMHMM), these objects are two-dimensional depending on the score S and the class n TM , i.e. H (S, n TM ) and w(S, n TM ). Furthermore, real valued parameters j i > 1 are used in each iteration i. Initially, the histogram values H(S, n TM ) are set to 0 in the desired range and all weights w(S, n TM ) to a constant, say 1. For the first iteration, i = 0, j i can be as large as e 1 . Then, a simulation is performed using acceptance ratio Eq. (12) or Eq. (14). After each step, corresponding to one step of a (biased) random walk in the configuration space, w(S, n TM ) is updated as w(S, n TM ) w(S, n TM ) × j i , where S is the current score value and n TM the sub-class of the current state. Also the histogram H is updated by one H are set to 0 again, while w is kept for the next iteration. Note that the application of a flatness criterion is not essential for the good performance of the algorithm. It is enough to guarantee that all values of S have been visited, for example by requiring that the random walker has cycled several times through the interval of interest [S min , S max ].
To summarize, we have the following recipe: WANGLANDAU(w, j, j final , N) Input: Initial guess w[s, n], initial and final modification factors j, j final , number of samples for production run N Output: Histogram of visited scores, H(s, n):= number of samples with score s and class n and weights used in the production run w(s, n) for all s and n 1: ▷ Initialize and estimate w[s, n] 2: Pick any x, y  and compute its null probability p := f query (x) · f subject (y); 3: compute s := S(x, y) using  4: compute z := V (x) using  and determine corresponding class n; , the modification factor j converges towards 1. The simulation is stopped when j reaches a chosen threshold value which is close to 1. It turned out that in our case the range from j 0 = exp(0.1) ≈ 1.105 to j final = exp(0.0002) ≈ 1.0002 has been proven valuable. Since detailed balance is violated explicitly, the convergence of the algorithm can not be proven. For this reason one should always use the Wang-Landau part as a precomputation step just to obtain weights suitable w (S). After this, one performs a simulation with j = 1 for data production, which corresponds to the Metropolis-Hastings algorithm.

Improvements
Of course there is much room for improvement. For example, consider the time evolution of the histogram H(S) for (RQGS) with L Q = L S = 348 up to S max = 500 with Prob(S = S max ) ≈ 10 -65 in Figure 4a.
When starting with an initial guess w(S) = 1 for all S [23, 500], the random walker needed about 5.8 × 10 5 Monte-Carlo steps for a round trip, i.e. to move from the lowest score S min = 23 to the highest one S max = 600 and back. The duration of a round trip is a measure of the mixing time of the corresponding Markov chain. Hence, the shorter the round trip is time, the faster the chain convergences. During the first round trip, the weights have been improved such that the second round trip (and further round trips) needed only 13% of the computational effort of the first one. Once the random walker has performed its first round trip, the typical round trip From a practical point of view, this allows us to save computing time for two distributions with close by parameters (e.g. close by sequence lengths). One can use the results of one distribution as input for the second one. With this approach we may also explore the parameter space successively. In some cases it is sufficient to run a short batch run with the weights of a close by distribution and j = 1, i.e. a detailed balance simulation, and then apply importance reweighting and use the so obtained approximation of P(s) for a longer production run. This kind of procedure is shown in the inset of Figure 4b: The detailed balance simulations were performed with L Q = L S = 348, whereas the weights w(s) came from a simulation with L S = 320 and L S = 400, respectively. The result shows that the histograms are not " flat" at all, but the distributions were close enough to visit all score values on the range of interest. In this successive way of iterations a broad range of the parameter space is accessible.

Estimation of the statistical error
Statistical analysis of Markov-chain Monte-Carlo data requires a careful inspection of correlation effects because the events depend on the history of the chain. This correlations vanish within a typical timescale: Events that are separated by a sufficient number of steps can be assumed to be independent. However, since Monte-Carlo methods are only approximative, an assignment of statistical errors are requisite. In this study we used Flyvbjerg and Peterson's [37] blocking method to estimate the error.

Results
To our knowledge we present the first highly accurate score statistics for alignments with position-specific scoring schemes. The alignment scores were calculated with the standard Smith-Waterman algorithm with the BLOSUM62 matrix for the (RQGS) and a bipartite version BLOSUM62/SLIM for (FQPS) and (HMM) (see Figure 1). For the a fine gap costs we have chosen the standard values with a gap-open penalty of 12 and a gap-extension penalty of 1, and UniProt symbol frequencies for i.i.d. sequences. We discuss four different transmembrane proteins as queries (see Table 2 Here we observe in Figure 5b that on the log scale the curvature of the tail of the distribution, i.e., the deviation from the exponential tail of the pure Gumbel form Eq. (2), is more pronounced in the (FQPS) model: Significant differences of shapes already show up in the high probability region, which is accessible by simple sampling (Figure 5a). The (RQGS) distributions for different lengths match almost perfectly (only two lengths are shown), whereas the shape of the (FQPS) distributions varies slightly with the sequence type. This supports the observation of Müller et.al. [23] that position-specific scoring in connection with a fixed query sequence may better discriminate between different sequences than the standard approach where two random sequences are compared with position-independent scoring matrices.
The asymptotic theory for i.i.d. sequences predicts an EVD of the form of Eq. (2). The parameters λ > 0 and c > 0 depend on the score matrix, on the symbol frequencies f, and on the query and subject sequence lengths L Q and L S . Altschul and Gish [16] pointed out that asymptotic results where c = KL Q L S (with K > 0 a constant) need to be corrected by using effective sequence lengths. An alignment may extend to the end of either sequence and the score will be distorted towards lower values and high scores become less probable. In the limit of infinite sequences this effect vanishes and the tail of the Gumbel distribution can be understood as an upper bound for finite sequences. Indeed, we clearly see that the curves in Figure 5b are not straight lines in the right tail, but have negative curvature. A better t to the empirical distribution is obtained by determining parameters s 0 , λ > 0, λ 2 > 0 for a "modified" Gumbel distribution with where s 0 can be interpreted as the center of the distribution. This corresponds to a EVD multiplied with a Gaussian correction factor, given by the last term. The parameter λ 2 is generally small (and thus shows its effect only in the far right tail). It vanishes for sequences of equal length as the length tends to infinity. Previously, such a correction has been proposed for (RQGS) statistics and has been computed for different parameter sets of BLOSUM62 and PAM250 with a ne gap costs [17,18].
More pronounced differences are seen in the behavior of the tail (Figure 5b), which is only accessible via importance sampling approach. The difference between the probabilities spans several orders of magnitude; hence a wrong choice of the model would falsify the estimation of significance drastically. Most importantly, the pmf obtained using the position-specific scoring is considerably curved. Thus, using EVDs from fits to data of the high-probability region is even more questionable here than in the (RQGS) model, where the pmf is almost a straight line. Note that for the (RQGS) model, previous simulations [18] have already shown that for the special case of L S = L Q , the pmf converges for large sequence length indeed to an EVD.
Note that the Gaussian correction for local alignment parameterized by λ 2 is purely heuristic. Looking at the data, the shape in Figure 5b looks similar to the one of the Tracy-Widom distribution [38]. Interestingly, Majumdar and Nechaev [39] as well as Priezzhev and Schütz [40] obtained analytically the Tracy-Widom distribution as the asymptotic distribution for the model of the longest common subsequences which is closely related to global alignment. Also, Sardiu, Alves and Yu [41] observed that the the statistics of the score fluctuations of global alignment in the large probability region is compatible with the Tracy-Widom distribution. There could be a connection to our results, because in the rare-event tail, alignment lengths are of the order of the sequence lengths, hence the alignment is effectively global. Nevertheless all our results are obtained for finite sequence lengths. In contrast, distributions like Gumbel or Tracy-Widom are obtained in the asymptotic limit of infinite sizes of the underlying systems. Since finite-size corrections are hard to obtain, or even unknown, we do not attempt to determine the "true" shape of the distribution and are satisfied by our heuristic formula.
Next, we discuss the usefulness of the (FQPS) statistics in terms of retrieval performance. For this purpose we considered the ASTRAL compendium [42] version 1.75 with less than 40% identity to each other. It contains a set of reference proteins classified hierarchically based on their tertiary structure. It is a subset of the SCOP database with removed redundancy. The main hierarchy levels in the SCOP classification scheme are class, fold, superfamily, family. Proteins in the same class share the same type of secondary structure, whereas the fold level describes more specific the arrangement of the secondary structure. As the position specific scoring scheme is designed to be more sensitive to discriminate transmembrane proteins against others, its performance can be measured by searching a collection of transmembrane proteins from the ASTRAL set against the complete set. From the 10569 sequences in the database we have chosen 63 sequences which are classified as Membrane and cell surface proteins and peptides. For this collection we predicted the membrane regions using TMHMM. Each of those queries were searched against the complete set and ranked according to the p-value on the basis of the (FQPS) statistics. The p-value threshold under which we regard a hit as significant controls the so called receiver operating characteristic (ROC), i.e. the relationship between sensitivity vs. specificity. A transmembrane protein that appears below an p-value threshold is referred as a true positive observation. Accordingly, proteins of all other SCOP classes below that value are false positives. The ROC space can be explored by changing the p-value threshold. A small threshold produces less false positives but we may miss some hits. A larger value leads to more false positives. The ROC is usually illustrated by plotting the true positive rate (TPR = true positives/positives), or the sensitivity, against the false positive rate (FPR = false positives/ negatives), 1-specificity. The result for the search characteristics for (FQPS) is shown in Figure 6. We did the same experiment for a BLAST search. In this case all observations are located in left bottom corner in the ROC space. This can be explained by the fact that we have only considered the highest ranked results below the E-value threshold of 10 and many positives were beyond that value. This can also be seen in the inset of Figure 6 where we show the average number of membrane proteins found so far as a function of the rank in the result set. The line of slope 1 for (FQPS) at the begin of the result list means that virtually all hits are found in the correct SCOP class. In contrast, BLAST ranks transmembrane proteins at a high positions only randomly. Hence, the (FQPS) clearly outperforms the (RQGS) statistics. The ROC curve in Figure 6 show the usefulness of the (FQPS) statistics for retrieval performance, but the extreme small p-values where the the modification factor λ 2 plays a role are not essential for this purpose.  Examples of blast hits for the four proteins used for FQPS. The original result sets have been re-ranked according the the FQPS statistics. Left column: Gumbel assumption (λ 2 = 0). Right column: Modified Gumbel distribution (λ 2 ≠ 0).
The modified Gumbel statistics however affect a possible ranking of database search results, especially for sequences of different lengths. To illustrate this, we used BLAST to retrieve homologs of our four example proteins from the current Swissprot database. The scores were recomputed via the position specific Smith-Waterman algorithm for (FQPS). We computed the corresponding p-values from our simulation data and ranked the result set by the p-value based on 1. the Gumbel distribution (λ 2 = 0) and 2. the accurate distribution (λ 2 > 0).
For subject sequence lengths that are not directly governed by our simulation directly we used interpolated fit parameters. In Table 3 we illustrate that the there are subjects whose relative order in the result set is switched when using the more accurate (FQPS) score distribution in contrast to the BLAST E-value. This result might be important for applications of protein classification where the specific ranking of high scoring proteins is particularly important. However, on a global level the order of the hits persists, signaled by Kendall's rank correlation [43]. When comparing the order obtained between the cases λ 2 = 0 and λ 2 ≠ 0 we measured a rank correlation = 0.986 for the query P08100. A τ of exactly 1 means identical, 0 unrelated, and, -1 exactly the reverse ranking. The rank correlation between the original BLAST ranking and the one based on the accurate p-values is = 0.737.
To investigate the impact of dissimilar query and subject lengths L Q and L S on the parameters of the modified Gumbel distribution, we vary L S and consider the parameters λ and λ 2 as functions of the ratio L S /L Q (see Figure 7). The large gap between the values of λ for the two different models reflects the qualitative difference of the shape in the high probability regime. We see that in the (RQGS) model, λ is virtually independent of the query and the sequence length. However, in the model (FQPS), λ varies with each individual query, as expected. For λ 2 one has to distinguish between L S <L Q and L S >L Q . In the first case, λ 2 decreases, which is not surprising, since the correction term describes a finite-size effect and should vanish for increasing sequence lengths.
Once the subject length exceeds the query length, the search space is still growing, but the finite length of the query enforces subject size independent edge effects.
For the (HMM), we approximate the score distribution within each class (number of helices = n). The shape of the distributions clearly agrees with the curvature for (RQGS) and (FQPS), and the modified Gumbel distribution could be fitted (see Figure 8) when the number of helices was not too small. This is indicated by a large reduced χ 2 value for distributions with a small number of helices. Also a visual inspection of the fit to the data supports this argument.
The rare-event tail shows clear differences between the different sub-classes of the model over several orders of magnitude. In Figure 9 the dependency of the fit parameters on the respective subclass of the model (Figure 9a and Figure 9b) as well as the dependency on the ratio L S / L Q (Figure 9c and Figure 9d) is shown. Note that for distributions that are not well described via Eq. (15), we only fitted the data in the high probability region. Those data points are left out in the plot for λ 2 in Figure 9b and are connected by dotted lines in Figure 9a.
In analogy to (RQGS) and (FQPS), the curvature remains constant when L S >L Q . Regarding the dependence on the number of helices, the curvature decays with increasing number of transmembrane regions and then approaches an approximate constant value. Numerical values are provided in the Appendix for reference.

Discussion and conclusions
We have presented a simple universal numerical method to accurately sample the far right tail of the score distribution of various sequence comparison algorithms. It appears to be the first method that is applicable to all classical local alignment statistics, query-specific and position-dependent score statistics, HMM calibration, statistics of normalized alignments, and many more. To sample the distribution using computer simulations, we use Markov-chain Monte Carlo simulations, in particular the Wang-Landau approach is connection with the Metropolis-Haistings algorithm. Apriori, the Wang-Landau approach does not require any assumption on the shape of the distribution (for example the parameters of the Gumbel distribution). The parameters can be estimated a posteriori by fitting the simulated distribution to an appropriate parametric form like Eq. (15). Here, we observed that for the (FQPS) model, the Gumbel distribution should be replaced by a more negatively curved one. The method has a disadvantage: Because of the high number of samples required for non-parametric estimation of the distribution, it can presently not be used in on-line database search web services, such as a BLAST server. For example, generating the 16,777,216 samples for Figure 5 (L Q = L S = 348) took approximately 16 hours on an Intel Pentium 4 with 3.4GHz. This is not as bad as it seems, though: Both the implementation and the design of the Markov chain have much room for improvement, e.g. we can choose different neighborhoods N(x) and optimize the weights in the generalized ensemble [44,45]. While this still prohibits interactive use, we see a lot of potential for our method to provide an improved version of the hmmcalibrate tool [22] and to explore the statistics of normalized sequence alignment [7].
During the preparation of this manuscript we came aware of a new related importance sampling method which is suitable for efficient p-value computations for alignment statistics [46]. It makes use of simultaneous backward sampling of alignments and sequences. So far this method was applied to i.i.d. sequences but it should be possible to extend it to more complex model as well. We have tested it for the (FQPS) model as well. For the joint distribution of score and number of helices one would have to sample simultaneous the alignments, sequences and the hidden state sequence of the TMHMM. Table 4 and Table 5 show numerical values for the parameters λ, λ 2 and K of the modified Gumbel distribution Eq. (15). These are visualized in Figure 7 and 9 in the body of the paper.