Fast index based algorithms and software for matching position specific scoring matrices

Background In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|m + m - 1, where m is the length of the PSSM and A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.


Background
Position specific scoring matrices (PSSMs) have a long history in sequence analysis (see [1]). A high PSSM-score in some region of a sequence often indicates a possible biological relationship of this sequence to the family or motif characterized by the PSSM. There are several databases utilizing PSSMs for function assignment and annotation, e.g., PROSITE [2], PRINTS [3], BLOCKS [4], EMATRIX [5], JASPAR [6], or TRANSFAC [7]. While these databases are constantly improved, there are only few improvements in the programs searching with PSSMs. E.g., the programs FingerPrintScan [8], BLIMPS [4], and MatInspector [9] still use a straightforward (mn)-time algorithm to search a PSSM of length m in a sequence of length n. In [10] the authors presented a method based on Fourier transformation. A different method introduced in [11] employs data compression. To the best of our knowledge there is no software available implementing these two methods. The most advanced program in the field of searching with PSSMs is EMATRIX [12], which incorporates a technique called lookahead scoring. The lookahead scoring technique is also employed in the suffix tree based method of [13]. This method performs a limited depth first traversal of the suffix tree of the set of target sequences. This search updates PSSM-scores along the edges of the suffix tree. Lookahead scoring allows to skip subtrees of the suffix tree that do not contain any matches to the PSSM. Unfortunately, the method of [13] has not found its way into a widely available and robust software system. In [14], the development of new, more efficient algorithms for searching with PSSMs is considered an important problem, which still needs better solutions.
In this paper, we present a new, non-heuristic algorithm for searching with PSSMs. With any non-heuristic PSSM searching algorithm, the performance in terms of sensitivity and specificity solely depends on the used PSSM and threshold, i.e. given a PSSM and threshold, all these algorithms give exactly the same results. For the generation of PSSMs from aligned sequences, numerous different methods, were described in literature over the last decades [1,5,[15][16][17]. Rather than improving PSSMs, we focus on improvements in terms of time and space efficiency when searching with PSSMs, independently of their underlying generation method. The overall structure of our proposed search algorithm is similar to the method of [13]. However, instead of suffix trees we use enhanced suffix arrays, a data structure which is as powerful as suffix trees (cf. [18]). Enhanced suffix arrays provide several advantages over suffix trees, which make them more suitable for searching with PSSMs: • While suffix trees require about 12n bytes in the best available implementation (cf. [19]), the enhanced suffix array used for searching with PSSMs only needs 9n bytes of space.
• While the suffix tree is usually only computed in main memory, the enhanced suffix array is computed once and stored on file. Whenever a PSSM is to be searched, the enhanced suffix array is mapped into main memory which requires no extra time.
• While the depth first traversal of the suffix tree suffers from the poor locality behavior of the data structure (cf. [20]), the enhanced suffix array provides optimal locality, because when searching with PSSMs it is sequentially scanned from left to right.
One of the algorithmic contributions of this paper is a new technique that allows to skip parts of the enhanced suffix array containing no matches to the PSSM. Due to the skipping, our algorithm achieves an expected running time that is sublinear in the size of the search space (i.e., the size of the nucleotide or protein database). As a consequence, our algorithm scales very well for large data sizes.
Since the running time of our algorithm increases with the size of the underlying alphabet, we developed a filtering technique, utilizing alphabet reduction, that achieves better performance especially on sequences/PSSMs over the amino acid alphabet.
When searching with a PSSM, it is important to determine a suitable threshold for a PSSM-match. Usually, the user prefers to specify a significance threshold (i.e., an E-value or a p-value) which has to be transformed into an absolute score threshold for the PSSM under consideration. This can be done by computing the score distribution of the PSSM, using well-known dynamic programming (DP, for short) methods, e.g., [12,[21][22][23]. Unfortunately, these methods are not fast enough for large PSSMs. For this reason, we have developed a new, lazy evaluation algorithm that only computes a small fraction of the complete score distribution. Our algorithm speeds up the computation of the threshold by factor of at least 3, compared to standard DP methods. This makes our algorithm applicable for onthe-fly computations of the score thresholds.
The new algorithms described in this paper are implemented as part of the PoSSuM software distribution. This is available free of charge for non-commercial research institutions. For details, see [24]. Parts of this contribution appeared as [25] in proceedings of GCB2004. A simple algorithm for the PSSM matching problem slides along the sequence and computes sc (w, M) for each w = S [j..j + m -1], j ∈ [0, n -m]. The running time of this algorithm is (mn). It is used e.g., in the programs FingerPrint-Scan [8], BLIMPS [4], MatInspector [9], and MATCH [17].

PSSMs and lookahead scoring: LAsearch
In [12], lookahead scoring is introduced to improve the simple algorithm. Lookahead scoring allows to stop the calculation of sc (w, M) when it is clear that the given overall score threshold th cannot be achieved. To be more precise, we define ,  [12], is implied by the following Lemma: The following statements are equivalent: Amino acid PSSM Figure 1 Amino acid PSSM. Amino acid PSSM of length m = 10 of a zinc-finger motif. If the score threshold is th = 400, then only substrings beginning with C or V can match the PSSM, because all other amino acids score below the intermediate threshold th 0 = th -σ 0 = 400 -398 = 2. That is, lookahead scoring will skip over all substrings which begin with amino acids different from C and V. Here σ d , d ∈ [0, m -1] denotes the maximal score that can be achieved in the last m -d -1 positions of the PSSM as defined in the text. In the worst case, k ∈ (m), which leads to the worst case running time of (mn), not better than the simple algorithm. However, k is expected to be much smaller than m, leading to considerable speedups in practice.
Our reformulation of lookahead scoring and its implementation is the basis for improvements and evaluation in the subsequent sections.

PSSM searching using enhanced suffix arrays: ESAsearch
The enhanced suffix array for a given sequence S of length n consists of three tables suf, lcp, and skp. Let $ be a symbol in , larger than all other symbols, which does not occur in S. suf is a and S suf [i] . Figure 2 shows this relation. Table skp can be computed in (n) time given suf and lcp. For the algorithm to be described we assume that the enhanced suffix array for S has been precomputed.
In a suffix tree, all substrings of S of a fixed length m can be scored with a PSSM by a depth first traversal of the tree.
Using lookahead scoring, one can skip certain subtrees that do not contain matches to the PSSM. Since suffix trees have several disadvantages (see the introduction), we use enhanced suffix arrays to search PSSMs. Like in other algorithms on enhanced suffix arrays (cf. [18]), one simulates a depth first traversal of the suffix tree by processing the arrays suf and lcp from left to right. To incorporate lookahead scoring while searching we must be able to skip certain ranges of suffixes in suf. To facilitate this, we use table skp. We will now make this more precise.  = 3 we find that S suf [1] and S suf[0] share a common prefix of length 3 and d 0 + 1 = -1 + 1 < lcp[1] = 3 (second case described above). The next suffix that may match M with th = 6 is S suf [6] = caaaaccacac$. Suffixes S suf [1] , S suf [2] ,... S suf [5] can be skipped since they all share a common prefix with S suf[0] of at least length 1. That is, they begin all with character 'a' and would also miss the first intermediate threshold th 0 = 3 when scored. We find S suf [6] by following a chain of entries in  [6] we compute pfxsc 0 (S suf [6] , M) = M(0,c) = 3 and pfxsc 1 (S suf [6] , M) = M(0, c) + M(1, a) = 6 and store them for reuse in C[0] and C [1]. Since d 6 = 1 = m -1 = 1 holds, we report suf [6] = 0 with score sc (S suf [6] , M) = pfxsc 1 (S suf [6] ,M) = 6 as a matching position.
With lookups in lcp[7] = 2 and lcp[8] = 3 we notice that S suf [7] and S suf [8] share a common prefix of at least two characters with S suf [6] . Hence we report suf[7] = 6 and suf [8] = 8 with score C[1] = 6 as further matching positions. We proceed with the scoring of S suf [9] . Since lcp[9] = 1 holds, we obtain the score for the first character 'c' from array C with pfxsc 0 (S suf [9] , M) = C[0]. After scoring the second character of S suf [9] , pfxsc 1 (S suf [9] , M) = 5 <th 1 = 6 holds and we miss the second intermediate threshold and continue with the next suffix. The last two suffixes S suf [10] and S suf [11] in suf do not have to be considered since their lengths are smaller than to m = 2 (not counting the sentinel character $) and therefore they cannot match M. We end up with matching positions 0, 6, and 8 of M in S with match score 6. To find these matches, we processed the enhanced suffix array suf top down and scored suffixes from left to right, facilitating the additional information given in tables lcp and skp to avoid rescoring of characters of common prefixes of suffixes and to skip suffixes that cannot match M for the given threshold.

Analysis
The C i arrays can be stored in a single (m) space array C as any step i only needs the C i specific to that step. C i solely depends on C i-1 , and C i [0..d -1] = C i-1 [0..d -1] holds for a certain d<m, i.e., the first d entries in C i are known from the previous step, and thus C can be organized as a stack.
Relationship between enhanced suffix array and suffix tree Figure 2 Relationship between enhanced suffix array and suffix tree.   head scoring does not give any speedup and every suffix must be read up to depth to m, leading to an (nm) worst case time complexity. This is not worse but also not better than the complexity for LAsearch. Next we show that, independent of the chosen threshold th, the overall worst case running time boundary for ESAsearch drops to (n + m) under the assumption that holds.
The shorter the common prefixes of the neighboring suffixes, the slower ESAsearch runs. Thus to analyze the worst case, we have to consider sequences containing as many different substrings of some length q as possible. Observe that a sequence can contain at most || q different substrings of length q > 0, independent of its length. To analyze the behavior of ESAsearch on such a sequence, we introduce the concept of suffix-intervals on enhanced suffix arrays, similar to lcp-intervals as used in [18].
, if the following three conditions hold: with i <j and lcp[k] = ᐍ for at least one k ∈ {i + 1,..., j}.
Every lcp-interval ᐍ -[i, j] of an enhanced suffix array for text S corresponds to an internal node v in a suffix tree for S, and the length of the string spelled out by the edge labels on the path from the root node to v is equal to ᐍ. Leaves are represented as singleton intervals, As an example for ᐍ-suffix-intervals, consider the enhanced suffix array given in Figure 2 Function skipchain(lcp, skp, n, i, d) input : Tables lcp and skp of an enhanced suffix array, |S| denoted with n, an index i of the i-th smallest suffix, and depth d from where to start skipping. output: An index j of the j-th smallest suffix with j > i.
. This corresponds to the number of internal nodes and leaves in a suffix tree, which is atomic up to at least depth q under our assumptions.
Since we are considering sequences that contain all possible substrings of length q, there are || d d-suffix-intervals at any depth d, Hence processing such a suffix-interval requires one matrix look-up and addition to compute the score, and ji + 1 steps to report all matches and find suffix v k . Since suffix-intervals do not overlap, the total length of all suffix-intervals at depth q can be at most n, so the total time spent on reporting matches is bounded by n.
There are three cases to consider when determining the time required for calculating the match scores for a PSSM of length m. Let p : = m -q.
1. If p = 0 (⇒ m = q), then the time required to calculate all match scores is in (E q ) as discussed above.
2. If p < 0 (⇒ m<q), then none of the m-suffix-intervals are singletons since we assumed that the sequence under consideration contains all possible substrings of length q, i.e., there must be suffixes sharing a common prefix of length m, and the time required to calculate all match scores is in (E m ).
3. If p > 0 (⇒ m > q), then every m-suffix-interval can be a singleton, and all prefix scores for the PSSM prefix of length q are calculated in (E q ) time. However, the remaining scores for the pending substrings of length p must be computed for every suffix longer than q, taking (np) additional time, and leading to a total (E q + np) worst case time complexity for computing all match scores.
Note that a text containing || q different substrings must have a certain length, which must be at least || q . In fact, a minimum length text that contains all strings of length q has length n = || q + q -1. It represents a de Bruijn sequence [28] without wrap-around, i.e., a de Bruijn sequence with its first q -1 characters concatenated to its end. Since a de Bruijn sequence without wrap-around represents the minimum length worst case, we infer from Equation (2) that E q ∈ (n). Hence, if m = q, then it takes (n) time to calculate all match scores. If m <q, then E m <E q and thus it takes sublinear time. If m > q, it takes (n + np) time.
We summarize the worst case running time of ESAsearch for preprocessing a PSSM M of length m, searching with M, and reporting all matches with their match scores, as Hence, the worst case running time is (n + m) for p ≤ 0, implying that this time complexity holds for any PSSM of length m and threshold on any text of length n ≥ || m + m -1, as already stated in Inequality (1).
In practice, large numbers of suffixes can be skipped if the threshold is stringent enough, leading to a total running time sublinear in the size of the text, regardless of the relation between n and m. ESAsearch reads a suffix up to depth m unless an intermediate score falls short of an intermediate threshold, and skips intervals with the same or greater lcp if this happens. Right boundaries of skipped suffixintervals are found quickly by following the chain of skipvalues (see function skipchain in Figure 4). It are these jumps that make ESAsearch superior in terms of running time to LAsearch in practice. The best case is indeed (||) which occurs whenever there is no score in the first row of the PSSM that is greater than th 0 . See Figure 5 for examples of enhanced suffix arrays, constructed from texts S and T that consist of all strings of a certain length m over some alphabet. In these enhanced suffix arrays no suffix shares a prefix of length m with any other suffix, forcing ESAsearch to compute scores for each operations to be performed during the search phase.

Performance improvements via alphabet transformations
Inequality (1) provides the necessary condition for (n + m) worst case running time. We now assume that m in Inequality (1)  Observe that for || = 4, m*(th) needs to be smaller or equal to 15 to achieve linear or sublinear running times. This provides the motivation to reduce the alphabet size by transforming  into a reduced size such that | | < ||.
In practice, for reasonably chosen thresholds th, the performance of ESAsearch mainly depends on the fact that often large ranges of suffixes in the enhanced suffix array can be skipped. This is always the case if we drop below an intermediate threshold while calculating a prefix' score, and if that prefix is a common prefix of other suffixes. In terms of lcp-intervals, this means that we can skip all ᐍ-intervals with ᐍ ≥ m*(th) on average. In contrast to suffixintervals, whose total count is in (n 2 ), size and number of lcp-intervals depend on ||, as illustrated in Figure 6. We  We now describe how to take advantage of reduced alphabets as fast filters in the ESAsearch algorithm. Let

Proof:
Thus the following implications follow directly (1) Transform S into and build the enhanced suffix array for ; (2) Construct from M; Observe that there is a trade-off between increased expected reading depth and increased lcp-interval sizes at low reading depths. Therefore it is desirable to minimize the effect of maximization by grouping PSSM columns with similar score values, i.e., highly correlated columns. Since PSSMs reflect the properties of the underlying multiple alignment, we expect correlations of PSSM columns Number of ᐍ-intervals for various reduced alphabets Figure 6 Number of ᐍ-intervals for various reduced alphabets. Numbers of ᐍ-intervals for ᐍ ∈ [1, 20] of different length for various reduced alphabets. We built the enhanced suffix array with sequences from the RCSB protein data bank (PDB) (total sequence length 4,264,239 bytes). The used reduced amino acid alphabets are given in Figure 8. Note that we limited the interval lengths in the figures to 5,000 to prevent distortion.
according to biologically motivated symbol similarities. Hence character correlation is the motivation for our alphabet reduction strategy.

Reduced amino acid alphabets
It is well known that various of the naturally occurring amino acids share certain similarities, like similar physiochemical properties. Accordingly, the complexity of protein sequences can be reduced by sorting these amino acids with similarities into groups and deriving a transformed, reduced alphabet [29]. These reduced alphabets contain symbols that represent a specific character class of the original alphabet. Since PSSMs and the sequences to be searched have to be encoded over the same alphabet, we are more interested in a single reduced alphabet suitable for all PSSMs under consideration, than in PSSM-specific reduced alphabets. The latter implies an unacceptable overhead of index generation for sequences over PSSMspecific alphabets, even though it may result in a lower expected reading depth. The basis for our reduction of the 20-letter amino acid alphabet to smaller alphabets are correlations indicated by the BLOSUM similarity matrix as described in [30]. That is, amino acid pairs with high similarity scores are grouped together (see Figure 8 for an example). Let a and b be two amino acids and Y a 20 × 20 score matrix, then a measure of amino acid correlation c a,b between a and b can be defined as and amino acid pairs can be iteratively grouped together according to their correlations, starting with the most correlated pairs, until all the amino acids are divided into the desired number of groups.

Finding an appropriate threshold for PSSM searching: LazyDistrib Probabilities and expectation values
The results of PSSM searches strongly depend on the choice of an appropriate threshold value th. A small threshold may produce a large number of false positive matches without any biological meaning, whereas meaningful matches may not be found if the threshold is too stringent. PSSM-scores are not equally distributed and thus scores of two different PSSMs are not comparable. It is therefore desirable to let the user define a significance threshold instead. The expected number of matches in a given random sequence database (E-value) is a widely accepted measure of the significance. We can compute the E-value for a known background distribution and length of the database by exhaustive enumeration of all substrings. However, the time complexity of such a computa-     [12,21,22]) allow to compute the probability distribution (and hence the E-value) in (m 2 R||) time.

|A| L V I M C A G S T P F Y W E D N Q K R H 20 LVIM C A G S T P FY W E D N Q
In practice the probability distribution is often not exactly, or completely calculated due to concerns of speed. E.g., in the EMATRIX system [12] score thresholds are calculated and stored for probability values in the interval π = 10 -1 , 10 -2 ,..., 10 -40 only. Consequently, the user can only specify one of these p-value cutoffs. For the calculation of the p-value from a determined match score, EMATRIX uses log-linear interpolation on the stored thresholds. A different, commonly used strategy to derive a continuous distribution function uses the extreme value distribution as an approximation [31][32][33] of high scoring matches.
Even though it is widely accepted that high-scoring local alignment score distributions of the popular position independent scoring systems PAM and BLOSUM can be well approximated by an extreme value distribution, this cannot be generalized for arbitrary PSSMs.
To check whether an extreme value distribution is a suitable approximation for the distribution of PSSM match scores, we sampled the match scores of PSSMs arbitrarily chosen from the TRANSFAC and BLOCKS database. We randomly shuffled 1000 human promotor sequences of length 1200, taken from the database of transcriptional start sites (DBTSS) and 1000 protein sequences of length 365 (= average sequence length in Uniprot-Swissprot), respectively, preserving their mono-symbol composition. From the derived random PSSM match scores we took the best score for each sequence and calculated the empirical cumulative distribution function. If the match scores S are extreme value distributed, a X-Y plot with X = S and Y = log(-log(S)) should appear linear, since log holds. For the TRANSFAC PSSM shown in Figure 9, the X-Y plot clearly indicates that an extreme value distribution is not an appropriate approximation. For PSSM IPB003211A (see Figure 10) from the BLOCKS database, it seems as if the score distribution can be approximated quite well with an extreme value distribution. However, we then still have That is why we are more interested in an exact solution and thus we focus on the efficient computation of an exact discrete score distribution.

Calculation of exact PSSM score distributions
While recent publications focus on the computation of the complete probability distribution, what is required specifically for PSSM matching, is computing a partial cumulative distribution corresponding to an E-value resp. p-value specified by the user. Therefore, we have developed a new "lazy" method to efficiently compute only a small fraction of the complete distribution.
Score distribution of TRANSFAC PSSM M00734 Figure 9 Score distribution of TRANSFAC PSSM M00734. Histogram, cumulative score distribution function, X-Y plot, and normal probability plot of TRANSFAC PSSM M00734 (PSSM length m = 9).   Figure   11 for an example.
If we allow for floating point scores that are rounded to ε decimal places, the time and space requirement increases by a factor of 10 ε . Conversely, if all integer scores share a greatest common divisor z, the matrix should be canceled down by z.

Restricted probability computation
In order to find (π, M) it is not necessary to compute the whole codomain of the distribution function Q = Q m-1 . We propose a new method only computing a partial distribution by summing over the probabilities for decreasing threshold values sc max (M), sc max (M) -1,..., until the given p-value π is exceeded (see Figures 11,12).
In Evaluation with dynamic programming Figure 11 Evaluation with dynamic programming. The simple DP scheme computes all probability vectors Q 0 , Q 1 , Q 2 completely within the green marked area, corresponding to score ranges of prefix PSSMs M k . In contrast to the simple scheme, the restricted probability computation method computes only the upper end of the probability distribution until the given p-value threshold is exceeded, omitting parts of the green area. In this example we show how to compute the score threshold (π, M) for PSSM M of length m = 3 and a score range of [4,11] corresponding to a given p-value threshold of π = .
For simplicity we assume a uniform character distribution of Cells of the matrix that are computed in the step actually under consideration are marked red. In step d = 0, see (A), the algorithm computes Q 2 (11) recursively for all paths through M that achieve a score of 11, i.e. Q 2 (11) = Q 1 (8)·f(G), Q 1 (8) = Q 0 (4)·f(G), Q 0 (4) = Q -1 (0)·f(A) = 1· , since AGG is the only path achieving score 11. It follows Q 2 (11) = . In step d = 1 all paths achieving a score of 11 -d = 10 to determine Q 2 (10) are computed, see (B). We conclude Q 2 (10) = . In this step, DP allows to reuse value Q 1 (8) without recomputation. In step d = 2, see (C) values Q 1 (7) and Q 0 (3) can be reused to compute Q 2 (9) = . In step d = 2 the cumulated probability Q 2 (11) + Q 2 (10) + Q 2 (9) = exceeds the given p-value threshold of π = , and the restricted probability computation method skips the rest of the computation. We obtain a score threshold of th = 10 correponding to π. Step d=0 : t=11 Step d=2: t= 9 given a threshold th, the algorithm only evaluates parts of the DP vectors necessary to determine Q k (th) and simultaneously saves sub-results concerned with score th in an additional buffer matrix Pbuf (instead of recomputing them later, see Figure 13 for an example). This is described by the following recurrence: Tmin  Probability computation using lazy evaluation ofthe DP matrix Figure 13 Probability computation using lazy evaluation of the DP matrix. In this example we use the same PSSM M, character distribution, and p-value threshold π = as in Figure 11. However, in each row of the PSSM the scores are sorted in descending order, and the rows are sorted with the most discriminant row coming first (see coloured PSSMs for this relationship). Observe that the LazyDistrib algorithm evaluates the DP vectors non-recursively top-down. Cells computed in the actual step are marked red. In step d = 0 the algorithm computes Q 2 (11) by evaluating paths through the PSSM contributing to Q 2 (ll), which is in this example only the high scoring path GGA. Intermediate results of Q 0 (4), Q 1 (7), and Q 2 (11) are collected in buffers Pbuf 0 (4), Pbuf 1 (7), and Pbuf 2 (11) first, and finally copied to the correponding cells in Q. See (A) for the situation after step d = 0 has been completed. In step d = 1, see (B), the algorithm computes Q 2 (10), starting in row k = 1 with the determination of Pbuf 1 (6) and Q 1 (6). That is, Q 1 (6) = Pbuf 1 (6) = Q 0 (4)·f(A) + Q 0 (4)·f(C) + Q 0 (4)·f(T) = . Analogously Q 2 (10) and Pbuf 2 (10) are computed based on Q 1 (7) and Q 1 (6). Additionally Pbuf 2 (9) is filled for further reuse in subsequent steps d + 1, d + 2,.... We compute Pbuf 2 (9) = Q 1 (6)·f(C) = . The algorithm can directly start in row k = 1 with the computation of Q 1 (6) instead of Q 0 (3) since a score of 3 cannot be achieved by the first prefix PSSM M 0 . Only score 4 of M 0 contributes to Q 2 (10), scores 2 and 1 do not. In step d = 2, see (C), the algorithm computes Q 2 (9), starting in row k = 0. Pbuf 2 (9) is computed reusing the partial sum calculated in previous steps, such that Pbuf 2 (9) = + Q 1 (7)·f(T) + Pbuf 1 (5)·f(A) = , and then copied to Q 2 (9). Pbuf 1 (4), Pbuf 2 (8), and Pbuf 2 (7) are filled based on Pbuf 0 (2), Q 1 (6), Pbuf 1 (5), and Q 1 (5) for further reuse. After step d = 2 the rest of the computation can be skipped since the cumulated probability Q 2 (11) + Q 2 (10) + Q 2 (9) = exceeds the given p-value π = and we obtain a score threshold of th = 10 corresponding to π. Step d=0 : t=11 Pbuf k (t) permuted/sorted PSSM: Step d=1 : t=10 Step d=2: t=9 In the present implementation, the algorithm assumes independently distributed symbols. The algorithm can be extended to an order d-Markov model (w.r.t. the background alphabet distribution). This increases the computation time by a factor of || d .

Implementation and computational results
We implemented LAsearch, ESAsearch, both capable to handle reduced alphabets, and LazyDistrib in C. The program was compiled with the GNU C compiler (version 3.1, optimization option -03). All measurements were performed on a 8 CPU Sun UltraSparc III computer running at 900 MHz, with 64 GB main memory (using only one CPU and a small fraction of the memory). Enhanced suffix arrays were constructed with the program mkvtree, see [34].
We performed seven experiments comparing different programs for searching PSSMs. Table 1 gives more details on the experimental input for Experiments 1-6. Results are given in Table 2 Table 1 for the definition of MSS.
In the experiments using protein PSSMs, ESAsearch is faster than the method of [13] by a factor between 1.5 and 1.8 (see Experiment 1). This is due to the better locality behavior of the enhanced suffix array compared to a suffix tree. For larger p-values LAsearch performs slightly better than ESAsearch. Increasing the stringency, the performance of ESAsearch increases, resulting in a speedup of factor 1.5 for a p-value of 10 -40 . We explain this behavior by

MatInspector
x FingerPrintScan x Blimps x DN00 x Overview of the sequences and PSSMs used in the performed experiments. For the experiments that use p-value or E-value cutoffs, we precomputed the cumulative score distributions and stored them on file. mdc is the time needed for this task. In Experiment 1 we measured the running time of the Java-program from [13], referred to by DN00. We ran DN00 with a maximum of 2 GB memory assigned to the Java virtual machine. DN00 constructs the suffix tree in main memory and then performs the searches. For a fair comparison, we therefore measured the total running time, and the time for matching the PSSMs (without suffix tree construction). For Experiment 2, we implemented the matrix similarity    protein PSSMs. Compared to the performance of ESAsearch operating on the normal 20 letter amino acid alphabet a speedup up to factor 2 can be achieved when using a 4 letter alphabet and a p-value cutoff of 10 -20 . Experiment 7 (see Figures 15 and 16) shows that the expected running time of ESAsearch is sublinear, whereas LAsearch runs in linear time. In a final experiment, we compared algorithm LazyDistrib with the DP-algorithm computing the complete distribution. LazyDistrib shows a speedup factor between 3 and 330 on our test set, depending on the stringency of the threshold (see Table 3).  pairing. Hence it is sufficient to build the enhanced suffix array for one strand only. This can then be used to search both strands.

PoSSuM software distribution
The cutoff can be specified as p-value, E-value, MSS (matrix similarity score), or raw score threshold. If only the best matches with the highest scores need to be known, then PoSSuMsearch can be asked to report only the k highest scoring matches without even specifying an explicit cutoff. To do so, the search algorithms dynamically adapt the threshold during the search. When using por E-values, the score threshold is determined by either the lazy dynamic programming algorithm introduced in this contribution, or read from file that stores the complete precalculated probability distribution. Background distributions can be specified arbitrarily by the user, or determined from a given sequence database. We provide a tool, PoSSuMdist, to generate a compressed file contain-ing the complete precalculated probability distribution for a set of PSSMs.
PSSM matches can be sorted by specifying a list of sort keys, like p-value, match score, sequence number, and so on. The output formats of PoSSuMsearch print out all available information about a match, either in a human readable format, tab delimited, or in machine readable, XML-based CisML [35]. PoSSuMsearch as well as PoSSuMdist support multi-threading for a further reduction of running time on multi CPU machines.
The PoSSuM software distribution includes the searching tool PoSSuMsearch itself, and additional tools to determine character frequencies from sequence data, for probability distribution calculation, and PSSM format converters for TRANSFAC, BLOCKS, PRINTS, and EMA-TRIX style PSSMs.

Discussion and conclusion
We have presented a new non-heuristic algorithm for searching with PSSMs, achieving expected sublinear running time. Our analysis of ESAsearch shows that for sequences not shorter than || m + m -1 a linear runtime in the worst case is achieved. It shows superior performance over the most widely used programs, especially for DNA sequences. The enhanced suffix array, on which the method is based, requires only 9n bytes. This is a space reduction of more than 45 percent compared to the 17n bytes implementation of [13]. Further on, we developed a systematic concept for alphabet reduction, especially useful on amino acid sequences and PSSMs for gaining additional speedup. Our third main contribution is a new algorithm for the efficient calculation of score thresholds from user defined E-values and p-values. The algorithm allows for accurate on-the-fly calculations of thresholds, and has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.

Availability
The PoSSuM software distribution and its documentation is available precompiled for different operating systems and architectures on [24]. A version of mkvtree is included. A web based version of PoSSuMsearch is available under the same URL.