Computing expectation values for RNA motifs using discrete convolutions
 André Lambert^{1},
 Matthieu Legendre^{2},
 JeanFred Fontaine^{2, 3} and
 Daniel Gautheret^{2}Email author
https://doi.org/10.1186/147121056118
© Lambert et al; licensee BioMed Central Ltd. 2005
Received: 08 March 2005
Accepted: 13 May 2005
Published: 13 May 2005
Abstract
Background
Computational biologists use Expectation values (Evalues) to estimate the number of solutions that can be expected by chance during a database scan. Here we focus on computing Expectation values for RNA motifs defined by singlestrand and helix lodscore profiles with variable helix spans. Such Evalues cannot be computed assuming a normal score distribution and their estimation previously required lengthy simulations.
Results
We introduce discrete convolutions as an accurate and fast mean to estimate score distributions of lodscore profiles. This method provides excellent score estimations for all singlestrand or helical elements tested and also applies to the combination of elements into larger, complex, motifs. Further, the estimated distributions remain accurate even when pseudocounts are introduced into the lodscore profiles. Estimated score distributions are then easily converted into Evalues.
Conclusion
A good agreement was observed between computed Evalues and simulations for a number of complete RNA motifs. This method is now implemented into the ERPIN software, but it can be applied as well to any search procedure based on ungapped profiles with statistically independent columns.
Keywords
Background
The introduction of the Expectation value (Evalue) in the Blast program in 1990 [1] was a major milestone in the development of sequence search algorithms. For any sequence match with a score S obtained in a given database, the Evalue is the number of hits of same or higher score that can be expected by chance. Evalues tell biologists how likely they are to encounter a specific sequence match in a database search, which is a more useful view of biological significance than a mere similarity score. Except in some special cases, low Evalues are commonly accepted as an evidence of sequence homology.
The recent years have seen a growing interest for RNA motif searches, driven by the discovery of important new classes of regulatory RNA genes and motifs in all organisms. Noncoding RNAs are characterized in a large part by long range base pair interactions, whereas linear sequence constraints are not as important as in protein coding genes. As a result, standard sequence alignment programs such as Blast are not suited to RNA motif search. Computational biologists have addressed this problem in several ways, notably through descriptorbased systems in which the topology of base paired regions is userspecified [3, 5, 4], Stochastic Context Free Grammars (SCFG) which use a complete statistical model of RNA elements [6], and Secondary Structure Profiles, which use position weight matrices describing stems and single strands in the RNA motif, as found in the ERPIN program [8]. Although the last two methods compute alignment scores, the complexity of the underlying models and search algorithms has hampered the estimation of an Evalue to date.
When the behavior of the score distribution is known, such as in sequence alignment scores, Evalues can be computed either directly, or by empirically fitting a histogram of scores from a sample of random sequences to the assumed distribution function [1, 2]. Unfortunately, we will show here that a search algorithm such as the one used in ERPIN does not produce predictable score distributions. A possible workaround for this practical limitation is to run simulations on randomized sequences and use the observed hit count at score S as the Evalue for this score. However, since interesting high scoring solutions can be extremely rare (commonly less than one random occurrence per genome) simulations often require days of calculations. Can we then estimate score distributions without having to run such lengthy simulations?
In this article, we show that ERPIN score distributions can be estimated a priori through a discrete convolution analysis of score profiles, based on a random model of nucleotide frequencies. This led us to develop a computational procedure that estimates the score distribution of ERPIN profiles in a very short time, before the actual search begins, so that each solution can be automatically assigned an Evalue.
ERPIN profiles
ERPIN converts each helix and single strand in the alignment into a lodscore profile. This involves two steps. First, columns in the alignment are converted into frequency profiles, recording the frequencies of bases or basepairs in column c as:
Here n_{ ic }is the number of bases or basepairs of type i in column c, and N is the total base or basepair count in a column. There is one frequency profile for each single strand or helix in the alignment. In the case of a single strand, i ∈ {A, T, G, C}, whereas for a helix, i ∈ {AA, AT, AG, AC, ..., CC} and each column in a helical profile actually refers to two positions in the initial alignment. Single strand profiles thus have 4 rows while helix profiles have 16 rows. The special case of gapcontaining single strands, where a fifth character is added to the profile is discussed later on.
Frequency profiles are then converted into lodscore profiles, where values for column c are defined as:
where b_{ i }is the background frequency of base or basepair i in the target database. Basepair frequencies are considered as the product of individual base frequencies. The score of a helix or singlestrand element is obtained by presenting a target sequence to this element's profile and summing the scores obtained for every profile column. For ungapped elements, the calculation is straightforward. For gapcontaining single strands, a dynamic programming matrix of the profile and target sequence is constructed, which provides the best possible alignment score [8]. In a first stage, let us ignore this alignment procedure and focus on ungapped elements.
Exclusions and pseudocounts
We define as an exclusion a profile element for which no base or base pair is observed in the training set, and thus P_{ ic }= 0. Exclusions may be due either to an unsufficient size of the training set, or to a true avoidance of this particular base or basepair at this position in the RNA molecule. In any case, the logodd ratio formula would produce a value of ∞ for such cases, thus requiring a special treatment. Exclusions are dealt with either by using arbitrary low values (e.g. 30) or by introducing pseudocounts in the frequency matrix that simulate what could have been observed in a larger training set.
Pseudocounts are based on some prior knowledge of "typical" substitution frequencies in RNA molecules, as observed in a model RNA sequence alignment. The pseudocount calculation procedure used in ERPIN is the same spirit as that of Henikoff and Henikoff [9], but we use a different definition of pseudocounts, as explained below. Let us first reformulate the values in any column c of a frequency profile:
Where n_{ ij }is the number of {i, j} couples in column c, P(i, j) is the joint probability of finding i and j in this column, and . This develops into:
Where P(ij) is the conditional probability of observing i, knowing that j is observed in the column. This conditional probability amounts to the observed frequency of i → j substitutions. To introduce pseudocounts in ERPIN frequency profiles, P(ij) is replaced with the average substitution frequencies observed in a model RNA sequence alignment, expressed in the form of a substitution matrix M. Pseudocountbased frequencies can be expressed as:
where M is a square matrix whose columns j are normalized, ∑_{ i }M_{ ij }= 1, so that . See Methods section for construction of M. The relative ratio of pseudocounts to true counts in the final frequency matrix is then controlled by a userdefined weight parameter α ∈ [0, 1], such that:
Since M_{ ij }are generally ≠ 0 in the substitution matrices (either for single strand or helices), most exclusions in the frequency profile are replaced by nonzero values as soon as α ≠ 0. Not only the resulting lodscore profiles are basically devoid of arbitrary low values, but they better reflect "natural" base and basepair substitution frequencies observed in real RNA alignments. This is especially interesting in helical regions, since the substitution matrix for helices represent natural exchanges between frequent basepairs such as WatsonCrick, G:U or even G:A, while incurring strong penalties for exchanges involving rare base pairs. This maintains a large fraction of strongly negative values in helix profiles, which is desirable for the sake of search specificity.
Shapes of score distributions: finite and nonfinite scores
We define as a "finite" score the score obtained for a sequence that does not contain any match to a profile exclusion. When pseudocounts are used, almost all scores are finite, but when pseudocounts are not in use (α = 0), many scores, especially in helix profiles, are "nonfinite", although in practice they are replaced by arbitrary low values.
Let S denote a finite score obtained at a given site in a random sequence. For any x > ∞, the conditional probability formula reads:
P(S > x) = P(S > ∞).P(S > xS > ∞) (7)
In this decomposition, it is noteworthy that:

The first factor is independent of x as soon as x is finite,

The second factor can be computed based on profile elements that contain no exclusion.
Let S_{ c }(c = 1, 2, .., w) denote a score for column c of a profile, and S = S_{1} + S_{2} + ... + S_{ w }the score obtained by presenting a given sequence to this profile. If w is large enough (w ≳ 10) and the distributions of random variables S_{ c }are (i) independent and (ii) identically distributed, the sum S follows a normal distribution (central limit theorem [11]).
Due to exclusions arising with different frequencies in different columns, condition (ii) is generally not fulfilled, but, using the decomposition given by formula (7) rewritten for column c, we can write:
P(S_{ c }> x) = P_{fs,c}.P_{f,c}(x) (8)
where P_{ fs }is the probability that a score is finite, and P_{ f }(x) is the probability that a finite score is higher than x. This gives, for a complete profile:
In many cases, the probability distributions of finite scores P_{f,c}are similar enough so that their sum S is normally distributed. If this behavior was always observed, the score distribution could be fully determined by computing the mean value μ and the standard deviation σ that characterize the normal law, which would enable a direct calculation of the Evalue [2]:
So called "nonfinite" scores may be biologically relevant, since many valid substitutions are potentially absent from the training set. However, nonfinite scores are detected only when high enough to fall into the extreme end of the distribution. This part of the distribution is composed mainly of finite scores and should thus behave like that of finite scores. Unfortunately finite scores may also deviate from a Gaussian distribution, for instance when score distributions in successive columns are too different from each other.
Score distributions of helices and singlestrands
Ungapped helices and single strands
How can we estimate score distributions such as those in Fig 3, 4 ? Let us admit that profile columns are independent. After matching the profile to a purely random sequence, the resulting scores for each column would thus behave as independent random variables, say X_{1} and X_{2} for columns 1 and 2. Therefore, the final score for two columns would be:
S = X_{1} + X_{2}
Then, the probability of obtaining a score S = x is:
The last formula defines the discrete convolution product [11] of two distributions. The overall score distribution can be obtained by doing the calculation for every possible values of u and v.
Using the separation formula (7) between "exclusions" and "finite scores", equation (10) can be written:
These operations can easily be extended to N columns by iterating the products on successive columns in the same singlestrand or helix profile. At each successive iteration, scores are discretized on a predefined grid so that the number of possible scores increases linearly with the number of columns (see Methods section for algorithm).
We performed such an analysis on a variety of helix and single strand profiles, with grid intervals set at Δx = .05. In Figures 3 and 4, score distributions estimated from discrete convolution (solid lines) are compared to scores obtained through simulation on a random database of variable size (shaded bars). There is a very good agreement between the discrete convolution and simulation.
Gapcontaining single strands
The score of a gapcontaining single strand in the ERPIN program is computed from the dynamic programming alignment matrix. Therefore, it is the maximum of several values, and could be expected to comply with an extreme value distribution. However, gapped single strands in ERPIN are very diverse entities that may include oddities such as singlenucleotide strands, or strands mostly filled with gaps. This results in very uneven distributions that we were not able to model satisfyingly. Therefore, the score distribution of a gapped single strand in ERPIN is currently estimated based on a short simulation performed on a random sequence (see Methods section for details).
Score distributions of complete regions with gaps
Score of a configuration
When presenting a sequence to a whole region, the presence of gaps in singlestrands results in multiple allowed positions for flanking helical elements. A configuration is a specific arrangement of helix elements determined by the number of intervening gaps (Fig 1). There is one score for each allowed configuration, which is the sum of scores for all helices and single strands in this configuration. We therefore need to compose the different score distributions to obtain the distribution of the total score for one configuration. This is again done using a discrete convolution of these distributions, with the same procedure and grid parameter as above. This provides the score distribution of a single configuration. Note that, although a configuration may contain gapped single strands of which score distribution was not produced by a discrete convolution, such distributions can now be treated by this second convolution round applied to whole profiles.
Score of a complete region
The number of gaps in a single strand is bounded by the maximum number of gaps observed for this strand in the training set: mxgaps. For a simple hairpinloop motif with mxgaps possible gaps in the loop, there are (mxgaps + 1) possible configurations. For a whole region containing N strands with gaps (i = 1, 2, ..., N), the number of configurations is:
Erpin evaluates all possible configurations without any construction rule or strategy. A combinatorial explosion is avoided by implementing multistage searches, where search at each stage is limited to a defined mask, or subset of the region under study. The score of a region or mask at in given site is the maximum score obtained for all possible configurations.
Since a motif is identified only after all possible configurations are evaluated at a given site, our estimation of motif scores requires taking into account this additional complexity. Let K denote the number of configurations for a given motif, and S_{ i }, the score obtained for the i th configuration. As the ERPIN program does not permit the of addition gaps relative to those present in the training set, K is necessarily bounded but its value can be relatively large. We are now interested in the maximal score obtained for all configurations at each site. This is the extreme value distribution, or the distribution of a random variable M defined as:
If P_{ fs }= P(S_{ i }> ∞) and p_{ i }(x) = P(S_{ i }> xS_{ i }> ∞), and the random variables S_{ i }are statistically independent (s.i) and identically distributed (i.d), then:
The most "interesting" scores are expected to be of the same order of magnitude as those obtained by training set sequences. For any realistic training set, these scores should be very high compared to scores obtained on random sequences and, therefore, their probability should be very low. In this case P(M > x), given by formula (17), behaves at the first order approximation as K.P_{ fs }.p(x).
For a database of size Ω, considering that individual sequences in the database are large enough compared to the search motif so that border effects can be ignored, the final Evalue, is :
E(x) = P(M > x).Ω (18)
In the case where pseudocounts are switchedoff, profiles contain multiple "nonfinite scores" which are excluded from the convolution process. This may imply a lack of accuracy in the lefthand side of the estimated score distribution, where scores have low values, but should have little effect in the region of biologically interesting scores. Therefore we do not expect Evalues to deteriorate significantly in practise when pseudocounts are switched off.
Conclusion
We have presented a method to estimate the score distributions of RNA helices or single strand profiles and of their combinations into larger motifs. This method is based on discrete convolutions. The computing time of the discrete convolution algorithm increases quadratically with profile size and remains in any case negligible relative to database scan durations. This procedure is implemented in the last release of the ERPIN software (V. 4.2) and provides accurate estimates of Evalues for practical applications. Interestingly, the discrete convolution approach can be applied as well to others sequence scoring models nucleic acids or proteins – based on ungapped profiles with independent columns.
Methods
ERPIN program and utilities
The ERPIN program (sources and executables) is available at http://tagc.univmrs.fr/erpin/. Simulated score distributions of independent helix and singlestrand profiles (Fig 2, 3, 4) were obtained using the hist (histogram) option of ERPIN and utility programs epnstat, convhstat, convsstat and mstat provided in the distribution. For Fig 5, full motif searches were performed using ERPIN Version 4.2.5 with pseudocount weight α = 2.10^{4}. Graphical outputs for figures 2, 3, 4, 5 were produced using the Matlab [13] package.
Training sets
Training sets for profile statistics and ERPIN runs in Fig 2, 3, 4, 5 are available on the ERPIN web site and were obtained as follows:

tRNA: 903 type I tRNA sequences (all species) from the 1997 version of M. Sprinzl's nuclear tRNA alignment [17].

SECIS (Selenocystein Insertion Sequence): 117 metazoan SECIS sequences, from our own compilation [7].

snoRNA (Small nucleolar RNA): 217 archaean C/D box snoRNA sequences, compiled and aligned by Fabrice Leclerc at CNRS Nancy (Personal communication).

Let7 miRNA: 27 animal miRNA precursor sequences from our previous compilation [15].

PolyA sites: 2327 human polyadenylation sequences from our previous compilation [16].
RNA substitution matrices for pseudocounts
Pseudocount calculation requires substitution matrices obtained from a model RNA sequence alignment or "training set", annoted with secondary structure information (helix or single strand). Klein and Eddy have developed RNA substitution matrices previously [18], but we use a different type here. Training set columns are converted into profiles containing raw nucleotide or basepair counts. Let Q denote a nucleotide or basepair count profile of width w, produced by a concatenation of all single strand or helix profiles from the pseudocount training set. Q is either a 4line matrix for single strands (h = 4) or a 16line matrix for helices (h = 16). The substitution matrix M introduced in Section "Exclusions and Pseudocounts" is then a square matrix of size h × h defined as:
The square matrix N is actually a "correlation matrix" of the profile lines since element N_{ ij }is the scalar product correlation of lines i and j. Coefficients λ_{ j }make this matrix normalized, so that:
Probability conservation is verified for P and therefore it is also verified for P':
Finally, it is obvious from formula (6) that P" also verifies verifies probability conservation, hence:
Substitution matrices can be generated from any RNA sequence alignment using the utility program mksum of the ERPIN distribution. Default matrices provided with distribution (SUM.dat file) were obtained using a 16S/18S rRNA training set from R.Gutell ([10]) containing 6310 sequences from all three phylogenetic domains. We used the secondary structure of Ecoli 16S rRNA as the consensus structure, resulting in 481 columns of helix profile and 7512 columns of single strand profile.
Option pcw is used to set pseudocount weight α in the ERPIN program. Default internal value is 2.10^{4}, but this has been rescaled for users by a factor of 2.10^{3} giving a default user value of 0.1 and a practical maximal value that should not exceed 1. Effects of pseudocounts and of the α parameter on profiles can be visualized using the utility program pview.
Score distribution of gapcontaining singlestrands
The score distribution of gapcontaining single strands is evaluated by repeatedly computing profile scores with a random sequence of same length L as the profile and same composition as the target sequence database. The calculation is repeated C.L^{2} times, with C a constant, and L <L_{ max }in order to limit CPU time for unusually large strands. Default values are C = 300 and L_{ max }= 12.
Histograms and discrete convolution product
Although discrete convolutions can be computed using iterated Fast Fourier Transforms, this approach is subject to numerical approximations in practice. A direct calculation is more accurate and proved fast enough in all cases tested. Time complexity of the discrete convolution algorithm is O(N^{2}) where N is the total number of profile columns. This value remains tractable even for the largest RNA motifs. The convolution algorithm was adapted from those found in the Octave [12] and Matlab [13] packages. The linear sampling interval was set at Δx = .05. CPU time for the complete Evalue calculation (including profile construction, convolution of independent profiles and convolution of configurations) for motifs in Fig 5 ranged from 10^{4}s to 0.8s on a 2.6 GHz Intel Pentium workstation with 1 Gb of RAM.
Extreme value distribution
Formula (17) used for calculating the extreme value distribution is of the type (1  (1  x)^{ N }). If x.N > 1 the result is obtained with the C library function x x^{ y }which lacks precision when x ≪ 1. Otherwise we compute (17) using the binomial formula for (1  x)^{ N }.
Declarations
Acknowledgements
We thank the ACI IMPBio program for their support to the development of the ERPIN software and Web server.
Authors’ Affiliations
References
 Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ: Basic local alignment search tool. J Mol Biol 1990, 215: 403–10. 10.1006/jmbi.1990.9999View ArticlePubMedGoogle Scholar
 Karlin S, Altschul SF: Methods for assessing the statistical significance of molecular sequence features by using general scoring schemes. Proc Natl Acad Sci U S A 1990, 87: 2264–8.PubMed CentralView ArticlePubMedGoogle Scholar
 Gautheret D, Major F, Cedergren R: Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comput Appl Biosci 1990, 6: 325–31.PubMedGoogle Scholar
 Billoud B, Kontic M, Viari A: Palingol: a declarative programming language to describe nucleic acids' secondary structures and to scan sequence database. Nucleic Acids Res 1996, 24: 1395–403. 10.1093/nar/24.8.1395PubMed CentralView ArticlePubMedGoogle Scholar
 Macke TJ, Ecker DJ, Gutell RR, Gautheret D, Case DA, Sampath R: RNAMotif, an RNA secondary structure definition and search algorithm. Nucleic Acids Res 2001, 29: 4724–35. 10.1093/nar/29.22.4724PubMed CentralView ArticlePubMedGoogle Scholar
 Eddy SR, Durbin R: RNA sequence analysis using covariance models. Nucleic Acids Res 1994, 22: 2079–88.PubMed CentralView ArticlePubMedGoogle Scholar
 Lambert A, Lescure A, Gautheret D: A survey of metazoan selenocysteine insertion sequences. Biochimie 2002, 84: 953–9. 10.1016/S03009084(02)014414View ArticlePubMedGoogle Scholar
 Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol 2001, 313: 1003–11. 10.1006/jmbi.2001.5102View ArticlePubMedGoogle Scholar
 Henikoff JG, Henikoff S: Using substitution probabilities to improve positionspecific scoring matrices. Comput Appl Biosci 1996, 12: 135–43.PubMedGoogle Scholar
 Cannone JJ, Subramanian S, Schnare MN, Collett JR, D'Souza LM, Du Y, Feng B, Lin N, Madabusi LV, Muller KM, Pande N, Shang Z, Yu N, Gutell RR: The comparative RNA web (CRW) site: an online database of comparative sequence and structure information for ribosomal, intron, and other RNAs. BMC Bioinformatics 2002, 3: 2. 10.1186/1471210532PubMed CentralView ArticlePubMedGoogle Scholar
 Feller W: An Introduction to Probability Theory and its Applications. Third edition. John Wiley & sons; 1968.Google Scholar
 Eaton JW: GNU Octave Manual: A highlevel interactive langage for numerical computations.1997. [http://www.octave.org/docs.html]Google Scholar
 Matlab: HighPerformance Numeric Computation and Visual Software. The MathWorks, IncGoogle Scholar
 Press WH, Teukolsky SA, Vetterling WT, Flannery BP: Numerical Recipes in C. Second edition. Cambridge University Press; 1994.Google Scholar
 Legendre M, Gautheret D: Sequence determinants in human polyadenylation site selection. BMC Genomics 2003, 4: 7. 10.1186/1471216447PubMed CentralView ArticlePubMedGoogle Scholar
 Legendre M, Lambert A, Gautheret D: Profilebased detection of microRNA precursors in animal genomes. Bioinformatics 21(7):841–5. 2005 Apr 1 10.1093/bioinformatics/bti073Google Scholar
 Sprinzl M, Dank N, Nock S, Schon A: Compilation of tRNA sequences and sequences of tRNA genes. Nucl Acids Res 1991, 19: 2127–2171.PubMed CentralView ArticlePubMedGoogle Scholar
 Klein RJ, Eddy SR: RSEARCH: Finding homologs of single structured RNA sequences. BMC Bioinformatics 2003, 4: 44. 10.1186/14712105444PubMed CentralView ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.