Volume 13 Supplement 5
Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics
Maximum expected accuracy structural neighbors of an RNA secondary structure
 Peter Clote^{1, 2, 3}Email author,
 Feng Lou^{2}Email author and
 William A Lorenz^{4}
https://doi.org/10.1186/1471210513S5S6
© Clote et al.; licensee BioMed Central Ltd. 2012
Published: 12 April 2012
Abstract
Background
Since RNA molecules regulate genes and control alternative splicing by allostery, it is important to develop algorithms to predict RNA conformational switches. Some tools, such as paRNAss, RNAshapes and RNAbor, can be used to predict potential conformational switches; nevertheless, no existent tool can detect general (i.e., not family specific) entire riboswitches (both aptamer and expression platform) with accuracy. Thus, the development of additional algorithms to detect conformational switches seems important, especially since the difference in free energy between the two metastable secondary structures may be as large as 1520 kcal/mol. It has recently emerged that RNA secondary structure can be more accurately predicted by computing the maximum expected accuracy (MEA) structure, rather than the minimum free energy (MFE) structure.
Results
Given an arbitrary RNA secondary structure S_{0} for an RNA nucleotide sequence a = a_{1},..., a_{ n }, we say that another secondary structure S of a is a kneighbor of S_{0}, if the base pair distance between S_{0} and S is k. In this paper, we prove that the Boltzmann probability of all kneighbors of the minimum free energy structure S_{0} can be approximated with accuracy ε and confidence 1  p, simultaneously for all 0 ≤ k < K, by a relative frequency count over N sampled structures, provided that $N>N\left(\epsilon ,p,K\right)=\frac{{\Phi}^{1}{\left(\frac{p}{2K}\right)}^{2}}{4{\epsilon}^{2}}$, where Φ(z) is the cumulative distribution function (CDF) for the standard normal distribution. We go on to describe the algorithm RNAborMEA, which for an arbitrary initial structure S_{0} and for all values 0 ≤ k < K, computes the secondary structure MEA(k), having maximum expected accuracy over all kneighbors of S_{0}. Computation time is O(n^{3} · K^{2}), and memory requirements are O(n^{2} · K). We analyze a sample TPP riboswitch, and apply our algorithm to the class of purine riboswitches.
Conclusions
The approximation of RNAbor by sampling, with rigorous bound on accuracy, together with the computation of maximum expected accuracy kneighbors by RNAborMEA, provide additional tools toward conformational switch detection. Results from RNAborMEA are quite distinct from other tools, such as RNAbor, RNAshapes and paRNAss, hence may provide orthogonal information when looking for suboptimal structures or conformational switches. Source code for RNAborMEA can be downloaded from http://sourceforge.net/projects/rnabormea/ or http://bioinformatics.bc.edu/clotelab/RNAborMEA/.
Background
RNA secondary structure conformational switches play an essential role in a number of biological processes, such as regulation of viral replication [1] and of viroid replication [2], regulation of R1 plasmid copy number in E. coli by hok/sok system [3], transcriptional and translational gene regulation in prokaryotes by riboswitches [4], regulation of alternative splicing in eukaryotes [5], and stressresponsive gene regulation in humans [6], etc. Due to the biological importance of conformational switches, several groups have developed algorithms that attempt to recognize switches  in particular, thermodynamicsbased methods such as paRNAss[7], RNAshapes[8], RNAbor[9], as well as an approach using the second eigenvalue of the Laplacian matrix [10].
Riboswitches are portions of the 5' untranslated region (UTR) of messenger RNAs, experimentally known to regulate genes in bacteria by allostery [4], and to regulate alternative splicing of the gene NMT1 in the eukaryote Neurospora crassa [5]. Riboswitches are composed of a 5' aptamer and a 3' expression platform. Since the aptamer binds to a specific ligand with high affinity (K_{ D } ≈ 5 nM), thus triggering the conformational change of the expression platform upon ligand binding [11], its sequence and secondary structure tend to be highly conserved. In contrast, there is lower sequence for the expression platform, which forms a bistable switch, effecting gene regulation by premature abortion of transcription (as in guanine riboswitches [12]), or by sequestering the ShineDalgarno sequence (as in thiamine pyrophosphate riboswitches [13]). Due to the conserved sequence and secondary structure within the aptamer, all existent algorithms (to the best of our knowledge), such as [14–16], attempt only to detect riboswitch aptamers, without the expression platform. In addition to these specific algorithmic approaches, more general computational tools that rely on stochastic context free grammars, such as Infernal[17] and CMFinder[18], have been trained to recognize riboswitch aptamers; in particular, Infernal was used to create the Rfam database [19], which includes 14 families of riboswitch aptamers.
Since current riboswitch detection algorithms do not attempt to predict the location of the expression platform, we have developed tools, RNAborSample and RNAborMEA, described in this paper, which yield information concerning alternative, or suboptimal, structures of a given RNA sequence. These tools can suggest the presence of a conformational switch; however, much more work must be done to actually produce a riboswitch gene finder, part of the difficulty due to the fact that riboswitch aptamers contain pseudoknots that cannot be captured by secondary structure.
The sum in the numerator is taken over all secondary structures of the given RNA sequence, that contain base pair (i, j), while the sum in the denominator is taken over all secondary structures of the given RNA sequence. Thus p(i, j) is the sum of the Boltzmann factors of all secondary structures that contain the fixed base pair (i, j), divided by the partition function, which latter is the sum of Boltzmann factors of all secondary structures. In fact, Kiryu et al. [26] describe an algorithm to compute the MEA structure common to all RNAs in a given alignment. Later, Lu et al. [28] rediscovered Kiryu's method; in addition, Lu et al. computed suboptimal MEA structures by implementing an analogue of Zuker's method [29].
Our motivation in developing both RNAborSample and RNAborMEA, was to simplify and improve our previous software, RNAbor, in detecting conformational switches. Since RNAbor computes the minimum free energy structure, MFE(k), over all structures having base pair distance k from an initially given structure S_{0}, a complex portion of the code in RNAbor concerns the retrieval of free energy parameters from the Turner model [22, 23]. The idea of RNAborMEA was to compute the base pairing probabilities p(i, j)  see equation (1)  by McCaskill's algorithm using RNAfold, then to compute the maximum expected accuracy structure, MEA(k), which needs no retrieval of energy parameters, and which we hoped would be very similar to the MFE(k) structure, in light of previously mentioned results [26, 28]. Surprisingly, it turns out that MEA(k) structures are quite different from MFE(k) structures, as shown later in one of the figures.
In this paper, we begin by showing rigorously how to approximate the output of RNAbor by frequency counts from sampling, using Sfold[30]. We then extend the MEA technique to compute the maximum expected accuracy kneighbor of a given RNA secondary structure S_{0}; i.e., that secondary structure which has maximum expected accuracy over all structures that differ from S_{0} by exactly k base pairs. By analyzing the family of purine riboswitches, obtained by retrieving full riboswitch sequences (aptamer and expression platform) from corresponding EMBL genomic data, by extending the aptamers from the seed alignment of Rfam family RF00167 [31], we show that our software RNAborMEA produces strikingly different results from other software that produce suboptimal structures (RNAbor, RNAborSample, RNAlocopt, RNAshapes, UNAFold).
Since the detection of computational switches remains an open problem, despite the success of some tools such as RNAshapes and RNAbor, we feel the addition of the tool RNAborMEA could prove useful, since it appears to be orthogonal to all other methods of generating suboptimal secondary structures.
Results and discussion
 1.
We describe a Python script RNAborSample that approximates the output ${p}_{k}=\frac{{Z}_{k}}{Z}$ of RNAbor by frequency counts ${\widehat{p}}_{k}$ from sampled structures, for all 0 ≤ k ≤ 2n, using Sfold[30], or RNAsubopt p[32].
 2.We prove that for any desired accuracy 0 < ε and probability 0 < α < 1, if at least$N\left(\epsilon ,p,K\right)=\frac{{\Phi}^{1}{\left(\frac{p}{2K}\right)}^{2}}{4{\epsilon}^{2}}$(2)
 3.We develop an algorithm, RNAborMEA, running in time O(n^{3} · K^{2}) and space O(n^{2} · K), which computes simultaneously for all 0 ≤ k ≤ K, the maximum expected accuracy kneighbors of a given RNA secondary structure S_{0}; i.e., for each 0 ≤ k ≤ K, RNAborMEA computes that structure S_{ k } which has maximum expected accuracy over all structures that differ from S_{0} by exactly k base pairs. The algorithm RNAborMEA additionally computes, for each 0 ≤ k ≤ K, the pseudo partition function$\stackrel{\u0303}{{Z}_{k}}=\sum _{\left\{S:{d}_{\text{BP}}\left(S,{S}_{0}\right)=k\right\}}\text{exp}\left(MEA\left(S\right)/RT\right).$
Moreover, RNAborMEA allows the user to stipulate (partial) hard constraints, that stipulate whether particular nucleotides are unpaired, or basepair with certain other nucleotides. The implementation of hard constraints follows ideas from Mathews [33], albeit suitably modified to simultaneously consider all kneighbors, for 0 ≤ k ≤ K.
We now describe the 13 figures and 4 tables, corresponding to computational experiments performed with RNAborSample and RNAborMEA. These tables and figures are only briefly described, and we refer the reader to the captions of the figures and tables, which explain the results in greater detail.
Number of samples needed for highconfidence approximation of Boltzmann probabilities
P  K  ε  z  N 

0.05  1  0.01  1.45  9506 
0.05  100  0.01  3.48  30276 
0.05  1000000  0.01  5.45  74256 
0.001  100  0.01  3.89  37830 
0.000001  100  0.01  5.73  82082 
0.05  1  0.001  1.45  950600 
0.05  100  0.001  3.48  3027600 
Comparison of NestedAlign similarity scores for the GENE ON structure of the XPT guanine riboswitch
index  EMBL  RNAbor  RNAborMEA  RNAborSample  RNAlocopt  RNAshapes  UNAFold 

0  AL591981/205922205823  9.0  5.0  9.0  8.5  9.0  9.0 
1  CP000764/271074271175  43.5  5.0  37.5  44.5  23.0  53.0 
2  CP000764/308099308200  27.0  18.0  24.5  31.5  25.5  22.0 
3  BA000028/760473760574  25.5  0.5  36.0  38.5  24.5  31.0 
4  CP000557/252200252301  9.5  8.5  9.5  8.5  10.0  12.0 
5  X83878/168267  60.0  87.5  57.0  66.0  64.0  59.0 
6  BA000004/15930741592973  35.0  16.5  13.5  21.5  19.0  13.5 
7  AAOX01000023/1944619345  15.0  2.0  13.0  18.5  13.5  15.5 
8  CP000416/17980401798138  5.5  1.5  1.5  12.0  4.5  4.5 
9  CP000721/398929399026  26.0  24.5  16.5  20.0  21.5  32.0 
10  BA000028/11039431104044  1.0  1.5  2.0  0.5  0.5  0.5 
11  ABDQ01000002/251055251152  16.0  2.5  16.5  21.5  17.5  22.5 
12  AAXV01000026/3133431233  11.5  6.0  1.5  8.5  22.0  3.0 
13  AE016877/298774298875  18.5  14.0  17.5  34.0  12.0  26.5 
14  BA000004/676475676576  28.5  31.0  28.0  69.0  21.0  29.5 
15  AE017333/692981693082  1.5  2.5  11.5  9.5  5.5  53.0 
16  AM180355/256217256318  17.0  45.0  45.5  49.0  48.0  49.0 
17  AM406671/13210621320965  25.5  15.0  22.0  28.5  23.5  23.5 
18  CP000612/25981112598012  42.0  39.5  42.0  47.5  39.0  38.5 
19  CP000002/697032697134  8.0  11.0  10.5  10.0  4.5  7.5 
20  CP000002/22959362295837  23.5  47.0  31.5  21.0  30.0  22.5 
21  AL596170/223345223246  0.5  7.0  0.5  8.5  10.0  10.0 
22  ABDQ01000005/131908131807  33.0  15.5  31.5  31.5  19.0  50.0 
23  AAOX01000052/90698968  13.5  1.5  14.0  21.0  15.5  14.5 
24  AE017333/40243244024425  29.5  26.5  33.5  24.0  23.5  36.0 
25  AP006627/15547171554818  31.5  1.5  37.0  44.5  28.5  43.5 
26  CP000024/11829481183043  0.5  18.5  9.0  4.0  2.0  19.0 
27  BA000028/786767786867  18.0  41.5  48.0  46.5  49.0  44.5 
28  ABDP01000002/2968829587  34.5  42.5  34.5  37.0  35.0  50.0 
29  BA000043/272473272574  9.5  4.0  9.5  10.0  3.0  12.5 
30  CP000724/944285944386  30.5  21.5  30.5  28.5  26.5  31.5 
31  CP000764/14097251409826  14.0  3.0  18.0  24.0  11.5  20.0 
32  AAEK01000017/8643786538  44.5  44.0  41.5  52.0  35.0  49.0 
33  CP000764/357645357544  11.0  13.5  33.0  26.0  18.5  36.0 
Comparison of NestedAlign similarity scores for the GENE OFF structure of the XPT guanine riboswitch
Index  EMBL  RNAbor  RNAborMEA  RNAborSample  RNAlocopt  RNAshapes  UNAFold 

0  AL591981/205922205823  27.5  28.5  28.5  25.5  25.5  25.5 
1  CP000764/271074271175  13.0  12.5  11.0  6.5  12.0  5.5 
2  CP000764/308099308200  24.0  26.0  26.5  23.0  24.5  26.5 
3  BA000028/760473760574  18.5  22.0  13.0  20.5  23.5  23.0 
4  CP000557/252200252301  7.0  8.0  7.0  10.0  6.5  4.5 
5  X83878/168267  143.0  143.5  143.0  141.0  143.0  141.0 
6  BA000004/15930741592973  41.0  39.0  41.0  36.0  38.0  41.0 
7  AAOX01000023/1944619345  47.5  45.5  46.0  42.5  34.0  43.5 
8  CP000416/17980401798138  17.5  12.5  12.5  13.0  11.5  12.5 
9  CP000721/398929399026  36.5  20.5  23.0  38.5  34.5  52.5 
10  BA000028/11039431104044  32.0  29.5  32.0  27.5  30.5  30.0 
11  ABDQ01000002/251055251152  27.0  26.0  26.5  24.0  25.5  7.5 
12  AAXV01000026/3133431233  37.5  38.5  38.0  32.5  35.0  36.0 
13  AE016877/298774298875  24.0  25.5  23.0  19.0  23.0  22.5 
14  BA000004/676475676576  9.0  4.5  6.5  35.5  5.0  9.0 
15  AE017333/692981693082  30.0  9.5  23.5  25.5  17.0  70.5 
16  AM180355/256217256318  23.5  24.0  25.0  27.0  23.5  27.0 
17  AM406671/13210621320965  0.5  3.5  1.0  10.0  1.0  0.5 
18  CP000612/25981112598012  12.0  9.0  8.0  8.5  9.5  9.0 
19  CP000002/697032697134  16.5  7.0  12.0  14.0  16.5  7.5 
20  CP000002/22959362295837  75.0  73.0  75.5  71.0  72.0  69.5 
21  AL596170/223345223246  30.5  31.5  30.5  28.5  29.5  29.5 
22  ABDQ01000005/131908131807  12.5  3.0  13.0  10.5  13.5  4.5 
23  AAOX01000052/90698968  12.5  14.5  13.5  11.0  12.0  12.0 
24  AE017333/40243244024425  3.5  2.5  3.5  6.0  2.5  1.5 
25  AP006627/15547171554818  22.5  18.0  22.5  14.5  25.5  12.5 
26  CP000024/11829481183043  6.0  7.0  6.5  6.0  5.0  6.0 
27  BA000028/786767786867  23.5  19.5  23.0  24.5  21.0  24.0 
28  ABDP01000002/2968829587  3.0  1.0  2.5  1.0  4.5  0.5 
29  BA000043/272473272574  17.5  12.5  12.5  13.5  12.5  11.5 
30  CP000724/944285944386  10.0  11.0  10.5  7.0  12.0  9.5 
31  CP000764/14097251409826  32.5  36.0  32.0  26.5  35.0  30.5 
32  AAEK01000017/8643786538  11.5  11.5  13.0  8.0  13.0  11.0 
33  CP000764/357645357544  23.5  22.0  24.5  24.0  22.0  22.5 
Number of times that the most similar structure was produced
Method  greatest similarity to gene on  greatest similarity to gene off 

RNAborMEA  18  11 
RNAbor  7  11 
RNAborSample  2  8 
RNAlocopt  3  2 
RNAshapes  5  8 
UNAFold  1  3 
The figures and tables show, in summary, that RNAborMEA provides useful suboptimal structures, which may be closer to metastable structures of a conformational switch than more traditional methods, which rely on searching for low energy structures.
Conclusions
where first sum is taken over all base pairs (i, j) belonging to S, and the second sum is taken over all unpaired positions in S, and where p_{i,j}[resp. q_{ i }] is the probability that i, j are paired [resp. i is unpaired] in the ensemble of low energy structures, as computed by McCaskill's algorithm [27]. Finally, RNAborMEA allows the user to sample structures from the maximum expected accuracy ensemble, in a fashion analogous to DingLawrence sampling from the low energy Boltzmann ensemble, as in Sfold[30].
Our preliminary investigations have not indicated a clear application of the partition function analogue, though it may be construed to provide a representation of the temperaturedependent mixing of various structures having large score σ. On the other hand, in computational experiments reported in the Results Section, it appears that RNAborMEA produces nearoptimal structures that are closer to the biologically functional structures, in the case of conformational switches that are difficult to predict by any method.
Indeed, in 18 [resp. 11] out of 34 instances, RNAborMEA produced the secondary structure most structurally similar to the experimentally determined XPT GENE ON [resp. GENE OFF] structure, as measured by NestedAlign[36]. See Table 4. Since there appears to be little to no correlation between the structures MFE(k) output by RNAbor[20] and the structures MEA(k) output by our current program RNAborMEA, it appears that RNAborMEA yields a signal that is orthogonal and complementary to that provided by stateoftheart thermodynamics software, such as UNAFold, RNAfold, RNAstructure, Sfold, RNAshapes, RNAbor, etc. For these reasons, we feel that RNAborMEA has a certain value, along with the programs UNAFold, RNAfold, RNAstructure, Sfold, RNAshapes, RNAbor, etc. when producing suboptimal structures. RNAborMEA is written in C and available at http://sourceforge.net/projects/rnabormea/ and http://bioinformatics.bc.edu/clotelab/RNAborMEA/.
Methods
Preliminaries
Recall the definition of RNA secondary structure.
Definition 1 A secondary structure S on RNA sequence a_{1}, .. ., a_{ n } is defined to be a set of ordered pairs (i, j), such that 1 ≤ i < j ≤ n and the following are satisfied.
1. WatsonCrick or GU wobble pairs: If (i, j) belongs to S, then pair (a_{ i }, a_{ j }) must be one of the following canonical base pairs: (A, U), (U, A), (G, C), (C, G), (G, U), (U, G).
2. Threshold requirement: If (i, j) belongs to S, then j  i > θ, where θ, generally taken to be equal to 3, is the minimum number of unpaired bases in a hairpin loop; i.e., there must be at least θ unpaired bases in a hairpin loop.
3. Nonexistence of pseudoknots: If (i, j) and (k, ℓ) belong to S, then it is not the case that i < k < j < ℓ.
4. No base triples: If (i, j) and (i, k) belong to S, then j = k; if (i, j) and (k, j) belong to S, then i = k.
The preceding definition provides for an inductive construction of the set of all secondary structures for a given RNA sequence a_{1}, .. ., a_{ n }. For all values of d = 0, .. ., n and all values of i = 1, .. ., n  d, the collection ${S}_{i,i+d}$of all secondary structures for a_{ i }, .. ., a_{i+d}is defined as follows. If 0 ≤ d ≤ θ, then ${S}_{i,i+d}=\left\{\varnothing \right\}$; i.e., the only secondary structure for a_{ i }, .. ., a_{i+d}is the empty structure containing no base pairs (due to the requirement that all hairpins contain at least θ unpaired bases). If d > θ and ${S}_{i,j}$ has been defined by recursion for all i ≤ j < i + d, then

Any secondary structure of a_{ i }, .. ., a_{i+d1}is a secondary structure for a_{ i }, .. ., a_{i+d}, in which a_{i+d}is unpaired.

If a_{ i }, a_{ j }can form a WatsonCrick or wobble base pair, then for any secondary structure S for a_{i+1}, .. ., a_{i+d1}, the structure S ∪ {(i, j)} is a secondary structure for a_{ i }, ..., a_{i+d}.

For any intermediate value i + 1 ≤ r ≤ j  θ  1, if a_{ r }, a_{ j }can form a WatsonCrick or wobble base pair, then for any secondary structure S for a_{ i }, .. .,a_{r1}and any secondary structure T for a_{r+1}, ..., a_{j1}, the structure S ∪ T ∪ {(r, j)} is a secondary structure for a_{ i }, .. ., a_{i+d}.
Given two secondary structures S, T, we define the base pair distance between S, T, denoted by d_{ BP } (S, T), to be the cardinality of the symmetric difference of S, T; i.e., d_{ BP } (S, T) = (S  T) ∪ (T  S).
RNAborSample
In this section, we describe how sampling from the Boltzmann ensemble, using Sfold[30], leads to a simple and fast approximation of RNAbor computations, provided that the input consists of an RNA sequence and the minimum free energy (MFE) secondary structure for that sequence. The work of this section is inspired by sampling approximations of the number of structural motifs, such as hairpins, multiloops, etc. of Ding and Lawrence [30], as well as the sampling approximation used in RNAshapes[8] for large sequences. A novelty of our work is that we provide a rigorous justification for the accuracy of the approximation, depending on sample size.
Let RNAborSample denote the protocol, where we apply Sfold[30] to sample N secondary structures S of an input RNA sequence a_{1}, .. .,a_{ n }, then subsequently compute the relative frequencies f_{ k } for 0 ≤ k < K, where ${f}_{k}=\frac{{N}_{k}}{N}$ is defined to be the number N_{ k } of sampled structures S, whose base pair distance with S_{0} is k, divided by N. Since Sfold appears to be only available as a web server, where the user can not stipulate the number of sampled structures, we instead use the Vienna RNA Package implementation of Sfold, given in RNAsubopt p[32].
and let ${p}_{k}=\frac{{Z}_{k}}{Z}$ denote the Boltzmann probability of all kneighbors.
Consider the value k as bin number. For a fixed bin k, let us denote the exact value of $\frac{{Z}_{k}}{Z}$ by p_{ k }. If we sample N structures, each falling in the k th bin with probability p_{ k }, then the number of structures in the k th bin is given by the random variable X_{ k } having binomial distribution with mean N · p_{ k } and variance N · p_{ k }(1  p_{ k }). It follows that the proportion $\frac{{X}_{k}}{N}$ of structures in the k th bin has mean µ = p_{ k } and standard deviation $\sigma =\sqrt{\frac{{p}_{k}\left(1{p}_{k}\right)}{N}}<\frac{1}{2\sqrt{N}}$. To determine minimum sample size sufficient to ensure a certain approximation accuracy with certain confidence interval, we adapt a standard argument from statistics [37] (see equation (24.35) on p. 529), by approximating the binomial distribution by the standard normal distribution using Zscores.
Before starting, we mention that it will suffice for our intended application of RNAborSample to have a precise approximation of those p_{ k } which exceed some modest lower bound, such as δ = 0.01 or δ = 0.0001. Thus we intend to prove that for all 0 ≤ k < K, if p_{ k } ≥ δ, then Equation (4) holds.
We have completed a more rigorous argument using Chernoff bounds, but prefer the exposition given here for simplicity. Note that the same argument, given verbatim, can be used for any binning procedure. In particular, this approach provides information on sufficient number of samples in order to approximate the result of RNAshapes [8, 38, 39] by means of sampling.
We can make some basic conclusions from the above analysis. The number of samples sufficient to ensure that f_{ k }  p_{ k } < ε for 0 ≤ k < K with confidence 1  p is reasonable, and only slightly increases for a higher number K of bins or to ensure greater confidence 1  p. However, the number of samples increases greatly when higher precision estimates (smaller ε values) are needed, even for one bin.
In the case of one bin, it is important to remember that the value N is a conservative estimate, sufficient to ensure our conclusion. This estimate uses the worst case of 50% probability of being in a bin, which maximizes the standard deviation. For bins with small probability, one can reestimate what is needed. However, for bins with smaller probability, higher precision is needed as well, unless all that is required is to verify that the bin has small probability. Also, clearly if a bin has probability of 10^{6}, then at least on the order of one million samples are needed, just for a reasonable probability of sampling the bin once.
Algorithm description
where the first sum is taken over all base pairs (i, j) belonging to S, the second sum is taken over all unpaired positions in S, and where p_{ i,j } [resp. q_{ i }] is the probability that i, j are paired [resp. i is unpaired] in the ensemble of low energy structures, and α, β > 0 are weights. Our computational experiments, as in Figure 9, were carried out with default values of 1 for α, β. (See Equation 1 for the formal definition of Boltzmann base pairing probability p_{ i,j }.)
The dynamic programming computation of M(i, j, k) is performed by recursion on increasing values of j  i for all values 1 ≤ i ≤ j ≤ n and 0 ≤ k ≤ Kmax. The value of M(i, j, k), stored in the upper triangular portion of matrix M, will involve taking the maximum over three cases, which correspond to the inductive construction of all secondary structures on a_{ i }, .. ., a_{ j }, as described in the previous section. At the same time, the value M(j, i, k), stored in the lower triangular portion of matrix M, will consist of a triple r, k_{0}, k_{1} of numbers, such that the following approximately holds (in this section, we provide the motivating idea; the actual algorithm description, which deviates slightly from the description here, is given in the next section and in Figures 10 and 11). (i) If r = 0 then M(i, j, k) is maximized by a kneighbor S of S_{0}[i, j] for the subsequence a_{ i }, .. ., a_{ j } in which a_{ j } is unpaired. In this case, k_{0} = k and k_{1} = 0. (ii) If r = i, then M(i, j, k) is maximized by a kneighbor S of S_{0}[i, j] for the subsequence a_{ i }, ...,a_{ j } in which base pair (i, j) ∈ S. In this case, k_{0} = 0 and k_{1} = k  1. (i) If i < r ≤ j  θ  1 then M(i, j, k) is maximized by a kneighbor S of S_{0}[i, j] for the subsequence a_{ i }, .. .,a_{ j } in which base pair (r, j) ∈ S. The left portion of S, which is S[i, r  1] will be a k_{0} neighbor of S[i, r  1], while the right portion of S, which is S[r, j] must contain the base pair (r, j) and itself be a k_{1} neighbor of S[r, j]. In summary, the values r, k_{0}, k_{1} will be used in computing the traceback, where the maximum expected accurate structure that is a kneighbor of S[i, j] will be constructed by one of the following: (i) MEA kneighbor of S[i, j  1], in the event that a_{ j } is unpaired in [i, j]; (ii) MEA k  1neighbor of S[i + 1, j  1], in the event that a_{ i }, a_{ j } form a base pair; (iii) MEA k_{0}neighbor of S[i, r  1] and the MEA k_{1}neighbor of S[r, j], where k_{0} + k_{1} = k, in the event that a_{ r }, a_{ j } form a base pair.
Pseudocode for the algorithm RNAborMEA is given in Figures 10 and 11. An array M of size n × n × Kmax is required to store the MEA scores in M(i, j, k) for all subsequences [i, j] and all base pair distances 0 ≤ k ≤ Kmax between structures S[i, j] and initially given structure S_{0}[i, j]. For 1 ≤ i ≤ j ≤ n and all 0 ≤ k ≤ Kmax, the pseudocode in Figure 11 stores a value of the form (x, y, z) in the lower triangular portion, M(j, i, k), of the array. Here, x = 0 indicates that the optimal structure on [i, j], i.e., having maximum MEA score over all kneighbors of S_{0}[i, j], is obtained by not pairing j with any nucleotide in [i, j]; for values x > 0, hence x ∈ [i, j  θ  1], the optimal kneighbor of S_{0}[i, j] is obtained by pairing x with j. The values y, z correspond to the values k_{0}, k_{1}, such that: (i) if x = 0, then the optimal kneighbor of S_{0}[i, j] is obtained by first computing the optimal k_{0}neighbor of S_{0}[i, j  1], where k_{0} = k  b_{0}, then leaving j unpaired; (ii) if x = i, then the optimal kneighbor of S_{0}[i, j] is obtained by first computing the optimal k_{1}neighbor of S_{0}[i + 1, j  1], where k_{1} = k  b_{1}, then adding the enclosing base pair (i, j); (iii) if x = r ∈ [i + 1, j  θ  1], then the optimal kneighbor of S_{0}[i, j] is obtained by first computing the optimal k_{0}neighbor of S_{0}[i, r  1] as well as the optimal k_{1}neighbor of S_{0}[r + 1, j  1], then adding the base pair (r, j). This last calculation must be done over all values k_{0}, k_{1} such that k_{0} + k_{1} = k. Using the values M(j, i, k) = (x, y, z), the traceback can be easily computed by recursion; see Figure 12 for pseudocode of traceback.
We have graphed the Boltzmann probabilities $\frac{{Z}_{1,n}^{\left(k\right)}}{{Z}_{1,n}}$ as well as the uniform probabilities $\frac{{N}_{1,n}^{\left(k\right)}}{{N}_{1,n}}$, where ${N}_{1,n}^{\left(k\right)}$ is the number of kneighbors, and N_{1,n}is the total number of secondary structures. When RT = n, which normalizes the MEA score to a maximum of 1, it appears that the Boltzmann distribution is the same as the uniform distribution, as shown in Figure 13.
Declarations
Acknowledgements
Funding for the research of P. Clote and F. Lou was provided by the Digiteo Foundation for the project RNAomics. Additional funding was provided to P. Clote by National Science Foundation grants DMS1016618 and DMS0817971. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
This article has been published as part of BMC Bioinformatics Volume 13 Supplement 5, 2012: Selected articles from the First IEEE International Conference on Computational Advances in Bio and medical Sciences (ICCABS 2011): Bioinformatics. The full contents of the supplement are available online at http://www.biomedcentral.com/bmcbioinformatics/supplements/13/S5.
Authors’ Affiliations
References
 Olsthoorn R, Mertens S, Brederode F, Bol J: A conformational switch at the 3' end of a plant virus RNA regulates viral replication. EMBO J 1999, 18: 4856–4864. 10.1093/emboj/18.17.4856PubMed CentralView ArticlePubMedGoogle Scholar
 Repsilber D, Wiese S, Rachen M, Schroder A, Riesner D, Steger G: Formation of metastable RNA structures by sequential folding during transcription: timeresolved structural analysis of potato spindle tuber viroid ()stranded RNA by temperaturegradient gel. RNA 1999, 5: 574–584. 10.1017/S1355838299982018PubMed CentralView ArticlePubMedGoogle Scholar
 Franch T, Gultyaev AP, Gerdes K: Programmed Cell Death by hok/sok of Plasmid R1: Processing at the hok mRNA 3Hend Triggers Structural Rearrangements that Allow Translation and Antisense RNA Binding. J Mol Biol 1997, 273: 38–51. 10.1006/jmbi.1997.1294View ArticlePubMedGoogle Scholar
 Mandal M, Boese B, Barrick J, Winkler W, Breaker R: Riboswitches control fundamental biochemical pathways in Bacillus subtilis and other bacteria. Cell 2003, 113(5):577–586. 10.1016/S00928674(03)00391XView ArticlePubMedGoogle Scholar
 Cheah MT, Wachter A, Sudarsan N, Breaker RR: Control of alternative RNA splicing and gene expression by eukaryotic riboswitches. Nature 2007, 447(7143):497–500. 10.1038/nature05769View ArticlePubMedGoogle Scholar
 Ray PS, Jia J, Yao P, Majumder M, Hatzoglou M, Fox PL: A stressresponsive RNA switch regulates VEGFA expression. Nature 2009, 457(7231):915–919. 10.1038/nature07598PubMed CentralView ArticlePubMedGoogle Scholar
 Voss B, Meyer C, Giegerich R: Evaluating the predictability of conformational switching in RNA. Bioinformatics 2004, 20(10):1573–1582. 10.1093/bioinformatics/bth129View ArticlePubMedGoogle Scholar
 Voss B, Giegerich R, Rehmsmeier M: Complete probabilistic analysis of RNA shapes. BMC Biol 2006., 4(5):Google Scholar
 Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural neighbors and riboswitch detection. Bioinformatics 2007, 23(16):2054–2062. Doi: 10.1093/bioinformatics/btm314 Doi: 10.1093/bioinformatics/btm314 10.1093/bioinformatics/btm314View ArticlePubMedGoogle Scholar
 Barash D: Second eigenvalue of the Laplacian matrix for predicting RNA conformational switch by mutation. Bioinformatics 2004, 20(12):1861–1869. 10.1093/bioinformatics/bth157View ArticlePubMedGoogle Scholar
 Mandal M, Breaker RR: Adenine riboswitches and gene activation by disruption of a transcription terminator. Nat Struct Mol Biol 2004, 11: 29–35. 10.1038/nsmb710View ArticlePubMedGoogle Scholar
 Serganov A, Yuan Y, Pikovskaya O, Polonskaia A, Malinina L, Phan A, Hobartner C, Micura R, Breaker R, Patel D: Structural Basis for Discriminative Regulation of Gene Expression by Adenine and GuanineSensing mRNAs. Chem Biol 2004, 11(12):1729–1741. 10.1016/j.chembiol.2004.11.018View ArticlePubMedGoogle Scholar
 Serganov A, Polonskaia A, Phan AT, Breaker RR, Patel DJ: Structural basis for gene regulation by a thiamine pyrophosphatesensing riboswitch. Nature 2006, 441(7097):1167–1171. 10.1038/nature04740View ArticlePubMedGoogle Scholar
 AbreuGoodger C, Merino E: RibEx: A web server for locating riboswitches and other conserved bacterial regulatory elements. Nucleic Acids Res 2005, 33: W690W692. 10.1093/nar/gki445PubMed CentralView ArticlePubMedGoogle Scholar
 Bengert P, Dandekar T: Riboswitch finder  A tool for identification of riboswitch RNAs. Nucleic Acids Res 2004, 32: W154W159. 10.1093/nar/gkh352PubMed CentralView ArticlePubMedGoogle Scholar
 Chang TH, Huang HD, Wu LC, Yeh CT, Liu BJ, Horng JT: Computational identification of riboswitches based on RNA conserved functional sequences and conformations. RNA 2009., 15(7):Google Scholar
 Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics 2009, 25(10):1335–1337. 10.1093/bioinformatics/btp157PubMed CentralView ArticlePubMedGoogle Scholar
 Weinberg Z, Barrick JE, Yao Z, Roth A, Kim JN, Gore J, Wang JX, Lee ER, Block KF, Sudarsan N, Neph S, Tompa M, Ruzzo WL, Breaker RR: Identification of 22 candidate structured RNAs in bacteria using the CMfinder comparative genomics pipeline. Nucleic Acids Res 2007, 35(14):4809–4819. 10.1093/nar/gkm487PubMed CentralView ArticlePubMedGoogle Scholar
 Gardner PP, Daub J, Tate JG, Nawrocki EP, Kolbe DL, Lindgreen S, Wilkinson AC, Finn RD, GriffithsJones S, Eddy SR, Bateman A: Rfam: updates to the RNA families database. Nucleic Acids Res 2009, 37(Database):D136D140. 10.1093/nar/gkn766PubMed CentralView ArticlePubMedGoogle Scholar
 Freyhult E, Moulton V, Clote P: Boltzmann probability of RNA structural neighbors and riboswitch detection. Bioinformatics 2007, 23(16):2054–2062. 10.1093/bioinformatics/btm314View ArticlePubMedGoogle Scholar
 Freyhult E, Moulton V, Clote P: RNAbor: a web server for RNA structural neighbors. Nucleic Acids Res 2007, 35(Web):W305W309. 10.1093/nar/gkm255PubMed CentralView ArticlePubMedGoogle Scholar
 Matthews D, Sabina J, Zuker M, Turner D: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA secondary structure. J Mol Biol 1999, 288: 911–940. 10.1006/jmbi.1999.2700View ArticleGoogle Scholar
 Xia T, SantaLucia JJ, Burkard M, Kierzek R, Schroeder S, Jiao X, Cox C, Turner D: Thermodynamic parameters for an expanded nearestneighbor model for formation of RNA duplexes with WatsonCrick base pairs. Biochemistry 1999, 37: 14719–35.View ArticleGoogle Scholar
 Do CB, Mahabhashyam MS, Brudno M, Batzoglou S: ProbCons: Probabilistic consistencybased multiple sequence alignment. Genome Res 2005, 15(2):330–340. 10.1101/gr.2821705PubMed CentralView ArticlePubMedGoogle Scholar
 Nussinov R, Jacobson AB: Fast Algorithm for Predicting the Secondary Structure of Single Stranded RNA. Proceedings of the National Academy of Sciences, USA 1980, 77(11):6309–6313. 10.1073/pnas.77.11.6309View ArticleGoogle Scholar
 Kiryu H, Kin T, Asai K: Robust prediction of consensus secondary structures using averaged base pairing probability matrices. Bioinformatics 2007, 23(4):434–441. 10.1093/bioinformatics/btl636View ArticlePubMedGoogle Scholar
 McCaskill J: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers 1990, 29: 1105–1119. 10.1002/bip.360290621View ArticlePubMedGoogle Scholar
 Lu ZJ, Gloor JW, Mathews DH: Improved RNA secondary structure prediction by maximizing expected pair accuracy. RNA 2009, 15(10):1805–1813. 10.1261/rna.1643609PubMed CentralView ArticlePubMedGoogle Scholar
 Zuker M: On finding all suboptimal foldings of an RNA molecule. Science 1989, 244(7):48–52.View ArticlePubMedGoogle Scholar
 Ding Y, Lawrence CE: A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res 2003, 31: 7280–7301. 10.1093/nar/gkg938PubMed CentralView ArticlePubMedGoogle Scholar
 Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, GriffithsJones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the "decimal" release. Nucleic Acids Res 2011, 39(Database):D141D145. 10.1093/nar/gkq1129PubMed CentralView ArticlePubMedGoogle Scholar
 Gruber A, Lorenz R, Bernhart S, Neubock R, Hofacker I: The Vienna RNA websuite. Nucleic Acids Research 2008, 36: 70–74.View ArticleGoogle Scholar
 Mathews DH, Disney MD, Childs JL, Schroeder SJ, Zuker M, Turner DH: Incorporating chemical modification constraints into a dynamic programming algorithm for prediction of RNA secondary structure. Proc Natl Acad Sci USA 2004, 101(19):7287–7292. 10.1073/pnas.0401799101PubMed CentralView ArticlePubMedGoogle Scholar
 Lorenz W, Clote P: Computing the partition function for kinetically trapped RNA secondary structures. Public Library of Science One (PLoS ONE) 2011, 6: 316178. Doi:10.1371/journal.pone.0016178 Doi:10.1371/journal.pone.0016178Google Scholar
 Wakeman CA, Winkler WC, Dann C: Structural features of metabolitesensing riboswitches. Trends Biochem Sci 2007, 32(9):415–424. 10.1016/j.tibs.2007.08.005PubMed CentralView ArticlePubMedGoogle Scholar
 Blin G, Denise A, Dulucq S, Herrbach C, Touz H: Alignments of RNA structures. IEEE/ACM Transactions on Computational Biology and Bioinformatics 2010.Google Scholar
 Zar J: Biostatistical Analysis. PrenticeHall, Inc; 1999.Google Scholar
 Giegerich R, Voss B, Rehmsmeier M: Abstract shapes of RNA. Nucleic Acids Res 2004, 32(16):4843–4851. 10.1093/nar/gkh779PubMed CentralView ArticlePubMedGoogle Scholar
 Steffen P, Voss B, Rehmsmeier M, Reeder J, Giegerich R: RNAshapes: an integrated RNA analysis package based on abstract shapes. Bioinformatics 2006, 22(4):500–503. 10.1093/bioinformatics/btk010View ArticlePubMedGoogle Scholar
 Hofacker I: Vienna RNA secondary structure server. Nucleic Acids Res 2003, 31: 3429–3431. 10.1093/nar/gkg599PubMed CentralView ArticlePubMedGoogle Scholar
 Markham NR, Zuker M: UNAFold: software for nucleic acid folding and hybridization. Methods Mol Biol 2008, 453: 3–31. 10.1007/9781603274296_1View ArticlePubMedGoogle Scholar
 Ponty Y: Efficient sampling of RNA secondary structures from the Boltzmann ensemble of lowenergy: The boustrophedon method. J Math Biol 2008, 56(1–2):107–127.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.