- Software
- Open access
- Published:
RNAdualPF: software to compute the dual partition function with sample applications in molecular evolution theory
BMC Bioinformatics volume 17, Article number: 424 (2016)
Abstract
Background
RNA inverse folding is the problem of finding one or more sequences that fold into a user-specified target structure s 0, i.e. whose minimum free energy secondary structure is identical to the target s 0. Here we consider the ensemble of all RNA sequences that have low free energy with respect to a given target s 0.
Results
We introduce the program RNAdualPF, which computes the dual partition function Z ∗, defined as the sum of Boltzmann factors exp(−E(a,s 0)/RT) of all RNA nucleotide sequences a compatible with target structure s 0. Using RNAdualPF, we efficiently sample RNA sequences that approximately fold into s 0, where additionally the user can specify IUPAC sequence constraints at certain positions, and whether to include dangles (energy terms for stacked, single-stranded nucleotides). Moreover, since we also compute the dual partition function Z ∗(k) over all sequences having GC-content k, the user can require that all sampled sequences have a precise, specified GC-content.
Using Z ∗, we compute the dual expected energy 〈E ∗〉, and use it to show that natural RNAs from the Rfam 12.0 database have higher minimum free energy than expected, thus suggesting that functional RNAs are under evolutionary pressure to be only marginally thermodynamically stable.
We show that C. elegans precursor microRNA (pre-miRNA) is significantly non-robust with respect to mutations, by comparing the robustness of each wild type pre-miRNA sequence with 2000 [resp. 500] sequences of the same GC-content generated by RNAdualPF, which approximately [resp. exactly] fold into the wild type target structure. We confirm and strengthen earlier findings that precursor microRNAs and bacterial small noncoding RNAs display plasticity, a measure of structural diversity.
Conclusion
We describe RNAdualPF, which rapidly computes the dual partition function Z ∗ and samples sequences having low energy with respect to a target structure, allowing sequence constraints and specified GC-content. Using different inverse folding software, another group had earlier shown that pre-miRNA is mutationally robust, even controlling for compositional bias. Our opposite conclusion suggests a cautionary note that computationally based insights into molecular evolution may heavily depend on the software used.
C/C++-software for RNAdualPF is available at http://bioinformatics.bc.edu/clotelab/RNAdualPF.
Background
In [1], Borenstein and Ruppin define neutrality of an RNA sequence a=a 1,…,a n by \(\eta (\mathbf {a}) = 1-\frac {\langle d \rangle }{n}\), where in this section 〈d〉 denotes the average, taken over all 3n single-point mutants of a, of the base pair distance d BP between the minimum free energy (MFE) structure s 0 of a and the MFE structures of single-point mutants of a. An RNA sequence a is then defined to be robust if η(a) is greater than the average neutrality of 1000 control sequences generated by the program RNAinverse [2], which fold into the same target structure s 0. The main finding of [1] is that precursor microRNAs (pre-miRNA) exhibit a significantly higher level of mutational robustness than random RNA sequences having the same structure. To control for sequence composition bias in their computational study, the authors selected sequences from the output of RNAinverse, whose dinucleotide composition was similar to that of wild type pre-miRNA (Jensen-Shannon divergence less than 0.01). Since the filtering step required enormous run time and computational resources, the authors restricted their attention to a small set of 211 microRNAs, generating only 100 control sequences per microRNA. Borenstein and Ruppin conclude that robustness of precursor microRNAs is not the byproduct of a base composition bias or of thermodynamic stability.
Subsequently Rodrigo et al. [3] undertook a similar analysis for bacterial small RNAs, also using the program RNAinverse, albeit using somewhat different definitions – precise definitions are given in “Formal definitions of robustness” section. The main finding of [3] was that bacterial sncRNAs are not significantly robust when compared with 1000 sequences having the same structure, as computed by RNAinverse; however, bacterial sncRNAs tend to be significantly plastic, in the sense that the ensemble of low energy structures is structurally diverse. Unlike the case of precursor microRNAs [1], Rodrigo et al. did not control for sequence compositional bias.
This raises the question of whether the control sequences analyzed in [1, 3] are representative or to what extent features shared by sequences output by the program RNAinverse are artifacts of the program used. Indeed, the number of RNA sequences that fold into a given target structure can be astronomically large. Over a few weeks, before we elected to terminate the execution, our state-of-the-art inverse folding software RNAiFold [4] generated 273,926,421 many 52-nt sequences that fold exactly into the MFE secondary structure s 0 of HIV-1 ribosomal frameshift stimulating signal from the Gag-Pol overlap region AF033819.3/1631-1682, and which additionally code 17-mer peptides in the Gag and Pol reading frames having amino acids that appear in Gag/Pol peptides found in the Los Alamos HIV-1 database [5]. The number of 52 nt RNA sequences that fold into target s 0 without additionally imposing the constraint of coding particular peptides in overlapping Gag/Pol reading frames is certain to dwarf the previous number. Moreover, the number of sequences that fold into the MFE structure of an animal precursor microRNA (length 68 to 91 nt [6]) or into the MFE structure of bacterial sncRNA (length 53-436 nt [3]) is certain to be even more daunting.
Different inverse folding algorithms have adopted different strategies to generate sequences that fold into a user-specified target secondary structure s 0. For instance, RNAinverse [2, 7] performs an adaptive walk, in one step of which a nucleotide in the current sequence is mutated and subsequently accepted if the base pair distance between the minimum free energy (MFE) structure of the mutated sequence and the target structure s 0 is reduced. NUPACK Design [8] selects a candidate mutation position with probability proportional to its contribution to the ensemble defect (Boltzmann-weighted Hamming distance to the vector representation of s 0, where s 0[i]=j indicates (i,j)∈s 0 and s 0[i]=i indicates i is unpaired in s 0). RNAiFold CP-design [4, 9] uses constraint programming to systematically explore the search tree of all inverse folding solutions in an order determined by certain heuristics. Accordingly, one cannot claim that the collection of sequences generated by any particular inverse folding algorithm is representative of the astronomically large space of all inverse folding solutions – indeed, each inverse folding algorithm has an inherent but unknown bias.
In this paper, we describe the algorithm RNAdualPF, which generates sequences which have low free energy with respect to a user-specified target structure s 0 – i.e. the inherent bias of RNAdualPF is known, unlike the situation of other inverse folding algorithms. We show that RNAdualPF is extremely fast software for generating sequences that approximately fold into s 0; moreover, in a postprocessing step, one can filter the output of RNAdualPF to select sequences that exactly fold into s 0. RNAdualPF additionally allows the user to specify IUPAC codes to constrain certain nucleotide positions as well as to control the GC-content of all generated sequences. Sampling is performed in a manner distinct but somewhat analogous to that by which Sfold [10] and RNAsubopt -p [2] sample representative secondary structures from the Boltzmann ensemble of all structures of a given sequence. Using RNAdualPF, we perform a pilot study that is similar, though not identical, to that of [1, 3] for two classes of RNA: 250 C. elegans precursor microRNA from miRBase [11] and the bacterial small noncoding RNAs previously analyzed in [3].
Finally, it should be noted that, although RNAdualPF was developed entirely independently of the work of Reinharz et al. [12], one can view our C-program as an extension of Python program IncaRNAtion [12] to the full Turner energy model, where additionally GC-content is rigorously handled. This point will be discussed further in the Conclusion.
Formal definitions of robustness
Let a=a 1,…,a n denote an arbitrary RNA sequence, where \(a_{i} \in \mathcal {N} = \{\text {A,\,U,\,G,\,C}\}\), a secondary structure s of a is a set of base pairs (i,j) satisfying the following conditions: (1) If (i,j)∈s then a i ,a j constitute a Watson-Crick or GU wobble pair, i.e. \(ij \in \mathcal {B}\) which is the set {AU,UA,GC,CG,GU,UG}. (2) If (i,j)∈s then i+θ<j, where θ=3 (a minimum assumed for steric hindrance). (3) If (i,j)∈s and (k,ℓ)∈s, then either i<k<ℓ<j or k<i<j<ℓ or i<j<k<ℓ or k<ℓ<i<j. The collection of all secondary structures of the RNA sequence a is denoted \({\mathbb {S}\mathbb {S}}(\mathbf {a})\), and the free energy [13] of s is denoted by E(a,s), or simply by E(s) provided that the sequence a is clear from context. The Boltzmann probability p(s)=p a (s) for structure s of a is defined by exp(−E(a,s)/RT)/Z, where the partition function \(Z = Z(\mathbf {a}) = \sum _{s \in {\mathbb {S}\mathbb {S}}(\mathbf {a})} \exp (-E(\mathbf {a},s)/RT)\). Given two secondary structures s,t of a, the base pair distance d BP(s,t) between s and t is defined to be the size of the symmetric difference of s,t, i.e. |s−t|+|t−s|.
In [3], Rodrigo et al. define intrinsic distance
i.e. intrinsic distance is another name for ensemble diversity earlier defined in [14], and computed by Vienna RNA Package [2]. Plasticity is defined in [3] to be normalized ensemble diversity; i.e.
obtained by dividing ensemble diversity by (essentially) the maximum possible number n/2 of base pairs in a structure of a. Given two RNA sequences a=a 1,…,a n and b=b 1,…,b n of the same length n, Rodrigo et al. define d 1(a,b) to be the expected base pair distance between structures of a and structures of b minus the ensemble diversity of a, i.e.
Since d 1 is not symmetric, this measure is not a metric. In contrast, ensemble distance as described in [14] is a valid metric, defined by the following:
In [3], Rodrigo et al. define the mutational robustness
where 〈d 1(a,a’)〉 denotes the average value of d 1(a,a’) taken over all single point mutants a’ of a. Since d 1(a,a’) is not a true metric, we replace it by the metric D V(a,b) in our computation of mutational robustness. Clearly both notions are closely related.
Implementation
In [15], McCaskill described a cubic time algorithm to compute the partition function
for an RNA sequence a=a 1,…,a n , where the sum is taken over all secondary structures \(\mathbb {S}\mathbb {S}(\mathbf {a})\) of a, E(a,s) denotes the free energy for the structure s of a with respect to the Turner energy parameters [13], R denotes the universal gas constant and T is absolute temperature. Subsequently Ding and Lawrence [16] described how to use the partition function together with a simple backtracking strategy to sample secondary structures of a from the Boltzmann ensemble of low energy structures.
If s 0 is a given secondary structure of length n, we define the dual partition function
where the sum is taken over all RNA sequences a=a 1,…,a n of length n that are compatible with structure s 0, i.e. a i ,a j constitute a Watson-Crick or wobble pair for each base pair (i,j)∈s 0. The set of all RNA sequences that are compatible with s 0 is denoted by \(\mathbb {AA}(s_{0})\). Note that if a sequence a is not compatible with the target structure s 0, then the energy E(a,s 0) is infinite, so the corresponding Boltzmann factor exp(−E(a,s 0)/RT) is zero and the sum in Eq. (7) could have been written over all sequences of the same length as s 0. Here we describe the efficient software RNAdualPF to compute the dual partition function Z ∗ and to sample from the low energy ensemble of sequences that are compatible with a given secondary structure s 0.
Dual partition function
If s is a secondary structure on sequence a=a 1,…,a n , then the length of s, denoted by ℓ(s), is equal to n, while the size of s, denoted by |s|, is the number of base pairs belonging to s. Similarly, if secondary structure s is restricted to the interval [i,j], where 1≤i≤j≤n, then the length of the restriction of s to [i,j], denoted by ℓ(s[i,j]), is equal to j−i+1, while the size of the restriction of s to [i,j], denoted by |s[i,j]|, is the number of base pairs (X,Y) of s that satisfy i≤x<y≤j.
Given an RNA sequence a=a 1,…,a n , the McCaskill algorithm [15] computes the partition function Z(a) defined in Eq. (6). When a is clear from context, Z(a) is usually denoted by Z.
Given a target secondary structure s 0, we describe below an algorithm to compute the dual partition function Z ∗(s 0), defined as the sum of all Boltzmann factors exp(−E(a,s 0)), where the sum is taken over all RNA sequences \(\mathbf {a} \in \mathbb {A}\mathbb {A}(s_{0})\). Unlike the McCaskill algorithm, which requires time that is cubic in the length of a, the algorithm presented below requires time that is (essentially) linear1 in the length of s 0. Our algorithm is motivated by the initialization step of the algorithm INFO-RNA [17], in which a sequence is determined, for which the free energy with respect to target structure s 0 is a minimum – i.e. INFO-RNA determines argmin a E(a,s 0).
The algorithm specification requires the notation Z ∗(i,j;X,Y), which denotes the sum
of Boltzmann factors for sequences a[i,j]=a i ,…,a j for which a i =x,a j =y, and for the restriction s 0[i,j], defined by
The function Z ∗(i,j;X,Y) is defined for all base pairs (i,j)∈s 0; these values will be stored in an array, whose rows index base pairs of s 0, and whose columns are indexed by the six canonical base pairs GC, CG, AU, UA, GU, UG (see example in Table 1). Once Z ∗(i,j;X,Y) has been computed for all base pairs that are visible, i.e. for which there is no base pair (X,Y) for which x<i<j<y, we can compute the full partition function Z ∗(s 0).
Following [17], we define a total ordering on base pairs (i,j) belonging to the target structure s 0 that satisfy the following precedence rule for any two base pairs (i,j),(X,Y).
From this ordering, we assign a base pair index to each base pair (i,j), which is defined to be the rank of (i,j) in the total ordering.
The following definitions correspond to the Turner nearest neighbor energy model [13], which is an additive loop model where a loop closed by external base pair (i,j) is designated as a k-loop, if the loop contains k base pairs interior to (i,j). Therefore, hairpin loops are 0-loops; base pair stacks, bulge loops and internal loops are 1-loops; and multiloops are k-loops for k≥2 (also called (k+1)-way junctions), where the additional count is due to the outer component adjacent to (i,j) [18].
Since AU-base pairs that close a loop are energetically unfavorable, in the Turner energy model, there is an AU-penalty we now define:
This AU-penalty is applied only if (i,j) is a base pair adjacent to a triloop, a bulge, an internal loop or a multiloop, or if it is the outermost base pair of an external loop in target structure s 0, and (i,j) is instantiated by one of the pairs AU, UA, GU, UG. When base-paired positions i,j are clear from the context, we write e AU (X,Y).
Here, we assume that in parsing the input target structure, a list BPcloseELorML has been created of those base pairs (i,j), which close either an external loop or a multiloop. Let I be the indicator function, it follows that if (i,j) closes an external loop or multiloop, then \(\exp \left (-\frac {I[(i,j) \in BPcloseELorML] \cdot e_{AU}(X,Y)}{RT} \right)\) is the Boltzmann factor for a special AU-penalty, otherwise this factor equals 1. For clarity in the notation, this factor is denoted by \(e^{(-\frac {e^{I}_{AU}(X,Y)}{RT})}\). Note that this term is distinct from the factor \(\exp (-\frac {e_{AU}(X,Y)}{RT})\) applied to base pairs adjacent to a triloop, a bulge or an internal loop, which does not depend on the indicator function.
Hairpins
Let (i,j) close a hairpin in s 0. The hairpin free energy term H(j−i−1), arising solely from entropic considerations, is defined by
where hairpinE(j−i−1) designates the hairpin free energy obtained from table look-up, when j−i−1≤30.
Triloop
Let TriLoop X,Y denote the collection of special triloops, xabcy, having an energy bonus triloopE(xabcy).
Tetraloop
Let TetraLoop X,Y denote the collection of special tetraloops, xabcdy, having an energy bonus tetraloopE(xabcdy). Similarly, given nucleotides \(n_{1},n_{2} \in \mathcal {N}\), TetraLoop X,Y (n 1,n 2) denotes the collection of special tetraloops of the form xn 1 abn 2 y. Define Z ∗(i,j;X,Y) by
Hexaloop Let HexaLoop X,Y denote the collection of special hexaloops, xabcdefy, having an energy bonus hexaloopE(xabcdefy). Similarly, given nucleotides n 1,n 2, HexaLoop X,Y (n 1,n 2) denotes the collection of special hexaloops of the form xn 1 abcdn 2 y. Define Z ∗(i,j;X,Y) by
Hairpin size exceeds four and is different than six
Define Z ∗(i,j;X,Y) by
Stacked base pairs, bulges and internal loops
Here, we consider the case of a 1-loop, which comprises the case of stacked base pairs, bulges and internal loops. The following cases correspond to each possibility.
Stacked base pair
In this case, (i,j) stacks on the base pair (i+1,j−1), and the partition function Z ∗(i+1,j−1;U,V) has been computed. Let stack(X,Y,U,V) denote the free energy of base stack \(\begin {array}{ll} 5'-\text {\texttt {XU}}-3'\\ 3'-\text {\texttt {YV}}-5'\\ \end {array}\) obtained by table look-up.
Bulge loop
In this case, (i,j) closes a bulge in s 0. Since bulge size may exceed the values in table look-up, we define the free energy for a bulge of size r by
If (i,j) closes a left bulge of size r in s 0, then the bulge is closed by base pair (i+r+1,j−1) involving nucleotide pair U,V, and
while if (i,j) closes a right bulge in s 0, then the bulge is closed by base pair (i+1,j−r−1) involving nucleotide pair U,V, and
Internal loop
In this case, (i,j) closes an internal loop in s 0, whose left [resp. right] portion is of size r 1 [resp. r 2]. Since internal loop size r=r 1+r 2 may exceed the values in table look-up, we define the free energy for an internal loop of size r by
The closing base pair (i+r 1+1,j−r 2−1) of the internal loop of size r=r 1+r 2 may involve the nucleotides \(UV \in \mathcal {B}\), while the unpaired (mismatch) nucleotides in positions i+1,j−1,i+r 1,j−r 2 may involve \(A,B,C,D \in \mathcal {N}\). In addition, there is an energy penalty for non symmetric internal loops, min(asym·|r 1−r 2|,maxAsym), where the value of the constants asym and maxAsym are given in the Turner energy model. Thus
External loop
Despite the fact that, by following the total order on base pairs defined in Eq. (10), the dual partition function of multiloops is always computed before the dual partition function of the external loop, the computation of the dual partition function of multiloops will be easier to understand if the dual partition function of the external loop is defined in advance.
In order to improve speed, some implementations of RNA thermodynamics-based algorithms ignore the contribution of dangling positions, which corresponds to Vienna RNA Package -d0 flag. RNAdualPF also includes this option, which dramatically increases the speed of the algorithm. The reason behind this difference of performance is clear from the following definitions.
Suppose that H=[(i 1,j 1),…,(i k ,j k )] constitutes the list of k external base pairs of s 0, where i 1<j 1<i 2<j 2< ⋯ < i k < j k . For each (i r ,j r ), with 1≤r≤k, and for each choice of base pair GC, CG, AU, UA, GU, UG, the value Z ∗(i r ,j r ;X r ,Y r ) has been previously computed and stored by dynamic programming, as well as the sum Z ∗(i r ,j r ). When the contribution of dangles is ignored, the dual partition function of an external loop with ℓ nucleotide positions external to every base pair is defined by
where \(\ell = n-\sum \limits _{r = 1,\ldots,k}{(j_{r}-i_{r}+1)}\) and n is the length of the target structure s 0.
The default treatment of dangles in RNAdualPF described below corresponds to Vienna RNA Package -d2 flag, where both flanking positions of each external base pair contribute to the free energy. Let D=[a 1,b 1,…,a k ,b k ]⊆[i 1−1,j 1+1,⋯,i k ,j k ] be a list of those nucleotide positions that are adjacent to the k external base pairs (i 1,j 1),…,(i k ,j k ). The ordered multiset [a 1,b 1,…,a k ,b k ] can be considered as a collection of constraints, so that (for instance) if a 2=i 2−1, and a 2=j 1+1, then a 2=b 1 and any nucleotide value that is assigned to b 1 must simultaneously be assigned to a 2. Moreover, there can also be an overlap between the list of base paired positions in H [i 1,j 1,…,i k ,j k ] and the multiset D=[a 1,b 1,…,a k ,b k ]. If (for instance) j 1=i 2−1, then b 1=i 2 and a 2=j 1. Therefore, in the computation we have to account for these constraints. Let m denote the number of unpaired positions in D, without repetitions, and define A r ,B r as the nucleotides instantiated respectively at a r ,b r . The energy term for a 5′-dangle [resp. 3′-dangle] on base pair (X,Y) with nucleotides U,V is denoted by E d5(X,Y,x−1;U,V,W) [resp E d3(X,Y,y+1;U,V,W)] where the dangle position x−1 [resp. y+1] is assigned nucleotide W. With the notation just described, we have
Depending on the target structure s 0, it can happen that the second sum of Eq. (24) must be restricted to range over strictly less than 42k many RNA sequences. This is explained as follows. If i 1=1 [resp. j r =n] then there is no position for a 5′ [resp. 3′] dangle, and hence the nucleotide sequences considered in the second summation would have length strictly less than 2k. Moreover, certain 5′ dangled positions could be identical to 3′ dangle positions, which arises for instance when j k +2=i k+1; alternatively, certain dangled positions could be identical with base-paired positions, which arises for instance when j k +1=i k+1. In such situations, instantiations of the 3′-dangle on (i k ,j k ) and the 5′-dangle on (i k+1,j k+1) are not independent, thus leading to a restriction of the range of the second summation in Eq. (24). A similar restriction is implicitly assumed in the treatment of external loops in this section and of multiloops in the next section.
The algorithm performance can be improved by dividing the external loop into groups of components having interdependently constrained dangling positions, as just explained. Define two base pairs (X,Y),(x ′,y ′) as adjacent if x<y<x ′<y ′ and x ′−y≤2 – i.e. dangling positions of the base pairs (X,Y),(x ′,y ′) are constrained. Let G denote a maximal collection of adjacent base pairs belonging to H=[(i 1,j 1),…,(i k ,j k )], together with their associated dangle positions in D=[i 1−1,j 1+1,…,i k −1,j k +1]. It is important to note that H∪D is thus partitioned into a collection of g disjoint groups \(\mathcal {G} = [G_{1},\ldots,G_{g}]\). Therefore, we can divide an external loop of k helices into a collection groups \(\mathcal {G}\) of size g≤k, and p unpaired positions that are external to every base pair of s 0 and not adjacent to any base pair.
For a group G with h base pairs, let H(G)=[(κ 1,λ 1),…,(κ k ,λ k )] denote the list of base pairs in G, and let D(G)=[α 1,β 1,…,α h ,β h ]⊆[κ 1−1,λ 1+1,⋯,κ h −1,λ h +1] denote their associated dangle positions. If U r ,V r ,A r ,B r denote the nucleotides instantiated at the base pair r=(κ r ,λ r ) and its respective dangling positions α r ,β r respectively, then the dual partition function of G is the following.
where the range of the second summation can be constrained by the overlap among positions in D(G) and between positions in D(G) and H(G), as explained for Eq. (24).
Finally, since there are no shared dangling positions between groups, the dual partition function of an external loop is defined by
Multiloop
Suppose that (i,j) closes a multiloop in s 0, which is a k-loop, or (k+1)-way junction, for k>1, where there are ℓ unpaired bases in the multiloop. Suppose that the k components of the multiloop are closed by the base pairs (i 1,j 1),…,(i k ,j k ) with the property that i<i 1<j 1<i 2<j 2<⋯<i k <j k <j. Assume that for all nucleotide choices in \(\mathcal {B}\) for each of the k base pairs of the multiloop (i r ,j r ), for 1≤r≤k, the value Z ∗(i r ,j r ;X r ,Y r ) has previously been computed and stored by dynamic programming, as well as the sum Z ∗(i r ,j r ). The computation of the dual partition function is similar to that of the external loop. However, in this case we have to add the contribution of the base pair closing the multiloop (i,j), the AU-penalties applied to this base pair, and the energetic penalty of a multiloop a+b·(k+1)+c·ℓ, where the values of the constants a, b and c are given in the Turner energy model. Then, the dual partition function of a multiloop without accounting for dangling positions is
The notation we use to define the dual partition function of multiloops with dangling positions is similar to that described for external loops. However, some modifications are required in the previously given definitions, since we have to take into account the flanking positions of the base pair (i,j) closing the multiloop. Let H=[(i 1,j 1),…,(i k ,j k ),(i,j)] be the collection of k base pairs closing one of the k components of the multiloop, and the base pair (i,j) closing the multiloop, and define the multiset D=[a 1,b 1,…,a k+1,b k+1]⊆[i 1−1,j 1+1,⋯,i k −1,j k +1,i+1,j−1] of nucleotide positions adjacent to the base pairs in H. Due to the possible overlap with the base pair closing the multiloop and its flanking positions, there are additional constraints in the ordered multiset [a 1,b 1,…,a k+1,b k+1], so that (for instance) if a 1=i 1−1, and i 1=i+1, then a 1=a k+1 and any nucleotide value that is assigned to a 1 must simultaneously be assigned to a k+1. Moreover, there can also be an overlap between the list of base paired positions [i 1,j 1,…,i k ,j k ,i,j] and the multiset [a 1,b 1,…,a k+1,b k+1]. If (for instance) i=i 1−1, then a k+1=i 1 and a 1=i.
Let m denote the number of unpaired positions in D, without repetitions. Then, the dual partition function of a multiloop with dangling positions is defined as follows.
As explained for Eq. (24), it can happen that the second summation must be restricted to range over strictly less than 42k many RNA sequences.
A decomposition similar that for external loops can be performed to improve the performance in the computation of the dual partition function of a multiloop. In a multiloop, in addition to the adjacency definition given for external loops, we consider the base pair (i,j) that closes the multiloop as adjacent to a base pair (X,Y) that closes a component of the multiloop, where i<x<y<j, if either x≤i+2 or y≥j−2. Then, let G denote a maximal collection of adjacent base pairs belonging to H=[(i 1,j 1),…,(i k ,j k ),(i,j)], together with their associated dangle positions in D=[i 1−1,j 1+1,…,i k −1,j k +1,i+1,j−1]. This decomposition produces a collection \(\mathcal {G}\) of g disjoint groups G 1,…,G g , one of which, designated the closing group G c contains the closing base pair (i,j) of the multiloop, and g−1 of which, designated as non-closing groups G nc , do not contain the base pair (i,j).
Non-closing groups have the same composition as those defined for external loops – i.e. a collection of h base pairs H(G nc )=[(κ 1,λ 1),…,(κ h ,λ h )] and a set of dangling positions D(G nc )=[α 1,β 1,…,α h ,β h ]⊆[κ 1−1,λ 1+1,⋯,κ h −1,λ h +1]. Therefore, we can compute the dual partition function Z(G gc ) of a non-closing group as described in Eq. (25). In addition, the collection of non-closing groups of size g−1 of a multiloop of k components is denoted by \(\mathcal {G}_{nc}\), where 0≤(g−1)≤k.
Therefore, a multiloop of k components and ℓ unpaired positions can be decomposed into one closing group G c , a collection of non-closing groups \(\mathcal {G}_{nc}\), and p unpaired positions that are not adjacent to any base pair, with 0≤p≤ℓ.
In a non-closing group, the collection of base pairs of size h+1 is denoted by H(G c )=[(κ 1,λ 1),…,(κ h ,λ h ),(i,j)], where the base pair (i,j) closing the multiloop is at the last position. The ordered multiset of adjacent positions is denoted by D(G c )=[α 1,β 1,…,α h+1,β h+1]⊆[κ 1−1,λ 1+1,⋯,κ h −1,λ h +1,i+1,j−1], where the positions adjacent to i and j are at the last positions are respectively denoted by α h+1,β h+1. A graphical example of a closing group and a non-closing group is shown in Fig. 1 e, where the positions of a non-closing group with 1 base pair are highlighted in green and the positions of the closing group are highlighted in red and blue, and where the base pair (i,j) that closes the multiloop is depicted in red.
For a closing group G c with h+1 base pairs in H(G c )=[(κ 1,λ 1),…,(κ h ,λ h ),(i,j)] and their flanking positions D(G c )=[α 1,β 1,…,α h+1,β h+1]⊆[κ 1−1,λ 1+1,⋯,κ h −1,λ h +1,i+1,j−1], let X,Y denote the nucleotides assigned to the closing base pair of the multiloop (i,j), and let U r ,V r ,A r ,B r denote the nucleotides assigned respectively to the base pair r=(κ r ,λ r ,) and its flanking positions α r ,β r . Then, the the dual partition function Z ∗(G c ;X,Y) of the closing group is defined by
In the same way as in Eq. (24), the values of the second summation are constrained to the possible choices among overlapping positions.
Then, the dual partition function Z ∗(i,j;X,Y) of the multiloop with k components and ℓ unpaired positions, where p of which are not adjacent to any base pair, is defined by
Sampling
Once the dual partition function Z ∗(i,j) and its subcases Z ∗(i,j;X,Y) for each base pair (i,j) have been computed, it is possible to perform a Boltzmann weighted sampling of positions i and j. For example, given the target structure with sequence constraints depicted in Fig. 2, RNAdualPF computes the dual partition function table shown in Table 1. The dual partition function of the substructure enclosed by the base pair (i,j) is Z ∗(i,j), and the dual partition function of the substructure enclosed by the base pair (i,j) where i,j are currently instantiated by the nucleotides X,Y is denoted by is Z ∗(i,j;X,Y). Therefore, the Boltzmann probability of X,Y at positions i,j in the substructure enclosed by the base pair (i,j) is Z ∗(i,j;X,Y)/Z ∗(i,j) and can be sampled using the roulette wheel method.
Due to the Turner energy model, it is necessary to determine nucleotide positions whose instantiation influences the energy (hence Boltzmann probability) of other positions, and subsequently all mutually dependent positions must be instantiated simultaneously. Figure 1 illustrates the mutual dependencies that must be considered when sampling different types of elements, where the base pair (i,j) to be sampled is highlighted in red, positions whose sampling probability is dependent on the instantiation of (i,j) are highlighted in blue, and positions that are mutually dependent, but independent of the instantiation of (i,j), are highlighted in green.
Since the dynamic programming algorithm for the dual partition function proceeds from inner to outer base pairs, using the total ordering ≺ in Eq. (10), the sampling order of base pairs proceeds from outer to inner positions, i.e. from largest base pair index to smallest. In order to account for mutual dependencies in the sampling step, we define the function sample(k,T,i,j,X,Y) for each base pair (i,j) in S 0, where k indicates the base pair index defined from Eq. (10), T indicates the type of structural element closed by base pair (i,j) in the target RNA secondary structure, as shown in Table 1, and X,Y are the instantiated nucleotides at positions (i,j). Due to the mutual dependencies, sampling a base pair with base pair indexk closing an m-loop, for m>0, forces the instantiation of all inner closing base pairs of the m-loop, and the base pair index of each such inner base pair is strictly less than k. For this reason, except in the case of external loops, the outermost base pair (i,j) has been always instantiated before sample(k,T,i,j,X,Y) is called, and therefore the instantiation X,Y is given as a parameter of the sampling function.
The Boltzmann probability of each possible instantiation of mutually dependent positions can be computed on the fly in the backward step. However, in order to improve the speed of the algorithm, in the forward step RNAdualPF stores (for each base pair) the conditional dual partition function values of instantiations of interdependent positions. These tables are used by the sampling function, since each value corresponds to the dual partition function conditional on a specific instantiation of the positions to be sampled by sample(k,T,i,j,X,Y). Since the sampling procedure depends on the type T of element, we describe the function sample(k,T,i,j,X,Y) for each type of element – hairpin, stacked base pair, internal loop (which also comprises left and right bulge), multiloop and external loop as depicted in Fig. 1. For each of these cases, the values are stored in the conditional dual partition function table associated with the closing base pair (i,j).
Hairpins
When hairpin size exceeds three (Fig. 1 a), since the base pair (i,j) has been previously instantiated, flanking positions i+1,j−1 are sampled first. Given the current assignment X,Y, the Boltzmann probability of sampling respectively the nucleotides U,V at the flanking positions i+1,j−1 is
Therefore, in the forward step RNAdualPF stores in a table the conditional dual partition function of each possible instantiation {X,Y,U,V} of the base pair (i,j) and its flanking positions i+1,j−1 respectively, defined by
Then, remaining unpaired positions are uniformly sampled, since the nucleotide choice does not change the final free energy. Triloops, tetraloops and hexaloops are exceptions to this rule, since there are special loops that contribute to or penalize the free energy. In those cases, we have to account for the special loops, as defined in “Hairpins” section.
Although it could seem to be a waste of space to store a different conditional dual partition function table for each base pair (i,j), even for two different hairpins of the same size in the target structure, one should note that RNAdualPF allows sequence constraints, and thus Z ∗(i,j) could possibly differ from Z ∗(i ′,j ′) when (i,j) and (i ′,j ′) close hairpins of the same size.
Stacking base pairs
As depicted in Fig. 1 b, sampling probability of a base pair with base pair index k−1 is dependent on the value sampled at the adjacent stacking base pair with base pair indexk. Therefore, sample(k,Stack,i,j,X,Y) samples the base pair (i+1,j−1) using the conditional probability given the instantiation of base pair (i,j) by X,Y, defined as follows:
The conditional dual partition function values stored in the forward step correspond to each instantiation {X,Y,U,V} of the base pairs (i,j),(i+1,j−1), denoted by
Internal loops
The energy contribution of internal loops in the Turner energy model depends on the flanking unpaired positions of both the inner and outer closing base pairs, hence the sampling probability of the inner base pair cannot be separated from the adjacent unpaired positions. Moreover, for specific sizes of internal loop (1×1, 1×2, 2×1, 1×N and N×1), the inner and outer closing base pairs share flanking positions. In these cases, all the unpaired positions and the outer base pair must be sampled at the same time, since the energy contribution of each combination of base pairs and flanking positions is different. In the 1×3 internal loop depicted in Fig. 1 c, if the outer base pair (i,j) is instantiated by X,Y, then let P(k=U,l=V,n 1=A,n 2=B,n 3=C|i=X,j=Y) denote the probability of sampling the nucleotides U,V,A,B,C respectively at positions k,l,n 1,n 2,n 3, where (k,l) is the inner closing base pair, n 1 is the flanking position at i+1 shared by the base paired positions i and k, and n 2 and n 3. In the following equation, let P(U,V,A,B,C|X,Y) abbreviate the conditional probability just defined. Then
RNAdualPF computes and stores the conditional dual partition function of each possible instantiation {X,Y,U,V,A,B,C} respectively at positions i,j,k,l,n 1,n 2,n 3, where the value Z ∗(i,j,k,l,n 1,n 2,n 3;X,Y,U,V,A,B,C) is defined by
For internal loops of sizes (1×1, 1×2, 2×1, 1×N and N×1) similar conditional dual partition function tables are computed following the definitions in “Stacked base pairs, bulges and internal loops” section.
Other internal loops: When there are no shared flanking positions between the two base pairs that close an internal loop, as depicted in Fig. 1 d, the energy contribution of innermost base pair and its respective flanking positions is independent of those of the outermost base pair.
In this case, RNAdualPF samples first the flanking positions i+1,j−1 of the outermost base pair (i,j), whose sampling probability is solely dependent on the instantiated nucleotides X,Y at positions i,j. Is not necessary to store any conditional dual partition function for sampling these positions, since the probability of sampling the values A,B at the flanking positions i+1,j−1, given the assignment X,Y is defined by
where mismatch penalties are obtained from table look-up. Finally, the innermost base pair (k,l) and its flanking positions k−1,l+1 are sampled together. In this case, we need to store an additional value Z ∗(k−1,l+1), which is given by
Then, following the same notation, the probability of sampling the nucleotides V,U,D,C respectively at positions k,l,k−1,l+1 is
Therefore, the conditional dual partition function of each possible instantiation {V,U,D,C} stored in the corresponding table is defined as
Finally, since the remaining unpaired position does not contribute to the free energy, it is uniformly sampled.
Multiloops and external loops
As explained in “External loop” section, if dangling positions are not included in the computation, sampling an external base pair or the closing base pair (i,j) of a multiloop from Z ∗(i,j) is trivial. On the other hand, by including dangling positions in the sampling, there is a dramatic increase in the space complexity of RNAdualPF, albeit the space used is only a constant factor larger. However, the decompositions into groups described in “External loop” and “External loop” sections allow to sample the positions of each group independently.
The example shown in Fig. 1 e depicts a multiloop with two groups: a non-closing group G nc highlighted in green, and a closing group G c highlighted in red and blue, where the closing base pair of the multiloop (i,j) is marked in red.
In a non-closing group G nc all base pairs in H(G nc ) and dangling positions in D(G nc ) must be sampled together. Therefore, the conditional dual partition function of each possible instantiation of nucleotides at the h closing pairs in H(G nc ) and their adjacent positions in D(G nc ) is stored. Let \(\mathcal {U}=\{U_{1},V_{1},\ldots,U_{h},V_{h}\}\) denote an instantiation of the h base pairs in H(G nc )=[κ 1,λ 1,…,κ h ,λ h ], and let \(\mathcal {W}=\{A_{1},B_{2},\ldots,A_{h},B_{h}\}\) denote an instantiation of the h flanking positions in D(G nc )=[α 1,β 1,…,α h ,β h ] in the non-closing group G nc . Then, the probability of sampling \(\mathcal {U},\mathcal {W}\) is
Therefore, the conditional dual partition function of each instantiation \(\mathcal {U},\mathcal {W}\) at H(G nc ),D(G nc ), stored in the table of the group, is defined by
Recall that the base pairs in H(G nc ) are adjacent. Therefore, due the constraints given by the overlapping positions within D(G nc ), and between D(G nc ) and H(G nc ), explained in “External loop” section, the number of possible instantiations \(\mathcal {U},\mathcal {W}\) of H(G nc ),D(G nc ) is ≤(6h·4h+1).
In a similar way, sampling from the closing group G c closed by the base pair (i,j), with h+1 base pairs in H(G c ) and their corresponding flanking positions in D(G c ) requires us to store the conditional dual partition function of each instantiation of nucleotides \(\{X,Y,\mathcal {U},\mathcal {W}\}\) respectively at i,j,H(G c ),D(G c ), where \(\mathcal {U}=\{U_{1},V_{1},\ldots,U_{h},V_{h}\}\) denotes an instantiation of the h first base pairs [(κ 1,λ 1),…,(κ h ,λ h )] in H(G c ), \(\mathcal {W}=\{A_{1},B_{2},\ldots,A_{h+1},B_{h+1}\}\) denotes an instantiation of the 2·(h+1) flanking positions in D(G c )=[α 1,β 1,…,α h+1,β h+1], and X,Y denotes an instantiation of (i,j). The probability of the instantiation \(\mathcal {U},\mathcal {W}\), given the nucleotides X,Y is
Then, the values stored in the table of the closing group correspond to the conditional dual partition function of each instantiation \(\left \{X,Y,\mathcal {U},\mathcal {W}\right \}\) are given by \(Z^{*}(G_{c},i,j,H(G_{c}),D(G_{c});X,Y,\mathcal {U},\mathcal {W})\), which is defined by the following expression:
As a final remark, we would like to recall that all the conditional dual partition function values are computed and stored in the forward step at the same time as the dual partition function. Therefore, despite the consequent increase of space complexity in the algorithm, the computation of the values required for correct sampling does not involve a greater time complexity.
Scaling
The sequence partition function Z ∗(s 0) grows much faster than the usual structure partition function Z(a), and so scaling must be used in the implementation. Let C>2 be a user-defined constant. By a slight modification of the previous recursions, we actually compute \(Z^{\dag }(i,j;X,Y) = \frac {Z^{*}(i,j;X,Y)}{C^{j-i+1}}\), and hence \(Z^{\dag }(s_{0}) = \frac {Z^{*}(s_{0})}{C^{n}}\), where n is the length of s 0. For instance, the analogue of Eq. (16) is
and the analogue of Eq. (17) is
This modification does not affect properties of sequences sampled from the low energy ensemble, since the same scaling factor appears in both the numerator and denominator of all conditional probabilities. For instance, the analogue of Eq. (31) is
Controlling GC-content
The GC-content of an RNA sequence a=s 1,…,s n is the number of nucleotides that are either G or C. Instead of computing Z ∗(i,j;X,Y) and Z ∗(s 0), we can compute Z ∗(i,j;X,Y;α) and Z ∗(s 0,α), defined to be the corresponding partition dual partition functions, restricted to sequences having GC-content of α. Note well that GC-content α includes the closing nucleotides X and Y respectively located at positions i and j; i.e.
where s 0[i,j] denotes the restriction of target structure s 0 to the interval [i,j]. We describe two particular subcases, to provide the idea of how modifications need to be undertaken.
Triloop
Note that the number of RNA sequences of length m having GC-content of α is \(\binom {m}{\alpha } \cdot 2^{\alpha } \cdot 2^{m-\alpha } = \binom {m}{\alpha } \cdot 2^{m} \leq 4^{m}\), since α selected positions must be either G or C, yielding the term 2α, while the remaining m−α positions must be either A or U, yielding the term 2m−α. Assume that γ(XY)=|{X,Y}∩{G,C}|=β. Then
Multiloop and external loop
Assume that (i,j) closes a multiloop, which is a (k+1)-way junction with ℓ unpaired nucleotides. Assume that the ordered multiset of potential dangle positions is D=[a 1,b 1,…,a k+1,b k+1], where a r =i r −1 and b r =j r +1 for r=1,…,k, and a k+1=i and b k+1=j, and assume that there are m unpaired positions that are not adjacent to a base pair in the multiloop. If r denotes an RNA sequence of arbitrary length, then let the function γ(r) denote the GC-count in r. Given an assignment of nucleotide base pairs U 1 V 1,…,U k V k to (i 1,j 1),…,(i k ,j k ), where U r V r ∈{GC,CG,AU,UA,GU,UG}, and given an assignment A 1,B 1,…,A k ,B k of dangle nucleotides, where \(A_{r},B_{r} \in \mathcal {N}\), for r=1,…,k, we let
Then the dual partition function of a multiloop with a GC-content of α is defined by setting Z ∗(i,j;X,Y;α) equal to the following:
Since the modification required in the remaining cases follows similar reasoning as in the treatment of the hairpin and external loop just described, the details for these remaining cases are not given.
An additional challenge of computing the dual partition function with GC-content control is the combinatorial problem of efficiently counting the number N of instantiations of the external loop, consisting of all positions external to every base pair, with GC-content k, where the user can stipulate that certain positions are constrained to contain nucleotides consistent with IUPAC codes. To this end, we implemented the combinatorial algorithm defined in Supplementary Information.
Sampling with GC-content
The implementation of sampling with GC-content is performed in a similar manner as described in “Sampling” section, with some notable differences.
First, the sampling function is redefined by sample(k,T,i,j,X,Y,α), where k indicates the base pair index in the ordering defined by Eq. (10) for the base pair (i,j) that is already instantiated by nucleotide pair XY, and T designates the type of structural element closed by base pair (i,j) in the target RNA secondary structure, as shown in Table 1. The function sample(k,T,i,j,X,Y,α) instantiates all positions of the loop having outer closing base pair (i,j), including its inner closing base pair(s) and which returns the GC-content of the sampled loop. Moreover, the GC-content of the subsequence a[i+1,j−1]=(a i+1,…,a j−1) will be α once the entire sequence a 1,…,a n is sampled.
Second, RNAdualPF stores a conditional dual partition function table for each base pair (i,j) and GC-content 0 to j-i-1. The function sample(k,T,i,j,X,Y,α) samples from the conditional dual partition function of those sequences which have exactly α Gs and Cs strictly between the positions i and j, thus guaranteeing a GC-content of α for the subsequence a[i+1,j−1] once the entire sequence a 1,…,a n is sampled. Note that sample(k,T,i,j,X,Y,α) samples only the loop closed by the already instantiated outer base pair (i,j), and that α is the GC-content of the entire subsequence a[i+1,j+1]=a i+1,…,a j−1 once the algorithm terminates. Only in the case that base pair (i,j) closes a hairpin loop will it generally happen that the GC-content of the loop closed by (i,j) is equal to α.
Let α be the user-designated GC-content of sequences a=a 1,…,a n to be sampled from a target secondary structure having ℓ base pairs. The following pseudocode describes how to sample sequences a=a 1,…,a n , whose GC-content exactly equals α. Here, an external loop with m components means that there are m exterior base pairs (i 1,j 1),…,(i m ,j m ) such that all positions exterior to these base pairs are unpaired; i.e. each position \(r \in \{ 1,\ldots,n\} - \cup _{c=1}^{m} \{ i_{c},\ldots,j_{c} \}\) is unpaired in the target structure.
To clarify how the GC-content is sampled in a statistically rigorous manner, suppose that the user has specified the GC-content to be α, and that L is the external loop of the target structure s 0 having m components, where the cth component has external closing base pair (i c ,j c ). In computing the dual partition function, for all possible choices of non-negative integers α 1,…,α m ,β that sum to α and all 6m possible assignments of Watson-Crick or wobble nucleotide pairs X 1,Y 1,…,X m ,Y m to the base pairs (i 1,j 1),…,(i m ,j m ), the software RNAdualPF has computed the sum of \(\sum _{c=1}^{m} Z^{*}_{c}(i_{c},j_{c};X_{c},Y_{c};\alpha _{c})\) plus the Boltzmann factor of the external loop with GC-content β. Since the dual partition function Z ∗(s 0;α) is the sum, taken over all values of α 1,…,α m ,β and all Watson-Crick and wobble pair assignments to the external base pairs, RNAdualPF can then use the roulette wheel method to sample values α 1,…,α m ,β and X 1,Y 1,…,X m ,Y m in a statistically rigorous manner. Multiloops, and other structural elements, which contain unpaired regions whose sequence does not contribute to the free energy of the structure, are handled in a analogous manner.
Results
Robustness and plasticity of C. elegans miRNAs and E. coli sncRNAs
In [1] Borenstein and Ruppin used version 1.4 of the Vienna RNA Package [7] to generate 1000 RNA sequences per wild type precursor microRNA (pre-miRNA) extracted from the database Rfam 1.0 [19], with the property that each of the 1000 control sequences folded into the wild type pre-miRNA structure – i.e. the minimum free energy (MFE) structure of each of the 1000 control sequences was identical to the MFE structure of the wild type pre-miRNA. Based on these computational experiments, Borenstein and Ruppin asserted that the “structure of miRNA precursor stem–loops exhibits a significantly high level of mutational robustness in comparison with random RNA sequences with similar stem–loop structures”. Noting that the Vienna RNA Package inverse folding program RNAinverse does not control for GC-content or other sequence compositional bias, the authors performed a second computational experiment, in which control sequences not only folded into the target wild type structure, but also had similar dinucleotide composition to that of wild type pre-miRNA (Jensen-Shannon divergence less than 0.01). Since the filtering step required enormous run time and computational resources, the authors restricted their attention to a small set of 211 microRNAs, and generated only 100 control sequences per microRNA – note here that RNAinverse cannot control for GC-content. Borenstein and Ruppin concluded that robustness of precursor microRNAs was not the byproduct of a base composition bias or of thermodynamic stability.
Subsequently Rodrigo et al. [3] undertook a similar analysis for bacterial small noncoding RNAs (sncRNA), also using the program RNAinverse, albeit using somewhat different definitions – see precise definitions in “Formal definitions of robustness” section. The main finding of [3] was that bacterial sncRNAs are not significantly robust when compared with 1000 sequences having the same structure, as computed by RNAinverse; however, the authors found that bacterial sncRNAs tend to be significantly plastic, in the sense that the ensemble of low energy structures are structurally diverse. Unlike the case of precursor microRNAs [1], Rodrigo et al. did not control for sequence compositional bias.
Using RNAdualPF, we performed similar computational experiments on 250 precursor microRNAs of C. elegans from miRBase 20 [11] and for the bacterial small noncoding RNAs of [3]. Below, we discuss each case separately.
For each C. elegans pre-miRNA, we used RNAdualPF to sample 2000 sequences with no control over GC-content and 2000 sequences whose GC-content was identical with that of the wild type pre-miRNA. Moreover, each control sequence approximately folded into the MFE wild type pre-miRNA structure as computed by Vienna RNA Package 2.1.9 [2]. Table 2 shows that length-normalized base pair distance between the MFE structure of the control sequence and that of the pre-miRNA is on average 0.09±0.04 for default use of RNAdualPF with control over GC-content, and 0.06±0.03 when GC-content of each control sequence is identical to that of the corresponding wild type pre-miRNA. Additional measures in Table 2 show that sequences sampled from RNAdualPF (1) are only modestly more stable thermodynamically, (2) the ensemble of low energy structures of control sequences deviate slightly more from the target pre-miRNA structure, as is the case for wild type pre-miRNA sequences, as mesured by ensemble defect [20], expected base pair distance to target [9], expected proportion of native contacts (called ensemble neutrality in [21]), average positional entropy [22], Morgan-Higgs structural diversity [23], and Vienna structural diversity (called ensemble diversity in [14]).
Tables 3 and 4 display a similar analysis of the collection of bacterial small noncoding RNAs of [3] and of Rfam 12.0 database [24]. For the Rfam database, we selected one sequence from each of the ≈2500 Rfam families, with the property that the MFE structure of the sequence most resembled the Rfam consensus structure – i.e. whose MFE structure has smallest base pair distance to the consensus structure. These tables show similar trends as those displayed in Table 2, although values are larger due to increased sequence length of bacterial sncRNA and sequences from Rfam.
In agreement with [1], the left panel of Fig. 3 shows that C. elegans miRNA is significantly robust (Z-score of 0.61±1.55, 2-tailed T-test p-value 2.2×10−9), provided that GC-content is not controlled. However, in contrast to [1], when GC-content is controlled, we find that C. elegans miRNA is significantly non-robust (Z-score of −1.3±2.9, 2-tailed T-test p-value 1.5×10−11). To corroborate our findings, for each wild type C. elegans pre-miRNA, we performed a second computational experiment, to generate 500 sequences with no control over GC-content and 500 sequences whose GC-content was identical with that of the wild type pre-miRNA. In contrast to the first experiment, we used RNAdualPF to generate sufficiently many sequences to subsequently select 500 sequences (no GC-control) and 500 sequences (GC-content equal to wild type pre-miRNA), each of whose MFE structure was identical to that of wild type pre-miRNA. The left panel of Fig. 4 shows that when GC-content is not controlled, C. elegans precursor microRNAs are statistically robust (Z-score 0.51±1.44, p-value 7.3×10−8), in agreement with the main result of Borenstein and Ruppin [1]. However, when GC-content of control sequences is identical to that of wild type precursor microRNA, we confirm that C. elegans pre-miRNA is statistically non-robust (Z-score −1.23±2.78, p-value 3×10−11). Note that our finding, which is in opposition to results of Borenstein and Ruppin [1], is based on a larger data set of precursor microRNAs, each of which has a larger control set, than in the analysis of [1].
Turning now to the analysis of bacterial small noncoding RNAs, we find that sncRNAs are not significantly robust (Z-score 0.0±1.4, p-value 0.98) when GC-content is not controlled, confirming a result from Rodrigo et al. [3]. However, when GC-content of the sequences sampled from RNAdualPF is required to be identical to that of wild type sncRNA, bacterial sncRNAs are see to be significantly non-robust (Z-score −2.58±3.87, p-value of 4.4×10−7). Note here that Rodrigo et al. used RNAinverse in their computational experiments, hence could not consider the case with control over GC-content. The left panels of Figs. 3 a and 4 summarize our findings that precursor microRNAs [resp. bacterial sncRNAs] are significantly non-robust [resp. not significantly robust] with respect to a control set of 2000 [resp. 1000] sequences generated by RNAdualPF with identical GC-content to that of the wild type sequence.
Finally, in our analysis of plasticity, the right panels of Figs. 3 and 4 show that both C. elegans and bacterial small noncoding RNAs exhibit more plasticity when compared with control sequences for which GC-content is not controlled, as well as when compared with control sequences for which GC-content is identical to that of wild type sequences.
Structural RNA has higher free energy than expected
In Figure 4 of [9], we showed that the free energy E 0 of the minimum free energy (MFE) structure s 0 of E. coli val-tRNA (accession RV1600 from Sprinzl database [25] tdbR00000454 from tRNAdb [26]), is much higher (less favorable) than the average free energy 〈E〉 of over four million RNAs having the same MFE structure s 0 as that of E. coli val-tRNA. Here, E. coli val-tRNA RV1600 was selected, because its MFE structure s 0 is identical to the Rfam consensus structure for tRNA family RF00005. This preliminary result suggests that naturally occurring transfer RNAs may be under selective pressure to be only marginally thermodynamically stable. Since it took a number of days for RNAiFold [4, 9] to return over four million solutions of the inverse folding problem for the tRNA target structure, we now describe how RNAdualPF can be used to compute the Boltzmann expected free energy of literally all sequences a 1,…,a n with respect to an arbitrary target structure s 0. In this manner, we confirm our preliminary finding concerning E. coli val-tRNA, and show that the folding energy of structural RNA from the Rfam database is much higher (less favorable) than expected. Before presenting results, we need some definitions.
For the Turner nearest neighbor energy model [13], the free energy of a secondary structure s of an RNA sequence a=a 1,…,a n depends on the (absolute) temperature T 0. To indicate this dependence, we write E(a,s,T 0), where in the sequel, T 0 will be designated as table temperature, i.e. the temperature for which parameters from the Turner energy tables are applied. For an arbitrary, but fixed secondary structure s 0 of length n, the dual partition function at temperature T 0 is defined by
where the sum is taken over all RNA sequences a=a 1,…,a n of length n. Note that T 0 indicates the (table) temperature at which the energy of a structure s 0 and nucleotide sequence a is evaluated using the Turner parameters, while all other occurrences of the temperature variable are designated by T, which we call formal temperature. The distinction between formal and table temperature is made to allow us to use finite difference approximations to derivatives with respect to the formal temperature when when we compute dual expected energy and dual conformational entropy below (see [27] for more explanation). When table temperature T 0 equals formal temperature T, and the temperature is clear from the context, we write Z ∗(s 0); if the target structure s 0 is also clear from the context, then we write Z ∗. A similar remark applies to the other thermodynamic functions p ∗,G ∗,〈E ∗〉,S ∗, which we now define.
The dual Boltzmann probability p ∗(a) is defined by
The dual ensemble free energy G ∗(s 0) is defined by
where R≈1.987 cal/(mol K) is the universal gas constant. The dual expected (free) energy 〈E ∗(s 0)〉 is defined by
Straightforward derivations analogous to those in [27] yield the following expressions for dual expected energy 〈E ∗〉 and dual entropy S ∗:
Programs to compute dual expected energy 〈E ∗〉, dual conformational entropy S ∗, and dual heat capacity \(C^{*}_{p}\) are provided at our web site. We do not elaborate further on dual entropy or dual heat capacity, since at the present time we have found no compelling applications.
Figure 5 shows that structural RNAs have higher free energy with respect to their native structure, hence are thermodynamically less stable, than expected, – even when expectations are taken over all sequences having the same GC-content as that of wild type sequences. We believe that this insight could be important when designing functional synthetic RNAs. To generate Fig. 5, we proceeded as follows. For each family from the Rfam 12.0 database [24], we took the family consensus structure s c , and computed 〈E(s c )〉. Additionally, for each Rfam family, we selected that sequence a 0, whose minimum free energy (MFE) structure s 0 has smallest base pair distance to the consensus structure s c . We computed the expected energy 〈E(s 0)〉, as well as the free energies E(a,s c ) and E(a,s 0). Figure 5 displays box-and-whiskers plots for the fold change \(\frac {\langle E(s_{c}) \rangle }{E(\mathbf {a}_{0},s_{c})}\) for the consensus structure and the fold change \(\frac {\langle E(s_{0}) \rangle }{E(\mathbf {a}_{0},s_{0})}\) for the minimum free energy structure. Since the dual Boltzmann probability p ∗(a,s 0) is generally larger for sequences a having higher GC-content (as stacked base pairs involving GC,CG have lower free energy than those involving AU,UA,GU,UG), RNAdualPF computes as well the dual partition function for GC-content k, defined by
In this fashion, we can exactly compute the dual expected energy 〈E ∗(s 0,k)〉 of all sequences having GC-content k which approximately fold into target structure s 0. Tables 2, 3 and 4 analyze what we mean by approximately folding into the target structure – i.e. sequences a are preferentially sampled when free energy E(a,s 0) is low, hence have large dual Boltzmann probability. RNAdualPF, even when exact GC-content is controlled, is faster than inverse folding programs by orders of magnitude, hence providing an effective alternative manner of solving inverse folding.
Conclusion
In this paper we describe the algorithm and software RNAdualPF, which computes the dual partition function Z ∗, defined as the sum of Boltzmann factors exp(−E(a,s 0)/RT) of all sequences a with respect to the target structure s 0. Using RNAdualPF, we efficiently sample RNA sequences that (approximately) fold into s 0, where additionally the user can specify IUPAC sequence constraints at certain positions, and whether to include dangles (energy terms for stacked, single-stranded nucleotides). Moreover, the user can require that all sampled sequences have a precisely specified GC-content, since, optionally, we compute the dual partition function Z ∗(k) simultaneously for all values k=G+C. This sampling strategy is complementary to the use of RNAiFold [4], since it allows the study of the properties of long RNA structures whose number of solutions for the inverse folding problem is astronomically large.
We use RNAdualPF to corroborate previous studies [1] using RNAinverse [2], by confirming that precursor microRNAs are significantly mutationally robust when GC-content is not controlled. However, in contrast to [1], we find that precursor microRNAs are significantly non-robust when GC-content is controlled. We confirm and extend previous findings [3] that bacterial small noncoding RNAs display plasticity (structural diversity) and are not statistically robust, when GC-content is not controlled. Additionally, we obtain the new finding that when when GC-content is controlled, bacterial small noncoding RNAs are significantly non-robust, as in the case of precursor microRNAs. One possible reason for the discrepancy between our results and those of [1] could be related with the fact that the energy parameters of Vienna RNA Package 1.4 (Turner 1999 parameters used in the computational experiments of [1]) differ from those of Vienna RNA Package 2.1.9 (Turner 2004 parameters used in the current study with RNAdualPF). Another possible reason is that the inverse folding solutions returned by the program RNAinverse used in [1] show a different bias than sequences returned by RNAdualPF (in this context, we mean the inverse folding solutions filtered from the sequences returned by RNAdualPF).
As mentioned in the Introduction, there is a relation between our C program RNAdualPF and the Python program IncaRNAtion [12], although our work is independent of that of Reinharz et al. [12]. IncaRNAtion is a weighted sampling algorithm that computes the dual partition function for a simple energy model, which only considers base stacking free energies – unlike RNAdualPF, the program IncaRNAtion includes no energy contributions for hairpins, bulges, internal loops, multiloops, dangles, or mismatches. If the user specifies a desired GC-content α, then IncaRNAtion does not compute the dual partition function for GC-content, but rather applies an adjustable heuristic so that after a suitable burn-in period, sequences tend to approximately have GC-content α. See Table 6.2 of [28] for benchmarking results on RNAdualPF and IncaRNAtion, which show conclusively that RNAdualPF is not only faster, but its sequences have a higher probability of folding into the target structure, its sequences have a smaller GC-content in default mode, where GC-content is not controlled, etc.
Our original motivation in designing RNAdualPF was to generate an unbiased sample of near-solutions (or by subsequent selection of solutions) to the inverse folding problem. At present, it seems clear that no program can claim to generate an unbiased sample of inverse folding solutions, since (1) the solution space so large that this hypothesis cannot be tested by brute force methods, and (2) different inverse folding algorithms return solution sequences having different properties, as shown in Table 2 of [4]. Nevertheless, in the same manner that structures sampled by the algorithm of Ding and Lawrence [16] constitute an unbiased, representative set of low energy secondary structures for a given RNA sequence, as implemented in Sfold [10] and RNAsubopt -p [2], the collection of RNA sequences sampled by the algorithm RNAdualPF constitute an unbiased, representative set of sequences having low energy with respect to a given target structure s 0. Although the minimum free energy structure of such sequences may indeed be distinct from s 0, it is likely that the MFE structure and the target s 0 be similar, as shown in Tables 2, 3 and 4. Moreover, Fig. 6 presents relative frequency plots that suggest that when GC-content is controlled, the sequences returned by RNAdualPF have similar properties to those of wild type sequences: (1) similar expected base pair distance to the wild type target structure [9], (2) similar ensemble defect to the target wild type structure [20], (3) similar positional entropy [22], (4) similar Vienna structural diversity (called ensemble diversity in [14]), (5) similar Morgan-Higgs diversity [23], (6) similar expected proportion of native contacts (called ensemble neutrality in [21]). These graphs were produced by using RNAdualPF to sample 2,000 sequences for each of the 250 C. elegans precursor microRNAs from the miRBase 20 database [11], in each of the following cases: (a) GC-content identical to that of the Rfam sequence, (b) no control for the GC-content. Figure 7 presents additional data, computed in the same manner for C. elegans pre-miRNA from miRBase 20, showing that when GC-content is controlled, sequences sampled by RNAdualPF satisfy the following: (1) the average length-normalized base pair distance between the minimum free energy and target structures is ≈0.05, (2) wild type and RNAdualPF sampled sequences have similar free energy with respect to the wild type target structure, (3) as well as similar minimum free energy, (4) similar dual probability, and (5) similar probability to wild type RNA sequences. Taken together, this data shows that if GC-content is controlled, then RNAdualPF returns sequences whose low energy structures tend to resemble the target structure. Figures 8 and 9 are similar to Figs. 6 and 7, except that for each C. elegans pre-miRNA, 500 sequences were generated by RNAdualPF, each of whose MFE structure is identical to the wild type target structure (this was done by repeatedly sampling sequences from RNAdualPF until 500 sequences were found, that fold exactly into the target pre-miRNA structure). Taken together, Figs. 6, 7, 8 and 9 present convincing evidence that RNAdualPF generates sequences that (approximately) fold into the user-specified target structure, hence supporting our finding that C. elegans precursor microRNAs are statistically non-robust, contrary to the finding of [1].
Additionally, we have shown that natural RNAs from the Rfam 12.0 database have higher minimum free energy than expected, thus supporting our results in [9] which suggest that functional RNAs are under evolutionary pressure to be only marginally thermodynamically stable. The applications described in this paper demonstrate that RNAdualPF is a useful and extremely fast software tool for evolutionary and synthetic biology.
Endnote
1 When dangling positions are not included in the computation (-d0), the algorithm clearly requires linear time. When dangling positions are included (-d2), run time is exponential in the number of components of the largest multilooop; however, in practice the algorithm is extremely fast, and it is possible to modify the algorithm to always run in linear time.
Abbreviations
- MFE:
-
Minimum free energy
- PLMVd:
-
Peach latent mosaic viroid
- sncRNA:
-
Small noncoding RNA
References
Borenstein E, Ruppin E. Direct evolution of genetic robustness in microRNA. Proc Natl Acad Sci. 2006; 103(17):6593–598. doi:10.1073/pnas.0510600103.
Lorenz R, Bernhart SH, Honer Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011; 6:26. doi:10.1186/1748-7188-6-26.
Rodrigo G, Fares MA. Describing the structural robustness landscape of bacterial small RNAs. BMC Evol Biol. 2012; 12(1):1–12. doi:10.1186/1471-2148-12-52 2148-12-52
Garcia-Martin JA, Dotu I, Clote P. RNAiFold 2.0: a web server and software to design custom and rfam-based RNA molecules. Nucleic Acids Res. 2015; 43(W1):513–21. doi:10.1093/nar/gkv460.
Los Alamos HIV database. 2015. http://www.hiv.lanl.gov/. Accessed 30 Dec 2015.
Krol J, Sobczak K, Wilczynska U, Drath M, Jasinska A, Kaczynska D, Krzyzosiak WJ. Structural features of microRNA (miRNA) precursors and their relevance to mirna biogenesis and small interfering RNA/short hairpin RNA design. J Biol Chem. 2004; 279(40):42230–2239. doi:10.1074/jbc.M404931200.
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P. Fast folding and comparison of RNA secondary structures. Monatsch Chem. 1994; 125:167–88.
Zadeh JN, Wolfe BR, Pierce NA. Nucleic acid sequence design via efficient ensemble defect optimization. J Comput Chem. 2011; 32(3):439–52.
Garcia-Martin JA, Clote P, Dotu I. RNAiFold: a constraint programming algorithm for RNA inverse folding and molecular design. J Bioinform Comput Biol. 2013; 11(2):1350001. doi:10.1142/S0219720013500017.
Ding Y, Chan CY, Lawrence CE. Sfold web server for statistical folding and rational design of nucleic acids. Nucleic Acids Res. 2004; 32:0.
Kozomara A, Griffiths-Jones S. mirbase: annotating high confidence microRNAs using deep sequencing data. Nucleic Acids Res. 2014; 42(Database):68–73. doi:24275495.
Reinharz V, Ponty Y, Waldispuhl J. A weighted sampling algorithm for the design of RNA sequences with targeted secondary structure and nucleotide distribution. Bioinformatics. 2013; 29(13):308–15.
Turner DH, Mathews DH. NNDB: the nearest neighbor parameter database for predicting stability of nucleic acid secondary structure. Nucleic Acids Res. 2010; 38(Database):280–2. doi:10.1093/nar/gkp892.
Gruber AR, Bernhart SH, Hofacker IL, Washietl S. Strategies for measuring evolutionary conservation of RNA secondary structures. BMC Bioinforma. 2008; 9(1):1–19. doi:10.1186/1471-2105-9-122.
McCaskill JS. The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990; 29:1105–1119. doi:10.1002/bip.360290621.
Ding Y, Lawrence CE. A statistical sampling algorithm for RNA secondary structure prediction. Nucleic Acids Res. 2003; 31:7280–301.
Busch A, Backofen R. INFO-RNA, a fast approach to inverse RNA folding. Bioinformatics. 2006; 22(15):1823–31. doi:10.1093/bioinformatics/btl194.
Zuker M, Mathews DH, Turner DH. Algorithms and thermodynamics for RNA secondary structure prediction: A practical guide In: Barciszewski J, Clark BFC, editors. RNA Biochemistry and Biotechnology. NATO ASI Series. Dordrecht: Kluwer Academic Publishers: 1999. p. 11–43.
Griffiths-Jones S, Bateman A, Marshall M, Khanna A, Eddy SR. Rfam: an RNA family database. Nucleic Acids Res. 2003; 31(1):439–41.
Dirks RM, Lin M, Winfree E, Pierce NA. Paradigms for computational nucleic acid design. Nucleic Acids Res. 2004; 32(4):1392–1403. doi:10.1093/nar/gkh291.
Pei S, Anthony JS, Meyer MM. Sampled ensemble neutrality as a feature to classify potential structured RNAs. BMC Genomics. 2015; 16(1):1–12. doi:10.1186/s12864-014-1203-8.
Huynen M, Gutell R, Konings D. Assessing the reliability of RNA folding using statistical mechanics. J Mol Biol. 1997; 267(5):1104–12. doi:10.1006/jmbi.1997.0889.
Morgan SR, Higgs PG. Barrier heights between ground states in a model of RNA secondary structure. J Phys A: Math Gen. 1998; 31:3153–170. doi:10.1088/0305-4470/31/14/005.
Nawrocki EP, Burge SW, Bateman A, Daub J, Eberhardt RY, Eddy SR, Floden EW, Gardner PP, Jones TA, Tate J, Finn RD. Rfam 12.0: updates to the RNA families database. Nucleic Acids Res. 2015; 43(D1):130–7. doi:10.1093/nar/gku1063.
Sprinzl M, Horn C, Brown M, Ioudovitch A, Steinberg S. Compilation of tRNA sequences and sequences of tRNA genes. Nucleic Acids Res. 1998; 26(1):148–53. doi:10.1093/nar/26.1.148.
Juhling F, Morl M, Hartmann RK, Sprinzl M, Stadler PF, Putz J. tRNAdb 2009: compilation of tRNA sequences and tRNA genes. Nucleic Acids Res. 2009; 37(Database):159–62.
Garcia-Martin JA, Clote P. RNA Thermodynamic Structural Entropy. PLoS ONE. 2015; 10(11):0137859. doi:10.1371/journal.pone.0137859.
Garcia-Martin JA. RNA inverse folding and synthetic design. Ph.D. dissertation in Biology, Boston College. 2016. Dissertation made available on June 28, 2016 and will remain accessible indefinitely: http://hdl.handle.net/2345/bc-ir:106989.
Acknowledgements
We would like to thank the anonymous reviewers for helpful comments.
Funding
Funding for this research was provided by National Science Foundation grant DBI-1262439. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
Availability of data and materials
Source code for OSX and Linux of the C/C++-software RNAdualPF is available at http://bioinformatics.bc.edu/clotelab/RNAdualPF under GPL3 licence.
Authors’ contributions
Project design PC. Algorithm design PC, ID, JAGM. Implementation JAGM. Software testing and benchmarking JAGM using a program to test structural diversity implemented by AHB. Manuscript preparation PC, JAGM. All authors read and approved the final manuscript.
Competing interests
The authors declare that they have no competing interests.
Consent for publication
Not applicable.
Ethics approval and consent to participate
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Additional file
Additional file 1
Supplementary Information. Title is Supplementary Information for RNAdualPF: software to compute the dual partition function with sample applications in molecular evolution theory. This is a 3-page PDF file containing a tricky derivation for an efficient computation of the number of external loops of size N with GC-content k, where the user can stipulate that certain positions are constrained to contain nucleotides consistent with IUPAC codes. (PDF 273 kb)
Rights and permissions
Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver(http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.
About this article
Cite this article
Garcia-Martin, J., Bayegan, A., Dotu, I. et al. RNAdualPF: software to compute the dual partition function with sample applications in molecular evolution theory. BMC Bioinformatics 17, 424 (2016). https://doi.org/10.1186/s12859-016-1280-6
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s12859-016-1280-6