Sequence-structure relations of pseudoknot RNA

Huang, Fenix WD; Li, Linda YM; Reidys, Christian M

doi:10.1186/1471-2105-10-S1-S39

Volume 10 Supplement 1

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Research
Open access
Published: 30 January 2009

Sequence-structure relations of pseudoknot RNA

Fenix WD Huang¹,
Linda YM Li¹ &
Christian M Reidys¹

BMC Bioinformatics volume 10, Article number: S39 (2009) Cite this article

3805 Accesses
5 Citations
Metrics details

Abstract

Background

The analysis of sequence-structure relations of RNA is based on a specific notion and folding of RNA structure. The notion of coarse grained structure employed here is that of canonical RNA pseudoknot contact-structures with at most two mutually crossing bonds (3-noncrossing). These structures are folded by a novel, ab initio prediction algorithm, cross, capable of searching all 3-noncrossing RNA structures. The algorithm outputs the minimum free energy structure.

Results

After giving some background on RNA pseudoknot structures and providing an outline of the folding algorithm being employed, we present in this paper various, statistical results on the mapping from RNA sequences into 3-noncrossing RNA pseudoknot structures. We study properties, like the fraction of pseudoknot structures, the dominant pseudoknot-shapes, neutral walks, neutral neighbors and local connectivity. We then put our results into context of molecular evolution of RNA.

Conclusion

Our results imply that, in analogy to RNA secondary structures, 3-noncrossing pseudoknot RNA represents a molecular phenotype that is well suited for molecular and in particular neutral evolution. We can conclude that extended, percolating neutral networks of pseudoknot RNA exist.

Background

Three decades ago, Michael Waterman pioneered the combinatorics and ab initio prediction of the at that time rather exotic ribunucleic acid (RNA) secondary structures [1–5]. The motivation for this work was coming from a fundamental dichotomy represented by RNA. On one hand RNA is described by its primary sequence, a linear string composed of the nucleotides A, G, U and C. The primary sequence embodies the genotypic legislative. On the other hand, RNA, being less structurally constrained than its chemical relative DNA, does fold into 3D-structures, representing the phenotypic executive. Therefore one molecule stands for both: geno- and phenotype.

Indeed, a vast variety of RNA activities was found: the discovery of catalytic RNAs, or ribozymes, in 1981 proved that RNA could catalyze reactions just as proteins. RNA can act also as a messenger between DNA and protein in the form of transfer RNA. The realization that RNA combines features of proteins with DNA led to the "RNA world" hypothesis for the origin of life. The idea was that DNA and the much more versatile proteins took over RNA's functions in the transition from the "RNA-world" to the "DNA/protein-world".

Let us have a closer look at RNA phenotypes. RNA molecules form "helical" structures by folding, i.e. pairing their nucleotides and thereby lowering their minimum free energy (mfe). Originally, these bonds were subject to strict combinatorial constraints, for instance "noncrossing" in RNA secondary structures. For the latter, dynamic programming (DP) algorithms, predicting the minimum free energy configuration were given 1980 [5, 6]. It is wellknown, however, that RNA structures are far more complex than secondary structures. One particularly prominent feature is the existence of cross-serial dependencies [7], that is crossing arcs or pseudoknots, see Figure 1, where we display the natural UTR-pseudoknot structure of the mouse hepatitis virus. Cross also folds into the natural structure given in Figure 1. In Figure 2 we present another RNA pseudoknot structure, the HDV-pseudoknot. We present here the structure as folded by cross and also its natural structure [8].

In fact, RNA pseudoknots are "everywhere". They occur in functional RNA, like for instance RNAseP [9] as well as ribosomal RNA [10]. They are conserved in the catalytic core of group I introns, in plant viral RNAs pseudoknots mimic tRNA structure and in in vitro RNA evolution [11], where experiments produced families of RNA structures with pseudoknot motifs, when binding HIV-1 reverse transcripts. Important mechanisms like ribosomal frame shifting [12] also involve pseudoknot interactions.

For prediction algorithms the implications of cross-serial dependencies are severe-they imply a higher level of formal language: context-sensitive. In general, on this level of formal languages it is not clear whether or not polynomial time ab initio folding algorithms exist. Indeed, Lyngsø et al. [13] showed that "reasonable" classes of RNA pseudoknots require exponential time algorithms. There exist however, polynomial time folding algorithms, capable of the energy based prediction of certain pseudoknots: Rivas et al. [14], Uemura et al. [15], Akutsu [16] and Lyngsø [13]. The output of these algorithms, however, remains somewhat "mysterious"-it is not clear which types of pseudoknots can be generated.

In analogy to the case of RNA secondary structures, the identification of key combinatorial properties of the output class offers deeper understanding. The combinatorial properties of RNA pseudoknot structures discussed in the following have indeed profound implications: first sequence-structure maps will generate exponentially many structures with neutral networks of exponential size. Second, the latter will come close to each other in sequence space, thereby allowing for efficient evolutionary search. None of these findings depend on the particular choice of loop-energies or the partition function [17]. Furthermore, without combinatorial specification, as it is the case for the above mentioned DP based pseudoknot folding algorithms [14], one arrives at an impossibly large configuration space.

For instance, the inductive generation of gap-matrices produces arbitrarily high number of mutually crossing arcs. The results in [18] prove, that the exponential growth rate of pseudoknot structures is linear in the crossing number. Accordingly, via gap-matrices, an uncontrollably large output class is being generated. Nevertheless, the DP-routine using pairs of gap-matrices cannot generate any 3-noncrossing nonplanar pseudoknot structure.

We will show that the notion of k-noncrossing diagrams [19] allows us to specify a suitable output-class for pseudoknot folding algorithms. Recall that a diagram is a graph over the vertex set [n] = {1, ..., n} with vertex degree less than or equal to one. It is represented by drawing the vertices in a horizontal line and its arcs (i, j), where i <j, in the upper half-plane. The vertices and arcs correspond to nucleotides and Watson-Crick (A-U, G-C) and (U-G) base pairs, respectively. A diagram is k-noncrossing if it contains at most k - 1 mutually crossing arcs. Diagrams have the following three key parameters: the maximum number of mutually crossing arcs, k - 1, the minimum arc-length, λ, and minimum stack-length, τ, The length of an arc (i, j) is j - i and a stack of length τ is a sequence of "parallel" arcs of the form

((i, j), (i + 1, j - 1), ..., (i + (τ - 1), j - (τ - 1))),

see Figure 3. We call an arc of length λ a λ-arc. Biophysical constraints on the base pairings imply that in all RNA structures λ is greater than or equal to four. We call diagrams with a minimum stack-length τ, τ-canonical and if λ ≥ 4 we refer to diagrams as structures. To reiterate, in the simplest case we have 2-noncrossing RNA structures, i.e. the secondary structures in which no two arcs cross, see Figure 4. The noncrossing of arcs has far-reaching consequences. It implies that RNA secondary structures form a context free language and allow for the DP algorithms [20], predicting the loop-based mfe-secondary structure in O(n³)-time and O(n²)-space.

Let us now, having some background on RNA structures return to the RNA-world. Around 1990 Peter Schuster and his coworkers initiated a paradigm shift. They began to study evolutionary optimization and neutral evolution of RNA via the relation between RNA genotypes and phenotypes. The particular mapping from RNA sequences into RNA secondary structures was obtained by the algorithm ViennaRNA [21], an implementation of the folding routine [6, 22], mentioned above. Two particularly prominent results of this line of work were the existence of neutral networks, i.e. vast, extended networks, composed of sequences folding into a given secondary structure [23] and the Intersection Theorem [23]. The latter guarantees for any two secondary structures the existence of at least one sequence which simultaneously satisfies all constraints imposed by their Watson-Crick and G-U base pairs. For the implication of the latter with respect to molecular switches, see [24]. It became evident that the "statistical" properties of this mapping played a central role in the molecular evolution of RNA.

But, there is more. Two discoveries suggested that RNA might not just be a stepping stone towards a DNA/protein world. They show that RNA plays an active role in vital cell processes. A large number of very small RNAs of about 22 nucleotides in length, called microRNAs (miRNAs), were discovered. They were found in organisms as diverse as the worm Caenorhabditis organs and humans, and their particular relationship to certain intermediates in RNA interference (RNAi). These findings have put RNA-in particular noncoding RNA-into the spotlight. In addition, RNA's conformational versatility and catalytic abilities have been identified in the context of protein synthesis and RNA splicing. More and more parallels between RNA and protein are currently being revealed [25].

Let us next briefly overview what we know about the combinatorics of our phenotypes, ultimatively allowing for the computation of biophysically relevant pseudoknot structures [26]. The key result comes from a seemingly unrelated field, the combinatorics of partitions. Chen et al. proved in a seminal paper [27] a bijection between walks in Weyl chambers and k-noncrossing partitions. This bijection has recently been generalized to tangled diagrams [28]. Now, a k-noncrossing diagram is a special type of k-noncrossing tangle and the relevance of Chen's result lies in the fact that the walks in question can be enumerated via the reflection principle. In fact, the reflection principle facilitated the computation of the generating function of k-noncrossing canonical pseudoknot RNA [19, 26, 29]. Subsequent singularity analysis [26, 29], showed, that the exponential growth rates of canonical pseudoknot RNA structures are surprisingly small, see Table 1, [26]. For instance, the number of 3-noncrossing, 3-canonical RNA structures with arc-length greater than or equal to four is asymptotically given by

Table 1 Exponential growth rates of ⟨k, τ⟩-structures. We have k-noncrossing structures with minimum stack-length greater than or equal to three.

Full size table

cn^-5 2.0348ⁿ,

where c is some (explicitly known) constant. This exponential growth rate is very close to Schuster et al.'s finding [30] for 2-canonical RNA secondary structures with arc-length greater than or equal to four

1.4848 n^-3/2 1.8444ⁿ.

For the analysis presented here, we use the algorithm cross [28], which produces a transparent output. This algorithm does not follow the DP paradigm and generates the mfe-k-noncrossing τ-canonical structure via a combination of branch and bound, as well as DP techniques. cross inductively constructs k-noncrossing, τ-canonical RNA structures via motifs. Currently full loop-based energy models are derived an implemented for k = 3 and τ ≥ 3.

Therefore, cross finds the mfe-RNA pseudoknot structure in which there are at most two mutually crossing arcs, which has minimum arc-length four and in which each stack has size at least three. While cross is an exponential time algorithm it allows to fold sequences of length 100 with an average folding time of 4.5 minutes.

Methods

While it is beyond the scope of this paper to present the algorithm cross in detail, the objective of this section is first to sketch its key organization and second to discuss some basic properties of RNA pseudoknot structures. These combinatorial properties enable us to assign a unique, loop-based energy. In the course of our analysis we show that an RNA pseudoknot structure can be constructed via simpler substructures. These serve as the building blocks via which cross derives the mfe-pseudoknot structure. At present time we do not have an algorithm computing the partition function version of cross. For RNA secondary structures, the partition function was obtained 1990 [31], three decades after the first mfe-folding algorithms were derived [32–34]. The partition function is based on a fixed sequence and contains vital statistical information on the probabilities of specific structural configurations of the latter. For any inductively constructed structure class, it allows to compute the base pairing probabilities. In analogy to similar studies in the case of RNA secondary structures [17, 35–37, 37–45], the partition function is for the type of analysis presented here not of key importance. We shall derive statistical information on the sequence-structure relation by mfe-folding a large number of sequences instead of considering the ensemble of structural configurations of a single sequence.

Cross

The algorithm cross has three distinct phases: the motif-, skeleton- and saturation-phase, see Figure 5 for an overview. We will here briefly discuss these three parts.

Let ≺ denote the following partial order over arcs

(i₁, j₁) ≺ (i₂, j₂) ⇔ i₂ <i₁ ∧ j₁ <j₂,

i.e. an arc α₁ is smaller then α₂ if it is nested in it.

I Motifs

Let us begin by defining core-structures. A k-noncrossing core [29] is a k-noncrossing diagram in which all stacks have size one. The core of a structure is obtained by identifying all its stacks by single arcs, keeping the unpaired nucleotides and finally relabeling, see Figure 6.

A ⟨k, τ⟩-motif is a ⟨k, τ⟩-diagram over [n], having the following properties

(M1) it has a nonnesting core

(M2) all its arcs are contained in stacks of length exactly τ = 3 and length λ = 4.

A m-shadow is a k-noncrossing diagram obtained by successively increasing the stacks of m from top to bottom, see Figure 7.

The key observation about motifs is that they can, despite the fact that they exhibit cross-serial dependencies, be generated inductively [46].

II Skeleta

Skeleta represent the non-inductive "frames" of pseudoknot RNA, i.e. skeleta entail exactly the cross-serial dependencies, that need to be considered exhaustively. A skeleton, S, is a 3-noncrossing structure, whose core has a connected L-graph. An L-graph is a diagram whose arcs are the vertices and two being adjacent if their corresponding arcs cross [46]. An irreducible shadow, IS_i,j, over [i, j]. IS_i,jis a skeleton which has no nested arcs, see Figure 7. Phase II consists in the generation of all skeleta-trees, which are rooted in irreducible shadows.

III Saturation

Given a skeleton, cross saturates or "fills" via context-sensitive DP routines the skeleton-intervals. Note that, while the inserted substructures cannot cross any arc of the skeleton, they will in general contain crossing arcs within themselves.

To summarize, first cross inductively constructs all roots of the skeleta-trees, second cross generates the skeleta-trees themselves and third it saturates the skeleta.

Loops

We next discuss loops of 3-noncrossing RNA structures. Loops are not only the basic building blocks for the mfe-evaluation but also of importance for the coarse grained notion of pseudoknot-shapes, discussed in Subsection. Let α be an arc in the 3-noncrossing RNA structure, S and denote by A_S(β) the set of S-arcs that cross β. Clearly, we have β ∈ A_S(α) if and only if α ∈ A_S(β). An arc α ∈ A_S(β) is called a minimal, β-crossing arc if there exists no α' ∈ A_S(β) such that α' ≺ α.

Let the interval [i, j] denote the sequence

(i, i + 1, ..., j - 1, j).

It is shown in [46] that any 3-noncrossing RNA structure can be uniquely decomposed into the following four loop-types:

(1) a hairpin-loop is a pair

((i, j), [i + 1, j - 1])

where (i, j) is an arc.

(2) an interior-loop is a sequence

((i₁, j₁), [i₁ + 1, i₂ - 1], (i₂, j₂), [j₂ + 1, j₁ - 1]),

where (i₂, j₂) is nested in (i₁, j₁).

(3) a multi-loop, see Figure 8, is a sequence

((i_{1}, j_{1}), [i_{1} + 1, ω_{1} - 1], S_{ω_{1}}^{τ_{1}}, [τ_{1} + 1, ω_{2} - 1], S_{ω_{2}}^{τ_{2}}, ...)

where $S_{ω_{h}}^{τ_{h}}$ denotes a pseudoknot structure over [ω_h, τ_h] (i.e. nested in (i₁, j₁)) and subject to the following condition: if all $S_{ω_{h}}^{τ_{h}}$ = (ω_h, τ_h), i.e. all substructures are simply arcs, for all h, then h = 2.

(4) a pseudoknot, see Figure 9, consisting of the following data:

(P1) a set of arcs

P = {(i₁, j₁), (i₂, j₂), ..., (i_t, j_t)},

where i₁ = min{i_s} and j_t= max{j_s}, such that

(i)
the diagram induced by the arc-set P is irreducible, i.e. the line-graph of P is connected and
(ii)
for each (i_s, j_s) ∈ P there exists some arc β (not necessarily contained in P) such that (i_s, j_s) is minimal β-crossing.

(P2) all vertices i₁ <r <j_t, not contained in hairpin, interior- or multi-loops.

Decomposition

We now show that each 3-noncrossing RNA structure can uniquely be constructed by simpler substructures [46]. Furthermore, each 3-noncrossing RNA structure has a unique loop decomposition-the basis of our energy evaluation. We remark that assertion (b) of the following result remains valid for arbitrary crossing number, k.

Theorem. Suppose k ≥ 2,τ ≥ 3.

(a)
Any k-noncrossing, t-canonical RNA structure corresponds to an unique sequence of shadows.
(b)
Any ⟨3. τ⟩-structure has an unique loop-decomposition.

In Figure 10 we illustrate how these decompositions work.

Results and discussion

Our results are organized in two sections. First we describe our findings with respect to the statistics of pseudoknot RNA structures and second we present our data with respect to the particular organization of the sequences in neutral networks.

Minimum free energy RNA pseudoknot structures

In this section we present some key statistics on pseudoknotted RNA structures. In order to put our findings into context we consider two variants of cross: first, cross₃, which generates 3-noncrossing, 3-canonical mfe-structures and second, cross₄, which produces 3-noncrossing, 4-canonical mfe-structures.

The fraction of pseudoknots

We next compute the fraction of RNA structures with pseudoknots within all structures for cross₃ and cross₄. Figure 11 displays the fraction of structures with pseudoknots as a function of sequence length. It is evident that the fraction of pseudoknotted structures is monotone with respect to the sequence length. Our data are based on folding 2000 random sequences via cross and suggest an linear relation. In particular, for n = 100, approximately 50% of the structures folded by both versions of cross contain pseudoknots.

Pseudoknot-shapes

Next we study the dominant pseudoknot-shapes as a function of sequence length. Our notion of pseudoknot-shape is based on k-noncrossing cores [29] discussed in Subsection. The shape of a structure S, is a subset of the core-arcs, induced by all arcs either contained in pseudoknots or arcs contained in multi-loops which contain nested pseudoknots. In other words, a pseudoknot-shape contains all pseudoknot-arcs and all arcs affecting the energy of pseudoknots, see Figure 12. In Figure 12 we display for cross₃ and cross₄ the dominant types. The shape data are obtained by folding 2000 random sequences. In Figure 13 we display the fraction of sequences on which cross₃ and cross₄ coincide, based on folding 2000 random sequences.

Stack-statistics in pseudoknot RNA

It is wellknown that large stacks contribute to a low mfe of a structure. In this section we relate the distribution of stacks in random structures to the distribution of stacks in mfe-pseudoknot structures generated by cross. This provides insight in what particular spectrum of pseudoknot structures cross produces.

Let us first discuss the distribution of stacks in random pseudoknot structures. The naive approach would be to generate a random structure and count the number of stacks. However, it is at present time not known how to construct a random pseudoknot structure with uniform probability. Therefore we have to employ a different strategy in order to obtain this distribution for random structures. The key idea [47] is to consider the bivariate generating function

T_{k, τ} (x, u) = \sum_{n \geq 0} \sum_{0 \leq t \leq \frac{n}{2}} T_{k, τ} (n, t) u^{t} x^{n}

(3)

where T_{k, τ}(n, t) denotes the number of k-noncrossing, τ-canonical pseudoknot structures having exactly t stacks. T_{k, τ}(x, u) can be computed using the cores introduced in Section. The stack-distribution is now given by

P (X_{k, τ}^{n} = t) = T_{k, τ} (n, t) / T_{k, τ} (n)

(4)

and via singularity analysis one can show that this distribution becomes asymptotically normal with mean μ_{k, τ}and variance $σ_{k, τ}^{2}$ given by

μ_{k, τ} = - \frac{{γ^{'}}_{k, τ} (0)}{γ_{k, τ} (0)}

(5)

σ_{k, τ}^{2} = {(\frac{{γ^{'}}_{k, τ} (0)}{γ_{k, τ} (0)})}^{2} - \frac{{γ^{″}}_{k, τ} (0)}{γ_{k, τ} (0)} .

(6)

where γ_k,t(u) is the unique dominant singularity parameterized by u = e^s. In Table 2 we display the values μ_{k, τ}and $σ_{k, τ}^{2}$ for k = 2, 3, 4 and τ = 3, ..., 7. Accordingly the number of stacks scales linearly with sequence length and so does the number of loops, since each loop corresponds to a stack. In Figure 14 we present the stack distributions of 3000 structures of random sequences folded by cross₄ and the normal distribution obtained from Table 2 (lhs). Analogously we present the stack distributions of 3000 structures of random sequences folded by cross5 and the normal distribution obtained from Table 2 (rhs).

Table 2 Mean and variances. Mean and variances of the normal limit distributions of the numbers of stacks in pseudoknot RNA structures for different k and τ. We list mean (μ) and variance (σ²).

Full size table

Neutrality and local connectivity

The mapping from sequence to structures plays an important role for evolution [23, 43, 48]. One of its key roles is to facilitate the search of a sequence-population for better adapted shapes. In tis context, Table 1 contains some nontrivial information about the mapping from RNA sequences into their pseudoknot structures. To be precise, Table 1, in combination with central limit theorems for the number of arcs in k-noncrossing RNA structures [49, 50] allows us to conclude that there exist exponentially many k-noncrossing canonical structures with exponentially large preimages. Indeed, according to Table 1 the exponential growth rate of the number of k-noncrossing canonical structures, 3 = k = 9 is strictly smaller than four-the growth rate of the space of all sequences over the natural alphabet.

The central limit theorems for the number of arcs of k-noncrossing, canonical pseudoknot structures [50] exhibit a mean of 0.39 n and a variance of 0.041 n. We conclude from this that sequence to structure maps in pseudoknot RNA structures cannot be trivial, since the preimages of particular structures have exponential growth rates strictly smaller than four. As a result the number of canonical pseudoknot structures grows exponentially. Accordingly, a sequence to structure map in pseudoknot RNA necessarily generates exponentially many canonical structures.

In light of this, the interesting question then becomes how the set of sequences folding into a given structure is "organized" in sequence space. The analysis presented in this section is analogous to the investigations for RNA secondary structures [23, 51] and can be viewed as a basic protocol for the local statistics of a genotype-phenotype map. The only exception is Subsection, which elaborates on the novel concept of local connectivity [48].

It is only possible to derive local statistics, since, for instance, exhaustive computations of the set of all sequences over the natural alphabet with fixed pseudoknot structure for n > 40 is at present time impossible.

Neutral walks

Let us consider a fixed RNA structure, S. Let furthermore C[S] denote the set of S-compatible sequences, consisting of all sequences that have at any two paired positions one of the 6 nucleotide pairs

(A, U), (U, A), (G, U), (U, G), (G, C), (C, G).

The structure S motivates to consider a new adjacency relation within C [S]. Indeed, we may reorganize a sequence (x₁, ..., x_n) into the pair

((u_{1}, ..., u_{n_{u}}), (p_{1}, ..., p_{n_{p}})),

(7)

where the u_jdenote the unpaired nucleotides and the p_j= (x_i, x_k) all base pairs, respectively, see Figure 15. We can then view v_u= (u₁, ..., $u_{n_{u}}$ ) and v_p= (p₁, ..., $p_{n_{p}}$ ) as elements of the formal cubes $Q_{4}^{n_{u}}$ and $Q_{6}^{n_{p}}$ , implying the new adjacency relation for elements of C [S].

Accordingly, there are two types of compatible neighbors in sequence space: u- and p-neighbors: a u-neighbor has Hamming distance one and differs exactly by a point mutation at an unpaired position. Analogously a p-neighbor differs by a compatible base pair-mutation, see Figure 15. Note however, that a p-neighbor has either Hamming distance one ((G, C) ↦ (G, U))) or Hamming distance two ((G, C) ↦ (C, G))). We call a u- or a p-neighbor, y, a compatible neighbor. If y is contained in the neutral network we refer to y as a neutral neighbor. This gives rise to consider the compatible- and neutral distance, denoted by C(v, v') and N(v, v'). These are the minimum length of a C[S]-path and path in the neutral network between v and v', respectively.

Our basic experiment is as follows: We select a (random) sequence, v and fold it into the structure S(v). We then proceed inductively: assume v_iis constructed. We randomly select some neutral (compatible) neighbor of v_i, denoted by v_i+1, subject to the condition d_H(v, v_i+1) > d_H(v, v_i), where d_H(x, y) denotes the Hamming distance. If no such neighbor exists we choose some v_i+1≠ v_iwith the property d_H(v, v_i+1) = d_H(v, v_i). If all neutral v_i-neighbors satisfy d_H(v, v_i+1) <d_H(v, v_i) we stop and output the integer d_H(v, v_i). In Figure 16 we study 200 neutral walks for the following four structures: first an H-pseudoknot loop structure (a), second a hairpin-loop structure (b), third an interior-loop structure (c) and finally the phenylalanine tRNA structure (d), see Figure 17. Our findings are in accordance with those for RNA secondary structures. One can easily neutrally traverse sequence space, suggesting the picture of vast, connected networks composed by neutral sequences.

Neutral neighbors

Complementing the analysis of neutral walks, we study now the distribution of neutral neighbors. Recall that a neutral neighbor of a sequence v with respect to the structure S = S(v) is a u- or a p-neighbor, y, contained in the neutral network of S. It has Hamming distance one or two, depending on whether it is induced by a point or base pair mutation, see Figure 15. The distribution of neutral neighbors provides relevant information about the mutational robustness of the structure S. The data presented here, are obtained in the course of the neutral walk experiments, displayed in Figure 16. They are given in Figure 18. In order to put things into context we also present in Figure 19 the distribution of neutral neighbors for 10000 random sequences folded by cross₄.

Local connectivity

Connectivity of a subgraph, Γ_n, of an n-cube alone does not imply that a small Hamming distance implies a small distance in Γ_n. For neutral sequences this means that two neutral sequences with Hamming distance less than four, are possibly connected via a neutral path of much greater length. Evidently, for molecular evolution it is therefore not connectivity but the existence of these short paths what matters. Local connectivity is a property which guarantees the existence of these short paths. If Γ_nis locally connected then a small Hamming distance does imply a Γ_n-distance scaled by at most a factor of Δ > 0. We shall begin by studying local connectivity for random induced subgraphs of n-cubes, i.e. where we select sequences with independent probability λ_n. Then we transfer the derived concepts to neutral networks of RNA pseudoknot structures.

We call Γ_nis locally connected if and only if almost surely (a.s.)

\begin{matrix} (†) & \begin{matrix} \exists Δ > 0; & d_{Γ_{n}} (v, v^{'}) \leq Δ d_{Q_{2}^{n}} (v, v^{'}) \end{matrix} \end{matrix},

provided v, v' are in Γ_n. We immediately observe that, trivially, for any finite n such Δ exists. However, the key point is that (†) employs the notion "almost surely", i.e. it holds for arbitrary n.

Random graph theory [48] shows that on the one hand, for λ_nsmaller than n^δ/ $\sqrt{n}$ , where δ > 0 is arbitrarily small, there exists a.s. no finite Δ satisfying (†). On the other hand, for λ_nlarger than or equal to n^δ/ $\sqrt{n}$ , there exists a.s. some finite Δ satisfying (†). In other words, there exists a threshold value for local connectivity. Since random subgraphs of n-cubes have giant components for λ_n= (1 + ε)/n, where ε > 0 [52] we can conclude that local connectivity emerges distinctly later in the evolution of random subgraphs of n-cubes.

Suppose we are given a structure S and sequence v, contained in its neutral network. By construction, local connectivity refers to the two n-cubes $Q_{4}^{n_{u}}$ and $Q_{6}^{n_{p}}$ induced by S, see Figure 20. Let

C₂ = |{v'| C(v, v') = 2}|

be the cardinality of the set of sequences in compatible distance two. Then the degree of local connectivity of S at v is given by

\begin{matrix} D_{S} (v) = | {v^{'} | C (v, v^{'}) = 2, & N (v, v^{'}) = 4} | C_{2}^{- 1} \end{matrix} .

(8)

In other words, D_S(v) is the fraction of locally connected vertices of the compatible distance two neighbors of v, that can be obtained via a neutral path of length at most four.

We perform the following experiment: we consider neutral walks for the UTR-pseudoknot structure of the mouse hepatitis virus displayed in Figure 1, see Subsection. Along these walks we compute the locality degree D_S(v_i) and the total number of locally connected sequences. Our findings are presented in Figure 21. We can report that the degree of local connectivity is, as suggested by random graph theory, almost 100%.

Conclusion

RNA pseudoknot structures-in particular their statistical properties-are a fascinating and new territory. To our knowledge the only statistical data beyond RNA secondary structures were derived for bi-secondary structures in [53, 54]. The structural concept of k-noncrossing canonical RNA structures and the resulting sequence to structure map employed for our experiments is new and represents a natural generalization of RNA secondary and bi-secondary structures. To be precise, bi-secondary structures are exactly planar 3-noncrossing RNA structures [19].

It is clear, that for sequence-length less than or equal to 100 we only encounter pseudoknots of limited complexity. Our findings presented in Figure 12 provide a transparent picture of which pseudoknot-shapes dominate for given sequence length. These results, in combination with the data on the fractions of pseudoknotted structures over sequence length show, that for n = 80 we have approximately 35% structures with nontrivial pseudoknots. In addition it is striking that basically all folded structures are irreducible, i.e. only a very small fraction can be decomposed into several independent substructures. This is of interest since decomposable structures can be folded much faster. It is known, [55] that Dyck-paths, i.e. path starting at the origin, having only up (1, 1), or down (1, -1) steps which end on the x-axis, decompose on average into three irreducible parts. This is of relevance, since a slight generalization of Dyck-path, the Motzkin-paths, having additional horizontal steps, correspond to secondary structures. Our findings suggest, that while secondary structures, decompose nontrivially, higher and higher crossing numbers change the picture. This complicates the computation of mfe-pseudoknot RNA due to their imminent irreducibility.

Both versions of cross produce analogous findings, confirming the generality of our results. The vast majority of pseudoknot-shapes is of a single type. As expected, cross₃ exhibits more structural variety due to the fact that its minimum stack-length is only three. The ratio of pseudoknot structures shifts significantly from n = 80 to n = 100 to approximately 50%. We can conclude from this that pseudoknots cannot be ignored, they evidently become the dominant structure class for n greater than or equal to 100. Figure 13 shows that the fraction of sequences for which cross₃ and cross₄ coincide, decreases linearly as a function of sequence length. This indicates that larger and larger sequences will exhibit more subtle structural elements whose emergence is facilitated by stabilizing large stacks.

Furthermore, the mfe-pseudoknot structures generated by cross are far from being random. The central limit theorems for random k-noncrossing canonical RNA structures, given in Table 2 imply, that stacks and consequently loops scale linearly with the sequence length. Figure 14 clearly shows that the mfe-structures, generated by cross₄ and cross₅, have for n = 76 two stacks less than random 3-noncrossing structures with minimum stack-length greater than four and five, respectively. This deviation is significant and indicates that mfe-pseudoknot structures are far from "typical" random structures. We remark that, while it is straightforward to generate random RNA secondary structures, it is nontrivial to obtain random pseudoknot structures. In particular, at present time, no polynomial time algorithm is known which generates a random 3-noncrossing RNA structure with uniform probability.

The organization of the sequences contained in neutral networks of RNA pseudoknot structures seems to be very analogous to the neutral networks of RNA secondary structures [23]. Figure 16 shows that neutral walks can effectively traverse sequence space and the fractions of neutral neighbors, presented in Figure 18 and Figure 19 suggest a high degree of neutrality.

We discussed in Subsection local connectivity, a property of neutral networks which implies the existence of short, neutral paths. It is apparent that local connectivity is of central importance for molecular evolution and any type of evolutionary optimization. It has been shown in [48] that local connectivity is a prerequisite for preserving any type of sequence specific information. Having a random graph threshold value localized at 1/ $\sqrt{n}$ , local connectivity appears much later than connectivity, being localized at 1/n. However, the high neutrality degrees of RNA pseudoknot structures of Figure 18 and Figure 19 imply locally connected neutral networks. Our findings for the UTR-pseudoknot structure of the mouse hepatitis virus of length 56, given in Figure 21, confirm the local connectivity of neutral networks of particular pseudoknot RNA structures. At all steps of the neutral walks almost all sequences are locally connected.

Abbreviations

UTR:: Untranslated Region
HDV:: Hepatitis Delta Virus
DP:: dynamic program
lhs:: left hand side
rhs:: right hand side.

References

Penner RC, Waterman MS: Spaces of RNA secondary structures. Adv Math. 1993, 101: 31-49. 10.1006/aima.1993.1039.
Article Google Scholar
Waterman MS: Combinatorics of RNA hairpins and cloverleaves. Stud Appl Math. 1979, 60: 91-96.
Article Google Scholar
Smith TF, Waterman MS: RNA secondary structure. Math Biol. 1978, 42: 31-49.
Google Scholar
Schmitt WR, Waterman MS: Linear trees and RNA secondary structure. Discr Appl Math. 1994, 51: 317-323. 10.1016/0166-218X(92)00038-N.
Article Google Scholar
Howell JA, Smith TF, Waterman MS: Computation of generating functions for biological molecules. J Appl Math. 1980, 39: 119-133.
Google Scholar
Nussinov R, Jacobson AB: Fast Algorithm for Predicting the Secondary Structure of Single-Stranded RNA. Proc Natl Acad Sci, USA. 1980, 77: 6309-6313. 10.1073/pnas.77.11.6309.
Article PubMed Central CAS PubMed Google Scholar
Searls DB: The language of genes. Nature. 2002, 420: 211-217. 10.1038/nature01255.
Article CAS PubMed Google Scholar
Webpage of HDV-pseudoknot structure in natural. [http://www.ekevanbatenburg.nl/PKBASE/PKB00075.HTML]
Loria A, Pan T: Domain Structure of the ribozyme from eubacterial ribonuclease. RNA. 1996, 2: 551-563.
PubMed Central CAS PubMed Google Scholar
Konings DAM, Gutell RR: A comparison of thermodynamic foldings with comparatively derived structures of 16s and 16s-like rRNAs. RNA. 1995, 1: 559-574.
PubMed Central CAS PubMed Google Scholar
Schneider D, Tuerk C, Gold L: Selection of high affinity RNA ligands to the bacteriophage R17 coat protein. J Mol Biol. 1992, 228: 862-869. 10.1016/0022-2836(92)90870-P.
Article CAS PubMed Google Scholar
Chamorro M, Parkin N, Varmus HE: An RNA pseudoknot and an optimal heptameric shift site are required for highly efficient ribosomal frameshifting on a retroviral messenger RNA. Proc Natl Acad Sci, USA. 1992, 89 (2): 713-7. 10.1073/pnas.89.2.713. 1309954
Article PubMed Central CAS PubMed Google Scholar
Lyngsø RB, Pedersen CNS: RNA Pseudoknot Prediction in Energy-Based Models. J Comp Biol. 2000, 7: 409-427. 10.1089/106652700750050862.
Article Google Scholar
Rivas E, Eddy S: A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol. 1999, 285 (5): 2053-2068. 10.1006/jmbi.1998.2436.
Article CAS PubMed Google Scholar
Uemura Y, Hasegawa A, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction. Theor Comp Sci. 1999, 210: 277-303. 10.1016/S0304-3975(98)00090-5.
Article Google Scholar
Akutsu T: Dynamic programming algorithms for RNA secondary prediction with pseudoknots. Discr Appl Math. 2000, 104: 45-62. 10.1016/S0166-218X(00)00186-4.
Article Google Scholar
Tacker M, Stadler PF, Bornberg-Bauer EG, Schuster P, Hofacker IL, Schuster P: Algorithm independent properties of RNA secondary structure predictions. Europ Biophys. 1996, 25: 115-130. 10.1007/s002490050023.
Article CAS Google Scholar
Jin EY, Reidys CM: Asymptotic enumberation of RNA structures with pseudoknots. Bull Math Biol.
Jin EY, Qin J, Reidys CM: Combinatorics of RNA structures with Pseudoknots. Bull Math Biol. 2008, 70 (1): 45-67. 10.1007/s11538-007-9240-y.
Article CAS PubMed Google Scholar
Waterman MS, Smith TF: Rapid dynamic programming methods for RNA secondary structure. Adv Appl Math. 1986, 7: 455-464. 10.1016/0196-8858(86)90025-4.
Article Google Scholar
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167-188. 10.1007/BF00818163.
Article CAS Google Scholar
Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl Acids Res. 1981, 9: 133-148. 10.1093/nar/9.1.133.
Article PubMed Central CAS PubMed Google Scholar
Reidys CM, Stadler PF, Schuster P: Generic properties of combinatory maps: neutral networks of RNA secondary structures. Bull Math Biol. 1997, 59 (2): 339-397. 10.1007/BF02462007.
Article CAS PubMed Google Scholar
Schultes EA, Bartel DP: Implications for the Emergence of New Ribozyme Folds. Science. 2000, 289 (5478): 448-452. 10.1126/science.289.5478.448.
Article CAS PubMed Google Scholar
Jolly A, (Ed): Mapping RNA form and function. Science. 2005, 309: 1441-1632. 10.1126/science.1111873.
Ma G, Reidys CM: Canonical RNA Pseudoknot Structures. J Comp Biol.
Chen WYC, Deng EYP, Du RRX, Stanley RP, Yan CH: Crossings and nestings of matchings and partitions. Trans Am Math Soc. 2007, 359: 1555-1575. 10.1090/S0002-9947-06-04210-3.
Article Google Scholar
Chen WYC, Qin J, Reidys CM: Crossing and Nesting in Tangled-diagrams. Elec J Comb. 2008, 15:
Google Scholar
Jin EY, Reidys CM: RNA-LEGO: Combinatorial Design of Pseudoknot RNA. Adv Appl Math.
Hofacker IL, Schuster P, Stadler PF: Combinatorics of RNA Secondary Structures. Discr Appl Math. 1998, 88: 207-237. 10.1016/S0166-218X(98)00073-0.
Article Google Scholar
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29: 1105-1119. 10.1002/bip.360290621.
Article CAS PubMed Google Scholar
Fresco JR, Alberts BM, Doty P: Some Molecular Details of the Secondary Structure of Ribonucleic Acid. Nature. 1960, 188: 98-101. 10.1038/188098a0.
Article CAS PubMed Google Scholar
Jun IT, Uhlenbeck OC, Levine MD: Estimation of Secondary Structure in Ribonucleic Acids. Nature. 1971, 230: 362-367. 10.1038/230362a0.
Article Google Scholar
DeLisi C, Crothers DM: Prediction of RNA secondary structure. Proc Natl Acad Sci USA. 1971, 68: 2682-2685. 10.1073/pnas.68.11.2682.
Article PubMed Central CAS PubMed Google Scholar
Huynen M, Stadler PF, Fontana W: Smoothness within ruggedness: the role of neutrality in adaptation. Proc Natl Acad Sci USA. 1996, 93: 397-401. 10.1073/pnas.93.1.397.
Article PubMed Central CAS PubMed Google Scholar
Babajide A, Hofacker IL, J SM, Stadler PF: Neutral Networks in Protein Space A Computational Study Based on Knowledge-Based Potentials of Mean Force. Folding Design. 1997, 93: 261-269. 10.1016/S1359-0278(97)00037-0.
Article Google Scholar
Schuster P: Genotypes with phenotypes: Adventures in an RNA Toy World. Biophys Chem. 1997, 6: 75-110. 10.1016/S0301-4622(97)00058-6.
Article Google Scholar
Fontana W, Schuster P: Shaping Space: The Possible and the Attainable in RNA Genotype-Phenotype Mapping. J Theor Biol. 1998, 194: 491-515. 10.1006/jtbi.1998.0771.
Article CAS PubMed Google Scholar
Stadler PF: Fitness Landscapes Arising from the Sequence-Structure Maps of Biopolymers. J Mol Struct (THEOCHEM). 1999, 463: 7-19. 10.1016/S0166-1280(98)00387-X.
Article CAS Google Scholar
Schuster P, Fontana W: Chance and Necessity in Evolution. Lessons from RNA Physica. 1999, 133: 427-452.
CAS Google Scholar
Reidys CM, Stadler PF: Combinatorial Landscapes. SIAM Review. 2002, 44: 3-54. 10.1137/S0036144501395952.
Article Google Scholar
Hofacker IL, Fekete M, Flamm C, Huynen MA, Rauscher S, Stolorz PE, Stadler PF: Automatic Detection of Conserved RNA Structure Elements in Complete RNA Virus Genomes. Nucl Acids Res. 1998, 26: 3825-2836. 10.1093/nar/26.16.3825.
Article PubMed Central CAS PubMed Google Scholar
Schuster P, Fontana W, Stadler PF, Hofacker IL: From Sequences to Shapes and Back: A Case Study in RNA Secondary Structures. Proc Roy Soc Lond B. 1994, 255: 279-284. 10.1098/rspb.1994.0040.
Article CAS Google Scholar
Gruener W, Giegerich R, Strothmann D, Reidys CM, J W, Hofacker IL, Stadler PF, Schuster P: Analysis of RNA sequence structure maps by exhaustive enumeration I. Neutral networks. Monatsh Chem. 1996, 127: 375-389. 10.1007/BF00810882.
Article CAS Google Scholar
Gruener W, Giegerich R, Strothmann D, Reidys CM, J W, Hofacker IL, Stadler PF, Schuster P: Analysis of RNA sequence structure maps by exhaustive enumeration. II. Monatsh Chem. 1996, 127: 355-374. 10.1007/BF00810881.
Article CAS Google Scholar
Huang FWD, Peng WWP, Reidys CM: Folding RNA pseudoknot structures. [In preparation].
Han HSW, Reidys CM: Stacks in canonical RNA pseudoknot structures. Comp Appl Math.
Reidys CM: Local Connectivity of Neutral Networks. Bull Math Biol.
Jin EY, Reidys CM: Central and Local Limit Theorems for RNA Structures. J Theor Biol. 2008, 250 (3): 547-559. 10.1016/j.jtbi.2007.09.020.
Article CAS PubMed Google Scholar
Huang FWD, Reidys CM: Statistics of canonical RNA pseudoknot structures. J Theor Biol.
Fontana W, Schuster P: Shaping Space: the Possible and the Attainable in RNA Genotype-Phenotype Mapping. J Theor Biol. 1998, 194 (4): 491-515. 10.1006/jtbi.1998.0771.
Article CAS PubMed Google Scholar
Reidys CM: Large components in random induced subgraphs of N-cubes. Discr Math.
Stadler PF, Haslinger C: RNA Structures with Pseudo-Knots. Bull Math Biol. 1999, 61: 437-467. 10.1006/bulm.1998.0085.
Article PubMed Google Scholar
Haslinger C: RNA Structures with Pseudoknots. PhD thesis. 1997, University of Vienna
Google Scholar
Shapiro L: A survey of the Riordan Group. Proc Amer Math Soc. 1994
Google Scholar

Download references

Acknowledgements

We are grateful to J.Z.M. Gao, H.S.W. Han and W.W.J. Peng for helpful discussions. This work was supported by the 973 Project, the PCSIRT Project of the Ministry of Education, the Ministry of Science and Technology, and the National Science Foundation of China.

This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/1471-2105/10?issue=S1

Author information

Authors and Affiliations

Center for Combinatorics, LPMC-TJKLC, Nankai University, Tianjin, 300071, PR China
Fenix WD Huang, Linda YM Li & Christian M Reidys

Authors

Fenix WD Huang
View author publications
You can also search for this author in PubMed Google Scholar
Linda YM Li
View author publications
You can also search for this author in PubMed Google Scholar
Christian M Reidys
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Christian M Reidys.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors' contributions

All authors contributed equally to this paper.

Rights and permissions

Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.

Reprints and permissions

About this article

Cite this article

Huang, F.W., Li, L.Y. & Reidys, C.M. Sequence-structure relations of pseudoknot RNA. BMC Bioinformatics 10 (Suppl 1), S39 (2009). https://doi.org/10.1186/1471-2105-10-S1-S39

Download citation

Published: 30 January 2009
DOI: https://doi.org/10.1186/1471-2105-10-S1-S39

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Sequence-structure relations of pseudoknot RNA

Abstract

Background

Results

Conclusion

Background

Methods

Cross

I Motifs

II Skeleta

III Saturation

Loops

Decomposition

Results and discussion

Minimum free energy RNA pseudoknot structures

The fraction of pseudoknots

Pseudoknot-shapes

Stack-statistics in pseudoknot RNA

Neutrality and local connectivity

Neutral walks

Neutral neighbors

Local connectivity

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Keywords

BMC Bioinformatics

Contact us

Selected papers from the Seventh Asia-Pacific Bioinformatics Conference (APBC 2009)

Sequence-structure relations of pseudoknot RNA

Abstract

Background

Results

Conclusion

Background

Methods

Cross

I Motifs

II Skeleta

III Saturation

Loops

Decomposition

Results and discussion

Minimum free energy RNA pseudoknot structures

The fraction of pseudoknots

Pseudoknot-shapes

Stack-statistics in pseudoknot RNA

Neutrality and local connectivity

Neutral walks

Neutral neighbors

Local connectivity

Conclusion

Abbreviations

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Additional information

Competing interests

Authors' contributions

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Bioinformatics

Contact us