 Research
 Open access
 Published:
Sequencestructure relations of pseudoknot RNA
BMC Bioinformatics volumeÂ 10, ArticleÂ number:Â S39 (2009)
Abstract
Background
The analysis of sequencestructure relations of RNA is based on a specific notion and folding of RNA structure. The notion of coarse grained structure employed here is that of canonical RNA pseudoknot contactstructures with at most two mutually crossing bonds (3noncrossing). These structures are folded by a novel, ab initio prediction algorithm, cross, capable of searching all 3noncrossing RNA structures. The algorithm outputs the minimum free energy structure.
Results
After giving some background on RNA pseudoknot structures and providing an outline of the folding algorithm being employed, we present in this paper various, statistical results on the mapping from RNA sequences into 3noncrossing RNA pseudoknot structures. We study properties, like the fraction of pseudoknot structures, the dominant pseudoknotshapes, neutral walks, neutral neighbors and local connectivity. We then put our results into context of molecular evolution of RNA.
Conclusion
Our results imply that, in analogy to RNA secondary structures, 3noncrossing pseudoknot RNA represents a molecular phenotype that is well suited for molecular and in particular neutral evolution. We can conclude that extended, percolating neutral networks of pseudoknot RNA exist.
Background
Three decades ago, Michael Waterman pioneered the combinatorics and ab initio prediction of the at that time rather exotic ribunucleic acid (RNA) secondary structures [1â€“5]. The motivation for this work was coming from a fundamental dichotomy represented by RNA. On one hand RNA is described by its primary sequence, a linear string composed of the nucleotides A, G, U and C. The primary sequence embodies the genotypic legislative. On the other hand, RNA, being less structurally constrained than its chemical relative DNA, does fold into 3Dstructures, representing the phenotypic executive. Therefore one molecule stands for both: geno and phenotype.
Indeed, a vast variety of RNA activities was found: the discovery of catalytic RNAs, or ribozymes, in 1981 proved that RNA could catalyze reactions just as proteins. RNA can act also as a messenger between DNA and protein in the form of transfer RNA. The realization that RNA combines features of proteins with DNA led to the "RNA world" hypothesis for the origin of life. The idea was that DNA and the much more versatile proteins took over RNA's functions in the transition from the "RNAworld" to the "DNA/proteinworld".
Let us have a closer look at RNA phenotypes. RNA molecules form "helical" structures by folding, i.e. pairing their nucleotides and thereby lowering their minimum free energy (mfe). Originally, these bonds were subject to strict combinatorial constraints, for instance "noncrossing" in RNA secondary structures. For the latter, dynamic programming (DP) algorithms, predicting the minimum free energy configuration were given 1980 [5, 6]. It is wellknown, however, that RNA structures are far more complex than secondary structures. One particularly prominent feature is the existence of crossserial dependencies [7], that is crossing arcs or pseudoknots, see Figure 1, where we display the natural UTRpseudoknot structure of the mouse hepatitis virus. Cross also folds into the natural structure given in Figure 1. In Figure 2 we present another RNA pseudoknot structure, the HDVpseudoknot. We present here the structure as folded by cross and also its natural structure [8].
In fact, RNA pseudoknots are "everywhere". They occur in functional RNA, like for instance RNAseP [9] as well as ribosomal RNA [10]. They are conserved in the catalytic core of group I introns, in plant viral RNAs pseudoknots mimic tRNA structure and in in vitro RNA evolution [11], where experiments produced families of RNA structures with pseudoknot motifs, when binding HIV1 reverse transcripts. Important mechanisms like ribosomal frame shifting [12] also involve pseudoknot interactions.
For prediction algorithms the implications of crossserial dependencies are severethey imply a higher level of formal language: contextsensitive. In general, on this level of formal languages it is not clear whether or not polynomial time ab initio folding algorithms exist. Indeed, LyngsÃ¸ et al. [13] showed that "reasonable" classes of RNA pseudoknots require exponential time algorithms. There exist however, polynomial time folding algorithms, capable of the energy based prediction of certain pseudoknots: Rivas et al. [14], Uemura et al. [15], Akutsu [16] and LyngsÃ¸ [13]. The output of these algorithms, however, remains somewhat "mysterious"it is not clear which types of pseudoknots can be generated.
In analogy to the case of RNA secondary structures, the identification of key combinatorial properties of the output class offers deeper understanding. The combinatorial properties of RNA pseudoknot structures discussed in the following have indeed profound implications: first sequencestructure maps will generate exponentially many structures with neutral networks of exponential size. Second, the latter will come close to each other in sequence space, thereby allowing for efficient evolutionary search. None of these findings depend on the particular choice of loopenergies or the partition function [17]. Furthermore, without combinatorial specification, as it is the case for the above mentioned DP based pseudoknot folding algorithms [14], one arrives at an impossibly large configuration space.
For instance, the inductive generation of gapmatrices produces arbitrarily high number of mutually crossing arcs. The results in [18] prove, that the exponential growth rate of pseudoknot structures is linear in the crossing number. Accordingly, via gapmatrices, an uncontrollably large output class is being generated. Nevertheless, the DProutine using pairs of gapmatrices cannot generate any 3noncrossing nonplanar pseudoknot structure.
We will show that the notion of knoncrossing diagrams [19] allows us to specify a suitable outputclass for pseudoknot folding algorithms. Recall that a diagram is a graph over the vertex set [n] = {1, ..., n} with vertex degree less than or equal to one. It is represented by drawing the vertices in a horizontal line and its arcs (i, j), where i <j, in the upper halfplane. The vertices and arcs correspond to nucleotides and WatsonCrick (AU, GC) and (UG) base pairs, respectively. A diagram is knoncrossing if it contains at most k  1 mutually crossing arcs. Diagrams have the following three key parameters: the maximum number of mutually crossing arcs, k  1, the minimum arclength, Î», and minimum stacklength, Ï„, The length of an arc (i, j) is j  i and a stack of length Ï„ is a sequence of "parallel" arcs of the form
((i, j), (i + 1, j  1), ..., (i + (Ï„  1), j  (Ï„  1))),
see Figure 3. We call an arc of length Î» a Î»arc. Biophysical constraints on the base pairings imply that in all RNA structures Î» is greater than or equal to four. We call diagrams with a minimum stacklength Ï„, Ï„canonical and if Î» â‰¥ 4 we refer to diagrams as structures. To reiterate, in the simplest case we have 2noncrossing RNA structures, i.e. the secondary structures in which no two arcs cross, see Figure 4. The noncrossing of arcs has farreaching consequences. It implies that RNA secondary structures form a context free language and allow for the DP algorithms [20], predicting the loopbased mfesecondary structure in O(n^{3})time and O(n^{2})space.
Let us now, having some background on RNA structures return to the RNAworld. Around 1990 Peter Schuster and his coworkers initiated a paradigm shift. They began to study evolutionary optimization and neutral evolution of RNA via the relation between RNA genotypes and phenotypes. The particular mapping from RNA sequences into RNA secondary structures was obtained by the algorithm ViennaRNA [21], an implementation of the folding routine [6, 22], mentioned above. Two particularly prominent results of this line of work were the existence of neutral networks, i.e. vast, extended networks, composed of sequences folding into a given secondary structure [23] and the Intersection Theorem [23]. The latter guarantees for any two secondary structures the existence of at least one sequence which simultaneously satisfies all constraints imposed by their WatsonCrick and GU base pairs. For the implication of the latter with respect to molecular switches, see [24]. It became evident that the "statistical" properties of this mapping played a central role in the molecular evolution of RNA.
But, there is more. Two discoveries suggested that RNA might not just be a stepping stone towards a DNA/protein world. They show that RNA plays an active role in vital cell processes. A large number of very small RNAs of about 22 nucleotides in length, called microRNAs (miRNAs), were discovered. They were found in organisms as diverse as the worm Caenorhabditis organs and humans, and their particular relationship to certain intermediates in RNA interference (RNAi). These findings have put RNAin particular noncoding RNAinto the spotlight. In addition, RNA's conformational versatility and catalytic abilities have been identified in the context of protein synthesis and RNA splicing. More and more parallels between RNA and protein are currently being revealed [25].
Let us next briefly overview what we know about the combinatorics of our phenotypes, ultimatively allowing for the computation of biophysically relevant pseudoknot structures [26]. The key result comes from a seemingly unrelated field, the combinatorics of partitions. Chen et al. proved in a seminal paper [27] a bijection between walks in Weyl chambers and knoncrossing partitions. This bijection has recently been generalized to tangled diagrams [28]. Now, a knoncrossing diagram is a special type of knoncrossing tangle and the relevance of Chen's result lies in the fact that the walks in question can be enumerated via the reflection principle. In fact, the reflection principle facilitated the computation of the generating function of knoncrossing canonical pseudoknot RNA [19, 26, 29]. Subsequent singularity analysis [26, 29], showed, that the exponential growth rates of canonical pseudoknot RNA structures are surprisingly small, see Table 1, [26]. For instance, the number of 3noncrossing, 3canonical RNA structures with arclength greater than or equal to four is asymptotically given by
cn^{5} 2.0348^{n},
where c is some (explicitly known) constant. This exponential growth rate is very close to Schuster et al.'s finding [30] for 2canonical RNA secondary structures with arclength greater than or equal to four
1.4848 n^{3/2} 1.8444^{n}.
For the analysis presented here, we use the algorithm cross [28], which produces a transparent output. This algorithm does not follow the DP paradigm and generates the mfeknoncrossing Ï„canonical structure via a combination of branch and bound, as well as DP techniques. cross inductively constructs knoncrossing, Ï„canonical RNA structures via motifs. Currently full loopbased energy models are derived an implemented for k = 3 and Ï„ â‰¥ 3.
Therefore, cross finds the mfeRNA pseudoknot structure in which there are at most two mutually crossing arcs, which has minimum arclength four and in which each stack has size at least three. While cross is an exponential time algorithm it allows to fold sequences of length 100 with an average folding time of 4.5 minutes.
Methods
While it is beyond the scope of this paper to present the algorithm cross in detail, the objective of this section is first to sketch its key organization and second to discuss some basic properties of RNA pseudoknot structures. These combinatorial properties enable us to assign a unique, loopbased energy. In the course of our analysis we show that an RNA pseudoknot structure can be constructed via simpler substructures. These serve as the building blocks via which cross derives the mfepseudoknot structure. At present time we do not have an algorithm computing the partition function version of cross. For RNA secondary structures, the partition function was obtained 1990 [31], three decades after the first mfefolding algorithms were derived [32â€“34]. The partition function is based on a fixed sequence and contains vital statistical information on the probabilities of specific structural configurations of the latter. For any inductively constructed structure class, it allows to compute the base pairing probabilities. In analogy to similar studies in the case of RNA secondary structures [17, 35â€“37, 37â€“45], the partition function is for the type of analysis presented here not of key importance. We shall derive statistical information on the sequencestructure relation by mfefolding a large number of sequences instead of considering the ensemble of structural configurations of a single sequence.
Cross
The algorithm cross has three distinct phases: the motif, skeleton and saturationphase, see Figure 5 for an overview. We will here briefly discuss these three parts.
Let â‰º denote the following partial order over arcs
(i_{1}, j_{1}) â‰º (i_{2}, j_{2}) â‡” i_{2} <i_{1} âˆ§ j_{1} <j_{2},
i.e. an arc Î±_{1} is smaller then Î±_{2} if it is nested in it.
I Motifs
Let us begin by defining corestructures. A knoncrossing core [29] is a knoncrossing diagram in which all stacks have size one. The core of a structure is obtained by identifying all its stacks by single arcs, keeping the unpaired nucleotides and finally relabeling, see Figure 6.
A âŸ¨k, Ï„âŸ©motif is a âŸ¨k, Ï„âŸ©diagram over [n], having the following properties
(M1) it has a nonnesting core
(M2) all its arcs are contained in stacks of length exactly Ï„ = 3 and length Î» = 4.
A mshadow is a knoncrossing diagram obtained by successively increasing the stacks of m from top to bottom, see Figure 7.
The key observation about motifs is that they can, despite the fact that they exhibit crossserial dependencies, be generated inductively [46].
II Skeleta
Skeleta represent the noninductive "frames" of pseudoknot RNA, i.e. skeleta entail exactly the crossserial dependencies, that need to be considered exhaustively. A skeleton, S, is a 3noncrossing structure, whose core has a connected Lgraph. An Lgraph is a diagram whose arcs are the vertices and two being adjacent if their corresponding arcs cross [46]. An irreducible shadow, IS_{i,j}, over [i, j]. IS_{i,j}is a skeleton which has no nested arcs, see Figure 7. Phase II consists in the generation of all skeletatrees, which are rooted in irreducible shadows.
III Saturation
Given a skeleton, cross saturates or "fills" via contextsensitive DP routines the skeletonintervals. Note that, while the inserted substructures cannot cross any arc of the skeleton, they will in general contain crossing arcs within themselves.
To summarize, first cross inductively constructs all roots of the skeletatrees, second cross generates the skeletatrees themselves and third it saturates the skeleta.
Loops
We next discuss loops of 3noncrossing RNA structures. Loops are not only the basic building blocks for the mfeevaluation but also of importance for the coarse grained notion of pseudoknotshapes, discussed in Subsection. Let Î± be an arc in the 3noncrossing RNA structure, S and denote by A_{ S }(Î²) the set of Sarcs that cross Î². Clearly, we have Î² âˆˆ A_{ S }(Î±) if and only if Î± âˆˆ A_{ S }(Î²). An arc Î± âˆˆ A_{ S }(Î²) is called a minimal, Î²crossing arc if there exists no Î±' âˆˆ A_{ S }(Î²) such that Î±' â‰º Î±.
Let the interval [i, j] denote the sequence
(i, i + 1, ..., j  1, j).
It is shown in [46] that any 3noncrossing RNA structure can be uniquely decomposed into the following four looptypes:
(1) a hairpinloop is a pair
((i, j), [i + 1, j  1])
where (i, j) is an arc.
(2) an interiorloop is a sequence
((i_{1}, j_{1}), [i_{1} + 1, i_{2}  1], (i_{2}, j_{2}), [j_{2} + 1, j_{1}  1]),
where (i_{2}, j_{2}) is nested in (i_{1}, j_{1}).
(3) a multiloop, see Figure 8, is a sequence
where {S}_{{\mathrm{\xcf\u2030}}_{h}}^{{\mathrm{\xcf\u201e}}_{h}} denotes a pseudoknot structure over [Ï‰_{ h }, Ï„_{ h }] (i.e. nested in (i_{1}, j_{1})) and subject to the following condition: if all {S}_{{\mathrm{\xcf\u2030}}_{h}}^{{\mathrm{\xcf\u201e}}_{h}} = (Ï‰_{ h }, Ï„_{ h }), i.e. all substructures are simply arcs, for all h, then h = 2.
(4) a pseudoknot, see Figure 9, consisting of the following data:
(P1) a set of arcs
P = {(i_{1}, j_{1}), (i_{2}, j_{2}), ..., (i_{ t }, j_{ t })},
where i_{1} = min{i_{ s }} and j_{ t }= max{j_{ s }}, such that

(i)
the diagram induced by the arcset P is irreducible, i.e. the linegraph of P is connected and

(ii)
for each (i_{ s }, j_{ s }) âˆˆ P there exists some arc Î² (not necessarily contained in P) such that (i_{ s }, j_{ s }) is minimal Î²crossing.
(P2) all vertices i_{1} <r <j_{ t }, not contained in hairpin, interior or multiloops.
Decomposition
We now show that each 3noncrossing RNA structure can uniquely be constructed by simpler substructures [46]. Furthermore, each 3noncrossing RNA structure has a unique loop decompositionthe basis of our energy evaluation. We remark that assertion (b) of the following result remains valid for arbitrary crossing number, k.
Theorem. Suppose k â‰¥ 2,Ï„ â‰¥ 3.

(a)
Any knoncrossing, tcanonical RNA structure corresponds to an unique sequence of shadows.

(b)
Any âŸ¨3. Ï„âŸ©structure has an unique loopdecomposition.
In Figure 10 we illustrate how these decompositions work.
Results and discussion
Our results are organized in two sections. First we describe our findings with respect to the statistics of pseudoknot RNA structures and second we present our data with respect to the particular organization of the sequences in neutral networks.
Minimum free energy RNA pseudoknot structures
In this section we present some key statistics on pseudoknotted RNA structures. In order to put our findings into context we consider two variants of cross: first, cross_{3}, which generates 3noncrossing, 3canonical mfestructures and second, cross_{4}, which produces 3noncrossing, 4canonical mfestructures.
The fraction of pseudoknots
We next compute the fraction of RNA structures with pseudoknots within all structures for cross_{3} and cross_{4}. Figure 11 displays the fraction of structures with pseudoknots as a function of sequence length. It is evident that the fraction of pseudoknotted structures is monotone with respect to the sequence length. Our data are based on folding 2000 random sequences via cross and suggest an linear relation. In particular, for n = 100, approximately 50% of the structures folded by both versions of cross contain pseudoknots.
Pseudoknotshapes
Next we study the dominant pseudoknotshapes as a function of sequence length. Our notion of pseudoknotshape is based on knoncrossing cores [29] discussed in Subsection. The shape of a structure S, is a subset of the corearcs, induced by all arcs either contained in pseudoknots or arcs contained in multiloops which contain nested pseudoknots. In other words, a pseudoknotshape contains all pseudoknotarcs and all arcs affecting the energy of pseudoknots, see Figure 12. In Figure 12 we display for cross_{3} and cross_{4} the dominant types. The shape data are obtained by folding 2000 random sequences. In Figure 13 we display the fraction of sequences on which cross_{3} and cross_{4} coincide, based on folding 2000 random sequences.
Stackstatistics in pseudoknot RNA
It is wellknown that large stacks contribute to a low mfe of a structure. In this section we relate the distribution of stacks in random structures to the distribution of stacks in mfepseudoknot structures generated by cross. This provides insight in what particular spectrum of pseudoknot structures cross produces.
Let us first discuss the distribution of stacks in random pseudoknot structures. The naive approach would be to generate a random structure and count the number of stacks. However, it is at present time not known how to construct a random pseudoknot structure with uniform probability. Therefore we have to employ a different strategy in order to obtain this distribution for random structures. The key idea [47] is to consider the bivariate generating function
where T_{k, Ï„}(n, t) denotes the number of knoncrossing, Ï„canonical pseudoknot structures having exactly t stacks. T_{k, Ï„}(x, u) can be computed using the cores introduced in Section. The stackdistribution is now given by
and via singularity analysis one can show that this distribution becomes asymptotically normal with mean Î¼_{k, Ï„}and variance {\mathrm{\xcf\u0192}}_{k,\mathrm{\xcf\u201e}}^{2} given by
where Î³_{k,t}(u) is the unique dominant singularity parameterized by u = e^{s}. In Table 2 we display the values Î¼_{k, Ï„}and {\mathrm{\xcf\u0192}}_{k,\mathrm{\xcf\u201e}}^{2} for k = 2, 3, 4 and Ï„ = 3, ..., 7. Accordingly the number of stacks scales linearly with sequence length and so does the number of loops, since each loop corresponds to a stack. In Figure 14 we present the stack distributions of 3000 structures of random sequences folded by cross_{4} and the normal distribution obtained from Table 2 (lhs). Analogously we present the stack distributions of 3000 structures of random sequences folded by cross5 and the normal distribution obtained from Table 2 (rhs).
Neutrality and local connectivity
The mapping from sequence to structures plays an important role for evolution [23, 43, 48]. One of its key roles is to facilitate the search of a sequencepopulation for better adapted shapes. In tis context, Table 1 contains some nontrivial information about the mapping from RNA sequences into their pseudoknot structures. To be precise, Table 1, in combination with central limit theorems for the number of arcs in knoncrossing RNA structures [49, 50] allows us to conclude that there exist exponentially many knoncrossing canonical structures with exponentially large preimages. Indeed, according to Table 1 the exponential growth rate of the number of knoncrossing canonical structures, 3 = k = 9 is strictly smaller than fourthe growth rate of the space of all sequences over the natural alphabet.
The central limit theorems for the number of arcs of knoncrossing, canonical pseudoknot structures [50] exhibit a mean of 0.39 n and a variance of 0.041 n. We conclude from this that sequence to structure maps in pseudoknot RNA structures cannot be trivial, since the preimages of particular structures have exponential growth rates strictly smaller than four. As a result the number of canonical pseudoknot structures grows exponentially. Accordingly, a sequence to structure map in pseudoknot RNA necessarily generates exponentially many canonical structures.
In light of this, the interesting question then becomes how the set of sequences folding into a given structure is "organized" in sequence space. The analysis presented in this section is analogous to the investigations for RNA secondary structures [23, 51] and can be viewed as a basic protocol for the local statistics of a genotypephenotype map. The only exception is Subsection, which elaborates on the novel concept of local connectivity [48].
It is only possible to derive local statistics, since, for instance, exhaustive computations of the set of all sequences over the natural alphabet with fixed pseudoknot structure for n > 40 is at present time impossible.
Neutral walks
Let us consider a fixed RNA structure, S. Let furthermore C[S] denote the set of Scompatible sequences, consisting of all sequences that have at any two paired positions one of the 6 nucleotide pairs
(A, U), (U, A), (G, U), (U, G), (G, C), (C, G).
The structure S motivates to consider a new adjacency relation within C [S]. Indeed, we may reorganize a sequence (x_{1}, ..., x_{ n }) into the pair
where the u_{ j }denote the unpaired nucleotides and the p_{ j }= (x_{ i }, x_{ k }) all base pairs, respectively, see Figure 15. We can then view v_{ u }= (u_{1}, ..., {u}_{{n}_{u}}) and v_{ p }= (p_{1}, ..., {p}_{{n}_{p}}) as elements of the formal cubes {Q}_{4}^{{n}_{u}} and {Q}_{6}^{{n}_{p}}, implying the new adjacency relation for elements of C [S].
Accordingly, there are two types of compatible neighbors in sequence space: u and pneighbors: a uneighbor has Hamming distance one and differs exactly by a point mutation at an unpaired position. Analogously a pneighbor differs by a compatible base pairmutation, see Figure 15. Note however, that a pneighbor has either Hamming distance one ((G, C) â†¦ (G, U))) or Hamming distance two ((G, C) â†¦ (C, G))). We call a u or a pneighbor, y, a compatible neighbor. If y is contained in the neutral network we refer to y as a neutral neighbor. This gives rise to consider the compatible and neutral distance, denoted by C(v, v') and N(v, v'). These are the minimum length of a C[S]path and path in the neutral network between v and v', respectively.
Our basic experiment is as follows: We select a (random) sequence, v and fold it into the structure S(v). We then proceed inductively: assume v_{ i }is constructed. We randomly select some neutral (compatible) neighbor of v_{ i }, denoted by v_{i+1}, subject to the condition d_{ H }(v, v_{i+1}) > d_{ H }(v, v_{ i }), where d_{ H }(x, y) denotes the Hamming distance. If no such neighbor exists we choose some v_{i+1}â‰ v_{ i }with the property d_{ H }(v, v_{i+1}) = d_{ H }(v, v_{ i }). If all neutral v_{ i }neighbors satisfy d_{ H }(v, v_{i+1}) <d_{ H }(v, v_{ i }) we stop and output the integer d_{ H }(v, v_{ i }). In Figure 16 we study 200 neutral walks for the following four structures: first an Hpseudoknot loop structure (a), second a hairpinloop structure (b), third an interiorloop structure (c) and finally the phenylalanine tRNA structure (d), see Figure 17. Our findings are in accordance with those for RNA secondary structures. One can easily neutrally traverse sequence space, suggesting the picture of vast, connected networks composed by neutral sequences.
Neutral neighbors
Complementing the analysis of neutral walks, we study now the distribution of neutral neighbors. Recall that a neutral neighbor of a sequence v with respect to the structure S = S(v) is a u or a pneighbor, y, contained in the neutral network of S. It has Hamming distance one or two, depending on whether it is induced by a point or base pair mutation, see Figure 15. The distribution of neutral neighbors provides relevant information about the mutational robustness of the structure S. The data presented here, are obtained in the course of the neutral walk experiments, displayed in Figure 16. They are given in Figure 18. In order to put things into context we also present in Figure 19 the distribution of neutral neighbors for 10000 random sequences folded by cross_{4}.
Local connectivity
Connectivity of a subgraph, Î“_{ n }, of an ncube alone does not imply that a small Hamming distance implies a small distance in Î“_{ n }. For neutral sequences this means that two neutral sequences with Hamming distance less than four, are possibly connected via a neutral path of much greater length. Evidently, for molecular evolution it is therefore not connectivity but the existence of these short paths what matters. Local connectivity is a property which guarantees the existence of these short paths. If Î“_{ n }is locally connected then a small Hamming distance does imply a Î“_{ n }distance scaled by at most a factor of Î” > 0. We shall begin by studying local connectivity for random induced subgraphs of ncubes, i.e. where we select sequences with independent probability Î»_{ n }. Then we transfer the derived concepts to neutral networks of RNA pseudoknot structures.
We call Î“_{ n }is locally connected if and only if almost surely (a.s.)
provided v, v' are in Î“_{ n }. We immediately observe that, trivially, for any finite n such Î” exists. However, the key point is that (â€ ) employs the notion "almost surely", i.e. it holds for arbitrary n.
Random graph theory [48] shows that on the one hand, for Î»_{ n }smaller than n^{Î´}/\sqrt{n}, where Î´ > 0 is arbitrarily small, there exists a.s. no finite Î” satisfying (â€ ). On the other hand, for Î»_{ n }larger than or equal to n^{Î´}/\sqrt{n}, there exists a.s. some finite Î” satisfying (â€ ). In other words, there exists a threshold value for local connectivity. Since random subgraphs of ncubes have giant components for Î»_{ n }= (1 + Îµ)/n, where Îµ > 0 [52] we can conclude that local connectivity emerges distinctly later in the evolution of random subgraphs of ncubes.
Suppose we are given a structure S and sequence v, contained in its neutral network. By construction, local connectivity refers to the two ncubes {Q}_{4}^{{n}_{u}} and {Q}_{6}^{{n}_{p}} induced by S, see Figure 20. Let
C_{2} = {v' C(v, v') = 2}
be the cardinality of the set of sequences in compatible distance two. Then the degree of local connectivity of S at v is given by
In other words, D_{ S }(v) is the fraction of locally connected vertices of the compatible distance two neighbors of v, that can be obtained via a neutral path of length at most four.
We perform the following experiment: we consider neutral walks for the UTRpseudoknot structure of the mouse hepatitis virus displayed in Figure 1, see Subsection. Along these walks we compute the locality degree D_{ S }(v_{ i }) and the total number of locally connected sequences. Our findings are presented in Figure 21. We can report that the degree of local connectivity is, as suggested by random graph theory, almost 100%.
Conclusion
RNA pseudoknot structuresin particular their statistical propertiesare a fascinating and new territory. To our knowledge the only statistical data beyond RNA secondary structures were derived for bisecondary structures in [53, 54]. The structural concept of knoncrossing canonical RNA structures and the resulting sequence to structure map employed for our experiments is new and represents a natural generalization of RNA secondary and bisecondary structures. To be precise, bisecondary structures are exactly planar 3noncrossing RNA structures [19].
It is clear, that for sequencelength less than or equal to 100 we only encounter pseudoknots of limited complexity. Our findings presented in Figure 12 provide a transparent picture of which pseudoknotshapes dominate for given sequence length. These results, in combination with the data on the fractions of pseudoknotted structures over sequence length show, that for n = 80 we have approximately 35% structures with nontrivial pseudoknots. In addition it is striking that basically all folded structures are irreducible, i.e. only a very small fraction can be decomposed into several independent substructures. This is of interest since decomposable structures can be folded much faster. It is known, [55] that Dyckpaths, i.e. path starting at the origin, having only up (1, 1), or down (1, 1) steps which end on the xaxis, decompose on average into three irreducible parts. This is of relevance, since a slight generalization of Dyckpath, the Motzkinpaths, having additional horizontal steps, correspond to secondary structures. Our findings suggest, that while secondary structures, decompose nontrivially, higher and higher crossing numbers change the picture. This complicates the computation of mfepseudoknot RNA due to their imminent irreducibility.
Both versions of cross produce analogous findings, confirming the generality of our results. The vast majority of pseudoknotshapes is of a single type. As expected, cross_{3} exhibits more structural variety due to the fact that its minimum stacklength is only three. The ratio of pseudoknot structures shifts significantly from n = 80 to n = 100 to approximately 50%. We can conclude from this that pseudoknots cannot be ignored, they evidently become the dominant structure class for n greater than or equal to 100. Figure 13 shows that the fraction of sequences for which cross_{3} and cross_{4} coincide, decreases linearly as a function of sequence length. This indicates that larger and larger sequences will exhibit more subtle structural elements whose emergence is facilitated by stabilizing large stacks.
Furthermore, the mfepseudoknot structures generated by cross are far from being random. The central limit theorems for random knoncrossing canonical RNA structures, given in Table 2 imply, that stacks and consequently loops scale linearly with the sequence length. Figure 14 clearly shows that the mfestructures, generated by cross_{4} and cross_{5}, have for n = 76 two stacks less than random 3noncrossing structures with minimum stacklength greater than four and five, respectively. This deviation is significant and indicates that mfepseudoknot structures are far from "typical" random structures. We remark that, while it is straightforward to generate random RNA secondary structures, it is nontrivial to obtain random pseudoknot structures. In particular, at present time, no polynomial time algorithm is known which generates a random 3noncrossing RNA structure with uniform probability.
The organization of the sequences contained in neutral networks of RNA pseudoknot structures seems to be very analogous to the neutral networks of RNA secondary structures [23]. Figure 16 shows that neutral walks can effectively traverse sequence space and the fractions of neutral neighbors, presented in Figure 18 and Figure 19 suggest a high degree of neutrality.
We discussed in Subsection local connectivity, a property of neutral networks which implies the existence of short, neutral paths. It is apparent that local connectivity is of central importance for molecular evolution and any type of evolutionary optimization. It has been shown in [48] that local connectivity is a prerequisite for preserving any type of sequence specific information. Having a random graph threshold value localized at 1/\sqrt{n}, local connectivity appears much later than connectivity, being localized at 1/n. However, the high neutrality degrees of RNA pseudoknot structures of Figure 18 and Figure 19 imply locally connected neutral networks. Our findings for the UTRpseudoknot structure of the mouse hepatitis virus of length 56, given in Figure 21, confirm the local connectivity of neutral networks of particular pseudoknot RNA structures. At all steps of the neutral walks almost all sequences are locally connected.
Abbreviations
 UTR:

Untranslated Region
 HDV:

Hepatitis Delta Virus
 DP:

dynamic program
 lhs:

left hand side
 rhs:

right hand side.
References
Penner RC, Waterman MS: Spaces of RNA secondary structures. Adv Math. 1993, 101: 3149. 10.1006/aima.1993.1039.
Waterman MS: Combinatorics of RNA hairpins and cloverleaves. Stud Appl Math. 1979, 60: 9196.
Smith TF, Waterman MS: RNA secondary structure. Math Biol. 1978, 42: 3149.
Schmitt WR, Waterman MS: Linear trees and RNA secondary structure. Discr Appl Math. 1994, 51: 317323. 10.1016/0166218X(92)00038N.
Howell JA, Smith TF, Waterman MS: Computation of generating functions for biological molecules. J Appl Math. 1980, 39: 119133.
Nussinov R, Jacobson AB: Fast Algorithm for Predicting the Secondary Structure of SingleStranded RNA. Proc Natl Acad Sci, USA. 1980, 77: 63096313. 10.1073/pnas.77.11.6309.
Searls DB: The language of genes. Nature. 2002, 420: 211217. 10.1038/nature01255.
Webpage of HDVpseudoknot structure in natural. [http://www.ekevanbatenburg.nl/PKBASE/PKB00075.HTML]
Loria A, Pan T: Domain Structure of the ribozyme from eubacterial ribonuclease. RNA. 1996, 2: 551563.
Konings DAM, Gutell RR: A comparison of thermodynamic foldings with comparatively derived structures of 16s and 16slike rRNAs. RNA. 1995, 1: 559574.
Schneider D, Tuerk C, Gold L: Selection of high affinity RNA ligands to the bacteriophage R17 coat protein. J Mol Biol. 1992, 228: 862869. 10.1016/00222836(92)90870P.
Chamorro M, Parkin N, Varmus HE: An RNA pseudoknot and an optimal heptameric shift site are required for highly efficient ribosomal frameshifting on a retroviral messenger RNA. Proc Natl Acad Sci, USA. 1992, 89 (2): 7137. 10.1073/pnas.89.2.713. 1309954
LyngsÃ¸ RB, Pedersen CNS: RNA Pseudoknot Prediction in EnergyBased Models. J Comp Biol. 2000, 7: 409427. 10.1089/106652700750050862.
Rivas E, Eddy S: A dynamic programming algorithm for RNA structure prediction including pseudoknots. J Mol Biol. 1999, 285 (5): 20532068. 10.1006/jmbi.1998.2436.
Uemura Y, Hasegawa A, Kobayashi S, Yokomori T: Tree adjoining grammars for RNA structure prediction. Theor Comp Sci. 1999, 210: 277303. 10.1016/S03043975(98)000905.
Akutsu T: Dynamic programming algorithms for RNA secondary prediction with pseudoknots. Discr Appl Math. 2000, 104: 4562. 10.1016/S0166218X(00)001864.
Tacker M, Stadler PF, BornbergBauer EG, Schuster P, Hofacker IL, Schuster P: Algorithm independent properties of RNA secondary structure predictions. Europ Biophys. 1996, 25: 115130. 10.1007/s002490050023.
Jin EY, Reidys CM: Asymptotic enumberation of RNA structures with pseudoknots. Bull Math Biol.
Jin EY, Qin J, Reidys CM: Combinatorics of RNA structures with Pseudoknots. Bull Math Biol. 2008, 70 (1): 4567. 10.1007/s115380079240y.
Waterman MS, Smith TF: Rapid dynamic programming methods for RNA secondary structure. Adv Appl Math. 1986, 7: 455464. 10.1016/01968858(86)900254.
Hofacker IL, Fontana W, Stadler PF, Bonhoeffer LS, Tacker M, Schuster P: Fast Folding and Comparison of RNA Secondary Structures. Monatsh Chem. 1994, 125: 167188. 10.1007/BF00818163.
Zuker M, Stiegler P: Optimal computer folding of large RNA sequences using thermodynamics and auxiliary information. Nucl Acids Res. 1981, 9: 133148. 10.1093/nar/9.1.133.
Reidys CM, Stadler PF, Schuster P: Generic properties of combinatory maps: neutral networks of RNA secondary structures. Bull Math Biol. 1997, 59 (2): 339397. 10.1007/BF02462007.
Schultes EA, Bartel DP: Implications for the Emergence of New Ribozyme Folds. Science. 2000, 289 (5478): 448452. 10.1126/science.289.5478.448.
Jolly A, (Ed): Mapping RNA form and function. Science. 2005, 309: 14411632. 10.1126/science.1111873.
Ma G, Reidys CM: Canonical RNA Pseudoknot Structures. J Comp Biol.
Chen WYC, Deng EYP, Du RRX, Stanley RP, Yan CH: Crossings and nestings of matchings and partitions. Trans Am Math Soc. 2007, 359: 15551575. 10.1090/S0002994706042103.
Chen WYC, Qin J, Reidys CM: Crossing and Nesting in Tangleddiagrams. Elec J Comb. 2008, 15:
Jin EY, Reidys CM: RNALEGO: Combinatorial Design of Pseudoknot RNA. Adv Appl Math.
Hofacker IL, Schuster P, Stadler PF: Combinatorics of RNA Secondary Structures. Discr Appl Math. 1998, 88: 207237. 10.1016/S0166218X(98)000730.
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990, 29: 11051119. 10.1002/bip.360290621.
Fresco JR, Alberts BM, Doty P: Some Molecular Details of the Secondary Structure of Ribonucleic Acid. Nature. 1960, 188: 98101. 10.1038/188098a0.
Jun IT, Uhlenbeck OC, Levine MD: Estimation of Secondary Structure in Ribonucleic Acids. Nature. 1971, 230: 362367. 10.1038/230362a0.
DeLisi C, Crothers DM: Prediction of RNA secondary structure. Proc Natl Acad Sci USA. 1971, 68: 26822685. 10.1073/pnas.68.11.2682.
Huynen M, Stadler PF, Fontana W: Smoothness within ruggedness: the role of neutrality in adaptation. Proc Natl Acad Sci USA. 1996, 93: 397401. 10.1073/pnas.93.1.397.
Babajide A, Hofacker IL, J SM, Stadler PF: Neutral Networks in Protein Space A Computational Study Based on KnowledgeBased Potentials of Mean Force. Folding Design. 1997, 93: 261269. 10.1016/S13590278(97)000370.
Schuster P: Genotypes with phenotypes: Adventures in an RNA Toy World. Biophys Chem. 1997, 6: 75110. 10.1016/S03014622(97)000586.
Fontana W, Schuster P: Shaping Space: The Possible and the Attainable in RNA GenotypePhenotype Mapping. J Theor Biol. 1998, 194: 491515. 10.1006/jtbi.1998.0771.
Stadler PF: Fitness Landscapes Arising from the SequenceStructure Maps of Biopolymers. J Mol Struct (THEOCHEM). 1999, 463: 719. 10.1016/S01661280(98)00387X.
Schuster P, Fontana W: Chance and Necessity in Evolution. Lessons from RNA Physica. 1999, 133: 427452.
Reidys CM, Stadler PF: Combinatorial Landscapes. SIAM Review. 2002, 44: 354. 10.1137/S0036144501395952.
Hofacker IL, Fekete M, Flamm C, Huynen MA, Rauscher S, Stolorz PE, Stadler PF: Automatic Detection of Conserved RNA Structure Elements in Complete RNA Virus Genomes. Nucl Acids Res. 1998, 26: 38252836. 10.1093/nar/26.16.3825.
Schuster P, Fontana W, Stadler PF, Hofacker IL: From Sequences to Shapes and Back: A Case Study in RNA Secondary Structures. Proc Roy Soc Lond B. 1994, 255: 279284. 10.1098/rspb.1994.0040.
Gruener W, Giegerich R, Strothmann D, Reidys CM, J W, Hofacker IL, Stadler PF, Schuster P: Analysis of RNA sequence structure maps by exhaustive enumeration I. Neutral networks. Monatsh Chem. 1996, 127: 375389. 10.1007/BF00810882.
Gruener W, Giegerich R, Strothmann D, Reidys CM, J W, Hofacker IL, Stadler PF, Schuster P: Analysis of RNA sequence structure maps by exhaustive enumeration. II. Monatsh Chem. 1996, 127: 355374. 10.1007/BF00810881.
Huang FWD, Peng WWP, Reidys CM: Folding RNA pseudoknot structures. [In preparation].
Han HSW, Reidys CM: Stacks in canonical RNA pseudoknot structures. Comp Appl Math.
Reidys CM: Local Connectivity of Neutral Networks. Bull Math Biol.
Jin EY, Reidys CM: Central and Local Limit Theorems for RNA Structures. J Theor Biol. 2008, 250 (3): 547559. 10.1016/j.jtbi.2007.09.020.
Huang FWD, Reidys CM: Statistics of canonical RNA pseudoknot structures. J Theor Biol.
Fontana W, Schuster P: Shaping Space: the Possible and the Attainable in RNA GenotypePhenotype Mapping. J Theor Biol. 1998, 194 (4): 491515. 10.1006/jtbi.1998.0771.
Reidys CM: Large components in random induced subgraphs of Ncubes. Discr Math.
Stadler PF, Haslinger C: RNA Structures with PseudoKnots. Bull Math Biol. 1999, 61: 437467. 10.1006/bulm.1998.0085.
Haslinger C: RNA Structures with Pseudoknots. PhD thesis. 1997, University of Vienna
Shapiro L: A survey of the Riordan Group. Proc Amer Math Soc. 1994
Acknowledgements
We are grateful to J.Z.M. Gao, H.S.W. Han and W.W.J. Peng for helpful discussions. This work was supported by the 973 Project, the PCSIRT Project of the Ministry of Education, the Ministry of Science and Technology, and the National Science Foundation of China.
This article has been published as part of BMC Bioinformatics Volume 10 Supplement 1, 2009: Proceedings of The Seventh Asia Pacific Bioinformatics Conference (APBC) 2009. The full contents of the supplement are available online at http://www.biomedcentral.com/14712105/10?issue=S1
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors' contributions
All authors contributed equally to this paper.
Rights and permissions
Open Access This article is published under license to BioMed Central Ltd. This is an Open Access article is distributed under the terms of the Creative Commons Attribution License ( https://creativecommons.org/licenses/by/2.0 ), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
About this article
Cite this article
Huang, F.W., Li, L.Y. & Reidys, C.M. Sequencestructure relations of pseudoknot RNA. BMC Bioinformatics 10 (Suppl 1), S39 (2009). https://doi.org/10.1186/1471210510S1S39
Published:
DOI: https://doi.org/10.1186/1471210510S1S39