Sequence-structure relations of pseudoknot RNA

Background The analysis of sequence-structure relations of RNA is based on a specific notion and folding of RNA structure. The notion of coarse grained structure employed here is that of canonical RNA pseudoknot contact-structures with at most two mutually crossing bonds (3-noncrossing). These structures are folded by a novel, ab initio prediction algorithm, cross, capable of searching all 3-noncrossing RNA structures. The algorithm outputs the minimum free energy structure. Results After giving some background on RNA pseudoknot structures and providing an outline of the folding algorithm being employed, we present in this paper various, statistical results on the mapping from RNA sequences into 3-noncrossing RNA pseudoknot structures. We study properties, like the fraction of pseudoknot structures, the dominant pseudoknot-shapes, neutral walks, neutral neighbors and local connectivity. We then put our results into context of molecular evolution of RNA. Conclusion Our results imply that, in analogy to RNA secondary structures, 3-noncrossing pseudoknot RNA represents a molecular phenotype that is well suited for molecular and in particular neutral evolution. We can conclude that extended, percolating neutral networks of pseudoknot RNA exist.


Background
Three decades ago, Michael Waterman pioneered the combinatorics and ab initio prediction of the at that time rather exotic ribunucleic acid (RNA) secondary structures [1][2][3][4][5]. The motivation for this work was coming from a fundamental dichotomy represented by RNA. On one hand RNA is described by its primary sequence, a linear string composed of the nucleotides A, G, U and C. The primary sequence embodies the genotypic legislative. On the other hand, RNA, being less structurally constrained than its chemical relative DNA, does fold into 3D-structures, representing the phenotypic executive. Therefore one molecule stands for both: geno-and phenotype.
Indeed, a vast variety of RNA activities was found: the discovery of catalytic RNAs, or ribozymes, in 1981 proved that RNA could catalyze reactions just as proteins. RNA can act also as a messenger between DNA and protein in the form of transfer RNA. The realization that RNA combines features of proteins with DNA led to the "RNA world" hypothesis for the origin of life. The idea was that DNA and the much more versatile proteins took over RNA's functions in the transition from the "RNA-world" to the "DNA/protein-world".
Let us have a closer look at RNA phenotypes. RNA molecules form "helical" structures by folding, i.e. pairing their nucleotides and thereby lowering their minimum free energy (mfe). Originally, these bonds were subject to strict combinatorial constraints, for instance "noncrossing" in RNA secondary structures. For the latter, dynamic programming (DP) algorithms, predicting the minimum free energy configuration were given 1980 [5,6]. It is wellknown, however, that RNA structures are far more complex than secondary structures. One particularly prominent feature is the existence of cross-serial dependencies [7], that is crossing arcs or pseudoknots, see Figure  1, where we display the natural UTR-pseudoknot structure of the mouse hepatitis virus. Cross also folds into the natural structure given in Figure 1. In Figure 2 we present another RNA pseudoknot structure, the HDV-pseudoknot. We present here the structure as folded by cross and also its natural structure [8].
In fact, RNA pseudoknots are "everywhere". They occur in functional RNA, like for instance RNAseP [9] as well as ribosomal RNA [10]. They are conserved in the catalytic core of group I introns, in plant viral RNAs pseudoknots mimic tRNA structure and in in vitro RNA evolution [11], where experiments produced families of RNA structures with pseudoknot motifs, when binding HIV-1 reverse transcripts. Important mechanisms like ribosomal frame shifting [12] also involve pseudoknot interactions.
For prediction algorithms the implications of cross-serial dependencies are severe-they imply a higher level of formal language: context-sensitive. In general, on this level of formal languages it is not clear whether or not polynomial time ab initio folding algorithms exist. Indeed, Lyngsø et al. [13] showed that "reasonable" classes of RNA pseudoknots require exponential time algorithms. There exist however, polynomial time folding algorithms, capable of the energy based prediction of certain pseudoknots: Rivas et al. [14], Uemura et al. [15], Akutsu [16] and Lyngsø [13]. The output of these algorithms, however, remains somewhat "mysterious"-it is not clear which types of pseudoknots can be generated.
In analogy to the case of RNA secondary structures, the identification of key combinatorial properties of the output class offers deeper understanding. The combinatorial properties of RNA pseudoknot structures discussed in the following have indeed profound implications: first sequence-structure maps will generate exponentially many structures with neutral networks of exponential size. Second, the latter will come close to each other in sequence space, thereby allowing for efficient evolutionary search. None of these findings depend on the particular choice of loop-energies or the partition function [17]. Furthermore, without combinatorial specification, as it is the case for the above mentioned DP based pseudoknot folding algorithms [14], one arrives at an impossibly large configuration space.
For instance, the inductive generation of gap-matrices produces arbitrarily high number of mutually crossing arcs. The results in [18] prove, that the exponential growth rate of pseudoknot structures is linear in the crossing number. Accordingly, via gap-matrices, an uncontrollably large output class is being generated. Nevertheless, the DP-routine using pairs of gap-matrices cannot generate any 3-noncrossing nonplanar pseudoknot structure.
We will show that the notion of k-noncrossing diagrams [19] allows us to specify a suitable output-class for pseudoknot folding algorithms. Recall that a diagram is a graph over the vertex set [n] = {1, ..., n} with vertex degree less than or equal to one. It is represented by drawing the vertices in a horizontal line and its arcs (i, j), where i <j, in the upper half-plane. The vertices and arcs correspond to nucleotides and Watson-Crick (A-U, G-C) and (U-G) base RNA pseudoknot structures Figure 1 RNA pseudoknot structures. Three representations of the UTR-pseudoknot structure of the mouse hepatitis virus. First, the planar graph representation, second the diagram representation and finally the output produced by cross.  [[[[[[.)))))))). pairs, respectively. A diagram is k-noncrossing if it contains at most k -1 mutually crossing arcs. Diagrams have the following three key parameters: the maximum number of mutually crossing arcs, k -1, the minimum arclength, λ, and minimum stack-length, τ, The length of an arc (i, j) is j -i and a stack of length τ is a sequence of "parallel" arcs of the form ((i, j), (i + 1, j -1), ..., (i + (τ -1), j -(τ -1))), see Figure 3. We call an arc of length λ a λ-arc. Biophysical constraints on the base pairings imply that in all RNA structures λ is greater than or equal to four. We call diagrams with a minimum stack-length τ, τ-canonical and if λ ≥ 4 we refer to diagrams as structures. To reiterate, in the simplest case we have 2-noncrossing RNA structures, i.e. the secondary structures in which no two arcs cross, see Figure 4. The noncrossing of arcs has far-reaching consequences. It implies that RNA secondary structures form a context free language and allow for the DP algorithms [20], predicting the loop-based mfe-secondary structure in O(n 3 )-time and O(n 2 )-space.
k-noncrossing diagrams Figure 3 k-noncrossing diagrams. We display a 4-noncrossing diagram with arc-length λ ≥ 4 and stack-length τ ≥ 1 (a) and a 3-noncrossing, λ ≥ 4 and τ ≥ 3 diagram (b).  Let us now, having some background on RNA structures return to the RNA-world. Around 1990 Peter Schuster and his coworkers initiated a paradigm shift. They began to study evolutionary optimization and neutral evolution of RNA via the relation between RNA genotypes and phenotypes. The particular mapping from RNA sequences into RNA secondary structures was obtained by the algorithm ViennaRNA [21], an implementation of the folding routine [6,22], mentioned above. Two particularly prominent results of this line of work were the existence of neutral networks, i.e. vast, extended networks, composed of sequences folding into a given secondary structure [23] and the Intersection Theorem [23]. The latter guarantees for any two secondary structures the existence of at least one sequence which simultaneously satisfies all constraints imposed by their Watson-Crick and G-U base pairs. For the implication of the latter with respect to molecular switches, see [24]. It became evident that the "statistical" properties of this mapping played a central role in the molecular evolution of RNA.
But, there is more. Two discoveries suggested that RNA might not just be a stepping stone towards a DNA/protein world. They show that RNA plays an active role in vital cell processes. A large number of very small RNAs of about 22 nucleotides in length, called microRNAs (miRNAs), were discovered. They were found in organisms as diverse as the worm Caenorhabditis organs and humans, and their particular relationship to certain intermediates in RNA interference (RNAi). These findings have put RNA-in particular noncoding RNA-into the spotlight. In addition, RNA's conformational versatility and catalytic abilities have been identified in the context of protein synthesis and RNA splicing. More and more parallels between RNA and protein are currently being revealed [25].
Let us next briefly overview what we know about the combinatorics of our phenotypes, ultimatively allowing for the computation of biophysically relevant pseudoknot structures [26]. The key result comes from a seemingly unrelated field, the combinatorics of partitions. Chen et al. proved in a seminal paper [27] a bijection between walks in Weyl chambers and k-noncrossing partitions. This bijection has recently been generalized to tangled diagrams [28]. Now, a k-noncrossing diagram is a special type of k-noncrossing tangle and the relevance of Chen's result lies in the fact that the walks in question can be enumerated via the reflection principle. In fact, the reflection principle facilitated the computation of the generating function of k-noncrossing canonical pseudoknot RNA [19,26,29]. Subsequent singularity analysis [26,29], showed, that the exponential growth rates of canonical pseudoknot RNA structures are surprisingly small, see Table 1, [26]. For instance, the number of 3-noncrossing, 3-canonical RNA structures with arc-length greater than or equal to four is asymptotically given by where c is some (explicitly known) constant. This exponential growth rate is very close to Schuster et al.'s finding [30] for 2-canonical RNA secondary structures with arclength greater than or equal to four For the analysis presented here, we use the algorithm cross [28], which produces a transparent output. This algorithm   does not follow the DP paradigm and generates the mfek-noncrossing τ-canonical structure via a combination of branch and bound, as well as DP techniques. cross inductively constructs k-noncrossing, τ-canonical RNA structures via motifs. Currently full loop-based energy models are derived an implemented for k = 3 and τ ≥ 3.
Therefore, cross finds the mfe-RNA pseudoknot structure in which there are at most two mutually crossing arcs, which has minimum arc-length four and in which each stack has size at least three. While cross is an exponential time algorithm it allows to fold sequences of length 100 with an average folding time of 4.5 minutes.

Methods
While it is beyond the scope of this paper to present the algorithm cross in detail, the objective of this section is first to sketch its key organization and second to discuss some basic properties of RNA pseudoknot structures. These combinatorial properties enable us to assign a unique, loop-based energy. In the course of our analysis we show that an RNA pseudoknot structure can be con-structed via simpler substructures. These serve as the building blocks via which cross derives the mfe-pseudoknot structure. At present time we do not have an algorithm computing the partition function version of cross. For RNA secondary structures, the partition function was obtained 1990 [31], three decades after the first mfe-folding algorithms were derived [32][33][34]. The partition function is based on a fixed sequence and contains vital statistical information on the probabilities of specific structural configurations of the latter. For any inductively constructed structure class, it allows to compute the base pairing probabilities. In analogy to similar studies in the case of RNA secondary structures [17,[35][36][37][37][38][39][40][41][42][43][44][45], the partition function is for the type of analysis presented here not of key importance. We shall derive statistical information on the sequence-structure relation by mfe-folding a large number of sequences instead of considering the ensemble of structural configurations of a single sequence.

Cross
The algorithm cross has three distinct phases: the motif-, skeleton-and saturation-phase, see Figure 5 for an overview. We will here briefly discuss these three parts.
Let < denote the following partial order over arcs i.e. an arc α 1 is smaller then α 2 if it is nested in it.

I Motifs
Let us begin by defining core-structures. A k-noncrossing core [29] is a k-noncrossing diagram in which all stacks have size one. The core of a structure is obtained by identifying all its stacks by single arcs, keeping the unpaired nucleotides and finally relabeling, see Figure 6.
A Όk, τ-motif is a Όk, τ-diagram over [n], having the following properties (M1) it has a nonnesting core (M2) all its arcs are contained in stacks of length exactly τ = 3 and length λ = 4.
A m-shadow is a k-noncrossing diagram obtained by successively increasing the stacks of m from top to bottom, see Figure 7.
The key observation about motifs is that they can, despite the fact that they exhibit cross-serial dependencies, be generated inductively [46].

II Skeleta
Skeleta represent the non-inductive "frames" of pseudoknot RNA, i.e. skeleta entail exactly the cross-serial dependencies, that need to be considered exhaustively. A skeleton, S, is a 3-noncrossing structure, whose core has a connected L-graph. An L-graph is a diagram whose arcs are the vertices and two being adjacent if their corresponding arcs cross [46]. An irreducible shadow, IS i,j , over [i, j]. IS i,j is a skeleton which has no nested arcs, see Figure 7. Phase II consists in the generation of all skeleta-trees, which are rooted in irreducible shadows.

III Saturation
Given a skeleton, cross saturates or "fills" via context-sensitive DP routines the skeleton-intervals. Note that, while the inserted substructures cannot cross any arc of the skeleton, they will in general contain crossing arcs within themselves.
Motifs, shadows and irreducible shadows Figure 7 Motifs, shadows and irreducible shadows. We display a Ό3, 3-motif (a) and one of its induced shadows (b), Furthermore we show an irreducible shadow (c), which corresponds to the (b)-substructure contained in the dashed box.
Cores will in general have 2-arcs Figure 6 Cores will in general have 2-arcs. The structure δ (lhs) is mapped into its core c(δ) (rhs). Clearly d has arc-length ≥ 4 and as a consequence of the collapse of the stack ((1, j + 2), (2, j + 1), (i, j)) into the arc (i, j), c(δ) contains the arc (i, j), which is, after relabeling, a 2-arc. To summarize, first cross inductively constructs all roots of the skeleta-trees, second cross generates the skeletatrees themselves and third it saturates the skeleta.

Loops
We next discuss loops of 3-noncrossing RNA structures. Loops are not only the basic building blocks for the mfeevaluation but also of importance for the coarse grained notion of pseudoknot-shapes, discussed in Subsection.
Let α be an arc in the 3-noncrossing RNA structure, S and denote by A S (β) the set of S-arcs that cross β. Clearly, we It is shown in [46] that any 3-noncrossing RNA structure can be uniquely decomposed into the following four loop-types: (1) a hairpin-loop is a pair where (i, j) is an arc.

Decomposition
We now show that each 3-noncrossing RNA structure can uniquely be constructed by simpler substructures [46]. Furthermore, each 3-noncrossing RNA structure has a unique loop decomposition-the basis of our energy evaluation. We remark that assertion (b) of the following result remains valid for arbitrary crossing number, k.
(a) Any k-noncrossing, t-canonical RNA structure corresponds to an unique sequence of shadows.
In Figure 10 we illustrate how these decompositions work.

Results and discussion
Our results are organized in two sections. First we describe our findings with respect to the statistics of pseudoknot RNA structures and second we present our data with respect to the particular organization of the sequences in neutral networks.

Minimum free energy RNA pseudoknot structures
In this section we present some key statistics on pseudoknotted RNA structures. In order to put our findings into context we consider two variants of cross: first, cross 3 , which generates 3-noncrossing, 3-canonical mfe-structures and second, cross 4 , which produces 3-noncrossing, 4-canonical mfe-structures.

The fraction of pseudoknots
We next compute the fraction of RNA structures with pseudoknots within all structures for cross 3 and cross 4 . Figure 11 displays the fraction of structures with pseudoknots as a function of sequence length. It is evident that the fraction of pseudoknotted structures is monotone with respect to the sequence length. Our data are based on folding 2000 random sequences via cross and suggest an linear relation. In particular, for n = 100, approximately 50% of the structures folded by both versions of cross contain pseudoknots.

Pseudoknot-shapes
Next we study the dominant pseudoknot-shapes as a function of sequence length. Our notion of pseudoknot-shape is based on k-noncrossing cores [29] discussed in Subsection. The shape of a structure S, is a subset of the core-arcs, induced by all arcs either contained in pseudoknots or arcs contained in multi-loops which contain nested pseudoknots. In other words, a pseudoknot-shape contains all pseudoknot-arcs and all arcs affecting the energy of pseudoknots, see Figure 12. In Figure 12 we display for cross 3 and cross 4 the dominant types. The shape data are obtained by folding 2000 random sequences. In Figure 13 we display the fraction of sequences on which cross 3 and cross 4 coincide, based on folding 2000 random sequences.

Stack-statistics in pseudoknot RNA
It is wellknown that large stacks contribute to a low mfe of a structure. In this section we relate the distribution of stacks in random structures to the distribution of stacks in mfe-pseudoknot structures generated by cross. This provides insight in what particular spectrum of pseudoknot structures cross produces.
Let us first discuss the distribution of stacks in random pseudoknot structures. The naive approach would be to The fraction of pseudoknot structures over sequence length Figure 11 The fraction of pseudoknot structures over sequence length. Pseudoknot fractions for cross 3 (hollow diamonds) and pseudoknot fractions for cross 4 (solid diamonds).
The dominant pseudoknot shapes for τ = 4 and τ = 3   generate a random structure and count the number of stacks. However, it is at present time not known how to construct a random pseudoknot structure with uniform probability. Therefore we have to employ a different strategy in order to obtain this distribution for random structures. The key idea [47] is to consider the bivariate generating function where T k, τ (n, t) denotes the number of k-noncrossing, τcanonical pseudoknot structures having exactly t stacks.
T k, τ (x, u) can be computed using the cores introduced in Section. The stack-distribution is now given by and via singularity analysis one can show that this distribution becomes asymptotically normal with mean μ k, τ and variance given by where γ k,t (u) is the unique dominant singularity parameterized by u = e s . In Table 2 we display the values μ k, τ and for k = 2, 3, 4 and τ = 3, ..., 7. Accordingly the number of stacks scales linearly with sequence length and so does the number of loops, since each loop corresponds to a stack. In Figure 14 we present the stack distributions of 3000 structures of random sequences folded by cross 4 and the normal distribution obtained from Table 2 (lhs). Analogously we present the stack distributions of 3000 structures of random sequences folded by cross5 and the normal distribution obtained from Table 2 (rhs).

Neutrality and local connectivity
The mapping from sequence to structures plays an important role for evolution [23,43,48]. One of its key roles is to facilitate the search of a sequence-population for better adapted shapes. In tis context, Table 1 contains some nontrivial information about the mapping from RNA sequences into their pseudoknot structures. To be precise, Table 1, in combination with central limit theorems for the number of arcs in k-noncrossing RNA structures [49,50] allows us to conclude that there exist exponen-  Fraction of same structures over sequence length Figure 13 Fraction of same structures over sequence length.
The fraction of sequences for which cross 4 , generating pseudoknot structures with minimum stack-length four. The data are derived by folding 20, 000 random sequences.
tially many k-noncrossing canonical structures with exponentially large preimages. Indeed, according to Table 1 the exponential growth rate of the number of k-noncrossing canonical structures, 3 = k = 9 is strictly smaller than fourthe growth rate of the space of all sequences over the natural alphabet.
The central limit theorems for the number of arcs of knoncrossing, canonical pseudoknot structures [50] exhibit a mean of 0.39 n and a variance of 0.041 n. We conclude from this that sequence to structure maps in pseudoknot RNA structures cannot be trivial, since the preimages of particular structures have exponential growth rates strictly smaller than four. As a result the number of canonical pseudoknot structures grows exponentially. Accordingly, a sequence to structure map in pseudoknot RNA necessarily generates exponentially many canonical structures.
In light of this, the interesting question then becomes how the set of sequences folding into a given structure is "organized" in sequence space. The analysis presented in this section is analogous to the investigations for RNA secondary structures [23,51] and can be viewed as a basic protocol for the local statistics of a genotype-phenotype map. The only exception is Subsection, which elaborates on the novel concept of local connectivity [48].
It is only possible to derive local statistics, since, for instance, exhaustive computations of the set of all sequences over the natural alphabet with fixed pseudoknot structure for n > 40 is at present time impossible. Accordingly, there are two types of compatible neighbors in sequence space: u-and p-neighbors: a u-neighbor has Hamming distance one and differs exactly by a point mutation at an unpaired position. Analogously a p-neighbor differs by a compatible base pair-mutation, see Figure  15. Note however, that a p-neighbor has either Hamming distance one ((G, C) → (G, U))) or Hamming distance two ((G, C) → (C, G))). We call a u-or a p-neighbor, y, a compatible neighbor. If y is contained in the neutral network we refer to y as a neutral neighbor. This gives rise to consider the compatible-and neutral distance, denoted by C (v, v') and N(v, v'). These are the minimum length of a C[S]-path and path in the neutral network between v and v', respectively.

Neutral walks
Our basic experiment is as follows: We select a (random) sequence, v and fold it into the structure S(v). We then proceed inductively: assume v i is constructed. We randomly select some neutral (compatible) neighbor of v i , denoted by v i+1 , subject to the condition Figure 16 we study 200 neutral walks for the following four structures: first an H-pseudoknot loop structure (a), second a hairpin-loop structure (b), third an interior-loop structure (c) and finally the phenylalanine tRNA structure (d), see Figure 17. Our findings are in accordance with those for RNA secondary structures. One can easily neutrally traverse sequence space, suggesting the picture of vast, connected networks composed by neutral sequences.

Neutral neighbors
Complementing the analysis of neutral walks, we study now the distribution of neutral neighbors. Recall that a neutral neighbor of a sequence v with respect to the structure S = S(v) is a u-or a p-neighbor, y, contained in the neutral network of S. It has Hamming distance one or Compatible neighbors in sequence space Figure 15 Compatible neighbors in sequence space. Diagram representation of an RNA structure (upper right) and its induced compatible neighbors in sequence space (lower left). Note that each base pair gives rise to 5 compatible neighbors exactly one of which is in Hamming distance one. In order to put things into context we also present in Figure 19 the distribution of neutral neighbors for 10000 random sequences folded by cross 4 .

Local connectivity
Connectivity of a subgraph, Γ n , of an n-cube alone does not imply that a small Hamming distance implies a small distance in Γ n . For neutral sequences this means that two neutral sequences with Hamming distance less than four, are possibly connected via a neutral path of much greater length. Evidently, for molecular evolution it is therefore not connectivity but the existence of these short paths what matters. Local connectivity is a property which guarantees the existence of these short paths. If Γ n is locally connected then a small Hamming distance does imply a Γ n -distance scaled by at most a factor of Δ > 0. We shall begin by studying local connectivity for random induced subgraphs of n-cubes, i.e. where we select sequences with independent probability λ n . Then we transfer the derived concepts to neutral networks of RNA pseudoknot structures.
We call Γ n is locally connected if and only if almost surely (a.s.) provided v, v' are in Γ n . We immediately observe that, trivially, for any finite n such Δ exists. However, the key point is that ( †) employs the notion "almost surely", i.e. it holds for arbitrary n.
Random graph theory [48] shows that on the one hand, for λ n smaller than n δ / , where δ > 0 is arbitrarily small, there exists a.s. no finite Δ satisfying ( †). On the other hand, for λ n larger than or equal to n δ / , there exists a.s. some finite Δ satisfying ( †). In other words, there exists a threshold value for local connectivity. Since random subgraphs of n-cubes have giant components for λ n = (1 + ε)/ n, where ε > 0 [52] we can conclude that local connectivity emerges distinctly later in the evolution of random subgraphs of n-cubes. ( d ) ( c ) 1 Suppose we are given a structure S and sequence v, contained in its neutral network. By construction, local connectivity refers to the two n-cubes and induced by S, see Figure 20. Let be the cardinality of the set of sequences in compatible distance two. Then the degree of local connectivity of S at v is given by In other words, D S (v) is the fraction of locally connected vertices of the compatible distance two neighbors of v, that can be obtained via a neutral path of length at most four.
We perform the following experiment: we consider neutral walks for the UTR-pseudoknot structure of the mouse hepatitis virus displayed in Figure 1, see Subsection. Along these walks we compute the locality degree D S (v i ) and the total number of locally connected sequences. Our findings are presented in Figure 21. We can report that the degree of local connectivity is, as suggested by random graph theory, almost 100%.

Conclusion
RNA pseudoknot structures-in particular their statistical properties-are a fascinating and new territory. To our knowledge the only statistical data beyond RNA secondary structures were derived for bi-secondary structures in [53,54]. The structural concept of k-noncrossing canonical RNA structures and the resulting sequence to structure map employed for our experiments is new and represents a natural generalization of RNA secondary and bi-secondary structures. To be precise, bi-secondary structures are exactly planar 3-noncrossing RNA structures [19].
It is clear, that for sequence-length less than or equal to 100 we only encounter pseudoknots of limited complexity. Our findings presented in Figure 12 provide a transparent picture of which pseudoknot-shapes dominate for given sequence length. These results, in combination with the data on the fractions of pseudoknotted structures over sequence length show, that for n = 80 we have approximately 35% structures with nontrivial pseudoknots. In addition it is striking that basically all folded structures are irreducible, i.e. only a very small fraction can be decomposed into several independent substructures. This is of interest since decomposable structures can be folded much faster. It is known, [55] that Dyck-paths, i.e. path starting at the origin, having only up (1, 1), or down (1, -1) steps which end on the x-axis, decompose on average into three irreducible parts. This is of relevance, since a slight generalization of Dyck-path, the Motzkin-paths, having additional horizontal steps, correspond to secondary structures. Our findings suggest, that while secondary structures, decompose nontrivially, higher and higher crossing numbers change the picture. This complicates the computation of mfe-pseudoknot RNA due to their imminent irreducibility.
Both versions of cross produce analogous findings, confirming the generality of our results. The vast majority of pseudoknot-shapes is of a single type. As expected, cross 3 exhibits more structural variety due to the fact that its minimum stack-length is only three. The ratio of pseudoknot structures shifts significantly from n = 80 to n = 100 to approximately 50%. We can conclude from this that pseudoknots cannot be ignored, they evidently become the dominant structure class for n greater than or equal to 100. Figure 13 shows that the fraction of sequences for which cross 3 and cross 4 coincide, decreases linearly as a function of sequence length. This indicates that larger and larger sequences will exhibit more subtle structural elements whose emergence is facilitated by stabilizing large stacks.
Furthermore, the mfe-pseudoknot structures generated by cross are far from being random. The central limit theorems for random k-noncrossing canonical RNA structures, Neutral fraction distribution Figure 19 Neutral fraction distribution. Neutral fraction distribution of 20, 000 random sequences for cross 4 .
given in Table 2 imply, that stacks and consequently loops scale linearly with the sequence length. Figure 14 clearly shows that the mfe-structures, generated by cross 4 and cross 5 , have for n = 76 two stacks less than random 3-noncrossing structures with minimum stack-length greater than four and five, respectively. This deviation is significant and indicates that mfe-pseudoknot structures are far from "typical" random structures. We remark that, while it is straightforward to generate random RNA secondary structures, it is nontrivial to obtain random pseudoknot structures. In particular, at present time, no polynomial time algorithm is known which generates a random 3noncrossing RNA structure with uniform probability.
The organization of the sequences contained in neutral networks of RNA pseudoknot structures seems to be very analogous to the neutral networks of RNA secondary structures [23]. Figure 16 shows that neutral walks can effectively traverse sequence space and the fractions of neutral neighbors, presented in Figure 18 and Figure 19 suggest a high degree of neutrality.
We discussed in Subsection local connectivity, a property of neutral networks which implies the existence of short, neutral paths. It is apparent that local connectivity is of central importance for molecular evolution and any type of evolutionary optimization. It has been shown in [48] that local connectivity is a prerequisite for preserving any type of sequence specific information. Having a random Local connectivity of the UTR-pseudoknot Figure 21 Local connectivity of the UTR-pseudoknot. Presented are the numbers of locally connected sequences during various neutral walks. The degree of local connectivity, D S (v), is one for all steps.
Deriving the two subcubes and Figure 20 Deriving the two subcubes and . A structure gives rise to rearrange its compatible sequences into unpaired and graph threshold value localized at 1/ , local connectivity appears much later than connectivity, being localized at 1/n. However, the high neutrality degrees of RNA pseudoknot structures of Figure 18 and Figure 19 imply locally connected neutral networks. Our findings for the UTRpseudoknot structure of the mouse hepatitis virus of length 56, given in Figure 21, confirm the local connectivity of neutral networks of particular pseudoknot RNA structures. At all steps of the neutral walks almost all sequences are locally connected.

List of abbreviations used
UTR: Untranslated Region; HDV: Hepatitis Delta Virus; DP:dynamic program; lhs: left hand side; rhs: right hand side.