RNA secondary structures
RNA is found as a single strand where individual bases can bond with each other forming base pairs[5]. Bonding makes RNA fold into a structure called secondary structure. There are different rules for base paring [5]. A Waston-Crick base pair is formed when A bonds with U through a double hydrogen bond or when G bonds with C through a triple hydrogen bond. A wobble base pair is formed when G bonds with U by a single hydrogen bond. There are other pairing rules such as G-A and U-C pairs, but they are relatively rare [5]. RNA secondary structures can be interacting or non-interacting. In an interacting structure, base pairs may be formed from bases that belong to different strands (inter-molecular). In a non-interacting RNA structure, all base pairs are formed by bases in the same strand (intra-molecular).
Definition: Secondary structure
Given an RNA sequence R = {r1r2...r
n
} of length n, the secondary structure S of R is a set of base pairs (r
i
, r
j
), where 1 ≤ i <j ≤ n, that satisfies the following two criteria [5]:
(1) Each base is paired at most once.
(2) If (r
i
, r
j
), (r
k
, r
l
) ∈ S, then i <k <j ⇔ i <l <j. This is called the nested criterion.
The criteria above may be rarely violated in RNA secondary structures. If the first one is not satisfied, then a base triple may happen. If the second is not satisfied, then a pseudoknot may exist [5].
RNA representations
There are different ways to represent RNA secondary structures. Some popular RNA representations are:
Dot-bracket notation
RNA secondary structure can be represented as a string of length n over the alphabet Σ = {(, ., ), [,]}. This representation is known as dot-bracket notation (DBN), or nested parenthesis. Initially, proposed in [6], base pairs are represented using matched brackets. A base pair (r
i
, r
j
) is represented by an opening bracket at position i and a closing bracket at position j. Unpaired bases are represented by dots. Pseudoknots are represented using square or curly bracket. Sometimes square bracket are used to represent inter-molecular base pairs. An example of bracket notation representations of the secondary structures of Figure 1 are shown in Figure 2.
Planar graph
RNA secondary structure can be represented as a planar graph[6], also known as bond representation. The graph is a simple approximation of the RNA secondary structure in two dimensions. As shown in Figure 3, the planar graph may include the following different types of loops in the secondary structure:
-
♦ Hairpin loop: a loop that contains exactly one base pair.
-
♦ Staked pair: a set of consecutive base pairs.
-
♦ Internal loop: a loop with two base pairs and at least one unpaired base on each side of the loop.
-
♦ Bulge: Like an internal loop, a bulge has two base pairs but with only one side of the loop having an unpaired bases.
-
♦ Multi-loop: any loop with three or more base pairs.
-
♦ External base: any unpaired base not contained in a loop. A set of consecutive external bases form an external element.
RNA expression
RNA secondary structures can be represented as an expression of the following six types of terms [1]:
-
♦ H5 and H3: to represent the beginning and end of an intra-molecular stem.
-
♦ I5 and I3: to represent the beginning and end of an inter-molecular stem for interacting molecules.
-
♦ SS: to represented a single stranded region (unpaired).
-
♦ BR: to indicate a move between RNA sub-patterns in case of interacting patterns.
Each term can be followed by tuple (x, y) to indicate the minimum and maximum length, respectively. Figure 4 shows the RNA expressions for the structures in Figure 1, where each term has a defined length.
Component-based representation
RNA secondary structures can be represented by components [1]. In this representation, a pattern can be defined by three parts: (1) its length, (2) the intra-molecular (INTRAM) component (3) the inter-molecular (INTERM) component if interacting. An interacting pattern consists of more than one sub-patterns. In general P = {p1, p2, ..., p
m
}, each p
j
= (len
j
, {INTERM1, INTERM2, ..., INTERM
r
}, {INTRAM1, INTRAM2, ..., INTRAM
q
}) for 1 <j ≤ m. If the pattern is not interacting, when is m = 1, it will only have INTRAM component.
Components are defined by the length of their opening and closing brackets and by their relative location in the pattern. The component-based representation of the structures in Figure 1 are shown in Figure 5.
Covariance model
Covariance model (CM) [7] is a probabilistic model for describing a sequence alignment and a consensus secondary structure for a set of RNA sequences. It is represented as an ordered binary tree of different types of nodes: begin (S), pair (P), left singlet (L), right singlet (R), and bifurcation (B). Each node can have different number of states to allow insertions, deletions, and mismatches. CM is a generalization of the hidden Markov models with two additional states. The match pairwise state allows the emission of a pair of symbols, while the bifurcation state allows multiple helices. The CM works only for non-interacting RNA sequence from the 5' end to the 3' end. Figure 6 shows an example of a CM for the non-interacting RNA in Figure 1.
Connectivity table
RNA secondary structure can be represented as a table called connectivity table (CT). The table is formatted as follows:
Figure 7 shows an example of connectivity table for the non-interacting RNA in Figure 1.
Arc representation
In this representation, a base pair is represented as an arc connecting the two bonded bases. The secondary structure is a set of overlapping or parallel arcs [8]. An example of this representation is shown in Figure 8.
Free energy models
The stability of RNA structures is determined by their free energy. Stable structures have the lowest free energy values. In this section, we discuss well-known models that are used to calculate the free energy.
Base pair energy model
The simplest energy model considers individual base pairs. It either maximizes the number of base pairs or uses the sum of free energies of individual Waston-Crick base pairs [9, 10]. Waston-Crick free energy between base i and base j is denoted as e(i, j) and it is based on the type of bonded bases. The bond is called internal, if the bases are on the same RNA strand, otherwise it is called external. The energy functions for internal and external bonds may be equal. For any secondary structure S, the total energy is the sum of e(i, j) over all pairs as follows [11]:
Although this model is simple, it is known to be inaccurate [12]. The base pair maximization model does not yield biologically relevant structures. This is because: it ignores stacking base pairs, it dose not consider loop sizes, and it has no special scoring of multi-loops.
Stacked pair energies
The stacked pair model [13, 14] assigns energy E
s
to a base pair (i, j) ∈ S if and only if (i + 1, j − 1) ∈ S. The total energy under this model is given as follows:
Where Si+1,j−1= 1 if the pair (i + 1, j − 1) ∈ S, Si+1,j−1= 0 otherwise. The total number of consecutive base pairs is called the stacking size. Single base pairs are not considered as stacks, so the staking size is at least 2.
Loop energy model
The loop energy model, also called the nearest neighbor model[15], considers the free energy of the different types of loops, including: free energy of externally interacting loops. Under this model, the free energy of a secondary structure is the sum of free energies of all of its component loops. This model appears to be more accurate especially for RNA molecules with length ≥ 150 [12]. The contribution of each loop type is determined as follows:
-
♦ Hairpin loop: the energy contribution of a hairpin loop is determined by two elements. First, the number of unpaired bases forming the loop. Second, the contribution of the terminal mismatch, which are the two bases adjacent to the closing base pair.
-
♦ Staked pair: the energy contribution of a stacked pair is determined by the type and order of base pairs.
-
♦ Bulge: the energy contribution of a bulge loop is determined by the length of the unpaired bases forming the bulge and the two closing base pairs.
-
♦ Internal loop: the energy contribution of an internal loop is determined by the length of the unpaired bases forming the loop and the four unpaired bases adjacent to the opening and closing base pairs.
-
♦ Multi-loop: the energy contribution of a multi-loop is a function of many factors, including: number of helices and the optimal configuration of free ends and terminal mismatches.