Representation of RNA folding pathways
Given an initial structure A and an end structure B, we use a sequence of actions successively applied to A, rather than a sequence of intermediate structures, to represent a folding pathway from A to B. Representing a pathway by an action chain can avoid cyclic additions and deletions of base pairs and make it easy to simulate the formation and deletion of RNA stacks. A similar representation has also been employed in the previous work of Thachuk et al. [17].
We use two types of actions, add
i,j
and del
i,j
in the representation of RNA folding pathways. For an intermediate secondary structure S of an RNA sequence x, the action add
i,j
denotes the 'add'ition of base pair (i, j) to S (i.e. add
i,j
(S) = S ∪ {(i, j)}) and del
i,j
denotes the 'del'etion of base pair (i, j) from S (i.e. del
i,j
(S) = S - {(i, j)}). An action is direct if it concerns a base pair in A ∪ B and indirect otherwise. The simplest direct pathways from A to B concern sequential deletions of all base pairs in A - B followed by additions of all base pairs in B - A.
Consider an example sequence x = GGGGAAAACCCCUUUU with initial and final structures shown in Figure 1. This simple pathway is obtained by first deleting all GC pairs from A until the RNA is single stranded, and then adding all AU pairs until B is obtained. Note that each intermediate structure S
i
differs from both its successor and predecessor by exactly one base pair. The actions in the example are all direct actions and the energy barrier is 5.50 - (-6.60) = 12.10 kcal/mol.
An addition action add
i,j
(S) conflicts with S if either x
i
or x
j
is already paired in S, and it clashes with S if there exists a base pair . A deletion action del
i,j
(S) conflicts with S if (x
i
, x
j
) ∉ S. An addition or deletion action is valid and can be applied to S properly if it neither conflicts with nor clashes with S.
A pathway from A to B can be represented by an action chain, which is a sequence of valid actions a1, . . . , a
m
such that S0 = A, S
t
= a
t
(S
t
-1) for 1 ≤ t ≤ m and S
m
= B. Note that an action chain for A to B implies a sequence of valid actions that can be successively applied to A without introducing conflicts or clashes and produce B. We use the term "action chain" when the sequence is certified to be valid, and the term "sequence of actions" if its validity is not guaranteed.
This representation of a pathway p from A to B has the following important properties. First, every folding pathway can be represented by a unique action chain and every action chain represents a unique folding pathway (note that it is not necessarily true for a sequence of actions). Second, rearranging the order of actions in p results in a new sequence of actions which represents a new folding pathway from A to B when it is valid. (It is an action chain that can be successively applied to A properly and obtain B.) Third, introducing a pair of complementary actions (e.g. add
i,j
and del
i,j
) to p results in a new sequence of actions which also represents a new folding pathway from A to B if it is valid.
In RNAEAPath, folding pathways are represented in the form of action chains, instead of a sequence of intermediate structures. This representation makes the life cycle of a folding pathway transparent to the algorithm and also makes it easier for us to simulate the cooperative formation and destruction of RNA stacks by re-arranging the order of actions or introducing multiple pairs of complementary actions.
Predicting low energy barrier folding pathways
Given an RNA sequence x, an initial structure A and a final structure B, RNAEAPath computes a near optimal low energy barrier folding pathway from A to B in an evolutionary algorithm framework [37]. Figure 2 elucidates the overall paradigm for RNAEAPath. In this algorithm, the population of each generation is comprised of folding pathways ordered by their fitness. The functions are mutation strategies, each of which takes in a pathway p and produces a set of offspring pathways. These mutation strategies are central to the effectiveness of RNAEAPath and will be discussed in the Mutation strategies subsection. ℓ1, ℓ2, ℓ3, MAX and γ are positive integer control parameters.
The initial population of RNAEAPath, ℙ0, is filled with a set of simple pathways. Then, the algorithm goes through several iterations. ℙ
k
-1 is the population of the k - 1st iteration. In the kth iteration, the algorithm produces (an ordered list of pathways) and ℙ
k
(the population of the kth iteration) from ℙ
k
-1. stores the best ℓ1 pathways in ℙ
k
-1 and the best ℓ2 pathways produced by each p ∈ ℙ
k
-1. More specifically, each pathway p ∈ ℙ
k
-1 produces offsprings through every mutation strategy . The resulting offsprings produced by p are stored in a temporary list , and the top ℓ2 pathways are added to . Finally, the best solution of the kth iteration, termed as OPT
k
, is the best pathway in . And, ℙ
k
(the population of the kth iteration) is composed of the best ℓ3 pathways of and will be used in the next iteration to produce ℙ
k
+1. This helps keep the diversity of the population large, since ℙ
k
contains at most ℓ2 offsprings produced by each p ∈ ℙ
k
-1, no matter how many high-qualified offsprings are produced by each pathway. The algorithm terminates when a stopping condition is met, and it returns the best solution of the last iteration. Since retains the best ℓ1 pathways from ℙ
k
-1 in each iteration, the best one ever encountered by the algorithm is retained in lists and ℙ
k
, and stored in OPT
k
. So, OPT
k
has no worse fitness when compared to OPT
k
-1, and RNAEAPath always returns the best action chain it ever discovered.
In the remaining of this section, we discuss details regarding fitness evaluation, initialization of the population, stopping conditions and mutation strategies of RNAEAPath.
Fitness of action chains
The order of folding pathways (valid action chains) is primarily determined by their energy barriers. In case of a tie, the order is determined by the average of energy differences between the initial structure A and intermediate structures. Note that lower energies are preferred in the previous two methods of ordering. If a tie still exists, then shorter action chains are preferred. Action chains are ordered arbitrarily if their relative order can not be determined based on these three criteria.
The initial population of folding pathways
The initial population, ℙ0, contains 4 simple pathways from A to B formed by first deleting all base pairs in A - B and then adding those in B - A, similar to the pathway shown in Figure 1. Although we can also arrange base pair deletions and additions in an arbitrary order, we tailor them in a manner that simulates successive degradation and formation of RNA stacks. This is because random deletions and additions of base pairs tend to form additional unpaired loop regions that introduce entropic penalties (see Figure 3 for an illustration). We can degrade or form each stack either from the outmost base pair to the innermost base pair or vice verse. Usually, it yields a lower energy barrier if we degrade a stack from the outmost base pair to the innermost base pair and form a stack from the innermost base pair to the outmost base pair. However, for the sake of simplicity and generosity, we construct 4 simple pathways in ℙ0, which degrade all the stacks from the same direction and form all the stacks from the same direction. These simple pathways constitute a diversified and unbiased initial population for the algorithm to start from.
The number of offsprings produced by each mutation strategy
In each generation, the expected total number of offsprings produced by each individual is a constant positive integer . The number of offsprings that each individual produces using mutation strategy , in the kth generation, is denoted by . In the initial generation, is equivalent to /Y for all the mutation strategies. In the kth generation, is determined adaptively according to the quality of the offsprings produced using in the k - 1st iteration. Let be number of offsprings that are both produced through and selected to construct ℙk-1, the population of the k - 1st generation. Then, in the kth generation is computed as follows.
Mutation strategies that have produced more high quality offsprings in the (k - 1)st iteration are allowed to generate more offsprings in the kth generation. In contrast, mutation strategies that perform poorly in the k - 1st generation, are only allowed to generate a small number (
min
, with default value 3) of offsprings. Note that, the sum of for 1 ≤ y ≤ Y may be greater than ℓ.
Stopping conditions
The algorithm terminates when (1) the current best solution achieves the lowest possible value |E(B) - E(A)|, or (2) when no improvement has been found over γ consecutive iterations (a plateau), or (3) when MAX number of iterations have passed and successive iterations do not discover better results. Note that the algorithm may simulate further than MAX iterations if improvements are made in the very last iteration and it stops immediately if no improvement is made between successive iterations. More specifically, the algorithm stops when any of the following conditions is satisfied:
-
1.
the energy barrier of OPT
k
is equivalent to |E(B) - E(A)|.
-
2.
k >γ and the fitness of OPT
k
is equivalent to that of OPT
k
-
γ
.
-
3.
k ≥ MAX and the fitness of OPT
k
is equivalent to that of OPT
k
-1.
Mutation strategies
In RNAEAPath, the mutation strategies employed to evolve folding pathways can be categorized into three types: (1) rearranging the order of actions, (2) introducing indirect pathways and (3) formation of a single stack or cooperative conversion of a pair of incompatible stacks. In this section, let denote the mutation strategies and let p = a1, . . . , a
m
denote the input pathway A = S0, . . . , S
m
= B. For each mutation strategy , we describe the process for generating one new pathway q using each mutation strategy when given p.
Type 1: reordering of actions
As described in the subsection of representation of RNA folding pathways, shuffling the order of actions of the input pathway p can result in a new pathway from A to B. In RNAEAPath, two mutation strategies of this type are employed. changes the position of an arbitrary action, and swaps the positions of two arbitrary actions.
M1: Let denote the sequence of actions obtained by first removing an action (1 ≤ t1 ≤ m) from p and then inserting it after , for all t2 ∈ {0,..., t1 - 1, t1 + 1, . . . , m}. Note that the resulting sequence of actions may not necessarily be a valid action chain. For instance, in Figure 1, and are valid action chains, while is not.
The procedure for computing is described in the following.
-
1.
Choose t1 uniformly at random from the interval [1, m].
-
2.
Compute the interval [l, u], (t1 <l <u <m), where l is the minimum and u is the maximum such that for all t2 ∈ [l, u] and t2 ≠ t1, is a valid action chain.
-
3.
Choose t2 from the interval [l, u].
-
3.1. If is an addition operation, for all l ≤ t <t' ≤ u and t ≠ t' ≠ t1, the probability of choosing t is greater than that of t'.
-
3.2. Otherwise (a deletion operation), for all l ≤ t ≤ t' ≤ u and t ≠ t' ≠ t1, the probability of choosing t is less than that of t'
We do not choose t2 (t2 ≠ t1) uniformly at random in [l, u], instead, we tend to place addition operations in the front part of p, and deletion operations in the later part of p. This is because adding base pairs early and deleting them late during the folding may help stabilize the intermediate secondary structures. (Please see Additional file 1 for the detailed description of the discrete probability.)
: Let denote the sequence of actions obtained by swapping with . If the resulting sequence of actions is a valid action chain, let it be q; otherwise, restart the process. For example, in Figure 1, is not a valid action chain, while is. t1 and t2 are chosen uniformly at random from {(t1, t2): 1 ≤ t1 <t2 ≤ m}.
Mutation strategies of type 1 provide methods for shuffling the order of actions of an input pathway and generating slightly different new pathways. However, these strategies are not capable of introducing additional (indirect) base pairs, and the offsprings of a direct pathway produced through type 1 strategies are also direct. In the following, we will describe mutation strategies that are able to construct indirect pathways from a direct pathway.
Type 2: introducing indirect pathways by adding a pair of complementary actions
Morgan and Higgs [16] pointed out that the optimal folding paths are generally indirect pathways. This idea was further described by Dotu et al. [28]. The temporary formation of base pairs, especially those base pairs that do not belong to A ∪ B, may lower the energies of intermediate structures and thus render better folding pathways. Similarly, temporary deletion and reformation of a base pair also can create an indirect pathway.
: Let denote the sequence of actions obtained by introducing an addition action add
i,j
after and its complementary action del
i,j
after . Let denote the sequence of actions obtained by introducing a deletion action del
i,j
after and its complementary action add
i,j
after . For example, in Figure 1, add1,16, a2, ..., a7, del1,16, a8. The procedures for computing and are similar to each other. In the following, we only describe the procedure for computing .
-
1.
Choose t1 uniformly at random from the interval [1, m], and obtain the associated intermediate structure .
-
2.
Find a set of base pairs that neither conflict with nor clash with and choose a base pair (i, j) uniformly at random from the set.
-
3.
Compute the interval [l, u], (t1 <l <u <m), where l is the minimum and u is the maximum such that for all values t2 ∈ [l, u] the resulting sequence of actions of is a valid action chain.
-
4.
Choose t2 from the interval [l, u] with the probability of choosing t greater than that of t' for all t >t'. (This is because (i, j) is not likely to be deleted soon after its formation.)
Mutation strategy is capable of producing an indirect pathway from a direct pathway. In addition, a proper combination of multiple applications of may result in a pathway which simulates the successive formation and deletion of a temporary stack during the folding. Take the pathway p in Figure 1 as an example, we can construct a pathway q that forms a temporary stack consisting of all the GU base pairs via a multiple application of .
Type 3: formation of a single stack or simultaneous formation and deletion of a pair of incompatible stacks
In this section, we will introduce mutation strategies for producing pathways that involve with formation and deletion of stacks. To perform this type of strategies, we first need to find all possible stacks in an RNA sequence x. We use the algorithm of Bafna et al. [38] to find the set of all possible stacks with more than 3 consecutive base pairs, and denote it by STA(x). There are two strategies in Type 3: formation of a single stack and simultaneous formation and destruction of a pair of incompatible stacks .
: Let denote the sequence of actions obtained by forcing the formation of a stack stack
h
∈ STA after action a
t
, where stack
h
is compatible with S
t
. The following describes the procedure for computing .
-
1.
Choose t uniformly at random from the interval [1, m], and obtain the associated intermediate structure S
t
.
-
2.
Find a set of stacks that neither conflict with nor clash with S
t
, and pick up a stack stack
h
uniformly at random from the set.
-
3.
Ensure that each base pair (i, j) in {stack
h
- S
t
} is sequentially (from the innermost base pair to the outmost base pair) formed after a
t
.
-
3.1. If an action add
i,j
appears in {at+1, . . . , a
m
}, move it up and place it after a
t
using strategy .
-
3.2. Otherwise, introduce a pair of complementary actions add
i,j
and del
i,j
to p after a
t
using strategy .
We can introduce additional stacks that are compatible with S
t
using by forcing a sequence of addition actions successively forming base pairs in {stack
h
- S
t
}, after a
t
.
: Let denote the sequence of actions obtained by forcing the formation of a stack stack
h
∈ STA which is incompatible with S
t
, after action a
t
. Shown on the right side of Figure 4 is a folding pathway which simultaneously destructs and forms a pair of incompatible stacks. Shown on the left side is a simple folding pathway which has exactly the same start and end structures, while it folds into a single stranded structure during the folding. Usually, the pathway on the right has lower energy barrier than the one on the left because it never folds into a single stranded structure. The folding pathway on the right side of Figure 4 can be introduced using strategy . And, the procedure for computing is as follows:
-
1.
Choose an arbitrary deletion action a
t
= del
i,j
from p, and obtain the associated intermediate structure S
t
.
-
2.
Find a set of stacks which either conflicts with or clashes with S
t
, and choose a stack stack
h
uniformly at random from the set.
-
3.
For each base pair (i', j') in {stack
h
- S
t
} that is compatible with S
t
, place add
i
',
j
' to p after a
t
using strategy .
-
4.
For each base pair (i', j') in {stack
h
- S
t
} that is incompatible with S
t
,
-
4.1. Find all the base pairs (i*, j*) in S
t
that are incompatible with (i', j'), and ensure that each base pair (i*, j*) is deleted before the action add
i
',
j'
.
-
4.2. If a action appears in {a
t
+1, . . . , a
m
}, move it up before add
i
',
j
'using strategy .
-
4.3. Otherwise, introduce a pair of complementary actions and using strategy .
Using , we can introduce the simultaneous formation of a stack stack
h
, which is incompatible with S
t
, and destruction of existent stacks (or base pairs) that hamper the formation of stack
h
. Since cooperative formation and destruction of stacks may contribute additional stacking energies for stabilizing the intermediate structures, better folding pathways with lower energy barriers may be rendered.