Analysing RNA-kinetics based on folding space abstraction

Background RNA molecules, especially non-coding RNAs, play vital roles in the cell and their biological functions are mostly determined by structural properties. Often, these properties are related to dynamic changes in the structure, as in the case of riboswitches, and thus the analysis of RNA folding kinetics is crucial for their study. Exact approaches to kinetic folding are computationally expensive and, thus, limited to short sequences. In a previous study, we introduced a position-specific abstraction based on helices which we termed helix index shapes (hishapes) and a hishape-based algorithm for near-optimal folding pathway computation, called HiPath. The combination of these approaches provides an abstract view of the folding space that offers information about the global features. Results In this paper we present HiKinetics, an algorithm that can predict RNA folding kinetics for sequences up to several hundred nucleotides long. This algorithm is based on RNAHeliCes, which decomposes the folding space into abstract classes, namely hishapes, and an improved version of HiPath, namely HiPath2, which estimates plausible folding pathways that connect these classes. Furthermore, we analyse the relationship of hishapes to locally optimal structures, the results of which strengthen the use of the hishape abstraction for studying folding kinetics. Finally, we show the application of HiKinetics to the folding kinetics of two well-studied RNAs. Conclusions HiKinetics can calculate kinetic folding based on a novel hishape decomposition. HiKinetics, together with HiPath2 and RNAHeliCes, is available for download at http://www.cyanolab.de/software/RNAHeliCes.htm.


Background
RNA molecules play vital roles in the cell, and their function is often determined by structural properties. These properties may be static, such as structural motifs, or dynamic, such as the ability to adopt different conformations as riboswitches do. The latter emphasises the importance of studying RNA folding kinetics, which is the dynamic behaviour of RNA structure over time.
Most approaches to the stochastic simulation of RNA folding kinetics can be described as Monte Carlo simulations [1][2][3] or continuous time Markov chains (CTMC) [4,5]. A Monte Carlo simulation requires a large number of samples of individual trajectories to achieve accuracy, rendering these methods computationally expensive. The same holds true for CTMC-based simulation, as long as it is based on a complete enumeration of the folding space.
The program TREEKIN [4] implements a CTMC-based simulation, and for short sequences (e.g., up to 30 nt), can simulate exact folding kinetics. For longer sequences, however, the exponential growth of the underlying state space requires restricting the analysis to a subset of the folding space. For this purpose so called "macrostates" were introduced in [4], each of which can be seen as a local minimum and all structures that are connected to it by a gradient walk. A macrostate is represented by its local minimum secondary structure. The problem that arises from the macrostate definition is that neighbouring macrostates cannot easily be identified. The program TREEKIN uses BARRIERS to compute saddle points connecting macrostates and the corresponding transition rates. The dependence on BARRIERS limits this approach to sequences of moderate length (up to 100 nt), which can be partially overcome by restricting the analysis to conformations within a specified energy range above the minimum free energy. To overcome the length restriction and reduce the computational burden Tang et al. [6] use a http://www.biomedcentral.com/1471-2105/15/60 sampling strategy called probabilistic Boltzmann-filtered suboptimal sampling. In their approach, sampled structures are connected by transition paths computed using a simple greedy algorithm [7]. These transition paths are weighted with their barrier energy. The procedure may be suboptimal in two ways: first, the sampling may miss important structures in the folding space, and second, the greedy pathway prediction may overestimate energy barriers and lead to inaccurate transition rates.
The computation of exact, globally optimal folding pathways between any two secondary structures (e.g., BAR-RIERS [1,8]) is NP-hard [9]. Many heuristic approaches for computing folding pathways have been proposed. The first approach was proposed by Morgan and Higgs [10] by selecting the least "clashing" base-pairs as the next intermediate structure from a set of neighbouring structures. Subsequently, the idea was extended by Flamm et al. [11]. Instead of selecting the best structure as the next intermediate structure, the k best candidates are maintained during the folding pathway construction (breadth first search, BFS). In contrast to these direct path heuristics (intermediate structures contain only base pairs that are also present in the start or target structure), Dotu et al. [12] presented a heuristic including indirect paths. Li et al. [13] proposed an evolutionary algorithm in which a pathway is represented by an action chain that is mutated by different strategies to find a better solution.
In general there are two central challenges in CTMCbased folding simulations for RNA. How can the energy landscape be decomposed in a complete, compact and non-heuristic way? And how can the transition rates between partitions be calculated accurately and efficiently?
Our contributions in this paper address these challenges. In previous work [14], we introduced hishapes as classes of structures sharing the same helices. These hishapes intrinsically decompose the folding space into disjoint classes, which are represented by the member with minimum free energy, called the hishrep. This partitioning is complete and non-heuristic, and its coarsegraining can be adjusted based on its abstraction levels, which differ in the type of structural elements they consider. Here, we analyse the degree to which hishapes overlap with locally optimal structures. Additionally, we provide a new folding space restriction, called strictly negative structures, that eliminates suboptimal structures with positive energy substructures. We present HIPATH2 as an improved version of HIPATH [14] and show that it computes lower energy barrier folding pathways for most cases in our benchmark set. Finally, we combine these methods in HIKINETICS, a tool for simulating RNA folding kinetics using strictly negative hishapes for the folding space decomposition and energy barriers estimated by HIPATH2 to derive transition rates using Arrhenius' equation. We apply our novel kinetic analysis tool termed HIKINETICS to two well-studied RNAs.

Hishapes revisited
We begin with a brief recapitulation of the central concepts and notations of hishapes. For formal definitions, we refer the reader to our previous manuscript [14]. For hishapes, we consider an RNA secondary structure as a set of helices terminated by loops (internal, bulge, multiple and hairpin loops). The innermost base pair (i, j) of a helix corresponds to the closing base pair of the terminating loop, and we define (j − i)/2 to be the helix index of this helix. Additionally, we mark the helix index with m, b, or i for multiple, bulge, or internal loop, respectively. Using a mapping function π , we can now map any secondary structure to a helix index shape (hishape), which is simply a list of helix indices. Figure 1 illustrates the relationship among helices, helix indices and hishapes. To provide different levels of abstraction, we make use of different mapping functions. The function π h retains only hairpin loop helices and π h+ additionally keeps track of the nesting within multiloops. The functions π m and π a extend π h+ through retaining multiloops and all helices, respectively. A hishape defines a class of similar structures, and we use the member with minimum free energy as the hishape representative (hishrep).

Reducing the search space to strictly negative structures
The number of feasible secondary structures grows exponentially with the length of the RNA. We recently presented hishapes, which abstract from helix lengths and, depending on the abstraction type, also from certain loop types. Compared to suboptimal structures, the number of possible hishapes is dramatically reduced, but it still grows exponentially with sequence length.
Hishapes provide deep insight into the folding space of an RNA molecule while keeping the output at a manageable size. Analysing one of our favourite toy examples, the Spliced Leader RNA from Leptomonas collosoma, we recognised that there are pairs of hishapes where the hishrep with an additional helix has a higher energy, as shown in Figure 2. Here, due to the additional helix with helix index 13, the energy of hishape [13,38] is worse than the energy of hishape [38].
The formation of this helix imposes an energy cost of 1.2 kcal/mol and, thus, is thermodynamically unfavourable. To eliminate such unfavourable structures, we cannot simply exclude all positive energy substructures within our recursive DP calculation. Doing so would for example disallow nearly all hairpin loops and thereby the computation of many biologically significant structures. We take the view that closed substructures within the external loop or within a multiloop must not have positive energy. http://www.biomedcentral.com/1471-2105/15/60 We are aware that disallowing positive energy substructures within multiloops may even remove the minimum free energy (MFE) structure from the structure space. In fact, a test on 10,000 randomly selected sequences from Rfam showed that for 1.67% of the sequences, the MFE structure is removed. For these 167 sequences, the strictly negative optimal structure has a G that is on average 0.49 kcal/mol (σ = 0.367, max = 2.3 kcal/mol) worse than the MFE. However, these differences are on the same scale as (or even below) the uncertainties present in the thermodynamic parameters used for computation. A further reason we think that removing substructures with positive energy is reasonable is that they seem kinetically unfavourable. A helix nucleates by formation of the terminal hairpin loop, which is the time dominating step, and is subsequently stabilised by the stacking of base pairs. For positive energy substructures, the G of the hairpin loop is very large, which results in a low probability of nucleation, and/or the G of the stacking pairs is small, which renders the melting of such helices very likely. For these reasons, we believe that disallowing positive energy substructures is a reasonable method to reduce the search space, although it is a heuristic filtering.
Because we can check for substructures with positive energy during the recursive calculation, this filter has nearly no computational burden. On the contrary, the reduced number of intermediate results actually speeds up the calculation. Restricting the analysis to strictly negative (SN) hishapes significantly reduces the search space (see Figure 3). It still grows exponentially with sequence length, but much more slowly, which is reflected by the much smaller base in the exponential growth asymptotics.

Hishreps versus local optimal structures
We were interested in the question of to what extent hishreps overlap with the set of locally optimal structures. As described, e.g., in [16], a locally optimal structure has the lowest free energy compared with its neighbouring structures, which are the structures that differ from it by a single base pair. Because our approach disregards any structure that contains isolated base pairs, we slightly modify the concept of the neighbourhood. Commonly, a neighbour (A ) of the observed structure (A) is defined by adding (or deleting) a base pair in A. This definition also holds true for our purposes, as long as A does not carry a lonely base pair. If A does contain a single lonely base pair as the result of previously removing a base pair, then we also delete the isolated one, resulting in the structure (A ), which will still be treated as a neighbour of A. Vice versa, if A carries an isolated base pair due to its addition we close, Figure 2 Three best hishapes of the spliced leader RNA from L. collosoma. The leftmost column lists hishreps. G is the free energy in kcal/mol and hishape represents the π m hishape. P is the hishape probability. This figure was generated using 'RNAHeliCes -f examples/spliced_leader.seq -q'. http://www.biomedcentral.com/1471-2105/15/60 Figure 3 Comparison of structure/hishape spaces. All possible structures and hishapes were predicted for random sequences of lengths ranging from 20-120 nt, using RNASUBOPT -noLP and RNAHELICES with different abstraction levels and restricting to strictly negative (SN) structures, respectively. The average numbers of structures/hishapes for each length were fitted to the formula a × b n × n −3/2 [15]. The numbers in parentheses give the values for b, which is the dominating factor in this term.
if possible, an adjacent base pair. The resulting structure A is then a neighbour to A. Note that in the two described cases, A and A differ by two adjacent base pairs. This version of the neighbourhood should be essentially the same as the 'noLP' move set from BARRIERS. Based on this definition, we check whether our predicted hishreps are locally optimal or not. Table 1 shows, for the different abstraction levels and for strictly negative hishapes and all hishapes, the fractions of hishreps that are local optima. Overall, the fractions are quite high, sometimes reaching 100%. The sequence for the S-box leader constitutes a negative outlier, especially in the case of strictly negative structures, where at most only 15% of the π h hishreps are locally optimal. Strikingly, strictly negative hishreps less frequently correspond to local minima compared to the unrestricted case. This result is somewhat counterintuitive but may be explained as follows. Filtering for strictly negative hishapes removes many hishapes. Because most hishapes are actually local minima, as can be seen for the unfiltered version, these hishapes are also affected the most strongly. Thus, the fraction of non-local optima increases in the case of strictly negative hishapes. So what are these non-locally optimal hishreps? In our opinion, they are mainly the result of replacing helices by single stranded regions. Because the formation of the removed helix would result in a neighbouring structure with better energy, the hishrep of the resulting hishape is not a local minimum.
This reasoning together with the fact that in abstraction type π a the largest number of helices is taken into account, also explains to a large degree why hishreps for abstraction type π a are less often locally optimal than hishreps of types π m , π h+ and π h .
The opposite question, "do all locally optimal structures belong to distinct hishapes" is easier to address. For abstractions π m , π h+ and π h the structures do not have to belong to distinct hishapes as two locally optimal structures differing, e.g., by an internal loop, will be mapped to the same hishape. The situation is different for π a hishapes, as they account for differences in all loop types. Starting from any locally optimal structure, the extension and shortening of helices cannot lead to another locally optimal structure. Reaching another locally optimal structure is only possible by adding or removing complete helices or by helix interruption, i.e., the introduction of internal or bulge loops. All these events will introduce new helices into the π a abstraction, thus resulting in different hishapes. This point is nicely reflected by the fractions of locally optimal structures that are also hishreps ( |H∩L| |L| , Table 1). While locally optimal structures have a fairly high overlap with hishreps of the least abstract types π a and π SN a , the overlap drops significantly for the other abstraction types, as many local optima differ in the composition of their internal and bulge loops and are thus not retained on these abstraction levels, as described above. http://www.biomedcentral.com/1471-2105/15/60 In each cell, the upper number represents the fraction of the set of hishreps H that are also locally optimal |H∩L| |H| and the lower number represents the fraction of the set of local optima that are also hishreps |H∩L|

|L|
. We restricted the computation of hishapes to the best 100 and the computation of the local optima to the corresponding energy range max{ G(x) : x ∈ H} above the MFE. The dataset is taken from [12]. SN strictly negative hishapes.

Improved barrier energy estimation
Pathways connecting alternative structures are important features of the folding space, especially when studying folding kinetics. Here, transition rates computed based on the energy barriers, which are derived from the pathways between structures, are commonly used. It has been http://www.biomedcentral.com/1471-2105/15/60 shown that the problem of computing the globally optimal folding pathway between two structures is NP-hard [9]. In our recent publication [14], we provided an overview of current pathway estimation tools and introduced HIPATH, outperforming the other analysed methods. Here, we present an improved version, which we term HIPATH2. One of the essential features of HIPATH is that it uses a set of related hishapes as anchors for estimating a (near-) optimal pathway between two structures. These related hishapes correspond to hishapes that consist of individual helix indices from the start and target structures or combinations thereof. By detailed inspection of the optimal folding pathways obtained by BARRIERS, we observed that pathway intermediates sometimes carry helices with helix indices that are not identical, but very similar to the helix indices of the start or target hishape, differing by only a few positions. Therefore, we implemented fuzzy related hishapes that also take into account the neighbourhoods (in terms of the helix index distance) of related hishapes. HIPATH2, which is based on fuzzy related hishapes was benchmarked against existing methods (BARRIERS [1,8], BFS [11], RNATABUPATH [12], RNAEAPATH [13] and HIPATH [14]) on 18 conformational switches taken from [12] (see Table 2). They consist of two parts: five of them are riboswitches (rb1, rb2, rb3, rb4 and rb5) taken from [17,18], and the remaining 13 are taken from PARNASS [19]. All of the algorithms were used with the same energy rules (Turner99) [20,21]. We use the "microstate" grammar [22], which corresponds to the "-d1" option of RNAEVAL from the Vienna RNA package [23]. All other parameters were kept as the defaults.
The results in Table 2 show that in most cases, HIPATH2, together with other methods, produces the lowest energy barrier estimates. In the four cases where exact pathways are known, the sum of errors is reduced from 1.7 to 0.8 compared to HIPATH. Compared to the second best method, RNAEAPATH, HIPATH2 produces slightly (0.1 to 0.4 kcal/mol) less optimal pathways in four cases (rb2, hok, thiM leader, HIV-1 leader). However, in eight cases it performs better by 0.14 to 2.26 kcal/mol. A major difference is found in the runtimes of the two. Table 3 compares the runtimes of HIPATH2 and RNAEAPATH. While RNAEAPATH spends approximately 837 min., HIPATH2 only needs approximately 192 min., thus being 4.4 times faster. The dataset was taken from [12,19], the results for BFS and RNATABUPATH from [12] and the results for EA from [13]. Energy barriers are given in kcal/mol. The maxkeep value k was 10 for BFS itself and for the BFS used within HIPATH and HIPATH2. HIPATH2 was used with auto-adjusted fuzzy related hishape numbers, π a and θ = 1.5. HIPATH was used with the default parameters. Bold numbers represent the minimum value for the respective sequence. The symbol "*" means BARRIERS could not be applied because either the start or the target structure was not locally optimal. The symbol "-" means computation did not finish within one day. The energy range used with RNASUBOPT for BARRIERS was determined using HIPATH2 and set to the barrier energy of HIPATH2 + 1 kcal/mol. Note that the results may be different from the ones shown in [14] since the used start and target structures may differ. Here we used the ones provided in [12], while in [14] we derived them for ourselves. http://www.biomedcentral.com/1471-2105/15/60 Run times were measured as described before [14], and both programs were used with default parameters. Sequences were taken from [12,19], and all tests were run on an 8x AMD Opteron 8378 machine with 128 GB RAM under openSUSE 11.2 (x86_64).

Simulating folding kinetics
Our approach for simulating folding kinetics is based on a set of hishapes connected by pathways with their corresponding barrier energies. The most straightforward approximation of transition rates can be done using Arrhenius' equation. Consider the two hishapes α and β. We initially compute the hishape ensemble energy ( G(α), G(β)) via the hishapes partition function contribution calculated by RNAHELICES (see Equation 4). Next, using HIPATH2, we estimate the barrier energy G[α, β] between the two hishreps of α and β. Finally, we derive the transition rates using Arrhenius' equation (see equation 5). Using the hishape ensemble energy can be seen as weighting the energy by the size of the hishape class, which takes into account that the more members a hishape has, the higher the probability of a transition into the hishape. In contrast, transition out of a large (in terms of members) hishape is less likely. Our approach is conceptually similar to the macrostate model introduced with TREEKIN. Here, the folding space is partitioned into macrostates, based on local minima and their basins of attraction. These macrostates are computed by the program BARRIERS, which also computes the transition rates based on the barrier energies. The latter are computed on-the-fly, which is elegant, but has one major drawback: the depth (in terms of free energy above the MFE) of the analysis must be sufficiently large to ensure that saddle points connecting all local minima (macrostates) of interest are present. For real-world examples, this depth can easily reach 10-20 kcal/mol (see Table 2), resulting in a large computational effort to compute the transition rates, especially for long sequences. Our approach circumvents this problem, as the computation of the transition rates is separated from the computation of the macrostates, i.e. hishapes, and the latter is more efficient, especially when restricted to strictly negative hishapes. Therefore, HIKINETICS is able to simulate folding kinetics for longer sequences than is possible with BARRIERS and TREEKIN. Of course, this ability does not come for free, and we expect our transition rate estimate to be less accurate than the one made using BARRIERS. The results we present in the next section show that this inaccuracy seems to have only a minor influence.

Spliced Leader RNA from Leptomonas collosoma
The Spliced Leader RNA from Leptomonas collosoma [24] has two alternating conformations of nearly equal free energy. Figure 2 shows the results of hishape analysis. The two π m hishapes ( [38] and [27]) represent the two native conformations of the Spliced Leader RNA. The probabilities of conformations 1 and 2 are 0.345271 and 0.470394, respectively, which is in agreement with the bistable character of this RNA.
The kinetic analysis in Figure 4 shows that the two major hishapes ( [38] and [27]) dominate from t = 10 μs until equilibrium. At the end of the simulation, their equilibrium occupancies are the same as the probability calculated by the partition function. Interestingly, both alternative hishape classes build plateaus that persist for a long period (from approximately t = 500 μs to t = 50, 000 μs) and cross at approximately t = 50, 000 μs. If the Spliced Leader RNA degrades within this period, hishape [38] would be kinetically preferred, achieving almost 50% occupancy. However, if the lifetime of the Spliced Leader RNA exceeds the time needed to reach equilibrium, hishape [27] would win.
To determine the degree to which strictly negative filtering influences the analysis, we performed a simulation based on strictly negative hishapes on the same sequence (see Figure 5). Here, the (arbitrary) timescale of the process is altered, while the characteristics are the same. Note that the two hishapes ( [13,38] and [10.5,38]), which are related to [38], are not strictly negative and thus are no longer present. As a result of the filtering, the equilibrium probabilities are also altered from 0.345 to 0.422 for hishape [38] and from 0.470 to 0.575 for hishape [27]. This result is mainly due to the reduced state space, such that each state occurs with higher frequency. Direct http://www.biomedcentral.com/1471-2105/15/60 computation of the probabilities for the strictly negative hishapes using RNAHELICES results in the same values.
Next, we compared our hishape-based kinetics simulation to the simulation from TREEKIN whose results are shown in Figure 6. Focussing on the two dominant hishapes [38] and [27], the similarity to the kinetics based on strictly negative structures ( Figure 5) is higher than the similarity to the kinetics for the unrestricted approach (Figure 4). By design, the latter retains more detail, which is reflected by the presence of the two not strictly negative hishapes [13,38] and [10.5,38] in this simulation. Again, however, the simulated kinetics is significantly similar to the TREEKIN results. Overall, this result shows that our approach to the simulation of folding kinetics is accurate enough to capture major features of the folding space, such as the late crossing of hishapes [38] and [27].

The c-di-GMP riboswitch of the tfoX from Candidatus desulforudis audaxviator
In the second example, we analysed the c-di-GMP riboswitch of the tfoX gene from Candidatus desulforudis audaxviator (CP000860.1/c(1860063-1860186), [25]. As shown in Figure 7, it has two states that differ by approixamtely 2.3 kcal/mol in free energy. The c-di-GMP riboswitches, like all riboswitches, are composed of two domains: an aptamer and an expression platform. The aptamer is more conserved and is responsible for binding c-di-GMP, while the expression platform controls expression by alternative conformations. Here, helix 116.5, which is present in the second hishrep constitutes a Rho-independent terminator hairpin.
We simulated the folding kinetics based on strictly negative hishapes and chose the stable helix ( [25.5]) of the aptamer as the initial population (see Figure 8). The hishape [25.5,94.5], which corresponds to the native ON conformation, dominates from t = 0.5 μs until thermodynamic equilibrium. Other hishapes such as [7.5,25.5,63.5,94.5,116.5], [25.5,63.5,87,116.5], [25.5,63.5,94.5] and [63.5] appear transiently in different periods. The first two correspond to OFF conformations (helix 116.5 is present), and their fraction is significantly increased from approximately t = 0.01 μs to t = 5, 000 http://www.biomedcentral.com/1471-2105/15/60 μs. The hishape [25.5,63.5,94.5] likely represents a folding intermediate between the ON and OFF conformations, as it is composed of helices from both structures. Its share increases briefly at time point 10, 000 μs and drops shortly after, while the fraction of hishape [25.5,94.5] increases, which supports the assumption that hishape [25.5,63.5,94.5] is a folding intermediate between the ON and OFF conformations. The hishape [63.5] appears late (1e + 06 μs) in the simulation. The short time span (t = 0.01 μs to t = 5, 000 μs) where OFF conformations achieve a significant fraction of the folding space reflects the kinetic control of this riboswitch [27]. The folding kinetics restricts the time period during which the RNA is accessible for regulation.

Conclusions
In this paper, we present several methods for improving folding space analysis. First, we introduce strictly negative hishapes that represent a reasonable subset of the folding space, i.e., those hishapes composed of helices that all have negative energies. We analysed hishapes and their strictly negative variant for correspondence to local optima, and found a large overlap. This result supports our idea of using hishapes for folding space analysis. Second, we present HIPATH2, an improved algorithm for calculating suboptimal folding pathways between two given secondary structures. A benchmark confirms that HIPATH2 outperforms its predecessor and other heuristics on the chosen dataset. Finally, we present a new approach for simulating RNA kinetics, which is based on hishapes and uses HIPATH2 to compute transition rates. The simulated folding kinetics of two well-studied RNAs show that using our approach allows us to draw functional conclusions. The results for the c-di-GMP riboswitch make us wonder if kinetics can help in identifying new riboswitches. To the best of our knowledge, the existing methods for the identification of riboswitches [19,[28][29][30][31], are based on sequence and/or secondary structure conservation or on structure comparison. No methods use folding kinetics. http://www.biomedcentral.com/1471-2105/15/60 Figure 6 Folding kinetics of the Spliced Leader RNA from L. collosoma simulated with TREEKIN. We applied BARRIERS and TREEKIN to simulate folding kinetics based on the macrostate model. Each macrostate representative local minimum was mapped to its π h hishape, and ones with the same hishape were merged. The simulation started from the open chain. We show the results for the 25 best hishapes plus the open chain.
Our strategy to disentangle folding space partitioning and barrier energy estimation makes it possible to simulate folding kinetics for fairly long sequences. The most time-consuming step is the computation of pairwise energy barriers using HIPATH2. Because these computations are independent, this step can be easily parallelised, which we already exploited. For massively parallel applications, GPU-accelerated computing is the method of choice, and might be a reasonable option to significantly speed up folding kinetics simulations using HIKINETICS.

Energy parameters
When not mentioned explicitly, we used the most recent set of energy parameters [32].

Restricting the folding space to strictly negative structures
The algorithm for helix index shape analysis has been developed using Bellman's GAP [33][34][35]. Bellman's GAP supports semantic filtering which filters the answer list with the specified filter function after the objective function is applied. We take advantage of this filtering feature to remove positive energy substructures in the external loop and in multiloops. Because the resulting hishapes have negative energy, we term them strictly negative (SN).

Fuzzy related hishapes
The helix index (central position of the loop closing base pair the helix ends in) is susceptible to small variations. If one of the pairing partners shifts by a single position, as in helix slipping, the helix index will also change.

Figure 7
The alternating structures of the c-di-GMP riboswitch of the tfoX gene from C. desulforudis audaxviator MP104C. We took the native structures proposed in [26] and used them as constraints to predict the energetically optimal structure using RNAFOLD. These results were then mapped to the corresponding hishapes. G is the free energy in kcal/mol, and hishape represents the π h hishape. http://www.biomedcentral.com/1471-2105/15/60 Furthermore, in folding pathways between two conformations, intermediate structures may occur that have helices with slightly different helix indices.
To account for these small variations, we introduce a less stringent version of related hishapes, which we call fuzzy related hishapes. (Fuzzy related hishapes). Given two hishapes α and β in an arbitrary abstraction type and a user-defined threshold θ , and letting φ be a function to extract hairpin loop helix indices, fuzzy related hishapes γ are the hishapes that satisfy max t∈φ(γ ) min z∈(φ (α)∪φ(β)) |t − z| ≤ θ (1)

Restricting the number of fuzzy related hishapes within HIPATH2
The number of (fuzzy related) hishapes has a large impact on the runtime of HIPATH2. For this reason we provide a means to restrict this number. In the previous version (HIPATH), the calculation of related hishapes always starts at the most abstract level. If, in this level, the number of hishapes is not greater than a user-defined threshold n, the next lower abstraction level is used. This step is performed either until the number of hishapes is greater than n or the user-defined lowest abstraction level t is reached. The number of related hishapes calculated in this way causes a repeated hishape calculation of different abstraction types. For example, if the first attempt does not result in a sufficient number of hishapes, they must be calculated for the next abstraction type, and the initial result will be discarded.
To avoid this issue and speed up HIPATH2, we use an auto-adjust strategy that applies the empirically derived formula shown in Equation 2. Precise asymptotics for the number of abstract shapes have been derived in [15,36] and are defined by a × b n × n −3/2 where n is the sequence length. We use this formula to adjust the number of related hishapes for the HIPATH2 calculations. After empirical testing, we chose a × b n = 124, 000. Therefore, http://www.biomedcentral.com/1471-2105/15/60 for n = 500, k is approximately 10, which means that we keep the 10 fuzzy related hishapes with the lowest free energy. This precaution keeps the HIPATH2 calculation within one hour for two hishapes of a 500 nt long sequence.

HIPATH2 algorithm
For the computation of a single pathway between a given start and target structure, we restrict the search space to fuzzy related hishapes as defined by Equation 1. Additionally, given an RNA sequence x, a start structure S and a target structure T, only the shortest path from the start to the target structure is computed. Algorithm 1 shows an outline of HIPATH in pseudocode. In line 4, the N lowestenergy fuzzy related hishreps in the π h abstraction (-t 1) with respect to the helix index list H U are calculated using RNAHELICES. In line 7, we use a breadth first search (BFS) to estimate the energy barrier between L[i] and L[j], which is stored in the matrix M BFS at position (i, j). In line 10, we apply a modified version of Dijkstra's algorithm [37] in which the edges are weighted with the barrier energies calculated by the BFS algorithm. Instead of computing the sum of the weights, we take the maximum weight along the path and look for the path with the lowest maximum weight.