ExpaRNA-P: simultaneous exact pattern matching and folding of RNAs

Background Identifying sequence-structure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable. Results The novel algorithm ExpaRNA-P computes exactly matching sequence-structure motifs in entire Boltzmann-distributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the well-known “simultaneous alignment and folding” of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNA-P has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemble-based sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNA-P’s sequence-structure motifs. Resulting in the very fast RNA alignment approach ExpLoc-P, we utilize the best chain as anchor constraints for the sequence-structure alignment tool LocARNA. ExpLoc-P is benchmarked in several variants and versus state-of-the-art approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical non-coding RNAs, ExpLoc-P has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speed-up over 30-fold for the longest benchmark sequences (≈400nt). Finally, different ExpLoc-P variants enable tailoring of the method to specific application scenarios. ExpaRNA-P and ExpLoc-P are distributed as part of the LocARNA package. The source code is freely available at http://www.bioinf.uni-freiburg.de/Software/ExpaRNA-P. Conclusions ExpaRNA-P’s novel ensemble-based sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNA-P significantly speeds up sequence-structure alignment while maintaining the alignment quality. Different ExpaRNA-P variants support a wide range of applications. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0404-0) contains supplementary material, which is available to authorized users.


Computation of unpaired probabilities in loops
We present the computation for sequence A (the case of B is analogous.) We compute base pair probabilities Pr [(i, j)|A] by the McCaskill algorithm [1]. Furthermore, we extend this algorithm to compute the probabilities Pr [k ∈ loop(i, j)|A] and Pr [(i , j ) ∈ loop(i, j)|A]. For this purpose, we utilize the matrices Q ij , Q b ij , Q m ij , and Q m1 ij as defined by [1]. To recapitulate this briefly: for 1 ≤ i ≤ j ≤ |S|, the entries of these matrices equal the respective sums over the Boltzmann weights of the following sets of structures of A i..j • Q ij : all structures of A i..j • Q b ij : all structures S of A i..j with (i, j) ∈ S • Q m ij : all non-empty structures of A i..j scored as part of a multiloop • Q m1 ij : all structures S of A i..j , scored as part of a multiple loop, such that for some k holds (i, k) ∈ S and for all (i , j ) ∈ S holds i ≤ i < j ≤ k. Intuitively, Q m1 ij counts the Boltzmann weights of all structures that are part of a multiloop and have exactly one outermost base pair, starting at position i. Extending the original set of matrices, we compute the additional matrix Q m2 It represents parts of a multiloop with at least two outermost base pairs. Note that this matrix is computed in O(n 3 ) time without adding to the complexity of McCaskill's algorithm. Given those matrices, we compute Pr [k ∈ loop(i, j)|A] as Please compare the formulas to the visualization in App. Fig. 1. H, I, and M are conditional partition functions, conditioned by k ∈ loop(i, j), in the respective cases where k is contained in a hairpin, interior loop, and multiloop. The constants a, c, k, and T , as well as the energy functions F 1 (hairpin loop) and F 2 (interior loop) are defined as in McCaskill [1], where β := (kT ) −1 . To compute M , we distinguish three cases: k is either in the leftmost, the rightmost, or any other unpaired region of the loop. Note that in the computation of M , one needs to ensure that M considers only multiloops, which must have at least two inner base pairs. Only the matrix Q m2 enables performing this in constant time.

Computation of base pair probabilities in loops
The computation of Pr As visualized in App.  2 Complexity Analysis of ExpaRNA-P Lemma 1 For a fixed j , there are only O(1) base pairs (i, j), such that j is a candidate of (i, j) (and analogously for l and (k, l) in sequence B).
Proof We fix some j and denote by p j (i, j) the probability that a structure of A contains the base pair (i, j) and j occurs as an unpaired base or right end of a base pair in the loop closed by the base pair (i, j): p j (i, j) := Pr [j ∈ loop(i, j)|A] + i<i <j Pr [(i , j ) ∈ loop(i, j)|A]. If j is a candidate, it follows p j (i, j) ≥ θ * := min{θ 2 , θ 3 }, since then either Pr [j ∈ loop(i, j)|A] ≥ θ 2 or Pr [(i , j ) ∈ loop(i, j)|A] ≥ θ 3 for some i . Note that for different (i, j) the events of probabilities p j (i, j) are disjoint, since in any structure j can occur in just one loop. Therefore i,j p j (i, j) ≤ 1. Hence there are at most 1 θ * ∈ O(1) base pairs (i, j) for which p j (i, j) ≥ θ * and only for those j can be a candidate.
Note that for this lemma it is crucial to consider the probabilities within the single loops and not only general base pair and unpaired probabilities. Considering these probabilities is the key insight of this new way of sparsification.
Theorem 1 There are only O(n 2 ) entries L ijkl (j , l ), G ijkl A (j , l ), G ijkl AB (j , l ), and LR ijkl (j , l ) such that j is a candidate of (i, j) and l is a candidate of (k, l).
Proof Due to Lem. 1 there are O(n) many combinations i, j, j . Analogously there are O(n) combinations k, l, l and therefore O(n 2 ) combinations i, j, k, l, j , l satisfying the conditions.

Corollary 1
The time and space complexity of computing all entries L ijkl (j , l ), G ijkl A (j , l ), G ijkl AB (j , l ), and LR ijkl (j , l ), D and F is O(n 2 ).
Proof We show that D has O(n 2 ) entries. Clearly, the number of all other matrix entries is quadratically bounded due to Theorem 1. The argument for D has been given before in the context of LocARNA [2]; also compare the proof of Lemma 1.  greater than 150 we achieve substantial speedup. The inexact mode is also superior to the exact mode for these input sequences. For the longest input sequences, we even obtain respective speedups of 32 and 35 for the exact and inexact mode.
These results show that utilizing anchor constraints to speed up sequence-structure alignments pays off especially if RNAs get longer. Note that while our benchmarks are rather biased to short instances, they clearly show the effect.