 Research Article
 Open Access
 Published:
ExpaRNAP: simultaneous exact pattern matching and folding of RNAs
BMC Bioinformatics volume 15, Article number: 404 (2014)
Abstract
Background
Identifying sequencestructure motifs common to two RNAs can speed up the comparison of structural RNAs substantially. The core algorithm of the existent approach ExpaRNA solves this problem for a priori known input structures. However, such structures are rarely known; moreover, predicting them computationally is no rescue, since single sequence structure prediction is highly unreliable.
Results
The novel algorithm ExpaRNAP computes exactly matching sequencestructure motifs in entire Boltzmanndistributed structure ensembles of two RNAs; thereby we match and fold RNAs simultaneously, analogous to the wellknown “simultaneous alignment and folding” of RNAs. While this implies much higher flexibility compared to ExpaRNA, ExpaRNAP has the same very low complexity (quadratic in time and space), which is enabled by its novel structure ensemblebased sparsification. Furthermore, we devise a generalized chaining algorithm to compute compatible subsets of ExpaRNAP’s sequencestructure motifs. Resulting in the very fast RNA alignment approach ExpLocP, we utilize the best chain as anchor constraints for the sequencestructure alignment tool LocARNA. ExpLocP is benchmarked in several variants and versus stateoftheart approaches. In particular, we formally introduce and evaluate strict and relaxed variants of the problem; the latter makes the approach sensitive to compensatory mutations. Across a benchmark set of typical noncoding RNAs, ExpLocP has similar accuracy to LocARNA but is four times faster (in both variants), while it achieves a speedup over 30fold for the longest benchmark sequences (≈400nt). Finally, different ExpLocP variants enable tailoring of the method to specific application scenarios. ExpaRNAP and ExpLocP are distributed as part of the LocARNA package. The source code is freely available at http://www.bioinf.unifreiburg.de/Software/ExpaRNAP.
Conclusions
ExpaRNAP’s novel ensemblebased sparsification reduces its complexity to quadratic time and space. Thereby, ExpaRNAP significantly speeds up sequencestructure alignment while maintaining the alignment quality. Different ExpaRNAP variants support a wide range of applications.
Background
Genomewide highthroughput transcriptomics has revealed evidence for massive transcription of eukaryotic genomes, vastly exceeding translation to proteins [1][3]. Ultimately, the ENCODE project [4] has established pervasive transcription of most of both strands of the human genome. Remarkably, while only a minor fraction of the transcripts codes for proteins, the majority of the noncoding RNAs (ncRNAs) are associated with function [5]. Nevertheless, the functional annotation is lagging behind strongly: reliable automated annotation pipelines exist only for subclasses of ncRNAs like tRNAs, microRNAs, or snoRNAs [6].
Recent computational screens, e.g. [7], reveal stable, conserved structures in a large part of ncRNAs, again pointing to function. The de novo RNAgene finders qrna [8], MSARi [9], EvoFold [10], and RNAz [11] identify conservation of stable RNA structures in whole genome alignments; this can be boosted by structurebased realignment (REAPR [12]). Identifying RNAs with similar sequence and common secondary structure advances further towards the automatic annotation of noncoding RNAs. At genomic scale, clustering approaches like [13][15] identify remote members of RNAfamilies as defined in the Rfam database [16], and determine new classes of structurally similar – hence, likely functionally related – ncRNAs. Thus, all such analysis of RNAs relies on comparing RNAs.
Simultaneous alignment and folding (SA&F)
Aligning RNAs and, simultaneously, inferring their common structure is considered the gold standard for comparing RNAs. [17] solves this problem in O(n ^{6}) time and O(n ^{4}) space (for RNAs of length n). In practice, e.g. for searching remote members of RNAfamilies, this complexity is strongly limiting. Even worse, identifying novel RNAclasses in the plethora of newly discovered RNAtranscripts (by allagainstall pairwise comparisons) is simply not feasible by Sankoff’s SA&F method.
Many Sankoffimplementations [18][24] reduce the high computational demands by sequencebased heuristics. A prominent line restricts the search space based on alignment probabilities that consider only sequence information. This idea was introduced by [20], and later refined by [22] and [24].
PMcomp [25] introduced an orthogonal idea to gain speed up over Sankoff’s algorithm. Applying a lightweight energy model, which assigns energies to single base pairs, enables to lower computational costs significantly. For – at the same time – high accuracy, it scores structural matches by ensemble base pair probabilities, precomputed in a fullfeatured energy model [26]. LocARNA [13] implements the lightweight energy model of PMcomp, but gains further speedup by introducing the structurebased heuristic ensemblebased sparsification. Employing the structural sparsity of RNA structure ensembles, LocARNA’s complexity is improved to only O(n ^{4}) time and O(n ^{2}) space.
Subsequently, other Sankofflike methods [27][29] apply similar ensemblebased sparsifications in lightweight models. RAF [29] additionally inherits the sequencebased speed up of [24].
In [30], we have proposed the lightweight SA&F strategy ExpLoc; it cuts down the computational demands significantly beyond LocARNA’s sparsification, but unlike other heuristic improvements such as [29], ExpLoc does not restrict the search space based on structureignorant sequence alignments. ExpLoc computes exactly conserved elements in pairs of fixed RNA secondary structures, based on an algorithm with quadratic time and space complexity [31]; subsequently, these elements provide anchors for a LocARNA alignment. In hindsight, this strategy suffers from similar problems as the first generation of RNA alignment methods [32],[33]: relying on a single predicted input structure for each sequence, this strategy fails frequently and causes severe misalignments, since predicting minimum free energy structures from single sequences is highly unreliable.
Simultaneous matching and folding (SM&F)
Here, we present a novel algorithm that enables an ExpLoclike speedup while resolving its fundamental problem (of relying on fixed structures) by performing exact matching and RNA folding simultaneously. Studying exact matching in nonfixed structures with a very different focus, we have discussed heavy path decomposition for related problems [34]; furthermore, we have presented preliminary work on ensemblebased exact matching in [35]. The novel algorithm ExpaRNAP computes exactly sequencestructureconserved elements that form highly probable local substructures in the RNA structure ensembles of both input RNAs.
Analogous to Sankoff’s SA&F idea, the novel strategy performs “simultaneous matching and folding” (SM&F) of RNA sequences. Thereby, it liberates exact pattern matching from its restriction to a priori fixed structure [31]. We point out that a straightforward extension of the fixed input structure matching to SM&F, would require at least O(n ^{4}) time and O(n ^{2}) space, which is still as high as the complexity of LocARNA. However, to speed up RNA comparison significantly, reducing this complexity is fundamental.
Sparsification of SM&F
Thus, our main technical contribution is to solve SM&F in quadratic time and space — as efficiently as plain sequence alignment. This is enabled by a novel sparsification technique that substantially goes beyond prior approaches. Utilizing novel ensemble properties of the sequences, we identify sparse regions of each matrix such that, in total across all matrices, only quadratically many matrix entries have to be computed; each of them calculated in constant time. In contrast, LocARNA reduces only the number of computed DPmatrices, but requires quadratic time for each of them. This novel sparsification is based on limiting the joint probability of a sequence position or a base pair occurring as parts of particular loops in the ensembles of the single RNAs.
Notably, other sparsification approaches [36][39] apply a different (not ensemblebased) form of sparsification. Generally, these methods rule out subsolutions, which are computed by the DP, that can not occur in the optimal solution. This allows deriving a provably optimal solution while reducing the number of required case distinctions. In contrast, the idea of our sparsification is to remove subsolutions that are unlikely in the solution ensemble. Consequently, ensemblebased sparsification does not only allow much stronger savings, but moreover is applicable even for computing partition functions of RNA alignments; this is realized in LocARNAP [40], which computes RNA alignment reliabilities from SA&F partition functions.
Overview of results
To evaluate the practical benefits of our algorithmic innovations, we construct the pipeline ExpLocP for SA&F (in the spirit of ExpLoc), which we sketch in Figure 1. In its first stage, it enumerates suboptimal exact matchings of local sequencestructure patterns due to the introduced algorithm ExpaRNAP. In the second stage, the suboptimal matchings are chained to select an optimal subset of compatible matchings that can simultaneously occur in an alignment of RNAs. Finally, these matchings are heuristically utilized as anchor constraints in the subsequent LocARNA alignment.
First, we study important design choices in the ExpLocP pipeline, which provides insights into practical implications of the developed concepts; in particular, we compare strict and relaxed matching in ExpaRNAP. The latter allows mismatches at structural positions, which improves the coverage of low identity sequences. In extensive benchmarks, ExpLocP produces highquality alignments. At the same time, due to its heuristic use of ExpaRNAP anchors, it achieves a considerable speedup (about fourfold) over the benchmark set of typical RNAs (BRAliBase 2.1). For long sequences (≈400 nt) of the benchmark set, the speedup is more than 30fold.
Methods
Preliminaries
An RNA sequence A is a string over the alphabet {A,C,G,U}. A _{ i } denotes the base at the ith position of A; A _{i..j}, the substring of A from position i to j, which is called subsequence in this context; and A, the length of A. A structure of A is a set S of base pairs (i,j) such that 1≤i<j≤A, where A _{ i } and A _{ j } are complementary (AU, CG, or GU.) Furthermore, structures are noncrossing: in a structure S, each sequence position is involved in at most one base pair, i.e. for all (i,j),(i ^{′},j ^{′})∈S: (i=i ^{′}⇔j=j ^{′}) and i≠j ^{′}, and base pairs do not cross, i.e. there are no base pairs (i,j),(i ^{′},j ^{′})∈S s.t. i<i ^{′}<j<j ^{′}. The span of a base pair (i,j) is j−i.
Let S be a structure of sequence A. We define the pseudobase pair ψ _{ A }:=(0,A+1). The parent of position k in S is the base pair (i,j)∈S∪ψ _{ A } with i<k<j such that there does not exist any (i ^{′},j ^{′})∈S with i<i ^{′}<k<j ^{′}<j. Analogously, the parent of a base pair (i,j) is the parent of i (which is the parent of j at the same time). Note that parents are unique, since noncrossing structures correspond to trees.
Furthermore, we define loop_{ S }(i,j) as the set of positions of A and base pairs in S, whose parent in S is (i,j); note that loop_{ S }(i,j) is empty, if (i,j)∉S. Intuitively, if a base or base pair has the parent (i,j) in S, it belongs to the loop closed by (i,j) in S.
For a sequence A, let Pr[SA] denote the probability of the structure S in the Boltzmann ensemble of A [26]. Pr[(i,j)A] denotes the base pair probability of (i,j), which is defined as \sum _{S\ni (i,j)}Pr\left[SA\right]. Thus, Pr[(i,j)A] is the probability that a random structure S, drawn from the Boltzmann ensemble of A, contains the base pair (i,j).
Pattern matchings in RNA structure ensembles
ExpaRNAP identifies sequencestructure patterns that are shared by two input RNA sequences. We provide a general description of pattern matchings in RNA sequences and specialize to two different variants (for examples, see Figure 2). We fix sequences A and B with lengths A=n and B=m; for stating computational complexities, we assume m≤n. The sets of possible base pairs of respective sequences A and B are denoted by P and Q.
Definition 1(connected, Pattern Matching).
We denote the match of positions i and k by i ∼ k and the base pair match of base pairs (i,j) and (k,l) by i j ∼ k l. We consider pairs of arbitrary sets \mathcal{\mathcal{M}}\subseteq \{i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k\mid i\in [\phantom{\rule{0.3em}{0ex}}1\mathrm{..n}],k\in [\phantom{\rule{0.3em}{0ex}}1\mathrm{..m}\left]\phantom{\rule{0.3em}{0ex}}\right\} and \mathcal{S}\subseteq \{\mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\mid (i,j)\in {\left[\phantom{\rule{0.3em}{0ex}}1\mathrm{..n}\right]}^{2},i<j,(k,l)\in {\left[\phantom{\rule{0.3em}{0ex}}1\mathrm{..m}\right]}^{2},k<l\}. \mathcal{P}=(\mathcal{\mathcal{M}},\mathcal{S}) is connected, iff the graph {\mathcal{G}}_{\mathcal{P}}=(\mathcal{\mathcal{M}},\mathcal{E}), where \mathcal{E}=\left\{\right(i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k,j\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}l)\mid (\phantom{\rule{0.3em}{0ex}}j=i+1\text{and}l=k+1)\text{or}\mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in \mathcal{S}\}, is (weakly) connected.
is called Pattern Matchingiff
is a matching,
i.e. i=j⇔k=l for all i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k,j\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}l\in \mathcal{\mathcal{M}}
is noncrossing,
i.e. i<j⇒k<l for all i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k,j\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}l\in \mathcal{\mathcal{M}}
‘contains’ ,
i.e. \mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in \mathcal{S}\Rightarrow \{i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k,j\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}l\}\subseteq \mathcal{\mathcal{M}}
the structure \left\{\right(i,j)\mid \mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in \mathcal{S}\} is noncrossing
(consequently, together with the previous condition, \left\{\right(k,l)\mid \mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in \mathcal{S}\} is noncrossing as well).
(\mathcal{\mathcal{M}},\mathcal{S}) is connected.
A position i is matched by (in sequence A) iff there is a position k, s.t. i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k\in \mathcal{\mathcal{M}.} This is symmetrically defined for positions j and sequence B.
We are going to define strict and relaxed exact pattern matchings (cf. Figure 2AB). In the former, all matched nucleotides have to be identical. The latter relaxes this by allowing mismatched nucleotides at matched base pairs (taking compensatory mutations into account).
For this purpose, we distinguish two kinds of matches in a pattern matching (\mathcal{\mathcal{M}},\mathcal{S}): define the set of structure matches as \mathcal{\mathcal{M}}{}_{\mathcal{S}}:=\{i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k,j\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}l\mid \mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in \mathcal{S}\}; the set of sequence matches is
i.e. all matches that are not structural matches.
Definition 2(Strict EPM).
A strict Exact Pattern Matching (strict EPM) is a pattern matching (Def. 1) with the additional property:
Definition 3(Relaxed EPM).
A relaxed Exact Pattern Matching (relaxed EPM) is a pattern matching with the additional property:
We introduce the term EPM to refer to strict EPMs and relaxed EPMs generically. By Definition 1, a pattern matching, and therefore an EPM, does not necessarily match positions of contiguous subsequences, but it is required that the matched sequencestructure motifs are structurelocal [41],[42] in each sequence. For example, in Figure 2B, the sets of gray sequence positions in each RNA are structurelocal, because these positions are (graphtheoretically) connected via edges formed by backbone or base pair bonds; in contrast gray motifs in Figure 2C are not structurelocal, because they consist of two separated connected components.
To characterize good EPMs, we define the score of an EPM (\mathcal{\mathcal{M}},\mathcal{S}) by summing up single score contributions of base and base pair matches:
where σ and τ are scoring functions with properties σ(i,k)>0 if A _{ i }=B _{ k } and τ(i,j,k,l)>0 if A _{ i }=B _{ k } and A _{ j }=B _{ l }. In our studies, we set σ(i,k) to 1 if A _{ i }=B _{ k } (otherwise, −∞); furthermore, τ is parameterized by
The parameters α _{1}, α _{2}, and α _{3} weight respective contributions of sequence matches, structure matches, and stacking. The stacking contribution c_sta rewards stacked base pairs. Each mismatch at the left or right end of a base pair match is penalized by str_mm; for scoring strict EPMs, we set this penalty to −∞, which forbids all kinds of mismatches. In analogy to the notation Pr[(i,j)A], Pr[(i,j)∧(i+1,j−1)A] denotes the joint probability of the stacked base pairs (i,j) and (i+1,j−1). Such probabilities are computed in slight extension of McCaskill’s algorithm [43].
As in the case of RNA structures (of some sequence A), one can define parent relations in EPMs of sequences A and B. In analogy, we define the pseudobase pair match to match the two pseudo base pairs, i.e. ψ:=ψ _{ A } ∼ ψ _{ B }. In the following, we consider the base pair matches i ^{′}j ^{′} ∼ k ^{′}l ^{′} to be order by their spans j ^{′}−i ^{′} (or k ^{′}−l ^{′}; the choice is arbitrary, since we consider only noncrossing structure.) According to this partial order, we define {\text{parent}}_{\mathcal{S}}(i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k) as the smallest {i}^{\prime}{j}^{\prime}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{k}^{\prime}{l}^{\prime}\in \mathcal{S}\cup \left\{\psi \right\} that satisfies i ^{′}≤i≤j ^{′}; {\text{parent}}_{\mathcal{S}}(\mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}) denotes the smallest base pair match that satisfies i ^{′}<i<j<j ^{′}.
We define additional joint probabilities to characterize the “interesting” EPMs.
Definition 4(Joint probabilities).
We define joint occurrence probabilities of elements in loops of structures in the Boltzmann ensemble of X, where X denotes either A or B.
Pr [k ∈ loop(i,j)X] denotes for i<k<j the joint probability that a structure of X contains the base pair (i,j)and the unpaired base k such that (i,j) is the parent of k.
Pr [(i ^{′},j ^{′}) ∈ loop(i,j)X] denotes for i<i ^{′}<j ^{′}<j the joint probability that a structure of X contains the base pairs (i,j)and (i ^{′},j ^{′}) such that (i,j) is the parent of (i ^{′},j ^{′}).
For catchy notation, the expressions loop(i,j) in Def. 4 resemble loop_{ S }(i,j) – notationally omitting the structures S in the Boltzmann ensemble of A (analogously, B).
We introduce an efficient algorithm to compute these probabilities in Section ‘Precomputation: joint loop probabilities’. Since we want to match only structures that have high probability in the Boltzmann ensembles of the given sequences – as computed by McCaskill’s algorithm [26] – we define the notion of significant EPMs. This constraint is crucial for both the quality of the results and the complexity of the algorithm. To define significance, we furthermore introduce three thresholds θ _{1},θ _{2} and θ _{3}. We limit the probability of all matched base pairs by θ _{1}; furthermore, the joint probabilities of matched unpaired bases and base pairs, occurring as part of their enclosing loop, by θ _{2} and θ _{3}, respectively.
Definition 5(Significant EPMs).
Given thresholds θ _{1},θ _{2}, and θ _{3}, an EPM is significantiff
for all \mathit{\text{ij}}\sim \mathit{\text{kl}}\in \mathcal{S}:
Pr[(i,j)A]≥θ _{1} and Pr[(k,l)B]≥θ _{1}
for all i\sim k\in \mathcal{\mathcal{M}}\setminus \mathcal{\mathcal{M}}{}_{\mathcal{S}}:
Pr[i∈loop(i ^{′},j ^{′})A]≥θ _{2}
and Pr[k∈loop(k ^{′},l ^{′})B]≥θ _{2},
where {i}^{\prime}{j}^{\prime}\sim {k}^{\prime}{l}^{\prime}={\text{parent}}_{\mathcal{S}}(i\sim k)\ne \psi
for all \mathit{\text{ij}}\sim \mathit{\text{kl}}\in \mathcal{S}:
Pr[(i,j)∈loop(i ^{′},j ^{′})A]≥θ _{3}
and Pr[(k,l)∈loop(k ^{′},l ^{′})B]≥θ _{3},
where {i}^{\prime}{j}^{\prime}\sim {k}^{\prime}{l}^{\prime}={\text{parent}}_{\mathcal{S}}(\mathit{\text{ij}}\sim \mathit{\text{kl}})\ne \psi
We reduce the return set of our algorithm further by reporting only EPMs that are not included in better (reported) EPMs and that do not include better EPMs. The second condition is relevant only for relaxed EPMs, since this cannot occur for strict EPMs. In the case of strict EPMs, those EPMs are simply maximal w.r.t. the following inclusion order \u2291 of pattern matchings. Hence, we call them maximal strict EPMs.
Definition 6(Inclusion Order on EPMs).
Let \mathcal{P}=(\mathcal{\mathcal{M}},\mathcal{S}) and {\mathcal{P}}^{\prime}=\left({\mathcal{\mathcal{M}}}^{\prime}\phantom{\rule{0.01pt}{0ex}},{\mathcal{S}}^{\prime}\right) be EPMs. is included in {\mathcal{P}}^{\prime}, written \mathcal{P}\u2291{\mathcal{P}}^{\prime}iff
\mathcal{\mathcal{M}}\subseteq {\mathcal{\mathcal{M}}}^{\prime}
for all i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k\in \mathcal{\mathcal{M}}:
{\text{parent}}_{\mathcal{S}}(i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k)={\text{parent}}_{{\mathcal{S}}^{\prime}}(i\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}k)
Notably, in the inclusion order of Def. 6, EPMs with different structures are not comparable. Consequently, two EPMs that match the same positions can be both maximal, if they match different structure. This is illustrated in Figure 3 (AC).
In the case of strict EPMs, the highest scoring EPMs are always maximal EPMs w.r.t. the inclusion order, which allows us to select the “interesting” EPMs by this simple property. However, the same does not hold for relaxed EPMs: for example, typically the score of a relaxed EPM decreases if it is extended by a structure match with mismatching nucleotides; still, further extensions can increase the total score again. These dependencies are illustrated in Figure 3 (A and DF).
Consequently, since we want to keep the highest scoring EPMs in the case of relaxed EPMs as well, we define a scoreextended partial order.
Definition 7(Score Inclusion Order).
Let \mathcal{P}=(\mathcal{\mathcal{M}},\mathcal{S}) and {\mathcal{P}}^{\prime}=({\mathcal{\mathcal{M}}}^{\prime},{\mathcal{S}}^{\prime}) be EPMs. is smaller than {\mathcal{P}}^{\prime} in the score inclusion order, iff \text{score}\left(\mathcal{P}\right)<\text{score}\left({\mathcal{P}}^{\prime}\right) and (\mathcal{P}\u2291{\mathcal{P}}^{\prime} or {\mathcal{P}}^{\prime}\u2291\mathcal{P}).
We call a relaxed EPM maximal, iff it is maximal w.r.t. this order among all relaxed EPMs. In other words, a relaxed EPM is maximal if and only if there is no second relaxed EPM with a higher score that is, by inclusion order, (a) smaller or (b) larger in the relaxed EPM (see Figure 3 (A and DF)). Note that different patterns with the same score are not comparable so that they cannot rule out each other.
Both maximality definitions are canonically raised to maximal significant strict EPMs and relaxed EPMs.
Precomputation: joint loop probabilities
Fundamentally, our novel sparsification technique relies on the joint probabilities of Def. 4. For sequences X∈{A,B}, one efficiently computes base pair probabilities Pr[(i,j)X] by McCaskill’s algorithm [26]. In this work, we extend this algorithm to compute the probabilities Pr[k ∈ loop(i,j)X] and Pr[(i ^{′},j ^{′}) ∈ loop(i,j)X] for X∈{A,B}. For this purpose, we introduce – on top of the McCaskill matrices – the auxiliary matrix {Q}_{\mathit{\text{ij}}}^{m2}, which represents parts of a multiloop with at least two outermost base pairs. This enables computing the additional joint probabilities efficiently in the complexity bounds of the McCaskill algorithm (Additional file 1).
Importantly, all these probabilities are efficiently precomputed independently for each sequence. Hence, e.g. in clustering scenarios, where all pairs from a set of sequences need to be matched, this preprocessing needs to be done only once for each sequence and not for all quadratically many pairs.
ExpaRNAP: Optimizing over significant EPMs
Figure 4 provides formal recursion equations of the dynamic programming EPM optimization algorithm; the same recursions are presented graphically in Figure 5.
Fundamental to our approach, all matrices and evaluations in the recursions are sparse, i.e. only entries and cases are considered where the probabilities of elements pass the respective probability thresholds (cf. Def. 5). Corresponding constraints are given in the recursion equations – this is also illustrated in Figure 5, using arrows. Otherwise, we can largely postpone this aspect until Section ‘ExpaRNAP: Sparsification’.
The matrix entries D(i j,k l) score the best EPM enclosed by each base pair match i j ∼ k l, i.e. D(i j,k l) denotes the best score of a significant EPM (\mathcal{\mathcal{M}},\mathcal{S}) of A _{i..j} and B _{k..l} with \mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in \mathcal{S}.
Inside of the base pair match i j ∼ k l, we determine the (score of the) best (\mathcal{\mathcal{M}},\mathcal{S}) that is either a significant EPM itself or forms a (connected) significant EPM only together with the closing base pair match i j ∼ k l. The first case is covered by the single matrix L, whereas the latter case requires three matrices G _{A}, G _{AB}, and LR. By and large, for deriving one Dentry one starts matching from the left using L. Potentially, one introduces a gap using matrices G _{A} and G _{AB} and continues using matrix LR to match the part that is only connected to the right end of i j ∼ k l.
In more detail, first we determine the best score of a significant EPM \mathcal{P}=(\mathcal{\mathcal{M}},\mathcal{S}) that is connected to the left end i ∼ k of the base pair match, i.e. is empty or contains i+1 ∼ k+1. Concretely, L ^{ijkl}(j ^{′},l ^{′}) is such a score, where \mathcal{\mathcal{M}}\subseteq [i+1\mathrm{..}{j}^{\prime}]\times [\phantom{\rule{0.3em}{0ex}}k+1\mathrm{..}{l}^{\prime}] and {j}^{\prime}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{l}^{\prime}\in \mathcal{\mathcal{M}}. To introduce a gap, the latter condition is changed for G _{A} and G _{AB}. In the case of {G}_{\phantom{\rule{0.3em}{0ex}}\text{A}}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}),\mathcal{\mathcal{M}} does not match j ^{′} but matches l ^{′}; for {G}_{\phantom{\rule{0.3em}{0ex}}\text{AB}}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}),\mathcal{\mathcal{M}} does not match l ^{′} and potentially does not match j ^{′}. Finally, L R ^{ijkl}(j ^{′},l ^{′}) is the best sum of scores of two significant EPMs {\mathcal{P}}_{1}=({\mathcal{\mathcal{M}}}_{1},{\mathcal{S}}_{1}) and {\mathcal{P}}_{2}=({\mathcal{\mathcal{M}}}_{2},{\mathcal{S}}_{2}) where the first is connected to the left base pair match end i ∼ k and the second contains j ^{′} ∼ l ^{′}. Intuitively, the two EPMs are separated by a gap; formally: (for all {i}_{1}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{k}_{1}\in {\mathcal{\mathcal{M}}}_{1} and {i}_{2}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{k}_{2}\in {\mathcal{\mathcal{M}}}_{2},{i}_{1}<{i}_{2}1 and k _{1}<k _{2}) or (for all {i}_{1}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{k}_{1}\in {\mathcal{\mathcal{M}}}_{1} and {i}_{2}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{k}_{2}\in {\mathcal{\mathcal{M}}}_{2},{i}_{1}<{i}_{2} and k _{1}<k _{2}−1).
Our recursion equations (Figure 4 and Figure 5) show the precise case distinctions and dependencies. In L, we check whether there is a sequence match (second case) or a structure match (third case); otherwise, we assign −∞ (first case). LR is analogous to L, only allowing to close a gap left of the structure or sequence match. For this purpose, we introduce an auxiliary matrix H, which does not need to be stored. The gap itself, computed in G _{A} and G _{AB}, allows skipping an arbitrary number of positions in both sequences. The recursion structure ensures that such a gap is introduced at most once per loop match and sequence. To avoid ambiguity, the recursion enforces to first skip positions in A (using G _{A}) and after that positions in B (using G _{AB}); furthermore we enforce a gap in the matchings computed via LR by its initialization.
We compute entries of D in increasing order with respect to their size so that when computing some D(i j,k l), any D(i ^{′}j ^{′},k ^{′}l ^{′}) with i<i ^{′}<j ^{′}<j and k<k ^{′}<l ^{′}<l is already computed.
Since EPMs are not necessarily closed by a base pair match (like the EPMs of D), we finally compute the matrix F. The entries F(j ^{′},l ^{′}), for 0≤j ^{′}≤n and 0≤l ^{′}≤m, denote the maximum score of a significant EPM of {A}_{1\mathrm{..}{j}^{\prime}} and {B}_{1\mathrm{..}{l}^{\prime}}, which ends at (j ^{′},l ^{′}), i.e. with {j}^{\prime}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{l}^{\prime}\in \mathcal{\mathcal{M}}. The recursion for F is almost identical to the recursion for L, except for the first case, which is 0 instead of −∞, since the EPMs in F can start at any point (similar to local sequence alignments). Also, since the matched base pairs in EPMs of F are external (i.e. they are not enclosed by some other base pair of the EPM), we do not perform checks for the second and third condition of significant EPMs (Def. 5).
Matrix initialization Matrix entries corresponding to matches of empty subsequences are initialized. Here, we take special care to disallow such matches for certain matrices (by assigning −∞).
{L}^{\mathit{\text{ijkl}}}(i,k)={G}_{\phantom{\rule{0.3em}{0ex}}\text{A}}^{\mathit{\text{ijkl}}}(i,k)={G}_{\phantom{\rule{0.3em}{0ex}}\text{AB}}^{\mathit{\text{ijkl}}}(i,k)=0 and L R ^{ijkl}(i,k)=−∞ (first matrix entry)
{L}^{\mathit{\text{ijkl}}}(i,{l}^{\prime})={G}_{\phantom{\rule{0.3em}{0ex}}\text{A}}^{\mathit{\text{ijkl}}}(i,{l}^{\prime})=L{R}^{\mathit{\text{ijkl}}}(i,{l}^{\prime})=\infty and {G}_{\phantom{\rule{0.3em}{0ex}}\text{AB}}^{\mathit{\text{ijkl}}}(i,{l}^{\prime})=0 for all l ^{′}>k (first matrix row)
{L}^{\mathit{\text{ijkl}}}({j}^{\prime},k)={G}_{\phantom{\rule{0.3em}{0ex}}\text{AB}}^{\mathit{\text{ijkl}}}({j}^{\prime},k)=L{R}^{\mathit{\text{ijkl}}}({j}^{\prime},k)=\infty and {G}_{\phantom{\rule{0.3em}{0ex}}\text{A}}^{\mathit{\text{ijkl}}}({j}^{\prime},k)=0 for all j ^{′}>i (first matrix column)
By initializing the LR matrix with −∞, we keep matchings represented by LR and L distinct (because in this way, finite LR entries have to be derived via G _{A} or G _{AB} entries, which enforces a gap).
The final matrix F is initialized by F(j ^{′},0)=F(0,l ^{′})=0 for all j ^{′},l ^{′}.
ExpaRNAP: suboptimal traceback & enumerating maximal EPMs
For enumerating only maximal EPMs during suboptimal traceback, we take special care that EPMs cannot be extended at the left or right end of gaps (G _{A} and G _{AB} matrices.) For strict EPMs this is decided independently of the other traced strict EPMs. It suffices to check whether the strict EPM can be extended into the gap matrices, i.e. whether a sequence or structure match is possible at the borders of the gap matrices.
However, the same does not work for relaxed EPMs, since while extending a relaxed EPM, the score might first decrease and then increase again (Figure 3). Therefore, we filter relaxed EPMs in two steps. First, we discard EPMs due to the same criterion as in the case of strict EPMs, checking for exact sequence or structure matches at the borders of the gap matrices. If an EPM cannot be discarded in this way, it is stored until all relaxed EPMs in the same D matrix are traced back. Only then, we compare the withheld relaxed EPMs of the same D matrix according to Def. 7.
Since we complete the whole traceback for a D matrix before tracing into its “enclosed” D matrices, we identify and remove all nonmaximal relaxed EPMs in an early stage of the traceback.
To enumerate all maximal EPMs, we start such tracebacks only from entries F(j ^{′},l ^{′}) that satisfy {A}_{{j}^{\prime}+1}\ne {B}_{{l}^{\prime}+1}. Due to Lemma 1, this condition is necessary and sufficient for strict EPMs.
Lemma 1.
Let \mathcal{P}=(\mathcal{\mathcal{M}},\mathcal{S}) be a maximal strict EPM of {A}_{1\mathrm{..}{j}^{\prime}} and {B}_{1\mathrm{..}{l}^{\prime}} with {j}^{\prime}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{l}^{\prime}\in \mathcal{\mathcal{M}}. is a maximal strict EPM of A and B, iff {A}_{{j}^{\prime}+1}\ne {B}_{{l}^{\prime}+1}.
Proof.
“ ⇒”: Let {A}_{{j}^{\prime}+1}={B}_{{l}^{\prime}+1}. Then {\mathcal{P}}^{\prime}:=(\mathcal{\mathcal{M}}\cup \{{j}^{\prime}+1\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}{l}^{\prime}+1\},\mathcal{S}) is a strict EPM with \mathcal{P}\u2291{\mathcal{P}}^{\prime}; hence is not maximal for A and B (i.e. among all strict EPMs of A and B). “ ⇐”: Let {A}_{{j}^{\prime}+1}\ne {B}_{{l}^{\prime}+1}.
Assume is not maximal for A and B. Then, there is a strict EPM {\mathcal{P}}^{\prime}=({\mathcal{\mathcal{M}}}^{\prime},{\mathcal{S}}^{\prime})\ne \mathcal{P} with \mathcal{P}\u2291{\mathcal{P}}^{\prime} that is not a strict EPM of {A}_{1\mathrm{..}{j}^{\prime}} and {B}_{1\mathrm{..}{l}^{\prime}}.
Consequently, to satisfy \mathcal{\mathcal{M}}\subset {\mathcal{\mathcal{M}}}^{\prime}, there has to exist \mathit{\text{ij}}\phantom{\rule{0.3em}{0ex}}\sim \phantom{\rule{0.3em}{0ex}}\mathit{\text{kl}}\in {\mathcal{S}}^{\prime} with i≤j ^{′}<j and k≤l ^{′}<l. However in this case, while clearly the parent of j ^{′} ∼ l ^{′} in is ψ, there is a parent of j ^{′} ∼ l ^{′} in {\mathcal{S}}^{\prime} different from ψ (i.e. either i j ∼ k l or some “smaller” base pair match). This contradicts \mathcal{P}\u2291{\mathcal{P}}^{\prime}, because and {\mathcal{P}}^{\prime} are not comparable by inclusion order (Def 6).
By the same argument, the forward direction holds for relaxed EPMs. Therefore, we enumerate all maximal relaxed EPMss by restricting the traceback in the same way. However, since the backward direction does not hold generally, this procedure can enumerate nonmaximal relaxed EPMs. In practice, we observe this very rarely; consequently, while redundant relaxed EPMs could be removed explicitly, we let the chaining procedure handle those EPMs.
ExpaRNAP: Sparsification
ExpaRNAP’s efficiency depends fundamentally on the sparsity of the DP matrices, which we leverage through fixed thresholds θ _{1},θ _{2}, and θ _{3}. Consequently, we compute all DP matrices in only O(n ^{2}) time and space. We compute matrices {L}^{\mathit{\text{ijkl}}},{G}_{\text{A}}^{\mathit{\text{ijkl}}},{G}_{\text{AB}}^{\mathit{\text{ijkl}}}, and L R ^{ijkl} only for base pairs (i,j) and (k,l) that are significant (i.e. Pr[(i,j)A]≥θ _{1} and Pr[(k,l)B]≥θ _{1}). Furthermore, we compute only relevant entries of these matrices.
This is best illustrated by the notion of candidates; each j ^{′} is a candidate of (i,j) in sequence A if it is either a significant singlestranded position within (i,j), i.e. Pr[j ^{′} ∈ loop(i,j)A]≥θ _{2}, or contained in a significant helix of (i,j), i.e. Pr[(i ^{′},j ^{′}) ∈ loop(i,j)A]≥θ _{3} for some i ^{′}. Analogously, we define candidates l ^{′}of (k,l) in sequence B. (For candidates l ^{′} holds Pr[l ^{′} ∈ loop(k,l)B]≥θ _{2} or Pr[(k ^{′},l ^{′}) ∈ loop(k,l)B]≥θ _{3} for some k ^{′}).
Theorem 1.
There are only O(n ^{2}) entries {L}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}),{G}_{\text{A}}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}), {G}_{\text{AB}}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}), and L R ^{ijkl}(j ^{′},l ^{′}) such that j ^{′} is a candidate of (i,j) and l ^{′} is a candidate of (k,l). Consequently, ExpaRNAP has quadratic time and space complexity.
Proof sketch: By definition, only candidates j ^{′} or l ^{′} can be part of a significant EPM as defined in Def. 5; otherwise, we assign −∞ to L ^{ijkl}(j ^{′},l ^{′}) and L R ^{ijkl}(j ^{′},l ^{′}). Furthermore, in the latter case, we neither store nor compute the values for {G}_{\text{A}}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}) and {G}_{\text{AB}}^{\mathit{\text{ijkl}}}({j}^{\prime},{l}^{\prime}). Due to these considerations, in the matrices {L}^{\mathit{\text{ijkl}}},L{R}^{\mathit{\text{ijkl}}},{G}_{\text{A}}^{\mathit{\text{ijkl}}}, and {G}_{\text{AB}}^{\mathit{\text{ijkl}}}, we skip each complete row or column whose index is no candidate. Consequently, after computing a mapping from candidate sequence positions to matrix positions — independently for each sequence and for all significant base pairs, the sparsified algorithm operates on “contracted” matrices that contain only the candidate rows and columns. The first threshold θ _{1} reduces the number of base pairs to a constant number of base pairs per sequence position; in total, quadratically many base pairs pass the filter. The thresholds on joint probabilities guarantee that each sequence position is candidate of only constantly many base pairs. In consequence, each position is considered only a constant number of times during the entire computation; this directly results in quadratic time complexity. (Full proof in Additional file 1: Sec. 2)
Chaining
Chaining selects a noncrossing and nonoverlapping subset of EPMs. Our algorithm generalizes the chaining of ExpaRNA [30]. The chaining algorithm recursively fills the holes of all EPMs with other EPMs. For this purpose, it fills one O(n ^{2}) matrix for each hole and takes O(H n ^{2}) time, where H is total number of holes with H≪n ^{2}. In contrast to ExpaRNA, there may exist more than one EPM ending at each sequence position pair, i.e. there is no onetoone correspondence between EPMs and EPM’s end positions. This is why each matrix requires additional steps in the order of the number of input EPMs E in ExpaRNAP’s chaining; the complexity of the generalized chaining algorithm is O(H·(n ^{2}+E)). Since in the most general case, when we enumerate all suboptimal EPMs up to a maximal difference to the optimal score, E∈O(n ^{2}) is not guaranteed, we implement in addition several ways to control the number of EPMs. For example, ExpaRNAP allows setting an ad hoc limit on this number. Furthermore, we suggest a heuristic strategy: for each sequence position pair, keep only the best EPM ending there. Consequently, typical use cases of ExpaRNAP maintain the chaining complexity of ExpaRNA, i.e. O(H n ^{2}).
Results and discussion
We implemented ExpaRNAP and the chaining algorithm in C++. In particular, we implemented two versions of the traceback: the suboptimal traceback and a heuristic version that, for each match i ∼ k, considers only the optimal EPM ending at that match. Our tool supports two ways to control the EPM enumeration by the suboptimal traceback: either by defining the maximum score difference to the optimal score or the maximum number of EPMs.
In order to assess the performance of ExpaRNAP, we designed the following pipeline: In a first step we compute the significant EPMs with ExpaRNAP and use the chaining algorithm to extract from these EPMs an optimal nonoverlapping and noncrossing subset. Then we compute a sequence structure alignment that includes all matches of the chained EPMs. For this purpose, we utilize the EPMs as anchor constraints for LocARNA. Consequently, LocARNA runs much faster, since each anchor reduces the alignment space. In correspondence with the analogous idea ExpLoc [30], which utilizes ExpaRNA anchors, we call our pipeline ExpLocP.
We performed all benchmarks over the pairwise alignment instances of the BRAliBase 2.1 benchmark set [44],[45]. To measure the quality of the calculated alignment in comparison to the (for each instance) known reference alignment, BRAliBase 2.1 [44] provides the scoring tool compalignp. It computes the similarity between the two alignments as sumofpairs score (SPS). Identical alignments receive the SPS score 1; alignments without any correspondence, 0. In this way, we evaluated different variants of our method and later compare them to existing tools. At the same time, we opposed quality to runtime.
Impact of EPM selection on the performance
We study five ExpLocP variants, where we generate anchor constraints respectively by

1)
heuristic traceback with exact matches

2)
heuristic traceback with inexact structure matches

3)
suboptimal traceback with exact matches

4)
suboptimal traceback with inexact structure matches

5)
suboptimal traceback with inexact structure matches and the additional second filter step
In particular, we compare exact modes (1,3), which follow the strict EPM definition, and inexact modes (2,4,5), which allow mismatches at structure positions (relaxed EPMs.) The score parameters were selected adhoc without parameter learning; in particular, we set the cutoff probabilities to restrictive values θ _{1}=θ _{2}=θ _{3}=0.01 to predict less false positives. Furthermore, we enumerated EPMs that have a score of at least 90 and fix the maximal number of traced EPMs in the suboptimal traceback to 100. The scoring – as defined in Eq. 1 and 2 was instantiated by setting the structure mismatch score str_mm to −10 for structure mismatches in inexact modes. Furthermore we set α _{1}=1,α _{2}=5 and α _{3}=5 in order to favor structured regions. In addition to SPS and runtime, we computed the coverage for each benchmark instance – consisting of sequences A and B. For this purpose, we define coverage as the fraction of nucleotides that are matched by the best chain of EPMs \mathcal{C}=\bigcup (\mathcal{\mathcal{M}},\mathcal{S}):
Figure 6A shows the alignment quality (SPS) versus the sequence identity; we visualized the dependency by estimating a LOWESS curve [46] for each series of benchmark evaluations. Overall, we observed that the difference between the suboptimal and heuristic traceback is not significant, solely for inexact modes, the suboptimal traceback leads to slightly better results. Furthermore, in inexact modes the additional second filter step did not change the quality significantly. Exact modes produced better alignments, however these modes generated much less anchor constraints for low sequence identity regions; in turn, the speedup decreases in these modes. This effect is visible in Figure 6B, which plots the estimated coverage vs. the sequence identity. The exact modes predict EPMs only for sequence identity values above 60%. For the inexact modes, we obtained much higher coverage; notably, we predicted many more relaxed EPMs than strict EPMs for the sequence identity interval from 4060%.
In Table 1, we report total runtimes and average SPS scores of different ExpLocP variants over the entire benchmark set. Furthermore, we provide single timings for preprocessing (first value in brackets), computing and chaining the EPMs (second value), and subsequent LocARNA alignments (third value). The differences in coverage directly impact the runtimes of the different variants, but not as pronounced, since – like one would expect for many real world applications – the benchmark set contains many high identity sequences. Consequently, relaxed EPMs significantly reduced the runtime for instances with sequence identity between 5080% (Additional file 1: Figure S3A). Furthermore, the heuristic traceback was slightly faster than the suboptimal one for long RNA sequences (Additional file 1: Figure S3B), while suboptimal traceback could not significantly improve the alignment quality in this setting. Consequently, for this specific benchmark, the two variants with heuristic traceback turned out to provide the best balance of quality and speedup.
Comparison to other tools
We benchmarked three existing approaches: LocARNA, ExpLoc [30], and RAF [29]. LocARNA without anchors serves as base line approach; in contrast to ExpLocP, ExpLoc identifies EPMs in a single predicted structure for each RNA (using ExpaRNA); and RAF is currently the fastest Sankoffstyle alignment approach due to its heuristic filtering based on sequence alignments. We compared these approaches to ExpLocP variants 1 and 2, which performed best in the previous section.
Table 2 summarizes the results; we report the speedup over LocARNA, total runtime, and average alignment quality (SPS) across the entire benchmark set (Opteron 2356, 2.3 GHz, single thread). Figure 7A shows the behavior of the compalignp score dependent on the sequence identity. LocARNA aligned with the best quality at the expense of the highest computation time. The best alignment quality that has been obtained with ExpLoc in [30] has been achieved with parameter minsize=10. Even this quality is significantly lower than the one for the two variants of ExpLocP (0.81 vs. 0.84 and 0.86). Moreover, the overall speedup for this setting is not much higher than the speedups for ExpLocP. Although RAF achieved the best speedup of 14.4, the quality drops tremendously for sequence similarities below 50%.
The quality drop of RAF alignments at low sequence identities is strongly reminiscent of pure sequence alignment methods. Thus, we conjecture that the specific use of sequencebased heuristics by RAF, while guaranteeing sequence alignment like runtime behavior, compromises RAF’s use for ‘hard’ RNA alignment instances that require structurebased alignment methods.
Furthermore, we investigated the dependency of the lengths of the input sequences on the speedup (see Figure 7B). As expected, the speedup increased for longer input sequences. For RNA sequences longer than 150 bases, we obtained a significantly better speedup with both variants of ExpLocP compared to ExpLoc. Moreover, the speedup difference increases with the lengths of the input sequences (Additional file 1: Figure S4 provides a detailed comparison of ExpLocP variants 1 and 2). For the longest input sequences, ExpLocP achieved respective speedups of 32 and 35 for variants 1 and 2, and RAF of almost 50.
To summarize, ExpLocP provided the best tradeoff between alignment quality and speedup in this setting; robustly, it maintained high alignment quality over the entire range of sequence identities; finally, it proofed to be particularly suited for long instances.
Conclusion
We have introduced the algorithm ExpaRNAP that very efficiently identifies exact pattern matches in RNAs by matching and folding them simultaneously. The method is a major achievement over previous approaches (including the “predecessor” ExpaRNA) that – without being more efficient – are much less flexible, since they require a priori known or unreliably predicted structure.
Due to its novel ensemblebased sparsification, the algorithm ExpaRNAP has only a very low (quadratic) time and space complexity, equalling sequence alignment. This sparsification technique is particularly relevant, since similar techniques can likely be applied to other RNA analysis methods.
We have developed two major variants of this method; one requires strict matches in all positions of an EPM (strict EPMs), the other relaxes this (therefore, relaxed EPMs) to allow mismatches at structural positions. The latter supports compensatory mutations, which are highly relevant in RNA structure analysis in general.
Our benchmarks study EPMs as anchor constraints to speed up RNA structure alignments (in the form of simultaneous alignment and folding by LocARNA). EPMs from structure ensembles have turned out to be substantially more reliable than EPMs from fixed structures. At comparable speed ups, this results in increased quality. Most importantly, the novel approach keeps up the alignment quality even for sequences of low identity, which is ultimately decisive for structure alignment. In striking contrast, the alignment quality of the similarly fast alignment tool RAF breaks down – very much like pure sequence alignment.
We have implemented rigorous suboptimal traceback, which provides extensive control of the set of enumerated EPMs. For example, this level of control is required in the analysis of structural variants common to the RNAs. In addition, we have developed a heuristic traceback, which performs almost indistinguishable in our benchmark. Being much faster than the rigorous method, it offers the best speedquality balance in such settings.
Finally, we conjecture that EPMbased anchor constraints can be combined advantageously with other RNA alignment tools such as RAF. While for LocARNA the constraints yield a considerable speedup, the combination with RAF has the potential to improve RAF’s poor alignment quality for low sequence similarity.
Additional file
References
The FANTOM Consortium: The transcriptional landscape of the mammalian genome . Science. 2005, 309 (5740): 155963. 10.1126/science.1112014.
Cheng J, Kapranov P, Drenkow J, Dike S, Brubaker S, Patel S, Long J, Stern D, Tammana H, Helt G, Sementchenko V, Piccolboni A, Bekiranov S, Bailey DK, Ganesh M, Ghosh S, Bell I, Gerhard DS, Gingeras TR: Transcriptional maps of 10 human chromosomes at 5nucleotide resolution . Science. 2005, 308: 11491154. 10.1126/science.1108625.
Bertone P, Stoc V, Royce TE, Rozowsky JS, Urban AE, Zhu X, Rinn JL, Tongprasit W, Samanta M, Weissman S, Gerstein M, Snyder M: Global identification of human transcribed sequences with genome tiling arrays . Science. 2004, 306: 22422246. 10.1126/science.1103388.
The ENCODE Project Consortium: An integrated encyclopedia of DNA elements in the human genome . Nature. 2012, 489 (7414): 5774. 10.1038/nature11247.
Mattick JS, Taft RJ, Faulkner GJ: A global view of genomic information  moving beyond the gene and the master regulator . Trends Genet. 2010, 26 (1): 218. 10.1016/j.tig.2009.11.002.
Bompfünewerer Consortium AF, Backofen R, Bernhart SH, Flamm C, Fried C, Fritzsch G, Hackermüller J, Hertel J, Hofacker IL, Missal K, Mosig A, Prohaska SJ, Rose D, Stadler PF, Tanzer A, Washietl S, Will S: RNAs everywhere: genomewide annotation of structured RNAs . J Exp Zoolog B Mol Dev Evol. 2007, 308 (1): 125. 10.1002/jez.b.21130.
Smith MA, Gesell T, Stadler PF, Mattick JS: Widespread purifying selection on RNA structure in mammals . Nucleic Acids Res. 2013, 41 (17): 822036. 10.1093/nar/gkt596. doi:10.1093/nar/gkt596,
Rivas E, Eddy SR: Noncoding RNA gene detection using comparative sequence analysis . BMC Bioinformatics. 2001, 2 (1): 810.1186/1471210528.
Coventry A, Kleitman DJ, Berger B: MSARI: multiple sequence alignments for statistical detection of RNA secondary structure . Proc Natl Acad Sci USA. 2004, 101 (33): 121027. 10.1073/pnas.0404193101.
Pedersen JS, Bejerano G, Siepel A, Rosenbloom K, LindbladToh K, Lander ES, Kent J, Miller W, Haussler D: Identification and Classification of Conserved RNA Secondary Structures in the Human Genome . PLoS Comput Biol. 2006, 2 (4): 3310.1371/journal.pcbi.0020033.
Washietl S, Hofacker IL: Identifying structural noncoding RNAs using RNAz . Curr Protoc Bioinformatics. 2007, 19: 12.7.112.7.18.
Will S, Yu M, Berger B: Structurebased wholegenome realignment reveals many novel noncoding RNAs . Genome Res. 2013, 23 (6): 10181027. 10.1101/gr.137091.111.
Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genomescale structurebased clustering . PLOS Comput Biol. 2007, 3 (4): 6510.1371/journal.pcbi.0030065.
Kaczkowski B, Torarinsson E, Reiche K, Havgaard JH, Stadler PF, Gorodkin J: Structural profiles of human miRNA families from pairwise clustering . Bioinformatics. 2009, 25 (3): 2914. 10.1093/bioinformatics/btn628.
Parker BJ, Moltke I, Roth A, Washietl S, Wen J, Kellis M, Breaker R, Pedersen JS: New families of human regulatory RNA structures identified by comparative analysis of vertebrate genomes . Genome Res. 2011, 21 (11): 192943. 10.1101/gr.112516.110. doi:10.1101/gr.112516.110,
Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families . Nucleic Acids Res. 2013, 41 (Database issue): 22632. 10.1093/nar/gks1005.
Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problems . SIAM J Appl Math. 1985, 45 (5): 810825. 10.1137/0145048.
Gorodkin J, Heyer L, Stormo G: Finding the most significant common sequence and structure motifs in a set of RNA sequences . Nucleic Acids Res. 1997, 25 (18): 372432. 10.1093/nar/25.18.3724.
Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences . J Mol Biol. 2002, 317 (2): 191203. 10.1006/jmbi.2001.5351.
Holmes I: Accelerated probabilistic inference of RNA structure evolution . BMC Bioinformatics. 2005, 6: 7310.1186/14712105673. doi:10.1186/14712105673,
Havgaard JH, Lyngso RB, Stormo GD, Gorodkin J: Pairwise local structural alignment of RNA sequences with sequence similarity less than 40% . Bioinformatics. 2005, 21 (9): 181524. 10.1093/bioinformatics/bti279.
Dowell RD, Eddy SR: Efficient pairwise RNA structure prediction and alignment using sequence alignment constraints . BMC Bioinformatics. 2006, 7: 40010.1186/147121057400.
Bradley RK, Pachter L, Holmes I: Specific alignment of structured RNA: stochastic grammars and sequence annealing . Bioinformatics. 2008, 24 (23): 267783. 10.1093/bioinformatics/btn495.
Harmanci AO, Sharma G, Mathews DH: Efficient pairwise RNA structure prediction using probabilistic alignment constraints in Dynalign . BMC Bioinformatics. 2007, 8: 13010.1186/147121058130. doi:10.1186/147121058130,
Hofacker IL, Bernhart SH, Stadler PF: Alignment of RNA base pairing probability matrices . Bioinformatics. 2004, 20 (14): 22227. 10.1093/bioinformatics/bth229.
McCaskill JS: The equilibrium partition function and base pair binding probabilities for RNA secondary structure . Biopolymers. 1990, 29 (67): 110519. 10.1002/bip.360290621.
Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA sequences . Bioinformatics. 2007, 23 (8): 92632. 10.1093/bioinformatics/btm049.
Bauer M, Klau GW, Reinert K: Accurate multiple sequencestructure alignment of RNA sequences using combinatorial optimization . BMC Bioinformatics. 2007, 8: 27110.1186/147121058271.
Do CB, Foo CS, Batzoglou S: A maxmargin model for efficient simultaneous alignment and folding of RNA sequences . Bioinformatics. 2008, 24 (13): 6876. 10.1093/bioinformatics/btn177.
Heyne S, Will S, Beckstette M, Backofen R: Lightweight comparison of RNAs based on exact sequencestructure matches . Bioinformatics. 2009, 25 (16): 20952102. 10.1093/bioinformatics/btp065.
Backofen R, Siebert S: Fast detection of common sequence structure patterns in RNAs . J Discrete Algorithms. 2007, 5 (2): 212228. 10.1016/j.jda.2006.03.015.
Höchsmann M, Töller T, Giegerich R, Kurtz S: Local similarity in RNA secondary structures . Proceedings of Computational Systems Bioinformatics (CSB 2003). Volume 2 . 2003, IEEE Computer Society, Washington, 159168.
Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons . Bioinformatics. 2005, 21 (16): 33529. 10.1093/bioinformatics/bti550.
Amit M, Backofen R, Heyne S, Landau G. M, Möhl M, Otto C, Will S: Local exact pattern matching for nonfixed RNA structures . IEEE/ACM Trans Comput Biol Bioinformatics. 2014, 11: 112. 10.1109/TCBB.2013.2297113.
Schmiedl C, Möhl M, Heyne S, Amit M, Landau G. M, Will S, Backofen R: Exact pattern matching for RNA structure ensembles . Proceedings of the 16th International Conference on Research in Computational Molecular Biology (RECOMB 2012). LNCS, Volume 7262 . 2012, Springer, Berlin Heidelberg, 245260.
Wexler Y, Zilberstein C, ZivUkelson M: A study of accessible motifs and RNA folding complexity . J Comput Biol. 2007, 14 (6): 85672. 10.1089/cmb.2007.R020.
ZivUkelson M, GatViks I, Wexler Y, Shamir R: A faster algorithm for RNA cofolding . WABI 2008. Lecture Notes in Computer Science. Volume 5251 . Edited by: Crandall KA, Lagergren J. 2008, Springer, Berlin Heidelberg, 174185.
Backofen R, Tsur D, Zakov S, ZivUkelson M: Sparse RNA folding: Time and space efficient algorithms . Proc. 20th Symp. Combinatorial Pattern Matching. LNCS, Volume 5577 . Edited by: Kucherov G, Ukkonen E. 2009, Springer, Berlin Heidelberg, 249262.
Time and space efficient RNARNA interaction prediction via sparse folding . 2010, Springer, Berlin Heidelberg
Will S, Joshi T, Hofacker IL, Stadler PF, Backofen R: LocARNAP: Accurate boundary prediction and improved detection of structural RNAs . RNA. 2012, 18 (5): 90014. 10.1261/rna.029041.111.
Backofen R, Will S: Local sequencestructure motifs in RNA . J Bioinformatics Comput Biol (JBCB). 2004, 2 (4): 681698. 10.1142/S0219720004000818.
Otto W, Will S, Backofen R: Structure local multiple alignment of RNA . Proceedings of German Conference on Bioinformatics (GCB’2008). Lecture Notes in Informatics (LNI), Volume P136 . 2008, Bonn, Gesellschaft für Informatik (GI), 178188.
Bompfünewerer AF, Backofen R, Bernhart SH, Hertel J, Hofacker IL, Stadler PF, Will S: Variations on RNA folding and alignment: lessons from Benasque . J Math Biol. 2008, 56 (12): 129144. 10.1007/s0028500701075.
Wilm A, Mainz I, Steger G: An enhanced RNA alignment benchmark for sequence alignment programs . Algorithms Mol Biol. 2006, 1: 1910.1186/17487188119.
Gardner PP, Wilm A, Washietl S: A benchmark of multiple sequence alignment programs upon structural RNAs . Nucleic Acids Res. 2005, 33 (8): 24339. 10.1093/nar/gki541.
Cleveland WS: Lowess: A program for smoothing scatterplots by robust locally weighted regression . Am Stat. 1981, 35: (54)10.2307/2683591.
Acknowledgements
This work was partially supported by the German Research Foundation (BA 2168/33 and MO 2402/11), the German Federal Ministry of Education and Research (BMBF, grant 0316165A e:Bio RNAsys to RB), the National Science Foundation (Award 0904246 to GML), the Israel Science Foundation (grant 347/09 to GML), and the United StatesIsrael Binational Science Foundation (BSF) and DFG (grant 2008217 to GML). We thank the anonymous reviewers for their help to improve the paper. Finally, we acknowledge support from the German Research Foundation (DFG) and Leipzig University within the program of Open Access Publishing.
Author information
Authors and Affiliations
Corresponding author
Additional information
Competing interests
The authors declare that they have no competing interests.
Authors’ contributions
MM and SW initiated the project and developed the ExpaRNAP core algorithm; CO, RB, SH, MA, and GML made various contributions to the final algorithm, the ExpLocP pipeline, and the presentation of algorithms. SH implemented the generalized chaining algorithm. CO implemented the other algorithms (making use of the LocARNA library by SW) and, in the course, identified and solved many detailed technical and algorithmic aspects. CO and SW wrote major parts of the manuscript, where RB significantly contributed to the introduction. CO, SH, RB, and SW designed experiments, which CO performed. RB and SW furthermore contributed with guidance and supervision. All authors read and approved the final manuscript.
Electronic supplementary material
12859_2014_404_MOESM1_ESM.pdf
Additional file 1: Supporting Information. Supplementary document containing details about precomputing joint loop probabilities (including a complexity analysis), the proof of ExpaRNAP’s complexity and supplementary results. (PDF 410 KB)
Authors’ original submitted files for images
Below are the links to the authors’ original submitted files for images.
Rights and permissions
Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made.
The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder.
To view a copy of this licence, visit https://creativecommons.org/licenses/by/4.0/.
The Creative Commons Public Domain Dedication waiver (https://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data.
About this article
Cite this article
Otto, C., Möhl, M., Heyne, S. et al. ExpaRNAP: simultaneous exact pattern matching and folding of RNAs. BMC Bioinformatics 15, 404 (2014). https://doi.org/10.1186/s1285901404040
Received:
Accepted:
Published:
DOI: https://doi.org/10.1186/s1285901404040
Keywords
 RNA bioinformatics
 Structurebased comparison of RNA
 Sparsification