Fast online and indexbased algorithms for approximate search of RNA sequencestructure patterns
 Fernando Meyer^{1},
 Stefan Kurtz^{1} and
 Michael Beckstette^{1}Email author
DOI: 10.1186/1471210514226
© Meyer et al.; licensee BioMed Central Ltd. 2013
Received: 27 February 2013
Accepted: 11 July 2013
Published: 17 July 2013
Abstract
Background
It is well known that the search for homologous RNAs is more effective if both sequence and structure information is incorporated into the search. However, current tools for searching with RNA sequencestructure patterns cannot fully handle mutations occurring on both these levels or are simply not fast enough for searching large sequence databases because of the high computational costs of the underlying sequencestructure alignment problem.
Results
We present new fast indexbased and online algorithms for approximate matching of RNA sequencestructure patterns supporting a full set of edit operations on single bases and base pairs. Our methods efficiently compute semiglobal alignments of structural RNA patterns and substrings of the target sequence whose costs satisfy a userdefined sequencestructure edit distance threshold. For this purpose, we introduce a new computing scheme to optimally reuse the entries of the required dynamic programming matrices for all substrings and combine it with a technique for avoiding the alignment computation of nonmatching substrings. Our new indexbased methods exploit suffix arrays preprocessed from the target database and achieve running times that are sublinear in the size of the searched sequences. To support the description of RNA molecules that fold into complex secondary structures with multiple ordered sequencestructure patterns, we use fast algorithms for the local or global chaining of approximate sequencestructure pattern matches. The chaining step removes spurious matches from the set of intermediate results, in particular of patterns with little specificity. In benchmark experiments on the Rfam database, our improved online algorithm is faster than the best previous method by up to factor 45. Our best new indexbased algorithm achieves a speedup of factor 560.
Conclusions
The presented methods achieve considerable speedups compared to the best previous method. This, together with the expected sublinear running time of the presented indexbased algorithms, allows for the first time approximate matching of RNA sequencestructure patterns in large sequence databases. Beyond the algorithmic contributions, we provide with RaligNAtor a robust and well documented opensource software package implementing the algorithms presented in this manuscript. The RaligNAtor software is available at http://www.zbh.unihamburg.de/ralignator.
Background
Due to their participation in several important molecularbiological processes, ranging from passive carriers of genetic information (tRNAs) over regulatory functions (microRNAs) to proteinlike catalytic activities (Riboswitsches), noncoding RNAs (ncRNAs) are of central research interest in molceular biology [1]. NcRNAs, although synthesized as singlestranded molecules, present surprising complexity by being able to base pair with themselves and fold into numerous different structures. It is to a large extent the structure that allows them to interact with other molecules and hence to carry out various biological functions. This can also be observed in families of functionally related ncRNAs like the ones compiled in the Rfam database [2]. Here members of a family often share only few sequence features, but share by far more specific structural and functional properties. Consequently, methods for effective RNA homology search (i.e. finding new members of an RNA family) cannot rely on sequence similarity alone, but also have to take structural similarity into account.
In this paper, we address the problem of searching nucleotide databases for occurrences of RNA family members. Since for this task it is not sufficient to rely on pure sequence alignment, we briefly review search methods that employ sequence and structure information.
There exist various general sequencestructure alignment tools which determine structural similarities that are too diverse to be alignable at the sequence level. Such tools can roughly be divided into two classes. The first class consists of tools that align RNAs with given structures or determine a common structure during the alignment process. Tools like MARNA[3] and RNAforester[4] require an a priori known secondary structure for both input RNAs. However, they suffer from the low quality of secondary structure prediction. Addressing this problem, other tools implement variations of the Sankoff algorithm [5], which provides a general but computationally demanding solution to the problem of simultaneously computing an alignment and the common secondary structure of the two aligned sequences. Unfortunately, even tools with improved running times using variations of this algorithm (LocARNA[6], Foldalign[7, 8], Dynalign[9, 10]) or heuristics [11] are simply not fast enough for rapid searches in large nucleotide databases. Hence, in a second class we identify more specialized tools for searching RNA families in nucleotide databases. These tools use a model or motif descriptors (i.e. patterns) defining consensus sequence and secondary structure properties of the families to be searched for. For example, Infernal[12] and RSEARCH[13] infer a covariance model from a given multiple sequence alignment annotated with structure information. This model can then be used to search sequence databases for new family members. Another tool, ERPIN[14] is also based on automatically generated statistical secondary profiles. Although being very sensitive in RNA homology search, in particular Infernal and RSEARCH suffer from high computational demands. An alternative are tools like RNAMotif[15], RNAMOT[16], RNABOB[17], RNAMST[18], PatScan[19], PatSearch[20], or Palingol[21]. These methods use userdefined motif descriptors created from a priori knowledge about the secondary structure of the described RNA family. Another tool, Locomotif[22], generates a thermodynamic matcher program from a pattern drawn interactively by the user via a graphical interface. Although these tools based on motif descriptors are faster than the previously mentioned tools, they have a running time that scales at least linearly with the size of the target sequence database. This makes their application to large databases challenging. Previously, we addressed this problem by presenting Structator[23], an ultra fast indexbased bidirectional matching tool that achieves sublinear running time by exploiting base pair complementarity constraints for search space reduction.
Apart from running time constraints, another major disadvantage of all current tools that search for sequencestructure patterns is their limited capacity to find approximate matches to the patterns. Although variability in length of pattern elements is often allowed, this is constrained to certain pattern positions that must be specified by the user. This limitation also holds for our Structator tool. Also, variations (insertions, deletions, or replacements) in the sequence that lead to small structural changes, such as the breaking of a base pair, are not supported. This often hampers the creation of patterns that are specific but generalized enough to match all family members. An algorithm presented in [24] only partially alleviates this problem by finding approximate matches of a helix in a genome allowing edit operations on single bases, but not on the structure.
To overcome these issues, we present new fast indexbased and online algorithms for approximate matching of sequencestructure patterns, all implemented in an easytouse software package. Given one or more patterns describing any (branching, noncrossing) RNA secondary structure, our algorithms compute alignments of the complete patterns to substrings of the target sequence, i.e. semiglobal alignments, taking sequence and structure into account. For this, they apply a full set of edit operations on single bases and base pairs. Matches are reported for alignments whose sequencestructure edit cost and number of insertions and deletions do not exceed userdefined thresholds. Our most basic algorithm is a scanning variant of the dynamic programming algorithm for global pairwise sequencestructure alignment of Jiang et al.[25], for which no implementation was available. Because its running time is too large for database searches on a large scale, we present accelerated online and indexbased algorithms. All our new algorithms profit from a new computing scheme to optimally reuse the required dynamic programming matrices and a technique to save computation time by determining as early as possible whether a substring of the target sequence can contain a match. In addition, our indexbased algorithms employ the suffix array data structure compiled from the search space. This further reduces the running time.
As in [23], we also support the description of an RNA molecule by multiple ordered sequencestructure patterns. In this way, the molecule’s secondary structure is decomposed into a sequence of substructures described by independent sequencestructure patterns. These patterns are efficiently aligned to the target sequences using one of our new algorithms and the results are combined with fast global and local chaining algorithms [23, 26]. This allows a better balancing of running time, sensitivity, and specificity compared to searching with a single long pattern describing the complete sequence and secondary structure.
Before we describe our algorithms, we formalize the approximate search problem with the involved sequencestructure edit operations. Then we present, step by step, two efficient online and two indexbased matching algorithms. We proceed with a short review of the approach for computing chains of matches. Finally, we present several benchmark experiments.
Methods
Preliminaries
An RNA sequence S of length n = S over the set of bases $\mathcal{A}=\{\text{A, C, G, U}\}$ is a juxtaposition of n bases from $\mathcal{A}$. S[i], 1 ≤ i ≤ n, denotes the base of S at position i. Let ε denote the empty sequence, the only sequence of length 0. By ${\mathcal{A}}^{n}$ we denote the set of sequences of length n ≥ 0 over $\mathcal{A}$. The set of all possible sequences over $\mathcal{A}$ including the empty sequence ε is denoted by ${\mathcal{A}}^{\ast}$.
For a sequence S = S[1] S[2] … S[n] and 1 ≤ i ≤ j ≤ n, S[i..j] denotes the substring S[i] S[i + 1] … S[j] of S. For S = u v, u and $v\in {\mathcal{A}}^{\ast}$, u is a prefix of S, and v is a suffix of S. The kth suffix of S starts at position k, while the kth prefix of S ends at k. For 1 ≤ k ≤ n, S_{ k } denotes the kth suffix of S. For stating the space requirements of our index structures, we assume that S < 2^{32}, so that sequence positions and lengths can be stored in 4 bytes.
The secondary structure of an RNA molecule is formed by WatsonCrick pairing of complementary bases and also by the slightly weaker wobble pairs. We say that two bases $(c,d)\in \mathcal{A}\times \mathcal{A}$ are complementary and can form a base pair if and only if $(c,d)\in \mathcal{C}=\{\text{(A, U), (U, A), (C, G), (G, C), (G, U), (U, G)}\}$. If two bases a and b form a base pair we also say that there exists an arc between a and b. A noncrossing RNA structure$\hat{R}$of length m is a set of base pairs (i,j), 1 ≤ i < j ≤ m, stating that the base at position i pairs with the base at position j, such that for all $(i,j),({i}^{\prime},{j}^{\prime})\in \hat{R}:i<{i}^{\prime}<{j}^{\prime}<j$ or i^{′} < i < j < j^{′} or i < j < i^{′} < j^{′} or i^{′} < j^{′} < i < j. A standard notation for $\hat{R}$ is a structure string R over the alphabet {.,(,)} such that for each base pair $(i,j)\in \hat{R}$, R[i] = ( and R[j] = ), and R[r] = . for positions r, 1 ≤ r ≤ m, that do not occur in any base pair of $\hat{R}$, i.e. r ≠ i and r ≠ j for all $(i,j)\in \hat{R}$.
Let Φ = {R, Y, M, K, W, S, B, D, H, V, N} be a set of characters. According to the IUPAC definition, each character in Φ denotes a specific character class $\phi (x)\subseteq \mathcal{A}$. Each character $x\in \mathcal{A}$ can be seen as a character class φ(x) = {x} of exactly one element. A sequence pattern is a sequence $P\in {(\mathcal{A}\cup \mathrm{\Phi})}^{\ast}$. An RNA sequencestructure pattern (RSSP)$\mathcal{Q}=(P,R)$ of length m is a pair of a sequence pattern P and a structure string R, both of length m. With $\mathcal{Q}[\mathit{\text{i..j}}]$ we denote the RSSP region (P[i..j],R[i..j]).
Approximate matching of RNA sequencestructure patterns
To find in a long RNA sequence S approximate matches of an RSSP $\mathcal{Q}$ describing a part of an RNA molecule, we compute alignments of the complete $\mathcal{Q}$ and substrings of S considering edit operations for unpaired bases and base pairs. That is, we compute semiglobal alignments simultaneously obtaining the sequencestructure edit distance of $\mathcal{Q}$ and substrings of S.
We define the alignment of $\mathcal{Q}$ and a substring S[p..q], 1 ≤ p ≤ q ≤ n, as set A = A_{match} ⊎ A_{gap}. The set A_{match} ⊆ [1..m] × [p..q] of match edges satisfies that, for all different (k,l),(k^{′},l^{′}) ∈ A_{match}, k > k^{′} implies l > l^{′}. The set A_{gap} of gap edges is defined as $\{(x,)x\in [1\mathit{\text{..m}}]\wedge \nexists y,(x,y)\in {A}_{\text{match}}\}\cup \{(,y)y\in [\mathit{\text{p..q}}]\wedge \nexists x,(x,y)\in {A}_{\text{match}}\}$. See Figure 1 for an example of a semiglobal alignment and associated alignment edges. The alignment cost is based on a sequencestructure edit distance. The allowed edit operations on unpaired bases P[k] and S[l], 1 ≤ k ≤ m, p ≤ l ≤ q, are base mismatch (match), with cost ω_{m} (zero), which occurs if there is an edge (k,l) ∈ A_{match} and S[l] ∉ φ(P[k]) (S[l] ∈ φ(P[k])), and base deletion (insertion), with cost ω_{d}, which occurs if (k,−) ∈ A_{gap} ((−,l) ∈ A_{gap}). The possible edit operations on base pairs were first introduced by Jiang et al.[25] and are defined as follows. Let (k_{1},k_{2}) be a base pair in $\hat{R}$ and l_{1} and l_{2}, p ≤ l_{1} < l_{2} ≤ q, be positions in S.

An arc breaking, with cost ω_{b}, occurs if (k_{1},l_{1}) ∈ A_{match} and (k_{2},l_{2}) ∈ A_{match} but bases S[l_{1}] and S[l_{2}] are not complementary. An additional base mismatch cost ω_{m} is caused if S[l_{1}] ∉ φ(P[k_{1}]) and another if S[l_{2}] ∉ φ(P[k_{2}]). To give an example, consider the semiglobal alignment in Figure 1. RSSP $\mathcal{Q}$ contains base pair $(5,9)\in \hat{R}$ and there exist edges (5,11) ∈ A_{match} and (9,16) ∈ A_{match} but S[11] = G and S[16] = G are not complementary. We note a difference between our definition and the definition of Jiang et al., where both aligned sequences are annotated with structure information. There, an arc breaking occurs if bases S[l_{1}] and S[l_{2}] are annotated as unpaired in addition to the condition of existing edges (k_{1},l_{1}) ∈ A_{match} and (k_{2},l_{2}) ∈ A_{match}. Hence, because in our case sequence S has no structure annotation, our definition is based on the complementarity of bases S[l_{1}] and S[l_{2}].

An arc altering, with cost ω_{a}, occurs if either (1) (k_{1},l_{1}) ∈ A_{match} and (k_{2},−) ∈ A_{gap} or (2) (k_{2},l_{2}) ∈ A_{match} and (k_{1},−) ∈ A_{gap}. Each case induces an additional base mismatch cost ω_{m} if S[l_{1}] ∉ φ(P[k_{1}]) or S[l_{2}] ∉ φ(P[k_{2}]). As an example, observe in the alignment shown in Figure 1 that there exist a base pair$(11,16)\in \hat{R}$ and edges (11,−) ∈ A_{gap} and (16,21) ∈ A_{match}.

An arc removing, with cost ω_{r}, occurs if (k_{1},−) ∈ A_{gap} and (k_{2},−) ∈ A_{gap}. As an example, observe in the alignment in Figure 1 that there exist a base pair$(3,19)\in \hat{R}$and edges (3,−) ∈ A_{gap} and (19,−) ∈ A_{gap}.
An alignment A of minimum cost between$\mathcal{Q}$ and S[p..q] is an optimal alignment of$\mathcal{Q}$ and S[p..q].
In practice, one is often interested in finding substrings of an RNA sequence S having a certain degree of similarity to a given RSSP$\mathcal{Q}$ on both the sequence and structure levels. Therefore, we are only concerned about optimal alignments of$\mathcal{Q}$ and substrings S[p..q] with up to a userdefined sequencestructure edit distance and a limited number of allowed insertions and deletions (indels). More precisely:

the cost$\mathit{\text{dist}}(\mathcal{Q},S[\mathit{\text{p..q}}])$ should not exceed a given threshold$\mathcal{K}$, and

the number of indels in the alignment should be at most d.
We call every substring S[p..q] satisfying Equation (3) a match of$\mathcal{Q}$ in S. In the subsequent sections we present algorithms for searching for matches of an RSSP$\mathcal{Q}$ in a sequence S.
Online approximate RNA database search for RSSPs: ScanAlign
A straightforward algorithm to search for approximate matches of an RSSP$\mathcal{Q}$ in an RNA sequence S consists of sliding a window of length m^{′} = m + d along S while computing$\mathit{\text{dist}}(\mathcal{Q},S[\mathit{\text{p..q}}])$ for 1 ≤ p ≤ q ≤ n and q − p + 1 = m^{′}. We note that, although the length of a match can vary in the range m − d to m + d, to find matches of all possible lengths it suffices to slide a window of length m^{′} along S corresponding to substrings S[p..q]. This holds because the alignment to a window of length m^{′} entails all possible alignments with up to d allowed indels. In the following we present a dynamic programming algorithm computing$\mathit{\text{dist}}(\mathcal{Q},S[\mathit{\text{p..q}}])$ for every window S[p..q]. Our recurrences are derived from the algorithm for global pairwise sequencestructure alignment of Jiang et al.[25], i.e. an algorithm for aligning sequences of similar lengths. Although Jiang’s algorithm supports the sequencestructure edit operations described above, we emphasize that it is not suitable for computing semiglobal alignments, which is what we are interested in.
 1.For computing base match and mismatch costs for positions i and j of the RSSP $\mathcal{Q}=(P,R)$ and substring T, respectively, we define a function $\chi :\mathbb{N}\times \mathbb{N}\to \{0,1\}$ as:$\begin{array}{l}\chi (i,j)=\left\{\begin{array}{lll}0& \text{if}\phantom{\rule{.3em}{0ex}}T[j]\in \phi (P[i])& \text{(base match)}\\ 1& \text{otherwise.}& \text{(base mismatch)}\end{array}\right.\end{array}$(4)
 2.To determine whether an arc breaking operation can occur, we must also be able to check for base complementarity at positions i and j of T. Therefore, we define a function $\mathit{\text{comp}}:\mathbb{N}\times \mathbb{N}\to \{0,1\}$ as:$\begin{array}{l}\mathit{\text{comp}}(i,j)=\left\{\begin{array}{lll}0& \text{if}\phantom{\rule{.3em}{0ex}}(T[i],T[j])\in \mathcal{C}& \text{(complementary)}\\ 1& \text{otherwise.}& \text{(not complementary)}\end{array}\right.\end{array}$(5)
 3.For determining the correct row (of the dynamic programming matrices introduced below) where certain operation costs must be stored we introduce a function $\mathit{\text{row}}:\mathbb{N}\to \mathbb{N}$ defined as:$\mathit{\text{row}}(i)=\left\{\begin{array}{ll}{i}^{\prime}& \text{if}({i}^{\prime},i)\in \hat{R}\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}1<{i}^{\prime}<i<m\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}R[i+1]\\ =.\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}R[{i}^{\prime}1]\ne \mathtt{\text{(}}\\ 0& \text{if}\phantom{\rule{.3em}{0ex}}(i,{i}^{\prime})\in \hat{R}\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}R[i+1]=.\\ i& \text{otherwise}.\end{array}\right.$(6)
Intuitively, function row satisfies the following: (1) given the right index i of a base pair (i^{′},i), it returns the left index i^{′} if (i^{′},i) is preceded or followed by other structures; (2) given the left index i of a base pair (i,i^{′}), it returns 0 if the base at position i + 1 of$\mathcal{Q}$ is unpaired; and (3) given any other position index i, it returns i itself.
 1.If i = 0 or R[i] =. (unpaired base), then$\begin{array}{l}\begin{array}{l}D{P}_{k}(i,j)=\left\{\begin{array}{ll}0& \text{if}\phantom{\rule{.3em}{0ex}}i=0\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}j=0\\ D{P}_{k}(0,j1)+{\omega}_{\mathrm{d}}& \text{if}\phantom{\rule{.3em}{0ex}}i=0\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}j>0\\ D{P}_{k}(\mathit{\text{row}}(i1),0)+{\omega}_{\mathrm{d}}& \text{if}\phantom{\rule{.3em}{0ex}}i>0\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}j=0\\ \text{min}\left\{\phantom{\rule{0.3em}{0ex}}\begin{array}{l}D{P}_{k}(\mathit{\text{row}}(i1),j)+{\omega}_{\mathrm{d}}\\ D{P}_{k}(i,j1)+{\omega}_{\mathrm{d}}\\ D{P}_{k}(\mathit{\text{row}}(i1),j1)+\chi (i,j){\omega}_{\mathrm{m}}\end{array}\right\}& \text{if}\phantom{\rule{.3em}{0ex}}i>0\phantom{\rule{.3em}{0ex}}\text{and}\phantom{\rule{.3em}{0ex}}j>0\end{array}\right.\end{array}\end{array}$(7)
 2.If R[i] ≠. (paired base), then
 (a)If R[i] =) where i forms base pair $({i}^{\prime},i)\in \hat{R}$,$\begin{array}{l}D{P}_{k}(i,j)=\left\{\begin{array}{ll}D{P}_{k}(\mathit{\text{row}}(i1),0)+{\omega}_{\mathrm{r}}& \text{if}\phantom{\rule{.3em}{0ex}}j=0\\ \text{min}\left\{\begin{array}{l}D{P}_{k}(\mathit{\text{row}}(i1),j1)+\chi (i,j+k){\omega}_{\mathrm{m}}+{\omega}_{\mathrm{a}}\\ D{P}_{k+1}(\mathit{\text{row}}(i1),j1)+\chi ({i}^{\prime},k){\omega}_{\mathrm{m}}+{\omega}_{\mathrm{a}}\\ D{P}_{k}(\mathit{\text{row}}(i1),j)+{\omega}_{\mathrm{r}}\\ D{P}_{k}(i,j1)+{\omega}_{\mathrm{d}}\\ D{P}_{k+1}(i,j1)+{\omega}_{\mathrm{d}}\\ D{P}_{k+1}(\mathit{\text{row}}(i1),j2)+(\chi (i,j+k)\\ \phantom{\rule{2.35982pt}{0ex}}+\chi ({i}^{\prime},k+1)){\omega}_{\mathrm{m}}+\\ \mathit{\text{comp}}(k+1,j+k){\omega}_{\mathrm{b}},\phantom{\rule{1.88821pt}{0ex}}\text{if}\phantom{\rule{.3em}{0ex}}j>1\end{array}\right\}& \text{if}\phantom{\rule{.3em}{0ex}}j>0\end{array}\right.\end{array}$(8)
 (b)If (a) holds and either R[i ^{′}−1] =. or R[i ^{′}−1] =), compute in addition to Equation (8)$D{P}_{k}(\mathit{\text{row}}(i),j)=\left\{\begin{array}{ll}D{P}_{k}(\mathit{\text{row}}({i}^{\prime}1),0)+D{P}_{k}(i,0)& \text{if}\phantom{\rule{.3em}{0ex}}j=0\\ \text{min}\left\{D{P}_{k}(\mathit{\text{row}}({i}^{\prime}1),{j}^{\prime})+D{P}_{k+{j}^{\prime}}(i,j{j}^{\prime})0\le {j}^{\prime}\le j\right\}& \text{if}\phantom{\rule{.3em}{0ex}}j>0\end{array}\right.$(9)
 (a)
A natural way to compute these DP matrices is top down, checking whether case 1, 2(a), or 2(b) applies, in this order. Due to the matrix dependencies in cases 2(a) and (b), the matrices need to be computed simultaneously.
Note that for all j, 1 ≤ j ≤ m^{′}, clearly$D{P}_{1}(\mathit{\text{row}}(m),j)=\mathit{\text{dist}}(\mathcal{Q},T[1\mathit{\text{..j}}])$. Therefore all candidate matches shorter than m^{′} beginning at position p are also computed in the computation of$\mathit{\text{dist}}(\mathcal{Q},T[1\mathrm{..}{m}^{\prime}])$. The following Lemma is another important contribution of this work and also the key for the development of an efficient algorithm.
Lemma 1
When sliding a window along S to compute$\mathit{\text{dist}}(\mathcal{Q},S[\mathit{\text{p..q}}])$, 1 ≤ p ≤ q ≤ n, m^{′} = q − p + 1 = m + d, a window shift by one position to the right requires to compute only column m^{′}−k + 1, i.e. the last column of matrices D P_{ k }, 1 ≤ k ≤ m^{′}.
Proof
Let T[1..m^{′}] = S[p..q]. The computation of$\mathit{\text{dist}}(\mathcal{Q},T[1\mathrm{..}{m}^{\prime}])$ requires to compute m^{′} + 1 DP matrices, one for each suffix T_{ k } of string T = T[1..m^{′}], 1 ≤ k ≤ m^{′}, and one for the empty sequence ε. As a result, it holds for every k that$\mathit{\text{dist}}(\mathcal{Q},{T}_{k})=D{P}_{k}(\mathit{\text{row}}(m),{m}^{\prime})$ which is obtained as a byproduct of the$\mathit{\text{dist}}(\mathcal{Q},T)$ computation. Because each substring T_{l+1}[1..m^{′}−l] = S[p + l..q], 0 ≤ l < m^{′}, only differs by its last character from S[p + l + 1..q + 1] which are suffixes of the window substring shifted by one position to the right, the lemma holds. □
Our ScanAlign algorithm has the following time complexity: computing D P_{ k }(i,j) in cases 1 and 2(a) takes O(1) time and in case 2(b) it takes O(m^{′}) time. Now consider the two situations:

For the first computed window substring S[1..m^{′}], cases 1 and 2(a) require O(m m^{′2}) time in total and case 2(b) requires O(m m^{′3}) time in total. This leads to an overall time of O(m m^{′3}).

For one window shift, cases 1 and 2(a) require O(m m^{′}) time in total and case 2(b) requires O(m m^{′2}) time in total, leading to an overall time of O(m m^{′2}).
Since there are n − m^{′}−1 window shifts, the computation for all shifted windows takes O(m m^{′2}(n−m^{′})) = O(m m^{′2}n) time. We observe that the time needed by ScanAlign to compute all window shifts reduces to O(m m^{′}n) if recurrence case 2(b) is not required. This is the case if the structure of$\mathcal{Q}$ does not contain unpaired bases before a base pair constituting e.g. a left dangling end or left bulge.
Faster online alignment with earlystop computation: LScanAlign
Often, before completing the computation of the alignment between an RSSP$\mathcal{Q}$ and a window substring S[p..q] of the searched RNA sequence, we can determine whether the cost of this alignment will exceed the cost threshold$\mathcal{K}$. By identifying this situation as early as possible, we can improve algorithm ScanAlign to skip the window, thus saving computation time and proceed with aligning the next window. The idea consists in checking, during the alignment computation, whether the cost of an already aligned region of$\mathcal{Q}$ and a substring of S[p..q] exceeds$\mathcal{K}$. In such a case, the alignment cost of the complete$\mathcal{Q}$ and S[p..q] will also exceed$\mathcal{K}$. In more detail, this works as follows.

We decompose the RSSP$\mathcal{Q}$ into regions that can themselves represent a pattern, e.g. a stemloop or unpaired region. A basic constraint is to not split base pairs to different regions.

We compute the alignment of a given initial RSSP region and a substring of the current window S[p..q], progressively extending the alignment to other regions.

If the cost of aligning an RSSP region to a substring of the window exceeds cost threshold$\mathcal{K}$, then the entire pattern cannot match the window. This means that the window can immediately be skipped.
 1.
$\mathcal{Q}[\mathit{\text{x..y}}]$ is a left dangling (unpaired) end of the pattern in 5^{′} to 3^{′} direction, i.e. x = 1. Alternatively, it is an unpaired region of maximal length such that position x − 1 forms a base pair $(x1,{y}^{\prime})\in \hat{R}$ for some position y ^{′} of $\mathcal{Q}$. Observe that no extension of $\mathcal{Q}[\mathit{\text{x..y}}]$ by another unpaired position is possible. As an example, consider the green marked regions $\mathcal{Q}[1\mathrm{..}2]$, $\mathcal{Q}[4\mathrm{..}4]$, $\mathcal{Q}[6\mathrm{..}8]$, and $\mathcal{Q}[12\mathrm{..}15]$ in Figure 3.
 2.
Position y is unpaired and there is at least one base pair $({x}^{\prime},{y}^{\prime})\in \hat{R}$, x ≤ x ^{′} < y ^{′} < y. No extension of $\mathcal{Q}[\mathit{\text{x..y}}]$ by another unpaired position is possible. As examples of regions under these requirements, see the regions in orange of the RSSP $\mathcal{Q}$ in Figure 3, namely $\mathcal{Q}[4\mathrm{..}10]$, $\mathcal{Q}[4\mathrm{..}18]$, and $\mathcal{Q}[1\mathrm{..}20]$.
 3.
$(x,y)\in \hat{R}$ is a base pair. For examples of such RSSP regions, see the regions in blue of the RSSP in Figure 3, namely $\mathcal{Q}[5\mathrm{..}9]$, $\mathcal{Q}[11\mathrm{..}16]$, and $\mathcal{Q}[3\mathrm{..}19]$.
 4.
y forms a base pair $({x}^{\prime},y)\in \hat{R}$ where either R[x ^{′}−1] =. or R[x ^{′}−1] =), 1 ≤ x ≤ x ^{′}−1. In addition, x = 1 or $(x1,{y}^{\prime})\in \hat{R}$ for some y ^{′} > y. Examples of such RSSP regions are shown in red in Figure 3, i.e. regions $\mathcal{Q}[4\mathrm{..}9]$, $\mathcal{Q}[4\mathrm{..}16]$, and $\mathcal{Q}[1\mathrm{..}19]$.
Note that regions can be embedded in other regions but cannot partially overlap another.
 1.
All rows in the interval [x..y] are computed by Equation (7).
 2.
One scans the structure of region $\mathcal{Q}[\mathit{\text{x..y}}]$ from position y to position x until one finds a paired position y ^{′}. Then, all rows in the interval [y ^{′} + 1..y] are computed by Equation (7).
 3.
Row y is computed by recurrence (a) of Equation (8).
 4.
Row row(y) is computed by recurrence (b) of Equation (8).
The sequential computation of the rows belonging to each region naturally leads to the computation of the entire alignment of$\mathcal{Q}$ and sequencestructure edit distance$\mathit{\text{dist}}(\mathcal{Q},T)$.
Our improvement of the ScanAlignalgorithm is based on the following two observations.

The standard dynamic programming algorithm for aligning two plain text sequences of lengths m and n requires an (m + 1) × (n + 1) matrix. Let i and j be indices of each of the matrix dimensions and a diagonal v be those entries defined by i and j such that j − i = v. Given that the cost of each edit operation is a positive value, the cost of the entries along a diagonal of the matrix are always nondecreasing [28].

Moreover, one indel operation implies that an optimal alignment path including an entry on diagonal v also includes at least one entry on diagonal v + 1 or v − 1. Now let v be the diagonal ending at the entry on the lowerright corner of the matrix and d be the number of allowed indels. One can stop the alignment computation as soon as all the entries of one row in the matrix and along diagonals v + d^{′}, −d ≤ d^{′} ≤ d, exceed$\mathcal{K}$.
For our improvement of algorithm ScanAlign, based on the following Lemma, we define a diagonal for each RSSP region instead of only one for the entire matrices.
Lemma 2
Assume an RSSP$\mathcal{Q}=(P,R)$, a region$\mathcal{Q}[\mathit{\text{x..y}}]$of length l = y − x + 1, a window substring T[1..m^{′}] of the searched RNA sequence, a cost threshold$\mathcal{K}$, and number d of allowed indels. If for every d^{′}, −d ≤ d^{′} ≤ min{d,x}, z ∈ {d^{′}−d,−d^{′}+d}, y + d^{′} ≤ m^{′}, it holds that$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x+{d}^{\prime}}[1\mathrm{..l}+z])>k$, then, for every d^{′′}, 0 ≤ d^{′′} ≤ d,$\mathit{\text{dist}}(\mathcal{Q},T[1\mathrm{..}{m}^{\prime}{d}^{\prime \prime}])>k$.
Proof
If the RSSP region$\mathcal{Q}[\mathit{\text{x..y}}]$ originates from condition 1 or 2 (3 or 4) above, we define the entries on a diagonal e as those entries D P_{ k }(i,j) (D P_{ k }(r o w(y),j)), 1 ≤ k ± d ≤ m^{′}, such that j − i + offset = e, where offset = x − 1. Without loss of generality let d = 1. Assuming x − 1 > 0 and y + 1 ≤ m^{′}, this means that an optimal alignment of pattern$\mathcal{Q}$ and substring T requires$\mathcal{Q}[\mathit{\text{x..y}}]$to align with:

T[x..y], T[x..y − 1], or T[x..y + 1], requiring for all three alignments the computation of$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x}[1\mathrm{..l}+z])$ for z ∈ {0−1,0 + 1} = {−1,1};

T[x − 1..y − 1], requiring the computation of$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x1}[1\mathrm{..l}+z])$ for z ∈ {−1−1,−−1 + 1} = {0}; or

T[x + 1..y + 1], requiring the computation of$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x+1}[1\mathrm{..l}+z])$for z ∈ {1−1,−1 + 1} = {0}.
The alignments with T[x..y], T[x..y + 1], and T[x..y−1] end in matrix D P_{ x }. The alignments with T[x−1..y−1] end in matrix D P_{x−1}, and the alignments with T[x + 1..y + 1] end in matrix D P_{x+1}. Every minimizing path obtained for the entire alignment of$\mathcal{Q}$ and T can only include the entries on the diagonals e, e + 1, and/or e − 1 for the alignments with T[x..y], T[x..y + 1], and T[x..y − 1], and can only include the entries on diagonal e for the alignments with T[x − 1..y − 1] and T[x + 1..y + 1] because these substrings already imply alignments with one indel. As the sum of the cost of the edit operations on the minimizing path increases monotonically and there cannot be other minimizing paths due to the limited number of indels d, the lemma holds. □
Let$\mathcal{Q}$ be an RSSP whose regions are sorted by the order of computation of their respective rows in the DP tables above, let d be the number of allowed indels, and T = T[1..m^{′}] be a window substring of the searched RNA sequence. Applying Lemma 2, we modify algorithm ScanAlign to compute the alignment of each region$\mathcal{Q}[\mathit{\text{x..y}}]$ to substrings${T}_{x+{d}^{\prime}}$, −d ≤ d^{′} ≤ min{d,x}, y + d^{′} ≤ m^{′}, and progressively extend the alignment to other RSSP regions and substrings of T as long as$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x+{d}^{\prime}}[1\mathrm{..l}+z])\le k$, z ∈ {d^{′}−d,−d^{′} + d}, holds. That is, for each RSSP region, it determines the rows and recurrence case required for their computation according to conditions 1, 2, 3, or 4 above. Then, within each processed row i, it checks whether for at least one entry D P_{ k }(i,j) on a possible minimizing path, i.e. on diagonals e^{′}, e − d ≤ e^{′} ≤ e + d, DP_{ k }(i,j) ≤ k. If no entry is below$\mathcal{K}$, it skips the alignment computation for all remaining RSSP regions and proceeds with aligning the next window. See Figure 2 for an example of the DP matrices of an alignment computation whose entries on a possible minimizing path are highlighted in yellow.
When scanning the searched RNA sequence, a window can be shifted before all DP matrices entries are computed. Hence, a direct application of Lemma 1 is no longer possible. To overcome this, we define an array Z in the range 1 to z, where z is the number of RSSP regions, and associate each region with an index r, 1 ≤ r ≤ z. Let p be the starting position of the window substring S[p..q] in the RNA sequence. We set Z[r] = p whenever all DP matrices rows and columns belonging to region r are computed. This occurs when the cost of aligning this region does not exceed cost threshold$\mathcal{K}$. Now, when aligning the same RSSP region r to a different window substring S[p^{′}..q^{′}], p^{′} > p, computing all DP matrices columns requires to compute the last p^{′} − p columns. If p^{′} − p < m^{′} (recall that m^{′} = q − p = q^{′}−p^{′}), this means that the two window substrings do not overlap and therefore no DP matrix column can be reused.
Our improved algorithm, hereinafter called LScanAlign, in the worst case needs to process every RSSP region for every window shift. Hence, it has the same time complexity as algorithm ScanAlign. However, as in many cases only a few RSSP regions are evaluated, it is much faster in practice as will be shown later. ScanAlign and LScanAlign are the basis for further improvements presented in the subsequent sections.
Indexbased search: LESAAlign
Suffix trees and enhanced suffix arrays are powerful data structures for exact string matching and for solving other string processing problems [29, 30]. In the following we show how the use of enhanced suffix arrays leads to even faster algorithms for searching for matches of an RSSP$\mathcal{Q}$ in an RNA sequence S.
Consider an RSSP$\mathcal{Q}$ to be matched against an RNA sequence S with up to d indels. For each i, 1 ≤ i ≤ n, let p_{ i } = min{m + d,S _{ suf }_{[i]}} be the reading depth of suffix S _{ suf }_{[i]}. When searching for matches of$\mathcal{Q}$ in S, we observe that algorithms ScanAlign and LScanAlign scan S computing$\mathit{\text{dist}}(\mathcal{Q},S[\mathit{\text{p..q}}])$ for every window substring of length q − p + 1 = m + d. In the suffix array, each substring S[p..q] is represented by a suffix S _{ suf }_{[i]} up to reading depth p_{ i }, i.e. there is a substring S _{ suf }_{[i]}[1..p_{ i }] such that S _{ suf }_{[i]}[1..p_{ i }] = S[p..q]. To match$\mathcal{Q}$ in S using a suffix array, we simulate a depth first traversal of the lcp interval tree [30] of S on the enhanced suffix array of S such that the reading depth of each suffix is limited by p_{ i }. That is, we traverse the suffix array of S top down, computing the sequencestructure edit distance$\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathsf{suf}}}_{[i]}[1\mathrm{..}{p}_{i}])$for each suffix S _{ suf }_{[i]}. We recall that candidate matches of$\mathcal{Q}$ have length between m − d and m + d and that p_{ i } ≤ m + d. In case p_{ i } < m − d, we can skip S _{ suf }_{[i]}. Also, remember that all candidate matches shorter than p_{ i } are obtained as a byproduct of the computation of$\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathit{\text{suf}}}}_{[i]}[1\mathrm{..}{p}_{i}])$. Hence, for every p^{′}, m − d ≤ p^{′} ≤ p_{ i }, if$\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathit{\text{suf}}}}_{[i]}[1\mathrm{..}{p}^{\prime}])\le \mathcal{K}$ we report [suf[i]..suf[i] + p^{′}] as a matching interval of$\mathcal{Q}$ in S. That is,$\mathcal{Q}$ matches substring S[suf[i]..suf[i] + p^{′}] beginning at position suf[i] of S.
 1.
If p _{ i } ≤ lcp and $\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathsf{suf}}}_{[i1]}[1\mathrm{..}{p}_{i1}])\le \mathcal{K}$, then clearly $\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathsf{suf}}}_{[i]}[1\mathrm{..}{p}_{i}])\le \mathcal{K}$ and at least one match of $\mathcal{Q}$ starts at position suf[i] of S; and
 2.
If p _{ i } ≤ lcp and $\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathsf{suf}}}_{[i1]}[1\mathrm{..}{p}_{i1}])>\mathcal{K}$, then $\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathsf{suf}}}_{[i]}[1\mathrm{..}{p}_{i}])>\mathcal{K}$.
These situations allow LESAAlign to benefit from the enhanced suffix array in a second important way. That is, it skips all suffixes S _{ suf }_{[i]}, S _{ suf }_{[i+1]},..., S _{ suf }_{[j]} sharing a common prefix of at least length lcp with S _{ suf }_{[i−1]}. To find the index j of the last suffix S _{ suf }_{[j]} to be skipped, it suffices to look for the largest j such that min{lcp[i],lcp[i+1],...,lcp[j]} ≥ lcp. If the first situation above holds, there are matches of$\mathcal{Q}$ in S at positions suf[i], suf[i +1 ],..., suf[j]. We note that suffixes can also be efficiently skipped using socalled skiptables as described in [36]. However, to save the 4n additional bytes required to store such tables we do not use them here. Our algorithm continues the topdown traversal of the suffix array with suffix S _{ suf }_{[j+1]}, taking into account that the DP tables were last computed for S _{ suf }_{[i−1]}. Consequently, the length of the longest common prefix between S _{ suf }_{[i−1]} and S _{ suf }_{[j+1]} to be considered in the processing of S _{ suf }_{[j+1]} is min{lcp[i],lcp[i+1],...,lcp[j],lcp[j+1]}.
We also incorporate in our indexbased algorithm the earlystop alignment computation scheme of algorithm LScanAlign. This allows to skip suffixes S _{ suf }[i] as soon as it becomes clear that the sequencestructure edit distance of RSSP$\mathcal{Q}$ and S _{ suf }_{[i]} up to reading depth p_{ i } will exceed the cost threshold$\mathcal{K}$. For this, LESAAlign progressively aligns regions of$\mathcal{Q}$ to a substring of the current suffix as in algorithm LScanAlign, checking whether the cost of each subalignment remains below the cost threshold$\mathcal{K}$, thus applying Lemma 2. If the cost exceeds$\mathcal{K}$, the alignment computation of the remaining pattern regions is skipped and the algorithm proceeds with processing the next suffix. To avoid recomputing as many entries of the DP matrices as possible while traversing the suffix array, LESAAlign differs from LScanAlign in the way it manages (non) aligned regions for each suffix. Lemma 1, which algorithm LScanAlign applies to support earlystop computation, relies on scanning the searched RNA sequence S and overlapping window substrings. This makes it unsuitable for use with the suffix array. Instead, LESAAlign only uses information from the lcp table as follows. Let z be the number of regions of$\mathcal{Q}$ indexed from 1 to z and T = S _{ suf }_{[i]}[1..p_{ i }] be the current substring. When progressively aligning the regions of$\mathcal{Q}$ to a substring of T, we store the index r of the first region whose alignment cost exceeds$\mathcal{K}$, if there is any. That is, for the first region$\mathcal{Q}[\mathit{\text{x..y}}]$ whose index r we store, it holds that for every d^{′}, −d ≤ d^{′} ≤ min{d,x},$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x+{d}^{\prime}}[1\mathrm{..l}+z])>k$ with l = y − x + 1, z ∈ {d^{′}−d,−d^{′}+d}, and y+d^{′} ≤ m + d (see Lemma 2). Then, when aligning$\mathcal{Q}$ to a subsequent substring S _{ suf }_{[j]}[1..p_{ j }], we must distinguish the regions of$\mathcal{Q}$previously computed from regions not computed.

Previously computed pattern regions are all regions whose index is strictly smaller than r. The alignment computation of these regions profits from the common prefix between S_{ suf }_{[i]}[1..p_{ i }] and S_{ suf }_{[ j]}[1..p_{ j }] by avoiding the recomputation of DP matrices columns as described above.

Noncomputed pattern regions are all regions whose index is larger than or equal to r. In this case, all DP matrices columns of the respective pattern region need to be computed, even if S_{ suf }_{[i]}[1..p_{ i }] and S_{ suf }_{[ j]}[1..p_{ j }] share a common prefix.
We observe that longer ranges of suffixes not containing matches to$\mathcal{Q}$can be skipped thanks to the earlystop alignment computation scheme. Note that the leftmost character of T needed to assert$\mathit{\text{dist}}(\mathcal{Q}[\mathit{\text{x..y}}],{T}_{x+{d}^{\prime}}[1\mathrm{..l}+z])>\mathcal{K}$ is T[x + l + d − 1] = T[x + y − x + 1 + d − 1] = T[y + d] as l = y − x + 1. Therefore, no suffix sharing prefix T[1..y + d] can match$\mathcal{Q}$ and thus can be skipped in the topdown traversal of the suffix array of S. Because in most cases y + d < p_{ i }, more suffixes are likely to share a prefix of length y + d than of length p_{ i } with S _{ suf }_{[i]}. For the pseudocode of algorithm LESAAlign, see Section 1 of Additional file 1.
Enhanced indexbased search: LGSlinkAlign
Given an RSSP$\mathcal{Q}$ to be searched in an RNA sequence S, algorithm LESAAlignis very fast when it can

avoid recomputation of DP matrices columns due to a common prefix between suffixes of S; and

skip long ranges of suffixes of the suffix array suf whose common prefix up to a required reading depth are known to match or not match$\mathcal{Q}$.
Therefore, LESAAlign exploits repetitions of substrings of S, i.e. substrings shared by different suffixes, and information of the lcp table to save computation time. However, the use of information of the lcp table alone does not necessarily lead to large speedups. Consider e.g. the DP matrices for the computation of the alignment of$\mathcal{Q}=(\text{AAGUUUC},\mathtt{\text{..(...)}})$ and substring S _{ suf }_{[4]}[1..p_{4}] = ACCCUCUU in Figure 5. The enhanced suffix array of S is shown in Figure 4. The substring S _{ suf }_{[4]}[1..p_{4}] of length 8 shares a common prefix of length lcp[4] = 4 with the previously processed substring S _{ suf }_{[3]}[1..p_{3}]. Despite this common prefix, still 182/252 ≈ 72% of the DP matrices entries need to be computed (disregarding initialization rows and columns 0) in case no earlystop is possible, i.e. in case$\mathcal{K}>4$. This is more than the at most 56/252 ≈ 22% of the DP matrices entries computed by the online algorithm LScanAlign for a window shift.
 1.
can skip more suffixes; and
 2.
improves the use of already computed DP matrices entries, reusing computed entries for as many suffixes as possible.
To address the first goal, we motivate our method by recalling the alignment computation example in Figure 2. In this example, one of the regions of$\mathcal{Q}=(\text{AAGUUUC},\mathtt{\text{..(\u2026)}})$ is$\mathcal{Q}[3\mathrm{..}7]=(\text{GUUUC},\mathtt{\text{(\u2026)}})$. Assume$\mathcal{K}=d=1$ and observe that$\mathit{\text{dist}}(\mathcal{Q}[3\mathrm{..}7],{T}_{3+{d}^{\prime}}[1\mathrm{..}5+z])>1$ for every d^{′}, −1 ≤ d^{′} ≤ 1, z ∈ {d^{′}−1,−d^{′} + 1}, i.e. the alignment cost for this pattern region already exceeds the cost threshold of 1 (in accordance with Lemma 2). In other words,$\mathcal{Q}[3\mathrm{..}7]$ cannot align to any of the substrings T[2..6] = CCCUC, T[3..6] = CCUC, T[3..7] = CCUCU, T[3..8] = CCUCUU, or T[4..8] = CUCUU with a cost lower than 1. Observe further that the alignment computation of region$\mathcal{Q}[3\mathrm{..}7]$ does not depend on any previous computation of any other region. We can therefore conclude that no suffix containing substring T[2..8]=CCCUCUU from position 2 to 8 can match$\mathcal{Q}$, independently of any prefix of length 1. Our goal is to find and eliminate from the search space all such suffixes, in addition to skipping all suffixes sharing prefix T[1..8] as performed by LESAAlign. That is, we want to skip suffixes sharing a substring, not limited to a prefix, whose alignment cost to a pattern region exceeds cost threshold$\mathcal{K}$.
Then, by letting j = link(i,x^{′}),${{S}_{\mathsf{suf}}}_{[\mathit{\text{link}}(i,{x}^{\prime})]}={{S}_{\mathsf{suf}}}_{[i]+{x}^{\prime}}$ holds for any x^{′} ≥ 1. All suffixes sharing T[x..y] as a prefix are all suffixes in the range j_{start} to j_{end} where j_{start} is the smallest and j_{end} is the largest index satisfying min{lcp[j_{start} + 1],...,lcp[j],...,lcp[j_{end}]} ≥ y − x + 1. Finally, we find that all suffixes of S sharing substring T[x..y] from position x to y are all${{S}_{\mathsf{suf}}}_{[{j}^{\prime}]{x}^{\prime}}$,${j}_{\text{start}}\le {j}^{\prime}\le {j}_{\text{end}}$, satisfying$\mathsf{suf}[{j}^{\prime}]>{x}^{\prime}$. To skip these suffixes not containing matches to$\mathcal{Q}$ in the topdown traversal of the suffix array suf, we mark their positions as true (for already“processed”) in a bit array vtab of n bits. The suffix array traversal proceeds from position suf[i], but skips the marked suffixes when their positions are reached.
We remark that the described method for skipping suffixes can profit from a resorting according to the order by which RSSP regions are aligned. In the alignment computation example in Figure 2, determining$\mathit{\text{dist}}(\mathcal{Q}[3\mathrm{..}4],{T}_{3+{d}^{\prime}}[1\mathrm{..}2+z])>1$, −1≤d^{′} ≤ 1, z ∈ {d^{′}−1,−d^{′} + 1}, does not depend on character T[1] and region$\mathcal{Q}[1\mathrm{..}1]$. Hence, region$\mathcal{Q}[1\mathrm{..}1]$ is unnecessarily aligned first when the regions are sorted by a topdown analysis of the DP tables. To decrease the chance that unnecessary computations occur, we sort the RSSP regions to begin aligning with the leftmost RSSP region$\mathcal{Q}[\mathit{\text{x..y}}]$ not depending on the alignment of any other region and satisfying x − d > 1.
We now address the second goal, namely reusing computed DP matrices entries for as many suffixes as possible. Recall that computing the sequencestructure edit distance$\mathit{\text{dist}}(\mathcal{Q},{{S}_{\mathsf{suf}}}_{[i]}[1\mathrm{..}{p}_{i}])$ for each suffix S _{ suf }_{[i]} up to reading depth p_{ i } means computing p_{ i } + 1DP matrices, one for each suffix T_{ k } of string T = S _{ suf }_{[i]}[1..p_{ i }], 1 ≤ k ≤ m^{′}, and one for the empty sequence ε. Observe that each suffix T_{ k }, T_{ k } ≠ T, also occurs itself as a prefix of a suffix in table suf, i.e. there exists a suffix S _{ suf }_{[j]} shorter than S _{ suf }_{[i]} by exactly k − 1 characters which has prefix T_{ k }. Consequently, T_{ k } is processed again in an alignment to RSSP$\mathcal{Q}$ at a different point in time during the traversal of suf. Let T^{′} = S _{ suf }_{[j]}[1..p_{ j }]. Now note that if T^{′} is at a (nearly) contiguous position in suf to T, T^{′} and T are likely to share a common prefix due to their similar lexicographic ranking. This allows algorithm LESAAlign to avoid recomputation of DP matrices columns by using information from the lcp table. Unfortunately, T^{′} and T can be lexicographically ranked far away from each other in table suf, meaning that the DP matrices computed for T^{′}either:

were already computed once because T^{′} is lexicographically smaller than T, but were discarded to allow the processing of other suffixes until T was traversed; or

are computed for the first time otherwise, but will not be reused to also allow the processing of other suffixes until T^{′} occurs in table suf as a prefix of a suffix itself.
In both cases, redundant computations occur. To avoid this, we optimize the use of computed DP matrices by processing T^{′} directly after processing T for fixed k = 2, recalling that T = S _{ suf }_{[i]}[1..p_{ i }] and T^{′} = S _{ suf }_{[j]}[1..p_{ j }]. This value of k implies that S _{ suf }_{[j]} does not contain the first character of S _{ suf }_{[i]} and that we can locate S _{ suf }_{[j]} in table suf by computing the suffix link j = link(i,1). Also, k = 2 implies that T^{′} only differs by its last character from T, aside from not beginning with character T[1]. Therefore, to determine$\mathit{\text{dist}}(\mathcal{Q},{T}^{\prime})$, we only have to compute the last column of the DP matrices required to compute$\mathit{\text{dist}}(\mathcal{Q},T)$ as shown by Lemma 1. We note that, because i and j are not necessarily contiguous positions in suf, we mark the processed suffix S _{ suf }_{[j]} in the bit array vtab so that it is only processed once. If no match to RSSP$\mathcal{Q}$ begins at position suf[j], we also mark and skip every suffix sharing the substring with T^{′} whose alignment to a region of$\mathcal{Q}$ is known to exceed threshold$\mathcal{K}$. Once T^{′} is processed and all possible suffixes are skipped, we recursively repeat this optimization scheme by setting T = T^{′} and processing the next${T}^{\prime}={{S}_{\mathsf{suf}}}_{[{j}^{\prime}]}[1\mathrm{..}{p}_{{j}^{\prime}}]$ where j^{′} = link(j,1). The recursion stops when${p}_{{j}^{\prime}}<md$, meaning that T^{′} is too short to match$\mathcal{Q}$, or when suf[j^{′}] is already marked as processed in vtab. The suffix array traversal proceeds at position i + 1 repeating the entire scheme.
We call our algorithm incorporating the presented improvements LGSlinkAlign. For its pseudocode, see Section 1 of Additional file 1. LGSlinkAligninherits all the improvements of the above presented algorithms. In summary, its improvements are as follows.

LGSlinkAlign traverses the enhanced suffix array of the searched sequence S, i.e. the suffix array suf enhanced with tables lcp and suf^{−1}. During this traversal, it benefits from common prefixes shared among suffixes to (1) avoid the computation of DP matrix columns and to (2) skip ranges of suffixes known to match or not match RSSP$\mathcal{Q}$ as in algorithm LESAAlign.

The suffix array traversal is predominantly top down, but noncontiguous suffixes are processed to optimize the use of computed DP matrices.

LGSlinkAlign stops the alignment computation as early as the alignment cost of a region of RSSP$\mathcal{Q}$ and a substring of the prefix of the current suffix exceeds threshold$\mathcal{K}$, an improvement first introduced in algorithm LScanAlign.

Due to the earlystop computation scheme, suffixes sharing common prefixes shorter than m + d can be skipped, leading to larger ranges of skipped suffixes. The earlystop computation scheme also helps to identify and skip noncontiguous suffixes sharing a common substring which is not their prefix.
Example: searching for an RSSP with algorithm LGSlinkAlign
We elucidate the ideas of algorithm LGSlinkAlign with the following example. Consider the RSSP$\mathcal{Q}=(\text{AAGUUUC},\mathtt{\text{..(...)}})$ to be matched in the sequence S whose enhanced suffix array is shown in Figure 4. To keep the example simple, we only allow a small cost threshold and number of indels, i.e. we set$\mathcal{K}=d=1$. The costs of the edit operations are ω_{d} = ω_{m} = ω_{b} = ω_{a} = 1 and ω_{r} = 2. When traversing the enhanced suffix array of S, LGSlinkAlign always begins to align$\mathcal{Q}$ to a substring of S with region$\mathcal{Q}[4\mathrm{..}6]$, because the alignment computation of this region does not depend on any other region. In addition, the left index of this region satisfies 4 − d > 1. This means that the alignment computation of region$\mathcal{Q}[1\mathrm{..}2]$ is avoided if the cost of aligning region$\mathcal{Q}[4\mathrm{..}6]$ exceeds the threshold$\mathcal{K}$. The algorithm starts the traversal of the enhanced suffix array of S aligning$\mathcal{Q}[4\mathrm{..}6]$ to substrings of T = S _{ suf }_{[1]}[1..p_{1}] = S_{14}[1..8] from positions 4 − d = 3 and 6 + d = 7. For this, it computes$\mathit{\text{dist}}(\mathcal{Q}[4\mathrm{..}6],{T}_{4+{d}^{\prime}}[1\mathrm{..}3+z])$ for −1 ≤ d^{′} ≤ 1 and z ∈ {d^{′}−1,−d^{′}+1}. Observe that$\mathit{\text{dist}}(\mathcal{Q}[4\mathrm{..}5],{T}_{4+{d}^{\prime}}[1\mathrm{..}2+z])>1$ holds. Hence (1) no suffix with prefix T[1..6] = AACACC can match$\mathcal{Q}$ and thus can be skipped and (2) no suffix containing substring T[3..6] = CACC from position 4−d = 3 to 5 + d = 6 can match$\mathcal{Q}$ and thus can be skipped as well. We notice that there is no other suffix with prefix AACACC because lcp[2] < 6, so we analyze case (2). The algorithm looks for suffixes sharing substring CACC from position 3 to 6. It begins by locating suffixes without the first two characters of T and containing CACC as a prefix. It follows the suffix link link(1,2) = suf^{−1}[suf[1]+2] = suf^{−1}[16] = 7 and looks for the smallest j_{start} and largest j_{end} satisfying min{lcp[j_{start} + 1],...,lcp[8],...,lcp[j_{end}]} ≥ 4 = CACC. It finds that j_{start} = 5 and j_{end} = 8, since min{lcp[5+1],lcp[7],lcp[8]} = min{4,5,5} ≥ 4 holds. The suffixes containing CACC from position 3 to 6 are S _{ suf }_{[5]−2} = S_{11}, S _{ suf }_{[6]−2} = S_{7}, and S _{ suf }_{[8]−2} = S_{14}. S_{11}and S_{7} are marked in the bit array vtab, whereas S_{14} = S _{ suf }_{[1]} was already processed and does not need to be marked. We observe that S _{ suf }_{[7]−2} = S_{−1} is not a valid suffix. To reuse as many computed DP matrices entries as possible, the algorithm next processes the suffix S _{ suf }_{[j]} which does not contain the first character of S _{ suf }_{[1]}. It determines j = link(1,1)=suf^{−1}[suf[1]+1] = 11 and sets T = S _{ suf }_{[12]}[1..p_{12}] = S_{15}[1..8]. The alignment to this substring T begins with its substrings from positions 3 to 7 and$\mathcal{Q}[4\mathrm{..}6]$. We observe that$\mathit{\text{dist}}(\mathcal{Q}[4\mathrm{..}5],{T}_{4+{d}^{\prime}}[1\mathrm{..}2+z])>1$ holds and consequently T cannot match$\mathcal{Q}$. Because suffix S _{ suf }_{[12]} = S_{15} was traversed via a suffix link, it is marked as processed in vtab. We now again analyze two cases of suffixes that cannot match$\mathcal{Q}$ and therefore can be skipped: (1) suffixes sharing prefix T[1..6] = CCACCC and (2) suffixes containing substring T[3..6] = ACCC from position 3 to 6. Satisfying case (1) are suffixes S _{ suf }_{[11]} = S_{1} and S _{ suf }_{[10]} = S_{8} since lcp[12] ≥ 6 and lcp[11] ≥ 6. These suffixes are marked in vtab. We now check if there are suffixes satisfying case (2). The algorithm begins by locating suffixes containing substring T[3..6] = ACCC as a prefix. For this, it follows the suffix link link(12,2) = suf^{−1}[suf[12] + 2] = 4 and determines j_{start} = 2 and j_{end} = 4. The property min{lcp[2 + 1],lcp[4]} ≥ 4 is satisfied. The suffixes containing ACCC from position 3 to 6 are S _{ suf }_{[2]−2} = S_{8}, S _{ suf }_{[3]−2} = S_{1}, and S _{ suf }_{[4]−2} = S_{15}. Since these were already marked in vtab, none of them needs to be marked. The algorithmic scheme of LGSlinkAlign to reuse as many computed DP matrices entries as possible continues processing other suffixes which are located by iteratively following the suffix links. It locates suffixes S _{ suf }_{[8]}, S _{ suf }_{[4]}, S _{ suf }_{[18]}, and S _{ suf }_{[19]} because link(12,1) = 8, link(8,1) = 4, link(4,1) = 18, and link(18,1) = 19, respectively. These suffixes are processed analogously as above, one after the other, not resulting in matches to$\mathcal{Q}$. The iteration then leads to suffix S _{ suf }_{[20]}, since link(19,1) = 20. However, S _{ suf }_{[20]} < m−d, meaning that this suffix is too short to contain a match to$\mathcal{Q}$. This causes the iteration to stop. The suffix array traversal proceeds and repeats the entire matching scheme from the suffix that follows the last processed suffix not located via a suffix link, i.e. suffix S _{ suf }_{[2]}. After processing and skipping all possible suffixes, we note that LGSlinkAlign does not report any matches for the defined cost threshold and allowed number of indels$\mathcal{K}=d=1$. By setting$\mathcal{K}=5$, it reports a match at position 16.
RNA secondary structure descriptors based on multiple ordered RSSPs
RNAs with complex branching structures often cannot be adequately described by a single RSSP due to difficulties in balancing sensitivity, specificity, and reasonable running time of the used search algorithm. Although their description by a single short RSSP specifying an unbranched fragment of the molecule might be very sensitive, it is often too unspecific and likely to generate many spurious matches when searching for structural homologs in large sequence databases or complete genomes. In contrast, using a single long RSSP often requires a higher cost threshold$\mathcal{K}$ for being sensitive enough which in turn, together with the increased RSSP length, has a negative influence on the search time. This might lead to disadvantageous running times in larger search scenarios in practice.
We solve this problem by applying the powerful concept of RNA secondary structure descriptors (SSDs for short) recently introduced in [23]. The underlying concept of SSDs is similar to the idea of PSSM family models [38], which are successfully used for fast and sensitive protein homology search. SSDs use the information of multiple ordered RSSPs derived from the decomposition of an RNA’s secondary structure into stemloop like structural elements. In a first step, approximate matches to the single RSSPs the SSD consists of are obtained using one of the algorithms presented above. From these matches, either local or global highscoring chains are computed with the efficient chaining algorithms described in [23]. These algorithms take the chain’s score, i.e. the weights of the fragments in the chain, into account (see [23] for details). For chaining of approximate RSSP matches, we use the fragment weight${\omega}_{\mathcal{Q}}^{\ast}\mathit{\text{dist}}(\mathcal{Q},T)$ for an RSSP$\mathcal{Q}$ of length m matching substring T, where${\omega}_{\mathcal{Q}}^{\ast}=m\ast {\omega}_{\mathrm{m}}+\mathit{\text{bps}}\ast {\omega}_{\mathrm{r}}$ and bps denotes the number of base pairs in$\mathcal{Q}$. Here${\omega}_{\mathcal{Q}}^{\ast}$ is the maximal possible weighting$\mathcal{Q}$ can gain when being aligned and therefore it reflects the situation of a perfect match between$\mathcal{Q}$ and T. With this definition of a fragment’s weight, a positive weight is always guaranteed, thus satisfying a requirement for the chaining algorithm. Once the chaining of matches to the RSSPs is completed, the highscoring chains are reported in descending order of their chain score. By restricting to highscoring chains, spurious RSSP matches are effectively eliminated. Moreover, the relatively short RSSPs used in an SSD can be matched efficiently with the presented algorithms leading to short running times that even allow for the large scale application of approximate RSSP search.
Results and discussion
Implementation and computational results
We implemented (1) the fast indexbased algorithms LESAAlign and LGSlinkAlign, (2) the online algorithms LScanAlign, ScanAlign, both operating on the plain sequence, and (3) the efficient global and local chaining algorithms described in [23]. In our experiments we use ScanAlign, which is the scanning version of the method proposed in [25], for reference benchmarking. All algorithms are included in the program RaligNAtor. The algorithms for index construction were implemented in the program sufconstruct, which makes use of routines from the libdivsufsort2 library (see http://code.google.com/p/libdivsufsort/) for computing the suf table in O(n log n) time. For the construction of table lcp we employ our own implementation of the linear time algorithm of [35]. All programs were written in C and compiled with the GNU C compiler (version 4.5.0, optimization option O3). All measurements are performed on a Quad Core Xeon E5620 CPU running at 2.4 GHz, with 64 GB main memory (using only one CPU core). To minimize the influence of disk subsystem performance, the reported running times are user times averaged over 10 runs. Allowed base pairs are canonical WatsonCrick and wobble, unless stated otherwise. The used sequencestructure operation costs are ω_{d} = ω_{m} = ω_{b} = ω_{a} = 1 and ω_{r} = 2.
Comparison of running times
Scaling behavior of the online and indexbased algorithms
Influence of stem and loop lengths on the search time
RNA family classification by global chaining of RSSP matches
In the next experiment we show the effectiveness of global chaining when searching with two SSDs built for Rfam families Cripavirus internal ribosome entry site (Acc.: RF00458) and flgRhizobiales RNA motif (Acc.: RF01736) [40]. These two families present only 53% and 69% sequence identity, respectively, much below the average of ∼80% of the Rfam 10.1 families. This illustrates the importance of using both sequence and structure information encoded in the SSDs of this experiment. The SSD of family RF01736 comprises three RSSPs, denoted by flg1, flg2, and flg3 in Figure 8, derived from the three stemloop substructures the members of this family fold into. The SSD of family RF00458 comprises five RSSPs, denoted by ires1, ires2, ires3, ires4, and ires5 in Figure S5 of Additional file 1, where the last four RSSPs describe the stemloop substructures the members of this family fold into. ires1 describes a moderately conserved strand occurring in these members. Observe also in Figures 8 and S5 the cost threshold$\mathcal{K}$ and allowed number of indels d used per pattern, remembering that these are denoted cost and indels in the RaligNAtor RSSP syntax.
Searching with the SSD of family RF00458 in RFAM10.1 delivers 16,033,351 matches for ires1, 8,950,417 for ires2, 1,052 for ires3, 112 for ires4, and 1,222,639 for ires5. From these matches, RaligNAtor computes highscoring chains of matches, eliminating spurious matches and resulting in exactly 17 chains. Each chain occurs in one of the 16 sequence members of the family in the full alignment except in sequence AF014388, where two chains with equal score occur. The highest (lowest) chain score is 171 (162). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all five RSSPs requires 688.32, 585.59, 186.88, and 92.25 minutes, respectively, whereas chaining requires 13.66 seconds. See Table S8 of Additional file 1 for the time required to match each pattern using the different algorithms.
The same search is performed using the SSD of family RF01736. It results in 4,145 matches for flg1, 68,024 for flg2, and 67 for flg3. Chaining the matches leads to 15 chains occurring each in one of the 15 sequence members of the family in the full alignment. The highest (lowest) chain score is 163 (156). Using ScanAlign, LScanAlign, LESAAlign, and LGSlinkAlign, the search for all three RSSPs requires 755.48, 336.69, 133.58, and 52.86 minutes, respectively, whereas chaining requires 0.03 seconds. The time required to match each pattern using each algorithm is reported in Table S9 of Additional file 1.
We also show that the lack of the sequencestructure edit operations supported by RaligNAtor deteriorates sensitivity and specificity in the search for sequence members of families RF00458 and RF01736. For this, we report in Section 4 and Table S10 of Additional file 1 results obtained with the Structator tool [23]. Structator is much faster but, in contrast to RaligNAtor, does not support all sequencestructure edit operations.
Importance of structural constraints for RNA family classification
Results of RaligNAtor and blastn database searches for members of RNA families of different degrees of sequence identity in RFAM10.1
RaligNAtor  RaligNAtor (sequence only)  blastn  

Family Acc.  Size  Seq. ident.  $\mathcal{K}\mathbf{=}\mathit{d}$  #TP  #FP  AUC  (pAUC)  $\mathcal{K}\mathbf{=}\mathit{d}$  #TP  #FP  AUC  (pAUC)  #TP  #FP  AUC  (pAUC) 
RF00032  9,900  48%  3  9,900  1,088,131  0.95  (0.17)  3  9,900  2,723,135  0.82  (0.09)  3,000  68  0.29  (0.05) 
RF00080  688  52%  33  688  698,942  0.71  (0.08)  19  688  1,279,375  0.60  (0.06)  326  540  0.42  (0.06) 
RF02003  176  52%  21  176  1,174,167  0.53  (0.03)  6  176  1,168,093  0.32  (0.00)  28  814  0.11  (0.01) 
RF00458  16  53%  20  16  88  0.94  (0.18)  14  16  2,688  0.96  (0.18)  12  1,224  0.73  (0.13) 
RF00685  131  55%  18  131  40,952  0.98  (0.19)  7  131  103,276  0.97  (0.19)  88  2,945  0.63  (0.10) 
RF00167  1,244  56%  25  1,244  2,514,701  0.58  (0.04)  17  1,244  2,611,256  0.28  (0.00)  660  624  0.52  (0.10) 
RF01705  598  56%  26  598  2,704,796  0.49  (0.02)  17  598  2,698,712  0.42  (0.00)  57  60  0.08  (0.01) 
RF01852  1,050  56%  22  1,050  1,026,233  0.99  (0.19)  14  1,050  1,488,254  0.94  (0.17)  543  83,268  0.44  (0.06) 
RF01734  584  57%  10  584  2,614,228  0.69  (0.05)  5  584  2,668,392  0.46  (0.01)  201  114  0.30  (0.05) 
RF00556  201  58%  8  201  69,808  0.97  (0.18)  6  201  1,514,311  0.92  (0.15)  91  1,024  0.44  (0.08) 
RF00713  14  58%  27  14  10,349  0.99  (0.19)  18  14  16,477  0.88  (0.16)  13  552  0.92  (0.18) 
RF00170  41  59%  13  41  53  0.97  (0.18)  9  41  9,197  0.96  (0.18)  29  176  0.70  (0.14) 
RF00706  69  59%  13  69  1  1.00  (0.20)  9  69  12  0.97  (0.19)  66  194  0.95  (0.18) 
RF00747  29  59%  20  29  130  0.97  (0.18)  16  29  159,898  0.96  (0.18)  28  236  0.96  (0.19) 
RF00778  20  59%  33  20  394,560  0.93  (0.17)  23  20  167,029  0.79  (0.13)  17  390  0.84  (0.16) 
RF01065  118  59%  17  118  0  1.00  (0.20)  9  118  0  1.00  (0.20)  70  305  0.59  (0.11) 
RF01733  9  63%  9  9  0  1.00  (0.20)  7  9  0  1.00  (0.20)  7  918  0.77  (0.15) 
RF00522  415  67%  5  415  1,461  0.99  (0.19)  5  415  32,224  0.99  (0.19)  359  391  0.63  (0.10) 
RF01862  15  67%  7  15  0  1.00  (0.20)  5  15  0  1.00  (0.20)  10  82  0.66  (0.13) 
RF00104  406  69%  24  406  989,362  0.99  (0.19)  14  406  1,560,674  0.99  (0.19)  237  72  0.45  (0.07) 
RF00165  431  69%  9  431  0  1.00  (0.20)  8  431  1  0.99  (0.19)  318  192  0.73  (0.14) 
RF01185  108  69%  13  108  24,759  0.99  (0.19)  13  108  24,759  0.99  (0.19)  104  329  0.93  (0.18) 
RF01838  77  69%  4  77  0  1.00  (0.20)  4  77  0  1.00  (0.20)  77  172  1.00  (0.20) 
RF02031  164  71%  17  164  297,941  0.99  (0.19)  12  164  521,018  0.99  (0.19)  100  218  0.60  (0.11) 
RF00052  210  72%  16  210  0  1.00  (0.20)  12  210  0  1.00  (0.20)  207  12,496  0.98  (0.19) 
RF00543  103  73%  26  103  0  1.00  (0.20)  19  103  0  1.00  (0.20)  102  110  0.99  (0.19) 
RF01744  14  73%  7  14  0  1.00  (0.20)  5  14  0  1.00  (0.20)  11  5,377  0.74  (0.14) 
RF01769  149  75%  16  149  0  1.00  (0.20)  10  149  0  1.00  (0.20)  149  150  0.99  (0.19) 
RF00110  161  81%  19  161  0  1.00  (0.20)  17  161  0  1.00  (0.20)  160  791  0.99  (0.19) 
RF01967  50  84%  37  50  660,130  0.98  (0.19)  26  50  475,242  0.98  (0.19)  48  691  0.95  (0.19) 
RF01472  26  85%  6  26  0  1.00  (0.20)  1  26  0  1.00  (0.20)  26  412  1.00  (0.20) 
RF01953  46  85%  32  46  0  1.00  (0.20)  22  46  0  1.00  (0.20)  46  772  1.00  (0.20) 
RF00372  45  86%  28  45  0  1.00  (0.20)  24  45  0  1.00  (0.20)  45  197  0.99  (0.19) 
RF01980  43  86%  39  43  830,971  0.97  (0.19)  28  43  702,352  0.96  (0.19)  43  341  1.00  (0.20) 
RF00469  1,366  89%  12  1,366  46,351  0.99  (0.19)  7  1,366  99,045  0.99  (0.19)  1,341  474  0.97  (0.19) 
Average  66%  0.93  (0.17)  0.89  (0.16)  0.72  (0.14) 
In addition, we show in Figures 11(2) and (3) the results of the ROC analysis for the families with the lowest and highest degree of sequence identity. For the ROC curve of each selected family, see Figures S7 and S8 of Additional file 1. Clearly, by using base pairing information, RaligNAtor achieves a higher sensitivity with a reduced false positive rate compared to searches ignoring base pairing (compare columns “RaligNAtor” and “RaligNAtor (sequence only)” in Table 1). This is in particular evident when searching for families with a low degree of sequence identity. This can be explained by the small amount of information left in the RSSP for such a family, once the structural information is removed. Due to the high variability of bases in the columns of the multiple alignment of the family, the pattern contains a large number of wildcards. These symbols alone, without the constraints imposed by the base pairs, lead to unspecific patterns and therefore to a large number of false positives. We observe that, for families with sequence identity of up to 59%, the area under the curve (AUC) is considerably larger when base pairing information is taken into account. This difference decreases with increasing sequence identity (compare Figures 11(2) and (3)). Overall, the average AUC value over all families is, with a value of 0.93, still notably higher when base pairing information is considered compared to 0.89 if base pairing information is ignored (see Table 1). In this experiment, blastn only finds all members of those families whose sequence identity is at least 85%. This is due to the fact that blastn cannot appropriately handle IUPAC wildcard characters. Hence, by taking the most frequent symbol in an alignment column as consensus symbol, the heterogeneity of less conserved positions in the alignment cannot be adequately modeled. For the blastn searches, the average AUC value over all families is only 0.72.
RaligNAtorsoftware package
RaligNAtor is an opensource software package for fast approximate matching of RNA sequencestructure patterns (RSSPs). It allows the user to search target RNA or DNA sequences choosing one of the new online or further accelerated indexbased algorithms presented in this work. The index of the sequence to be searched can be easily constructed with program sufconstruct distributed with RaligNAtor.
Searched RSSPs can describe any (branching, noncrossing) RNA secondary structure; see examples in Figures 1, 6, 8, and S5 of Additional file 1. Bases composing the sequence information of RSSPs can be ambiguous IUPAC characters. As part of the search parameters for RSSPs, the user can specify the cost of each sequencestructure edit operation defined above, the cost threshold of possible matches, and the number of allowed indels. The RSSPs, along with costs and thresholds per RSSP, are specified in a simple text file using a syntax that is expressive but easy to understand as shown in the mentioned figures. Another possibility is to provide the same costs and thresholds for all searched patterns as parameters in the command line call to RaligNAtor. To ensure maximal flexibility, the user can also define the base pairing rules from an arbitrary subset of$\mathcal{A}\times \mathcal{A}$ as valid pairings in a separate text file. Searches can be performed on the forward and reverse strands of the target sequence. Searching on the reverse strand is implemented by reversal of the RSSP and transformation according to WatsonCrick base pairing. Wobble pairs {(G,U), (U,G)} automatically become {(C,A), (A,C)}. Due to these transformations, the index is built for one strand only.
For describing a complex RNA with our concept of secondary structure descriptor (SSD), i.e. with multiple RSSPs, the user specifies all RSSPs in one text file. The order of the RSSPs in the file will then specify the order of the RSSP matches used to build highscoring chains. The chain score directly depends on the score of each match occurring in the chain. This is inversely proportional to the sequencestructure edit distance of the RSSP and its matching substring in the target sequence. Hence, higher scores indicate sequences with a higher conservation which are probably more closely related to the sought RNA family.
Chaining of matches discards spurious matches not occurring in any chain. An additional filtering option eliminates matches overlapping another with a higher score for the same RSSP. This is particularly useful when indels lead to almost identical matches that are only shifted by a few positions in the target sequence.
The output of RaligNAtor includes not only matching positions to single RSSPs and chains, but their sequencestructure alignment to the matched substrings as well. In the RaligNAtor software package, all programs for searching patterns support multithreading to take advantage of computer systems with multiple CPU cores. There are two modes of parallelism. At first, different patterns are searched using multiple threads. Additionally, the search space (i.e. the sequence for the online algorithms and the index structure for the indexbased methods) is partitioned, processing each part using a different thread. Lastly, we remark that our software also provides an implementation of the original algorithm of Jiang et al. for global sequencestructure alignment [25], easily applicable by the user.
Conclusions
Overview of the presented algorithms
Algorithm  Online  Indexed  Earlystop  Additional memory  Used index tables  

Acceleration  Requirements [bytes]  suf  lcp  suf ^{−1}  vtab  
ScanAlign  $\u2713$  0  
LScanAlign  $\u2713$  $\u2713$  0  
LESAAlign  $\u2713$  $\u2713$  5n  $\u2713$  $\u2713$  
LGSlinkAlign  $\u2713$  $\u2713$  9.125n  $\u2713$  $\u2713$  $\u2713$  $\u2713$ 
Availability
The RaligNAtor software package including documentation is available in binary format for different operating systems and architectures and as source code under the GNU General Public License Version 3. See http://www.zbh.unihamburg.de/ralignator for details.
Declarations
Acknowledgements
This work was supported by basic funding of the University of Hamburg. We thank Steffen Dettmann for interesting discussions and algorithmic ideas that contributed to this work.
Authors’ Affiliations
References
 Mattick J: RNA regulation: a new genetics?. Nat Rev Genet. 2004, 5 (4): 316323. 10.1038/nrg1321.View ArticlePubMedGoogle Scholar
 Burge SW, Daub J, Eberhardt R, Tate J, Barquist L, Nawrocki EP, Eddy SR, Gardner PP, Bateman A: Rfam 11.0: 10 years of RNA families. Nucleic Acids Res. 2012, 41 (D1):Google Scholar
 Siebert S, Backofen R: MARNA: multiple alignment and consensus structure prediction of RNAs based on sequence structure comparisons. Bioinformatics. 2005, 21 (16): 33523359. 10.1093/bioinformatics/bti550.View ArticlePubMedGoogle Scholar
 Höchsmann M, Voss B, Giegerich R: Pure multiple RNA secondary structure alignments: a progressive profile approach. IEEE/ACM Trans Comput Bio Bioinformatics. 2004, 1: 5362. 10.1109/TCBB.2004.11.View ArticleGoogle Scholar
 Sankoff D: Simultaneous solution of the RNA folding, alignment and protosequence problem. SIAM J Appl Mathe. 1985, 45: 810825. 10.1137/0145048.View ArticleGoogle Scholar
 Will S, Reiche K, Hofacker IL, Stadler PF, Backofen R: Inferring noncoding RNA families and classes by means of genomescale structurebased clustering. PLoS Comput Biol. 2007, 3 (4): e65+PubMed CentralView ArticlePubMedGoogle Scholar
 Havgaard JH, Torarinsson E, Gorodkin J: Fast pairwise structural RNA alignments by pruning of the dynamical programming matrix. PLoS Comput Biol. 2007, 3 (10): e193+PubMed CentralView ArticleGoogle Scholar
 Torarinsson E, Havgaard JH, Gorodkin J: Multiple structural alignment and clustering of RNA sequences. Bioinformatics. 2007, 23 (8): 926932. 10.1093/bioinformatics/btm049.View ArticlePubMedGoogle Scholar
 Mathews DH, Turner DH: Dynalign: an algorithm for finding the secondary structure common to two RNA sequences. J Mol Biol. 2002, 317 (2): 191203. 10.1006/jmbi.2001.5351.View ArticlePubMedGoogle Scholar
 Mathews DH: Predicting a set of minimal free energy RNA secondary structures common to two sequences. Bioinformatics. 2005, 21 (10): 22462253. 10.1093/bioinformatics/bti349.View ArticlePubMedGoogle Scholar
 Dalli D, Wilm A, Mainz I, Steger G: STRAL: progressive alignment of noncoding RNA using base pairing probability vectors in quadratic time. Bioinformatics. 2006, 22 (13): 15931599. 10.1093/bioinformatics/btl142.View ArticlePubMedGoogle Scholar
 Nawrocki EP, Kolbe DL, Eddy SR: Infernal 1.0: inference of RNA alignments. Bioinformatics. 2009, 25 (10): 13351337. 10.1093/bioinformatics/btp157.PubMed CentralView ArticlePubMedGoogle Scholar
 Klein R, Eddy S: RSEARCH: finding homologs of single structured RNA sequences. BMC Bioinformatics. 2003, 4: 4410.1186/14712105444.PubMed CentralView ArticlePubMedGoogle Scholar
 Gautheret D, Lambert A: Direct RNA motif definition and identification from multiple sequence alignments using secondary structure profiles. J Mol Biol. 2001, 313: 100311. 10.1006/jmbi.2001.5102.View ArticlePubMedGoogle Scholar
 Macke T, Ecker D, Gutell R, Gautheret D, Case D, Sampath R: RNAMotif  A new RNA secondary structure definition and discovery algorithm. Nucleic Acids Res. 2001, 29 (22): 47244735. 10.1093/nar/29.22.4724.PubMed CentralView ArticlePubMedGoogle Scholar
 Gautheret D, Major F, Cedergren R: Pattern searching/alignment with RNA primary and secondary structures: an effective descriptor for tRNA. Comput Appl Biosci. 1990, 6 (4): 325331.PubMedGoogle Scholar
 RNABOB: a program to search for RNA secondary structure motifs in sequence databases. [http://selab.janelia.org/software.html],
 Chang T, Huang H, Chuang T, Shien D, Horng J: RNAMST: efficient and flexible approach for identifying RNA structural homologs. Nucleic Acids Res. 2006, 34: W423W428. 10.1093/nar/gkl231.PubMed CentralView ArticlePubMedGoogle Scholar
 Dsouza M, Larsen N, Overbeek R: Searching for patterns in genomic data. Trends Genet. 1997, 13 (12): 497498.View ArticlePubMedGoogle Scholar
 Grillo G, Licciulli F, Liuni S, SbisÃă E, Pesole G: PatSearch: A program for the detection of patterns and structural motifs in nucleotide sequences. Nucleic Acids Res. 2003, 31 (13): 36083612. 10.1093/nar/gkg548.PubMed CentralView ArticlePubMedGoogle Scholar
 Billoud B, Kontic M, Viari A: Palingol: a declarative programming language to describe nucleic acids’ secondary structures and to scan sequence database. Nucleic Acids Res. 1996, 24 (8): 13951403. 10.1093/nar/24.8.1395.PubMed CentralView ArticlePubMedGoogle Scholar
 Reeder J, Giegerich R: A graphical programming system for molecular motif search. Proceedings of the 5th international Conference on Generative Programming and Component Engineering. 2006, New York: ACM Press, 131140.Google Scholar
 Meyer F, Kurtz S, Backofen R, Will S, Beckstette M: Structator: fast indexbased search for RNA sequencestructure patterns. BMC Bioinformatics. 2011, 12: 21410.1186/1471210512214.PubMed CentralView ArticlePubMedGoogle Scholar
 ElMabrouk N, Raffinot M, Duchesne JE, Lajoie M, Luc N: Approximate matching of structured motifs in DNA sequences. J Bioinform Comput Biol. 2005, 3 (2): 317342. 10.1142/S0219720005001065.View ArticlePubMedGoogle Scholar
 Jiang T, Lin G, Ma B, Zhang K: A general edit distance between RNA structures. J Comput Biol. 2002, 9 (2): 371388. 10.1089/10665270252935511.View ArticlePubMedGoogle Scholar
 Abouelhoda M, Ohlebusch E: Chaining algorithms for multiple genome comparison. J Discrete Algo. 2005, 3 (24): 321341.View ArticleGoogle Scholar
 Will S, Siebauer M, Heyne S, Engelhardt J, Stadler P, Reiche K, Backofen R: LocARNAscan: incorporating thermodynamic stability in sequence and structurebased RNA homology search. Algo Mol Biol. 2013, 8: 1410.1186/17487188814.View ArticleGoogle Scholar
 Ukkonen E: Algorithms for approximate string matching. Inf Control. 1985, 64 (13): 100118.View ArticleGoogle Scholar
 Manber U, Myers E: Suffix arrays: a new method for online string searches. SIAM J Comput. 1993, 22 (5): 935948. 10.1137/0222058.View ArticleGoogle Scholar
 Abouelhoda M, Kurtz S, Ohlebusch E: Replacing suffix trees with enhanced suffix arrays. J Discrete Algo. 2004, 2: 5386. 10.1016/S15708667(03)000650.View ArticleGoogle Scholar
 Kärkkäinen J, Sanders P: Simple linear work suffix array construction. Proceedings of the 13th International Conference on Automata, Languages and Programming. 2003, Berlin  Heidelberg: SpringerGoogle Scholar
 Puglisi SJ, Smyth W, Turpin A: The performance of linear time suffix sorting algorithms. DCC ’05: Proceedings of the Data Compression Conference. 2005, Washington: IEEE Computer Society, 358367.View ArticleGoogle Scholar
 Manzini G, Ferragina P: Engineering a lightweight suffix array construction algorithm. Algorithmica. 2004, 40: 3350. 10.1007/s0045300410941.View ArticleGoogle Scholar
 Fischer J: Wee LCP. Inf Proc Let. 2010, 110 (89): 317320.View ArticleGoogle Scholar
 Kasai T, Lee G, Arimura H, Arikawa S, Park K: Lineartime longestcommonprefix computation in suffix arrays and its applications. Proceedings of the 18th Annual Symposium on Combinatorial Pattern Matching. 2001, Berlin  Heidelberg: Springer, 181192.View ArticleGoogle Scholar
 Beckstette M, Homann R, Giegerich R, Kurtz S: Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics. 2006, 7: 38910.1186/147121057389.PubMed CentralView ArticlePubMedGoogle Scholar
 Ukkonen E: Online construction of suffix trees. Algorithmica. 1995, 14 (3): 249260. 10.1007/BF01206331.View ArticleGoogle Scholar
 Beckstette M, Homann R, Giegerich R, Kurtz S: Significant speedup of database searches with HMMs by search space reduction with PSSM family models. Bioinformatics. 2009, 25 (24): 32513258. 10.1093/bioinformatics/btp593.PubMed CentralView ArticlePubMedGoogle Scholar
 Darty K, Denise A, Ponty Y: VARNA: Interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009, 25 (15): 19741975. 10.1093/bioinformatics/btp250.PubMed CentralView ArticlePubMedGoogle Scholar
 Weinberg Z, Wang J, Bogue J, Yang J, Corbino K, Moy R, Breaker R: Comparative genomics reveals 104 candidate structured RNAs from bacteria, archaea, and their metagenomes. Genome Biol. 2010, 11 (3): R3110.1186/gb2010113r31.PubMed CentralView ArticlePubMedGoogle Scholar
 Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ: Gapped BLAST and PSIBLAST: a new generation of protein database search programs. Nucleic Acids Res. 1997, 25 (17): 33893402. 10.1093/nar/25.17.3389.PubMed CentralView ArticlePubMedGoogle Scholar
 Mathews DH, Turner DH: Prediction of RNA secondary structure by free energy minimization. Curr Opin Struct Biol. 2006, 16 (3): 270278. 10.1016/j.sbi.2006.05.010.View ArticlePubMedGoogle Scholar
Copyright
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.